date:20181017

On Thu, Oct 18, 2018 at 05:54:29AM +0100, Al Viro wrote:
> On Wed, Oct 17, 2018 at 09:47:18PM -0700, Darrick J. Wong wrote:
> 
> > > > +#define REMAP_FILE_DEDUP   (1 << 0)
> > > > +
> > > > +/*
> > > > + * These flags should be taken care of by the implementation (possibly 
> > > > using
> > > > + * vfs helpers) but can be ignored by the implementation.
> > > > + */
> > > > +#define REMAP_FILE_ADVISORY(0)
> > > 
> > > ???
> > 
> > Sorry if this wasn't clear.  How about this?
> > 
> > /*
> >  * These flags signal that the caller is ok with altering various aspects of
> >  * the behavior of the remap operation.  The changes must be made by the
> >  * implementation; the vfs remap helper functions can take advantage of 
> > them.
> >  * Flags in this category exist to preserve the quirky behavior of the 
> > hoisted
> >  * btrfs clone/dedupe ioctls.
> >  */
> 
> Something like "currently we have no such flags, but some will appear
> in subsequent commits", removed once such flags do appear, perhaps?

Done.

--D

Re: [PATCH 09/29] vfs: combine the clone and dedupe into a single remap_file_range

2018-10-17 Thread Al Viro

On Wed, Oct 17, 2018 at 09:47:18PM -0700, Darrick J. Wong wrote:

> > > +#define REMAP_FILE_DEDUP (1 << 0)
> > > +
> > > +/*
> > > + * These flags should be taken care of by the implementation (possibly 
> > > using
> > > + * vfs helpers) but can be ignored by the implementation.
> > > + */
> > > +#define REMAP_FILE_ADVISORY  (0)
> > 
> > ???
> 
> Sorry if this wasn't clear.  How about this?
> 
> /*
>  * These flags signal that the caller is ok with altering various aspects of
>  * the behavior of the remap operation.  The changes must be made by the
>  * implementation; the vfs remap helper functions can take advantage of them.
>  * Flags in this category exist to preserve the quirky behavior of the hoisted
>  * btrfs clone/dedupe ioctls.
>  */

Something like "currently we have no such flags, but some will appear
in subsequent commits", removed once such flags do appear, perhaps?

Re: [PATCH 09/29] vfs: combine the clone and dedupe into a single remap_file_range

On Thu, Oct 18, 2018 at 01:48:26AM +0100, Al Viro wrote:
> On Wed, Oct 17, 2018 at 03:45:17PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong 
> > 
> > Combine the clone_file_range and dedupe_file_range operations into a
> > single remap_file_range file operation dispatch since they're
> > fundamentally the same operation.  The differences between the two can
> > be made in the prep functions.
> > 
> > Signed-off-by: Darrick J. Wong 
> > Reviewed-by: Amir Goldstein 
> > Reviewed-by: Christoph Hellwig 
> > ---
> >  Documentation/filesystems/vfs.txt |   13 +--
> >  fs/btrfs/ctree.h  |8 ++-
> >  fs/btrfs/file.c   |3 +-
> >  fs/btrfs/ioctl.c  |   45 
> > +++--
> >  fs/cifs/cifsfs.c  |   22 +++---
> >  fs/nfs/nfs4file.c |   10 ++--
> >  fs/ocfs2/file.c   |   24 +++-
> >  fs/overlayfs/file.c   |   30 ++---
> >  fs/read_write.c   |   18 +++
> >  fs/xfs/xfs_file.c |   23 ++-
> >  include/linux/fs.h|   20 +---
> >  11 files changed, 110 insertions(+), 106 deletions(-)
> > 
> > 
> > diff --git a/Documentation/filesystems/vfs.txt 
> > b/Documentation/filesystems/vfs.txt
> > index a6c6a8af48a2..bb3183334ab9 100644
> > --- a/Documentation/filesystems/vfs.txt
> > +++ b/Documentation/filesystems/vfs.txt
> > @@ -883,8 +883,9 @@ struct file_operations {
> > unsigned (*mmap_capabilities)(struct file *);
> >  #endif
> > ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, 
> > loff_t, size_t, unsigned int);
> > -   int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, 
> > u64);
> > -   int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, 
> > u64);
> > +   int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> > +   struct file *file_out, loff_t pos_out,
> > +   u64 len, unsigned int remap_flags);
> > int (*fadvise)(struct file *, loff_t, loff_t, int);
> >  };
> 
> Documentation/filesystems/porting part, please.  And document remap_flags.

Ok, will do.

> > +#define REMAP_FILE_DEDUP   (1 << 0)
> > +
> > +/*
> > + * These flags should be taken care of by the implementation (possibly 
> > using
> > + * vfs helpers) but can be ignored by the implementation.
> > + */
> > +#define REMAP_FILE_ADVISORY(0)
> 
> ???

Sorry if this wasn't clear.  How about this?

/*
 * These flags signal that the caller is ok with altering various aspects of
 * the behavior of the remap operation.  The changes must be made by the
 * implementation; the vfs remap helper functions can take advantage of them.
 * Flags in this category exist to preserve the quirky behavior of the hoisted
 * btrfs clone/dedupe ioctls.
 */


--D

Re: [PATCH 04/29] vfs: strengthen checking of file range inputs to generic_remap_checks

On Thu, Oct 18, 2018 at 01:41:56AM +0100, Al Viro wrote:
> On Wed, Oct 17, 2018 at 03:44:43PM -0700, Darrick J. Wong wrote:
> > +static int generic_access_check_limits(struct file *file, loff_t pos,
> > +  loff_t *count)
> > +{
> > +   struct inode *inode = file->f_mapping->host;
> > +
> > +   /* Don't exceed the LFS limits. */
> > +   if (unlikely(pos + *count > MAX_NON_LFS &&
> > +   !(file->f_flags & O_LARGEFILE))) {
> > +   if (pos >= MAX_NON_LFS)
> > +   return -EFBIG;
> > +   *count = min(*count, (loff_t)MAX_NON_LFS - pos);
> 
>   Can that can be different from MAX_NON_LFS - pos?
> 
> > +   }
> > +
> > +   /*
> > +* Don't operate on ranges the page cache doesn't support.
> > +*
> > +* If we have written data it becomes a short write.  If we have
> > +* exceeded without writing data we send a signal and return EFBIG.
> > +* Linus frestrict idea will clean these up nicely..
> > +*/
> > +   if (unlikely(pos >= inode->i_sb->s_maxbytes))
> > +   return -EFBIG;
> > +
> > +   *count = min(*count, inode->i_sb->s_maxbytes - pos);
> > +   return 0;
> > +}
> 
> Anyway, I would rather do this here:
> 
>   struct inode *inode = file->f_mapping->host;
>   loff_t max_size = inode->i_sb->s_maxbytes;
> 
>   if (!(file->f_flags & O_LARGEFILE))
>   max_size = MAX_NON_LFS;
> 
>   if (unlikely(pos >= max_size))
>   return -EFBIG;
>   *count = min(*count, max_size - pos);
>   return 0;

Sounds much better to me. :)

--D

Untar on empty partition returns ENOSPACE

2018-10-17 Thread Jean-Denis Girard

Hi list,

My goal is to duplicate some SD cards, to prepare 50 similar Raspberry Pi.

First, I made a tar of my master SD card (unmounted). Then I made a
script, which creates 2 partitions (50 MB for boot, 14 GB for /),
creates the file-systems (vfat and btrfs, default options), mounts the
two partitions:

mount $part2 $mnt
-ocompress=zstd,space_cache=v2,autodefrag,noatime,nodiratime
mkdir $mnt/boot
mount $part1 $mnt/boot

When untarring, I get many errors, like:
tar:
./usr/lib/libreoffice/share/gallery/arrows/A45-TrendArrow-Red-GoUp.png :
open impossible: No space left on device
tar:
./usr/lib/libreoffice/share/gallery/arrows/A53-TrendArrow-LightBlue-TwoDirections.svg
: open impossible: No space left on device
tar:
./usr/lib/libreoffice/share/gallery/arrows/A27-CurvedArrow-DarkRed.png :
open impossible: No space left on device
tar:
./usr/lib/libreoffice/share/gallery/arrows/A41-CurvedArrow-Gray-Left.svg
: open impossible: No space left on device

Which usually results in unusable SD card.

When the first errors occur, less than 1 GB has been written. I tried to
change mount options, especially commit interval, but still got ENOSPACE.

What I ended up doing is limiting write speed with:
xzcat $archive | pv -L 4M | tar x
(with -L 5M I start to get a couple of errors)

Is there a better work around? Or a patch to test that could help?

The machine that runs the script has:
[jdg@tiare tar-install]$ uname -r
4.19.0-rc7-snx
[jdg@tiare tar-install]$ btrfs version
btrfs-progs v4.17.1


Thanks,
-- 
Jean-Denis Girard

SysNux   Systèmes   Linux   en   Polynésie  française
https://www.sysnux.pf/   Tél: +689 40.50.10.40 / GSM: +689 87.797.527

Re: Conversion to btrfs raid1 profile on added ext device renders some systems unable to boot into converted rootfs

2018-10-17 Thread Qu Wenruo



On 2018/10/18 上午12:38, Tony Prokott wrote:
> Good day. My technical trouble seems to be beyond the scope of active helpers 
> on debian's irc support channel. Reasonable supposition that it's quite 
> particular to the development stage of btrfs infrastructure on 4.17.xxx 
> backport kernels and userland tools available on debian 9.5 stretch as well 
> as buster, the testing suite to be released in the next several months as 
> 10.0 stable. 
> 
>  > / # uname -a; lsb_release -a
>  > Linux localhost 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 
> (2018-08-27) x86_64 GNU/Linux
>  > Distributor ID: LinuxMint
>  > Description: LMDE 3 Cindy
>  > Release: 3
>  > Codename: cindy
>  > 
>  > / # btrfs --version
>  > btrfs-progs v4.7.3
>  > 
>  > / # btrfs fi sh
>  > Label: 'sys'  uuid: [snip]
>  > Total devices 2 FS bytes used 24.07GiB
>  > devid1 size 401.59GiB used 26.03GiB path /dev/sda2
>  > devid2 size 401.76GiB used 26.03GiB path /dev/sdc1
>  > 
>  > / # btrfs fi df /
>  > Data, RAID1: total=24.00GiB, used=23.27GiB
>  > System, RAID1: total=32.00MiB, used=16.00KiB
>  > Metadata, RAID1: total=2.00GiB, used=820.00MiB
>  > GlobalReserve, single: total=69.17MiB, used=0.00B
>  > 
>  > / # btrfs su li -ta /
>  > ID gen top level   path
>  > -- --- -   
>  > 260115103  5   /d9
>  > 261115103  5   /d10
>  > 262123876  5   /home
>  > 263115148  261 /d10/@
>  > 264115136  261 /d10/@home
>  > 443123874  447 /md3/@
>  > 444123876  447 /md3/@home
>  > 447115103  5   /md3
>  > 451115144  260 /d9/@
>  > 452115136  260 /d9/@home
> 
> Providing no dmesg content so far, as it doesn't bear on the kind of 
> difficulty in question. My system requires expert help now to restore 
> bootability to 2 of its OS installations; it has a btrfs root file system in 
> subvolumes for stretch, buster, and LMDE3(cindy) which derives directly from 
> stretch and so has most core elements if not cfg defaults in common; even 
> kernel versions are alike, besides buster. subvolid=262 is a  /home fs shared 
> among  linux distros; 451, 263, and 443 are rootfs for stretch, buster and 
> cindy respectively.
> 
> All 3 installations had been booting and running fine when data block group 
> profile was "single" on an internal sata HDD /dev/sda2; then an external usb3 
> drive enclosure's sata HDD partition /dev/sdc1, also of size ~0.4TiB, was 
> added and balanced as btrfs "raid1"; raid conversion did not damage subvolume 
> content or filesystem integrity afaict, but rather rendered stretch and 
> buster unbootable (more to follow), whereas cindy carried on without hiccup.
> 
> At first it seemed as though the initrd's might be missing a module or so, to 
> allow access to external drives -- i.e. grub starts the unbootable 
> kernel/initrd but drops to busybox prompt right away without starting 
> external drives, referring to allegedly "missing" btrfs device's UUID_SUB.
> 
> But after chrooting to update-initramfs and cataloging resulting image 
> content, usb_storage and uas were present under /lib/modules/xxx already, and 
> failing systems still just busybox without a real rootfs rather than launch 
> systemd; even tried kernel option "rootwait" which had no effect on access to 
> ext storage; udev still seems not to have noticed the ext drives once busybox 
> had control.

Still looks like a initramfs problem other than btrfs problem.

In the busybox environment, have you tried listing /dev to see if that
external device is found?

> 
> I could list all initrd modules present in cindy & absent for others, but 
> need better knowledge than my reasonable guesses of what's required to make 
> btrfs volume companion devices cooperate at boot time, as initrd transitions 
> to steady state rootfs.

Since you have a busybox environment, have you checked if "btrfs"
command lives in the initramfs?

IIRC at least you need the following things/abilities to boot:

1) usb and sata drivers
   Means you could see both devices in the busybox environment under
   /dev

2) "Btrfs" command
   Mostly for scan

Then you could try the following commands under busybox environment:

# btrfs device scan
# mount  

If it works, it may mean you're missing "btrfs device scan" during boot
so kernel can't see all RAID1 disks for btrfs and failed to boot.

Please refer to your distribution initramfs creation tool to see how to
add that scan. (Some distro has special hook for btrfs to handle such case).

Thanks,
Qu

> 
> What would be a more practical diagnostic? Could stretch & buster initrd's 
> somehow be failing to do a btrfs device scan at the proper moment? Not so 
> interested in giving up on btrfs software raid so early in the game.
> 
> thanks in advance-
> TP [not a list subscriber]
> 
> 



signature.asc
Description: OpenPG

Re: [PATCH 09/29] vfs: combine the clone and dedupe into a single remap_file_range

2018-10-17 Thread Al Viro

On Wed, Oct 17, 2018 at 03:45:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Combine the clone_file_range and dedupe_file_range operations into a
> single remap_file_range file operation dispatch since they're
> fundamentally the same operation.  The differences between the two can
> be made in the prep functions.
> 
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Amir Goldstein 
> Reviewed-by: Christoph Hellwig 
> ---
>  Documentation/filesystems/vfs.txt |   13 +--
>  fs/btrfs/ctree.h  |8 ++-
>  fs/btrfs/file.c   |3 +-
>  fs/btrfs/ioctl.c  |   45 
> +++--
>  fs/cifs/cifsfs.c  |   22 +++---
>  fs/nfs/nfs4file.c |   10 ++--
>  fs/ocfs2/file.c   |   24 +++-
>  fs/overlayfs/file.c   |   30 ++---
>  fs/read_write.c   |   18 +++
>  fs/xfs/xfs_file.c |   23 ++-
>  include/linux/fs.h|   20 +---
>  11 files changed, 110 insertions(+), 106 deletions(-)
> 
> 
> diff --git a/Documentation/filesystems/vfs.txt 
> b/Documentation/filesystems/vfs.txt
> index a6c6a8af48a2..bb3183334ab9 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -883,8 +883,9 @@ struct file_operations {
>   unsigned (*mmap_capabilities)(struct file *);
>  #endif
>   ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, 
> loff_t, size_t, unsigned int);
> - int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, 
> u64);
> - int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, 
> u64);
> + int (*remap_file_range)(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + u64 len, unsigned int remap_flags);
>   int (*fadvise)(struct file *, loff_t, loff_t, int);
>  };

Documentation/filesystems/porting part, please.  And document remap_flags.

> +#define REMAP_FILE_DEDUP (1 << 0)
> +
> +/*
> + * These flags should be taken care of by the implementation (possibly using
> + * vfs helpers) but can be ignored by the implementation.
> + */
> +#define REMAP_FILE_ADVISORY  (0)

???

Re: Urgent: Need BTRFS-Expert

2018-10-17 Thread Qu Wenruo



On 2018/10/17 下午11:50, Michael Post wrote:
> Hello Qu,
> Hello Hugo,
> 
> i got this result when i try to recover the chunk tree.
> 
> # btrfs check /dev/mapper/vg0-virtualbox
> bytenr mismatch, want=1263835381760, have=0
> ERROR: cannot open file system

This means some essential trees get corrupted.
I remember latest btrfs-progs should prompt which tree is corrupted.

Would you please use latest btrfs-progs and try again?

Also would you please paste the following output?

 # btrfs ins dump-super 
 # btrfs ins dump-tree -t chunk 

Thanks,
Qu

> 
> 
> # btrfs rescue chunk-recover -y  /dev/mapper/vg0-virtualbox
> Scanning: DONE in dev0
> open with broken chunk error
> Chunk tree recovery failed
> 
> Did you any clue?
> 
> 
> Michael
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 04/29] vfs: strengthen checking of file range inputs to generic_remap_checks

2018-10-17 Thread Al Viro

On Wed, Oct 17, 2018 at 03:44:43PM -0700, Darrick J. Wong wrote:
> +static int generic_access_check_limits(struct file *file, loff_t pos,
> +loff_t *count)
> +{
> + struct inode *inode = file->f_mapping->host;
> +
> + /* Don't exceed the LFS limits. */
> + if (unlikely(pos + *count > MAX_NON_LFS &&
> + !(file->f_flags & O_LARGEFILE))) {
> + if (pos >= MAX_NON_LFS)
> + return -EFBIG;
> + *count = min(*count, (loff_t)MAX_NON_LFS - pos);

Can that can be different from MAX_NON_LFS - pos?

> + }
> +
> + /*
> +  * Don't operate on ranges the page cache doesn't support.
> +  *
> +  * If we have written data it becomes a short write.  If we have
> +  * exceeded without writing data we send a signal and return EFBIG.
> +  * Linus frestrict idea will clean these up nicely..
> +  */
> + if (unlikely(pos >= inode->i_sb->s_maxbytes))
> + return -EFBIG;
> +
> + *count = min(*count, inode->i_sb->s_maxbytes - pos);
> + return 0;
> +}

Anyway, I would rather do this here:

struct inode *inode = file->f_mapping->host;
loff_t max_size = inode->i_sb->s_maxbytes;

if (!(file->f_flags & O_LARGEFILE))
max_size = MAX_NON_LFS;

if (unlikely(pos >= max_size))
return -EFBIG;
*count = min(*count, max_size - pos);
return 0;

[PATCH 24/29] xfs: fix pagecache truncation prior to reflink

From: Darrick J. Wong 

Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range shouldn't be a problem.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
---
 fs/xfs/xfs_reflink.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 9b1ea42c81d1..e8e86646bb4b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1369,8 +1369,9 @@ xfs_reflink_remap_prep(
goto out_unlock;
 
/* Zap any page cache for the destination file's range. */
-   truncate_inode_pages_range(&inode_out->i_data, pos_out,
-  PAGE_ALIGN(pos_out + *len) - 1);
+   truncate_inode_pages_range(&inode_out->i_data,
+   round_down(pos_out, PAGE_SIZE),
+   round_up(pos_out + *len, PAGE_SIZE) - 1);
 
return 1;
 out_unlock:

[PATCH 18/29] vfs: clean up generic_remap_file_range_prep return value

From: Darrick J. Wong 

Since the remap prep function can update the length of the remap
request, we can change this function to return the usual return status
instead of the odd behavior it has now.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/ocfs2/refcounttree.c |2 +-
 fs/read_write.c |6 +++---
 fs/xfs/xfs_reflink.c|4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 6a42c04ac0ab..46bbd315c39f 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4852,7 +4852,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
 
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
&len, remap_flags);
-   if (ret <= 0)
+   if (ret < 0 || len == 0)
goto out_unlock;
 
/* Lock out changes to the allocation maps and remap. */
diff --git a/fs/read_write.c b/fs/read_write.c
index e4d295d0d236..6b40a43edf18 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1848,8 +1848,8 @@ static int vfs_dedupe_file_range_compare(struct inode 
*src, loff_t srcoff,
  * sense, and then flush all dirty data.  Caller must ensure that the
  * inodes have been locked against any other modifications.
  *
- * Returns: 0 for "nothing to clone", 1 for "something to clone", or
- * the usual negative error code.
+ * If there's an error, then the usual negative error code is returned.
+ * Otherwise returns 0 with *len set to the request length.
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
@@ -1945,7 +1945,7 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
return ret;
}
 
-   return 1;
+   return 0;
 }
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 3dbe5fb7e9c0..9b1ea42c81d1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1329,7 +1329,7 @@ xfs_reflink_remap_prep(
 
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
len, remap_flags);
-   if (ret <= 0)
+   if (ret < 0 || *len == 0)
goto out_unlock;
 
/*
@@ -1409,7 +1409,7 @@ xfs_reflink_remap_range(
/* Prepare and then clone file data. */
ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
&len, remap_flags);
-   if (ret <= 0)
+   if (ret < 0 || len == 0)
return ret;
 
trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);

[PATCH 17/29] vfs: hide file range comparison function

From: Darrick J. Wong 

There are no callers of vfs_dedupe_file_range_compare, so we might as
well make it a static helper and remove the export.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c|  187 +---
 include/linux/fs.h |3 -
 2 files changed, 91 insertions(+), 99 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index c0bcc1a20650..e4d295d0d236 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1752,6 +1752,97 @@ static int generic_remap_check_len(struct inode 
*inode_in,
return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
 }
 
+/*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
+{
+   struct page *page;
+
+   page = read_mapping_page(inode->i_mapping, offset >> PAGE_SHIFT, NULL);
+   if (IS_ERR(page))
+   return page;
+   if (!PageUptodate(page)) {
+   put_page(page);
+   return ERR_PTR(-EIO);
+   }
+   lock_page(page);
+   return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ * Caller must have locked both inodes to prevent write races.
+ */
+static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+struct inode *dest, loff_t destoff,
+loff_t len, bool *is_same)
+{
+   loff_t src_poff;
+   loff_t dest_poff;
+   void *src_addr;
+   void *dest_addr;
+   struct page *src_page;
+   struct page *dest_page;
+   loff_t cmp_len;
+   bool same;
+   int error;
+
+   error = -EINVAL;
+   same = true;
+   while (len) {
+   src_poff = srcoff & (PAGE_SIZE - 1);
+   dest_poff = destoff & (PAGE_SIZE - 1);
+   cmp_len = min(PAGE_SIZE - src_poff,
+ PAGE_SIZE - dest_poff);
+   cmp_len = min(cmp_len, len);
+   if (cmp_len <= 0)
+   goto out_error;
+
+   src_page = vfs_dedupe_get_page(src, srcoff);
+   if (IS_ERR(src_page)) {
+   error = PTR_ERR(src_page);
+   goto out_error;
+   }
+   dest_page = vfs_dedupe_get_page(dest, destoff);
+   if (IS_ERR(dest_page)) {
+   error = PTR_ERR(dest_page);
+   unlock_page(src_page);
+   put_page(src_page);
+   goto out_error;
+   }
+   src_addr = kmap_atomic(src_page);
+   dest_addr = kmap_atomic(dest_page);
+
+   flush_dcache_page(src_page);
+   flush_dcache_page(dest_page);
+
+   if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+   same = false;
+
+   kunmap_atomic(dest_addr);
+   kunmap_atomic(src_addr);
+   unlock_page(dest_page);
+   unlock_page(src_page);
+   put_page(dest_page);
+   put_page(src_page);
+
+   if (!same)
+   break;
+
+   srcoff += cmp_len;
+   destoff += cmp_len;
+   len -= cmp_len;
+   }
+
+   *is_same = same;
+   return 0;
+
+out_error:
+   return error;
+}
+
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
  * sense, and then flush all dirty data.  Caller must ensure that the
@@ -1923,102 +2014,6 @@ loff_t vfs_clone_file_range(struct file *file_in, 
loff_t pos_in,
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
 
-/*
- * Read a page's worth of file data into the page cache.  Return the page
- * locked.
- */
-static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
-{
-   struct address_space *mapping;
-   struct page *page;
-   pgoff_t n;
-
-   n = offset >> PAGE_SHIFT;
-   mapping = inode->i_mapping;
-   page = read_mapping_page(mapping, n, NULL);
-   if (IS_ERR(page))
-   return page;
-   if (!PageUptodate(page)) {
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   lock_page(page);
-   return page;
-}
-
-/*
- * Compare extents of two files to see if they are the same.
- * Caller must have locked both inodes to prevent write races.
- */
-int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
- struct inode *dest, loff_t destoff,
- loff_t len, bool *is_same)
-{
-   loff_t src_poff;
-   loff_t dest_poff;
-   void *src_addr;
-   void *dest_addr;
-   struct page *src_page;
-   struct page *dest_page;
-   loff_t cmp_len;
-   bool same;
-   int error;
-
-   error = -EINVAL;
-

[PATCH 27/29] xfs: remove redundant remap partial EOF block checks

From: Darrick J. Wong 

Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundant functionality from XFS.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Dave Chinner 
---
 fs/xfs/xfs_reflink.c |   19 ---
 1 file changed, 19 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 4abb2aea8f31..bccc66316cc4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1314,7 +1314,6 @@ xfs_reflink_remap_prep(
struct inode*inode_out = file_inode(file_out);
struct xfs_inode*dest = XFS_I(inode_out);
boolsame_inode = (inode_in == inode_out);
-   u64 blkmask = i_blocksize(inode_in) - 1;
ssize_t ret;
 
/* Lock both files against IO */
@@ -1342,24 +1341,6 @@ xfs_reflink_remap_prep(
if (ret < 0 || *len == 0)
goto out_unlock;
 
-   /*
-* If the dedupe data matches, chop off the partial EOF block
-* from the source file so we don't try to dedupe the partial
-* EOF block.
-*/
-   if (remap_flags & REMAP_FILE_DEDUP) {
-   *len &= ~blkmask;
-   } else if (*len & blkmask) {
-   /*
-* The user is attempting to share a partial EOF block,
-* if it's inside the destination EOF then reject it.
-*/
-   if (pos_out + *len < i_size_read(inode_out)) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
-   }
-
/* Attach dquots to dest inode before changing block map */
ret = xfs_qm_dqattach(dest);
if (ret)

[PATCH 29/29] xfs: remove [cm]time update from reflink calls

From: Darrick J. Wong 

Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
longer needs to do that on its own.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_reflink.c |7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 84f372f7ea04..e72218477bf2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -927,8 +927,7 @@ xfs_reflink_update_dest(
struct xfs_trans*tp;
int error;
 
-   if ((remap_flags & REMAP_FILE_DEDUP) &&
-   newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
+   if (newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
return 0;
 
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
@@ -949,10 +948,6 @@ xfs_reflink_update_dest(
dest->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
}
 
-   if (!(remap_flags & REMAP_FILE_DEDUP)) {
-   xfs_trans_ichgtime(tp, dest,
-  XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
-   }
xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
 
error = xfs_trans_commit(tp);

[PATCH 28/29] xfs: remove xfs_reflink_remap_range

From: Darrick J. Wong 

Since xfs_file_remap_range is a thin wrapper, move the contents of
xfs_reflink_remap_range into the shell.  This cuts down on the vfs
calls being made from internal xfs code.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c|   65 --
 fs/xfs/xfs_reflink.c |   70 +++---
 fs/xfs/xfs_reflink.h |   10 +++
 3 files changed, 70 insertions(+), 75 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7d42ab8fe6e1..53c9ab8fb777 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -919,20 +919,67 @@ xfs_file_fallocate(
return error;
 }
 
-STATIC loff_t
+
+loff_t
 xfs_file_remap_range(
-   struct file *file_in,
-   loff_t  pos_in,
-   struct file *file_out,
-   loff_t  pos_out,
-   loff_t  len,
-   unsigned intremap_flags)
+   struct file *file_in,
+   loff_t  pos_in,
+   struct file *file_out,
+   loff_t  pos_out,
+   loff_t  len,
+   unsigned intremap_flags)
 {
+   struct inode*inode_in = file_inode(file_in);
+   struct xfs_inode*src = XFS_I(inode_in);
+   struct inode*inode_out = file_inode(file_out);
+   struct xfs_inode*dest = XFS_I(inode_out);
+   struct xfs_mount*mp = src->i_mount;
+   loff_t  remapped = 0;
+   xfs_extlen_tcowextsize;
+   int ret;
+
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
 
-   return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-   len, remap_flags);
+   if (!xfs_sb_version_hasreflink(&mp->m_sb))
+   return -EOPNOTSUPP;
+
+   if (XFS_FORCED_SHUTDOWN(mp))
+   return -EIO;
+
+   /* Prepare and then clone file data. */
+   ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
+   &len, remap_flags);
+   if (ret < 0 || len == 0)
+   return ret;
+
+   trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
+
+   ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len,
+   &remapped);
+   if (ret)
+   goto out_unlock;
+
+   /*
+* Carry the cowextsize hint from src to dest if we're sharing the
+* entire source file to the entire destination file, the source file
+* has a cowextsize hint, and the destination file does not.
+*/
+   cowextsize = 0;
+   if (pos_in == 0 && len == i_size_read(inode_in) &&
+   (src->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) &&
+   pos_out == 0 && len >= i_size_read(inode_out) &&
+   !(dest->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
+   cowextsize = src->i_d.di_cowextsize;
+
+   ret = xfs_reflink_update_dest(dest, pos_out + len, cowextsize,
+   remap_flags);
+
+out_unlock:
+   xfs_reflink_remap_unlock(file_in, file_out);
+   if (ret)
+   trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
+   return remapped > 0 ? remapped : ret;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index bccc66316cc4..84f372f7ea04 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -916,7 +916,7 @@ xfs_reflink_set_inode_flag(
 /*
  * Update destination inode size & cowextsize hint, if necessary.
  */
-STATIC int
+int
 xfs_reflink_update_dest(
struct xfs_inode*dest,
xfs_off_t   newlen,
@@ -1116,7 +1116,7 @@ xfs_reflink_remap_extent(
 /*
  * Iteratively remap one file's extents (and holes) to another's.
  */
-STATIC int
+int
 xfs_reflink_remap_blocks(
struct xfs_inode*src,
loff_t  pos_in,
@@ -1232,7 +1232,7 @@ xfs_iolock_two_inodes_and_break_layout(
 }
 
 /* Unlock both inodes after they've been prepped for a range clone. */
-STATIC void
+void
 xfs_reflink_remap_unlock(
struct file *file_in,
struct file *file_out)
@@ -1300,7 +1300,7 @@ xfs_reflink_zero_posteof(
  * stale data in the destination file. Hence we reject these clone attempts 
with
  * -EINVAL in this case.
  */
-STATIC int
+int
 xfs_reflink_remap_prep(
struct file *file_in,
loff_t  pos_in,
@@ -1370,68 +1370,6 @@ xfs_reflink_remap_prep(
return ret;
 }
 
-/*
- * Link a range of blocks from one file to another.
- */
-loff_t
-xfs_reflink_remap_range(
-   struct file *file_in,
-   loff_t  pos_in,
-   struct file *file_out,
-   loff_t  pos_out,
-   loff_t  len,
-   unsigned intremap_flags)
-{
-   struct inode

[PATCH 26/29] xfs: support returning partial reflink results

From: Darrick J. Wong 

Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c|5 +
 fs/xfs/xfs_reflink.c |   17 -
 fs/xfs/xfs_reflink.h |2 +-
 3 files changed, 14 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 38fde4e11714..7d42ab8fe6e1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -928,14 +928,11 @@ xfs_file_remap_range(
loff_t  len,
unsigned intremap_flags)
 {
-   int ret;
-
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
 
-   ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+   return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out,
len, remap_flags);
-   return ret < 0 ? ret : len;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 79dec457f7fb..4abb2aea8f31 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1122,13 +1122,15 @@ xfs_reflink_remap_blocks(
loff_t  pos_in,
struct xfs_inode*dest,
loff_t  pos_out,
-   loff_t  remap_len)
+   loff_t  remap_len,
+   loff_t  *remapped)
 {
struct xfs_bmbt_irecimap;
xfs_fileoff_t   srcoff;
xfs_fileoff_t   destoff;
xfs_filblks_t   len;
xfs_filblks_t   range_len;
+   xfs_filblks_t   remapped_len = 0;
xfs_off_t   new_isize = pos_out + remap_len;
int nimaps;
int error = 0;
@@ -1175,10 +1177,13 @@ xfs_reflink_remap_blocks(
srcoff += range_len;
destoff += range_len;
len -= range_len;
+   remapped_len += range_len;
}
 
if (error)
trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
+   *remapped = min_t(loff_t, remap_len,
+ XFS_FSB_TO_B(src->i_mount, remapped_len));
return error;
 }
 
@@ -1387,7 +1392,7 @@ xfs_reflink_remap_prep(
 /*
  * Link a range of blocks from one file to another.
  */
-int
+loff_t
 xfs_reflink_remap_range(
struct file *file_in,
loff_t  pos_in,
@@ -1401,8 +1406,9 @@ xfs_reflink_remap_range(
struct inode*inode_out = file_inode(file_out);
struct xfs_inode*dest = XFS_I(inode_out);
struct xfs_mount*mp = src->i_mount;
+   loff_t  remapped = 0;
xfs_extlen_tcowextsize;
-   ssize_t ret;
+   int ret;
 
if (!xfs_sb_version_hasreflink(&mp->m_sb))
return -EOPNOTSUPP;
@@ -1418,7 +1424,8 @@ xfs_reflink_remap_range(
 
trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
-   ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len);
+   ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len,
+   &remapped);
if (ret)
goto out_unlock;
 
@@ -1441,7 +1448,7 @@ xfs_reflink_remap_range(
xfs_reflink_remap_unlock(file_in, file_out);
if (ret)
trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
-   return ret;
+   return remapped > 0 ? remapped : ret;
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c3c46c276fe1..cbc26ff79a8f 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -27,7 +27,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, 
xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
-extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
+extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out, loff_t len,
unsigned int remap_flags);
 extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp,

[PATCH 25/29] xfs: clean up xfs_reflink_remap_blocks call site

From: Darrick J. Wong 

Move the offset <-> blocks unit conversions into
xfs_reflink_remap_blocks to make the call site less ugly.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_reflink.c |   37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e8e86646bb4b..79dec457f7fb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1119,16 +1119,23 @@ xfs_reflink_remap_extent(
 STATIC int
 xfs_reflink_remap_blocks(
struct xfs_inode*src,
-   xfs_fileoff_t   srcoff,
+   loff_t  pos_in,
struct xfs_inode*dest,
-   xfs_fileoff_t   destoff,
-   xfs_filblks_t   len,
-   xfs_off_t   new_isize)
+   loff_t  pos_out,
+   loff_t  remap_len)
 {
struct xfs_bmbt_irecimap;
+   xfs_fileoff_t   srcoff;
+   xfs_fileoff_t   destoff;
+   xfs_filblks_t   len;
+   xfs_filblks_t   range_len;
+   xfs_off_t   new_isize = pos_out + remap_len;
int nimaps;
int error = 0;
-   xfs_filblks_t   range_len;
+
+   destoff = XFS_B_TO_FSBT(src->i_mount, pos_out);
+   srcoff = XFS_B_TO_FSBT(src->i_mount, pos_in);
+   len = XFS_B_TO_FSB(src->i_mount, remap_len);
 
/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
while (len) {
@@ -1143,7 +1150,7 @@ xfs_reflink_remap_blocks(
error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
xfs_iunlock(src, lock_mode);
if (error)
-   goto err;
+   break;
ASSERT(nimaps == 1);
 
trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
@@ -1157,11 +1164,11 @@ xfs_reflink_remap_blocks(
error = xfs_reflink_remap_extent(dest, &imap, destoff,
new_isize);
if (error)
-   goto err;
+   break;
 
if (fatal_signal_pending(current)) {
error = -EINTR;
-   goto err;
+   break;
}
 
/* Advance drange/srange */
@@ -1170,10 +1177,8 @@ xfs_reflink_remap_blocks(
len -= range_len;
}
 
-   return 0;
-
-err:
-   trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
+   if (error)
+   trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
return error;
 }
 
@@ -1396,8 +1401,6 @@ xfs_reflink_remap_range(
struct inode*inode_out = file_inode(file_out);
struct xfs_inode*dest = XFS_I(inode_out);
struct xfs_mount*mp = src->i_mount;
-   xfs_fileoff_t   sfsbno, dfsbno;
-   xfs_filblks_t   fsblen;
xfs_extlen_tcowextsize;
ssize_t ret;
 
@@ -1415,11 +1418,7 @@ xfs_reflink_remap_range(
 
trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
-   dfsbno = XFS_B_TO_FSBT(mp, pos_out);
-   sfsbno = XFS_B_TO_FSBT(mp, pos_in);
-   fsblen = XFS_B_TO_FSB(mp, len);
-   ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
-   pos_out + len);
+   ret = xfs_reflink_remap_blocks(src, pos_in, dest, pos_out, len);
if (ret)
goto out_unlock;

[PATCH 23/29] xfs: add a per-xfs trace_printk macro

From: Darrick J. Wong 

Add a "xfs_tprintk" macro so that developers can use trace_printk to
print out arbitrary debugging information with the XFS device name
attached to the trace output.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_error.h |6 ++
 1 file changed, 6 insertions(+)


diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 246d3e989c6c..5caa8bdf6c38 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -76,6 +76,11 @@ extern int xfs_errortag_set(struct xfs_mount *mp, unsigned 
int error_tag,
unsigned int tag_value);
 extern int xfs_errortag_add(struct xfs_mount *mp, unsigned int error_tag);
 extern int xfs_errortag_clearall(struct xfs_mount *mp);
+
+/* trace printk version of xfs_err and friends */
+#define xfs_tprintk(mp, fmt, args...) \
+   trace_printk("dev %d:%d " fmt, MAJOR((mp)->m_super->s_dev), \
+   MINOR((mp)->m_super->s_dev), ##args)
 #else
 #define xfs_errortag_init(mp)  (0)
 #define xfs_errortag_del(mp)
@@ -83,6 +88,7 @@ extern int xfs_errortag_clearall(struct xfs_mount *mp);
 #define xfs_errortag_set(mp, tag, val) (ENOSYS)
 #define xfs_errortag_add(mp, tag)  (ENOSYS)
 #define xfs_errortag_clearall(mp)  (ENOSYS)
+#define xfs_tprintk(mp, fmt, args...)  do { } while (0)
 #endif /* DEBUG */
 
 /*

[PATCH 21/29] ocfs2: support partial clone range and dedupe range

From: Darrick J. Wong 

Change the ocfs2 remap code to allow for returning partial results.

Signed-off-by: Darrick J. Wong 
---
 fs/ocfs2/file.c |7 +
 fs/ocfs2/refcounttree.c |   72 +--
 fs/ocfs2/refcounttree.h |   12 
 3 files changed, 46 insertions(+), 45 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index fbaeafe44b5f..8125c5ccf821 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2531,14 +2531,11 @@ static loff_t ocfs2_remap_file_range(struct file 
*file_in, loff_t pos_in,
 struct file *file_out, loff_t pos_out,
 loff_t len, unsigned int remap_flags)
 {
-   int ret;
-
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
 
-   ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-   len, remap_flags);
-   return ret < 0 ? ret : len;
+   return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
+   len, remap_flags);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 7c709229e108..c7409578657b 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4507,14 +4507,14 @@ static int ocfs2_reflink_update_dest(struct inode *dest,
 }
 
 /* Remap the range pos_in:len in s_inode to pos_out:len in t_inode. */
-static int ocfs2_reflink_remap_extent(struct inode *s_inode,
- struct buffer_head *s_bh,
- loff_t pos_in,
- struct inode *t_inode,
- struct buffer_head *t_bh,
- loff_t pos_out,
- loff_t len,
- struct ocfs2_cached_dealloc_ctxt *dealloc)
+static loff_t ocfs2_reflink_remap_extent(struct inode *s_inode,
+struct buffer_head *s_bh,
+loff_t pos_in,
+struct inode *t_inode,
+struct buffer_head *t_bh,
+loff_t pos_out,
+loff_t len,
+struct ocfs2_cached_dealloc_ctxt 
*dealloc)
 {
struct ocfs2_extent_tree s_et;
struct ocfs2_extent_tree t_et;
@@ -4522,8 +4522,9 @@ static int ocfs2_reflink_remap_extent(struct inode 
*s_inode,
struct buffer_head *ref_root_bh = NULL;
struct ocfs2_refcount_tree *ref_tree;
struct ocfs2_super *osb;
+   loff_t remapped_bytes = 0;
loff_t pstart, plen;
-   u32 p_cluster, num_clusters, slast, spos, tpos;
+   u32 p_cluster, num_clusters, slast, spos, tpos, remapped_clus = 0;
unsigned int ext_flags;
int ret = 0;
 
@@ -4605,30 +4606,34 @@ static int ocfs2_reflink_remap_extent(struct inode 
*s_inode,
 next_loop:
spos += num_clusters;
tpos += num_clusters;
+   remapped_clus += num_clusters;
}
 
-out:
-   return ret;
+   goto out;
 out_unlock_refcount:
ocfs2_unlock_refcount_tree(osb, ref_tree, 1);
brelse(ref_root_bh);
-   return ret;
+out:
+   remapped_bytes = ocfs2_clusters_to_bytes(t_inode->i_sb, remapped_clus);
+   remapped_bytes = min_t(loff_t, len, remapped_bytes);
+
+   return remapped_bytes > 0 ? remapped_bytes : ret;
 }
 
 /* Set up refcount tree and remap s_inode to t_inode. */
-static int ocfs2_reflink_remap_blocks(struct inode *s_inode,
- struct buffer_head *s_bh,
- loff_t pos_in,
- struct inode *t_inode,
- struct buffer_head *t_bh,
- loff_t pos_out,
- loff_t len)
+static loff_t ocfs2_reflink_remap_blocks(struct inode *s_inode,
+struct buffer_head *s_bh,
+loff_t pos_in,
+struct inode *t_inode,
+struct buffer_head *t_bh,
+loff_t pos_out,
+loff_t len)
 {
struct ocfs2_cached_dealloc_ctxt dealloc;
struct ocfs2_super *osb;
struct ocfs2_dinode *dis;
struct ocfs2_dinode *dit;
-   int ret;
+   loff_t ret;
 
osb = OCFS2_SB(s_inode->i_sb);
dis = (struct ocfs2_dinode *)s_bh->b_data;
@@ -4700,7 +4705,7 @@ static int ocfs2_reflink_remap_blocks(struct inode 
*s_inode,
/* Actually remap extents now. */
ret = ocfs2_

[PATCH 22/29] ocfs2: remove ocfs2_reflink_remap_range

From: Darrick J. Wong 

Since ocfs2_remap_file_range is a thin shell around
ocfs2_remap_remap_range, move everything from the latter into the
former.

Signed-off-by: Darrick J. Wong 
---
 fs/ocfs2/file.c |   68 +++-
 fs/ocfs2/refcounttree.c |  113 +++
 fs/ocfs2/refcounttree.h |   24 +++---
 3 files changed, 102 insertions(+), 103 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 8125c5ccf821..fe570824b991 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2531,11 +2531,75 @@ static loff_t ocfs2_remap_file_range(struct file 
*file_in, loff_t pos_in,
 struct file *file_out, loff_t pos_out,
 loff_t len, unsigned int remap_flags)
 {
+   struct inode *inode_in = file_inode(file_in);
+   struct inode *inode_out = file_inode(file_out);
+   struct ocfs2_super *osb = OCFS2_SB(inode_in->i_sb);
+   struct buffer_head *in_bh = NULL, *out_bh = NULL;
+   bool same_inode = (inode_in == inode_out);
+   loff_t remapped = 0;
+   ssize_t ret;
+
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
+   if (!ocfs2_refcount_tree(osb))
+   return -EOPNOTSUPP;
+   if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb))
+   return -EROFS;
 
-   return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-   len, remap_flags);
+   /* Lock both files against IO */
+   ret = ocfs2_reflink_inodes_lock(inode_in, &in_bh, inode_out, &out_bh);
+   if (ret)
+   return ret;
+
+   /* Check file eligibility and prepare for block sharing. */
+   ret = -EINVAL;
+   if ((OCFS2_I(inode_in)->ip_flags & OCFS2_INODE_SYSTEM_FILE) ||
+   (OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
+   goto out_unlock;
+
+   ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
+   &len, remap_flags);
+   if (ret < 0 || len == 0)
+   goto out_unlock;
+
+   /* Lock out changes to the allocation maps and remap. */
+   down_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+   if (!same_inode)
+   down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
+ SINGLE_DEPTH_NESTING);
+
+   /* Zap any page cache for the destination file's range. */
+   truncate_inode_pages_range(&inode_out->i_data,
+  round_down(pos_out, PAGE_SIZE),
+  round_up(pos_out + len, PAGE_SIZE) - 1);
+
+   remapped = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in,
+   inode_out, out_bh, pos_out, len);
+   up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
+   if (!same_inode)
+   up_write(&OCFS2_I(inode_out)->ip_alloc_sem);
+   if (remapped < 0) {
+   ret = remapped;
+   mlog_errno(ret);
+   goto out_unlock;
+   }
+
+   /*
+* Empty the extent map so that we may get the right extent
+* record from the disk.
+*/
+   ocfs2_extent_map_trunc(inode_in, 0);
+   ocfs2_extent_map_trunc(inode_out, 0);
+
+   ret = ocfs2_reflink_update_dest(inode_out, out_bh, pos_out + len);
+   if (ret) {
+   mlog_errno(ret);
+   goto out_unlock;
+   }
+
+out_unlock:
+   ocfs2_reflink_inodes_unlock(inode_in, in_bh, inode_out, out_bh);
+   return remapped > 0 ? remapped : ret;
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index c7409578657b..dc66b80585ec 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4468,9 +4468,9 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 }
 
 /* Update destination inode size, if necessary. */
-static int ocfs2_reflink_update_dest(struct inode *dest,
-struct buffer_head *d_bh,
-loff_t newlen)
+int ocfs2_reflink_update_dest(struct inode *dest,
+ struct buffer_head *d_bh,
+ loff_t newlen)
 {
handle_t *handle;
int ret;
@@ -4621,13 +4621,13 @@ static loff_t ocfs2_reflink_remap_extent(struct inode 
*s_inode,
 }
 
 /* Set up refcount tree and remap s_inode to t_inode. */
-static loff_t ocfs2_reflink_remap_blocks(struct inode *s_inode,
-struct buffer_head *s_bh,
-loff_t pos_in,
-struct inode *t_inode,
-struct buffer_head *t_bh,
-loff_t pos_out,
-loff_t len)
+loff_t ocfs2_reflink_remap_blocks(struct inode *s_inode,
+

[PATCH 16/29] vfs: enable remap callers that can handle short operations

From: Darrick J. Wong 

Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/read_write.c|   28 
 include/linux/fs.h |4 +++-
 mm/filemap.c   |   11 +++
 3 files changed, 30 insertions(+), 13 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index ea30666013b0..c0bcc1a20650 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
 
cloned = file_in->f_op->remap_file_range(file_in, pos_in,
file_out, pos_out,
-   min_t(loff_t, MAX_RW_COUNT, len), 0);
+   min_t(loff_t, MAX_RW_COUNT, len),
+   REMAP_FILE_CAN_SHORTEN);
if (cloned > 0) {
ret = cloned;
goto done;
@@ -1721,6 +1722,8 @@ static int remap_verify_area(struct file *file, loff_t 
pos, loff_t len,
  * can't meaningfully compare post-EOF contents.
  *
  * For clone we only link a partial EOF block above the destination file's EOF.
+ *
+ * Shorten the request if possible.
  */
 static int generic_remap_check_len(struct inode *inode_in,
   struct inode *inode_out,
@@ -1729,16 +1732,24 @@ static int generic_remap_check_len(struct inode 
*inode_in,
   unsigned int remap_flags)
 {
u64 blkmask = i_blocksize(inode_in) - 1;
+   loff_t new_len = *len;
 
if ((*len & blkmask) == 0)
return 0;
 
-   if (remap_flags & REMAP_FILE_DEDUP)
-   *len &= ~blkmask;
-   else if (pos_out + *len < i_size_read(inode_out))
-   return -EINVAL;
+   if ((remap_flags & REMAP_FILE_DEDUP) ||
+   pos_out + *len < i_size_read(inode_out))
+   new_len &= ~blkmask;
 
-   return 0;
+   if (new_len == *len)
+   return 0;
+
+   if (remap_flags & REMAP_FILE_CAN_SHORTEN) {
+   *len = new_len;
+   return 0;
+   }
+
+   return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
 }
 
 /*
@@ -2014,7 +2025,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, 
loff_t src_pos,
 {
loff_t ret;
 
-   WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP));
+   WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP |
+REMAP_FILE_CAN_SHORTEN));
 
ret = mnt_want_write_file(dst_file);
if (ret)
@@ -2115,7 +2127,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
deduped = vfs_dedupe_file_range_one(file, off, dst_file,
info->dest_offset, len,
-   0);
+   REMAP_FILE_CAN_SHORTEN);
if (deduped == -EBADE)
info->status = FILE_DEDUPE_RANGE_DIFFERS;
else if (deduped < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ea2c2f673ecb..0b750e3f8f20 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1726,14 +1726,16 @@ struct block_device_operations;
  * If it is called with len == 0 that means "remap to end of source file".
  *
  * REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate)
+ * REMAP_FILE_CAN_SHORTEN: caller can handle a shortened request
  */
 #define REMAP_FILE_DEDUP   (1 << 0)
+#define REMAP_FILE_CAN_SHORTEN (1 << 1)
 
 /*
  * These flags should be taken care of by the implementation (possibly using
  * vfs helpers) but can be ignored by the implementation.
  */
-#define REMAP_FILE_ADVISORY(0)
+#define REMAP_FILE_ADVISORY(REMAP_FILE_CAN_SHORTEN)
 
 struct iov_iter;
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e93269efafe..0731869541ce 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3052,8 +3052,7 @@ int generic_remap_checks(struct file *file_in, loff_t 
pos_in,
bcount = ALIGN(size_in, bs) - pos_in;
} else {
if (!IS_ALIGNED(count, bs))
-   return -EINVAL;
-
+   count = ALIGN_DOWN(count, bs);
bcount = count;
}
 
@@ -3063,10 +3062,14 @@ int generic_remap_checks(struct file *file_in, loff_t 
pos_in,
pos_out < pos_in + bcount)
return -EINVAL;
 
-   /* For now we don't support changing the length. */
-   if (*req_count !

[PATCH 19/29] ocfs2: truncate page cache for clone destination file before remapping

From: Darrick J. Wong 

When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.

Signed-off-by: Darrick J. Wong 
---
 fs/ocfs2/refcounttree.c |   10 --
 1 file changed, 4 insertions(+), 6 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 46bbd315c39f..2a5c96bc9677 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4861,14 +4861,12 @@ int ocfs2_reflink_remap_range(struct file *file_in,
down_write_nested(&OCFS2_I(inode_out)->ip_alloc_sem,
  SINGLE_DEPTH_NESTING);
 
-   ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
-out_bh, pos_out, len);
-
/* Zap any page cache for the destination file's range. */
-   if (!ret)
-   truncate_inode_pages_range(&inode_out->i_data, pos_out,
-  PAGE_ALIGN(pos_out + len) - 1);
+   truncate_inode_pages_range(&inode_out->i_data, pos_out,
+  PAGE_ALIGN(pos_out + len) - 1);
 
+   ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
+out_bh, pos_out, len);
up_write(&OCFS2_I(inode_in)->ip_alloc_sem);
if (!same_inode)
up_write(&OCFS2_I(inode_out)->ip_alloc_sem);

[PATCH 20/29] ocfs2: fix pagecache truncation prior to reflink

From: Darrick J. Wong 

Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range should be fine.

Signed-off-by: Darrick J. Wong 
---
 fs/ocfs2/refcounttree.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 2a5c96bc9677..7c709229e108 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4862,8 +4862,9 @@ int ocfs2_reflink_remap_range(struct file *file_in,
  SINGLE_DEPTH_NESTING);
 
/* Zap any page cache for the destination file's range. */
-   truncate_inode_pages_range(&inode_out->i_data, pos_out,
-  PAGE_ALIGN(pos_out + len) - 1);
+   truncate_inode_pages_range(&inode_out->i_data,
+  round_down(pos_out, PAGE_SIZE),
+  round_up(pos_out + len, PAGE_SIZE) - 1);
 
ret = ocfs2_reflink_remap_blocks(inode_in, in_bh, pos_in, inode_out,
 out_bh, pos_out, len);

[PATCH 13/29] vfs: make remap_file_range functions take and return bytes completed

From: Darrick J. Wong 

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 Documentation/filesystems/vfs.txt |6 ++---
 fs/btrfs/ctree.h  |6 ++---
 fs/btrfs/ioctl.c  |   13 ++
 fs/cifs/cifsfs.c  |6 ++---
 fs/ioctl.c|   10 +++-
 fs/nfs/nfs4file.c |6 ++---
 fs/nfsd/vfs.c |8 +-
 fs/ocfs2/file.c   |   16 ++--
 fs/ocfs2/refcounttree.c   |2 +-
 fs/ocfs2/refcounttree.h   |2 +-
 fs/overlayfs/copy_up.c|6 ++---
 fs/overlayfs/file.c   |   12 +
 fs/read_write.c   |   49 -
 fs/xfs/xfs_file.c |9 +--
 fs/xfs/xfs_reflink.c  |4 ++-
 fs/xfs/xfs_reflink.h  |2 +-
 include/linux/fs.h|   27 +++-
 mm/filemap.c  |2 +-
 18 files changed, 106 insertions(+), 80 deletions(-)


diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index bb3183334ab9..8ba47d9d6cae 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -883,9 +883,9 @@ struct file_operations {
unsigned (*mmap_capabilities)(struct file *);
 #endif
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, 
loff_t, size_t, unsigned int);
-   int (*remap_file_range)(struct file *file_in, loff_t pos_in,
-   struct file *file_out, loff_t pos_out,
-   u64 len, unsigned int remap_flags);
+   loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+  struct file *file_out, loff_t pos_out,
+  loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
 };
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 124a05662fc2..771a961d77ad 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3247,9 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page 
**pages,
  size_t num_pages, loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
-  struct file *file_out, loff_t pos_out, u64 len,
-  unsigned int remap_flags);
+loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t len, unsigned int remap_flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index bfd99c66723e..b0c513e10977 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4328,10 +4328,12 @@ static noinline int btrfs_clone_files(struct file 
*file, struct file *file_src,
return ret;
 }
 
-int btrfs_remap_file_range(struct file *src_file, loff_t off,
-   struct file *dst_file, loff_t destoff, u64 len,
+loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
+   struct file *dst_file, loff_t destoff, loff_t len,
unsigned int remap_flags)
 {
+   int ret;
+
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
 
@@ -4349,10 +4351,11 @@ int btrfs_remap_file_range(struct file *src_file, 
loff_t off,
return -EINVAL;
}
 
-   return btrfs_extent_same(src, off, len, dst, destoff);
+   ret = btrfs_extent_same(src, off, len, dst, destoff);
+   } else {
+   ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
}
-
-   return btrfs_clone_files(dst_file, src_file, off, len, destoff);
+   return ret < 0 ? ret : len;
 }
 
 static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index e8144d0dcde2..5ca71c6c8be2 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/c

[PATCH 12/29] vfs: remap helper should update destination inode metadata

From: Darrick J. Wong 

Extend generic_remap_file_range_prep to handle inode metadata updates
when remapping into a file.  If the operation can possibly alter the
file contents, we must update the ctime and mtime and remove security
privileges, just like we do for regular file writes.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/read_write.c  |   19 +++
 fs/xfs/xfs_reflink.c |   23 ---
 2 files changed, 19 insertions(+), 23 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index ebcbfc4f2907..b61bd3fc7154 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1820,6 +1820,25 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (ret)
return ret;
 
+   /* If can't alter the file contents, we're done. */
+   if (!(remap_flags & REMAP_FILE_DEDUP)) {
+   /* Update the timestamps, since we can alter file contents. */
+   if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+   ret = file_update_time(file_out);
+   if (ret)
+   return ret;
+   }
+
+   /*
+* Clear the security bits if the process is not being run by
+* root.  This keeps people from modifying setuid and setgid
+* binaries.
+*/
+   ret = file_remove_privs(file_out);
+   if (ret)
+   return ret;
+   }
+
return 1;
 }
 EXPORT_SYMBOL(generic_remap_file_range_prep);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 29aab196ce7e..2d7dd8b28d7c 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1372,29 +1372,6 @@ xfs_reflink_remap_prep(
truncate_inode_pages_range(&inode_out->i_data, pos_out,
   PAGE_ALIGN(pos_out + *len) - 1);
 
-   /* If we're altering the file contents... */
-   if (!(remap_flags & REMAP_FILE_DEDUP)) {
-   /*
-* ...update the timestamps (which will grab the ilock again
-* from xfs_fs_dirty_inode, so we have to call it before we
-* take the ilock).
-*/
-   if (!(file_out->f_mode & FMODE_NOCMTIME)) {
-   ret = file_update_time(file_out);
-   if (ret)
-   goto out_unlock;
-   }
-
-   /*
-* ...clear the security bits if the process is not being run
-* by root.  This keeps people from modifying setuid and setgid
-* binaries.
-*/
-   ret = file_remove_privs(file_out);
-   if (ret)
-   goto out_unlock;
-   }
-
return 1;
 out_unlock:
xfs_reflink_remap_unlock(file_in, file_out);

[PATCH 15/29] vfs: plumb remap flags through the vfs dedupe functions

From: Darrick J. Wong 

Plumb a remap_flags argument through the vfs_dedupe_file_range_one
functions so that dedupe can take advantage of it.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/overlayfs/file.c |3 ++-
 fs/read_write.c |9 ++---
 include/linux/fs.h  |2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)


diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 0393815c8971..84dd957efa24 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -467,7 +467,8 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t 
pos_in,
 
case OVL_DEDUPE:
ret = vfs_dedupe_file_range_one(real_in.file, pos_in,
-   real_out.file, pos_out, len);
+   real_out.file, pos_out, len,
+   flags);
break;
}
revert_creds(old_cred);
diff --git a/fs/read_write.c b/fs/read_write.c
index 0d1ac1b9bc22..ea30666013b0 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2010,10 +2010,12 @@ EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
 
 loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 struct file *dst_file, loff_t dst_pos,
-loff_t len)
+loff_t len, unsigned int remap_flags)
 {
loff_t ret;
 
+   WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP));
+
ret = mnt_want_write_file(dst_file);
if (ret)
return ret;
@@ -2044,7 +2046,7 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, 
loff_t src_pos,
}
 
ret = dst_file->f_op->remap_file_range(src_file, src_pos, dst_file,
-   dst_pos, len, REMAP_FILE_DEDUP);
+   dst_pos, len, remap_flags | REMAP_FILE_DEDUP);
 out_drop_write:
mnt_drop_write_file(dst_file);
 
@@ -2112,7 +2114,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
}
 
deduped = vfs_dedupe_file_range_one(file, off, dst_file,
-   info->dest_offset, len);
+   info->dest_offset, len,
+   0);
if (deduped == -EBADE)
info->status = FILE_DEDUPE_RANGE_DIFFERS;
else if (deduped < 0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bc78ad7e21b2..ea2c2f673ecb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1854,7 +1854,7 @@ extern int vfs_dedupe_file_range(struct file *file,
 struct file_dedupe_range *same);
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
struct file *dst_file, loff_t dst_pos,
-   loff_t len);
+   loff_t len, unsigned int remap_flags);
 
 
 struct super_operations {

[PATCH 14/29] vfs: plumb remap flags through the vfs clone functions

From: Darrick J. Wong 

Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/ioctl.c |2 +-
 fs/nfsd/vfs.c  |2 +-
 fs/overlayfs/copy_up.c |2 +-
 fs/overlayfs/file.c|6 +++---
 fs/read_write.c|   13 +
 include/linux/fs.h |4 ++--
 6 files changed, 17 insertions(+), 12 deletions(-)


diff --git a/fs/ioctl.c b/fs/ioctl.c
index 72537b68c272..505275ec5596 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -232,7 +232,7 @@ static long ioctl_file_clone(struct file *dst_file, 
unsigned long srcfd,
if (src_file.file->f_path.mnt != dst_file->f_path.mnt)
goto fdput;
cloned = vfs_clone_file_range(src_file.file, off, dst_file, destoff,
- olen);
+ olen, 0);
if (cloned < 0)
ret = cloned;
else if (olen && cloned != olen)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ac6cb6101cbe..726fc5b2b27a 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -543,7 +543,7 @@ __be32 nfsd4_clone_file_range(struct file *src, u64 
src_pos, struct file *dst,
 {
loff_t cloned;
 
-   cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count);
+   cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count, 0);
if (count && cloned != count)
cloned = -EINVAL;
return nfserrno(cloned < 0 ? cloned : 0);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 8750b7235516..5f82fece64a0 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -142,7 +142,7 @@ static int ovl_copy_up_data(struct path *old, struct path 
*new, loff_t len)
}
 
/* Try to use clone_file_range to clone up within the same fs */
-   cloned = do_clone_file_range(old_file, 0, new_file, 0, len);
+   cloned = do_clone_file_range(old_file, 0, new_file, 0, len, 0);
if (cloned == len)
goto out;
/* Couldn't clone, so now we try to copy the data */
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 6c3fec6168e9..0393815c8971 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -462,7 +462,7 @@ static loff_t ovl_copyfile(struct file *file_in, loff_t 
pos_in,
 
case OVL_CLONE:
ret = vfs_clone_file_range(real_in.file, pos_in,
-  real_out.file, pos_out, len);
+  real_out.file, pos_out, len, flags);
break;
 
case OVL_DEDUPE:
@@ -512,8 +512,8 @@ static loff_t ovl_remap_file_range(struct file *file_in, 
loff_t pos_in,
 !ovl_inode_upper(file_inode(file_out
return -EPERM;
 
-   return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
-   op);
+   return ovl_copyfile(file_in, pos_in, file_out, pos_out, len,
+   remap_flags, op);
 }
 
 const struct file_operations ovl_file_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index 356641afa487..0d1ac1b9bc22 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1848,12 +1848,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
-  struct file *file_out, loff_t pos_out, loff_t len)
+  struct file *file_out, loff_t pos_out,
+  loff_t len, unsigned int remap_flags)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
loff_t ret;
 
+   WARN_ON_ONCE(remap_flags);
+
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
@@ -1884,7 +1887,7 @@ loff_t do_clone_file_range(struct file *file_in, loff_t 
pos_in,
return ret;
 
ret = file_in->f_op->remap_file_range(file_in, pos_in,
-   file_out, pos_out, len, 0);
+   file_out, pos_out, len, remap_flags);
if (ret < 0)
return ret;
 
@@ -1895,12 +1898,14 @@ loff_t do_clone_file_range(struct file *file_in, loff_t 
pos_in,
 EXPORT_SYMBOL(do_clone_file_range);
 
 loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in,
-   struct file *file_out, loff_t pos_out, loff_t len)
+   struct file *file_out, loff_t pos_out,
+   loff_t len, unsigned int remap_flags)
 {
loff_t ret;
 
file_start_write(file_out);
-   ret = do_clone_file_range(file_in, pos_in, file_out, pos_out, len);
+   ret = do_clone_file_range(file_in, pos_in, file_out, pos_out,

[PATCH 11/29] vfs: pass remap flags to generic_remap_checks

From: Darrick J. Wong 

Pass the same remap flags to generic_remap_checks for consistency.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c|2 +-
 include/linux/fs.h |2 +-
 mm/filemap.c   |4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 201381689284..ebcbfc4f2907 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1782,7 +1782,7 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
 
/* Check that we don't violate system file offset limits. */
ret = generic_remap_checks(file_in, pos_in, file_out, pos_out, len,
-   (remap_flags & REMAP_FILE_DEDUP));
+   remap_flags);
if (ret)
return ret;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ecdfcb8b15ff..c3e807f1f022 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2981,7 +2981,7 @@ extern int generic_file_readonly_mmap(struct file *, 
struct vm_area_struct *);
 extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
-   uint64_t *count, bool is_dedupe);
+   uint64_t *count, unsigned int remap_flags);
 extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08ad210fee49..b0f1f6d93d9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3001,7 +3001,7 @@ EXPORT_SYMBOL(generic_write_checks);
  */
 int generic_remap_checks(struct file *file_in, loff_t pos_in,
 struct file *file_out, loff_t pos_out,
-uint64_t *req_count, bool is_dedupe)
+uint64_t *req_count, unsigned int remap_flags)
 {
struct inode *inode_in = file_in->f_mapping->host;
struct inode *inode_out = file_out->f_mapping->host;
@@ -3023,7 +3023,7 @@ int generic_remap_checks(struct file *file_in, loff_t 
pos_in,
size_out = i_size_read(inode_out);
 
/* Dedupe requires both ranges to be within EOF. */
-   if (is_dedupe &&
+   if ((remap_flags & REMAP_FILE_DEDUP) &&
(pos_in >= size_in || pos_in + count > size_in ||
 pos_out >= size_out || pos_out + count > size_out))
return -EINVAL;

[PATCH 10/29] vfs: pass remap flags to generic_remap_file_range_prep

From: Darrick J. Wong 

Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
Reviewed-by: Christoph Hellwig 
---
 fs/ocfs2/file.c |2 +-
 fs/ocfs2/refcounttree.c |4 ++--
 fs/ocfs2/refcounttree.h |2 +-
 fs/read_write.c |   14 +++---
 fs/xfs/xfs_file.c   |2 +-
 fs/xfs/xfs_reflink.c|   21 +++--
 fs/xfs/xfs_reflink.h|3 ++-
 include/linux/fs.h  |2 +-
 8 files changed, 26 insertions(+), 24 deletions(-)


diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 0b757a24567c..9809b0e5746f 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2538,7 +2538,7 @@ static int ocfs2_remap_file_range(struct file *file_in,
return -EINVAL;
 
return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out,
-len, remap_flags & REMAP_FILE_DEDUP);
+len, remap_flags);
 }
 
 const struct inode_operations ocfs2_file_iops = {
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 36c56dfbe485..df9781567ec0 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4825,7 +4825,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
  struct file *file_out,
  loff_t pos_out,
  u64 len,
- bool is_dedupe)
+ unsigned int remap_flags)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -4851,7 +4851,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
goto out_unlock;
 
ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-   &len, is_dedupe);
+   &len, remap_flags);
if (ret <= 0)
goto out_unlock;
 
diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h
index 4af55bf4b35b..d2c5f526edff 100644
--- a/fs/ocfs2/refcounttree.h
+++ b/fs/ocfs2/refcounttree.h
@@ -120,6 +120,6 @@ int ocfs2_reflink_remap_range(struct file *file_in,
  struct file *file_out,
  loff_t pos_out,
  u64 len,
- bool is_dedupe);
+ unsigned int remap_flags);
 
 #endif /* OCFS2_REFCOUNTTREE_H */
diff --git a/fs/read_write.c b/fs/read_write.c
index 766bdcb381f3..201381689284 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1722,14 +1722,14 @@ static int generic_remap_check_len(struct inode 
*inode_in,
   struct inode *inode_out,
   loff_t pos_out,
   u64 *len,
-  bool is_dedupe)
+  unsigned int remap_flags)
 {
u64 blkmask = i_blocksize(inode_in) - 1;
 
if ((*len & blkmask) == 0)
return 0;
 
-   if (is_dedupe)
+   if (remap_flags & REMAP_FILE_DEDUP)
*len &= ~blkmask;
else if (pos_out + *len < i_size_read(inode_out))
return -EINVAL;
@@ -1747,7 +1747,7 @@ static int generic_remap_check_len(struct inode *inode_in,
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
- u64 *len, bool is_dedupe)
+ u64 *len, unsigned int remap_flags)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -1771,7 +1771,7 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (*len == 0) {
loff_t isize = i_size_read(inode_in);
 
-   if (is_dedupe || pos_in == isize)
+   if ((remap_flags & REMAP_FILE_DEDUP) || pos_in == isize)
return 0;
if (pos_in > isize)
return -EINVAL;
@@ -1782,7 +1782,7 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
 
/* Check that we don't violate system file offset limits. */
ret = generic_remap_checks(file_in, pos_in, file_out, pos_out, len,
-   is_dedupe);
+   (remap_flags & REMAP_FILE_DEDUP));
if (ret)
return ret;
 
@@ -1804,7 +1804,7 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
/*
 * Check that the extents are the same.
 */
-   if (is_dedupe) {
+   if (remap_flags & REMAP_FILE_DEDUP) {
boolis_same = false;
 
ret = vfs_dedupe_file_range_compa

[PATCH 09/29] vfs: combine the clone and dedupe into a single remap_file_range

From: Darrick J. Wong 

Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation.  The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
Reviewed-by: Christoph Hellwig 
---
 Documentation/filesystems/vfs.txt |   13 +--
 fs/btrfs/ctree.h  |8 ++-
 fs/btrfs/file.c   |3 +-
 fs/btrfs/ioctl.c  |   45 +++--
 fs/cifs/cifsfs.c  |   22 +++---
 fs/nfs/nfs4file.c |   10 ++--
 fs/ocfs2/file.c   |   24 +++-
 fs/overlayfs/file.c   |   30 ++---
 fs/read_write.c   |   18 +++
 fs/xfs/xfs_file.c |   23 ++-
 include/linux/fs.h|   20 +---
 11 files changed, 110 insertions(+), 106 deletions(-)


diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index a6c6a8af48a2..bb3183334ab9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -883,8 +883,9 @@ struct file_operations {
unsigned (*mmap_capabilities)(struct file *);
 #endif
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, 
loff_t, size_t, unsigned int);
-   int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, 
u64);
-   int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, 
u64);
+   int (*remap_file_range)(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   u64 len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
 };
 
@@ -960,11 +961,9 @@ otherwise noted.
 
   copy_file_range: called by the copy_file_range(2) system call.
 
-  clone_file_range: called by the ioctl(2) system call for FICLONERANGE and
-   FICLONE commands.
-
-  dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE
-   command.
+  remap_file_range: called by the ioctl(2) system call for FICLONERANGE and
+   FICLONE and FIDEDUPERANGE commands to remap file ranges.  Note that
+   a zero length implies "remap to end of source file".
 
   fadvise: possibly called by the fadvise64() system call.
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..124a05662fc2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3218,9 +3218,6 @@ void btrfs_get_block_group_info(struct list_head 
*groups_list,
struct btrfs_ioctl_space_info *space);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
   struct btrfs_ioctl_balance_args *bargs);
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
-   struct file *dst_file, loff_t dst_loff,
-   u64 olen);
 
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
@@ -3250,8 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page 
**pages,
  size_t num_pages, loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
-int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
-  struct file *file_out, loff_t pos_out, u64 len);
+int btrfs_remap_file_range(struct file *file_in, loff_t pos_in,
+  struct file *file_out, loff_t pos_out, u64 len,
+  unsigned int remap_flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2be00e873e92..9a963f061393 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3269,8 +3269,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_compat_ioctl,
 #endif
-   .clone_file_range = btrfs_clone_file_range,
-   .dedupe_file_range = btrfs_dedupe_file_range,
+   .remap_file_range = btrfs_remap_file_range,
 };
 
 void __cold btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d60b6caf09e8..bfd99c66723e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3627,26 +3627,6 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
return ret;
 }
 
-int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
-   struct file *dst_file, loff_t dst_loff,
-   u64 olen)
-{
-   struct inode *src = file_inode(src_file);
-   struct inode *dst = file_inode(dst_file);
-   u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
-
-   if (WARN_ON_ONCE(bs < PAGE_

[PATCH 06/29] vfs: skip zero-length dedupe requests

From: Darrick J. Wong 

Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
---
 fs/read_write.c |5 +
 1 file changed, 5 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index 0f0a6efdd502..f5395d8da741 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2009,6 +2009,11 @@ int vfs_dedupe_file_range_one(struct file *src_file, 
loff_t src_pos,
if (!dst_file->f_op->dedupe_file_range)
goto out_drop_write;
 
+   if (len == 0) {
+   ret = 0;
+   goto out_drop_write;
+   }
+
ret = dst_file->f_op->dedupe_file_range(src_file, src_pos,
dst_file, dst_pos, len);
 out_drop_write:

[PATCH 05/29] vfs: avoid problematic remapping requests into partial EOF block

From: Darrick J. Wong 

A deduplication data corruption is exposed in XFS and btrfs. It is
caused by extending the block match range to include the partial EOF
block, but then allowing unknown data beyond EOF to be considered a
"match" to data in the destination file because the comparison is only
made to the end of the source file. This corrupts the destination file
when the source extent is shared with it.

The VFS remapping prep functions  only support whole block dedupe, but
we still need to appear to support whole file dedupe correctly.  Hence
if the dedupe request includes the last block of the souce file, don't
include it in the actual dedupe operation. If the rest of the range
dedupes successfully, then reject the entire request.  A subsequent
patch will enable us to shorten dedupe requests correctly.

When reflinking sub-file ranges, a data corruption can occur when the
source file range includes a partial EOF block. This shares the unknown
data beyond EOF into the second file at a position inside EOF, exposing
stale data in the second file.

If the reflink request includes the last block of the souce file, only
proceed with the reflink operation if it lands at or past the
destination file's current EOF. If it lands within the destination file
EOF, reject the entire request with -EINVAL and make the caller go the
hard way.  A subsequent patch will enable us to shorten reflink requests
correctly.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c |   33 +
 1 file changed, 33 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index 2456da3f8a41..0f0a6efdd502 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1708,6 +1708,34 @@ static int clone_verify_area(struct file *file, loff_t 
pos, u64 len, bool write)
 
return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
 }
+/*
+ * Ensure that we don't remap a partial EOF block in the middle of something
+ * else.  Assume that the offsets have already been checked for block
+ * alignment.
+ *
+ * For deduplication we always scale down to the previous block because we
+ * can't meaningfully compare post-EOF contents.
+ *
+ * For clone we only link a partial EOF block above the destination file's EOF.
+ */
+static int generic_remap_check_len(struct inode *inode_in,
+  struct inode *inode_out,
+  loff_t pos_out,
+  u64 *len,
+  bool is_dedupe)
+{
+   u64 blkmask = i_blocksize(inode_in) - 1;
+
+   if ((*len & blkmask) == 0)
+   return 0;
+
+   if (is_dedupe)
+   *len &= ~blkmask;
+   else if (pos_out + *len < i_size_read(inode_out))
+   return -EINVAL;
+
+   return 0;
+}
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -1787,6 +1815,11 @@ int vfs_clone_file_prep(struct file *file_in, loff_t 
pos_in,
return -EBADE;
}
 
+   ret = generic_remap_check_len(inode_in, inode_out, pos_out, len,
+   is_dedupe);
+   if (ret)
+   return ret;
+
return 1;
 }
 EXPORT_SYMBOL(vfs_clone_file_prep);

[PATCH 07/29] vfs: rename vfs_clone_file_prep to be more descriptive

From: Darrick J. Wong 

The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only.  Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/ocfs2/refcounttree.c |2 +-
 fs/read_write.c |8 
 fs/xfs/xfs_reflink.c|2 +-
 include/linux/fs.h  |6 +++---
 4 files changed, 9 insertions(+), 9 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 19e03936c5e1..36c56dfbe485 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4850,7 +4850,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
(OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
goto out_unlock;
 
-   ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
+   ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
&len, is_dedupe);
if (ret <= 0)
goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index f5395d8da741..aca75a97a695 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1745,9 +1745,9 @@ static int generic_remap_check_len(struct inode *inode_in,
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
-   struct file *file_out, loff_t pos_out,
-   u64 *len, bool is_dedupe)
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ u64 *len, bool is_dedupe)
 {
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
@@ -1822,7 +1822,7 @@ int vfs_clone_file_prep(struct file *file_in, loff_t 
pos_in,
 
return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep);
+EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 int do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out, u64 len)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 281d5f53f2ec..a7757a128a78 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1326,7 +1326,7 @@ xfs_reflink_remap_prep(
if (IS_DAX(inode_in) || IS_DAX(inode_out))
goto out_unlock;
 
-   ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
+   ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
len, is_dedupe);
if (ret <= 0)
goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ba93a6e7dac4..55729e1c2e75 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1825,9 +1825,9 @@ extern ssize_t vfs_readv(struct file *, const struct 
iovec __user *,
unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
   loff_t, size_t, unsigned int);
-extern int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
-  struct file *file_out, loff_t pos_out,
-  u64 *count, bool is_dedupe);
+extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+struct file *file_out, loff_t pos_out,
+u64 *count, bool is_dedupe);
 extern int do_clone_file_range(struct file *file_in, loff_t pos_in,
   struct file *file_out, loff_t pos_out, u64 len);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,

[PATCH 08/29] vfs: rename clone_verify_area to remap_verify_area

From: Darrick J. Wong 

Since we use clone_verify_area for both clone and dedupe range checks,
rename the function to make it clear that it's for both.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
---
 fs/read_write.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index aca75a97a695..734c5661fb69 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1686,7 +1686,7 @@ SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t 
__user *, off_in,
return ret;
 }
 
-static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool 
write)
+static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool 
write)
 {
struct inode *inode = file_inode(file);
 
@@ -1852,11 +1852,11 @@ int do_clone_file_range(struct file *file_in, loff_t 
pos_in,
if (!file_in->f_op->clone_file_range)
return -EOPNOTSUPP;
 
-   ret = clone_verify_area(file_in, pos_in, len, false);
+   ret = remap_verify_area(file_in, pos_in, len, false);
if (ret)
return ret;
 
-   ret = clone_verify_area(file_out, pos_out, len, true);
+   ret = remap_verify_area(file_out, pos_out, len, true);
if (ret)
return ret;
 
@@ -1989,7 +1989,7 @@ int vfs_dedupe_file_range_one(struct file *src_file, 
loff_t src_pos,
if (ret)
return ret;
 
-   ret = clone_verify_area(dst_file, dst_pos, len, true);
+   ret = remap_verify_area(dst_file, dst_pos, len, true);
if (ret < 0)
goto out_drop_write;
 
@@ -2051,7 +2051,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
if (!S_ISREG(src->i_mode))
goto out;
 
-   ret = clone_verify_area(file, off, len, false);
+   ret = remap_verify_area(file, off, len, false);
if (ret < 0)
goto out;
ret = 0;

[PATCH 04/29] vfs: strengthen checking of file range inputs to generic_remap_checks

From: Darrick J. Wong 

File range remapping, if allowed to run past the destination file's EOF,
is an optimization on a regular file write.  Regular file writes that
extend the file length are subject to various constraints which are not
checked by range cloning.

This is a correctness problem because we're never allowed to touch
ranges that the page cache can't support (s_maxbytes); we're not
supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't
set; and we must obey resource limits (RLIMIT_FSIZE).

Therefore, add these checks to the new generic_remap_checks function so
that we curtail unexpected behavior.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Amir Goldstein 
Reviewed-by: Christoph Hellwig 
---
 mm/filemap.c |   91 ++
 1 file changed, 59 insertions(+), 32 deletions(-)


diff --git a/mm/filemap.c b/mm/filemap.c
index 47e6bfd45a91..08ad210fee49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2915,6 +2915,49 @@ struct page *read_cache_page_gfp(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL(read_cache_page_gfp);
 
+static int generic_access_check_limits(struct file *file, loff_t pos,
+  loff_t *count)
+{
+   struct inode *inode = file->f_mapping->host;
+
+   /* Don't exceed the LFS limits. */
+   if (unlikely(pos + *count > MAX_NON_LFS &&
+   !(file->f_flags & O_LARGEFILE))) {
+   if (pos >= MAX_NON_LFS)
+   return -EFBIG;
+   *count = min(*count, (loff_t)MAX_NON_LFS - pos);
+   }
+
+   /*
+* Don't operate on ranges the page cache doesn't support.
+*
+* If we have written data it becomes a short write.  If we have
+* exceeded without writing data we send a signal and return EFBIG.
+* Linus frestrict idea will clean these up nicely..
+*/
+   if (unlikely(pos >= inode->i_sb->s_maxbytes))
+   return -EFBIG;
+
+   *count = min(*count, inode->i_sb->s_maxbytes - pos);
+   return 0;
+}
+
+static int generic_write_check_limits(struct file *file, loff_t pos,
+ loff_t *count)
+{
+   unsigned long limit = rlimit(RLIMIT_FSIZE);
+
+   if (limit != RLIM_INFINITY) {
+   if (pos >= limit) {
+   send_sig(SIGXFSZ, current, 0);
+   return -EFBIG;
+   }
+   *count = min(*count, (loff_t)limit - pos);
+   }
+
+   return generic_access_check_limits(file, pos, count);
+}
+
 /*
  * Performs necessary checks before doing a write
  *
@@ -2926,8 +2969,8 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 {
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
-   unsigned long limit = rlimit(RLIMIT_FSIZE);
-   loff_t pos;
+   loff_t count;
+   int ret;
 
if (!iov_iter_count(from))
return 0;
@@ -2936,40 +2979,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
 
-   pos = iocb->ki_pos;
-
if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
return -EINVAL;
 
-   if (limit != RLIM_INFINITY) {
-   if (iocb->ki_pos >= limit) {
-   send_sig(SIGXFSZ, current, 0);
-   return -EFBIG;
-   }
-   iov_iter_truncate(from, limit - (unsigned long)pos);
-   }
+   count = iov_iter_count(from);
+   ret = generic_write_check_limits(file, iocb->ki_pos, &count);
+   if (ret)
+   return ret;
 
-   /*
-* LFS rule
-*/
-   if (unlikely(pos + iov_iter_count(from) > MAX_NON_LFS &&
-   !(file->f_flags & O_LARGEFILE))) {
-   if (pos >= MAX_NON_LFS)
-   return -EFBIG;
-   iov_iter_truncate(from, MAX_NON_LFS - (unsigned long)pos);
-   }
-
-   /*
-* Are we about to exceed the fs block limit ?
-*
-* If we have written data it becomes a short write.  If we have
-* exceeded without writing data we send a signal and return EFBIG.
-* Linus frestrict idea will clean these up nicely..
-*/
-   if (unlikely(pos >= inode->i_sb->s_maxbytes))
-   return -EFBIG;
-
-   iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
+   iov_iter_truncate(from, count);
return iov_iter_count(from);
 }
 EXPORT_SYMBOL(generic_write_checks);
@@ -2991,6 +3009,7 @@ int generic_remap_checks(struct file *file_in, loff_t 
pos_in,
uint64_t bcount;
loff_t size_in, size_out;
loff_t bs = inode_out->i_sb->s_blocksize;
+   int ret;
 
/* The start of both ranges must be aligned to an fs block. */

[PATCH 03/29] vfs: exit early from zero length remap operations

From: Darrick J. Wong 

If a remap caller asks us to remap to the source file's EOF and the
source file length leaves us with a zero byte request, exit early.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c |2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/read_write.c b/fs/read_write.c
index d6e8e242a15f..2456da3f8a41 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1748,6 +1748,8 @@ int vfs_clone_file_prep(struct file *file_in, loff_t 
pos_in,
if (pos_in > isize)
return -EINVAL;
*len = isize - pos_in;
+   if (*len == 0)
+   return 0;
}
 
/* Check that we don't violate system file offset limits. */

[PATCH 02/29] vfs: check file ranges before cloning files

From: Darrick J. Wong 

Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
---
 fs/ocfs2/refcounttree.c |2 +
 fs/read_write.c |   55 +
 fs/xfs/xfs_reflink.c|2 +
 include/linux/fs.h  |9 --
 mm/filemap.c|   69 +++
 5 files changed, 90 insertions(+), 47 deletions(-)


diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 7a5ee145c733..19e03936c5e1 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4850,7 +4850,7 @@ int ocfs2_reflink_remap_range(struct file *file_in,
(OCFS2_I(inode_out)->ip_flags & OCFS2_INODE_SYSTEM_FILE))
goto out_unlock;
 
-   ret = vfs_clone_file_prep_inodes(inode_in, pos_in, inode_out, pos_out,
+   ret = vfs_clone_file_prep(file_in, pos_in, file_out, pos_out,
&len, is_dedupe);
if (ret <= 0)
goto out_unlock;
diff --git a/fs/read_write.c b/fs/read_write.c
index 260797b01851..d6e8e242a15f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1717,13 +1717,12 @@ static int clone_verify_area(struct file *file, loff_t 
pos, u64 len, bool write)
  * Returns: 0 for "nothing to clone", 1 for "something to clone", or
  * the usual negative error code.
  */
-int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
-  struct inode *inode_out, loff_t pos_out,
-  u64 *len, bool is_dedupe)
+int vfs_clone_file_prep(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   u64 *len, bool is_dedupe)
 {
-   loff_t bs = inode_out->i_sb->s_blocksize;
-   loff_t blen;
-   loff_t isize;
+   struct inode *inode_in = file_inode(file_in);
+   struct inode *inode_out = file_inode(file_out);
bool same_inode = (inode_in == inode_out);
int ret;
 
@@ -1740,10 +1739,10 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, 
loff_t pos_in,
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
return -EINVAL;
 
-   isize = i_size_read(inode_in);
-
/* Zero length dedupe exits immediately; reflink goes to EOF. */
if (*len == 0) {
+   loff_t isize = i_size_read(inode_in);
+
if (is_dedupe || pos_in == isize)
return 0;
if (pos_in > isize)
@@ -1751,36 +1750,11 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, 
loff_t pos_in,
*len = isize - pos_in;
}
 
-   /* Ensure offsets don't wrap and the input is inside i_size */
-   if (pos_in + *len < pos_in || pos_out + *len < pos_out ||
-   pos_in + *len > isize)
-   return -EINVAL;
-
-   /* Don't allow dedupe past EOF in the dest file */
-   if (is_dedupe) {
-   loff_t  disize;
-
-   disize = i_size_read(inode_out);
-   if (pos_out >= disize || pos_out + *len > disize)
-   return -EINVAL;
-   }
-
-   /* If we're linking to EOF, continue to the block boundary. */
-   if (pos_in + *len == isize)
-   blen = ALIGN(isize, bs) - pos_in;
-   else
-   blen = *len;
-
-   /* Only reflink if we're aligned to block boundaries */
-   if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
-   !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
-   return -EINVAL;
-
-   /* Don't allow overlapped reflink within the same file */
-   if (same_inode) {
-   if (pos_out + blen > pos_in && pos_out < pos_in + blen)
-   return -EINVAL;
-   }
+   /* Check that we don't violate system file offset limits. */
+   ret = generic_remap_checks(file_in, pos_in, file_out, pos_out, len,
+   is_dedupe);
+   if (ret)
+   return ret;
 
/* Wait for the completion of any pending IOs on both files */
inode_dio_wait(inode_in);
@@ -1813,7 +1787,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, 
loff_t pos_in,
 
return 1;
 }
-EXPORT_SYMBOL(vfs_clone_file_prep_inodes);
+EXPORT_SYMBOL(vfs_clone_file_prep);
 
 int do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out, u64 len)
@@ -1851,9 +1825,6 @@ int do_clone_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   if (pos_in + len > i_size_read(inode_in))
-

[PATCH v6 00/29] fs: fixes for serious clone/dedupe problems

Hi all,

Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
reflink implementation, and tracked it down to reflink forgetting to do
some of the file-extending activities that must happen for regular
writes.

We then started auditing the clone, dedupe, and copyfile code and
realized that from a file contents perspective, clonerange isn't any
different from a regular file write. Unfortunately, we also noticed
that *unlike* a regular write, clonerange skips a ton of overflow
checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
and RLIMIT_FSIZE. We also observed that cloning into a file did not
strip security privileges (suid, capabilities) like a regular write
would. I also noticed that xfs and ocfs2 need to dump the page cache
before remapping blocks, not after.

In fixing the range checking problems I also realized that both dedupe
and copyfile tell userspace how much of the requested operation was
acted upon. Since the range validation can shorten a clone request (or
we can ENOSPC midway through), we might as well plumb the short
operation reporting back through the VFS indirection code to userspace.
I added a few more cleanups to the xfs code per reviewer suggestions.

So, here's the whole giant pile of patches[1] that fix all the problems.
This branch is against current upstream (4.19-rc8). The patch
"generic: test reflink side effects" recently sent to fstests exercises
the fixes in this series. Tests are in [2].

--D

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

[PATCH 01/29] vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF

From: Darrick J. Wong 

vfs_clone_file_prep_inodes cannot return 0 if it is asked to remap from
a zero byte file because that's what btrfs does.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c |3 ---
 1 file changed, 3 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index 8a2737f0d61d..260797b01851 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1740,10 +1740,7 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, 
loff_t pos_in,
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
return -EINVAL;
 
-   /* Are we going all the way to the end? */
isize = i_size_read(inode_in);
-   if (isize == 0)
-   return 0;
 
/* Zero length dedupe exits immediately; reflink goes to EOF. */
if (*len == 0) {

Re: [PATCH] btrfs: delayed-ref: extract find_first_ref_head from find_ref_head

On Mon, Oct 15, 2018 at 02:25:38PM +0800, Lu Fengqi wrote:
> The find_ref_head shouldn't return the first entry even if no exact match
> is found. So move the hidden behavior to higher level.
> 
> Besides, remove the useless local variables in the btrfs_select_ref_head.
> 
> Signed-off-by: Lu Fengqi 

Added to misc-next, thanks.

Conversion to btrfs raid1 profile on added ext device renders some systems unable to boot into converted rootfs

2018-10-17 Thread Tony Prokott

Good day. My technical trouble seems to be beyond the scope of active helpers 
on debian's irc support channel. Reasonable supposition that it's quite 
particular to the development stage of btrfs infrastructure on 4.17.xxx 
backport kernels and userland tools available on debian 9.5 stretch as well as 
buster, the testing suite to be released in the next several months as 10.0 
stable. 

 > / # uname -a; lsb_release -a
 > Linux localhost 4.17.0-0.bpo.3-amd64 #1 SMP Debian 4.17.17-1~bpo9+1 
 > (2018-08-27) x86_64 GNU/Linux
 > Distributor ID: LinuxMint
 > Description: LMDE 3 Cindy
 > Release: 3
 > Codename: cindy
 > 
 > / # btrfs --version
 > btrfs-progs v4.7.3
 > 
 > / # btrfs fi sh
 > Label: 'sys'  uuid: [snip]
 > Total devices 2 FS bytes used 24.07GiB
 > devid1 size 401.59GiB used 26.03GiB path /dev/sda2
 > devid2 size 401.76GiB used 26.03GiB path /dev/sdc1
 > 
 > / # btrfs fi df /
 > Data, RAID1: total=24.00GiB, used=23.27GiB
 > System, RAID1: total=32.00MiB, used=16.00KiB
 > Metadata, RAID1: total=2.00GiB, used=820.00MiB
 > GlobalReserve, single: total=69.17MiB, used=0.00B
 > 
 > / # btrfs su li -ta /
 > ID   gen top level   path
 > --   --- -   
 > 260  115103  5   /d9
 > 261  115103  5   /d10
 > 262  123876  5   /home
 > 263  115148  261 /d10/@
 > 264  115136  261 /d10/@home
 > 443  123874  447 /md3/@
 > 444  123876  447 /md3/@home
 > 447  115103  5   /md3
 > 451  115144  260 /d9/@
 > 452  115136  260 /d9/@home

Providing no dmesg content so far, as it doesn't bear on the kind of difficulty 
in question. My system requires expert help now to restore bootability to 2 of 
its OS installations; it has a btrfs root file system in subvolumes for 
stretch, buster, and LMDE3(cindy) which derives directly from stretch and so 
has most core elements if not cfg defaults in common; even kernel versions are 
alike, besides buster. subvolid=262 is a  /home fs shared among  linux distros; 
451, 263, and 443 are rootfs for stretch, buster and cindy respectively.

All 3 installations had been booting and running fine when data block group 
profile was "single" on an internal sata HDD /dev/sda2; then an external usb3 
drive enclosure's sata HDD partition /dev/sdc1, also of size ~0.4TiB, was added 
and balanced as btrfs "raid1"; raid conversion did not damage subvolume content 
or filesystem integrity afaict, but rather rendered stretch and buster 
unbootable (more to follow), whereas cindy carried on without hiccup.

At first it seemed as though the initrd's might be missing a module or so, to 
allow access to external drives -- i.e. grub starts the unbootable 
kernel/initrd but drops to busybox prompt right away without starting external 
drives, referring to allegedly "missing" btrfs device's UUID_SUB.

But after chrooting to update-initramfs and cataloging resulting image content, 
usb_storage and uas were present under /lib/modules/xxx already, and failing 
systems still just busybox without a real rootfs rather than launch systemd; 
even tried kernel option "rootwait" which had no effect on access to ext 
storage; udev still seems not to have noticed the ext drives once busybox had 
control.

I could list all initrd modules present in cindy & absent for others, but need 
better knowledge than my reasonable guesses of what's required to make btrfs 
volume companion devices cooperate at boot time, as initrd transitions to 
steady state rootfs.

What would be a more practical diagnostic? Could stretch & buster initrd's 
somehow be failing to do a btrfs device scan at the proper moment? Not so 
interested in giving up on btrfs software raid so early in the game.

thanks in advance-
TP [not a list subscriber]

Re: [PATCH 25/26] xfs: support returning partial reflink results

On Wed, Oct 17, 2018 at 01:40:02AM -0700, Christoph Hellwig wrote:
> > @@ -1415,11 +1419,17 @@ xfs_reflink_remap_range(
> >  
> > trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> >  
> > +   if (len == 0) {
> > +   ret = 0;
> > +   goto out_unlock;
> > +   }
> > +
> 
> As pointed out last time this check is superflous, right above we have
> this check:
> 
>   if (ret < 0 || len == 0)
>   return ret;

Oops, sorry I missed that, will fix now.

> > ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> > -   pos_out + len);
> > +   &remappedfsb, pos_out + len);
> > +   remapped_bytes = min_t(loff_t, len, XFS_FSB_TO_B(mp, remappedfsb));
> 
> I still think returning the bytes from the function would be saner,
> but maybe that's just me.

Hmmm, this call site is getting messy; I'll tack on another patch to
clean that up too.

--D

Re: [PATCH 17/26] vfs: enable remap callers that can handle short operations

On Wed, Oct 17, 2018 at 01:36:52AM -0700, Christoph Hellwig wrote:
> >  /* Update inode timestamps and remove security privileges when remapping. 
> > */
> > @@ -2023,7 +2034,8 @@ loff_t vfs_dedupe_file_range_one(struct file 
> > *src_file, loff_t src_pos,
> >  {
> > loff_t ret;
> >  
> > -   WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP));
> > +   WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP |
> > +REMAP_FILE_CAN_SHORTEN));
> 
> I guess this is where you could actually use REMAP_FILE_VALID_FLAGS..
> 
> >  /* REMAP_FILE flags taken care of by the vfs. */
> > -#define REMAP_FILE_ADVISORY(0)
> > +#define REMAP_FILE_ADVISORY(REMAP_FILE_CAN_SHORTEN)
> 
> And btw, they are not 'taken care of by the VFS', they need to be
> taken care of by the fs (possibly using helpers) to take affect,
> but they can be safely ignored.

Ok, I'll update the comment.

> > +   if (!IS_ALIGNED(count, bs)) {
> > +   if (remap_flags & REMAP_FILE_CAN_SHORTEN)
> > +   count = ALIGN_DOWN(count, bs);
> > +   else
> > +   return -EINVAL;
> 
>   if (!(remap_flags & REMAP_FILE_CAN_SHORTEN))
>   return -EINVAL;
>   count = ALIGN_DOWN(count, bs);

Seeing as we return EINVAL on shortened count and !CAN_SHORTEN below
this, I think this can be simplified further:

if (pos_in + count == size_in) {
bcount = ALIGN(size_in, bs) - pos_in;
} else {
if (!IS_ALIGNED(count, bs))
count = ALIGN_DOWN(count, bs);
bcount = count;
}

--D

Re: Urgent: Need BTRFS-Expert

Hello Qu,
Hello Hugo,

i got this result when i try to recover the chunk tree.

# btrfs check /dev/mapper/vg0-virtualbox
bytenr mismatch, want=1263835381760, have=0
ERROR: cannot open file system


# btrfs rescue chunk-recover -y  /dev/mapper/vg0-virtualbox
Scanning: DONE in dev0
open with broken chunk error
Chunk tree recovery failed

Did you any clue?


Michael



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Btrfs: fix assertion on fsync of regular file when using no-holes feature

On Mon, Oct 15, 2018 at 09:51:00AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When using the NO_HOLES feature and logging a regular file, we were
> expecting that if we find an inline extent, that either its size in ram
> (uncompressed and unenconded) matches the size of the file or if it does
> not, that it matches the sector size and it represents compressed data.
> This assertion does not cover a case where the length of the inline extent
> is smaller then the sector size and also smaller the file's size, such
> case is possible through fallocate. Example:
> 
>   $ mkfs.btrfs -f -O no-holes /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
>   $ xfs_io -c "falloc 40 40" /mnt/foobar
>   $ xfs_io -c "fsync" /mnt/foobar
> 
> In the abobe example we trigger the assertion because the inline extent's
> length is 21 bytes while the file size is 80 bytes. The fallocate() call
> merely updated the file's size and did not touch the existing inline
> extent, as expected.
> 
> So fix this by adjusting the assertion so that an inline extent length
> smaller then the file size is valid if the file size is smaller then the
> filesystem's sector size.
> 
> A test case for fstests follows soon.
> 
> Reported-by: Anatoly Trosinenko 
> Link: 
> https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_...@mail.gmail.com/
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH v4 0/4] btrfs: Refactor find_free_extent()

On Wed, Oct 17, 2018 at 02:56:02PM +0800, Qu Wenruo wrote:
> Can be fetched from github:
> https://github.com/adam900710/linux/tree/refactor_find_free_extent
> 
> Which is based on v4.19-rc1.
> 
> extent-tree.c::find_free_extent() could be one of the most
> ill-structured functions, it has at least 6 non-exit tags and jumps
> between them.
> 
> Refactor it into 4 parts:
> 
> 1) find_free_extent()
>The main entrance, does the main work of block group iteration and
>block group selection.
>Now this function doesn't care nor handles free extent search by
>itself.
> 
> 2) find_free_extent_clustered()
>Do clustered free extent search.
>May try to build/re-fill cluster.
> 
> 3) find_free_extent_unclustered()
>Do unclustered free extent search.
>May try to fill free space cache.
> 
> 4) find_free_extent_update_loop()
>Do the loop based black magic.
>May allocate new chunk.
> 
> With this patch, at least we should make find_free_extent() a little
> easier to read, and provides the basis for later work on this function.
> 
> Current refactor is trying not to touch the original functionality, thus
> the helper structure find_free_extent_ctl still contains a lot of
> unrelated members.
> But it should not change how find_free_extent() works at all.

Thanks, patches added to for-next. It looks much better than before,
more cleanups welcome.

Re: [PATCH v4 2/4] btrfs: Refactor clustered extent allocation into find_free_extent_clustered()

On Wed, Oct 17, 2018 at 02:56:04PM +0800, Qu Wenruo wrote:
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7261,6 +7261,115 @@ struct find_free_extent_ctl {
>   u64 found_offset;
>  };
>  
> +

No extra newlines.

> +/*
> + * Helper function for find_free_extent().
> + *
> + * Return -ENOENT to inform caller that we need fallback to unclustered mode.
> + * Return -EAGAIN to inform caller that we need to re-search this block group
> + * Return >0 to inform caller that we find nothing
> + * Return 0 means we have found a location and set ffe_ctl->found_offset.
> + */
> +static int find_free_extent_clustered(struct btrfs_block_group_cache *bg,
> + struct btrfs_free_cluster *last_ptr,
> + struct find_free_extent_ctl *ffe_ctl,
> + struct btrfs_block_group_cache **cluster_bg_ret)
> +{
> + struct btrfs_fs_info *fs_info = bg->fs_info;
> + struct btrfs_block_group_cache *cluster_bg;
> + u64 aligned_cluster;
> + u64 offset;
> + int ret;
> +
> + cluster_bg = btrfs_lock_cluster(bg, last_ptr, ffe_ctl->delalloc);
> + if (!cluster_bg)
> + goto refill_cluster;
> + if (cluster_bg != bg && (cluster_bg->ro ||
> + !block_group_bits(cluster_bg, ffe_ctl->flags)))
> + goto release_cluster;
> +
> + offset = btrfs_alloc_from_cluster(cluster_bg, last_ptr,
> + ffe_ctl->num_bytes, cluster_bg->key.objectid,
> + &ffe_ctl->max_extent_size);
> + if (offset) {
> + /* we have a block, we're done */
> + spin_unlock(&last_ptr->refill_lock);
> + trace_btrfs_reserve_extent_cluster(cluster_bg,
> + ffe_ctl->search_start, ffe_ctl->num_bytes);
> + *cluster_bg_ret = cluster_bg;
> + ffe_ctl->found_offset = offset;
> + return 0;
> + }
> + WARN_ON(last_ptr->block_group != cluster_bg);
> +release_cluster:
> + /* If we are on LOOP_NO_EMPTY_SIZE, we can't set up a new clusters, so

If you move or update comment that does not follow our preferred style,
please fix it.

I'll fix both and maybe more that I find while committing the patches.

> +  * lets just skip it and let the allocator find whatever block it can
> +  * find. If we reach this point, we will have tried the cluster
> +  * allocator plenty of times and not have found anything, so we are
> +  * likely way too fragmented for the clustering stuff to find anything.
> +  *
> +  * However, if the cluster is taken from the current block group,
> +  * release the cluster first, so that we stand a better chance of
> +  * succeeding in the unclustered allocation.
> +  */

Re: [PATCH] Btrfs: fix deadlock when writing out free space caches

On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When writing out a block group free space cache we can end deadlocking
> with ourseves on an extent buffer lock resulting in a warning like the
> following:
> 
>   [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 
> btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
> W I  4.16.8 #1
>   [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.396791] RSP: 0018:c9000424b840 EFLAGS: 00010246
>   [245043.398093] RAX: 0a30 RBX: 8807e20a3d20 RCX: 
> 0001
>   [245043.399414] RDX: 0001 RSI: 0002 RDI: 
> 8807e20a3d20
>   [245043.400732] RBP: 0001 R08: 88041f39a700 R09: 
> 8800
>   [245043.402021] R10: 0040 R11: 8807e20a3d20 R12: 
> 8807cb220630
>   [245043.403296] R13: 0001 R14: 8807cb220628 R15: 
> 88041fbdf000
>   [245043.404780] FS:  () GS:88082fc8() 
> knlGS:
>   [245043.406050] CS:  0010 DS:  ES:  CR0: 80050033
>   [245043.407321] CR2: 7fffdbdb9f10 CR3: 01c09005 CR4: 
> 000206e0
>   [245043.408670] Call Trace:
>   [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
>   [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
>   [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
>   [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
>   [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
>   [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
>   [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
>   [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
>   [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
>   [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
>   [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
>   [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
>   [245043.425538]  ? iput+0x72/0x1e0
>   [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
>   [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
>   [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
>   [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
>   [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
>   [245043.433341]  kthread+0xf0/0x130
>   [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
>   [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
>   [245043.437236]  ret_from_fork+0x1f/0x30
>   [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
> 
> This is because at write_one_cache_group() when we are COWing a leaf from
> the extent tree we end up allocating a new block group (chunk) and,
> because we have hit a threshold on the number of bytes reserved for system
> chunks, we attempt to finalize the creation of new block groups from the
> current transaction, by calling btrfs_create_pending_block_groups().
> However here we also need to modify the extent tree in order to insert
> a block group item, and if the location for this new block group item
> happens to be in the same leaf that we were COWing earlier, we deadlock
> since btrfs_search_slot() tries to write lock the extent buffer that we
> locked before at write_one_cache_group().
> 
> We have already hit similar cases in the past and commit d9a0540a79f8
> ("Btrfs: fix deadlock when finalizing block group creation") fixed some
> of those cases by delaying the creation of pending block groups at the
> known specific spots that could lead to a deadlock. This change reworks
> that commit to be more generic so that we don't have to add similar logic
> to every possible path that can lead to a deadlock. This is done by
> making __btrfs_cow_block() disallowing the creation of new block groups
> (setting the transaction's can_flush_pending_bgs to false) before it
> attempts to allocate a new extent buffer for either the extent, chunk or
> device trees, since those are the trees that pending block creation
> modifies. Once the new extent buffer is allocated, it allows creation of
> pending block groups to happen again.
> 
> This change depends on a recent patch from Josef which is not yet in
> Linus' tree, named "btrfs: make sure we create all new block groups" in
> order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().

Thanks for mentioning it, the referenced patch has been in misc-next and
is scheduled for 4.20 so your patch will be added too.

Re: [PATCH 9/9] btrfs: Add RAID 6 recovery for a btrfs filesystem.

On Thu, Oct 11, 2018 at 08:51:03PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Add the RAID 6 recovery, in order to use a RAID 6 filesystem even if some
> disks (up to two) are missing. This code use the md RAID 6 code already
> present in grub.
>
> Signed-off-by: Goffredo Baroncelli 
> Reviewed-by: Daniel Kiper 
> ---
>  grub-core/fs/btrfs.c | 60 +++-
>  1 file changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index d066d54cc..d20ee09e4 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  GRUB_MOD_LICENSE ("GPLv3+");
>
> @@ -702,11 +703,36 @@ rebuild_raid5 (char *dest, struct raid56_buffer 
> *buffers,
>  }
>  }
>
> +static grub_err_t
> +raid6_recover_read_buffer (void *data, int disk_nr,
> +grub_uint64_t addr __attribute__ ((unused)),
> +void *dest, grub_size_t size)
> +{
> +struct raid56_buffer *buffers = data;
> +
> +if (!buffers[disk_nr].data_is_valid)
> + return grub_errno = GRUB_ERR_READ_ERROR;
> +
> +grub_memcpy(dest, buffers[disk_nr].buf, size);
> +
> +return grub_errno = GRUB_ERR_NONE;
> +}
> +
> +static void
> +rebuild_raid6 (struct raid56_buffer *buffers, grub_uint64_t nstripes,
> +   grub_uint64_t csize, grub_uint64_t parities_pos, void *dest,
> +   grub_uint64_t stripen)
> +
> +{
> +  grub_raid6_recover_gen (buffers, nstripes, stripen, parities_pos,
> +  dest, 0, csize, 0, raid6_recover_read_buffer);
> +}
> +
>  static grub_err_t
>  raid56_read_retry (struct grub_btrfs_data *data,
>  struct grub_btrfs_chunk_item *chunk,
> -grub_uint64_t stripe_offset,
> -grub_uint64_t csize, void *buf)
> +grub_uint64_t stripe_offset, grub_uint64_t stripen,
> +grub_uint64_t csize, void *buf, grub_uint64_t parities_pos)
>  {
>struct raid56_buffer *buffers;
>grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
> @@ -779,6 +805,15 @@ raid56_read_retry (struct grub_btrfs_data *data,
>ret = GRUB_ERR_READ_ERROR;
>goto cleanup;
>  }
> +  else if (failed_devices > 2 && (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID6))
> +{
> +  grub_dprintf ("btrfs",
> + "not enough disks for raid6: total %" PRIuGRUB_UINT64_T

s/not enough disks for raid6/not enough disks for RAID 6/

You are using "RAID 5" in earlier patch, so, please be consistent
and use "RAID 6" here.

> + ", missing %" PRIuGRUB_UINT64_T "\n",
> + nstripes, failed_devices);
> +  ret = GRUB_ERR_READ_ERROR;
> +  goto cleanup;
> +}
>else
>  grub_dprintf ("btrfs",
> "enough disks for RAID 5 rebuilding: total %"
> @@ -789,7 +824,7 @@ raid56_read_retry (struct grub_btrfs_data *data,
>if (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5)
>  rebuild_raid5 (buf, buffers, nstripes, csize);
>else
> -grub_dprintf ("btrfs", "called rebuild_raid6(), NOT IMPLEMENTED\n");
> +rebuild_raid6 (buffers, nstripes, csize, parities_pos, buf, stripen);
>
>ret = GRUB_ERR_NONE;
>   cleanup:
> @@ -879,9 +914,11 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
>   unsigned redundancy = 1;
>   unsigned i, j;
>   int is_raid56;
> + grub_uint64_t parities_pos = 0;
>
> - is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
> -GRUB_BTRFS_CHUNK_TYPE_RAID5);
> +is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
> +(GRUB_BTRFS_CHUNK_TYPE_RAID5 |
> + GRUB_BTRFS_CHUNK_TYPE_RAID6));
>
>   if (grub_le_to_cpu64 (chunk->size) <= off)
> {
> @@ -1030,6 +1067,17 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
>  */
> grub_divmod64 (high + stripen, nstripes, &stripen);
>
> +   /*
> +* parities_pos is equal to "(high - nparities) % nstripes"
> +* (see the diagram above).
> +* However "high - nparities" might be negative (eg when high
> +* == 0) leading to an incorrect computation.

However, "high - nparities" can be negative, eg. when high == 0,
leading to an incorrect results.

> +* Instead "high + nstripes - nparities" is always positive and
> +* in modulo nstripes is equal to "(high - nparities) % nstripes"

"high + nstripes - nparities" is always positive and modulo
nstripes is equal to "(high - nparities) % nstripes".

If you change above mentioned things you can retain my
  Reviewed-by: Daniel Kiper 

Daniel

Re: [PATCH 7/9] btrfs: Add support for recovery for a RAID 5 btrfs profiles.

On Thu, Oct 11, 2018 at 08:51:01PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Add support for recovery for a RAID 5 btrfs profile. In addition
> it is added some code as preparatory work for RAID 6 recovery code.
>
> Signed-off-by: Goffredo Baroncelli 
> ---
>  grub-core/fs/btrfs.c | 162 +--
>  1 file changed, 157 insertions(+), 5 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index 899dc32b7..d066d54cc 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  GRUB_MOD_LICENSE ("GPLv3+");
>
> @@ -665,6 +666,141 @@ btrfs_read_from_chunk (struct grub_btrfs_data *data,
>  return err;
>  }
>
> +struct raid56_buffer {
> +  void *buf;
> +  int  data_is_valid;
> +};
> +
> +static void
> +rebuild_raid5 (char *dest, struct raid56_buffer *buffers,
> +grub_uint64_t nstripes, grub_uint64_t csize)
> +{
> +  grub_uint64_t i;
> +  int first;
> +
> +  for(i = 0; buffers[i].data_is_valid && i < nstripes; i++);
> +
> +  if (i == nstripes)
> +{
> +  grub_dprintf ("btrfs", "called rebuild_raid5(), but all disks are 
> OK\n");
> +  return;
> +}
> +
> +  grub_dprintf ("btrfs", "rebuilding RAID 5 stripe #%" PRIuGRUB_UINT64_T 
> "\n", i);
> +
> +  for (i = 0, first = 1; i < nstripes; i++)
> +{
> +  if (!buffers[i].data_is_valid)
> + continue;
> +
> +  if (first) {
> + grub_memcpy(dest, buffers[i].buf, csize);
> + first = 0;
> +  } else
> + grub_crypto_xor (dest, dest, buffers[i].buf, csize);
> +

Please drop this empty line.

> +}
> +}
> +
> +static grub_err_t
> +raid56_read_retry (struct grub_btrfs_data *data,
> +struct grub_btrfs_chunk_item *chunk,
> +grub_uint64_t stripe_offset,
> +grub_uint64_t csize, void *buf)
> +{
> +  struct raid56_buffer *buffers;
> +  grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
> +  grub_uint64_t chunk_type = grub_le_to_cpu64 (chunk->type);
> +  grub_err_t ret = GRUB_ERR_OUT_OF_MEMORY;
> +  grub_uint64_t i, failed_devices;
> +
> +  buffers = grub_zalloc (sizeof(*buffers) * nstripes);
> +  if (!buffers)
> +goto cleanup;
> +
> +  for (i = 0; i < nstripes; i++)
> +{
> +  buffers[i].buf = grub_zalloc (csize);
> +  if (!buffers[i].buf)
> + goto cleanup;
> +}
> +
> +  for (failed_devices = 0, i = 0; i < nstripes; i++)
> +{
> +  struct grub_btrfs_chunk_stripe *stripe;
> +  grub_disk_addr_t paddr;
> +  grub_device_t dev;
> +  grub_err_t err;
> +
> +  /* after the struct grub_btrfs_chunk_item, there is an array of
> + struct grub_btrfs_chunk_stripe */

/* Struct grub_btrfs_chunk_stripe lives behind struct grub_btrfs_chunk_item. */

> +  stripe = (struct grub_btrfs_chunk_stripe *) (chunk + 1) + i;
> +
> +  paddr = grub_le_to_cpu64 (stripe->offset) + stripe_offset;
> +  grub_dprintf ("btrfs", "reading paddr %" PRIxGRUB_UINT64_T
> +" from stripe ID %" PRIxGRUB_UINT64_T "\n", paddr,
> +stripe->device_id);
> +
> +  dev = find_device (data, stripe->device_id);
> +  if (!dev)
> + {
> +   buffers[i].data_is_valid = 0;
> +   grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T " FAILED (dev ID 
> %"
> + PRIxGRUB_UINT64_T ")\n", i, stripe->device_id);
> +   failed_devices++;
> +   continue;
> + }
> +
> +  err = grub_disk_read (dev->disk, paddr >> GRUB_DISK_SECTOR_BITS,
> + paddr & (GRUB_DISK_SECTOR_SIZE - 1),
> + csize, buffers[i].buf);
> +  if (err == GRUB_ERR_NONE)
> + {
> +   buffers[i].data_is_valid = 1;
> +   grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T " Ok (dev ID %"

s/Ok/OK/

> + PRIxGRUB_UINT64_T ")\n", i, stripe->device_id);
> + }
> +  else
> + {
> +   buffers[i].data_is_valid = 0;
> +   grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T
> + " FAILED (dev ID %" PRIxGRUB_UINT64_T ")\n", i,
> + stripe->device_id);
> +   failed_devices++;
> + }
> +}
> +
> +  if (failed_devices > 1 && (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5))
> +{
> +  grub_dprintf ("btrfs",
> + "not enough disks for RAID 5: total %" PRIuGRUB_UINT64_T
> + ", missing %" PRIuGRUB_UINT64_T "\n",
> + nstripes, failed_devices);
> +  ret = GRUB_ERR_READ_ERROR;
> +  goto cleanup;
> +}
> +  else
> +grub_dprintf ("btrfs",
> +   "enough disks for RAID 5 rebuilding: total %"

s/enough disks for RAID 5 rebuilding/enough disks for RAID 5/

> +   PRIuGRUB_UINT64_T ", missing %" PRIuGRUB_UINT64_T "\n",
> +   nstripes, failed_devices);
> +
> +  /* if these are enough, try to rebuild the data */

/* We have enough disks. So, rebuild

Re: [PATCH] btrfs: Fix the return value in case of error in 'btrfs_mark_extent_written()'

On Wed, Oct 17, 2018 at 11:13:59AM +0200, Christophe JAILLET wrote:
> We return 0 unconditionally in most of the error handling paths of
> 'btrfs_mark_extent_written()'.
> However, 'ret' is set to some error codes in several error handling paths.
> 
> Return 'ret' instead to propagate the error code.
> 
> Fixes: 9c8e63db1de9 ("Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written")
> Signed-off-by: Christophe JAILLET 
> ---
> This patch proposal is purely speculative.
> I'm not sure at all that returning 'ret' is correct (but it looks like it
> is :) )

Agreed.

> What puzzles me is when 'ret' is set, 'btrfs_abort_transaction()' is also
> called.
> However, the only caller of 'btrfs_mark_extent_written()' (i.e.
> 'btrfs_finish_ordered_io()') also calls 'btrfs_abort_transaction()' if an
> error is returned.
> So returning an error code here, would lead to a double call to this abort
> function.

Calling abort multiple times shuld not hurt, there's a bit set atomically
so that only the first time the stacktrace is printed. There's
possiblity of the trans->aborted value overwrite if two completely
different aborts happen exactly at the same time, but still both values
gen printed in the log so there's enough information.

> I'm usure of if it is correct and/or intented.
> If returning 'ret' is correct, should we also axe the 
> 'btrfs_abort_transaction()'
> calls here, and leave the caller do the clean-up?

This depends on the context where the abort is used. Functions that do a
lot of things but do not start the transaction are allowed to call abort
after the first unrecoverable failure, so this possibly logs the exact
error. If the abort is only in the caller, we would not know which of
the many calls in btrfs_mark_extent_written failed.

> Before the commit in the Fixes tag, we were BUGing_ON in case of errror. So
> propagating the error was pointless.

Right, now the error must be propagated to btrfs_finish_ordered_io so
the "if (ret < 0) abort" is not skipped and code continues to
add_pending_csums, btrfs_update_inode_fallback etc.

Good catch, please resend with updated changelog, you can use the
relevant parts of my comments above to explain what's wrong. Thanks.

Re: [PATCH 4/9] btrfs: Avoid a rescan for a device which was already not found.

On Thu, Oct 11, 2018 at 08:50:58PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Currently read from missing device triggers rescan. However, it is never
> recorded that the device is missing. So, each read of a missing device
> triggers rescan again and again. This behavior causes a lot of unneeded
> rescans leading to huge slowdowns.
>
> This patch fixes above mentioned issue. Information about missing devices
> is stored in the data->devices_attached[] array as NULL value in dev
> member. Rescan is triggered only if no information is found for a given
> device. This means that only first time read triggers rescan.
>
> The patch drops premature return. This way data->devices_attached[] is
> filled even when a given device is missing.
>
> Signed-off-by: Goffredo Baroncelli 

I changed commit message, so, you should add
  Signed-off-by: Daniel Kiper 

I simply forgot to tell you about that. Sorry.

And below you can add
  Reviewed-by: Daniel Kiper 

Daniel

Re: [PATCH 1/9] btrfs: Add support for reading a filesystem with a RAID 5 or RAID 6 profile.

On Thu, Oct 11, 2018 at 08:50:55PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Signed-off-by: Goffredo Baroncelli 
> Signed-off-by: Daniel Kiper 

One nit pick below...

Otherwise you can add Reviewed-by: Daniel Kiper 

> ---
>  grub-core/fs/btrfs.c | 73 
>  1 file changed, 73 insertions(+)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index be195448d..933a57d3b 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
>  #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID6 0x100
>grub_uint8_t dummy2[0xc];
>grub_uint16_t nstripes;
>grub_uint16_t nsubstripes;
> @@ -764,6 +766,77 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
> stripe_offset = low + chunk_stripe_length
>   * high;
> csize = chunk_stripe_length - low;
> +   break;
> + }
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID5:
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID6:
> + {
> +   grub_uint64_t nparities, stripe_nr, high, low;
> +
> +   redundancy = 1;   /* no redundancy for now */
> +
> +   if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
> + {
> +   grub_dprintf ("btrfs", "RAID5\n");
> +   nparities = 1;
> + }
> +   else
> + {
> +   grub_dprintf ("btrfs", "RAID6\n");
> +   nparities = 2;
> + }
> +
> +   /*
> +* RAID 6 layout consists of several stripes spread over
> +* the disks, e.g.:
> +*
> +*   Disk_0  Disk_1  Disk_2  Disk_3
> +* A0  B0  P0  Q0
> +* Q1  A1  B1  P1
> +* P2  Q2  A2  B2
> +*
> +* Note: placement of the parities depend on row number.
> +*
> +* Pay attention that the btrfs terminology may differ from
> +* terminology used in other RAID implementations, e.g. LVM,
> +* dm or md. The main difference is that btrfs calls contiguous
> +* block of data on a given disk, e.g. A0, stripe instead of 
> chunk.
> +*
> +* The variables listed below have following meaning:
> +*   - stripe_nr is the stripe number excluding the parities
> +* (A0 = 0, B0 = 1, A1 = 2, B1 = 3, etc.),
> +*   - high is the row number (0 for A0...Q0, 1 for Q1...P1, 
> etc.),
> +*   - stripen is the disk number in a row (0 for A0, Q1, P2,
> +* 1 for B0, A1, Q2, etc.),
> +*   - off is the logical address to read,
> +*   - chunk_stripe_length is the size of a stripe (typically 64 
> KiB),
> +*   - nstripes is the number of disks in a row,
> +*   - low is the offset of the data inside a stripe,
> +*   - stripe_offset is the data offset in an array,
> +*   - csize is the "potential" data to read; it will be reduced
> +* to size if the latter is smaller,
> +*   - nparities is the number of parities (1 for RAID 5, 2 for
> +* RAID 6); used only in RAID 5/6 code.
> +*/
> +   stripe_nr = grub_divmod64 (off, chunk_stripe_length, &low);
> +
> +   /*
> +* stripen is computed without the parities
> +* (0 for A0, A1, A2, 1 for B0, B1, B2, etc.).
> +*/
> +   high = grub_divmod64 (stripe_nr, nstripes - nparities, &stripen);
> +
> +   /*
> +* The stripes are spread over the disks. Every each row their
> +* positions are shifted by 1 place. So, the real disks number
> +* change. Hence, we have to take current row number modulo
> +* nstripes into account (0 for A0, 1 for A1, 2 for A2, etc.).

s/current row number modulo nstripes into account/into account current row 
number modulo nstripes/

Daniel

Re: CRC mismatch

2018-10-17 Thread Austin S. Hemmelgarn


On 2018-10-16 16:27, Chris Murphy wrote:

On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn
 wrote:

On 2018-10-16 11:30, Anton Shepelev wrote:


Hello, all

What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

 csum failed ino 175524 off 1876295680 csum 451760558
 expected csum 1446289185

Shall I seek the culprit in the host machine on in the guest
one?  Supposing the host machine healty, what operations on
the gueest might have caused a CRC mismatch?


Possible causes include:

* On the guest side:
   - Unclean shutdown of the guest system (not likely even if this did
happen).
   - A kernel bug on in the guest.
   - Something directly modifying the block device (also not very likely).

* On the host side:
   - Unclean shutdown of the host system without properly flushing data from
the guest.  Not likely unless you're using an actively unsafe caching mode
for the guest's storage back-end.
   - At-rest data corruption in the storage back-end.
   - A bug in the host-side storage stack.
   - A transient error in the host-side storage stack.
   - A bug in the hypervisor.
   - Something directly modifying the back-end storage.

Of these, the statistically most likely location for the issue is probably
the storage stack on the host.


Is there still that O_DIRECT related "bug" (or more of a limitation)
if the guest is using cache=none on the block device?
I had actually forgotten about this, and I'm not quite sure if it's 
fixed or not.


Anton what virtual machine tech are you using? qemu/kvm managed with
virt-manager? The configuration affects host behavior; but the
negative effect manifests inside the guest as corruption. If I
remember correctly.

Re: [PATCH 42/42] btrfs: don't run delayed_iputs in commit

On Fri, Oct 12, 2018 at 03:32:56PM -0400, Josef Bacik wrote:
> This could result in a really bad case where we do something like
> 
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
>   btrfs_run_delayed_iputs
> evict
>   evict_refill_and_join
> btrfs_commit_transaction
> ... forever
> 
> We have plenty of other places where we run delayed iputs that are much
> safer, let those do the work.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba

Re: [PATCH 19/42] btrfs: set max_extent_size properly

On Fri, Oct 12, 2018 at 03:32:33PM -0400, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We can't use entry->bytes if our entry is a bitmap entry, we need to use
> entry->max_extent_size in that case.  Fix up all the logic to make this
> consistent.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/free-space-cache.c | 30 --
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index e077ad3b4549..2dd773e96530 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -1770,6 +1770,13 @@ static int search_bitmap(struct btrfs_free_space_ctl 
> *ctl,
>   return -1;
>  }
>  
> +static inline u64 get_max_extent_size(struct btrfs_free_space *entry)
> +{
> + if (entry->bitmap)
> + return entry->max_extent_size;
> + return entry->bytes;
> +}

> + *max_extent_size = max(get_max_extent_size(entry),
> +*max_extent_size);

Looks reasonable.

Reviewed-by: David Sterba

my subject

2018-10-17 Thread test

I am Peter Wong director of operations, Hong Kong and Shanghai Banking
Corporation Limited Hong Kong. I have a very confidential business
proposition involving transfer of $18.350.000.00 that will be of great
benefit for both of us. Reply for more details as regards this
transaction

Best Regards
Peter Wong
ec

Re: btrfs send receive: No space left on device

2018-10-17 Thread Henk Slager

On Wed, Oct 17, 2018 at 10:29 AM Libor Klepáč  wrote:
>
> Hello,
> i have new 32GB SSD in my intel nuc, installed debian9 on it, using btrfs as 
> a rootfs.
> Then i created subvolumes /system and /home and moved system there.
>
> System was installed using kernel 4.9.x and filesystem created using 
> btrfs-progs 4.7.x
> Details follow:
> main filesystem
>
> # btrfs filesystem usage /mnt/btrfs/ssd/
> Overall:
> Device size:  29.08GiB
> Device allocated:  4.28GiB
> Device unallocated:   24.80GiB
> Device missing:  0.00B
> Used:  2.54GiB
> Free (estimated): 26.32GiB  (min: 26.32GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:4.00GiB, Used:2.48GiB
>/dev/sda3   4.00GiB
>
> Metadata,single: Size:256.00MiB, Used:61.05MiB
>/dev/sda3 256.00MiB
>
> System,single: Size:32.00MiB, Used:16.00KiB
>/dev/sda3  32.00MiB
>
> Unallocated:
>/dev/sda3  24.80GiB
>
> #/etc/fstab
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /mnt/btrfs/ssd  btrfs 
> noatime,space_cache=v2,compress=lzo,commit=300,subvolid=5 0   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/system 0 
>   0
> UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /home   btrfs 
>   noatime,space_cache=v2,compress=lzo,commit=300,subvol=/home 0   > 0
>
> -
> Then i installed kernel from backports:
> 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
> and btrfs-progs 4.17
>
> For backups , i have created 16GB iscsi device on my qnap and mounted it, 
> created filesystem, mounted like this:
> LABEL=backup/mnt/btrfs/backup   btrfs 
>   noatime,space_cache=v2,compress=lzo,subvolid=5,nofail,noauto 0  
>  0
>
> After send-receive operation on /home subvolume, usage looks like this:
>
> # btrfs filesystem usage /mnt/btrfs/backup/
> Overall:
> Device size:  16.00GiB
> Device allocated:  1.27GiB
> Device unallocated:   14.73GiB
> Device missing:  0.00B
> Used:844.18MiB
> Free (estimated): 14.92GiB  (min: 14.92GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:   16.00MiB  (used: 0.00B)
>
> Data,single: Size:1.01GiB, Used:833.36MiB
>/dev/sdb1.01GiB
>
> Metadata,single: Size:264.00MiB, Used:10.80MiB
>/dev/sdb  264.00MiB
>
> System,single: Size:4.00MiB, Used:16.00KiB
>/dev/sdb4.00MiB
>
> Unallocated:
>/dev/sdb   14.73GiB
>
>
> Problem is, during send-receive of system subvolume, it runs out of space:
>
> # btrbk run /mnt/btrfs/ssd/system/ -v
> btrbk command line client, version 0.26.1  (Wed Oct 17 09:51:20 2018)
> Using configuration: /etc/btrbk/btrbk.conf
> Using transaction log: /var/log/btrbk.log
> Creating subvolume snapshot for: /mnt/btrfs/ssd/system
> [snapshot] source: /mnt/btrfs/ssd/system
> [snapshot] target: /mnt/btrfs/ssd/_snapshots/system.20181017T0951
> Checking for missing backups of subvolume "/mnt/btrfs/ssd/system" in 
> "/mnt/btrfs/backup/"
> Creating subvolume backup (send-receive) for: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> No common parent subvolume present, creating full backup...
> [send/receive] source: /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> [send/receive] target: /mnt/btrfs/backup/system.20181016T2034
> mbuffer: error: outputThread: error writing to  at offset 0x4b5bd000: 
> Broken pipe
> mbuffer: warning: error during output to : Broken pipe
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034
> WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
> receive=/mnt/btrfs/backup) At subvol system.20181016T2034
> ERROR: rename o77417-5519-0 -> 
> lib/modules/4.18.0-0.bpo.1-amd64/kernel/drivers/watchdog/pcwd_pci.ko failed: 
> No space left on device
> ERROR: Failed to send/receive btrfs subvolume: 
> /mnt/btrfs/ssd/_snapshots/system.20181016T2034  -> /mnt/btrfs/backup
> [delete] options: commit-after
> [delete] target: /mnt/btrfs/backup/system.20181016T2034
> WARNING: Deleted partially received (garbled) subvolume: 
> /mnt/btrfs/backup/system.20181016T2034
> ERROR: Error while resuming backups, aborting
> Created 0/2 missing backups
> WARNING: Skipping cleanup of snapshots for subvolume "/mnt/btrfs/ssd/system", 
> as at least one target aborted earlier
> Completed within: 116s  (Wed Oct 17 09:53:16 2018)
> ---

Re: Urgent: Need BTRFS-Expert

2018-10-17 Thread Qu Wenruo



On 2018/10/17 下午5:27, Michael Post wrote:
> Hello Hugo,
> 
> thanks for your information.
> 
> I have an 1 TB BTRFS-Partition which suddenly could not be mounted.
> 
> I tried the following commands
> 
> 
> host@:btrfs check -b /dev/mapper/vg0-virtualbox
> 

This is the important part.
Please paste them out.

Thanks,
Qu

> Errors found in extent allocation tree or chunk allocation
> bytenr mismatch, want=1263661723648, have=0
> 
> 
> 
> 
> host@:btrfs rescue super-recover -y  /dev/mapper/vg0-virtualbox
> All supers are valid, no need to recover
> 
> 
> Then i did
> btrfs rescue chunk-recover -y  /dev/mapper/vg0-virtualbox
> This commands fails after around 2.5 hours with the messages, that it
> could not recover the chunc tree.
> 
> What are the next possibilities?
> 
> Greetings,
> 
> Michael
> 



signature.asc
Description: OpenPGP digital signature

Re: Urgent: Need BTRFS-Expert

Hello Hugo,

thanks for your information.

I have an 1 TB BTRFS-Partition which suddenly could not be mounted.

I tried the following commands


host@:btrfs check -b /dev/mapper/vg0-virtualbox

Errors found in extent allocation tree or chunk allocation
bytenr mismatch, want=1263661723648, have=0




host@:btrfs rescue super-recover -y  /dev/mapper/vg0-virtualbox
All supers are valid, no need to recover


Then i did
btrfs rescue chunk-recover -y  /dev/mapper/vg0-virtualbox
This commands fails after around 2.5 hours with the messages, that it
could not recover the chunc tree.

What are the next possibilities?

Greetings,

Michael



signature.asc
Description: OpenPGP digital signature

Re: Urgent: Need BTRFS-Expert

2018-10-17 Thread Hugo Mills

   Hi, Michael,

On Wed, Oct 17, 2018 at 09:58:31AM +0200, Michael Post wrote:
> Hello together,
> 
> i need a BTRFS-Expert for remote support.
> 
> Anyone who can assist me?

   This is generally the wrong approach to take in open-source
circles. Instead, if you describe your problem here on this mailing
list, you'll get *most* of the experts looking at it, rather than just
the one, and you'll generally get a much better (and easier to use)
service.

   Hugo.

-- 
Hugo Mills | The early bird gets the worm, but the second mouse
hugo@... carfax.org.uk | gets the cheese.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

signature.asc
Description: Digital signature

Re: CRC mismatch

2018-10-17 Thread Anton Shepelev

[I accdientally replied to Chris instead of the mailing list]
Chris Murphy:

>Is there still that O_DIRECT related "bug" (or more of a
>limitation) if the guest is using cache=none on the block
>device?

I know nothing about it.

>Anton what virtual machine tech are you using?  qemu/kvm
>managed with virt-manager?  The configuration affects host
>behavior; but the negative effect manifests inside the
>guest as corruption.  If I remember correctly.

This is a commericial system run inside VMWare.

-- 
()  ascii ribbon campaign - against html e-mail
/\  http://preview.tinyurl.com/qcy6mjc [archived]

[PATCH] btrfs: Fix the return value in case of error in 'btrfs_mark_extent_written()'

2018-10-17 Thread Christophe JAILLET

We return 0 unconditionally in most of the error handling paths of
'btrfs_mark_extent_written()'.
However, 'ret' is set to some error codes in several error handling paths.

Return 'ret' instead to propagate the error code.

Fixes: 9c8e63db1de9 ("Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written")
Signed-off-by: Christophe JAILLET 
---
This patch proposal is purely speculative.
I'm not sure at all that returning 'ret' is correct (but it looks like it
is :) )

What puzzles me is when 'ret' is set, 'btrfs_abort_transaction()' is also
called.
However, the only caller of 'btrfs_mark_extent_written()' (i.e.
'btrfs_finish_ordered_io()') also calls 'btrfs_abort_transaction()' if an
error is returned.
So returning an error code here, would lead to a double call to this abort
function.

I'm usure of if it is correct and/or intented.
If returning 'ret' is correct, should we also axe the 
'btrfs_abort_transaction()'
calls here, and leave the caller do the clean-up?

Before the commit in the Fixes tag, we were BUGing_ON in case of errror. So
propagating the error was pointless.
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 15b925142793..cac0bd744de3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1374,7 +1374,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle 
*trans,
}
 out:
btrfs_free_path(path);
-   return 0;
+   return ret;
 }
 
 /*
-- 
2.17.1

Re: [PATCH 26/26] xfs: remove redundant remap partial EOF block checks

On Mon, Oct 15, 2018 at 08:21:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Now that we've moved the partial EOF block checks to the VFS helpers, we
> can remove the redundant functionality from XFS.
> 
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Dave Chinner 

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 25/26] xfs: support returning partial reflink results

> @@ -1415,11 +1419,17 @@ xfs_reflink_remap_range(
>  
>   trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
>  
> + if (len == 0) {
> + ret = 0;
> + goto out_unlock;
> + }
> +

As pointed out last time this check is superflous, right above we have
this check:

if (ret < 0 || len == 0)
return ret;

>   ret = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> - pos_out + len);
> + &remappedfsb, pos_out + len);
> + remapped_bytes = min_t(loff_t, len, XFS_FSB_TO_B(mp, remappedfsb));

I still think returning the bytes from the function would be saner,
but maybe that's just me.

Re: [PATCH 24/26] xfs: fix pagecache truncation prior to reflink

On Mon, Oct 15, 2018 at 08:20:48PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Prior to remapping blocks, it is necessary to remove pages from the
> destination file's page cache.  Unfortunately, the truncation is not
> aggressive enough -- if page size > block size, we'll end up zeroing
> subpage blocks instead of removing them.  So, round the start offset
> down and the end offset up to page boundaries.  We already wrote all
> the dirty data so the larger range shouldn't be a problem.
> 
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Dave Chinner 

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 19/26] vfs: clean up generic_remap_file_range_prep return value

On Mon, Oct 15, 2018 at 08:20:14PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Since the remap prep function can update the length of the remap
> request, we can change this function to return the usual return status
> instead of the odd behavior it has now.
> 
> Signed-off-by: Darrick J. Wong 

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 17/26] vfs: enable remap callers that can handle short operations

>  /* Update inode timestamps and remove security privileges when remapping. */
> @@ -2023,7 +2034,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, 
> loff_t src_pos,
>  {
>   loff_t ret;
>  
> - WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP));
> + WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP |
> +  REMAP_FILE_CAN_SHORTEN));

I guess this is where you could actually use REMAP_FILE_VALID_FLAGS..

>  /* REMAP_FILE flags taken care of by the vfs. */
> -#define REMAP_FILE_ADVISORY  (0)
> +#define REMAP_FILE_ADVISORY  (REMAP_FILE_CAN_SHORTEN)

And btw, they are not 'taken care of by the VFS', they need to be
taken care of by the fs (possibly using helpers) to take affect,
but they can be safely ignored.

> + if (!IS_ALIGNED(count, bs)) {
> + if (remap_flags & REMAP_FILE_CAN_SHORTEN)
> + count = ALIGN_DOWN(count, bs);
> + else
> + return -EINVAL;

if (!(remap_flags & REMAP_FILE_CAN_SHORTEN))
return -EINVAL;
count = ALIGN_DOWN(count, bs);

Re: [PATCH 13/26] vfs: create generic_remap_file_range_touch to update inode metadata

On Mon, Oct 15, 2018 at 08:19:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Create a new VFS helper to handle inode metadata updates when remapping
> into a file.  If the operation can possibly alter the file contents, we
> must update the ctime and mtime and remove security privileges, just
> like we do for regular file writes.  Wire up ocfs2 to ensure consistent
> behavior.

Subject line doesn't match the actual function name..

> +/* Update inode timestamps and remove security privileges when remapping. */
> +static int generic_remap_file_range_target(struct file *file,
> +unsigned int remap_flags)
> +{
> + int ret;
> +
> + /* If can't alter the file contents, we're done. */
> + if (remap_flags & REMAP_FILE_DEDUP)
> + return 0;
> +
> + /* Update the timestamps, since we can alter file contents. */
> + if (!(file->f_mode & FMODE_NOCMTIME)) {
> + ret = file_update_time(file);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> +  * Clear the security bits if the process is not being run by root.
> +  * This keeps people from modifying setuid and setgid binaries.
> +  */
> + return file_remove_privs(file);
> +}
> +
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
>   * sense, and then flush all dirty data.  Caller must ensure that the
> @@ -1820,6 +1844,10 @@ int generic_remap_file_range_prep(struct file 
> *file_in, loff_t pos_in,
>   if (ret)
>   return ret;
>  
> + ret = generic_remap_file_range_target(file_out, remap_flags);
> + if (ret)
> + return ret;
> +

Also I find the name still somewhat odd.  Why don't we side-step that
issue by moving the code directly into generic_remap_file_range_prep?

Something like this folded in:

diff --git a/fs/read_write.c b/fs/read_write.c
index 37a7d3fe35d8..6de813cf9e63 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1752,30 +1752,6 @@ static int generic_remap_check_len(struct inode 
*inode_in,
return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
 }
 
-/* Update inode timestamps and remove security privileges when remapping. */
-static int generic_remap_file_range_target(struct file *file,
-  unsigned int remap_flags)
-{
-   int ret;
-
-   /* If can't alter the file contents, we're done. */
-   if (remap_flags & REMAP_FILE_DEDUP)
-   return 0;
-
-   /* Update the timestamps, since we can alter file contents. */
-   if (!(file->f_mode & FMODE_NOCMTIME)) {
-   ret = file_update_time(file);
-   if (ret)
-   return ret;
-   }
-
-   /*
-* Clear the security bits if the process is not being run by root.
-* This keeps people from modifying setuid and setgid binaries.
-*/
-   return file_remove_privs(file);
-}
-
 /*
  * Read a page's worth of file data into the page cache.  Return the page
  * locked.
@@ -1950,9 +1926,25 @@ int generic_remap_file_range_prep(struct file *file_in, 
loff_t pos_in,
if (ret)
return ret;
 
-   ret = generic_remap_file_range_target(file_out, remap_flags);
-   if (ret)
-   return ret;
+   if (!(remap_flags & REMAP_FILE_DEDUP)) {
+   /*
+* Update the timestamps, since we can alter file contents.
+*/
+   if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+   ret = file_update_time(file_out);
+   if (ret)
+   return ret;
+   }
+
+   /*
+* Clear the security bits if the process is not being run by
+* root.  This keeps people from modifying setuid and setgid
+* binaries.
+*/
+   ret = file_remove_privs(file_out);
+   if (ret)
+   return ret;
+   }
 
return 0;
 }

Re: [PATCH 12/26] vfs: pass remap flags to generic_remap_checks

On Mon, Oct 15, 2018 at 08:11:19PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Pass the same remap flags to generic_remap_checks for consistency.
> 
> Signed-off-by: Darrick J. Wong 
> Reviewed-by: Amir Goldstein 

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 11/26] vfs: pass remap flags to generic_remap_file_range_prep

On Mon, Oct 15, 2018 at 08:11:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Plumb the remap flags through the filesystem from the vfs function
> dispatcher all the way to the prep function to prepare for behavior
> changes in subsequent patches.

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 10/26] vfs: combine the clone and dedupe into a single remap_file_range

> +/* All valid REMAP_FILE flags */
> +#define REMAP_FILE_VALID_FLAGS   (REMAP_FILE_DEDUP)

It looks like this still isn't used after the whole series.

With it removed:

Reviewed-by: Christoph Hellwig

btrfs send receive: No space left on device

2018-10-17 Thread Libor Klepáč

Hello,
i have new 32GB SSD in my intel nuc, installed debian9 on it, using btrfs as a 
rootfs.
Then i created subvolumes /system and /home and moved system there.

System was installed using kernel 4.9.x and filesystem created using 
btrfs-progs 4.7.x
Details follow:
main filesystem

# btrfs filesystem usage /mnt/btrfs/ssd/
Overall:
Device size:  29.08GiB
Device allocated:  4.28GiB
Device unallocated:   24.80GiB
Device missing:  0.00B
Used:  2.54GiB
Free (estimated): 26.32GiB  (min: 26.32GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,single: Size:4.00GiB, Used:2.48GiB
   /dev/sda3   4.00GiB

Metadata,single: Size:256.00MiB, Used:61.05MiB
   /dev/sda3 256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda3  32.00MiB

Unallocated:
   /dev/sda3  24.80GiB

#/etc/fstab
UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /mnt/btrfs/ssd  btrfs 
noatime,space_cache=v2,compress=lzo,commit=300,subvolid=5 0   0
UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /   btrfs   
noatime,space_cache=v2,compress=lzo,commit=300,subvol=/system 0   0
UUID=d801da52-813d-49da-bdda-87fc6363e0ac   /home   btrfs   
noatime,space_cache=v2,compress=lzo,commit=300,subvol=/home 0   0

-
Then i installed kernel from backports:
4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
and btrfs-progs 4.17

For backups , i have created 16GB iscsi device on my qnap and mounted it, 
created filesystem, mounted like this:
LABEL=backup/mnt/btrfs/backup   btrfs   
noatime,space_cache=v2,compress=lzo,subvolid=5,nofail,noauto 0   0

After send-receive operation on /home subvolume, usage looks like this:

# btrfs filesystem usage /mnt/btrfs/backup/
Overall:
Device size:  16.00GiB
Device allocated:  1.27GiB
Device unallocated:   14.73GiB
Device missing:  0.00B
Used:844.18MiB
Free (estimated): 14.92GiB  (min: 14.92GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,single: Size:1.01GiB, Used:833.36MiB
   /dev/sdb1.01GiB

Metadata,single: Size:264.00MiB, Used:10.80MiB
   /dev/sdb  264.00MiB

System,single: Size:4.00MiB, Used:16.00KiB
   /dev/sdb4.00MiB

Unallocated:
   /dev/sdb   14.73GiB


Problem is, during send-receive of system subvolume, it runs out of space:

# btrbk run /mnt/btrfs/ssd/system/ -v  
btrbk command line client, version 0.26.1  (Wed Oct 17 09:51:20 2018)
Using configuration: /etc/btrbk/btrbk.conf
Using transaction log: /var/log/btrbk.log
Creating subvolume snapshot for: /mnt/btrfs/ssd/system
[snapshot] source: /mnt/btrfs/ssd/system
[snapshot] target: /mnt/btrfs/ssd/_snapshots/system.20181017T0951
Checking for missing backups of subvolume "/mnt/btrfs/ssd/system" in 
"/mnt/btrfs/backup/"
Creating subvolume backup (send-receive) for: 
/mnt/btrfs/ssd/_snapshots/system.20181016T2034
No common parent subvolume present, creating full backup...
[send/receive] source: /mnt/btrfs/ssd/_snapshots/system.20181016T2034
[send/receive] target: /mnt/btrfs/backup/system.20181016T2034
mbuffer: error: outputThread: error writing to  at offset 0x4b5bd000: 
Broken pipe
mbuffer: warning: error during output to : Broken pipe
WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
receive=/mnt/btrfs/backup) At subvol 
/mnt/btrfs/ssd/_snapshots/system.20181016T2034
WARNING: [send/receive] (send=/mnt/btrfs/ssd/_snapshots/system.20181016T2034, 
receive=/mnt/btrfs/backup) At subvol system.20181016T2034
ERROR: rename o77417-5519-0 -> 
lib/modules/4.18.0-0.bpo.1-amd64/kernel/drivers/watchdog/pcwd_pci.ko failed: No 
space left on device
ERROR: Failed to send/receive btrfs subvolume: 
/mnt/btrfs/ssd/_snapshots/system.20181016T2034  -> /mnt/btrfs/backup
[delete] options: commit-after
[delete] target: /mnt/btrfs/backup/system.20181016T2034
WARNING: Deleted partially received (garbled) subvolume: 
/mnt/btrfs/backup/system.20181016T2034
ERROR: Error while resuming backups, aborting
Created 0/2 missing backups
WARNING: Skipping cleanup of snapshots for subvolume "/mnt/btrfs/ssd/system", 
as at least one target aborted earlier
Completed within: 116s  (Wed Oct 17 09:53:16 2018)

Backup Summary (btrbk command line client, version 0.26.1)

Date:   Wed Oct 17 09:51:20 2018
Config: /etc/btrbk/btrbk.conf
Filter: subvolume=/mnt/btrfs/ssd/system

Legend:
===  up-to-date subvolume (source snapshot)
+++  created subvolume (source snapsho

Re: [PATCH 04/26] vfs: exit early from zero length remap operations

On Mon, Oct 15, 2018 at 08:10:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> If a remap caller asks us to remap to the source file's EOF and the
> source file has zero bytes, exit early.
> 
> Signed-off-by: Darrick J. Wong 

Looks good,

Reviewed-by: Christoph Hellwig

Urgent: Need BTRFS-Expert

Hello together,

i need a BTRFS-Expert for remote support.

Anyone who can assist me via Skype or teamviewer?
I will pay for it.

Please let me know,

Michael Post

Urgent: Need BTRFS-Expert