Re: [survey] sysfs layout for btrfs

2015-08-18 Thread Gabriel de Perthuis
On Sat, 15 Aug 2015 07:40:40 +0800, Anand Jain wrote:

 Hello,
 
 as of now btrfs sysfs does not include the attributes for the volume 
 manager part in its sysfs layout, so its being developed and there are 
 two types of layout here below, so I have a quick survey to know which 
 will be preferred. contenders are:
 1. FS and VM (volume manager) attributes[1] merged sysfs layout
 
/sys/fs/btrfs/fsid -- holds FS attr, VM attr will be added here.
/sys/fs/btrfs/fsid/devices/uuid [2]  -- btrfs_devices attr here

My vote is for the first one.
Lengthening the UI/API with /pools/ seems unnecessary, and it's
better to get attributes exposed earlier.

 2. FS and VM attributes separated sysfs layout.
 
   /sys/fs/btrfs/fsid --- as is, will continue to hold fs attributes.
   /sys/fs/btrfs/pools/fsid/ -- will hold VM attributes
   /sys/fs/btrfs/pools/fsid/devices/sdx -- btrfs_devices attr here

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem show _exact_ freaking size?

2014-11-18 Thread Gabriel de Perthuis
Le 18/11/2014 11:39, Robert White a écrit :
 Howdy,
 
 How does one get the exact size (in blocks preferably, but bytes okay)
 of the filesystem inside a partition? I know how to get the partition
 size, but that's not useful when shrinking a partition...

dev_item.total_bytes in brtfs-show-super's output is what you're after.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Manual deduplication would be useful

2013-07-23 Thread Gabriel de Perthuis
 Hello,
 
 For over a year now, I've been experimenting with stacked filesystems
 as a way to save on resources.  A basic OS layer is shared among
 Containers, each of which stacks a layer with modifications on top of
 it.  This approach means that Containers share buffer cache and
 loaded executables.  Concrete technology choices aside, the result is
 rock-solid and the efficiency improvements are incredible, as
 documented here:
 
 http://rickywiki.vanrein.org/doku.php?id=openvz-aufs
 
 One problem with this setup is updating software.  In lieu of
 stacking-support in package managers, it is necessary to do this on a
 per-Container basis, meaning that each installs their own versions,
 including overwrites of the basic OS layer.  Deduplication could
 remedy this, but the generic mechanism is known from ZFS to be fairly
 inefficient.
 
 Interestingly however, this particular use case demonstrates that a
 much simpler deduplication mechanism than normally considered could
 be useful.  It would suffice if the filesystem could check on manual
 hints, or stack-specifying hints, to see if overlaid files share the
 same file contents; when they do, deduplication could commence.  This
 saves searching through the entire filesystem for every file or block
 written.  It might also mean that the actual stacking is not needed,
 but instead a basic OS could be cloned to form a new basic install,
 and kept around for this hint processing.
 
 I'm not sure if this should ideally be implemented inside the
 stacking approach (where it would be
 stacking-implementation-specific) or in the filesystem (for which it
 might be too far off the main purpose) but I thought it wouldn't hurt
 to start a discussion on it, given that (1) filesystems nowadays
 service multiple instances, (2) filesystems like Btrfs are based on
 COW, and (3) deduplication is a goal but the generic mechanism could
 use some efficiency improvements.
 
 I hope having seen this approach is useful to you!

Have a look at bedup[1] (disclaimer: I wrote it).  The normal mode
does incremental scans, and there's also a subcommand for
deduplicating files that you already know are identical:
  bedup dedup-files

The implementation in master uses a clone ioctl.  Here is Mark
Fasheh's latest patch series to implement a dedup ioctl[2]; it
also comes with a command to work on listed files
(btrfs-extent-same in [3]).

[1] https://github.com/g2p/bedup
[2] http://comments.gmane.org/gmane.comp.file-systems.btrfs/26310/
[3] https://github.com/markfasheh/duperemove
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
Now... since the snapshot's FS tree is a direct duplicate of the
 original FS tree (actually, it's the same tree, but they look like
 different things to the outside world), they share everything --
 including things like inode numbers. This is OK within a subvolume,
 because we have the semantics that subvolumes have their own distinct
 inode-number spaces. If we could snapshot arbitrary subsections of the
 FS, we'd end up having to fix up inode numbers to ensure that they
 were unique -- which can't really be an atomic operation (unless you
 want to have the FS locked while the kernel updates the inodes of the
 billion files you just snapshotted).

I don't think so; I just checked some snapshots and the inos are the same.
Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).

The other thing to talk about here is that while the FS tree is a
 tree structure, it's not a direct one-to-one map to the directory tree
 structure. In fact, it looks more like a list of inodes, in inode
 order, with some extra info for easily tracking through the list. The
 B-tree structure of the FS tree is just a fast indexing method. So
 snapshotting a directory entry within the FS tree would require
 (somehow) making an atomic copy, or CoW copy, of only the parts of the
 FS tree that fall under the directory in question -- so you'd end up
 trying to take a sequence of records in the FS tree, of arbitrary size
 (proportional roughly to the number of entries in the directory) and
 copying them to somewhere else in the same tree in such a way that you
 can automatically dereference the copies when you modify them. So,
 ultimately, it boils down to being able to do CoW operations at the
 byte level, which is going to introduce huge quantities of extra
 metadata, and it all starts looking really awkward to implement (plus
 having to deal with the long time taken to copy the directory entries
 for the thing you're snapshotting).

Btrfs already does CoW of arbitrarily-large files (extent lists);
doing the same for directories doesn't seem impossible.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Why subvolumes?

2013-07-23 Thread Gabriel de Perthuis
Le mar. 23 juil. 2013 21:30:13 CEST, Hugo Mills a écrit :
 On Tue, Jul 23, 2013 at 07:47:41PM +0200, Gabriel de Perthuis wrote:
Now... since the snapshot's FS tree is a direct duplicate of the
 original FS tree (actually, it's the same tree, but they look like
 different things to the outside world), they share everything --
 including things like inode numbers. This is OK within a subvolume,
 because we have the semantics that subvolumes have their own distinct
 inode-number spaces. If we could snapshot arbitrary subsections of the
 FS, we'd end up having to fix up inode numbers to ensure that they
 were unique -- which can't really be an atomic operation (unless you
 want to have the FS locked while the kernel updates the inodes of the
 billion files you just snapshotted).

 I don't think so; I just checked some snapshots and the inos are the same.
 Btrfs just changes the dev_id of subvolumes (somehow the vfs allows this).

That's what I said. Our current implementation allows different
 subvolumes to have the same inode numbers, which is what makes it
 work. If you threw out the concept of subvolumes, or allowed snapshots
 within subvolumes, then you'd be duplicating inodes within a
 subvolume, which is one reason it doesn't work.

Sorry for misreading you.
Directory snapshots can work by giving a new device number to the snapshot.
There is no need to update inode numbers in that case.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lots of harddrive chatter on after booting with btrfs on root (slow boot)

2013-07-20 Thread Gabriel de Perthuis
On Sat, 20 Jul 2013 17:15:50 +0200, Jason Russell wrote:
 Ive also noted that this excessive hdd chatter does not occur
 immediately after a fresh format with arch on btrfs root.
 
 Ive made some deductions/assumptions:
 This only seems to occur with btrfs roots.
 This only happens after some number of reboots OR after the partition
 fills up a little bit.
 Im pretty sure of ruled out everything except for the filesystem.

In my experience (as of 3.8 or so), Btrfs performance degrades on a
filled-up filesystem, even a comparatively new one.  Various
background workers start to eat io according to atop.

 I have just done two clean installs to more thoroughly compare ext4
 and btrfs roots. So far no excessive hdd chatter from btrfs.
 
 I have also seen what I have described on two other computers
 (different hardware entirely) where there is lots of hdd chatter from
 btrfs root, and nothing from ext4.
 
 Here are two threads:
 https://bbs.archlinux.org/viewtopic.php?pid=1117932
 https://bbs.archlinux.org/viewtopic.php?pid=1301684

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-07-16 Thread Gabriel de Perthuis
On Mon, 15 Jul 2013 13:55:51 -0700, Zach Brown wrote:
 I'd get rid of all this code by only copying each input argument on to
 the stack as it's needed and by getting rid of the writable output
 struct fields.  (more on this later)

 As I said, I'd get rid of the output fields.  Like the other vectored io
 syscalls, the return value can indicate the number of initial
 consecutive bytes that worked.  When no progess was made then it can
 return errors.  Userspace is left to sort out the resulting state and
 figure out the extents to retry in exactly the same way that it found
 the initial extents to attempt to dedupe in the first place.
 
 (And imagine strace trying to print the inputs and outputs.  Poor, poor,
 strace! :))

The dedup branch that uses this syscall[1] doesn't compare files
before submitting them anymore (the kernel will do it, and ranges
may not fit in cache once I get rid of an unnecessary loop).

I don't have strong opinions on the return style, but it would be
good to have the syscall always make progress by finding at least
one good range before bailing out, and signaling which files were
involved.  With those constraints, the current struct seems like the
cleanest way to pass the data.  The early return you suggest is a
good idea if Mark agrees, but the return condition should be
something like: if one range with bytes_deduped != 0 doesn't get
bytes_deduped incremented by iteration_len in this iteration, bail
out.
That's sufficient to guarantee progress and to know which ranges
were involved.

 I hope this helps!
 
 - z

Thank you and everyone involved for the progress on this.

[1] https://github.com/g2p/bedup/tree/wip/dedup-syscall


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[XFSTESTS PATCH] btrfs: Test deduplication

2013-06-26 Thread Gabriel de Perthuis
---
The matching kernel patch is here:
https://github.com/g2p/linux/tree/v3.10%2Bextent-same (rebased on 3.10, fixing 
a small conflict)
Requires the btrfs-extent-same command:

- http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26579
- https://github.com/markfasheh/duperemove


 tests/btrfs/313 | 93 +
 tests/btrfs/313.out | 25 ++
 tests/btrfs/group   |  1 +
 3 files changed, 119 insertions(+)
 create mode 100755 tests/btrfs/313
 create mode 100644 tests/btrfs/313.out

diff --git a/tests/btrfs/313 b/tests/btrfs/313
new file mode 100755
index 000..04e4ccb
--- /dev/null
+++ b/tests/btrfs/313
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FS QA Test No. 313
+#
+# Test the deduplication syscall
+#
+#---
+# Copyright (c) 2013 Red Hat, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -f $tmp.*
+}
+
+. ./common/rc
+. ./common/filter
+
+ESAME=`set_prog_path btrfs-extent-same`
+
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_command $ESAME
+_require_command $XFS_IO_PROG
+_require_scratch
+
+_scratch_mkfs /dev/null
+_scratch_mount $seqres.full 21
+
+fiemap() {
+xfs_io -r -c fiemap $1 |tail -n+2
+}
+
+dedup() {
+! diff -q (fiemap $1) (fiemap $2)
+$ESAME $(stat -c %s $1) $1 0 $2 0
+diff -u (fiemap $1) (fiemap $2)
+}
+
+echo Silence is golden
+set -e
+
+v1=$SCRATCH_MNT/v1
+v2=$SCRATCH_MNT/v2
+v3=$SCRATCH_MNT/v3
+
+$BTRFS_UTIL_PROG subvolume create $v1
+$BTRFS_UTIL_PROG subvolume create $v2
+
+dd bs=1M status=none if=/dev/urandom of=$v1/file1 count=1
+dd bs=1M status=none if=/dev/urandom of=$v1/file2 count=1
+dd bs=1M status=none if=$v1/file1 of=$v2/file3
+dd bs=1M status=none if=$v1/file1 of=$v2/file4
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $v2 $v3
+
+# identical, multiple volumes
+dedup $v1/file1 $v2/file3
+
+# not identical, same volume
+! $ESAME $((2**20)) $v1/file1 0 $v1/file2 0
+
+# identical, second file on a frozen volume
+dedup $v1/file1 $v3/file4
+
+_scratch_unmount
+_check_scratch_fs
+status=0
+exit
diff --git a/tests/btrfs/313.out b/tests/btrfs/313.out
new file mode 100644
index 000..eabe6be
--- /dev/null
+++ b/tests/btrfs/313.out
@@ -0,0 +1,25 @@
+QA output created by 313
+Silence is golden
+Create subvolume 'sdir/v1'
+Create subvolume 'sdir/v2'
+Create a readonly snapshot of 'sdir/v2' in 'sdir/v3'
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v2/file3
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v1/file2
+1 files asked to be deduped
+i: 0, status: 1, bytes_deduped: 0
+0 total bytes deduped in this operation
+Files /dev/fd/63 and /dev/fd/62 differ
+Deduping 2 total files
+(0, 1048576): sdir/v1/file1
+(0, 1048576): sdir/v3/file4
+1 files asked to be deduped
+i: 0, status: 0, bytes_deduped: 1048576
+1048576 total bytes deduped in this operation
diff --git a/tests/btrfs/group b/tests/btrfs/group
index bc6c256..4c868c8 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -7,5 +7,6 @@
 264 auto
 265 auto
 276 auto rw metadata
 284 auto
 307 auto quick
+313 auto
-- 
1.8.3.1.588.gb04834f

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
On Thu, 20 Jun 2013 10:16:22 +0100, Hugo Mills wrote:
 On Thu, Jun 20, 2013 at 10:47:53AM +0200, Clemens Eisserer wrote:
 Hi,
 
 I've observed a rather strange behaviour while trying to mount two
 identical copies of the same image to different mount points.
 Each modification to one image is also performed in the second one.

 touch m2/hello
 ls -la m1  //will now also include a file calles hello
 
 Is this behaviour intentional and known or should I create a bug-report?
 
It's known, and not desired behaviour. The problem is that you've
 ended up with two filesystems with the same UUID, and the FS code gets
 rather confused about that. The same problem exists with LVM snapshots
 (or other block-device-layer copies).
 
The solution is a combination of a tool to scan an image and change
 the UUID (offline), and of some code in the kernel that detects when
 it's being told about a duplicate image (rather than an additional
 device in the same FS). Neither of these has been written yet, I'm
 afraid.

To clarify, the loop devices are properly distinct, but the first
device ends up mounted twice.

I've had a look at the vfs code, and it doesn't seem to be uuid-aware,
which makes sense because the uuid is a property of the superblock and
the fs structure doesn't expose it.  It's a Btrfs problem.

Instead of redirecting to a different block device, Btrfs could and
should refuse to mount an already-mounted superblock when the block
device doesn't match, somewhere in or below btrfs_mount.  Registering
extra, distinct superblocks for an already mounted raid is a different
matter, but that isn't done through the mount syscall anyway.

 I've deleted quite a bunch of files on my production system because of 
 this...
 
Oops. I'm sorry to hear that. :(
 
Hugo.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two identical copies of an image mounted result in changes to both images if only one is modified

2013-06-20 Thread Gabriel de Perthuis
 Instead of redirecting to a different block device, Btrfs could and
 should refuse to mount an already-mounted superblock when the block
 device doesn't match, somewhere in or below btrfs_mount.  Registering
 extra, distinct superblocks for an already mounted raid is a different
 matter, but that isn't done through the mount syscall anyway.
 
The problem here is that you could quite legitimately mount
 /dev/sda (with UUID=AA1234) on, say, /mnt/fs-a, and /dev/sdb (with
 UUID=AA1234) on /mnt/fs-b -- _provided_ that /dev/sda and /dev/sdb are
 both part of the same filesystem. So you can't simply prevent mounting
 based on the device that the mount's being done with.

Okay.  The check should rely on a list of known block devices
for a given filesystem uuid.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 22:31, Mark Fasheh a écrit :
 Perhaps this isn't a limiation per-se but extent-same requires read/write
 access to the files we want to dedupe.  During my last series I had a
 conversation with Gabriel de Perthuis about access checking where we tried
 to maintain the ability for a user to run extent-same against a readonly
 snapshot. In addition, I reasoned that since the underlying data won't
 change (at least to the user) that we ought only require the files to be
 open for read.
 
 What I found however is that neither of these is a great idea ;)
 
 - We want to require that the inode be open for writing so that an
   unprivileged user can't do things like run dedupe on a performance
   sensitive file that they might only have read access to.  In addition I
   could see it as kind of a surprise (non-standard behavior) to an
   administrator that users could alter the layout of files they are only
   allowed to read.
 
 - Readonly snapshots won't let you open for write anyway (unsuprisingly,
   open() returns -EROFS).  So that kind of kills the idea of them being able
   to open those files for write which we want to dedupe.
 
 That said, I still think being able to run this against a set of readonly
 snapshots makes sense especially if those snapshots are taken for backup
 purposes. I'm just not sure how we can sanely enable it.

The check could be: if (fmode_write || cap_sys_admin).

This isn't incompatible with mnt_want_write, that check is at the
level of the superblocks and vfsmount and not the subvolume fsid.

 Code review is very much appreciated. Thanks,
  --Mark
 
 
 ChangeLog
 
 - check that we have appropriate access to each file before deduping. For
   the source, we only check that it is opened for read. Target files have to
   be open for write.
 
 - don't dedupe on readonly submounts (this is to maintain 
 
 - check that we don't dedupe files with different checksumming states
  (compare BTRFS_INODE_NODATASUM flags)
 
 - get and maintain write access to the mount during the extent same
   operation (mount_want_write())
 
 - allocate our read buffers up front in btrfs_ioctl_file_extent_same() and
   pass them through for re-use on every call to btrfs_extent_same(). (thanks
   to David Sterba dste...@suse.cz for reporting this
 
 - As the read buffers could possibly be up to 1MB (depending on user
   request), we now conditionally vmalloc them.
 
 - removed redundant check for same inode. btrfs_extent_same() catches it now
   and bubbles the error up.
 
 - remove some unnecessary printks
 
 Changes from RFC to v1:
 
 - don't error on large length value in btrfs exent-same, instead we just
   dedupe the maximum allowed.  That way userspace doesn't have to worry
   about an arbitrary length limit.
 
 - btrfs_extent_same will now loop over the dedupe range at 1MB increments (for
   a total of 16MB per request)
 
 - cleaned up poorly coded while loop in __extent_read_full_page() (thanks to
   David Sterba dste...@suse.cz for reporting this)
 
 - included two fixes from Gabriel de Perthuis g2p.c...@gmail.com:
- allow dedupe across subvolumes
- don't lock compressed pages twice when deduplicating
 
 - removed some unused / poorly designed fields in btrfs_ioctl_same_args.
   This should also give us a bit more reserved bytes.
 
 - return -E2BIG instead of -ENOMEM when arg list is too large (thanks to
   David Sterba dste...@suse.cz for reporting this)
 
 - Some more reserved bytes are now included as a result of some of my
   cleanups. Quite possibly we could add a couple more.
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] btrfs: offline dedupe v2

2013-06-11 Thread Gabriel de Perthuis
Le 11/06/2013 23:04, Mark Fasheh a écrit :
 On Tue, Jun 11, 2013 at 10:56:59PM +0200, Gabriel de Perthuis wrote:
 What I found however is that neither of these is a great idea ;)

 - We want to require that the inode be open for writing so that an
   unprivileged user can't do things like run dedupe on a performance
   sensitive file that they might only have read access to.  In addition I
   could see it as kind of a surprise (non-standard behavior) to an
   administrator that users could alter the layout of files they are only
   allowed to read.

 - Readonly snapshots won't let you open for write anyway (unsuprisingly,
   open() returns -EROFS).  So that kind of kills the idea of them being able
   to open those files for write which we want to dedupe.

 That said, I still think being able to run this against a set of readonly
 snapshots makes sense especially if those snapshots are taken for backup
 purposes. I'm just not sure how we can sanely enable it.

 The check could be: if (fmode_write || cap_sys_admin).

 This isn't incompatible with mnt_want_write, that check is at the
 level of the superblocks and vfsmount and not the subvolume fsid.
 
 Oh ok that's certainly better. I think we still have a problem though - how
 does a process gets write access to a file from a ro-snapshot? If I open a
 file (as root) on a ro-snapshot on my test machine here I'll get -EROFS.

Your first series did work in that case.
The process does get a read-only fd, but that's no obstacle for the ioctl.

 I'm a bit confused - how does mnt_want_write factor in here? I think that's
 for a totally seperate kind of accounting, right?

It doesn't, it's just that I had spent a few minutes checking anyway.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
 +#define BTRFS_MAX_DEDUPE_LEN   (16 * 1024 * 1024)
 +#define BTRFS_ONE_DEDUPE_LEN   (1 * 1024 * 1024)
 +
 +static long btrfs_ioctl_file_extent_same(struct file *file,
 +void __user *argp)
 +{
 +   struct btrfs_ioctl_same_args *args;
 +   struct btrfs_ioctl_same_args tmp;
 +   struct btrfs_ioctl_same_extent_info *info;
 +   struct inode *src = file-f_dentry-d_inode;
 +   struct file *dst_file = NULL;
 +   struct inode *dst;
 +   u64 off;
 +   u64 len;
 +   int args_size;
 +   int i;
 +   int ret;
 +   u64 bs = BTRFS_I(src)-root-fs_info-sb-s_blocksize;

 The ioctl is available to non-root, so an extra care should be taken to
 potentail overflows etc. I haven't spotted anything so far.
 
 
 Sure. Actually, you got me thinking about some sanity checks... I need to
 add at least this check:
 
   if (btrfs_root_readonly(root))
   return -EROFS;
 
 which isn't in there as of now.

It's not needed and I'd rather do without, read-only snapshots and 
deduplication go together well for backups.
Data and metadata are guaranteed to be immutable, extent storage isn't.  This 
is also the case with raid.


 Also I don't really check the open mode (read, write, etc) on files passed
 in. We do this in the clone ioctl and it makes sense there since data (to
 the user) can change. With this ioctl though data won't ever change (even if
 the underlying extent does). So I left the checks out. A part of me is
 thinking we might want to be conservative to start with though and just add
 those type of checks in. Basically, I figure the source should be open for
 read at least and target files need write access.

I don't know of any privileged files that one would be able to open(2), but if 
this is available to unprivileged users the files all need to be open for 
reading so that it can't be used to guess at their contents.
As long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't 
hurt my use case.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: offline dedupe

2013-05-24 Thread Gabriel de Perthuis
Le sam. 25 mai 2013 00:38:27 CEST, Mark Fasheh a écrit :
 On Fri, May 24, 2013 at 09:50:14PM +0200, Gabriel de Perthuis wrote:
 Sure. Actually, you got me thinking about some sanity checks... I need to
 add at least this check:

 if (btrfs_root_readonly(root))
 return -EROFS;

 which isn't in there as of now.

 It's not needed and I'd rather do without, read-only snapshots and 
 deduplication go together well for backups.
 Data and metadata are guaranteed to be immutable, extent storage isn't.  
 This is also the case with raid.

 You're absolutely right - I miswrote the check I meant.
 Specifically, I was thinking about when the entire fs is readonly due to
 either some error or the user mounted with -oro. So something more like:

   if (root-fs_info-sb-s_flags  MS_RDONLY)
   return -EROFS;

 I think that should be reasonable and wouldn't affect most use cases,
 right?

That's all right.

 Also I don't really check the open mode (read, write, etc) on files passed
 in. We do this in the clone ioctl and it makes sense there since data (to
 the user) can change. With this ioctl though data won't ever change (even if
 the underlying extent does). So I left the checks out. A part of me is
 thinking we might want to be conservative to start with though and just add
 those type of checks in. Basically, I figure the source should be open for
 read at least and target files need write access.

 I don't know of any privileged files that one would be able to open(2),
 but if this is available to unprivileged users the files all need to be
 open for reading so that it can't be used to guess at their contents. As
 long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't
 hurt my use case.

 Oh ok so this seems to make sense. How does this logic sound:

 We're not going to worry about write access since it would be entirely
 reasonable for the user to want to do this on a readonly submount
 (specifically for the purpose of deduplicating backups).

 Read access needs to be provided however so we know that the user has access
 to the file data.

 So basically, if a user can open any files for read, they can check their
 contents and dedupe them.

 Letting users dedupe files in say, /etc seems kind of weird to me but I'm
 struggling to come up with a good explanation of why that should mean we
 limit this ioctl to root.
   --Mark

I agree with that model.  Most of the code is shared with clone (and the
copy_range RFC) which are unprivileged, so it doesn't increase the potential
surface for bugs much.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
On Fri, 17 May 2013 18:54:38 +0800, Anand Jain wrote:
 The idea was to introduce /dev/mapper to find for btrfs disk, 
 However I found first we need to congregate the disk scan
 procedure at a function so it would help to consistently tune
 it across the btrfs-progs. As of now both fi show and
 dev scan use the disks scan they do it on their own.
 
 So here it would congregate btrfs-disk scans at the function
 scan_devs_for_btrfs, adds /dev/mapper to be used to scan
 for btrfs, and updates its calling functions and few bug fixes.

Just scan /dev/block/*.  That contains all block devices.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] a structure for the disks scan for btrfs

2013-05-17 Thread Gabriel de Perthuis
 Just scan /dev/block/*.  That contains all block devices.

Oh, this is about finding nicer names.  Never mind.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvol copying

2013-05-15 Thread Gabriel de Perthuis
 A user of a workstation has a home directory /home/john as a subvolume.  I 
 wrote a cron job to make read-only snapshots of it under /home/john/backup 
 which was fortunate as they just ran a script that did something like
 rm -rf ~.
 
 Apart from copying dozens of gigs of data back, is there a good way of 
 recovering it all?  Whatever you suggest isn't going to work for this time 
 (the copy is almost done) but will be useful for next time.
 
 Should I have put the backups under /backup instead so that I could just 
 delete the corrupted subvol and make a read-write snapshot of the last good 
 one?

You can move subvolumes at any time, as if they were regular directories.

For example: move the backups to an external location, move what's left
of the home to another location out of the way, and make a snapshot to
restore.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-09 Thread Gabriel de Perthuis
 We want this for btrfs_extent_same. Basically readpage and friends do their
 own extent locking but for the purposes of dedupe, we want to have both
 files locked down across a set of readpage operations (so that we can
 compare data). Introduce this variant and a flag which can be set for
 extent_read_full_page() to indicate that we are already locked.
 
 This one can get stuck in TASK_UNINTERRUPTIBLE:
 
 [32129.522257] SysRq : Show Blocked State
 [32129.524337]   taskPC stack   pid father
 [32129.526515] python  D 88021f394280 0 16281  1 
 0x0004
 [32129.528656]  88020e079a48 0082 88013d3cdd40 
 88020e079fd8
 [32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
 88013d3cdd40
 [32129.533044]   1fff 88015286f440 
 0008
 [32129.535285] Call Trace:
 [32129.537522]  [816dcca9] schedule+0x29/0x70
 [32129.539829]  [a02b4908] wait_extent_bit+0xf8/0x150 [btrfs]
 [32129.542130]  [8107ea00] ? finish_wait+0x80/0x80
 [32129.544463]  [a02b4f84] lock_extent_bits+0x44/0xa0 [btrfs]
 [32129.546824]  [a02b4ff3] lock_extent+0x13/0x20 [btrfs]
 [32129.549198]  [a02dc0cf] add_ra_bio_pages.isra.8+0x17f/0x2d0 
 [btrfs]
 [32129.551602]  [a02dccfc] btrfs_submit_compressed_read+0x25c/0x4c0 
 [btrfs]
 [32129.554028]  [a029d131] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
 [32129.556457]  [a02b2d07] submit_one_bio+0x67/0xa0 [btrfs]
 [32129.558899]  [a02b7ecd] extent_read_full_page_nolock+0x4d/0x60 
 [btrfs]
 [32129.561290]  [a02c8052] fill_data+0xb2/0x230 [btrfs]
 [32129.563623]  [a02cd57e] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
 [32129.565924]  [816ddbae] ? _raw_spin_lock+0xe/0x20
 [32129.568207]  [8119b907] ? inode_get_bytes+0x47/0x60
 [32129.570472]  [811a8297] do_vfs_ioctl+0x97/0x560
 [32129.572700]  [8119bb5a] ? sys_newfstat+0x2a/0x40
 [32129.574882]  [811a87f1] sys_ioctl+0x91/0xb0
 [32129.577008]  [816e64dd] system_call_fastpath+0x1a/0x1f

For anyone trying those patches, there's a fix here:
https://github.com/g2p/linux/tree/v3.9%2Bbtrfs-extent-same

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()

2013-05-07 Thread Gabriel de Perthuis
 We want this for btrfs_extent_same. Basically readpage and friends do their
 own extent locking but for the purposes of dedupe, we want to have both
 files locked down across a set of readpage operations (so that we can
 compare data). Introduce this variant and a flag which can be set for
 extent_read_full_page() to indicate that we are already locked.

This one can get stuck in TASK_UNINTERRUPTIBLE:

[32129.522257] SysRq : Show Blocked State
[32129.524337]   taskPC stack   pid father
[32129.526515] python  D 88021f394280 0 16281  1 0x0004
[32129.528656]  88020e079a48 0082 88013d3cdd40 
88020e079fd8
[32129.530840]  88020e079fd8 88020e079fd8 8802138dc5f0 
88013d3cdd40
[32129.533044]   1fff 88015286f440 
0008
[32129.535285] Call Trace:
[32129.537522]  [816dcca9] schedule+0x29/0x70
[32129.539829]  [a02b4908] wait_extent_bit+0xf8/0x150 [btrfs]
[32129.542130]  [8107ea00] ? finish_wait+0x80/0x80
[32129.544463]  [a02b4f84] lock_extent_bits+0x44/0xa0 [btrfs]
[32129.546824]  [a02b4ff3] lock_extent+0x13/0x20 [btrfs]
[32129.549198]  [a02dc0cf] add_ra_bio_pages.isra.8+0x17f/0x2d0 [btrfs]
[32129.551602]  [a02dccfc] btrfs_submit_compressed_read+0x25c/0x4c0 
[btrfs]
[32129.554028]  [a029d131] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
[32129.556457]  [a02b2d07] submit_one_bio+0x67/0xa0 [btrfs]
[32129.558899]  [a02b7ecd] extent_read_full_page_nolock+0x4d/0x60 
[btrfs]
[32129.561290]  [a02c8052] fill_data+0xb2/0x230 [btrfs]
[32129.563623]  [a02cd57e] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
[32129.565924]  [816ddbae] ? _raw_spin_lock+0xe/0x20
[32129.568207]  [8119b907] ? inode_get_bytes+0x47/0x60
[32129.570472]  [811a8297] do_vfs_ioctl+0x97/0x560
[32129.572700]  [8119bb5a] ? sys_newfstat+0x2a/0x40
[32129.574882]  [811a87f1] sys_ioctl+0x91/0xb0
[32129.577008]  [816e64dd] system_call_fastpath+0x1a/0x1f

Side note, I wish btrfs used TASK_KILLABLE[1] instead.

[1]: https://lwn.net/Articles/288056/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
 How will it compare to bcache? I'm currently thinking about buying an SSD 
 but bcache requires some efforts in migrating the storage to use. And after 
 all those hassles I am even not sure if it would work easily with a dracut 
 generated initramfs.

   On the side note: dm-cache, which is already in-kernel, do not need to
 reformat backing storage.

On the other hand dm-cache is somewhat complex to assemble, and letting
the system automount the unsynchronised backing device is a recipe for
data loss.

It will need lvm integration to become really convenient to use.

Anyway, here's a shameless plug for a tool that converts to bcache
in-place:  https://github.com/g2p/blocks#bcache-conversion

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
 Do you plan to support deduplication on a finer grained basis than file 
 level? As an example, in the end it could be interesting to deduplicate 1M 
 blocks of huge files. Backups of VM images come to my mind as a good 
 candidate. While my current backup script[1] takes care of this by using 
 rsync --inplace it won't consider files moved between two backup cycles. 
 This is the main purpose I'm using bedup for on my backup drive.
 
 Maybe you could define another cutoff value to consider huge files for 
 block-level deduplication?

I'm considering deduplicating aligned blocks of large files sharing the
same size (VMs with the same baseline.  Those would ideally come
pre-cowed, but rsync or scp could have broken that).

It sounds simple, and was sort-of prompted by the new syscall taking
short ranges, but it is tricky figuring out a sane heuristic (when to
hash, when to bail, when to submit without comparing, what should be the
source in the last case), and it's not something I have an immediate
need for.  It is also possible to use 9p (with standard cow and/or
small-file dedup) and trade a bit of configuration for much more
space-efficient VMs.

Finer-grained tracking of which ranges have changed, and maybe some
caching of range hashes, would be a good first step before doing any
crazy large-file heuristics.  The hash caching would actually benefit
all use cases.

 Regards,
 Kai
 
 [1]: https://gist.github.com/kakra/5520370


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/5] BTRFS hot relocation support

2013-05-07 Thread Gabriel de Perthuis
On Tue, 07 May 2013 23:58:08 +0200, Kai Krakow wrote:
 Gabriel de Perthuis g2p.c...@gmail.com schrieb:
   On the side note: dm-cache, which is already in-kernel, do not need to
 reformat backing storage.
 
 On the other hand dm-cache is somewhat complex to assemble, and letting
 the system automount the unsynchronised backing device is a recipe for
 data loss.
 
 Yes, that was my first impression, too, after reading of how it works. How 
 safe is bcache on that matter?

The bcache superblock is there just to prevent the naked backing device
from becoming available.  So it's safe in that respect.  LVM has
something similar with hidden volumes.

 Anyway, here's a shameless plug for a tool that converts to bcache
 in-place:  https://github.com/g2p/blocks#bcache-conversion
 
 Did I say: I love your shameless plugs? ;-)
 
 I've read the docs for this tool with interest. Still I do not feel very 
 comfortable with converting my storage for some unknown outcome. Sure, I can 
 take backups (and by any means: I will). But it takes time: backup, try, 
 restore, try again, maybe restore... I don't want to find out that it was 
 all useless because it's just not ready to boot a multi-device btrfs through 
 dracut. So you see, the point is: Will that work? I didn't see any docs 
 answering my questions.

Try it with a throwaway filesystem inside a VM.  The bcache list will
appreciate the feedback on Dracut, even if you don't make the switch
for real.

 Of course, if it would work I'd happily contribute documentation to your 
 project.

That would be very welcome.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to deduplicate read-only snapshots for space-efficient backups

2013-05-07 Thread Gabriel de Perthuis
On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:
 Gabriel de Perthuis g2p.c...@gmail.com schrieb:
 It sounds simple, and was sort-of prompted by the new syscall taking
 short ranges, but it is tricky figuring out a sane heuristic (when to
 hash, when to bail, when to submit without comparing, what should be the
 source in the last case), and it's not something I have an immediate
 need for.  It is also possible to use 9p (with standard cow and/or
 small-file dedup) and trade a bit of configuration for much more
 space-efficient VMs.
 
 Finer-grained tracking of which ranges have changed, and maybe some
 caching of range hashes, would be a good first step before doing any
 crazy large-file heuristics.  The hash caching would actually benefit
 all use cases.
 
 Looking back to good old peer-2-peer days (I think we all got in touch with 
 that the one or the other way), one title pops back into my mind: tiger-
 tree-hash...
 
 I'm not really into it, but would it be possible to use tiger-tree-hashes to 
 find identical blocks? Even accross different sized files...

Possible, but bedup is all about doing as little io as it can get away
with, doing streaming reads only when it has sampled that the files are
likely duplicates and not spending a ton of disk space for indexing.

Hashing everything in the hope that there are identical blocks at
unrelated places on the disk is a much more resource-intensive approach;
Liu Bo is working on that, following ZFS's design choices.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 95d46cc..b3f0276 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1797,23 +1797,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i  nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset 
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key-objectid;
sh.offset = key-offset;
sh.type = key-type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6



[PATCH] btrfs: don't stop searching after encountering the wrong item

2013-05-06 Thread Gabriel de Perthuis
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Cc: sta...@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
(resent, with the correct header to have stable copied)

 fs/btrfs/ioctl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c02310..f49b62f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1794,23 +1794,23 @@ static noinline int copy_to_sk(struct btrfs_root *root,
 
for (i = slot; i  nritems; i++) {
item_off = btrfs_item_ptr_offset(leaf, i);
item_len = btrfs_item_size_nr(leaf, i);
 
-   if (item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
+   btrfs_item_key_to_cpu(leaf, key, i);
+   if (!key_in_sk(key, sk))
+   continue;
+
+   if (sizeof(sh) + item_len  BTRFS_SEARCH_ARGS_BUFSIZE)
item_len = 0;
 
if (sizeof(sh) + item_len + *sk_offset 
BTRFS_SEARCH_ARGS_BUFSIZE) {
ret = 1;
goto overflow;
}
 
-   btrfs_item_key_to_cpu(leaf, key, i);
-   if (!key_in_sk(key, sk))
-   continue;
-
sh.objectid = key-objectid;
sh.offset = key-offset;
sh.type = key-type;
sh.len = item_len;
sh.transid = found_transid;
-- 
1.8.2.1.419.ga0b97c6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible to dedpulicate read-only snapshots for space-efficient backups

2013-05-05 Thread Gabriel de Perthuis
On Sun, 05 May 2013 12:07:17 +0200, Kai Krakow wrote:
 Hey list,
 
 I wonder if it is possible to deduplicate read-only snapshots.
 
 Background:
 
 I'm using an bash/rsync script[1] to backup my whole system on a nightly 
 basis to an attached USB3 drive into a scratch area, then take a snapshot of 
 this area. I'd like to have these snapshots immutable, so they should be 
 read-only.
 
 Since rsync won't discover moved files but instead place a new copy of that 
 in the backup, I'm running the wonderful bedup application[2] to deduplicate 
 my backup drive from time to time and it almost always gains back a good 
 pile of gigabytes. The rest of storage space issues is taken care of by 
 using rsync's inplace option (although this won't cover the case of files 
 moved and changed between backup runs) and using compress-force=gzip.

 I've read about ongoing work to integrate offline (and even online) 
 deduplication into the kernel so that this process can be made atomic (and 
 even block-based instead of file-based). This would - to my understandings - 
 result in the immutable attribute no longer needed. So, given the fact above 
 and for the case read-only snapshots cannot be used for this application 
 currently, will these patches address the problem and read-only snapshots 
 could be deduplicated? Or are read-only snapshots meant to be what the name 
 suggests: Immutable, even for deduplication?

There's no deep reason read-only snapshots should keep their storage
immutable, they can be affected by raid rebalancing for example.

The current bedup restriction comes from the clone call; Mark Fasheh's
dedup ioctl[3] appears to be fine with snapshots.  The bedup integration
(in a branch) is a work in progress at the moment.  I need to fix a scan
bug, tweak parameters for the latest kernel dedup patch, remove a lot of
logic that is now unnecessary, and figure out the compatibility story.

 Regards,
 Kai
 
 [1]: https://gist.github.com/kakra/5520370
 [2]: https://github.com/g2p/bedup

[3]: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25062


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice - Partition, or not?

2013-05-01 Thread Gabriel de Perthuis
 Hello
 
 If I want to manage a complete disk with btrfs, what's the Best
 Practice? Would it be best to create the btrfs filesystem on
 /dev/sdb, or would it be better to create just one partition from
 start to end and then do mkfs.btrfs /dev/sdb1?

Partitions (GPT) are always more flexible and future-proof.  If you ever
need to shrink the btrfs filesystem and give the space to another
partition, or do a conversion to lvm/bcache/luks (shameless plug:
https://github.com/g2p/blocks ), it'd be stupid to be locked into your
current setup for want of a few megabytes of space before your
filesystem.

 Would the same recomendation hold true, if we're talking about huge
 disks, like 4TB or so?

More so, since it can be infeasible to move this much data.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] Btrfs: online data deduplication

2013-05-01 Thread Gabriel de Perthuis
  #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
   struct btrfs_ioctl_dev_replace_args)
 +#define BTRFS_IOC_DEDUP_REGISTER _IO(BTRFS_IOCTL_MAGIC, 54)

This number has already been used by the offline dedup patches.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-22 Thread Gabriel de Perthuis

On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few tweaks
to suggest to the argument checking.


Awesome, thanks for looking this over!


At first the 1M limitation on the length was a bit inconvenient, but making
repeated calls in userspace is okay and allows for better error recovery
(for example, repeating the calls when only a portion of the ranges is
identical).  The destination file appears to be fragmented into 1M extents,
but these are contiguous so it's not really a problem.


Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


A cursor-style API could work here: make the offset and length 
parameters in/out, exit early in case of error or after the read quota 
has been used up.


The caller can retry as long as the length is nonzero (and at least one 
block), and the syscall will return frequently enough that it won't 
block an unmount operation or concurrent access to the ranges.



Requiring the offset or the length to align is spurious however; it doesn't
translate to any alignment in the extent tree (especially with
compression).  Requiring a minimum length of a few blocks or dropping the
alignment condition entirely would make more sense.


I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


I've had a look, and btrfs fiemap only sets FIEMAP_EXTENT_NOT_ALIGNED 
for inline extents, so the alignment requirement makes sense.  The 
caller should do the alignment and decide if it wants to extend a bit 
and accept a not-same status or shrink a bit, so just keep it as is and 
maybe add explanatory comments.



I notice there is a restriction on cross-subvolume deduplication. Hopefully
it can be lifted like it was done in 3.6 for the clone ioctl.


Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


Good to have this code shared (compression support is another win). 
bedup will need feature parity to switch from one ioctl to the other.



Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an
experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in the
bedup readme and some unnecessary subtleties in the code.


Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


See ranges_same and dedup_fileset.  The ImmutableFDs stuff can be 
removed and the fact that dedup can be partially successful over a range 
will ripple through.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might

Re: [PATCH 0/4] [RFC] btrfs: offline dedupe

2013-04-20 Thread Gabriel de Perthuis

Hi,

The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.


I am a fan of this patch, the API is just right.  I just have a few 
tweaks to suggest to the argument checking.


At first the 1M limitation on the length was a bit inconvenient, but 
making repeated calls in userspace is okay and allows for better error 
recovery (for example, repeating the calls when only a portion of the 
ranges is identical).  The destination file appears to be fragmented 
into 1M extents, but these are contiguous so it's not really a problem.


Requiring the offset or the length to align is spurious however; it 
doesn't translate to any alignment in the extent tree (especially with 
compression).  Requiring a minimum length of a few blocks or dropping 
the alignment condition entirely would make more sense.


I notice there is a restriction on cross-subvolume deduplication. 
Hopefully it can be lifted like it was done in 3.6 for the clone ioctl.


Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an 
experimental branch:


  https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in 
the bedup readme and some unnecessary subtleties in the code.



I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.

The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
__u64 logical_offset;   /* in - start of extent in source */
__u64 length;   /* in - length of extent */
__u16 total_files;  /* in - total elements in info array */


Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be better named range-same, since there is 
no need for the input to fall on extent boundaries.



__u16 files_deduped;/* out - number of files that got deduped */
__u32 reserved;
struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
   return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS 1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
__s64 fd;   /* in - destination file */
__u64 logical_offset;   /* in - start of extent in destination */
__u64 bytes_deduped;/* out - total # of bytes we were able
 * to dedupe from this file */
/* status of this dedupe operation:
 * 0 if dedup succeeds
 *  0 for error
 * == BTRFS_SAME_DATA_DIFFERS if data differs
 */
__s32 status;   /* out - see above description */
__u32 reserved;
};


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with building instructions for btrfs-tools in https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories

2013-03-20 Thread Gabriel de Perthuis
 There is a missing dependency: liblzo2-dev

 I suggest to make amendment to the wiki and add a liblzo2-dev to the 
 apt-get line for Ubuntu/Debian.

Added. Other distros may need some additions too.

Anyone can edit the wiki, as the spambots will attest; a ConfirmEdit 
captcha at signup would be nice.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
Hello,
I have a filesystem that has become unusable because of a balance I can't 
stop. It is very close to full, and the balance is preventing me from 
growing it.

It was started like this:
sudo btrfs filesystem balance start -v -musage=60 -dusage=60 /srv/backups

It has been stuck at 0% across reboots and kernel upgrades (currently on 
3.8.1), and cancelling it had no effect:

Balance on '/srv/backups' is running
0 out of about 5 chunks balanced (95 considered), 100% left

According to atop it is writing but not reading anything.
Unmounting never terminates, so does remounting ro, the only way to 
temporarilly kill it is to reboot. SIGKILL has no effect either. Is there 
*any* way I can get rid of it?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Permanent uncancellable balance

2013-03-02 Thread Gabriel de Perthuis
On Sat, 02 Mar 2013 17:12:37 +0600, Roman Mamedov wrote:
 Mount with the skip_balance option
 https://btrfs.wiki.kernel.org/index.php/Mount_options then you can issue
 btrfs fi balance cancel and it will succeed.

Excellent, thank you.
I had just thought of doing the same thing with ro and it worked.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix a sign bug causing invalid memory access in the ino_paths ioctl.

2012-10-10 Thread Gabriel de Perthuis
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:

  ls -i
  btrfs inspect inode-resolve -v $ino /mnt/btrfs

I noticed the memory layout of the fspath-val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.

---
 fs/btrfs/backref.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 868cf5b..29d05c6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1131,7 +1131,7 @@ char *btrfs_iref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
int slot;
u64 next_inum;
int ret;
-   s64 bytes_left = size - 1;
+   s64 bytes_left = ((s64)size) - 1;
struct extent_buffer *eb = eb_in;
struct btrfs_key found_key;
int leave_spinning = path-leave_spinning;
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] warn when skipping snapshots created with older

2012-09-05 Thread Gabriel de Perthuis
Thanks, I fixed the objectid test.
Apply with --scissors.

-- 8 --
Subject: [PATCH] btrfs send: warn when skipping snapshots created with older
 kernels.

This message is more explicit than ERROR: could not resolve root_id,
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs, to prevent spurious warnings.

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 send-utils.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..03ca72a 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,13 +224,18 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh-objectid != 5 
sh-objectid  BTRFS_FIRST_FREE_OBJECTID) ||
-   sh-objectid == BTRFS_FREE_INO_OBJECTID)
+   sh-objectid  BTRFS_LAST_FREE_OBJECTID)
goto skip;
 
if (sh-type == BTRFS_ROOT_ITEM_KEY) {
/* older kernels don't have uuids+times */
if (sh-len  sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   Ignoring subvolume id %llu, 
+   btrfs send needs snapshots 
+   created with kernel 3.6+\n,
+   sh-objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Warn when skipping snapshots created with older kernels.

2012-09-04 Thread Gabriel de Perthuis
This message is more explicit than ERROR: could not resolve root_id,
the message that will be shown immediately before `btrfs send` bails.

Also skip invalid high OIDs.

Signed-off-by: Gabriel de Perthuis g2p.code+bt...@gmail.com
---
 send-utils.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/send-utils.c b/send-utils.c
index a43d47e..386aeb3 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -224,6 +224,7 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
 
if ((sh-objectid != 5 
sh-objectid  BTRFS_FIRST_FREE_OBJECTID) ||
+   sh-objectid = BTRFS_LAST_FREE_OBJECTID ||
sh-objectid == BTRFS_FREE_INO_OBJECTID)
goto skip;
 
@@ -231,6 +232,11 @@ int subvol_uuid_search_init(int mnt_fd, struct 
subvol_uuid_search *s)
/* older kernels don't have uuids+times */
if (sh-len  sizeof(root_item)) {
root_item_valid = 0;
+   fprintf(stderr,
+   Ignoring subvolume id %llu, 
+   btrfs send needs snapshots 
+   created with kernel 3.6+\n,
+   sh-objectid);
goto skip;
}
root_item_ptr = (struct btrfs_root_item *)
-- 
1.7.12.117.gdc24c27

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html