sendfile(2) not killable on btrfs

2016-01-06 Thread Akihiro Suda
Hi,

I have an issue on btrfs, and opened a bugzilla ticket:
https://bugzilla.kernel.org/show_bug.cgi?id=110391

=
Commit 296291cd should have made sendfile(2) killable.

https://github.com/torvalds/linux/commit/296291cd
> Currently a simple program below issues a sendfile(2) system call which
> takes about 62 days to complete in my test KVM instance.
>
> int fd;
> off_t off = 0;
>
> fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
> ftruncate(fd, 2);
> lseek(fd, 0, SEEK_END);
> sendfile(fd, fd, , 0xfff);
>
> Now you should not ask kernel to do a stupid stuff like copying 256MB in
> 2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
> should have a way to stop you.


However, it is still not killable on btrfs.

Stack trace:

  [] write_all_supers.isra.45+0x960/0xb00 [btrfs]
  [] write_ctree_super+0x17/0x20 [btrfs]
  [] btrfs_sync_log+0x897/0xb40 [btrfs]
  [] btrfs_sync_file+0x328/0x360 [btrfs]
  [] vfs_fsync_range+0x4b/0xb0
  [] btrfs_file_write_iter+0x207/0x510 [btrfs]
  [] new_sync_write+0x9b/0xe0
  [] __vfs_write+0x26/0x40
  [] __kernel_write+0x53/0xf0
  [] write_pipe_buf+0x72/0xa0
  [] __splice_from_pipe+0xf9/0x170
  [] splice_from_pipe+0x5e/0x90
  [] default_file_splice_write+0x1d/0x30
  [] direct_splice_actor+0x36/0x40
  [] splice_direct_to_actor+0xe6/0x210
  [] do_splice_direct+0x98/0xd0
  [] do_sendfile+0x1bf/0x3a0
  [] SyS_sendfile64+0x5e/0xb0
  [] entry_SYSCALL_64_fastpath+0x16/0x75
  [] 0x
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add big device, remove small device, read-only

2016-01-06 Thread Rasmus Abrahamsen


On Wed, Jan 6, 2016, at 07:45 AM, Duncan wrote:
> Rasmus Abrahamsen posted on Fri, 01 Jan 2016 21:20:13 +0100 as excerpted:
> 
> > I accidentically sent my messages directly to Duncan, I am copying them
> > in here.
> > 
> > Hello Duncan,
> > 
> > Thank you for the amazing response. Wow, you are awesome.
> 
> Just a note to mention that real life (TM) got in the way, and I'm a few 
> days and a couple hundred posts behind on the list, now.  Sounds like you 
> have a backup tho, and if worse comes to worse, you can simply blow away 
> the filesystem and start over.  Between that and Chris Murphy helping you 
> now, I read the thread to date but am simply marking it read without 
> further replies as it exists ATM, but might reply to new posts to the 
> thread from now, if I think I can be helpful.
> 
> (Which is why I try to discourage direct replies, too.  With direct 
> replies to a single person, if that person doesn't get back...  While if 
> it's to the list, there's more that can take up the thread, it's not on 
> just one person.  Of course the just to me was an accident, but...)
> 
> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yeah, I have made many mistakes as this is my first time participating
on a mailing list :)

I decided to blow away the filesystem and start over. I did so by
physically removing the new 4 TB drive, mounting -o degraded and then
running:

find . -type f -exec cat {} >> /dev/null \;

In all folders that matter to me to make sure everything is readable
without the new drive. I then formatted the new drive as BTRFS and moved
over all my data. Then I added my old 4 TB drive to the new filesystem
and did balance -dconvert=raid1 -mconvert=raid1. I am now online with
the new filesystem and things are good.

I was expecting btrfs to just work (tm) when adding the new drive and
removing the old, so that made me a little sad. However, it's awesome
that no data was lost and my method actually worked.

Thanks for a great place and product.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Btrfs: Check metadata redundancy on balance

2016-01-06 Thread sam tygier
From: Sam Tygier 
Date: Wed, 6 Jan 2016 08:46:12 +
Subject: [PATCH] Btrfs: Check metadata redundancy on balance

When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single

Signed-off-by: Sam Tygier 
---
v3:
  Use btrfs_warn()
  Mention profiles in message
v2:
  Use btrfs_get_num_tolerated_disk_barrier_failures()
---
 fs/btrfs/volumes.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a23399e..be91458 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3756,6 +3756,14 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
}
} while (read_seqretry(_info->profiles_lock, seq));
 
+   if (btrfs_get_num_tolerated_disk_barrier_failures(bctl->meta.target) <
+   
btrfs_get_num_tolerated_disk_barrier_failures(bctl->data.target)) {
+   btrfs_warn(fs_info,
+   "Warning: metatdata profile %llu has lower redundancy "
+   "than data profile %llu\n", bctl->meta.target,
+   bctl->data.target);
+   }
+
if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
fs_info->num_tolerated_disk_barrier_failures = min(
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: Continue write in case of can_not_nocow

2016-01-06 Thread Zhao Lei
btrfs failed in xfstests btrfs/080 with -o nodatacow.

Can be reproduced by following script:
  DEV=/dev/vdg
  MNT=/mnt/tmp

  umount $DEV &>/dev/null
  mkfs.btrfs -f $DEV
  mount -o nodatacow $DEV $MNT

  dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
  btrfs subvolume snapshot -r $MNT $MNT/test_snap &
  wait
  --
  We can see dd failed on NO_SPACE.

Reason:
  __btrfs_buffered_write should run cow write when no_cow impossible,
  and current code is designed with above logic.
  But check_can_nocow() have 2 type of return value(0 and <0) on
  can_not_no_cow, and current code only continue write on first case,
  the second case happened in doing subvolume.

Fix:
  Continue write when check_can_nocow() return 0 and <0.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/file.c | 37 +
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 977e715..11fd981 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1516,27 +1516,24 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
 
-   if (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
-BTRFS_INODE_PREALLOC)) {
-   ret = check_can_nocow(inode, pos, _bytes);
-   if (ret < 0)
-   break;
-   if (ret > 0) {
-   /*
-* For nodata cow case, no need to reserve
-* data space.
-*/
-   only_release_metadata = true;
-   /*
-* our prealloc extent may be smaller than
-* write_bytes, so scale down.
-*/
-   num_pages = DIV_ROUND_UP(write_bytes + offset,
-PAGE_CACHE_SIZE);
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
-   goto reserve_metadata;
-   }
+   if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) &&
+   check_can_nocow(inode, pos, _bytes) > 0) {
+   /*
+* For nodata cow case, no need to reserve
+* data space.
+*/
+   only_release_metadata = true;
+   /*
+* our prealloc extent may be smaller than
+* write_bytes, so scale down.
+*/
+   num_pages = DIV_ROUND_UP(write_bytes + offset,
+PAGE_CACHE_SIZE);
+   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+   goto reserve_metadata;
}
+
ret = btrfs_check_data_free_space(inode, pos, write_bytes);
if (ret < 0)
break;
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: delete no_used argument in btrfs_copy_from_user

2016-01-06 Thread Zhao Lei
size_t write_bytes is not necessary for btrfs_copy_from_user(),
delete it.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/file.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 11fd981..e4f287c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -406,8 +406,7 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
 /* simple helper to fault in pages and copy.  This should go away
  * and be replaced with calls into generic code.
  */
-static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
-size_t write_bytes,
+static noinline int btrfs_copy_from_user(loff_t pos, size_t write_bytes,
 struct page **prepared_pages,
 struct iov_iter *i)
 {
@@ -1575,8 +1574,7 @@ again:
ret = 0;
}
 
-   copied = btrfs_copy_from_user(pos, num_pages,
-  write_bytes, pages, i);
+   copied = btrfs_copy_from_user(pos, write_bytes, pages, i);
 
/*
 * if we have trouble faulting in the pages, fall
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: merge functions for wait snapshot creation

2016-01-06 Thread Zhao Lei
wait_for_snapshot_creation() is in same group with oher two:
 btrfs_start_write_no_snapshoting()
 btrfs_end_write_no_snapshoting()

Rename wait_for_snapshot_creation() and move it into same place
with other two.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 20 
 fs/btrfs/inode.c   | 22 +-
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0912f89..5d7a2c4d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3566,6 +3566,7 @@ int btrfs_delayed_refs_qgroup_accounting(struct 
btrfs_trans_handle *trans,
 int __get_raid_index(u64 flags);
 int btrfs_start_write_no_snapshoting(struct btrfs_root *root);
 void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
+void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
 void check_system_chunk(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
const u64 type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dc3763a..e83ffa8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10704,3 +10704,23 @@ int btrfs_start_write_no_snapshoting(struct btrfs_root 
*root)
}
return 1;
 }
+
+static int wait_snapshoting_atomic_t(atomic_t *a)
+{
+   schedule();
+   return 0;
+}
+
+void btrfs_wait_for_snapshot_creation(struct btrfs_root *root)
+{
+   while (true) {
+   int ret;
+
+   ret = btrfs_start_write_no_snapshoting(root);
+   if (ret)
+   break;
+   wait_on_atomic_t(>will_be_snapshoted,
+wait_snapshoting_atomic_t,
+TASK_UNINTERRUPTIBLE);
+   }
+}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 994490d..21d7cd1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4897,26 +4897,6 @@ next:
return err;
 }
 
-static int wait_snapshoting_atomic_t(atomic_t *a)
-{
-   schedule();
-   return 0;
-}
-
-static void wait_for_snapshot_creation(struct btrfs_root *root)
-{
-   while (true) {
-   int ret;
-
-   ret = btrfs_start_write_no_snapshoting(root);
-   if (ret)
-   break;
-   wait_on_atomic_t(>will_be_snapshoted,
-wait_snapshoting_atomic_t,
-TASK_UNINTERRUPTIBLE);
-   }
-}
-
 static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -4948,7 +4928,7 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
 * truncation, it must capture all writes that happened before
 * this truncation.
 */
-   wait_for_snapshot_creation(root);
+   btrfs_wait_for_snapshot_creation(root);
ret = btrfs_cont_expand(inode, oldsize, newsize);
if (ret) {
btrfs_end_write_no_snapshoting(root);
-- 
1.8.5.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Intialize btrfs_root->highest_objectid when loading tree root and subvolume roots

2016-01-06 Thread Chandan Rajendra
On Tuesday 05 Jan 2016 13:12:34 David Sterba wrote:
> Sorry for not answering that. As you're going to resend it, please
> use EOVERFLOW in the btrfs_init_fs_root. We should not hit the overflow
> error in the mount path.
Right. Now I understand that.

David, Replacing the following code snippet instances (in both open_ctree()
and btrfs_init_fs_root()) ...

if (unlikely(root->highest_objectid >= BTRFS_LAST_FREE_OBJECTID)) {
mutex_unlock(>objectid_mutex);
ret = -EOVERFLOW;
goto free_root_dev;
}

with 

 ASSERT(tree_root->highest_objectid <= BTRFS_LAST_FREE_OBJECTID);

is probably a better option?

The validation of root->highest_objectid must have been done by
btrfs_find_free_objectid() when creating the subvolume. If the parent
subvolume already has an objectid with BTRFS_LAST_FREE_OBJECTID as the value,
btrfs_find_free_objectid() would return with an error and hence we should
never have subvolumes containing other subvolumes with objectid greater than
BTRFS_LAST_FREE_OBJECTID.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cannot repair filesystem

2016-01-06 Thread Patrik Lundquist
On 1 January 2016 at 16:44, Jan Koester  wrote:
>
> Hi,
>
> if I try to repair filesystem got I'am assert. I use Raid6.
>
> Linux dibsi 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt4-3~bpo70+1 
> (2015-02-12) x86_64 GNU/Linux

Raid6 wasn't completed until Linux 3.19 and I wouldn't call it stable yet.

https://btrfs.wiki.kernel.org/index.php/RAID56

I suggest you upgrade from Wheezy to Jessie and install the lastest
backports kernel and latest btrfs-progs from Git (there's no
stable-bpo for btrfs-tools) if you want to use raid56.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Purposely using btrfs RAID1 in degraded mode ?

2016-01-06 Thread Alphazo
Thanks Chris for the warning. I agree that mounting both drives
separately in degraded r/w will lead to very funky results when trying
to scrub them when put together.

On Mon, Jan 4, 2016 at 6:41 PM, Chris Murphy  wrote:
> On Mon, Jan 4, 2016 at 10:00 AM, Alphazo  wrote:
>
>> I have tested the above use case with a couple of USB flash drive and
>> even used btrfs over dm-crypt partitions and it seemed to work fine
>> but I wanted to get some advices from the community if this is really
>> a bad practice that should not be used on the long run. Is there any
>> limitation/risk to read/write to/from a degraded filesystem knowing it
>> will be re-synced later?
>
> As long as you realize you're testing a sort of edge case, but an
> important one (it should work, that's the point of rw degraded mounts
> being possible), then I think it's fine.
>
> The warning though is, you need to designate a specific drive for the
> rw,degraded mounts. If you were to separately rw,degraded mount the
> two drives, the fs will become irreparably corrupt if they are
> rejoined. And you'll probably lose everything on the volume. The other
> thing is that to "resync" you have to manually initiate a scrub, it's
> not going to resync automatically, and it has to read everything on
> both drives to compare and fix what's missing. There is no equivalent
> to a write intent bitmap on Btrfs like with mdadm (the information
> ostensibly could be inferred from btrfs generation metadata similar to
> how incremental snapshot send/receive works) but that work isn't done.
>
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sendfile(2) not killable on btrfs

2016-01-06 Thread Chris Samuel
On Wed, 6 Jan 2016 06:08:46 PM Akihiro Suda wrote:

> However, it is still not killable on btrfs.

Your bugzilla entry is for the 4.2 kernel in the current Ubuntu, but this 
patch was only merged into the mainline for at v4.3-rc6-123-g296291c so unless 
it was backported by Canonical it won't be present.

Could you test with the 4.3 kernel please?

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.3-wily/

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [PATCH 01/35] block/fs/drivers: remove rw argument from submit_bio

2016-01-06 Thread Bart Van Assche

On 01/05/2016 09:53 PM, mchri...@redhat.com wrote:

From: Mike Christie 

This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 vs raid5

2016-01-06 Thread Sean Greenslade
On Tue, Jan 05, 2016 at 05:24:31PM +0100, Psalle wrote:
> Hello all and excuse me if this is a silly question. I looked around in the
> wiki and list archives but couldn't find any in-depth discussion about this:
> 
> I just realized that, since raid1 in btrfs is special (meaning only two
> copies in different devices), the effect in terms of resilience achieved
> with raid1 and raid5 are the same: you can lose one drive and not lose data.
> 
> So!, presuming that raid5 were at the same level of maturity, what would be
> the pros/cons of each mode?

This is true for "classic" RAID: assume you have 3x 1TB disks. RAID1
will give you 1.5TB, whereas RAID5 will give you 2TB.

RAID1 = 1/2 total disk space (assuming equally-sized disks)
RAID5 = (N-1)*single disk space (same assumption)

> As a corollary, I guess that if raid1 is considered a good compromise, then
> functional equivalents to raid6 and beyond could simply be implemented as
> "storing n copies in different devices", dropping any complex parity
> computations and making this mode entirely generic.

This is akin to what has been mentioned on the list earlier as "N-way
mirroring" and I agree that it will be very nice to have once
implemented. However it is not the same as RAID5/6 since the parity
schemes are used to get more usable storage than just simple mirroring
would allow for.

Thus, the main pro of RAID5/6 is more usable storage, and the main con
is more computational complexity (and thus more cpu requirements, slower
access time, more fragile error states, etc.)

> Since this seems pretty obvious, I'd welcome your insights on what are
> the things I'm missing, since it doesn't exist (and it isn't planned
> to be this way, AFAIK). I can foresee consistency difficulties, but
> that seems hardly insurmountable if its being done for raid1?

Fixing an inconsistency in RAID1 is much easier than RAID5/6. No math,
just checking csums. Fixing an inconsistency in RAID5/6 involves busting
out the parity math. This is why repairing RAID5/6 only became possible
in btrfs relatively recently. Generating the parity data was relatively
easy, but rebuilding missing data with it was a more difficult task.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] Btrfs: fix output of compression message in btrfs_parse_options()

2016-01-06 Thread Tsutomu Itoh
The compression message might not be correctly output.
Fix it.

[[before fix]]

# mount -o compress /dev/sdb3 /test3
[  996.874264] BTRFS info (device sdb3): disk space caching is enabled
[  996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

[[after fix]]

# mount -o compress /dev/sdb3 /test3
[  401.021753] BTRFS info (device sdb3): use zlib compression
[  401.021758] BTRFS info (device sdb3): disk space caching is enabled
[  401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress-force /dev/sdb3 /test3
[  439.824624] BTRFS info (device sdb3): force zlib compression
[  439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)

# mount -o remount,compress /dev/sdb3 /test3
[  459.918430] BTRFS info (device sdb3): use zlib compression
[  459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs 
(rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)

Signed-off-by: Tsutomu Itoh 
---
V1->V2: It is corrected that API doesn't change.

 fs/btrfs/super.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 24154e4..12d04c9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -381,6 +381,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
int ret = 0;
char *compress_type;
bool compress_force = false;
+   enum btrfs_compression_type saved_compress_type;
+   bool saved_compress_force;
+   int no_compress = 0;
 
cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
if (cache_gen)
@@ -458,6 +461,10 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
/* Fallthrough */
case Opt_compress:
case Opt_compress_type:
+   saved_compress_type = btrfs_test_opt(root, COMPRESS) ?
+   info->compress_type : BTRFS_COMPRESS_NONE;
+   saved_compress_force =
+   btrfs_test_opt(root, FORCE_COMPRESS);
if (token == Opt_compress ||
token == Opt_compress_force ||
strcmp(args[0].from, "zlib") == 0) {
@@ -466,6 +473,7 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
btrfs_set_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
+   no_compress = 0;
} else if (strcmp(args[0].from, "lzo") == 0) {
compress_type = "lzo";
info->compress_type = BTRFS_COMPRESS_LZO;
@@ -473,25 +481,21 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
btrfs_set_fs_incompat(info, COMPRESS_LZO);
+   no_compress = 0;
} else if (strncmp(args[0].from, "no", 2) == 0) {
compress_type = "no";
btrfs_clear_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, 
FORCE_COMPRESS);
compress_force = false;
+   no_compress++;
} else {
ret = -EINVAL;
goto out;
}
 
if (compress_force) {
-   btrfs_set_and_info(root, FORCE_COMPRESS,
-  "force %s compression",
-  compress_type);
+   btrfs_set_opt(info->mount_opt, 

Re: Confining scrub to a subvolume

2016-01-06 Thread Duncan
Sree Harsha Totakura posted on Mon, 04 Jan 2016 12:01:58 +0100 as
excerpted:

> On 12/30/2015 07:26 PM, Duncan wrote:
>> David Sterba posted on Wed, 30 Dec 2015 18:39:49 +0100 as excerpted:
>> 
>>> On Wed, Dec 30, 2015 at 01:00:34AM +0100, Sree Harsha Totakura wrote:
 Is it possible to confine scrubbing to a subvolume instead of the
 whole file system?
>>>
>>> No. Srub reads the blocks from devices (without knowing which files
>>> own them) and compares them to the stored checksums.
>> 
>> Of course if like me you prefer not to have all your data eggs in one
>> filesystem basket and have used partitions (or LVM) and multiple
>> independent btrfs, in which case you scrub the filesystem you want, and
>> don't worry about the others. =:^)
> 
> I considered it, but after reading somewhere (couldn't find the source)
> that having a single btrfs could be beneficial, I decided not to.
> Clearly, it doesn't seem to be true in this case.

It depends very much on your viewpoint and use-case.

Arguably, btrfs should /eventually/ provide a more flexible alternative 
to lvm as a volume manager, letting you do subvolumes, restrict them if 
desired to some maximum size via quotas (which remain buggy and 
unreliable on btrfs ATM so don't try to use them for that yet), and let 
you "magically" add and remove devices from a single btrfs storage pool, 
changing quota settings as desired, without the hassle of managing 
individual partitions and multiple filesystems.

But IMO at least, and I'd guess in the opinion of most on the list, btrfs 
at its present "still stabilizing, not yet fully stable and mature" 
status, remains at least partially unsuited to that.  Besides general 
stability, some options, including quotas, simply don't work reliably 
yet, and other tools and features have yet to be developed.

But even in the future, when btrfs is stable and many are using these 
sorts of logical volume management and integrated filesystem features to 
do everything on a single filesystem, I'm still very likely to consider 
that a higher risk than I'm willing to take, because it'll still be 
ultimately putting all those data eggs in the filesystem basket, which if 
the bottom falls out...

In addition, I actually tried, for instance, big partitioned mdraids, 
with several filesystems each on their own partition of that mdraid, and 
I eventually came to the conclusion that at some point, they simply get 
too big, and maintenance simply takes too long, to be practical for me.  
When adding a multi-TB device to a big mdraid takes days...  I'd much 
rather have multiple much smaller mdraids, or now, btrfs raids, and 
perhaps still take days overall to do the same thing if I'm doing it to 
all of them, but in increments of a few hours each on multiple much 
smaller capacity ones, rather than a single-shot instance that takes days.

Meanwhile, my current btrfs layout is multiple mostly raid1 btrfs, on a 
pair of partitioned SSDs, which each partition under 50 GiB, under 100 
GiB for the raid1 filesystem, 50 on each device.  On that, scrubs 
normally take, literally, under a minute, full balances well under ten, 
per filesystem.  Sure, to do every single filesystem might still take say 
a half hour, but most of the time, not all filesystems are even mounted, 
and most of the time, I only need to scrub or balance perhaps three of 
them, so while if they were all in one I might do it in say 20 minutes, 
and if I had to do all of them it might take me 30 because I have to 
repeatedly type in the command for each one, because I have to do only 
three of them, it's done in five minutes or less.

That's in addition to the fact that if a filesystem dies, I've only a 
fraction of the data to btrfs restore or to copy over from backup, 
because most of it was on other filesystems, many of which weren't even 
mounted or in some cases (my /) were mounted read-only, so they're just 
fine and I don't have to btrfs restore or copy back over from backup, for 
them.

These are lessons I've learned in a quarter century of working with 
computers, about a decade on MS, a decade and a half later this year, on 
Linux.  They may not always apply to everyone, but I've definitely 
learned how to spare myself unnecessary pain, as I've learned how they 
apply to me. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/35] block: add REQ_OP definitions and bi_op/op fields

2016-01-06 Thread Martin K. Petersen
> "Mike" == mchristi   writes:

+enum req_op {
+   REQ_OP_READ,
+   REQ_OP_WRITE= REQ_WRITE,
+   REQ_OP_DISCARD  = REQ_DISCARD,
+   REQ_OP_WRITE_SAME   = REQ_WRITE_SAME,
+};
+

I have been irked by the REQ_ prefix in bios since the flags were
consolidated a while back. When I attempted to fix the READ/WRITE mess I
used a BLK_ prefix as a result.

Anyway. Just bikeshedding...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken RAID6, segfault on chunk-recover

2016-01-06 Thread Abe

Here is the backtrace.
Any chance using chunk-recover will help repair this filesystem?
Thanks


btrfs-progs# git rev-parse HEAD
7c3394ed9ef2063a7256d4bc078a485b6f826bc5

btrfs-progs# gdb --args ./btrfs rescue chunk-recover -v /dev/sdf1

(gdb) r
Starting program: /root/btrfs-progs/btrfs rescue chunk-recover -v 
/dev/sdf1

[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".

All Devices:
Device: id = 7, name = /dev/sdh1
Device: id = 8, name = /dev/sdg1
Device: id = 4, name = /dev/sde1
Device: id = 2, name = /dev/sdc1
Device: id = 3, name = /dev/sdd1
Device: id = 5, name = /dev/sdb1
Device: id = 1, name = /dev/sdf1

[New Thread 0x76f95700 (LWP 26524)]
[New Thread 0x76794700 (LWP 26525)]
[New Thread 0x75f93700 (LWP 26526)]
[New Thread 0x75792700 (LWP 26527)]
[New Thread 0x74f91700 (LWP 26528)]
[New Thread 0x7fffe7fff700 (LWP 26529)]
[New Thread 0x7fffe77fe700 (LWP 26530)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x75f93700 (LWP 26526)]
btrfs_new_block_group_record (leaf=leaf@entry=0x7fffec0008c0, 
key=key@entry=0x75f92c30, slot=slot@entry=3) at cmds-check.c:5013

5013rec->flags = btrfs_disk_block_group_flags(leaf, ptr);


(gdb) bt

#0  btrfs_new_block_group_record (leaf=leaf@entry=0x7fffec0008c0, 
key=key@entry=0x75f92c30, slot=slot@entry=3) at cmds-check.c:5013
#1  0x004309c6 in process_block_group_item (slot=3, 
key=0x75f92c30, leaf=0x7fffec0008c0, bg_cache=0x7fffe3e8) at 
chunk-recover.c:232
#2  extract_metadata_record (rc=rc@entry=0x7fffe3b0, 
leaf=leaf@entry=0x7fffec0008c0) at chunk-recover.c:717
#3  0x00431190 in scan_one_device (dev_scan_struct=0x695820) at 
chunk-recover.c:807
#4  0x77341284 in start_thread (arg=0x75f93700) at 
pthread_create.c:333
#5  0x7707e74d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109



(gdb) thread apply all bt

Thread 8 (Thread 0x7fffe77fe700 (LWP 26530)):
#0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:81
#1  0x773411c0 in ?? () at pthread_create.c:237 from 
/lib/x86_64-linux-gnu/libpthread.so.0

#2  0x7fffe77fe700 in ?? ()
#3  0x in ?? ()

Thread 7 (Thread 0x7fffe7fff700 (LWP 26529)):
#0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:81
#1  0x773411c0 in ?? () at pthread_create.c:237 from 
/lib/x86_64-linux-gnu/libpthread.so.0

#2  0x7fffe7fff700 in ?? ()
#3  0x in ?? ()

Thread 6 (Thread 0x74f91700 (LWP 26528)):
#0  0x7734a013 in pread64 () at 
../sysdeps/unix/syscall-template.S:81
#1  0x00430ea5 in pread64 (__offset=532480, __nbytes=out>, __buf=0x7fffdc00093c, __fd=7) at 
/usr/include/x86_64-linux-gnu/bits/unistd.h:117

#2  scan_one_device (dev_scan_struct=0x695860) at chunk-recover.c:776
#3  0x77341284 in start_thread (arg=0x74f91700) at 
pthread_create.c:333
#4  0x7707e74d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Thread 5 (Thread 0x75792700 (LWP 26527)):
#0  0x7734a013 in pread64 () at 
../sysdeps/unix/syscall-template.S:81
#1  0x00430ea5 in pread64 (__offset=368640, __nbytes=out>, __buf=0x7fffe93c, __fd=6) at 
/usr/include/x86_64-linux-gnu/bits/unistd.h:117

#2  scan_one_device (dev_scan_struct=0x695840) at chunk-recover.c:776
#3  0x77341284 in start_thread (arg=0x75792700) at 
pthread_create.c:333
#4  0x7707e74d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Thread 4 (Thread 0x75f93700 (LWP 26526)):
#0  btrfs_new_block_group_record (leaf=leaf@entry=0x7fffec0008c0, 
key=key@entry=0x75f92c30, slot=slot@entry=3) at cmds-check.c:5013
#1  0x004309c6 in process_block_group_item (slot=3, 
key=0x75f92c30, leaf=0x7fffec0008c0, bg_cache=0x7fffe3e8) at 
chunk-recover.c:232
#2  extract_metadata_record (rc=rc@entry=0x7fffe3b0, 
leaf=leaf@entry=0x7fffec0008c0) at chunk-recover.c:717
#3  0x00431190 in scan_one_device (dev_scan_struct=0x695820) at 
chunk-recover.c:807
#4  0x77341284 in start_thread (arg=0x75f93700) at 
pthread_create.c:333
#5  0x7707e74d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Thread 3 (Thread 0x76794700 (LWP 26525)):
#0  0x7734a013 in pread64 () at 
../sysdeps/unix/syscall-template.S:81
#1  0x00430ea5 in pread64 (__offset=3493888, __nbytes=out>, __buf=0x7fffe800093c, __fd=4)

at /usr/include/x86_64-linux-gnu/bits/unistd.h:117
#2  scan_one_device (dev_scan_struct=0x695800) at chunk-recover.c:776
#3  0x77341284 in start_thread (arg=0x76794700) at 
pthread_create.c:333
#4  0x7707e74d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Thread 2 (Thread 0x76f95700 (LWP 26524)):
#0  0x7734a013 in pread64 () at 
../sysdeps/unix/syscall-template.S:81
#1  

[PATCH v2] Btrfs: fix regression that makes fitrim ioctl discard a device's MBR

2016-01-06 Thread fdmanana
From: Filipe Manana 

As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not part of a chunk/block group, including the MBR
(master boot record).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as we did for space allocated to chunks/block groups.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: sta...@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana 
---

V2: Fix possiblity of endless loop with fitrim call.

 fs/btrfs/volumes.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a114b7b..fd71ee3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1258,6 +1258,14 @@ int find_free_dev_extent_start(struct btrfs_transaction 
*transaction,
int ret;
int slot;
struct extent_buffer *l;
+   u64 min_search_start;
+
+   /*
+* we don't want to overwrite the superblock on the drive (nor the MBR
+* sector), so we make sure to start at an offset of at least 1MB
+*/
+   min_search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
+   search_start = max(search_start, min_search_start);
 
path = btrfs_alloc_path();
if (!path)
@@ -1398,18 +1406,9 @@ int find_free_dev_extent(struct btrfs_trans_handle 
*trans,
 struct btrfs_device *device, u64 num_bytes,
 u64 *start, u64 *len)
 {
-   struct btrfs_root *root = device->dev_root;
-   u64 search_start;
-
/* FIXME use last free of some kind */
-
-   /*
-* we don't want to overwrite the superblock on the drive,
-* so we make sure to start at an offset of at least 1MB
-*/
-   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
return find_free_dev_extent_start(trans->transaction, device,
- num_bytes, search_start, start, len);
+ num_bytes, 0, start, len);
 }
 
 static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix regression that makes fitrim ioctl discard a device's MBR

2016-01-06 Thread fdmanana
From: Filipe Manana 

As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not part of a chunk/block group, including the MBR
(master boot record).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as we did for space allocated to chunks/block groups.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: sta...@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana 
---
 fs/btrfs/extent-tree.c |  7 +++
 fs/btrfs/volumes.c | 38 ++
 fs/btrfs/volumes.h |  5 +
 3 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cf8983e..7af2f03 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9482,8 +9482,8 @@ int btrfs_can_relocate(struct btrfs_root *root, u64 
bytenr)
 */
if (device->total_bytes > device->bytes_used + min_free &&
!device->is_tgtdev_for_dev_replace) {
-   ret = find_free_dev_extent(trans, device, min_free,
-  _offset, NULL);
+   ret = find_free_dev_extent(trans->transaction, device,
+  min_free, _offset, NULL);
if (!ret)
dev_nr++;
 
@@ -10677,8 +10677,7 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
atomic_inc(>use_count);
spin_unlock(_info->trans_lock);
 
-   ret = find_free_dev_extent_start(trans, device, minlen, start,
-, );
+   ret = find_free_dev_extent(trans, device, minlen, , );
if (trans)
btrfs_put_transaction(trans);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a114b7b..207f077 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1222,10 +1222,9 @@ again:
 
 
 /*
- * find_free_dev_extent_start - find free space in the specified device
+ * find_free_dev_extent - find free space in the specified device
  * @device:  the device which we search the free space in
  * @num_bytes:   the size of the free space that we need
- * @search_start: the position from which to begin the search
  * @start:   store the start of the free space.
  * @len: the size of the free space. that we find, or the size
  *   of the max free space if we don't find suitable free space
@@ -1242,9 +1241,9 @@ again:
  * But if we don't find suitable free space, it is used to store the size of
  * the max free space.
  */
-int find_free_dev_extent_start(struct btrfs_transaction *transaction,
-  struct btrfs_device *device, u64 num_bytes,
-  u64 search_start, u64 *start, u64 *len)
+int find_free_dev_extent(struct btrfs_transaction *transaction,
+struct btrfs_device *device, u64 num_bytes,
+u64 *start, u64 *len)
 {
struct btrfs_key key;
struct btrfs_root *root = device->dev_root;
@@ -1258,6 +1257,15 @@ int find_free_dev_extent_start(struct btrfs_transaction 
*transaction,
int ret;
int slot;
struct extent_buffer *l;
+   u64 search_start;
+
+   /* FIXME use last free of some kind */
+
+   /*
+* we don't want to overwrite the superblock on the drive (nor the MBR
+* sector), so we make sure to start at an offset of at least 1MB
+*/
+   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
 
path = btrfs_alloc_path();
if (!path)
@@ -1394,24 +1402,6 @@ out:
return ret;
 }
 
-int find_free_dev_extent(struct btrfs_trans_handle *trans,
-struct btrfs_device *device, u64 num_bytes,
-u64 *start, u64 *len)
-{
-   struct btrfs_root *root = device->dev_root;
-   u64 search_start;
-
-   /* FIXME use last free of some kind */
-
-   /*
-* we don't want to overwrite the superblock on the drive,
-* so we make sure to start at an offset of at least 1MB
-*/
-   search_start = max(root->fs_info->alloc_start, 1024ull * 1024);
-   return find_free_dev_extent_start(trans->transaction, device,
- num_bytes, search_start, start, len);
-}
-
 static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
  struct btrfs_device *device,
  u64 start, u64 *dev_extent_len)
@@ -4597,7 +4587,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
if (total_avail == 0)

Re: sendfile(2) not killable on btrfs

2016-01-06 Thread Akihiro Suda
>> However, it is still not killable on btrfs.
>
>Your bugzilla entry is for the 4.2 kernel in the current Ubuntu, but this
>patch was only merged into the mainline for at v4.3-rc6-123-g296291c so unless
>it was backported by Canonical it won't be present.
>
>Could you test with the 4.3 kernel please?
>
>http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.3-wily/
>
>cheers!
>Chris

Hi,
Thank your the comment.
The patch has been backported to v4.2 by Canonical:
http://kernel.ubuntu.com/git/ubuntu/ubuntu-wily.git/commit/mm/filemap.c?id=4d291df30fb7a94d13c6d38addf8d85d38f0111b

I tried linux-image-4.3.3-040303-generic_4.3.3-040303.201512150130_amd64.deb,
and I can still hit the bug.
Should I also test the vanilla kernel?

Sorry for that this mail can break In-Reply-To header, as I'm not
subscribing the ML, and manually writing this reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/16] btrfs: dedup: Introduce dedup framework and its header

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedup
method and 1 dedup hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
v3:
  Reduce the size of struct btrfs_dedup_hash.
  Increase max dedup size to 8M for better performance.
---
 fs/btrfs/ctree.h |   3 ++
 fs/btrfs/dedup.h | 123 +++
 2 files changed, 126 insertions(+)
 create mode 100644 fs/btrfs/dedup.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c23f34..62fed1d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1816,6 +1816,9 @@ struct btrfs_fs_info {
 * and will be latter freed. Protected by fs_info->chunk_mutex.
 */
struct list_head pinned_chunks;
+
+   /* reference to inband de-duplication info */
+   struct btrfs_dedup_info *dedup_info;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedup.h b/fs/btrfs/dedup.h
new file mode 100644
index 000..1e04d89
--- /dev/null
+++ b/fs/btrfs/dedup.h
@@ -0,0 +1,123 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUP__
+#define __BTRFS_DEDUP__
+
+#include 
+#include 
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUP_BACKEND_INMEMORY   0
+#define BTRFS_DEDUP_BACKEND_ONDISK 1
+#define BTRFS_DEDUP_BACKEND_LAST   2
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUP_BLOCKSIZE_MAX  (8 * 1024 * 1024)
+#define BTRFS_DEDUP_BLOCKSIZE_MIN  (16 * 1024)
+#define BTRFS_DEDUP_BLOCKSIZE_DEFAULT  (32 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUP_HASH_SHA2560
+
+static int btrfs_dedup_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedup backends should have their own hash structure
+ */
+struct btrfs_dedup_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedup hash */
+   u8 hash[];
+};
+
+struct btrfs_dedup_info {
+   /* dedup blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_type;
+
+   /* Hash driver */
+   struct crypto_shash *dedup_driver;
+
+   /* following members are only used in in-memory dedup mode */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   spinlock_t lock;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+int btrfs_dedup_hash_size(u16 type);
+struct btrfs_dedup_hash *btrfs_dedup_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedup info
+ * Called at either dedup enable or mount time.
+ */
+int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+  u64 blocksize, u64 limit);
+
+/*
+ * Disable dedup and invalidate all its dedup data.
+ * Called at dedup disable time.
+ */
+int btrfs_dedup_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedup_bs) has valid data.
+ */
+int btrfs_dedup_calc_hash(struct btrfs_root *root, struct inode *inode,
+ u64 start, struct btrfs_dedup_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedup_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * INCREASED.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedup_search(struct inode *inode, u64 file_pos,
+  struct btrfs_dedup_hash *hash);
+
+/* Add a dedup hash into dedup tree */
+int btrfs_dedup_add(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   struct btrfs_dedup_hash *hash);
+
+/* Remove a dedup hash from dedup tree */
+int btrfs_dedup_del(struct btrfs_trans_handle 

[PATCH v3 02/16] btrfs: dedup: Introduce function to initialize dedup info

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Add generic function to initialize dedup info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/Makefile |  2 +-
 fs/btrfs/dedup.c  | 96 +++
 fs/btrfs/dedup.h  | 14 ++--
 3 files changed, 109 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedup.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6d1d0b9..a8bd917 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o
+  uuid-tree.o props.o hash.o dedup.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
new file mode 100644
index 000..c1fadff
--- /dev/null
+++ b/fs/btrfs/dedup.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedup.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+  u64 blocksize, u64 limit)
+{
+   struct btrfs_dedup_info *dedup_info;
+   int ret = 0;
+
+   /* Sanity check */
+   if (blocksize > BTRFS_DEDUP_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUP_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize))
+   return -EINVAL;
+   if (type > ARRAY_SIZE(btrfs_dedup_sizes))
+   return -EINVAL;
+   if (backend >= BTRFS_DEDUP_BACKEND_LAST)
+   return -EINVAL;
+   if (backend == BTRFS_DEDUP_BACKEND_INMEMORY && limit == 0)
+   limit = 4096; /* default value */
+   if (backend == BTRFS_DEDUP_BACKEND_ONDISK && limit != 0)
+   limit = 0;
+
+   if (fs_info->dedup_info) {
+   dedup_info = fs_info->dedup_info;
+
+   /* Check if we are re-enable for different dedup config */
+   if (dedup_info->blocksize != blocksize ||
+   dedup_info->hash_type != type ||
+   dedup_info->backend != backend) {
+   btrfs_dedup_disable(fs_info);
+   goto enable;
+   }
+
+   /* On-fly limit change is OK */
+   spin_lock(_info->lock);
+   fs_info->dedup_info->limit_nr = limit;
+   spin_unlock(_info->lock);
+   return 0;
+   }
+
+enable:
+   fs_info->dedup_info = kzalloc(sizeof(*dedup_info), GFP_NOFS);
+   if (!fs_info->dedup_info)
+   return -ENOMEM;
+
+   dedup_info = fs_info->dedup_info;
+
+   dedup_info->hash_type = type;
+   dedup_info->backend = backend;
+   dedup_info->blocksize = blocksize;
+   dedup_info->limit_nr = limit;
+
+   /* Only support SHA256 yet */
+   dedup_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedup_info->dedup_driver)) {
+   btrfs_err(fs_info, "failed to init sha256 driver");
+   ret = PTR_ERR(dedup_info->dedup_driver);
+   goto out;
+   }
+
+   dedup_info->hash_root = RB_ROOT;
+   dedup_info->bytenr_root = RB_ROOT;
+   dedup_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   spin_lock_init(_info->lock);
+
+   fs_info->dedup_info = dedup_info;
+out:
+   if (ret < 0) {
+   kfree(dedup_info);
+   fs_info->dedup_info = NULL;
+   }
+   return ret;
+}
diff --git a/fs/btrfs/dedup.h b/fs/btrfs/dedup.h
index 1e04d89..d06d667 100644
--- a/fs/btrfs/dedup.h
+++ b/fs/btrfs/dedup.h
@@ -74,8 +74,18 @@ struct btrfs_dedup_info {
 
 struct btrfs_trans_handle;
 
-int btrfs_dedup_hash_size(u16 type);
-struct btrfs_dedup_hash *btrfs_dedup_alloc_hash(u16 type);
+static inline int btrfs_dedup_hash_size(u16 type)
+{
+   if 

[PATCH v3 00/14][For 4.6] Btrfs: Add inband (write time) de-duplication framework

2016-01-06 Thread Qu Wenruo
This updated version of inband de-duplication has the following features:
1) ONE unified dedup framework.
   Most of its code is hidden quietly in dedup.c and export the minimal
   interfaces for its caller.
   Reviewer and further developer would benefit from the unified
   framework.

2) TWO different back-end with different trade-off
   One is the improved version of previous Fujitsu in-memory only dedup.
   The other one is enhanced dedup implementation from Liu Bo.
   Changed its tree structure to handle bytenr -> hash search for
   deleting hash, without the hideous data backref hack.

3) Ioctl interface with persist dedup status
   Advised by David, now we use ioctl to enable/disable dedup.
   Further we will add per-file/dir dedup disable command in that ioctl.

   And we now have dedup status, recorded in the first item of dedup
   tree.
   Just like quota, once enabled, no extra ioctl is needed for next
   mount.

TODO:
1) Per-file ioctl to disbale dedup
   Just like compression setting per file

2) Support compression for hash miss case
   It will go though non-compress routine, not compressing it even we can.
   This may need to change the on-disk format for on-disk backend.

3) Make codes more elegant
   A lot of place is quite ugly, espcially when it comes to delayed_ref
   related part.
   Like btrfs_dedup_add() and btrfs_dedup_del().
   We may need to rework delayed_ref to a more simple and straightforward
   implementation.

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.

Qu Wenruo (7):
  btrfs: delayed-ref: Add support for atomic increasing extent ref
  btrfs: delayed_ref: Add support for handle dedup hash
  btrfs: dedup: Add basic tree structure for on-disk dedup method
  btrfs: dedup: Introduce interfaces to resume and cleanup dedup info
  btrfs: dedup: Add support for on-disk hash search
  btrfs: dedup: Add support to delete hash for on-disk backend
  btrfs: dedup: Add support for adding hash for on-disk backend

Wang Xiaoguang (9):
  btrfs: dedup: Introduce dedup framework and its header
  btrfs: dedup: Introduce function to initialize dedup info
  btrfs: dedup: Introduce function to add hash into in-memory tree
  btrfs: dedup: Introduce function to remove hash from in-memory tree
  btrfs: dedup: Introduce function to search for an existing hash
  btrfs: dedup: Implement btrfs_dedup_calc_hash interface
  btrfs: ordered-extent: Add support for dedup
  btrfs: dedup: Inband in-memory only de-duplication implement
  btrfs: dedup: Add ioctl for inband deduplication

 fs/btrfs/Makefile|2 +-
 fs/btrfs/ctree.h |   79 +++-
 fs/btrfs/dedup.c | 1053 ++
 fs/btrfs/dedup.h |  155 +++
 fs/btrfs/delayed-ref.c   |   26 +-
 fs/btrfs/delayed-ref.h   |9 +-
 fs/btrfs/disk-io.c   |   28 +-
 fs/btrfs/disk-io.h   |1 +
 fs/btrfs/extent-tree.c   |   39 +-
 fs/btrfs/extent_io.c |   30 +-
 fs/btrfs/extent_io.h |   15 +
 fs/btrfs/inode.c |  305 +++-
 fs/btrfs/ioctl.c |   63 +++
 fs/btrfs/ordered-data.c  |   33 +-
 fs/btrfs/ordered-data.h  |   13 +
 include/trace/events/btrfs.h |3 +-
 include/uapi/linux/btrfs.h   |   23 +
 17 files changed, 1829 insertions(+), 48 deletions(-)
 create mode 100644 fs/btrfs/dedup.c
 create mode 100644 fs/btrfs/dedup.h

-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/16] btrfs: dedup: Implement btrfs_dedup_calc_hash interface

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedup method, only SHA256 hash method is
supported yet, so implement btrfs_dedup_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedup.c | 57 
 1 file changed, 57 insertions(+)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 4863cf2..4f24a2c 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -499,3 +499,60 @@ int btrfs_dedup_search(struct inode *inode, u64 file_pos,
}
return ret;
 }
+
+static int hash_data(struct btrfs_dedup_info *dedup_info, const char *data,
+u64 length, struct btrfs_dedup_hash *hash)
+{
+   struct crypto_shash *tfm = dedup_info->dedup_driver;
+   struct {
+   struct shash_desc desc;
+   char ctx[crypto_shash_descsize(tfm)];
+   } sdesc;
+   int ret;
+
+   sdesc.desc.tfm = tfm;
+   sdesc.desc.flags = 0;
+
+   ret = crypto_shash_digest(, data, length,
+ (char *)(hash->hash));
+   return ret;
+}
+
+int btrfs_dedup_calc_hash(struct btrfs_root *root, struct inode *inode,
+ u64 start, struct btrfs_dedup_hash *hash)
+{
+   struct page *p;
+   struct btrfs_dedup_info *dedup_info = root->fs_info->dedup_info;
+   char *data;
+   int i;
+   int ret;
+   u64 dedup_bs;
+   u64 sectorsize = root->sectorsize;
+
+   if (!dedup_info || !hash)
+   return 0;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedup_bs = dedup_info->blocksize;
+   sectorsize = root->sectorsize;
+
+   data = kmalloc(dedup_bs, GFP_NOFS);
+   if (!data)
+   return -ENOMEM;
+   for (i = 0; sectorsize * i < dedup_bs; i++) {
+   char *d;
+
+   /* TODO: Add support for subpage size case */
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_CACHE_SHIFT) + i);
+   WARN_ON(!p);
+   d = kmap_atomic(p);
+   memcpy((data + sectorsize * i), d, sectorsize);
+   kunmap_atomic(d);
+   page_cache_release(p);
+   }
+   ret = hash_data(dedup_info, data, dedup_bs, hash);
+   kfree(data);
+   return ret;
+}
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/16] btrfs: dedup: Add support for adding hash for on-disk backend

2016-01-06 Thread Qu Wenruo
Now on-disk backend can add hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 

---
v3:
  Fix a bug in writing extent buffer with wrong offset.
  Add handler for same bytenr different hashes case.
---
 fs/btrfs/dedup.c | 84 
 1 file changed, 84 insertions(+)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index f448e27..3e09fd8 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -246,6 +246,8 @@ out:
return ret;
 }
 
+static int ondisk_search_hash(struct btrfs_dedup_info *dedup_info, u8 *hash,
+ u64 *bytenr_ret, u64 *num_bytes_ret);
 static void inmem_destroy(struct btrfs_fs_info *fs_info);
 int btrfs_dedup_cleanup(struct btrfs_fs_info *fs_info)
 {
@@ -392,6 +394,86 @@ out:
return 0;
 }
 
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+   struct btrfs_dedup_info *dedup_info,
+   struct btrfs_path *path, u64 bytenr,
+   int prepare_del);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+ struct btrfs_dedup_info *dedup_info,
+ struct btrfs_dedup_hash *hash)
+{
+   struct btrfs_path *path;
+   struct btrfs_root *dedup_root = dedup_info->dedup_root;
+   struct btrfs_key key;
+   struct btrfs_dedup_hash_item *hash_item;
+   u64 bytenr;
+   u64 num_bytes;
+   int hash_len = btrfs_dedup_sizes[dedup_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   mutex_lock(_info->ondisk_lock);
+
+   /* Search for duplicated bytenr, refer inmem_add() for reason */
+   ret = ondisk_search_bytenr(NULL, dedup_info, path, hash->bytenr, 0);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   btrfs_release_path(path);
+
+   ret = ondisk_search_hash(dedup_info, hash->hash, , _bytes);
+   if (ret < 0)
+   goto out;
+   /* Same hash found, don't re-add to save dedup tree space */
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   /* Insert hash->bytenr item */
+   memcpy(, hash->hash + hash_len - 8, 8);
+   key.type = BTRFS_DEDUP_HASH_ITEM_KEY;
+   key.offset = hash->bytenr;
+
+   ret = btrfs_insert_empty_item(trans, dedup_root, path, ,
+   sizeof(*hash_item) + hash_len);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+  struct btrfs_dedup_hash_item);
+   btrfs_set_dedup_hash_len(path->nodes[0], hash_item, hash->num_bytes);
+   write_extent_buffer(path->nodes[0], hash->hash,
+   (unsigned long)(hash_item + 1), hash_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+   btrfs_release_path(path);
+
+   /* Then bytenr->hash item */
+   key.objectid = hash->bytenr;
+   key.type = BTRFS_DEDUP_BYTENR_ITEM_KEY;
+   memcpy(, hash->hash + hash_len - 8, 8);
+
+   ret = btrfs_insert_empty_item(trans, dedup_root, path, , hash_len);
+   WARN_ON(ret == -EEXIST);
+   if (ret < 0)
+   goto out;
+   write_extent_buffer(path->nodes[0], hash->hash,
+   btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+   hash_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+   mutex_unlock(_info->ondisk_lock);
+   btrfs_free_path(path);
+   return ret;
+}
+
 int btrfs_dedup_add(struct btrfs_trans_handle *trans, struct btrfs_root *root,
struct btrfs_dedup_hash *hash)
 {
@@ -406,6 +488,8 @@ int btrfs_dedup_add(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 
if (dedup_info->backend == BTRFS_DEDUP_BACKEND_INMEMORY)
return inmem_add(dedup_info, hash);
+   if (dedup_info->backend == BTRFS_DEDUP_BACKEND_ONDISK)
+   return ondisk_add(trans, dedup_info, hash);
return -EINVAL;
 }
 
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/16] btrfs: delayed-ref: Add support for atomic increasing extent ref

2016-01-06 Thread Qu Wenruo
Slightly modify btrfs_add_delayed_data_ref() to allow it accept
GFP_ATOMIC, and allow it to do be called inside a spinlock.

This is used by later dedup patches.

Signed-off-by: Qu Wenruo 
---
v3:
  Newly introduced
---
 fs/btrfs/ctree.h   |  4 
 fs/btrfs/delayed-ref.c | 25 +
 fs/btrfs/delayed-ref.h |  2 +-
 fs/btrfs/extent-tree.c | 24 +---
 4 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 62fed1d..450790b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3469,6 +3469,10 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner, u64 offset);
+int btrfs_inc_extent_ref_atomic(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root, u64 bytenr,
+   u64 num_bytes, u64 parent,
+   u64 root_objectid, u64 owner, u64 offset);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans,
   struct btrfs_root *root);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index e06dd75..94609ec 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -812,26 +812,31 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
-  struct btrfs_delayed_extent_op *extent_op)
+  int atomic)
 {
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   gfp_t gfp_flags;
+
+   if (atomic)
+   gfp_flags = GFP_ATOMIC;
+   else
+   gfp_flags = GFP_NOFS;
 
-   BUG_ON(extent_op && !extent_op->is_data);
-   ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, gfp_flags);
if (!ref)
return -ENOMEM;
 
-   head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, gfp_flags);
if (!head_ref) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
return -ENOMEM;
}
 
if (fs_info->quota_enabled && is_fstree(ref_root)) {
-   record = kmalloc(sizeof(*record), GFP_NOFS);
+   record = kmalloc(sizeof(*record), gfp_flags);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
kmem_cache_free(btrfs_delayed_ref_head_cachep,
@@ -840,10 +845,13 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
}
}
 
-   head_ref->extent_op = extent_op;
+   head_ref->extent_op = NULL;
 
delayed_refs = >transaction->delayed_refs;
-   spin_lock(_refs->lock);
+
+   /* For atomic case, caller should already hold the delayed_refs lock */
+   if (!atomic)
+   spin_lock(_refs->lock);
 
/*
 * insert both the head node and the new ref without dropping
@@ -856,7 +864,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
   num_bytes, parent, ref_root, owner, offset,
   action);
-   spin_unlock(_refs->lock);
+   if (!atomic)
+   spin_unlock(_refs->lock);
 
return 0;
 }
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 00ed02c..8928fe7 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -249,7 +249,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
-  struct btrfs_delayed_extent_op *extent_op);
+  int atomic);
 int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
 struct btrfs_trans_handle *trans,
 u64 ref_root, u64 bytenr, u64 num_bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c4661db..d80f74d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2089,11 +2089,29 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
ret = 

[PATCH v3 09/16] btrfs: ordered-extent: Add support for dedup

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ordered-extent support for dedup.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ordered-data.c | 33 +
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 8c27292..46493f5 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -25,6 +25,7 @@
 #include "btrfs_inode.h"
 #include "extent_io.h"
 #include "disk-io.h"
+#include "dedup.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -183,12 +184,14 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedup_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
struct rb_node *node;
struct btrfs_ordered_extent *entry;
+   struct btrfs_dedup_info *dedup_info = root->fs_info->dedup_info;
 
tree = _I(inode)->ordered_tree;
entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
@@ -203,6 +206,20 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   if (hash && dedup_info) {
+   entry->hash = btrfs_dedup_alloc_hash(dedup_info->hash_type);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   /* Hash contains locks, only copy what we need */
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_dedup_sizes[dedup_info->hash_type]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -249,15 +266,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedup_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -266,7 +291,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 23c9605..58519ce 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;
struct list_head work_list;
+
+   /*
+* For inband deduplication
+* If hash is NULL, no deduplication.
+* If hash->bytenr is zero, means this is a dedup miss, hash will
+* be added into dedup tree.
+* If hash->bytenr is non-zero, this is a dedup hit. Extent ref is
+* *ALREADY* increased.
+*/
+   struct btrfs_dedup_hash *hash;
 };
 
 /*
@@ -172,6 

[PATCH v3 07/16] btrfs: dedup: Introduce function to search for an existing hash

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedup_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
v3:
  Use struct inmem_hash instead of struct btrfs_dedup_hash.
---
 fs/btrfs/dedup.c | 163 +++
 1 file changed, 163 insertions(+)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 0272411..4863cf2 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -336,3 +336,166 @@ int btrfs_dedup_disable(struct btrfs_fs_info *fs_info)
inmem_destroy(fs_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedup_info *dedup_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_type = dedup_info->hash_type;
+   int hash_len = btrfs_dedup_sizes[hash_type];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct inode *inode, u64 file_pos,
+   struct btrfs_dedup_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct inmem_hash *found_hash;
+   struct btrfs_dedup_info *dedup_info = fs_info->dedup_info;
+   u64 bytenr, num_bytes;
+
+   spin_lock(_info->lock);
+   found_hash = inmem_search_hash(dedup_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   spin_unlock(_info->lock);
+   return 0;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+   spin_unlock(_info->lock);
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+again:
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   ret = btrfs_inc_extent_ref_atomic(trans, root, bytenr,
+   num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos);
+   spin_unlock(_refs->lock);
+   btrfs_end_transaction(trans, root);
+
+   if (ret == 0) {
+   hash->bytenr = bytenr;
+   hash->num_bytes = num_bytes;
+   ret = 1;
+   }
+   return ret;
+   }
+
+   /*
+* we may have dropped the delayed_refs->lock to get the head mutex
+* lock, and that might have given someone else time to free the head.
+* If that's true, it has been removed from our list and we can move on.
+*/
+   ret = btrfs_delayed_ref_lock(trans, head);
+   spin_unlock(_refs->lock);
+   if (ret == -EAGAIN) {
+   spin_lock(_info->lock);
+   found_hash = inmem_search_hash(dedup_info, hash->hash);
+   if (!found_hash) {
+   spin_unlock(_info->lock);
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+   spin_unlock(_info->lock);
+   goto again;
+   }
+
+   /* We still need to look up the hash again... */
+   spin_lock(_info->lock);
+   found_hash = inmem_search_hash(dedup_info, hash->hash);
+   if (!found_hash) {
+   spin_unlock(_info->lock);
+   mutex_unlock(>mutex);
+   goto out;
+   }
+
+   /* The bytenr has 

[PATCH v3 11/16] btrfs: dedup: Add basic tree structure for on-disk dedup method

2016-01-06 Thread Qu Wenruo
Introduce a new tree, dedup tree to record on-disk dedup hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo 
Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
v2:
  Use special type for bytenr-> hash search, instead of Liu's backref
  hack
v3:
  None
---
 fs/btrfs/ctree.h | 67 +++-
 fs/btrfs/dedup.h |  8 ++
 fs/btrfs/disk-io.c   |  1 +
 include/trace/events/btrfs.h |  3 +-
 4 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aa7f5fb..87158ac 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -97,6 +97,9 @@ struct btrfs_dedup_hash;
 /* for storing items that use the BTRFS_UUID_KEY* types */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* on-disk dedup tree (EXPERIMENTAL) */
+#define BTRFS_DEDUP_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -524,10 +527,12 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9)
 
+#define BTRFS_FEATURE_COMPAT_RO_DEDUP  (1ULL << 0)
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL
-#define BTRFS_FEATURE_COMPAT_RO_SUPP   0ULL
+#define BTRFS_FEATURE_COMPAT_RO_SUPP   \
+   (BTRFS_FEATURE_COMPAT_RO_DEDUP)
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET   0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL
 
@@ -955,6 +960,46 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUP_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedup_status_item {
+   __le64 blocksize;
+   __le64 limit_nr;
+   __le16 hash_type;
+   __le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUP_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ * XXX: On-disk format not stable yet, see the unsed one
+ */
+struct btrfs_dedup_hash_item {
+   /* on disk length of dedup range */
+   __le64 len;
+
+   /* Spare space */
+   u8 __unused[16];
+
+   /* Hash follows */
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUP_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -2101,6 +2146,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY   228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUP_STATUS_ITEM_KEY230
+#define BTRFS_DEDUP_HASH_ITEM_KEY  231
+#define BTRFS_DEDUP_BYTENR_ITEM_KEY232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3158,6 +3210,19 @@ static inline unsigned long btrfs_leaf_data(struct 
extent_buffer *l)
return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedup_status */
+BTRFS_SETGET_FUNCS(dedup_status_blocksize, struct btrfs_dedup_status_item,
+  blocksize, 64);
+BTRFS_SETGET_FUNCS(dedup_status_limit, struct btrfs_dedup_status_item,
+  limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedup_status_hash_type, struct btrfs_dedup_status_item,
+  hash_type, 16);
+BTRFS_SETGET_FUNCS(dedup_status_backend, struct btrfs_dedup_status_item,
+  backend, 16);
+
+/* btrfs_dedup_hash_item */
+BTRFS_SETGET_FUNCS(dedup_hash_len, struct btrfs_dedup_hash_item, len, 64);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedup.h b/fs/btrfs/dedup.h
index d06d667..990e922 100644
--- a/fs/btrfs/dedup.h
+++ b/fs/btrfs/dedup.h
@@ -54,6 +54,8 @@ struct btrfs_dedup_hash {
u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedup_info {
/* dedup blocksize */
u64 blocksize;
@@ -70,6 +72,12 @@ struct btrfs_dedup_info {
spinlock_t lock;
u64 limit_nr;
u64 current_nr;
+
+   /* for persist data like dedup-hash and dedup status */
+   struct btrfs_root *dedup_root;
+
+   /* on-disk mode only mutex */
+   struct mutex ondisk_lock;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c 

[PATCH v3 16/16] btrfs: dedup: Add ioctl for inband deduplication

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ioctl interface for inband deduplication, which includes:
1) enable
2) disable
3) status

We will later add ioctl to disable inband dedup for given file/dir.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/ioctl.c   | 63 ++
 include/uapi/linux/btrfs.h | 23 +
 4 files changed, 88 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 87158ac..4845933 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1865,6 +1865,7 @@ struct btrfs_fs_info {
 
/* reference to inband de-duplication info */
struct btrfs_dedup_info *dedup_info;
+   struct mutex dedup_ioctl_mutex;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0d84e17..341f4f0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2562,6 +2562,7 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->delete_unused_bgs_mutex);
mutex_init(_info->reloc_mutex);
mutex_init(_info->delalloc_root_mutex);
+   mutex_init(_info->dedup_ioctl_mutex);
seqlock_init(_info->profiles_lock);
init_rwsem(_info->delayed_iput_sem);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7375cf2..6749c84 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -59,6 +59,7 @@
 #include "props.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedup.h"
 
 #ifdef CONFIG_64BIT
 /* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
@@ -3206,6 +3207,66 @@ out:
return ret;
 }
 
+static long btrfs_ioctl_dedup_ctl(struct btrfs_root *root, void __user *args)
+{
+   struct btrfs_ioctl_dedup_args *dargs;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_dedup_info *dedup_info;
+   int ret;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   dargs = memdup_user(args, sizeof(*dargs));
+   if (IS_ERR(dargs)) {
+   ret = PTR_ERR(dargs);
+   return ret;
+   }
+
+   if (dargs->cmd >= BTRFS_DEDUP_CTL_LAST) {
+   ret = -EINVAL;
+   goto out;
+   }
+   switch (dargs->cmd) {
+   case BTRFS_DEDUP_CTL_ENABLE:
+   ret = btrfs_dedup_enable(fs_info, dargs->hash_type,
+dargs->backend, dargs->blocksize,
+dargs->limit_nr);
+   break;
+   case BTRFS_DEDUP_CTL_DISABLE:
+   ret = btrfs_dedup_disable(fs_info);
+   break;
+   case BTRFS_DEDUP_CTL_STATUS:
+   dedup_info = fs_info->dedup_info;
+   if (dedup_info) {
+   dargs->status = 1;
+   dargs->blocksize = dedup_info->blocksize;
+   dargs->backend = dedup_info->backend;
+   dargs->hash_type = dedup_info->hash_type;
+   dargs->limit_nr = dedup_info->limit_nr;
+   dargs->current_nr = dedup_info->current_nr;
+   } else {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_type = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   }
+   if (copy_to_user(args, dargs, sizeof(*dargs)))
+   ret = -EFAULT;
+   else
+   ret = 0;
+   break;
+   default:
+   ret = -EINVAL;
+   break;
+   }
+out:
+   kfree(dargs);
+   return ret;
+}
+
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 struct inode *inode,
 u64 endoff,
@@ -5565,6 +5626,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_set_fslabel(file, argp);
case BTRFS_IOC_FILE_EXTENT_SAME:
return btrfs_ioctl_file_extent_same(file, argp);
+   case BTRFS_IOC_DEDUP_CTL:
+   return btrfs_ioctl_dedup_ctl(root, argp);
case BTRFS_IOC_GET_SUPPORTED_FEATURES:
return btrfs_ioctl_get_supported_features(file, argp);
case BTRFS_IOC_GET_FEATURES:
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..b33da24 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -445,6 +445,27 @@ struct btrfs_ioctl_get_dev_stats {
__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
 };
 
+/*
+ * de-duplication control modes
+ * For re-config, re-enable will handle it
+ * TODO: Add support to disable per-file/dir dedup operation
+ */
+#define 

Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Martin K. Petersen
> "Mike" == mchristi   writes:

Mike> The following patches begin to cleanup the request->cmd_flags and
bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
Mike> attributes and state of the request. For bi_rw we use it for
Mike> similar info and also the priority but then also have another
Mike> bi_flags field for state. At some point, we abused them so much we
Mike> just made cmd_flags 64 bits, so we could add more.

Mike> The following patches seperate the operation (read, write discard,
Mike> flush, etc) from cmd_flags/bi_rw.

Mike> This patchset was made against linux-next from today Jan 5 2016.
Mike> (git tag next-20160105).

Very nice work. Thanks for doing this!

I think it's a much needed cleanup. I focused mainly on the core block,
discard, write same and sd.c pieces and everything looks sensible to me.

I wonder what the best approach is to move a patch set with this many
stakeholders forward? Set a "speak now or forever hold your peace"
review deadline?

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recovery of a raid1 FS

2016-01-06 Thread Tom Hunt
I've got a two-disk RAID1 btrfs volume, which crashed for no apparent
reason and was corrupt on next boot. Relevant command runs and
outputs:

[root@archiso ~]# mount -osubvol=.,ro,recovery /dev/mapper/rootvol_1 mnt
mount: wrong fs type, bad option, bad superblock on /dev/mapper/rootvol_1,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.
[  372.665800] Btrfs loaded
[  372.666481] BTRFS: device fsid 91f7e058-1142-4036-b1f7-4f64163dc8ae
devid 1 transid 257371 /dev/dm-4
[  389.339860] BTRFS: device fsid 91f7e058-1142-4036-b1f7-4f64163dc8ae
devid 2 transid 257371 /dev/dm-5
[  781.695931] BTRFS info (device dm-5): enabling auto recovery
[  781.695933] BTRFS info (device dm-5): disk space caching is enabled
[  781.695934] BTRFS: has skinny extents
[  781.776198] BTRFS: bdev /dev/mapper/rootvol_2 errs: wr 0, rd 0,
flush 0, corrupt 10, gen 426
[  791.982552] BTRFS critical (device dm-5): corrupt leaf, bad key
order: block=6210228551680,root=1, slot=49
[  791.982836] BTRFS critical (device dm-5): corrupt leaf, bad key
order: block=6210228551680,root=1, slot=49
[  791.982862] BTRFS: Failed to read block groups: -5
[  792.030513] BTRFS: open_ctree failed

[root@archiso ~]# btrfs restore -D /dev/mapper/rootvol_1
/root/tmp_data | tee data_mnt/tom/restore_list.txt
bad key ordering 49 50
Error searching -1
Error searching /root/tmp_data/home
This is a dry-run, no files are going to be restored
We have looped trying to restore files in
/exherbo/root/squashfs-root/usr/bin too many times to be making
progress, stopping
We have looped trying to restore files in
/exherbo/root/squashfs-root/usr/share/man/man1 too many times to be
making progress, stopping
We have looped trying to restore files in
/exherbo/root/squashfs-root/usr/share/man/man3 too many times to be
making progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/bin too many times to be making
progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/include/firefox/mozilla/dom too many
times to be making progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/include/firefox too many times to be
making progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/lib/python3.5/test/__pycache__ too
many times to be making progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/lib/python2.7/test too many times to
be making progress, stopping
We have looped trying to restore files in
/exherbo/usr/x86_64-pc-linux-gnu/lib/debug/usr/x86_64-pc-linux-gnu/bin
too many times to be making progress, stopping
We have looped trying to restore files in /exherbo/usr/share/man/man3
too many times to be making progress, stopping
We have looped trying to restore files in /exherbo/usr/share/man/man1
too many times to be making progress, stopping
We have looped trying to restore files in
/exherbo/usr/share/idl/firefox too many times to be making progress,
stopping

[root@archiso ~]# btrfs restore -v -l /dev/mapper/rootvol_1
bad key ordering 49 50
 tree key (EXTENT_TREE ROOT_ITEM 0) 6210219163648 level 2
 tree key (DEV_TREE ROOT_ITEM 0) 379264000 level 1
 tree key (FS_TREE ROOT_ITEM 0) 3792595566592 level 0
 tree key (CSUM_TREE ROOT_ITEM 0) 6210220048384 level 3
 tree key (UUID_TREE ROOT_ITEM 0) 3792855105536 level 0
 tree key (257 ROOT_ITEM 0) 3792551755776 level 2
 tree key (258 ROOT_ITEM 0) 6210218557440 level 2
 tree key (3785 ROOT_ITEM 0) 7588191682560 level 0
 tree key (3786 ROOT_ITEM 0) 7588191633408 level 0
 tree key (7565 ROOT_ITEM 0) 6210221588480 level 2
 tree key (7566 ROOT_ITEM 0) 4055721377792 level 0
 tree key (7567 ROOT_ITEM 101377) 3792849616896 level 2
 tree key (DATA_RELOC_TREE ROOT_ITEM 0) 3792520544256 level 0

Any help available? I tried using btrfs-restore previously on a
different machine with only one of the disks plugged in, and it at
least showed some files, though the 'Error searching -1' still showed
up. (The /home subvolume is the only one I really care about;
everything else is nice-to-have.)

-- 
Tom Hunt
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] btrfs-progs: Add dedup feature for mkfs and convert

2016-01-06 Thread Qu Wenruo
Add new DEDUP ro compat flag and corresponding mkfs/convert flag
'dedup'.

Since dedup tree is completely isolated from fs tree, so even old kernel
could do read mount.
So add it to RO compat flag instead of common incompat flags

Signed-off-by: Qu Wenruo 
---
 Documentation/mkfs.btrfs.asciidoc |  9 +
 btrfs-convert.c   | 19 +--
 mkfs.c|  8 ++--
 utils.c   | 38 +++---
 utils.h   |  7 ---
 5 files changed, 59 insertions(+), 22 deletions(-)

diff --git a/Documentation/mkfs.btrfs.asciidoc 
b/Documentation/mkfs.btrfs.asciidoc
index 12d8840..5c5b41d 100644
--- a/Documentation/mkfs.btrfs.asciidoc
+++ b/Documentation/mkfs.btrfs.asciidoc
@@ -207,6 +207,15 @@ reduced-size metadata for extent references, saves a few 
percent of metadata
 improved representation of file extents where holes are not explicitly
 stored as an extent, saves a few percent of metadata if sparse files are used
 
+*dedup*::
+allow btrfs to use new on-disk format designed for in-band(write time)
+de-duplication.
++
+on-disk storage backend and persist de-duplication status needs this feature.
++
+unlike other features, this is an RO compat flag, means old kernel can still
+mount fs read-only.
+
 BLOCK GROUPS, CHUNKS, RAID
 --
 
diff --git a/btrfs-convert.c b/btrfs-convert.c
index 02e5cdb..bf1df48 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -2288,7 +2288,7 @@ err:
 
 static int do_convert(const char *devname, int datacsum, int packing, int 
noxattr,
u32 nodesize, int copylabel, const char *fslabel, int progress,
-   u64 features)
+   u64 features, u64 ro_features)
 {
int i, ret, blocks_per_node;
int fd = -1;
@@ -2343,8 +2343,9 @@ static int do_convert(const char *devname, int datacsum, 
int packing, int noxatt
fprintf(stderr, "unable to open %s\n", devname);
goto fail;
}
-   btrfs_parse_features_to_string(features_buf, features);
-   if (features == BTRFS_MKFS_DEFAULT_FEATURES)
+   btrfs_parse_features_to_string(features_buf, features, ro_features);
+   if (features == BTRFS_MKFS_DEFAULT_FEATURES &&
+   ro_features == 0)
strcat(features_buf, " (default)");
 
printf("create btrfs filesystem:\n");
@@ -2360,6 +2361,7 @@ static int do_convert(const char *devname, int datacsum, 
int packing, int noxatt
mkfs_cfg.sectorsize = blocksize;
mkfs_cfg.stripesize = blocksize;
mkfs_cfg.features = features;
+   mkfs_cfg.ro_features = ro_features;
 
ret = make_btrfs(fd, _cfg);
if (ret) {
@@ -2898,6 +2900,7 @@ int main(int argc, char *argv[])
char *file;
char fslabel[BTRFS_LABEL_SIZE];
u64 features = BTRFS_MKFS_DEFAULT_FEATURES;
+   u64 ro_features = 0;
 
while(1) {
enum { GETOPT_VAL_NO_PROGRESS = 256 };
@@ -2956,7 +2959,8 @@ int main(int argc, char *argv[])
char *orig = strdup(optarg);
char *tmp = orig;
 
-   tmp = btrfs_parse_fs_features(tmp, );
+   tmp = btrfs_parse_fs_features(tmp, ,
+ _features);
if (tmp) {
fprintf(stderr,
"Unrecognized filesystem 
feature '%s'\n",
@@ -2974,7 +2978,9 @@ int main(int argc, char *argv[])
char buf[64];
 
btrfs_parse_features_to_string(buf,
-   features & 
~BTRFS_CONVERT_ALLOWED_FEATURES);
+   features &
+   ~BTRFS_CONVERT_ALLOWED_FEATURES,
+   ro_features);
fprintf(stderr,
"ERROR: features not allowed 
for convert: %s\n",
buf);
@@ -3025,7 +3031,8 @@ int main(int argc, char *argv[])
ret = do_rollback(file);
} else {
ret = do_convert(file, datacsum, packing, noxattr, nodesize,
-   copylabel, fslabel, progress, features);
+   copylabel, fslabel, progress, features,
+   ro_features);
}
if (ret)
return 1;
diff --git a/mkfs.c b/mkfs.c
index 5f1411f..9c7acc2 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1372,6 +1372,7 @@ int main(int ac, char **av)
int saved_optind;
char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 };
u64 features = 

[PATCH v2 2/7] btrfs-progs: dedup: Add enable command for dedup command group

2016-01-06 Thread Qu Wenruo
Add enable subcommand for dedup commmand group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedup.asciidoc |  62 ++-
 cmds-dedup.c   | 120 +
 ioctl.h|   2 +
 kerncompat.h   |   5 ++
 4 files changed, 188 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-dedup.asciidoc 
b/Documentation/btrfs-dedup.asciidoc
index 354313f..652f22d 100644
--- a/Documentation/btrfs-dedup.asciidoc
+++ b/Documentation/btrfs-dedup.asciidoc
@@ -19,7 +19,67 @@ use with caution.
 
 SUBCOMMAND
 --
-Nothing yet
+*enable* [options] ::
+Enable in-band de-duplication for a filesystem.
++
+`Options`
++
+-s|--storage-backend 
+Specify de-duplication hash storage backend.
+Supported backends are 'ondisk' and 'inmemory'
+If not specified, default value is 'inmemory'.
++
+Refer to *BACKENDS* sector for more information.
+
+-b|--blocksize 
+Specify dedup block size.
+Supported values are power of 2 from '16K' to '128K'.
+Default value is the larger of page size and '32K'.
++
+Refer to *BLOCKSIZE* sector for more information.
+
+-a|--hash-algorithm 
+Specify hash algorithm.
+Only 'sha256' is supported yet.
+
+-l|--limit 
+Specify limit of hash number.
+Only works for 'inmemory' backend.
+If *LIMIT* is zero, there will be no limit at all. Use with caution as it can
+use up all the memory for dedup hash.
+Default value is 4096 if using 'inmemory' backend.
+
+BACKENDS
+
+Btrfs in-band de-duplication support two different backends with their own
+features.
+
+In-memory backend::
+Designed for speed, in-memory backend will keep all dedup hash into memory.
+And it has a limit of number of hash kept in-memory.
+Hashes over the limit will be dropped following last recent use behavior.
+So this backend has a consistent overhead for given limit but can't ensure
+any all duplicated data will be de-duplicated.
++
+After umount and mount, in-memory backend need to refill its hash table.
+
+On-disk backend::
+Designed for de-duplication rate, on-disk backend will keep dedup hash on disk.
+This behavior may cause extra disk IO for de-duplication, but will have a much
+higher dedup rate.
++
+After umount and mount, on-disk backend still has its hash on disk, no need to
+refill its dedup hash table.
+
+BLOCKSIZE
+-
+Block in-band de-duplication is done at block size unit.
+Any data smaller than dedup block size won't go through the dedup backends.
+
+Smaller block size will cause more fragments and lower performance, but a
+higher dedup rate.
+Larger block size will cause less fragments and higher performance, but a
+lower dedup rate.
 
 EXIT STATUS
 ---
diff --git a/cmds-dedup.c b/cmds-dedup.c
index 800df34..e116f4c 100644
--- a/cmds-dedup.c
+++ b/cmds-dedup.c
@@ -36,8 +36,128 @@ static const char * const dedup_cmd_group_usage[] = {
 static const char dedup_cmd_group_info[] =
 "manage inband(write time) de-duplication";
 
+static const char * const cmd_dedup_enable_usage[] = {
+   "btrfs dedup enable [options] ",
+   "Enable in-band(write time) de-duplication of a btrfs.",
+   "",
+   "-s|--storage-backend ",
+   "   specify dedup hash storage backend",
+   "   supported backend: 'ondisk', 'inmemory'",
+   "   inmemory is the default backend",
+   "-b|--blocksize ",
+   "   specify dedup block size",
+   "   default value is the larger of page size and 16K",
+   "-a|--hash-algorithm ",
+   "   specify hash algorithm",
+   "   only 'sha256' is supported yet",
+   "-l|--limit ",
+   "   specify limit of hash number",
+   "   only for 'inmemory' backend",
+   "   default value is 4096 if using 'inmemory' backend",
+   NULL
+};
+
+static int cmd_dedup_enable(int argc, char **argv)
+{
+   int ret;
+   int fd;
+   char *path;
+   int pagesize = sysconf(_SC_PAGESIZE);
+   u64 blocksize = max(pagesize, BTRFS_DEDUP_BLOCKSIZE_DEFAULT);
+   u16 hash_type = BTRFS_DEDUP_HASH_SHA256;
+   u16 backend = BTRFS_DEDUP_BACKEND_INMEMORY;
+   u64 limit = 0;
+   struct btrfs_ioctl_dedup_args dargs;
+   DIR *dirstream;
+
+   while (1) {
+   int c;
+   static const struct option long_options[] = {
+   { "storage-backend", required_argument, NULL, 's'},
+   { "blocksize", required_argument, NULL, 'b'},
+   { "hash-algorithm", required_argument, NULL, 'a'},
+   { "limit", required_argument, NULL, 'l'},
+   { NULL, 0, NULL, 0}
+   };
+
+   c = getopt_long(argc, argv, "s:b:a:l:", long_options, NULL);
+   if (c < 0)
+   break;
+   switch (c) {
+   case 's':
+   if 

[PATCH v2 7/7] btrfs-progs: dedup-tree: Add dedup tree support

2016-01-06 Thread Qu Wenruo
Add dedup tree support for btrfs-debug-tree.

Signed-off-by: Qu Wenruo 
---
v2:
  Add support to print hex objectid/offset for dedup hash.
  Add support to print hex hash.
---
 btrfs-debug-tree.c |  4 +++
 ctree.h|  7 +
 print-tree.c   | 75 ++
 3 files changed, 86 insertions(+)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 8adc39f..8b04df1 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -381,6 +381,10 @@ again:
printf("multiple");
}
break;
+   case BTRFS_DEDUP_TREE_OBJECTID:
+   if (!skip)
+   printf("dedup");
+   break;
default:
if (!skip) {
printf("file");
diff --git a/ctree.h b/ctree.h
index 20305de..eacad7d 100644
--- a/ctree.h
+++ b/ctree.h
@@ -76,6 +76,9 @@ struct btrfs_free_space_ctl;
 /* for storing items that use the BTRFS_UUID_KEY* */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* on-disk dedup tree (EXPERIMENTAL) */
+#define BTRFS_DEDUP_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -1180,6 +1183,10 @@ struct btrfs_root {
 #define BTRFS_DEV_ITEM_KEY 216
 #define BTRFS_CHUNK_ITEM_KEY   228
 
+#define BTRFS_DEDUP_STATUS_ITEM_KEY230
+#define BTRFS_DEDUP_HASH_ITEM_KEY  231
+#define BTRFS_DEDUP_BYTENR_ITEM_KEY232
+
 #define BTRFS_BALANCE_ITEM_KEY 248
 
 /*
diff --git a/print-tree.c b/print-tree.c
index 4d4c3a2..edc79c4 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -25,6 +25,7 @@
 #include "disk-io.h"
 #include "print-tree.h"
 #include "utils.h"
+#include "dedup.h"
 
 
 static void print_dir_item_type(struct extent_buffer *eb,
@@ -658,6 +659,15 @@ static void print_key_type(u64 objectid, u8 type)
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
printf("UUID_KEY_RECEIVED_SUBVOL");
break;
+   case BTRFS_DEDUP_STATUS_ITEM_KEY:
+   printf("DEDUP_STATUS_ITEM");
+   break;
+   case BTRFS_DEDUP_HASH_ITEM_KEY:
+   printf("DEDUP_HASH_ITEM");
+   break;
+   case BTRFS_DEDUP_BYTENR_ITEM_KEY:
+   printf("DEDUP_BYTENR_ITEM");
+   break;
default:
printf("UNKNOWN.%d", type);
};
@@ -677,6 +687,8 @@ static void print_objectid(u64 objectid, u8 type)
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
printf("0x%016llx", (unsigned long long)objectid);
return;
+   case BTRFS_DEDUP_HASH_ITEM_KEY:
+   printf("0x%016llx", objectid);
}
 
switch (objectid) {
@@ -740,6 +752,9 @@ static void print_objectid(u64 objectid, u8 type)
case BTRFS_MULTIPLE_OBJECTIDS:
printf("MULTIPLE");
break;
+   case BTRFS_DEDUP_TREE_OBJECTID:
+   printf("DEDUP_TREE");
+   break;
case (u64)-1:
printf("-1");
break;
@@ -773,6 +788,7 @@ void btrfs_print_key(struct btrfs_disk_key *disk_key)
break;
case BTRFS_UUID_KEY_SUBVOL:
case BTRFS_UUID_KEY_RECEIVED_SUBVOL:
+   case BTRFS_DEDUP_BYTENR_ITEM_KEY:
printf(" 0x%016llx)", (unsigned long long)offset);
break;
default:
@@ -803,6 +819,49 @@ static void print_uuid_item(struct extent_buffer *l, 
unsigned long offset,
}
 }
 
+static void print_dedup_status(struct extent_buffer *node, int slot)
+{
+   struct btrfs_dedup_status_item *status_item;
+   u64 blocksize;
+   u64 limit;
+   u16 hash_type;
+   u16 backend;
+
+   status_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedup_status_item);
+   blocksize = btrfs_dedup_status_blocksize(node, status_item);
+   limit = btrfs_dedup_status_limit(node, status_item);
+   hash_type = btrfs_dedup_status_hash_type(node, status_item);
+   backend = btrfs_dedup_status_backend(node, status_item);
+
+   printf("\t\tdedup status item ");
+   if (backend == BTRFS_DEDUP_BACKEND_INMEMORY)
+   printf("backend: inmemory\n");
+   else if (backend == BTRFS_DEDUP_BACKEND_ONDISK)
+   printf("backend: ondisk\n");
+   else
+   printf("backend: Unrecognized(%u)\n", backend);
+
+   if (hash_type == BTRFS_DEDUP_HASH_SHA256)
+   printf("\t\thash algorithm: SHA-256 ");
+   else
+   printf("\t\thash algorithm: Unrecognized(%u) ", hash_type);
+
+   printf("blocksize: %llu limit: %llu\n", blocksize, limit);
+}
+
+static void print_dedup_hash(struct extent_buffer *eb, unsigned long offset)
+{
+   u8 buf[32];
+   int i;
+
+   

[PATCH v2 1/7] btrfs-progs: Basic framework for dedup command group

2016-01-06 Thread Qu Wenruo
Add basic ioctl header and command group framework for later use.
Alone with basic man page doc.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedup.asciidoc | 37 +
 Makefile.in|  2 +-
 btrfs.c|  1 +
 cmds-dedup.c   | 48 ++
 commands.h |  2 ++
 ctree.h| 34 ++-
 dedup.h| 39 +++
 ioctl.h| 21 +
 8 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/btrfs-dedup.asciidoc
 create mode 100644 cmds-dedup.c
 create mode 100644 dedup.h

diff --git a/Documentation/btrfs-dedup.asciidoc 
b/Documentation/btrfs-dedup.asciidoc
new file mode 100644
index 000..354313f
--- /dev/null
+++ b/Documentation/btrfs-dedup.asciidoc
@@ -0,0 +1,37 @@
+btrfs-dedup(8)
+==
+
+NAME
+
+btrfs-dedup - manage in-band (write time) de-duplication of a btrfs filesystem
+
+SYNOPSIS
+
+*btrfs dedup*  
+
+DESCRIPTION
+---
+*btrfs dedup* is used to enable/disable or show current in-band de-duplication
+status of a btrfs filesystem.
+
+WARNING: In-band de-duplication is still a experimental feautre of btrfs,
+use with caution.
+
+SUBCOMMAND
+--
+Nothing yet
+
+EXIT STATUS
+---
+*btrfs dedup* returns a zero exit status if it succeeds. Non zero is
+returned in case of failure.
+
+AVAILABILITY
+
+*btrfs* is part of btrfs-progs.
+Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for
+further details.
+
+SEE ALSO
+
+`mkfs.btrfs`(8),
diff --git a/Makefile.in b/Makefile.in
index 8e24808..f1fb54c 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -74,7 +74,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
-  cmds-property.o cmds-fi-usage.o
+  cmds-property.o cmds-fi-usage.o cmds-dedup.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index 14b556b..0774ebb 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -208,6 +208,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ "receive", cmd_receive, cmd_receive_usage, NULL, 0 },
{ "quota", cmd_quota, NULL, _cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, _cmd_group, 0 },
+   { "dedup", cmd_dedup, NULL, _cmd_group, 0 },
{ "replace", cmd_replace, NULL, _cmd_group, 0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
diff --git a/cmds-dedup.c b/cmds-dedup.c
new file mode 100644
index 000..800df34
--- /dev/null
+++ b/cmds-dedup.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+
+#include "ctree.h"
+#include "ioctl.h"
+
+#include "commands.h"
+#include "utils.h"
+#include "kerncompat.h"
+#include "dedup.h"
+
+static const char * const dedup_cmd_group_usage[] = {
+   "btrfs dedup  [options] ",
+   NULL
+};
+
+static const char dedup_cmd_group_info[] =
+"manage inband(write time) de-duplication";
+
+const struct cmd_group dedup_cmd_group = {
+   dedup_cmd_group_usage, dedup_cmd_group_info, {
+   NULL_CMD_STRUCT
+   }
+};
+
+int cmd_dedup(int argc, char **argv)
+{
+   return handle_command_group(_cmd_group, argc, argv);
+}
diff --git a/commands.h b/commands.h
index d2bb093..26d1af3 100644
--- a/commands.h
+++ b/commands.h
@@ -93,6 +93,7 @@ extern const struct cmd_group inspect_cmd_group;
 extern const struct cmd_group property_cmd_group;
 extern const struct cmd_group quota_cmd_group;
 extern const struct cmd_group qgroup_cmd_group;
+extern const struct cmd_group dedup_cmd_group;
 extern const struct 

[PATCH v2 3/7] btrfs-progs: dedup: Add disable support for inban deduplication

2016-01-06 Thread Qu Wenruo
Add disable subcommand for dedup command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedup.asciidoc |  5 +
 cmds-dedup.c   | 42 ++
 2 files changed, 47 insertions(+)

diff --git a/Documentation/btrfs-dedup.asciidoc 
b/Documentation/btrfs-dedup.asciidoc
index 652f22d..08d1050 100644
--- a/Documentation/btrfs-dedup.asciidoc
+++ b/Documentation/btrfs-dedup.asciidoc
@@ -19,6 +19,11 @@ use with caution.
 
 SUBCOMMAND
 --
+*disable* ::
+Disable in-band de-duplication for a filesystem.
++
+This will trash all stored dedup hash.
++
 *enable* [options] ::
 Enable in-band de-duplication for a filesystem.
 +
diff --git a/cmds-dedup.c b/cmds-dedup.c
index e116f4c..f15c2c2 100644
--- a/cmds-dedup.c
+++ b/cmds-dedup.c
@@ -155,9 +155,51 @@ out:
return ret;
 }
 
+static const char * const cmd_dedup_disable_usage[] = {
+   "btrfs dedup disable ",
+   "Disable in-band(write time) de-duplication of a btrfs.",
+   NULL
+};
+
+static int cmd_dedup_disable(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedup_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedup_disable_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   return 1;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUP_CTL_DISABLE;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, );
+   if (ret < 0) {
+   error("failed to disable inband deduplication: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+
+out:
+   close_file_or_dir(fd, dirstream);
+   return 0;
+}
+
 const struct cmd_group dedup_cmd_group = {
dedup_cmd_group_usage, dedup_cmd_group_info, {
{ "enable", cmd_dedup_enable, cmd_dedup_enable_usage, NULL, 0},
+   { "disable", cmd_dedup_disable, cmd_dedup_disable_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] btrfs-progs: dedup: Add status subcommand

2016-01-06 Thread Qu Wenruo
Add status subcommand for dedup command group.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-dedup.asciidoc |  3 ++
 cmds-dedup.c   | 72 ++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/btrfs-dedup.asciidoc 
b/Documentation/btrfs-dedup.asciidoc
index 08d1050..8b605fb 100644
--- a/Documentation/btrfs-dedup.asciidoc
+++ b/Documentation/btrfs-dedup.asciidoc
@@ -54,6 +54,9 @@ If *LIMIT* is zero, there will be no limit at all. Use with 
caution as it can
 use up all the memory for dedup hash.
 Default value is 4096 if using 'inmemory' backend.
 
+*status* ::
+Show current in-band de-duplication status of a filesystem.
+
 BACKENDS
 
 Btrfs in-band de-duplication support two different backends with their own
diff --git a/cmds-dedup.c b/cmds-dedup.c
index f15c2c2..5792420 100644
--- a/cmds-dedup.c
+++ b/cmds-dedup.c
@@ -195,11 +195,83 @@ out:
return 0;
 }
 
+static const char * const cmd_dedup_status_usage[] = {
+   "btrfs dedup status ",
+   "Show current in-band(write time) de-duplication status of a btrfs.",
+   NULL
+};
+
+static int cmd_dedup_status(int argc, char **argv)
+{
+   struct btrfs_ioctl_dedup_args dargs;
+   DIR *dirstream;
+   char *path;
+   int fd;
+   int ret;
+   int print_limit = 1;
+
+   if (check_argc_exact(argc, 2))
+   usage(cmd_dedup_status_usage);
+
+   path = argv[1];
+   fd = open_file_or_dir(path, );
+   if (fd < 0) {
+   error("failed to open file or directory: %s", path);
+   ret = 1;
+   goto out;
+   }
+   memset(, 0, sizeof(dargs));
+   dargs.cmd = BTRFS_DEDUP_CTL_STATUS;
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, );
+   if (ret < 0) {
+   error("failed to get inband deduplication status: %s",
+ strerror(errno));
+   ret = 1;
+   goto out;
+   }
+   ret = 0;
+   if (dargs.status == 0) {
+   printf("Status: Disabled\n");
+   goto out;
+   }
+   printf("Status: Enabled\n");
+
+   if (dargs.hash_type == BTRFS_DEDUP_HASH_SHA256)
+   printf("Hash algorithm: SHA-256\n");
+   else
+   printf("Hash algorithm: Unrecognized(%x)\n",
+   dargs.hash_type);
+
+   if (dargs.backend == BTRFS_DEDUP_BACKEND_INMEMORY) {
+   printf("Backend: In-memory\n");
+   print_limit = 1;
+   } else if (dargs.backend == BTRFS_DEDUP_BACKEND_ONDISK) {
+   printf("Backend: On-disk\n");
+   print_limit = 0;
+   } else  {
+   printf("Backend: Unrecognized(%x)\n",
+   dargs.backend);
+   }
+
+   printf("Dedup Blocksize: %llu\n", dargs.blocksize);
+
+   if (print_limit) {
+   printf("Current number of hash: %llu\n", dargs.current_nr);
+   printf("Max number of hash: %llu\n", dargs.limit_nr);
+   }
+out:
+   close_file_or_dir(fd, dirstream);
+   return ret;
+}
+
 const struct cmd_group dedup_cmd_group = {
dedup_cmd_group_usage, dedup_cmd_group_info, {
{ "enable", cmd_dedup_enable, cmd_dedup_enable_usage, NULL, 0},
{ "disable", cmd_dedup_disable, cmd_dedup_disable_usage,
  NULL, 0},
+   { "status", cmd_dedup_status, cmd_dedup_status_usage,
+ NULL, 0},
NULL_CMD_STRUCT
}
 };
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/7] btrfs-progs: Support for in-band de-duplication

2016-01-06 Thread Qu Wenruo
Preparation patchset for in-coming (aimed for 4.6) kernel in-band
de-duplication patchset.

New kernel dedup will has 2 different dedup backends and a ioctl
interface to enable/disable dedup.

The ioctl interface and on-disk format (mostly) is determined, so submit
this patchset first before de-duplication first.

v2:
  Better objectid/offset format for dedup hash items
  Output hash for dedup hash items

Qu Wenruo (7):
  btrfs-progs: Basic framework for dedup command group
  btrfs-progs: dedup: Add enable command for dedup command group
  btrfs-progs: dedup: Add disable support for inban deduplication
  btrfs-progs: dedup: Add status subcommand
  btrfs-progs: Add dedup feature for mkfs and convert
  btrfs: dedup: Add show-super support for new DEDUP flag
  btrfs-progs: dedup-tree: Add dedup tree support

 Documentation/btrfs-dedup.asciidoc | 105 ++
 Documentation/mkfs.btrfs.asciidoc  |   9 ++
 Makefile.in|   2 +-
 btrfs-convert.c|  19 ++-
 btrfs-debug-tree.c |   4 +
 btrfs-show-super.c |  17 +++
 btrfs.c|   1 +
 cmds-dedup.c   | 282 +
 commands.h |   2 +
 ctree.h|  41 +-
 dedup.h|  39 +
 ioctl.h|  23 +++
 kerncompat.h   |   5 +
 mkfs.c |   8 +-
 print-tree.c   |  75 ++
 utils.c|  38 +++--
 utils.h|   7 +-
 17 files changed, 653 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/btrfs-dedup.asciidoc
 create mode 100644 cmds-dedup.c
 create mode 100644 dedup.h

-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/7] btrfs: dedup: Add show-super support for new DEDUP flag

2016-01-06 Thread Qu Wenruo
Now btrfs-show-super can handle DEDUP ro compat flag.

Signed-off-by: Qu Wenruo 
---
 btrfs-show-super.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/btrfs-show-super.c b/btrfs-show-super.c
index d8ad69e..7c0dfa9 100644
--- a/btrfs-show-super.c
+++ b/btrfs-show-super.c
@@ -291,6 +291,15 @@ struct readable_flag_entry {
u64 bit;
char *output;
 };
+#define DEF_RO_COMPAT_FLAG_ENTRY(bit_name) \
+   {BTRFS_FEATURE_COMPAT_RO_##bit_name, #bit_name}
+
+struct readable_flag_entry ro_compat_flags_array[] = {
+   DEF_RO_COMPAT_FLAG_ENTRY(DEDUP)
+};
+
+static const int ro_compat_flags_num = sizeof(ro_compat_flags_array) /
+  sizeof(struct readable_flag_entry);
 
 #define DEF_INCOMPAT_FLAG_ENTRY(bit_name)  \
{BTRFS_FEATURE_INCOMPAT_##bit_name, #bit_name}
@@ -363,6 +372,13 @@ static void __print_readable_flag(u64 flag, struct 
readable_flag_entry *array,
printf(")\n");
 }
 
+static void print_readable_ro_compat_flag(u64 ro_flag)
+{
+   return __print_readable_flag(ro_flag, ro_compat_flags_array,
+ro_compat_flags_num,
+BTRFS_FEATURE_COMPAT_RO_SUPP);
+}
+
 static void print_readable_incompat_flag(u64 flag)
 {
return __print_readable_flag(flag, incompat_flags_array,
@@ -454,6 +470,7 @@ static void dump_superblock(struct btrfs_super_block *sb, 
int full)
   (unsigned long long)btrfs_super_compat_flags(sb));
printf("compat_ro_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_compat_ro_flags(sb));
+   print_readable_ro_compat_flag(btrfs_super_compat_ro_flags(sb));
printf("incompat_flags\t\t0x%llx\n",
   (unsigned long long)btrfs_super_incompat_flags(sb));
print_readable_incompat_flag(btrfs_super_incompat_flags(sb));
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 04/16] btrfs: dedup: Introduce function to remove hash from in-memory tree

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedup tree.
And implement btrfs_dedup_del() and btrfs_dedup_destroy() interfaces.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
v3:
  Use struct inmem_hash instead of btrfs_dedup_hash.
---
 fs/btrfs/dedup.c | 77 
 1 file changed, 77 insertions(+)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 279f50b..0272411 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -259,3 +259,80 @@ int btrfs_dedup_add(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
return inmem_add(dedup_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedup_info *dedup_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedup tree */
+static int inmem_del(struct btrfs_dedup_info *dedup_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   spin_lock(_info->lock);
+   hash = inmem_search_bytenr(dedup_info, bytenr);
+   if (!hash) {
+   spin_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedup_info, hash);
+   spin_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedup hash from dedup tree */
+int btrfs_dedup_del(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   u64 bytenr)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_dedup_info *dedup_info = fs_info->dedup_info;
+
+   if (!dedup_info)
+   return 0;
+
+   if (dedup_info->backend == BTRFS_DEDUP_BACKEND_INMEMORY)
+   return inmem_del(dedup_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_fs_info *fs_info)
+{
+   struct inmem_hash *entry, *tmp;
+   struct btrfs_dedup_info *dedup_info = fs_info->dedup_info;
+
+   spin_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedup_info, entry);
+   spin_unlock(_info->lock);
+}
+
+int btrfs_dedup_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedup_info *dedup_info = fs_info->dedup_info;
+
+   if (!dedup_info)
+   return 0;
+
+   if (dedup_info->backend == BTRFS_DEDUP_BACKEND_INMEMORY)
+   inmem_destroy(fs_info);
+   return 0;
+}
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/16] btrfs: dedup: Introduce function to add hash into in-memory tree

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedup_add() interface.

Sgined-o
Signed-off-by: Wang Xiaoguang 
---
v3:
  Use specific struct inmem_hash for inmem backend
  Handle corner case where same bytenr may have 2 different hashes.
---
 fs/btrfs/dedup.c | 165 +++
 1 file changed, 165 insertions(+)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index c1fadff..279f50b 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -21,6 +21,25 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+   if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedup_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_dedup_sizes[type],
+   GFP_NOFS);
+}
+
 int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
   u64 blocksize, u64 limit)
 {
@@ -94,3 +113,149 @@ out:
}
return ret;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>hash_node, parent, p);
+   rb_insert_color(>hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>bytenr_node, parent, p);
+   rb_insert_color(>bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedup_info *dedup_info,
+   struct inmem_hash *hash)
+{
+   list_del(>lru_list);
+   rb_erase(>hash_node, _info->hash_root);
+   rb_erase(>bytenr_node, _info->bytenr_root);
+
+   if (!WARN_ON(dedup_info->current_nr == 0))
+   dedup_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedup tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedup_info *dedup_info,
+struct btrfs_dedup_hash *hash)
+{
+   int ret = 0;
+   u16 type = dedup_info->hash_type;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(type);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_dedup_sizes[type]);
+
+   spin_lock(_info->lock);
+
+   /*
+* Insert bytenr node first
+* It is *POSSIBLE* that same bytenr has different hash,
+* since hash is added at finish_ordered_io, but freed
+* at delayed_ref time.
+* If a extent is allocated and freed, then reallocated in the
+* same trans, same bytenr will has different hash.
+* So let insert_bytenr to handle it.
+* TODO: Maybe just free previous inmem_hash would be better?
+*/
+   ret = inmem_insert_bytenr(_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(_info->hash_root, ihash,
+   btrfs_dedup_sizes[type]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free current one.
+*/
+   rb_erase(>bytenr_node, _info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+  

Re: [PATCH 00/35 v2] separate operations from flags in the bio/request structs

2016-01-06 Thread Dave Chinner
On Wed, Jan 06, 2016 at 08:40:09PM -0500, Martin K. Petersen wrote:
> > "Mike" == mchristi   writes:
> 
> Mike> The following patches begin to cleanup the request->cmd_flags and
> bio-> bi_rw mess. We currently use cmd_flags to specify the operation,
> Mike> attributes and state of the request. For bi_rw we use it for
> Mike> similar info and also the priority but then also have another
> Mike> bi_flags field for state. At some point, we abused them so much we
> Mike> just made cmd_flags 64 bits, so we could add more.
> 
> Mike> The following patches seperate the operation (read, write discard,
> Mike> flush, etc) from cmd_flags/bi_rw.
> 
> Mike> This patchset was made against linux-next from today Jan 5 2016.
> Mike> (git tag next-20160105).
> 
> Very nice work. Thanks for doing this!
> 
> I think it's a much needed cleanup. I focused mainly on the core block,
> discard, write same and sd.c pieces and everything looks sensible to me.
> 
> I wonder what the best approach is to move a patch set with this many
> stakeholders forward? Set a "speak now or forever hold your peace"
> review deadline?

I say just ask Linus to pull it immediately after the next merge
window closes

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 12/16] btrfs: dedup: Introduce interfaces to resume and cleanup dedup info

2016-01-06 Thread Qu Wenruo
Since we will introduce a new on-disk based dedup method, introduce new
interfaces to resume previous dedup setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
v3:
  None
---
 fs/btrfs/dedup.c   | 193 +++--
 fs/btrfs/dedup.h   |  11 +++
 fs/btrfs/disk-io.c |  26 +++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 210 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 4f24a2c..5edb923 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -20,6 +20,43 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "disk-io.h"
+
+static int init_dedup_info(struct btrfs_fs_info *fs_info, u16 type, u16 
backend,
+  u64 blocksize, u64 limit)
+{
+   struct btrfs_dedup_info *dedup_info;
+   int ret;
+
+   fs_info->dedup_info = kzalloc(sizeof(*dedup_info), GFP_NOFS);
+   if (!fs_info->dedup_info)
+   return -ENOMEM;
+
+   dedup_info = fs_info->dedup_info;
+
+   dedup_info->hash_type = type;
+   dedup_info->backend = backend;
+   dedup_info->blocksize = blocksize;
+   dedup_info->limit_nr = limit;
+
+   /* Only support SHA256 yet */
+   dedup_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedup_info->dedup_driver)) {
+   btrfs_err(fs_info, "failed to init sha256 driver");
+   ret = PTR_ERR(dedup_info->dedup_driver);
+   kfree(fs_info->dedup_info);
+   fs_info->dedup_info = NULL;
+   return ret;
+   }
+
+   dedup_info->hash_root = RB_ROOT;
+   dedup_info->bytenr_root = RB_ROOT;
+   dedup_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   spin_lock_init(_info->lock);
+   mutex_init(_info->ondisk_lock);
+   return 0;
+}
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -44,6 +81,13 @@ int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
   u64 blocksize, u64 limit)
 {
struct btrfs_dedup_info *dedup_info;
+   struct btrfs_root *dedup_root;
+   struct btrfs_key key;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_path *path;
+   struct btrfs_dedup_status_item *status;
+   int create_tree;
+   u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
int ret = 0;
 
/* Sanity check */
@@ -61,6 +105,18 @@ int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
if (backend == BTRFS_DEDUP_BACKEND_ONDISK && limit != 0)
limit = 0;
 
+   /*
+* If current fs doesn't support DEDUP feature, don't enable
+* on-disk dedup.
+*/
+   if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUP) &&
+   backend == BTRFS_DEDUP_BACKEND_ONDISK)
+   return -EINVAL;
+
+   /* Meaningless and unable to enable dedup for RO fs */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EINVAL;
+
if (fs_info->dedup_info) {
dedup_info = fs_info->dedup_info;
 
@@ -80,32 +136,63 @@ int btrfs_dedup_enable(struct btrfs_fs_info *fs_info, u16 
type, u16 backend,
}
 
 enable:
-   fs_info->dedup_info = kzalloc(sizeof(*dedup_info), GFP_NOFS);
-   if (!fs_info->dedup_info)
-   return -ENOMEM;
+   create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUP;
 
+   ret = init_dedup_info(fs_info, type, backend, blocksize, limit);
dedup_info = fs_info->dedup_info;
+   if (ret < 0)
+   goto out;
 
-   dedup_info->hash_type = type;
-   dedup_info->backend = backend;
-   dedup_info->blocksize = blocksize;
-   dedup_info->limit_nr = limit;
+   if (!create_tree)
+   goto out;
 
-   /* Only support SHA256 yet */
-   dedup_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
-   if (IS_ERR(dedup_info->dedup_driver)) {
-   btrfs_err(fs_info, "failed to init sha256 driver");
-   ret = PTR_ERR(dedup_info->dedup_driver);
+   /* Create dedup tree for status at least */
+   path = btrfs_alloc_path();
+   if (!path) {
+   ret = -ENOMEM;
goto out;
}
 
-   dedup_info->hash_root = RB_ROOT;
-   dedup_info->bytenr_root = RB_ROOT;
-   dedup_info->current_nr = 0;
-   INIT_LIST_HEAD(_info->lru_list);
-   spin_lock_init(_info->lock);
+   trans = btrfs_start_transaction(fs_info->tree_root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   btrfs_free_path(path);
+   goto out;
+   }
+
+   dedup_root = btrfs_create_tree(trans, fs_info,
+  

[PATCH v3 10/16] btrfs: dedup: Inband in-memory only de-duplication implement

2016-01-06 Thread Qu Wenruo
From: Wang Xiaoguang 

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to calculate dedup hash.
And use dedup hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedup_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedup_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedup hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
v3:
  Fix a wrong page parameter for cow_file_range().
  Fix a memory leak.
  Move dedup_add() to run_delayed_ref() to fix an abort transaction.
---
 fs/btrfs/extent-tree.c |   6 +
 fs/btrfs/extent_io.c   |  30 ++---
 fs/btrfs/extent_io.h   |  15 +++
 fs/btrfs/inode.c   | 305 +++--
 4 files changed, 331 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a2e4c2b..8eb8d85 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6677,6 +6677,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedup_del(trans, root, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, extent_root,
+   ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, extent_root, 
ret);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 33a01ea..b7a6612 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2549,7 +2549,7 @@ int end_extent_writepage(struct page *page, int err, u64 
start, u64 end)
  * Scheduling is not allowed, so the extent state tree is expected
  * to have one and only one object corresponding to this IO.
  */
-static void end_bio_extent_writepage(struct bio *bio)
+void end_bio_extent_writepage(struct bio *bio)
 {
struct bio_vec *bvec;
u64 start;
@@ -2813,8 +2813,8 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned 
int nr_iovecs)
 }
 
 
-static int __must_check submit_one_bio(int rw, struct bio *bio,
-  int mirror_num, unsigned long bio_flags)
+int __must_check submit_one_bio(int rw, struct bio *bio,
+   int mirror_num, unsigned long bio_flags)
 {
int ret = 0;
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -2851,18 +2851,18 @@ static int merge_bio(int rw, struct extent_io_tree 
*tree, struct page *page,
 
 }
 
-static int submit_extent_page(int rw, struct extent_io_tree *tree,
- struct writeback_control *wbc,
- struct page *page, sector_t sector,
- size_t size, unsigned long offset,
- struct block_device *bdev,
- struct bio **bio_ret,
- unsigned long max_pages,
- bio_end_io_t end_io_func,
- int mirror_num,
- unsigned long prev_bio_flags,
- unsigned long bio_flags,
- bool force_bio_submit)
+int submit_extent_page(int rw, struct extent_io_tree *tree,
+   struct writeback_control *wbc,
+   struct page *page, sector_t sector,
+   size_t size, unsigned long offset,
+   struct block_device *bdev,
+   struct bio **bio_ret,
+   unsigned long max_pages,
+   bio_end_io_t end_io_func,
+   int mirror_num,
+   unsigned long prev_bio_flags,
+   unsigned long bio_flags,
+   bool force_bio_submit)
 {
int ret = 0;
struct bio *bio;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index f4c1ae1..ae17832 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -360,6 +360,21 @@ int clean_io_failure(struct inode *inode, u64 start, 
struct page *page,
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 int mirror_num);
+int submit_extent_page(int rw, struct extent_io_tree *tree,
+  

[PATCH v3 06/16] btrfs: delayed_ref: Add support for handle dedup hash

2016-01-06 Thread Qu Wenruo
Add support for delayed_ref to handle dedup_hash.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
v3:
  Newly introduced.
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/delayed-ref.c |  3 ++-
 fs/btrfs/delayed-ref.h |  9 -
 fs/btrfs/extent-tree.c | 17 -
 fs/btrfs/inode.c   |  2 +-
 5 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 450790b..aa7f5fb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -48,6 +48,7 @@ extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
 struct btrfs_ordered_sum;
+struct btrfs_dedup_hash;
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 #define STATIC noinline
@@ -3436,7 +3437,8 @@ int btrfs_alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 root_objectid, u64 owner,
 u64 offset, u64 ram_bytes,
-struct btrfs_key *ins);
+struct btrfs_key *ins,
+struct btrfs_dedup_hash *hash);
 int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
   u64 root_objectid, u64 owner, u64 offset,
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 94609ec..fb9f47b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -812,7 +812,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
-  int atomic)
+  int atomic, struct btrfs_dedup_hash *hash)
 {
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
@@ -846,6 +846,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
}
 
head_ref->extent_op = NULL;
+   ref->hash = hash;
 
delayed_refs = >transaction->delayed_refs;
 
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8928fe7..d67766b 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -24,6 +24,7 @@
 #define BTRFS_ADD_DELAYED_EXTENT 3 /* record a full extent allocation */
 #define BTRFS_UPDATE_DELAYED_HEAD 4 /* not changing ref count on head ref */
 
+struct btrfs_dedup_hash;
 /*
  * XXX: Qu: I really hate the design that ref_head and tree/data ref shares the
  * same ref_node structure.
@@ -152,6 +153,12 @@ struct btrfs_delayed_data_ref {
u64 parent;
u64 objectid;
u64 offset;
+
+   /*
+* For dedup hash miss case, to add it into dedup tree at
+* run_delayed_ref() time.
+*/
+   struct btrfs_dedup_hash *hash;
 };
 
 struct btrfs_delayed_ref_root {
@@ -249,7 +256,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 bytenr, u64 num_bytes,
   u64 parent, u64 ref_root,
   u64 owner, u64 offset, u64 reserved, int action,
-  int atomic);
+  int atomic, struct btrfs_dedup_hash *hash);
 int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
 struct btrfs_trans_handle *trans,
 u64 ref_root, u64 bytenr, u64 num_bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d80f74d..a2e4c2b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -36,6 +36,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedup.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2089,7 +2090,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes, parent, root_objectid,
owner, offset, 0,
-   BTRFS_ADD_DELAYED_REF, 0);
+   BTRFS_ADD_DELAYED_REF, 0, NULL);
}
return ret;
 }
@@ -2109,7 +2110,7 @@ int btrfs_inc_extent_ref_atomic(struct btrfs_trans_handle 
*trans,
return -EINVAL;
return btrfs_add_delayed_data_ref(fs_info, trans, bytenr,
num_bytes, parent, root_objectid,
-   owner, offset, 0, BTRFS_ADD_DELAYED_REF, 1);
+   owner, offset, 0, BTRFS_ADD_DELAYED_REF, 1, NULL);
 }
 
 static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
@@ -2196,6 +2197,10 

[PATCH v3 13/16] btrfs: dedup: Add support for on-disk hash search

2016-01-06 Thread Qu Wenruo
Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
v3:
  None
---
 fs/btrfs/dedup.c | 169 ++-
 fs/btrfs/dedup.h |   3 +
 2 files changed, 171 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 5edb923..4bcdf5d 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -493,6 +493,172 @@ int btrfs_dedup_disable(struct btrfs_fs_info *fs_info)
 }
 
 /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedup_info *dedup_info, u8 *hash,
+ u64 *bytenr_ret, u64 *num_bytes_ret)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root *dedup_root = dedup_info->dedup_root;
+   u8 *buf = NULL;
+   u64 hash_key;
+   int hash_len = btrfs_dedup_sizes[dedup_info->hash_type];
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   buf = kmalloc(hash_len, GFP_NOFS);
+   if (!buf) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   memcpy(_key, hash + hash_len - 8, 8);
+   key.objectid = hash_key;
+   key.type = BTRFS_DEDUP_HASH_ITEM_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, dedup_root, , path, 0, 0);
+   if (ret < 0)
+   goto out;
+   WARN_ON(ret == 0);
+   while (1) {
+   struct extent_buffer *node;
+   struct btrfs_dedup_hash_item *hash_item;
+   int slot;
+
+   ret = btrfs_previous_item(dedup_root, path, hash_key,
+ BTRFS_DEDUP_HASH_ITEM_KEY);
+   if (ret < 0)
+   goto out;
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, , slot);
+
+   if (key.type != BTRFS_DEDUP_HASH_ITEM_KEY ||
+   memcmp(, hash + hash_len - 8, 8))
+   break;
+   hash_item = btrfs_item_ptr(node, slot,
+   struct btrfs_dedup_hash_item);
+   read_extent_buffer(node, buf, (unsigned long)(hash_item + 1),
+  hash_len);
+   if (!memcmp(buf, hash, hash_len)) {
+   ret = 1;
+   *bytenr_ret = key.offset;
+   *num_bytes_ret = btrfs_dedup_hash_len(node, hash_item);
+   break;
+   }
+   }
+out:
+   kfree(buf);
+   btrfs_free_path(path);
+   return ret;
+}
+
+static int ondisk_search(struct inode *inode, u64 file_pos,
+struct btrfs_dedup_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_trans_handle *trans = NULL;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_dedup_info *dedup_info = fs_info->dedup_info;
+   u64 old_bytenr;
+   u64 bytenr;
+   u64 num_bytes;
+
+   /*
+* TODO: Opitmized the superhot mutex.
+*/
+   mutex_lock(_info->ondisk_lock);
+   ret = ondisk_search_hash(dedup_info, hash->hash, , _bytes);
+   mutex_unlock(_info->ondisk_lock);
+   if (ret <= 0)
+   goto out;
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+again:
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head || head->processing == 1) {
+   /*
+* Somebody else may be trying to run the refs, the found
+* duplicated extent may be freed, so here we just
+* choose to abort this dedup handle.
+* XXX: we need to find a better method to improve it.
+*/
+   spin_unlock(_refs->lock);
+   ret = 0;
+   goto out;
+   }
+
+   ret = btrfs_delayed_ref_lock(trans, head);
+   spin_unlock(_refs->lock);
+   if (ret == -EAGAIN) {
+   mutex_lock(_info->ondisk_lock);
+   ret = ondisk_search_hash(dedup_info, hash->hash, ,
+_bytes);
+   mutex_unlock(_info->ondisk_lock);
+   if (ret <= 0)
+   goto out;
+   goto again;
+   }
+   /*
+* Still need to search the hash again to ensure the hash is not
+* 

Re: [PATCH 13/35] xfs: set bi_op to REQ_OP

2016-01-06 Thread Dave Chinner
On Tue, Jan 05, 2016 at 02:53:16PM -0600, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch has xfs set the bio bi_op to a REQ_OP, and
> rq_flag_bits to bi_rw.
> 
> Note:
> I have run xfs tests on these btrfs patches. There were some failures
> with and without the patches. I have not had time to track down why
> xfstest fails without the patches.
> 
> Signed-off-by: Mike Christie 
> ---
>  fs/xfs/xfs_aops.c |  3 ++-
>  fs/xfs/xfs_buf.c  | 27 +++
>  2 files changed, 17 insertions(+), 13 deletions(-)

Not sure which patches your note is refering to here.

The XFS change here looks fine.

Acked-by: Dave Chinner 

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 25/35] target: set bi_op to REQ_OP

2016-01-06 Thread Nicholas A. Bellinger
Hi Mike,

On Tue, 2016-01-05 at 14:53 -0600, mchri...@redhat.com wrote:
> From: Mike Christie 
> 
> This patch has the target modules set the bio bi_op to a REQ_OP, and
> rq_flag_bits to bi_rw.
> 
> This patch is compile tested only.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/target/target_core_iblock.c | 38 
> ++---
>  drivers/target/target_core_pscsi.c  |  2 +-
>  2 files changed, 24 insertions(+), 16 deletions(-)
> 

Nice work to clean this long standing abuse.  ;)

For the target/iblock bit:

Acked-by: Nicholas Bellinger 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: floating point exception (core dumped) - btrfs rescue chunk-recover

2016-01-06 Thread Henk Slager
On Wed, Jan 6, 2016 at 5:37 AM, P R Shah  wrote:
> Hello,
>
> TL;DR ==
>
> btrfs 3x500GB RAID 5 - One device failed. Added a new device (btrfs device
> add) and tried to remove the failed device (btrfs device delete).
Pity that you used add and delete, especially with already failing
device in btrfs raid5, this will easily lead to the situation you have
now. A delete assumes all devices healthy and that wasn't the case.
You should have used
 btrfs replace  , likely with option -r.
You could have also tried some non-btrfs saving tricks, like umount
the fs and do a
 dd_rescue  failing_device new_device
and disconnect the failing_device from the system, do a btrfs dev scan
and then mount the fs and do a scrub. I have used this once
successfully. But btrfs replace does work well enough with recent
kernel/tools, so simply use that I would say.
Note that scrub for raid5  in latest kernels is slow, roughly 10MiB/s
per disk for disks you use I think

> I tried to mount the array in degraded mode, but that didn't work either.
> After multiple attempts (including adding back the failed HDD), I finally
> ran the btrfs rescue chunk-recover command on the primary member /dev/sdb.
>
> This ran for about 4 hours, and then failed with "floating point exception
> (core dumped)".
> ==
>
> I am testing out btrfs to gain familiarity with it. I am quite amazed at
> it's capabilities and performance. However, I am either not able to
> understand or implement RAID5 fault tolerance.
>
> I understand from the wiki that RAID56 is experimental. The data I am
> working with is backed up elsewhere and for all intents and purposes,
> discard-able.
>
> I have set up a btrfs RAID5 with 3x500GB Seagate HDDs, with a mount point of
> /storage. Booting is off a fourth HDD (ext4, lubuntu 64bit) that is not
> involved in the RAID.
>
> Everything was working amazingly well, until one HDD failed and was quietly
> offlined. For a couple of days, the RAID was running off 2 HDDs and I didn't
> notice.
>
> When I DID realize, I shut down the system, bought a new HDD (2TB), which
> took a couple of days to arrive.
>
> When I powered up the system again, the failed 500GB was back. Everything
> loaded fine, and looked good. To be on the safe side, I ran a badblocks test
> (ro) on the failing HDD.
>
> Halfway through the test, the HDD disappeared again. After a cold reboot, it
> was loaded fine again.
>
> At this point, I decided to replace the failed HDD. I shutdown, plugged in
> the new HDD in place of the boot HDD, booted up with Lubuntu live, mounted
> (/storage) and added the device to the RAID.
>
> After adding the device successfully, I gave a device delete command for the
> failed HDD. Partway through the process, the failing HDD (/dev/sdc)
> disappeared again, and after waiting a couple of hours, I hard-reset the
> system, and removed the failing HDD, assuming that the RAID will re-build on
> the existing devices.
>
> Now, the RAID (/storage) refused to mount. I got a c_tree error (please see
> enclosed logs below).
>
> I tried to mount the array in degraded mode, but that didn't work either.
> After multiple attempts (including adding back the failed HDD), I finally
> ran the btrfs rescue chunk-recover command on the primary member /dev/sdb.
>
> This ran for about 4 hours, and then failed with "floating point exception
> (core dumped)".
>
> Can I recover the array or should I start again? The data is not important,
> but I would like to know the recovery process, or any misconceptions in my
> thinking that RAID5 with 3 devices is enough for SOHO-level fault tolerance?
You could try to mount with -o recovery with a 4.4 kernel and see what
happens, but I would start again. Maybe you can restart with the
3x500GB, and when you detect the first fail again, do a replace and
see if it works. Then you have an answers to the second question. It
is also up to you if are ok with the slow scrubs and a missing device
fail alert.

> Any advice, pointers, etc, much appreciated. Tech level: medium-high (RHCE).
>
> Relevant system information:
> === uname -a
> Linux lubuntu 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 15:35:06 UTC 2015
> x86_64 x86_64 x86_64 GNU/Linux
If you want try/use raid5, best to use a latest kernel from
kernel.org, like 4.4-rc8. Support for kernel 4.2 will not not be done
anymore by kernel.org, but by Canonical, so in theory, you should
look-up what patches are in this  4.2.0-16-generic #19-Ubuntu and ask
them for support.

> == btrfs --version
> btrfs-progs v4.0
Same for tools version, get the latest ( currently 4.3.1)

> == btrfs fi show
> warning, device 2 is missing
> Label: 'storage'  uuid: 5a3d6590-df08-4520-b61b-802d350849c7
> Total devices 4 FS bytes used 176.91GiB
> devid1 size 465.76GiB used 90.03GiB path /dev/sdb
> devid3 size 465.76GiB used 90.01GiB path /dev/sdc
> devid4 size 1.82TiB used 10.00GiB path /dev/sda
> *** Some devices missing
>
> == dmesg 

[PATCH] Btrfs-progs: fix typo in parse_range

2016-01-06 Thread Liu Bo
s/*end/*start.

This makes 'btrfs balance start -dvrange=xxx..yyy' really work.

Signed-off-by: Liu Bo 
---
 cmds-balance.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-balance.c b/cmds-balance.c
index c5be6b9..9f647cd 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -123,7 +123,7 @@ static int parse_range(const char *range, u64 *start, u64 
*end)
*start = 0;
skipped++;
} else {
-   *end = strtoull(range, , 10);
+   *start = strtoull(range, , 10);
if (*endptr != 0 && *endptr != '.')
return 1;
}
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html