Re: [PATCH 1/2] btrfs-progs: Add missing devices check for mounted btrfs.

2014-04-09 Thread Qu Wenruo


于 2014年04月09日 12:33, Anand Jain 写道:



A bit of background of btrfs fi show.

As such original btrfs fi show had too many problems since btrfs-progs
wasn't much consulting btrfs-kernel to determine various
mounted device status. This was a serious problem sometime back.
Various patches brings btrfs-progs to communicate with kernel
and to tell the world the exact status as seen by the kernel.
And avoid any btrfs status fabrication in the user land for the
mounted disks.

The old code was move into the option -d and newer which
would print status as per the kernel comes under -m
which stands for mounted.

The cli btrfs fi show (with no option) would use both -m
and user-land (lblkid) for unmounted btrfs to show various
disks status.

But here this patch is again bring the old problem and idea into life,
that is it fabricates missing at the userland for the mounted btrfs.
As shown in the below testcase it has bad cascading consequences.

This patch must be taken out.

If your worry is about xfstests/btrfs/003 fails, yes it does, we
don't have the right solution as of now to fix it.


Yes, I'm still worry about the xfstests testcase.
Personally, I think you should try to remove the disk remove test in 
btrfs/003 first.

(Though I think it could be much harder to archieve it)

If xfstests devs can be persuaded and NOTE added to btrfs-progs document,
I'll be completely OK for the revert of this patch.




On 09/04/2014 11:26, Qu Wenruo wrote:

Yes, the deleted device scan is still one of the deep problems yet.

But my patch is not intented to deal anything related to the problem.




For me, I am *only* going to deal with the *exit code* problem,
'btrfs fi show device' executes correctly(OK, only part of it 
exactlly),

but exit code is still 1, which is the bug I'm trying to fix it.


Looks out of context, am I missing something ?


Sorry for getting mixed with the patch dealing with the exit code.
(Previous patch I sent)




IMO, the users/admins may never be interested in the inside mechanism
nor algorithm,
but the output and exit value things.


user trust btrfs-progs to tell them whats going on in the btrfs kernel.


IMO user trust btrfs-progs to give them right info on btrfs, not care 
nor aware of the relationship

between kernel.



there is no point in fabricating things (like missing) at the user land
when btrfs kernel isn't aware of it.


From the respect of normal users/admins, they does not cares how it is 
done but the result.
So if user finds missing devices still output in 'btrfs fi show', they 
will be confused.


Though I think any sane users/admins will call 'btrfs dev del' first,
so the problem still comes to the btrfs/003 test case.




This bug is much like the previous 'btrfs fi show' bugs that breaks some
xfstests test case


I guess you are talking about the btrfs/003 test case, think the 
correct-way that a xfstest should have determined if the disk is

missing is to dump the btrfs_fs_devices - btrfs_device and to check
if flag missing is set. xfstest trust btrfs-progs but btrfs-progs don't
do it.

If there is real test program which would check btrfs disk status from
the btrfs kernel memory then again this bug (missing flag is not set)
would exist.


This bug is not the dev delete but the error message due to the 
unimplemented

btrfs ioctl things below.





(always showing a error due to the scan_kernel_v2 function which calls a
unimplemented ioctl interface),


The problem is its matching kernel patch is not integrated.



if some patch breaks the old exit code or output, then it should be
fixed to maintain the old
output/exit code (except some big decision is made to change it).


Yes but fix it in the right way - notify btrfs kernel about the
missing disks, which would/should set missing flag in the btrfs_device
list.


As I mentioned, this patch is just a workaround for testcase btrfs/003.
No mean to fix the whole things.

As mentioned at the beginning, I prefer to remove the device remove test
in testcase, then the patch can be reverted without any complain from me.

Thanks,
Qu.



So for the output/exit code consistence, it should be fixed even the
patch may not  means a cure but
only a workaround for you.


There is no short cut. As shown your patch has the cascading problem.
When btrfs-kernel fixes this issue, And if David and Josef agrees a
patch on top of btrfs-devlist patch will help to fix this problem
once for all.

Thanks, Anand


Thanks,
Qu.

于 2014年04月09日 11:04, Anand Jain 写道:



 Below shows the bug cascading to this patch.

 And now to fix this I think we shouldn't fix/workaround in the
 btrfs-progs again!, fix it in the btrfs-kernel (or leave it open
 until suitable fix is found, I tried and failed. but don't fix it
 in a wrong way). If you want to help to fix this problem: Find out
 if we could get kobject notification with in kernel when disks gets
 disappeared.

 I have been advocating btrfs-progs should _not_ add its intelligence
 and it should be as 

[PATCH v9 03/16] Btrfs: introduce dedup tree operations

2014-04-09 Thread Liu Bo
The operations consist of finding matched items, adding new items and
removing items.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h |   9 +++
 fs/btrfs/file-item.c | 210 +++
 2 files changed, 219 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 54c29d2..7b14e2e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3760,6 +3760,15 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 struct list_head *list, int search_commit);
 
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash 
*hash);
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash);
+int noinline_for_stack
+btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root, u64 hash, u64 bytenr);
 /* inode.c */
 struct btrfs_delalloc_work {
struct inode *inode;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 9d84658..068112c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -884,3 +884,213 @@ out:
 fail_unlock:
goto out;
 }
+
+/* 1 means we find one, 0 means we dont. */
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash)
+{
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct extent_buffer *leaf;
+   struct btrfs_root *dedup_root;
+   struct btrfs_dedup_item *item;
+   u64 hash_value;
+   u64 length;
+   u64 dedup_size;
+   int compression;
+   int found = 0;
+   int index;
+   int ret;
+
+   if (!hash) {
+   WARN_ON(1);
+   return 0;
+   }
+   if (!root-fs_info-dedup_root) {
+   WARN(1, KERN_INFO dedup not enabled\n);
+   return 0;
+   }
+   dedup_root = root-fs_info-dedup_root;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return 0;
+
+   /*
+* For SHA256 dedup algorithm, we store the last 64bit as the
+* key.objectid, and the rest in the tree item.
+*/
+   index = btrfs_dedup_lens[hash-type] - 1;
+   dedup_size = btrfs_dedup_sizes[hash-type] - sizeof(u64);
+
+   hash_value = hash-hash[index];
+
+   key.objectid = hash_value;
+   key.offset = (u64)-1;
+   btrfs_set_key_type(key, BTRFS_DEDUP_ITEM_KEY);
+
+   ret = btrfs_search_slot(NULL, dedup_root, key, path, 0, 0);
+   if (ret  0)
+   goto out;
+   if (ret == 0) {
+   WARN_ON(1);
+   goto out;
+   }
+
+prev_slot:
+   /* this will do match checks. */
+   ret = btrfs_previous_item(dedup_root, path, hash_value,
+ BTRFS_DEDUP_ITEM_KEY);
+   if (ret)
+   goto out;
+
+   leaf = path-nodes[0];
+   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
+   if (key.objectid != hash_value)
+   goto out;
+
+   item = btrfs_item_ptr(leaf, path-slots[0], struct btrfs_dedup_item);
+   /* disk length of dedup range */
+   length = btrfs_dedup_len(leaf, item);
+
+   compression = btrfs_dedup_compression(leaf, item);
+   if (compression  BTRFS_COMPRESS_TYPES) {
+   WARN_ON(1);
+   goto out;
+   }
+
+   if (btrfs_dedup_type(leaf, item) != hash-type)
+   goto prev_slot;
+
+   if (memcmp_extent_buffer(leaf, hash-hash, (unsigned long)(item + 1),
+dedup_size)) {
+   pr_info(goto prev\n);
+   goto prev_slot;
+   }
+
+   hash-bytenr = key.offset;
+   hash-num_bytes = length;
+   hash-compression = compression;
+   found = 1;
+out:
+   btrfs_free_path(path);
+   return found;
+}
+
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash)
+{
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct extent_buffer *leaf;
+   struct btrfs_root *dedup_root;
+   struct btrfs_dedup_item *dedup_item;
+   u64 ins_size;
+   u64 dedup_size;
+   int index;
+   int ret;
+
+   if (!hash) {
+   WARN_ON(1);
+   return 0;
+   }
+
+   WARN_ON(hash-num_bytes  root-fs_info-dedup_bs);
+
+   if (!root-fs_info-dedup_root) {
+   WARN(1, KERN_INFO dedup not enabled\n);
+   return 0;
+   }
+   dedup_root = root-fs_info-dedup_root;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   /*
+* For SHA256 dedup algorithm, we store the last 64bit 

[RFC PATCH v9 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello,

This the 9th attempt for in-band data dedupe.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to Content based storage in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is 
  uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
  transaction abortion and crash.

* btrfs-progs patch(PATCH 17) which offers all details about how to control the
  dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable  on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options flushoncommit and enabling dedupe feature will end up
  with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: add ioctl of dedup control
  Btrfs: improve the delayed refs process in rm case
  Btrfs: fix a crash of dedup ref
  Btrfs: fix deadlock of dedup work
  Btrfs: fix transactin abortion in __btrfs_free_extent
  Btrfs: fix wrong pinned bytes in __btrfs_free_extent
  Btrfs: use total_bytes instead of bytes_used for global_rsv
  Btrfs: fix dedup enospc problem

 fs/btrfs/backref.c   |   9 +
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |  86 ++
 fs/btrfs/delayed-ref.c   |  26 +-
 fs/btrfs/delayed-ref.h   |   3 +
 fs/btrfs/disk-io.c   |  37 +++
 fs/btrfs/extent-tree.c   | 235 +---
 fs/btrfs/extent_io.c |  22 +-
 fs/btrfs/extent_io.h |  16 ++
 fs/btrfs/file-item.c | 244 +
 fs/btrfs/inode.c | 635 ++-
 fs/btrfs/ioctl.c | 167 
 fs/btrfs/ordered-data.c  |  44 ++-
 fs/btrfs/ordered-data.h  |  13 +-
 fs/btrfs/qgroup.c|   3 +
 fs/btrfs/relocation.c|   3 +
 fs/btrfs/transaction.c   |  41 +++
 fs/btrfs/transaction.h   |   1 +
 include/trace/events/btrfs.h |   3 +-
 include/uapi/linux/btrfs.h   |  11 +
 20 files changed, 1470 insertions(+), 131 deletions(-)

-- 
1.8.2.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 01/16] Btrfs: disable qgroups accounting when quata_enable is 0

2014-04-09 Thread Liu Bo
It's unnecessary to do qgroups accounting without enabling quota.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.c   |  2 +-
 fs/btrfs/delayed-ref.c | 18 ++
 fs/btrfs/qgroup.c  |  3 +++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 88d1b1e..54f3c67 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -406,7 +406,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
tree_mod_log_write_lock(fs_info);
spin_lock(fs_info-tree_mod_seq_lock);
-   if (!elem-seq) {
+   if (elem  !elem-seq) {
elem-seq = btrfs_inc_tree_mod_seq_major(fs_info);
list_add_tail(elem-list, fs_info-tree_mod_seq_list);
}
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3129964..3ab37b6 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -656,8 +656,13 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
-   if (need_ref_seq(for_cow, ref_root))
-   seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
+   if (need_ref_seq(for_cow, ref_root)) {
+   struct seq_list *elem = NULL;
+
+   if (fs_info-quota_enabled)
+   elem = trans-delayed_ref_elem;
+   seq = btrfs_get_tree_mod_seq(fs_info, elem);
+   }
ref-seq = seq;
 
full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -718,8 +723,13 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
-   if (need_ref_seq(for_cow, ref_root))
-   seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
+   if (need_ref_seq(for_cow, ref_root)) {
+   struct seq_list *elem = NULL;
+
+   if (fs_info-quota_enabled)
+   elem = trans-delayed_ref_elem;
+   seq = btrfs_get_tree_mod_seq(fs_info, elem);
+   }
ref-seq = seq;
 
full_ref = btrfs_delayed_node_to_data_ref(ref);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 2cf9058..c634b3e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1186,6 +1186,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
*trans,
 {
struct qgroup_update *u;
 
+   if (!trans-root-fs_info-quota_enabled)
+   return 0;
+
BUG_ON(!trans-delayed_ref_elem.seq);
u = kmalloc(sizeof(*u), GFP_NOFS);
if (!u)
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 11/16] Btrfs: fix a crash of dedup ref

2014-04-09 Thread Liu Bo
The dedup reference is a special kind of delayed refs, and the delayed refs
are batched to be processed later.

If we find a matched dedup extent, then we queue an ADD delayed ref on it within
endio work, but there is already a DROP delayed ref queued,

   t1 t2  t3
-writepage commit transaction
  -run_delalloc_dedup
 find_dedup
--
   process_delayed refs
(it deletes the dedup 
extent)
 add ordered extent|
 submit pages  |
  finish ordered io|
insert file extents|
queue delayed refs |
queue dedup ref|
 process delayed refs 
continues
 (insert a ref on an extent
  deleted by the above)

This senario ends up with a crash because we're going to insert a ref on
a deleted extent.

To avoid the race, we need to check if there is a ADD delayed ref on deleting
the extent and protect this job with lock.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h   |  3 ++-
 fs/btrfs/extent-tree.c | 35 +++
 fs/btrfs/file-item.c   | 36 +++-
 fs/btrfs/inode.c   | 10 ++
 4 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 65adf07..48f587e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3764,7 +3764,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
 struct list_head *list, int search_commit);
 
 int noinline_for_stack
-btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash 
*hash);
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash,
+   struct inode *inode, u64 file_pos);
 int noinline_for_stack
 btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 191f0a7..a8da7aa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5937,9 +5937,23 @@ again:
goto again;
}
} else {
-   if (!dedup_hash  is_data 
-   root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
-   dedup_hash = extent_data_ref_offset(root, path, iref);
+   if (is_data  root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+   if (!dedup_hash)
+   dedup_hash = extent_data_ref_offset(root,
+   path, iref);
+
+   ret = btrfs_free_dedup_extent(trans, root,
+ dedup_hash, bytenr);
+   if (ret) {
+   if (ret == -EAGAIN)
+   ret = 0;
+   else
+   btrfs_abort_transaction(trans,
+   extent_root,
+   ret);
+   goto out;
+   }
+   }
 
if (found_extent) {
BUG_ON(is_data  refs_to_drop !=
@@ -5964,21 +5978,10 @@ again:
if (is_data) {
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
-   btrfs_abort_transaction(trans, extent_root, 
ret);
+   btrfs_abort_transaction(trans,
+   extent_root, ret);
goto out;
}
-
-   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
-   ret = btrfs_free_dedup_extent(trans, root,
- dedup_hash,
- bytenr);
-   if (ret) {
-   btrfs_abort_transaction(trans,
-   extent_root,
-   ret);
-   goto out;
- 

[PATCH v9 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent

2014-04-09 Thread Liu Bo
We need to reset @refs_to_drop to 1 when we're going to delete the last
special dedup reference, otherwise we can trigger (@refs  @refs_to_drop)
and end up with transaction abortion.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c1c342..1cb3ec5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5931,6 +5931,7 @@ again:
btrfs_release_path(path);
root_objectid = BTRFS_DEDUP_TREE_OBJECTID;
parent = 0;
+   refs_to_drop = 1;
 
goto again;
}
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 12/16] Btrfs: fix deadlock of dedup work

2014-04-09 Thread Liu Bo
Checking for dedup references needs to allocate memory so it cannot
be run within spin_lock, otherwise it will end up with heavy deadlock.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a8da7aa..4c1c342 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5720,7 +5720,6 @@ again:
dedup_hash = 0;
 
path-reada = 1;
-   path-leave_spinning = 1;
 
is_data = owner_objectid = BTRFS_FIRST_FREE_OBJECTID;
BUG_ON(!is_data  refs_to_drop != 1);
@@ -5774,7 +5773,6 @@ again:
goto out;
}
btrfs_release_path(path);
-   path-leave_spinning = 1;
 
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -5942,6 +5940,7 @@ again:
dedup_hash = extent_data_ref_offset(root,
path, iref);
 
+   WARN_ON_ONCE(path-leave_spinning);
ret = btrfs_free_dedup_extent(trans, root,
  dedup_hash, bytenr);
if (ret) {
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 09/16] Btrfs: add ioctl of dedup control

2014-04-09 Thread Liu Bo
So far we have 4 commands to control dedup behaviour,
- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.

- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.

- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes.  Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.

- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/ioctl.c   | 167 +
 include/uapi/linux/btrfs.h |  11 +++
 4 files changed, 182 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7b14e2e..65adf07 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1740,6 +1740,9 @@ struct btrfs_fs_info {
u64 dedup_bs;
 
int dedup_type;
+
+   /* protect user change for dedup operations */
+   struct mutex dedup_ioctl_mutex;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index adc7e90e..cb9b844 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2365,6 +2365,7 @@ int open_ctree(struct super_block *sb,
mutex_init(fs_info-dev_replace.lock_finishing_cancel_unmount);
mutex_init(fs_info-dev_replace.lock_management_lock);
mutex_init(fs_info-dev_replace.lock);
+   mutex_init(fs_info-dedup_ioctl_mutex);
 
spin_lock_init(fs_info-qgroup_lock);
mutex_init(fs_info-qgroup_ioctl_lock);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6778fa3..d50c953 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4842,6 +4842,171 @@ static int btrfs_ioctl_set_features(struct file *file, 
void __user *arg)
return btrfs_commit_transaction(trans, root);
 }
 
+static long btrfs_enable_dedup(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root-fs_info;
+   struct btrfs_trans_handle *trans = NULL;
+   struct btrfs_root *dedup_root;
+   int ret = 0;
+
+   mutex_lock(fs_info-dedup_ioctl_mutex);
+   if (fs_info-dedup_root) {
+   pr_info(btrfs: dedup has already been enabled\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return 0;
+   }
+
+   trans = btrfs_start_transaction(root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return ret;
+   }
+
+   dedup_root = btrfs_create_tree(trans, fs_info,
+  BTRFS_DEDUP_TREE_OBJECTID);
+   if (IS_ERR(dedup_root))
+   ret = PTR_ERR(dedup_root);
+
+   if (ret)
+   btrfs_end_transaction(trans, root);
+   else
+   ret = btrfs_commit_transaction(trans, root);
+
+   if (!ret) {
+   pr_info(btrfs: dedup enabled\n);
+   fs_info-dedup_root = dedup_root;
+   fs_info-dedup_root-block_rsv = fs_info-global_block_rsv;
+   btrfs_set_fs_incompat(fs_info, DEDUP);
+   }
+
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return ret;
+}
+
+static long btrfs_disable_dedup(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root-fs_info;
+   struct btrfs_root *dedup_root;
+   int ret;
+
+   mutex_lock(fs_info-dedup_ioctl_mutex);
+   if (!fs_info-dedup_root) {
+   pr_info(btrfs: dedup has been disabled\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return 0;
+   }
+
+   if (fs_info-dedup_bs != 0) {
+   pr_info(btrfs: cannot disable dedup until switching off 
dedup!\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return -EBUSY;
+   }
+
+   dedup_root = fs_info-dedup_root;
+
+   ret = btrfs_drop_snapshot(dedup_root, NULL, 1, 0);
+
+   if (!ret) {
+   fs_info-dedup_root = NULL;
+   pr_info(btrfs: dedup disabled\n);
+   }
+
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   WARN_ON(ret  0  ret != -EAGAIN  ret != -EROFS);
+   return ret;
+}
+
+static long btrfs_set_dedup_bs(struct btrfs_root *root, u64 bs)
+{
+   struct btrfs_fs_info *info = root-fs_info;
+   int ret = 0;
+
+   mutex_lock(info-dedup_ioctl_mutex);
+   if (!info-dedup_root) {
+   pr_info(btrfs: dedup is disabled, we cannot switch on/off 
dedup\n);
+   ret = -EINVAL;
+   goto out;
+   }
+
+   bs = ALIGN(bs, root-sectorsize);
+   bs = min_t(u64, bs, (128 * 1024ULL));
+
+   if (bs == 

[PATCH v9 10/16] Btrfs: improve the delayed refs process in rm case

2014-04-09 Thread Liu Bo
While removing a file with dedup extents, we could have a great number of
delayed refs pending to process, and these refs refer to droping
a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type.

But in order to prevent an extent's ref count from going down to zero when
there still are pending delayed refs, we first select those adding a ref
ones, which is of BTRFS_ADD_DELAYED_REF type.

So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type,
but we have to walk all the refs issued to the extent to find any
BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start
over again to find BTRFS_DROP_DELAYED_REF.

This is really unnecessary, we can improve this by tracking how many
BTRFS_ADD_DELAYED_REF refs we have and search by the right type.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/delayed-ref.c |  8 
 fs/btrfs/delayed-ref.h |  3 +++
 fs/btrfs/extent-tree.c | 16 +---
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3ab37b6..6435d78 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -538,6 +538,9 @@ update_existing_head_ref(struct btrfs_delayed_ref_node 
*existing,
 * currently, for refs we just added we know we're a-ok.
 */
existing-ref_mod += update-ref_mod;
+   WARN_ON(update-ref_mod  1);
+   if (update-ref_mod == 1)
+   existing_ref-add_cnt++;
spin_unlock(existing_ref-lock);
 }
 
@@ -601,6 +604,11 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
head_ref-is_data = is_data;
head_ref-ref_root = RB_ROOT;
head_ref-processing = 0;
+   /* track added ref, more comments in select_delayed_ref() */
+   if (count_mod == 1)
+   head_ref-add_cnt = 1;
+   else
+   head_ref-add_cnt = 0;
 
spin_lock_init(head_ref-lock);
mutex_init(head_ref-mutex);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 4ba9b93..905f991 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -87,6 +87,9 @@ struct btrfs_delayed_ref_head {
struct rb_node href_node;
 
struct btrfs_delayed_extent_op *extent_op;
+
+   int add_cnt;
+
/*
 * when a new extent is allocated, it is just reserved in memory
 * The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 088846c..191f0a7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2347,7 +2347,11 @@ static noinline struct btrfs_delayed_ref_node *
 select_delayed_ref(struct btrfs_delayed_ref_head *head)
 {
struct rb_node *node;
-   struct btrfs_delayed_ref_node *ref, *last = NULL;;
+   struct btrfs_delayed_ref_node *ref, *last = NULL;
+   int action = BTRFS_ADD_DELAYED_REF;
+
+   if (head-add_cnt == 0)
+   action = BTRFS_DROP_DELAYED_REF;
 
/*
 * select delayed ref of type BTRFS_ADD_DELAYED_REF first.
@@ -2358,10 +2362,13 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
while (node) {
ref = rb_entry(node, struct btrfs_delayed_ref_node,
rb_node);
-   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   if (ref-action == action) {
+   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   head-add_cnt--;
return ref;
-   else if (last == NULL)
+   } else if (last == NULL) {
last = ref;
+   }
node = rb_next(node);
}
return last;
@@ -2435,6 +2442,9 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 
if (ref  ref-seq 
btrfs_check_delayed_seq(fs_info, delayed_refs, ref-seq)) {
+   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   locked_ref-add_cnt++;
+
spin_unlock(locked_ref-lock);
btrfs_delayed_ref_unlock(locked_ref);
spin_lock(delayed_refs-lock);
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 05/16] Btrfs: make ordered extent aware of dedup

2014-04-09 Thread Liu Bo
This adds a dedup flag and dedup hash into ordered extent so that
we can insert dedup extents to dedup tree at endio time.

The benefit is simplicity, we don't need to fall back to cleanup dedup
structures if the write is cancelled for some reasons.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ordered-data.c | 38 --
 fs/btrfs/ordered-data.h | 13 -
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a94b05f..c520e13 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -183,7 +183,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ int dedup, struct btrfs_dedup_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)-root;
struct btrfs_ordered_inode_tree *tree;
@@ -199,10 +200,23 @@ static int __btrfs_add_ordered_extent(struct inode 
*inode, u64 file_offset,
entry-start = start;
entry-len = len;
if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM) 
-   !(type == BTRFS_ORDERED_NOCOW))
+   !(type == BTRFS_ORDERED_NOCOW)  !dedup)
entry-csum_bytes_left = disk_len;
entry-disk_len = disk_len;
entry-bytes_left = len;
+   entry-dedup = dedup;
+   entry-hash = NULL;
+
+   if (!dedup  hash) {
+   entry-hash = kzalloc(btrfs_dedup_hash_size(hash-type),
+ GFP_NOFS);
+   if (!entry-hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   memcpy(entry-hash, hash, btrfs_dedup_hash_size(hash-type));
+   }
+
entry-inode = igrab(inode);
entry-compress_type = compress_type;
entry-truncated_len = (u64)-1;
@@ -251,7 +265,17 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
+}
+
+int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  int dedup, struct btrfs_dedup_hash *hash,
+  int compress_type)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ compress_type, dedup, hash);
 }
 
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
@@ -259,16 +283,17 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int compress_type)
+ int type, int compress_type,
+ struct btrfs_dedup_hash *hash)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, 0, hash);
 }
 
 /*
@@ -530,6 +555,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(sum-list);
kfree(sum);
}
+   kfree(entry-hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2468970..efbb11f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -109,6 +109,9 @@ struct btrfs_ordered_extent {
/* compression algorithm */
int compress_type;
 
+   /* whether this ordered extent is marked for dedup or not */
+   int dedup;
+
/* reference count */
atomic_t refs;
 
@@ -135,6 +138,9 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;

[PATCH v9 08/16] Btrfs: don't return space for dedup extent

2014-04-09 Thread Liu Bo
If the ordered extent had an IOERR or something else went wrong we need to
return the space for this ordered extent back to the allocator, but if the
extent is marked as a dedup one, we don't free the space because we just
use the existing space instead of allocating new space.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c69e530..d32b066 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3213,6 +3213,7 @@ out:
 * truncated case if we didn't write out the extent at all.
 */
if ((ret || !logical_len) 
+   !ordered_extent-dedup 
!test_bit(BTRFS_ORDERED_NOCOW, ordered_extent-flags) 
!test_bit(BTRFS_ORDERED_PREALLOC, ordered_extent-flags))
btrfs_free_reserved_extent(root, ordered_extent-start,
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] Btrfs-progs: add dedup subcommand

2014-04-09 Thread Liu Bo
This adds deduplication subcommands, 'btrfs dedup command path',
including enable/disable/on/off.

- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.

- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.

- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes.  Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.

- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.

-
Usage:
Step 1: btrfs dedup enable /btrfs
Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs
Step 3: now we have dedup, run your test.
Step 4: btrfs dedup off /btrfs
Step 5: btrfs dedup disable /btrfs
-

Signed-off-by: Liu Bo bo.li@oracle.com
---
v4: rebase and reserve spare space in btrfs_ioctl_dedup_args struct. 
v3: add commands 'btrfs dedup on/off'
v2: add manpage


 Makefile   |   3 +-
 btrfs.c|   1 +
 cmds-dedup.c   | 178 +
 commands.h |   2 +
 ctree.h|   2 +
 ioctl.h|  13 +
 man/btrfs.8.in |  31 +-
 7 files changed, 226 insertions(+), 4 deletions(-)
 create mode 100644 cmds-dedup.c

diff --git a/Makefile b/Makefile
index 0874a41..092f2db 100644
--- a/Makefile
+++ b/Makefile
@@ -13,7 +13,8 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
-  cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o
+  cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
+  cmds-dedup.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index d5fc738..dfae35f 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -255,6 +255,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ quota, cmd_quota, NULL, quota_cmd_group, 0 },
{ qgroup, cmd_qgroup, NULL, qgroup_cmd_group, 0 },
{ replace, cmd_replace, NULL, replace_cmd_group, 0 },
+   { dedup, cmd_dedup, NULL, dedup_cmd_group, 0 },
{ help, cmd_help, cmd_help_usage, NULL, 0 },
{ version, cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedup.c b/cmds-dedup.c
new file mode 100644
index 000..b959349
--- /dev/null
+++ b/cmds-dedup.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2013 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include sys/ioctl.h
+#include unistd.h
+#include getopt.h
+
+#include ctree.h
+#include ioctl.h
+
+#include commands.h
+#include utils.h
+
+static const char * const dedup_cmd_group_usage[] = {
+   btrfs dedup command [options] path,
+   NULL
+};
+
+int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args)
+{
+   int ret = 0;
+   int fd;
+   int e;
+   DIR *dirstream = NULL;
+
+   fd = open_file_or_dir(path, dirstream);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access '%s'\n, path);
+   return -EACCES;
+   }
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args);
+   e = errno;
+   close_file_or_dir(fd, dirstream);
+   if (ret  0) {
+   fprintf(stderr, ERROR: dedup command failed: %s\n,
+   strerror(e));
+   if (args-cmd == BTRFS_DEDUP_CTL_DISABLE ||
+   args-cmd == BTRFS_DEDUP_CTL_SET_BS)
+   fprintf(stderr, please refer to 'dmesg | tail' for 
more info\n);
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static const char * const cmd_dedup_enable_usage[] = 

[PATCH v9 02/16] Btrfs: introduce dedup tree and relatives

2014-04-09 Thread Liu Bo
This is a preparation step for online/inband dedup tree.
It introduces dedup tree and its relatives, including hash driver and
some structures.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h | 73 
 fs/btrfs/disk-io.c   | 36 ++
 fs/btrfs/extent-tree.c   |  2 ++
 include/trace/events/btrfs.h |  3 +-
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2a9d32e..54c29d2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include asm/kmap_types.h
 #include linux/pagemap.h
 #include linux/btrfs.h
+#include crypto/hash.h
 #include extent_io.h
 #include extent_map.h
 #include async-thread.h
@@ -101,6 +102,9 @@ struct btrfs_ordered_sum;
 /* for storing items that use the BTRFS_UUID_KEY* types */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -523,6 +527,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL  7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL  8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL  9)
+#define BTRFS_FEATURE_INCOMPAT_DEDUP   (1ULL  10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
@@ -540,6 +545,7 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
+BTRFS_FEATURE_INCOMPAT_DEDUP | \
 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
@@ -915,6 +921,51 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/* dedup */
+enum btrfs_dedup_type {
+   BTRFS_DEDUP_SHA256 = 0,
+   BTRFS_DEDUP_LAST = 1,
+};
+
+static int btrfs_dedup_lens[] = { 4, 0 };
+static int btrfs_dedup_sizes[] = { 32, 0 };/* 256bit, 32bytes */
+
+struct btrfs_dedup_item {
+   /* disk length of dedup range */
+   __le64 len;
+
+   u8 type;
+   u8 compression;
+   u8 encryption;
+
+   /* spare for later use */
+   __le16 other_encoding;
+
+   /* hash/fingerprints go here */
+} __attribute__ ((__packed__));
+
+struct btrfs_dedup_hash {
+   u64 bytenr;
+   u64 num_bytes;
+
+   /* hash algorithm */
+   int type;
+
+   int compression;
+
+   /* last field is a variable length array of dedup hash */
+   u64 hash[];
+};
+
+static inline int btrfs_dedup_hash_size(int type)
+{
+   WARN_ON((btrfs_dedup_lens[type] * sizeof(u64)) !=
+btrfs_dedup_sizes[type]);
+
+   return sizeof(struct btrfs_dedup_hash) + btrfs_dedup_sizes[type];
+}
+
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -1320,6 +1371,7 @@ struct btrfs_fs_info {
struct btrfs_root *dev_root;
struct btrfs_root *fs_root;
struct btrfs_root *csum_root;
+   struct btrfs_root *dedup_root;
struct btrfs_root *quota_root;
struct btrfs_root *uuid_root;
 
@@ -1680,6 +1732,14 @@ struct btrfs_fs_info {
 
struct semaphore uuid_tree_rescan_sem;
unsigned int update_uuid_tree_gen:1;
+
+   /* reference to deduplication algorithm driver via cryptoapi */
+   struct crypto_shash *dedup_driver;
+
+   /* dedup blocksize */
+   u64 dedup_bs;
+
+   int dedup_type;
 };
 
 struct btrfs_subvolume_writers {
@@ -2013,6 +2073,8 @@ struct btrfs_ioctl_defrag_range_args {
  */
 #define BTRFS_STRING_ITEM_KEY  253
 
+#define BTRFS_DEDUP_ITEM_KEY   254
+
 /*
  * Flags for mount options.
  *
@@ -3047,6 +3109,14 @@ static inline u32 btrfs_file_extent_inline_len(struct 
extent_buffer *eb,
 }
 
 
+/* btrfs_dedup_item */
+BTRFS_SETGET_FUNCS(dedup_len, struct btrfs_dedup_item, len, 64);
+BTRFS_SETGET_FUNCS(dedup_compression, struct btrfs_dedup_item, compression, 8);
+BTRFS_SETGET_FUNCS(dedup_encryption, struct btrfs_dedup_item, encryption, 8);
+BTRFS_SETGET_FUNCS(dedup_other_encoding, struct btrfs_dedup_item,
+  other_encoding, 16);
+BTRFS_SETGET_FUNCS(dedup_type, struct btrfs_dedup_item, type, 8);
+
 /* btrfs_dev_stats_item */
 static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb,
struct btrfs_dev_stats_item *ptr,
@@ -3521,6 +3591,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
btrfs_root *root)
 
 static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 {
+   if (fs_info-dedup_driver)
+   crypto_free_shash(fs_info-dedup_driver);
kfree(fs_info-balance_ctl);
kfree(fs_info-delayed_root);
kfree(fs_info-extent_root);
@@ -3687,6 +3759,7 @@ 

[PATCH v9 16/16] Btrfs: fix dedup enospc problem

2014-04-09 Thread Liu Bo
In the case of dedupe, btrfs will produce large number of delayed refs, and
processing them can very likely eat all of the space reserved in
global_block_rsv, and we'll end up with transaction abortion due to ENOSPC.

I tried several different ways to reserve more space for global_block_rsv to
hope it's enough for flushing delayed refs, but I failed and code could
become very messy.

I found that with high delayed refs pressure, the throttle work in the
end_transaction had little use since it didn't block new delayed refs'
insertion, so I put throttle stuff into the very start stage,
i.e. start_transaction.

We take the worst case into account in the throttle code, that is,
every delayed_refs would update btree, so when we reach the limit that
it may use up all the reserved space of global_block_rsv, we kick
transaction_kthread to commit transaction to process these delayed refs,
refresh global_block_rsv's space, and get pinned space back as well.
That way we get rid of annoy ENOSPC problem.

However, this leads to a new problem that it cannot use along with option
flushoncommit, otherwise it can cause ABBA deadlock between
commit_transaction between ordered extents flush.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c  | 50 ++---
 fs/btrfs/ordered-data.c |  6 ++
 fs/btrfs/transaction.c  | 41 
 fs/btrfs/transaction.h  |  1 +
 4 files changed, 87 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6f8b012..ec6f42d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2695,24 +2695,52 @@ static inline u64 heads_to_leaves(struct btrfs_root 
*root, u64 heads)
 int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
   struct btrfs_root *root)
 {
+   struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_block_rsv *global_rsv;
-   u64 num_heads = trans-transaction-delayed_refs.num_heads_ready;
+   u64 num_heads;
+   u64 num_entries;
u64 num_bytes;
int ret = 0;
 
-   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
-   num_heads = heads_to_leaves(root, num_heads);
-   if (num_heads  1)
-   num_bytes += (num_heads - 1) * root-leafsize;
-   num_bytes = 1;
global_rsv = root-fs_info-global_block_rsv;
 
-   /*
-* If we can't allocate any more chunks lets make sure we have _lots_ of
-* wiggle room since running delayed refs can create more delayed refs.
-*/
-   if (global_rsv-space_info-full)
+   if (trans) {
+   num_heads = trans-transaction-delayed_refs.num_heads_ready;
+   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   num_heads = heads_to_leaves(root, num_heads);
+   if (num_heads  1)
+   num_bytes += (num_heads - 1) * root-leafsize;
num_bytes = 1;
+   /*
+* If we can't allocate any more chunks lets make sure we have
+* _lots_ of wiggle room since running delayed refs can create
+* more delayed refs.
+*/
+   if (global_rsv-space_info-full)
+   num_bytes = 1;
+   } else {
+   if (root-fs_info-dedup_bs == 0)
+   return 0;
+
+   /* dedup enabled */
+   spin_lock(root-fs_info-trans_lock);
+   if (!root-fs_info-running_transaction) {
+   spin_unlock(root-fs_info-trans_lock);
+   return 0;
+   }
+
+   delayed_refs =
+root-fs_info-running_transaction-delayed_refs;
+
+   num_entries = atomic_read(delayed_refs-num_entries);
+   num_heads = delayed_refs-num_heads;
+
+   spin_unlock(root-fs_info-trans_lock);
+
+   /* The worst case */
+   num_bytes = (num_entries - num_heads) *
+   btrfs_calc_trans_metadata_size(root, 1);
+   }
 
spin_lock(global_rsv-lock);
if (global_rsv-reserved = num_bytes)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c520e13..72c0caa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -747,6 +747,12 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle 
*trans,
  cur_trans-ordered_operations);
spin_unlock(root-fs_info-ordered_root_lock);
 
+   if (cur_trans-blocked) {
+   cur_trans-blocked = 0;
+   if (waitqueue_active(cur_trans-commit_wait))
+   wake_up(cur_trans-commit_wait);
+   }
+
work = btrfs_alloc_delalloc_work(inode, wait, 1);
if (!work) {

[PATCH v9 06/16] Btrfs: online(inband) data dedup

2014-04-09 Thread Liu Bo
The main part of data dedup.

This introduces a FORMAT CHANGE.

Btrfs provides online(inband/synchronous) and block-level dedup.

It maps naturally to btrfs's block back-reference, which enables us
to store multiple copies of data as single copy with references
on that copy.

The workflow is
(1) write some data,
(2) get the hash of these data based on btrfs's dedup blocksize.
(3) find matched extents by hash and decide whether to mark it
as a duplicate one or not.  If no, write the data onto disk,
otherwise, add a reference to the matched extent.

Btrfs's built-in dedup supports normal writes and compressed writes.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 150 ++--
 fs/btrfs/extent_io.c   |   8 +-
 fs/btrfs/extent_io.h   |  11 +
 fs/btrfs/inode.c   | 640 +++--
 4 files changed, 712 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 06124c1..088846c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1123,8 +1123,16 @@ static noinline int lookup_extent_data_ref(struct 
btrfs_trans_handle *trans,
key.offset = parent;
} else {
key.type = BTRFS_EXTENT_DATA_REF_KEY;
-   key.offset = hash_extent_data_ref(root_objectid,
- owner, offset);
+
+   /*
+* we've not got the right offset and owner, so search by -1
+* here.
+*/
+   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
+   key.offset = (u64)-1;
+   else
+   key.offset = hash_extent_data_ref(root_objectid,
+ owner, offset);
}
 again:
recow = 0;
@@ -1151,6 +1159,10 @@ again:
goto fail;
}
 
+   if (ret  0  root_objectid == BTRFS_DEDUP_TREE_OBJECTID 
+   path-slots[0]  0)
+   path-slots[0]--;
+
leaf = path-nodes[0];
nritems = btrfs_header_nritems(leaf);
while (1) {
@@ -1174,14 +1186,22 @@ again:
ref = btrfs_item_ptr(leaf, path-slots[0],
 struct btrfs_extent_data_ref);
 
-   if (match_extent_data_ref(leaf, ref, root_objectid,
- owner, offset)) {
-   if (recow) {
-   btrfs_release_path(path);
-   goto again;
+   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+   if (btrfs_extent_data_ref_root(leaf, ref) ==
+   root_objectid) {
+   err = 0;
+   break;
+   }
+   } else {
+   if (match_extent_data_ref(leaf, ref, root_objectid,
+ owner, offset)) {
+   if (recow) {
+   btrfs_release_path(path);
+   goto again;
+   }
+   err = 0;
+   break;
}
-   err = 0;
-   break;
}
path-slots[0]++;
}
@@ -1325,6 +1345,32 @@ static noinline int remove_extent_data_ref(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+static noinline u64 extent_data_ref_offset(struct btrfs_root *root,
+ struct btrfs_path *path,
+ struct btrfs_extent_inline_ref *iref)
+{
+   struct btrfs_key key;
+   struct extent_buffer *leaf;
+   struct btrfs_extent_data_ref *ref1;
+   u64 offset = 0;
+
+   leaf = path-nodes[0];
+   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
+   if (iref) {
+   WARN_ON(btrfs_extent_inline_ref_type(leaf, iref) !=
+   BTRFS_EXTENT_DATA_REF_KEY);
+   ref1 = (struct btrfs_extent_data_ref *)(iref-offset);
+   offset = btrfs_extent_data_ref_offset(leaf, ref1);
+   } else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
+   ref1 = btrfs_item_ptr(leaf, path-slots[0],
+ struct btrfs_extent_data_ref);
+   offset = btrfs_extent_data_ref_offset(leaf, ref1);
+   } else {
+   WARN_ON(1);
+   }
+   return offset;
+}
+
 static noinline u32 extent_data_ref_count(struct btrfs_root *root,
  struct btrfs_path *path,
  struct btrfs_extent_inline_ref *iref)
@@ -1591,7 +1637,8 @@ again:
err = -ENOENT;
while (1) {
if (ptr = end) {
-   

[PATCH v9 14/16] Btrfs: fix wrong pinned bytes in __btrfs_free_extent

2014-04-09 Thread Liu Bo
With the special dedup reference, in the case of (refs == 1) in 
__btrfs_free_extent,
we'll actually free the extent, so pinned_bytes of it should not be added to 
that
global counter.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1cb3ec5..b8fee86 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5915,9 +5915,6 @@ again:
goto out;
}
}
-   add_pinned_bytes(root-fs_info, -num_bytes, owner_objectid,
-root_objectid);
-
/*
 * special case for dedup
 *
@@ -5934,6 +5931,9 @@ again:
refs_to_drop = 1;
 
goto again;
+   } else {
+   add_pinned_bytes(root-fs_info, -num_bytes,
+owner_objectid, root_objectid);
}
} else {
if (is_data  root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv

2014-04-09 Thread Liu Bo
Because of dedupe, data space info cannot reflect how many data has
been written, in order to get global_rsv more proper, use total_bytes
instead.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b8fee86..6f8b012 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4692,14 +4692,14 @@ static u64 calc_global_metadata_size(struct 
btrfs_fs_info *fs_info)
 
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
spin_lock(sinfo-lock);
-   data_used = sinfo-bytes_used;
+   data_used = sinfo-total_bytes;
spin_unlock(sinfo-lock);
 
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
spin_lock(sinfo-lock);
if (sinfo-flags  BTRFS_BLOCK_GROUP_DATA)
data_used = 0;
-   meta_used = sinfo-bytes_used;
+   meta_used = sinfo-total_bytes;
spin_unlock(sinfo-lock);
 
num_bytes = (data_used  fs_info-sb-s_blocksize_bits) *
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 07/16] Btrfs: skip dedup reference during backref walking

2014-04-09 Thread Liu Bo
The dedup ref is quite a special one, it is just used to store the hash value
of the extent and cannot be used to find data, so we skip it during backref
walking.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/backref.c| 9 +
 fs/btrfs/relocation.c | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index aad7201..5e57949 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -623,6 +623,9 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head 
*head, u64 seq,
key.objectid = ref-objectid;
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = ref-offset;
+   if (ref-root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, ref-root, key, 0, 0,
   node-bytenr,
   node-ref_mod * sgn, GFP_ATOMIC);
@@ -743,6 +746,9 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+   if (root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, root, key, 0, 0,
   bytenr, count, GFP_NOFS);
break;
@@ -826,6 +832,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+   if (root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, root, key, 0, 0,
   bytenr, count, GFP_NOFS);
break;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index def428a..8431294 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3508,6 +3508,9 @@ static int find_data_references(struct reloc_control *rc,
ref_offset = btrfs_extent_data_ref_offset(leaf, ref);
ref_count = btrfs_extent_data_ref_count(leaf, ref);
 
+   if (ref_root == BTRFS_DEDUP_TREE_OBJECTID)
+   return 0;
+
/*
 * This is an extent belonging to the free space cache, lets just delete
 * it and redo the search.
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dm-crypt + btrfs preformance - long lockups during io

2014-04-09 Thread Tobias Grosser

On Sat, Apr 5, 2014 at 10:54 AM, Anders Aagaard aagaande@x wrote:
 Hi

 I just recently repartitioned my harddrive, and in the process
switched from
 ext4+ecryptfs to dm-crypt and btrfs. I'm on ubuntu 14.04, using kernel
 3.14.0-031400-generic. I'm using a intel ssd, which btrfs detects
(ssd mode
 enabled according to dmesg).

I also saw and still see lockups on linux 3.8 with a similar 
configuration. btrfs on dm-crypt with SSD mode. I use slightly different 
mount options:


/dev/mapper/sda4_crypt on /home type btrfs 
(rw,noatime,flushoncommit,subvol=@home)

/dev/mapper/sda4_crypt on /store type btrfs (rw,subvol=@store)
/dev/mapper/sda4_crypt on /home/grosser type btrfs 
(rw,noatime,flushoncommit,subvol=@grosser)


(I previously also used compression. This is disabled at the moment, but 
compressed files are still around)


latencytop gives the following information:


firefox: 22909.8 ms latency, cause fsync() on a file

The corresponding backtrace is:

btrfs_log_inode_parent
[btrfs]
btrfs_log_dentry_safe
[btrfs]
btrfs_sync_file
[btrfs]
do_fsync
sys_fsync
system_call_fastpath

There are also 6 processes called

btrfs-endio-wri?? with 1050-1391 ms latency due to [sleep_on_page].

The corresponding backtraces are (3x):

sleep_on_page
wait_on_page_bit
read_extent_buffer_pages
[btrfs]
btree_read_extend_buffer_pages.constprop.119
[btrfs]
read_tree_block
[btrfs]
read_node_slot
[btrfs]
push_leaf_right
[btrfs]
split_leaf
[btrfs]
btrfs_search_slot
[btrfs]
btrfs_csum_file_blocks
[btrfs]
add_pending_csums.isra.42
[btrfs]
btrfs_finish_ordered_io
[btrfs]

1x:

sleep_on_page
wait_on_page_bit
read_extent_buffer_pages
[btrfs]
btree_read_extend_buffer_pages.constprop.119
[btrfs]
read_tree_block
[btrfs]
read_block_for_search.isra.51
[btrfs]
btrfs_search_slot
[btrfs]
btrfs_insert_empty_items
[btrfs]
run_clustered_refs
[btrfs]
btrfs_run_delayed_refs
[btrfs]
__btrfs_end_transaction
[btrfs]
btfs_end_transaction
[btrfs]

1x
sleep_on_page
wait_on_page_bit
read_extent_buffer_pages
[btrfs]
btree_read_extend_buffer_pages.constprop.119
[btrfs]
read_tree_block
[btrfs]
read_node_slot
[btrfs]
push_leaf_left
[btrfs]
split_leaf
[btrfs]
btrfs_search_slot
[btrfs]
btrfs_insert_empty_items
[btrfs]
btrfs_csum_file_blocks
[btrfs]
add_pending_csums.isra.42
[btrfs]

1x
sleep_on_page
wait_on_page_bit
read_extent_buffer_pages
[btrfs]
btree_read_extend_buffer_pages.constprop.119
[btrfs]
read_tree_block
[btrfs]
read_node_slot
[btrfs]
push_leaf_right
[btrfs]
split_leaf
[btrfs]
btrfs_search_slot
[btrfs]
btrfs_csum_file_blocks
[btrfs]
add_pending_csums.isra.42
[btrfs]
btrfs_finish_ordered_io
[btrfs]

At the moment of deadlock, I am just running normal desktop applications 
like thunderbird and firefox.


Cheers,
Tobias

Any idea where such long delays may come from.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 01/16] Btrfs: disable qgroups accounting when quata_enable is 0

2014-04-09 Thread Liu Bo

Sorry, there is a typo in the subject line, thanks cwillu for pointing it out.

s/quata/quota/g.

-liubo

On Wed, Apr 09, 2014 at 03:08:29PM +0800, Liu Bo wrote:
 It's unnecessary to do qgroups accounting without enabling quota.
 
 Signed-off-by: Liu Bo bo.li@oracle.com
 ---
  fs/btrfs/ctree.c   |  2 +-
  fs/btrfs/delayed-ref.c | 18 ++
  fs/btrfs/qgroup.c  |  3 +++
  3 files changed, 18 insertions(+), 5 deletions(-)
 
 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index 88d1b1e..54f3c67 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -406,7 +406,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
  
   tree_mod_log_write_lock(fs_info);
   spin_lock(fs_info-tree_mod_seq_lock);
 - if (!elem-seq) {
 + if (elem  !elem-seq) {
   elem-seq = btrfs_inc_tree_mod_seq_major(fs_info);
   list_add_tail(elem-list, fs_info-tree_mod_seq_list);
   }
 diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
 index 3129964..3ab37b6 100644
 --- a/fs/btrfs/delayed-ref.c
 +++ b/fs/btrfs/delayed-ref.c
 @@ -656,8 +656,13 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   ref-is_head = 0;
   ref-in_tree = 1;
  
 - if (need_ref_seq(for_cow, ref_root))
 - seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
 + if (need_ref_seq(for_cow, ref_root)) {
 + struct seq_list *elem = NULL;
 +
 + if (fs_info-quota_enabled)
 + elem = trans-delayed_ref_elem;
 + seq = btrfs_get_tree_mod_seq(fs_info, elem);
 + }
   ref-seq = seq;
  
   full_ref = btrfs_delayed_node_to_tree_ref(ref);
 @@ -718,8 +723,13 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   ref-is_head = 0;
   ref-in_tree = 1;
  
 - if (need_ref_seq(for_cow, ref_root))
 - seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
 + if (need_ref_seq(for_cow, ref_root)) {
 + struct seq_list *elem = NULL;
 +
 + if (fs_info-quota_enabled)
 + elem = trans-delayed_ref_elem;
 + seq = btrfs_get_tree_mod_seq(fs_info, elem);
 + }
   ref-seq = seq;
  
   full_ref = btrfs_delayed_node_to_data_ref(ref);
 diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
 index 2cf9058..c634b3e 100644
 --- a/fs/btrfs/qgroup.c
 +++ b/fs/btrfs/qgroup.c
 @@ -1186,6 +1186,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
 *trans,
  {
   struct qgroup_update *u;
  
 + if (!trans-root-fs_info-quota_enabled)
 + return 0;
 +
   BUG_ON(!trans-delayed_ref_elem.seq);
   u = kmalloc(sizeof(*u), GFP_NOFS);
   if (!u)
 -- 
 1.8.2.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: Add missing devices check for mounted btrfs.

2014-04-09 Thread Anand Jain



As mentioned at the beginning, I prefer to remove the device remove test
in testcase, then the patch can be reverted without any complain from me.


 Thats a wrong approach as well.
 The test case 003 is a very real case at the data centers.
 Especially the iscsi/san luns go offline/online all the time.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5] Btrfs-progs: add dedup subcommand

2014-04-09 Thread Liu Bo
This adds deduplication subcommands, 'btrfs dedup command path',
including enable/disable/on/off.

- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.

- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.

- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes.  Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.

- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.

-
Usage:
Step 1: btrfs dedup enable /btrfs
Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs
Step 3: now we have dedup, run your test.
Step 4: btrfs dedup off /btrfs
Step 5: btrfs dedup disable /btrfs
-

Signed-off-by: Liu Bo bo.li@oracle.com
---
v4: rebase and reserve spare space in btrfs_ioctl_dedup_args struct. 
v3: add commands 'btrfs dedup on/off'
v2: add manpage

 Makefile   |   2 +-
 btrfs.c|   1 +
 cmds-dedup.c   | 178 +
 commands.h |   2 +
 ctree.h|   2 +
 ioctl.h|  12 
 man/btrfs.8.in |  31 +-
 7 files changed, 224 insertions(+), 4 deletions(-)
 create mode 100644 cmds-dedup.c

diff --git a/Makefile b/Makefile
index da05197..369df6c 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
-  cmds-property.o
+  cmds-property.o cmds-dedup.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index 98ff6f5..16458ef 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -256,6 +256,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ quota, cmd_quota, NULL, quota_cmd_group, 0 },
{ qgroup, cmd_qgroup, NULL, qgroup_cmd_group, 0 },
{ replace, cmd_replace, NULL, replace_cmd_group, 0 },
+   { dedup, cmd_dedup, NULL, dedup_cmd_group, 0 },
{ help, cmd_help, cmd_help_usage, NULL, 0 },
{ version, cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedup.c b/cmds-dedup.c
new file mode 100644
index 000..b959349
--- /dev/null
+++ b/cmds-dedup.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2013 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include sys/ioctl.h
+#include unistd.h
+#include getopt.h
+
+#include ctree.h
+#include ioctl.h
+
+#include commands.h
+#include utils.h
+
+static const char * const dedup_cmd_group_usage[] = {
+   btrfs dedup command [options] path,
+   NULL
+};
+
+int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args)
+{
+   int ret = 0;
+   int fd;
+   int e;
+   DIR *dirstream = NULL;
+
+   fd = open_file_or_dir(path, dirstream);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access '%s'\n, path);
+   return -EACCES;
+   }
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args);
+   e = errno;
+   close_file_or_dir(fd, dirstream);
+   if (ret  0) {
+   fprintf(stderr, ERROR: dedup command failed: %s\n,
+   strerror(e));
+   if (args-cmd == BTRFS_DEDUP_CTL_DISABLE ||
+   args-cmd == BTRFS_DEDUP_CTL_SET_BS)
+   fprintf(stderr, please refer to 'dmesg | tail' for 
more info\n);
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static const char * const cmd_dedup_enable_usage[] = {
+   btrfs dedup enable path,
+   Enable data deduplication support for a filesystem.,
+ 

Re: [PATCH v5] Btrfs-progs: add dedup subcommand

2014-04-09 Thread Liu Bo
On Wed, Apr 09, 2014 at 06:10:40PM +0800, Liu Bo wrote:
 This adds deduplication subcommands, 'btrfs dedup command path',
 including enable/disable/on/off.
 
 - btrfs dedup enable
 Create the dedup tree, and it's the very first step when you're going to use
 the dedup feature.
 
 - btrfs dedup disable
 Delete the dedup tree, after this we're not able to use dedup any more unless
 you enable it again.
 
 - btrfs dedup on [-b]
 Switch on the dedup feature temporarily, and it's the second step of applying
 dedup with writes.  Option '-b' is used to set dedup blocksize.
 The default blocksize is 8192(no special reason, you may argue), and the 
 current
 limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is 
 the
 upper limit of btrfs's compression.
 
 - btrfs dedup off
 Switch off the dedup feature temporarily, but the dedup tree remains.
 
 -
 Usage:
 Step 1: btrfs dedup enable /btrfs
 Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs
 Step 3: now we have dedup, run your test.
 Step 4: btrfs dedup off /btrfs
 Step 5: btrfs dedup disable /btrfs
 -
 
 Signed-off-by: Liu Bo bo.li@oracle.com
 ---

v5: rebase onto the latest btrfs-progs v3.14.

-liubo

 v4: rebase and reserve spare space in btrfs_ioctl_dedup_args struct. 
 v3: add commands 'btrfs dedup on/off'
 v2: add manpage
 
  Makefile   |   2 +-
  btrfs.c|   1 +
  cmds-dedup.c   | 178 
 +
  commands.h |   2 +
  ctree.h|   2 +
  ioctl.h|  12 
  man/btrfs.8.in |  31 +-
  7 files changed, 224 insertions(+), 4 deletions(-)
  create mode 100644 cmds-dedup.c
 
 diff --git a/Makefile b/Makefile
 index da05197..369df6c 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -14,7 +14,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
 cmds-device.o cmds-scrub.o \
  cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
  cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
  cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
 -cmds-property.o
 +cmds-property.o cmds-dedup.o
  libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o 
 \
  uuid-tree.o
  libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
 diff --git a/btrfs.c b/btrfs.c
 index 98ff6f5..16458ef 100644
 --- a/btrfs.c
 +++ b/btrfs.c
 @@ -256,6 +256,7 @@ static const struct cmd_group btrfs_cmd_group = {
   { quota, cmd_quota, NULL, quota_cmd_group, 0 },
   { qgroup, cmd_qgroup, NULL, qgroup_cmd_group, 0 },
   { replace, cmd_replace, NULL, replace_cmd_group, 0 },
 + { dedup, cmd_dedup, NULL, dedup_cmd_group, 0 },
   { help, cmd_help, cmd_help_usage, NULL, 0 },
   { version, cmd_version, cmd_version_usage, NULL, 0 },
   NULL_CMD_STRUCT
 diff --git a/cmds-dedup.c b/cmds-dedup.c
 new file mode 100644
 index 000..b959349
 --- /dev/null
 +++ b/cmds-dedup.c
 @@ -0,0 +1,178 @@
 +/*
 + * Copyright (C) 2013 Oracle.  All rights reserved.
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public
 + * License along with this program; if not, write to the
 + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 + * Boston, MA 021110-1307, USA.
 + */
 +
 +#include sys/ioctl.h
 +#include unistd.h
 +#include getopt.h
 +
 +#include ctree.h
 +#include ioctl.h
 +
 +#include commands.h
 +#include utils.h
 +
 +static const char * const dedup_cmd_group_usage[] = {
 + btrfs dedup command [options] path,
 + NULL
 +};
 +
 +int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args)
 +{
 + int ret = 0;
 + int fd;
 + int e;
 + DIR *dirstream = NULL;
 +
 + fd = open_file_or_dir(path, dirstream);
 + if (fd  0) {
 + fprintf(stderr, ERROR: can't access '%s'\n, path);
 + return -EACCES;
 + }
 +
 + ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args);
 + e = errno;
 + close_file_or_dir(fd, dirstream);
 + if (ret  0) {
 + fprintf(stderr, ERROR: dedup command failed: %s\n,
 + strerror(e));
 + if (args-cmd == BTRFS_DEDUP_CTL_DISABLE ||
 + args-cmd == BTRFS_DEDUP_CTL_SET_BS)
 + fprintf(stderr, please refer to 'dmesg | tail' for 
 more info\n);
 + return -EINVAL;
 

Re: BTRFS setup advice for laptop performance ?

2014-04-09 Thread Chris Samuel
On Mon, 7 Apr 2014 11:11:11 AM Austin S Hemmelgarn wrote:

 This is because every other filesystem (except ZFS) doesn't use COW
 semantics.

There is an interesting article on LWN at the moment (subscriber only for the 
next day or two, but if you can afford it I'd suggest considering subscribing) 
about the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit 
discussions around the impact of the new Shingle Magnetic Recording (SMR) 
drives that may change that.

Given these devices are likely to have large sequential write only areas it's 
going to make getting existing Linux filesystems to work on them interesting, 
in the section on Dave Chinners filesystem part of the discussion it says:

# Any of the existing filesystems that do not support copy-on-write (COW)
# cannot really be optimized for SMR, he said, because you can't overwrite
# data in sequential zones. That would mean adding COW to ext4 and XFS,
# Chinner said. 

https://lwn.net/Articles/592091/

There's some background to SMR drives (available to all) from the LSFMM here:

https://lwn.net/Articles/591782/

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



signature.asc
Description: This is a digitally signed message part.


Unable to remove read-only seeding device from a filesystem.

2014-04-09 Thread Jan Kouba
When running this script:

dd if=/dev/zero of=seed-disk.img bs=1M seek=1k count=0
dd if=/dev/zero of=test-disk.img bs=1M seek=1k count=0

# Make image of seed device
mkfs.btrfs seed-disk.img
seed_dev=`losetup -f --show seed-disk.img`
mount $seed_dev /mnt/tmp
touch /mnt/tmp/a
umount /mnt/tmp
losetup -d $seed_dev
btrfstune -S 1 seed-disk.img

# Make read-only seed device
seed_dev=`losetup -f -r --show seed-disk.img`

test_dev=`losetup -f --show test-disk.img`

mount $seed_dev /mnt/tmp
btrfs dev add $test_dev /mnt/tmp
mount -o remount,rw /mnt/tmp

# This fails
btrfs dev delete $seed_dev /mnt/tmp

# cleanup
umount /mnt/tmp
losetup -d $seed_dev
losetup -d $test_dev

rm seed-disk.img
rm test-disk.img




the command 
btrfs dev delete $seed_dev /mnt/tmp fails with 
ERROR: error removing the device '/dev/loop0' - Permission denied message.

If /dev/loop0 is not read-only everything works. 

I tested this on ubuntu 13.10, 14.04 with stock kernel and btrfs-progs and on 
ubuntu 14.04 with latest PPA kernel (3.14.0-031400-generic) and v3.14 btrfs-
progs.

Is this behaviour expected or is it a bug? 

I thougt that btrfs never changes seeding devices, so I don't understand, why 
it needs to be writeable in order to remove it from a filesystem.


Jan Kouba





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS setup advice for laptop performance ?

2014-04-09 Thread Chris Samuel
On Fri, 4 Apr 2014 10:02:27 AM Swâmi Petaramesh wrote:

 However I'm still concerned with chronic BTRFS dreadful performance and
 still  find that BRTFS degrades much over time even with periodic defrag
 and best practices etc.

That's odd, I've been running it on laptops with SSDs since 2009 and haven't 
hit this yet.I'm a KDE user too (though not using Kmail/Akonadi on the 
machines in question).

Sorry I can't do much more than say works for me, but that happens to be the 
case.. :-(

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



signature.asc
Description: This is a digitally signed message part.


Using noCow with snapshots ?

2014-04-09 Thread Swâmi Petaramesh
Hi,

In the quest for BTRFS and performance, and having received the advice to 
chattr +C my akonadi DB directory to make it noCow, I would like to be sure 
about what will happen when I take a snapshot of the concerned BTRFS 
subvolume.

1/ Being noCow, will the database be modified in the snapshot as well, 
efectively defeating the snapshot ?

2/ Being snapshotted, will the database be COWed even though it's supposed to 
be noCow ?

3/ Are both options mutually incompatible in some more osbcure ways ?

I'd like to know where I'm going with this ;-)

TIA and kind regards.

-- 
Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: make sure there are not any read requests before stopping workers

2014-04-09 Thread Wang Shilong
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:

kernel BUG at fs/btrfs/async-thread.c:619!

By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.

Thanks to Miao for pointing out this problem.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
 fs/btrfs/disk-io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9698fd..8a49823 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3619,6 +3619,11 @@ int close_ctree(struct btrfs_root *root)
 
btrfs_free_block_groups(fs_info);
 
+   /*
+* we must make sure there is not any read request to
+* submit after we stopping all workers.
+*/
+   invalidate_inode_pages2(fs_info-btree_inode-i_mapping);
btrfs_stop_all_workers(fs_info);
 
free_root_pointers(fs_info, 1);
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using noCow with snapshots ?

2014-04-09 Thread Hugo Mills
On Wed, Apr 09, 2014 at 01:15:24PM +0200, Swâmi Petaramesh wrote:
 Hi,
 
 In the quest for BTRFS and performance, and having received the advice to 
 chattr +C my akonadi DB directory to make it noCow, I would like to be sure 
 about what will happen when I take a snapshot of the concerned BTRFS 
 subvolume.
 
 1/ Being noCow, will the database be modified in the snapshot as well, 
 efectively defeating the snapshot ?

   No (see below)

 2/ Being snapshotted, will the database be COWed even though it's
 supposed to be noCow ?

   Yes -- once.

   When you make a snapshot of a nodatacow file, the data is shared
between the snapshot and the original as normal. The extents are
reference counted, so the original data now has two references to it.

   When one of these copies is written to, the writes are placed
somewhere else on the disk, still marked as nodatacow, and the
reference count is reduced to 1 for each copy again. (Note that this
is done on a per-block basis, although the 30-second transaction
commit will tend to coalesce adjacent blocks to reduce fragmentation;
autodefrag helps here, too).

   Basically, a snapshot of a nodatacow file will increase the
reference count for its blocks. A write to a block with a reference
count of more than one will *always* write a new block elsewhere. A
write to a block with a reference count of exactly one will not do so
if the file is marked nodatacow. I hope that's clear.

 3/ Are both options mutually incompatible in some more osbcure ways ?

   Only as noted above.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- There are three mistaikes in this sentance. ---   


signature.asc
Description: Digital signature


Re: Using noCow with snapshots ?

2014-04-09 Thread Duncan
Swâmi Petaramesh posted on Wed, 09 Apr 2014 13:15:24 +0200 as excerpted:

 In the quest for BTRFS and performance, and having received the advice
 to chattr +C my akonadi DB directory to make it noCow, I would like to
 be sure about what will happen when I take a snapshot of the concerned
 BTRFS subvolume.
 
 1/ Being noCow, will the database be modified in the snapshot as well,
 efectively defeating the snapshot ?
 
 2/ Being snapshotted, will the database be COWed even though it's
 supposed to be noCow ?
 
 3/ Are both options mutually incompatible in some more osbcure ways ?
 
 I'd like to know where I'm going with this ;-)


Good questions. =:^)

#2. That's from one of the devs when the question came up perhaps a 
couple months ago.

On a NOCOW file the first write to a fileblock (4096 bytes) after a 
snapshot must still be COW, because the snapshot locks the old version in 
place, and now the fileblock has changed, so it MUST be written elsewhere 
despite the NOCOW in ordered to keep the snapshot as it was.  However, 
the file does retain the NOCOW attribute and additional writes to the 
same fileblock will be in-place... until the next snapshot of course.

This is why on filesystems with scripted snapshots as close as a minute a 
part (I even saw one guy say he was doing them every 30 seconds!!), 
setting NOCOW has very little value -- they aren't NOCOW on the first 
write after a snapshot, and with snapshots happening every minute...,  
Hourly snapshots are still likely to be a problem on a regularly changing 
file, tho with daily snapshots you'd probably save some fragmentation 
over the fairly short term anyway, but it'd still be a problem longer 
term.

Which is why I suggest putting such files on a separate subvolume and not 
snapshotting that subvolume, since snapshots stop at the subvolume 
boundary.  That gives NOCOW a chance to actually *BE* NOCOW.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: check if items are ordered when a leaf is marked dirty

2014-04-09 Thread Filipe David Borba Manana
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---
 fs/btrfs/disk-io.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9698fd..8bf6628 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3695,6 +3695,12 @@ void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
__percpu_counter_add(root-fs_info-dirty_metadata_bytes,
 buf-len,
 root-fs_info-dirty_metadata_batch);
+#ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
+   if (btrfs_header_level(buf) == 0  check_leaf(root, buf)) {
+   btrfs_print_leaf(root, buf);
+   ASSERT(0);
+   }
+#endif
 }
 
 static void __btrfs_btree_balance_dirty(struct btrfs_root *root,
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: don't access non-existent key when csum tree is empty

2014-04-09 Thread Filipe David Borba Manana
When the csum tree is empty, our leaf (path-nodes[0]) has a number
of items equal to 0 and since btrfs_header_nritems() returns an
unsigned integer (and so is our local nritems variable) the following
comparison always evaluates to false:

 if (path-slots[0] = nritems - 1) {

As the casting rules lead to:

 if ((u32)0 = (u32)4294967295) {

This makes us access key at slot paths-slots[0] + 1 (1) of the empty leaf
some lines below:

btrfs_item_key_to_cpu(path-nodes[0], found_key, slot);
if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
found_key.type != BTRFS_EXTENT_CSUM_KEY) {
found_next = 1;
goto insert;
}

So just don't access such non-existent slot and don't set found_next to 1
when the tree is empty. It's very unlikely we'll get a random key with the
objectid and type values above, which is where we could go into trouble.

If nritems is 0, just set found_next to 1 anyway as it will make us insert
a csum item covering our whole extent (or the whole leaf) when the tree is
empty.

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---
 fs/btrfs/file-item.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 9d84658..0721113 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -749,7 +749,7 @@ again:
int slot = path-slots[0] + 1;
/* we didn't find a csum item, insert one */
nritems = btrfs_header_nritems(path-nodes[0]);
-   if (path-slots[0] = nritems - 1) {
+   if (!nritems || (path-slots[0] = nritems - 1)) {
ret = btrfs_next_leaf(root, path);
if (ret == 1)
found_next = 1;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not reset last_snapshot after relocation

2014-04-09 Thread Shilong Wang
Hi Josef,

2014-03-28 2:56 GMT+08:00 Josef Bacik jba...@fb.com:
 This was done to allow NO_COW to continue to be NO_COW after relocation but it
 is not right.  When relocating we will convert blocks to FULL_BACKREF that we
 relocate.  We can leave some of these full backref blocks behind if they are 
 not
 cow'ed out during the relocation, like if we fail the relocation with ENOSPC 
 and
 then just drop the reloc tree.  Then when we go to cow the block again we 
 won't
 lookup the extent flags because we won't think there has been a snapshot
 recently which means we will do our normal ref drop thing instead of adding 
 back
 a tree ref and dropping the shared ref.  This will cause btrfs_free_extent to
 blow up because it can't find the ref we are trying to free.  This was found
 with my ref verifying tool.  Thanks,

Could we pass error into merge_reloc_roots() something like this:

merge_reloc_roots(struct reloc_control *rc, int err)

and we only skip reset root's last snapshot if  @err is not 0. I think
it is meaningful
to allow NO_COW to continue after relocation.

Thanks,
Wang

 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  fs/btrfs/relocation.c | 21 -
  1 file changed, 21 deletions(-)

 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index ec00777..f026a82 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -2318,7 +2318,6 @@ void free_reloc_roots(struct list_head *list)
  static noinline_for_stack
  int merge_reloc_roots(struct reloc_control *rc)
  {
 -   struct btrfs_trans_handle *trans;
 struct btrfs_root *root;
 struct btrfs_root *reloc_root;
 u64 last_snap;
 @@ -2376,26 +2375,6 @@ again:
 list_add_tail(reloc_root-root_list,
   reloc_roots);
 goto out;
 -   } else if (!ret) {
 -   /*
 -* recover the last snapshot tranid to avoid
 -* the space balance break NOCOW.
 -*/
 -   root = read_fs_root(rc-extent_root-fs_info,
 -   objectid);
 -   if (IS_ERR(root))
 -   continue;
 -
 -   trans = btrfs_join_transaction(root);
 -   BUG_ON(IS_ERR(trans));
 -
 -   /* Check if the fs/file tree was snapshoted or not. */
 -   if (btrfs_root_last_snapshot(root-root_item) ==
 -   otransid - 1)
 -   btrfs_set_root_last_snapshot(root-root_item,
 -last_snap);
 -
 -   btrfs_end_transaction(trans, root);
 }
 }

 --
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrade to 3.14.0 messed up raid0 array (btrfs cleaner crashes in fs/btrfs/extent-tree.c:5748 and fs/btrfs/free-space-cache.c:1183 )

2014-04-09 Thread Marc MERLIN
On Tue, Apr 08, 2014 at 10:31:39PM -0700, Marc MERLIN wrote:
 On Tue, Apr 08, 2014 at 09:31:25PM -0700, Marc MERLIN wrote:
  On Tue, Apr 08, 2014 at 07:49:14PM -0400, Chris Mason wrote:
   
   
   On 04/08/2014 06:09 PM, Marc MERLIN wrote:
   I forgot to add that while I'm not sure if anyone ended up looking at the
   last image I made regarding
   https://bugzilla.kernel.org/show_bug.cgi?id=72801
   
   I can generate a an image of that filesystem if that helps, or try other
   commands which hopefully won't crash my running server :)
   (filesystem is almost 2TB, so the image will again be big)
   
   
   Hi Marc,
   
   So from the messages it looks like your space cache is corrupted.  Lets 
   start with clearing the space cache and running fsck and seeing exactly 
   what is wrong.
  
  gargamel:~# mount  -o clear_cache /dev/dm-4 /mnt/mnt
  [48132.661274] BTRFS: device label btrfs_raid0 devid 1 transid 50567 
  /dev/mapper/raid0d1
  [48132.703063] BTRFS info (device dm-5): force clearing of disk cache
  [48132.724780] BTRFS info (device dm-5): disk space caching is enabled

So, I tried again this morning, mounted with clear_cache, let the clearer
process work a bit:
root 25187  0.0  0.0  0 0 ?S07:56   0:00 
[btrfs-freespace]
but even though I did not have the FS mounted, after just one minute, the
kernel went into that death loop again.

Then (2nd log below), I tried mounting with -o clear_cache,nospace_cache and 
had the same problem too.

I'll wait on your next suggestion, with maybe how you'd like me to run btrfsck

Thanks,
Marc



[37652.548583] BTRFS: device label btrfs_raid0 devid 2 transid 50571 
/dev/mapper/raid0d2
[37652.757397] BTRFS info (device dm-5): force clearing of disk cache
[37652.779375] BTRFS info (device dm-5): disk space caching is enabled
[37842.582194] WARNING: CPU: 2 PID: 25231 at fs/btrfs/extent-tree.c:5748 
__btrfs_free_extent+0x359/0x712()
[37842.613790] Modules linked in: udp_diag tcp_diag inet_diag ip6table_filter 
ip6_tables ebtable_nat ebtables tun ppdev lp autofs4 binfmt_misc kl5kusb105 
deflate ctr twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 
twofish_generic twofish_common camellia_x86_64 camellia_generic 
serpent_sse2_x86_64 serpent_avx_x86_64 glue_helper lrw serpent_generic 
blowfish_x86_64 blowfish_generic blowfish_common cast5_avx_x86_64 ablk_helper 
cast5_generic cast_common des_generic cmac xcbc rmd160 sha512_ssse3 
sha512_generic ftdi_sio crypto_null keyspan af_key xfrm_algo dm_mirror 
dm_region_hash dm_log nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc 
ipt_REJECT xt_conntrack xt_nat xt_tcpudp xt_LOG iptable_mangle iptable_filter 
aes_x86_64 lm85 hwmon_vid dm_snapshot dm_bufio iptable_nat ip_tables 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_conntrack_ftp ipt_MASQUERADE 
nf_nat x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss fuse microcode 
snd_hda_codec_realtek snd_cmipci snd_hda_codec_generic kvm_intel gameport kvm 
eeepc_wmi snd_hda_intel asus_wmi sparse_keymap snd_opl3_lib snd_mpu401_uart 
snd_seq_midi snd_hda_codec rfkill snd_seq_midi_event snd_seq snd_rawmidi 
snd_hwdep snd_pcm snd_timer tpm_infineon battery snd_seq_device wmi coretemp 
processor rc_ati_x10 pl2303 pcspkr snd intel_rapl asix tpm_tis 
x86_pkg_temp_thermal parport_pc ati_remote libphy ezusb soundcore i2c_i801 
parport tpm intel_powerclamp rc_core lpc_ich xhci_hcd usbnet usbserial evdev 
xts gf128mul dm_crypt dm_mod raid456 async_raid6_recov async_pq async_xor 
async_memcpy async_tx e1000e ptp pps_core crc32_pclmul crc32c_intel ehci_pci 
sata_sil24 r8169 ehci_hcd thermal mii crct10dif_pclmul fan sata_mv 
ghash_clmulni_intel cryptd usbcore usb_common [last unloaded: kl5kusb105]
[37843.113872] CPU: 2 PID: 25231 Comm: btrfs-cleaner Not tainted 
3.14.0-amd64-i915-preempt-20140216 #2
[37843.143161] Hardware name: System manufacturer System Product Name/P8H67-M 
PRO, BIOS 3806 08/20/2012
[37843.143161] Hardware name: System manufacturer System Product Name/P8H67-M 
PRO, BIOS 3806 08/20/2012
[37843.172720]   880076d15b38 8160a06d 

[37843.197245]  880076d15b70 81050025 812170f6 
8801bbdf3580
[37843.221785]  fffe 000ee061  
880076d15b80
[37843.246199] Call Trace:
[37843.255694]  [8160a06d] dump_stack+0x4e/0x7a
[37843.273251]  [81050025] warn_slowpath_common+0x7f/0x98
[37843.293448]  [812170f6] ? __btrfs_free_extent+0x359/0x712
[37843.314212]  [810500ec] warn_slowpath_null+0x1a/0x1c
[37843.334376]  [812170f6] __btrfs_free_extent+0x359/0x712
[37843.354692]  [8160f97b] ? _raw_spin_unlock+0x17/0x2a
[37843.374076]  [8126518b] ? btrfs_check_delayed_seq+0x84/0x90
[37843.395273]  [8121c262] __btrfs_run_delayed_refs+0xa94/0xbdf
[37843.417102]  [8113fcf3] ? __cache_free.isra.39+0x1b4/0x1c3
[37843.437969]  [8121df46] btrfs_run_delayed_refs+0x81/0x18f
[37843.458651]  

Re: Upgrade to 3.14.0 messed up raid0 array (btrfs cleaner crashes in fs/btrfs/extent-tree.c:5748 and fs/btrfs/free-space-cache.c:1183 )

2014-04-09 Thread Chris Mason



On 04/09/2014 11:42 AM, Marc MERLIN wrote:

On Tue, Apr 08, 2014 at 10:31:39PM -0700, Marc MERLIN wrote:

On Tue, Apr 08, 2014 at 09:31:25PM -0700, Marc MERLIN wrote:

On Tue, Apr 08, 2014 at 07:49:14PM -0400, Chris Mason wrote:



On 04/08/2014 06:09 PM, Marc MERLIN wrote:

I forgot to add that while I'm not sure if anyone ended up looking at the
last image I made regarding
https://urldefense.proofpoint.com/v1/url?u=https://bugzilla.kernel.org/show_bug.cgi?id%3D72801k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=dQHW2ddzMXNlRthusH4o6nZVMltGycqZ8zO5AgwPphE%3D%0As=783fd1cd39566becdfb62904889cd7459c81dd2793d19538b2ab093bb8d06f88

I can generate a an image of that filesystem if that helps, or try other
commands which hopefully won't crash my running server :)
(filesystem is almost 2TB, so the image will again be big)



Hi Marc,

So from the messages it looks like your space cache is corrupted.  Lets
start with clearing the space cache and running fsck and seeing exactly
what is wrong.


gargamel:~# mount  -o clear_cache /dev/dm-4 /mnt/mnt
[48132.661274] BTRFS: device label btrfs_raid0 devid 1 transid 50567 
/dev/mapper/raid0d1
[48132.703063] BTRFS info (device dm-5): force clearing of disk cache
[48132.724780] BTRFS info (device dm-5): disk space caching is enabled


So, I tried again this morning, mounted with clear_cache, let the clearer
process work a bit:
root 25187  0.0  0.0  0 0 ?S07:56   0:00 
[btrfs-freespace]
but even though I did not have the FS mounted, after just one minute, the
kernel went into that death loop again.

Then (2nd log below), I tried mounting with -o clear_cache,nospace_cache and
had the same problem too.

I'll wait on your next suggestion, with maybe how you'd like me to run btrfsck


Downloading the image now.  I'd just run a readonly btrfsck /dev/xxx

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix lockdep warning with reclaim lock inversion

2014-04-09 Thread Filipe David Manana
On Wed, Mar 26, 2014 at 6:11 PM, Jeff Mahoney je...@suse.com wrote:
 When encountering memory pressure, testers have run into the following
 lockdep warning. It was caused by __link_block_group calling kobject_add
 with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
 which gets us into reclaim context. The kobject doesn't actually need
 to be added under the lock -- it just needs to ensure that it's only
 added for the first block group to be linked.

 =
 [ INFO: possible irq lock inversion dependency detected ]
 3.14.0-rc8-default #1 Not tainted
 -
 kswapd0/169 just changed the state of lock:
  (delayed_node-mutex){+.+.-.}, at: [a018baea] 
 __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
 but this lock took another, RECLAIM_FS-unsafe lock in the past:
  (found-groups_sem){+.}

 and interrupts could create inverse lock ordering between them.

 other info that might help us debug this:
  Possible interrupt unsafe locking scenario:
CPU0CPU1

   lock(found-groups_sem);
local_irq_disable();
lock(delayed_node-mutex);
lock(found-groups_sem);
   Interrupt
 lock(delayed_node-mutex);

  *** DEADLOCK ***
 2 locks held by kswapd0/169:
  #0:  (shrinker_rwsem){..}, at: [81159e8a] 
 shrink_slab+0x3a/0x160
  #1:  (type-s_umount_key#27){..}, at: [811bac6f] 
 grab_super_passive+0x3f/0x90

 Signed-off-by: Jeff Mahoney je...@suse.com

Hi Jeff,

The same kind of problem happens when deleting the kobj. See the trace
below, thanks.

[23510.196996] BTRFS: device fsid 6b4518c2-60e8-4990-8c6e-fb15054b1e46
devid 1 transid 4 /dev/sdd
[23510.203212] BTRFS info (device sdd): disk space caching is enabled
[23510.203214] BTRFS: flagging fs with big metadata feature
[23510.227365] BTRFS: creating UUID tree
[23661.587069]
[23661.587073] =
[23661.587074] [ INFO: possible irq lock inversion dependency detected ]
[23661.587076] 3.13.0-fdm-btrfs-next-24+ #7 Not tainted
[23661.587077] -
[23661.587078] kswapd0/42 just changed the state of lock:
[23661.587079]  (delayed_node-mutex){+.+.-.}, at:
[a036d73c] __btrfs_release_delayed_node+0x4c/0x1f0 [btrfs]
[23661.587095] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[23661.587096]  (sysfs_mutex){+.+.+.}
[23661.587096]
[23661.587096] and interrupts could create inverse lock ordering between them.
[23661.587096]
[23661.587098]
[23661.587098] other info that might help us debug this:
[23661.587099] Chain exists of:
[23661.587099]   delayed_node-mutex -- found-groups_sem -- sysfs_mutex
[23661.587099]
[23661.587102]  Possible interrupt unsafe locking scenario:
[23661.587102]
[23661.587103]CPU0CPU1
[23661.587103]
[23661.587104]   lock(sysfs_mutex);
[23661.587105]local_irq_disable();
[23661.587106]lock(delayed_node-mutex);
[23661.587107]lock(found-groups_sem);
[23661.587108]   Interrupt
[23661.587109] lock(delayed_node-mutex);
[23661.587110]
[23661.587110]  *** DEADLOCK ***
[23661.587110]
[23661.587111] 2 locks held by kswapd0/42:
[23661.587112]  #0:  (shrinker_rwsem){..}, at:
[8114d0ec] shrink_slab+0x3c/0x3a0
[23661.587117]  #1:  (type-s_umount_key#31){+.}, at:
[811a43d4] grab_super_passive+0x44/0x90
[23661.587122]
[23661.587122] the shortest dependencies between 2nd lock and 1st lock:
[23661.587126]   - (sysfs_mutex){+.+.+.} ops: 12918522 {
[23661.587129]  HARDIRQ-ON-W at:
[23661.587130] [81099900]
__lock_acquire+0x670/0x1e40
[23661.587133] [8109b6e5]
lock_acquire+0x85/0x110
[23661.587134] [816f7f36]
mutex_lock_nested+0x76/0x390
[23661.587137] [8121f1b6]
sysfs_mount+0x166/0x210
[23661.587140] [811a4d53] mount_fs+0x43/0x1b0
[23661.587142] [811c2403]
vfs_kern_mount+0x73/0x160
[23661.587144] [811c2509]
kern_mount_data+0x19/0x30
[23661.587145] [81f16082] sysfs_init+0x55/0xb5
[23661.587148] [81f13435] mnt_init+0xcd/0x1d8
[23661.587149] [81f130d7]
vfs_caches_init+0x99/0x11c
[23661.587151] [81eebe1e]
start_kernel+0x3ab/0x40c
[23661.587154] [81eeb579]
x86_64_start_reservations+0x2a/0x2c
[23661.587155] [81eeb672]
x86_64_start_kernel+0xf7/0xfb

Re: Upgrade to 3.14.0 messed up raid0 array (btrfs cleaner crashes in fs/btrfs/extent-tree.c:5748 and fs/btrfs/free-space-cache.c:1183 )

2014-04-09 Thread Marc MERLIN
On Wed, Apr 09, 2014 at 11:46:13AM -0400, Chris Mason wrote:
 Downloading the image now.  I'd just run a readonly btrfsck /dev/xxx

http://marc.merlins.org/tmp/btrfs-raid0-image-fsck.txt (6MB)

I admit to not knowing how to read that output, I've only ever seen
thousands of lines of output from it on any filesystem., but hopefully you
know how to grep out expected noise.

But since we're talking about this, is btrfsck ever supposed to return clean
on a clean filesystem?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs/035: update clone test to expect EOPNOTSUPP

2014-04-09 Thread David Disseldorp
With kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, the first
clone-range overwrite attempt now fails with EOPNOTSUPP.

FIXME: The second clone-range causes EIO on subsequent read attempts.

Signed-off-by: David Disseldorp dd...@suse.de
---
 tests/btrfs/035 | 10 ++
 tests/btrfs/035.out |  5 +
 2 files changed, 15 insertions(+)

diff --git a/tests/btrfs/035 b/tests/btrfs/035
index 6808179..21a9059 100755
--- a/tests/btrfs/035
+++ b/tests/btrfs/035
@@ -57,21 +57,31 @@ src_str=aa
 echo -n $src_str  $SCRATCH_MNT/src
 
 $CLONER_PROG $SCRATCH_MNT/src  $SCRATCH_MNT/src.clone1
+cat $SCRATCH_MNT/src.clone1
+echo
 
 src_str=bbcc
 
 echo -n $src_str  $SCRATCH_MNT/src
 
 $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2
+cat $SCRATCH_MNT/src.clone2
+echo
 
+# Prior to kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, this clone
+# resulted in a BUG_ON in __btrfs_drop_extents(). The kernel now returns EINVAL
+# up to userspace.
 snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'`
 echo attempting ioctl (src.clone1 src)
 $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
$SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src
+cat $SCRATCH_MNT/src
+echo
 
 snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'`
 echo attempting ioctl (src.clone2 src)
 $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
$SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src
+cat $SCRATCH_MNT/src
 
 status=0 ; exit
diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out
index f86cadf..0ea2c4f 100644
--- a/tests/btrfs/035.out
+++ b/tests/btrfs/035.out
@@ -1,3 +1,8 @@
 QA output created by 035
+aa
+bbcc
 attempting ioctl (src.clone1 src)
+clone failed: Operation not supported
+bbcc
 attempting ioctl (src.clone2 src)
+bbcc
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix a crash of clone with inline extents's split

2014-04-09 Thread David Disseldorp
Thanks for the BUG_ON() fix here.
Strangely, I'm now seeing EIO returned for reads following the second
clone-range.

Please see the subsequent xfstests patch.

Cheers, David
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-09 Thread Marc MERLIN
On Mon, Apr 07, 2014 at 01:00:02PM -0700, Marc MERLIN wrote:
 On Mon, Apr 07, 2014 at 03:32:13PM -0400, Chris Mason wrote:
  You're recommending that I try btrfs-next on a 3.15 pre kernel, correct?
  If so would it be likely to fix my filesystem and let me go back to a
  stable 3.14? (I'm a bit warry about running some unstable 3.15 on it :).
  
  Right now the fixes for this are in the integration branch on my git 
  tree.  I think we've shaken  out all the problems, but if you want to 
  wait until tomorrow I'll have it in my next branch (for linux-next).
 
 I can wait, even a few more days if needed.
 But just to be clear: will this new kernel be something that will be
 required for me to run from there on to avoid all those deadlocks and very
 poor performance I'm seeing, or the new kernel will fix things up, and then
 if other stuff isn't quite stable, I can downgrade back to 3.14 stable?
 
 By the way, I think I know which filesystem is causing this, and one unusual
 thing is that it uses a lot of hardlinks.
 
 In case that helps, there are only 40 snapshots on it, but many inodes, of
 which many are hardlinked together:
 
 gargamel:/mnt/btrfs_pool2# btrfs filesystem df `pwd`
 Data, single: total=3.28TiB, used=2.30TiB
 System, DUP: total=8.00MiB, used=384.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=74.50GiB, used=70.11GiB
 Metadata, single: total=8.00MiB, used=0.00
 gargamel:/mnt/btrfs_pool2# btrfs filesystem show `pwd`
 Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
   Total devices 1 FS bytes used 2.37TiB
   devid1 size 7.28TiB used 3.43TiB path /dev/mapper/dshelf2

Back on that front, while debugging the other problem I sent you, I've been
having more issues with this device too.

At boot time, I've been getting multiple of these after boot:
gargamel login: [ 1328.241302] INFO: task btrfs-cleaner:3571 blocked for more 
than 120 seconds.
[ 1328.264046]   Not tainted 3.14.0-amd64-i915-preempt-20140216 #2
[ 1328.284413] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 1328.309394] btrfs-cleaner   D 88020d5ea800 0  3571  2 0x
[ 1328.331996]  8800c8985d00 0046 8800c8985fd8 
88020d5ea2d0
[ 1328.355774]  000141c0 88020d5ea2d0 8801d9a7ee50 
880213bfc9e8
[ 1328.379408]   880213bfc800 88020c5b7ce0 
8800c8985d10
[ 1328.403654] Call Trace:
[ 1328.412617]  [8160d2a1] schedule+0x73/0x75
[ 1328.429189]  [8122aa77] wait_current_trans.isra.15+0x98/0xf4
[ 1328.450402]  [81085116] ? finish_wait+0x65/0x65
[ 1328.467957]  [8122bf1c] start_transaction+0x48e/0x4f2
[ 1328.487315]  [8122c2ff] ? __btrfs_end_transaction+0x2a1/0x2c6
[ 1328.508614]  [8122bf9b] btrfs_start_transaction+0x1b/0x1d
[ 1328.528842]  [8121cc7d] btrfs_drop_snapshot+0x443/0x610
[ 1328.548481]  [8122c73d] 
btrfs_clean_one_deleted_snapshot+0x103/0x10f
[ 1328.571518]  [81224f09] cleaner_kthread+0x103/0x136
[ 1328.590436]  [81224e06] ? btrfs_alloc_root+0x26/0x26
[ 1328.609348]  [8106bc62] kthread+0xae/0xb6
[ 1328.625275]  [8106bbb4] ? __kthread_parkme+0x61/0x61
[ 1328.644406]  [8161637c] ret_from_fork+0x7c/0xb0
[ 1328.662075]  [8106bbb4] ? __kthread_parkme+0x61/0x61

But more annoyingly, accessing the mountpoint was hanging, so I've now mounted
it with recovery,ro, and backing up all the data to another device so that I
can destroy/recreate this device that clearly has severe performance issues.

Do you want btrfsck output and an image of that one too?
(this one is not raid0, it's on top of an dm encrypted md raid5 array)

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfsck - process_inode_item: Assertion `!(rec-ino != key-objectid || rec-refs 1)' failed

2014-04-09 Thread Tomasz Mloduchowski
Hi,

I'm experiencing the following error when performing btrfsck --repair on
the damaged filesystem (some files indicate 'stale NFS handle' errors
when accessing).

checking extents
parent transid verify failed on 19964 wanted 1868 found 1586
parent transid verify failed on 19964 wanted 1868 found 1586
parent transid verify failed on 19964 wanted 1868 found 1586
parent transid verify failed on 19964 wanted 1868 found 1586
Ignoring transid failure
leaf parent key incorrect 19964
bad block 19964
Chunk[256, 228, 1103101952]: length(1073741824), offset(1103101952),
type(1) is not found in block group

-- snip --

Chunk[256, 228, 21793024]: length(1023410176), offset(21793024),
type(1) is not found in block group
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
parent transid verify failed on 19964 wanted 1868 found 1586
Ignoring transid failure

-- snip --

parent transid verify failed on 19964 wanted 1868 found 1586
Ignoring transid failure
btrfsck: cmds-check.c:512: process_inode_item: Assertion `!(rec-ino !=
key-objectid || rec-refs  1)' failed.


The image (btrfs-image  -s -c 9 -t 4) of this filesystem is available here:

http://static.qdot.me/home-btrfs-fail-2.img

btrfs-progs:
commit 761650b628d3b8964cc55da68ad5c8187f55c543

Thanks a lot for pointers or assistance (or patches to try out).

Cheers,
Tomasz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrade to 3.14.0 messed up raid0 array (btrfs cleaner crashes in fs/btrfs/extent-tree.c:5748 and fs/btrfs/free-space-cache.c:1183 )

2014-04-09 Thread Chris Mason

On 04/09/2014 12:51 PM, Marc MERLIN wrote:

On Wed, Apr 09, 2014 at 11:46:13AM -0400, Chris Mason wrote:

Downloading the image now.  I'd just run a readonly btrfsck /dev/xxx


https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/tmp/btrfs-raid0-image-fsck.txtk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=FvUwVb5mtQKfcHDJf0YzDhzyfAaFwrR9BXQbyCmT0No%3D%0As=633624e090ad2c187c1c5d62169bc0a8470ff2560049f83f195642638bff4b91
 (6MB)

I admit to not knowing how to read that output, I've only ever seen
thousands of lines of output from it on any filesystem., but hopefully you
know how to grep out expected noise.

But since we're talking about this, is btrfsck ever supposed to return clean
on a clean filesystem?


Looks like I'm getting different results from btrfsck on the image. 
Still a ton of corruptions but complaints about different blocks.


Can you please use btrfs-map-logical to dump both copies of block
245432320 and send me the results?

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrade to 3.14.0 messed up raid0 array (btrfs cleaner crashes in fs/btrfs/extent-tree.c:5748 and fs/btrfs/free-space-cache.c:1183 )

2014-04-09 Thread Duncan
Marc MERLIN posted on Wed, 09 Apr 2014 09:51:34 -0700 as excerpted:

 But since we're talking about this, is btrfsck ever supposed to return
 clean on a clean filesystem?

FWIW, it seems to return clean here, on everything I've tried it on.

But I run relatively small partitions (the biggest is I believe 40 gig, 
my media partitions are still reiserfs on spinning rust, while all my 
btrfs partitions are on SSD and most are raid1 both data/metadata, with 
the exceptions (my normal /boot and the backup /boot on the other ssd in 
the pair that's btrfs raid1 for most partitions) being tiny mixed-data/
metadata dup), and keep them pretty clean, running balance and scrub when 
needed.

I had seen some scrub recoveries back when I was doing suspend-to-ram and 
the system wasn't reliably resuming, I've quit doing that and recently 
did a new mkfs.btrfs and restored from backup on the affected filesystems 
in ordered to take advantage of newer features like 16k metadata nodes, 
so in fact have never personally seen an unclean output of any type from 
btrfs check.

Tho I don't run btrfs check regularly as in normal mode it's read-only 
anyway, and I know it can make some problems worse instead of fixing them 
in repair mode, so my normal idea is why run it and see stuff that might 
make me worried if I can't really do much about it, and I prefer balance 
and scrub instead if there's problems.  But I have run it a few times as 
I was curious just what it /would/ output, and everything came up clean 
on the filesystems I ran it on.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix lockdep warning with reclaim lock inversion

2014-04-09 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 4/9/14, 12:05 PM, Filipe David Manana wrote:
 On Wed, Mar 26, 2014 at 6:11 PM, Jeff Mahoney je...@suse.com
 wrote:
 When encountering memory pressure, testers have run into the
 following lockdep warning. It was caused by __link_block_group
 calling kobject_add with the groups_sem held. kobject_add calls
 kvasprintf with GFP_KERNEL, which gets us into reclaim context.
 The kobject doesn't actually need to be added under the lock --
 it just needs to ensure that it's only added for the first block
 group to be linked.
 
 = [ INFO:
 possible irq lock inversion dependency detected ] 
 3.14.0-rc8-default #1 Not tainted 
 - 
 kswapd0/169 just changed the state of lock: 
 (delayed_node-mutex){+.+.-.}, at: [a018baea]
 __btrfs_release_delayed_node+0x3a/0x200 [btrfs] but this lock
 took another, RECLAIM_FS-unsafe lock in the past: 
 (found-groups_sem){+.}
 
 and interrupts could create inverse lock ordering between them.
 
 other info that might help us debug this: Possible interrupt
 unsafe locking scenario: CPU0CPU1 
  lock(found-groups_sem); local_irq_disable(); 
 lock(delayed_node-mutex); lock(found-groups_sem); 
 Interrupt lock(delayed_node-mutex);
 
 *** DEADLOCK *** 2 locks held by kswapd0/169: #0:
 (shrinker_rwsem){..}, at: [81159e8a]
 shrink_slab+0x3a/0x160 #1:  (type-s_umount_key#27){..}, at:
 [811bac6f] grab_super_passive+0x3f/0x90
 
 Signed-off-by: Jeff Mahoney je...@suse.com
 
 Hi Jeff,
 
 The same kind of problem happens when deleting the kobj. See the
 trace below, thanks.


Thanks, Filipe. I have an updated patch I'll post in a minute.

- -Jeff

 [23510.196996] BTRFS: device fsid
 6b4518c2-60e8-4990-8c6e-fb15054b1e46 devid 1 transid 4 /dev/sdd 
 [23510.203212] BTRFS info (device sdd): disk space caching is
 enabled [23510.203214] BTRFS: flagging fs with big metadata
 feature [23510.227365] BTRFS: creating UUID tree [23661.587069] 
 [23661.587073]
 = 
 [23661.587074] [ INFO: possible irq lock inversion dependency
 detected ] [23661.587076] 3.13.0-fdm-btrfs-next-24+ #7 Not tainted 
 [23661.587077]
 - 
 [23661.587078] kswapd0/42 just changed the state of lock: 
 [23661.587079]  (delayed_node-mutex){+.+.-.}, at: 
 [a036d73c] __btrfs_release_delayed_node+0x4c/0x1f0
 [btrfs] [23661.587095] but this lock took another,
 RECLAIM_FS-unsafe lock in the past: [23661.587096]
 (sysfs_mutex){+.+.+.} [23661.587096] [23661.587096] and interrupts
 could create inverse lock ordering between them. [23661.587096] 
 [23661.587098] [23661.587098] other info that might help us debug
 this: [23661.587099] Chain exists of: [23661.587099]
 delayed_node-mutex -- found-groups_sem -- sysfs_mutex 
 [23661.587099] [23661.587102]  Possible interrupt unsafe locking
 scenario: [23661.587102] [23661.587103]CPU0
 CPU1 [23661.587103] 
 [23661.587104]   lock(sysfs_mutex); [23661.587105]
 local_irq_disable(); [23661.587106]
 lock(delayed_node-mutex); [23661.587107]
 lock(found-groups_sem); [23661.587108]   Interrupt 
 [23661.587109] lock(delayed_node-mutex); [23661.587110] 
 [23661.587110]  *** DEADLOCK *** [23661.587110] [23661.587111] 2
 locks held by kswapd0/42: [23661.587112]  #0:
 (shrinker_rwsem){..}, at: [8114d0ec]
 shrink_slab+0x3c/0x3a0 [23661.587117]  #1:
 (type-s_umount_key#31){+.}, at: [811a43d4]
 grab_super_passive+0x44/0x90 [23661.587122] [23661.587122] the
 shortest dependencies between 2nd lock and 1st lock: [23661.587126]
 - (sysfs_mutex){+.+.+.} ops: 12918522 { [23661.587129]
 HARDIRQ-ON-W at: [23661.587130]
 [81099900] __lock_acquire+0x670/0x1e40 [23661.587133]
 [8109b6e5] lock_acquire+0x85/0x110 [23661.587134]
 [816f7f36] mutex_lock_nested+0x76/0x390 [23661.587137]
 [8121f1b6] sysfs_mount+0x166/0x210 [23661.587140]
 [811a4d53] mount_fs+0x43/0x1b0 [23661.587142]
 [811c2403] vfs_kern_mount+0x73/0x160 [23661.587144]
 [811c2509] kern_mount_data+0x19/0x30 [23661.587145]
 [81f16082] sysfs_init+0x55/0xb5 [23661.587148]
 [81f13435] mnt_init+0xcd/0x1d8 [23661.587149]
 [81f130d7] vfs_caches_init+0x99/0x11c [23661.587151]
 [81eebe1e] start_kernel+0x3ab/0x40c [23661.587154]
 [81eeb579] x86_64_start_reservations+0x2a/0x2c 
 [23661.587155] [81eeb672] 
 x86_64_start_kernel+0xf7/0xfb [23661.587157]  SOFTIRQ-ON-W at: 
 [23661.587158] [8109993c] 
 __lock_acquire+0x6ac/0x1e40 [23661.587160]
 [8109b6e5] lock_acquire+0x85/0x110 [23661.587162]
 [816f7f36] mutex_lock_nested+0x76/0x390 [23661.587164]
 [8121f1b6] sysfs_mount+0x166/0x210 

[PATCH] btrfs: fix lockdep warning with reclaim lock inversion

2014-04-09 Thread Jeff Mahoney
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked. We also need to release
the lock before removing the kobjects.

=
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
-
kswapd0/169 just changed the state of lock:
 (delayed_node-mutex){+.+.-.}, at: [a018baea] 
__btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 (found-groups_sem){+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
   CPU0CPU1
   
  lock(found-groups_sem);
   local_irq_disable();
   lock(delayed_node-mutex);
   lock(found-groups_sem);
  Interrupt
lock(delayed_node-mutex);

 *** DEADLOCK ***
2 locks held by kswapd0/169:
 #0:  (shrinker_rwsem){..}, at: [81159e8a] shrink_slab+0x3a/0x160
 #1:  (type-s_umount_key#27){..}, at: [811bac6f] 
grab_super_passive+0x3f/0x90

Signed-off-by: Jeff Mahoney je...@suse.com
---
 fs/btrfs/extent-tree.c |   20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8343,9 +8343,15 @@ static void __link_block_group(struct bt
   struct btrfs_block_group_cache *cache)
 {
int index = get_block_group_index(cache);
+   bool first = false;
 
down_write(space_info-groups_sem);
-   if (list_empty(space_info-block_groups[index])) {
+   if (list_empty(space_info-block_groups[index]))
+   first = true;
+   list_add_tail(cache-list, space_info-block_groups[index]);
+   up_write(space_info-groups_sem);
+
+   if (first) {
struct kobject *kobj = space_info-block_group_kobjs[index];
int ret;
 
@@ -8357,8 +8363,6 @@ static void __link_block_group(struct bt
kobject_put(space_info-kobj);
}
}
-   list_add_tail(cache-list, space_info-block_groups[index]);
-   up_write(space_info-groups_sem);
 }
 
 static struct btrfs_block_group_cache *
@@ -8693,6 +8697,7 @@ int btrfs_remove_block_group(struct btrf
struct btrfs_root *tree_root = root-fs_info-tree_root;
struct btrfs_key key;
struct inode *inode;
+   bool cleanup_needed = false;
int ret;
int index;
int factor;
@@ -8791,12 +8796,15 @@ int btrfs_remove_block_group(struct btrf
 * are still on the list after taking the semaphore
 */
list_del_init(block_group-list);
-   if (list_empty(block_group-space_info-block_groups[index])) {
+   if (list_empty(block_group-space_info-block_groups[index]))
+   cleanup_needed = true;
+   up_write(block_group-space_info-groups_sem);
+
+   if (cleanup_needed) {
+   clear_avail_alloc_bits(root-fs_info, block_group-flags);
kobject_del(block_group-space_info-block_group_kobjs[index]);
kobject_put(block_group-space_info-block_group_kobjs[index]);
-   clear_avail_alloc_bits(root-fs_info, block_group-flags);
}
-   up_write(block_group-space_info-groups_sem);
 
if (block_group-cached == BTRFS_CACHE_STARTED)
wait_block_group_cache_done(block_group);

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0

2014-04-09 Thread Liu Bo
It's unnecessary to do qgroups accounting without enabling quota.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.c   |  2 +-
 fs/btrfs/delayed-ref.c | 18 ++
 fs/btrfs/qgroup.c  |  3 +++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 88d1b1e..54f3c67 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -406,7 +406,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
tree_mod_log_write_lock(fs_info);
spin_lock(fs_info-tree_mod_seq_lock);
-   if (!elem-seq) {
+   if (elem  !elem-seq) {
elem-seq = btrfs_inc_tree_mod_seq_major(fs_info);
list_add_tail(elem-list, fs_info-tree_mod_seq_list);
}
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3129964..3ab37b6 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -656,8 +656,13 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
-   if (need_ref_seq(for_cow, ref_root))
-   seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
+   if (need_ref_seq(for_cow, ref_root)) {
+   struct seq_list *elem = NULL;
+
+   if (fs_info-quota_enabled)
+   elem = trans-delayed_ref_elem;
+   seq = btrfs_get_tree_mod_seq(fs_info, elem);
+   }
ref-seq = seq;
 
full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -718,8 +723,13 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
ref-is_head = 0;
ref-in_tree = 1;
 
-   if (need_ref_seq(for_cow, ref_root))
-   seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem);
+   if (need_ref_seq(for_cow, ref_root)) {
+   struct seq_list *elem = NULL;
+
+   if (fs_info-quota_enabled)
+   elem = trans-delayed_ref_elem;
+   seq = btrfs_get_tree_mod_seq(fs_info, elem);
+   }
ref-seq = seq;
 
full_ref = btrfs_delayed_node_to_data_ref(ref);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 2cf9058..c634b3e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1186,6 +1186,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
*trans,
 {
struct qgroup_update *u;
 
+   if (!trans-root-fs_info-quota_enabled)
+   return 0;
+
BUG_ON(!trans-delayed_ref_elem.seq);
u = kmalloc(sizeof(*u), GFP_NOFS);
if (!u)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 09/16] Btrfs: add ioctl of dedup control

2014-04-09 Thread Liu Bo
So far we have 4 commands to control dedup behaviour,
- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.

- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.

- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes.  Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.

- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/ioctl.c   | 167 +
 include/uapi/linux/btrfs.h |  12 
 4 files changed, 183 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ca1b516..feebfab 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1740,6 +1740,9 @@ struct btrfs_fs_info {
u64 dedup_bs;
 
int dedup_type;
+
+   /* protect user change for dedup operations */
+   struct mutex dedup_ioctl_mutex;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a2586ac..3be947f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2362,6 +2362,7 @@ int open_ctree(struct super_block *sb,
mutex_init(fs_info-dev_replace.lock_finishing_cancel_unmount);
mutex_init(fs_info-dev_replace.lock_management_lock);
mutex_init(fs_info-dev_replace.lock);
+   mutex_init(fs_info-dedup_ioctl_mutex);
 
spin_lock_init(fs_info-qgroup_lock);
mutex_init(fs_info-qgroup_ioctl_lock);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0401397..45c183c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4820,6 +4820,171 @@ static int btrfs_ioctl_set_features(struct file *file, 
void __user *arg)
return btrfs_commit_transaction(trans, root);
 }
 
+static long btrfs_enable_dedup(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root-fs_info;
+   struct btrfs_trans_handle *trans = NULL;
+   struct btrfs_root *dedup_root;
+   int ret = 0;
+
+   mutex_lock(fs_info-dedup_ioctl_mutex);
+   if (fs_info-dedup_root) {
+   pr_info(btrfs: dedup has already been enabled\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return 0;
+   }
+
+   trans = btrfs_start_transaction(root, 2);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return ret;
+   }
+
+   dedup_root = btrfs_create_tree(trans, fs_info,
+  BTRFS_DEDUP_TREE_OBJECTID);
+   if (IS_ERR(dedup_root))
+   ret = PTR_ERR(dedup_root);
+
+   if (ret)
+   btrfs_end_transaction(trans, root);
+   else
+   ret = btrfs_commit_transaction(trans, root);
+
+   if (!ret) {
+   pr_info(btrfs: dedup enabled\n);
+   fs_info-dedup_root = dedup_root;
+   fs_info-dedup_root-block_rsv = fs_info-global_block_rsv;
+   btrfs_set_fs_incompat(fs_info, DEDUP);
+   }
+
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return ret;
+}
+
+static long btrfs_disable_dedup(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root-fs_info;
+   struct btrfs_root *dedup_root;
+   int ret;
+
+   mutex_lock(fs_info-dedup_ioctl_mutex);
+   if (!fs_info-dedup_root) {
+   pr_info(btrfs: dedup has been disabled\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return 0;
+   }
+
+   if (fs_info-dedup_bs != 0) {
+   pr_info(btrfs: cannot disable dedup until switching off 
dedup!\n);
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   return -EBUSY;
+   }
+
+   dedup_root = fs_info-dedup_root;
+
+   ret = btrfs_drop_snapshot(dedup_root, NULL, 1, 0);
+
+   if (!ret) {
+   fs_info-dedup_root = NULL;
+   pr_info(btrfs: dedup disabled\n);
+   }
+
+   mutex_unlock(fs_info-dedup_ioctl_mutex);
+   WARN_ON(ret  0  ret != -EAGAIN  ret != -EROFS);
+   return ret;
+}
+
+static long btrfs_set_dedup_bs(struct btrfs_root *root, u64 bs)
+{
+   struct btrfs_fs_info *info = root-fs_info;
+   int ret = 0;
+
+   mutex_lock(info-dedup_ioctl_mutex);
+   if (!info-dedup_root) {
+   pr_info(btrfs: dedup is disabled, we cannot switch on/off 
dedup\n);
+   ret = -EINVAL;
+   goto out;
+   }
+
+   bs = ALIGN(bs, root-sectorsize);
+   bs = min_t(u64, bs, (128 * 1024ULL));
+
+   if (bs == 

[PATCH v10 05/16] Btrfs: make ordered extent aware of dedup

2014-04-09 Thread Liu Bo
This adds a dedup flag and dedup hash into ordered extent so that
we can insert dedup extents to dedup tree at endio time.

The benefit is simplicity, we don't need to fall back to cleanup dedup
structures if the write is cancelled for some reasons.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ordered-data.c | 38 --
 fs/btrfs/ordered-data.h | 13 -
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a94b05f..c520e13 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -183,7 +183,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ int dedup, struct btrfs_dedup_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)-root;
struct btrfs_ordered_inode_tree *tree;
@@ -199,10 +200,23 @@ static int __btrfs_add_ordered_extent(struct inode 
*inode, u64 file_offset,
entry-start = start;
entry-len = len;
if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM) 
-   !(type == BTRFS_ORDERED_NOCOW))
+   !(type == BTRFS_ORDERED_NOCOW)  !dedup)
entry-csum_bytes_left = disk_len;
entry-disk_len = disk_len;
entry-bytes_left = len;
+   entry-dedup = dedup;
+   entry-hash = NULL;
+
+   if (!dedup  hash) {
+   entry-hash = kzalloc(btrfs_dedup_hash_size(hash-type),
+ GFP_NOFS);
+   if (!entry-hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   memcpy(entry-hash, hash, btrfs_dedup_hash_size(hash-type));
+   }
+
entry-inode = igrab(inode);
entry-compress_type = compress_type;
entry-truncated_len = (u64)-1;
@@ -251,7 +265,17 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
+}
+
+int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  int dedup, struct btrfs_dedup_hash *hash,
+  int compress_type)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ compress_type, dedup, hash);
 }
 
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
@@ -259,16 +283,17 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int compress_type)
+ int type, int compress_type,
+ struct btrfs_dedup_hash *hash)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, 0, hash);
 }
 
 /*
@@ -530,6 +555,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(sum-list);
kfree(sum);
}
+   kfree(entry-hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2468970..efbb11f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -109,6 +109,9 @@ struct btrfs_ordered_extent {
/* compression algorithm */
int compress_type;
 
+   /* whether this ordered extent is marked for dedup or not */
+   int dedup;
+
/* reference count */
atomic_t refs;
 
@@ -135,6 +138,9 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;

[PATCH v10 02/16] Btrfs: introduce dedup tree and relatives

2014-04-09 Thread Liu Bo
This is a preparation step for online/inband dedup tree.
It introduces dedup tree and its relatives, including hash driver and
some structures.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h | 73 
 fs/btrfs/disk-io.c   | 36 ++
 fs/btrfs/extent-tree.c   |  2 ++
 include/trace/events/btrfs.h |  3 +-
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bc96c03..da4320d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include asm/kmap_types.h
 #include linux/pagemap.h
 #include linux/btrfs.h
+#include crypto/hash.h
 #include extent_io.h
 #include extent_map.h
 #include async-thread.h
@@ -101,6 +102,9 @@ struct btrfs_ordered_sum;
 /* for storing items that use the BTRFS_UUID_KEY* types */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -523,6 +527,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56  (1ULL  7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL  8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL  9)
+#define BTRFS_FEATURE_INCOMPAT_DEDUP   (1ULL  10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP  0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET  0ULL
@@ -540,6 +545,7 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
+BTRFS_FEATURE_INCOMPAT_DEDUP | \
 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
@@ -915,6 +921,51 @@ struct btrfs_csum_item {
u8 csum;
 } __attribute__ ((__packed__));
 
+/* dedup */
+enum btrfs_dedup_type {
+   BTRFS_DEDUP_SHA256 = 0,
+   BTRFS_DEDUP_LAST = 1,
+};
+
+static int btrfs_dedup_lens[] = { 4, 0 };
+static int btrfs_dedup_sizes[] = { 32, 0 };/* 256bit, 32bytes */
+
+struct btrfs_dedup_item {
+   /* disk length of dedup range */
+   __le64 len;
+
+   u8 type;
+   u8 compression;
+   u8 encryption;
+
+   /* spare for later use */
+   __le16 other_encoding;
+
+   /* hash/fingerprints go here */
+} __attribute__ ((__packed__));
+
+struct btrfs_dedup_hash {
+   u64 bytenr;
+   u64 num_bytes;
+
+   /* hash algorithm */
+   int type;
+
+   int compression;
+
+   /* last field is a variable length array of dedup hash */
+   u64 hash[];
+};
+
+static inline int btrfs_dedup_hash_size(int type)
+{
+   WARN_ON((btrfs_dedup_lens[type] * sizeof(u64)) !=
+btrfs_dedup_sizes[type]);
+
+   return sizeof(struct btrfs_dedup_hash) + btrfs_dedup_sizes[type];
+}
+
+
 struct btrfs_dev_stats_item {
/*
 * grow this item struct at the end for future enhancements and keep
@@ -1320,6 +1371,7 @@ struct btrfs_fs_info {
struct btrfs_root *dev_root;
struct btrfs_root *fs_root;
struct btrfs_root *csum_root;
+   struct btrfs_root *dedup_root;
struct btrfs_root *quota_root;
struct btrfs_root *uuid_root;
 
@@ -1680,6 +1732,14 @@ struct btrfs_fs_info {
 
struct semaphore uuid_tree_rescan_sem;
unsigned int update_uuid_tree_gen:1;
+
+   /* reference to deduplication algorithm driver via cryptoapi */
+   struct crypto_shash *dedup_driver;
+
+   /* dedup blocksize */
+   u64 dedup_bs;
+
+   int dedup_type;
 };
 
 struct btrfs_subvolume_writers {
@@ -2013,6 +2073,8 @@ struct btrfs_ioctl_defrag_range_args {
  */
 #define BTRFS_STRING_ITEM_KEY  253
 
+#define BTRFS_DEDUP_ITEM_KEY   254
+
 /*
  * Flags for mount options.
  *
@@ -3047,6 +3109,14 @@ static inline u32 btrfs_file_extent_inline_len(struct 
extent_buffer *eb,
 }
 
 
+/* btrfs_dedup_item */
+BTRFS_SETGET_FUNCS(dedup_len, struct btrfs_dedup_item, len, 64);
+BTRFS_SETGET_FUNCS(dedup_compression, struct btrfs_dedup_item, compression, 8);
+BTRFS_SETGET_FUNCS(dedup_encryption, struct btrfs_dedup_item, encryption, 8);
+BTRFS_SETGET_FUNCS(dedup_other_encoding, struct btrfs_dedup_item,
+  other_encoding, 16);
+BTRFS_SETGET_FUNCS(dedup_type, struct btrfs_dedup_item, type, 8);
+
 /* btrfs_dev_stats_item */
 static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb,
struct btrfs_dev_stats_item *ptr,
@@ -3521,6 +3591,8 @@ static inline int btrfs_need_cleaner_sleep(struct 
btrfs_root *root)
 
 static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 {
+   if (fs_info-dedup_driver)
+   crypto_free_shash(fs_info-dedup_driver);
kfree(fs_info-balance_ctl);
kfree(fs_info-delayed_root);
kfree(fs_info-extent_root);
@@ -3687,6 +3759,7 @@ 

[PATCH v10 11/16] Btrfs: fix a crash of dedup ref

2014-04-09 Thread Liu Bo
The dedup reference is a special kind of delayed refs, and the delayed refs
are batched to be processed later.

If we find a matched dedup extent, then we queue an ADD delayed ref on it within
endio work, but there is already a DROP delayed ref queued,

   t1 t2  t3
-writepage commit transaction
  -run_delalloc_dedup
 find_dedup
--
   process_delayed refs
(it deletes the dedup 
extent)
 add ordered extent|
 submit pages  |
  finish ordered io|
insert file extents|
queue delayed refs |
queue dedup ref|
 process delayed refs 
continues
 (insert a ref on an extent
  deleted by the above)

This senario ends up with a crash because we're going to insert a ref on
a deleted extent.

To avoid the race, we need to check if there is a ADD delayed ref on deleting
the extent and protect this job with lock.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h   |  3 ++-
 fs/btrfs/extent-tree.c | 35 +++
 fs/btrfs/file-item.c   | 36 +++-
 fs/btrfs/inode.c   | 10 ++
 4 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index feebfab..2e8c443 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3764,7 +3764,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
 struct list_head *list, int search_commit);
 
 int noinline_for_stack
-btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash 
*hash);
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash,
+   struct inode *inode, u64 file_pos);
 int noinline_for_stack
 btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 191f0a7..a8da7aa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5937,9 +5937,23 @@ again:
goto again;
}
} else {
-   if (!dedup_hash  is_data 
-   root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
-   dedup_hash = extent_data_ref_offset(root, path, iref);
+   if (is_data  root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+   if (!dedup_hash)
+   dedup_hash = extent_data_ref_offset(root,
+   path, iref);
+
+   ret = btrfs_free_dedup_extent(trans, root,
+ dedup_hash, bytenr);
+   if (ret) {
+   if (ret == -EAGAIN)
+   ret = 0;
+   else
+   btrfs_abort_transaction(trans,
+   extent_root,
+   ret);
+   goto out;
+   }
+   }
 
if (found_extent) {
BUG_ON(is_data  refs_to_drop !=
@@ -5964,21 +5978,10 @@ again:
if (is_data) {
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
-   btrfs_abort_transaction(trans, extent_root, 
ret);
+   btrfs_abort_transaction(trans,
+   extent_root, ret);
goto out;
}
-
-   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
-   ret = btrfs_free_dedup_extent(trans, root,
- dedup_hash,
- bytenr);
-   if (ret) {
-   btrfs_abort_transaction(trans,
-   extent_root,
-   ret);
-   goto out;
- 

[PATCH v10 12/16] Btrfs: fix deadlock of dedup work

2014-04-09 Thread Liu Bo
Checking for dedup references needs to allocate memory so it cannot
be run within spin_lock, otherwise it will end up with heavy deadlock.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a8da7aa..4c1c342 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5720,7 +5720,6 @@ again:
dedup_hash = 0;
 
path-reada = 1;
-   path-leave_spinning = 1;
 
is_data = owner_objectid = BTRFS_FIRST_FREE_OBJECTID;
BUG_ON(!is_data  refs_to_drop != 1);
@@ -5774,7 +5773,6 @@ again:
goto out;
}
btrfs_release_path(path);
-   path-leave_spinning = 1;
 
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -5942,6 +5940,7 @@ again:
dedup_hash = extent_data_ref_offset(root,
path, iref);
 
+   WARN_ON_ONCE(path-leave_spinning);
ret = btrfs_free_dedup_extent(trans, root,
  dedup_hash, bytenr);
if (ret) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 03/16] Btrfs: introduce dedup tree operations

2014-04-09 Thread Liu Bo
The operations consist of finding matched items, adding new items and
removing items.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ctree.h |   9 +++
 fs/btrfs/file-item.c | 210 +++
 2 files changed, 219 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index da4320d..ca1b516 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3760,6 +3760,15 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 struct list_head *list, int search_commit);
 
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash 
*hash);
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash);
+int noinline_for_stack
+btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root, u64 hash, u64 bytenr);
 /* inode.c */
 struct btrfs_delalloc_work {
struct inode *inode;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 127555b..6437ebe 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -885,3 +885,213 @@ out:
 fail_unlock:
goto out;
 }
+
+/* 1 means we find one, 0 means we dont. */
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash)
+{
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct extent_buffer *leaf;
+   struct btrfs_root *dedup_root;
+   struct btrfs_dedup_item *item;
+   u64 hash_value;
+   u64 length;
+   u64 dedup_size;
+   int compression;
+   int found = 0;
+   int index;
+   int ret;
+
+   if (!hash) {
+   WARN_ON(1);
+   return 0;
+   }
+   if (!root-fs_info-dedup_root) {
+   WARN(1, KERN_INFO dedup not enabled\n);
+   return 0;
+   }
+   dedup_root = root-fs_info-dedup_root;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return 0;
+
+   /*
+* For SHA256 dedup algorithm, we store the last 64bit as the
+* key.objectid, and the rest in the tree item.
+*/
+   index = btrfs_dedup_lens[hash-type] - 1;
+   dedup_size = btrfs_dedup_sizes[hash-type] - sizeof(u64);
+
+   hash_value = hash-hash[index];
+
+   key.objectid = hash_value;
+   key.offset = (u64)-1;
+   btrfs_set_key_type(key, BTRFS_DEDUP_ITEM_KEY);
+
+   ret = btrfs_search_slot(NULL, dedup_root, key, path, 0, 0);
+   if (ret  0)
+   goto out;
+   if (ret == 0) {
+   WARN_ON(1);
+   goto out;
+   }
+
+prev_slot:
+   /* this will do match checks. */
+   ret = btrfs_previous_item(dedup_root, path, hash_value,
+ BTRFS_DEDUP_ITEM_KEY);
+   if (ret)
+   goto out;
+
+   leaf = path-nodes[0];
+   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
+   if (key.objectid != hash_value)
+   goto out;
+
+   item = btrfs_item_ptr(leaf, path-slots[0], struct btrfs_dedup_item);
+   /* disk length of dedup range */
+   length = btrfs_dedup_len(leaf, item);
+
+   compression = btrfs_dedup_compression(leaf, item);
+   if (compression  BTRFS_COMPRESS_TYPES) {
+   WARN_ON(1);
+   goto out;
+   }
+
+   if (btrfs_dedup_type(leaf, item) != hash-type)
+   goto prev_slot;
+
+   if (memcmp_extent_buffer(leaf, hash-hash, (unsigned long)(item + 1),
+dedup_size)) {
+   pr_info(goto prev\n);
+   goto prev_slot;
+   }
+
+   hash-bytenr = key.offset;
+   hash-num_bytes = length;
+   hash-compression = compression;
+   found = 1;
+out:
+   btrfs_free_path(path);
+   return found;
+}
+
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash)
+{
+   struct btrfs_key key;
+   struct btrfs_path *path;
+   struct extent_buffer *leaf;
+   struct btrfs_root *dedup_root;
+   struct btrfs_dedup_item *dedup_item;
+   u64 ins_size;
+   u64 dedup_size;
+   int index;
+   int ret;
+
+   if (!hash) {
+   WARN_ON(1);
+   return 0;
+   }
+
+   WARN_ON(hash-num_bytes  root-fs_info-dedup_bs);
+
+   if (!root-fs_info-dedup_root) {
+   WARN(1, KERN_INFO dedup not enabled\n);
+   return 0;
+   }
+   dedup_root = root-fs_info-dedup_root;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   /*
+* For SHA256 dedup algorithm, we store the last 64bit 

[PATCH v10 16/16] Btrfs: fix dedup enospc problem

2014-04-09 Thread Liu Bo
In the case of dedupe, btrfs will produce large number of delayed refs, and
processing them can very likely eat all of the space reserved in
global_block_rsv, and we'll end up with transaction abortion due to ENOSPC.

I tried several different ways to reserve more space for global_block_rsv to
hope it's enough for flushing delayed refs, but I failed and code could
become very messy.

I found that with high delayed refs pressure, the throttle work in the
end_transaction had little use since it didn't block new delayed refs'
insertion, so I put throttle stuff into the very start stage,
i.e. start_transaction.

We take the worst case into account in the throttle code, that is,
every delayed_refs would update btree, so when we reach the limit that
it may use up all the reserved space of global_block_rsv, we kick
transaction_kthread to commit transaction to process these delayed refs,
refresh global_block_rsv's space, and get pinned space back as well.
That way we get rid of annoy ENOSPC problem.

However, this leads to a new problem that it cannot use along with option
flushoncommit, otherwise it can cause ABBA deadlock between
commit_transaction between ordered extents flush.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c  | 50 ++---
 fs/btrfs/ordered-data.c |  6 ++
 fs/btrfs/transaction.c  | 41 
 fs/btrfs/transaction.h  |  1 +
 4 files changed, 87 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6f8b012..ec6f42d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2695,24 +2695,52 @@ static inline u64 heads_to_leaves(struct btrfs_root 
*root, u64 heads)
 int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
   struct btrfs_root *root)
 {
+   struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_block_rsv *global_rsv;
-   u64 num_heads = trans-transaction-delayed_refs.num_heads_ready;
+   u64 num_heads;
+   u64 num_entries;
u64 num_bytes;
int ret = 0;
 
-   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
-   num_heads = heads_to_leaves(root, num_heads);
-   if (num_heads  1)
-   num_bytes += (num_heads - 1) * root-leafsize;
-   num_bytes = 1;
global_rsv = root-fs_info-global_block_rsv;
 
-   /*
-* If we can't allocate any more chunks lets make sure we have _lots_ of
-* wiggle room since running delayed refs can create more delayed refs.
-*/
-   if (global_rsv-space_info-full)
+   if (trans) {
+   num_heads = trans-transaction-delayed_refs.num_heads_ready;
+   num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+   num_heads = heads_to_leaves(root, num_heads);
+   if (num_heads  1)
+   num_bytes += (num_heads - 1) * root-leafsize;
num_bytes = 1;
+   /*
+* If we can't allocate any more chunks lets make sure we have
+* _lots_ of wiggle room since running delayed refs can create
+* more delayed refs.
+*/
+   if (global_rsv-space_info-full)
+   num_bytes = 1;
+   } else {
+   if (root-fs_info-dedup_bs == 0)
+   return 0;
+
+   /* dedup enabled */
+   spin_lock(root-fs_info-trans_lock);
+   if (!root-fs_info-running_transaction) {
+   spin_unlock(root-fs_info-trans_lock);
+   return 0;
+   }
+
+   delayed_refs =
+root-fs_info-running_transaction-delayed_refs;
+
+   num_entries = atomic_read(delayed_refs-num_entries);
+   num_heads = delayed_refs-num_heads;
+
+   spin_unlock(root-fs_info-trans_lock);
+
+   /* The worst case */
+   num_bytes = (num_entries - num_heads) *
+   btrfs_calc_trans_metadata_size(root, 1);
+   }
 
spin_lock(global_rsv-lock);
if (global_rsv-reserved = num_bytes)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c520e13..72c0caa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -747,6 +747,12 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle 
*trans,
  cur_trans-ordered_operations);
spin_unlock(root-fs_info-ordered_root_lock);
 
+   if (cur_trans-blocked) {
+   cur_trans-blocked = 0;
+   if (waitqueue_active(cur_trans-commit_wait))
+   wake_up(cur_trans-commit_wait);
+   }
+
work = btrfs_alloc_delalloc_work(inode, wait, 1);
if (!work) {

[PATCH v10 14/16] Btrfs: fix wrong pinned bytes in __btrfs_free_extent

2014-04-09 Thread Liu Bo
With the special dedup reference, in the case of (refs == 1) in 
__btrfs_free_extent,
we'll actually free the extent, so pinned_bytes of it should not be added to 
that
global counter.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1cb3ec5..b8fee86 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5915,9 +5915,6 @@ again:
goto out;
}
}
-   add_pinned_bytes(root-fs_info, -num_bytes, owner_objectid,
-root_objectid);
-
/*
 * special case for dedup
 *
@@ -5934,6 +5931,9 @@ again:
refs_to_drop = 1;
 
goto again;
+   } else {
+   add_pinned_bytes(root-fs_info, -num_bytes,
+owner_objectid, root_objectid);
}
} else {
if (is_data  root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5] Btrfs-progs: add dedup subcommand

2014-04-09 Thread Liu Bo
This adds deduplication subcommands, 'btrfs dedup command path',
including enable/disable/on/off.

- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.

- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.

- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes.  Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.

- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.

-
Usage:
Step 1: btrfs dedup enable /btrfs
Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs
Step 3: now we have dedup, run your test.
Step 4: btrfs dedup off /btrfs
Step 5: btrfs dedup disable /btrfs
-

Signed-off-by: Liu Bo bo.li@oracle.com
---
v5: rebase onto the latest btrfs-progs v3.14.
v4: rebase and reserve spare space in btrfs_ioctl_dedup_args struct. 
v3: add commands 'btrfs dedup on/off'
v2: add manpage

 Makefile   |   2 +-
 btrfs.c|   1 +
 cmds-dedup.c   | 178 +
 commands.h |   2 +
 ctree.h|   2 +
 ioctl.h|  12 
 man/btrfs.8.in |  31 +-
 7 files changed, 224 insertions(+), 4 deletions(-)
 create mode 100644 cmds-dedup.c

diff --git a/Makefile b/Makefile
index da05197..369df6c 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
-  cmds-property.o
+  cmds-property.o cmds-dedup.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
   uuid-tree.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index 98ff6f5..16458ef 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -256,6 +256,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ quota, cmd_quota, NULL, quota_cmd_group, 0 },
{ qgroup, cmd_qgroup, NULL, qgroup_cmd_group, 0 },
{ replace, cmd_replace, NULL, replace_cmd_group, 0 },
+   { dedup, cmd_dedup, NULL, dedup_cmd_group, 0 },
{ help, cmd_help, cmd_help_usage, NULL, 0 },
{ version, cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedup.c b/cmds-dedup.c
new file mode 100644
index 000..b959349
--- /dev/null
+++ b/cmds-dedup.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2013 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include sys/ioctl.h
+#include unistd.h
+#include getopt.h
+
+#include ctree.h
+#include ioctl.h
+
+#include commands.h
+#include utils.h
+
+static const char * const dedup_cmd_group_usage[] = {
+   btrfs dedup command [options] path,
+   NULL
+};
+
+int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args)
+{
+   int ret = 0;
+   int fd;
+   int e;
+   DIR *dirstream = NULL;
+
+   fd = open_file_or_dir(path, dirstream);
+   if (fd  0) {
+   fprintf(stderr, ERROR: can't access '%s'\n, path);
+   return -EACCES;
+   }
+
+   ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args);
+   e = errno;
+   close_file_or_dir(fd, dirstream);
+   if (ret  0) {
+   fprintf(stderr, ERROR: dedup command failed: %s\n,
+   strerror(e));
+   if (args-cmd == BTRFS_DEDUP_CTL_DISABLE ||
+   args-cmd == BTRFS_DEDUP_CTL_SET_BS)
+   fprintf(stderr, please refer to 'dmesg | tail' for 
more info\n);
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static const char * const cmd_dedup_enable_usage[] = {
+   btrfs dedup enable path,
+   Enable data 

[PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case

2014-04-09 Thread Liu Bo
While removing a file with dedup extents, we could have a great number of
delayed refs pending to process, and these refs refer to droping
a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type.

But in order to prevent an extent's ref count from going down to zero when
there still are pending delayed refs, we first select those adding a ref
ones, which is of BTRFS_ADD_DELAYED_REF type.

So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type,
but we have to walk all the refs issued to the extent to find any
BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start
over again to find BTRFS_DROP_DELAYED_REF.

This is really unnecessary, we can improve this by tracking how many
BTRFS_ADD_DELAYED_REF refs we have and search by the right type.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/delayed-ref.c |  8 
 fs/btrfs/delayed-ref.h |  3 +++
 fs/btrfs/extent-tree.c | 16 +---
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3ab37b6..6435d78 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -538,6 +538,9 @@ update_existing_head_ref(struct btrfs_delayed_ref_node 
*existing,
 * currently, for refs we just added we know we're a-ok.
 */
existing-ref_mod += update-ref_mod;
+   WARN_ON(update-ref_mod  1);
+   if (update-ref_mod == 1)
+   existing_ref-add_cnt++;
spin_unlock(existing_ref-lock);
 }
 
@@ -601,6 +604,11 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
head_ref-is_data = is_data;
head_ref-ref_root = RB_ROOT;
head_ref-processing = 0;
+   /* track added ref, more comments in select_delayed_ref() */
+   if (count_mod == 1)
+   head_ref-add_cnt = 1;
+   else
+   head_ref-add_cnt = 0;
 
spin_lock_init(head_ref-lock);
mutex_init(head_ref-mutex);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 4ba9b93..905f991 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -87,6 +87,9 @@ struct btrfs_delayed_ref_head {
struct rb_node href_node;
 
struct btrfs_delayed_extent_op *extent_op;
+
+   int add_cnt;
+
/*
 * when a new extent is allocated, it is just reserved in memory
 * The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 088846c..191f0a7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2347,7 +2347,11 @@ static noinline struct btrfs_delayed_ref_node *
 select_delayed_ref(struct btrfs_delayed_ref_head *head)
 {
struct rb_node *node;
-   struct btrfs_delayed_ref_node *ref, *last = NULL;;
+   struct btrfs_delayed_ref_node *ref, *last = NULL;
+   int action = BTRFS_ADD_DELAYED_REF;
+
+   if (head-add_cnt == 0)
+   action = BTRFS_DROP_DELAYED_REF;
 
/*
 * select delayed ref of type BTRFS_ADD_DELAYED_REF first.
@@ -2358,10 +2362,13 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
while (node) {
ref = rb_entry(node, struct btrfs_delayed_ref_node,
rb_node);
-   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   if (ref-action == action) {
+   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   head-add_cnt--;
return ref;
-   else if (last == NULL)
+   } else if (last == NULL) {
last = ref;
+   }
node = rb_next(node);
}
return last;
@@ -2435,6 +2442,9 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 
if (ref  ref-seq 
btrfs_check_delayed_seq(fs_info, delayed_refs, ref-seq)) {
+   if (ref-action == BTRFS_ADD_DELAYED_REF)
+   locked_ref-add_cnt++;
+
spin_unlock(locked_ref-lock);
btrfs_delayed_ref_unlock(locked_ref);
spin_lock(delayed_refs-lock);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 06/16] Btrfs: online(inband) data dedup

2014-04-09 Thread Liu Bo
The main part of data dedup.

This introduces a FORMAT CHANGE.

Btrfs provides online(inband/synchronous) and block-level dedup.

It maps naturally to btrfs's block back-reference, which enables us
to store multiple copies of data as single copy with references
on that copy.

The workflow is
(1) write some data,
(2) get the hash of these data based on btrfs's dedup blocksize.
(3) find matched extents by hash and decide whether to mark it
as a duplicate one or not.  If no, write the data onto disk,
otherwise, add a reference to the matched extent.

Btrfs's built-in dedup supports normal writes and compressed writes.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 150 ++--
 fs/btrfs/extent_io.c   |   8 +-
 fs/btrfs/extent_io.h   |  11 +
 fs/btrfs/inode.c   | 640 +++--
 4 files changed, 712 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 06124c1..088846c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1123,8 +1123,16 @@ static noinline int lookup_extent_data_ref(struct 
btrfs_trans_handle *trans,
key.offset = parent;
} else {
key.type = BTRFS_EXTENT_DATA_REF_KEY;
-   key.offset = hash_extent_data_ref(root_objectid,
- owner, offset);
+
+   /*
+* we've not got the right offset and owner, so search by -1
+* here.
+*/
+   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
+   key.offset = (u64)-1;
+   else
+   key.offset = hash_extent_data_ref(root_objectid,
+ owner, offset);
}
 again:
recow = 0;
@@ -1151,6 +1159,10 @@ again:
goto fail;
}
 
+   if (ret  0  root_objectid == BTRFS_DEDUP_TREE_OBJECTID 
+   path-slots[0]  0)
+   path-slots[0]--;
+
leaf = path-nodes[0];
nritems = btrfs_header_nritems(leaf);
while (1) {
@@ -1174,14 +1186,22 @@ again:
ref = btrfs_item_ptr(leaf, path-slots[0],
 struct btrfs_extent_data_ref);
 
-   if (match_extent_data_ref(leaf, ref, root_objectid,
- owner, offset)) {
-   if (recow) {
-   btrfs_release_path(path);
-   goto again;
+   if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+   if (btrfs_extent_data_ref_root(leaf, ref) ==
+   root_objectid) {
+   err = 0;
+   break;
+   }
+   } else {
+   if (match_extent_data_ref(leaf, ref, root_objectid,
+ owner, offset)) {
+   if (recow) {
+   btrfs_release_path(path);
+   goto again;
+   }
+   err = 0;
+   break;
}
-   err = 0;
-   break;
}
path-slots[0]++;
}
@@ -1325,6 +1345,32 @@ static noinline int remove_extent_data_ref(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+static noinline u64 extent_data_ref_offset(struct btrfs_root *root,
+ struct btrfs_path *path,
+ struct btrfs_extent_inline_ref *iref)
+{
+   struct btrfs_key key;
+   struct extent_buffer *leaf;
+   struct btrfs_extent_data_ref *ref1;
+   u64 offset = 0;
+
+   leaf = path-nodes[0];
+   btrfs_item_key_to_cpu(leaf, key, path-slots[0]);
+   if (iref) {
+   WARN_ON(btrfs_extent_inline_ref_type(leaf, iref) !=
+   BTRFS_EXTENT_DATA_REF_KEY);
+   ref1 = (struct btrfs_extent_data_ref *)(iref-offset);
+   offset = btrfs_extent_data_ref_offset(leaf, ref1);
+   } else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
+   ref1 = btrfs_item_ptr(leaf, path-slots[0],
+ struct btrfs_extent_data_ref);
+   offset = btrfs_extent_data_ref_offset(leaf, ref1);
+   } else {
+   WARN_ON(1);
+   }
+   return offset;
+}
+
 static noinline u32 extent_data_ref_count(struct btrfs_root *root,
  struct btrfs_path *path,
  struct btrfs_extent_inline_ref *iref)
@@ -1591,7 +1637,8 @@ again:
err = -ENOENT;
while (1) {
if (ptr = end) {
-   

[PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent

2014-04-09 Thread Liu Bo
We need to reset @refs_to_drop to 1 when we're going to delete the last
special dedup reference, otherwise we can trigger (@refs  @refs_to_drop)
and end up with transaction abortion.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c1c342..1cb3ec5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5931,6 +5931,7 @@ again:
btrfs_release_path(path);
root_objectid = BTRFS_DEDUP_TREE_OBJECTID;
parent = 0;
+   refs_to_drop = 1;
 
goto again;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-09 Thread Liu Bo
Hello,

This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to Content based storage in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is 
  uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
  transaction abortion and crash.

* btrfs-progs patch(PATCH 17) offers all details about how to control the
  dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable  on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options flushoncommit and enabling dedupe feature will end up
  with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v10:
- fix a typo in the subject line.
- update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
  'Inappropriate ioctl for device'.

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
  Btrfs: disable qgroups accounting when quota_enable is 0
  Btrfs: introduce dedup tree and relatives
  Btrfs: introduce dedup tree operations
  Btrfs: introduce dedup state
  Btrfs: make ordered extent aware of dedup
  Btrfs: online(inband) data dedup
  Btrfs: skip dedup reference during backref walking
  Btrfs: don't return space for dedup extent
  Btrfs: add ioctl of dedup control
  Btrfs: improve the delayed refs process in rm case
  Btrfs: fix a crash of dedup ref
  Btrfs: fix deadlock of dedup work
  Btrfs: fix transactin abortion in __btrfs_free_extent
  Btrfs: fix wrong pinned bytes in __btrfs_free_extent
  Btrfs: use total_bytes instead of bytes_used for global_rsv
  Btrfs: fix dedup enospc problem

 fs/btrfs/backref.c   |   9 +
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |  86 ++
 fs/btrfs/delayed-ref.c   |  26 +-
 fs/btrfs/delayed-ref.h   |   3 +
 fs/btrfs/disk-io.c   |  37 +++
 fs/btrfs/extent-tree.c   | 235 +---
 fs/btrfs/extent_io.c |  22 +-
 fs/btrfs/extent_io.h |  16 ++
 fs/btrfs/file-item.c | 244 +
 fs/btrfs/inode.c | 635 ++-
 fs/btrfs/ioctl.c | 167 
 fs/btrfs/ordered-data.c  |  44 ++-
 fs/btrfs/ordered-data.h  |  13 +-
 fs/btrfs/qgroup.c|   3 +
 fs/btrfs/relocation.c|   3 +
 fs/btrfs/transaction.c   |  41 +++
 fs/btrfs/transaction.h   |   1 +
 include/trace/events/btrfs.h |   3 +-
 include/uapi/linux/btrfs.h   |  12 +
 20 files changed, 1471 insertions(+), 131 deletions(-)

-- 
1.8.2.1
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 07/16] Btrfs: skip dedup reference during backref walking

2014-04-09 Thread Liu Bo
The dedup ref is quite a special one, it is just used to store the hash value
of the extent and cannot be used to find data, so we skip it during backref
walking.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/backref.c| 9 +
 fs/btrfs/relocation.c | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index aad7201..5e57949 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -623,6 +623,9 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head 
*head, u64 seq,
key.objectid = ref-objectid;
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = ref-offset;
+   if (ref-root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, ref-root, key, 0, 0,
   node-bytenr,
   node-ref_mod * sgn, GFP_ATOMIC);
@@ -743,6 +746,9 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+   if (root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, root, key, 0, 0,
   bytenr, count, GFP_NOFS);
break;
@@ -826,6 +832,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+   if (root == BTRFS_DEDUP_TREE_OBJECTID)
+   break;
+
ret = __add_prelim_ref(prefs, root, key, 0, 0,
   bytenr, count, GFP_NOFS);
break;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index def428a..8431294 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3508,6 +3508,9 @@ static int find_data_references(struct reloc_control *rc,
ref_offset = btrfs_extent_data_ref_offset(leaf, ref);
ref_count = btrfs_extent_data_ref_count(leaf, ref);
 
+   if (ref_root == BTRFS_DEDUP_TREE_OBJECTID)
+   return 0;
+
/*
 * This is an extent belonging to the free space cache, lets just delete
 * it and redo the search.
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 08/16] Btrfs: don't return space for dedup extent

2014-04-09 Thread Liu Bo
If the ordered extent had an IOERR or something else went wrong we need to
return the space for this ordered extent back to the allocator, but if the
extent is marked as a dedup one, we don't free the space because we just
use the existing space instead of allocating new space.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e031bf..0c1a43e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3213,6 +3213,7 @@ out:
 * truncated case if we didn't write out the extent at all.
 */
if ((ret || !logical_len) 
+   !ordered_extent-dedup 
!test_bit(BTRFS_ORDERED_NOCOW, ordered_extent-flags) 
!test_bit(BTRFS_ORDERED_PREALLOC, ordered_extent-flags))
btrfs_free_reserved_extent(root, ordered_extent-start,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 04/16] Btrfs: introduce dedup state

2014-04-09 Thread Liu Bo
This introduces dedup state and relative operations to mark and unmark
the dedup data range, it'll be used in later patches.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent_io.c | 14 ++
 fs/btrfs/extent_io.h |  5 +
 2 files changed, 19 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ae69a00..d51487b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1296,6 +1296,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, 
u64 start, u64 end,
cached_state, mask);
 }
 
+int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+struct extent_state **cached_state, gfp_t mask)
+{
+   return set_extent_bit(tree, start, end, EXTENT_DEDUP, 0,
+ cached_state, mask);
+}
+
+int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached_state, gfp_t mask)
+{
+   return clear_extent_bit(tree, start, end, EXTENT_DEDUP, 0, 0,
+   cached_state, mask);
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 58b27e5..897110d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -20,6 +20,7 @@
 #define EXTENT_NEED_WAIT (1  13)
 #define EXTENT_DAMAGED (1  14)
 #define EXTENT_NORESERVE (1  15)
+#define EXTENT_DEDUP (1  16)
 #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
@@ -226,6 +227,10 @@ int set_extent_uptodate(struct extent_io_tree *tree, u64 
start, u64 end,
struct extent_state **cached_state, gfp_t mask);
 int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
  struct extent_state **cached_state, gfp_t mask);
+int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+struct extent_state **cached_state, gfp_t mask);
+int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+  struct extent_state **cached_state, gfp_t mask);
 int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
   gfp_t mask);
 int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: correct prompt of minimal num of devs for raid56

2014-04-09 Thread Gui Hecheng
For btrfs,
Raid5 can't go below 2 devs, not 3;
Raid6 can't go below 3 devs, not 4.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 ioctl.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ioctl.h b/ioctl.h
index c3ee270..6742ba6 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -489,9 +489,9 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
case BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET:
return unable to go below four devices on raid10;
case BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET:
-   return unable to go below three devices on raid5;
+   return unable to go below two devices on raid5;
case BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET:
-   return unable to go below four devices on raid6;
+   return unable to go below three devices on raid6;
case BTRFS_ERROR_DEV_TGT_REPLACE:
return unable to remove the dev_replace target dev;
case BTRFS_ERROR_DEV_MISSING_NOT_FOUND:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html