Re: [PATCH] btrfs: fix check_shared for fiemap ioctl

2016-05-23 Thread luke

Does anyone have interest in this patch?


在 2016年05月16日 11:23, Lu Fengqi 写道:

Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least  n^3, so if a extent
has too many references,  even causes soft hang up.

First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the  on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.

Reported-by: Tsutomu Itoh 
Signed-off-by: Lu Fengqi 
---
  fs/btrfs/backref.c   | 348 +--
  fs/btrfs/extent_io.c |  18 ++-
  2 files changed, 356 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 80e8472..1118c76 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -17,6 +17,7 @@
   */
  
  #include 

+#include 
  #include "ctree.h"
  #include "disk-io.h"
  #include "backref.h"
@@ -34,6 +35,249 @@ struct extent_inode_elem {
struct extent_inode_elem *next;
  };
  
+/*

+ * ref_root is used as the root of the ref tree that hold a collection
+ * of unique references.
+ */
+struct ref_root {
+   /*
+* the unique_refs represents the number of ref_nodes with a positive
+* count stored in the tree. Even if a ref_node(the count is greater
+* than one) is added, the unique_refs will only increase one.
+*/
+   unsigned int unique_refs;
+
+   struct rb_root rb_root;
+};
+
+/* ref_node is used to store a unique reference to the ref tree. */
+struct ref_node {
+   /* for NORMAL_REF, otherwise all these fields should be set to 0 */
+   u64 root_id;
+   u64 object_id;
+   u64 offset;
+
+   /* for SHARED_REF, otherwise parent field should be set to 0 */
+   u64 parent;
+
+   /* ref to the ref_mod of btrfs_delayed_ref_node(delayed-ref.h) */
+   int ref_mod;
+
+   struct rb_node rb_node;
+};
+
+/* dynamically allocate and initialize a ref_root */
+static struct ref_root *ref_root_alloc(gfp_t gfp_mask)
+{
+   struct ref_root *ref_tree;
+
+   ref_tree = kmalloc(sizeof(*ref_tree), gfp_mask);
+   if (!ref_tree)
+   return NULL;
+
+   ref_tree->rb_root = RB_ROOT;
+   ref_tree->unique_refs = 0;
+
+   return ref_tree;
+}
+
+/* free all node in the ref tree, and reinit ref_root */
+static void ref_root_fini(struct ref_root *ref_tree)
+{
+   struct ref_node *node;
+   struct rb_node *next;
+
+   while ((next = rb_first(_tree->rb_root)) != NULL) {
+   node = rb_entry(next, struct ref_node, rb_node);
+   rb_erase(next, _tree->rb_root);
+   kfree(node);
+   }
+
+   ref_tree->rb_root = RB_ROOT;
+   ref_tree->unique_refs = 0;
+}
+
+/* free dynamically allocated ref_root */
+static void ref_root_free(struct ref_root *ref_tree)
+{
+   if (!ref_tree)
+   return;
+
+   ref_root_fini(ref_tree);
+   kfree(ref_tree);
+}
+
+/*
+ * search ref_node with (root_id, object_id, offset, parent) in the tree
+ *
+ * if found, the pointer of the ref_node will be returned;
+ * if not found, NULL will be returned and pos will point to the rb_node for
+ * insert, pos_parent will point to pos'parent for insert;
+*/
+static struct ref_node *__ref_tree_search(struct ref_root *ref_tree,
+ struct rb_node ***pos,
+ struct rb_node **pos_parent,
+ u64 root_id, u64 object_id,
+ u64 offset, u64 parent)
+{
+   struct ref_node *cur = NULL;
+
+   *pos = _tree->rb_root.rb_node;
+
+   while (**pos) {
+   *pos_parent = **pos;
+   cur = rb_entry(*pos_parent, struct ref_node, rb_node);
+
+   if (cur->root_id < root_id) {
+   *pos = &(**pos)->rb_right;
+   continue;
+   } else if (cur->root_id > root_id) {
+   *pos = &(**pos)->rb_left;
+   continue;
+   }
+
+   if (cur->object_id < object_id) {
+   *pos = &(**pos)->rb_right;
+   continue;
+   } else if (cur->object_id > object_id) {
+   *pos = &(**pos)->rb_left;
+   continue;
+   }
+
+   if (cur->offset < offset) {
+   *pos = &(**pos)->rb_right;
+   continue;
+   } else if 

Re: [RFC PATCH v2.1 16/16] btrfs-progs: fsck: Introduce low memory mode

2016-05-23 Thread Qu Wenruo



David Sterba wrote on 2016/05/23 13:08 +0200:

On Fri, May 20, 2016 at 10:33:55AM +0800, Qu Wenruo wrote:

We'll enrich the test cases for current low memory mode.


I started something to add optional default options for a few basic
commands (mkfs, fsck, convert) to extend the coverage. I'm not finished,
the idea is to call the commands via some wrapper that will grab the
defaults from a file or from environment.



Thank you a lot.

But that's still not enough for low memory fsck yet.

Even we can add --low-memory option for btrfsck to run on that images, 
we still have the following problems:


1) Lack of support for repair
   Repair support for low memory is quite tricky, as we need to do a
   lot of record work other than just calling btrfs_previous/next_item()

   This won't be implemented in a short time. And this will make almost
   all repair function test fails for low memory backend.

2) btrfs-image bug causing missing chunk stripe
   We're actively working on this before low memory mode for fs tree
   check.

   In fact the problem is already here for a long time, and another bug
   in btrfsck, which will ignore the error returned from dev_extent
   check, makes btrfsck can pass the fsck test images.

   Unfortunately (or fortunately?) low memory mode won't ignore such
   error and always report missing chunk for dev_extent.

   Unless we fix btrfs-image (only restore part is affected), low
   memory mode will always report error on btrfs-image restored image.

3) Extra images
   During the development of low memory mode, we found that current
   test images are all for some special fix case.

   No check on health images, not to mention test on all possible extent
   backrefs.

   We have build such images for internal low memory mode tests, and
   hopes to push it into current test.

   But since we don't have such check only test cases infrastructure
   and due to the bug of 2), we still needs some work for this.

So we still have to some work to do.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: utils: use better wrappered random generator

2016-05-23 Thread Qu Wenruo



David Sterba wrote on 2016/05/23 14:01 +0200:

The API does not seem right. It's fine to provide functions for full
int/u32/u64 ranges, but in the cases when we know the range from which
we request the random number, it has to be passed as parameter. Not
doing the % by hand.


This makes sense.
I'll add a new function to create random number for a given range.



+u32 rand_u32(void)
+{
+   struct timeval tv;
+   unsigned short rand_seed[3];


This could be made static (with thread local storage) so the state does
not get regenerated all the time. Possibly it could be initialize from
some true random source, not time or pid.


I also considered true random source like /dev/random, but since it's 
possible to wait for entropy pool, it would be quite slow and confusing 
for users.


So time with pid seems good enough.




+   long int ret;
+   int i;
+
+   gettimeofday(, 0);
+   rand_seed[0] = getpid() ^ (tv.tv_sec & 0x);
+   rand_seed[1] = getppid() ^ (tv.tv_usec & 0x);
+   rand_seed[2] = (tv.tv_sec ^ tv.tv_usec) >> 16;
+
+   /* Crank the random number generator a few times */
+   gettimeofday(, 0);
+   for (i = (tv.tv_sec ^ tv.tv_sec) ^ 0x1F; i > 0; i--)
+   nrand48(rand_seed);


This would be then unnecesssray, just draw the number from nrand.


Right, this part is just copied from libuuid, but in fact we don't 
really need to be that random.




About patch separation: please introduce the new api in one patch, use
in another (ie. drop srand and switch to it).


OK, I'll update it in next version.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10

2016-05-23 Thread Hugo Mills
On Mon, May 23, 2016 at 08:49:04PM +0200, Zoiled wrote:
[snip]
> For what it's worth...  I have a 8 disk (7x 300GB + 1x 500GB) data
> raid10, metadata raid1 setup and I get the following output of
> btrfs...
> 
> Label: 'xxyyzz'  uuid: 12345678-9abc-def1-2345-6789abcdef01
> Total devices 8 FS bytes used 1.05TiB
> devid1 size 268.05GiB used 265.94GiB path /dev/sda1
> devid2 size 279.40GiB used 277.22GiB path /dev/sdb
> devid3 size 279.40GiB used 277.32GiB path /dev/sdc
> devid4 size 279.40GiB used 278.73GiB path /dev/sdd
> devid5 size 279.40GiB used 277.72GiB path /dev/sde
> devid6 size 279.40GiB used 278.61GiB path /dev/sdf
> devid7 size 279.40GiB used 278.82GiB path /dev/sdg
> devid8 size 465.76GiB used 230.99GiB path /dev/sdh
> 
> # btrfs filesystem usage -T /
> Overall:
> Device size:   2.35TiB
> Device allocated:  2.11TiB
> Device unallocated:  244.83GiB
> Device missing:  0.00B
> Used:  2.11TiB
> Free (estimated):122.57GiB  (min: 122.57GiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
>  Data   Data  Metadata System
> Id Path  RAID1  RAID10RAID1RAID1 Unallocated
> -- - -- -  - ---
>  1 /dev/sda1  - 132.97GiB- - 135.08GiB
>  2 /dev/sdb   - 138.61GiB- - 140.79GiB
>  3 /dev/sdc   - 138.66GiB- - 140.74GiB
>  4 /dev/sdd   - 138.87GiB  1.00GiB - 139.53GiB
>  5 /dev/sde 1.00GiB 137.86GiB  1.00GiB - 139.53GiB
>  6 /dev/sdf   - 138.81GiB  1.00GiB - 139.59GiB
>  7 /dev/sdg 1.00GiB 138.38GiB  1.00GiB  64.00MiB 138.96GiB
>  8 /dev/sdh   - 113.46GiB  4.00GiB  64.00MiB 348.24GiB
> -- - -- -  - ---
>Total1.00GiB   1.05TiB  4.00GiB  64.00MiB 1.29TiB
> Used  1007.30MiB   1.05TiB  1.66GiB 400.00KiB
> 
> What I don't get is... how can I have 244.8 GB unallocated when the
> table below clearly shows that there is as much as 1.29TiB
> unallocated does not appear to make sense to me at least...

   This is exactly the issue. The Unallocated value(s) from btrfs fi
usage on at least RAID-10 are simply wrong, any way you look at it.

   Hugo.

-- 
Hugo Mills | Great films about cricket: The Third Man
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10

2016-05-23 Thread Zoiled

Marco Lorenzo Crociani wrote:

Hi,
as I wrote today in IRCI experienced an issue with 'btrfs filesystem 
usage'.

I have a 4 partitions RAID10 btrfs filesystem almost full.
'btrfs filesystem usage' reports wrong "Unallocated" indications.

Linux 4.5.3
btrfs-progs v4.5.3


# btrfs fi usage /data/

Overall:
Device size:  13.93TiB
Device allocated:  13.77TiB
Device unallocated: 167.54GiB
Device missing: 0.00B
Used:  13.44TiB
Free (estimated): 244.39GiB(min: 244.39GiB)
Data ratio:  2.00
Metadata ratio:  2.00
Global reserve: 512.00MiB(used: 0.00B)

Data,single: Size:8.00MiB, Used:0.00B
   /dev/sda4   8.00MiB

Data,RAID10: Size:6.87TiB, Used:6.71TiB
   /dev/sda4   1.72TiB
   /dev/sdb3   1.72TiB
   /dev/sdc3   1.72TiB
   /dev/sdd3   1.72TiB

Metadata,single: Size:8.00MiB, Used:0.00B
   /dev/sda4   8.00MiB

Metadata,RAID10: Size:19.00GiB, Used:14.15GiB
   /dev/sda4   4.75GiB
   /dev/sdb3   4.75GiB
   /dev/sdc3   4.75GiB
   /dev/sdd3   4.75GiB

System,single: Size:4.00MiB, Used:0.00B
   /dev/sda4   4.00MiB

System,RAID10: Size:16.00MiB, Used:768.00KiB
   /dev/sda4   4.00MiB
   /dev/sdb3   4.00MiB
   /dev/sdc3   4.00MiB
   /dev/sdd3   4.00MiB

Unallocated:
   /dev/sda4   1.76TiB
   /dev/sdb3   1.76TiB
   /dev/sdc3   1.76TiB
   /dev/sdd3   1.76TiB

-- 


# btrfs fi show /data/
Label: 'data'  uuid: df6639d5-3ef2-4ff6-a871-9ede440e2dae
Total devices 4 FS bytes used 6.72TiB
devid1 size 3.48TiB used 3.44TiB path /dev/sda4
devid2 size 3.48TiB used 3.44TiB path /dev/sdb3
devid3 size 3.48TiB used 3.44TiB path /dev/sdc3
devid4 size 3.48TiB used 3.44TiB path /dev/sdd3

-- 


# btrfs fi df /data/
Data, RAID10: total=6.87TiB, used=6.71TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID10: total=16.00MiB, used=768.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID10: total=19.00GiB, used=14.15GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

-- 


# df -h
/dev/sda4 7,0T  6,8T245G  97% /data

Regards,

For what it's worth...  I have a 8 disk (7x 300GB + 1x 500GB) data 
raid10, metadata raid1 setup and I get the following output of btrfs...


Label: 'xxyyzz'  uuid: 12345678-9abc-def1-2345-6789abcdef01
Total devices 8 FS bytes used 1.05TiB
devid1 size 268.05GiB used 265.94GiB path /dev/sda1
devid2 size 279.40GiB used 277.22GiB path /dev/sdb
devid3 size 279.40GiB used 277.32GiB path /dev/sdc
devid4 size 279.40GiB used 278.73GiB path /dev/sdd
devid5 size 279.40GiB used 277.72GiB path /dev/sde
devid6 size 279.40GiB used 278.61GiB path /dev/sdf
devid7 size 279.40GiB used 278.82GiB path /dev/sdg
devid8 size 465.76GiB used 230.99GiB path /dev/sdh

# btrfs filesystem usage -T /
Overall:
Device size:   2.35TiB
Device allocated:  2.11TiB
Device unallocated:  244.83GiB
Device missing:  0.00B
Used:  2.11TiB
Free (estimated):122.57GiB  (min: 122.57GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

 Data   Data  Metadata System
Id Path  RAID1  RAID10RAID1RAID1 Unallocated
-- - -- -  - ---
 1 /dev/sda1  - 132.97GiB- - 135.08GiB
 2 /dev/sdb   - 138.61GiB- - 140.79GiB
 3 /dev/sdc   - 138.66GiB- - 140.74GiB
 4 /dev/sdd   - 138.87GiB  1.00GiB - 139.53GiB
 5 /dev/sde 1.00GiB 137.86GiB  1.00GiB - 139.53GiB
 6 /dev/sdf   - 138.81GiB  1.00GiB - 139.59GiB
 7 /dev/sdg 1.00GiB 138.38GiB  1.00GiB  64.00MiB 138.96GiB
 8 /dev/sdh   - 113.46GiB  4.00GiB  64.00MiB 348.24GiB
-- - -- -  - ---
   Total1.00GiB   1.05TiB  4.00GiB  64.00MiB 1.29TiB
Used  1007.30MiB   1.05TiB  1.66GiB 400.00KiB

What I don't get is... how can I have 244.8 GB unallocated when the 
table below clearly shows that there is as much as 1.29TiB 
unallocated does not appear to make sense to me at least...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10

2016-05-23 Thread Nicholas D Steeves
On 23 May 2016 at 09:34, Marco Lorenzo Crociani
 wrote:
> Hi,
> as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'.
> I have a 4 partitions RAID10 btrfs filesystem almost full.
> 'btrfs filesystem usage' reports wrong "Unallocated" indications.
>
> Linux 4.5.3
> btrfs-progs v4.5.3
>
>
> # btrfs fi usage /data/
>
> Overall:
> Device size:   13.93TiB
> Device allocated:   13.77TiB
> Device unallocated: 167.54GiB

I wonder if this is related to whatever caused the free space cache
bug for Ivan Pilipenko and myself (linux 4.4.10, btrfs-progs 4.4.1)?

Cheers,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

2016-05-23 Thread Nicholas D Steeves
On 23 May 2016 at 13:01, David Sterba  wrote:
> On Thu, May 19, 2016 at 09:30:49PM -0400, Nicholas D Steeves wrote:
>> Sorry for the noise.  Please disregard my v1 patch and subsequent
>> emails.  This patch is for upstream linux-next.  From now on I think
>> that's what I'm going to work from, to keep things simple, because it
>> seems I'm still inept with git.
>
> The patch applies cleanly on top of the current branch that's going to
> Linus tree, so I'll queue it for the next pull request. All your inline
> notices were addressed. Thanks.

You're welcome, and thank you for the assistance.  I don't want to
annoy everyone with a regular stream of these patches, so what do you
think of the the following?:  I'll submit a patch for user-facing
typos in btrfs-progs when I find one, if I find any, and a strings &
comments review for both -progs and kernel twice a year, where one
review is part of preparing for an LTS kernel.

Regards,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

2016-05-23 Thread David Sterba
On Thu, May 19, 2016 at 09:30:49PM -0400, Nicholas D Steeves wrote:
> Sorry for the noise.  Please disregard my v1 patch and subsequent
> emails.  This patch is for upstream linux-next.  From now on I think
> that's what I'm going to work from, to keep things simple, because it
> seems I'm still inept with git.

The patch applies cleanly on top of the current branch that's going to
Linus tree, so I'll queue it for the next pull request. All your inline
notices were addressed. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs filesystem usage - Wrong Unallocated indications - RAID10

2016-05-23 Thread Marco Lorenzo Crociani

Hi,
as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'.
I have a 4 partitions RAID10 btrfs filesystem almost full.
'btrfs filesystem usage' reports wrong "Unallocated" indications.

Linux 4.5.3
btrfs-progs v4.5.3


# btrfs fi usage /data/

Overall:
Device size:  13.93TiB
Device allocated:  13.77TiB
Device unallocated: 167.54GiB
Device missing: 0.00B
Used:  13.44TiB
Free (estimated): 244.39GiB(min: 244.39GiB)
Data ratio:  2.00
Metadata ratio:  2.00
Global reserve: 512.00MiB(used: 0.00B)

Data,single: Size:8.00MiB, Used:0.00B
   /dev/sda4   8.00MiB

Data,RAID10: Size:6.87TiB, Used:6.71TiB
   /dev/sda4   1.72TiB
   /dev/sdb3   1.72TiB
   /dev/sdc3   1.72TiB
   /dev/sdd3   1.72TiB

Metadata,single: Size:8.00MiB, Used:0.00B
   /dev/sda4   8.00MiB

Metadata,RAID10: Size:19.00GiB, Used:14.15GiB
   /dev/sda4   4.75GiB
   /dev/sdb3   4.75GiB
   /dev/sdc3   4.75GiB
   /dev/sdd3   4.75GiB

System,single: Size:4.00MiB, Used:0.00B
   /dev/sda4   4.00MiB

System,RAID10: Size:16.00MiB, Used:768.00KiB
   /dev/sda4   4.00MiB
   /dev/sdb3   4.00MiB
   /dev/sdc3   4.00MiB
   /dev/sdd3   4.00MiB

Unallocated:
   /dev/sda4   1.76TiB
   /dev/sdb3   1.76TiB
   /dev/sdc3   1.76TiB
   /dev/sdd3   1.76TiB

--
# btrfs fi show /data/
Label: 'data'  uuid: df6639d5-3ef2-4ff6-a871-9ede440e2dae
Total devices 4 FS bytes used 6.72TiB
devid1 size 3.48TiB used 3.44TiB path /dev/sda4
devid2 size 3.48TiB used 3.44TiB path /dev/sdb3
devid3 size 3.48TiB used 3.44TiB path /dev/sdc3
devid4 size 3.48TiB used 3.44TiB path /dev/sdd3

--
# btrfs fi df /data/
Data, RAID10: total=6.87TiB, used=6.71TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID10: total=16.00MiB, used=768.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID10: total=19.00GiB, used=14.15GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

--
# df -h
/dev/sda4 7,0T  6,8T245G  97% /data

Regards,

--
Marco Crociani
Prisma Telecom Testing S.r.l.
via Petrocchi, 4  20127 MILANO  ITALY
Phone:  +39 02 26113507
Fax:  +39 02 26113597
e-mail:  mar...@prismatelecomtesting.com
web:  http://www.prismatelecomtesting.com

Questa email (e I suoi allegati) costituisce informazione riservata e 
confidenziale e può essere soggetto a legal privilege. Può essere utilizzata 
esclusivamente dai suoi destinatari legittimi.  Se avete ricevuto questa email 
per errore, siete pregati di informarne immediatamente il mittente e quindi 
cancellarla.  A meno che non siate stati a ciò espressamente autorizzati, la 
diffusione o la riproduzione di questa email o del suo contenuto non sono 
consentiti.

 Salvo che questa email sia espressamente qualificata come offerta o 
accettazione contrattuale, il mittente non intende con questa email dare vita 
ad un vincolo giuridico e questa email non può essere interpretata quale 
offerta o accettazione che possa dare vita ad un contratto. Qualsiasi opinione 
manifestata in questa email è un'opinione personale del mittente, salvo che il 
mittente dichiari espressamente che si tratti di un'opinione di Prisma 
Engineering.


***

 This e-mail (including any attachments) is private and confidential, and may 
be privileged.  It is for the exclusive use of the intended recipient(s).  If 
you have received this email in error, please inform the sender immediately and 
then delete this email.  Unless you have been given specific permission to do 
so, please do not distribute or copy this email or its contents.
 Unless the text of this email specifically states that it is a contractual 
offer or acceptance, the sender does not intend to create a legal relationship 
and this email shall not constitute an offer or acceptance which could give 
rise to a contract. Any views expressed in this communication are those of the 
individual sender, except where the sender specifically states them to be the 
views of Prisma Engineering.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


JFYI Lock-free Btree

2016-05-23 Thread Timofey Titovets
Hi guys,
i just find this document:
http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf

It's describe implementation of lock-free btree
I believe it's can be interesting for someone
(AFAIK btrfs use btree)
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix unexpected return value of fiemap

2016-05-23 Thread David Sterba
On Wed, May 18, 2016 at 10:52:25AM -0700, Liu Bo wrote:
> On Wed, May 18, 2016 at 11:41:05AM +0200, David Sterba wrote:
> > On Tue, May 17, 2016 at 05:21:48PM -0700, Liu Bo wrote:
> > > btrfs's fiemap is supposed to return 0 on success and
> > >  return < 0 on error, however, ret becomes 1 after looking
> > > up the last file extent, and if the offset is beyond EOF,
> > > we can return 1.
> > > 
> > > This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
> > > 
> > > Signed-off-by: Liu Bo 
> > 
> > Reviewed-by: David Sterba 
> > 
> > > ---
> > >  fs/btrfs/extent_io.c | 6 +-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index d247fc0..16ece52 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -4379,8 +4379,12 @@ int extent_fiemap(struct inode *inode, struct 
> > > fiemap_extent_info *fieinfo,
> > >   if (ret < 0) {
> > >   btrfs_free_path(path);
> > >   return ret;
> > > + } else {
> > > + WARN_ON(!ret);
> > > + if (ret == 1)
> > > + ret = 0;
> > >   }
> > 
> > So, ret == 1 can end up here from btrfs_lookup_file_extent ->
> > btrfs_search_slot(..., ins_len=0, cow=0) and the offset does not exist,
> > we'll get path pointed to the slot where it would be inserted and ret is 1.
> 
> Sounds better than the commit log, would you like me to update it?

Done.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: utils: use better wrappered random generator

2016-05-23 Thread David Sterba
The API does not seem right. It's fine to provide functions for full
int/u32/u64 ranges, but in the cases when we know the range from which
we request the random number, it has to be passed as parameter. Not
doing the % by hand.

> +u32 rand_u32(void)
> +{
> + struct timeval tv;
> + unsigned short rand_seed[3];

This could be made static (with thread local storage) so the state does
not get regenerated all the time. Possibly it could be initialize from
some true random source, not time or pid.

> + long int ret;
> + int i;
> +
> + gettimeofday(, 0);
> + rand_seed[0] = getpid() ^ (tv.tv_sec & 0x);
> + rand_seed[1] = getppid() ^ (tv.tv_usec & 0x);
> + rand_seed[2] = (tv.tv_sec ^ tv.tv_usec) >> 16;
> +
> + /* Crank the random number generator a few times */
> + gettimeofday(, 0);
> + for (i = (tv.tv_sec ^ tv.tv_sec) ^ 0x1F; i > 0; i--)
> + nrand48(rand_seed);

This would be then unnecesssray, just draw the number from nrand.

About patch separation: please introduce the new api in one patch, use
in another (ie. drop srand and switch to it).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-23 Thread Austin S. Hemmelgarn

On 2016-05-20 18:26, Henk Slager wrote:

Yes, sorry, I took some shortcut in the discussion and jumped to a
method for avoiding this 0.5-2% slowdown that you mention. (Or a
kernel crashing in bcache code due to corrupt SB on a backing device
or corrupted caching device contents).
I am actually bit surprised that there is a measurable slowdown,
considering that it is basically just one 8KiB offset on a certain
layer in the kernel stack, but I haven't looked at that code.
There's still a layer of indirection in the kernel code, even in the 
pass-through mode with no cache, and that's probably where the slowdown 
comes from.  My testing was also in a VM with it's backing device on an 
SSD though, so you may get different results on other hardware

I don't know other tables than MBR and GPT, but this bcache SB
'insertion' works with both. Indeed, if GRUB is involved, it can get
complicated, I have avoided that. If there is less than 8KiB slack
space on a HDD, I would worry about alignment/performance first, then
there is likely a reason to fully rewrite the HDD with a standard 1M
alingment.
The 'alignment' things is mostly bogus these days.  It originated when 
1M was a full track on the disk, and you wanted your filesystem to start 
on the beginning of a track for performance reasons.  On most modern 
disks though, this is not a full track, but it got kept because a number 
of bootloaders (GRUB included) used to use the slack space this caused 
to embed themselves before the filesystem.  The only case where 1M 
alignment actually makes sense is on SSD's with a 1M erase block size 
(which are rare, most consumer devices have a 4M erase block).  As far 
as partition tables, you're not likely to see any other formats these 
days (the only ones I've dealt with other than MBR and GPT are APM (the 
old pre-OSX Apple format), RDB (the Amiga format, which is kind of neat 
because it can embed drivers), and the old Sun disk labels (from before 
SunOS became Solaris)), and I had actually forgotten that a GPT is only 
32k, hence my comment about it potentially being an issue.

If there is more partitions and the partition in front of the one you
would like to be bcached, I personally would shrink it by 8KiB (like
NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers.
Definitely, although depending on how the system is set up, this will 
almost certainly need down time.



This also doesn't change the fact that without careful initial formatting
(it is possible on some filesystems to embed the bcache SB at the beginning
of the FS itself, many of them have some reserved space at the beginning of
the partition for bootloaders, and this space doesn't have to exist when
mounting the FS) or manual alteration of the partition, it's not possible to
mount the FS on a system without bcache support.


If we consider a non-bootable single HDD btrfs FS, are you then
suggesting that the bcache SB could be placed in the first 64KiB where
also GRUB stores its code if the FS would need booting ?
That would be interesting, it would mean that also for btrfs on raw
device (and also multi-device) there is no extra exclusive 8KiB space
needed in front.
Is there someone who has this working? I think it would lead to issues
on the blocklayer, but I have currently no clue about that.
I don't think it would work on BTRFS, we expect the SB at a fixed 
location into the device, and it wouldn't be there on the bcache device. 
 It might work on ext4 though, but I'm not certain about that.  I do 
know of at least one person who got it working with a FAT32 filesystem 
as a proof of concept though.  Trying to do that even if it would work 
on BTRFS would be _really_ risky though, because the kernel would 
potentially see both devices, and you would probably have the same 
issues that you do with block level copies.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2.1 16/16] btrfs-progs: fsck: Introduce low memory mode

2016-05-23 Thread David Sterba
On Fri, May 20, 2016 at 10:33:55AM +0800, Qu Wenruo wrote:
> We'll enrich the test cases for current low memory mode.

I started something to add optional default options for a few basic
commands (mkfs, fsck, convert) to extend the coverage. I'm not finished,
the idea is to call the commands via some wrapper that will grab the
defaults from a file or from environment.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: fix race between readahead and device replace/removal

2016-05-23 Thread David Sterba
On Fri, May 20, 2016 at 11:11:57AM -0400, Josef Bacik wrote:
> On Fri, May 20, 2016 at 12:44 AM,   wrote:
> > So fix this by taking the device_list_mutex in the readahead code. We
> > can't use here the lighter approach of using a rcu_read_lock() and
> > rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
> > because we end up doing calls to sleeping functions (kzalloc()) in the
> > respective code path.
> 
> I think it might be time to change this to a rwsem as well as we use
> it in a bunch of places that are read only like statfs and readahead.
> But this works for now.

Sounds good to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: correct inode's outstanding_extents computation

2016-05-23 Thread Filipe Manana
On Mon, May 23, 2016 at 7:05 AM, Wang Xiaoguang
 wrote:
> hello,
>
>
> On 05/19/2016 07:01 PM, Filipe Manana wrote:
>>
>> On Thu, May 19, 2016 at 11:49 AM, Wang Xiaoguang
>>  wrote:
>>>
>>> This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
>>> When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
>>> gets
>>> these warnings from btrfs_destroy_inode():
>>>  WARN_ON(BTRFS_I(inode)->outstanding_extents);
>>>  WARN_ON(BTRFS_I(inode)->reserved_extents);
>>>
>>> Simple test program below can reproduce this issue steadily.
>>>  #include 
>>>  #include 
>>>  #include 
>>>  #include 
>>>  #include 
>>>
>>>  int main(void)
>>>  {
>>>  int fd;
>>>  char buf[1024*1024];
>>>
>>>  memset(buf, 0, 1024 * 1024);
>>>  fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
>>>  pwrite(fd, buf, 69954, 693581);
>>>  return;
>>>  }
>>>
>>> Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is:
>>> 692224
>>> 765951
>>>
>>> |--|
>>>   len(73728)
>>> 1) for the above data range, btrfs_delalloc_reserve_metadata() will
>>> reserve
>>> metadata and BTRFS_I(inode)->outstanding_extents will be 2.
>>> (73728 + 65535) / 65536 == 2
>>>
>>> 2) then btrfs_dirty_page() will be called to dirty pages and set
>>> EXTENT_DELALLOC
>>> flag. In this case, btrfs_set_bit_hook will be called 3 times. For first
>>> call,
>>> there will be such extent io map.
>>> 692224 696319 696320
>>> 765951
>>> |--|
>>> |-|
>>> len(4096)len(69632)
>>>  have EXTENT_DELALLOC
>>> and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't
>>> change
>>> BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in
>>> btrfs_set_bit_hook();
>>>
>>> 3) second btrfs_set_bit_hook() call.
>>> Because of EXTENT_FIRST_DELALLOC have been unset by previous
>>> btrfs_set_bit_hook(),
>>> btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by
>>> one, so now
>>> BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such
>>> extent_io map:
>>> 692224   696319 696320761855 761856
>>> 765951
>>> ||  |-|
>>> |--|
>>>  len(4096) len(65536)
>>> len(4096)
>>>  have EXTENT_DELALLOC  have EXTENT_DELALLOC
>>>
>>> And because (692224, 696319) and (696320, 761855) is adjacent,
>>> btrfs_merge_extent_hook()
>>> will merge them into one delalloc extent, but according to the
>>> compulation logic in
>>> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still
>>> be 3.
>>> After merge, tehre will bu such extent_io map:
>>> 692224761855 761856
>>> 765951
>>> |-|
>>> |--|
>>> len(69632)
>>> len(4096)
>>>have EXTENT_DELALLOC
>>>
>>> 4) third btrfs_set_bit_hook() call.
>>> Also because of EXTENT_FIRST_DELALLOC have not been set,
>>> btrfs_set_bit_hook will increase
>>> BTRFS_I(inode)->outstanding_extents by one, so now
>>> BTRFS_I(inode)->outstanding_extents is 4.
>>> The extent io map is:
>>> 692224761855 761856
>>> 765951
>>> |-|
>>> |--|
>>> len(69632)
>>> len(4096)
>>>have EXTENT_DELALLOChave
>>> EXTENT_DELALLOC
>>>
>>> Also because (692224, 761855) and (761856, 765951) is adjacent,
>>> btrfs_merge_extent_hook()
>>> will merge them into one delalloc extent, according to the compulation
>>> logic in
>>> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will
>>> decrease by one, be 3.
>>> so after merge, tehre will bu such extent_io map:
>>> 692224
>>> 765951
>>>
>>> |---|
>>>   len(73728)
>>> have EXTENT_DELALLOC
>>>
>>> But indeed for original data range(start:692224 end:765951 len:73728), we
>>> just should
>>> have 2 outstanding extents, so it will trigger the above WARNINGs.
>>>
>>> The root casue is that btrfs_delalloc_reserve_metadata() will always add
>>> needed outstanding
>>> extents first, and if later btrfs_set_extent_delalloc call multiple
>>> btrfs_set_bit_hook(),
>>> it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch
>>> choose to also add
>>> BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook() 

[PATCH v3] fstests: generic: Test reserved extent map search routine on dedupe file

2016-05-23 Thread Qu Wenruo
For fully dedupe file, which means all its file exntents are pointing to
the same bytenr, btrfs can cause soft lockup when calling fiemap ioctl
on that file, like the following output:
--
CPU: 1 PID: 7500 Comm: xfs_io Not tainted 4.5.0-rc6+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
12/01/2006
task: 880027681b40 ti: 8800276e task.ti: 8800276e
RIP: 0010:[]  []
__merge_refs+0x34/0x120 [btrfs]
RSP: 0018:8800276e3c08  EFLAGS: 0202
RAX: 8800269cc330 RBX: 8800269cdb18 RCX: 0007
RDX: 61b0 RSI: 8800269cc4c8 RDI: 8800276e3c88
RBP: 8800276e3c20 R08:  R09: 0001
R10:  R11:  R12: 880026ea3cb0
R13: 8800276e3c88 R14: 880027132a50 R15: 88002743
FS:  7f10201df700() GS:88003fa0()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f10201ec000 CR3: 27603000 CR4: 000406e0
Stack:
    8800276e3ce8
 a0259f38 0005 8800274c6870 8800274c7d88
 00c1  0001 27431190
Call Trace:
 [] find_parent_nodes+0x448/0x740 [btrfs]
 [] btrfs_check_shared+0x102/0x1b0 [btrfs]
 [] ? __might_fault+0x4d/0xa0
 [] extent_fiemap+0x2ac/0x550 [btrfs]
 [] ? __filemap_fdatawait_range+0x96/0x160
 [] ? btrfs_get_extent+0xb30/0xb30 [btrfs]
 [] btrfs_fiemap+0x45/0x50 [btrfs]
 [] do_vfs_ioctl+0x498/0x670
 [] SyS_ioctl+0x79/0x90
 [] entry_SYSCALL_64_fastpath+0x12/0x6f
Code: 41 55 41 54 53 4c 8b 27 4c 39 e7 0f 84 e9 00 00 00 49 89 fd 49 8b
34 24 49 39 f5 48 8b 1e 75 17 e9 d5 00 00 00 49 39 dd 48 8b 03 <48> 89
de 0f 84 b9 00 00 00 48 89 c3 8b 46 2c 41 39 44 24 2c 75
--

Also btrfs will return wrong flag for all these extents, they should
have SHARED(0x2000) flags, while btrfs still consider them as exclusive
extents.

On the other hand, with unmerged xfs reflink patches, xfs can handle it
without problem, and for patched btrfs, it can also handle it.

This test case will create a large fully deduped file to check if the fs
can handle the fiemap ioctl and return correct SHARED flag for any fs
which support reflink.

Reported-by: Tsutomu Itoh 
Signed-off-by: Qu Wenruo 
---
v2:
  Use more wrapper of xfs_io
  Add fiemap requirement
  Refactor output to match golden output if LOAD_FACTOR is not 1
v3:
  Fix a bug that temporary file is not removed.
---
 common/punch  | 17 +
 tests/generic/352 | 98 +++
 tests/generic/352.out |  5 +++
 tests/generic/group   |  1 +
 4 files changed, 121 insertions(+)
 create mode 100755 tests/generic/352
 create mode 100644 tests/generic/352.out

diff --git a/common/punch b/common/punch
index 43f04c2..44c6e1c 100644
--- a/common/punch
+++ b/common/punch
@@ -218,6 +218,23 @@ _filter_fiemap()
_coalesce_extents
 }
 
+_filter_fiemap_flags()
+{
+   $AWK_PROG '
+   $3 ~ /hole/ {
+   print $1, $2, $3;
+   next;
+   }
+   $5 ~ /0x[[:xdigit:]]*8[[:xdigit:]][[:xdigit:]]/ {
+   print $1, $2, "unwritten";
+   next;
+   }
+   $5 ~ /0x[[:xdigit:]]+/ {
+   print $1, $2, $5;
+   }' |
+   _coalesce_extents
+}
+
 # Filters fiemap output to only print the 
 # file offset column and whether or not
 # it is an extent or a hole
diff --git a/tests/generic/352 b/tests/generic/352
new file mode 100755
index 000..70e43fb
--- /dev/null
+++ b/tests/generic/352
@@ -0,0 +1,98 @@
+#! /bin/bash
+# FS QA Test 352
+#
+# Test fiemap ioctl on heavily deduped file
+#
+# This test case will check if reserved extent map searching go
+# without problem and return correct SHARED flag.
+# Which btrfs will soft lock up and return wrong shared flag.
+#
+#---
+# Copyright (c) 2016 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # 

Re: [PATCH v2] fstests: generic: Test reserved extent map search routine on deduped file

2016-05-23 Thread Eryu Guan
On Thu, May 12, 2016 at 03:37:39PM +0800, Qu Wenruo wrote:
> For fully deduped file, which means all its file exntents are pointing to
> the same bytenr, btrfs can cause soft lockup when calling fiemap ioctl
> on that file, like the following output:
[snip]
> +
> +# then call fiemap on that file to test both the shared flag and if
> +# reserved extent mapping search will cause soft lockup
> +$XFS_IO_PROG -c "fiemap -v" $file | _filter_fiemap_flags > $tmp
> +cat $tmp >> $seqres.full

$tmp won't be removed after test, in _cleanup() it's removing $tmp.*
$tmp.out is better.

> +
> +# refact the $LOAD_FACTOR to 1 to match the golden output
> +sed -i -e "s/$(($last_extent - 1))/$(($orig_last_extent - 1))/" \
> + -e "s/$last_extent/$orig_last_extent/" \
> + -e "s/$end/$orig_end/" $tmp
> +cat $tmp

Same here.

Otherwise looks good to me.

Thanks,
Eryu

> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/352.out b/tests/generic/352.out
> new file mode 100644
> index 000..a87c507
> --- /dev/null
> +++ b/tests/generic/352.out
> @@ -0,0 +1,5 @@
> +QA output created by 352
> +wrote 131072/131072 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +0: [0..2097151]: 0x2000
> +1: [2097152..2097407]: 0x2001
> diff --git a/tests/generic/group b/tests/generic/group
> index 36fb759..3f00386 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -354,3 +354,4 @@
>  349 blockdev quick rw
>  350 blockdev quick rw
>  351 blockdev quick rw
> +352 auto clone
> -- 
> 2.5.5
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Slightly speedup btrfs_read_block_groups

2016-05-23 Thread Qu Wenruo

Any comment on this patch?

BTW, for anyone who is interested in the speedup, and the trace result,
I've updated it to google driver:

https://drive.google.com/open?id=0BxpkL3ehzX3pbFEybXd3X3MzRGM
https://drive.google.com/open?id=0BxpkL3ehzX3pd1ByOFhhbml3Ujg

Thanks,
Qu

Qu Wenruo wrote on 2016/05/05 15:51 +0800:

Btrfs_read_block_groups() function is the most time consuming function
if the whole fs is filled with small extents.

For a btrfs filled with all 16K sized files, and when 2T space is used,
mount the fs needs 10 to 12 seconds.

While ftrace shows that, btrfs_read_block_groups() takes about 9
seconds, while btrfs_read_chunk_tree() only takes 14ms.
In theory, btrfs_read_chunk_tree() and btrfs_read_block_groups() should
take the same time, as chunk and block groups are 1:1 mapped.

However, considering block group items are spread across the large
extent tree, it takes a lot of time to search btree.

And furthermore, find_first_block_group() function used by
btrfs_read_block_groups() is using a very bad method to locate block
group item, by searching and then checking slot by slot.

In kernel space, checking slot by slot is a little time consuming, as
for next_leaf() case, kernel need to do extra locking.

This patch will fix the slot by slot checking, as when we call
btrfs_read_block_groups(), we have already read out all chunks and save
them into map_tree.

So we use map_tree to get exact block group start and length, then do
exact btrfs_search_slot(), without slot by slot check, to speedup the
mount.

With this patch, time spent on btrfs_read_block_groups() is reduced to
7.56s, compared to old 8.94s.

Reported-by: Tsutomu Itoh 
Signed-off-by: Qu Wenruo 

---
The further fix would change the mount process from reading out all
block groups to reading out block group on demand.

But according to the btrfs_read_chunk_tree() calling time, the real
problem is the on-disk format and btree locking.

If block group items are arranged like chunks, in a dedicated tree,
btrfs_read_block_groups() should take the same time as
btrfs_read_chunk_tree().

And further more, if we can split current huge extent tree into
something like per-chunk extent tree, a lot of current code like
delayed_refs can be removed, as extent tree operation will be much
faster.
---
 fs/btrfs/extent-tree.c | 61 --
 fs/btrfs/extent_map.c  |  1 +
 fs/btrfs/extent_map.h  | 22 ++
 3 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8507484..9fa7728 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9520,39 +9520,20 @@ out:
return ret;
 }

-static int find_first_block_group(struct btrfs_root *root,
-   struct btrfs_path *path, struct btrfs_key *key)
+int find_block_group(struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct extent_map *chunk_em)
 {
int ret = 0;
-   struct btrfs_key found_key;
-   struct extent_buffer *leaf;
-   int slot;
-
-   ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
-   if (ret < 0)
-   goto out;
+   struct btrfs_key key;

-   while (1) {
-   slot = path->slots[0];
-   leaf = path->nodes[0];
-   if (slot >= btrfs_header_nritems(leaf)) {
-   ret = btrfs_next_leaf(root, path);
-   if (ret == 0)
-   continue;
-   if (ret < 0)
-   goto out;
-   break;
-   }
-   btrfs_item_key_to_cpu(leaf, _key, slot);
+   key.objectid = chunk_em->start;
+   key.offset = chunk_em->len;
+   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;

-   if (found_key.objectid >= key->objectid &&
-   found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
-   ret = 0;
-   goto out;
-   }
-   path->slots[0]++;
-   }
-out:
+   ret = btrfs_search_slot(NULL, root, , path, 0, 0);
+   if (ret > 0)
+   ret = -ENOENT;
return ret;
 }

@@ -9771,16 +9752,14 @@ int btrfs_read_block_groups(struct btrfs_root *root)
struct btrfs_block_group_cache *cache;
struct btrfs_fs_info *info = root->fs_info;
struct btrfs_space_info *space_info;
-   struct btrfs_key key;
+   struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree;
+   struct extent_map *chunk_em;
struct btrfs_key found_key;
struct extent_buffer *leaf;
int need_clear = 0;
u64 cache_gen;

root = info->extent_root;
-   key.objectid = 0;
-   key.offset = 0;
-   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
path = btrfs_alloc_path();
if (!path)
return 

Re: [RFC PATCH] btrfs: correct inode's outstanding_extents computation

2016-05-23 Thread Wang Xiaoguang

hello,

On 05/19/2016 07:01 PM, Filipe Manana wrote:

On Thu, May 19, 2016 at 11:49 AM, Wang Xiaoguang
 wrote:

This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often gets
these warnings from btrfs_destroy_inode():
 WARN_ON(BTRFS_I(inode)->outstanding_extents);
 WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
 #include 
 #include 
 #include 
 #include 
 #include 

 int main(void)
 {
 int fd;
 char buf[1024*1024];

 memset(buf, 0, 1024 * 1024);
 fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
 pwrite(fd, buf, 69954, 693581);
 return;
 }

Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is:
692224  
   765951
|--|
  len(73728)
1) for the above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and BTRFS_I(inode)->outstanding_extents will be 2.
(73728 + 65535) / 65536 == 2

2) then btrfs_dirty_page() will be called to dirty pages and set EXTENT_DELALLOC
flag. In this case, btrfs_set_bit_hook will be called 3 times. For first call,
there will be such extent io map.
692224 696319 696320
765951
|--|  
|-|
len(4096)len(69632)
 have EXTENT_DELALLOC
and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't change
BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in 
btrfs_set_bit_hook();

3) second btrfs_set_bit_hook() call.
Because of EXTENT_FIRST_DELALLOC have been unset by previous 
btrfs_set_bit_hook(),
btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by one, so 
now
BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such extent_io 
map:
692224   696319 696320761855 761856 
765951
||  |-|  
|--|
 len(4096) len(65536) len(4096)
 have EXTENT_DELALLOC  have EXTENT_DELALLOC

And because (692224, 696319) and (696320, 761855) is adjacent, 
btrfs_merge_extent_hook()
will merge them into one delalloc extent, but according to the compulation 
logic in
btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still be 3.
After merge, tehre will bu such extent_io map:
692224761855 761856 
765951
|-|  
|--|
len(69632) len(4096)
   have EXTENT_DELALLOC

4) third btrfs_set_bit_hook() call.
Also because of EXTENT_FIRST_DELALLOC have not been set, btrfs_set_bit_hook 
will increase
BTRFS_I(inode)->outstanding_extents by one, so now 
BTRFS_I(inode)->outstanding_extents is 4.
The extent io map is:
692224761855 761856 
765951
|-|  
|--|
len(69632) len(4096)
   have EXTENT_DELALLOChave 
EXTENT_DELALLOC

Also because (692224, 761855) and (761856, 765951) is adjacent, 
btrfs_merge_extent_hook()
will merge them into one delalloc extent, according to the compulation logic in
btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will decrease by 
one, be 3.
so after merge, tehre will bu such extent_io map:
692224  
765951
|---|
  len(73728)
have EXTENT_DELALLOC

But indeed for original data range(start:692224 end:765951 len:73728), we just 
should
have 2 outstanding extents, so it will trigger the above WARNINGs.

The root casue is that btrfs_delalloc_reserve_metadata() will always add needed 
outstanding
extents first, and if later btrfs_set_extent_delalloc call multiple 
btrfs_set_bit_hook(),
it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch choose to 
also add
BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook() according to the 
data range length,
and the added value is the correct number of outstanding_extents for this data 
range, then
decrease the value which was added in