Re: [PATCH] btrfs: fix check_shared for fiemap ioctl
Does anyone have interest in this patch? 在 2016年05月16日 11:23, Lu Fengqi 写道: Only in the case of different root_id or different object_id, check_shared identified extent as the shared. However, If a extent was referred by different offset of same file, it should also be identified as shared. In addition, check_shared's loop scale is at least n^3, so if a extent has too many references, even causes soft hang up. First, add all delayed_ref to the ref_tree and calculate the unqiue_refs, if the unique_refs is greater than one, return BACKREF_FOUND_SHARED. Then individually add the on-disk reference(inline/keyed) to the ref_tree and calculate the unique_refs of the ref_tree to check if the unique_refs is greater than one.Because once there are two references to return SHARED, so the time complexity is close to the constant. Reported-by: Tsutomu ItohSigned-off-by: Lu Fengqi --- fs/btrfs/backref.c | 348 +-- fs/btrfs/extent_io.c | 18 ++- 2 files changed, 356 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 80e8472..1118c76 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -17,6 +17,7 @@ */ #include +#include #include "ctree.h" #include "disk-io.h" #include "backref.h" @@ -34,6 +35,249 @@ struct extent_inode_elem { struct extent_inode_elem *next; }; +/* + * ref_root is used as the root of the ref tree that hold a collection + * of unique references. + */ +struct ref_root { + /* +* the unique_refs represents the number of ref_nodes with a positive +* count stored in the tree. Even if a ref_node(the count is greater +* than one) is added, the unique_refs will only increase one. +*/ + unsigned int unique_refs; + + struct rb_root rb_root; +}; + +/* ref_node is used to store a unique reference to the ref tree. */ +struct ref_node { + /* for NORMAL_REF, otherwise all these fields should be set to 0 */ + u64 root_id; + u64 object_id; + u64 offset; + + /* for SHARED_REF, otherwise parent field should be set to 0 */ + u64 parent; + + /* ref to the ref_mod of btrfs_delayed_ref_node(delayed-ref.h) */ + int ref_mod; + + struct rb_node rb_node; +}; + +/* dynamically allocate and initialize a ref_root */ +static struct ref_root *ref_root_alloc(gfp_t gfp_mask) +{ + struct ref_root *ref_tree; + + ref_tree = kmalloc(sizeof(*ref_tree), gfp_mask); + if (!ref_tree) + return NULL; + + ref_tree->rb_root = RB_ROOT; + ref_tree->unique_refs = 0; + + return ref_tree; +} + +/* free all node in the ref tree, and reinit ref_root */ +static void ref_root_fini(struct ref_root *ref_tree) +{ + struct ref_node *node; + struct rb_node *next; + + while ((next = rb_first(_tree->rb_root)) != NULL) { + node = rb_entry(next, struct ref_node, rb_node); + rb_erase(next, _tree->rb_root); + kfree(node); + } + + ref_tree->rb_root = RB_ROOT; + ref_tree->unique_refs = 0; +} + +/* free dynamically allocated ref_root */ +static void ref_root_free(struct ref_root *ref_tree) +{ + if (!ref_tree) + return; + + ref_root_fini(ref_tree); + kfree(ref_tree); +} + +/* + * search ref_node with (root_id, object_id, offset, parent) in the tree + * + * if found, the pointer of the ref_node will be returned; + * if not found, NULL will be returned and pos will point to the rb_node for + * insert, pos_parent will point to pos'parent for insert; +*/ +static struct ref_node *__ref_tree_search(struct ref_root *ref_tree, + struct rb_node ***pos, + struct rb_node **pos_parent, + u64 root_id, u64 object_id, + u64 offset, u64 parent) +{ + struct ref_node *cur = NULL; + + *pos = _tree->rb_root.rb_node; + + while (**pos) { + *pos_parent = **pos; + cur = rb_entry(*pos_parent, struct ref_node, rb_node); + + if (cur->root_id < root_id) { + *pos = &(**pos)->rb_right; + continue; + } else if (cur->root_id > root_id) { + *pos = &(**pos)->rb_left; + continue; + } + + if (cur->object_id < object_id) { + *pos = &(**pos)->rb_right; + continue; + } else if (cur->object_id > object_id) { + *pos = &(**pos)->rb_left; + continue; + } + + if (cur->offset < offset) { + *pos = &(**pos)->rb_right; + continue; + } else if
Re: [RFC PATCH v2.1 16/16] btrfs-progs: fsck: Introduce low memory mode
David Sterba wrote on 2016/05/23 13:08 +0200: On Fri, May 20, 2016 at 10:33:55AM +0800, Qu Wenruo wrote: We'll enrich the test cases for current low memory mode. I started something to add optional default options for a few basic commands (mkfs, fsck, convert) to extend the coverage. I'm not finished, the idea is to call the commands via some wrapper that will grab the defaults from a file or from environment. Thank you a lot. But that's still not enough for low memory fsck yet. Even we can add --low-memory option for btrfsck to run on that images, we still have the following problems: 1) Lack of support for repair Repair support for low memory is quite tricky, as we need to do a lot of record work other than just calling btrfs_previous/next_item() This won't be implemented in a short time. And this will make almost all repair function test fails for low memory backend. 2) btrfs-image bug causing missing chunk stripe We're actively working on this before low memory mode for fs tree check. In fact the problem is already here for a long time, and another bug in btrfsck, which will ignore the error returned from dev_extent check, makes btrfsck can pass the fsck test images. Unfortunately (or fortunately?) low memory mode won't ignore such error and always report missing chunk for dev_extent. Unless we fix btrfs-image (only restore part is affected), low memory mode will always report error on btrfs-image restored image. 3) Extra images During the development of low memory mode, we found that current test images are all for some special fix case. No check on health images, not to mention test on all possible extent backrefs. We have build such images for internal low memory mode tests, and hopes to push it into current test. But since we don't have such check only test cases infrastructure and due to the bug of 2), we still needs some work for this. So we still have to some work to do. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: utils: use better wrappered random generator
David Sterba wrote on 2016/05/23 14:01 +0200: The API does not seem right. It's fine to provide functions for full int/u32/u64 ranges, but in the cases when we know the range from which we request the random number, it has to be passed as parameter. Not doing the % by hand. This makes sense. I'll add a new function to create random number for a given range. +u32 rand_u32(void) +{ + struct timeval tv; + unsigned short rand_seed[3]; This could be made static (with thread local storage) so the state does not get regenerated all the time. Possibly it could be initialize from some true random source, not time or pid. I also considered true random source like /dev/random, but since it's possible to wait for entropy pool, it would be quite slow and confusing for users. So time with pid seems good enough. + long int ret; + int i; + + gettimeofday(, 0); + rand_seed[0] = getpid() ^ (tv.tv_sec & 0x); + rand_seed[1] = getppid() ^ (tv.tv_usec & 0x); + rand_seed[2] = (tv.tv_sec ^ tv.tv_usec) >> 16; + + /* Crank the random number generator a few times */ + gettimeofday(, 0); + for (i = (tv.tv_sec ^ tv.tv_sec) ^ 0x1F; i > 0; i--) + nrand48(rand_seed); This would be then unnecesssray, just draw the number from nrand. Right, this part is just copied from libuuid, but in fact we don't really need to be that random. About patch separation: please introduce the new api in one patch, use in another (ie. drop srand and switch to it). OK, I'll update it in next version. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10
On Mon, May 23, 2016 at 08:49:04PM +0200, Zoiled wrote: [snip] > For what it's worth... I have a 8 disk (7x 300GB + 1x 500GB) data > raid10, metadata raid1 setup and I get the following output of > btrfs... > > Label: 'xxyyzz' uuid: 12345678-9abc-def1-2345-6789abcdef01 > Total devices 8 FS bytes used 1.05TiB > devid1 size 268.05GiB used 265.94GiB path /dev/sda1 > devid2 size 279.40GiB used 277.22GiB path /dev/sdb > devid3 size 279.40GiB used 277.32GiB path /dev/sdc > devid4 size 279.40GiB used 278.73GiB path /dev/sdd > devid5 size 279.40GiB used 277.72GiB path /dev/sde > devid6 size 279.40GiB used 278.61GiB path /dev/sdf > devid7 size 279.40GiB used 278.82GiB path /dev/sdg > devid8 size 465.76GiB used 230.99GiB path /dev/sdh > > # btrfs filesystem usage -T / > Overall: > Device size: 2.35TiB > Device allocated: 2.11TiB > Device unallocated: 244.83GiB > Device missing: 0.00B > Used: 2.11TiB > Free (estimated):122.57GiB (min: 122.57GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data Data Metadata System > Id Path RAID1 RAID10RAID1RAID1 Unallocated > -- - -- - - --- > 1 /dev/sda1 - 132.97GiB- - 135.08GiB > 2 /dev/sdb - 138.61GiB- - 140.79GiB > 3 /dev/sdc - 138.66GiB- - 140.74GiB > 4 /dev/sdd - 138.87GiB 1.00GiB - 139.53GiB > 5 /dev/sde 1.00GiB 137.86GiB 1.00GiB - 139.53GiB > 6 /dev/sdf - 138.81GiB 1.00GiB - 139.59GiB > 7 /dev/sdg 1.00GiB 138.38GiB 1.00GiB 64.00MiB 138.96GiB > 8 /dev/sdh - 113.46GiB 4.00GiB 64.00MiB 348.24GiB > -- - -- - - --- >Total1.00GiB 1.05TiB 4.00GiB 64.00MiB 1.29TiB > Used 1007.30MiB 1.05TiB 1.66GiB 400.00KiB > > What I don't get is... how can I have 244.8 GB unallocated when the > table below clearly shows that there is as much as 1.29TiB > unallocated does not appear to make sense to me at least... This is exactly the issue. The Unallocated value(s) from btrfs fi usage on at least RAID-10 are simply wrong, any way you look at it. Hugo. -- Hugo Mills | Great films about cricket: The Third Man hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10
Marco Lorenzo Crociani wrote: Hi, as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'. I have a 4 partitions RAID10 btrfs filesystem almost full. 'btrfs filesystem usage' reports wrong "Unallocated" indications. Linux 4.5.3 btrfs-progs v4.5.3 # btrfs fi usage /data/ Overall: Device size: 13.93TiB Device allocated: 13.77TiB Device unallocated: 167.54GiB Device missing: 0.00B Used: 13.44TiB Free (estimated): 244.39GiB(min: 244.39GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB(used: 0.00B) Data,single: Size:8.00MiB, Used:0.00B /dev/sda4 8.00MiB Data,RAID10: Size:6.87TiB, Used:6.71TiB /dev/sda4 1.72TiB /dev/sdb3 1.72TiB /dev/sdc3 1.72TiB /dev/sdd3 1.72TiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/sda4 8.00MiB Metadata,RAID10: Size:19.00GiB, Used:14.15GiB /dev/sda4 4.75GiB /dev/sdb3 4.75GiB /dev/sdc3 4.75GiB /dev/sdd3 4.75GiB System,single: Size:4.00MiB, Used:0.00B /dev/sda4 4.00MiB System,RAID10: Size:16.00MiB, Used:768.00KiB /dev/sda4 4.00MiB /dev/sdb3 4.00MiB /dev/sdc3 4.00MiB /dev/sdd3 4.00MiB Unallocated: /dev/sda4 1.76TiB /dev/sdb3 1.76TiB /dev/sdc3 1.76TiB /dev/sdd3 1.76TiB -- # btrfs fi show /data/ Label: 'data' uuid: df6639d5-3ef2-4ff6-a871-9ede440e2dae Total devices 4 FS bytes used 6.72TiB devid1 size 3.48TiB used 3.44TiB path /dev/sda4 devid2 size 3.48TiB used 3.44TiB path /dev/sdb3 devid3 size 3.48TiB used 3.44TiB path /dev/sdc3 devid4 size 3.48TiB used 3.44TiB path /dev/sdd3 -- # btrfs fi df /data/ Data, RAID10: total=6.87TiB, used=6.71TiB Data, single: total=8.00MiB, used=0.00B System, RAID10: total=16.00MiB, used=768.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID10: total=19.00GiB, used=14.15GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B -- # df -h /dev/sda4 7,0T 6,8T245G 97% /data Regards, For what it's worth... I have a 8 disk (7x 300GB + 1x 500GB) data raid10, metadata raid1 setup and I get the following output of btrfs... Label: 'xxyyzz' uuid: 12345678-9abc-def1-2345-6789abcdef01 Total devices 8 FS bytes used 1.05TiB devid1 size 268.05GiB used 265.94GiB path /dev/sda1 devid2 size 279.40GiB used 277.22GiB path /dev/sdb devid3 size 279.40GiB used 277.32GiB path /dev/sdc devid4 size 279.40GiB used 278.73GiB path /dev/sdd devid5 size 279.40GiB used 277.72GiB path /dev/sde devid6 size 279.40GiB used 278.61GiB path /dev/sdf devid7 size 279.40GiB used 278.82GiB path /dev/sdg devid8 size 465.76GiB used 230.99GiB path /dev/sdh # btrfs filesystem usage -T / Overall: Device size: 2.35TiB Device allocated: 2.11TiB Device unallocated: 244.83GiB Device missing: 0.00B Used: 2.11TiB Free (estimated):122.57GiB (min: 122.57GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Data Metadata System Id Path RAID1 RAID10RAID1RAID1 Unallocated -- - -- - - --- 1 /dev/sda1 - 132.97GiB- - 135.08GiB 2 /dev/sdb - 138.61GiB- - 140.79GiB 3 /dev/sdc - 138.66GiB- - 140.74GiB 4 /dev/sdd - 138.87GiB 1.00GiB - 139.53GiB 5 /dev/sde 1.00GiB 137.86GiB 1.00GiB - 139.53GiB 6 /dev/sdf - 138.81GiB 1.00GiB - 139.59GiB 7 /dev/sdg 1.00GiB 138.38GiB 1.00GiB 64.00MiB 138.96GiB 8 /dev/sdh - 113.46GiB 4.00GiB 64.00MiB 348.24GiB -- - -- - - --- Total1.00GiB 1.05TiB 4.00GiB 64.00MiB 1.29TiB Used 1007.30MiB 1.05TiB 1.66GiB 400.00KiB What I don't get is... how can I have 244.8 GB unallocated when the table below clearly shows that there is as much as 1.29TiB unallocated does not appear to make sense to me at least... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem usage - Wrong Unallocated indications - RAID10
On 23 May 2016 at 09:34, Marco Lorenzo Crocianiwrote: > Hi, > as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'. > I have a 4 partitions RAID10 btrfs filesystem almost full. > 'btrfs filesystem usage' reports wrong "Unallocated" indications. > > Linux 4.5.3 > btrfs-progs v4.5.3 > > > # btrfs fi usage /data/ > > Overall: > Device size: 13.93TiB > Device allocated: 13.77TiB > Device unallocated: 167.54GiB I wonder if this is related to whatever caused the free space cache bug for Ivan Pilipenko and myself (linux 4.4.10, btrfs-progs 4.4.1)? Cheers, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.
On 23 May 2016 at 13:01, David Sterbawrote: > On Thu, May 19, 2016 at 09:30:49PM -0400, Nicholas D Steeves wrote: >> Sorry for the noise. Please disregard my v1 patch and subsequent >> emails. This patch is for upstream linux-next. From now on I think >> that's what I'm going to work from, to keep things simple, because it >> seems I'm still inept with git. > > The patch applies cleanly on top of the current branch that's going to > Linus tree, so I'll queue it for the next pull request. All your inline > notices were addressed. Thanks. You're welcome, and thank you for the assistance. I don't want to annoy everyone with a regular stream of these patches, so what do you think of the the following?: I'll submit a patch for user-facing typos in btrfs-progs when I find one, if I find any, and a strings & comments review for both -progs and kernel twice a year, where one review is part of preparing for an LTS kernel. Regards, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.
On Thu, May 19, 2016 at 09:30:49PM -0400, Nicholas D Steeves wrote: > Sorry for the noise. Please disregard my v1 patch and subsequent > emails. This patch is for upstream linux-next. From now on I think > that's what I'm going to work from, to keep things simple, because it > seems I'm still inept with git. The patch applies cleanly on top of the current branch that's going to Linus tree, so I'll queue it for the next pull request. All your inline notices were addressed. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs filesystem usage - Wrong Unallocated indications - RAID10
Hi, as I wrote today in IRCI experienced an issue with 'btrfs filesystem usage'. I have a 4 partitions RAID10 btrfs filesystem almost full. 'btrfs filesystem usage' reports wrong "Unallocated" indications. Linux 4.5.3 btrfs-progs v4.5.3 # btrfs fi usage /data/ Overall: Device size: 13.93TiB Device allocated: 13.77TiB Device unallocated: 167.54GiB Device missing: 0.00B Used: 13.44TiB Free (estimated): 244.39GiB(min: 244.39GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB(used: 0.00B) Data,single: Size:8.00MiB, Used:0.00B /dev/sda4 8.00MiB Data,RAID10: Size:6.87TiB, Used:6.71TiB /dev/sda4 1.72TiB /dev/sdb3 1.72TiB /dev/sdc3 1.72TiB /dev/sdd3 1.72TiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/sda4 8.00MiB Metadata,RAID10: Size:19.00GiB, Used:14.15GiB /dev/sda4 4.75GiB /dev/sdb3 4.75GiB /dev/sdc3 4.75GiB /dev/sdd3 4.75GiB System,single: Size:4.00MiB, Used:0.00B /dev/sda4 4.00MiB System,RAID10: Size:16.00MiB, Used:768.00KiB /dev/sda4 4.00MiB /dev/sdb3 4.00MiB /dev/sdc3 4.00MiB /dev/sdd3 4.00MiB Unallocated: /dev/sda4 1.76TiB /dev/sdb3 1.76TiB /dev/sdc3 1.76TiB /dev/sdd3 1.76TiB -- # btrfs fi show /data/ Label: 'data' uuid: df6639d5-3ef2-4ff6-a871-9ede440e2dae Total devices 4 FS bytes used 6.72TiB devid1 size 3.48TiB used 3.44TiB path /dev/sda4 devid2 size 3.48TiB used 3.44TiB path /dev/sdb3 devid3 size 3.48TiB used 3.44TiB path /dev/sdc3 devid4 size 3.48TiB used 3.44TiB path /dev/sdd3 -- # btrfs fi df /data/ Data, RAID10: total=6.87TiB, used=6.71TiB Data, single: total=8.00MiB, used=0.00B System, RAID10: total=16.00MiB, used=768.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID10: total=19.00GiB, used=14.15GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B -- # df -h /dev/sda4 7,0T 6,8T245G 97% /data Regards, -- Marco Crociani Prisma Telecom Testing S.r.l. via Petrocchi, 4 20127 MILANO ITALY Phone: +39 02 26113507 Fax: +39 02 26113597 e-mail: mar...@prismatelecomtesting.com web: http://www.prismatelecomtesting.com Questa email (e I suoi allegati) costituisce informazione riservata e confidenziale e può essere soggetto a legal privilege. Può essere utilizzata esclusivamente dai suoi destinatari legittimi. Se avete ricevuto questa email per errore, siete pregati di informarne immediatamente il mittente e quindi cancellarla. A meno che non siate stati a ciò espressamente autorizzati, la diffusione o la riproduzione di questa email o del suo contenuto non sono consentiti. Salvo che questa email sia espressamente qualificata come offerta o accettazione contrattuale, il mittente non intende con questa email dare vita ad un vincolo giuridico e questa email non può essere interpretata quale offerta o accettazione che possa dare vita ad un contratto. Qualsiasi opinione manifestata in questa email è un'opinione personale del mittente, salvo che il mittente dichiari espressamente che si tratti di un'opinione di Prisma Engineering. *** This e-mail (including any attachments) is private and confidential, and may be privileged. It is for the exclusive use of the intended recipient(s). If you have received this email in error, please inform the sender immediately and then delete this email. Unless you have been given specific permission to do so, please do not distribute or copy this email or its contents. Unless the text of this email specifically states that it is a contractual offer or acceptance, the sender does not intend to create a legal relationship and this email shall not constitute an offer or acceptance which could give rise to a contract. Any views expressed in this communication are those of the individual sender, except where the sender specifically states them to be the views of Prisma Engineering. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
JFYI Lock-free Btree
Hi guys, i just find this document: http://www.cs.technion.ac.il/~erez/Papers/lfbtree-full.pdf It's describe implementation of lock-free btree I believe it's can be interesting for someone (AFAIK btrfs use btree) -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix unexpected return value of fiemap
On Wed, May 18, 2016 at 10:52:25AM -0700, Liu Bo wrote: > On Wed, May 18, 2016 at 11:41:05AM +0200, David Sterba wrote: > > On Tue, May 17, 2016 at 05:21:48PM -0700, Liu Bo wrote: > > > btrfs's fiemap is supposed to return 0 on success and > > > return < 0 on error, however, ret becomes 1 after looking > > > up the last file extent, and if the offset is beyond EOF, > > > we can return 1. > > > > > > This may confuse applications using ioctl(FIEL_IOC_FIEMAP). > > > > > > Signed-off-by: Liu Bo> > > > Reviewed-by: David Sterba > > > > > --- > > > fs/btrfs/extent_io.c | 6 +- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > > index d247fc0..16ece52 100644 > > > --- a/fs/btrfs/extent_io.c > > > +++ b/fs/btrfs/extent_io.c > > > @@ -4379,8 +4379,12 @@ int extent_fiemap(struct inode *inode, struct > > > fiemap_extent_info *fieinfo, > > > if (ret < 0) { > > > btrfs_free_path(path); > > > return ret; > > > + } else { > > > + WARN_ON(!ret); > > > + if (ret == 1) > > > + ret = 0; > > > } > > > > So, ret == 1 can end up here from btrfs_lookup_file_extent -> > > btrfs_search_slot(..., ins_len=0, cow=0) and the offset does not exist, > > we'll get path pointed to the slot where it would be inserted and ret is 1. > > Sounds better than the commit log, would you like me to update it? Done. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: utils: use better wrappered random generator
The API does not seem right. It's fine to provide functions for full int/u32/u64 ranges, but in the cases when we know the range from which we request the random number, it has to be passed as parameter. Not doing the % by hand. > +u32 rand_u32(void) > +{ > + struct timeval tv; > + unsigned short rand_seed[3]; This could be made static (with thread local storage) so the state does not get regenerated all the time. Possibly it could be initialize from some true random source, not time or pid. > + long int ret; > + int i; > + > + gettimeofday(, 0); > + rand_seed[0] = getpid() ^ (tv.tv_sec & 0x); > + rand_seed[1] = getppid() ^ (tv.tv_usec & 0x); > + rand_seed[2] = (tv.tv_sec ^ tv.tv_usec) >> 16; > + > + /* Crank the random number generator a few times */ > + gettimeofday(, 0); > + for (i = (tv.tv_sec ^ tv.tv_sec) ^ 0x1F; i > 0; i--) > + nrand48(rand_seed); This would be then unnecesssray, just draw the number from nrand. About patch separation: please introduce the new api in one patch, use in another (ie. drop srand and switch to it). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 2016-05-20 18:26, Henk Slager wrote: Yes, sorry, I took some shortcut in the discussion and jumped to a method for avoiding this 0.5-2% slowdown that you mention. (Or a kernel crashing in bcache code due to corrupt SB on a backing device or corrupted caching device contents). I am actually bit surprised that there is a measurable slowdown, considering that it is basically just one 8KiB offset on a certain layer in the kernel stack, but I haven't looked at that code. There's still a layer of indirection in the kernel code, even in the pass-through mode with no cache, and that's probably where the slowdown comes from. My testing was also in a VM with it's backing device on an SSD though, so you may get different results on other hardware I don't know other tables than MBR and GPT, but this bcache SB 'insertion' works with both. Indeed, if GRUB is involved, it can get complicated, I have avoided that. If there is less than 8KiB slack space on a HDD, I would worry about alignment/performance first, then there is likely a reason to fully rewrite the HDD with a standard 1M alingment. The 'alignment' things is mostly bogus these days. It originated when 1M was a full track on the disk, and you wanted your filesystem to start on the beginning of a track for performance reasons. On most modern disks though, this is not a full track, but it got kept because a number of bootloaders (GRUB included) used to use the slack space this caused to embed themselves before the filesystem. The only case where 1M alignment actually makes sense is on SSD's with a 1M erase block size (which are rare, most consumer devices have a 4M erase block). As far as partition tables, you're not likely to see any other formats these days (the only ones I've dealt with other than MBR and GPT are APM (the old pre-OSX Apple format), RDB (the Amiga format, which is kind of neat because it can embed drivers), and the old Sun disk labels (from before SunOS became Solaris)), and I had actually forgotten that a GPT is only 32k, hence my comment about it potentially being an issue. If there is more partitions and the partition in front of the one you would like to be bcached, I personally would shrink it by 8KiB (like NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers. Definitely, although depending on how the system is set up, this will almost certainly need down time. This also doesn't change the fact that without careful initial formatting (it is possible on some filesystems to embed the bcache SB at the beginning of the FS itself, many of them have some reserved space at the beginning of the partition for bootloaders, and this space doesn't have to exist when mounting the FS) or manual alteration of the partition, it's not possible to mount the FS on a system without bcache support. If we consider a non-bootable single HDD btrfs FS, are you then suggesting that the bcache SB could be placed in the first 64KiB where also GRUB stores its code if the FS would need booting ? That would be interesting, it would mean that also for btrfs on raw device (and also multi-device) there is no extra exclusive 8KiB space needed in front. Is there someone who has this working? I think it would lead to issues on the blocklayer, but I have currently no clue about that. I don't think it would work on BTRFS, we expect the SB at a fixed location into the device, and it wouldn't be there on the bcache device. It might work on ext4 though, but I'm not certain about that. I do know of at least one person who got it working with a FAT32 filesystem as a proof of concept though. Trying to do that even if it would work on BTRFS would be _really_ risky though, because the kernel would potentially see both devices, and you would probably have the same issues that you do with block level copies. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2.1 16/16] btrfs-progs: fsck: Introduce low memory mode
On Fri, May 20, 2016 at 10:33:55AM +0800, Qu Wenruo wrote: > We'll enrich the test cases for current low memory mode. I started something to add optional default options for a few basic commands (mkfs, fsck, convert) to extend the coverage. I'm not finished, the idea is to call the commands via some wrapper that will grab the defaults from a file or from environment. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: fix race between readahead and device replace/removal
On Fri, May 20, 2016 at 11:11:57AM -0400, Josef Bacik wrote: > On Fri, May 20, 2016 at 12:44 AM,wrote: > > So fix this by taking the device_list_mutex in the readahead code. We > > can't use here the lighter approach of using a rcu_read_lock() and > > rcu_read_unlock() pair together with a list_for_each_entry_rcu() call > > because we end up doing calls to sleeping functions (kzalloc()) in the > > respective code path. > > I think it might be time to change this to a rwsem as well as we use > it in a bunch of places that are read only like statfs and readahead. > But this works for now. Sounds good to me. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: correct inode's outstanding_extents computation
On Mon, May 23, 2016 at 7:05 AM, Wang Xiaoguangwrote: > hello, > > > On 05/19/2016 07:01 PM, Filipe Manana wrote: >> >> On Thu, May 19, 2016 at 11:49 AM, Wang Xiaoguang >> wrote: >>> >>> This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, >>> When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often >>> gets >>> these warnings from btrfs_destroy_inode(): >>> WARN_ON(BTRFS_I(inode)->outstanding_extents); >>> WARN_ON(BTRFS_I(inode)->reserved_extents); >>> >>> Simple test program below can reproduce this issue steadily. >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(void) >>> { >>> int fd; >>> char buf[1024*1024]; >>> >>> memset(buf, 0, 1024 * 1024); >>> fd = open("testfile", O_CREAT | O_EXCL | O_RDWR); >>> pwrite(fd, buf, 69954, 693581); >>> return; >>> } >>> >>> Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is: >>> 692224 >>> 765951 >>> >>> |--| >>> len(73728) >>> 1) for the above data range, btrfs_delalloc_reserve_metadata() will >>> reserve >>> metadata and BTRFS_I(inode)->outstanding_extents will be 2. >>> (73728 + 65535) / 65536 == 2 >>> >>> 2) then btrfs_dirty_page() will be called to dirty pages and set >>> EXTENT_DELALLOC >>> flag. In this case, btrfs_set_bit_hook will be called 3 times. For first >>> call, >>> there will be such extent io map. >>> 692224 696319 696320 >>> 765951 >>> |--| >>> |-| >>> len(4096)len(69632) >>> have EXTENT_DELALLOC >>> and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't >>> change >>> BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in >>> btrfs_set_bit_hook(); >>> >>> 3) second btrfs_set_bit_hook() call. >>> Because of EXTENT_FIRST_DELALLOC have been unset by previous >>> btrfs_set_bit_hook(), >>> btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by >>> one, so now >>> BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such >>> extent_io map: >>> 692224 696319 696320761855 761856 >>> 765951 >>> || |-| >>> |--| >>> len(4096) len(65536) >>> len(4096) >>> have EXTENT_DELALLOC have EXTENT_DELALLOC >>> >>> And because (692224, 696319) and (696320, 761855) is adjacent, >>> btrfs_merge_extent_hook() >>> will merge them into one delalloc extent, but according to the >>> compulation logic in >>> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still >>> be 3. >>> After merge, tehre will bu such extent_io map: >>> 692224761855 761856 >>> 765951 >>> |-| >>> |--| >>> len(69632) >>> len(4096) >>>have EXTENT_DELALLOC >>> >>> 4) third btrfs_set_bit_hook() call. >>> Also because of EXTENT_FIRST_DELALLOC have not been set, >>> btrfs_set_bit_hook will increase >>> BTRFS_I(inode)->outstanding_extents by one, so now >>> BTRFS_I(inode)->outstanding_extents is 4. >>> The extent io map is: >>> 692224761855 761856 >>> 765951 >>> |-| >>> |--| >>> len(69632) >>> len(4096) >>>have EXTENT_DELALLOChave >>> EXTENT_DELALLOC >>> >>> Also because (692224, 761855) and (761856, 765951) is adjacent, >>> btrfs_merge_extent_hook() >>> will merge them into one delalloc extent, according to the compulation >>> logic in >>> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will >>> decrease by one, be 3. >>> so after merge, tehre will bu such extent_io map: >>> 692224 >>> 765951 >>> >>> |---| >>> len(73728) >>> have EXTENT_DELALLOC >>> >>> But indeed for original data range(start:692224 end:765951 len:73728), we >>> just should >>> have 2 outstanding extents, so it will trigger the above WARNINGs. >>> >>> The root casue is that btrfs_delalloc_reserve_metadata() will always add >>> needed outstanding >>> extents first, and if later btrfs_set_extent_delalloc call multiple >>> btrfs_set_bit_hook(), >>> it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch >>> choose to also add >>> BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook()
[PATCH v3] fstests: generic: Test reserved extent map search routine on dedupe file
For fully dedupe file, which means all its file exntents are pointing to the same bytenr, btrfs can cause soft lockup when calling fiemap ioctl on that file, like the following output: -- CPU: 1 PID: 7500 Comm: xfs_io Not tainted 4.5.0-rc6+ #2 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: 880027681b40 ti: 8800276e task.ti: 8800276e RIP: 0010:[] [] __merge_refs+0x34/0x120 [btrfs] RSP: 0018:8800276e3c08 EFLAGS: 0202 RAX: 8800269cc330 RBX: 8800269cdb18 RCX: 0007 RDX: 61b0 RSI: 8800269cc4c8 RDI: 8800276e3c88 RBP: 8800276e3c20 R08: R09: 0001 R10: R11: R12: 880026ea3cb0 R13: 8800276e3c88 R14: 880027132a50 R15: 88002743 FS: 7f10201df700() GS:88003fa0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f10201ec000 CR3: 27603000 CR4: 000406e0 Stack: 8800276e3ce8 a0259f38 0005 8800274c6870 8800274c7d88 00c1 0001 27431190 Call Trace: [] find_parent_nodes+0x448/0x740 [btrfs] [] btrfs_check_shared+0x102/0x1b0 [btrfs] [] ? __might_fault+0x4d/0xa0 [] extent_fiemap+0x2ac/0x550 [btrfs] [] ? __filemap_fdatawait_range+0x96/0x160 [] ? btrfs_get_extent+0xb30/0xb30 [btrfs] [] btrfs_fiemap+0x45/0x50 [btrfs] [] do_vfs_ioctl+0x498/0x670 [] SyS_ioctl+0x79/0x90 [] entry_SYSCALL_64_fastpath+0x12/0x6f Code: 41 55 41 54 53 4c 8b 27 4c 39 e7 0f 84 e9 00 00 00 49 89 fd 49 8b 34 24 49 39 f5 48 8b 1e 75 17 e9 d5 00 00 00 49 39 dd 48 8b 03 <48> 89 de 0f 84 b9 00 00 00 48 89 c3 8b 46 2c 41 39 44 24 2c 75 -- Also btrfs will return wrong flag for all these extents, they should have SHARED(0x2000) flags, while btrfs still consider them as exclusive extents. On the other hand, with unmerged xfs reflink patches, xfs can handle it without problem, and for patched btrfs, it can also handle it. This test case will create a large fully deduped file to check if the fs can handle the fiemap ioctl and return correct SHARED flag for any fs which support reflink. Reported-by: Tsutomu ItohSigned-off-by: Qu Wenruo --- v2: Use more wrapper of xfs_io Add fiemap requirement Refactor output to match golden output if LOAD_FACTOR is not 1 v3: Fix a bug that temporary file is not removed. --- common/punch | 17 + tests/generic/352 | 98 +++ tests/generic/352.out | 5 +++ tests/generic/group | 1 + 4 files changed, 121 insertions(+) create mode 100755 tests/generic/352 create mode 100644 tests/generic/352.out diff --git a/common/punch b/common/punch index 43f04c2..44c6e1c 100644 --- a/common/punch +++ b/common/punch @@ -218,6 +218,23 @@ _filter_fiemap() _coalesce_extents } +_filter_fiemap_flags() +{ + $AWK_PROG ' + $3 ~ /hole/ { + print $1, $2, $3; + next; + } + $5 ~ /0x[[:xdigit:]]*8[[:xdigit:]][[:xdigit:]]/ { + print $1, $2, "unwritten"; + next; + } + $5 ~ /0x[[:xdigit:]]+/ { + print $1, $2, $5; + }' | + _coalesce_extents +} + # Filters fiemap output to only print the # file offset column and whether or not # it is an extent or a hole diff --git a/tests/generic/352 b/tests/generic/352 new file mode 100755 index 000..70e43fb --- /dev/null +++ b/tests/generic/352 @@ -0,0 +1,98 @@ +#! /bin/bash +# FS QA Test 352 +# +# Test fiemap ioctl on heavily deduped file +# +# This test case will check if reserved extent map searching go +# without problem and return correct SHARED flag. +# Which btrfs will soft lock up and return wrong shared flag. +# +#--- +# Copyright (c) 2016 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 #
Re: [PATCH v2] fstests: generic: Test reserved extent map search routine on deduped file
On Thu, May 12, 2016 at 03:37:39PM +0800, Qu Wenruo wrote: > For fully deduped file, which means all its file exntents are pointing to > the same bytenr, btrfs can cause soft lockup when calling fiemap ioctl > on that file, like the following output: [snip] > + > +# then call fiemap on that file to test both the shared flag and if > +# reserved extent mapping search will cause soft lockup > +$XFS_IO_PROG -c "fiemap -v" $file | _filter_fiemap_flags > $tmp > +cat $tmp >> $seqres.full $tmp won't be removed after test, in _cleanup() it's removing $tmp.* $tmp.out is better. > + > +# refact the $LOAD_FACTOR to 1 to match the golden output > +sed -i -e "s/$(($last_extent - 1))/$(($orig_last_extent - 1))/" \ > + -e "s/$last_extent/$orig_last_extent/" \ > + -e "s/$end/$orig_end/" $tmp > +cat $tmp Same here. Otherwise looks good to me. Thanks, Eryu > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/352.out b/tests/generic/352.out > new file mode 100644 > index 000..a87c507 > --- /dev/null > +++ b/tests/generic/352.out > @@ -0,0 +1,5 @@ > +QA output created by 352 > +wrote 131072/131072 bytes at offset 0 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +0: [0..2097151]: 0x2000 > +1: [2097152..2097407]: 0x2001 > diff --git a/tests/generic/group b/tests/generic/group > index 36fb759..3f00386 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -354,3 +354,4 @@ > 349 blockdev quick rw > 350 blockdev quick rw > 351 blockdev quick rw > +352 auto clone > -- > 2.5.5 > > > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Slightly speedup btrfs_read_block_groups
Any comment on this patch? BTW, for anyone who is interested in the speedup, and the trace result, I've updated it to google driver: https://drive.google.com/open?id=0BxpkL3ehzX3pbFEybXd3X3MzRGM https://drive.google.com/open?id=0BxpkL3ehzX3pd1ByOFhhbml3Ujg Thanks, Qu Qu Wenruo wrote on 2016/05/05 15:51 +0800: Btrfs_read_block_groups() function is the most time consuming function if the whole fs is filled with small extents. For a btrfs filled with all 16K sized files, and when 2T space is used, mount the fs needs 10 to 12 seconds. While ftrace shows that, btrfs_read_block_groups() takes about 9 seconds, while btrfs_read_chunk_tree() only takes 14ms. In theory, btrfs_read_chunk_tree() and btrfs_read_block_groups() should take the same time, as chunk and block groups are 1:1 mapped. However, considering block group items are spread across the large extent tree, it takes a lot of time to search btree. And furthermore, find_first_block_group() function used by btrfs_read_block_groups() is using a very bad method to locate block group item, by searching and then checking slot by slot. In kernel space, checking slot by slot is a little time consuming, as for next_leaf() case, kernel need to do extra locking. This patch will fix the slot by slot checking, as when we call btrfs_read_block_groups(), we have already read out all chunks and save them into map_tree. So we use map_tree to get exact block group start and length, then do exact btrfs_search_slot(), without slot by slot check, to speedup the mount. With this patch, time spent on btrfs_read_block_groups() is reduced to 7.56s, compared to old 8.94s. Reported-by: Tsutomu ItohSigned-off-by: Qu Wenruo --- The further fix would change the mount process from reading out all block groups to reading out block group on demand. But according to the btrfs_read_chunk_tree() calling time, the real problem is the on-disk format and btree locking. If block group items are arranged like chunks, in a dedicated tree, btrfs_read_block_groups() should take the same time as btrfs_read_chunk_tree(). And further more, if we can split current huge extent tree into something like per-chunk extent tree, a lot of current code like delayed_refs can be removed, as extent tree operation will be much faster. --- fs/btrfs/extent-tree.c | 61 -- fs/btrfs/extent_map.c | 1 + fs/btrfs/extent_map.h | 22 ++ 3 files changed, 47 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8507484..9fa7728 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9520,39 +9520,20 @@ out: return ret; } -static int find_first_block_group(struct btrfs_root *root, - struct btrfs_path *path, struct btrfs_key *key) +int find_block_group(struct btrfs_root *root, + struct btrfs_path *path, + struct extent_map *chunk_em) { int ret = 0; - struct btrfs_key found_key; - struct extent_buffer *leaf; - int slot; - - ret = btrfs_search_slot(NULL, root, key, path, 0, 0); - if (ret < 0) - goto out; + struct btrfs_key key; - while (1) { - slot = path->slots[0]; - leaf = path->nodes[0]; - if (slot >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret == 0) - continue; - if (ret < 0) - goto out; - break; - } - btrfs_item_key_to_cpu(leaf, _key, slot); + key.objectid = chunk_em->start; + key.offset = chunk_em->len; + key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; - if (found_key.objectid >= key->objectid && - found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - ret = 0; - goto out; - } - path->slots[0]++; - } -out: + ret = btrfs_search_slot(NULL, root, , path, 0, 0); + if (ret > 0) + ret = -ENOENT; return ret; } @@ -9771,16 +9752,14 @@ int btrfs_read_block_groups(struct btrfs_root *root) struct btrfs_block_group_cache *cache; struct btrfs_fs_info *info = root->fs_info; struct btrfs_space_info *space_info; - struct btrfs_key key; + struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree; + struct extent_map *chunk_em; struct btrfs_key found_key; struct extent_buffer *leaf; int need_clear = 0; u64 cache_gen; root = info->extent_root; - key.objectid = 0; - key.offset = 0; - key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; path = btrfs_alloc_path(); if (!path) return
Re: [RFC PATCH] btrfs: correct inode's outstanding_extents computation
hello, On 05/19/2016 07:01 PM, Filipe Manana wrote: On Thu, May 19, 2016 at 11:49 AM, Wang Xiaoguangwrote: This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often gets these warnings from btrfs_destroy_inode(): WARN_ON(BTRFS_I(inode)->outstanding_extents); WARN_ON(BTRFS_I(inode)->reserved_extents); Simple test program below can reproduce this issue steadily. #include #include #include #include #include int main(void) { int fd; char buf[1024*1024]; memset(buf, 0, 1024 * 1024); fd = open("testfile", O_CREAT | O_EXCL | O_RDWR); pwrite(fd, buf, 69954, 693581); return; } Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is: 692224 765951 |--| len(73728) 1) for the above data range, btrfs_delalloc_reserve_metadata() will reserve metadata and BTRFS_I(inode)->outstanding_extents will be 2. (73728 + 65535) / 65536 == 2 2) then btrfs_dirty_page() will be called to dirty pages and set EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook will be called 3 times. For first call, there will be such extent io map. 692224 696319 696320 765951 |--| |-| len(4096)len(69632) have EXTENT_DELALLOC and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't change BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in btrfs_set_bit_hook(); 3) second btrfs_set_bit_hook() call. Because of EXTENT_FIRST_DELALLOC have been unset by previous btrfs_set_bit_hook(), btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by one, so now BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such extent_io map: 692224 696319 696320761855 761856 765951 || |-| |--| len(4096) len(65536) len(4096) have EXTENT_DELALLOC have EXTENT_DELALLOC And because (692224, 696319) and (696320, 761855) is adjacent, btrfs_merge_extent_hook() will merge them into one delalloc extent, but according to the compulation logic in btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still be 3. After merge, tehre will bu such extent_io map: 692224761855 761856 765951 |-| |--| len(69632) len(4096) have EXTENT_DELALLOC 4) third btrfs_set_bit_hook() call. Also because of EXTENT_FIRST_DELALLOC have not been set, btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by one, so now BTRFS_I(inode)->outstanding_extents is 4. The extent io map is: 692224761855 761856 765951 |-| |--| len(69632) len(4096) have EXTENT_DELALLOChave EXTENT_DELALLOC Also because (692224, 761855) and (761856, 765951) is adjacent, btrfs_merge_extent_hook() will merge them into one delalloc extent, according to the compulation logic in btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will decrease by one, be 3. so after merge, tehre will bu such extent_io map: 692224 765951 |---| len(73728) have EXTENT_DELALLOC But indeed for original data range(start:692224 end:765951 len:73728), we just should have 2 outstanding extents, so it will trigger the above WARNINGs. The root casue is that btrfs_delalloc_reserve_metadata() will always add needed outstanding extents first, and if later btrfs_set_extent_delalloc call multiple btrfs_set_bit_hook(), it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch choose to also add BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook() according to the data range length, and the added value is the correct number of outstanding_extents for this data range, then decrease the value which was added in