[PATCH 1/1] Btrfs: Remove redundant NULL check before kfree
There is no need of NULL check before kfree, removing the same Signed-off-by: Maninder Singh Reviewed-by: Akhilesh Kumar --- fs/btrfs/free-space-cache.c |6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 9dbe5b5..88f1e16 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2101,8 +2101,7 @@ new_bitmap: out: if (info) { - if (info->bitmap) - kfree(info->bitmap); + kfree(info->bitmap); kmem_cache_free(btrfs_free_space_cachep, info); } @@ -3561,8 +3560,7 @@ again: if (info) kmem_cache_free(btrfs_free_space_cachep, info); - if (map) - kfree(map); + kfree(map); return 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
BTRFS balance fails with -dusage=100
OpenSuSE 13.2 system with single BTRFS / mounted on top of /dev/md1. /dev/md1 is md raid5 across 4 SATA disks. System details are: Linux suse132 4.0.5-4.g56152db-default #1 SMP Thu Jun 18 15:11:06 UTC 2015 (56152db) x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.1+20150622 Label: none uuid: 33b98d97-606b-4968-a266-24a48a9fe50d Total devices 1 FS bytes used 884.21GiB devid1 size 1.36TiB used 889.06GiB path /dev/md1 Data, single: total=885.00GiB, used=883.12GiB System, DUP: total=32.00MiB, used=144.00KiB Metadata, DUP: total=2.00GiB, used=1.09GiB GlobalReserve, single: total=384.00MiB, used=0.00B Relevant entries from log are: 2015-06-22T22:46:32.238011-05:00 suse132 kernel: [90193.446128] BTRFS: bdev /dev/md1 errs: wr 9977, rd 0, flush 0, corrupt 0, gen 0 2015-06-22T22:46:32.238050-05:00 suse132 kernel: [90193.446158] BTRFS: bdev /dev/md1 errs: wr 9978, rd 0, flush 0, corrupt 0, gen 0 2015-06-22T22:46:32.238054-05:00 suse132 kernel: [90193.446179] BTRFS: bdev /dev/md1 errs: wr 9979, rd 0, flush 0, corrupt 0, gen 0 System was (still is - other than btrfs balance) running fine. Then I did massive data I/O, copying and deleting and massive amounts of data to bring the system into it's present state. Once I was done with the I/O, kicked off btrfs balance start /. Above command failed. Then I started doing btrfs balance -dusage=XX / This command succeeds with XX upto and including 99. It fails when I set XX to 100. btrfs balance also fails if I omit the -dusage option. The errors in the log make no sense to me since the md raid device is not reporting any errors at all. Also running btrfs scrub reports no errors at all. Any ideas on how to get btrfs balance to succeed without errors would be welcome. Regards, --Moby -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: NULL pointer dereference during snapshot removal
On Sat, Jun 20, 2015 at 04:53:24PM +0200, Christoph Biedl wrote: > Hi there, > > I'm having trouble with btrfs where removing a snapshot causes a > kernel Oops at blk_get_backing_dev_info+0x10/0x1c (plus or minus a > byte bytes). Is this a known issue? Else I'll dig further. Stack > traces below. Can you use gdb to locate the line of blk_get_backing_dev_info+0x10/0x1c? Although the stack trace comes from btrfs, btrfs doesn't play with inode's bdi. Thanks, -liubo > > In general these snapshot operations work as expected. In a specific > setup they fail every time. I can try to trim this down to a simple > and public reproducer but I expect this will take some time. Basically > this is a private Debian buildd using sbuild/schroot with btrfs > snapshots. Building a certain package results in the trouble. That > package is not public but does a lot of nasty things during the build, > including probing block devices[1]. The build runs as expected, the > cleanup however does not. > > * btrfs-tools is v3.17 > * kernel is the latest 4.0.x stable series. Note even yesterday's > 4.0.6-rc1 is affected. > * userland is both Debian wheezy and jessie > * the build chroot is Debian jessie, Debian wheezy is not affected > > Christoph > > [1] Those who are familiar with sbuild: Build dependencies include > dmsetup, lvm2, mdadm, and udev. Starting daemons is disabled > by an according policy-rd.d sniplet but I expect somebody isn't > playing nice here. An still, this must not affect btrfs is such a > way. > > Unable to handle kernel NULL pointer dereference at virtual address 0204 > pgd = ec0b8000 > [0204] *pgd=6e22f831, *pte=, *ppte= > Internal error: Oops: 17 [#1] SMP ARM > Modules linked in: nfsd btrfs xor raid6_pq sunxi_sid > CPU: 1 PID: 7351 Comm: btrfs Not tainted 4.0.6-rc1 #1 > Hardware name: Allwinner sun7i (A20) Family > task: eca16040 ti: e1022000 task.ti: e1022000 > PC is at blk_get_backing_dev_info+0x10/0x1c > LR is at inode_to_bdi+0x38/0x48 > pc : []lr : []psr: 20070013 > sp : e1023b60 ip : e1023b70 fp : e1023b6c > r10: e16e51c8 r9 : 7fff r8 : > r7 : r6 : r5 : edc03890 r4 : ee027000 > r3 : r2 : r1 : 7fff r0 : edc03800 > Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user > Control: 10c5387d Table: 6c0b806a DAC: 0015 > Process btrfs (pid: 7351, stack limit = 0xe1022218) > Stack: (0xe1023b60 to 0xe1024000) > 3b60: e1023b84 e1023b70 c012b794 c02df058 edc03964 e1023bbc e1023b88 > 3b80: c00bd708 c012b768 7fff 7fff > 3ba0: 0001 7fff e1023be4 e1023bc0 c00be5c0 c00bd6d0 > 3bc0: 7fff 0001 e58a2910 e16e51c8 7fff e1023c14 e1023be8 > 3be0: bf14d354 c00be5a8 7fff fffe > 3c00: e16e50b0 e1023c5c e1023c18 bf1530b8 bf14d334 7fff > 3c20: 7fff e16e51c8 > 3c40: e16e50b0 e16e50cc e1023ccc e1023c60 bf140e1c bf153028 > 3c60: e1023cb4 e1023c78 c012ae1c c005e134 e16e5234 0007 > 3c80: 1000 ec5f7800 e1023c90 e1023c90 c09ca300 e16e51c8 > 3ca0: e16e5270 e16e51c8 e16e5270 c09ca300 bf1c28d4 015e ec5f7800 > 3cc0: e1023cec e1023cd0 c011e338 bf140ba0 e16e51c8 ed4ba800 e16e5218 bf1c28d4 > 3ce0: e1023d0c e1023cf0 c011eed4 c011e294 e16e513c ec5f7b50 e16e51c8 > 3d00: e1023d3c e1023d10 bf14132c c011ed5c 2dc0a000 ec942000 ec645000 ec5f7800 > 3d20: eb04fc38 eb0b9920 ec826dc0 e1023dcc e1023d40 bf173e88 bf14117c > 3d40: 0139 ea52f388 0038 c0a15380 ec5f7800 eb04fc38 ec5f7b68 > 3d60: ede805d8 c00c3794 eb0b9990 ede6abd8 ec645000 0004 > 3d80: ed9f6600 00060006 00070001 > 3da0: 00024800 ede6ab68 ec826dc0 ec645000 5000940f ede6ab68 bea3d7a8 ec826dc0 > 3dc0: e1023ef4 e1023dd0 bf177408 bf1738c8 c09cb880 ee02fe00 eea7adb4 ed81d778 > 3de0: eea7adb4 ed81d740 eea7adb4 0136c000 ed81d778 eea7adb4 e1023e1c e1023e08 > 3e00: 0103 ed5553f8 0136c000 ed81d778 e1023eb4 e1023e20 c00e11e0 c001d3b4 > 3e20: 0024 ec826dc0 ede6ab68 e1023e40 c0110680 ec826dc0 > 3e40: e1023ed0 e1023f5c ec0b8048 0040 05b0 016c 0009 > 3e60: c0112e54 c010e3e4 e1023e94 b6dd e1023f40 bea3d6b0 0079 e9dd1740 > 3e80: e1023fb0 ee02fe00 e1023eb4 e1023fb0 ed81d740 eca16040 0136c0e4 ed5553f8 > 3ea0: ed81d77c 0817 e1023f04 e1023eb8 c001c8f8 c0060268 e1023f4c e1023ec8 > 3ec0: c0113e88 c0112dc8 0043 ede6ab68 ec826dc0 bea3d7a8 5000940f 0003 > 3ee0: e1022000 e1023f7c e1023ef8 c011607c bf175fd8 e1023fac e1023f08 > 3f00: c0008588 c001c79c ede6ab68 4020 c09cbc34 ec942000 ec942000 ec826dc0 > 3f20: 4020 ede6ab68 e1023f4c e1023f38 c01134c4 c00f8348 eca16040 0003 > 3f40: e1023f94 e1023f50 e102
Re: [PATCH v2 0/5] Btrfs: RAID 5/6 missing device scrub+replace
Hi, I have tested your PATCH v2 , but something wrong happened. kernel: 4.1.0-rc7+ with your five patches vitrualBox ubuntu14.10-server + LVM I make a new btrfs.ko with your patches, rmmod original module and insmod the new. When I use the profile RAID1/10, mkfs successfully But when mount the fs, dmesg dumped: trans: 18446612133975020584 running 5 btrfs transid mismatch buffer 29507584, found 18446612133975020584 running 5 btrfs transid mismatch buffer 29507584, found 18446612133975020584 running 5 btrfs transid mismatch buffer 29507584, found 18446612133975020584 running 5 ... ... When use the RAID5/6, mkfs and mount system stoped at the 'mount -t btrfs /dev/mapper/server-dev1 /mnt' cmd. That's all. 在 2015年06月20日 02:52, Omar Sandoval 写道: Hi, Here's version 2 of the missing device RAID 5/6 fixes. The original problem was reported by a user on Bugzilla: the kernel crashed when attempting to replace a missing device in a RAID 6 filesystem. This is detailed and fixed in patch 4. After the initial posting, Zhao Lei reported a similar issue when doing a scrub on a RAID 5 filesystem with a missing device. This is fixed in the added patch 5. My new-and-improved-and-overengineered reproducer as well as Zhao Lei's reproducer can be found below. Thanks! v1: http://article.gmane.org/gmane.comp.file-systems.btrfs/45045 v1->v2: - Add missing scrub_wr_submit() in scrub_missing_raid56_worker() - Add clarifying comment in dev->missing case of scrub_stripe() (Zhaolei) - Add fix for scrub with missing device (patch 5) Omar Sandoval (5): Btrfs: remove misleading handling of missing device scrub Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: fix parity scrub of RAID 5/6 with missing device fs/btrfs/raid56.c | 87 --- fs/btrfs/raid56.h | 10 ++- fs/btrfs/reada.c | 4 +- fs/btrfs/scrub.c | 202 +- 4 files changed, 259 insertions(+), 44 deletions(-) Reproducer 1: #!/bin/bash usage () { USAGE_STRING="Usage: $0 [OPTION]... Options: -mfailure mode; MODE is 'eio', 'missing', or 'corrupt' (defaults to 'missing') -nnumber of files to write, each twice as big as the last, the first being 1M in size (defaults to 4) -ooperation to perform; OP is 'replace' or 'scrub' (defaults to 'replace') -rRAID profile; RAID is 'raid0', 'raid1', 'raid10', 'raid5', or 'raid6' (defaults to 'raid5') Miscellaneous: -hdisplay this help message and exit" case "$1" in out) echo "$USAGE_STRING" exit 0 ;; err) echo "$USAGE_STRING" >&2 exit 1 ;; esac } MODE=missing RAID=raid5 OP=replace NUM_FILES=4 while getopts "m:n:o:r:h" OPT; do case "$OPT" in m) MODE="$OPTARG" ;; r) RAID="$OPTARG" ;; o) OP="$OPTARG" ;; n) NUM_FILES="$OPTARG" if [[ ! "$NUM_FILES" =~ ^[0-9]+$ ]]; then usage "err" fi ;; h) usage "out" ;; *) usage "err" ;; esac done case "$MODE" in eio|missing|corrupt) ;; *) usage err ;; esac case "$RAID" in raid[01]) NUM_RAID_DISKS=2 ;; raid10) NUM_RAID_DISKS=4 ;; raid5) NUM_RAID_DISKS=3 ;; raid6) NUM_RAID_DISKS=4 ;; *) usage err ;; esac case "$OP" in replace) NUM_DISKS=$((NUM_RAID_DISKS + 1)) ;; scrub) NUM_DISKS=$NUM_RAID_DISKS ;; *) usage err ;; esac echo "Running $OP on $RAID with $MODE" SRC_DISK=$((NUM_RAID_DISKS - 1)) TARGET_DISK=$((NUM_DISKS - 1)) NUM_SECTORS=$((1024 * 1024)) LOOP_DEVICES=() DM_DEVICES=() cleanup () { echo "Done. Press enter to cleanup..." read if findmnt /mnt; then umount /mnt fi for DM in "${DM_DEVICES[@]}"; do dmsetup remove "$DM" done for LOOP in "${LOOP_DEVICES[@]}"; do losetup --detach "$LOOP" done for ((i =
Re: qgroup limit clearing, was Re: Btrfs progs release 4.1
Tsutomu Itoh wrote on 2015/06/23 08:55 +0900: On 2015/06/23 3:18, Christian Robottom Reis wrote: On Mon, Jun 22, 2015 at 05:00:23PM +0200, David Sterba wrote: - qgroup: - show: distinguish no limits and 0 limit value - limit: ability to clear the limit I'm using kernel 4.1-rc7 as per: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# uname -a Linux riff 4.1.0-040100rc7-generic #201506080035 SMP Mon Jun 8 04:36:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux But apart from still having major issues with qgroups (quota enforcement triggers even when there seems to be plenty of free space) clearing limits with btrfs-progs 4.1 doesn't revert back to 'none', instead confusingly setting the quota to 16EiB. Using: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# btrfs version btrfs-progs v4.1 I start from: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB none100.00GiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB I then issue: root@riff# btrfs qgroup limit -e none 261 /var root@riff# btrfs qgroup limit none 261 /var I end up with: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB 16.00EiB 16.00EiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB Is that expected? The following fix is necessary for the kernel to display it correctly. [PATCH] btrfs: qgroup: allow user to clear the limitation on qgroup http://marc.info/?l=linux-btrfs&m=143331495409594&w=2 Thanks, Tsutomu I'll send a new pull request containing this patch when we done the full test. The pull will be mainly consisted of small cleanup and bug fixes, so it should be quite safe, but I still want to make sure it's completely safe anyway. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: Corrupted btrfs partition (converted from ext4) after balance
Vianney Stroebel wrote on 2015/06/19 01:55 +0200: One of my btrfs partition seems to have been corrupted. Since I've tried to balance it, I can only mount it read-only. I have been able to use it read-only without problem so far so the data seems safe. When I remove the "ro" option, the "mount" command hangs and some programs do not function properly (iotop hangs too and Firefox cannot load new web pages). Every few seconds a message is printed on syslog (see attached file). If I try to terminate the "mount" process with ctrl+c, my whole system hangs. This partition was converted from ext4 and I could use it fine after that. It got corrupted when I tried to balance it a few days ago (even though I think I had balanced it before, but I'm not sure about this). The balance would seem to have started but "balance status" showed no progress even after one hour. This partition is on one hard disk (no raid). Mount options: defaults,compress=lzo,noatime,nodiratime,noauto,ro). My system also runs on btrfs on another disk (ssd) without any problems apart from quite poor performance (but that's for another post). The command "btrfs scrub start /_big -r" hangs my system. The command Konsole output "btrfs check /dev/sdb1" outputs : "Checking filesystem on /dev/sdb1 UUID: 21873ba7-438a-4fbf-a051-ace28bffd264 checking extents" and stops after a few minutes with no other output. Maybe I'm too late to point out the problem, but you may be impatient about btrfsck. Unlike fsck from ext4/xfs, btrfsck will always read the whole metadata to check the consistence, so it may takes a long time for that. Without the full output of btrfsck, it's quite hard to call it a clear bug report if you want to save your data in the corrupted partition. On the other hand, if you can provide the full output, there is a chance that developers interested in btrfsck can help solving your problem. BTW, it seems that you are also impatient about the reply speed in btrfs mail list. IMHO, current btrfs mail list is much like a developer mail list. Although there are talent sysadmins like Ducan or Marc here, most of the developers are hardly interested in a bug report without a reproducer or even btrfsck output. Not to mention they are also busy fixing but or developing new features. So just calm down and be patient for both btrfsck and developers. Thanks, Qu I did not try "btrfs check --repair". "Btrfs-zero-log" doesn't seem to apply here. Konsole output I could copy the data on another freshly formatted disk and reformat this one but I am wondering if btrfs is stable enough to be used on my professional laptop (where I cannot afford such downtime)or if I should go back to ext4. So the goal of this message is not only to see if I can repair this partition, but also to assess if btrfs corrupt partitions randomly and irreversibly. If the root cause resides in a non-essential feature (conversion or balancing for example), I would happily continue to use it without this feature. This is my first message on this mailing list. I've spent the last hours trying to solve this. More info: uname -a Linux viybel-pc 3.19.0-21-generic #21-Ubuntu SMP Sun Jun 14 18:31:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux btrfs --version btrfs-progs v3.19.1 btrfs fi show Label: none uuid: 358f485d-690d-436d-ad35-3a1f47329ed7 Total devices 1 FS bytes used 107.75GiB devid1 size 111.79GiB used 111.79GiB path /dev/sda1 Label: none uuid: 21873ba7-438a-4fbf-a051-ace28bffd264 Total devices 1 FS bytes used 606.17GiB devid1 size 698.63GiB used 660.03GiB path /dev/sdb1 btrfs fi df /_big Data, single: total=431.00GiB, used=419.49GiB System, single: total=32.00MiB, used=64.00KiB Metadata, single: total=229.00GiB, used=186.67GiB GlobalReserve, single: total=512.00MiB, used=0.00B dmesg > dmesg.log (attached) Konsole outp Vianney -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: qgroup limit clearing, was Re: Btrfs progs release 4.1
On 2015/06/23 3:18, Christian Robottom Reis wrote: On Mon, Jun 22, 2015 at 05:00:23PM +0200, David Sterba wrote: - qgroup: - show: distinguish no limits and 0 limit value - limit: ability to clear the limit I'm using kernel 4.1-rc7 as per: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# uname -a Linux riff 4.1.0-040100rc7-generic #201506080035 SMP Mon Jun 8 04:36:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux But apart from still having major issues with qgroups (quota enforcement triggers even when there seems to be plenty of free space) clearing limits with btrfs-progs 4.1 doesn't revert back to 'none', instead confusingly setting the quota to 16EiB. Using: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# btrfs version btrfs-progs v4.1 I start from: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB none100.00GiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB I then issue: root@riff# btrfs qgroup limit -e none 261 /var root@riff# btrfs qgroup limit none 261 /var I end up with: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB 16.00EiB 16.00EiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB Is that expected? The following fix is necessary for the kernel to display it correctly. [PATCH] btrfs: qgroup: allow user to clear the limitation on qgroup http://marc.info/?l=linux-btrfs&m=143331495409594&w=2 Thanks, Tsutomu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: raid 1 to 10 conversion
I can confirm that convert works now with 4.1 kernel and btrfs-progs Suman On Tue, Jun 9, 2015 at 10:31 PM, Gareth Pye wrote: > btrfs has a small bug at the moment where balance can't convert raid > levels (it just does nothing), it is meant to be fixed with the next > kernel release. > > On Wed, Jun 10, 2015 at 3:28 PM, Guilherme Gonçalves > wrote: >> Hello!, i think i made a mistake >> i had two 3tb drivre on a raid 1 setup, i bought two aditional 3tb >> drives to make my raid 10 array >> i used this commands >> >> btrfs -f device add /dev/sdc /mnt/nas/(i used -f because i >> formatted my new drives using gpt) >> btrfs -f device add /dev/sdf /mnt/nas/ >> >> finally: >> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt/nas/ >> >> after a couple of hours i ran: >> >> btrfs filesystem df /mnt/nas/ >> >> Data, RAID1: total=963.00GiB, used=962.69GiB >> System, RAID1: total=32.00MiB, used=176.00KiB >> Metadata, RAID1: total=6.00GiB, used=4.59GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> should that not read raid 10 ? >> >> output for btrfs fi usage /mnt/nas >> >> Overall: >> Device size: 10.92TiB >> Device allocated: 1.89TiB >> Device unallocated: 9.02TiB >> Device missing: 0.00B >> Used: 1.89TiB >> Free (estimated): 4.51TiB (min: 4.51TiB) >> Data ratio: 2.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,RAID1: Size:963.00GiB, Used:962.69GiB >>/dev/sdc 481.00GiB >>/dev/sdd1 482.00GiB >>/dev/sde1 482.00GiB >>/dev/sdf 481.00GiB >> >> Metadata,RAID1: Size:6.00GiB, Used:4.59GiB >>/dev/sdc 4.00GiB >>/dev/sdd1 2.00GiB >>/dev/sde1 2.00GiB >>/dev/sdf 4.00GiB >> >> System,RAID1: Size:32.00MiB, Used:176.00KiB >>/dev/sdd1 32.00MiB >>/dev/sde1 32.00MiB >> >> Unallocated: >>/dev/sdc 2.25TiB >>/dev/sdd1 2.26TiB >>/dev/sde1 2.26TiB >>/dev/sdf 2.25TiB >> >> >> I think i made a mess here... why is system only on two drives? why >> is it not showing raid 10? >> If i actually failed how do i acheive this? i want all four drives in >> a raid 10 setup. >> >> Thanks in advance >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Gareth Pye > Level 2 MTG Judge, Melbourne, Australia > "Dear God, I would like to file a bug report" > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH 3/5] btrfs: fix clone / extent-same deadlocks
Clone and extent same lock their source and target inodes in opposite order. In addition to this, the range locking in clone doesn't take ordering into account. Fix this by having clone use the same locking helpers as btrfs-extent-same. In addition, I do a small cleanup of the locking helpers, removing a case (both inodes being the same) which was poorly accounted for and never actually used by the callers. Signed-off-by: Mark Fasheh --- fs/btrfs/ioctl.c | 34 -- 1 file changed, 8 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b899584..8d6887d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2831,8 +2831,7 @@ static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) swap(inode1, inode2); mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); - if (inode1 != inode2) - mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); } static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1, @@ -2850,8 +2849,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, swap(loff1, loff2); } lock_extent_range(inode1, loff1, len); - if (inode1 != inode2) - lock_extent_range(inode2, loff2, len); + lock_extent_range(inode2, loff2, len); } struct cmp_pages { @@ -3713,13 +3711,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, goto out_fput; if (!same_inode) { - if (inode < src) { - mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&src->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&src->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD); - } + btrfs_double_inode_lock(src, inode); } else { mutex_lock(&src->i_mutex); } @@ -3769,8 +3761,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, lock_extent_range(src, lock_start, lock_len); } else { - lock_extent_range(src, off, len); - lock_extent_range(inode, destoff, len); + btrfs_double_extent_lock(src, off, inode, destoff, len); } ret = btrfs_clone(src, inode, off, olen, len, destoff); @@ -3781,9 +3772,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, unlock_extent(&BTRFS_I(src)->io_tree, lock_start, lock_end); } else { - unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1); - unlock_extent(&BTRFS_I(inode)->io_tree, destoff, - destoff + len - 1); + btrfs_double_extent_unlock(src, off, inode, destoff, len); } /* * Truncate page cache pages so that future reads will see the cloned @@ -3792,17 +3781,10 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, truncate_inode_pages_range(&inode->i_data, destoff, PAGE_CACHE_ALIGN(destoff + len) - 1); out_unlock: - if (!same_inode) { - if (inode < src) { - mutex_unlock(&src->i_mutex); - mutex_unlock(&inode->i_mutex); - } else { - mutex_unlock(&inode->i_mutex); - mutex_unlock(&src->i_mutex); - } - } else { + if (!same_inode) + btrfs_double_inode_unlock(src, inode); + else mutex_unlock(&src->i_mutex); - } out_fput: fdput(src_file); out_drop_write: -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH 2/5] btrfs: fix deadlock with extent-same and readpage
->readpage() does page_lock() before extent_lock(), we do the opposite in extent-same. We want to reverse the order in btrfs_extent_same() but it's not quite straightforward since the page locks are taken inside btrfs_cmp_data(). So I split btrfs_cmp_data() into 3 parts with a small context structure that is passed between them. The first, btrfs_cmp_data_prepare() gathers up the pages needed (taking page lock as required) and puts them on our context structure. At this point, we are safe to lock the extent range. Afterwards, we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free() to clean up our context. Signed-off-by: Mark Fasheh Reviewed-by: David Sterba --- fs/btrfs/ioctl.c | 148 +++ 1 file changed, 117 insertions(+), 31 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2deea1f..b899584 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2755,14 +2755,11 @@ out: return ret; } -static struct page *extent_same_get_page(struct inode *inode, u64 off) +static struct page *extent_same_get_page(struct inode *inode, pgoff_t index) { struct page *page; - pgoff_t index; struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; - index = off >> PAGE_CACHE_SHIFT; - page = grab_cache_page(inode->i_mapping, index); if (!page) return NULL; @@ -2783,6 +2780,20 @@ static struct page *extent_same_get_page(struct inode *inode, u64 off) return page; } +static int gather_extent_pages(struct inode *inode, struct page **pages, + int num_pages, u64 off) +{ + int i; + pgoff_t index = off >> PAGE_CACHE_SHIFT; + + for (i = 0; i < num_pages; i++) { + pages[i] = extent_same_get_page(inode, index + i); + if (!pages[i]) + return -ENOMEM; + } + return 0; +} + static inline void lock_extent_range(struct inode *inode, u64 off, u64 len) { /* do any pending delalloc/csum calc on src, one way or @@ -2808,52 +2819,120 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len) } } -static void btrfs_double_unlock(struct inode *inode1, u64 loff1, - struct inode *inode2, u64 loff2, u64 len) +static void btrfs_double_inode_unlock(struct inode *inode1, struct inode *inode2) { - unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1); - unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); - mutex_unlock(&inode1->i_mutex); mutex_unlock(&inode2->i_mutex); } -static void btrfs_double_lock(struct inode *inode1, u64 loff1, - struct inode *inode2, u64 loff2, u64 len) +static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2) +{ + if (inode1 < inode2) + swap(inode1, inode2); + + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + if (inode1 != inode2) + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); +} + +static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1, + struct inode *inode2, u64 loff2, u64 len) +{ + unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1); + unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); +} + +static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, +struct inode *inode2, u64 loff2, u64 len) { if (inode1 < inode2) { swap(inode1, inode2); swap(loff1, loff2); } - - mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); lock_extent_range(inode1, loff1, len); - if (inode1 != inode2) { - mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + if (inode1 != inode2) lock_extent_range(inode2, loff2, len); +} + +struct cmp_pages { + int num_pages; + struct page **src_pages; + struct page **dst_pages; +}; + +static void btrfs_cmp_data_free(struct cmp_pages *cmp) +{ + int i; + struct page *pg; + + for (i = 0; i < cmp->num_pages; i++) { + pg = cmp->src_pages[i]; + if (pg) + page_cache_release(pg); + pg = cmp->dst_pages[i]; + if (pg) + page_cache_release(pg); + } + kfree(cmp->src_pages); + kfree(cmp->dst_pages); +} + +static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, + struct inode *dst, u64 dst_loff, + u64 len, struct cmp_pages *cmp) +{ + int ret; + int num_pages = PAGE_CACHE_ALIGN(len) >> PAGE_CACHE_SHIFT; + struct page **src_pgarr, **dst_pgarr; + + /* +* We must gather up all the pages before we init
[PATCH 5/5] btrfs: add no_mtime flag to btrfs-extent-same
One issue users have reported is that dedupe changes mtime on files, resulting in tools like rsync thinking that their contents have changed when in fact the data is exactly the same. Clone still wants an mtime change, so we special case this in the code. With this patch an application can pass the BTRFS_SAME_NO_MTIME flag to a dedupe request and the kernel will honor it by only changing ctime. I have an updated version of the btrfs-extent-same test program with a switch to provide this flag at the 'no_time' branch of: https://github.com/markfasheh/duperemove/ Signed-off-by: Mark Fasheh --- fs/btrfs/ioctl.c | 34 -- include/uapi/linux/btrfs.h | 5 - 2 files changed, 28 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 83f4679..8cfc65f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -87,7 +87,8 @@ struct btrfs_ioctl_received_subvol_args_32 { static int btrfs_clone(struct inode *src, struct inode *inode, - u64 off, u64 olen, u64 olen_aligned, u64 destoff); + u64 off, u64 olen, u64 olen_aligned, u64 destoff, + int no_mtime); /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -2974,7 +2975,7 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, } static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, -struct inode *dst, u64 dst_loff) +struct inode *dst, u64 dst_loff, int no_mtime) { int ret; u64 len = olen; @@ -3054,7 +3055,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, /* pass original length for comparison so we stay within i_size */ ret = btrfs_cmp_data(src, loff, dst, dst_loff, olen, &cmp); if (ret == 0) - ret = btrfs_clone(src, dst, loff, olen, len, dst_loff); + ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, + no_mtime); if (same_inode) unlock_extent(&BTRFS_I(src)->io_tree, same_lock_start, @@ -3088,6 +3090,7 @@ static long btrfs_ioctl_file_extent_same(struct file *file, u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize; bool is_admin = capable(CAP_SYS_ADMIN); u16 count; + int no_mtime = 0; if (!(file->f_mode & FMODE_READ)) return -EINVAL; @@ -3139,6 +3142,12 @@ static long btrfs_ioctl_file_extent_same(struct file *file, if (!S_ISREG(src->i_mode)) goto out; + ret = -EINVAL; + if (same->flags & ~BTRFS_SAME_FLAGS) + goto out; + if (same->flags & BTRFS_SAME_NO_MTIME) + no_mtime = 1; + /* pre-format output fields to sane values */ for (i = 0; i < count; i++) { same->info[i].bytes_deduped = 0ULL; @@ -3164,7 +3173,8 @@ static long btrfs_ioctl_file_extent_same(struct file *file, info->status = -EACCES; } else { info->status = btrfs_extent_same(src, off, len, dst, - info->logical_offset); +info->logical_offset, +no_mtime); if (info->status == 0) info->bytes_deduped += len; } @@ -3219,13 +3229,17 @@ static int clone_finish_inode_update(struct btrfs_trans_handle *trans, struct inode *inode, u64 endoff, const u64 destoff, -const u64 olen) +const u64 olen, +int no_mtime) { struct btrfs_root *root = BTRFS_I(inode)->root; int ret; inode_inc_iversion(inode); - inode->i_mtime = inode->i_ctime = CURRENT_TIME; + if (no_mtime) + inode->i_ctime = CURRENT_TIME; + else + inode->i_mtime = inode->i_ctime = CURRENT_TIME; /* * We round up to the block size at eof when determining which * extents to clone above, but shouldn't round up the file size. @@ -3316,7 +3330,7 @@ static void clone_update_extent_map(struct inode *inode, */ static int btrfs_clone(struct inode *src, struct inode *inode, const u64 off, const u64 olen, const u64 olen_aligned, - const u64 destoff) + const u64 destoff, int no_mtime) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_path *path = NULL; @@ -3640,7 +3654,7 @@ process_slot: ro
[PATCH 4/5] btrfs: allow dedupe of same inode
clone() supports cloning within an inode so extent-same can do the same now. This patch fixes up the locking in extent-same to know about the single-inode case. In addition to that, we add a check for overlapping ranges, which clone does not allow. Signed-off-by: Mark Fasheh --- fs/btrfs/ioctl.c | 76 1 file changed, 60 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 8d6887d..83f4679 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2979,27 +2979,61 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, int ret; u64 len = olen; struct cmp_pages cmp; + int same_inode = 0; + u64 same_lock_start = 0; + u64 same_lock_len = 0; - /* -* btrfs_clone() can't handle extents in the same file -* yet. Once that works, we can drop this check and replace it -* with a check for the same inode, but overlapping extents. -*/ if (src == dst) - return -EINVAL; + same_inode = 1; if (len == 0) return 0; - btrfs_double_inode_lock(src, dst); + if (same_inode) { + mutex_lock(&src->i_mutex); - ret = extent_same_check_offsets(src, loff, &len, olen); - if (ret) - goto out_unlock; + ret = extent_same_check_offsets(src, loff, &len, olen); + if (ret) + goto out_unlock; - ret = extent_same_check_offsets(dst, dst_loff, &len, olen); - if (ret) - goto out_unlock; + /* +* Single inode case wants the same checks, except we +* don't want our length pushed out past i_size as +* comparing that data range makes no sense. +* +* extent_same_check_offsets() will do this for an +* unaligned length at i_size, so catch it here and +* reject the request. +* +* This effectively means we require aligned extents +* for the single-inode case, whereas the other cases +* allow an unaligned length so long as it ends at +* i_size. +*/ + if (len != olen) { + ret = -EINVAL; + goto out_unlock; + } + + /* Check for overlapping ranges */ + if (dst_loff + len > loff && dst_loff < loff + len) { + ret = -EINVAL; + goto out_unlock; + } + + same_lock_start = min_t(u64, loff, dst_loff); + same_lock_len = max_t(u64, loff, dst_loff) + len - same_lock_start; + } else { + btrfs_double_inode_lock(src, dst); + + ret = extent_same_check_offsets(src, loff, &len, olen); + if (ret) + goto out_unlock; + + ret = extent_same_check_offsets(dst, dst_loff, &len, olen); + if (ret) + goto out_unlock; + } /* don't make the dst file partly checksummed */ if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != @@ -3012,18 +3046,28 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, if (ret) goto out_unlock; - btrfs_double_extent_lock(src, loff, dst, dst_loff, len); + if (same_inode) + lock_extent_range(src, same_lock_start, same_lock_len); + else + btrfs_double_extent_lock(src, loff, dst, dst_loff, len); /* pass original length for comparison so we stay within i_size */ ret = btrfs_cmp_data(src, loff, dst, dst_loff, olen, &cmp); if (ret == 0) ret = btrfs_clone(src, dst, loff, olen, len, dst_loff); - btrfs_double_extent_unlock(src, loff, dst, dst_loff, len); + if (same_inode) + unlock_extent(&BTRFS_I(src)->io_tree, same_lock_start, + same_lock_start + same_lock_len - 1); + else + btrfs_double_extent_unlock(src, loff, dst, dst_loff, len); btrfs_cmp_data_free(&cmp); out_unlock: - btrfs_double_inode_unlock(src, dst); + if (same_inode) + mutex_unlock(&src->i_mutex); + else + btrfs_double_inode_unlock(src, dst); return ret; } -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH 1/5] btrfs: pass unaligned length to btrfs_cmp_data()
In the case that we dedupe the tail of a file, we might expand the dedupe len out to the end of our last block. We don't want to compare data past i_size however, so pass the original length to btrfs_cmp_data(). Signed-off-by: Mark Fasheh Reviewed-by: David Sterba --- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2d24ff4..2deea1f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2933,7 +2933,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, goto out_unlock; } - ret = btrfs_cmp_data(src, loff, dst, dst_loff, len); + /* pass original length for comparison so we stay within i_size */ + ret = btrfs_cmp_data(src, loff, dst, dst_loff, olen); if (ret == 0) ret = btrfs_clone(src, dst, loff, olen, len, dst_loff); -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH 0/5] btrfs: dedupe fixes, features V2
Hi Chris, The following patches are based on top of my patch titled "btrfs: Handle unaligned length in extent_same" which you have in your 'integration-4.2' branch: https://git.kernel.org/cgit/linux/kernel/git/mason/linux-btrfs.git/commit/?id=e1d227a42ea2b4664f94212bd1106b9a3413ffb8 I sent out the 1st two patches from this series last week, the last 3 are new to the list: http://www.spinics.net/lists/linux-btrfs/msg44849.html The first patch in the series fixes a bug where we were sometimes passing the aligned length to our comparison function. We actually can stop at the user passed length for this as we don't need to compare data past i_size (and we only align if the extents go out to i_size). The 2nd patch fixes a deadlock between btrfs readpage and btrfs_extent_same. This was reported on the list some months ago - basically we had the page and extent locking reversed. My patch fixes up the locking to be in the right order. The 3rd patch fixes a deadlocks in clone() (wrt extent-same) which David found while reviewing my fixes. I also found that clone doesn't lock extent ranges in any particular order which could obvioulsy be a problem so that is fixed there too. The last two patches add features which have been requested often by users - the 4th adds the ability to dedupe within the same inode, and the last patch adds a dedupe flag to avoid mtime updates (this helps with backup software). These patches have been tested with the 'btrfs-extent-same' tool that can be found at: https://github.com/markfasheh/duperemove/blob/nomtime/btrfs-extent-same.c Thanks, --Mark -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
qgroup limit clearing, was Re: Btrfs progs release 4.1
On Mon, Jun 22, 2015 at 05:00:23PM +0200, David Sterba wrote: > - qgroup: > - show: distinguish no limits and 0 limit value > - limit: ability to clear the limit I'm using kernel 4.1-rc7 as per: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# uname -a Linux riff 4.1.0-040100rc7-generic #201506080035 SMP Mon Jun 8 04:36:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux But apart from still having major issues with qgroups (quota enforcement triggers even when there seems to be plenty of free space) clearing limits with btrfs-progs 4.1 doesn't revert back to 'none', instead confusingly setting the quota to 16EiB. Using: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# btrfs version btrfs-progs v4.1 I start from: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB none100.00GiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB I then issue: root@riff# btrfs qgroup limit -e none 261 /var root@riff# btrfs qgroup limit none 261 /var I end up with: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB 16.00EiB 16.00EiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB Is that expected? -- Christian Robottom Reis | [+1] 612 888 4935| http://launchpad.net/~kiko Canonical VP Hyperscale | [+55 16] 9 9112 6430 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
qgroup limit clearing, was Re: Btrfs progs release 4.1
On Mon, Jun 22, 2015 at 05:00:23PM +0200, David Sterba wrote: > - qgroup: > - show: distinguish no limits and 0 limit value > - limit: ability to clear the limit I'm using kernel 4.1-rc7 as per: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# uname -a Linux riff 4.1.0-040100rc7-generic #201506080035 SMP Mon Jun 8 04:36:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux But apart from still having major issues with qgroups (quota enforcement triggers even when there seems to be plenty of free space) clearing limits with btrfs-progs 4.1 doesn't revert back to 'none', instead confusingly setting the quota to 16EiB. Using: root@riff:/var/lib/lxc/juju-trusty-lxc-template/rootfs# btrfs version btrfs-progs v4.1 I start from: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB none100.00GiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB I then issue: root@riff# btrfs qgroup limit -e none 261 /var root@riff# btrfs qgroup limit none 261 /var I end up with: qgroupid rfer excl max_rfer max_excl 0/5 2.15GiB 1.95GiB none none 0/261 1.42GiB 1.11GiB 16.00EiB 16.00EiB 0/265 1.09GiB600.59MiB none100.00GiB 0/271 793.32MiB366.40MiB none100.00GiB 0/274 514.96MiB142.92MiB none100.00GiB Is that expected? -- Christian Robottom Reis | [+55 16] 3376 0125 | http://async.com.br/~kiko CEO, Async Open Source | [+55 16] 9 9112 6430 | http://launchpad.net/~kiko -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: Btrfs progs release 4.1
Wow, nice collection of changes! Am Montag, 22. Juni 2015, 17:00:23 schrieb David Sterba: > * new > - rescure zero-log > - btrfsune: > - rewrite uuid on a filesystem image > - new option to turn on NO_HOLES incompat feature Did you think about folding btrfstune into btrfs command as well? Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)
On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote: > Moving the discussion to fsdevel. > > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the > generic 'noiversion' option cannot be used to achieve that. It is > processed before it reaches btrfs superblock callback, where > MS_I_VERSION is forced. > > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs, > to which I object. The issue is that you can't overide IS_I_VERSION(inode) because it looks at the superblock flag, yes? So perhaps IS_I_VERSION should become an inode flag, set by the filesystem at inode instantiation time, and hence filesystems can choose on a per-inode basis if they want I_VERSION behaviour or not. At that point, the behaviour of MS_I_VERSION becomes irrelevant to the discussion, doesn't it? > xfs also forces I_VERSION if it detects the superblock version 5, so it > could use the same fix that would work for btrfs. XFS is a special snowflake - it updates the I_VERSION only when an inode is otherwise modified in a transaction, so turning it off saves nothing. (And yes, timestamp updates are transactional in XFS). Hence XFS behaviour is irrelevant to the discussion, because we aren't ever going to turn it off Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: Btrfs progs release 4.1
On Mon, Jun 22, 2015 at 06:18:35PM +0200, Goffredo Baroncelli wrote: > Many thanks for your work. > BTW just for curiosity: is it a coincidence that both Torvalds and you > released the kernel 4.1/btrfs-progs 4.1 in the same day ? I know that > the version are coupled, but also the same day This time around I was ready to do the release on time so there was no reason to delay it. Previous releases were delayed because of other work or (my) insufficient confidence in the pending changes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: RAID1: system stability
On Mon, Jun 22, 2015 at 10:36 AM, Timofey Titovets wrote: > 2015-06-22 19:03 GMT+03:00 Chris Murphy : >> On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets >> wrote: >>> Okay, logs, i did release disk /dev/sde1 and get: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >>> 00 00 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >>> error, dev sde, sector 287140096 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >>> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >>> 00 00 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >>> error, dev sde, sector 287140096 >> >> So what's up with this? This only happens after you try to (software) >> remove /dev/sde1? Or is it happening also before that? Because this >> looks like some kind of hardware problem when the drive is reporting >> an error for a particular sector on read, as if it's a bad sector. > > Nope, i've physically remove device and as you see it's produce errors > on block layer -.- > and this disks have 100% 'health' > > Because it's hot-plug device, kernel see what device now missing and > remove all kernel objects reletad to them. OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's "missing" or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. > >> >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, >>> logical block 35892256, async page read >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >>> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 >>> 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, >>> dev sde, sector 287140096 >> >> Again same sector as before. This is not a Btrfs error message, it's >> coming from the block layer. >> >> >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, >>> logical block 35892256, async page read >> >> I'm not a dev so take it with a grain of salt but because this >> references a logical block, this is the layer in between Btrfs and the >> physical device. Btrfs works on logical blocks and those have to be >> translated to device and physical sector. Maybe what's happening is >> there's confusion somewhere about this device not actually being >> unavailable so Btrfs or something else is trying to read this logical >> block again, which causes a read attempt to happen instead of a flat >> out "this device doesn't exist" type of error. So I don't know if this >> is a problem strictly in Btrfs missing device error handling, or if >> there's something else that's not really working correctly. >> >> You could test by physically removing the device, if you have hot plug >> support (be certain all the hardware components support it), you can >> see if you get different results. Or you could try to reproduce the >> software delete of the device with mdraid or lvm raid with XFS and no >> Btrfs at all, and see if you get different results. >> >> It's known that the btrfs multiple device failure use case is weak >> right now. Data isn't lost, but the error handling, notification, all >> that is almost non-existent compared to mdadm. > > So sad -.- > i've test this test case with md raid1 and system continue work > without problem when i release one of two md device OK well then it's either a Btrfs bug or something it directly depends on that md does not. > You right about usb devices, it's not produce oops. > May be its because kernel use different modules for SAS/SATA disks and > usb sticks. They appear as sd devices on my system, so they're using libata and as such they ultimately still depend on the SCSI block layer. But there may be a very different kind of missing device error handl
Re: RAID1: system stability
2015-06-22 19:03 GMT+03:00 Chris Murphy : > On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets > wrote: >> Okay, logs, i did release disk /dev/sde1 and get: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >> 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >> error, dev sde, sector 287140096 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >> 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >> error, dev sde, sector 287140096 > > So what's up with this? This only happens after you try to (software) > remove /dev/sde1? Or is it happening also before that? Because this > looks like some kind of hardware problem when the drive is reporting > an error for a particular sector on read, as if it's a bad sector. Nope, i've physically remove device and as you see it's produce errors on block layer -.- and this disks have 100% 'health' Because it's hot-plug device, kernel see what device now missing and remove all kernel objects reletad to them. > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, >> logical block 35892256, async page read >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 >> 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, >> dev sde, sector 287140096 > > Again same sector as before. This is not a Btrfs error message, it's > coming from the block layer. > > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, >> logical block 35892256, async page read > > I'm not a dev so take it with a grain of salt but because this > references a logical block, this is the layer in between Btrfs and the > physical device. Btrfs works on logical blocks and those have to be > translated to device and physical sector. Maybe what's happening is > there's confusion somewhere about this device not actually being > unavailable so Btrfs or something else is trying to read this logical > block again, which causes a read attempt to happen instead of a flat > out "this device doesn't exist" type of error. So I don't know if this > is a problem strictly in Btrfs missing device error handling, or if > there's something else that's not really working correctly. > > You could test by physically removing the device, if you have hot plug > support (be certain all the hardware components support it), you can > see if you get different results. Or you could try to reproduce the > software delete of the device with mdraid or lvm raid with XFS and no > Btrfs at all, and see if you get different results. > > It's known that the btrfs multiple device failure use case is weak > right now. Data isn't lost, but the error handling, notification, all > that is almost non-existent compared to mdadm. So sad -.- i've test this test case with md raid1 and system continue work without problem when i release one of two md device > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 >> 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, >> dev sde, sector 287140096 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, >> logical block 35892256, async page read >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6:
Re: Btrfs progs release 4.1
On 2015-06-22 17:00, David Sterba wrote: > Hi, Many thanks for your work. BTW just for curiosity: is it a coincidence that both Torvalds and you released the kernel 4.1/btrfs-progs 4.1 in the same day ? I know that the version are coupled, but also the same day BR G.Baronelli > > btrfs-progs 4.1 have been released (in time with kernel 4.1). Unusual load of > changes. > > Fixed since rc1: > - uuid rewrite prints the correct original UUID > - map-logical updated > - fi show size units > - typos > > * bugfixes > - fsck.btrfs: no bash-isms > - bugzilla 97171: invalid memory access (with tests) > - receive: > - cloning works with --chroot > - capabilities not lost > - mkfs: do not try to register bare file images > - option --help accepted by the standalone utilities > > * enhancements > - corrupt block: ability to remove csums > - mkfs: > - warn if metadata redundancy is lower than for data > - options to make the output quiet (only errors) > - mixed case names of raid profiles accepted > - rework the output: > - more comprehensive, 'key: value' format > - subvol: > - show: > - print received uuid > - update the output > - new options to specify size units > - sync: > - grab all deleted ids and print them as they're removed, > previous implementation only checked if there are any > to be deleted - change in command semantics > - scrub: print timestamps in days HMS format > - receive: > - can specify mount point, do not rely on /proc > - can work inside subvolumes > - send: > - new option to send stream without data (NO_FILE_DATA) > - convert: > - specify incompat features on the new fs > - qgroup: > - show: distinguish no limits and 0 limit value > - limit: ability to clear the limit > - help for 'btrfs' is shorter, 1st level command overview > - debug tree: print key names according to their C name > > * new > - rescure zero-log > - btrfsune: > - rewrite uuid on a filesystem image > - new option to turn on NO_HOLES incompat feature > > * deprecated > - standalone btrfs-zero-log > > * other > - testing framework updates > - uuid rewrite test > - btrfstune feature setting test > - zero-log tests > - more testing image formats > - manual page updates > - ioctl.h synced with current kernel uapi version > - convert: preparatory works for more filesystems (reiserfs pending) > - use static buffers for path handling where possible > - add new helpers for send uilts that check memory allocations, > switch all users, deprecate old helpers > - Makefile: fix build dependency generation > - map-logical: make it work again > > Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ > Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git > > Shortlog: > > Anand Jain (2): > btrfs-progs: add info about list-all to the help > btrfs-progs: use function is_block_device() instead > > Dimitri John Ledkov (1): > btrfs-progs: fsck.btrfs: Fix bashism and bad getopts processing > > Dongsheng Yang (4): > btrfs-progs: qgroup: show 'none' when we did not limit it on this qgroup > btrfs-progs: qgroup: allow user to clear some limitation on qgroup. > btrfs-progs: qgroup limit: error out if input value is negative > btrfs-progs: qgroup limit: add a check for invalid input of 'T/G/M/K' > > Emil Karlson (1): > btrfs-progs: use openat for process_clone in receive > > Goffredo Baroncelli (4): > btrfs-progs: add strdup in btrfs_add_to_fsid() to track the device path > btrfs-progs: return the fsid from make_btrfs() > btrfs-progs: mkfs: track sizes of created block groups > btrfs-progs: mkfs: print the summary > > Jeff Mahoney (8): > btrfs-progs: convert: clean up blk_iterate_data handling wrt > record_file_blocks > btrfs-progs: convert: remove unused fs argument from block_iterate_proc > btrfs-progs: convert: remove unused inode_key in copy_single_inode > btrfs-progs: convert: rename ext2_root to image_root > btrfs-progs: compat: define DIV_ROUND_UP if not already defined > btrfs-progs: convert: fix typo in btrfs_insert_dir_item call > btrfs-progs: convert: factor out adding dirent into > convert_insert_dirent > btrfs-progs: convert: factor out block iteration callback > > Josef Bacik (3): > Btrfs-progs: corrupt-block: add the ability to remove csums > btrfs-progs: specify mountpoint for recieve > btrfs-progs: make receive work inside of subvolumes > > Qu Wenruo (13): > btrfs-progs: Enhance read_tree_block to avoid memory corruption > btrfs-progs: btrfstune: rework change_uuid > btrfs-progs: btrfstune: add ability to restore unfinished fsid change > btrfs-progs: btrfstune: add '-U' and '-u' option to change fsid > btrfs-progs: Docume
Re: RAID1: system stability
On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets wrote: > Okay, logs, i did release disk /dev/sde1 and get: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 So what's up with this? This only happens after you try to (software) remove /dev/sde1? Or is it happening also before that? Because this looks like some kind of hardware problem when the drive is reporting an error for a particular sector on read, as if it's a bad sector. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 Again same sector as before. This is not a Btrfs error message, it's coming from the block layer. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read I'm not a dev so take it with a grain of salt but because this references a logical block, this is the layer in between Btrfs and the physical device. Btrfs works on logical blocks and those have to be translated to device and physical sector. Maybe what's happening is there's confusion somewhere about this device not actually being unavailable so Btrfs or something else is trying to read this logical block again, which causes a read attempt to happen instead of a flat out "this device doesn't exist" type of error. So I don't know if this is a problem strictly in Btrfs missing device error handling, or if there's something else that's not really working correctly. You could test by physically removing the device, if you have hot plug support (be certain all the hardware components support it), you can see if you get different results. Or you could try to reproduce the software delete of the device with mdraid or lvm raid with XFS and no Btrfs at all, and see if you get different results. It's known that the btrfs multiple device failure use case is weak right now. Data isn't lost, but the error handling, notification, all that is almost non-existent compared to mdadm. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: > mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy > 5,sas_addr 0x5000cca00d0514bd > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: > delete phy 5, phy-obj (0x880449541400) > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: > ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: > ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr > 0x5000cca00d0514bd OK it looks like not until here does it
Btrfs progs release 4.1
Hi, btrfs-progs 4.1 have been released (in time with kernel 4.1). Unusual load of changes. Fixed since rc1: - uuid rewrite prints the correct original UUID - map-logical updated - fi show size units - typos * bugfixes - fsck.btrfs: no bash-isms - bugzilla 97171: invalid memory access (with tests) - receive: - cloning works with --chroot - capabilities not lost - mkfs: do not try to register bare file images - option --help accepted by the standalone utilities * enhancements - corrupt block: ability to remove csums - mkfs: - warn if metadata redundancy is lower than for data - options to make the output quiet (only errors) - mixed case names of raid profiles accepted - rework the output: - more comprehensive, 'key: value' format - subvol: - show: - print received uuid - update the output - new options to specify size units - sync: - grab all deleted ids and print them as they're removed, previous implementation only checked if there are any to be deleted - change in command semantics - scrub: print timestamps in days HMS format - receive: - can specify mount point, do not rely on /proc - can work inside subvolumes - send: - new option to send stream without data (NO_FILE_DATA) - convert: - specify incompat features on the new fs - qgroup: - show: distinguish no limits and 0 limit value - limit: ability to clear the limit - help for 'btrfs' is shorter, 1st level command overview - debug tree: print key names according to their C name * new - rescure zero-log - btrfsune: - rewrite uuid on a filesystem image - new option to turn on NO_HOLES incompat feature * deprecated - standalone btrfs-zero-log * other - testing framework updates - uuid rewrite test - btrfstune feature setting test - zero-log tests - more testing image formats - manual page updates - ioctl.h synced with current kernel uapi version - convert: preparatory works for more filesystems (reiserfs pending) - use static buffers for path handling where possible - add new helpers for send uilts that check memory allocations, switch all users, deprecate old helpers - Makefile: fix build dependency generation - map-logical: make it work again Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Anand Jain (2): btrfs-progs: add info about list-all to the help btrfs-progs: use function is_block_device() instead Dimitri John Ledkov (1): btrfs-progs: fsck.btrfs: Fix bashism and bad getopts processing Dongsheng Yang (4): btrfs-progs: qgroup: show 'none' when we did not limit it on this qgroup btrfs-progs: qgroup: allow user to clear some limitation on qgroup. btrfs-progs: qgroup limit: error out if input value is negative btrfs-progs: qgroup limit: add a check for invalid input of 'T/G/M/K' Emil Karlson (1): btrfs-progs: use openat for process_clone in receive Goffredo Baroncelli (4): btrfs-progs: add strdup in btrfs_add_to_fsid() to track the device path btrfs-progs: return the fsid from make_btrfs() btrfs-progs: mkfs: track sizes of created block groups btrfs-progs: mkfs: print the summary Jeff Mahoney (8): btrfs-progs: convert: clean up blk_iterate_data handling wrt record_file_blocks btrfs-progs: convert: remove unused fs argument from block_iterate_proc btrfs-progs: convert: remove unused inode_key in copy_single_inode btrfs-progs: convert: rename ext2_root to image_root btrfs-progs: compat: define DIV_ROUND_UP if not already defined btrfs-progs: convert: fix typo in btrfs_insert_dir_item call btrfs-progs: convert: factor out adding dirent into convert_insert_dirent btrfs-progs: convert: factor out block iteration callback Josef Bacik (3): Btrfs-progs: corrupt-block: add the ability to remove csums btrfs-progs: specify mountpoint for recieve btrfs-progs: make receive work inside of subvolumes Qu Wenruo (13): btrfs-progs: Enhance read_tree_block to avoid memory corruption btrfs-progs: btrfstune: rework change_uuid btrfs-progs: btrfstune: add ability to restore unfinished fsid change btrfs-progs: btrfstune: add '-U' and '-u' option to change fsid btrfs-progs: Documentation: uuid change btrfs-progs: btrfstune: fix a bug which makes unfinished fsid change unrecoverable btrfs-progs: export read_extent_data function btrfs-progs: map-logical: introduce map_one_extent function Btrfs-progs: map-logical: introduce print_mapping_info function Btrfs-progs: map-logical: introduce write_extent_content function btrfs-progs: map-logical: Rework map-logical logics btrfs-progs: Allow "filesystem show" command to handle different units btrfs-progs: do
Re: [PATCH 1/2] btrfs-progs: Allow "filesystem show" command to handle different units
On Thu, Jun 18, 2015 at 02:46:11PM +0800, Qu Wenruo wrote: > Now "filesystem show" command can handle different units now. > > This is handy for higher level programs to get accurate output from "fi > show" command. > > Signed-off-by: Qu Wenruo Thanks, both applied with minor fixups. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: [PATCH] Btrfs: Check if kobject is initialized before put
On Mon, Jun 22, 2015 at 06:18:32PM +0800, Anand Jain wrote: > Signed-off-by: Anand Jain Tested-by: David Sterba Thanks, fixes the crash in the sysfs update patchset. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: Corrupted btrfs partition (converted from ext4) after balance
So, in the btrfs mailing list, nobody will help a user who has had a whole partition corrupted? I think my report was clear and complete. In IRC, the only answer I got was: "format your partition, there's nothing you can do and there's nothing to understand from this" (from nice people I should say). What I understood from this experience is that btrfs is far from production-ready. How many people every day around the world are losing a lot of time because the "unstable" warning was removed? And losing data too: only a perfect backup system could allow someone to avoid data loss after a crash in a production system. It would require instantaneous replication + instantaneous versioning (without btrfs obviously) + instantaneous restore, which afaik no backup system has. Thank you Ducan for your reply about btrfs' stability. But frankly, we shouldn't have to speculate how stable btrfs is. I don't get how people in this mailing list and in IRC find this situation acceptable. A file system is too critical to be treated this lightly. I'm going back to ext4 for the moment and from now on I will only trust reputable third-party sources as to when btrfs is production-ready. Sorry for the tone. I hope nobody found this message disrespectful. Vianney Le 19/06/2015 09:53, Duncan a écrit : Vianney Stroebel posted on Fri, 19 Jun 2015 01:55:01 +0200 as excerpted: I could copy the data on another freshly formatted disk and reformat this one but I am wondering if btrfs is stable enough to be used on my professional laptop (where I cannot afford such downtime)or if I should go back to ext4. As a btrfs-using admin and list regular, not a dev, I'll reply to just the above more general question, letting others deal with the specific technical issue... Good question, on which there's apparently a bit of controversy. My own opinion, TL;DR summary? If you're asking the question and are unlikely to be going ahead anyway, regardless of the answer you get, then btrfs is unlikely to be what you'd call "stable enough", at this point. The longer version... The devs have applied patches that have removed most of the warnings, and some distros are now using btrfs by default, generally for the system partitions in ordered to take advantage of btrfs snapshotting to enable rollback, so it's obviously "stable enough" for them. But actual non-dev btrfs user and list regular opinion on this list seems to be somewhere between "Are you kidding? After I just got thru dealing with bug , no way, Jose!", and "It's definitely stabilizing and maturing, and is noticeably better than six months ago, which was noticeably better than six months before that, but it's equally definitely not something I'd characterize as fully stable and mature just yet." An arguably more practical way of stating the latter position, which happens to be my own, is by reference to the sysadmin's rule of backups. This rule says that if a particular set of files isn't backed up, then by definition, you don't care about losing it, despite any claims, possibly after said loss, to the contrary. Additionally, a would-be backup that hasn't passed restorability tests isn't yet complete, and therefore cannot be called a backup for purposes of the above rule. If it isn't backed up, you don't care about losing it. Full stop. But, because btrfs isn't yet fully stable and mature, that rule applies double. I'd argue that for anyone that accepts that principle, including the doubling, and is still willing to use btrfs, it's "stable enough". Otherwise, better look somewhere else, as what you're looking for isn't found here. That's the sysadmin-speak test, and result. But there's another way of putting it that's more developer-speak. As any good developer will tell you, premature optimization is bad, very bad, in no small part because optimization is a LOT of work, and premature optimization either severely limits post-optimization flexibility in ordered to retain that work, or must be repeated over and over again as the problem and solution space becomes more defined by early trial and mid-stage implementations and better solutions become known. For reasonably good developers, then (and if you don't consider them good developers, why are you trusting their filesystem work?), developer's own REAL opinion of the stability and maturity of a project is how much it has been optimized, vs. where optimization remains on the TODO list. Once developers are focusing on optimization, arguably they too believe the general solution to be relatively stable and mature. By contrast, if major parts of the code remain unoptimized, particularly where the current code works well enough but is known to be LESS than optimum, developers self-evidently consider it still maturing and subject to change that could possibly undo any current efforts at optimization. Arguably, that's about as technically reasonable and unbiased as a measure gets, so for those concerned about
Re: RAID1: system stability
And again if i've try echo 1 > /sys/block/sdf/device/delete Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [ cut here ] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: kernel BUG at /build/buildd/linux-3.19.0/fs/btrfs/extent_io.c:2056! Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: invalid opcode: [#1] SMP Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc ipmi_ssif amdkfd amd_iommu_v2 gpio_ich radeon ttm drm_kms_helper lpc_ich coretemp drm kvm_intel kvm i5000_edac i2c_algo_bit edac_core i5k_amb shpchp ipmi_si serio_raw 8250_fintek ioatdma dca joydev mac_hid ipmi_msghandler bonding autofs4 btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq hid_generic raid1 e1000e raid0 usbhid mptsas mptscsih multipath psmouse hid mptbase ptp scsi_transport_sas pps_core linear Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CPU: 0 PID: 1150 Comm: kworker/u16:12 Not tainted 3.19.0-21-generic #21-Ubuntu Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: task: 88044c603110 ti: 88044b4b8000 task.ti: 88044b4b8000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP: 0010:[] [] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP: 0018:88044b4bbba8 EFLAGS: 00010202 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RAX: RBX: 1000 RCX: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RDX: RSI: 880449841b08 RDI: 880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RBP: 88044b4bbc08 R08: 00109000 R09: 880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R10: 9000 R11: 0002 R12: 8803fa878068 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R13: 880448f5d000 R14: 88044cde8d28 R15: 000524f09000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: FS: () GS:88045fc0() knlGS: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CS: 0010 DS: ES: CR0: 8005003b Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CR2: 7fdcef9cafb8 CR3: 01c13000 CR4: 000407f0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Stack: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 880448f5d100 1000 4b4bbbd8 ea000fb66d40 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 7000 880449841a80 88044b4bbc08 880439a44b58 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 1000 880448f5d000 88044cde8d28 88044cde8bf0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Call Trace: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] clean_io_failure+0x19c/0x1b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] end_bio_extent_readpage+0x310/0x5e0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? __slab_free+0xa5/0x320 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? native_sched_clock+0x2a/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] bio_endio+0x6b/0xa0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? kmem_cache_free+0x1be/0x200 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] bio_endio_nodec+0x12/0x20 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] end_workqueue_fn+0x3f/0x50 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] normal_work_helper+0xc2/0x2b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] process_one_work+0x158/0x430 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] worker_thread+0x5b/0x530 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? rescuer_thread+0x3a0/0x3a0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] kthread+0xc9/0xe0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ret_from_fork+0x58/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Code: f4 fe ff ff 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 4c 89 e7 e8 e0 e4 f3 c0 41 b9 fb ff ff ff e9 d2 fe ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e7 e8 c0 e4 f3 c0 31 f6 4c 89 ef Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP [] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
Re: [PATCH v2 6/7] Btrfs: incremental send, don't send utimes for non-existing directory
On Mon, Jun 22, 2015 at 10:08 AM, Robbie Ko wrote: > There's one case where we can't issue a utimes operation for a directory. There's one where we attempt to get utimes from a directory that doesn't exist in the send snapshot. > First, 261 can't move to d/item1 without the rename of inode 265. So as 262. > Thus 261 and 262 need to wait for rename. > Second, since 263 will be deleted and there are two waiting sub-directory > 261 and 262, rmdir_ino of 261 will set to 263 and rmdir_ino of 262 is not set. > If 262 is processed earlier than 261, utime of both 263 and 264 will be > updated. However, 263 should not update since it will vanish. You can't just start explaining an example, referring to inode numbers etc, without showing the example before. How the parent and send snapshots look like? So move up the example (parent and send snapshots directory hierarchy) before explaining it. We read top down and not bottom up. > > I've found that the following case is the main cause of such error > and it's fs tree is shown via btrfs-debug-tress as below. > > file tree key (459 ROOT_ITEM 20487) > node 132988928 level 1 items 3 free 490 generation 20487 owner 459 > fs uuid b451ae42-3b03-4003-b0a4-45dce324557f > chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc > key (256 INODE_ITEM 0) block 132710400 (8100) gen 20486 > key (264 INODE_ITEM 0) block 130695168 (7977) gen 20480 > key (266 XATTR_ITEM 952319794) block 126042112 (7693) gen 20464 > leaf 132710400 items 166 free space 3639 generation 20486 owner 455 > fs uuid b451ae42-3b03-4003-b0a4-45dce324557f > chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc > item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 > inode generation 20425 transid 20442 size 32 block > group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 > item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 > inode ref index 0 namelen 2 name: .. > ... > item 165 key (262 XATTR_ITEM 1100961104) itemoff 7789 itemsize 39 > location key (0 UNKNOWN.0 0) type XATTR > namelen 8 datalen 1 name: user.a78 > data a > binary 61 > leaf 130695168 items 133 free space 7332 generation 20480 owner 455 > fs uuid b451ae42-3b03-4003-b0a4-45dce324557f > chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc > item 0 key (264 INODE_ITEM 0) itemoff 16123 itemsize 160 > inode generation 20428 transid 20434 size 10 block > group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 > item 1 key (264 INODE_REF 256) itemoff 16112 itemsize 11 > inode ref index 11 namelen 1 name: c > ... > > We can see that inode 262 is right at the end of leaf. Then send_utime() will > use btrfs_search_slot() to find a appropriate place to put 262 where is at the > back of 262. However, that place is uninitialized on disk. Suppose we read > atime tv_sec:576469548413222912, tv_nsec:1919251317 and then send it out. > Receiving side will got EINVAL since tv_nsec:1919251317 is greater > than 999,999,999. "...place to put 262..." -> it's actually inode 263, plus we aren't attempting to put anything anywhere. So you can explain this by mentioning that we're trying to send utimes for a directory/inode that doesn't exist in the send snapshot. That send_utimes() will use part of a leaf beyond its boundaries or a wrong slot (belonging to some other unrelated inode), because btrfs_search_slot() returns 1 when we call it to find the inode item to extract a utimes value from, and send_utimes() is not prepared to deal with such case because it assumes no one calls it for an inode that doesn't exist in the send root. And that you fix the problem in the offending caller. > > So fix this by don't send utimes for non-existing directory for this case. > > Example: > > Parent snapshot: > | a/ (ino 259) > | c (ino 264) > | b/ (ino 260) > | d (ino 265) > | del/ (ino 263) > | item1/ (ino 261) > | item2/ (ino 262) > > Send snapshot: > | a/ (ino 259) > | b/ (ino 260) > | c/ (ino 264) > | item2 (ino 262) > | d/ (ino 265) > | item1/ (ino 261) > > Signed-off-by: Robbie Ko > --- > > V2:don't send utimes for non-existing directory > > fs/btrfs/send.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index cd22f7d..579a4c8 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -3243,8 +3243,18 @@ finish: > * and old parent(s). > */ > list_for_each_entry(cur, &pm->update_refs, list) { > - if (cur->dir == rmdir_ino) > + /* > +* don't send utimes for non-existing directory > +*/ > + ret = get_inode_info(sctx->send_root, cur->dir, NULL, > +NULL , NULL, NULL, NULL, NULL); > + if (ret == -ENOENT) {
Re: [PATCH v2 3/7] Btrfs: incremental send, avoid ancestor rename to descendant
On Mon, Jun 22, 2015 at 10:08 AM, Robbie Ko wrote: > There's one more case where we can't issue a rename operation for a directory > as soon as we process it. We move a directory from ancestor to descendant. > > | a > | b > | c > | d > "Move a directory from ancestor to descendant" means moving dir. a into dir. c > > This case will happen after applying "[PATCH] Btrfs: incremental send, > don't delay directory renames unnecessarily". > Because, that patch changes behavior of wait_for_parent_move function. > > Example: > Parent snapshot: > | @tmp/ (ino 257) > | pre/ (ino 260) > | wait_dir (ino 261) > | ance/ (ino 263) > | wait_at_below_ance/ (ino 259) > | desc/ (ino 262) > | other_dir/ (ino 264) > > Send snapshot: > | @tmp/ (ino 257) > | other_dir/ (ino 264) > | wait_at_below_ance/ (ino 259) > | pre/ (ino 260) > | wait_dir/ (ino 261) > | desc/ (ino 262) > | ance/ (ino 263) > > 1. 259 must move to @tmp/other_dir, so it is waiting on other_dir(264). > > 2. 260 is able to rename as ance/wait_at_below_ance/pre since > wait_at_below_ance(259) is waiting and 260 is not the ancestor of > wait_at_below_ance(259). > > 3. 261 must move to @tmp/other_dir, so it is waiting on other_dir(264). > > 4. 262 is able to rename as ance/wait_at_below_ance/pre/wait_dir/desc since > wait_dir(261) is waiting and 262 is not the ancestor of wait_dir(261). > > 5. 263 is rename as ance/wait_at_below_ance/pre/wait_dir/desc/ance since > wait_dir(261) is waiting and 263 is not the ancestor of wait_dir(261). > At the same time, receiving side will encounter error. > If anyone calls get_cur_path() to any element in > ance/wait_at_below_ance/pre/wait_dir/desc/ance like wait_dir(260), > there will cause path building loop like this : 261 -> 260 -> 259 -> > 263 -> 262 -> 261 > > So fix the problem by check path_loop for this case. > > Signed-off-by: Robbie Ko > --- > > V2: Always check path_loop, and check Allocation ret value. > > fs/btrfs/send.c | 46 +++--- > 1 file changed, 43 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 44ad144..b946067 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -3088,15 +3088,23 @@ static int path_loop(struct send_ctx *sctx, struct > fs_path *name, > > *ancestor_ino = 0; > while (ino != BTRFS_FIRST_FREE_OBJECTID) { > + struct waiting_dir_move *wdm; > fs_path_reset(name); > > if (is_waiting_for_rm(sctx, ino)) > break; > - if (is_waiting_for_move(sctx, ino)) { > + > + wdm = get_waiting_dir_move(sctx, ino); > + if (wdm) { > if (*ancestor_ino == 0) > *ancestor_ino = ino; > - ret = get_first_ref(sctx->parent_root, ino, > - &parent_inode, &parent_gen, name); > + if (wdm->orphanized) { > + ret = gen_unique_name(sctx, ino, gen, name); > + break; > + } else { > + ret = get_first_ref(sctx->parent_root, ino, > + > &parent_inode, &parent_gen, name); > + } > } else { > ret = __get_cur_name_and_parent(sctx, ino, gen, > &parent_inode, > @@ -3743,6 +3751,38 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", > sctx->cur_ino); > } > > /* > +* if cur_ino is cur ancestor, can't move now, > +* find descendant who is waiting, waiting it. > +*/ If cur_ino is current ancestor of whom? "find descendant" -> but below we're looking for an ancestor and delaying the rename of the current inode (sctx->cur_ino) to happen after the rename of that ancestor. > + if(can_rename) { Again, please run checkpath.pl against the files. Kernel coding style, add a space between if and the opening parenthesis: if (...) { > + struct fs_path *name = NULL; > + u64 ancestor; > + u64 old_send_progress = sctx->send_progress; > + > + name = fs_path_alloc(); > + if (!valid_path) { Wrong variable. Must be: if (!name) { (...) > + ret = -ENOMEM; > + goto out; > + } > + > + sctx->send_progress = sctx->cur_ino + 1; > + ret = path_loop(sctx, name, sctx->cur_ino, > sctx->cur_inode_gen, &ancestor); Need
Re: RAID1: system stability
Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy 5,sas_addr 0x5000cca00d0514bd Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0x880449541400) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr 0x5000cca00d0514bd Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted 343582415 mirror 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at 87fa7ff534
Re: [PATCH v2 2/7] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path
On Mon, Jun 22, 2015 at 10:08 AM, Robbie Ko wrote: > Base on [PATCH] Btrfs: incremental send, check if orphanized dir inode needs > delayed rename This is mentioned on the cover letter, so no need to repeat this on the commit message of every patch in the series. > > Example1: > There's one case where we can't issue a rename operation for a directory > as soon as we process it. Used to delay directory renames if > wait_parent_move or wait_for_dest_dir_move, maybe cause circular waiting. This second sentence is confusing to say the least. What is "if wait_parent_move"? There's nothing in send.c with that name. And the maybe is equally confusing and redundant. You already explain below that the problem is a circular waiting, an example and what is a circular waiting exactly. > > Parent snapshot: > | d/ (ino 257) > | p1 (ino 258) > | p1/ (ino 259) > > Send snapshot: > | d/ (ino 257) > | p1 (ino 259) > | p1/ (ino 258) > > Here we can not rename 258 from d/p1 to p1/p1 without the rename of inode 259. > p1 258 is put into wait_parent_move. "... is put into wait_parent_move" -> what is wait_parent_move? There's nothing in send.c with that name. Is it a function, is it a data structure, or what? Even someone familiar with send's internals scratches his head trying to understand what does this means. A better alternative: "Inode 258 became a child of inode 259 and both were renamed in the send snapshot. Therefore inode 258's rename operation is delayed to happen after 259 is renamed." Or something along those lines. > 259 can't be rename to d/p1, so it is put into It should be mentioned why 259 can't be renamed. > circular waiting happens" -> so 259's rename is delayed to happen after 258's > rename, > which creates a circular dependency (258 -> 259 -> 258). > > Example2: > There's one case where we can't issue a rename operation for a directory > immediately we process it. We are repeating this sentence in every example. Just say at the very top that there are several more cases where we can't do the renames immediately. > After moving 262 outside, path of 265 is stored in the name_cache_entry. After renaming inode 262, the name inode 265 has in the parent snapshot is stored in the name cache. > When 263 try to overwrite 265, its ancestor, 265 is moved to orphanized. Path > of 263 > is still the original path, however. This causes error. What error? It's important to mention what error it is. You should explain that after orphanizing 265 we were leaving its old name in the cache and how that causes a problem. > > Parent snapshot: > | a/ (ino 259) > | c (ino 266) > | d/ (ino 260) > | ance (ino 265) > | e (ino 261) > | f (ino 262) > | ance (ino 263) > > Send snapshot: > | a/ (ino 259) > | c/ (ino 266) > | ance (ino 265) > | d/ (ino 260) > | ance (ino 263) > | f/ (ino 262) > | e (ino 261) > > Example3: > There is another case for 2nd scenario where is_ancestor() can't be used. > > Parent snapshot: > | a/ (ino 261) > | c (ino 267) > | d/ (ino 259) > | ance/ (ino 266) > | waiting_dir/ (ino 262) > | pre/ (ino 264) > | ance/ (ino 265) > > Send snapshot: > | a/ (ino 261) > | ance/ (ino 266) > | c (ino 267) > | waiting_dir/ (ino 262) > | pre/ (ino 264) > | d/ (ino 259) > | ance/ (ino 265) > > First, 262 can't move to c/waiting_dir without the rename of inode 267. > Second, 264 can move into dir 262. Although 262 is waiting, 264 is not > parent of 262 in the parent root. > (The second behavior will happen after applying "[PATCH] Btrfs: > incremental send, don't delay directory renames unnecessarily") > Finally, 265 will overwrite 266 and path for 265 should be updated > since 266 is not the ancestor of 265. > Here we need to check the current state of tree rather than parent > root which is_ancestor function does. > > Signed-off-by: Robbie Ko > --- > > V2:when orphanized inode always get_cur_path again. > > fs/btrfs/send.c | 38 -- > 1 file changed, 32 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 257753b..44ad144 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -230,7 +230,6 @@ struct pending_dir_move { > u64 parent_ino; > u64 ino; > u64 gen; > - bool is_orphan; > struct list_head update_refs; > }; > > @@ -1840,7 +1839,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, > u64 dir, u64 dir_gen, > * was already unlinked/moved, so we can safely assume that we will > not > * overwrite anything at this point in time. > */ > - if (other_inode > sctx->send_progress) { > + if (other_inode > sctx->send_progress || is_waiting_for_move(sctx, > other_inode)) { > ret = get_inode_info(sctx->pa
Re: [PATCH v2 1/7] Revert "Btrfs: incremental send, remove dead code"
On Mon, Jun 22, 2015 at 10:08 AM, Robbie Ko wrote: > This reverts commit 5f806c3ae2ff6263a10a6901f97abb74dac03d36. > So, this is a revert patch that alone by itself doesn't fix any problem. Fine. However you are now pasting below the commit message from another patch in the series (patch 3) that actually makes use of this patch and fixes something. Just mention here that this is necessary for a subsequent patch in the series... Explaining here what some other patch fixes and how is confusing. > Btrfs: incremental send, avoid ancestor rename to descendant > > There's one more case where we can't issue a rename operation for a directory > as soon as we process it. We move a directory from ancestor to descendant. > > | a > | b > | c > | d > "Move a directory from ancestor to descendant" means moving dir. a into dir. c > > This case will happen after applying "[PATCH] Btrfs: incremental send, > don't delay directory renames unnecessarily". > Because, that patch changes behavior of wait_for_parent_move function. > > Parent snapshot: > | @tmp/ (ino 257) > | pre/ (ino 259) > | wait_dir (ino 260) > | finish_dir2/ (ino 261) > | ance/ (ino 263) > | finish_dir1/ (ino 258) > | desc/ (ino 262) > | other_dir/ (ino 264) > > Send snapshot: > | @tmp/ (ino 257) > | other_dir/ (ino 264) > | wait_dir/ (ino 260) > | finish_dir2/ (ino 261) > | desc/ (ino 262) > | ance/ (ino 263) > | finish_dir1/ (ino 258) > | pre/ (ino 259) > > 1. 259 can not move under 258 because 263 needs to move to 263 first. > So 259 is waiting on ance(263). > > 2. 260 must move to @tmp/other_dir, so it is waiting on other_dir(264). > > 3. 262 is able to rename as pre/wait_dir/finish_dir2(261)/desc since > wait_dir(260) is waiting and 262 is not the ancestor of wait_dir(260). > > 4.263 is able to rename as pre/wait_dir/finish_dir2(261)/ance since > wait_dir(260) is waiting and 263 is not the ancestor of wait_dir(260). > > 5. After wait_dir(263) is finished, all pending dirs. start to run. > /pre(259) in apply_dir_move() renames /pre as > pre/wait_dir/finish_dir2/desc/ance/finish_dir1/pre > At the same time, receiving side will encounter error. > If anyone calls get_cur_path() to any element in > pre/wait_dir/finish_dir2/desc/ance/finish_dir1/pre like wait_dir(260) > , > there will cause path building loop like this : 260 -> 259 -> 258 -> > 263 -> 262 -> 261 -> 260 > > So fix the problem by check path_loop for this case. > > Signed-off-by: Robbie Ko > --- > fs/btrfs/send.c | 59 > + > 1 file changed, 59 insertions(+) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 1c1f161..257753b 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -3080,6 +3080,48 @@ static struct pending_dir_move > *get_pending_dir_moves(struct send_ctx *sctx, > return NULL; > } > > +static int path_loop(struct send_ctx *sctx, struct fs_path *name, > +u64 ino, u64 gen, u64 *ancestor_ino) > +{ > + int ret = 0; > + u64 parent_inode = 0; > + u64 parent_gen = 0; > + u64 start_ino = ino; > + > + *ancestor_ino = 0; > + while (ino != BTRFS_FIRST_FREE_OBJECTID) { > + fs_path_reset(name); > + > + if (is_waiting_for_rm(sctx, ino)) > + break; > + if (is_waiting_for_move(sctx, ino)) { > + if (*ancestor_ino == 0) > + *ancestor_ino = ino; > + ret = get_first_ref(sctx->parent_root, ino, > + &parent_inode, &parent_gen, name); > + } else { > + ret = __get_cur_name_and_parent(sctx, ino, gen, > + &parent_inode, > + &parent_gen, name); > + if (ret > 0) { > + ret = 0; > + break; > + } > + } > + if (ret < 0) > + break; > + if (parent_inode == start_ino) { > + ret = 1; > + if (*ancestor_ino == 0) > + *ancestor_ino = ino; > + break; > + } > + ino = parent_inode; > + gen = parent_gen; > + } > + return ret; > +} > + > static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) > { > struct fs_path *from_path = NULL; > @@ -3091,6 +3133,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct > pending_dir_move *pm) > struct waiting_dir_move *dm = NULL; >
[PATCH] Btrfs: Check if kobject is initialized before put
Signed-off-by: Anand Jain --- fs/btrfs/sysfs.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index ea81a05..603b0cc 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -523,9 +523,11 @@ static void __btrfs_sysfs_remove_fsid(struct btrfs_fs_devices *fs_devs) fs_devs->device_dir_kobj = NULL; } - kobject_del(&fs_devs->super_kobj); - kobject_put(&fs_devs->super_kobj); - wait_for_completion(&fs_devs->kobj_unregister); + if (fs_devs->super_kobj.state_initialized) { + kobject_del(&fs_devs->super_kobj); + kobject_put(&fs_devs->super_kobj); + wait_for_completion(&fs_devs->kobj_unregister); + } } /* when fs_devs is NULL it will remove all fsid kobject */ -- 2.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH v2 4/7] Btrfs: incremental send, fix orphan_dir_info leak
There's one case where we leak a orphan_dir_info structure. Example: Parent snapshot: | a/ (ino 279) | c (ino 282) | del/ (ino 281) | tmp/ (ino 280) | long/ (ino 283) | longlong/ (ino 284) Send snapshot: | a/ (ino 279) | long (ino 283) | longlong (ino 284) | c/ (ino 282) | tmp/ (ino 280) Freeing an existing orphan_dir_info for a directory, when we realize we can't rmdir the directory because it has a descendant that wasn't yet processed, and the orphan_dir_info was created because it had a descendant that had its rename operation delayed. Signed-off-by: Robbie Ko --- V2: modify comment fs/btrfs/send.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index b946067..bc9efbe 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -2913,6 +2913,11 @@ static int can_rmdir(struct send_ctx *sctx, u64 dir, u64 dir_gen, } if (loc.objectid > send_progress) { + struct orphan_dir_info *odi; + + odi = get_orphan_dir_info(sctx, dir); + if (odi) + free_orphan_dir_info(sctx, odi); ret = 0; goto out; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH v2 6/7] Btrfs: incremental send, don't send utimes for non-existing directory
There's one case where we can't issue a utimes operation for a directory. First, 261 can't move to d/item1 without the rename of inode 265. So as 262. Thus 261 and 262 need to wait for rename. Second, since 263 will be deleted and there are two waiting sub-directory 261 and 262, rmdir_ino of 261 will set to 263 and rmdir_ino of 262 is not set. If 262 is processed earlier than 261, utime of both 263 and 264 will be updated. However, 263 should not update since it will vanish. I've found that the following case is the main cause of such error and it's fs tree is shown via btrfs-debug-tress as below. file tree key (459 ROOT_ITEM 20487) node 132988928 level 1 items 3 free 490 generation 20487 owner 459 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc key (256 INODE_ITEM 0) block 132710400 (8100) gen 20486 key (264 INODE_ITEM 0) block 130695168 (7977) gen 20480 key (266 XATTR_ITEM 952319794) block 126042112 (7693) gen 20464 leaf 132710400 items 166 free space 3639 generation 20486 owner 455 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 20425 transid 20442 size 32 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 inode ref index 0 namelen 2 name: .. ... item 165 key (262 XATTR_ITEM 1100961104) itemoff 7789 itemsize 39 location key (0 UNKNOWN.0 0) type XATTR namelen 8 datalen 1 name: user.a78 data a binary 61 leaf 130695168 items 133 free space 7332 generation 20480 owner 455 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc item 0 key (264 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 20428 transid 20434 size 10 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 1 key (264 INODE_REF 256) itemoff 16112 itemsize 11 inode ref index 11 namelen 1 name: c ... We can see that inode 262 is right at the end of leaf. Then send_utime() will use btrfs_search_slot() to find a appropriate place to put 262 where is at the back of 262. However, that place is uninitialized on disk. Suppose we read atime tv_sec:576469548413222912, tv_nsec:1919251317 and then send it out. Receiving side will got EINVAL since tv_nsec:1919251317 is greater than 999,999,999. So fix this by don't send utimes for non-existing directory for this case. Example: Parent snapshot: | a/ (ino 259) | c (ino 264) | b/ (ino 260) | d (ino 265) | del/ (ino 263) | item1/ (ino 261) | item2/ (ino 262) Send snapshot: | a/ (ino 259) | b/ (ino 260) | c/ (ino 264) | item2 (ino 262) | d/ (ino 265) | item1/ (ino 261) Signed-off-by: Robbie Ko --- V2:don't send utimes for non-existing directory fs/btrfs/send.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index cd22f7d..579a4c8 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3243,8 +3243,18 @@ finish: * and old parent(s). */ list_for_each_entry(cur, &pm->update_refs, list) { - if (cur->dir == rmdir_ino) + /* +* don't send utimes for non-existing directory +*/ + ret = get_inode_info(sctx->send_root, cur->dir, NULL, +NULL , NULL, NULL, NULL, NULL); + if (ret == -ENOENT) { + ret = 0; continue; + } + if (ret < 0) + goto out; + ret = send_utimes(sctx, cur->dir, cur->dir_gen); if (ret < 0) goto out; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH v2 7/7] Btrfs: incremental send, avoid the overhead of allocating an orphan_dir_info object unnecessarily
Avoid the overhead of allocating an orphan_dir_info object unnecessarily. Signed-off-by: Robbie Ko --- fs/btrfs/send.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 579a4c8..9c60421 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -2785,12 +2785,6 @@ add_orphan_dir_info(struct send_ctx *sctx, u64 dir_ino) struct rb_node *parent = NULL; struct orphan_dir_info *entry, *odi; - odi = kmalloc(sizeof(*odi), GFP_NOFS); - if (!odi) - return ERR_PTR(-ENOMEM); - odi->ino = dir_ino; - odi->gen = 0; - while (*p) { parent = *p; entry = rb_entry(parent, struct orphan_dir_info, node); @@ -2799,11 +2793,16 @@ add_orphan_dir_info(struct send_ctx *sctx, u64 dir_ino) } else if (dir_ino > entry->ino) { p = &(*p)->rb_right; } else { - kfree(odi); return entry; } } + odi = kmalloc(sizeof(*odi), GFP_NOFS); + if (!odi) + return ERR_PTR(-ENOMEM); + odi->ino = dir_ino; + odi->gen = 0; + rb_link_node(&odi->node, parent, p); rb_insert_color(&odi->node, &sctx->orphan_dirs); return odi; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH v2 1/7] Revert "Btrfs: incremental send, remove dead code"
This reverts commit 5f806c3ae2ff6263a10a6901f97abb74dac03d36. Btrfs: incremental send, avoid ancestor rename to descendant There's one more case where we can't issue a rename operation for a directory as soon as we process it. We move a directory from ancestor to descendant. | a | b | c | d "Move a directory from ancestor to descendant" means moving dir. a into dir. c This case will happen after applying "[PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily". Because, that patch changes behavior of wait_for_parent_move function. Parent snapshot: | @tmp/ (ino 257) | pre/ (ino 259) | wait_dir (ino 260) | finish_dir2/ (ino 261) | ance/ (ino 263) | finish_dir1/ (ino 258) | desc/ (ino 262) | other_dir/ (ino 264) Send snapshot: | @tmp/ (ino 257) | other_dir/ (ino 264) | wait_dir/ (ino 260) | finish_dir2/ (ino 261) | desc/ (ino 262) | ance/ (ino 263) | finish_dir1/ (ino 258) | pre/ (ino 259) 1. 259 can not move under 258 because 263 needs to move to 263 first. So 259 is waiting on ance(263). 2. 260 must move to @tmp/other_dir, so it is waiting on other_dir(264). 3. 262 is able to rename as pre/wait_dir/finish_dir2(261)/desc since wait_dir(260) is waiting and 262 is not the ancestor of wait_dir(260). 4.263 is able to rename as pre/wait_dir/finish_dir2(261)/ance since wait_dir(260) is waiting and 263 is not the ancestor of wait_dir(260). 5. After wait_dir(263) is finished, all pending dirs. start to run. /pre(259) in apply_dir_move() renames /pre as pre/wait_dir/finish_dir2/desc/ance/finish_dir1/pre At the same time, receiving side will encounter error. If anyone calls get_cur_path() to any element in pre/wait_dir/finish_dir2/desc/ance/finish_dir1/pre like wait_dir(260) , there will cause path building loop like this : 260 -> 259 -> 258 -> 263 -> 262 -> 261 -> 260 So fix the problem by check path_loop for this case. Signed-off-by: Robbie Ko --- fs/btrfs/send.c | 59 + 1 file changed, 59 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 1c1f161..257753b 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3080,6 +3080,48 @@ static struct pending_dir_move *get_pending_dir_moves(struct send_ctx *sctx, return NULL; } +static int path_loop(struct send_ctx *sctx, struct fs_path *name, +u64 ino, u64 gen, u64 *ancestor_ino) +{ + int ret = 0; + u64 parent_inode = 0; + u64 parent_gen = 0; + u64 start_ino = ino; + + *ancestor_ino = 0; + while (ino != BTRFS_FIRST_FREE_OBJECTID) { + fs_path_reset(name); + + if (is_waiting_for_rm(sctx, ino)) + break; + if (is_waiting_for_move(sctx, ino)) { + if (*ancestor_ino == 0) + *ancestor_ino = ino; + ret = get_first_ref(sctx->parent_root, ino, + &parent_inode, &parent_gen, name); + } else { + ret = __get_cur_name_and_parent(sctx, ino, gen, + &parent_inode, + &parent_gen, name); + if (ret > 0) { + ret = 0; + break; + } + } + if (ret < 0) + break; + if (parent_inode == start_ino) { + ret = 1; + if (*ancestor_ino == 0) + *ancestor_ino = ino; + break; + } + ino = parent_inode; + gen = parent_gen; + } + return ret; +} + static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) { struct fs_path *from_path = NULL; @@ -3091,6 +3133,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) struct waiting_dir_move *dm = NULL; u64 rmdir_ino = 0; int ret; + u64 ancestor = 0; name = fs_path_alloc(); from_path = fs_path_alloc(); @@ -3122,6 +3165,22 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) goto out; sctx->send_progress = sctx->cur_ino + 1; + ret = path_loop(sctx, name, pm->ino, pm->gen, &ancestor); + if (ret) { + LIST_HEAD(deleted_refs); + ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID); + ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor, + &pm->update_refs, &deleted_refs, +
[PATCH v2 2/7] Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path
Base on [PATCH] Btrfs: incremental send, check if orphanized dir inode needs delayed rename Example1: There's one case where we can't issue a rename operation for a directory as soon as we process it. Used to delay directory renames if wait_parent_move or wait_for_dest_dir_move, maybe cause circular waiting. Parent snapshot: | d/ (ino 257) | p1 (ino 258) | p1/ (ino 259) Send snapshot: | d/ (ino 257) | p1 (ino 259) | p1/ (ino 258) Here we can not rename 258 from d/p1 to p1/p1 without the rename of inode 259. p1 258 is put into wait_parent_move. 259 can't be rename to d/p1, so it is put into circular waiting happens" -> so 259's rename is delayed to happen after 258's rename, which creates a circular dependency (258 -> 259 -> 258). Example2: There's one case where we can't issue a rename operation for a directory immediately we process it. After moving 262 outside, path of 265 is stored in the name_cache_entry. When 263 try to overwrite 265, its ancestor, 265 is moved to orphanized. Path of 263 is still the original path, however. This causes error. Parent snapshot: | a/ (ino 259) | c (ino 266) | d/ (ino 260) | ance (ino 265) | e (ino 261) | f (ino 262) | ance (ino 263) Send snapshot: | a/ (ino 259) | c/ (ino 266) | ance (ino 265) | d/ (ino 260) | ance (ino 263) | f/ (ino 262) | e (ino 261) Example3: There is another case for 2nd scenario where is_ancestor() can't be used. Parent snapshot: | a/ (ino 261) | c (ino 267) | d/ (ino 259) | ance/ (ino 266) | waiting_dir/ (ino 262) | pre/ (ino 264) | ance/ (ino 265) Send snapshot: | a/ (ino 261) | ance/ (ino 266) | c (ino 267) | waiting_dir/ (ino 262) | pre/ (ino 264) | d/ (ino 259) | ance/ (ino 265) First, 262 can't move to c/waiting_dir without the rename of inode 267. Second, 264 can move into dir 262. Although 262 is waiting, 264 is not parent of 262 in the parent root. (The second behavior will happen after applying "[PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily") Finally, 265 will overwrite 266 and path for 265 should be updated since 266 is not the ancestor of 265. Here we need to check the current state of tree rather than parent root which is_ancestor function does. Signed-off-by: Robbie Ko --- V2:when orphanized inode always get_cur_path again. fs/btrfs/send.c | 38 -- 1 file changed, 32 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 257753b..44ad144 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -230,7 +230,6 @@ struct pending_dir_move { u64 parent_ino; u64 ino; u64 gen; - bool is_orphan; struct list_head update_refs; }; @@ -1840,7 +1839,7 @@ static int will_overwrite_ref(struct send_ctx *sctx, u64 dir, u64 dir_gen, * was already unlinked/moved, so we can safely assume that we will not * overwrite anything at this point in time. */ - if (other_inode > sctx->send_progress) { + if (other_inode > sctx->send_progress || is_waiting_for_move(sctx, other_inode)) { ret = get_inode_info(sctx->parent_root, other_inode, NULL, who_gen, NULL, NULL, NULL, NULL); if (ret < 0) @@ -3014,7 +3013,6 @@ static int add_pending_dir_move(struct send_ctx *sctx, pm->parent_ino = parent_ino; pm->ino = ino; pm->gen = ino_gen; - pm->is_orphan = is_orphan; INIT_LIST_HEAD(&pm->list); INIT_LIST_HEAD(&pm->update_refs); RB_CLEAR_NODE(&pm->node); @@ -3134,6 +3132,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) u64 rmdir_ino = 0; int ret; u64 ancestor = 0; + bool is_orphan; name = fs_path_alloc(); from_path = fs_path_alloc(); @@ -3145,9 +3144,10 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) dm = get_waiting_dir_move(sctx, pm->ino); ASSERT(dm); rmdir_ino = dm->rmdir_ino; + is_orphan = dm->orphanized; free_waiting_dir_move(sctx, dm); - if (pm->is_orphan) { + if (is_orphan) { ret = gen_unique_name(sctx, pm->ino, pm->gen, from_path); } else { @@ -3171,7 +3171,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) ASSERT(ancestor > BTRFS_FIRST_FREE_OBJECTID); ret = add_pending_dir_move(sctx, pm->ino, pm->gen, ancestor, &pm->update_refs, &deleted_refs, - pm->is_orphan); + is_orphan); if (ret < 0)
[PATCH v2 5/7] Btrfs: incremental send, fix rmdir but dir have a unprocess item
There's one case where we attempt to rmdir a directory prematurely. Example: Parent snapshot: | a/ (ino 279) | c (ino 282) | del/ (ino 281) | tmp/ (ino 280) | long/ (ino 283) Send snapshot: | a/ (ino 279) | long (ino 283) | c/ (ino 282) | tmp/ (ino 280) While process inode 281, since inode 280 is waiting for inode 282, rmdir_ino of struct waitng_dir_move for inode 280 will assigned to 281 and an orphan_dir_info will be created for node 281 in can_rmdir(). Such that, when process inode 282, we will do following steps. First, move inode 282 from a/c to c Second, move inode 280 from del/tmp to c/tmp Third, try to remove inode 281 In Third step, we pass 283 (sctx->cur_ino + 1) as the send_progress to the can_rmdir() function and that makes it return true when it shouldn't, because the inode 283 wasn't processed yet and it's still a child of the directory with inode number 281, which makes the receiver run into an ENOTEMPTY error when attempting to remove the directory. Signed-off-by: Robbie Ko --- V2:modify comment fs/btrfs/send.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index bc9efbe..cd22f7d 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3213,7 +3213,7 @@ static int apply_dir_move(struct send_ctx *sctx, struct pending_dir_move *pm) /* already deleted */ goto finish; } - ret = can_rmdir(sctx, rmdir_ino, odi->gen, sctx->cur_ino + 1); + ret = can_rmdir(sctx, rmdir_ino, odi->gen, sctx->cur_ino); if (ret < 0) goto out; if (!ret) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
[PATCH v2 3/7] Btrfs: incremental send, avoid ancestor rename to descendant
There's one more case where we can't issue a rename operation for a directory as soon as we process it. We move a directory from ancestor to descendant. | a | b | c | d "Move a directory from ancestor to descendant" means moving dir. a into dir. c This case will happen after applying "[PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily". Because, that patch changes behavior of wait_for_parent_move function. Example: Parent snapshot: | @tmp/ (ino 257) | pre/ (ino 260) | wait_dir (ino 261) | ance/ (ino 263) | wait_at_below_ance/ (ino 259) | desc/ (ino 262) | other_dir/ (ino 264) Send snapshot: | @tmp/ (ino 257) | other_dir/ (ino 264) | wait_at_below_ance/ (ino 259) | pre/ (ino 260) | wait_dir/ (ino 261) | desc/ (ino 262) | ance/ (ino 263) 1. 259 must move to @tmp/other_dir, so it is waiting on other_dir(264). 2. 260 is able to rename as ance/wait_at_below_ance/pre since wait_at_below_ance(259) is waiting and 260 is not the ancestor of wait_at_below_ance(259). 3. 261 must move to @tmp/other_dir, so it is waiting on other_dir(264). 4. 262 is able to rename as ance/wait_at_below_ance/pre/wait_dir/desc since wait_dir(261) is waiting and 262 is not the ancestor of wait_dir(261). 5. 263 is rename as ance/wait_at_below_ance/pre/wait_dir/desc/ance since wait_dir(261) is waiting and 263 is not the ancestor of wait_dir(261). At the same time, receiving side will encounter error. If anyone calls get_cur_path() to any element in ance/wait_at_below_ance/pre/wait_dir/desc/ance like wait_dir(260), there will cause path building loop like this : 261 -> 260 -> 259 -> 263 -> 262 -> 261 So fix the problem by check path_loop for this case. Signed-off-by: Robbie Ko --- V2: Always check path_loop, and check Allocation ret value. fs/btrfs/send.c | 46 +++--- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 44ad144..b946067 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -3088,15 +3088,23 @@ static int path_loop(struct send_ctx *sctx, struct fs_path *name, *ancestor_ino = 0; while (ino != BTRFS_FIRST_FREE_OBJECTID) { + struct waiting_dir_move *wdm; fs_path_reset(name); if (is_waiting_for_rm(sctx, ino)) break; - if (is_waiting_for_move(sctx, ino)) { + + wdm = get_waiting_dir_move(sctx, ino); + if (wdm) { if (*ancestor_ino == 0) *ancestor_ino = ino; - ret = get_first_ref(sctx->parent_root, ino, - &parent_inode, &parent_gen, name); + if (wdm->orphanized) { + ret = gen_unique_name(sctx, ino, gen, name); + break; + } else { + ret = get_first_ref(sctx->parent_root, ino, + &parent_inode, &parent_gen, name); + } } else { ret = __get_cur_name_and_parent(sctx, ino, gen, &parent_inode, @@ -3743,6 +3751,38 @@ verbose_printk("btrfs: process_recorded_refs %llu\n", sctx->cur_ino); } /* +* if cur_ino is cur ancestor, can't move now, +* find descendant who is waiting, waiting it. +*/ + if(can_rename) { + struct fs_path *name = NULL; + u64 ancestor; + u64 old_send_progress = sctx->send_progress; + + name = fs_path_alloc(); + if (!valid_path) { + ret = -ENOMEM; + goto out; + } + + sctx->send_progress = sctx->cur_ino + 1; + ret = path_loop(sctx, name, sctx->cur_ino, sctx->cur_inode_gen, &ancestor); + if (ret) { + ret = add_pending_dir_move(sctx, sctx->cur_ino, sctx->cur_inode_gen, + ancestor, &sctx->new_refs, &sctx->deleted_refs, is_orphan); + if (ret < 0) { + sctx->send_progress = old_send_progress; + fs_path_free(name); + goto out; + } + can_rename = false; + *pending_move = 1; + } +
[PATCH v2 0/7] Btrfs incremental send fix serval case for rename and rm directory
Patch for fix btrfs send receive. These patches base on v4.1 plus following patches. [PATCH] Btrfs: incremental send, don't delay directory renames unnecessarily [PATCH] Btrfs: incremental send, check if orphanized dir inode needs delayed rename Thanks. Robbie Ko (7): Revert "Btrfs: incremental send, remove dead code" Btrfs: incremental send, avoid circular waiting and descendant overwrite ancestor need to update path Btrfs: incremental send, avoid ancestor rename to descendant Btrfs: incremental send, fix orphan_dir_info leak Btrfs: incremental send, fix rmdir but dir have a unprocess item Btrfs: incremental send, don't send utimes for non-existing directory Btrfs: incremental send, avoid the overhead of allocating an orphan_dir_info object unnecessarily fs/btrfs/send.c | 167 +++- 1 file changed, 153 insertions(+), 14 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in