Re: Blocked tasks on 3.15.1
MM == Marc MERLIN m...@merlins.org writes: MM Note 3.16.0 is actually worse than 3.15 for me. Here (a single partition btrfs), 3.16.0 works fine, but 3.17-rc1 fails again. My /var/log is also a compressed, single-partition btrfs; that doesn't show the problem with any version. Just the partition with git, svn and rsync trees. Last night's test of 3.17-rc1 showed the problem with the first git pull, getting stuck reading FETCH_HEAD. All repos on that fs failed the same way. But rebooting back to 3.16.0 let everything work perfectly. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 0x997A9F17ED7DAEA6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
The blocked tasks issue that got significantly worse in 3.15 -- did anything go into 3.16 related to this? I didn't see a single btrfs in Linus' 3.16 announcement, so I don't know whether it should be better, the same, or worse in this respect... I haven't seen a definite statement about this on this list, either. Can someone more familiar with the state of development comment on this? Charles -- --- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ --- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Mon, Aug 11, 2014 at 08:55:21PM -0600, Charles Cazabon wrote: The blocked tasks issue that got significantly worse in 3.15 -- did anything go into 3.16 related to this? I didn't see a single btrfs in Linus' 3.16 announcement, so I don't know whether it should be better, the same, or worse in this respect... I haven't seen a definite statement about this on this list, either. Can someone more familiar with the state of development comment on this? Good news is that we've figured out the bug and the patch is already under testing :-) thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Liu Bo posted on Tue, 12 Aug 2014 10:56:42 +0800 as excerpted: On Mon, Aug 11, 2014 at 08:55:21PM -0600, Charles Cazabon wrote: The blocked tasks issue that got significantly worse in 3.15 -- did anything go into 3.16 related to this? I didn't see a single btrfs in Linus' 3.16 announcement, so I don't know whether it should be better, the same, or worse in this respect... I haven't seen a definite statement about this on this list, either. Can someone more familiar with the state of development comment on this? Good news is that we've figured out the bug and the patch is already under testing :-) IOW, it's not in 3.16.0, but will hopefully make it into 3.16.2 (it'll likely be a too late for 3.16.1). -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Mon, Aug 11, 2014 at 08:55:21PM -0600, Charles Cazabon wrote: The blocked tasks issue that got significantly worse in 3.15 -- did anything go into 3.16 related to this? I didn't see a single btrfs in Linus' 3.16 announcement, so I don't know whether it should be better, the same, or worse in this respect... I haven't seen a definite statement about this on this list, either. Yes, 3.15 is unusable for some workloads, mine included. Go back to 3.14 until there is a patch in 3.16, which there isn't quite as for right now, but very soon hopefully. Note 3.16.0 is actually worse than 3.15 for me. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Hi Is there anything new on this topic? I am using Ubuntu 14.04.1 and experiencing the same problem. - 6 HDDs - LUKS on every HDD - btrfs RAID6 over this 6 crypt-devices No LVM, no nodatacow files. Mount-options: defaults,compress-force=lzo,space_cache With the original 3.13-kernel (3.13.0-32-generic) it is working fine. Then I tried the following kernels from here: http://kernel.ubuntu.com/~kernel-ppa/mainline/ linux-image-3.14.15-031415-generic_3.14.15-031415.201407311853_amd64.deb - not even booting, kernel panic at boot. linux-image-3.15.6-031506-generic_3.15.6-031506.201407172034_amd64.deb, linux-image-3.15.7-031507-generic_3.15.7-031507.201407281235_amd64.deb, and linux-image-3.16.0-031600-generic_3.16.0-031600.201408031935_amd64.deb causing the hangs like described in this thread. When doing big IO (unpacking a .rar-archive with multiple GB) the filesystem stops working. Load stays very high but nothing actually happens on the drives accoding to dstat. htop shows a D (uninterruptible sleep (usually IO)) at many kworker-threads. Unmounting of the btrfs-filesystem only works with -l (lazy) option. Reboot or shutdown doesn't work because of the blocking threads. So only a power cut works. After the reboot the last written data before the hang is lost. I am now back on 3.13. Regards 2014-07-25 4:27 GMT+02:00 Cody P Schafer d...@codyps.com: On Tue, Jul 22, 2014 at 9:53 AM, Chris Mason c...@fb.com wrote: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. No, both of my btrfs filesystems are single disk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Tobias Holst posted on Thu, 07 Aug 2014 17:12:17 +0200 as excerpted: Is there anything new on this topic? I am using Ubuntu 14.04.1 and experiencing the same problem. - 6 HDDs - LUKS on every HDD - btrfs RAID6 over this 6 crypt-devices No LVM, no nodatacow files. Mount-options: defaults,compress-force=lzo,space_cache With the original 3.13-kernel (3.13.0-32-generic) it is working fine. I see you're using compress-force. See the recent replies to the Btrfs: fix compressed write corruption on enospc thread. I'm not /sure/ your case is directly related (tho the kworker code is pretty new and 3.13 may be working for you due to being before the migration to kworkers, supporting the case of it being either the same problem or another related to it), but that's certainly one problem they've recently traced down... to a bug in the kworker threads code, that starts a new worker that can race with the first instead of obeying a flag that says keep it on the first worker. Looks like they're doing patch that takes a slower but safer path to work around the kworker bug for now, as that bug was just traced (there was another bug, with a patch available originally hiding the ultimate problem, but obviously that's only half the fix as it simply revealed another bug underneath) and fixing it properly is likely to take some time. Now that it's basically traced the workaround patch should be published on-list shortly and should make it into 3.17 and back into the stables, altho I'm not sure it'll make it into 3.16.1, etc. But there's certainly progress. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 23 Jul 2014, at 03:06, Rich Freeman r-bt...@thefreemanclan.net wrote: I disabled lzo and haven't had problems since. I'm now running on mainline without issue, but I think I did see the hang on mainline when I tried enabling lzo again briefly. Can confirm. I’m running mainline 3.16rc5 and was experiencing deadlocks when having LZO enabled. Disabled it, now all seems ok. Using btrfs RAID1 - dm-crypt - SATA. I’ve attached some more dmesg “blocked” messages using kernel versions 3.15.5, 3.14.6 and 3.16rc5 just in case it helps anyone. Jul 18 23:36:58 nas kernel: INFO: task sudo:1214 blocked for more than 120 seconds. Jul 18 23:36:58 nas kernel: Tainted: G O 3.15.5-2-ARCH #1 Jul 18 23:36:58 nas kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 18 23:36:58 nas kernel: sudoD 0 1214 1 0x0004 Jul 18 23:36:58 nas kernel: 88001d0ebc20 0086 88002cca5bb0 00014700 Jul 18 23:36:58 nas kernel: 88001d0ebfd8 00014700 88002cca5bb0 Jul 18 23:36:58 nas kernel: 880028ee4000 0003 284e0d53 0002 Jul 18 23:36:58 nas kernel: Call Trace: Jul 18 23:36:58 nas kernel: [815110dc] ? __do_page_fault+0x2ec/0x600 Jul 18 23:36:58 nas kernel: [81509fa9] schedule+0x29/0x70 Jul 18 23:36:58 nas kernel: [8150a426] schedule_preempt_disabled+0x16/0x20 Jul 18 23:36:58 nas kernel: [8150bda5] __mutex_lock_slowpath+0xe5/0x230 Jul 18 23:36:58 nas kernel: [8150bf07] mutex_lock+0x17/0x30 Jul 18 23:36:58 nas kernel: [811bfa24] lookup_slow+0x34/0xc0 Jul 18 23:36:58 nas kernel: [811c1b73] path_lookupat+0x723/0x880 Jul 18 23:36:58 nas kernel: [8114f111] ? release_pages+0xc1/0x280 Jul 18 23:36:58 nas kernel: [811bfd97] ? getname_flags+0x37/0x130 Jul 18 23:36:58 nas kernel: [811c1cf6] filename_lookup.isra.30+0x26/0x80 Jul 18 23:36:58 nas kernel: [811c4fd7] user_path_at_empty+0x67/0xd0 Jul 18 23:36:58 nas kernel: [81172b52] ? unmap_region+0xe2/0x130 Jul 18 23:36:58 nas kernel: [811c5051] user_path_at+0x11/0x20 Jul 18 23:36:58 nas kernel: [811b979a] vfs_fstatat+0x6a/0xd0 Jul 18 23:36:58 nas kernel: [811b981b] vfs_stat+0x1b/0x20 Jul 18 23:36:58 nas kernel: [811b9df9] SyS_newstat+0x29/0x60 Jul 18 23:36:58 nas kernel: [8117501c] ? vm_munmap+0x4c/0x60 Jul 18 23:36:58 nas kernel: [81175f92] ? SyS_munmap+0x22/0x30 Jul 18 23:36:58 nas kernel: [81515fa9] system_call_fastpath+0x16/0x1b --- Jul 19 18:34:17 nas kernel: INFO: task rsync:4900 blocked for more than 120 seconds. Jul 19 18:34:17 nas kernel: Tainted: G O 3.15.5-2-ARCH #1 Jul 19 18:34:17 nas kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 19 18:34:17 nas kernel: rsync D 0 4900 4899 0x Jul 19 18:34:17 nas kernel: 880005947c20 0082 880034aa4750 00014700 Jul 19 18:34:17 nas kernel: 880005947fd8 00014700 880034aa4750 810a5995 Jul 19 18:34:17 nas kernel: 88011fc14700 8800dd828a30 8800cece6a00 880005947bd8 Jul 19 18:34:17 nas kernel: Call Trace: Jul 19 18:34:17 nas kernel: [810a5995] ? set_next_entity+0x95/0xb0 Jul 19 18:34:17 nas kernel: [810ac0be] ? pick_next_task_fair+0x46e/0x550 Jul 19 18:34:17 nas kernel: [810136c1] ? __switch_to+0x1f1/0x540 Jul 19 18:34:17 nas kernel: [81509fa9] schedule+0x29/0x70 Jul 19 18:34:17 nas kernel: [8150a426] schedule_preempt_disabled+0x16/0x20 Jul 19 18:34:17 nas kernel: [8150bda5] __mutex_lock_slowpath+0xe5/0x230 Jul 19 18:34:17 nas kernel: [8150bf07] mutex_lock+0x17/0x30 Jul 19 18:34:17 nas kernel: [811bfa24] lookup_slow+0x34/0xc0 Jul 19 18:34:17 nas kernel: [811c1b73] path_lookupat+0x723/0x880 Jul 19 18:34:17 nas kernel: [8150a2bf] ? io_schedule+0xbf/0xf0 Jul 19 18:34:17 nas kernel: [8150a7d1] ? __wait_on_bit_lock+0x91/0xb0 Jul 19 18:34:17 nas kernel: [811bfd97] ? getname_flags+0x37/0x130 Jul 19 18:34:17 nas kernel: [811c1cf6] filename_lookup.isra.30+0x26/0x80 Jul 19 18:34:17 nas kernel: [811c4fd7] user_path_at_empty+0x67/0xd0 Jul 19 18:34:17 nas kernel: [811c5051] user_path_at+0x11/0x20 Jul 19 18:34:17 nas kernel: [811b979a] vfs_fstatat+0x6a/0xd0 Jul 19 18:34:17 nas kernel: [811d4414] ? mntput+0x24/0x40 Jul 19 18:34:17 nas kernel: [811b983e] vfs_lstat+0x1e/0x20 Jul 19 18:34:17 nas kernel: [811b9e59] SyS_newlstat+0x29/0x60 Jul 19 18:34:17 nas kernel: [8108a3c4] ? task_work_run+0xa4/0xe0 Jul 19 18:34:17 nas kernel: [8150e939] ? do_device_not_available+0x19/0x20 Jul 19 18:34:17 nas kernel: [8151760e] ? device_not_available+0x1e/0x30 Jul 19
Re: Blocked tasks on 3.15.1
Am Dienstag, 22. Juli 2014, 17:15:21 schrieb Chris Mason: On 07/22/2014 05:13 PM, Martin Steigerwald wrote: Am Dienstag, 22. Juli 2014, 10:53:03 schrieb Chris Mason: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. As I told before I am using BTRFS RAID 1. Two logival volumes on two distinct SSDs. RAID is directly in BTRFS, no SoftRAID here (which I wouldn´t want to use with SSDs anyway). When you say logical volumes, you mean LVM right? Just making sure I know all the pieces involved. Exactly. As a recap from the other thread: merkaba:~ btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 123.20GiB devid1 size 160.00GiB used 159.98GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 159.98GiB path /dev/dm-3 Btrfs v3.14.1 merkaba:~#1 btrfs fi df /home Data, RAID1: total=154.95GiB, used=120.61GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=2.59GiB unknown, single: total=512.00MiB, used=0.00 merkaba:~ df -hT /home DateisystemTyp Größe Benutzt Verf. Verw% Eingehängt auf /dev/mapper/msata-home btrfs 320G247G 69G 79% /home merkaba:~ file -sk /dev/sata/home /dev/sata/home: symbolic link to `../dm-3' merkaba:~ file -sk /dev/dm-3 /dev/dm-3: BTRFS Filesystem label home, sectorsize 4096, nodesize 16384, leafsize 16384, UUID=[…], 132303151104/343597383680 bytes used, 2 devices And LVM layout: merkaba:~ lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 279,5G 0 disk ├─sda18:10 4M 0 part ├─sda28:20 191M 0 part ├─sda38:30 286M 0 part └─sda48:40 279G 0 part ├─sata-home (dm-3)254:30 160G 0 lvm ├─sata-swap (dm-4)254:4012G 0 lvm [SWAP] └─sata-debian (dm-5) 254:5030G 0 lvm sdb 8:16 0 447,1G 0 disk ├─sdb18:17 0 200M 0 part ├─sdb28:18 0 300M 0 part /boot └─sdb38:19 0 446,7G 0 part ├─msata-home (dm-0) 254:00 160G 0 lvm ├─msata-daten (dm-1) 254:10 200G 0 lvm └─msata-debian (dm-2) 254:2030G 0 lvm sr0 11:01 1024M 0 rom sda is Intel SSD 320 SATA sdb is Crucial m500 mSATA Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Chris Mason c...@fb.com wrote: Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. Not me, at least - I'm doing the device aggregation down at the LVM level (sata-dmcrypt-lvm-btrfs stack), so it's presented to btrfs as a single logical device. Charles -- --- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ --- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/22/2014 04:53 PM, Chris Mason wrote: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. -chris Hi, In case it's interesting: From an earlier email thread with subject: 3.15-rc6 - btrfs-transacti:4157 blocked for more than 120 TLDR: yes, btrfs sees multiple devices. sata - dmcrypt - btrfs raid10 btrfs raid10 consist of multiple dmcrypt devices from multiple sata devices. Mount: /dev/mapper/sdu on /mnt/storage type btrfs (rw,noatime,space_cache,compress=lzo,inode_cache,subvol=storage) (yes I know inode_cache is not recommended for general use) I have a nocow directory in a separate subvolume containing vm-images used by kvm. The same kvm-vms are reading/writing data from that array over nfs. I'm still holding that system on 3.14. Anything above causes blocks. -- Torbjørn -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Tue, Jul 22, 2014 at 10:53:03AM -0400, Chris Mason wrote: Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. In the bugs I sent you, it was a mix of arrays that were mdraid / dmcrypt / btrfs I have also one array with: disk1 / dmcrypt \ - btrfs (2 drives visible by btrfs) disk2 / dmcrypt / The multidrive setup seemed a bit worse, I just destroyed it and went back to putting all the drives together with mdadm and showing a single dmcrypted device to btrfs. But that is still super unstable on my server with 3.15, while being somewhat usable with my laptop (it still hangs, but more rarely) The one difference is that my laptop actually does disk dmcrypt btrfs while my server does disks mdadm dmcrypt btrfs Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/22/2014 04:53 PM, Chris Mason wrote: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. -chris 3.16-rc6 with your patch on top still causes hangs here. No traces of racing in dmesg Hang is on a btrfs raid 0 consisting of 3 drives. Full stack is: sata - dmcrypt - btrfs raid0 Hang was caused by 1. Several rsync -av --inplace --delete source backup subvol 2. btrfs subvolume snapshot -r backup subvol bacup snap The rsync jobs are done one at a time btrfs is stuck when trying to create the read only snapshot -- Torbjørn All output via netconsole. sysrq-w: https://gist.github.com/anonymous/d1837187e261f9a4cbd2#file-gistfile1-txt sysrq-t: https://gist.github.com/anonymous/2bdb73f035ab9918c63d#file-gistfile1-txt dmesg: [ 9352.784136] INFO: task btrfs-transacti:3874 blocked for more than 120 seconds. [ 9352.784222] Tainted: GE 3.16.0-rc6+ #64 [ 9352.784270] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 9352.784354] btrfs-transacti D 88042fc943c0 0 3874 2 0x [ 9352.784413] 8803fb9dfca0 0002 8800c4214b90 8803fb9dffd8 [ 9352.784502] 000143c0 000143c0 88041977b260 8803d29f23a0 [ 9352.784592] 8803d29f23a8 7fff 8800c4214b90 880232e2c0a8 [ 9352.784682] Call Trace: [ 9352.784726] [8170eb59] schedule+0x29/0x70 [ 9352.784774] [8170df99] schedule_timeout+0x209/0x280 [ 9352.784827] [8170874b] ? __slab_free+0xfe/0x2c3 [ 9352.784879] [810829f4] ? wake_up_worker+0x24/0x30 [ 9352.784929] [8170f656] wait_for_completion+0xa6/0x160 [ 9352.784981] [8109d4e0] ? wake_up_state+0x20/0x20 [ 9352.785049] [c045b936] btrfs_wait_and_free_delalloc_work+0x16/0x30 [btrfs] [ 9352.785141] [c04658be] btrfs_run_ordered_operations+0x1ee/0x2c0 [btrfs] [ 9352.785260] [c044bbb7] btrfs_commit_transaction+0x27/0xa40 [btrfs] [ 9352.785324] [c0447d65] transaction_kthread+0x1b5/0x240 [btrfs] [ 9352.785385] [c0447bb0] ? btrfs_cleanup_transaction+0x560/0x560 [btrfs] [ 9352.785469] [8108cc52] kthread+0xd2/0xf0 [ 9352.785517] [8108cb80] ? kthread_create_on_node+0x180/0x180 [ 9352.785571] [81712dfc] ret_from_fork+0x7c/0xb0 [ 9352.785620] [8108cb80] ? kthread_create_on_node+0x180/0x180 [ 9352.785678] INFO: task kworker/u16:3:6932 blocked for more than 120 seconds. [ 9352.785732] Tainted: GE 3.16.0-rc6+ #64 [ 9352.785780] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 9352.785863] kworker/u16:3 D 88042fd943c0 0 6932 2 0x [ 9352.785930] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] [ 9352.785983] 88035f1bbb58 0002 880417e564c0 88035f1bbfd8 [ 9352.786072] 000143c0 000143c0 8800c1a03260 88042fd94cd8 [ 9352.786160] 88042ffb4be8 88035f1bbbe0 0002 81159930 [ 9352.786250] Call Trace: [ 9352.786292] [81159930] ? wait_on_page_read+0x60/0x60 [ 9352.786343] [8170ee6d] io_schedule+0x9d/0x130 [ 9352.786393] [8115993e] sleep_on_page+0xe/0x20 [ 9352.786443] [8170f3e8] __wait_on_bit_lock+0x48/0xb0 [ 9352.786495] [81159a4a] __lock_page+0x6a/0x70 [ 9352.786544] [810b14a0] ? autoremove_wake_function+0x40/0x40 [ 9352.786607] [c046711e] ? flush_write_bio+0xe/0x10 [btrfs] [ 9352.786669] [c046b0c0] extent_write_cache_pages.isra.28.constprop.46+0x3d0/0x3f0 [btrfs] [ 9352.786766] [c046cd2d] extent_writepages+0x4d/0x70 [btrfs] [ 9352.786828] [c04506f0] ? btrfs_submit_direct+0x6a0/0x6a0 [btrfs] [ 9352.786883] [810b0d78] ? __wake_up_common+0x58/0x90 [ 9352.786943] [c044e1d8] btrfs_writepages+0x28/0x30 [btrfs] [ 9352.786997] [811668ee] do_writepages+0x1e/0x40 [ 9352.787045] [8115b409] __filemap_fdatawrite_range+0x59/0x60 [ 9352.787097] [8115b4bc] filemap_flush+0x1c/0x20 [
Re: Blocked tasks on 3.15.1
On 07/22/2014 03:42 PM, Torbjørn wrote: On 07/22/2014 04:53 PM, Chris Mason wrote: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. -chris 3.16-rc6 with your patch on top still causes hangs here. No traces of racing in dmesg Hang is on a btrfs raid 0 consisting of 3 drives. Full stack is: sata - dmcrypt - btrfs raid0 Hang was caused by 1. Several rsync -av --inplace --delete source backup subvol 2. btrfs subvolume snapshot -r backup subvol bacup snap The rsync jobs are done one at a time btrfs is stuck when trying to create the read only snapshot The trace is similar, but you're stuck trying to read the free space cache. This one I saw earlier this morning, but I haven't seen these parts from the 3.15 bug reports. Maybe they are related though, I'll dig into the 3.15 bug reports again. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/22/2014 09:50 PM, Chris Mason wrote: On 07/22/2014 03:42 PM, Torbjørn wrote: On 07/22/2014 04:53 PM, Chris Mason wrote: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. -chris 3.16-rc6 with your patch on top still causes hangs here. No traces of racing in dmesg Hang is on a btrfs raid 0 consisting of 3 drives. Full stack is: sata - dmcrypt - btrfs raid0 Hang was caused by 1. Several rsync -av --inplace --delete source backup subvol 2. btrfs subvolume snapshot -r backup subvol bacup snap The rsync jobs are done one at a time btrfs is stuck when trying to create the read only snapshot The trace is similar, but you're stuck trying to read the free space cache. This one I saw earlier this morning, but I haven't seen these parts from the 3.15 bug reports. Maybe they are related though, I'll dig into the 3.15 bug reports again. -chris In case it was not clear, this hang was on a different btrfs volume than the 3.15 hang (but the same server). Earlier the affected volume was readable during the hang. This time the volume is not readable either. I'll keep the patched 3.16 running and see if I can trigger something similar to the 3.15 hang. Thanks -- Torbjørn -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Am Dienstag, 22. Juli 2014, 10:53:03 schrieb Chris Mason: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. As I told before I am using BTRFS RAID 1. Two logival volumes on two distinct SSDs. RAID is directly in BTRFS, no SoftRAID here (which I wouldn´t want to use with SSDs anyway). -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/22/2014 05:13 PM, Martin Steigerwald wrote: Am Dienstag, 22. Juli 2014, 10:53:03 schrieb Chris Mason: On 07/19/2014 02:23 PM, Martin Steigerwald wrote: Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. As I told before I am using BTRFS RAID 1. Two logival volumes on two distinct SSDs. RAID is directly in BTRFS, no SoftRAID here (which I wouldn´t want to use with SSDs anyway). When you say logical volumes, you mean LVM right? Just making sure I know all the pieces involved. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Tue, Jul 22, 2014 at 10:53 AM, Chris Mason c...@fb.com wrote: Thanks for the help in tracking this down everyone. We'll get there! Are you all running multi-disk systems (from a btrfs POV, more than one device?) I don't care how many physical drives this maps to, just does btrfs think there's more than one drive. I've been away on vacation so I haven't been able to try your latest patch, but I can try whatever is out there starting this weekend. I was getting fairly consistent hangs during heavy IO (especially rsync) on 3.15 with lzo enabled. This is on raid1 across 5 drives, directly against the partitions themselves (no dmcrypt, mdadm, lvm, etc). I disabled lzo and haven't had problems since. I'm now running on mainline without issue, but I think I did see the hang on mainline when I tried enabling lzo again briefly. Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
[ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, -- Hi Chris, just had that hang during rsync from /home (ZFS, mirrored) to /bak (Btrfs w. lzo compression) again with that patch applied, it doesn't seem to be related to that issue (or patch) - only applicable to my case, obviously - since search for that string (e.g. racing) doesn't show anything in that message: [16028.169347] INFO: task kworker/u16:2:11956 blocked for more than 180 seconds. [16028.169349] Tainted: P O 3.14.13_btrfs+_BFS_test27_integration #2 [16028.169350] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [16028.169351] kworker/u16:2 D 88081ec13540 0 11956 2 0x0008 [16028.169356] Workqueue: btrfs-delalloc normal_work_helper [16028.169358] 8806180ab8e0 0046 0004 [16028.169359] a000 8806210f16b0 8806180abfd8 81e11500 [16028.169360] 8806210f16b0 0206 8113e6cc 88081ec135c0 [16028.169362] Call Trace: [16028.169367] [8113e6cc] ? delayacct_end+0x7c/0x90 [16028.169370] [811689d0] ? wait_on_page_read+0x60/0x60 [16028.169374] [819cfc78] ? io_schedule+0x88/0xe0 [16028.169375] [811689d5] ? sleep_on_page+0x5/0x10 [16028.169377] [819cfffc] ? __wait_on_bit_lock+0x3c/0x90 [16028.169378] [81168ac5] ? __lock_page+0x65/0x70 [16028.169382] [810f5580] ? autoremove_wake_function+0x30/0x30 [16028.169384] [81169854] ? __find_lock_page+0x44/0x70 [16028.169385] [811698ca] ? find_or_create_page+0x2a/0xa0 [16028.169388] [8145a1cf] ? io_ctl_prepare_pages+0x4f/0x150 [16028.169390] [8145bd45] ? __load_free_space_cache+0x195/0x5d0 [16028.169392] [8145c26b] ? load_free_space_cache+0xeb/0x1b0 [16028.169395] [813fd6a1] ? cache_block_group+0x191/0x390 [16028.169396] [810f5550] ? prepare_to_wait_event+0xf0/0xf0 [16028.169398] [814085ea] ? find_free_extent+0x95a/0xdb0 [16028.169400] [81408bf9] ? btrfs_reserve_extent+0x69/0x150 [16028.169403] [81421116] ? cow_file_range+0x136/0x420 [16028.169404] [81422493] ? submit_compressed_extents+0x1f3/0x480 [16028.169406] [81422720] ? submit_compressed_extents+0x480/0x480 [16028.169407] [8144896b] ? normal_work_helper+0x1ab/0x330 [16028.169410] [810df26d] ? process_one_work+0x16d/0x490 [16028.169411] [810dff8b] ? worker_thread+0x12b/0x410 [16028.169412] [810dfe60] ? manage_workers.isra.28+0x2c0/0x2c0 [16028.169414] [810e579a] ? kthread+0xca/0xe0 [16028.169415] [810e56d0] ? kthread_create_on_node+0x180/0x180 [16028.169417] [819d3c7c] ? ret_from_fork+0x7c/0xb0 [16028.169418] [810e56d0] ? kthread_create_on_node+0x180/0x180 [16028.169422] INFO: task btrfs-transacti:12042 blocked for more than 180 seconds. [16028.169422] Tainted: P O 3.14.13_btrfs+_BFS_test27_integration #2 [16028.169423] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [16028.169423] btrfs-transacti D 88081ec13540 0 12042 2 0x0008 [16028.169425] 88009c7adb20 0046 88040d84ca68 [16028.169426] a000 88061f284ba0 88009c7adfd8 81e11500 [16028.169427] 88061f284ba0 88061a21dea8 811b8c2d 8805fc919e00 [16028.169428] Call Trace: [16028.169431] [811b8c2d] ? kmem_cache_alloc_trace+0x14d/0x160 [16028.169433] [813fd632] ? cache_block_group+0x122/0x390 [16028.169434] [810f5550] ? prepare_to_wait_event+0xf0/0xf0 [16028.169436] [814085ea] ? find_free_extent+0x95a/0xdb0 [16028.169437] [81408bf9] ? btrfs_reserve_extent+0x69/0x150 [16028.169439] [81422fa8] ? __btrfs_prealloc_file_range+0xe8/0x380 [16028.169441] [8140b6f2] ? btrfs_write_dirty_block_groups+0x642/0x6d0 [16028.169442] [819cb00c] ?
Re: Blocked tasks on 3.15.1
On Thu, Jul 17, 2014 at 8:18 AM, Chris Mason c...@fb.com wrote: [ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, Thanks Chris. Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo - /mnt/home is btrfs with no compression enabled. - I have _not_ created any nodatacow files. - Both filesystems are on different physical disks. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Am Samstag, 19. Juli 2014, 12:38:53 schrieb Cody P Schafer: On Thu, Jul 17, 2014 at 8:18 AM, Chris Mason c...@fb.com wrote: [ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, Thanks Chris. Running 3.15.6 with this patch applied on top: - still causes a hang with `rsync -hPaHAXx --del /mnt/home/nyx/ /home/nyx/` - no extra error messages printed (`dmesg | grep racing`) compared to without the patch I got same results with 3.16-rc5 + this patch (see thread BTRFS hang with 3.16-rc5). 3.16-rc4 still is fine with me. No hang whatsoever so far. To recap some details (so I can have it all in one place): - /home/ is btrfs with compress=lzo BTRFS RAID 1 with lzo. - I have _not_ created any nodatacow files. Me neither. - Full stack is: sata - dmcrypt - lvm - btrfs (I noticed others mentioning the use of dmcrypt) Same, except no dmcrypt. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1, raid1 btrfs is no ends of trouble for me
On Thu, Jul 17, 2014 at 09:18:07AM -0400, Chris Mason wrote: [ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, I've gotten more blocked messages with your patch: See also the message I sent about memory leaks, and how enabling kmemleak gets btrfs to deadlock soon after boot relibably. https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg35568.html (this was before your patch though) With your patch (and without kmemleak): gargamel:/etc/apache2/sites-enabled# ps -eo pid,etime,wchan:30,args |grep df 349501:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 410507:48:39 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 12639 48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 12691 48:37 btrfs_statfs df 1475306:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 1721410:48:39 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 1752603:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 1871009:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 2366805:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 2667511:37:42 btrfs_statfs df . 2682802:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs 2751508:48:38 btrfs_statfs df -hP -x none -x tmpfs -x iso9660 -x udf -x nfs sysrq-w does not show me output for those and I cannot understand why. Howver, I have found that btrfs raid 1 on top of dmcrypt has given me no ends of trouble. I lost that filesystem twice due to corruption, and now it hangs my machine (strace finds that df is hanging on that partition). gargamel:~# btrfs fi df /mnt/btrfs_raid0 Data, RAID1: total=222.00GiB, used=221.61GiB Data, single: total=8.00MiB, used=0.00 System, RAID1: total=8.00MiB, used=48.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=2.00GiB, used=1.10GiB Metadata, single: total=8.00MiB, used=0.00 unknown, single: total=384.00MiB, used=0.00 gargamel:~# btrfs fi show /mnt/btrfs_raid0 Label: 'btrfs_raid0' uuid: 74279e10-46e7-4ac4-8216-a291819a6691 Total devices 2 FS bytes used 222.71GiB devid1 size 836.13GiB used 224.03GiB path /dev/dm-3 devid2 size 836.13GiB used 224.01GiB path /dev/mapper/raid0d2 Btrfs v3.14.1 This is not encouraging, I think I'm going to stop using raid1 in btrfs :( I tried sysrq-t, but the output goes faster than my serial console can capture it, I can't get you a traceback on those df processes. the dmesg buffer is too small I already have Kernel log buffer size (16 = 64KB, 17 = 128KB) (LOG_BUF_SHIFT) [17] (NEW) 17 and the kernel config does not let me increase it to something more useful like 24. Btrfs in 3.15 has been no end of troubles for me on my 2 machines, and I can't even capture useful info when it happens since my long sysrq dumps get truncated and flow faster than syslog can capture and relay them it seems. Do you have any suggestions on how to capture that data better? In the meantime, kernel log when things started hanging is below. the zm processes are indeed accessing that raid1 partition. [67499.502755] INFO: task btrfs-transacti:2867 blocked for more than 120 seconds. [67499.526860] Not tainted 3.15.5-amd64-i915-preempt-20140714cm1 #1 [67499.548624] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [67499.575212] btrfs-transacti D 0001 0 2867 2 0x [67499.598611] 8802135e7e10 0046 880118322158 8802135e7fd8
Re: Blocked tasks on 3.15.1, raid1 btrfs is no ends of trouble for me
On Fri, Jul 18, 2014 at 05:33:45PM -0700, Marc MERLIN wrote: Howver, I have found that btrfs raid 1 on top of dmcrypt has given me no ends of trouble. I lost that filesystem twice due to corruption, and now it hangs my machine (strace finds that df is hanging on that partition). gargamel:~# btrfs fi df /mnt/btrfs_raid0 Data, RAID1: total=222.00GiB, used=221.61GiB Data, single: total=8.00MiB, used=0.00 System, RAID1: total=8.00MiB, used=48.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=2.00GiB, used=1.10GiB Metadata, single: total=8.00MiB, used=0.00 unknown, single: total=384.00MiB, used=0.00 gargamel:~# btrfs fi show /mnt/btrfs_raid0 Label: 'btrfs_raid0' uuid: 74279e10-46e7-4ac4-8216-a291819a6691 Total devices 2 FS bytes used 222.71GiB devid1 size 836.13GiB used 224.03GiB path /dev/dm-3 devid2 size 836.13GiB used 224.01GiB path /dev/mapper/raid0d2 Btrfs v3.14.1 This is not encouraging, I think I'm going to stop using raid1 in btrfs :( Sorry, this may be a bit misleading. I actually lost 2 filesystems that were raid0 on top of dmcrypt. This time it's raid1, and the data isn't lost, but btrfs is tripping all over itself and taking my whole system apparently because of that filesystem. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1, raid1 btrfs is no ends of trouble for me
TL;DR: 3.15.5 (or .1 when I tried it) just hang over and over again in multiple ways on my server. They also hang on my laptop reliably if I enable kmemleak, but otherwise my laptop mostly survives with 3.15.x without kmemleak (although it does deadlock eventually, but that could be after days/weeks, not hours). I reverted to 3.14 on that machine, and everything is good again. As a note, this is the 3rd time I try to upgrade this server to 3.15 and everything goes to crap. I then go back to 3.14 and things work again, not great since btrfs has never been great and stable for me, but it works well enough. On Fri, Jul 18, 2014 at 05:44:57PM -0700, Marc MERLIN wrote: On Fri, Jul 18, 2014 at 05:33:45PM -0700, Marc MERLIN wrote: Howver, I have found that btrfs raid 1 on top of dmcrypt has given me no ends of trouble. I lost that filesystem twice due to corruption, and now it hangs my machine (strace finds that df is hanging on that partition). gargamel:~# btrfs fi df /mnt/btrfs_raid0 Data, RAID1: total=222.00GiB, used=221.61GiB Data, single: total=8.00MiB, used=0.00 System, RAID1: total=8.00MiB, used=48.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=2.00GiB, used=1.10GiB Metadata, single: total=8.00MiB, used=0.00 unknown, single: total=384.00MiB, used=0.00 gargamel:~# btrfs fi show /mnt/btrfs_raid0 Label: 'btrfs_raid0' uuid: 74279e10-46e7-4ac4-8216-a291819a6691 Total devices 2 FS bytes used 222.71GiB devid1 size 836.13GiB used 224.03GiB path /dev/dm-3 devid2 size 836.13GiB used 224.01GiB path /dev/mapper/raid0d2 Btrfs v3.14.1 This is not encouraging, I think I'm going to stop using raid1 in btrfs :( Sorry, this may be a bit misleading. I actually lost 2 filesystems that were raid0 on top of dmcrypt. This time it's raid1, and the data isn't lost, but btrfs is tripping all over itself and taking my whole system apparently because of that filesystem. And just to say that I'm wrong at pinning this down, the same 3.15.5 with your patch locked up on my root filesystem on the next boot This time sysrq-w worked for a change. Excerpt: 31933 03:54 btrfs_file_llseek tail -n 50 /var/local/src/misterhouse/data/logs/print.log 31960 32:54 btrfs_file_llseek tail -n 50 /var/local/src/misterhouse/data/logs/print.log 32077 18:54 btrfs_file_llseek tail -n 50 /var/local/src/misterhouse/data/logs/print.log [ 2176.230211] tailD 8801b3a567c0 0 25396 22031 0x20020080 [ 2176.252788] 88006fed3e20 0082 00a8 88006fed3fd8 [ 2176.276039] 8801a542a3d0 000141c0 88020c374e10 88020c374e14 [ 2176.299273] 8801a542a3d0 88020c374e18 88006fed3e30 [ 2176.322515] Call Trace: [ 2176.330739] [8161fa5e] schedule+0x73/0x75 [ 2176.346527] [8161fd1f] schedule_preempt_disabled+0x18/0x24 [ 2176.367208] [81620e42] __mutex_lock_slowpath+0x160/0x1d7 [ 2176.386946] [81620ed0] mutex_lock+0x17/0x27 [ 2176.403727] [8123a33a] btrfs_file_llseek+0x40/0x205 [ 2176.422603] [810be59a] ? from_kgid_munged+0x12/0x1e [ 2176.441015] [810482f1] ? cp_stat64+0x50/0x20b [ 2176.457841] [81156627] vfs_llseek+0x2e/0x30 [ 2176.474606] [81156c32] SyS_llseek+0x5b/0xaa [ 2176.490895] [8162ab2c] sysenter_dispatch+0x7/0x21 Full log: http://marc.merlins.org/tmp/btrfs_hang3.txt After reboot, it's now hanging on this: [ 362.811392] INFO: task kworker/u8:0:6 blocked for more than 120 seconds. [ 362.831717] Not tainted 3.15.5-amd64-i915-preempt-20140714cm1 #1 [ 362.851516] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 362.875213] kworker/u8:0D 88021265a800 0 6 2 0x [ 362.896672] Workqueue: btrfs-flush_delalloc normal_work_helper [ 362.914260] 8802148cbb60 0046 8802148cbb30 8802148cbfd8 [ 362.936741] 8802148c4150 000141c0 88021f3941c0 8802148c4150 [ 362.959195] 8802148cbc00 0002 810fdda8 8802148cbb70 [ 362.981602] Call Trace: [ 362.988972] [810fdda8] ? wait_on_page_read+0x3c/0x3c [ 363.006769] [8161fa5e] schedule+0x73/0x75 [ 363.021704] [8161fc03] io_schedule+0x60/0x7a [ 363.037414] [810fddb6] sleep_on_page+0xe/0x12 [ 363.053416] [8161ff93] __wait_on_bit_lock+0x46/0x8a [ 363.070980] [810fde71] __lock_page+0x69/0x6b [ 363.086722] [810848d1] ? autoremove_wake_function+0x34/0x34 [ 363.106373] [81242ab0] lock_page+0x1e/0x21 [ 363.121585] [812465bb] extent_write_cache_pages.isra.16.constprop.32+0x10e/0x2c6 [ 363.148103] [81246a19] extent_writepages+0x4b/0x5c [ 363.166792] [81230ce4] ? btrfs_submit_direct+0x3f4/0x3f4 [ 363.187074] [810765ec] ?
Re: Blocked tasks on 3.15.1, raid1 btrfs is no ends of trouble for me
On Fri, 18 Jul 2014 05:44:57 PM Marc MERLIN wrote: Sorry, this may be a bit misleading. I actually lost 2 filesystems that were raid0 on top of dmcrypt. Stupid question I know, but does this happen without dmcrypt? cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1, raid1 btrfs is no ends of trouble for me
On Sat, Jul 19, 2014 at 11:59:24AM +1000, Chris Samuel wrote: On Fri, 18 Jul 2014 05:44:57 PM Marc MERLIN wrote: Sorry, this may be a bit misleading. I actually lost 2 filesystems that were raid0 on top of dmcrypt. Stupid question I know, but does this happen without dmcrypt? It's not a stupid question: I don't use btrfs without dmcrypt, so I can't say. (and I'm not interested in trying :) Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
[ deadlocks during rsync in 3.15 with compression enabled ] Hi everyone, I still haven't been able to reproduce this one here, but I'm going through a series of tests with lzo compression foraced and every operation forced to ordered. Hopefully it'll kick it out soon. While I'm hammering away, could you please try this patch. If this is the buy you're hitting, the deadlock will go away and you'll see this printk in the log. thanks! -chris diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3668048..8ab56df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8157,6 +8157,13 @@ void btrfs_destroy_inode(struct inode *inode) spin_unlock(root-fs_info-ordered_root_lock); } + spin_lock(root-fs_info-ordered_root_lock); + if (!list_empty(BTRFS_I(inode)-ordered_operations)) { + list_del_init(BTRFS_I(inode)-ordered_operations); +printk(KERN_CRIT racing inode deletion with ordered operations!!!\n); + } + spin_unlock(root-fs_info-ordered_root_lock); + if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)) { btrfs_info(root-fs_info, inode %llu still on the orphan list, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/02/2014 08:27 AM, Cody P Schafer wrote: Will do. The rsync I'm running is processing a lot of chromium cache files when it hangs (just for a reference), and ends up triggering a bunch of deletes as well. Still a problem with your v3.15.y (eb97581), here's the log with sysrq-t and sysrq-l https://urldefense.proofpoint.com/v1/url?u=http://bpaste.net/show/428234/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=9lRzEuxWeyHtsDXvFJNWlf2CgKZWZ1w%2FScqbUMy1jII%3D%0As=0daa1232bef652c4f16c9d12cdad408909feaa5069ba3c1888fa4895e01ec3a2 Also, correction, it's a firefox cache dir rsync that seems to trigger it (stalls pretty early on and very consistently): [... snip ...] .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/1F/F43F9d01 5.23M 100% 17.82MB/s0:00:00 (xfr#452, ir-chk=1201/6659) .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/20/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/20/23A66d01 116.82K 100% 376.50kB/s0:00:00 (xfr#453, ir-chk=1200/6659) .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/21/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/23/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/24/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/25/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/25/7C836d01 [... stall here ...] Ok, and just to clarify, are you actively using the files on the destination outside of rsync? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 07/02/2014 09:58 AM, Chris Mason wrote: On 07/02/2014 08:27 AM, Cody P Schafer wrote: Will do. The rsync I'm running is processing a lot of chromium cache files when it hangs (just for a reference), and ends up triggering a bunch of deletes as well. Still a problem with your v3.15.y (eb97581), here's the log with sysrq-t and sysrq-l https://urldefense.proofpoint.com/v1/url?u=http://bpaste.net/show/428234/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=9lRzEuxWeyHtsDXvFJNWlf2CgKZWZ1w%2FScqbUMy1jII%3D%0As=0daa1232bef652c4f16c9d12cdad408909feaa5069ba3c1888fa4895e01ec3a2 Also, correction, it's a firefox cache dir rsync that seems to trigger it (stalls pretty early on and very consistently): [... snip ...] .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/1F/F43F9d01 5.23M 100% 17.82MB/s0:00:00 (xfr#452, ir-chk=1201/6659) .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/20/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/20/23A66d01 116.82K 100% 376.50kB/s0:00:00 (xfr#453, ir-chk=1200/6659) .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/21/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/23/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/24/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/25/ .cache/mozilla/firefox/kqtl1tlc.test/Cache/7/25/7C836d01 [... stall here ...] Ok, and just to clarify, are you actively using the files on the destination outside of rsync? Also, do you have compression on? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 06/30/2014 07:42 PM, Cody P Schafer wrote: On Mon, Jun 30, 2014 at 1:30 PM, Chris Mason c...@fb.com wrote: On 06/30/2014 02:11 PM, Chris Mason wrote: On 06/29/2014 04:02 PM, Cody P Schafer wrote: On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: https://urldefense.proofpoint.com/v1/url?u=http://bpaste.net/show/419555k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=SAjzDO8AnhJBEWtUi6s8VGVQd2sORQ%2FJz5tWH4nOYWg%3D%0As=2c4ff3f7f39b2e6d3dcd4947905df54d6a534b35adf63c55d8c50e28ef5781b6 -- These traces show us waiting for IO, but it doesn't show anyone doing the IO. Either we're failing to kick off our work queues or they are stuck on something else. Could you please send a sysrq-t and sysrq-l while you're stuck? That will show us all the procs and all the CPUs. Also, do you have any nodatacow files in here? Please say yes. kernel log from 3.15.2 + ea4ebde02 showing the blocked tasks, sysrq-{w,t,l} included https://urldefense.proofpoint.com/v1/url?u=http://bpaste.net/show/423296/k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0Am=SAjzDO8AnhJBEWtUi6s8VGVQd2sORQ%2FJz5tWH4nOYWg%3D%0As=5af8bc75059925af242b0eef1f4b94348d233d79968d53ff36b7c2594c9dd6b9 I haven't explicitely created any nodatacow files, is there a quick way to tell if there are any? Right now I'm doing `lsattr -R /mnt/home/a/ 2/dev/null | grep -- '^-*C-* '` to try and check. (2/dev/null is hiding lots of Operation not supported While reading flags on warnings) If you haven't turned nodatacow on intentionally, you don't have any nodatacow files ;) I have been trying to reproduce this with rsync and other code that hammers on the ordered writeback, but no luck yet. Before we spend too much time triggering it again, I'd like you to please try a patch from Filipe that is in current mainline. I've cherry picked on top of 3.15.3 in a branch called v3.15.y: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git v3.15.y -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 06/29/2014 04:02 PM, Cody P Schafer wrote: On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: http://bpaste.net/show/419555 -- These traces show us waiting for IO, but it doesn't show anyone doing the IO. Either we're failing to kick off our work queues or they are stuck on something else. Could you please send a sysrq-t and sysrq-l while you're stuck? That will show us all the procs and all the CPUs. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On 06/30/2014 02:11 PM, Chris Mason wrote: On 06/29/2014 04:02 PM, Cody P Schafer wrote: On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: http://bpaste.net/show/419555 -- These traces show us waiting for IO, but it doesn't show anyone doing the IO. Either we're failing to kick off our work queues or they are stuck on something else. Could you please send a sysrq-t and sysrq-l while you're stuck? That will show us all the procs and all the CPUs. Also, do you have any nodatacow files in here? Please say yes. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Mon, Jun 30, 2014 at 1:30 PM, Chris Mason c...@fb.com wrote: On 06/30/2014 02:11 PM, Chris Mason wrote: On 06/29/2014 04:02 PM, Cody P Schafer wrote: On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: http://bpaste.net/show/419555 -- These traces show us waiting for IO, but it doesn't show anyone doing the IO. Either we're failing to kick off our work queues or they are stuck on something else. Could you please send a sysrq-t and sysrq-l while you're stuck? That will show us all the procs and all the CPUs. Also, do you have any nodatacow files in here? Please say yes. kernel log from 3.15.2 + ea4ebde02 showing the blocked tasks, sysrq-{w,t,l} included http://bpaste.net/show/423296/ I haven't explicitely created any nodatacow files, is there a quick way to tell if there are any? Right now I'm doing `lsattr -R /mnt/home/a/ 2/dev/null | grep -- '^-*C-* '` to try and check. (2/dev/null is hiding lots of Operation not supported While reading flags on warnings) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Chris Mason c...@fb.com wrote: On 06/29/2014 04:02 PM, Cody P Schafer wrote: been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I'm seeing these with 3.15.2 as well. Could you please send a sysrq-t and sysrq-l while you're stuck? That will show us all the procs and all the CPUs. For what it's worth, http://bpaste.net/show/BswHMVpHlguSrdELgv7e/ is my syslog covering my most recent stuck event, including the results of sysrq-t and sysrq-l. Charles -- --- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ --- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: http://bpaste.net/show/419555/ (3.15.2 + ea4ebde) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Sun, Jun 29, 2014 at 3:02 PM, Cody P Schafer d...@codyps.com wrote: On Fri, Jun 27, 2014 at 7:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. I'm also seeing stuck tasks on btrfs (3.14.4, 3.15.1, 3.15.2). I've also tried 3.15.2 with ea4ebde02e08558b020c4b61bb9a4c applied on top with similar results. I've been triggering the hang with 'rsync -hPaHAXx --del /mnt/home/a/ /home/a/' where /mnt/home and /home are 2 separate btrfs filesystems on 2 separate disks. dmesg with w-trigger: http://bpaste.net/show/419555/ (3.15.2 + ea4ebde) And here's the same thing but with lockdep enabled (in the hope that that info might be useful) http://bpaste.net/show/419899/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Fri, Jun 27, 2014 at 8:22 PM, Chris Samuel ch...@csamuel.org wrote: On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. I can confirm that 3.15.2 definitely has the deadlock problem. I tried upgrading just to convince myself of this before patching it and it only took a few hours before it stopped syncing with the usual errors. I applied the patch on Jun 28 around 20:00UTC. I haven't had a deadlock since, despite having the file system fairly active with a few reboots, some deleted snapshots, being assimilated by the new sysvinit replacement, etc. That doesn't really prove anything though - for all I know it will hang a week from now. However, the patch seems stable so far on 3.15.2. Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
I've been getting blocked tasks on 3.15.1 generally at times when the filesystem is somewhat busy (such as doing a backup via scp/clonezilla writing to the disk). A week ago I had enabled snapper for a day which resulted in a daily cleanup of about 8 snapshots at once, which might have contributed, but I've been limping along since. I've started seeing similar on several servers, after upgrading to 3.15 or 3.15.1. With 3.16-rc1 it was even crashing for me. I've rolled back to the latest 3.14.x, and it's still behaving fine. I've signalled it before on the list in btrfs filesystem hang with 3.15-rc3 when doing rsync thread. -- Tomasz Chmielewski http://www.sslrack.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Tomasz Chmielewski posted on Fri, 27 Jun 2014 12:02:43 +0200 as excerpted: I've been getting blocked tasks on 3.15.1 generally at times when the filesystem is somewhat busy (such as doing a backup via scp/clonezilla writing to the disk). I've started seeing similar on several servers, after upgrading to 3.15 or 3.15.1. With 3.16-rc1 it was even crashing for me. I've rolled back to the latest 3.14.x, and it's still behaving fine. I've signalled it before on the list in btrfs filesystem hang with 3.15-rc3 when doing rsync thread. There is a known btrfs lockup bug that was introduced in the commit- window btrfs pull for 3.16, that was fixed by a pull I believe the day before 3.16-rc2. So 3.16-pre to rc2 is known-bad tho it'll work for a few minutes and didn't do any permanent damage that I could see, here. But from 3.16-rc2 on, the 3.16-pre series has been working fine for me. For 3.15, I didn't run the pre-releases as I had another project I was focusing on, but I experienced no problems with 3.15 itself. However, my use-case is multiple independent small btrfs on partitioned SSD, sub-100- GB per btrfs, so I'd be less likely to experience the blocked task issues that others reported, mostly on TB+ size spinning rust. And it /did/ seem to me that the frequency of blocked-task reports were higher for 3.15 than for previous kernel series, tho 3.15 worked fine for me on small btrfs on SSD, the relatively short time I ran it. Hopefully that problem's fixed on 3.16-rc2+, but as of yet there's not enough 3.16-rc2+ reports out there from folks experiencing issues with 3.15 blocked tasks to rightfully say. What CAN be said is that the known 3.16-series commit-window btrfs lockups bug that DID affect me was fixed right before rc2, and I'm running rc2+ just fine, here. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Fri, Jun 27, 2014 at 9:06 AM, Duncan 1i5t5.dun...@cox.net wrote: Hopefully that problem's fixed on 3.16-rc2+, but as of yet there's not enough 3.16-rc2+ reports out there from folks experiencing issues with 3.15 blocked tasks to rightfully say. Any chance that it was backported to 3.15.2? I'd rather not move to mainline just for btrfs. I got another block this morning and failed to capture a log before my terminals gave out. I switched back to 3.15.0 for the moment, and we'll see if that fares any better. Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Jun 27, 2014, at 9:14 AM, Rich Freeman r-bt...@thefreemanclan.net wrote: On Fri, Jun 27, 2014 at 9:06 AM, Duncan 1i5t5.dun...@cox.net wrote: Hopefully that problem's fixed on 3.16-rc2+, but as of yet there's not enough 3.16-rc2+ reports out there from folks experiencing issues with 3.15 blocked tasks to rightfully say. Any chance that it was backported to 3.15.2? I'd rather not move to mainline just for btrfs. The backports don't happen that quickly. I'm uncertain about specifics but I think many such fixes need to be demonstrated in mainline before they get backported to stable. I got another block this morning and failed to capture a log before my terminals gave out. I switched back to 3.15.0 for the moment, and we'll see if that fares any better. Yeah I'd start going backwards. The idea of going forwards is to hopefully get you unstuck or extract data where otherwise you can't, it's not really a recommendation for production usage. It's also often useful if you can reproduce the block with a current rc kernel and issue sysrq+w and post that. Then do your regression with an older kernel. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
Chris Murphy posted on Fri, 27 Jun 2014 09:52:46 -0600 as excerpted: On Jun 27, 2014, at 9:14 AM, Rich Freeman r-bt...@thefreemanclan.net wrote: On Fri, Jun 27, 2014 at 9:06 AM, Duncan 1i5t5.dun...@cox.net wrote: Hopefully that problem's fixed on 3.16-rc2+, but as of yet there's not enough 3.16-rc2+ reports out there from folks experiencing issues with 3.15 blocked tasks to rightfully say. Any chance that it was backported to 3.15.2? I'd rather not move to mainline just for btrfs. The backports don't happen that quickly. The lockup bug that affected early 3.16 was introduced in the commit- window pull for 3.16, so the fix for that shouldn't have needed backported (unless the problem commit ended up in stable too, which I doubt but don't know for sure). 3.15.0 didn't contain that bug, which affected me, but as I said, there did seem to be more blocked-task reports in 3.15, which didn't affect me. I didn't run 3.15.1, however, staying on 3.15.0 until after 3.16-rc2 fixed the earlier 3.16-pre series bug that had kept me from the 3.16 series until then. So anything that might have affected the 3.15 stable series after 3.15.0, I wouldn't know about. If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. But I think the 3.16 commit-window changes introducing the bug weren't btrfs specific but instead at the generic vfs level. If that's the case, then it's possible that the bug was there before 3.16's commit window and might have been triggering some of the 3.15 reports as well, and the 3.16 vfs change simply made it much worse. IOW, I don't know whether that 3.16 series fix will help 3.15 or not, but I don't believe it'll hurt, and it /might/ help. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Fri, Jun 27, 2014 at 11:52 AM, Chris Murphy li...@colorremedies.com wrote: On Jun 27, 2014, at 9:14 AM, Rich Freeman r-bt...@thefreemanclan.net wrote: I got another block this morning and failed to capture a log before my terminals gave out. I switched back to 3.15.0 for the moment, and we'll see if that fares any better. Yeah I'd start going backwards. The idea of going forwards is to hopefully get you unstuck or extract data where otherwise you can't, it's not really a recommendation for production usage. It's also often useful if you can reproduce the block with a current rc kernel and issue sysrq+w and post that. Then do your regression with an older kernel. So, obviously I'm getting my money's worth from the btrfs team, but neither is always a great option as neither involves me running a stable kernel. 3.15.0 contains CVE-2014-4014, although I'm running a version patched for that vulnerability. If I go back any further I'd probably have to backport it myself, and I only know about it because my distro patched that CVE on 3.15.0 before moving to 3.15.1. Running 3.16 doesn't bother me much from a btrfs standpoint, but it means I'm getting unstable updates on all the other modules as well. It is just more to deal with. I might give 3.15.2 a shot and see what happens, and I can always fall back to 3.15.0 again. Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Blocked tasks on 3.15.1
On Fri, 27 Jun 2014 05:20:41 PM Duncan wrote: If I'm not mistaken the fix for the 3.16 series bug was: ea4ebde02e08558b020c4b61bb9a4c0fcf63028e Btrfs: fix deadlocks with trylock on tree nodes. That patch applies cleanly to 3.15.2 so if it is indeed the fix it should probably go to -stable for the next 3.15 release.. Unfortunately my test system died a while ago (hardware problem) and I've not been able to resurrect it yet. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC signature.asc Description: This is a digitally signed message part.