Re: btrfs send hangs after partial transfer and blocks all IO
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann wrote: > Am 13.9.2018 14:35, schrieb Nikolay Borisov: >> >> On 13.09.2018 15:30, Jürgen Herrmann wrote: >>> >>> OK, I will install kdump later and perform a dump after the hang. >>> >>> One more noob question beforehand: does this dump contain sensitive >>> information, for example the luks encryption key for the disk etc? A >>> Google search only brings up one relevant search result which can only >>> be viewed with a redhat subscription... >> >> >> >> So a kdump will dump the kernel memory so it's possible that the LUKS >> encryption keys could be extracted from that image. Bummer, it's >> understandable why you wouldn't want to upload it :). In this case you'd >> have to install also the 'crash' utility to open the crashdump and >> extract the calltrace of the btrfs process. The rough process should be : >> >> >> crash 'path to vm linux' 'path to vmcore file', then once inside the >> crash utility : >> >> set , you can acquire the pid by issuing 'ps' >> which will give you a ps-like output of all running processes at the >> time of crash. After the context has been set you can run 'bt' which >> will give you a backtrace of the send process. >> >> >> >>> >>> Best regards, >>> Jürgen >>> >>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov >>> : >>> On 13.09.2018 14:50, Jürgen Herrmann wrote: > > I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also > after the hang because I started the loop before the hang. The dmesg > output should show the hanging tasks from second 346 on or so. Still > not > useful? > So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done >>> >>> > I know this is probably not the correct list to ask this question but maybe > someone of the devs can point me to the right list? > > I cannot get kdump to work. The crashkernel is loaded and everything is > setup for it afaict. I asked a question on this over at stackexchange but no > answer yet. > https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing > > So i did a little digging and added some debug printk() statements to see > whats going on and it seems that panic() is never called. maybe the second > stack trace is the reason? > Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN > > Could someone please tell me where I can report this problem and get some > help on this topic? Try kexec mailing list. They handle kdump. http://lists.infradead.org/mailman/listinfo/kexec -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
Am 13.9.2018 14:35, schrieb Nikolay Borisov: On 13.09.2018 15:30, Jürgen Herrmann wrote: OK, I will install kdump later and perform a dump after the hang. One more noob question beforehand: does this dump contain sensitive information, for example the luks encryption key for the disk etc? A Google search only brings up one relevant search result which can only be viewed with a redhat subscription... So a kdump will dump the kernel memory so it's possible that the LUKS encryption keys could be extracted from that image. Bummer, it's understandable why you wouldn't want to upload it :). In this case you'd have to install also the 'crash' utility to open the crashdump and extract the calltrace of the btrfs process. The rough process should be : crash 'path to vm linux' 'path to vmcore file', then once inside the crash utility : set , you can acquire the pid by issuing 'ps' which will give you a ps-like output of all running processes at the time of crash. After the context has been set you can run 'bt' which will give you a backtrace of the send process. Best regards, Jürgen Am 13. September 2018 14:02:11 schrieb Nikolay Borisov : On 13.09.2018 14:50, Jürgen Herrmann wrote: I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also after the hang because I started the loop before the hang. The dmesg output should show the hanging tasks from second 346 on or so. Still not useful? So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done I know this is probably not the correct list to ask this question but maybe someone of the devs can point me to the right list? I cannot get kdump to work. The crashkernel is loaded and everything is setup for it afaict. I asked a question on this over at stackexchange but no answer yet. https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing So i did a little digging and added some debug printk() statements to see whats going on and it seems that panic() is never called. maybe the second stack trace is the reason? Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN Could someone please tell me where I can report this problem and get some help on this topic? Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: btrfs send hangs after partial transfer and blocks all IO
Am 13.9.2018 18:22, schrieb Chris Murphy: (resend to all) On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov wrote: On 13.09.2018 18:30, Chris Murphy wrote: This is the 2nd or 3rd thread containing hanging btrfs send, with kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" and the other I can't find at the moment. In that case though the hang is reproducible in 4.14.x and weirdly it only happens when a snapshot contains (perhaps many) reflinks. Scrub and check lowmem find nothing wrong. I have snapshots with a few reflinks (cp --reflink and also deduplication), and I see maybe 15-30 second hangs where nothing is apparently happening (in top or iotop), but I'm also not seeing any blocked tasks or high CPU usage. Perhaps in my case it's just recovering quickly. Are there any kernel config options in "# Debug Lockups and Hangs" that might hint at what's going on? Some of these are enabled in Fedora debug kernels, which are built practically daily, e.g. right now the latest in the build system is 4.19.0-0.rc3.git2.1 - which translates to git 54eda9df17f3. If it's a lock-related problem then you need Lock Debugging => Lock debugging: prove locking correctness OK looks like that's under a different section as CONFIG_PROVE_LOCKING which is enabled on Fedora debug kernels. # Debug Lockups and Hangs CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 # Lock Debugging (spinlocks, mutexes, etc...) CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y # CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_LOCK_TORTURE_TEST=m Hello again! I have CONFIG_PROVE_LOCKING enabled in my kernel, but no change to the observed behaviour. Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: btrfs send hangs after partial transfer and blocks all IO
(resend to all) On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov wrote: > > > On 13.09.2018 18:30, Chris Murphy wrote: >> This is the 2nd or 3rd thread containing hanging btrfs send, with >> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" >> and the other I can't find at the moment. In that case though the hang >> is reproducible in 4.14.x and weirdly it only happens when a snapshot >> contains (perhaps many) reflinks. Scrub and check lowmem find nothing >> wrong. >> >> I have snapshots with a few reflinks (cp --reflink and also >> deduplication), and I see maybe 15-30 second hangs where nothing is >> apparently happening (in top or iotop), but I'm also not seeing any >> blocked tasks or high CPU usage. Perhaps in my case it's just >> recovering quickly. >> >> Are there any kernel config options in "# Debug Lockups and Hangs" >> that might hint at what's going on? Some of these are enabled in >> Fedora debug kernels, which are built practically daily, e.g. right >> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which >> translates to git 54eda9df17f3. > > If it's a lock-related problem then you need Lock Debugging => Lock > debugging: prove locking correctness OK looks like that's under a different section as CONFIG_PROVE_LOCKING which is enabled on Fedora debug kernels. # Debug Lockups and Hangs CONFIG_LOCKUP_DETECTOR=y CONFIG_SOFTLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0 CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y CONFIG_HARDLOCKUP_DETECTOR=y # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0 # Lock Debugging (spinlocks, mutexes, etc...) CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y # CONFIG_DEBUG_LOCKDEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_LOCK_TORTURE_TEST=m -- Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 18:30, Chris Murphy wrote: > This is the 2nd or 3rd thread containing hanging btrfs send, with > kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" > and the other I can't find at the moment. In that case though the hang > is reproducible in 4.14.x and weirdly it only happens when a snapshot > contains (perhaps many) reflinks. Scrub and check lowmem find nothing > wrong. > > I have snapshots with a few reflinks (cp --reflink and also > deduplication), and I see maybe 15-30 second hangs where nothing is > apparently happening (in top or iotop), but I'm also not seeing any > blocked tasks or high CPU usage. Perhaps in my case it's just > recovering quickly. > > Are there any kernel config options in "# Debug Lockups and Hangs" > that might hint at what's going on? Some of these are enabled in > Fedora debug kernels, which are built practically daily, e.g. right > now the latest in the build system is 4.19.0-0.rc3.git2.1 - which > translates to git 54eda9df17f3. If it's a lock-related problem then you need Lock Debugging => Lock debugging: prove locking correctness > > > Chris Murphy >
Re: btrfs send hangs after partial transfer and blocks all IO
This is the 2nd or 3rd thread containing hanging btrfs send, with kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait" and the other I can't find at the moment. In that case though the hang is reproducible in 4.14.x and weirdly it only happens when a snapshot contains (perhaps many) reflinks. Scrub and check lowmem find nothing wrong. I have snapshots with a few reflinks (cp --reflink and also deduplication), and I see maybe 15-30 second hangs where nothing is apparently happening (in top or iotop), but I'm also not seeing any blocked tasks or high CPU usage. Perhaps in my case it's just recovering quickly. Are there any kernel config options in "# Debug Lockups and Hangs" that might hint at what's going on? Some of these are enabled in Fedora debug kernels, which are built practically daily, e.g. right now the latest in the build system is 4.19.0-0.rc3.git2.1 - which translates to git 54eda9df17f3. Chris Murphy
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 15:30, Jürgen Herrmann wrote: > OK, I will install kdump later and perform a dump after the hang. > > One more noob question beforehand: does this dump contain sensitive > information, for example the luks encryption key for the disk etc? A > Google search only brings up one relevant search result which can only > be viewed with a redhat subscription... So a kdump will dump the kernel memory so it's possible that the LUKS encryption keys could be extracted from that image. Bummer, it's understandable why you wouldn't want to upload it :). In this case you'd have to install also the 'crash' utility to open the crashdump and extract the calltrace of the btrfs process. The rough process should be : crash 'path to vm linux' 'path to vmcore file', then once inside the crash utility : set , you can acquire the pid by issuing 'ps' which will give you a ps-like output of all running processes at the time of crash. After the context has been set you can run 'bt' which will give you a backtrace of the send process. > > Best regards, > Jürgen > > Am 13. September 2018 14:02:11 schrieb Nikolay Borisov : > >> On 13.09.2018 14:50, Jürgen Herrmann wrote: >>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also >>> after the hang because I started the loop before the hang. The dmesg >>> output should show the hanging tasks from second 346 on or so. Still not >>> useful? >>> >> >> So from 346 it's evident that transaction commit is waiting for >> commit_root_sem to be acquired. So something else is holding it and not >> giving the transaction chance to finish committing. Now the only place >> where send acquires this lock is in find_extent_clone around the call >> to extent_from_logical. The latter basically does an extent tree search >> and doesn't loop so can't possibly deadlock. Furthermore I don't see any >> userspace processes being hung in kernel space. >> >> Additionally looking at the userspace processes they indicate that >> find_extent_clone has finished and are blocked in send_write_or_clone >> which does the write. But I guess this actually happens before the hang. >> >> >> So at this point without looking at the stacktrace of the btrfs send >> process after the hung has occurred I don't think much can be done > > > Mit AquaMail Android > https://www.mobisystems.com/aqua-mail > > >
Re: btrfs send hangs after partial transfer and blocks all IO
OK, I will install kdump later and perform a dump after the hang. One more noob question beforehand: does this dump contain sensitive information, for example the luks encryption key for the disk etc? A Google search only brings up one relevant search result which can only be viewed with a redhat subscription... Best regards, Jürgen Am 13. September 2018 14:02:11 schrieb Nikolay Borisov : On 13.09.2018 14:50, Jürgen Herrmann wrote: I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also after the hang because I started the loop before the hang. The dmesg output should show the hanging tasks from second 346 on or so. Still not useful? So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done Mit AquaMail Android https://www.mobisystems.com/aqua-mail
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 14:50, Jürgen Herrmann wrote: > I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also > after the hang because I started the loop before the hang. The dmesg > output should show the hanging tasks from second 346 on or so. Still not > useful? > So from 346 it's evident that transaction commit is waiting for commit_root_sem to be acquired. So something else is holding it and not giving the transaction chance to finish committing. Now the only place where send acquires this lock is in find_extent_clone around the call to extent_from_logical. The latter basically does an extent tree search and doesn't loop so can't possibly deadlock. Furthermore I don't see any userspace processes being hung in kernel space. Additionally looking at the userspace processes they indicate that find_extent_clone has finished and are blocked in send_write_or_clone which does the write. But I guess this actually happens before the hang. So at this point without looking at the stacktrace of the btrfs send process after the hung has occurred I don't think much can be done
Re: btrfs send hangs after partial transfer and blocks all IO
I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also after the hang because I started the loop before the hang. The dmesg output should show the hanging tasks from second 346 on or so. Still not useful? Best regards, Jürgen Am 13. September 2018 13:04:39 schrieb Nikolay Borisov : On 13.09.2018 13:56, Jürgen Herrmann wrote: Both loops were started before the hang because after the hang I cannot do that anymore. That's why there is progress in the logs at first. The hang continues for at least 1.5 hours. No data is transferred anymore during this time. I never waited longer than 1.5 hours. So these logs don't provide any useful information then. The other thing which I can advise is to setup kdump and when the kernel hangs cause a crashdump to be taken and try to upload it somewhere alongside your vmlinux file for further debugging. Best regards, Jürgen Am 13. September 2018 12:50:59 schrieb Nikolay Borisov : On 13.09.2018 13:29, Jürgen Herrmann wrote: Am 13.9.2018 10:40, schrieb Nikolay Borisov: On 13.09.2018 11:34, Jürgen Herrmann wrote: Hello! I have a newly installed laptop running a freshly installed (abt. two months ago) laptop running latest linux mint 19. Root filesystem is on a 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G partition. Timeshift-btrfs is enabled for root (@) and home (@home) subvolumes. I want to transfer snapshots to a server with a separated disk via "btrfs send" and ssh. Here's the list of snapshot directories, each containing tow snapshots for root and home: drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 "btrfs send /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ /dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang. The crash does not happen immediately, as i was on the road using cellular connection it seemed fine at first. That's how I found out that it transfers ~140MB of data before hanging. The snapshot is created on the server and contains data (du shows abt 140MB). I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 4.17.1 compiled from source. Here's the btrfs filesystem info: Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b Total devices 1 FS bytes used 342.85GiB devid1 size 875.44GiB used 357.05GiB path /dev/mapper/sda3_crypt A scrub shows no errors: scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b scrub started at Thu Sep 13 10:20:18 2018 and finished after 00:12:19 total bytes scrubbed: 342.78GiB with 0 errors What can I do to help debugging this issue? You should provide output of echo w > /proc/sysrq-trigger. Also sample the stack of /proc/[pid of btrfs send]/stack to see if it is changing. Best regards, Jürgen Hello! dmesg output can be found here: https://pastebin.com/g86dPGSZ So from what I see current transaction commit is waiting for root->commit_root_sem and then other threads (in this case systemd) is waiting for transaction commit to finish. stacks can be found here: https://pastebin.com/dCt1YgJp ANd your user process seems to be making some progress as evident from the fact that the call trace of the process is actually changing over the course of sampling. Is it possible that it just takes time to do the IO ? Best regards, Jürgen Mit AquaMail Android https://www.mobisystems.com/aqua-mail Mit AquaMail Android https://www.mobisystems.com/aqua-mail
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 13:56, Jürgen Herrmann wrote: > Both loops were started before the hang because after the hang I cannot > do that anymore. That's why there is progress in the logs at first. The > hang continues for at least 1.5 hours. No data is transferred anymore > during this time. I never waited longer than 1.5 hours. So these logs don't provide any useful information then. The other thing which I can advise is to setup kdump and when the kernel hangs cause a crashdump to be taken and try to upload it somewhere alongside your vmlinux file for further debugging. > > Best regards, > Jürgen > > Am 13. September 2018 12:50:59 schrieb Nikolay Borisov : > >> On 13.09.2018 13:29, Jürgen Herrmann wrote: >>> Am 13.9.2018 10:40, schrieb Nikolay Borisov: On 13.09.2018 11:34, Jürgen Herrmann wrote: > Hello! > > I have a newly installed laptop running a freshly installed (abt. two > months ago) laptop running latest linux mint 19. Root filesystem is > on a > 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G > partition. Timeshift-btrfs is enabled for root (@) and home (@home) > subvolumes. I want to transfer snapshots to a server with a separated > disk via "btrfs send" and ssh. > > Here's the list of snapshot directories, each containing tow snapshots > for root and home: > > drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 > drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 > drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 > drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 > drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 > drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 > drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 > drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 > drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 > drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 > > "btrfs send > /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ >> /dev/null" results in the btrfs task taking 100% cpu time on one cpu > and then all IO is blocked -> only reboot can solve the hang. > > The crash does not happen immediately, as i was on the road using > cellular connection it seemed fine at first. That's how I found out > that > it transfers ~140MB of data before hanging. The snapshot is created on > the server and contains data (du shows abt 140MB). > > I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs > progs > 4.17.1 compiled from source. > > Here's the btrfs filesystem info: > Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b > Total devices 1 FS bytes used 342.85GiB > devid 1 size 875.44GiB used 357.05GiB path > /dev/mapper/sda3_crypt > > A scrub shows no errors: > scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b > scrub started at Thu Sep 13 10:20:18 2018 and finished after > 00:12:19 > total bytes scrubbed: 342.78GiB with 0 errors > > What can I do to help debugging this issue? You should provide output of echo w > /proc/sysrq-trigger. Also sample the stack of /proc/[pid of btrfs send]/stack to see if it is changing. > > Best regards, > Jürgen >>> >>> Hello! >>> >>> dmesg output can be found here: >>> https://pastebin.com/g86dPGSZ >> >> So from what I see current transaction commit is waiting for >> root->commit_root_sem and then other threads (in this case systemd) is >> waiting for transaction commit to finish. >>> >>> stacks can be found here: >>> https://pastebin.com/dCt1YgJp >> >> ANd your user process seems to be making some progress as evident from >> the fact that the call trace of the process is actually changing over >> the course of sampling. Is it possible that it just takes time to do the >> IO ? >>> >>> Best regards, >>> Jürgen > > > Mit AquaMail Android > https://www.mobisystems.com/aqua-mail > > >
Re: btrfs send hangs after partial transfer and blocks all IO
Both loops were started before the hang because after the hang I cannot do that anymore. That's why there is progress in the logs at first. The hang continues for at least 1.5 hours. No data is transferred anymore during this time. I never waited longer than 1.5 hours. Best regards, Jürgen Am 13. September 2018 12:50:59 schrieb Nikolay Borisov : On 13.09.2018 13:29, Jürgen Herrmann wrote: Am 13.9.2018 10:40, schrieb Nikolay Borisov: On 13.09.2018 11:34, Jürgen Herrmann wrote: Hello! I have a newly installed laptop running a freshly installed (abt. two months ago) laptop running latest linux mint 19. Root filesystem is on a 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G partition. Timeshift-btrfs is enabled for root (@) and home (@home) subvolumes. I want to transfer snapshots to a server with a separated disk via "btrfs send" and ssh. Here's the list of snapshot directories, each containing tow snapshots for root and home: drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 "btrfs send /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ /dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang. The crash does not happen immediately, as i was on the road using cellular connection it seemed fine at first. That's how I found out that it transfers ~140MB of data before hanging. The snapshot is created on the server and contains data (du shows abt 140MB). I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 4.17.1 compiled from source. Here's the btrfs filesystem info: Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b Total devices 1 FS bytes used 342.85GiB devid1 size 875.44GiB used 357.05GiB path /dev/mapper/sda3_crypt A scrub shows no errors: scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b scrub started at Thu Sep 13 10:20:18 2018 and finished after 00:12:19 total bytes scrubbed: 342.78GiB with 0 errors What can I do to help debugging this issue? You should provide output of echo w > /proc/sysrq-trigger. Also sample the stack of /proc/[pid of btrfs send]/stack to see if it is changing. Best regards, Jürgen Hello! dmesg output can be found here: https://pastebin.com/g86dPGSZ So from what I see current transaction commit is waiting for root->commit_root_sem and then other threads (in this case systemd) is waiting for transaction commit to finish. stacks can be found here: https://pastebin.com/dCt1YgJp ANd your user process seems to be making some progress as evident from the fact that the call trace of the process is actually changing over the course of sampling. Is it possible that it just takes time to do the IO ? Best regards, Jürgen Mit AquaMail Android https://www.mobisystems.com/aqua-mail
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 13:29, Jürgen Herrmann wrote: > Am 13.9.2018 10:40, schrieb Nikolay Borisov: >> On 13.09.2018 11:34, Jürgen Herrmann wrote: >>> Hello! >>> >>> I have a newly installed laptop running a freshly installed (abt. two >>> months ago) laptop running latest linux mint 19. Root filesystem is on a >>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G >>> partition. Timeshift-btrfs is enabled for root (@) and home (@home) >>> subvolumes. I want to transfer snapshots to a server with a separated >>> disk via "btrfs send" and ssh. >>> >>> Here's the list of snapshot directories, each containing tow snapshots >>> for root and home: >>> >>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 >>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 >>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 >>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 >>> drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 >>> drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 >>> drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 >>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 >>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 >>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 >>> >>> "btrfs send >>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ /dev/null" results in the btrfs task taking 100% cpu time on one cpu >>> and then all IO is blocked -> only reboot can solve the hang. >>> >>> The crash does not happen immediately, as i was on the road using >>> cellular connection it seemed fine at first. That's how I found out that >>> it transfers ~140MB of data before hanging. The snapshot is created on >>> the server and contains data (du shows abt 140MB). >>> >>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs >>> 4.17.1 compiled from source. >>> >>> Here's the btrfs filesystem info: >>> Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b >>> Total devices 1 FS bytes used 342.85GiB >>> devid 1 size 875.44GiB used 357.05GiB path >>> /dev/mapper/sda3_crypt >>> >>> A scrub shows no errors: >>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b >>> scrub started at Thu Sep 13 10:20:18 2018 and finished after >>> 00:12:19 >>> total bytes scrubbed: 342.78GiB with 0 errors >>> >>> What can I do to help debugging this issue? >> >> >> You should provide output of echo w > /proc/sysrq-trigger. Also >> sample the stack of /proc/[pid of btrfs send]/stack to see if it is >> changing. >> >> >>> >>> Best regards, >>> Jürgen > > Hello! > > dmesg output can be found here: > https://pastebin.com/g86dPGSZ So from what I see current transaction commit is waiting for root->commit_root_sem and then other threads (in this case systemd) is waiting for transaction commit to finish. > > stacks can be found here: > https://pastebin.com/dCt1YgJp ANd your user process seems to be making some progress as evident from the fact that the call trace of the process is actually changing over the course of sampling. Is it possible that it just takes time to do the IO ? > > Best regards, > Jürgen
Re: btrfs send hangs after partial transfer and blocks all IO
Am 13.9.2018 10:40, schrieb Nikolay Borisov: On 13.09.2018 11:34, Jürgen Herrmann wrote: Hello! I have a newly installed laptop running a freshly installed (abt. two months ago) laptop running latest linux mint 19. Root filesystem is on a 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G partition. Timeshift-btrfs is enabled for root (@) and home (@home) subvolumes. I want to transfer snapshots to a server with a separated disk via "btrfs send" and ssh. Here's the list of snapshot directories, each containing tow snapshots for root and home: drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 "btrfs send /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ /dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang. The crash does not happen immediately, as i was on the road using cellular connection it seemed fine at first. That's how I found out that it transfers ~140MB of data before hanging. The snapshot is created on the server and contains data (du shows abt 140MB). I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 4.17.1 compiled from source. Here's the btrfs filesystem info: Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b Total devices 1 FS bytes used 342.85GiB devid 1 size 875.44GiB used 357.05GiB path /dev/mapper/sda3_crypt A scrub shows no errors: scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b scrub started at Thu Sep 13 10:20:18 2018 and finished after 00:12:19 total bytes scrubbed: 342.78GiB with 0 errors What can I do to help debugging this issue? You should provide output of echo w > /proc/sysrq-trigger. Also sample the stack of /proc/[pid of btrfs send]/stack to see if it is changing. Best regards, Jürgen Hello! dmesg output can be found here: https://pastebin.com/g86dPGSZ stacks can be found here: https://pastebin.com/dCt1YgJp Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen
Re: btrfs send hangs after partial transfer and blocks all IO
On 13.09.2018 11:34, Jürgen Herrmann wrote: > Hello! > > I have a newly installed laptop running a freshly installed (abt. two > months ago) laptop running latest linux mint 19. Root filesystem is on a > 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G > partition. Timeshift-btrfs is enabled for root (@) and home (@home) > subvolumes. I want to transfer snapshots to a server with a separated > disk via "btrfs send" and ssh. > > Here's the list of snapshot directories, each containing tow snapshots > for root and home: > > drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 > drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 > drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 > drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 > drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 > drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 > drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 > drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 > drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 > drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 > > "btrfs send > /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ >>/dev/null" results in the btrfs task taking 100% cpu time on one cpu > and then all IO is blocked -> only reboot can solve the hang. > > The crash does not happen immediately, as i was on the road using > cellular connection it seemed fine at first. That's how I found out that > it transfers ~140MB of data before hanging. The snapshot is created on > the server and contains data (du shows abt 140MB). > > I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs > 4.17.1 compiled from source. > > Here's the btrfs filesystem info: > Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b > Total devices 1 FS bytes used 342.85GiB > devid 1 size 875.44GiB used 357.05GiB path > /dev/mapper/sda3_crypt > > A scrub shows no errors: > scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b > scrub started at Thu Sep 13 10:20:18 2018 and finished after > 00:12:19 > total bytes scrubbed: 342.78GiB with 0 errors > > What can I do to help debugging this issue? You should provide output of echo w > /proc/sysrq-trigger. Also sample the stack of /proc/[pid of btrfs send]/stack to see if it is changing. > > Best regards, > Jürgen
btrfs send hangs after partial transfer and blocks all IO
Hello! I have a newly installed laptop running a freshly installed (abt. two months ago) laptop running latest linux mint 19. Root filesystem is on a 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G partition. Timeshift-btrfs is enabled for root (@) and home (@home) subvolumes. I want to transfer snapshots to a server with a separated disk via "btrfs send" and ssh. Here's the list of snapshot directories, each containing tow snapshots for root and home: drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01 drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02 drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01 drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 20:00 2018-09-06_20-00-01 drwxr-xr-x 1 root root 30 Sep 6 22:00 2018-09-06_22-00-01 drwxr-xr-x 1 root root 30 Sep 8 16:00 2018-09-08_16-00-01 drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02 drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02 drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01 "btrfs send /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ >/dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang. The crash does not happen immediately, as i was on the road using cellular connection it seemed fine at first. That's how I found out that it transfers ~140MB of data before hanging. The snapshot is created on the server and contains data (du shows abt 140MB). I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 4.17.1 compiled from source. Here's the btrfs filesystem info: Label: none uuid: a914c141-72bf-448b-847f-d64ee82d8b7b Total devices 1 FS bytes used 342.85GiB devid1 size 875.44GiB used 357.05GiB path /dev/mapper/sda3_crypt A scrub shows no errors: scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b scrub started at Thu Sep 13 10:20:18 2018 and finished after 00:12:19 total bytes scrubbed: 342.78GiB with 0 errors What can I do to help debugging this issue? Best regards, Jürgen -- Jürgen Herrmann https://t-5.eu ALbertstraße 2 94327 Bogen