Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-20 Thread Chris Murphy
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann  wrote:
> Am 13.9.2018 14:35, schrieb Nikolay Borisov:
>>
>> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>>>
>>> OK, I will install kdump later and perform a dump after the hang.
>>>
>>> One more noob question beforehand: does this dump contain sensitive
>>> information, for example the luks encryption key for the disk etc? A
>>> Google search only brings up one relevant search result which can only
>>> be viewed with a redhat subscription...
>>
>>
>>
>> So a kdump will dump the kernel memory so it's possible that the LUKS
>> encryption keys could be extracted from that image. Bummer, it's
>> understandable why you wouldn't want to upload it :). In this case you'd
>> have to install also the 'crash' utility to open the crashdump and
>> extract the calltrace of the btrfs process. The rough process should be :
>>
>>
>> crash 'path to vm linux' 'path to vmcore file', then once inside the
>> crash utility :
>>
>> set , you can acquire the pid by issuing 'ps'
>> which will give you a ps-like output of all running processes at the
>> time of crash. After the context has been set you can run 'bt' which
>> will give you a backtrace of the send process.
>>
>>
>>
>>>
>>> Best regards,
>>> Jürgen
>>>
>>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov
>>> :
>>>
 On 13.09.2018 14:50, Jürgen Herrmann wrote:
>
> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
> after the hang because I started the loop before the hang. The dmesg
> output should show the hanging tasks from second 346 on or so. Still
> not
> useful?
>

 So from 346 it's evident that transaction commit is waiting for
 commit_root_sem to be acquired. So something else is holding it and not
 giving the transaction chance to finish committing. Now the only place
 where send acquires this lock is in find_extent_clone around the  call
 to extent_from_logical. The latter basically does an extent tree search
 and doesn't loop so can't possibly deadlock. Furthermore I don't see any
 userspace processes being hung in kernel space.

 Additionally looking at the userspace processes they indicate that
 find_extent_clone has finished and are blocked in send_write_or_clone
 which does the write. But I guess this actually happens before the hang.


 So at this point without looking at the stacktrace of the btrfs send
 process after the hung has occurred I don't think much can be done
>>>
>>>
> I know this is probably not the correct list to ask this question but maybe
> someone of the devs can point me to the right list?
>
> I cannot get kdump to work. The crashkernel is loaded and everything is
> setup for it afaict. I asked a question on this over at stackexchange but no
> answer yet.
> https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing
>
> So i did a little digging and added some debug printk() statements to see
> whats going on and it seems that panic() is never called. maybe the second
> stack trace is the reason?
> Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN
>
> Could someone please tell me where I can report this problem and get some
> help on this topic?


Try kexec mailing list. They handle kdump.

http://lists.infradead.org/mailman/listinfo/kexec



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-19 Thread Jürgen Herrmann

Am 13.9.2018 14:35, schrieb Nikolay Borisov:

On 13.09.2018 15:30, Jürgen Herrmann wrote:

OK, I will install kdump later and perform a dump after the hang.

One more noob question beforehand: does this dump contain sensitive
information, for example the luks encryption key for the disk etc? A
Google search only brings up one relevant search result which can only
be viewed with a redhat subscription...



So a kdump will dump the kernel memory so it's possible that the LUKS
encryption keys could be extracted from that image. Bummer, it's
understandable why you wouldn't want to upload it :). In this case 
you'd

have to install also the 'crash' utility to open the crashdump and
extract the calltrace of the btrfs process. The rough process should be 
:



crash 'path to vm linux' 'path to vmcore file', then once inside the
crash utility :

set , you can acquire the pid by issuing 
'ps'

which will give you a ps-like output of all running processes at the
time of crash. After the context has been set you can run 'bt' which
will give you a backtrace of the send process.





Best regards,
Jürgen

Am 13. September 2018 14:02:11 schrieb Nikolay Borisov 
:



On 13.09.2018 14:50, Jürgen Herrmann wrote:
I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work 
also

after the hang because I started the loop before the hang. The dmesg
output should show the hanging tasks from second 346 on or so. Still 
not

useful?



So from 346 it's evident that transaction commit is waiting for
commit_root_sem to be acquired. So something else is holding it and 
not
giving the transaction chance to finish committing. Now the only 
place
where send acquires this lock is in find_extent_clone around the  
call
to extent_from_logical. The latter basically does an extent tree 
search
and doesn't loop so can't possibly deadlock. Furthermore I don't see 
any

userspace processes being hung in kernel space.

Additionally looking at the userspace processes they indicate that
find_extent_clone has finished and are blocked in send_write_or_clone
which does the write. But I guess this actually happens before the 
hang.



So at this point without looking at the stacktrace of the btrfs send
process after the hung has occurred I don't think much can be done


I know this is probably not the correct list to ask this question but 
maybe someone of the devs can point me to the right list?


I cannot get kdump to work. The crashkernel is loaded and everything is 
setup for it afaict. I asked a question on this over at stackexchange 
but no answer yet.

https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing

So i did a little digging and added some debug printk() statements to 
see whats going on and it seems that panic() is never called. maybe the 
second stack trace is the reason?

Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN

Could someone please tell me where I can report this problem and get 
some help on this topic?


Best regards,
Jürgen

--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-19 Thread Jürgen Herrmann

Am 13.9.2018 18:22, schrieb Chris Murphy:

(resend to all)

On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov  
wrote:



On 13.09.2018 18:30, Chris Murphy wrote:

This is the 2nd or 3rd thread containing hanging btrfs send, with
kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
and the other I can't find at the moment. In that case though the 
hang

is reproducible in 4.14.x and weirdly it only happens when a snapshot
contains (perhaps many) reflinks. Scrub and check lowmem find nothing
wrong.

I have snapshots with a few reflinks (cp --reflink and also
deduplication), and I see maybe 15-30 second hangs where nothing is
apparently happening (in top or iotop), but I'm also not seeing any
blocked tasks or high CPU usage. Perhaps in my case it's just
recovering quickly.

Are there any kernel config options in "# Debug Lockups and Hangs"
that might hint at what's going on? Some of these are enabled in
Fedora debug kernels, which are built practically daily, e.g. right
now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
translates to git 54eda9df17f3.


If it's a lock-related problem then you need Lock Debugging => Lock
debugging: prove locking correctness


OK looks like that's under a different section as CONFIG_PROVE_LOCKING
which is enabled on Fedora debug kernels.


# Debug Lockups and Hangs
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# Lock Debugging (spinlocks, mutexes, etc...)
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_LOCK_TORTURE_TEST=m


Hello again!

I have CONFIG_PROVE_LOCKING enabled in my kernel, but no change to
the observed behaviour.

Best regards, Jürgen


--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Chris Murphy
(resend to all)

On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov  wrote:
>
>
> On 13.09.2018 18:30, Chris Murphy wrote:
>> This is the 2nd or 3rd thread containing hanging btrfs send, with
>> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
>> and the other I can't find at the moment. In that case though the hang
>> is reproducible in 4.14.x and weirdly it only happens when a snapshot
>> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
>> wrong.
>>
>> I have snapshots with a few reflinks (cp --reflink and also
>> deduplication), and I see maybe 15-30 second hangs where nothing is
>> apparently happening (in top or iotop), but I'm also not seeing any
>> blocked tasks or high CPU usage. Perhaps in my case it's just
>> recovering quickly.
>>
>> Are there any kernel config options in "# Debug Lockups and Hangs"
>> that might hint at what's going on? Some of these are enabled in
>> Fedora debug kernels, which are built practically daily, e.g. right
>> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
>> translates to git 54eda9df17f3.
>
> If it's a lock-related problem then you need Lock Debugging => Lock
> debugging: prove locking correctness

OK looks like that's under a different section as CONFIG_PROVE_LOCKING
which is enabled on Fedora debug kernels.


# Debug Lockups and Hangs
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# Lock Debugging (spinlocks, mutexes, etc...)
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_LOCK_TORTURE_TEST=m



-- 
Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 18:30, Chris Murphy wrote:
> This is the 2nd or 3rd thread containing hanging btrfs send, with
> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
> and the other I can't find at the moment. In that case though the hang
> is reproducible in 4.14.x and weirdly it only happens when a snapshot
> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
> wrong.
> 
> I have snapshots with a few reflinks (cp --reflink and also
> deduplication), and I see maybe 15-30 second hangs where nothing is
> apparently happening (in top or iotop), but I'm also not seeing any
> blocked tasks or high CPU usage. Perhaps in my case it's just
> recovering quickly.
> 
> Are there any kernel config options in "# Debug Lockups and Hangs"
> that might hint at what's going on? Some of these are enabled in
> Fedora debug kernels, which are built practically daily, e.g. right
> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
> translates to git 54eda9df17f3.

If it's a lock-related problem then you need Lock Debugging => Lock
debugging: prove locking correctness

> 
> 
> Chris Murphy
> 


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Chris Murphy
This is the 2nd or 3rd thread containing hanging btrfs send, with
kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
and the other I can't find at the moment. In that case though the hang
is reproducible in 4.14.x and weirdly it only happens when a snapshot
contains (perhaps many) reflinks. Scrub and check lowmem find nothing
wrong.

I have snapshots with a few reflinks (cp --reflink and also
deduplication), and I see maybe 15-30 second hangs where nothing is
apparently happening (in top or iotop), but I'm also not seeing any
blocked tasks or high CPU usage. Perhaps in my case it's just
recovering quickly.

Are there any kernel config options in "# Debug Lockups and Hangs"
that might hint at what's going on? Some of these are enabled in
Fedora debug kernels, which are built practically daily, e.g. right
now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
translates to git 54eda9df17f3.


Chris Murphy


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 15:30, Jürgen Herrmann wrote:
> OK, I will install kdump later and perform a dump after the hang.
> 
> One more noob question beforehand: does this dump contain sensitive
> information, for example the luks encryption key for the disk etc? A
> Google search only brings up one relevant search result which can only
> be viewed with a redhat subscription...


So a kdump will dump the kernel memory so it's possible that the LUKS
encryption keys could be extracted from that image. Bummer, it's
understandable why you wouldn't want to upload it :). In this case you'd
have to install also the 'crash' utility to open the crashdump and
extract the calltrace of the btrfs process. The rough process should be :


crash 'path to vm linux' 'path to vmcore file', then once inside the
crash utility :

set , you can acquire the pid by issuing 'ps'
which will give you a ps-like output of all running processes at the
time of crash. After the context has been set you can run 'bt' which
will give you a backtrace of the send process.



> 
> Best regards,
> Jürgen
> 
> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov :
> 
>> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
>>> after the hang because I started the loop before the hang. The dmesg
>>> output should show the hanging tasks from second 346 on or so. Still not
>>> useful?
>>>
>>
>> So from 346 it's evident that transaction commit is waiting for
>> commit_root_sem to be acquired. So something else is holding it and not
>> giving the transaction chance to finish committing. Now the only place
>> where send acquires this lock is in find_extent_clone around the  call
>> to extent_from_logical. The latter basically does an extent tree search
>> and doesn't loop so can't possibly deadlock. Furthermore I don't see any
>> userspace processes being hung in kernel space.
>>
>> Additionally looking at the userspace processes they indicate that
>> find_extent_clone has finished and are blocked in send_write_or_clone
>> which does the write. But I guess this actually happens before the hang.
>>
>>
>> So at this point without looking at the stacktrace of the btrfs send
>> process after the hung has occurred I don't think much can be done
> 
> 
> Mit AquaMail Android
> https://www.mobisystems.com/aqua-mail
> 
> 
> 


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Jürgen Herrmann

OK, I will install kdump later and perform a dump after the hang.

One more noob question beforehand: does this dump contain sensitive 
information, for example the luks encryption key for the disk etc? A Google 
search only brings up one relevant search result which can only be viewed 
with a redhat subscription...


Best regards,
Jürgen

Am 13. September 2018 14:02:11 schrieb Nikolay Borisov :


On 13.09.2018 14:50, Jürgen Herrmann wrote:

I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
after the hang because I started the loop before the hang. The dmesg
output should show the hanging tasks from second 346 on or so. Still not
useful?



So from 346 it's evident that transaction commit is waiting for
commit_root_sem to be acquired. So something else is holding it and not
giving the transaction chance to finish committing. Now the only place
where send acquires this lock is in find_extent_clone around the  call
to extent_from_logical. The latter basically does an extent tree search
and doesn't loop so can't possibly deadlock. Furthermore I don't see any
userspace processes being hung in kernel space.

Additionally looking at the userspace processes they indicate that
find_extent_clone has finished and are blocked in send_write_or_clone
which does the write. But I guess this actually happens before the hang.


So at this point without looking at the stacktrace of the btrfs send
process after the hung has occurred I don't think much can be done



Mit AquaMail Android
https://www.mobisystems.com/aqua-mail




Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 14:50, Jürgen Herrmann wrote:
> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
> after the hang because I started the loop before the hang. The dmesg
> output should show the hanging tasks from second 346 on or so. Still not
> useful?
> 

So from 346 it's evident that transaction commit is waiting for
commit_root_sem to be acquired. So something else is holding it and not
giving the transaction chance to finish committing. Now the only place
where send acquires this lock is in find_extent_clone around the  call
to extent_from_logical. The latter basically does an extent tree search
and doesn't loop so can't possibly deadlock. Furthermore I don't see any
userspace processes being hung in kernel space.

Additionally looking at the userspace processes they indicate that
find_extent_clone has finished and are blocked in send_write_or_clone
which does the write. But I guess this actually happens before the hang.


So at this point without looking at the stacktrace of the btrfs send
process after the hung has occurred I don't think much can be done



Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Jürgen Herrmann
I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also 
after the hang because I started the loop before the hang. The dmesg output 
should show the hanging tasks from second 346 on or so. Still not useful?


Best regards,
Jürgen

Am 13. September 2018 13:04:39 schrieb Nikolay Borisov :


On 13.09.2018 13:56, Jürgen Herrmann wrote:

Both loops were started before the hang because after the hang I cannot
do that anymore. That's why there is progress in the logs at first. The
hang continues for at least 1.5 hours. No data is transferred anymore
during this time. I never waited longer than 1.5 hours.


So these logs don't provide any useful information then. The other thing
which I can advise is to setup kdump and when the kernel hangs cause a
crashdump to be taken and try to upload it somewhere alongside your
vmlinux file for further debugging.




Best regards,
Jürgen

Am 13. September 2018 12:50:59 schrieb Nikolay Borisov :


On 13.09.2018 13:29, Jürgen Herrmann wrote:

Am 13.9.2018 10:40, schrieb Nikolay Borisov:

On 13.09.2018 11:34, Jürgen Herrmann wrote:

Hello!

I have a newly installed laptop running a freshly installed (abt. two
months ago) laptop running latest linux mint 19. Root filesystem is
on a
1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
partition. Timeshift-btrfs is enabled for root (@) and home (@home)
subvolumes. I want to transfer snapshots to a server with a separated
disk via "btrfs send" and ssh.

Here's the list of snapshot directories, each containing tow snapshots
for root and home:

drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01

"btrfs send
/mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@

/dev/null" results in the btrfs task taking 100% cpu time on one cpu

and then all IO is blocked -> only reboot can solve the hang.

The crash does not happen immediately, as i was on the road using
cellular connection it seemed fine at first. That's how I found out
that
it transfers ~140MB of data before hanging. The snapshot is created on
the server and contains data (du shows abt 140MB).

I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs
progs
4.17.1 compiled from source.

Here's the btrfs filesystem info:
Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
Total devices 1 FS bytes used 342.85GiB
devid1 size 875.44GiB used 357.05GiB path
/dev/mapper/sda3_crypt

A scrub shows no errors:
scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
scrub started at Thu Sep 13 10:20:18 2018 and finished after
00:12:19
total bytes scrubbed: 342.78GiB with 0 errors

What can I do to help debugging this issue?



You should provide output of echo w > /proc/sysrq-trigger. Also
sample the stack of /proc/[pid of btrfs send]/stack to see if it is
changing.




Best regards,
Jürgen


Hello!

dmesg output can be found here:
https://pastebin.com/g86dPGSZ


So from what I see current transaction commit is waiting for
root->commit_root_sem and then other threads (in this case systemd) is
waiting for transaction commit to finish.


stacks can be found here:
https://pastebin.com/dCt1YgJp


ANd your user process seems to be making some progress as evident from
the fact that the call trace of the process is actually changing over
the course of sampling. Is it possible that it just takes time to do the
IO ?


Best regards,
Jürgen



Mit AquaMail Android
https://www.mobisystems.com/aqua-mail






Mit AquaMail Android
https://www.mobisystems.com/aqua-mail




Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 13:56, Jürgen Herrmann wrote:
> Both loops were started before the hang because after the hang I cannot
> do that anymore. That's why there is progress in the logs at first. The
> hang continues for at least 1.5 hours. No data is transferred anymore
> during this time. I never waited longer than 1.5 hours.

So these logs don't provide any useful information then. The other thing
which I can advise is to setup kdump and when the kernel hangs cause a
crashdump to be taken and try to upload it somewhere alongside your
vmlinux file for further debugging.


> 
> Best regards,
> Jürgen
> 
> Am 13. September 2018 12:50:59 schrieb Nikolay Borisov :
> 
>> On 13.09.2018 13:29, Jürgen Herrmann wrote:
>>> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
 On 13.09.2018 11:34, Jürgen Herrmann wrote:
> Hello!
>
> I have a newly installed laptop running a freshly installed (abt. two
> months ago) laptop running latest linux mint 19. Root filesystem is
> on a
> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
> subvolumes. I want to transfer snapshots to a server with a separated
> disk via "btrfs send" and ssh.
>
> Here's the list of snapshot directories, each containing tow snapshots
> for root and home:
>
> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>
> "btrfs send
> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
> and then all IO is blocked -> only reboot can solve the hang.
>
> The crash does not happen immediately, as i was on the road using
> cellular connection it seemed fine at first. That's how I found out
> that
> it transfers ~140MB of data before hanging. The snapshot is created on
> the server and contains data (du shows abt 140MB).
>
> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs
> progs
> 4.17.1 compiled from source.
>
> Here's the btrfs filesystem info:
> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>     Total devices 1 FS bytes used 342.85GiB
>     devid    1 size 875.44GiB used 357.05GiB path
> /dev/mapper/sda3_crypt
>
> A scrub shows no errors:
> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>     scrub started at Thu Sep 13 10:20:18 2018 and finished after
> 00:12:19
>     total bytes scrubbed: 342.78GiB with 0 errors
>
> What can I do to help debugging this issue?


 You should provide output of echo w > /proc/sysrq-trigger. Also
 sample the stack of /proc/[pid of btrfs send]/stack to see if it is
 changing.


>
> Best regards,
> Jürgen
>>>
>>> Hello!
>>>
>>> dmesg output can be found here:
>>> https://pastebin.com/g86dPGSZ
>>
>> So from what I see current transaction commit is waiting for
>> root->commit_root_sem and then other threads (in this case systemd) is
>> waiting for transaction commit to finish.
>>>
>>> stacks can be found here:
>>> https://pastebin.com/dCt1YgJp
>>
>> ANd your user process seems to be making some progress as evident from
>> the fact that the call trace of the process is actually changing over
>> the course of sampling. Is it possible that it just takes time to do the
>> IO ?
>>>
>>> Best regards,
>>> Jürgen
> 
> 
> Mit AquaMail Android
> https://www.mobisystems.com/aqua-mail
> 
> 
> 


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Jürgen Herrmann
Both loops were started before the hang because after the hang I cannot do 
that anymore. That's why there is progress in the logs at first. The hang 
continues for at least 1.5 hours. No data is transferred anymore during 
this time. I never waited longer than 1.5 hours.


Best regards,
Jürgen

Am 13. September 2018 12:50:59 schrieb Nikolay Borisov :


On 13.09.2018 13:29, Jürgen Herrmann wrote:

Am 13.9.2018 10:40, schrieb Nikolay Borisov:

On 13.09.2018 11:34, Jürgen Herrmann wrote:

Hello!

I have a newly installed laptop running a freshly installed (abt. two
months ago) laptop running latest linux mint 19. Root filesystem is on a
1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
partition. Timeshift-btrfs is enabled for root (@) and home (@home)
subvolumes. I want to transfer snapshots to a server with a separated
disk via "btrfs send" and ssh.

Here's the list of snapshot directories, each containing tow snapshots
for root and home:

drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01

"btrfs send
/mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@

/dev/null" results in the btrfs task taking 100% cpu time on one cpu

and then all IO is blocked -> only reboot can solve the hang.

The crash does not happen immediately, as i was on the road using
cellular connection it seemed fine at first. That's how I found out that
it transfers ~140MB of data before hanging. The snapshot is created on
the server and contains data (du shows abt 140MB).

I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
4.17.1 compiled from source.

Here's the btrfs filesystem info:
Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
Total devices 1 FS bytes used 342.85GiB
devid1 size 875.44GiB used 357.05GiB path
/dev/mapper/sda3_crypt

A scrub shows no errors:
scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
scrub started at Thu Sep 13 10:20:18 2018 and finished after
00:12:19
total bytes scrubbed: 342.78GiB with 0 errors

What can I do to help debugging this issue?



You should provide output of echo w > /proc/sysrq-trigger. Also
sample the stack of /proc/[pid of btrfs send]/stack to see if it is
changing.




Best regards,
Jürgen


Hello!

dmesg output can be found here:
https://pastebin.com/g86dPGSZ


So from what I see current transaction commit is waiting for
root->commit_root_sem and then other threads (in this case systemd) is
waiting for transaction commit to finish.


stacks can be found here:
https://pastebin.com/dCt1YgJp


ANd your user process seems to be making some progress as evident from
the fact that the call trace of the process is actually changing over
the course of sampling. Is it possible that it just takes time to do the
IO ?


Best regards,
Jürgen



Mit AquaMail Android
https://www.mobisystems.com/aqua-mail




Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 13:29, Jürgen Herrmann wrote:
> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
>> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>>> Hello!
>>>
>>> I have a newly installed laptop running a freshly installed (abt. two
>>> months ago) laptop running latest linux mint 19. Root filesystem is on a
>>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>>> subvolumes. I want to transfer snapshots to a server with a separated
>>> disk via "btrfs send" and ssh.
>>>
>>> Here's the list of snapshot directories, each containing tow snapshots
>>> for root and home:
>>>
>>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>>>
>>> "btrfs send
>>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
 /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>>> and then all IO is blocked -> only reboot can solve the hang.
>>>
>>> The crash does not happen immediately, as i was on the road using
>>> cellular connection it seemed fine at first. That's how I found out that
>>> it transfers ~140MB of data before hanging. The snapshot is created on
>>> the server and contains data (du shows abt 140MB).
>>>
>>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
>>> 4.17.1 compiled from source.
>>>
>>> Here's the btrfs filesystem info:
>>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>>     Total devices 1 FS bytes used 342.85GiB
>>>     devid    1 size 875.44GiB used 357.05GiB path
>>> /dev/mapper/sda3_crypt
>>>
>>> A scrub shows no errors:
>>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>>     scrub started at Thu Sep 13 10:20:18 2018 and finished after
>>> 00:12:19
>>>     total bytes scrubbed: 342.78GiB with 0 errors
>>>
>>> What can I do to help debugging this issue?
>>
>>
>> You should provide output of echo w > /proc/sysrq-trigger. Also
>> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
>> changing.
>>
>>
>>>
>>> Best regards,
>>> Jürgen
> 
> Hello!
> 
> dmesg output can be found here:
> https://pastebin.com/g86dPGSZ

So from what I see current transaction commit is waiting for
root->commit_root_sem and then other threads (in this case systemd) is
waiting for transaction commit to finish.
> 
> stacks can be found here:
> https://pastebin.com/dCt1YgJp

ANd your user process seems to be making some progress as evident from
the fact that the call trace of the process is actually changing over
the course of sampling. Is it possible that it just takes time to do the
IO ?
> 
> Best regards,
> Jürgen


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Jürgen Herrmann

Am 13.9.2018 10:40, schrieb Nikolay Borisov:

On 13.09.2018 11:34, Jürgen Herrmann wrote:

Hello!

I have a newly installed laptop running a freshly installed (abt. two
months ago) laptop running latest linux mint 19. Root filesystem is on 
a

1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
partition. Timeshift-btrfs is enabled for root (@) and home (@home)
subvolumes. I want to transfer snapshots to a server with a separated
disk via "btrfs send" and ssh.

Here's the list of snapshot directories, each containing tow snapshots
for root and home:

drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01

"btrfs send
/mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@

/dev/null" results in the btrfs task taking 100% cpu time on one cpu

and then all IO is blocked -> only reboot can solve the hang.

The crash does not happen immediately, as i was on the road using
cellular connection it seemed fine at first. That's how I found out 
that

it transfers ~140MB of data before hanging. The snapshot is created on
the server and contains data (du shows abt 140MB).

I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs 
progs

4.17.1 compiled from source.

Here's the btrfs filesystem info:
Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
    Total devices 1 FS bytes used 342.85GiB
    devid    1 size 875.44GiB used 357.05GiB path
/dev/mapper/sda3_crypt

A scrub shows no errors:
scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
    scrub started at Thu Sep 13 10:20:18 2018 and finished after
00:12:19
    total bytes scrubbed: 342.78GiB with 0 errors

What can I do to help debugging this issue?



You should provide output of echo w > /proc/sysrq-trigger. Also
sample the stack of /proc/[pid of btrfs send]/stack to see if it is
changing.




Best regards,
Jürgen


Hello!

dmesg output can be found here:
https://pastebin.com/g86dPGSZ

stacks can be found here:
https://pastebin.com/dCt1YgJp

Best regards,
Jürgen
--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Nikolay Borisov



On 13.09.2018 11:34, Jürgen Herrmann wrote:
> Hello!
> 
> I have a newly installed laptop running a freshly installed (abt. two
> months ago) laptop running latest linux mint 19. Root filesystem is on a
> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
> subvolumes. I want to transfer snapshots to a server with a separated
> disk via "btrfs send" and ssh.
> 
> Here's the list of snapshot directories, each containing tow snapshots
> for root and home:
> 
> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
> 
> "btrfs send
> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>/dev/null" results in the btrfs task taking 100% cpu time on one cpu
> and then all IO is blocked -> only reboot can solve the hang.
> 
> The crash does not happen immediately, as i was on the road using
> cellular connection it seemed fine at first. That's how I found out that
> it transfers ~140MB of data before hanging. The snapshot is created on
> the server and contains data (du shows abt 140MB).
> 
> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
> 4.17.1 compiled from source.
> 
> Here's the btrfs filesystem info:
> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>     Total devices 1 FS bytes used 342.85GiB
>     devid    1 size 875.44GiB used 357.05GiB path
> /dev/mapper/sda3_crypt
> 
> A scrub shows no errors:
> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>     scrub started at Thu Sep 13 10:20:18 2018 and finished after
> 00:12:19
>     total bytes scrubbed: 342.78GiB with 0 errors
> 
> What can I do to help debugging this issue?


You should provide output of echo w > /proc/sysrq-trigger. Also
sample the stack of /proc/[pid of btrfs send]/stack to see if it is
changing.


> 
> Best regards,
> Jürgen


btrfs send hangs after partial transfer and blocks all IO

2018-09-13 Thread Jürgen Herrmann

Hello!

I have a newly installed laptop running a freshly installed (abt. two 
months ago) laptop running latest linux mint 19. Root filesystem is on a 
1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G 
partition. Timeshift-btrfs is enabled for root (@) and home (@home) 
subvolumes. I want to transfer snapshots to a server with a separated 
disk via "btrfs send" and ssh.


Here's the list of snapshot directories, each containing tow snapshots 
for root and home:


drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01

"btrfs send 
/mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ 
>/dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang.


The crash does not happen immediately, as i was on the road using 
cellular connection it seemed fine at first. That's how I found out that 
it transfers ~140MB of data before hanging. The snapshot is created on 
the server and contains data (du shows abt 140MB).


I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 
4.17.1 compiled from source.


Here's the btrfs filesystem info:
Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
Total devices 1 FS bytes used 342.85GiB
devid1 size 875.44GiB used 357.05GiB path 
/dev/mapper/sda3_crypt


A scrub shows no errors:
scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
scrub started at Thu Sep 13 10:20:18 2018 and finished after 
00:12:19

total bytes scrubbed: 342.78GiB with 0 errors

What can I do to help debugging this issue?

Best regards,
Jürgen
--
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen