Re: btrfs send hung in pipe_wait

2018-09-09 Thread Chris Murphy
On Sun, Sep 9, 2018 at 2:16 PM, Stefan Loewen  wrote:
> I'm not sure about the exact definition of "blocked" here, but I was
> also surprised that there were no blocked tasks listed since I'm
> definitely unable to kill (SIGKILL) that process.
> On the other hand it wakes up hourly to transfer a few bytes.
> The problem is definitely not, that I issued the sysrq too early. I
> think it was after about 45min of no IO.

Another one the devs have asked for in cases where things get slow or
hang, but without explicit blocked task messages, is sysrq + t. But
I'm throwing spaghetti at a wall at this point, none of it will fix
the problem, and I haven't learned how to read these outputs.



> So there is some problem with this "original" subvol. Maybe I should
> describe how that came into existence.
> Initially I had my data on a NTFS formatted drive. I then created a
> btrfs partition on my second drive and rsynced all my stuff over into
> the root subvol.
> Then I noticed that having all my data in the root subvol was a bad
> idea and created a "data" subvol and reflinked everything into it.
> I deleted the data from the root subvol, made a snapshot of the "data"
> subvol, tried sending that and ran into the problem we're discussing
> here.

That is interesting and useful information. I see nothing invalid
about it at all. However, just for future reference it is possible to
snapshot the top level (default) subvolume.

By default, the top level subvolume (sometimes referred to as
subvolid=5 or subvolid=0) is what is mounted if you haven't used
'btrfs sub set-default' to change it. You can snapshot that subvolume
by snapshotting the mount point. e.g.

mount /dev/sda1 /mnt
btrfs sub snap /mnt/subvolume1

So now you have a readwrite subvolume called "subvolume1" which
contains everything that was in the top level, which you can now
delete if you're trying to keep things tidy and just have subvolumes
and snapshots in the top level.

Anyway, what you did is possibly relevant to the problem. But if it
turns out it's the cause of the problem, it's definitely a bug.


>
> btrfs check in lowmem mode did not find any errors either:
>
> $ sudo btrfs check --mode=lowmem --progress /dev/sdb1
> Opening filesystem to check...
> Checking filesystem on /dev/sdb1
> UUID: cd786597-3816-40e7-bf6c-d585265ad372
> [1/7] checking root items  (0:00:30 elapsed,
> 1047408 items checked)
> [2/7] checking extents (0:03:55 elapsed,
> 309170 items checked)
> cache and super generation don't match, space cache will be invalidated
> [3/7] checking free space cache(0:00:00 elapsed)
> [4/7] checking fs roots(0:04:07 elapsed, 85373
> items checked)
> [5/7] checking csums (without verifying data)  (0:00:00 elapsed,
> 253106 items checked)
> [6/7] checking root refs done with fs roots in lowmem mode, skipping
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 708354711552 bytes used, no error found
> total csum bytes: 689206904
> total tree bytes: 2423865344
> total fs tree bytes: 1542914048
> total extent tree bytes: 129843200
> btree space waste bytes: 299191292
> file data blocks allocated: 31709967417344
> referenced 928531877888

OK good to know.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-08 Thread Chris Murphy
I don't see any blocked tasks. I wonder if you were too fast with
sysrq w? Maybe it takes a little bit for the block task to actually
develop?

I suggest also 'btrfs check --mode=lowmem' because that is a separate
implementation of btrfs check that tends to catch different things
than the original. It is slow, however.

-- 
Chris Murphy


Fwd: btrfs send hung in pipe_wait

2018-09-08 Thread Stefan Loewen
And this one as well.

-- Forwarded message -
From: Chris Murphy 
Date: Fr., 7. Sep. 2018 um 23:53 Uhr
Subject: Re: btrfs send hung in pipe_wait
To: Stefan Loewen 
Cc: Chris Murphy 


On Fri, Sep 7, 2018 at 3:19 PM, Stefan Loewen  wrote:
> Now I also tested with Fedora 28 (linux 4.16) from live-usb (so baremetal).
> Same result.
> Thanks for the pointer towards sys requests. sysrq-w is empty, but I
> sent a bunch of other sysrqs to get stacktraces etc. from the kernel.
> Logs are attached.

Needs a dev to take a look at this, or someone with usb or block
device knowledge to see if something unrelated to btrfs is hung up. I
can't parse it.

Using btrfs-progs 4.17.1, what do you get for 'btrfs check
--mode=lowmem ' ?



--
Chris Murphy


Fwd: btrfs send hung in pipe_wait

2018-09-08 Thread Stefan Loewen
Oops. Forgot CCing the mailinglist

-- Forwarded message -
From: Stefan Loewen 
Date: Fr., 7. Sep. 2018 um 23:19 Uhr
Subject: Re: btrfs send hung in pipe_wait
To: Chris Murphy 


No it does not only happen in VirtualBox. I already tested the following:
- Manjaro baremetal (btrfs-progs v4.17.1; linux v4.18.5 and v4.14.67)
- Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
- ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
The logs I posted until now were mostly (iirc all of them) from the VM
with Arch.

Now I also tested with Fedora 28 (linux 4.16) from live-usb (so baremetal).
Same result.
Thanks for the pointer towards sys requests. sysrq-w is empty, but I
sent a bunch of other sysrqs to get stacktraces etc. from the kernel.
Logs are attached.

To recap:
I copied (reflink) a 3.8G iso file from a read-only snapshot (A) into
a new subvol (B) just to keep things small and managable.
There's nothing special about this file other than that it happens to
be one of the files to trigger the later btrfs-send to hang.
Not all files from A do this, but there are definitely multiple and
the problem only occurs when the files are reflinked.
I then create a snapshot (C) of B to be able to btrfs-send it.
Then I run "btrfs send snap-C > /somewhere", (--no-data leads to the
same) that process reads some MB from the source disk and writes a few
bytes to /somewhere and then just hangs without any further IO.
This is where I issued some sysrqs. The output is attached.
Then I tried killing the btrfs-send with ctrl-c and issued the sysrqs
again. I have no idea if that changed anything, but it didn't hurt, so
why not.

I'll try to minimize the dataset and maybe get a small fs-image
without too much personal information that I can upload so the issue
is reproducible by others.

Am Fr., 7. Sep. 2018 um 21:17 Uhr schrieb Chris Murphy
:
>
> On Fri, Sep 7, 2018 at 11:07 AM, Stefan Loewen  
> wrote:
> > List of steps:
> > - 3.8G iso lays in read-only subvol A
> > - I create subvol B and reflink-copy the iso into it.
> > - I create a read-only snapshot C of B
> > - I "btrfs send --no-data C > /somefile"
> > So you got that right, yes.
>
> OK I can't reproduce it. Sending A and C complete instantly with
> --no-data, and complete in the same time with a full send/receive. In
> my case I used a 4.9G ISO.
>
> I can't think of what local difference accounts for what you're
> seeing. There is really nothing special about --reflinks. The extent
> and csum data are identical to the original file, and that's the bulk
> of the metadata for a given file.
>
> What I can tell you is usually the developers want to see sysrq+w
> whenever there are blocked tasks.
> https://fedoraproject.org/wiki/QA/Sysrq
>
> You'll want to enable all sysrq functions. And next you'll want three
> ssh shells:
>
> 1. sudo journalctl -fk
> 2. sudo -i to become root, and then echo w > /proc/sysrq-trigger but
> do not hit return yet
> 3. sudo btrfs send... to reproduce the problem.
>
> Basically the thing is gonna hang soon after you reproduce the
> problem, so you want to get to shell #2 and just hit return rather
> than dealing with long delays typing that echo command out. And then
> the journal command is so your local terminal captures the sysrq
> output because you're gonna kill the VM instead of waiting it out. I
> have no idea how to read these things but someone might pick up this
> thread and have some idea why these tasks are hanging.
>
>
>
>
> >
> > Unfortunately I don't have any way to connect the drive to a SATA port
> > directly but I tried to switch out as much of the used setup as
> > possible (all changes active at the same time):
> > - I got the original (not the clone) HDD out of the enclosure and used
> > this adapter to connect it:
> > https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK
> > - I used a different Notebook
> > - I ran the test natively on that notebook (instead of from
> > VirtualBox. I used VirtualBox for most of the tests as I have to
> > force-poweroff the PC everytime the btrfs-send hangs as it is not
> > killable)
>
>
> This problem only happens in VirtualBox? Or it happens on baremetal
> also? And we've established it happens with two different source
> (send) devices, which means two different Btrfs volumes.
>
> All I can say is you need to keep changing things up, process of
> elimination. Rather tedious. Maybe you could try downloading a Fedora
> 28 ISO, make a boot stick out of it, and try to reproduce with the
> same drives. At least that's an easy way to isolate the OS from the
> equation.
>
>
> --
> Chris Murphy


fedora-kernelmsgs.log.gz
Description: application/gzip


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Chris Murphy
On Fri, Sep 7, 2018 at 11:07 AM, Stefan Loewen  wrote:
> List of steps:
> - 3.8G iso lays in read-only subvol A
> - I create subvol B and reflink-copy the iso into it.
> - I create a read-only snapshot C of B
> - I "btrfs send --no-data C > /somefile"
> So you got that right, yes.

OK I can't reproduce it. Sending A and C complete instantly with
--no-data, and complete in the same time with a full send/receive. In
my case I used a 4.9G ISO.

I can't think of what local difference accounts for what you're
seeing. There is really nothing special about --reflinks. The extent
and csum data are identical to the original file, and that's the bulk
of the metadata for a given file.

What I can tell you is usually the developers want to see sysrq+w
whenever there are blocked tasks.
https://fedoraproject.org/wiki/QA/Sysrq

You'll want to enable all sysrq functions. And next you'll want three
ssh shells:

1. sudo journalctl -fk
2. sudo -i to become root, and then echo w > /proc/sysrq-trigger but
do not hit return yet
3. sudo btrfs send... to reproduce the problem.

Basically the thing is gonna hang soon after you reproduce the
problem, so you want to get to shell #2 and just hit return rather
than dealing with long delays typing that echo command out. And then
the journal command is so your local terminal captures the sysrq
output because you're gonna kill the VM instead of waiting it out. I
have no idea how to read these things but someone might pick up this
thread and have some idea why these tasks are hanging.




>
> Unfortunately I don't have any way to connect the drive to a SATA port
> directly but I tried to switch out as much of the used setup as
> possible (all changes active at the same time):
> - I got the original (not the clone) HDD out of the enclosure and used
> this adapter to connect it:
> https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK
> - I used a different Notebook
> - I ran the test natively on that notebook (instead of from
> VirtualBox. I used VirtualBox for most of the tests as I have to
> force-poweroff the PC everytime the btrfs-send hangs as it is not
> killable)


This problem only happens in VirtualBox? Or it happens on baremetal
also? And we've established it happens with two different source
(send) devices, which means two different Btrfs volumes.

All I can say is you need to keep changing things up, process of
elimination. Rather tedious. Maybe you could try downloading a Fedora
28 ISO, make a boot stick out of it, and try to reproduce with the
same drives. At least that's an easy way to isolate the OS from the
equation.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Stefan Loewen
List of steps:
- 3.8G iso lays in read-only subvol A
- I create subvol B and reflink-copy the iso into it.
- I create a read-only snapshot C of B
- I "btrfs send --no-data C > /somefile"
So you got that right, yes.

Unfortunately I don't have any way to connect the drive to a SATA port
directly but I tried to switch out as much of the used setup as
possible (all changes active at the same time):
- I got the original (not the clone) HDD out of the enclosure and used
this adapter to connect it:
https://www.amazon.de/DIGITUS-Adapterkabel-40pol-480Mbps-schwarz/dp/B007X86VZK
- I used a different Notebook
- I ran the test natively on that notebook (instead of from
VirtualBox. I used VirtualBox for most of the tests as I have to
force-poweroff the PC everytime the btrfs-send hangs as it is not
killable)
- That notebook runs Manjaro with "Linux asus 4.14.67-1-MANJARO #1 SMP
PREEMPT Fri Aug 24 16:33:26 UTC 2018 x86_64 GNU/Linux"

Same results. btrfs-send freezes.
Am Fr., 7. Sep. 2018 um 17:44 Uhr schrieb Chris Murphy
:
>
> On Fri, Sep 7, 2018 at 6:47 AM, Stefan Loewen  wrote:
> > Well... It seems it's not the hardware.
> > I ran a long SMART check which ran through without errors and
> > reallocation count is still 0.
>
> That only checks the drive, it's an internal test. It doesn't check
> anything else, including connections.
>
> Also you do have a log with a read error and a sector LBA reported. So
> there is a hardware issue, it could just be transient.
>
>
> > So I used clonezilla (partclone.btrfs) to mirror the drive to another
> > drive (same model).
> > Everything copied over just fine. No I/O error im dmesg.
> >
> > The new disk shows the same behavior.
>
> So now I'm suspicious of USB behavior. Like I said earlier, when I've
> got USB enclosed drives connect to my NUC, regardless of file system,
> I routinely get hangs and USB resets. I have to connect all of my USB
> enclosed drives to a good USB hub, or I have problems.
>
>
>
> > So I created another subvolume, reflinked stuff over and found that it
> > is enough to reflink one file, create a read-only snapshot and try to
> > btrfs-send that. It's not happening with every file, but there are
> > definitely multiple different files. The one I tested with is a 3.8GB
> > ISO file.
> > Even better:
> > 'btrfs send --no-data snap-one > /dev/null'
> > (snap-one containing just one iso file) hangs as well.
>
> Do you have a list of steps to make this clear? It sounds like first
> you copy a 3.8G ISO file to one subvolume, then reflink copy it into
> another subvolume, then snapshot that 2nd subvolume, and try to send
> the snapshot? But I want to be clear.
>
> I've got piles of reflinked files in snapshots and they send OK,
> although like I said I do get sometimes a 15-30 second hang during
> sends.
>
> > Still dmesg shows no IO errors, only "INFO: task btrfs-transacti:541
> > blocked for more than 120 seconds." with associated call trace.
> > btrfs-send reads some MB in the beginning, writes a few bytes and then
> > hangs without further IO.
> >
> > copying the same file without --reflink, snapshotting and sending
> > works without problems.
> >
> > I guess that pretty much eleminates bad sectors and points towards
> > some problem with reflinks / btrfs metadata.
>
> That's pretty weird. I'll keep trying and see if I hit this. What
> happens if you downgrade to an older kernel? Either 4.14 or 4.17 or
> both. The send code is mainly in the kernel, where the receive code is
> mainly in user space tools, for this testing you don't need to
> downgrade user space tools. If there's a bug here, I expect it's
> kernel.
>
>
>
>
> --
> Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Chris Murphy
On Fri, Sep 7, 2018 at 6:47 AM, Stefan Loewen  wrote:
> Well... It seems it's not the hardware.
> I ran a long SMART check which ran through without errors and
> reallocation count is still 0.

That only checks the drive, it's an internal test. It doesn't check
anything else, including connections.

Also you do have a log with a read error and a sector LBA reported. So
there is a hardware issue, it could just be transient.


> So I used clonezilla (partclone.btrfs) to mirror the drive to another
> drive (same model).
> Everything copied over just fine. No I/O error im dmesg.
>
> The new disk shows the same behavior.

So now I'm suspicious of USB behavior. Like I said earlier, when I've
got USB enclosed drives connect to my NUC, regardless of file system,
I routinely get hangs and USB resets. I have to connect all of my USB
enclosed drives to a good USB hub, or I have problems.



> So I created another subvolume, reflinked stuff over and found that it
> is enough to reflink one file, create a read-only snapshot and try to
> btrfs-send that. It's not happening with every file, but there are
> definitely multiple different files. The one I tested with is a 3.8GB
> ISO file.
> Even better:
> 'btrfs send --no-data snap-one > /dev/null'
> (snap-one containing just one iso file) hangs as well.

Do you have a list of steps to make this clear? It sounds like first
you copy a 3.8G ISO file to one subvolume, then reflink copy it into
another subvolume, then snapshot that 2nd subvolume, and try to send
the snapshot? But I want to be clear.

I've got piles of reflinked files in snapshots and they send OK,
although like I said I do get sometimes a 15-30 second hang during
sends.

> Still dmesg shows no IO errors, only "INFO: task btrfs-transacti:541
> blocked for more than 120 seconds." with associated call trace.
> btrfs-send reads some MB in the beginning, writes a few bytes and then
> hangs without further IO.
>
> copying the same file without --reflink, snapshotting and sending
> works without problems.
>
> I guess that pretty much eleminates bad sectors and points towards
> some problem with reflinks / btrfs metadata.

That's pretty weird. I'll keep trying and see if I hit this. What
happens if you downgrade to an older kernel? Either 4.14 or 4.17 or
both. The send code is mainly in the kernel, where the receive code is
mainly in user space tools, for this testing you don't need to
downgrade user space tools. If there's a bug here, I expect it's
kernel.




-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-07 Thread Stefan Loewen
Well... It seems it's not the hardware.
I ran a long SMART check which ran through without errors and
reallocation count is still 0.

So I used clonezilla (partclone.btrfs) to mirror the drive to another
drive (same model).
Everything copied over just fine. No I/O error im dmesg.

The new disk shows the same behavior.
So I created another subvolume, reflinked stuff over and found that it
is enough to reflink one file, create a read-only snapshot and try to
btrfs-send that. It's not happening with every file, but there are
definitely multiple different files. The one I tested with is a 3.8GB
ISO file.
Even better:
'btrfs send --no-data snap-one > /dev/null'
(snap-one containing just one iso file) hangs as well.
Still dmesg shows no IO errors, only "INFO: task btrfs-transacti:541
blocked for more than 120 seconds." with associated call trace.
btrfs-send reads some MB in the beginning, writes a few bytes and then
hangs without further IO.

copying the same file without --reflink, snapshotting and sending
works without problems.

I guess that pretty much eleminates bad sectors and points towards
some problem with reflinks / btrfs metadata.


Btw.: Thanks for taking that much time for helping me find the problem
here, Chris. Very much appreciated!
Am Fr., 7. Sep. 2018 um 05:29 Uhr schrieb Chris Murphy
:
>
> On Thu, Sep 6, 2018 at 2:16 PM, Stefan Loewen  wrote:
>
> > Data,single: Size:695.01GiB, Used:653.69GiB
> > /dev/sdb1 695.01GiB
> > Metadata,DUP: Size:4.00GiB, Used:2.25GiB
> > /dev/sdb1   8.00GiB
> > System,DUP: Size:40.00MiB, Used:96.00KiB
>
>
> > Does that mean Metadata is duplicated?
>
> Yes. Single copy for data. Duplicate for metadata+system, and there
> are no single chunks for metadata/system.
>
> >
> > Ok so to summarize and see if I understood you correctly:
> > There are bad sectors on disk. Running an extended selftest (smartctl -t
> > long) could find those and replace them with spare sectors.
>
> More likely if it finds a persistently failing sector, it will just
> record the first failing sector LBA in its log, and then abort. You'll
> see this info with 'smartctl -a' or with -x.
>
> It is possible to resume the test using selective option and picking a
> 4K aligned 512 byte LBA value after the 4K sector with the defect.
> Just because only one is reported in dmesg doesn't mean there isn't a
> bad one.
>
> It's unlikely the long test is going to actually fix anything, it'll
> just give you more ammunition for getting a likely under warranty
> device replaced because it really shouldn't have any issues at this
> age.
>
>
> > If it does not I can try calculating the physical (4K) sector number and
> > write to that to make the drive notice and mark the bad sector.
> > Is there a way to find out which file I will be writing to beforehand?
>
> I'm not sure how to do it easily.
>
> >Or is
> > it easier to just write to the sector and then wait for scrub to tell me
> > (and the sector is broken anyways)?
>
> If it's a persistent read error, then it's lost. So you might as well
> overwrite it. If it's data, scrub will tell you what file is corrupted
> (and restore can help you recover the whole file, of course it'll have
> a 4K hole of zeros in it). If it's metadata, Btrfs will fix up the 4K
> hole with duplicate metadata.
>
> Gotcha is to make certain you've got the right LBA to write to. You
> can use dd to test this, by reading the suspect bad sector, and if
> you've got the right one, you'll get an I/O error in user space and
> dmesg will have a message like before with sector value. Use the dd
> skip= flag for reading, but make *sure* you use seek= when writing
> *and* make sure you always use bs=4096 count=1 so that if you make a
> mistake you limit the damage haha.
>
> >
> > For the drive: Not under warranty anymore. It's an external HDD that I had
> > lying around for years, mostly unused. Now I wanted to use it as part of my
> > small DIY NAS.
>
> Gotcha. Well you can read up on smartctl and smartd, and set it up for
> regular extended tests, and keep an eye on rapidly changing values. It
> might give you a 50/50 chance of an early heads up before it dies.
>
> I've got an old Hitachi/Apple laptop drive that years ago developed
> multiple bad sectors in different zones of the drive. They got
> remapped and I haven't had a problem with that drive since. *shrug*
> And in fact I did get a discrete error message from the drive for one
> of those and Btrfs overwrote that bad sector with a good copy (it's in
> a raid1 volume), so working as designed I guess.
>
> Since you didn't get a fix up message from Btrfs, either the whole
> thing just got confused with hanging tasks, or it's possible it's a
> data block.
>
>
> --
> Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 2:16 PM, Stefan Loewen  wrote:

> Data,single: Size:695.01GiB, Used:653.69GiB
> /dev/sdb1 695.01GiB
> Metadata,DUP: Size:4.00GiB, Used:2.25GiB
> /dev/sdb1   8.00GiB
> System,DUP: Size:40.00MiB, Used:96.00KiB


> Does that mean Metadata is duplicated?

Yes. Single copy for data. Duplicate for metadata+system, and there
are no single chunks for metadata/system.

>
> Ok so to summarize and see if I understood you correctly:
> There are bad sectors on disk. Running an extended selftest (smartctl -t
> long) could find those and replace them with spare sectors.

More likely if it finds a persistently failing sector, it will just
record the first failing sector LBA in its log, and then abort. You'll
see this info with 'smartctl -a' or with -x.

It is possible to resume the test using selective option and picking a
4K aligned 512 byte LBA value after the 4K sector with the defect.
Just because only one is reported in dmesg doesn't mean there isn't a
bad one.

It's unlikely the long test is going to actually fix anything, it'll
just give you more ammunition for getting a likely under warranty
device replaced because it really shouldn't have any issues at this
age.


> If it does not I can try calculating the physical (4K) sector number and
> write to that to make the drive notice and mark the bad sector.
> Is there a way to find out which file I will be writing to beforehand?

I'm not sure how to do it easily.

>Or is
> it easier to just write to the sector and then wait for scrub to tell me
> (and the sector is broken anyways)?

If it's a persistent read error, then it's lost. So you might as well
overwrite it. If it's data, scrub will tell you what file is corrupted
(and restore can help you recover the whole file, of course it'll have
a 4K hole of zeros in it). If it's metadata, Btrfs will fix up the 4K
hole with duplicate metadata.

Gotcha is to make certain you've got the right LBA to write to. You
can use dd to test this, by reading the suspect bad sector, and if
you've got the right one, you'll get an I/O error in user space and
dmesg will have a message like before with sector value. Use the dd
skip= flag for reading, but make *sure* you use seek= when writing
*and* make sure you always use bs=4096 count=1 so that if you make a
mistake you limit the damage haha.

>
> For the drive: Not under warranty anymore. It's an external HDD that I had
> lying around for years, mostly unused. Now I wanted to use it as part of my
> small DIY NAS.

Gotcha. Well you can read up on smartctl and smartd, and set it up for
regular extended tests, and keep an eye on rapidly changing values. It
might give you a 50/50 chance of an early heads up before it dies.

I've got an old Hitachi/Apple laptop drive that years ago developed
multiple bad sectors in different zones of the drive. They got
remapped and I haven't had a problem with that drive since. *shrug*
And in fact I did get a discrete error message from the drive for one
of those and Btrfs overwrote that bad sector with a good copy (it's in
a raid1 volume), so working as designed I guess.

Since you didn't get a fix up message from Btrfs, either the whole
thing just got confused with hanging tasks, or it's possible it's a
data block.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen

[root@archlinux @data]# btrfs fi us /mnt/intenso_white/
Overall:
Device size: 911.51GiB
Device allocated:    703.09GiB
Device unallocated:  208.43GiB
Device missing:  0.00B
Used:    658.19GiB
Free (estimated):    249.75GiB  (min: 145.53GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)
Data,single: Size:695.01GiB, Used:653.69GiB
/dev/sdb1 695.01GiB
Metadata,DUP: Size:4.00GiB, Used:2.25GiB
/dev/sdb1   8.00GiB
System,DUP: Size:40.00MiB, Used:96.00KiB
/dev/sdb1  80.00MiB
Unallocated:
/dev/sdb1 208.43GiB

Does that mean Metadata is duplicated?

Ok so to summarize and see if I understood you correctly:
There are bad sectors on disk. Running an extended selftest (smartctl -t 
long) could find those and replace them with spare sectors.
If it does not I can try calculating the physical (4K) sector number and 
write to that to make the drive notice and mark the bad sector.
Is there a way to find out which file I will be writing to beforehand? 
Or is it easier to just write to the sector and then wait for scrub to 
tell me (and the sector is broken anyways)?


For the drive: Not under warranty anymore. It's an external HDD that I 
had lying around for years, mostly unused. Now I wanted to use it as 
part of my small DIY NAS.



On 9/6/18 9:58 PM, Chris Murphy wrote:

On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen  wrote:

Output of the commands is attached.

fdisk
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

smart
Sector Sizes: 512 bytes logical, 4096 bytes physical

So clearly the case is lying about the actual physical sector size of
the drive. It's very common. But it means to fix the bad sector by
writing to it, must be a 4K write. A 512 byte write to the reported
LBA, will fail because it is a RMW, and the read will fail. So if you
write to that sector, you'll get a read failure. Kinda confusing. So
you can convert the LBA to a 4K value, and use dd to write to that "4K
LBA" using bs=4096 and a count of 1 but only when you're ready to
lose all 4096 bytes in that sector. If it's data, it's fine. It's the
loss of one file, and scrub will find and report path to file so you
know what was affected.

If it's metadata, it could be a problem. What do you get for 'btrfs fi
us ' for this volume? I'm wondering if DUP metadata is
being used across the board with no single chunks. If so, then you can
zero that sector, and Btrfs will detect the missing metadata in that
chunk on scrub, and fix it up from a copy. But if you only have single
copy metadata, it just depends what's on that block as to how
recoverable or repairable this is.


195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
196 Reallocated_Event_Count -O--CK   252   252   000-0
197 Current_Pending_Sector  -O--CK   252   252   000-0
198 Offline_Uncorrectable   CK   252   252   000-0

Interesting, no complaints there. Unexpected.

11 Calibration_Retry_Count -O--CK   100   100   000-8
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-31

https://kb.acronis.com/content/9136

This is a low hour device, probably still under warranty? I'd get it
swapped out. If you want more ammunition for arguing in favor of a
swap out under warranty you could do

smartctl -t long /dev/sdb

That will take just under 4 hours to run (you can use the drive in the
meantime, but it'll take a bit longer); and then after that

smartctl -x /dev/sdb

And see if it's found a bad sector or updated any of those smart
values for the worse in particular the offline values.




SCT (Get) Error Recovery Control command failed

OK so not configurable, it is whatever it is and we don't know what
that is. Probably one of the really long recoveries.





The broken-sector-theory sounds plausible and is compatible with my new
findings:
I suspected the problem to be in one specific directory, let's call it
"broken_dir".
I created a new subvolume and copied broken_dir over.
- If I copied it with cp --reflink, made a snapshot and tried to btrfs-send
that, it hung
- If I rsynced broken_dir over I could snapshot and btrfs-send without a
problem.

Yeah I'm not sure what it is, maybe a data block.


But shouldn't btrfs scrub or check find such errors?

Nope. Btrfs expects the drive to complete the read command, but always
second guesses the content of the read by comparing to checksums. So
if the drive just supplied corrupt data, Btrfs would detect that and
discretely report, and if there's a good copy it would self heal. But
it can't do that because the drive or USB bus also seems to hang in
such a way that a bunch of tasks are also hung, and none of them are
getting a clear pass/fail for the read. It just hangs.

Arguably the device or the link should not hang. So I'm still
wondering if something else 

Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen  wrote:
> Output of the commands is attached.

fdisk
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

smart
Sector Sizes: 512 bytes logical, 4096 bytes physical

So clearly the case is lying about the actual physical sector size of
the drive. It's very common. But it means to fix the bad sector by
writing to it, must be a 4K write. A 512 byte write to the reported
LBA, will fail because it is a RMW, and the read will fail. So if you
write to that sector, you'll get a read failure. Kinda confusing. So
you can convert the LBA to a 4K value, and use dd to write to that "4K
LBA" using bs=4096 and a count of 1 but only when you're ready to
lose all 4096 bytes in that sector. If it's data, it's fine. It's the
loss of one file, and scrub will find and report path to file so you
know what was affected.

If it's metadata, it could be a problem. What do you get for 'btrfs fi
us ' for this volume? I'm wondering if DUP metadata is
being used across the board with no single chunks. If so, then you can
zero that sector, and Btrfs will detect the missing metadata in that
chunk on scrub, and fix it up from a copy. But if you only have single
copy metadata, it just depends what's on that block as to how
recoverable or repairable this is.


195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
196 Reallocated_Event_Count -O--CK   252   252   000-0
197 Current_Pending_Sector  -O--CK   252   252   000-0
198 Offline_Uncorrectable   CK   252   252   000-0

Interesting, no complaints there. Unexpected.

11 Calibration_Retry_Count -O--CK   100   100   000-8
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-31

https://kb.acronis.com/content/9136

This is a low hour device, probably still under warranty? I'd get it
swapped out. If you want more ammunition for arguing in favor of a
swap out under warranty you could do

smartctl -t long /dev/sdb

That will take just under 4 hours to run (you can use the drive in the
meantime, but it'll take a bit longer); and then after that

smartctl -x /dev/sdb

And see if it's found a bad sector or updated any of those smart
values for the worse in particular the offline values.




SCT (Get) Error Recovery Control command failed

OK so not configurable, it is whatever it is and we don't know what
that is. Probably one of the really long recoveries.




>
> The broken-sector-theory sounds plausible and is compatible with my new
> findings:
> I suspected the problem to be in one specific directory, let's call it
> "broken_dir".
> I created a new subvolume and copied broken_dir over.
> - If I copied it with cp --reflink, made a snapshot and tried to btrfs-send
> that, it hung
> - If I rsynced broken_dir over I could snapshot and btrfs-send without a
> problem.

Yeah I'm not sure what it is, maybe a data block.

>
> But shouldn't btrfs scrub or check find such errors?

Nope. Btrfs expects the drive to complete the read command, but always
second guesses the content of the read by comparing to checksums. So
if the drive just supplied corrupt data, Btrfs would detect that and
discretely report, and if there's a good copy it would self heal. But
it can't do that because the drive or USB bus also seems to hang in
such a way that a bunch of tasks are also hung, and none of them are
getting a clear pass/fail for the read. It just hangs.

Arguably the device or the link should not hang. So I'm still
wondering if something else is going on, but this is just the most
obvious first problem, and maybe it's being complicated by another
problem we haven't figure out yet. Anyway, once this problem is solve,
it'll become clear if there are additional problems or not.

In my case, I often get usb reset errors when I directly connect USB
3.0 drives to my Intel NUC, but I don't ever get them when plugging
the drive into a dyconn hub. So if you don't already have a hub in
between the drive and the computer, it might be worth considering.
Basically the hub is going to read and completely rewrite the whole
stream that goes through it (in both directions).



-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen

Output of the commands is attached.

The broken-sector-theory sounds plausible and is compatible with my new 
findings:
I suspected the problem to be in one specific directory, let's call it 
"broken_dir".

I created a new subvolume and copied broken_dir over.
- If I copied it with cp --reflink, made a snapshot and tried to 
btrfs-send that, it hung
- If I rsynced broken_dir over I could snapshot and btrfs-send without a 
problem.


But shouldn't btrfs scrub or check find such errors?


On 9/6/18 8:16 PM, Chris Murphy wrote:

OK you've got a different problem.

[  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
00 08 00 00
[  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
[  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
00 08 00 00
[  188.247048] print_req_error: I/O error, dev sdb, sector 354855120


This is a read error for a specific sector.  So your drive has media
problems. And I think that's the instigating problem here, from which
a bunch of other tasks that depend on one or more reads completing but
never do. But weirdly there also isn't any kind of libata reset. At
least on SATA, by default we see a link reset after a command has not
returned in 30 seconds. That reset would totally clear the drive's
command queue, and then things either can recover or barf. But in your
case, neither happens and it just sits there with hung tasks.

[  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
rd 2, flush 0, corrupt 0, gen 0

And that's the last we really see from Btrfs. After that, it's all
just hung task traces and are rather unsurprising to me.

Drives in USB cases add a whole bunch of complicating factors for
troubleshooting and repair. Including often masking the actual logical
and physical sector size, the min and max IO size, alignment offset,
and all kinds of things. They can have all sorts of bugs. And I'm also
not totally certain about the relationship between the usb reset
messages and the bad sector. As far as I know the only way we can get
a sector LBA expressly noted in dmesg along with the failed read(10)
command, is if the drive has reported back to libata that discrete
error with sense information. So I'm accepting that as a reliable
error, rather than it being something like a cable. But the reset
messages could possibly be something else in addition to that.

Anyway, the central issue is sector 354855120 is having problems. I
can't tell from the trace if it's transient or persistent. Maybe if
it's transient, that would explain how you sometimes get send to start
working again briefly but then it reverts to hanging. What do you get
for:

fdisk -l /dev/sdb
smartctl -x /dev/sdb
smartctl -l sct erc /dev/sdb

Those are all read only commands, nothing is written or changed.



[root@archlinux ~]# fdisk -l /dev/sdb
Disk /dev/sdb: 931.5 GiB, 1000204140544 bytes, 1953523712 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x1cc21bd2
Device Boot  StartEndSectors  Size Id Type
/dev/sdb1 2048 1933593750 1933591703  922G 83 Linux
/dev/sdb2   1933610176 1953525167   19914992  9.5G 83 Linux


[root@archlinux ~]# smartctl -x /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.5-arch1-1-ARCH] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Samsung SpinPoint M8 (AF)
Device Model: ST1000LM024 HN-M101MBB
Serial Number:S2RXJ9FCB07612
LU WWN Device Id: 5 0004cf 208d24759
Firmware Version: 2AR10002
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:5400 rpm
Form Factor:  2.5 inches
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:Thu Sep  6 18:23:57 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM level is: 254 (maximum performance), recommended: 254
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write 

Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 10:03 AM, Stefan Löwen  wrote:
> I have one subvolume (rw) and 2 snapshots (ro) of it.
>
> I just tested 'btrfs send  > /dev/null' and that also shows no IO
> after a while but also no significant CPU usage.
> During this I tried 'ls' on the source subvolume and it hangs as well.
> dmesg has some interesting messages I think (see attached dmesg.log)
>

OK you've got a different problem.

[  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
00 08 00 00
[  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
[  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
00 08 00 00
[  188.247048] print_req_error: I/O error, dev sdb, sector 354855120


This is a read error for a specific sector.  So your drive has media
problems. And I think that's the instigating problem here, from which
a bunch of other tasks that depend on one or more reads completing but
never do. But weirdly there also isn't any kind of libata reset. At
least on SATA, by default we see a link reset after a command has not
returned in 30 seconds. That reset would totally clear the drive's
command queue, and then things either can recover or barf. But in your
case, neither happens and it just sits there with hung tasks.

[  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
rd 2, flush 0, corrupt 0, gen 0

And that's the last we really see from Btrfs. After that, it's all
just hung task traces and are rather unsurprising to me.

Drives in USB cases add a whole bunch of complicating factors for
troubleshooting and repair. Including often masking the actual logical
and physical sector size, the min and max IO size, alignment offset,
and all kinds of things. They can have all sorts of bugs. And I'm also
not totally certain about the relationship between the usb reset
messages and the bad sector. As far as I know the only way we can get
a sector LBA expressly noted in dmesg along with the failed read(10)
command, is if the drive has reported back to libata that discrete
error with sense information. So I'm accepting that as a reliable
error, rather than it being something like a cable. But the reset
messages could possibly be something else in addition to that.

Anyway, the central issue is sector 354855120 is having problems. I
can't tell from the trace if it's transient or persistent. Maybe if
it's transient, that would explain how you sometimes get send to start
working again briefly but then it reverts to hanging. What do you get
for:

fdisk -l /dev/sdb
smartctl -x /dev/sdb
smartctl -l sct erc /dev/sdb

Those are all read only commands, nothing is written or changed.



-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Löwen

I have one subvolume (rw) and 2 snapshots (ro) of it.

I just tested 'btrfs send  > /dev/null' and that also shows no 
IO after a while but also no significant CPU usage.

During this I tried 'ls' on the source subvolume and it hangs as well.
dmesg has some interesting messages I think (see attached dmesg.log)


On 9/6/18 5:48 PM, Chris Murphy wrote:

On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen  wrote:

Update:
It seems like btrfs-send is not completely hung. It somewhat regularly
wakes up every hour to transfer a few bytes. I noticed this via a
periodic 'ls -l' on the snapshot file. These are the last outputs
(uniq'ed):

-rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
-rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
-rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
-rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
-rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
-rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
-rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot

I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
have no idea if this is useful) but there are no changes, even during
the short wakeups.

I have a sort of "me too" here. I definitely see btrfs send just hang
for no apparent reason, but in my case it's for maybe 15-30 seconds.
Not an hour. Looking at top and iotop at the same time as the LED
lights on the drives, there's  definitely zero activity happening. I
can make things happen during this time - like I can read a file or
save a file from/to any location including the send source or receive
destination. It really just behaves as if the send thread is saying
"OK I'm gonna nap now, back in a bit" and then it is.

So what I end up with on drives with a minimum read-write of 80M/s, is
a send receive that's getting me a net of about 30M/s.

I have around 100 snapshots on the source device. How many total
snapshots do you have on your source? That does appear to affect
performance for some things, including send/receive.


[0.00] Linux version 4.18.5-arch1-1-ARCH (builduser@heftig-12250) (gcc version 8.2.0 (GCC)) #1 SMP PREEMPT Fri Aug 24 12:48:58 UTC 2018
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=a83c9650-c8f8-4afe-90a6-4e80156d523d rw quiet
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xdffe] usable
[0.00] BIOS-e820: [mem 0xdfff-0xdfff] ACPI data
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00011fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.5 present.
[0.00] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[0.00] Hypervisor detected: KVM
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x12 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR variable ranges disabled:
[0.00] Disabled
[0.00] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[0.00] CPU MTRRs all blank - virtualized system.
[0.00] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC  
[0.00] last_pfn = 0xdfff0 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x0009fff0-0x0009] mapped at [(ptrval)]
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [(ptrval)] 99000 size 24576
[0.00] BRK [0x115a77000, 0x115a77fff] PGTABLE
[0.00] BRK [0x115a78000, 0x115a78fff] PGTABLE
[0.00] BRK [0x115a79000, 0x115a79fff] PGTABLE
[0.00] BRK [0x115a7a000, 0x115a7afff] PGTABLE
[0.00] BRK [0x115a7b000, 0x115a7bfff] PGTABLE
[

Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen  wrote:
> Update:
> It seems like btrfs-send is not completely hung. It somewhat regularly
> wakes up every hour to transfer a few bytes. I noticed this via a
> periodic 'ls -l' on the snapshot file. These are the last outputs
> (uniq'ed):
>
> -rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
> -rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
> -rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
> -rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
> -rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
> -rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
> -rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot
>
> I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
> have no idea if this is useful) but there are no changes, even during
> the short wakeups.

I have a sort of "me too" here. I definitely see btrfs send just hang
for no apparent reason, but in my case it's for maybe 15-30 seconds.
Not an hour. Looking at top and iotop at the same time as the LED
lights on the drives, there's  definitely zero activity happening. I
can make things happen during this time - like I can read a file or
save a file from/to any location including the send source or receive
destination. It really just behaves as if the send thread is saying
"OK I'm gonna nap now, back in a bit" and then it is.

So what I end up with on drives with a minimum read-write of 80M/s, is
a send receive that's getting me a net of about 30M/s.

I have around 100 snapshots on the source device. How many total
snapshots do you have on your source? That does appear to affect
performance for some things, including send/receive.


-- 
Chris Murphy


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen
Update:
It seems like btrfs-send is not completely hung. It somewhat regularly
wakes up every hour to transfer a few bytes. I noticed this via a
periodic 'ls -l' on the snapshot file. These are the last outputs
(uniq'ed):

-rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
-rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
-rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
-rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
-rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
-rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
-rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot

I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
have no idea if this is useful) but there are no changes, even during
the short wakeups.
Am Do., 6. Sep. 2018 um 11:22 Uhr schrieb Stefan Löwen
:
>
> Hello linux-btrfs,
>
> I'm trying to clone a subvolume with 'btrfs send' but it always hangs
> for hours.
>
> I tested this on multiple systems. All showed the same result:
> - Manjaro (btrfs-progs v4.17.1; linux v4.18.5-1-MANJARO)
> - Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
> - ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
> All following logs are from the ArchLinux VM.
>
> To make sure it's not the 'btrfs receive' at fault I tried sending into
> a file using the following command:
> 'strace -o btrfs-send.strace btrfs send -vvv -f intenso_white.snapshot
> /mnt/intenso_white/@data/.snapshots/test-snapshot'
>
> The 'btrfs send' process always copies around 1.2-1.4G of data, then
> stops all disk IO and fully loads one cpu core. btrfs scrub found 0
> errors. Neither did btrfsck. 'btrfs device stats' is all 0.
>
> I would be thankful for all ideas and tips.
>
> Regards
> Stefan
>
> -
>
> The btrfs-send.strace is attached. So is the dmesg.log during the hang.
>
> Stack traces of the hung process:
> --- /proc/3022/task/3022/stack ---
> [<0>] 0x
> --- /proc/3022/task/3023/stack ---
> [<0>] pipe_wait+0x6c/0xb0
> [<0>] splice_from_pipe_next.part.3+0x24/0xa0
> [<0>] __splice_from_pipe+0x43/0x180
> [<0>] splice_from_pipe+0x5d/0x90
> [<0>] default_file_splice_write+0x15/0x20
> [<0>] __se_sys_splice+0x31b/0x770
> [<0>] do_syscall_64+0x5b/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0x
>
> [vagrant@archlinux mnt]$ uname -a
> Linux archlinux 4.18.5-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 24 12:48:58
> UTC 2018 x86_64 GNU/Linux
>
> [vagrant@archlinux mnt]$ btrfs --version
> btrfs-progs v4.17.1
>
> [vagrant@archlinux mnt]$ sudo btrfs fi show /dev/sdb1
> Label: 'intenso_white'  uuid: 07bf61ed-7728-4151-a784-c4b840e343ed
> Total devices 1 FS bytes used 655.82GiB
> devid1 size 911.51GiB used 703.09GiB path /dev/sdb1
>
> [vagrant@archlinux mnt]$ sudo btrfs fi df /mnt/intenso_white/
> Data, single: total=695.01GiB, used=653.69GiB
> System, DUP: total=40.00MiB, used=96.00KiB
> Metadata, DUP: total=4.00GiB, used=2.13GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>


btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Löwen

Hello linux-btrfs,

I'm trying to clone a subvolume with 'btrfs send' but it always hangs 
for hours.


I tested this on multiple systems. All showed the same result:
- Manjaro (btrfs-progs v4.17.1; linux v4.18.5-1-MANJARO)
- Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
- ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
All following logs are from the ArchLinux VM.

To make sure it's not the 'btrfs receive' at fault I tried sending into 
a file using the following command:
'strace -o btrfs-send.strace btrfs send -vvv -f intenso_white.snapshot 
/mnt/intenso_white/@data/.snapshots/test-snapshot'


The 'btrfs send' process always copies around 1.2-1.4G of data, then 
stops all disk IO and fully loads one cpu core. btrfs scrub found 0 
errors. Neither did btrfsck. 'btrfs device stats' is all 0.


I would be thankful for all ideas and tips.

Regards
Stefan

-

The btrfs-send.strace is attached. So is the dmesg.log during the hang.

Stack traces of the hung process:
--- /proc/3022/task/3022/stack ---
[<0>] 0x
--- /proc/3022/task/3023/stack ---
[<0>] pipe_wait+0x6c/0xb0
[<0>] splice_from_pipe_next.part.3+0x24/0xa0
[<0>] __splice_from_pipe+0x43/0x180
[<0>] splice_from_pipe+0x5d/0x90
[<0>] default_file_splice_write+0x15/0x20
[<0>] __se_sys_splice+0x31b/0x770
[<0>] do_syscall_64+0x5b/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0x

[vagrant@archlinux mnt]$ uname -a
Linux archlinux 4.18.5-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 24 12:48:58 
UTC 2018 x86_64 GNU/Linux


[vagrant@archlinux mnt]$ btrfs --version
btrfs-progs v4.17.1

[vagrant@archlinux mnt]$ sudo btrfs fi show /dev/sdb1
Label: 'intenso_white'  uuid: 07bf61ed-7728-4151-a784-c4b840e343ed
Total devices 1 FS bytes used 655.82GiB
devid    1 size 911.51GiB used 703.09GiB path /dev/sdb1

[vagrant@archlinux mnt]$ sudo btrfs fi df /mnt/intenso_white/
Data, single: total=695.01GiB, used=653.69GiB
System, DUP: total=40.00MiB, used=96.00KiB
Metadata, DUP: total=4.00GiB, used=2.13GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

execve("/usr/bin/btrfs", ["btrfs", "send", "-vvv", "-f", 
"intenso_white.snapshot", "/mnt/intenso_white/@data/.snapsh"...], 
0x7ffcb8f52718 /* 16 vars */) = 0
brk(NULL)   = 0x555c9eddb000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe8c9971e0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=28542, ...}) = 0
mmap(NULL, 28542, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffa9a016000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\24\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=26552, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7ffa9a014000
mmap(NULL, 2121752, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99e0d000
mprotect(0x7ffa99e13000, 2093056, PROT_NONE) = 0
mmap(0x7ffa9a012000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7ffa9a012000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\231\0\0\0\0\0\0"..., 832) = 
832
fstat(3, {st_mode=S_IFREG|0755, st_size=326480, ...}) = 0
mmap(NULL, 2426656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99bbc000
mprotect(0x7ffa99c06000, 2097152, PROT_NONE) = 0
mmap(0x7ffa99e06000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4a000) = 0x7ffa99e06000
mmap(0x7ffa99e0c000, 1824, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffa99e0c000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320!\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=91912, ...}) = 0
mmap(NULL, 2187280, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa999a5000
mprotect(0x7ffa999bb000, 2093056, PROT_NONE) = 0
mmap(0x7ffa99bba000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7ffa99bba000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/liblzo2.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20'\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=137432, ...}) = 0
mmap(NULL, 2232528, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99783000
mprotect(0x7ffa997a4000, 2093056, PROT_NONE) = 0
mmap(0x7ffa999a3000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2) = 0x7ffa999a3000
close(3)