Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-09 Thread Marc MERLIN
On Mon, Apr 07, 2014 at 01:00:02PM -0700, Marc MERLIN wrote:
 On Mon, Apr 07, 2014 at 03:32:13PM -0400, Chris Mason wrote:
  You're recommending that I try btrfs-next on a 3.15 pre kernel, correct?
  If so would it be likely to fix my filesystem and let me go back to a
  stable 3.14? (I'm a bit warry about running some unstable 3.15 on it :).
  
  Right now the fixes for this are in the integration branch on my git 
  tree.  I think we've shaken  out all the problems, but if you want to 
  wait until tomorrow I'll have it in my next branch (for linux-next).
 
 I can wait, even a few more days if needed.
 But just to be clear: will this new kernel be something that will be
 required for me to run from there on to avoid all those deadlocks and very
 poor performance I'm seeing, or the new kernel will fix things up, and then
 if other stuff isn't quite stable, I can downgrade back to 3.14 stable?
 
 By the way, I think I know which filesystem is causing this, and one unusual
 thing is that it uses a lot of hardlinks.
 
 In case that helps, there are only 40 snapshots on it, but many inodes, of
 which many are hardlinked together:
 
 gargamel:/mnt/btrfs_pool2# btrfs filesystem df `pwd`
 Data, single: total=3.28TiB, used=2.30TiB
 System, DUP: total=8.00MiB, used=384.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=74.50GiB, used=70.11GiB
 Metadata, single: total=8.00MiB, used=0.00
 gargamel:/mnt/btrfs_pool2# btrfs filesystem show `pwd`
 Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
   Total devices 1 FS bytes used 2.37TiB
   devid1 size 7.28TiB used 3.43TiB path /dev/mapper/dshelf2

Back on that front, while debugging the other problem I sent you, I've been
having more issues with this device too.

At boot time, I've been getting multiple of these after boot:
gargamel login: [ 1328.241302] INFO: task btrfs-cleaner:3571 blocked for more 
than 120 seconds.
[ 1328.264046]   Not tainted 3.14.0-amd64-i915-preempt-20140216 #2
[ 1328.284413] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 1328.309394] btrfs-cleaner   D 88020d5ea800 0  3571  2 0x
[ 1328.331996]  8800c8985d00 0046 8800c8985fd8 
88020d5ea2d0
[ 1328.355774]  000141c0 88020d5ea2d0 8801d9a7ee50 
880213bfc9e8
[ 1328.379408]   880213bfc800 88020c5b7ce0 
8800c8985d10
[ 1328.403654] Call Trace:
[ 1328.412617]  [8160d2a1] schedule+0x73/0x75
[ 1328.429189]  [8122aa77] wait_current_trans.isra.15+0x98/0xf4
[ 1328.450402]  [81085116] ? finish_wait+0x65/0x65
[ 1328.467957]  [8122bf1c] start_transaction+0x48e/0x4f2
[ 1328.487315]  [8122c2ff] ? __btrfs_end_transaction+0x2a1/0x2c6
[ 1328.508614]  [8122bf9b] btrfs_start_transaction+0x1b/0x1d
[ 1328.528842]  [8121cc7d] btrfs_drop_snapshot+0x443/0x610
[ 1328.548481]  [8122c73d] 
btrfs_clean_one_deleted_snapshot+0x103/0x10f
[ 1328.571518]  [81224f09] cleaner_kthread+0x103/0x136
[ 1328.590436]  [81224e06] ? btrfs_alloc_root+0x26/0x26
[ 1328.609348]  [8106bc62] kthread+0xae/0xb6
[ 1328.625275]  [8106bbb4] ? __kthread_parkme+0x61/0x61
[ 1328.644406]  [8161637c] ret_from_fork+0x7c/0xb0
[ 1328.662075]  [8106bbb4] ? __kthread_parkme+0x61/0x61

But more annoyingly, accessing the mountpoint was hanging, so I've now mounted
it with recovery,ro, and backing up all the data to another device so that I
can destroy/recreate this device that clearly has severe performance issues.

Do you want btrfsck output and an image of that one too?
(this one is not raid0, it's on top of an dm encrypted md raid5 array)

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-07 Thread Marc MERLIN
I was debugging my why backup failed to run, and eventually found it was
stuck on sync:
14080   18:18 btrfs_tree_read_lock   sync

This was hung for hours on this lock.

Strangely, it looks like taking my sysrq-w hung the machine pretty hard for
close to 30sec, but this seems to have unhung sync and in the end btrfs send
completed after that.

Sysqrq-w is here:
http://marc.merlins.org/tmp/sysrq-btrfs-sync-hang.txt

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-07 Thread Josef Bacik

On 04/07/2014 12:05 PM, Marc MERLIN wrote:

I was debugging my why backup failed to run, and eventually found it was
stuck on sync:
14080   18:18 btrfs_tree_read_lock   sync

This was hung for hours on this lock.

Strangely, it looks like taking my sysrq-w hung the machine pretty hard for
close to 30sec, but this seems to have unhung sync and in the end btrfs send
completed after that.

Sysqrq-w is here:
https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/tmp/sysrq-btrfs-sync-hang.txtk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=IHXWC1Chbc0jEiUWu1v4Va9NOphtjPbjYp6yVMdUmXM%3D%0As=bd787a3422e9ff0972d2d09de7d424f56589aadc9d6db33e19fc44886dce604f


Try Chris's integration branch in a few hours and see if that fixes it. 
 Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-07 Thread Marc MERLIN
On Mon, Apr 07, 2014 at 12:10:52PM -0400, Josef Bacik wrote:
 On 04/07/2014 12:05 PM, Marc MERLIN wrote:
 I was debugging my why backup failed to run, and eventually found it was
 stuck on sync:
 14080   18:18 btrfs_tree_read_lock   sync
 
 This was hung for hours on this lock.
 
 Strangely, it looks like taking my sysrq-w hung the machine pretty hard for
 close to 30sec, but this seems to have unhung sync and in the end btrfs send
 completed after that.
 
 Sysqrq-w is here:
 https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/tmp/sysrq-btrfs-sync-hang.txtk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=IHXWC1Chbc0jEiUWu1v4Va9NOphtjPbjYp6yVMdUmXM%3D%0As=bd787a3422e9ff0972d2d09de7d424f56589aadc9d6db33e19fc44886dce604f
 
 Try Chris's integration branch in a few hours and see if that fixes
 it.  Thanks,

Mmmh, so I rebooted that server with 3.14.0 (no rc), and it was
deadlocked a long time during boot (about 10mn) before it unlocked
itself and finished booting.

This is a bit vexing, I don't yet know which of my 3 btrfs filesystems
is causing this, and how to fix it.
After boot, it seems ok enough.

You're recommending that I try btrfs-next on a 3.15 pre kernel, correct?
If so would it be likely to fix my filesystem and let me go back to a
stable 3.14? (I'm a bit warry about running some unstable 3.15 on it :).

Is there a chance balance or some file system cleaning will fix this?

For now, during boot, I get:
INFO: task btrfs-transacti:3633 blocked for more than 120 seconds.
  Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D 88020d762680 0  3633  2 0x
 88020c6c7dc0 0046 88020c6c7fd8 88020d762150
 000141c0 88020d762150 88020e11be90 8802106271e8
  880210627000 8800c5c82740 88020c6c7dd0
Call Trace:
 [8160c331] schedule+0x73/0x75
 [8122a5f9] wait_current_trans.isra.15+0x98/0xf4
 [810850c9] ? finish_wait+0x65/0x65
 [8122b812] start_transaction+0x202/0x4f2
 [8122bb9e] btrfs_attach_transaction+0x17/0x19
 [812277a8] transaction_kthread+0xd6/0x1ab
 [812276d2] ? btrfs_cleanup_transaction+0x43f/0x43f
 [8106bc56] kthread+0xae/0xb6
 [8106bba8] ? __kthread_parkme+0x61/0x61
 [816153fc] ret_from_fork+0x7c/0xb0
 [8106bba8] ? __kthread_parkme+0x61/0x61
INFO: task btrfs-transacti:3633 blocked for more than 120 seconds.
  Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D 88020d762680 0  3633  2 0x
 88020c6c7dc0 0046 88020c6c7fd8 88020d762150
 000141c0 88020d762150 88020e11be90 8802106271e8
  880210627000 8800c5c82740 88020c6c7dd0
Call Trace:
 [8160c331] schedule+0x73/0x75
 [8122a5f9] wait_current_trans.isra.15+0x98/0xf4
 [810850c9] ? finish_wait+0x65/0x65
 [8122b812] start_transaction+0x202/0x4f2
 [8122bb9e] btrfs_attach_transaction+0x17/0x19
 [812277a8] transaction_kthread+0xd6/0x1ab
 [812276d2] ? btrfs_cleanup_transaction+0x43f/0x43f
 [8106bc56] kthread+0xae/0xb6
 [8106bba8] ? __kthread_parkme+0x61/0x61
 [816153fc] ret_from_fork+0x7c/0xb0
 [8106bba8] ? __kthread_parkme+0x61/0x61
INFO: task btrfs-transacti:3633 blocked for more than 120 seconds.
  Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D 88020d762680 0  3633  2 0x
 88020c6c7dc0 0046 88020c6c7fd8 88020d762150
 000141c0 88020d762150 88020e11be90 8802106271e8
  880210627000 8800c5c82740 88020c6c7dd0
Call Trace:
 [8160c331] schedule+0x73/0x75
 [8122a5f9] wait_current_trans.isra.15+0x98/0xf4
 [810850c9] ? finish_wait+0x65/0x65
 [8122b812] start_transaction+0x202/0x4f2
 [8122bb9e] btrfs_attach_transaction+0x17/0x19
 [812277a8] transaction_kthread+0xd6/0x1ab
 [812276d2] ? btrfs_cleanup_transaction+0x43f/0x43f
 [8106bc56] kthread+0xae/0xb6
 [8106bba8] ? __kthread_parkme+0x61/0x61
 [816153fc] ret_from_fork+0x7c/0xb0
 [8106bba8] ? __kthread_parkme+0x61/0x61

Eventually the boot finishes, but it hangs way too long.

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to 

Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-07 Thread Chris Mason



On 04/07/2014 02:51 PM, Marc MERLIN wrote:

On Mon, Apr 07, 2014 at 12:10:52PM -0400, Josef Bacik wrote:

On 04/07/2014 12:05 PM, Marc MERLIN wrote:

I was debugging my why backup failed to run, and eventually found it was
stuck on sync:
14080   18:18 btrfs_tree_read_lock   sync

This was hung for hours on this lock.

Strangely, it looks like taking my sysrq-w hung the machine pretty hard for
close to 30sec, but this seems to have unhung sync and in the end btrfs send
completed after that.

Sysqrq-w is here:
https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/tmp/sysrq-btrfs-sync-hang.txtk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=IHXWC1Chbc0jEiUWu1v4Va9NOphtjPbjYp6yVMdUmXM%3D%0As=bd787a3422e9ff0972d2d09de7d424f56589aadc9d6db33e19fc44886dce604f


Try Chris's integration branch in a few hours and see if that fixes
it.  Thanks,


Mmmh, so I rebooted that server with 3.14.0 (no rc), and it was
deadlocked a long time during boot (about 10mn) before it unlocked
itself and finished booting.

This is a bit vexing, I don't yet know which of my 3 btrfs filesystems
is causing this, and how to fix it.
After boot, it seems ok enough.

You're recommending that I try btrfs-next on a 3.15 pre kernel, correct?
If so would it be likely to fix my filesystem and let me go back to a
stable 3.14? (I'm a bit warry about running some unstable 3.15 on it :).



Right now the fixes for this are in the integration branch on my git 
tree.  I think we've shaken  out all the problems, but if you want to 
wait until tomorrow I'll have it in my next branch (for linux-next).


-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on 3.14rc5 stuck on btrfs_tree_read_lock sync

2014-04-07 Thread Marc MERLIN
On Mon, Apr 07, 2014 at 03:32:13PM -0400, Chris Mason wrote:
 You're recommending that I try btrfs-next on a 3.15 pre kernel, correct?
 If so would it be likely to fix my filesystem and let me go back to a
 stable 3.14? (I'm a bit warry about running some unstable 3.15 on it :).
 
 Right now the fixes for this are in the integration branch on my git 
 tree.  I think we've shaken  out all the problems, but if you want to 
 wait until tomorrow I'll have it in my next branch (for linux-next).

I can wait, even a few more days if needed.
But just to be clear: will this new kernel be something that will be
required for me to run from there on to avoid all those deadlocks and very
poor performance I'm seeing, or the new kernel will fix things up, and then
if other stuff isn't quite stable, I can downgrade back to 3.14 stable?

By the way, I think I know which filesystem is causing this, and one unusual
thing is that it uses a lot of hardlinks.

In case that helps, there are only 40 snapshots on it, but many inodes, of
which many are hardlinked together:

gargamel:/mnt/btrfs_pool2# btrfs filesystem df `pwd`
Data, single: total=3.28TiB, used=2.30TiB
System, DUP: total=8.00MiB, used=384.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=74.50GiB, used=70.11GiB
Metadata, single: total=8.00MiB, used=0.00
gargamel:/mnt/btrfs_pool2# btrfs filesystem show `pwd`
Label: btrfs_pool2  uuid: cb9df6d3-a528-4afc-9a45-4fed5ec358d6
Total devices 1 FS bytes used 2.37TiB
devid1 size 7.28TiB used 3.43TiB path /dev/mapper/dshelf2

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html