Re: Ceph on btrfs 3.4rc

2012-05-24 Thread Christian Brunner
Same thing here.

I've tried really hard, but even after 12 hours I wasn't able to get a
single warning from btrfs.

I think you cracked it!

Thanks,
Christian

2012/5/24 Martin Mailand mar...@tuxadero.com:
 Hi,
 the ceph cluster is running under heavy load for the last 13 hours without a
 problem, dmesg is empty and the performance is good.

 -martin

 Am 23.05.2012 21:12, schrieb Martin Mailand:

 this patch is running for 3 hours without a Bug and without the Warning.
 I will let it run overnight and report tomorrow.
 It looks very good ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-23 Thread Christian Brunner
2012/5/22 Josef Bacik jo...@redhat.com:


 Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
 taking the BTRFS_I(inode)-lock when messing with these since we don't want to
 take up all that space in the inode just for a marker.  I ran this patch for 3
 hours with no issues, let me know if it works for you.  Thanks,

Compared to the last runs, I had to run it much longer, but somehow I
managed to hit a BUG_ON again:

[448281.002087] couldn't find orphan item for 2027, nlink 1, root 308,
root being deleted no
[448281.011339] [ cut here ]
[448281.016590] kernel BUG at fs/btrfs/inode.c:2230!
[448281.021837] invalid opcode:  [#1] SMP
[448281.026525] CPU 4
[448281.028670] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[448281.052215]
[448281.053977] Pid: 16018, comm: ceph-osd Tainted: PW  O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[448281.06] RIP: 0010:[a04a17ab]  [a04a17ab]
btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.075965] RSP: 0018:880458257d18  EFLAGS: 00010292
[448281.081987] RAX: 0063 RBX: 8803a28ebc48 RCX:
2fdb
[448281.090042] RDX:  RSI: 0046 RDI:
0246
[448281.098093] RBP: 880458257d58 R08: 81af6100 R09:

[448281.106146] R10: 0004 R11:  R12:
0001
[448281.114202] R13: 88052e130400 R14: 0001 R15:
8805beae9e10
[448281.122262] FS:  7fa2e772f700() GS:88062728()
knlGS:
[448281.131386] CS:  0010 DS:  ES:  CR0: 80050033
[448281.137879] CR2: ff600400 CR3: 0005015a5000 CR4:
06e0
[448281.145929] DR0:  DR1:  DR2:

[448281.153974] DR3:  DR6: 0ff0 DR7:
0400
[448281.162043] Process ceph-osd (pid: 16018, threadinfo
880458256000, task 88055b711940)
[448281.171646] Stack:
[448281.173987]  880458257dff 8803a28eba98 880458257d58
8805beae9e10
[448281.182377]   88052e130400 88029ff33380
8803a28ebc48
[448281.190766]  880458257e08 a04ab4e6 
8803a28ebc48
[448281.199155] Call Trace:
[448281.202005]  [a04ab4e6] btrfs_truncate+0x5f6/0x660 [btrfs]
[448281.209203]  [a04ab646] btrfs_setattr+0xf6/0x1a0 [btrfs]
[448281.216202]  [811816fb] notify_change+0x18b/0x2b0
[448281.222517]  [81276541] ? selinux_inode_permission+0xd1/0x130
[448281.229990]  [81165f44] do_truncate+0x64/0xa0
[448281.235919]  [81172669] ? inode_permission+0x49/0x100
[448281.242617]  [81166197] sys_truncate+0x137/0x150
[448281.248838]  [8158b1e9] system_call_fastpath+0x16/0x1b
[448281.255631] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 45 c0 48
85 f6 74 1b 80 bb 60 fe ff ff 84 74 12 48 c7 c7 e8 1d 50 a0 31 c0 e8
9d ea 0d e1 0f 0b eb fe 48 8b 73 40 eb e8 66 66 2e 0f 1f 84 00 00 00
00 00
[448281.277435] RIP  [a04a17ab] btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.285229]  RSP 880458257d18
[448281.289667] ---[ end trace 9adc7b36a3e66872 ]---

Sorry,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Christian Brunner
2012/5/21 Miao Xie mi...@cn.fujitsu.com:
 Hi Josef,

 On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
 diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
 index 9b9b15f..492c74f 100644
 --- a/fs/btrfs/btrfs_inode.h
 +++ b/fs/btrfs/btrfs_inode.h
 @@ -57,9 +57,6 @@ struct btrfs_inode {
       /* used to order data wrt metadata */
       struct btrfs_ordered_inode_tree ordered_tree;

 -     /* for keeping track of orphaned inodes */
 -     struct list_head i_orphan;
 -
       /* list of all the delalloc inodes in the FS.  There are times we need
        * to write all the delalloc pages to disk, and this list is used
        * to walk them all.
 @@ -156,6 +153,8 @@ struct btrfs_inode {
       unsigned dummy_inode:1;
       unsigned in_defrag:1;
       unsigned delalloc_meta_reserved:1;
 +     unsigned has_orphan_item:1;
 +     unsigned doing_truncate:1;

 I think the problem is we should not use the different lock to protect the 
 bit fields which
 are stored in the same machine word. Or some bit fields may be covered by the 
 others when
 someone change those fields. Could you try to declare 
 -delalloc_meta_reserved and -has_orphan_item
 as a integer?

I have tried changing it to:

struct btrfs_inode {
unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
unsigned in_defrag:1;
-   unsigned delalloc_meta_reserved:1;
+   int delalloc_meta_reserved;
+   int has_orphan_item;
+   int doing_truncate;

The strange thing is, that I'm no longer hitting the BUG_ON, but the
old WARNING (no additional messages):

[351021.157124] [ cut here ]
[351021.162400] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351021.171812] Hardware name: ProLiant DL180 G6
[351021.176867] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351021.200236] Pid: 9837, comm: btrfs-transacti Tainted: PW
O 3.3.5-1.fits.1.el6.x86_64 #1
[351021.210126] Call Trace:
[351021.212957]  [8104df6f] warn_slowpath_common+0x7f/0xc0
[351021.219758]  [8104dfca] warn_slowpath_null+0x1a/0x20
[351021.226385]  [a03eb627]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351021.234461]  [a03e6976] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351021.241669]  [a0438c61] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[351021.249841]  [a03e7ae4]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351021.258006]  [a03e8432] ? start_transaction+0x92/0x310 [btrfs]
[351021.265580]  [81070aa0] ? wake_up_bit+0x40/0x40
[351021.271719]  [a03e2f3b] transaction_kthread+0x26b/0x2e0 [btrfs]
[351021.279405]  [a03e2cd0] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.288934]  [a03e2cd0] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.298449]  [8107040e] kthread+0x9e/0xb0
[351021.303989]  [8158c5a4] kernel_thread_helper+0x4/0x10
[351021.310691]  [81070370] ? kthread_freezable_should_stop+0x70/0x70
[351021.318555]  [8158c5a0] ? gs_change+0x13/0x13
[351021.324479] ---[ end trace 9adc7b36a3e66833 ]---
[351710.339482] [ cut here ]
[351710.344754] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351710.354165] Hardware name: ProLiant DL180 G6
[351710.359222] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351710.382569] Pid: 9797, comm: kworker/5:0 Tainted: PW  O
3.3.5-1.fits.1.el6.x86_64 #1
[351710.392075] Call Trace:
[351710.394901]  [8104df6f] warn_slowpath_common+0x7f/0xc0
[351710.401750]  [8104dfca] warn_slowpath_null+0x1a/0x20
[351710.408414]  [a03eb627]
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351710.416528]  [a03e6976] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351710.423775]  [a03e7ae4]
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351710.431983]  [810127a3] ? __switch_to+0x153/0x440
[351710.438352]  [81070aa0] ? wake_up_bit+0x40/0x40
[351710.444529]  [a03e7fb0] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[351710.452894]  [a03e7fcf] do_async_commit+0x1f/0x30 [btrfs]
[351710.459979]  [81068959] process_one_work+0x129/0x450
[351710.466576]  [8106b7fb] worker_thread+0x17b/0x3c0
[351710.472884]  [8106b680] ? manage_workers+0x220/0x220
[351710.479472]  [8107040e] kthread+0x9e/0xb0
[351710.485029]  [8158c5a4] kernel_thread_helper+0x4/0x10
[351710.491731]  [81070370] ? kthread_freezable_should_stop+0x70/0x70
[351710.499640]  [8158c5a0] ? gs_change+0x13/0x13
[351710.505590] ---[ end trace 9adc7b36a3e66834 

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Christian Brunner
2012/5/17 Josef Bacik jo...@redhat.com:
 On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
 Hi Josef,
 no there was nothing above. Here the is another dmesg output.


 Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
 Thanks,

 Josef

Well, I hate to say it, but the new patch doesn't seem to change much...

Regards,
Christian

[  123.507444] Btrfs loaded
[  202.683630] device fsid 2aa7531c-0e3c-4955-8542-6aed7ab8c1a2 devid
1 transid 4 /dev/sda
[  202.693704] btrfs: use lzo compression
[  202.697999] btrfs: enabling inode map caching
[  202.702989] btrfs: enabling auto defrag
[  202.707190] btrfs: disk space caching is enabled
[  202.712721] btrfs flagging fs with big metadata feature
[  207.839761] device fsid f81ff6a1-c333-4daf-989f-a28139f15f08 devid
1 transid 4 /dev/sdb
[  207.849681] btrfs: use lzo compression
[  207.853987] btrfs: enabling inode map caching
[  207.858970] btrfs: enabling auto defrag
[  207.863173] btrfs: disk space caching is enabled
[  207.868635] btrfs flagging fs with big metadata feature
[  210.857328] device fsid 9b905faa-f4fa-4626-9cae-2cd0287b30f7 devid
1 transid 4 /dev/sdc
[  210.867265] btrfs: use lzo compression
[  210.871560] btrfs: enabling inode map caching
[  210.876550] btrfs: enabling auto defrag
[  210.880757] btrfs: disk space caching is enabled
[  210.886228] btrfs flagging fs with big metadata feature
[  214.296287] device fsid f7990e4c-90b0-4691-9502-92b60538574a devid
1 transid 4 /dev/sdd
[  214.306510] btrfs: use lzo compression
[  214.310855] btrfs: enabling inode map caching
[  214.315905] btrfs: enabling auto defrag
[  214.320174] btrfs: disk space caching is enabled
[  214.325706] btrfs flagging fs with big metadata feature
[ 1337.937379] [ cut here ]
[ 1337.942526] kernel BUG at fs/btrfs/inode.c:2224!
[ 1337.947671] invalid opcode:  [#1] SMP
[ 1337.952255] CPU 5
[ 1337.954300] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg pcspkr serio_raw iTCO_wdt
iTCO_vendor_support iomemory_vsl(PO) ixgbe dca mdio i7core_edac
edac_core hpsa squashfs [last unloaded: scsi_wait_scan]
[ 1337.978570]
[ 1337.980230] Pid: 6812, comm: ceph-osd Tainted: P   O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[ 1337.991592] RIP: 0010:[a035675c]  [a035675c]
btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.001897] RSP: 0018:8805e1171d38  EFLAGS: 00010282
[ 1338.007815] RAX: fffe RBX: 88061c3c8400 RCX: 00b37f48
[ 1338.015768] RDX: 00b37f47 RSI: 8805ec2a1cf0 RDI: ea0017b0a840
[ 1338.023724] RBP: 8805e1171d68 R08: 60f9d88028a0 R09: a033016a
[ 1338.031675] R10:  R11: 0004 R12: 8805de7f57a0
[ 1338.039629] R13: 0001 R14: 0001 R15: 8805ec2a5280
[ 1338.047584] FS:  7f4bffc6e700() GS:8806272a()
knlGS:
[ 1338.056600] CS:  0010 DS:  ES:  CR0: 80050033
[ 1338.063003] CR2: ff600400 CR3: 0005e34c3000 CR4: 06e0
[ 1338.070954] DR0:  DR1:  DR2: 
[ 1338.078909] DR3:  DR6: 0ff0 DR7: 0400
[ 1338.086865] Process ceph-osd (pid: 6812, threadinfo
8805e117, task 88060fa81940)
[ 1338.096268] Stack:
[ 1338.098509]  8805e1171d68 8805ec2a5280 88051235b920

[ 1338.106795]  88051235b920 0008 8805e1171e08
a036043c
[ 1338.115082]    
00011000
[ 1338.123367] Call Trace:
[ 1338.126111]  [a036043c] btrfs_truncate+0x5bc/0x640 [btrfs]
[ 1338.133213]  [a03605b6] btrfs_setattr+0xf6/0x1a0 [btrfs]
[ 1338.140105]  [811816fb] notify_change+0x18b/0x2b0
[ 1338.146320]  [81276541] ? selinux_inode_permission+0xd1/0x130
[ 1338.153699]  [81165f44] do_truncate+0x64/0xa0
[ 1338.159527]  [81172669] ? inode_permission+0x49/0x100
[ 1338.166128]  [81166197] sys_truncate+0x137/0x150
[ 1338.172244]  [8158b1e9] system_call_fastpath+0x16/0x1b
[ 1338.178936] Code: 89 e7 e8 88 7d fe ff eb 89 66 0f 1f 44 00 00 be
a4 08 00 00 48 c7 c7 59 49 3b a0 45 31 ed e8 5c 78 cf e0 45 31 f6 e9
30 ff ff ff 0f 0b eb fe 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c 89 65
e0 4c
[ 1338.200623] RIP  [a035675c] btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.208317]  RSP 8805e1171d38
[ 1338.212681] ---[ end trace 86be14f0f863ea79 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-11 Thread Christian Brunner
2012/5/10 Josef Bacik jo...@redhat.com:
 On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
 Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net:
  On Tue, 24 Apr 2012, Josef Bacik wrote:
  On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
   After running ceph on XFS for some time, I decided to try btrfs again.
   Performance with the current for-linux-min branch and big metadata
   is much better. The only problem (?) I'm still seeing is a warning
   that seems to occur from time to time:
 
  Actually, before you do that... we have a new tool,
  test_filestore_workloadgen, that generates a ceph-osd-like workload on the
  local file system.  It's a subset of what a full OSD might do, but if
  we're lucky it will be sufficient to reproduce this issue.  Something like
 
   test_filestore_workloadgen --osd-data /foo --osd-journal /bar
 
  will hopefully do the trick.
 
  Christian, maybe you can see if that is able to trigger this warning?
  You'll need to pull it from the current master branch; it wasn't in the
  last release.

 Trying to reproduce with test_filestore_workloadgen didn't work for
 me. So here are some instructions on how to reproduce with a minimal
 ceph setup.
 [...]

 Well I feel like an idiot, I finally get it to reproduce, go look at where I
 want to put my printks and theres the problem staring me right in the face.
 I've looked seriously at this problem 2 or 3 times and have missed this every
 single freaking time.  Here is the patch I'm trying, please try it on yours to
 make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
 me so I won't be able to fully test it until tomorrow, but so far it hasn't
 broken anything so it should be good.  Thanks,

Great! I've put your patch on my testbox and will run a test over the
weekend. I'll report back on monday.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-04 Thread Christian Brunner
2012/5/3 Josef Bacik jo...@redhat.com:
 On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
 On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik jo...@redhat.com
 wrote:
  On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
 
  Yeah all that was in the right place, I rebooted and I magically
  stopped getting
  that error, but now I'm getting this
 
  http://fpaste.org/OE92/
 
  with that ping thing repeating over and over.  Thanks,

 That just looks like the osd isn't running. If you restart the
 osd with 'debug osd = 20' the osd log should tell us what's going on.

 Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff 
 after
 reboot.  But now I'm back to my original problem

 http://fpaste.org/PfwO/

 I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd 
 is
 in there, so I'm not sure what is wrong.  Thanks,

Thats really strange. Do you have the osd logs in /var/log/ceph? If
so, can you look if you find anything about rbd or class loading
in there?

Another thing you should try is, whether you can access ceph with rados:

# rados -p rbd ls
# rados -p rbd -i /proc/cpuinfo put testobj
# rados -p rbd -o - get testobj

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-04-30 Thread Christian Brunner
2012/4/29 tsuna tsuna...@gmail.com:
 On Fri, Apr 20, 2012 at 8:09 AM, Christian Brunner
 christ...@brunner-muc.de wrote:
 After running ceph on XFS for some time, I decided to try btrfs again.
 Performance with the current for-linux-min branch and big metadata
 is much better.

 I've heard that although performance from btrfs is better at first, it
 degrades over time due to metadata fragmentation, whereas XFS'
 performance starts off a little worse, but remains stable even after
 weeks of heavy utilization.  Would be curious to hear your (or
 others') feedback on that topic.

Metadata fragmentation was a big problem (for us) in the past. With
the big metatdata feature (mkfs.btrfs -l 64k -n 64k) these problems
seem to be solved. We do not use it in production yet, but my stress
test didn't show any degradation. The only remaining issues I've seen
are these warnings.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-04-27 Thread Christian Brunner
Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net:
 On Tue, 24 Apr 2012, Josef Bacik wrote:
 On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
  After running ceph on XFS for some time, I decided to try btrfs again.
  Performance with the current for-linux-min branch and big metadata
  is much better. The only problem (?) I'm still seeing is a warning
  that seems to occur from time to time:

 Actually, before you do that... we have a new tool,
 test_filestore_workloadgen, that generates a ceph-osd-like workload on the
 local file system.  It's a subset of what a full OSD might do, but if
 we're lucky it will be sufficient to reproduce this issue.  Something like

  test_filestore_workloadgen --osd-data /foo --osd-journal /bar

 will hopefully do the trick.

 Christian, maybe you can see if that is able to trigger this warning?
 You'll need to pull it from the current master branch; it wasn't in the
 last release.

Trying to reproduce with test_filestore_workloadgen didn't work for
me. So here are some instructions on how to reproduce with a minimal
ceph setup.

You will need a single system with two disks and a bit of memory.

- Compile and install ceph (detailed instructions:
http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)

- For the test setup I've used two tmpfs files as journal devices. To
create these, do the following:

# mkdir -p /ceph/temp
# mount -t tmpfs tmpfs /ceph/temp
# dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
# dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k

- Now you should create and mount btrfs. Here is what I did:

# mkfs.btrfs -l 64k -n 64k /dev/sda
# mkfs.btrfs -l 64k -n 64k /dev/sdb
# mkdir /ceph/osd.000
# mkdir /ceph/osd.001
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001

- Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
will probably have to change the btrfs devices and the hostname
(os39).

- Create the ceph filesystems:

# mkdir /ceph/mon
# mkcephfs -a -c /etc/ceph/ceph.conf

- Start ceph (e.g. service ceph start)

- Now you should be able to use ceph - ceph -s will tell you about
the state of the ceph cluster.

- rbd create -size 100 testimg will create an rbd image on the ceph cluster.

- Compile my test with gcc -o rbdtest rbdtest.c -lrbd and run it
with ./rbdtest testimg.

I can see the first btrfs_orphan_commit_root warning after an hour or
so... I hope that I've described all necessary steps. If there is a
problem just send me a note.

Thanks,
Christian


ceph.conf
Description: Binary data


Re: Ceph on btrfs 3.4rc

2012-04-23 Thread Christian Brunner
I decided to run the test over the weekend. The good news is, that the
system is still running without performance degradation. But in the
meantime I've got over 5000 WARNINGs of this kind:

[330700.043557] btrfs: block rsv returned -28
[330700.043559] [ cut here ]
[330700.048898] WARNING: at fs/btrfs/extent-tree.c:6220
btrfs_alloc_free_block+0x357/0x370 [btrfs]()
[330700.058880] Hardware name: ProLiant DL180 G6
[330700.064044] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[330700.090361] Pid: 7954, comm: btrfs-endio-wri Tainted: PW
O 3.3.2-1.fits.1.el6.x86_64 #1
[330700.100393] Call Trace:
[330700.103263]  [8104df6f] warn_slowpath_common+0x7f/0xc0
[330700.110201]  [8104dfca] warn_slowpath_null+0x1a/0x20
[330700.116905]  [a03436f7] btrfs_alloc_free_block+0x357/0x370 [btrfs]
[330700.124988]  [a0330eb0] ? __btrfs_cow_block+0x330/0x530 [btrfs]
[330700.132787]  [a0398174] ?
btrfs_add_delayed_data_ref+0x64/0x1c0 [btrfs]
[330700.141369]  [a0372d8b] ? read_extent_buffer+0xbb/0x120 [btrfs]
[330700.149194]  [a0365d6d] ?
btrfs_token_item_offset+0x5d/0xe0 [btrfs]
[330700.157373]  [a0330cb3] __btrfs_cow_block+0x133/0x530 [btrfs]
[330700.165023]  [a032f2ed] ?
read_block_for_search+0x14d/0x3d0 [btrfs]
[330700.173183]  [a0331684] btrfs_cow_block+0xf4/0x1f0 [btrfs]
[330700.180552]  [a03344b8] btrfs_search_slot+0x3e8/0x8e0 [btrfs]
[330700.188128]  [a03469f4] btrfs_lookup_csum+0x74/0x170 [btrfs]
[330700.195634]  [811589e5] ? kmem_cache_alloc+0x105/0x130
[330700.202551]  [a03477e0] btrfs_csum_file_blocks+0xd0/0x6d0 [btrfs]
[330700.210542]  [a03768b1] ? clear_extent_bit+0x161/0x420 [btrfs]
[330700.218237]  [a0354109] add_pending_csums+0x49/0x70 [btrfs]
[330700.225706]  [a0357de6]
btrfs_finish_ordered_io+0x276/0x3d0 [btrfs]
[330700.233940]  [a0357f8c]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[330700.242345]  [a0376cb9] end_extent_writepage+0x69/0x100 [btrfs]
[330700.250192]  [a0376db6] end_bio_extent_writepage+0x66/0xa0 [btrfs]
[330700.258327]  [8119959d] bio_endio+0x1d/0x40
[330700.264214]  [a034b135] end_workqueue_fn+0x45/0x50 [btrfs]
[330700.271612]  [a03831df] worker_loop+0x14f/0x5a0 [btrfs]
[330700.278672]  [a0383090] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.286582]  [a0383090] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.294535]  [810703fe] kthread+0x9e/0xb0
[330700.300244]  [8158c224] kernel_thread_helper+0x4/0x10
[330700.307031]  [81070360] ? kthread_freezable_should_stop+0x70/0x70
[330700.315061]  [8158c220] ? gs_change+0x13/0x13
[330700.321167] ---[ end trace b8c31966cca74ca0 ]---

The filesystems have plenty of free space:

/dev/sda  1.9T   16G  1.8T   1% /ceph/osd.000
/dev/sdb  1.9T   15G  1.8T   1% /ceph/osd.001
/dev/sdc  1.9T   13G  1.8T   1% /ceph/osd.002
/dev/sdd  1.9T   14G  1.8T   1% /ceph/osd.003

# btrfs fi df /ceph/osd.000
Data: total=38.01GB, used=15.53GB
System, DUP: total=8.00MB, used=64.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=37.50GB, used=82.19MB
Metadata: total=8.00MB, used=0.00

A few more btrfs_orphan_commit_root WARNINGS are present too. If
needed I could upload the messages file.

Regards,
Christian

Am 20. April 2012 17:09 schrieb Christian Brunner christ...@brunner-muc.de:
 After running ceph on XFS for some time, I decided to try btrfs again.
 Performance with the current for-linux-min branch and big metadata
 is much better. The only problem (?) I'm still seeing is a warning
 that seems to occur from time to time:

 [87703.784552] [ cut here ]
 [87703.789759] WARNING: at fs/btrfs/inode.c:2103
 btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
 [87703.799070] Hardware name: ProLiant DL180 G6
 [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
 exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
 iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
 iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
 [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
 3.3.2-1.fits.1.el6.x86_64 #1
 [87703.837513] Call Trace:
 [87703.840280]  [8104df6f] warn_slowpath_common+0x7f/0xc0
 [87703.847016]  [8104dfca] warn_slowpath_null+0x1a/0x20
 [87703.853533]  [a0355686] btrfs_orphan_commit_root+0xf6/0x100 
 [btrfs]
 [87703.861541]  [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs]
 [87703.868674]  [a0351bcb]
 btrfs_commit_transaction+0x5db/0xa50 [btrfs]
 [87703.876745]  [810127a3] ? __switch_to+0x153/0x440
 [87703.882966]  [81070a90] ? wake_up_bit+0x40/0x40
 [87703.888997

Ceph on btrfs 3.4rc

2012-04-20 Thread Christian Brunner
After running ceph on XFS for some time, I decided to try btrfs again.
Performance with the current for-linux-min branch and big metadata
is much better. The only problem (?) I'm still seeing is a warning
that seems to occur from time to time:

[87703.784552] [ cut here ]
[87703.789759] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[87703.799070] Hardware name: ProLiant DL180 G6
[87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P   O
3.3.2-1.fits.1.el6.x86_64 #1
[87703.837513] Call Trace:
[87703.840280]  [8104df6f] warn_slowpath_common+0x7f/0xc0
[87703.847016]  [8104dfca] warn_slowpath_null+0x1a/0x20
[87703.853533]  [a0355686] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[87703.861541]  [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs]
[87703.868674]  [a0351bcb]
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[87703.876745]  [810127a3] ? __switch_to+0x153/0x440
[87703.882966]  [81070a90] ? wake_up_bit+0x40/0x40
[87703.888997]  [a0352040] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[87703.897271]  [a035205f] do_async_commit+0x1f/0x30 [btrfs]
[87703.904262]  [81068949] process_one_work+0x129/0x450
[87703.910777]  [8106b7eb] worker_thread+0x17b/0x3c0
[87703.916991]  [8106b670] ? manage_workers+0x220/0x220
[87703.923504]  [810703fe] kthread+0x9e/0xb0
[87703.928952]  [8158c224] kernel_thread_helper+0x4/0x10
[87703.93]  [81070360] ? kthread_freezable_should_stop+0x70/0x70
[87703.943323]  [8158c220] ? gs_change+0x13/0x13
[87703.949149] ---[ end trace b8c31966cca731fa ]---
[91128.812399] [ cut here ]
[91128.817576] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[91128.826930] Hardware name: ProLiant DL180 G6
[91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: PW  O
3.3.2-1.fits.1.el6.x86_64 #1
[91128.865912] Call Trace:
[91128.868670]  [8104df6f] warn_slowpath_common+0x7f/0xc0
[91128.875379]  [8104dfca] warn_slowpath_null+0x1a/0x20
[91128.881900]  [a0355686] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[91128.889894]  [a0350a06] commit_fs_roots+0xc6/0x1c0 [btrfs]
[91128.897019]  [a03a2b61] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[91128.905075]  [a0351bcb]
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[91128.913156]  [a03524b2] ? start_transaction+0x92/0x310 [btrfs]
[91128.920643]  [81070a90] ? wake_up_bit+0x40/0x40
[91128.926667]  [a034cfcb] transaction_kthread+0x26b/0x2e0 [btrfs]
[91128.934254]  [a034cd60] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.943671]  [a034cd60] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.953079]  [810703fe] kthread+0x9e/0xb0
[91128.958532]  [8158c224] kernel_thread_helper+0x4/0x10
[91128.965133]  [81070360] ? kthread_freezable_should_stop+0x70/0x70
[91128.972913]  [8158c220] ? gs_change+0x13/0x13
[91128.978826] ---[ end trace b8c31966cca731fb ]---

I'm able to reproduce this with ceph on a single server with 4 disks
(4 filesystems/osds) and a small test program based on librbd. It is
simply writing random bytes on a rbd volume (see attachment).

Is this something I should care about? Any hint's on solving this
would be appreciated.

Thanks,
Christian
#include inttypes.h
#include rbd/librbd.h
#include stdio.h
#include signal.h

int nr_writes=0;

void
alarm_handler(int sig) {
fprintf(stderr, Writes/sec: %i\n, nr_writes/10);
	nr_writes = 0;
	alarm(10);
}


int main(int argc, char *argv[]) {
char *clientname;
rados_t cluster;
rados_ioctx_t io_ctx;
rbd_image_t image;
char *pool = rbd;
char *imgname = argv[1];
	
if (rados_create(cluster, NULL)  0) {
fprintf(stderr, error initializing);
return 1;
}

rados_conf_read_file(cluster, NULL);
	
if (rados_connect(cluster)  0) {
fprintf(stderr, error connecting);
rados_shutdown(cluster);
return 1;
}

if (rados_ioctx_create(cluster, pool, io_ctx)  0) {
fprintf(stderr, error opening pool %s, pool);
rados_shutdown(cluster);
return 1;
}

int r = rbd_open(io_ctx, imgname, image, NULL);
if (r  0) {
fprintf(stderr, error reading header from %s, imgname);
rados_ioctx_destroy(io_ctx);

Re: Strange prformance degradation when COW writes happen at fixed offsets

2012-02-27 Thread Christian Brunner
2012/2/24 Nik Markovic nmarkovi.nav...@gmail.com:
 To add... I also tried nodatasum (only) and nodatacow otions. I found
 somewhere that nodatacow doesn't really mean tthat COW is disabled.
 Test data is still the same - CPU spikes and times are the same.

 On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic nmarkovi.nav...@gmail.com 
 wrote:
 On Fri, Feb 24, 2012 at 12:38 AM, Duncan 1i5t5.dun...@cox.net wrote:
 Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:

 I noticed a few errors in the script that I used. I corrected it and it
 seems that degradation is occurring even at fully random writes:

 I don't have an ssd, but is it possible that you're simply seeing erase-
 block related degradation due to multi-write-block sized erase-blocks?

 It seems to me that when originally written to the btrfs-on-ssd, the file
 will likely be written block-sequentially enough that the file as a whole
 takes up relatively few erase-blocks.  As you COW-write individual
 blocks, they'll be written elsewhere, perhaps all the changed blocks to a
 new erase-block, perhaps each to a different erase block.

 This is a very interesting insight. I wasn't even aware of the
 erase-block issue, so I did some reading up on it...


 As you increase the successive COW generation count, the file's file-
 system/write blocks will be spread thru more and more erase-blocks,
 basically fragmentation but of the SSD-critical type, into more and more
 erase blocks, thus affecting modification and removal time but not read
 time.

 OK, so time to write would increase due to fragmentation and writing,
 it now makes sense (though I don't see why small writes would affect
 this, but my concerns are not writes anyway), but why would cp
 --reflink time increase so much. Yes, new extents would be created,
 but btrfs doesn't write into data blocks, does it? I figured its
 metadata would be kept in one place. I figure the only thing BTRFS
 would do on cp --reflink=always:
 1. Take a collection of extents owned by source.
 2. Make the new copy use the same collection of extents.
 3. Write the collection of extents to the directory.

 Now this process seems to be CPU intensive. When I remove or make a
 reflink copy, one core pikes up to 100%, which tells me that there's a
 performance issue there, not an ssd issue. Also, only one CPU thread
 is being used for this. I figured that I can improve this by some
 setting. Maybe thread_pool mount option? Are there any updates in
 later kernels that I should possibly pick up?

 [...]

 Unless I am wrong, this would disable COW completely and reflink copy.
 Reflinks are a crucial component and the sole
 reason I picked BTRFS for the system that I am writing for my company.
 The autodefrag option addresses multiple writes. Writing is not the
 problem, but cp --reflink should be near-instant. That was the reason
 we chose BTRFS over ZFS, which seemed to be the only feasible
 alternative. ZFS snapshot complicate the design and deduplication copy
 time is the same as (or not much better than) raw copy.

 [...]

 As I mentioned above, the COW is the crucial component of our system,
 XFS won't do. Our system does not do random writes. In fact it is
 mainly heavy on read operation. The system does occasional rotation
 of rust on large files in a way that version control system would
 (large files are modified and then used as a new baseline)

The symptoms you are reporting are quite similar to what I'm seeing in
our Ceph cluster:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413

AFAIK, Chris and Josef are working on it, but you'll have to wait for
kernel 3.4, until this will be available in mainline. If you are
feeling adventurous, you could try the patches in Josef's git tree,
but I think it's still experimental.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Christian Brunner
2012/1/23 Chris Mason chris.ma...@oracle.com:
 On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
 On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
  As you might know, I have been seeing btrfs slowdowns in our ceph
  cluster for quite some time. Even with the latest btrfs code for 3.3
  I'm still seeing these problems. To make things reproducible, I've now
  written a small test, that imitates ceph's behavior:
 
  On a freshly created btrfs filesystem (2 TB size, mounted with
  noatime,nodiratime,compress=lzo,space_cache,inode_cache) I'm opening
  100 files. After that I'm doing random writes on these files with a
  sync_file_range after each write (each write has a size of 100 bytes)
  and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
 
  After approximately 20 minutes, write activity suddenly increases
  fourfold and the average request size decreases (see chart in the
  attachment).
 
  You can find IOstat output here: http://pastebin.com/Smbfg1aG
 
  I hope that you are able to trace down the problem with the test
  program in the attachment.

 Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree 
 and
 formatted the fs with 64k node and leaf sizes and the problem appeared to go
 away.  So surprise surprise fragmentation is biting us in the ass.  If you 
 can
 try running that branch with 64k node and leaf sizes with your ceph cluster 
 and
 see how that works out.  Course you should only do that if you dont mind if 
 you
 lose everything :).  Thanks,


 Please keep in mind this branch is only out there for development, and
 it really might have huge flaws.  scrub doesn't work with it correctly
 right now, and the IO error recovery code is probably broken too.

 Long term though, I think the bigger block sizes are going to make a
 huge difference in these workloads.

 If you use the very dangerous code:

 mkfs.btrfs -l 64k -n 64k /dev/xxx

 (-l is leaf size, -n is node size).

 64K is the max right now, 32K may help just as much at a lower CPU cost.

Thanks for taking a look. - I'm glad to hear that there is a solution
on the horizon, but I'm not brave enough to try this on our ceph
cluster. I'll try it when the code has stabilized a bit.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs slowdown with ceph (how to reproduce)

2012-01-20 Thread Christian Brunner
As you might know, I have been seeing btrfs slowdowns in our ceph
cluster for quite some time. Even with the latest btrfs code for 3.3
I'm still seeing these problems. To make things reproducible, I've now
written a small test, that imitates ceph's behavior:

On a freshly created btrfs filesystem (2 TB size, mounted with
noatime,nodiratime,compress=lzo,space_cache,inode_cache) I'm opening
100 files. After that I'm doing random writes on these files with a
sync_file_range after each write (each write has a size of 100 bytes)
and ioctl(BTRFS_IOC_SYNC) after every 100 writes.

After approximately 20 minutes, write activity suddenly increases
fourfold and the average request size decreases (see chart in the
attachment).

You can find IOstat output here: http://pastebin.com/Smbfg1aG

I hope that you are able to trace down the problem with the test
program in the attachment.

Thanks,
Christian
#define _GNU_SOURCE

#include inttypes.h
#include stdio.h
#include stdlib.h
#include string.h
#include sys/types.h
#include sys/stat.h
#include sys/ioctl.h
#include fcntl.h
#include unistd.h
#include attr/xattr.h

#define FILE_COUNT 100
#define FILE_SIZE 4194304

#define STRING 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789

#define BTRFS_IOCTL_MAGIC 0x94
#define BTRFS_IOC_SYNC _IO(BTRFS_IOCTL_MAGIC, 8)

int main(int argc, char *argv[]) {
char *imgname = argv[1]; 
char *tempname;
int fd[FILE_COUNT]; 
int ilen, i;

ilen = strlen(imgname);
tempname = malloc(ilen + 8);

for(i=0; i  FILE_COUNT; i++) {
	snprintf(tempname, ilen + 8, %s.%i, imgname, i);
	fd[i] = open(tempname, O_CREAT|O_RDWR);
}
	
i=0;
while(1) {
int start = rand() % FILE_SIZE;
int file = rand() % FILE_COUNT;

putc('.', stderr);

lseek(fd[file], start, SEEK_SET);
write(fd[file], STRING, 100);
sync_file_range(fd[file], start, 100, 0x2);

usleep(25000);

i++;
if (i == 100) {
i=0;
ioctl(fd[file], BTRFS_IOC_SYNC);
}
}
}
attachment: btrfstest.png

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-12 Thread Christian Brunner
2012/1/7 Christian Brunner c...@muc.de:
 2012/1/5 Chris Mason chris.ma...@oracle.com:
 On Fri, Jan 06, 2012 at 07:12:16AM +1100, Dave Chinner wrote:
 On Thu, Jan 05, 2012 at 02:45:00PM -0500, Chris Mason wrote:
  On Thu, Jan 05, 2012 at 01:46:57PM -0500, Chris Mason wrote:
  
   Unfortunately, this one works for me.  I'll try it again and see if I
   can push harder.  If not, I'll see if I can trade beer for some
   diagnostic runs.
 
  Aha, if I try it just on the ssd instead of on my full array it triggers
  at 88M files.  Great.

 Good to know.  The error that is generating the BUG on my machine is
 -28 (ENOSPC).  Given there's 17TB free on my filesystem

 Yeah, same thing here.  I'm testing a fix now, it's pretty dumb.  We're
 not allocating more metadata chunks from the drive because of where the
 allocation is happening, so it is just a check for do we need a new
 chunk in the right place.

 I'll make sure it can fill my ssd and then send to you.

 Could you send the patch to the list (or to me), please? Telling from
 what you mentioned on IRC this sounds quite interesting and I would
 like to see if this solves my performance problems with ceph, too...

I apologize for bothering you again, but I would really like to give it a spin.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-07 Thread Christian Brunner
2012/1/5 Chris Mason chris.ma...@oracle.com:
 On Fri, Jan 06, 2012 at 07:12:16AM +1100, Dave Chinner wrote:
 On Thu, Jan 05, 2012 at 02:45:00PM -0500, Chris Mason wrote:
  On Thu, Jan 05, 2012 at 01:46:57PM -0500, Chris Mason wrote:
   On Thu, Jan 05, 2012 at 10:01:22AM +1100, Dave Chinner wrote:
On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
 On 05/01/12 09:11, Dave Chinner wrote:

  Looks to be reproducable.

 Does this happen with rc6 ?
   
I haven't tried. All I'm doing is running some benchmarks to get
numbers for a talk I'm giving about improvements in XFS metadata
scalability, so I wanted to update my last set of numbers from
2.6.39.
   
As it was, these benchmarks also failed on btrfs with oopsen and
corruptions back in 2.6.39 time frame.  e.g. same VM, same
test, different crashes, similar slowdowns as reported here:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062
   
Given that there is now a history of this simple test uncovering
problems, perhaps this is a test that should be run more regularly
by btrfs developers?
  
   Unfortunately, this one works for me.  I'll try it again and see if I
   can push harder.  If not, I'll see if I can trade beer for some
   diagnostic runs.
 
  Aha, if I try it just on the ssd instead of on my full array it triggers
  at 88M files.  Great.

 Good to know.  The error that is generating the BUG on my machine is
 -28 (ENOSPC).  Given there's 17TB free on my filesystem

 Yeah, same thing here.  I'm testing a fix now, it's pretty dumb.  We're
 not allocating more metadata chunks from the drive because of where the
 allocation is happening, so it is just a check for do we need a new
 chunk in the right place.

 I'll make sure it can fill my ssd and then send to you.

Could you send the patch to the list (or to me), please? Telling from
what you mentioned on IRC this sounds quite interesting and I would
like to see if this solves my performance problems with ceph, too...

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/extent-tree.c:5980

2011-12-13 Thread Christian Brunner
Sorry - I forgot to mention, that I'm still seeing this with:

[PATCH] Btrfs: update global block_rsv when creating a new block group

Christian

2011/12/13 Christian Brunner c...@muc.de:
 Hi,

 with the latest btrfs for-linus I'm seeing seeing occasional
 btrfs_alloc_free_block warnings on several nodes in our ceph cluster.

 Before the warning there is an additional block rsv -28 message, but
 there is plenty of free space on the disk.


 [201653.774412] btrfs: block rsv returned -28
 [201653.774415] [ cut here ]
 [201653.779846] WARNING: at fs/btrfs/extent-tree.c:5980
 btrfs_alloc_free_block+0x347/0x360 [btrfs]()

 The complte trace is here:

 http://pastebin.com/0SFeZReg

 The extent-tree.c:5980 is in use_block_rsv():

 5974         if (ret) {
 5975                 static DEFINE_RATELIMIT_STATE(_rs,
 5976                                 DEFAULT_RATELIMIT_INTERVAL,
 5977                                 /*DEFAULT_RATELIMIT_BURST*/ 2);
 5978                 if (__ratelimit(_rs)) {
 5979                         printk(KERN_DEBUG btrfs: block rsv
 returned %d\n, ret);
 5980                         WARN_ON(1);
 5981                 }
 5982                 ret = reserve_metadata_bytes(root, block_rsv,
 blocksize, 0);

 Thanks,
 Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-12 Thread Christian Brunner
2011/12/12 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Dec  7, 2011, Christian Brunner c...@muc.de wrote:

 With this patch applied I get much higher write-io values than without
 it. Some of the other patches help to reduce the effect, but it's
 still significant.

 iostat on an unpatched node is giving me:

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda             105.90     0.37   15.42   14.48  2657.33   560.13
 107.61     1.89   62.75   6.26  18.71

 while on a node with this patch it's
 sda             128.20     0.97   11.10   57.15  3376.80   552.80
 57.58    20.58  296.33   4.16  28.36


 Also interesting, is the fact that the average request size on the
 patched node is much smaller.

 That's probably expected for writes, as bitmaps are expected to be more
 fragmented, even if used only for metadata (or are you on SSD?)


It's a traditional hardware RAID5 with spinning disks. - I would
accept this if the writes would start right after the mount, but in
this case it takes a few hours until the writes increase. Thats why
I'm allmost certain that something is still wrong.

 Bitmaps are just a different in-memory (and on-disk-cache, if enabled)
 representation of free space, that can be far more compact: one bit per
 disk block, rather than an extent list entry.  They're interchangeable
 otherwise, it's just that searching bitmaps for a free block (bit) is
 somewhat more expensive than taking the next entry from a list, but you
 don't want to use up too much memory with long lists of
 e.g. single-block free extents.

Thanks for the explanation! I'll try to insert some debuging code,
once my test server is ready.

Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: avoid redundant block group free-space checks

2011-12-12 Thread Christian Brunner
2011/12/12 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 It was pointed out to me that the test for enough free space in a block
 group was wrong in that it would skip a block group that had most of its
 free space reserved by a cluster.

 I offer two mutually exclusive, (so far) very lightly tested patches to
 address this problem.

 One moves the test to the middle of the clustered allocation logic,
 between the release of the cluster and the attempt to create a new
 cluster, with some ugliness due to more indentation, locking operations
 and testing.

 The other, that I like better but haven't given any significant amount
 of testing yet, only performs the test when we fall back to unclustered
 allocation, relying on btrfs_find_space_cluster to test for enough free
 space early (it does); it also arranges for the cluster in the current
 block group to be released before we try unclustered allocation.

I've chosen to try the second patch in our ceph environment. It seems
that btrfs_find_space_cluster() isn't called any longer.
find_free_extent() is much faster now.

(I think that the write-io numbers are still to high, though.)

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-09 Thread Christian Brunner
2011/12/7 Christian Brunner c...@muc.de:
 2011/12/1 Christian Brunner c...@muc.de:
 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

 That's unexpected.

 In the mean time I know, that it's not related to the reads.

 I suppose I could wave my hands while explaining that you're getting
 higher data throughput, so it's natural that it would take up more
 resources, but that explanation doesn't satisfy me.  I suppose
 allocation might have got slightly more CPU intensive in some cases, as
 we now use bitmaps where before we'd only use the cheaper-to-allocate
 extents.  But that's unsafisfying as well.

 I must admit, that I do not completely understand the difference
 between bitmaps and extents.

 From what I see on my servers, I can tell, that the degradation over
 time is gone. (Rebooting the servers every day is no longer needed.
 This is a real plus.) But the performance compared to a freshly
 booted, unpatched server is much slower with my ceph workload.

 I wonder if it would make sense to initialize the list field only,
 when the cluster setup fails? This would avoid the fallback to the
 much unclustered allocation and would give us the cheaper-to-allocate
 extents.

 I've now tried various combinations of you patches and I can really
 nail it down to this one line.

 With this patch applied I get much higher write-io values than without
 it. Some of the other patches help to reduce the effect, but it's
 still significant.

 iostat on an unpatched node is giving me:

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda             105.90     0.37   15.42   14.48  2657.33   560.13
 107.61     1.89   62.75   6.26  18.71

 while on a node with this patch it's
 sda             128.20     0.97   11.10   57.15  3376.80   552.80
 57.58    20.58  296.33   4.16  28.36


 Also interesting, is the fact that the average request size on the
 patched node is much smaller.

 Josef was telling me, that this could be related to the number of
 bitmaps we write out, but I've no idea how to trace this.

 I would be very happy if someone could give me a hint on what to do
 next, as this is one of the last remaining issues with our ceph
 cluster.

This is still bugging me and I just remembered something that might be
helpfull. Also I hope that this is not misleading...

Back in 2.6.38 we were running ceph without btrfs performance
degradation. I found a thread on the list where similar problems where
reported:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10346.html

In that thread someone bisected the issue to

From 4e69b598f6cfb0940b75abf7e179d6020e94ad1e Mon Sep 17 00:00:00 2001
From: Josef Bacik jo...@redhat.com
Date: Mon, 21 Mar 2011 10:11:24 -0400
Subject: [PATCH] Btrfs: cleanup how we setup free space clusters

In this commit the bitmaps handling was changed. So I just thought
that this may be related.

I'm still hoping, that someone with a deeper understanding of btrfs
could take a look at this.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-07 Thread Christian Brunner
2011/12/1 Christian Brunner c...@muc.de:
 2011/12/1 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 On Nov 29, 2011, Christian Brunner c...@muc.de wrote:

 When I'm doing havy reading in our ceph cluster. The load and wait-io
 on the patched servers is higher than on the unpatched ones.

 That's unexpected.

In the mean time I know, that it's not related to the reads.

 I suppose I could wave my hands while explaining that you're getting
 higher data throughput, so it's natural that it would take up more
 resources, but that explanation doesn't satisfy me.  I suppose
 allocation might have got slightly more CPU intensive in some cases, as
 we now use bitmaps where before we'd only use the cheaper-to-allocate
 extents.  But that's unsafisfying as well.

 I must admit, that I do not completely understand the difference
 between bitmaps and extents.

 From what I see on my servers, I can tell, that the degradation over
 time is gone. (Rebooting the servers every day is no longer needed.
 This is a real plus.) But the performance compared to a freshly
 booted, unpatched server is much slower with my ceph workload.

 I wonder if it would make sense to initialize the list field only,
 when the cluster setup fails? This would avoid the fallback to the
 much unclustered allocation and would give us the cheaper-to-allocate
 extents.

I've now tried various combinations of you patches and I can really
nail it down to this one line.

With this patch applied I get much higher write-io values than without
it. Some of the other patches help to reduce the effect, but it's
still significant.

iostat on an unpatched node is giving me:

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda 105.90 0.37   15.42   14.48  2657.33   560.13
107.61 1.89   62.75   6.26  18.71

while on a node with this patch it's
sda 128.20 0.97   11.10   57.15  3376.80   552.80
57.5820.58  296.33   4.16  28.36


Also interesting, is the fact that the average request size on the
patched node is much smaller.

Josef was telling me, that this could be related to the number of
bitmaps we write out, but I've no idea how to trace this.

I would be very happy if someone could give me a hint on what to do
next, as this is one of the last remaining issues with our ceph
cluster.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-29 Thread Christian Brunner
2011/11/28 Alexandre Oliva ol...@lsd.ic.unicamp.br:
 We're failing to create clusters with bitmaps because
 setup_cluster_no_bitmap checks that the list is empty before inserting
 the bitmap entry in the list for setup_cluster_bitmap, but the list
 field is only initialized when it is restored from the on-disk free
 space cache, or when it is written out to disk.

 Besides a potential race condition due to the multiple use of the list
 field, filesystem performance severely degrades over time: as we use
 up all non-bitmap free extents, the try-to-set-up-cluster dance is
 done at every metadata block allocation.  For every block group, we
 fail to set up a cluster, and after failing on them all up to twice,
 we fall back to the much slower unclustered allocation.

This matches exactly what I've been observing in our ceph cluster.
I've now installed your patches (1-11) on two servers.
The cluster setup problem seems to be gone. - A big thanks for that!

However another thing is causing me some headeache:

When I'm doing havy reading in our ceph cluster. The load and wait-io
on the patched servers is higher than on the unpatched ones.

Dstat from an unpatched server:

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   6  83   8   0   1|  22M  348k| 336k   93M|   0 0 |8445  3715
  1   5  87   7   0   1|  12M 1808k| 214k   65M|   0 0 |5461  1710
  1   3  85  10   0   0|  11M  640k| 313k   49M|   0 0 |5919  2853
  1   6  84   9   0   1|  12M  608k| 358k   69M|   0 0 |7406  3645
  1   7  78  13   0   1|  15M 5344k| 348k  105M|   0 0 |9765  4403
  1   7  80  10   0   1|  22M 1368k| 358k   89M|   0 0 |8036  3202
  1   9  72  16   0   1|  22M 2424k| 646k  137M|   0 0 |  12k 5527

Dstat from a patched server:

---total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   2  61  35   0   0|2500k 2736k| 141k   34M|   0 0 |4415  1603
  1   4  48  47   0   1|  10M 3924k| 353k   61M|   0 0 |6871  3771
  1   5  55  38   0   1|  10M 1728k| 385k   92M|   0 0 |8030  2617
  2   8  69  20   0   1|  18M 1384k| 435k  130M|   0 0 |  10k 4493
  1   5  85   8   0   1|7664k   84k| 287k   97M|   0 0 |6231  1357
  1   3  91   5   0   0|  10M  144k| 194k   44M|   0 0 |3807  1081
  1   7  66  25   0   1|  20M 1248k| 404k  101M|   0 0 |8676  3632
  0   3  38  58   0   0|8104k 2660k| 176k   40M|   0 0 |4841  2093


This seems to be coming from btrfs-endio-1. A kernel thread that has
not caught my attention on unpatched systems, yet.

I did some tracing on that process with ftrace and I can see that the
time is wasted in end_bio_extent_readpage(). In a single call to
end_bio_extent_readpage()the functions unlock_extent_cached(),
unlock_page() and btrfs_readpage_end_io_hook() are invoked 128 times
(each).

Do you have any idea what's going on here?

(Please note that the filesystem is still unmodified - metadata
overhead is large).

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-26 Thread Christian Brunner
2011/11/26 Stefan Kleijkers ste...@unilogicnetworks.net:
 Hello Josef,

 I've new results, is this the trace you are looking for?

 Trace of OSD0: http://pastebin.com/gddLBXE4
 Dmesg of OSD0: http://pastebin.com/Uebzgkjv

 OSD1 crashed a while later with the same messages.

 Stefan

Hi Josef,

I ran your patch on one of our ceph nodes, too. At the first run it
hit the BUG_ON and creashed. Unfortunately I was not able to get the
trace messages from the server (I'm glad that Stefan managed to fetch
it), so I gave it a second spin. This time it did NOT hit the BUG_ON,
but I wrote the trace to a file, so I can send you the trace output at
that time. You can find dmesg-output here:

http://pastebin.com/pWWsZ79e

The trace messages from 154900 till 154999 are here (don't know if
this is interesting):

http://pastebin.com/01EKHqn5

and the tracing output from 206200 till 206399 is here:

http://pastebin.com/50PNtiF7

I hope, that this will give you a better insight into this. I will now
reboot and run it a third time, to see if I can hit the BUG_ON again.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG at fs/btrfs/inode.c:1587

2011-11-16 Thread Christian Brunner
2011/11/16 Chris Mason chris.ma...@oracle.com:
 On Tue, Nov 15, 2011 at 09:19:53AM +0100, Christian Brunner wrote:
 Hi,

 this time I've hit a new bug. This happened while ceph was rebuilding
 his filestore (heavy io).

 The btrfs version is from 3.2-rc1, applied to a 3.0 kernel.

 This one means some part of the kernel has set a btrfs data page dirty
 without going through the proper setup.  A few of us have hit it, but we
 haven't been able to nail down a solid way to reproduce it.

 Have you hit it more than once?


I' sorry, I've only hit this once and it's not reproduceable.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BUG at fs/btrfs/inode.c:1587

2011-11-15 Thread Christian Brunner
Hi,

this time I've hit a new bug. This happened while ceph was rebuilding
his filestore (heavy io).

The btrfs version is from 3.2-rc1, applied to a 3.0 kernel.

Regards,
Christian

[28981.550478] [ cut here ]
[28981.555625] kernel BUG at fs/btrfs/inode.c:1587!
[28981.560773] invalid opcode:  [#1] SMP
[28981.565361] CPU 2
[28981.567407] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
ixgbe dca mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[28981.591184]
[28981.592842] Pid: 1814, comm: btrfs-fixup-0 Tainted: P
3.0.8-1.fits.4.el6.x86_64 #1 HP ProLiant DL180 G6
[28981.604589] RIP: 0010:[a0292f3c]  [a0292f3c]
btrfs_writepage_fixup_worker+0x14c/0x160 [btrfs]
[28981.616049] RSP: 0018:8805ee735dd0  EFLAGS: 00010246
[28981.621967] RAX:  RBX: ea00132c2520 RCX: 8805ef32ec58
[28981.629918] RDX:  RSI: 003b5000 RDI: 8805ef32ea38
[28981.637870] RBP: 8805ee735e20 R08: 88063f25add0 R09: 8805ee735d88
[28981.645822] R10:  R11: 0001 R12: 003b5000
[28981.653774] R13: 8805ef32eb08 R14:  R15: 003b5fff
[28981.661727] FS:  () GS:88063f24()
knlGS:
[28981.670744] CS:  0010 DS:  ES:  CR0: 8005003b
[28981.677146] CR2: 07737000 CR3: 01a03000 CR4: 06e0
[28981.685098] DR0:  DR1:  DR2: 
[28981.693050] DR3:  DR6: 0ff0 DR7: 0400
[28981.701010] Process btrfs-fixup-0 (pid: 1814, threadinfo
8805ee734000, task 8805f3f54bc0)
[28981.710901] Stack:
[28981.713146]  88045dbf4d20 8805ef32e9a8 00012bc0
88027dcdbd20
[28981.721434]   8805ef99ede0 8805ef99ee30
8805ef99edf8
[28981.729723]  88045dbf4d50 8805ee735e80 8805ee735ee0
a02b39ce
[28981.738013] Call Trace:
[28981.740763]  [a02b39ce] worker_loop+0x13e/0x540 [btrfs]
[28981.747577]  [a02b3890] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[28981.755263]  [a02b3890] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[28981.762931]  [81085c96] kthread+0x96/0xa0
[28981.768373]  [81556844] kernel_thread_helper+0x4/0x10
[28981.774976]  [81085c00] ? kthread_worker_fn+0x1a0/0x1a0
[28981.781772]  [81556840] ? gs_change+0x13/0x13
[28981.787593] Code: e0 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f c9 c3
48 8b 7d b8 48 8d 4d c8 41 b8 50 00 00 00 4c 89 fa 4c 89 e6 e8 96 38
01 00 eb bd 0f 0b eb fe 48 89 df e8 c8 0e e7 e0 eb 9d 66 0f 1f 44 00
00 55
[28981.809294] RIP  [a0292f3c]
btrfs_writepage_fixup_worker+0x14c/0x160 [btrfs]
[28981.818150]  RSP 8805ee735dd0
[28981.822721] ---[ end trace 0236051622523829 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-09 Thread Christian Brunner
2011/11/9 Stefan Kleijkers ste...@unilogicnetworks.net:
 Hello,

 I'm seeing a lot of warnings in dmesg with a BTRFS filesystem. I'm using the
 3.1 kernel, I found a patch for these warnings (
 http://marc.info/?l=linux-btrfsm=131547325515336w=2)
 http://marc.info/?l=linux-btrfsm=131547325515336w=2, but that patch has
 already been included in 3.1. Are there any other patches I can try?

 I'm using BTRFS in combination with Ceph and it looks like after a while
 with a high rsync workload that the IO stalls for some time, could the
 warnings result in IO stall?

This seem to be the same issue, I've seen in our ceph cluster. We had
a lengthy discussion on the btrfs list about this:

http://marc.info/?l=linux-btrfsm=132007001119383w=2

As far as I know josef is still working on it. Some of the latest
patches he sent seem to related to this, but I don't know if these fix
the problem.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-31 Thread Christian Brunner
2011/10/31 Christian Brunner c...@muc.de:
 2011/10/31 Christian Brunner c...@muc.de:

 The patch didn't hurt, but I've to tell you that I'm still seeing the
 same old problems. Load is going up again:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97 btrfs-endio-wri
  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd

 And I have hit our warning again:

 [223560.970713] [ cut here ]
 [223560.976043] WARNING: at fs/btrfs/inode.c:2118
 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
 [223560.985411] Hardware name: ProLiant DL180 G6
 [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
 bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
 i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
 [last unloaded: scsi_wait_scan]
 [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
 3.0.6-1.fits.9.el6.x86_64 #1
 [223561.023874] Call Trace:
 [223561.026738]  [8106344f] warn_slowpath_common+0x7f/0xc0
 [223561.033564]  [810634aa] warn_slowpath_null+0x1a/0x20
 [223561.040272]  [a0282120] btrfs_orphan_commit_root+0xb0/0xc0 
 [btrfs]
 [223561.048278]  [a027ce55] commit_fs_roots+0xc5/0x1b0 [btrfs]
 [223561.055534]  [8154c231] ? mutex_lock+0x31/0x60
 [223561.061666]  [a027ddbe]
 btrfs_commit_transaction+0x3ce/0x820 [btrfs]
 [223561.069876]  [a027d1b8] ? wait_current_trans+0x28/0x110 [btrfs]
 [223561.077582]  [a027e325] ? join_transaction+0x25/0x250 [btrfs]
 [223561.085065]  [81086410] ? wake_up_bit+0x40/0x40
 [223561.091251]  [a025a329] btrfs_sync_fs+0x59/0xd0 [btrfs]
 [223561.098187]  [a02abc65] btrfs_ioctl+0x495/0xd50 [btrfs]
 [223561.105120]  [8125ed20] ? inode_has_perm+0x30/0x40
 [223561.111575]  [81261a2c] ? file_has_perm+0xdc/0xf0
 [223561.117924]  [8117086a] do_vfs_ioctl+0x9a/0x5a0
 [223561.124072]  [81170e11] sys_ioctl+0xa1/0xb0
 [223561.129842]  [81555702] system_call_fastpath+0x16/0x1b
 [223561.136699] ---[ end trace 176e8be8996f25f6 ]---

 [ Not sending this to the lists, as the attachment is large ].

 I've spent a little time to do some tracing with ftrace. Its output
 seems to be right (at least as far as I can tell). I hope that its
 output can give you an insight on whats going on.

 The interesting PIDs in the trace are:

  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd


[ adding linux-btrfs again ]

I've been digging into this a bit further:

Attached is another ftrace report that I've filtered for btrfs_*
calls and limited to CPU0 (this is where PID 5502 was running).

From what I can see there is a lot of time consumed in
btrfs_reserve_extent(). I this normal?

Thanks,
Christian


ftrace_btrfs_cpu0.bz2
Description: BZip2 compressed data


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-31 Thread Christian Brunner
2011/10/31 Christian Brunner c...@muc.de:
 2011/10/31 Christian Brunner c...@muc.de:
 2011/10/31 Christian Brunner c...@muc.de:

 The patch didn't hurt, but I've to tell you that I'm still seeing the
 same old problems. Load is going up again:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97 btrfs-endio-wri
  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd

 And I have hit our warning again:

 [223560.970713] [ cut here ]
 [223560.976043] WARNING: at fs/btrfs/inode.c:2118
 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
 [223560.985411] Hardware name: ProLiant DL180 G6
 [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
 bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
 i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
 [last unloaded: scsi_wait_scan]
 [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
 3.0.6-1.fits.9.el6.x86_64 #1
 [223561.023874] Call Trace:
 [223561.026738]  [8106344f] warn_slowpath_common+0x7f/0xc0
 [223561.033564]  [810634aa] warn_slowpath_null+0x1a/0x20
 [223561.040272]  [a0282120] btrfs_orphan_commit_root+0xb0/0xc0 
 [btrfs]
 [223561.048278]  [a027ce55] commit_fs_roots+0xc5/0x1b0 [btrfs]
 [223561.055534]  [8154c231] ? mutex_lock+0x31/0x60
 [223561.061666]  [a027ddbe]
 btrfs_commit_transaction+0x3ce/0x820 [btrfs]
 [223561.069876]  [a027d1b8] ? wait_current_trans+0x28/0x110 
 [btrfs]
 [223561.077582]  [a027e325] ? join_transaction+0x25/0x250 [btrfs]
 [223561.085065]  [81086410] ? wake_up_bit+0x40/0x40
 [223561.091251]  [a025a329] btrfs_sync_fs+0x59/0xd0 [btrfs]
 [223561.098187]  [a02abc65] btrfs_ioctl+0x495/0xd50 [btrfs]
 [223561.105120]  [8125ed20] ? inode_has_perm+0x30/0x40
 [223561.111575]  [81261a2c] ? file_has_perm+0xdc/0xf0
 [223561.117924]  [8117086a] do_vfs_ioctl+0x9a/0x5a0
 [223561.124072]  [81170e11] sys_ioctl+0xa1/0xb0
 [223561.129842]  [81555702] system_call_fastpath+0x16/0x1b
 [223561.136699] ---[ end trace 176e8be8996f25f6 ]---

 [ Not sending this to the lists, as the attachment is large ].

 I've spent a little time to do some tracing with ftrace. Its output
 seems to be right (at least as far as I can tell). I hope that its
 output can give you an insight on whats going on.

 The interesting PIDs in the trace are:

  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd


 [ adding linux-btrfs again ]

 I've been digging into this a bit further:

 Attached is another ftrace report that I've filtered for btrfs_*
 calls and limited to CPU0 (this is where PID 5502 was running).

 From what I can see there is a lot of time consumed in
 btrfs_reserve_extent(). I this normal?

Sorry for spamming, but in the meantime I'm almost certain that the
problem is inside find_free_extent (called from btrfs_reserve_extent).

When I'm running ftrace for a sample period of 10s my system is
wasting a total of 4,2 seconds inside find_free_extent(). Each call to
find_free_extent() is taking an average of 4 milliseconds to complete.
On a recently rebooted system this is only 1-2 us!

I'm not sure if the problem is occurring suddenly or slowly over time.
(At the moment I suspect that its occurring suddenly, but I still have
to investigate this).

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Christian Brunner
2011/10/27 Josef Bacik jo...@redhat.com:
 On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
 2011/10/24 Josef Bacik jo...@redhat.com:
  On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
  [adding linux-btrfs to cc]
 
  Josef, Chris, any ideas on the below issues?
 
  On Mon, 24 Oct 2011, Christian Brunner wrote:
  
   - When I run ceph with btrfs snaps disabled, the situation is getting
   slightly better. I can run an OSD for about 3 days without problems,
   but then again the load increases. This time, I can see that the
   ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
   than usual.
 
  FYI in this scenario you're exposed to the same journal replay issues that
  ext4 and XFS are.  The btrfs workload that ceph is generating will also
  not be all that special, though, so this problem shouldn't be unique to
  ceph.
 
 
  Can you get sysrq+w when this happens?  I'd like to see what 
  btrfs-endio-write
  is up to.

 Capturing this seems to be not easy. I have a few traces (see
 attachment), but with sysrq+w I do not get a stacktrace of
 btrfs-endio-write. What I have is a latencytop -c output which is
 interesting:

 In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
 tries to balance the load over all OSDs, so all filesystems should get
 an nearly equal load. At the moment one filesystem seems to have a
 problem. When running with iostat I see the following

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sdd               0.00     0.00    0.00    4.33     0.00    53.33
 12.31     0.08   19.38  12.23   5.30
 sdc               0.00     1.00    0.00  228.33     0.00  1957.33
 8.57    74.33  380.76   2.74  62.57
 sdb               0.00     0.00    0.00    1.33     0.00    16.00
 12.00     0.03   25.00 19.75 2.63
 sda               0.00     0.00    0.00    0.67     0.00     8.00
 12.00     0.01   19.50  12.50   0.83

 The PID of the ceph-osd taht is running on sdc is 2053 and when I look
 with top I see this process and a btrfs-endio-writer (PID 5447):

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri

 In the latencytop output you can see that those processes have a much
 higher latency, than the other ceph-osd and btrfs-endio-writers.

 Regards,
 Christian

 Ok just a shot in the dark, but could you give this a whirl and see if it 
 helps
 you?  Thanks

Thanks for the patch! I'll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner
2011/10/26 Sage Weil s...@newdream.net:
 On Wed, 26 Oct 2011, Christian Brunner wrote:
   Christian, have you tweaked those settings in your ceph.conf?  It would 
   be
   something like 'journal dio = false'.  If not, can you verify that
   directio shows true when the journal is initialized from your osd log?
   E.g.,
  
    2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal 
   fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
  
   If directio = 1 for you, something else funky is causing those
   blkdev_fsync's...
 
  I've looked it up in the logs - directio is 1:
 
  Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
  /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
  bytes, directio = 1
 
  Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
  is coming from.

 Here is an strace. I can see a lot of sync_file_range operations.

 Yeah, these all look like the flusher thread, and shouldn't be hitting
 blkdev_fsync.  Can you confirm that with

        filestore flusher = false
        filestore sync flush = false

 you get no sync_file_range at all?  I wonder if this is also perf lying
 about the call chain.

Yes, setting this makes the sync_file_range calls go away.

Is it safe to use these settings with filestore btrfs snap = 0?

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner
2011/10/26 Christian Brunner c...@muc.de:
 2011/10/25 Josef Bacik jo...@redhat.com:
 On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
 On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
  On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
  
   Attached is a perf-report. I have included the whole report, so that
   you can see the difference between the good and the bad
   btrfs-endio-wri.
  
 
  We also shouldn't be running run_ordered_operations, man this is screwed 
  up,
  thanks so much for this, I should be able to nail this down pretty easily.
  Thanks,

 Looks like we're getting there from reserve_metadata_bytes when we join
 the transaction?


 We don't do reservations in the endio stuff, we assume you've reserved all 
 the
 space you need in delalloc, plus we would have seen reserve_metadata_bytes in
 the trace.  Though it does look like perf is lying to us in at least one case
 sicne btrfs_alloc_logged_file_extent is only called from log replay and not
 during normal runtime, so it definitely shouldn't be showing up.  Thanks,

 Strange! - I'll check if symbols got messed up in the report tomorrow.

I've checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I've also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik jo...@redhat.com:
 On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
 2011/10/24 Josef Bacik jo...@redhat.com:
  On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
  [adding linux-btrfs to cc]
 
  Josef, Chris, any ideas on the below issues?
 
  On Mon, 24 Oct 2011, Christian Brunner wrote:
  
   - When I run ceph with btrfs snaps disabled, the situation is getting
   slightly better. I can run an OSD for about 3 days without problems,
   but then again the load increases. This time, I can see that the
   ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
   than usual.
 
  FYI in this scenario you're exposed to the same journal replay issues that
  ext4 and XFS are.  The btrfs workload that ceph is generating will also
  not be all that special, though, so this problem shouldn't be unique to
  ceph.
 
 
  Can you get sysrq+w when this happens?  I'd like to see what 
  btrfs-endio-write
  is up to.

 Capturing this seems to be not easy. I have a few traces (see
 attachment), but with sysrq+w I do not get a stacktrace of
 btrfs-endio-write. What I have is a latencytop -c output which is
 interesting:

 In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
 tries to balance the load over all OSDs, so all filesystems should get
 an nearly equal load. At the moment one filesystem seems to have a
 problem. When running with iostat I see the following

 Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sdd               0.00     0.00    0.00    4.33     0.00    53.33
 12.31     0.08   19.38  12.23   5.30
 sdc               0.00     1.00    0.00  228.33     0.00  1957.33
 8.57    74.33  380.76   2.74  62.57
 sdb               0.00     0.00    0.00    1.33     0.00    16.00
 12.00     0.03   25.00 19.75 2.63
 sda               0.00     0.00    0.00    0.67     0.00     8.00
 12.00     0.01   19.50  12.50   0.83

 The PID of the ceph-osd taht is running on sdc is 2053 and when I look
 with top I see this process and a btrfs-endio-writer (PID 5447):

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri

 In the latencytop output you can see that those processes have a much
 higher latency, than the other ceph-osd and btrfs-endio-writers.


 I'm seeing a lot of this

        [schedule]      1654.6 msec         96.4 %
                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
                generic_write_sync blkdev_aio_write do_sync_readv_writev
                do_readv_writev vfs_writev sys_writev system_call_fastpath

 where ceph-osd's latency is mostly coming from this fsync of a block device
 directly, and not so much being tied up by btrfs directly.  With 22% CPU being
 taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
 perf
 record -ag when this is going on and then perf report so we can see what
 btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to 
 get
 only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
 of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
 horribly wrong or introducing a lot of latency.  Most of it seems to be when
 running the dleayed refs and having to read in blocks.  I've been suspecting 
 for
 a while that the delayed ref stuff ends up doing way more work than it needs 
 to
 be per task, and it's possible that btrfs-endio-wri is simply getting screwed 
 by
 other people doing work.

 At this point it seems like the biggest problem with latency in ceph-osd is 
 not
 related to btrfs, the latency seems to all be from the fact that ceph-osd is
 fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
 like
 its blowing a lot of CPU time, so perf record -ag is probably going to be your
 best bet when it's using lots of cpu so we can figure out what it's spinning 
 on.

Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.

Thanks,
Christian


perf.report.bz2
Description: BZip2 compressed data


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik jo...@redhat.com:
 On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
 2011/10/25 Josef Bacik jo...@redhat.com:
  On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
 
  In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
  tries to balance the load over all OSDs, so all filesystems should get
  an nearly equal load. At the moment one filesystem seems to have a
  problem. When running with iostat I see the following
 
  Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
  avgrq-sz avgqu-sz   await  svctm  %util
  sdd               0.00     0.00    0.00    4.33     0.00    53.33
  12.31     0.08   19.38  12.23   5.30
  sdc               0.00     1.00    0.00  228.33     0.00  1957.33
  8.57    74.33  380.76   2.74  62.57
  sdb               0.00     0.00    0.00    1.33     0.00    16.00
  12.00     0.03   25.00 19.75 2.63
  sda               0.00     0.00    0.00    0.67     0.00     8.00
  12.00     0.01   19.50  12.50   0.83
 
  The PID of the ceph-osd taht is running on sdc is 2053 and when I look
  with top I see this process and a btrfs-endio-writer (PID 5447):
 
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
   5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
 
  In the latencytop output you can see that those processes have a much
  higher latency, than the other ceph-osd and btrfs-endio-writers.
 
 
  I'm seeing a lot of this
 
         [schedule]      1654.6 msec         96.4 %
                 schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
                 generic_write_sync blkdev_aio_write do_sync_readv_writev
                 do_readv_writev vfs_writev sys_writev system_call_fastpath
 
  where ceph-osd's latency is mostly coming from this fsync of a block device
  directly, and not so much being tied up by btrfs directly.  With 22% CPU 
  being
  taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
  perf
  record -ag when this is going on and then perf report so we can see what
  btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
  to get
  only what btrfs-endio-wri is doing, so that would be best.  As far as the 
  rest
  of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
  anything
  horribly wrong or introducing a lot of latency.  Most of it seems to be 
  when
  running the dleayed refs and having to read in blocks.  I've been 
  suspecting for
  a while that the delayed ref stuff ends up doing way more work than it 
  needs to
  be per task, and it's possible that btrfs-endio-wri is simply getting 
  screwed by
  other people doing work.
 
  At this point it seems like the biggest problem with latency in ceph-osd 
  is not
  related to btrfs, the latency seems to all be from the fact that ceph-osd 
  is
  fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
  like
  its blowing a lot of CPU time, so perf record -ag is probably going to be 
  your
  best bet when it's using lots of cpu so we can figure out what it's 
  spinning on.

 Attached is a perf-report. I have included the whole report, so that
 you can see the difference between the good and the bad
 btrfs-endio-wri.


 We also shouldn't be running run_ordered_operations, man this is screwed up,
 thanks so much for this, I should be able to nail this down pretty easily.

Please note that this is with btrfs snaps disabled in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Sage Weil s...@newdream.net:
 On Tue, 25 Oct 2011, Josef Bacik wrote:
 At this point it seems like the biggest problem with latency in ceph-osd
 is not related to btrfs, the latency seems to all be from the fact that
 ceph-osd is fsyncing a block dev for whatever reason.

 There is one place where we sync_file_range() on the journal block device,
 but that should only happen if directio is disabled (it's on by default).

 Christian, have you tweaked those settings in your ceph.conf?  It would be
 something like 'journal dio = false'.  If not, can you verify that
 directio shows true when the journal is initialized from your osd log?
 E.g.,

  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 
 14: 104857600 bytes, block size 4096 bytes, directio = 1

 If directio = 1 for you, something else funky is causing those
 blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik jo...@redhat.com:
 On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
 On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
  On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
  
   Attached is a perf-report. I have included the whole report, so that
   you can see the difference between the good and the bad
   btrfs-endio-wri.
  
 
  We also shouldn't be running run_ordered_operations, man this is screwed 
  up,
  thanks so much for this, I should be able to nail this down pretty easily.
  Thanks,

 Looks like we're getting there from reserve_metadata_bytes when we join
 the transaction?


 We don't do reservations in the endio stuff, we assume you've reserved all the
 space you need in delalloc, plus we would have seen reserve_metadata_bytes in
 the trace.  Though it does look like perf is lying to us in at least one case
 sicne btrfs_alloc_logged_file_extent is only called from log replay and not
 during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Strange! - I'll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Christian Brunner
2011/10/24 Chris Mason chris.ma...@oracle.com:
 On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
 On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
  [adding linux-btrfs to cc]
 
  Josef, Chris, any ideas on the below issues?
 
  On Mon, 24 Oct 2011, Christian Brunner wrote:
   Thanks for explaining this. I don't have any objections against btrfs
   as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
   scare me, since I can use the ceph replication to recover a lost
   btrfs-filesystem. The only problem I have is, that btrfs is not stable
   on our side and I wonder what you are doing to make it work. (Maybe
   it's related to the load pattern of using ceph as a backend store for
   qemu).
  
   Here is a list of the btrfs problems I'm having:
  
   - When I run ceph with the default configuration (btrfs snaps enabled)
   I can see a rapid increase in Disk-I/O after a few hours of uptime.
   Btrfs-cleaner is using more and more time in
   btrfs_clean_old_snapshots().
 
  In theory, there shouldn't be any significant difference between taking a
  snapshot and removing it a few commits later, and the prior root refs that
  btrfs holds on to internally until the new commit is complete.  That's
  clearly not quite the case, though.
 
  In any case, we're going to try to reproduce this issue in our
  environment.
 

 I've noticed this problem too, clean_old_snapshots is taking quite a while in
 cases where it really shouldn't.  I will see if I can come up with a 
 reproducer
 that doesn't require setting up ceph ;).

 This sounds familiar though, I thought we had fixed a similar
 regression.  Either way, Arne's readahead code should really help.

 Which kernel version were you running?

 [ ack on the rest of Josef's comments ]

This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the use the global reserve when truncating the free
space cache inode patch.

I'll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-20 Thread Christian Brunner
2011/10/20 Liu Bo liubo2...@cn.fujitsu.com:
 On 10/17/2011 11:23 PM, Christian Brunner wrote:
 2011/10/11 Christian Brunner c...@muc.de:

 I have updated to a 3.0.6 kernel, with all the btrfs patches from
 josef's git repo this weekend. But I'm still seeing the following
 warning:


 Hi,

 Would you try with this patch:

 http://permalink.gmane.org/gmane.comp.file-systems.btrfs/13728


I have now applied the patch josef sent to the list (Btrfs: use the
global reserve when truncating the free space cache inode), but the
warning is still there:

[   69.153400] [ cut here ]
[   69.158669] WARNING: at fs/btrfs/inode.c:2114
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[   69.167984] Hardware name: ProLiant DL180 G6
[   69.173037] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
ixgbe dca mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[   69.197502] Pid: 3426, comm: ceph-osd Tainted: P
3.0.6-1.fits.8.el6.x86_64 #1
[   69.206591] Call Trace:
[   69.209389]  [8106344f] warn_slowpath_common+0x7f/0xc0
[   69.216144]  [810634aa] warn_slowpath_null+0x1a/0x20
[   69.222647]  [a028c080] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[   69.230550]  [a0286e15] commit_fs_roots+0xc5/0x1b0 [btrfs]
[   69.237698]  [8154c231] ? mutex_lock+0x31/0x60
[   69.243707]  [a0267a8a] ? btrfs_free_path+0x2a/0x40 [btrfs]
[   69.250966]  [a0287d56]
btrfs_commit_transaction+0x3c6/0x820 [btrfs]
[   69.259087]  [a0287178] ? wait_current_trans+0x28/0x110 [btrfs]
[   69.266720]  [a02882c5] ? join_transaction+0x25/0x250 [btrfs]
[   69.274136]  [81086410] ? wake_up_bit+0x40/0x40
[   69.280210]  [a0264329] btrfs_sync_fs+0x59/0xd0 [btrfs]
[   69.287072]  [a02b5b95] btrfs_ioctl+0x495/0xd50 [btrfs]
[   69.293922]  [8125ed20] ? inode_has_perm+0x30/0x40
[   69.300286]  [81261a2c] ? file_has_perm+0xdc/0xf0
[   69.306558]  [8117086a] do_vfs_ioctl+0x9a/0x5a0
[   69.312657]  [81170e11] sys_ioctl+0xa1/0xb0
[   69.318352]  [81555702] system_call_fastpath+0x16/0x1b
[   69.325107] ---[ end trace 2fd1a5665203d8e3 ]---


Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-17 Thread Christian Brunner
2011/10/11 Christian Brunner c...@muc.de:
 2011/10/11 Liu Bo liubo2...@cn.fujitsu.com:
 On 10/10/2011 12:41 AM, Christian Brunner wrote:
 I just realized that this is still the same warning I reported some month 
 ago.

 I thought that this had been fixed with

 25d37af374263243214be9d912cbb46a8e469bc7

 which is included in the kernel I'm using. So I think there must be
 another Problem.


 Would you try with this patch:

 http://marc.info/?l=linux-btrfsm=131547325515336w=2


 This one is already included in my tree.

I have updated to a 3.0.6 kernel, with all the btrfs patches from
josef's git repo this weekend. But I'm still seeing the following
warning:

[75532.763336] [ cut here ]
[75532.768570] WARNING: at fs/btrfs/inode.c:2114
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[75532.777807] Hardware name: ProLiant DL180 G6
[75532.782798] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[75532.806891] Pid: 1858, comm: ceph-osd Tainted: P
3.0.6-1.fits.5.el6.x86_64 #1
[75532.815990] Call Trace:
[75532.818772]  [8106344f] warn_slowpath_common+0x7f/0xc0
[75532.825514]  [810634aa] warn_slowpath_null+0x1a/0x20
[75532.832076]  [a0281ff0] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[75532.840028]  [a027cd85] commit_fs_roots+0xc5/0x1b0 [btrfs]
[75532.847196]  [8154c231] ? mutex_lock+0x31/0x60
[75532.853198]  [a025da8a] ? btrfs_free_path+0x2a/0x40 [btrfs]
[75532.860476]  [a027dcc6]
btrfs_commit_transaction+0x3c6/0x820 [btrfs]
[75532.868599]  [a027d0e8] ? wait_current_trans+0x28/0x110 [btrfs]
[75532.876264]  [a027e235] ? join_transaction+0x25/0x250 [btrfs]
[75532.883762]  [81086410] ? wake_up_bit+0x40/0x40
[75532.889839]  [a025a329] btrfs_sync_fs+0x59/0xd0 [btrfs]
[75532.896703]  [a02abb25] btrfs_ioctl+0x495/0xd50 [btrfs]
[75532.903544]  [8125ed20] ? inode_has_perm+0x30/0x40
[75532.909902]  [81261a2c] ? file_has_perm+0xdc/0xf0
[75532.916205]  [8117086a] do_vfs_ioctl+0x9a/0x5a0
[75532.922276]  [81170e11] sys_ioctl+0xa1/0xb0
[75532.927988]  [81555702] system_call_fastpath+0x16/0x1b
[75532.934755] ---[ end trace a10c532625ad12af ]---

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: allow us to overcommit our enospc reservations TEST THIS PLEASE!!!

2011-10-13 Thread Christian Brunner
2011/10/13 Josef Bacik jo...@redhat.com:
[...]
  [  175.956273] kernel BUG at fs/btrfs/inode.c:2176!
 
  Ok I think I see what's happening, this patch replaces the previous one, 
  let me
  know how it goes.  Thanks,
 

 Getting a slightly different BUG this time:


 Ok looks like I've fixed the original problem and now we're hitting a problem
 with the free space cache.  This patch will replace the last one, its all the
 fixes up to now and a new set of BUG_ON()'s to figure out which free space 
 cache
 inode is screwing us up.  Thanks,

 Josef


 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index fc0de68..e595372 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3334,7 +3334,7 @@ out:
  * shrink metadata reservation for delalloc
  */
  static int shrink_delalloc(struct btrfs_trans_handle *trans,
 -                          struct btrfs_root *root, u64 to_reclaim, int sync)
 +                          struct btrfs_root *root, u64 to_reclaim, int 
 retries)
  {
        struct btrfs_block_rsv *block_rsv;
        struct btrfs_space_info *space_info;
 @@ -3365,12 +3365,10 @@ static int shrink_delalloc(struct btrfs_trans_handle 
 *trans,
        }

        max_reclaim = min(reserved, to_reclaim);
 +       if (max_reclaim  (2 * 1024 * 1024))
 +               nr_pages = max_reclaim  PAGE_CACHE_SHIFT;

        while (loops  1024) {
 -               /* have the flusher threads jump in and do some IO */
 -               smp_mb();
 -               nr_pages = min_t(unsigned long, nr_pages,
 -                      root-fs_info-delalloc_bytes  PAGE_CACHE_SHIFT);
                writeback_inodes_sb_nr_if_idle(root-fs_info-sb, nr_pages);

                spin_lock(space_info-lock);
 @@ -3384,14 +3382,22 @@ static int shrink_delalloc(struct btrfs_trans_handle 
 *trans,
                if (reserved == 0 || reclaimed = max_reclaim)
                        break;

 -               if (trans  trans-transaction-blocked)
 +               if (trans)
                        return -EAGAIN;

 -               time_left = schedule_timeout_interruptible(1);
 +               if (!retries) {
 +                       time_left = schedule_timeout_interruptible(1);

 -               /* We were interrupted, exit */
 -               if (time_left)
 -                       break;
 +                       /* We were interrupted, exit */
 +                       if (time_left)
 +                               break;
 +               } else {
 +                       /*
 +                        * We've already done this song and dance once, let's
 +                        * really wait for some work to get done.
 +                        */
 +                       btrfs_wait_ordered_extents(root, 0, 0);
 +               }

                /* we've kicked the IO a few times, if anything has been freed,
                 * exit.  There is no sense in looping here for a long time
 @@ -3399,15 +3405,13 @@ static int shrink_delalloc(struct btrfs_trans_handle 
 *trans,
                 * just too many writers without enough free space
                 */

 -               if (loops  3) {
 +               if (!retries  loops  3) {
                        smp_mb();
                        if (progress != space_info-reservation_progress)
                                break;
                }

        }
 -       if (reclaimed  to_reclaim  !trans)
 -               btrfs_wait_ordered_extents(root, 0, 0);
        return reclaimed = to_reclaim;
  }

 @@ -3552,7 +3556,7 @@ again:
         * We do synchronous shrinking since we don't actually unreserve
         * metadata until after the IO is completed.
         */
 -       ret = shrink_delalloc(trans, root, num_bytes, 1);
 +       ret = shrink_delalloc(trans, root, num_bytes, retries);
        if (ret  0)
                goto out;

 @@ -3568,17 +3572,6 @@ again:
                goto again;
        }

 -       /*
 -        * Not enough space to be reclaimed, don't bother committing the
 -        * transaction.
 -        */
 -       spin_lock(space_info-lock);
 -       if (space_info-bytes_pinned  orig_bytes)
 -               ret = -ENOSPC;
 -       spin_unlock(space_info-lock);
 -       if (ret)
 -               goto out;
 -
        ret = -EAGAIN;
        if (trans)
                goto out;
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index d6ba353..cb63904 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -782,7 +782,8 @@ static noinline int cow_file_range(struct inode *inode,
        struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree;
        int ret = 0;

 -       BUG_ON(btrfs_is_free_space_inode(root, inode));
 +       BUG_ON(root == root-fs_info-tree_root);
 +       BUG_ON(BTRFS_I(inode)-location.objectid == BTRFS_FREE_INO_OBJECTID);
        trans = btrfs_join_transaction(root);
        BUG_ON(IS_ERR(trans));
        trans-block_rsv = root-fs_info-delalloc_block_rsv;
 @@ -2790,7 +2791,8 @@ static struct btrfs_trans_handle 
 

Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-11 Thread Christian Brunner
2011/10/11 Liu Bo liubo2...@cn.fujitsu.com:
 On 10/10/2011 12:41 AM, Christian Brunner wrote:
 I just realized that this is still the same warning I reported some month 
 ago.

 I thought that this had been fixed with

 25d37af374263243214be9d912cbb46a8e469bc7

 which is included in the kernel I'm using. So I think there must be
 another Problem.


 Would you try with this patch:

 http://marc.info/?l=linux-btrfsm=131547325515336w=2


This one is already included in my tree.

Regards,
Christian


 2011/10/9 Christian Brunner c...@muc.de:
 I gave btrfs for-chris from josef's github repo a try in our ceph
 cluster. During the rebuild I git the following warning.

 Everything still seems to work... Should I be concerned?

 Thanks,
 Christian

 [12554.886362] [ cut here ]
 [12554.891693] WARNING: at fs/btrfs/inode.c:2114
 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
 [12554.901210] Hardware name: ProLiant DL180 G6
 [12554.906338] Modules linked in: btrfs zlib_deflate libcrc32c bonding
 ipv6 pcspkr serio_raw ghes hed iTCO_wdt iTCO_vendor_support
 i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
 [last unloaded: scsi_wait_scan]
 [12554.930791] Pid: 4686, comm: flush-btrfs-1 Tainted: P
 3.0.6-1.fits.1.el6.x86_64 #1
 [12554.940483] Call Trace:
 [12554.943400]  [8106344f] warn_slowpath_common+0x7f/0xc0
 [12554.950378]  [810634aa] warn_slowpath_null+0x1a/0x20
 [12554.957070]  [a022be60] btrfs_orphan_commit_root+0xb0/0xc0 
 [btrfs]
 [12554.965301]  [a0226c45] commit_fs_roots+0xc5/0x1b0 [btrfs]
 [12554.972571]  [815593d1] ? mutex_lock+0x31/0x60
 [12554.978652]  [a0207a5a] ? btrfs_free_path+0x2a/0x40 [btrfs]
 [12554.986017]  [a0227b96]
 btrfs_commit_transaction+0x3c6/0x830 [btrfs]
 [12554.994256]  [a0228115] ? join_transaction+0x25/0x250 [btrfs]
 [12555.001814]  [81086410] ? wake_up_bit+0x40/0x40
 [12555.008061]  [a022b35b] btrfs_write_inode+0xbb/0xc0 [btrfs]
 [12555.015490]  [81184c71] writeback_single_inode+0x201/0x260
 [12555.022879]  [81184f6b] writeback_sb_inodes+0xeb/0x1c0
 [12555.029721]  [8118532f] wb_writeback+0x18f/0x480
 [12555.036031]  [81558105] ? __schedule+0x3f5/0x8b0
 [12555.042295]  [81072f3c] ? lock_timer_base+0x3c/0x70
 [12555.048860]  [811856bd] wb_do_writeback+0x9d/0x270
 [12555.055384]  [81073060] ? del_timer+0xf0/0xf0
 [12555.061375]  [81185932] bdi_writeback_thread+0xa2/0x280
 [12555.068375]  [81185890] ? wb_do_writeback+0x270/0x270
 [12555.075203]  [81185890] ? wb_do_writeback+0x270/0x270
 [12555.081971]  [81085d96] kthread+0x96/0xa0
 [12555.087582]  [815639c4] kernel_thread_helper+0x4/0x10
 [12555.094338]  [81085d00] ? kthread_worker_fn+0x1a0/0x1a0
 [12555.101296]  [815639c0] ? gs_change+0x13/0x13
 [12555.107290] ---[ end trace 57ec2e8544131a12 ]---

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-09 Thread Christian Brunner
I just realized that this is still the same warning I reported some month ago.

I thought that this had been fixed with

25d37af374263243214be9d912cbb46a8e469bc7

which is included in the kernel I'm using. So I think there must be
another Problem.

Regards,
Christian

2011/10/9 Christian Brunner c...@muc.de:
 I gave btrfs for-chris from josef's github repo a try in our ceph
 cluster. During the rebuild I git the following warning.

 Everything still seems to work... Should I be concerned?

 Thanks,
 Christian

 [12554.886362] [ cut here ]
 [12554.891693] WARNING: at fs/btrfs/inode.c:2114
 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
 [12554.901210] Hardware name: ProLiant DL180 G6
 [12554.906338] Modules linked in: btrfs zlib_deflate libcrc32c bonding
 ipv6 pcspkr serio_raw ghes hed iTCO_wdt iTCO_vendor_support
 i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
 [last unloaded: scsi_wait_scan]
 [12554.930791] Pid: 4686, comm: flush-btrfs-1 Tainted: P
 3.0.6-1.fits.1.el6.x86_64 #1
 [12554.940483] Call Trace:
 [12554.943400]  [8106344f] warn_slowpath_common+0x7f/0xc0
 [12554.950378]  [810634aa] warn_slowpath_null+0x1a/0x20
 [12554.957070]  [a022be60] btrfs_orphan_commit_root+0xb0/0xc0 
 [btrfs]
 [12554.965301]  [a0226c45] commit_fs_roots+0xc5/0x1b0 [btrfs]
 [12554.972571]  [815593d1] ? mutex_lock+0x31/0x60
 [12554.978652]  [a0207a5a] ? btrfs_free_path+0x2a/0x40 [btrfs]
 [12554.986017]  [a0227b96]
 btrfs_commit_transaction+0x3c6/0x830 [btrfs]
 [12554.994256]  [a0228115] ? join_transaction+0x25/0x250 [btrfs]
 [12555.001814]  [81086410] ? wake_up_bit+0x40/0x40
 [12555.008061]  [a022b35b] btrfs_write_inode+0xbb/0xc0 [btrfs]
 [12555.015490]  [81184c71] writeback_single_inode+0x201/0x260
 [12555.022879]  [81184f6b] writeback_sb_inodes+0xeb/0x1c0
 [12555.029721]  [8118532f] wb_writeback+0x18f/0x480
 [12555.036031]  [81558105] ? __schedule+0x3f5/0x8b0
 [12555.042295]  [81072f3c] ? lock_timer_base+0x3c/0x70
 [12555.048860]  [811856bd] wb_do_writeback+0x9d/0x270
 [12555.055384]  [81073060] ? del_timer+0xf0/0xf0
 [12555.061375]  [81185932] bdi_writeback_thread+0xa2/0x280
 [12555.068375]  [81185890] ? wb_do_writeback+0x270/0x270
 [12555.075203]  [81185890] ? wb_do_writeback+0x270/0x270
 [12555.081971]  [81085d96] kthread+0x96/0xa0
 [12555.087582]  [815639c4] kernel_thread_helper+0x4/0x10
 [12555.094338]  [81085d00] ? kthread_worker_fn+0x1a0/0x1a0
 [12555.101296]  [815639c0] ? gs_change+0x13/0x13
 [12555.107290] ---[ end trace 57ec2e8544131a12 ]---

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown

2011-08-09 Thread Christian Brunner
Hi Sage,

I did some testing with btrfs-unstable yesterday. With the recent
commit from Chris it looks quite good:

Btrfs: force unplugs when switching from high to regular priority bios


However I can't test it extensively, because our main environment is
on ext4 at the moment.

Regards
Christian

2011/8/8 Sage Weil s...@newdream.net:
 Hi Christian,

 Are you still seeing this slowness?

 sage


 On Wed, 27 Jul 2011, Christian Brunner wrote:
 2011/7/25 Chris Mason chris.ma...@oracle.com:
  Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400:
  Hi,
 
  we are running a ceph cluster with btrfs as it's base filesystem
  (kernel 3.0). At the beginning everything worked very well, but after
  a few days (2-3) things are getting very slow.
 
  When I look at the object store servers I see heavy disk-i/o on the
  btrfs filesystems (disk utilization is between 60% and 100%). I also
  did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
  certain, that the majority of the disk I/O is not caused by ceph or
  any other userland process.
 
  When reboot the system(s) the problems go away for another 2-3 days,
  but after that, it starts again. I'm not sure if the problem is
  related to the kernel warning I've reported last week. At least there
  is no temporal relationship between the warning and the slowdown.
 
  Any hints on how to trace this would be welcome.
 
  The easiest way to trace this is with latencytop.
 
  Apply this patch:
 
  http://oss.oracle.com/~mason/latencytop.patch
 
  And then use latencytop -c for a few minutes while the system is slow.
  Send the output here and hopefully we'll be able to figure it out.

 I've now installed latencytop. Attached are two output files: The
 first is from yesterday and was created aproxematly half an hour after
 the boot. The second on is from today, uptime is 19h. The load on the
 system is already rising. Disk utilization is approximately at 50%.

 Thanks for your help.

 Christian

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown

2011-07-28 Thread Christian Brunner
2011/7/28 Marcus Sorensen shadow...@gmail.com:
 Christian,

 Have you checked up on the disks themselves and hardware? High
 utilization can mean that the i/o load has increased, but it can also
 mean that the i/o capacity has decreased.  Your traces seem to
 indicate that a good portion of the time is being spent on commits,
 that could be waiting on disk. That wait_for_commit looks to
 basically just spin waiting for the commit to complete, and at least
 one thing that calls it raises a BUG_ON, not sure if it's one you've
 seen even on 2.6.38.

 There could be all sorts of performance related reasons that aren't
 specific to btrfs or ceph, on our various systems we've seen things
 like the raid card module being upgraded in newer kernels and suddenly
 our disks start to go into sleep mode after a bit, dirty_ratio causing
 multiple gigs of memory to sync because its not optimized for the
 workload, external SAS enclosures stop communicating a few days after
 reboot (but the disks keep working with sporadic issues), things like
 patrol read hitting a bad sector on a disk, causing it to go into
 enhanced error recovery and stop responding, etc.

I' fairly confident that the hardware is ok. We see the problem on
four machines. It could be a problem with the hpsa driver/firmware,
but we haven't seen the behavior with 2.6.38 and the changes in the
hpsa driver are not that big.

 Maybe you have already tried these things. It's where I would start
 anyway. Looking at /proc/meminfo, dirty, writeback, swap, etc both
 while the system is functioning desirably and when it's misbehaving.
 Looking at anything else that might be in D state. Looking at not just
 disk util, but the workload causing it (e.g. Was I doing 300 iops
 previously with an average size of 64k, and now I'm only managing 50
 iops at 64k before the disk util reports 100%?) Testing the system in
 a filesystem-agnostic manner, for example when performance is bad
 through btrfs, is performance the same as you got on fresh boot when
 testing iops on /dev/sdb or whatever? You're not by chance swapping
 after a bit of uptime on any volume that's shared with the underlying
 disks that make up your osd, obfuscated by a hardware raid? I didn't
 see the kernel warning you're referring to, just the ixgbe malloc
 failure you mentioned the other day.

I've looked at most of this. What makes me point to btrfs, is that the
problem goes away when I reboot on server in our cluster, but persists
on the other systems. So it can't be related to the number of requests
that come in.

 I do not mean to presume that you have not looked at these things
 already. I am not very knowledgeable in btrfs specifically, but I
 would expect any degradation in performance over time to be due to
 what's on disk (lots of small files, fragmented, etc). This is
 obviously not the case in this situation since a reboot recovers the
 performance. I suppose it could also be a memory leak or something
 similar, but you should be able to detect something like that by
 monitoring your memory situation, /proc/slabinfo etc.

It could be related to a memory leak. The machine has a lot RAM (24
GB), but we have seen page allocation failures in the ixgbe driver,
when we are using jumbo frames.

 Just my thoughts, good luck on this. I am currently running 2.6.39.3
 (btrfs) on the 7 node cluster I put together, but I just built it and
 am comparing between various configs. It will be awhile before it is
 under load for several days straight.

Thanks!

When I look at the latencytop results, there is a high latency when
calling btrfs_commit_transaction_async. Isn't async supposed to
return immediately?

Regards,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs slowdown

2011-07-25 Thread Christian Brunner
Hi,

we are running a ceph cluster with btrfs as it's base filesystem
(kernel 3.0). At the beginning everything worked very well, but after
a few days (2-3) things are getting very slow.

When I look at the object store servers I see heavy disk-i/o on the
btrfs filesystems (disk utilization is between 60% and 100%). I also
did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
certain, that the majority of the disk I/O is not caused by ceph or
any other userland process.

When reboot the system(s) the problems go away for another 2-3 days,
but after that, it starts again. I'm not sure if the problem is
related to the kernel warning I've reported last week. At least there
is no temporal relationship between the warning and the slowdown.

Any hints on how to trace this would be welcome.

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't be as agressive with delalloc metadata reservations V2

2011-07-21 Thread Christian Brunner
2011/7/18 Josef Bacik jo...@redhat.com:
 On 07/18/2011 02:11 PM, Josef Bacik wrote:
 Currently we reserve enough space to COW an entirely full btree for every 
 extent
 we have reserved for an inode.  This _sucks_, because you only need to COW 
 once,
 and then everybody else is ok.  Unfortunately we don't know we'll all be 
 able to
 get into the same transaction so that's what we have had to do.  But the 
 global
 reserve holds a reservation large enough to cover a large percentage of all 
 the
 metadata currently in the fs.  So all we really need to account for is any 
 new
 blocks that we may allocate.  So fix this by

 1) Passing to btrfs_alloc_free_block() wether this is a new block or a COW
 block.  If it is a COW block we use the global reserve, if not we use the
 trans-block_rsv.
 2) Reduce the amount of space we reserve.  Since we don't need to account for
 cow'ing the tree we can just keep track of new blocks to reserve, which 
 greatly
 reduces the reservation amount.

 This makes my basic random write test go from 3 mb/s to 75 mb/s.  I've tested
 this with my horrible ENOSPC test and it seems to work out fine.  Thanks,

 Signed-off-by: Josef Bacik jo...@redhat.com
 ---
 V1-V2:
 -fix a problem reported by Liubo, we need to make sure that we move bytes
 over for any new extents we may add to the extent tree so we don't get a 
 bunch
 of warnings.
 -fix the global reserve to reserve 50% of the metadata space currently used.

When I run this patch I get a lot of messages like these (V1 seemed to
run fine).

Regards,
Christian

Jul 21 15:25:59 os00 kernel: [   35.411360] [ cut here ]
Jul 21 15:25:59 os00 kernel: [   35.416589] WARNING: at
fs/btrfs/extent-tree.c:5564
btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs]()
Jul 21 15:25:59 os00 kernel: [   35.427311] Hardware name: ProLiant DL180 G6
Jul 21 15:25:59 os00 kernel: [   35.432326] Modules linked in: btrfs
zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(P) hpsa squashfs usb_storage [last unloaded:
scsi_wait_scan]
Jul 21 15:25:59 os00 kernel: [   35.456799] Pid: 1876, comm:
btrfs-endio-wri Tainted: P3.0.0-1.fits.4.el6.x86_64 #1
Jul 21 15:25:59 os00 kernel: [   35.466610] Call Trace:
Jul 21 15:25:59 os00 kernel: [   35.469497]  [8106306f]
warn_slowpath_common+0x7f/0xc0
Jul 21 15:25:59 os00 kernel: [   35.476254]  [810630ca]
warn_slowpath_null+0x1a/0x20
Jul 21 15:25:59 os00 kernel: [   35.482839]  [a02227f8]
btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.491683]  [a023d871]
insert_reserved_file_extent.clone.0+0x201/0x270 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.500912]  [a023debb]
btrfs_finish_ordered_io+0x2eb/0x360 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.508978]  [81073841] ?
try_to_del_timer_sync+0x81/0xe0
Jul 21 15:25:59 os00 kernel: [   35.516081]  [a023df7c]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.524340]  [a0277846]
end_compressed_bio_write+0x86/0xf0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.532259]  [8118f0cd]
bio_endio+0x1d/0x40
Jul 21 15:25:59 os00 kernel: [   35.538034]  [a0232654]
end_workqueue_fn+0xf4/0x130 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.545384]  [a0265e7e]
worker_loop+0x13e/0x540 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.552307]  [a0265d40] ?
btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.560039]  [a0265d40] ?
btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.567768]  [81085836]
kthread+0x96/0xa0
Jul 21 15:25:59 os00 kernel: [   35.573275]  [81562b84]
kernel_thread_helper+0x4/0x10
Jul 21 15:25:59 os00 kernel: [   35.579931]  [810857a0] ?
kthread_worker_fn+0x1a0/0x1a0
Jul 21 15:25:59 os00 kernel: [   35.586816]  [81562b80] ?
gs_change+0x13/0x13
Jul 21 15:25:59 os00 kernel: [   35.592779] ---[ end trace d87e2733f1e978b8 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARNING: at fs/btrfs/inode.c:2204

2011-07-21 Thread Christian Brunner
I'm running a Ceph Object Store with 3.0-rc7 and patches from Josef.
Occasionally I get the attached warning.

Everything seems to be working after this warning, but I am concerned...

Thanks,
Christian

[13319.808020] [ cut here ]
[13319.813284] WARNING: at fs/btrfs/inode.c:2204
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[13319.822563] Hardware name: ProLiant DL180 G6
[13319.827586] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[13319.851192] Pid: 23617, comm: kworker/6:0 Tainted: P
3.0.0-1.fits.2.el6.x86_64 #1
[13319.860661] Call Trace:
[13319.863433]  [8106306f] warn_slowpath_common+0x7f/0xc0
[13319.870172]  [810630ca] warn_slowpath_null+0x1a/0x20
[13319.876724]  [a022d030] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[13319.884633]  [a0227e05] commit_fs_roots+0xc5/0x1b0 [btrfs]
[13319.891762]  [a02288ae]
btrfs_commit_transaction+0x3ce/0x840 [btrfs]
[13319.899917]  [8105cb6f] ? dequeue_task_fair+0x20f/0x220
[13319.906726]  [8100a38b] ? __switch_to+0x12b/0x320
[13319.912943]  [81085eb0] ? wake_up_bit+0x40/0x40
[13319.918971]  [a0228ff0] ? btrfs_end_transaction+0x20/0x20 [btrfs]
[13319.926775]  [a022900f] do_async_commit+0x1f/0x30 [btrfs]
[13319.933825]  [8107e388] process_one_work+0x128/0x450
[13319.940419]  [8108116b] worker_thread+0x17b/0x3c0
[13319.946670]  [81080ff0] ? manage_workers+0x220/0x220
[13319.953210]  [81085836] kthread+0x96/0xa0
[13319.958682]  [81562b44] kernel_thread_helper+0x4/0x10
[13319.965316]  [810857a0] ? kthread_worker_fn+0x1a0/0x1a0
[13319.972183]  [81562b40] ? gs_change+0x13/0x13
[13319.978065] ---[ end trace 942778a443791443 ]---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Christian Brunner
2011/7/13 Josef Bacik jo...@redhat.com:
 On 07/12/2011 11:20 AM, Christian Brunner wrote:
 2011/6/7 Josef Bacik jo...@redhat.com:
 On 06/06/2011 09:39 PM, Miao Xie wrote:
 On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
 I got a lot of these when running stress.sh on my test box



 This is because use_block_rsv() is having to do a
 reserve_metadata_bytes(), which shouldn't happen as we should have
 reserved enough space for those operations to complete.  This is
 happening because use_block_rsv() will call get_block_rsv(), which if
 root-ref_cows is set (which is the case on all fs roots) we will use
 trans-block_rsv, which will only have what the current transaction
 starter had reserved.

 What needs to be done instead is we need to have a block reserve that
 any reservation that is done at create time for these inodes is migrated
 to this special reserve, and then when you run the delayed inode items
 stuff you set trans-block_rsv to the special block reserve so the
 accounting is all done properly.

 This is just off the top of my head, there may be a better way to do it,
 I've not actually looked that the delayed inode code at all.

 I would do this myself but I have a ever increasing list of shit to do
 so will somebody pick this up and fix it please?  Thanks,

 Sorry, it's my miss.
 I forgot to set trans-block_rsv to global_block_rsv, since we have 
 migrated
 the space from trans_block_rsv to global_block_rsv.

 I'll fix it soon.


 There is another problem, we're failing xfstest 204.  I tried making
 reserve_metadata_bytes commit the transaction regardless of whether or
 not there were pinned bytes but the test just hung there.  Usually it
 takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
 204 just creates a crap ton of files, which is what is killing us.
 There needs to be a way to start flushing delayed inode items so we can
 reclaim the space they are holding onto so we don't get enospc, and it
 needs to be better than just committing the transaction because that is
 dog slow.  Thanks,

 Josef

 Is there a solution for this?

 I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
 (except the pluging). When starting a ceph rebuild on the btrfs
 volumes I get a lot of warnings from block_rsv_use_bytes in
 use_block_rsv:


 Ok I think I've got this nailed down.  Will you run with this patch and make 
 sure the warnings go away?  Thanks,

I'm sorry, I'm still getting a lot of warnings like the one below.

I've also noticed, that I'm not getting these messages when the
free_space_cache is disabled.

Christian

[  697.398097] [ cut here ]
[  697.398109] WARNING: at fs/btrfs/extent-tree.c:5693
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[  697.398111] Hardware name: ProLiant DL180 G6
[  697.398112] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[  697.398122] Pid: 6591, comm: btrfs-freespace Tainted: PW
3.0.0-1.fits.1.el6.x86_64 #1
[  697.398124] Call Trace:
[  697.398128]  [810630af] warn_slowpath_common+0x7f/0xc0
[  697.398131]  [8106310a] warn_slowpath_null+0x1a/0x20
[  697.398142]  [a022cb88] btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[  697.398156]  [a025ae08] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[  697.398316]  [a021d112] split_leaf+0x142/0x8c0 [btrfs]
[  697.398325]  [a021629b] ? generic_bin_search+0x19b/0x210 [btrfs]
[  697.398334]  [a0218a1a] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[  697.398344]  [a021df63] btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[  697.398355]  [a0230942] btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[  697.398369]  [a025c03a] ? clear_extent_bit+0x17a/0x440 [btrfs]
[  697.398382]  [a023c009] add_pending_csums+0x49/0x70 [btrfs]
[  697.398395]  [a023ef5d] btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[  697.398408]  [a023f0dc]
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[  697.398422]  [a025c4fb]
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[  697.398425]  [81558b5b] ? schedule_timeout+0x17b/0x2e0
[  697.398436]  [a02336d9] ? end_workqueue_fn+0xe9/0x130 [btrfs]
[  697.398439]  [8118f24d] bio_endio+0x1d/0x40
[  697.398451]  [a02336e4] end_workqueue_fn+0xf4/0x130 [btrfs]
[  697.398464]  [a02671de] worker_loop+0x13e/0x540 [btrfs]
[  697.398477]  [a02670a0] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398490]  [a02670a0] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[  697.398493]  [81085896] kthread+0x96/0xa0
[  697.398496]  [81563844] kernel_thread_helper+0x4/0x10
[  697.398499]  [81085800] ? kthread_worker_fn+0x1a0/0x1a0
[  697.398502]  [81563840] ? gs_change+0x13/0x13
[  697.398503] ---[ end trace 8c77269b0de3f0fb

Re: Delayed inode operations not doing the right thing with enospc

2011-07-12 Thread Christian Brunner
2011/6/7 Josef Bacik jo...@redhat.com:
 On 06/06/2011 09:39 PM, Miao Xie wrote:
 On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
 I got a lot of these when running stress.sh on my test box



 This is because use_block_rsv() is having to do a
 reserve_metadata_bytes(), which shouldn't happen as we should have
 reserved enough space for those operations to complete.  This is
 happening because use_block_rsv() will call get_block_rsv(), which if
 root-ref_cows is set (which is the case on all fs roots) we will use
 trans-block_rsv, which will only have what the current transaction
 starter had reserved.

 What needs to be done instead is we need to have a block reserve that
 any reservation that is done at create time for these inodes is migrated
 to this special reserve, and then when you run the delayed inode items
 stuff you set trans-block_rsv to the special block reserve so the
 accounting is all done properly.

 This is just off the top of my head, there may be a better way to do it,
 I've not actually looked that the delayed inode code at all.

 I would do this myself but I have a ever increasing list of shit to do
 so will somebody pick this up and fix it please?  Thanks,

 Sorry, it's my miss.
 I forgot to set trans-block_rsv to global_block_rsv, since we have migrated
 the space from trans_block_rsv to global_block_rsv.

 I'll fix it soon.


 There is another problem, we're failing xfstest 204.  I tried making
 reserve_metadata_bytes commit the transaction regardless of whether or
 not there were pinned bytes but the test just hung there.  Usually it
 takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
 204 just creates a crap ton of files, which is what is killing us.
 There needs to be a way to start flushing delayed inode items so we can
 reclaim the space they are holding onto so we don't get enospc, and it
 needs to be better than just committing the transaction because that is
 dog slow.  Thanks,

 Josef

Is there a solution for this?

I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
(except the pluging). When starting a ceph rebuild on the btrfs
volumes I get a lot of warnings from block_rsv_use_bytes in
use_block_rsv:

[ 2157.922054] [ cut here ]
[ 2157.927270] WARNING: at fs/btrfs/extent-tree.c:5683
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[ 2157.937123] Hardware name: ProLiant DL180 G6
[ 2157.942132] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 pcspkr serio_raw iTCO_wdt iTCO_vendor_support ghes hed
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[ 2157.967386] Pid: 10280, comm: btrfs-freespace Tainted: PW
2.6.38.8-1.fits.4.el6.x86_64 #1
[ 2157.977554] Call Trace:
[ 2157.980383]  [8106482f] ? warn_slowpath_common+0x7f/0xc0
[ 2157.987382]  [8106488a] ? warn_slowpath_null+0x1a/0x20
[ 2157.994192]  [a0240b88] ?
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[ 2158.002354]  [a026eda8] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[ 2158.010014]  [a0231132] ? split_leaf+0x142/0x8c0 [btrfs]
[ 2158.016990]  [a022a29b] ? generic_bin_search+0x19b/0x210 [btrfs]
[ 2158.024784]  [a022ca1a] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[ 2158.032627]  [a0231f83] ? btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[ 2158.040325]  [a0244942] ?
btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[ 2158.048477]  [a026ffda] ? clear_extent_bit+0x17a/0x440 [btrfs]
[ 2158.056026]  [a024ffc5] ? add_pending_csums+0x45/0x70 [btrfs]
[ 2158.063530]  [a0252dad] ?
btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[ 2158.071755]  [a0252f2c] ?
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[ 2158.080172]  [a027049b] ?
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[ 2158.088505]  [815406fb] ? schedule_timeout+0x17b/0x2e0
[ 2158.095258]  [8118964d] ? bio_endio+0x1d/0x40
[ 2158.101171]  [a0247764] ? end_workqueue_fn+0xf4/0x130 [btrfs]
[ 2158.108621]  [a027b30e] ? worker_loop+0x13e/0x540 [btrfs]
[ 2158.115703]  [a027b1d0] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.122563]  [a027b1d0] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.129413]  [81086356] ? kthread+0x96/0xa0
[ 2158.135093]  [8100ce44] ? kernel_thread_helper+0x4/0x10
[ 2158.141913]  [810862c0] ? kthread+0x0/0xa0
[ 2158.147467]  [8100ce40] ? kernel_thread_helper+0x0/0x10
[ 2158.154287] ---[ end trace 55e53c726a04ecd7 ]---

Thanks,
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel BUG at fs/btrfs/extent-tree.c:5637!

2011-05-19 Thread Christian Brunner
Hi,

we are running a ceph cluster with a btrfs store. Last night we ran
across this btrfs BUG.

Any hints on how to solve this are welcome.

Regards
Christian

May 19 06:10:07 os00 kernel: [247212.342712] [ cut here
]
May 19 06:10:07 os00 kernel: [247212.347953] kernel BUG at
fs/btrfs/extent-tree.c:5637!
May 19 06:10:07 os00 kernel: [247212.353773] invalid opcode:  [#1] SMP
May 19 06:10:07 os00 kernel: [247212.358449] last sysfs file:
/sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
May 19 06:10:07 os00 kernel: [247212.367268] CPU 6
May 19 06:10:07 os00 kernel: [247212.369407] Modules linked in: btrfs
zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe mdio iomemory_vsl(P)
hpsa igb dca squashfs usb_storage [last unloaded: scsi_wait_scan]
May 19 06:10:07 os00 kernel: [247212.393864]
May 19 06:10:07 os00 kernel: [247212.395618] Pid: 3074, comm: cosd
Tainted: P2.6.38.6-1.fits.3.el6.x86_64 #1 HP ProLiant
DL180 G6
May 19 06:10:07 os00 kernel: [247212.406885] RIP:
0010:[a025fdcd]  [a025fdcd]
run_clustered_refs+0x54d/0x800 [btrfs]
May 19 06:10:07 os00 kernel: [247212.417468] RSP:
0018:8805dc6b99b8  EFLAGS: 00010282
May 19 06:10:07 os00 kernel: [247212.423482] RAX: ffef
RBX: 88037570ac00 RCX: 8805dc6b8000
May 19 06:10:07 os00 kernel: [247212.431528] RDX: 0008
RSI: 8800 RDI: 8805c7acabb0
May 19 06:10:07 os00 kernel: [247212.439572] RBP: 8805dc6b9a98
R08: 0001 R09: 0001
May 19 06:10:07 os00 kernel: [247212.447617] R10: 8805e0947000
R11: 8802e7e04480 R12: 8805ac2150c0
May 19 06:10:07 os00 kernel: [247212.455663] R13: 8804e758bc00
R14: 8805e0963000 R15: 8802e7e04480
May 19 06:10:07 os00 kernel: [247212.463709] FS:
7fc1691d2700() GS:8800bf2c()
knlGS:
May 19 06:10:07 os00 kernel: [247212.472820] CS:  0010 DS:  ES:
 CR0: 80050033
May 19 06:10:07 os00 kernel: [247212.479317] CR2: 7f907f6313b0
CR3: 0005dfad1000 CR4: 06e0
May 19 06:10:07 os00 kernel: [247212.487363] DR0: 
DR1:  DR2: 
May 19 06:10:07 os00 kernel: [247212.495408] DR3: 
DR6: 0ff0 DR7: 0400
May 19 06:10:07 os00 kernel: [247212.503453] Process cosd (pid: 3074,
threadinfo 8805dc6b8000, task 8805dc56e4c0)
May 19 06:10:07 os00 kernel: [247212.512563] Stack:
May 19 06:10:07 os00 kernel: [247212.514896]  
 88040001 
May 19 06:10:07 os00 kernel: [247212.523301]  8805dfd1c000
8805e1d71288  8805dc6b9ad8
May 19 06:10:07 os00 kernel: [247212.531673]  
0dd0 8805e1d711d0 0002
May 19 06:10:07 os00 kernel: [247212.540054] Call Trace:
May 19 06:10:07 os00 kernel: [247212.542883]  [a02adf01] ?
btrfs_find_ref_cluster+0x1/0x180 [btrfs]
May 19 06:10:07 os00 kernel: [247212.550840]  [a0260148]
btrfs_run_delayed_refs+0xc8/0x230 [btrfs]
May 19 06:10:07 os00 kernel: [247212.558700]  [a026d6a1]
__btrfs_end_transaction+0x71/0x210 [btrfs]
May 19 06:10:07 os00 kernel: [247212.566685]  [a026d895]
btrfs_end_transaction+0x15/0x20 [btrfs]
May 19 06:10:07 os00 kernel: [247212.574382]  [a0273a2a]
btrfs_dirty_inode+0x8a/0x130 [btrfs]
May 19 06:10:07 os00 kernel: [247212.581752]  [8117fa7f]
__mark_inode_dirty+0x3f/0x1e0
May 19 06:10:07 os00 kernel: [247212.588446]  [811715ac]
file_update_time+0xec/0x170
May 19 06:10:07 os00 kernel: [247212.594952]  [a027cbb0]
btrfs_file_aio_write+0x1d0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.602709]  [8126ff31] ?
ima_counts_get+0x61/0x140
May 19 06:10:07 os00 kernel: [247212.609214]  [a027c9e0] ?
btrfs_file_aio_write+0x0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.616970]  [81158ff3]
do_sync_readv_writev+0xd3/0x110
May 19 06:10:07 os00 kernel: [247212.623855]  [81163d42] ?
path_put+0x22/0x30
May 19 06:10:07 os00 kernel: [247212.629675]  [812584a3] ?
selinux_file_permission+0xf3/0x150
May 19 06:10:07 os00 kernel: [247212.637044]  [81251583] ?
security_file_permission+0x23/0x90
May 19 06:10:07 os00 kernel: [247212.644415]  [81159f14]
do_readv_writev+0xd4/0x1e0
May 19 06:10:07 os00 kernel: [247212.650818]  [81540d91] ?
mutex_lock+0x31/0x60
May 19 06:10:07 os00 kernel: [247212.656832]  [8115a066]
vfs_writev+0x46/0x60
May 19 06:10:07 os00 kernel: [247212.662653]  [8115a1a1]
sys_writev+0x51/0xc0
May 19 06:10:07 os00 kernel: [247212.668477]  [8100c002]
system_call_fastpath+0x16/0x1b
May 19 06:10:07 os00 kernel: [247212.675264] Code: 48 8b 75 a0 48 8b
7d a8 ba b0 00 00 00 e8 7c 6c 02 00 48 8b 95 78 ff ff ff 48 8b 75 a0
48 8b 7d a8 e8 68 6b 02 00 e9 04 ff ff ff 0f 0b eb fe 0f 0b eb 

Re: [PATCH] Prevent oopsing in posix_acl_valid()

2011-05-03 Thread Christian Brunner
2011/5/3 Josef Bacik jo...@redhat.com:
 On 05/03/2011 12:44 PM, Daniel J Blueman wrote:

 If posix_acl_from_xattr() returns an error code, a negative address is
 dereferenced causing an oops; fix by checking for error code first.

 Signed-off-by: Daniel J Bluemandaniel.blue...@gmail.com
 ---
  fs/btrfs/acl.c |    5 +++--
  1 files changed, 3 insertions(+), 2 deletions(-)

 diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
 index 5d505aa..cad6fbb 100644
 --- a/fs/btrfs/acl.c
 +++ b/fs/btrfs/acl.c
 @@ -178,12 +178,13 @@ static int btrfs_xattr_acl_set(struct dentry
 *dentry, const char *name,

        if (value) {
                acl = posix_acl_from_xattr(value, size);
 +               if (IS_ERR(acl)

A small typo: The right parenthesis is missing.

Christian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html