from:"Christian Brunner"

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Christian Brunner

2011/10/27 Josef Bacik :
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik :
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what 
>> > btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>> Regards,
>> Christian
>
> Ok just a shot in the dark, but could you give this a whirl and see if it 
> helps
> you?  Thanks

Thanks for the patch! I'll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-31 Thread Christian Brunner

2011/10/31 Christian Brunner :
> 2011/10/31 Christian Brunner :
>>
>> The patch didn't hurt, but I've to tell you that I'm still seeing the
>> same old problems. Load is going up again:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97 btrfs-endio-wri
>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd
>>
>> And I have hit our warning again:
>>
>> [223560.970713] [ cut here ]
>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>> [223560.985411] Hardware name: ProLiant DL180 G6
>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>> [last unloaded: scsi_wait_scan]
>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>> 3.0.6-1.fits.9.el6.x86_64 #1
>> [223561.023874] Call Trace:
>> [223561.026738]  [] warn_slowpath_common+0x7f/0xc0
>> [223561.033564]  [] warn_slowpath_null+0x1a/0x20
>> [223561.040272]  [] btrfs_orphan_commit_root+0xb0/0xc0 
>> [btrfs]
>> [223561.048278]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
>> [223561.055534]  [] ? mutex_lock+0x31/0x60
>> [223561.061666]  []
>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>> [223561.069876]  [] ? wait_current_trans+0x28/0x110 [btrfs]
>> [223561.077582]  [] ? join_transaction+0x25/0x250 [btrfs]
>> [223561.085065]  [] ? wake_up_bit+0x40/0x40
>> [223561.091251]  [] btrfs_sync_fs+0x59/0xd0 [btrfs]
>> [223561.098187]  [] btrfs_ioctl+0x495/0xd50 [btrfs]
>> [223561.105120]  [] ? inode_has_perm+0x30/0x40
>> [223561.111575]  [] ? file_has_perm+0xdc/0xf0
>> [223561.117924]  [] do_vfs_ioctl+0x9a/0x5a0
>> [223561.124072]  [] sys_ioctl+0xa1/0xb0
>> [223561.129842]  [] system_call_fastpath+0x16/0x1b
>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>
> [ Not sending this to the lists, as the attachment is large ].
>
> I've spent a little time to do some tracing with ftrace. Its output
> seems to be right (at least as far as I can tell). I hope that its
> output can give you an insight on whats going on.
>
> The interesting PIDs in the trace are:
>
>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd
>

[ adding linux-btrfs again ]

I've been digging into this a bit further:

Attached is another ftrace report that I've filtered for "btrfs_*"
calls and limited to CPU0 (this is where PID 5502 was running).

>From what I can see there is a lot of time consumed in
btrfs_reserve_extent(). I this normal?

Thanks,
Christian


ftrace_btrfs_cpu0.bz2
Description: BZip2 compressed data

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-31 Thread Christian Brunner

2011/10/31 Christian Brunner :
> 2011/10/31 Christian Brunner :
>> 2011/10/31 Christian Brunner :
>>>
>>> The patch didn't hurt, but I've to tell you that I'm still seeing the
>>> same old problems. Load is going up again:
>>>
>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97 btrfs-endio-wri
>>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd
>>>
>>> And I have hit our warning again:
>>>
>>> [223560.970713] [ cut here ]
>>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>>> [223560.985411] Hardware name: ProLiant DL180 G6
>>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
>>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
>>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>>> [last unloaded: scsi_wait_scan]
>>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>>> 3.0.6-1.fits.9.el6.x86_64 #1
>>> [223561.023874] Call Trace:
>>> [223561.026738]  [] warn_slowpath_common+0x7f/0xc0
>>> [223561.033564]  [] warn_slowpath_null+0x1a/0x20
>>> [223561.040272]  [] btrfs_orphan_commit_root+0xb0/0xc0 
>>> [btrfs]
>>> [223561.048278]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
>>> [223561.055534]  [] ? mutex_lock+0x31/0x60
>>> [223561.061666]  []
>>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>>> [223561.069876]  [] ? wait_current_trans+0x28/0x110 
>>> [btrfs]
>>> [223561.077582]  [] ? join_transaction+0x25/0x250 [btrfs]
>>> [223561.085065]  [] ? wake_up_bit+0x40/0x40
>>> [223561.091251]  [] btrfs_sync_fs+0x59/0xd0 [btrfs]
>>> [223561.098187]  [] btrfs_ioctl+0x495/0xd50 [btrfs]
>>> [223561.105120]  [] ? inode_has_perm+0x30/0x40
>>> [223561.111575]  [] ? file_has_perm+0xdc/0xf0
>>> [223561.117924]  [] do_vfs_ioctl+0x9a/0x5a0
>>> [223561.124072]  [] sys_ioctl+0xa1/0xb0
>>> [223561.129842]  [] system_call_fastpath+0x16/0x1b
>>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>>
>> [ Not sending this to the lists, as the attachment is large ].
>>
>> I've spent a little time to do some tracing with ftrace. Its output
>> seems to be right (at least as far as I can tell). I hope that its
>> output can give you an insight on whats going on.
>>
>> The interesting PIDs in the trace are:
>>
>>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
>>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
>>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
>>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd
>>
>
> [ adding linux-btrfs again ]
>
> I've been digging into this a bit further:
>
> Attached is another ftrace report that I've filtered for "btrfs_*"
> calls and limited to CPU0 (this is where PID 5502 was running).
>
> From what I can see there is a lot of time consumed in
> btrfs_reserve_extent(). I this normal?

Sorry for spamming, but in the meantime I'm almost certain that the
problem is inside find_free_extent (called from btrfs_reserve_extent).

When I'm running ftrace for a sample period of 10s my system is
wasting a total of 4,2 seconds inside find_free_extent(). Each call to
find_free_extent() is taking an average of 4 milliseconds to complete.
On a recently rebooted system this is only 1-2 us!

I'm not sure if the problem is occurring suddenly or slowly over time.
(At the moment I suspect that its occurring suddenly, but I still have
to investigate this).

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-09 Thread Christian Brunner

2011/11/9 Stefan Kleijkers :
> Hello,
>
> I'm seeing a lot of warnings in dmesg with a BTRFS filesystem. I'm using the
> 3.1 kernel, I found a patch for these warnings (
> http://marc.info/?l=linux-btrfs&m=131547325515336&w=2)
> , but that patch has
> already been included in 3.1. Are there any other patches I can try?
>
> I'm using BTRFS in combination with Ceph and it looks like after a while
> with a high rsync workload that the IO stalls for some time, could the
> warnings result in IO stall?

This seem to be the same issue, I've seen in our ceph cluster. We had
a lengthy discussion on the btrfs list about this:

http://marc.info/?l=linux-btrfs&m=132007001119383&w=2

As far as I know josef is still working on it. Some of the latest
patches he sent seem to related to this, but I don't know if these fix
the problem.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

BUG at fs/btrfs/inode.c:1587

2011-11-15 Thread Christian Brunner

Hi,

this time I've hit a new bug. This happened while ceph was rebuilding
his filestore (heavy io).

The btrfs version is from 3.2-rc1, applied to a 3.0 kernel.

Regards,
Christian

[28981.550478] [ cut here ]
[28981.555625] kernel BUG at fs/btrfs/inode.c:1587!
[28981.560773] invalid opcode:  [#1] SMP
[28981.565361] CPU 2
[28981.567407] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
ixgbe dca mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[28981.591184]
[28981.592842] Pid: 1814, comm: btrfs-fixup-0 Tainted: P
3.0.8-1.fits.4.el6.x86_64 #1 HP ProLiant DL180 G6
[28981.604589] RIP: 0010:[]  []
btrfs_writepage_fixup_worker+0x14c/0x160 [btrfs]
[28981.616049] RSP: 0018:8805ee735dd0  EFLAGS: 00010246
[28981.621967] RAX:  RBX: ea00132c2520 RCX: 8805ef32ec58
[28981.629918] RDX:  RSI: 003b5000 RDI: 8805ef32ea38
[28981.637870] RBP: 8805ee735e20 R08: 88063f25add0 R09: 8805ee735d88
[28981.645822] R10:  R11: 0001 R12: 003b5000
[28981.653774] R13: 8805ef32eb08 R14:  R15: 003b5fff
[28981.661727] FS:  () GS:88063f24()
knlGS:
[28981.670744] CS:  0010 DS:  ES:  CR0: 8005003b
[28981.677146] CR2: 07737000 CR3: 01a03000 CR4: 06e0
[28981.685098] DR0:  DR1:  DR2: 
[28981.693050] DR3:  DR6: 0ff0 DR7: 0400
[28981.701010] Process btrfs-fixup-0 (pid: 1814, threadinfo
8805ee734000, task 8805f3f54bc0)
[28981.710901] Stack:
[28981.713146]  88045dbf4d20 8805ef32e9a8 00012bc0
88027dcdbd20
[28981.721434]   8805ef99ede0 8805ef99ee30
8805ef99edf8
[28981.729723]  88045dbf4d50 8805ee735e80 8805ee735ee0
a02b39ce
[28981.738013] Call Trace:
[28981.740763]  [] worker_loop+0x13e/0x540 [btrfs]
[28981.747577]  [] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[28981.755263]  [] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
[28981.762931]  [] kthread+0x96/0xa0
[28981.768373]  [] kernel_thread_helper+0x4/0x10
[28981.774976]  [] ? kthread_worker_fn+0x1a0/0x1a0
[28981.781772]  [] ? gs_change+0x13/0x13
[28981.787593] Code: e0 48 83 c4 28 5b 41 5c 41 5d 41 5e 41 5f c9 c3
48 8b 7d b8 48 8d 4d c8 41 b8 50 00 00 00 4c 89 fa 4c 89 e6 e8 96 38
01 00 eb bd <0f> 0b eb fe 48 89 df e8 c8 0e e7 e0 eb 9d 66 0f 1f 44 00
00 55
[28981.809294] RIP  []
btrfs_writepage_fixup_worker+0x14c/0x160 [btrfs]
[28981.818150]  RSP 
[28981.822721] ---[ end trace 0236051622523829 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG at fs/btrfs/inode.c:1587

2011-11-16 Thread Christian Brunner

2011/11/16 Chris Mason :
> On Tue, Nov 15, 2011 at 09:19:53AM +0100, Christian Brunner wrote:
>> Hi,
>>
>> this time I've hit a new bug. This happened while ceph was rebuilding
>> his filestore (heavy io).
>>
>> The btrfs version is from 3.2-rc1, applied to a 3.0 kernel.
>
> This one means some part of the kernel has set a btrfs data page dirty
> without going through the proper setup.  A few of us have hit it, but we
> haven't been able to nail down a solid way to reproduce it.
>
> Have you hit it more than once?


I' sorry, I've only hit this once and it's not reproduceable.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2198 btrfs_orphan_commit_root+0xa8/0xc0

2011-11-26 Thread Christian Brunner

2011/11/26 Stefan Kleijkers :
> Hello Josef,
>
> I've new results, is this the trace you are looking for?
>
> Trace of OSD0: http://pastebin.com/gddLBXE4
> Dmesg of OSD0: http://pastebin.com/Uebzgkjv
>
> OSD1 crashed a while later with the same messages.
>
> Stefan

Hi Josef,

I ran your patch on one of our ceph nodes, too. At the first run it
hit the BUG_ON and creashed. Unfortunately I was not able to get the
trace messages from the server (I'm glad that Stefan managed to fetch
it), so I gave it a second spin. This time it did NOT hit the BUG_ON,
but I wrote the trace to a file, so I can send you the trace output at
that time. You can find dmesg-output here:

http://pastebin.com/pWWsZ79e

The trace messages from 154900 till 154999 are here (don't know if
this is interesting):

http://pastebin.com/01EKHqn5

and the tracing output from 206200 till 206399 is here:

http://pastebin.com/50PNtiF7

I hope, that this will give you a better insight into this. I will now
reboot and run it a third time, to see if I can hit the BUG_ON again.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-11-29 Thread Christian Brunner

2011/11/28 Alexandre Oliva :
> We're failing to create clusters with bitmaps because
> setup_cluster_no_bitmap checks that the list is empty before inserting
> the bitmap entry in the list for setup_cluster_bitmap, but the list
> field is only initialized when it is restored from the on-disk free
> space cache, or when it is written out to disk.
>
> Besides a potential race condition due to the multiple use of the list
> field, filesystem performance severely degrades over time: as we use
> up all non-bitmap free extents, the try-to-set-up-cluster dance is
> done at every metadata block allocation.  For every block group, we
> fail to set up a cluster, and after failing on them all up to twice,
> we fall back to the much slower unclustered allocation.

This matches exactly what I've been observing in our ceph cluster.
I've now installed your patches (1-11) on two servers.
The cluster setup problem seems to be gone. - A big thanks for that!

However another thing is causing me some headeache:

When I'm doing havy reading in our ceph cluster. The load and wait-io
on the patched servers is higher than on the unpatched ones.

Dstat from an unpatched server:

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   6  83   8   0   1|  22M  348k| 336k   93M|   0 0 |8445  3715
  1   5  87   7   0   1|  12M 1808k| 214k   65M|   0 0 |5461  1710
  1   3  85  10   0   0|  11M  640k| 313k   49M|   0 0 |5919  2853
  1   6  84   9   0   1|  12M  608k| 358k   69M|   0 0 |7406  3645
  1   7  78  13   0   1|  15M 5344k| 348k  105M|   0 0 |9765  4403
  1   7  80  10   0   1|  22M 1368k| 358k   89M|   0 0 |8036  3202
  1   9  72  16   0   1|  22M 2424k| 646k  137M|   0 0 |  12k 5527

Dstat from a patched server:

---total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  1   2  61  35   0   0|2500k 2736k| 141k   34M|   0 0 |4415  1603
  1   4  48  47   0   1|  10M 3924k| 353k   61M|   0 0 |6871  3771
  1   5  55  38   0   1|  10M 1728k| 385k   92M|   0 0 |8030  2617
  2   8  69  20   0   1|  18M 1384k| 435k  130M|   0 0 |  10k 4493
  1   5  85   8   0   1|7664k   84k| 287k   97M|   0 0 |6231  1357
  1   3  91   5   0   0|  10M  144k| 194k   44M|   0 0 |3807  1081
  1   7  66  25   0   1|  20M 1248k| 404k  101M|   0 0 |8676  3632
  0   3  38  58   0   0|8104k 2660k| 176k   40M|   0 0 |4841  2093


This seems to be coming from "btrfs-endio-1". A kernel thread that has
not caught my attention on unpatched systems, yet.

I did some tracing on that process with ftrace and I can see that the
time is wasted in end_bio_extent_readpage(). In a single call to
end_bio_extent_readpage()the functions unlock_extent_cached(),
unlock_page() and btrfs_readpage_end_io_hook() are invoked 128 times
(each).

Do you have any idea what's going on here?

(Please note that the filesystem is still unmodified - metadata
overhead is large).

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-01 Thread Christian Brunner

2011/12/1 Alexandre Oliva :
> On Nov 29, 2011, Christian Brunner  wrote:
>
>> When I'm doing havy reading in our ceph cluster. The load and wait-io
>> on the patched servers is higher than on the unpatched ones.
>
> That's unexpected.
>
>> This seems to be coming from "btrfs-endio-1". A kernel thread that has
>> not caught my attention on unpatched systems, yet.
>
> I suppose I could wave my hands while explaining that you're getting
> higher data throughput, so it's natural that it would take up more
> resources, but that explanation doesn't satisfy me.  I suppose
> allocation might have got slightly more CPU intensive in some cases, as
> we now use bitmaps where before we'd only use the cheaper-to-allocate
> extents.  But that's unsafisfying as well.

I must admit, that I do not completely understand the difference
between bitmaps and extents.

>From what I see on my servers, I can tell, that the degradation over
time is gone. (Rebooting the servers every day is no longer needed.
This is a real plus.) But the performance compared to a freshly
booted, unpatched server is much slower with my ceph workload.

I wonder if it would make sense to initialize the list field only,
when the cluster setup fails? This would avoid the fallback to the
much unclustered allocation and would give us the cheaper-to-allocate
extents.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: protect orphan block rsv with spin_lock

2011-12-05 Thread Christian Brunner

2011/12/2 Josef Bacik :
> We've been seeing warnings coming out of the orphan commit stuff forever from
> ceph.  Turns out it's because we're racing with checking if the orphan block
> reserve is set, because we clear it outside of the spin_lock.  So leave the
> normal fastpath checks where they are, but take the spin_lock and _recheck_ to
> make sure we haven't had an orphan block rsv added in the meantime.  Then 
> clear
> the root's orphan block rsv and release the lock.  With this patch a user said
> the warnings went away and they usually showed up pretty soon after he started
> ceph.  Thanks,

*sigh* - As soon as I turned my back to the serve console it also
happened again on one of our nodes. That was 25 hours after I started
the system. Usually I see these warnings a few minutes after the
start, but the have been cases in the past where it took longer. So
I'm not sure if the improvement is due to the patch.

Josef: I was still running the patch you sent me, but there was no
message from the printk's you added.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-07 Thread Christian Brunner

2011/12/1 Christian Brunner :
> 2011/12/1 Alexandre Oliva :
>> On Nov 29, 2011, Christian Brunner  wrote:
>>
>>> When I'm doing havy reading in our ceph cluster. The load and wait-io
>>> on the patched servers is higher than on the unpatched ones.
>>
>> That's unexpected.

In the mean time I know, that it's not related to the reads.

>> I suppose I could wave my hands while explaining that you're getting
>> higher data throughput, so it's natural that it would take up more
>> resources, but that explanation doesn't satisfy me.  I suppose
>> allocation might have got slightly more CPU intensive in some cases, as
>> we now use bitmaps where before we'd only use the cheaper-to-allocate
>> extents.  But that's unsafisfying as well.
>
> I must admit, that I do not completely understand the difference
> between bitmaps and extents.
>
> From what I see on my servers, I can tell, that the degradation over
> time is gone. (Rebooting the servers every day is no longer needed.
> This is a real plus.) But the performance compared to a freshly
> booted, unpatched server is much slower with my ceph workload.
>
> I wonder if it would make sense to initialize the list field only,
> when the cluster setup fails? This would avoid the fallback to the
> much unclustered allocation and would give us the cheaper-to-allocate
> extents.

I've now tried various combinations of you patches and I can really
nail it down to this one line.

With this patch applied I get much higher write-io values than without
it. Some of the other patches help to reduce the effect, but it's
still significant.

iostat on an unpatched node is giving me:

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda 105.90 0.37   15.42   14.48  2657.33   560.13
107.61 1.89   62.75   6.26  18.71

while on a node with this patch it's
sda 128.20 0.97   11.10   57.15  3376.80   552.80
57.5820.58  296.33   4.16  28.36


Also interesting, is the fact that the average request size on the
patched node is much smaller.

Josef was telling me, that this could be related to the number of
bitmaps we write out, but I've no idea how to trace this.

I would be very happy if someone could give me a hint on what to do
next, as this is one of the last remaining issues with our ceph
cluster.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-09 Thread Christian Brunner

2011/12/7 Christian Brunner :
> 2011/12/1 Christian Brunner :
>> 2011/12/1 Alexandre Oliva :
>>> On Nov 29, 2011, Christian Brunner  wrote:
>>>
>>>> When I'm doing havy reading in our ceph cluster. The load and wait-io
>>>> on the patched servers is higher than on the unpatched ones.
>>>
>>> That's unexpected.
>
> In the mean time I know, that it's not related to the reads.
>
>>> I suppose I could wave my hands while explaining that you're getting
>>> higher data throughput, so it's natural that it would take up more
>>> resources, but that explanation doesn't satisfy me.  I suppose
>>> allocation might have got slightly more CPU intensive in some cases, as
>>> we now use bitmaps where before we'd only use the cheaper-to-allocate
>>> extents.  But that's unsafisfying as well.
>>
>> I must admit, that I do not completely understand the difference
>> between bitmaps and extents.
>>
>> From what I see on my servers, I can tell, that the degradation over
>> time is gone. (Rebooting the servers every day is no longer needed.
>> This is a real plus.) But the performance compared to a freshly
>> booted, unpatched server is much slower with my ceph workload.
>>
>> I wonder if it would make sense to initialize the list field only,
>> when the cluster setup fails? This would avoid the fallback to the
>> much unclustered allocation and would give us the cheaper-to-allocate
>> extents.
>
> I've now tried various combinations of you patches and I can really
> nail it down to this one line.
>
> With this patch applied I get much higher write-io values than without
> it. Some of the other patches help to reduce the effect, but it's
> still significant.
>
> iostat on an unpatched node is giving me:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda             105.90     0.37   15.42   14.48  2657.33   560.13
> 107.61     1.89   62.75   6.26  18.71
>
> while on a node with this patch it's
> sda             128.20     0.97   11.10   57.15  3376.80   552.80
> 57.58    20.58  296.33   4.16  28.36
>
>
> Also interesting, is the fact that the average request size on the
> patched node is much smaller.
>
> Josef was telling me, that this could be related to the number of
> bitmaps we write out, but I've no idea how to trace this.
>
> I would be very happy if someone could give me a hint on what to do
> next, as this is one of the last remaining issues with our ceph
> cluster.

This is still bugging me and I just remembered something that might be
helpfull. Also I hope that this is not misleading...

Back in 2.6.38 we were running ceph without btrfs performance
degradation. I found a thread on the list where similar problems where
reported:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10346.html

In that thread someone bisected the issue to

>From 4e69b598f6cfb0940b75abf7e179d6020e94ad1e Mon Sep 17 00:00:00 2001
From: Josef Bacik 
Date: Mon, 21 Mar 2011 10:11:24 -0400
Subject: [PATCH] Btrfs: cleanup how we setup free space clusters

In this commit the bitmaps handling was changed. So I just thought
that this may be related.

I'm still hoping, that someone with a deeper understanding of btrfs
could take a look at this.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/20] Btrfs: initialize new bitmaps' list

2011-12-12 Thread Christian Brunner

2011/12/12 Alexandre Oliva :
> On Dec  7, 2011, Christian Brunner  wrote:
>
>> With this patch applied I get much higher write-io values than without
>> it. Some of the other patches help to reduce the effect, but it's
>> still significant.
>
>> iostat on an unpatched node is giving me:
>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda             105.90     0.37   15.42   14.48  2657.33   560.13
>> 107.61     1.89   62.75   6.26  18.71
>
>> while on a node with this patch it's
>> sda             128.20     0.97   11.10   57.15  3376.80   552.80
>> 57.58    20.58  296.33   4.16  28.36
>
>
>> Also interesting, is the fact that the average request size on the
>> patched node is much smaller.
>
> That's probably expected for writes, as bitmaps are expected to be more
> fragmented, even if used only for metadata (or are you on SSD?)
>

It's a traditional hardware RAID5 with spinning disks. - I would
accept this if the writes would start right after the mount, but in
this case it takes a few hours until the writes increase. Thats why
I'm allmost certain that something is still wrong.

> Bitmaps are just a different in-memory (and on-disk-cache, if enabled)
> representation of free space, that can be far more compact: one bit per
> disk block, rather than an extent list entry.  They're interchangeable
> otherwise, it's just that searching bitmaps for a free block (bit) is
> somewhat more expensive than taking the next entry from a list, but you
> don't want to use up too much memory with long lists of
> e.g. single-block free extents.

Thanks for the explanation! I'll try to insert some debuging code,
once my test server is ready.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: avoid redundant block group free-space checks

2011-12-12 Thread Christian Brunner

2011/12/12 Alexandre Oliva :
> It was pointed out to me that the test for enough free space in a block
> group was wrong in that it would skip a block group that had most of its
> free space reserved by a cluster.
>
> I offer two mutually exclusive, (so far) very lightly tested patches to
> address this problem.
>
> One moves the test to the middle of the clustered allocation logic,
> between the release of the cluster and the attempt to create a new
> cluster, with some ugliness due to more indentation, locking operations
> and testing.
>
> The other, that I like better but haven't given any significant amount
> of testing yet, only performs the test when we fall back to unclustered
> allocation, relying on btrfs_find_space_cluster to test for enough free
> space early (it does); it also arranges for the cluster in the current
> block group to be released before we try unclustered allocation.

I've chosen to try the second patch in our ceph environment. It seems
that btrfs_find_space_cluster() isn't called any longer.
find_free_extent() is much faster now.

(I think that the write-io numbers are still to high, though.)

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: at fs/btrfs/extent-tree.c:5980

2011-12-13 Thread Christian Brunner

Hi,

with the latest btrfs for-linus I'm seeing seeing occasional
btrfs_alloc_free_block warnings on several nodes in our ceph cluster.

Before the warning there is an additional block rsv -28 message, but
there is plenty of free space on the disk.


[201653.774412] btrfs: block rsv returned -28
[201653.774415] [ cut here ]
[201653.779846] WARNING: at fs/btrfs/extent-tree.c:5980
btrfs_alloc_free_block+0x347/0x360 [btrfs]()

The complte trace is here:

http://pastebin.com/0SFeZReg

The extent-tree.c:5980 is in use_block_rsv():

5974 if (ret) {
5975 static DEFINE_RATELIMIT_STATE(_rs,
5976 DEFAULT_RATELIMIT_INTERVAL,
5977 /*DEFAULT_RATELIMIT_BURST*/ 2);
5978 if (__ratelimit(&_rs)) {
5979 printk(KERN_DEBUG "btrfs: block rsv
returned %d\n", ret);
5980 WARN_ON(1);
5981 }
5982 ret = reserve_metadata_bytes(root, block_rsv,
blocksize, 0);

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/extent-tree.c:5980

2011-12-13 Thread Christian Brunner

Sorry - I forgot to mention, that I'm still seeing this with:

[PATCH] Btrfs: update global block_rsv when creating a new block group

Christian

2011/12/13 Christian Brunner :
> Hi,
>
> with the latest btrfs for-linus I'm seeing seeing occasional
> btrfs_alloc_free_block warnings on several nodes in our ceph cluster.
>
> Before the warning there is an additional block rsv -28 message, but
> there is plenty of free space on the disk.
>
>
> [201653.774412] btrfs: block rsv returned -28
> [201653.774415] [ cut here ]
> [201653.779846] WARNING: at fs/btrfs/extent-tree.c:5980
> btrfs_alloc_free_block+0x347/0x360 [btrfs]()
>
> The complte trace is here:
>
> http://pastebin.com/0SFeZReg
>
> The extent-tree.c:5980 is in use_block_rsv():
>
> 5974         if (ret) {
> 5975                 static DEFINE_RATELIMIT_STATE(_rs,
> 5976                                 DEFAULT_RATELIMIT_INTERVAL,
> 5977                                 /*DEFAULT_RATELIMIT_BURST*/ 2);
> 5978                 if (__ratelimit(&_rs)) {
> 5979                         printk(KERN_DEBUG "btrfs: block rsv
> returned %d\n", ret);
> 5980                         WARN_ON(1);
> 5981                 }
> 5982                 ret = reserve_metadata_bytes(root, block_rsv,
> blocksize, 0);
>
> Thanks,
> Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-07 Thread Christian Brunner

2012/1/5 Chris Mason :
> On Fri, Jan 06, 2012 at 07:12:16AM +1100, Dave Chinner wrote:
>> On Thu, Jan 05, 2012 at 02:45:00PM -0500, Chris Mason wrote:
>> > On Thu, Jan 05, 2012 at 01:46:57PM -0500, Chris Mason wrote:
>> > > On Thu, Jan 05, 2012 at 10:01:22AM +1100, Dave Chinner wrote:
>> > > > On Thu, Jan 05, 2012 at 09:23:52AM +1100, Chris Samuel wrote:
>> > > > > On 05/01/12 09:11, Dave Chinner wrote:
>> > > > >
>> > > > > > Looks to be reproducable.
>> > > > >
>> > > > > Does this happen with rc6 ?
>> > > >
>> > > > I haven't tried. All I'm doing is running some benchmarks to get
>> > > > numbers for a talk I'm giving about improvements in XFS metadata
>> > > > scalability, so I wanted to update my last set of numbers from
>> > > > 2.6.39.
>> > > >
>> > > > As it was, these benchmarks also failed on btrfs with oopsen and
>> > > > corruptions back in 2.6.39 time frame.  e.g. same VM, same
>> > > > test, different crashes, similar slowdowns as reported here:
>> > > > http://comments.gmane.org/gmane.comp.file-systems.btrfs/11062
>> > > >
>> > > > Given that there is now a history of this simple test uncovering
>> > > > problems, perhaps this is a test that should be run more regularly
>> > > > by btrfs developers?
>> > >
>> > > Unfortunately, this one works for me.  I'll try it again and see if I
>> > > can push harder.  If not, I'll see if I can trade beer for some
>> > > diagnostic runs.
>> >
>> > Aha, if I try it just on the ssd instead of on my full array it triggers
>> > at 88M files.  Great.
>>
>> Good to know.  The error that is generating the BUG on my machine is
>> -28 (ENOSPC).  Given there's 17TB free on my filesystem
>
> Yeah, same thing here.  I'm testing a fix now, it's pretty dumb.  We're
> not allocating more metadata chunks from the drive because of where the
> allocation is happening, so it is just a check for "do we need a new
> chunk" in the right place.
>
> I'll make sure it can fill my ssd and then send to you.

Could you send the patch to the list (or to me), please? Telling from
what you mentioned on IRC this sounds quite interesting and I would
like to see if this solves my performance problems with ceph, too...

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [3.2-rc7] slowdown, warning + oops creating lots of files

2012-01-12 Thread Christian Brunner

2012/1/7 Christian Brunner :
> 2012/1/5 Chris Mason :
>> On Fri, Jan 06, 2012 at 07:12:16AM +1100, Dave Chinner wrote:
>>> On Thu, Jan 05, 2012 at 02:45:00PM -0500, Chris Mason wrote:
>>> > On Thu, Jan 05, 2012 at 01:46:57PM -0500, Chris Mason wrote:
>>> > >
>>> > > Unfortunately, this one works for me.  I'll try it again and see if I
>>> > > can push harder.  If not, I'll see if I can trade beer for some
>>> > > diagnostic runs.
>>> >
>>> > Aha, if I try it just on the ssd instead of on my full array it triggers
>>> > at 88M files.  Great.
>>>
>>> Good to know.  The error that is generating the BUG on my machine is
>>> -28 (ENOSPC).  Given there's 17TB free on my filesystem
>>
>> Yeah, same thing here.  I'm testing a fix now, it's pretty dumb.  We're
>> not allocating more metadata chunks from the drive because of where the
>> allocation is happening, so it is just a check for "do we need a new
>> chunk" in the right place.
>>
>> I'll make sure it can fill my ssd and then send to you.
>
> Could you send the patch to the list (or to me), please? Telling from
> what you mentioned on IRC this sounds quite interesting and I would
> like to see if this solves my performance problems with ceph, too...

I apologize for bothering you again, but I would really like to give it a spin.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs slowdown with ceph (how to reproduce)

2012-01-20 Thread Christian Brunner

As you might know, I have been seeing btrfs slowdowns in our ceph
cluster for quite some time. Even with the latest btrfs code for 3.3
I'm still seeing these problems. To make things reproducible, I've now
written a small test, that imitates ceph's behavior:

On a freshly created btrfs filesystem (2 TB size, mounted with
"noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
100 files. After that I'm doing random writes on these files with a
sync_file_range after each write (each write has a size of 100 bytes)
and ioctl(BTRFS_IOC_SYNC) after every 100 writes.

After approximately 20 minutes, write activity suddenly increases
fourfold and the average request size decreases (see chart in the
attachment).

You can find IOstat output here: http://pastebin.com/Smbfg1aG

I hope that you are able to trace down the problem with the test
program in the attachment.

Thanks,
Christian
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILE_COUNT 100
#define FILE_SIZE 4194304

#define STRING "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"

#define BTRFS_IOCTL_MAGIC 0x94
#define BTRFS_IOC_SYNC _IO(BTRFS_IOCTL_MAGIC, 8)

int main(int argc, char *argv[]) {
char *imgname = argv[1]; 
char *tempname;
int fd[FILE_COUNT]; 
int ilen, i;

ilen = strlen(imgname);
tempname = malloc(ilen + 8);

for(i=0; i < FILE_COUNT; i++) {
	snprintf(tempname, ilen + 8, "%s.%i", imgname, i);
	fd[i] = open(tempname, O_CREAT|O_RDWR);
}
	
i=0;
while(1) {
int start = rand() % FILE_SIZE;
int file = rand() % FILE_COUNT;

putc('.', stderr);

lseek(fd[file], start, SEEK_SET);
write(fd[file], STRING, 100);
sync_file_range(fd[file], start, 100, 0x2);

usleep(25000);

i++;
if (i == 100) {
i=0;
ioctl(fd[file], BTRFS_IOC_SYNC);
}
}
}
<>

Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Christian Brunner

2012/1/23 Chris Mason :
> On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
>> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
>> > As you might know, I have been seeing btrfs slowdowns in our ceph
>> > cluster for quite some time. Even with the latest btrfs code for 3.3
>> > I'm still seeing these problems. To make things reproducible, I've now
>> > written a small test, that imitates ceph's behavior:
>> >
>> > On a freshly created btrfs filesystem (2 TB size, mounted with
>> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
>> > 100 files. After that I'm doing random writes on these files with a
>> > sync_file_range after each write (each write has a size of 100 bytes)
>> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
>> >
>> > After approximately 20 minutes, write activity suddenly increases
>> > fourfold and the average request size decreases (see chart in the
>> > attachment).
>> >
>> > You can find IOstat output here: http://pastebin.com/Smbfg1aG
>> >
>> > I hope that you are able to trace down the problem with the test
>> > program in the attachment.
>>
>> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree 
>> and
>> formatted the fs with 64k node and leaf sizes and the problem appeared to go
>> away.  So surprise surprise fragmentation is biting us in the ass.  If you 
>> can
>> try running that branch with 64k node and leaf sizes with your ceph cluster 
>> and
>> see how that works out.  Course you should only do that if you dont mind if 
>> you
>> lose everything :).  Thanks,
>>
>
> Please keep in mind this branch is only out there for development, and
> it really might have huge flaws.  scrub doesn't work with it correctly
> right now, and the IO error recovery code is probably broken too.
>
> Long term though, I think the bigger block sizes are going to make a
> huge difference in these workloads.
>
> If you use the very dangerous code:
>
> mkfs.btrfs -l 64k -n 64k /dev/xxx
>
> (-l is leaf size, -n is node size).
>
> 64K is the max right now, 32K may help just as much at a lower CPU cost.

Thanks for taking a look. - I'm glad to hear that there is a solution
on the horizon, but I'm not brave enough to try this on our ceph
cluster. I'll try it when the code has stabilized a bit.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Strange prformance degradation when COW writes happen at fixed offsets

2012-02-27 Thread Christian Brunner

2012/2/24 Nik Markovic :
> To add... I also tried nodatasum (only) and nodatacow otions. I found
> somewhere that nodatacow doesn't really mean tthat COW is disabled.
> Test data is still the same - CPU spikes and times are the same.
>
> On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic  
> wrote:
>> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.dun...@cox.net> wrote:
>>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>>
 I noticed a few errors in the script that I used. I corrected it and it
 seems that degradation is occurring even at fully random writes:
>>>
>>> I don't have an ssd, but is it possible that you're simply seeing erase-
>>> block related degradation due to multi-write-block sized erase-blocks?
>>>
>>> It seems to me that when originally written to the btrfs-on-ssd, the file
>>> will likely be written block-sequentially enough that the file as a whole
>>> takes up relatively few erase-blocks.  As you COW-write individual
>>> blocks, they'll be written elsewhere, perhaps all the changed blocks to a
>>> new erase-block, perhaps each to a different erase block.
>>
>> This is a very interesting insight. I wasn't even aware of the
>> erase-block issue, so I did some reading up on it...
>>
>>>
>>> As you increase the successive COW generation count, the file's file-
>>> system/write blocks will be spread thru more and more erase-blocks,
>>> basically fragmentation but of the SSD-critical type, into more and more
>>> erase blocks, thus affecting modification and removal time but not read
>>> time.
>>
>> OK, so time to write would increase due to fragmentation and writing,
>> it now makes sense (though I don't see why small writes would affect
>> this, but my concerns are not writes anyway), but why would cp
>> --reflink time increase so much. Yes, new extents would be created,
>> but btrfs doesn't write into data blocks, does it? I figured its
>> metadata would be kept in one place. I figure the only thing BTRFS
>> would do on cp --reflink=always:
>> 1. Take a collection of extents owned by source.
>> 2. Make the new copy use the same collection of extents.
>> 3. Write the collection of extents to the "directory".
>>
>> Now this process seems to be CPU intensive. When I remove or make a
>> reflink copy, one core pikes up to 100%, which tells me that there's a
>> performance issue there, not an ssd issue. Also, only one CPU thread
>> is being used for this. I figured that I can improve this by some
>> setting. Maybe thread_pool mount option? Are there any updates in
>> later kernels that I should possibly pick up?
>>
>> [...]
>>
>> Unless I am wrong, this would disable COW completely and reflink copy.
>> Reflinks are a crucial component and the sole
>> reason I picked BTRFS for the system that I am writing for my company.
>> The autodefrag option addresses multiple writes. Writing is not the
>> problem, but cp --reflink should be near-instant. That was the reason
>> we chose BTRFS over ZFS, which seemed to be the only feasible
>> alternative. ZFS snapshot complicate the design and deduplication copy
>> time is the same as (or not much better than) raw copy.
>>
>> [...]
>>
>> As I mentioned above, the COW is the crucial component of our system,
>> XFS won't do. Our system does not do random writes. In fact it is
>> mainly heavy on read operation. The system does occasional "rotation
>> of rust" on large files in a way that version control system would
>> (large files are modified and then used as a new baseline)

The symptoms you are reporting are quite similar to what I'm seeing in
our Ceph cluster:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413

AFAIK, Chris and Josef are working on it, but you'll have to wait for
kernel 3.4, until this will be available in mainline. If you are
feeling adventurous, you could try the patches in Josef's git tree,
but I think it's still experimental.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ceph on btrfs 3.4rc

2012-04-20 Thread Christian Brunner

After running ceph on XFS for some time, I decided to try btrfs again.
Performance with the current "for-linux-min" branch and big metadata
is much better. The only problem (?) I'm still seeing is a warning
that seems to occur from time to time:

[87703.784552] [ cut here ]
[87703.789759] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[87703.799070] Hardware name: ProLiant DL180 G6
[87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P   O
3.3.2-1.fits.1.el6.x86_64 #1
[87703.837513] Call Trace:
[87703.840280]  [] warn_slowpath_common+0x7f/0xc0
[87703.847016]  [] warn_slowpath_null+0x1a/0x20
[87703.853533]  [] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[87703.861541]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[87703.868674]  []
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[87703.876745]  [] ? __switch_to+0x153/0x440
[87703.882966]  [] ? wake_up_bit+0x40/0x40
[87703.888997]  [] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[87703.897271]  [] do_async_commit+0x1f/0x30 [btrfs]
[87703.904262]  [] process_one_work+0x129/0x450
[87703.910777]  [] worker_thread+0x17b/0x3c0
[87703.916991]  [] ? manage_workers+0x220/0x220
[87703.923504]  [] kthread+0x9e/0xb0
[87703.928952]  [] kernel_thread_helper+0x4/0x10
[87703.93]  [] ? kthread_freezable_should_stop+0x70/0x70
[87703.943323]  [] ? gs_change+0x13/0x13
[87703.949149] ---[ end trace b8c31966cca731fa ]---
[91128.812399] [ cut here ]
[91128.817576] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
[91128.826930] Hardware name: ProLiant DL180 G6
[91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: PW  O
3.3.2-1.fits.1.el6.x86_64 #1
[91128.865912] Call Trace:
[91128.868670]  [] warn_slowpath_common+0x7f/0xc0
[91128.875379]  [] warn_slowpath_null+0x1a/0x20
[91128.881900]  [] btrfs_orphan_commit_root+0xf6/0x100 [btrfs]
[91128.889894]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[91128.897019]  [] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[91128.905075]  []
btrfs_commit_transaction+0x5db/0xa50 [btrfs]
[91128.913156]  [] ? start_transaction+0x92/0x310 [btrfs]
[91128.920643]  [] ? wake_up_bit+0x40/0x40
[91128.926667]  [] transaction_kthread+0x26b/0x2e0 [btrfs]
[91128.934254]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.943671]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[91128.953079]  [] kthread+0x9e/0xb0
[91128.958532]  [] kernel_thread_helper+0x4/0x10
[91128.965133]  [] ? kthread_freezable_should_stop+0x70/0x70
[91128.972913]  [] ? gs_change+0x13/0x13
[91128.978826] ---[ end trace b8c31966cca731fb ]---

I'm able to reproduce this with ceph on a single server with 4 disks
(4 filesystems/osds) and a small test program based on librbd. It is
simply writing random bytes on a rbd volume (see attachment).

Is this something I should care about? Any hint's on solving this
would be appreciated.

Thanks,
Christian
#include 
#include 
#include 
#include 

int nr_writes=0;

void
alarm_handler(int sig) {
fprintf(stderr, "Writes/sec: %i\n", nr_writes/10);
	nr_writes = 0;
	alarm(10);
}


int main(int argc, char *argv[]) {
char *clientname;
rados_t cluster;
rados_ioctx_t io_ctx;
rbd_image_t image;
char *pool = "rbd";
char *imgname = argv[1];
	
if (rados_create(&cluster, NULL) < 0) {
fprintf(stderr, "error initializing");
return 1;
}

rados_conf_read_file(cluster, NULL);
	
if (rados_connect(cluster) < 0) {
fprintf(stderr, "error connecting");
rados_shutdown(cluster);
return 1;
}

if (rados_ioctx_create(cluster, pool, &io_ctx) < 0) {
fprintf(stderr, "error opening pool %s", pool);
rados_shutdown(cluster);
return 1;
}

int r = rbd_open(io_ctx, imgname, &image, NULL);
if (r < 0) {
fprintf(stderr, "error reading header from %s", imgname);
rados_ioctx_destroy(io_ctx);
rados_shutdown(cluster);
return 1;
}

alarm(10);
(void) signal(SIGALRM, alarm_handler);

while(1) {
#define RAND_MAX 10485760
   int start = rand();
   rbd_write(image, start, 1, "a");
   nr_writes++;
}

rados_ioctx_destroy(io_ctx);
rados_shutdown(cluster);
}

Re: Ceph on btrfs 3.4rc

2012-04-23 Thread Christian Brunner

I decided to run the test over the weekend. The good news is, that the
system is still running without performance degradation. But in the
meantime I've got over 5000 WARNINGs of this kind:

[330700.043557] btrfs: block rsv returned -28
[330700.043559] [ cut here ]
[330700.048898] WARNING: at fs/btrfs/extent-tree.c:6220
btrfs_alloc_free_block+0x357/0x370 [btrfs]()
[330700.058880] Hardware name: ProLiant DL180 G6
[330700.064044] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
[330700.090361] Pid: 7954, comm: btrfs-endio-wri Tainted: PW
O 3.3.2-1.fits.1.el6.x86_64 #1
[330700.100393] Call Trace:
[330700.103263]  [] warn_slowpath_common+0x7f/0xc0
[330700.110201]  [] warn_slowpath_null+0x1a/0x20
[330700.116905]  [] btrfs_alloc_free_block+0x357/0x370 [btrfs]
[330700.124988]  [] ? __btrfs_cow_block+0x330/0x530 [btrfs]
[330700.132787]  [] ?
btrfs_add_delayed_data_ref+0x64/0x1c0 [btrfs]
[330700.141369]  [] ? read_extent_buffer+0xbb/0x120 [btrfs]
[330700.149194]  [] ?
btrfs_token_item_offset+0x5d/0xe0 [btrfs]
[330700.157373]  [] __btrfs_cow_block+0x133/0x530 [btrfs]
[330700.165023]  [] ?
read_block_for_search+0x14d/0x3d0 [btrfs]
[330700.173183]  [] btrfs_cow_block+0xf4/0x1f0 [btrfs]
[330700.180552]  [] btrfs_search_slot+0x3e8/0x8e0 [btrfs]
[330700.188128]  [] btrfs_lookup_csum+0x74/0x170 [btrfs]
[330700.195634]  [] ? kmem_cache_alloc+0x105/0x130
[330700.202551]  [] btrfs_csum_file_blocks+0xd0/0x6d0 [btrfs]
[330700.210542]  [] ? clear_extent_bit+0x161/0x420 [btrfs]
[330700.218237]  [] add_pending_csums+0x49/0x70 [btrfs]
[330700.225706]  []
btrfs_finish_ordered_io+0x276/0x3d0 [btrfs]
[330700.233940]  []
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[330700.242345]  [] end_extent_writepage+0x69/0x100 [btrfs]
[330700.250192]  [] end_bio_extent_writepage+0x66/0xa0 [btrfs]
[330700.258327]  [] bio_endio+0x1d/0x40
[330700.264214]  [] end_workqueue_fn+0x45/0x50 [btrfs]
[330700.271612]  [] worker_loop+0x14f/0x5a0 [btrfs]
[330700.278672]  [] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.286582]  [] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[330700.294535]  [] kthread+0x9e/0xb0
[330700.300244]  [] kernel_thread_helper+0x4/0x10
[330700.307031]  [] ? kthread_freezable_should_stop+0x70/0x70
[330700.315061]  [] ? gs_change+0x13/0x13
[330700.321167] ---[ end trace b8c31966cca74ca0 ]---

The filesystems have plenty of free space:

/dev/sda  1.9T   16G  1.8T   1% /ceph/osd.000
/dev/sdb  1.9T   15G  1.8T   1% /ceph/osd.001
/dev/sdc  1.9T   13G  1.8T   1% /ceph/osd.002
/dev/sdd  1.9T   14G  1.8T   1% /ceph/osd.003

# btrfs fi df /ceph/osd.000
Data: total=38.01GB, used=15.53GB
System, DUP: total=8.00MB, used=64.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=37.50GB, used=82.19MB
Metadata: total=8.00MB, used=0.00

A few more btrfs_orphan_commit_root WARNINGS are present too. If
needed I could upload the messages file.

Regards,
Christian

Am 20. April 2012 17:09 schrieb Christian Brunner :
> After running ceph on XFS for some time, I decided to try btrfs again.
> Performance with the current "for-linux-min" branch and big metadata
> is much better. The only problem (?) I'm still seeing is a warning
> that seems to occur from time to time:
>
> [87703.784552] [ cut here ]
> [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [87703.799070] Hardware name: ProLiant DL180 G6
> [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P           O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [87703.837513] Call Trace:
> [87703.840280]  [] warn_slowpath_common+0x7f/0xc0
> [87703.847016]  [] warn_slowpath_null+0x1a/0x20
> [87703.853533]  [] btrfs_orphan_commit_root+0xf6/0x100 
> [btrfs]
> [87703.861541]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [87703.868674]  []
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [87703.876745]  [] ? __switch_to+0x153/0x440
> [87703.882966]  [] ? wake_up_bit+0x40/0x40
> [87703.888997]  [] ?
> btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> [87703.897271]  [] do_async_commit+0x1f/0x30 [btrfs]
> [87703.904262]  [] process_one_work+0x129/0x450
> [87703.910777]  [] worker_thread+0x17b/0x3c0
> [87703.916991]  [] ? manage_workers+0x220/0x220
> [87703.923504]  [] kthread+0x9e/0xb0
> [87703.928952]  [] kernel_thread_helper+0x4/0x10
> [87703.93]  [] ? kthread_freezable_should_stop+0x70/0x70
> [87703.94

Re: Ceph on btrfs 3.4rc

2012-04-27 Thread Christian Brunner

Am 24. April 2012 18:26 schrieb Sage Weil :
> On Tue, 24 Apr 2012, Josef Bacik wrote:
>> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> > After running ceph on XFS for some time, I decided to try btrfs again.
>> > Performance with the current "for-linux-min" branch and big metadata
>> > is much better. The only problem (?) I'm still seeing is a warning
>> > that seems to occur from time to time:
>
> Actually, before you do that... we have a new tool,
> test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> local file system.  It's a subset of what a full OSD might do, but if
> we're lucky it will be sufficient to reproduce this issue.  Something like
>
>  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>
> will hopefully do the trick.
>
> Christian, maybe you can see if that is able to trigger this warning?
> You'll need to pull it from the current master branch; it wasn't in the
> last release.

Trying to reproduce with test_filestore_workloadgen didn't work for
me. So here are some instructions on how to reproduce with a minimal
ceph setup.

You will need a single system with two disks and a bit of memory.

- Compile and install ceph (detailed instructions:
http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)

- For the test setup I've used two tmpfs files as journal devices. To
create these, do the following:

# mkdir -p /ceph/temp
# mount -t tmpfs tmpfs /ceph/temp
# dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
# dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k

- Now you should create and mount btrfs. Here is what I did:

# mkfs.btrfs -l 64k -n 64k /dev/sda
# mkfs.btrfs -l 64k -n 64k /dev/sdb
# mkdir /ceph/osd.000
# mkdir /ceph/osd.001
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
# mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001

- Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
will probably have to change the btrfs devices and the hostname
(os39).

- Create the ceph filesystems:

# mkdir /ceph/mon
# mkcephfs -a -c /etc/ceph/ceph.conf

- Start ceph (e.g. "service ceph start")

- Now you should be able to use ceph - "ceph -s" will tell you about
the state of the ceph cluster.

- "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.

- Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
with "./rbdtest testimg".

I can see the first btrfs_orphan_commit_root warning after an hour or
so... I hope that I've described all necessary steps. If there is a
problem just send me a note.

Thanks,
Christian


ceph.conf
Description: Binary data

Re: Ceph on btrfs 3.4rc

2012-04-30 Thread Christian Brunner

2012/4/29 tsuna :
> On Fri, Apr 20, 2012 at 8:09 AM, Christian Brunner
>  wrote:
>> After running ceph on XFS for some time, I decided to try btrfs again.
>> Performance with the current "for-linux-min" branch and big metadata
>> is much better.
>
> I've heard that although performance from btrfs is better at first, it
> degrades over time due to metadata fragmentation, whereas XFS'
> performance starts off a little worse, but remains stable even after
> weeks of heavy utilization.  Would be curious to hear your (or
> others') feedback on that topic.

Metadata fragmentation was a big problem (for us) in the past. With
the "big metatdata feature" (mkfs.btrfs -l 64k -n 64k) these problems
seem to be solved. We do not use it in production yet, but my stress
test didn't show any degradation. The only remaining issues I've seen
are these warnings.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-04 Thread Christian Brunner

2012/5/3 Josef Bacik :
> On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
>> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik 
>> wrote:
>> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
>> >
>> > Yeah all that was in the right place, I rebooted and I magically
>> > stopped getting
>> > that error, but now I'm getting this
>> >
>> > http://fpaste.org/OE92/
>> >
>> > with that ping thing repeating over and over.  Thanks,
>>
>> That just looks like the osd isn't running. If you restart the
>> osd with 'debug osd = 20' the osd log should tell us what's going on.
>
> Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff 
> after
> reboot.  But now I'm back to my original problem
>
> http://fpaste.org/PfwO/
>
> I have the osd class dir = /usr/lib64/rados-classes thing set and libcls_rbd 
> is
> in there, so I'm not sure what is wrong.  Thanks,

Thats really strange. Do you have the osd logs in /var/log/ceph? If
so, can you look if you find anything about "rbd" or "class" loading
in there?

Another thing you should try is, whether you can access ceph with rados:

# rados -p rbd ls
# rados -p rbd -i /proc/cpuinfo put testobj
# rados -p rbd -o - get testobj

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-11 Thread Christian Brunner

2012/5/10 Josef Bacik :
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
>> Am 24. April 2012 18:26 schrieb Sage Weil :
>> > On Tue, 24 Apr 2012, Josef Bacik wrote:
>> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
>> >> > After running ceph on XFS for some time, I decided to try btrfs again.
>> >> > Performance with the current "for-linux-min" branch and big metadata
>> >> > is much better. The only problem (?) I'm still seeing is a warning
>> >> > that seems to occur from time to time:
>> >
>> > Actually, before you do that... we have a new tool,
>> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
>> > local file system.  It's a subset of what a full OSD might do, but if
>> > we're lucky it will be sufficient to reproduce this issue.  Something like
>> >
>> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
>> >
>> > will hopefully do the trick.
>> >
>> > Christian, maybe you can see if that is able to trigger this warning?
>> > You'll need to pull it from the current master branch; it wasn't in the
>> > last release.
>>
>> Trying to reproduce with test_filestore_workloadgen didn't work for
>> me. So here are some instructions on how to reproduce with a minimal
>> ceph setup.
>> [...]
>
> Well I feel like an idiot, I finally get it to reproduce, go look at where I
> want to put my printks and theres the problem staring me right in the face.
> I've looked seriously at this problem 2 or 3 times and have missed this every
> single freaking time.  Here is the patch I'm trying, please try it on yours to
> make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
> me so I won't be able to fully test it until tomorrow, but so far it hasn't
> broken anything so it should be good.  Thanks,

Great! I've put your patch on my testbox and will run a test over the
weekend. I'll report back on monday.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Christian Brunner

2012/5/17 Josef Bacik :
> On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
>> Hi Josef,
>> no there was nothing above. Here the is another dmesg output.
>>
>
> Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
> Thanks,
>
> Josef

Well, I hate to say it, but the new patch doesn't seem to change much...

Regards,
Christian

[  123.507444] Btrfs loaded
[  202.683630] device fsid 2aa7531c-0e3c-4955-8542-6aed7ab8c1a2 devid
1 transid 4 /dev/sda
[  202.693704] btrfs: use lzo compression
[  202.697999] btrfs: enabling inode map caching
[  202.702989] btrfs: enabling auto defrag
[  202.707190] btrfs: disk space caching is enabled
[  202.712721] btrfs flagging fs with big metadata feature
[  207.839761] device fsid f81ff6a1-c333-4daf-989f-a28139f15f08 devid
1 transid 4 /dev/sdb
[  207.849681] btrfs: use lzo compression
[  207.853987] btrfs: enabling inode map caching
[  207.858970] btrfs: enabling auto defrag
[  207.863173] btrfs: disk space caching is enabled
[  207.868635] btrfs flagging fs with big metadata feature
[  210.857328] device fsid 9b905faa-f4fa-4626-9cae-2cd0287b30f7 devid
1 transid 4 /dev/sdc
[  210.867265] btrfs: use lzo compression
[  210.871560] btrfs: enabling inode map caching
[  210.876550] btrfs: enabling auto defrag
[  210.880757] btrfs: disk space caching is enabled
[  210.886228] btrfs flagging fs with big metadata feature
[  214.296287] device fsid f7990e4c-90b0-4691-9502-92b60538574a devid
1 transid 4 /dev/sdd
[  214.306510] btrfs: use lzo compression
[  214.310855] btrfs: enabling inode map caching
[  214.315905] btrfs: enabling auto defrag
[  214.320174] btrfs: disk space caching is enabled
[  214.325706] btrfs flagging fs with big metadata feature
[ 1337.937379] [ cut here ]
[ 1337.942526] kernel BUG at fs/btrfs/inode.c:2224!
[ 1337.947671] invalid opcode:  [#1] SMP
[ 1337.952255] CPU 5
[ 1337.954300] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg pcspkr serio_raw iTCO_wdt
iTCO_vendor_support iomemory_vsl(PO) ixgbe dca mdio i7core_edac
edac_core hpsa squashfs [last unloaded: scsi_wait_scan]
[ 1337.978570]
[ 1337.980230] Pid: 6812, comm: ceph-osd Tainted: P   O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[ 1337.991592] RIP: 0010:[]  []
btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.001897] RSP: 0018:8805e1171d38  EFLAGS: 00010282
[ 1338.007815] RAX: fffe RBX: 88061c3c8400 RCX: 00b37f48
[ 1338.015768] RDX: 00b37f47 RSI: 8805ec2a1cf0 RDI: ea0017b0a840
[ 1338.023724] RBP: 8805e1171d68 R08: 60f9d88028a0 R09: a033016a
[ 1338.031675] R10:  R11: 0004 R12: 8805de7f57a0
[ 1338.039629] R13: 0001 R14: 0001 R15: 8805ec2a5280
[ 1338.047584] FS:  7f4bffc6e700() GS:8806272a()
knlGS:
[ 1338.056600] CS:  0010 DS:  ES:  CR0: 80050033
[ 1338.063003] CR2: ff600400 CR3: 0005e34c3000 CR4: 06e0
[ 1338.070954] DR0:  DR1:  DR2: 
[ 1338.078909] DR3:  DR6: 0ff0 DR7: 0400
[ 1338.086865] Process ceph-osd (pid: 6812, threadinfo
8805e117, task 88060fa81940)
[ 1338.096268] Stack:
[ 1338.098509]  8805e1171d68 8805ec2a5280 88051235b920

[ 1338.106795]  88051235b920 0008 8805e1171e08
a036043c
[ 1338.115082]    
00011000
[ 1338.123367] Call Trace:
[ 1338.126111]  [] btrfs_truncate+0x5bc/0x640 [btrfs]
[ 1338.133213]  [] btrfs_setattr+0xf6/0x1a0 [btrfs]
[ 1338.140105]  [] notify_change+0x18b/0x2b0
[ 1338.146320]  [] ? selinux_inode_permission+0xd1/0x130
[ 1338.153699]  [] do_truncate+0x64/0xa0
[ 1338.159527]  [] ? inode_permission+0x49/0x100
[ 1338.166128]  [] sys_truncate+0x137/0x150
[ 1338.172244]  [] system_call_fastpath+0x16/0x1b
[ 1338.178936] Code: 89 e7 e8 88 7d fe ff eb 89 66 0f 1f 44 00 00 be
a4 08 00 00 48 c7 c7 59 49 3b a0 45 31 ed e8 5c 78 cf e0 45 31 f6 e9
30 ff ff ff <0f> 0b eb fe 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c 89 65
e0 4c
[ 1338.200623] RIP  [] btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.208317]  RSP 
[ 1338.212681] ---[ end trace 86be14f0f863ea79 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Christian Brunner

2012/5/21 Miao Xie :
> Hi Josef,
>
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index 9b9b15f..492c74f 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -57,9 +57,6 @@ struct btrfs_inode {
>>       /* used to order data wrt metadata */
>>       struct btrfs_ordered_inode_tree ordered_tree;
>>
>> -     /* for keeping track of orphaned inodes */
>> -     struct list_head i_orphan;
>> -
>>       /* list of all the delalloc inodes in the FS.  There are times we need
>>        * to write all the delalloc pages to disk, and this list is used
>>        * to walk them all.
>> @@ -156,6 +153,8 @@ struct btrfs_inode {
>>       unsigned dummy_inode:1;
>>       unsigned in_defrag:1;
>>       unsigned delalloc_meta_reserved:1;
>> +     unsigned has_orphan_item:1;
>> +     unsigned doing_truncate:1;
>
> I think the problem is we should not use the different lock to protect the 
> bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the 
> others when
> someone change those fields. Could you try to declare 
> ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?

I have tried changing it to:

struct btrfs_inode {
unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
unsigned in_defrag:1;
-   unsigned delalloc_meta_reserved:1;
+   int delalloc_meta_reserved;
+   int has_orphan_item;
+   int doing_truncate;

The strange thing is, that I'm no longer hitting the BUG_ON, but the
old WARNING (no additional messages):

[351021.157124] [ cut here ]
[351021.162400] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351021.171812] Hardware name: ProLiant DL180 G6
[351021.176867] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351021.200236] Pid: 9837, comm: btrfs-transacti Tainted: PW
O 3.3.5-1.fits.1.el6.x86_64 #1
[351021.210126] Call Trace:
[351021.212957]  [] warn_slowpath_common+0x7f/0xc0
[351021.219758]  [] warn_slowpath_null+0x1a/0x20
[351021.226385]  []
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351021.234461]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351021.241669]  [] ?
btrfs_run_delayed_items+0xf1/0x160 [btrfs]
[351021.249841]  []
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351021.258006]  [] ? start_transaction+0x92/0x310 [btrfs]
[351021.265580]  [] ? wake_up_bit+0x40/0x40
[351021.271719]  [] transaction_kthread+0x26b/0x2e0 [btrfs]
[351021.279405]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.288934]  [] ?
btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
[351021.298449]  [] kthread+0x9e/0xb0
[351021.303989]  [] kernel_thread_helper+0x4/0x10
[351021.310691]  [] ? kthread_freezable_should_stop+0x70/0x70
[351021.318555]  [] ? gs_change+0x13/0x13
[351021.324479] ---[ end trace 9adc7b36a3e66833 ]---
[351710.339482] [ cut here ]
[351710.344754] WARNING: at fs/btrfs/inode.c:2103
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]()
[351710.354165] Hardware name: ProLiant DL180 G6
[351710.359222] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[351710.382569] Pid: 9797, comm: kworker/5:0 Tainted: PW  O
3.3.5-1.fits.1.el6.x86_64 #1
[351710.392075] Call Trace:
[351710.394901]  [] warn_slowpath_common+0x7f/0xc0
[351710.401750]  [] warn_slowpath_null+0x1a/0x20
[351710.408414]  []
btrfs_orphan_commit_root+0xf7/0x100 [btrfs]
[351710.416528]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
[351710.423775]  []
btrfs_commit_transaction+0x584/0xa50 [btrfs]
[351710.431983]  [] ? __switch_to+0x153/0x440
[351710.438352]  [] ? wake_up_bit+0x40/0x40
[351710.444529]  [] ?
btrfs_commit_transaction+0xa50/0xa50 [btrfs]
[351710.452894]  [] do_async_commit+0x1f/0x30 [btrfs]
[351710.459979]  [] process_one_work+0x129/0x450
[351710.466576]  [] worker_thread+0x17b/0x3c0
[351710.472884]  [] ? manage_workers+0x220/0x220
[351710.479472]  [] kthread+0x9e/0xb0
[351710.485029]  [] kernel_thread_helper+0x4/0x10
[351710.491731]  [] ? kthread_freezable_should_stop+0x70/0x70
[351710.499640]  [] ? gs_change+0x13/0x13
[351710.505590] ---[ end trace 9adc7b36a3e66834 ]---


Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-23 Thread Christian Brunner

2012/5/22 Josef Bacik :
>>
>
> Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
> taking the BTRFS_I(inode)->lock when messing with these since we don't want to
> take up all that space in the inode just for a marker.  I ran this patch for 3
> hours with no issues, let me know if it works for you.  Thanks,

Compared to the last runs, I had to run it much longer, but somehow I
managed to hit a BUG_ON again:

[448281.002087] couldn't find orphan item for 2027, nlink 1, root 308,
root being deleted no
[448281.011339] [ cut here ]
[448281.016590] kernel BUG at fs/btrfs/inode.c:2230!
[448281.021837] invalid opcode:  [#1] SMP
[448281.026525] CPU 4
[448281.028670] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(PO) hpsa squashfs [last unloaded: btrfs]
[448281.052215]
[448281.053977] Pid: 16018, comm: ceph-osd Tainted: PW  O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[448281.06] RIP: 0010:[]  []
btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.075965] RSP: 0018:880458257d18  EFLAGS: 00010292
[448281.081987] RAX: 0063 RBX: 8803a28ebc48 RCX:
2fdb
[448281.090042] RDX:  RSI: 0046 RDI:
0246
[448281.098093] RBP: 880458257d58 R08: 81af6100 R09:

[448281.106146] R10: 0004 R11:  R12:
0001
[448281.114202] R13: 88052e130400 R14: 0001 R15:
8805beae9e10
[448281.122262] FS:  7fa2e772f700() GS:88062728()
knlGS:
[448281.131386] CS:  0010 DS:  ES:  CR0: 80050033
[448281.137879] CR2: ff600400 CR3: 0005015a5000 CR4:
06e0
[448281.145929] DR0:  DR1:  DR2:

[448281.153974] DR3:  DR6: 0ff0 DR7:
0400
[448281.162043] Process ceph-osd (pid: 16018, threadinfo
880458256000, task 88055b711940)
[448281.171646] Stack:
[448281.173987]  880458257dff 8803a28eba98 880458257d58
8805beae9e10
[448281.182377]   88052e130400 88029ff33380
8803a28ebc48
[448281.190766]  880458257e08 a04ab4e6 
8803a28ebc48
[448281.199155] Call Trace:
[448281.202005]  [] btrfs_truncate+0x5f6/0x660 [btrfs]
[448281.209203]  [] btrfs_setattr+0xf6/0x1a0 [btrfs]
[448281.216202]  [] notify_change+0x18b/0x2b0
[448281.222517]  [] ? selinux_inode_permission+0xd1/0x130
[448281.229990]  [] do_truncate+0x64/0xa0
[448281.235919]  [] ? inode_permission+0x49/0x100
[448281.242617]  [] sys_truncate+0x137/0x150
[448281.248838]  [] system_call_fastpath+0x16/0x1b
[448281.255631] Code: a0 49 8b 8d f0 02 00 00 8b 53 48 4c 0f 45 c0 48
85 f6 74 1b 80 bb 60 fe ff ff 84 74 12 48 c7 c7 e8 1d 50 a0 31 c0 e8
9d ea 0d e1 <0f> 0b eb fe 48 8b 73 40 eb e8 66 66 2e 0f 1f 84 00 00 00
00 00
[448281.277435] RIP  [] btrfs_orphan_del+0x19b/0x1b0 [btrfs]
[448281.285229]  RSP 
[448281.289667] ---[ end trace 9adc7b36a3e66872 ]---

Sorry,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-24 Thread Christian Brunner

Same thing here.

I've tried really hard, but even after 12 hours I wasn't able to get a
single warning from btrfs.

I think you cracked it!

Thanks,
Christian

2012/5/24 Martin Mailand :
> Hi,
> the ceph cluster is running under heavy load for the last 13 hours without a
> problem, dmesg is empty and the performance is good.
>
> -martin
>
> Am 23.05.2012 21:12, schrieb Martin Mailand:
>
>> this patch is running for 3 hours without a Bug and without the Warning.
>> I will let it run overnight and report tomorrow.
>> It looks very good ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs BUG during Ceph cosd open() syscall

2011-01-27 Thread Christian Brunner

The btrfs_orphan_commit_root warning is also reproducable in our ceph
environment.

Regards
Christian

2011/1/26 Matt Weil :
> heavy writes as well
>
> Jan  5 16:56:46 linuscs101 kernel: [ 3666.496742] [ cut here
> ]
>>
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496754] WARNING: at
>> fs/btrfs/inode.c:2143 btrfs_orphan_commit_root+0xb0/0xc0()
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496756] Hardware name: ProLiant
>> DL380 G5
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496758] Modules linked in: nfsd
>> exportfs nfs lockd nfs_acl auth_rpcgss bonding sunrpc radeon ttm
>> drm_kms_helper drm bnx2 psmouse i5000_edac usbhid lp shpchp ipmi_si
>> i2c_algo_bit hid edac_core parport ipmi_msghandler serio_raw i5k_amb hpilo
>> cciss fbcon tileblit font bitblit softcursor
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496788] Pid: 2764, comm: cosd
>> Not tainted 2.6.37-ceph-client #1
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496790] Call Trace:
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496797]  []
>> warn_slowpath_common+0x7f/0xc0
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496800]  []
>> warn_slowpath_null+0x1a/0x20
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496804]  []
>> btrfs_orphan_commit_root+0xb0/0xc0
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496807]  []
>> commit_fs_roots+0xa1/0x140
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496810]  []
>> btrfs_commit_transaction+0x350/0x730
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496816]  [] ?
>> autoremove_wake_function+0x0/0x40
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496820]  []
>> btrfs_mksubvol+0x363/0x380
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496823]  []
>> btrfs_ioctl_snap_create_transid+0xed/0x140
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496826]  []
>> btrfs_ioctl_snap_create+0xf7/0x140
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496830]  []
>> btrfs_ioctl+0x61f/0xa20
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496834]  [] ?
>> fsnotify+0x1ea/0x320
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496839]  []
>> do_vfs_ioctl+0xa9/0x5a0
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496842]  []
>> sys_ioctl+0x81/0xa0
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496847]  []
>> system_call_fastpath+0x16/0x1b
>>  Jan  5 16:56:46 linuscs101 kernel: [ 3666.496850] ---[ end trace
>> 2a6c3f752cfb5f1b ]---
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.723170] CPU 1
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.723210] Modules linked in: nfsd
>> exportfs nfs lockd nfs_acl auth_rpcgss bonding sunrpc radeon ttm
>> drm_kms_helper drm bnx2 psmouse i5000_edac usbhid lp shpchp ipmi_si
>> i2c_algo_bit hid edac_core parport ipmi_msghandler serio_raw i5k_amb hpilo
>> cciss fbcon tileblit font bitblit softcursor
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724006]
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724041] Pid: 2766, comm: cosd
>> Tainted: G        W   2.6.37-ceph-client #1 /ProLiant DL380 G5
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724169] RIP:
>> 0010:[]  [] btrfs_truncate+0x510/0x530
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724318] RSP:
>> 0018:8803d7e1bd48  EFLAGS: 00010286
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724397] RAX: ffe4
>> RBX: 8803dfaf1800 RCX: 880406ce7090
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724493] RDX: 
>> RSI: ea000e17d288 RDI: 0206
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724592] RBP: 8803d7e1bdd8
>> R08: 0783 R09: 8803d7e1bb28
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724691] R10: ffe4
>> R11: 0001 R12: 8803dee49f00
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724793] R13: 8803d5369c10
>> R14: 8803d5369a78 R15: 8803d5369d38
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.724899] FS:
>>  7f77acfb6710() GS:8800cfc4() knlGS:
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725019] CS:  0010 DS:  ES:
>>  CR0: 80050033
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725096] CR2: 7f81cd5b8000
>> CR3: 0003dfad3000 CR4: 06e0
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725195] DR0: 
>> DR1:  DR2: 
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725293] DR3: 
>> DR6: 0ff0 DR7: 0400
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725392] Process cosd (pid:
>> 2766, threadinfo 8803d7e1a000, task 8803dfaf8000)
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725549]  
>>  8803d5369d78 01da
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725695]  0fff
>> d5369d38 1000 
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.725841]  8803d5369aa8
>> 8803d5369c10 8803d7e1bdc8 
>>  Jan  5 17:07:45 linuscs101 kernel: [ 4325.726039]  []
>> vmtruncate+0x56/0x70
>>  Jan  5 17:07:45 linu

Re: [PATCH] Prevent oopsing in posix_acl_valid()

2011-05-03 Thread Christian Brunner

2011/5/3 Josef Bacik :
> On 05/03/2011 12:44 PM, Daniel J Blueman wrote:
>>
>> If posix_acl_from_xattr() returns an error code, a negative address is
>> dereferenced causing an oops; fix by checking for error code first.
>>
>> Signed-off-by: Daniel J Blueman
>> ---
>>  fs/btrfs/acl.c |    5 +++--
>>  1 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
>> index 5d505aa..cad6fbb 100644
>> --- a/fs/btrfs/acl.c
>> +++ b/fs/btrfs/acl.c
>> @@ -178,12 +178,13 @@ static int btrfs_xattr_acl_set(struct dentry
>> *dentry, const char *name,
>>
>>        if (value) {
>>                acl = posix_acl_from_xattr(value, size);
>> +               if (IS_ERR(acl)

A small typo: The right parenthesis is missing.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel BUG at fs/btrfs/extent-tree.c:5637!

2011-05-19 Thread Christian Brunner

Hi,

we are running a ceph cluster with a btrfs store. Last night we ran
across this btrfs BUG.

Any hints on how to solve this are welcome.

Regards
Christian

May 19 06:10:07 os00 kernel: [247212.342712] [ cut here
]
May 19 06:10:07 os00 kernel: [247212.347953] kernel BUG at
fs/btrfs/extent-tree.c:5637!
May 19 06:10:07 os00 kernel: [247212.353773] invalid opcode:  [#1] SMP
May 19 06:10:07 os00 kernel: [247212.358449] last sysfs file:
/sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
May 19 06:10:07 os00 kernel: [247212.367268] CPU 6
May 19 06:10:07 os00 kernel: [247212.369407] Modules linked in: btrfs
zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support i7core_edac edac_core ixgbe mdio iomemory_vsl(P)
hpsa igb dca squashfs usb_storage [last unloaded: scsi_wait_scan]
May 19 06:10:07 os00 kernel: [247212.393864]
May 19 06:10:07 os00 kernel: [247212.395618] Pid: 3074, comm: cosd
Tainted: P2.6.38.6-1.fits.3.el6.x86_64 #1 HP ProLiant
DL180 G6
May 19 06:10:07 os00 kernel: [247212.406885] RIP:
0010:[]  []
run_clustered_refs+0x54d/0x800 [btrfs]
May 19 06:10:07 os00 kernel: [247212.417468] RSP:
0018:8805dc6b99b8  EFLAGS: 00010282
May 19 06:10:07 os00 kernel: [247212.423482] RAX: ffef
RBX: 88037570ac00 RCX: 8805dc6b8000
May 19 06:10:07 os00 kernel: [247212.431528] RDX: 0008
RSI: 8800 RDI: 8805c7acabb0
May 19 06:10:07 os00 kernel: [247212.439572] RBP: 8805dc6b9a98
R08: 0001 R09: 0001
May 19 06:10:07 os00 kernel: [247212.447617] R10: 8805e0947000
R11: 8802e7e04480 R12: 8805ac2150c0
May 19 06:10:07 os00 kernel: [247212.455663] R13: 8804e758bc00
R14: 8805e0963000 R15: 8802e7e04480
May 19 06:10:07 os00 kernel: [247212.463709] FS:
7fc1691d2700() GS:8800bf2c()
knlGS:
May 19 06:10:07 os00 kernel: [247212.472820] CS:  0010 DS:  ES:
 CR0: 80050033
May 19 06:10:07 os00 kernel: [247212.479317] CR2: 7f907f6313b0
CR3: 0005dfad1000 CR4: 06e0
May 19 06:10:07 os00 kernel: [247212.487363] DR0: 
DR1:  DR2: 
May 19 06:10:07 os00 kernel: [247212.495408] DR3: 
DR6: 0ff0 DR7: 0400
May 19 06:10:07 os00 kernel: [247212.503453] Process cosd (pid: 3074,
threadinfo 8805dc6b8000, task 8805dc56e4c0)
May 19 06:10:07 os00 kernel: [247212.512563] Stack:
May 19 06:10:07 os00 kernel: [247212.514896]  
 88040001 
May 19 06:10:07 os00 kernel: [247212.523301]  8805dfd1c000
8805e1d71288  8805dc6b9ad8
May 19 06:10:07 os00 kernel: [247212.531673]  
0dd0 8805e1d711d0 0002
May 19 06:10:07 os00 kernel: [247212.540054] Call Trace:
May 19 06:10:07 os00 kernel: [247212.542883]  [] ?
btrfs_find_ref_cluster+0x1/0x180 [btrfs]
May 19 06:10:07 os00 kernel: [247212.550840]  []
btrfs_run_delayed_refs+0xc8/0x230 [btrfs]
May 19 06:10:07 os00 kernel: [247212.558700]  []
__btrfs_end_transaction+0x71/0x210 [btrfs]
May 19 06:10:07 os00 kernel: [247212.566685]  []
btrfs_end_transaction+0x15/0x20 [btrfs]
May 19 06:10:07 os00 kernel: [247212.574382]  []
btrfs_dirty_inode+0x8a/0x130 [btrfs]
May 19 06:10:07 os00 kernel: [247212.581752]  []
__mark_inode_dirty+0x3f/0x1e0
May 19 06:10:07 os00 kernel: [247212.588446]  []
file_update_time+0xec/0x170
May 19 06:10:07 os00 kernel: [247212.594952]  []
btrfs_file_aio_write+0x1d0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.602709]  [] ?
ima_counts_get+0x61/0x140
May 19 06:10:07 os00 kernel: [247212.609214]  [] ?
btrfs_file_aio_write+0x0/0x4e0 [btrfs]
May 19 06:10:07 os00 kernel: [247212.616970]  []
do_sync_readv_writev+0xd3/0x110
May 19 06:10:07 os00 kernel: [247212.623855]  [] ?
path_put+0x22/0x30
May 19 06:10:07 os00 kernel: [247212.629675]  [] ?
selinux_file_permission+0xf3/0x150
May 19 06:10:07 os00 kernel: [247212.637044]  [] ?
security_file_permission+0x23/0x90
May 19 06:10:07 os00 kernel: [247212.644415]  []
do_readv_writev+0xd4/0x1e0
May 19 06:10:07 os00 kernel: [247212.650818]  [] ?
mutex_lock+0x31/0x60
May 19 06:10:07 os00 kernel: [247212.656832]  []
vfs_writev+0x46/0x60
May 19 06:10:07 os00 kernel: [247212.662653]  []
sys_writev+0x51/0xc0
May 19 06:10:07 os00 kernel: [247212.668477]  []
system_call_fastpath+0x16/0x1b
May 19 06:10:07 os00 kernel: [247212.675264] Code: 48 8b 75 a0 48 8b
7d a8 ba b0 00 00 00 e8 7c 6c 02 00 48 8b 95 78 ff ff ff 48 8b 75 a0
48 8b 7d a8 e8 68 6b 02 00 e9 04 ff ff ff <0f> 0b eb fe 0f 0b eb fe 0f
0b 66 0f 1f 84 00 00 00 00 00 eb f5
May 19 06:10:07 os00 kernel: [247212.697014] RIP  []
run_clustered_refs+0x54d/0x800 [btrfs]
May 19 06:10:07 os00 kernel: [247212.704981]  RSP 
May 19 06:10:07 os00 kernel: [247212.709579] ---[ end trace
b0954a112f69e38b ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-b

Re: Delayed inode operations not doing the right thing with enospc

2011-07-12 Thread Christian Brunner

2011/6/7 Josef Bacik :
> On 06/06/2011 09:39 PM, Miao Xie wrote:
>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>> I got a lot of these when running stress.sh on my test box
>>>
>>>
>>>
>>> This is because use_block_rsv() is having to do a
>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>> reserved enough space for those operations to complete.  This is
>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>> trans->block_rsv, which will only have what the current transaction
>>> starter had reserved.
>>>
>>> What needs to be done instead is we need to have a block reserve that
>>> any reservation that is done at create time for these inodes is migrated
>>> to this special reserve, and then when you run the delayed inode items
>>> stuff you set trans->block_rsv to the special block reserve so the
>>> accounting is all done properly.
>>>
>>> This is just off the top of my head, there may be a better way to do it,
>>> I've not actually looked that the delayed inode code at all.
>>>
>>> I would do this myself but I have a ever increasing list of shit to do
>>> so will somebody pick this up and fix it please?  Thanks,
>>
>> Sorry, it's my miss.
>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>> the space from trans_block_rsv to global_block_rsv.
>>
>> I'll fix it soon.
>>
>
> There is another problem, we're failing xfstest 204.  I tried making
> reserve_metadata_bytes commit the transaction regardless of whether or
> not there were pinned bytes but the test just hung there.  Usually it
> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
> 204 just creates a crap ton of files, which is what is killing us.
> There needs to be a way to start flushing delayed inode items so we can
> reclaim the space they are holding onto so we don't get enospc, and it
> needs to be better than just committing the transaction because that is
> dog slow.  Thanks,
>
> Josef

Is there a solution for this?

I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
(except the pluging). When starting a ceph rebuild on the btrfs
volumes I get a lot of warnings from block_rsv_use_bytes in
use_block_rsv:

[ 2157.922054] [ cut here ]
[ 2157.927270] WARNING: at fs/btrfs/extent-tree.c:5683
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[ 2157.937123] Hardware name: ProLiant DL180 G6
[ 2157.942132] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 pcspkr serio_raw iTCO_wdt iTCO_vendor_support ghes hed
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[ 2157.967386] Pid: 10280, comm: btrfs-freespace Tainted: PW
2.6.38.8-1.fits.4.el6.x86_64 #1
[ 2157.977554] Call Trace:
[ 2157.980383]  [] ? warn_slowpath_common+0x7f/0xc0
[ 2157.987382]  [] ? warn_slowpath_null+0x1a/0x20
[ 2157.994192]  [] ?
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[ 2158.002354]  [] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[ 2158.010014]  [] ? split_leaf+0x142/0x8c0 [btrfs]
[ 2158.016990]  [] ? generic_bin_search+0x19b/0x210 [btrfs]
[ 2158.024784]  [] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[ 2158.032627]  [] ? btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[ 2158.040325]  [] ?
btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[ 2158.048477]  [] ? clear_extent_bit+0x17a/0x440 [btrfs]
[ 2158.056026]  [] ? add_pending_csums+0x45/0x70 [btrfs]
[ 2158.063530]  [] ?
btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[ 2158.071755]  [] ?
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[ 2158.080172]  [] ?
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[ 2158.088505]  [] ? schedule_timeout+0x17b/0x2e0
[ 2158.095258]  [] ? bio_endio+0x1d/0x40
[ 2158.101171]  [] ? end_workqueue_fn+0xf4/0x130 [btrfs]
[ 2158.108621]  [] ? worker_loop+0x13e/0x540 [btrfs]
[ 2158.115703]  [] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.122563]  [] ? worker_loop+0x0/0x540 [btrfs]
[ 2158.129413]  [] ? kthread+0x96/0xa0
[ 2158.135093]  [] ? kernel_thread_helper+0x4/0x10
[ 2158.141913]  [] ? kthread+0x0/0xa0
[ 2158.147467]  [] ? kernel_thread_helper+0x0/0x10
[ 2158.154287] ---[ end trace 55e53c726a04ecd7 ]---

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Christian Brunner

2011/7/13 Josef Bacik :
> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>> 2011/6/7 Josef Bacik :
>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>> I got a lot of these when running stress.sh on my test box
>>>>>
>>>>>
>>>>>
>>>>> This is because use_block_rsv() is having to do a
>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>> reserved enough space for those operations to complete.  This is
>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>> trans->block_rsv, which will only have what the current transaction
>>>>> starter had reserved.
>>>>>
>>>>> What needs to be done instead is we need to have a block reserve that
>>>>> any reservation that is done at create time for these inodes is migrated
>>>>> to this special reserve, and then when you run the delayed inode items
>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>> accounting is all done properly.
>>>>>
>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>> I've not actually looked that the delayed inode code at all.
>>>>>
>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>
>>>> Sorry, it's my miss.
>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have 
>>>> migrated
>>>> the space from trans_block_rsv to global_block_rsv.
>>>>
>>>> I'll fix it soon.
>>>>
>>>
>>> There is another problem, we're failing xfstest 204.  I tried making
>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>> not there were pinned bytes but the test just hung there.  Usually it
>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>> 204 just creates a crap ton of files, which is what is killing us.
>>> There needs to be a way to start flushing delayed inode items so we can
>>> reclaim the space they are holding onto so we don't get enospc, and it
>>> needs to be better than just committing the transaction because that is
>>> dog slow.  Thanks,
>>>
>>> Josef
>>
>> Is there a solution for this?
>>
>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>> (except the pluging). When starting a ceph rebuild on the btrfs
>> volumes I get a lot of warnings from block_rsv_use_bytes in
>> use_block_rsv:
>>
>
> Ok I think I've got this nailed down.  Will you run with this patch and make 
> sure the warnings go away?  Thanks,

I'm sorry, I'm still getting a lot of warnings like the one below.

I've also noticed, that I'm not getting these messages when the
free_space_cache is disabled.

Christian

[  697.398097] [ cut here ]
[  697.398109] WARNING: at fs/btrfs/extent-tree.c:5693
btrfs_alloc_free_block+0x1f8/0x360 [btrfs]()
[  697.398111] Hardware name: ProLiant DL180 G6
[  697.398112] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
usb_storage [last unloaded: scsi_wait_scan]
[  697.398122] Pid: 6591, comm: btrfs-freespace Tainted: PW
3.0.0-1.fits.1.el6.x86_64 #1
[  697.398124] Call Trace:
[  697.398128]  [] warn_slowpath_common+0x7f/0xc0
[  697.398131]  [] warn_slowpath_null+0x1a/0x20
[  697.398142]  [] btrfs_alloc_free_block+0x1f8/0x360 [btrfs]
[  697.398156]  [] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[  697.398316]  [] split_leaf+0x142/0x8c0 [btrfs]
[  697.398325]  [] ? generic_bin_search+0x19b/0x210 [btrfs]
[  697.398334]  [] ? btrfs_leaf_free_space+0x8a/0xe0 [btrfs]
[  697.398344]  [] btrfs_search_slot+0x6d3/0x7a0 [btrfs]
[  697.398355]  [] btrfs_csum_file_blocks+0x632/0x830 [btrfs]
[  697.398369]  [] ? clear_extent_bit+0x17a/0x440 [btrfs]
[  697.398382]  [] add_pending_csums+0x49/0x70 [btrfs]
[  697.398395]  [] btrfs_finish_ordered_io+0x22d/0x360 [btrfs]
[  697.398408]  []
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
[  697.398422]  []
end_bio_extent_writepage+0x13b/0x180 [btrfs]
[  697.398425]  [] ? schedule_timeout+0x17b/0x2e0
[  697.398436]  [] ? end_workqueue_fn+0xe9/0x13

Re: [PATCH] Btrfs: don't be as agressive with delalloc metadata reservations V2

2011-07-21 Thread Christian Brunner

2011/7/18 Josef Bacik :
> On 07/18/2011 02:11 PM, Josef Bacik wrote:
>> Currently we reserve enough space to COW an entirely full btree for every 
>> extent
>> we have reserved for an inode.  This _sucks_, because you only need to COW 
>> once,
>> and then everybody else is ok.  Unfortunately we don't know we'll all be 
>> able to
>> get into the same transaction so that's what we have had to do.  But the 
>> global
>> reserve holds a reservation large enough to cover a large percentage of all 
>> the
>> metadata currently in the fs.  So all we really need to account for is any 
>> new
>> blocks that we may allocate.  So fix this by
>>
>> 1) Passing to btrfs_alloc_free_block() wether this is a new block or a COW
>> block.  If it is a COW block we use the global reserve, if not we use the
>> trans->block_rsv.
>> 2) Reduce the amount of space we reserve.  Since we don't need to account for
>> cow'ing the tree we can just keep track of new blocks to reserve, which 
>> greatly
>> reduces the reservation amount.
>>
>> This makes my basic random write test go from 3 mb/s to 75 mb/s.  I've tested
>> this with my horrible ENOSPC test and it seems to work out fine.  Thanks,
>>
>> Signed-off-by: Josef Bacik 
>> ---
>> V1->V2:
>> -fix a problem reported by Liubo, we need to make sure that we move bytes
>> over for any new extents we may add to the extent tree so we don't get a 
>> bunch
>> of warnings.
>> -fix the global reserve to reserve 50% of the metadata space currently used.

When I run this patch I get a lot of messages like these (V1 seemed to
run fine).

Regards,
Christian

Jul 21 15:25:59 os00 kernel: [   35.411360] [ cut here ]
Jul 21 15:25:59 os00 kernel: [   35.416589] WARNING: at
fs/btrfs/extent-tree.c:5564
btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs]()
Jul 21 15:25:59 os00 kernel: [   35.427311] Hardware name: ProLiant DL180 G6
Jul 21 15:25:59 os00 kernel: [   35.432326] Modules linked in: btrfs
zlib_deflate libcrc32c bonding ipv6 serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support ixgbe dca mdio i7core_edac edac_core
iomemory_vsl(P) hpsa squashfs usb_storage [last unloaded:
scsi_wait_scan]
Jul 21 15:25:59 os00 kernel: [   35.456799] Pid: 1876, comm:
btrfs-endio-wri Tainted: P3.0.0-1.fits.4.el6.x86_64 #1
Jul 21 15:25:59 os00 kernel: [   35.466610] Call Trace:
Jul 21 15:25:59 os00 kernel: [   35.469497]  []
warn_slowpath_common+0x7f/0xc0
Jul 21 15:25:59 os00 kernel: [   35.476254]  []
warn_slowpath_null+0x1a/0x20
Jul 21 15:25:59 os00 kernel: [   35.482839]  []
btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.491683]  []
insert_reserved_file_extent.clone.0+0x201/0x270 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.500912]  []
btrfs_finish_ordered_io+0x2eb/0x360 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.508978]  [] ?
try_to_del_timer_sync+0x81/0xe0
Jul 21 15:25:59 os00 kernel: [   35.516081]  []
btrfs_writepage_end_io_hook+0x4c/0xa0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.524340]  []
end_compressed_bio_write+0x86/0xf0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.532259]  []
bio_endio+0x1d/0x40
Jul 21 15:25:59 os00 kernel: [   35.538034]  []
end_workqueue_fn+0xf4/0x130 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.545384]  []
worker_loop+0x13e/0x540 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.552307]  [] ?
btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.560039]  [] ?
btrfs_queue_worker+0x2d0/0x2d0 [btrfs]
Jul 21 15:25:59 os00 kernel: [   35.567768]  []
kthread+0x96/0xa0
Jul 21 15:25:59 os00 kernel: [   35.573275]  []
kernel_thread_helper+0x4/0x10
Jul 21 15:25:59 os00 kernel: [   35.579931]  [] ?
kthread_worker_fn+0x1a0/0x1a0
Jul 21 15:25:59 os00 kernel: [   35.586816]  [] ?
gs_change+0x13/0x13
Jul 21 15:25:59 os00 kernel: [   35.592779] ---[ end trace d87e2733f1e978b8 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: at fs/btrfs/inode.c:2204

2011-07-21 Thread Christian Brunner

I'm running a Ceph Object Store with 3.0-rc7 and patches from Josef.
Occasionally I get the attached warning.

Everything seems to be working after this warning, but I am concerned...

Thanks,
Christian

[13319.808020] [ cut here ]
[13319.813284] WARNING: at fs/btrfs/inode.c:2204
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[13319.822563] Hardware name: ProLiant DL180 G6
[13319.827586] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[13319.851192] Pid: 23617, comm: kworker/6:0 Tainted: P
3.0.0-1.fits.2.el6.x86_64 #1
[13319.860661] Call Trace:
[13319.863433]  [] warn_slowpath_common+0x7f/0xc0
[13319.870172]  [] warn_slowpath_null+0x1a/0x20
[13319.876724]  [] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[13319.884633]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
[13319.891762]  []
btrfs_commit_transaction+0x3ce/0x840 [btrfs]
[13319.899917]  [] ? dequeue_task_fair+0x20f/0x220
[13319.906726]  [] ? __switch_to+0x12b/0x320
[13319.912943]  [] ? wake_up_bit+0x40/0x40
[13319.918971]  [] ? btrfs_end_transaction+0x20/0x20 [btrfs]
[13319.926775]  [] do_async_commit+0x1f/0x30 [btrfs]
[13319.933825]  [] process_one_work+0x128/0x450
[13319.940419]  [] worker_thread+0x17b/0x3c0
[13319.946670]  [] ? manage_workers+0x220/0x220
[13319.953210]  [] kthread+0x96/0xa0
[13319.958682]  [] kernel_thread_helper+0x4/0x10
[13319.965316]  [] ? kthread_worker_fn+0x1a0/0x1a0
[13319.972183]  [] ? gs_change+0x13/0x13
[13319.978065] ---[ end trace 942778a443791443 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs slowdown

2011-07-25 Thread Christian Brunner

Hi,

we are running a ceph cluster with btrfs as it's base filesystem
(kernel 3.0). At the beginning everything worked very well, but after
a few days (2-3) things are getting very slow.

When I look at the object store servers I see heavy disk-i/o on the
btrfs filesystems (disk utilization is between 60% and 100%). I also
did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
certain, that the majority of the disk I/O is not caused by ceph or
any other userland process.

When reboot the system(s) the problems go away for another 2-3 days,
but after that, it starts again. I'm not sure if the problem is
related to the kernel warning I've reported last week. At least there
is no temporal relationship between the warning and the slowdown.

Any hints on how to trace this would be welcome.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs slowdown

2011-07-28 Thread Christian Brunner

2011/7/28 Marcus Sorensen :
> Christian,
>
> Have you checked up on the disks themselves and hardware? High
> utilization can mean that the i/o load has increased, but it can also
> mean that the i/o capacity has decreased.  Your traces seem to
> indicate that a good portion of the time is being spent on commits,
> that could be waiting on disk. That "wait_for_commit" looks to
> basically just spin waiting for the commit to complete, and at least
> one thing that calls it raises a BUG_ON, not sure if it's one you've
> seen even on 2.6.38.
>
> There could be all sorts of performance related reasons that aren't
> specific to btrfs or ceph, on our various systems we've seen things
> like the raid card module being upgraded in newer kernels and suddenly
> our disks start to go into sleep mode after a bit, dirty_ratio causing
> multiple gigs of memory to sync because its not optimized for the
> workload, external SAS enclosures stop communicating a few days after
> reboot (but the disks keep working with sporadic issues), things like
> patrol read hitting a bad sector on a disk, causing it to go into
> enhanced error recovery and stop responding, etc.

I' fairly confident that the hardware is ok. We see the problem on
four machines. It could be a problem with the hpsa driver/firmware,
but we haven't seen the behavior with 2.6.38 and the changes in the
hpsa driver are not that big.

> Maybe you have already tried these things. It's where I would start
> anyway. Looking at /proc/meminfo, dirty, writeback, swap, etc both
> while the system is functioning desirably and when it's misbehaving.
> Looking at anything else that might be in D state. Looking at not just
> disk util, but the workload causing it (e.g. Was I doing 300 iops
> previously with an average size of 64k, and now I'm only managing 50
> iops at 64k before the disk util reports 100%?) Testing the system in
> a filesystem-agnostic manner, for example when performance is bad
> through btrfs, is performance the same as you got on fresh boot when
> testing iops on /dev/sdb or whatever? You're not by chance swapping
> after a bit of uptime on any volume that's shared with the underlying
> disks that make up your osd, obfuscated by a hardware raid? I didn't
> see the kernel warning you're referring to, just the ixgbe malloc
> failure you mentioned the other day.

I've looked at most of this. What makes me point to btrfs, is that the
problem goes away when I reboot on server in our cluster, but persists
on the other systems. So it can't be related to the number of requests
that come in.

> I do not mean to presume that you have not looked at these things
> already. I am not very knowledgeable in btrfs specifically, but I
> would expect any degradation in performance over time to be due to
> what's on disk (lots of small files, fragmented, etc). This is
> obviously not the case in this situation since a reboot recovers the
> performance. I suppose it could also be a memory leak or something
> similar, but you should be able to detect something like that by
> monitoring your memory situation, /proc/slabinfo etc.

It could be related to a memory leak. The machine has a lot RAM (24
GB), but we have seen page allocation failures in the ixgbe driver,
when we are using jumbo frames.

> Just my thoughts, good luck on this. I am currently running 2.6.39.3
> (btrfs) on the 7 node cluster I put together, but I just built it and
> am comparing between various configs. It will be awhile before it is
> under load for several days straight.

Thanks!

When I look at the latencytop results, there is a high latency when
calling "btrfs_commit_transaction_async". Isn't "async" supposed to
return immediately?

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs slowdown

2011-08-09 Thread Christian Brunner

Hi Sage,

I did some testing with btrfs-unstable yesterday. With the recent
commit from Chris it looks quite good:

"Btrfs: force unplugs when switching from high to regular priority bios"


However I can't test it extensively, because our main environment is
on ext4 at the moment.

Regards
Christian

2011/8/8 Sage Weil :
> Hi Christian,
>
> Are you still seeing this slowness?
>
> sage
>
>
> On Wed, 27 Jul 2011, Christian Brunner wrote:
>> 2011/7/25 Chris Mason :
>> > Excerpts from Christian Brunner's message of 2011-07-25 03:54:47 -0400:
>> >> Hi,
>> >>
>> >> we are running a ceph cluster with btrfs as it's base filesystem
>> >> (kernel 3.0). At the beginning everything worked very well, but after
>> >> a few days (2-3) things are getting very slow.
>> >>
>> >> When I look at the object store servers I see heavy disk-i/o on the
>> >> btrfs filesystems (disk utilization is between 60% and 100%). I also
>> >> did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
>> >> certain, that the majority of the disk I/O is not caused by ceph or
>> >> any other userland process.
>> >>
>> >> When reboot the system(s) the problems go away for another 2-3 days,
>> >> but after that, it starts again. I'm not sure if the problem is
>> >> related to the kernel warning I've reported last week. At least there
>> >> is no temporal relationship between the warning and the slowdown.
>> >>
>> >> Any hints on how to trace this would be welcome.
>> >
>> > The easiest way to trace this is with latencytop.
>> >
>> > Apply this patch:
>> >
>> > http://oss.oracle.com/~mason/latencytop.patch
>> >
>> > And then use latencytop -c for a few minutes while the system is slow.
>> > Send the output here and hopefully we'll be able to figure it out.
>>
>> I've now installed latencytop. Attached are two output files: The
>> first is from yesterday and was created aproxematly half an hour after
>> the boot. The second on is from today, uptime is 19h. The load on the
>> system is already rising. Disk utilization is approximately at 50%.
>>
>> Thanks for your help.
>>
>> Christian
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: at fs/btrfs/inode.c:2114

2011-10-09 Thread Christian Brunner

I gave btrfs "for-chris" from josef's github repo a try in our ceph
cluster. During the rebuild I git the following warning.

Everything still seems to work... Should I be concerned?

Thanks,
Christian

[12554.886362] [ cut here ]
[12554.891693] WARNING: at fs/btrfs/inode.c:2114
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[12554.901210] Hardware name: ProLiant DL180 G6
[12554.906338] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 pcspkr serio_raw ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[12554.930791] Pid: 4686, comm: flush-btrfs-1 Tainted: P
3.0.6-1.fits.1.el6.x86_64 #1
[12554.940483] Call Trace:
[12554.943400]  [] warn_slowpath_common+0x7f/0xc0
[12554.950378]  [] warn_slowpath_null+0x1a/0x20
[12554.957070]  [] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[12554.965301]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
[12554.972571]  [] ? mutex_lock+0x31/0x60
[12554.978652]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
[12554.986017]  []
btrfs_commit_transaction+0x3c6/0x830 [btrfs]
[12554.994256]  [] ? join_transaction+0x25/0x250 [btrfs]
[12555.001814]  [] ? wake_up_bit+0x40/0x40
[12555.008061]  [] btrfs_write_inode+0xbb/0xc0 [btrfs]
[12555.015490]  [] writeback_single_inode+0x201/0x260
[12555.022879]  [] writeback_sb_inodes+0xeb/0x1c0
[12555.029721]  [] wb_writeback+0x18f/0x480
[12555.036031]  [] ? __schedule+0x3f5/0x8b0
[12555.042295]  [] ? lock_timer_base+0x3c/0x70
[12555.048860]  [] wb_do_writeback+0x9d/0x270
[12555.055384]  [] ? del_timer+0xf0/0xf0
[12555.061375]  [] bdi_writeback_thread+0xa2/0x280
[12555.068375]  [] ? wb_do_writeback+0x270/0x270
[12555.075203]  [] ? wb_do_writeback+0x270/0x270
[12555.081971]  [] kthread+0x96/0xa0
[12555.087582]  [] kernel_thread_helper+0x4/0x10
[12555.094338]  [] ? kthread_worker_fn+0x1a0/0x1a0
[12555.101296]  [] ? gs_change+0x13/0x13
[12555.107290] ---[ end trace 57ec2e8544131a12 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-09 Thread Christian Brunner

I just realized that this is still the same warning I reported some month ago.

I thought that this had been fixed with

25d37af374263243214be9d912cbb46a8e469bc7

which is included in the kernel I'm using. So I think there must be
another Problem.

Regards,
Christian

2011/10/9 Christian Brunner :
> I gave btrfs "for-chris" from josef's github repo a try in our ceph
> cluster. During the rebuild I git the following warning.
>
> Everything still seems to work... Should I be concerned?
>
> Thanks,
> Christian
>
> [12554.886362] [ cut here ]
> [12554.891693] WARNING: at fs/btrfs/inode.c:2114
> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
> [12554.901210] Hardware name: ProLiant DL180 G6
> [12554.906338] Modules linked in: btrfs zlib_deflate libcrc32c bonding
> ipv6 pcspkr serio_raw ghes hed iTCO_wdt iTCO_vendor_support
> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
> [last unloaded: scsi_wait_scan]
> [12554.930791] Pid: 4686, comm: flush-btrfs-1 Tainted: P
> 3.0.6-1.fits.1.el6.x86_64 #1
> [12554.940483] Call Trace:
> [12554.943400]  [] warn_slowpath_common+0x7f/0xc0
> [12554.950378]  [] warn_slowpath_null+0x1a/0x20
> [12554.957070]  [] btrfs_orphan_commit_root+0xb0/0xc0 
> [btrfs]
> [12554.965301]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
> [12554.972571]  [] ? mutex_lock+0x31/0x60
> [12554.978652]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
> [12554.986017]  []
> btrfs_commit_transaction+0x3c6/0x830 [btrfs]
> [12554.994256]  [] ? join_transaction+0x25/0x250 [btrfs]
> [12555.001814]  [] ? wake_up_bit+0x40/0x40
> [12555.008061]  [] btrfs_write_inode+0xbb/0xc0 [btrfs]
> [12555.015490]  [] writeback_single_inode+0x201/0x260
> [12555.022879]  [] writeback_sb_inodes+0xeb/0x1c0
> [12555.029721]  [] wb_writeback+0x18f/0x480
> [12555.036031]  [] ? __schedule+0x3f5/0x8b0
> [12555.042295]  [] ? lock_timer_base+0x3c/0x70
> [12555.048860]  [] wb_do_writeback+0x9d/0x270
> [12555.055384]  [] ? del_timer+0xf0/0xf0
> [12555.061375]  [] bdi_writeback_thread+0xa2/0x280
> [12555.068375]  [] ? wb_do_writeback+0x270/0x270
> [12555.075203]  [] ? wb_do_writeback+0x270/0x270
> [12555.081971]  [] kthread+0x96/0xa0
> [12555.087582]  [] kernel_thread_helper+0x4/0x10
> [12555.094338]  [] ? kthread_worker_fn+0x1a0/0x1a0
> [12555.101296]  [] ? gs_change+0x13/0x13
> [12555.107290] ---[ end trace 57ec2e8544131a12 ]---
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs High IO-Wait

2011-10-11 Thread Christian Brunner

I think this is related to the sync issues. You could try the josef's git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work.git

Since yesterday I'm using it in our ceph cluster and it seems to do a
better job.

Regards,
Christian

2011/10/9 Martin Mailand :
> Hi,
> I have high IO-Wait on the ods (ceph), the osd are running a v3.1-rc9
> kernel.
> I also experience high IO-rates, around 500IO/s reported via iostat.
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00    6.80     0.00    62.40 18.35
> 0.04    5.29    0.00    5.29   5.29   3.60
> sdb               0.00   249.80 0.40 669.60     1.60  4118.40 12.30    87.47
>  130.56   15.00  130.63   1.01  67.40
>
> In comparison, the same workload, but the osd uses ext4 as a backing fs.
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00   10.00     0.00   128.00 25.60
> 0.03    3.40    0.00    3.40   3.40   3.40
> sdb               0.00    27.80 0.00 48.20     0.00   318.40 13.21     0.43
> 8.84 0.00 8.84 1.99   9.60
>
> iodump shows similar results, where sdb is the data disk, sda7 the journal
> and sda5 the root.
>
> btrfs
>
> root@s-brick-003:~# echo 1 > /proc/sys/vm/block_dump
> root@s-brick-003:~# while true; do sleep 1; dmesg -c; done | perl
> /usr/local/bin/iodump
> ^C# Caught SIGINT.
> TASK                   PID      TOTAL       READ      WRITE      DIRTY
> DEVICES
> btrfs-submit-0        8321      28040          0      28040          0 sdb
> ceph-osd              8514        158          0        158          0 sda7
> kswapd0                 46         81          0         81          0 sda1
> bash                 10709         35         35          0          0 sda1
> flush-8:0              962         12          0         12          0 sda5
> kworker/0:1           8897          6          0          6          0 sdb
> kworker/1:1          10354          3          0          3          0 sdb
> kjournald              266          3          0          3          0 sda5
> ceph-osd              8523          2          2          0          0 sda1
> ceph-osd              8531          1          1          0          0 sda1
> dmesg                10712          1          1          0          0 sda5
>
>
> ext4
>
> root@s-brick-002:~# echo 1 > /proc/sys/vm/block_dump
> root@s-brick-002:~# while true; do sleep 1; dmesg -c; done | perl
> /usr/local/bin/iodump
> ^C# Caught SIGINT.
> TASK                   PID      TOTAL       READ      WRITE      DIRTY
> DEVICES
> ceph-osd              3115        847          0        847          0 sdb
> jbd2/sdb-8            2897        784          0        784          0 sdb
> ceph-osd              3112        728          0        728          0 sda5,
> sdb
> ceph-osd              3110        191          0        191          0 sda7
> perl                  3628         13         13          0          0 sda5
> flush-8:16            2901          8          0          8          0 sdb
> kjournald              272          3          0          3          0 sda5
> dmesg                 3630          1          1          0          0 sda5
> sleep                 3629          1          1          0          0 sda5
>
>
> I think that is the same problem as in
> http://marc.info/?l=ceph-devel&m=131158049117139&w=2
>
> I also did a latencytop as Chris recommended in the above thread.
>
> Best Regards,
>  martin
>
>
>
>
>
>
>

Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-11 Thread Christian Brunner

2011/10/11 Liu Bo :
> On 10/10/2011 12:41 AM, Christian Brunner wrote:
>> I just realized that this is still the same warning I reported some month 
>> ago.
>>
>> I thought that this had been fixed with
>>
>> 25d37af374263243214be9d912cbb46a8e469bc7
>>
>> which is included in the kernel I'm using. So I think there must be
>> another Problem.
>>
>
> Would you try with this patch:
>
> http://marc.info/?l=linux-btrfs&m=131547325515336&w=2
>

This one is already included in my tree.

Regards,
Christian


>> 2011/10/9 Christian Brunner :
>>> I gave btrfs "for-chris" from josef's github repo a try in our ceph
>>> cluster. During the rebuild I git the following warning.
>>>
>>> Everything still seems to work... Should I be concerned?
>>>
>>> Thanks,
>>> Christian
>>>
>>> [12554.886362] [ cut here ]
>>> [12554.891693] WARNING: at fs/btrfs/inode.c:2114
>>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>>> [12554.901210] Hardware name: ProLiant DL180 G6
>>> [12554.906338] Modules linked in: btrfs zlib_deflate libcrc32c bonding
>>> ipv6 pcspkr serio_raw ghes hed iTCO_wdt iTCO_vendor_support
>>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>>> [last unloaded: scsi_wait_scan]
>>> [12554.930791] Pid: 4686, comm: flush-btrfs-1 Tainted: P
>>> 3.0.6-1.fits.1.el6.x86_64 #1
>>> [12554.940483] Call Trace:
>>> [12554.943400]  [] warn_slowpath_common+0x7f/0xc0
>>> [12554.950378]  [] warn_slowpath_null+0x1a/0x20
>>> [12554.957070]  [] btrfs_orphan_commit_root+0xb0/0xc0 
>>> [btrfs]
>>> [12554.965301]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
>>> [12554.972571]  [] ? mutex_lock+0x31/0x60
>>> [12554.978652]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
>>> [12554.986017]  []
>>> btrfs_commit_transaction+0x3c6/0x830 [btrfs]
>>> [12554.994256]  [] ? join_transaction+0x25/0x250 [btrfs]
>>> [12555.001814]  [] ? wake_up_bit+0x40/0x40
>>> [12555.008061]  [] btrfs_write_inode+0xbb/0xc0 [btrfs]
>>> [12555.015490]  [] writeback_single_inode+0x201/0x260
>>> [12555.022879]  [] writeback_sb_inodes+0xeb/0x1c0
>>> [12555.029721]  [] wb_writeback+0x18f/0x480
>>> [12555.036031]  [] ? __schedule+0x3f5/0x8b0
>>> [12555.042295]  [] ? lock_timer_base+0x3c/0x70
>>> [12555.048860]  [] wb_do_writeback+0x9d/0x270
>>> [12555.055384]  [] ? del_timer+0xf0/0xf0
>>> [12555.061375]  [] bdi_writeback_thread+0xa2/0x280
>>> [12555.068375]  [] ? wb_do_writeback+0x270/0x270
>>> [12555.075203]  [] ? wb_do_writeback+0x270/0x270
>>> [12555.081971]  [] kthread+0x96/0xa0
>>> [12555.087582]  [] kernel_thread_helper+0x4/0x10
>>> [12555.094338]  [] ? kthread_worker_fn+0x1a0/0x1a0
>>> [12555.101296]  [] ? gs_change+0x13/0x13
>>> [12555.107290] ---[ end trace 57ec2e8544131a12 ]---
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD: no current directory

2011-10-11 Thread Christian Brunner

2011/10/11 Sage Weil :
> On Tue, 11 Oct 2011, Christian Brunner wrote:
>> Maybe this one is easier:
>>
>> One of our OSDs isn't starting, because ther is no "current"
>> directory. What I have are three snap directories.
>>
>> total 0
>> -rw-r--r-- 1 root root   37 Oct  9 15:57 ceph_fsid
>> -rw-r--r-- 1 root root    8 Oct  9 15:57 fsid
>> -rw-r--r-- 1 root root   21 Oct  9 15:57 magic
>> drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_506043
>> drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_507364
>> drwxr-xr-x 1 root root 7814 Oct 11 18:36 snap_507417
>> -rw-r--r-- 1 root root    4 Oct  9 15:57 store_version
>> -rw-r--r-- 1 root root    2 Oct  9 15:57 whoami
>>
>> Is there a way to rollback the latest?
>
> That's what the OSD actually does on startup (roll back to the newest
> snap_).  It's probably a trivial bug that's preventing startup now... I'll
> take a look.  In the meantime, you can clone the latest snap_ to current
> and it should start!
>
> sage

This seems to be a btrfs problem. It fails, when I'm trying to create the clone

# btrfs subvolume snapshot snap_507417 current
Create a snapshot of 'snap_507417' in './current'
ERROR: cannot snapshot 'snap_507417'

And I get the following kernel messages:

[ 5863.263950] [ cut here ]
[ 5863.269125] WARNING: at fs/btrfs/inode.c:2335
btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]()
[ 5863.278142] Hardware name: ProLiant DL180 G6
[ 5863.283161] Modules linked in: btrfs zlib_deflate libcrc32c bonding
ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support ixgbe dca
mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs usb_storage
[last unloaded: scsi_wait_scan]
[ 5863.307774] Pid: 6349, comm: btrfs Tainted: PW
3.0.6-1.fits.2.el6.x86_64 #1
[ 5863.316647] Call Trace:
[ 5863.319648]  [] warn_slowpath_common+0x7f/0xc0
[ 5863.326536]  [] warn_slowpath_null+0x1a/0x20
[ 5863.333146]  [] btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]
[ 5863.340839]  [] ? join_transaction+0x201/0x250 [btrfs]
[ 5863.348482]  [] ? block_rsv_migrate_bytes+0x3a/0x50 [btrfs]
[ 5863.356590]  [] btrfs_mksubvol+0x2fb/0x380 [btrfs]
[ 5863.363726]  []
btrfs_ioctl_snap_create_transid+0xfa/0x150 [btrfs]
[ 5863.372445]  [] btrfs_ioctl_snap_create+0x56/0x80 [btrfs]
[ 5863.380398]  [] btrfs_ioctl+0x2fe/0xd50 [btrfs]
[ 5863.387344]  [] ? inode_has_perm+0x30/0x40
[ 5863.393798]  [] ? file_has_perm+0xdc/0xf0
[ 5863.400114]  [] do_vfs_ioctl+0x9a/0x5a0
[ 5863.406244]  [] sys_ioctl+0xa1/0xb0
[ 5863.412001]  [] system_call_fastpath+0x16/0x1b
[ 5863.418767] ---[ end trace e3234ecab14ad64c ]---
[ 5863.424084] btrfs: Error removing orphan entry, stopping orphan cleanup
[ 5863.431614] btrfs: could not do orphan cleanup -22

Can I use an older snapshot as well?

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD: no current directory

2011-10-11 Thread Christian Brunner

2011/10/11 Sage Weil :
> On Tue, 11 Oct 2011, Christian Brunner wrote:
>> 2011/10/11 Sage Weil :
>> > On Tue, 11 Oct 2011, Christian Brunner wrote:
>> >> Maybe this one is easier:
>> >>
>> >> One of our OSDs isn't starting, because ther is no "current"
>> >> directory. What I have are three snap directories.
>> >>
>> >> total 0
>> >> -rw-r--r-- 1 root root   37 Oct  9 15:57 ceph_fsid
>> >> -rw-r--r-- 1 root root    8 Oct  9 15:57 fsid
>> >> -rw-r--r-- 1 root root   21 Oct  9 15:57 magic
>> >> drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_506043
>> >> drwxr-xr-x 1 root root 7986 Oct 11 18:34 snap_507364
>> >> drwxr-xr-x 1 root root 7814 Oct 11 18:36 snap_507417
>> >> -rw-r--r-- 1 root root    4 Oct  9 15:57 store_version
>> >> -rw-r--r-- 1 root root    2 Oct  9 15:57 whoami
>> >>
>> >> Is there a way to rollback the latest?
>> >
>> > That's what the OSD actually does on startup (roll back to the newest
>> > snap_).  It's probably a trivial bug that's preventing startup now... I'll
>> > take a look.  In the meantime, you can clone the latest snap_ to current
>> > and it should start!
>> >
>> > sage
>>
>> This seems to be a btrfs problem. It fails, when I'm trying to create the 
>> clone
>>
>> # btrfs subvolume snapshot snap_507417 current
>> Create a snapshot of 'snap_507417' in './current'
>> ERROR: cannot snapshot 'snap_507417'
>>
>> And I get the following kernel messages:
>>
>> [ 5863.263950] [ cut here ]
>> [ 5863.269125] WARNING: at fs/btrfs/inode.c:2335
>> btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]()
>> [ 5863.278142] Hardware name: ProLiant DL180 G6
>> [ 5863.283161] Modules linked in: btrfs zlib_deflate libcrc32c bonding
>> ipv6 serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support ixgbe dca
>> mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs usb_storage
>> [last unloaded: scsi_wait_scan]
>> [ 5863.307774] Pid: 6349, comm: btrfs Tainted: P        W
>> 3.0.6-1.fits.2.el6.x86_64 #1
>> [ 5863.316647] Call Trace:
>> [ 5863.319648]  [] warn_slowpath_common+0x7f/0xc0
>> [ 5863.326536]  [] warn_slowpath_null+0x1a/0x20
>> [ 5863.333146]  [] btrfs_orphan_cleanup+0xcd/0x3d0 [btrfs]
>> [ 5863.340839]  [] ? join_transaction+0x201/0x250 [btrfs]
>> [ 5863.348482]  [] ? block_rsv_migrate_bytes+0x3a/0x50 
>> [btrfs]
>> [ 5863.356590]  [] btrfs_mksubvol+0x2fb/0x380 [btrfs]
>> [ 5863.363726]  []
>> btrfs_ioctl_snap_create_transid+0xfa/0x150 [btrfs]
>> [ 5863.372445]  [] btrfs_ioctl_snap_create+0x56/0x80 
>> [btrfs]
>> [ 5863.380398]  [] btrfs_ioctl+0x2fe/0xd50 [btrfs]
>> [ 5863.387344]  [] ? inode_has_perm+0x30/0x40
>> [ 5863.393798]  [] ? file_has_perm+0xdc/0xf0
>> [ 5863.400114]  [] do_vfs_ioctl+0x9a/0x5a0
>> [ 5863.406244]  [] sys_ioctl+0xa1/0xb0
>> [ 5863.412001]  [] system_call_fastpath+0x16/0x1b
>> [ 5863.418767] ---[ end trace e3234ecab14ad64c ]---
>> [ 5863.424084] btrfs: Error removing orphan entry, stopping orphan cleanup
>> [ 5863.431614] btrfs: could not do orphan cleanup -22
>>
>> Can I use an older snapshot as well?
>
> You're able to snapshot the others?
>
> Yeah, any of the snap_ directories will work, although keep in mind when
> the OSD starts up it will immediately remove current/ and re-clone the
> newest snap_ to current/ again.  If the problem is a toxic/broken snap_
> dir, you'll need to rename it out of the way to avoid hitting the problem
> again...
>
> sage

OK - renaming snap_507417 to broken_snap_507417 worked.

Two other OSDs crashed at the moment it became online again, but as
far as I can see, this is the same problem I've reported already.
After a couple of OSD restarts, I have them all up again.

Thanks for your help.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: allow us to overcommit our enospc reservations TEST THIS PLEASE!!!

2011-10-13 Thread Christian Brunner

2011/10/13 Josef Bacik :
[...]
>> >> [  175.956273] kernel BUG at fs/btrfs/inode.c:2176!
>> >
>> > Ok I think I see what's happening, this patch replaces the previous one, 
>> > let me
>> > know how it goes.  Thanks,
>> >
>>
>> Getting a slightly different BUG this time:
>>
>
> Ok looks like I've fixed the original problem and now we're hitting a problem
> with the free space cache.  This patch will replace the last one, its all the
> fixes up to now and a new set of BUG_ON()'s to figure out which free space 
> cache
> inode is screwing us up.  Thanks,
>
> Josef
>
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fc0de68..e595372 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3334,7 +3334,7 @@ out:
>  * shrink metadata reservation for delalloc
>  */
>  static int shrink_delalloc(struct btrfs_trans_handle *trans,
> -                          struct btrfs_root *root, u64 to_reclaim, int sync)
> +                          struct btrfs_root *root, u64 to_reclaim, int 
> retries)
>  {
>        struct btrfs_block_rsv *block_rsv;
>        struct btrfs_space_info *space_info;
> @@ -3365,12 +3365,10 @@ static int shrink_delalloc(struct btrfs_trans_handle 
> *trans,
>        }
>
>        max_reclaim = min(reserved, to_reclaim);
> +       if (max_reclaim > (2 * 1024 * 1024))
> +               nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
>
>        while (loops < 1024) {
> -               /* have the flusher threads jump in and do some IO */
> -               smp_mb();
> -               nr_pages = min_t(unsigned long, nr_pages,
> -                      root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
>                writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages);
>
>                spin_lock(&space_info->lock);
> @@ -3384,14 +3382,22 @@ static int shrink_delalloc(struct btrfs_trans_handle 
> *trans,
>                if (reserved == 0 || reclaimed >= max_reclaim)
>                        break;
>
> -               if (trans && trans->transaction->blocked)
> +               if (trans)
>                        return -EAGAIN;
>
> -               time_left = schedule_timeout_interruptible(1);
> +               if (!retries) {
> +                       time_left = schedule_timeout_interruptible(1);
>
> -               /* We were interrupted, exit */
> -               if (time_left)
> -                       break;
> +                       /* We were interrupted, exit */
> +                       if (time_left)
> +                               break;
> +               } else {
> +                       /*
> +                        * We've already done this song and dance once, let's
> +                        * really wait for some work to get done.
> +                        */
> +                       btrfs_wait_ordered_extents(root, 0, 0);
> +               }
>
>                /* we've kicked the IO a few times, if anything has been freed,
>                 * exit.  There is no sense in looping here for a long time
> @@ -3399,15 +3405,13 @@ static int shrink_delalloc(struct btrfs_trans_handle 
> *trans,
>                 * just too many writers without enough free space
>                 */
>
> -               if (loops > 3) {
> +               if (!retries && loops > 3) {
>                        smp_mb();
>                        if (progress != space_info->reservation_progress)
>                                break;
>                }
>
>        }
> -       if (reclaimed < to_reclaim && !trans)
> -               btrfs_wait_ordered_extents(root, 0, 0);
>        return reclaimed >= to_reclaim;
>  }
>
> @@ -3552,7 +3556,7 @@ again:
>         * We do synchronous shrinking since we don't actually unreserve
>         * metadata until after the IO is completed.
>         */
> -       ret = shrink_delalloc(trans, root, num_bytes, 1);
> +       ret = shrink_delalloc(trans, root, num_bytes, retries);
>        if (ret < 0)
>                goto out;
>
> @@ -3568,17 +3572,6 @@ again:
>                goto again;
>        }
>
> -       /*
> -        * Not enough space to be reclaimed, don't bother committing the
> -        * transaction.
> -        */
> -       spin_lock(&space_info->lock);
> -       if (space_info->bytes_pinned < orig_bytes)
> -               ret = -ENOSPC;
> -       spin_unlock(&space_info->lock);
> -       if (ret)
> -               goto out;
> -
>        ret = -EAGAIN;
>        if (trans)
>                goto out;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d6ba353..cb63904 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -782,7 +782,8 @@ static noinline int cow_file_range(struct inode *inode,
>        struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
>        int ret = 0;
>
> -       BUG_ON(btrfs_is_free_space_inode(root, inode));
> +       BUG_ON(root == root->fs_info->tree_root);
> +       BUG_ON(BTRFS_I(inode)->location.objectid == BTRFS_FREE_INO_OBJECTID);
>        trans = btrfs_join_transac

Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-17 Thread Christian Brunner

2011/10/11 Christian Brunner :
> 2011/10/11 Liu Bo :
>> On 10/10/2011 12:41 AM, Christian Brunner wrote:
>>> I just realized that this is still the same warning I reported some month 
>>> ago.
>>>
>>> I thought that this had been fixed with
>>>
>>> 25d37af374263243214be9d912cbb46a8e469bc7
>>>
>>> which is included in the kernel I'm using. So I think there must be
>>> another Problem.
>>>
>>
>> Would you try with this patch:
>>
>> http://marc.info/?l=linux-btrfs&m=131547325515336&w=2
>>
>
> This one is already included in my tree.

I have updated to a 3.0.6 kernel, with all the btrfs patches from
josef's git repo this weekend. But I'm still seeing the following
warning:

[75532.763336] [ cut here ]
[75532.768570] WARNING: at fs/btrfs/inode.c:2114
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[75532.777807] Hardware name: ProLiant DL180 G6
[75532.782798] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[75532.806891] Pid: 1858, comm: ceph-osd Tainted: P
3.0.6-1.fits.5.el6.x86_64 #1
[75532.815990] Call Trace:
[75532.818772]  [] warn_slowpath_common+0x7f/0xc0
[75532.825514]  [] warn_slowpath_null+0x1a/0x20
[75532.832076]  [] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[75532.840028]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
[75532.847196]  [] ? mutex_lock+0x31/0x60
[75532.853198]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
[75532.860476]  []
btrfs_commit_transaction+0x3c6/0x820 [btrfs]
[75532.868599]  [] ? wait_current_trans+0x28/0x110 [btrfs]
[75532.876264]  [] ? join_transaction+0x25/0x250 [btrfs]
[75532.883762]  [] ? wake_up_bit+0x40/0x40
[75532.889839]  [] btrfs_sync_fs+0x59/0xd0 [btrfs]
[75532.896703]  [] btrfs_ioctl+0x495/0xd50 [btrfs]
[75532.903544]  [] ? inode_has_perm+0x30/0x40
[75532.909902]  [] ? file_has_perm+0xdc/0xf0
[75532.916205]  [] do_vfs_ioctl+0x9a/0x5a0
[75532.922276]  [] sys_ioctl+0xa1/0xb0
[75532.927988]  [] system_call_fastpath+0x16/0x1b
[75532.934755] ---[ end trace a10c532625ad12af ]---

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2114

2011-10-20 Thread Christian Brunner

2011/10/20 Liu Bo :
> On 10/17/2011 11:23 PM, Christian Brunner wrote:
>> 2011/10/11 Christian Brunner :
>>
>> I have updated to a 3.0.6 kernel, with all the btrfs patches from
>> josef's git repo this weekend. But I'm still seeing the following
>> warning:
>>
>
> Hi,
>
> Would you try with this patch:
>
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/13728
>

I have now applied the patch josef sent to the list (Btrfs: use the
global reserve when truncating the free space cache inode), but the
warning is still there:

[   69.153400] [ cut here ]
[   69.158669] WARNING: at fs/btrfs/inode.c:2114
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
[   69.167984] Hardware name: ProLiant DL180 G6
[   69.173037] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
ixgbe dca mdio i7core_edac edac_core iomemory_vsl(P) hpsa squashfs
[last unloaded: scsi_wait_scan]
[   69.197502] Pid: 3426, comm: ceph-osd Tainted: P
3.0.6-1.fits.8.el6.x86_64 #1
[   69.206591] Call Trace:
[   69.209389]  [] warn_slowpath_common+0x7f/0xc0
[   69.216144]  [] warn_slowpath_null+0x1a/0x20
[   69.222647]  [] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
[   69.230550]  [] commit_fs_roots+0xc5/0x1b0 [btrfs]
[   69.237698]  [] ? mutex_lock+0x31/0x60
[   69.243707]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
[   69.250966]  []
btrfs_commit_transaction+0x3c6/0x820 [btrfs]
[   69.259087]  [] ? wait_current_trans+0x28/0x110 [btrfs]
[   69.266720]  [] ? join_transaction+0x25/0x250 [btrfs]
[   69.274136]  [] ? wake_up_bit+0x40/0x40
[   69.280210]  [] btrfs_sync_fs+0x59/0xd0 [btrfs]
[   69.287072]  [] btrfs_ioctl+0x495/0xd50 [btrfs]
[   69.293922]  [] ? inode_has_perm+0x30/0x40
[   69.300286]  [] ? file_has_perm+0xdc/0xf0
[   69.306558]  [] do_vfs_ioctl+0x9a/0x5a0
[   69.312657]  [] sys_ioctl+0xa1/0xb0
[   69.318352]  [] system_call_fastpath+0x16/0x1b
[   69.325107] ---[ end trace 2fd1a5665203d8e3 ]---


Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Christian Brunner

2011/10/24 Chris Mason :
> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> > [adding linux-btrfs to cc]
>> >
>> > Josef, Chris, any ideas on the below issues?
>> >
>> > On Mon, 24 Oct 2011, Christian Brunner wrote:
>> > > Thanks for explaining this. I don't have any objections against btrfs
>> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
>> > > scare me, since I can use the ceph replication to recover a lost
>> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable
>> > > on our side and I wonder what you are doing to make it work. (Maybe
>> > > it's related to the load pattern of using ceph as a backend store for
>> > > qemu).
>> > >
>> > > Here is a list of the btrfs problems I'm having:
>> > >
>> > > - When I run ceph with the default configuration (btrfs snaps enabled)
>> > > I can see a rapid increase in Disk-I/O after a few hours of uptime.
>> > > Btrfs-cleaner is using more and more time in
>> > > btrfs_clean_old_snapshots().
>> >
>> > In theory, there shouldn't be any significant difference between taking a
>> > snapshot and removing it a few commits later, and the prior root refs that
>> > btrfs holds on to internally until the new commit is complete.  That's
>> > clearly not quite the case, though.
>> >
>> > In any case, we're going to try to reproduce this issue in our
>> > environment.
>> >
>>
>> I've noticed this problem too, clean_old_snapshots is taking quite a while in
>> cases where it really shouldn't.  I will see if I can come up with a 
>> reproducer
>> that doesn't require setting up ceph ;).
>
> This sounds familiar though, I thought we had fixed a similar
> regression.  Either way, Arne's readahead code should really help.
>
> Which kernel version were you running?
>
> [ ack on the rest of Josef's comments ]

This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I'll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner

2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik :
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what 
>> > btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>
> I'm seeing a lot of this
>
>        [schedule]      1654.6 msec         96.4 %
>                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>                generic_write_sync blkdev_aio_write do_sync_readv_writev
>                do_readv_writev vfs_writev sys_writev system_call_fastpath
>
> where ceph-osd's latency is mostly coming from this fsync of a block device
> directly, and not so much being tied up by btrfs directly.  With 22% CPU being
> taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> perf
> record -ag when this is going on and then perf report so we can see what
> btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to 
> get
> only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
> of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
> horribly wrong or introducing a lot of latency.  Most of it seems to be when
> running the dleayed refs and having to read in blocks.  I've been suspecting 
> for
> a while that the delayed ref stuff ends up doing way more work than it needs 
> to
> be per task, and it's possible that btrfs-endio-wri is simply getting screwed 
> by
> other people doing work.
>
> At this point it seems like the biggest problem with latency in ceph-osd is 
> not
> related to btrfs, the latency seems to all be from the fact that ceph-osd is
> fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
> like
> its blowing a lot of CPU time, so perf record -ag is probably going to be your
> best bet when it's using lots of cpu so we can figure out what it's spinning 
> on.

Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.

Thanks,
Christian


perf.report.bz2
Description: BZip2 compressed data

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner

2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik :
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> >> tries to balance the load over all OSDs, so all filesystems should get
>> >> an nearly equal load. At the moment one filesystem seems to have a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >> avgrq-sz avgqu-sz   await  svctm  %util
>> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> >> 12.31     0.08   19.38  12.23   5.30
>> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> >> 8.57    74.33  380.76   2.74  62.57
>> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> >> 12.00     0.03   25.00 19.75 2.63
>> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> >> 12.00     0.01   19.50  12.50   0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> >> with top I see this process and a btrfs-endio-writer (PID 5447):
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have a much
>> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
>> >>
>> >
>> > I'm seeing a lot of this
>> >
>> >        [schedule]      1654.6 msec         96.4 %
>> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
>> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
>> >
>> > where ceph-osd's latency is mostly coming from this fsync of a block device
>> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
>> > being
>> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
>> > perf
>> > record -ag when this is going on and then perf report so we can see what
>> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
>> > to get
>> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
>> > rest
>> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
>> > anything
>> > horribly wrong or introducing a lot of latency.  Most of it seems to be 
>> > when
>> > running the dleayed refs and having to read in blocks.  I've been 
>> > suspecting for
>> > a while that the delayed ref stuff ends up doing way more work than it 
>> > needs to
>> > be per task, and it's possible that btrfs-endio-wri is simply getting 
>> > screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in ceph-osd 
>> > is not
>> > related to btrfs, the latency seems to all be from the fact that ceph-osd 
>> > is
>> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
>> > like
>> > its blowing a lot of CPU time, so perf record -ag is probably going to be 
>> > your
>> > best bet when it's using lots of cpu so we can figure out what it's 
>> > spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.

Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner

2011/10/25 Sage Weil :
> On Tue, 25 Oct 2011, Josef Bacik wrote:
>> At this point it seems like the biggest problem with latency in ceph-osd
>> is not related to btrfs, the latency seems to all be from the fact that
>> ceph-osd is fsyncing a block dev for whatever reason.
>
> There is one place where we sync_file_range() on the journal block device,
> but that should only happen if directio is disabled (it's on by default).
>
> Christian, have you tweaked those settings in your ceph.conf?  It would be
> something like 'journal dio = false'.  If not, can you verify that
> directio shows true when the journal is initialized from your osd log?
> E.g.,
>
>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 
> 14: 104857600 bytes, block size 4096 bytes, directio = 1
>
> If directio = 1 for you, something else funky is causing those
> blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner

2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> > >
>> > > Attached is a perf-report. I have included the whole report, so that
>> > > you can see the difference between the good and the bad
>> > > btrfs-endio-wri.
>> > >
>> >
>> > We also shouldn't be running run_ordered_operations, man this is screwed 
>> > up,
>> > thanks so much for this, I should be able to nail this down pretty easily.
>> > Thanks,
>>
>> Looks like we're getting there from reserve_metadata_bytes when we join
>> the transaction?
>>
>
> We don't do reservations in the endio stuff, we assume you've reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> the trace.  Though it does look like perf is lying to us in at least one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Strange! - I'll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner

2011/10/26 Sage Weil :
> On Wed, 26 Oct 2011, Christian Brunner wrote:
>> >> > Christian, have you tweaked those settings in your ceph.conf?  It would 
>> >> > be
>> >> > something like 'journal dio = false'.  If not, can you verify that
>> >> > directio shows true when the journal is initialized from your osd log?
>> >> > E.g.,
>> >> >
>> >> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal 
>> >> > fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>> >> >
>> >> > If directio = 1 for you, something else funky is causing those
>> >> > blkdev_fsync's...
>> >>
>> >> I've looked it up in the logs - directio is 1:
>> >>
>> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>> >> bytes, directio = 1
>> >
>> > Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
>> > is coming from.
>>
>> Here is an strace. I can see a lot of sync_file_range operations.
>
> Yeah, these all look like the flusher thread, and shouldn't be hitting
> blkdev_fsync.  Can you confirm that with
>
>        filestore flusher = false
>        filestore sync flush = false
>
> you get no sync_file_range at all?  I wonder if this is also perf lying
> about the call chain.

Yes, setting this makes the sync_file_range calls go away.

Is it safe to use these settings with "filestore btrfs snap = 0"?

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner

2011/10/26 Christian Brunner :
> 2011/10/25 Josef Bacik :
>> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>>> > >
>>> > > Attached is a perf-report. I have included the whole report, so that
>>> > > you can see the difference between the good and the bad
>>> > > btrfs-endio-wri.
>>> > >
>>> >
>>> > We also shouldn't be running run_ordered_operations, man this is screwed 
>>> > up,
>>> > thanks so much for this, I should be able to nail this down pretty easily.
>>> > Thanks,
>>>
>>> Looks like we're getting there from reserve_metadata_bytes when we join
>>> the transaction?
>>>
>>
>> We don't do reservations in the endio stuff, we assume you've reserved all 
>> the
>> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
>> the trace.  Though it does look like perf is lying to us in at least one case
>> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
>> during normal runtime, so it definitely shouldn't be showing up.  Thanks,
>
> Strange! - I'll check if symbols got messed up in the report tomorrow.

I've checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I've also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

57 matches

Mail list logo