3.10.0 failed paging request from kthread_data

2013-07-17 Thread Jim Schutt
Hi,

I'm trying to test the btrfs and ceph contributions to 3.11, without
testing all of 3.11-rc1 (just yet), so I'm testing with the "next"
branch of Chris Mason's tree  (commit cbacd76bb3 from
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git)
merged into the for-linus branch of the ceph tree (commit 8b8cf8917f
from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git)

One of my ceph clients hit this:

[94633.463166] BUG: unable to handle kernel paging request at ffa8
[94633.464003] IP: [] kthread_data+0x10/0x20
[94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 
[94633.464003] Oops:  [#2] SMP 
[94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan 
tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button 
serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core 
uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs 
lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core 
dca hwmon e1000
[94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G  D W
3.10.0-00119-g2925339 #601
[94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 
06/21/2006
[94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 
88040e39a000
[94633.464003] RIP: 0010:[]  [] 
kthread_data+0x10/0x20
[94633.464003] RSP: 0018:88040e39b7f8  EFLAGS: 00010092
[94633.464003] RAX:  RBX:  RCX: 81d30320
[94633.464003] RDX:  RSI:  RDI: 880415b6
[94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001
[94633.464003] R10:  R11:  R12: 
[94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002
[94633.464003] FS:  () GS:88042fc0() 
knlGS:
[94633.464003] CS:  0010 DS:  ES:  CR0: 8005003b
[94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0
[94633.464003] DR0:  DR1:  DR2: 
[94633.464003] DR3:  DR6: 0ff0 DR7: 0400
[94633.464003] Stack:
[94633.464003]  88040e39b818 810602a5 88040e39b818 
88042fc139c0
[94633.464003]  88040e39b8a8 814ef79e 8804 
88040e39bfd8
[94633.464003]  88040e39a000 88040e39a000 88040e39a010 
88040e39a000
[94633.464003] Call Trace:
[94633.464003]  [] wq_worker_sleeping+0x15/0xa0
[94633.464003]  [] __schedule+0x17e/0x6b0
[94633.464003]  [] schedule+0x5d/0x60
[94633.464003]  [] do_exit+0x3eb/0x440
[94633.464003]  [] oops_end+0xd8/0xf0
[94633.464003]  [] no_context+0x1bf/0x1e0
[94633.464003]  [] __bad_area_nosemaphore+0x1f5/0x230
[94633.464003]  [] bad_area_nosemaphore+0x13/0x20
[94633.464003]  [] __do_page_fault+0x416/0x4b0
[94633.464003]  [] ? idle_balance+0x14e/0x180
[94633.464003]  [] ? finish_task_switch+0x3f/0x110
[94633.464003]  [] ? error_sti+0x5/0x6
[94633.464003]  [] ? trace_hardirqs_off_caller+0x29/0xd0
[94633.464003]  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
[94633.464003]  [] do_page_fault+0xe/0x10
[94633.464003]  [] page_fault+0x22/0x30
[94633.464003]  [] ? rb_erase+0x297/0x3a0
[94633.464003]  [] __remove_osd+0x98/0xd0 [libceph]
[94633.464003]  [] __reset_osd+0xa3/0x1c0 [libceph]
[94633.464003]  [] ? osd_reset+0x9b/0xd0 [libceph]
[94633.464003]  [] __kick_osd_requests+0x7b/0x2e0 [libceph]
[94633.464003]  [] osd_reset+0xa6/0xd0 [libceph]
[94633.464003]  [] con_work+0x445/0x4a0 [libceph]
[94633.464003]  [] process_one_work+0x2e5/0x510
[94633.464003]  [] ? process_one_work+0x240/0x510
[94633.464003]  [] worker_thread+0x215/0x340
[94633.464003]  [] ? manage_workers+0x170/0x170
[94633.464003]  [] kthread+0xe1/0xf0
[94633.464003]  [] ? __init_kthread_worker+0x70/0x70
[94633.464003]  [] ret_from_fork+0x7c/0xb0
[94633.464003]  [] ? __init_kthread_worker+0x70/0x70
[94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 
0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 <48> 8b 
40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 
[94633.464003] RIP  [] kthread_data+0x10/0x20
[94633.464003]  RSP 
[94633.464003] CR2: ffa8
[94633.464003] ---[ end trace 89622896705a7fac ]---
[94633.464003] Fixing recursive fault but reboot is needed!
[94633.464003] [ cut here ]

kthread_data disassembles to this:

(gdb) disassemble kthread_data
Dump of assembler code for function kthread_data:
   0x8106a060 <+0>: push   %rbp
   0x8106a061 <+1>: mov%rsp,%rbp
   0x8106a064 <+4>: callq  0x814fabc0
   0x8106a069 <+9>: mov0x390(%rdi),%rax
   0x8106a070 <+16>: 

3.10.0 failed paging request from kthread_data

2013-07-17 Thread Jim Schutt
Hi,

I'm trying to test the btrfs and ceph contributions to 3.11, without
testing all of 3.11-rc1 (just yet), so I'm testing with the next
branch of Chris Mason's tree  (commit cbacd76bb3 from
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git)
merged into the for-linus branch of the ceph tree (commit 8b8cf8917f
from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git)

One of my ceph clients hit this:

[94633.463166] BUG: unable to handle kernel paging request at ffa8
[94633.464003] IP: [8106a070] kthread_data+0x10/0x20
[94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 
[94633.464003] Oops:  [#2] SMP 
[94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan 
tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button 
serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core 
uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs 
lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core 
dca hwmon e1000
[94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G  D W
3.10.0-00119-g2925339 #601
[94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 
06/21/2006
[94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 
88040e39a000
[94633.464003] RIP: 0010:[8106a070]  [8106a070] 
kthread_data+0x10/0x20
[94633.464003] RSP: 0018:88040e39b7f8  EFLAGS: 00010092
[94633.464003] RAX:  RBX:  RCX: 81d30320
[94633.464003] RDX:  RSI:  RDI: 880415b6
[94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001
[94633.464003] R10:  R11:  R12: 
[94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002
[94633.464003] FS:  () GS:88042fc0() 
knlGS:
[94633.464003] CS:  0010 DS:  ES:  CR0: 8005003b
[94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0
[94633.464003] DR0:  DR1:  DR2: 
[94633.464003] DR3:  DR6: 0ff0 DR7: 0400
[94633.464003] Stack:
[94633.464003]  88040e39b818 810602a5 88040e39b818 
88042fc139c0
[94633.464003]  88040e39b8a8 814ef79e 8804 
88040e39bfd8
[94633.464003]  88040e39a000 88040e39a000 88040e39a010 
88040e39a000
[94633.464003] Call Trace:
[94633.464003]  [810602a5] wq_worker_sleeping+0x15/0xa0
[94633.464003]  [814ef79e] __schedule+0x17e/0x6b0
[94633.464003]  [814efefd] schedule+0x5d/0x60
[94633.464003]  [8104717b] do_exit+0x3eb/0x440
[94633.464003]  [814f33f8] oops_end+0xd8/0xf0
[94633.464003]  [810362df] no_context+0x1bf/0x1e0
[94633.464003]  [810364f5] __bad_area_nosemaphore+0x1f5/0x230
[94633.464003]  [81036543] bad_area_nosemaphore+0x13/0x20
[94633.464003]  [814f6406] __do_page_fault+0x416/0x4b0
[94633.464003]  [810869ae] ? idle_balance+0x14e/0x180
[94633.464003]  [81077a1f] ? finish_task_switch+0x3f/0x110
[94633.464003]  [814f29e3] ? error_sti+0x5/0x6
[94633.464003]  [8109e859] ? trace_hardirqs_off_caller+0x29/0xd0
[94633.464003]  [8128c6dd] ? trace_hardirqs_off_thunk+0x3a/0x3c
[94633.464003]  [814f64ae] do_page_fault+0xe/0x10
[94633.464003]  [814f27e2] page_fault+0x22/0x30
[94633.464003]  [81285a47] ? rb_erase+0x297/0x3a0
[94633.464003]  [a02b45d8] __remove_osd+0x98/0xd0 [libceph]
[94633.464003]  [a02b49c3] __reset_osd+0xa3/0x1c0 [libceph]
[94633.464003]  [a02b6c5b] ? osd_reset+0x9b/0xd0 [libceph]
[94633.464003]  [a02b695b] __kick_osd_requests+0x7b/0x2e0 [libceph]
[94633.464003]  [a02b6c66] osd_reset+0xa6/0xd0 [libceph]
[94633.464003]  [a02aeb65] con_work+0x445/0x4a0 [libceph]
[94633.464003]  [810635b5] process_one_work+0x2e5/0x510
[94633.464003]  [81063510] ? process_one_work+0x240/0x510
[94633.464003]  [81064975] worker_thread+0x215/0x340
[94633.464003]  [81064760] ? manage_workers+0x170/0x170
[94633.464003]  [8106aa61] kthread+0xe1/0xf0
[94633.464003]  [8106a980] ? __init_kthread_worker+0x70/0x70
[94633.464003]  [814faf5c] ret_from_fork+0x7c/0xb0
[94633.464003]  [8106a980] ? __init_kthread_worker+0x70/0x70
[94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 
0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 48 8b 
40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 
[94633.464003] RIP  [8106a070] kthread_data+0x10/0x20
[94633.464003]  RSP 

Re: 3.10-rc3 NFSv3 mount issues

2013-05-30 Thread Jim Schutt
On 05/30/2013 02:26 PM, Chuck Lever wrote:
> 
> On May 30, 2013, at 4:19 PM, Jim Schutt  wrote:
> 
>> Hi,
>>
>> I've been trying to test 3.10-rc3 on some diskless clients, and found
>> that I can no longer mount my root file system via NFSv3.
>>

> 
> 3.10-rc3 appears to be missing the fix for this.  See:
> 
>   http://marc.info/?l=linux-nfs=136855668104598=2
> 
> Trond, can we get this applied?
> 

That works for me.

Thanks!

-- Jim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10-rc3 NFSv3 mount issues

2013-05-30 Thread Jim Schutt
Hi,

I've been trying to test 3.10-rc3 on some diskless clients, and found
that I can no longer mount my root file system via NFSv3.

I poked around looking at NFS changes for 3.10, and found these two
commits:

 d497ab9751 "NFSv3: match sec= flavor against server list"
 4580a92d44 "NFS: Use server-recommended security flavor by default (NFSv3)"

If I revert both of these commits from 3.10-rc3, then my diskless
client can mount its root file system.

The busybox mount command fails like this, when using 3.10-rc3:

/ # mount  -t nfs -o ro,nolock,vers=3,proto=tcp 
172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: 
Invalid argument

The commit messages for both these commits seem to say that mounting
with the "sys=sec" option should work, but unfortunately, my busybox doesn't
seem to understand the "sec=" mount option:

/ # mount  -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys 
172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
mount: invalid number 'sys'

My NFS server is based on RHEL6, and is not using any "sec=" option
in its export for this file system.  I did try exporting with "sec=sys",
but it didn't seem to make any difference either.

So far, this seems like a regression to me 
Any ideas what I might be doing wrong?  How can I
help make this work again?

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.10-rc3 NFSv3 mount issues

2013-05-30 Thread Jim Schutt
Hi,

I've been trying to test 3.10-rc3 on some diskless clients, and found
that I can no longer mount my root file system via NFSv3.

I poked around looking at NFS changes for 3.10, and found these two
commits:

 d497ab9751 NFSv3: match sec= flavor against server list
 4580a92d44 NFS: Use server-recommended security flavor by default (NFSv3)

If I revert both of these commits from 3.10-rc3, then my diskless
client can mount its root file system.

The busybox mount command fails like this, when using 3.10-rc3:

/ # mount  -t nfs -o ro,nolock,vers=3,proto=tcp 
172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: 
Invalid argument

The commit messages for both these commits seem to say that mounting
with the sys=sec option should work, but unfortunately, my busybox doesn't
seem to understand the sec= mount option:

/ # mount  -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys 
172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt
mount: invalid number 'sys'

My NFS server is based on RHEL6, and is not using any sec= option
in its export for this file system.  I did try exporting with sec=sys,
but it didn't seem to make any difference either.

So far, this seems like a regression to me 
Any ideas what I might be doing wrong?  How can I
help make this work again?

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.10-rc3 NFSv3 mount issues

2013-05-30 Thread Jim Schutt
On 05/30/2013 02:26 PM, Chuck Lever wrote:
 
 On May 30, 2013, at 4:19 PM, Jim Schutt jasc...@sandia.gov wrote:
 
 Hi,

 I've been trying to test 3.10-rc3 on some diskless clients, and found
 that I can no longer mount my root file system via NFSv3.


 
 3.10-rc3 appears to be missing the fix for this.  See:
 
   http://marc.info/?l=linux-nfsm=136855668104598w=2
 
 Trond, can we get this applied?
 

That works for me.

Thanks!

-- Jim


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
On 05/15/2013 10:49 AM, Alex Elder wrote:
> On 05/15/2013 11:38 AM, Jim Schutt wrote:
>> > Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
>> > holding a lock, but it's spoiled because ceph_pagelist_addpage() always
>> > calls kmap(), which might sleep.  Here's the result:
> This looks good to me, but I admit I didn't take as close
> a look at it this time.
> 
> I appreciate your updating the series to include the things
> I mentioned.
> 
> I'll commit these for you, and I'll get confirmation on the
> byte order thing as well.

Cool.  Thanks, Alex.

-- Jim

> 
> Reviewed-by: Alex Elder 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [] 
process_one_work+0x218/0x480
[13439.329701]  #1:  ((&(>work)->work)){..}, at: [] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (>s_mutex){..}, at: [] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (>snap_rwsem){..}, at: [] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [] __might_sleep+0xfc/0x110
[13439.384353]  [] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [] ? lock_flocks+0x15/0x20
[13439.409277]  [] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [] ? igrab+0x28/0x70
[13439.420610]  [] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [] ? mutex_unlock+0xe/0x10
[13439.473934]  [] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [] ? finish_task_switch+0x3f/0x110
[13439.531515]  [] process_one_work+0x2b5/0x480
[13439.537439]  [] ? process_one_work+0x218/0x480
[13439.543526]  [] worker_thread+0x1f5/0x320
[13439.549191]  [] ? manage_workers+0x170/0x170
[13439.555102]  [] kthread+0xe1/0xf0
[13439.560075]  [] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [] ret_from_fork+0x7c/0xb0
[13439.571918]  [] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |   76 +++---
 fs/ceph/mds_client.c |   65 +++---
 fs/ceph/super.h  |9 -
 3 files changed, 89 insertions(+), 61 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 4518313..8978851 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct ceph_filelock cephlock;
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
-   __le32 nlocks;
+   int l = 0;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   nlocks = cpu_to_le32(num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, , sizeof(nlocks));
-   if (err)
-   goto fail;
+
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
if (lock->fl_flags & FL_POSIX) {
++seen_fcntl;
@@ -221,20 +215,1

[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability

2013-05-15 Thread Jim Schutt
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |7 +--
 fs/ceph/mds_client.c |2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ffc86cb..4518313 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
+   __le32 nlocks;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, _fcntl_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_fcntl_locks);
+   err = ceph_pagelist_append(pagelist, , sizeof(nlocks));
if (err)
goto fail;
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
@@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
goto fail;
}
 
-   err = ceph_pagelist_append(pagelist, _flock_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_flock_locks);
+   err = ceph_pagelist_append(pagelist, , sizeof(nlocks));
if (err)
goto fail;
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..d9ca152 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct 
ceph_cap *cap,
lock_flocks();
ceph_count_locks(inode, _fcntl_locks,
 _flock_locks);
-   rec.v2.flock_len = (2*sizeof(u32) +
+   rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) +
(num_fcntl_locks+num_flock_locks) *
sizeof(struct ceph_filelock));
unlock_flocks();
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold

2013-05-15 Thread Jim Schutt

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..ffc86cb 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock 
*fl)
 }
 
 /**
- * Must be called with BKL already held. Fills in the passed
+ * Must be called with lock_flocks() already held. Fills in the passed
  * counter variables, so you can prepare pagelist metadata before calling
  * ceph_encode_locks.
  */
-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/3] ceph: fix might_sleep while atomic

2013-05-15 Thread Jim Schutt
This patch series fixes an issue where encode_caps_cb() ends up
calling kmap() while holding a spinlock.

The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html)
was expanded to fix some other outstanding issues noticed by Alex Elder
in his review.

Changes from the original version:

 - clean up a comment
 - add missing calls to cpu_to_le32()
 - eliminate unnecessary loop around calls to ceph_pagelist_append()
 - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer()
 if the number of locks changes; exit on other errors
 - return proper error code from encode_caps_cb() if kmalloc fails

In his review, Alex mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Jim Schutt (3):
  ceph: fix up comment for ceph_count_locks() as to which lock to hold
  ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
  ceph: ceph_pagelist_append might sleep while atomic

 fs/ceph/locks.c  |   75 --
 fs/ceph/mds_client.c |   65 +++
 fs/ceph/super.h  |9 -
 3 files changed, 90 insertions(+), 59 deletions(-)

-- 
1.7.8.2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/3] ceph: fix might_sleep while atomic

2013-05-15 Thread Jim Schutt
This patch series fixes an issue where encode_caps_cb() ends up
calling kmap() while holding a spinlock.

The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html)
was expanded to fix some other outstanding issues noticed by Alex Elder
in his review.

Changes from the original version:

 - clean up a comment
 - add missing calls to cpu_to_le32()
 - eliminate unnecessary loop around calls to ceph_pagelist_append()
 - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer()
 if the number of locks changes; exit on other errors
 - return proper error code from encode_caps_cb() if kmalloc fails

In his review, Alex mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Jim Schutt (3):
  ceph: fix up comment for ceph_count_locks() as to which lock to hold
  ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
  ceph: ceph_pagelist_append might sleep while atomic

 fs/ceph/locks.c  |   75 --
 fs/ceph/mds_client.c |   65 +++
 fs/ceph/super.h  |9 -
 3 files changed, 90 insertions(+), 59 deletions(-)

-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold

2013-05-15 Thread Jim Schutt

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 fs/ceph/locks.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..ffc86cb 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock 
*fl)
 }
 
 /**
- * Must be called with BKL already held. Fills in the passed
+ * Must be called with lock_flocks() already held. Fills in the passed
  * counter variables, so you can prepare pagelist metadata before calling
  * ceph_encode_locks.
  */
-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability

2013-05-15 Thread Jim Schutt
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks
and num_flock_locks were properly decoded on the server side, from a le32
over-the-wire type to a cpu type.  I checked, and AFAICS it is done; those
interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and
src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph).

I also checked the server side for flock_len decoding, and I believe that
also happens correctly, by virtue of having been declared __le32 in
struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 fs/ceph/locks.c  |7 +--
 fs/ceph/mds_client.c |2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index ffc86cb..4518313 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
+   __le32 nlocks;
 
dout(encoding %d flock and %d fcntl locks, num_flock_locks,
 num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, num_fcntl_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_fcntl_locks);
+   err = ceph_pagelist_append(pagelist, nlocks, sizeof(nlocks));
if (err)
goto fail;
for (lock = inode-i_flock; lock != NULL; lock = lock-fl_next) {
@@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct 
ceph_pagelist *pagelist,
goto fail;
}
 
-   err = ceph_pagelist_append(pagelist, num_flock_locks, sizeof(u32));
+   nlocks = cpu_to_le32(num_flock_locks);
+   err = ceph_pagelist_append(pagelist, nlocks, sizeof(nlocks));
if (err)
goto fail;
for (lock = inode-i_flock; lock != NULL; lock = lock-fl_next) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..d9ca152 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct 
ceph_cap *cap,
lock_flocks();
ceph_count_locks(inode, num_fcntl_locks,
 num_flock_locks);
-   rec.v2.flock_len = (2*sizeof(u32) +
+   rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) +
(num_fcntl_locks+num_flock_locks) *
sizeof(struct ceph_filelock));
unlock_flocks();
-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [810609f8] 
process_one_work+0x218/0x480
[13439.329701]  #1:  (((con-work)-work)){..}, at: [810609f8] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (s-s_mutex){..}, at: [a046273c] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (mdsc-snap_rwsem){..}, at: [a04627be] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [811cadf5] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [81076f4c] __might_sleep+0xfc/0x110
[13439.384353]  [a03f4ce0] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [a0448fe9] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [814ee849] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [811cadf5] ? lock_flocks+0x15/0x20
[13439.409277]  [a045e2af] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [81196748] ? igrab+0x28/0x70
[13439.420610]  [a045e9f8] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [a045ea25] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [a045de90] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [a0462888] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [a0464542] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [a0462e42] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [a04631ad] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [814ebc7e] ? mutex_unlock+0xe/0x10
[13439.473934]  [a043c612] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [a03f6c2c] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [a03eec3d] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [a03f1498] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [81415184] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [a03ef3a8] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [a03f2aae] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [a03f39f8] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [8107792f] ? finish_task_switch+0x3f/0x110
[13439.531515]  [81060a95] process_one_work+0x2b5/0x480
[13439.537439]  [810609f8] ? process_one_work+0x218/0x480
[13439.543526]  [81064185] worker_thread+0x1f5/0x320
[13439.549191]  [81063f90] ? manage_workers+0x170/0x170
[13439.555102]  [81069641] kthread+0xe1/0xf0
[13439.560075]  [81069560] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [814f7edc] ret_from_fork+0x7c/0xb0
[13439.571918]  [81069560] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 fs/ceph/locks.c  |   76 +++---
 fs/ceph/mds_client.c |   65 +++---
 fs/ceph/super.h  |9 -
 3 files changed, 89 insertions(+), 61 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 4518313..8978851 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct

Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic

2013-05-15 Thread Jim Schutt
On 05/15/2013 10:49 AM, Alex Elder wrote:
 On 05/15/2013 11:38 AM, Jim Schutt wrote:
  Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while
  holding a lock, but it's spoiled because ceph_pagelist_addpage() always
  calls kmap(), which might sleep.  Here's the result:
 This looks good to me, but I admit I didn't take as close
 a look at it this time.
 
 I appreciate your updating the series to include the things
 I mentioned.
 
 I'll commit these for you, and I'll get confirmation on the
 byte order thing as well.

Cool.  Thanks, Alex.

-- Jim

 
 Reviewed-by: Alex Elder el...@inktank.com
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-14 Thread Jim Schutt
On 05/14/2013 10:44 AM, Alex Elder wrote:
> On 05/09/2013 09:42 AM, Jim Schutt wrote:
>> Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
>> holding a lock, but it's spoiled because ceph_pagelist_addpage() always
>> calls kmap(), which might sleep.  Here's the result:
> 
> I finally took a close look at this today, Jim.  Sorry
> for the delay.
> 

No worries - thanks for taking a look.

> The issue is formatting the reconnect message--which will
> hold an arbitrary amount of data and therefore which we'll
> need to do some allocation (and kmap) for--in the face of
> having to hold the flock spinlock while doing so.
> 
> And as you found, ceph_pagelist_addpage(), which is called
> by ceph_pagelist_append(), calls kmap() even if it doesn't
> need to allocate anything.  This means that despite reserving
> the pages, those pages are in the free list and because they'll
> need to be the subject of kmap() their preallocation doesn't
> help.
> 
> Your solution was to pre-allocate a buffer, format the locks
> into that buffer while holding the lock, then append the
> buffer contents to a pagelist after releasing the lock.  You
> check for a changing (increasing) lock count while you format
> the locks, which is good.
> 
> So...  Given that, I think your change looks good.  It's a shame
> we can't format directly into the pagelist buffer but this won't
> happen much so it's not a big deal.  I have a few small suggestions,
> below.
> 
> I do find some byte order bugs though.   They aren't your doing,
> but I think they ought to be fixed first, as a separate patch
> that would precede this one.  The bug is that the lock counts
> that are put into the buffer (num_fcntl_locks and num_flock_locks)
> are not properly byte-swapped.  I'll point it out inline
> in your code, below.
> 
> I'll say that what you have is OK.  Consider my suggestions, and
> if you choose not to fix the byte order bugs, please let me know.

I'll happily fix up a v2 series with your suggestions addressed.
Thanks for catching those issues.  Stay tuned...

Thanks -- Jim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-14 Thread Jim Schutt
On 05/14/2013 10:44 AM, Alex Elder wrote:
 On 05/09/2013 09:42 AM, Jim Schutt wrote:
 Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
 holding a lock, but it's spoiled because ceph_pagelist_addpage() always
 calls kmap(), which might sleep.  Here's the result:
 
 I finally took a close look at this today, Jim.  Sorry
 for the delay.
 

No worries - thanks for taking a look.

 The issue is formatting the reconnect message--which will
 hold an arbitrary amount of data and therefore which we'll
 need to do some allocation (and kmap) for--in the face of
 having to hold the flock spinlock while doing so.
 
 And as you found, ceph_pagelist_addpage(), which is called
 by ceph_pagelist_append(), calls kmap() even if it doesn't
 need to allocate anything.  This means that despite reserving
 the pages, those pages are in the free list and because they'll
 need to be the subject of kmap() their preallocation doesn't
 help.
 
 Your solution was to pre-allocate a buffer, format the locks
 into that buffer while holding the lock, then append the
 buffer contents to a pagelist after releasing the lock.  You
 check for a changing (increasing) lock count while you format
 the locks, which is good.
 
 So...  Given that, I think your change looks good.  It's a shame
 we can't format directly into the pagelist buffer but this won't
 happen much so it's not a big deal.  I have a few small suggestions,
 below.
 
 I do find some byte order bugs though.   They aren't your doing,
 but I think they ought to be fixed first, as a separate patch
 that would precede this one.  The bug is that the lock counts
 that are put into the buffer (num_fcntl_locks and num_flock_locks)
 are not properly byte-swapped.  I'll point it out inline
 in your code, below.
 
 I'll say that what you have is OK.  Consider my suggestions, and
 if you choose not to fix the byte order bugs, please let me know.

I'll happily fix up a v2 series with your suggestions addressed.
Thanks for catching those issues.  Stay tuned...

Thanks -- Jim


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-09 Thread Jim Schutt
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [] 
process_one_work+0x218/0x480
[13439.329701]  #1:  ((&(>work)->work)){..}, at: [] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (>s_mutex){..}, at: [] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (>snap_rwsem){..}, at: [] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [] __might_sleep+0xfc/0x110
[13439.384353]  [] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [] ? lock_flocks+0x15/0x20
[13439.409277]  [] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [] ? igrab+0x28/0x70
[13439.420610]  [] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [] ? mutex_unlock+0xe/0x10
[13439.473934]  [] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [] ? finish_task_switch+0x3f/0x110
[13439.531515]  [] process_one_work+0x2b5/0x480
[13439.537439]  [] ? process_one_work+0x218/0x480
[13439.543526]  [] worker_thread+0x1f5/0x320
[13439.549191]  [] ? manage_workers+0x170/0x170
[13439.555102]  [] kthread+0xe1/0xf0
[13439.560075]  [] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [] ret_from_fork+0x7c/0xb0
[13439.571918]  [] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt 
---
 fs/ceph/locks.c  |   73 +
 fs/ceph/mds_client.c |   62 ++
 fs/ceph/super.h  |9 +-
 3 files changed, 88 insertions(+), 56 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..9a46161 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct ceph_filelock cephlock;
int err = 0;
int seen_fcntl = 0;
int seen_flock = 0;
+   int l = 0;
 
dout("encoding %d flock and %d fcntl locks", num_flock_locks,
 num_fcntl_locks);
-   err = ceph_pagelist_append(pagelist, _fcntl_locks, sizeof(u32));
-   if (err)
-   goto fail;
+
for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) {
if (lock->fl_flags & FL_POSIX) {
++seen_fcntl;
@@ -219,19 +215,12 @@ int ceph_encode_locks(struct inode 

[PATCH] libceph: ceph_pagelist_append might sleep while atomic

2013-05-09 Thread Jim Schutt
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while
holding a lock, but it's spoiled because ceph_pagelist_addpage() always
calls kmap(), which might sleep.  Here's the result:

[13439.295457] ceph: mds0 reconnect start
[13439.300572] BUG: sleeping function called from invalid context at 
include/linux/highmem.h:58
[13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1
[13439.316464] 5 locks held by kworker/1:1/12059:
[13439.320998]  #0:  (ceph-msgr){..}, at: [810609f8] 
process_one_work+0x218/0x480
[13439.329701]  #1:  (((con-work)-work)){..}, at: [810609f8] 
process_one_work+0x218/0x480
[13439.339446]  #2:  (s-s_mutex){..}, at: [a046273c] 
send_mds_reconnect+0xec/0x450 [ceph]
[13439.349081]  #3:  (mdsc-snap_rwsem){..}, at: [a04627be] 
send_mds_reconnect+0x16e/0x450 [ceph]
[13439.359278]  #4:  (file_lock_lock){..}, at: [811cadf5] 
lock_flocks+0x15/0x20
[13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW
3.9.0-00358-g308ae61 #557
[13439.376225] Call Trace:
[13439.378757]  [81076f4c] __might_sleep+0xfc/0x110
[13439.384353]  [a03f4ce0] ceph_pagelist_append+0x120/0x1b0 [libceph]
[13439.391491]  [a0448fe9] ceph_encode_locks+0x89/0x190 [ceph]
[13439.398035]  [814ee849] ? _raw_spin_lock+0x49/0x50
[13439.403775]  [811cadf5] ? lock_flocks+0x15/0x20
[13439.409277]  [a045e2af] encode_caps_cb+0x41f/0x4a0 [ceph]
[13439.415622]  [81196748] ? igrab+0x28/0x70
[13439.420610]  [a045e9f8] ? iterate_session_caps+0xe8/0x250 [ceph]
[13439.427584]  [a045ea25] iterate_session_caps+0x115/0x250 [ceph]
[13439.434499]  [a045de90] ? set_request_path_attr+0x2d0/0x2d0 [ceph]
[13439.441646]  [a0462888] send_mds_reconnect+0x238/0x450 [ceph]
[13439.448363]  [a0464542] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph]
[13439.455250]  [a0462e42] check_new_map+0x352/0x500 [ceph]
[13439.461534]  [a04631ad] ceph_mdsc_handle_map+0x1bd/0x260 [ceph]
[13439.468432]  [814ebc7e] ? mutex_unlock+0xe/0x10
[13439.473934]  [a043c612] extra_mon_dispatch+0x22/0x30 [ceph]
[13439.480464]  [a03f6c2c] dispatch+0xbc/0x110 [libceph]
[13439.486492]  [a03eec3d] process_message+0x1ad/0x1d0 [libceph]
[13439.493190]  [a03f1498] ? read_partial_message+0x3e8/0x520 
[libceph]
[13439.500583]  [81415184] ? kernel_recvmsg+0x44/0x60
[13439.506324]  [a03ef3a8] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
[13439.513140]  [a03f2aae] try_read+0x5fe/0x7e0 [libceph]
[13439.519246]  [a03f39f8] con_work+0x378/0x4a0 [libceph]
[13439.525345]  [8107792f] ? finish_task_switch+0x3f/0x110
[13439.531515]  [81060a95] process_one_work+0x2b5/0x480
[13439.537439]  [810609f8] ? process_one_work+0x218/0x480
[13439.543526]  [81064185] worker_thread+0x1f5/0x320
[13439.549191]  [81063f90] ? manage_workers+0x170/0x170
[13439.555102]  [81069641] kthread+0xe1/0xf0
[13439.560075]  [81069560] ? __init_kthread_worker+0x70/0x70
[13439.566419]  [814f7edc] ret_from_fork+0x7c/0xb0
[13439.571918]  [81069560] ? __init_kthread_worker+0x70/0x70
[13439.587132] ceph: mds0 reconnect success
[13490.720032] ceph: mds0 caps stale
[13501.235257] ceph: mds0 recovery completed
[13501.300419] ceph: mds0 caps renewed

Fix it up by encoding locks into a buffer first, and when the
number of encoded locks is stable, copy that into a ceph_pagelist.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 fs/ceph/locks.c  |   73 +
 fs/ceph/mds_client.c |   62 ++
 fs/ceph/super.h  |9 +-
 3 files changed, 88 insertions(+), 56 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..9a46161 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int 
*fcntl_count, int *flock_count)
 }
 
 /**
- * Encode the flock and fcntl locks for the given inode into the pagelist.
- * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
- * sequential flock locks.
- * Must be called with lock_flocks() already held.
- * If we encounter more of a specific lock type than expected,
- * we return the value 1.
+ * Encode the flock and fcntl locks for the given inode into the ceph_filelock
+ * array. Must be called with lock_flocks() already held.
+ * If we encounter more of a specific lock type than expected, return -ENOSPC.
  */
-int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist,
- int num_fcntl_locks, int num_flock_locks)
+int ceph_encode_locks_to_buffer(struct inode *inode,
+   struct ceph_filelock *flocks,
+   int num_fcntl_locks, int num_flock_locks)
 {
struct file_lock *lock;
-   struct

Re: 3.7.0-rc8 btrfs locking issue

2012-12-12 Thread Jim Schutt
On 12/11/2012 06:37 PM, Liu Bo wrote:
> On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote:
>> On 12/09/2012 07:04 AM, Liu Bo wrote:
>>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
>>> Hi Jim,
>>>
>>> Could you please apply the following patch to test if it works?
>>
>> Hi,
>>
>> So far, with your patch applied I've been unable to reproduce
>> the recursive deadlock.  Thanks a lot for this patch!
>> This issue has been troubling me for a while.
> 
> Hi Jim,
> 
> Good news for us :)
> 
>>
>> I've been trying to learn more about btrfs internals -
>> if you have the time to answer a couple questions about
>> your patch, I'd really appreciate it.
> 
> See below.
> 
>>
>>>
>>> (It's against 3.7-rc8.)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 3d3e2c1..100289b 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
>>> *root, int data)
>>>  
>>> if (data)
>>> flags = BTRFS_BLOCK_GROUP_DATA;
>>> -   else if (root == root->fs_info->chunk_root)
>>> +   else if (root == root->fs_info->chunk_root ||
>>> +root == root->fs_info->dev_root)
>>> flags = BTRFS_BLOCK_GROUP_SYSTEM;
>>> else
>>> flags = BTRFS_BLOCK_GROUP_METADATA;
>>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
>>> btrfs_root *root, u64 type)
>>> num_dev = 1;/* DUP or single */
>>>  
>>> /* metadata for updaing devices and chunk tree */
>>> +   num_dev = num_dev << 1
>>
>> AFAICS this is doubling the size of the reserve, which
>> reduces the chance of a recursive do_chunk_alloc(), right?
>>
> 
> Not like that, we hit the deadlock because updating device tree also
> uses METADATA chunk, which may be called when we're actually allocating
> a METADATA chunk, so the patch I sent you makes updating device tree
> use SYSTEM chunk, which we'll have some code to check if it is enough
> before allocating a chunk(if not, we'll allocate a SYSTEM chunk first).
> 
> Here I double the size just because the worst case of allocating a
> DATA/METADATA chunk -may- results in
> 
> 1)adding a SYSTEM chunk +
> 2)adding dev extent per chunk stripe +
> 3)updating chunk stripes's bytes_used
> 
>>> return btrfs_calc_trans_metadata_size(root, num_dev + 1);
>>
>> btrfs_calc_trans_metadata_size(root, num_items) multiplies its
>> num_items argument by another factor of three - do you know if
>> there is there some rationale behind that number, or is it
>> perhaps also an empirically determined factor?
> 
> The height of Btrfs's metadata btree is at most 8,
> leaf is on 0 level while node is at most on 7 level.
> 
> Each btree update may results in COWing a node and its sibling nodes,
> where the factor of tree comes from
> 
>>
>> What I'm wondering about is that if the size of the reserve is
>> empirically determined, will it need to be increased again
>> later when machines are more capable, and can handle a higher
>> load?
>>
>> Do you think it's feasible to modify the locking for
>> do_chunk_alloc to allow it to recurse without deadlock?
> 
> Well, it could be, but IMO it'll bring us complexity, so worse
> maintainance.
> 
> Any questions? Feel free to ask.

Your response was very helpful.  Thanks a lot!

-- Jim

> 
> thanks,
> liubo
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7.0-rc8 btrfs locking issue

2012-12-12 Thread Jim Schutt
On 12/11/2012 06:37 PM, Liu Bo wrote:
 On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote:
 On 12/09/2012 07:04 AM, Liu Bo wrote:
 On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
 Hi Jim,

 Could you please apply the following patch to test if it works?

 Hi,

 So far, with your patch applied I've been unable to reproduce
 the recursive deadlock.  Thanks a lot for this patch!
 This issue has been troubling me for a while.
 
 Hi Jim,
 
 Good news for us :)
 

 I've been trying to learn more about btrfs internals -
 if you have the time to answer a couple questions about
 your patch, I'd really appreciate it.
 
 See below.
 


 (It's against 3.7-rc8.)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3d3e2c1..100289b 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
 *root, int data)
  
 if (data)
 flags = BTRFS_BLOCK_GROUP_DATA;
 -   else if (root == root-fs_info-chunk_root)
 +   else if (root == root-fs_info-chunk_root ||
 +root == root-fs_info-dev_root)
 flags = BTRFS_BLOCK_GROUP_SYSTEM;
 else
 flags = BTRFS_BLOCK_GROUP_METADATA;
 @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
 btrfs_root *root, u64 type)
 num_dev = 1;/* DUP or single */
  
 /* metadata for updaing devices and chunk tree */
 +   num_dev = num_dev  1

 AFAICS this is doubling the size of the reserve, which
 reduces the chance of a recursive do_chunk_alloc(), right?

 
 Not like that, we hit the deadlock because updating device tree also
 uses METADATA chunk, which may be called when we're actually allocating
 a METADATA chunk, so the patch I sent you makes updating device tree
 use SYSTEM chunk, which we'll have some code to check if it is enough
 before allocating a chunk(if not, we'll allocate a SYSTEM chunk first).
 
 Here I double the size just because the worst case of allocating a
 DATA/METADATA chunk -may- results in
 
 1)adding a SYSTEM chunk +
 2)adding dev extent per chunk stripe +
 3)updating chunk stripes's bytes_used
 
 return btrfs_calc_trans_metadata_size(root, num_dev + 1);

 btrfs_calc_trans_metadata_size(root, num_items) multiplies its
 num_items argument by another factor of three - do you know if
 there is there some rationale behind that number, or is it
 perhaps also an empirically determined factor?
 
 The height of Btrfs's metadata btree is at most 8,
 leaf is on 0 level while node is at most on 7 level.
 
 Each btree update may results in COWing a node and its sibling nodes,
 where the factor of tree comes from
 

 What I'm wondering about is that if the size of the reserve is
 empirically determined, will it need to be increased again
 later when machines are more capable, and can handle a higher
 load?

 Do you think it's feasible to modify the locking for
 do_chunk_alloc to allow it to recurse without deadlock?
 
 Well, it could be, but IMO it'll bring us complexity, so worse
 maintainance.
 
 Any questions? Feel free to ask.

Your response was very helpful.  Thanks a lot!

-- Jim

 
 thanks,
 liubo
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7.0-rc8 btrfs locking issue

2012-12-11 Thread Jim Schutt
On 12/09/2012 07:04 AM, Liu Bo wrote:
> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
>> > Hi,
>> > 
>> > I'm hitting a btrfs locking issue with 3.7.0-rc8.
>> > 
>> > The btrfs filesystem in question is backing a Ceph OSD
>> > under a heavy write load from many cephfs clients.
>> > 
>> > I reported this issue a while ago:
>> >   http://www.spinics.net/lists/linux-btrfs/msg19370.html
>> > when I was testing what I thought might be part of the
>> > 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.
>> > 
>> > I spent some time attempting to bisect the btrfs patch queue
>> > just before it was merged for 3.7, but got nowhere due to
>> > false negatives.
>> > 
>> > I've just been able to get back to testing 3.7-rc, and found
>> > that I can still trigger the issue.
> Hi Jim,
> 
> Could you please apply the following patch to test if it works?

Hi,

So far, with your patch applied I've been unable to reproduce
the recursive deadlock.  Thanks a lot for this patch!
This issue has been troubling me for a while.

I've been trying to learn more about btrfs internals -
if you have the time to answer a couple questions about
your patch, I'd really appreciate it.

> 
> (It's against 3.7-rc8.)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d3e2c1..100289b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
> *root, int data)
>  
>   if (data)
>   flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (root == root->fs_info->chunk_root)
> + else if (root == root->fs_info->chunk_root ||
> +  root == root->fs_info->dev_root)
>   flags = BTRFS_BLOCK_GROUP_SYSTEM;
>   else
>   flags = BTRFS_BLOCK_GROUP_METADATA;
> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
> btrfs_root *root, u64 type)
>   num_dev = 1;/* DUP or single */
>  
>   /* metadata for updaing devices and chunk tree */
> + num_dev = num_dev << 1

AFAICS this is doubling the size of the reserve, which
reduces the chance of a recursive do_chunk_alloc(), right?

>   return btrfs_calc_trans_metadata_size(root, num_dev + 1);

btrfs_calc_trans_metadata_size(root, num_items) multiplies its
num_items argument by another factor of three - do you know if
there is there some rationale behind that number, or is it
perhaps also an empirically determined factor?

What I'm wondering about is that if the size of the reserve is
empirically determined, will it need to be increased again
later when machines are more capable, and can handle a higher
load?

Do you think it's feasible to modify the locking for
do_chunk_alloc to allow it to recurse without deadlock?

Thanks -- Jim


>  }
>  
> @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct
> btrfs_fs_info *fs_info)
>  
>   fs_info->extent_root->block_rsv = _info->global_block_rsv;
>   fs_info->csum_root->block_rsv = _info->global_block_rsv;
> - fs_info->dev_root->block_rsv = _info->global_block_rsv;
> + fs_info->dev_root->block_rsv = _info->chunk_block_rsv;
>   fs_info->tree_root->block_rsv = _info->global_block_rsv;
>   fs_info->chunk_root->block_rsv = _info->chunk_block_rsv;
>  
> 
> thanks,
> liubo
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7.0-rc8 btrfs locking issue

2012-12-11 Thread Jim Schutt
On 12/09/2012 07:04 AM, Liu Bo wrote:
 On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:
  Hi,
  
  I'm hitting a btrfs locking issue with 3.7.0-rc8.
  
  The btrfs filesystem in question is backing a Ceph OSD
  under a heavy write load from many cephfs clients.
  
  I reported this issue a while ago:
http://www.spinics.net/lists/linux-btrfs/msg19370.html
  when I was testing what I thought might be part of the
  3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.
  
  I spent some time attempting to bisect the btrfs patch queue
  just before it was merged for 3.7, but got nowhere due to
  false negatives.
  
  I've just been able to get back to testing 3.7-rc, and found
  that I can still trigger the issue.
 Hi Jim,
 
 Could you please apply the following patch to test if it works?

Hi,

So far, with your patch applied I've been unable to reproduce
the recursive deadlock.  Thanks a lot for this patch!
This issue has been troubling me for a while.

I've been trying to learn more about btrfs internals -
if you have the time to answer a couple questions about
your patch, I'd really appreciate it.

 
 (It's against 3.7-rc8.)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3d3e2c1..100289b 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root
 *root, int data)
  
   if (data)
   flags = BTRFS_BLOCK_GROUP_DATA;
 - else if (root == root-fs_info-chunk_root)
 + else if (root == root-fs_info-chunk_root ||
 +  root == root-fs_info-dev_root)
   flags = BTRFS_BLOCK_GROUP_SYSTEM;
   else
   flags = BTRFS_BLOCK_GROUP_METADATA;
 @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct
 btrfs_root *root, u64 type)
   num_dev = 1;/* DUP or single */
  
   /* metadata for updaing devices and chunk tree */
 + num_dev = num_dev  1

AFAICS this is doubling the size of the reserve, which
reduces the chance of a recursive do_chunk_alloc(), right?

   return btrfs_calc_trans_metadata_size(root, num_dev + 1);

btrfs_calc_trans_metadata_size(root, num_items) multiplies its
num_items argument by another factor of three - do you know if
there is there some rationale behind that number, or is it
perhaps also an empirically determined factor?

What I'm wondering about is that if the size of the reserve is
empirically determined, will it need to be increased again
later when machines are more capable, and can handle a higher
load?

Do you think it's feasible to modify the locking for
do_chunk_alloc to allow it to recurse without deadlock?

Thanks -- Jim


  }
  
 @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct
 btrfs_fs_info *fs_info)
  
   fs_info-extent_root-block_rsv = fs_info-global_block_rsv;
   fs_info-csum_root-block_rsv = fs_info-global_block_rsv;
 - fs_info-dev_root-block_rsv = fs_info-global_block_rsv;
 + fs_info-dev_root-block_rsv = fs_info-chunk_block_rsv;
   fs_info-tree_root-block_rsv = fs_info-global_block_rsv;
   fs_info-chunk_root-block_rsv = fs_info-chunk_block_rsv;
  
 
 thanks,
 liubo
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.7.0-rc8 btrfs locking issue

2012-12-07 Thread Jim Schutt

On 12/05/2012 09:07 AM, Jim Schutt wrote:

Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(_info->chunk_mutex);
[ 1184.274276]   lock(_info->chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***


To try to debug this, I applied this patch:

-cut here -
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c72ead8..86da319 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1470,6 +1470,10 @@ struct btrfs_fs_info {
int backup_root_index;

int num_tolerated_disk_barrier_failures;
+
+   /* protect against recursive do_chunk_alloc() */
+   const void *chunk_alloc_task;
+   const struct btrfs_space_info *prev_sinfo;
 };

 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7cda519..d61e93d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb,

fs_info->btree_inode->i_ino = BTRFS_BTREE_INODE_OBJECTID;
set_nlink(fs_info->btree_inode, 1);
+
+   fs_info->chunk_alloc_task = NULL;
+   fs_info->prev_sinfo = NULL;
/*
 * we set the i_size on the btree inode to the max possible int.
 * the real end of the address space is determined by all of
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d3e2c1..1e0e421 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
}
BUG_ON(!space_info); /* Logic error */

+   if (mutex_is_locked(_info->chunk_mutex) &&
+   unlikely(fs_info->chunk_alloc_task == current)) {
+   WARN_ONCE(1, "do_chunk_alloc() about to recursively acquire "
+ "fs_info->chunk_mutex: impending deadlock avoided!\n"
+ "outer call space_info = %p flags %#llx\n"
+ "nested call space_info = %p flags %#llx\n",
+ fs_info->prev_sinfo, fs_info->prev_sinfo->flags,
+ space_info, space_info->flags);
+   return -EDEADLK;
+   }
 again:
spin_lock(_info->lock);
if (force < space_info->force_alloc)
@@ -3603,6 +3613,8 @@ again:
spin_unlock(_info->lock);

mutex_lock(_info->chunk_mutex);
+   fs_info->chunk_alloc_task = current;
+   fs_info->prev_sinfo = space_info;

/*
 * The chunk_mutex is held throughout the entirety of a chunk
@@ -3655,6 +3667,8 @@ again:
space_info->chunk_alloc = 0;
spin_unlock(_info->lock);
 out:
+   fs_info->chunk_alloc_task = NULL;
+   fs_info->prev_sinfo = NULL;
mutex_unlock(_info->chunk_mutex);
return ret;
 }
-cut here -

After a few trials, I got this during a test:

[ 1614.213516] [ cut here ]
[ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 
do_chunk_alloc+0xec/0x3f0 [btrfs]()
[ 1614.226228] Hardware name: X8DTH-i/6/iF/6F
[ 1614.230484] do_chunk_alloc() about to recursively acquire 
fs_info->chunk_mutex: impending deadlock avoided!
[ 1614.230484] outer call space_info = 880c214dac00 flags 0x1
[ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4
[ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash 
dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput 
joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix lib

Re: 3.7.0-rc8 btrfs locking issue

2012-12-07 Thread Jim Schutt

On 12/05/2012 09:07 AM, Jim Schutt wrote:

Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(fs_info-chunk_mutex);
[ 1184.274276]   lock(fs_info-chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***


To try to debug this, I applied this patch:

-cut here -
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c72ead8..86da319 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1470,6 +1470,10 @@ struct btrfs_fs_info {
int backup_root_index;

int num_tolerated_disk_barrier_failures;
+
+   /* protect against recursive do_chunk_alloc() */
+   const void *chunk_alloc_task;
+   const struct btrfs_space_info *prev_sinfo;
 };

 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7cda519..d61e93d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb,

fs_info-btree_inode-i_ino = BTRFS_BTREE_INODE_OBJECTID;
set_nlink(fs_info-btree_inode, 1);
+
+   fs_info-chunk_alloc_task = NULL;
+   fs_info-prev_sinfo = NULL;
/*
 * we set the i_size on the btree inode to the max possible int.
 * the real end of the address space is determined by all of
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d3e2c1..1e0e421 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans,
}
BUG_ON(!space_info); /* Logic error */

+   if (mutex_is_locked(fs_info-chunk_mutex) 
+   unlikely(fs_info-chunk_alloc_task == current)) {
+   WARN_ONCE(1, do_chunk_alloc() about to recursively acquire 
+ fs_info-chunk_mutex: impending deadlock avoided!\n
+ outer call space_info = %p flags %#llx\n
+ nested call space_info = %p flags %#llx\n,
+ fs_info-prev_sinfo, fs_info-prev_sinfo-flags,
+ space_info, space_info-flags);
+   return -EDEADLK;
+   }
 again:
spin_lock(space_info-lock);
if (force  space_info-force_alloc)
@@ -3603,6 +3613,8 @@ again:
spin_unlock(space_info-lock);

mutex_lock(fs_info-chunk_mutex);
+   fs_info-chunk_alloc_task = current;
+   fs_info-prev_sinfo = space_info;

/*
 * The chunk_mutex is held throughout the entirety of a chunk
@@ -3655,6 +3667,8 @@ again:
space_info-chunk_alloc = 0;
spin_unlock(space_info-lock);
 out:
+   fs_info-chunk_alloc_task = NULL;
+   fs_info-prev_sinfo = NULL;
mutex_unlock(fs_info-chunk_mutex);
return ret;
 }
-cut here -

After a few trials, I got this during a test:

[ 1614.213516] [ cut here ]
[ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 
do_chunk_alloc+0xec/0x3f0 [btrfs]()
[ 1614.226228] Hardware name: X8DTH-i/6/iF/6F
[ 1614.230484] do_chunk_alloc() about to recursively acquire 
fs_info-chunk_mutex: impending deadlock avoided!
[ 1614.230484] outer call space_info = 880c214dac00 flags 0x1
[ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4
[ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm 
ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash 
dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput 
joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata 
button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper 
cryptd lrw aes_x86_64 xts gf128mul

3.7.0-rc8 btrfs locking issue

2012-12-05 Thread Jim Schutt

Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
  http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(_info->chunk_mutex);
[ 1184.274276]   lock(_info->chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***
[ 1184.278417]
[ 1184.284325]  May be due to missing lock nesting notation
[ 1184.284325]
[ 1184.291099] 4 locks held by ceph-osd/42177:
[ 1184.295277]  #0:  (sb_writers#7){.+.+.+}, at: [] 
btrfs_file_aio_write+0x64/0x320 [btrfs]
[ 1184.305103]  #1:  (>s_type->i_mutex_key#9){+.+.+.}, at: 
[] btrfs_file_aio_write+0x6e/0x320 [btrfs]
[ 1184.316108]  #2:  (sb_internal){.+.+..}, at: [] 
start_transaction+0x1c4/0x450 [btrfs]
[ 1184.325632]  #3:  (_info->chunk_mutex){+.+...}, at: [] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.335761]
[ 1184.335761] stack backtrace:
[ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 
#438
[ 1184.347508] Call Trace:
[ 1184.349962]  [] ? vprintk_emit+0x42a/0x4c0
[ 1184.355619]  [] print_deadlock_bug+0xe9/0x100
[ 1184.361556]  [] validate_chain+0x596/0x750
[ 1184.367222]  [] __lock_acquire+0x449/0x510
[ 1184.372894]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.379417]  [] lock_acquire+0xc9/0x120
[ 1184.384855]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.391377]  [] ? __lock_acquire+0x449/0x510
[ 1184.397204]  [] __mutex_lock_common+0x5d/0x3a0
[ 1184.403221]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.409762]  [] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.416323]  [] ? do_chunk_alloc+0x179/0x340 [btrfs]
[ 1184.422849]  [] mutex_lock_nested+0x4a/0x60
[ 1184.428640]  [] do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.435018]  [] find_free_extent+0xa3c/0xb70 [btrfs]
[ 1184.441555]  [] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.449051]  [] btrfs_reserve_extent+0x82/0x190 [btrfs]
[ 1184.455843]  [] btrfs_alloc_free_block+0x85/0x230 [btrfs]
[ 1184.462828]  [] __btrfs_cow_block+0x14a/0x4b0 [btrfs]
[ 1184.469471]  [] ? btrfs_set_lock_blocking_rw+0xe3/0x160 
[btrfs]
[ 1184.476962]  [] btrfs_cow_block+0x18d/0x230 [btrfs]
[ 1184.483426]  [] btrfs_search_slot+0x360/0x730 [btrfs]
[ 1184.490067]  [] btrfs_insert_empty_items+0x8d/0x100 [btrfs]
[ 1184.497199]  [] ? kmem_cache_alloc+0xd3/0x170
[ 1184.503151]  [] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs]
[ 1184.510125]  [] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs]
[ 1184.516928]  [] btrfs_alloc_chunk+0x5e/0x90 [btrfs]
[ 1184.523374]  [] ? check_system_chunk+0x71/0x130 [btrfs]
[ 1184.530179]  [] do_chunk_alloc+0x2ce/0x340 [btrfs]
[ 1184.536555]  [] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.544070]  [] btrfs_check_data_free_space+0x13c/0x2b0 
[btrfs]
[ 1184.551574]  [] btrfs_delalloc_reserve_space+0x20/0x60 
[btrfs]
[ 1184.558987]  [] __btrfs_buffered_write+0x15e/0x350 [btrfs]
[ 1184.566042]  [] btrfs_file_aio_write+0x209/0x320 [btrfs]
[ 1184.572944]  [] ? __btrfs_direct_write+0x130/0x130 [btrfs]
[ 1184.579984]  [] do_sync_readv_writev+0x94/0xe0
[ 1184.585985]  [] do_readv_writev+0xe3/0x1e0
[ 1184.591645]  [] ? fget_light+0x122/0x170
[ 1184.597131]  [] vfs_writev+0x46/0x60
[ 1184.602266]  [] sys_writev+0x5f/0xc0
[ 1184.607398]  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1184.613832]  [] system_call_fastpath+0x16/0x1b

and then a little later:

[ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds.
[ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1319.478227] ceph-osdD 0246 0 42177  1 0x
[ 1319.485330]  880402c4d358 0046 7fffeccc49d0 
880402c4dfd8
[ 1319.492852]  880402c4c010 880402c4c000 880402c4c000 
880402c4c000
[ 1319.500449]  880402c4dfd8 880402c4c000 880c2151bec0 

3.7.0-rc8 btrfs locking issue

2012-12-05 Thread Jim Schutt

Hi,

I'm hitting a btrfs locking issue with 3.7.0-rc8.

The btrfs filesystem in question is backing a Ceph OSD
under a heavy write load from many cephfs clients.

I reported this issue a while ago:
  http://www.spinics.net/lists/linux-btrfs/msg19370.html
when I was testing what I thought might be part of the
3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree.

I spent some time attempting to bisect the btrfs patch queue
just before it was merged for 3.7, but got nowhere due to
false negatives.

I've just been able to get back to testing 3.7-rc, and found
that I can still trigger the issue.

First I get this lockdep splat:

[ 1184.201331] =
[ 1184.206716] [ INFO: possible recursive locking detected ]
[ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted
[ 1184.217156] -
[ 1184.222544] ceph-osd/42177 is trying to acquire lock:
[ 1184.227589]  (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.237270]
[ 1184.237270] but task is already holding lock:
[ 1184.243114]  (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.252786]
[ 1184.252786] other info that might help us debug this:
[ 1184.259303]  Possible unsafe locking scenario:
[ 1184.259303]
[ 1184.265220]CPU0
[ 1184.267680]
[ 1184.270133]   lock(fs_info-chunk_mutex);
[ 1184.274276]   lock(fs_info-chunk_mutex);
[ 1184.278417]
[ 1184.278417]  *** DEADLOCK ***
[ 1184.278417]
[ 1184.284325]  May be due to missing lock nesting notation
[ 1184.284325]
[ 1184.291099] 4 locks held by ceph-osd/42177:
[ 1184.295277]  #0:  (sb_writers#7){.+.+.+}, at: [a0584664] 
btrfs_file_aio_write+0x64/0x320 [btrfs]
[ 1184.305103]  #1:  (sb-s_type-i_mutex_key#9){+.+.+.}, at: 
[a058466e] btrfs_file_aio_write+0x6e/0x320 [btrfs]
[ 1184.316108]  #2:  (sb_internal){.+.+..}, at: [a05746f4] 
start_transaction+0x1c4/0x450 [btrfs]
[ 1184.325632]  #3:  (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] 
do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.335761]
[ 1184.335761] stack backtrace:
[ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 
#438
[ 1184.347508] Call Trace:
[ 1184.349962]  [81042eca] ? vprintk_emit+0x42a/0x4c0
[ 1184.355619]  [8109b579] print_deadlock_bug+0xe9/0x100
[ 1184.361556]  [8109d246] validate_chain+0x596/0x750
[ 1184.367222]  [8109d849] __lock_acquire+0x449/0x510
[ 1184.372894]  [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.379417]  [8109d9d9] lock_acquire+0xc9/0x120
[ 1184.384855]  [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.391377]  [8109d849] ? __lock_acquire+0x449/0x510
[ 1184.397204]  [814ab17d] __mutex_lock_common+0x5d/0x3a0
[ 1184.403221]  [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.409762]  [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.416323]  [a055a5c9] ? do_chunk_alloc+0x179/0x340 [btrfs]
[ 1184.422849]  [814ab62a] mutex_lock_nested+0x4a/0x60
[ 1184.428640]  [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs]
[ 1184.435018]  [a05635bc] find_free_extent+0xa3c/0xb70 [btrfs]
[ 1184.441555]  [a05569e9] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.449051]  [a0563772] btrfs_reserve_extent+0x82/0x190 [btrfs]
[ 1184.455843]  [a0563905] btrfs_alloc_free_block+0x85/0x230 [btrfs]
[ 1184.462828]  [a054fb6a] __btrfs_cow_block+0x14a/0x4b0 [btrfs]
[ 1184.469471]  [a05a50d3] ? btrfs_set_lock_blocking_rw+0xe3/0x160 
[btrfs]
[ 1184.476962]  [a055056d] btrfs_cow_block+0x18d/0x230 [btrfs]
[ 1184.483426]  [a0553460] btrfs_search_slot+0x360/0x730 [btrfs]
[ 1184.490067]  [a055426d] btrfs_insert_empty_items+0x8d/0x100 [btrfs]
[ 1184.497199]  [81159be3] ? kmem_cache_alloc+0xd3/0x170
[ 1184.503151]  [a0596f25] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs]
[ 1184.510125]  [a059b0d7] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs]
[ 1184.516928]  [a059b1de] btrfs_alloc_chunk+0x5e/0x90 [btrfs]
[ 1184.523374]  [a055a391] ? check_system_chunk+0x71/0x130 [btrfs]
[ 1184.530179]  [a055a71e] do_chunk_alloc+0x2ce/0x340 [btrfs]
[ 1184.536555]  [a05569e9] ? btrfs_reduce_alloc_profile+0xa9/0x120 
[btrfs]
[ 1184.544070]  [a055e35c] btrfs_check_data_free_space+0x13c/0x2b0 
[btrfs]
[ 1184.551574]  [a055ff00] btrfs_delalloc_reserve_space+0x20/0x60 
[btrfs]
[ 1184.558987]  [a05842de] __btrfs_buffered_write+0x15e/0x350 [btrfs]
[ 1184.566042]  [a0584809] btrfs_file_aio_write+0x209/0x320 [btrfs]
[ 1184.572944]  [a0584600] ? __btrfs_direct_write+0x130/0x130 [btrfs]
[ 1184.579984]  [81165014] do_sync_readv_writev+0x94/0xe0
[ 1184.585985]  [81166023] do_readv_writev+0xe3/0x1e0
[ 1184.591645]  [81183112] ? 

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-13 Thread Jim Schutt

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:



I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?



I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console).  So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit.  If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-13 Thread Jim Schutt

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:



I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?



I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console).  So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit.  If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Jim Schutt

On 08/10/2012 05:02 AM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:




Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---




Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:



Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.



Excellent, so it is contention that is the problem.



I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.



It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.


I really appreciate getting the chance to test out
your patchset.



I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8<---
mm: compaction: Abort async compaction if locks are contended or taking too long



Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2515.884447] ceph-osdD  0 30375  1 0x
[ 2515.891531]  8802e1a99e38 0082 88056b38e298 
8802e1a99fd8
[ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
8802e1a98000
[ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433]  [] schedule+0x5d/0x60
[ 2515.921417]  [] rwsem_down_failed_common+0x105/0x140
[ 2515.927938]  [] rwsem_down_write_failed+0x13/0x20
[ 2515.934195]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934]  [] ? down_write+0x45/0x50
[ 2515.946244]  [] sys_mprotect+0xd2/0x240
[ 2515.951640]  [] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2515.972141] ceph-osdD  0 95698  1 0x
[ 2515.979223]  8802b049fe38 0082 88056b38e2a0 
8802b049ffd8
[ 2515.986700]  8802b049e010 8802b049e000 8802b049e000 
8802b049e000
[ 2515.994176]  8802b049ffd8 8802b049e000 8809832ddc00 
880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111]  [] schedule+0x5d/0x60
[ 2516.009072]  [] rwsem_down_failed_common+0x105/0x140
[ 2516.015589]  [] rwsem_down_write_failed+0x13/0x20
[ 2516.021861]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555]  [] ? down_write+0x45/0x50
[ 2516.033859]  [] sys_mprotect+0xd2/0x240
[ 2516.039248]  [] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2516.059753] ceph-osdD  0 95699  1 0x
[ 2516.066832]  880c022d3dc8 0082 880c022d2000 
880c022d3fd8
[ 2516.074302]  880c022d2010 880c022d2000 880c022d2000 
880c022d2000
[ 2516.081784]  880c022d3fd8 880c022d2000 8806224cc500 
88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702]  [] schedule+0x5d/0x60
[ 2516.096656]  [] rwsem_down_failed_common+0x105/0x140
[ 2516.103176]  [] rwsem_down_write_failed+0x13/0x20
[ 2516.109443]  [] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134]  [] ? down_write+0x45/0x50
[ 2516.121442]  [] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861]  [] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552]  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985]  [] sys_mmap+0x22/0x30
[ 2516.143945]  [] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2516.16] ceph-osdD  0 95816  1 0x
[ 2516.171521]  880332991e38 0082 880332991de8 
880332991fd8
[ 2516.178992]  880332990010 88033299 88033299 
88033299
[ 2516.186466]  880332

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-10 Thread Jim Schutt

On 08/10/2012 05:02 AM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:




Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8---
mm: compaction: back out if contended

---


snip

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:



Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.



Excellent, so it is contention that is the problem.


SNIP
I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.



It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.


I really appreciate getting the chance to test out
your patchset.



I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8---
mm: compaction: Abort async compaction if locks are contended or taking too long



Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2515.884447] ceph-osdD  0 30375  1 0x
[ 2515.891531]  8802e1a99e38 0082 88056b38e298 
8802e1a99fd8
[ 2515.899013]  8802e1a98010 8802e1a98000 8802e1a98000 
8802e1a98000
[ 2515.906482]  8802e1a99fd8 8802e1a98000 880697d31700 
8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433]  [8147fded] schedule+0x5d/0x60
[ 2515.921417]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2515.927938]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2515.934195]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934]  [8147edc5] ? down_write+0x45/0x50
[ 2515.946244]  [81127b62] sys_mprotect+0xd2/0x240
[ 2515.951640]  [81489412] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2515.972141] ceph-osdD  0 95698  1 0x
[ 2515.979223]  8802b049fe38 0082 88056b38e2a0 
8802b049ffd8
[ 2515.986700]  8802b049e010 8802b049e000 8802b049e000 
8802b049e000
[ 2515.994176]  8802b049ffd8 8802b049e000 8809832ddc00 
880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111]  [8147fded] schedule+0x5d/0x60
[ 2516.009072]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2516.015589]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2516.021861]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555]  [8147edc5] ? down_write+0x45/0x50
[ 2516.033859]  [81127b62] sys_mprotect+0xd2/0x240
[ 2516.039248]  [81489412] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 2516.059753] ceph-osdD  0 95699  1 0x
[ 2516.066832]  880c022d3dc8 0082 880c022d2000 
880c022d3fd8
[ 2516.074302]  880c022d2010 880c022d2000 880c022d2000 
880c022d2000
[ 2516.081784]  880c022d3fd8 880c022d2000 8806224cc500 
88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702]  [8147fded] schedule+0x5d/0x60
[ 2516.096656]  [81480b25] rwsem_down_failed_common+0x105/0x140
[ 2516.103176]  [81480b73] rwsem_down_write_failed+0x13/0x20
[ 2516.109443]  [8124bcd3] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134]  [8147edc5] ? down_write+0x45/0x50
[ 2516.121442]  [8111362e] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861]  [8112486a] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552]  [8124bd6e] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985]  [81006b22] sys_mmap+0x22/0x30
[ 2516.143945]  [81489412] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] echo 0  /proc/sys/kernel

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 02:46 PM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>   0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>   0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:




My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.


I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.





Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---




Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
21 19  0 351628576   378354400017 44394 1241  
653   6 20  64  9  0
11 11  0 365520576   3789306000   124 2121508 
203450 170957  12 46  25 17  0
13 16  0 359888576   379544560098 2185033 
209473 171571  13 44  25 18  0
17 15  0 353728576   380105360089 2170971 
208052 167988  13 43  26 18  0
17 16  0 349732576   3804828400   135 2217752 
218754 174170  13 49  21 16  0
43 13  0 343280576   3804650000   153 2207135 
217872 179519  13 47  23 18  0
26 13  0 350968576   3793718400   147 2189822 
214276 176697  13 47  23 17  0
 4 12  0 350080576   3795836400   226 2145212 
207077 172163  12 44  24 20  0
15 13  0 353124576   3792104000   145 2078422 
197231 166381  12 41  30 17  0
14 15  0 348964576   3794958800   107 2020853 
188192 164064  12 39  30 20  0
21  9  0 354784576   3795122800   117 2148090 
204307 165609  13 48  22 18  0
36 16  0 347368576   3798982400   166 2208681 
216392 178114  13 47  24 16  0
28 15  0 300656576   3806091200   164 2181681 
214618 175132  13 45  24 18  0
 9 16  0 295484576   3809218400   153 2156909 
218993 180289  13 43  27 17  0
17 16  0 346760576   3797900800   165 2124168 
198730 173455  12 44  27 18  0
14 17  0 360988576   3795713600   142 2092248 
197430 168199  12 42  29 17  0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>  0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

--

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
20 14  0 235884576   389160720012 17047  171  
133   3  8  85  4  0
18 17  0 220272576   389559120086 2131838 
200142 162956  12 38  31 19  0
17  9  0 244284576   389553280019 2179562 
213775 167901  13 43  26 18  0
27 15  0 223036576   389526400024 2202816 
217996 158390  14 47  25 15  0
17 16  0 233124576   3895990800 5 2268815 
224647 165728  14 50  21 15  0
16 13  0 225840576   389957400052 2253829 
216797 160551  14 47  23 16  0
22 13  0 260584576   389829080092 2196737 
211694 140924  14 53  19 15  0
16 10  0 235784576   389171280022 2157466 
210022 137630  14 54  19 14  0
12 13  0 214300576   389238480031 2187735 
213862 142711  14 52  20 14  0
25 12  0 219528576   389195400011 2066523 
205256 142080  13 49  23 15  0
26 14  0 229460576   389137040049 2108654 
200692 135447  13 51  21 15  0
11 11  0 220376576   388624560045 2136419 
207493 146813  13 49  22 16  0
36 12  0 229860576   3886978400 7 2163463 
212223 151812  14 47  25 14  0
16 13  0 238356576   388914960067 2251650 
221728 154429  14 52  20 14  0
65 15  0 211536576   389221080059 2237925 
224237 156587  14 53  19 14  0
24 13  0 585024576   386340240037 2240929 
229040 148192  15 61  14 10  0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
43  8  0 794392576   383823160011 20491  576  
420   3 10  82  4  0
127  6  0 579328576   384221560021 2006775 
205582 119660  12 70  11  7  0
44  5  0 492860576   385123600046 1536525 
173377 85320  10 78   7  4  0
218  9  0 585668576   382713200039 1257266 
152869 64023   8 83   7  3  0
101  6  0 600168576   381281040010 1438705 
160769 68374   9 84   5  3  0
62  5  0 597004576   380989720093 1376841 
154012 63912   8 82   7  4  0
61 11  0 850396576   378087720046 1186816 
145731 70453   7 78   9  6  0
124  7  0 437388576   381263200015 1208434 
149736 57142   7 86   4  3  0
204 11  01105816576   373095320020 1327833 
145979 52718   7 87   4  2  0
29  8  0 751020576   3736033200 8 1405474 
169916 61982   9 85   4  2  0
38  7  0 626448576   373332440014 1328415 
174665 74214   8 84   5  3  0
23  5  0 650040576   371342800028 1351209 
179220 71631   8 85   5  2  0
40 10  0 610988576   3705429200   104 1272527 
167530 73527   7 85   5  3  0
79 22  02076836576   3548734000   750 1249934 
175420 70124   7 88   3  2  0

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order>  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order>  0 compaction start off where it left].


I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1.  Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order  0 compaction start off where it left].


I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1.  Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order  0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order  0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

--

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
20 14  0 235884576   389160720012 17047  171  
133   3  8  85  4  0
18 17  0 220272576   389559120086 2131838 
200142 162956  12 38  31 19  0
17  9  0 244284576   389553280019 2179562 
213775 167901  13 43  26 18  0
27 15  0 223036576   389526400024 2202816 
217996 158390  14 47  25 15  0
17 16  0 233124576   3895990800 5 2268815 
224647 165728  14 50  21 15  0
16 13  0 225840576   389957400052 2253829 
216797 160551  14 47  23 16  0
22 13  0 260584576   389829080092 2196737 
211694 140924  14 53  19 15  0
16 10  0 235784576   389171280022 2157466 
210022 137630  14 54  19 14  0
12 13  0 214300576   389238480031 2187735 
213862 142711  14 52  20 14  0
25 12  0 219528576   389195400011 2066523 
205256 142080  13 49  23 15  0
26 14  0 229460576   389137040049 2108654 
200692 135447  13 51  21 15  0
11 11  0 220376576   388624560045 2136419 
207493 146813  13 49  22 16  0
36 12  0 229860576   3886978400 7 2163463 
212223 151812  14 47  25 14  0
16 13  0 238356576   388914960067 2251650 
221728 154429  14 52  20 14  0
65 15  0 211536576   389221080059 2237925 
224237 156587  14 53  19 14  0
24 13  0 585024576   386340240037 2240929 
229040 148192  15 61  14 10  0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
43  8  0 794392576   383823160011 20491  576  
420   3 10  82  4  0
127  6  0 579328576   384221560021 2006775 
205582 119660  12 70  11  7  0
44  5  0 492860576   385123600046 1536525 
173377 85320  10 78   7  4  0
218  9  0 585668576   382713200039 1257266 
152869 64023   8 83   7  3  0
101  6  0 600168576   381281040010 1438705 
160769 68374   9 84   5  3  0
62  5  0 597004576   380989720093 1376841 
154012 63912   8 82   7  4  0
61 11  0 850396576   378087720046 1186816 
145731 70453   7 78   9  6  0
124  7  0 437388576   381263200015 1208434 
149736 57142   7 86   4  3  0
204 11  01105816576   373095320020 1327833 
145979 52718   7 87   4  2  0
29  8  0 751020576   3736033200 8 1405474 
169916 61982   9 85   4  2  0
38  7  0 626448576   373332440014 1328415 
174665 74214   8 84   5  3  0
23  5  0 650040576   371342800028 1351209 
179220 71631   8 85   5  2  0
40 10  0 610988576   3705429200   104 1272527 
167530 73527   7 85   5  3  0
79 22  02076836576   3548734000   750 1249934 
175420 70124   7 88   3  2  0
58  6

Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3

2012-08-09 Thread Jim Schutt

On 08/09/2012 02:46 PM, Mel Gorman wrote:

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:

On 08/09/2012 07:49 AM, Mel Gorman wrote:

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed 
(minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order   0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order   0 compaction start off where it left].


On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing.  Here's what vmstat had to say during that period:

SNIP


My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.


I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.

snip



Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8---
mm: compaction: back out if contended

---


snip

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00
vmstat -w 4 16
procs ---memory-- ---swap-- -io 
--system-- -cpu---
 r  b   swpd   free   buff  cache   si   sobibo   in   
cs  us sy  id wa st
21 19  0 351628576   378354400017 44394 1241  
653   6 20  64  9  0
11 11  0 365520576   3789306000   124 2121508 
203450 170957  12 46  25 17  0
13 16  0 359888576   379544560098 2185033 
209473 171571  13 44  25 18  0
17 15  0 353728576   380105360089 2170971 
208052 167988  13 43  26 18  0
17 16  0 349732576   3804828400   135 2217752 
218754 174170  13 49  21 16  0
43 13  0 343280576   3804650000   153 2207135 
217872 179519  13 47  23 18  0
26 13  0 350968576   3793718400   147 2189822 
214276 176697  13 47  23 17  0
 4 12  0 350080576   3795836400   226 2145212 
207077 172163  12 44  24 20  0
15 13  0 353124576   3792104000   145 2078422 
197231 166381  12 41  30 17  0
14 15  0 348964576   3794958800   107 2020853 
188192 164064  12 39  30 20  0
21  9  0 354784576   3795122800   117 2148090 
204307 165609  13 48  22 18  0
36 16  0 347368576   3798982400   166 2208681 
216392 178114  13 47  24 16  0
28 15  0 300656576   3806091200   164 2181681 
214618 175132  13 45  24 18  0
 9 16  0 295484576   3809218400   153 2156909 
218993 180289  13 43  27 17  0
17 16  0 346760576   3797900800   165 2124168 
198730 173455  12 44  27 18  0
14 17  0 360988576   3795713600   142 2092248 
197430 168199  12 42  29 17  0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages

2012-08-07 Thread Jim Schutt

On 08/07/2012 08:52 AM, Mel Gorman wrote:

On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:

On 08/07/2012 08:31 AM, Mel Gorman wrote:

commit [7db8889a: mm: have order>   0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A   M S F
|---|
Process B   M   FS


Argh. Good spotting.


This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.


Agreed on the "not optimal", but I also cannot think of a better
idea right now. Getting this fixed for 3.6 is important, we can
think of future optimizations in San Diego.



Sounds like a plan.


Signed-off-by: Mel Gorman


Reviewed-by: Rik van Riel



Thanks very much.

Jim, what are the chances of getting this series tested with your large
data workload? As it's on top of 3.5, it should be less scary than
testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
test with just this patch on top.



As it turns out I'm already testing 3.6-rc1, as I'm on
the trail of a Ceph client messaging bug.  I think I've
about got that figured out, and am working on a patch, but
I need it fixed in order to generate enough load to trigger
the problem that your patch addresses.

Which is a long-winded way of saying:  no problem, I'll
roll this into my current testing, but I'll need another
day or two before I'm likely to be able to generate a
high enough load to test effectively.  OK?

Also FWIW, it occurs to me that you might be interested
to know that my load also involves lots of network load
where I'm using jumbo frames.  I suspect that puts even
more stress on higher page order allocations, right?

-- Jim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] mm: have order 0 compaction start near a pageblock with free pages

2012-08-07 Thread Jim Schutt

On 08/07/2012 08:52 AM, Mel Gorman wrote:

On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:

On 08/07/2012 08:31 AM, Mel Gorman wrote:

commit [7db8889a: mm: have order   0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A   M S F
|---|
Process B   M   FS


Argh. Good spotting.


This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.


Agreed on the not optimal, but I also cannot think of a better
idea right now. Getting this fixed for 3.6 is important, we can
think of future optimizations in San Diego.



Sounds like a plan.


Signed-off-by: Mel Gormanmgor...@suse.de


Reviewed-by: Rik van Rielr...@redhat.com



Thanks very much.

Jim, what are the chances of getting this series tested with your large
data workload? As it's on top of 3.5, it should be less scary than
testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
test with just this patch on top.



As it turns out I'm already testing 3.6-rc1, as I'm on
the trail of a Ceph client messaging bug.  I think I've
about got that figured out, and am working on a patch, but
I need it fixed in order to generate enough load to trigger
the problem that your patch addresses.

Which is a long-winded way of saying:  no problem, I'll
roll this into my current testing, but I'll need another
day or two before I'm likely to be able to generate a
high enough load to test effectively.  OK?

Also FWIW, it occurs to me that you might be interested
to know that my load also involves lots of network load
where I'm using jumbo frames.  I suspect that puts even
more stress on higher page order allocations, right?

-- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: splice/vmsplice performance test results

2006-11-27 Thread Jim Schutt
On Thu, 2006-11-23 at 12:24 +0100, Jens Axboe wrote: 
> On Wed, Nov 22 2006, Jim Schutt wrote:
> > 
> > On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote:
> > > On Tue, Nov 21 2006, Jim Schutt wrote:
> > [snip]
> > > > 
> > > > Hmmm.  Is it worth me trying to do some sort of kernel 
> > > > profiling to see if there is anything unexpected with 
> > > > my setup?  If so, do you have a preference as to what 
> > > > I would use?  
> > > 
> > > Not sure that profiling would be that interesting, as the problem
> > > probably lies in where we are _not_ spending the time. But it certainly
> > > can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
> > > the test is running. Do 3 such runs for the two test cases
> > > (write-to-file, vmsplice/splice-to-file).
> > > 
> > 
> > OK, I've attached results for 20 second profiles of three
> > runs of each test: read-from-socket + write-to-file, and
> > read-from-socket + vmsplice/splice-to-file.
> > 
> > The test case and throughput is in the name: e.g. rvs-1-306MBps
> > is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s.
> > 
> > Let me know if I can help with more testing, and thanks
> > again for looking into this.
> 
> As I suspected, nothing sticks out in these logs as the problem here is
> not due to a maxed out system. The traces look fairly identical, less
> time spent in copy_user with the splice approach.
> 
> Comparing the generic_file_buffered_write() and splice-to-file path,
> there really isn't a whole lot of difference. It would be interesting to
> try and eliminate some of the differences between the two approaches -
> could you try and change the vmsplice to a write-to-pipe instead? And
> add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm
> interested in a something that only really tests splice-to-file vs
> write-to-file. Perhaps easier if you can just run fio to test that, I'm
> inlining a job file to test that specifically.
> 

Sorry for the delayed reply.

Here's results from three fio runs, using the job file
you gave me, and fio-git-20061124142507.tar.gz:

---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
  write: io= 16384MiB, bw=558404KiB/s, runt= 30766msec
slat (msec): min=0, max=0, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   35, avg= 0.00, dev= 0.08
bw (KiB/s) : min=0, max=644481, per=98.90%, avg=552243.95,
dev=561417.03
  cpu  : usr=0.91%, sys=85.03%, ctx=14121
splice: (groupid=1): err= 0:
  write: io= 16384MiB, bw=486144KiB/s, runt= 35339msec
slat (msec): min=0, max=0, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   30, avg= 0.00, dev= 0.10
bw (KiB/s) : min=0, max=555745, per=99.06%, avg=481567.28,
dev=488565.60
  cpu  : usr=0.85%, sys=88.87%, ctx=12956

Run status group 0 (all jobs):
  WRITE: io=16384MiB, aggrb=558404, minb=558404, maxb=558404,
mint=30766msec, maxt=30766msec

Run status group 1 (all jobs):
  WRITE: io=16384MiB, aggrb=486144, minb=486144, maxb=486144,
mint=35339msec, maxt=35339msec

Disk stats (read/write):
  md0: ios=20/270938, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
  write: io= 16384MiB, bw=547234KiB/s, runt= 31394msec
slat (msec): min=0, max=1, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   57, avg= 0.00, dev= 0.14
bw (KiB/s) : min=0, max=662568, per=98.94%, avg=541406.71,
dev=550958.67
  cpu  : usr=0.79%, sys=82.80%, ctx=16560
splice: (groupid=1): err= 0:
  write: io= 16384MiB, bw=473313KiB/s, runt= 36297msec
slat (msec): min=0, max=1, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   27, avg= 0.00, dev= 0.11
bw (KiB/s) : min=0, max=562298, per=

Re: splice/vmsplice performance test results

2006-11-27 Thread Jim Schutt
On Thu, 2006-11-23 at 12:24 +0100, Jens Axboe wrote: 
 On Wed, Nov 22 2006, Jim Schutt wrote:
  
  On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote:
   On Tue, Nov 21 2006, Jim Schutt wrote:
  [snip]

Hmmm.  Is it worth me trying to do some sort of kernel 
profiling to see if there is anything unexpected with 
my setup?  If so, do you have a preference as to what 
I would use?  
   
   Not sure that profiling would be that interesting, as the problem
   probably lies in where we are _not_ spending the time. But it certainly
   can't hurt. Try to oprofile the kernel for a 10-20 sec interval while
   the test is running. Do 3 such runs for the two test cases
   (write-to-file, vmsplice/splice-to-file).
   
  
  OK, I've attached results for 20 second profiles of three
  runs of each test: read-from-socket + write-to-file, and
  read-from-socket + vmsplice/splice-to-file.
  
  The test case and throughput is in the name: e.g. rvs-1-306MBps
  is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s.
  
  Let me know if I can help with more testing, and thanks
  again for looking into this.
 
 As I suspected, nothing sticks out in these logs as the problem here is
 not due to a maxed out system. The traces look fairly identical, less
 time spent in copy_user with the splice approach.
 
 Comparing the generic_file_buffered_write() and splice-to-file path,
 there really isn't a whole lot of difference. It would be interesting to
 try and eliminate some of the differences between the two approaches -
 could you try and change the vmsplice to a write-to-pipe instead? And
 add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm
 interested in a something that only really tests splice-to-file vs
 write-to-file. Perhaps easier if you can just run fio to test that, I'm
 inlining a job file to test that specifically.
 

Sorry for the delayed reply.

Here's results from three fio runs, using the job file
you gave me, and fio-git-20061124142507.tar.gz:

---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
  write: io= 16384MiB, bw=558404KiB/s, runt= 30766msec
slat (msec): min=0, max=0, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   35, avg= 0.00, dev= 0.08
bw (KiB/s) : min=0, max=644481, per=98.90%, avg=552243.95,
dev=561417.03
  cpu  : usr=0.91%, sys=85.03%, ctx=14121
splice: (groupid=1): err= 0:
  write: io= 16384MiB, bw=486144KiB/s, runt= 35339msec
slat (msec): min=0, max=0, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   30, avg= 0.00, dev= 0.10
bw (KiB/s) : min=0, max=555745, per=99.06%, avg=481567.28,
dev=488565.60
  cpu  : usr=0.85%, sys=88.87%, ctx=12956

Run status group 0 (all jobs):
  WRITE: io=16384MiB, aggrb=558404, minb=558404, maxb=558404,
mint=30766msec, maxt=30766msec

Run status group 1 (all jobs):
  WRITE: io=16384MiB, aggrb=486144, minb=486144, maxb=486144,
mint=35339msec, maxt=35339msec

Disk stats (read/write):
  md0: ios=20/270938, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
---
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=sync, iodepth=1
splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0,
ioengine=splice, iodepth=1
write: (groupid=0): err= 0:
  write: io= 16384MiB, bw=547234KiB/s, runt= 31394msec
slat (msec): min=0, max=1, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   57, avg= 0.00, dev= 0.14
bw (KiB/s) : min=0, max=662568, per=98.94%, avg=541406.71,
dev=550958.67
  cpu  : usr=0.79%, sys=82.80%, ctx=16560
splice: (groupid=1): err= 0:
  write: io= 16384MiB, bw=473313KiB/s, runt= 36297msec
slat (msec): min=0, max=1, avg= 0.00, dev= 0.00
clat (msec): min=0, max=   27, avg= 0.00, dev= 0.11
bw (KiB/s) : min=0, max=562298, per=99.12%, avg=469142.21,
dev=476426.36
  cpu  : usr=1.05%, sys=84.78%, ctx=16043

Run status group 0 (all jobs):
  WRITE: io=16384MiB, aggrb=547234, minb=547234, maxb=547234,
mint=31394msec, maxt=31394msec

Run status group 1 (all jobs):
  WRITE: io=16384MiB, aggrb=473313, minb=473313, maxb=473313,
mint=36297msec, maxt=36297msec

Disk

Re: splice/vmsplice performance test results

2006-11-17 Thread Jim Schutt
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> On Thu, Nov 16 2006, Jim Schutt wrote:
> > Hi,
> > 

> > My test program can do one of the following:
> > 
> > send data:
> >  A) read() from file into buffer, write() buffer into socket
> >  B) mmap() section of file, write() that into socket, munmap()
> >  C) splice() from file to pipe, splice() from pipe to socket
> > 
> > receive data:
> >  1) read() from socket into buffer, write() buffer into file
> >  2) ftruncate() to extend file, mmap() new extent, read() 
> >   from socket into new extent, munmap()
> >  3) read() from socket into buffer, vmsplice() buffer to 
> >  pipe, splice() pipe to file (using the double-buffer trick)
> > 
> > Here's the results, using:
> >  - 64 KiB buffer, mmap extent, or splice
> >  - 1 MiB TCP window
> >  - 16 GiB data sent across network
> > 
> > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > 
> > A) from file  -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > B) from file  -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > C) from file  -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > 
> > A) from /dev/zero -> 1) to file  : 375 MB/s (3.00 Gb/s)
> > A) from /dev/zero -> 2) to file  : 150 MB/s (1.20 Gb/s)
> > A) from /dev/zero -> 3) to file  : 286 MB/s (2.29 Gb/s)
> > 
> > I had (naively) hoped the read/vmsplice/splice combination would 
> > run at the same speed I can write a file, i.e. at about 450 MB/s
> > on my setup.  Do any of my numbers seem bogus, so I should look 
> > harder at my test program?
> 
> Could be read-ahead playing in here, I'd have to take a closer look at
> the generated io patterns to say more about that. Any chance you can
> capture iostat or blktrace info for such a run to compare that goes to
> the disk?

I've attached a file with iostat and vmstat results for the case
where I read from a socket and write a file, vs. the case where I
read from a socket and use vmsplice/splice to write the file.
(Sorry it's not inline - my mailer locks up when I try to
include the file.)

Would you still like blktrace info for these two cases?

-- Jim


Here's 'iostat -dx 5' on the receiver, when it reads from a socket
and writes to a file:

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdc  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdd  0.00   0.00  0.00  0.200.000.40 0.00 0.20 2.00 
0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md0  0.00   0.00  0.00  0.200.000.40 0.00 0.20 2.00 
0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.20  0.00 571.260.00 142574.45 0.00 71287.23   
249.58 0.891.56   0.43  24.51
sdc  0.00   0.00  0.00 576.850.00 142558.08 0.00 71279.04   
247.13 1.021.77   0.42  23.95
sdd  0.00   1.20  0.60 577.454.79 142545.91 2.40 71272.95   
246.61 1.162.00   0.54  31.46
sde  0.00   0.60  0.00 573.850.00 142607.58 0.00 71303.79   
248.51 1.342.33   0.54  30.74
md0  0.00   0.00  0.60 2301.404.79 570286.03 2.40 285143.01   
247.74 0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.20  0.00 750.400.00 187265.80 0.00 93632.90   
249.55 1.101.46   0.42  31.36
sdc  0.00   1.40  0.00 755.800.00 187236.80 0.00 93618.40   
247.73 1.301.72   0.43  32.64
sdd  0.00   2.80  0.00 751.200.00 187253.40 0.00 93626.70   
249.27 1.532.04   0.55  41.36
sde  0.00   0.20  0.00 767.400.00 187235.20 0.00 93617.60   
243.99 1.652.15   0.54  41.52
md0  0.00   0.00  0.00 3029.400.00 748991.20 0.00 374495.60   
247.24 0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   1.00  0.00 771.260.00 187299.20 0.00 93649.60   
242.85 1.211.57   0.44  33.61
sdc  0.00   1.60  0.00 765.870.00 187262.28 0.00 93631.14   
244.51 1.331.73   0.45  34.49
sdd  0.00   1.00  0.00 739.720.00 187258.48 0.00 93629.24   
253.15 1.431.93   0.55  40.56
sde  0.00   

Re: splice/vmsplice performance test results

2006-11-17 Thread Jim Schutt
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
 On Thu, Nov 16 2006, Jim Schutt wrote:
  Hi,
  

  My test program can do one of the following:
  
  send data:
   A) read() from file into buffer, write() buffer into socket
   B) mmap() section of file, write() that into socket, munmap()
   C) splice() from file to pipe, splice() from pipe to socket
  
  receive data:
   1) read() from socket into buffer, write() buffer into file
   2) ftruncate() to extend file, mmap() new extent, read() 
from socket into new extent, munmap()
   3) read() from socket into buffer, vmsplice() buffer to 
   pipe, splice() pipe to file (using the double-buffer trick)
  
  Here's the results, using:
   - 64 KiB buffer, mmap extent, or splice
   - 1 MiB TCP window
   - 16 GiB data sent across network
  
  A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s)
  
  A) from file  - 1) to /dev/null : 472 MB/s (3.77 Gb/s)
  B) from file  - 1) to /dev/null : 366 MB/s (2.93 Gb/s)
  C) from file  - 1) to /dev/null : 854 MB/s (6.83 Gb/s)
  
  A) from /dev/zero - 1) to file  : 375 MB/s (3.00 Gb/s)
  A) from /dev/zero - 2) to file  : 150 MB/s (1.20 Gb/s)
  A) from /dev/zero - 3) to file  : 286 MB/s (2.29 Gb/s)
  
  I had (naively) hoped the read/vmsplice/splice combination would 
  run at the same speed I can write a file, i.e. at about 450 MB/s
  on my setup.  Do any of my numbers seem bogus, so I should look 
  harder at my test program?
 
 Could be read-ahead playing in here, I'd have to take a closer look at
 the generated io patterns to say more about that. Any chance you can
 capture iostat or blktrace info for such a run to compare that goes to
 the disk?

I've attached a file with iostat and vmstat results for the case
where I read from a socket and write a file, vs. the case where I
read from a socket and use vmsplice/splice to write the file.
(Sorry it's not inline - my mailer locks up when I try to
include the file.)

Would you still like blktrace info for these two cases?

-- Jim


Here's 'iostat -dx 5' on the receiver, when it reads from a socket
and writes to a file:

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdc  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdd  0.00   0.00  0.00  0.200.000.40 0.00 0.20 2.00 
0.000.00   0.00   0.00
sde  0.00   0.00  0.00  0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md0  0.00   0.00  0.00  0.200.000.40 0.00 0.20 2.00 
0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.20  0.00 571.260.00 142574.45 0.00 71287.23   
249.58 0.891.56   0.43  24.51
sdc  0.00   0.00  0.00 576.850.00 142558.08 0.00 71279.04   
247.13 1.021.77   0.42  23.95
sdd  0.00   1.20  0.60 577.454.79 142545.91 2.40 71272.95   
246.61 1.162.00   0.54  31.46
sde  0.00   0.60  0.00 573.850.00 142607.58 0.00 71303.79   
248.51 1.342.33   0.54  30.74
md0  0.00   0.00  0.60 2301.404.79 570286.03 2.40 285143.01   
247.74 0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   0.20  0.00 750.400.00 187265.80 0.00 93632.90   
249.55 1.101.46   0.42  31.36
sdc  0.00   1.40  0.00 755.800.00 187236.80 0.00 93618.40   
247.73 1.301.72   0.43  32.64
sdd  0.00   2.80  0.00 751.200.00 187253.40 0.00 93626.70   
249.27 1.532.04   0.55  41.36
sde  0.00   0.20  0.00 767.400.00 187235.20 0.00 93617.60   
243.99 1.652.15   0.54  41.52
md0  0.00   0.00  0.00 3029.400.00 748991.20 0.00 374495.60   
247.24 0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb  0.00   1.00  0.00 771.260.00 187299.20 0.00 93649.60   
242.85 1.211.57   0.44  33.61
sdc  0.00   1.60  0.00 765.870.00 187262.28 0.00 93631.14   
244.51 1.331.73   0.45  34.49
sdd  0.00   1.00  0.00 739.720.00 187258.48 0.00 93629.24   
253.15 1.431.93   0.55  40.56
sde  0.00   0.80  0.00 744.310.00 187227.15 0.00 93613.57   
251.54 1.852.48   0.56  41.68
md0  0.00   0.00  0.00 3025.550.00 749047.11 0.00 374523.55   
247.57 0.000.00   0.00   0.00

Device:rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdb

Re: splice/vmsplice performance test results

2006-11-16 Thread Jim Schutt
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
> On Thu, Nov 16 2006, Jim Schutt wrote:
> > Hi,
> > 

> > 
> > My test program can do one of the following:
> > 
> > send data:
> >  A) read() from file into buffer, write() buffer into socket
> >  B) mmap() section of file, write() that into socket, munmap()
> >  C) splice() from file to pipe, splice() from pipe to socket
> > 
> > receive data:
> >  1) read() from socket into buffer, write() buffer into file
> >  2) ftruncate() to extend file, mmap() new extent, read() 
> >   from socket into new extent, munmap()
> >  3) read() from socket into buffer, vmsplice() buffer to 
> >  pipe, splice() pipe to file (using the double-buffer trick)
> > 
> > Here's the results, using:
> >  - 64 KiB buffer, mmap extent, or splice
> >  - 1 MiB TCP window
> >  - 16 GiB data sent across network
> > 
> > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)
> > 
> > A) from file  -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
> > B) from file  -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
> > C) from file  -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)
> > 
> > A) from /dev/zero -> 1) to file  : 375 MB/s (3.00 Gb/s)
> > A) from /dev/zero -> 2) to file  : 150 MB/s (1.20 Gb/s)
> > A) from /dev/zero -> 3) to file  : 286 MB/s (2.29 Gb/s)
> > 
> > I had (naively) hoped the read/vmsplice/splice combination would 
> > run at the same speed I can write a file, i.e. at about 450 MB/s
> > on my setup.  Do any of my numbers seem bogus, so I should look 
> > harder at my test program?
> 
> Could be read-ahead playing in here, I'd have to take a closer look at
> the generated io patterns to say more about that. Any chance you can
> capture iostat or blktrace info for such a run to compare that goes to
> the disk? 

I can try.  Do you prefer iostat or blktrace, or would you like both?
Can you point me at some instructions?

> Can you pass along the test program?

Inserted inline below.

> 
> > Or is read+write really the fastest way to get data off a
> > socket and into a file?
> 
> splice() should be just as fast of course, and more efficient. Not a lot
> of real-life performance tuning has gone into it yet, so I would not be
> surprised if we need to smoothen a few edges.
> 

I'm glad I can help a little here.

-- Jim

Here's a splice.h I'm using, based on your example:
---

/* Implement splice syscall support for glibc versions that don't
 * have it.
 */

#ifndef __do_splice_syscall_h__
#define __do_splice_syscall_h__

#include 
#include 

#if defined(__i386__)

/* From kernel tree include/asm-i386/unistd.h
 */
#ifndef __NR_splice
#define __NR_splice   313
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 316
#endif

#elif defined(__x86_64__)

/* From kernel tree include/asm-x86_64/unistd.h
 */
#ifndef __NR_splice
#define __NR_splice   275
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 278
#endif

#else
#error unsupported architecture
#endif

/* From kernel tree include/linux/pipe_fs_i.h
 */
#define SPLICE_F_MOVE   (0x01)   /* move pages instead of copying */
#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing
(but */
 /* we may still block on the fd we splice */
 /* from/to, of course */
#define SPLICE_F_MORE   (0x04)   /* expect more data */
#define SPLICE_F_GIFT   (0x08)   /* pages passed in are a gift */

#ifndef SYS_splice
#define SYS_splice __NR_splice
#endif
#ifndef SYS_vmsplice
#define SYS_vmsplice __NR_vmsplice
#endif


#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif

static inline
int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out,
   size_t len, unsigned int flags)
{
return syscall(SYS_splice, fd_in, off_in,
   fd_out, off_out, len, flags);
}

struct iovec;

static inline
int vmsplice(int fd, const struct iovec *iov,
 unsigned long nr_segs, unsigned int flags)
{
return syscall(SYS_vmsplice, fd, iov, nr_segs, flags);
}


#endif /* __do_splice_syscall_h__ */

--

And here's the test program itself:


/* Copyright 2006 Sandia Corporation.
 *
 * This file is free software; you can redistribute it and/or modify
 * it under the terms of version 2 of the GNU General Public License,
 * as published by the Free Software Foundation.
 */

/* Compile with -DHAVE_SPLICE_SYSCALL if you _know_ your kernel
 * has splice/vmsplice support.
 */

#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif


#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#

splice/vmsplice performance test results

2006-11-16 Thread Jim Schutt
Hi,

I've done some testing to see how splice/vmsplice perform
vs. other alternatives on transferring a large file across
a fast network.  One option I tested was to use vmsplice 
to get a 1-copy receive, but it didn't perform as well 
as I had hoped.  I was wondering if my results were at odds
with what other people have observed.

I've two systems, each with:
  Tyan S2895 motherboard
  2 ea. 2.6 GHz Opteron
  1 GiB memory
  Myricom Myri-10G 10 Gb/s NIC (PCIe x8)
  2.6.19-rc5-g134a11f0 on FC4

In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware
9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives 
(250 GB ea., NCQ enabled).  Write caching is enabled on the 3ware
cards.

The Myricom cards are connected back-to-back using 9000 byte MTU. 
I baseline the network performance with 'iperf -w 1M -l 64K'
and get 6.9 Gb/s.

After a fair amount of testing, I settled on a 4-way software
RAID0 on top of 4-way hardware RAID0 units as giving the best
streaming performance.  The file system is XFS, with the stripe
unit set to the hardware RAID chunk size, and the stripe width 
16 times that.  

Disk tuning parameters in /sys/block/sd*/queue are default
values, except queue/nr_requests = 5 gives me best performance.
(It seems like the 3ware cards slow down a little if I feed them 
too much data on the streaming write test I'm using.)

I baseline file write performance with 
  sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; }
and get 465-520 MB/s (highly variable).

I test baseline file read performance with
  time dd if=./zero of=/dev/null bs=32k count=512k
and get 950 MB/s (fairly repeatable).

My test program can do one of the following:

send data:
 A) read() from file into buffer, write() buffer into socket
 B) mmap() section of file, write() that into socket, munmap()
 C) splice() from file to pipe, splice() from pipe to socket

receive data:
 1) read() from socket into buffer, write() buffer into file
 2) ftruncate() to extend file, mmap() new extent, read() 
  from socket into new extent, munmap()
 3) read() from socket into buffer, vmsplice() buffer to 
 pipe, splice() pipe to file (using the double-buffer trick)

Here's the results, using:
 - 64 KiB buffer, mmap extent, or splice
 - 1 MiB TCP window
 - 16 GiB data sent across network

A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s)

A) from file  -> 1) to /dev/null : 472 MB/s (3.77 Gb/s)
B) from file  -> 1) to /dev/null : 366 MB/s (2.93 Gb/s)
C) from file  -> 1) to /dev/null : 854 MB/s (6.83 Gb/s)

A) from /dev/zero -> 1) to file  : 375 MB/s (3.00 Gb/s)
A) from /dev/zero -> 2) to file  : 150 MB/s (1.20 Gb/s)
A) from /dev/zero -> 3) to file  : 286 MB/s (2.29 Gb/s)

I had (naively) hoped the read/vmsplice/splice combination would 
run at the same speed I can write a file, i.e. at about 450 MB/s
on my setup.  Do any of my numbers seem bogus, so I should look 
harder at my test program?

Or is read+write really the fastest way to get data off a
socket and into a file?

-- Jim Schutt

(Please Cc: me, as I'm not subscribed to lkml.)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


splice/vmsplice performance test results

2006-11-16 Thread Jim Schutt
Hi,

I've done some testing to see how splice/vmsplice perform
vs. other alternatives on transferring a large file across
a fast network.  One option I tested was to use vmsplice 
to get a 1-copy receive, but it didn't perform as well 
as I had hoped.  I was wondering if my results were at odds
with what other people have observed.

I've two systems, each with:
  Tyan S2895 motherboard
  2 ea. 2.6 GHz Opteron
  1 GiB memory
  Myricom Myri-10G 10 Gb/s NIC (PCIe x8)
  2.6.19-rc5-g134a11f0 on FC4

In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware
9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives 
(250 GB ea., NCQ enabled).  Write caching is enabled on the 3ware
cards.

The Myricom cards are connected back-to-back using 9000 byte MTU. 
I baseline the network performance with 'iperf -w 1M -l 64K'
and get 6.9 Gb/s.

After a fair amount of testing, I settled on a 4-way software
RAID0 on top of 4-way hardware RAID0 units as giving the best
streaming performance.  The file system is XFS, with the stripe
unit set to the hardware RAID chunk size, and the stripe width 
16 times that.  

Disk tuning parameters in /sys/block/sd*/queue are default
values, except queue/nr_requests = 5 gives me best performance.
(It seems like the 3ware cards slow down a little if I feed them 
too much data on the streaming write test I'm using.)

I baseline file write performance with 
  sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; }
and get 465-520 MB/s (highly variable).

I test baseline file read performance with
  time dd if=./zero of=/dev/null bs=32k count=512k
and get 950 MB/s (fairly repeatable).

My test program can do one of the following:

send data:
 A) read() from file into buffer, write() buffer into socket
 B) mmap() section of file, write() that into socket, munmap()
 C) splice() from file to pipe, splice() from pipe to socket

receive data:
 1) read() from socket into buffer, write() buffer into file
 2) ftruncate() to extend file, mmap() new extent, read() 
  from socket into new extent, munmap()
 3) read() from socket into buffer, vmsplice() buffer to 
 pipe, splice() pipe to file (using the double-buffer trick)

Here's the results, using:
 - 64 KiB buffer, mmap extent, or splice
 - 1 MiB TCP window
 - 16 GiB data sent across network

A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s)

A) from file  - 1) to /dev/null : 472 MB/s (3.77 Gb/s)
B) from file  - 1) to /dev/null : 366 MB/s (2.93 Gb/s)
C) from file  - 1) to /dev/null : 854 MB/s (6.83 Gb/s)

A) from /dev/zero - 1) to file  : 375 MB/s (3.00 Gb/s)
A) from /dev/zero - 2) to file  : 150 MB/s (1.20 Gb/s)
A) from /dev/zero - 3) to file  : 286 MB/s (2.29 Gb/s)

I had (naively) hoped the read/vmsplice/splice combination would 
run at the same speed I can write a file, i.e. at about 450 MB/s
on my setup.  Do any of my numbers seem bogus, so I should look 
harder at my test program?

Or is read+write really the fastest way to get data off a
socket and into a file?

-- Jim Schutt

(Please Cc: me, as I'm not subscribed to lkml.)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: splice/vmsplice performance test results

2006-11-16 Thread Jim Schutt
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote:
 On Thu, Nov 16 2006, Jim Schutt wrote:
  Hi,
  

  
  My test program can do one of the following:
  
  send data:
   A) read() from file into buffer, write() buffer into socket
   B) mmap() section of file, write() that into socket, munmap()
   C) splice() from file to pipe, splice() from pipe to socket
  
  receive data:
   1) read() from socket into buffer, write() buffer into file
   2) ftruncate() to extend file, mmap() new extent, read() 
from socket into new extent, munmap()
   3) read() from socket into buffer, vmsplice() buffer to 
   pipe, splice() pipe to file (using the double-buffer trick)
  
  Here's the results, using:
   - 64 KiB buffer, mmap extent, or splice
   - 1 MiB TCP window
   - 16 GiB data sent across network
  
  A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s)
  
  A) from file  - 1) to /dev/null : 472 MB/s (3.77 Gb/s)
  B) from file  - 1) to /dev/null : 366 MB/s (2.93 Gb/s)
  C) from file  - 1) to /dev/null : 854 MB/s (6.83 Gb/s)
  
  A) from /dev/zero - 1) to file  : 375 MB/s (3.00 Gb/s)
  A) from /dev/zero - 2) to file  : 150 MB/s (1.20 Gb/s)
  A) from /dev/zero - 3) to file  : 286 MB/s (2.29 Gb/s)
  
  I had (naively) hoped the read/vmsplice/splice combination would 
  run at the same speed I can write a file, i.e. at about 450 MB/s
  on my setup.  Do any of my numbers seem bogus, so I should look 
  harder at my test program?
 
 Could be read-ahead playing in here, I'd have to take a closer look at
 the generated io patterns to say more about that. Any chance you can
 capture iostat or blktrace info for such a run to compare that goes to
 the disk? 

I can try.  Do you prefer iostat or blktrace, or would you like both?
Can you point me at some instructions?

 Can you pass along the test program?

Inserted inline below.

 
  Or is read+write really the fastest way to get data off a
  socket and into a file?
 
 splice() should be just as fast of course, and more efficient. Not a lot
 of real-life performance tuning has gone into it yet, so I would not be
 surprised if we need to smoothen a few edges.
 

I'm glad I can help a little here.

-- Jim

Here's a splice.h I'm using, based on your example:
---

/* Implement splice syscall support for glibc versions that don't
 * have it.
 */

#ifndef __do_splice_syscall_h__
#define __do_splice_syscall_h__

#include sys/syscall.h
#include unistd.h

#if defined(__i386__)

/* From kernel tree include/asm-i386/unistd.h
 */
#ifndef __NR_splice
#define __NR_splice   313
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 316
#endif

#elif defined(__x86_64__)

/* From kernel tree include/asm-x86_64/unistd.h
 */
#ifndef __NR_splice
#define __NR_splice   275
#endif
#ifndef __NR_vmsplice
#define __NR_vmsplice 278
#endif

#else
#error unsupported architecture
#endif

/* From kernel tree include/linux/pipe_fs_i.h
 */
#define SPLICE_F_MOVE   (0x01)   /* move pages instead of copying */
#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing
(but */
 /* we may still block on the fd we splice */
 /* from/to, of course */
#define SPLICE_F_MORE   (0x04)   /* expect more data */
#define SPLICE_F_GIFT   (0x08)   /* pages passed in are a gift */

#ifndef SYS_splice
#define SYS_splice __NR_splice
#endif
#ifndef SYS_vmsplice
#define SYS_vmsplice __NR_vmsplice
#endif


#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif

static inline
int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out,
   size_t len, unsigned int flags)
{
return syscall(SYS_splice, fd_in, off_in,
   fd_out, off_out, len, flags);
}

struct iovec;

static inline
int vmsplice(int fd, const struct iovec *iov,
 unsigned long nr_segs, unsigned int flags)
{
return syscall(SYS_vmsplice, fd, iov, nr_segs, flags);
}


#endif /* __do_splice_syscall_h__ */

--

And here's the test program itself:


/* Copyright 2006 Sandia Corporation.
 *
 * This file is free software; you can redistribute it and/or modify
 * it under the terms of version 2 of the GNU General Public License,
 * as published by the Free Software Foundation.
 */

/* Compile with -DHAVE_SPLICE_SYSCALL if you _know_ your kernel
 * has splice/vmsplice support.
 */

#ifndef _LARGEFILE64_SOURCE
#error need -D_LARGEFILE64_SOURCE
#endif


#include stdint.h
#include stdlib.h
#include stdio.h
#include unistd.h
#include errno.h
#include string.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include sys/time.h
#include sys/mman.h
#include sys/sendfile.h

#include netdb.h
#include sys/socket.h
#include netinet/tcp.h
#include arpa/inet.h
#include resolv.h

#define ALIGN(value,size) (((value) + (size) - 1)  ~((size) - 1))

/* Once glibc implements splice(), SYS_splice will be defined in 
 * system headers.  Until then we need to use our own

Re: Spurious interrupts for UP w/ IO-APIC on ServerWorks

2001-04-24 Thread Jim Schutt

Alan Cox wrote:

> > I've got a dual-processor system built around the Intel SBT2 motherboard, 
> > which uses the ServerWorks LE chipset. 2.4.3 SMP works fine. When I 
> > build a UP kernel with IO-APIC support, I get this during boot: 
> 
> Turn off the OSB4 driver - bet that helps 

It does -- no more spurious interrupts.

Thanks -- Jim

-- 
Jim Schutt  <[EMAIL PROTECTED]>
Sandia National Laboratories, Albuquerque, New Mexico USA

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Spurious interrupts for UP w/ IO-APIC on ServerWorks

2001-04-24 Thread Jim Schutt


Hi,

I've got a dual-processor system built around the Intel SBT2 motherboard,
which uses the ServerWorks LE chipset.  2.4.3 SMP works fine.  When I 
build a UP kernel with IO-APIC support, I get this during boot:

[... everything seems ok until ...]
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ServerWorks OSB4: IDE controller on PCI bus 00 dev 79
ServerWorks OSB4: chipset revision 0
ServerWorks OSB4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0x0840-0x0847, BIOS settings: hda:DMA, hdb:DMA
ide1: BM-DMA at 0x0848-0x084f, BIOS settings: hdc:DMA, hdd:DMA
spurious APIC interrupt on CPU#0, should never happen.
spurious APIC interrupt on CPU#0, should never happen.
hda: CD-ROM 52X/AKH, ATAPI CD/DVD-ROM drive
spurious APIC interrupt on CPU#0, should never happen.
spurious APIC interrupt on CPU#0, should never happen.
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
spurious APIC interrupt on CPU#0, should never happen.
spurious APIC interrupt on CPU#0, should never happen.
hda: lost interrupt
spurious APIC interrupt on CPU#0, should never happen.
hda: lost interrupt
hda: ATAPI 52X CD-ROM drive, 192kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
Floppy drive(s): fd0 is 1.44M
spurious APIC interrupt on CPU#0, should never happen.
spurious APIC interrupt on CPU#0, should never happen.
floppy0: no floppy controllers found
Serial driver version 5.05 (2000-12-13) with MANY_PORTS SHARE_IRQ SERIAL_PCI
ISAPNP enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
SCSI subsystem driver Revision: 1.00
(scsi0)  found at PCI 1/4/0
(scsi0) Wide Channel A, SCSI ID=7, 32/255 SCBs
(scsi0) Downloading sequencer code... 392 instructions downloaded
(scsi1)  found at PCI 1/4/1
(scsi1) Wide Channel B, SCSI ID=7, 32/255 SCBs
(scsi1) Downloading sequencer code... 392 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
   
scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0
   
scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0
Inquiry 00 00 00 ff 00 
spurious APIC interrupt on CPU#0, should never happen.
SCSI host 0 abort (pid 0) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
spurious APIC interrupt on CPU#0, should never happen.
SCSI host 0 channel 0 reset (pid 0) timed out - trying harder
SCSI bus is being reset for host 0 channel 0.
[... repeat above 6 lines until I reset hw ...]


2.4.3 has this behavior, 2.4.3-ac9 works fine. I found this in the -ac
changelog; perhaps it's relevant:

> 2.4.2-ac12 
[...]
> o Remove serverworks handling. The BIOS is our (me) 
> best (and right now only) hope for that chip 

Here's the pertinent (I think) part of 2.4.3 .config:

CONFIG_MPENTIUMIII=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_SMP is not set
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y

Details available on request.

Thanks -- Jim

-- 
Jim Schutt  <[EMAIL PROTECTED]>
Sandia National Laboratories, Albuquerque, New Mexico USA

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Spurious interrupts for UP w/ IO-APIC on ServerWorks

2001-04-24 Thread Jim Schutt

Alan Cox wrote:

  I've got a dual-processor system built around the Intel SBT2 motherboard, 
  which uses the ServerWorks LE chipset. 2.4.3 SMP works fine. When I 
  build a UP kernel with IO-APIC support, I get this during boot: 
 
 Turn off the OSB4 driver - bet that helps 

It does -- no more spurious interrupts.

Thanks -- Jim

-- 
Jim Schutt  [EMAIL PROTECTED]
Sandia National Laboratories, Albuquerque, New Mexico USA

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



tulip vs. de4x5 (was Re: Linux 2.4 Status / TODO page ...)

2000-11-08 Thread Jim Schutt

Jeff Garzik wrote:
> 
> 
> de4x5 is becoming EISA-only in 2.5.x too, since its PCI support is
> duplicated now in tulip driver.
> 

I've got some DEC Miatas with DECchip 21142/43 ethernet cards, and I
don't get the same link speeds when using the de4x5 and tulip drivers,
as of 2.4.0-test10.  The machines are connected to a Netgear 16-port 
10/100 hub.  With the tulip driver the hub shows a 10 Mb/s connection;
with the de4x5 driver the hub shows a 100 Mb/s connection.  In both
cases the drivers are configured into the kernel, not as modules, if
it matters.

Am I doing something wrong, or is this a known issue?

FWIW, these machines are diskless, and according to the hub they download
the kernel at 100 Mb/s.  When the tulip-based kernel initializes the
card, the hub shows the link speed switching to 10 Mb/s.


from lspci -v:

00:03.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev
30)
Flags: bus master, medium devsel, latency 255, IRQ 24
I/O ports at 8000 [size=128]
Memory at 0900 (32-bit, non-prefetchable) [size=128]
Expansion ROM at 0904 [disabled] [size=256K]


de4x5 boot messages:

eth0: DC21143 at 0x8000 (PCI bus 0, device 3), h/w address 00:00:f8:76:3c:74,
  and requires IRQ24 (provided by PCI BIOS).
de4x5.c:V0.545 1999/11/28 [EMAIL PROTECTED]


tulip boot messages:

Linux Tulip driver version 0.9.10 (September 6, 2000)
eth0: Digital DS21143 Tulip rev 48 at 0x8000, 00:00:F8:76:3C:74, IRQ 24.
eth0:  EEPROM default media type Autosense.
eth0:  Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block.
eth0:  Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2)
block.
eth0:  Index #2 - Media 10base2 (#1) described by a 21142 Serial PHY (2) block.
eth0:  Index #3 - Media AUI (#2) described by a 21142 Serial PHY (2) block.
eth0:  Index #4 - Media MII (#11) described by a 21142 MII PHY (3) block.
eth0:  MII transceiver #5 config 2000 status 784b advertising 01e1.

-- Jim

-
James A. Schutt
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



tulip vs. de4x5 (was Re: Linux 2.4 Status / TODO page ...)

2000-11-08 Thread Jim Schutt

Jeff Garzik wrote:
 
 
 de4x5 is becoming EISA-only in 2.5.x too, since its PCI support is
 duplicated now in tulip driver.
 

I've got some DEC Miatas with DECchip 21142/43 ethernet cards, and I
don't get the same link speeds when using the de4x5 and tulip drivers,
as of 2.4.0-test10.  The machines are connected to a Netgear 16-port 
10/100 hub.  With the tulip driver the hub shows a 10 Mb/s connection;
with the de4x5 driver the hub shows a 100 Mb/s connection.  In both
cases the drivers are configured into the kernel, not as modules, if
it matters.

Am I doing something wrong, or is this a known issue?

FWIW, these machines are diskless, and according to the hub they download
the kernel at 100 Mb/s.  When the tulip-based kernel initializes the
card, the hub shows the link speed switching to 10 Mb/s.


from lspci -v:

00:03.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev
30)
Flags: bus master, medium devsel, latency 255, IRQ 24
I/O ports at 8000 [size=128]
Memory at 0900 (32-bit, non-prefetchable) [size=128]
Expansion ROM at 0904 [disabled] [size=256K]


de4x5 boot messages:

eth0: DC21143 at 0x8000 (PCI bus 0, device 3), h/w address 00:00:f8:76:3c:74,
  and requires IRQ24 (provided by PCI BIOS).
de4x5.c:V0.545 1999/11/28 [EMAIL PROTECTED]


tulip boot messages:

Linux Tulip driver version 0.9.10 (September 6, 2000)
eth0: Digital DS21143 Tulip rev 48 at 0x8000, 00:00:F8:76:3C:74, IRQ 24.
eth0:  EEPROM default media type Autosense.
eth0:  Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block.
eth0:  Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2)
block.
eth0:  Index #2 - Media 10base2 (#1) described by a 21142 Serial PHY (2) block.
eth0:  Index #3 - Media AUI (#2) described by a 21142 Serial PHY (2) block.
eth0:  Index #4 - Media MII (#11) described by a 21142 MII PHY (3) block.
eth0:  MII transceiver #5 config 2000 status 784b advertising 01e1.

-- Jim

-
James A. Schutt
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/