3.10.0 failed paging request from kthread_data
Hi, I'm trying to test the btrfs and ceph contributions to 3.11, without testing all of 3.11-rc1 (just yet), so I'm testing with the "next" branch of Chris Mason's tree (commit cbacd76bb3 from git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git) merged into the for-linus branch of the ceph tree (commit 8b8cf8917f from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git) One of my ceph clients hit this: [94633.463166] BUG: unable to handle kernel paging request at ffa8 [94633.464003] IP: [] kthread_data+0x10/0x20 [94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 [94633.464003] Oops: [#2] SMP [94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core dca hwmon e1000 [94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G D W 3.10.0-00119-g2925339 #601 [94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 06/21/2006 [94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 88040e39a000 [94633.464003] RIP: 0010:[] [] kthread_data+0x10/0x20 [94633.464003] RSP: 0018:88040e39b7f8 EFLAGS: 00010092 [94633.464003] RAX: RBX: RCX: 81d30320 [94633.464003] RDX: RSI: RDI: 880415b6 [94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001 [94633.464003] R10: R11: R12: [94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002 [94633.464003] FS: () GS:88042fc0() knlGS: [94633.464003] CS: 0010 DS: ES: CR0: 8005003b [94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0 [94633.464003] DR0: DR1: DR2: [94633.464003] DR3: DR6: 0ff0 DR7: 0400 [94633.464003] Stack: [94633.464003] 88040e39b818 810602a5 88040e39b818 88042fc139c0 [94633.464003] 88040e39b8a8 814ef79e 8804 88040e39bfd8 [94633.464003] 88040e39a000 88040e39a000 88040e39a010 88040e39a000 [94633.464003] Call Trace: [94633.464003] [] wq_worker_sleeping+0x15/0xa0 [94633.464003] [] __schedule+0x17e/0x6b0 [94633.464003] [] schedule+0x5d/0x60 [94633.464003] [] do_exit+0x3eb/0x440 [94633.464003] [] oops_end+0xd8/0xf0 [94633.464003] [] no_context+0x1bf/0x1e0 [94633.464003] [] __bad_area_nosemaphore+0x1f5/0x230 [94633.464003] [] bad_area_nosemaphore+0x13/0x20 [94633.464003] [] __do_page_fault+0x416/0x4b0 [94633.464003] [] ? idle_balance+0x14e/0x180 [94633.464003] [] ? finish_task_switch+0x3f/0x110 [94633.464003] [] ? error_sti+0x5/0x6 [94633.464003] [] ? trace_hardirqs_off_caller+0x29/0xd0 [94633.464003] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [94633.464003] [] do_page_fault+0xe/0x10 [94633.464003] [] page_fault+0x22/0x30 [94633.464003] [] ? rb_erase+0x297/0x3a0 [94633.464003] [] __remove_osd+0x98/0xd0 [libceph] [94633.464003] [] __reset_osd+0xa3/0x1c0 [libceph] [94633.464003] [] ? osd_reset+0x9b/0xd0 [libceph] [94633.464003] [] __kick_osd_requests+0x7b/0x2e0 [libceph] [94633.464003] [] osd_reset+0xa6/0xd0 [libceph] [94633.464003] [] con_work+0x445/0x4a0 [libceph] [94633.464003] [] process_one_work+0x2e5/0x510 [94633.464003] [] ? process_one_work+0x240/0x510 [94633.464003] [] worker_thread+0x215/0x340 [94633.464003] [] ? manage_workers+0x170/0x170 [94633.464003] [] kthread+0xe1/0xf0 [94633.464003] [] ? __init_kthread_worker+0x70/0x70 [94633.464003] [] ret_from_fork+0x7c/0xb0 [94633.464003] [] ? __init_kthread_worker+0x70/0x70 [94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 <48> 8b 40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 [94633.464003] RIP [] kthread_data+0x10/0x20 [94633.464003] RSP [94633.464003] CR2: ffa8 [94633.464003] ---[ end trace 89622896705a7fac ]--- [94633.464003] Fixing recursive fault but reboot is needed! [94633.464003] [ cut here ] kthread_data disassembles to this: (gdb) disassemble kthread_data Dump of assembler code for function kthread_data: 0x8106a060 <+0>: push %rbp 0x8106a061 <+1>: mov%rsp,%rbp 0x8106a064 <+4>: callq 0x814fabc0 0x8106a069 <+9>: mov0x390(%rdi),%rax 0x8106a070 <+16>:
3.10.0 failed paging request from kthread_data
Hi, I'm trying to test the btrfs and ceph contributions to 3.11, without testing all of 3.11-rc1 (just yet), so I'm testing with the next branch of Chris Mason's tree (commit cbacd76bb3 from git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git) merged into the for-linus branch of the ceph tree (commit 8b8cf8917f from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git) One of my ceph clients hit this: [94633.463166] BUG: unable to handle kernel paging request at ffa8 [94633.464003] IP: [8106a070] kthread_data+0x10/0x20 [94633.464003] PGD 1a0c067 PUD 1a0e067 PMD 0 [94633.464003] Oops: [#2] SMP [94633.464003] Modules linked in: cbc ceph libceph ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh scsi_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas coretemp kvm microcode button serio_raw pcspkr ehci_pci ehci_hcd ib_mthca ib_mad ib_core lpc_ich mfd_core uhci_hcd i5k_amb i5000_edac edac_core dm_mod nfsv4 nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 bnx2 igb ptp pps_core i2c_algo_bit i2c_core dca hwmon e1000 [94633.464003] CPU: 0 PID: 78416 Comm: kworker/0:1 Tainted: G D W 3.10.0-00119-g2925339 #601 [94633.464003] Hardware name: Dell Inc. PowerEdge 1950/0NK937, BIOS 1.1.0 06/21/2006 [94633.464003] task: 880415b6 ti: 88040e39a000 task.ti: 88040e39a000 [94633.464003] RIP: 0010:[8106a070] [8106a070] kthread_data+0x10/0x20 [94633.464003] RSP: 0018:88040e39b7f8 EFLAGS: 00010092 [94633.464003] RAX: RBX: RCX: 81d30320 [94633.464003] RDX: RSI: RDI: 880415b6 [94633.464003] RBP: 88040e39b7f8 R08: 880415b60070 R09: 0001 [94633.464003] R10: R11: R12: [94633.464003] R13: 880415b603e8 R14: 0001 R15: 0002 [94633.464003] FS: () GS:88042fc0() knlGS: [94633.464003] CS: 0010 DS: ES: CR0: 8005003b [94633.464003] CR2: 0028 CR3: 000415f77000 CR4: 07f0 [94633.464003] DR0: DR1: DR2: [94633.464003] DR3: DR6: 0ff0 DR7: 0400 [94633.464003] Stack: [94633.464003] 88040e39b818 810602a5 88040e39b818 88042fc139c0 [94633.464003] 88040e39b8a8 814ef79e 8804 88040e39bfd8 [94633.464003] 88040e39a000 88040e39a000 88040e39a010 88040e39a000 [94633.464003] Call Trace: [94633.464003] [810602a5] wq_worker_sleeping+0x15/0xa0 [94633.464003] [814ef79e] __schedule+0x17e/0x6b0 [94633.464003] [814efefd] schedule+0x5d/0x60 [94633.464003] [8104717b] do_exit+0x3eb/0x440 [94633.464003] [814f33f8] oops_end+0xd8/0xf0 [94633.464003] [810362df] no_context+0x1bf/0x1e0 [94633.464003] [810364f5] __bad_area_nosemaphore+0x1f5/0x230 [94633.464003] [81036543] bad_area_nosemaphore+0x13/0x20 [94633.464003] [814f6406] __do_page_fault+0x416/0x4b0 [94633.464003] [810869ae] ? idle_balance+0x14e/0x180 [94633.464003] [81077a1f] ? finish_task_switch+0x3f/0x110 [94633.464003] [814f29e3] ? error_sti+0x5/0x6 [94633.464003] [8109e859] ? trace_hardirqs_off_caller+0x29/0xd0 [94633.464003] [8128c6dd] ? trace_hardirqs_off_thunk+0x3a/0x3c [94633.464003] [814f64ae] do_page_fault+0xe/0x10 [94633.464003] [814f27e2] page_fault+0x22/0x30 [94633.464003] [81285a47] ? rb_erase+0x297/0x3a0 [94633.464003] [a02b45d8] __remove_osd+0x98/0xd0 [libceph] [94633.464003] [a02b49c3] __reset_osd+0xa3/0x1c0 [libceph] [94633.464003] [a02b6c5b] ? osd_reset+0x9b/0xd0 [libceph] [94633.464003] [a02b695b] __kick_osd_requests+0x7b/0x2e0 [libceph] [94633.464003] [a02b6c66] osd_reset+0xa6/0xd0 [libceph] [94633.464003] [a02aeb65] con_work+0x445/0x4a0 [libceph] [94633.464003] [810635b5] process_one_work+0x2e5/0x510 [94633.464003] [81063510] ? process_one_work+0x240/0x510 [94633.464003] [81064975] worker_thread+0x215/0x340 [94633.464003] [81064760] ? manage_workers+0x170/0x170 [94633.464003] [8106aa61] kthread+0xe1/0xf0 [94633.464003] [8106a980] ? __init_kthread_worker+0x70/0x70 [94633.464003] [814faf5c] ret_from_fork+0x7c/0xb0 [94633.464003] [8106a980] ? __init_kthread_worker+0x70/0x70 [94633.464003] Code: 90 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 90 03 00 00 48 8b 40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 [94633.464003] RIP [8106a070] kthread_data+0x10/0x20 [94633.464003] RSP
Re: 3.10-rc3 NFSv3 mount issues
On 05/30/2013 02:26 PM, Chuck Lever wrote: > > On May 30, 2013, at 4:19 PM, Jim Schutt wrote: > >> Hi, >> >> I've been trying to test 3.10-rc3 on some diskless clients, and found >> that I can no longer mount my root file system via NFSv3. >> > > 3.10-rc3 appears to be missing the fix for this. See: > > http://marc.info/?l=linux-nfs=136855668104598=2 > > Trond, can we get this applied? > That works for me. Thanks! -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.10-rc3 NFSv3 mount issues
Hi, I've been trying to test 3.10-rc3 on some diskless clients, and found that I can no longer mount my root file system via NFSv3. I poked around looking at NFS changes for 3.10, and found these two commits: d497ab9751 "NFSv3: match sec= flavor against server list" 4580a92d44 "NFS: Use server-recommended security flavor by default (NFSv3)" If I revert both of these commits from 3.10-rc3, then my diskless client can mount its root file system. The busybox mount command fails like this, when using 3.10-rc3: / # mount -t nfs -o ro,nolock,vers=3,proto=tcp 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: Invalid argument The commit messages for both these commits seem to say that mounting with the "sys=sec" option should work, but unfortunately, my busybox doesn't seem to understand the "sec=" mount option: / # mount -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt mount: invalid number 'sys' My NFS server is based on RHEL6, and is not using any "sec=" option in its export for this file system. I did try exporting with "sec=sys", but it didn't seem to make any difference either. So far, this seems like a regression to me Any ideas what I might be doing wrong? How can I help make this work again? Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.10-rc3 NFSv3 mount issues
Hi, I've been trying to test 3.10-rc3 on some diskless clients, and found that I can no longer mount my root file system via NFSv3. I poked around looking at NFS changes for 3.10, and found these two commits: d497ab9751 NFSv3: match sec= flavor against server list 4580a92d44 NFS: Use server-recommended security flavor by default (NFSv3) If I revert both of these commits from 3.10-rc3, then my diskless client can mount its root file system. The busybox mount command fails like this, when using 3.10-rc3: / # mount -t nfs -o ro,nolock,vers=3,proto=tcp 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt mount: mounting 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x on /mnt failed: Invalid argument The commit messages for both these commits seem to say that mounting with the sys=sec option should work, but unfortunately, my busybox doesn't seem to understand the sec= mount option: / # mount -t nfs -o ro,nolock,vers=3,proto=tcp,sec=sys 172.17.0.122:/gmi/images/jaschut/ceph.toss-2x /mnt mount: invalid number 'sys' My NFS server is based on RHEL6, and is not using any sec= option in its export for this file system. I did try exporting with sec=sys, but it didn't seem to make any difference either. So far, this seems like a regression to me Any ideas what I might be doing wrong? How can I help make this work again? Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.10-rc3 NFSv3 mount issues
On 05/30/2013 02:26 PM, Chuck Lever wrote: On May 30, 2013, at 4:19 PM, Jim Schutt jasc...@sandia.gov wrote: Hi, I've been trying to test 3.10-rc3 on some diskless clients, and found that I can no longer mount my root file system via NFSv3. 3.10-rc3 appears to be missing the fix for this. See: http://marc.info/?l=linux-nfsm=136855668104598w=2 Trond, can we get this applied? That works for me. Thanks! -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
On 05/15/2013 10:49 AM, Alex Elder wrote: > On 05/15/2013 11:38 AM, Jim Schutt wrote: >> > Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while >> > holding a lock, but it's spoiled because ceph_pagelist_addpage() always >> > calls kmap(), which might sleep. Here's the result: > This looks good to me, but I admit I didn't take as close > a look at it this time. > > I appreciate your updating the series to include the things > I mentioned. > > I'll commit these for you, and I'll get confirmation on the > byte order thing as well. Cool. Thanks, Alex. -- Jim > > Reviewed-by: Alex Elder > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [] process_one_work+0x218/0x480 [13439.329701] #1: ((&(>work)->work)){..}, at: [] process_one_work+0x218/0x480 [13439.339446] #2: (>s_mutex){..}, at: [] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (>snap_rwsem){..}, at: [] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [] __might_sleep+0xfc/0x110 [13439.384353] [] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [] ? _raw_spin_lock+0x49/0x50 [13439.403775] [] ? lock_flocks+0x15/0x20 [13439.409277] [] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [] ? igrab+0x28/0x70 [13439.420610] [] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [] check_new_map+0x352/0x500 [ceph] [13439.461534] [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [] ? mutex_unlock+0xe/0x10 [13439.473934] [] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [] dispatch+0xbc/0x110 [libceph] [13439.486492] [] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [] ? kernel_recvmsg+0x44/0x60 [13439.506324] [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [] con_work+0x378/0x4a0 [libceph] [13439.525345] [] ? finish_task_switch+0x3f/0x110 [13439.531515] [] process_one_work+0x2b5/0x480 [13439.537439] [] ? process_one_work+0x218/0x480 [13439.543526] [] worker_thread+0x1f5/0x320 [13439.549191] [] ? manage_workers+0x170/0x170 [13439.555102] [] kthread+0xe1/0xf0 [13439.560075] [] ? __init_kthread_worker+0x70/0x70 [13439.566419] [] ret_from_fork+0x7c/0xb0 [13439.571918] [] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt --- fs/ceph/locks.c | 76 +++--- fs/ceph/mds_client.c | 65 +++--- fs/ceph/super.h |9 - 3 files changed, 89 insertions(+), 61 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 4518313..8978851 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct ceph_filelock cephlock; int err = 0; int seen_fcntl = 0; int seen_flock = 0; - __le32 nlocks; + int l = 0; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - nlocks = cpu_to_le32(num_fcntl_locks); - err = ceph_pagelist_append(pagelist, , sizeof(nlocks)); - if (err) - goto fail; + for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { if (lock->fl_flags & FL_POSIX) { ++seen_fcntl; @@ -221,20 +215,1
[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Signed-off-by: Jim Schutt --- fs/ceph/locks.c |7 +-- fs/ceph/mds_client.c |2 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ffc86cb..4518313 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, int err = 0; int seen_fcntl = 0; int seen_flock = 0; + __le32 nlocks; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - err = ceph_pagelist_append(pagelist, _fcntl_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_fcntl_locks); + err = ceph_pagelist_append(pagelist, , sizeof(nlocks)); if (err) goto fail; for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { @@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, goto fail; } - err = ceph_pagelist_append(pagelist, _flock_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_flock_locks); + err = ceph_pagelist_append(pagelist, , sizeof(nlocks)); if (err) goto fail; for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 4f22671..d9ca152 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, lock_flocks(); ceph_count_locks(inode, _fcntl_locks, _flock_locks); - rec.v2.flock_len = (2*sizeof(u32) + + rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) + (num_fcntl_locks+num_flock_locks) * sizeof(struct ceph_filelock)); unlock_flocks(); -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt --- fs/ceph/locks.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..ffc86cb 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) } /** - * Must be called with BKL already held. Fills in the passed + * Must be called with lock_flocks() already held. Fills in the passed * counter variables, so you can prepare pagelist metadata before calling * ceph_encode_locks. */ -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/3] ceph: fix might_sleep while atomic
This patch series fixes an issue where encode_caps_cb() ends up calling kmap() while holding a spinlock. The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html) was expanded to fix some other outstanding issues noticed by Alex Elder in his review. Changes from the original version: - clean up a comment - add missing calls to cpu_to_le32() - eliminate unnecessary loop around calls to ceph_pagelist_append() - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer() if the number of locks changes; exit on other errors - return proper error code from encode_caps_cb() if kmalloc fails In his review, Alex mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Jim Schutt (3): ceph: fix up comment for ceph_count_locks() as to which lock to hold ceph: add missing cpu_to_le32() calls when encoding a reconnect capability ceph: ceph_pagelist_append might sleep while atomic fs/ceph/locks.c | 75 -- fs/ceph/mds_client.c | 65 +++ fs/ceph/super.h |9 - 3 files changed, 90 insertions(+), 59 deletions(-) -- 1.7.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/3] ceph: fix might_sleep while atomic
This patch series fixes an issue where encode_caps_cb() ends up calling kmap() while holding a spinlock. The original patch (http://www.spinics.net/lists/ceph-devel/msg14573.html) was expanded to fix some other outstanding issues noticed by Alex Elder in his review. Changes from the original version: - clean up a comment - add missing calls to cpu_to_le32() - eliminate unnecessary loop around calls to ceph_pagelist_append() - only loop on calls to ceph_count_locks() and ceph_encode_locks_to_buffer() if the number of locks changes; exit on other errors - return proper error code from encode_caps_cb() if kmalloc fails In his review, Alex mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Jim Schutt (3): ceph: fix up comment for ceph_count_locks() as to which lock to hold ceph: add missing cpu_to_le32() calls when encoding a reconnect capability ceph: ceph_pagelist_append might sleep while atomic fs/ceph/locks.c | 75 -- fs/ceph/mds_client.c | 65 +++ fs/ceph/super.h |9 - 3 files changed, 90 insertions(+), 59 deletions(-) -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] ceph: fix up comment for ceph_count_locks() as to which lock to hold
Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/locks.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..ffc86cb 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -169,7 +169,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) } /** - * Must be called with BKL already held. Fills in the passed + * Must be called with lock_flocks() already held. Fills in the passed * counter variables, so you can prepare pagelist metadata before calling * ceph_encode_locks. */ -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/3] ceph: add missing cpu_to_le32() calls when encoding a reconnect capability
In his review, Alex Elder mentioned that he hadn't checked that num_fcntl_locks and num_flock_locks were properly decoded on the server side, from a le32 over-the-wire type to a cpu type. I checked, and AFAICS it is done; those interested can consult Locker::_do_cap_update() in src/mds/Locker.cc and src/include/encoding.h in the Ceph server code (git://github.com/ceph/ceph). I also checked the server side for flock_len decoding, and I believe that also happens correctly, by virtue of having been declared __le32 in struct ceph_mds_cap_reconnect, in src/include/ceph_fs.h. Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/locks.c |7 +-- fs/ceph/mds_client.c |2 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ffc86cb..4518313 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -206,10 +206,12 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, int err = 0; int seen_fcntl = 0; int seen_flock = 0; + __le32 nlocks; dout(encoding %d flock and %d fcntl locks, num_flock_locks, num_fcntl_locks); - err = ceph_pagelist_append(pagelist, num_fcntl_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_fcntl_locks); + err = ceph_pagelist_append(pagelist, nlocks, sizeof(nlocks)); if (err) goto fail; for (lock = inode-i_flock; lock != NULL; lock = lock-fl_next) { @@ -229,7 +231,8 @@ int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, goto fail; } - err = ceph_pagelist_append(pagelist, num_flock_locks, sizeof(u32)); + nlocks = cpu_to_le32(num_flock_locks); + err = ceph_pagelist_append(pagelist, nlocks, sizeof(nlocks)); if (err) goto fail; for (lock = inode-i_flock; lock != NULL; lock = lock-fl_next) { diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 4f22671..d9ca152 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2485,7 +2485,7 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, lock_flocks(); ceph_count_locks(inode, num_fcntl_locks, num_flock_locks); - rec.v2.flock_len = (2*sizeof(u32) + + rec.v2.flock_len = cpu_to_le32(2*sizeof(u32) + (num_fcntl_locks+num_flock_locks) * sizeof(struct ceph_filelock)); unlock_flocks(); -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [810609f8] process_one_work+0x218/0x480 [13439.329701] #1: (((con-work)-work)){..}, at: [810609f8] process_one_work+0x218/0x480 [13439.339446] #2: (s-s_mutex){..}, at: [a046273c] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (mdsc-snap_rwsem){..}, at: [a04627be] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [811cadf5] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [81076f4c] __might_sleep+0xfc/0x110 [13439.384353] [a03f4ce0] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [a0448fe9] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [814ee849] ? _raw_spin_lock+0x49/0x50 [13439.403775] [811cadf5] ? lock_flocks+0x15/0x20 [13439.409277] [a045e2af] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [81196748] ? igrab+0x28/0x70 [13439.420610] [a045e9f8] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [a045ea25] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [a045de90] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [a0462888] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [a0464542] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [a0462e42] check_new_map+0x352/0x500 [ceph] [13439.461534] [a04631ad] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [814ebc7e] ? mutex_unlock+0xe/0x10 [13439.473934] [a043c612] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [a03f6c2c] dispatch+0xbc/0x110 [libceph] [13439.486492] [a03eec3d] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [a03f1498] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [81415184] ? kernel_recvmsg+0x44/0x60 [13439.506324] [a03ef3a8] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [a03f2aae] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [a03f39f8] con_work+0x378/0x4a0 [libceph] [13439.525345] [8107792f] ? finish_task_switch+0x3f/0x110 [13439.531515] [81060a95] process_one_work+0x2b5/0x480 [13439.537439] [810609f8] ? process_one_work+0x218/0x480 [13439.543526] [81064185] worker_thread+0x1f5/0x320 [13439.549191] [81063f90] ? manage_workers+0x170/0x170 [13439.555102] [81069641] kthread+0xe1/0xf0 [13439.560075] [81069560] ? __init_kthread_worker+0x70/0x70 [13439.566419] [814f7edc] ret_from_fork+0x7c/0xb0 [13439.571918] [81069560] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/locks.c | 76 +++--- fs/ceph/mds_client.c | 65 +++--- fs/ceph/super.h |9 - 3 files changed, 89 insertions(+), 61 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 4518313..8978851 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,29 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct
Re: [PATCH v2 3/3] ceph: ceph_pagelist_append might sleep while atomic
On 05/15/2013 10:49 AM, Alex Elder wrote: On 05/15/2013 11:38 AM, Jim Schutt wrote: Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc() while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: This looks good to me, but I admit I didn't take as close a look at it this time. I appreciate your updating the series to include the things I mentioned. I'll commit these for you, and I'll get confirmation on the byte order thing as well. Cool. Thanks, Alex. -- Jim Reviewed-by: Alex Elder el...@inktank.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic
On 05/14/2013 10:44 AM, Alex Elder wrote: > On 05/09/2013 09:42 AM, Jim Schutt wrote: >> Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while >> holding a lock, but it's spoiled because ceph_pagelist_addpage() always >> calls kmap(), which might sleep. Here's the result: > > I finally took a close look at this today, Jim. Sorry > for the delay. > No worries - thanks for taking a look. > The issue is formatting the reconnect message--which will > hold an arbitrary amount of data and therefore which we'll > need to do some allocation (and kmap) for--in the face of > having to hold the flock spinlock while doing so. > > And as you found, ceph_pagelist_addpage(), which is called > by ceph_pagelist_append(), calls kmap() even if it doesn't > need to allocate anything. This means that despite reserving > the pages, those pages are in the free list and because they'll > need to be the subject of kmap() their preallocation doesn't > help. > > Your solution was to pre-allocate a buffer, format the locks > into that buffer while holding the lock, then append the > buffer contents to a pagelist after releasing the lock. You > check for a changing (increasing) lock count while you format > the locks, which is good. > > So... Given that, I think your change looks good. It's a shame > we can't format directly into the pagelist buffer but this won't > happen much so it's not a big deal. I have a few small suggestions, > below. > > I do find some byte order bugs though. They aren't your doing, > but I think they ought to be fixed first, as a separate patch > that would precede this one. The bug is that the lock counts > that are put into the buffer (num_fcntl_locks and num_flock_locks) > are not properly byte-swapped. I'll point it out inline > in your code, below. > > I'll say that what you have is OK. Consider my suggestions, and > if you choose not to fix the byte order bugs, please let me know. I'll happily fix up a v2 series with your suggestions addressed. Thanks for catching those issues. Stay tuned... Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] libceph: ceph_pagelist_append might sleep while atomic
On 05/14/2013 10:44 AM, Alex Elder wrote: On 05/09/2013 09:42 AM, Jim Schutt wrote: Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: I finally took a close look at this today, Jim. Sorry for the delay. No worries - thanks for taking a look. The issue is formatting the reconnect message--which will hold an arbitrary amount of data and therefore which we'll need to do some allocation (and kmap) for--in the face of having to hold the flock spinlock while doing so. And as you found, ceph_pagelist_addpage(), which is called by ceph_pagelist_append(), calls kmap() even if it doesn't need to allocate anything. This means that despite reserving the pages, those pages are in the free list and because they'll need to be the subject of kmap() their preallocation doesn't help. Your solution was to pre-allocate a buffer, format the locks into that buffer while holding the lock, then append the buffer contents to a pagelist after releasing the lock. You check for a changing (increasing) lock count while you format the locks, which is good. So... Given that, I think your change looks good. It's a shame we can't format directly into the pagelist buffer but this won't happen much so it's not a big deal. I have a few small suggestions, below. I do find some byte order bugs though. They aren't your doing, but I think they ought to be fixed first, as a separate patch that would precede this one. The bug is that the lock counts that are put into the buffer (num_fcntl_locks and num_flock_locks) are not properly byte-swapped. I'll point it out inline in your code, below. I'll say that what you have is OK. Consider my suggestions, and if you choose not to fix the byte order bugs, please let me know. I'll happily fix up a v2 series with your suggestions addressed. Thanks for catching those issues. Stay tuned... Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] libceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [] process_one_work+0x218/0x480 [13439.329701] #1: ((&(>work)->work)){..}, at: [] process_one_work+0x218/0x480 [13439.339446] #2: (>s_mutex){..}, at: [] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (>snap_rwsem){..}, at: [] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [] __might_sleep+0xfc/0x110 [13439.384353] [] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [] ? _raw_spin_lock+0x49/0x50 [13439.403775] [] ? lock_flocks+0x15/0x20 [13439.409277] [] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [] ? igrab+0x28/0x70 [13439.420610] [] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [] check_new_map+0x352/0x500 [ceph] [13439.461534] [] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [] ? mutex_unlock+0xe/0x10 [13439.473934] [] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [] dispatch+0xbc/0x110 [libceph] [13439.486492] [] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [] ? kernel_recvmsg+0x44/0x60 [13439.506324] [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [] con_work+0x378/0x4a0 [libceph] [13439.525345] [] ? finish_task_switch+0x3f/0x110 [13439.531515] [] process_one_work+0x2b5/0x480 [13439.537439] [] ? process_one_work+0x218/0x480 [13439.543526] [] worker_thread+0x1f5/0x320 [13439.549191] [] ? manage_workers+0x170/0x170 [13439.555102] [] kthread+0xe1/0xf0 [13439.560075] [] ? __init_kthread_worker+0x70/0x70 [13439.566419] [] ret_from_fork+0x7c/0xb0 [13439.571918] [] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt --- fs/ceph/locks.c | 73 + fs/ceph/mds_client.c | 62 ++ fs/ceph/super.h |9 +- 3 files changed, 88 insertions(+), 56 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..9a46161 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct ceph_filelock cephlock; int err = 0; int seen_fcntl = 0; int seen_flock = 0; + int l = 0; dout("encoding %d flock and %d fcntl locks", num_flock_locks, num_fcntl_locks); - err = ceph_pagelist_append(pagelist, _fcntl_locks, sizeof(u32)); - if (err) - goto fail; + for (lock = inode->i_flock; lock != NULL; lock = lock->fl_next) { if (lock->fl_flags & FL_POSIX) { ++seen_fcntl; @@ -219,19 +215,12 @@ int ceph_encode_locks(struct inode
[PATCH] libceph: ceph_pagelist_append might sleep while atomic
Ceph's encode_caps_cb() worked hard to not call __page_cache_alloc while holding a lock, but it's spoiled because ceph_pagelist_addpage() always calls kmap(), which might sleep. Here's the result: [13439.295457] ceph: mds0 reconnect start [13439.300572] BUG: sleeping function called from invalid context at include/linux/highmem.h:58 [13439.309243] in_atomic(): 1, irqs_disabled(): 0, pid: 12059, name: kworker/1:1 [13439.316464] 5 locks held by kworker/1:1/12059: [13439.320998] #0: (ceph-msgr){..}, at: [810609f8] process_one_work+0x218/0x480 [13439.329701] #1: (((con-work)-work)){..}, at: [810609f8] process_one_work+0x218/0x480 [13439.339446] #2: (s-s_mutex){..}, at: [a046273c] send_mds_reconnect+0xec/0x450 [ceph] [13439.349081] #3: (mdsc-snap_rwsem){..}, at: [a04627be] send_mds_reconnect+0x16e/0x450 [ceph] [13439.359278] #4: (file_lock_lock){..}, at: [811cadf5] lock_flocks+0x15/0x20 [13439.367816] Pid: 12059, comm: kworker/1:1 Tainted: GW 3.9.0-00358-g308ae61 #557 [13439.376225] Call Trace: [13439.378757] [81076f4c] __might_sleep+0xfc/0x110 [13439.384353] [a03f4ce0] ceph_pagelist_append+0x120/0x1b0 [libceph] [13439.391491] [a0448fe9] ceph_encode_locks+0x89/0x190 [ceph] [13439.398035] [814ee849] ? _raw_spin_lock+0x49/0x50 [13439.403775] [811cadf5] ? lock_flocks+0x15/0x20 [13439.409277] [a045e2af] encode_caps_cb+0x41f/0x4a0 [ceph] [13439.415622] [81196748] ? igrab+0x28/0x70 [13439.420610] [a045e9f8] ? iterate_session_caps+0xe8/0x250 [ceph] [13439.427584] [a045ea25] iterate_session_caps+0x115/0x250 [ceph] [13439.434499] [a045de90] ? set_request_path_attr+0x2d0/0x2d0 [ceph] [13439.441646] [a0462888] send_mds_reconnect+0x238/0x450 [ceph] [13439.448363] [a0464542] ? ceph_mdsmap_decode+0x5e2/0x770 [ceph] [13439.455250] [a0462e42] check_new_map+0x352/0x500 [ceph] [13439.461534] [a04631ad] ceph_mdsc_handle_map+0x1bd/0x260 [ceph] [13439.468432] [814ebc7e] ? mutex_unlock+0xe/0x10 [13439.473934] [a043c612] extra_mon_dispatch+0x22/0x30 [ceph] [13439.480464] [a03f6c2c] dispatch+0xbc/0x110 [libceph] [13439.486492] [a03eec3d] process_message+0x1ad/0x1d0 [libceph] [13439.493190] [a03f1498] ? read_partial_message+0x3e8/0x520 [libceph] [13439.500583] [81415184] ? kernel_recvmsg+0x44/0x60 [13439.506324] [a03ef3a8] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [13439.513140] [a03f2aae] try_read+0x5fe/0x7e0 [libceph] [13439.519246] [a03f39f8] con_work+0x378/0x4a0 [libceph] [13439.525345] [8107792f] ? finish_task_switch+0x3f/0x110 [13439.531515] [81060a95] process_one_work+0x2b5/0x480 [13439.537439] [810609f8] ? process_one_work+0x218/0x480 [13439.543526] [81064185] worker_thread+0x1f5/0x320 [13439.549191] [81063f90] ? manage_workers+0x170/0x170 [13439.555102] [81069641] kthread+0xe1/0xf0 [13439.560075] [81069560] ? __init_kthread_worker+0x70/0x70 [13439.566419] [814f7edc] ret_from_fork+0x7c/0xb0 [13439.571918] [81069560] ? __init_kthread_worker+0x70/0x70 [13439.587132] ceph: mds0 reconnect success [13490.720032] ceph: mds0 caps stale [13501.235257] ceph: mds0 recovery completed [13501.300419] ceph: mds0 caps renewed Fix it up by encoding locks into a buffer first, and when the number of encoded locks is stable, copy that into a ceph_pagelist. Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/locks.c | 73 + fs/ceph/mds_client.c | 62 ++ fs/ceph/super.h |9 +- 3 files changed, 88 insertions(+), 56 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 202dd3d..9a46161 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -191,27 +191,23 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) } /** - * Encode the flock and fcntl locks for the given inode into the pagelist. - * Format is: #fcntl locks, sequential fcntl locks, #flock locks, - * sequential flock locks. - * Must be called with lock_flocks() already held. - * If we encounter more of a specific lock type than expected, - * we return the value 1. + * Encode the flock and fcntl locks for the given inode into the ceph_filelock + * array. Must be called with lock_flocks() already held. + * If we encounter more of a specific lock type than expected, return -ENOSPC. */ -int ceph_encode_locks(struct inode *inode, struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks) +int ceph_encode_locks_to_buffer(struct inode *inode, + struct ceph_filelock *flocks, + int num_fcntl_locks, int num_flock_locks) { struct file_lock *lock; - struct
Re: 3.7.0-rc8 btrfs locking issue
On 12/11/2012 06:37 PM, Liu Bo wrote: > On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: >> On 12/09/2012 07:04 AM, Liu Bo wrote: >>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >>> Hi Jim, >>> >>> Could you please apply the following patch to test if it works? >> >> Hi, >> >> So far, with your patch applied I've been unable to reproduce >> the recursive deadlock. Thanks a lot for this patch! >> This issue has been troubling me for a while. > > Hi Jim, > > Good news for us :) > >> >> I've been trying to learn more about btrfs internals - >> if you have the time to answer a couple questions about >> your patch, I'd really appreciate it. > > See below. > >> >>> >>> (It's against 3.7-rc8.) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 3d3e2c1..100289b 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root >>> *root, int data) >>> >>> if (data) >>> flags = BTRFS_BLOCK_GROUP_DATA; >>> - else if (root == root->fs_info->chunk_root) >>> + else if (root == root->fs_info->chunk_root || >>> +root == root->fs_info->dev_root) >>> flags = BTRFS_BLOCK_GROUP_SYSTEM; >>> else >>> flags = BTRFS_BLOCK_GROUP_METADATA; >>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct >>> btrfs_root *root, u64 type) >>> num_dev = 1;/* DUP or single */ >>> >>> /* metadata for updaing devices and chunk tree */ >>> + num_dev = num_dev << 1 >> >> AFAICS this is doubling the size of the reserve, which >> reduces the chance of a recursive do_chunk_alloc(), right? >> > > Not like that, we hit the deadlock because updating device tree also > uses METADATA chunk, which may be called when we're actually allocating > a METADATA chunk, so the patch I sent you makes updating device tree > use SYSTEM chunk, which we'll have some code to check if it is enough > before allocating a chunk(if not, we'll allocate a SYSTEM chunk first). > > Here I double the size just because the worst case of allocating a > DATA/METADATA chunk -may- results in > > 1)adding a SYSTEM chunk + > 2)adding dev extent per chunk stripe + > 3)updating chunk stripes's bytes_used > >>> return btrfs_calc_trans_metadata_size(root, num_dev + 1); >> >> btrfs_calc_trans_metadata_size(root, num_items) multiplies its >> num_items argument by another factor of three - do you know if >> there is there some rationale behind that number, or is it >> perhaps also an empirically determined factor? > > The height of Btrfs's metadata btree is at most 8, > leaf is on 0 level while node is at most on 7 level. > > Each btree update may results in COWing a node and its sibling nodes, > where the factor of tree comes from > >> >> What I'm wondering about is that if the size of the reserve is >> empirically determined, will it need to be increased again >> later when machines are more capable, and can handle a higher >> load? >> >> Do you think it's feasible to modify the locking for >> do_chunk_alloc to allow it to recurse without deadlock? > > Well, it could be, but IMO it'll bring us complexity, so worse > maintainance. > > Any questions? Feel free to ask. Your response was very helpful. Thanks a lot! -- Jim > > thanks, > liubo > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7.0-rc8 btrfs locking issue
On 12/11/2012 06:37 PM, Liu Bo wrote: On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: On 12/09/2012 07:04 AM, Liu Bo wrote: On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: Hi Jim, Could you please apply the following patch to test if it works? Hi, So far, with your patch applied I've been unable to reproduce the recursive deadlock. Thanks a lot for this patch! This issue has been troubling me for a while. Hi Jim, Good news for us :) I've been trying to learn more about btrfs internals - if you have the time to answer a couple questions about your patch, I'd really appreciate it. See below. (It's against 3.7-rc8.) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..100289b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data) if (data) flags = BTRFS_BLOCK_GROUP_DATA; - else if (root == root-fs_info-chunk_root) + else if (root == root-fs_info-chunk_root || +root == root-fs_info-dev_root) flags = BTRFS_BLOCK_GROUP_SYSTEM; else flags = BTRFS_BLOCK_GROUP_METADATA; @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct btrfs_root *root, u64 type) num_dev = 1;/* DUP or single */ /* metadata for updaing devices and chunk tree */ + num_dev = num_dev 1 AFAICS this is doubling the size of the reserve, which reduces the chance of a recursive do_chunk_alloc(), right? Not like that, we hit the deadlock because updating device tree also uses METADATA chunk, which may be called when we're actually allocating a METADATA chunk, so the patch I sent you makes updating device tree use SYSTEM chunk, which we'll have some code to check if it is enough before allocating a chunk(if not, we'll allocate a SYSTEM chunk first). Here I double the size just because the worst case of allocating a DATA/METADATA chunk -may- results in 1)adding a SYSTEM chunk + 2)adding dev extent per chunk stripe + 3)updating chunk stripes's bytes_used return btrfs_calc_trans_metadata_size(root, num_dev + 1); btrfs_calc_trans_metadata_size(root, num_items) multiplies its num_items argument by another factor of three - do you know if there is there some rationale behind that number, or is it perhaps also an empirically determined factor? The height of Btrfs's metadata btree is at most 8, leaf is on 0 level while node is at most on 7 level. Each btree update may results in COWing a node and its sibling nodes, where the factor of tree comes from What I'm wondering about is that if the size of the reserve is empirically determined, will it need to be increased again later when machines are more capable, and can handle a higher load? Do you think it's feasible to modify the locking for do_chunk_alloc to allow it to recurse without deadlock? Well, it could be, but IMO it'll bring us complexity, so worse maintainance. Any questions? Feel free to ask. Your response was very helpful. Thanks a lot! -- Jim thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7.0-rc8 btrfs locking issue
On 12/09/2012 07:04 AM, Liu Bo wrote: > On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >> > Hi, >> > >> > I'm hitting a btrfs locking issue with 3.7.0-rc8. >> > >> > The btrfs filesystem in question is backing a Ceph OSD >> > under a heavy write load from many cephfs clients. >> > >> > I reported this issue a while ago: >> > http://www.spinics.net/lists/linux-btrfs/msg19370.html >> > when I was testing what I thought might be part of the >> > 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. >> > >> > I spent some time attempting to bisect the btrfs patch queue >> > just before it was merged for 3.7, but got nowhere due to >> > false negatives. >> > >> > I've just been able to get back to testing 3.7-rc, and found >> > that I can still trigger the issue. > Hi Jim, > > Could you please apply the following patch to test if it works? Hi, So far, with your patch applied I've been unable to reproduce the recursive deadlock. Thanks a lot for this patch! This issue has been troubling me for a while. I've been trying to learn more about btrfs internals - if you have the time to answer a couple questions about your patch, I'd really appreciate it. > > (It's against 3.7-rc8.) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d3e2c1..100289b 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > *root, int data) > > if (data) > flags = BTRFS_BLOCK_GROUP_DATA; > - else if (root == root->fs_info->chunk_root) > + else if (root == root->fs_info->chunk_root || > + root == root->fs_info->dev_root) > flags = BTRFS_BLOCK_GROUP_SYSTEM; > else > flags = BTRFS_BLOCK_GROUP_METADATA; > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > btrfs_root *root, u64 type) > num_dev = 1;/* DUP or single */ > > /* metadata for updaing devices and chunk tree */ > + num_dev = num_dev << 1 AFAICS this is doubling the size of the reserve, which reduces the chance of a recursive do_chunk_alloc(), right? > return btrfs_calc_trans_metadata_size(root, num_dev + 1); btrfs_calc_trans_metadata_size(root, num_items) multiplies its num_items argument by another factor of three - do you know if there is there some rationale behind that number, or is it perhaps also an empirically determined factor? What I'm wondering about is that if the size of the reserve is empirically determined, will it need to be increased again later when machines are more capable, and can handle a higher load? Do you think it's feasible to modify the locking for do_chunk_alloc to allow it to recurse without deadlock? Thanks -- Jim > } > > @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct > btrfs_fs_info *fs_info) > > fs_info->extent_root->block_rsv = _info->global_block_rsv; > fs_info->csum_root->block_rsv = _info->global_block_rsv; > - fs_info->dev_root->block_rsv = _info->global_block_rsv; > + fs_info->dev_root->block_rsv = _info->chunk_block_rsv; > fs_info->tree_root->block_rsv = _info->global_block_rsv; > fs_info->chunk_root->block_rsv = _info->chunk_block_rsv; > > > thanks, > liubo > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7.0-rc8 btrfs locking issue
On 12/09/2012 07:04 AM, Liu Bo wrote: On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. Hi Jim, Could you please apply the following patch to test if it works? Hi, So far, with your patch applied I've been unable to reproduce the recursive deadlock. Thanks a lot for this patch! This issue has been troubling me for a while. I've been trying to learn more about btrfs internals - if you have the time to answer a couple questions about your patch, I'd really appreciate it. (It's against 3.7-rc8.) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..100289b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data) if (data) flags = BTRFS_BLOCK_GROUP_DATA; - else if (root == root-fs_info-chunk_root) + else if (root == root-fs_info-chunk_root || + root == root-fs_info-dev_root) flags = BTRFS_BLOCK_GROUP_SYSTEM; else flags = BTRFS_BLOCK_GROUP_METADATA; @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct btrfs_root *root, u64 type) num_dev = 1;/* DUP or single */ /* metadata for updaing devices and chunk tree */ + num_dev = num_dev 1 AFAICS this is doubling the size of the reserve, which reduces the chance of a recursive do_chunk_alloc(), right? return btrfs_calc_trans_metadata_size(root, num_dev + 1); btrfs_calc_trans_metadata_size(root, num_items) multiplies its num_items argument by another factor of three - do you know if there is there some rationale behind that number, or is it perhaps also an empirically determined factor? What I'm wondering about is that if the size of the reserve is empirically determined, will it need to be increased again later when machines are more capable, and can handle a higher load? Do you think it's feasible to modify the locking for do_chunk_alloc to allow it to recurse without deadlock? Thanks -- Jim } @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info) fs_info-extent_root-block_rsv = fs_info-global_block_rsv; fs_info-csum_root-block_rsv = fs_info-global_block_rsv; - fs_info-dev_root-block_rsv = fs_info-global_block_rsv; + fs_info-dev_root-block_rsv = fs_info-chunk_block_rsv; fs_info-tree_root-block_rsv = fs_info-global_block_rsv; fs_info-chunk_root-block_rsv = fs_info-chunk_block_rsv; thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.7.0-rc8 btrfs locking issue
On 12/05/2012 09:07 AM, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(_info->chunk_mutex); [ 1184.274276] lock(_info->chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** To try to debug this, I applied this patch: -cut here - diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c72ead8..86da319 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1470,6 +1470,10 @@ struct btrfs_fs_info { int backup_root_index; int num_tolerated_disk_barrier_failures; + + /* protect against recursive do_chunk_alloc() */ + const void *chunk_alloc_task; + const struct btrfs_space_info *prev_sinfo; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7cda519..d61e93d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb, fs_info->btree_inode->i_ino = BTRFS_BTREE_INODE_OBJECTID; set_nlink(fs_info->btree_inode, 1); + + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; /* * we set the i_size on the btree inode to the max possible int. * the real end of the address space is determined by all of diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..1e0e421 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, } BUG_ON(!space_info); /* Logic error */ + if (mutex_is_locked(_info->chunk_mutex) && + unlikely(fs_info->chunk_alloc_task == current)) { + WARN_ONCE(1, "do_chunk_alloc() about to recursively acquire " + "fs_info->chunk_mutex: impending deadlock avoided!\n" + "outer call space_info = %p flags %#llx\n" + "nested call space_info = %p flags %#llx\n", + fs_info->prev_sinfo, fs_info->prev_sinfo->flags, + space_info, space_info->flags); + return -EDEADLK; + } again: spin_lock(_info->lock); if (force < space_info->force_alloc) @@ -3603,6 +3613,8 @@ again: spin_unlock(_info->lock); mutex_lock(_info->chunk_mutex); + fs_info->chunk_alloc_task = current; + fs_info->prev_sinfo = space_info; /* * The chunk_mutex is held throughout the entirety of a chunk @@ -3655,6 +3667,8 @@ again: space_info->chunk_alloc = 0; spin_unlock(_info->lock); out: + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; mutex_unlock(_info->chunk_mutex); return ret; } -cut here - After a few trials, I got this during a test: [ 1614.213516] [ cut here ] [ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 do_chunk_alloc+0xec/0x3f0 [btrfs]() [ 1614.226228] Hardware name: X8DTH-i/6/iF/6F [ 1614.230484] do_chunk_alloc() about to recursively acquire fs_info->chunk_mutex: impending deadlock avoided! [ 1614.230484] outer call space_info = 880c214dac00 flags 0x1 [ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4 [ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix lib
Re: 3.7.0-rc8 btrfs locking issue
On 12/05/2012 09:07 AM, Jim Schutt wrote: Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(fs_info-chunk_mutex); [ 1184.274276] lock(fs_info-chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** To try to debug this, I applied this patch: -cut here - diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c72ead8..86da319 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1470,6 +1470,10 @@ struct btrfs_fs_info { int backup_root_index; int num_tolerated_disk_barrier_failures; + + /* protect against recursive do_chunk_alloc() */ + const void *chunk_alloc_task; + const struct btrfs_space_info *prev_sinfo; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7cda519..d61e93d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb, fs_info-btree_inode-i_ino = BTRFS_BTREE_INODE_OBJECTID; set_nlink(fs_info-btree_inode, 1); + + fs_info-chunk_alloc_task = NULL; + fs_info-prev_sinfo = NULL; /* * we set the i_size on the btree inode to the max possible int. * the real end of the address space is determined by all of diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..1e0e421 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, } BUG_ON(!space_info); /* Logic error */ + if (mutex_is_locked(fs_info-chunk_mutex) + unlikely(fs_info-chunk_alloc_task == current)) { + WARN_ONCE(1, do_chunk_alloc() about to recursively acquire + fs_info-chunk_mutex: impending deadlock avoided!\n + outer call space_info = %p flags %#llx\n + nested call space_info = %p flags %#llx\n, + fs_info-prev_sinfo, fs_info-prev_sinfo-flags, + space_info, space_info-flags); + return -EDEADLK; + } again: spin_lock(space_info-lock); if (force space_info-force_alloc) @@ -3603,6 +3613,8 @@ again: spin_unlock(space_info-lock); mutex_lock(fs_info-chunk_mutex); + fs_info-chunk_alloc_task = current; + fs_info-prev_sinfo = space_info; /* * The chunk_mutex is held throughout the entirety of a chunk @@ -3655,6 +3667,8 @@ again: space_info-chunk_alloc = 0; spin_unlock(space_info-lock); out: + fs_info-chunk_alloc_task = NULL; + fs_info-prev_sinfo = NULL; mutex_unlock(fs_info-chunk_mutex); return ret; } -cut here - After a few trials, I got this during a test: [ 1614.213516] [ cut here ] [ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 do_chunk_alloc+0xec/0x3f0 [btrfs]() [ 1614.226228] Hardware name: X8DTH-i/6/iF/6F [ 1614.230484] do_chunk_alloc() about to recursively acquire fs_info-chunk_mutex: impending deadlock avoided! [ 1614.230484] outer call space_info = 880c214dac00 flags 0x1 [ 1614.230484] nested call space_info = 880c214dbc00 flags 0x4 [ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul
3.7.0-rc8 btrfs locking issue
Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(_info->chunk_mutex); [ 1184.274276] lock(_info->chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** [ 1184.278417] [ 1184.284325] May be due to missing lock nesting notation [ 1184.284325] [ 1184.291099] 4 locks held by ceph-osd/42177: [ 1184.295277] #0: (sb_writers#7){.+.+.+}, at: [] btrfs_file_aio_write+0x64/0x320 [btrfs] [ 1184.305103] #1: (>s_type->i_mutex_key#9){+.+.+.}, at: [] btrfs_file_aio_write+0x6e/0x320 [btrfs] [ 1184.316108] #2: (sb_internal){.+.+..}, at: [] start_transaction+0x1c4/0x450 [btrfs] [ 1184.325632] #3: (_info->chunk_mutex){+.+...}, at: [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.335761] [ 1184.335761] stack backtrace: [ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 #438 [ 1184.347508] Call Trace: [ 1184.349962] [] ? vprintk_emit+0x42a/0x4c0 [ 1184.355619] [] print_deadlock_bug+0xe9/0x100 [ 1184.361556] [] validate_chain+0x596/0x750 [ 1184.367222] [] __lock_acquire+0x449/0x510 [ 1184.372894] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.379417] [] lock_acquire+0xc9/0x120 [ 1184.384855] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.391377] [] ? __lock_acquire+0x449/0x510 [ 1184.397204] [] __mutex_lock_common+0x5d/0x3a0 [ 1184.403221] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.409762] [] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.416323] [] ? do_chunk_alloc+0x179/0x340 [btrfs] [ 1184.422849] [] mutex_lock_nested+0x4a/0x60 [ 1184.428640] [] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.435018] [] find_free_extent+0xa3c/0xb70 [btrfs] [ 1184.441555] [] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.449051] [] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1184.455843] [] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1184.462828] [] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1184.469471] [] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1184.476962] [] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1184.483426] [] btrfs_search_slot+0x360/0x730 [btrfs] [ 1184.490067] [] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1184.497199] [] ? kmem_cache_alloc+0xd3/0x170 [ 1184.503151] [] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1184.510125] [] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1184.516928] [] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1184.523374] [] ? check_system_chunk+0x71/0x130 [btrfs] [ 1184.530179] [] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1184.536555] [] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.544070] [] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1184.551574] [] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1184.558987] [] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1184.566042] [] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1184.572944] [] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1184.579984] [] do_sync_readv_writev+0x94/0xe0 [ 1184.585985] [] do_readv_writev+0xe3/0x1e0 [ 1184.591645] [] ? fget_light+0x122/0x170 [ 1184.597131] [] vfs_writev+0x46/0x60 [ 1184.602266] [] sys_writev+0x5f/0xc0 [ 1184.607398] [] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1184.613832] [] system_call_fastpath+0x16/0x1b and then a little later: [ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds. [ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1319.478227] ceph-osdD 0246 0 42177 1 0x [ 1319.485330] 880402c4d358 0046 7fffeccc49d0 880402c4dfd8 [ 1319.492852] 880402c4c010 880402c4c000 880402c4c000 880402c4c000 [ 1319.500449] 880402c4dfd8 880402c4c000 880c2151bec0
3.7.0-rc8 btrfs locking issue
Hi, I'm hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik's btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I've just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] = [ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] - [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220]CPU0 [ 1184.267680] [ 1184.270133] lock(fs_info-chunk_mutex); [ 1184.274276] lock(fs_info-chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** [ 1184.278417] [ 1184.284325] May be due to missing lock nesting notation [ 1184.284325] [ 1184.291099] 4 locks held by ceph-osd/42177: [ 1184.295277] #0: (sb_writers#7){.+.+.+}, at: [a0584664] btrfs_file_aio_write+0x64/0x320 [btrfs] [ 1184.305103] #1: (sb-s_type-i_mutex_key#9){+.+.+.}, at: [a058466e] btrfs_file_aio_write+0x6e/0x320 [btrfs] [ 1184.316108] #2: (sb_internal){.+.+..}, at: [a05746f4] start_transaction+0x1c4/0x450 [btrfs] [ 1184.325632] #3: (fs_info-chunk_mutex){+.+...}, at: [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.335761] [ 1184.335761] stack backtrace: [ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 #438 [ 1184.347508] Call Trace: [ 1184.349962] [81042eca] ? vprintk_emit+0x42a/0x4c0 [ 1184.355619] [8109b579] print_deadlock_bug+0xe9/0x100 [ 1184.361556] [8109d246] validate_chain+0x596/0x750 [ 1184.367222] [8109d849] __lock_acquire+0x449/0x510 [ 1184.372894] [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.379417] [8109d9d9] lock_acquire+0xc9/0x120 [ 1184.384855] [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.391377] [8109d849] ? __lock_acquire+0x449/0x510 [ 1184.397204] [814ab17d] __mutex_lock_common+0x5d/0x3a0 [ 1184.403221] [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.409762] [a055a5d3] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.416323] [a055a5c9] ? do_chunk_alloc+0x179/0x340 [btrfs] [ 1184.422849] [814ab62a] mutex_lock_nested+0x4a/0x60 [ 1184.428640] [a055a5d3] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.435018] [a05635bc] find_free_extent+0xa3c/0xb70 [btrfs] [ 1184.441555] [a05569e9] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.449051] [a0563772] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1184.455843] [a0563905] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1184.462828] [a054fb6a] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1184.469471] [a05a50d3] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1184.476962] [a055056d] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1184.483426] [a0553460] btrfs_search_slot+0x360/0x730 [btrfs] [ 1184.490067] [a055426d] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1184.497199] [81159be3] ? kmem_cache_alloc+0xd3/0x170 [ 1184.503151] [a0596f25] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1184.510125] [a059b0d7] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1184.516928] [a059b1de] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1184.523374] [a055a391] ? check_system_chunk+0x71/0x130 [btrfs] [ 1184.530179] [a055a71e] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1184.536555] [a05569e9] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.544070] [a055e35c] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1184.551574] [a055ff00] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1184.558987] [a05842de] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1184.566042] [a0584809] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1184.572944] [a0584600] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1184.579984] [81165014] do_sync_readv_writev+0x94/0xe0 [ 1184.585985] [81166023] do_readv_writev+0xe3/0x1e0 [ 1184.591645] [81183112] ?
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/12/2012 02:22 PM, Mel Gorman wrote: I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? I've gotten in several hours of testing on this patch with no issues at all, and no output from CONFIG_PROVE_LOCKING (I'm assuming it would show up on a serial console). So, it seems to me this patch has done the trick. CPU utilization is staying under control, and write-out rate is good. You can add my Tested-by: as you see fit. If you work up any refinements and would like me to test, please let me know. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/12/2012 02:22 PM, Mel Gorman wrote: I went through the patch again but only found the following which is a weak candidate. Still, can you retest with the following patch on top and CONFIG_PROVE_LOCKING set please? I've gotten in several hours of testing on this patch with no issues at all, and no output from CONFIG_PROVE_LOCKING (I'm assuming it would show up on a serial console). So, it seems to me this patch has done the trick. CPU utilization is staying under control, and write-out rate is good. You can add my Tested-by: as you see fit. If you work up any refinements and would like me to test, please let me know. Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/10/2012 05:02 AM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8<--- mm: compaction: back out if contended --- Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Hmmm, I wonder if I should have tested this patch longer, in view of the trouble I ran into testing the new patch? See below. Excellent, so it is contention that is the problem. I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. I really appreciate getting the chance to test out your patchset. I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8<--- mm: compaction: Abort async compaction if locks are contended or taking too long Hmmm, while testing this patch, a couple of my servers got stuck after ~30 minutes or so, like this: [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2515.884447] ceph-osdD 0 30375 1 0x [ 2515.891531] 8802e1a99e38 0082 88056b38e298 8802e1a99fd8 [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 8802e1a98000 [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 8802e1a84500 [ 2515.913968] Call Trace: [ 2515.916433] [] schedule+0x5d/0x60 [ 2515.921417] [] rwsem_down_failed_common+0x105/0x140 [ 2515.927938] [] rwsem_down_write_failed+0x13/0x20 [ 2515.934195] [] call_rwsem_down_write_failed+0x13/0x20 [ 2515.940934] [] ? down_write+0x45/0x50 [ 2515.946244] [] sys_mprotect+0xd2/0x240 [ 2515.951640] [] system_call_fastpath+0x16/0x1b [ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds. [ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2515.972141] ceph-osdD 0 95698 1 0x [ 2515.979223] 8802b049fe38 0082 88056b38e2a0 8802b049ffd8 [ 2515.986700] 8802b049e010 8802b049e000 8802b049e000 8802b049e000 [ 2515.994176] 8802b049ffd8 8802b049e000 8809832ddc00 880611592e00 [ 2516.001653] Call Trace: [ 2516.004111] [] schedule+0x5d/0x60 [ 2516.009072] [] rwsem_down_failed_common+0x105/0x140 [ 2516.015589] [] rwsem_down_write_failed+0x13/0x20 [ 2516.021861] [] call_rwsem_down_write_failed+0x13/0x20 [ 2516.028555] [] ? down_write+0x45/0x50 [ 2516.033859] [] sys_mprotect+0xd2/0x240 [ 2516.039248] [] system_call_fastpath+0x16/0x1b [ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds. [ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2516.059753] ceph-osdD 0 95699 1 0x [ 2516.066832] 880c022d3dc8 0082 880c022d2000 880c022d3fd8 [ 2516.074302] 880c022d2010 880c022d2000 880c022d2000 880c022d2000 [ 2516.081784] 880c022d3fd8 880c022d2000 8806224cc500 88096b64dc00 [ 2516.089254] Call Trace: [ 2516.091702] [] schedule+0x5d/0x60 [ 2516.096656] [] rwsem_down_failed_common+0x105/0x140 [ 2516.103176] [] rwsem_down_write_failed+0x13/0x20 [ 2516.109443] [] call_rwsem_down_write_failed+0x13/0x20 [ 2516.116134] [] ? down_write+0x45/0x50 [ 2516.121442] [] vm_mmap_pgoff+0x6e/0xb0 [ 2516.126861] [] sys_mmap_pgoff+0x18a/0x190 [ 2516.132552] [] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 2516.138985] [] sys_mmap+0x22/0x30 [ 2516.143945] [] system_call_fastpath+0x16/0x1b [ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds. [ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2516.16] ceph-osdD 0 95816 1 0x [ 2516.171521] 880332991e38 0082 880332991de8 880332991fd8 [ 2516.178992] 880332990010 88033299 88033299 88033299 [ 2516.186466] 880332
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/10/2012 05:02 AM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote: Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: Hmmm, I wonder if I should have tested this patch longer, in view of the trouble I ran into testing the new patch? See below. Excellent, so it is contention that is the problem. SNIP I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. It does impair allocation success rates as I expected (they're still ok but not as high as I'd like) so I implemented the following instead. It attempts to backoff when contention is detected or compaction is taking too long. It does not backoff as quickly as the first prototype did so I'd like to see if it addresses your problem or not. I really appreciate getting the chance to test out your patchset. I appreciate that you have a workload that demonstrates the problem and will test patches. I will not abuse this and hope the keep the revisions to a minimum. Thanks. ---8--- mm: compaction: Abort async compaction if locks are contended or taking too long Hmmm, while testing this patch, a couple of my servers got stuck after ~30 minutes or so, like this: [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds. [ 2515.876630] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2515.884447] ceph-osdD 0 30375 1 0x [ 2515.891531] 8802e1a99e38 0082 88056b38e298 8802e1a99fd8 [ 2515.899013] 8802e1a98010 8802e1a98000 8802e1a98000 8802e1a98000 [ 2515.906482] 8802e1a99fd8 8802e1a98000 880697d31700 8802e1a84500 [ 2515.913968] Call Trace: [ 2515.916433] [8147fded] schedule+0x5d/0x60 [ 2515.921417] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2515.927938] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2515.934195] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2515.940934] [8147edc5] ? down_write+0x45/0x50 [ 2515.946244] [81127b62] sys_mprotect+0xd2/0x240 [ 2515.951640] [81489412] system_call_fastpath+0x16/0x1b [ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds. [ 2515.964330] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2515.972141] ceph-osdD 0 95698 1 0x [ 2515.979223] 8802b049fe38 0082 88056b38e2a0 8802b049ffd8 [ 2515.986700] 8802b049e010 8802b049e000 8802b049e000 8802b049e000 [ 2515.994176] 8802b049ffd8 8802b049e000 8809832ddc00 880611592e00 [ 2516.001653] Call Trace: [ 2516.004111] [8147fded] schedule+0x5d/0x60 [ 2516.009072] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2516.015589] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2516.021861] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2516.028555] [8147edc5] ? down_write+0x45/0x50 [ 2516.033859] [81127b62] sys_mprotect+0xd2/0x240 [ 2516.039248] [81489412] system_call_fastpath+0x16/0x1b [ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds. [ 2516.051934] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2516.059753] ceph-osdD 0 95699 1 0x [ 2516.066832] 880c022d3dc8 0082 880c022d2000 880c022d3fd8 [ 2516.074302] 880c022d2010 880c022d2000 880c022d2000 880c022d2000 [ 2516.081784] 880c022d3fd8 880c022d2000 8806224cc500 88096b64dc00 [ 2516.089254] Call Trace: [ 2516.091702] [8147fded] schedule+0x5d/0x60 [ 2516.096656] [81480b25] rwsem_down_failed_common+0x105/0x140 [ 2516.103176] [81480b73] rwsem_down_write_failed+0x13/0x20 [ 2516.109443] [8124bcd3] call_rwsem_down_write_failed+0x13/0x20 [ 2516.116134] [8147edc5] ? down_write+0x45/0x50 [ 2516.121442] [8111362e] vm_mmap_pgoff+0x6e/0xb0 [ 2516.126861] [8112486a] sys_mmap_pgoff+0x18a/0x190 [ 2516.132552] [8124bd6e] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 2516.138985] [81006b22] sys_mmap+0x22/0x30 [ 2516.143945] [81489412] system_call_fastpath+0x16/0x1b [ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds. [ 2516.156632] echo 0 /proc/sys/kernel
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 02:46 PM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I'm concerned about both the high CPU usage as well as the reduction in write-out rate, but I've been assuming the latter is caused by the former. Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8<--- mm: compaction: back out if contended --- Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: 2012-08-09 16:26:07.550-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 21 19 0 351628576 378354400017 44394 1241 653 6 20 64 9 0 11 11 0 365520576 3789306000 124 2121508 203450 170957 12 46 25 17 0 13 16 0 359888576 379544560098 2185033 209473 171571 13 44 25 18 0 17 15 0 353728576 380105360089 2170971 208052 167988 13 43 26 18 0 17 16 0 349732576 3804828400 135 2217752 218754 174170 13 49 21 16 0 43 13 0 343280576 3804650000 153 2207135 217872 179519 13 47 23 18 0 26 13 0 350968576 3793718400 147 2189822 214276 176697 13 47 23 17 0 4 12 0 350080576 3795836400 226 2145212 207077 172163 12 44 24 20 0 15 13 0 353124576 3792104000 145 2078422 197231 166381 12 41 30 17 0 14 15 0 348964576 3794958800 107 2020853 188192 164064 12 39 30 20 0 21 9 0 354784576 3795122800 117 2148090 204307 165609 13 48 22 18 0 36 16 0 347368576 3798982400 166 2208681 216392 178114 13 47 24 16 0 28 15 0 300656576 3806091200 164 2181681 214618 175132 13 45 24 18 0 9 16 0 295484576 3809218400 153 2156909 218993 180289 13 43 27 17 0 17 16 0 346760576 3797900800 165 2124168 198730 173455 12 44 27 18 0 14 17 0 360988576 3795713600 142 2092248 197430 168199 12 42 29 17 0 I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. I really appreciate getting the chance to test out your patchset. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: -- 2012-08-09 11:58:04.107-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 20 14 0 235884576 389160720012 17047 171 133 3 8 85 4 0 18 17 0 220272576 389559120086 2131838 200142 162956 12 38 31 19 0 17 9 0 244284576 389553280019 2179562 213775 167901 13 43 26 18 0 27 15 0 223036576 389526400024 2202816 217996 158390 14 47 25 15 0 17 16 0 233124576 3895990800 5 2268815 224647 165728 14 50 21 15 0 16 13 0 225840576 389957400052 2253829 216797 160551 14 47 23 16 0 22 13 0 260584576 389829080092 2196737 211694 140924 14 53 19 15 0 16 10 0 235784576 389171280022 2157466 210022 137630 14 54 19 14 0 12 13 0 214300576 389238480031 2187735 213862 142711 14 52 20 14 0 25 12 0 219528576 389195400011 2066523 205256 142080 13 49 23 15 0 26 14 0 229460576 389137040049 2108654 200692 135447 13 51 21 15 0 11 11 0 220376576 388624560045 2136419 207493 146813 13 49 22 16 0 36 12 0 229860576 3886978400 7 2163463 212223 151812 14 47 25 14 0 16 13 0 238356576 388914960067 2251650 221728 154429 14 52 20 14 0 65 15 0 211536576 389221080059 2237925 224237 156587 14 53 19 14 0 24 13 0 585024576 386340240037 2240929 229040 148192 15 61 14 10 0 2012-08-09 11:59:04.714-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 43 8 0 794392576 383823160011 20491 576 420 3 10 82 4 0 127 6 0 579328576 384221560021 2006775 205582 119660 12 70 11 7 0 44 5 0 492860576 385123600046 1536525 173377 85320 10 78 7 4 0 218 9 0 585668576 382713200039 1257266 152869 64023 8 83 7 3 0 101 6 0 600168576 381281040010 1438705 160769 68374 9 84 5 3 0 62 5 0 597004576 380989720093 1376841 154012 63912 8 82 7 4 0 61 11 0 850396576 378087720046 1186816 145731 70453 7 78 9 6 0 124 7 0 437388576 381263200015 1208434 149736 57142 7 86 4 3 0 204 11 01105816576 373095320020 1327833 145979 52718 7 87 4 2 0 29 8 0 751020576 3736033200 8 1405474 169916 61982 9 85 4 2 0 38 7 0 626448576 373332440014 1328415 174665 74214 8 84 5 3 0 23 5 0 650040576 371342800028 1351209 179220 71631 8 85 5 2 0 40 10 0 610988576 3705429200 104 1272527 167530 73527 7 85 5 3 0 79 22 02076836576 3548734000 750 1249934 175420 70124 7 88 3 2 0
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order> 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order> 0 compaction start off where it left]. I was successful at resolving my Ceph issue on 3.6-rc1, but ran into some other issue that isn't immediately obvious, and prevents me from testing your patch with 3.6-rc1. Today I will apply your patch series to 3.5 and test that way. Sorry for the delay. -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
Hi Mel, On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. I was successful at resolving my Ceph issue on 3.6-rc1, but ran into some other issue that isn't immediately obvious, and prevents me from testing your patch with 3.6-rc1. Today I will apply your patch series to 3.5 and test that way. Sorry for the delay. -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: -- 2012-08-09 11:58:04.107-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 20 14 0 235884576 389160720012 17047 171 133 3 8 85 4 0 18 17 0 220272576 389559120086 2131838 200142 162956 12 38 31 19 0 17 9 0 244284576 389553280019 2179562 213775 167901 13 43 26 18 0 27 15 0 223036576 389526400024 2202816 217996 158390 14 47 25 15 0 17 16 0 233124576 3895990800 5 2268815 224647 165728 14 50 21 15 0 16 13 0 225840576 389957400052 2253829 216797 160551 14 47 23 16 0 22 13 0 260584576 389829080092 2196737 211694 140924 14 53 19 15 0 16 10 0 235784576 389171280022 2157466 210022 137630 14 54 19 14 0 12 13 0 214300576 389238480031 2187735 213862 142711 14 52 20 14 0 25 12 0 219528576 389195400011 2066523 205256 142080 13 49 23 15 0 26 14 0 229460576 389137040049 2108654 200692 135447 13 51 21 15 0 11 11 0 220376576 388624560045 2136419 207493 146813 13 49 22 16 0 36 12 0 229860576 3886978400 7 2163463 212223 151812 14 47 25 14 0 16 13 0 238356576 388914960067 2251650 221728 154429 14 52 20 14 0 65 15 0 211536576 389221080059 2237925 224237 156587 14 53 19 14 0 24 13 0 585024576 386340240037 2240929 229040 148192 15 61 14 10 0 2012-08-09 11:59:04.714-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 43 8 0 794392576 383823160011 20491 576 420 3 10 82 4 0 127 6 0 579328576 384221560021 2006775 205582 119660 12 70 11 7 0 44 5 0 492860576 385123600046 1536525 173377 85320 10 78 7 4 0 218 9 0 585668576 382713200039 1257266 152869 64023 8 83 7 3 0 101 6 0 600168576 381281040010 1438705 160769 68374 9 84 5 3 0 62 5 0 597004576 380989720093 1376841 154012 63912 8 82 7 4 0 61 11 0 850396576 378087720046 1186816 145731 70453 7 78 9 6 0 124 7 0 437388576 381263200015 1208434 149736 57142 7 86 4 3 0 204 11 01105816576 373095320020 1327833 145979 52718 7 87 4 2 0 29 8 0 751020576 3736033200 8 1405474 169916 61982 9 85 4 2 0 38 7 0 626448576 373332440014 1328415 174665 74214 8 84 5 3 0 23 5 0 650040576 371342800028 1351209 179220 71631 8 85 5 2 0 40 10 0 610988576 3705429200 104 1272527 167530 73527 7 85 5 3 0 79 22 02076836576 3548734000 750 1249934 175420 70124 7 88 3 2 0 58 6
Re: [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3
On 08/09/2012 02:46 PM, Mel Gorman wrote: On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote: On 08/09/2012 07:49 AM, Mel Gorman wrote: Changelog since V2 o Capture !MIGRATE_MOVABLE pages where possible o Document the treatment of MIGRATE_MOVABLE pages while capturing o Expand changelogs Changelog since V1 o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan) o Expanded changelogs a little Allocation success rates have been far lower since 3.4 due to commit [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This commit was introduced for good reasons and it was known in advance that the success rates would suffer but it was justified on the grounds that the high allocation success rates were achieved by aggressive reclaim. Success rates are expected to suffer even more in 3.6 due to commit [7db8889a: mm: have order 0 compaction start off where it left] which testing has shown to severely reduce allocation success rates under load - to 0% in one case. There is a proposed change to that patch in this series and it would be ideal if Jim Schutt could retest the workload that led to commit [7db8889a: mm: have order 0 compaction start off where it left]. On my first test of this patch series on top of 3.5, I ran into an instance of what I think is the sort of thing that patch 4/5 was fixing. Here's what vmstat had to say during that period: SNIP My conclusion looking at the vmstat data is that everything is looking ok until system CPU usage goes through the roof. I'm assuming that's what we are all still looking at. I'm concerned about both the high CPU usage as well as the reduction in write-out rate, but I've been assuming the latter is caused by the former. snip Ok, this is an untested hack and I expect it would drop allocation success rates again under load (but not as much). Can you test again and see what effect, if any, it has please? ---8--- mm: compaction: back out if contended --- snip Initial testing with this patch looks very good from my perspective; CPU utilization stays reasonable, write-out rate stays high, no signs of stress. Here's an example after ~10 minutes under my test load: 2012-08-09 16:26:07.550-06:00 vmstat -w 4 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 21 19 0 351628576 378354400017 44394 1241 653 6 20 64 9 0 11 11 0 365520576 3789306000 124 2121508 203450 170957 12 46 25 17 0 13 16 0 359888576 379544560098 2185033 209473 171571 13 44 25 18 0 17 15 0 353728576 380105360089 2170971 208052 167988 13 43 26 18 0 17 16 0 349732576 3804828400 135 2217752 218754 174170 13 49 21 16 0 43 13 0 343280576 3804650000 153 2207135 217872 179519 13 47 23 18 0 26 13 0 350968576 3793718400 147 2189822 214276 176697 13 47 23 17 0 4 12 0 350080576 3795836400 226 2145212 207077 172163 12 44 24 20 0 15 13 0 353124576 3792104000 145 2078422 197231 166381 12 41 30 17 0 14 15 0 348964576 3794958800 107 2020853 188192 164064 12 39 30 20 0 21 9 0 354784576 3795122800 117 2148090 204307 165609 13 48 22 18 0 36 16 0 347368576 3798982400 166 2208681 216392 178114 13 47 24 16 0 28 15 0 300656576 3806091200 164 2181681 214618 175132 13 45 24 18 0 9 16 0 295484576 3809218400 153 2156909 218993 180289 13 43 27 17 0 17 16 0 346760576 3797900800 165 2124168 198730 173455 12 44 27 18 0 14 17 0 360988576 3795713600 142 2092248 197430 168199 12 42 29 17 0 I'll continue testing tomorrow to be sure nothing shows up after continued testing. If this passes your allocation success rate testing, I'm happy with this performance for 3.6 - if not, I'll be happy to test any further patches. I really appreciate getting the chance to test out your patchset. Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] mm: have order > 0 compaction start near a pageblock with free pages
On 08/07/2012 08:52 AM, Mel Gorman wrote: On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote: On 08/07/2012 08:31 AM, Mel Gorman wrote: commit [7db8889a: mm: have order> 0 compaction start off where it left] introduced a caching mechanism to reduce the amount work the free page scanner does in compaction. However, it has a problem. Consider two process simultaneously scanning free pages C Process A M S F |---| Process B M FS Argh. Good spotting. This is not optimal and it can still race but the compact_cached_free_pfn will be pointing to or very near a pageblock with free pages. Agreed on the "not optimal", but I also cannot think of a better idea right now. Getting this fixed for 3.6 is important, we can think of future optimizations in San Diego. Sounds like a plan. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Thanks very much. Jim, what are the chances of getting this series tested with your large data workload? As it's on top of 3.5, it should be less scary than testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please test with just this patch on top. As it turns out I'm already testing 3.6-rc1, as I'm on the trail of a Ceph client messaging bug. I think I've about got that figured out, and am working on a patch, but I need it fixed in order to generate enough load to trigger the problem that your patch addresses. Which is a long-winded way of saying: no problem, I'll roll this into my current testing, but I'll need another day or two before I'm likely to be able to generate a high enough load to test effectively. OK? Also FWIW, it occurs to me that you might be interested to know that my load also involves lots of network load where I'm using jumbo frames. I suspect that puts even more stress on higher page order allocations, right? -- Jim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 6/6] mm: have order 0 compaction start near a pageblock with free pages
On 08/07/2012 08:52 AM, Mel Gorman wrote: On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote: On 08/07/2012 08:31 AM, Mel Gorman wrote: commit [7db8889a: mm: have order 0 compaction start off where it left] introduced a caching mechanism to reduce the amount work the free page scanner does in compaction. However, it has a problem. Consider two process simultaneously scanning free pages C Process A M S F |---| Process B M FS Argh. Good spotting. This is not optimal and it can still race but the compact_cached_free_pfn will be pointing to or very near a pageblock with free pages. Agreed on the not optimal, but I also cannot think of a better idea right now. Getting this fixed for 3.6 is important, we can think of future optimizations in San Diego. Sounds like a plan. Signed-off-by: Mel Gormanmgor...@suse.de Reviewed-by: Rik van Rielr...@redhat.com Thanks very much. Jim, what are the chances of getting this series tested with your large data workload? As it's on top of 3.5, it should be less scary than testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please test with just this patch on top. As it turns out I'm already testing 3.6-rc1, as I'm on the trail of a Ceph client messaging bug. I think I've about got that figured out, and am working on a patch, but I need it fixed in order to generate enough load to trigger the problem that your patch addresses. Which is a long-winded way of saying: no problem, I'll roll this into my current testing, but I'll need another day or two before I'm likely to be able to generate a high enough load to test effectively. OK? Also FWIW, it occurs to me that you might be interested to know that my load also involves lots of network load where I'm using jumbo frames. I suspect that puts even more stress on higher page order allocations, right? -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: splice/vmsplice performance test results
On Thu, 2006-11-23 at 12:24 +0100, Jens Axboe wrote: > On Wed, Nov 22 2006, Jim Schutt wrote: > > > > On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote: > > > On Tue, Nov 21 2006, Jim Schutt wrote: > > [snip] > > > > > > > > Hmmm. Is it worth me trying to do some sort of kernel > > > > profiling to see if there is anything unexpected with > > > > my setup? If so, do you have a preference as to what > > > > I would use? > > > > > > Not sure that profiling would be that interesting, as the problem > > > probably lies in where we are _not_ spending the time. But it certainly > > > can't hurt. Try to oprofile the kernel for a 10-20 sec interval while > > > the test is running. Do 3 such runs for the two test cases > > > (write-to-file, vmsplice/splice-to-file). > > > > > > > OK, I've attached results for 20 second profiles of three > > runs of each test: read-from-socket + write-to-file, and > > read-from-socket + vmsplice/splice-to-file. > > > > The test case and throughput is in the name: e.g. rvs-1-306MBps > > is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s. > > > > Let me know if I can help with more testing, and thanks > > again for looking into this. > > As I suspected, nothing sticks out in these logs as the problem here is > not due to a maxed out system. The traces look fairly identical, less > time spent in copy_user with the splice approach. > > Comparing the generic_file_buffered_write() and splice-to-file path, > there really isn't a whole lot of difference. It would be interesting to > try and eliminate some of the differences between the two approaches - > could you try and change the vmsplice to a write-to-pipe instead? And > add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm > interested in a something that only really tests splice-to-file vs > write-to-file. Perhaps easier if you can just run fio to test that, I'm > inlining a job file to test that specifically. > Sorry for the delayed reply. Here's results from three fio runs, using the job file you gave me, and fio-git-20061124142507.tar.gz: --- write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (groupid=0): err= 0: write: io= 16384MiB, bw=558404KiB/s, runt= 30766msec slat (msec): min=0, max=0, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 35, avg= 0.00, dev= 0.08 bw (KiB/s) : min=0, max=644481, per=98.90%, avg=552243.95, dev=561417.03 cpu : usr=0.91%, sys=85.03%, ctx=14121 splice: (groupid=1): err= 0: write: io= 16384MiB, bw=486144KiB/s, runt= 35339msec slat (msec): min=0, max=0, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 30, avg= 0.00, dev= 0.10 bw (KiB/s) : min=0, max=555745, per=99.06%, avg=481567.28, dev=488565.60 cpu : usr=0.85%, sys=88.87%, ctx=12956 Run status group 0 (all jobs): WRITE: io=16384MiB, aggrb=558404, minb=558404, maxb=558404, mint=30766msec, maxt=30766msec Run status group 1 (all jobs): WRITE: io=16384MiB, aggrb=486144, minb=486144, maxb=486144, mint=35339msec, maxt=35339msec Disk stats (read/write): md0: ios=20/270938, merge=0/0, ticks=0/0, in_queue=0, util=0.00% --- write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (groupid=0): err= 0: write: io= 16384MiB, bw=547234KiB/s, runt= 31394msec slat (msec): min=0, max=1, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 57, avg= 0.00, dev= 0.14 bw (KiB/s) : min=0, max=662568, per=98.94%, avg=541406.71, dev=550958.67 cpu : usr=0.79%, sys=82.80%, ctx=16560 splice: (groupid=1): err= 0: write: io= 16384MiB, bw=473313KiB/s, runt= 36297msec slat (msec): min=0, max=1, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 27, avg= 0.00, dev= 0.11 bw (KiB/s) : min=0, max=562298, per=
Re: splice/vmsplice performance test results
On Thu, 2006-11-23 at 12:24 +0100, Jens Axboe wrote: On Wed, Nov 22 2006, Jim Schutt wrote: On Wed, 2006-11-22 at 09:57 +0100, Jens Axboe wrote: On Tue, Nov 21 2006, Jim Schutt wrote: [snip] Hmmm. Is it worth me trying to do some sort of kernel profiling to see if there is anything unexpected with my setup? If so, do you have a preference as to what I would use? Not sure that profiling would be that interesting, as the problem probably lies in where we are _not_ spending the time. But it certainly can't hurt. Try to oprofile the kernel for a 10-20 sec interval while the test is running. Do 3 such runs for the two test cases (write-to-file, vmsplice/splice-to-file). OK, I've attached results for 20 second profiles of three runs of each test: read-from-socket + write-to-file, and read-from-socket + vmsplice/splice-to-file. The test case and throughput is in the name: e.g. rvs-1-306MBps is trial 1 of read/vmsplice/splice case, which ran at 306 MB/s. Let me know if I can help with more testing, and thanks again for looking into this. As I suspected, nothing sticks out in these logs as the problem here is not due to a maxed out system. The traces look fairly identical, less time spent in copy_user with the splice approach. Comparing the generic_file_buffered_write() and splice-to-file path, there really isn't a whole lot of difference. It would be interesting to try and eliminate some of the differences between the two approaches - could you try and change the vmsplice to a write-to-pipe instead? And add SPLICE_F_NONBLOCK to the splice-to-file as well. Basically I'm interested in a something that only really tests splice-to-file vs write-to-file. Perhaps easier if you can just run fio to test that, I'm inlining a job file to test that specifically. Sorry for the delayed reply. Here's results from three fio runs, using the job file you gave me, and fio-git-20061124142507.tar.gz: --- write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (groupid=0): err= 0: write: io= 16384MiB, bw=558404KiB/s, runt= 30766msec slat (msec): min=0, max=0, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 35, avg= 0.00, dev= 0.08 bw (KiB/s) : min=0, max=644481, per=98.90%, avg=552243.95, dev=561417.03 cpu : usr=0.91%, sys=85.03%, ctx=14121 splice: (groupid=1): err= 0: write: io= 16384MiB, bw=486144KiB/s, runt= 35339msec slat (msec): min=0, max=0, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 30, avg= 0.00, dev= 0.10 bw (KiB/s) : min=0, max=555745, per=99.06%, avg=481567.28, dev=488565.60 cpu : usr=0.85%, sys=88.87%, ctx=12956 Run status group 0 (all jobs): WRITE: io=16384MiB, aggrb=558404, minb=558404, maxb=558404, mint=30766msec, maxt=30766msec Run status group 1 (all jobs): WRITE: io=16384MiB, aggrb=486144, minb=486144, maxb=486144, mint=35339msec, maxt=35339msec Disk stats (read/write): md0: ios=20/270938, merge=0/0, ticks=0/0, in_queue=0, util=0.00% --- write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (g=0): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=sync, iodepth=1 splice: (g=1): rw=write, odir=0, bs=64K-64K/64K-64K, rate=0, ioengine=splice, iodepth=1 write: (groupid=0): err= 0: write: io= 16384MiB, bw=547234KiB/s, runt= 31394msec slat (msec): min=0, max=1, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 57, avg= 0.00, dev= 0.14 bw (KiB/s) : min=0, max=662568, per=98.94%, avg=541406.71, dev=550958.67 cpu : usr=0.79%, sys=82.80%, ctx=16560 splice: (groupid=1): err= 0: write: io= 16384MiB, bw=473313KiB/s, runt= 36297msec slat (msec): min=0, max=1, avg= 0.00, dev= 0.00 clat (msec): min=0, max= 27, avg= 0.00, dev= 0.11 bw (KiB/s) : min=0, max=562298, per=99.12%, avg=469142.21, dev=476426.36 cpu : usr=1.05%, sys=84.78%, ctx=16043 Run status group 0 (all jobs): WRITE: io=16384MiB, aggrb=547234, minb=547234, maxb=547234, mint=31394msec, maxt=31394msec Run status group 1 (all jobs): WRITE: io=16384MiB, aggrb=473313, minb=473313, maxb=473313, mint=36297msec, maxt=36297msec Disk
Re: splice/vmsplice performance test results
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote: > On Thu, Nov 16 2006, Jim Schutt wrote: > > Hi, > > > > My test program can do one of the following: > > > > send data: > > A) read() from file into buffer, write() buffer into socket > > B) mmap() section of file, write() that into socket, munmap() > > C) splice() from file to pipe, splice() from pipe to socket > > > > receive data: > > 1) read() from socket into buffer, write() buffer into file > > 2) ftruncate() to extend file, mmap() new extent, read() > > from socket into new extent, munmap() > > 3) read() from socket into buffer, vmsplice() buffer to > > pipe, splice() pipe to file (using the double-buffer trick) > > > > Here's the results, using: > > - 64 KiB buffer, mmap extent, or splice > > - 1 MiB TCP window > > - 16 GiB data sent across network > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s) > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s) > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s) > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s) > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s) > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s) > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s) > > > > I had (naively) hoped the read/vmsplice/splice combination would > > run at the same speed I can write a file, i.e. at about 450 MB/s > > on my setup. Do any of my numbers seem bogus, so I should look > > harder at my test program? > > Could be read-ahead playing in here, I'd have to take a closer look at > the generated io patterns to say more about that. Any chance you can > capture iostat or blktrace info for such a run to compare that goes to > the disk? I've attached a file with iostat and vmstat results for the case where I read from a socket and write a file, vs. the case where I read from a socket and use vmsplice/splice to write the file. (Sorry it's not inline - my mailer locks up when I try to include the file.) Would you still like blktrace info for these two cases? -- Jim Here's 'iostat -dx 5' on the receiver, when it reads from a socket and writes to a file: Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdc 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdd 0.00 0.00 0.00 0.200.000.40 0.00 0.20 2.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md0 0.00 0.00 0.00 0.200.000.40 0.00 0.20 2.00 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.20 0.00 571.260.00 142574.45 0.00 71287.23 249.58 0.891.56 0.43 24.51 sdc 0.00 0.00 0.00 576.850.00 142558.08 0.00 71279.04 247.13 1.021.77 0.42 23.95 sdd 0.00 1.20 0.60 577.454.79 142545.91 2.40 71272.95 246.61 1.162.00 0.54 31.46 sde 0.00 0.60 0.00 573.850.00 142607.58 0.00 71303.79 248.51 1.342.33 0.54 30.74 md0 0.00 0.00 0.60 2301.404.79 570286.03 2.40 285143.01 247.74 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.20 0.00 750.400.00 187265.80 0.00 93632.90 249.55 1.101.46 0.42 31.36 sdc 0.00 1.40 0.00 755.800.00 187236.80 0.00 93618.40 247.73 1.301.72 0.43 32.64 sdd 0.00 2.80 0.00 751.200.00 187253.40 0.00 93626.70 249.27 1.532.04 0.55 41.36 sde 0.00 0.20 0.00 767.400.00 187235.20 0.00 93617.60 243.99 1.652.15 0.54 41.52 md0 0.00 0.00 0.00 3029.400.00 748991.20 0.00 374495.60 247.24 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 1.00 0.00 771.260.00 187299.20 0.00 93649.60 242.85 1.211.57 0.44 33.61 sdc 0.00 1.60 0.00 765.870.00 187262.28 0.00 93631.14 244.51 1.331.73 0.45 34.49 sdd 0.00 1.00 0.00 739.720.00 187258.48 0.00 93629.24 253.15 1.431.93 0.55 40.56 sde 0.00
Re: splice/vmsplice performance test results
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote: On Thu, Nov 16 2006, Jim Schutt wrote: Hi, My test program can do one of the following: send data: A) read() from file into buffer, write() buffer into socket B) mmap() section of file, write() that into socket, munmap() C) splice() from file to pipe, splice() from pipe to socket receive data: 1) read() from socket into buffer, write() buffer into file 2) ftruncate() to extend file, mmap() new extent, read() from socket into new extent, munmap() 3) read() from socket into buffer, vmsplice() buffer to pipe, splice() pipe to file (using the double-buffer trick) Here's the results, using: - 64 KiB buffer, mmap extent, or splice - 1 MiB TCP window - 16 GiB data sent across network A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s) A) from file - 1) to /dev/null : 472 MB/s (3.77 Gb/s) B) from file - 1) to /dev/null : 366 MB/s (2.93 Gb/s) C) from file - 1) to /dev/null : 854 MB/s (6.83 Gb/s) A) from /dev/zero - 1) to file : 375 MB/s (3.00 Gb/s) A) from /dev/zero - 2) to file : 150 MB/s (1.20 Gb/s) A) from /dev/zero - 3) to file : 286 MB/s (2.29 Gb/s) I had (naively) hoped the read/vmsplice/splice combination would run at the same speed I can write a file, i.e. at about 450 MB/s on my setup. Do any of my numbers seem bogus, so I should look harder at my test program? Could be read-ahead playing in here, I'd have to take a closer look at the generated io patterns to say more about that. Any chance you can capture iostat or blktrace info for such a run to compare that goes to the disk? I've attached a file with iostat and vmstat results for the case where I read from a socket and write a file, vs. the case where I read from a socket and use vmsplice/splice to write the file. (Sorry it's not inline - my mailer locks up when I try to include the file.) Would you still like blktrace info for these two cases? -- Jim Here's 'iostat -dx 5' on the receiver, when it reads from a socket and writes to a file: Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdc 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdd 0.00 0.00 0.00 0.200.000.40 0.00 0.20 2.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md0 0.00 0.00 0.00 0.200.000.40 0.00 0.20 2.00 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.20 0.00 571.260.00 142574.45 0.00 71287.23 249.58 0.891.56 0.43 24.51 sdc 0.00 0.00 0.00 576.850.00 142558.08 0.00 71279.04 247.13 1.021.77 0.42 23.95 sdd 0.00 1.20 0.60 577.454.79 142545.91 2.40 71272.95 246.61 1.162.00 0.54 31.46 sde 0.00 0.60 0.00 573.850.00 142607.58 0.00 71303.79 248.51 1.342.33 0.54 30.74 md0 0.00 0.00 0.60 2301.404.79 570286.03 2.40 285143.01 247.74 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.20 0.00 750.400.00 187265.80 0.00 93632.90 249.55 1.101.46 0.42 31.36 sdc 0.00 1.40 0.00 755.800.00 187236.80 0.00 93618.40 247.73 1.301.72 0.43 32.64 sdd 0.00 2.80 0.00 751.200.00 187253.40 0.00 93626.70 249.27 1.532.04 0.55 41.36 sde 0.00 0.20 0.00 767.400.00 187235.20 0.00 93617.60 243.99 1.652.15 0.54 41.52 md0 0.00 0.00 0.00 3029.400.00 748991.20 0.00 374495.60 247.24 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 1.00 0.00 771.260.00 187299.20 0.00 93649.60 242.85 1.211.57 0.44 33.61 sdc 0.00 1.60 0.00 765.870.00 187262.28 0.00 93631.14 244.51 1.331.73 0.45 34.49 sdd 0.00 1.00 0.00 739.720.00 187258.48 0.00 93629.24 253.15 1.431.93 0.55 40.56 sde 0.00 0.80 0.00 744.310.00 187227.15 0.00 93613.57 251.54 1.852.48 0.56 41.68 md0 0.00 0.00 0.00 3025.550.00 749047.11 0.00 374523.55 247.57 0.000.00 0.00 0.00 Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdb
Re: splice/vmsplice performance test results
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote: > On Thu, Nov 16 2006, Jim Schutt wrote: > > Hi, > > > > > > My test program can do one of the following: > > > > send data: > > A) read() from file into buffer, write() buffer into socket > > B) mmap() section of file, write() that into socket, munmap() > > C) splice() from file to pipe, splice() from pipe to socket > > > > receive data: > > 1) read() from socket into buffer, write() buffer into file > > 2) ftruncate() to extend file, mmap() new extent, read() > > from socket into new extent, munmap() > > 3) read() from socket into buffer, vmsplice() buffer to > > pipe, splice() pipe to file (using the double-buffer trick) > > > > Here's the results, using: > > - 64 KiB buffer, mmap extent, or splice > > - 1 MiB TCP window > > - 16 GiB data sent across network > > > > A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s) > > > > A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s) > > B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s) > > C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s) > > > > A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s) > > A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s) > > A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s) > > > > I had (naively) hoped the read/vmsplice/splice combination would > > run at the same speed I can write a file, i.e. at about 450 MB/s > > on my setup. Do any of my numbers seem bogus, so I should look > > harder at my test program? > > Could be read-ahead playing in here, I'd have to take a closer look at > the generated io patterns to say more about that. Any chance you can > capture iostat or blktrace info for such a run to compare that goes to > the disk? I can try. Do you prefer iostat or blktrace, or would you like both? Can you point me at some instructions? > Can you pass along the test program? Inserted inline below. > > > Or is read+write really the fastest way to get data off a > > socket and into a file? > > splice() should be just as fast of course, and more efficient. Not a lot > of real-life performance tuning has gone into it yet, so I would not be > surprised if we need to smoothen a few edges. > I'm glad I can help a little here. -- Jim Here's a splice.h I'm using, based on your example: --- /* Implement splice syscall support for glibc versions that don't * have it. */ #ifndef __do_splice_syscall_h__ #define __do_splice_syscall_h__ #include #include #if defined(__i386__) /* From kernel tree include/asm-i386/unistd.h */ #ifndef __NR_splice #define __NR_splice 313 #endif #ifndef __NR_vmsplice #define __NR_vmsplice 316 #endif #elif defined(__x86_64__) /* From kernel tree include/asm-x86_64/unistd.h */ #ifndef __NR_splice #define __NR_splice 275 #endif #ifndef __NR_vmsplice #define __NR_vmsplice 278 #endif #else #error unsupported architecture #endif /* From kernel tree include/linux/pipe_fs_i.h */ #define SPLICE_F_MOVE (0x01) /* move pages instead of copying */ #define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (but */ /* we may still block on the fd we splice */ /* from/to, of course */ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ #ifndef SYS_splice #define SYS_splice __NR_splice #endif #ifndef SYS_vmsplice #define SYS_vmsplice __NR_vmsplice #endif #ifndef _LARGEFILE64_SOURCE #error need -D_LARGEFILE64_SOURCE #endif static inline int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out, size_t len, unsigned int flags) { return syscall(SYS_splice, fd_in, off_in, fd_out, off_out, len, flags); } struct iovec; static inline int vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags) { return syscall(SYS_vmsplice, fd, iov, nr_segs, flags); } #endif /* __do_splice_syscall_h__ */ -- And here's the test program itself: /* Copyright 2006 Sandia Corporation. * * This file is free software; you can redistribute it and/or modify * it under the terms of version 2 of the GNU General Public License, * as published by the Free Software Foundation. */ /* Compile with -DHAVE_SPLICE_SYSCALL if you _know_ your kernel * has splice/vmsplice support. */ #ifndef _LARGEFILE64_SOURCE #error need -D_LARGEFILE64_SOURCE #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #
splice/vmsplice performance test results
Hi, I've done some testing to see how splice/vmsplice perform vs. other alternatives on transferring a large file across a fast network. One option I tested was to use vmsplice to get a 1-copy receive, but it didn't perform as well as I had hoped. I was wondering if my results were at odds with what other people have observed. I've two systems, each with: Tyan S2895 motherboard 2 ea. 2.6 GHz Opteron 1 GiB memory Myricom Myri-10G 10 Gb/s NIC (PCIe x8) 2.6.19-rc5-g134a11f0 on FC4 In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware 9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives (250 GB ea., NCQ enabled). Write caching is enabled on the 3ware cards. The Myricom cards are connected back-to-back using 9000 byte MTU. I baseline the network performance with 'iperf -w 1M -l 64K' and get 6.9 Gb/s. After a fair amount of testing, I settled on a 4-way software RAID0 on top of 4-way hardware RAID0 units as giving the best streaming performance. The file system is XFS, with the stripe unit set to the hardware RAID chunk size, and the stripe width 16 times that. Disk tuning parameters in /sys/block/sd*/queue are default values, except queue/nr_requests = 5 gives me best performance. (It seems like the 3ware cards slow down a little if I feed them too much data on the streaming write test I'm using.) I baseline file write performance with sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; } and get 465-520 MB/s (highly variable). I test baseline file read performance with time dd if=./zero of=/dev/null bs=32k count=512k and get 950 MB/s (fairly repeatable). My test program can do one of the following: send data: A) read() from file into buffer, write() buffer into socket B) mmap() section of file, write() that into socket, munmap() C) splice() from file to pipe, splice() from pipe to socket receive data: 1) read() from socket into buffer, write() buffer into file 2) ftruncate() to extend file, mmap() new extent, read() from socket into new extent, munmap() 3) read() from socket into buffer, vmsplice() buffer to pipe, splice() pipe to file (using the double-buffer trick) Here's the results, using: - 64 KiB buffer, mmap extent, or splice - 1 MiB TCP window - 16 GiB data sent across network A) from /dev/zero -> 1) to /dev/null : 857 MB/s (6.86 Gb/s) A) from file -> 1) to /dev/null : 472 MB/s (3.77 Gb/s) B) from file -> 1) to /dev/null : 366 MB/s (2.93 Gb/s) C) from file -> 1) to /dev/null : 854 MB/s (6.83 Gb/s) A) from /dev/zero -> 1) to file : 375 MB/s (3.00 Gb/s) A) from /dev/zero -> 2) to file : 150 MB/s (1.20 Gb/s) A) from /dev/zero -> 3) to file : 286 MB/s (2.29 Gb/s) I had (naively) hoped the read/vmsplice/splice combination would run at the same speed I can write a file, i.e. at about 450 MB/s on my setup. Do any of my numbers seem bogus, so I should look harder at my test program? Or is read+write really the fastest way to get data off a socket and into a file? -- Jim Schutt (Please Cc: me, as I'm not subscribed to lkml.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
splice/vmsplice performance test results
Hi, I've done some testing to see how splice/vmsplice perform vs. other alternatives on transferring a large file across a fast network. One option I tested was to use vmsplice to get a 1-copy receive, but it didn't perform as well as I had hoped. I was wondering if my results were at odds with what other people have observed. I've two systems, each with: Tyan S2895 motherboard 2 ea. 2.6 GHz Opteron 1 GiB memory Myricom Myri-10G 10 Gb/s NIC (PCIe x8) 2.6.19-rc5-g134a11f0 on FC4 In addition, one system has a 3ware 9590-8ML (PCIe) and a 3ware 9550SX-8LP (PCI-X), with 16 Seagate Barracuda 7200.10 SATA drives (250 GB ea., NCQ enabled). Write caching is enabled on the 3ware cards. The Myricom cards are connected back-to-back using 9000 byte MTU. I baseline the network performance with 'iperf -w 1M -l 64K' and get 6.9 Gb/s. After a fair amount of testing, I settled on a 4-way software RAID0 on top of 4-way hardware RAID0 units as giving the best streaming performance. The file system is XFS, with the stripe unit set to the hardware RAID chunk size, and the stripe width 16 times that. Disk tuning parameters in /sys/block/sd*/queue are default values, except queue/nr_requests = 5 gives me best performance. (It seems like the 3ware cards slow down a little if I feed them too much data on the streaming write test I'm using.) I baseline file write performance with sync; time { dd if=/dev/zero of=./zero bs=32k count=512k; sync; } and get 465-520 MB/s (highly variable). I test baseline file read performance with time dd if=./zero of=/dev/null bs=32k count=512k and get 950 MB/s (fairly repeatable). My test program can do one of the following: send data: A) read() from file into buffer, write() buffer into socket B) mmap() section of file, write() that into socket, munmap() C) splice() from file to pipe, splice() from pipe to socket receive data: 1) read() from socket into buffer, write() buffer into file 2) ftruncate() to extend file, mmap() new extent, read() from socket into new extent, munmap() 3) read() from socket into buffer, vmsplice() buffer to pipe, splice() pipe to file (using the double-buffer trick) Here's the results, using: - 64 KiB buffer, mmap extent, or splice - 1 MiB TCP window - 16 GiB data sent across network A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s) A) from file - 1) to /dev/null : 472 MB/s (3.77 Gb/s) B) from file - 1) to /dev/null : 366 MB/s (2.93 Gb/s) C) from file - 1) to /dev/null : 854 MB/s (6.83 Gb/s) A) from /dev/zero - 1) to file : 375 MB/s (3.00 Gb/s) A) from /dev/zero - 2) to file : 150 MB/s (1.20 Gb/s) A) from /dev/zero - 3) to file : 286 MB/s (2.29 Gb/s) I had (naively) hoped the read/vmsplice/splice combination would run at the same speed I can write a file, i.e. at about 450 MB/s on my setup. Do any of my numbers seem bogus, so I should look harder at my test program? Or is read+write really the fastest way to get data off a socket and into a file? -- Jim Schutt (Please Cc: me, as I'm not subscribed to lkml.) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: splice/vmsplice performance test results
On Thu, 2006-11-16 at 21:25 +0100, Jens Axboe wrote: On Thu, Nov 16 2006, Jim Schutt wrote: Hi, My test program can do one of the following: send data: A) read() from file into buffer, write() buffer into socket B) mmap() section of file, write() that into socket, munmap() C) splice() from file to pipe, splice() from pipe to socket receive data: 1) read() from socket into buffer, write() buffer into file 2) ftruncate() to extend file, mmap() new extent, read() from socket into new extent, munmap() 3) read() from socket into buffer, vmsplice() buffer to pipe, splice() pipe to file (using the double-buffer trick) Here's the results, using: - 64 KiB buffer, mmap extent, or splice - 1 MiB TCP window - 16 GiB data sent across network A) from /dev/zero - 1) to /dev/null : 857 MB/s (6.86 Gb/s) A) from file - 1) to /dev/null : 472 MB/s (3.77 Gb/s) B) from file - 1) to /dev/null : 366 MB/s (2.93 Gb/s) C) from file - 1) to /dev/null : 854 MB/s (6.83 Gb/s) A) from /dev/zero - 1) to file : 375 MB/s (3.00 Gb/s) A) from /dev/zero - 2) to file : 150 MB/s (1.20 Gb/s) A) from /dev/zero - 3) to file : 286 MB/s (2.29 Gb/s) I had (naively) hoped the read/vmsplice/splice combination would run at the same speed I can write a file, i.e. at about 450 MB/s on my setup. Do any of my numbers seem bogus, so I should look harder at my test program? Could be read-ahead playing in here, I'd have to take a closer look at the generated io patterns to say more about that. Any chance you can capture iostat or blktrace info for such a run to compare that goes to the disk? I can try. Do you prefer iostat or blktrace, or would you like both? Can you point me at some instructions? Can you pass along the test program? Inserted inline below. Or is read+write really the fastest way to get data off a socket and into a file? splice() should be just as fast of course, and more efficient. Not a lot of real-life performance tuning has gone into it yet, so I would not be surprised if we need to smoothen a few edges. I'm glad I can help a little here. -- Jim Here's a splice.h I'm using, based on your example: --- /* Implement splice syscall support for glibc versions that don't * have it. */ #ifndef __do_splice_syscall_h__ #define __do_splice_syscall_h__ #include sys/syscall.h #include unistd.h #if defined(__i386__) /* From kernel tree include/asm-i386/unistd.h */ #ifndef __NR_splice #define __NR_splice 313 #endif #ifndef __NR_vmsplice #define __NR_vmsplice 316 #endif #elif defined(__x86_64__) /* From kernel tree include/asm-x86_64/unistd.h */ #ifndef __NR_splice #define __NR_splice 275 #endif #ifndef __NR_vmsplice #define __NR_vmsplice 278 #endif #else #error unsupported architecture #endif /* From kernel tree include/linux/pipe_fs_i.h */ #define SPLICE_F_MOVE (0x01) /* move pages instead of copying */ #define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (but */ /* we may still block on the fd we splice */ /* from/to, of course */ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ #ifndef SYS_splice #define SYS_splice __NR_splice #endif #ifndef SYS_vmsplice #define SYS_vmsplice __NR_vmsplice #endif #ifndef _LARGEFILE64_SOURCE #error need -D_LARGEFILE64_SOURCE #endif static inline int splice(int fd_in, off64_t *off_in, int fd_out, off64_t *off_out, size_t len, unsigned int flags) { return syscall(SYS_splice, fd_in, off_in, fd_out, off_out, len, flags); } struct iovec; static inline int vmsplice(int fd, const struct iovec *iov, unsigned long nr_segs, unsigned int flags) { return syscall(SYS_vmsplice, fd, iov, nr_segs, flags); } #endif /* __do_splice_syscall_h__ */ -- And here's the test program itself: /* Copyright 2006 Sandia Corporation. * * This file is free software; you can redistribute it and/or modify * it under the terms of version 2 of the GNU General Public License, * as published by the Free Software Foundation. */ /* Compile with -DHAVE_SPLICE_SYSCALL if you _know_ your kernel * has splice/vmsplice support. */ #ifndef _LARGEFILE64_SOURCE #error need -D_LARGEFILE64_SOURCE #endif #include stdint.h #include stdlib.h #include stdio.h #include unistd.h #include errno.h #include string.h #include sys/types.h #include sys/stat.h #include fcntl.h #include sys/time.h #include sys/mman.h #include sys/sendfile.h #include netdb.h #include sys/socket.h #include netinet/tcp.h #include arpa/inet.h #include resolv.h #define ALIGN(value,size) (((value) + (size) - 1) ~((size) - 1)) /* Once glibc implements splice(), SYS_splice will be defined in * system headers. Until then we need to use our own
Re: Spurious interrupts for UP w/ IO-APIC on ServerWorks
Alan Cox wrote: > > I've got a dual-processor system built around the Intel SBT2 motherboard, > > which uses the ServerWorks LE chipset. 2.4.3 SMP works fine. When I > > build a UP kernel with IO-APIC support, I get this during boot: > > Turn off the OSB4 driver - bet that helps It does -- no more spurious interrupts. Thanks -- Jim -- Jim Schutt <[EMAIL PROTECTED]> Sandia National Laboratories, Albuquerque, New Mexico USA - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Spurious interrupts for UP w/ IO-APIC on ServerWorks
Hi, I've got a dual-processor system built around the Intel SBT2 motherboard, which uses the ServerWorks LE chipset. 2.4.3 SMP works fine. When I build a UP kernel with IO-APIC support, I get this during boot: [... everything seems ok until ...] Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 ServerWorks OSB4: chipset revision 0 ServerWorks OSB4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0x0840-0x0847, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0x0848-0x084f, BIOS settings: hdc:DMA, hdd:DMA spurious APIC interrupt on CPU#0, should never happen. spurious APIC interrupt on CPU#0, should never happen. hda: CD-ROM 52X/AKH, ATAPI CD/DVD-ROM drive spurious APIC interrupt on CPU#0, should never happen. spurious APIC interrupt on CPU#0, should never happen. ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 spurious APIC interrupt on CPU#0, should never happen. spurious APIC interrupt on CPU#0, should never happen. hda: lost interrupt spurious APIC interrupt on CPU#0, should never happen. hda: lost interrupt hda: ATAPI 52X CD-ROM drive, 192kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.12 Floppy drive(s): fd0 is 1.44M spurious APIC interrupt on CPU#0, should never happen. spurious APIC interrupt on CPU#0, should never happen. floppy0: no floppy controllers found Serial driver version 5.05 (2000-12-13) with MANY_PORTS SHARE_IRQ SERIAL_PCI ISAPNP enabled ttyS00 at 0x03f8 (irq = 4) is a 16550A ttyS01 at 0x02f8 (irq = 3) is a 16550A SCSI subsystem driver Revision: 1.00 (scsi0) found at PCI 1/4/0 (scsi0) Wide Channel A, SCSI ID=7, 32/255 SCBs (scsi0) Downloading sequencer code... 392 instructions downloaded (scsi1) found at PCI 1/4/1 (scsi1) Wide Channel B, SCSI ID=7, 32/255 SCBs (scsi1) Downloading sequencer code... 392 instructions downloaded scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0 scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.1/5.2.0 scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Inquiry 00 00 00 ff 00 spurious APIC interrupt on CPU#0, should never happen. SCSI host 0 abort (pid 0) timed out - resetting SCSI bus is being reset for host 0 channel 0. spurious APIC interrupt on CPU#0, should never happen. SCSI host 0 channel 0 reset (pid 0) timed out - trying harder SCSI bus is being reset for host 0 channel 0. [... repeat above 6 lines until I reset hw ...] 2.4.3 has this behavior, 2.4.3-ac9 works fine. I found this in the -ac changelog; perhaps it's relevant: > 2.4.2-ac12 [...] > o Remove serverworks handling. The BIOS is our (me) > best (and right now only) hope for that chip Here's the pertinent (I think) part of 2.4.3 .config: CONFIG_MPENTIUMIII=y CONFIG_X86_GOOD_APIC=y # CONFIG_SMP is not set CONFIG_X86_UP_IOAPIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y Details available on request. Thanks -- Jim -- Jim Schutt <[EMAIL PROTECTED]> Sandia National Laboratories, Albuquerque, New Mexico USA - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Spurious interrupts for UP w/ IO-APIC on ServerWorks
Alan Cox wrote: I've got a dual-processor system built around the Intel SBT2 motherboard, which uses the ServerWorks LE chipset. 2.4.3 SMP works fine. When I build a UP kernel with IO-APIC support, I get this during boot: Turn off the OSB4 driver - bet that helps It does -- no more spurious interrupts. Thanks -- Jim -- Jim Schutt [EMAIL PROTECTED] Sandia National Laboratories, Albuquerque, New Mexico USA - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
tulip vs. de4x5 (was Re: Linux 2.4 Status / TODO page ...)
Jeff Garzik wrote: > > > de4x5 is becoming EISA-only in 2.5.x too, since its PCI support is > duplicated now in tulip driver. > I've got some DEC Miatas with DECchip 21142/43 ethernet cards, and I don't get the same link speeds when using the de4x5 and tulip drivers, as of 2.4.0-test10. The machines are connected to a Netgear 16-port 10/100 hub. With the tulip driver the hub shows a 10 Mb/s connection; with the de4x5 driver the hub shows a 100 Mb/s connection. In both cases the drivers are configured into the kernel, not as modules, if it matters. Am I doing something wrong, or is this a known issue? FWIW, these machines are diskless, and according to the hub they download the kernel at 100 Mb/s. When the tulip-based kernel initializes the card, the hub shows the link speed switching to 10 Mb/s. from lspci -v: 00:03.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 30) Flags: bus master, medium devsel, latency 255, IRQ 24 I/O ports at 8000 [size=128] Memory at 0900 (32-bit, non-prefetchable) [size=128] Expansion ROM at 0904 [disabled] [size=256K] de4x5 boot messages: eth0: DC21143 at 0x8000 (PCI bus 0, device 3), h/w address 00:00:f8:76:3c:74, and requires IRQ24 (provided by PCI BIOS). de4x5.c:V0.545 1999/11/28 [EMAIL PROTECTED] tulip boot messages: Linux Tulip driver version 0.9.10 (September 6, 2000) eth0: Digital DS21143 Tulip rev 48 at 0x8000, 00:00:F8:76:3C:74, IRQ 24. eth0: EEPROM default media type Autosense. eth0: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. eth0: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. eth0: Index #2 - Media 10base2 (#1) described by a 21142 Serial PHY (2) block. eth0: Index #3 - Media AUI (#2) described by a 21142 Serial PHY (2) block. eth0: Index #4 - Media MII (#11) described by a 21142 MII PHY (3) block. eth0: MII transceiver #5 config 2000 status 784b advertising 01e1. -- Jim - James A. Schutt [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
tulip vs. de4x5 (was Re: Linux 2.4 Status / TODO page ...)
Jeff Garzik wrote: de4x5 is becoming EISA-only in 2.5.x too, since its PCI support is duplicated now in tulip driver. I've got some DEC Miatas with DECchip 21142/43 ethernet cards, and I don't get the same link speeds when using the de4x5 and tulip drivers, as of 2.4.0-test10. The machines are connected to a Netgear 16-port 10/100 hub. With the tulip driver the hub shows a 10 Mb/s connection; with the de4x5 driver the hub shows a 100 Mb/s connection. In both cases the drivers are configured into the kernel, not as modules, if it matters. Am I doing something wrong, or is this a known issue? FWIW, these machines are diskless, and according to the hub they download the kernel at 100 Mb/s. When the tulip-based kernel initializes the card, the hub shows the link speed switching to 10 Mb/s. from lspci -v: 00:03.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 30) Flags: bus master, medium devsel, latency 255, IRQ 24 I/O ports at 8000 [size=128] Memory at 0900 (32-bit, non-prefetchable) [size=128] Expansion ROM at 0904 [disabled] [size=256K] de4x5 boot messages: eth0: DC21143 at 0x8000 (PCI bus 0, device 3), h/w address 00:00:f8:76:3c:74, and requires IRQ24 (provided by PCI BIOS). de4x5.c:V0.545 1999/11/28 [EMAIL PROTECTED] tulip boot messages: Linux Tulip driver version 0.9.10 (September 6, 2000) eth0: Digital DS21143 Tulip rev 48 at 0x8000, 00:00:F8:76:3C:74, IRQ 24. eth0: EEPROM default media type Autosense. eth0: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. eth0: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. eth0: Index #2 - Media 10base2 (#1) described by a 21142 Serial PHY (2) block. eth0: Index #3 - Media AUI (#2) described by a 21142 Serial PHY (2) block. eth0: Index #4 - Media MII (#11) described by a 21142 MII PHY (3) block. eth0: MII transceiver #5 config 2000 status 784b advertising 01e1. -- Jim - James A. Schutt [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/