I have not heard of this problem previously. Here are a two
ideas to pursue.

Try to determine if these failures are all from a single
program or user and then try to determine what is unusual
about those jobs.

Write the slurmd/slurmstepd information to a log file with
a high level of detail so we can better see what is happening:
SlurmdLogFile=/tmp/slurmd.log
SlurmdDebug=6
It might be helpful to get detailed logging from the slurmctld
daemon too. Note that the log files will get large quickly
with this level of detail.

Moe Jette
SchedMD

Quoting "Glanfield, Wayne (Oxford)" <[email protected]>:

We are seeing a number of jobs hang in the completing state where the slurm daemon has segfaulted with the following message

Any ideas what could be causing this?

Thanks
Wayne

slurmstepd[2000]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041e54e40 error 6 slurmstepd[2016]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041737e40 error 6 slurmstepd[1922]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041cd1e40 error 6 slurmstepd[1985]: segfault at 0000000000000020 rip 0000003abec72694 rsp 000000004056de40 error 6 slurmstepd[11925]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00007fffdcc5ee80 error 6 slurmstepd[18738]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00007fffecf06ae0 error 6
INFO: task java:15636 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java D ffffffff80150790 0 15636 15537 15637 15634 (NOTLB)
 ffff81060f5c9dc8 0000000000000086 ffff810d65333910 ffffffff80009a0e
 ffff810c70a75860 0000000000000006 ffff810c70a75860 ffff81101fde9100
 0001d9f0808c6fb2 0000000000000ab4 ffff810c70a75a48 000000071efabed0
Call Trace:
 [<ffffffff80009a0e>] __link_path_walk+0x173/0xf5b
 [<ffffffff8002ca5a>] mntput_no_expire+0x19/0x89
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff800237b5>] __path_lookup_intent_open+0x56/0x97
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8001b026>] open_namei+0xea/0x6d5
 [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
 [<ffffffff80027533>] do_filp_open+0x1c/0x38
 [<ffffffff80019e5d>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task java:15637 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java D ffffffff80150790 0 15637 15537 15638 15636 (NOTLB)
 ffff811911e01dc8 0000000000000086 ffff810d65333910 ffffffff80009a0e
 ffff81201fa057e0 0000000000000006 ffff81201fa057e0 ffff81101ff400c0
 0001d9f0808c4a39 000000000000260e ffff81201fa059c8 0000000e1efabed0
Call Trace:
 [<ffffffff80009a0e>] __link_path_walk+0x173/0xf5b
 [<ffffffff8002ca5a>] mntput_no_expire+0x19/0x89
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff800237b5>] __path_lookup_intent_open+0x56/0x97
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8001b026>] open_namei+0xea/0x6d5
 [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
 [<ffffffff80027533>] do_filp_open+0x1c/0x38
 [<ffffffff80019e5d>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

slurmstepd[11988]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00007fff19ee0850 error 6 slurmstepd[10490]: segfault at 0000000000000020 rip 0000003abec72694 rsp 000000004116ff60 error 6 rpc: request timed out, request 498658, pan_pm_report_space_rpc_v1 (sync), to 10.220.67.1:10622
nfs: server ahl16 not responding, still trying
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
bnx2x: eth5: using MSI-X  IRQs: sp 163  fp[0] 179 ... fp[14] 100
bnx2x: eth5 NIC Link is Down
bnx2x: eth5 NIC Link is Up, 1000 Mbps full duplex, receive flow control ON
nfs: server kenny not responding, still trying
nfs: server kenny OK
slurmstepd[16615]: segfault at 0000000000000020 rip 0000003abec72694 rsp 000000004022ae40 error 6 slurmstepd[16695]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000040a42e40 error 6 slurmstepd[17737]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000040f87e40 error 6 slurmstepd[17753]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041551e40 error 6 slurmstepd[17771]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041fabe40 error 6
      blocks= 143305920 block_size= 512
      heads= 255, sectors= 32, cylinders= 17562

      blocks= 143305920 block_size= 512
      heads= 255, sectors= 32, cylinders= 17562

slurmstepd[17784]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00007ffffa59cd80 error 6 slurmstepd[16416]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00007fffe504af70 error 6
INFO: task cmahostd:7081 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cmahostd D ffffffff80150790 0 7081 1 7103 7077 (NOTLB)
 ffff811ffe09bdd8 0000000000000086 00000000ffffffff 0000010100000000
 0000000700b118fa 0000000000000009 ffff811fffbda820 ffff81101fd6d080
 0002cfa83a7767ee 000000000108476f ffff811fffbdaa08 000000058002ca5a
Call Trace:
 [<ffffffff800646ac>] __down_read+0x7a/0x92
 [<ffffffff800c3054>] access_process_vm+0x47/0x18d
 [<ffffffff8000f339>] __alloc_pages+0x78/0x308
 [<ffffffff801065f4>] proc_pid_cmdline+0x69/0xf4
 [<ffffffff80106b00>] proc_info_read+0x5f/0xb9
 [<ffffffff8000b729>] vfs_read+0xcb/0x171
 [<ffffffff80011c3b>] sys_read+0x45/0x6e
 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76

INFO: task slurmstepd:11311 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffffffff80150790 0 11311 1 11312 11026 (NOTLB)
 ffff810b26153f08 0000000000000082 ffff811ffdea6140 0000000000455420
 0000000000000292 0000000000000007 ffff81101fa98040 ffff81201fdf5860
 0002cfa46db2b134 000000000000702a ffff81101fa98228 00000006ff9cf860
Call Trace:
 [<ffffffff800181d6>] release_task+0x3a5/0x3cb
 [<ffffffff8003bc4e>] remove_wait_queue+0x1c/0x2c
 [<ffffffff80064613>] __down_write_nested+0x7a/0x92
 [<ffffffff800173c4>] sys_brk+0x28/0x110
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task slurmstepd:11313 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffffffff80150790 0 11313 1 11314 11312 (NOTLB)
 ffff8115830bdef8 0000000000000082 ffff812005e18c40 ffff81101fdae8c0
 ffff81101fdaea00 0000000000000009 ffff81200791a040 ffff8120026f00c0
 0002cfa46deecbfd 0000000000004b66 ffff81200791a228 0000000080022290
Call Trace:
 [<ffffffff80226d03>] sys_accept+0x1b8/0x1ea
 [<ffffffff80064613>] __down_write_nested+0x7a/0x92
 [<ffffffff800cef74>] sys_mmap_pgoff+0x55/0xac
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task slurmstepd:19675 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffffffff80150790 0 19675 1 19676 19674 (NOTLB)
 ffff811fa46c5c38 0000000000000082 0000000700000000 0000000000000001
 000000001fa7d1c0 0000000000000007 ffff810f7ed250c0 ffff81156be28820
 0002cfa46db38d5c 00000000000060c5 ffff810f7ed252a8 0000000000000001
Call Trace:
 [<ffffffff8006383f>] schedule_timeout+0x1e/0xad
 [<ffffffff8003acc1>] prepare_to_wait+0x34/0x61
 [<ffffffff800646ac>] __down_read+0x7a/0x92
 [<ffffffff800a327f>] futex_wake+0x24/0xd4
 [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8004a6de>] unix_stream_sendmsg+0x281/0x346
 [<ffffffff8003e2e2>] do_futex+0x2c2/0xcf3
 [<ffffffff80045d34>] do_sock_read+0xcf/0x110
 [<ffffffff802259e0>] sock_aio_read+0x4f/0x5e
 [<ffffffff8000ceb5>] do_sync_read+0xc7/0x104
 [<ffffffff8009947a>] __group_send_sig_info+0xb9/0xc8
 [<ffffffff800a411a>] sys_futex+0x10a/0x12b
 [<ffffffff8003539c>] mm_release+0x75/0x89
 [<ffffffff80041e0f>] exit_mm+0x16/0xf5
 [<ffffffff8001581b>] do_exit+0x2b1/0x911
 [<ffffffff80093630>] complete_and_exit+0x0/0x16
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ps:20432 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ps D ffffffff80150790 0 20432 20399 20433 (NOTLB)
 ffff810f4bb2bdd8 0000000000000086 0000000000000000 0000010100000000
 0000000700b118fa 0000000000000002 ffff810f678467a0 ffff81101fd6d080
 0002cfa9172384a6 00000000009094d9 ffff810f67846988 000000058002ca5a
Call Trace:
 [<ffffffff800646ac>] __down_read+0x7a/0x92
 [<ffffffff800c3054>] access_process_vm+0x47/0x18d
 [<ffffffff8000f339>] __alloc_pages+0x78/0x308
 [<ffffffff801065f4>] proc_pid_cmdline+0x69/0xf4
 [<ffffffff80106b00>] proc_info_read+0x5f/0xb9
 [<ffffffff8000b729>] vfs_read+0xcb/0x171
 [<ffffffff80011c3b>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task cmahostd:7081 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cmahostd D ffffffff80150790 0 7081 1 7103 7077 (NOTLB)
 ffff811ffe09bdd8 0000000000000086 00000000ffffffff 0000010100000000
 0000000700b118fa 0000000000000009 ffff811fffbda820 ffff81101fd6d080
 0002cfa83a7767ee 000000000108476f ffff811fffbdaa08 000000058002ca5a
Call Trace:
 [<ffffffff800646ac>] __down_read+0x7a/0x92
 [<ffffffff800c3054>] access_process_vm+0x47/0x18d
 [<ffffffff8000f339>] __alloc_pages+0x78/0x308
 [<ffffffff801065f4>] proc_pid_cmdline+0x69/0xf4
 [<ffffffff80106b00>] proc_info_read+0x5f/0xb9
 [<ffffffff8000b729>] vfs_read+0xcb/0x171
 [<ffffffff80011c3b>] sys_read+0x45/0x6e
 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76

INFO: task slurmstepd:11311 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffffffff80150790 0 11311 1 11312 11026 (NOTLB)
 ffff810b26153f08 0000000000000082 ffff811ffdea6140 0000000000455420
 0000000000000292 0000000000000007 ffff81101fa98040 ffff81201fdf5860
 0002cfa46db2b134 000000000000702a ffff81101fa98228 00000006ff9cf860
Call Trace:
 [<ffffffff800181d6>] release_task+0x3a5/0x3cb
 [<ffffffff8003bc4e>] remove_wait_queue+0x1c/0x2c
 [<ffffffff80064613>] __down_write_nested+0x7a/0x92
 [<ffffffff800173c4>] sys_brk+0x28/0x110
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task slurmstepd:11313 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffffffff80150790 0 11313 1 11314 11312 (NOTLB)
 ffff8115830bdef8 0000000000000082 ffff812005e18c40 ffff81101fdae8c0
 ffff81101fdaea00 0000000000000009 ffff81200791a040 ffff8120026f00c0
 0002cfa46deecbfd 0000000000004b66 ffff81200791a228 0000000080022290
Call Trace:
 [<ffffffff80226d03>] sys_accept+0x1b8/0x1ea
 [<ffffffff80064613>] __down_write_nested+0x7a/0x92
 [<ffffffff800cef74>] sys_mmap_pgoff+0x55/0xac
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

slurmstepd[11344]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000040c2be40 error 6 slurmstepd[1040]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041322f60 error 6 slurmstepd[1049]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041398c40 error 6 slurmstepd[1056]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000041175c20 error 6 slurmstepd[11536]: segfault at 0000000000000020 rip 0000003abec72694 rsp 000000004090ce40 error 6 slurmstepd[11314]: segfault at 0000000000000020 rip 0000003abec72694 rsp 00000000412d4e40 error 6 slurmstepd[7846]: segfault at 0000000000000020 rip 0000003abec72694 rsp 0000000040671e40 error 6

Wayne Glanfield
AHL Infrastructure
[email protected]<mailto:[email protected]>
Tel +44 20 7144 3642
Mob +44 7899 900414

[http://www.maninvestments.com/gfx/email/ahl/ahl_logo.gif]
Man Investments Limited | Man Research Laboratory, Eagle House, Walton Well Road | Oxford OX2 6ED | England
Visit us at: www.man.com<http://www.man.com/>


Registered in England and Wales no. 2093429
Registered address: Sugar Quay, Lower Thames Street, London EC3R 6DU, United Kingdom Man Investments Limited is authorised and regulated by the Financial Services Authority



This email has been sent by a member of the Man group (?Man?). Man?s parent company, Man Group plc, is registered in England and Wales (company number 2921462) at Sugar Quay, Lower Thames Street, London EC3R 6DU, UK. The contents of this email are for the named addressee(s) only. It contains information which may be confidential and privileged. If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man reserves the right to monitor, record and retain all electronic and telephone communications through its network in accordance with applicable laws and regulations. --UwQe9f5k7pI3vplngP





Reply via email to