RE: [slurm-dev] slurmstepd[2000]: segfault

Glanfield, Wayne (Oxford) Tue, 24 May 2011 06:48:39 -0700

And from the slurm controller

[root@dlonapahls121 ~]# grep 195883 /var/log/slurm/slurmctld.log
[2011-05-24T13:42:23] sched: _slurm_rpc_allocate_resources JobId=195883 
NodeList=kenny24 usec=1282
[2011-05-24T13:42:23] sched: _slurm_rpc_job_step_create: StepId=195883.0 
kenny24 usec=445
[2011-05-24T14:27:24] Time limit exhausted for JobId=195883
[2011-05-24T14:27:24] Signal 9 of StepId=195883.0 by UID=1044: Job/step already 
completing or completed
[2011-05-24T14:27:25] Signal 9 of StepId=195883.0 by UID=1044: Job/step already 
completing or completed
[2011-05-24T14:27:25] completing job 195883
[2011-05-24T14:27:25] _slurm_rpc_complete_job_allocation JobId=195883: Job/step 
already completing or completed
[2011-05-24T14:31:22] Resending TERMINATE_JOB request JobId=195883 
Nodelist=kenny24
[2011-05-24T14:35:22] Resending TERMINATE_JOB request JobId=195883 
Nodelist=kenny24
[2011-05-24T14:39:22] Resending TERMINATE_JOB request JobId=195883 
Nodelist=kenny24
[2011-05-24T14:43:22] Resending TERMINATE_JOB request JobId=195883 
Nodelist=kenny24


Wayne

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Glanfield, Wayne (Oxford)
Sent: 24 May 2011 14:44
To: '[email protected]'; [email protected]
Subject: RE: [slurm-dev] slurmstepd[2000]: segfault

Additional info from slurmd on node, not sure it is user specific, possibly 
application specific.

[2011-05-24T14:26:36] [196773.0] done with job [2011-05-24T14:27:24] [195883.0] 
*** STEP 195883.0 CANCELLED AT 2011-05-24T14:27:24 DUE TO TIME LIMIT *** 
[2011-05-24T14:27:24] [195885.0] *** STEP 195885.0 CANCELLED AT 
2011-05-24T14:27:24 DUE TO TIME LIMIT *** [2011-05-24T14:27:24] [195884.0] *** 
STEP 195884.0 CANCELLED AT 2011-05-24T14:27:24 DUE TO TIME LIMIT *** 
[2011-05-24T14:31:39] [195885.0] Unable to destroy container 29275 
[2011-05-24T14:31:39] [195883.0] Unable to destroy container 29256 
[2011-05-24T14:31:39] [195884.0] Unable to destroy container 29263 
[2011-05-24T14:33:47] [195883.0] Unable to destroy container 29256 
[2011-05-24T14:33:47] [195885.0] Unable to destroy container 29275 
[2011-05-24T14:33:47] [195884.0] Unable to destroy container 29263 
[2011-05-24T14:35:55] [195883.0] Unable to destroy container 29256 
[2011-05-24T14:35:55] [195885.0] Unable to destroy container 29275 
[2011-05-24T14:35:55] [195884.0] Unable to destroy container 29263 
[2011-05-24T14:38:03] [195883.0] Unable to destroy container 29256 
[2011-05-24T14:38:03] [195885.0] Unable to destroy container 29275 
[2011-05-24T14:38:03] [195884.0] Unable to destroy container 29263 
[2011-05-24T14:40:11] [195883.0] Unable to destroy container 29256 
[2011-05-24T14:40:11] [195885.0] Unable to destroy container 29275 
[2011-05-24T14:40:11] [195884.0] Unable to destroy container 29263


Wayne

-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: 24 May 2011 14:28
To: [email protected]; Glanfield, Wayne (Oxford)
Subject: Re: [slurm-dev] slurmstepd[2000]: segfault

I have not heard of this problem previously. Here are a two ideas to pursue.

Try to determine if these failures are all from a single program or user and 
then try to determine what is unusual about those jobs.

Write the slurmd/slurmstepd information to a log file with a high level of 
detail so we can better see what is happening:
SlurmdLogFile=/tmp/slurmd.log
SlurmdDebug=6
It might be helpful to get detailed logging from the slurmctld daemon too. Note 
that the log files will get large quickly with this level of detail.

Moe Jette
SchedMD

Quoting "Glanfield, Wayne (Oxford)" <[email protected]>:

> We are seeing a number of jobs hang in the completing state where the
> slurm daemon has segfaulted with the following message
>
> Any ideas what could be causing this?
>
> Thanks
> Wayne
>
> slurmstepd[2000]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041e54e40 error 6
> slurmstepd[2016]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041737e40 error 6
> slurmstepd[1922]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041cd1e40 error 6
> slurmstepd[1985]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 000000004056de40 error 6
> slurmstepd[11925]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00007fffdcc5ee80 error 6
> slurmstepd[18738]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00007fffecf06ae0 error 6
> INFO: task java:15636 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> java          D ffffffff80150790     0 15636  15537         15637
> 15634 (NOTLB)
>  ffff81060f5c9dc8 0000000000000086 ffff810d65333910 ffffffff80009a0e
> ffff810c70a75860 0000000000000006 ffff810c70a75860 ffff81101fde9100
>  0001d9f0808c6fb2 0000000000000ab4 ffff810c70a75a48 000000071efabed0
> Call Trace:
>  [<ffffffff80009a0e>] __link_path_walk+0x173/0xf5b
> [<ffffffff8002ca5a>] mntput_no_expire+0x19/0x89  [<ffffffff80063c6f>]
> __mutex_lock_slowpath+0x60/0x9b  [<ffffffff800237b5>]
> __path_lookup_intent_open+0x56/0x97
>  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14  [<ffffffff8001b026>]
> open_namei+0xea/0x6d5  [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
> [<ffffffff80027533>] do_filp_open+0x1c/0x38  [<ffffffff80019e5d>]
> do_sys_open+0x44/0xbe  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> INFO: task java:15637 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> java          D ffffffff80150790     0 15637  15537         15638
> 15636 (NOTLB)
>  ffff811911e01dc8 0000000000000086 ffff810d65333910 ffffffff80009a0e
> ffff81201fa057e0 0000000000000006 ffff81201fa057e0 ffff81101ff400c0
>  0001d9f0808c4a39 000000000000260e ffff81201fa059c8 0000000e1efabed0
> Call Trace:
>  [<ffffffff80009a0e>] __link_path_walk+0x173/0xf5b
> [<ffffffff8002ca5a>] mntput_no_expire+0x19/0x89  [<ffffffff80063c6f>]
> __mutex_lock_slowpath+0x60/0x9b  [<ffffffff800237b5>]
> __path_lookup_intent_open+0x56/0x97
>  [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14  [<ffffffff8001b026>]
> open_namei+0xea/0x6d5  [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
> [<ffffffff80027533>] do_filp_open+0x1c/0x38  [<ffffffff80019e5d>]
> do_sys_open+0x44/0xbe  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> slurmstepd[11988]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00007fff19ee0850 error 6
> slurmstepd[10490]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 000000004116ff60 error 6
> rpc: request timed out, request 498658, pan_pm_report_space_rpc_v1
> (sync), to 10.220.67.1:10622
> nfs: server ahl16 not responding, still trying
> tg3: eth0: Link is up at 1000 Mbps, full duplex.
> tg3: eth0: Flow control is off for TX and off for RX.
> tg3: eth0: Link is up at 1000 Mbps, full duplex.
> tg3: eth0: Flow control is off for TX and off for RX.
> bnx2x: eth5: using MSI-X  IRQs: sp 163  fp[0] 179 ... fp[14] 100
> bnx2x: eth5 NIC Link is Down
> bnx2x: eth5 NIC Link is Up, 1000 Mbps full duplex, receive flow
> control ON
> nfs: server kenny not responding, still trying
> nfs: server kenny OK
> slurmstepd[16615]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 000000004022ae40 error 6
> slurmstepd[16695]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000040a42e40 error 6
> slurmstepd[17737]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000040f87e40 error 6
> slurmstepd[17753]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041551e40 error 6
> slurmstepd[17771]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041fabe40 error 6
>       blocks= 143305920 block_size= 512
>       heads= 255, sectors= 32, cylinders= 17562
>
>       blocks= 143305920 block_size= 512
>       heads= 255, sectors= 32, cylinders= 17562
>
> slurmstepd[17784]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00007ffffa59cd80 error 6
> slurmstepd[16416]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00007fffe504af70 error 6
> INFO: task cmahostd:7081 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cmahostd      D ffffffff80150790     0  7081      1          7103
> 7077 (NOTLB)
>  ffff811ffe09bdd8 0000000000000086 00000000ffffffff 0000010100000000
> 0000000700b118fa 0000000000000009 ffff811fffbda820 ffff81101fd6d080
> 0002cfa83a7767ee 000000000108476f ffff811fffbdaa08 000000058002ca5a
> Call Trace:
>  [<ffffffff800646ac>] __down_read+0x7a/0x92  [<ffffffff800c3054>]
> access_process_vm+0x47/0x18d  [<ffffffff8000f339>]
> __alloc_pages+0x78/0x308  [<ffffffff801065f4>]
> proc_pid_cmdline+0x69/0xf4  [<ffffffff80106b00>]
> proc_info_read+0x5f/0xb9  [<ffffffff8000b729>] vfs_read+0xcb/0x171
> [<ffffffff80011c3b>] sys_read+0x45/0x6e  [<ffffffff8006149d>]
> sysenter_do_call+0x1e/0x76
>
> INFO: task slurmstepd:11311 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> slurmstepd    D ffffffff80150790     0 11311      1         11312
> 11026 (NOTLB)
>  ffff810b26153f08 0000000000000082 ffff811ffdea6140 0000000000455420
>  0000000000000292 0000000000000007 ffff81101fa98040 ffff81201fdf5860
>  0002cfa46db2b134 000000000000702a ffff81101fa98228 00000006ff9cf860
> Call Trace:
>  [<ffffffff800181d6>] release_task+0x3a5/0x3cb  [<ffffffff8003bc4e>]
> remove_wait_queue+0x1c/0x2c  [<ffffffff80064613>]
> __down_write_nested+0x7a/0x92  [<ffffffff800173c4>] sys_brk+0x28/0x110
> [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> INFO: task slurmstepd:11313 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> slurmstepd    D ffffffff80150790     0 11313      1         11314
> 11312 (NOTLB)
>  ffff8115830bdef8 0000000000000082 ffff812005e18c40 ffff81101fdae8c0
> ffff81101fdaea00 0000000000000009 ffff81200791a040 ffff8120026f00c0
> 0002cfa46deecbfd 0000000000004b66 ffff81200791a228 0000000080022290
> Call Trace:
>  [<ffffffff80226d03>] sys_accept+0x1b8/0x1ea  [<ffffffff80064613>]
> __down_write_nested+0x7a/0x92  [<ffffffff800cef74>]
> sys_mmap_pgoff+0x55/0xac  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> INFO: task slurmstepd:19675 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> slurmstepd    D ffffffff80150790     0 19675      1         19676
> 19674 (NOTLB)
>  ffff811fa46c5c38 0000000000000082 0000000700000000 0000000000000001
> 000000001fa7d1c0 0000000000000007 ffff810f7ed250c0 ffff81156be28820
> 0002cfa46db38d5c 00000000000060c5 ffff810f7ed252a8 0000000000000001
> Call Trace:
>  [<ffffffff8006383f>] schedule_timeout+0x1e/0xad  [<ffffffff8003acc1>]
> prepare_to_wait+0x34/0x61  [<ffffffff800646ac>] __down_read+0x7a/0x92
> [<ffffffff800a327f>] futex_wake+0x24/0xd4  [<ffffffff800a09d4>]
> autoremove_wake_function+0x0/0x2e  [<ffffffff8004a6de>]
> unix_stream_sendmsg+0x281/0x346  [<ffffffff8003e2e2>]
> do_futex+0x2c2/0xcf3  [<ffffffff80045d34>] do_sock_read+0xcf/0x110
> [<ffffffff802259e0>] sock_aio_read+0x4f/0x5e  [<ffffffff8000ceb5>]
> do_sync_read+0xc7/0x104  [<ffffffff8009947a>]
> __group_send_sig_info+0xb9/0xc8  [<ffffffff800a411a>]
> sys_futex+0x10a/0x12b  [<ffffffff8003539c>] mm_release+0x75/0x89
> [<ffffffff80041e0f>] exit_mm+0x16/0xf5  [<ffffffff8001581b>]
> do_exit+0x2b1/0x911  [<ffffffff80093630>] complete_and_exit+0x0/0x16
> [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> INFO: task ps:20432 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ps            D ffffffff80150790     0 20432  20399         20433
>     (NOTLB)
>  ffff810f4bb2bdd8 0000000000000086 0000000000000000 0000010100000000
> 0000000700b118fa 0000000000000002 ffff810f678467a0 ffff81101fd6d080
>  0002cfa9172384a6 00000000009094d9 ffff810f67846988 000000058002ca5a
> Call Trace:
>  [<ffffffff800646ac>] __down_read+0x7a/0x92  [<ffffffff800c3054>]
> access_process_vm+0x47/0x18d  [<ffffffff8000f339>]
> __alloc_pages+0x78/0x308  [<ffffffff801065f4>]
> proc_pid_cmdline+0x69/0xf4  [<ffffffff80106b00>]
> proc_info_read+0x5f/0xb9  [<ffffffff8000b729>] vfs_read+0xcb/0x171
> [<ffffffff80011c3b>] sys_read+0x45/0x6e  [<ffffffff8005d28d>]
> tracesys+0xd5/0xe0
>
> INFO: task cmahostd:7081 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cmahostd      D ffffffff80150790     0  7081      1          7103
> 7077 (NOTLB)
>  ffff811ffe09bdd8 0000000000000086 00000000ffffffff 0000010100000000
> 0000000700b118fa 0000000000000009 ffff811fffbda820 ffff81101fd6d080
> 0002cfa83a7767ee 000000000108476f ffff811fffbdaa08 000000058002ca5a
> Call Trace:
>  [<ffffffff800646ac>] __down_read+0x7a/0x92  [<ffffffff800c3054>]
> access_process_vm+0x47/0x18d  [<ffffffff8000f339>]
> __alloc_pages+0x78/0x308  [<ffffffff801065f4>]
> proc_pid_cmdline+0x69/0xf4  [<ffffffff80106b00>]
> proc_info_read+0x5f/0xb9  [<ffffffff8000b729>] vfs_read+0xcb/0x171
> [<ffffffff80011c3b>] sys_read+0x45/0x6e  [<ffffffff8006149d>]
> sysenter_do_call+0x1e/0x76
>
> INFO: task slurmstepd:11311 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> slurmstepd    D ffffffff80150790     0 11311      1         11312
> 11026 (NOTLB)
>  ffff810b26153f08 0000000000000082 ffff811ffdea6140 0000000000455420
>  0000000000000292 0000000000000007 ffff81101fa98040 ffff81201fdf5860
>  0002cfa46db2b134 000000000000702a ffff81101fa98228 00000006ff9cf860
> Call Trace:
>  [<ffffffff800181d6>] release_task+0x3a5/0x3cb  [<ffffffff8003bc4e>]
> remove_wait_queue+0x1c/0x2c  [<ffffffff80064613>]
> __down_write_nested+0x7a/0x92  [<ffffffff800173c4>] sys_brk+0x28/0x110
> [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> INFO: task slurmstepd:11313 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> slurmstepd    D ffffffff80150790     0 11313      1         11314
> 11312 (NOTLB)
>  ffff8115830bdef8 0000000000000082 ffff812005e18c40 ffff81101fdae8c0
> ffff81101fdaea00 0000000000000009 ffff81200791a040 ffff8120026f00c0
> 0002cfa46deecbfd 0000000000004b66 ffff81200791a228 0000000080022290
> Call Trace:
>  [<ffffffff80226d03>] sys_accept+0x1b8/0x1ea  [<ffffffff80064613>]
> __down_write_nested+0x7a/0x92  [<ffffffff800cef74>]
> sys_mmap_pgoff+0x55/0xac  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
>
> slurmstepd[11344]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000040c2be40 error 6
> slurmstepd[1040]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041322f60 error 6
> slurmstepd[1049]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041398c40 error 6
> slurmstepd[1056]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000041175c20 error 6
> slurmstepd[11536]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 000000004090ce40 error 6
> slurmstepd[11314]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 00000000412d4e40 error 6
> slurmstepd[7846]: segfault at 0000000000000020 rip 0000003abec72694
> rsp 0000000040671e40 error 6
>
> Wayne Glanfield
> AHL Infrastructure
> [email protected]<mailto:[email protected]>
> Tel +44 20 7144 3642
> Mob +44 7899 900414
>
> [http://www.maninvestments.com/gfx/email/ahl/ahl_logo.gif]
>         Man Investments Limited | Man Research Laboratory, Eagle
> House, Walton Well Road | Oxford OX2 6ED | England Visit us at:
> www.man.com<http://www.man.com/>
>
>
> Registered in England and Wales no. 2093429 Registered address: Sugar
> Quay, Lower Thames Street, London EC3R 6DU, United Kingdom Man
> Investments Limited is authorised and regulated by the Financial
> Services Authority
>
>
>
> This email has been sent by a member of the Man group (?Man?). Man?s
> parent company, Man Group plc, is registered in England and Wales
> (company number 2921462) at Sugar Quay, Lower Thames Street, London
> EC3R 6DU, UK.
> The contents of this email are for the named addressee(s) only. It
> contains information which may be confidential and privileged. If you
> are not the intended recipient, please notify the sender immediately,
> destroy this email and any attachments and do not otherwise disclose
> or use them. Email transmission is not a secure method of
> communication and Man cannot accept responsibility for the
> completeness or accuracy of this email or any attachments. Whilst Man
> makes every effort to keep its network free from viruses, it does not
> accept responsibility for any computer virus which might be
> transferred by way of this email or any attachments. This email does
> not constitute a request, offer, recommendation or solicitation of any
> kind to buy, subscribe, sell or redeem any investment  instruments or
> to perform other such transactions of any kind. Man  reserves the
> right to monitor, record and retain all electronic and  telephone
> communications through its network in accordance with  applicable laws
> and regulations. --UwQe9f5k7pI3vplngP
>

RE: [slurm-dev] slurmstepd[2000]: segfault

Reply via email to