Here's a patch, only lightly tested.
--- slurm-14.03.0/src/slurmctld/job_mgr.c 2014-03-27 03:57:22.000000000
+0800
+++ slurm-14.03.patch/src/slurmctld/job_mgr.c 2014-04-11 16:18:22.234954705
+0800
@@ -6966,12 +6966,12 @@
{
uint16_t shared = 0;
- if (detail_ptr->share_res == 1)
+ if(!detail_ptr)
+ shared = (uint16_t) NO_VAL;
+ else if (detail_ptr->share_res == 1)
shared = 1;
else if (detail_ptr->whole_node == 1)
shared = 0;
- else
- shared = (uint16_t) NO_VAL;
if (protocol_version >= SLURM_14_03_PROTOCOL_VERSION) {
if (detail_ptr) {
On Fri, 2014-04-11 at 00:50 -0700, Franco Broi wrote:
> Hi
>
> Were getting frequent repeatable crashes when users input jobs with
> sbatch arguments like this -p c2 -N 15-16 -c 6 --no-requeue -H
>
> Here's a traceback of the controller daemon.
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fa7839f9700 (LWP 8038)]
> _pack_pending_job_details (detail_ptr=0x0, buffer=0x7fa784001e48,
> protocol_version=6912) at job_mgr.c:6969
> 6969 if (detail_ptr->share_res == 1)
> (gdb) where
> #0 _pack_pending_job_details (detail_ptr=0x0, buffer=0x7fa784001e48,
> protocol_version=6912) at job_mgr.c:6969
> #1 0x00000000004432dd in pack_job (dump_job_ptr=0x7fa784001768,
> show_flags=<value optimized out>, buffer=0x7fa784001e48,
> protocol_version=<value optimized out>, uid=<value optimized out>)
> at job_mgr.c:6764
> #2 0x000000000044390b in pack_all_jobs (buffer_ptr=0x7fa7839f8e10,
> buffer_size=0x7fa7839f8e48, show_flags=0, uid=1380, filter_uid=4294967294,
> protocol_version=6912) at job_mgr.c:6274
> #3 0x0000000000470b92 in _slurm_rpc_dump_jobs (msg=0x7fa7840008c8) at
> proc_req.c:1076
> #4 slurmctld_req (msg=0x7fa7840008c8) at proc_req.c:209
> #5 0x000000000042fe58 in _service_connection (arg=0x7fa7c8000ca8) at
> controller.c:1075
> #6 0x00000038c4c06ccb in start_thread () from /lib64/libpthread.so.0
> #7 0x00000038c48e0c2d in clone () from /lib64/libc.so.6
>
>
> Cheers,