When I begin the slurmd and the slurmctld daemons with the attached slurm.conf file, the slurmctld always manages to load the checkpoint/blcr plugin successfully.
SLURMCTLD output --------------------------- slurmctld -Dvvvv slurmctld: error: Unable to open pidfile `/var/run/slurmctld.pid': Permission denied slurmctld: Not running as root. Can't drop supplementary groups slurmctld: debug3: Version in last_conf_lite header is 6912 slurmctld: slurmctld version 14.03.4-2 started on cluster cluster slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so slurmctld: Munge cryptographic signature plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/select_linear.so slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/preempt_none.so slurmctld: preempt/none loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/checkpoint_blcr.so slurmctld: checkpoint/blcr init slurmctld: debug3: Success. slurmctld: Checkpoint plugin loaded: checkpoint/blcr slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_energy_none.so slurmctld: AcctGatherEnergy NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_profile_none.so slurmctld: AcctGatherProfile NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_infiniband_none.so slurmctld: AcctGatherInfiniband NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_filesystem_none.so slurmctld: AcctGatherFilesystem NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf) slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/jobacct_gather_none.so slurmctld: Job accounting gather NOT_INVOKED plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/ext_sensors_none.so slurmctld: ExtSensors NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so slurmctld: switch NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug: No backup controller to shutdown slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/accounting_storage_none.so slurmctld: Accounting storage NOT INVOKED plugin loaded slurmctld: debug3: Success. slurmctld: debug3: not enforcing associations and no list was given so we are giving a blank list slurmctld: debug3: Version in assoc_mgr_state header is 1 slurmctld: debug: Reading slurm.conf file: /usr/local/etc/slurm.conf slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/topology_none.so slurmctld: topology NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug: No DownNodes slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/jobcomp_none.so slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/sched_backfill.so slurmctld: sched: Backfill scheduler plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION slurmctld: Recovered state of 8 nodes slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION slurmctld: debug3: Job id in job_state header is 64 slurmctld: debug3: Set job_id_sequence to 64 slurmctld: Recovered information about 0 jobs slurmctld: debug: Updating partition uid access list slurmctld: debug3: Version string in resv_state header is PROTOCOL_VERSION slurmctld: Recovered state of 0 reservations slurmctld: State of 0 triggers recovered slurmctld: read_slurm_conf: backup_controller not specified. slurmctld: Running as primary controller slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/priority_basic.so slurmctld: debug: Priority BASIC plugin loaded slurmctld: debug3: Success. slurmctld: debug3: _slurmctld_background pid = 23415 slurmctld: debug3: _slurmctld_rpc_mgr pid = 23415 slurmctld: debug: power_save module disabled, SuspendTime < 0 slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmctld: debug3: Success. slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: node enode009 returned to service slurmctld: debug2: _slurm_rpc_node_registration complete for enode009 usec=35 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: node enode012 returned to service slurmctld: debug2: _slurm_rpc_node_registration complete for enode012 usec=29 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: node enode013 returned to service slurmctld: debug2: _slurm_rpc_node_registration complete for enode013 usec=29 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug: validate_node_specs: node enode014 registered with 0 jobs slurmctld: debug2: _slurm_rpc_node_registration complete for enode014 usec=29 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: node enode011 returned to service slurmctld: debug2: _slurm_rpc_node_registration complete for enode011 usec=28 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug: validate_node_specs: node enode007 registered with 0 jobs slurmctld: debug2: _slurm_rpc_node_registration complete for enode007 usec=30 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug: validate_node_specs: node enode008 registered with 0 jobs slurmctld: debug2: _slurm_rpc_node_registration complete for enode008 usec=29 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: node enode010 returned to service slurmctld: debug2: _slurm_rpc_node_registration complete for enode010 usec=28 slurmctld: debug: Spawning registration agent for enode[007-014] 8 hosts slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS slurmctld: SchedulingParameters: default_queue_depth=100 max_rpc_cnt=0 max_sched_time=4 partition_job_depth=0 slurmctld: debug: sched: Running job scheduler slurmctld: debug2: got 1 threads to send out slurmctld: debug3: Tree sending to enode007 slurmctld: debug3: Tree sending to enode009 slurmctld: debug3: Tree sending to enode008 slurmctld: debug3: Tree sending to enode010 slurmctld: debug3: Tree sending to enode011 slurmctld: debug3: Tree sending to enode012 slurmctld: debug2: Tree head got back 1 looking for 8 slurmctld: debug3: Tree sending to enode013 slurmctld: debug3: Tree sending to enode014 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode007 usec=17 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got back 3 slurmctld: debug2: Tree head got back 4 slurmctld: debug2: Tree head got back 5 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode009 usec=20 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode008 usec=17 slurmctld: debug2: Tree head got back 6 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode011 usec=17 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode012 usec=23 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode010 usec=17 slurmctld: debug2: Tree head got back 7 slurmctld: debug2: Tree head got back 7 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode014 usec=12 slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0 slurmctld: debug2: _slurm_rpc_node_registration complete for enode013 usec=15 slurmctld: debug2: Tree head got back 8 slurmctld: debug2: node_did_resp enode007 slurmctld: debug2: node_did_resp enode008 slurmctld: debug2: node_did_resp enode009 slurmctld: debug2: node_did_resp enode010 slurmctld: debug2: node_did_resp enode012 slurmctld: debug2: node_did_resp enode011 slurmctld: debug2: node_did_resp enode014 slurmctld: debug2: node_did_resp enode013 slurmctld: debug: backfill: beginning slurmctld: debug: backfill: no jobs to backfill slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug2: Testing job time limits and checkpoints However, none of the compute nodes running slurmd even attempt to load the checkpoint/blcr plugin. SLURMD Output ----------------------- : Pid File = `/var/run/slurmd.pid' slurmd: debug3: Slurm UID = 504 slurmd: debug3: TaskProlog = `(null)' slurmd: debug3: TaskEpilog = `(null)' slurmd: debug3: TaskPluginParam = 0 slurmd: debug3: Use PAM = 0 slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/proctrack_pgid.so slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/task_none.so slurmd: task NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmd: debug3: Success. slurmd: debug: spank: opening plugin stack /usr/local/etc/plugstack.conf slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so slurmd: Munge cryptographic signature plugin loaded slurmd: debug3: Success. slurmd: debug3: initializing slurmd spool directory slurmd: debug3: slurmd initialization successful slurmd: slurmd version 14.03.4-2 started slurmd: debug3: finished daemonize slurmd: debug3: cred_unpack: job 2 ctime:141025182439 expires:700101053000 slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/jobacct_gather_none.so slurmd: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/job_container_none.so slurmd: debug: job_container none plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/core_spec_none.so slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so slurmd: switch NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: successfully opened slurm listen port 130.1.5.24:6818 slurmd: slurmd started on Mon, 03 Nov 2014 17:44:43 +0530 slurmd: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=64417 TmpDisk=50396 Uptime=23751 slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_energy_none.so slurmd: AcctGatherEnergy NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_profile_none.so slurmd: AcctGatherProfile NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_infiniband_none.so slurmd: AcctGatherInfiniband NONE plugin loaded slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/acct_gather_filesystem_none.so slurmd: AcctGatherFilesystem NONE plugin loaded slurmd: debug3: Success. slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf) Also, in the output of slurmctld, I see that initially forward trees are sent back and forth between the nodes freely. slurmctld: debug2: got 1 threads to send out slurmctld: debug3: Tree sending to enode007 slurmctld: debug3: Tree sending to enode008 slurmctld: debug3: Tree sending to enode009 slurmctld: debug3: Tree sending to enode010 slurmctld: debug3: Tree sending to enode011 slurmctld: debug3: Tree sending to enode012 slurmctld: debug2: Tree head got back 1 looking for 8 slurmctld: debug3: Tree sending to enode013 slurmctld: debug3: Tree sending to enode014 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got back 3 slurmctld: debug2: Tree head got back 4 slurmctld: debug2: Tree head got back 5 slurmctld: debug2: Tree head got back 6 slurmctld: debug2: Tree head got back 7 slurmctld: debug2: Tree head got back 8 When I run a normal MPI job without checkpointing, the trees are going back and forth again. There's no problem here too. No problem with initializing, no problem with running normal MPI jobs. *srun -N8 -n8 ./MPIJob* SLURMCTLD Output ----------------------------- slurmctld: debug: Configuration for job 68 complete slurmctld: debug3: step_layout cpus = 12 pos = 0 slurmctld: debug3: step_layout cpus = 12 pos = 1 slurmctld: debug3: step_layout cpus = 12 pos = 2 slurmctld: debug3: step_layout cpus = 12 pos = 3 slurmctld: debug3: step_layout cpus = 12 pos = 4 slurmctld: debug3: step_layout cpus = 12 pos = 5 slurmctld: debug3: step_layout cpus = 12 pos = 6 slurmctld: debug3: step_layout cpus = 12 pos = 7 slurmctld: debug: laying out the 8 tasks on 8 hosts enode[007-014] dist 1 slurmctld: sched: _slurm_rpc_job_step_create: StepId=68.0 enode[007-014] usec=383 slurmctld: debug3: Writing job id 68 to header record of job_state file slurmctld: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=504 slurmctld: debug2: _slurm_rpc_dump_partitions, size=128 usec=23 slurmctld: debug3: Processing RPC: REQUEST_NODE_INFO from uid=504 slurmctld: debug2: Processing RPC: REQUEST_RESERVATION_INFO from uid=504 slurmctld: debug: backfill: beginning slurmctld: debug: backfill: no jobs to backfillsrun -N8 -n8 --checkpoint 1 --checkpoint-dir /home/arjun/Ckpt_Local ./MPIJob slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug2: Performing full system state save slurmctld: debug3: Writing job id 68 to header record of job_state file slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug: Spawning ping agent for enode[007-013] slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_PING slurmctld: debug2: got 1 threads to send out slurmctld: debug3: Tree sending to enode007 slurmctld: debug3: Tree sending to enode009 slurmctld: debug3: Tree sending to enode010 slurmctld: debug2: Tree head got back 0 looking for 7 slurmctld: debug3: Tree sending to enode008 slurmctld: debug3: Tree sending to enode011 slurmctld: debug3: Tree sending to enode012 slurmctld: debug3: Tree sending to enode013 slurmctld: debug2: Tree head got back 1 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got back 3 slurmctld: debug2: Tree head got back 4 slurmctld: debug2: Tree head got back 6 slurmctld: debug2: Tree head got back 7 slurmctld: debug2: node_did_resp enode007 slurmctld: debug2: node_did_resp enode010 slurmctld: debug2: node_did_resp enode009 slurmctld: debug2: node_did_resp enode011 slurmctld: debug2: node_did_resp enode013 slurmctld: debug2: node_did_resp enode012 slurmctld: debug2: node_did_resp enode008 slurmctld: debug: backfill: beginning slurmctld: debug: backfill: no jobs to backfill slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler SLURMD Output ----------------------- slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug: task_p_slurmd_launch_request: 68.0 7 slurmd: launch task 68.0 request from [email protected] (port 32396) slurmd: debug3: state for jobid 2: ctime:1414241679 revoked:0 expires:0 slurmd: debug3: state for jobid 65: ctime:1415018192 revoked:1415018901 expires:1415018901 slurmd: debug3: state for jobid 65: ctime:1415018192 revoked:0 expires:0 slurmd: debug: Checking credential with 300 bytes of sig data slurmd: debug: Calling /usr/local/sbin/slurmstepd spank prolog spank-prolog: Reading slurm.conf file: /usr/local/etc/slurm.conf spank-prolog: Running spank/prolog for jobid [68] uid [504] spank-prolog: spank: opening plugin stack /usr/local/etc/plugstack.conf slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 7 (enode014), parent rank 0 (enode007), children 0, depth 1, max_depth 1 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 68 7 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: Sending signal 995 to step 68.0 However, as soon as I launch the job *srun -N8 -n8 --checkpoint 1 --checkpoint-dir /home/arjun/Ckpt_Local ./MPIJob* The same forward trees are no longer going back and forth. How SLURMCTLD Output ----------------------------- slurmctld: debug3: problems with enode012 slurmctld: debug2: Tree head got back 1 slurmctld: debug3: problems with enode014 slurmctld: debug2: Tree head got back 2 slurmctld: debug3: problems with enode007 slurmctld: debug2: Tree head got back 2 slurmctld: debug3: problems with enode013 slurmctld: debug3: problems with enode011 slurmctld: debug3: problems with enode010 slurmctld: debug3: problems with enode009 slurmctld: debug2: Tree head got back 4 slurmctld: debug3: problems with enode008 slurmctld: debug2: Tree head got back 8 slurmctld: error: checkpoint/blcr: error on checkpoint request 3 to 65.0: Communication connection failure slurmctld: debug: checkpoint/blcr: file /usr/local/sbin/scch not found slurmctld: debug: backfill: beginning slurmctld: debug: backfill: no jobs to backfill slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug2: Performing full system state save slurmctld: debug3: Writing job id 65 to header record of job_state file slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to 65.0 slurmctld: debug3: Tree sending to enode007 slurmctld: debug3: Tree sending to enode009 slurmctld: debug3: Tree sending to enode008 *slurmctld: debug2: Tree head got back 0 looking for 8* slurmctld: debug3: Tree sending to enode010 slurmctld: debug3: Tree sending to enode011 slurmctld: debug3: Tree sending to enode013 slurmctld: debug3: Tree sending to enode012 slurmctld: debug3: Tree sending to enode014 SLURMD Output ----------------------- slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug: task_p_slurmd_launch_request: 65.0 6 slurmd: launch task 65.0 request from [email protected] (port 58091) slurmd: debug3: state for jobid 2: ctime:1414241746 revoked:0 expires:0 slurmd: debug3: state for jobid 46: ctime:1414397386 revoked:0 expires:0 slurmd: debug: Checking credential with 300 bytes of sig data slurmd: debug: Calling /usr/local/sbin/slurmstepd spank prolog spank-prolog: Reading slurm.conf file: /usr/local/etc/slurm.conf spank-prolog: Running spank/prolog for jobid [65] uid [504] spank-prolog: spank: opening plugin stack /usr/local/etc/plugstack.conf slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 6 (enode013), parent rank 0 (enode007), children 0, depth 1, max_depth 1 slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd slurmd: debug: task_p_slurmd_reserve_resources: 65 6 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6004 slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS slurmd: debug: Sending signal 995 to step 65.0 slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6005 *slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS* *slurmd: debug3: in the service_connection* *slurmd: debug2: got this type of message 6005* slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 6005 slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS slurmd: debug3: in the service_connection slurmd: debug2: got this type of message 1008 The slurmd daemons just receive the REQUEST_CHECKPOINT_TASKS message and don't do anything at all. I looked into the code and figured out that its probably the *forward.c* in /src/common directory that is causing the problem. The error code for the error "Communication Connection Failure" is "RESPONSE_FORWARD_FAILED", which gets set when the errno is "SLURM_COMMUNICATIONS_CONNECTIONS_ERROR" I put in debug3() lines of my own wherever the function mark_as_failed_forward() [defined within common/forward.c] was being called, and isolated that the two calls in slurm_protocol_api.c [located in /common] were failing, specifically Line 3896 in slurm_protocol_api.c: msg->forward_struct = NULL; if (!(ret_list = _send_and_recv_msgs(fd, msg, timeout))) { mark_as_failed_forward(&ret_list, name, errno); errno = SLURM_COMMUNICATIONS_CONNECTION_ERROR; return ret_list; } Now, clearly the send_and_recv_msgs() method call is failing here. I have no idea why this method call is failing. Why is it failing only when checkpointing is enabled ? Why does it work perfectly when there's no checkpointing to be performed ? Please help.
slurm.conf
Description: Binary data
