[slurm-dev] slurmd daemons don't initialize the checkpoint/blcr plugin

Arjun J Rao Mon, 03 Nov 2014 07:36:18 -0800

When I begin the slurmd and the slurmctld daemons with the attached
slurm.conf file, the slurmctld always manages to load the checkpoint/blcr
plugin successfully.


SLURMCTLD output
---------------------------
slurmctld -Dvvvv
slurmctld: error: Unable to open pidfile `/var/run/slurmctld.pid':
Permission denied
slurmctld: Not running as root. Can't drop supplementary groups
slurmctld: debug3: Version in last_conf_lite header is 6912
slurmctld: slurmctld version 14.03.4-2 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/select_linear.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/checkpoint_blcr.so
slurmctld: checkpoint/blcr init
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/blcr
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_energy_none.so
slurmctld: AcctGatherEnergy NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_profile_none.so
slurmctld: AcctGatherProfile NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_infiniband_none.so
slurmctld: AcctGatherInfiniband NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_filesystem_none.so
slurmctld: AcctGatherFilesystem NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: No acct_gather.conf file
(/usr/local/etc/acct_gather.conf)
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/ext_sensors_none.so
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
slurmctld: switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No backup controller to shutdown
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given so we
are giving a blank list
slurmctld: debug3: Version in assoc_mgr_state header is 1
slurmctld: debug:  Reading slurm.conf file: /usr/local/etc/slurm.conf
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
slurmctld: topology NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No DownNodes
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 8 nodes
slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION
slurmctld: debug3: Job id in job_state header is 64
slurmctld: debug3: Set job_id_sequence to 64
slurmctld: Recovered information about 0 jobs
slurmctld: debug:  Updating partition uid access list
slurmctld: debug3: Version string in resv_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/priority_basic.so
slurmctld: debug:  Priority BASIC plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: _slurmctld_background pid = 23415
slurmctld: debug3: _slurmctld_rpc_mgr pid = 23415
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmctld: debug3: Success.
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: node enode009 returned to service
slurmctld: debug2: _slurm_rpc_node_registration complete for enode009
usec=35
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: node enode012 returned to service
slurmctld: debug2: _slurm_rpc_node_registration complete for enode012
usec=29
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: node enode013 returned to service
slurmctld: debug2: _slurm_rpc_node_registration complete for enode013
usec=29
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug:  validate_node_specs: node enode014 registered with 0 jobs
slurmctld: debug2: _slurm_rpc_node_registration complete for enode014
usec=29
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: node enode011 returned to service
slurmctld: debug2: _slurm_rpc_node_registration complete for enode011
usec=28
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug:  validate_node_specs: node enode007 registered with 0 jobs
slurmctld: debug2: _slurm_rpc_node_registration complete for enode007
usec=30
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug:  validate_node_specs: node enode008 registered with 0 jobs
slurmctld: debug2: _slurm_rpc_node_registration complete for enode008
usec=29
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: node enode010 returned to service
slurmctld: debug2: _slurm_rpc_node_registration complete for enode010
usec=28
slurmctld: debug:  Spawning registration agent for enode[007-014] 8 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: SchedulingParameters: default_queue_depth=100 max_rpc_cnt=0
max_sched_time=4 partition_job_depth=0
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: got 1 threads to send out
slurmctld: debug3: Tree sending to enode007
slurmctld: debug3: Tree sending to enode009
slurmctld: debug3: Tree sending to enode008
slurmctld: debug3: Tree sending to enode010
slurmctld: debug3: Tree sending to enode011
slurmctld: debug3: Tree sending to enode012
slurmctld: debug2: Tree head got back 1 looking for 8
slurmctld: debug3: Tree sending to enode013
slurmctld: debug3: Tree sending to enode014
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode007
usec=17
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 3
slurmctld: debug2: Tree head got back 4
slurmctld: debug2: Tree head got back 5
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode009
usec=20
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode008
usec=17
slurmctld: debug2: Tree head got back 6
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode011
usec=17
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode012
usec=23
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode010
usec=17
slurmctld: debug2: Tree head got back 7
slurmctld: debug2: Tree head got back 7
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode014
usec=12
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from
uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for enode013
usec=15
slurmctld: debug2: Tree head got back 8
slurmctld: debug2: node_did_resp enode007
slurmctld: debug2: node_did_resp enode008
slurmctld: debug2: node_did_resp enode009
slurmctld: debug2: node_did_resp enode010
slurmctld: debug2: node_did_resp enode012
slurmctld: debug2: node_did_resp enode011
slurmctld: debug2: node_did_resp enode014
slurmctld: debug2: node_did_resp enode013
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: Testing job time limits and checkpoints


However, none of the compute nodes running slurmd even attempt to load the
checkpoint/blcr plugin.

SLURMD Output
-----------------------
: Pid File    = `/var/run/slurmd.pid'
slurmd: debug3: Slurm UID   = 504
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: Use PAM     = 0
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/proctrack_pgid.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/task_none.so
slurmd: task NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so
slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmd: debug3: Success.
slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/crypto_munge.so
slurmd: Munge cryptographic signature plugin loaded
slurmd: debug3: Success.
slurmd: debug3: initializing slurmd spool directory
slurmd: debug3: slurmd initialization successful
slurmd: slurmd version 14.03.4-2 started
slurmd: debug3: finished daemonize
slurmd: debug3: cred_unpack: job 2 ctime:141025182439 expires:700101053000
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_none.so
slurmd: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/job_container_none.so
slurmd: debug:  job_container none plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/core_spec_none.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
slurmd: switch NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: successfully opened slurm listen port 130.1.5.24:6818
slurmd: slurmd started on Mon, 03 Nov 2014 17:44:43 +0530
slurmd: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=64417
TmpDisk=50396 Uptime=23751
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_energy_none.so
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_profile_none.so
slurmd: AcctGatherProfile NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_infiniband_none.so
slurmd: AcctGatherInfiniband NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_filesystem_none.so
slurmd: AcctGatherFilesystem NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf)

Also, in the output of slurmctld, I see that initially forward trees are
sent back and forth between the nodes freely.

slurmctld: debug2: got 1 threads to send out
slurmctld: debug3: Tree sending to enode007
slurmctld: debug3: Tree sending to enode008
slurmctld: debug3: Tree sending to enode009
slurmctld: debug3: Tree sending to enode010
slurmctld: debug3: Tree sending to enode011
slurmctld: debug3: Tree sending to enode012
slurmctld: debug2: Tree head got back 1 looking for 8
slurmctld: debug3: Tree sending to enode013
slurmctld: debug3: Tree sending to enode014
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 3
slurmctld: debug2: Tree head got back 4
slurmctld: debug2: Tree head got back 5
slurmctld: debug2: Tree head got back 6
slurmctld: debug2: Tree head got back 7
slurmctld: debug2: Tree head got back 8


When I run a normal MPI job without checkpointing, the trees are going back
and forth again. There's no problem here too. No problem with initializing,
no problem with running normal MPI jobs.

*srun -N8 -n8 ./MPIJob*

SLURMCTLD Output
-----------------------------
slurmctld: debug:  Configuration for job 68 complete
slurmctld: debug3: step_layout cpus = 12 pos = 0
slurmctld: debug3: step_layout cpus = 12 pos = 1
slurmctld: debug3: step_layout cpus = 12 pos = 2
slurmctld: debug3: step_layout cpus = 12 pos = 3
slurmctld: debug3: step_layout cpus = 12 pos = 4
slurmctld: debug3: step_layout cpus = 12 pos = 5
slurmctld: debug3: step_layout cpus = 12 pos = 6
slurmctld: debug3: step_layout cpus = 12 pos = 7
slurmctld: debug:  laying out the 8 tasks on 8 hosts enode[007-014] dist 1
slurmctld: sched: _slurm_rpc_job_step_create: StepId=68.0 enode[007-014]
usec=383
slurmctld: debug3: Writing job id 68 to header record of job_state file
slurmctld: debug2: Processing RPC: REQUEST_PARTITION_INFO uid=504
slurmctld: debug2: _slurm_rpc_dump_partitions, size=128 usec=23
slurmctld: debug3: Processing RPC: REQUEST_NODE_INFO from uid=504
slurmctld: debug2: Processing RPC: REQUEST_RESERVATION_INFO from uid=504
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfillsrun -N8 -n8 --checkpoint 1
--checkpoint-dir /home/arjun/Ckpt_Local ./MPIJob

slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: Performing full system state save
slurmctld: debug3: Writing job id 68 to header record of job_state file
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  Spawning ping agent for enode[007-013]
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_PING
slurmctld: debug2: got 1 threads to send out
slurmctld: debug3: Tree sending to enode007
slurmctld: debug3: Tree sending to enode009
slurmctld: debug3: Tree sending to enode010
slurmctld: debug2: Tree head got back 0 looking for 7
slurmctld: debug3: Tree sending to enode008
slurmctld: debug3: Tree sending to enode011
slurmctld: debug3: Tree sending to enode012
slurmctld: debug3: Tree sending to enode013
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 2
slurmctld: debug2: Tree head got back 3
slurmctld: debug2: Tree head got back 4
slurmctld: debug2: Tree head got back 6
slurmctld: debug2: Tree head got back 7
slurmctld: debug2: node_did_resp enode007
slurmctld: debug2: node_did_resp enode010
slurmctld: debug2: node_did_resp enode009
slurmctld: debug2: node_did_resp enode011
slurmctld: debug2: node_did_resp enode013
slurmctld: debug2: node_did_resp enode012
slurmctld: debug2: node_did_resp enode008
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler

SLURMD Output
-----------------------
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_p_slurmd_launch_request: 68.0 7
slurmd: launch task 68.0 request from [email protected] (port 32396)
slurmd: debug3: state for jobid 2: ctime:1414241679 revoked:0 expires:0
slurmd: debug3: state for jobid 65: ctime:1415018192 revoked:1415018901
expires:1415018901
slurmd: debug3: state for jobid 65: ctime:1415018192 revoked:0 expires:0
slurmd: debug:  Checking credential with 300 bytes of sig data
slurmd: debug:  Calling /usr/local/sbin/slurmstepd spank prolog
spank-prolog: Reading slurm.conf file: /usr/local/etc/slurm.conf
spank-prolog: Running spank/prolog for jobid [68] uid [504]
spank-prolog: spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 7 (enode014), parent rank 0 (enode007),
children 0, depth 1, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 68 7
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 995 to step 68.0

However, as soon as I launch the job
*srun -N8 -n8 --checkpoint 1 --checkpoint-dir /home/arjun/Ckpt_Local
./MPIJob*

The same forward trees are no longer going back and forth. How

SLURMCTLD Output
-----------------------------
slurmctld: debug3: problems with enode012
slurmctld: debug2: Tree head got back 1
slurmctld: debug3: problems with enode014
slurmctld: debug2: Tree head got back 2
slurmctld: debug3: problems with enode007
slurmctld: debug2: Tree head got back 2
slurmctld: debug3: problems with enode013
slurmctld: debug3: problems with enode011
slurmctld: debug3: problems with enode010
slurmctld: debug3: problems with enode009
slurmctld: debug2: Tree head got back 4
slurmctld: debug3: problems with enode008
slurmctld: debug2: Tree head got back 8
slurmctld: error: checkpoint/blcr: error on checkpoint request 3 to 65.0:
Communication connection failure
slurmctld: debug:  checkpoint/blcr: file /usr/local/sbin/scch not found
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: Performing full system state save
slurmctld: debug3: Writing job id 65 to header record of job_state file
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to
65.0
slurmctld: debug3: Tree sending to enode007
slurmctld: debug3: Tree sending to enode009
slurmctld: debug3: Tree sending to enode008
*slurmctld: debug2: Tree head got back 0 looking for 8*
slurmctld: debug3: Tree sending to enode010
slurmctld: debug3: Tree sending to enode011
slurmctld: debug3: Tree sending to enode013
slurmctld: debug3: Tree sending to enode012
slurmctld: debug3: Tree sending to enode014

SLURMD Output
-----------------------
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug:  task_p_slurmd_launch_request: 65.0 6
slurmd: launch task 65.0 request from [email protected] (port 58091)
slurmd: debug3: state for jobid 2: ctime:1414241746 revoked:0 expires:0
slurmd: debug3: state for jobid 46: ctime:1414397386 revoked:0 expires:0
slurmd: debug:  Checking credential with 300 bytes of sig data
slurmd: debug:  Calling /usr/local/sbin/slurmstepd spank prolog
spank-prolog: Reading slurm.conf file: /usr/local/etc/slurm.conf
spank-prolog: Running spank/prolog for jobid [65] uid [504]
spank-prolog: spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank 6 (enode013), parent rank 0 (enode007),
children 0, depth 1, max_depth 1
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug:  task_p_slurmd_reserve_resources: 65 6
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6004
slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
slurmd: debug:  Sending signal 995 to step 65.0
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6005
*slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS*
*slurmd: debug3: in the service_connection*
*slurmd: debug2: got this type of message 6005*
slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6005
slurmd: debug2: Processing RPC: REQUEST_CHECKPOINT_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 1008

The slurmd daemons just receive the REQUEST_CHECKPOINT_TASKS message and
don't do anything at all.

I looked into the code and figured out that its probably the *forward.c* in
/src/common directory that is causing the problem. The error code for the
error "Communication Connection Failure" is "RESPONSE_FORWARD_FAILED",
which gets set when the errno is "SLURM_COMMUNICATIONS_CONNECTIONS_ERROR"

I put in debug3() lines of my own wherever the function
mark_as_failed_forward()
[defined within common/forward.c] was being called, and isolated that the
two calls in slurm_protocol_api.c [located in /common] were failing,
specifically

Line 3896 in slurm_protocol_api.c:
msg->forward_struct = NULL;
if (!(ret_list = _send_and_recv_msgs(fd, msg, timeout))) {
mark_as_failed_forward(&ret_list, name, errno);
errno = SLURM_COMMUNICATIONS_CONNECTION_ERROR;
return ret_list;
}

Now, clearly the send_and_recv_msgs() method call is failing here. I have
no idea why this method call is failing. Why is it failing only when
checkpointing is enabled ? Why does it work perfectly when there's no
checkpointing to be performed ?  Please help.

slurm.conf
Description: Binary data

[slurm-dev] slurmd daemons don't initialize the checkpoint/blcr plugin

Reply via email to