I have managed to get SLURM running with checkpointing enabled on my cluster of two machines, named qdr3 and qdr4. However when I run the command srun -N2 -n24 --checkpoint 1 --checkpoint-dir /home/arjun/ACIM/Ctrl ./MPIJob
The MPIJob's code does get executed. However, all of the checkpointing instructions fail. slurmctld shows the following messages : slurmctld: debug3: checkpoint/blcr: sending checkpoint tasks request 3 to 81.0 slurmctld: debug2: Tree head got back 0 looking for 2 slurmctld: debug3: Tree sending to qdr4 slurmctld: debug3: Tree sending to qdr3 slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout slurmctld: error:* slurm_receive_msgs: Socket timed out on send/recv operation* slurmctld: debug: _slurm_recv_timeout at 0 of 4, timeout slurmctld: error: *slurm_receive_msgs: Socket timed out on send/recv operation* slurmctld: debug3: problems with qdr3 slurmctld: debug3: problems with qdr4 slurmctld: debug2: Tree head got back 2 slurmctld: debug2: Tree head got back 2 slurmctld: error:* checkpoint/blcr: error on checkpoint request 3 to 81.0: Communication connection failure* slurmctld: debug: *checkpoint/blcr: file /usr/local/sbin/scch not found* What could be the reason for the failing of the checkpointing commands ? Also, is the missing /usr/local/sbin/scch an integral part of the problem ?
slurm.conf
Description: Binary data
