dear all, I failed everytime I try to checkpoint MPI application using BLCR in Slurm. The following is my sbatch script :
##########################SBATCH SCRIPT############ #!/bin/bash #SBATCH -J MatMul #SBATCH -o cr/mm-%j.out #SBATCH -A necis #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=1 #SBATCH --checkpoint-dir=cr #SBATCH --time=01:30:00 #SBATCH [email protected] #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o #################################################### I also have tried to run directly using srun command but I failed. The following is the command I use and the error message that occured. command : srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o error : Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' Received results from task 6 Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' in cr directory there are 7 .ckpt files as follows : task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt, task.8.ckpt and task.9.ckpt. There are no checkpoint files called task.0.ckpt, task.3.ckpt and task.4.ckpt as mentioned in the error message. mirror is NFS directory that shared across the nodes. I set the cr directory to have permission 777 just to avoid permission issue. note : if I execute the command using sbatch job, I just get file named script.ckpt. There is no task.[number].ckpt file. Anyone please tell me how to solve this ? Thank you in advance. Regards, Husen
