Are the temporary files created?
Does ls -a on the directory show the missing files? Can you create files in that directory with touch? Finally, is cr_checkpoint being run by root? Or some other user? The checkpoint file will be created by the user invoking cr_checkpoint. Eric On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: > dear all, > I failed everytime I try to checkpoint MPI application using BLCR in > Slurm. The following is my sbatch script : > ##########################SBATCH SCRIPT############ > #!/bin/bash > #SBATCH -J MatMul > #SBATCH -o cr/mm-%j.out > #SBATCH -A necis > #SBATCH -N 3 > #SBATCH -n 24 > #SBATCH --checkpoint=1 > #SBATCH --checkpoint-dir=cr > #SBATCH --time=01:30:00 > #SBATCH --mail-user=[1][email protected] > #SBATCH --mail-type=begin > #SBATCH --mail-type=end > srun --mpi=pmi2 ./mm.o > #################################################### > I also have tried to run directly using srun command but I failed. The > following is the command I use and the error message that occured. > command :� > srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� > error : > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' > Received results from task 6 > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' > Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission > denied > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' > in cr directory there are 7 .ckpt files as follows : > task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt, > task.8.ckpt and task.9.ckpt. > There are no checkpoint files called task.0.ckpt, task.3.ckpt and > task.4.ckpt as mentioned in the error message. > mirror is NFS directory that shared across the nodes. I set the cr > directory to have permission 777 just to avoid permission issue. > note : if I execute the command using sbatch job, I just get file named > script.ckpt. There is no task.[number].ckpt file. > Anyone please tell me how to solve this ? > Thank you in advance. > Regards, > Husen > > References > > Visible links > 1. mailto:[email protected]
