Hi, This is the output of ls -a:
. .task.0.ckpt.tmp task.2.ckpt .task.3.ckpt.tmp task.5.ckpt task.7.ckpt task.9.ckpt .. task.1.ckpt .task.2.ckpt.tmp .task.4.ckpt.tmp task.6.ckpt task.8.ckpt This is the output of ls : task.1.ckpt task.2.ckpt task.5.ckpt task.6.ckpt task.7.ckpt task.8.ckpt task.9.ckpt The temporary files appeared when I use ls -a command. What does it mean ? I can create file in the directory with touch. I try to checkpoint mpi application using non root user with slurm checkpoint interval feature. I don't directly checkpoint using cr_checkpoint command. Regards, Husen On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[email protected]> wrote: > > > Are the temporary files created? > > Does ls -a on the directory show the missing files? > > Can you create files in that directory with touch? > > Finally, is cr_checkpoint being run by root? Or some other user? The > checkpoint file will be created by the user invoking cr_checkpoint. > > Eric > > On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: > > dear all, > > I failed everytime I try to checkpoint MPI application using BLCR in > > Slurm. The following is my sbatch script : > > ##########################SBATCH SCRIPT############ > > #!/bin/bash > > #SBATCH -J MatMul > > #SBATCH -o cr/mm-%j.out > > #SBATCH -A necis > > #SBATCH -N 3 > > #SBATCH -n 24 > > #SBATCH --checkpoint=1 > > #SBATCH --checkpoint-dir=cr > > #SBATCH --time=01:30:00 > > #SBATCH --mail-user=[1][email protected] > > #SBATCH --mail-type=begin > > #SBATCH --mail-type=end > > srun --mpi=pmi2 ./mm.o > > #################################################### > > I also have tried to run directly using srun command but I failed. The > > following is the command I use and the error message that occured. > > command :� > > srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� > > error : > > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' > > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' > > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' > > Received results from task 6 > > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' > > Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' > > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' > > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > Permission > > denied > > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' > > in cr directory there are 7 .ckpt files as follows : > > task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt, > > task.8.ckpt and task.9.ckpt. > > There are no checkpoint files called task.0.ckpt, task.3.ckpt and > > task.4.ckpt as mentioned in the error message. > > mirror is NFS directory that shared across the nodes. I set the cr > > directory to have permission 777 just to avoid permission issue. > > note : if I execute the command using sbatch job, I just get file > named > > script.ckpt. There is no task.[number].ckpt file. > > Anyone please tell me how to solve this ? > > Thank you in advance. > > Regards, > > Husen > > > > References > > > > Visible links > > 1. mailto:[email protected] >
