This is the output of ls -a -l. The files that appeared in the error message are 0 bytes in size and they are all resulting from processes in the remote nodes.
drwxrwxr-x 2 necis necis 4096 Mei 17 14:44 . drwxrwxrwx 12 root root 4096 Mei 17 17:24 .. -r-------- 1 necis necis 0 Mei 17 14:42 .task.0.ckpt.tmp -r-------- 1 necis necis 183630160 Mei 17 14:44 task.1.ckpt -r-------- 1 necis necis 183630200 Mei 17 14:43 task.2.ckpt -r-------- 1 necis necis 0 Mei 17 14:43 .task.2.ckpt.tmp -r-------- 1 necis necis 0 Mei 17 14:42 .task.3.ckpt.tmp -r-------- 1 necis necis 0 Mei 17 14:42 .task.4.ckpt.tmp -r-------- 1 necis necis 183297635 Mei 17 14:43 task.5.ckpt -r-------- 1 necis necis 183297635 Mei 17 14:43 task.6.ckpt -r-------- 1 necis necis 183297635 Mei 17 14:43 task.7.ckpt -r-------- 1 necis necis 183301731 Mei 17 14:43 task.8.ckpt -r-------- 1 necis necis 183297635 Mei 17 14:43 task.9.ckpt Regards, Husen On Wed, May 18, 2016 at 7:38 AM, Husen R <[email protected]> wrote: > Hi, > > This is the output of ls -a: > > . .task.0.ckpt.tmp task.2.ckpt .task.3.ckpt.tmp task.5.ckpt > task.7.ckpt task.9.ckpt > .. task.1.ckpt .task.2.ckpt.tmp .task.4.ckpt.tmp task.6.ckpt > task.8.ckpt > > This is the output of ls : > > task.1.ckpt task.2.ckpt task.5.ckpt task.6.ckpt task.7.ckpt > task.8.ckpt task.9.ckpt > > > The temporary files appeared when I use ls -a command. What does it mean ? > I can create file in the directory with touch. > > I try to checkpoint mpi application using non root user with slurm > checkpoint interval feature. I don't directly checkpoint using > cr_checkpoint command. > > Regards, > > Husen > > On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[email protected]> wrote: > >> >> >> Are the temporary files created? >> >> Does ls -a on the directory show the missing files? >> >> Can you create files in that directory with touch? >> >> Finally, is cr_checkpoint being run by root? Or some other user? The >> checkpoint file will be created by the user invoking cr_checkpoint. >> >> Eric >> >> On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: >> > dear all, >> > I failed everytime I try to checkpoint MPI application using BLCR in >> > Slurm. The following is my sbatch script : >> > ##########################SBATCH SCRIPT############ >> > #!/bin/bash >> > #SBATCH -J MatMul >> > #SBATCH -o cr/mm-%j.out >> > #SBATCH -A necis >> > #SBATCH -N 3 >> > #SBATCH -n 24 >> > #SBATCH --checkpoint=1 >> > #SBATCH --checkpoint-dir=cr >> > #SBATCH --time=01:30:00 >> > #SBATCH --mail-user=[1][email protected] >> > #SBATCH --mail-type=begin >> > #SBATCH --mail-type=end >> > srun --mpi=pmi2 ./mm.o >> > #################################################### >> > I also have tried to run directly using srun command but I failed. >> The >> > following is the command I use and the error message that occured. >> > command :� >> > srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� >> > error : >> > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' >> > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' >> > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' >> > Received results from task 6 >> > Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' >> > Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' >> > Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' >> > Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': >> Permission >> > denied >> > Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' >> > in cr directory there are 7 .ckpt files as follows : >> > task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt, >> > task.8.ckpt and task.9.ckpt. >> > There are no checkpoint files called task.0.ckpt, task.3.ckpt and >> > task.4.ckpt as mentioned in the error message. >> > mirror is NFS directory that shared across the nodes. I set the cr >> > directory to have permission 777 just to avoid permission issue. >> > note : if I execute the command using sbatch job, I just get file >> named >> > script.ckpt. There is no task.[number].ckpt file. >> > Anyone please tell me how to solve this ? >> > Thank you in advance. >> > Regards, >> > Husen >> > >> > References >> > >> > Visible links >> > 1. mailto:[email protected] >> > >
