Husen,
Just to follow-up. I looked closer into the file permissions in cr_checkpoint. BLCR does create the checkpoint files with mode 400 (read-only), so nothing is changing the permissions. No mystery there. What's likely happening is a different checkpoint has failed. cr_checkpoint is being invoked, the file is being created, but no data is written to it. Further, cr_checkpoint would normally delete this file. So, a previous checkpoint may still be running, but hasn't yet produced any output. It could be blocked. Or cr_checkpoint may have been killed, which would prevent that temporary file from being deleted. Eric On Tue, May 17, 2016 at 05:43:36PM -0700, Husen R wrote: > This is the output of ls -a -l. The files that appeared in the error > message are 0 bytes in size and they are all resulting from processes in > the remote nodes. > drwxrwxr-x �2 necis necis � � �4096 Mei 17 14:44 . > drwxrwxrwx 12 root �root � � � 4096 Mei 17 17:24 .. > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.0.ckpt.tmp > -r-------- �1 necis necis 183630160 Mei 17 14:44 task.1.ckpt > -r-------- �1 necis necis 183630200 Mei 17 14:43 task.2.ckpt > -r-------- �1 necis necis � � � � 0 Mei 17 14:43 .task.2.ckpt.tmp > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.3.ckpt.tmp > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.4.ckpt.tmp > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.5.ckpt > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.6.ckpt > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.7.ckpt > -r-------- �1 necis necis 183301731 Mei 17 14:43 task.8.ckpt > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.9.ckpt > Regards, > Husen > On Wed, May 18, 2016 at 7:38 AM, Husen R <[1][email protected]> wrote: > > Hi, > This is the output of ls -a: > �. � .task.0.ckpt.tmp �task.2.ckpt � � � .task.3.ckpt.tmp > �task.5.ckpt > �task.7.ckpt �task.9.ckpt > .. �task.1.ckpt � � � .task.2.ckpt.tmp �.task.4.ckpt.tmp > �task.6.ckpt > �task.8.ckpt > This is the output of ls : > task.1.ckpt �task.2.ckpt �task.5.ckpt �task.6.ckpt �task.7.ckpt > �task.8.ckpt �task.9.ckpt > The temporary files appeared when I use ls -a command. What does it mean > ? > I can create file in the directory with touch. > I try to checkpoint mpi application using non root user with slurm > checkpoint interval feature. I don't directly checkpoint using > cr_checkpoint command. > Regards, > Husen > On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[2][email protected]> wrote: > > Are the temporary files created? > > Does ls -a on the directory show the missing files? > > Can you create files in that directory with touch? > > Finally, is cr_checkpoint being run by root?� Or some other user?� > The > checkpoint file will be created by the user invoking cr_checkpoint. > > Eric > On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: > >� � dear all, > >� � I failed everytime I try to checkpoint MPI application using > BLCR > in > >� � Slurm. The following is my sbatch script : > >� � ##########################SBATCH SCRIPT############ > >� � #!/bin/bash > >� � #SBATCH -J MatMul > >� � #SBATCH -o cr/mm-%j.out > >� � #SBATCH -A necis > >� � #SBATCH -N 3 > >� � #SBATCH -n 24 > >� � #SBATCH --checkpoint=1 > >� � #SBATCH --checkpoint-dir=cr > >� � #SBATCH --time=01:30:00 > >� � #SBATCH --mail-user=[1][3][email protected] > >� � #SBATCH --mail-type=begin > >� � #SBATCH --mail-type=end > >� � srun --mpi=pmi2 ./mm.o > >� � #################################################### > >� � I also have tried to run directly using srun command but I > failed. The > >� � following is the command I use and the error message that > occured. > >� � command :� > >� � srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� > >� � error : > >� � Unable to open file > '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.4.ckpt' > >� � Unable to open file > '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.3.ckpt' > >� � Unable to open file > '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.0.ckpt' > >� � Received results from task 6 > >� � Unable to open file > '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.4.ckpt' > >� � Unable to open file > '/mirror/source/cr/275.0/.task.2.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.2.ckpt' > >� � Unable to open file > '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.3.ckpt' > >� � Unable to open file > '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > Permission > >� � denied > >� � Failed to open checkpoint file > '/mirror/source/cr/275.0/task.0.ckpt' > >� � in cr directory there are 7 .ckpt files as follows : > >� � task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, > task.7.ckpt, > >� � task.8.ckpt and task.9.ckpt. > >� � There are no checkpoint files called task.0.ckpt, task.3.ckpt > and > >� � task.4.ckpt as mentioned in the error message. > >� � mirror is NFS directory that shared across the nodes. I set > the > cr > >� � directory to have permission 777 just to avoid permission > issue. > >� � note : if I execute the command using sbatch job, I just get > file > named > >� � script.ckpt. There is no task.[number].ckpt file. > >� � Anyone please tell me how to solve this ? > >� � Thank you in advance. > >� � Regards, > >� � Husen > > > > References > > > >� � Visible links > >� � 1. mailto:[4][email protected] > > References > > Visible links > 1. mailto:[email protected] > 2. mailto:[email protected] > 3. mailto:[email protected] > 4. mailto:[email protected]
