Hi Eric, Thank you for your reply ! I'll try to checkpoint using longer checkpoint interval. The value 1 of checkpoint interval is only a test.
In addition to that problem, I'm wondering why task.[number].ckpt files are not created if I submit the job using sbatch ? there is only a file named script.ckpt if I submit the job using sbatch. that error message appeared after I executed the command "srun [option]" directly in the terminal. Thank you in advance Regards, Husen On Thu, May 19, 2016 at 8:20 AM, Eric Roman <[email protected]> wrote: > > > Husen, > > Just to follow-up. I looked closer into the file permissions in > cr_checkpoint. > BLCR does create the checkpoint files with mode 400 (read-only), so nothing > is changing the permissions. No mystery there. > > What's likely happening is a different checkpoint has failed. > cr_checkpoint is > being invoked, the file is being created, but no data is written to it. > Further, cr_checkpoint would normally delete this file. > > So, a previous checkpoint may still be running, but hasn't yet produced any > output. It could be blocked. > > Or cr_checkpoint may have been killed, which would prevent that temporary > file > from being deleted. > > Eric > > On Tue, May 17, 2016 at 05:43:36PM -0700, Husen R wrote: > > This is the output of ls -a -l. The files that appeared in the error > > message are 0 bytes in size and they are all resulting from processes > in > > the remote nodes. > > drwxrwxr-x �2 necis necis � � �4096 Mei 17 14:44 . > > drwxrwxrwx 12 root �root � � � 4096 Mei 17 17:24 .. > > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 > .task.0.ckpt.tmp > > -r-------- �1 necis necis 183630160 Mei 17 14:44 task.1.ckpt > > -r-------- �1 necis necis 183630200 Mei 17 14:43 task.2.ckpt > > -r-------- �1 necis necis � � � � 0 Mei 17 14:43 > .task.2.ckpt.tmp > > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 > .task.3.ckpt.tmp > > -r-------- �1 necis necis � � � � 0 Mei 17 14:42 > .task.4.ckpt.tmp > > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.5.ckpt > > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.6.ckpt > > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.7.ckpt > > -r-------- �1 necis necis 183301731 Mei 17 14:43 task.8.ckpt > > -r-------- �1 necis necis 183297635 Mei 17 14:43 task.9.ckpt > > Regards, > > Husen > > On Wed, May 18, 2016 at 7:38 AM, Husen R <[1][email protected]> wrote: > > > > Hi, > > This is the output of ls -a: > > �. � .task.0.ckpt.tmp �task.2.ckpt � � � > .task.3.ckpt.tmp �task.5.ckpt > > �task.7.ckpt �task.9.ckpt > > .. �task.1.ckpt � � � .task.2.ckpt.tmp �.task.4.ckpt.tmp > �task.6.ckpt > > �task.8.ckpt > > This is the output of ls : > > task.1.ckpt �task.2.ckpt �task.5.ckpt �task.6.ckpt > �task.7.ckpt > > �task.8.ckpt �task.9.ckpt > > The temporary files appeared when I use ls -a command. What does it > mean > > ? > > I can create file in the directory with touch. > > I try to checkpoint mpi application using non root user with slurm > > checkpoint interval feature. I don't directly checkpoint using > > cr_checkpoint command. > > Regards, > > Husen > > On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[2][email protected]> > wrote: > > > > Are the temporary files created? > > > > Does ls -a on the directory show the missing files? > > > > Can you create files in that directory with touch? > > > > Finally, is cr_checkpoint being run by root?� Or some other > user?� The > > checkpoint file will be created by the user invoking > cr_checkpoint. > > > > Eric > > On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: > > >� � dear all, > > >� � I failed everytime I try to checkpoint MPI application > using BLCR > > in > > >� � Slurm. The following is my sbatch script : > > >� � ##########################SBATCH SCRIPT############ > > >� � #!/bin/bash > > >� � #SBATCH -J MatMul > > >� � #SBATCH -o cr/mm-%j.out > > >� � #SBATCH -A necis > > >� � #SBATCH -N 3 > > >� � #SBATCH -n 24 > > >� � #SBATCH --checkpoint=1 > > >� � #SBATCH --checkpoint-dir=cr > > >� � #SBATCH --time=01:30:00 > > >� � #SBATCH --mail-user=[1][3][email protected] > > >� � #SBATCH --mail-type=begin > > >� � #SBATCH --mail-type=end > > >� � srun --mpi=pmi2 ./mm.o > > >� � #################################################### > > >� � I also have tried to run directly using srun command but > I > > failed. The > > >� � following is the command I use and the error message that > > occured. > > >� � command :� > > >� � srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� > > >� � error : > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.4.ckpt' > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.3.ckpt' > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.0.ckpt' > > >� � Received results from task 6 > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.4.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.4.ckpt' > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.2.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.2.ckpt' > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.3.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.3.ckpt' > > >� � Unable to open file > '/mirror/source/cr/275.0/.task.0.ckpt.tmp': > > Permission > > >� � denied > > >� � Failed to open checkpoint file > > '/mirror/source/cr/275.0/task.0.ckpt' > > >� � in cr directory there are 7 .ckpt files as follows : > > >� � task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, > task.7.ckpt, > > >� � task.8.ckpt and task.9.ckpt. > > >� � There are no checkpoint files called task.0.ckpt, > task.3.ckpt and > > >� � task.4.ckpt as mentioned in the error message. > > >� � mirror is NFS directory that shared across the nodes. I > set the > > cr > > >� � directory to have permission 777 just to avoid > permission issue. > > >� � note : if I execute the command using sbatch job, I just > get file > > named > > >� � script.ckpt. There is no task.[number].ckpt file. > > >� � Anyone please tell me how to solve this ? > > >� � Thank you in advance. > > >� � Regards, > > >� � Husen > > > > > > References > > > > > >� � Visible links > > >� � 1. mailto:[4][email protected] > > > > References > > > > Visible links > > 1. mailto:[email protected] > > 2. mailto:[email protected] > > 3. mailto:[email protected] > > 4. mailto:[email protected] >
