Hi Eric,

Thank you for your reply !
I'll try to checkpoint using longer checkpoint interval. The value 1 of
checkpoint interval is only a test.

In addition to that problem, I'm wondering why task.[number].ckpt files are
not created if I submit the job using sbatch ?
there is only a file named script.ckpt if I submit the job using sbatch.

that error message appeared after I executed the command "srun [option]"
directly in the terminal.

Thank you in advance

Regards,


Husen


On Thu, May 19, 2016 at 8:20 AM, Eric Roman <[email protected]> wrote:

>
>
> Husen,
>
> Just to follow-up.  I looked closer into the file permissions in
> cr_checkpoint.
> BLCR does create the checkpoint files with mode 400 (read-only), so nothing
> is changing the permissions.  No mystery there.
>
> What's likely happening is a different checkpoint has failed.
> cr_checkpoint is
> being invoked, the file is being created, but no data is written to it.
> Further, cr_checkpoint would normally delete this file.
>
> So, a previous checkpoint may still be running, but hasn't yet produced any
> output.  It could be blocked.
>
> Or cr_checkpoint may have been killed, which would prevent that temporary
> file
> from being deleted.
>
> Eric
>
> On Tue, May 17, 2016 at 05:43:36PM -0700, Husen R wrote:
> >    This is the output of ls -a -l. The files that appeared in the error
> >    message are 0 bytes in size and they are all resulting from processes
> in
> >    the remote nodes.
> >    drwxrwxr-x �2 necis necis � � �4096 Mei 17 14:44 .
> >    drwxrwxrwx 12 root �root � � � 4096 Mei 17 17:24 ..
> >    -r-------- �1 necis necis � � � � 0 Mei 17 14:42
> .task.0.ckpt.tmp
> >    -r-------- �1 necis necis 183630160 Mei 17 14:44 task.1.ckpt
> >    -r-------- �1 necis necis 183630200 Mei 17 14:43 task.2.ckpt
> >    -r-------- �1 necis necis � � � � 0 Mei 17 14:43
> .task.2.ckpt.tmp
> >    -r-------- �1 necis necis � � � � 0 Mei 17 14:42
> .task.3.ckpt.tmp
> >    -r-------- �1 necis necis � � � � 0 Mei 17 14:42
> .task.4.ckpt.tmp
> >    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.5.ckpt
> >    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.6.ckpt
> >    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.7.ckpt
> >    -r-------- �1 necis necis 183301731 Mei 17 14:43 task.8.ckpt
> >    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.9.ckpt
> >    Regards,
> >    Husen
> >    On Wed, May 18, 2016 at 7:38 AM, Husen R <[1][email protected]> wrote:
> >
> >      Hi,
> >      This is the output of ls -a:
> >      �. � .task.0.ckpt.tmp �task.2.ckpt � � �
> .task.3.ckpt.tmp �task.5.ckpt
> >      �task.7.ckpt �task.9.ckpt
> >      .. �task.1.ckpt � � � .task.2.ckpt.tmp �.task.4.ckpt.tmp
> �task.6.ckpt
> >      �task.8.ckpt
> >      This is the output of ls :
> >      task.1.ckpt �task.2.ckpt �task.5.ckpt �task.6.ckpt
> �task.7.ckpt
> >      �task.8.ckpt �task.9.ckpt
> >      The temporary files appeared when I use ls -a command. What does it
> mean
> >      ?
> >      I can create file in the directory with touch.
> >      I try to checkpoint mpi application using non root user with slurm
> >      checkpoint interval feature. I don't directly checkpoint using
> >      cr_checkpoint command.
> >      Regards,
> >      Husen
> >      On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[2][email protected]>
> wrote:
> >
> >        Are the temporary files created?
> >
> >        Does ls -a on the directory show the missing files?
> >
> >        Can you create files in that directory with touch?
> >
> >        Finally, is cr_checkpoint being run by root?� Or some other
> user?� The
> >        checkpoint file will be created by the user invoking
> cr_checkpoint.
> >
> >        Eric
> >        On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
> >        >� � dear all,
> >        >� � I failed everytime I try to checkpoint MPI application
> using BLCR
> >        in
> >        >� � Slurm. The following is my sbatch script :
> >        >� � ##########################SBATCH SCRIPT############
> >        >� � #!/bin/bash
> >        >� � #SBATCH -J MatMul
> >        >� � #SBATCH -o cr/mm-%j.out
> >        >� � #SBATCH -A necis
> >        >� � #SBATCH -N 3
> >        >� � #SBATCH -n 24
> >        >� � #SBATCH --checkpoint=1
> >        >� � #SBATCH --checkpoint-dir=cr
> >        >� � #SBATCH --time=01:30:00
> >        >� � #SBATCH --mail-user=[1][3][email protected]
> >        >� � #SBATCH --mail-type=begin
> >        >� � #SBATCH --mail-type=end
> >        >� � srun --mpi=pmi2 ./mm.o
> >        >� � ####################################################
> >        >� � I also have tried to run directly using srun command but
> I
> >        failed. The
> >        >� � following is the command I use and the error message that
> >        occured.
> >        >� � command :�
> >        >� � srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
> >        >� � error :
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.4.ckpt'
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.3.ckpt'
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.0.ckpt'
> >        >� � Received results from task 6
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.4.ckpt'
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.2.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.2.ckpt'
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.3.ckpt'
> >        >� � Unable to open file
> '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
> >        Permission
> >        >� � denied
> >        >� � Failed to open checkpoint file
> >        '/mirror/source/cr/275.0/task.0.ckpt'
> >        >� � in cr directory there are 7 .ckpt files as follows :
> >        >� � task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt,
> task.7.ckpt,
> >        >� � task.8.ckpt and task.9.ckpt.
> >        >� � There are no checkpoint files called task.0.ckpt,
> task.3.ckpt and
> >        >� � task.4.ckpt as mentioned in the error message.
> >        >� � mirror is NFS directory that shared across the nodes. I
> set the
> >        cr
> >        >� � directory to have permission 777 just to avoid
> permission issue.
> >        >� � note : if I execute the command using sbatch job, I just
> get file
> >        named
> >        >� � script.ckpt. There is no task.[number].ckpt file.
> >        >� � Anyone please tell me how to solve this ?
> >        >� � Thank you in advance.
> >        >� � Regards,
> >        >� � Husen
> >        >
> >        > References
> >        >
> >        >� � Visible links
> >        >� � 1. mailto:[4][email protected]
> >
> > References
> >
> >    Visible links
> >    1. mailto:[email protected]
> >    2. mailto:[email protected]
> >    3. mailto:[email protected]
> >    4. mailto:[email protected]
>

Reply via email to