Husen,

Just to follow-up.  I looked closer into the file permissions in cr_checkpoint.
BLCR does create the checkpoint files with mode 400 (read-only), so nothing
is changing the permissions.  No mystery there.

What's likely happening is a different checkpoint has failed.  cr_checkpoint is
being invoked, the file is being created, but no data is written to it.
Further, cr_checkpoint would normally delete this file.

So, a previous checkpoint may still be running, but hasn't yet produced any 
output.  It could be blocked.

Or cr_checkpoint may have been killed, which would prevent that temporary file
from being deleted.

Eric

On Tue, May 17, 2016 at 05:43:36PM -0700, Husen R wrote:
>    This is the output of ls -a -l. The files that appeared in the error
>    message are 0 bytes in size and they are all resulting from processes in
>    the remote nodes.
>    drwxrwxr-x �2 necis necis � � �4096 Mei 17 14:44 .
>    drwxrwxrwx 12 root �root � � � 4096 Mei 17 17:24 ..
>    -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.0.ckpt.tmp
>    -r-------- �1 necis necis 183630160 Mei 17 14:44 task.1.ckpt
>    -r-------- �1 necis necis 183630200 Mei 17 14:43 task.2.ckpt
>    -r-------- �1 necis necis � � � � 0 Mei 17 14:43 .task.2.ckpt.tmp
>    -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.3.ckpt.tmp
>    -r-------- �1 necis necis � � � � 0 Mei 17 14:42 .task.4.ckpt.tmp
>    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.5.ckpt
>    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.6.ckpt
>    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.7.ckpt
>    -r-------- �1 necis necis 183301731 Mei 17 14:43 task.8.ckpt
>    -r-------- �1 necis necis 183297635 Mei 17 14:43 task.9.ckpt
>    Regards,
>    Husen
>    On Wed, May 18, 2016 at 7:38 AM, Husen R <[1][email protected]> wrote:
> 
>      Hi,
>      This is the output of ls -a:
>      �. � .task.0.ckpt.tmp �task.2.ckpt � � � .task.3.ckpt.tmp 
> �task.5.ckpt
>      �task.7.ckpt �task.9.ckpt
>      .. �task.1.ckpt � � � .task.2.ckpt.tmp �.task.4.ckpt.tmp 
> �task.6.ckpt
>      �task.8.ckpt
>      This is the output of ls :
>      task.1.ckpt �task.2.ckpt �task.5.ckpt �task.6.ckpt �task.7.ckpt
>      �task.8.ckpt �task.9.ckpt
>      The temporary files appeared when I use ls -a command. What does it mean
>      ?
>      I can create file in the directory with touch.
>      I try to checkpoint mpi application using non root user with slurm
>      checkpoint interval feature. I don't directly checkpoint using
>      cr_checkpoint command.
>      Regards,
>      Husen
>      On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[2][email protected]> wrote:
> 
>        Are the temporary files created?
> 
>        Does ls -a on the directory show the missing files?
> 
>        Can you create files in that directory with touch?
> 
>        Finally, is cr_checkpoint being run by root?� Or some other user?� 
> The
>        checkpoint file will be created by the user invoking cr_checkpoint.
> 
>        Eric
>        On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
>        >� � dear all,
>        >� � I failed everytime I try to checkpoint MPI application using 
> BLCR
>        in
>        >� � Slurm. The following is my sbatch script :
>        >� � ##########################SBATCH SCRIPT############
>        >� � #!/bin/bash
>        >� � #SBATCH -J MatMul
>        >� � #SBATCH -o cr/mm-%j.out
>        >� � #SBATCH -A necis
>        >� � #SBATCH -N 3
>        >� � #SBATCH -n 24
>        >� � #SBATCH --checkpoint=1
>        >� � #SBATCH --checkpoint-dir=cr
>        >� � #SBATCH --time=01:30:00
>        >� � #SBATCH --mail-user=[1][3][email protected]
>        >� � #SBATCH --mail-type=begin
>        >� � #SBATCH --mail-type=end
>        >� � srun --mpi=pmi2 ./mm.o
>        >� � ####################################################
>        >� � I also have tried to run directly using srun command but I
>        failed. The
>        >� � following is the command I use and the error message that
>        occured.
>        >� � command :�
>        >� � srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
>        >� � error :
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.4.ckpt'
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.3.ckpt'
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.0.ckpt'
>        >� � Received results from task 6
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.4.ckpt'
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.2.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.2.ckpt'
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.3.ckpt'
>        >� � Unable to open file 
> '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>        Permission
>        >� � denied
>        >� � Failed to open checkpoint file
>        '/mirror/source/cr/275.0/task.0.ckpt'
>        >� � in cr directory there are 7 .ckpt files as follows :
>        >� � task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, 
> task.7.ckpt,
>        >� � task.8.ckpt and task.9.ckpt.
>        >� � There are no checkpoint files called task.0.ckpt, task.3.ckpt 
> and
>        >� � task.4.ckpt as mentioned in the error message.
>        >� � mirror is NFS directory that shared across the nodes. I set 
> the
>        cr
>        >� � directory to have permission 777 just to avoid permission 
> issue.
>        >� � note : if I execute the command using sbatch job, I just get 
> file
>        named
>        >� � script.ckpt. There is no task.[number].ckpt file.
>        >� � Anyone please tell me how to solve this ?
>        >� � Thank you in advance.
>        >� � Regards,
>        >� � Husen
>        >
>        > References
>        >
>        >� � Visible links
>        >� � 1. mailto:[4][email protected]
> 
> References
> 
>    Visible links
>    1. mailto:[email protected]
>    2. mailto:[email protected]
>    3. mailto:[email protected]
>    4. mailto:[email protected]

Reply via email to