Are the temporary files created?

Does ls -a on the directory show the missing files?

Can you create files in that directory with touch?

Finally, is cr_checkpoint being run by root?  Or some other user?  The
checkpoint file will be created by the user invoking cr_checkpoint.

Eric

On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
>    dear all,
>    I failed everytime I try to checkpoint MPI application using BLCR in
>    Slurm. The following is my sbatch script :
>    ##########################SBATCH SCRIPT############
>    #!/bin/bash
>    #SBATCH -J MatMul
>    #SBATCH -o cr/mm-%j.out
>    #SBATCH -A necis
>    #SBATCH -N 3
>    #SBATCH -n 24
>    #SBATCH --checkpoint=1
>    #SBATCH --checkpoint-dir=cr
>    #SBATCH --time=01:30:00
>    #SBATCH --mail-user=[1][email protected]
>    #SBATCH --mail-type=begin
>    #SBATCH --mail-type=end
>    srun --mpi=pmi2 ./mm.o
>    ####################################################
>    I also have tried to run directly using srun command but I failed. The
>    following is the command I use and the error message that occured.
>    command :�
>    srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
>    error :
>    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
>    Received results from task 6
>    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>    Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
>    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
>    denied
>    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
>    in cr directory there are 7 .ckpt files as follows :
>    task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt,
>    task.8.ckpt and task.9.ckpt.
>    There are no checkpoint files called task.0.ckpt, task.3.ckpt and
>    task.4.ckpt as mentioned in the error message.
>    mirror is NFS directory that shared across the nodes. I set the cr
>    directory to have permission 777 just to avoid permission issue.
>    note : if I execute the command using sbatch job, I just get file named
>    script.ckpt. There is no task.[number].ckpt file.
>    Anyone please tell me how to solve this ?
>    Thank you in advance.
>    Regards,
>    Husen
> 
> References
> 
>    Visible links
>    1. mailto:[email protected]

Reply via email to