Hi,

This is the output of ls -a:

 .   .task.0.ckpt.tmp  task.2.ckpt       .task.3.ckpt.tmp  task.5.ckpt
 task.7.ckpt  task.9.ckpt
..  task.1.ckpt       .task.2.ckpt.tmp  .task.4.ckpt.tmp  task.6.ckpt
 task.8.ckpt

This is the output of ls :

task.1.ckpt  task.2.ckpt  task.5.ckpt  task.6.ckpt  task.7.ckpt
 task.8.ckpt  task.9.ckpt


The temporary files appeared when I use ls -a command. What does it mean ?
I can create file in the directory with touch.

I try to checkpoint mpi application using non root user with slurm
checkpoint interval feature. I don't directly checkpoint using
cr_checkpoint command.

Regards,

Husen

On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[email protected]> wrote:

>
>
> Are the temporary files created?
>
> Does ls -a on the directory show the missing files?
>
> Can you create files in that directory with touch?
>
> Finally, is cr_checkpoint being run by root?  Or some other user?  The
> checkpoint file will be created by the user invoking cr_checkpoint.
>
> Eric
>
> On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
> >    dear all,
> >    I failed everytime I try to checkpoint MPI application using BLCR in
> >    Slurm. The following is my sbatch script :
> >    ##########################SBATCH SCRIPT############
> >    #!/bin/bash
> >    #SBATCH -J MatMul
> >    #SBATCH -o cr/mm-%j.out
> >    #SBATCH -A necis
> >    #SBATCH -N 3
> >    #SBATCH -n 24
> >    #SBATCH --checkpoint=1
> >    #SBATCH --checkpoint-dir=cr
> >    #SBATCH --time=01:30:00
> >    #SBATCH --mail-user=[1][email protected]
> >    #SBATCH --mail-type=begin
> >    #SBATCH --mail-type=end
> >    srun --mpi=pmi2 ./mm.o
> >    ####################################################
> >    I also have tried to run directly using srun command but I failed. The
> >    following is the command I use and the error message that occured.
> >    command :�
> >    srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
> >    error :
> >    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
> >    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
> >    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
> >    Received results from task 6
> >    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
> >    Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
> >    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
> >    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
> Permission
> >    denied
> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
> >    in cr directory there are 7 .ckpt files as follows :
> >    task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt,
> >    task.8.ckpt and task.9.ckpt.
> >    There are no checkpoint files called task.0.ckpt, task.3.ckpt and
> >    task.4.ckpt as mentioned in the error message.
> >    mirror is NFS directory that shared across the nodes. I set the cr
> >    directory to have permission 777 just to avoid permission issue.
> >    note : if I execute the command using sbatch job, I just get file
> named
> >    script.ckpt. There is no task.[number].ckpt file.
> >    Anyone please tell me how to solve this ?
> >    Thank you in advance.
> >    Regards,
> >    Husen
> >
> > References
> >
> >    Visible links
> >    1. mailto:[email protected]
>

Reply via email to