This is the output of ls -a -l. The files that appeared in the error
message are 0 bytes in size and they are all resulting from processes in
the remote nodes.

drwxrwxr-x  2 necis necis      4096 Mei 17 14:44 .
drwxrwxrwx 12 root  root       4096 Mei 17 17:24 ..
-r--------  1 necis necis         0 Mei 17 14:42 .task.0.ckpt.tmp
-r--------  1 necis necis 183630160 Mei 17 14:44 task.1.ckpt
-r--------  1 necis necis 183630200 Mei 17 14:43 task.2.ckpt
-r--------  1 necis necis         0 Mei 17 14:43 .task.2.ckpt.tmp
-r--------  1 necis necis         0 Mei 17 14:42 .task.3.ckpt.tmp
-r--------  1 necis necis         0 Mei 17 14:42 .task.4.ckpt.tmp
-r--------  1 necis necis 183297635 Mei 17 14:43 task.5.ckpt
-r--------  1 necis necis 183297635 Mei 17 14:43 task.6.ckpt
-r--------  1 necis necis 183297635 Mei 17 14:43 task.7.ckpt
-r--------  1 necis necis 183301731 Mei 17 14:43 task.8.ckpt
-r--------  1 necis necis 183297635 Mei 17 14:43 task.9.ckpt

Regards,

Husen


On Wed, May 18, 2016 at 7:38 AM, Husen R <[email protected]> wrote:

> Hi,
>
> This is the output of ls -a:
>
>  .   .task.0.ckpt.tmp  task.2.ckpt       .task.3.ckpt.tmp  task.5.ckpt
>  task.7.ckpt  task.9.ckpt
> ..  task.1.ckpt       .task.2.ckpt.tmp  .task.4.ckpt.tmp  task.6.ckpt
>  task.8.ckpt
>
> This is the output of ls :
>
> task.1.ckpt  task.2.ckpt  task.5.ckpt  task.6.ckpt  task.7.ckpt
>  task.8.ckpt  task.9.ckpt
>
>
> The temporary files appeared when I use ls -a command. What does it mean ?
> I can create file in the directory with touch.
>
> I try to checkpoint mpi application using non root user with slurm
> checkpoint interval feature. I don't directly checkpoint using
> cr_checkpoint command.
>
> Regards,
>
> Husen
>
> On Tue, May 17, 2016 at 10:30 PM, Eric Roman <[email protected]> wrote:
>
>>
>>
>> Are the temporary files created?
>>
>> Does ls -a on the directory show the missing files?
>>
>> Can you create files in that directory with touch?
>>
>> Finally, is cr_checkpoint being run by root?  Or some other user?  The
>> checkpoint file will be created by the user invoking cr_checkpoint.
>>
>> Eric
>>
>> On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
>> >    dear all,
>> >    I failed everytime I try to checkpoint MPI application using BLCR in
>> >    Slurm. The following is my sbatch script :
>> >    ##########################SBATCH SCRIPT############
>> >    #!/bin/bash
>> >    #SBATCH -J MatMul
>> >    #SBATCH -o cr/mm-%j.out
>> >    #SBATCH -A necis
>> >    #SBATCH -N 3
>> >    #SBATCH -n 24
>> >    #SBATCH --checkpoint=1
>> >    #SBATCH --checkpoint-dir=cr
>> >    #SBATCH --time=01:30:00
>> >    #SBATCH --mail-user=[1][email protected]
>> >    #SBATCH --mail-type=begin
>> >    #SBATCH --mail-type=end
>> >    srun --mpi=pmi2 ./mm.o
>> >    ####################################################
>> >    I also have tried to run directly using srun command but I failed.
>> The
>> >    following is the command I use and the error message that occured.
>> >    command :�
>> >    srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
>> >    error :
>> >    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>> >    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>> >    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
>> >    Received results from task 6
>> >    Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>> >    Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
>> >    Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>> >    Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>> Permission
>> >    denied
>> >    Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
>> >    in cr directory there are 7 .ckpt files as follows :
>> >    task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt,
>> >    task.8.ckpt and task.9.ckpt.
>> >    There are no checkpoint files called task.0.ckpt, task.3.ckpt and
>> >    task.4.ckpt as mentioned in the error message.
>> >    mirror is NFS directory that shared across the nodes. I set the cr
>> >    directory to have permission 777 just to avoid permission issue.
>> >    note : if I execute the command using sbatch job, I just get file
>> named
>> >    script.ckpt. There is no task.[number].ckpt file.
>> >    Anyone please tell me how to solve this ?
>> >    Thank you in advance.
>> >    Regards,
>> >    Husen
>> >
>> > References
>> >
>> >    Visible links
>> >    1. mailto:[email protected]
>>
>
>

Reply via email to