dear all,

I failed everytime I try to checkpoint MPI application using BLCR in Slurm.
The following is my sbatch script :

##########################SBATCH SCRIPT############
#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o cr/mm-%j.out
#SBATCH -A necis
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=1
#SBATCH --checkpoint-dir=cr
#SBATCH --time=01:30:00
#SBATCH [email protected]
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

####################################################

I also have tried to run directly using srun command but I failed. The
following is the command I use and the error message that occured.

command :

srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o

error :

Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
Received results from task 6
Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'


in cr directory there are 7 .ckpt files as follows :

task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt,
task.8.ckpt and task.9.ckpt.

There are no checkpoint files called task.0.ckpt, task.3.ckpt and
task.4.ckpt as mentioned in the error message.
mirror is NFS directory that shared across the nodes. I set the cr
directory to have permission 777 just to avoid permission issue.

note : if I execute the command using sbatch job, I just get file named
script.ckpt. There is no task.[number].ckpt file.

Anyone please tell me how to solve this ?
Thank you in advance.

Regards,


Husen

Reply via email to