Singularity 3.5.3 on RHEL 7 cluster w/ OpenMPI 4.0.3 lives inside a
SimpleFOAM version 10 container. I've confirmed the OpenMPI versions
are the same. Perhaps this is a question for Singularity users as well
but how can I troubleshoot why mpirun just returns step creation
temporarily disabled, retrying Requested
Singularity> mpirun -V
mpirun (Open MPI) 4.0.3
Report bugs to http://www.open-mpi.org/community/help/
Singularity> which mpirun
/usr/bin/mpirun
Singularity>
$ mpirun -V
mpirun (Open MPI) 4.0.3
mpirun -n 2 -mca plm_base_verbose 100 --mca ras_base_verbose 100 --mca
rss_base_verbose 100 --mca rmaps_base_verbose 100 singularity exec
openfoam simpleFoam -fileHandler uncollated -parallel | tee log.simpleFoam
openfoam10/ openfoam10.sif openfoamtestfile.sh
openfoam_v2012.sif
[myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100
--mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca
rmaps_base_verbose 100 singularity exec openfoam simpleFoam
-fileHandler uncollated -parallel | tee log.simpleFoam
openfoam10/ openfoam10.sif openfoamtestfile.sh
openfoam_v2012.sif
[myuser@node047 motorBike]$ mpirun -n 2 -mca plm_base_verbose 100
--mca ras_base_verbose 100 --mca rss_base_verbose 100 --mca
rmaps_base_verbose 100 singularity exec openfoam10.sif simpleFoam
-parallel | tee log.simpleFoam
[node047:11650] mca: base: components_register: registering framework
plm components
[node047:11650] mca: base: components_register: found loaded component
slurm
[node047:11650] mca: base: components_register: component slurm
register function successful
[node047:11650] mca: base: components_register: found loaded component
isolated
[node047:11650] mca: base: components_register: component isolated has
no register or open function
[node047:11650] mca: base: components_register: found loaded component rsh
[node047:11650] mca: base: components_register: component rsh register
function successful
[node047:11650] mca: base: components_open: opening plm components
[node047:11650] mca: base: components_open: found loaded component slurm
[node047:11650] mca: base: components_open: component slurm open
function successful
[node047:11650] mca: base: components_open: found loaded component
isolated
[node047:11650] mca: base: components_open: component isolated open
function successful
[node047:11650] mca: base: components_open: found loaded component rsh
[node047:11650] mca: base: components_open: component rsh open
function successful
[node047:11650] mca:base:select: Auto-selecting plm components
[node047:11650] mca:base:select:( plm) Querying component [slurm]
[node047:11650] mca:base:select:( plm) Query of component [slurm] set
priority to 75
[node047:11650] mca:base:select:( plm) Querying component [isolated]
[node047:11650] mca:base:select:( plm) Query of component [isolated]
set priority to 0
[node047:11650] mca:base:select:( plm) Querying component [rsh]
[node047:11650] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[node047:11650] mca:base:select:( plm) Selected component [slurm]
[node047:11650] mca: base: close: component isolated closed
[node047:11650] mca: base: close: unloading component isolated
[node047:11650] mca: base: close: component rsh closed
[node047:11650] mca: base: close: unloading component rsh
[node047:11650] mca: base: components_register: registering framework
ras components
[node047:11650] mca: base: components_register: found loaded component
slurm
[node047:11650] mca: base: components_register: component slurm
register function successful
[node047:11650] mca: base: components_register: found loaded component
simulator
[node047:11650] mca: base: components_register: component simulator
register function successful
[node047:11650] mca: base: components_open: opening ras components
[node047:11650] mca: base: components_open: found loaded component slurm
[node047:11650] mca: base: components_open: component slurm open
function successful
[node047:11650] mca: base: components_open: found loaded component
simulator
[node047:11650] mca:base:select: Auto-selecting ras components
[node047:11650] mca:base:select:( ras) Querying component [slurm]
[node047:11650] mca:base:select:( ras) Query of component [slurm] set
priority to 50
[node047:11650] mca:base:select:( ras) Querying component [simulator]
[node047:11650] mca:base:select:( ras) Selected component [slurm]
[node047:11650] mca: base: close: unloading component simulator
[node047:11650] mca: base: components_register: registering framework
rmaps components
[node047:11650] mca: base: components_register: found loaded component seq
[node047:11650] mca: base: components_register: component seq register
function successful
[node047:11650] mca: base: components_register: found loaded component
rank_file
[node047:11650] mca: base: components_register: component rank_file
register function successful
[node047:11650] mca: base: components_register: found loaded component
resilient
[node047:11650] mca: base: components_register: component resilient
register function successful
[node047:11650] mca: base: components_register: found loaded component
mindist
[node047:11650] mca: base: components_register: component mindist
register function successful
[node047:11650] mca: base: components_register: found loaded component
round_robin
[node047:11650] mca: base: components_register: component round_robin
register function successful
[node047:11650] mca: base: components_register: found loaded component ppr
[node047:11650] mca: base: components_register: component ppr register
function successful
[node047:11650] [[57513,0],0] rmaps:base set policy with NULL device
NONNULL
[node047:11650] mca: base: components_open: opening rmaps components
[node047:11650] mca: base: components_open: found loaded component seq
[node047:11650] mca: base: components_open: component seq open
function successful
[node047:11650] mca: base: components_open: found loaded component
rank_file
[node047:11650] mca: base: components_open: component rank_file open
function successful
[node047:11650] mca: base: components_open: found loaded component
resilient
[node047:11650] mca: base: components_open: component resilient open
function successful
[node047:11650] mca: base: components_open: found loaded component mindist
[node047:11650] mca: base: components_open: component mindist open
function successful
[node047:11650] mca: base: components_open: found loaded component
round_robin
[node047:11650] mca: base: components_open: component round_robin open
function successful
[node047:11650] mca: base: components_open: found loaded component ppr
[node047:11650] mca: base: components_open: component ppr open
function successful
[node047:11650] mca:rmaps:select: checking available component seq
[node047:11650] mca:rmaps:select: Querying component [seq]
[node047:11650] mca:rmaps:select: checking available component rank_file
[node047:11650] mca:rmaps:select: Querying component [rank_file]
[node047:11650] mca:rmaps:select: checking available component resilient
[node047:11650] mca:rmaps:select: Querying component [resilient]
[node047:11650] mca:rmaps:select: checking available component mindist
[node047:11650] mca:rmaps:select: Querying component [mindist]
[node047:11650] mca:rmaps:select: checking available component round_robin
[node047:11650] mca:rmaps:select: Querying component [round_robin]
[node047:11650] mca:rmaps:select: checking available component ppr
[node047:11650] mca:rmaps:select: Querying component [ppr]
[node047:11650] [[57513,0],0]: Final mapper priorities
[node047:11650] Mapper: ppr Priority: 90
[node047:11650] Mapper: seq Priority: 60
[node047:11650] Mapper: resilient Priority: 40
[node047:11650] Mapper: mindist Priority: 20
[node047:11650] Mapper: round_robin Priority: 10
[node047:11650] Mapper: rank_file Priority: 0
[node047:11650] [[57513,0],0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --nodes=1
--nodelist=node048 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3769171968" -mca ess_base_vpid "1" -mca
ess_base_num_procs "2" -mca orte_node_regex "t[3:47-48]@0(2)" -mca
orte_hnp_uri "3769171968.0;tcp://10.x.x.47,10.x.x.47:50819" -mca
plm_base_verbose "100" --mca ras_base_verbose "100" --mca
rss_base_verbose "100" --mca rmaps_base_verbose "100"
====================== ALLOCATED NODES ======================
node047: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP
node048: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
My process:
myuser 11650 10965 0 22:28 pts/0 00:00:00 mpirun -n 2 -mca
plm_base_verbose 100 --mca ras_base_verbose 100 --mca rss_base_verbose
100 --mca rmaps_base_verbose 100 singularity exec openfoam10.sif
simpleFoam -parallel
strace just hangs at:
strace: Process 11650 attached
restart_syscall(<... resuming interrupted poll ...>^Cstrace: Process
11650 detached
<detached ...>
With or without the --exclusive option all I get is:
srun: Job 12525169 step creation temporarily disabled, retrying
(Requested nodes are busy)
srun: Job 12525169 step creation temporarily disabled, retrying
(Requested nodes are busy)
Are the options not in the correct order?
Thanks,
Rob