Dear sir:
These days I built a parallel computing server in my lab. I used slurm as
the resource manager and openmpi run my own paralleling softwares, but problems
always arised. I can't launch my jobs to the computing nodes. I'm sure of no
problems wtih my openmpi buliding, because I can run my softwares with slurm
donw. Here are the problem descriptions the slurm return .Is anyone kind to
help me?
in the terminal, I typed: salloc -N 1 mpirun -n 16 ./a.out(my software name)
then the terminal returned the erro info:
salloc: Granted job allocation 35
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 35
in the control nodes, the erro info is
slurmctld: sched: _slurm_rpc_allocate_resources JobId=35 NodeList=node142
usec=379
slurmctld: job_step_signal step 35.0 not found
slurmctld: job_step_signal step 35.0 not found
slurmctld: job_complete: JobID=35 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=35 State=0x8005 NodeCnt=1 done
--
-----------------------------
黄旸
中国科学院高能物理研究所东莞分部
散裂中子源中子科学部
Huang Yang
Institue of Hihg Energy Physics, China
18710073837