Dear sir:

      These days I built a parallel computing server in my lab. I used slurm as 
the resource manager and openmpi run my own paralleling softwares, but problems 
always arised. I can't launch my jobs to the computing nodes. I'm sure of no 
problems wtih my openmpi buliding, because I can run my softwares with slurm 
donw. Here are the problem descriptions the slurm return .Is anyone kind to 
help me?

in the terminal, I typed: salloc -N 1 mpirun -n 16 ./a.out(my software name)

then the terminal returned the erro info:







salloc: Granted job allocation 35
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 35


in the control nodes, the erro info is 




slurmctld: sched: _slurm_rpc_allocate_resources JobId=35 NodeList=node142 
usec=379
slurmctld: job_step_signal step 35.0 not found
slurmctld: job_step_signal step 35.0 not found
slurmctld: job_complete: JobID=35 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=35 State=0x8005 NodeCnt=1 done



--
-----------------------------
黄旸 
中国科学院高能物理研究所东莞分部

散裂中子源中子科学部

Huang Yang

Institue of Hihg Energy Physics, China

18710073837

Reply via email to