Aha! I missed it the first time... In your script, replace "mpirun"
with "srun" and the world should be better.
On 04/30/2016 01:35 PM, Mehdi Acheli
wrote:
Re: [slurm-dev] Re: MPI/OpenMPI send receive not working
Euh, I did a "make all install" so I think pmi
support is installed. And the hello world program is working,
would it if it wasn't installed ?
2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
For Slurm, after the
"make install", did you do a "make install-contrib" (which
builds the pmi2 support)? I think you would have seen a
runtime error if you hadn't, but possibly not.
On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
First of all, thank you
for the reaction.
Here are the answers :
[*]I tried multiple commands:
[*]I started with "srun -N2 --mpi=pmi2 ptest"
then I changed the slurm.conf's mpi
parameter to pmi2 so I no longer need the
option.
[*]I also tried a script submitted via
sbatch. It doesn't work either and squeue
shows that it's running. My program is just
passing a number from node 1 to node 2 so it
doesn't normally take that long.
[*]OpenMPI version is 1.10.2 / SLURM's is
15.08.8
[*]I built Slurm myself with no specific options.
For OpenMPI I actually downloaded it from the
CentOS 7 default repo. But I tried building the
same version before with --with-slurm and
--with-pmi options, yet it wasn't working
either.�
I am joining a copy of my slurm.conf file and
the script I used to submit the job.
The script :�
/#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res_mpi.txt
#
#SBATCH -N 2
module load openmpi
mpirun test/
Slurm.conf file :
#
slurm.conf file generated by configurator
easy.html.
#
Put this file on all nodes of your cluster.
#
See the slurm.conf man page for more
information.
#
ControlMachine=m
ControlAddr=m
BackupController=mb
BackupAddr=mb
#
#MailProg=/bin/mail
MpiDefault=pmi2
MpiParams=ports=12000-12999
ProctrackType=proctrack/linuxproc
ReturnToService=2
#SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
#SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
#SlurmdUser=root
#StateSaveLocation=/var/spool/slurm
StateSaveLocation=/mnt/data/spool/slurm
SwitchType=switch/none
TaskPlugin=task/none
#
#
#
TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
#
SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
PreemptType=preempt/partition_prio
PreemptMode=requeue
#
#
#
LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/none
#SlurmctldDebug=3
#SlurmctldLogFile=/var/log/slurmctld.log
SlurmctldLogFile=/mnt/data/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
AccountingStorageBackupHost=mb
#
#
#
COMPUTE NODES
NodeName=n[1-8]
NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
NodeName=logn
NodeAddr=logn CPUs=1 State=UNKNOWN
2016-04-30 16:40 GMT+01:00 Andy
Riebs <[email protected]>:
Hi,
The one problem that I see in your
description is minor, and probably not
significant: the MPI ports parameter was
needed for very old versions of Open MPI,
IIRC.
To help debug your problems, please respond
to this list with
[*]What command did you use to invoke
your program?
[*]What versions of Slurm and OpenMPI are
you using?
[*]Did you build them yourself, or use
prebuilt versions?
[*]If you built them yourself, what
configuration options did you use?
[*]If pre-built versions, where did
you get them?
[*]A copy of your slurm.conf file (you
may want to change node names and other
potentially sensitive information)
Andy
On 04/30/2016 10:02 AM, Mehdi Acheli
wrote:
Hello everyone,
I've set
a basic configuration
using�slurm�with
a master node, backup node, a login node
and eight compute node.
Everything
in�slurm�is
working fine. I can issue jobs and see
the state of the eight nodes as Idle.
The problem is with OpenMPI. The hello
parallel program where each process
prints its rank among the global set is
working but when i try to establish
communications between nodes through
MPI_Send and MPI_Recv, it just hangs
there undefinitely.�
I'm
using CentOS 7, firewalld and
SElinux are disabled. If i launch my
parallel program, ptest, on 2 nodes
: [n1, n2], a little check with lsof
-i shows that ptest is listening on
port 1024 on both nodes, which i
find weird since only one should be
listening. Moreover, i've set slurm
Mpi parameters on pmi2 and ports
allowed on [12000-12999], so why is
it still using port 1024 ?
I hope u
can help me with this problem. I can't
see what's wrong.�
Thank
you in advance.
M. Acheli.
- [slurm-dev] MPI/OpenMPI send receive not working Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI send receive not working Andy Riebs
- [slurm-dev] Re: MPI/OpenMPI send receive not workin... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI send receive not wo... Andy Riebs
- [slurm-dev] Re: MPI/OpenMPI send receive no... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI send recei... Andy Riebs
- [slurm-dev] Re: MPI/OpenMPI send r... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI se... Ralph Castain
- [slurm-dev] Re: MPI/OpenMPI se... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI se... Ralph Castain
- [slurm-dev] Re: MPI/OpenMPI se... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI se... Ralph Castain
- [slurm-dev] Re: MPI/OpenMPI se... Mehdi Acheli
- [slurm-dev] Re: MPI/OpenMPI se... Ralph Castain
- [slurm-dev] Re: MPI/OpenMPI se... Mehdi Acheli
