I've just done some testing of toy jobs on EC2.

However, I've found that unless I specify a list of nodes (with the -w
option), all the jobs go to the first node.

Can anyone shed some light on this behavior?  Could this be because
the networking on virtual nodes in EC2 is miserable?

-Whit


A couple of examples and the config files:


nodeuser@node0:~$ salloc -w node0,node1 mpirun -n 2 hostname
salloc: Granted job allocation 43
node0
node1
salloc: Relinquishing job allocation 43
nodeuser@node0:~$ salloc mpirun -n 2 hostname
salloc: Granted job allocation 44
node0
node0
salloc: Relinquishing job allocation 44
nodeuser@node0:~$ srun mpirun -n 2 hostname
node0
node0
nodeuser@node0:~$ scontrol show nodes
NodeName=node0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
   OS=Linux RealMemory=1 Sockets=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   Reason=(null)

NodeName=node1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
   OS=Linux RealMemory=1 Sockets=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   Reason=(null)

nodeuser@node0:~$


nodeuser@node0:~$ cat /etc/slurm-llnl/slurm.conf
ControlMachine=node0
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/munge
JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key
JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmUser=slurm
StateSaveLocation=/tmp/slurm.state
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
Include /etc/slurm-llnl/nodes.conf
nodeuser@node0:~$
nodeuser@node0:~$ cat /etc/slurm-llnl/nodes.conf
NodeName=node0
NodeName=node1
PartitionName=prod Nodes=node[0-1] Default=YES MaxTime=INFINITE State=UP
nodeuser@node0:~$

Reply via email to