I've just done some testing of toy jobs on EC2. However, I've found that unless I specify a list of nodes (with the -w option), all the jobs go to the first node.
Can anyone shed some light on this behavior? Could this be because the networking on virtual nodes in EC2 is miserable? -Whit A couple of examples and the config files: nodeuser@node0:~$ salloc -w node0,node1 mpirun -n 2 hostname salloc: Granted job allocation 43 node0 node1 salloc: Relinquishing job allocation 43 nodeuser@node0:~$ salloc mpirun -n 2 hostname salloc: Granted job allocation 44 node0 node0 salloc: Relinquishing job allocation 44 nodeuser@node0:~$ srun mpirun -n 2 hostname node0 node0 nodeuser@node0:~$ scontrol show nodes NodeName=node0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null) OS=Linux RealMemory=1 Sockets=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null) NodeName=node1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null) OS=Linux RealMemory=1 Sockets=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null) nodeuser@node0:~$ nodeuser@node0:~$ cat /etc/slurm-llnl/slurm.conf ControlMachine=node0 AuthType=auth/none CacheGroups=0 CryptoType=crypto/munge JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmUser=slurm StateSaveLocation=/tmp/slurm.state SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageType=accounting_storage/none ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=6 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=6 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log Include /etc/slurm-llnl/nodes.conf nodeuser@node0:~$ nodeuser@node0:~$ cat /etc/slurm-llnl/nodes.conf NodeName=node0 NodeName=node1 PartitionName=prod Nodes=node[0-1] Default=YES MaxTime=INFINITE State=UP nodeuser@node0:~$
