Please disregard the earlier post. This issue was due to the argument order to salloc and mpirun.
When I call the job this way it runs fine: salloc -N 4 mpirun /home/nodeuser/yield.calcs.r -Whit On Fri, Feb 25, 2011 at 3:41 PM, Whit Armstrong <[email protected]> wrote: > I've just done some testing of toy jobs on EC2. > > However, I've found that unless I specify a list of nodes (with the -w > option), all the jobs go to the first node. > > Can anyone shed some light on this behavior? Could this be because > the networking on virtual nodes in EC2 is miserable? > > -Whit > > > A couple of examples and the config files: > > > nodeuser@node0:~$ salloc -w node0,node1 mpirun -n 2 hostname > salloc: Granted job allocation 43 > node0 > node1 > salloc: Relinquishing job allocation 43 > nodeuser@node0:~$ salloc mpirun -n 2 hostname > salloc: Granted job allocation 44 > node0 > node0 > salloc: Relinquishing job allocation 44 > nodeuser@node0:~$ srun mpirun -n 2 hostname > node0 > node0 > nodeuser@node0:~$ scontrol show nodes > NodeName=node0 Arch=x86_64 CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null) > OS=Linux RealMemory=1 Sockets=1 > State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 > Reason=(null) > > NodeName=node1 Arch=x86_64 CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null) > OS=Linux RealMemory=1 Sockets=1 > State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 > Reason=(null) > > nodeuser@node0:~$ > > > nodeuser@node0:~$ cat /etc/slurm-llnl/slurm.conf > ControlMachine=node0 > AuthType=auth/none > CacheGroups=0 > CryptoType=crypto/munge > JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key > JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert > MpiDefault=none > ProctrackType=proctrack/pgid > ReturnToService=1 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/tmp/slurmd.spool > SlurmUser=slurm > StateSaveLocation=/tmp/slurm.state > SwitchType=switch/none > TaskPlugin=task/none > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > FastSchedule=1 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/linear > AccountingStorageType=accounting_storage/none > ClusterName=cluster > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=6 > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > SlurmdDebug=6 > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > Include /etc/slurm-llnl/nodes.conf > nodeuser@node0:~$ > nodeuser@node0:~$ cat /etc/slurm-llnl/nodes.conf > NodeName=node0 > NodeName=node1 > PartitionName=prod Nodes=node[0-1] Default=YES MaxTime=INFINITE State=UP > nodeuser@node0:~$ >
