Please disregard the earlier post.

This issue was due to the argument order to salloc and mpirun.

When I call the job this way it runs fine:

salloc -N 4 mpirun /home/nodeuser/yield.calcs.r

-Whit


On Fri, Feb 25, 2011 at 3:41 PM, Whit Armstrong
<[email protected]> wrote:
> I've just done some testing of toy jobs on EC2.
>
> However, I've found that unless I specify a list of nodes (with the -w
> option), all the jobs go to the first node.
>
> Can anyone shed some light on this behavior?  Could this be because
> the networking on virtual nodes in EC2 is miserable?
>
> -Whit
>
>
> A couple of examples and the config files:
>
>
> nodeuser@node0:~$ salloc -w node0,node1 mpirun -n 2 hostname
> salloc: Granted job allocation 43
> node0
> node1
> salloc: Relinquishing job allocation 43
> nodeuser@node0:~$ salloc mpirun -n 2 hostname
> salloc: Granted job allocation 44
> node0
> node0
> salloc: Relinquishing job allocation 44
> nodeuser@node0:~$ srun mpirun -n 2 hostname
> node0
> node0
> nodeuser@node0:~$ scontrol show nodes
> NodeName=node0 Arch=x86_64 CoresPerSocket=1
>   CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
>   OS=Linux RealMemory=1 Sockets=1
>   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>   Reason=(null)
>
> NodeName=node1 Arch=x86_64 CoresPerSocket=1
>   CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
>   OS=Linux RealMemory=1 Sockets=1
>   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
>   Reason=(null)
>
> nodeuser@node0:~$
>
>
> nodeuser@node0:~$ cat /etc/slurm-llnl/slurm.conf
> ControlMachine=node0
> AuthType=auth/none
> CacheGroups=0
> CryptoType=crypto/munge
> JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key
> JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/tmp/slurmd.spool
> SlurmUser=slurm
> StateSaveLocation=/tmp/slurm.state
> SwitchType=switch/none
> TaskPlugin=task/none
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> FastSchedule=1
> SchedulerType=sched/backfill
> SchedulerPort=7321
> SelectType=select/linear
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=6
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=6
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> Include /etc/slurm-llnl/nodes.conf
> nodeuser@node0:~$
> nodeuser@node0:~$ cat /etc/slurm-llnl/nodes.conf
> NodeName=node0
> NodeName=node1
> PartitionName=prod Nodes=node[0-1] Default=YES MaxTime=INFINITE State=UP
> nodeuser@node0:~$
>

Reply via email to