Quoting Greg Wilson <[email protected]>:

> Best of resource management developers,
>
> I have a very strange problem (or at least it seems like one to a novice
> like myself). It's nice that sbatch works, but srun is really useful too,
> so if anyone can figure out how to fix this I'd be very grateful.
>
>
> ### Problem
> All commands work fine (sinfo, squeue, sbatch(!), salloc etc) EXCEPT srun.
> srun hangs/blocks UNLESS the job happens to get allocated on the same node
> on which the srun was issued - then it works. Below I have attached log
> level 9 output and config.
>
> ### Suspicion
> I don't know much about the inner workings of Slurm, but I suspect that
> when a job submitted by srun is starting up on a node, it tries to initiate
> a network connection back to the node where the srun is waiting - possibly
> to feed back stdout/stderr data?

That is correct.


> For some reason, that network connection
> won't target 6817 nor 6818, and all other ports are of course firewalled.
> That could explain the block. If this hypothesis is correct, does it mean
> srun can only be used on internal networks with open firewalls on the
> machines?

Srun opens several random port number for communications (the number  
of ports varies by the job size). This has been raised as an issue  
previously and adding a configurable port range for srun to use would  
be relatively simple to add, but that does not exist today. If this  
were done, you would probably need to open up tens to hundreds of  
ports depending upon how many sruns are being executed at the same time.


> This seems unlikely, I must have made a mistake somewhere.
>
> ### Setup
> * machines: I have two Amazon EC2-machines running Ubuntu 12.10.
> * network: Both have ports 22, 6817 and 6818 open for incoming in the
> security group (and Ubuntu's own iptables allowing everything). I have
> verified that all nodes can resolve all other nodes, and all open ports can
> be reached from all nodes. I have entered NodeAddr info, as well as made
> entries in /etc/hosts.
> * slurm: first I tried the version in the standard Ubuntu repos, then I
> compiled the latest from the Slurm download site. The problem remains
> between these versions.
>
> ### slurmd.log output on receiving node when srun is issued on the other
> node:
>    http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/799179263237/
> Highlights:
> [2013-03-07T20:19:33+00:00] [118.0] Error connecting slurm stream socket at
> 54.247.137.41:45562: Connection timed out
> [2013-03-07T20:19:33+00:00] [118.0] connect io: Connection timed out
> [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> [2013-03-07T20:19:33+00:00] [118.0] Leaving  _setup_normal_io
> [2013-03-07T20:19:33+00:00] [118.0] IO setup failed: Connection timed out
>
> ### slurmctld.log
> http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/660589756816/
>
> ### slurm.conf
> ControlMachine=ip-10-39-4-58
> ControlAddr=54.228.34.32
> AuthType=auth/munge
> CacheGroups=0
> CryptoType=crypto/munge
> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> TaskPluginParam=None
> TreeWidth=1024
> UsePAM=0
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> FastSchedule=1
> SchedulerType=sched/builtin
> SelectType=select/cons_res
> ClusterName=cluster
> DebugFlags=Steps,FrontEnd,Gres,Priority,SelectType,Steps
> JobCompLoc=/tmp/jobcomplete.txt
> JobCompType=jobcomp/filetxt
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=9
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=9
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> NodeName=ip-10-39-4-58 NodeAddr=54.228.34.32 Procs=1 State=UNKNOWN
> NodeName=ip-10-39-63-199 NodeAddr=54.247.137.41 Procs=1 State=UNKNOWN
> PartitionName=testing Nodes=ip-10-39-4-58,ip-10-39-63-199 Default=YES
> MaxTime=INFINITE State=UP
>
> Thanks for any help. /Greg
>

Reply via email to