timeouts

Greg Wilson Thu, 07 Mar 2013 15:36:08 -0800

Thank you very much for clarifying and the quick response! Best regards Greg



On Thu, Mar 7, 2013 at 10:27 PM, Moe Jette <[email protected]> wrote:

>
> Quoting Greg Wilson <[email protected]>:
>
> > Best of resource management developers,
> >
> > I have a very strange problem (or at least it seems like one to a novice
> > like myself). It's nice that sbatch works, but srun is really useful too,
> > so if anyone can figure out how to fix this I'd be very grateful.
> >
> >
> > ### Problem
> > All commands work fine (sinfo, squeue, sbatch(!), salloc etc) EXCEPT
> srun.
> > srun hangs/blocks UNLESS the job happens to get allocated on the same
> node
> > on which the srun was issued - then it works. Below I have attached log
> > level 9 output and config.
> >
> > ### Suspicion
> > I don't know much about the inner workings of Slurm, but I suspect that
> > when a job submitted by srun is starting up on a node, it tries to
> initiate
> > a network connection back to the node where the srun is waiting -
> possibly
> > to feed back stdout/stderr data?
>
> That is correct.
>
>
> > For some reason, that network connection
> > won't target 6817 nor 6818, and all other ports are of course firewalled.
> > That could explain the block. If this hypothesis is correct, does it mean
> > srun can only be used on internal networks with open firewalls on the
> > machines?
>
> Srun opens several random port number for communications (the number
> of ports varies by the job size). This has been raised as an issue
> previously and adding a configurable port range for srun to use would
> be relatively simple to add, but that does not exist today. If this
> were done, you would probably need to open up tens to hundreds of
> ports depending upon how many sruns are being executed at the same time.
>
>
> > This seems unlikely, I must have made a mistake somewhere.
> >
> > ### Setup
> > * machines: I have two Amazon EC2-machines running Ubuntu 12.10.
> > * network: Both have ports 22, 6817 and 6818 open for incoming in the
> > security group (and Ubuntu's own iptables allowing everything). I have
> > verified that all nodes can resolve all other nodes, and all open ports
> can
> > be reached from all nodes. I have entered NodeAddr info, as well as made
> > entries in /etc/hosts.
> > * slurm: first I tried the version in the standard Ubuntu repos, then I
> > compiled the latest from the Slurm download site. The problem remains
> > between these versions.
> >
> > ### slurmd.log output on receiving node when srun is issued on the other
> > node:
> >
> http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/916234438360/
> > Highlights:
> > [2013-03-07T20:19:33+00:00] [118.0] Error connecting slurm stream socket
> at
> > 54.247.137.41:45562: Connection timed out
> > [2013-03-07T20:19:33+00:00] [118.0] connect io: Connection timed out
> > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects
> > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable
> > [2013-03-07T20:19:33+00:00] [118.0] Leaving  _setup_normal_io
> > [2013-03-07T20:19:33+00:00] [118.0] IO setup failed: Connection timed out
> >
> > ### slurmctld.log
> > http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/182487506931/
> >
> > ### slurm.conf
> > ControlMachine=ip-10-39-4-58
> > ControlAddr=54.228.34.32
> > AuthType=auth/munge
> > CacheGroups=0
> > CryptoType=crypto/munge
> > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> > MpiDefault=none
> > ProctrackType=proctrack/linuxproc
> > ReturnToService=1
> > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> > SlurmctldPort=6817
> > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> > SlurmdPort=6818
> > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> > SlurmUser=slurm
> > StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> > SwitchType=switch/none
> > TaskPlugin=task/none
> > TaskPluginParam=None
> > TreeWidth=1024
> > UsePAM=0
> > InactiveLimit=0
> > KillWait=30
> > MinJobAge=300
> > SlurmctldTimeout=120
> > SlurmdTimeout=300
> > Waittime=0
> > FastSchedule=1
> > SchedulerType=sched/builtin
> > SelectType=select/cons_res
> > ClusterName=cluster
> > DebugFlags=Steps,FrontEnd,Gres,Priority,SelectType,Steps
> > JobCompLoc=/tmp/jobcomplete.txt
> > JobCompType=jobcomp/filetxt
> > JobAcctGatherType=jobacct_gather/none
> > SlurmctldDebug=9
> > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> > SlurmdDebug=9
> > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> > NodeName=ip-10-39-4-58 NodeAddr=54.228.34.32 Procs=1 State=UNKNOWN
> > NodeName=ip-10-39-63-199 NodeAddr=54.247.137.41 Procs=1 State=UNKNOWN
> > PartitionName=testing Nodes=ip-10-39-4-58,ip-10-39-63-199 Default=YES
> > MaxTime=INFINITE State=UP
> >
> > Thanks for any help. /Greg
> >
>
>

[slurm-dev] Re: sbatch works, but srun hangs/fails/timeouts

Reply via email to