Thank you very much for clarifying and the quick response! Best regards Greg
On Thu, Mar 7, 2013 at 10:27 PM, Moe Jette <[email protected]> wrote: > > Quoting Greg Wilson <[email protected]>: > > > Best of resource management developers, > > > > I have a very strange problem (or at least it seems like one to a novice > > like myself). It's nice that sbatch works, but srun is really useful too, > > so if anyone can figure out how to fix this I'd be very grateful. > > > > > > ### Problem > > All commands work fine (sinfo, squeue, sbatch(!), salloc etc) EXCEPT > srun. > > srun hangs/blocks UNLESS the job happens to get allocated on the same > node > > on which the srun was issued - then it works. Below I have attached log > > level 9 output and config. > > > > ### Suspicion > > I don't know much about the inner workings of Slurm, but I suspect that > > when a job submitted by srun is starting up on a node, it tries to > initiate > > a network connection back to the node where the srun is waiting - > possibly > > to feed back stdout/stderr data? > > That is correct. > > > > For some reason, that network connection > > won't target 6817 nor 6818, and all other ports are of course firewalled. > > That could explain the block. If this hypothesis is correct, does it mean > > srun can only be used on internal networks with open firewalls on the > > machines? > > Srun opens several random port number for communications (the number > of ports varies by the job size). This has been raised as an issue > previously and adding a configurable port range for srun to use would > be relatively simple to add, but that does not exist today. If this > were done, you would probably need to open up tens to hundreds of > ports depending upon how many sruns are being executed at the same time. > > > > This seems unlikely, I must have made a mistake somewhere. > > > > ### Setup > > * machines: I have two Amazon EC2-machines running Ubuntu 12.10. > > * network: Both have ports 22, 6817 and 6818 open for incoming in the > > security group (and Ubuntu's own iptables allowing everything). I have > > verified that all nodes can resolve all other nodes, and all open ports > can > > be reached from all nodes. I have entered NodeAddr info, as well as made > > entries in /etc/hosts. > > * slurm: first I tried the version in the standard Ubuntu repos, then I > > compiled the latest from the Slurm download site. The problem remains > > between these versions. > > > > ### slurmd.log output on receiving node when srun is issued on the other > > node: > > > http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/916234438360/ > > Highlights: > > [2013-03-07T20:19:33+00:00] [118.0] Error connecting slurm stream socket > at > > 54.247.137.41:45562: Connection timed out > > [2013-03-07T20:19:33+00:00] [118.0] connect io: Connection timed out > > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects > > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable > > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects > > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable > > [2013-03-07T20:19:33+00:00] [118.0] eio: handling events for 1 objects > > [2013-03-07T20:19:33+00:00] [118.0] Called _msg_socket_readable > > [2013-03-07T20:19:33+00:00] [118.0] Leaving _setup_normal_io > > [2013-03-07T20:19:33+00:00] [118.0] IO setup failed: Connection timed out > > > > ### slurmctld.log > > http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/182487506931/ > > > > ### slurm.conf > > ControlMachine=ip-10-39-4-58 > > ControlAddr=54.228.34.32 > > AuthType=auth/munge > > CacheGroups=0 > > CryptoType=crypto/munge > > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint > > MpiDefault=none > > ProctrackType=proctrack/linuxproc > > ReturnToService=1 > > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > > SlurmctldPort=6817 > > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > > SlurmdPort=6818 > > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd > > SlurmUser=slurm > > StateSaveLocation=/var/lib/slurm-llnl/slurmctld > > SwitchType=switch/none > > TaskPlugin=task/none > > TaskPluginParam=None > > TreeWidth=1024 > > UsePAM=0 > > InactiveLimit=0 > > KillWait=30 > > MinJobAge=300 > > SlurmctldTimeout=120 > > SlurmdTimeout=300 > > Waittime=0 > > FastSchedule=1 > > SchedulerType=sched/builtin > > SelectType=select/cons_res > > ClusterName=cluster > > DebugFlags=Steps,FrontEnd,Gres,Priority,SelectType,Steps > > JobCompLoc=/tmp/jobcomplete.txt > > JobCompType=jobcomp/filetxt > > JobAcctGatherType=jobacct_gather/none > > SlurmctldDebug=9 > > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > > SlurmdDebug=9 > > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > > NodeName=ip-10-39-4-58 NodeAddr=54.228.34.32 Procs=1 State=UNKNOWN > > NodeName=ip-10-39-63-199 NodeAddr=54.247.137.41 Procs=1 State=UNKNOWN > > PartitionName=testing Nodes=ip-10-39-4-58,ip-10-39-63-199 Default=YES > > MaxTime=INFINITE State=UP > > > > Thanks for any help. /Greg > > > >
