[slurm-dev] Re: srun not executing command on node

Trey Dockendorf Thu, 10 Jul 2014 10:55:24 -0700

Uwe,

I ran into something similar.  Our submit host (login node) could not execute 
srun commands, they would simply hang.  I found that it was due to my iptables 
rules being "too strict".  I had to allow all incoming traffic from our private 
network to the login node.  Once that was done all srun commands began working. 
 My controller host is also on this private network, so I'm not sure if it was 
the compute node or the slurm controller that need the access through the 
firewall.  If your attempting to setup iptables as restrictive as possible, 
start by allowing incoming from the controller and see if that fixes the issue.


You can use something like the following to log all iptables DROPs which can be 
viewed in dmesg:

<ACCEPT rules>
-A INPUT -m limit --limit 2/min -j LOG --log-prefix "IPTABLES: INPUT DROPPED " 
--log-level 7 
-A INPUT -j DROP

- Trey

=============================

Trey Dockendorf 
Systems Analyst I 
Texas A&M University 
Academy for Advanced Telecommunications and Learning Technologies 
Phone: (979)458-2396 
Email: [email protected] 
Jabber: [email protected]

----- Original Message -----
> From: "Uwe Sauter" <[email protected]>
> To: "slurm-dev" <[email protected]>
> Sent: Thursday, July 10, 2014 9:41:15 AM
> Subject: [slurm-dev] srun not executing command on node
> 
> 
> Hi all,
> 
> I have a cluster which is set up as following
> 
> hostname      description
> cl4intern admin server, running slurmctld and slurmdbd
> cl4fr1        frontend, not running any slurm service but has slurm
> installed
> n01 -- n54    compute nodes
> 
> slurm.conf is shared on all hosts via NFS. cl4fr1 does not appear in
> slurm.conf. It should only act as submit node.
> 
> Queuing job from cl4fr1 using sbatch is working as expected. But if I
> run "srun -N1 -n1 -p mypartition hostname" I do not get the output of
> "hostname" but just a terminal window that is waiting for something
> to
> happen.
> 
>  > srun -N1 -n1 -p mypartition hostname
> 
>   With "scontrol show jobs" I can see that a node was allocated but
>   this
> is just sitting around, doing nothing.
> 
> JobId=2681 Name=df
>     UserId=myuser(15001) GroupId=mygroup(15000)
>     Priority=4294901684 Nice=0 Account=(null) QOS=normal
>     JobState=RUNNING Reason=None Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
>     RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
>     SubmitTime=2014-07-10T16:30:26 EligibleTime=2014-07-10T16:30:26
>     StartTime=2014-07-10T16:30:26 EndTime=Unknown
>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
>     Partition=mypartition AllocNode:Sid=cl4fr1:18795
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=n52
>     BatchHost=n52
>     NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
>     MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>     Features=(null) Gres=(null) Reservation=(null)
>     Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>     Command=/bin/hostname
>     WorkDir=/home/myuser
> 
> I can cancel this job with CRTL+C without problem:
> 
> ^Csrun: interrupt (one more within 1 sec to abort)
> srun: task 0: unknown
> ^Csrun: sending Ctrl-C to job 2681.0
> srun: Job step 2681.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 2 seconds for job step to
> finish.
> srun: error: Timed out waiting for job step to complete
> 
> A combination of salloc + env | grep SLURM_JOB_NODELIST + ssh to node
> +
> srun -p mypartition -N1 -n1 hostname does execute, but on a different
> node.
> 
> Can someone point me into a direction where to investigate further?
> Did
> you already have this issue? Is it even possible to have a submit
> node
> that does not apper in the SLURM configuration??
> 
> 
> Thanks in advance,
> 
>      Uwe Sauter
>

[slurm-dev] Re: srun not executing command on node

Reply via email to