Hi all,

I have a cluster which is set up as following

hostname      description
cl4intern admin server, running slurmctld and slurmdbd
cl4fr1 frontend, not running any slurm service but has slurm installed
n01 -- n54    compute nodes

slurm.conf is shared on all hosts via NFS. cl4fr1 does not appear in slurm.conf. It should only act as submit node.

Queuing job from cl4fr1 using sbatch is working as expected. But if I run "srun -N1 -n1 -p mypartition hostname" I do not get the output of "hostname" but just a terminal window that is waiting for something to happen.

> srun -N1 -n1 -p mypartition hostname

With "scontrol show jobs" I can see that a node was allocated but this is just sitting around, doing nothing.

JobId=2681 Name=df
   UserId=myuser(15001) GroupId=mygroup(15000)
   Priority=4294901684 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:04 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2014-07-10T16:30:26 EligibleTime=2014-07-10T16:30:26
   StartTime=2014-07-10T16:30:26 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=mypartition AllocNode:Sid=cl4fr1:18795
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n52
   BatchHost=n52
   NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/hostname
   WorkDir=/home/myuser

I can cancel this job with CRTL+C without problem:

^Csrun: interrupt (one more within 1 sec to abort)
srun: task 0: unknown
^Csrun: sending Ctrl-C to job 2681.0
srun: Job step 2681.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

A combination of salloc + env | grep SLURM_JOB_NODELIST + ssh to node + srun -p mypartition -N1 -n1 hostname does execute, but on a different node.

Can someone point me into a direction where to investigate further? Did you already have this issue? Is it even possible to have a submit node that does not apper in the SLURM configuration??


Thanks in advance,

    Uwe Sauter

Reply via email to