Hi all again, Initial problems are fixed, but there are more - thanks in advance for reading this!
Ubuntu 14.04 with Slurm 2.6.5-1 Master and controller seem to be talking to each other: scontrol ping Slurmctld(primary/backup) at slurm-master/(NULL) are UP/DOWN exact same message from each slave ON one of the slaves, node0, I start the daemon with sudo slurmd -Dvvv which returns slurmd: topology NONE plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: task NONE plugin loaded slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: Warning: Core limit is only 0 KB slurmd: slurmd version 2.6.5 started slurmd: Job accounting gather NOT_INVOKED plugin loaded slurmd: switch NONE plugin loaded slurmd: slurmd started on Wed, 31 Aug 2016 02:48:03 +0000 slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=64431 TmpDisk=10042 Uptime=77963 slurmd: AcctGatherEnergy NONE plugin loaded slurmd: AcctGatherProfile NONE plugin loaded slurmd: AcctGatherInfiniband NONE plugin loaded slurmd: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf) Which seems fine? sinfo on any machine returns PARTITION AVAIL TIMELIMIT NODES STATE NODELIST standard* up infinite 1 down* node2 standard* up infinite 2 idle node[0-1] node2 isn't connected yet, so I'm not concerned about that. When I run a job, srun -N1 /bin/hostname The master debug says: slurmctld: debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=1000 slurmctld: debug: job_submit_plugin_submit: usec=1 slurmctld: debug2: found 3 usable nodes from config containing node[0-2] slurmctld: debug2: sched: JobId=7 allocated resources: NodeList=node0 slurmctld: sched: _slurm_rpc_allocate_resources JobId=7 NodeList=node0 usec=3578 slurmctld: debug2: _slurm_rpc_job_ready(7)=3 usec=14 slurmctld: debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from uid=1000 slurmctld: debug: Configuration for job 7 complete slurmctld: debug: laying out the 1 tasks on 1 hosts node0 dist 1 slurmctld: sched: _slurm_rpc_job_step_create: StepId=7.0 node0 usec=6717 And the node's says: slurmd: debug2: got this type of message 1008 slurmd: debug2: got this type of message 6001 slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS slurmd: debug: task_slurmd_launch_request: 7.0 0 slurmd: launch task 7.0 request from 1000.1000@144.6.230.71 (port 12449) slurmd: debug: Checking credential with 256 bytes of sig data slurmd: debug: Calling /usr/sbin/slurmstepd spank prolog spank-prolog: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf spank-prolog: Running spank/prolog for jobid [7] uid [1000] spank-prolog: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf slurmd: debug: task_slurmd_reserve_resources: 7 0 The sinfo changes to allocated: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST standard* up infinite 1 down* node2 standard* up infinite 1 alloc node0 standard* up infinite 1 idle node1 And scontrol show job 7 returns: JobId=7 Name=hostname UserId=ubuntu(1000) GroupId=ubuntu(1000) Priority=4294901754 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:01:29 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2016-08-31T02:51:06 EligibleTime=2016-08-31T02:51:06 StartTime=2016-08-31T02:51:06 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=standard AllocNode:Sid=slurm-master:3410 ReqNodeList=(null) ExcNodeList=(null) NodeList=node0 BatchHost=node0 NumNodes=1 NumCPUs=16 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/bin/hostname WorkDir=/home/ubuntu But the job doesn't run... Any help is appreciated greatly. James Venning