Looks like slurmd daemons not running on your compute nodes. Run "sinfo".
David Race <[email protected]> wrote: >Hello - > >I am sure this is simple, but I cannot get an srun job to start: > >Here is my config file: > ># ># Example slurm.conf file. Please run configurator.html ># (in doc/html) to build a configuration file customized ># for your environment. ># ># ># slurm.conf file generated by configurator.html. ># ># See the slurm.conf man page for more information. ># >ClusterName=Prod >ControlMachine=tx321 >#ControlAddr= >#BackupController= >#BackupAddr= ># >SlurmUser=slurm >#SlurmdUser=root >SlurmctldPort=6817 >SlurmdPort=6818 >AuthType=auth/munge >#JobCredentialPrivateKey= >#JobCredentialPublicCertificate= >StateSaveLocation=/tmp >SlurmdSpoolDir=/tmp/slurmd >SwitchType=switch/none >MpiDefault=none >SlurmctldPidFile=/var/run/slurmctld.pid >SlurmdPidFile=/var/run/slurmd.pid >ProctrackType=proctrack/pgid >#PluginDir= >CacheGroups=0 >#FirstJobId= >ReturnToService=2 >#MaxJobCount= >#PlugStackConfig= >#PropagatePrioProcess= >#PropagateResourceLimits= >#PropagateResourceLimitsExcept= >#Prolog= >#Epilog= >#SrunProlog= >#SrunEpilog= >#TaskProlog= >#TaskEpilog= >#TaskPlugin= >#TrackWCKey=no >#TreeWidth=50 >#TmpFs= >#UsePAM= ># ># TIMERS >SlurmctldTimeout=300 >SlurmdTimeout=300 >InactiveLimit=0 >MinJobAge=300 >KillWait=30 >Waittime=0 ># ># SCHEDULING >SchedulerType=sched/builtin >#SchedulerAuth= >#SchedulerPort= >#SchedulerRootFilter= >SelectType=select/linear >FastSchedule=1 >#PriorityType=priority/multifactor >#PriorityDecayHalfLife=14-0 >#PriorityUsageResetPeriod=14-0 >#PriorityWeightFairshare=100000 >#PriorityWeightAge=1000 >#PriorityWeightPartition=10000 >#PriorityWeightJobSize=1000 >#PriorityMaxAge=1-0 ># ># LOGGING >SlurmctldDebug=3 >#SlurmctldLogFile= >SlurmdDebug=3 >#SlurmdLogFile= >JobCompType=jobcomp/none >#JobCompLoc= ># ># ACCOUNTING >#JobAcctGatherType=jobacct_gather/linux >#JobAcctGatherFrequency=30 ># >#AccountingStorageType=accounting_storage/slurmdbd >#AccountingStorageHost= >#AccountingStorageLoc= >#AccountingStoragePass= >#AccountingStorageUser= ># ># COMPUTE NODES >NodeName=prod-000[1-4] Procs=8 State=UNKNOWN >PartitionName=all.q Nodes=prod-000[1-4] Default=YES MaxTime=INFINITE >State=UP > >=============END CONFIG=================================== > >I attempt to run srun with > >srun -n 8 -vvvvvv date > >and I get: >srun: defined options for program `srun' >srun: --------------- --------------------- >srun: user : `root' >srun: uid : 0 >srun: gid : 0 >srun: cwd : /home/drace/QueueOptions/ACE/ace/daemons >srun: ntasks : 8 (set) >srun: nodes : 1 (default) >srun: jobid : 4294967294 (default) >srun: partition : default >srun: job name : `(null)' >srun: reservation : `(null)' >srun: wckey : `(null)' >srun: switches : -1 >srun: wait-for-switches : -1 >srun: distribution : unknown >srun: cpu_bind : default >srun: mem_bind : default >srun: cpu_freq : 4294967294 >srun: verbose : 5 >srun: slurmd_debug : 0 >srun: immediate : false >srun: label output : false >srun: unbuffered IO : false >srun: overcommit : false >srun: threads : 60 >srun: checkpoint_dir : /home/drace/QueueOptions/ACE/ace/daemons >srun: wait : 0 >srun: account : (null) >srun: comment : (null) >srun: dependency : (null) >srun: exclusive : false >srun: qos : (null) >srun: constraints : tmp-per-node=4294967294 >srun: geometry : (null) >srun: reboot : yes >srun: rotate : no >srun: preserve_env : false >srun: network : (null) >srun: propagate : NONE >srun: prolog : (null) >srun: epilog : (null) >srun: mail_type : NONE >srun: mail_user : (null) >srun: task_prolog : (null) >srun: task_epilog : (null) >srun: multi_prog : no >srun: sockets-per-node : -2 >srun: cores-per-socket : -2 >srun: threads-per-core : -2 >srun: ntasks-per-node : -2 >srun: ntasks-per-socket : -2 >srun: ntasks-per-core : -2 >srun: plane_size : 4294967294 >srun: remote command : `/bin/date' >srun: debug: propagating RLIMIT_CPU=18446744073709551615 >srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 >srun: debug: propagating RLIMIT_DATA=18446744073709551615 >srun: debug: propagating RLIMIT_STACK=10485760 >srun: debug: propagating RLIMIT_CORE=0 >srun: debug: propagating RLIMIT_RSS=18446744073709551615 >srun: debug: propagating RLIMIT_NPROC=131072 >srun: debug: propagating RLIMIT_NOFILE=327680 >srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 >srun: debug: propagating RLIMIT_AS=18446744073709551615 >srun: debug: propagating SLURM_PRIO_PROCESS=0 >srun: debug: propagating UMASK=0022 >srun: debug2: srun PMI messages to port=43211 >srun: debug: Entering slurm_allocation_msg_thr_create() >srun: debug: port from net_stream_listen is 34042 >srun: debug: Entering _msg_thr_internal >srun: debug4: eio: handling events for 1 objects >srun: debug3: Called eio_message_socket_readable 0 4 >srun: debug3: Trying to load plugin /slurm/lib/slurm/auth_munge.so >srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded >srun: debug3: Success. >srun: Required node not available (down or drained) >srun: debug2: Pending job allocation 33 >srun: job 33 queued and waiting for resources > > >Any ideas? > >Thanks > >David Race -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
