Mr. Jette -
Thanks. This was pilot error. I had some changes to my /home, so my
slurm user disappeared.
DMR
On 5/24/2013 6:32 PM, Morris Jette wrote:
Looks like slurmd daemons not running on your compute nodes. Run "sinfo".
David Race <[email protected]> wrote:
Hello -
I am sure this is simple, but I cannot get an srun job to start:
Here is my config file:
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=Prod
ControlMachine=tx321
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/builtin
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=prod-000[1-4] Procs=8 State=UNKNOWN
PartitionName=all.q Nodes=prod-000[1-4] Default=YES
MaxTime=INFINITE State=UP
=============END CONFIG===================================
I attempt to run srun with
srun -n 8 -vvvvvv date
and I get:
srun: defined options for program `srun'
srun: --------------- ---------------------
srun: user : `root'
srun: uid : 0
srun: gid : 0
srun: cwd : /home/drace/QueueOptions/ACE/ace/daemons
srun: ntasks : 8 (set)
srun: nodes : 1 (default)
srun: jobid : 4294967294 (default)
srun: partition : default
srun: job name : `(null)'
srun: reservation : `(null)'
srun: wckey : `(null)'
srun: switches : -1
srun: wait-for-switches : -1
srun: distribution : unknown
srun: cpu_bind : default
srun: mem_bind : default
srun: cpu_freq : 4294967294
srun: verbose : 5
srun: slurmd_debug : 0
srun: immediate : false
srun: label output : false
srun: unbuffered IO : false
srun: overcommit : false
srun: threads : 60
srun: checkpoint_dir : /home/drace/QueueOptions/ACE/ace/daemons
srun: wait : 0
srun: account : (null)
srun: comment : (null)
srun: dependency : (null)
srun: exclusive : false
srun: qos : (null)
srun: constraints : tmp-per-node=4294967294
srun: geometry : (null)
srun: reboot : yes
srun: rotate : no
srun: preserve_env : false
srun: network : (null)
srun: propagate : NONE
srun: prolog : (null)
srun: epilog : (null)
srun: mail_type : NONE
srun: mail_user : (null)
srun: task_prolog : (null)
srun: task_epilog : (null)
srun: multi_prog : no
srun: sockets-per-node : -2
srun: cores-per-socket : -2
srun: threads-per-core : -2
srun: ntasks-per-node : -2
srun: ntasks-per-socket : -2
srun: ntasks-per-core : -2
srun: plane_size : 4294967294
srun: remote command : `/bin/date'
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=10485760
srun: debug: propagating RLIMIT_CORE=0
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=131072
srun: debug: propagating RLIMIT_NOFILE=327680
srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug2: srun PMI messages to port=43211
srun: debug: Entering slurm_allocation_msg_thr_create()
srun: debug: port from net_stream_listen is 34042
srun: debug: Entering _msg_thr_internal
srun: debug4: eio: handling events for 1 objects
srun: debug3: Called eio_message_socket_readable 0 4
srun: debug3: Trying to load plugin /slurm/lib/slurm/auth_munge.so
srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded
srun: debug3: Success.
srun: Required node not available (down or drained)
srun: debug2: Pending job allocation 33
srun: job 33 queued and waiting for resources
Any ideas?
Thanks
David Race
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.