Looks like slurmd daemons not running on your compute nodes. Run "sinfo".

David Race <[email protected]> wrote:

>Hello -
>
>I am sure this is simple, but I cannot get an srun job to start:
>
>Here is my config file:
>
>#
># Example slurm.conf file. Please run configurator.html
># (in doc/html) to build a configuration file customized
># for your environment.
>#
>#
># slurm.conf file generated by configurator.html.
>#
># See the slurm.conf man page for more information.
>#
>ClusterName=Prod
>ControlMachine=tx321
>#ControlAddr=
>#BackupController=
>#BackupAddr=
>#
>SlurmUser=slurm
>#SlurmdUser=root
>SlurmctldPort=6817
>SlurmdPort=6818
>AuthType=auth/munge
>#JobCredentialPrivateKey=
>#JobCredentialPublicCertificate=
>StateSaveLocation=/tmp
>SlurmdSpoolDir=/tmp/slurmd
>SwitchType=switch/none
>MpiDefault=none
>SlurmctldPidFile=/var/run/slurmctld.pid
>SlurmdPidFile=/var/run/slurmd.pid
>ProctrackType=proctrack/pgid
>#PluginDir=
>CacheGroups=0
>#FirstJobId=
>ReturnToService=2
>#MaxJobCount=
>#PlugStackConfig=
>#PropagatePrioProcess=
>#PropagateResourceLimits=
>#PropagateResourceLimitsExcept=
>#Prolog=
>#Epilog=
>#SrunProlog=
>#SrunEpilog=
>#TaskProlog=
>#TaskEpilog=
>#TaskPlugin=
>#TrackWCKey=no
>#TreeWidth=50
>#TmpFs=
>#UsePAM=
>#
># TIMERS
>SlurmctldTimeout=300
>SlurmdTimeout=300
>InactiveLimit=0
>MinJobAge=300
>KillWait=30
>Waittime=0
>#
># SCHEDULING
>SchedulerType=sched/builtin
>#SchedulerAuth=
>#SchedulerPort=
>#SchedulerRootFilter=
>SelectType=select/linear
>FastSchedule=1
>#PriorityType=priority/multifactor
>#PriorityDecayHalfLife=14-0
>#PriorityUsageResetPeriod=14-0
>#PriorityWeightFairshare=100000
>#PriorityWeightAge=1000
>#PriorityWeightPartition=10000
>#PriorityWeightJobSize=1000
>#PriorityMaxAge=1-0
>#
># LOGGING
>SlurmctldDebug=3
>#SlurmctldLogFile=
>SlurmdDebug=3
>#SlurmdLogFile=
>JobCompType=jobcomp/none
>#JobCompLoc=
>#
># ACCOUNTING
>#JobAcctGatherType=jobacct_gather/linux
>#JobAcctGatherFrequency=30
>#
>#AccountingStorageType=accounting_storage/slurmdbd
>#AccountingStorageHost=
>#AccountingStorageLoc=
>#AccountingStoragePass=
>#AccountingStorageUser=
>#
># COMPUTE NODES
>NodeName=prod-000[1-4] Procs=8 State=UNKNOWN
>PartitionName=all.q Nodes=prod-000[1-4] Default=YES MaxTime=INFINITE 
>State=UP
>
>=============END CONFIG===================================
>
>I attempt to run srun with
>
>srun -n 8 -vvvvvv date
>
>and I get:
>srun: defined options for program `srun'
>srun: --------------- ---------------------
>srun: user           : `root'
>srun: uid            : 0
>srun: gid            : 0
>srun: cwd            : /home/drace/QueueOptions/ACE/ace/daemons
>srun: ntasks         : 8 (set)
>srun: nodes          : 1 (default)
>srun: jobid          : 4294967294 (default)
>srun: partition      : default
>srun: job name       : `(null)'
>srun: reservation    : `(null)'
>srun: wckey          : `(null)'
>srun: switches       : -1
>srun: wait-for-switches : -1
>srun: distribution   : unknown
>srun: cpu_bind       : default
>srun: mem_bind       : default
>srun: cpu_freq       : 4294967294
>srun: verbose        : 5
>srun: slurmd_debug   : 0
>srun: immediate      : false
>srun: label output   : false
>srun: unbuffered IO  : false
>srun: overcommit     : false
>srun: threads        : 60
>srun: checkpoint_dir : /home/drace/QueueOptions/ACE/ace/daemons
>srun: wait           : 0
>srun: account        : (null)
>srun: comment        : (null)
>srun: dependency     : (null)
>srun: exclusive      : false
>srun: qos            : (null)
>srun: constraints    : tmp-per-node=4294967294
>srun: geometry       : (null)
>srun: reboot         : yes
>srun: rotate         : no
>srun: preserve_env   : false
>srun: network        : (null)
>srun: propagate      : NONE
>srun: prolog         : (null)
>srun: epilog         : (null)
>srun: mail_type      : NONE
>srun: mail_user      : (null)
>srun: task_prolog    : (null)
>srun: task_epilog    : (null)
>srun: multi_prog     : no
>srun: sockets-per-node  : -2
>srun: cores-per-socket  : -2
>srun: threads-per-core  : -2
>srun: ntasks-per-node   : -2
>srun: ntasks-per-socket : -2
>srun: ntasks-per-core   : -2
>srun: plane_size        : 4294967294
>srun: remote command    : `/bin/date'
>srun: debug:  propagating RLIMIT_CPU=18446744073709551615
>srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
>srun: debug:  propagating RLIMIT_DATA=18446744073709551615
>srun: debug:  propagating RLIMIT_STACK=10485760
>srun: debug:  propagating RLIMIT_CORE=0
>srun: debug:  propagating RLIMIT_RSS=18446744073709551615
>srun: debug:  propagating RLIMIT_NPROC=131072
>srun: debug:  propagating RLIMIT_NOFILE=327680
>srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
>srun: debug:  propagating RLIMIT_AS=18446744073709551615
>srun: debug:  propagating SLURM_PRIO_PROCESS=0
>srun: debug:  propagating UMASK=0022
>srun: debug2: srun PMI messages to port=43211
>srun: debug:  Entering slurm_allocation_msg_thr_create()
>srun: debug:  port from net_stream_listen is 34042
>srun: debug:  Entering _msg_thr_internal
>srun: debug4: eio: handling events for 1 objects
>srun: debug3: Called eio_message_socket_readable 0 4
>srun: debug3: Trying to load plugin /slurm/lib/slurm/auth_munge.so
>srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded
>srun: debug3: Success.
>srun: Required node not available (down or drained)
>srun: debug2: Pending job allocation 33
>srun: job 33 queued and waiting for resources
>
>
>Any ideas?
>
>Thanks
>
>David Race

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to