[slurm-dev] Configuration Issues
Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):2 On-line CPU(s) list: 0,1 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 23 Stepping: 6 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0,1 I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below: JOBID PARTITION NAME USERSTATE TIME TIME_LIMI NODES NODELIST(REASON) 20 compute calculat SlurmUse PENDING 0:00 10:00 1 (Resources) I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance My current slurm.conf file: [SlurmUser@sod264 etc]$ cat slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=sod264 ControlAddr=129.XXX # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=SlurmUser SlurmdUser=SlurmUser StateSaveLocation=/var/spool/statesave SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 #SelectType=select/serial SelectType=select/cons_res SelectTypeParameters=CR_CORE # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=MESA-Web #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895 PartitionName=compute Nodes=sod264 Default=YES STATE=UP Kind Regards, Carl
[slurm-dev] Re: Configuration Issues
It would be helpful to see how you submitted the job. And the output from scontrol show job 20. Regards, Uwe Am 30.03.2015 um 19:49 schrieb Carl E. Fields: Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):2 On-line CPU(s) list: 0,1 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 23 Stepping: 6 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0,1 I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below: JOBID PARTITION NAME USERSTATE TIME TIME_LIMI NODES NODELIST(REASON) 20 compute calculat SlurmUse PENDING 0:00 10:00 1 (Resources) I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance My current slurm.conf file: [SlurmUser@sod264 etc]$ cat slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=sod264 ControlAddr=129.XXX # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=SlurmUser SlurmdUser=SlurmUser StateSaveLocation=/var/spool/statesave SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 #SelectType=select/serial SelectType=select/cons_res SelectTypeParameters=CR_CORE # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=MESA-Web #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895 PartitionName=compute Nodes=sod264 Default=YES STATE=UP Kind Regards, Carl
[slurm-dev] Re: Configuration Issues
Hi all, Carl, could you give the output of sinfo ? Le 30/03/2015 20:00, Carl E. Fields a écrit : Re: [slurm-dev] Re: Configuration Issues Dear Uwe, Thank you for your response, I have submitted the job by: $ batch calculate.sh and the output you requested: [SlurmUser@sod264 services]$ scontrol show job 20 JobId=20 JobName=calculate.sh UserId=SlurmUser(3099) GroupId=SlurmUser(3099) Priority=4294901742 Nice=0 Account=slurmuser QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21 StartTime=2015-03-31T10:56:06 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=compute AllocNode:Sid=sod264:792 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh http://mesa-web.asu.edu/html/services/calculate.sh WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ http://mesa-web.asu.edu/html/services/Work/ StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err http://mesa-web.asu.edu/html/services/Work//job.%J.err StdIn=/dev/null StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out http://mesa-web.asu.edu/html/services/Work//job.%J.out Thanks, Carl On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote: It would be helpful to see how you submitted the job. And the output from scontrol show job 20. Regards, Uwe Am 30.03.2015 um 19:49 schrieb Carl E. Fields: Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):2 On-line CPU(s) list: 0,1 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 23 Stepping: 6 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0,1 I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below: JOBID PARTITION NAME USERSTATE TIME TIME_LIMI NODES NODELIST(REASON) 20 compute calculat SlurmUse PENDING 0:00 10:00 1 (Resources) I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance My current slurm.conf file: [SlurmUser@sod264 etc]$ cat slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=sod264 ControlAddr=129.XXX # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=SlurmUser SlurmdUser=SlurmUser StateSaveLocation=/var/spool/statesave SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # #
[slurm-dev] Re: Configuration Issues
Dear Uwe, Thank you for your response, I have submitted the job by: $ batch calculate.sh and the output you requested: [SlurmUser@sod264 services]$ scontrol show job 20 JobId=20 JobName=calculate.sh UserId=SlurmUser(3099) GroupId=SlurmUser(3099) Priority=4294901742 Nice=0 Account=slurmuser QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21 StartTime=2015-03-31T10:56:06 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=compute AllocNode:Sid=sod264:792 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err StdIn=/dev/null StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out Thanks, Carl On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com wrote: It would be helpful to see how you submitted the job. And the output from scontrol show job 20. Regards, Uwe Am 30.03.2015 um 19:49 schrieb Carl E. Fields: Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):2 On-line CPU(s) list: 0,1 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 23 Stepping: 6 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0,1 I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below: JOBID PARTITION NAME USERSTATE TIME TIME_LIMI NODES NODELIST(REASON) 20 compute calculat SlurmUse PENDING 0:00 10:00 1 (Resources) I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance My current slurm.conf file: [SlurmUser@sod264 etc]$ cat slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=sod264 ControlAddr=129.XXX # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=SlurmUser SlurmdUser=SlurmUser StateSaveLocation=/var/spool/statesave SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 #SelectType=select/serial SelectType=select/cons_res SelectTypeParameters=CR_CORE # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=MESA-Web #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895 PartitionName=compute Nodes=sod264 Default=YES STATE=UP Kind Regards, Carl
[slurm-dev] Re: --reboot
Just adding the /var/log/message to this: slurmd[4526]: error: _step_connect: connect() failed dir /var/slurmd/spool/slurmd node holy2a14107 job 36557200 step -2 Connection refused On Mar 30, 2015, at 1:02 PM, Paul Edmon ped...@cfa.harvard.edu wrote: So point of information. My impression is that when you use this flag, the node/s are rebooted (or the script defined is run) and then the job runs as per normal. Is this correct? Because I think in our case it is rebooting the nodes but then the node is defined as closed due to unexpected reboot. Is there a way to suppress this when the node is rebooted by this flag? Obviously the reboot wasn't unexpected as slurm was aware of it due to the flag. -Paul Edmon-
[slurm-dev] --reboot
So point of information. My impression is that when you use this flag, the node/s are rebooted (or the script defined is run) and then the job runs as per normal. Is this correct? Because I think in our case it is rebooting the nodes but then the node is defined as closed due to unexpected reboot. Is there a way to suppress this when the node is rebooted by this flag? Obviously the reboot wasn't unexpected as slurm was aware of it due to the flag. -Paul Edmon-
[slurm-dev] Re: Configuration Issues
Hi, please post your answers to the list so others can help, too. The reason for the node's IDLE+DRAIN state is given by the output of scontrol show node: Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12] Why slurmctld decided to put the node into this state is a bit unclear to me as in slurm.conf there is: NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895 And it correctly detects 2 CPUs (again scontrol show node): CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00 You might want to give it a try in *not* specifying the sockets/cores/threads in slurm.conf and let slurmd detect the values on its own. If someone else has a better idea, its welcome. Regards, Uwe Am 30.03.2015 um 20:03 schrieb Carl E. Fields: Output of calculate.sh [SlurmUser@sod264 services]$ cat calculate.sh #!/bin/bash #SBATCH -A SlurmUser #SBATCH --workdir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ http://mesa-web.asu.edu/html/services/Work/ #SBATCH -n 1 #SBATCH -c 1 #SBATCH --time=00:10:00 #SBATCH --mail-type=ALL #SBATCH --mail-user=c...@asu.edu mailto:c...@asu.edu #SBATCH --error=job.%J.err #SBATCH --output=job.%J.out #SBATCH --export=MESA_DIR=/home/cefields/mesa #SBATCH --export=OMP_NUM_THREADS=1 #SBATCH --export=MESASDK_ROOT=/home/cefields/mesasdk source $MESASDK_ROOT/bin/mesasdk_init.sh srun ./mv.sh echo 'WAITING 1 MINUTE FOR FILES TO BE MOVED...' srun ./mk echo'make...' srun ./rn echo 'run...' State of node: [SlurmUser@sod264 services]$ scontrol show node sod264 NodeName=sod264 Arch=x86_64 CoresPerSocket=2 CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00 Features=(null) Gres=(null) NodeAddr=sod264 NodeHostName=sod264 Version=14.11 OS=Linux RealMemory=128940 AllocMem=0 Sockets=1 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=19895 Weight=1 BootTime=2015-03-10T12:12:21 SlurmdStartTime=2015-03-29T13:08:30 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12] [SlurmUser@sod264 services]$ Thanks, Carl On Mon, Mar 30, 2015 at 11:01 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote: And what is the state of your node (sinfo -l output)? Or scontrol show node sod264? Am 30.03.2015 um 19:57 schrieb Carl E. Fields: Dear Uwe, Thank you for your response, I have submitted the job by: $ batch calculate.sh and the output you requested: [SlurmUser@sod264 services]$ scontrol show job 20 JobId=20 JobName=calculate.sh UserId=SlurmUser(3099) GroupId=SlurmUser(3099) Priority=4294901742 Nice=0 Account=slurmuser QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21 StartTime=2015-03-31T10:56:06 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=compute AllocNode:Sid=sod264:792 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh http://mesa-web.asu.edu/html/services/calculate.sh http://mesa-web.asu.edu/html/services/calculate.sh WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ http://mesa-web.asu.edu/html/services/Work/ http://mesa-web.asu.edu/html/services/Work/ StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err http://mesa-web.asu.edu/html/services/Work//job.%J.err http://mesa-web.asu.edu/html/services/Work//job.%J.err StdIn=/dev/null StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out http://mesa-web.asu.edu/html/services/Work//job.%J.out http://mesa-web.asu.edu/html/services/Work//job.%J.out Thanks, Carl On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote: It would be helpful to see how you submitted the job. And the output from
[slurm-dev] Problems running job
Good afternoon! I think I've got Slurm up and working (well maybe) thanks to some help from Paul Edmon. I tried running a simple job #!/bin/batch #SBATCH --job-name=slurmtest hostname uptime But it doesn't seem to run. Here is the output of sinfo and squeue: [ec2-user@ip-10-0-1-72 ec2-user]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 comp* ip-10-0-2-101 debug* up infinite 1 down* ip-10-0-2-102 [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? Thanks! Jeff P.S. slurm.conf file attached # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=HPCDemo ControlMachine=ip-10-0-1-72 ControlAddr=10.0.1.72 #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/tmp SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=0 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/linear FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=10 #PriorityWeightAge=1000 #PriorityWeightPartition=1 #PriorityWeightJobSize=1000 # # LOGGING SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES NodeName=ip-10-0-2-[101-102] Procs=1 State=UNKNOWN PartitionName=debug Nodes=ip-10-0-2-[101-102] Default=YES MaxTime=INFINITE State=U
[slurm-dev] Re: Configuration Issues
Hello, Output below: [SlurmUser@sod264 ~]$ sinfo -l Mon Mar 30 11:13:12 2015 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOTSHARE GROUPS NODES STATE NODELIST compute* up infinite 1-infinite no NOall 1 drained sod264 [SlurmUser@sod264 ~]$ Thank you, Carl On Mon, Mar 30, 2015 at 11:11 AM, Mehdi Denou mehdi.de...@atos.net wrote: Hi all, Carl, could you give the output of sinfo ? Le 30/03/2015 20:00, Carl E. Fields a écrit : Dear Uwe, Thank you for your response, I have submitted the job by: $ batch calculate.sh and the output you requested: [SlurmUser@sod264 services]$ scontrol show job 20 JobId=20 JobName=calculate.sh UserId=SlurmUser(3099) GroupId=SlurmUser(3099) Priority=4294901742 Nice=0 Account=slurmuser QOS=(null) JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21 StartTime=2015-03-31T10:56:06 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=compute AllocNode:Sid=sod264:792 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err StdIn=/dev/null StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out Thanks, Carl On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com wrote: It would be helpful to see how you submitted the job. And the output from scontrol show job 20. Regards, Uwe Am 30.03.2015 um 19:49 schrieb Carl E. Fields: Hello, I have installed slurm version version 14.11.4 on a RHEL server with the following specs: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):2 On-line CPU(s) list: 0,1 Thread(s) per core:1 Core(s) per socket:2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 23 Stepping: 6 CPU MHz: 2300.000 BogoMIPS: 4600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0,1 I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below: JOBID PARTITION NAME USERSTATE TIME TIME_LIMI NODES NODELIST(REASON) 20 compute calculat SlurmUse PENDING 0:00 10:00 1 (Resources) I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance My current slurm.conf file: [SlurmUser@sod264 etc]$ cat slurm.conf # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=sod264 ControlAddr=129.XXX # #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=SlurmUser SlurmdUser=SlurmUser StateSaveLocation=/var/spool/statesave SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 #SelectType=select/serial SelectType=select/cons_res SelectTypeParameters=CR_CORE # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none ClusterName=MESA-Web #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3
[slurm-dev] Error connecting slurm stream socket at IP:6817: Connection refused
Hi guys, i'm from Portugal and start a project with slurm. Now i have a problem on start slurmd. Have my config attached Someone can help? Error msg: slurmd -cDvvv slurmd: topology NONE plugin loaded slurmd: route default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: No specialized cores configured by default on this node slurmd: Resource spec: system memory limit not configured for this node slurmd: debug: task NONE plugin loaded slurmd: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmd: debug: spank: opening plugin stack /usr/local/etc/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: Warning: Core limit is only 0 KB slurmd: slurmd version 14.11.4 started slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug: job_container none plugin loaded slurmd: debug: switch NONE plugin loaded slurmd: slurmd started on Mon, 30 Mar 2015 22:08:54 +0100 slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=992 TmpDisk=6635 Uptime=26 CPUSpecList=(null) slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug: AcctGatherInfiniband NONE plugin loaded slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf) slurmd: debug2: _slurm_connect failed: Connection refused slurmd: debug2: Error connecting slurm stream socket at 172.16.40.42:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused . . . Someone can help? Bests, Jorge Góis slurm.conf Description: Binary data
[slurm-dev] Re: Error connecting slurm stream socket at IP:6817: Connection refused
This should get you started: http://slurm.schedmd.com/troubleshoot.html Quoting Jorge Góis mail.jg...@gmail.com: Hi guys, i'm from Portugal and start a project with slurm. Now i have a problem on start slurmd. Have my config attached Someone can help? Error msg: slurmd -cDvvv slurmd: topology NONE plugin loaded slurmd: route default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: No specialized cores configured by default on this node slurmd: Resource spec: system memory limit not configured for this node slurmd: debug: task NONE plugin loaded slurmd: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmd: debug: spank: opening plugin stack /usr/local/etc/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: Warning: Core limit is only 0 KB slurmd: slurmd version 14.11.4 started slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug: job_container none plugin loaded slurmd: debug: switch NONE plugin loaded slurmd: slurmd started on Mon, 30 Mar 2015 22:08:54 +0100 slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=992 TmpDisk=6635 Uptime=26 CPUSpecList=(null) slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug: AcctGatherInfiniband NONE plugin loaded slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf) slurmd: debug2: _slurm_connect failed: Connection refused slurmd: debug2: Error connecting slurm stream socket at 172.16.40.42:6817: Connection refused slurmd: debug: Failed to contact primary controller: Connection refused . . . Someone can help? Bests, Jorge Góis -- Morris Moe Jette CTO, SchedMD LLC Commercial Slurm Development and Support
[slurm-dev] Re: Problems running job
The requeue indicates the node failed or was set down. On 03/30/2015 03:24 PM, Christopher Samuel wrote: On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris -- Thanks, /David/Bigagli www.schedmd.com
[slurm-dev] Re: Problems running job
On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci