[slurm-dev] Configuration Issues

2015-03-30 Thread Carl E. Fields
Hello,

I have installed slurm version version 14.11.4 on a RHEL server with the
following specs:


Architecture:  x86_64

CPU op-mode(s):32-bit, 64-bit

Byte Order:Little Endian

CPU(s):2

On-line CPU(s) list:   0,1

Thread(s) per core:1

Core(s) per socket:2

Socket(s): 1

NUMA node(s):  1

Vendor ID: GenuineIntel

CPU family:6

Model: 23

Stepping:  6

CPU MHz:   2300.000

BogoMIPS:  4600.00

Hypervisor vendor: VMware

Virtualization type:   full

L1d cache: 32K

L1i cache: 32K

L2 cache:  256K

L3 cache:  15360K

NUMA node0 CPU(s): 0,1



I wish to designate one core as the controller. And another core as
available for job submissions which require 1 core.

I have configured everything however, I believe I have an error in my
slurm.conf file because when I submit a job, it sits in the queue with node
reason as below:

 JOBID PARTITION NAME USERSTATE   TIME
TIME_LIMI  NODES NODELIST(REASON)

20   compute calculat SlurmUse  PENDING   0:00
10:00  1 (Resources)



I believe I am not properly configuring the resources in my file but I am
unsure of wherein the issue lies. I hope one can assist me in properly
configuring my server. Thank you in advance

My current slurm.conf file:



[SlurmUser@sod264 etc]$ cat slurm.conf

# slurm.conf file generated by configurator easy.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

ControlMachine=sod264

ControlAddr=129.XXX

#

#MailProg=/bin/mail

MpiDefault=none

#MpiParams=ports=#-#

ProctrackType=proctrack/pgid

ReturnToService=0

SlurmctldPidFile=/var/run/slurmctld.pid

#SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

#SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=SlurmUser

SlurmdUser=SlurmUser

StateSaveLocation=/var/spool/statesave

SwitchType=switch/none

TaskPlugin=task/none

#

#

# TIMERS

#KillWait=30

#MinJobAge=300

#SlurmctldTimeout=120

#SlurmdTimeout=300

#

#

# SCHEDULING

FastSchedule=1

SchedulerType=sched/backfill

#SchedulerPort=7321

#SelectType=select/serial

SelectType=select/cons_res

SelectTypeParameters=CR_CORE

#

#

# LOGGING AND ACCOUNTING

AccountingStorageType=accounting_storage/none

ClusterName=MESA-Web

#JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=3

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=3

SlurmdLogFile=/var/log/slurm/slurmd.log

#


#

# COMPUTE NODES

NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
RealMemory=128940 TmpDisk=19895


PartitionName=compute Nodes=sod264 Default=YES STATE=UP



Kind Regards,

Carl


[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Uwe Sauter

It would be helpful to see how you submitted the job. And the output from 
scontrol show job 20.

Regards,

Uwe

Am 30.03.2015 um 19:49 schrieb Carl E. Fields:
 Hello,
 
 I have installed slurm version version 14.11.4 on a RHEL server with the 
 following specs:
 
 
 Architecture:  x86_64
 
 CPU op-mode(s):32-bit, 64-bit
 
 Byte Order:Little Endian
 
 CPU(s):2
 
 On-line CPU(s) list:   0,1
 
 Thread(s) per core:1
 
 Core(s) per socket:2
 
 Socket(s): 1
 
 NUMA node(s):  1
 
 Vendor ID: GenuineIntel
 
 CPU family:6
 
 Model: 23
 
 Stepping:  6
 
 CPU MHz:   2300.000
 
 BogoMIPS:  4600.00
 
 Hypervisor vendor: VMware
 
 Virtualization type:   full
 
 L1d cache: 32K
 
 L1i cache: 32K
 
 L2 cache:  256K
 
 L3 cache:  15360K
 
 NUMA node0 CPU(s): 0,1
 
 
 
 I wish to designate one core as the controller. And another core as available 
 for job submissions which require 1 core. 
 
 I have configured everything however, I believe I have an error in my 
 slurm.conf file because when I submit a job, it
 sits in the queue with node reason as below:
 
  JOBID PARTITION NAME USERSTATE   TIME TIME_LIMI  
 NODES NODELIST(REASON)
 
 20   compute calculat SlurmUse  PENDING   0:00 10:00  
 1 (Resources)
 
 
 
 I believe I am not properly configuring the resources in my file but I am 
 unsure of wherein the issue lies. I hope one
 can assist me in properly configuring my server. Thank you in advance
 
 My current slurm.conf file:
 
 
 
 [SlurmUser@sod264 etc]$ cat slurm.conf 
 
 # slurm.conf file generated by configurator easy.html.
 
 # Put this file on all nodes of your cluster.
 
 # See the slurm.conf man page for more information.
 
 #
 
 ControlMachine=sod264
 
 ControlAddr=129.XXX
 
 # 
 
 #MailProg=/bin/mail 
 
 MpiDefault=none
 
 #MpiParams=ports=#-# 
 
 ProctrackType=proctrack/pgid
 
 ReturnToService=0
 
 SlurmctldPidFile=/var/run/slurmctld.pid
 
 #SlurmctldPort=6817 
 
 SlurmdPidFile=/var/run/slurmd.pid
 
 #SlurmdPort=6818 
 
 SlurmdSpoolDir=/var/spool/slurmd
 
 SlurmUser=SlurmUser
 
 SlurmdUser=SlurmUser 
 
 StateSaveLocation=/var/spool/statesave
 
 SwitchType=switch/none
 
 TaskPlugin=task/none
 
 # 
 
 # 
 
 # TIMERS 
 
 #KillWait=30 
 
 #MinJobAge=300 
 
 #SlurmctldTimeout=120 
 
 #SlurmdTimeout=300 
 
 # 
 
 # 
 
 # SCHEDULING 
 
 FastSchedule=1
 
 SchedulerType=sched/backfill
 
 #SchedulerPort=7321 
 
 #SelectType=select/serial
 
 SelectType=select/cons_res
 
 SelectTypeParameters=CR_CORE
 
 # 
 
 # 
 
 # LOGGING AND ACCOUNTING 
 
 AccountingStorageType=accounting_storage/none
 
 ClusterName=MESA-Web
 
 #JobAcctGatherFrequency=30 
 
 JobAcctGatherType=jobacct_gather/none
 
 SlurmctldDebug=3
 
 SlurmctldLogFile=/var/log/slurm/slurmctld.log
 
 SlurmdDebug=3
 
 SlurmdLogFile=/var/log/slurm/slurmd.log
 
 #
 
 
 # 
 
 # COMPUTE NODES 
 
 NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 
 TmpDisk=19895  
 
 
 PartitionName=compute Nodes=sod264 Default=YES STATE=UP
 
 
 
 
 Kind Regards,
 
 Carl
 
 
 


[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Mehdi Denou
Hi all,

Carl, could you give the output of sinfo ?


Le 30/03/2015 20:00, Carl E. Fields a écrit :
 Re: [slurm-dev] Re: Configuration Issues
 Dear Uwe,

 Thank you for your response,

 I have submitted the job by:

 $ batch calculate.sh

 and the output you requested:

 [SlurmUser@sod264 services]$ scontrol show job 20

 JobId=20 JobName=calculate.sh

UserId=SlurmUser(3099) GroupId=SlurmUser(3099)

Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)

JobState=PENDING Reason=Resources Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A

SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21

StartTime=2015-03-31T10:56:06 EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=compute AllocNode:Sid=sod264:792

ReqNodeList=(null) ExcNodeList=(null)

NodeList=(null)

NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   
 Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh
 http://mesa-web.asu.edu/html/services/calculate.sh

WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/
 http://mesa-web.asu.edu/html/services/Work/

   
 StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err 
 http://mesa-web.asu.edu/html/services/Work//job.%J.err

StdIn=/dev/null

   
 StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out 
 http://mesa-web.asu.edu/html/services/Work//job.%J.out




 Thanks,

 Carl

 On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com
 mailto:uwe.sauter...@gmail.com wrote:


 It would be helpful to see how you submitted the job. And the
 output from scontrol show job 20.

 Regards,

 Uwe

 Am 30.03.2015 um 19:49 schrieb Carl E. Fields:
  Hello,
 
  I have installed slurm version version 14.11.4 on a RHEL server
 with the following specs:
 
 
  Architecture:  x86_64
 
  CPU op-mode(s):32-bit, 64-bit
 
  Byte Order:Little Endian
 
  CPU(s):2
 
  On-line CPU(s) list:   0,1
 
  Thread(s) per core:1
 
  Core(s) per socket:2
 
  Socket(s): 1
 
  NUMA node(s):  1
 
  Vendor ID: GenuineIntel
 
  CPU family:6
 
  Model: 23
 
  Stepping:  6
 
  CPU MHz:   2300.000
 
  BogoMIPS:  4600.00
 
  Hypervisor vendor: VMware
 
  Virtualization type:   full
 
  L1d cache: 32K
 
  L1i cache: 32K
 
  L2 cache:  256K
 
  L3 cache:  15360K
 
  NUMA node0 CPU(s): 0,1
 
 
 
  I wish to designate one core as the controller. And another core
 as available for job submissions which require 1 core.
 
  I have configured everything however, I believe I have an error
 in my slurm.conf file because when I submit a job, it
  sits in the queue with node reason as below:
 
   JOBID PARTITION NAME USERSTATE 
  TIME TIME_LIMI  NODES NODELIST(REASON)
 
  20   compute calculat SlurmUse  PENDING 
  0:00 10:00  1 (Resources)
 
 
 
  I believe I am not properly configuring the resources in my file
 but I am unsure of wherein the issue lies. I hope one
  can assist me in properly configuring my server. Thank you in
 advance
 
  My current slurm.conf file:
 
 
 
  [SlurmUser@sod264 etc]$ cat slurm.conf
 
  # slurm.conf file generated by configurator easy.html.
 
  # Put this file on all nodes of your cluster.
 
  # See the slurm.conf man page for more information.
 
  #
 
  ControlMachine=sod264
 
  ControlAddr=129.XXX
 
  #
 
  #MailProg=/bin/mail
 
  MpiDefault=none
 
  #MpiParams=ports=#-#
 
  ProctrackType=proctrack/pgid
 
  ReturnToService=0
 
  SlurmctldPidFile=/var/run/slurmctld.pid
 
  #SlurmctldPort=6817
 
  SlurmdPidFile=/var/run/slurmd.pid
 
  #SlurmdPort=6818
 
  SlurmdSpoolDir=/var/spool/slurmd
 
  SlurmUser=SlurmUser
 
  SlurmdUser=SlurmUser
 
  StateSaveLocation=/var/spool/statesave
 
  SwitchType=switch/none
 
  TaskPlugin=task/none
 
  #
 
  #
 
  # TIMERS
 
  #KillWait=30
 
  #MinJobAge=300
 
  #SlurmctldTimeout=120
 
  #SlurmdTimeout=300
 
  #
 
  #

[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Carl E. Fields
Dear Uwe,

Thank you for your response,

I have submitted the job by:

$ batch calculate.sh

and the output you requested:

[SlurmUser@sod264 services]$ scontrol show job 20

JobId=20 JobName=calculate.sh

   UserId=SlurmUser(3099) GroupId=SlurmUser(3099)

   Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)

   JobState=PENDING Reason=Resources Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A

   SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21

   StartTime=2015-03-31T10:56:06 EndTime=Unknown

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=compute AllocNode:Sid=sod264:792

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=(null)

   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=(null)

   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh

   WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/

   StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err

   StdIn=/dev/null

   StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out




Thanks,

Carl

On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com
wrote:


 It would be helpful to see how you submitted the job. And the output from
 scontrol show job 20.

 Regards,

 Uwe

 Am 30.03.2015 um 19:49 schrieb Carl E. Fields:
  Hello,
 
  I have installed slurm version version 14.11.4 on a RHEL server with the
 following specs:
 
 
  Architecture:  x86_64
 
  CPU op-mode(s):32-bit, 64-bit
 
  Byte Order:Little Endian
 
  CPU(s):2
 
  On-line CPU(s) list:   0,1
 
  Thread(s) per core:1
 
  Core(s) per socket:2
 
  Socket(s): 1
 
  NUMA node(s):  1
 
  Vendor ID: GenuineIntel
 
  CPU family:6
 
  Model: 23
 
  Stepping:  6
 
  CPU MHz:   2300.000
 
  BogoMIPS:  4600.00
 
  Hypervisor vendor: VMware
 
  Virtualization type:   full
 
  L1d cache: 32K
 
  L1i cache: 32K
 
  L2 cache:  256K
 
  L3 cache:  15360K
 
  NUMA node0 CPU(s): 0,1
 
 
 
  I wish to designate one core as the controller. And another core as
 available for job submissions which require 1 core.
 
  I have configured everything however, I believe I have an error in my
 slurm.conf file because when I submit a job, it
  sits in the queue with node reason as below:
 
   JOBID PARTITION NAME USERSTATE   TIME
 TIME_LIMI  NODES NODELIST(REASON)
 
  20   compute calculat SlurmUse  PENDING   0:00
  10:00  1 (Resources)
 
 
 
  I believe I am not properly configuring the resources in my file but I
 am unsure of wherein the issue lies. I hope one
  can assist me in properly configuring my server. Thank you in advance
 
  My current slurm.conf file:
 
 
 
  [SlurmUser@sod264 etc]$ cat slurm.conf
 
  # slurm.conf file generated by configurator easy.html.
 
  # Put this file on all nodes of your cluster.
 
  # See the slurm.conf man page for more information.
 
  #
 
  ControlMachine=sod264
 
  ControlAddr=129.XXX
 
  #
 
  #MailProg=/bin/mail
 
  MpiDefault=none
 
  #MpiParams=ports=#-#
 
  ProctrackType=proctrack/pgid
 
  ReturnToService=0
 
  SlurmctldPidFile=/var/run/slurmctld.pid
 
  #SlurmctldPort=6817
 
  SlurmdPidFile=/var/run/slurmd.pid
 
  #SlurmdPort=6818
 
  SlurmdSpoolDir=/var/spool/slurmd
 
  SlurmUser=SlurmUser
 
  SlurmdUser=SlurmUser
 
  StateSaveLocation=/var/spool/statesave
 
  SwitchType=switch/none
 
  TaskPlugin=task/none
 
  #
 
  #
 
  # TIMERS
 
  #KillWait=30
 
  #MinJobAge=300
 
  #SlurmctldTimeout=120
 
  #SlurmdTimeout=300
 
  #
 
  #
 
  # SCHEDULING
 
  FastSchedule=1
 
  SchedulerType=sched/backfill
 
  #SchedulerPort=7321
 
  #SelectType=select/serial
 
  SelectType=select/cons_res
 
  SelectTypeParameters=CR_CORE
 
  #
 
  #
 
  # LOGGING AND ACCOUNTING
 
  AccountingStorageType=accounting_storage/none
 
  ClusterName=MESA-Web
 
  #JobAcctGatherFrequency=30
 
  JobAcctGatherType=jobacct_gather/none
 
  SlurmctldDebug=3
 
  SlurmctldLogFile=/var/log/slurm/slurmctld.log
 
  SlurmdDebug=3
 
  SlurmdLogFile=/var/log/slurm/slurmd.log
 
  #
 
 
  #
 
  # COMPUTE NODES
 
  NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
 RealMemory=128940 TmpDisk=19895
 
 
  PartitionName=compute Nodes=sod264 Default=YES STATE=UP
 
 
 
 
  Kind Regards,
 
  Carl
 
 
 



[slurm-dev] Re: --reboot

2015-03-30 Thread Scott Yockel

Just adding the /var/log/message to this: 

slurmd[4526]: error: _step_connect: connect() failed dir 
/var/slurmd/spool/slurmd node holy2a14107 job 36557200 step -2 Connection 
refused


 On Mar 30, 2015, at 1:02 PM, Paul Edmon ped...@cfa.harvard.edu wrote:
 
 
 So point of information.  My impression is that when you use this flag, the 
 node/s are rebooted (or the script defined is run) and then the job runs as 
 per normal.  Is this correct?  Because I think in our case it is rebooting 
 the nodes but then the node is defined as closed due to unexpected reboot.  
 Is there a way to suppress this when the node is rebooted by this flag?  
 Obviously the reboot wasn't unexpected as slurm was aware of it due to the 
 flag.
 
 -Paul Edmon-


[slurm-dev] --reboot

2015-03-30 Thread Paul Edmon


So point of information.  My impression is that when you use this flag, 
the node/s are rebooted (or the script defined is run) and then the job 
runs as per normal.  Is this correct?  Because I think in our case it is 
rebooting the nodes but then the node is defined as closed due to 
unexpected reboot.  Is there a way to suppress this when the node is 
rebooted by this flag?  Obviously the reboot wasn't unexpected as slurm 
was aware of it due to the flag.


-Paul Edmon-


[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Uwe Sauter

Hi,

please post your answers to the list so others can help, too.

The reason for the node's IDLE+DRAIN state is given by the output of scontrol 
show node:

  Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12]

Why slurmctld decided to put the node into this state is a bit unclear to me as 
in slurm.conf there is:

  NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 
TmpDisk=19895

And it correctly detects 2 CPUs (again scontrol show node):

  CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00


You might want to give it a try in *not* specifying the sockets/cores/threads 
in slurm.conf and let slurmd detect the
values on its own.

If someone else has a better idea, its welcome.

Regards,

Uwe


Am 30.03.2015 um 20:03 schrieb Carl E. Fields:
 Output of calculate.sh
 
 [SlurmUser@sod264 services]$ cat calculate.sh 
 
 #!/bin/bash
 
 #SBATCH -A SlurmUser
 
 #SBATCH --workdir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ 
 http://mesa-web.asu.edu/html/services/Work/
 
 
 #SBATCH -n 1
 
 #SBATCH -c 1
 
 #SBATCH --time=00:10:00
 
 #SBATCH --mail-type=ALL
 
 #SBATCH --mail-user=c...@asu.edu mailto:c...@asu.edu
 
 #SBATCH --error=job.%J.err
 
 #SBATCH --output=job.%J.out
 
 
 #SBATCH --export=MESA_DIR=/home/cefields/mesa
 
 #SBATCH --export=OMP_NUM_THREADS=1
 
 #SBATCH --export=MESASDK_ROOT=/home/cefields/mesasdk
 
 source $MESASDK_ROOT/bin/mesasdk_init.sh
 
 
 srun ./mv.sh
 
 echo 'WAITING 1 MINUTE FOR FILES TO BE MOVED...'
 
 srun ./mk
 
 echo'make...'
 
 srun ./rn
 
 echo 'run...'
 
 
 
 
 State of node:
 
 
 [SlurmUser@sod264 services]$ scontrol show node sod264
 
 NodeName=sod264 Arch=x86_64 CoresPerSocket=2
 
CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00 Features=(null)
 
Gres=(null)
 
NodeAddr=sod264 NodeHostName=sod264 Version=14.11
 
OS=Linux RealMemory=128940 AllocMem=0 Sockets=1 Boards=1
 
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=19895 Weight=1
 
BootTime=2015-03-10T12:12:21 SlurmdStartTime=2015-03-29T13:08:30
 
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
 
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 
Reason=Low socket*core*thread count, Low CPUs 
 [SlurmUser@2015-03-11T22:15:12]
 
 
 [SlurmUser@sod264 services]$ 
 
 
 
 
 Thanks,
 
 Carl
 
 On Mon, Mar 30, 2015 at 11:01 AM, Uwe Sauter uwe.sauter...@gmail.com 
 mailto:uwe.sauter...@gmail.com wrote:
 
 And what is the state of your node (sinfo -l output)? Or scontrol show 
 node sod264?
 
 
 Am 30.03.2015 um 19:57 schrieb Carl E. Fields:
  Dear Uwe,
 
  Thank you for your response,
 
  I have submitted the job by:
 
  $ batch calculate.sh
 
  and the output you requested:
 
  [SlurmUser@sod264 services]$ scontrol show job 20
 
  JobId=20 JobName=calculate.sh
 
 UserId=SlurmUser(3099) GroupId=SlurmUser(3099)
 
 Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)
 
 JobState=PENDING Reason=Resources Dependency=(null)
 
 Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 
 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
 
 SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21
 
 StartTime=2015-03-31T10:56:06 EndTime=Unknown
 
 PreemptTime=None SuspendTime=None SecsPreSuspend=0
 
 Partition=compute AllocNode:Sid=sod264:792
 
 ReqNodeList=(null) ExcNodeList=(null)
 
 NodeList=(null)
 
 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
 
 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
 
 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
 
 Features=(null) Gres=(null) Reservation=(null)
 
 Shared=OK Contiguous=0 Licenses=(null) Network=(null)
 
 Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh 
 http://mesa-web.asu.edu/html/services/calculate.sh
 http://mesa-web.asu.edu/html/services/calculate.sh
 
 WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ 
 http://mesa-web.asu.edu/html/services/Work/
 http://mesa-web.asu.edu/html/services/Work/
 
 
 StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err 
 http://mesa-web.asu.edu/html/services/Work//job.%J.err
  http://mesa-web.asu.edu/html/services/Work//job.%J.err
 
 StdIn=/dev/null
 
 
 StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out 
 http://mesa-web.asu.edu/html/services/Work//job.%J.out
  http://mesa-web.asu.edu/html/services/Work//job.%J.out
 
 
 
 
  Thanks,
 
  Carl
 
  On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com 
 mailto:uwe.sauter...@gmail.com
 mailto:uwe.sauter...@gmail.com mailto:uwe.sauter...@gmail.com wrote:
 
 
  It would be helpful to see how you submitted the job. And the 
 output from 

[slurm-dev] Problems running job

2015-03-30 Thread Jeff Layton

Good afternoon!

I think I've got Slurm up and working (well maybe) thanks
to some help from Paul Edmon. I tried running a simple job

#!/bin/batch

#SBATCH --job-name=slurmtest

hostname
uptime

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[ec2-user@ip-10-0-1-72 ec2-user]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*   up   infinite  1  comp* ip-10-0-2-101
debug*   up   infinite  1  down* ip-10-0-2-102
[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
 JOBID PARTITION NAME USER ST TIME  NODES 
NODELIST(REASON)

 2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101


The system logs on the master node (contoller node)
don't show too much:

Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: 
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 
NodeList=ip-10-0-2-101 #CPUs=1
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue 
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 
State=0x8000 NodeCnt=1 done
Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes 
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 
not responding, setting DOW



From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?

Thanks!

Jeff

P.S. slurm.conf file attached



#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=HPCDemo
ControlMachine=ip-10-0-1-72
ControlAddr=10.0.1.72
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/linear
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=10
#PriorityWeightAge=1000
#PriorityWeightPartition=1
#PriorityWeightJobSize=1000
#
# LOGGING
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=ip-10-0-2-[101-102] Procs=1 State=UNKNOWN
PartitionName=debug Nodes=ip-10-0-2-[101-102] Default=YES MaxTime=INFINITE 
State=U



[slurm-dev] Re: Configuration Issues

2015-03-30 Thread Carl E. Fields
Hello,

Output below:

[SlurmUser@sod264 ~]$ sinfo -l

Mon Mar 30 11:13:12 2015

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOTSHARE GROUPS  NODES
STATE NODELIST

compute* up   infinite 1-infinite   no   NOall  1
drained sod264

[SlurmUser@sod264 ~]$




Thank you,

Carl

On Mon, Mar 30, 2015 at 11:11 AM, Mehdi Denou mehdi.de...@atos.net wrote:

  Hi all,

 Carl, could you give the output of sinfo ?



 Le 30/03/2015 20:00, Carl E. Fields a écrit :

 Dear Uwe,

  Thank you for your response,

  I have submitted the job by:

  $ batch calculate.sh

  and the output you requested:

  [SlurmUser@sod264 services]$ scontrol show job 20

 JobId=20 JobName=calculate.sh

UserId=SlurmUser(3099) GroupId=SlurmUser(3099)

Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)

JobState=PENDING Reason=Resources Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A

SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21

StartTime=2015-03-31T10:56:06 EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=compute AllocNode:Sid=sod264:792

ReqNodeList=(null) ExcNodeList=(null)

NodeList=(null)

NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh

WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/

StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err

StdIn=/dev/null

StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out




  Thanks,

  Carl

 On Mon, Mar 30, 2015 at 10:53 AM, Uwe Sauter uwe.sauter...@gmail.com
 wrote:


 It would be helpful to see how you submitted the job. And the output from
 scontrol show job 20.

 Regards,

 Uwe

 Am 30.03.2015 um 19:49 schrieb Carl E. Fields:
  Hello,
 
  I have installed slurm version version 14.11.4 on a RHEL server with
 the following specs:
 
 
  Architecture:  x86_64
 
  CPU op-mode(s):32-bit, 64-bit
 
  Byte Order:Little Endian
 
  CPU(s):2
 
  On-line CPU(s) list:   0,1
 
  Thread(s) per core:1
 
  Core(s) per socket:2
 
  Socket(s): 1
 
  NUMA node(s):  1
 
  Vendor ID: GenuineIntel
 
  CPU family:6
 
  Model: 23
 
  Stepping:  6
 
  CPU MHz:   2300.000
 
  BogoMIPS:  4600.00
 
  Hypervisor vendor: VMware
 
  Virtualization type:   full
 
  L1d cache: 32K
 
  L1i cache: 32K
 
  L2 cache:  256K
 
  L3 cache:  15360K
 
  NUMA node0 CPU(s): 0,1
 
 
 
  I wish to designate one core as the controller. And another core as
 available for job submissions which require 1 core.
 
  I have configured everything however, I believe I have an error in my
 slurm.conf file because when I submit a job, it
  sits in the queue with node reason as below:
 
   JOBID PARTITION NAME USERSTATE   TIME
 TIME_LIMI  NODES NODELIST(REASON)
 
  20   compute calculat SlurmUse  PENDING   0:00
  10:00  1 (Resources)
 
 
 
  I believe I am not properly configuring the resources in my file but I
 am unsure of wherein the issue lies. I hope one
  can assist me in properly configuring my server. Thank you in advance
 
  My current slurm.conf file:
 
 
 
  [SlurmUser@sod264 etc]$ cat slurm.conf
 
  # slurm.conf file generated by configurator easy.html.
 
  # Put this file on all nodes of your cluster.
 
  # See the slurm.conf man page for more information.
 
  #
 
  ControlMachine=sod264
 
  ControlAddr=129.XXX
 
  #
 
  #MailProg=/bin/mail
 
  MpiDefault=none
 
  #MpiParams=ports=#-#
 
  ProctrackType=proctrack/pgid
 
  ReturnToService=0
 
  SlurmctldPidFile=/var/run/slurmctld.pid
 
  #SlurmctldPort=6817
 
  SlurmdPidFile=/var/run/slurmd.pid
 
  #SlurmdPort=6818
 
  SlurmdSpoolDir=/var/spool/slurmd
 
  SlurmUser=SlurmUser
 
  SlurmdUser=SlurmUser
 
  StateSaveLocation=/var/spool/statesave
 
  SwitchType=switch/none
 
  TaskPlugin=task/none
 
  #
 
  #
 
  # TIMERS
 
  #KillWait=30
 
  #MinJobAge=300
 
  #SlurmctldTimeout=120
 
  #SlurmdTimeout=300
 
  #
 
  #
 
  # SCHEDULING
 
  FastSchedule=1
 
  SchedulerType=sched/backfill
 
  #SchedulerPort=7321
 
  #SelectType=select/serial
 
  SelectType=select/cons_res
 
  SelectTypeParameters=CR_CORE
 
  #
 
  #
 
  # LOGGING AND ACCOUNTING
 
  AccountingStorageType=accounting_storage/none
 
  ClusterName=MESA-Web
 
  #JobAcctGatherFrequency=30
 
  JobAcctGatherType=jobacct_gather/none
 
  SlurmctldDebug=3
 
  SlurmctldLogFile=/var/log/slurm/slurmctld.log
 
  SlurmdDebug=3
 
  

[slurm-dev] Error connecting slurm stream socket at IP:6817: Connection refused

2015-03-30 Thread Jorge Góis
Hi guys, i'm from Portugal and start a project with slurm.

Now i have a problem on start slurmd.
Have my config attached
Someone can help?

Error msg:

  slurmd -cDvvv
 slurmd: topology NONE plugin loaded
 slurmd: route default plugin loaded
 slurmd: CPU frequency setting not configured for this node
 slurmd: No specialized cores configured by default on this node
 slurmd: Resource spec: system memory limit not configured for this node
 slurmd: debug:  task NONE plugin loaded
 slurmd: debug:  auth plugin for Munge (http://code.google.com/p/munge/)
 loaded
 slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
 slurmd: Munge cryptographic signature plugin loaded
 slurmd: Warning: Core limit is only 0 KB
 slurmd: slurmd version 14.11.4 started
 slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
 slurmd: debug:  job_container none plugin loaded
 slurmd: debug:  switch NONE plugin loaded
 slurmd: slurmd started on Mon, 30 Mar 2015 22:08:54 +0100
 slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=992
 TmpDisk=6635 Uptime=26 CPUSpecList=(null)
 slurmd: debug:  AcctGatherEnergy NONE plugin loaded
 slurmd: debug:  AcctGatherProfile NONE plugin loaded
 slurmd: debug:  AcctGatherInfiniband NONE plugin loaded
 slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
 slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf)
 slurmd: debug2: _slurm_connect failed: Connection refused
 slurmd: debug2: Error connecting slurm stream socket at 172.16.40.42:6817:
 Connection refused
 slurmd: debug:  Failed to contact primary controller: Connection refused
 .
 .
 .


Someone can help?

Bests,
  Jorge Góis


slurm.conf
Description: Binary data


[slurm-dev] Re: Error connecting slurm stream socket at IP:6817: Connection refused

2015-03-30 Thread Moe Jette


This should get you started:
http://slurm.schedmd.com/troubleshoot.html


Quoting Jorge Góis mail.jg...@gmail.com:


Hi guys, i'm from Portugal and start a project with slurm.

Now i have a problem on start slurmd.
Have my config attached
Someone can help?

Error msg:


 slurmd -cDvvv
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: No specialized cores configured by default on this node
slurmd: Resource spec: system memory limit not configured for this node
slurmd: debug:  task NONE plugin loaded
slurmd: debug:  auth plugin for Munge (http://code.google.com/p/munge/)
loaded
slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 14.11.4 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 30 Mar 2015 22:08:54 +0100
slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=992
TmpDisk=6635 Uptime=26 CPUSpecList=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInfiniband NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/usr/local/etc/acct_gather.conf)
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 172.16.40.42:6817:
Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
.
.
.



Someone can help?

Bests,
  Jorge Góis



--
Morris Moe Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support


[slurm-dev] Re: Problems running job

2015-03-30 Thread David Bigagli


The requeue indicates the node failed or was set down.

On 03/30/2015 03:24 PM, Christopher Samuel wrote:


On 31/03/15 07:31, Jeff Layton wrote:


Good afternoon!


Hiya Jeff,

[...]

But it doesn't seem to run. Here is the output of sinfo
and squeue:

[...]

Actually it does appear to get started (at least), but..


[ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES
NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101


...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.


The system logs on the master node (contoller node) don't show too much:

Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
_slurm_rpc_submit_batch_job JobId=2 usec=239
Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
NodeList=ip-10-0-2-101 #CPUs=1


OK, node allocated.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0


Job finishes.


Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
JobID=2 State=0x8000 NodeCnt=1 per user/system request
Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
State=0x8000 NodeCnt=1 done


Not sure of the implication of that requeue there, unless it's the
transition to the CG state?


Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
ip-10-0-2-[101-102] not responding
Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
not responding, setting DOWN
Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
not responding, setting DOW


Now the nodes stop responding (not before).


 From these logs, it looks like the compute nodes are not
responding to the control node (master node).

Not sure how to debug this - any tips?


I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris



--

Thanks,
  /David/Bigagli

www.schedmd.com


[slurm-dev] Re: Problems running job

2015-03-30 Thread Christopher Samuel

On 31/03/15 07:31, Jeff Layton wrote:

 Good afternoon!

Hiya Jeff,

[...]
 But it doesn't seem to run. Here is the output of sinfo
 and squeue:
[...]

Actually it does appear to get started (at least), but..

 [ec2-user@ip-10-0-1-72 ec2-user]$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES
 NODELIST(REASON)
  2 debug slurmtes ec2-user CG 0:00  1 ip-10-0-2-101

...the CG state you see there is the completing state, i.e. the state
when a job is finishing up.

 The system logs on the master node (contoller node) don't show too much:
 
 Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]:
 _slurm_rpc_submit_batch_job JobId=2 usec=239
 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2
 NodeList=ip-10-0-2-101 #CPUs=1

OK, node allocated.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0

Job finishes.

 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue
 JobID=2 State=0x8000 NodeCnt=1 per user/system request
 Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2
 State=0x8000 NodeCnt=1 done

Not sure of the implication of that requeue there, unless it's the
transition to the CG state?

 Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes
 ip-10-0-2-[101-102] not responding
 Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102
 not responding, setting DOWN
 Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101
 not responding, setting DOW

Now the nodes stop responding (not before).

 From these logs, it looks like the compute nodes are not
 responding to the control node (master node).
 
 Not sure how to debug this - any tips?

I would suggest looking at the slurmd logs on the compute nodes to see
if they report any problems, and check to see what state the processes
are in - especially if they're stuck in a 'D' state waiting on some form
of device I/O.

I know some people have reported strange interactions between Slurm
being on an NFSv4 mount (NFSv3 is fine).

Good luck!
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci