[slurm-dev] batch submit failure -Slurmd could not create a batch directory or file

Kevin Abbey Mon, 18 Feb 2013 19:17:05 -0800

Dear help,

I have a new install working when I submit via salloc and srun howeverall batch submits fail. I think the issue is tied to this error line:

slurmctld: error: slurmd error running JobId=44 on node(s)=node01:Slurmd could not create a batch directory or file

The problem I have is that I cannot find what exactly is causing thisproblem. Could it be a resource permission/missing dir. I have not setor allowed? I am at a loss since I have searched everywhere for anexplanation of what this error means and I cannot find anydocumentation. It seems like an obvious fix based on the message but Ijust can't see it.

I have attached the log for slurmctld and slurmd; the slurm.conf and theexample batch file.



Any advice for troubleshooting this would be very much appreciated.

Thank you for the help in advance,
Kevin

--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
  Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.ab...@rutgers.edu




 test.sh
====================
#!/bin/bash
#SBATCH -J TESTJOB
#SBATCH -n 2
#SBATCH -t 1

#prepdir
#cd $JOBDIR
#srun pwd 
srun hostname
date
#exit

====================





slurm.conf

[slurm@golova etc]$ more slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=golova
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/g0/opt/slurm/2.5.3/var/spool/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/g0/opt/slurm/2.5.3/var/run/slurmd_%h.pid
#SlurmdPort=6818
SlurmdSpoolDir=/g0/opt/slurm/2.5.3/var/spool/slurmd_%h
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/g0/opt/slurm/2.5.3/var/spool
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=golova
[slurm@golova etc]$ emacs -nw slurm.conf
[slurm@golova etc]$ emacs -nw slurm.conf
[slurm@golova etc]$ emacs -nw slurm.conf
[slurm@golova etc]$ more slurm.c
slurm.c: No such file or directory
[slurm@golova etc]$ more slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=golova
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/g0/opt/slurm/2.5.3/var/spool/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/g0/opt/slurm/2.5.3/var/run/slurmd_%h.pid
#SlurmdPort=6818
#SlurmdSpoolDir=/g0/opt/slurm/2.5.3/var/spool/slurmd_%h
SlurmdSpoolDir=/g0/opt/slurm/2.5.3/var/spool
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/g0/opt/slurm/2.5.3/var/spool
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=golova
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=7
SlurmctldLogFile=/g0/opt/slurm/log/slurmctld.log
SlurmdDebug=7
SlurmdLogFile=/g0/opt/slurm/log/slurmd_%h.log
#
#
# COMPUTE NODES
NodeName=node01 NodeAddr=192.168.0.1 CPUs=32 RealMemory=58505 Sockets=2 
CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=node01 Default=YES MaxTime=43200 State=UP
#
#NodeName=node[01-12] NodeAddr=192.168.0.[1-12] CPUs=32 RealMemory=58505 
Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=node[01-12] Default=YES MaxTime=43200 State=UP

[slurm@golova etc]$

[root@golova ~]# slurmctld -Dcvvvvvv
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given so we are 
giving a blank list
slurmctld: debug3: Version in assoc_mgr_state header is 1
slurmctld: slurmctld version 2.5.3 started on cluster golova
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 
1
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No backup controller to shutdown
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/switch_none.so
slurmctld: switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  Reading slurm.conf file: /g0/opt/slurm/2.5.3/etc/slurm.conf
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/topology_none.so
slurmctld: topology NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No DownNodes
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/jobcomp_none.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/sched_backfill.so
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Version string in job_state header is VER013
slurmctld: debug3: Job ID in job_state header is 43
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: debug3: found batch directory for job_id 43
slurmctld: Purging files for defunct batch job 43
slurmctld: debug:  Updating partition uid access list
slurmctld: debug3: Version string in resv_state header is VER004
slurmctld: Recovered state of 0 reservations
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 1 partitions
slurmctld: Running as primary controller
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/priority_basic.so
slurmctld: debug:  Priority BASIC plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: _slurmctld_rpc_mgr pid = 57969
slurmctld: debug3: _slurmctld_background pid = 57969
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug:  Spawning registration agent for node01 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type 1001
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to node01
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 192.168.0.1:6818: 
Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 192.168.0.1:6818: 
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 192.168.0.1:6818: 
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 192.168.0.1:6818: 
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 192.168.0.1:6818: 
Connection refused
slurmctld: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmctld: debug3: Success.
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug:  validate_node_specs: node node01 registered with 0 jobs
slurmctld: debug2: _slurm_rpc_node_registration complete for node01 usec=119
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug2: _slurm_rpc_node_registration complete for node01 usec=85
slurmctld: debug2: node_did_resp node01
slurmctld: debug2: agent maximum delay 5 seconds
slurmctld: debug:  backfill: no jobs to backfill
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=12901
slurmctld: debug3: JobDesc: user_id=12901 job_id=-1 partition=(null) 
name=TESTJOB
slurmctld: debug3:    cpus=2-4294967294 pn_min_cpus=-1
slurmctld: debug3:    -N min-[max]: 4294967294-[4294967294]:65534:65534:65534
slurmctld: debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
slurmctld: debug3:    immediate=0 features=(null) reservation=(null)
slurmctld: debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
slurmctld: debug3:    time_limit=1-1 priority=-1 contiguous=0 shared=-1
slurmctld: debug3:    kill_on_node_fail=-1 script=#!/bin/bash
#SBATCH -J TESTJOB
#SBATCH -...
slurmctld: debug3:    argv="/g1/home/kabbey/test.sh"
slurmctld: debug3:    
environment=MANPATH=/ccib-bsb-164/u1/opt/software/python/2.7.3/share/man/man1:/u2/software/perl/man:/u2/software/perl/share/man:/u2/software/soap/man:/opt/xcat/share/man:/g0/opt/slurm/2.5.3/share/man:/g0/opt/blcr/0.8.5/man:/g0/opt/munge/0.5.10/man::/opt/software/ViennaRNA/2.0.1/share/man,BEDTOOLS=/opt/software/BEDTools-Version-2.15.0_default,HOSTNAME=golova.ccib.rutgers.edu,...
slurmctld: debug3:    stdin=/dev/null stdout=(null) stderr=(null)
slurmctld: debug3:    work_dir=/g1/home/kabbey alloc_node:sid=golova:36950
slurmctld: debug3:    resp_host=(null) alloc_resp_port=0  other_port=0
slurmctld: debug3:    dependency=(null) account=(null) qos=(null) comment=(null)
slurmctld: debug3:    mail_type=0 mail_user=(null) nice=55534 num_tasks=2 
open_mode=0 overcommit=-1 acctg_freq=-1
slurmctld: debug3:    network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 
licenses=(null)
slurmctld: debug3:    end_time=Unknown signal=0@0 wait_all_nodes=-1
slurmctld: debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1
slurmctld: debug3:    cpus_bind=65534:(null) mem_bind=65534:(null) 
plane_size:65534
slurmctld: debug2: found 1 usable nodes from config containing node01
slurmctld: debug3: _pick_best_nodes: job 44 idle_nodes 1 share_nodes 1
slurmctld: debug2: select_p_job_test for job 44
slurmctld: debug2: sched: JobId=44 allocated resources: NodeList=(null)
slurmctld: _slurm_rpc_submit_batch_job JobId=44 usec=625
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug2: found 1 usable nodes from config containing node01
slurmctld: debug3: _pick_best_nodes: job 44 idle_nodes 1 share_nodes 1
slurmctld: debug2: select_p_job_test for job 44
slurmctld: debug3: cons_res: best_fit: node[0]: required cpus: 2, min req 
boards: 1,
slurmctld: debug3: cons_res: best_fit: node[0]: min req sockets: 1, min avail 
cores: 16
slurmctld: debug3: cons_res: best_fit: using node[0]: board[0]: socket[1]: 8 
cores available
slurmctld: debug3: cons_res: _add_job_to_res: job 44 act 0 
slurmctld: debug3: cons_res: adding job 44 to part compute row 0
slurmctld: debug3: sched: JobId=44 initiated
slurmctld: sched: Allocate JobId=44 NodeList=node01 #CPUs=2
slurmctld: debug2: Spawning RPC agent for msg_type 4005
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to node01
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
slurmctld: debug3: Writing job id 44 to header record of job_state file
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: node_did_resp node01
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 
JobId=44
slurmctld: error: slurmd error running JobId=44 on node(s)=node01: Slurmd could 
not create a batch directory or file
slurmctld: update_node: node node01 reason set to: batch job complete failure
slurmctld: update_node: node node01 state set to DRAINING
slurmctld: completing job 44
slurmctld: Requeue JobId=44 due to node failure
slurmctld: debug3: cons_res: _rm_job_from_res: job 44 action 0
slurmctld: debug3: cons_res: removed job 44 from part compute row 0
slurmctld: debug2: Spawning RPC agent for msg_type 6011
slurmctld: sched: job_complete for JobId=44 successful
slurmctld: debug2: _slurm_rpc_complete_batch_script JobId=44 usec=171
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to node01
slurmctld: debug4: orig_timeout was 10000 we have 0 steps and a timeout of 10000
slurmctld: debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 
JobId=44
slurmctld: error: slurmd error running JobId=44 on node(s)=(null): Unspecified 
error
slurmctld: update_node: invalid node name  (null)
slurmctld: completing job 44
slurmctld: _slurm_rpc_complete_batch_script JobId=44: Invalid node name 
specified 
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug3: make_node_idle: Node node01 is DRAINED
slurmctld: requeue batch job 44
slurmctld: debug2: node_did_resp node01
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug3: Writing job id 44 to header record of job_state file
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug3: sched: JobId=44. State=PENDING. Reason=Resources. 
Priority=4294901759. Partition=compute.
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug:  sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: debug3: Writing job id 44 to header record of job_state file
slurmctld: debug3: _slurmctld_background shutting down
[root@golova ~]#

[root@node01 ~]# slurmd -Dcvvvvvv
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:32 Boards:1 Sockets:2 CoresPerSocket:8 ThreadsPerCore:2
slurmd: debug4: CPU map[0]=>0
slurmd: debug4: CPU map[1]=>16
slurmd: debug4: CPU map[2]=>1
slurmd: debug4: CPU map[3]=>17
slurmd: debug4: CPU map[4]=>2
slurmd: debug4: CPU map[5]=>18
slurmd: debug4: CPU map[6]=>3
slurmd: debug4: CPU map[7]=>19
slurmd: debug4: CPU map[8]=>4
slurmd: debug4: CPU map[9]=>20
slurmd: debug4: CPU map[10]=>5
slurmd: debug4: CPU map[11]=>21
slurmd: debug4: CPU map[12]=>6
slurmd: debug4: CPU map[13]=>22
slurmd: debug4: CPU map[14]=>7
slurmd: debug4: CPU map[15]=>23
slurmd: debug4: CPU map[16]=>8
slurmd: debug4: CPU map[17]=>24
slurmd: debug4: CPU map[18]=>9
slurmd: debug4: CPU map[19]=>25
slurmd: debug4: CPU map[20]=>10
slurmd: debug4: CPU map[21]=>26
slurmd: debug4: CPU map[22]=>11
slurmd: debug4: CPU map[23]=>27
slurmd: debug4: CPU map[24]=>12
slurmd: debug4: CPU map[25]=>28
slurmd: debug4: CPU map[26]=>13
slurmd: debug4: CPU map[27]=>29
slurmd: debug4: CPU map[28]=>14
slurmd: debug4: CPU map[29]=>30
slurmd: debug4: CPU map[30]=>15
slurmd: debug4: CPU map[31]=>31
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/topology_none.so
slurmd: topology NONE plugin loaded
slurmd: debug3: Success.
slurmd: Gathering cpu frequency information for 32 cpus
slurmd: debug:  cpu_freq_init: cpu 0, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 1, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 2, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 3, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 4, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 5, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 6, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 7, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 8, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 9, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 10, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 11, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 12, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 13, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 14, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 15, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 16, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 17, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 18, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 19, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 20, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 21, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 22, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 23, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 24, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 25, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 26, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 27, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 28, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 29, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 30, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug:  cpu_freq_init: cpu 31, reset freq: 1200000, reset governor: 
ondemand
slurmd: debug3: NodeName    = node01
slurmd: debug3: TopoAddr    = node01
slurmd: debug3: TopoPattern = node
slurmd: debug3: CacheGroups = 0
slurmd: debug3: Confile     = `/g0/opt/slurm/2.5.3/etc/slurm.conf'
slurmd: debug3: Debug       = 7
slurmd: debug3: CPUs        = 32 (CF: 32, HW: 32)
slurmd: debug3: Boards      = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets     = 2  (CF:  2, HW:  2)
slurmd: debug3: Cores       = 8  (CF:  8, HW:  8)
slurmd: debug3: Threads     = 2  (CF:  2, HW:  2)
slurmd: debug3: UpTime      = 11315 = 03:08:35
slurmd: debug3: Block Map   = 
0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23,8,24,9,25,10,26,11,27,12,28,13,29,14,30,15,31
slurmd: debug3: Inverse Map = 
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
slurmd: debug3: RealMemory  = 64505
slurmd: debug3: TmpDisk     = 10
slurmd: debug3: Epilog      = `(null)'
slurmd: debug3: Logfile     = `/g0/opt/slurm/log/slurmd_%h.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName    = node01
slurmd: debug3: NodeAddr    = 192.168.0.1
slurmd: debug3: Port        = 6818
slurmd: debug3: Prolog      = `(null)'
slurmd: debug3: TmpFS       = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd  = `/g0/opt/slurm/2.5.3/sbin/slurmstepd'
slurmd: debug3: Spool Dir   = `/g0/opt/slurm/2.5.3/var/spool'
slurmd: debug3: Pid File    = `/g0/opt/slurm/2.5.3/var/run/slurmd_node01.pid'
slurmd: debug3: Slurm UID   = 13001
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: Use PAM     = 0
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/jobacct_gather_none.so
slurmd: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/proctrack_pgid.so
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/task_affinity.so
slurmd: task affinity plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/auth_munge.so
slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmd: debug3: Success.
slurmd: debug:  spank: opening plugin stack 
/g0/opt/slurm/2.5.3/etc/plugstack.conf
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/crypto_munge.so
slurmd: Munge cryptographic signature plugin loaded
slurmd: debug3: Success.
slurmd: debug3: initializing slurmd spool directory
slurmd: debug3: slurmd initialization successful
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 2.5.3 started
slurmd: debug3: finished daemonize
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/switch_none.so
slurmd: switch NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: successfully opened slurm listen port 192.168.0.1:6818
slurmd: slurmd started on Mon, 18 Feb 2013 21:41:35 -0500
slurmd: Procs=32 Boards=1 Sockets=2 Cores=8 Threads=2 Memory=64505 TmpDisk=10 
Uptime=11315
slurmd: debug3: Trying to load plugin 
/g0/opt/slurm/2.5.3/lib/slurm/acct_gather_energy_none.so
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 1001
slurmd: debug2: Processing RPC: REQUEST_NODE_REGISTRATION_STATUS
slurmd: debug3: Procs=32 Boards=1 Sockets=2 Cores=8 Threads=2 Memory=64505 
TmpDisk=10 Uptime=11316
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: task_slurmd_batch_request: 44
slurmd: debug3: task/affinity: job 44 CPU mask from slurmctld: 0x8000
slurmd: task/affinity: job 44 CPU input mask for node: 0xC0000000
slurmd: debug3: _lllp_map_abstract_masks
slurmd: task/affinity: job 44 CPU final HW mask for node: 0x80008000
slurmd: debug:  Calling /g0/opt/slurm/2.5.3/sbin/slurmstepd spank prolog
spank-prolog: Reading slurm.conf file: /g0/opt/slurm/2.5.3/etc/slurm.conf
spank-prolog: Running spank/prolog for jobid [44] uid [12901]
spank-prolog: spank: opening plugin stack /g0/opt/slurm/2.5.3/etc/plugstack.conf
slurmd: Launching batch job 44 for UID 12901
slurmd: debug3: _rpc_batch_job: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank -1 (node01), parent rank -1 (NONE), children 0, 
depth 0, max_depth 0
slurmd: debug3: _send_slurmstepd_init: call to getpwuid_r
slurmd: debug3: _send_slurmstepd_init: return from getpwuid_r
slurmd: debug3: in the service_connection
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 13001
slurmd: debug3: _rpc_batch_job: return from _forkexec_slurmstepd: -1
slurmd: debug:  task_slurmd_release_resources: 44
slurmd: debug:  credential for job 44 revoked
slurmd: debug2: No steps in jobid 44 to send signal 998
slurmd: debug2: No steps in jobid 44 to send signal 18
slurmd: debug2: No steps in jobid 44 to send signal 15
slurmd: debug4: sent ALREADY_COMPLETE
slurmd: debug2: set revoke expiration for jobid 44 to 1361242923 UTS
^Cslurmd: got shutdown request
slurmd: all threads complete
slurmd: task affinity plugin unloaded
slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
slurmd: debug3: destroying job 44 state
slurmd: Munge cryptographic signature plugin unloaded
slurmd: Slurmd shutdown completing
[root@node01 ~]#

[slurm-dev] batch submit failure -Slurmd could not create a batch directory or file

Reply via email to