Hi All
Sometimes we see slurmd lock up on slave nodes with the last message
"active_threads == MAX_THREADS(130)" in the slurmd log file
the node is still physically up and we can ssh to it and it still
reports running from a status request:
[root@node-sw-008 ~]# /etc/init.d/slurm status
slurmd (pid 5680) is running...
but slurm reports the node down and not responding
[root@master-01 ~]# scontrol show node node-sw-008
NodeName=node-sw-008 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=N/A Features=(null)
Gres=(null)
NodeAddr=node-sw-008 NodeHostName=node-sw-008 Version=(null)
OS=Linux RealMemory=129138 AllocMem=0 Sockets=2 Boards=1
State=DOWN* ThreadsPerCore=1 TmpDisk=2015 Weight=1
BootTime=2015-01-07T12:54:45 SlurmdStartTime=2015-01-07T12:56:27
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [root@2015-01-07T14:13:15]
I have seen something similar mentioned about a year but this was fixed
by an upgrade from 2.6.0 to 2.6.4 and this cluster is running slurm 14.03.0
[root@master-01 ~]# sinfo --version
slurm 14.03.0
does anyone have any suggestions
we also had the topology plugin configured and this is autogenerated
from ibnetdiscover and the missing hostnames are not used by slurm
here is the detail of the slurmd log file on an affected node form the
last slurmd start on the node
at the time of the lock up the node appears to be idle
Thanks
Antony
[2015-01-07T12:56:16.102] debug2: hwloc_topology_init
[2015-01-07T12:56:16.158] debug2: hwloc_topology_load
[2015-01-07T12:56:16.171] debug: CPUs:16 Boards:1 Sockets:2
CoresPerSocket:8 ThreadsPerCore:1
[2015-01-07T12:56:16.171] debug4: CPU map[0]=>0
[2015-01-07T12:56:16.171] debug4: CPU map[1]=>1
[2015-01-07T12:56:16.171] debug4: CPU map[2]=>2
[2015-01-07T12:56:16.171] debug4: CPU map[3]=>3
[2015-01-07T12:56:16.171] debug4: CPU map[4]=>4
[2015-01-07T12:56:16.171] debug4: CPU map[5]=>5
[2015-01-07T12:56:16.171] debug4: CPU map[6]=>6
[2015-01-07T12:56:16.171] debug4: CPU map[7]=>7
[2015-01-07T12:56:16.171] debug4: CPU map[8]=>8
[2015-01-07T12:56:16.171] debug4: CPU map[9]=>9
[2015-01-07T12:56:16.171] debug4: CPU map[10]=>10
[2015-01-07T12:56:16.171] debug4: CPU map[11]=>11
[2015-01-07T12:56:16.171] debug4: CPU map[12]=>12
[2015-01-07T12:56:16.171] debug4: CPU map[13]=>13
[2015-01-07T12:56:16.171] debug4: CPU map[14]=>14
[2015-01-07T12:56:16.171] debug4: CPU map[15]=>15
[2015-01-07T12:56:16.172] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/gres_gpu.so
[2015-01-07T12:56:16.173] debug: init: Gres GPU plugin loaded
[2015-01-07T12:56:16.173] debug3: Success.
[2015-01-07T12:56:16.173] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/gres_mic.so
[2015-01-07T12:56:16.175] debug3: Success.
[2015-01-07T12:56:16.222] Gres Name=gpu Count=0
[2015-01-07T12:56:16.222] Gres Name=mic Count=0
[2015-01-07T12:56:16.222] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/topology_tree.so
[2015-01-07T12:56:16.234] topology tree plugin loaded
[2015-01-07T12:56:16.234] debug3: Success.
[2015-01-07T12:56:16.234] debug: Reading the topology.conf file
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
beegfs-01
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified beegfs-01
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
beegfs-03
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified beegfs-03
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
master-01
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified master-01
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
stor-gw-01
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified stor-gw-01
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
beegfs-02
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified beegfs-02
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
beegfs-04
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified beegfs-04
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
master-02
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified master-02
[2015-01-07T12:56:16.247] error: find_node_record: lookup failure for
stor-gw-02
[2015-01-07T12:56:16.247] debug2: _node_name2bitmap: invalid node
specified stor-gw-02
[2015-01-07T12:56:16.247] error: WARNING: switches lack access to 17
nodes:
dev-ngpu-[01-02],dev-phi-[01-02],node-as-phi-[01-03],node-dw-phi-001-mic0,node-dw-phi-002-mic0,node-dw-phi-003-mic0,node-dw-phi-004-mic0,node-dw-phi-005-mic0,node-dw-phi-006-mic0,node-dw-phi-007-mic0,node-dw-phi-008-mic0,node-sw-[002,004]
[2015-01-07T12:56:16.247] error: WARNING: Invalid hostnames in switch
configuration:
beegfs-[01,03],master-01,stor-gw-01,beegfs-[02,04],master-02,stor-gw-02
[2015-01-07T12:56:16.247] debug: Switch level:0 name:S000
nodes:node-sw-073,node-sw-074,node-sw-075,node-sw-076,node-sw-077,node-sw-078,node-sw-079,node-sw-080,node-sw-081,node-sw-082,node-sw-083,node-sw-084,node-sw-085,node-sw-086,node-sw-087,node-sw-088,node-sw-089,node-sw-090,node-sw-091,node-sw-092,node-sw-093,node-sw-094,node-sw-095,node-sw-096
switches:(null)
[2015-01-07T12:56:16.247] debug: Switch level:0 name:S001
nodes:node-sw-001,node-sw-003,node-sw-005,node-sw-006,node-sw-007,node-sw-008,node-sw-009,node-sw-010,node-sw-011,node-sw-012,node-sw-013,node-sw-014,node-sw-015,node-sw-016,node-sw-017,node-sw-018,node-sw-019,node-sw-020,node-sw-021,node-sw-022,node-sw-023,node-sw-024
switches:(null)
[2015-01-07T12:56:16.247] debug: Switch level:0 name:S002
nodes:node-sw-097,node-sw-098,node-sw-099,node-sw-100,node-sw-101,node-sw-102,node-sw-103,node-sw-104,node-sw-105,node-sw-106,node-sw-107,node-sw-108,node-sw-109,node-sw-110,node-sw-111,node-sw-112,node-sw-113,node-sw-114,node-sw-115,node-sw-116,node-sw-117,node-sw-118,node-sw-119,node-sw-120
switches:(null)
[2015-01-07T12:56:16.247] debug: Switch level:0 name:S003
nodes:node-sw-145,node-sw-146,node-sw-147,node-sw-148,node-sw-149,node-sw-150,node-sw-151,node-sw-152,node-sw-153,node-sw-154,node-sw-155,node-sw-156,node-sw-157,node-sw-158,node-sw-159,node-sw-160,node-sw-161,node-sw-162,node-sw-163,node-sw-164,node-sw-165,node-sw-166,node-sw-167,node-sw-168
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S004
nodes:balena-01,beegfs-01,beegfs-03,master-01,node-sw-fat-01,node-sw-fat-02,stor-gw-01,vis-01
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S005
nodes:balena-02,node-as-01,node-as-02,node-as-agpu-01,node-as-ngpu-01,node-as-ngpu-02,node-as-ngpu-03,node-as-ngpu-04
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S006
nodes:node-sw-025,node-sw-026,node-sw-027,node-sw-028,node-sw-029,node-sw-030,node-sw-031,node-sw-032,node-sw-033,node-sw-034,node-sw-035,node-sw-036,node-sw-037,node-sw-038,node-sw-039,node-sw-040,node-sw-041,node-sw-042,node-sw-043,node-sw-044,node-sw-045,node-sw-046,node-sw-047,node-sw-048
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S007
nodes:node-sw-049,node-sw-050,node-sw-051,node-sw-052,node-sw-053,node-sw-054,node-sw-055,node-sw-056,node-sw-057,node-sw-058,node-sw-059,node-sw-060,node-sw-061,node-sw-062,node-sw-063,node-sw-064,node-sw-065,node-sw-066,node-sw-067,node-sw-068,node-sw-069,node-sw-070,node-sw-071,node-sw-072
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S008
nodes:node-sw-121,node-sw-122,node-sw-123,node-sw-124,node-sw-125,node-sw-126,node-sw-127,node-sw-128,node-sw-129,node-sw-130,node-sw-131,node-sw-132,node-sw-133,node-sw-134,node-sw-135,node-sw-136,node-sw-137,node-sw-138,node-sw-139,node-sw-140,node-sw-141,node-sw-142,node-sw-143,node-sw-144
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S009
nodes:node-dw-ngpu-001,node-dw-ngpu-002,node-dw-ngpu-003,node-dw-ngpu-004,node-dw-phi-001,node-dw-phi-002,node-dw-phi-003,node-dw-phi-004,node-dw-phi-005,node-dw-phi-006,node-dw-phi-007,node-dw-phi-008
switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:0 name:S010
nodes:beegfs-02,beegfs-04,master-02,stor-gw-02,vis-02 switches:(null)
[2015-01-07T12:56:16.248] debug: Switch level:1 name:S011
nodes:balena-[01-02],node-as-[01-02],node-as-agpu-01,node-as-ngpu-[01-04],node-dw-ngpu-[001-004],node-dw-phi-[001-008],node-sw-[001,003,005-168],node-sw-fat-[01-02],vis-[01-02]
switches:S000,S001,S002,S003,S004,S005,S006,S007,S008,S009,S010
[2015-01-07T12:56:16.248] Gathering cpu frequency information for 16 cpus
[2015-01-07T12:56:16.248] debug: cpu_freq_init: cpu 0, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.248] debug: cpu_freq_init: cpu 1, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.248] debug: cpu_freq_init: cpu 2, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.248] debug: cpu_freq_init: cpu 3, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 4, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 5, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 6, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 7, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 8, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 9, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 10, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 11, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.249] debug: cpu_freq_init: cpu 12, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.250] debug: cpu_freq_init: cpu 13, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.250] debug: cpu_freq_init: cpu 14, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.250] debug: cpu_freq_init: cpu 15, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:16.250] debug3: NodeName = node-sw-008
[2015-01-07T12:56:16.250] debug3: TopoAddr = S011.S001.node-sw-008
[2015-01-07T12:56:16.250] debug3: TopoPattern = switch.switch.node
[2015-01-07T12:56:16.250] debug3: CacheGroups = 0
[2015-01-07T12:56:16.250] debug3: ClusterName = balena_test
[2015-01-07T12:56:16.250] debug3: Confile = `/etc/slurm/slurm.conf'
[2015-01-07T12:56:16.250] debug3: Debug = 9
[2015-01-07T12:56:16.250] debug3: CPUs = 16 (CF: 16, HW: 16)
[2015-01-07T12:56:16.250] debug3: Boards = 1 (CF: 1, HW: 1)
[2015-01-07T12:56:16.250] debug3: Sockets = 2 (CF: 2, HW: 2)
[2015-01-07T12:56:16.250] debug3: Cores = 8 (CF: 8, HW: 8)
[2015-01-07T12:56:16.250] debug3: Threads = 1 (CF: 1, HW: 1)
[2015-01-07T12:56:16.250] debug3: UpTime = 91 = 00:01:31
[2015-01-07T12:56:16.250] debug3: Block Map =
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
[2015-01-07T12:56:16.250] debug3: Inverse Map =
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
[2015-01-07T12:56:16.250] debug3: RealMemory = 129138
[2015-01-07T12:56:16.250] debug3: TmpDisk = 2015
[2015-01-07T12:56:16.250] debug3: Epilog =
`/cm/local/apps/cmd/scripts/epilog'
[2015-01-07T12:56:16.250] debug3: Logfile = `/var/log/slurmd'
[2015-01-07T12:56:16.250] debug3: HealthCheck = `(null)'
[2015-01-07T12:56:16.250] debug3: NodeName = node-sw-008
[2015-01-07T12:56:16.250] debug3: NodeAddr = (null)
[2015-01-07T12:56:16.250] debug3: Port = 6818
[2015-01-07T12:56:16.250] debug3: Prolog =
`/cm/local/apps/cmd/scripts/prolog'
[2015-01-07T12:56:16.250] debug3: TmpFS = `/tmp'
[2015-01-07T12:56:16.250] debug3: Public Cert = `(null)'
[2015-01-07T12:56:16.250] debug3: Slurmstepd =
`/cm/shared/apps/slurm/14.03.0/sbin/slurmstepd'
[2015-01-07T12:56:16.250] debug3: Spool Dir =
`/cm/local/apps/slurm/var/spool'
[2015-01-07T12:56:16.250] debug3: Pid File = `/var/run/slurm/slurmd.pid'
[2015-01-07T12:56:16.250] debug3: Slurm UID = 450
[2015-01-07T12:56:16.250] debug3: TaskProlog = `(null)'
[2015-01-07T12:56:16.250] debug3: TaskEpilog = `(null)'
[2015-01-07T12:56:16.250] debug3: TaskPluginParam = 0
[2015-01-07T12:56:16.250] debug3: Use PAM = 0
[2015-01-07T12:56:16.250] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/proctrack_linuxproc.so
[2015-01-07T12:56:16.252] debug3: Success.
[2015-01-07T12:56:16.252] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/task_none.so
[2015-01-07T12:56:16.254] task NONE plugin loaded
[2015-01-07T12:56:16.254] debug3: Success.
[2015-01-07T12:56:16.254] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/auth_munge.so
[2015-01-07T12:56:16.255] auth plugin for Munge
(http://code.google.com/p/munge/) loaded
[2015-01-07T12:56:16.255] debug3: Success.
[2015-01-07T12:56:16.255] debug: spank: opening plugin stack
/etc/slurm/plugstack.conf
[2015-01-07T12:56:16.255] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/crypto_munge.so
[2015-01-07T12:56:16.259] Munge cryptographic signature plugin loaded
[2015-01-07T12:56:16.259] debug3: Success.
[2015-01-07T12:56:16.259] debug3: initializing slurmd spool directory
[2015-01-07T12:56:16.275] debug3: slurmd initialization successful
[2015-01-07T12:56:16.275] Warning: Core limit is only 0 KB
[2015-01-07T12:56:16.276] slurmd version 14.03.0 started
[2015-01-07T12:56:16.276] debug3: finished daemonize
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12877
ctime:150107121257 revoked:150107121257 expires:150107121257
[2015-01-07T12:56:16.342] debug3: not appending expired job 12877 state
[2015-01-07T12:56:16.342] debug3: destroying job 12877 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12878
ctime:150107121315 revoked:150107121315 expires:150107121315
[2015-01-07T12:56:16.342] debug3: not appending expired job 12878 state
[2015-01-07T12:56:16.342] debug3: destroying job 12878 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12879
ctime:150107121318 revoked:150107121318 expires:150107121318
[2015-01-07T12:56:16.342] debug3: not appending expired job 12879 state
[2015-01-07T12:56:16.342] debug3: destroying job 12879 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12880
ctime:150107121419 revoked:150107121419 expires:150107121419
[2015-01-07T12:56:16.342] debug3: not appending expired job 12880 state
[2015-01-07T12:56:16.342] debug3: destroying job 12880 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12881
ctime:150107121424 revoked:150107121424 expires:150107121424
[2015-01-07T12:56:16.342] debug3: not appending expired job 12881 state
[2015-01-07T12:56:16.342] debug3: destroying job 12881 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12882
ctime:150107121612 revoked:150107121612 expires:150107121612
[2015-01-07T12:56:16.342] debug3: not appending expired job 12882 state
[2015-01-07T12:56:16.342] debug3: destroying job 12882 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12883
ctime:150107121801 revoked:150107121801 expires:150107121801
[2015-01-07T12:56:16.342] debug3: not appending expired job 12883 state
[2015-01-07T12:56:16.342] debug3: destroying job 12883 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12884
ctime:150107121824 revoked:150107121824 expires:150107121824
[2015-01-07T12:56:16.342] debug3: not appending expired job 12884 state
[2015-01-07T12:56:16.342] debug3: destroying job 12884 state
[2015-01-07T12:56:16.342] debug3: cred_unpack: job 12885
ctime:150107121839 revoked:150107121839 expires:150107121839
[2015-01-07T12:56:16.342] debug3: not appending expired job 12885 state
[2015-01-07T12:56:16.342] debug3: destroying job 12885 state
[2015-01-07T12:56:16.342] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/jobacct_gather_linux.so
[2015-01-07T12:56:16.344] Job accounting gather LINUX plugin loaded
[2015-01-07T12:56:16.344] debug3: Success.
[2015-01-07T12:56:16.344] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/job_container_none.so
[2015-01-07T12:56:16.346] debug: job_container none plugin loaded
[2015-01-07T12:56:16.346] debug3: Success.
[2015-01-07T12:56:16.347] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/core_spec_none.so
[2015-01-07T12:56:16.348] debug3: Success.
[2015-01-07T12:56:16.348] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/switch_none.so
[2015-01-07T12:56:16.350] switch NONE plugin loaded
[2015-01-07T12:56:16.350] debug3: Success.
[2015-01-07T12:56:16.350] debug3: successfully opened slurm listen port
*:6818
[2015-01-07T12:56:16.350] slurmd started on Wed, 07 Jan 2015 12:56:16 +0000
[2015-01-07T12:56:16.351] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1
Memory=129138 TmpDisk=2015 Uptime=91
[2015-01-07T12:56:16.351] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_energy_ipmi.so
[2015-01-07T12:56:16.676] debug3: Success.
[2015-01-07T12:56:16.676] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_profile_none.so
[2015-01-07T12:56:16.678] AcctGatherProfile NONE plugin loaded
[2015-01-07T12:56:16.678] debug3: Success.
[2015-01-07T12:56:16.678] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_infiniband_none.so
[2015-01-07T12:56:16.679] AcctGatherInfiniband NONE plugin loaded
[2015-01-07T12:56:16.679] debug3: Success.
[2015-01-07T12:56:16.679] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_filesystem_none.so
[2015-01-07T12:56:16.681] AcctGatherFilesystem NONE plugin loaded
[2015-01-07T12:56:16.681] debug3: Success.
[2015-01-07T12:56:16.681] debug2: No acct_gather.conf file
(/etc/slurm/acct_gather.conf)
[2015-01-07T12:56:16.681] AcctGatherEnergy IPMI plugin loaded
[2015-01-07T12:56:19.159] error:
ipmi_monitoring_sensor_readings_by_record_id: internal error
[2015-01-07T12:56:26.822] got shutdown request
[2015-01-07T12:56:26.822] all threads complete
[2015-01-07T12:56:26.851] debug: fini: unloading Gres GPU plugin
[2015-01-07T12:56:26.851] Consumable Resources (CR) Node Selection
plugin shutting down ...
[2015-01-07T12:56:26.851] Munge cryptographic signature plugin unloaded
[2015-01-07T12:56:26.851] Slurmd shutdown completing
[2015-01-07T12:56:27.915] debug2: hwloc_topology_init
[2015-01-07T12:56:27.915] debug2: hwloc_topology_load
[2015-01-07T12:56:27.925] debug: CPUs:16 Boards:1 Sockets:2
CoresPerSocket:8 ThreadsPerCore:1
[2015-01-07T12:56:27.925] debug4: CPU map[0]=>0
[2015-01-07T12:56:27.925] debug4: CPU map[1]=>1
[2015-01-07T12:56:27.925] debug4: CPU map[2]=>2
[2015-01-07T12:56:27.925] debug4: CPU map[3]=>3
[2015-01-07T12:56:27.925] debug4: CPU map[4]=>4
[2015-01-07T12:56:27.925] debug4: CPU map[5]=>5
[2015-01-07T12:56:27.925] debug4: CPU map[6]=>6
[2015-01-07T12:56:27.925] debug4: CPU map[7]=>7
[2015-01-07T12:56:27.925] debug4: CPU map[8]=>8
[2015-01-07T12:56:27.925] debug4: CPU map[9]=>9
[2015-01-07T12:56:27.925] debug4: CPU map[10]=>10
[2015-01-07T12:56:27.925] debug4: CPU map[11]=>11
[2015-01-07T12:56:27.925] debug4: CPU map[12]=>12
[2015-01-07T12:56:27.925] debug4: CPU map[13]=>13
[2015-01-07T12:56:27.925] debug4: CPU map[14]=>14
[2015-01-07T12:56:27.925] debug4: CPU map[15]=>15
[2015-01-07T12:56:27.925] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/gres_gpu.so
[2015-01-07T12:56:27.926] debug: init: Gres GPU plugin loaded
[2015-01-07T12:56:27.926] debug3: Success.
[2015-01-07T12:56:27.926] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/gres_mic.so
[2015-01-07T12:56:27.927] debug3: Success.
[2015-01-07T12:56:27.927] Gres Name=gpu Count=1
[2015-01-07T12:56:27.927] Gres Name=mic Count=1
[2015-01-07T12:56:27.927] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/topology_tree.so
[2015-01-07T12:56:27.928] topology tree plugin loaded
[2015-01-07T12:56:27.928] debug3: Success.
[2015-01-07T12:56:27.928] debug: Reading the topology.conf file
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
beegfs-01
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified beegfs-01
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
beegfs-03
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified beegfs-03
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
master-01
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified master-01
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
stor-gw-01
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified stor-gw-01
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
beegfs-02
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified beegfs-02
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
beegfs-04
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified beegfs-04
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
master-02
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified master-02
[2015-01-07T12:56:27.930] error: find_node_record: lookup failure for
stor-gw-02
[2015-01-07T12:56:27.930] debug2: _node_name2bitmap: invalid node
specified stor-gw-02
[2015-01-07T12:56:27.930] error: WARNING: switches lack access to 17
nodes:
dev-ngpu-[01-02],dev-phi-[01-02],node-as-phi-[01-03],node-dw-phi-001-mic0,node-dw-phi-002-mic0,node-dw-phi-003-mic0,node-dw-phi-004-mic0,node-dw-phi-005-mic0,node-dw-phi-006-mic0,node-dw-phi-007-mic0,node-dw-phi-008-mic0,node-sw-[002,004]
[2015-01-07T12:56:27.930] error: WARNING: Invalid hostnames in switch
configuration:
beegfs-[01,03],master-01,stor-gw-01,beegfs-[02,04],master-02,stor-gw-02
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S000
nodes:node-sw-073,node-sw-074,node-sw-075,node-sw-076,node-sw-077,node-sw-078,node-sw-079,node-sw-080,node-sw-081,node-sw-082,node-sw-083,node-sw-084,node-sw-085,node-sw-086,node-sw-087,node-sw-088,node-sw-089,node-sw-090,node-sw-091,node-sw-092,node-sw-093,node-sw-094,node-sw-095,node-sw-096
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S001
nodes:node-sw-001,node-sw-003,node-sw-005,node-sw-006,node-sw-007,node-sw-008,node-sw-009,node-sw-010,node-sw-011,node-sw-012,node-sw-013,node-sw-014,node-sw-015,node-sw-016,node-sw-017,node-sw-018,node-sw-019,node-sw-020,node-sw-021,node-sw-022,node-sw-023,node-sw-024
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S002
nodes:node-sw-097,node-sw-098,node-sw-099,node-sw-100,node-sw-101,node-sw-102,node-sw-103,node-sw-104,node-sw-105,node-sw-106,node-sw-107,node-sw-108,node-sw-109,node-sw-110,node-sw-111,node-sw-112,node-sw-113,node-sw-114,node-sw-115,node-sw-116,node-sw-117,node-sw-118,node-sw-119,node-sw-120
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S003
nodes:node-sw-145,node-sw-146,node-sw-147,node-sw-148,node-sw-149,node-sw-150,node-sw-151,node-sw-152,node-sw-153,node-sw-154,node-sw-155,node-sw-156,node-sw-157,node-sw-158,node-sw-159,node-sw-160,node-sw-161,node-sw-162,node-sw-163,node-sw-164,node-sw-165,node-sw-166,node-sw-167,node-sw-168
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S004
nodes:balena-01,beegfs-01,beegfs-03,master-01,node-sw-fat-01,node-sw-fat-02,stor-gw-01,vis-01
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S005
nodes:balena-02,node-as-01,node-as-02,node-as-agpu-01,node-as-ngpu-01,node-as-ngpu-02,node-as-ngpu-03,node-as-ngpu-04
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S006
nodes:node-sw-025,node-sw-026,node-sw-027,node-sw-028,node-sw-029,node-sw-030,node-sw-031,node-sw-032,node-sw-033,node-sw-034,node-sw-035,node-sw-036,node-sw-037,node-sw-038,node-sw-039,node-sw-040,node-sw-041,node-sw-042,node-sw-043,node-sw-044,node-sw-045,node-sw-046,node-sw-047,node-sw-048
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S007
nodes:node-sw-049,node-sw-050,node-sw-051,node-sw-052,node-sw-053,node-sw-054,node-sw-055,node-sw-056,node-sw-057,node-sw-058,node-sw-059,node-sw-06
0,node-sw-061,node-sw-062,node-sw-063,node-sw-064,node-sw-065,node-sw-066,node-sw-067,node-sw-068,node-sw-069,node-sw-070,node-sw-071,node-sw-072
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S008
nodes:node-sw-121,node-sw-122,node-sw-123,node-sw-124,node-sw-125,node-sw-126,node-sw-127,node-sw-128,node-sw-129,node-sw-130,node-sw-131,node-sw-13
2,node-sw-133,node-sw-134,node-sw-135,node-sw-136,node-sw-137,node-sw-138,node-sw-139,node-sw-140,node-sw-141,node-sw-142,node-sw-143,node-sw-144
switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S009
nodes:node-dw-ngpu-001,node-dw-ngpu-002,node-dw-ngpu-003,node-dw-ngpu-004,node-dw-phi-001,node-dw-phi-002,node-dw-phi-003,node-dw-phi-004,node-dw-ph
i-005,node-dw-phi-006,node-dw-phi-007,node-dw-phi-008 switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:0 name:S010
nodes:beegfs-02,beegfs-04,master-02,stor-gw-02,vis-02 switches:(null)
[2015-01-07T12:56:27.930] debug: Switch level:1 name:S011
nodes:balena-[01-02],node-as-[01-02],node-as-agpu-01,node-as-ngpu-[01-04],node-dw-ngpu-[001-004],node-dw-phi-[001-008],node-sw-[001,003,005-168],node-sw-fat-[01-02],vis-[01-02]
switches:S000,S001,S002,S003,S004,S005,S006,S007,S008,S009,S010
[2015-01-07T12:56:27.930] Gathering cpu frequency information for 16 cpus
[2015-01-07T12:56:27.930] debug: cpu_freq_init: cpu 0, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.930] debug: cpu_freq_init: cpu 1, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.930] debug: cpu_freq_init: cpu 2, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 3, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 4, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 5, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 6, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 7, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 8, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 9, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 10, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 11, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 12, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 13, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 14, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug: cpu_freq_init: cpu 15, reset freq:
1200000, reset governor: ondemand
[2015-01-07T12:56:27.931] debug3: NodeName = node-sw-008
[2015-01-07T12:56:27.931] debug3: TopoAddr = S011.S001.node-sw-008
[2015-01-07T12:56:27.931] debug3: TopoPattern = switch.switch.node
[2015-01-07T12:56:27.931] debug3: CacheGroups = 0
[2015-01-07T12:56:27.931] debug3: ClusterName = balena_test
[2015-01-07T12:56:27.931] debug3: Confile = `/etc/slurm/slurm.conf'
[2015-01-07T12:56:27.931] debug3: Debug = 9
[2015-01-07T12:56:27.931] debug3: CPUs = 16 (CF: 16, HW: 16)
[2015-01-07T12:56:27.931] debug3: Boards = 1 (CF: 1, HW: 1)
[2015-01-07T12:56:27.931] debug3: Sockets = 2 (CF: 2, HW: 2)
[2015-01-07T12:56:27.931] debug3: Cores = 8 (CF: 8, HW: 8)
[2015-01-07T12:56:27.931] debug3: Threads = 1 (CF: 1, HW: 1)
[2015-01-07T12:56:27.931] debug3: UpTime = 102 = 00:01:42
[2015-01-07T12:56:27.931] debug3: Block Map =
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
[2015-01-07T12:56:27.931] debug3: Inverse Map =
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
[2015-01-07T12:56:27.931] debug3: RealMemory = 129138
[2015-01-07T12:56:27.931] debug3: TmpDisk = 2015
[2015-01-07T12:56:27.931] debug3: Epilog =
`/cm/local/apps/cmd/scripts/epilog'
[2015-01-07T12:56:27.931] debug3: Logfile = `/var/log/slurmd'
[2015-01-07T12:56:27.931] debug3: HealthCheck = `(null)'
[2015-01-07T12:56:27.931] debug3: NodeName = node-sw-008
[2015-01-07T12:56:27.931] debug3: NodeAddr = (null)
[2015-01-07T12:56:27.931] debug3: Port = 6818
[2015-01-07T12:56:27.931] debug3: Prolog =
`/cm/local/apps/cmd/scripts/prolog'
[2015-01-07T12:56:27.931] debug3: TmpFS = `/tmp'
[2015-01-07T12:56:27.931] debug3: Public Cert = `(null)'
[2015-01-07T12:56:27.931] debug3: Slurmstepd =
`/cm/shared/apps/slurm/14.03.0/sbin/slurmstepd'
[2015-01-07T12:56:27.931] debug3: Spool Dir =
`/cm/local/apps/slurm/var/spool'
[2015-01-07T12:56:27.931] debug3: Pid File = `/var/run/slurm/slurmd.pid'
[2015-01-07T12:56:27.931] debug3: Slurm UID = 450
[2015-01-07T12:56:27.931] debug3: TaskProlog = `(null)'
[2015-01-07T12:56:27.931] debug3: TaskEpilog = `(null)'
[2015-01-07T12:56:27.931] debug3: TaskPluginParam = 0
[2015-01-07T12:56:27.931] debug3: Use PAM = 0
[2015-01-07T12:56:27.931] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/proctrack_linuxproc.so
[2015-01-07T12:56:27.933] debug3: Success.
[2015-01-07T12:56:27.933] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/task_none.so
[2015-01-07T12:56:27.934] task NONE plugin loaded
[2015-01-07T12:56:27.934] debug3: Success.
[2015-01-07T12:56:27.934] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/auth_munge.so
[2015-01-07T12:56:27.935] auth plugin for Munge
(http://code.google.com/p/munge/) loaded
[2015-01-07T12:56:27.935] debug3: Success.
[2015-01-07T12:56:27.935] debug: spank: opening plugin stack
/etc/slurm/plugstack.conf
[2015-01-07T12:56:27.935] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/crypto_munge.so
[2015-01-07T12:56:27.936] Munge cryptographic signature plugin loaded
[2015-01-07T12:56:27.936] debug3: Success.
[2015-01-07T12:56:27.936] debug3: initializing slurmd spool directory
[2015-01-07T12:56:27.936] debug3: slurmd initialization successful
[2015-01-07T12:56:27.937] Warning: Core limit is only 0 KB
[2015-01-07T12:56:27.937] slurmd version 14.03.0 started
[2015-01-07T12:56:27.937] debug3: finished daemonize
[2015-01-07T12:56:27.938] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/jobacct_gather_linux.so
[2015-01-07T12:56:27.939] Job accounting gather LINUX plugin loaded
[2015-01-07T12:56:27.939] debug3: Success.
[2015-01-07T12:56:27.939] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/job_container_none.so
[2015-01-07T12:56:27.940] debug: job_container none plugin loaded
[2015-01-07T12:56:27.940] debug3: Success.
[2015-01-07T12:56:27.940] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/core_spec_none.so
[2015-01-07T12:56:27.941] debug3: Success.
[2015-01-07T12:56:27.941] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/switch_none.so
[2015-01-07T12:56:27.942] switch NONE plugin loaded
[2015-01-07T12:56:27.942] debug3: Success.
[2015-01-07T12:56:27.942] debug3: successfully opened slurm listen port
*:6818
[2015-01-07T12:56:27.943] slurmd started on Wed, 07 Jan 2015 12:56:27 +0000
[2015-01-07T12:56:27.943] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1
Memory=129138 TmpDisk=2015 Uptime=102
[2015-01-07T12:56:27.943] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_energy_ipmi.so
[2015-01-07T12:56:27.947] debug3: Success.
[2015-01-07T12:56:27.947] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_profile_none.so
[2015-01-07T12:56:27.948] AcctGatherProfile NONE plugin loaded
[2015-01-07T12:56:27.948] debug3: Success.
[2015-01-07T12:56:27.948] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_infiniband_none.so
[2015-01-07T12:56:27.949] AcctGatherInfiniband NONE plugin loaded
[2015-01-07T12:56:27.949] debug3: Success.
[2015-01-07T12:56:27.949] debug3: Trying to load plugin
/cm/shared/apps/slurm/14.03.0/lib64/slurm/acct_gather_filesystem_none.so
[2015-01-07T12:56:27.950] AcctGatherFilesystem NONE plugin loaded
[2015-01-07T12:56:27.950] debug3: Success.
[2015-01-07T12:56:27.950] debug2: No acct_gather.conf file
(/etc/slurm/acct_gather.conf)
[2015-01-07T12:56:27.950] AcctGatherEnergy IPMI plugin loaded
[2015-01-07T12:56:36.305] debug3: in the service_connection
[2015-01-07T12:56:36.305] debug2: got this type of message 6011
[2015-01-07T12:56:36.305] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2015-01-07T12:56:36.305] debug: _rpc_terminate_job, uid = 450
[2015-01-07T12:56:36.306] debug: task_p_slurmd_release_resources: 12886
[2015-01-07T12:56:36.306] debug: credential for job 12886 revoked
[2015-01-07T12:56:36.306] debug2: No steps in jobid 12886 to send signal 18
[2015-01-07T12:56:36.306] debug2: No steps in jobid 12886 to send signal 15
[2015-01-07T12:56:36.306] debug4: sent SUCCESS
[2015-01-07T12:56:36.306] debug2: set revoke expiration for jobid 12886
to 1420636596 UTS
[2015-01-07T12:56:36.308] debug: Waiting for job 12886's prolog to complete
[2015-01-07T12:56:36.308] debug: Finished wait for job 12886's prolog
to complete
[2015-01-07T12:56:36.308] debug: Calling
/cm/shared/apps/slurm/14.03.0/sbin/slurmstepd spank epilog
[2015-01-07T12:56:36.590] Reading slurm.conf file: /etc/slurm/slurm.conf
[2015-01-07T12:56:36.595] Running spank/epilog for jobid [12886] uid [1000]
[2015-01-07T12:56:36.595] spank: opening plugin stack
/etc/slurm/plugstack.conf
[2015-01-07T12:56:36.619] debug: [job 12886] attempting to run epilog
[/cm/local/apps/cmd/scripts/epilog]
[2015-01-07T12:56:36.706] debug: completed epilog for jobid 12886
[2015-01-07T12:56:36.707] debug3: slurm_send_only_controller_msg: sent 0
[2015-01-07T12:56:36.707] debug: Job 12886: sent epilog complete msg:
rc = 0
[2015-01-07T12:56:38.248] debug3: in the service_connection
[2015-01-07T12:56:38.249] debug2: got this type of message 1017
[2015-01-07T12:56:38.249] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:56:38.952] error: ipmi thread init timeout
[2015-01-07T12:56:38.952] error: AcctGatherEnergy IPMI plugin threads
failed to start in a timely manner
[2015-01-07T12:56:49.250] debug3: in the service_connection
[2015-01-07T12:56:49.250] debug2: got this type of message 1017
[2015-01-07T12:56:49.250] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:00.251] debug3: in the service_connection
[2015-01-07T12:57:00.252] debug2: got this type of message 1017
[2015-01-07T12:57:00.252] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:11.252] debug3: in the service_connection
[2015-01-07T12:57:11.253] debug2: got this type of message 1017
[2015-01-07T12:57:11.253] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:22.255] debug3: in the service_connection
[2015-01-07T12:57:22.255] debug2: got this type of message 1017
[2015-01-07T12:57:22.255] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:33.256] debug3: in the service_connection
[2015-01-07T12:57:33.257] debug2: got this type of message 1017
[2015-01-07T12:57:33.257] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:44.278] debug3: in the service_connection
[2015-01-07T12:57:44.278] debug2: got this type of message 1017
[2015-01-07T12:57:44.278] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:57:56.259] debug3: in the service_connection
[2015-01-07T12:57:56.260] debug2: got this type of message 1008
[2015-01-07T12:57:58.260] debug3: in the service_connection
[2015-01-07T12:57:58.261] debug2: got this type of message 1008
[2015-01-07T12:58:00.277] debug3: in the service_connection
[2015-01-07T12:58:00.278] debug2: got this type of message 1008
[2015-01-07T12:58:02.261] debug3: in the service_connection
[2015-01-07T12:58:02.262] debug2: got this type of message 1008
[2015-01-07T12:58:04.262] debug3: in the service_connection
[2015-01-07T12:58:04.263] debug2: got this type of message 1008
[2015-01-07T12:58:05.264] debug3: in the service_connection
[2015-01-07T12:58:05.264] debug2: got this type of message 1017
[2015-01-07T12:58:05.264] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T12:58:28.267] debug3: in the service_connection
[2015-01-07T12:58:28.267] debug2: got this type of message 1008
[2015-01-07T12:58:30.267] debug3: in the service_connection
[2015-01-07T12:58:30.268] debug2: got this type of message 1008
[2015-01-07T12:58:31.271] debug3: in the service_connection
[2015-01-07T12:58:31.271] debug2: got this type of message 1001
. . . another 2 hours of this:
[2015-01-07T14:02:06.587] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:02:20.591] debug3: in the service_connection
[2015-01-07T14:02:20.592] debug2: got this type of message 1008
[2015-01-07T14:02:32.588] debug3: in the service_connection
[2015-01-07T14:02:32.588] debug2: got this type of message 1017
[2015-01-07T14:02:32.588] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:02:44.595] debug3: in the service_connection
[2015-01-07T14:02:44.595] debug2: got this type of message 1008
[2015-01-07T14:02:55.595] debug3: in the service_connection
[2015-01-07T14:02:55.595] debug2: got this type of message 1017
[2015-01-07T14:02:55.595] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:03:07.601] debug3: in the service_connection
[2015-01-07T14:03:07.601] debug2: got this type of message 1008
[2015-01-07T14:03:18.652] debug3: in the service_connection
[2015-01-07T14:03:18.662] debug2: got this type of message 1017
[2015-01-07T14:03:18.662] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:03:30.601] debug3: in the service_connection
[2015-01-07T14:03:30.601] debug2: got this type of message 1008
[2015-01-07T14:03:41.621] debug3: in the service_connection
[2015-01-07T14:03:41.621] debug2: got this type of message 1017
[2015-01-07T14:03:41.621] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:03:57.615] debug3: in the service_connection
[2015-01-07T14:03:57.615] debug2: got this type of message 1008
[2015-01-07T14:03:59.654] debug3: in the service_connection
[2015-01-07T14:03:59.655] debug2: got this type of message 1008
[2015-01-07T14:04:01.654] debug3: in the service_connection
[2015-01-07T14:04:01.654] debug2: got this type of message 1008
[2015-01-07T14:04:02.655] debug3: in the service_connection
[2015-01-07T14:04:02.656] debug2: got this type of message 1017
[2015-01-07T14:04:02.674] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:04:14.660] debug3: in the service_connection
[2015-01-07T14:04:14.661] debug2: got this type of message 1008
[2015-01-07T14:04:16.659] debug3: in the service_connection
[2015-01-07T14:04:16.659] debug2: got this type of message 1008
[2015-01-07T14:04:18.661] debug3: in the service_connection
[2015-01-07T14:04:18.662] debug2: got this type of message 1008
[2015-01-07T14:04:20.661] debug3: in the service_connection
[2015-01-07T14:04:20.661] debug2: got this type of message 1008
[2015-01-07T14:04:22.662] debug3: in the service_connection
[2015-01-07T14:04:22.662] debug2: got this type of message 1008
[2015-01-07T14:04:23.662] debug3: in the service_connection
[2015-01-07T14:04:23.662] debug2: got this type of message 1017
[2015-01-07T14:04:23.662] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:04:35.668] debug3: in the service_connection
[2015-01-07T14:04:35.668] debug2: got this type of message 1001
[2015-01-07T14:04:35.668] debug2: Processing RPC:
REQUEST_NODE_REGISTRATION_STATUS
[2015-01-07T14:04:35.668] debug3: CPUs=16 Boards=1 Sockets=2 Cores=8
Threads=1 Memory=129138 TmpDisk=2015 Uptime=4190
[2015-01-07T14:04:46.667] debug3: in the service_connection
[2015-01-07T14:04:46.667] debug2: got this type of message 1017
[2015-01-07T14:04:46.667] debug2: Processing RPC: REQUEST_ACCT_GATHER_UPDATE
[2015-01-07T14:11:09.759] active_threads == MAX_THREADS(130)