Hi,

I'm having trouble understanding why my jobs won't start after a MCDRAM
change and reboot.

Everything seems to work as expected:
   - the reboot happens
   - the MCDRAM change is done after reboot
   - slurmd on the node comes up correctly without any errors
   - the controller seems to acknowledge the MCDRAM change
(prolog_running_decr: Configuration for job 495 is complete)
   - the node is running correctly if you reschedule the same job

*From the submit node:*
#srun --constraint=flat -w host033 -B 1:1:1 --time=23:00:00 -N 1 -n 64 -p
long a.ot 1000000000
srun: error: Nodes host033 are still not ready
srun: error: Something is wrong with the boot of the nodes.

*From the controller logs:*
[2017-03-08T10:49:54.483] sched: _slurm_rpc_allocate_resources JobId=495
NodeList=host033 usec=2887
[2017-03-08T10:55:28.182] _update_node_avail_features: nodes host033
available features set to: a2a,hemi,quad,snc2,snc4,cache,
flat,hybrid,auto,knl
[2017-03-08T10:55:28.182] _update_node_active_features: nodes host033
active features set to: a2a,flat,knl
[2017-03-08T10:55:28.182] Node host033 now responding
[2017-03-08T10:55:29.495] Job 495 boot complete for all 1 nodes
[2017-03-08T10:55:29.496] prolog_running_decr: Configuration for job 495 is
complete
[2017-03-08T10:57:30.960] job_complete: JobID=495 State=0x1 NodeCnt=1
WTERMSIG 1
[2017-03-08T10:57:30.960] job_complete: JobID=495 State=0x8003 NodeCnt=1
done
[2017-03-08T10:58:36.605] error: Nodes host033 not responding

*From the host slurmd logs:*
[2017-03-08T10:49:54.485] Node reboot request with features flat,a2a being
processed
[2017-03-08T10:49:55.447] got shutdown request
[2017-03-08T10:49:55.447] waiting on 2 active threads
[2017-03-08T10:55:22.783] Considering each NUMA node as a socket
[2017-03-08T10:55:22.788] Message aggregation disabled
[2017-03-08T10:55:22.794] s_p_parse_file: file "/etc/slurm/gres.conf" is
empty
[2017-03-08T10:55:22.795] topology tree plugin loaded
[2017-03-08T10:55:22.797] route default plugin loaded
[2017-03-08T10:55:27.066] Resource spec: Reserved system memory limit not
configured for this node
[2017-03-08T10:55:27.689] Considering each NUMA node as a socket
[2017-03-08T10:55:27.694] cgroup namespace 'freezer' is now mounted
[2017-03-08T10:55:27.696] cgroup namespace 'cpuset' is now mounted
[2017-03-08T10:55:27.697] cgroup namespace 'memory' is now mounted
[2017-03-08T10:55:27.698] task affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000
000000000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[2017-03-08T10:55:27.698] Munge cryptographic signature plugin loaded
[2017-03-08T10:55:27.701] Warning: Core limit is only 0 KB
[2017-03-08T10:55:27.701] slurmd version 17.02.1-2 started
[2017-03-08T10:55:27.707] slurmd started on Wed, 08 Mar 2017 10:55:27 -0500
[2017-03-08T10:55:28.175] CPUs=256 Boards=1 Sockets=1 Cores=64 Threads=4
Memory=209284 TmpDisk=46977 Uptime=71 CPUSpecList=(null)
FeaturesAvail=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto
FeaturesActive=a2a,flat


*node status after reboot:*
scontrol show node host033
NodeName=host033 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=0 CPUErr=0 CPUTot=256 CPULoad=0.46
   AvailableFeatures=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto,knl
   ActiveFeatures=a2a,flat,knl
   Gres=hbm:16G
   NodeAddr=host033 NodeHostName=host033 Version=17.02
   OS=Linux RealMemory=190000 AllocMem=0 FreeMem=204262 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=long,scavenger
   BootTime=2017-03-08T10:54:17 SlurmdStartTime=2017-03-08T10:55:27
   CfgTRES=cpu=256,mem=190000M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



*Relevant slurm.conf configs:*
ReturnToService          = 1
SlurmctldTimeout=900
SlurmdTimeout=900
ResumeTimeout=900

NodeFeaturesPlugins        = knl_generic
GresTypes                       = hbm
RebootProgram                = /sbin/reboot

*knl_generic.conf:*
SyscfgPath=/usr/bin/syscfg/syscfg
DefaultNUMA=a2a         # NUMA=all2all
AllowNUMA=a2a,snc2,hemi
DefaultMCDRAM=flat     # MCDRAM=flat
BootTime=600
SyscfgTimeout=5000

*slurm version: 17.02.1-2*

Any help or idea is more than welcomed not sure what is wrong.

Thank you,
Costin

Reply via email to