Hi, I'm having trouble understanding why my jobs won't start after a MCDRAM change and reboot.
Everything seems to work as expected: - the reboot happens - the MCDRAM change is done after reboot - slurmd on the node comes up correctly without any errors - the controller seems to acknowledge the MCDRAM change (prolog_running_decr: Configuration for job 495 is complete) - the node is running correctly if you reschedule the same job *From the submit node:* #srun --constraint=flat -w host033 -B 1:1:1 --time=23:00:00 -N 1 -n 64 -p long a.ot 1000000000 srun: error: Nodes host033 are still not ready srun: error: Something is wrong with the boot of the nodes. *From the controller logs:* [2017-03-08T10:49:54.483] sched: _slurm_rpc_allocate_resources JobId=495 NodeList=host033 usec=2887 [2017-03-08T10:55:28.182] _update_node_avail_features: nodes host033 available features set to: a2a,hemi,quad,snc2,snc4,cache, flat,hybrid,auto,knl [2017-03-08T10:55:28.182] _update_node_active_features: nodes host033 active features set to: a2a,flat,knl [2017-03-08T10:55:28.182] Node host033 now responding [2017-03-08T10:55:29.495] Job 495 boot complete for all 1 nodes [2017-03-08T10:55:29.496] prolog_running_decr: Configuration for job 495 is complete [2017-03-08T10:57:30.960] job_complete: JobID=495 State=0x1 NodeCnt=1 WTERMSIG 1 [2017-03-08T10:57:30.960] job_complete: JobID=495 State=0x8003 NodeCnt=1 done [2017-03-08T10:58:36.605] error: Nodes host033 not responding *From the host slurmd logs:* [2017-03-08T10:49:54.485] Node reboot request with features flat,a2a being processed [2017-03-08T10:49:55.447] got shutdown request [2017-03-08T10:49:55.447] waiting on 2 active threads [2017-03-08T10:55:22.783] Considering each NUMA node as a socket [2017-03-08T10:55:22.788] Message aggregation disabled [2017-03-08T10:55:22.794] s_p_parse_file: file "/etc/slurm/gres.conf" is empty [2017-03-08T10:55:22.795] topology tree plugin loaded [2017-03-08T10:55:22.797] route default plugin loaded [2017-03-08T10:55:27.066] Resource spec: Reserved system memory limit not configured for this node [2017-03-08T10:55:27.689] Considering each NUMA node as a socket [2017-03-08T10:55:27.694] cgroup namespace 'freezer' is now mounted [2017-03-08T10:55:27.696] cgroup namespace 'cpuset' is now mounted [2017-03-08T10:55:27.697] cgroup namespace 'memory' is now mounted [2017-03-08T10:55:27.698] task affinity plugin loaded with CPU mask 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000000000000000000000000000000000000 000000000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff [2017-03-08T10:55:27.698] Munge cryptographic signature plugin loaded [2017-03-08T10:55:27.701] Warning: Core limit is only 0 KB [2017-03-08T10:55:27.701] slurmd version 17.02.1-2 started [2017-03-08T10:55:27.707] slurmd started on Wed, 08 Mar 2017 10:55:27 -0500 [2017-03-08T10:55:28.175] CPUs=256 Boards=1 Sockets=1 Cores=64 Threads=4 Memory=209284 TmpDisk=46977 Uptime=71 CPUSpecList=(null) FeaturesAvail=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto FeaturesActive=a2a,flat *node status after reboot:* scontrol show node host033 NodeName=host033 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUErr=0 CPUTot=256 CPULoad=0.46 AvailableFeatures=a2a,hemi,quad,snc2,snc4,cache,flat,hybrid,auto,knl ActiveFeatures=a2a,flat,knl Gres=hbm:16G NodeAddr=host033 NodeHostName=host033 Version=17.02 OS=Linux RealMemory=190000 AllocMem=0 FreeMem=204262 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=long,scavenger BootTime=2017-03-08T10:54:17 SlurmdStartTime=2017-03-08T10:55:27 CfgTRES=cpu=256,mem=190000M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s *Relevant slurm.conf configs:* ReturnToService = 1 SlurmctldTimeout=900 SlurmdTimeout=900 ResumeTimeout=900 NodeFeaturesPlugins = knl_generic GresTypes = hbm RebootProgram = /sbin/reboot *knl_generic.conf:* SyscfgPath=/usr/bin/syscfg/syscfg DefaultNUMA=a2a # NUMA=all2all AllowNUMA=a2a,snc2,hemi DefaultMCDRAM=flat # MCDRAM=flat BootTime=600 SyscfgTimeout=5000 *slurm version: 17.02.1-2* Any help or idea is more than welcomed not sure what is wrong. Thank you, Costin
