Just leaving a trail for future Googlers. My colleague did an extensive
search for answers and came up empty.
We ran into an issue after disabling hyperthreading on one of our CentOS
clusters.
Here's the scenario:
- Our compute nodes had hyperthreading enabled while we evaluated the
costs and benefits.
- SLURM was configured to schedule only one job per real core. For
example, nodes with 24 cores / 48 virtual are configured as follows:
NodeName=compute-[029-083] RealMemory=64000 Sockets=2 CoresPerSocket=12
ThreadsP
erCore=1 State=UNKNOWN
- I added a command to /etc/rc.d/rc.local to disable hyperthreading
on the next reboot.
- No changes were made to slurm.conf.
- After rebooting with hyperthreading disabled, certain jobs landing
on the node would fail with the following error:
slurmstepd: Failed task affinity setup
- Restarting the scheduler cleared up the issue.
Does anybody know what would cause this? My best hypothesis is that
slurmctld is caching some probed hardware info from slurmd that changed
when hyperthreading was disabled.
Cheers,
Jason
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
-- Francois Fenelon