Just leaving a trail for future Googlers. My colleague did an extensive search for answers and came up empty.

We ran into an issue after disabling hyperthreading on one of our CentOS clusters.

Here's the scenario:

- Our compute nodes had hyperthreading enabled while we evaluated the costs and benefits.

- SLURM was configured to schedule only one job per real core. For example, nodes with 24 cores / 48 virtual are configured as follows:

NodeName=compute-[029-083] RealMemory=64000 Sockets=2 CoresPerSocket=12 ThreadsP
erCore=1 State=UNKNOWN

- I added a command to /etc/rc.d/rc.local to disable hyperthreading on the next reboot.

-    No changes were made to slurm.conf.

- After rebooting with hyperthreading disabled, certain jobs landing on the node would fail with the following error:

    slurmstepd: Failed task affinity setup

-    Restarting the scheduler cleared up the issue.

Does anybody know what would cause this? My best hypothesis is that slurmctld is caching some probed hardware info from slurmd that changed when hyperthreading was disabled.

Cheers,

    Jason

--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
                -- Francois Fenelon

Reply via email to