[slurm-dev] slurmstepd: Failed task affinity setup

Jason Bacon Wed, 18 May 2016 06:01:34 -0700

Just leaving a trail for future Googlers. My colleague did an extensivesearch for answers and came up empty.

We ran into an issue after disabling hyperthreading on one of our CentOSclusters.


Here's the scenario:

- Our compute nodes had hyperthreading enabled while we evaluated thecosts and benefits.

- SLURM was configured to schedule only one job per real core. Forexample, nodes with 24 cores / 48 virtual are configured as follows:

NodeName=compute-[029-083] RealMemory=64000 Sockets=2 CoresPerSocket=12ThreadsP

erCore=1 State=UNKNOWN

- I added a command to /etc/rc.d/rc.local to disable hyperthreadingon the next reboot.


-    No changes were made to slurm.conf.

- After rebooting with hyperthreading disabled, certain jobs landingon the node would fail with the following error:


    slurmstepd: Failed task affinity setup

-    Restarting the scheduler cleared up the issue.

Does anybody know what would cause this? My best hypothesis is thatslurmctld is caching some probed hardware info from slurmd that changedwhen hyperthreading was disabled.


Cheers,

    Jason

--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
                -- Francois Fenelon

[slurm-dev] slurmstepd: Failed task affinity setup

Reply via email to