Jason,
have you tried disabling HT from bios instead of doing from the OS?
Davide
On Wed, 2016-05-18 at 06:02 -0700, Jason Bacon wrote:
>
> Just leaving a trail for future Googlers. My colleague did an
> extensive
> search for answers and came up empty.
>
> We ran into an issue after disabling hyperthreading on one of our
> CentOS
> clusters.
>
> Here's the scenario:
>
> - Our compute nodes had hyperthreading enabled while we evaluated
> the
> costs and benefits.
>
> - SLURM was configured to schedule only one job per real core.
> For
> example, nodes with 24 cores / 48 virtual are configured as follows:
>
> NodeName=compute-[029-083] RealMemory=64000 Sockets=2
> CoresPerSocket=12
> ThreadsP
> erCore=1 State=UNKNOWN
>
> - I added a command to /etc/rc.d/rc.local to disable
> hyperthreading
> on the next reboot.
>
> - No changes were made to slurm.conf.
>
> - After rebooting with hyperthreading disabled, certain jobs
> landing
> on the node would fail with the following error:
>
> slurmstepd: Failed task affinity setup
>
> - Restarting the scheduler cleared up the issue.
>
> Does anybody know what would cause this? My best hypothesis is that
> slurmctld is caching some probed hardware info from slurmd that
> changed
> when hyperthreading was disabled.
>
> Cheers,
>
> Jason
>