Jason,
have you tried disabling HT from bios instead of doing from the OS?
Davide
On Wed, 2016-05-18 at 06:02 -0700, Jason Bacon wrote:
> 
> Just leaving a trail for future Googlers.  My colleague did an
> extensive 
> search for answers and came up empty.
> 
> We ran into an issue after disabling hyperthreading on one of our
> CentOS 
> clusters.
> 
> Here's the scenario:
> 
> -    Our compute nodes had hyperthreading enabled while we evaluated
> the 
> costs and benefits.
> 
> -    SLURM was configured to schedule only one job per real core.
> For 
> example, nodes with 24 cores / 48 virtual are configured as follows:
> 
> NodeName=compute-[029-083] RealMemory=64000 Sockets=2
> CoresPerSocket=12 
> ThreadsP
> erCore=1 State=UNKNOWN
> 
> -    I added a command to /etc/rc.d/rc.local to disable
> hyperthreading 
> on the next reboot.
> 
> -    No changes were made to slurm.conf.
> 
> -    After rebooting with hyperthreading disabled, certain jobs
> landing 
> on the node would fail with the following error:
> 
>      slurmstepd: Failed task affinity setup
> 
> -    Restarting the scheduler cleared up the issue.
> 
> Does anybody know what would cause this?  My best hypothesis is that 
> slurmctld is caching some probed hardware info from slurmd that
> changed 
> when hyperthreading was disabled.
> 
> Cheers,
> 
>      Jason
> 

Reply via email to