On 02/04/2015 11:23 AM, Ulf Markwardt wrote:

DebugFlags=NO_CONF_HASH
But we do have different slurm.conf files due to different energy
sensors, prolog/epilog scripts.
The NO_CONF_HASH is very dangerous in most systems. It should be avoided at all cost.

It is interesting you have different sensors per node. I could understand in this case to have NO_CONF_HASH set. We are thinking of adding a new kind of slurm.conf include that doesn't get added to the hash which you could put node specific information like this and could remove the NO_CONF_HASH.

You might be able to get around the pro/epilog issue by having a master pro/epilog that in turn calls different ones depending on the node. Adding the new file would also eliminate this issue as well. This doesn't exist today, but is being thought about.



I am guessing the slurm.conf file on your nodes may be insync, but
perhaps the slurmd on the troubled nodes may be running with an old
version.
All show slurm 14.11.3
I meant an older version of the file, not Slurm :). With NO_CONF_HASH set there isn't a real good way to verify the slurmd's are all running the same slurm.conf.

I would suggest issuing a "scontrol shutdown" then restarting all your nodes and your controller. If you still see the problem after that then indeed something else is the matter. Perhaps routing tables or something else.

U

Reply via email to