On 2015-02-05 09:55, Magnus Jonsson wrote: > It would be nice to eliminate most of the slurm.conf on the nodes. > > Most of the information could as easily be fetched (or not needed at > all) from the slurmctld on the master node. > > An API to make a call to the master node and fetch configuration options > could eliminate the need for NO_CONF_HASH :-) > > All that should be needed is a slim slurm.conf with information where > the slurmctld lives (and how to contact (munge/...)).
There are out-of-the-box solutions for this kind of problem, e.g. etcd or consul.io, offering strong consistency with some variant of the Paxos or Raft protocols; http://raftconsensus.github.io/ AFAIK etcd and consul also have features allowing a client to "subscribe" to some data, and get automatically notified when the value changes. Supposedly scalable etc., though I'm not sure if it's really scalable enough for ~1e6 clients or whatever slurm is shooting for these days.. > > /Magnus > > On 2015-02-04 20:54, Danny Auble wrote: >> >> >> On 02/04/2015 11:23 AM, Ulf Markwardt wrote: >>> >>>> DebugFlags=NO_CONF_HASH >>> But we do have different slurm.conf files due to different energy >>> sensors, prolog/epilog scripts. >> The NO_CONF_HASH is very dangerous in most systems. It should be >> avoided at all cost. >> >> It is interesting you have different sensors per node. I could >> understand in this case to have NO_CONF_HASH set. We are thinking of >> adding a new kind of slurm.conf include that doesn't get added to the >> hash which you could put node specific information like this and could >> remove the NO_CONF_HASH. >> >> You might be able to get around the pro/epilog issue by having a master >> pro/epilog that in turn calls different ones depending on the node. >> Adding the new file would also eliminate this issue as well. This >> doesn't exist today, but is being thought about. >> >>> >>> >>>> I am guessing the slurm.conf file on your nodes may be insync, but >>>> perhaps the slurmd on the troubled nodes may be running with an old >>>> version. >>> All show slurm 14.11.3 >> I meant an older version of the file, not Slurm :). With NO_CONF_HASH >> set there isn't a real good way to verify the slurmd's are all running >> the same slurm.conf. >> >> I would suggest issuing a "scontrol shutdown" then restarting all your >> nodes and your controller. If you still see the problem after that then >> indeed something else is the matter. Perhaps routing tables or >> something else. >>> >>> U >>> > -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || [email protected]
signature.asc
Description: OpenPGP digital signature
