[slurm-dev] Re: Job on wrong node

Janne Blomqvist Fri, 20 Feb 2015 00:53:35 -0800

On 2015-02-05 09:55, Magnus Jonsson wrote:
> It would be nice to eliminate most of the slurm.conf on the nodes.
> 
> Most of the information could as easily be fetched (or not needed at
> all) from the slurmctld on the master node.
> 
> An API to make a call to the master node and fetch configuration options
> could eliminate the need for NO_CONF_HASH :-)
> 
> All that should be needed is a slim slurm.conf with information where
> the slurmctld lives (and how to contact (munge/...)).


There are out-of-the-box solutions for this kind of problem, e.g. etcd
or consul.io, offering strong consistency with some variant of the Paxos
or Raft protocols;

http://raftconsensus.github.io/

AFAIK etcd and consul also have features allowing a client to
"subscribe" to some data, and get automatically notified when the value
changes. Supposedly scalable etc., though I'm not sure if it's really
scalable enough for ~1e6 clients or whatever slurm is shooting for these
days..

> 
> /Magnus
> 
> On 2015-02-04 20:54, Danny Auble wrote:
>>
>>
>> On 02/04/2015 11:23 AM, Ulf Markwardt wrote:
>>>
>>>> DebugFlags=NO_CONF_HASH
>>> But we do have different slurm.conf files due to different energy
>>> sensors, prolog/epilog scripts.
>> The NO_CONF_HASH is very dangerous in most systems.  It should be
>> avoided at all cost.
>>
>> It is interesting you have different sensors per node.  I could
>> understand in this case to have NO_CONF_HASH set.  We are thinking of
>> adding a new kind of slurm.conf include that doesn't get added to the
>> hash which you could put node specific information like this and could
>> remove the NO_CONF_HASH.
>>
>> You might be able to get around the pro/epilog issue by having a master
>> pro/epilog that in turn calls different ones depending on the node.
>> Adding the new file would also eliminate this issue as well. This
>> doesn't exist today, but is being thought about.
>>
>>>
>>>
>>>> I am guessing the slurm.conf file on your nodes may be insync, but
>>>> perhaps the slurmd on the troubled nodes may be running with an old
>>>> version.
>>> All show slurm 14.11.3
>> I meant an older version of the file, not Slurm :).  With NO_CONF_HASH
>> set there isn't a real good way to verify the slurmd's are all running
>> the same slurm.conf.
>>
>> I would suggest issuing a "scontrol shutdown" then restarting all your
>> nodes and your controller.  If you still see the problem after that then
>> indeed something else is the matter.  Perhaps routing tables or
>> something else.
>>>
>>> U
>>>
> 


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || [email protected]

signature.asc
Description: OpenPGP digital signature

[slurm-dev] Re: Job on wrong node

Reply via email to