Hi,

Has anyone tried creating features for nodes dynamically?

Our use-case is this:

We do rolling updates, so occasionally, at a given time not all nodes in
the cluster will have the same version of a certain piece of software.
This generally does not cause problems.

However, a recent update of our infiniband stack has meant that if a job
is assigned nodes for which the version of the infiniband software is
not the same across all nodes, the job will fail.

For users who are currently affected, I periodically create a list of
the nodes in the two groups (old software or new software) by doing
something like

pdsh -w node[001-112] ls -l /usr/lib64/libpsm_infinipath.so | \
  grep "so.1.15" | sort | \
  awk 'nodes=(substr(nodes,1,length(nodes)-1) "," $1); END {print nodes}'

so that users can exclude one group or the other.

I was wondering how easy it would be to generate features in slurm.conf
dynamically.  I am assuming I would need a NodeName stanza for every
node, rather than being able to use ranges.  As we have only around 100
nodes this seems manageable, but with 1000s of nodes it might not be so
feasible.

Has any one tried anything like this and is willing to share the
experiences/results? 

Cheers,

Loris

-- 
This signature is currently under construction.

Reply via email to