Stuart Barkley <[email protected]> writes:

> Do you have a working reboot (or email sending) job script you can
> share?

Not that would be useful, as it depends on your configuration, and I
need to tidy ours up anyway.  The only tricky thing is using the right
qsub parameters to get exclusive access to the node, either by
submitting a parallel job that uses all the slots, or exclusive=true.  I
need to make a specific admin queue across all the nodes.  You may want
to worry about reservation and bumping up the priority too.

Otherwise, sending mail when a restricted node becomes available is
basically just

  qsub -q "*@$node" -b y -l ... -m b /bin/true

and the guts of my reboot job are just

  /usr/bin/sudo /sbin/service sgeexecd softstop
  /usr/bin/sudo /sbin/reboot

> Some of the other scripts you have recently shared have helped
> my understanding of things.

Good-oh!

> Do you need to remove the host from it's original host group or just
> add it to the new one?

If you're talking about restricting a node to admin access, what context
is missing for the scripts I referred to before?  They might well need
more comments.

> How do you keep track of the original configuration/state?

Sorry, I don't understand that.

> How do you keep track of what action is needed for the node?  For our
> needs, I can see creating two new host groups "reboot" and "maint"
> which can keep the basic information.

I just submit a reboot job when I know it's needed, and it runs when the
node is free.

> For hosts in "maint" we could
> look up an RT ticket (also still a work in progress) when the host
> becomes available.

Indeed, there are assorted things you can do with related admin systems.

> For just a reboot (which is our usual case: new image for stateless
> node), can a simple reboot script do the extra work of restoring the
> original configuration?  Including enabling the host and doing the
> reboot before SGE tries to start a job on it?

I don't understand the problem.  I can normally modify the production
image and then just reboot, but to flip images I use pxeconfig.

> On our other cluster torque has a short string which can be associated
> with a node when it is disabled ('pbsnodes -o -N "yyyy/mm/dd: NEED
> REBOOT" nodename').

My sge-restrict-nodes has a --reason argument for that.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to