Stuart Barkley <[email protected]> writes:
> Do you have a working reboot (or email sending) job script you can
> share?
Not that would be useful, as it depends on your configuration, and I
need to tidy ours up anyway. The only tricky thing is using the right
qsub parameters to get exclusive access to the node, either by
submitting a parallel job that uses all the slots, or exclusive=true. I
need to make a specific admin queue across all the nodes. You may want
to worry about reservation and bumping up the priority too.
Otherwise, sending mail when a restricted node becomes available is
basically just
qsub -q "*@$node" -b y -l ... -m b /bin/true
and the guts of my reboot job are just
/usr/bin/sudo /sbin/service sgeexecd softstop
/usr/bin/sudo /sbin/reboot
> Some of the other scripts you have recently shared have helped
> my understanding of things.
Good-oh!
> Do you need to remove the host from it's original host group or just
> add it to the new one?
If you're talking about restricting a node to admin access, what context
is missing for the scripts I referred to before? They might well need
more comments.
> How do you keep track of the original configuration/state?
Sorry, I don't understand that.
> How do you keep track of what action is needed for the node? For our
> needs, I can see creating two new host groups "reboot" and "maint"
> which can keep the basic information.
I just submit a reboot job when I know it's needed, and it runs when the
node is free.
> For hosts in "maint" we could
> look up an RT ticket (also still a work in progress) when the host
> becomes available.
Indeed, there are assorted things you can do with related admin systems.
> For just a reboot (which is our usual case: new image for stateless
> node), can a simple reboot script do the extra work of restoring the
> original configuration? Including enabling the host and doing the
> reboot before SGE tries to start a job on it?
I don't understand the problem. I can normally modify the production
image and then just reboot, but to flip images I use pxeconfig.
> On our other cluster torque has a short string which can be associated
> with a node when it is disabled ('pbsnodes -o -N "yyyy/mm/dd: NEED
> REBOOT" nodename').
My sge-restrict-nodes has a --reason argument for that.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users