> I'll let you know what happens,

I got to a chance to try things out on a Xen mimic of the grid and
starting up a
new execd does seem to allow one to carry on using the resource on which you
have orpahned any jobs by taking out the original execd.

A full write-up of my testing can be found here

http://homepages.ecs.vuw.ac.nz/~kevin/forSGE/Extending_Grid_Engine_Runtimes_with_an_execd_softstop.html

but the salient points follow to keep things in the thread.

In between the softstop and the restart, replace the execute host's
configuration
which just had these defaults

execd_spool_dir              /var/opt/gridengine/default/spool
gid_range                    20000-20100

by creating a local conf for it

qconf -mconf localnode

with new values as follows

execd_spool_dir              /var/opt/gridengine/default/spool2
gid_range                    20101-20200

The restart even creates the new spool directory.

A qstat still shows the job on that node with a slot taken

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
[email protected]   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
[email protected]   BIP   0/1/1          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

A pstree shows a new execd tree and the orpahned job

     +-sge_execd---4*[{sge_execd}]
     +-sge_shepherd---sh---sleep

Even after altering the configuration to add another slot works

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
[email protected]   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
[email protected]   BIP   0/1/2          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

Submitting another job to the same queue sees

job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05
[email protected]       1
      8 0.55500 qsub3.sh   buckleke     r     06/17/2012 12:07:05
[email protected]       1

with the pstree showing both

     +-sge_execd-+-sge_shepherd---sh---sleep
     |           +-4*[{sge_execd}]
     +-sge_shepherd---sh---sleep

with the Grid Engine now believing that both slots are used

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
[email protected]   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
[email protected]   BIP   0/2/2          0.01     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1
      8 0.55500 qsub3.sh   buckleke     r     06/17/2012 12:07:05     1

Eventually, the newer job stops as normal yet, the qmaster thinks the
old one is still running, even though it has finished

# qstat -f -u \*
queuename                      qtype resv/used/tot. load_avg arch        states
-------------------------------------------------------------------------------
[email protected]   BIP   0/0/1          0.00     lx24-amd64
-------------------------------------------------------------------------------
[email protected]   BIP   0/1/2          0.00     lx24-amd64
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05     1

and the Grid Engine knows nothing about it finsihing either

# qacct -j 7
error: job id 7 not found

and nor does the user looking for their job

$ qstat
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      7 0.55500 qsub2.sh   buckleke     r     06/17/2012 12:00:05
[email protected]       1

even though that job has run its course on the node we mangled, with a
pstree there now only showing

     +-sge_execd---4*[{sge_execd}]

To get back to the "original" environment, we "softstop" the new
execd, although, with no jobs running node it, we could just ==stopp=
it..

Modify the execd's conf back to what it was (in this case, the
defaults, so we could just delete the local config)

The system now thinks the job that was orpahned finshed when it did
(after 10 minutes)

qsub_time    Sun Jun 17 11:59:53 2012
start_time   Sun Jun 17 12:00:05 2012
end_time     Sun Jun 17 12:10:05 2012

This will get my user out of a major bind, so thanks to all for the
insight and feedback.

Kevin Buckley
ECS, VUW, NZ
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to