Hi:

We've just been bitten by the problem others have seen, where starting
a shadow master, or starting the primary one after a qmaster death,
results in all, or almost all, running and pending jobs being deleted,
with

 15:10:57|worker|rcluster|E|[email protected]
 reports running job (84285.1/master) in queue
 "[email protected]" that was not supposed to be
 there - killing

This happened with Univa GE 8.0.0 on RHEL 4, and with Son of Grid
Engine 8.0.0a on RHEL 5, both using classic spooling with SGE_ROOT on
a high-performance, though busy, NFSv3 mount.  It's definitely the
qmaster start, and not an exec host going down, which triggers the job
loss.  The job loss happened whether execd_spool_dir was on that
shared NFS filesystem or internal to each exec host.

I have a hunch that switching from classic spooling to berkeleydb
might prevent this from happening (because the job loss doesn't happen
on the RHEL 4 cluster when it runs SGE 6.2u5 with BDB spooling), but
that's just a hunch.

I'll add that the job loss happens in testing too, when we manually
kill the qmaster, so it's not that the qmaster deaths and the job
losses have a common cause.  (And so it's not quite Dave Love's SGE
ticket #1347.)

In all cases we have
 qmaster_params               none
 execd_params                 none
 reschedule_unknown           00:00:00

And pretty much a default config, qconf-wise.

Does anyone have insight so far as to how to prevent this "job loss
upon qmaster restart"?

And is this still true, as someone posted in March?

"There are the following spooling options if you want to
 setup sge_shadowd:

- classic spooling on nfs (or nfs4)
- Berkeley DB spooling on nfs4
- Berkeley DB RPC server (still available in Grid Engine
   6.2u5, but no longer supported with Univa Grid Engine
   8.0.0)"

 I'd be glad to provide any further details.  Thanks!

--
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center
(formerly "Research Computing Center")
Enterprise IT Svcs, University of Georgia

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to