Hi: We've just been bitten by the problem others have seen, where starting a shadow master, or starting the primary one after a qmaster death, results in all, or almost all, running and pending jobs being deleted, with
15:10:57|worker|rcluster|E|[email protected] reports running job (84285.1/master) in queue "[email protected]" that was not supposed to be there - killing This happened with Univa GE 8.0.0 on RHEL 4, and with Son of Grid Engine 8.0.0a on RHEL 5, both using classic spooling with SGE_ROOT on a high-performance, though busy, NFSv3 mount. It's definitely the qmaster start, and not an exec host going down, which triggers the job loss. The job loss happened whether execd_spool_dir was on that shared NFS filesystem or internal to each exec host. I have a hunch that switching from classic spooling to berkeleydb might prevent this from happening (because the job loss doesn't happen on the RHEL 4 cluster when it runs SGE 6.2u5 with BDB spooling), but that's just a hunch. I'll add that the job loss happens in testing too, when we manually kill the qmaster, so it's not that the qmaster deaths and the job losses have a common cause. (And so it's not quite Dave Love's SGE ticket #1347.) In all cases we have qmaster_params none execd_params none reschedule_unknown 00:00:00 And pretty much a default config, qconf-wise. Does anyone have insight so far as to how to prevent this "job loss upon qmaster restart"? And is this still true, as someone posted in March? "There are the following spooling options if you want to setup sge_shadowd: - classic spooling on nfs (or nfs4) - Berkeley DB spooling on nfs4 - Berkeley DB RPC server (still available in Grid Engine 6.2u5, but no longer supported with Univa Grid Engine 8.0.0)" I'd be glad to provide any further details. Thanks! -- Paul Brunk, system administrator Georgia Advanced Computing Resource Center (formerly "Research Computing Center") Enterprise IT Svcs, University of Georgia _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
