I have the qmaster killed by Linux's OOM (Out Of Memory) killer on a SGE 6.2u5 cluster, and it is using classic spooling, and I did not get the problem you are getting.
Also, SGE 6.2u5 is out for ~2 years, yet I have not heard of users getting this problem. Other mailing list and the original Sun Grid Engine list did not have this problem reported (I joined with my original gmail account on the sun list). Each year my qmaster machine goes down for a few times, because the users run jobs on the front-end (they always do that), and sometimes the server gets unpluged by mistake. So I think it has gone down for 3 or 4 times since SGE 6.2u5 was installed, but I don't think I have any jobs lost because of this. May be it is due to a regression in one of the 8.0 changes if it is also seen by the others using 8.0. --Chi ----- 原始信件 ---- 寄件者: Paul Brunk <[email protected]> 15:10:57|worker|rcluster|E|[email protected] reports running job (84285.1/master) in queue "[email protected]" that was not supposed to be there - killing This happened with Univa GE 8.0.0 on RHEL 4, and with Son of Grid Engine 8.0.0a on RHEL 5, both using classic spooling with SGE_ROOT on a high-performance, though busy, NFSv3 mount. It's definitely the qmaster start, and not an exec host going down, which triggers the job loss. The job loss happened whether execd_spool_dir was on that shared NFS filesystem or internal to each exec host. I have a hunch that switching from classic spooling to berkeleydb might prevent this from happening (because the job loss doesn't happen on the RHEL 4 cluster when it runs SGE 6.2u5 with BDB spooling), but that's just a hunch. I'll add that the job loss happens in testing too, when we manually kill the qmaster, so it's not that the qmaster deaths and the job losses have a common cause. (And so it's not quite Dave Love's SGE ticket #1347.) In all cases we have qmaster_params none execd_params none reschedule_unknown 00:00:00 And pretty much a default config, qconf-wise. Does anyone have insight so far as to how to prevent this "job loss upon qmaster restart"? And is this still true, as someone posted in March? "There are the following spooling options if you want to setup sge_shadowd: - classic spooling on nfs (or nfs4) - Berkeley DB spooling on nfs4 - Berkeley DB RPC server (still available in Grid Engine 6.2u5, but no longer supported with Univa Grid Engine 8.0.0)" I'd be glad to provide any further details. Thanks! -- Paul Brunk, system administrator Georgia Advanced Computing Resource Center (formerly "Research Computing Center") Enterprise IT Svcs, University of Georgia _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
