Re: [gridengine users] Jobs Getting Killed

Reuti Thu, 13 Feb 2014 15:05:02 -0800

Hi,

Am 13.02.2014 um 22:39 schrieb Eric Kaufmann:


> We have jobs that are randomly getting killed. We are running GE 6.2u5
> 
> This is from the messages log:
> 
> 02/13/2014 08:34:49|worker|kepler|W|job 31233.1 failed on host 
> c052.cm.cluster assumedly after job because: job 31233.1 died through signal 
> KILL (9)
> 02/13/2014 09:23:00| timer|kepler|W|got timeout error while write data to 
> heartbeat file "heartbeat"
> 02/13/2014 09:44:55|worker|kepler|W|job 30895.1 failed on host 
> c062.cm.cluster assumedly after job because: job 30895.1 died through signal 
> KILL (9)
> 02/13/2014 11:28:26| timer|kepler|W|got timeout error while write data to 
> heartbeat file "heartbeat"
> 02/13/2014 11:41:17|event_|kepler|W|acknowledge timeout after 600 seconds for 
> event client (schedd:0) on host "kepler"
> 02/13/2014 11:48:34| timer|kepler|W|got timeout error while write data to 
> heartbeat file "heartbeat"
> 
> I do have a hard limit set.
> 
> s_rt                  INFINITY
> h_rt                  96:00:00
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> I am running GE from a NFS share. Would this have something to do with the 
> exehost spool directory configuration?

AFAICS also the qmaster has the spool directory on the share - right, and this 
causes the failure of the heartbeat writing? Often at least the spool directory 
for the qmaster is local, as the qmaster machine is also the file server for 
the cluster. Nevertheless, for all daemons it would be best to have the spool 
directory local. Especially for the exechosts: it would otherwise mean, that 
the job script is transferred to the node by SGE's protocol, and then 
transferred a second time by NFS to the spool area.

http://arc.liv.ac.uk/SGE/howto/nfsreduce.html

-- Reuti


> Thanks,
> 
> Eric
> 
> 
> 
> -- 
> Eric Kaufmann |  Application Support Analyst -  Advanced Technology Group | 
> Saint Louis University | 314-977-2257 | [email protected] 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs Getting Killed

Reply via email to