Hi, Am 13.02.2014 um 22:39 schrieb Eric Kaufmann:
> We have jobs that are randomly getting killed. We are running GE 6.2u5 > > This is from the messages log: > > 02/13/2014 08:34:49|worker|kepler|W|job 31233.1 failed on host > c052.cm.cluster assumedly after job because: job 31233.1 died through signal > KILL (9) > 02/13/2014 09:23:00| timer|kepler|W|got timeout error while write data to > heartbeat file "heartbeat" > 02/13/2014 09:44:55|worker|kepler|W|job 30895.1 failed on host > c062.cm.cluster assumedly after job because: job 30895.1 died through signal > KILL (9) > 02/13/2014 11:28:26| timer|kepler|W|got timeout error while write data to > heartbeat file "heartbeat" > 02/13/2014 11:41:17|event_|kepler|W|acknowledge timeout after 600 seconds for > event client (schedd:0) on host "kepler" > 02/13/2014 11:48:34| timer|kepler|W|got timeout error while write data to > heartbeat file "heartbeat" > > I do have a hard limit set. > > s_rt INFINITY > h_rt 96:00:00 > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > I am running GE from a NFS share. Would this have something to do with the > exehost spool directory configuration? AFAICS also the qmaster has the spool directory on the share - right, and this causes the failure of the heartbeat writing? Often at least the spool directory for the qmaster is local, as the qmaster machine is also the file server for the cluster. Nevertheless, for all daemons it would be best to have the spool directory local. Especially for the exechosts: it would otherwise mean, that the job script is transferred to the node by SGE's protocol, and then transferred a second time by NFS to the spool area. http://arc.liv.ac.uk/SGE/howto/nfsreduce.html -- Reuti > Thanks, > > Eric > > > > -- > Eric Kaufmann | Application Support Analyst - Advanced Technology Group | > Saint Louis University | 314-977-2257 | [email protected] > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
