I have a problem with gridengine that disappeared for a long time but has come 
back.

Some user's jobs fail persistently without any output files

This is a sample of the messages file on a node in E state, spooling is local 
in /var/spool/sge/cn385 which is owned by the sge admin account.

02/27/2012 12:50:32|execd|cn385|I|sending job start mail to user "xxx"|mailer 
"/bin/mail"|"Job 28182 (run.sge) Started"
02/27/2012 12:50:32|execd|cn385|E|shepherd of job 28182.1 exited with exit 
status = 11
02/27/2012 12:50:32|execd|cn385|I|sending admin mail mail to user " admin 
"|mailer "/bin/mail"|"GE 6.1u6: Job 28182 failed"
02/27/2012 12:51:06|execd|cn385|I|sending job abortion/end mail to user 
"xxx"|mailer "/bin/mail"|"Job 28182 (run.sge) Aborted"

This is the corresponding error message in qmaster/messages:

02/27/2012 12:51:06|qmaster|ham4|W|job 28182.1 failed on host cn385 general 
before job because: 02/27/2012 12:50:32 [17813:30041]: unable to find job file 
"/var/spool/sge/cn385/job_scripts/28182"
02/27/2012 12:51:06|qmaster|ham4|W|rescheduling job 28182.1
02/27/2012 12:51:06|qmaster|ham4|E|queue blades.q marked QERROR as result of 
job 28182's failure at host cn385

It complains about not able to find the job file, although there is sufficient 
space on the node's local disk.
The version is 6.1u6 binaries on Centos6. 6.2u5 was tried in the past but had 
some problems with array jobs crashing.

I would be grateful for any advise

Henk


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to