On Fri, 12 Jun 2015 06:46:29 +0000
"[email protected]" <[email protected]> wrote:

> Hi,
> 
> The qacct output for a job says that the job is failed with code 11 ( failed 
> 11 : before job).
> 
> seems like this error occurs mostly when the user probably was too slow and 
> misses the timeslot for entering the password after launching the job. But 
> when user launches the job again the job gets started without any issues.
> 
> But these kind of failed jobs mark the queue in error state.
> 
> What could be the reason for the queue going into error state.

The reason the queue goes into an error state is that grid engine thinks the 
problem is down to the host rather than the job.  Generally when you have a 
problem 
launching a job it means something didn't go the way grid engine expected.  If 
the error was detected directly by grid engine then it can usually do a good job
of attributing the problem to job or host.  However if an external command 
reports an error or fails to do what grid engine expects then the attribution 
of 
the problem is harder.  It looks like you think the problem lies with the job 
and grid engine is mistaken in attributing it to the host.

I'm not aware of anything in grid engine itself that requires a password so 
your attribution of the problem to a failure to enter a password makes me think 
that you are running some sort of external command here as part of the job 
startup.  Probably the easiest way to solve that particular issue would be to 
remove 
the requirement for a password to be entered somehow.  If this is a password 
prompted for when using qrsh then either using the builtin qrsh_command and 
qrsh_daemon
or setting up passwordless ssh should remove the need.  


-- 
William Hay <[email protected]>

Attachment: pgp8fuQEASBus.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to