On Fri, 12 Jun 2015 06:46:29 +0000 "[email protected]" <[email protected]> wrote:
> Hi, > > The qacct output for a job says that the job is failed with code 11 ( failed > 11 : before job). > > seems like this error occurs mostly when the user probably was too slow and > misses the timeslot for entering the password after launching the job. But > when user launches the job again the job gets started without any issues. > > But these kind of failed jobs mark the queue in error state. > > What could be the reason for the queue going into error state. The reason the queue goes into an error state is that grid engine thinks the problem is down to the host rather than the job. Generally when you have a problem launching a job it means something didn't go the way grid engine expected. If the error was detected directly by grid engine then it can usually do a good job of attributing the problem to job or host. However if an external command reports an error or fails to do what grid engine expects then the attribution of the problem is harder. It looks like you think the problem lies with the job and grid engine is mistaken in attributing it to the host. I'm not aware of anything in grid engine itself that requires a password so your attribution of the problem to a failure to enter a password makes me think that you are running some sort of external command here as part of the job startup. Probably the easiest way to solve that particular issue would be to remove the requirement for a password to be entered somehow. If this is a password prompted for when using qrsh then either using the builtin qrsh_command and qrsh_daemon or setting up passwordless ssh should remove the need. -- William Hay <[email protected]>
pgp8fuQEASBus.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
