So only some jobs/tasks fail for "user1", and only happens randomly??

Rayson



On Tue, Feb 7, 2012 at 4:31 PM, Prentice Bisbal <[email protected]> wrote:
> A user submittted an array job with a large number of tasks. For the past 
> couple of days, I've been getting e-mails like the one below from SGE 
> alerting my the job failed. Have any of you seen an error like this before?
>
> Job 1252898 caused action: Job-array task 1252898.1 set to ERROR
>  User        = user1
>  Queue       = [email protected]
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed assumedly before job:can't get password entry for user "user1". Either 
> the user does not exist or NIS error!
>
> Looking in /var/log/messages and /var/log/secure, I see no errors. I've been 
> able to 'su - user1' on the nodes where the error occured, and I can do 
> 'getent passwd user1' and get the correct answer on every cluster node. There 
> are jobs running on the cluster for the same user.
>
> The only place I can find errors are in my SGE logs, which show the same 
> error as above (no surprise there), but no additional clues as to what may 
> have caused this:
>
>  main|node02|E|can't start job "1252898": can't get password entry for user 
> "user1". Either the user does not exist or NIS error!
>
> We use LDAP for account information, and there have been no outages that I 
> know of, and I'd know since I'm the LDAP admin, too!
>
> Any ideas? I'm not even sure if I can reproduce this error.
>
> --
> Prentice Bisbal
> Linux Software Support Specialist/System Administrator
> School of Natural Sciences
> Institute for Advanced Study
> Princeton, NJ
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to