21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>:

> After one of these qrsh jobs fails, I get the following e-mail:
> 
> Job 5326173 caused action: Job 5326173 set to ERROR
> User        = xxxx
> Queue       =pow1...@yyyy.zzzz
> Start Time  = <unknown>
> End Time    = <unknown>
> failed assumedly before job:can't get password entry for user "xxxx". Either 
> the user does not exist or NIS error!
> 
> 
> This error indicates there's something wrong with getting user information. 
> However, I can ssh into the problematic execution hosts just fine, and when I 
> do a 'getent passwd <username>', I get the correct results. I've gone over my 
> PAM configuration, and my /etc/nsswitch.conf configuration, but I don't see 
> anything obviously wrong. It appears to me that sge_execd is using some other 
> mechanism for getting user information that is not configured correctly on 
> these hosts.

What password/account backend are you using for your system? I have been seeing 
it occasionally on our system where users are authenticated using winbind 
towards ab AD. My best guess after some debugging is that those errors are 
generated when winbind for some reason returns a negative answer when it can’t 
find the user in the cache, and the lookup takes too long due to network 
latency and/or slow AD server. In particular, it was triggered by system we run 
that occasionally submits largish array-jobs. When I changed the user running 
to job to a user found in /etc/passwd the errors were gone. To conclude, I 
don’t believe GridEngine is at fault for this.

cheers,
Mikael


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to