On 04/22/2014 03:13 AM, Mikael Brandström Durling wrote:
21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>:

After one of these qrsh jobs fails, I get the following e-mail:

Job 5326173 caused action: Job 5326173 set to ERROR
User        = xxxx
Queue       =pow1...@yyyy.zzzz
Start Time  = <unknown>
End Time    = <unknown>
failed assumedly before job:can't get password entry for user "xxxx". Either 
the user does not exist or NIS error!


This error indicates there's something wrong with getting user information. However, 
I can ssh into the problematic execution hosts just fine, and when I do a 'getent 
passwd <username>', I get the correct results. I've gone over my PAM 
configuration, and my /etc/nsswitch.conf configuration, but I don't see anything 
obviously wrong. It appears to me that sge_execd is using some other mechanism for 
getting user information that is not configured correctly on these hosts.
What password/account backend are you using for your system? I have been seeing 
it occasionally on our system where users are authenticated using winbind 
towards ab AD. My best guess after some debugging is that those errors are 
generated when winbind for some reason returns a negative answer when it can’t 
find the user in the cache, and the lookup takes too long due to network 
latency and/or slow AD server. In particular, it was triggered by system we run 
that occasionally submits largish array-jobs. When I changed the user running 
to job to a user found in /etc/passwd the errors were gone.

This system is using OpenLDAP to get user information. The nodes are all running Scientific Linux 6, so they are using SSSD on the client-side to provide name services and authentication.

To conclude, I don’t believe GridEngine is at fault for this.


Yes and no. This is clearly a client-side configuration error that is impacting GE, but if GE is the only application/service that can't get user information correctly, maybe there is a problem with GE. Ultimately, I was just asking for help in figuring out what is misconfigured on these two systems that are preventing GE from workingon them when it works everywhere else.

Prentice
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to