On 04/22/2014 03:13 AM, Mikael Brandström Durling wrote:
21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>:
After one of these qrsh jobs fails, I get the following e-mail:
Job 5326173 caused action: Job 5326173 set to ERROR
User = xxxx
Queue =pow1...@yyyy.zzzz
Start Time = <unknown>
End Time = <unknown>
failed assumedly before job:can't get password entry for user "xxxx". Either
the user does not exist or NIS error!
This error indicates there's something wrong with getting user information. However,
I can ssh into the problematic execution hosts just fine, and when I do a 'getent
passwd <username>', I get the correct results. I've gone over my PAM
configuration, and my /etc/nsswitch.conf configuration, but I don't see anything
obviously wrong. It appears to me that sge_execd is using some other mechanism for
getting user information that is not configured correctly on these hosts.
What password/account backend are you using for your system? I have been seeing
it occasionally on our system where users are authenticated using winbind
towards ab AD. My best guess after some debugging is that those errors are
generated when winbind for some reason returns a negative answer when it can’t
find the user in the cache, and the lookup takes too long due to network
latency and/or slow AD server. In particular, it was triggered by system we run
that occasionally submits largish array-jobs. When I changed the user running
to job to a user found in /etc/passwd the errors were gone.
This system is using OpenLDAP to get user information. The nodes are all
running Scientific Linux 6, so they are using SSSD on the client-side to
provide name services and authentication.
To conclude, I don’t believe GridEngine is at fault for this.
Yes and no. This is clearly a client-side configuration error that is
impacting GE, but if GE is the only application/service that can't get
user information correctly, maybe there is a problem with GE.
Ultimately, I was just asking for help in figuring out what is
misconfigured on these two systems that are preventing GE from workingon
them when it works everywhere else.
Prentice
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users