21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>:
> After one of these qrsh jobs fails, I get the following e-mail: > > Job 5326173 caused action: Job 5326173 set to ERROR > User = xxxx > Queue =pow1...@yyyy.zzzz > Start Time = <unknown> > End Time = <unknown> > failed assumedly before job:can't get password entry for user "xxxx". Either > the user does not exist or NIS error! > > > This error indicates there's something wrong with getting user information. > However, I can ssh into the problematic execution hosts just fine, and when I > do a 'getent passwd <username>', I get the correct results. I've gone over my > PAM configuration, and my /etc/nsswitch.conf configuration, but I don't see > anything obviously wrong. It appears to me that sge_execd is using some other > mechanism for getting user information that is not configured correctly on > these hosts. What password/account backend are you using for your system? I have been seeing it occasionally on our system where users are authenticated using winbind towards ab AD. My best guess after some debugging is that those errors are generated when winbind for some reason returns a negative answer when it can’t find the user in the cache, and the lookup takes too long due to network latency and/or slow AD server. In particular, it was triggered by system we run that occasionally submits largish array-jobs. When I changed the user running to job to a user found in /etc/passwd the errors were gone. To conclude, I don’t believe GridEngine is at fault for this. cheers, Mikael _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users