Rayson,

This is resolved.

I hadn't determined a pattern yet, when the user submitting the jobs
replied to an e-mail I sent him (I started investigating the error as
soon as SGE reported it, before the user reported it).

Turns out that error is coming not from my cluster nodes, but from
MySQL. The user's jobs store their results in a MySQL database, and then
subsequent jobs do queries on the results from previous jobs. Don't ask
for more information - that's all I know! Anyhow, there was an error in
his script and it wasn't authenticating to MySQL properly, so those
errors were coming from MySQL. The user said he did some debugging, and
the jobs seem to be running just fine now. Since MySQL is not running on
a cluster node, that explains why I didn't find anything in the logs on
my nodes.

Sorry for the false alarm.

Prentice 


On 02/07/2012 04:36 PM, Rayson Ho wrote:
> So only some jobs/tasks fail for "user1", and only happens randomly??
>
> Rayson
>
>
>
> On Tue, Feb 7, 2012 at 4:31 PM, Prentice Bisbal <[email protected]> wrote:
>> A user submittted an array job with a large number of tasks. For the past 
>> couple of days, I've been getting e-mails like the one below from SGE 
>> alerting my the job failed. Have any of you seen an error like this before?
>>
>> Job 1252898 caused action: Job-array task 1252898.1 set to ERROR
>>  User        = user1
>>  Queue       = [email protected]
>>  Start Time  = <unknown>
>>  End Time    = <unknown>
>> failed assumedly before job:can't get password entry for user "user1". 
>> Either the user does not exist or NIS error!
>>
>> Looking in /var/log/messages and /var/log/secure, I see no errors. I've been 
>> able to 'su - user1' on the nodes where the error occured, and I can do 
>> 'getent passwd user1' and get the correct answer on every cluster node. 
>> There are jobs running on the cluster for the same user.
>>
>> The only place I can find errors are in my SGE logs, which show the same 
>> error as above (no surprise there), but no additional clues as to what may 
>> have caused this:
>>
>>  main|node02|E|can't start job "1252898": can't get password entry for user 
>> "user1". Either the user does not exist or NIS error!
>>
>> We use LDAP for account information, and there have been no outages that I 
>> know of, and I'd know since I'm the LDAP admin, too!
>>
>> Any ideas? I'm not even sure if I can reproduce this error.
>>
>> --
>> Prentice Bisbal
>> Linux Software Support Specialist/System Administrator
>> School of Natural Sciences
>> Institute for Advanced Study
>> Princeton, NJ
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to