HI,

I saw in your previous debug log :

>
>     5     worker001     user doesn't match
>     6     worker001     user doesn't match
>     7     worker001     queue doesn't match
>     8     worker001     queue doesn't match
>     9     worker001     user doesn't match
>    10     worker001     user doesn't match
<

And now I see your response :
>
Shut everything down for a storage update and switch from NIS => LDAP and 
change in DNS server.
<

Yes, NIS --> LDAP !  That might be the problem...
Be sure your nodes have the exact same LDAP config and can check the LDAP in 
the same way.
So if you are using external LDAP, the nodes should have access to the same 
server.
Or you might set up a resolving LDAP on the headnode, and let the nodes point 
to the headnode-LDAP.

Regards,
Carel van der Werf (UU - Fac of Science)

-----Original Message-----
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of berg...@merctech.com
Sent: 27 September, 2019 23:49
To: Reuti
Cc: users@gridengine.org
Subject: Re: [gridengine users] jobs stuck in transitioning state

In the message dated: Fri, 27 Sep 2019 23:32:43 +0200,
The pithy ruminations from Reuti on 
[Re: [gridengine users] jobs stuck in transitioning state] were:
=> Hi,
=> 
=> Am 27.09.2019 um 22:21 schrieb berg...@merctech.com:
=> 
=> > We're having a problem with submit scripts not being transferred to exec
=> > nodes and jobs being stuck in the [t]ransitioning state.
=> 
=> Did this issue to start out of the blue?

Not spontaneously.

We had a working 8.1.6 cluster.

Shut everything down for a storage update and switch from NIS => LDAP and 
change in DNS server.

Upgraded SGE to 8.1.9 at the same time.

Brought everything up, worked out all sorts of little things, then began
having the problem with jobs getting stuck.

Reverted to 8.1.6, problem still exists.

=> 
=> Is the execd running as sge or initially as root? It must be run at root to 
be able to switch to any user but switches to the admin user:

The execd and qmaster both start as root & then become the effective user 'sge'.

=> 
=> > There is successful communication between the qmaster and execd hosts:
=> >            
=> >    qping works in both directions
=> > 
=> >    jobs submitted as binaries (-b y) run correctly
=> > 
=> >    directives from the master to the execd (for example, to delete jobs) 
work
=> > 
=> > If I read the qmaster debug logs correctly, it looks like the qmaster 
isn't able to send the submit script to the compute node:


=> > 
=> >    11          worker001     spooling job 9899430.1 <null>
=> >    12          worker001     Making dir "jobs/00/0989/9430/1-4096/1"
=> >    13          worker001     retval = 0
=> >    14          worker001     spooling job 9899430.1 <null>
=> >    15          worker001     Making dir "jobs/00/0989/9430"
=> >    16          worker001     retval = 0
=> >    17          worker001     TRIGGER JOB RESEND 9899430/1 in 300 seconds
=> >    18          worker001     successfully handed off job "9899430" to 
queue "all.q@2115fmn001.foobar.local"
=> >    19          worker001     NO TICKET DELIVERY
=> > 
=> > 
=> > We don't see corresponding log messages on the client.
=> > 
=> > 
=> > What mechanism is used by SGE to transfer submit scripts (something
=> > specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?
=> 
=> It uses its own protocol. No SSH inside the cluster is necessary.

That's what I thought...and there's no mechanism to change the file transfer 
method.

=> 
=> 
=> > What are the system-level requirements for succesfully sending the
=> > submit scripts (for example: same UID for sge across the cluster, same
=> > UID<->username for the user submitting the job across the cluster, etc)?

Are there any other requirements you can think of?

Thanks,

Mark

=> 
=> Yes.
=> 
=> -- Reuti
=> 

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to