Hi,

Am 30.04.2014 um 10:05 schrieb Arnau Bria:

> yesterday we had big NIS & DNS problems and nodes were not able
> to resolve the name of the master nor usernames.
> This morning all problems have been fixed, but not all nodes are ok.
> 
> - nodes knows the master:
> 
> # host master
> # master has address 172.X.X.X
> 
> 
> - but from qhost:
> 
> node-hp0506             linux-x64      16     -  126.0G       -   64.0G       
> -
> 
> qmaster - execd are not able to communicate
> 
> - qping clearly shows that it's a DNS problem:
> 
> # qping -info node-hp0506  6445 execd 1
> endpoint node-hp0506/execd/1 at port 6445: can't find connection
> access denied: server host resolves rdata host "master" as 
> "(HOST_NOT_RESOLVABLE)"
> 
> The quick solution is restarting sge_execd on the node (and I've done
> it for some empty nodes) but some other have running jobs and I'm not
> sure how a restart will affect running jobs.
> 
> what will happen to running jobs if I restart execd?  as shepherd
> processese become "orphan", new execd is able to control them?

Yes. You can use "sgeexecd softstop" (which won't shut down the running jobs) 
and start it again. It should discover the running jobs and they also reappear 
in `qstat`.

-- Reuti


> Is there any way to tell execd to retry resolving master name
> resolution? 
> 
> TIA,
> Arnau
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to