Have you tried "softstop"-ing the execd?

It should allow you to shut down the execd and keep the jobs running, and after re-starting the execd it should be able to catch up with the shepherds again.

Txema

El 30/04/14 10:05, Arnau Bria escribió:
Hello all,

yesterday we had big NIS & DNS problems and nodes were not able
to resolve the name of the master nor usernames.
This morning all problems have been fixed, but not all nodes are ok.

- nodes knows the master:

# host master
# master has address 172.X.X.X


- but from qhost:

node-hp0506             linux-x64      16     -  126.0G       -   64.0G       -

qmaster - execd are not able to communicate

- qping clearly shows that it's a DNS problem:

# qping -info node-hp0506  6445 execd 1
endpoint node-hp0506/execd/1 at port 6445: can't find connection
access denied: server host resolves rdata host "master" as 
"(HOST_NOT_RESOLVABLE)"

The quick solution is restarting sge_execd on the node (and I've done
it for some empty nodes) but some other have running jobs and I'm not
sure how a restart will affect running jobs.

what will happen to running jobs if I restart execd?  as shepherd
processese become "orphan", new execd is able to control them?
Is there any way to tell execd to retry resolving master name
resolution?

TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to