Have you tried "softstop"-ing the execd?
It should allow you to shut down the execd and keep the jobs running,
and after re-starting the execd it should be able to catch up with the
shepherds again.
Txema
El 30/04/14 10:05, Arnau Bria escribió:
Hello all,
yesterday we had big NIS & DNS problems and nodes were not able
to resolve the name of the master nor usernames.
This morning all problems have been fixed, but not all nodes are ok.
- nodes knows the master:
# host master
# master has address 172.X.X.X
- but from qhost:
node-hp0506 linux-x64 16 - 126.0G - 64.0G -
qmaster - execd are not able to communicate
- qping clearly shows that it's a DNS problem:
# qping -info node-hp0506 6445 execd 1
endpoint node-hp0506/execd/1 at port 6445: can't find connection
access denied: server host resolves rdata host "master" as
"(HOST_NOT_RESOLVABLE)"
The quick solution is restarting sge_execd on the node (and I've done
it for some empty nodes) but some other have running jobs and I'm not
sure how a restart will affect running jobs.
what will happen to running jobs if I restart execd? as shepherd
processese become "orphan", new execd is able to control them?
Is there any way to tell execd to retry resolving master name
resolution?
TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users