Hello all, yesterday we had big NIS & DNS problems and nodes were not able to resolve the name of the master nor usernames. This morning all problems have been fixed, but not all nodes are ok.
- nodes knows the master: # host master # master has address 172.X.X.X - but from qhost: node-hp0506 linux-x64 16 - 126.0G - 64.0G - qmaster - execd are not able to communicate - qping clearly shows that it's a DNS problem: # qping -info node-hp0506 6445 execd 1 endpoint node-hp0506/execd/1 at port 6445: can't find connection access denied: server host resolves rdata host "master" as "(HOST_NOT_RESOLVABLE)" The quick solution is restarting sge_execd on the node (and I've done it for some empty nodes) but some other have running jobs and I'm not sure how a restart will affect running jobs. what will happen to running jobs if I restart execd? as shepherd processese become "orphan", new execd is able to control them? Is there any way to tell execd to retry resolving master name resolution? TIA, Arnau _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
