I have an error state of au on about 20% of my execute nodes, and have been unable to release the au (alarm, unreachable) and

qstat -f
load average
normal2@blade5-5-14  BIP   0/0/24         1.42     lx-amd64
---------------------------------------------------------------------------------
normal2@blade5-5-15  BIP   0/0/24         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
normal2@blade5-5-16  BIP   0/0/24         1.25     lx-amd64
---------------------------------------------------------------------------------
normal2@blade5-5-2    BIP   0/0/24         1.31     lx-amd64
---------------------------------------------------------------------------------

qping allows me to reach the master from the execute node.

about 20% of my nodes now have the "au" error.

I have googled and it suggested qping  or reboot.
I did the reboot on one node to no avail

messages file has hundreds of these
09/28/2015 10:55:35|worker|blade5-1-1|E|no execd known on host blade5-5-15.dsg.wustl.edu to send conf notification 09/28/2015 10:55:37|worker|blade5-1-1|E|no execd known on host blade5-6-6.dsg.wustl.edu to send conf notification 09/28/2015 10:55:37|worker|blade5-1-1|E|no execd known on host blade5-6-8.dsg.wustl.edu to send conf notification 09/28/2015 10:55:51|worker|blade5-1-1|E|no execd known on host blade5-6-2.dsg.wustl.edu to send conf notification 09/28/2015 10:55:52|worker|blade5-1-1|E|no execd known on host blade5-6-1.dsg.wustl.edu to send conf notification 09/28/2015 10:56:15|worker|blade5-1-1|E|no execd known on host blade5-5-15.dsg.wustl.edu to send conf notification 09/28/2015 10:56:17|worker|blade5-1-1|E|no execd known on host blade5-6-8.dsg.wustl.edu to send conf notificati

qmaster]$ ps -eaf|grep execd
sgeadmin 4147 1 0 Aug31 ? 00:09:43 /opt/sge/bin/lx-amd64/sge_execd
sgeadmin 37198 37150  0 10:59 pts/0    00:00:00 grep execd

we have been migrating from one vlan to another, but most of the affected nodes and the master are on the original vlan.

any suggestions where I might go from here?

Thanks,
Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to