On Tue, 9 Jun 2015 15:36:00 +0000 Dan Hyatt <[email protected]> wrote:
> >> The execute nodes were updated, and some are not playing well in the
> >> sandbox. When the grid sends a job there, it hangs, sends an error but
> > Not clear what hangs the job or the node?
> > What is the error that is being sent and how is it being sent?
> The nodes are hanging the job, some did not come up correctly. I had
So if I understand you correctly the job is what is hanging but the node
is otherwise OK ie you can ssh into it. If that is the case I'd log into one
while
a job is hung, look for some sign of the problem, write a load sensor to detect
it
and set an appropriate load_threshold.
We have one load sensor that backgrounds a ps command and if it hasn't finished
by the next time the load sensor is queried flags an error. This catches a
linux problem
where a process never finishes exiting (and locks out attempts to examine its
status)
Still not sure what you mean by 'sends an error'. That might give a clue as to
what to look for.
> > Not clear what was marking the node or how it was marking the node.
> When I run qhost -j
>
> I assume
> HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT
> MEMUSE SWAPTO SWAPUS
> blade5-5-5 lx-amd64 24 2 12 24
> 0.00 126.0G 1.2G 2.0G 0.0
> blade5-5-6 lx-amd64 24 2 12 24 -
> 126.0G - 2.0G -
> blade5-5-7 lx-amd64 24 2 12
> 24 - 126.0G - 2.0G -
> blade5-5-8 lx-amd64 24 2 12
> 24 0.00 126.0G 1.2G 2.0G 0.0
>
You're just looking for the - rather than an actual load value there. If it is
only the job that
is hung then the execd will be fine so that is expected behavior. Those
builtin sensors sometimes get
cached so even when the node is hung they show up.
qselect -qs u |awk -F@ '{print $2}'|sort -u
should give a more reliable indication of which nodes are currently hung.
--
William Hay <[email protected]>
pgp56W8vsc8ul.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
