On Tue, 3 Apr 2012 at 10:19pm, Reuti wrote
Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain:
error: commlib error: can't connect to service (Connection timed out)
ethtool shows the correct speed for the network interface?
Yes indeed -- 1000Mb/s across the board.
Sometimes a job will report that error and seem to still run, and other
times it won't report the error but will fail.
The error from the job is different from a timeout - what in detail?
These jobs are submitted with "-sync y". For jobs that fail, qsub reports
"Unable to run job $JOBID". The SGE error logs of those jobs usually (but
not always) contain commlib errors, but they always contain the following
Open MPI errors:
[opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline
[[6569,0],0] lost
Looking at the qmaster and relevant execd messages, the jobs that fail are
in fact killed b/c they hit their hard wallclock limits. But they hit
that limit without ever using *any* CPU time. In other words they appear
to hang on startup due to the errors, and then SGE kills them when they
hit the runtime limit. Jobs that succeed (same exact binaries and input
parameters) complete well within the runtime limit.
Do you still use the mpiexec the application was compiled with, or start
an old binary with a new mpiexec?
Everything (MPI and the application) is freshly compiled.
Thanks.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users