Hello,

We've been having a seemingly-random problem with MPI jobs on our install
of Open Grid Scheduler 2011.11.  For some varying length of time from when
the execd processes start up, MPI jobs running across multiple hosts will
run fine.  Then, at some point, they will start failing at the mpirun
step, and will keep failing until execd is restarted on the affected
hosts.  They then work again, before eventually failing, and so on.  If I
increase the SGE debug level before calling mpirun in my job script, I see
things like this:

   842  11556         main     ../clients/qsh/qsh.c 1840 executing task of
job 6805430 failed: failed sending task to execd@<hostname>: got send error

...but nothing more interesting that I can see.  (I also get the same sort
of "send error" message from mpirun itself if I use its --mca
ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing
else.)  Jobs that run on multiple cores on a single host are fine, but
ones that try to start up workers on additional hosts fail.  Since
restarting execd makes it work again, I assumed the problem was on that
end, and tried dumping verbose log output for execd (using dl 10) to a
file.  But, despite many thousands of lines, I can't spot anything that
looks different when the jobs start failing from when they are working, as
far as execd is concerned.  Ordinary grid jobs (no parallel environment)
continue to run fine no matter what.

So for now, I'm stumped!  Any other ideas of what to look for, or thoughts
of what the unpredictable off-and-on behavior could possibly mean?  Thanks
in advance,

Jesse

P.S.  This is on CentOS 6, with its openmpi 1.5.4 package.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to