Hello, We've been having a seemingly-random problem with MPI jobs on our install of Open Grid Scheduler 2011.11. For some varying length of time from when the execd processes start up, MPI jobs running across multiple hosts will run fine. Then, at some point, they will start failing at the mpirun step, and will keep failing until execd is restarted on the affected hosts. They then work again, before eventually failing, and so on. If I increase the SGE debug level before calling mpirun in my job script, I see things like this:
842 11556 main ../clients/qsh/qsh.c 1840 executing task of job 6805430 failed: failed sending task to execd@<hostname>: got send error ...but nothing more interesting that I can see. (I also get the same sort of "send error" message from mpirun itself if I use its --mca ras_gridengine_debug --mca ras_gridengine_verbose flags, but nothing else.) Jobs that run on multiple cores on a single host are fine, but ones that try to start up workers on additional hosts fail. Since restarting execd makes it work again, I assumed the problem was on that end, and tried dumping verbose log output for execd (using dl 10) to a file. But, despite many thousands of lines, I can't spot anything that looks different when the jobs start failing from when they are working, as far as execd is concerned. Ordinary grid jobs (no parallel environment) continue to run fine no matter what. So for now, I'm stumped! Any other ideas of what to look for, or thoughts of what the unpredictable off-and-on behavior could possibly mean? Thanks in advance, Jesse P.S. This is on CentOS 6, with its openmpi 1.5.4 package. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
