On Wed, 8 Jan 2014 at 1:59am, Mark Dixon wrote

On Tue, 7 Jan 2014, Joshua Baker-LePain wrote:
...
We're running OGS/GE 2011.11p1 on top of fully updated CentOS 6 on a
cluster with ~650 nodes.  Spool directories are local to the nodes.  Our
jobs are primarily serial, but with some parallel usage.  One user has
been having issues with random tasks of parallel array jobs failing, and
I'm having trouble tracking it down.
...

Does this sound like your problem?

 http://gridengine.org/pipermail/dev/2011-December/000081.html

That does indeed look like exactly the issue -- thanks!

There's a patch posted in that thread, although Univa later improved it. That improvement can be found in Univa's public git repo here:

 https://github.com/gridengine/gridengine

Alternatively, it was integrated into Son of Gridengine some time ago.

Excellent.  Thanks so much for pointers to the fix.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to