Hi, Am 16.12.2015 um 19:53 schrieb Gowtham:
> > Dear fellow Grid Engine users, > > Over the past few days, I have had to re-install compute nodes (12 cores > each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I > ensured the extend-*.xml files had no error in them using the xmllint command > before rebuilding the distribution. All six compute nodes installed > successfully, and so did running several test "Hello, World!" cases up to 72 > cores. I can SSH into any one of these nodes, and SSH between any two compute > nodes just fine. > > As of this morning all submitted jobs that require more than 12 cores (i.e., > spanning more than one compute node) fail about a minute after starting > successfully. However, all jobs with 12 or less cores within the a given > compute node run just fine. The error message for failed job is as follows: > > error: got no connection within 60 seconds. "Timeout occured while waiting > for connection" > Ctrl-C caught... cleaning up processes > > "Hello, World!" and one other program, both compiled with Intel Cluster > Studio 2013.0.028, display the same behavior. The line corresponding to the > failed job from /opt/gridengine/default/spool/qmaster/messages is as follows: > > 12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 6129.1 > task 1.compute-0-1 failed - killing job > > I'd appreciate any insight or help to resolve this issue. If you need > additional information from my end, please let me know. What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than 4.1? IIRC a tight integration was not supported before this one, as there was no call to `qrsh` automatically set up as you would need to start certain daemons beforehand. Does your version still need mpdboot? Do you request a proper set up PE in your job submission? -- Reuti > > Thank you for your time and help. > > Best regards, > g > > -- > Gowtham, PhD > Director of Research Computing, IT > Adj. Asst. Professor, Physics/ECE > Michigan Technological University > > P: (906) 487-3593 > F: (906) 487-2787 > http://it.mtu.edu > http://hpc.mtu.edu > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
