Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Kamel Mazouzi
Hi, mpirun (Intel) is just a wrapper for mpdboot + mpiexec + mpdallexit while mpiexec.hydra is the new intel mpi process spawner which is tightly integrated with Grid Engine since the version 4.3.1 Regards, On Thu, Dec 17, 2015 at 4:19 PM, Reuti wrote: > Maybe `mpirun` doesn't support/use Hydr

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Reuti
Maybe `mpirun` doesn't support/use Hydra. Although not required, the MPI standard specifies `mpiexec` as a portable startup mechanism. Doesn't Intel MPI also have an `mpiexec`, which would match the `mpirun` behavior (and doesn't use Hydra)? -- Reuti > Am 17.12.2015 um 15:06 schrieb Gowtham :

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Gowtham
Yes sir. mpirun and mpiexec.hydra are both from Intel Cluster Studio suite. To make sure of this, I ran a quick batch job with which mpirun which mpiexec.hydra and it returned /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpirun /share/apps/intel/2013.0.028/impi/4.1.0.024/i

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Reuti
> Am 17.12.2015 um 13:41 schrieb Gowtham : > > > I tried replacing the call to mpirun with mpiexec.hydra and it seems to work > successfully as before. Please find below the contents of *.sh.o file > corresponding to the Hello, World! run spanning more than one compute node: Are both `mpi

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Gowtham
I tried replacing the call to mpirun with mpiexec.hydra and it seems to work successfully as before. Please find below the contents of *.sh.o file corresponding to the Hello, World! run spanning more than one compute node: Parallel version of 'Go Huskies!' with 16 processors

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Gowtham
Here you go, Sir. These two PEs are created by me (not from Rocks) to help our researchers pick one depending on the nature of their job. If a software suite required that all processors/cores belong to the same physical compute node (e.g., MATLAB with Parallel Computing Toolbox), then they wo

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-17 Thread Reuti
> Am 16.12.2015 um 21:32 schrieb Gowtham : > > > Hi Reuti, > > The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and I > do not need mpdboot. The PE used for this purpose is called mpich_unstaged > (basically, a copy of the original mpich with '$fill_up' rule). The only >

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Kamel Mazouzi
Hi, Within Grid Engine try using mpiexec.hydra instead of mpirun. check if mpiexec.hydra integrate sge: strings mpiexec.hydra | grep sge Regards, On Wed, Dec 16, 2015 at 9:32 PM, Gowtham wrote: > > Hi Reuti, > > The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and > I do

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Gowtham
Hi Reuti, The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and I do not need mpdboot. The PE used for this purpose is called mpich_unstaged (basically, a copy of the original mpich with '$fill_up' rule). The only other PE in this system is called mpich_staged (a copy of th

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Reuti
Hi, Am 16.12.2015 um 19:53 schrieb Gowtham: > > Dear fellow Grid Engine users, > > Over the past few days, I have had to re-install compute nodes (12 cores > each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I > ensured the extend-*.xml files had no error in them using

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Gowtham
Thank you, Sir. I made a 'machines' file with round robin list of compute node names (repeated 12 times for a total of 72): compute-0-0 compute-0-1 compute-0-2 compute-0-3 compute-0-4 compute-0-5 I then ran the 'Hello, World!' program (renamed 'Go Huskies!' in honor of my University's mascot),

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Chris Dagdigian
This looks and feels like an MPI job launching failure Especially as it fails exactly when it tries to cross the threshold from single chassis to multiple boxes The #1 debugging advice in this scenario is this: -- Can you definitively run on more than 12 cores OUTSIDE of grid engine? My ex

[gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Gowtham
Dear fellow Grid Engine users, Over the past few days, I have had to re-install compute nodes (12 cores each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I ensured the extend-*.xml files had no error in them using the xmllint command before rebuilding the distribution.