Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-22 Thread Ralph Castain
Had to go back into the deep, dark archives and look at the source code to 
address this. :-)

Your understanding is correct - ompi should see those envars and automatically 
use xgrid.

I don't see anything in the source code beyond those variables described in the 
reference you provided. You could try setting OMPI_MCA_pls=xgrid and 
OMPI_MCA_ras=xgrid in the environment to force the use of xgrid, though it 
should automatically pick it up as you say.

I don't see anything in the source that applies to setting a grid id. In the 
connection setup, I do see we tell it to use the default port number, which may 
translate under-their-covers to a grid id (I couldn't find anything in the 
xgrid docs that stated either way).

Afraid we can't test it any more as none of the developers has access to an 
xgrid server :-(

Been trying to rectify that as we would like to restore support, but so far no 
joy...

On Jun 21, 2010, at 3:01 PM, charlie strauss wrote:

> To be more specific.
> 
> I have a working xgrid with the envirnment variables set.  In particular I 
> can run xgrid commands from the shell prompt like this:
> 
> xgrid -job submit /bin/hostname
> 
> and it runs because the enviroment variables are set.
> 
> my understanding is that openMPI will look for those ENV vars and if present 
> try to run on xgrid.  my understanding is that there are no configuration 
> files for this needed.  It should work out of the box.
> 
> thus I could be able to type at the same command line:
> mpirun -np 3 /bin/hostname
>  or
> mpirun -np 3 examples/hello_c( the mpi example)
> 
> and have them run on xgrid.(for example see 
> http://www.macresearch.org/getting_started_with_openmpi_and_xgrid )
> 
> But that's not what happens instead they always run on the localhost
> 
>  I know I'm not the only one who has this issue since i can reproduce it on 6 
> different computers around me and I see questions like mine posted on the web.
> 
> Is there any other configuration one needs to use the built-in openmpi and 
> have it use an available xgrid?
> 
> (separate question: if so, does it always uses the default logical grid or is 
> there a way to configure which grid id (a given controller_host can partition 
> the grid into logical subsets of nodes. in xgrid-speak these are calles 
> logical grids and one of these is assigned to be the default grid if the 
> grif-id is not specified).
> 
> 
> 
> 
> 
> 
> 
> On Jun 21, 2010, at 1:40 PM, Barrett, Brian W wrote:
> 
>> You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME and 
>> XGRID_CONTROLLER_PASSWORD) with the correct information in order for the 
>> XGrid starter to work.  Due to the way XGrid works, the nolocal option will 
>> not work properly when launching with XGrid.
>> 
>> Brian
>> 
>> On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:
>> 
>>> Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI on 
>>> mac osx, my 10.5 OSX has v1.2.3  when I try to run it, it works fine 
>>> locally but it never finds the xgrid.
>>> 
>>> any mpi job I run, will run on the localhost not the xgrid agents.  If try 
>>> to force the issue by specifying -nolocal then it just complains there are 
>>> no nodes.
>>> 
>>> SO how do I use openMPI so that it uses the nodes of an xgrid cluster?
>>> 
>>> mpirun -nolocal -n 32 /bin/hostname
>>> --
>>> There are no available nodes allocated to this job. This could be because
>>> no nodes were found or all the available nodes were already used.
>>> 
>>> Note that since the -nolocal option was given no processes can be 
>>> launched on the local node.
>>> --
>>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource 
>>> in file 
>>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_support_fns.c
>>>  at line 168
>>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource 
>>> in file 
>>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/round_robin/rmaps_rr.c
>>>  at line 402
>>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource 
>>> in file 
>>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_map_job.c
>>>  at line 210
>>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource 
>>> in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/urm/rmgr_urm.c 
>>> at line 372
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:
>>> 
 Where did you see that 1.5 works with xgrid? That support has been broken 
 since the 1.2 series, unfortunately, so it would help to ensure we don't 
 have stale docs out there to the contrary.
 
 As for the 1.2 results, you are aware (I imagine) that OSX ships with the 
 last 1.2 release already installed? You don't have to do anything to use 

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread charlie strauss

To be more specific.

I have a working xgrid with the envirnment variables set.  In  
particular I can run xgrid commands from the shell prompt like this:


xgrid -job submit /bin/hostname

and it runs because the enviroment variables are set.

my understanding is that openMPI will look for those ENV vars and if  
present try to run on xgrid.  my understanding is that there are no  
configuration files for this needed.  It should work out of the box.


thus I could be able to type at the same command line:
mpirun -np 3 /bin/hostname
 or
mpirun -np 3 examples/hello_c( the mpi example)

and have them run on xgrid.(for example see http://www.macresearch.org/getting_started_with_openmpi_and_xgrid 
 )


But that's not what happens instead they always run on the localhost

 I know I'm not the only one who has this issue since i can reproduce  
it on 6 different computers around me and I see questions like mine  
posted on the web.


Is there any other configuration one needs to use the built-in openmpi  
and have it use an available xgrid?


(separate question: if so, does it always uses the default logical  
grid or is there a way to configure which grid id (a given  
controller_host can partition the grid into logical subsets of nodes.  
in xgrid-speak these are calles logical grids and one of these is  
assigned to be the default grid if the grif-id is not specified).








On Jun 21, 2010, at 1:40 PM, Barrett, Brian W wrote:

You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME  
and XGRID_CONTROLLER_PASSWORD) with the correct information in order  
for the XGrid starter to work.  Due to the way XGrid works, the  
nolocal option will not work properly when launching with XGrid.


Brian

On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:

Perhaps I was mistaken about 1.5rc1.As for  the installed  
openMPI on mac osx, my 10.5 OSX has v1.2.3  when I try to run it,  
it works fine locally but it never finds the xgrid.


any mpi job I run, will run on the localhost not the xgrid agents.   
If try to force the issue by specifying -nolocal then it just  
complains there are no nodes.


SO how do I use openMPI so that it uses the nodes of an xgrid  
cluster?


mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be  
because

no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/base/rmaps_base_support_fns.c at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/round_robin/rmaps_rr.c at line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/base/rmaps_base_map_job.c at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmgr/urm/rmgr_urm.c at line 372










On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been  
broken since the 1.2 series, unfortunately, so it would help to  
ensure we don't have stale docs out there to the contrary.


As for the 1.2 results, you are aware (I imagine) that OSX ships  
with the last 1.2 release already installed? You don't have to do  
anything to use it but run.


If you are getting peer timeouts, that is almost always a firewall  
issue. But I would try the factory-installed version first to be  
sure.


On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I  
have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I  
have seen
some discussions in the archives of this mail list saying some  
people have

v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and  
neither of
these is working for me with xgrid.   Both of these say they work  
with

xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on  
a OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.   
I have tried

configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is  
running on a
remote machine.  However nothing ever happens and peeking in the  
xgrid

results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: 

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread Ralph Castain
If you want, you can upgrade to the last release in the 1.2 series from the 
www.open-mpi.org web site. Anything in 1.2 will work - just not beyond.


On Jun 21, 2010, at 1:40 PM, Barrett, Brian W wrote:

> You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME and 
> XGRID_CONTROLLER_PASSWORD) with the correct information in order for the 
> XGrid starter to work.  Due to the way XGrid works, the nolocal option will 
> not work properly when launching with XGrid.
> 
> Brian
> 
> On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:
> 
>> Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI on mac 
>> osx, my 10.5 OSX has v1.2.3  when I try to run it, it works fine locally but 
>> it never finds the xgrid.
>> 
>> any mpi job I run, will run on the localhost not the xgrid agents.  If try 
>> to force the issue by specifying -nolocal then it just complains there are 
>> no nodes.
>> 
>> SO how do I use openMPI so that it uses the nodes of an xgrid cluster?
>> 
>> mpirun -nolocal -n 32 /bin/hostname
>> --
>> There are no available nodes allocated to this job. This could be because
>> no nodes were found or all the available nodes were already used.
>> 
>> Note that since the -nolocal option was given no processes can be 
>> launched on the local node.
>> --
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_support_fns.c
>>  at line 168
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/round_robin/rmaps_rr.c 
>> at line 402
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_map_job.c
>>  at line 210
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/urm/rmgr_urm.c at 
>> line 372
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:
>> 
>>> Where did you see that 1.5 works with xgrid? That support has been broken 
>>> since the 1.2 series, unfortunately, so it would help to ensure we don't 
>>> have stale docs out there to the contrary.
>>> 
>>> As for the 1.2 results, you are aware (I imagine) that OSX ships with the 
>>> last 1.2 release already installed? You don't have to do anything to use it 
>>> but run.
>>> 
>>> If you are getting peer timeouts, that is almost always a firewall issue. 
>>> But I would try the factory-installed version first to be sure.
>>> 
>>> On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:
>>> 
 I'm new to openMPI.  I'm trying to set it up for using xgrid.  I have read
 that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have seen
 some discussions in the archives of this mail list saying some people have
 v1.4 running on 10.6.
 
 I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
 these is working for me with xgrid.   Both of these say they work with
 xgrid.
 
 The failuremodes are different.
 
 Anyone know how to get a working install?  I am building this on a OSX 
 10.5.8
 machine.  THe xgrid controller is on a OSX 10.6 server machine.  I have 
 tried
 configuring with and without the --with-xgrid option.
 
 Behaviour of openMPI1.2
 $ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname
 
 THe job appears in the xgrid queue, and the logs show it is running on a
 remote machine.  However nothing ever happens and peeking in the xgrid
 results I see:
 
 $ xgrid -job results -id 8703
 [brio.llnl.gov:38789] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
 connection failed: Operation timed out (60) - retrying
 [brio.llnl.gov:38792] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
 connection failed: Operation timed out (60) - retrying
 
 Perhaps a firewall issue?
 
 Of course I'm more interested in getting the new openMPI1.5 working.
 When I run this, again I get an entry in the queue, and the job runs on a
 remote machine but  I get a job failed message
 
 $ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
 $ xgrid -job results -id 8702
 [brio.llnl.gov:38776] Error: unknown option "-mca"
 
 
 
 Note I have NOT installed openMPI on any of the other computers in the
 grid.  So perhaps that is the problem?  If I did install it on other
 computers how would I tell mpirun where to find the path to the install
 point?
 
 
 
 
 Finally in both cases, I don't see any way to pass xgrid specific argument
 in on the mpi command 

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread Barrett, Brian W
You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME and 
XGRID_CONTROLLER_PASSWORD) with the correct information in order for the XGrid 
starter to work.  Due to the way XGrid works, the nolocal option will not work 
properly when launching with XGrid.

Brian

On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:

Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI on mac 
osx, my 10.5 OSX has v1.2.3  when I try to run it, it works fine locally but it 
never finds the xgrid.

any mpi job I run, will run on the localhost not the xgrid agents.  If try to 
force the issue by specifying -nolocal then it just complains there are no 
nodes.

SO how do I use openMPI so that it uses the nodes of an xgrid cluster?

mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be because
no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_support_fns.c
 at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/round_robin/rmaps_rr.c at 
line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_map_job.c 
at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/urm/rmgr_urm.c at 
line 372









On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been broken since 
the 1.2 series, unfortunately, so it would help to ensure we don't have stale 
docs out there to the contrary.

As for the 1.2 results, you are aware (I imagine) that OSX ships with the last 
1.2 release already installed? You don't have to do anything to use it but run.

If you are getting peer timeouts, that is almost always a firewall issue. But I 
would try the factory-installed version first to be sure.

On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have seen
some discussions in the archives of this mail list saying some people have
v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
these is working for me with xgrid.   Both of these say they work with
xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on a OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.  I have tried
configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is running on a
remote machine.  However nothing ever happens and peeking in the xgrid
results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed: Operation timed out (60) - retrying
[brio.llnl.gov:38792] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed: Operation timed out (60) - retrying

Perhaps a firewall issue?

Of course I'm more interested in getting the new openMPI1.5 working.
When I run this, again I get an entry in the queue, and the job runs on a
remote machine but  I get a job failed message

$ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
$ xgrid -job results -id 8702
[brio.llnl.gov:38776] Error: unknown option "-mca"



Note I have NOT installed openMPI on any of the other computers in the
grid.  So perhaps that is the problem?  If I did install it on other
computers how would I tell mpirun where to find the path to the install
point?




Finally in both cases, I don't see any way to pass xgrid specific argument
in on the mpi command line.  An xgrid controller divides the agents into
sets of logical grids and you need to specify which logical grid to submit
the job to.In xgrid cli syntax one write "xgrid -gid 2"  for grid 2.
When I use openMPI all the jobs get sent to just the default grid which is
the grid that xgrid uses if no gid is specified.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Charlie Strauss
Bioscience Division
c...@lanl.gov
505 665 4838
Quidquid latine dictum sit, altum sonatur.

___

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread charlie strauss
Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI  
on mac osx, my 10.5 OSX has v1.2.3  when I try to run it, it works  
fine locally but it never finds the xgrid.


any mpi job I run, will run on the localhost not the xgrid agents.  If  
try to force the issue by specifying -nolocal then it just complains  
there are no nodes.


SO how do I use openMPI so that it uses the nodes of an xgrid cluster?

mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be  
because

no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
base/rmaps_base_support_fns.c at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
round_robin/rmaps_rr.c at line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
base/rmaps_base_map_job.c at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/ 
urm/rmgr_urm.c at line 372










On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been  
broken since the 1.2 series, unfortunately, so it would help to  
ensure we don't have stale docs out there to the contrary.


As for the 1.2 results, you are aware (I imagine) that OSX ships  
with the last 1.2 release already installed? You don't have to do  
anything to use it but run.


If you are getting peer timeouts, that is almost always a firewall  
issue. But I would try the factory-installed version first to be sure.


On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I  
have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have  
seen
some discussions in the archives of this mail list saying some  
people have

v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
these is working for me with xgrid.   Both of these say they work  
with

xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on a  
OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.  I  
have tried

configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is running  
on a
remote machine.  However nothing ever happens and peeking in the  
xgrid

results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: Operation timed out (60) - retrying
[brio.llnl.gov:38792] [0,0,2]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: Operation timed out (60) - retrying

Perhaps a firewall issue?

Of course I'm more interested in getting the new openMPI1.5 working.
When I run this, again I get an entry in the queue, and the job  
runs on a

remote machine but  I get a job failed message

$ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
$ xgrid -job results -id 8702
[brio.llnl.gov:38776] Error: unknown option "-mca"



Note I have NOT installed openMPI on any of the other computers in  
the

grid.  So perhaps that is the problem?  If I did install it on other
computers how would I tell mpirun where to find the path to the  
install

point?




Finally in both cases, I don't see any way to pass xgrid specific  
argument
in on the mpi command line.  An xgrid controller divides the agents  
into
sets of logical grids and you need to specify which logical grid to  
submit
the job to.In xgrid cli syntax one write "xgrid -gid 2"  for  
grid 2.
When I use openMPI all the jobs get sent to just the default grid  
which is

the grid that xgrid uses if no gid is specified.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Charlie Strauss
Bioscience Division
c...@lanl.gov
505 665 4838
Quidquid latine dictum sit, altum sonatur.



Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-16 Thread Ralph Castain
Where did you see that 1.5 works with xgrid? That support has been broken since 
the 1.2 series, unfortunately, so it would help to ensure we don't have stale 
docs out there to the contrary.

As for the 1.2 results, you are aware (I imagine) that OSX ships with the last 
1.2 release already installed? You don't have to do anything to use it but run.

If you are getting peer timeouts, that is almost always a firewall issue. But I 
would try the factory-installed version first to be sure.

On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

> I'm new to openMPI.  I'm trying to set it up for using xgrid.  I have read
> that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have seen
> some discussions in the archives of this mail list saying some people have
> v1.4 running on 10.6.
> 
> I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
> these is working for me with xgrid.   Both of these say they work with
> xgrid.
> 
> The failuremodes are different.
> 
> Anyone know how to get a working install?  I am building this on a OSX 10.5.8
> machine.  THe xgrid controller is on a OSX 10.6 server machine.  I have tried
> configuring with and without the --with-xgrid option.
> 
> Behaviour of openMPI1.2
> $ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname
> 
> THe job appears in the xgrid queue, and the logs show it is running on a
> remote machine.  However nothing ever happens and peeking in the xgrid
> results I see:
> 
> $ xgrid -job results -id 8703
> [brio.llnl.gov:38789] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed: Operation timed out (60) - retrying
> [brio.llnl.gov:38792] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> connection failed: Operation timed out (60) - retrying
> 
> Perhaps a firewall issue?
> 
> Of course I'm more interested in getting the new openMPI1.5 working.
> When I run this, again I get an entry in the queue, and the job runs on a
> remote machine but  I get a job failed message
> 
> $ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
> $ xgrid -job results -id 8702
> [brio.llnl.gov:38776] Error: unknown option "-mca"
> 
> 
> 
> Note I have NOT installed openMPI on any of the other computers in the
> grid.  So perhaps that is the problem?  If I did install it on other
> computers how would I tell mpirun where to find the path to the install
> point?
> 
> 
> 
> 
> Finally in both cases, I don't see any way to pass xgrid specific argument
> in on the mpi command line.  An xgrid controller divides the agents into
> sets of logical grids and you need to specify which logical grid to submit
> the job to.In xgrid cli syntax one write "xgrid -gid 2"  for grid 2. 
> When I use openMPI all the jobs get sent to just the default grid which is
> the grid that xgrid uses if no gid is specified.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users