Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-27 Thread Ralph Castain
Hmmm...well, it looks from your original error message that Rmpi/snow is using 
a single "master" process and then comm_spawn'ing a whole bunch of "workers". I 
tried replicating that on a slurm machine by having a single master comm_spawn 
a whole bunch of processes, and that worked fine. Of course, this was with the 
current 1.6 branch, which may have something patched in it as we are getting 
ready for 1.6.1.

What the error is saying is that a race condition is causing procs to try and 
communicate before the daemon knows how to route their messages. That shouldn't 
be possible - the daemon should unpack the routing info prior to starting any 
local procs. I'll review the code to see if I can spot something.

Meantime, could you try the current 1.6 branch tarball and/or the trunk, per my 
earlier note? It would help to know if I'm looking for a bug in 1.6.0 or 
something more systemic.

Thanks!
Ralph


On Jul 26, 2012, at 4:32 PM, Brock Palen wrote:

> I think so, sorry if I gave you the impression that Rmpi changed, 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 7:30 PM, Ralph Castain wrote:
> 
>> Guess I'm confused - your original note indicated that something had changed 
>> in Rmpi that broke things. Are you now saying it was something in OMPI?
>> 
>> On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:
>> 
>>> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
>>> 2010,  this this kinda stinks.
>>> 
>>> I will keep digging into it thanks for the help.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
>>> 
 Crud - afraid you'll have to ask them, then :-(
 
 
 On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
 
> Ralph,
> 
> Rmpi wraps everything up, so I tried setting them with
> 
> export OMPI_plm_base_verbose=5
> export OMPI_dpm_base_verbose=5
> 
> and I get no extra messages even on helloworld example simple MPI-1.0 
> code. 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
> 
>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
>> enough about Rmpi/snow to advise on what changed, but you could add some 
>> debug params to get an idea of where the problem is occurring:
>> 
>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>> 
>> should tell you from an OMPI perspective. I can try to help debug that 
>> end, at least.
>> 
>> 
>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>> 
>>> Weird - looks like it has done a comm_spawn and having trouble 
>>> connecting between the jobs. I can check the basic code and make sure 
>>> it is working - I seem to recall someone else recently talking about 
>>> Rmpi changes causing problems (different ones than this, IIRC), so you 
>>> might want to search our user archives for rmpi to see what they ran 
>>> into. Not sure what rmpi changed, or why.
>>> 
>>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>>> 
 I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
 running).
 
 I built OpenMPI following another post where I built static:
 
 ./configure --prefix=$INSTALL/gcc-4.4.6-static 
 --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
 --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
 F77=gfortran
 
 Rmpi/snow work fine when I run on a single node.  When I span more 
 than one node I get nasty errors (pasted below).
 
 I tested this mpi install with a simple hello world and that works.  
 Any thoughts what is different about Rmpi/snow that could cause this?
 
 [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
 routing message from [[48116,2],16] to [[48116,1],0]:16, can't find 
 route
 [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
 routing message from [[48116,2],32] to [[48116,1],0]:16, can't find 
 route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2b7e9209e0df]
 [1] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
  

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Ah - okay, my misunderstanding. Would you be willing to give the trunk a try? 
It might help to know if the problem is solely in 1.6, or continues.


On Jul 26, 2012, at 4:32 PM, Brock Palen wrote:

> I think so, sorry if I gave you the impression that Rmpi changed, 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 7:30 PM, Ralph Castain wrote:
> 
>> Guess I'm confused - your original note indicated that something had changed 
>> in Rmpi that broke things. Are you now saying it was something in OMPI?
>> 
>> On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:
>> 
>>> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
>>> 2010,  this this kinda stinks.
>>> 
>>> I will keep digging into it thanks for the help.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
>>> 
 Crud - afraid you'll have to ask them, then :-(
 
 
 On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
 
> Ralph,
> 
> Rmpi wraps everything up, so I tried setting them with
> 
> export OMPI_plm_base_verbose=5
> export OMPI_dpm_base_verbose=5
> 
> and I get no extra messages even on helloworld example simple MPI-1.0 
> code. 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
> 
>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
>> enough about Rmpi/snow to advise on what changed, but you could add some 
>> debug params to get an idea of where the problem is occurring:
>> 
>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>> 
>> should tell you from an OMPI perspective. I can try to help debug that 
>> end, at least.
>> 
>> 
>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>> 
>>> Weird - looks like it has done a comm_spawn and having trouble 
>>> connecting between the jobs. I can check the basic code and make sure 
>>> it is working - I seem to recall someone else recently talking about 
>>> Rmpi changes causing problems (different ones than this, IIRC), so you 
>>> might want to search our user archives for rmpi to see what they ran 
>>> into. Not sure what rmpi changed, or why.
>>> 
>>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>>> 
 I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
 running).
 
 I built OpenMPI following another post where I built static:
 
 ./configure --prefix=$INSTALL/gcc-4.4.6-static 
 --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
 --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
 F77=gfortran
 
 Rmpi/snow work fine when I run on a single node.  When I span more 
 than one node I get nasty errors (pasted below).
 
 I tested this mpi install with a simple hello world and that works.  
 Any thoughts what is different about Rmpi/snow that could cause this?
 
 [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
 routing message from [[48116,2],16] to [[48116,1],0]:16, can't find 
 route
 [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
 routing message from [[48116,2],32] to [[48116,1],0]:16, can't find 
 route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2b7e9209e0df]
 [1] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
  [0x2b7e9206577a]
 [2] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
  [0x2b7e920404af]
 [3] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
  [0x2b7e92041ed2]
 [4] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
  [0x2b7e92087e38]
 [5] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
  [0x2b7e92016768]
 [6] func:orted(main+0x66) [0x400966]
 [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
 [8] func:orted() 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
I think so, sorry if I gave you the impression that Rmpi changed, 

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 7:30 PM, Ralph Castain wrote:

> Guess I'm confused - your original note indicated that something had changed 
> in Rmpi that broke things. Are you now saying it was something in OMPI?
> 
> On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:
> 
>> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
>> 2010,  this this kinda stinks.
>> 
>> I will keep digging into it thanks for the help.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
>> 
>>> Crud - afraid you'll have to ask them, then :-(
>>> 
>>> 
>>> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
>>> 
 Ralph,
 
 Rmpi wraps everything up, so I tried setting them with
 
 export OMPI_plm_base_verbose=5
 export OMPI_dpm_base_verbose=5
 
 and I get no extra messages even on helloworld example simple MPI-1.0 
 code. 
 
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
 
> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
> enough about Rmpi/snow to advise on what changed, but you could add some 
> debug params to get an idea of where the problem is occurring:
> 
> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
> 
> should tell you from an OMPI perspective. I can try to help debug that 
> end, at least.
> 
> 
> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
> 
>> Weird - looks like it has done a comm_spawn and having trouble 
>> connecting between the jobs. I can check the basic code and make sure it 
>> is working - I seem to recall someone else recently talking about Rmpi 
>> changes causing problems (different ones than this, IIRC), so you might 
>> want to search our user archives for rmpi to see what they ran into. Not 
>> sure what rmpi changed, or why.
>> 
>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>> 
>>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>>> running).
>>> 
>>> I built OpenMPI following another post where I built static:
>>> 
>>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>>> F77=gfortran
>>> 
>>> Rmpi/snow work fine when I run on a single node.  When I span more than 
>>> one node I get nasty errors (pasted below).
>>> 
>>> I tested this mpi install with a simple hello world and that works.  
>>> Any thoughts what is different about Rmpi/snow that could cause this?
>>> 
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
>>> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find 
>>> route
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
>>> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find 
>>> route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2b7e9209e0df]
>>> [1] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>>  [0x2b7e9206577a]
>>> [2] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>>  [0x2b7e920404af]
>>> [3] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>>  [0x2b7e92041ed2]
>>> [4] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>>  [0x2b7e92087e38]
>>> [5] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>>  [0x2b7e92016768]
>>> [6] func:orted(main+0x66) [0x400966]
>>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>>> [8] func:orted() [0x400839]
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
>>> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>>> [nyx0401.engin.umich.edu:07782] 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Guess I'm confused - your original note indicated that something had changed in 
Rmpi that broke things. Are you now saying it was something in OMPI?

On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:

> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
> 2010,  this this kinda stinks.
> 
> I will keep digging into it thanks for the help.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
> 
>> Crud - afraid you'll have to ask them, then :-(
>> 
>> 
>> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
>> 
>>> Ralph,
>>> 
>>> Rmpi wraps everything up, so I tried setting them with
>>> 
>>> export OMPI_plm_base_verbose=5
>>> export OMPI_dpm_base_verbose=5
>>> 
>>> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
>>> 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
>>> 
 Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
 enough about Rmpi/snow to advise on what changed, but you could add some 
 debug params to get an idea of where the problem is occurring:
 
 -mca plm_base_verbose 5 -mca dpm_base_verbose 5
 
 should tell you from an OMPI perspective. I can try to help debug that 
 end, at least.
 
 
 On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
 
> Weird - looks like it has done a comm_spawn and having trouble connecting 
> between the jobs. I can check the basic code and make sure it is working 
> - I seem to recall someone else recently talking about Rmpi changes 
> causing problems (different ones than this, IIRC), so you might want to 
> search our user archives for rmpi to see what they ran into. Not sure 
> what rmpi changed, or why.
> 
> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
> 
>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>> running).
>> 
>> I built OpenMPI following another post where I built static:
>> 
>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>> F77=gfortran
>> 
>> Rmpi/snow work fine when I run on a single node.  When I span more than 
>> one node I get nasty errors (pasted below).
>> 
>> I tested this mpi install with a simple hello world and that works.  Any 
>> thoughts what is different about Rmpi/snow that could cause this?
>> 
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
>> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
>> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2b7e9209e0df]
>> [1] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>  [0x2b7e9206577a]
>> [2] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>  [0x2b7e920404af]
>> [3] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>  [0x2b7e92041ed2]
>> [4] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>  [0x2b7e92087e38]
>> [5] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>  [0x2b7e92016768]
>> [6] func:orted(main+0x66) [0x400966]
>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>> [8] func:orted() [0x400839]
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
>> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried 
>> routing message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
Ok will see, Rmpi we had working with 1.4 and has not been updated after 2010,  
this this kinda stinks.

I will keep digging into it thanks for the help.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:

> Crud - afraid you'll have to ask them, then :-(
> 
> 
> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
> 
>> Ralph,
>> 
>> Rmpi wraps everything up, so I tried setting them with
>> 
>> export OMPI_plm_base_verbose=5
>> export OMPI_dpm_base_verbose=5
>> 
>> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
>> 
>>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
>>> enough about Rmpi/snow to advise on what changed, but you could add some 
>>> debug params to get an idea of where the problem is occurring:
>>> 
>>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>>> 
>>> should tell you from an OMPI perspective. I can try to help debug that end, 
>>> at least.
>>> 
>>> 
>>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>>> 
 Weird - looks like it has done a comm_spawn and having trouble connecting 
 between the jobs. I can check the basic code and make sure it is working - 
 I seem to recall someone else recently talking about Rmpi changes causing 
 problems (different ones than this, IIRC), so you might want to search our 
 user archives for rmpi to see what they ran into. Not sure what rmpi 
 changed, or why.
 
 On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
 
> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
> running).
> 
> I built OpenMPI following another post where I built static:
> 
> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
> F77=gfortran
> 
> Rmpi/snow work fine when I run on a single node.  When I span more than 
> one node I get nasty errors (pasted below).
> 
> I tested this mpi install with a simple hello world and that works.  Any 
> thoughts what is different about Rmpi/snow that could cause this?
> 
> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find route
> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2b7e9209e0df]
> [1] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>  [0x2b7e9206577a]
> [2] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>  [0x2b7e920404af]
> [3] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>  [0x2b7e92041ed2]
> [4] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>  [0x2b7e92087e38]
> [5] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>  [0x2b7e92016768]
> [6] func:orted(main+0x66) [0x400966]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
> [8] func:orted() [0x400839]
> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried 
> routing message from [[48116,2],23] to [[48116,1],0]:16, can't find route
> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried 
> routing message from [[48116,2],39] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2ae2ad17d0df]
> 
> 
> 
> 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Crud - afraid you'll have to ask them, then :-(


On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:

> Ralph,
> 
> Rmpi wraps everything up, so I tried setting them with
> 
> export OMPI_plm_base_verbose=5
> export OMPI_dpm_base_verbose=5
> 
> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
> 
>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
>> about Rmpi/snow to advise on what changed, but you could add some debug 
>> params to get an idea of where the problem is occurring:
>> 
>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>> 
>> should tell you from an OMPI perspective. I can try to help debug that end, 
>> at least.
>> 
>> 
>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>> 
>>> Weird - looks like it has done a comm_spawn and having trouble connecting 
>>> between the jobs. I can check the basic code and make sure it is working - 
>>> I seem to recall someone else recently talking about Rmpi changes causing 
>>> problems (different ones than this, IIRC), so you might want to search our 
>>> user archives for rmpi to see what they ran into. Not sure what rmpi 
>>> changed, or why.
>>> 
>>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>>> 
 I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
 running).
 
 I built OpenMPI following another post where I built static:
 
 ./configure --prefix=$INSTALL/gcc-4.4.6-static 
 --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
 --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
 F77=gfortran
 
 Rmpi/snow work fine when I run on a single node.  When I span more than 
 one node I get nasty errors (pasted below).
 
 I tested this mpi install with a simple hello world and that works.  Any 
 thoughts what is different about Rmpi/snow that could cause this?
 
 [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
 message from [[48116,2],16] to [[48116,1],0]:16, can't find route
 [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
 message from [[48116,2],32] to [[48116,1],0]:16, can't find route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2b7e9209e0df]
 [1] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
  [0x2b7e9206577a]
 [2] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
  [0x2b7e920404af]
 [3] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
  [0x2b7e92041ed2]
 [4] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
  [0x2b7e92087e38]
 [5] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
  [0x2b7e92016768]
 [6] func:orted(main+0x66) [0x400966]
 [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
 [8] func:orted() [0x400839]
 [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
 message from [[48116,2],7] to [[48116,1],0]:16, can't find route
 [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
 message from [[48116,2],23] to [[48116,1],0]:16, can't find route
 [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
 message from [[48116,2],39] to [[48116,1],0]:16, can't find route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2ae2ad17d0df]
 
 
 
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
Ralph,

Rmpi wraps everything up, so I tried setting them with

export OMPI_plm_base_verbose=5
export OMPI_dpm_base_verbose=5

and I get no extra messages even on helloworld example simple MPI-1.0 code. 


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:

> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
> about Rmpi/snow to advise on what changed, but you could add some debug 
> params to get an idea of where the problem is occurring:
> 
> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
> 
> should tell you from an OMPI perspective. I can try to help debug that end, 
> at least.
> 
> 
> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
> 
>> Weird - looks like it has done a comm_spawn and having trouble connecting 
>> between the jobs. I can check the basic code and make sure it is working - I 
>> seem to recall someone else recently talking about Rmpi changes causing 
>> problems (different ones than this, IIRC), so you might want to search our 
>> user archives for rmpi to see what they ran into. Not sure what rmpi 
>> changed, or why.
>> 
>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>> 
>>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>>> running).
>>> 
>>> I built OpenMPI following another post where I built static:
>>> 
>>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>>> F77=gfortran
>>> 
>>> Rmpi/snow work fine when I run on a single node.  When I span more than one 
>>> node I get nasty errors (pasted below).
>>> 
>>> I tested this mpi install with a simple hello world and that works.  Any 
>>> thoughts what is different about Rmpi/snow that could cause this?
>>> 
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
>>> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
>>> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2b7e9209e0df]
>>> [1] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>>  [0x2b7e9206577a]
>>> [2] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>>  [0x2b7e920404af]
>>> [3] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>>  [0x2b7e92041ed2]
>>> [4] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>>  [0x2b7e92087e38]
>>> [5] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>>  [0x2b7e92016768]
>>> [6] func:orted(main+0x66) [0x400966]
>>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>>> [8] func:orted() [0x400839]
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
>>> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
>>> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
>>> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2ae2ad17d0df]
>>> 
>>> 
>>> 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
about Rmpi/snow to advise on what changed, but you could add some debug params 
to get an idea of where the problem is occurring:

-mca plm_base_verbose 5 -mca dpm_base_verbose 5

should tell you from an OMPI perspective. I can try to help debug that end, at 
least.


On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:

> Weird - looks like it has done a comm_spawn and having trouble connecting 
> between the jobs. I can check the basic code and make sure it is working - I 
> seem to recall someone else recently talking about Rmpi changes causing 
> problems (different ones than this, IIRC), so you might want to search our 
> user archives for rmpi to see what they ran into. Not sure what rmpi changed, 
> or why.
> 
> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
> 
>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>> running).
>> 
>> I built OpenMPI following another post where I built static:
>> 
>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>> F77=gfortran
>> 
>> Rmpi/snow work fine when I run on a single node.  When I span more than one 
>> node I get nasty errors (pasted below).
>> 
>> I tested this mpi install with a simple hello world and that works.  Any 
>> thoughts what is different about Rmpi/snow that could cause this?
>> 
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
>> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
>> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2b7e9209e0df]
>> [1] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>  [0x2b7e9206577a]
>> [2] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>  [0x2b7e920404af]
>> [3] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>  [0x2b7e92041ed2]
>> [4] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>  [0x2b7e92087e38]
>> [5] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>  [0x2b7e92016768]
>> [6] func:orted(main+0x66) [0x400966]
>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>> [8] func:orted() [0x400839]
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
>> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
>> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
>> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2ae2ad17d0df]
>> 
>> 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Weird - looks like it has done a comm_spawn and having trouble connecting 
between the jobs. I can check the basic code and make sure it is working - I 
seem to recall someone else recently talking about Rmpi changes causing 
problems (different ones than this, IIRC), so you might want to search our user 
archives for rmpi to see what they ran into. Not sure what rmpi changed, or why.

On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:

> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
> running).
> 
> I built OpenMPI following another post where I built static:
> 
> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
> F77=gfortran
> 
> Rmpi/snow work fine when I run on a single node.  When I span more than one 
> node I get nasty errors (pasted below).
> 
> I tested this mpi install with a simple hello world and that works.  Any 
> thoughts what is different about Rmpi/snow that could cause this?
> 
> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2b7e9209e0df]
> [1] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>  [0x2b7e9206577a]
> [2] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>  [0x2b7e920404af]
> [3] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>  [0x2b7e92041ed2]
> [4] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>  [0x2b7e92087e38]
> [5] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>  [0x2b7e92016768]
> [6] func:orted(main+0x66) [0x400966]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
> [8] func:orted() [0x400839]
> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2ae2ad17d0df]
> 
> 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
I have ran into a problem using Rmpi with OpenMPI (trying to get snow running).

I built OpenMPI following another post where I built static:

./configure --prefix=$INSTALL/gcc-4.4.6-static 
--mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
--with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran F77=gfortran

Rmpi/snow work fine when I run on a single node.  When I span more than one 
node I get nasty errors (pasted below).

I tested this mpi install with a simple hello world and that works.  Any 
thoughts what is different about Rmpi/snow that could cause this?

[nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
message from [[48116,2],16] to [[48116,1],0]:16, can't find route
[nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
message from [[48116,2],32] to [[48116,1],0]:16, can't find route
[0] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
 [0x2b7e9209e0df]
[1] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
 [0x2b7e9206577a]
[2] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
 [0x2b7e920404af]
[3] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
 [0x2b7e92041ed2]
[4] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
 [0x2b7e92087e38]
[5] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
 [0x2b7e92016768]
[6] func:orted(main+0x66) [0x400966]
[7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
[8] func:orted() [0x400839]
[nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
message from [[48116,2],7] to [[48116,1],0]:16, can't find route
[nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
message from [[48116,2],23] to [[48116,1],0]:16, can't find route
[nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
message from [[48116,2],39] to [[48116,1],0]:16, can't find route
[0] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
 [0x2ae2ad17d0df]




Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985