Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I see. Thank you both for the prompt replies.

On Thu, Jul 26, 2012 at 8:21 PM, Ralph Castain  wrote:

> Application processes will *only* be placed on nodes included in the
> allocation. The -nolocal flag is intended to ensure that no application
> processes are started on the same node as mpirun in the case where that
> node is included in the allocation. This happens, for example, with Torque,
> where mpirun is executed on one of the allocated nodes.
>
> I believe SGE doesn't do that - and so the allocation won't include the
> submit host, in which case you don't need -nolocal.
>
>
> On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote:
>
> I was under the impression that the -nolocal option keeps processes off
> the submit
> host (since there may be hundreds or thousands of jobs submitted at any
> time,
> and we don't want this host to be overloaded).
>
> My understanding of what you said in you last email is that, by listing
> the hosts,  I
> automatically send all processes (parent and child, or master and slave if
> you
> prefer) to the specified list of hosts.
>
> Reading your email below, it looks like this was the correct understanding.
>
>
> On Thu, Jul 26, 2012 at 5:20 PM, Reuti  wrote:
>
>> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
>>
>> > Reuti,
>> >
>> > Thank you. Our queue is backed up, so it will take a little while
>> before I can try this.
>> >
>> > I assume that by specifying the nodes this way, I don't need (and it
>> would confuse
>> > the system) to add -nolocal. In other words, qsub will try to put the
>> parent node
>> > somewhere in this set.
>> >
>> > Is this the idea?
>>
>> Depends what you refer to by "parent node". I assume you mean the submit
>> host. This is never included in any created selection of SGE unless it's an
>> execution host too.
>>
>> The master host of the parallel job (i.e. the one where the jobscript
>> with the `mpiexec` is running) will be used as a normal machine from MPI's
>> point of view.
>>
>> -- Reuti
>>
>>
>> > Erik
>> >
>> >
>> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti 
>> wrote:
>> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
>> >
>> > > I have a purely parallel job that runs ~100 processes. Each process
>> has ~identical
>> > > overhead so the speed of the program is dominated by the slowest
>> processor.
>> > >
>> > > For this reason, I would like to restrict the job to a specific set
>> of identical (fast)
>> > > processors on our cluster.
>> > >
>> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me
>> what affect these
>> > > directives will have in a queuing environment.
>> > >
>> > > Currently, I submit the job using the "qsub" command in the "sge"
>> environment as :
>> > >
>> > > qsub -pe mpich 101 jobfile.job
>> > >
>> > > where jobfile contains the command
>> > >
>> > > mpirun -np 101 -nolocal ./executable
>> >
>> > I would leave -nolocal out here.
>> >
>> > $ qsub -l
>> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe
>> mpich 101 jobfile.job
>> >
>> > -- Reuti
>> >
>> >
>> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32
>> on our machine,
>> > > each containing 8 cpu's (slots). How do I go about this?
>> > >
>> > > Thanks, Erik
>> > >
>> > > --
>> > > Erik Nelson
>> > >
>> > > Howard Hughes Medical Institute
>> > > 6001 Forest Park Blvd., Room ND10.124
>> > > Dallas, Texas 75235-9050
>> > >
>> > > p : 214 645 5981
>> > > f : 214 645 5948
>> > > ___
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> >
>> > --
>> > Erik Nelson
>> >
>> > Howard Hughes Medical Institute
>> > 6001 Forest Park Blvd., Room ND10.124
>> > Dallas, Texas 75235-9050
>> >
>> > p : 214 645 5981
>> > f : 214 645 5948
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Erik Nelson
>
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
>
> p : 214 645 5981
> f : 214 645 5948
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 

Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Ralph Castain
Application processes will *only* be placed on nodes included in the 
allocation. The -nolocal flag is intended to ensure that no application 
processes are started on the same node as mpirun in the case where that node is 
included in the allocation. This happens, for example, with Torque, where 
mpirun is executed on one of the allocated nodes.

I believe SGE doesn't do that - and so the allocation won't include the submit 
host, in which case you don't need -nolocal.


On Jul 26, 2012, at 5:58 PM, Erik Nelson wrote:

> I was under the impression that the -nolocal option keeps processes off the 
> submit
> host (since there may be hundreds or thousands of jobs submitted at any time, 
> and we don't want this host to be overloaded).
> 
> My understanding of what you said in you last email is that, by listing the 
> hosts,  I
> automatically send all processes (parent and child, or master and slave if 
> you 
> prefer) to the specified list of hosts. 
> 
> Reading your email below, it looks like this was the correct understanding.
> 
> 
> On Thu, Jul 26, 2012 at 5:20 PM, Reuti  wrote:
> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
> 
> > Reuti,
> >
> > Thank you. Our queue is backed up, so it will take a little while before I 
> > can try this.
> >
> > I assume that by specifying the nodes this way, I don't need (and it would 
> > confuse
> > the system) to add -nolocal. In other words, qsub will try to put the 
> > parent node
> > somewhere in this set.
> >
> > Is this the idea?
> 
> Depends what you refer to by "parent node". I assume you mean the submit 
> host. This is never included in any created selection of SGE unless it's an 
> execution host too.
> 
> The master host of the parallel job (i.e. the one where the jobscript with 
> the `mpiexec` is running) will be used as a normal machine from MPI's point 
> of view.
> 
> -- Reuti
> 
> 
> > Erik
> >
> >
> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti  wrote:
> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> >
> > > I have a purely parallel job that runs ~100 processes. Each process has 
> > > ~identical
> > > overhead so the speed of the program is dominated by the slowest 
> > > processor.
> > >
> > > For this reason, I would like to restrict the job to a specific set of 
> > > identical (fast)
> > > processors on our cluster.
> > >
> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me 
> > > what affect these
> > > directives will have in a queuing environment.
> > >
> > > Currently, I submit the job using the "qsub" command in the "sge" 
> > > environment as :
> > >
> > > qsub -pe mpich 101 jobfile.job
> > >
> > > where jobfile contains the command
> > >
> > > mpirun -np 101 -nolocal ./executable
> >
> > I would leave -nolocal out here.
> >
> > $ qsub -l 
> > "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe 
> > mpich 101 jobfile.job
> >
> > -- Reuti
> >
> >
> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32 on 
> > > our machine,
> > > each containing 8 cpu's (slots). How do I go about this?
> > >
> > > Thanks, Erik
> > >
> > > --
> > > Erik Nelson
> > >
> > > Howard Hughes Medical Institute
> > > 6001 Forest Park Blvd., Room ND10.124
> > > Dallas, Texas 75235-9050
> > >
> > > p : 214 645 5981
> > > f : 214 645 5948
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Erik Nelson
> 
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> 
> p : 214 645 5981
> f : 214 645 5948
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I was under the impression that the -nolocal option keeps processes off the
submit
host (since there may be hundreds or thousands of jobs submitted at any
time,
and we don't want this host to be overloaded).

My understanding of what you said in you last email is that, by listing the
hosts,  I
automatically send all processes (parent and child, or master and slave if
you
prefer) to the specified list of hosts.

Reading your email below, it looks like this was the correct understanding.


On Thu, Jul 26, 2012 at 5:20 PM, Reuti  wrote:

> Am 26.07.2012 um 23:58 schrieb Erik Nelson:
>
> > Reuti,
> >
> > Thank you. Our queue is backed up, so it will take a little while before
> I can try this.
> >
> > I assume that by specifying the nodes this way, I don't need (and it
> would confuse
> > the system) to add -nolocal. In other words, qsub will try to put the
> parent node
> > somewhere in this set.
> >
> > Is this the idea?
>
> Depends what you refer to by "parent node". I assume you mean the submit
> host. This is never included in any created selection of SGE unless it's an
> execution host too.
>
> The master host of the parallel job (i.e. the one where the jobscript with
> the `mpiexec` is running) will be used as a normal machine from MPI's point
> of view.
>
> -- Reuti
>
>
> > Erik
> >
> >
> > On Thu, Jul 26, 2012 at 4:48 PM, Reuti 
> wrote:
> > Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> >
> > > I have a purely parallel job that runs ~100 processes. Each process
> has ~identical
> > > overhead so the speed of the program is dominated by the slowest
> processor.
> > >
> > > For this reason, I would like to restrict the job to a specific set of
> identical (fast)
> > > processors on our cluster.
> > >
> > > I read the FAQ on -hosts and -hostfile, but it is still unclear to me
> what affect these
> > > directives will have in a queuing environment.
> > >
> > > Currently, I submit the job using the "qsub" command in the "sge"
> environment as :
> > >
> > > qsub -pe mpich 101 jobfile.job
> > >
> > > where jobfile contains the command
> > >
> > > mpirun -np 101 -nolocal ./executable
> >
> > I would leave -nolocal out here.
> >
> > $ qsub -l
> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe
> mpich 101 jobfile.job
> >
> > -- Reuti
> >
> >
> > > I would like to restrict the job to nodes compute-5-1 to compute-5-32
> on our machine,
> > > each containing 8 cpu's (slots). How do I go about this?
> > >
> > > Thanks, Erik
> > >
> > > --
> > > Erik Nelson
> > >
> > > Howard Hughes Medical Institute
> > > 6001 Forest Park Blvd., Room ND10.124
> > > Dallas, Texas 75235-9050
> > >
> > > p : 214 645 5981
> > > f : 214 645 5948
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Ah - okay, my misunderstanding. Would you be willing to give the trunk a try? 
It might help to know if the problem is solely in 1.6, or continues.


On Jul 26, 2012, at 4:32 PM, Brock Palen wrote:

> I think so, sorry if I gave you the impression that Rmpi changed, 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 7:30 PM, Ralph Castain wrote:
> 
>> Guess I'm confused - your original note indicated that something had changed 
>> in Rmpi that broke things. Are you now saying it was something in OMPI?
>> 
>> On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:
>> 
>>> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
>>> 2010,  this this kinda stinks.
>>> 
>>> I will keep digging into it thanks for the help.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
>>> 
 Crud - afraid you'll have to ask them, then :-(
 
 
 On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
 
> Ralph,
> 
> Rmpi wraps everything up, so I tried setting them with
> 
> export OMPI_plm_base_verbose=5
> export OMPI_dpm_base_verbose=5
> 
> and I get no extra messages even on helloworld example simple MPI-1.0 
> code. 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
> 
>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
>> enough about Rmpi/snow to advise on what changed, but you could add some 
>> debug params to get an idea of where the problem is occurring:
>> 
>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>> 
>> should tell you from an OMPI perspective. I can try to help debug that 
>> end, at least.
>> 
>> 
>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>> 
>>> Weird - looks like it has done a comm_spawn and having trouble 
>>> connecting between the jobs. I can check the basic code and make sure 
>>> it is working - I seem to recall someone else recently talking about 
>>> Rmpi changes causing problems (different ones than this, IIRC), so you 
>>> might want to search our user archives for rmpi to see what they ran 
>>> into. Not sure what rmpi changed, or why.
>>> 
>>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>>> 
 I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
 running).
 
 I built OpenMPI following another post where I built static:
 
 ./configure --prefix=$INSTALL/gcc-4.4.6-static 
 --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
 --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
 F77=gfortran
 
 Rmpi/snow work fine when I run on a single node.  When I span more 
 than one node I get nasty errors (pasted below).
 
 I tested this mpi install with a simple hello world and that works.  
 Any thoughts what is different about Rmpi/snow that could cause this?
 
 [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
 routing message from [[48116,2],16] to [[48116,1],0]:16, can't find 
 route
 [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not 
 found in file routed_binomial.c at line 386
 [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
 routing message from [[48116,2],32] to [[48116,1],0]:16, can't find 
 route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2b7e9209e0df]
 [1] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
  [0x2b7e9206577a]
 [2] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
  [0x2b7e920404af]
 [3] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
  [0x2b7e92041ed2]
 [4] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
  [0x2b7e92087e38]
 [5] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
  [0x2b7e92016768]
 [6] func:orted(main+0x66) [0x400966]
 [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
 [8] func:orted() 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
I think so, sorry if I gave you the impression that Rmpi changed, 

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 7:30 PM, Ralph Castain wrote:

> Guess I'm confused - your original note indicated that something had changed 
> in Rmpi that broke things. Are you now saying it was something in OMPI?
> 
> On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:
> 
>> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
>> 2010,  this this kinda stinks.
>> 
>> I will keep digging into it thanks for the help.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
>> 
>>> Crud - afraid you'll have to ask them, then :-(
>>> 
>>> 
>>> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
>>> 
 Ralph,
 
 Rmpi wraps everything up, so I tried setting them with
 
 export OMPI_plm_base_verbose=5
 export OMPI_dpm_base_verbose=5
 
 and I get no extra messages even on helloworld example simple MPI-1.0 
 code. 
 
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
 
> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
> enough about Rmpi/snow to advise on what changed, but you could add some 
> debug params to get an idea of where the problem is occurring:
> 
> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
> 
> should tell you from an OMPI perspective. I can try to help debug that 
> end, at least.
> 
> 
> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
> 
>> Weird - looks like it has done a comm_spawn and having trouble 
>> connecting between the jobs. I can check the basic code and make sure it 
>> is working - I seem to recall someone else recently talking about Rmpi 
>> changes causing problems (different ones than this, IIRC), so you might 
>> want to search our user archives for rmpi to see what they ran into. Not 
>> sure what rmpi changed, or why.
>> 
>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>> 
>>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>>> running).
>>> 
>>> I built OpenMPI following another post where I built static:
>>> 
>>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>>> F77=gfortran
>>> 
>>> Rmpi/snow work fine when I run on a single node.  When I span more than 
>>> one node I get nasty errors (pasted below).
>>> 
>>> I tested this mpi install with a simple hello world and that works.  
>>> Any thoughts what is different about Rmpi/snow that could cause this?
>>> 
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
>>> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find 
>>> route
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
>>> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find 
>>> route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2b7e9209e0df]
>>> [1] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>>  [0x2b7e9206577a]
>>> [2] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>>  [0x2b7e920404af]
>>> [3] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>>  [0x2b7e92041ed2]
>>> [4] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>>  [0x2b7e92087e38]
>>> [5] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>>  [0x2b7e92016768]
>>> [6] func:orted(main+0x66) [0x400966]
>>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>>> [8] func:orted() [0x400839]
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
>>> in file routed_binomial.c at line 386
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
>>> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>>> [nyx0401.engin.umich.edu:07782] 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Guess I'm confused - your original note indicated that something had changed in 
Rmpi that broke things. Are you now saying it was something in OMPI?

On Jul 26, 2012, at 4:22 PM, Brock Palen wrote:

> Ok will see, Rmpi we had working with 1.4 and has not been updated after 
> 2010,  this this kinda stinks.
> 
> I will keep digging into it thanks for the help.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:
> 
>> Crud - afraid you'll have to ask them, then :-(
>> 
>> 
>> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
>> 
>>> Ralph,
>>> 
>>> Rmpi wraps everything up, so I tried setting them with
>>> 
>>> export OMPI_plm_base_verbose=5
>>> export OMPI_dpm_base_verbose=5
>>> 
>>> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
>>> 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
>>> 
 Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
 enough about Rmpi/snow to advise on what changed, but you could add some 
 debug params to get an idea of where the problem is occurring:
 
 -mca plm_base_verbose 5 -mca dpm_base_verbose 5
 
 should tell you from an OMPI perspective. I can try to help debug that 
 end, at least.
 
 
 On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
 
> Weird - looks like it has done a comm_spawn and having trouble connecting 
> between the jobs. I can check the basic code and make sure it is working 
> - I seem to recall someone else recently talking about Rmpi changes 
> causing problems (different ones than this, IIRC), so you might want to 
> search our user archives for rmpi to see what they ran into. Not sure 
> what rmpi changed, or why.
> 
> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
> 
>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>> running).
>> 
>> I built OpenMPI following another post where I built static:
>> 
>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>> F77=gfortran
>> 
>> Rmpi/snow work fine when I run on a single node.  When I span more than 
>> one node I get nasty errors (pasted below).
>> 
>> I tested this mpi install with a simple hello world and that works.  Any 
>> thoughts what is different about Rmpi/snow that could cause this?
>> 
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
>> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
>> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2b7e9209e0df]
>> [1] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>  [0x2b7e9206577a]
>> [2] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>  [0x2b7e920404af]
>> [3] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>  [0x2b7e92041ed2]
>> [4] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>  [0x2b7e92087e38]
>> [5] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>  [0x2b7e92016768]
>> [6] func:orted(main+0x66) [0x400966]
>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>> [8] func:orted() [0x400839]
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
>> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 386
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried 
>> routing message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found 
>> in file routed_binomial.c at line 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
Ok will see, Rmpi we had working with 1.4 and has not been updated after 2010,  
this this kinda stinks.

I will keep digging into it thanks for the help.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 7:16 PM, Ralph Castain wrote:

> Crud - afraid you'll have to ask them, then :-(
> 
> 
> On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:
> 
>> Ralph,
>> 
>> Rmpi wraps everything up, so I tried setting them with
>> 
>> export OMPI_plm_base_verbose=5
>> export OMPI_dpm_base_verbose=5
>> 
>> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
>> 
>>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know 
>>> enough about Rmpi/snow to advise on what changed, but you could add some 
>>> debug params to get an idea of where the problem is occurring:
>>> 
>>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>>> 
>>> should tell you from an OMPI perspective. I can try to help debug that end, 
>>> at least.
>>> 
>>> 
>>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>>> 
 Weird - looks like it has done a comm_spawn and having trouble connecting 
 between the jobs. I can check the basic code and make sure it is working - 
 I seem to recall someone else recently talking about Rmpi changes causing 
 problems (different ones than this, IIRC), so you might want to search our 
 user archives for rmpi to see what they ran into. Not sure what rmpi 
 changed, or why.
 
 On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
 
> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
> running).
> 
> I built OpenMPI following another post where I built static:
> 
> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
> F77=gfortran
> 
> Rmpi/snow work fine when I run on a single node.  When I span more than 
> one node I get nasty errors (pasted below).
> 
> I tested this mpi install with a simple hello world and that works.  Any 
> thoughts what is different about Rmpi/snow that could cause this?
> 
> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried 
> routing message from [[48116,2],16] to [[48116,1],0]:16, can't find route
> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried 
> routing message from [[48116,2],32] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2b7e9209e0df]
> [1] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>  [0x2b7e9206577a]
> [2] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>  [0x2b7e920404af]
> [3] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>  [0x2b7e92041ed2]
> [4] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>  [0x2b7e92087e38]
> [5] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>  [0x2b7e92016768]
> [6] func:orted(main+0x66) [0x400966]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
> [8] func:orted() [0x400839]
> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried 
> routing message from [[48116,2],7] to [[48116,1],0]:16, can't find route
> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried 
> routing message from [[48116,2],23] to [[48116,1],0]:16, can't find route
> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found 
> in file routed_binomial.c at line 386
> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried 
> routing message from [[48116,2],39] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2ae2ad17d0df]
> 
> 
> 
> 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Crud - afraid you'll have to ask them, then :-(


On Jul 26, 2012, at 3:50 PM, Brock Palen wrote:

> Ralph,
> 
> Rmpi wraps everything up, so I tried setting them with
> 
> export OMPI_plm_base_verbose=5
> export OMPI_dpm_base_verbose=5
> 
> and I get no extra messages even on helloworld example simple MPI-1.0 code. 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:
> 
>> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
>> about Rmpi/snow to advise on what changed, but you could add some debug 
>> params to get an idea of where the problem is occurring:
>> 
>> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
>> 
>> should tell you from an OMPI perspective. I can try to help debug that end, 
>> at least.
>> 
>> 
>> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
>> 
>>> Weird - looks like it has done a comm_spawn and having trouble connecting 
>>> between the jobs. I can check the basic code and make sure it is working - 
>>> I seem to recall someone else recently talking about Rmpi changes causing 
>>> problems (different ones than this, IIRC), so you might want to search our 
>>> user archives for rmpi to see what they ran into. Not sure what rmpi 
>>> changed, or why.
>>> 
>>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>>> 
 I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
 running).
 
 I built OpenMPI following another post where I built static:
 
 ./configure --prefix=$INSTALL/gcc-4.4.6-static 
 --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
 --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
 F77=gfortran
 
 Rmpi/snow work fine when I run on a single node.  When I span more than 
 one node I get nasty errors (pasted below).
 
 I tested this mpi install with a simple hello world and that works.  Any 
 thoughts what is different about Rmpi/snow that could cause this?
 
 [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
 message from [[48116,2],16] to [[48116,1],0]:16, can't find route
 [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
 message from [[48116,2],32] to [[48116,1],0]:16, can't find route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2b7e9209e0df]
 [1] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
  [0x2b7e9206577a]
 [2] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
  [0x2b7e920404af]
 [3] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
  [0x2b7e92041ed2]
 [4] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
  [0x2b7e92087e38]
 [5] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
  [0x2b7e92016768]
 [6] func:orted(main+0x66) [0x400966]
 [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
 [8] func:orted() [0x400839]
 [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
 message from [[48116,2],7] to [[48116,1],0]:16, can't find route
 [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
 message from [[48116,2],23] to [[48116,1],0]:16, can't find route
 [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
 file routed_binomial.c at line 386
 [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
 message from [[48116,2],39] to [[48116,1],0]:16, can't find route
 [0] 
 func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
  [0x2ae2ad17d0df]
 
 
 
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 bro...@umich.edu
 (734)936-1985
 
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> 

Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
Ralph,

Rmpi wraps everything up, so I tried setting them with

export OMPI_plm_base_verbose=5
export OMPI_dpm_base_verbose=5

and I get no extra messages even on helloworld example simple MPI-1.0 code. 


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985



On Jul 26, 2012, at 6:42 PM, Ralph Castain wrote:

> Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
> about Rmpi/snow to advise on what changed, but you could add some debug 
> params to get an idea of where the problem is occurring:
> 
> -mca plm_base_verbose 5 -mca dpm_base_verbose 5
> 
> should tell you from an OMPI perspective. I can try to help debug that end, 
> at least.
> 
> 
> On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:
> 
>> Weird - looks like it has done a comm_spawn and having trouble connecting 
>> between the jobs. I can check the basic code and make sure it is working - I 
>> seem to recall someone else recently talking about Rmpi changes causing 
>> problems (different ones than this, IIRC), so you might want to search our 
>> user archives for rmpi to see what they ran into. Not sure what rmpi 
>> changed, or why.
>> 
>> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
>> 
>>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>>> running).
>>> 
>>> I built OpenMPI following another post where I built static:
>>> 
>>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>>> F77=gfortran
>>> 
>>> Rmpi/snow work fine when I run on a single node.  When I span more than one 
>>> node I get nasty errors (pasted below).
>>> 
>>> I tested this mpi install with a simple hello world and that works.  Any 
>>> thoughts what is different about Rmpi/snow that could cause this?
>>> 
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
>>> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
>>> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2b7e9209e0df]
>>> [1] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>>  [0x2b7e9206577a]
>>> [2] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>>  [0x2b7e920404af]
>>> [3] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>>  [0x2b7e92041ed2]
>>> [4] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>>  [0x2b7e92087e38]
>>> [5] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>>  [0x2b7e92016768]
>>> [6] func:orted(main+0x66) [0x400966]
>>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>>> [8] func:orted() [0x400839]
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
>>> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
>>> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
>>> file routed_binomial.c at line 386
>>> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
>>> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
>>> [0] 
>>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>>  [0x2ae2ad17d0df]
>>> 
>>> 
>>> 
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Well, it looks like comm_spawn is working on 1.6. Afraid I don't know enough 
about Rmpi/snow to advise on what changed, but you could add some debug params 
to get an idea of where the problem is occurring:

-mca plm_base_verbose 5 -mca dpm_base_verbose 5

should tell you from an OMPI perspective. I can try to help debug that end, at 
least.


On Jul 26, 2012, at 3:02 PM, Ralph Castain wrote:

> Weird - looks like it has done a comm_spawn and having trouble connecting 
> between the jobs. I can check the basic code and make sure it is working - I 
> seem to recall someone else recently talking about Rmpi changes causing 
> problems (different ones than this, IIRC), so you might want to search our 
> user archives for rmpi to see what they ran into. Not sure what rmpi changed, 
> or why.
> 
> On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:
> 
>> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
>> running).
>> 
>> I built OpenMPI following another post where I built static:
>> 
>> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
>> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
>> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
>> F77=gfortran
>> 
>> Rmpi/snow work fine when I run on a single node.  When I span more than one 
>> node I get nasty errors (pasted below).
>> 
>> I tested this mpi install with a simple hello world and that works.  Any 
>> thoughts what is different about Rmpi/snow that could cause this?
>> 
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
>> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
>> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2b7e9209e0df]
>> [1] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>>  [0x2b7e9206577a]
>> [2] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>>  [0x2b7e920404af]
>> [3] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>>  [0x2b7e92041ed2]
>> [4] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>>  [0x2b7e92087e38]
>> [5] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>>  [0x2b7e92016768]
>> [6] func:orted(main+0x66) [0x400966]
>> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
>> [8] func:orted() [0x400839]
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
>> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
>> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
>> file routed_binomial.c at line 386
>> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
>> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
>> [0] 
>> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>>  [0x2ae2ad17d0df]
>> 
>> 
>> 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] compilation on windows 7 64-bit

2012-07-26 Thread Damien

Do you have

OMPI_IMPORTS, OPAL_IMPORTS and ORTE_IMPORTS

defined in your preprocessor flags?  You need those.

Damien


On 26/07/2012 3:56 PM, Sayre, Alan N wrote:


I'm trying to replace the usage of platform mpi with open mpi. I am 
trying to compile on Windows 7 64 bit using Visual Studio 2010. I have 
added the paths to the openmpi include and library directories and 
added the libmpid.lib and libmpi_cxxd.lib to the linker input. The 
application compiles (find the mpi headers). When it tries to link I 
get the following output:


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_remote_size referenced in function "struct MpComm_s * 
__cdecl MpCommSpawn(char const *,char const * * const,int,enum 
Bool_e)" (?MpCommSpawn@@YAPAUMpComm_s@@PBDQAPBDHW4Bool_e@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_spawn referenced in function "struct MpComm_s * __cdecl 
MpCommSpawn(char const *,char const * * const,int,enum Bool_e)" 
(?MpCommSpawn@@YAPAUMpComm_s@@PBDQAPBDHW4Bool_e@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_info_null


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_comm_self


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_comm_null


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_op_sum


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_op_min


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_op_max


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Intercomm_create referenced in function "int __cdecl 
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)" 
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_split referenced in function "int __cdecl 
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)" 
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_rank referenced in function "int __cdecl 
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)" 
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)


como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol 
_MPI_Comm_rank


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_size referenced in function "int __cdecl 
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)" 
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)


como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol 
_MPI_Comm_size


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_comm_world


como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_comm_world


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_get_parent referenced in function "struct MpComm_s * __cdecl 
MpCommNewChild(void)" (?MpCommNewChild@@YAPAUMpComm_s@@XZ)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Comm_free referenced in function "void __cdecl MpCommFree(struct 
MpComm_s *)" (?MpCommFree@@YAXPAUMpComm_s@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Send referenced in function "int __cdecl MpCommSend(struct 
MpComm_s *,void const *,int,enum Dtype_e,int,int)" 
(?MpCommSend@@YAHPAUMpComm_s@@PBXHW4Dtype_e@@HH@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Isend referenced in function "int __cdecl MpCommISend(struct 
MpComm_s *,void const *,int,enum Dtype_e,int,int,struct MpRequest_s 
*)" (?MpCommISend@@YAHPAUMpComm_s@@PBXHW4Dtype_e@@HHPAUMpRequest_s@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Get_count referenced in function "int __cdecl MpCommRecv(struct 
MpComm_s *,void *,int,enum Dtype_e,int,int,struct MpStatus_s *)" 
(?MpCommRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpStatus_s@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Recv referenced in function "int __cdecl MpCommRecv(struct 
MpComm_s *,void *,int,enum Dtype_e,int,int,struct MpStatus_s *)" 
(?MpCommRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpStatus_s@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Irecv referenced in function "int __cdecl MpCommIRecv(struct 
MpComm_s *,void *,int,enum Dtype_e,int,int,struct MpRequest_s *)" 
(?MpCommIRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpRequest_s@@@Z)


como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol 
_ompi_mpi_char


como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol 
_MPI_Probe referenced in function "int __cdecl MpCommProbe(struct 
MpComm_s *,int,int,struct MpStatus_s *)" 

Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Reuti
Am 26.07.2012 um 23:58 schrieb Erik Nelson:

> Reuti,
> 
> Thank you. Our queue is backed up, so it will take a little while before I 
> can try this. 
> 
> I assume that by specifying the nodes this way, I don't need (and it would 
> confuse 
> the system) to add -nolocal. In other words, qsub will try to put the parent 
> node 
> somewhere in this set. 
> 
> Is this the idea?

Depends what you refer to by "parent node". I assume you mean the submit host. 
This is never included in any created selection of SGE unless it's an execution 
host too.

The master host of the parallel job (i.e. the one where the jobscript with the 
`mpiexec` is running) will be used as a normal machine from MPI's point of view.

-- Reuti


> Erik
> 
> 
> On Thu, Jul 26, 2012 at 4:48 PM, Reuti  wrote:
> Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> 
> > I have a purely parallel job that runs ~100 processes. Each process has 
> > ~identical
> > overhead so the speed of the program is dominated by the slowest processor.
> >
> > For this reason, I would like to restrict the job to a specific set of 
> > identical (fast)
> > processors on our cluster.
> >
> > I read the FAQ on -hosts and -hostfile, but it is still unclear to me what 
> > affect these
> > directives will have in a queuing environment.
> >
> > Currently, I submit the job using the "qsub" command in the "sge" 
> > environment as :
> >
> > qsub -pe mpich 101 jobfile.job
> >
> > where jobfile contains the command
> >
> > mpirun -np 101 -nolocal ./executable
> 
> I would leave -nolocal out here.
> 
> $ qsub -l 
> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe 
> mpich 101 jobfile.job
> 
> -- Reuti
> 
> 
> > I would like to restrict the job to nodes compute-5-1 to compute-5-32 on 
> > our machine,
> > each containing 8 cpu's (slots). How do I go about this?
> >
> > Thanks, Erik
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> Erik Nelson
> 
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> 
> p : 214 645 5981
> f : 214 645 5948
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Ralph Castain
Weird - looks like it has done a comm_spawn and having trouble connecting 
between the jobs. I can check the basic code and make sure it is working - I 
seem to recall someone else recently talking about Rmpi changes causing 
problems (different ones than this, IIRC), so you might want to search our user 
archives for rmpi to see what they ran into. Not sure what rmpi changed, or why.

On Jul 26, 2012, at 2:41 PM, Brock Palen wrote:

> I have ran into a problem using Rmpi with OpenMPI (trying to get snow 
> running).
> 
> I built OpenMPI following another post where I built static:
> 
> ./configure --prefix=$INSTALL/gcc-4.4.6-static 
> --mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
> --with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran 
> F77=gfortran
> 
> Rmpi/snow work fine when I run on a single node.  When I span more than one 
> node I get nasty errors (pasted below).
> 
> I tested this mpi install with a simple hello world and that works.  Any 
> thoughts what is different about Rmpi/snow that could cause this?
> 
> [nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
> message from [[48116,2],16] to [[48116,1],0]:16, can't find route
> [nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
> message from [[48116,2],32] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2b7e9209e0df]
> [1] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
>  [0x2b7e9206577a]
> [2] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
>  [0x2b7e920404af]
> [3] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
>  [0x2b7e92041ed2]
> [4] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
>  [0x2b7e92087e38]
> [5] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
>  [0x2b7e92016768]
> [6] func:orted(main+0x66) [0x400966]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
> [8] func:orted() [0x400839]
> [nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
> message from [[48116,2],7] to [[48116,1],0]:16, can't find route
> [nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
> message from [[48116,2],23] to [[48116,1],0]:16, can't find route
> [nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in 
> file routed_binomial.c at line 386
> [nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
> message from [[48116,2],39] to [[48116,1],0]:16, can't find route
> [0] 
> func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
>  [0x2ae2ad17d0df]
> 
> 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Reuti
Am 26.07.2012 um 23:48 schrieb Reuti:

> Am 26.07.2012 um 23:33 schrieb Erik Nelson:
> 
>> I have a purely parallel job that runs ~100 processes. Each process has 
>> ~identical 
>> overhead so the speed of the program is dominated by the slowest processor.
>> 
>> For this reason, I would like to restrict the job to a specific set of 
>> identical (fast)
>> processors on our cluster.
>> 
>> I read the FAQ on -hosts and -hostfile, but it is still unclear to me what 
>> affect these 
>> directives will have in a queuing environment.
>> 
>> Currently, I submit the job using the "qsub" command in the "sge" 
>> environment as :
>> 
>>qsub -pe mpich 101 jobfile.job
>> 
>> where jobfile contains the command
>> 
>>mpirun -np 101 -nolocal ./executable
> 
> I would leave -nolocal out here.
> 
> $ qsub -l 
> "h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe 
> mpich 101 jobfile.job

Or shorter:

$ qsub -l "h=compute-5*&(*-[1-9]|*-[1-2][0-9]|*-3[0-2])" -pe mpich 101 
jobfile.job

-- Reuti


> -- Reuti
> 
> 
>> I would like to restrict the job to nodes compute-5-1 to compute-5-32 on our 
>> machine, 
>> each containing 8 cpu's (slots). How do I go about this?
>> 
>> Thanks, Erik
>> 
>> -- 
>> Erik Nelson
>> 
>> Howard Hughes Medical Institute
>> 6001 Forest Park Blvd., Room ND10.124
>> Dallas, Texas 75235-9050
>> 
>> p : 214 645 5981
>> f : 214 645 5948
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] compilation on windows 7 64-bit

2012-07-26 Thread Sayre, Alan N
I'm trying to replace the usage of platform mpi with open mpi. I am
trying to compile on Windows 7 64 bit using Visual Studio 2010. I have
added the paths to the openmpi include and library directories and added
the libmpid.lib and libmpi_cxxd.lib to the linker input. The application
compiles (find the mpi headers). When it tries to link I get the
following output:

 

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_remote_size referenced in function "struct MpComm_s * __cdecl
MpCommSpawn(char const *,char const * * const,int,enum Bool_e)"
(?MpCommSpawn@@YAPAUMpComm_s@@PBDQAPBDHW4Bool_e@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_spawn referenced in function "struct MpComm_s * __cdecl
MpCommSpawn(char const *,char const * * const,int,enum Bool_e)"
(?MpCommSpawn@@YAPAUMpComm_s@@PBDQAPBDHW4Bool_e@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_info_null

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_comm_self

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_comm_null

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_op_sum

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_op_min

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_op_max

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Intercomm_create referenced in function "int __cdecl
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)"
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_split referenced in function "int __cdecl
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)"
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_rank referenced in function "int __cdecl
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)"
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)

como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol
_MPI_Comm_rank

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_size referenced in function "int __cdecl
MpCommCreateCommunicators(struct MpComm_s * *,struct MpComm_s * *)"
(?MpCommCreateCommunicators@@YAHPAPAUMpComm_s@@0@Z)

como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol
_MPI_Comm_size

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_comm_world

como_mplib.lib(mpenv.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_comm_world

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_get_parent referenced in function "struct MpComm_s * __cdecl
MpCommNewChild(void)" (?MpCommNewChild@@YAPAUMpComm_s@@XZ)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Comm_free referenced in function "void __cdecl MpCommFree(struct
MpComm_s *)" (?MpCommFree@@YAXPAUMpComm_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Send referenced in function "int __cdecl MpCommSend(struct MpComm_s
*,void const *,int,enum Dtype_e,int,int)"
(?MpCommSend@@YAHPAUMpComm_s@@PBXHW4Dtype_e@@HH@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Isend referenced in function "int __cdecl MpCommISend(struct
MpComm_s *,void const *,int,enum Dtype_e,int,int,struct MpRequest_s *)"
(?MpCommISend@@YAHPAUMpComm_s@@PBXHW4Dtype_e@@HHPAUMpRequest_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Get_count referenced in function "int __cdecl MpCommRecv(struct
MpComm_s *,void *,int,enum Dtype_e,int,int,struct MpStatus_s *)"
(?MpCommRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpStatus_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Recv referenced in function "int __cdecl MpCommRecv(struct MpComm_s
*,void *,int,enum Dtype_e,int,int,struct MpStatus_s *)"
(?MpCommRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpStatus_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Irecv referenced in function "int __cdecl MpCommIRecv(struct
MpComm_s *,void *,int,enum Dtype_e,int,int,struct MpRequest_s *)"
(?MpCommIRecv@@YAHPAUMpComm_s@@PAXHW4Dtype_e@@HHPAUMpRequest_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2001: unresolved external symbol
_ompi_mpi_char

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Probe referenced in function "int __cdecl MpCommProbe(struct
MpComm_s *,int,int,struct MpStatus_s *)"
(?MpCommProbe@@YAHPAUMpComm_s@@HHPAUMpStatus_s@@@Z)

como_mplib.lib(mpcomm.obj) : error LNK2019: unresolved external symbol
_MPI_Barrier referenced in function "int __cdecl MpCommBarrier(struct
MpComm_s *)" (?MpCommBarrier@@YAHPAUMpComm_s@@@Z)


Re: [OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Reuti
Am 26.07.2012 um 23:33 schrieb Erik Nelson:

> I have a purely parallel job that runs ~100 processes. Each process has 
> ~identical 
> overhead so the speed of the program is dominated by the slowest processor.
>  
> For this reason, I would like to restrict the job to a specific set of 
> identical (fast)
> processors on our cluster.
> 
> I read the FAQ on -hosts and -hostfile, but it is still unclear to me what 
> affect these 
> directives will have in a queuing environment.
> 
> Currently, I submit the job using the "qsub" command in the "sge" environment 
> as :
> 
> qsub -pe mpich 101 jobfile.job
> 
> where jobfile contains the command
> 
> mpirun -np 101 -nolocal ./executable

I would leave -nolocal out here.

$ qsub -l 
"h=compute-5-[1-9]|compute-5-1[0-9]|compute-5-2[0-9]|compute-5-3[0-2]" -pe 
mpich 101 jobfile.job

-- Reuti


> I would like to restrict the job to nodes compute-5-1 to compute-5-32 on our 
> machine, 
> each containing 8 cpu's (slots). How do I go about this?
> 
> Thanks, Erik
> 
> -- 
> Erik Nelson
> 
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> 
> p : 214 645 5981
> f : 214 645 5948
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI and Rmpi/snow

2012-07-26 Thread Brock Palen
I have ran into a problem using Rmpi with OpenMPI (trying to get snow running).

I built OpenMPI following another post where I built static:

./configure --prefix=$INSTALL/gcc-4.4.6-static 
--mandir=$INSTALL/gcc-4.4.6-static/man --with-tm=/usr/local/torque/ 
--with-openib --with-psm --enable-static CC=gcc CXX=g++ FC=gfortran F77=gfortran

Rmpi/snow work fine when I run on a single node.  When I span more than one 
node I get nasty errors (pasted below).

I tested this mpi install with a simple hello world and that works.  Any 
thoughts what is different about Rmpi/snow that could cause this?

[nyx0400.engin.umich.edu:11927] [[48116,0],4] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0400.engin.umich.edu:11927] [[48116,0],4]:route_callback tried routing 
message from [[48116,2],16] to [[48116,1],0]:16, can't find route
[nyx0405.engin.umich.edu:07707] [[48116,0],8] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0405.engin.umich.edu:07707] [[48116,0],8]:route_callback tried routing 
message from [[48116,2],32] to [[48116,1],0]:16, can't find route
[0] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
 [0x2b7e9209e0df]
[1] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x9f77a)
 [0x2b7e9206577a]
[2] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(mca_oob_tcp_msg_recv_complete+0x27f)
 [0x2b7e920404af]
[3] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(+0x7bed2)
 [0x2b7e92041ed2]
[4] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_event_base_loop+0x238)
 [0x2b7e92087e38]
[5] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(orte_daemon+0x8d8)
 [0x2b7e92016768]
[6] func:orted(main+0x66) [0x400966]
[7] func:/lib64/libc.so.6(__libc_start_main+0xfd) [0x3d39c1ecdd]
[8] func:orted() [0x400839]
[nyx0397.engin.umich.edu:06959] [[48116,0],1] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0397.engin.umich.edu:06959] [[48116,0],1]:route_callback tried routing 
message from [[48116,2],7] to [[48116,1],0]:16, can't find route
[nyx0401.engin.umich.edu:07782] [[48116,0],5] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0401.engin.umich.edu:07782] [[48116,0],5]:route_callback tried routing 
message from [[48116,2],23] to [[48116,1],0]:16, can't find route
[nyx0406.engin.umich.edu:07743] [[48116,0],9] ORTE_ERROR_LOG: Not found in file 
routed_binomial.c at line 386
[nyx0406.engin.umich.edu:07743] [[48116,0],9]:route_callback tried routing 
message from [[48116,2],39] to [[48116,1],0]:16, can't find route
[0] 
func:/home/software/rhel6/openmpi-1.6.0/gcc-4.4.6-static/lib/libopen-rte.so.4(opal_backtrace_print+0x1f)
 [0x2ae2ad17d0df]




Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985






[OMPI users] restricting a job to a set of hosts

2012-07-26 Thread Erik Nelson
I have a purely parallel job that runs ~100 processes. Each process has
~identical
overhead so the speed of the program is dominated by the slowest processor.

For this reason, I would like to restrict the job to a specific set of
identical (fast)
processors on our cluster.

I read the FAQ on -hosts and -hostfile, but it is still unclear to me what
affect these
directives will have in a queuing environment.

Currently, I submit the job using the "qsub" command in the "sge"
environment as :

qsub -pe mpich 101 jobfile.job

where jobfile contains the command

mpirun -np 101 -nolocal ./executable

I would like to restrict the job to nodes compute-5-1 to compute-5-32 on
our machine,
each containing 8 cpu's (slots). How do I go about this?

Thanks, Erik

-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


Re: [OMPI users] issue with addresses

2012-07-26 Thread Priyesh Srivastava
hello  Hristo

Thank you for taking a look at the program and the output.
The detailed explanation was very helpful. I also found out that the
signature of a derived datatype is the specific sequence of the primitive
datatypes and is independent of the displacements. So the differences in
the relative addresses will not cause a problem.

thanks again  :)
priyesh

On Wed, Jul 25, 2012 at 12:00 PM,  wrote:

> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: issue with addresses (Iliev, Hristo)
>2. Re: Extent of Distributed Array Type? (George Bosilca)
>3. Re: Extent of Distributed Array Type? (Jeff Squyres)
>4. Re: Extent of Distributed Array Type? (Richard Shaw)
>5. Mpi_leave_pinned=1 is thread safe? (tmish...@jcity.maeda.co.jp)
>6. Re: Fortran90 Bindings (Kumar, Sudhir)
>7. Re: Fortran90 Bindings (Damien)
>
>
> --
>
> Message: 1
> Date: Tue, 24 Jul 2012 17:10:33 +
> From: "Iliev, Hristo" 
> Subject: Re: [OMPI users] issue with addresses
> To: Open MPI Users 
> Message-ID: <18d6fe2f-7a68-4d1a-94fe-c14058ba4...@rz.rwth-aachen.de>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi, Priyesh,
>
> The output of your program is pretty much what one would expect.
> 140736841025492 is 0x7FFFD96A87D4 which pretty much corresponds to a
> location in the stack, which is to be expected as a and b are scalar
> variables and most likely end up on the stack. As c is array its location
> is compiler-dependent. Some compilers put small arrays on the stack while
> others make them global or allocate them on the heap. In your case 0x6ABAD0
> could either be somewhere in the BSS (where uninitialised global variables
> reside) or in the heap, which starts right after BSS (I would say it is the
> BSS). If the array is placed in BSS its location is fixed with respect to
> the image base.
>
> Linux by default implements partial Address Space Layout Randomisation
> (ASLR) by placing the program stack at slightly different location with
> each run (this is to make remote stack based exploits harder). That's why
> you see different addresses for variables on the stack. But things in BSS
> would pretty much have the same addresses when the code is executed
> multiple times or on different machines having the same architecture and
> similar OS with similar settings since executable images are still loaded
> at the same base virtual address.
>
> Having different addresses is not an issue for MPI as it only operates
> with pointers which are local to the process as well as with relative
> offsets. You pass the MPI_Send or MPI_Recv function the address of the data
> buffer in the current process and it has nothing to do with where those
> buffers are located in the other processes. Note also that MPI supports
> heterogeneous computing, e.g. the sending process might be 32-bit and the
> receiving one 64-bit. In this scenario it is quite probable that the
> addresses will differ by very large margin (e.g. the stack address of your
> 64-bit output is not even valid on 32-bit system).
>
> Hope that helps more :)
>
> Kind regards,
> Hristo
>
> On 24.07.2012, at 02:02, Priyesh Srivastava wrote:
>
> > hello  Hristo
> >
> > Thank you for your reply. I was able to understand some parts of your
> response, but still had some doubts due to my lack of knowledge about the
> way memory is allocated.
> >
> > I have created a small sample program and the resulting output which
> will help me  pin point my question.
> > The program is :
> >
> >
> > program test
> >   include'mpif.h'
> >
> >   integer a,b,c(10),ierr,id,datatype,size(3),type(3),i,status
> >
> >   integer(kind=MPI_ADDRESS_KIND) add(3)
> >
> >
> >   call MPI_INIT(ierr)
> >   call MPI_COMM_RANK(MPI_COMM_WORLD,id,ierr)
> >   call MPI_GET_ADDRESS(a,add(1),ierr)
> >   write(*,*) 'address of a ,id ', add(1), id
> >   call MPI_GET_ADDRESS(b,add(2),ierr)
> >   write(*,*) 'address of b,id ', add(2), id
> >   call MPI_GET_ADDRESS(c,add(3),ierr)
> >   write(*,*) 'address of c,id ', add(3), id
> >
> >   add(3)=add(3)-add(1)
> >   add(2)=add(2)-add(1)
> >   add(1)=add(1)-add(1)
> >
> >   size(1)=1
> >   size(2)=1
> >   size(3)=10
> >   type(1)=MPI_INTEGER
> >   type(2)=MPI_INTEGER
> >   type(3)=MPI_INTEGER
> >   call MPI_TYPE_CREATE_STRUCT(3,size,add,type,datatype,ierr)
> >   

Re: [OMPI users] Fortran90 Bindings

2012-07-26 Thread Shiqing Fan

No it's not related to the version of Visual Studio.

On 2012-07-26 4:08 AM, Kumar, Sudhir wrote:


I am wondering if it is related to the version of Visual Studio, I am 
using Visual Studio 2005.


*Sudhir Kumar*

*From:*users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
*On Behalf Of *Damien

*Sent:* Wednesday, July 25, 2012 3:35 PM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Fortran90 Bindings

Hmmm.  My 64-bit builds create mpif77.exe, libmpi_f77.lib and 
libmpi_f77.dll, and they work.


Damien

On 25/07/2012 10:11 AM, Kumar, Sudhir wrote:

Hi

I am new to Open MPI so please pardon my ignorance, I just came
across an article which refers to F77 bindings available for 32
bit windows only, It was as of June. Has something changed since then,

http://www.open-mpi.org/community/lists/users/2012/06/19525.php

Thanks so much.

*From:*users-boun...@open-mpi.org

[mailto:users-boun...@open-mpi.org] *On Behalf Of *Damien
*Sent:* Wednesday, July 25, 2012 10:52 AM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Fortran90 Bindings

Sudhir,

F77 works on both.

Damien


On 25/07/2012 8:55 AM, Kumar, Sudhir wrote:

Hi

I have one more related question. Is the F77 bindings
available for both 64bit and 32 bit windows environments or
just for the 32 bit environment.

Thanks

*From:*users-boun...@open-mpi.org

[mailto:users-boun...@open-mpi.org] *On Behalf Of *Damien
*Sent:* Wednesday, July 18, 2012 10:11 AM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Fortran90 Bindings

Hmmm.  6 months ago there weren't F90 bindings in the Windows
version (the F90 bindings are large and tricky).  It's an
option you can select when you compile it yourself, but
looking at the one I just did a month ago, there's still no
mpif90.exe built, so I'd say that's still not supported on
Windows.  :-(

Damien

On 18/07/2012 9:00 AM, Kumar, Sudhir wrote:

Hi had meant to say if Fortran90 bindings for Windows

*Sudhir Kumar*

*From:*users-boun...@open-mpi.org

[mailto:users-boun...@open-mpi.org] *On Behalf Of *Damien
*Sent:* Wednesday, July 18, 2012 9:56 AM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] Fortran90 Bindings

Yep.

On 18/07/2012 8:53 AM, Kumar, Sudhir wrote:

Hi

Just wondering if Fortran90 bindings are available for
OpemMPI 1.6

Thanks

*Sudhir Kumar*







___

users mailing list

us...@open-mpi.org  

http://www.open-mpi.org/mailman/listinfo.cgi/users






___

users mailing list

us...@open-mpi.org  

http://www.open-mpi.org/mailman/listinfo.cgi/users





___

users mailing list

us...@open-mpi.org  

http://www.open-mpi.org/mailman/listinfo.cgi/users




___

users mailing list

us...@open-mpi.org  

http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de