Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)

2007-07-11 Thread Jeff Squyres

On Jul 11, 2007, at 10:52 AM, Ricardo Reis wrote:


Whoa -- if you are failing here, something is definitely wrong: this
is failing when accessing stack memory!

Are you able to compile/run other trivial and non-trivial C++
applications using your Intel compiler installation?


Please ignore my last reply. Has I said previously I can compile  
and use LAM MPI with my intel compiler instalation. I believe that  
LAM uses C++ inside no?


LAM uses C++ for the laminfo command and its wrapper compilers (mpicc  
and friends).  Did you use those successfully?


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] MPI_Reduce problem

2007-07-11 Thread anyi li

Hi Jelena,

 int* ttt = (int*)malloc(2 * sizeof(int));
 ttt[0] = myworldrank + 1;
 ttt[1] = myworldrank * 2;
 if(myworldrank == 0)
   MPI_Reduce(MPI_IN_PLACE, ttt, 2, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD); //sum all logdetphi from different nodes
 else
   MPI_Reduce(ttt, NULL, 2, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
//sum all logdetphi from different nodes

 FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);


That works. Thanks so much.

Anyi

On 7/11/07, Jelena Pjesivac-Grbovic  wrote:

Hi Anyi,

you are using reduce incorrectly: you cannot use the same buffer as input
and output.
If you want to do operation in place, you must specify "MPI_IN_PLACE"
as send buffer at the root process.
Thus, your code should look something like:

   int* ttt = (int*)malloc(2 * sizeof(int));
   ttt[0] = myworldrank + 1;
   ttt[1] = myworldrank * 2;
   if (root == myworldrank) {
  MPI_Reduce(MPI_IN_PLACE, ttt, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   } else {
  MPI_Reduce(ttt, NULL, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   }
   FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);


hope this helps,
Jelena
PS. If I remember correctly the standard, you must specify send buffer on
non-root nodes - it cannot be MPI_IN_PLACE (if you try it - you'll get
segfault).

On Wed, 11 Jul 2007 any...@pa.uky.edu wrote:

> Hi,
>  I have a code which have a identical vector on each node,  I am going to do
> the vector sum and return result to root.  Such like this,
>
>  int* ttt = (int*)malloc(2 * sizeof(int));
>  ttt[0] = myworldrank + 1;
>  ttt[1] = myworldrank * 2;
>   MPI_Allreduce(ttt, ttt, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
>  FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);
>
>  myworldrank is the rank of local node, if I run this code under 4 nodes, what
> I expect  return is 10,12. But what I got is 18,24. So I'm confused here on
> MPI_Reduce, is that supposed to do the vector sum ?
>  I tried MPI_Allreduce, it gave me the correct answer 10, 12.
>
>  Is there someone met the same problems or I am wrong on calling MPI_Reduce()
>
>  Thanks.
> Anyi
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

--
Jelena Pjesivac-Grbovic, Pjesa
Graduate Research Assistant
Innovative Computing Laboratory
Computer Science Department, UTK
Claxton Complex 350
(865) 974 - 6722
(865) 974 - 6321
jpjes...@utk.edu

Murphy's Law of Research:
 Enough research will tend to support your theory.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph Castain
Hooray! Glad we could track it down.

The problem here is that you might actually want to provide a set of
variables to direct that second orterun's behavior. Fortunately, we actually
provided you with a way to do it!

You can set any MCA param on the command line by just doing "-mca param
value". So, if you construct the second orterun command as a char string,
you could build into it the complete description of params that you wanted.

If you still want to use os.system instead of os.execve, I could give you a
command-line option to orterun that would say "ignore anything in the
environment - only look at the system default param file and the command
line". We probably don't want to ignore the system defaults as that is where
your sys admin can help you (e.g., by specifying params that optimize Open
MPI for that environment).

Would that help? If so, I could send you a patch you could try - might take
me a day or two to compose and test it.

Alternatively, if os.execve is an acceptable option, I could give you a list
of the params you specifically would need to purge. Not everything will
cause problems, and you *may* want to keep those that specified something
useful - e.g., what interconnect you wanted the app to use. You could then
write a filter just to purge the problem ones.

Let me know which way you would like to go...
Ralph



On 7/11/07 6:32 PM, "Lev Gelb"  wrote:

> 
> Well done, that was exactly the problem -
> 
> Python's os.environ passes the complete collection of shell variables.
> 
> I tried a different os method (os.execve) , where I could specify the
> environment (I took out all the OMPI_* variables) and the second orterun
> call worked!
> 
> Now I just need a cleaner way to reset the environment within the
> spawned process.  (Or,  a way to tell orterun to ignore/overwrite the
> existing OMPI_* variables...?)
> 
> Thanks for your help,
> 
> Lev
> 
> 
> 
> On Wed, 11 Jul 2007, Ralph Castain wrote:
> 
>> Hmmm...interesting. As a cross-check on something - if you os.system, does
>> your environment by any chance get copied across? Reason I ask: we set a
>> number of environmental variables when orterun spawns a process. If you call
>> orterun from within that process - and the new orterun sees the enviro
>> variables from the parent process - then I can guarantee it won't work.
>> 
>> What you need is for os.system to start its child with a clean environment.
>> I would imagine if you just os.system'd something that output'd the
>> environment, that would suffice to identify the problem. If you see anything
>> that starts with OMPI_MCA_..., then we are indeed doomed.
>> 
>> Which would definitely explain why the persistent orted wouldn't help solve
>> the problem.
>> 
>> Ralph
>> 
>> 
>> 
>> On 7/11/07 3:05 PM, "Lev Gelb"  wrote:
>> 
>>> 
>>> Thanks for the suggestions.  The separate 'orted' scheme (below) did
>>> not work, unfortunately;  same behavior as before.  I have conducted
>>> a few other simple tests, and found:
>>> 
>>> 1.  The problem only occurs if the first process is "in" MPI;
>>> if it doesn't call MPI_Init or calls MPI_Finalize before it executes
>>> the second orterun, everything works.
>>> 
>>> 2.  Whether or not the second process actually uses MPI doesn't matter.
>>> 
>>> 3.  Using the standalone orted in "debug" mode with "universe"
>>> specified throughout, there does not appear to be any communication to
>>> orted upon the second invocation of orterun
>>> 
>>> (Also, I've been able to get working nested orteruns using simple shell
>>> scripts, but these don't call MPI_Init.)
>>> 
>>> Cheers,
>>> 
>>> Lev
>>> 
>>> 
>>> 
>>> On Wed, 11 Jul 2007, Ralph H Castain wrote:
>>> 
 Hmmm...well, what that indicates is that your application program is losing
 the connection to orterun, but that orterun is still alive and kicking (it
 is alive enough to send the [0,0,1] daemon a message ordering it to exit).
 So the question is: why is your application program dropping the
 connection?
 
 I haven't tried doing embedded orterun commands, so there could be a
 conflict there that causes the OOB connection to drop. Best guess is that
 there is confusion over which orterun it is supposed to connect to. I can
 give it a try and see - this may not be a mode we can support.
 
 Alternatively, you could start a persistent daemon and then just allow both
 orterun instances to report to it. Our method for doing that isn't as
 convenient as we want it to be, and hope to soon have it, but it does work.
 What you have to do is:
 
 1. to start the persistent daemon, type:
 
 "orted --seed --persistent --scope public --universe foo"
 
 where foo can be whatever name you like.
 
 2. when you execute your application, use:
 
 orterun -np 1 --universe foo python ./test.py
 
 where the "foo" matches the name given above.
 
 3. in your os.system command, you'll need that same "--unive

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb


Well done, that was exactly the problem -

Python's os.environ passes the complete collection of shell variables.

I tried a different os method (os.execve) , where I could specify the 
environment (I took out all the OMPI_* variables) and the second orterun 
call worked!


Now I just need a cleaner way to reset the environment within the
spawned process.  (Or,  a way to tell orterun to ignore/overwrite the 
existing OMPI_* variables...?)


Thanks for your help,

Lev



On Wed, 11 Jul 2007, Ralph Castain wrote:


Hmmm...interesting. As a cross-check on something - if you os.system, does
your environment by any chance get copied across? Reason I ask: we set a
number of environmental variables when orterun spawns a process. If you call
orterun from within that process - and the new orterun sees the enviro
variables from the parent process - then I can guarantee it won't work.

What you need is for os.system to start its child with a clean environment.
I would imagine if you just os.system'd something that output'd the
environment, that would suffice to identify the problem. If you see anything
that starts with OMPI_MCA_..., then we are indeed doomed.

Which would definitely explain why the persistent orted wouldn't help solve
the problem.

Ralph



On 7/11/07 3:05 PM, "Lev Gelb"  wrote:



Thanks for the suggestions.  The separate 'orted' scheme (below) did
not work, unfortunately;  same behavior as before.  I have conducted
a few other simple tests, and found:

1.  The problem only occurs if the first process is "in" MPI;
if it doesn't call MPI_Init or calls MPI_Finalize before it executes
the second orterun, everything works.

2.  Whether or not the second process actually uses MPI doesn't matter.

3.  Using the standalone orted in "debug" mode with "universe"
specified throughout, there does not appear to be any communication to
orted upon the second invocation of orterun

(Also, I've been able to get working nested orteruns using simple shell
scripts, but these don't call MPI_Init.)

Cheers,

Lev



On Wed, 11 Jul 2007, Ralph H Castain wrote:


Hmmm...well, what that indicates is that your application program is losing
the connection to orterun, but that orterun is still alive and kicking (it
is alive enough to send the [0,0,1] daemon a message ordering it to exit).
So the question is: why is your application program dropping the connection?

I haven't tried doing embedded orterun commands, so there could be a
conflict there that causes the OOB connection to drop. Best guess is that
there is confusion over which orterun it is supposed to connect to. I can
give it a try and see - this may not be a mode we can support.

Alternatively, you could start a persistent daemon and then just allow both
orterun instances to report to it. Our method for doing that isn't as
convenient as we want it to be, and hope to soon have it, but it does work.
What you have to do is:

1. to start the persistent daemon, type:

"orted --seed --persistent --scope public --universe foo"

where foo can be whatever name you like.

2. when you execute your application, use:

orterun -np 1 --universe foo python ./test.py

where the "foo" matches the name given above.

3. in your os.system command, you'll need that same "--universe foo" option

That may solve the problem (let me know if it does). Meantime, I'll take a
look at the embedded approach without the persistent daemon...may take me
awhile as I'm in the middle of something, but I will try to get to it
shortly.

Ralph


On 7/11/07 1:40 PM, "Lev Gelb"  wrote:



OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:


orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py

Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
set to true
[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
process [0,1,0]
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is abs

Re: [OMPI users] MPI_Reduce problem

2007-07-11 Thread Jelena Pjesivac-Grbovic

Hi Anyi,

you are using reduce incorrectly: you cannot use the same buffer as input 
and output.
If you want to do operation in place, you must specify "MPI_IN_PLACE" 
as send buffer at the root process.

Thus, your code should look something like:

  int* ttt = (int*)malloc(2 * sizeof(int));
  ttt[0] = myworldrank + 1;
  ttt[1] = myworldrank * 2;
  if (root == myworldrank) {
 MPI_Reduce(MPI_IN_PLACE, ttt, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  } else {
 MPI_Reduce(ttt, NULL, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  }
  FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);


hope this helps,
Jelena
PS. If I remember correctly the standard, you must specify send buffer on 
non-root nodes - it cannot be MPI_IN_PLACE (if you try it - you'll get 
segfault).


On Wed, 11 Jul 2007 any...@pa.uky.edu wrote:


Hi,
 I have a code which have a identical vector on each node,  I am going to do
the vector sum and return result to root.  Such like this,

 int* ttt = (int*)malloc(2 * sizeof(int));
 ttt[0] = myworldrank + 1;
 ttt[1] = myworldrank * 2;
  MPI_Allreduce(ttt, ttt, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
 FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);

 myworldrank is the rank of local node, if I run this code under 4 nodes, what
I expect  return is 10,12. But what I got is 18,24. So I'm confused here on
MPI_Reduce, is that supposed to do the vector sum ?
 I tried MPI_Allreduce, it gave me the correct answer 10, 12.

 Is there someone met the same problems or I am wrong on calling MPI_Reduce()

 Thanks.
Anyi





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jelena Pjesivac-Grbovic, Pjesa
Graduate Research Assistant
Innovative Computing Laboratory
Computer Science Department, UTK
Claxton Complex 350
(865) 974 - 6722 
(865) 974 - 6321

jpjes...@utk.edu

Murphy's Law of Research:
Enough research will tend to support your theory.


[OMPI users] MPI_Reduce problem

2007-07-11 Thread anyili
Hi,
  I have a code which have a identical vector on each node,  I am going to do
the vector sum and return result to root.  Such like this,

  int* ttt = (int*)malloc(2 * sizeof(int));
  ttt[0] = myworldrank + 1;
  ttt[1] = myworldrank * 2;
   MPI_Allreduce(ttt, ttt, 2, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
  FOR_WORLDNODE0 printf("%d, %d\n" , ttt[0],ttt[1]);

  myworldrank is the rank of local node, if I run this code under 4 nodes, what
I expect  return is 10,12. But what I got is 18,24. So I'm confused here on
MPI_Reduce, is that supposed to do the vector sum ?
  I tried MPI_Allreduce, it gave me the correct answer 10, 12.

  Is there someone met the same problems or I am wrong on calling MPI_Reduce()

  Thanks.
Anyi







Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph Castain
Hmmm...interesting. As a cross-check on something - if you os.system, does
your environment by any chance get copied across? Reason I ask: we set a
number of environmental variables when orterun spawns a process. If you call
orterun from within that process - and the new orterun sees the enviro
variables from the parent process - then I can guarantee it won't work.

What you need is for os.system to start its child with a clean environment.
I would imagine if you just os.system'd something that output'd the
environment, that would suffice to identify the problem. If you see anything
that starts with OMPI_MCA_..., then we are indeed doomed.

Which would definitely explain why the persistent orted wouldn't help solve
the problem.

Ralph



On 7/11/07 3:05 PM, "Lev Gelb"  wrote:

> 
> Thanks for the suggestions.  The separate 'orted' scheme (below) did
> not work, unfortunately;  same behavior as before.  I have conducted
> a few other simple tests, and found:
> 
> 1.  The problem only occurs if the first process is "in" MPI;
> if it doesn't call MPI_Init or calls MPI_Finalize before it executes
> the second orterun, everything works.
> 
> 2.  Whether or not the second process actually uses MPI doesn't matter.
> 
> 3.  Using the standalone orted in "debug" mode with "universe"
> specified throughout, there does not appear to be any communication to
> orted upon the second invocation of orterun
> 
> (Also, I've been able to get working nested orteruns using simple shell
> scripts, but these don't call MPI_Init.)
> 
> Cheers,
> 
> Lev
> 
> 
> 
> On Wed, 11 Jul 2007, Ralph H Castain wrote:
> 
>> Hmmm...well, what that indicates is that your application program is losing
>> the connection to orterun, but that orterun is still alive and kicking (it
>> is alive enough to send the [0,0,1] daemon a message ordering it to exit).
>> So the question is: why is your application program dropping the connection?
>> 
>> I haven't tried doing embedded orterun commands, so there could be a
>> conflict there that causes the OOB connection to drop. Best guess is that
>> there is confusion over which orterun it is supposed to connect to. I can
>> give it a try and see - this may not be a mode we can support.
>> 
>> Alternatively, you could start a persistent daemon and then just allow both
>> orterun instances to report to it. Our method for doing that isn't as
>> convenient as we want it to be, and hope to soon have it, but it does work.
>> What you have to do is:
>> 
>> 1. to start the persistent daemon, type:
>> 
>> "orted --seed --persistent --scope public --universe foo"
>> 
>> where foo can be whatever name you like.
>> 
>> 2. when you execute your application, use:
>> 
>> orterun -np 1 --universe foo python ./test.py
>> 
>> where the "foo" matches the name given above.
>> 
>> 3. in your os.system command, you'll need that same "--universe foo" option
>> 
>> That may solve the problem (let me know if it does). Meantime, I'll take a
>> look at the embedded approach without the persistent daemon...may take me
>> awhile as I'm in the middle of something, but I will try to get to it
>> shortly.
>> 
>> Ralph
>> 
>> 
>> On 7/11/07 1:40 PM, "Lev Gelb"  wrote:
>> 
>>> 
>>> OK, I've added the debug flags - when I add them to the
>>> os.system instance of orterun, there is no additional input,
>>> but when I add them to the orterun instance controlling the
>>> python program, I get the following:
>>> 
 orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
>>> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
>>> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
>>> [druid.wustl.edu:18054] odls: setting up launch for job 1
>>> [druid.wustl.edu:18054] odls: overriding oversubscription
>>> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
>>> set to true
>>> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
>>> Pypar (version 1.9.3) initialised MPI OK with 1 processors
>>> [druid.wustl.edu:18057] OOB: Connection to HNP lost
>>> [druid.wustl.edu:18054] odls: child process terminated
>>> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
>>> [0,0,0]
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
>>> process [0,1,0]
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
>>> 
>>> (the Pypar output is from loading that module; the next thing in
>>> the code is the os.system call to start orterun with 2 processors.)
>>> 
>>> Also, there is absolutely no output from the second orterun-launched
>>> program (even the first line does not execute.)
>>> 
>>> Cheers,
>>> 
>>> Lev
>>> 
>>> 
>>> 
 Message: 5
 Date: Wed, 11 Jul 2007 13:26:22 -0600
 From: Ralph H Castain 
>

Re: [OMPI users] Problems running openmpi under os x

2007-07-11 Thread Brian Barrett
That's unexpected.  If you run the command 'ompi_info --all', it  
should list (towards the top) things like the Bindir and Libdir.  Can  
you see if those have sane values?  If they do, can you try running a  
simple hello, world type MPI application (there's one in the OMPI  
tarball).  It almost looks like memory is getting corrupted, which  
would be very unexpected that early in the process.  I'm unable to  
duplicate the problem with 1.2.3 on my Mac Pro, making it all the  
more strange.


Another random thought -- Which compilers did you use to build Open MPI?

Brian


On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote:



 Open MPI: 1.2.3
Open MPI SVN revision: r15136
 Open RTE: 1.2.3
Open RTE SVN revision: r15136
 OPAL: 1.2.3
OPAL SVN revision: r15136
   Prefix: /usr/local
  Configured architecture: i386-apple-darwin8.10.1

Hi Brian,

1.2.3 downloaded and built from source.

Tim

On 12/07/2007, at 12:50 AM, Brian Barrett wrote:


Which version of Open MPI are you using?

Thanks,

Brian

On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote:



I have a problem running openmpi under OS 10.4.10. My program runs
fine under debian x86_64 on an opteron but under OS X on a number
of Mac Book and Mac Book Pros, I get the following immediately on
startup. This smells like a common problem but I could find
anything relevant anywhere. Can anyone provide a hint or better yet
a solution?

Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/
installdirs_base_components.c:94
#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83
#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc:
155


 
-

-
--
Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122,  
AUSTRALIA

Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
 
-

-
---



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb


Thanks for the suggestions.  The separate 'orted' scheme (below) did
not work, unfortunately;  same behavior as before.  I have conducted
a few other simple tests, and found:

1.  The problem only occurs if the first process is "in" MPI;
if it doesn't call MPI_Init or calls MPI_Finalize before it executes
the second orterun, everything works.

2.  Whether or not the second process actually uses MPI doesn't matter.

3.  Using the standalone orted in "debug" mode with "universe"
specified throughout, there does not appear to be any communication to 
orted upon the second invocation of orterun


(Also, I've been able to get working nested orteruns using simple shell 
scripts, but these don't call MPI_Init.)


Cheers,

Lev



On Wed, 11 Jul 2007, Ralph H Castain wrote:


Hmmm...well, what that indicates is that your application program is losing
the connection to orterun, but that orterun is still alive and kicking (it
is alive enough to send the [0,0,1] daemon a message ordering it to exit).
So the question is: why is your application program dropping the connection?

I haven't tried doing embedded orterun commands, so there could be a
conflict there that causes the OOB connection to drop. Best guess is that
there is confusion over which orterun it is supposed to connect to. I can
give it a try and see - this may not be a mode we can support.

Alternatively, you could start a persistent daemon and then just allow both
orterun instances to report to it. Our method for doing that isn't as
convenient as we want it to be, and hope to soon have it, but it does work.
What you have to do is:

1. to start the persistent daemon, type:

"orted --seed --persistent --scope public --universe foo"

where foo can be whatever name you like.

2. when you execute your application, use:

orterun -np 1 --universe foo python ./test.py

where the "foo" matches the name given above.

3. in your os.system command, you'll need that same "--universe foo" option

That may solve the problem (let me know if it does). Meantime, I'll take a
look at the embedded approach without the persistent daemon...may take me
awhile as I'm in the middle of something, but I will try to get to it
shortly.

Ralph


On 7/11/07 1:40 PM, "Lev Gelb"  wrote:



OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:


orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py

Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
set to true
[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
process [0,1,0]
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev




Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain 
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users " 
Message-ID: 
Content-Type: text/plain; charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb"  wrote:



Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage ex

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph H Castain
Hmmm...well, what that indicates is that your application program is losing
the connection to orterun, but that orterun is still alive and kicking (it
is alive enough to send the [0,0,1] daemon a message ordering it to exit).
So the question is: why is your application program dropping the connection?

I haven't tried doing embedded orterun commands, so there could be a
conflict there that causes the OOB connection to drop. Best guess is that
there is confusion over which orterun it is supposed to connect to. I can
give it a try and see - this may not be a mode we can support.

Alternatively, you could start a persistent daemon and then just allow both
orterun instances to report to it. Our method for doing that isn't as
convenient as we want it to be, and hope to soon have it, but it does work.
What you have to do is:

1. to start the persistent daemon, type:

"orted --seed --persistent --scope public --universe foo"

where foo can be whatever name you like.

2. when you execute your application, use:

orterun -np 1 --universe foo python ./test.py

where the "foo" matches the name given above.

3. in your os.system command, you'll need that same "--universe foo" option

That may solve the problem (let me know if it does). Meantime, I'll take a
look at the embedded approach without the persistent daemon...may take me
awhile as I'm in the middle of something, but I will try to get to it
shortly.

Ralph


On 7/11/07 1:40 PM, "Lev Gelb"  wrote:

> 
> OK, I've added the debug flags - when I add them to the
> os.system instance of orterun, there is no additional input,
> but when I add them to the orterun instance controlling the
> python program, I get the following:
> 
>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
> [druid.wustl.edu:18054] odls: setting up launch for job 1
> [druid.wustl.edu:18054] odls: overriding oversubscription
> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
> set to true
> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
> Pypar (version 1.9.3) initialised MPI OK with 1 processors
> [druid.wustl.edu:18057] OOB: Connection to HNP lost
> [druid.wustl.edu:18054] odls: child process terminated
> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
> process [0,1,0]
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
> 
> (the Pypar output is from loading that module; the next thing in
> the code is the os.system call to start orterun with 2 processors.)
> 
> Also, there is absolutely no output from the second orterun-launched
> program (even the first line does not execute.)
> 
> Cheers,
> 
> Lev
> 
> 
> 
>> Message: 5
>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>> From: Ralph H Castain 
>> Subject: Re: [OMPI users] Recursive use of "orterun"
>> To: "Open MPI Users " 
>> Message-ID: 
>> Content-Type: text/plain; charset="US-ASCII"
>> 
>> I'm unaware of any issues that would cause it to fail just because it is
>> being run via that interface.
>> 
>> The error message is telling us that the procs got launched, but then
>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>> sometimes see that message due to a race condition between the daemons
>> spawned to support the application procs and orterun itself (see other
>> recent notes in this forum).
>> 
>> If your procs are not completing, then it would mean that either the
>> connecting fabric is failing for some reason, or orterun is terminating
>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>> os.system command, the output from that might help us understand why it is
>> failing.
>> 
>> Ralph
>> 
>> 
>> 
>> On 7/11/07 10:49 AM, "Lev Gelb"  wrote:
>> 
>>> 
>>> Hi -
>>> 
>>> I'm trying to port an application to use OpenMPI, and running
>>> into a problem.  The program (written in Python, parallelized
>>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>>> in order to manage external, parallel processes, via something like:
>>> 
>>> orterun -np 2 python myapp.py
>>> 
>>> where myapp.py contains:
>>> 
>>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>> 
>>> I have this working under both LAM-MPI and MPICH on a variety
>>> of different machines.  However, with OpenMPI,  all I get is an
>>> immediate return from the system call and the error:
>>> 
>>> "OOB: Connection to HNP lost"
>>> 
>>> I have verified that the command passed to os.system is correct,
>>> and even that it runs correctly if "myapp.py" doesn't invoke any
>>> MPI calls of 

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb


OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:


orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py

Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor 
set to true

[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from 
[0,0,0]

[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child 
process [0,1,0]

[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev




Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain 
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users " 
Message-ID: 
Content-Type: text/plain;   charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb"  wrote:



Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage external, parallel processes, via something like:

orterun -np 2 python myapp.py

where myapp.py contains:

os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')

I have this working under both LAM-MPI and MPICH on a variety
of different machines.  However, with OpenMPI,  all I get is an
immediate return from the system call and the error:

"OOB: Connection to HNP lost"

I have verified that the command passed to os.system is correct,
and even that it runs correctly if "myapp.py" doesn't invoke any
MPI calls of its own.

I'm testing openMPI on a single box, so there's no machinefile-stuff currently
active.  The system is running Fedora Core 6 x86-64, I'm using the latest
openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
I can provide additional configuration details if necessary.

Thanks, in advance, for any help or advice,

Lev


--
Lev Gelb Associate Professor Department of Chemistry, Washington University in
St. Louis, St. Louis, MO 63130  USA

email: g...@wustl.edu
phone: (314)935-5026 fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problems running openmpi under os x

2007-07-11 Thread Tim Cornwell


Open MPI: 1.2.3
   Open MPI SVN revision: r15136
Open RTE: 1.2.3
   Open RTE SVN revision: r15136
OPAL: 1.2.3
   OPAL SVN revision: r15136
  Prefix: /usr/local
 Configured architecture: i386-apple-darwin8.10.1

Hi Brian,

1.2.3 downloaded and built from source.

Tim

On 12/07/2007, at 12:50 AM, Brian Barrett wrote:


Which version of Open MPI are you using?

Thanks,

Brian

On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote:



I have a problem running openmpi under OS 10.4.10. My program runs
fine under debian x86_64 on an opteron but under OS X on a number
of Mac Book and Mac Book Pros, I get the following immediately on
startup. This smells like a common problem but I could find
anything relevant anywhere. Can anyone provide a hint or better yet
a solution?

Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/
installdirs_base_components.c:94
#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83
#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc:
155


- 
-

--
Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122, AUSTRALIA
Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
- 
-

---



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Recursive use of "orterun"

2007-07-11 Thread Ralph H Castain
I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb"  wrote:

> 
> Hi -
> 
> I'm trying to port an application to use OpenMPI, and running
> into a problem.  The program (written in Python, parallelized
> using either of "pypar" or "pyMPI") itself invokes "mpirun"
> in order to manage external, parallel processes, via something like:
> 
> orterun -np 2 python myapp.py
> 
> where myapp.py contains:
> 
> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
> 
> I have this working under both LAM-MPI and MPICH on a variety
> of different machines.  However, with OpenMPI,  all I get is an
> immediate return from the system call and the error:
> 
> "OOB: Connection to HNP lost"
> 
> I have verified that the command passed to os.system is correct,
> and even that it runs correctly if "myapp.py" doesn't invoke any
> MPI calls of its own.
> 
> I'm testing openMPI on a single box, so there's no machinefile-stuff currently
> active.  The system is running Fedora Core 6 x86-64, I'm using the latest
> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
> I can provide additional configuration details if necessary.
> 
> Thanks, in advance, for any help or advice,
> 
> Lev
> 
> 
> --
> Lev Gelb Associate Professor Department of Chemistry, Washington University in
> St. Louis, St. Louis, MO 63130  USA
> 
> email: g...@wustl.edu
> phone: (314)935-5026 fax:   (314)935-4481
> 
> http://www.chemistry.wustl.edu/~gelb
> --
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi fails on mx endpoint

2007-07-11 Thread Tim Prins
Or you can simply tell the mx mtl not to run by adding "-mca mtl ^mx" to 
the command line.


George: There is an open bug about this problem: 
https://svn.open-mpi.org/trac/ompi/ticket/1080


Tim

George Bosilca wrote:
There seems to be a problem with MX, because a conflict between out  
MTL and the BTL. So, I suspect that if you want it to run [right now]  
you should spawn less than the MX supported endpoint by node (one  
less). I'll take a look this afternoon.


   Thanks,
 george.

On Jul 11, 2007, at 12:39 PM, Warner Yuen wrote:

The hostfile was changed around. As we tried to pull nodes out that  
we thought might have been bad. But none were over subscribed if  
that's what you mean.


Warner Yuen
Scientific Computing Consultant
Apple Computer



On Jul 11, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote:


Message: 3
Date: Wed, 11 Jul 2007 11:27:47 -0400
From: George Bosilca 
Subject: Re: [OMPI users] OMPI users] openmpi fails on mx endpoint
busy
To: Open MPI Users 
Message-ID: <15c9e0ab-6c55-43d9-a40e-82cf973b0...@cs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

What's in the hostmx10g file ? How many hosts ?

   george.

On Jul 11, 2007, at 1:34 AM, Warner Yuen wrote:


I've also had someone run into the endpoint busy problem. I never
figured it out, I just increased the default endpoints on MX-10G
from 8 to 16 endpoints to make the problem go away. Here's the
actual command and error before setting the endpoints to 16. The
version is MX-1.2.1with OMPI 1.2.3:

node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx
12 > out12
[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
 
--


Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Recursive use of "orterun"

2007-07-11 Thread Lev Gelb


Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage external, parallel processes, via something like:

   orterun -np 2 python myapp.py

where myapp.py contains:

   os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')

I have this working under both LAM-MPI and MPICH on a variety
of different machines.  However, with OpenMPI,  all I get is an
immediate return from the system call and the error:

"OOB: Connection to HNP lost"

I have verified that the command passed to os.system is correct,
and even that it runs correctly if "myapp.py" doesn't invoke any
MPI calls of its own.

I'm testing openMPI on a single box, so there's no machinefile-stuff currently 
active.  The system is running Fedora Core 6 x86-64, I'm using the latest 
openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,

I can provide additional configuration details if necessary.

Thanks, in advance, for any help or advice,

Lev


--
Lev Gelb Associate Professor Department of Chemistry, Washington University in 
St. Louis, St. Louis, MO 63130  USA


email: g...@wustl.edu
phone: (314)935-5026 fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb 
--




Re: [OMPI users] openmpi fails on mx endpoint

2007-07-11 Thread George Bosilca
There seems to be a problem with MX, because a conflict between out  
MTL and the BTL. So, I suspect that if you want it to run [right now]  
you should spawn less than the MX supported endpoint by node (one  
less). I'll take a look this afternoon.


  Thanks,
george.

On Jul 11, 2007, at 12:39 PM, Warner Yuen wrote:

The hostfile was changed around. As we tried to pull nodes out that  
we thought might have been bad. But none were over subscribed if  
that's what you mean.


Warner Yuen
Scientific Computing Consultant
Apple Computer



On Jul 11, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote:


Message: 3
Date: Wed, 11 Jul 2007 11:27:47 -0400
From: George Bosilca 
Subject: Re: [OMPI users] OMPI users] openmpi fails on mx endpoint
busy
To: Open MPI Users 
Message-ID: <15c9e0ab-6c55-43d9-a40e-82cf973b0...@cs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

What's in the hostmx10g file ? How many hosts ?

   george.

On Jul 11, 2007, at 1:34 AM, Warner Yuen wrote:


I've also had someone run into the endpoint busy problem. I never
figured it out, I just increased the default endpoints on MX-10G
from 8 to 16 endpoints to make the problem go away. Here's the
actual command and error before setting the endpoints to 16. The
version is MX-1.2.1with OMPI 1.2.3:

node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx
12 > out12
[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
 
--


Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi fails on mx endpoint

2007-07-11 Thread Warner Yuen
The hostfile was changed around. As we tried to pull nodes out that  
we thought might have been bad. But none were over subscribed if  
that's what you mean.


Warner Yuen
Scientific Computing Consultant
Apple Computer



On Jul 11, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote:


Message: 3
Date: Wed, 11 Jul 2007 11:27:47 -0400
From: George Bosilca 
Subject: Re: [OMPI users] OMPI users] openmpi fails on mx endpoint
busy
To: Open MPI Users 
Message-ID: <15c9e0ab-6c55-43d9-a40e-82cf973b0...@cs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

What's in the hostmx10g file ? How many hosts ?

   george.

On Jul 11, 2007, at 1:34 AM, Warner Yuen wrote:


I've also had someone run into the endpoint busy problem. I never
figured it out, I just increased the default endpoints on MX-10G
from 8 to 16 endpoints to make the problem go away. Here's the
actual command and error before setting the endpoints to 16. The
version is MX-1.2.1with OMPI 1.2.3:

node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx
12 > out12
[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
- 
-


Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.




Re: [OMPI users] OMPI users] openmpi fails on mx endpoint busy

2007-07-11 Thread George Bosilca

What's in the hostmx10g file ? How many hosts ?

  george.

On Jul 11, 2007, at 1:34 AM, Warner Yuen wrote:

I've also had someone run into the endpoint busy problem. I never  
figured it out, I just increased the default endpoints on MX-10G  
from 8 to 16 endpoints to make the problem go away. Here's the  
actual command and error before setting the endpoints to 16. The  
version is MX-1.2.1with OMPI 1.2.3:


node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl  
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx  
12 > out12
[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with  
status=20
-- 


Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
-- 

-- 


Process 0.1.11 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
-- 

-- 

It looks like MPI_INIT failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
-- 


*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
-- 




Warner Yuen
Scientific Computing Consultant
Apple Computer
email: wy...@apple.com
Tel: 408.718.2859
Fax: 408.715.0133


On Jul 10, 2007, at 7:53 AM, users-requ...@open-mpi.org wrote:


--

Message: 2
Date: Tue, 10 Jul 2007 09:19:42 -0400
From: Tim Prins 
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
To: Open MPI Users 
Message-ID: <4693876e.4070...@open-mpi.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

SLIM H.A. wrote:

Dear Tim


So, you should just be able to run:
mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi


I tried

node001>mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi

I put in a sleep call to keep it running for some time and to  
monitor

the endpoints. None of the 4 were open, it must have used tcp.

No, this is not possible. With this command line it will not use tcp.
Are you launching on more than one machine? If the procs are all  
on one

machine, then it will use the shared memory component to communicate
(sm), although the endpoints should still be opened.

Just to make sure, you did put the sleep between MPI_Init and  
MPI_Finalize?




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)

2007-07-11 Thread Ricardo Reis

On Tue, 10 Jul 2007, Jeff Squyres wrote:


Whoa -- if you are failing here, something is definitely wrong: this
is failing when accessing stack memory!

Are you able to compile/run other trivial and non-trivial C++
applications using your Intel compiler installation?


Please ignore my last reply. Has I said previously I can compile and use 
LAM MPI with my intel compiler instalation. I believe that LAM uses C++ 
inside no?


 greets,

 Ricardo Reis

 'Non Serviam'

 PhD student @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 

 &

 Cultural Instigator @ Rádio Zero
 http://radio.ist.utl.pt

Re: [OMPI users] Problems running openmpi under os x

2007-07-11 Thread Brian Barrett

Which version of Open MPI are you using?

Thanks,

Brian

On Jul 11, 2007, at 3:32 AM, Tim Cornwell wrote:



I have a problem running openmpi under OS 10.4.10. My program runs  
fine under debian x86_64 on an opteron but under OS X on a number  
of Mac Book and Mac Book Pros, I get the following immediately on  
startup. This smells like a common problem but I could find  
anything relevant anywhere. Can anyone provide a hint or better yet  
a solution?


Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$ 
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/ 
installdirs_base_components.c:94

#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,  
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at  
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,  
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83
#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc: 
155



-- 
--

Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122, AUSTRALIA
Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
-- 
---




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)

2007-07-11 Thread Ricardo Reis

On Tue, 10 Jul 2007, Jeff Squyres wrote:


Whoa -- if you are failing here, something is definitely wrong: this
is failing when accessing stack memory!

Are you able to compile/run other trivial and non-trivial C++
applications using your Intel compiler installation?


I don't have trivial or non-trivial C++ apps to compile in this machine... 
Do you wan't to suggest some? (hello_world works...)


 greets,

 Ricardo Reis

 'Non Serviam'

 PhD student @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 

 &

 Cultural Instigator @ Rádio Zero
 http://radio.ist.utl.pt

[OMPI users] Problems running openmpi under os x

2007-07-11 Thread Tim Cornwell


I have a problem running openmpi under OS 10.4.10. My program runs  
fine under debian x86_64 on an opteron but under OS X on a number of  
Mac Book and Mac Book Pros, I get the following immediately on  
startup. This smells like a common problem but I could find anything  
relevant anywhere. Can anyone provide a hint or better yet a solution?


Thanks,

Tim


Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x000c
0x04510412 in free ()
(gdb) where
#0  0x04510412 in free ()
#1  0x05d24f80 in opal_install_dirs_expand (input=0x5d2a6b0 "$ 
{prefix}") at base/installdirs_base_expand.c:67
#2  0x05d24584 in opal_installdirs_base_open () at base/ 
installdirs_base_components.c:94

#3  0x05d01a40 in opal_init_util () at runtime/opal_init.c:150
#4  0x05d01b24 in opal_init () at runtime/opal_init.c:200
#5  0x051fa5cd in ompi_mpi_init (argc=1, argv=0xbfffde74,  
requested=0, provided=0xbfffd930) at runtime/ompi_mpi_init.c:219
#6  0x0523a8db in MPI_Init (argc=0xbfffd980, argv=0xbfffde14) at  
init.c:71
#7  0x0005a03d in conrad::cp::MPIConnection::initMPI (argc=1,  
argv=@0xbfffde14) at mwcommon/MPIConnection.cc:83

#8  0x4163 in main (argc=1, argv=0xbfffde74) at apps/cimager.cc:155


 


Tim Cornwell,  Australia Telescope National Facility, CSIRO
Location: Cnr Pembroke & Vimiera Rds, Marsfield, NSW, 2122, AUSTRALIA
Post: PO Box 76, Epping, NSW 1710, AUSTRALIA
Phone:+61 2 9372 4261   Fax:  +61 2 9372 4450 or 4310
Mobile:  +61 4 3366 5399
Email:tim.cornw...@csiro.au
URL:  http://www.atnf.csiro.au/people/tim.cornwell
 
-






[OMPI users] OMPI users] openmpi fails on mx endpoint busy

2007-07-11 Thread Warner Yuen
I've also had someone run into the endpoint busy problem. I never  
figured it out, I just increased the default endpoints on MX-10G from  
8 to 16 endpoints to make the problem go away. Here's the actual  
command and error before setting the endpoints to 16. The version is  
MX-1.2.1with OMPI 1.2.3:


node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl  
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx 12  
> out12

[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with status=20
 
--

Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
 
--
 
--

Process 0.1.11 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
 
--
 
--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
 
--

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
 
--



Warner Yuen
Scientific Computing Consultant
Apple Computer
email: wy...@apple.com
Tel: 408.718.2859
Fax: 408.715.0133


On Jul 10, 2007, at 7:53 AM, users-requ...@open-mpi.org wrote:


--

Message: 2
Date: Tue, 10 Jul 2007 09:19:42 -0400
From: Tim Prins 
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
To: Open MPI Users 
Message-ID: <4693876e.4070...@open-mpi.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

SLIM H.A. wrote:

Dear Tim


So, you should just be able to run:
mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi


I tried

node001>mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi

I put in a sleep call to keep it running for some time and to monitor
the endpoints. None of the 4 were open, it must have used tcp.

No, this is not possible. With this command line it will not use tcp.
Are you launching on more than one machine? If the procs are all on  
one

machine, then it will use the shared memory component to communicate
(sm), although the endpoints should still be opened.

Just to make sure, you did put the sleep between MPI_Init and  
MPI_Finalize?