from:"Tim Prins"

Re: [OMPI users] PubSub and MPI

2008-04-24 Thread Tim Prins

Open MPI ships with a full set of man pages for all the MPI functions, 
you might want to start with those.


Tim

Alberto Giannetti wrote:
I am looking to use MPI in a publisher/subscriber context. Haven't  
found much relevant information online.
Basically I would need to deal with dynamic tag subscriptions from  
independent components (connectors) and a number of other issues. I  
can provide more details if there is an interest. Am also looking for  
more information on these calls:


MPI_Open_port
MPI_Publish_name
MPI_Comm_spawn_multiple

Any code example or snapshot would be great.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Different Interfaces on Different Nodes .. OpenMPI 1.2.3, 1.2.4 ..

2008-04-17 Thread Tim Prins


Hi Graham,

Have you tried running without the btl_tcp_if_include line in the .conf 
file? Open MPI is usually smart enough to auto detect and choose the 
correct interfaces.


Hope this helps,

Tim

Graham Jenkins wrote:

We're moving from using a single (eth0) interface on our execute nodes
to using a bond interface (bond0) for resilience.
And what we're seeing on those nodes which have been upgraded is:
--
[0,1,1][btl_tcp_component.c:349:mca_btl_tcp_component_create_instances]
invalid interface "eth0"
--

This of course, is because all nodes share a common copy of
openmpi-mca-params.conf .. in which its says:
--
btl_tcp_if_include=eth0
--

So .. does anybody have a suggestion for a way around this during our
migration/upgrade period?
If we place "bond0" in there as well, then we get error messages about
whichever one is absent on the node where execution is happening.

Regards ..

Re: [OMPI users] Spawn problem

2008-04-04 Thread Tim Prins

Hi Joao,

Thanks for the bug report! You do not have to call free/disconnect 
before MPI_Finalize. If you do not, they will be called automatically. 
Unfortunately, there was a bug in the code that did the free/disconnect 
automatically. This is fixed in r18079.

Thanks again,

Tim

Joao Vicente Lima wrote:

Really MPI_Finalize is crashing and calling MPI_Comm_{free,disconnect} works!
I don't know if the free/disconnect must appear before a MPI_Finalize
for this case (spawn processes)   some suggest ?

I use loops in spawn:
-  first for testing :)
- and second because certain MPI applications don't know in advance
the number of childrens needed to complete his work.

The spawn works is creat ... I will made other tests.

thanks,
Joao

On Mon, Mar 31, 2008 at 3:03 AM, Matt Hughes
 wrote:

On 30/03/2008, Joao Vicente Lima  wrote:
 > Hi,
 >  sorry bring this again ... but i hope use spawn in ompi someday :-D

 I believe it's crashing in MPI_Finalize because you have not closed
 all communication paths between the parent and the child processes.
 For the parent process, try calling MPI_Comm_free or
 MPI_Comm_disconnect on each intercomm in your intercomm array before
 calling finalize.  On the child, call free or disconnect on the parent
 intercomm before calling finalize.

 Out of curiosity, why a loop of spawns?  Why not increase the value of
 the maxprocs argument, or if you need to spawn different executables,
 or use different arguments for each instance, why not
 MPI_Comm_spawn_multiple?

 mch

 >
 >  The execution of spawn in this way works fine:
 >  MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 2, MPI_INFO_NULL, 0,
 >  MPI_COMM_SELF, , MPI_ERRCODES_IGNORE);
 >
 >  but if this code go to a for I get a problem :
 >  for (i= 0; i < 2; i++)
 >  {
 >   MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1,
 >   MPI_INFO_NULL, 0, MPI_COMM_SELF, [i], MPI_ERRCODES_IGNORE);
 >  }
 >
 >  and the error is:
 >  spawning ...
 >  child!
 >  child!
 >  [localhost:03892] *** Process received signal ***
 >  [localhost:03892] Signal: Segmentation fault (11)
 >  [localhost:03892] Signal code: Address not mapped (1)
 >  [localhost:03892] Failing at address: 0xc8
 >  [localhost:03892] [ 0] /lib/libpthread.so.0 [0x2ac71ca8bed0]
 >  [localhost:03892] [ 1]
 >  /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_dpm_base_dyn_finalize+0xa3)
 >  [0x2ac71ba7448c]
 >  [localhost:03892] [ 2] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 
[0x2ac71b9decdf]
 >  [localhost:03892] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 
[0x2ac71ba04765]
 >  [localhost:03892] [ 4]
 >  /usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Finalize+0x71)
 >  [0x2ac71ba365c9]
 >  [localhost:03892] [ 5] ./spawn1(main+0xaa) [0x400ac2]
 >  [localhost:03892] [ 6] /lib/libc.so.6(__libc_start_main+0xf4) 
[0x2ac71ccb7b74]
 >  [localhost:03892] [ 7] ./spawn1 [0x400989]
 >  [localhost:03892] *** End of error message ***
 >  --
 >  mpirun noticed that process rank 0 with PID 3892 on node localhost
 >  exited on signal 11 (Segmentation fault).
 >  --
 >
 >  the attachments contain the ompi_info, config.log and program.
 >
 >  thanks for some check,
 >
 > Joao.
 >

___

 >  users mailing list
 >  us...@open-mpi.org
 >  http://www.open-mpi.org/mailman/listinfo.cgi/users
 >
 >
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] SLURM and OpenMPI

2008-03-20 Thread Tim Prins


Hi Werner,

Open MPI does things a little bit differently than other MPIs when it 
comes to supporting SLURM. See

http://www.open-mpi.org/faq/?category=slurm
for general information about running with Open MPI on SLURM.

After trying the commands you sent, I am actually a bit surprised by the 
results. I would have expected this mode of operation to work. But 
looking at the environment variables that SLURM is setting for us, I can 
see why it doesn't.


On a cluster with 4 cores/node, I ran:
[tprins@odin ~]$ cat mprun.sh
#!/bin/sh
printenv
[tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
srun: jobid 55641 submitted
[tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
SLURM_TASKS_PER_NODE=4(x2)
[tprins@odin ~]$

Which seems to be wrong, since the srun man page says that 
SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each 
node". This seems to imply that the value should be "1(x2)". So maybe 
this is a SLURM problem? If this value were correctly reported, Open MPI 
should work fine for what you wanted to do.


Two other things:
1. You should probably use the command line option '--npernode' for 
mpirun instead of setting the rmaps_base_n_pernode directly.
2. In regards to your second example below, Open MPI by default maps 'by 
slot'. That is, it will fill all available slots on the first node 
before moving to the second. You can change this, see:

http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

I have copied Ralph on this mail to see if he has a better response.

Tim

Werner Augustin wrote:

Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 


Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 


srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] -prefix option to mpirun.

2008-03-04 Thread Tim Prins

Thanks for the report of the broken link. It is now fixed. I have also 
added a paragraph about --enable-mpirun-prefix-by-default to 
http://www.open-mpi.org/faq/?category=running#mpirun-prefix


Tim

Ashley Pittman wrote:

That looks like just what I need, thank you for the quick response.

The closest I could find in the FAQ is this entry which has a broken
link to the second entry.

http://www.open-mpi.org/faq/?category=running#mpirun-prefix

http://www.open-mpi.org/faq/?category=mpi-aps#why-no-rpath

I need to avoid modifying ld.so.conf or setting up aliases so openmpi
can be properly loaded and unloaded with the modules command.

Ashley,

On Tue, 2008-03-04 at 09:37 -0500, Tim Prins wrote:

Hi Ashley,

Yes, you can have this done automatically. Just use the 
'--enable-mpirun-prefix-by-default' option to configure.


I'm actually a bit surprised this is not in the FAQ. I'll have to add it.

Hope this helps,

Tim

Ashley Pittman wrote:

Hello,

I work for medium sized UK based ISV and am packaging open-mpi so that
is can be made available as an option to our users, so far I've been
very impressed by how smoothly things have gone but I've got one problem
which doesn't seem to be covered by the FAQ.

We install openmpi to /opt/openmpi-1.2.5 and are using the modules
command to select which mpi to use, the modules command correctly sets
PATH to pick up mpicc and mpirun on the head node however the issue
comes with running a job, users need to specify -prefix on the mpirun
command line.  Is there a way to specify this in the environment so I
could make it happen automatically as part of the modules environment?

I've searched the archives for this, the closest I can find is this
exchange in 2006, if I specify a full path to mpirun then it does the
right thing but is there a way to extend this functionality to the case
where mpirun is run from path?
http://www.open-mpi.org/community/lists/users/2006/01/0480.php

Yours,  Ashley Pittman.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] -prefix option to mpirun.

2008-03-04 Thread Tim Prins


Hi Ashley,

Yes, you can have this done automatically. Just use the 
'--enable-mpirun-prefix-by-default' option to configure.


I'm actually a bit surprised this is not in the FAQ. I'll have to add it.

Hope this helps,

Tim

Ashley Pittman wrote:

Hello,

I work for medium sized UK based ISV and am packaging open-mpi so that
is can be made available as an option to our users, so far I've been
very impressed by how smoothly things have gone but I've got one problem
which doesn't seem to be covered by the FAQ.

We install openmpi to /opt/openmpi-1.2.5 and are using the modules
command to select which mpi to use, the modules command correctly sets
PATH to pick up mpicc and mpirun on the head node however the issue
comes with running a job, users need to specify -prefix on the mpirun
command line.  Is there a way to specify this in the environment so I
could make it happen automatically as part of the modules environment?

I've searched the archives for this, the closest I can find is this
exchange in 2006, if I specify a full path to mpirun then it does the
right thing but is there a way to extend this functionality to the case
where mpirun is run from path?
http://www.open-mpi.org/community/lists/users/2006/01/0480.php

Yours,  Ashley Pittman.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Cannot build 1.2.5

2008-02-28 Thread Tim Prins

To clean this up for the web archives, we were able to get it to work by 
using '--disable-dlopen'


Tim

Tim Prins wrote:

Scott,

I can replicate this on big red. Seems to be a libtool problem. I'll 
investigate...


Thanks,

Tim

Teige, Scott W wrote:

Hi all,

Attempting a build of 1.2.5 on a ppc machine, particulars:


uname -a
Linux s10c2b2 2.6.5-7.286-pseries64-lustre-1.4.10.1 #2 SMP Tue Jun 26 
11:36:04 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux


Error message (many times)

../../../opal/.libs/libopen-pal.a(dlopen.o)(.opd+0x0): In function 
`__argz_next':

: multiple definition of `__argz_next'
../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x0): first 
defined here


Output from ./configure  and make all is attached.

Thanks for the help,
S.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Cannot build 1.2.5

2008-02-27 Thread Tim Prins


Scott,

I can replicate this on big red. Seems to be a libtool problem. I'll 
investigate...


Thanks,

Tim

Teige, Scott W wrote:

Hi all,

Attempting a build of 1.2.5 on a ppc machine, particulars:


uname -a
Linux s10c2b2 2.6.5-7.286-pseries64-lustre-1.4.10.1 #2 SMP Tue Jun 26 
11:36:04 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux


Error message (many times)

../../../opal/.libs/libopen-pal.a(dlopen.o)(.opd+0x0): In function 
`__argz_next':

: multiple definition of `__argz_next'
../../../opal/.libs/libopen-pal.a(libltdlc_la-ltdl.o)(.opd+0x0): first 
defined here


Output from ./configure  and make all is attached.

Thanks for the help,
S.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Comm_spawn errors

2008-02-19 Thread Tim Prins


Hi Joao,

Unfortunately, spawn is broken on the development trunk right now. We 
are working on a major revamp of the runtime system which should fix 
these problems, but it is not ready yet.


Sorry about that :(

Tim


Joao Vicente Lima wrote:

Hi all,
I'm getting errors with spawn in the situations:

1) spawn1.c - spawning 2 process on localhost, one by one,  the error is:

spawning ...
[localhost:31390] *** Process received signal ***
[localhost:31390] Signal: Segmentation fault (11)
[localhost:31390] Signal code: Address not mapped (1)
[localhost:31390] Failing at address: 0x98
[localhost:31390] [ 0] /lib/libpthread.so.0 [0x2b1d38a17ed0]
[localhost:31390] [ 1]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_dyn_finalize+0xd2)
[0x2b1d37667cb2]
[localhost:31390] [ 2]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_finalize+0x3b)
[0x2b1d3766358b]
[localhost:31390] [ 3]
/usr/local/mpi/openmpi-svn/lib/libmpi.so.0(ompi_mpi_finalize+0x248)
[0x2b1d37679598]
[localhost:31390] [ 4] ./spawn1(main+0xac) [0x400ac4]
[localhost:31390] [ 5] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b1d38c43b74]
[localhost:31390] [ 6] ./spawn1 [0x400989]
[localhost:31390] *** End of error message ***
--
mpirun has exited due to process rank 0 with PID 31390 on
node localhost calling "abort". This will have caused other processes
in the application to be terminated by signals sent by mpirun
(as reported here).
--

With 1 process spawned or with 2 process spawned in one call there is
no output from child.

2) spawn2.c - no response, this init is
 MPI_Init_thread (, , MPI_THREAD_MULTIPLE, )

the attachments contains the programs, ompi_info and config.log.

Some suggest ?

thanks a lot.
Joao.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] bug in MPI_ACCUMULATE for window offsets > 2**31 - 1 bytes? openmpi v1.2.5

2008-02-07 Thread Tim Prins


The fix I previously sent to the list has been committed in r17400.

Thanks,

Tim

Tim Prins wrote:

Hi Stefan,

I was able to verify the problem. Turns out this is a problem with other 
onesided operations as well. Attached is a simple test case I made in c 
using MPI_Put that also fails.


The problem is that the target count and displacements are both sent as 
signed 32 bit integers. Then, the receiver multiplies them together and 
adds them to the window base. However, this multiplication is done using 
the signed 32 bit integers, which overflows. This is then added to the 
64 bit pointer. This, of course, results in a bad address.


I have attached a patch against a recent development version that fixes 
this for me. I am also copying Brian Barrett, who did all the work on 
the onesided code.


Brian: if possible, please take a look at the attached patch and test case.

Thanks for the report!

Tim Prins

Stefan Knecht wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

I encounter a problem with the routine MPI_ACCUMULATE trying to sum up
MPI_REAL8's on a large memory window with a large offset.
My program running (on a single processor, x86_64 architecture) 
crashes with

an error message like:

[node14:16236] *** Process received signal ***
[node14:16236] Signal: Segmentation fault (11)
[node14:16236] Signal code: Address not mapped (1)
[node14:16236] Failing at address: 0x2aaa32b16000
[node14:16236] [ 0] /lib64/libpthread.so.0 [0x32e080de00]
[node14:16236] [ 1] 
/home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(ompi_mpi_op_sum_double+0x10) 
[0x2af15530]
[node14:16236] [ 2] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_process_op+0x2d7) 


[0x2aaab1a47257]
[node14:16236] [ 3] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so 
[0x2aaab1a45432]
[node14:16236] [ 4] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_passive_unlock+0x93) 


[0x2aaab1a48243]
[node14:16236] [ 5] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so 
[0x2aaab1a43436]
[node14:16236] [ 6] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0xff) 


[0x2aaab1a42e0f]
[node14:16236] [ 7] 
/home/stefan/bin/openmpi-1.2.5/lib/libopen-pal.so.0(opal_progress+0x4a) 
[0x2b3dfa0a]
[node14:16236] [ 8] 
/home/stefan/bin/openmpi-1.2.5/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_module_unlock+0x2a9) 


[0x2aaab1a48629]
[node14:16236] [ 9] 
/home/stefan/bin/openmpi-1.2.5/lib/libmpi.so.0(PMPI_Win_unlock+0xe1) 
[0x2af4a291]
[node14:16236] [10] 
/home/stefan/bin/openmpi-1.2.5/lib/libmpi_f77.so.0(mpi_win_unlock_+0x25) 
[0x2acdd8c5]
[node14:16236] [11] /home/stefan/calc/mpi2_test/a.out(MAIN__+0x809) 
[0x401851]
[node14:16236] [12] /home/stefan/calc/mpi2_test/a.out(main+0xe) 
[0x401bbe]
[node14:16236] [13] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x32dfc1dab4]

[node14:16236] [14] /home/stefan/calc/mpi2_test/a.out [0x400f99]
[node14:16236] *** End of error message ***
mpirun noticed that job rank 0 with PID 16236 on node node14 exited on 
signal 11 (Segmentation fault).



The relevant part of my FORTRAN source code reads as:

~  program accumulate_test
~  IMPLICIT REAL*8 (A-H,O-Z)
~  include 'mpif.h'
~  INTEGER(KIND=MPI_OFFSET_KIND) MX_SIZE_M
C dummy size parameter
~  PARAMETER (MX_SIZE_M = 1 000 000)
~  INTEGER MPIerr, MYID, NPROC
~  INTEGER ITARGET, MY_X_WIN, JCOUNT, JCOUNT_T
~  INTEGER(KIND=MPI_ADDRESS_KIND) MEM_X, MEM_Y
~  INTEGER(KIND=MPI_ADDRESS_KIND) IDISPL_WIN
~  INTEGER(KIND=MPI_ADDRESS_KIND) PTR1, PTR2
~  INTEGER(KIND=MPI_INTEGER_KIND) ISIZE_REAL8
~  INTEGER*8 NELEMENT_X, NELEMENT_Y
~  POINTER (PTR1, XMAT(MX_SIZE_M))
~  POINTER (PTR2, YMAT(MX_SIZE_M))
C
~  CALL MPI_INIT( MPIerr )
~  CALL MPI_COMM_RANK( MPI_COMM_WORLD, MYID,  MPIerr)
~  CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, MPIerr)
C
~  NELEMENT_X = 400 000 000
~  NELEMENT_Y =  10 000
C
~  CALL MPI_TYPE_EXTENT(MPI_REAL8, ISIZE_REAL8, MPIerr)
~  MEM_X = NELEMENT_X * ISIZE_REAL8
~  MEM_Y = NELEMENT_Y * ISIZE_REAL8
C
C allocate memory
C
~  CALL MPI_ALLOC_MEM( MEM_X, MPI_INFO_NULL, PTR1, MPIerr)
~  CALL MPI_ALLOC_MEM( MEM_Y, MPI_INFO_NULL, PTR2, MPIerr)
C
C fill vectors with 0.0D0 and 1.0D0
C
~  CALL DZERO(XMAT,NELEMENT_X)
~  CALL DONE(YMAT,NELEMENT_Y)
C
C open memory window
C
~  CALL MPI_WIN_CREATE( XMAT, MEM_X, ISIZE_REAL8,
~ & MPI_INFO_NULL, MPI_COMM_WORLD,
~ & MY_X_WIN, MPIerr )
C lock window (MPI_LOCK_SHARED mode)
C select target ==> if itarget == myid: no 1-sided communication
C
~  ITARGET = MYID
~  CALL MPI_WIN_LOCK( MPI_LOCK_SHARED, ITARGET, MPI_MODE_NOCHECK,
~ &   MY_X_WIN, MPIerr)
C
C transfer data to target ITARGET
C
~  JCOUNT_T = 10 000
~  JCOUNT   = JCOUNT_T
C set displacement in memory window
~  I

Re: [OMPI users] mpirun, paths and xterm again

2008-02-06 Thread Tim Prins


Jody,

If you want to forward X connections through ssh, you should NOT set the 
DISPLAY variable. ssh will set the proper one for you.


Tim

jody wrote:

Tim

Thank you for your explanation on how OpenMPI uses ssh.



There is a way to force the ssh sessions to stay open. However doing so
will result in a bunch of excess debug output. If you add
"--debug-daemons" to the mpirun command line, the ssh connections should
stay open.


Unfortunately this didn't work either:

[jody]:/mnt/data1/neander:$mpirun -np 4 --debug-daemons --hostfile
testhosts -x DISPLAY=plankton:0.0 xterm -hold -e ../MPITest
Daemon [0,0,1] checking in as pid 19473 on host plankton.unizh.ch
Daemon [0,0,2] checking in as pid 26531 on host nano_00
[plankton.unizh.ch:19473] [0,0,1] orted: received launch callback
[nano_00:26531] [0,0,2] orted: received launch callback
xterm Xt error: Can't open display: plankton:0.0
xterm Xt error: Can't open display: plankton:0.0
xterm Xt error: Can't open display: plankton:0.0
xterm Xt error: Can't open display: plankton:0.0
[plankton.unizh.ch:19473] [0,0,1] orted_recv_pls: received message from [0,0,0]
[plankton.unizh.ch:19473] [0,0,1] orted_recv_pls: received exit
[nano_00:26531] [0,0,2] orted_recv_pls: received message from [0,0,0]
[nano_00:26531] [0,0,2] orted_recv_pls: received exit

If i use ":0.0" instead of "plankton:0.0", at least the local
processes open their X-terms.



Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun, paths and xterm again

2008-02-05 Thread Tim Prins


Jody,

jody wrote:

Hi Tim


Your desktop is plankton, and you want
to run a job on both plankton and nano, and have xterms show up on nano.


Not on nano, but on plankton, but ithink this was just a typo :)

Correct.


It looks like you are already doing this, but to make sure, the way I
would use xhost is:
plankton$ xhost +nano_00
plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0
xterm -hold -e ../MPITest

This gives me 2 lines of
  xterm Xt error: Can't open display: plankton:0.0


Can you try running:
plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv

This yields
DISPLAY=plankton:0.0





just to make sure the environment variable is being properly set.

You might also try:
in terminal 1:
plankton$ xhost +nano_00

in terminal 2:
plankton$ ssh -x nano_00
nano_00$ export DISPLAY="plankton:0.0"
nano_00$ xterm


This experiment also gives
xterm Xt error: Can't open display: plankton:0.0


This will ssh into nano, disabling ssh X forwarding, and try to launch
an xterm. If this does not work, then something is wrong with your x
setup. If it does work, it should work with Open MPI as well.


So i guess something is wrong with my X setup.
I wonder what it could be ...


So this is an X issue, not an Open MPI issue then. I do not know enough 
about X setup to help here...




Doing the same with X11 forwarding works perfectly.
But why is X11 forwarding bad?  Or differently asked,
does Opem MPI make the ssh connection in such a way
that X11 forwarding is  disabled?


What Open MPI does is it uses ssh to launch a daemon on a remote node, 
then it disconnects the ssh session. This is done to prevent running out 
of resources at scale. We then send a message to the daemon to launch 
the client application. So we are not doing anything to prevent ssh X11 
forwarding, it is just that by the time the application launched the ssh 
sessions are no longer around.


There is a way to force the ssh sessions to stay open. However doing so 
will result in a bunch of excess debug output. If you add 
"--debug-daemons" to the mpirun command line, the ssh connections should 
stay open.


Hope this helps,

Tim

Re: [OMPI users] mpirun, paths and xterm again

2008-02-05 Thread Tim Prins


Hi Jody,

Just to make sure I understand. Your desktop is plankton, and you want 
to run a job on both plankton and nano, and have xterms show up on nano.


It looks like you are already doing this, but to make sure, the way I 
would use xhost is:

plankton$ xhost +nano_00
plankton$ mpirun -np 4 --hostfile testhosts -x DISPLAY=plankton:0.0 
xterm -hold -e ../MPITest


Can you try running:
plankton$ mpirun -np 1 -host nano_00 -x DISPLAY=plankton:0.0 printenv

just to make sure the environment variable is being properly set.

You might also try:
in terminal 1:
plankton$ xhost +nano_00

in terminal 2:
plankton$ ssh -x nano_00
nano_00$ export DISPLAY="plankton:0.0"
nano_00$ xterm

This will ssh into nano, disabling ssh X forwarding, and try to launch 
an xterm. If this does not work, then something is wrong with your x 
setup. If it does work, it should work with Open MPI as well.


For your second question: I'm not sure why there would be a difference 
in finding the shared libraries in gdb vs. with the xterm.


Tim

jody wrote:

Hi
Sorry to bring this subject up again -
but i have a problem getting xterms
running for all of my processes (for debugging purposes).
There are actually two problem involved:
display, and paths.


my ssh is set up so that X forwarding is allowed,
and, indeed,
  ssh nano_00 xterm
opens an xterm from the remote machine nano_00.

When i run my program normally, it works ok:
 [jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts ./MPITest
[aim-plankton.unizh.ch]I am #0/4 global
[aim-plankton.unizh.ch]I am #1/4 global
[aim-nano_00]I am #2/4 global
[aim-nano_00]I am #3/4 global

But when i try to see it in xterms
[jody]:/mnt/data1/neander:$mpirun -np 4 --hostfile testhosts -x
DISPLAY xterm -hold -e  ./MPITest
xterm Xt error: Can't open display: :0.0
xterm Xt error: Can't open display: :0.0

(same happens, if i set DISPLAY=plankton:0.0, or if i use plankton's
ip address;
and xhost is enabled for nano_00)

the other two (the "local") xterms open, but they display the message:
 ./MPITest: error while loading shared libraries: libmpi_cxx.so.0:
cannot open shared object file: No such file or directory
(This also happens if i only have local processes)

So my first question is: what do i do to enable nano_00 to display an xterm
on plankton? Using normal ssh there seems to be no problem.

Second question: why does the use of xterm "hide" the open-mpi libs?
Interestingly: if i use xterm with gdb to start my application, it works.

Any ideas?

Thank you
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI/Myrinet dynamic process management

2008-01-31 Thread Tim Prins


Hi Kay,

Sorry for the delay in replying, looks like this one slipped through.

The dynamic process management should work fine on GM.

Hope this helps,

Tim

kay kay wrote:


I am looking for dynamic process management support (e.g.MPI_Comm_spawn) 
on Myrinet platform. From the Myricom website, it looks like MPICH2-GM 
doesnot support dynamic process management.


I was wondering if OpenMPI supports this feature on Myrinet/GM ?
 
Thanks,

-Kay.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Torque and OpenMPI 1.2

2007-12-18 Thread Tim Prins

Open MPI v1.2 had some problems with the TM configuration code which was fixed 
in v1.2.1. So any version v1.2.1 or later should work fine (and, as you 
indicate, 1.2.4 works fine).

Tim

On Tuesday 18 December 2007 12:48:40 pm pat.o'bry...@exxonmobil.com wrote:
> Jeff,
> Here is the result of the "pbs-config". By the way, I have successfully
> built OpenMPI 1.2.4 on this same system. The "config.log" for OpenMPI 1.2.4
> shows the correct Torque path. That is not surprising since the "configure"
> script for OpenMPI 1.2.4 uses "pbs-config" while the configure script for
> OpenMPI 1.2 does not.
> ---
>- # pbs-config --libs
> -L/usr/local/pbs/x86_64/lib -ltorque -Wl,--rpath
> -Wl,/usr/local/pbs/x86_64/lib
> ---
>-
>
> Now, to address your concern about the nodes, my users are not "adding
> nodes" to those provided by Torque. They are using a "proper subset" of the
> nodes.  Also,  I believe I read this comment on the OpenMPI web site which
> seems to imply an oversight as far as the "-hostfile" is concerned.
> ---
>
>- Can I specify a hostfile or use
> the --host option to mpirun when running in a Torque / PBS environment?
> As of version v1.2.1, no.
> Open MPI will fail to launch processes properly when a hostfile is specifed
> on the mpirun command line, or if the mpirun [--host] option is used.
>
>
> We're working on correcting the error. A future version of Open MPI will
> likely launch on the hosts specified either in the hostfile or via the
> --host option as long as they are a proper subset of the hosts allocated to
> the Torque / PBS Pro job.
> ---
>
>- Thanks,
>
> J.W. (Pat) O'Bryant,Jr.
> Business Line Infrastructure
> Technical Systems, HPC
> Office: 713-431-7022
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] errno=131 ?

2007-11-18 Thread Tim Prins

Or you can follow the advice in this faq:
http://www.open-mpi.org/faq/?category=tcp#tcp-connection-errors

and run:
perl -e 'die$!=131'

Tim

On Sunday 18 November 2007 09:29:25 pm George Bosilca wrote:
> There is a good reason for this. The errno is system dependent. As an
> example on my Debian cluster errno 131 means "ENOTRECOVERABLE".
> Usually, this value is used with mutexes and not with writev.
> If you want to know what the 131 means on your specific system, take a
> look in /usr/include/errno.h.
>
>george.
>
> On Nov 18, 2007, at 8:59 AM, Lydia Heck wrote:
> > One of our programs has got stuck - it has not terminated -
> > with the error messages:
> > mca_btl_tcp_frag_send: writev failed with errno=131.
> >
> > Searching the openmpi web site did not result in a positive hit.
> > What does it mean?
> >
> > I am running 1.2.1r14096
> >
> > Lydia
> >
> >
> > --
> > Dr E L  Heck
> >
> > University of Durham
> > Institute for Computational Cosmology
> > Ogden Centre
> > Department of Physics
> > South Road
> >
> > DURHAM, DH1 3LE
> > United Kingdom
> >
> > e-mail: lydia.h...@durham.ac.uk
> >
> > Tel.: + 44 191 - 334 3628
> > Fax.: + 44 191 - 334 3645
> > ___
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Too many open files Error

2007-10-31 Thread Tim Prins

Hi Clement,

I seem to recall (though this may have changed) that if a system supports 
ipv6, we may open both ipv4 and ipv6 sockets. This can be worked around by 
configuring Open MPI with --disable-ipv6

Other then that, I don't know of anything else to do except raise the limit 
for the number of open files.

I know it doesn't help you now, but we are actively working on this problem 
for Open MPI 1.3. This version will introduce a tree routing scheme which 
will dramatically reduce the number of open sockets that the runtime system 
needs.

Hope this helps,

Tim

On Tuesday 30 October 2007 07:15:42 pm Clement Kam Man Chu wrote:
> Hi,
>
>   I got a "Too many open files" error while running over 1024 processes
> on 512 cpus.  I found the same error on
> http://www.open-mpi.org/community/lists/users/2006/11/2216.php, but I
> would like to know whether it is another solution instead of changing
> limit descriptors.  The limit descriptors is changed by root access and
> needs to restart the system that I don't want to.
>
> Regards,
> Clement

Re: [OMPI users] mpirun udapl problem

2007-10-31 Thread Tim Prins

Hi Jon,

Just to make sure, running 'ompi_info' shows that you have the udapl btl 
installed?

Tim

On Wednesday 31 October 2007 06:11:39 pm Jon Mason wrote:
> I am having a bit of a problem getting udapl to work via mpirun (over
> open-mpi, obviously).  I am running a basic pingpong test and I get the
> following error.
>
> # mpirun --n 2 --host vic12-10g,vic20-10g -mca btl udapl,self
> /usr/mpi/gcc/open*/tests/IMB*/IMB-MPI1 pingpong
> --
> Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> --
> Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> If you specified the use of a BTL component, you may have
> forgotten a component (such as "self") in the list of
> usable components.
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
>
>
> The command is successful if udapl is replaced with tcp or openib.  So I
> think my setup is correct.  Also, dapltest successfully completes
> without any problems over IB or iWARP.
>
> Any thoughts or suggestions would be greatly appreciated.
>
> Thanks,
> Jon
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Syntax error in remote rsh execution

2007-10-24 Thread Tim Prins


Glad you found the problem.

Don't worry about the '--num_proc 3'. This does not refer to the number 
of application processes, but rather the number of 'daemon' processes 
plus 1 for mpirun. However, this is an internal interface which changes 
on different versions of Open MPI, so this explanation is subject to 
change :)


Tim

Jorge Parra wrote:

Hi Tim,

Thank you for your reply.

You are right, my openMPI version is rather old. However I am stuck with 
it while I can compile v1.2.4. I have had some problems with it (I already 
opened a case on Oct 15th).


You were also right about my hostname. uname -n reports (none) and the 
"hostname" command did not exist in the nodes of my cluster. I already 
added it to the nodes and modified the /etc/hosts file. The error went 
away and now I can see that orted runs in the remote node. It is strange 
to me that orted runs with --num_proc 3 when mpirun was executed with -np 
2. Does this sound correct to you? I might open a new case for it 
though...



Thank you for your help,

Jorge

On Mon, 22 Oct 2007, Tim Prins wrote:


Sorry to reply to my own mail.

Just browsing through the logs you sent, and I see that 'hostname' should be
working fine. However, you are using v1.1.5 which is very old. I would
strongly suggest upgrading to v1.2.4. It is a huge improvement over the old
v1.1 series (which is not being maintained anymore).

Tim

On Monday 22 October 2007 08:41:30 pm Tim Prins wrote:

Hi Jorge,

This is interesting. The problem is the universe name:
root@(none):default-universe

The "(none)" part is supposed to be the hostname where mpirun is executed.
Try running:
hostname

and:
uname -n

These should both return valid hostnames for your machine.

Open MPI pretty much assumes that all nodes have a valid (preferably
unique) hostname. If the above commands don't work, you probably need to
fix your cluster.

Let me know if this does not work.

Thanks,

Tim

On Thursday 18 October 2007 09:22:09 pm Jorge Parra wrote:

Hi,

When trying to execute an application that spawns to another node, I
obtain the following message:

# ./mpirun --hostfile /root/hostfile -np 2 greetings
Syntax error: "(" unexpected (expecting ")")
-
- Could not execute the executable
"/opt/OpenMPI/OpenMPI-1.1.5b/exec/bin/greetings
": Exec format error

This could mean that your PATH or executable name is wrong, or that you
do not
have the necessary permissions.  Please ensure that the executable is
able to be

found and executed.
-
-

and in the remote node:

# pam_rhosts_auth[183]: user root has a `+' user entry
pam_rhosts_auth[183]: allowed to root@192.168.1.102 as root
PAM_unix[183]: (rsh) session opened for user root by (uid=0)
in.rshd[184]: root@192.168.1.102 as root: cmd='( ! [ -e ./.profile ] || .
./.pro
file; orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
--nodename 1
92.168.1.103 --universe root@(none):default-universe --nsreplica
"0.0.0;tcp://19
2.168.1.102:32774" --gprreplica "0.0.0;tcp://192.168.1.102:32774"
--mpi-call-yie
ld 0 )'
PAM_unix[183]: (rsh) session closed for user root

I suspect the command that rsh is trying to execute in the remote node
fails. It seems to me that the first parenthesis in cmd='( ! is not well
interpreted, thus causing the syntax error. This might prevent .profile
to run and to correctly set PATH. Therefore, "greetings" is not found.

I am attaching to this email the appropiate configuration files of my
system and openmpi on it. This is a system in an isolated network, so I
don't care too much for security. Therefore I am using rsh on it.

I would really appreciate any suggestions to correct this problem.

Thank you,

Jorge

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Recursive use of "orterun"

2007-10-22 Thread Tim Prins

Hi Ides,

Thanks for the report and reminder. I have filed a ticket on this 
(https://svn.open-mpi.org/trac/ompi/ticket/1173) and you should receive email 
as it is updated.

I do not know of any more elegant way to work around this at the moment.

Thanks,

Tim

On Friday 19 October 2007 06:31:53 am idesbald van den bosch wrote:
> Hi,
>
> I've run into the same problem as discussed in the thread Lev Gelb: "Re:
> [OMPI users] Recursive use of "orterun" (Ralph H
> Castain)"
>
> I am running a parallel python code, then from python I launch a C++
> parallel program using the python os.system command, then I come back in
> python and keep going.
>
> With LAM/MPI there is no problem with this.
>
> But Open-mpi systematically crashes, because the python os.system command
> launches the C++ program with the same OMPI_* environment variables as for
> the Python program. As discussed in the thread, I have tried filtering the
> OMPI_* variables prior to launching the C++ program with an
> os.execvecommand, but then it fails to return the hand to python and
> instead simply
> terminates when the C++ program ends.
>
> There is a workaround (
> http://thread.gmane.org/gmane.comp.clustering.open-mpi.user/986): create a
> *.sh file with the following lines:
>
> 
> for i in $(env | grep OMPI_MCA |sed 's/=/ /' | awk '{print $1}')
> do
>unset $i
> done
>
> # now the C++ call
> mpirun -np 2  ./MoM/communicateMeshArrays
> --
>
> and then call the *.sh program through the python os.system command.
>
> What I would like to know is that if this "problem" will get fixed in
> open-MPI? Is there another way to elegantly solve this issue? Meanwhile, I
> will stick to the ugly *.sh hack listed above.
>
> Cheers
>
> Ides

Re: [OMPI users] Syntax error in remote rsh execution

2007-10-22 Thread Tim Prins

Sorry to reply to my own mail. 

Just browsing through the logs you sent, and I see that 'hostname' should be 
working fine. However, you are using v1.1.5 which is very old. I would 
strongly suggest upgrading to v1.2.4. It is a huge improvement over the old 
v1.1 series (which is not being maintained anymore).

Tim

On Monday 22 October 2007 08:41:30 pm Tim Prins wrote:
> Hi Jorge,
>
> This is interesting. The problem is the universe name:
> root@(none):default-universe
>
> The "(none)" part is supposed to be the hostname where mpirun is executed.
> Try running:
> hostname
>
> and:
> uname -n
>
> These should both return valid hostnames for your machine.
>
> Open MPI pretty much assumes that all nodes have a valid (preferably
> unique) hostname. If the above commands don't work, you probably need to
> fix your cluster.
>
> Let me know if this does not work.
>
> Thanks,
>
> Tim
>
> On Thursday 18 October 2007 09:22:09 pm Jorge Parra wrote:
> > Hi,
> >
> > When trying to execute an application that spawns to another node, I
> > obtain the following message:
> >
> > # ./mpirun --hostfile /root/hostfile -np 2 greetings
> > Syntax error: "(" unexpected (expecting ")")
> > -
> >- Could not execute the executable
> > "/opt/OpenMPI/OpenMPI-1.1.5b/exec/bin/greetings
> > ": Exec format error
> >
> > This could mean that your PATH or executable name is wrong, or that you
> > do not
> > have the necessary permissions.  Please ensure that the executable is
> > able to be
> >
> > found and executed.
> > -
> >-
> >
> > and in the remote node:
> >
> > # pam_rhosts_auth[183]: user root has a `+' user entry
> > pam_rhosts_auth[183]: allowed to root@192.168.1.102 as root
> > PAM_unix[183]: (rsh) session opened for user root by (uid=0)
> > in.rshd[184]: root@192.168.1.102 as root: cmd='( ! [ -e ./.profile ] || .
> > ./.pro
> > file; orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
> > --nodename 1
> > 92.168.1.103 --universe root@(none):default-universe --nsreplica
> > "0.0.0;tcp://19
> > 2.168.1.102:32774" --gprreplica "0.0.0;tcp://192.168.1.102:32774"
> > --mpi-call-yie
> > ld 0 )'
> > PAM_unix[183]: (rsh) session closed for user root
> >
> > I suspect the command that rsh is trying to execute in the remote node
> > fails. It seems to me that the first parenthesis in cmd='( ! is not well
> > interpreted, thus causing the syntax error. This might prevent .profile
> > to run and to correctly set PATH. Therefore, "greetings" is not found.
> >
> > I am attaching to this email the appropiate configuration files of my
> > system and openmpi on it. This is a system in an isolated network, so I
> > don't care too much for security. Therefore I am using rsh on it.
> >
> > I would really appreciate any suggestions to correct this problem.
> >
> > Thank you,
> >
> > Jorge
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Syntax error in remote rsh execution

2007-10-22 Thread Tim Prins

Hi Jorge,

This is interesting. The problem is the universe name:
root@(none):default-universe

The "(none)" part is supposed to be the hostname where mpirun is executed. Try 
running:
hostname

and:
uname -n

These should both return valid hostnames for your machine.

Open MPI pretty much assumes that all nodes have a valid (preferably unique) 
hostname. If the above commands don't work, you probably need to fix your 
cluster.

Let me know if this does not work.

Thanks,

Tim

On Thursday 18 October 2007 09:22:09 pm Jorge Parra wrote:
> Hi,
>
> When trying to execute an application that spawns to another node, I
> obtain the following message:
>
> # ./mpirun --hostfile /root/hostfile -np 2 greetings
> Syntax error: "(" unexpected (expecting ")")
> --
> Could not execute the executable
> "/opt/OpenMPI/OpenMPI-1.1.5b/exec/bin/greetings
> ": Exec format error
>
> This could mean that your PATH or executable name is wrong, or that you do
> not
> have the necessary permissions.  Please ensure that the executable is able
> to be
>
> found and executed.
> --
>
> and in the remote node:
>
> # pam_rhosts_auth[183]: user root has a `+' user entry
> pam_rhosts_auth[183]: allowed to root@192.168.1.102 as root
> PAM_unix[183]: (rsh) session opened for user root by (uid=0)
> in.rshd[184]: root@192.168.1.102 as root: cmd='( ! [ -e ./.profile ] || .
> ./.pro
> file; orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
> --nodename 1
> 92.168.1.103 --universe root@(none):default-universe --nsreplica
> "0.0.0;tcp://19
> 2.168.1.102:32774" --gprreplica "0.0.0;tcp://192.168.1.102:32774"
> --mpi-call-yie
> ld 0 )'
> PAM_unix[183]: (rsh) session closed for user root
>
> I suspect the command that rsh is trying to execute in the remote node
> fails. It seems to me that the first parenthesis in cmd='( ! is not well
> interpreted, thus causing the syntax error. This might prevent .profile to
> run and to correctly set PATH. Therefore, "greetings" is not found.
>
> I am attaching to this email the appropiate configuration files of my
> system and openmpi on it. This is a system in an isolated network, so I
> don't care too much for security. Therefore I am using rsh on it.
>
> I would really appreciate any suggestions to correct this problem.
>
> Thank you,
>
> Jorge

Re: [OMPI users] Query regarding GPR

2007-10-09 Thread Tim Prins


Hi Neeraj,

The GPR is maintained in the mpirun (orterun) process. The data is then 
distributed via the RML/OOB.


Hope this helps,

Tim

Neeraj Chourasia wrote:

Hi everybody,

I have a doubt regarding ORTE. One of the major functionality of 
orte is to maintain GPR, which subscribes and publishes information to 
the universe. I have a doubt saying, when we submit job from a machine, 
where does GPR gets created? Is it on the submit machine (HNP)?
if YES, then how does compute node gets the information of the same 
during execution ? Does it use OOB for it ?


-Neeraj




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] problem with 'orted'

2007-10-03 Thread Tim Prins


So you did:
ssh  which orted

and it found the orted?

Tim

Amit Kumar Saha wrote:

Hi sebi!

On 10/2/07, Sebastian Schulz  wrote:

 Amit Kumar Saha wrote:

what i find bizarre is that I used Open MPI 1.2.3 to install on all my
4 machines. whereas, 'orted' is installed in /usr/local/bin on all the
other 3 machines, the 4th machine which is giving me trouble has got
it installed in '/usr/bin' . Yes, 'orted' is accessible from a ssh
login as well.

Note that on Ubuntu (at least on 7.04) the default ~/.bashrc contains the 
following line:

# If not running interactively, don't do anything
[ -z "$PS1" ] && return



Unfortunately, this does not solve the problem. I have got all 2 of my
other 3 machines running Ubuntu 7.04 as well. but they are doing fine!

Hope you can provide me some more info!

Thanks,
--Amit

Re: [OMPI users] Bug in MPI_Reduce/MPI_Comm_split?

2007-10-03 Thread Tim Prins


Marco,

Thanks for the report, and sorry for the delayed response. I can 
replicate a problem using your test code, but it does not segfault for 
me (although I am using a different version of Open MPI).


I filed a bug on this so (hopefully) out collective gurus will look at 
it soon. You will receive email updates about the bug. Also, it is here:

https://svn.open-mpi.org/trac/ompi/ticket/1158

Thanks,

Tim

Marco Sbrighi wrote:


Dear Open MPI developers,

I'm using Open MPI 1.2.2 over OFED 1.1 on an 680 nodes dual Opteron dual
core Linux cluster. Of course, with Infiniband interconnect. 
During the execution of big jobs (greater than 128 processes) I've

experimented slow down in performances and deadlock in collective MPI
operations. The job processes terminates often issuing "RETRY EXCEEDED
ERROR", of course if the  btl_openib_ib_timeout is properly set.  
Yes, this kind of error seems to be related to the fabric, but more or
less half of the MPI processes are incurring in timeout. 
In order to do a better investigation on that behaviour, I've tried to

do some "constrained" tests using SKaMPI, but it is quite difficult to
insulate a single collective operation using SKaMPI. In fact despite the
SKaMPI script can contain only a request for (say) a Reduce, with many
communicator sizes, the SKaMPI code will make also a lot of bcast,
alltoall etc. by itself.
So I've tried to use an hand made piece of code, in order to do "only" a
repeated collective operation at a time.
The code is attached to this message, the file is named
collect_noparms.c.
What is happened when I've tried to run this code is reported here:

..

011 - 011 - 039 NOOT START
000 - 000 of 38 - 655360  0.00
[node1049:11804] *** Process received signal ***
[node1049:11804] Signal: Segmentation fault (11)
[node1049:11804] Signal code: Address not mapped (1)
[node1049:11804] Failing at address: 0x18
035 - 035 - 039 NOOT START
000 - 000 of 38 - 786432  0.00
[node1049:11804] [ 0] /lib64/tls/libpthread.so.0 [0x2a964db420]
000 - 000 of 38 - 917504  0.00
[node1049:11804] [ 1] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a9573fa18]
[node1049:11804] [ 2] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a9573f639]
[node1049:11804] [ 3] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_btl_sm_send+0x122)
 [0x2a9573f5e1]
[node1049:11804] [ 4] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957acac6]
[node1049:11804] [ 5] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send_request_start_copy+0x303)
 [0x2a957ace52]
[node1049:11804] [ 6] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957a2788]
[node1049:11804] [ 7] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0 
[0x2a957a251c]
[node1049:11804] [ 8] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(mca_pml_ob1_send+0x2e2)
 [0x2a957a2d9e]
[node1049:11804] [ 9] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_generic+0x651)
 [0x2a95751621]
[node1049:11804] [10] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_pipeline+0x176)
 [0x2a95751bff]
[node1049:11804] [11] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(ompi_coll_tuned_reduce_intra_dec_fixed+0x3f4)
 [0x2a957475f6]
[node1049:11804] [12] 
/cineca/prod/openmpi/1.2.2/mr/gnu3.4-bc_no_memory_mgr_dbg/lib/libmpi.so.0(PMPI_Reduce+0x3a6)
 [0x2a9570a076]
[node1049:11804] [13] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(reduce+0x3e) 
[0x404e64]
[node1049:11804] [14] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x(main+0x620) 
[0x404c8e]
[node1049:11804] [15] /lib64/tls/libc.so.6(__libc_start_main+0xdb) 
[0x2a966004bb]
[node1049:11804] [16] 
/bcx/usercin/asm0/mpptools/mpitools/debug/src/collect_noparms_bc.x [0x40448a]
[node1049:11804] *** End of error message ***

...

the behaviour is the same, more or less identical, using either
Infiniband or Gigabit interconnect. If I use another MPI implementation
(say MVAPICH), all goes right.
Then I've compiled both my code and Open MPI using gcc 3.4.4 with
bounds-checking, compiler debugging flags, without OMPI memory
manager ... the behaviour is identical but now I've the line were the
SIGSEGV is trapped:



gdb collect_noparms_bc.x core.11580
GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.

Re: [OMPI users] MPI_Gatherv on One Process

2007-10-03 Thread Tim Prins


Thanks for the report!

I have reproduced this bug and have filed a ticket on this 
(https://svn.open-mpi.org/trac/ompi/ticket/1157). You should receive 
updates as this bug is worked on.


Thanks,

Tim

Chris Johnson wrote:

Hi, I'm trying to run an MPI program of mine under OpenMPI 1.2 using
just one process (mpirun -np 1 ./a.out) and I'm getting some
unexpected results.  The problem is that I'm getting unexpected
results from an MPI_Gatherv call when the offset for rank is nonzero.

I've worked up a small sample that can reproduce the problem on the
several machines I've tried.  Here, each process creates a tmp array
of five ints.  These tmp arrays are then gathered by rank into a
buffer, but offset by 10 places.  (These 10 places are initialized
with -1.)  When I run with multiple processes, I see the 10 -1s and
each process's tmp array in the buffer.  But when I run with one
process, the buffer contains funny values.  When I run with one
process under MPICH, the buffer contains the 10 -1s and the rank's
array, as expected.  When the offset is 0, OpenMPI behaves just fine
with one process.

Here's the sample:

--
#include 
#include 
#include "mpi.h"

#define COUNT 5
#define OFFSET 10

int main(int argc, char **argv) {

   int i;
   int *nitems;
   int *offsets;
   int *buffer;
   int tmp[COUNT];
   int rank;
   int nprocs;

   MPI_Init(, );
   MPI_Comm_rank(MPI_COMM_WORLD, );
   MPI_Comm_size(MPI_COMM_WORLD, );

   for (i = 0; i < COUNT; i++) {
  tmp[i] = i + rank * 100;
   }

   if (rank == 0) {
  buffer = malloc(sizeof(int) * (nprocs * COUNT + OFFSET));
  nitems = malloc(sizeof(int) * nprocs);
  offsets = malloc(sizeof(int) * nprocs);
  nitems[0] = COUNT;
  offsets[0] = OFFSET;
  for (i = 1; i < nprocs; i++) {
 nitems[i] = COUNT;
 offsets[i] = offsets[i - 1] + nitems[i - 1];
  }

  for (i = 0; i < OFFSET; i++) {
 buffer[i] = -1;
  }
   }
   MPI_Gatherv(tmp, COUNT, MPI_INT, buffer, nitems, offsets, MPI_INT, 0,
   MPI_COMM_WORLD);

   if (rank == 0) {
  for (i = 0; i < nprocs * COUNT + OFFSET; i++) {
 printf("buffer[%d]: %d\n", i, buffer[i]);
  }
  free(buffer);
  free(nitems);
  free(offsets);
   }

   MPI_Finalize();

   return 0;

}
--

For what it's worth, I've started using MPI_IN_PLACE instead of the
above method.  This works around the problem for now, but I'd
appreciate any insight on how to fix this or confirmation of bug.
Thanks for your help!

- Chris
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-10-03 Thread Tim Prins

Unfortunately, I am out of ideas on this one. It is very strange. Maybe 
someone else has an idea.


I would recommend trying to install Open MPI again. First be sure to get 
rid of all of the old installs of Open MPI from all your nodes, then 
reinstall and try again.


Tim

Dino Rossegger wrote:

Here the Syntax & Output of the Command:
root@sun:~# mpirun --hostfile hostfile saturn
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:28777] ERROR: A daemon on node saturn failed to start as expected.
[sun:28777] ERROR: There may be more information available from
[sun:28777] ERROR: the remote shell (see above).
[sun:28777] ERROR: The daemon exited unexpectedly with status 255.
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--

I'm using version 1.2.3, got it from openmpi.org. I'm using the same
version of openmpi on all nodes.

Thanks
dino

Tim Prins schrieb:
This is very odd. The daemon is being launched properly, but then things 
get strange. It looks like mpirun is sending a message to kill 
application processes on saturn.


What version of Open MPI are you using?

Are you sure that the same version of Open MPI us being used everywhere?

Can you try:
mpirun --hostfile hostfile hostname

Thanks,

Tim

Dino Rossegger wrote:

Hi again,

Tim Prins schrieb:

Hi,

On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:

Hi again,

Yes the error output is the same:
root@sun:~# mpirun --hostfile hostfile main
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:23748] ERROR: A daemon on node saturn failed to start as expected.
[sun:23748] ERROR: There may be more information available from
[sun:23748] ERROR: the remote shell (see above).
[sun:23748] ERROR: The daemon exited unexpectedly with status 255.
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--

Can you try:
mpirun --debug-daemons --hostfile hostfile main


Did it but it doesn't give me any special output (as far as I can see that)
Heres the output:
root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main
Daemon [0,0,1] checking in as pid 27168 on host sun
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 275
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1164
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
.c at line 90
[sun:27167] ERROR: A daemon on node saturn failed to start as
expected.
[sun:27167] ERROR: There may be more information available fro
m
[sun:27167] ERROR: the remote shell (see above).
[sun:27167] ERROR: The daemon exited unexpectedly with status
255.
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received exit


[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 188
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1196
--

mpirun was unable to cleanly terminate the daemons for this jo
b. Returned value Timeout instead of ORTE_SUCCESS.

--



This may give more output about the error. Also, try
mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main

Heres the output, but I cant decipher it ^^
root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
e hostfile main
[sun:27175] pls:rsh: local csh: 0, local sh: 1
[sun:27175] pls:rsh: assuming same remote shell as local shell
[sun:27175] pls:rsh: remote csh: 0, remote sh: 1
[sun:27175] pls:rsh: final template argv:
[sun:27175] pls:rsh: /usr/bin/ssh  orted --bootp
roxy 1 --name  --num_procs 3 --vpid_start 0 --nodena
me  --universe

Re: [OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)

2007-10-03 Thread Tim Prins


Hi,

Miguel Figueiredo Mascarenhas Sousa Filipe wrote:

Hi there,

I have a 2-cpu system (linux/x86-64), running openmpi-1.1. I do not
specify a hostfile.
Lately I'm having performance problems when running my mpi-app this way:

mpiexec -n 2 ./mpi-app config.ini

Both mpi-app processes are running on cpu0, leaving cpu1 idle.

After reading the mpirun manpage, it seems that openmpi bind tasks to
cpus in a round-robin way, meaning that this should not happen.
But given my problem, I assume that it's not detecting this is a 2-way
smp system, (assuming a UP system) and binding both tasks to cpu0..

Is this correct?
By default I do not think Open MPI does any process affinity (although I 
could be wrong). See this FAQ for information on process affinity:

http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
http://www.open-mpi.org/faq/?category=tuning#using-paffinity



The openmpi-default-hostfile says I should not specify localhost in
there.. and let the job dispatcher/rca "detect" the single-node setup.

Where should I define/configure system wide, that this is a
single-node, 2-slot system?
I would like to avoid making the system users be obliged to pass a
hostfile to mpirun/mpiexec. I simply want mpiexec -n N ./mpi-task to
do the propper job of _really_ spreading the processes evenly between
all the system's CPUs.

Best regards, waiting for your answer.


You could put localhost and specify the number of slots in the default 
hostfile, or just pass a hostfile containing local host to mpirun.


By default Open MPI will run on the localhost assuming 1 slot if it does 
not detect a resource manager or isn't passed a hostfile.




ps.: should I upgrade to latest openMPI to have my problem
"automagically" solved?
I would definitely update to a newer version. The 1.1 series has many 
problems.


Hope this helps,

Tim

Re: [OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)

2007-10-03 Thread Tim Prins


-c, -np, --np, -n, --n all do exactly the same thing.

Tim

Miguel Figueiredo Mascarenhas Sousa Filipe wrote:

Hi,

On 10/3/07, jody  wrote:

Hi Miguel
I don't know if it's a typo - but actually it should be
 mpiexec -np 2 ./mpi-app config.ini
and not

mpiexec -n 2 ./mpi-app config.ini


thanks for the remark, you're right, but in the man page says -n is a
synonym for -np

Kind regards,

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-10-02 Thread Tim Prins

This is very odd. The daemon is being launched properly, but then things 
get strange. It looks like mpirun is sending a message to kill 
application processes on saturn.


What version of Open MPI are you using?

Are you sure that the same version of Open MPI us being used everywhere?

Can you try:
mpirun --hostfile hostfile hostname

Thanks,

Tim

Dino Rossegger wrote:

Hi again,

Tim Prins schrieb:

Hi,

On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:

Hi again,

Yes the error output is the same:
root@sun:~# mpirun --hostfile hostfile main
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:23748] ERROR: A daemon on node saturn failed to start as expected.
[sun:23748] ERROR: There may be more information available from
[sun:23748] ERROR: the remote shell (see above).
[sun:23748] ERROR: The daemon exited unexpectedly with status 255.
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--

Can you try:
mpirun --debug-daemons --hostfile hostfile main


Did it but it doesn't give me any special output (as far as I can see that)
Heres the output:
root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main
Daemon [0,0,1] checking in as pid 27168 on host sun
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 275
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1164
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
.c at line 90
[sun:27167] ERROR: A daemon on node saturn failed to start as
expected.
[sun:27167] ERROR: There may be more information available fro
m
[sun:27167] ERROR: the remote shell (see above).
[sun:27167] ERROR: The daemon exited unexpectedly with status
255.
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received exit


[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 188
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1196
--

mpirun was unable to cleanly terminate the daemons for this jo
b. Returned value Timeout instead of ORTE_SUCCESS.

--



This may give more output about the error. Also, try
mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main


Heres the output, but I cant decipher it ^^
root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
e hostfile main
[sun:27175] pls:rsh: local csh: 0, local sh: 1
[sun:27175] pls:rsh: assuming same remote shell as local shell
[sun:27175] pls:rsh: remote csh: 0, remote sh: 1
[sun:27175] pls:rsh: final template argv:
[sun:27175] pls:rsh: /usr/bin/ssh  orted --bootp
roxy 1 --name  --num_procs 3 --vpid_start 0 --nodena
me  --universe root@sun:default-universe-27175 --nsr
eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
" --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
202:4733"
[sun:27175] pls:rsh: launching on node sun
[sun:27175] pls:rsh: sun is a LOCAL node
[sun:27175] pls:rsh: changing to directory /root
[sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
-bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
ame sun --universe root@sun:default-universe-27175 --nsreplica
 "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
/mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
[sun:27175] pls:rsh: launching on node saturn
[sun:27175] pls:rsh: saturn is a REMOTE node
[sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s

Re: [OMPI users] OpenMPI Giving problems when using -mca btl mx, sm, self

2007-10-02 Thread Tim Prins

Hi,

On Monday 01 October 2007 03:08:04 am Hammad Siddiqi wrote:
> One more thing to add -mca mtl mx uses ethernet and IP emulation of
> Myrinet to my knowledge. I want to use Myrinet(not its IP Emulation)
> and shared memory simultaneously.
This is not true (as far as I know...). Open MPI has 2 different network 
stacks, and we can use MX with either. See:
http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx

The mx mtl relies on the MX library for all communications, and the MX library 
itself does shared memory message passing. In my experience the mx mtl 
performs better than the mx,sm,self btl combination. However, I would 
encourage you to try both with your application and would be interested in 
hearing your opinion.


> > *1.  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 2 -mca btl mx,sm,self  -host
> > "indus1,indus2" -mca btl_base_debug 1000 ./hello*
> >
> > /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca btl mx,sm,self  -host
> > "indus1,indus2,indus3,indus4" -mca btl_base_debug 1000 ./hello
> > [indus1:29331] select: initializing btl component mx
> > [indus1:29331] select: init returned failure
> > [indus1:29331] select: module mx unloaded


So it looks like we are trying to load the mx library, but fail for some 
reason. Are you sure MX is working correctly? Can you run mx_pingpong between 
indus1 and indus2 as described here:
http://www.myri.com/cgi-bin/fom.pl?file=455=file%253D91

> > *2.1  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" ./hello*
> >
> > This command works fine
Since you did not specify to use the cm pml (which MUST be done to use the mx 
mtl. see: http://www.open-mpi.org/faq/?category=myrinet#myri-btl-mx), you 
were probably actually using tcp for this run since we would automatically 
fail back after the mx btl fails to load.

> > *2.2 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm ./hello*
> >
> > This command works fine.
Good. So maybe there isn't anything wrong with your mx setup.

> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 4 -mca pml cm  -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*,
> > this command works fine.
Since you selected the cm pml, we should be automatically using the mx mtl 
here.

> > but *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca pml cm  -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*
> > hangs for indefinite time.
Strange. I do not know why this would hang.

> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host
> > "indus1,indus2,indus3,indus4"  -mca mtl_base_debug 1000 ./hello"*
> > works fine
Again, you are falling back to using the tcp btl here. BTW, the mtl 
string 'mx,sm,self' is bogus. There is no sm or self mtl's.

> >
> > *2.3 /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm ./hello*
> >
> > This command hangs the machines for indefinite time.
> > Also *"/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx -host
> > "indus1,indus2,indus3,indus4" -mca pml cm  -mca mtl_base_debug 1000
> > ./hello"* hangs the
> > systems for indefinite time.
These two commands should have the exact same effect as the hang above.

> >
> > *2.4  /opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -mca mtl mx,sm,self -host
> > "indus1,indus2,indus3,indus4" -mca pml cm  -mca mtl_base_debug 1000
> > ./hello*
> >
> > This command hangs the machines for indefinite time.
Again, the mtl line here is bogus.

> >
> > Please notice that running more than four mpi processes hangs the
> > machines. Any suggestion please.
The first thing I would try is to see if a non-mpi application works. So try:
/opt/SUNWhpc/HPC7.0/bin/mpirun -np 8 -host "indus1,indus2,indus3,indus4" 
hostname

If that works, then try a simple MPI hello application that does no 
communication.

Tim

> >

> >
> > The output of *mx_info* on each node is given below
> >
> > =*=
> > indus1
> > *==
> >
> > MX Version: 1.1.7rc3cvs1_1_fixes
> > MX Build: @indus4:/opt/mx2g-1.1.7rc3 Thu May 31 11:36:59 PKT 2007
> > 2 Myrinet boards installed.
> > The MX driver is configured to support up to 4 instances and 1024
> > nodes.
> > ===
> > Instance #0: 333.2 MHz LANai, 66.7 MHz PCI bus, 2 MB SRAM
> > Status: Running, P0: Link up
> > MAC Address: 00:60:dd:47:ad:7c
> > Product code: M3F-PCIXF-2
> > Part number: 09-03392
> > Serial number: 297218
> > Mapper: 00:60:dd:47:b3:e8, version = 0x7677b8ba, configured
> > Mapped hosts: 10
> >
> >
> > ROUTE COUNT
> > INDEX MAC ADDRESS HOST NAME P0
> > - ---
> > - ---
> >0) 00:60:dd:47:ad:7c indus1:0 1,1
> >2) 00:60:dd:47:ad:68 indus4:0 8,3
> >3) 00:60:dd:47:b3:e8 indus4:1 7,3
> >4)

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-10-01 Thread Tim Prins

Hi,

On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
> Hi again,
>
> Yes the error output is the same:
> root@sun:~# mpirun --hostfile hostfile main
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1164
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
> [sun:23748] ERROR: There may be more information available from
> [sun:23748] ERROR: the remote shell (see above).
> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1196
> --
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
>
> --
Can you try:
mpirun --debug-daemons --hostfile hostfile main

This may give more output about the error. Also, try
mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main

This will print out the exact command that is used to launch the orted.

Also, I would highly recommend not running Open MPI as root. It is just a bad 
idea.
>
> I wrote the following to my .ssh/environment (on all machines)
> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>
> PATH=$PATH:/usr/local/lib;
>
> export LD_LIBRARY_PATH;
> export PATH;
>
> and added the statement you told me to the ssd_config (on all machines):
> PermitUserEnvironment yes
>
> And it seems to me that the pathes are correct now.
>
> My shell is bash (/bin/bash)
>
> When running locate orted (to find out where exactly my openmpi
> installation is (compilation defaults) i saw that, on sun there was a
> /usr/bin/orted while there wasn't one on saturn.
> I deleted /usr/bin/orted on sun and tried again with the option --prefix
>  /usr/local/ (which seems to be  my installation directory) but it
> didn't work (same error).
Is it possible that you are mixing 2 different installations of Open MPI? You 
may consider installing OpenMPI to a NFS drive to make these things a bit 
easier.
>
> Is there a script or anything like that with which I can uninstall
> openmpi, because i'll might try a new compilation to /opt/openmpi since
> it doesn't look like I would be able to solve the problem.
If you still have the tree around that you used to 'make' Open MPI, you can 
just go into that tree and type 'make uninstall'.

Hope this helps,

Tim

>
> jody schrieb:
> > Now that the PATHs seem to be set correctly for
> > ssh i don't know what the problem could be.
> >
> > Is the error message still the same on as in the first mail?
> > Did you do the envorpnment/sshd_config on both machines?
> > What shell are you using?
> >
> > On other test you could make is to start your application
> > with the --prefix option:
> >
> > $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
> >
> > (assuming your Open MPI installation lies in /opt/openmpi
> > on both machines)
> >
> >
> > Jody
> >
> > On 10/1/07, Dino Rossegger  wrote:
> >> Hi Jodi,
> >> did the steps as you said, but it didn't work for me.
> >> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
> >> made the changes to sshd_config.
> >>
> >> But this all didn't solve my problem, although the pahts seemed to be
> >> set correctly (judging what ssh saturn `printenv >> test` says). I also
> >> restarted the ssh server, the error is the same.
> >>
> >> Hope you can help me out here and thanks for your help so far
> >> dino
> >>
> >> jody schrieb:
> >>> Dino -
> >>> I had a similar problem.
> >>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
> >>> in the file ~/ssh/environment on the client and setting
> >>>   PermitUserEnvironment yes
> >>> in /etc/ssh/sshd_config on the server (for this you need root
> >>> prioviledge though)
> >>>
> >>> To be on the safe side, i did both on all my nodes
> >>>
> >>> Jody
> >>>
> >>> On 9/27/07, Dino Rossegger  wrote:
>  Hi Jody,
> 
>  Thanks for your help, it really is the case that either in PATH nor in
>  LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
>  hope it works.
> 
>  jody schrieb:
> > Hi Dino
> >
> > Try
> >  ssh saturn printenv | grep PATH
> >
> > >from your host sun to see what your environment variables are when
> >
> > ssh is run without a shell.
> >
> > On 9/27/07, Dino Rossegger  wrote:
> >> Hi,
> >>
> >> I have a problem running a simple programm mpihello.cpp.
> >>
> >> Here is a

Re: [OMPI users] init_thread + spawn error

2007-10-01 Thread Tim Prins

Hi Joao,

Unfortunately Comm_spawn is a bit broken right now on the Open MPI trunk. We 
are currently working on some major changes to the runtime system, so I would 
rather not dig into this until these changes have made it onto the trunk.

I do not know of a timeline for when this these changes will be put in the 
trunk and Comm_spawn (especially with threads) will be expected to work 
correctly again.

Tim

On Monday 01 October 2007 03:40:46 pm Joao Vicente Lima wrote:
> Hi all!
> I'm getting a error on call MPI_Init_thread and MPI_Comm_spawn.
> am I mistaking something?
> the attachments contains my ompi_info and source ...
>
> thank!
> Joao
>
> 
>   char *arg[]= {"spawn1", (char *)0};
>
>   MPI_Init_thread (, , MPI_THREAD_MULTIPLE, );
>   MPI_Comm_spawn ("./spawn_slave", arg, 1,
>   MPI_INFO_NULL, 0, MPI_COMM_SELF, ,
>   MPI_ERRCODES_IGNORE);
> .
>
> and the error:
>
> opal_mutex_lock(): Resource deadlock avoided
> [c8:13335] *** Process received signal ***
> [c8:13335] Signal: Aborted (6)
> [c8:13335] Signal code:  (-6)
> [c8:13335] [ 0] [0xb7fbf440]
> [c8:13335] [ 1] /lib/libc.so.6(abort+0x101) [0xb7abd5b1]
> [c8:13335] [ 2] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e2933c]
> [c8:13335] [ 3] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e2923a]
> [c8:13335] [ 4] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e292e3]
> [c8:13335] [ 5] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e29fa7]
> [c8:13335] [ 6] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e29eda]
> [c8:13335] [ 7] /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0 [0xb7e2adec]
> [c8:13335] [ 8]
> /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0(ompi_proc_unpack+ 0x181)
> [0xb7e2b142]
> [c8:13335] [ 9]
> /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0(ompi_comm_connect
> _accept+0x57c) [0xb7e0fb70]
> [c8:13335] [10]
> /usr/local/openmpi/openmpi-svn/lib/libmpi.so.0(PMPI_Comm_spawn+0 x395)
> [0xb7e5e285]
> [c8:13335] [11] ./spawn(main+0x7f) [0x80486ef]
> [c8:13335] [12] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7aa7ebc]
> [c8:13335] [13] ./spawn [0x80485e1]
> [c8:13335] *** End of error message ***
> --
> mpirun has exited due to process rank 0 with PID 13335 on
> node c8 calling "abort". This will have caused other processes
> in the application to be terminated by signals sent by mpirun
> (as reported here).
> --

Re: [OMPI users] Rank to host mapping

2007-10-01 Thread Tim Prins

So you know this is something that we are working on for the next major 
release of Open MPI (v 1.3). More details on some of the discussion can 
be found here:

https://svn.open-mpi.org/trac/ompi/ticket/1023

Tim

Torje Henriksen wrote:
Specifying nodes several times in the hostfile or with the --host 
parameter seems to just add up the number of slots availible for the 
given node. It doesn't seem to affect the mapping of the ranks. I think 
this is due to how the hostfile is read into the structure that holds this 
information in the source code.


Adding the host several times to the hostfile was the first thing I tried, 
and I've also gotten that suggestion from others, so it might seem that it 
would make sense to make it work that way.



I've hacked the source to be able to take a ranks-parameter in the 
hostfile like this:


node0 ranks=0,1,3
node1 ranks=2,4,5

so I guess it's not a problem any more, but I would love to know if there 
is a way of doing it without changing the source code.



You're very right about the unix scripting part. It makes sense to create 
the hostfile this way.



-Torje

On Mon, 1 Oct 2007, Christian Bell wrote:


How about a hostfile such as

% cat -n ~/tmp/hostfile
1  node0
2  node0
3  node1
4  node0
5  node1
6  node1

Looks like the function to express the mapping is not anything simple.  If it's
an expressible function but too complicated for open mpi, you'll have to make
your own script to generate the function.  This shouldn't be hard to do with
any standard unix scripting.

. . christian

On Mon, 01 Oct 2007, Torje Henriksen wrote:


Oh man, sorry about that, and thanks for the fast response.
Let me try again, please :)

I want to manually specify what ranks should run on what node.

Here is an example of a mapping that I can't seem to be able to do, since
it isn't a round-robin type of mapping.

hosts ranks
===
node0 0,1,3
node1 2,4,5

No matter what I do, I either get

node0: 0,1,2
node1: 3,4,5

or

node0: 0,2,4
node1: 1,3,5


Hope I got it right this time, and thank you again.

-Torje

On Mon, 1 Oct 2007, jody wrote:


hosts ranks
===
node0 1,2,4
node1 3,4,6

I guess there must be a typo:
You can't assign one rank (4) to two nodes
And ranks start from 0 not from 1.

Check this site,
http://www.open-mpi.org/faq/?category=running#mpirun-host
there might be some inforegarding your problem.

Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
christian.b...@qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] --enable-mca-no-build broken or bad docs?

2007-09-27 Thread Tim Prins


Mostyn,

It looks like the documentation is wrong (and has been wrong for years). 
I assume you were looking at the FAQ? I will update it tonight or tomorrow.


Thanks for the report!

Tim

Mostyn Lewis wrote:

I see docs for this like:

--enable-mca-no-build=btl:mvapi,btl:openib,btl:gm,btl:mx,mtl:psm

however, the code in a generated configure that parse this looks like:

...
 ifs_save="$IFS"
 IFS="${IFS}$PATH_SEPARATOR,"
 msg=
 for item in $enable_mca_no_build; do
 type="`echo $item | cut -s -f1 -d-`"
 comp="`echo $item | cut -s -f2- -d-`"
 if test -z $type -o -z $comp ; then
...

So this actually expects "-" and not ":" as a delimiter and

--enable-mca-no-build=btl-mvapi,btl-openib,btl-gm,btl-mx,mtl-psm

would parse.

So, which is it? The docs or the last above?

From a SVN of today.


Regards,
Mostyn Lewis
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] SIGSEG in ompi_comm_start_processes

2007-09-27 Thread Tim Prins


Murat,

Thanks for the bug report. I have fixed (slightly differently than you 
suggested) this in the Open MPI trunk in r16265 and it should be 
available in the nightly trunk tarball tonight.


I will ask to have this moved into the next release of Open MPI.

Thanks,

Tim

Murat Knecht wrote:

Copy-and-paste-error: The second part of the fix ought to be:

if ( !have_wdir ) {
  free(cwd);
}

Murat




Murat Knecht schrieb:

Hi all,

I think, I found a bug and a fix for it.
Could someone verify the rationale behind this bug, as I have this
SIGSEG on only one of two machines, and I don't quite see why it doesn't
occur always. (Same testprogram, equally compiled 1.2.4 OpenMPI).
Though the fix does prevent the segmentation fault. :)

Thanks,
Murat





Where:
Bug:
free() crashes when trying to free stack memory
ompi/communicator/comm_dyn.c:630

OBJ_RELEASE(apps[i]);



SIGSEG:
orte/mca/rmgr/rmgr_types.h:113

free (app_context->cwd);



There are two ways that apps[i]->cwd is filled:

1. dynamically allocated memory
548 if ( !have_wdir ) {
getcwd(cwd, OMPI_PATH_MAX);
apps[i]->cwd = strdup(cwd);// <--
}

2. stack
354char cwd[OMPI_PATH_MAX];
// ...
516 /* check for 'wdir' */
ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, );
if ( flag ) {
apps[i]->cwd = cwd;  // <--
have_wdir = 1;
}



Fix: Allocate cwd always manually and make sure, it is deleted afterwards.

1.
cwd = strdup(cwd);
   }



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bundling OpenMPI

2007-09-27 Thread Tim Prins


Hi Teng,

Teng Lin wrote:

Hi,


We would like to distribute OpenMPI along with our software  to  
customers, is there any legal issue we need to know about?
Not that I know of (disclaimer: IANAL). Open MPI is licensed under the 
new BSD license. Open MPI's license is here:

http://www.open-mpi.org/community/license.php



We can successfully build OpenMPI using
./configure --prefix=/some_path;make;make install

However, if we do

cp -r /some_path /other_path

and try to run
/other_path/bin/orterun,
below error message is thrown:
 
--

Sorry!  You were supposed to get help about:
 orterun:usage
from the file:
 help-orterun.txt
But I couldn't find any file matching that name.  Sorry!
 
--


Apparently, the path is hard-coded in the executable. Is there any  
way to fix it (such as using an environment variable etc)?

There is. See:
http://www.open-mpi.org/faq/?category=building#installdirs

Hope this helps,

Tim

Re: [OMPI users] mpirun ERROR: The daemon exited unexpectedly with status 255.

2007-09-27 Thread Tim Prins

Note that you may be able to get some more error output by 
adding --debug-daemons to the mpirun command line.

Tim

On Thursday 27 September 2007 05:12:53 pm Dino Rossegger wrote:
> Hi Jody,
>
> Thanks for your help, it really is the case that either in PATH nor in
> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
> hope it works.
>
> jody schrieb:
> > Hi Dino
> >
> > Try
> >  ssh saturn printenv | grep PATH
> >
> >>from your host sun to see what your environment variables are when
> >
> > ssh is run without a shell.
> >
> > On 9/27/07, Dino Rossegger  wrote:
> >> Hi,
> >>
> >> I have a problem running a simple programm mpihello.cpp.
> >>
> >> Here is a excerp of the error and the command
> >> root@sun:~# mpirun -H sun,saturn main
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 275
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> >> line 1164
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line
> >> 90 [sun:25213] ERROR: A daemon on node saturn failed to start as
> >> expected. [sun:25213] ERROR: There may be more information available
> >> from [sun:25213] ERROR: the remote shell (see above).
> >> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 188
> >> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> >> line 1196
> >> 
> >>-- mpirun was unable to cleanly terminate the daemons for this job.
> >> Returned value Timeout instead of ORTE_SUCCESS.
> >>
> >> 
> >>--
> >>
> >> The program is runable from each node alone (mpirun -np2 main)
> >>
> >> My PathVariables:
> >> $PATH
> >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:
> >>/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
> >> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3-4/:
> >>/usr/lib:/usr/local/lib
> >>
> >> Passwordless ssh is up 'n running
> >>
> >> I walked through the FAQ and Mailing Lists but couldn't find any
> >> solution for my problem.
> >>
> >> Thanks
> >> Dino R.
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Tim Prins


Åke Sandgren wrote:

On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:

Hi Ake,

Looking at the svn logs it looks like you reported the problems with 
these checks quite a while ago and we fixed them (in r13773 
https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
them to the 1.2 branch.


Yes, it's the same. Since i never saw it in the source i tried once more
with some more explanations just in case :-)


I will ask for this to be moved to the 1.2 branch.


Good.

However, the changes made for ompi_config_pthreads.m4 are different than 
you are suggesting now. Is this changeset good enough, or are there 
other changes you think should be made?


The ones i sent today are slightly more correct. There where 2 missing
LIBS="$orig_LIBS" in the failure cases.
But do we really need these? It looks like configure aborts in these 
cases (I am not a autoconf wizard, so I could be completely wrong here).


Tim



If you compare the resulting file after patching you will see the
difference. They are in the "Can not find working threads configuration"
portions.

Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Tim Prins


Hi Ake,

Looking at the svn logs it looks like you reported the problems with 
these checks quite a while ago and we fixed them (in r13773 
https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
them to the 1.2 branch.


I will ask for this to be moved to the 1.2 branch.

However, the changes made for ompi_config_pthreads.m4 are different than 
you are suggesting now. Is this changeset good enough, or are there 
other changes you think should be made?


Thanks,

Tim



Åke Sandgren wrote:

Hi!

There are a couple of bugs in the configure scripts regarding threads
checking.

In ompi_check_pthread_pids.m4 the actual code for testing is wrong and
is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the
linking always failing for the -pthread test with gcc.
config.log looks like this.
=
configure:50353: checking if threads have different pids (pthreads on
linux)
configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2
-maccumulate-outgoing-args -finline-functions -fno-strict-aliasing
-fexceptions  conftest.c -lnsl -lutil  -lm  >&5
conftest.c: In function 'checkpid':
conftest.c:327: warning: cast to pointer from integer of different size
/tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined
reference to `pthread_create'
:conftest.c:(.text+0x2e): undefined reference to `pthread_join'
collect2: ld returned 1 exit status
configure:50412: $? = 1
configure: program exited with status 1
=

Adding the CFLAGS save/add/restore make the code return the right answer
both on systems with the old pthreads implementation and NPTL based
systems. BUT, the code as it stands is technically incorrect.
The patch have a corrected version.

There is also two bugs in ompi_config_pthreads.m4.
In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting
PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which
at the time isn't set yet and forgetting to reset LIBS on failure in the
bottom most if-else case in the for pl loop.

In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether
succesfull or not resulting in -lpthread missing when checking for
PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and
older fails, 7.0 seems to always add -lpthread with pgf77 as linker)

The output from configure in such a case looks like this:
checking if C compiler and POSIX threads work with -lpthread... yes
checking if C++ compiler and POSIX threads work with -lpthread... yes
checking if F77 compiler and POSIX threads work with -lpthread... yes
checking for PTHREAD_MUTEX_ERRORCHECK_NP... no
checking for PTHREAD_MUTEX_ERRORCHECK... no
(OS: Ubuntu Dapper, Compiler: pgi 6.1)

There is also a problem in the F90 modules include flag search.
The test currently does:
$FC -c conftest-module.f90
$FC conftest.f90

This doesn't work if one has set FCFLAGS=-g in the environment.
At least not with pgf90 since it needs the debug symbols from
conftest-module.o to be able to link.
You have to either add conftest-module.o to the compile line of conftest
or make it a three-stager, $FC -c conftest-module.f90; $FC -c
conftest.f90; $FC conftest.o conftest-module.o





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Open MPI v1.2.4 released

2007-09-26 Thread Tim Prins

Francesco,

I guess the first step would be to decide whether or not you want to upgrade. 
All of the changes are listed below, if none of them effect you and your 
current setup is working fine, I would not bother upgrading.

Also, assuming you installed from a tarball, there is no way that I know of 
to 'upgrade' Open MPI in the strict sense of the word. Rather you have to 
reinstall it.

That being said, if you do want to use the new version you have to decide 
whether or not you want to replace your current installation of Open MPI or 
to install the new version alongside the old version. 

Replacing the old version with the new one is nice because it is simpler and 
there is less to keep track of. However, we make no guarantees about binary 
compatibility between releases (although we try to keep binary compatibility 
between minor releases). So if you replace your installation of Open MPI, the 
only completely safe thing to do would be to recompile all your applications 
with the the new version.

So, if you have decided to keep your old version and add the new version, just 
install Open MPI normally, but install it to a different prefix than the 
other version. See http://www.open-mpi.org/faq/?category=building for 
building instructions. You would then need to modify your PATH and 
LD_LIBRARY_PATH to point to the installation you want to use (as shown in 
http://www.open-mpi.org/faq/?category=running#run-prereqs). Alternatively you 
could use something like Modules (http://modules.sourceforge.net/) or SoftEnv 
(http://www-unix.mcs.anl.gov/systems/software/msys/) to manage multiple 
installations.

If you want to replace your current installation of Open MPI, you have 3 
options:
1. Install the new version exactly as you did the old version, overwriting the 
old version. This should work, but can lead to problems if there are any 
stale files left over. Thus I would recommend not doing this and doing one of 
the following.

2. If you sill have the build tree you used to originally install Open MPI, go 
into the build tree and type 'make uninstall'. This should remove all the old 
Open MPI files and allow you to install the new version normally.

3. If you installed Open MPI into a unique prefix, such as /opt/openmpi, just 
delete the directory and then install the new version of Open MPI. 
Personally, I think that one should always install Open MPI into a directory 
where nothing else is installed, as it makes management and upgrading 
significantly easier.

Whatever path you take, remember the new installation must be available on all 
the nodes in your cluster, and that different versions of Open MPI will 
probably not work together. That is, you can't use 1.2.4 on the head node and 
1.2.3 on the compute nodes.

I hope this helps. Let me know if you have any problems,

Tim

On Wednesday 26 September 2007 04:37:16 pm Francesco Pietra wrote:
> Are any detailed directions for upgrading (for common guys, not experts, I
> mean)? My 1.2.3 version on Debian Linux amd64 runs perfectly.
> Thanks
> francesco pietra
>
> --- Tim Mattox  wrote:
> > The Open MPI Team, representing a consortium of research, academic,
> > and industry partners, is pleased to announce the release of Open MPI
> > version 1.2.4. This release is mainly a bug fix release over the v1.2.3
> > release, but there are few new features.  We strongly recommend
> > that all users upgrade to version 1.2.4 if possible.
> >
> > Version 1.2.4 can be downloaded from the main Open MPI web site or
> > any of its mirrors (mirrors will be updating shortly).
> >
> > Here are a list of changes in v1.2.4 as compared to v1.2.3:
> >
> > - Really added support for TotalView/DDT parallel debugger message queue
> >   debugging (it was mistakenly listed as "added" in the 1.2 release).
> > - Fixed a build issue with GNU/kFreeBSD. Thanks to Petr Salinger for
> >   the patch.
> > - Added missing MPI_FILE_NULL constant in Fortran.  Thanks to
> >   Bernd Schubert for bringing this to our attention.
> > - Change such that the UDAPL BTL is now only built in Linux when
> >   explicitly specified via the --with-udapl configure command line
> >   switch.
> > - Fixed an issue with umask not being propagated when using the TM
> >   launcher.
> > - Fixed behavior if number of slots is not the same on all bproc nodes.
> > - Fixed a hang on systems without GPR support (ex. Cray XT3/4).
> > - Prevent users of 32-bit MPI apps from requesting >= 2GB of shared
> >   memory.
> > - Added a Portals MTL.
> > - Fix 0 sized MPI_ALLOC_MEM requests.  Thanks to Lisandro Dalcin for
> >   pointing out the problem.
> > - Fixed a segfault crash on large SMPs when doing collectives.
> > - A variety of fixes for Cray XT3/4 class of machines.
> > - Fixed which error handler is used when MPI_COMM_SELF is passed
> >   to MPI_COMM_FREE.  Thanks to Lisandro Dalcini for the bug report.
> > - Fixed compilation on platforms that don't have hton/ntoh.
> > - Fixed a logic

[MTT users] Problem with reporter: selecting bitness

2007-09-20 Thread Tim Prins


Hi,

I was doing a search and hit advanced, entered '32' into the bitness 
field, and pressed submit. I got back the following error:


postgres: ERROR: operator does not exist: bit ~* "unknown" LINE 70: 
(bitness ~* '32') ^ HINT: No operator matches the given name and 
argument type(s). You may need to add explicit type casts. postgres: 
ERROR: operator does not exist: bit ~* "unknown" LINE 70: (bitness ~* 
'32') ^ HINT: No operator matches the given name and argument type(s). 
You may need to add explicit type casts.



Thanks,

Tim

Re: [OMPI users] C and Fortran 77 compilers are not link compatible. Can not continue.

2007-09-20 Thread Tim Prins


Hi,

This is because Open MPI is finding gcc for the C compiler and ifort for 
the Fortran compiler. Please see:


http://www.open-mpi.org/faq/?category=building#build-compilers

For how to specify to use the Intel compilers.

Hope this helps,

Tim

Bertrand P. S. Russell wrote:

Dear OpenMPI users,

I am trying to install OpenMPI-1.2.3 in a MacOS-X 10. I installed trial 
version ifort C compiler and fortran compiler both 10.0.16 version. When 
issue ./configure command my configuration stops with the following 
error message. Could nayone tell me how to solve this problem? Many 
thanks in advance. I herewith attaching the config.log file and error 
message on screen.



--
Miles to go. Millions to meet...
Bertrand. P. S. Russell
+91 - 98943 98441




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault when spawning

2007-09-18 Thread Tim Prins


Hi Murat,

If the process is being spawned onto a node that you are already running 
on there should not be a problem with ssh-sessions, since if there is 
already a daemon running on the node we do not ssh into it again.


Can you try running again with --debug-daemons added to the mpirun 
command line? This may (or may not) print out some information that 
would be useful.


Also, what does the spawn call look like? What is the mpirun command you 
are using (and the contents of any hostfiles)?


Unfortunately, besides these stabs there is really no way to debug it 
without gdb. Once I get the spawn call and command line you are using I 
will see if I can replicate it.


Thanks,

Tim

Murat Knecht wrote:

Hi all,

I get a segmentation fault when trying to spawn a single process on the
localhost (127.0.0.1).
I tried both the current stable 1.2.3 and the beta 1.2.4, both ended up the
same way.

From the stack trace, i know it's the spawn call.

Is it possible that there is an error with authentification? (I accepted
the localhost certificates manually by opening up ssh-sessions.)


[loud2:15472] *** Process received signal ***
[loud2:15472] Signal: Segmentation fault (11)
[loud2:15472] Signal code: Address not mapped (1)
[loud2:15472] Failing at address: 0x2b7182ea7fe0
[loud2:15472] [ 0] /lib64/libpthread.so.0 [0x2b6983637c10]
[loud2:15472] [ 1] /usr/local/lib/libopen-pal.so.0(_int_free+0x26d)
[0x2b6982d75fdd]
[loud2:15472] [ 2] /usr/local/lib/libopen-pal.so.0(free+0xbd)
[0x2b6982d762fd]
[loud2:15472] [ 3] /usr/local/lib/libopen-rte.so.0 [0x2b6982c33146]
[loud2:15472] [ 4]
/usr/local/lib/libmpi.so.0(ompi_comm_start_processes+0xe61)
[0x2b6982a8a3a1]
[loud2:15472] [ 5] /usr/local/lib/libmpi.so.0(PMPI_Comm_spawn+0x13a)
[0x2b6982aaedfa]
[loud2:15472] [ 6]
queen(_ZNK3MPI9Intracomm5SpawnEPKcPS2_iRKNS_4InfoEi+0x5e) [0x41f64a]
[loud2:15472] [ 7]
queen(_ZN5blink5queen5Queen16startupLandscapeERKSsRSt4listINS0_4HostESaIS5_EE+0x9e2)
 [0x4222ae]
[loud2:15472] [ 8] queen(main+0x936) [0x428c4c]
[loud2:15472] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b698375e154]
[loud2:15472] [10] queen(__gxx_personality_v0+0xa9) [0x4183f9]
[loud2:15472] *** End of error message ***


All parameters are being checked for correctness, MPI::ARGV_NULL is used
for argv.
Is there a way to enable detailed logging, or are the mpirun - arguments
all there is? (In the FAQ and /var/log/ i did not find logs.)
Us there maybe a suggested solution to this problem, or do I have to debug
OpenMPI with gdb now?
Are there secret assumptions regarding the system this is running on? I had
a version of the program running on another machine already (no changes to
MPI related parts) ...
Btw, I very much welcome the recent thoughts about establishing a
documentation project. :)

Thanks for any hint!
Best regards
Murat







mkne@loud2:~/rep/DWA/queen> ompi_info -a
Open MPI: 1.2.4b0
   Open MPI SVN revision: r15441
Open RTE: 1.2.4b0
   Open RTE SVN revision: r15441
OPAL: 1.2.4b0
   OPAL SVN revision: r15441
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.4)
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.4)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.4)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.4)
   MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.4)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.4)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.4)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.4)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.4)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.4)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.4)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.4)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.4)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.4)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.4)
 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.4)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.4)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.4)
  MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.4)
 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.4)
 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.4)
 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.4)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.4)
  MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.4)
  MCA

Re: [OMPI users] mpiio romio etc

2007-09-14 Thread Tim Prins


Hi,

To give FLAGS to the ROMIO configuration script, the configure option 
for Open MPI is:


  --with-io-romio-flags=FLAGS

So something like:
  --with-io-romio-flags="--with-filesystems=ufs+nfs+pvfs2"
should work, though I have not tested it.

You can see all the ROMIO configure flags by running:
   ./ompi/mca/io/romio/romio/configure --help
from the top directory of the Open MPI source.

If you want to see what file systems support has been built for, you 
should just be able to look in the config.log for ROMIO:

  grep FILE_SYSTEM ./ompi/mca/io/romio/romio/config.log

I am not an expert in this area, but I hope this helps.

Tim

Robert Latham wrote:

On Fri, Sep 07, 2007 at 10:18:55AM -0400, Brock Palen wrote:

Is there a way to find out which ADIO options romio was built with?


not easily. You can use 'nm' and look at the symbols :>

Also does OpenMPI's romio come with pvfs2 support included? What  
about Luster or GPFS.


OpenMPI has shipped with PVFS v2 support for a long time.  Not sure
how you enable it, though.  --with-filesystems=ufs+nfs+pvfs2 might
work for OpenMPI as it does for MPICH2.

All versions of ROMIO support Lustre and GPFS the same way: with the
"generic unix filesystem" (UFS) driver.  Weikuan Yu at ORNL has been
working on a native "AD_LUSTRE" driver and some improvements to ROMIO
collective I/O.   Likely to be in the next ROMIO release.

For GPFS, the only optimized MPI-IO implementation is IBM's MPI for
AIX.  You're likely to see decent performance with the UFS driver,
though.

==rob

Re: [MTT users] Test runs not getting into database

2007-09-05 Thread Tim Prins


Here is the smallest one. Let me know if you need anything else.

Tim

Jeff Squyres wrote:
Can you send any one of those mtt database files?  We'll need to  
figure out if this is a client or a server problem.  :-(


On Sep 5, 2007, at 7:40 AM, Tim Prins wrote:


Hi,

BigRed has not gotten its test results into the database for a while.
This is running the ompi-core-testers branch. We run by passing the
results through the mtt-relay.

The mtt-output file has lines like:
*** WARNING: MTTDatabase did not get a serial; phases will be isolated
from each other in the reports

Reported to MTTDatabase: 1 successful submit, 0 failed submits

(total of 1 result)

I have the database submit files if they would help.

Thanks,

Tim

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users





$VAR1 = {
  'exit_signal_1' => -1,
  'duration_1' => '5 seconds',
  'mpi_version' => '1.3a1r16038',
  'trial' => 0,
  'mpi_install_section_name_1' => 'bigred 32 bit gcc',
  'client_serial' => undef,
  'hostname' => 's1c2b12',
  'result_stdout_1' => '/bin/rm -f *.o *~ PI* core IMB-IO IMB-EXT 
IMB-MPI1 exe_io exe_ext exe_mpi1
touch IMB_declare.h
touch exe_mpi1 *.c; rm -rf exe_io exe_ext
make MPI1 CPP=MPI1
make[1]: Entering directory 
`/N/ptl01/mpiteam/bigred/20070905-Wednesday/pb_0/installs/d7Ri/tests/imb/IMB_2.3/src\'
mpicc  -I.  -DMPI1 -O -c IMB.c
mpicc  -I.  -DMPI1 -O -c IMB_declare.c
mpicc  -I.  -DMPI1 -O -c IMB_init.c
mpicc  -I.  -DMPI1 -O -c IMB_mem_manager.c
mpicc  -I.  -DMPI1 -O -c IMB_parse_name_mpi1.c
mpicc  -I.  -DMPI1 -O -c IMB_benchlist.c
mpicc  -I.  -DMPI1 -O -c IMB_strgs.c
mpicc  -I.  -DMPI1 -O -c IMB_err_handler.c
mpicc  -I.  -DMPI1 -O -c IMB_g_info.c
mpicc  -I.  -DMPI1 -O -c IMB_warm_up.c
mpicc  -I.  -DMPI1 -O -c IMB_output.c
mpicc  -I.  -DMPI1 -O -c IMB_pingpong.c
mpicc  -I.  -DMPI1 -O -c IMB_pingping.c
mpicc  -I.  -DMPI1 -O -c IMB_allreduce.c
mpicc  -I.  -DMPI1 -O -c IMB_reduce_scatter.c
mpicc  -I.  -DMPI1 -O -c IMB_reduce.c
mpicc  -I.  -DMPI1 -O -c IMB_exchange.c
mpicc  -I.  -DMPI1 -O -c IMB_bcast.c
mpicc  -I.  -DMPI1 -O -c IMB_barrier.c
mpicc  -I.  -DMPI1 -O -c IMB_allgather.c
mpicc  -I.  -DMPI1 -O -c IMB_allgatherv.c
mpicc  -I.  -DMPI1 -O -c IMB_alltoall.c
mpicc  -I.  -DMPI1 -O -c IMB_sendrecv.c
mpicc  -I.  -DMPI1 -O -c IMB_init_transfer.c
mpicc  -I.  -DMPI1 -O -c IMB_chk_diff.c
mpicc  -I.  -DMPI1 -O -c IMB_cpu_exploit.c
mpicc   -o IMB-MPI1 IMB.o IMB_declare.o  IMB_init.o IMB_mem_manager.o 
IMB_parse_name_mpi1.o  IMB_benchlist.o IMB_strgs.o IMB_err_handler.o 
IMB_g_info.o  IMB_warm_up.o IMB_output.o IMB_pingpong.o IMB_pingping.o 
IMB_allreduce.o IMB_reduce_scatter.o IMB_reduce.o IMB_exchange.o IMB_bcast.o 
IMB_barrier.o IMB_allgather.o IMB_allgatherv.o IMB_alltoall.o IMB_sendrecv.o 
IMB_init_transfer.o  IMB_chk_diff.o IMB_cpu_exploit.o   
make[1]: Leaving directory 
`/N/ptl01/mpiteam/bigred/20070905-Wednesday/pb_0/installs/d7Ri/tests/imb/IMB_2.3/src\'
',
  'mpi_name' => 'ompi-nightly-trunk',
  'number_of_results' => '1',
  'phase' => 'Test Build',
  'compiler_version_1' => '3.3.3',
  'exit_value_1' => 0,
  'result_message_1' => 'Success',
  'start_timestamp_1' => 'Wed Sep  5 04:16:52 2007',
  'compiler_name_1' => 'gnu',
  'suite_name_1' => 'imb',
  'test_result_1' => 1,
  'mtt_client_version' => '2.1devel',
  'fields' => 
'compiler_name,compiler_version,duration,exit_signal,exit_value,mpi_get_section_name,mpi_install_id,mpi_install_section_name,mpi_name,mpi_version,phase,result_message,result_stdout,start_timestamp,suite_name,test_result',
  'mpi_install_id' => undef,
  'platform_name' => 'IU_BigRed',
  'local_username' => 'mpiteam',
  'mpi_get_section_name_1' => 'ompi-nightly-trunk'
};

[MTT users] Test runs not getting into database

2007-09-05 Thread Tim Prins


Hi,

BigRed has not gotten its test results into the database for a while. 
This is running the ompi-core-testers branch. We run by passing the 
results through the mtt-relay.


The mtt-output file has lines like:
*** WARNING: MTTDatabase did not get a serial; phases will be isolated 
from each other in the reports
>> Reported to MTTDatabase: 1 successful submit, 0 failed submits 
(total of 1 result)


I have the database submit files if they would help.

Thanks,

Tim

Re: [MTT users] Database submit error

2007-08-28 Thread Tim Prins


It installed and the tests built and made it into the database:
http://www.open-mpi.org/mtt/reporter.php?do_redir=293

Tim

Jeff Squyres wrote:

Did you get a correct MPI install section for mpich2?

On Aug 28, 2007, at 9:05 AM, Tim Prins wrote:


Hi all,

I am working with the jms branch, and when trying to use mpich2, I get
the following submit error:

*** WARNING: MTTDatabase server notice: mpi_install_section_name is  
not in

 mtt database.
 MTTDatabase server notice: number_of_results is not in mtt  
database.

 MTTDatabase server notice: phase is not in mtt database.
 MTTDatabase server notice: test_type is not in mtt database.
 MTTDatabase server notice: test_build_section_name is not in mtt
 database.
 MTTDatabase server notice: variant is not in mtt database.
 MTTDatabase server notice: command is not in mtt database.
 MTTDatabase server notice: fields is not in mtt database.
 MTTDatabase server notice: resource_manager is not in mtt  
database.


 MTT submission for test run
 MTTDatabase server notice: Invalid test_build_id (47368) given.
 Guessing that it should be -1
 MTTDatabase server error: ERROR: Unable to find a test_build to
 associate with this test_run.

 MTTDatabase abort: (Tried to send HTTP error) 400
 MTTDatabase abort:
 No test_build associated with this test_run
*** WARNING: MTTDatabase did not get a serial; phases will be  
isolated from

 each other in the reports
Reported to MTTDatabase: 1 successful submit, 0 failed submits  
(total of

   12 results)

This happens for each test run section.

Thanks,

Tim
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

[MTT users] results being submitted as wrong suite

2007-08-27 Thread Tim Prins

Hi folks,

Another question. I am working on trying to get the performance stuff to work, 
using the jms branch. The tests all ran ok, but when they were submitted, all 
of the tests (imb, netpipe, osu, and skampi) are in the database labeled as 
being from the skampi suite.

You can see this here: http://www.open-mpi.org/mtt/reporter.php?do_redir=291
(you may have to enable trial runs).

I looked at my config file but could not find anything immediately wrong. It 
is attached.

Any ideas?

Thanks,

Tim
#==
# Overall configuration
#==

[MTT]

#identifier to say this is part of the collective performance 
#measurements
description = [2007 collective performance bakeoff]

hostfile =
hostlist =
max_np = 
textwrap = 76
drain_timeout = 5
trial = 1

#--

#==
# MPI get phase
#==

[MPI get: ompi-nightly-trunk]
mpi_details = Open MPI

module = OMPI_Snapshot
ompi_snapshot_url = http://www.open-mpi.org/nightly/trunk

#--

#==
# Install MPI phase
#==

[MPI install: odin 32 bit gcc]
mpi_get = ompi-nightly-trunk
save_stdout_on_success = 1
merge_stdout_stderr = 1
ompi_vpath_mode = none

ompi_make_all_arguments = -j 8
ompi_make_check = 1

ompi_compiler_name = gnu
ompi_compiler_version = ("gcc --version | head -n 1 | awk '{ print \$3 
}'")
ompi_configure_arguments =

[MTT users] trouble with new reporter

2007-08-27 Thread Tim Prins


All,

First, I have to say the new faster reporter is very nice.

However, I am running into some difficulty with trial runs. Here is what 
I did:


1. went to www.open-mpi.org/mtt/reporter.php
2. Clicked preferences, toggled show trial runs
3. typed 'IU' into org
4. Press summary

So far so good, I see the performance results I expect. But then if I 
click on the performance results, I get 'no data available for the 
specified query"


Thanks,

Tim

Re: [OMPI users] openmpi realloc() holding onto memory when glibc doesn't

2007-08-23 Thread Tim Prins

Hi Josh,

I am not an expert in this area of the code, but I'll give it a shot. 

(I assume you are using linux due to your email address) When using the memory 
manager (which is the default on linux), we wrap malloc/realloc/etc with 
ptmalloc2 (which is the same allocator used in glibc 2.3.x).

What I believe is happening is that ptmalloc2 is requesting more memory than 
necessary from the OS, and then lazily releasing it back. Try looking in the 
ompi source at opal/mca/memory/ptmalloc2/README ( 
https://svn.open-mpi.org/trac/ompi/browser/tags/v1.2-series/v1.2.3/opal/mca/memory/ptmalloc2/README#L121
 ).

This mentions some environment variables that can be set to alter ptmalloc2's 
behavior, although I have no idea if they work.

Alternatively, if you are not using a high performance network, there is 
little reason to use the memory manager, so you could just disable it.

Tim

On Thursday 23 August 2007 10:18:45 am Josh Aune wrote:
> I have found that the infiniserv MPI that comes with our IB software
> distribution tracks the same behaviour as gcc (releaseing memory on
> realloc).  I have also found that building openmpi with
> --without-memory-manager makes openmpi track the same behaviour as
> glibc.   I'm guessing that there is a bug in the pinned pages caching
> code?
>
> On 8/21/07, Josh Aune  wrote:
> > The realloc included with openmpi 1.2.3 is not releasing memory to the
> > OS and is causing apps to go into swap.  Attached is a little test
> > program that shows calls to realloc not releasing the memory when
> > compiled using mpicc, but when compiled directly with gcc (or icc)
> > calling realloc() frees any memory no longer needed.
> >
> > Is this a bug?
> >
> > If not, how can I force openmpi to free the memory that the allocator
> > is sitting on?
> >
> > Thanks,
> > Josh
> >
> > Sample output.  Note the delta between 'total' and 'malloc held' when
> > compiled with mpicc and how the gcc compiled versions track perfectly.
> >
> > $ mpicc -o realloc_test realloc_test.c
> > $ ./realloc_test
> > ...
> > malloc/realloc/free test
> > malloc()50 MB, total   50 MB, malloc held   50 MB
> > realloc()1 MB, total1 MB, malloc held   50 MB
> > malloc()50 MB, total   51 MB, malloc held  100 MB
> > realloc()1 MB, total2 MB, malloc held  100 MB
> > malloc()50 MB, total   52 MB, malloc held  150 MB
> > realloc()1 MB, total3 MB, malloc held  150 MB
> > malloc()50 MB, total   53 MB, malloc held  200 MB
> > realloc()1 MB, total4 MB, malloc held  200 MB
> > malloc()50 MB, total   54 MB, malloc held  250 MB
> > realloc()1 MB, total5 MB, malloc held  250 MB
> > free()   1 MB, total4 MB, malloc held  200 MB
> > free()   1 MB, total3 MB, malloc held  150 MB
> > free()   1 MB, total2 MB, malloc held  100 MB
> > free()   1 MB, total1 MB, malloc held   50 MB
> > free()   1 MB, total0 MB, malloc held0 MB
> > ...
> >
> > $ gcc -o realloc_test realloc_test.c
> > $ ./realloc_test
> > ...
> > malloc/realloc/free test
> > malloc()50 MB, total   50 MB, malloc held   50 MB
> > realloc()1 MB, total1 MB, malloc held1 MB
> > malloc()50 MB, total   51 MB, malloc held   51 MB
> > realloc()1 MB, total2 MB, malloc held2 MB
> > malloc()50 MB, total   52 MB, malloc held   52 MB
> > realloc()1 MB, total3 MB, malloc held3 MB
> > malloc()50 MB, total   53 MB, malloc held   53 MB
> > realloc()1 MB, total4 MB, malloc held4 MB
> > malloc()50 MB, total   54 MB, malloc held   54 MB
> > realloc()1 MB, total5 MB, malloc held5 MB
> > free()   1 MB, total4 MB, malloc held4 MB
> > free()   1 MB, total3 MB, malloc held3 MB
> > free()   1 MB, total2 MB, malloc held2 MB
> > free()   1 MB, total1 MB, malloc held1 MB
> > free()   1 MB, total0 MB, malloc held0 MB
> > ...
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun hangs

2007-08-14 Thread Tim Prins

Jody,

jody wrote:

Hi TIm
thanks for the suggestions.

I now set both paths  in .zshenv but it seems that LD_LIBRARY_PATH
still does not get set.
The ldd experment shows that all openmpi libraries are not found,
and indeed the printenv shows that PATH is there but LD_LIBRARY_PATH is 
not.
Are you setting LD_LIBRARY_PATH anywhere else in your scripts? I have, 
on more than one occasion, forgotten that I needed to do:

export LD_LIBRARY_PATH="/foo:$LD_LIBRARY_PATH"

Instead of just:
export LD_LIBRARY_PATH="/foo"

It is rather unclear why this happens...

As to thew second problem:
$ mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02 
./MPI2Test2
[aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: 
connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed: 
(103)
[aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: 
connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed, 
connecting over all interfaces failed!

[aim-nano_02:05455] OOB: Connection to HNP lost
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
[0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at 
line 275
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
[0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
[0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
ERROR: A daemon on node nano_02 failed to start as expected.
[ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
ERROR: There may be more information available from
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
ERROR: the remote shell (see above).
[ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
ERROR: The daemon exited unexpectedly with status 1.
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
[0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at 
line 188
[aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] 
[0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196

The strange thing is that nano_02's address is 130.60.49.130 
<http://130.60.49.130> and plankton's (the caller) is 130.60.49 134.
I also made sure that nano_02 cann ssh to plankton without password, but 
that didn't change the output.

What is happening here is that the daemon launched on nano_02 is trying 
to contact mpirun on plankton, and is failing for some reason.

Do you have any firewalls/port filtering enabled on nano_02? Open MPI 
generally cannot be run when there are any firewalls on the machines 
being used.

Hope this helps,

Tim

Does this message give any hints as to the problem?

Jody

On 8/14/07, *Tim Prins* <tpr...@open-mpi.org 
<mailto:tpr...@open-mpi.org>> wrote:

Hi Jody,

jody wrote:
 > Hi
 > I installed openmpi 1.2.2 on a quad core intel machine running
fedora 6
 > (hostname plankton)
 > I set PATH and LD_LIBRARY in the .zshrc file:
Note that .zshrc is only used for interactive logins. You need to setup
your system so the LD_LIBRARY_PATH and PATH is also set for
non-interactive logins. See this zsh FAQ entry for what files you need
to modify:
http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19
<http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19>

(BTW: I do not use zsh, but my assumption is that the file you want to
set the PATH and LD_LIBRARY_PATH in is .zshenv)
 > $ echo $PATH
 >

/opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin

 >
 > $ echo $LD_LIBRARY_PATH
 > /opt/openmpi/lib:
 >
 > When i run
 > $ mpirun -np 2 ./MPITest2
 > i get the message
 > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
 > cannot open shared object file: No such file or directory
 > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
 > cannot open shared object file: No such file or directory
 >
 > However
 > $ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2
 > works.  Any explanation?
Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running:
mpirun -np 2 ldd ./MPITest2

This should show what libraries your executable is using. Make sure all
of the libraries are resolved.

Also, try running:
mpirun -np 1 printenv |grep LD_LIBRARY_PATH
to see what the LD_LIBRARY_PATH is for you executables. Note that you
can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable
will be
interpreted in the executing shell.

 >
 > Second problem:
 > I have also  installed openmpi 1.2.2 on a

Re: [OMPI users] libmpi.so.0 problem

2007-08-14 Thread Tim Prins


I meant to say, "exporting the variables is *not* good enough".

Tim

Tim Prins wrote:
In general, exporting the variables is good enough. You really should be 
setting the variables in the appropriate shell (non-interactive) login 
scripts, such as .bashrc (I again point you to the same FAQ entries for 
more information: 
http://www.open-mpi.org/faq/?category=running#run-prereqs and 
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path )


Try running:
mpirun -np 1 printenv
to see what variables are set.

Also,
mpirun -np 1 ldd a.out
will show the libraries your executable is trying to use.

Tim

Durga Choudhury wrote:
Did you export your variables? Otherwise the child shell that forks the 
MPI process will not inherit it.



 
On 8/14/07, *Rodrigo Faccioli* <faccioli.postgre...@gmail.com 
<mailto:faccioli.postgre...@gmail.com>> wrote:


Thanks, Tim Prins for your email.

However It did't resolve my problem.

I set the enviroment variable on my Kubuntu Linux:

faccioli@faccioli-desktop:/usr/local/lib$

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin

faccioli@faccioli-desktop:/usr/local/lib$
LD_LIBRARY_PATH=/usr/local/lib/
 


Therefore, set command will display:

BASH=/bin/bash
BASH_ARGC=()
BASH_ARGV=()
BASH_COMPLETION=/etc/bash_completion
BASH_COMPLETION_DIR=/etc/bash_completion.d
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="3" [1]="2" [2]="13" [3]="1" [4]="release"
[5]="x86_64-pc-linux-gnu")
BASH_VERSION='3.2.13(1)-release'
COLORTERM=
COLUMNS=83

DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-C83Ve0QbQz,guid=e07c2bd483a99b50932d080046c199e9
DESKTOP_SESSION=default
DIRSTACK=()
DISPLAY=: 0.0
DM_CONTROL=/var/run/xdmctl
EUID=1000
GROUPS=()
GS_LIB=/home/faccioli/.fonts

GTK2_RC_FILES=/home/faccioli/.gtkrc-2.0-kde:/home/faccioli/.kde/share/config/gtkrc-2.0

GTK_RC_FILES=/etc/gtk/gtkrc:/home/faccioli/.gtkrc:/home/faccioli/.kde/share/config/gtkrc

HISTCONTROL=ignoreboth
HISTFILE=/home/faccioli/.bash_history
HISTFILESIZE=500
HISTSIZE=500
HOME=/home/faccioli
HOSTNAME=faccioli-desktop
HOSTTYPE=x86_64
IFS=$' \t\n'
KDE_FULL_SESSION=true
KDE_MULTIHEAD=false
KONSOLE_DCOP='DCOPRef(konsole-5587,konsole)'
KONSOLE_DCOP_SESSION='DCOPRef(konsole-5587,session-2)'
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/usr/local/lib/
LESSCLOSE='/usr/bin/lesspipe %s %s'
LESSOPEN='| /usr/bin/lesspipe %s'
LINES=33
LOGNAME=faccioli

LS_COLORS='no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.flac=01;35:*.mp3=01;35:*.mpc=01;35:*.ogg=01;35:*.wav=01;35:'

MACHTYPE=x86_64-pc-linux-gnu
MAILCHECK=60
OLDPWD=/home/faccioli
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin

PIPESTATUS=([0]="0")
PPID=5587

Unfortunately,  when I execute mpirun a.out, the message I received
is: a.out:  error while loading shared libraries: libmpi.so.0 :
    cannot open shared object file: No such file or directory

Thanks,


On 8/14/07, *Tim Prins* < tpr...@open-mpi.org
<mailto:tpr...@open-mpi.org> > wrote:

You need to set your LD_LIBRARY_PATH. See these FAQ entries:
http://www.open-mpi.org/faq/?category=running#run-prereqs
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
<http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path>

Tim

Rodrigo Faccioli wrote:
 > Hi,
 >
 > I need to know what I can resolve my problem. I'm starting my
study on
 > mpi, more specificaly open-mpi.
 >
 > But, when I execute mpirun a.out, the message I received is:
a.out:
 > error while loading shared libraries: libmpi.so.0: cannot
open shared
 > object file: No such file or directory
 >
 > The a.out file was obtained through mpicc hello.c
 >
 > Thanks.
 >
 >
 >
 >

 >
 > _

Re: [OMPI users] libmpi.so.0 problem

2007-08-14 Thread Tim Prins

In general, exporting the variables is good enough. You really should be 
setting the variables in the appropriate shell (non-interactive) login 
scripts, such as .bashrc (I again point you to the same FAQ entries for 
more information: 
http://www.open-mpi.org/faq/?category=running#run-prereqs and 
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path )


Try running:
mpirun -np 1 printenv
to see what variables are set.

Also,
mpirun -np 1 ldd a.out
will show the libraries your executable is trying to use.

Tim

Durga Choudhury wrote:
Did you export your variables? Otherwise the child shell that forks the 
MPI process will not inherit it.



 
On 8/14/07, *Rodrigo Faccioli* <faccioli.postgre...@gmail.com 
<mailto:faccioli.postgre...@gmail.com>> wrote:


    Thanks, Tim Prins for your email.

However It did't resolve my problem.

I set the enviroment variable on my Kubuntu Linux:

faccioli@faccioli-desktop:/usr/local/lib$

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin

faccioli@faccioli-desktop:/usr/local/lib$
LD_LIBRARY_PATH=/usr/local/lib/
 


Therefore, set command will display:

BASH=/bin/bash
BASH_ARGC=()
BASH_ARGV=()
BASH_COMPLETION=/etc/bash_completion
BASH_COMPLETION_DIR=/etc/bash_completion.d
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="3" [1]="2" [2]="13" [3]="1" [4]="release"
[5]="x86_64-pc-linux-gnu")
BASH_VERSION='3.2.13(1)-release'
COLORTERM=
COLUMNS=83

DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-C83Ve0QbQz,guid=e07c2bd483a99b50932d080046c199e9
DESKTOP_SESSION=default
DIRSTACK=()
DISPLAY=: 0.0
DM_CONTROL=/var/run/xdmctl
EUID=1000
GROUPS=()
GS_LIB=/home/faccioli/.fonts

GTK2_RC_FILES=/home/faccioli/.gtkrc-2.0-kde:/home/faccioli/.kde/share/config/gtkrc-2.0

GTK_RC_FILES=/etc/gtk/gtkrc:/home/faccioli/.gtkrc:/home/faccioli/.kde/share/config/gtkrc

HISTCONTROL=ignoreboth
HISTFILE=/home/faccioli/.bash_history
HISTFILESIZE=500
HISTSIZE=500
HOME=/home/faccioli
HOSTNAME=faccioli-desktop
HOSTTYPE=x86_64
IFS=$' \t\n'
KDE_FULL_SESSION=true
KDE_MULTIHEAD=false
KONSOLE_DCOP='DCOPRef(konsole-5587,konsole)'
KONSOLE_DCOP_SESSION='DCOPRef(konsole-5587,session-2)'
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/usr/local/lib/
LESSCLOSE='/usr/bin/lesspipe %s %s'
LESSOPEN='| /usr/bin/lesspipe %s'
LINES=33
LOGNAME=faccioli

LS_COLORS='no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.flac=01;35:*.mp3=01;35:*.mpc=01;35:*.ogg=01;35:*.wav=01;35:'

MACHTYPE=x86_64-pc-linux-gnu
MAILCHECK=60
OLDPWD=/home/faccioli
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin

PIPESTATUS=([0]="0")
PPID=5587

Unfortunately,  when I execute mpirun a.out, the message I received
is: a.out:  error while loading shared libraries: libmpi.so.0 :
cannot open shared object file: No such file or directory

Thanks,


On 8/14/07, *Tim Prins* < tpr...@open-mpi.org
<mailto:tpr...@open-mpi.org> > wrote:

You need to set your LD_LIBRARY_PATH. See these FAQ entries:
http://www.open-mpi.org/faq/?category=running#run-prereqs
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
<http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path>

Tim

Rodrigo Faccioli wrote:
 > Hi,
 >
 > I need to know what I can resolve my problem. I'm starting my
study on
 > mpi, more specificaly open-mpi.
 >
 > But, when I execute mpirun a.out, the message I received is:
a.out:
 > error while loading shared libraries: libmpi.so.0: cannot
open shared
 > object file: No such file or directory
 >
 > The a.out file was obtained through mpicc hello.c
 >
 > Thanks.
 >
 >
 >
 >

 >
 > ___
 > users mailing list
 > us...@open-mpi.org <mailto:us...@open-mpi.org>

Re: [OMPI users] Help : Need some tuning, or is it a bug ?

2007-08-14 Thread Tim Prins


Guillaume THOMAS-COLLIGNON wrote:

Hi,

I wrote an application which works fine on a small number of nodes  
(eg. 4), but it crashes on a large number of CPUs.


In this application, all the slaves send many small messages to the  
master. I use the regular MPI_Send, and since the messages are  
relatively small (1 int, then many times 3296 ints), OpenMPI does a  
very good job at sending them asynchronously, and it maxes out the  
gigabit link on the master node. I'm very happy with this behaviour,  
it gives me the same performance as if I was doing all the  
asynchronous stuff myself, and the code remains simple.


But it crashes when there are too many slaves. 
How many is too many? I successfully ran your code on 96 nodes, with 4 
processes per node and it seemed to work fine. Also, what network are 
you using?


So it looks like at  
some point the master node runs out of buffers and the job crashes  
brutally. 

What do you mean by crashing? Is there a segfault or an error message?

Tim


That's my understanding but I may be wrong.
If I use explicit synchronous sends (MPI_Ssend), it does not crash  
anymore but the performance is a lot lower.


I have 2 questions regarding this :

1) What kind of tuning would help handling more messages and keep the  
master from crashing ?


2) Is this the expected behaviour ? I don't think my code is doing  
anything wrong, so I would not expect a brutal crash.



The workaround I've found so far is to do an MPI_Ssend for the  
request, then use MPI_Send for the data blocks. So all the slaves are  
blocked on the request, it keeps the master from being flooded, and  
the performance is still good. But nothing tells me it won't crash at  
some point if I have more data blocks in my real code, so I'd like to  
know more about what's happening here.


Thanks,

-Guillaume


Here is the code, so you get a better idea of the communication  
scheme, or if you someone wants to reproduce the problem.



#include 
#include 

#include 

#define BLOCKSIZE 3296
#define MAXBLOCKS 1000
#define NLOOP 4

int main (int argc, char **argv) {
   int i, j, ier, rank, npes, slave, request;
   int *data;
   MPI_Status status;

   MPI_Init (, );
   MPI_Comm_rank (MPI_COMM_WORLD, );
   MPI_Comm_size (MPI_COMM_WORLD, );

   if ((data = (int *) calloc (BLOCKSIZE, sizeof (int))) == NULL)
 return -10;

   // Master
   if (rank == 0) {
 // Expect (NLOOP * number of slaves) requests
 for (i=0; i<(npes-1)*NLOOP; i++) {
   /* Wait for a request from any slave. Request contains number  
of data blocks */
   ier = MPI_Recv(, 1, MPI_INT, MPI_ANY_SOURCE, 964,  
MPI_COMM_WORLD, );

   if (ier != MPI_SUCCESS)
return -1;
   slave = status.MPI_SOURCE;
   printf ("Master : request for %d blocks from slave %d\n",  
request, slave);


   /* Receive the data blocks from this slave */
   for (j=0; j

Re: [OMPI users] mpirun hangs

2007-08-14 Thread Tim Prins

Hi Jody,

jody wrote:

Hi
I installed openmpi 1.2.2 on a quad core intel machine running fedora 6
(hostname plankton)

I set PATH and LD_LIBRARY in the .zshrc file:
Note that .zshrc is only used for interactive logins. You need to setup
your system so the LD_LIBRARY_PATH and PATH is also set for
non-interactive logins. See this zsh FAQ entry for what files you need
to modify:

http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19

(BTW: I do not use zsh, but my assumption is that the file you want to
set the PATH and LD_LIBRARY_PATH in is .zshenv)
$ echo $PATH
/opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin

$ echo $LD_LIBRARY_PATH
/opt/openmpi/lib:

When i run
$ mpirun -np 2 ./MPITest2
i get the message
./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
cannot open shared object file: No such file or directory
./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
cannot open shared object file: No such file or directory

However
$ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2
works. Any explanation?

Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running:
mpirun -np 2 ldd ./MPITest2

This should show what libraries your executable is using. Make sure all
of the libraries are resolved.

Also, try running:
mpirun -np 1 printenv |grep LD_LIBRARY_PATH
to see what the LD_LIBRARY_PATH is for you executables. Note that you
can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable will be
interpreted in the executing shell.

Second problem:
I have also installed openmpi 1.2.2 on an AMD machine running gentoo
linux (hostname nano_02).

Here as well PATH and LD_LIBRARY_PATH are set correctly,
and
$ mpirun -np 2 ./MPITest2
works locally on nano_02.

If, however, from plankton i call
$ mpirun -np 2 --prefix /opt/openmpi --host nano_02 ./MPI2Test2
the call hangs with no output whatsoever.
Any pointers on how to solve this problem?

Try running:
mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
./MPI2Test2

This should give some more output as to what is happening.

Hope this helps,

Tim

Thank You
Jody

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] libmpi.so.0 problem

2007-08-14 Thread Tim Prins


You need to set your LD_LIBRARY_PATH. See these FAQ entries:
http://www.open-mpi.org/faq/?category=running#run-prereqs
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Tim

Rodrigo Faccioli wrote:

Hi,

I need to know what I can resolve my problem. I'm starting my study on 
mpi, more specificaly open-mpi.


But, when I execute mpirun a.out, the message I received is: a.out: 
error while loading shared libraries: libmpi.so.0: cannot open shared 
object file: No such file or directory


The a.out file was obtained through mpicc hello.c

Thanks.





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] orterun mangling executable's "-host" argument

2007-08-10 Thread Tim Prins

Hi Marcus,

Your expectation sounds very reasonable to me. I have filed a bug in our bug 
tracker (https://svn.open-mpi.org/trac/ompi/ticket/1124), and you will 
receive emails as it is updated.

Unfortunately, this is in a part of the code which has not been touched for a 
long time, and is in somewhat of disrepair. So it might take a while to fix 
it.

Thanks,

Tim

On Wednesday 08 August 2007 04:01:01 pm Marcus R. Epperson wrote:
> We have a code that takes "-host " as command-line arguments, and
> when run via orterun they are getting replaced with "-rawmap 1 ".  I
> would have expected orterun to stop parsing its own options after seeing
> the executable name.
>
> Here's a simple reproducer:
>
> $ cat myprogram.sh
> #!/bin/bash
> echo "$@"
>
> $ ./myprogram.sh a b -host c
> a b -host c
>
> $ orterun -n 1 ./myprogram.sh a b -host c
> a b -rawmap 1 c
>
> This seems like a bug to me, but maybe there is some other simple
> invocation that would make it work as expected.  I tried adding a "--"
> argument before the executable name in hopes that it would stop argument
> processing at that point (similar to bash), but it had no effect.
>
> Thanks for any help,
> -Marcus
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI and PathScale problem

2007-08-07 Thread Tim Prins


Have you setup your LD_LIBRARY_PATH variable correctly? See this FAQ entry:
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Hope this helps,

Tim

Michael Komm wrote:

I'm trying to make work the pathscale fortran compiler with OpenMPI on a 64bit 
Linux machine and can't get passed a simple demo program. Here is detailed info:

pathf90 -v
PathScale EKOPath(TM) Compiler Suite: Version 2.5
Built on: 2006-08-22 21:02:51 -0700
Thread model: posix
GNU gcc version 3.3.1 (PathScale 2.5 driver)

mpif90 --show
pathf90 -I/home/fort/usr//include -pthread -I/home/fort/usr//lib 
-L/home/fort/usr//lib -lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl 
-Wl,--export-dynamic -lnsl -lutil -lm -ldl

The OpenMPI version 1.2.3 resides in the /home/fort/usr/ directory.

When I compile a simple program using 


mpif90 -o test test.f90

I get a binary all right but it has broken linked libraries

ldd test
libmpi_f90.so.0 => not found
libmpi_f77.so.0 => not found
libmpi.so.0 => /usr/lib64/lam/libmpi.so.0 (0x003db360)
libopen-rte.so.0 => not found
libopen-pal.so.0 => not found
libdl.so.2 => /lib64/libdl.so.2 (0x003db320)
libnsl.so.1 => /lib64/libnsl.so.1 (0x003db990)
libutil.so.1 => /lib64/libutil.so.1 (0x003db840)
libmv.so.1 => /opt/pathscale/lib/2.5/libmv.so.1 (0x002a9557f000)
libmpath.so.1 => /opt/pathscale/lib/2.5/libmpath.so.1 
(0x002a956a8000)
libm.so.6 => /lib64/tls/libm.so.6 (0x003db300)
libpathfortran.so.1 => /opt/pathscale/lib/2.5/libpathfortran.so.1 
(0x002a957c9000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x003db380)
libc.so.6 => /lib64/tls/libc.so.6 (0x003db2d0)
/lib64/ld-linux-x86-64.so.2 (0x003db290)

The demo program fails to start due to missing shared libraries. In addition 
the pathf90 uses some lame mpi library instead of openMPI! Any ideas on where 
the problem could be?

 Michael


Mgr. Michael Komm
Tokamak Department
Institute of Plasma Physics of Academy of Sciences of Czech Republic
E-mail:k...@ipp.cas.cz
Za Slovankou 3 
182 00 
PRAGUE 8 



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins


> Yes, this helps tremendously.  I installed rsh, and now it pretty much
> works.
Glad this worked out for you.

>
> The one missing detail is that I can't seem to get the stdout/stderr
> output.  For example:
>
> $ orterun -np 1 uptime
> $ uptime
> 18:24:27 up 13 days,  3:03,  0 users,  load average: 0.00, 0.03, 0.00
>
> The man page indicates that stdout/stderr is supposed to come back to
> the stdout/stderr of the orterun process.  Any ideas on why this isn't
> working?
It should work. However, we currently have some I/O forwarding problems which 
show up in some environments that will (hopefully) be fixed in the next 
release. As far as I know, the problem seems to happen mostly with non-mpi 
applications.

Try running a simple mpi application, such as:

#include 
#include "mpi.h"

int main(int argc, char* argv[])
{
int rank, size;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
printf("Hello, world, I am %d of %d\n", rank, size);
MPI_Finalize();

return 0;
}

If that works fine, then it is probably our problem, and not a problem with 
your setup.

Sorry I don't have a better answer :(

Tim

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins


Adam C Powell IV wrote:

As mentioned, I'm running in a chroot environment, so rsh and ssh won't
work: "rsh localhost" will rsh into the primary local host environment,
not the chroot, which will fail.

[The purpose is to be able to build and test MPI programs in the Debian
unstable distribution, without upgrading the whole machine to unstable.
Though most machines I use for this purpose run Debian stable or
testing, the machine I'm currently using runs a very old Fedora, for
which I don't think OpenMPI is available.]


Allright, I understand what you are trying to do now. To be honest, I 
don't think we have ever really thought about this use case. We always 
figured that to test Open MPI people would simply install it in a 
different directory and use it from there.




With MPICH, mpirun -np 1 just runs the new process in the current
context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
support this functionality?


Open MPI does support this functionality. First, a bit of explanation:

We use 'pls' (process launching system) components to handling the 
launching of processes. There are components for slurm, gridengine, rsh, 
and others. At runtime we open each of these components and query them 
as to whether they can be used. The original error you posted says that 
none of the 'pls' components can be used because all of they detected 
they could not run in your setup. The slurm one excluded itself because 
there were no environment variables set indicating it is running under 
SLURM. Similarly, the gridengine pls said it cannot run as well. The 
'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are 
available (I assume this is the case, though you did not explicitly say 
they were not available).


But in this case, you do want the 'rsh' pls to be used. It will 
automatically fork any local processes, and will user rsh/ssh to launch 
any remote processes. Again, I don't think we ever imagined the use case 
 on a UNIX-like system where there are no launchers like SLURM 
available, and rsh/ssh also wasn't available (Open MPI is, after all, 
primarily concerned with multi-node operation).


So, there are several ways around this:

1. Make rsh or ssh available, even though they will not be used.

2. Tell the 'rsh' pls component to use a dummy program such as 
/bin/false by adding the following to the command line:

-mca pls_rsh_agent /bin/false

3. Create a dummy 'rsh' executable that is available in your path.

For instance:

[tprins@odin ~]$ which ssh
/usr/bin/which: no ssh in 
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)

[tprins@odin ~]$ which rsh
/usr/bin/which: no rsh in 
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)

[tprins@odin ~]$ mpirun -np 1  hostname
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 317

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

--
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 46
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 52
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
orterun.c at line 399


[tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false  hostname
odin.cs.indiana.edu

[tprins@odin ~]$ touch usr/bin/rsh
[tprins@odin ~]$ chmod +x usr/bin/rsh
[tprins@odin ~]$ mpirun -np 1  hostname
odin.cs.indiana.edu
[tprins@odin ~]$


I hope this helps,

Tim



Thanks,
Adam

On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
This is strange. I assume that you what to use rsh or ssh to launch the 
processes?


If you want to use ssh, does "which ssh" find ssh? Similarly, if you 
want to use rsh, does "which rsh" find rsh?


Thanks,

Tim

Adam C Powell IV wrote:

On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:

Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:
[snip]
What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.

Thanks!  Here's the output:

$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mc

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins

This is strange. I assume that you what to use rsh or ssh to launch the 
processes?


If you want to use ssh, does "which ssh" find ssh? Similarly, if you 
want to use rsh, does "which rsh" find rsh?


Thanks,

Tim

Adam C Powell IV wrote:

On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:

Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:
[snip]
What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.


Thanks!  Here's the output:

$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mca: base: components_open: Looking for pls components
[new-host-3:19201] mca: base: components_open: distilling pls components
[new-host-3:19201] mca: base: components_open: accepting all pls components
[new-host-3:19201] mca: base: components_open: opening pls components
[new-host-3:19201] mca: base: components_open: found loaded component 
gridengine[new-host-3:19201] mca: base: components_open: component gridengine 
open function successful
[new-host-3:19201] mca: base: components_open: found loaded component proxy
[new-host-3:19201] mca: base: components_open: component proxy open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component rsh
[new-host-3:19201] mca: base: components_open: component rsh open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component slurm
[new-host-3:19201] mca: base: components_open: component slurm open function 
successful
[new-host-3:19201] orte:base:select: querying component gridengine
[new-host-3:19201] pls:gridengine: NOT available for selection
[new-host-3:19201] orte:base:select: querying component proxy
[new-host-3:19201] orte:base:select: querying component rsh
[new-host-3:19201] orte:base:select: querying component slurm
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

-Adam

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins


Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:

$ uptime
 12:51:55 up 12 days, 21:30,  0 users,  load average: 0.00, 0.00, 0.00
$ orterun -np 1 uptime
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

Note running with -v produces no more output than this.  Running orted
in the background doesn't seem to help.

What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.

Thanks,

Tim



Thanks,
-Adam

Re: [OMPI users] openmpi fails on mx endpoint

2007-07-11 Thread Tim Prins

Or you can simply tell the mx mtl not to run by adding "-mca mtl ^mx" to 
the command line.


George: There is an open bug about this problem: 
https://svn.open-mpi.org/trac/ompi/ticket/1080


Tim

George Bosilca wrote:
There seems to be a problem with MX, because a conflict between out  
MTL and the BTL. So, I suspect that if you want it to run [right now]  
you should spawn less than the MX supported endpoint by node (one  
less). I'll take a look this afternoon.


   Thanks,
 george.

On Jul 11, 2007, at 12:39 PM, Warner Yuen wrote:

The hostfile was changed around. As we tried to pull nodes out that  
we thought might have been bad. But none were over subscribed if  
that's what you mean.


Warner Yuen
Scientific Computing Consultant
Apple Computer



On Jul 11, 2007, at 9:00 AM, users-requ...@open-mpi.org wrote:


Message: 3
Date: Wed, 11 Jul 2007 11:27:47 -0400
From: George Bosilca 
Subject: Re: [OMPI users] OMPI users] openmpi fails on mx endpoint
busy
To: Open MPI Users 
Message-ID: <15c9e0ab-6c55-43d9-a40e-82cf973b0...@cs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

What's in the hostmx10g file ? How many hosts ?

   george.

On Jul 11, 2007, at 1:34 AM, Warner Yuen wrote:


I've also had someone run into the endpoint busy problem. I never
figured it out, I just increased the default endpoints on MX-10G
from 8 to 16 endpoints to make the problem go away. Here's the
actual command and error before setting the endpoints to 16. The
version is MX-1.2.1with OMPI 1.2.3:

node1:~/taepic tae$ mpirun --hostfile hostmx10g -byslot -mca btl
self,sm,mx -np 12 test_beam_injection test_beam_injection.inp -npx
12 > out12
[node2:00834] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
 
--


Process 0.1.3 is unable to reach 0.1.7 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [MTT users] Textfile Reporter

2007-07-10 Thread Tim Prins

Looks good. Thanks for doing this. I did need the following patch though to 
correct a capitalization problem...

Tim

Index: lib/MTT/Reporter/ParsableTextfile.pm
===
--- lib/MTT/Reporter/ParsableTextfile.pm(revision 742)
+++ lib/MTT/Reporter/ParsableTextfile.pm(working copy)
@@ -10,7 +10,7 @@
 # $HEADER$
 #

-package MTT::Reporter::ParsableTextFile;
+package MTT::Reporter::ParsableTextfile;

 use strict;
 use Cwd;


On Tuesday 10 July 2007 04:40:40 pm Ethan Mallove wrote:
> Done.
>
> I brought it back as "ParsableTextfile" in both the trunk
> and ompi-core-testers. You'll just have to do two things in
> your INI file:
>
>  * Change "module = Textfile" to "module = ParsableTextfile"
>  * Rename "textfile_" params to "parsabletextfile_" params
>
> Let me know if you run into any other issues with this.
>
> -Ethan
>
> On Tue, Jul/10/2007 02:27:27PM, Tim Prins wrote:
> > Hmm, the INI file reporter does not seem to work for me. For the test
> > results I only get the information about the last test run.
> >
> > Anyways, I like the idea of pulling the data directly in from perl output
> > but just don't have the time to mess with it right now. For me bringing
> > back the old reporter would be easiest for the time being. However, I
> > also need the following patch applied to resurect a couple output fields
> > that were removed which we need:
> >
> > Index: lib/MTT/Test/Analyze/Correctness.pm
> > ===
> > --- lib/MTT/Test/Analyze/Correctness.pm (revision 737)
> > +++ lib/MTT/Test/Analyze/Correctness.pm (working copy)
> > @@ -53,6 +53,8 @@
> >
> >  test_name => $run->{name},
> >  command => $run->{cmd},
> > +test_build_section_name =>
> > $run->{test_build_simple_section_name}, +
> >  np => $run->{np},
> >  exit_value =>
> > MTT::DoCommand::exit_value($results->{exit_status}), exit_signal =>
> > MTT::DoCommand::exit_signal($results->{exit_status}), Index:
> > lib/MTT/MPI/Install.pm
> > ===
> > --- lib/MTT/MPI/Install.pm  (revision 737)
> > +++ lib/MTT/MPI/Install.pm  (working copy)
> > @@ -505,6 +505,8 @@
> >  my $report = {
> >  phase => "MPI Install",
> >
> > +mpi_install_section_name => $config->{simple_section_name},
> > +
> >  bitness => $config->{bitness},
> >  endian => $config->{endian},
> >  compiler_name => $config->{compiler_name},
> >
> >
> > Thanks,
> >
> > Tim
> >
> > On Tuesday 10 July 2007 11:46:34 am Ethan Mallove wrote:
> > > Whoops! I didn't realize anyone was using that Textfile
> > > module. We can resurrect that if you'd like (call it
> > > ParseableTextfile).
> > >
> > > There's also the INIFile Reporter. That might be your best
> > > bet, since there's a Config::INIFiles CPAN module. (Your
> > > wrappers are in Perl, right?) Though wouldn't it be even
> > > easier if there were a PerlDumper Reporter module so you
> > > could read in the data *directly* to your Perl wrappers?
> > > Your wrapper would do no parsing then. E.g.,
> > >
> > > open(FILE, "< $file");
> > > undef $/;
> > > $mtt_results = ;
> > >
> > > -Ethan
> > >
> > > On Mon, Jul/09/2007 06:07:51PM, Tim Prins wrote:
> > > > Hi,
> > > >
> > > > With the new version of MTT, the textfile report file
> > > > format changed to a more human readable format. Since we
> > > > here at IU use a script to parse this, it presents a bit
> > > > of a problem. I can update our script, but was wondering
> > > > how stable this new output format is.
> > > >
> > > > If it will not be very stable, I was wondering if the
> > > > developers would consider adding a parseable textfile
> > > > output module. The easiest thing to do for this would be
> > > > to just import the old textfile module as a new parseable
> > > > module. I have tried this and it seems to work fine,
> > > > however there may be problems with this that I am unaware
> > > > of.
> > > >
> > > > I can deal with this either way, but just thought it might
> > &

Re: [OMPI users] warning:regcache incompatible with malloc

2007-07-10 Thread Tim Prins

On Tuesday 10 July 2007 03:11:45 pm Scott Atchley wrote:
> On Jul 10, 2007, at 2:58 PM, Scott Atchley wrote:
> > Tim, starting with the recently released 1.2.1, it is the default.
>
> To clarify, MX_RCACHE=1 is the default.

It would be good for the default to be something where there is no warning 
printed (i.e. 0 or 2). I see the warning on the current trunk.

Tim

Re: [OMPI users] warning:regcache incompatible with malloc

2007-07-10 Thread Tim Prins

Is this something that Open MPI should be setting automatically?

Tim

On Tuesday 10 July 2007 02:44:04 pm George Bosilca wrote:
> I always use MX_RCACHE=2 for both MTL and BTL. So far I didn't had
> any problems with it.
>
>george.
>
> On Jul 10, 2007, at 2:37 PM, Brian Barrett wrote:
> > On Jul 10, 2007, at 11:40 AM, Scott Atchley wrote:
> >> On Jul 10, 2007, at 1:14 PM, Christopher D. Maestas wrote:
> >>> Has anyone seen the following message with Open MPI:
> >>> ---
> >>> warning:regcache incompatible with malloc
> >>> ---
> >>>
> >>> 
> >>>
> >>> ---
> >>>
> >>> We don't see this message with mpich-mx-1.2.7..4
> >>
> >> MX has an internal registration cache that can be enabled with
> >> MX_RCACHE=1 or disabled with MX_RCACHE=0 (the default before MX-1.2.1
> >> was off, and starting with 1.2.1 the default is on). If it is on, MX
> >> checks to see if the application is trying to override malloc() and
> >> other memory handling functions. If so, it prints the error that you
> >> are seeing and fails to use the registration cache.
> >>
> >> Open MPI can use the regcache if you set MX_RCACHE=2. This tells MX
> >> to skip the malloc() check and use the cache regardless. In the case
> >> of Open MPI, this is believed to be safe. That will not be true for
> >> all applications.
> >>
> >> MPICH-MX does not manage memory, so MX_RCACHE=1 is safe to use unless
> >> the user's application manages memory.
> >
> > Scott -
> >
> > I'm having trouble getting the warning to go away with Open MPI.
> > I've disabled our copy of ptmalloc2, so we're not providing a malloc
> > anymore.  I'm wondering if there's also something with the use of
> > DSOs to load libmyriexpress?  Is your belief that MX_RCACHE=2 is safe
> > just for the BTL or for the MTL as well?
> >
> > Brian
> >
> >
> > --
> >Brian W. Barrett
> >Networking Team, CCS-1
> >Los Alamos National Laboratory
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [MTT users] Textfile Reporter

2007-07-10 Thread Tim Prins

Hmm, the INI file reporter does not seem to work for me. For the test results 
I only get the information about the last test run. 

Anyways, I like the idea of pulling the data directly in from perl output but 
just don't have the time to mess with it right now. For me bringing back the 
old reporter would be easiest for the time being. However, I also need the 
following patch applied to resurect a couple output fields that were removed 
which we need:

Index: lib/MTT/Test/Analyze/Correctness.pm
===
--- lib/MTT/Test/Analyze/Correctness.pm (revision 737)
+++ lib/MTT/Test/Analyze/Correctness.pm (working copy)
@@ -53,6 +53,8 @@

 test_name => $run->{name},
 command => $run->{cmd},
+test_build_section_name => $run->{test_build_simple_section_name},
+
 np => $run->{np},
 exit_value => MTT::DoCommand::exit_value($results->{exit_status}),
 exit_signal => MTT::DoCommand::exit_signal($results->{exit_status}),
Index: lib/MTT/MPI/Install.pm
===
--- lib/MTT/MPI/Install.pm  (revision 737)
+++ lib/MTT/MPI/Install.pm  (working copy)
@@ -505,6 +505,8 @@
 my $report = {
 phase => "MPI Install",

+mpi_install_section_name => $config->{simple_section_name},
+
 bitness => $config->{bitness},
 endian => $config->{endian},
 compiler_name => $config->{compiler_name},


Thanks,

Tim

On Tuesday 10 July 2007 11:46:34 am Ethan Mallove wrote:
> Whoops! I didn't realize anyone was using that Textfile
> module. We can resurrect that if you'd like (call it
> ParseableTextfile).
>
> There's also the INIFile Reporter. That might be your best
> bet, since there's a Config::INIFiles CPAN module. (Your
> wrappers are in Perl, right?) Though wouldn't it be even
> easier if there were a PerlDumper Reporter module so you
> could read in the data *directly* to your Perl wrappers?
> Your wrapper would do no parsing then. E.g.,
>
> open(FILE, "< $file");
> undef $/;
> $mtt_results = ;
>
> -Ethan
>
> On Mon, Jul/09/2007 06:07:51PM, Tim Prins wrote:
> > Hi,
> >
> > With the new version of MTT, the textfile report file
> > format changed to a more human readable format. Since we
> > here at IU use a script to parse this, it presents a bit
> > of a problem. I can update our script, but was wondering
> > how stable this new output format is.
> >
> > If it will not be very stable, I was wondering if the
> > developers would consider adding a parseable textfile
> > output module. The easiest thing to do for this would be
> > to just import the old textfile module as a new parseable
> > module. I have tried this and it seems to work fine,
> > however there may be problems with this that I am unaware
> > of.
> >
> > I can deal with this either way, but just thought it might
> > make things easier to have a parseable format that is
> > relatively static, and a human readable format that can be
> > tweaked for useability as time goes by.
> >
> > Thanks,
> >
> > Tim
> > ___
> > mtt-users mailing list
> > mtt-us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
>
> ___
> mtt-users mailing list
> mtt-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

[MTT users] Textfile Reporter

2007-07-09 Thread Tim Prins

Hi,

With the new version of MTT, the textfile report file format changed to a more 
human readable format. Since we here at IU use a script to parse this, it 
presents a bit of a problem. I can update our script, but was wondering how 
stable this new output format is.

If it will not be very stable, I was wondering if the developers would 
consider adding a parseable textfile output module. The easiest thing to do 
for this would be to just import the old textfile module as a new parseable 
module. I have tried this and it seems to work fine, however there may be 
problems with this that I am unaware of.

I can deal with this either way, but just thought it might make things easier 
to have a parseable format that is relatively static, and a human readable 
format that can be tweaked for useability as time goes by.

Thanks,

Tim

Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread Tim Prins


Hi Jody,

Sorry for the super long delay. I don't know how this one got lost...

I run like this all the time. Unfortunately, it is not as simple as I 
would like. Here is what I do:


1. Log into the machine using ssh -X
2. Run mpirun with the following parameters:
	-mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh launcher. 
It may not be necessary depending on your setup)
	-mca pls_rsh_agent "ssh -X" (To make sure X information is forwarded. 
This might not be necessary if you have ssh setup to always forward X 
information)
	--debug-daemons (This ensures that the ssh connections to the backed 
nodes are kept open. Otherwise, they are closed and X information cannot 
be forwarded. Unfortunately, this will also cause some debugging output 
to be printed, but right now there is no other way :( )


So, the complete command is:
mpirun -np 4 -mca pls rsh -mca pls_rsh_agent "ssh -X" --debug-daemons 
xterm -e gdb my_prog


I hope this helps. Let me know if you are still experiencing problems.

Tim


jody wrote:

Hi
For debugging i usually run each process in a separate X-window.
This works well if i set the DISPLAY variable to the computer
from which i am starting my OpenMPI application.

This method fails however, if i log in (via ssh) to my workstation
from a third computer and then start my OpenMPI application,
only the processes running on the workstation i logged into can
open their windows on the third computers. The processes on
the other computers cant open their windows.

This is how i start the processes

mpirun -np 4 -x DISPLAY run_gdb.sh ./TestApp

where run_gdb.sh looks like this
-
#!/bin/csh -f

echo "Running GDB on node `hostname`"
xterm -e gdb $*
exit 0
-
The output from the processes on the other computer:
xterm Xt error: Can't open display: localhost:12.0

I there a way to tell OpenMPI to forward the X windows
over yet another ssh connection?

Thanks
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [MTT users] Database insert errors

2007-07-09 Thread Tim Prins

Could these errors have to do with the fact that we are running MTT 
v2.0.1, and not the latest version?


Thanks,

Tim

Ethan Mallove wrote:

Hi Tim,

I see some of these FOREIGN KEY constraint errors every
night. There's a system of speedy and archive tables to keep
short-term queries fast, but it has bugs. There are rows in
the archive tables that should be mirrored in the speedy
tables, but this is not always the case. We (well, mostly
Josh) are working on an improved system of "partitioning"
the huge Postgres tables to keep queries fast which will
hopefully also resolve these referential integrity problems.

-Ethan


On Sun, Jul/01/2007 09:37:17PM, Tim Prins wrote:

Hi Folks,

For a while now we have been getting errors when MTT tries to submit its test 
results to the database. The weird thing is that it only happens on our 1.2 
runs, not our trunk runs. 


Here is the first few lines of the error output:
*** WARNING: MTTDatabase server notice: fields is not in mtt3 database.
MTTDatabase server notice: test_build_section_name is not in mtt3
database.
MTTDatabase server notice: mpi_install_section_name is not in mtt3
database.
MTTDatabase server notice: mtt_version_minor is not in mtt3 database.
MTTDatabase server notice: stop_timestamp is not in mtt3 database.
MTTDatabase server notice: mtt_version_major is not in mtt3 database.
MTTDatabase server notice: number_of_results is not in mtt3 database.
MTTDatabase server notice: test_run_section_name is not in mtt3
database.

MTT submission for test run
MTTDatabase server error:
SQL QUERY:
 INSERT INTO speedy_test_run
 (np,
variant,
test_build_id,
command,
test_name,
test_run_id)
 VALUES
 ('8',
'1',
'20809',
'mpirun  -mca pml ob1 -mca btl_tcp_if_include eth0 -mca btl
tcp,sm,self -np 8 --prefix
/san/homedirs/mpiteam/mtt-runs/thor/20070630-Nightly/pb_0/installs/k1mL
/install collective/allgather ',
'allgather',
'14517807')

SQL ERROR: ERROR:  insert or update on table "speedy_test_run" violates
foreign key constraint "$1"
DETAIL:  Key (test_build_id)=(20809) is not present in table
"speedy_test_build".

Another strange thing is that the output says that the build information and 
some test results have been submitted, but I do not see them in the reporter. 
Any ideas?


Thanks,

Tim
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

Re: [OMPI users] Can I run MPI and non MPI programs together

2007-07-08 Thread Tim Prins

On Sunday 08 July 2007 08:22:04 pm Neville Clark wrote:
> I have openmpi installed and running, but have a need to run non mpi
> programs (3rd party software for which I don't have the source) together
> with mpi programs.
>
> Have managed to simplify the problem down to the following
>
> JobA
> int main(.)
> {
> printf("Starting JobA\n");
> MPI::Init();
> printf("JobA Init done\n");
> }
>
> JobB
> Int main(.)
> {
> printf("Starting JobB\n");
> }
>
> And running with
> mpirun -mca btl tcp,self,sm -np 1 -host lyre JobA  : -np 1 -host lyre JobB
>
> The output is the two messages "Starting ." message and then hangs.
>
> It would appear that the MPI::Init() is waiting for all Ranks to call
> MPI::Init() before continuing.
This is correct. You cannot run both mpi and non-mpi processes like this 
together. The best you can do is run mpirun twice.

Hope this helps,

Tim

>
> Please note the above works as expected if we run either two JobAs or two
> JobBs. Only have a problem if there is a mixture of JobAs and JobBs.
>
> Is there a way around this problem?
>
> Thanks in advance Neville

Re: [OMPI users] openmpi fails on mx endpoint busy

2007-07-06 Thread Tim Prins

Henk,

On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
> Dear Tim
>
> I followed the use of "--mca btl mx,self" as suggested in the FAQ
>
> http://www.open-mpi.org/faq/?category=myrinet#myri-btl
Yeah, that FAQ is wrong. I am working right now to fix it up. It should be 
updated this afternoon.

>
> When I use your extra mca value I get:
> >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
>
> 
> --
>
> > WARNING: A user-supplied value attempted to override the read-only MCA
> > parameter named "btl_mx_shared_mem".
> >
> > The user-supplied value was ignored.
Opps, on the 1.2 branch this is a read-only parameter. On the current trunk 
the user can change it. Sorry for the confusion. Oh well, you should probably 
use Open MPI's shared memory support instead anyways.

So you should either pass '-mca btl mx,sm,self', or just pass nothing at all. 
Open MPI is fairly smart at figuring out what components to use, so you 
really should not need to specify anything.

> followed by the same error messages as before.
>
> Note that although I add "self" the error messages complain about it
>
> missing:
> > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > If you specified the use of a BTL component, you may have
> >
> > forgotten a
> >
> > > component (such as "self") in the list of usable components.
>
> I checked the output from mx_info for both the current node and another,
> there seems not to be a problem.
> I attch the output from ompi_info --all
> Also
>
> >ompi_info | grep mx
>
>   Prefix:
> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
>  MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2.3)
>  MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
>
> As a further check, I rebuild the exe with mpich and that works fine on
> the same node over myrinet. I wonder whether mx is properly include in
> my openmpi build.
> Use of ldd -v on the mpich exe gives references to libmyriexpress.so,
> which is not the case for the ompi built exe, suggesting something is
> missing?
No, this is expected behavior. The Open MPI executeables are not linked to 
libmyriexpress.so, only the mx components. So if you do a ldd 
on /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mca_btl_mx.so, 
this should show the Myrinet library.

> I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
> and a listing of that directory is
>
> >ls /usr/local/Cluster-Apps/mx/mx-1.1.1
>
> bin  etc  include  lib  lib32  lib64  sbin
>
> This should be sufficient, I don't need --with-mx-libdir?
Correct.


Hope this helps,

Tim

>
> Thanks
>
> Henk
>
> > -Original Message-
> > From: users-boun...@open-mpi.org
> > [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins
> > Sent: 05 July 2007 18:16
> > To: Open MPI Users
> > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> >
> > Hi Henk,
> >
> > By specifying '--mca btl mx,self' you are telling Open MPI
> > not to use its shared memory support. If you want to use Open
> > MPI's shared memory support, you must add 'sm' to the list.
> > I.e. '--mca btl mx,self'. If you would rather use MX's shared
> > memory support, instead use '--mca btl mx,self --mca
> > btl_mx_shared_mem 1'. However, in most cases I believe Open
> > MPI's shared memory support is a bit better.
> >
> > Alternatively, if you don't specify any btls, Open MPI should
> > figure out what to use automatically.
> >
> > Hope this helps,
> >
> > Tim
> >
> > SLIM H.A. wrote:
> > > Hello
> > >
> > > I have compiled openmpi-1.2.3 with the --with-mx=
> > > configuration and gcc compiler. On testing with 4-8 slots I get an
> > >
> > > error message, the mx ports are busy:
> > >> mpirun --mca btl mx,self -np 4 ./cpi
> > >
> > > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
> > > status=20 [node001:10074] mca_btl_mx_init:
> >
> > mx_open_endpoint() failed
> >
> > > with status=20 [node001:10073] mca_btl_mx_init: mx_open_endpoint()
> > > failed with status=20
> >
> > --
> >
> > > --
> > > --
> > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > If you specified the use of a BTL component, you may have
> >
> > forgotten a
> >
> > &g

Re: [OMPI users] openmpi fails on mx endpoint busy

2007-07-05 Thread Tim Prins


Hi Henk,

By specifying '--mca btl mx,self' you are telling Open MPI not to use 
its shared memory support. If you want to use Open MPI's shared memory 
support, you must add 'sm' to the list. I.e. '--mca btl mx,self'. If you 
would rather use MX's shared memory support, instead use '--mca btl 
mx,self --mca btl_mx_shared_mem 1'. However, in most cases I believe 
Open MPI's shared memory support is a bit better.


Alternatively, if you don't specify any btls, Open MPI should figure out 
what to use automatically.


Hope this helps,

Tim

SLIM H.A. wrote:

Hello

I have compiled openmpi-1.2.3 with the --with-mx=
configuration and gcc compiler. On testing with 4-8 slots I get an error
message, the mx ports are busy:


mpirun --mca btl mx,self -np 4 ./cpi

[node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:10074] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:10073] mca_btl_mx_init: mx_open_endpoint() failed with
status=20

--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

... snipped
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 10071 on node node001 exited on
signal 1 (Hangup).


I would not expect mx messages as communication should not go through
the mx card? (This is a twin dual core  shared memory node)
The same happens when testing on 2 nodes, using a hostfile.
I checked the state of the mx card with mx_endpoint_info and mx_info,
they are healthy and free.
What is missing here?

Thanks

Henk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI / SLURM Job Issues

2007-06-27 Thread Tim Prins

Hi Jeff,

If you submit a batch script, there is no need to do a salloc. 

See the Open MPI FAQ for details on how to run on SLURM:
http://www.open-mpi.org/faq/?category=slurm

Hope this helps.

Tim

On Wednesday 27 June 2007 14:21, Jeff Pummill wrote:
> Hey Jeff,
>
> Finally got my test nodes back and was looking at the info you sent. On
> the SLURM page, it states the following:
>
> *Open MPI*  relies upon SLURM to allocate
> resources for the job and then mpirun to initiate the tasks. When using
> salloc command, mpirun's -nolocal option is recommended. For example:
>
> $ salloc -n4 sh# allocates 4 processors and spawns shell for job
>
> > mpirun -np 4 -nolocal a.out
> > exit  # exits shell spawned by initial salloc command
>
> You are saying that I need to use the slurm salloc, then pass SLURM a
> script? Or could I just add it all into the script? Fro eaample:
>
> #!/bin/sh
> salloc -n4
> mpirun my_mpi_application
>
> Then, run with srun -b myscript.sh
>
>
> Jeff F. Pummill
> Senior Linux Cluster Administrator
> University of Arkansas
> Fayetteville, Arkansas 72701
> (479) 575 - 4590
> http://hpc.uark.edu
>
> "A supercomputer is a device for turning compute-bound
> problems into I/O-bound problems." -Seymour Cray
>
> Jeff Squyres wrote:
> > Ick; I'm surprised that we don't have this info on the FAQ.  I'll try
> > to rectify that shortly.
> >
> > How are you launching your jobs through SLURM?  OMPI currently does
> > not support the "srun -n X my_mpi_application" model for launching
> > MPI jobs.  You must either use the -A option to srun (i.e., get an
> > interactive SLURM allocation) or use the -b option (submit a script
> > that runs on the first node in the allocation).  Your script can be
> > quite short:
> >
> > #!/bin/sh
> > mpirun my_mpi_application
> >
> > Note that OMPI will automatically figure out how many cpu's are in
> > your SLURM allocation, so you don't need to specify "-np X".  Hence,
> > you can run the same script without modification no matter how many
> > cpus/nodes you get from SLURM.
> >
> > It's on the long-term plan to get "srun -n X my_mpi_application"
> > model to work; it just hasn't bubbled up high enough in the priority
> > stack yet... :-\
> >
> > On Jun 20, 2007, at 1:59 PM, Jeff Pummill wrote:
> >> Just started working with OpenMPI / SLURM combo this morning. I can
> >> successfully launch this job from the command line and it runs to
> >> completion, but when launching from SLURM they hang.
> >>
> >> They appear to just sit with no load apparent on the compute nodes
> >> even though SLURM indicates they are running...
> >>
> >> [jpummil@trillion ~]$ sinfo -l
> >> Wed Jun 20 12:32:29 2007
> >> PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT SHARE GROUPS
> >> NODES   STATE NODELIST
> >> debug*   up   infinite 1-infinite   nonoall
> >> 8   allocated compute-1-[1-8]
> >> debug*   up   infinite 1-infinite   nonoall
> >> 1idle compute-1-0
> >>
> >> [jpummil@trillion ~]$ squeue -l
> >> Wed Jun 20 12:32:20 2007
> >>   JOBID PARTITION NAME USERSTATE   TIME TIMELIMIT
> >> NODES NODELIST(REASON)
> >>  79 debug   mpirun  jpummil  RUNNING   5:27
> >> UNLIMITED  2 compute-1-[1-2]
> >>  78 debug   mpirun  jpummil  RUNNING   5:58
> >> UNLIMITED  2 compute-1-[3-4]
> >>  77 debug   mpirun  jpummil  RUNNING   7:00
> >> UNLIMITED  2 compute-1-[5-6]
> >>  74 debug   mpirun  jpummil  RUNNING  11:39
> >> UNLIMITED  2 compute-1-[7-8]
> >>
> >> Are there any known issues of this nature involving OpenMPI and SLURM?
> >>
> >> Thanks!
> >>
> >> Jeff F. Pummill
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [MTT users] OMPI C++ tests just split

2007-06-23 Thread Tim Prins

On Saturday 23 June 2007 10:58:13 am Jeff Squyres wrote:
> If you are running the MPI 2 C++ tests for OMPI testing, note that I
> just split it into 2 executables.  So if you currently have this in
> your .ini file:
>
>  simple_pass:tests = src/mpi2c++_test
>
> You need to change it to this:
>
>  simple_pass:tests = src/mpi2c++_test src/mpi2c++_dyanmics_test

It is actually (note the spelling of 'dynamics'):
 simple_pass:tests = src/mpi2c++_test src/mpi2c++_dynamics_test


Tim

[MTT users] Differentiating builds in the reporter

2007-06-14 Thread Tim Prins

Hi everyone,

This may be a silly question, but I am new to configuring MTT so I'll ask.

Here at IU on our cluster Thor, we are running the trunk both with threads 
enabled and with threads disabled. How should these builds be differentiated 
for the reporter?

Thanks,

Tim

Re: [OMPI users] Compilation bug in libtool

2007-06-02 Thread Tim Prins

Hi Daniel,

I am able to replicate your problem on Mandriva 2007.1, however I'm not sure 
what is going on. 

I was able to build the tarball just fine though, so you may try that.

Tim

On Friday 01 June 2007 12:32:54 pm Daniel Pfenniger wrote:
> Hello,
>
> version 1.2.2 refuses to compile on Mandriva 2007.1:
> (more details are in the attached lg files)
> ...
>
> make[2]: Entering directory `/usr/src/rpm/BUILD/openmpi-1.2.2/opal/asm'
> depbase=`echo asm.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`; \
> if /bin/sh ../../libtool --tag=CC --mode=compile gcc
> -DHAVE_CONFIG_H -I. -I. -I../../opal/include -I../../orte/include
> -I../../ompi/include -I../../ompi  
>/include -I../..-O3 -DNDEBUG -finline-functions -fno-strict-aliasing
> -pthr ead -MT asm.lo -MD -MP -MF "$depbase.Tpo" -c -o asm.lo asm.c; \ then
> mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; exi t 1; fi
> ../../libtool: line 813: X--tag=CC: command not found
> ../../libtool: line 846: libtool: ignoring unknown tag : command not found
> ../../libtool: line 813: X--mode=compile: command not found
> ../../libtool: line 979: *** Warning: inferring the mode of operation is
> depreca ted.: command not found
> ../../libtool: line 980: *** Future versions of Libtool will require
> --mode=MODE be specified.: command not found ../../libtool: line 1123:
> Xgcc: command not found
> ../../libtool: line 1123: X-DHAVE_CONFIG_H: command not found
> ../../libtool: line 1123: X-I.: command not found
> ../../libtool: line 1123: X-I.: command not found
> ../../libtool: line 1123: X-I../../opal/include: No such file or directory
> ../../libtool: line 1123: X-I../../orte/include: No such file or directory
> ../../libtool: line 1123: X-I../../ompi/include: No such file or directory
> ../../libtool: line 1123: X-I../../ompi/include: No such file or directory
> ../../libtool: line 1123: X-I../..: No such file or directory
> ../../libtool: line 1123: X-O3: command not found
> ../../libtool: line 1123: X-DNDEBUG: command not found
> ../../libtool: line 1123: X-finline-functions: command not found
> ../../libtool: line 1123: X-fno-strict-aliasing: command not found
> ../../libtool: line 1123: X-pthread: command not found
> ../../libtool: line 1123: X-MT: command not found
> ../../libtool: line 1123: Xasm.lo: command not found
> ../../libtool: line 1123: X-MD: command not found
> ../../libtool: line 1123: X-MP: command not found
> ../../libtool: line 1123: X-MF: command not found
> ../../libtool: line 1123: X.deps/asm.Tpo: No such file or directory
> ../../libtool: line 1123: X-c: command not found
> ../../libtool: line 1175: Xasm.lo: command not found
> ../../libtool: line 1180: libtool: compile: cannot determine name of
> library obj ect from `': command not found make[2]: *** [asm.lo] Error 1
> make[2]: Leaving directory `/usr/src/rpm/BUILD/openmpi-1.2.2/opal/asm'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/usr/src/rpm/BUILD/openmpi-1.2.2/opal'
> make: *** [all-recursive] Error 1
> [root openmpi-1.2.2]#

Re: [OMPI users] port(s) and protocol used by openmpi for interprocess communication

2007-05-18 Thread Tim Prins

Open MPI uses TCP, and does not use any fixed ports. We use whatever ports the 
operating system gives us. At this time there is no way to specify what ports 
to use.

Hope this helps,

Tim

On Friday 18 May 2007 05:19 am, Code Master wrote:
> I run my openmpi-based application in a multi-node cluster.  There is also
> a sniffer computer (installed with wireshark) attached to a listener port
> on the switch to sniff any packets.
>
> However I would like to know the protocol (UDP or TCP) as well as the ports
> used by openmpi for interprocess communication so that wireshark can only
> capture these packets.
>
> Thanks!

Re: [OMPI users] debugging my program in openmpi

2007-05-10 Thread Tim Prins

On Thursday 10 May 2007 07:19 pm, Code Master wrote:
> I am a newbie in openmpi.  I have just compiled a program with -g -pg (an
> mpi program with a listener thread, which all MPI calls except
> initialization and MPI_Finalize are placed within)  and I run it.  However
> it crashes and I can't find any core dump, even I set the core dump max
> size to 10 by
>
> ulimit -c 10
You probably need to set the ulimit in your .bashrc to get a core dump, since 
processes are (by default) started via ssh.

>
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:(nil)
> [0] func:raytrace [0x8185581]
> [1] func:[0xe440]
> [2] func:raytrace [0x8056736]
> [3] func:/lib/tls/libpthread.so.0 [0x40063b63]
> [4] func:/lib/tls/libc.so.6(__clone+0x5a) [0x4014618a]
> *** End of error message ***
> I tried to use gdb and I ran:
> gdb mpirun
>
> run --hostfile ../hostfile n 16 raytrace -finputs/car.env
>
> when I type
>
> backtrace
>
>
> after it crashes, it just said "no stack"
This is because you are debugging mpirun, and not your application. Mpirun 
runs to completion successfully, but it is your program which is crashing.

Hope this helps,

Tim

>
> I really want to find out what lines in what function are responsible for
> the crash.  What can I do to find out the culprit?

Re: [OMPI users] OpenMPI 1.2.1: cannot install on IBM SP4

2007-05-10 Thread Tim Prins

On Thursday 10 May 2007 11:35 am, Laurent Nguyen wrote:
> Hi Tim,
>
> Ok, I thank you for all theses precisions. I also add "static int
> pls_poe_cancel_operation(void)" similary to you, and I can continue the
> compilation. But, I had another problem. In ompi/mpi/cxx/mpicxx.cc,
> three variables are already defined. The preprocessor set them to the
> constant of C. So, I put theses lines in comment:
>//const int SEEK_SET = MPI_SEEK_SET;
>//const int SEEK_CUR = MPI_SEEK_CUR;
>//const int SEEK_END = MPI_SEEK_END;
I remember there was a problem with these constants earlier. You should be 
able to disable them by passing --disable-mpi-cxx-seek to configure. 


> I was interested for OpenMPI because it support MPI-2. Since OpenMPI
> 1.1.1, I install all the version on my SP4 for testing. My impressions are:
> - it seems to be very difficult for developpers to implement OpenMPI on
> SP4 and I hope one day they achieve it ;)
> - in my context, my institution puts many restrictions on the use of our
> machine, that's why my tests are incomplete. (On the same way, rsh
> command is forbidden between our nodes...)
Note that the name 'rsh' is a bit of a misnomer. The rsh launcher actually 
uses ssh by default.

Tim
>
> So, I really thank you for your explanations and precisions.
>
> Best Regards,
>
>
> **
> NGUYEN Anh-Khai Laurent
> Equipe Support Utilisateur
>
> Email:laurent.ngu...@idris.fr
> Tél  :01.69.35.85.66
> Adresse  :IDRIS - Institut du Développement et des Ressources en
>Informatique Scientifique
>CNRS
>Batiment 506
>BP 167
>        F - 91403 ORSAY Cedex
> Site Web :http://www.idris.fr
> **
>
> Tim Prins a écrit :
> > Hi Laurent,
> >
> > Unfortunately, as far as I know, none of the current Open MPI developers
> > has access to a system with POE, so the POE process launcher has fallen
> > into disrepair. Attached is a patch that should allow you to compile
> > (however, you may also need to add #include  to
> > pls_poe_module.c).
> >
> > Though this should allow the compile to succeed, launching with POE may
> > not work (it has not been tested for quite a while). If it doesn't work,
> > you should use the rsh launcher instead (pass -mca pls rsh on the command
> > line, or set the parameter using one of the methods here:
> > http://www.open-mpi.org/faq/?category=tuning#setting-mca-params).
> >
> > Sorry about this. We have an IBM machine at my institution which I am
> > told will have POE on it 'soon', but I am not sure when. Once it does, we
> > will be working on getting POE well supported again.
> >
> > I should mention that we do use LoadLeveler on one of our machines and
> > Open MPI seems to work with it quite well. I would be interested in
> > hearing how it works for you.
> >
> > Hope this helps, let me know if this works.
> >
> > Thanks,
> >
> > Tim
> >
> > On Thursday 10 May 2007 02:57 am, Laurent Nguyen wrote:
> >> Hello,
> >>
> >> I tried to install OpenMPI 1.2 but I saw there some problems when
> >> compiling files with POE. When OpenMPI 1.2.1 was released, I saw in the
> >> bug fixes that this problem was fixed. Then I tried, but it still
> >> doesn't work. The problem comes from orte/mca/pls/poe/pls_poe_module.c.
> >> A static function "static int pls_poe_cancel_operation(void);" is
> >> declared but not defined in the files. I don't know if my configuration
> >> make it bug.
> >>
> >> So, if someone achieved to install OpenMPI 1.2.1 on IBM, I would like to
> >> have some advices.
> >>
> >> Thank you for your help,
> >>
> >> PS: I attached some output files of my installation
> >>
> >> 
> >>
> >> Index: orte/mca/pls/poe/pls_poe_module.c
> >> ===
> >> --- orte/mca/pls/poe/pls_poe_module.c  (revision 14640)
> >> +++ orte/mca/pls/poe/pls_poe_module.c  (working copy)
> >> @@ -37,6 +37,7 @@
> >>  #include "opal/mca/base/mca_base_param.h"
> >>  #include "opal/util/argv.h"
> >>  #include "opal/util/opal_environ.h"
> >> +#include "opal/util/output.h"
> >>
> >>  #include "orte/mca/errmgr/errmgr.h"
> >>  #include "orte/mca/gpr/gpr.h"
> >> @@ -69,7 +70,10 @@
> >>  static int pls_poe_signal_job(orte_jobid_t jobid, int32_t signal,
> >> opal_list_t *attrs); static int pls_poe_signal_proc(const
> >> orte_process_name_t *name, int32_t signal); static int
> >> pls_poe_finalize(void);
> >> -static int pls_poe_cancel_operation(void);
> >> +static int pls_poe_cancel_operation(void) {
> >> +return ORTE_ERR_NOT_IMPLEMENTED;
> >> +}
> >> +
> >>
> >>  orte_pls_base_module_t orte_pls_poe_module = {
> >>  pls_poe_launch_job,
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI 1.2.1: cannot install on IBM SP4

2007-05-10 Thread Tim Prins

Hi Laurent,

Unfortunately, as far as I know, none of the current Open MPI developers has 
access to a system with POE, so the POE process launcher has fallen into 
disrepair. Attached is a patch that should allow you to compile (however, you 
may also need to add #include  to pls_poe_module.c). 

Though this should allow the compile to succeed, launching with POE may not 
work (it has not been tested for quite a while). If it doesn't work, you 
should use the rsh launcher instead (pass -mca pls rsh on the command line, 
or set the parameter using one of the methods here: 
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params). 

Sorry about this. We have an IBM machine at my institution which I am told 
will have POE on it 'soon', but I am not sure when. Once it does, we will be 
working on getting POE well supported again.

I should mention that we do use LoadLeveler on one of our machines and Open 
MPI seems to work with it quite well. I would be interested in hearing how it 
works for you.

Hope this helps, let me know if this works.

Thanks,

Tim

On Thursday 10 May 2007 02:57 am, Laurent Nguyen wrote:
> Hello,
>
> I tried to install OpenMPI 1.2 but I saw there some problems when
> compiling files with POE. When OpenMPI 1.2.1 was released, I saw in the
> bug fixes that this problem was fixed. Then I tried, but it still
> doesn't work. The problem comes from orte/mca/pls/poe/pls_poe_module.c.
> A static function "static int pls_poe_cancel_operation(void);" is
> declared but not defined in the files. I don't know if my configuration
> make it bug.
>
> So, if someone achieved to install OpenMPI 1.2.1 on IBM, I would like to
> have some advices.
>
> Thank you for your help,
>
> PS: I attached some output files of my installation
Index: orte/mca/pls/poe/pls_poe_module.c
===
--- orte/mca/pls/poe/pls_poe_module.c	(revision 14640)
+++ orte/mca/pls/poe/pls_poe_module.c	(working copy)
@@ -37,6 +37,7 @@
 #include "opal/mca/base/mca_base_param.h"
 #include "opal/util/argv.h"
 #include "opal/util/opal_environ.h"
+#include "opal/util/output.h"

 #include "orte/mca/errmgr/errmgr.h"
 #include "orte/mca/gpr/gpr.h"
@@ -69,7 +70,10 @@
 static int pls_poe_signal_job(orte_jobid_t jobid, int32_t signal, opal_list_t *attrs);
 static int pls_poe_signal_proc(const orte_process_name_t *name, int32_t signal);
 static int pls_poe_finalize(void);
-static int pls_poe_cancel_operation(void);
+static int pls_poe_cancel_operation(void) {
+return ORTE_ERR_NOT_IMPLEMENTED;
+}
+

 orte_pls_base_module_t orte_pls_poe_module = {
 pls_poe_launch_job,

Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed withstatus=20

2007-04-02 Thread Tim Prins

Yes, only the first segfault is fixed in the nightly builds. You can  
run mx_endpoint_info to see how many endpoints are available and if  
any are in use.


As far as the segfault you are seeing now, I am unsure what is  
causing it. Hopefully someone who knows more about that area of the  
code than me can help.


Thanks,

Tim

On Apr 2, 2007, at 6:12 AM, de Almeida, Valmor F. wrote:



Hi Tim,

I installed the openmpi-1.2.1a0r14178 tarball (took this  
opportunity to
use the intel fortran compiler instead gfortran). With a simple  
test it

seems to work but note the same messages

->mpirun -np 8 -machinefile mymachines a.out
[x1:25417] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:25418] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31983] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31982] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x2:31980] mca_btl_mx_init: mx_open_endpoint() failed with status=20
Hello, world! I am 4 of 7
Hello, world! I am 0 of 7
Hello, world! I am 1 of 7
Hello, world! I am 5 of 7
Hello, world! I am 2 of 7
Hello, world! I am 7 of 7
Hello, world! I am 6 of 7
Hello, world! I am 3 of 7

and the machinefile is

x1  slots=4 max_slots=4
x2  slots=4 max_slots=4

However with a realistic code, it starts fine (same messages as above)
and somewhere later:

[x1:25947] *** Process received signal ***
[x1:25947] Signal: Segmentation fault (11)
[x1:25947] Signal code: Address not mapped (1)
[x1:25947] Failing at address: 0x14
[x1:25947] [ 0] [0xb7f00440]
[x1:25947] [ 1]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_r

equest_start_copy+0x13e) [0xb7a80e6e]
[x1:25947] [ 2]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_r

equest_process_pending+0x1e3) [0xb7a82463]
[x1:25947] [ 3] /opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_pml_ob1.so
[0xb7a7ebf8]
[x1:25947] [ 4]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_btl_sm.so 
(mca_btl_sm_componen

t_progress+0x1813) [0xb7a41923]
[x1:25947] [ 5]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_bml_r2.so 
(mca_bml_r2_progress

+0x36) [0xb7a4fdd6]
[x1:25947] [ 6] /opt/ompi/lib/libopen-pal.so.0(opal_progress+0x79)
[0xb7dc41a9]
[x1:25947] [ 7] /opt/ompi/lib/libmpi.so.0(ompi_request_wait_all+0xb5)
[0xb7e90145]
[x1:25947] [ 8]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_sendrecv_actual+0xc9) [0xb7a167a9]
[x1:25947] [ 9]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_barrier_intra_recursivedoubling+0xe4) [0xb7a1bfb4]
[x1:25947] [10]
/opt/openmpi-1.2.1a0r14178/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned

_barrier_intra_dec_fixed+0x48) [0xb7a16a18]
[x1:25947] [11] /opt/ompi/lib/libmpi.so.0(PMPI_Barrier+0x69)
[0xb7ea4059]
[x1:25947] [12] driver0(_ZNK3MPI4Comm7BarrierEv+0x20) [0x806baf4]
[x1:25947] [13] driver0(_ZN3gms12PartitionSet14ReadData_Case2Ev+0xc92)
[0x808bb78]
[x1:25947] [14] driver0(_ZN3gms12PartitionSet8ReadDataESsSsSst+0xbc)
[0x8086f96]
[x1:25947] [15] driver0(main+0x181) [0x8068c7f]
[x1:25947] [16] /lib/libc.so.6(__libc_start_main+0xdc) [0xb7b6a824]
[x1:25947] [17] driver0(__gxx_personality_v0+0xb9) [0x8068991]
[x1:25947] *** End of error message ***
mpirun noticed that job rank 0 with PID 25945 on node x1 exited on
signal 15 (Terminated).
7 additional processes aborted (not shown)


This code does run to completion using ompi-1.2 if I use only 2 slots
per machine.

Thanks for any help.

--
Valmor


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]

On

Behalf Of Tim Prins
Sent: Friday, March 30, 2007 10:49 PM
To: Open MPI Users
Subject: Re: [OMPI users] mca_btl_mx_init: mx_open_endpoint() failed
withstatus=20

Hi Valmor,

What is happening here is that when Open MPI tries to create MX

endpoint

for
communication, mx returns code 20, which is MX_BUSY.

At this point we should gracefully move on, but there is a bug in  
Open

MPI

1.2
which causes a segmentation fault in case of this type of error. This

will

be
fixed in 1.2.1, and the fix is available now in the 1.2 nightly

tarballs.


Hope this helps,

Tim

On Friday 30 March 2007 05:06 pm, de Almeida, Valmor F. wrote:

Hello,

I am getting this error any time the number of processes requested

per

machine is greater than the number of cpus. I suspect it is

something on

the configuration of mx / ompi that I am missing since another

machine I

have without mx installed runs ompi correctly with oversubscription.

Thanks for any help.

--
Valmor


->mpirun -np 3 --machinefile mymachines-1 a.out
[x1:23624] mca_btl_mx_init: mx_open_endpoint() failed with status=20
[x1:23624] *** Process received signal *** [x1:23624] Signal:
Segmentation fault (11) [x1:23624] Signal code: Address not mapped

(1)

[x1:23624] Failing at address: 0x20 [x1:23624] [ 0] [0xb7f7f440]
[x1:23624] [ 1]
/opt/openmpi-1.2/lib/openmpi/mca_btl_mx.so(mca_btl_mx_finalize+0x25)
[0xb7aca825] [x1:23624] [ 2]

Re: [MTT users] [devel-core] Recent OMPI Trunk fails MPI_Allgatherv_* MTT tests

2007-04-02 Thread Tim Prins

I'm not an MTT developer, but I'll answer #1...

MTT only looks at the return codes from the test programs that are ran. This 
is all well and good, but the problem is that some Intel tests return a 
meaningful value, and some don't. A while ago I went through and fixed all 
the c tests so that they return a meaningful value. But I don't know Fortran, 
so I did not even look at the Fortran versions of the tests. It appears that 
they are not returning a meaningful value, and that somebody needs to fix 
them.

Tim

On Sunday 01 April 2007 09:25 pm, Tim Mattox wrote:
> Hi All,
> I just checked the recent nightly MTT results and found two things of note,
> one for the MTT community, the other for the OMPI developers.
>
> For both, see http://www.open-mpi.org/mtt/reporter.php?do_redir=143
> for details of the failed MTT tests with the OMPI trunk at r14180.
>
> 1) For MTT developers:
> The MTT intel test suite is incorrectly seeing a failed MPI_Allgatherv_f
> test as passed, yet is correctly detecting that the MPI_Allgatherv_c
> test is failing.
> The STDOUT from "passed" MPI_Allgatherv_f seems to indicate that the tests
> actually failed in a similar way to the _c version, but MTT thinks it
> passed. I've not had time to diagnose why MTT is missing this...  anyone
> else have some spare cycles to look at this?
>
> 2) For OMPI developers:
> The MPI_Allgatherv_* tests are failing as of r14180 in all test conditions
> on the IU machines, and others, yet this passed the night before on r14172.
>
> Looking at the svn log for r#'s r14173 thru r14180, I can narrow it down to
> one of these changes as the culprit:
> https://svn.open-mpi.org/trac/ompi/changeset/14180
> https://svn.open-mpi.org/trac/ompi/changeset/14179
> https://svn.open-mpi.org/trac/ompi/changeset/14174 (Not likely)
>
> My money is on the much larger r14180 changeset.
> The other r#'s aren't culprits for obvious reasons.

Re: [OMPI users] Torque/OpenMPI

2007-04-01 Thread Tim Prins


Hi Barry,

The problem is the line:
ncpus=`wc -l $PBS_NODEFILE`

wc will print out the file name after the count. So ncpus gets "16 / 
var/spool/torque/aux//350.wc01" and your mpirun command will look like:

mpirun -np 16 /var/spool/torque/aux//350.wc01 /home/test/hpcc-1.0.0/hpcc

So mpirun will try to execute  /var/spool/torque/aux//350.wc01

A solution to this is that Open MPI will run on every available slot  
if -np is not passed. So you could just use the script:


HPCC_HOME=/home/test/hpcc-1.0.0
mpirun $HPCC_HOME/hpcc

This will launch one process on every CPU reported by Torque.

Alternatively, you could have wc read from stdin instead of from a file:
ncpus=`wc -l < $PBS_NODEFILE`

this will avoid the filename being printed.

Hope this helps,

Tim

On Apr 1, 2007, at 9:16 AM, Barry Evans wrote:


Hello,



Having a bit of trouble running Open MPI 1.2 under Torque 2.1.8.



My Script contains the following:

---

HPCC_HOME=/home/test/hpcc-1.0.0

ncpus=`wc -l $PBS_NODEFILE`

mpirun -np $ncpus $HPCC_HOME/hpcc

---





When I try to run on 4 nodes, 4 cpus each I receive the following  
in my err file:




[node003:04409] [0,0,4] ORTE_ERROR_LOG: Not found in file  
odls_default_module.c at line 1188


[node008:06691] [0,0,1] ORTE_ERROR_LOG: Not found in file  
odls_default_module.c at line 1188


[node007:04352] [0,0,2] ORTE_ERROR_LOG: Not found in file  
odls_default_module.c at line 1188


-- 



Failed to find or execute the following executable:



Host:   node007

Executable: /var/spool/torque/aux//350.wc01



Cannot continue.

-- 



[no--- 
---


Failed to find or execute the following executable:



Host:   node004

Executable: /var/spool/torque/aux//350.wc01



Cannot continue.

-- 



de004:04364] [0,0,3] ORTE_ERROR_LOG: Not found in file  
odls_default_module.c at line 1188


-- 



Failed to find or execute the following executable:



Host:   node003

Executable: /var/spool/torque/aux//350.wc01



Cannot continue.

-- 



-- 



Failed to find or execute the following executable:



Host:   node008

Executable: /var/spool/torque/aux//350.wc01



Cannot continue.

-- 



[node007:04352] [0,0,2] ORTE_ERROR_LOG: Not found in file orted.c  
at line 588


[node008:06691] [0,0,1] ORTE_ERROR_LOG: Not found in file orted.c  
at line 588


[node004:04364] [0,0,3] ORTE_ERROR_LOG: Not found in file orted.c  
at line 588


[node003:04409] [0,0,4] ORTE_ERROR_LOG: Not found in file orted.c  
at line 588






Has anyone seen this before? It seems odd that openmpi would be  
trying to execute what is effectively the host file. I stuck a  
sleep in to make sure the file was being distributed, and sure  
enough, it was there. I am able to run mvapich through torque  
without issue and openmpi from the command line.




Cheers,

Barry Evans

Technical Manager

OCF plc

+44 (0)7970 148 121

bev...@ocf.co.uk



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] error in MPI_Waitall

2007-03-23 Thread Tim Prins


Steve,

This list is for supporting Open MPI, not MPICH2 (MPICH2 is an  
entirely different software package).  You should probably redirect  
your question to their support lists.


Thanks,

Tim

On Mar 23, 2007, at 12:46 AM, Jeffrey Stephen wrote:


Hi,

I am trying to run an MPICH2 application over 2 processors on a  
dual processor x64 Linux box (SuSE 10). I am getting the following  
error message:


--
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..: MPI_Waitall(count=2,  
req_array=0x5bbda70, status_array=0x7fff461d9ce0) failed
MPIDI_CH3_Progress_wait(212)..: an error occurred while  
handling an event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..: connection failure  
(set=0,sock=1,errno=104:Connection reset by peer)

rank 0 in job 2  Demeter_18432   caused collective abort of all ranks
  exit status of rank 0: killed by signal 11
--

The "cpi" example that comes with MPICH2 executes correctly. I am  
using MPICH2-1.0.5p2 which I compiled from source.


Does anyone know what the problem is?

cheers
steve
** 
**


Climate change will impact on everyone… Queensland takes action

Register your interest in attending at http://www.nrw.qld.gov.au/ 
events/nrconference/index.html


Natural Resources Conference 2007

Climate Change - Queensland takes action

Wednesday 23 May 2007
Brisbane Convention and Exhibition Centre

** 
**


The information in this email together with any attachments is

intended only for the person or entity to which it is addressed

and may contain confidential and/or privileged material.

Any form of review, disclosure, modification, distribution

and/or publication of this email message is prohibited, unless

as a necessary part of Departmental business.

If you have received this message in error, you are asked to

inform the sender as quickly as possible and delete this message

and any copies of this message from your computer and/or your

computer system network.

** 
**



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] hostfile syntax

2007-03-22 Thread Tim Prins

Geoff,

'cpu', 'slots', and 'count' all do exactly the same thing.

Tim

On Thursday 22 March 2007 03:03 pm, Geoff Galitz wrote:
> Does the hostfile understand the syntax:
>
> mybox cpu=4
>
> I have some legacy code and scripts that I'd like to move without
> modifying if possible.  I understand the syntax is supposed to be:
>
> mybox slots=4
>
> but using "cpu" seems to work.  Does that achieve the same thing?
>
> -geoff
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpirun exit status for non-existent executable

2007-03-20 Thread Tim Prins

Well that's not a good thing. I have filed a bug about this (https:// 
svn.open-mpi.org/trac/ompi/ticket/954) and will try to look into it  
soon, but don't know when it will get fixed.


Thanks for bringing this to our attention!

Tim

On Mar 20, 2007, at 1:39 AM, Bill Saphir wrote:



If you ask mpirun to launch an executable that does not exist, it
fails, but returns an exit status of 0.
This makes it difficult to write scripts that invoke mpirun and need
to check for errors.
I'm wondering if a) this is considered a bug and b) whether it might
be fixed in a near term release.

Example:


orterun -np 2 asdflkj
-- 
--

--
Failed to find the following executable:

Host:   build-linux64
Executable: asdflkj

Cannot continue.
-- 
--

--

echo $?

0


I see this behavior for both 1.2 and 1.1.x.

Thanks for your help.

Bill

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] remote execution problem

2007-03-19 Thread Tim Prins


David,

Have you tried something like

mpirun -np 1  --host talisker4 hostname

If that hangs, try adding '--debug-daemons' to the command line and  
see if the output from that helps. If not, please send the output to  
the list.


Thanks,

Tim

On Mar 19, 2007, at 1:59 AM, David Burns wrote:


I neglected to mention that the test is currently running on 100 Mbps
ethernet. I have also tested the setup using a simple "hello world my
rank is_" program and get the same hanging problem.


3d...@qlink.queensu.ca wrote:
If anyone could help me out with this I would greatly appreciate  
it. I
have already read through the entire FAQ and havent seen anyone  
with a

similar problem.

I have successfully tested and run the ompi application I've coded  
locally

on both computers talisker2 and talisker4

mpirun -np 1 --host localhost fdtd : -np 2 --host localhost rnode

However, when attempting to execute processes remotely, eg

mpirun -np 1 --host localhost fdtd : -np 2 --host talisker4 rnode

Nothing happens. The shell just sits there, nothing prints (despite
stdouts), and does not return until I kill it. I have set up ssh with
rsa-authentication, no passphrase. The paths are all set; I have  
tried

purposefully missetting them and the error is reported and returns as
expected (so it isnt that).

More info about the system- fedora core 5, (Open MPI) 1.1.4.  
config.log

and ompi_info outputs attached. Any help or ideas of where to go next
would be greatly appreciated.

Thanks,
David

- 
---


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
- 
---


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.13/725 - Release Date:  
17/03/2007 12:33 PM



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] orted takes 100 percent cpu, how to avoid this??

2007-03-19 Thread Tim Prins


Bala,

This is a known problem with the 1.1 series. The bad news is that I  
know of no fix for this, though many people work around this problem  
by running a cleanup script after each unclean run. The good news is  
that the 1.2 series is MUCH better, though still not perfect. I would  
suggest trying out 1.2 and seeing if it works for you.


Hope this helps,

Tim

On Mar 17, 2007, at 9:58 AM, Bala wrote:


Hi All,
   we have installed 16 node Intel X86_64
dual CPU and dual core cluster( blade servers)
with OFED-1.1, that installs OpenMPI as well.

 we are able to run some sample programs also,
after few time when we run the sample and do
some Ctrl+C to stop the program we notice that
some "orted" is still running and takes 100% cpu
as well.

1. why some times this "orted" process not stopped
   and how to avoid this??

2. we can kill with -9 option, but the problem is
  while running various OpenMPI programs we can
  see each one has one "orted", don't know
  which process is idle to kill.

regards,
Bala.

Re: [OMPI users] MPI_Comm_Spawn

2007-03-05 Thread Tim Prins


Never mind, I was just able to replicate it. I'll look into it.

Tim

On Mar 5, 2007, at 4:26 PM, Tim Prins wrote:


That is possible. Threading support is VERY lightly tested, but I
doubt it is the problem since it always fails after 31 spawns.

Again, I have tried with these configure options and the same version
of Open MPI and have still have been able to replicate this (after
letting it spawn over 500 times). Have you been able to try a more
recent version of Open MPI? What kind of system is it? How many nodes
are you running on?

Tim

On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote:



Maybe the problem comes from the configuration options.
The configuration options used are :
./configure  --enable-mpi-threads --enable-progress-threads --with-
threads=posix --enable-smp-locks
Could you give me your point of view about that please ?
Thanks

-Message d'origine-
De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
De la
part de Ralph H Castain
Envoyé : mardi 27 février 2007 16:26
À : Open MPI Users <us...@open-mpi.org>
Objet : Re: [OMPI users] MPI_Comm_Spawn


Now that's interesting! There shouldn't be a limit, but to be
honest, I've
never tested that mode of operation - let me look into it and see.
It sounds
like there is some counter that is overflowing, but I'll look.

Thanks
Ralph


On 2/27/07 8:15 AM, "rozzen.vinc...@fr.thalesgroup.com"
<rozzen.vinc...@fr.thalesgroup.com> wrote:


Do you know if there is a limit to the number of MPI_Comm_spawn we
can use in
order to launch a program?
I want to start and stop a program several times (with the function
MPI_Comm_spawn) but every time after  31 MPI_Comm_spawn, I get a
"segmentation
fault".
Could you give me your point of you to solve this problem?
Thanks

/*file .c : spawned  the file Exe*/
#include 
#include 
#include 
#include "mpi.h"
#include 
#include 
#include 
#include 
#define EXE_TEST "/home/workspace/test_spaw1/src/ 
Exe"




int main( int argc, char **argv ) {

long *lpBufferMpi;
MPI_Comm lIntercom;
int lErrcode;
MPI_Comm lCommunicateur;
int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu,
NiveauThreadObtenu,lTailleBuffer;
int *lpMessageEnvoi=
MPI_Status lStatus; /*status de reception*/

 lIter=0;


/* MPI environnement */

printf("main***\n");
printf("main : Lancement MPI*\n");

NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
MPI_Init_thread( , , NiveauThreadVoulu,
 );
lpBufferMpi = calloc( 1, sizeof(long));
MPI_Buffer_attach( (void*)lpBufferMpi, 1 * sizeof(long) );

while (lIter<1000){
lIter ++;
lIntercom=(MPI_Comm)-1 ;

MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL,
  0, MPI_COMM_WORLD, ,  );
printf( "%i main***MPI_Comm_spawn return : %d\n",lIter,
lErrcode );

if(lIntercom == (MPI_Comm)-1 ){
printf("%i Intercom null\n",lIter);
return 0;
}
MPI_Intercomm_merge(lIntercom, 0, );
MPI_Comm_rank( lCommunicateur, );
lRangExe=1-lRangMain;

printf("%i main***Rang main : %i   Rang exe : %i
\n",lIter,(int)lRangMain,(int)lRangExe);
sleep(2);

}


/* Arret de l'environnement MPI */
lTailleBuffer=1* sizeof(long);
MPI_Buffer_detach( (void*)lpBufferMpi,  );
MPI_Comm_free(  );
MPI_Finalize( );
free( lpBufferMpi );

printf( "Main = End .\n" );
return 0;

}
/
 
*


***/
Exe:
#include 
#include 
#include 
#include 
#include  /* pour sleep() */
#include 
#include 
#include "mpi.h"

int main( int argc, char **argv ) {
/*1)pour communiaction MPI*/
MPI_Comm lCommunicateur;/*communicateur du process*/
MPI_Comm CommParent;/*Communiacteur parent à
récupérer*/
int lRank;  /*rang du communicateur du
process*/
int lRangMain;/*rang du séquenceur si lancé en
mode normal*/
int lTailleCommunicateur;   /*taille du communicateur;*/
long *lpBufferMpi;  /*buffer pour message*/
int lBufferSize;/*taille du buffer*/

/*2) pour les thread*/
int NiveauThreadVoulu, NiveauThreadObtenu;


lCommunicateur   = (MPI_Comm)-1;
NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
int erreur = MPI_Init_thread( , , NiveauThreadVoulu,
 );

if (erreur!=0){
printf("erreur\n");
free( lpBufferMpi );
return -1;
}

   /*2) Attachement à un buffer pour le message*/
lBufferSize=1 * sizeof(long);
lpBufferMpi = calloc( 1, sizeof(long));
erreur = MPI_Buffer_attach( (void*)lpBufferMpi, lBufferSize );

if (erreur!=0){
printf("erreur\n");

Re: [OMPI users] MPI_Comm_Spawn

2007-03-05 Thread Tim Prins

That is possible. Threading support is VERY lightly tested, but I  
doubt it is the problem since it always fails after 31 spawns.


Again, I have tried with these configure options and the same version  
of Open MPI and have still have been able to replicate this (after  
letting it spawn over 500 times). Have you been able to try a more  
recent version of Open MPI? What kind of system is it? How many nodes  
are you running on?


Tim

On Mar 5, 2007, at 1:21 PM, rozzen.vinc...@fr.thalesgroup.com wrote:



Maybe the problem comes from the configuration options.
The configuration options used are :
./configure  --enable-mpi-threads --enable-progress-threads --with- 
threads=posix --enable-smp-locks

Could you give me your point of view about that please ?
Thanks

-Message d'origine-
De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
De la

part de Ralph H Castain
Envoyé : mardi 27 février 2007 16:26
À : Open MPI Users 
Objet : Re: [OMPI users] MPI_Comm_Spawn


Now that's interesting! There shouldn't be a limit, but to be  
honest, I've
never tested that mode of operation - let me look into it and see.  
It sounds

like there is some counter that is overflowing, but I'll look.

Thanks
Ralph


On 2/27/07 8:15 AM, "rozzen.vinc...@fr.thalesgroup.com"
 wrote:

Do you know if there is a limit to the number of MPI_Comm_spawn we  
can use in

order to launch a program?
I want to start and stop a program several times (with the function
MPI_Comm_spawn) but every time after  31 MPI_Comm_spawn, I get a  
"segmentation

fault".
Could you give me your point of you to solve this problem?
Thanks

/*file .c : spawned  the file Exe*/
#include 
#include 
#include 
#include "mpi.h"
#include 
#include 
#include 
#include 
#define EXE_TEST "/home/workspace/test_spaw1/src/Exe"



int main( int argc, char **argv ) {

long *lpBufferMpi;
MPI_Comm lIntercom;
int lErrcode;
MPI_Comm lCommunicateur;
int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu,
NiveauThreadObtenu,lTailleBuffer;
int *lpMessageEnvoi=
MPI_Status lStatus; /*status de reception*/

 lIter=0;


/* MPI environnement */

printf("main***\n");
printf("main : Lancement MPI*\n");

NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
MPI_Init_thread( , , NiveauThreadVoulu,  
 );

lpBufferMpi = calloc( 1, sizeof(long));
MPI_Buffer_attach( (void*)lpBufferMpi, 1 * sizeof(long) );

while (lIter<1000){
lIter ++;
lIntercom=(MPI_Comm)-1 ;

MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL,
  0, MPI_COMM_WORLD, ,  );
printf( "%i main***MPI_Comm_spawn return : %d\n",lIter,  
lErrcode );


if(lIntercom == (MPI_Comm)-1 ){
printf("%i Intercom null\n",lIter);
return 0;
}
MPI_Intercomm_merge(lIntercom, 0, );
MPI_Comm_rank( lCommunicateur, );
lRangExe=1-lRangMain;

printf("%i main***Rang main : %i   Rang exe : %i
\n",lIter,(int)lRangMain,(int)lRangExe);
sleep(2);

}


/* Arret de l'environnement MPI */
lTailleBuffer=1* sizeof(long);
MPI_Buffer_detach( (void*)lpBufferMpi,  );
MPI_Comm_free(  );
MPI_Finalize( );
free( lpBufferMpi );

printf( "Main = End .\n" );
return 0;

}
/ 
* 


***/
Exe:
#include 
#include 
#include 
#include 
#include  /* pour sleep() */
#include 
#include 
#include "mpi.h"

int main( int argc, char **argv ) {
/*1)pour communiaction MPI*/
MPI_Comm lCommunicateur;/*communicateur du process*/
MPI_Comm CommParent;/*Communiacteur parent à  
récupérer*/
int lRank;  /*rang du communicateur du  
process*/
int lRangMain;/*rang du séquenceur si lancé en  
mode normal*/

int lTailleCommunicateur;   /*taille du communicateur;*/
long *lpBufferMpi;  /*buffer pour message*/
int lBufferSize;/*taille du buffer*/

/*2) pour les thread*/
int NiveauThreadVoulu, NiveauThreadObtenu;


lCommunicateur   = (MPI_Comm)-1;
NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
int erreur = MPI_Init_thread( , , NiveauThreadVoulu,
 );

if (erreur!=0){
printf("erreur\n");
free( lpBufferMpi );
return -1;
}

   /*2) Attachement à un buffer pour le message*/
lBufferSize=1 * sizeof(long);
lpBufferMpi = calloc( 1, sizeof(long));
erreur = MPI_Buffer_attach( (void*)lpBufferMpi, lBufferSize );

if (erreur!=0){
printf("erreur\n");
free( lpBufferMpi );
return -1;
}

printf( "Exe : Lance \n" );
MPI_Comm_get_parent();
MPI_Intercomm_merge( CommParent, 1,  );
MPI_Comm_rank( lCommunicateur,  );
MPI_Comm_size( lCommunicateur,  );

Re: [OMPI users] MPI_Comm_Spawn

2007-03-01 Thread Tim Prins

Actually, I have also tried with the same version you are using and  
cannot reproduce the behavior. Can you get a backtrace from the  
segmentation fault?


Also, as Ralph suggested, you might want to upgrade and see if the  
problem persists.


Tim

On Mar 1, 2007, at 8:52 AM, Ralph Castain wrote:

One thing immediately leaps out at me - you are using a very old  
version of
Open MPI. I suspect Tim is testing on a much newer version, most  
likely the

1.2 version that is about to be released in the next day or two.

If it's at all possible, I would urge you to upgrade to 1.2 - if  
you would
rather not wait for the official release, the web site's latest  
beta is
virtually identical. I believe you will find the code much improved  
and

worth the change.

If you truly want to stick with the 1.1 family, then I would  
suggest you at
least update to the latest release there (we are currently at  
1.1.4, and
1.1.5 - which is planned to be the last in that series - is also  
coming out

in the next day or two).

Hope that helps

Ralph



On 3/1/07 4:44 AM, "rozzen.vinc...@fr.thalesgroup.com"
<rozzen.vinc...@fr.thalesgroup.com> wrote:



Thanks for your help.
Here is attached the output of ompi_info in the file ompi_info.txt.

-Message d'origine-
De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
De la

part de Tim Prins
Envoyé : jeudi 1 mars 2007 05:45
À : Open MPI Users
Objet : Re: [OMPI users] MPI_Comm_Spawn


I have tried to reproduce this but cannot. I have been able to run  
your test
program to over 100 spawns. So I can track this further, please  
send the

output of ompi_info.

Thanks,

Tim

On Tuesday 27 February 2007 10:15 am,  
rozzen.vinc...@fr.thalesgroup.com wrote:
Do you know if there is a limit to the number of MPI_Comm_spawn  
we can use
in order to launch a program? I want to start and stop a program  
several

times (with the function MPI_Comm_spawn) but every time after  31
MPI_Comm_spawn, I get a "segmentation fault". Could you give me  
your point

of you to solve this problem?
Thanks

/*file .c : spawned  the file Exe*/
#include 
#include 
#include 
#include "mpi.h"
#include 
#include 
#include 
#include 
#define EXE_TEST "/home/workspace/test_spaw1/src/ 
Exe"




int main( int argc, char **argv ) {

long *lpBufferMpi;
MPI_Comm lIntercom;
int lErrcode;
MPI_Comm lCommunicateur;
int lRangMain,lRangExe,lMessageEnvoi,lIter,NiveauThreadVoulu,
NiveauThreadObtenu,lTailleBuffer; int  
*lpMessageEnvoi=

MPI_Status lStatus; /*status de reception*/

 lIter=0;


/* MPI environnement */

printf("main***\n");
printf("main : Lancement MPI*\n");

NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
MPI_Init_thread( , , NiveauThreadVoulu,  


); lpBufferMpi = calloc( 1, sizeof(long));
MPI_Buffer_attach( (void*)lpBufferMpi, 1 * sizeof(long) );

while (lIter<1000){
lIter ++;
lIntercom=(MPI_Comm)-1 ;

MPI_Comm_spawn( EXE_TEST, NULL, 1, MPI_INFO_NULL,
  0, MPI_COMM_WORLD, ,  );
printf( "%i main***MPI_Comm_spawn return : %d\n",lIter,  
lErrcode );


if(lIntercom == (MPI_Comm)-1 ){
printf("%i Intercom null\n",lIter);
return 0;
}
MPI_Intercomm_merge(lIntercom, 0, );
MPI_Comm_rank( lCommunicateur, );
lRangExe=1-lRangMain;

printf("%i main***Rang main : %i   Rang exe : %i
\n",lIter,(int)lRangMain,(int)lRangExe); sleep(2);

}


/* Arret de l'environnement MPI */
lTailleBuffer=1* sizeof(long);
MPI_Buffer_detach( (void*)lpBufferMpi,  );
MPI_Comm_free(  );
MPI_Finalize( );
free( lpBufferMpi );

printf( "Main = End .\n" );
return 0;

}
/ 
 
**

**/ Exe:
#include 
#include 
#include 
#include 
#include  /* pour sleep() */
#include 
#include 
#include "mpi.h"

int main( int argc, char **argv ) {
/*1)pour communiaction MPI*/
MPI_Comm lCommunicateur;/*communicateur du process*/
MPI_Comm CommParent;/*Communiacteur parent à  
récupérer*/
int lRank;  /*rang du communicateur du  
process*/

int lRangMain;/*rang du séquenceur si lancé en mode
normal*/ int lTailleCommunicateur;   /*taille du  
communicateur;*/

long *lpBufferMpi;  /*buffer pour message*/
int lBufferSize;/*taille du buffer*/

/*2) pour les thread*/
int NiveauThreadVoulu, NiveauThreadObtenu;


lCommunicateur   = (MPI_Comm)-1;
NiveauThreadVoulu = MPI_THREAD_MULTIPLE;
int erreur = MPI_Init_thread( , , NiveauThreadVoulu,
 );

if (erreur!=0){
printf("erreur\n");
free( lpBufferMpi );
return -

Re: [OMPI users] Build OpenMPI for SHM only

2006-11-21 Thread Tim Prins

Hi,

I don't know if there is a way to do it in configure, but after installing you 
can go into the $prefix/lib/openmpi directory and delete mca_btl_tcp.*

This will remove the tcp component and thus users will not be able to use it. 
Note that you must NOT delete the mca_oob_tcp.* files, as these are used for 
our internal administrative messaging and we currently require it to be 
there.

Thanks,

Tim Prins


On Tuesday 21 November 2006 07:49 pm, Adam Moody wrote:
> Hello,
> We have some clusters which consist of a large pool of 8-way nodes
> connected via ethernet.  On these particular machines, we'd like our
> users to be able to run 8-way MPI jobs on node, but we *don't* want them
> to run MPI jobs across nodes via the ethernet.  Thus, I'd like to
> configure and build OpenMPI to provide shared memory support (or TCP
> loopback) but disable general TCP support.
>
> I realize that you can run without tcp via something like "mpirun --mca
> btl ^tcp", but this is up to the user's discretion.  I need a way to
> disable it systematically.  Is there a way to configure it out at build
> time or is there some runtime configuration file I can modify to turn it
> off?  Also, when we configure "--without-tcp", the configure script
> doesn't complain, but TCP support is added anyway.
>
> Thanks,
> -Adam Moody
> MPI Support @ LLNL
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] configure script not hapy with OpenPBS

2006-10-19 Thread Tim Prins


Hi Martin,

Yeah, we appear to have some mistakes in the configuration macros. I  
will correct them, but they really should not be effecting things in  
this instance.


Whether Open MPI expects a 32 bit or 64 bit library depends on the  
compiler. If your compiler generates 64 bit executables by default,  
we will by default compile Open MPI in 64 bit mode and expect 64 bit  
libraries.


Unfortunately there is no single simple flag to switch between 64 bit  
and 32 bit mode. With gcc I use the follow configure line to compile  
in 32 bit mode:
./configure FCFLAGS=-m32 FFLAGS=-m32 CFLAGS=-m32 CXXFLAGS=-m32 --with- 
wrapper-cflags=-m32 --with-wrapper-cxxflags=-m32 --with-wrapper- 
fflags=-m32 --with-wrapper-fcflags=-m32


I know that is a bit unwieldy, but I believe that is the best way to  
do it right now.


If this does not work, please send the information requested here:
http://www.open-mpi.org/community/help/

Thanks,

Tim

On Oct 19, 2006, at 2:48 PM, Audet, Martin wrote:


Hi,

When I tried to install OpenMPI on the front node of a cluster  
using OpenPBS batch system (e.g. --with-tm=/usr/open-pbs argument  
to configure), it didn't work and I got the error message:


--- MCA component pls:tm (m4 configuration macro)
checking for MCA component pls:tm compile mode... dso
checking tm.h usability... yes
checking tm.h presence... yes
checking for tm.h... yes
looking for library in lib
checking for tm_init in -lpbs... no
looking for library in lib64
checking for tm_init in -lpbs... no
checking tm.h usability... yes
checking tm.h presence... yes
checking for tm.h... yes
looking for library in lib
checking for tm_finalize in -ltorque... no
looking for library in lib64
checking for tm_finalize in -ltorque... no
configure: error: TM support requested but not found.  Aborting

By looking in the very long configure script I found two typo  
errors in variable name:


  "ompi_check_tm_hapy" is set at lines 68164 and 76084
  "ompi_check_loadleveler_hapy" is set at line 73086

where the correct names are obviously "ompi_check_tm_happy" and  
"ompi_check_loadleveler_happy" (e.g. "happy" not "hapy") when  
looking to the variables used arround.


I corrected the variables names but unfortunately it didn't fixed  
my problem, configure stoped with the same error message (maybe you  
should also correct it in your "svn" repository since this may be a  
"latent" bug).


I'm now questionning why didn't the configuration script found the  
'tm_init'  symbol in libpbs.a since the following command:


nm /usr/open-pbs/lib/libpbs.a  | grep -e '\' -e  
'\'


prints:

0cd0 T tm_finalize
1270 T tm_init

Is it possible that on an EM64T Linux system the configure script  
require that lib/libpbs.a or lib64/libpbs.a be a 64 bit library to  
be happy (lib64/libpbs.a doesn't exist and lib/libpbs.a is a 32 bit  
library on our system since the OpenPBS version we use is a bit old  
(2.3.x) and didn't appear to be 64 bit clean) ?



Martin Audet

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [O-MPI users] Fwd: Fwd: [Beowulf] MorphMPI based on fortran itf

2005-10-13 Thread Tim Prins

Quoting Toon Knapen <toon.kna...@fft.be>:

> Tim Prins wrote:
>
> > I am in the process of developing MorphMPI and have designed my
> > implementation a bit different than what you propose (my apologies
> if I
> > misunderstood what you have said). I am creating one main library,
> which
> > users will compile and run against, and which should not need to
> be
> > recompiled. This library will then open a plugin depending on what
> MPI
> > the user would like to use. Then, it will dynamically open the
> actual
> > MPI implementation. In other words, to add support for another MPI
> one
> > would just need to drop the appropriate plugin into the right
> directory.
>
>
> Thus IIUC, the app calls your lib and your lib on its turn calls a
> plugin?
Not quite. The plugin will merely consist of a data table, which will
tell me all I need to know about the MPI and how to call its functions.
Thus the app will call a function in MorphMPI which will in turn call a
function in the actual MPI.

> This involves two dereferences. My idea was to (be able to)
> recompile the MorphMPI for each of the MPI lib's and plug this one
> between the app and the MPI. AFACIT this approach has the same set
> of
> features but is more lightweight.
However, if you have to recompile MorphMPI for each mpi, you loose a lot
of the benefits of having an ABI, i.e. being able to easily run with
multiple implementations without recompiling. In this project I am
really going for easy extensibility and ease of use for the user.

>
> Is your project open-source? If so, can I check it out?
It will be open-source, but right now this project is still in its early
stages so there is nothing to release yet.

Tim

96 matches

Mail list logo