Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!

2010-07-19 Thread Jed Brown
On Mon, 19 Jul 2010 13:33:01 -0600, Damien Hocking  wrote:
> It does.  The big difference is that MUMPS is a 3-minute compile, and 
> PETSc, erm, isn't.  It's..longer...

FWIW, PETSc takes less than 3 minutes to build (after configuration) for
me (I build it every day).  Building MUMPS (with dependencies) is
automatic with PETSc's --download-{blacs,scalapack,mumps}, but is
involved to do by hand (all three require editing makefiles).  I know
people that have configured PETSc just to build code which calls MUMPS
directly (without PETSc).  :-)

Jed


Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!

2010-07-19 Thread Damien Hocking
It does.  The big difference is that MUMPS is a 3-minute compile, and 
PETSc, erm, isn't.  It's..longer...


D

On 19/07/2010 12:56 PM, Daniel Janzon wrote:

Thanks a lot! PETSc seems to be really solid and integrates with MUMPS
suggested by Damien.

All the best,
Daniel Janzon

On 7/18/10, Gustavo Correa  wrote:
   

Check PETSc:
http://www.mcs.anl.gov/petsc/petsc-as/

On Jul 18, 2010, at 12:37 AM, Damien wrote:

 

You should check out the MUMPS parallel linear solver.

Damien
Sent from my iPhone

On 2010-07-17, at 5:16 PM, Daniel Janzon  wrote:

   

Dear OpenMPI Users,

I successfully installed OpenMPI on some FreeBSD machines and I can
run MPI programs on the cluster. Yippie!

But I'm not patient enough to write my own MPI-based routines. So I
thought maybe I could ask here for suggestions. I am primarily
interested in general linear algebra routines. The best would be to
for instance start Octave and just use it as normal, only that all
matrix operations would run on the cluster. Has anyone done that? The
octave-parallel package seems to be something different.

I installed scalapack and the test files ran successfully with mpirun
(except a few of them). But the source code examples of scalapack
looks terrible. Is there no higher-level library that provides an API
with matrix operations, which have all MPI parallelism stuff handled
for you in the background? Certainly a smart piece of software can
decide better than me how to chunk up a matrix and pass it out to the
available processes.

All the best,
Daniel
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
 

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
   

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
   


Re: [OMPI users] openmpi v1.5?

2010-07-19 Thread Jeff Squyres
I'm actually waiting for *1* more bug fix before we consider 1.5 "complete".


On Jul 19, 2010, at 3:24 PM, Jed Brown wrote:

> On Mon, 19 Jul 2010 15:16:59 -0400, Michael Di Domenico 
>  wrote:
>> Since I am a SVN neophyte can anyone tell me when openmpi 1.5 is
>> scheduled for release?
> 
> https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.5
> 
>> And whether the Slurm srun changes are going to make in?
> 
> https://svn.open-mpi.org/trac/ompi/wiki/v1.5/planning
> 
> Jed
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] openmpi v1.5?

2010-07-19 Thread Jed Brown
On Mon, 19 Jul 2010 15:16:59 -0400, Michael Di Domenico 
 wrote:
> Since I am a SVN neophyte can anyone tell me when openmpi 1.5 is
> scheduled for release?

https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.5

> And whether the Slurm srun changes are going to make in?

https://svn.open-mpi.org/trac/ompi/wiki/v1.5/planning

Jed


[OMPI users] openmpi v1.5?

2010-07-19 Thread Michael Di Domenico
Since I am a SVN neophyte can anyone tell me when openmpi 1.5 is
scheduled for release?  And whether the Slurm srun changes are going
to make in?

thanks


Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!

2010-07-19 Thread Daniel Janzon
Thanks a lot! PETSc seems to be really solid and integrates with MUMPS
suggested by Damien.

All the best,
Daniel Janzon

On 7/18/10, Gustavo Correa  wrote:
> Check PETSc:
> http://www.mcs.anl.gov/petsc/petsc-as/
>
> On Jul 18, 2010, at 12:37 AM, Damien wrote:
>
>> You should check out the MUMPS parallel linear solver.
>>
>> Damien
>> Sent from my iPhone
>>
>> On 2010-07-17, at 5:16 PM, Daniel Janzon  wrote:
>>
>>> Dear OpenMPI Users,
>>>
>>> I successfully installed OpenMPI on some FreeBSD machines and I can
>>> run MPI programs on the cluster. Yippie!
>>>
>>> But I'm not patient enough to write my own MPI-based routines. So I
>>> thought maybe I could ask here for suggestions. I am primarily
>>> interested in general linear algebra routines. The best would be to
>>> for instance start Octave and just use it as normal, only that all
>>> matrix operations would run on the cluster. Has anyone done that? The
>>> octave-parallel package seems to be something different.
>>>
>>> I installed scalapack and the test files ran successfully with mpirun
>>> (except a few of them). But the source code examples of scalapack
>>> looks terrible. Is there no higher-level library that provides an API
>>> with matrix operations, which have all MPI parallelism stuff handled
>>> for you in the background? Certainly a smart piece of software can
>>> decide better than me how to chunk up a matrix and pass it out to the
>>> available processes.
>>>
>>> All the best,
>>> Daniel
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-19 Thread Edgar Gabriel
Hm, so I am not sure how to approach this. First of all, the test case
works for me. I used up to 80 clients, and for both optimized and
non-optimized compilation. I ran the tests with trunk (not with 1.4
series, but the communicator code is identical in both cases). Clearly,
the patch from Ralph is necessary to make it work.

Additionally, I went through the communicator creation code for dynamic
communicators trying to find spots that could create problems. The only
place that I found the number 64 appear is the fortran-to-c mapping
arrays (e.g. for communicators), where the initial size of the table is
64. I looked twice over the pointer-array code to see whether we could
have a problem their (since it is a key-piece of the cid allocation code
for communicators), but I am fairly confident that it is correct.

Note, that we have other (non-dynamic tests), were comm_set is called
100,000 times, and the code per se does not seem to have a problem due
to being called too often. So I am not sure what else to look at.

Edgar



On 7/13/2010 8:42 PM, Ralph Castain wrote:
> As far as I can tell, it appears the problem is somewhere in our communicator 
> setup. The people knowledgeable on that area are going to look into it later 
> this week.
> 
> I'm creating a ticket to track the problem and will copy you on it.
> 
> 
> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
> 
>>
>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
>>
>>> Bad news..
>>> I've tried the latest patch with and without the prior one, but it
>>> hasn't changed anything. I've also tried using the old code but with
>>> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
>>> help.
>>> While looking through the sources of openmpi-1.4.2 I couldn't find any
>>> call of the function ompi_dpm_base_mark_dyncomm.
>>
>> It isn't directly called - it shows in ompi_comm_set as 
>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but I 
>> guess something else is also being hit. Have to look further...
>>
>>
>>>
>>>
>>> 2010/7/12 Ralph Castain :
 Just so you don't have to wait for 1.4.3 release, here is the patch 
 (doesn't include the prior patch).




 On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:

> 2010/7/12 Ralph Castain :
>> Dug around a bit and found the problem!!
>>
>> I have no idea who or why this was done, but somebody set a limit of 64 
>> separate jobids in the dynamic init called by ompi_comm_set, which 
>> builds the intercommunicator. Unfortunately, they hard-wired the array 
>> size, but never check that size before adding to it.
>>
>> So after 64 calls to connect_accept, you are overwriting other areas of 
>> the code. As you found, hitting 66 causes it to segfault.
>>
>> I'll fix this on the developer's trunk (I'll also add that original 
>> patch to it). Rather than my searching this thread in detail, can you 
>> remind me what version you are using so I can patch it too?
>
> I'm using 1.4.2
> Thanks a lot and I'm looking forward for the patch.
>
>>
>> Thanks for your patience with this!
>> Ralph
>>
>>
>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:
>>
>>> 1024 is not the problem: changing it to 2048 hasn't change anything.
>>> Following your advice I've run my process using gdb. Unfortunately I
>>> didn't get anything more than:
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
>>> 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0
>>>
>>> (gdb) bt
>>> #0  0xf7f39905 in ompi_comm_set () from 
>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>> #1  0xf7e3ba95 in connect_accept () from
>>> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
>>> #2  0xf7f62013 in PMPI_Comm_connect () from 
>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>> #3  0x080489ed in main (argc=825832753, argv=0x34393638) at client.c:43
>>>
>>> What's more: when I've added a breakpoint on ompi_comm_set in 66th
>>> process and stepped a couple of instructions, one of the other
>>> processes crashed (as usualy on ompi_comm_set) earlier than 66th did.
>>>
>>> Finally I decided to recompile openmpi using -g flag for gcc. In this
>>> case the 66 processes issue has gone! I was running my applications
>>> exactly the same way as previously (even without recompilation) and
>>> I've run successfully over 130 processes.
>>> When switching back to the openmpi compilation without -g it again 
>>> segfaults.
>>>
>>> Any ideas? I'm really confused.
>>>
>>>
>>>
>>> 2010/7/7 Ralph Castain :
 I would guess the #files limit of 1024. However, if it behaves the 
 same way when spread across multiple machines, I would suspect it is 
 

Re: [OMPI users] MPE logging GUI

2010-07-19 Thread Stefan Kuhne
Am 19.07.2010 16:32, schrieb Anthony Chan:

Hello Anthony,
> 
> Just curious, is there any reason you are looking for another
> tool to view slog2 file ?
> 
I'm looking for a more clearer tool.
I find jumpstart a little bit overloaded.

Regards,
Stefan Kuhne



signature.asc
Description: OpenPGP digital signature


[OMPI users] openib issues

2010-07-19 Thread Eloi Gaudry
Hi,

I've been working on a random segmentation fault that seems to occur during a 
collective communication when using the openib btl (see [OMPI users] [openib] 
segfault when using openib btl).

During my tests, I've come across different issues reported by OpenMPI-1.4.2:

1/ 
[[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122 
error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 
560618664 opcode 1  vendor error 105 qp_idx 3

2/
[[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 error 
polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 
162858496 opcode 1  vendor error 136 qp_idx 
0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 error 
polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 
485900928 opcode 0  vendor error 249 
qp_idx 0

--
The OpenFabrics stack has reported a network error event.  Open MPI will try to 
continue, but your job may end up failing.

  Local host:p'"
  MPI process PID:   20743
  Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)

This error may indicate connectivity problems within the fabric; please contact 
your system administrator.
--

I'd like to know what these two errors mean and where they come from.

Thanks for your help,
Eloi

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


Re: [OMPI users] MPE logging GUI

2010-07-19 Thread Anthony Chan

Just curious, is there any reason you are looking for another
tool to view slog2 file ?

A.Chan

- "Stefan Kuhne"  wrote:

> Hello,
> 
> does anybody know another tool as jumpstart to view a MPE logging
> file?
> 
> Regards,
> Stefan Kuhne
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPICH2 is working OpenMPI Not

2010-07-19 Thread Scott Atchley
Hi Bibrak,

The message about malloc looks like a MX message. Which interconnects did you 
compile support for?

If you are using MX, does it appear when you run with:

$ mpirun --mca pml cm -np 4 ./exec 98

which uses the MX MTL instead of MX BTL.

Scott

On Jul 18, 2010, at 9:23 AM, Bibrak Qamar wrote:

> Hello,
> 
> I have developed a code which I tested on MPICH2, it working fine.
> 
> But when I compile and run it with OpenMPI, its not working.
> 
> The result of the program with the errors by OpenMPI is below ..
> 
> --
> 
> 
> bibrak@barq:~/XXX> mpirun -np 4 ./exec 98
> 
> 
> warning:regcache incompatible with malloc
> warning:regcache incompatible with malloc
> warning:regcache incompatible with malloc
> warning:regcache incompatible with malloc
> Send count -- >> 25 
> Send count -- >> 25 
> Send count -- >> 24 
> Send count -- >> 24 
> Dis -- >> 0 
> Dis -- >> 25 
> Dis -- >> 50 
> Dis -- >> 74 
> 
> 
> 
> 
>  0 d[0] = -14.025975 
>  1 d[0] = -14.025975 
> -- 1 -- 
>  2 d[0] = -14.025975 
> -- 2 -- 
> -- 0 -- 
>  3 d[0] = -14.025975 
>  --3 --
> [barq:27118] *** Process received signal ***
> [barq:27118] Signal: Segmentation fault (11)
> [barq:27118] Signal code: Address not mapped (1)
> [barq:27118] Failing at address: 0x51681f96
> [barq:27121] *** Process received signal ***
> [barq:27121] Signal: Segmentation fault (11)
> [barq:27121] Signal code: Address not mapped (1)
> [barq:27121] Failing at address: 0x77b5685
> [barq:27118] [ 0] [0xe410]
> [barq:27118] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d20f3c]
> [barq:27118] [ 2] ./exec(main+0x2214) [0x804ad8d]
> [barq:27118] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cc9705]
> [barq:27121] [ 0] [0xe410]
> [barq:27121] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d0ef3c]
> [barq:27121] [ 2] ./exec(main+0x2214) [0x804ad8d]
> [barq:27121] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cb7705]
> [barq:27121] [ 4] ./exec [0x8048b01]
> [barq:27121] *** End of error message ***
> [barq:27118] [ 4] ./exec [0x8048b01]
> [barq:27118] *** End of error message ***
> --
> mpirun noticed that process rank 3 with PID 27121 on node barq exited on 
> signal 11 (Segmentation fault).
> --
> [barq:27120] *** Process received signal ***
> [barq:27120] Signal: Segmentation fault (11)
> [barq:27120] Signal code: Address not mapped (1)
> [barq:27120] Failing at address: 0x4bd1ca3e
> [barq:27120] [ 0] [0xe410]
> [barq:27120] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7c97f3c]
> [barq:27120] [ 2] ./exec(main+0x2214) [0x804ad8d]
> [barq:27120] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7c40705]
> [barq:27120] [ 4] ./exec [0x8048b01]
> [barq:27120] *** End of error message ***
> 
> 
> 
> 
> Because of the warning:regcache incompatible with malloc warning I did 
> >  bibrak@barq:~/XXX> export MX_RCACHE=2
> 
> And now ignored the warning, but the error still remains
> 
> I shall appreciate any help.
> 
> Bibrak Qamar
> NUST-SEECS
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-07-19 Thread Ralph Castain
I'm wondering if we can't make this simpler. What launch environment are you 
operating under? I know you said you can't use mpiexec, but I'm wondering if we 
could add support for your environment to mpiexec so you could.


On Jul 18, 2010, at 4:09 PM, Philippe wrote:

> Ralph,
> 
> thanks for investigating.
> 
> I've applied the two patches you mentioned earlier and ran with the
> ompi server. Although i was able to runn our standalone test, when I
> integrated the changes to our code, the processes entered a crazy loop
> and allocated all the memory available when calling MPI_Port_Connect.
> I was not able to identify why it works standalone but not integrated
> with our code. If I found why, I'll let your know.
> 
> looking forward to your findings. We'll be happy to test any patches
> if you have some!
> 
> p.
> 
> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain  wrote:
>> Okay, I can reproduce this problem. Frankly, I don't think this ever worked 
>> with OMPI, and I'm not sure how the choice of BTL makes a difference.
>> 
>> The program is crashing in the communicator definition, which involves a 
>> communication over our internal out-of-band messaging system. That system 
>> has zero connection to any BTL, so it should crash either way.
>> 
>> Regardless, I will play with this a little as time allows. Thanks for the 
>> reproducer!
>> 
>> 
>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to run a test program which consists of a server creating a
>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>> connect to the server.
>>> 
>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>>> clients, I get the following error message:
>>> 
>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>> 
>>> This is only happening with the openib BTL. With tcp BTL it works
>>> perfectly fine (ofud also works as a matter of fact...). This has been
>>> tested on two completely different clusters, with identical results.
>>> In either cases, the IB frabic works normally.
>>> 
>>> Any help would be greatly appreciated! Several people in my team
>>> looked at the problem. Google and the mailing list archive did not
>>> provide any clue. I believe that from an MPI standpoint, my test
>>> program is valid (and it works with TCP, which make me feel better
>>> about the sequence of MPI calls)
>>> 
>>> Regards,
>>> Philippe.
>>> 
>>> 
>>> 
>>> Background:
>>> 
>>> I intend to use openMPI to transport data inside a much larger
>>> application. Because of that, I cannot used mpiexec. Each process is
>>> started by our own "job management" and use a name server to find
>>> about each others. Once all the clients are connected, I would like
>>> the server to do MPI_Recv to get the data from all the client. I dont
>>> care about the order or which client are sending data, as long as I
>>> can receive it with on call. Do do that, the clients and the server
>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>>> so that at the end, all the clients and the server are inside the same
>>> intracomm.
>>> 
>>> Steps:
>>> 
>>> I have a sample program that show the issue. I tried to make it as
>>> short as possible. It needs to be executed on a shared file system
>>> like NFS because the server write the port info to a file that the
>>> client will read. To reproduce the issue, the following steps should
>>> be performed:
>>> 
>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>> 1. ssh to the machine that will be the server
>>> 2. run ./ben12 3 1
>>> 3. ssh to the machine that will be the client #1
>>> 4. run ./ben12 3 0
>>> 5. repeat step 3-4 for client #2 and #3
>>> 
>>> the server accept the connection from client #1 and merge it in a new
>>> intracomm. It then accept connection from client #2 and merge it. when
>>> the client #3 arrives, the server accept the connection, but that
>>> cause client #1 and #2 to die with the error above (see the complete
>>> trace in the tarball).
>>> 
>>> The exact steps are:
>>> 
>>> - server open port
>>> - server does accept
>>> - client #1 does connect
>>> - server and client #1 do merge
>>> - server does accept
>>> - client #2 does connect
>>> - server, client #1 and client #2 do merge
>>> - server does accept
>>> - client #3 does connect
>>> - server, client #1, client #2 and client #3 do merge
>>> 
>>> 
>>> My infiniband network works normally with other test programs or
>>> applications (MPI or others like Verbs).
>>> 
>>> Info about my setup:
>>> 
>>>openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>>config.log in the tarball
>>>"ompi_info --all" in the tarball
>>>OFED version = 1.3 installed from RHEL 5.3
>>>Distro = RedHat Entreprise Linux 5.3
>>>Kernel 

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-07-19 Thread Ralph Castain

On Jul 18, 2010, at 4:09 PM, Philippe wrote:

> Ralph,
> 
> thanks for investigating.
> 
> I've applied the two patches you mentioned earlier and ran with the
> ompi server. Although i was able to runn our standalone test, when I
> integrated the changes to our code, the processes entered a crazy loop
> and allocated all the memory available when calling MPI_Port_Connect.
> I was not able to identify why it works standalone but not integrated
> with our code. If I found why, I'll let your know.

How many processes are we talking about?

> 
> looking forward to your findings. We'll be happy to test any patches
> if you have some!
> 
> p.
> 
> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain  wrote:
>> Okay, I can reproduce this problem. Frankly, I don't think this ever worked 
>> with OMPI, and I'm not sure how the choice of BTL makes a difference.
>> 
>> The program is crashing in the communicator definition, which involves a 
>> communication over our internal out-of-band messaging system. That system 
>> has zero connection to any BTL, so it should crash either way.
>> 
>> Regardless, I will play with this a little as time allows. Thanks for the 
>> reproducer!
>> 
>> 
>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to run a test program which consists of a server creating a
>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>> connect to the server.
>>> 
>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>>> clients, I get the following error message:
>>> 
>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>> 
>>> This is only happening with the openib BTL. With tcp BTL it works
>>> perfectly fine (ofud also works as a matter of fact...). This has been
>>> tested on two completely different clusters, with identical results.
>>> In either cases, the IB frabic works normally.
>>> 
>>> Any help would be greatly appreciated! Several people in my team
>>> looked at the problem. Google and the mailing list archive did not
>>> provide any clue. I believe that from an MPI standpoint, my test
>>> program is valid (and it works with TCP, which make me feel better
>>> about the sequence of MPI calls)
>>> 
>>> Regards,
>>> Philippe.
>>> 
>>> 
>>> 
>>> Background:
>>> 
>>> I intend to use openMPI to transport data inside a much larger
>>> application. Because of that, I cannot used mpiexec. Each process is
>>> started by our own "job management" and use a name server to find
>>> about each others. Once all the clients are connected, I would like
>>> the server to do MPI_Recv to get the data from all the client. I dont
>>> care about the order or which client are sending data, as long as I
>>> can receive it with on call. Do do that, the clients and the server
>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>>> so that at the end, all the clients and the server are inside the same
>>> intracomm.
>>> 
>>> Steps:
>>> 
>>> I have a sample program that show the issue. I tried to make it as
>>> short as possible. It needs to be executed on a shared file system
>>> like NFS because the server write the port info to a file that the
>>> client will read. To reproduce the issue, the following steps should
>>> be performed:
>>> 
>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>> 1. ssh to the machine that will be the server
>>> 2. run ./ben12 3 1
>>> 3. ssh to the machine that will be the client #1
>>> 4. run ./ben12 3 0
>>> 5. repeat step 3-4 for client #2 and #3
>>> 
>>> the server accept the connection from client #1 and merge it in a new
>>> intracomm. It then accept connection from client #2 and merge it. when
>>> the client #3 arrives, the server accept the connection, but that
>>> cause client #1 and #2 to die with the error above (see the complete
>>> trace in the tarball).
>>> 
>>> The exact steps are:
>>> 
>>> - server open port
>>> - server does accept
>>> - client #1 does connect
>>> - server and client #1 do merge
>>> - server does accept
>>> - client #2 does connect
>>> - server, client #1 and client #2 do merge
>>> - server does accept
>>> - client #3 does connect
>>> - server, client #1, client #2 and client #3 do merge
>>> 
>>> 
>>> My infiniband network works normally with other test programs or
>>> applications (MPI or others like Verbs).
>>> 
>>> Info about my setup:
>>> 
>>>openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>>config.log in the tarball
>>>"ompi_info --all" in the tarball
>>>OFED version = 1.3 installed from RHEL 5.3
>>>Distro = RedHat Entreprise Linux 5.3
>>>Kernel = 2.6.18-128.4.1.el5 x86_64
>>>subnet manager = built-in SM from the cisco/topspin switch
>>>output of ibv_devinfo included in the tarball (there are no "bad" nodes)
>>>

[OMPI users] MPE logging GUI

2010-07-19 Thread Stefan Kuhne
Hello,

does anybody know another tool as jumpstart to view a MPE logging file?

Regards,
Stefan Kuhne



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] is loop unrolling safe for MPI logic?

2010-07-19 Thread Tim Prince

On 7/18/2010 9:09 AM, Anton Shterenlikht wrote:

On Sat, Jul 17, 2010 at 09:14:11AM -0700, Eugene Loh wrote:
   

Jeff Squyres wrote:

 

On Jul 17, 2010, at 4:22 AM, Anton Shterenlikht wrote:


   

Is loop vectorisation/unrolling safe for MPI logic?
I presume it is, but are there situations where
loop vectorisation could e.g. violate the order
of execution of MPI calls?


 

I *assume* that the intel compiler will not unroll loops that contain MPI 
function calls.  That's obviously an assumption, but I would think that unless 
you put some pragmas in there that tell the compiler that it's safe to unroll, 
the compiler will be somewhat conservative about what it automatically unrolls.


   

More generally, a Fortran compiler that optimizes aggressively could
"break" MPI code.

http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node241

That said, you may not need to worry about this in your particular case.
 

This is a very important point, many thanks Eugene.
Fortran MPI programmer definitely needs to pay attention to this.

MPI-2.2 provides a slightly updated version of this guide:

http://www.mpi-forum.org/docs/mpi22-report/node343.htm#Node348

many thanks
anton

   
From the point of view of the compiler developers, auto-vectorization 
and unrolling are distinct questions.  An MPI or other non-inlined 
function call would not be subject to vectorization.  While 
auto-vectorization or unrolling may expose latent bugs, MPI is not 
particularly likely to make them worse.  You have made some misleading 
statements about vectorization along the way, but these aren't likely to 
relate to MPI problems.
Upon my return, I will be working on a case which was developed and 
tested succeessfully under ifort 10.1 and other compilers, which is 
failing under current ifort versions.  Current Intel MPI throws a run 
time error indicating that the receive buffer has been lost; the openmpi 
failure is more obscure.  I will have to change the code to use distinct 
tags for each MPI send/receive pair in order to track it down.  I'm not 
counting on that magically making the bug go away.  ifort is not 
particularly aggressive about unrolling loops which contain MPI calls, 
but I agree that must be considered.


--
Tim Prince