Re: [OMPI users] MPI_Bcast issue

2010-08-08 Thread Randolph Pullen
Thanks,  although “An intercommunicator cannot be used for collective 
communication.” i.e ,  bcast calls., I can see how the MPI_Group_xx calls can 
be used to produce a useful group and then communicator;  - thanks again but 
this is really the side issue to my main question about MPI_Bcast.

I seem to have duplicate concurrent processes interfering with each other.  
This would appear to be a breach of the MPI safety dictum, ie MPI_COMM_WORD is 
supposed to only include the processes started by a single mpirun command and 
isolate these processes from other similar groups of processes safely.

So, it would appear to be a bug.  If so this has significant implications for 
environments such as mine, where it may often occur that the same program is 
run by different users simultaneously.  

It is really this issue that it concerning me, I can rewrite the code but if it 
can crash when 2 copies run at the same time, I have a much bigger problem.

My suspicion is that a within the MPI_Bcast handshaking, a syncronising 
broadcast call may be colliding across the environments.  My only evidence is 
an otherwise working program waits on broadcast reception forever when two or 
more copies are run at [exactly] the same time.

Has anyone else seen similar behavior in concurrently running programs that 
perform lots of broadcasts perhaps?

Randolph


--- On Sun, 8/8/10, David Zhang  wrote:

From: David Zhang 
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" 
Received: Sunday, 8 August, 2010, 12:34 PM

In particular, intercommunicators

On 8/7/10, Aurélien Bouteiller  wrote:
> You should consider reading about communicators in MPI.
>
> Aurelien
> --
> Aurelien Bouteiller, Ph.D.
> Innovative Computing Laboratory, The University of Tennessee.
>
> Envoyé de mon iPad
>
> Le Aug 7, 2010 à 1:05, Randolph Pullen  a
> écrit :
>
>> I seem to be having a problem with MPI_Bcast.
>> My massive I/O intensive data movement program must broadcast from n to n
>> nodes. My problem starts because I require 2 processes per node, a sender
>> and a receiver and I have implemented these using MPI processes rather
>> than tackle the complexities of threads on MPI.
>>
>> Consequently, broadcast and calls like alltoall are not completely
>> helpful.  The dataset is huge and each node must end up with a complete
>> copy built by the large number of contributing broadcasts from the sending
>> nodes.  Network efficiency and run time are paramount.
>>
>> As I don’t want to needlessly broadcast all this data to the sending nodes
>> and I have a perfectly good MPI program that distributes globally from a
>> single node (1 to N), I took the unusual decision to start N copies of
>> this program by spawning the MPI system from the PVM system in an effort
>> to get my N to N concurrent transfers.
>>
>> It seems that the broadcasts running on concurrent MPI environments
>> collide and cause all but the first process to hang waiting for their
>> broadcasts.  This theory seems to be confirmed by introducing a sleep of
>> n-1 seconds before the first MPI_Bcast  call on each node, which results
>> in the code working perfectly.  (total run time 55 seconds, 3 nodes,
>> standard TCP stack)
>>
>> My guess is that unlike PVM, OpenMPI implements broadcasts with broadcasts
>> rather than multicasts.  Can someone confirm this?  Is this a bug?
>>
>> Is there any multicast or N to N broadcast where sender processes can
>> avoid participating when they don’t need to?
>>
>> Thanks in advance
>> Randolph
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Sent from my mobile device

David Zhang
University of California, San Diego

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] Memory allocation error when linking with MPI libraries

2010-08-08 Thread Nysal Jan
What interconnect are you using? Infiniband? Use  "--without-memory-manager"
option while building ompi in order to disable ptmalloc.

Regards
--Nysal

On Sun, Aug 8, 2010 at 7:49 PM, Nicolas Deladerriere <
nicolas.deladerri...@gmail.com> wrote:

> Yes, I'am using 24G machine on 64 bit Linux OS.
> If I compile without wrapper, I did not get any problems.
>
> It seems that when I am linking with openmpi, my program use a kind of
> openmpi implemented malloc. Is it possible to switch it off in order ot only
> use malloc from libc ?
>
> Nicolas
>
> 2010/8/8 Terry Frankcombe 
>
> You're trying to do a 6GB allocate.  Can your underlying system handle
>> that?  IF you compile without the wrapper, does it work?
>>
>> I see your executable is using the OMPI memory stuff.  IIRC there are
>> switches to turn that off.
>>
>>
>> On Fri, 2010-08-06 at 15:05 +0200, Nicolas Deladerriere wrote:
>> > Hello,
>> >
>> > I'am having an sigsegv error when using simple program compiled and
>> > link with openmpi.
>> > I have reproduce the problem using really simple fortran code. It
>> > actually does not even use MPI, but just link with mpi shared
>> > libraries. (problem does not appear when I do not link with mpi
>> > libraries)
>> >% cat allocate.F90
>> >program test
>> >implicit none
>> >integer, dimension(:), allocatable :: z
>> >integer(kind=8) :: l
>> >
>> >write(*,*) "l ?"
>> >read(*,*) l
>> >
>> >ALLOCATE(z(l))
>> >z(1) = 111
>> >z(l) = 222
>> >DEALLOCATE(z)
>> >
>> >end program test
>> >
>> > I am using openmpi 1.4.2 and gfortran for my tests. Here is the
>> > compilation :
>> >
>> >% ./openmpi-1.4.2/build/bin/mpif90 --showme -g -o testallocate
>> > allocate.F90
>> >gfortran -g -o testallocate allocate.F90
>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/include -pthread
>> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib
>> > -L/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib -lmpi_f90
>> > -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
>> > -lutil -lm -ldl -pthread
>> >
>> > When I am running that test with different length, I sometimes get a
>> > "Segmentation fault" error. Here are two examples using two specific
>> > values, but error happens for many other values of length (I did not
>> > manage to find which values of lenght gives that error)
>> >
>> >%  ./testallocate
>> > l ?
>> >16
>> >Segmentation fault
>> >% ./testallocate
>> > l ?
>> >20
>> >
>> > I used debugger with re-compiled version of openmpi using debug flag.
>> > I got the folowing error in function sYSMALLOc
>> >
>> >Program received signal SIGSEGV, Segmentation fault.
>> >0x2b70b3b3 in sYSMALLOc (nb=640016, av=0x2b930200)
>> > at malloc.c:3239
>> >3239set_head(remainder, remainder_size | PREV_INUSE);
>> >Current language:  auto; currently c
>> >(gdb) bt
>> >#0  0x2b70b3b3 in sYSMALLOc (nb=640016,
>> > av=0x2b930200) at malloc.c:3239
>> >#1  0x2b70d0db in opal_memory_ptmalloc2_int_malloc
>> > (av=0x2b930200, bytes=64) at malloc.c:4322
>> >#2  0x2b70b773 in opal_memory_ptmalloc2_malloc
>> > (bytes=64) at malloc.c:3435
>> >#3  0x2b70a665 in opal_memory_ptmalloc2_malloc_hook
>> > (sz=64, caller=0x2bf8534d) at hooks.c:667
>> >#4  0x2bf8534d in _gfortran_internal_free ()
>> > from /usr/lib64/libgfortran.so.1
>> >#5  0x00400bcc in MAIN__ () at allocate.F90:11
>> >#6  0x00400c4e in main ()
>> >(gdb) display
>> >(gdb) list
>> >3234  if ((unsigned long)(size) >= (unsigned long)(nb +
>> > MINSIZE)) {
>> >3235remainder_size = size - nb;
>> >3236remainder = chunk_at_offset(p, nb);
>> >3237av->top = remainder;
>> >3238set_head(p, nb | PREV_INUSE | (av != _arena ?
>> > NON_MAIN_ARENA : 0));
>> >3239set_head(remainder, remainder_size | PREV_INUSE);
>> >3240check_malloced_chunk(av, p, nb);
>> >3241return chunk2mem(p);
>> >3242  }
>> >3243
>> >
>> >
>> > I also did the same test in C and I got the same problem.
>> >
>> > Does someone has any idea that could help me understand what's going
>> > on ?
>> >
>> > Regards
>> > Nicolas
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Memory allocation error when linking with MPI libraries

2010-08-08 Thread Nicolas Deladerriere
Yes, I'am using 24G machine on 64 bit Linux OS.
If I compile without wrapper, I did not get any problems.

It seems that when I am linking with openmpi, my program use a kind of
openmpi implemented malloc. Is it possible to switch it off in order ot only
use malloc from libc ?

Nicolas

2010/8/8 Terry Frankcombe 

> You're trying to do a 6GB allocate.  Can your underlying system handle
> that?  IF you compile without the wrapper, does it work?
>
> I see your executable is using the OMPI memory stuff.  IIRC there are
> switches to turn that off.
>
>
> On Fri, 2010-08-06 at 15:05 +0200, Nicolas Deladerriere wrote:
> > Hello,
> >
> > I'am having an sigsegv error when using simple program compiled and
> > link with openmpi.
> > I have reproduce the problem using really simple fortran code. It
> > actually does not even use MPI, but just link with mpi shared
> > libraries. (problem does not appear when I do not link with mpi
> > libraries)
> >% cat allocate.F90
> >program test
> >implicit none
> >integer, dimension(:), allocatable :: z
> >integer(kind=8) :: l
> >
> >write(*,*) "l ?"
> >read(*,*) l
> >
> >ALLOCATE(z(l))
> >z(1) = 111
> >z(l) = 222
> >DEALLOCATE(z)
> >
> >end program test
> >
> > I am using openmpi 1.4.2 and gfortran for my tests. Here is the
> > compilation :
> >
> >% ./openmpi-1.4.2/build/bin/mpif90 --showme -g -o testallocate
> > allocate.F90
> >gfortran -g -o testallocate allocate.F90
> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/include -pthread
> > -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib
> > -L/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib -lmpi_f90
> > -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
> > -lutil -lm -ldl -pthread
> >
> > When I am running that test with different length, I sometimes get a
> > "Segmentation fault" error. Here are two examples using two specific
> > values, but error happens for many other values of length (I did not
> > manage to find which values of lenght gives that error)
> >
> >%  ./testallocate
> > l ?
> >16
> >Segmentation fault
> >% ./testallocate
> > l ?
> >20
> >
> > I used debugger with re-compiled version of openmpi using debug flag.
> > I got the folowing error in function sYSMALLOc
> >
> >Program received signal SIGSEGV, Segmentation fault.
> >0x2b70b3b3 in sYSMALLOc (nb=640016, av=0x2b930200)
> > at malloc.c:3239
> >3239set_head(remainder, remainder_size | PREV_INUSE);
> >Current language:  auto; currently c
> >(gdb) bt
> >#0  0x2b70b3b3 in sYSMALLOc (nb=640016,
> > av=0x2b930200) at malloc.c:3239
> >#1  0x2b70d0db in opal_memory_ptmalloc2_int_malloc
> > (av=0x2b930200, bytes=64) at malloc.c:4322
> >#2  0x2b70b773 in opal_memory_ptmalloc2_malloc
> > (bytes=64) at malloc.c:3435
> >#3  0x2b70a665 in opal_memory_ptmalloc2_malloc_hook
> > (sz=64, caller=0x2bf8534d) at hooks.c:667
> >#4  0x2bf8534d in _gfortran_internal_free ()
> > from /usr/lib64/libgfortran.so.1
> >#5  0x00400bcc in MAIN__ () at allocate.F90:11
> >#6  0x00400c4e in main ()
> >(gdb) display
> >(gdb) list
> >3234  if ((unsigned long)(size) >= (unsigned long)(nb +
> > MINSIZE)) {
> >3235remainder_size = size - nb;
> >3236remainder = chunk_at_offset(p, nb);
> >3237av->top = remainder;
> >3238set_head(p, nb | PREV_INUSE | (av != _arena ?
> > NON_MAIN_ARENA : 0));
> >3239set_head(remainder, remainder_size | PREV_INUSE);
> >3240check_malloced_chunk(av, p, nb);
> >3241return chunk2mem(p);
> >3242  }
> >3243
> >
> >
> > I also did the same test in C and I got the same problem.
> >
> > Does someone has any idea that could help me understand what's going
> > on ?
> >
> > Regards
> > Nicolas
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Memory allocation error when linking with MPI libraries

2010-08-08 Thread Terry Frankcombe
You're trying to do a 6GB allocate.  Can your underlying system handle
that?  IF you compile without the wrapper, does it work?

I see your executable is using the OMPI memory stuff.  IIRC there are
switches to turn that off.


On Fri, 2010-08-06 at 15:05 +0200, Nicolas Deladerriere wrote:
> Hello,
> 
> I'am having an sigsegv error when using simple program compiled and
> link with openmpi.
> I have reproduce the problem using really simple fortran code. It
> actually does not even use MPI, but just link with mpi shared
> libraries. (problem does not appear when I do not link with mpi
> libraries)
>% cat allocate.F90
>program test
>implicit none
>integer, dimension(:), allocatable :: z
>integer(kind=8) :: l
> 
>write(*,*) "l ?"
>read(*,*) l
>
>ALLOCATE(z(l))
>z(1) = 111
>z(l) = 222
>DEALLOCATE(z)
> 
>end program test
> 
> I am using openmpi 1.4.2 and gfortran for my tests. Here is the
> compilation :
> 
>% ./openmpi-1.4.2/build/bin/mpif90 --showme -g -o testallocate
> allocate.F90
>gfortran -g -o testallocate allocate.F90
> -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/include -pthread
> -I/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib
> -L/s0/scr1/TOMOT_19311_HAL_/openmpi-1.4.2/build/lib -lmpi_f90
> -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl
> -lutil -lm -ldl -pthread
> 
> When I am running that test with different length, I sometimes get a
> "Segmentation fault" error. Here are two examples using two specific
> values, but error happens for many other values of length (I did not
> manage to find which values of lenght gives that error)
> 
>%  ./testallocate
> l ?
>16
>Segmentation fault
>% ./testallocate
> l ?
>20
> 
> I used debugger with re-compiled version of openmpi using debug flag.
> I got the folowing error in function sYSMALLOc
> 
>Program received signal SIGSEGV, Segmentation fault.
>0x2b70b3b3 in sYSMALLOc (nb=640016, av=0x2b930200)
> at malloc.c:3239
>3239set_head(remainder, remainder_size | PREV_INUSE);
>Current language:  auto; currently c
>(gdb) bt
>#0  0x2b70b3b3 in sYSMALLOc (nb=640016,
> av=0x2b930200) at malloc.c:3239
>#1  0x2b70d0db in opal_memory_ptmalloc2_int_malloc
> (av=0x2b930200, bytes=64) at malloc.c:4322
>#2  0x2b70b773 in opal_memory_ptmalloc2_malloc
> (bytes=64) at malloc.c:3435
>#3  0x2b70a665 in opal_memory_ptmalloc2_malloc_hook
> (sz=64, caller=0x2bf8534d) at hooks.c:667
>#4  0x2bf8534d in _gfortran_internal_free ()
> from /usr/lib64/libgfortran.so.1
>#5  0x00400bcc in MAIN__ () at allocate.F90:11
>#6  0x00400c4e in main ()
>(gdb) display
>(gdb) list
>3234  if ((unsigned long)(size) >= (unsigned long)(nb +
> MINSIZE)) {
>3235remainder_size = size - nb;
>3236remainder = chunk_at_offset(p, nb);
>3237av->top = remainder;
>3238set_head(p, nb | PREV_INUSE | (av != _arena ?
> NON_MAIN_ARENA : 0));
>3239set_head(remainder, remainder_size | PREV_INUSE);
>3240check_malloced_chunk(av, p, nb);
>3241return chunk2mem(p);
>3242  }
>3243
> 
> 
> I also did the same test in C and I got the same problem. 
> 
> Does someone has any idea that could help me understand what's going
> on ?
> 
> Regards
> Nicolas
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users