[OMPI users] Building OpenMPI 10.4 with PGI fortran 10.8 and, gcc

2010-09-15 Thread Axel Schweiger

 Gus,
Thanks for the suggestion! I had missed the flag for F77.

Axel


Re: [OMPI users] Building OpenMPI 10.4 with PGI fortran 10.8 and gcc

2010-09-15 Thread Prentice Bisbal
How good are you with reading/editing Makefiles? I find problems like
this are usually solved by searching the Makefiles for the offending
line(s) and removing the offending switch.

In a well-designed make environment, you should only have to edit the
top-level Makefile. In the worst case, you'll have to edit every
Makefile. Fortunately, you can usually speed this up with some shell
kung-fu, if necessary.

This of course doesn't work if the developers were "clever" enough to
have a build environment that overwrite the Makefiles with new ones
every time to try to build. I don't think this applies to Open MPI.

Prentice


Axel Schweiger wrote:
>  Trying to build a hybrid OpenMPI with PGI fortran and gcc to support
> WRF model
> The problem appears to be due to a -pthread switch passed to pgfortran.
> 
> 
> 
> libtool: link: pgfortran -shared  -fpic -Mnomain  .libs/mpi.o
> .libs/mpi_sizeof.o .libs/mpi_comm_spawn_multiple_f90.o
> .libs/mpi_testall_f90.o .libs/mpi_testsome_f90.o .libs/mpi_waitall_f90.o
> .libs/mpi_waitsome_f90.o .libs/mpi_wtick_f90.o .libs/mpi_wtime_f90.o  
> -Wl,-rpath -Wl,/home/axel/AxboxInstall/openmpi-1.4.2/ompi/.libs
> -Wl,-rpath -Wl,/home/axel/AxboxInstall/openmpi-1.4.2/orte/.libs
> -Wl,-rpath -Wl,/home/axel/AxboxInstall/openmpi-1.4.2/opal/.libs
> -Wl,-rpath -Wl,/opt/openmpi-pgi-gcc-1.42/lib
> -L/home/axel/AxboxInstall/openmpi-1.4.2/orte/.libs
> -L/home/axel/AxboxInstall/openmpi-1.4.2/opal/.libs
> ../../../ompi/.libs/libmpi.so
> /home/axel/AxboxInstall/openmpi-1.4.2/orte/.libs/libopen-rte.so
> /home/axel/AxboxInstall/openmpi-1.4.2/opal/.libs/libopen-pal.so -ldl
> -lnsl -lutil -lm-pthread -Wl,-soname -Wl,libmpi_f90.so.0 -o
> .libs/libmpi_f90.so.0.0.0
> pgfortran-Error-Unknown switch: -pthread
> make[4]: *** [libmpi_f90.la] Error 1
> 
> 
> There has been discussion on this issue and the below solution
> suggested. This doesn't appear to work for the 10.8
> release.
> 
> http://www.open-mpi.org/community/lists/users/2009/04/8911.php
> 
> There was a previous thread:
> http://www.open-mpi.org/community/lists/users/2009/03/8687.php
> 
> suggesting other solutions.
> 
> Wondering if there is a better solution right now? Building 1.4.2
> 
> Thanks
> Axel





Re: [OMPI users] [openib] segfault when using openib btl

2010-09-15 Thread Eloi Gaudry
Hi,

I was wondering if anybody got a chance to have a look at this issue.

Regards,
Eloi


On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
> 
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> btl_openib_want_fork_support 0 -tag-output
> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> valgrind.supp --suppressions=./suppressions.python.supp
> /opt/actran/bin/actranpy_mp ...
> 
> Thanks,
> Eloi
> 
> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > I did run our application through valgrind but it couldn't find any
> > > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > > > with the suppression file), "Use of uninitialized bytes" and
> > > > "Conditional jump depending on uninitialized bytes" in different ompi
> > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > send you an output of valgrind shortly.
> > > 
> > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > unaware of them (and therefore incorrectly marks them as
> > > uninitialized).
> > 
> > would it  help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort has been done to clean-up the valgrind output ? but maybe
> > that this doesn't concern this btl (for the reasons you mentionned).
> > 
> > > > Another question, you said that the callback function pointer should
> > > > never be 0. But can the tag be null (hdr->tag) ?
> > > 
> > > The tag is not a pointer -- it's just an integer.
> > 
> > I was worrying that its value could not be null.
> > 
> > I'll send a valgrind output soon (i need to build libpython without
> > pymalloc first).
> > 
> > Thanks,
> > Eloi
> > 
> > > > Thanks for your help,
> > > > Eloi
> > > > 
> > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > >> Sorry for the delay in replying.
> > > >> 
> > > >> Odd; the values of the callback function pointer should never be 0.
> > > >> This seems to suggest some kind of memory corruption is occurring.
> > > >> 
> > > >> I don't know if it's possible, because the stack trace looks like
> > > >> you're calling through python, but can you run this application
> > > >> through valgrind, or some other memory-checking debugger?
> > > >> 
> > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > >>> Hi,
> > > >>> 
> > > >>> sorry, i just forgot to add the values of the function parameters:
> > > >>> (gdb) print reg->cbdata
> > > >>> $1 = (void *) 0x0
> > > >>> (gdb) print openib_btl->super
> > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > >>> 
> > > >>>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size
> > > >>>   = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > >>>   btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > >>>   0x2b341eb8ee47, btl_del_procs =
> > > >>>   0x2b341eb90156, btl_register = 0,
> > > >>>   btl_finalize = 0x2b341eb93186, btl_alloc
> > > >>>   = 0x2b341eb90a3e, btl_free =
> > > >>>   0x2b341eb91400, btl_prepare_src =
> > > >>>   0x2b341eb91813, btl_prepare_dst =
> > > >>>   0x2b341eb91f2e, btl_send =
> > > >>>   0x2b341eb94517, btl_sendi =
> > > >>>   0x2b341eb9340d, btl_put =
> > > >>>   0x2b341eb94660, btl_get =
> > > >>>   0x2b341eb94c4e, btl_dump =
> > > >>>   0x2b341acd45cb, btl_mpool = 0xf3f4110,
> > > >>>   btl_register_error =
> > > >>>   0x2b341eb90565, btl_ft_event =
> > > >>>   0x2b341eb952e7}
> > > >>> 
> > > >>> (gdb) print hdr->tag
> > > >>> $3 = 0 '\0'
> > > >>> (gdb) print des
> > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > >>> (gdb) print reg->cbfunc
> > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > >>> 
> > > >>> Eloi
> > > >>> 
> > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > >  Hi,
> > >  
> > >  Here is the output of a core file generated during a segmentation
> > >  fault observed during a collective call (using openib):
> > >  
> > >  #0  0x in ?? ()
> > >  (gdb) where
> > >  #0  0x in ?? ()
> > >  #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > >  (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > >  byte_len=18) at btl_openib_component.c:2881 #2 
> > >  0x2aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
> > >  wc=0x7279ce90) at
> > >  btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > >  (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
>