Hi,
I was wondering if anybody got a chance to have a look at this issue.
Regards,
Eloi
On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
>
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> btl_openib_want_fork_support 0 -tag-output
> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> valgrind.supp --suppressions=./suppressions.python.supp
> /opt/actran/bin/actranpy_mp ...
>
> Thanks,
> Eloi
>
> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > I did run our application through valgrind but it couldn't find any
> > > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > > > with the suppression file), "Use of uninitialized bytes" and
> > > > "Conditional jump depending on uninitialized bytes" in different ompi
> > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > send you an output of valgrind shortly.
> > >
> > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > unaware of them (and therefore incorrectly marks them as
> > > uninitialized).
> >
> > would it help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort has been done to clean-up the valgrind output ? but maybe
> > that this doesn't concern this btl (for the reasons you mentionned).
> >
> > > > Another question, you said that the callback function pointer should
> > > > never be 0. But can the tag be null (hdr->tag) ?
> > >
> > > The tag is not a pointer -- it's just an integer.
> >
> > I was worrying that its value could not be null.
> >
> > I'll send a valgrind output soon (i need to build libpython without
> > pymalloc first).
> >
> > Thanks,
> > Eloi
> >
> > > > Thanks for your help,
> > > > Eloi
> > > >
> > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > >> Sorry for the delay in replying.
> > > >>
> > > >> Odd; the values of the callback function pointer should never be 0.
> > > >> This seems to suggest some kind of memory corruption is occurring.
> > > >>
> > > >> I don't know if it's possible, because the stack trace looks like
> > > >> you're calling through python, but can you run this application
> > > >> through valgrind, or some other memory-checking debugger?
> > > >>
> > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > >>> Hi,
> > > >>>
> > > >>> sorry, i just forgot to add the values of the function parameters:
> > > >>> (gdb) print reg->cbdata
> > > >>> $1 = (void *) 0x0
> > > >>> (gdb) print openib_btl->super
> > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > >>>
> > > >>> btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size
> > > >>> = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > >>> btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > >>> 0x2b341eb8ee47, btl_del_procs =
> > > >>> 0x2b341eb90156, btl_register = 0,
> > > >>> btl_finalize = 0x2b341eb93186, btl_alloc
> > > >>> = 0x2b341eb90a3e, btl_free =
> > > >>> 0x2b341eb91400, btl_prepare_src =
> > > >>> 0x2b341eb91813, btl_prepare_dst =
> > > >>> 0x2b341eb91f2e, btl_send =
> > > >>> 0x2b341eb94517, btl_sendi =
> > > >>> 0x2b341eb9340d, btl_put =
> > > >>> 0x2b341eb94660, btl_get =
> > > >>> 0x2b341eb94c4e, btl_dump =
> > > >>> 0x2b341acd45cb, btl_mpool = 0xf3f4110,
> > > >>> btl_register_error =
> > > >>> 0x2b341eb90565, btl_ft_event =
> > > >>> 0x2b341eb952e7}
> > > >>>
> > > >>> (gdb) print hdr->tag
> > > >>> $3 = 0 '\0'
> > > >>> (gdb) print des
> > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > >>> (gdb) print reg->cbfunc
> > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > >>>
> > > >>> Eloi
> > > >>>
> > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > Hi,
> > >
> > > Here is the output of a core file generated during a segmentation
> > > fault observed during a collective call (using openib):
> > >
> > > #0 0x in ?? ()
> > > (gdb) where
> > > #0 0x in ?? ()
> > > #1 0x2aedbc4e05f4 in btl_openib_handle_incoming
> > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > byte_len=18) at btl_openib_component.c:2881 #2
> > > 0x2aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
> > > wc=0x7279ce90) at
> > > btl_openib_component.c:3178 #3 0x2aedbc4e2e9d in poll_device
> > > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
>