Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje
In some of the testing Eloi did earlier he did disabled eager rdma and 
still saw the issue.


--td

Shamis, Pavel wrote:

Terry,
Ishai Rabinovitz is HPC team manager (I added him to CC)

Eloi,

Back to issue. I have seen very similar issue long time ago on some hardware 
platforms that support relaxed ordering memory operations. If I remember 
correct it was some IBM platform.
Do you know if relaxed memory ordering is enabled on your platform ? If it is 
enabled you have to disable eager rdma.

Regards,
Pasha

On Sep 29, 2010, at 1:04 PM, Terry Dontje wrote:

Pasha, do you by any chance know who at Mellanox might be responsible for OMPI 
working?

--td

Eloi Gaudry wrote:
 Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may discuss with, 
preferably someone who has spent some time inside the openib btl ?

Regards,
Eloi

On 29/09/2010 06:01, Nysal Jan wrote:
Hi Eloi,
We discussed this issue during the weekly developer meeting & there were no 
further suggestions, apart from checking the driver and firmware levels. The 
consensus was that it would be better if you could take this up directly with your 
IB vendor.

Regards
--Nysal
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry

 Pasha,
Thanks for your help.

I'm not aware of such memory configuration on the new cluster of our 
customer (each computing node is running the Red-Hat 5.x operating 
system on Intel X5570 processors).
Anyway, I've already tried to deactivate eager_rdma, but this wouldn't 
solve the hdr->tag=0 issue (in 
share/openmpi/mca-btl-openib-device-params.ini, eager_rdma is on 
[vendor_part_id=26428]).


Ishai,
If you need anymore information, please feel free to ask.

Regards,
Eloi

On 29/09/2010 19:49, Shamis, Pavel wrote:

Terry,
Ishai Rabinovitz is HPC team manager (I added him to CC)

Eloi,

Back to issue. I have seen very similar issue long time ago on some hardware 
platforms that support relaxed ordering memory operations. If I remember 
correct it was some IBM platform.
Do you know if relaxed memory ordering is enabled on your platform ? If it is 
enabled you have to disable eager rdma.

Regards,
Pasha

On Sep 29, 2010, at 1:04 PM, Terry Dontje wrote:

Pasha, do you by any chance know who at Mellanox might be responsible for OMPI 
working?

--td

Eloi Gaudry wrote:
  Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may discuss with, 
preferably someone who has spent some time inside the openib btl ?

Regards,
Eloi

On 29/09/2010 06:01, Nysal Jan wrote:
Hi Eloi,
We discussed this issue during the weekly developer meeting&  there were no 
further suggestions, apart from checking the driver and firmware levels. The 
consensus was that it would be better if you could take this up directly with your 
IB vendor.

Regards
--Nysal
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Shamis, Pavel
Terry,
Ishai Rabinovitz is HPC team manager (I added him to CC)

Eloi,

Back to issue. I have seen very similar issue long time ago on some hardware 
platforms that support relaxed ordering memory operations. If I remember 
correct it was some IBM platform.
Do you know if relaxed memory ordering is enabled on your platform ? If it is 
enabled you have to disable eager rdma.

Regards,
Pasha

On Sep 29, 2010, at 1:04 PM, Terry Dontje wrote:

Pasha, do you by any chance know who at Mellanox might be responsible for OMPI 
working?

--td

Eloi Gaudry wrote:
 Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may discuss with, 
preferably someone who has spent some time inside the openib btl ?

Regards,
Eloi

On 29/09/2010 06:01, Nysal Jan wrote:
Hi Eloi,
We discussed this issue during the weekly developer meeting & there were no 
further suggestions, apart from checking the driver and firmware levels. The 
consensus was that it would be better if you could take this up directly with 
your IB vendor.

Regards
--Nysal
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com






Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje
Pasha, do you by any chance know who at Mellanox might be responsible 
for OMPI working?


--td

Eloi Gaudry wrote:

 Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may 
discuss with, preferably someone who has spent some time inside the 
openib btl ?


Regards,
Eloi

On 29/09/2010 06:01, Nysal Jan wrote:

Hi Eloi,
We discussed this issue during the weekly developer meeting & there 
were no further suggestions, apart from checking the driver and 
firmware levels. The consensus was that it would be better if you 
could take this up directly with your IB vendor.


Regards
--Nysal

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry

 Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may 
discuss with, preferably someone who has spent some time inside the 
openib btl ?


Regards,
Eloi

On 29/09/2010 06:01, Nysal Jan wrote:

Hi Eloi,
We discussed this issue during the weekly developer meeting & there 
were no further suggestions, apart from checking the driver and 
firmware levels. The consensus was that it would be better if you 
could take this up directly with your IB vendor.


Regards
--Nysal


Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Nysal Jan
lanox
> (with
> > >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters
> > >>>>>>>>>>> associated to it.
> > >>>>>>>>>>>
> > >>>>>>>>>>>  $ cat share/openmpi/mca-btl-openib-device-params.ini as
> well:
> > >>>>>>>>>>> [...]
> > >>>>>>>>>>>
> > >>>>>>>>>>>   # A.k.a. ConnectX
> > >>>>>>>>>>>   [Mellanox Hermon]
> > >>>>>>>>>>>   vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> > >>>>>>>>>>>   vendor_part_id =
> > >>>>>>>>>>>
> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2
> > >>>>>>>>>>>   64 88 use_eager_rdma = 1
> > >>>>>>>>>>>   mtu = 2048
> > >>>>>>>>>>>   max_inline_data = 128
> > >>>>>>>>>>>
> > >>>>>>>>>>> [..]
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep
> receive_queues
> > >>>>>>>>>>>
> > >>>>>>>>>>>
>  mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
> > >>>>>>>>>>>  ,1 92 ,128
> > >>>>>>>>>>>
> > >>>>>>>>>>>  :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> > >>>>>>>>>>>
> > >>>>>>>>>>>
>  mca:btl:openib:param:btl_openib_receive_queues:data_source:def
> > >>>>>>>>>>>  au lt value
> > >>>>>>>>>>>
>  mca:btl:openib:param:btl_openib_receive_queues:status:writable
> > >>>>>>>>>>>
>  mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
> > >>>>>>>>>>>  mi t ed, comma delimited list of receive queues:
> > >>>>>>>>>>>  P,4096,8,6,4:P,32768,8,6,4
> > >>>>>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> > >>>>>>>>>>>
> > >>>>>>>>>>> I was wondering if these parameters (automatically computed
> at
> > >>>>>>>>>>> openib btl init for what I understood) were not incorrect in
> > >>>>>>>>>>> some way and I plugged some others values:
> > >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that values
> > >>>>>>>>>>> when encountering a different issue) . Since that, I haven't
> > >>>>>>>>>>> been able to observe the segfault (occuring as hrd->tag = 0
> in
> > >>>>>>>>>>> btl_openib_component.c:2881) yet.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Eloi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
> > >>>>>>>>>>>> Eloi, I am curious about your problem.  Can you tell me what
> > >>>>>>>>>>>> size of job it is?  Does it always fail on the same bcast,
>  or
> > >>>>>>>>>>>> same process?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Eloi Gaudry wrote:
> > >>>>>>>>>>>>> Hi Nysal,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for your suggestions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
> > >>>>>>>>>>>>> stdout, thanks (I forgo

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
ons.

I'm now able to get the checksum computed and redirected to
stdout, thanks (I forgot the  "-mca pml_base_verbose 5"
option, you were right). I haven't been able to observe the
segmentation fault (with hdr->tag=0) so far (when using pml
csum) but I 'll let you know when I am.

I've got two others question, which may be related to the
error observed:

1/ does the maximum number of MPI_Comm that can be handled by
OpenMPI somehow depends on the btl being used (i.e. if I'm
using openib, may I use the same number of MPI_Comm object as
with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?

2/ the segfaults only appears during a mpi collective call,
with very small message (one int is being broadcast, for
instance) ; i followed the guidelines given at
http://icl.cs.utk.edu/open-
mpi/faq/?category=openfabrics#ib-small-message-rdma but the
debug-build of OpenMPI asserts if I use a different min-size
that 255. Anyway, if I deactivate eager_rdma, the segfaults
remains. Does the openib btl handle very small message
differently (even with eager_rdma
deactivated) than tcp ?
  

Others on the list does coalescing happen with non-eager_rdma?
If so then that would possibly be one difference between the
openib btl and tcp aside from the actual protocol used.



 is there a way to make sure that large messages and small
 messages are handled the same way ?
  

Do you mean so they all look like eager messages?  How large
of messages are we talking about here 1K, 1M or 10M?

--td



Regards,
Eloi

On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
  

Hi Eloi,
Create a debug build of OpenMPI (--enable-debug) and while
running with the csum PML add "-mca pml_base_verbose 5" to
the command line. This will print the checksum details for
each fragment sent over the wire. I'm guessing it didnt
catch anything because the BTL failed. The checksum
verification is done in the PML, which the BTL calls via a
callback function. In your case the PML callback is never
called because the hdr->tag is invalid. So enabling
checksum tracing also might not be of much use. Is it the
first Bcast that fails or the nth Bcast and what is the
message size? I'm not sure what could be the problem at
this moment. I'm afraid you will have to debug the BTL to
find out more.

--Nysal

On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could
illustrate the hdr->tag=0 error.
Actually, I'm only observing this issue when running an
internode computation involving infiniband hardware from
Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI
use performed during a parallel computation and I couldn't
find any error so far. The fact that the very
same parallel computation run flawlessly when using tcp
(and disabling openib support) might seem to indicate that
the issue is somewhere located inside the
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't
seen any related messages (when hdr->tag=0 and the
segfaults occurs). Any suggestion ?

Regards,
Eloi

On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
  

Hi Eloi,
Sorry for the delay in response. I haven't read the entire
email thread, but do you have a test case which can
reproduce this error? Without that it will be difficult to
nail down the cause. Just to clarify, I do not work for an
iwarp vendor. I can certainly try to reproduce it on an IB
system. There is also a PML called csum, you can use it
via "-mca pml csum", which will checksum the MPI messages
and verify it at the receiver side for any data
corruption. You can try using it to see if it is able


to

  

catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

I'm sorry to intrrupt, but I was wondering if you had a
chance to look
  

at

  

this error.

Regards,
Eloi



--


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


------ Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using
openib btl Hi,

I was wondering if anybody got a chance to have a look at
this issue.

Regards,
Eloi

On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
  

H

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
> >>>>>>>>>>>  ,1 92 ,128
> >>>>>>>>>>>  
> >>>>>>>>>>>  :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> >>>>>>>>>>>  
> >>>>>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:data_source:def
> >>>>>>>>>>>  au lt value
> >>>>>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:status:writable
> >>>>>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
> >>>>>>>>>>>  mi t ed, comma delimited list of receive queues:
> >>>>>>>>>>>  P,4096,8,6,4:P,32768,8,6,4
> >>>>>>>>>>>  mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> >>>>>>>>>>> 
> >>>>>>>>>>> I was wondering if these parameters (automatically computed at
> >>>>>>>>>>> openib btl init for what I understood) were not incorrect in
> >>>>>>>>>>> some way and I plugged some others values:
> >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that values
> >>>>>>>>>>> when encountering a different issue) . Since that, I haven't
> >>>>>>>>>>> been able to observe the segfault (occuring as hrd->tag = 0 in
> >>>>>>>>>>> btl_openib_component.c:2881) yet.
> >>>>>>>>>>> 
> >>>>>>>>>>> Eloi
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
> >>>>>>>>>>> 
> >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
> >>>>>>>>>>>> Eloi, I am curious about your problem.  Can you tell me what
> >>>>>>>>>>>> size of job it is?  Does it always fail on the same bcast,  or
> >>>>>>>>>>>> same process?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Eloi Gaudry wrote:
> >>>>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Thanks for your suggestions.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
> >>>>>>>>>>>>> stdout, thanks (I forgot the  "-mca pml_base_verbose 5"
> >>>>>>>>>>>>> option, you were right). I haven't been able to observe the
> >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
> >>>>>>>>>>>>> csum) but I 'll let you know when I am.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> I've got two others question, which may be related to the
> >>>>>>>>>>>>> error observed:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
> >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
> >>>>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
> >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
> >>>>>>>>>>>>> with very small message (one int is being broadcast, for
> >>>>>>>>>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>>>> mpi/faq/?category=openfabrics#

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
something as MPI_COMM_MAX in OpenMPI ?

2/ the segfaults only appears during a mpi collective call,
with very small message (one int is being broadcast, for
instance) ; i followed the guidelines given at
http://icl.cs.utk.edu/open-
mpi/faq/?category=openfabrics#ib-small-message-rdma but the
debug-build of OpenMPI asserts if I use a different min-size
that 255. Anyway, if I deactivate eager_rdma, the segfaults
remains. Does the openib btl handle very small message
differently (even with eager_rdma
deactivated) than tcp ?
  
Others on the list does coalescing happen with non-eager_rdma? 
If so then that would possibly be one difference between the

openib btl and tcp aside from the actual protocol used.



 is there a way to make sure that large messages and small
 messages are handled the same way ?
  

Do you mean so they all look like eager messages?  How large of
messages are we talking about here 1K, 1M or 10M?

--td



Regards,
Eloi

On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
  

Hi Eloi,
Create a debug build of OpenMPI (--enable-debug) and while
running with the csum PML add "-mca pml_base_verbose 5" to the
command line. This will print the checksum details for each
fragment sent over the wire. I'm guessing it didnt catch
anything because the BTL failed. The checksum verification is
done in the PML, which the BTL calls via a callback function.
In your case the PML callback is never called because the
hdr->tag is invalid. So enabling checksum tracing also might
not be of much use. Is it the first Bcast that fails or the
nth Bcast and what is the message size? I'm not sure what
could be the problem at this moment. I'm afraid you will have
to debug the BTL to find out more.

--Nysal

On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could
illustrate the hdr->tag=0 error.
Actually, I'm only observing this issue when running an
internode computation involving infiniband hardware from
Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI use
performed during a parallel computation and I couldn't find
any error so far. The fact that the very
same parallel computation run flawlessly when using tcp (and
disabling openib support) might seem to indicate that the
issue is somewhere located inside the
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't seen
any related messages (when hdr->tag=0 and the segfaults
occurs). Any suggestion ?

Regards,
Eloi

On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
  

Hi Eloi,
Sorry for the delay in response. I haven't read the entire
email thread, but do you have a test case which can
reproduce this error? Without that it will be difficult to
nail down the cause. Just to clarify, I do not work for an
iwarp vendor. I can certainly try to reproduce it on an IB
system. There is also a PML called csum, you can use it via
"-mca pml csum", which will checksum the MPI messages and
verify it at the receiver side for any data
corruption. You can try using it to see if it is able


to

  

catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

I'm sorry to intrrupt, but I was wondering if you had a
chance to look
  

at

  

this error.

Regards,
Eloi



--


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


-- Forwarded message ------
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using
openib btl Hi,

I was wondering if anybody got a chance to have a look at
this issue.

Regards,
Eloi

On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
  

Hi Jeff,

Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
pbn11,pbn10 --mca


btl

  

openib,self --display-map --verbose --mca mpi_warn_on_fork
0 --mca btl_openib_want_fork_support 0 -tag-output
/opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
--suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/open
mp i- valgrind.supp
--suppressions=./suppressions.python.supp
/opt/actran/bin/actranpy_mp ...

Thanks,
Eloi

On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
  

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Gaudry wrote:
> >>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks for your suggestions.
> >>>>>>>>>>> 
> >>>>>>>>>>> I'm now able to get the checksum computed and redirected to
> >>>>>>>>>>> stdout, thanks (I forgot the  "-mca pml_base_verbose 5" option,
> >>>>>>>>>>> you were right). I haven't been able to observe the
> >>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
> >>>>>>>>>>> csum) but I 'll let you know when I am.
> >>>>>>>>>>> 
> >>>>>>>>>>> I've got two others question, which may be related to the error
> >>>>>>>>>>> observed:
> >>>>>>>>>>> 
> >>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
> >>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
> >>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
> >>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
> >>>>>>>>>>> 
> >>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
> >>>>>>>>>>> with very small message (one int is being broadcast, for
> >>>>>>>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> >>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size
> >>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults
> >>>>>>>>>>> remains. Does the openib btl handle very small message
> >>>>>>>>>>> differently (even with eager_rdma
> >>>>>>>>>>> deactivated) than tcp ?
> >>>>>>>>>> 
> >>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma? 
> >>>>>>>>>> If so then that would possibly be one difference between the
> >>>>>>>>>> openib btl and tcp aside from the actual protocol used.
> >>>>>>>>>> 
> >>>>>>>>>>>  is there a way to make sure that large messages and small
> >>>>>>>>>>>  messages are handled the same way ?
> >>>>>>>>>> 
> >>>>>>>>>> Do you mean so they all look like eager messages?  How large of
> >>>>>>>>>> messages are we talking about here 1K, 1M or 10M?
> >>>>>>>>>> 
> >>>>>>>>>> --td
> >>>>>>>>>> 
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Eloi
> >>>>>>>>>>> 
> >>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> >>>>>>>>>>>> Hi Eloi,
> >>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
> >>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to the
> >>>>>>>>>>>> command line. This will print the checksum details for each
> >>>>>>>>>>>> fragment sent over the wire. I'm guessing it didnt catch
> >>>>>>>>>>>> anything because the BTL failed. The checksum verification is
> >>>>>>>>>>>> done in the PML, which the BTL calls via a callback function.
> >>>>>>>>>>>> In your case the PML callback is never called because the
> >>>>>>>>>>>> hdr->tag is invalid. So enabling checksum tracing also might
> >>>>>>>>>>>> not be of much use. Is it the first Bcast that fails or the
> >>>>>>>>>>>> n

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Gaudry wrote:
> >>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks for your suggestions.
> >>>>>>>>>>> 
> >>>>>>>>>>> I'm now able to get the checksum computed and redirected to
> >>>>>>>>>>> stdout, thanks (I forgot the  "-mca pml_base_verbose 5" option,
> >>>>>>>>>>> you were right). I haven't been able to observe the
> >>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
> >>>>>>>>>>> csum) but I 'll let you know when I am.
> >>>>>>>>>>> 
> >>>>>>>>>>> I've got two others question, which may be related to the error
> >>>>>>>>>>> observed:
> >>>>>>>>>>> 
> >>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
> >>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
> >>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
> >>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
> >>>>>>>>>>> 
> >>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
> >>>>>>>>>>> with very small message (one int is being broadcast, for
> >>>>>>>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> >>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size
> >>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults
> >>>>>>>>>>> remains. Does the openib btl handle very small message
> >>>>>>>>>>> differently (even with eager_rdma
> >>>>>>>>>>> deactivated) than tcp ?
> >>>>>>>>>> 
> >>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma? 
> >>>>>>>>>> If so then that would possibly be one difference between the
> >>>>>>>>>> openib btl and tcp aside from the actual protocol used.
> >>>>>>>>>> 
> >>>>>>>>>>>  is there a way to make sure that large messages and small
> >>>>>>>>>>>  messages are handled the same way ?
> >>>>>>>>>> 
> >>>>>>>>>> Do you mean so they all look like eager messages?  How large of
> >>>>>>>>>> messages are we talking about here 1K, 1M or 10M?
> >>>>>>>>>> 
> >>>>>>>>>> --td
> >>>>>>>>>> 
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Eloi
> >>>>>>>>>>> 
> >>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> >>>>>>>>>>>> Hi Eloi,
> >>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
> >>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to the
> >>>>>>>>>>>> command line. This will print the checksum details for each
> >>>>>>>>>>>> fragment sent over the wire. I'm guessing it didnt catch
> >>>>>>>>>>>> anything because the BTL failed. The checksum verification is
> >>>>>>>>>>>> done in the PML, which the BTL calls via a callback function.
> >>>>>>>>>>>> In your case the PML callback is never called because the
> >>>>>>>>>>>> hdr->tag is invalid. So enabling checksum tracing also might
> >>>>>>>>>>>> not be of much use. Is it the first Bcast that fails or the
> >>>>>>>>>>>> n

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje
is able


to

  

catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

I'm sorry to intrrupt, but I was wondering if you had a chance to
look
  

at

  

this error.

Regards,
Eloi



--


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


-- Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using openib btl
Hi,

I was wondering if anybody got a chance to have a look at this
issue.

Regards,
Eloi

On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
  

Hi Jeff,

Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10
--mca


btl

  

openib,self --display-map --verbose --mca mpi_warn_on_fork 0
--mca btl_openib_want_fork_support 0 -tag-output
/opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
--suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
valgrind.supp --suppressions=./suppressions.python.supp
/opt/actran/bin/actranpy_mp ...

Thanks,
Eloi

On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:


On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
  

On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:


I did run our application through valgrind but it couldn't
find any "Invalid write": there is a bunch of "Invalid read"
(I'm using
  

1.4.2

  

with the suppression file), "Use of uninitialized bytes" and
"Conditional jump depending on uninitialized bytes" in
  

different

  

ompi

  

routines. Some of them are located in btl_openib_component.c.
I'll send you an output of valgrind shortly.
  

A lot of them in btl_openib_* are to be expected --
OpenFabrics uses OS-bypass methods for some of its memory,
and therefore valgrind is unaware of them (and therefore
incorrectly marks them as
uninitialized).


would it  help if i use the upcoming 1.5 version of openmpi ? i
  

read

  

that

  

a huge effort has been done to clean-up the valgrind output ?
but maybe that this doesn't concern this btl (for the reasons
you mentionned).

  

Another question, you said that the callback function pointer
  

should

  

never be 0. But can the tag be null (hdr->tag) ?
  

The tag is not a pointer -- it's just an integer.


I was worrying that its value could not be null.

I'll send a valgrind output soon (i need to build libpython
without pymalloc first).

Thanks,
Eloi

  

Thanks for your help,
Eloi

On 16/08/2010 18:22, Jeff Squyres wrote:
  

Sorry for the delay in replying.

Odd; the values of the callback function pointer should
never


be

  

0.

  

This seems to suggest some kind of memory corruption is
occurring.

I don't know if it's possible, because the stack trace looks
like you're calling through python, but can you run this
application through valgrind, or some other memory-checking
debugger?

On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:


Hi,

sorry, i just forgot to add the values of the function
  

parameters:
  

(gdb) print reg->cbdata
$1 = (void *) 0x0
(gdb) print openib_btl->super
$2 = {btl_component = 0x2b341edd7380, btl_eager_limit =
  

12288,

  

btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
btl_rdma_pipeline_send_length = 1048576,

  btl_rdma_pipeline_frag_size = 1048576,
  

btl_min_rdma_pipeline_size

  

  = 1060864, btl_exclusivity = 1024, btl_latency = 10,
  btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
  0x2b341eb8ee47, btl_del_procs =
  0x2b341eb90156, btl_register =
  0, btl_finalize =
  0x2b341eb93186,
  

btl_alloc

  

  = 0x2b341eb90a3e, btl_free =
  0x2b341eb91400, btl_prepare_src =
  0x2b341eb91813,
  btl_prepare_dst
  

=

  

  0x2b341eb91f2e, btl_send =
  0x2b341eb94517, btl_sendi =
  0x2b341eb9340d, btl_put =
  0x2b341eb94660, btl_get =
  0x2b341eb94c4e, btl_dump =
  0x2b341acd45cb, b

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
 anything because the BTL failed.
> >>>>>> The checksum verification is done in the PML, which the BTL calls
> >>>>>> via a callback function. In your case the PML callback is never
> >>>>>> called because the hdr->tag is invalid. So enabling checksum
> >>>>>> tracing also might not be of much use. Is it the first Bcast that
> >>>>>> fails or the nth Bcast and what is the message size? I'm not sure
> >>>>>> what could be the problem at this moment. I'm afraid you will have
> >>>>>> to debug the BTL to find out more.
> >>>>>> 
> >>>>>> --Nysal
> >>>>>> 
> >>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:
> >>>>>>> Hi Nysal,
> >>>>>>> 
> >>>>>>> thanks for your response.
> >>>>>>> 
> >>>>>>> I've been unable so far to write a test case that could illustrate
> >>>>>>> the hdr->tag=0 error.
> >>>>>>> Actually, I'm only observing this issue when running an internode
> >>>>>>> computation involving infiniband hardware from Mellanox (MT25418,
> >>>>>>> ConnectX IB DDR, PCIe 2.0
> >>>>>>> 2.5GT/s, rev a0) with our time-domain software.
> >>>>>>> 
> >>>>>>> I checked, double-checked, and rechecked again every MPI use
> >>>>>>> performed during a parallel computation and I couldn't find any
> >>>>>>> error so far. The fact that the very
> >>>>>>> same parallel computation run flawlessly when using tcp (and
> >>>>>>> disabling openib support) might seem to indicate that the issue is
> >>>>>>> somewhere located inside the
> >>>>>>> openib btl or at the hardware/driver level.
> >>>>>>> 
> >>>>>>> I've just used the "-mca pml csum" option and I haven't seen any
> >>>>>>> related messages (when hdr->tag=0 and the segfaults occurs).
> >>>>>>> Any suggestion ?
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> Eloi
> >>>>>>> 
> >>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> >>>>>>>> Hi Eloi,
> >>>>>>>> Sorry for the delay in response. I haven't read the entire email
> >>>>>>>> thread, but do you have a test case which can reproduce this
> >>>>>>>> error? Without that it will be difficult to nail down the cause.
> >>>>>>>> Just to clarify, I do not work for an iwarp vendor. I can
> >>>>>>>> certainly try to reproduce it on an IB system. There is also a
> >>>>>>>> PML called csum, you can use it via "-mca pml csum", which will
> >>>>>>>> checksum the MPI messages and verify it at the receiver side for
> >>>>>>>> any data
> >>>>>>>> corruption. You can try using it to see if it is able
> >>>>>>> 
> >>>>>>> to
> >>>>>>> 
> >>>>>>>> catch anything.
> >>>>>>>> 
> >>>>>>>> Regards
> >>>>>>>> --Nysal
> >>>>>>>> 
> >>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> >>>>>>>>> Hi Nysal,
> >>>>>>>>> 
> >>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a chance to
> >>>>>>>>> look
> >>>>>>> 
> >>>>>>> at
> >>>>>>> 
> >>>>>>>>> this error.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> Eloi
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> --
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Eloi Gaudry
> >>>>>>>>> 
> >>>>>>&g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
/ the segfaults only appears during a mpi collective call, with very
> >>> small message (one int is being broadcast, for instance) ; i followed
> >>> the guidelines given at http://icl.cs.utk.edu/open-
> >>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build
> >>> of OpenMPI asserts if I use a different min-size that 255. Anyway, if I
> >>> deactivate eager_rdma, the segfaults remains. Does the openib btl
> >>> handle very small message differently (even with eager_rdma
> >>> deactivated) than tcp ?
> >> 
> >> Others on the list does coalescing happen with non-eager_rdma?  If so
> >> then that would possibly be one difference between the openib btl and
> >> tcp aside from the actual protocol used.
> >> 
> >>>  is there a way to make sure that large messages and small messages are
> >>>  handled the same way ?
> >> 
> >> Do you mean so they all look like eager messages?  How large of messages
> >> are we talking about here 1K, 1M or 10M?
> >> 
> >> --td
> >> 
> >>> Regards,
> >>> Eloi
> >>> 
> >>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> >>>> Hi Eloi,
> >>>> Create a debug build of OpenMPI (--enable-debug) and while running
> >>>> with the csum PML add "-mca pml_base_verbose 5" to the command line.
> >>>> This will print the checksum details for each fragment sent over the
> >>>> wire. I'm guessing it didnt catch anything because the BTL failed.
> >>>> The checksum verification is done in the PML, which the BTL calls via
> >>>> a callback function. In your case the PML callback is never called
> >>>> because the hdr->tag is invalid. So enabling checksum tracing also
> >>>> might not be of much use. Is it the first Bcast that fails or the nth
> >>>> Bcast and what is the message size? I'm not sure what could be the
> >>>> problem at this moment. I'm afraid you will have to debug the BTL to
> >>>> find out more.
> >>>> 
> >>>> --Nysal
> >>>> 
> >>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:
> >>>>> Hi Nysal,
> >>>>> 
> >>>>> thanks for your response.
> >>>>> 
> >>>>> I've been unable so far to write a test case that could illustrate
> >>>>> the hdr->tag=0 error.
> >>>>> Actually, I'm only observing this issue when running an internode
> >>>>> computation involving infiniband hardware from Mellanox (MT25418,
> >>>>> ConnectX IB DDR, PCIe 2.0
> >>>>> 2.5GT/s, rev a0) with our time-domain software.
> >>>>> 
> >>>>> I checked, double-checked, and rechecked again every MPI use
> >>>>> performed during a parallel computation and I couldn't find any
> >>>>> error so far. The fact that the very
> >>>>> same parallel computation run flawlessly when using tcp (and
> >>>>> disabling openib support) might seem to indicate that the issue is
> >>>>> somewhere located inside the
> >>>>> openib btl or at the hardware/driver level.
> >>>>> 
> >>>>> I've just used the "-mca pml csum" option and I haven't seen any
> >>>>> related messages (when hdr->tag=0 and the segfaults occurs).
> >>>>> Any suggestion ?
> >>>>> 
> >>>>> Regards,
> >>>>> Eloi
> >>>>> 
> >>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> >>>>>> Hi Eloi,
> >>>>>> Sorry for the delay in response. I haven't read the entire email
> >>>>>> thread, but do you have a test case which can reproduce this error?
> >>>>>> Without that it will be difficult to nail down the cause. Just to
> >>>>>> clarify, I do not work for an iwarp vendor. I can certainly try to
> >>>>>> reproduce it on an IB system. There is also a PML called csum, you
> >>>>>> can use it via "-mca pml csum", which will checksum the MPI
> >>>>>> messages and verify it at the receiver side for any data
> >>>>>> corruption. You can try using it to see if it is able
> >>>>> 
> >>>>> to
> >>>>> 
> &

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje
sure what could be the
problem at this moment. I'm afraid you will have to debug the BTL to
find out more.

--Nysal

On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could illustrate the
hdr->tag=0 error.
Actually, I'm only observing this issue when running an internode
computation involving infiniband hardware from Mellanox (MT25418,
ConnectX IB DDR, PCIe 2.0
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI use performed
during a parallel computation and I couldn't find any error so far. The
fact that the very
same parallel computation run flawlessly when using tcp (and disabling
openib support) might seem to indicate that the issue is somewhere
located inside the
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't seen any
related messages (when hdr->tag=0 and the segfaults occurs).
Any suggestion ?

Regards,
Eloi

On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
  

Hi Eloi,
Sorry for the delay in response. I haven't read the entire email
thread, but do you have a test case which can reproduce this error?
Without that it will be difficult to nail down the cause. Just to
clarify, I do not work for an iwarp vendor. I can certainly try to
reproduce it on an IB system. There is also a PML called csum, you can
use it via "-mca pml csum", which will checksum the MPI messages and
verify it at the receiver side for any data corruption. You can try
using it to see if it is able


to

  

catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

I'm sorry to intrrupt, but I was wondering if you had a chance to
look
  

at

  

this error.

Regards,
Eloi



--


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


-- Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using openib btl
Hi,

I was wondering if anybody got a chance to have a look at this issue.

Regards,
Eloi

On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
  

Hi Jeff,

Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca


btl

  

openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
btl_openib_want_fork_support 0 -tag-output
/opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
--suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
valgrind.supp --suppressions=./suppressions.python.supp
/opt/actran/bin/actranpy_mp ...

Thanks,
Eloi

On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:


On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
  

On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:


I did run our application through valgrind but it couldn't
find any "Invalid write": there is a bunch of "Invalid read"
(I'm using
  

1.4.2

  

with the suppression file), "Use of uninitialized bytes" and
"Conditional jump depending on uninitialized bytes" in
  

different

  

ompi

  

routines. Some of them are located in btl_openib_component.c.
I'll send you an output of valgrind shortly.
  

A lot of them in btl_openib_* are to be expected -- OpenFabrics
uses OS-bypass methods for some of its memory, and therefore
valgrind is unaware of them (and therefore incorrectly marks
them as
uninitialized).


would it  help if i use the upcoming 1.5 version of openmpi ? i
  

read

  

that

  

a huge effort has been done to clean-up the valgrind output ? but
maybe that this doesn't concern this btl (for the reasons you
mentionned).

  

Another question, you said that the callback function pointer
  

should

  

never be 0. But can the tag be null (hdr->tag) ?
  

The tag is not a pointer -- it's just an integer.


I was worrying that its value could not be null.

I'll send a valgrind output soon (i need to build libpython
without pymalloc first).

Thanks,
Eloi

  

Thanks for your help,
Eloi

On 16/08/2010 18:22, Jeff Squyres wrote:
  

Sorry for the delay in replying.

Odd; the values of the callback function pointer should
never


be

  

0.

  

This seems to suggest some kind of memory corruption is
occurring.

I don't know if it's possible,

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
ebug the BTL to
> >> find out more.
> >> 
> >> --Nysal
> >> 
> >> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:
> >>> Hi Nysal,
> >>> 
> >>> thanks for your response.
> >>> 
> >>> I've been unable so far to write a test case that could illustrate the
> >>> hdr->tag=0 error.
> >>> Actually, I'm only observing this issue when running an internode
> >>> computation involving infiniband hardware from Mellanox (MT25418,
> >>> ConnectX IB DDR, PCIe 2.0
> >>> 2.5GT/s, rev a0) with our time-domain software.
> >>> 
> >>> I checked, double-checked, and rechecked again every MPI use performed
> >>> during a parallel computation and I couldn't find any error so far. The
> >>> fact that the very
> >>> same parallel computation run flawlessly when using tcp (and disabling
> >>> openib support) might seem to indicate that the issue is somewhere
> >>> located inside the
> >>> openib btl or at the hardware/driver level.
> >>> 
> >>> I've just used the "-mca pml csum" option and I haven't seen any
> >>> related messages (when hdr->tag=0 and the segfaults occurs).
> >>> Any suggestion ?
> >>> 
> >>> Regards,
> >>> Eloi
> >>> 
> >>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> >>>> Hi Eloi,
> >>>> Sorry for the delay in response. I haven't read the entire email
> >>>> thread, but do you have a test case which can reproduce this error?
> >>>> Without that it will be difficult to nail down the cause. Just to
> >>>> clarify, I do not work for an iwarp vendor. I can certainly try to
> >>>> reproduce it on an IB system. There is also a PML called csum, you can
> >>>> use it via "-mca pml csum", which will checksum the MPI messages and
> >>>> verify it at the receiver side for any data corruption. You can try
> >>>> using it to see if it is able
> >>> 
> >>> to
> >>> 
> >>>> catch anything.
> >>>> 
> >>>> Regards
> >>>> --Nysal
> >>>> 
> >>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> >>>>> Hi Nysal,
> >>>>> 
> >>>>> I'm sorry to intrrupt, but I was wondering if you had a chance to
> >>>>> look
> >>> 
> >>> at
> >>> 
> >>>>> this error.
> >>>>> 
> >>>>> Regards,
> >>>>> Eloi
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> --
> >>>>> 
> >>>>> 
> >>>>> Eloi Gaudry
> >>>>> 
> >>>>> Free Field Technologies
> >>>>> Company Website: http://www.fft.be
> >>>>> Company Phone:   +32 10 487 959
> >>>>> 
> >>>>> 
> >>>>> -- Forwarded message --
> >>>>> From: Eloi Gaudry <e...@fft.be>
> >>>>> To: Open MPI Users <us...@open-mpi.org>
> >>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
> >>>>> Subject: Re: [OMPI users] [openib] segfault when using openib btl
> >>>>> Hi,
> >>>>> 
> >>>>> I was wondering if anybody got a chance to have a look at this issue.
> >>>>> 
> >>>>> Regards,
> >>>>> Eloi
> >>>>> 
> >>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> >>>>>> Hi Jeff,
> >>>>>> 
> >>>>>> Please find enclosed the output (valgrind.out.gz) from
> >>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca
> >>> 
> >>> btl
> >>> 
> >>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> >>>>>> btl_openib_want_fork_support 0 -tag-output
> >>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> >>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> >>>>>> valgrind.supp --suppressions=./suppressions.python.supp
> >>>>>> /opt/actran/bin/actranpy_mp ...
> >>>>>> 
> >&g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-23 Thread Terry Dontje
Eloi, I am curious about your problem.  Can you tell me what size of job 
it is?  Does it always fail on the same bcast,  or same process?

Eloi Gaudry wrote:

Hi Nysal,

Thanks for your suggestions.

I'm now able to get the checksum computed and redirected to stdout, thanks (I forgot the  
"-mca pml_base_verbose 5" option, you were right).
I haven't been able to observe the segmentation fault (with hdr->tag=0) so far 
(when using pml csum) but I 'll let you know when I am.

I've got two others question, which may be related to the error observed:

1/ does the maximum number of MPI_Comm that can be handled by OpenMPI somehow depends on the btl being used (i.e. if I'm using 
openib, may I use the same number of MPI_Comm object as with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?


2/ the segfaults only appears during a mpi collective call, with very small 
message (one int is being broadcast, for instance) ; i followed the guidelines 
given at http://icl.cs.utk.edu/open-
mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build of OpenMPI asserts if I use a different min-size that 255. Anyway, if I deactivate eager_rdma, the segfaults remains. 
Does the openib btl handle very small message differently (even with eager_rdma deactivated) than tcp ?
Others on the list does coalescing happen with non-eager_rdma?  If so 
then that would possibly be one difference between the openib btl and 
tcp aside from the actual protocol used.

 is there a way to make sure that large messages and small messages are handled 
the same way ?
  
Do you mean so they all look like eager messages?  How large of messages 
are we talking about here 1K, 1M or 10M?


--td

Regards,
Eloi


On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
  

Hi Eloi,
Create a debug build of OpenMPI (--enable-debug) and while running with the
csum PML add "-mca pml_base_verbose 5" to the command line. This will print
the checksum details for each fragment sent over the wire. I'm guessing it
didnt catch anything because the BTL failed. The checksum verification is
done in the PML, which the BTL calls via a callback function. In your case
the PML callback is never called because the hdr->tag is invalid. So
enabling checksum tracing also might not be of much use. Is it the first
Bcast that fails or the nth Bcast and what is the message size? I'm not
sure what could be the problem at this moment. I'm afraid you will have to
debug the BTL to find out more.

--Nysal

On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could illustrate the
hdr->tag=0 error.
Actually, I'm only observing this issue when running an internode
computation involving infiniband hardware from Mellanox (MT25418,
ConnectX IB DDR, PCIe 2.0
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI use performed
during a parallel computation and I couldn't find any error so far. The
fact that the very
same parallel computation run flawlessly when using tcp (and disabling
openib support) might seem to indicate that the issue is somewhere
located inside the
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't seen any related
messages (when hdr->tag=0 and the segfaults occurs).
Any suggestion ?

Regards,
Eloi

On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
  

Hi Eloi,
Sorry for the delay in response. I haven't read the entire email
thread, but do you have a test case which can reproduce this error?
Without that it will be difficult to nail down the cause. Just to
clarify, I do not work for an iwarp vendor. I can certainly try to
reproduce it on an IB system. There is also a PML called csum, you can
use it via "-mca pml csum", which will checksum the MPI messages and
verify it at the receiver side for any data corruption. You can try
using it to see if it is able


to

  

catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:


Hi Nysal,

I'm sorry to intrrupt, but I was wondering if you had a chance to
look
  

at

  

this error.

Regards,
Eloi



--


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


-- Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using openib btl
Hi,

I was wondering if anybody got a chance to have a look at this issue.

Regards,
Eloi

On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
  

Hi Jeff,

Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca


btl

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-22 Thread Eloi Gaudry
Hi Nysal,

Thanks for your suggestions.

I'm now able to get the checksum computed and redirected to stdout, thanks (I 
forgot the  "-mca pml_base_verbose 5" option, you were right).
I haven't been able to observe the segmentation fault (with hdr->tag=0) so far 
(when using pml csum) but I 'll let you know when I am.

I've got two others question, which may be related to the error observed:

1/ does the maximum number of MPI_Comm that can be handled by OpenMPI somehow 
depends on the btl being used (i.e. if I'm using 
openib, may I use the same number of MPI_Comm object as with tcp) ? Is there 
something as MPI_COMM_MAX in OpenMPI ?

2/ the segfaults only appears during a mpi collective call, with very small 
message (one int is being broadcast, for instance) ; i followed the guidelines 
given at http://icl.cs.utk.edu/open-
mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build of 
OpenMPI asserts if I use a different min-size that 255. Anyway, if I deactivate 
eager_rdma, the segfaults remains. 
Does the openib btl handle very small message differently (even with eager_rdma 
deactivated) than tcp ? is there a way to make sure that large messages and 
small messages are handled the same way ?

Regards,
Eloi


On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> Hi Eloi,
> Create a debug build of OpenMPI (--enable-debug) and while running with the
> csum PML add "-mca pml_base_verbose 5" to the command line. This will print
> the checksum details for each fragment sent over the wire. I'm guessing it
> didnt catch anything because the BTL failed. The checksum verification is
> done in the PML, which the BTL calls via a callback function. In your case
> the PML callback is never called because the hdr->tag is invalid. So
> enabling checksum tracing also might not be of much use. Is it the first
> Bcast that fails or the nth Bcast and what is the message size? I'm not
> sure what could be the problem at this moment. I'm afraid you will have to
> debug the BTL to find out more.
> 
> --Nysal
> 
> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <e...@fft.be> wrote:
> > Hi Nysal,
> > 
> > thanks for your response.
> > 
> > I've been unable so far to write a test case that could illustrate the
> > hdr->tag=0 error.
> > Actually, I'm only observing this issue when running an internode
> > computation involving infiniband hardware from Mellanox (MT25418,
> > ConnectX IB DDR, PCIe 2.0
> > 2.5GT/s, rev a0) with our time-domain software.
> > 
> > I checked, double-checked, and rechecked again every MPI use performed
> > during a parallel computation and I couldn't find any error so far. The
> > fact that the very
> > same parallel computation run flawlessly when using tcp (and disabling
> > openib support) might seem to indicate that the issue is somewhere
> > located inside the
> > openib btl or at the hardware/driver level.
> > 
> > I've just used the "-mca pml csum" option and I haven't seen any related
> > messages (when hdr->tag=0 and the segfaults occurs).
> > Any suggestion ?
> > 
> > Regards,
> > Eloi
> > 
> > On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> > > Hi Eloi,
> > > Sorry for the delay in response. I haven't read the entire email
> > > thread, but do you have a test case which can reproduce this error?
> > > Without that it will be difficult to nail down the cause. Just to
> > > clarify, I do not work for an iwarp vendor. I can certainly try to
> > > reproduce it on an IB system. There is also a PML called csum, you can
> > > use it via "-mca pml csum", which will checksum the MPI messages and
> > > verify it at the receiver side for any data corruption. You can try
> > > using it to see if it is able
> > 
> > to
> > 
> > > catch anything.
> > > 
> > > Regards
> > > --Nysal
> > > 
> > > On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> > > > Hi Nysal,
> > > > 
> > > > I'm sorry to intrrupt, but I was wondering if you had a chance to
> > > > look
> > 
> > at
> > 
> > > > this error.
> > > > 
> > > > Regards,
> > > > Eloi
> > > > 
> > > > 
> > > > 
> > > > --
> > > > 
> > > > 
> > > > Eloi Gaudry
> > > > 
> > > > Free Field Technologies
> > > > Company Website: http://www.fft.be
> > > > Company Phone:   +32 10 487 959
> > > > 
> > > > 
> > > > -- Forwarded

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Eloi Gaudry
Hi Nysal,

thanks for your response.

I've been unable so far to write a test case that could illustrate the 
hdr->tag=0 error.
Actually, I'm only observing this issue when running an internode computation 
involving infiniband hardware from Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0 
2.5GT/s, rev a0) with our time-domain software.

I checked, double-checked, and rechecked again every MPI use performed during a 
parallel computation and I couldn't find any error so far. The fact that the 
very 
same parallel computation run flawlessly when using tcp (and disabling openib 
support) might seem to indicate that the issue is somewhere located inside the 
openib btl or at the hardware/driver level.

I've just used the "-mca pml csum" option and I haven't seen any related 
messages (when hdr->tag=0 and the segfaults occurs).
Any suggestion ?

Regards,
Eloi



On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> Hi Eloi,
> Sorry for the delay in response. I haven't read the entire email thread,
> but do you have a test case which can reproduce this error? Without that
> it will be difficult to nail down the cause. Just to clarify, I do not
> work for an iwarp vendor. I can certainly try to reproduce it on an IB
> system. There is also a PML called csum, you can use it via "-mca pml
> csum", which will checksum the MPI messages and verify it at the receiver
> side for any data corruption. You can try using it to see if it is able to
> catch anything.
> 
> Regards
> --Nysal
> 
> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> > Hi Nysal,
> > 
> > I'm sorry to intrrupt, but I was wondering if you had a chance to look at
> > this error.
> > 
> > Regards,
> > Eloi
> > 
> > 
> > 
> > --
> > 
> > 
> > Eloi Gaudry
> > 
> > Free Field Technologies
> > Company Website: http://www.fft.be
> > Company Phone:   +32 10 487 959
> > 
> > 
> > -- Forwarded message --
> > From: Eloi Gaudry <e...@fft.be>
> > To: Open MPI Users <us...@open-mpi.org>
> > Date: Wed, 15 Sep 2010 16:27:43 +0200
> > Subject: Re: [OMPI users] [openib] segfault when using openib btl
> > Hi,
> > 
> > I was wondering if anybody got a chance to have a look at this issue.
> > 
> > Regards,
> > Eloi
> > 
> > On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> > > Hi Jeff,
> > > 
> > > Please find enclosed the output (valgrind.out.gz) from
> > > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> > > openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> > > btl_openib_want_fork_support 0 -tag-output
> > > /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> > > --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> > > valgrind.supp --suppressions=./suppressions.python.supp
> > > /opt/actran/bin/actranpy_mp ...
> > > 
> > > Thanks,
> > > Eloi
> > > 
> > > On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > > > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > > > I did run our application through valgrind but it couldn't find
> > > > > > any "Invalid write": there is a bunch of "Invalid read" (I'm
> > > > > > using
> > 
> > 1.4.2
> > 
> > > > > > with the suppression file), "Use of uninitialized bytes" and
> > > > > > "Conditional jump depending on uninitialized bytes" in different
> > 
> > ompi
> > 
> > > > > > routines. Some of them are located in btl_openib_component.c.
> > > > > > I'll send you an output of valgrind shortly.
> > > > > 
> > > > > A lot of them in btl_openib_* are to be expected -- OpenFabrics
> > > > > uses OS-bypass methods for some of its memory, and therefore
> > > > > valgrind is unaware of them (and therefore incorrectly marks them
> > > > > as
> > > > > uninitialized).
> > > > 
> > > > would it  help if i use the upcoming 1.5 version of openmpi ? i read
> > 
> > that
> > 
> > > > a huge effort has been done to clean-up the valgrind output ? but
> > > > maybe that this doesn't concern this btl (for the reasons you
> > > > mentionned).
> > > > 
> > > > > > Another question, you said that the callback function pointer
> > 
> > sh

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Nysal Jan
Hi Eloi,
Sorry for the delay in response. I haven't read the entire email thread, but
do you have a test case which can reproduce this error? Without that it will
be difficult to nail down the cause. Just to clarify, I do not work for an
iwarp vendor. I can certainly try to reproduce it on an IB system. There is
also a PML called csum, you can use it via "-mca pml csum", which will
checksum the MPI messages and verify it at the receiver side for any data
corruption. You can try using it to see if it is able to catch anything.

Regards
--Nysal

On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:

> Hi Nysal,
>
> I'm sorry to intrrupt, but I was wondering if you had a chance to look at
> this error.
>
> Regards,
> Eloi
>
>
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone:   +32 10 487 959
>
>
> -- Forwarded message --
> From: Eloi Gaudry <e...@fft.be>
> To: Open MPI Users <us...@open-mpi.org>
> Date: Wed, 15 Sep 2010 16:27:43 +0200
> Subject: Re: [OMPI users] [openib] segfault when using openib btl
> Hi,
>
> I was wondering if anybody got a chance to have a look at this issue.
>
> Regards,
> Eloi
>
>
> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> > Hi Jeff,
> >
> > Please find enclosed the output (valgrind.out.gz) from
> > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> > openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> > btl_openib_want_fork_support 0 -tag-output
> > /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> > --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> > valgrind.supp --suppressions=./suppressions.python.supp
> > /opt/actran/bin/actranpy_mp ...
> >
> > Thanks,
> > Eloi
> >
> > On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > > I did run our application through valgrind but it couldn't find any
> > > > > "Invalid write": there is a bunch of "Invalid read" (I'm using
> 1.4.2
> > > > > with the suppression file), "Use of uninitialized bytes" and
> > > > > "Conditional jump depending on uninitialized bytes" in different
> ompi
> > > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > > send you an output of valgrind shortly.
> > > >
> > > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > > unaware of them (and therefore incorrectly marks them as
> > > > uninitialized).
> > >
> > > would it  help if i use the upcoming 1.5 version of openmpi ? i read
> that
> > > a huge effort has been done to clean-up the valgrind output ? but maybe
> > > that this doesn't concern this btl (for the reasons you mentionned).
> > >
> > > > > Another question, you said that the callback function pointer
> should
> > > > > never be 0. But can the tag be null (hdr->tag) ?
> > > >
> > > > The tag is not a pointer -- it's just an integer.
> > >
> > > I was worrying that its value could not be null.
> > >
> > > I'll send a valgrind output soon (i need to build libpython without
> > > pymalloc first).
> > >
> > > Thanks,
> > > Eloi
> > >
> > > > > Thanks for your help,
> > > > > Eloi
> > > > >
> > > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > > >> Sorry for the delay in replying.
> > > > >>
> > > > >> Odd; the values of the callback function pointer should never be
> 0.
> > > > >> This seems to suggest some kind of memory corruption is occurring.
> > > > >>
> > > > >> I don't know if it's possible, because the stack trace looks like
> > > > >> you're calling through python, but can you run this application
> > > > >> through valgrind, or some other memory-checking debugger?
> > > > >>
> > > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > > >>> Hi,
> > > > >>>
> > > > >>> sorry, i just forgot to add the values of the function
> parameters:
> >

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-15 Thread Eloi Gaudry
Hi,

I was wondering if anybody got a chance to have a look at this issue.

Regards,
Eloi


On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
> 
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> btl_openib_want_fork_support 0 -tag-output
> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> valgrind.supp --suppressions=./suppressions.python.supp
> /opt/actran/bin/actranpy_mp ...
> 
> Thanks,
> Eloi
> 
> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > I did run our application through valgrind but it couldn't find any
> > > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > > > with the suppression file), "Use of uninitialized bytes" and
> > > > "Conditional jump depending on uninitialized bytes" in different ompi
> > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > send you an output of valgrind shortly.
> > > 
> > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > unaware of them (and therefore incorrectly marks them as
> > > uninitialized).
> > 
> > would it  help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort has been done to clean-up the valgrind output ? but maybe
> > that this doesn't concern this btl (for the reasons you mentionned).
> > 
> > > > Another question, you said that the callback function pointer should
> > > > never be 0. But can the tag be null (hdr->tag) ?
> > > 
> > > The tag is not a pointer -- it's just an integer.
> > 
> > I was worrying that its value could not be null.
> > 
> > I'll send a valgrind output soon (i need to build libpython without
> > pymalloc first).
> > 
> > Thanks,
> > Eloi
> > 
> > > > Thanks for your help,
> > > > Eloi
> > > > 
> > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > >> Sorry for the delay in replying.
> > > >> 
> > > >> Odd; the values of the callback function pointer should never be 0.
> > > >> This seems to suggest some kind of memory corruption is occurring.
> > > >> 
> > > >> I don't know if it's possible, because the stack trace looks like
> > > >> you're calling through python, but can you run this application
> > > >> through valgrind, or some other memory-checking debugger?
> > > >> 
> > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > >>> Hi,
> > > >>> 
> > > >>> sorry, i just forgot to add the values of the function parameters:
> > > >>> (gdb) print reg->cbdata
> > > >>> $1 = (void *) 0x0
> > > >>> (gdb) print openib_btl->super
> > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > >>> 
> > > >>>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size
> > > >>>   = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > >>>   btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > >>>   0x2b341eb8ee47, btl_del_procs =
> > > >>>   0x2b341eb90156, btl_register = 0,
> > > >>>   btl_finalize = 0x2b341eb93186, btl_alloc
> > > >>>   = 0x2b341eb90a3e, btl_free =
> > > >>>   0x2b341eb91400, btl_prepare_src =
> > > >>>   0x2b341eb91813, btl_prepare_dst =
> > > >>>   0x2b341eb91f2e, btl_send =
> > > >>>   0x2b341eb94517, btl_sendi =
> > > >>>   0x2b341eb9340d, btl_put =
> > > >>>   0x2b341eb94660, btl_get =
> > > >>>   0x2b341eb94c4e, btl_dump =
> > > >>>   0x2b341acd45cb, btl_mpool = 0xf3f4110,
> > > >>>   btl_register_error =
> > > >>>   0x2b341eb90565, btl_ft_event =
> > > >>>   0x2b341eb952e7}
> > > >>> 
> > > >>> (gdb) print hdr->tag
> > > >>> $3 = 0 '\0'
> > > >>> (gdb) print des
> > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > >>> (gdb) print reg->cbfunc
> > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > >>> 
> > > >>> Eloi
> > > >>> 
> > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > >  Hi,
> > >  
> > >  Here is the output of a core file generated during a segmentation
> > >  fault observed during a collective call (using openib):
> > >  
> > >  #0  0x in ?? ()
> > >  (gdb) where
> > >  #0  0x in ?? ()
> > >  #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > >  (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > >  byte_len=18) at btl_openib_component.c:2881 #2 
> > >  0x2aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
> > >  wc=0x7279ce90) at
> > >  btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > >  (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-20 Thread Eloi Gaudry
Hi Jeff,

here is the valgrind output when using OpenMPI -1.5rc5, just in case.

Thanks,
Eloi

On Wednesday 18 August 2010 23:01:49 Jeff Squyres wrote:
> On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote:
> > would it  help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort has been done to clean-up the valgrind output ? but maybe
> > that this doesn't concern this btl (for the reasons you mentionned).
> 
> I do not believe that the IB/iWARP vendors have cleaned up the openib BTL
> much in this regard recently.  But then again, we branched for v1.3 a long
> time ago, so I don't remember offhand if any valgrind cleanups occurred
> since then...

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


valgrind.ompi15rc5.out.gz
Description: GNU Zip compressed data


Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Jeff Squyres
On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote:

> would it  help if i use the upcoming 1.5 version of openmpi ? i read that a 
> huge effort has been done to clean-up the valgrind output ? but maybe that 
> this doesn't 
> concern this btl (for the reasons you mentionned).

I do not believe that the IB/iWARP vendors have cleaned up the openib BTL much 
in this regard recently.  But then again, we branched for v1.3 a long time ago, 
so I don't remember offhand if any valgrind cleanups occurred since then...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Eloi Gaudry
Hi Jeff,

Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl 
openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca 
btl_openib_want_fork_support 0 -tag-output /opt/valgrind-3.5.0/bin/valgrind 
--tool=memcheck --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
valgrind.supp --suppressions=./suppressions.python.supp 
/opt/actran/bin/actranpy_mp ...

Thanks,
Eloi


On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > I did run our application through valgrind but it couldn't find any
> > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > > with the suppression file), "Use of uninitialized bytes" and
> > > "Conditional jump depending on uninitialized bytes" in different ompi
> > > routines. Some of them are located in btl_openib_component.c. I'll send
> > > you an output of valgrind shortly.
> > 
> > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > OS-bypass methods for some of its memory, and therefore valgrind is
> > unaware of them (and therefore incorrectly marks them as uninitialized).
> 
> would it  help if i use the upcoming 1.5 version of openmpi ? i read that a
> huge effort has been done to clean-up the valgrind output ? but maybe that
> this doesn't concern this btl (for the reasons you mentionned).
> 
> > > Another question, you said that the callback function pointer should
> > > never be 0. But can the tag be null (hdr->tag) ?
> > 
> > The tag is not a pointer -- it's just an integer.
> 
> I was worrying that its value could not be null.
> 
> I'll send a valgrind output soon (i need to build libpython without
> pymalloc first).
> 
> Thanks,
> Eloi
> 
> > > Thanks for your help,
> > > Eloi
> > > 
> > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > >> Sorry for the delay in replying.
> > >> 
> > >> Odd; the values of the callback function pointer should never be 0.
> > >> This seems to suggest some kind of memory corruption is occurring.
> > >> 
> > >> I don't know if it's possible, because the stack trace looks like
> > >> you're calling through python, but can you run this application
> > >> through valgrind, or some other memory-checking debugger?
> > >> 
> > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > >>> Hi,
> > >>> 
> > >>> sorry, i just forgot to add the values of the function parameters:
> > >>> (gdb) print reg->cbdata
> > >>> $1 = (void *) 0x0
> > >>> (gdb) print openib_btl->super
> > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > >>> btl_rdma_pipeline_send_length = 1048576,
> > >>> 
> > >>>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > >>>   1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth =
> > >>>   800, btl_flags = 310, btl_add_procs =
> > >>>   0x2b341eb8ee47, btl_del_procs =
> > >>>   0x2b341eb90156, btl_register = 0,
> > >>>   btl_finalize = 0x2b341eb93186, btl_alloc =
> > >>>   0x2b341eb90a3e, btl_free =
> > >>>   0x2b341eb91400, btl_prepare_src =
> > >>>   0x2b341eb91813, btl_prepare_dst =
> > >>>   0x2b341eb91f2e, btl_send =
> > >>>   0x2b341eb94517, btl_sendi =
> > >>>   0x2b341eb9340d, btl_put =
> > >>>   0x2b341eb94660, btl_get =
> > >>>   0x2b341eb94c4e, btl_dump =
> > >>>   0x2b341acd45cb, btl_mpool = 0xf3f4110,
> > >>>   btl_register_error =
> > >>>   0x2b341eb90565, btl_ft_event =
> > >>>   0x2b341eb952e7}
> > >>> 
> > >>> (gdb) print hdr->tag
> > >>> $3 = 0 '\0'
> > >>> (gdb) print des
> > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > >>> (gdb) print reg->cbfunc
> > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > >>> 
> > >>> Eloi
> > >>> 
> > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> >  Hi,
> >  
> >  Here is the output of a core file generated during a segmentation
> >  fault observed during a collective call (using openib):
> >  
> >  #0  0x in ?? ()
> >  (gdb) where
> >  #0  0x in ?? ()
> >  #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> >  (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> >  at btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> >  (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> >  btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> >  (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> >  0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> >  btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> >  btl_openib_component_progress () at btl_openib_component.c:3451 #6
> >  0x2aedb8b22ab8 in opal_progress () at
> >  runtime/opal_progress.c:207 #7 0x2aedb859f497 in
> >  opal_condition_wait (c=0x2aedb888ccc0, 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
Hi Nysal,

There is only one thread invoking MPI functions in our applications. Others 
threads are related to flexlm protection routines and some self-diagnostics 
routines 
that don't use any MPI functions. I built a version of our application, just ot 
be sure, without any other thread that the flexlm ones, and the error remained 
still 
observable.

I forgot to tell I've tried OpenMPI 1.3.4, 1.41, 1.4.2 and 1.4.3a1r23542 
versions and they all failed with the same error (hdr->tag=0 and reg->cbfunc=0).

Do you think that a buggy infiniband hardware/driver could potentially corrupt 
the message being transmit ?

Thanks for your help,
Eloi

On Tuesday 17 August 2010 12:05:13 Nysal Jan wrote:
> Hi Eloi,
> 
> >Do you think that a thread race condition could explain the hdr->tag value
> 
> ?
> Are there multiple threads invoking MPI functions in your application? The
> openib BTL is not yet thread safe in the 1.4 release series. There have
> been improvements to openib BTL thread safety in 1.5, but it is still not
> officially supported.
> 
> --Nysal
> 
> On Tue, Aug 17, 2010 at 1:06 PM, Eloi Gaudry  wrote:
> > Hi Nysal,
> > 
> > This is what I was wondering, it hdr->tag was expected to be null or not.
> > I'll soon send a valgrind output to the list, hoping this could help to
> > locate an invalid
> > memory access allowing to understand why reg->cbfunc / hdr->tag are null.
> > 
> > Do you think that a thread race condition could explain the hdr->tag
> > value ?
> > 
> > Thanks for your help,
> > Eloi
> > 
> > On Monday 16 August 2010 20:46:39 Nysal Jan wrote:
> > > The value of hdr->tag seems wrong.
> > > 
> > > In ompi/mca/pml/ob1/pml_ob1_hdr.h
> > > #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
> > > #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> > > #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> > > #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> > > #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> > > #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> > > #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> > > #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> > > #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> > > 
> > > and in ompi/mca/btl/btl.h
> > > #define MCA_BTL_TAG_PML 0x40
> > > 
> > > So hdr->tag should be a value >= 65
> > > Since the tag is incorrect you are not getting the proper callback
> > 
> > function
> > 
> > > pointer and hence the SEGV.
> > > I'm not sure at this point as to why you are getting an invalid/corrupt
> > > message header ?
> > > 
> > > --Nysal
> > > 
> > > On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:
> > > > Hi,
> > > > 
> > > > sorry, i just forgot to add the values of the function parameters:
> > > > (gdb) print reg->cbdata
> > > > $1 = (void *) 0x0
> > > > (gdb) print openib_btl->super
> > > > $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > > btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > > btl_rdma_pipeline_send_length = 1048576,
> > > > 
> > > >  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > > > 
> > > > 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth =
> > > > 800, btl_flags = 310,
> > > > 
> > > >  btl_add_procs = 0x2b341eb8ee47 ,
> > 
> > btl_del_procs
> > 
> > > >  =
> > > > 
> > > > 0x2b341eb90156 , btl_register = 0,
> > 
> > btl_finalize
> > 
> > > > = 0x2b341eb93186 ,
> > > > 
> > > >  btl_alloc = 0x2b341eb90a3e , btl_free =
> > > > 
> > > > 0x2b341eb91400 , btl_prepare_src =
> > > > 0x2b341eb91813 ,
> > > > 
> > > >  btl_prepare_dst = 0x2b341eb91f2e ,
> > 
> > btl_send
> > 
> > > >  =
> > > > 
> > > > 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> > > > ,
> > > > 
> > > >  btl_put = 0x2b341eb94660 , btl_get =
> > 
> > 0x2b341eb94c4e
> > 
> > > > , btl_dump = 0x2b341acd45cb ,
> > > > btl_mpool = 0xf3f4110,
> > > > 
> > > >  btl_register_error = 0x2b341eb90565
> > 
> > ,
> > 
> > > > btl_ft_event = 0x2b341eb952e7 }
> > > > (gdb) print hdr->tag
> > > > $3 = 0 '\0'
> > > > (gdb) print des
> > > > $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > > (gdb) print reg->cbfunc
> > > > $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > > 
> > > > Eloi
> > > > 
> > > > On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > > > Hi,
> > > > > 
> > > > > Here is the output of a core file generated during a segmentation
> > 
> > fault
> > 
> > > > > observed during a collective call (using openib):
> > > > > 
> > > > > #0  0x in ?? ()
> > > > > (gdb) where
> > > > > #0  0x in ?? ()
> > > > > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > > > byte_len=18)
> > 
> > at
> > 
> > > > > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > > > > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > > > > 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Nysal Jan
Hi Eloi,
>Do you think that a thread race condition could explain the hdr->tag value
?
Are there multiple threads invoking MPI functions in your application? The
openib BTL is not yet thread safe in the 1.4 release series. There have been
improvements to openib BTL thread safety in 1.5, but it is still not
officially supported.

--Nysal

On Tue, Aug 17, 2010 at 1:06 PM, Eloi Gaudry  wrote:

> Hi Nysal,
>
> This is what I was wondering, it hdr->tag was expected to be null or not.
> I'll soon send a valgrind output to the list, hoping this could help to
> locate an invalid
> memory access allowing to understand why reg->cbfunc / hdr->tag are null.
>
> Do you think that a thread race condition could explain the hdr->tag value
> ?
>
> Thanks for your help,
> Eloi
>
> On Monday 16 August 2010 20:46:39 Nysal Jan wrote:
> > The value of hdr->tag seems wrong.
> >
> > In ompi/mca/pml/ob1/pml_ob1_hdr.h
> > #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> > #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> > #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> > #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> > #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> >
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML 0x40
> >
> > So hdr->tag should be a value >= 65
> > Since the tag is incorrect you are not getting the proper callback
> function
> > pointer and hence the SEGV.
> > I'm not sure at this point as to why you are getting an invalid/corrupt
> > message header ?
> >
> > --Nysal
> >
> > On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:
> > > Hi,
> > >
> > > sorry, i just forgot to add the values of the function parameters:
> > > (gdb) print reg->cbdata
> > > $1 = (void *) 0x0
> > > (gdb) print openib_btl->super
> > > $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > btl_rdma_pipeline_send_length = 1048576,
> > >
> > >  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > >
> > > 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> > > btl_flags = 310,
> > >
> > >  btl_add_procs = 0x2b341eb8ee47 ,
> btl_del_procs
> > >  =
> > >
> > > 0x2b341eb90156 , btl_register = 0,
> btl_finalize
> > > = 0x2b341eb93186 ,
> > >
> > >  btl_alloc = 0x2b341eb90a3e , btl_free =
> > >
> > > 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> > > ,
> > >
> > >  btl_prepare_dst = 0x2b341eb91f2e ,
> btl_send
> > >  =
> > >
> > > 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> > > ,
> > >
> > >  btl_put = 0x2b341eb94660 , btl_get =
> 0x2b341eb94c4e
> > >
> > > , btl_dump = 0x2b341acd45cb ,
> > > btl_mpool = 0xf3f4110,
> > >
> > >  btl_register_error = 0x2b341eb90565
> ,
> > >
> > > btl_ft_event = 0x2b341eb952e7 }
> > > (gdb) print hdr->tag
> > > $3 = 0 '\0'
> > > (gdb) print des
> > > $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > (gdb) print reg->cbfunc
> > > $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > >
> > > Eloi
> > >
> > > On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > > Hi,
> > > >
> > > > Here is the output of a core file generated during a segmentation
> fault
> > > > observed during a collective call (using openib):
> > > >
> > > > #0  0x in ?? ()
> > > > (gdb) where
> > > > #0  0x in ?? ()
> > > > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
> at
> > > > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > > > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > > > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > > > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > > > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > > > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > > > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
> > > > #7 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > > > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
> > >
> > >  0x2aedb859fa31
> > >
> > > > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > > > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > > > ompi_coll_tuned_allreduce_intra_recursivedoubling
> (sbuf=0x7279d444,
> > > > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > > > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > > > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
Hi Nysal,

This is what I was wondering, it hdr->tag was expected to be null or not. I'll 
soon send a valgrind output to the list, hoping this could help to locate an 
invalid 
memory access allowing to understand why reg->cbfunc / hdr->tag are null.

Do you think that a thread race condition could explain the hdr->tag value ?

Thanks for your help,
Eloi

On Monday 16 August 2010 20:46:39 Nysal Jan wrote:
> The value of hdr->tag seems wrong.
> 
> In ompi/mca/pml/ob1/pml_ob1_hdr.h
> #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
> #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> 
> and in ompi/mca/btl/btl.h
> #define MCA_BTL_TAG_PML 0x40
> 
> So hdr->tag should be a value >= 65
> Since the tag is incorrect you are not getting the proper callback function
> pointer and hence the SEGV.
> I'm not sure at this point as to why you are getting an invalid/corrupt
> message header ?
> 
> --Nysal
> 
> On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:
> > Hi,
> > 
> > sorry, i just forgot to add the values of the function parameters:
> > (gdb) print reg->cbdata
> > $1 = (void *) 0x0
> > (gdb) print openib_btl->super
> > $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > btl_rdma_pipeline_send_length = 1048576,
> > 
> >  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> > 
> > 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> > btl_flags = 310,
> > 
> >  btl_add_procs = 0x2b341eb8ee47 , btl_del_procs
> >  =
> > 
> > 0x2b341eb90156 , btl_register = 0, btl_finalize
> > = 0x2b341eb93186 ,
> > 
> >  btl_alloc = 0x2b341eb90a3e , btl_free =
> > 
> > 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> > ,
> > 
> >  btl_prepare_dst = 0x2b341eb91f2e , btl_send
> >  =
> > 
> > 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> > ,
> > 
> >  btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e
> > 
> > , btl_dump = 0x2b341acd45cb ,
> > btl_mpool = 0xf3f4110,
> > 
> >  btl_register_error = 0x2b341eb90565 ,
> > 
> > btl_ft_event = 0x2b341eb952e7 }
> > (gdb) print hdr->tag
> > $3 = 0 '\0'
> > (gdb) print des
> > $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > (gdb) print reg->cbfunc
> > $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > 
> > Eloi
> > 
> > On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > Hi,
> > > 
> > > Here is the output of a core file generated during a segmentation fault
> > > observed during a collective call (using openib):
> > > 
> > > #0  0x in ?? ()
> > > (gdb) where
> > > #0  0x in ?? ()
> > > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
> > > #7 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
> >  
> >  0x2aedb859fa31
> >  
> > > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > > coll_tuned_decision_fixed.c:63
> > > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > > FEMTown::MPI::Allreduce (sendbuf=0x7279d444,
> > > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > comm=0x19d81ff0) at stubs.cpp:626 #13 0x04058be8 in
> > > FEMTown::Domain::align (itf=
> > 
> > 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > I did run our application through valgrind but it couldn't find any
> > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > with the suppression file), "Use of uninitialized bytes" and
> > "Conditional jump depending on uninitialized bytes" in different ompi
> > routines. Some of them are located in btl_openib_component.c. I'll send
> > you an output of valgrind shortly.
> 
> A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> OS-bypass methods for some of its memory, and therefore valgrind is
> unaware of them (and therefore incorrectly marks them as uninitialized).
would it  help if i use the upcoming 1.5 version of openmpi ? i read that a 
huge effort has been done to clean-up the valgrind output ? but maybe that this 
doesn't 
concern this btl (for the reasons you mentionned).

> 
> > Another question, you said that the callback function pointer should
> > never be 0. But can the tag be null (hdr->tag) ?
> 
> The tag is not a pointer -- it's just an integer.
I was worrying that its value could not be null.

I'll send a valgrind output soon (i need to build libpython without pymalloc 
first).

Thanks,
Eloi

> 
> > Thanks for your help,
> > Eloi
> > 
> > On 16/08/2010 18:22, Jeff Squyres wrote:
> >> Sorry for the delay in replying.
> >> 
> >> Odd; the values of the callback function pointer should never be 0. 
> >> This seems to suggest some kind of memory corruption is occurring.
> >> 
> >> I don't know if it's possible, because the stack trace looks like you're
> >> calling through python, but can you run this application through
> >> valgrind, or some other memory-checking debugger?
> >> 
> >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> >>> Hi,
> >>> 
> >>> sorry, i just forgot to add the values of the function parameters:
> >>> (gdb) print reg->cbdata
> >>> $1 = (void *) 0x0
> >>> (gdb) print openib_btl->super
> >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> >>> btl_rdma_pipeline_send_length = 1048576,
> >>> 
> >>>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> >>>   1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth =
> >>>   800, btl_flags = 310, btl_add_procs =
> >>>   0x2b341eb8ee47, btl_del_procs =
> >>>   0x2b341eb90156, btl_register = 0,
> >>>   btl_finalize = 0x2b341eb93186, btl_alloc =
> >>>   0x2b341eb90a3e, btl_free =
> >>>   0x2b341eb91400, btl_prepare_src =
> >>>   0x2b341eb91813, btl_prepare_dst =
> >>>   0x2b341eb91f2e, btl_send =
> >>>   0x2b341eb94517, btl_sendi =
> >>>   0x2b341eb9340d, btl_put =
> >>>   0x2b341eb94660, btl_get =
> >>>   0x2b341eb94c4e, btl_dump =
> >>>   0x2b341acd45cb, btl_mpool = 0xf3f4110,
> >>>   btl_register_error =
> >>>   0x2b341eb90565, btl_ft_event =
> >>>   0x2b341eb952e7}
> >>> 
> >>> (gdb) print hdr->tag
> >>> $3 = 0 '\0'
> >>> (gdb) print des
> >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> >>> (gdb) print reg->cbfunc
> >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> >>> 
> >>> Eloi
> >>> 
> >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
>  Hi,
>  
>  Here is the output of a core file generated during a segmentation
>  fault observed during a collective call (using openib):
>  
>  #0  0x in ?? ()
>  (gdb) where
>  #0  0x in ?? ()
>  #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
>  (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18)
>  at btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
>  (device=0x19024ac0, cq=0, wc=0x7279ce90) at
>  btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
>  (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
>  0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
>  btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
>  btl_openib_component_progress () at btl_openib_component.c:3451 #6
>  0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
>  #7 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
>  m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8 
>  0x2aedb859fa31 in ompi_request_default_wait_all (count=2,
>  requests=0x7279d0e0, statuses=0x0) at request/req_wait.c:262 #9 
>  0x2aedbd7559ad in
>  ompi_coll_tuned_allreduce_intra_recursivedoubling
>  (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
>  op=0x6787a20,
>  comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
>  #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
>  (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
>  op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
>  coll_tuned_decision_fixed.c:63
>  #11 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Nysal Jan
The value of hdr->tag seems wrong.

In ompi/mca/pml/ob1/pml_ob1_hdr.h
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
#define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
#define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)

and in ompi/mca/btl/btl.h
#define MCA_BTL_TAG_PML 0x40

So hdr->tag should be a value >= 65
Since the tag is incorrect you are not getting the proper callback function
pointer and hence the SEGV.
I'm not sure at this point as to why you are getting an invalid/corrupt
message header ?

--Nysal

On Tue, Aug 10, 2010 at 7:45 PM, Eloi Gaudry  wrote:

> Hi,
>
> sorry, i just forgot to add the values of the function parameters:
> (gdb) print reg->cbdata
> $1 = (void *) 0x0
> (gdb) print openib_btl->super
> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> btl_rdma_pipeline_send_length = 1048576,
>  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size =
> 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800,
> btl_flags = 310,
>  btl_add_procs = 0x2b341eb8ee47 , btl_del_procs =
> 0x2b341eb90156 , btl_register = 0, btl_finalize =
> 0x2b341eb93186 ,
>  btl_alloc = 0x2b341eb90a3e , btl_free =
> 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813
> ,
>  btl_prepare_dst = 0x2b341eb91f2e , btl_send =
> 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d
> ,
>  btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e
> , btl_dump = 0x2b341acd45cb ,
> btl_mpool = 0xf3f4110,
>  btl_register_error = 0x2b341eb90565 ,
> btl_ft_event = 0x2b341eb952e7 }
> (gdb) print hdr->tag
> $3 = 0 '\0'
> (gdb) print des
> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> (gdb) print reg->cbfunc
> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>
> Eloi
>
> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > Hi,
> >
> > Here is the output of a core file generated during a segmentation fault
> > observed during a collective call (using openib):
> >
> > #0  0x in ?? ()
> > (gdb) where
> > #0  0x in ?? ()
> > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
> > 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8
>  0x2aedb859fa31
> > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > coll_tuned_decision_fixed.c:63
> > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> > count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> > stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> >
> {
> > = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> > {pi_ = 0x6}}}, }) at interface.cpp:371
> > #14 0x040cb858 in
> FEMTown::Field::detail::align_itfs_and_neighbhors
> > (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> > check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> > FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> > 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> > 0x039acdd4 in PyField_align_elements (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Eloi Gaudry

 Hi Jeff,

Thanks for your reply.

I did run our application through valgrind but it couldn't find any 
"Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 
with the suppression file), "Use of uninitialized bytes" and 
"Conditional jump depending on uninitialized bytes" in different ompi 
routines. Some of them are located in btl_openib_component.c. I'll send 
you an output of valgrind shortly.


Another question, you said that the callback function pointer should 
never be 0. But can the tag be null (hdr->tag) ?


Thanks for your help,
Eloi



On 16/08/2010 18:22, Jeff Squyres wrote:

Sorry for the delay in replying.

Odd; the values of the callback function pointer should never be 0.  This seems 
to suggest some kind of memory corruption is occurring.

I don't know if it's possible, because the stack trace looks like you're 
calling through python, but can you run this application through valgrind, or 
some other memory-checking debugger?


On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:


Hi,

sorry, i just forgot to add the values of the function parameters:
(gdb) print reg->cbdata
$1 = (void *) 0x0
(gdb) print openib_btl->super
$2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, 
btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, 
btl_rdma_pipeline_send_length = 1048576,
   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size = 1060864, 
btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800, btl_flags = 310,
   btl_add_procs = 0x2b341eb8ee47, btl_del_procs = 
0x2b341eb90156, btl_register = 0, btl_finalize = 
0x2b341eb93186,
   btl_alloc = 0x2b341eb90a3e, btl_free = 
0x2b341eb91400, btl_prepare_src = 
0x2b341eb91813,
   btl_prepare_dst = 0x2b341eb91f2e, btl_send = 
0x2b341eb94517, btl_sendi = 0x2b341eb9340d,
   btl_put = 0x2b341eb94660, btl_get = 
0x2b341eb94c4e, btl_dump = 0x2b341acd45cb, 
btl_mpool = 0xf3f4110,
   btl_register_error = 0x2b341eb90565, 
btl_ft_event = 0x2b341eb952e7}
(gdb) print hdr->tag
$3 = 0 '\0'
(gdb) print des
$4 = (mca_btl_base_descriptor_t *) 0xf4a6700
(gdb) print reg->cbfunc
$5 = (mca_btl_base_module_recv_cb_fn_t) 0

Eloi

On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:

Hi,

Here is the output of a core file generated during a segmentation fault
observed during a collective call (using openib):

#0  0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x2aedbc4e05f4 in btl_openib_handle_incoming
(openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
(device=0x19024ac0, cq=0, wc=0x7279ce90) at
btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
(device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
btl_openib_component_progress () at btl_openib_component.c:3451 #6
0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8  0x2aedb859fa31
in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
#10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
(sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
coll_tuned_decision_fixed.c:63
#11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
 {
= {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
{pi_ = 0x6}}},}) at interface.cpp:371
#14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors
(dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
0x039acdd4 in PyField_align_elements (self=0x0,
args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
_object*, _object*)>::exec<_object>  (this=0x7279dc20, s=0x0,
po1=0x2aaab0765050, po2=0x19d2e950) at
/home/qa/svntop/femtown/modules/main/py/exception.hpp:463
#18 0x039acc82 in 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Jeff Squyres
Sorry for the delay in replying.

Odd; the values of the callback function pointer should never be 0.  This seems 
to suggest some kind of memory corruption is occurring.

I don't know if it's possible, because the stack trace looks like you're 
calling through python, but can you run this application through valgrind, or 
some other memory-checking debugger?


On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:

> Hi,
> 
> sorry, i just forgot to add the values of the function parameters:
> (gdb) print reg->cbdata
> $1 = (void *) 0x0
> (gdb) print openib_btl->super
> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, 
> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, 
> btl_rdma_pipeline_send_length = 1048576,
>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size = 
> 1060864, btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800, 
> btl_flags = 310,
>   btl_add_procs = 0x2b341eb8ee47 , btl_del_procs = 
> 0x2b341eb90156 , btl_register = 0, btl_finalize = 
> 0x2b341eb93186 ,
>   btl_alloc = 0x2b341eb90a3e , btl_free = 
> 0x2b341eb91400 , btl_prepare_src = 0x2b341eb91813 
> ,
>   btl_prepare_dst = 0x2b341eb91f2e , btl_send = 
> 0x2b341eb94517 , btl_sendi = 0x2b341eb9340d 
> ,
>   btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e 
> , btl_dump = 0x2b341acd45cb , 
> btl_mpool = 0xf3f4110,
>   btl_register_error = 0x2b341eb90565 , 
> btl_ft_event = 0x2b341eb952e7 }
> (gdb) print hdr->tag
> $3 = 0 '\0'
> (gdb) print des
> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> (gdb) print reg->cbfunc
> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> 
> Eloi
> 
> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > Hi,
> >
> > Here is the output of a core file generated during a segmentation fault
> > observed during a collective call (using openib):
> >
> > #0  0x in ?? ()
> > (gdb) where
> > #0  0x in ?? ()
> > #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> > (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> > btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> > (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> > btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> > (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> > btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7
> > 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> > m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8  0x2aedb859fa31
> > in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> > statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> > ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> > rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> > op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> > coll_tuned_decision_fixed.c:63
> > #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> > recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> > FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> > count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> > stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> > {
> > = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> > {pi_ = 0x6}}}, }) at interface.cpp:371
> > #14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors
> > (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> > check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> > FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> > 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> > 0x039acdd4 in PyField_align_elements (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
> > _object*, _object*)>::exec<_object> (this=0x7279dc20, s=0x0,
> > po1=0x2aaab0765050, po2=0x19d2e950) at
> > /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
> > #18 0x039acc82 in PyField_align_elements_ewrap (self=0x0,
> > args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> > 0x044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, throwflag= > optimized out>) at Python/ceval.c:3921 #20 0x0440aae9 in
> > PyEval_EvalCodeEx (co=0x2aaab754ad50, globals=,
> > locals=, args=0x3, argcount=1, 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry
Hi,

sorry, i just forgot to add the values of the function parameters:
(gdb) print reg->cbdata
$1 = (void *) 0x0
(gdb) print openib_btl->super
$2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, 
btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, 
btl_rdma_pipeline_send_length = 1048576, 
  btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size = 1060864, 
btl_exclusivity = 1024, btl_latency = 10, btl_bandwidth = 800, btl_flags = 310, 
  btl_add_procs = 0x2b341eb8ee47 , btl_del_procs = 
0x2b341eb90156 , btl_register = 0, btl_finalize = 
0x2b341eb93186 , 
  btl_alloc = 0x2b341eb90a3e , btl_free = 0x2b341eb91400 
, btl_prepare_src = 0x2b341eb91813 
, 
  btl_prepare_dst = 0x2b341eb91f2e , btl_send = 
0x2b341eb94517 , btl_sendi = 0x2b341eb9340d 
, 
  btl_put = 0x2b341eb94660 , btl_get = 0x2b341eb94c4e 
, btl_dump = 0x2b341acd45cb , btl_mpool 
= 0xf3f4110, 
  btl_register_error = 0x2b341eb90565 , 
btl_ft_event = 0x2b341eb952e7 }
(gdb) print hdr->tag
$3 = 0 '\0'
(gdb) print des
$4 = (mca_btl_base_descriptor_t *) 0xf4a6700
(gdb) print reg->cbfunc
$5 = (mca_btl_base_module_recv_cb_fn_t) 0

Eloi

On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> Hi,
> 
> Here is the output of a core file generated during a segmentation fault
> observed during a collective call (using openib):
> 
> #0  0x in ?? ()
> (gdb) where
> #0  0x in ?? ()
> #1  0x2aedbc4e05f4 in btl_openib_handle_incoming
> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> btl_openib_component.c:2881 #2  0x2aedbc4e25e2 in handle_wc
> (device=0x19024ac0, cq=0, wc=0x7279ce90) at
> btl_openib_component.c:3178 #3  0x2aedbc4e2e9d in poll_device
> (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4 
> 0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> btl_openib_component.c:3426 #5  0x2aedbc4e3561 in
> btl_openib_component_progress () at btl_openib_component.c:3451 #6 
> 0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207 #7 
> 0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0,
> m=0x2aedb888cd20) at ../opal/threads/condition.h:99 #8  0x2aedb859fa31
> in ompi_request_default_wait_all (count=2, requests=0x7279d0e0,
> statuses=0x0) at request/req_wait.c:262 #9  0x2aedbd7559ad in
> ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7279d444,
> rbuf=0x7279d440, count=1, dtype=0x6788220, op=0x6787a20,
> comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> #10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220,
> op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20) at
> coll_tuned_decision_fixed.c:63
> #11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444,
> recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20,
> comm=0x19d81ff0) at pallreduce.c:102 #12 0x04387dbf in
> FEMTown::MPI::Allreduce (sendbuf=0x7279d444, recvbuf=0x7279d440,
> count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at
> stubs.cpp:626 #13 0x04058be8 in FEMTown::Domain::align (itf=
> {
> = {_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn =
> {pi_ = 0x6}}}, }) at interface.cpp:371
> #14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors
> (dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}},
> check_info=@0x7279d7f0) at check.cpp:63 #15 0x040cbfa8 in
> FEMTown::Field::align_elements (set={px = 0x7279d950, pn = {pi_ =
> 0x66e08d0}}, check_info=@0x7279d7f0) at check.cpp:159 #16
> 0x039acdd4 in PyField_align_elements (self=0x0,
> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*,
> _object*, _object*)>::exec<_object> (this=0x7279dc20, s=0x0,
> po1=0x2aaab0765050, po2=0x19d2e950) at
> /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
> #18 0x039acc82 in PyField_align_elements_ewrap (self=0x0,
> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> 0x044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, throwflag= optimized out>) at Python/ceval.c:3921 #20 0x0440aae9 in
> PyEval_EvalCodeEx (co=0x2aaab754ad50, globals=,
> locals=, args=0x3, argcount=1, kws=0x19ace4a0,
> kwcount=2, defs=0x2aaab75e4800, defcount=2, closure=0x0) at
> Python/ceval.c:2968
> #21 0x04408f58 in PyEval_EvalFrameEx (f=0x19ace2d0,
> throwflag=) at Python/ceval.c:3802 #22
> 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120, globals= optimized out>, locals=, args=0x7, argcount=1,
> kws=0x19acc418, kwcount=3, defs=0x2aaab759e958, defcount=6, closure=0x0)
> at Python/ceval.c:2968
> #23 0x04408f58 in PyEval_EvalFrameEx (f=0x19acc1c0,
> throwflag=) at Python/ceval.c:3802 #24
> 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals= 

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry
Hi,

Here is the output of a core file generated during a segmentation fault 
observed during a collective call (using openib):

#0  0x in ?? ()
(gdb) where
#0  0x in ?? ()
#1  0x2aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, 
ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at btl_openib_component.c:2881
#2  0x2aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0, 
wc=0x7279ce90) at btl_openib_component.c:3178
#3  0x2aedbc4e2e9d in poll_device (device=0x19024ac0, count=2) at 
btl_openib_component.c:3318
#4  0x2aedbc4e34b8 in progress_one_device (device=0x19024ac0) at 
btl_openib_component.c:3426
#5  0x2aedbc4e3561 in btl_openib_component_progress () at 
btl_openib_component.c:3451
#6  0x2aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
#7  0x2aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0, 
m=0x2aedb888cd20) at ../opal/threads/condition.h:99
#8  0x2aedb859fa31 in ompi_request_default_wait_all (count=2, 
requests=0x7279d0e0, statuses=0x0) at request/req_wait.c:262
#9  0x2aedbd7559ad in ompi_coll_tuned_allreduce_intra_recursivedoubling 
(sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220, 
op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20)
at coll_tuned_allreduce.c:223
#10 0x2aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed 
(sbuf=0x7279d444, rbuf=0x7279d440, count=1, dtype=0x6788220, 
op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20)
at coll_tuned_decision_fixed.c:63
#11 0x2aedb85c7792 in PMPI_Allreduce (sendbuf=0x7279d444, 
recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20, 
comm=0x19d81ff0) at pallreduce.c:102
#12 0x04387dbf in FEMTown::MPI::Allreduce (sendbuf=0x7279d444, 
recvbuf=0x7279d440, count=1, datatype=0x6788220, op=0x6787a20, 
comm=0x19d81ff0) at stubs.cpp:626
#13 0x04058be8 in FEMTown::Domain::align (itf=
{ = 
{_vptr.shared_base_ptr = 0x7279d620, ptr_ = {px = 0x199942a4, pn = {pi_ = 
0x6}}}, })
at interface.cpp:371
#14 0x040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors 
(dim=2, set={px = 0x7279d780, pn = {pi_ = 0x2f279d640}}, 
check_info=@0x7279d7f0) at check.cpp:63
#15 0x040cbfa8 in FEMTown::Field::align_elements (set={px = 
0x7279d950, pn = {pi_ = 0x66e08d0}}, check_info=@0x7279d7f0) at 
check.cpp:159
#16 0x039acdd4 in PyField_align_elements (self=0x0, 
args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31
#17 0x01fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*, 
_object*, _object*)>::exec<_object> (this=0x7279dc20, s=0x0, 
po1=0x2aaab0765050, po2=0x19d2e950)
at /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
#18 0x039acc82 in PyField_align_elements_ewrap (self=0x0, 
args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39
#19 0x044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, throwflag=) at Python/ceval.c:3921
#20 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab754ad50, globals=, locals=, args=0x3, argcount=1, 
kws=0x19ace4a0, kwcount=2, defs=0x2aaab75e4800, 
defcount=2, closure=0x0) at Python/ceval.c:2968
#21 0x04408f58 in PyEval_EvalFrameEx (f=0x19ace2d0, throwflag=) at Python/ceval.c:3802
#22 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120, globals=, locals=, args=0x7, argcount=1, 
kws=0x19acc418, kwcount=3, defs=0x2aaab759e958, 
defcount=6, closure=0x0) at Python/ceval.c:2968
#23 0x04408f58 in PyEval_EvalFrameEx (f=0x19acc1c0, throwflag=) at Python/ceval.c:3802
#24 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=, locals=, args=0x6, argcount=1, 
kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8, 
defcount=3, closure=0x0) at Python/ceval.c:2968
#25 0x04408f58 in PyEval_EvalFrameEx (f=0x19abcea0, throwflag=) at Python/ceval.c:3802
#26 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=, locals=, args=0xb, argcount=1, 
kws=0x19a89df0, kwcount=10, defs=0x0, 
defcount=0, closure=0x0) at Python/ceval.c:2968
#27 0x04408f58 in PyEval_EvalFrameEx (f=0x19a89c40, throwflag=) at Python/ceval.c:3802
#28 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=, locals=, args=0x1, argcount=0, 
kws=0x19a89330, kwcount=0, defs=0x2aaab8b8, 
defcount=1, closure=0x0) at Python/ceval.c:2968
#29 0x04408f58 in PyEval_EvalFrameEx (f=0x19a891b0, throwflag=) at Python/ceval.c:3802
#30 0x0440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=, locals=, args=0x0, argcount=0, kws=0x0, 
kwcount=0, defs=0x0, defcount=0, 
closure=0x0) at Python/ceval.c:2968
#31 0x0440ac02 in PyEval_EvalCode (co=0x1902f9b0, globals=0x0, 
locals=0x190d9700) at Python/ceval.c:522
#32 0x0442853c in PyRun_StringFlags (str=0x192fd3d8 
"DIRECT.Actran.main()", start=, globals=0x192213d0, 
locals=0x192213d0, flags=0x0) at Python/pythonrun.c:1335

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-16 Thread Eloi Gaudry
Hi Edgar,

The only difference I could observed was that the segmentation fault appeared 
sometimes later during the parallel computation.

I'm running out of idea here. I wish I could use the "--mca coll tuned" with 
"--mca self,sm,tcp" so that I could check that the issue is not somehow limited 
to the tuned collective routines.

Thanks,
Eloi


On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > hi edgar,
> > 
> > thanks for the tips, I'm gonna try this option as well. the segmentation
> > fault i'm observing always happened during a collective communication
> > indeed... does it basically switch all collective communication to basic
> > mode, right ?
> > 
> > sorry for my ignorance, but what's a NCA ?
> 
> sorry, I meant to type HCA (InifinBand networking card)
> 
> Thanks
> Edgar
> 
> > thanks,
> > éloi
> > 
> > On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> >> you could try first to use the algorithms in the basic module, e.g.
> >> 
> >> mpirun -np x --mca coll basic ./mytest
> >> 
> >> and see whether this makes a difference. I used to observe sometimes a
> >> (similar ?) problem in the openib btl triggered from the tuned
> >> collective component, in cases where the ofed libraries were installed
> >> but no NCA was found on a node. It used to work however with the basic
> >> component.
> >> 
> >> Thanks
> >> Edgar
> >> 
> >> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> >>> hi Rolf,
> >>> 
> >>> unfortunately, i couldn't get rid of that annoying segmentation fault
> >>> when selecting another bcast algorithm. i'm now going to replace
> >>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
> >>> see if that helps.
> >>> 
> >>> regards,
> >>> éloi
> >>> 
> >>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>  Hi Rolf,
>  
>  thanks for your input. You're right, I miss the
>  coll_tuned_use_dynamic_rules option.
>  
>  I'll check if I the segmentation fault disappears when using the basic
>  bcast linear algorithm using the proper command line you provided.
>  
>  Regards,
>  Eloi
>  
>  On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > Hi Eloi:
> > To select the different bcast algorithms, you need to add an extra
> > mca parameter that tells the library to use dynamic selection.
> > --mca coll_tuned_use_dynamic_rules 1
> > 
> > One way to make sure you are typing this in correctly is to use it
> > with ompi_info.  Do the following:
> > ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > 
> > You should see lots of output with all the different algorithms that
> > can be selected for the various collectives.
> > Therefore, you need this:
> > 
> > --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm
> > 1
> > 
> > Rolf
> > 
> > On 07/13/10 11:28, Eloi Gaudry wrote:
> >> Hi,
> >> 
> >> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to
> >> switch to the basic linear algorithm. Anyway whatever the algorithm
> >> used, the segmentation fault remains.
> >> 
> >> Does anyone could give some advice on ways to diagnose the issue I'm
> >> facing ?
> >> 
> >> Regards,
> >> Eloi
> >> 
> >> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> >>> Hi,
> >>> 
> >>> I'm focusing on the MPI_Bcast routine that seems to randomly
> >>> segfault when using the openib btl. I'd like to know if there is
> >>> any way to make OpenMPI switch to a different algorithm than the
> >>> default one being selected for MPI_Bcast.
> >>> 
> >>> Thanks for your help,
> >>> Eloi
> >>> 
> >>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>  Hi,
>  
>  I'm observing a random segmentation fault during an internode
>  parallel computation involving the openib btl and OpenMPI-1.4.2
>  (the same issue can be observed with OpenMPI-1.3.3).
>  
> mpirun (Open MPI) 1.4.2
> Report bugs to http://www.open-mpi.org/community/help/
> [pbn08:02624] *** Process received signal ***
> [pbn08:02624] Signal: Segmentation fault (11)
> [pbn08:02624] Signal code: Address not mapped (1)
> [pbn08:02624] Failing at address: (nil)
> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> [pbn08:02624] *** End of error message ***
> sh: line 1:  2624 Segmentation fault
>  
>  \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5
>  \/ x 86 _6 4\ /bin\/actranpy_mp
>  '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x
>  86 _ 64 /A c tran_11.0.rc2.41872'
>  '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2
>  .d a t' 

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> hi edgar,
> 
> thanks for the tips, I'm gonna try this option as well. the segmentation 
> fault i'm observing always happened during a collective communication 
> indeed...
> does it basically switch all collective communication to basic mode, right ?
> 
> sorry for my ignorance, but what's a NCA ? 

sorry, I meant to type HCA (InifinBand networking card)

Thanks
Edgar

> 
> thanks,
> éloi
> 
> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>> you could try first to use the algorithms in the basic module, e.g.
>>
>> mpirun -np x --mca coll basic ./mytest
>>
>> and see whether this makes a difference. I used to observe sometimes a
>> (similar ?) problem in the openib btl triggered from the tuned
>> collective component, in cases where the ofed libraries were installed
>> but no NCA was found on a node. It used to work however with the basic
>> component.
>>
>> Thanks
>> Edgar
>>
>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>> hi Rolf,
>>>
>>> unfortunately, i couldn't get rid of that annoying segmentation fault
>>> when selecting another bcast algorithm. i'm now going to replace
>>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
>>> see if that helps.
>>>
>>> regards,
>>> éloi
>>>
>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
 Hi Rolf,

 thanks for your input. You're right, I miss the
 coll_tuned_use_dynamic_rules option.

 I'll check if I the segmentation fault disappears when using the basic
 bcast linear algorithm using the proper command line you provided.

 Regards,
 Eloi

 On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> Hi Eloi:
> To select the different bcast algorithms, you need to add an extra mca
> parameter that tells the library to use dynamic selection.
> --mca coll_tuned_use_dynamic_rules 1
>
> One way to make sure you are typing this in correctly is to use it with
> ompi_info.  Do the following:
> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
>
> You should see lots of output with all the different algorithms that
> can be selected for the various collectives.
> Therefore, you need this:
>
> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
>
> Rolf
>
> On 07/13/10 11:28, Eloi Gaudry wrote:
>> Hi,
>>
>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
>> to the basic linear algorithm. Anyway whatever the algorithm used, the
>> segmentation fault remains.
>>
>> Does anyone could give some advice on ways to diagnose the issue I'm
>> facing ?
>>
>> Regards,
>> Eloi
>>
>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>> Hi,
>>>
>>> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
>>> when using the openib btl. I'd like to know if there is any way to
>>> make OpenMPI switch to a different algorithm than the default one
>>> being selected for MPI_Bcast.
>>>
>>> Thanks for your help,
>>> Eloi
>>>
>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
 Hi,

 I'm observing a random segmentation fault during an internode
 parallel computation involving the openib btl and OpenMPI-1.4.2 (the
 same issue can be observed with OpenMPI-1.3.3).

mpirun (Open MPI) 1.4.2
Report bugs to http://www.open-mpi.org/community/help/
[pbn08:02624] *** Process received signal ***
[pbn08:02624] Signal: Segmentation fault (11)
[pbn08:02624] Signal code: Address not mapped (1)
[pbn08:02624] Failing at address: (nil)
[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
[pbn08:02624] *** End of error message ***
sh: line 1:  2624 Segmentation fault

 \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/
 x 86 _6 4\ /bin\/actranpy_mp
 '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86
 _ 64 /A c tran_11.0.rc2.41872'
 '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.d
 a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
 '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
 '--parallel=domain'

 If I choose not to use the openib btl (by using --mca btl
 self,sm,tcp on the command line, for instance), I don't encounter
 any problem and the parallel computation runs flawlessly.

 I would like to get some help to be able:
 - to diagnose the issue I'm facing with the openib btl
 - understand why this issue is observed only when using the openib
 btl and not when using self,sm,tcp

 Any help would be very much appreciated.

 The outputs of 

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi edgar,

thanks for the tips, I'm gonna try this option as well. the segmentation fault 
i'm observing always happened during a collective communication indeed...
does it basically switch all collective communication to basic mode, right ?

sorry for my ignorance, but what's a NCA ? 

thanks,
éloi

On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> you could try first to use the algorithms in the basic module, e.g.
> 
> mpirun -np x --mca coll basic ./mytest
> 
> and see whether this makes a difference. I used to observe sometimes a
> (similar ?) problem in the openib btl triggered from the tuned
> collective component, in cases where the ofed libraries were installed
> but no NCA was found on a node. It used to work however with the basic
> component.
> 
> Thanks
> Edgar
> 
> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > hi Rolf,
> > 
> > unfortunately, i couldn't get rid of that annoying segmentation fault
> > when selecting another bcast algorithm. i'm now going to replace
> > MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and
> > see if that helps.
> > 
> > regards,
> > éloi
> > 
> > On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> >> Hi Rolf,
> >> 
> >> thanks for your input. You're right, I miss the
> >> coll_tuned_use_dynamic_rules option.
> >> 
> >> I'll check if I the segmentation fault disappears when using the basic
> >> bcast linear algorithm using the proper command line you provided.
> >> 
> >> Regards,
> >> Eloi
> >> 
> >> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> >>> Hi Eloi:
> >>> To select the different bcast algorithms, you need to add an extra mca
> >>> parameter that tells the library to use dynamic selection.
> >>> --mca coll_tuned_use_dynamic_rules 1
> >>> 
> >>> One way to make sure you are typing this in correctly is to use it with
> >>> ompi_info.  Do the following:
> >>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> >>> 
> >>> You should see lots of output with all the different algorithms that
> >>> can be selected for the various collectives.
> >>> Therefore, you need this:
> >>> 
> >>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> >>> 
> >>> Rolf
> >>> 
> >>> On 07/13/10 11:28, Eloi Gaudry wrote:
>  Hi,
>  
>  I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
>  to the basic linear algorithm. Anyway whatever the algorithm used, the
>  segmentation fault remains.
>  
>  Does anyone could give some advice on ways to diagnose the issue I'm
>  facing ?
>  
>  Regards,
>  Eloi
>  
>  On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > Hi,
> > 
> > I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> > when using the openib btl. I'd like to know if there is any way to
> > make OpenMPI switch to a different algorithm than the default one
> > being selected for MPI_Bcast.
> > 
> > Thanks for your help,
> > Eloi
> > 
> > On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >> Hi,
> >> 
> >> I'm observing a random segmentation fault during an internode
> >> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
> >> same issue can be observed with OpenMPI-1.3.3).
> >> 
> >>mpirun (Open MPI) 1.4.2
> >>Report bugs to http://www.open-mpi.org/community/help/
> >>[pbn08:02624] *** Process received signal ***
> >>[pbn08:02624] Signal: Segmentation fault (11)
> >>[pbn08:02624] Signal code: Address not mapped (1)
> >>[pbn08:02624] Failing at address: (nil)
> >>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >>[pbn08:02624] *** End of error message ***
> >>sh: line 1:  2624 Segmentation fault
> >> 
> >> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/
> >> x 86 _6 4\ /bin\/actranpy_mp
> >> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86
> >> _ 64 /A c tran_11.0.rc2.41872'
> >> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.d
> >> a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> >> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> >> '--parallel=domain'
> >> 
> >> If I choose not to use the openib btl (by using --mca btl
> >> self,sm,tcp on the command line, for instance), I don't encounter
> >> any problem and the parallel computation runs flawlessly.
> >> 
> >> I would like to get some help to be able:
> >> - to diagnose the issue I'm facing with the openib btl
> >> - understand why this issue is observed only when using the openib
> >> btl and not when using self,sm,tcp
> >> 
> >> Any help would be very much appreciated.
> >> 
> >> The outputs of ompi_info and the configure scripts of OpenMPI are
> >> enclosed to this email, and some information on the infiniband
> 

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
you could try first to use the algorithms in the basic module, e.g.

mpirun -np x --mca coll basic ./mytest

and see whether this makes a difference. I used to observe sometimes a
(similar ?) problem in the openib btl triggered from the tuned
collective component, in cases where the ofed libraries were installed
but no NCA was found on a node. It used to work however with the basic
component.

Thanks
Edgar


On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> hi Rolf,
> 
> unfortunately, i couldn't get rid of that annoying segmentation fault when 
> selecting another bcast algorithm.
> i'm now going to replace MPI_Bcast with a naive implementation (using 
> MPI_Send and MPI_Recv) and see if that helps.
> 
> regards,
> éloi
> 
> 
> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>> Hi Rolf,
>>
>> thanks for your input. You're right, I miss the
>> coll_tuned_use_dynamic_rules option.
>>
>> I'll check if I the segmentation fault disappears when using the basic
>> bcast linear algorithm using the proper command line you provided.
>>
>> Regards,
>> Eloi
>>
>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
>>> Hi Eloi:
>>> To select the different bcast algorithms, you need to add an extra mca
>>> parameter that tells the library to use dynamic selection.
>>> --mca coll_tuned_use_dynamic_rules 1
>>>
>>> One way to make sure you are typing this in correctly is to use it with
>>> ompi_info.  Do the following:
>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
>>>
>>> You should see lots of output with all the different algorithms that can
>>> be selected for the various collectives.
>>> Therefore, you need this:
>>>
>>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
>>>
>>> Rolf
>>>
>>> On 07/13/10 11:28, Eloi Gaudry wrote:
 Hi,

 I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
 to the basic linear algorithm. Anyway whatever the algorithm used, the
 segmentation fault remains.

 Does anyone could give some advice on ways to diagnose the issue I'm
 facing ?

 Regards,
 Eloi

 On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> Hi,
>
> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> when using the openib btl. I'd like to know if there is any way to
> make OpenMPI switch to a different algorithm than the default one
> being selected for MPI_Bcast.
>
> Thanks for your help,
> Eloi
>
> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>> Hi,
>>
>> I'm observing a random segmentation fault during an internode
>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
>> same issue can be observed with OpenMPI-1.3.3).
>>
>>mpirun (Open MPI) 1.4.2
>>Report bugs to http://www.open-mpi.org/community/help/
>>[pbn08:02624] *** Process received signal ***
>>[pbn08:02624] Signal: Segmentation fault (11)
>>[pbn08:02624] Signal code: Address not mapped (1)
>>[pbn08:02624] Failing at address: (nil)
>>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
>>[pbn08:02624] *** End of error message ***
>>sh: line 1:  2624 Segmentation fault
>>
>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x
>> 86 _6 4\ /bin\/actranpy_mp
>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_
>> 64 /A c tran_11.0.rc2.41872'
>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.da
>> t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
>> '--parallel=domain'
>>
>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
>> on the command line, for instance), I don't encounter any problem and
>> the parallel computation runs flawlessly.
>>
>> I would like to get some help to be able:
>> - to diagnose the issue I'm facing with the openib btl
>> - understand why this issue is observed only when using the openib
>> btl and not when using self,sm,tcp
>>
>> Any help would be very much appreciated.
>>
>> The outputs of ompi_info and the configure scripts of OpenMPI are
>> enclosed to this email, and some information on the infiniband
>> drivers as well.
>>
>> Here is the command line used when launching a parallel computation
>>
>> using infiniband:
>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>--mca
>>
>> btl openib,sm,self,tcp  --display-map --verbose --version --mca
>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
>>
>> and the command line used if not using infiniband:
>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
>>--mca
>>
>> btl self,sm,tcp  --display-map --verbose --version --mca
>> 

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi Rolf,

unfortunately, i couldn't get rid of that annoying segmentation fault when 
selecting another bcast algorithm.
i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send 
and MPI_Recv) and see if that helps.

regards,
éloi


On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> Hi Rolf,
> 
> thanks for your input. You're right, I miss the
> coll_tuned_use_dynamic_rules option.
> 
> I'll check if I the segmentation fault disappears when using the basic
> bcast linear algorithm using the proper command line you provided.
> 
> Regards,
> Eloi
> 
> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > Hi Eloi:
> > To select the different bcast algorithms, you need to add an extra mca
> > parameter that tells the library to use dynamic selection.
> > --mca coll_tuned_use_dynamic_rules 1
> > 
> > One way to make sure you are typing this in correctly is to use it with
> > ompi_info.  Do the following:
> > ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > 
> > You should see lots of output with all the different algorithms that can
> > be selected for the various collectives.
> > Therefore, you need this:
> > 
> > --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> > 
> > Rolf
> > 
> > On 07/13/10 11:28, Eloi Gaudry wrote:
> > > Hi,
> > > 
> > > I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch
> > > to the basic linear algorithm. Anyway whatever the algorithm used, the
> > > segmentation fault remains.
> > > 
> > > Does anyone could give some advice on ways to diagnose the issue I'm
> > > facing ?
> > > 
> > > Regards,
> > > Eloi
> > > 
> > > On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > >> Hi,
> > >> 
> > >> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> > >> when using the openib btl. I'd like to know if there is any way to
> > >> make OpenMPI switch to a different algorithm than the default one
> > >> being selected for MPI_Bcast.
> > >> 
> > >> Thanks for your help,
> > >> Eloi
> > >> 
> > >> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > >>> Hi,
> > >>> 
> > >>> I'm observing a random segmentation fault during an internode
> > >>> parallel computation involving the openib btl and OpenMPI-1.4.2 (the
> > >>> same issue can be observed with OpenMPI-1.3.3).
> > >>> 
> > >>>mpirun (Open MPI) 1.4.2
> > >>>Report bugs to http://www.open-mpi.org/community/help/
> > >>>[pbn08:02624] *** Process received signal ***
> > >>>[pbn08:02624] Signal: Segmentation fault (11)
> > >>>[pbn08:02624] Signal code: Address not mapped (1)
> > >>>[pbn08:02624] Failing at address: (nil)
> > >>>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> > >>>[pbn08:02624] *** End of error message ***
> > >>>sh: line 1:  2624 Segmentation fault
> > >>> 
> > >>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x
> > >>> 86 _6 4\ /bin\/actranpy_mp
> > >>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_
> > >>> 64 /A c tran_11.0.rc2.41872'
> > >>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.da
> > >>> t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> > >>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> > >>> '--parallel=domain'
> > >>> 
> > >>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
> > >>> on the command line, for instance), I don't encounter any problem and
> > >>> the parallel computation runs flawlessly.
> > >>> 
> > >>> I would like to get some help to be able:
> > >>> - to diagnose the issue I'm facing with the openib btl
> > >>> - understand why this issue is observed only when using the openib
> > >>> btl and not when using self,sm,tcp
> > >>> 
> > >>> Any help would be very much appreciated.
> > >>> 
> > >>> The outputs of ompi_info and the configure scripts of OpenMPI are
> > >>> enclosed to this email, and some information on the infiniband
> > >>> drivers as well.
> > >>> 
> > >>> Here is the command line used when launching a parallel computation
> > >>> 
> > >>> using infiniband:
> > >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>--mca
> > >>> 
> > >>> btl openib,sm,self,tcp  --display-map --verbose --version --mca
> > >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>> 
> > >>> and the command line used if not using infiniband:
> > >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>--mca
> > >>> 
> > >>> btl self,sm,tcp  --display-map --verbose --version --mca
> > >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>> 
> > >>> Thanks,
> > >>> Eloi
> > > 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959



Re: [OMPI users] [openib] segfault when using openib btl

2010-07-14 Thread Eloi Gaudry
Hi Rolf,

thanks for your input. You're right, I miss the coll_tuned_use_dynamic_rules 
option.

I'll check if I the segmentation fault disappears when using the basic bcast 
linear algorithm using the proper command line you provided.

Regards,
Eloi

On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> Hi Eloi:
> To select the different bcast algorithms, you need to add an extra mca
> parameter that tells the library to use dynamic selection.
> --mca coll_tuned_use_dynamic_rules 1
> 
> One way to make sure you are typing this in correctly is to use it with
> ompi_info.  Do the following:
> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> 
> You should see lots of output with all the different algorithms that can
> be selected for the various collectives.
> Therefore, you need this:
> 
> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1
> 
> Rolf
> 
> On 07/13/10 11:28, Eloi Gaudry wrote:
> > Hi,
> > 
> > I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to
> > the basic linear algorithm. Anyway whatever the algorithm used, the
> > segmentation fault remains.
> > 
> > Does anyone could give some advice on ways to diagnose the issue I'm
> > facing ?
> > 
> > Regards,
> > Eloi
> > 
> > On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> >> Hi,
> >> 
> >> I'm focusing on the MPI_Bcast routine that seems to randomly segfault
> >> when using the openib btl. I'd like to know if there is any way to make
> >> OpenMPI switch to a different algorithm than the default one being
> >> selected for MPI_Bcast.
> >> 
> >> Thanks for your help,
> >> Eloi
> >> 
> >> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >>> Hi,
> >>> 
> >>> I'm observing a random segmentation fault during an internode parallel
> >>> computation involving the openib btl and OpenMPI-1.4.2 (the same issue
> >>> can be observed with OpenMPI-1.3.3).
> >>> 
> >>>mpirun (Open MPI) 1.4.2
> >>>Report bugs to http://www.open-mpi.org/community/help/
> >>>[pbn08:02624] *** Process received signal ***
> >>>[pbn08:02624] Signal: Segmentation fault (11)
> >>>[pbn08:02624] Signal code: Address not mapped (1)
> >>>[pbn08:02624] Failing at address: (nil)
> >>>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >>>[pbn08:02624] *** End of error message ***
> >>>sh: line 1:  2624 Segmentation fault
> >>> 
> >>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86
> >>> _6 4\ /bin\/actranpy_mp
> >>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64
> >>> /A c tran_11.0.rc2.41872'
> >>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
> >>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
> >>> '--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'
> >>> 
> >>> If I choose not to use the openib btl (by using --mca btl self,sm,tcp
> >>> on the command line, for instance), I don't encounter any problem and
> >>> the parallel computation runs flawlessly.
> >>> 
> >>> I would like to get some help to be able:
> >>> - to diagnose the issue I'm facing with the openib btl
> >>> - understand why this issue is observed only when using the openib btl
> >>> and not when using self,sm,tcp
> >>> 
> >>> Any help would be very much appreciated.
> >>> 
> >>> The outputs of ompi_info and the configure scripts of OpenMPI are
> >>> enclosed to this email, and some information on the infiniband drivers
> >>> as well.
> >>> 
> >>> Here is the command line used when launching a parallel computation
> >>> 
> >>> using infiniband:
> >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> >>> 
> >>> btl openib,sm,self,tcp  --display-map --verbose --version --mca
> >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>> 
> >>> and the command line used if not using infiniband:
> >>>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> >>> 
> >>> btl self,sm,tcp  --display-map --verbose --version --mca
> >>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> >>> 
> >>> Thanks,
> >>> Eloi
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Rolf vandeVaart

Hi Eloi:
To select the different bcast algorithms, you need to add an extra mca 
parameter that tells the library to use dynamic selection.

--mca coll_tuned_use_dynamic_rules 1

One way to make sure you are typing this in correctly is to use it with 
ompi_info.  Do the following:

ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll

You should see lots of output with all the different algorithms that can 
be selected for the various collectives.

Therefore, you need this:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1

Rolf

On 07/13/10 11:28, Eloi Gaudry wrote:

Hi,

I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the 
basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.

Does anyone could give some advice on ways to diagnose the issue I'm facing ?

Regards,
Eloi


On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
  

Hi,

I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
using the openib btl. I'd like to know if there is any way to make OpenMPI
switch to a different algorithm than the default one being selected for
MPI_Bcast.

Thanks for your help,
Eloi

On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:


Hi,

I'm observing a random segmentation fault during an internode parallel
computation involving the openib btl and OpenMPI-1.4.2 (the same issue
can be observed with OpenMPI-1.3.3).

   mpirun (Open MPI) 1.4.2
   Report bugs to http://www.open-mpi.org/community/help/
   [pbn08:02624] *** Process received signal ***
   [pbn08:02624] Signal: Segmentation fault (11)
   [pbn08:02624] Signal code: Address not mapped (1)
   [pbn08:02624] Failing at address: (nil)
   [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
   [pbn08:02624] *** End of error message ***
   sh: line 1:  2624 Segmentation fault

\/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
4\ /bin\/actranpy_mp
'--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
c tran_11.0.rc2.41872'
'--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
'--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
'--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'

If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
the command line, for instance), I don't encounter any problem and the
parallel computation runs flawlessly.

I would like to get some help to be able:
- to diagnose the issue I'm facing with the openib btl
- understand why this issue is observed only when using the openib btl
and not when using self,sm,tcp

Any help would be very much appreciated.

The outputs of ompi_info and the configure scripts of OpenMPI are
enclosed to this email, and some information on the infiniband drivers
as well.

Here is the command line used when launching a parallel computation

using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl openib,sm,self,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

and the command line used if not using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl self,sm,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

Thanks,
Eloi
  


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Eloi Gaudry

Hi,

I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the 
basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.

Does anyone could give some advice on ways to diagnose the issue I'm facing ?

Regards,
Eloi


On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> Hi,
> 
> I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
> using the openib btl. I'd like to know if there is any way to make OpenMPI
> switch to a different algorithm than the default one being selected for
> MPI_Bcast.
> 
> Thanks for your help,
> Eloi
> 
> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > Hi,
> > 
> > I'm observing a random segmentation fault during an internode parallel
> > computation involving the openib btl and OpenMPI-1.4.2 (the same issue
> > can be observed with OpenMPI-1.3.3).
> > 
> >mpirun (Open MPI) 1.4.2
> >Report bugs to http://www.open-mpi.org/community/help/
> >[pbn08:02624] *** Process received signal ***
> >[pbn08:02624] Signal: Segmentation fault (11)
> >[pbn08:02624] Signal code: Address not mapped (1)
> >[pbn08:02624] Failing at address: (nil)
> >[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> >[pbn08:02624] *** End of error message ***
> >sh: line 1:  2624 Segmentation fault
> > 
> > \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
> > 4\ /bin\/actranpy_mp
> > '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
> > c tran_11.0.rc2.41872'
> > '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
> > '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
> > '--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'
> > 
> > If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
> > the command line, for instance), I don't encounter any problem and the
> > parallel computation runs flawlessly.
> > 
> > I would like to get some help to be able:
> > - to diagnose the issue I'm facing with the openib btl
> > - understand why this issue is observed only when using the openib btl
> > and not when using self,sm,tcp
> > 
> > Any help would be very much appreciated.
> > 
> > The outputs of ompi_info and the configure scripts of OpenMPI are
> > enclosed to this email, and some information on the infiniband drivers
> > as well.
> > 
> > Here is the command line used when launching a parallel computation
> > 
> > using infiniband:
> >path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> > 
> > btl openib,sm,self,tcp  --display-map --verbose --version --mca
> > mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > 
> > and the command line used if not using infiniband:
> >path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> > 
> > btl self,sm,tcp  --display-map --verbose --version --mca
> > mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > 
> > Thanks,
> > Eloi
> 



Re: [OMPI users] [openib] segfault when using openib btl

2010-07-12 Thread Eloi Gaudry
Hi,

I'm focusing on the MPI_Bcast routine that seems to randomly segfault when 
using the openib btl.
I'd like to know if there is any way to make OpenMPI switch to a different 
algorithm than the default one being selected for MPI_Bcast.

Thanks for your help,
Eloi 

On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> Hi,
> 
> I'm observing a random segmentation fault during an internode parallel
> computation involving the openib btl and OpenMPI-1.4.2 (the same issue
> can be observed with OpenMPI-1.3.3).
>mpirun (Open MPI) 1.4.2
>Report bugs to http://www.open-mpi.org/community/help/
>[pbn08:02624] *** Process received signal ***
>[pbn08:02624] Signal: Segmentation fault (11)
>[pbn08:02624] Signal code: Address not mapped (1)
>[pbn08:02624] Failing at address: (nil)
>[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
>[pbn08:02624] *** End of error message ***
>sh: line 1:  2624 Segmentation fault
> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_64\
> /bin\/actranpy_mp
> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/Ac
> tran_11.0.rc2.41872'
> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
> '--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'
> 
> If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
> the command line, for instance), I don't encounter any problem and the
> parallel computation runs flawlessly.
> 
> I would like to get some help to be able:
> - to diagnose the issue I'm facing with the openib btl
> - understand why this issue is observed only when using the openib btl
> and not when using self,sm,tcp
> 
> Any help would be very much appreciated.
> 
> The outputs of ompi_info and the configure scripts of OpenMPI are
> enclosed to this email, and some information on the infiniband drivers
> as well.
> 
> Here is the command line used when launching a parallel computation
> using infiniband:
>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> btl openib,sm,self,tcp  --display-map --verbose --version --mca
> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> and the command line used if not using infiniband:
>path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
> btl self,sm,tcp  --display-map --verbose --version --mca
> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> 
> Thanks,
> Eloi