Re: [OMPI devel] Malloc segfaulting?

2007-09-20 Thread Brian Barrett

On Sep 20, 2007, at 7:02 AM, Tim Prins wrote:

In our nightly runs with the trunk I have started seeing cases  
where we
appear to be segfaulting within/below malloc. Below is a typical  
output.


Note that this appears to only happen on the trunk, when we use  
openib,

and are in 32 bit mode. It seems to happen randomly at a very low
frequency (59 out of about 60,000 32 bit openib runs).

This could be a problem with our machine, and has showed up since I
started testing 32bit ofed 10 days ago.

Anyways, just curious if anyone had any ideas.


As someone else said, this usually points to a duplicate free or the  
like in malloc.  You might want to try compiling with --without- 
memory-manager, as the ptmalloc2 in glibc frequently is more verbose  
about where errors occurred than is the one in Open MPI.


Brian


Re: [OMPI devel] Message Queue debugging support for1.2.4

2007-09-20 Thread Terry Dontje
I've talked with both Brian and Rich about the measurements and they are 
ok with the new findings.  I also have not received any other comments 
to the negative on putting 1097 into the v1.2 branch.  So I would like 
to instruct Tim Mattox to bring over the 1097 change to v1.2 branch and 
make a new 1.2 RC.


thanks,

--td
Terry Dontje wrote:

Nikolay and Community,

Sorry to be so late in responding to your email but I've been working 
with Pak to determine whether my hasty decision as RM yesterday was 
hasty or not.  To answer your question, we are still trying to determine 
if the message queue support can go in or not and the below is my 
perspective on whether we should.


Community,

A couple things have transpired in the last 24 hours from when we had 
our concall.  As Jeff surmised earlier this morning Pak did accidentally 
have debugging enabled which did skew the numbers quite a bit.  After 
making sure debugging was disabled for both builds (v1.2 and the tmp 
branch with the message queue fixes) we then fretted over the numbers.  
It looks to me that there is quite a bit of variance in the numbers that 
the OSU latency, IMB latency and mpi_ping  all produce. 

For example in using the OSU latency tests we say the MX MTL have a .01 
us difference between v1.2 and the tmp branch (in favor of v1.2).  
However the mean, trimmed mean and median have about .02-07us difference 
(in favor of the tmp branch).  To me the data looks pretty much the same 
and the fact that we are measuring the averages (ie none of the tests 
pick out the minimum value) makes these numbers even more hard to really 
nail down IMHO.  I've essentially seen this affect for the other tests 
(IMB and mpi_ping).
 
For the SM timings  using the mpi_ping tests we have seen a range of 
average latencies from 1.47-1.5 us for both the tmp and v1.2 so they 
seem like moral equivalents to me.  Rich Graham has led me to believe 
that he might get more consistent numbers but we are not able to and so 
I can only deduce that the numbers are essentially the same.


In conclusion I believe both the CM PML (MX MTL) and the SM BTL 
performance is about the same between the tmp branch and v1.2.  Because 
of this I would like to request that the 1097 cmr be put into 1.2.4.  If 
others disagree with my assessment above I think a discussion will need 
to ensue and I would welcome further testing by others that may show 
that the changes have regressed performance (or not).  I would like to 
set a timeout of 12 noon ET for others to comment whether these new 
findings puts our fears at ease.  At that time if not descenting 
comments have been received I would like to ask Tim to pull in these 
changes and rebuild 1.2.4.


thanks,

--td



Nikolay Piskun wrote:
  

  Hi,

  Just to verify, before I'll start testing this, there will be no 
message queue debugging support in this version, correct? This all 
goes to 1.3 release.

Best Regards,

P.S. It looks like it is time for us to be more formally involved in 
this work.


Nikolay Piskun
Director of Continuing Engineering, TotalView Technologies
24 Prime Parkway, Natick, MA 01760
http://www.totalviewtech.com


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] Malloc segfaulting?

2007-09-20 Thread Aurelien Bouteiller

This usually means something have been freed twice.

Aurelien

Le 20 sept. 07 à 09:02, Tim Prins a écrit :


Hi folks,

In our nightly runs with the trunk I have started seeing cases  
where we
appear to be segfaulting within/below malloc. Below is a typical  
output.


Note that this appears to only happen on the trunk, when we use  
openib,

and are in 32 bit mode. It seems to happen randomly at a very low
frequency (59 out of about 60,000 32 bit openib runs).

This could be a problem with our machine, and has showed up since I
started testing 32bit ofed 10 days ago.

Anyways, just curious if anyone had any ideas.

Thanks,

Tim

--

[odin011:04084] *** Process received signal ***
[odin011:04084] Signal: Segmentation fault (11)
[odin011:04084] Signal code: Invalid permissions (2)
[odin011:04084] Failing at address: 0xf7cbea68
[odin011:04084] [ 0] [0xe600]
[odin011:04084] [ 1]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/libopen-pal.so.0(malloc+0x82)

[0xf7e882d2]
[odin011:04084] [ 2]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/libopen-rte.so.0(orte_hash_table_set_proc+0xfa)

[0xf7ec57aa]
[odin011:04084] [ 3]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_lookup+0x11d)

[0xf7cbcebd]
[odin011:04084] [ 4]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_send_nb+0x1f)

[0xf7cbfccf]
[odin011:04084] [ 5]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_rml_oob.so(orte_rml_oob_send_buffer_nb 
+0x25a)

[0xf7cddfda]
[odin011:04084] [ 6]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_btl_openib.so

[0xf7c145f1]
[odin011:04084] [ 7]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_btl_openib.so

[0xf7c146e9]
[odin011:04084] [ 8]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_btl_openib.so 
(mca_btl_openib_endpoint_send+0x345)

[0xf7c0e155]
[odin011:04084] [ 9]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_send+0x3e)

[0xf7c0718e]
[odin011:04084] [10]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_pml_ob1.so 
(mca_pml_ob1_send_request_start_copy+0x17b)

[0xf7c3c4bb]
[odin011:04084] [11]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x27c)

[0xf7c35adc]
[odin011:04084] [12]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned_gather_intra_basic_linear+0x65)

[0xf7bc72a5]
[odin011:04084] [13]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/openmpi/mca_coll_tuned.so 
(ompi_coll_tuned_gather_intra_dec_fixed+0x16a)

[0xf7bba2aa]
[odin011:04084] [14]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/ 
eiso/install/lib/libmpi.so.0(MPI_Gather+0x18c)

[0xf7f62b6c]
[odin011:04084] [15] src/MPI_Gather_c(main+0x5fd) [0x804a101]
[odin011:04084] [16] /lib/tls/libc.so.6(__libc_start_main+0xd3)  
[0xf7d0fde3]

[odin011:04084] [17] src/MPI_Gather_c [0x8049a81]
[odin011:04084] *** End of error message ***
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] FreeBSD Support?

2007-09-20 Thread Karol Mroz
Hi, Ralf. So it seems that either approach works. One can `chmod u+w *`
or `chmod u+w configure` on its own, allowing the autogen process to
complete successfully on FreeBSD 6.2. Note also, that neither change to
autogen.sh breaks the Linux autogen process.

Ralf Wildenhues wrote:
> Hello Karol,
> 
> * Karol Mroz wrote on Wed, Sep 19, 2007 at 07:23:50PM CEST:
>> When running the autogen.sh script as non-root, I see the following error:
> [...]
>>  autom4te-2.61: cannot open configure: Permission denied
> [...]
>> After some searching, it would appear that this is an autoconf issue
>> that crops up in FreeBSD but, for whatever reason, not in Linux. A quick
>> workaround is to add: `chmod -vr u+w *` just before autogen issues the
>> `run_and_check $ompi_autoconf` command on line 438.
> 
> I can look into this if needed.  Is it sufficient to make
> opal/libltdl/configure writable, or its containing directory?
> 
> As a workaround you should be able to use nightly tarballs
> instead of the SVN version.
> 
> Cheers,
> Ralf
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Thanks.
-- 
Karol Mroz
km...@cs.ubc.ca


[OMPI devel] Malloc segfaulting?

2007-09-20 Thread Tim Prins

Hi folks,

In our nightly runs with the trunk I have started seeing cases where we 
appear to be segfaulting within/below malloc. Below is a typical output.


Note that this appears to only happen on the trunk, when we use openib, 
and are in 32 bit mode. It seems to happen randomly at a very low 
frequency (59 out of about 60,000 32 bit openib runs).


This could be a problem with our machine, and has showed up since I 
started testing 32bit ofed 10 days ago.


Anyways, just curious if anyone had any ideas.

Thanks,

Tim

--

[odin011:04084] *** Process received signal ***
[odin011:04084] Signal: Segmentation fault (11)
[odin011:04084] Signal code: Invalid permissions (2)
[odin011:04084] Failing at address: 0xf7cbea68
[odin011:04084] [ 0] [0xe600]
[odin011:04084] [ 1]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/libopen-pal.so.0(malloc+0x82)
[0xf7e882d2]
[odin011:04084] [ 2]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/libopen-rte.so.0(orte_hash_table_set_proc+0xfa)
[0xf7ec57aa]
[odin011:04084] [ 3]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_lookup+0x11d)
[0xf7cbcebd]
[odin011:04084] [ 4]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_send_nb+0x1f)
[0xf7cbfccf]
[odin011:04084] [ 5]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_rml_oob.so(orte_rml_oob_send_buffer_nb+0x25a)
[0xf7cddfda]
[odin011:04084] [ 6]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_btl_openib.so
[0xf7c145f1]
[odin011:04084] [ 7]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_btl_openib.so
[0xf7c146e9]
[odin011:04084] [ 8]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_endpoint_send+0x345)
[0xf7c0e155]
[odin011:04084] [ 9]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_send+0x3e)
[0xf7c0718e]
[odin011:04084] [10]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x17b)
[0xf7c3c4bb]
[odin011:04084] [11]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x27c)
[0xf7c35adc]
[odin011:04084] [12]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_gather_intra_basic_linear+0x65)
[0xf7bc72a5]
[odin011:04084] [13]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_gather_intra_dec_fixed+0x16a)
[0xf7bba2aa]
[odin011:04084] [14]
/san/homedirs/mpiteam/mtt-runs/odin/20070919-Nightly/pb_4/installs/eiso/install/lib/libmpi.so.0(MPI_Gather+0x18c)
[0xf7f62b6c]
[odin011:04084] [15] src/MPI_Gather_c(main+0x5fd) [0x804a101]
[odin011:04084] [16] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0xf7d0fde3]
[odin011:04084] [17] src/MPI_Gather_c [0x8049a81]
[odin011:04084] *** End of error message ***


Re: [OMPI devel] FreeBSD Support?

2007-09-20 Thread Ralf Wildenhues
Hello Karol,

* Karol Mroz wrote on Wed, Sep 19, 2007 at 07:23:50PM CEST:
> When running the autogen.sh script as non-root, I see the following error:
[...]
>   autom4te-2.61: cannot open configure: Permission denied
[...]
> After some searching, it would appear that this is an autoconf issue
> that crops up in FreeBSD but, for whatever reason, not in Linux. A quick
> workaround is to add: `chmod -vr u+w *` just before autogen issues the
> `run_and_check $ompi_autoconf` command on line 438.

I can look into this if needed.  Is it sufficient to make
opal/libltdl/configure writable, or its containing directory?

As a workaround you should be able to use nightly tarballs
instead of the SVN version.

Cheers,
Ralf