Re: [OMPI devel] SM init failures

2009-03-27 Thread Eugene Loh
Paul H. Hargrove wrote: Quoting from a different manpage for ftruncate: [T]he POSIX standard allows two behaviours for ftruncate when length exceeds the file length [...]: either returning an error, or extending the file. So, if that is to be trusted, it is not legal by PO

Re: [OMPI devel] MCA component dependency

2009-03-27 Thread Jeff Squyres
On Mar 25, 2009, at 6:09 PM, Aurélien Bouteiller wrote: I'm trying to state that a particular component depends on another that should therefore be dlopened automatically when it is loaded. I found some code doing exactly that in mca_base_component_find:open_component, but can't find any example

Re: [OMPI devel] Bug in MPI_Request_get_status (1.3.1) [PATCH]

2009-03-27 Thread Jeff Squyres
FWIW, MPI_TEST* and MPI_WAIT* all check for MPI_STATUS[ES]_IGNORE at the lower layers. I believe that the correct fix for MPI_REQUEST_GET_STATUS should be the following, because checks for MPI_STATUS_IGNORE are performed later in the function: Index: ompi/mpi/c/request_get_status.c ==

Re: [OMPI devel] SM init failures

2009-03-27 Thread Paul H. Hargrove
Quoting from a different manpage for ftruncate: [T]he POSIX standard allows two behaviours for ftruncate when length exceeds the file length [...]: either returning an error, or extending the file. So, if that is to be trusted, it is not legal by POSIX to *silently* not extend

Re: [OMPI devel] SM init failures

2009-03-27 Thread George Bosilca
Talking with Aurelien here @ UT we think we came-up with a possible way to get such an error. Before explaining this let me set the bases. There are 2 critical functions used in setting up the shared memory file. One is ftruncate the other one mmap. Here are two snippets from these function

[OMPI devel] Fwd: [Open MPI Announce] Critical bug notice

2009-03-27 Thread Brad Benton
In reference to this critical bug, there are implications for the current 1.3.x release schedule that are alluded to in Jeff's message. In particular, there are two time-critical issues at play: 1) getting a fix for #1853 in time for inclusion for OFED-1.4.1 2) getting in Sun's changes/CMRs in

Re: [OMPI devel] Bug in MPI_Request_get_status (1.3.1) [PATCH]

2009-03-27 Thread George Bosilca
Shaun, Not in Open MPI :) But there is a section in the MPI Standard that talk about the MPI_STATUS_IGNORE and make the list of functions that can deal with it. george. On Mar 27, 2009, at 15:15 , Shaun Jackman wrote: Hi George, You will need to update MPI_Test and MPI_Wait as well, w

Re: [OMPI devel] Bug in MPI_Request_get_status (1.3.1) [PATCH]

2009-03-27 Thread Shaun Jackman
Hi George, You will need to update MPI_Test and MPI_Wait as well, which do not check that status != NULL. Is there an index of MPI functions by their parameter type, such as the set of functions that take an MPI_Status argument? Cheers, Shaun George Bosilca wrote: Shaun, Thanks for the bu

[OMPI devel] Critical bug notice

2009-03-27 Thread Jeff Squyres
The Open MPI team has uncovered a serious bug in Open MPI v1.3.0 and v1.3.1: when running on OpenFabrics-based networks, silent data corruption is possible in some cases. There are two workarounds to avoid the issue -- please see the bug ticket that has been opened about this issue for fur

Re: [OMPI devel] Bug in MPI_Request_get_status (1.3.1) [PATCH]

2009-03-27 Thread George Bosilca
Shaun, Thanks for the bug report. In general we like to check the arguments against NULL, in order to make sure we don't segfault. However, in this particular context we check against NULL but NULL is our MPI_STATUS_IGNORE. I think I will prefer a little bit more safer solution where we t

[OMPI devel] Bug in MPI_Request_get_status (1.3.1) [PATCH]

2009-03-27 Thread Shaun Jackman
MPI_Request_get_status fails if the status parameter is passed MPI_STATUS_IGNORE. A patch is attached. Cheers, Shaun 2009-03-26 Shaun Jackman * ompi/mpi/c/request_get_status.c (MPI_Request_get_status): Do not fail if the status argument is NULL, because the application may pas

Re: [OMPI devel] SM init failures

2009-03-27 Thread Tim Mattox
Eugene, I think I remember setting up the MTT tests on Sif so that tests are run both with and without the coll_hierarch component selected. The coll_hierarch component stresses code paths and potential race conditions in its own way. So, if the problems are showing up more frequently for the test

Re: [OMPI devel] SM init failures

2009-03-27 Thread Eugene Loh
Josh Hursey wrote: Sif is also running the coll_hierarch component on some of those tests which has caused some additional problems. I don't know if that is related or not. Indeed. Many of the MTT stack traces (for both 1.3.1 and 1.3.2 and that have seg faults and call out mca_btl_sm.so)

[OMPI devel] cisco mtt failures

2009-03-27 Thread Jeff Squyres
Ignore the 17k+ failures from Cisco last night... I had a bunch of half-complete changes on my cluster last night and forgot to disable MTT overnight. -- Jeff Squyres Cisco Systems

Re: [OMPI devel] SM init failures

2009-03-27 Thread Jeff Squyres
FWIW, when I was looking into this before, the problem was definitely during MPI_INIT. I ran out of time before being able to track it down further, but it was definitely something during the sm startup -- during add_procs, IIRC. It *looked* like there was some kind of bogus value in the b

Re: [OMPI devel] SM init failures

2009-03-27 Thread Ralph Castain
Hmmm...Eugene, you need to be a tad less sensitive. Nobody was attempting to indict you or in any way attack you or your code. What I was attempting to point out is that there are a number of sm failures during sm init. I didn't single you out. I posted it to the community because (a) it is

Re: [OMPI devel] SM init failures

2009-03-27 Thread Josh Hursey
On Mar 26, 2009, at 6:41 PM, Ralph Castain wrote: I suspect Josh or someone at IU could tell you the compiler. I would be very surprised if it wasn't gcc, but I don't know what version. All the MTT runs on Sif are using gcc 4.1.2: -bash-3.2$ gcc --version gcc (GCC) 4.1.2 20080704 (Red Hat

Re: [OMPI devel] SM init failures

2009-03-27 Thread Eugene Loh
Ralph Castain wrote: You are correct - the Sun errors are in a version prior to the insertion of the SM changes. We didn't relabel the version to 1.3.2 until -after- those changes went in, so you have to look for anything with an r number >= 20839. The sif errors are all in that group - I