Re: [OMPI devel] race condition in oob/tcp

2014-09-26 Thread Gilles Gouaillardet
dition vis 1.8 - I agree it > is not a blocker for that release. > > Ralph > > On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> here is the patch i am using so far. >> i will res

Re: [OMPI devel] Conversion to GitHub: POSTPONED

2014-09-24 Thread Gilles Gouaillardet
my 0.02 US$ ... Bitbucket pricing model is per user (but with free public/private repository up to 5 users) whereas github pricing is per *private* repository (and free public repository and with unlimited users) from an OpenMPI point of view, this means : - with github, only the private

Re: [OMPI devel] RFC: "v1.9.0" (vs. "v1.9")

2014-09-22 Thread Gilles Gouaillardet
Folks, if i read between the lines, it looks like the next stable branch will be v2.0 and not v1.10 is there a strong reason for that (such as ABI compatibility will break, or a major but internal refactoring) ? /* other than v1.10 is less than v1.8 when comparing strings :-) */ Cheers, Gilles

Re: [OMPI devel] race condition in oob/tcp

2014-09-22 Thread Gilles Gouaillardet
o ahead, and thanks > > On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Thanks for the pointer George ! > > On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> Or copy the han

Re: [OMPI devel] race condition in oob/tcp

2014-09-21 Thread Gilles Gouaillardet
Thanks for the pointer George ! On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote: > Or copy the handshake protocol design of the TCP BTL... > > the main difference between oob/tcp and btl/tcp is the way we resolve the situation in which two processes send their first

Re: [OMPI devel] race condition in oob/tcp

2014-09-19 Thread Gilles Gouaillardet
moved from MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED, retry() should have been invoked ? Cheers, Gilles On 2014/09/18 17:02, Ralph Castain wrote: > The patch looks fine to me - please go ahead and apply it. Thanks! > > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet > <gilles.goua

[OMPI devel] v1.8 does not compile any more

2014-09-19 Thread Gilles Gouaillardet
Folks, r32716 broke v1.8 :-( the root cause it it included MCA_BASE_VAR_TYPE_VERSION_STRING which has not yet landed into v1.8 the attached trivial patch fixes this issue Can the RM/GK please review it and apply it ? Cheers, Gilles Index: opal/mca/base/mca_base_var.c

[OMPI devel] RFC: remove the --with-threads configure option

2014-09-18 Thread Gilles Gouaillardet
Folks, for both trunk and v1.8 branch, configure takes the --with-threads option. valid usages are --with-threads, --with-threads=yes, --with-threads=posix and --with-threads=no /* v1.6 used to support the --with-threads=solaris */ if we try to configure with --with-threads=no, this will result

Re: [OMPI devel] race condition in oob/tcp

2014-09-18 Thread Gilles Gouaillardet
t triggers it so I > can continue debugging > > Ralph > > On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Thanks Ralph, >> >> this is much better but there is still a bug : >> with the very same scen

Re: [OMPI devel] race condition in oob/tcp

2014-09-17 Thread Gilles Gouaillardet
then have the higher vpid retry while the lower one waits. > The logic for that was still in place, but it looks like you are hitting a > different code path, and I found another potential one as well. So I think I > plugged the holes, but will wait to hear if you confirm. >

[OMPI devel] race condition in oob/tcp

2014-09-16 Thread Gilles Gouaillardet
Ralph, here is the full description of a race condition in oob/tcp i very briefly mentionned in a previous post : the race condition can occur when two not connected orted try to send a message to each other for the first time and at the same time. that can occur when running mpi helloworld on

Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Gilles Gouaillardet
Howard, and Rolf, i initially reported the issue at http://www.open-mpi.org/community/lists/devel/2014/09/15767.php r32659 is not a fix nor a regression, it simply aborts instead of OBJ_RELEASE(mpi_comm_world). /* my point here is we should focus on the root cause and not the consequence */

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Gilles Gouaillardet
and 3 enter the allgather at the send time, they will sent a message to each other at the same time and rml fails establishing the connection. i could not find whether this is linked to my changes... Cheers, Gilles > > On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet < > gilles.gouai

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
, 2014, at 4:02 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Ralph, >> >> the root cause is when the second orted/mpirun runs rcd_finalize_coll, >> it does not invoke pmix_server_release >> because allgather_stub was not previou

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
ely not the right fix, it was very lightly tested, but so far, it works for me ... Cheers, Gilles On 2014/09/11 16:11, Gilles Gouaillardet wrote: > Ralph, > > things got worst indeed :-( > > now a simple hello world involving two hosts hang in mpi_init. > there is still a race conditio

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-11 Thread Gilles Gouaillardet
and for each of them to establish a >> persistent receive. They then can use the signature to tell which collective >> the incoming message belongs to. >> >> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot. >> >> >> On Sep 9, 2014

Re: [OMPI devel] Need to know your Github ID

2014-09-11 Thread Gilles Gouaillardet
ggouaillardet -> ggouaillardet On 2014/09/10 19:46, Jeff Squyres (jsquyres) wrote: > As the next step of the planned migration to Github, I need to know: > > - Your Github ID (so that you can be added to the new OMPI git repo) > - Your SVN ID (so that I can map SVN->Github IDs, and therefore map

[OMPI devel] race condition in grpcomm/rcd

2014-09-09 Thread Gilles Gouaillardet
Folks, Since r32672 (trunk), grpcomm/rcd is the default module. the attached spawn.c test program is a trimmed version of the spawn_with_env_vars.c test case from the ibm test suite. when invoked on two nodes : - the program hangs with -np 2 - the program can crash with np > 2 error message is

Re: [OMPI devel] about r32685

2014-09-08 Thread Gilles Gouaillardet
l I can say is that > "tolower" on my CentOS box is defined in , and that has to be > included in the misc.h header. > > > On Sep 8, 2014, at 5:49 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Ralph and Brice, >> >&g

[OMPI devel] about r32685

2014-09-08 Thread Gilles Gouaillardet
Ralph and Brice, i noted Ralph commited r32685 in order to fix a problem with Intel compilers. The very similar issue occurs with clang 3.2 (gcc and clang 3.4 are ok for me) imho, the root cause is in the hwloc configure. in this case, configure fails to detect strncasecmp is part of the C

[OMPI devel] f08 bindings and weak symbols

2014-09-05 Thread Gilles Gouaillardet
Folks, when OpenMPI is configured with --disable-weak-symbols and a fortran 2008 capable compiler (e.g. gcc 4.9), MPI_STATUSES_IGNORE invoked from Fortran is not correctly interpreted as it should. /* instead of being a special array of statuses, it is an array of one status, which can lead to

Re: [OMPI devel] OMPI devel] race condition in coll/ml

2014-09-01 Thread Gilles Gouaillardet
ion of the coll/ml locality requirement. > >Did this patch "fix" the problem by avoiding the segfault due to coll/ml >disqualifying itself? Or did it make everything work okay again? > > >On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org>

[OMPI devel] race condition in coll/ml

2014-09-01 Thread Gilles Gouaillardet
Folks, mtt recently failed a bunch of times with the trunk. a good suspect is the collective/ibarrier test from the ibm test suite. most of the time, CHECK_AND_RECYCLE will fail /* IS_COLL_SYNCMEM(coll_op) is true */ with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is called

Re: [OMPI devel] about the test_shmem_zero_get.x test from the openshmem test suite

2014-09-01 Thread Gilles Gouaillardet
, Jeff Squyres (jsquyres) wrote: > Gilles -- > > Did you get a reply about this? > > > On Aug 26, 2014, at 3:17 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Folks, >> >> the test_shmem_zero_get.x from the openshmem-release-

Re: [OMPI devel] oshmem-openmpi-1.8.2 causes compile error with -i8(64bit fortarn integer) configuration

2014-09-01 Thread Gilles Gouaillardet
Mishima-san, the root cause is macro expansion does not always occur as one would have expected ... could you please give a try to the attached patch ? it compiles (at least with gcc) and i made zero tests so far Cheers, Gilles On 2014/09/01 10:44, tmish...@jcity.maeda.co.jp wrote: > Hi

Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Gilles Gouaillardet
original problem that was trying to be > addressed. > > > On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Howard and Edgar, >> >> i fixed a few bugs (r32639 and r32642) >> >> the bug is trivial

[OMPI devel] mpirun hangs when a task exits with a non zero code

2014-08-29 Thread Gilles Gouaillardet
Ralph and all, The following trivial test hangs /* it hangs at least 99% of the time in my environment, 1% is a race condition and the program behaves as expected */ mpirun -np 1 --mca btl self /bin/false same behaviour happen with the following trivial but MPI program : #include int main

Re: [OMPI devel] segfault in openib component on trunk

2014-08-29 Thread Gilles Gouaillardet
Howard and Edgar, i fixed a few bugs (r32639 and r32642) the bug is trivial to reproduce with any mpi hello world program mpirun -np 2 --mca btl openib,self hello_world after setting the mca param in the $HOME/.openmpi/mca-params.conf $ cat ~/.openmpi/mca-params.conf btl_openib_receive_queues

Re: [OMPI devel] intercomm_create from the ibm test suite hangs

2014-08-28 Thread Gilles Gouaillardet
Thanks Ralph ! Cheers, Gilles On 2014/08/28 4:52, Ralph Castain wrote: > Took me awhile to track this down, but it is now fixed - combination of > several minor errors > > Thanks > Ralph > > On Aug 27, 2014, at 4:07 AM, Gilles Gouaillardet > <gilles.gouaillar...@i

[OMPI devel] intercomm_create from the ibm test suite hangs

2014-08-27 Thread Gilles Gouaillardet
Folks, the intercomm_create test case from the ibm test suite can hang under some configuration. basically, it will spawn n tasks in a first communicator, and then n tasks in a second communicator. when i run from node0 : mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2

[OMPI devel] coll/ml without hwloc (?)

2014-08-26 Thread Gilles Gouaillardet
Folks, i just commited r32604 in order to fix compilation (pmix) when ompi is configured with --without-hwloc now, even a trivial hello world program issues the following output (which is a non fatal, and could even be reported as a warning) :

[OMPI devel] about the test_shmem_zero_get.x test from the openshmem test suite

2014-08-26 Thread Gilles Gouaillardet
Folks, the test_shmem_zero_get.x from the openshmem-release-1.0d test suite is currently failing. i looked at the test itself, and compared it to test_shmem_zero_put.x (that is a success) and i am very puzzled ... the test calls several flavors of shmem_*_get where : - the destination is in the

Re: [OMPI devel] OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Gilles Gouaillardet
gt;look at that signature to ensure we aren't getting it confused. > >On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > >> Folks, >> >> when i run >> mpirun -np 1 ./intercomm_create >> from the ibm test

[OMPI devel] pmix: race condition in dynamic/intercomm_create from the ibm test suite

2014-08-25 Thread Gilles Gouaillardet
Folks, when i run mpirun -np 1 ./intercomm_create from the ibm test suite, it either : - success - hangs - mpirun crashes (SIGSEGV) soon after writing the following message ORTE_ERROR_LOG: Not found in file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566 here is what happens :

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-25 Thread Gilles Gouaillardet
ass for me > > > On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >>> Ralph, >>> >>> Will do on Monday &g

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-22 Thread Gilles Gouaillardet
; whereas your mpi_no_op.c return 0; Cheers, Gilles Ralph Castain <r...@open-mpi.org> wrote: >You might want to try again with current head of trunk as something seems off >in what you are seeing - more below > > > >On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet ><

Re: [OMPI devel] [OMPI svn] svn:open-mpi r32555 - trunk/opal/mca/btl/scif

2014-08-21 Thread Gilles Gouaillardet
he struct in order to preserve the old behaviour. > > Ashley. > > On 21 Aug 2014, at 04:31, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> > wrote: > >> Paul, >> >> the piece of code that causes an issue with PGI 2013 and older is just a bit >> mor

Re: [OMPI devel] [OMPI svn] svn:open-mpi r32555 - trunk/opal/mca/btl/scif

2014-08-21 Thread Gilles Gouaillardet
t; -Nathan >> >> On Tue, Aug 19, 2014 at 10:48:48PM -0400, svn-commit-mai...@open-mpi.org >> wrote: >>> Author: ggouaillardet (Gilles Gouaillardet) >>> Date: 2014-08-19 22:48:47 EDT (Tue, 19 Aug 2014) >>> New Revision: 32555 >>> URL: ht

[OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-20 Thread Gilles Gouaillardet
Folks, let's look at the following trivial test program : #include #include int main (int argc, char * argv[]) { int rank, size; MPI_Init(, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); printf ("I am %d/%d and i abort\n", rank, size);

Re: [OMPI devel] OMPI devel] [1.8.2rc4] OSHMEM fortran bindings with bad compilers

2014-08-19 Thread Gilles Gouaillardet
r32551 now detects this limitation and automatically disable oshmem profile. I am now revamping the patch for v1.8 Gilles Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: >In the case of PGI compilers prior to 13, a workaround is to configure with >--disable-oshmem-profil

Re: [OMPI devel] [1.8.2rc4] OSHMEM fortran bindings with bad compilers

2014-08-18 Thread Gilles Gouaillardet
In the case of PGI compilers prior to 13, a workaround is to configure with --disable-oshmem-profile On 2014/08/18 16:21, Gilles Gouaillardet wrote: > Josh, Paul, > > the problem with old PGI compilers comes from the preprocessor (!) > > with pgi 12.10 : > oshmem/shmem/for

Re: [OMPI devel] [1.8.2rc4] OSHMEM fortran bindings with bad compilers

2014-08-18 Thread Gilles Gouaillardet
Josh, Paul, the problem with old PGI compilers comes from the preprocessor (!) with pgi 12.10 : oshmem/shmem/fortran/start_pes_f.c SHMEM_GENERATE_WEAK_BINDINGS(START_PES, start_pes) gets expanded as #pragma weak START_PES = PSTART_PES SHMEM_GENERATE_WEAK_PRAGMA ( weak start_pes_ = pstart_pes_

Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65

2014-08-13 Thread Gilles Gouaillardet
Lenny, that looks related to #4857 which has been fixed in trunk since r32517 could you please update your openmpi library and try again ? Gilles On 2014/08/13 17:00, Lenny Verkhovsky wrote: > Following Jeff's suggestion adding devel mailing list. > > Hi All, > I am currently facing strange

Re: [OMPI devel] Grammar error in git master: 'You job will now abort'

2014-08-13 Thread Gilles Gouaillardet
Thanks Christopher, this has been fixed in the trunk with r32520 Cheers, Gilles On 2014/08/13 14:49, Christopher Samuel wrote: > Hi all, > > We spotted this in 1.6.5 and git grep shows it's fixed in the > v1.8 branch but in master it's still there: > >

[OMPI devel] trunk hang when nodes have similar but private network

2014-08-13 Thread Gilles Gouaillardet
Folks, i noticed mpirun (trunk) hangs when running any mpi program on two nodes *and* each node has a private network with the same ip (in my case, each node has a private network to a MIC) in order to reproduce the problem, you can simply run (as root) on the two compute nodes brctl addbr br0

Re: [OMPI devel] errors and warnings with show_help() usage

2014-08-11 Thread Gilles Gouaillardet
Jeff and all, i fixed the trivial errors in the trunk, there are now 11 non trivial errors. (commits r32490 to r32497) i ran the script vs the v1.8 branch and found 54 errors (first, you need to touch Makefile.ompi-rules in the top-level Open MPI directory in order to make the script happy)

Re: [OMPI devel] ibm abort test hangs on one node

2014-08-11 Thread Gilles Gouaillardet
r_finalize in the first place (which is sufficient but might not be necessary ...) Cheers, Gilles On 2014/08/09 1:27, Ralph Castain wrote: > Committed a fix for this in r32460 - see if I got it! > > On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> w

[OMPI devel] ibm abort test hangs on one node

2014-08-08 Thread Gilles Gouaillardet
Folks, here is the description of a hang i briefly mentionned a few days ago. with the trunk (i did not check 1.8 ...) simply run on one node : mpirun -np 2 --mca btl sm,self ./abort (the abort test is taken from the ibm test suite : process 0 call MPI_Abort while process 1 enters an infinite

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
as an ORTE_NAME the issue will go away. > > George. > > > > On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Kawashima-san and all, >> >> Here is attached a one off patch for v1.8. >> /* it does

Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc2 on Solaris 10 Sparc

2014-08-08 Thread Gilles Gouaillardet
Kawashima-san, This is interesting :-) proc is in the stack and has type orte_process_name_t with typedef uint32_t orte_jobid_t; typedef uint32_t orte_vpid_t; struct orte_process_name_t { orte_jobid_t jobid; /**< Job number */ orte_vpid_t vpid; /**< Process id - equivalent to

Re: [OMPI devel] v1.8.2 still held up...

2014-08-08 Thread Gilles Gouaillardet
Ralph and all, > * static linking failure - Gilles has posted a proposed fix, but somebody > needs to approve and CMR it. Please see: > https://svn.open-mpi.org/trac/ompi/ticket/4834 Jeff made a better fix (r32447) to which i added a minor correction (r32448). as far as i am concerned,

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-07 Thread Gilles Gouaillardet
to the POSIX prototype (aka. returning the changes value >> instead of doing things inplace). >> >> George. >> >> >> >> On Wed, Aug 6, 2014 at 7:02 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> Ralph and George,

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-06 Thread Gilles Gouaillardet
Ralph and George, here is attached a patch that fixes the heterogeneous support without the abstraction violation. Cheers, Gilles On 2014/08/06 9:40, Gilles Gouaillardet wrote: > hummm > > i intentionally did not swap the two 32 bits (!) > > from the top level, what we have i

Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO

2014-08-06 Thread Gilles Gouaillardet
testing. > Since I've determined already that 1.6.5 did not have the problem while > 1.7.x does, the possibility exists that some smaller change might exist to > restore what ever was lost between the v1.6 and v1.7 branches. > > -Paul > > > On Tue, Aug 5, 2014 at 1:

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-05 Thread Gilles Gouaillardet
y speaking, converting a 64 bits to a big endian >>> representation requires the swap of the 2 32 bits parts. So the correct >>> approach would have been: >>> uint64_t htonll(uint64_t v) >>> { >>> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntoh

Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO

2014-08-05 Thread Gilles Gouaillardet
Here is a patch that has been minimally tested. this is likely an overkill (at least when dynamic libraries can be used), but it does the job so far ... Cheers, Gilles On 2014/08/05 16:56, Gilles Gouaillardet wrote: > from libopen-pal.la : > dependency_libs=' -lrdmacm -libverbs -lscif

Re: [OMPI devel] [1.8.2rc3] static linking fails on linux when not building ROMIO

2014-08-05 Thread Gilles Gouaillardet
from libopen-pal.la : dependency_libs=' -lrdmacm -libverbs -lscif -lnuma -ldl -lrt -lnsl -lutil -lm' i confirm mpicc fails linking but FWIT, using libtool does work (!) could the bug come from the mpicc (and other) wrappers ? Gilles $ gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-05 Thread Gilles Gouaillardet
eing empty (bswap_64 or something). > > George. > > On Aug 1, 2014, at 06:52 , Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> George and Ralph, >> >> i am very confused whether there is an issue or not. >> >> >> anyway, t

Re: [OMPI devel] oshmem enabled by default

2014-08-04 Thread Gilles Gouaillardet
Paul, this is a bit trickier ... on a Linux platform oshmem is built by default, on a non Linux platform, oshmem is *not* built by default. so the configure message (disabled by default) is correct on non Linux platform, and incorrect on Linux platform ... i do not know what should be done,

Re: [OMPI devel] 1.8.2rc3 now out

2014-08-04 Thread Gilles Gouaillardet
Fixed in r32409 : %d and %s were swapped in a MLERROR (printf like) Gilles On 2014/08/02 11:07, Gilles Gouaillardet wrote: > Paul, > > about the second point : > mmap is called with the MAP_FIXED flag, before the fix, the > required address was not aligned on a page size and henc

Re: [OMPI devel] OMPI devel] trunk warnings on x86

2014-08-04 Thread Gilles Gouaillardet
out 25% of the time --mca btl scif,self => always hang only the mpirun process remains and is hanging. i will try to debug this, and i welcome any help ! Cheers, Gilles On 2014/08/04 11:57, Gilles Gouaillardet wrote: > Paul, > > i confirm ampersand was missing and this was a bug

Re: [OMPI devel] OMPI devel] trunk warnings on x86

2014-08-03 Thread Gilles Gouaillardet
on both 32 and 64 bits arch in this case : #if OPAL_ENABLE_DEBUG static inline orte_process_name_t * OMPI_CAST_RTE_NAME(opal_process_name_t * name); #else #define OMPI_CAST_RTE_NAME(a) ((orte_process_name_t*)(a)) #endif Cheers, Gilles On 2014/08/03 14:49, Gilles GOUAILLARDET wrote: > P

Re: [OMPI devel] OMPI devel] trunk warnings on x86

2014-08-03 Thread Gilles GOUAILLARDET
Paul, imho, the root cause is a missing ampersand. I will double check this from tomorrow only Cheers, Gilles Ralph Castain wrote: >Arg - that raises an interesting point. This is a pointer to a 64-bit number. >Will uintptr_t resolve that problem on such platforms? > >

Re: [OMPI devel] 1.8.2rc3 now out

2014-08-01 Thread Gilles Gouaillardet
Paul, about the second point : mmap is called with the MAP_FIXED flag, before the fix, the required address was not aligned on a page size and hence mmap failed. the mmap failure was immediatly handled, but for some reasons i did not fully investigate yet, this failure was not correctly

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-01 Thread Gilles Gouaillardet
? /* and then _process_name_jobid_for_opal, _process_name_vpid_for_opal, opal_process_name_vpid_should_never_be_called should also be updated */ Cheers, Gilles On 2014/08/01 19:52, Gilles Gouaillardet wrote: > George and Ralph, > > i am very confused whether there is an issue or not. > > > anyway, today Pa

Re: [OMPI devel] Trunk broken for PPC64?

2014-08-01 Thread Gilles Gouaillardet
pagesize = 65536; /* safer to overestimate than under */ > #endif > > > opal_pagesize() anyone? > > -Paul > > On Fri, Aug 1, 2014 at 12:50 AM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Paul, >> >> you are absolu

Re: [OMPI devel] Trunk broken for PPC64?

2014-08-01 Thread Gilles Gouaillardet
Paul and Ralph, for what it's worth : a) i faced the very same issue on my (slw) qemu emulated ppc64 vm b) i was able to run very basic programs when passing --mca coll ^ml to mpirun Cheers, Gilles On 2014/08/01 12:30, Ralph Castain wrote: > Yes, I fear this will require some effort to

Re: [OMPI devel] [OMPI svn] svn:open-mpi r32388 - trunk/ompi/mca/pml/ob1

2014-08-01 Thread Gilles Gouaillardet
e Studio > compilers are preferred. > Let me know if you need me to try any of those gcc installations. > > -Paul > > > On Thu, Jul 31, 2014 at 9:12 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Paul, >> >> As Ralph pointed

Re: [OMPI devel] [OMPI svn] svn:open-mpi r32388 - trunk/ompi/mca/pml/ob1

2014-08-01 Thread Gilles Gouaillardet
Paul, As Ralph pointed, this issue was reported last month on the user mailing list. #include did not help : http://www.open-mpi.org/community/lists/users/2014/07/24883.php I will try if i can reproduce and fix this issue on a solaris10 (but x86) VM BTW, are you using the GNU compiler ?

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-31 Thread Gilles Gouaillardet
Paul, the ibm test suite from the non public ompi-tests repository has several tests for usempif08. Cheers, Gilles On 2014/08/01 11:04, Paul Hargrove wrote: > Second related issue: > > Can/should examples/hello_usempif08.f90 be extended to use more of the > module such that it would have

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-31 Thread Gilles Gouaillardet
Paul, in .../ompi/mpi/fortran/use-mpi-f08, can you create the following dumb test program, compile and run nm | grep f08 on the object : $ cat foo.f90 program foo use mpi_f08_sizeof implicit none real :: x integer :: size, ierror call MPI_Sizeof_real_s_4(x, size, ierror) stop end program

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-31 Thread Gilles Gouaillardet
Paul and all, For what it's worth, with openmpi 1.8.2rc2 and the intel fortran compiler version 14.0.3.174 : $ nm libmpi_usempif08.so| grep -i sizeof there is no such undefined symbol (mpi_f08_sizeof_) as a temporary workaround, did you try to force the linker use

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-07-30 Thread Gilles Gouaillardet
stead of a pointer to the name > > r32357 > > On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET < > gilles.gouaillar...@gmail.com> wrote: > > Rolf, > > r32353 can be seen as a suspect... > Even if it is correct, it might have exposed the bug discussed in #48

Re: [OMPI devel] OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-30 Thread Gilles GOUAILLARDET
't have a PGI compiler. I also didn't specify a level of Fortran >support, but just had --enable-mpi-fortran > >Maybe we need to revert this commit until we figure out a better solution? > >On Jul 30, 2014, at 12:16 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org>

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-07-30 Thread Gilles GOUAILLARDET
e/util/name_fns.c:522 > >522       if (name1->jobid < name2->jobid) { > >(gdb) print name1 > >$1 = (const orte_process_name_t *) 0x192350001 > >(gdb) print *name1 > >Cannot access memory at address 0x192350001 > >(gdb) print name2 > >$2 = (const orte_proce

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-30 Thread Gilles Gouaillardet
Paul, this is a fair point. i commited r32354 in order to abort configure in this case Cheers, Gilles On 2014/07/30 15:11, Paul Hargrove wrote: > On a related topic: > > I configured with an explicit --enable-mpi-fortran=usempif08. > Then configure found PROCEDURE was missing/broken. > The

Re: [OMPI devel] trunk compilation errors in jenkins

2014-07-30 Thread Gilles Gouaillardet
George, #4815 is indirectly related to the move : in bcol/basesmuma, we used to compare ompi_process_name_t, and now we (try to) compare an ompi_process_name_t and an opal_process_name_t (which causes a glory SIGSEGV) i proposed a temporary patch which is both broken and unelegant, could you

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-30 Thread Gilles Gouaillardet
5-Illegal procedure interface - mpi_user_function (conftest.f90: > 12) 0 inform, 0 warnings, 2 severes, 0 fatal for test_proc > {hargrove@hopper04 OMPI}$ pgf90 -V pgf90 13.10-0 64-bit target on x86-64 > Linux -tp shanghai The Portland Group - PGI Compilers and Tools Copyright > (c) 2013,

Re: [OMPI devel] openmpi-1.8.2rc2 and f08 interface built with PGI-14.7 causes link error

2014-07-30 Thread Gilles Gouaillardet
Paul, from the logs, the only difference i see is about Fortran PROCEDURE. openpmi 1.8 (svn checkout) does not build the usempif08 bindings if PROCEDURE is not supported. from the logs, openmpi 1.8.1 does not check whether PROCEDURE is supported or not here is the sample program to check

Re: [OMPI devel] RFC: Add an __attribute__((destructor)) function to opal

2014-07-18 Thread Gilles Gouaillardet
> > It would make sense, though I guess I always thought that was part of what > happened in OBJ_CLASS_INSTANCE - guess I was wrong. My thinking was that > DEREGISTER would be the counter to INSTANCE, and I do want to keep this > from getting even more clunky - so maybe renaming INSTANCE to be

Re: [OMPI devel] RFC: Add an __attribute__((destructor)) function to opal

2014-07-18 Thread Gilles Gouaillardet
+1 for the overall idea ! On Fri, Jul 18, 2014 at 10:17 PM, Ralph Castain wrote: > > * add an OBJ_CLASS_DEREGISTER and require that all instantiations be > matched by deregister at close of the framework/component that instanced > it. Of course, that requires that we protect

Re: [OMPI devel] Onesided failures

2014-07-17 Thread Gilles Gouaillardet
Rolf, i commited r2389. MPI_Win_allocate_shared is now invoked on a single node communicator Cheers, Gilles On 2014/07/16 22:59, Rolf vandeVaart wrote: > Sounds like a good plan. Thanks for looking into this Gilles! > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf

Re: [OMPI devel] Onesided failures

2014-07-16 Thread Gilles GOUAILLARDET
Rolf, From the man page of MPI_Win_allocate_shared It is the user's responsibility to ensure that the communicator comm represents a group of processes that can create a shared memory segment that can be accessed by all processes in the group And from the mtt logs, you are running 4 tasks on

Re: [OMPI devel] RFC: Add an __attribute__((destructor)) function to opal

2014-07-16 Thread Gilles Gouaillardet
Ralph and all, my understanding is that opal_finalize_util agressively tries to free memory that would be still allocated otherwise. an other way of saying "make valgrind happy" is "fully automake memory leak detection" (Joost pointed to the -fsanitize=leak feature of gcc 4.9 in

Re: [OMPI devel] 100% test failures

2014-07-15 Thread Gilles GOUAILLARDET
r32236 is a suspect i am afk I just read the code and a class is initialized with opal_class_initialize the first time an object is instantiated with OBJ_NEW I would simply revert r32236 or update opal_class_finalize and free(cls->cls_construct_array); only if cls->cls_construct_array is not

Re: [OMPI devel] trunk and fortran errors

2014-07-11 Thread Gilles Gouaillardet
Thanks Jeff, i confirm the problem is fixed on CentOS 5 i commited r32215 because some files were missing from the tarball/nightly snapshot/make dist. Cheers, Gilles On 2014/07/11 4:21, Jeff Squyres (jsquyres) wrote: > As of r32204, this should be fixed. Please let me know if it now works

Re: [OMPI devel] trunk and fortran errors

2014-07-10 Thread Gilles Gouaillardet
On CentOS 5.x, gfortran is unable to compile this simple program : subroutine foo () use, intrinsic :: iso_c_binding, only : c_ptr end subroutine foo an other workaround is to install gfortran 4.4 (yum install gcc44-gfortran) and configure with FC=gfortran44 On 2014/07/09 19:46, Jeff Squyres

Re: [OMPI devel] segv in ompi_info

2014-07-09 Thread Gilles Gouaillardet
Mike, how do you test ? i cannot reproduce a bug : if you run ompi_info -a -l 9 | less and i press 'q' at the early stage (e.g. before all output is written to the pipe) then the less process exits and receives SIG_PIPE and crash (which is a normal unix behaviour) now if i press the spacebar

Re: [OMPI devel] centos-7 / rhel-7 build fail (configure fails to recognize g++)

2014-07-07 Thread Gilles Gouaillardet
Olivier, i was unable to reproduce the issue on a centos7 beta with : - trunk (latest nightly snapshot) - 1.8.1 - 1.6.5 the libtool-ltdl-devel package is not installed on this server that being said, i did not use --with-verbs nor --with-tm since these packages are not installed on my server.

Re: [OMPI devel] MPI_Recv_init_null_c from intel test suite fails vs ompi trunk

2014-07-04 Thread Gilles Gouaillardet
Yossi, thanks for reporting this issue. i commited r32139 and r32140 to trunk in order to fix this issue (with MPI_Startall) and some misc extra bugs. i also made CMR #4764 for the v1.8 branch (and asked George to review it) Cheers, Gilles On 2014/07/03 22:25, Yossi Etigin wrote: > Looks

Re: [OMPI devel] trunk broken

2014-06-25 Thread Gilles Gouaillardet
Mike, could you try again with OMPI_MCA_btl=vader,self,openib it seems the sm module causes a hang (which later causes the timeout sending a SIGSEGV) Cheers, Gilles On 2014/06/25 14:22, Mike Dubman wrote: > Hi, > The following commit broke trunk in jenkins: > Per the OMPI developer

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-25 Thread Gilles Gouaillardet
Hi Ralph, On 2014/06/25 2:51, Ralph Castain wrote: > Had a chance to review this with folks here, and we think that having > oversubscribe automatically set overload makes some sense. However, we do > want to retain the ability to separately specify oversubscribe and overload > as well since

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet
1:12, Ralph Castain wrote: > Yeah, we should make that change, if you wouldn't mind doing it. > > > > On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET < > gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> That makes perfect sense. >> &g

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles GOUAILLARDET
have a >reason for changing it other than coll/ml. If so, we'd be happy to revisit the >proposal. > > >Make sense? > >Ralph > > > > >On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > >WHAT:

[OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet
Folks, this issue is related to the failures reported by mtt on the trunk when the ibm test suite invokes MPI_Comm_spawn. my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus per socket each. if i run on one host (without any batch manager) mpirun -np 16 --host slurm1

[OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet
WHAT: semantic change of opal_hwloc_base_get_relative_locality WHY: make is closer to what coll/ml expects. Currently, opal_hwloc_base_get_relative_locality means "at what level do these procs share cpus" however, coll/ml is using it as "at what level are these procs commonly

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet
_NODE. i am puzzled wether this is a bug in opal_hwloc_base_get_relative_locality or in proc.c that should not call this subroutine because it does not do what should be expected. Cheers, Gilles On 2014/06/20 13:59, Gilles Gouaillardet wrote: > Ralph, > > my test VM is single socket four cor

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet
Ralph, my test VM is single socket four cores. here is something odd i just found when running mpirun -np 2 intercomm_create. tasks [0,1] are bound on cpus [0,1] => OK tasks[2-3] (first spawn) are bound on cpus [2,3] => OK tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK in

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet
Ralph and Tetsuya, is this related to the hang i reported at http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ? Nathan already replied he is working on a fix. Cheers, Gilles On 2014/06/20 11:54, Ralph Castain wrote: > My guess is that the coll/ml component may have problems

[OMPI devel] v1.8 cannot compile since r31979

2014-06-10 Thread Gilles Gouaillardet
Folks, in mca_oob_tcp_component_hop_unknown, the local variable bpr is not defined, which prevents v1.8 compilation. /* there was a local variable called pr, it seems it was removed instead of being renamed into bpr */ the attached patch fixes this issue. Cheers, Gilles Index:

<    2   3   4   5   6   7   8   >