from:"Gilles Gouaillardet"

[OMPI devel] configure fails on the trunk since r31390

2014-04-14 Thread Gilles Gouaillardet

Folks,

as reported in https://svn.open-mpi.org/trac/ompi/ticket/4521, configure
fails on the trunk :-(

Dear OpenMPI Folks,

since r31390 ,
configure fails on the trunk because oshmem/include/pshmem.h.in in missing.

i believe r31390 
should have moved the profiling api from oshmem/include/shmem.h.in into
oshmem/include/pshmem.h.in. instead, r31390
 simply removed all
the profiling api and did not include oshmem/include/pshmem.h.in

i am now trying to rebuild oshmem/include/pshmem.h.in in order to get
things work again

Best regards,

Gilles

[OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-17 Thread Gilles Gouaillardet

Dear OpenMPI developers,

i just created #4531 in order to track this issue :
https://svn.open-mpi.org/trac/ompi/ticket/4531

Basically, the coll/tuned implementation of MPI_Bcast does not work when
two tasks
uses datatypes of different sizes.
for example, if the root send two large vectors of MPI_INT and non root
receive many MPI_INT, then MPI_Bcast will crash.
but if the root send many MPI_INT and the non root receive two large
vectors of MPI_INT, then MPI_Bcast will silently fail.
(the TRAC ticket has attached test cases)

i believe this kind of issue could occur on all/most collective of the
coll/tuned module, so it is not limited to MPI_Bcast.


i am wondering of what could be the best way to solve this.

one solution i could think of, would be to generate temporary datatypes
in order to send message whose size is exactly the segment_size.

an other solution i could think of, would be to have new send/recv
functions :
if we consider the send function :
int mca_pml_ob1_send(void *buf,
 size_t count,
 ompi_datatype_t * datatype,
 int dst,
 int tag,
 mca_pml_base_send_mode_t sendmode,
 ompi_communicator_t * comm)

we could imagine to have the xsend function :
int mca_pml_ob1_xsend(void *buf,
 size_t count,
 ompi_datatype_t * datatype,
 size_t offset,
 size_t size,
 int dst,
 int tag,
 mca_pml_base_send_mode_t sendmode,
 ompi_communicator_t * comm)

where offset is the number of bytes that should be skipped from the
beginning of buf
and size if the (max) number of bytes to be sent (e.g. the message will
be "truncated"
to size bytes if (count*size(datatype) - offset) > size

or we could use a buffer if needed, and send/recv with MPI_PACKED datatype
(this is less efficient, would it even work on heterogeneous nodes ?)

or we could simply consider this is just a limitation of coll/tuned
(coll/basic works fine) and do nothing

or something else i did not think of ...


thanks in advance for your feedback

Gilles

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet

Nathan,

i uploaded this part to github :
https://github.com/ggouaillardet/ompi-svn-mirror/tree/flatten-datatype

you really need to check the last commit :
https://github.com/ggouaillardet/ompi-svn-mirror/commit/a8d014c6f144fa5732bdd25f8b6b05b07ea8

please consider this as experimental and poorly tested.
that being said, this is only addition to existing code, so it does not
break anything and could be pushed to the trunk.

Gilles

On 2014/04/23 0:05, Hjelm, Nathan T wrote:
> I need the flatten datatype call for handling true rdma in the one-sided code 
> as well. Is there a plan to implement this feature soon?
>

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet

George,

i am sorry i cannot see how flatten datatype can be helpful here :-(

in this example, the master must broadcast a long vector. this datatype
is contiguous
so the flatten'ed datatype *is* the type provided by the MPI application.

how would pipelining happen in this case (e.g. who has to cut the long
vector into pieces and how) ?

should a temporary buffer be used ? and then should it be sent into
pieces of type MPI_PACKED ?
(and if yes, would this be safe in an heterogenous communicator ?)

Thanks in advance for your insights,

Gilles

On 2014/04/22 12:04, George Bosilca wrote:
> Indeed there are many potential solutions, but all require too much
> intervention on the code to be generic enough. As we discussed
> privately mid last year, the "flatten datatype" approach seems to me
> to be the most profitable.It is simple to implement and it is also
> generic, a simple change will make all pipelined collective work (not
> only tuned but all the other as well).

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet

my bad :-(

this has just been fixed

Gilles

On 2014/04/23 14:55, Nathan Hjelm wrote:
> The ompi_datatype_flatten.c file appears to be missing. Let me know once
> it is committed and I will take a look. I will see if I can write the
> RMA code using it over the next week or so.
>

[OMPI devel] MPI_Recv_init_null_c from intel test suite fails vs ompi trunk

2014-04-24 Thread Gilles Gouaillardet

Folks,

Here is attached an oversimplified version of the MPI_Recv_init_null_c
test from the
intel test suite.

the test works fine with v1.6, v1.7 and v1.8 branches but fails with the
trunk.

i wonder wether the bug is in OpenMPI or the test itself.

on one hand, we could consider there is a bug in OpenMPI :
status.MPI_SOURCE should be MPI_PROC_NULL since we explicitly posted a
recv request with MPI_PROC_NULL.

on the other hand, (mpi specs, chapter 3.7.3 and
https://svn.open-mpi.org/trac/ompi/ticket/3475)
we could consider the returned value is not significant, and hence
MPI_Wait should return an
empty status (and empty status has source=MPI_ANY_SOURCE per the MPI specs).

for what it's worth, this test is a success with mpich (e.g.
status.MPI_SOURCE is MPI_PROC_NULL).


what is the correct interpretation of the MPI specs and what should be
done ?
(e.g. fix OpenMPI or fix/skip the test ?)

Cheers,

Gilles
/*
 *  This test program is an over simplified version of the
 *  MPI_Recv_init_null_c test from the intel test suite.
 *
 *  It can be ran on one task :
 *  mpirun -np 1 -host localhost ./a.out
 *
 *  when ran on the trunk, since r28431, the test will fail :
 *  status.MPI_SOURCE is MPI_ANY_SOURCE instead of MPI_PROC_NULL
 *
 * Copyright (c) 2014  Research Organization for Information Science
 * and Technology (RIST). All rights reserved.
 */
#include 
#include 

int main (int argc, char *argv[]) {
MPI_Status status;
MPI_Request req;
int ierr;

MPI_Init(&argc, &argv);

ierr = MPI_Recv_init(NULL, 0, MPI_INT, MPI_PROC_NULL, MPI_ANY_TAG, 
MPI_COMM_WORLD, &req);
if (ierr != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, 1);

ierr = MPI_Start(&req);
if (ierr != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, 2);

ierr = MPI_Wait(&req, &status);
if (ierr != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, 3);

if (MPI_PROC_NULL != status.MPI_SOURCE) {
if (MPI_ANY_SOURCE == status.MPI_SOURCE) {
printf("got MPI_ANY_SOURCE=%d instead of MPI_PROC_NULL=%d\n", 
status.MPI_SOURCE, MPI_PROC_NULL);
} else {
printf("got %d instead of MPI_PROC_NULL=%d\n", status.MPI_SOURCE, 
MPI_PROC_NULL);
}
} else {
printf("OK\n");
}

MPI_Finalize();
return 0;
}

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-25 Thread Gilles Gouaillardet

Jeff and all,

On 2014/04/25 18:47, Jeff Squyres (jsquyres) wrote:
> But ask that question a little differently: which is more complicated,
> long-term maintenance of a feature which no one really tests (or even
> has the hardware setup to test) or removal?

it is possible to use qemu in order to emulate unavailable hardware.
for what it's worth, i am now running a ppc64 qemu emulated virtual
machine on an x86_64 workstation.
this is pretty slow (2 hours for configure and even more for make) but
enough to make simple tests/debugging.

Gilles

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-27 Thread Gilles Gouaillardet

According to Jeff's comment, OpenMPI compiled with
--enable-heterogeneous is broken even in an homogeneous cluster.

as a first step, MTT could be ran with OpenMPI compiled with
--enable-heterogenous and running on an homogeneous cluster
(ideally on both little and big endian) in order to identify and fix the
bug/regression.
/* this build is currently disabled in the MTT config of the
cisco-community cluster */

Gilles

On 2014/04/26 9:41, Ralph Castain wrote:
> So it sounds like we may have a test platform, which leaves the question of 
> repair
>
> George: can you give us some idea of what was broken and/or pointers on what 
> needs to be done to repair it?
>

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-28 Thread Gilles Gouaillardet

I might have misunderstood Jeff's comment :

> The broken part(s) is(are) likely somewhere in the datatype and/or PML code 
> (my guess).  Keep in mind that my only testing of this feature is in 
> *homogeneous* mode -- i.e., I compile with --enable-heterogeneous and then 
> run tests on homogeneous machines.  Meaning: it's not only broken for actual 
> heterogeneity, it's also broken in the "unity"/homogeneous case.

Unfortunatly, a trivial send/recv can hang in this case
(--enable-heterogeneous and homogenous cluster of little endian procs).

i opened #4568 https://svn.open-mpi.org/trac/ompi/ticket/4568 in order
to track this issue
(uninitialized data can cause a hang with this config)

trunk is affected, v1.8 is very likely affected too

Gilles

On 2014/04/28 12:22, Ralph Castain wrote:
> I think you misunderstood his comment. It works fine on a homogeneous 
> cluster, even with --enable-hetero. I've run it that way on my cluster.
>
> On Apr 27, 2014, at 7:50 PM, Gilles Gouaillardet 
>  wrote:
>
>> According to Jeff's comment, OpenMPI compiled with
>> --enable-heterogeneous is broken even in an homogeneous cluster.
>>
>> as a first step, MTT could be ran with OpenMPI compiled with
>> --enable-heterogenous and running on an homogeneous cluster
>> (ideally on both little and big endian) in order to identify and fix the
>> bug/regression.
>> /* this build is currently disabled in the MTT config of the
>> cisco-community cluster */
>>
>> Gilles
>>

Re: [OMPI devel] MPI_Comm_create_group()

2014-04-29 Thread Gilles Gouaillardet

Lisandro,

i assume you are running OpenMPI 1.8

r31554 fixes this issue (and some others)
https://svn.open-mpi.org/trac/ompi/changeset/31554/branches/v1.8/ompi/communicator/comm_cid.c

the root cause was an unitialized variable (rc in
ompi/communicator/comm_cid.c), and the issue only occured when using a
communicator of size 1.

Gilles

On 2014/04/30 2:48, Dave Goodell (dgoodell) wrote:
> Thanks for the bug report.  It seems that nobody has time to work on this at 
> the moment, so I've filed a ticket so that we don't lose track of it:
>
> https://svn.open-mpi.org/trac/ompi/ticket/4577
>
> On Apr 21, 2014, at 9:55 AM, Lisandro Dalcin  wrote:
>
>> A very basic test for MPI_Comm_create_group() is failing for me. I'm
>> pasting the code, the failure, and output from valgrind.
>>
>> [dalcinl@kw2060 openmpi]$ cat comm_create_group.c
>> #include 
>> int main(int argc, char *argv[])
>> {
>>  MPI_Group group;
>>  MPI_Comm comm;
>>  MPI_Init(&argc, &argv);
>>  MPI_Comm_group(MPI_COMM_WORLD, &group);
>>  MPI_Comm_create_group(MPI_COMM_WORLD, group, 0, &comm);
>>  MPI_Comm_free(&comm);
>>  MPI_Group_free(&group);
>>  MPI_Finalize();
>>  return 0;
>> }
>>

Re: [OMPI devel] Wrong Endianness in Open MPI for external32 representation

2014-04-30 Thread Gilles Gouaillardet

Edgar and Christoph,

i do not think ROMIO supports this yet.

from ompi/mca/io/romio/romio/README
"This version of ROMIO includes everything defined in the MPI I/O
chapter except support for file interoperability [...]"

i also ran ompi/mca/io/romio/romio/test/external32.c :

on a x86_64 box (little endian)
$ ./external32
native datarep is LITTLE ENDIAN
external32 datarep is LITTLE ENDIAN
internal datarep is LITTLE ENDIAN

on a ppc64 box (big endian)
$ ./external32
native datarep is BIG ENDIAN
external32 datarep is BIG ENDIAN
internal datarep is BIG ENDIAN


that being said :
with mpich (trunk), on a x86_64 box :
$ ./external32.mpich
native datarep is LITTLE ENDIAN
external32 datarep is BIG ENDIAN
internal datarep is LITTLE ENDIAN

here is the output of mpi-io-external32 (with mpich) :
$ ./mpi-io-external32.mpich
Output file: mpi-io-external32.dat
[-1] Block at address 0x00c6f0e8 is corrupted (probably write
past end)
[-1] Block being freed allocated in
rc/mpich/src/mpi/romio/mpi-io/mpiu_external32.c[159]
[-1] Block cookie should be f0e0d0c9 but was e2ff4c054000

$ od -t x1 ./mpi-io-external32.dat
000 ff ff ff e2 00 00 00 00 40 30 40 00

MPI_INT was written big endian (good)
but
MPI_DOUBLE was written little endian (bad)


my conclusion is that the ROMIO included in OpenMPI is a few steps
behind the one provided with MPICH
and/but MPICH ROMIO does not fully support file interoperability

Cheers,

Gilles

On 2014/04/29 22:27, Edgar Gabriel wrote:
> the way you launch the app, you will be using ROMIO, and I am not 100%
> sure about how the data representation stuff is integrated with OMPI. I
> am pretty sure that we are not doing the right thing for OMPIO, but I
> will look into later this week.
>
> Thanks
> Edgar
>
> On 4/29/2014 7:03 AM, Christoph Niethammer wrote:
>> Hello,
>>
>> It seems for me that the endianness is wrong in Open MPI's I/O for the 
>> external32 data representation. :O
>>
>> Find attached my test programs which write the integer -30 and the double 
>> 16.25 into a file.
>> Here the output on my Linux system:
>>
>> mpicc c-xdr.c   -o c-xdr
>> mpicc mpi-io-external32.c   -o mpi-io-external32
>> ./c-xdr
>> Output file: c-xdr.dat
>> hexdump c-xdr.dat
>> 000  e2ff 3040 0040    
>> 00c
>> ./mpi-io-external32
>> Output file: mpi-io-external32.dat
>> hexdump mpi-io-external32.dat
>> 000 ffe2    4000 4030  
>> 00c
>>
>>
>> Best regards
>> Christoph Niethammer
>>
>> --
>>
>> Christoph Niethammer
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstrasse 19
>> 70569 Stuttgart
>>
>> Tel: ++49(0)711-685-87203
>> email: nietham...@hlrs.de
>> http://www.hlrs.de/people/niethammer
>>

Re: [OMPI devel] memory leaks upon dup/split/create of communicators?

2014-04-30 Thread Gilles Gouaillardet

Joost,

i created #4581 and attached a patch (for the trunk) in order to solve
this leak (and two similar ones)

Cheers,

Gilles

On 2014/04/29 5:18, VandeVondele Joost wrote:
> Hi,
>
> I applied the patch from ticket #4569 (to 1.8.1), and things improved (in 
> particular the reported issue is gone). The next big leaks seems to relate to 
> Cartesian communicators.
>
> Direct leak of 9600 byte(s) in 300 object(s) allocated from:
> #0 0x7f7cd2c8e793 in __interceptor_calloc 
> ../../../../gcc/libsanitizer/lsan/lsan_interceptors.cc:89
> #1 0x7f7cd3a92552 in mca_topo_base_cart_create 
> ../../../../ompi/mca/topo/base/topo_base_cart_create.c:89
> #2 0x7f7cd3a52bfd in PMPI_Cart_create 
> /data/vjoost/openmpi-1.8.1/build/ompi/mpi/c/profile/pcart_create.c:103
> #3 0x7f7cd3d0f4bf in ompi_cart_create_f 
> /data/vjoost/openmpi-1.8.1/build/ompi/mpi/fortran/mpif-h/profile/pcart_create_f.c:82
> #4 0x1bfdf6a in __message_passing_MOD_mp_cart_create 
> /data/vjoost/clean/cp2k/cp2k/src/common/message_passing.F:984
>
> Direct leak of 21600 byte(s) in 300 object(s) allocated from:
> #0 0x7f7cd2c8e3a8 in __interceptor_malloc 
> ../../../../gcc/libsanitizer/lsan/lsan_interceptors.cc:66
> #1 0x7f7cd3a3501f in opal_obj_new ../../opal/class/opal_object.h:467
> #2 0x7f7cd3a3501f in ompi_group_allocate ../../ompi/group/group_init.c:63
> #3 0x7f7cd3a2b192 in ompi_comm_fill_rest 
> ../../ompi/communicator/comm.c:1827
> #4 0x7f7cd3a2b192 in ompi_comm_enable ../../ompi/communicator/comm.c:1782
> #5 0x7f7cd3a924d2 in mca_topo_base_cart_create 
> ../../../../ompi/mca/topo/base/topo_base_cart_create.c:164
> #6 0x7f7cd3a52bfd in PMPI_Cart_create 
> /data/vjoost/openmpi-1.8.1/build/ompi/mpi/c/profile/pcart_create.c:103
> #7 0x7f7cd3d0f4bf in ompi_cart_create_f 
> /data/vjoost/openmpi-1.8.1/build/ompi/mpi/fortran/mpif-h/profile/pcart_create_f.c:82
> #8 0x1bfdf6a in __message_passing_MOD_mp_cart_create 
> /data/vjoost/clean/cp2k/cp2k/src/common/message_passing.F:984
>
> Joost
>

[OMPI devel] scif btl side effects

2014-05-07 Thread Gilles Gouaillardet

Dear OpenMPI Folks,

i noticed some crashes when running OpenMPI (both latest v1.8 and trunk
from svn) on a single linux system where a MIC is available.
/* strictly speaking, MIC hardware is not needed: libscif.so, mic kernel
module and accessible /dev/mic/* are enough */

the attached test_scif program can be used in order to evidence this issue.
/* this is an over simplified version of collective/bcast_struct.c from
the ibm test suite,
it is currently failing on the bend-rsh cluster at intel */

this program will cause a silent failure
(MPI_Recv receives truncated data without issuing any warning)

i ran a few investigations and basically, here is what i found :
MPI_Send will split the message into two fragments. the first fragment
is sent via the vader btl
and the second fragment is sent with the scif btl.

the program will success if the scif btl is disabled (mpirun --mca btl
^scif)
interestingly, i found that
mpirun -host localhost -np 2 --mca btl scif,self ./test_scif
does produce correct results with ompi v1.8 r31309 (a crash might happen
in MPI_Finalize)
and it procuce incorrect results with ompi v1.8 r31671 and trunk (r31667)

imho :
a) the scif btl could/should be automatically disabled if no MIC is
detected on a host
b) the scif btl could/should not be used to communicates between two
cores of the host
(e.g. it should be used *only* when at least one peer is on the MIC)
c) that being said, that should work so there is a bug
d) there is a regression in v1.8 and a bug that might have been always here

i attached a patch that will automatically disable the scif btl if no
MIC is found
(e.g. scif_get_nodeIDs(...) returns 1), i believe it is safe to use it
(that being said, we might want to add an option to force the use of the
scif btl no matter what)

could you please share your thoughts on my asumptions a) b) c) and d) ?
if b) is what we want to implement, then mca_btl_scif_add_procs could be
modified as follow :

if (!OPAL_PROC_ON_LOCAL_HOST(ompi_proc->proc_flags) ||
my_proc == ompi_proc) {
/* scif can only be used with procs on this board */
continue;
}

becomes

if (!OPAL_PROC_ON_LOCAL_HOST(ompi_proc->proc_flags) ||
my_proc == ompi_proc || (!IS_MIC(my_proc) &&
!IS_MIC(ompi_proc)) {
/* scif can only be used with procs on this board unless
both procs are not on MIC */
continue;
}

and IS_MIC(proc) has to be implemented ...
/* is hwloc 1.7.2 already able to do this ? if yes, pointers will be
highly appreciated */

Cheers,

Gilles
Index: ompi/mca/btl/scif/btl_scif_module.c
===
--- ompi/mca/btl/scif/btl_scif_module.c (revision 31667)
+++ ompi/mca/btl/scif/btl_scif_module.c (working copy)
@@ -78,7 +78,13 @@
 rc = scif_get_nodeIDs (NULL, 0, &mca_btl_scif_module.port_id.node);
 if (-1 == rc) {
 BTL_VERBOSE(("btl/scif error getting node id of this node"));
+scif_close (mca_btl_scif_module.scif_fd);
+mca_btl_scif_module.scif_fd = -1;
 return OMPI_ERROR;
+} else if (1 == rc) {
+BTL_VERBOSE(("btl/scif no MIC detected"));
+mca_btl_scif_module.scif_fd = -1;
+return OMPI_ERROR;
 }

 /* Listen for connections */
/*
 * This test is an oversimplified version of collective/bcast_struct
 * that comes with the ibm test suite.
 * it must be ran on two tasks on a single host where the MIC software stack
 * is present (e.g. libscif.so is present, the mic driver is loaded and
 * /dev/mic/* are accessible and the scif btl is available.
 *
 * mpirun -np 2 -host localhost --mca scif,vader,self ./test_scif
 * will produce incorrect results with trunk and v1.8
 *
 * mpirun -np 2 --mca btl ^scif -host localhost ./test_scif
 * will work with trunk and v1.8
 *
 * mpirun -np 2 --mca btl scif,self -host localhost ./test_scif
 * will produce correct results with v1.8 r31309 (but eventually crash in 
MPI_Finalize)
 * and produce incorrect result with v1.8 r31671 and trunk r31667
 *
 * Copyright (c) 2011  Oracle and/or its affiliates.  All rights reserved.
 * Copyright (c) 2014  Research Organization for Information Science
 * and Technology (RIST). All rights reserved.
 */
/

 MESSAGE PASSING INTERFACE TEST CASE SUITE

 Copyright IBM Corp. 1995

 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and
 distribute this software for any purpose and without fee provided that the
 above copyright notice and the following paragraphs appear in all copies.

 IBM Corp. makes no representation that the test cases comprising this
 suite are correct or are an accurate representation of any standard.

 In no event shall IBM be liable to any party for direct, indirect, special
 incidental, or consequential damage arising out of the use of this software
 even if IBM Corp. has been

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Gilles Gouaillardet


On 2014/05/08 2:15, Ralph Castain wrote:
> I wonder if that might also explain the issue reported by Gilles regarding 
> the scif BTL? In his example, the problem only occurred if the message was 
> split across scif and vader. If so, then it might be that splitting messages 
> in general is broken.
>
i am afraid there is a misunderstanding :
the problem always occur with scif,vader,self (regardless the ompi v1.8
version)
the problem occurs with scif,self only if r31496 is applied to ompi v1.8


In my previous email
http://www.open-mpi.org/community/lists/devel/2014/05/14699.php
i reported the following interesting fact :

with ompi v1.8 (latest r31678), the following command produces incorrect
results :
mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

but with ompi v1.8 r31309, the very same command produces correct results

Elena pointed that r31496 is a suspect. so i took the latest v1.8
(r31678) and reverted r31496 and ...


mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

works again !

note that the "default"
mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif
still produces incorrect results

in order to reproduce the issue, a MIC is *not* needed,
you only need to install the software stack, load the mic kernel module
and make sure you can read/write /dev/mic/*

bottom line, there are two issues here :
1) r31496 broke something : mpirun -np 2 -host localhost --mca btl
scif,self ./test_scif
2) something else never worked : mpirun -np 2 -host localhost --mca btl
scif,vader,self ./test_scif

Gilles

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet

George,

you do not need any hardware, just download MPSS from Intel and install it.
make sure the mic kernel module is loaded *and* you can read/write to the
newly created /dev/mic/* devices.

/* i am now running this on a virtual machine with no MIC whatsoever */

i was able to improve things a bit for the new attached test case
/* send MPI_PACKED / recv newtype */
with the attached unpack.patch.

it has to be applied on r31678 (aka the latest checkout of the v1.8 branch)

with this patch (zero regression test so far, it might solve one problem
but break anything else !)

mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2
works fine :-)

but

mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2
still crashes (and it did not crash before r31496)

i will provide the output you requested shortly

Cheers,

Gilles
/*
 * This test is an oversimplified version of collective/bcast_struct
 * that comes with the ibm test suite.
 * it must be ran on two tasks on a single host where the MIC software stack
 * is present (e.g. libscif.so is present, the mic driver is loaded and
 * /dev/mic/* are accessible and the scif btl is available.
 *
 * mpirun -np 2 -host localhost --mca scif,vader,self ./test_scif
 * will produce incorrect results with trunk and v1.8
 *
 * mpirun -np 2 --mca btl ^scif -host localhost ./test_scif
 * will work with trunk and v1.8
 *
 * mpirun -np 2 --mca btl scif,self -host localhost ./test_scif
 * will produce correct results with v1.8 r31309 (but eventually crash in 
MPI_Finalize)
 * and produce incorrect result with v1.8 r31671 and trunk r31667
 *
 * Copyright (c) 2011  Oracle and/or its affiliates.  All rights reserved.
 * Copyright (c) 2014  Research Organization for Information Science
 * and Technology (RIST). All rights reserved.
 */
/

 MESSAGE PASSING INTERFACE TEST CASE SUITE

 Copyright IBM Corp. 1995

 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and
 distribute this software for any purpose and without fee provided that the
 above copyright notice and the following paragraphs appear in all copies.

 IBM Corp. makes no representation that the test cases comprising this
 suite are correct or are an accurate representation of any standard.

 In no event shall IBM be liable to any party for direct, indirect, special
 incidental, or consequential damage arising out of the use of this software
 even if IBM Corp. has been advised of the possibility of such damage.

 IBM CORP. SPECIFICALLY DISCLAIMS ANY WARRANTIES INCLUDING, BUT NOT LIMITED
 TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS AND IBM
 CORP. HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
 ENHANCEMENTS, OR MODIFICATIONS.



 These test cases reflect an interpretation of the MPI Standard.  They are
 are, in most cases, unit tests of specific MPI behaviors.  If a user of any
 test case from this set believes that the MPI Standard requires behavior
 different than that implied by the test case we would appreciate feedback.

 Comments may be sent to:
Richard Treumann
treum...@kgn.ibm.com


*/
#include 
#include 
#include 
#include "mpi.h"

#define ompitest_error(file,line,...) {fprintf(stderr, "FUCK at %s:%d root=%d 
size=%d (i,j)=(%d,%d)\n", file, line,root, i0, i, j); MPI_Abort(MPI_COMM_WORLD, 
1);}

const int SIZE = 1000;

int main(int argc, char **argv)
{
   int myself;

   double a[2], t_stop;
   int ii, size;
   int len[2];
   MPI_Aint disp[2];
   MPI_Datatype type[2], newtype, t1, t2;
   struct foo_t {
   int i[3];
   double d[3];
   } foo, *bar;
   struct pfoo_t {
   int i[2];
   double d[2];
   } pfoo, *pbar;
   int i0, i, j, root, nseconds = 600, done_flag;
   int _dbg=0;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&myself);
   MPI_Comm_size(MPI_COMM_WORLD,&size);
   // _dbg = (0 == myself);
   while (_dbg) poll(NULL,0,1);

   if ( argc > 1 ) nseconds = atoi(argv[1]);
   t_stop = MPI_Wtime() + nseconds;

   /*-*/
   /* Build a datatype that is guaranteed to have holes; send/recv
  large numbers of them */

   MPI_Type_vector(2, 1, 2, MPI_INT, &t1);
   MPI_Type_commit(&t1);
   MPI_Type_vector(2, 1, 2, MPI_DOUBLE, &t2);
   MPI_Type_commit(&t2);

   len[0] = len[1] = 1;
   MPI_Address(&foo.i[0], &disp[0]);
   MPI_Address(&foo.d[0], &disp[1]);
   printf ("%d: %x %x\n", myself, disp[0], disp[1]);
   disp[0] -= (MPI_Aint) &foo;
   disp[1] -= (MPI_Aint) &foo;
   printf ("%d: %ld %ld\n", myself, disp[0], disp[1]);
   type[0] = t1;
   type[1] = t2;
   MPI_Type_struct(2, len, disp, type, &newtype);
   MPI_Type_commit(&newtype);

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet

Nathan and George,

here are the output files of the original test_scif.c
the command line was

mpirun -np 2 -host localhost --mca btl scif,vader,self --mca
mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca
mpi_ddt_position_debug 1 a.out

this is a silent failure and there is no core file
the test itself detects it did not receive the expected value
/* grep "expected" in the output */

Gilles

On 2014/05/08 16:43, Hjelm, Nathan T wrote:
> If you can get me the backtrace from one of the crash core files I would like 
> to see what is going on there.
>

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet

Nathan and George,

here are the (compressed) traces

Gilles

On 2014/05/08 16:43, Hjelm, Nathan T wrote:
> If you can get me the backtrace from one of the crash core files I would like 
> to see what is going on there.
>
> -Nathan
> 
> From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
> [gilles.gouaillar...@iferc.org]
> Sent: Thursday, May 08, 2014 1:32 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] regression with derived datatypes
>
> George,
>
> you do not need any hardware, just download MPSS from Intel and install it.
> make sure the mic kernel module is loaded *and* you can read/write to the
> newly created /dev/mic/* devices.
>
> /* i am now running this on a virtual machine with no MIC whatsoever */
>
> i was able to improve things a bit for the new attached test case
> /* send MPI_PACKED / recv newtype */
> with the attached unpack.patch.
>
> it has to be applied on r31678 (aka the latest checkout of the v1.8 branch)
>
> with this patch (zero regression test so far, it might solve one problem
> but break anything else !)
>
> mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2
> works fine :-)
>
> but
>
> mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2
> still crashes (and it did not crash before r31496)
>
> i will provide the output you requested shortly
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14745.php



r31678.log.bz2
Description: Binary data


r31678withoutr31496.log.bz2
Description: Binary data

Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet

I ran some more investigations with --mca btl scif,self

i found that the previous patch i posted was complete crap and i
apologize for it.

on a brighter side, and imho, the issue only occurs if fragments are
received (and then processed) out of order.
/* i did not observe this with the tcp btl, but i always see that with
the scif btl, i guess this can be observed too
with openib+RDMA */

in this case only, opal_convertor_generic_simple_position(...) is
invoked and does not set the pConvertor->pStack
as expected by r31496

i will run some more tests from now

Gilles

On 2014/05/08 2:23, George Bosilca wrote:
> Strange. The outcome and the timing of this issue seems to highlight a link 
> with the other datatype-related issue you reported earlier, and as suggested 
> by Ralph with Gilles scif+vader issue.
>
> Generally speaking, the mechanism used to split the data in the case of 
> multiple BTLs, is identical to the one used to split the data in fragments. 
> So, if the culprit is in the splitting logic, one might see some weirdness as 
> soon as we force the exclusive usage of the send protocol, with an 
> unconventional fragment size.
>
> In other words using the following flags “—mca btl tcp,self —mca 
> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 
> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when 
> only one single BTL is in play.
>
>   George.
>
> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>
>> OK.  So, I investigated a little more.  I only see the issue when I am 
>> running with multiple ports enabled such that I have two openib BTLs 
>> instantiated.  In addition, large message RDMA has to be enabled.  If those 
>> conditions are not met, then I do not see the problem.  For example:
>> FAILS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>> PASS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>> btl_openib_flags 3 MPI_Isend_ator_c
>> Ø  mpirun –np 2 –host host1,host2 –mca 
>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>> MPI_Isend_ator_c
>>  
>> So we must have some type of issue when we break up the message between the 
>> two openib BTLs.  Maybe someone else can confirm my observations?
>> I was testing against the latest trunk.
>>

Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet

i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
and attached a patch for the v1.8 branch

i ran several tests from the intel_tests test suite and did not observe
any regression.

please note there are still issues when running with --mca btl
scif,vader,self

this might be an other issue, i will investigate more next week

Gilles

On 2014/05/09 18:08, Gilles Gouaillardet wrote:
> I ran some more investigations with --mca btl scif,self
>
> i found that the previous patch i posted was complete crap and i
> apologize for it.
>
> on a brighter side, and imho, the issue only occurs if fragments are
> received (and then processed) out of order.
> /* i did not observe this with the tcp btl, but i always see that with
> the scif btl, i guess this can be observed too
> with openib+RDMA */
>
> in this case only, opal_convertor_generic_simple_position(...) is
> invoked and does not set the pConvertor->pStack
> as expected by r31496
>
> i will run some more tests from now
>
> Gilles
>
> On 2014/05/08 2:23, George Bosilca wrote:
>> Strange. The outcome and the timing of this issue seems to highlight a link 
>> with the other datatype-related issue you reported earlier, and as suggested 
>> by Ralph with Gilles scif+vader issue.
>>
>> Generally speaking, the mechanism used to split the data in the case of 
>> multiple BTLs, is identical to the one used to split the data in fragments. 
>> So, if the culprit is in the splitting logic, one might see some weirdness 
>> as soon as we force the exclusive usage of the send protocol, with an 
>> unconventional fragment size.
>>
>> In other words using the following flags “—mca btl tcp,self —mca 
>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 
>> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when 
>> only one single BTL is in play.
>>
>>   George.
>>
>> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>>
>>> OK.  So, I investigated a little more.  I only see the issue when I am 
>>> running with multiple ports enabled such that I have two openib BTLs 
>>> instantiated.  In addition, large message RDMA has to be enabled.  If those 
>>> conditions are not met, then I do not see the problem.  For example:
>>> FAILS:
>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>> PASS:
>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>>> btl_openib_flags 3 MPI_Isend_ator_c
>>> Ø  mpirun –np 2 –host host1,host2 –mca 
>>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>>> MPI_Isend_ator_c
>>>  
>>> So we must have some type of issue when we break up the message between the 
>>> two openib BTLs.  Maybe someone else can confirm my observations?
>>> I was testing against the latest trunk.
>>>

Re: [OMPI devel] scif btl side effects

2014-05-12 Thread Gilles Gouaillardet

Nathan,

On 2014/05/08 4:21, Hjelm, Nathan T wrote:
> c) that being said, that should work so there is a bug
> d) there is a regression in v1.8 and a bug that might have been always here
> This is probably not a regression. The SCIF btl has been part of the 1.7 
> series for some time. The nightly MTTs are probably missing one of the cases 
> that causes this problem. Hopefully we can get this fixed before 1.8.2.
as explained in #4610 (https://svn.open-mpi.org/trac/ompi/ticket/4610)
the root cause is in the way data are unpacked.

The scif btl is ok :-)

when using --mca btl scif,self fragments can be received out of order,
and that can trigger a bug introduced by r31496

that being said, --mca btl scif,vader,self does not work with r31496
reverted.
the root cause is an other bug in the way data are unpacked, it happen
also when fragments are received out of order
*and* fragments contain a subpart of a predefined datatype.
in this case, the vader btl received a fragment of size 1325 *and* out
of order and that caused the bug.

Gilles

Re: [OMPI devel] scif btl side effects

2014-05-12 Thread Gilles Gouaillardet

i wrote this too early ...

the attached program produces incorrect results when ran with
--mca btl scif,vader,self

once the most up-to-date patch of #4610 has been applied, (at least) one
bug remain, and it is in the scif btl

the attached patch fixes it.

Gilles

On 2014/05/12 16:17, Gilles Gouaillardet wrote:
> Nathan,
>
> On 2014/05/08 4:21, Hjelm, Nathan T wrote:
>> c) that being said, that should work so there is a bug
>> d) there is a regression in v1.8 and a bug that might have been always here
>> This is probably not a regression. The SCIF btl has been part of the 1.7 
>> series for some time. The nightly MTTs are probably missing one of the cases 
>> that causes this problem. Hopefully we can get this fixed before 1.8.2.
> as explained in #4610 (https://svn.open-mpi.org/trac/ompi/ticket/4610)
> the root cause is in the way data are unpacked.
>
> The scif btl is ok :-)
>
> when using --mca btl scif,self fragments can be received out of order,
> and that can trigger a bug introduced by r31496
>
> that being said, --mca btl scif,vader,self does not work with r31496
> reverted.
> the root cause is an other bug in the way data are unpacked, it happen
> also when fragments are received out of order
> *and* fragments contain a subpart of a predefined datatype.
> in this case, the vader btl received a fragment of size 1325 *and* out
> of order and that caused the bug.
>
> Gilles

/*
 * This test is an oversimplified version of collective/bcast_struct
 * that comes with the ibm test suite.
 * it must be ran on two tasks on a single host where the MIC software stack
 * is present (e.g. libscif.so is present, the mic driver is loaded and
 * /dev/mic/* are accessible and the scif btl is available.
 *
 * mpirun -np 2 -host localhost --mca scif,vader,self ./test_scif
 * will produce incorrect results with trunk and v1.8
 *
 * mpirun -np 2 --mca btl ^scif -host localhost ./test_scif
 * will work with trunk and v1.8
 *
 * mpirun -np 2 --mca btl scif,self -host localhost ./test_scif
 * will produce correct results with v1.8 r31309 (but eventually crash in 
MPI_Finalize)
 * and produce incorrect result with v1.8 r31671 and trunk r31667
 *
 * Copyright (c) 2011  Oracle and/or its affiliates.  All rights reserved.
 * Copyright (c) 2014  Research Organization for Information Science
 * and Technology (RIST). All rights reserved.
 */
/

 MESSAGE PASSING INTERFACE TEST CASE SUITE

 Copyright IBM Corp. 1995

 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and
 distribute this software for any purpose and without fee provided that the
 above copyright notice and the following paragraphs appear in all copies.

 IBM Corp. makes no representation that the test cases comprising this
 suite are correct or are an accurate representation of any standard.

 In no event shall IBM be liable to any party for direct, indirect, special
 incidental, or consequential damage arising out of the use of this software
 even if IBM Corp. has been advised of the possibility of such damage.

 IBM CORP. SPECIFICALLY DISCLAIMS ANY WARRANTIES INCLUDING, BUT NOT LIMITED
 TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS AND IBM
 CORP. HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
 ENHANCEMENTS, OR MODIFICATIONS.

 These test cases reflect an interpretation of the MPI Standard.  They are
 are, in most cases, unit tests of specific MPI behaviors.  If a user of any
 test case from this set believes that the MPI Standard requires behavior
 different than that implied by the test case we would appreciate feedback.

 Comments may be sent to:
Richard Treumann
treum...@kgn.ibm.com

*/
#include 
#include 
#include 
#include 
#include "mpi.h"

#define ompitest_error(file,line,...) {fprintf(stderr, "FUCK at %s:%d root=%d 
size=%d (i,j)=(%d,%d)\n", file, line,root, i0, i, j); MPI_Abort(MPI_COMM_WORLD, 
1);}

const int SIZE = 1000;

int main(int argc, char **argv)
{
   int myself;

   double a[2], t_stop;
   int ii, size;
   int len[2];
   MPI_Aint disp[2];
   MPI_Datatype type[2], newtype, t1, t2;
   struct foo_t {
   int i[3];
   double d[3];
   } foo, *bar;
   int i0, i, j, root, nseconds = 600, done_flag;
   int _dbg=0;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&myself);
   MPI_Comm_size(MPI_COMM_WORLD,&size);
   // _dbg = (0 == myself);
   while (_dbg) poll(NULL,0,1);

   if ( argc > 1 ) nseconds = atoi(argv[1]);
   t_stop = MPI_Wtime() + nseconds;

   /*---

[OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet

Folks,

i would like to comment on r31738 :

> There is no reason to cancel the listening thread. It should die
> automatically when the file descriptor is closed.
i could not agree more
> It is sufficient to just wait for the thread to exit with pthread join.
unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
is *not* :-(

this is what i described in #4615
https://svn.open-mpi.org/trac/ompi/ticket/4615
in which i attached scif_hang.c that evidences that (at least in my
environment)
scif_poll(...) does *not* return after scif_close(...) is closed, and
hence the scif pthread never ends.

this is likely a bug in MPSS and it might have been fixed in earlier
release.

Nathan, could you try scif_hang in your environment and report the MPSS
version you are running ?


bottom line, and once again, in my test environment, pthread_join (...)
without pthread_cancel(...)
might cause a hang when the btl/scif module is released.


assuming the bug is in old MPSS and has been fixed in recent releases,
what is the OpenMPI policy ?
a) test the MPSS version and call pthread_cancel() or do *not* call
pthread_join if buggy MPSS is detected ?
b) display an error/warning if a buggy MPSS is detected ?
c) do not call pthread_join at all ? /* SIGSEGV might occur with older
MPSS, it is in MPI_Finalize() so impact is limited */
d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
problem after all ?
e) something else ?

Gilles

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet

Ralph,

scif_poll(...) is called with an infinite timeout.

a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
the obvious drawback is the thread has to wake up every xxx seconds and
that would be for
nothing 99.9% of the time.

my analysis (see #4615) is the crash occurs when the btl/scif is
unloaded from memory (e.g. dlcose()) and
the scif_thread is still running.

Gilles

On 2014/05/14 11:25, Ralph Castain wrote:
> It could be a bug in the software stack, though I wouldn't count on it. 
> Unfortunately, pthread_cancel is known to have bad side effects, and so we 
> avoid its use.
>
> The key here is that the thread must detect that the file descriptor has 
> closed and exit, or use some other method for detecting that it should 
> terminate. We do this in multiple other places in the code, without using 
> pthread_cancel and without hanging. So it is certainly doable.
>
> I don't know the specifics of why Nathan's code is having trouble exiting, 
> but I suspect that a simple solution - not involving pthread_cancel - can be 
> readily developed.
>
>
> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> i would like to comment on r31738 :
>>
>>> There is no reason to cancel the listening thread. It should die
>>> automatically when the file descriptor is closed.
>> i could not agree more
>>> It is sufficient to just wait for the thread to exit with pthread join.
>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>> is *not* :-(
>>
>> this is what i described in #4615
>> https://svn.open-mpi.org/trac/ompi/ticket/4615
>> in which i attached scif_hang.c that evidences that (at least in my
>> environment)
>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>> hence the scif pthread never ends.
>>
>> this is likely a bug in MPSS and it might have been fixed in earlier
>> release.
>>
>> Nathan, could you try scif_hang in your environment and report the MPSS
>> version you are running ?
>>
>>
>> bottom line, and once again, in my test environment, pthread_join (...)
>> without pthread_cancel(...)
>> might cause a hang when the btl/scif module is released.
>>
>>
>> assuming the bug is in old MPSS and has been fixed in recent releases,
>> what is the OpenMPI policy ?
>> a) test the MPSS version and call pthread_cancel() or do *not* call
>> pthread_join if buggy MPSS is detected ?
>> b) display an error/warning if a buggy MPSS is detected ?
>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>> MPSS, it is in MPI_Finalize() so impact is limited */
>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>> problem after all ?
>> e) something else ?
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Gilles Gouaillardet

Nathan,

> Looks like this is a scif bug. From the documentation:

and from the source code, scif_poll(...) simply calls poll(...)

at least in MPSS 2.1

> Since that is not the case I will look through the documentation and see

if there is a way other than pthread_cancel.


what about :

- use a global variable (a boolean called "close_requested")

- update the scif thread so it checks close_requested after each scif_poll,

and exits if true

- when closing btl/scif :

 * set close_requested to true

 * scif_connect to myself

 * close this connection

 * pthread_join(...)


that's a bit heavyweight, but it does the job

( and we keep an infinite timeout for scif_poll() so overhead at runtime is
null)


i can test this approach from tomorrow if needed


Gilles

[OMPI devel] r31765 causes crash in mpirun

2014-05-15 Thread Gilles Gouaillardet

Folks,

since r31765 (opal/event: release the opal event context when closing
the event base)
mpirun crashes at the end of the job.

for example :

$ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c
MPITEST info  (0): Starting MPI_Allreduce_user() test
MPITEST_results: MPI_Allreduce_user() all tests PASSED (7076)
[soleil:10959] *** Process received signal ***
[soleil:10959] Signal: Segmentation fault (11)
[soleil:10959] Signal code: Address not mapped (1)
[soleil:10959] Failing at address: 0x7fd969e75a98
[soleil:10959] [ 0] /lib64/libpthread.so.0[0x3c9da0f500]
[soleil:10959] [ 1]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7bae5)[0x7fd96a55dae5]
[soleil:10959] [ 2]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x7ac97)[0x7fd96a55cc97]
[soleil:10959] [ 3]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_del+0x88)[0x7fd96a55ca15]
[soleil:10959] [ 4]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_free+0x132)[0x7fd96a558831]
[soleil:10959] [ 5]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(+0x74126)[0x7fd96a556126]
[soleil:10959] [ 6]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(mca_base_framework_close+0xdd)[0x7fd96a54026f]
[soleil:10959] [ 7]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-pal.so.0(opal_finalize+0x7e)[0x7fd96a50d36e]
[soleil:10959] [ 8]
/csc/home1/gouaillardet/local/ompi-trunk/lib/libopen-rte.so.0(orte_finalize+0xd3)[0x7fd96a7ead2f]
[soleil:10959] [ 9] mpirun(orterun+0x1298)[0x404f0e]
[soleil:10959] [10] mpirun(main+0x20)[0x4038a4]
[soleil:10959] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c9d21ecdd]
[soleil:10959] [12] mpirun[0x4037c9]
[soleil:10959] *** End of error message ***
Segmentation fault (core dumped)

Gilles

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-15 Thread Gilles Gouaillardet

Nathan,

this had no effect on my environment :-(

i am not sure you can reuse mca_btl_scif_module.scif_fd with connect()
i had to use a new scif fd for that.

then i ran into an other glitch : if the listen thread does not
scif_accept() the connection,
the scif_connect() will take 30 seconds (default timeout value i guess)

i fixed this in r31772

Gilles

On 2014/05/15 1:16, Nathan Hjelm wrote:
> That is exactly how I decided to fix it. It looks like it is
> working. Please try r31755 when you get a chance.
>

[OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self

2014-05-16 Thread Gilles Gouaillardet

Folks,

a simple
mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld

crashes after some of yesterday's commits (i would blame r31778 and/or
r31782,
but i am not 100% sure)

/* a list receives a negative value, so the program takes some time
before crashing,
symptom may vary from one system to an other */

i digged into this, and found what looks like an old bug/typo in
mca_bml_r2_del_procs().
the bug has *not* been introduced by yesterday commits.
i believe this path was not executed since yesterday, that is why we
(only) now hit the bug

i fixed this in r31786

Gilles

[OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-16 Thread Gilles Gouaillardet

Folks,

there is a small memory leak in ompi/mca/pml/bfo/pml_bfo_component.c

in my environment, this module is not used.
this means mca_pml_bfo_component_open() and mca_pml_bfo_component_close()
are invoked but
mca_pml_bfo_component_init() and mca_pml_bfo_component_fini() are *not*
invoked.

mca_pml_bfo.allocator is currently allocated in
mca_pml_bfo_component_open() and released in mca_pml_bfo_component_fini()
this causes a leak (at least in my environment, since
mca_pml_bfo_component_fini() is not invoked)

One option is to release the allocator in mca_pml_bfo_component_close()
An other option is to allocate the allocator in mca_pml_bfo_component_init()

Which is the correct/best one ?

i attached two patches, which one (if any) should be commited ?

Thanks in advance for your insights

Gilles
Index: ompi/mca/pml/bfo/pml_bfo_component.c
===
--- ompi/mca/pml/bfo/pml_bfo_component.c	(revision 31788)
+++ ompi/mca/pml/bfo/pml_bfo_component.c	(working copy)
@@ -12,6 +12,8 @@
  * All rights reserved.
  * Copyright (c) 2007-2010 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2010  Oracle and/or its affiliates.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -149,25 +151,9 @@
 
 static int mca_pml_bfo_component_open(void)
 {
-mca_allocator_base_component_t* allocator_component;
-
 mca_pml_bfo_output = opal_output_open(NULL);
 opal_output_set_verbosity(mca_pml_bfo_output, mca_pml_bfo_verbose);
 
-allocator_component = mca_allocator_component_lookup( mca_pml_bfo.allocator_name );
-if(NULL == allocator_component) {
-opal_output(0, "mca_pml_bfo_component_open: can't find allocator: %s\n", mca_pml_bfo.allocator_name);
-return OMPI_ERROR;
-}
-
-mca_pml_bfo.allocator = allocator_component->allocator_init(true,
-mca_pml_bfo_seg_alloc,
-mca_pml_bfo_seg_free, NULL);
-if(NULL == mca_pml_bfo.allocator) {
-opal_output(0, "mca_pml_bfo_component_open: unable to initialize allocator\n");
-return OMPI_ERROR;
-}
-
 mca_pml_bfo.enabled = false; 
 return mca_base_framework_open(&ompi_bml_base_framework, 0); 
 }
@@ -191,6 +177,8 @@
 bool enable_progress_threads,
 bool enable_mpi_threads )
 {
+mca_allocator_base_component_t* allocator_component;
+
 opal_output_verbose( 10, mca_pml_bfo_output,
  "in bfo, my priority is %d\n", mca_pml_bfo.priority);
 
@@ -200,6 +188,21 @@
 }
 *priority = mca_pml_bfo.priority;
 
+allocator_component = mca_allocator_component_lookup( mca_pml_bfo.allocator_name );
+if(NULL == allocator_component) {
+opal_output(0, "mca_pml_bfo_component_open: can't find allocator: %s\n", mca_pml_bfo.allocator_name);
+return NULL;
+}
+
+mca_pml_bfo.allocator = allocator_component->allocator_init(true,
+mca_pml_bfo_seg_alloc,
+mca_pml_bfo_seg_free, NULL);
+if(NULL == mca_pml_bfo.allocator) {
+opal_output(0, "mca_pml_bfo_component_open: unable to initialize allocator\n");
+return NULL;
+}
+
+
 if(OMPI_SUCCESS != mca_bml_base_init( enable_progress_threads, 
   enable_mpi_threads)) {
 return NULL;
Index: ompi/mca/pml/bfo/pml_bfo_component.c
===
--- ompi/mca/pml/bfo/pml_bfo_component.c	(revision 31785)
+++ ompi/mca/pml/bfo/pml_bfo_component.c	(working copy)
@@ -12,6 +12,8 @@
  * All rights reserved.
  * Copyright (c) 2007-2010 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2010  Oracle and/or its affiliates.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -180,6 +182,9 @@
 if (OMPI_SUCCESS != (rc = mca_base_framework_close(&ompi_bml_base_framework))) {
  return rc;
 }
+if(OMPI_SUCCESS != (rc = mca_pml_bfo.allocator->alc_finalize(mca_pml_bfo.allocator))) {
+return rc;
+}
 opal_output_close(mca_pml_bfo_output);
 
 return OMPI_SUCCESS;
@@ -237,10 +242,6 @@
 OBJ_DESTRUCT(&mca_pml_bfo.rdma_frags);
 OBJ_DESTRUCT(&mca_pml_bfo.lock);
 
-if(OMPI_SUCCESS != (rc = mca_pml_bfo.allocator->alc_finalize(mca_pml_bfo.allocator))) {
-return rc;
-}
-
 #if 0
 if (mca_pml_base_send_requests.fl_

[OMPI devel] problem compiling trunk after r31810

2014-05-18 Thread Gilles Gouaillardet

Folks,

i was unable to compile trunk after svn update.

i use different directories (aka VPATH) for source and build
error message is related to the missing shmem/java directory
from the oshmem directory.

The attached patch fixed this.

/* that being said, i did not try to build java for oshmem,
so the i did not commit this patch since it might not work when needed */

Cheers,

Gilles
Index: oshmem/Makefile.am
===
--- oshmem/Makefile.am	(revision 31810)
+++ oshmem/Makefile.am	(working copy)
@@ -34,6 +34,7 @@
 	include \
 	shmem/c \
 	shmem/fortran
+	shmem/java
 
 if PROJECT_OSHMEM
 # Only traverse these dirs if we're building oshmem
@@ -41,8 +42,7 @@
 	$(MCA_oshmem_FRAMEWORKS_SUBDIRS) \
 	$(MCA_oshmem_FRAMEWORK_COMPONENT_STATIC_SUBDIRS) \
 	. \
-	$(MCA_oshmem_FRAMEWORK_COMPONENT_DSO_SUBDIRS) \
-	shmem/java
+	$(MCA_oshmem_FRAMEWORK_COMPONENT_DSO_SUBDIRS)
 endif
 
 DIST_SUBDIRS = \

Re: [OMPI devel] [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2

2014-05-19 Thread Gilles Gouaillardet

Nathan,

do you mean the bug/typo was not at line 487
(e.g. btl_send was ok)
but at line 498 ?
(e.g. btl_send must be used instead of btl_eager)

at first sight, that make sense.

i'd rather let the author/maintainer of this part comment on that

Cheers,

Gilles


On Sat, May 17, 2014 at 5:47 AM, Hjelm, Nathan T  wrote:

> Is this correct? Shouldn't the fix be to change the call before the loop
> to reference btl_send instead of btl_eager?
>
> I ask because it looks like the loop is trying to prevent a btl from
> getting del_procs twice for the same proc. If we do not remove the btl from
> the btl_send array it will get the call twice.
>
> Correct me if I am wrong.
>
> -Nathan
>
> 
> From: svn [svn-boun...@open-mpi.org] on behalf of
> svn-commit-mai...@open-mpi.org [svn-commit-mai...@open-mpi.org]
> Sent: Thursday, May 15, 2014 10:43 PM
> To: s...@open-mpi.org
> Subject: [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2
>
> Author: ggouaillardet (Gilles Gouaillardet)
> Date: 2014-05-16 00:43:18 EDT (Fri, 16 May 2014)
> New Revision: 31786
> URL: https://svn.open-mpi.org/trac/ompi/changeset/31786
>
> Log:
> Fix a typo in mca_bml_r2_del_procs()
>
> Use bml_endpoint->btl_eager instead of bml_endpoint->btl_send.
>
> cmr=v1.8.2:reviewer=rhc
>
> Text files modified:
>trunk/ompi/mca/bml/r2/bml_r2.c | 4 +++-
>1 files changed, 3 insertions(+), 1 deletions(-)
>
> Modified: trunk/ompi/mca/bml/r2/bml_r2.c
>
> ==
> --- trunk/ompi/mca/bml/r2/bml_r2.c  Thu May 15 20:30:41 2014
>  (r31785)
> +++ trunk/ompi/mca/bml/r2/bml_r2.c  2014-05-16 00:43:18 EDT (Fri, 16
> May 2014)  (r31786)
> @@ -15,6 +15,8 @@
>   * Copyright (c) 2008-2014 Cisco Systems, Inc.  All rights reserved.
>   * Copyright (c) 2013  Intel, Inc. All rights reserved
>   * Copyright (c) 2014  NVIDIA Corporation.  All rights reserved.
> + * Copyright (c) 2014  Research Organization for Information Science
> + * and Technology (RIST). All rights reserved.
>   * $COPYRIGHT$
>   *
>   * Additional copyrights may follow
> @@ -482,7 +484,7 @@
>   */
>  n_size =
> mca_bml_base_btl_array_get_size(&bml_endpoint->btl_eager);
>  for(n_index = 0; n_index < n_size; n_index++) {
> -mca_bml_base_btl_t* search_bml_btl =
> mca_bml_base_btl_array_get_index(&bml_endpoint->btl_send, n_index);
> +mca_bml_base_btl_t* search_bml_btl =
> mca_bml_base_btl_array_get_index(&bml_endpoint->btl_eager, n_index);
>  if(search_bml_btl->btl == btl) {
>  memset(search_bml_btl, 0, sizeof(mca_bml_base_btl_t));
>  break;
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14821.php
>

Re: [OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-19 Thread Gilles Gouaillardet

Thanks guys !

i commited r31816 (bfo: allocate the allocator in init rather than open)
and made a CMR

based on mtt results, i will push George's commit tomorrow.
and based on Rolf recommendation, i will do the CMR by the end of the week
if everything
works fine

Gilles

Re: [OMPI devel] [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2

2014-05-19 Thread Gilles Gouaillardet

Nathan,

r31829 caused many sigsegv :-(
/* i am testing on a RHEL6.3 like VM with --mca btl,tcp */

this is now fixed in r31830,
i think i get the intent of the code and i believe we are all set now.

bottom line :
- we agree on line 487 (e.g. use btl_send)
- your update of line 485 is correct (e.g. use btl_send)
- my suggested update of line 498 (e.g. use btl_send) was correct.

Cheers,

Gilles

On 2014/05/20 4:06, Nathan Hjelm wrote:
> On Mon, May 19, 2014 at 02:14:57PM +0900, Gilles Gouaillardet wrote:
>>Nathan,
>>
>>do you mean the bug/typo was not at line 487
>>(e.g. btl_send was ok)
>>but at line 498 ?
>>(e.g. btl_send must be used instead of btl_eager)
> Yup. If you look at the next loop (L497) it looks through the btl_send
> list and then calls del_procs if it finds the btl in that list.
>
>>at first sight, that make sense.
>>
>>i'd rather let the author/maintainer of this part comment on that
> I don't know if the original author still works on Open MPI. I think we
> will have to guess the intent of the code. Let me take a closer look and
> see if I can determine for sure what was intended. If I can determine
> for sure I will include this change with another bml fix that needs to
> go in.
>
> -Nathan
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14828.php

[OMPI devel] neighbor_alltoall are broken

2014-05-23 Thread Gilles Gouaillardet

Folks,

i noticed that *neighbor_alltoall* are now broken.

the bug is in the way parameters are checked (i revamped this and did the
wrong thing for neighbor communications :-()

this bug is only visible when the number of tasks become large
(this can explain why i did not detect this on my VM ...)

i am sorry for the mess and now working on a fix

Gilles

Re: [OMPI devel] Still problems with del_procs in trunkj

2014-05-25 Thread Gilles Gouaillardet

Rolf,

the assert fails because the endpoint reference count is greater than one.
the root cause is the endpoint has been added to the list of
eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
ompi/mca/btl/openib/btl_openib_endpoint.c:1009)

a simple workaround is not to use eager rdma with the openib btl
(e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)

here is attached a patch that solves the issue.

because of my poor understanding of the openib btl, i did not commit it.
i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
(e.g. what happens if there are inflight messages ?)
i also added some comments that indicates the patch might be suboptimal.

Nathan, could you please review the attached patch ?

please note that if the faulty assertion is removed, the endpoint will be
OBJ_RELEASE'd  but only in the btl finalize.

Gilles



On Sat, May 24, 2014 at 12:31 AM, Rolf vandeVaart wrote:

> I am still seeing problems with del_procs with openib.  Do we believe
> everything should be working?  This is with the latest trunk (updated 1
> hour ago).
>
> [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
> mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1
> connectivity_cConnectivity test on 2 processes PASSED.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> --
> mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1
> exited on signal 11 (Segmentation fault).
> --
> [rvandevaart@drossetti-ivy0 examples]$
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
>
Index: ompi/mca/btl/openib/btl_openib.c
===
--- ompi/mca/btl/openib/btl_openib.c(revision 31888)
+++ ompi/mca/btl/openib/btl_openib.c(working copy)
@@ -1128,7 +1128,7 @@
 struct ompi_proc_t **procs,
 struct mca_btl_base_endpoint_t ** peers)
 {
-int i,ep_index;
+int i, ep_index;
 mca_btl_openib_module_t* openib_btl = (mca_btl_openib_module_t*) btl;
 mca_btl_openib_endpoint_t* endpoint;
 
@@ -1144,8 +1144,19 @@
 continue;
 }
 if (endpoint == del_endpoint) {
+int j;
 BTL_VERBOSE(("in del_procs %d, setting another endpoint to 
null",
  ep_index));
+/* remove the endpoint from eager_rdma_buffers */
+for (j=0; jdevice->eager_rdma_buffers_count; j++) 
{
+if (openib_btl->device->eager_rdma_buffers[j] == endpoint) 
{
+/* should it be obj_reference_count == 2 ? */
+assert(((opal_object_t*)endpoint)->obj_reference_count 
> 1);
+OBJ_RELEASE(endpoint);
+openib_btl->device->eager_rdma_buffers[j] = NULL;
+/* can we simply break and leave the for loop ? */
+}
+}
 opal_pointer_array_set_item(openib_btl->device->endpoints,
 ep_index, NULL);
 assert(((opal_object_t*)endpoint)->obj_reference_count == 1);

[OMPI devel] OMPI Opengrok config

2014-05-27 Thread Gilles Gouaillardet

Folks,

OMPI Opengrok search (http://svn.open-mpi.org/source) currently returns
results for :
- trunk
- v1.6 branch
- v1.5 branch
- v1.3 branch

imho, it could/should return results for the following branches :
- trunk
- v1.8 branch
- v1.6 branch
and maybe the v1.4 branch (and the v1.9 branch when it is created)

any thoughts ?

Cheers,

Gilles

[OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-27 Thread Gilles Gouaillardet

Folks,

currently, the dynamic/intercomm_create test from the ibm test suite output
the following messages :

dpm_base_disconnect_init: error -12 in isend to process 1

the root cause it task 0 tries to send messages to already exited tasks.

one way of seeing things is that this is an application issue :
task 0 should have MPI_Comm_free'd all its communicator before calling
MPI_Comm_disconnect.
This can be achieved via the attached patch

an other way of seeing things is that this is a bug in OpenMPI.
In this case, what would be the the right approach ?
- automatically free communicators (if needed) when MPI_Comm_disconnect is
invoked ?
- simply remove communicators (if needed) from ompi_mpi_communicators when
MPI_Comm_disconnect is invoked ?
  /* this causes a memory leak, but the application can be seen as
responsible of it */
- other ?

Thanks in advance for your feedback,

Gilles
Index: ibm/dynamic/intercomm_create.c
===
--- ibm/dynamic/intercomm_create.c	(revision 2370)
+++ ibm/dynamic/intercomm_create.c	(working copy)
@@ -104,6 +104,10 @@
 err = MPI_Barrier(abc_intra);
 printf( "%s: barrier (%d)\n", whoami, err );
 
+MPI_Comm_free(&abc_intra);
+MPI_Comm_free(&ab_c_inter);
+MPI_Comm_free(&ab_intra);
+MPI_Comm_free(&ac_intra);
 MPI_Comm_disconnect(&ab_inter);
 MPI_Comm_disconnect(&ac_inter);
 }

[OMPI devel] some info is not pushed into the dstore

2014-05-27 Thread Gilles Gouaillardet

Folks,

while debugging the dynamic/intercomm_create from the ibm test suite, i
found something odd.

i ran *without* any batch manager on a VM (one socket and four cpus)
mpirun -np 1 ./dynamic/intercomm_create

it hangs by default
it works with --mca coll ^ml

basically :
- task 0 spawns task 1
- task 0 spawns task 2
- a communicator is created for the 3 tasks via MPI_Intercomm_create()

MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls
ompi_proc_set_locality()

then, on task 1, ompi_proc_set_locality() calls
opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which
fails and this is OK
then
opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which
fails and this is *not* OK

/* on task 2, the first fetch for "task 1" fails but the second success */

my analysis is that when task 2 was created, it updated its
opal_dstore_nonpeer with info from "task 1" which was previously spawned by
task 0.
when task 1 was spawned, task 2 did not exist yet and hence
opal_dstore_nonpeer contains no reference to task 2.
but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been
updated, hence the failure

(on task 1, proc_flags of task 2 has incorrect locality, this likely
confuses coll ml and hang the test)

should task1 have received new information when task 2 was spawned ?
shoud task2 have sent information to task1 when it was spawned ?
should task1 have (tried to) get fresh information before invoking
MPI_Intercomm_create() ?

incidentally, i found ompi_proc_set_locality calls opal_dstore.store with
identifier &proc (the argument is &proc->proc_name everywhere else, so this
is likely a bug/typo. the attached patch fixes this.

Thanks in advance for your feedback,

Gilles
Index: ompi/proc/proc.c
===
--- ompi/proc/proc.c	(revision 31891)
+++ ompi/proc/proc.c	(working copy)
@@ -231,7 +231,7 @@
 kvn.key = strdup(OPAL_DSTORE_LOCALITY);
 kvn.type = OPAL_HWLOC_LOCALITY_T;
 kvn.data.uint16 = locality;
-ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc, &kvn);
+ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc->proc_name, &kvn);
 OBJ_DESTRUCT(&kvn);
 /* set the proc's local value as well */
 proc->proc_flags = locality;

Re: [OMPI devel] OMPI Opengrok config

2014-05-27 Thread Gilles Gouaillardet

Thanks Jeff,

i can only speak for myself : i use OpenGrok on a daily basis and it is a
great help

Cheers,

Gilles


On Wed, May 28, 2014 at 8:21 AM, Jeff Squyres (jsquyres)  wrote:

> I can ask IU to adjust the OpenGrok config.
>
>
> On May 27, 2014, at 1:06 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > Folks,
> >
> > OMPI Opengrok search (http://svn.open-mpi.org/source) currently returns
> results for :
> > - trunk
> > - v1.6 branch
> > - v1.5 branch
> > - v1.3 branch
> >
> > imho, it could/should return results for the following branches :
> > - trunk
> > - v1.8 branch
> > - v1.6 branch
> > and maybe the v1.4 branch (and the v1.9 branch when it is created)
> >
> > any thoughts ?
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14846.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14873.php
>

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-27 Thread Gilles Gouaillardet

Ralph,

in the case of intercomm_create, the children free all the communicators
and then MPI_Disconnect() and then MPI_Finalize() and exits.
the parent only MPI_Disconnect() without freeing all the communicators.
MPI_Finalize() tries to disconnect and communicate with already exited
processes.

my understanding is that there are two ways of seeing things :
a) the "R-way" : the problem is the parent should not try to communicate to
already exited processes
b) the "J-way" : the problem is the children should have waited either in
MPI_Comm_free() or MPI_Finalize()

i did not investigate the loop_spawn test yet, and will do today.

as far as i am concerned, i have no opinion on which of a) or b) is the
correct/most appropriate approach.

Cheers,

Gilles


On Wed, May 28, 2014 at 9:46 AM, Ralph Castain  wrote:

> Since you ignored my response, I'll reiterate and clarify it here. The
> problem in the case of loop_spawn is that the parent process remains
> "connected" to children after the child has finalized and died. Hence, when
> the parent attempts to finalize, it tries to "disconnect" itself from
> processes that no longer exist - and that is what generates the error
> message.
>
> So the issue in that case appears to be that "finalize" is not marking the
> child process as "disconnected", thus leaving the parent thinking that it
> needs to disconnect when it finally ends.
>
>
> On May 27, 2014, at 5:33 PM, Jeff Squyres (jsquyres) 
> wrote:
>
> > Note that MPI says that COMM_DISCONNECT simply disconnects that
> individual communicator.  It does *not* guarantee that the processes
> involved will be fully disconnected.
> >
> > So I think that the freeing of communicators is good app behavior, but
> it is not required by the MPI spec.
> >
> > If OMPI is requiring this for correct termination, then something is
> wrong.  MPI_FINALIZE is supposed to be collective across all connected MPI
> procs -- and if the parent and spawned procs in this test are still
> connected (because they have not disconnected all communicators between
> them), the FINALIZE is supposed to be collective across all of them.
> >
> > This means that FINALIZE is allowed to block if it needs to, such that
> OMPI sending control messages to procs that are still "connected" (in the
> MPI sense) should never cause a race condition.
> >
> > As such, this sounds like an OMPI bug.
> >
> >
> >
> >
> > On May 27, 2014, at 2:27 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> currently, the dynamic/intercomm_create test from the ibm test suite
> output the following messages :
> >>
> >> dpm_base_disconnect_init: error -12 in isend to process 1
> >>
> >> the root cause it task 0 tries to send messages to already exited tasks.
> >>
> >> one way of seeing things is that this is an application issue :
> >> task 0 should have MPI_Comm_free'd all its communicator before calling
> MPI_Comm_disconnect.
> >> This can be achieved via the attached patch
> >>
> >> an other way of seeing things is that this is a bug in OpenMPI.
> >> In this case, what would be the the right approach ?
> >> - automatically free communicators (if needed) when MPI_Comm_disconnect
> is invoked ?
> >> - simply remove communicators (if needed) from ompi_mpi_communicators
> when MPI_Comm_disconnect is invoked ?
> >>  /* this causes a memory leak, but the application can be seen as
> responsible of it */
> >> - other ?
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14847.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14875.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14876.php
>

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-27 Thread Gilles Gouaillardet

Ralph,

On 2014/05/28 12:10, Ralph Castain wrote:
> my understanding is that there are two ways of seeing things :
> a) the "R-way" : the problem is the parent should not try to communicate to 
> already exited processes
> b) the "J-way" : the problem is the children should have waited either in 
> MPI_Comm_free() or MPI_Finalize()
> I don't think you can use option (b) - we can't have the children lingering 
> around for the parent to call finalize, if I'm understanding you correctly.
you understood me correctly.

once again, i did not start investigating loop_spawn.

in the case of intercomm_create, we would not run into this if the
application had explicitly called MPI_Comm_free in the parent.
so in this case *only*, and as explained by Jeff, b) could be an option
to make OpenMPI happy.
(to be blunt : if the user is not happy with children lingering around,
he can explicitly call MPI_Comm_free before calling MPI_Comm_disconnect)

i will start investigating loop_spawn from now

Cheers,

Gilles

Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-28 Thread Gilles Gouaillardet

Ralph,

can you please describe your environment (at least compiler (and version) +
configure command line)
i checked osc_rdma_data_move.c only :

size_t incoming_length; is used to improve readability.
it is used only in an assert clause and in OPAL_OUTPUT_VERBOSE

one way to silence the warning is not to use this variable (and compromise
readability).

an other way would be to
#if OPAL_ENABLE_DEBUG
size_t incoming_length = request->req_status._ucount);
#endif

imho, a more elegant way would be to use a macro like
OPAL_IF_DEBUG(size_t incoming_length = request->req_status._ucount);)

/* i am not aware of such a macro, please point me if it already exists */

any thoughts ?


about the other warnings, xxx may be used uninitialized in this function, i
was unable to reproduce it and i have to double check again.
so far, it seems this is a false positive/compiler bug that could be
triggered by inlining
/* i could not find any path in which the variable is used unitialized */

Cheers,

Gilles


On Mon, May 26, 2014 at 12:25 PM, Ralph Castain  wrote:

> Building optimized on an IB-based machine:
>
> osc_rdma_data_move.c: In function 'ompi_osc_rdma_callback':
> osc_rdma_data_move.c:1633: warning: unused variable 'incoming_length'
> osc_rdma_data_move.c: In function 'ompi_osc_rdma_control_send':
> osc_rdma_data_move.c:221: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_data_move.c:220: warning: 'frag' may be used uninitialized in
> this function
> osc_rdma_data_move.c: In function 'ompi_osc_gacc_long_start':
> osc_rdma_data_move.c:961: warning: 'acc_data' may be used uninitialized in
> this function
> osc_rdma_data_move.c: In function 'ompi_osc_rdma_gacc_start':
> osc_rdma_data_move.c:912: warning: 'acc_data' may be used uninitialized in
> this function
> osc_rdma_comm.c: In function 'ompi_osc_rdma_rget_accumulate_internal':
> osc_rdma_comm.c:943: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_comm.c:940: warning: 'frag' may be used uninitialized in this
> function
> osc_rdma_data_move.c: In function 'ompi_osc_rdma_acc_long_start':
> osc_rdma_data_move.c:827: warning: 'acc_data' may be used uninitialized in
> this function
> osc_rdma_comm.c: In function 'ompi_osc_rdma_rget':
> osc_rdma_comm.c:736: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_comm.c:733: warning: 'frag' may be used uninitialized in this
> function
> osc_rdma_comm.c: In function 'ompi_osc_rdma_accumulate_w_req':
> osc_rdma_comm.c:420: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_comm.c:417: warning: 'frag' may be used uninitialized in this
> function
> osc_rdma_comm.c: In function 'ompi_osc_rdma_put_w_req':
> osc_rdma_comm.c:251: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_comm.c:244: warning: 'frag' may be used uninitialized in this
> function
> osc_rdma_comm.c: In function 'ompi_osc_rdma_get':
> osc_rdma_comm.c:736: warning: 'ptr' may be used uninitialized in this
> function
> osc_rdma_comm.c:733: warning: 'frag' may be used uninitialized in this
> function
>
>
>
>
> vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
> vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
> this function
> vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
> vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
> this function
> vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
> vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
> this function
> vt_plugin_cntr.c: In function 'vt_plugin_cntr_write_post_mortem':
> vt_plugin_cntr.c:1139: warning: 'min_counter' may be used uninitialized in
> this function
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14840.php
>

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

Ralph,

i could not find anything wrong with loop_spawn and unless i am missing
something obvious :

from mtt http://mtt.open-mpi.org/index.php?do_redir=2196

all tests ran this month (both trunk and v1.8) failed (timeout) and there
was no error message such as
dpm_base_disconnect_init: error -12 in isend to process 1

loop_spawn tries to spawn 2000 tasks in 10 minutes.
my system is not fast enough to achieve this so the iteration count is
bumped
/* if time exceeded, then bump iteration count to the end */

the test would success in 10 minutes and a few seconds ( required to
complete the last spawn and MPI_Finalize())

the slurm timeout is set to 10 minutes exactly, so the job is aborted
before it has time to finish (and i believe it would have finished
successfully)

you can either increase the slurm timeout (10min30s looks good to me),
decrease nseconds (570 looks good to me) in loop_spawn.c or run
mpirun ... dynamic/loop_spawn 
where nseconds is "a bit less" than 600 seconds (once again, 570 looks good
to me)

did i miss something ?

Cheers,

Gilles


On Wed, May 28, 2014 at 12:53 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Ralph,
>
>
> On 2014/05/28 12:10, Ralph Castain wrote:
> > my understanding is that there are two ways of seeing things :
> > a) the "R-way" : the problem is the parent should not try to communicate
> to already exited processes
> > b) the "J-way" : the problem is the children should have waited either
> in MPI_Comm_free() or MPI_Finalize()
> > I don't think you can use option (b) - we can't have the children
> lingering around for the parent to call finalize, if I'm understanding you
> correctly.
> you understood me correctly.
>
> once again, i did not start investigating loop_spawn.
>
> in the case of intercomm_create, we would not run into this if the
> application had explicitly called MPI_Comm_free in the parent.
> so in this case *only*, and as explained by Jeff, b) could be an option
> to make OpenMPI happy.
> (to be blunt : if the user is not happy with children lingering around,
> he can explicitly call MPI_Comm_free before calling MPI_Comm_disconnect)
>
> i will start investigating loop_spawn from now
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14879.php
>

Re: [OMPI devel] some info is not pushed into the dstore

2014-05-28 Thread Gilles Gouaillardet

i finally got it :-)

/* i previously got it "almost" right ... */

here is what happens on job 2 (with trunk) :
MPI_Intercomm_create calls ompi_comm_get_rprocs that calls ompi_proc_unpack
=> ompi_proc_unpack store job 3 info into opal_dstore_peer


then ompi_comm_get_rprocs calls ompi_proc_set_locality(job 3)
=> ompi_proc_set_locality fetch information job 3 info from
opal_dstore_internal (not found) and then opal_dstore_nonpeer (not found
again) and then fails.
this is just the consequence of ompi_proc_unpack stored job 3 info in
opal_dstore_peer and not in opal_dstore_nonpeer

i do not understand which of opal_dstore_peer and opal_dstore_nonpeer
should be used and when, so i wrote a defensive patch (fetch both
opal_dstore_nonpeer and then opal_dstore_peer if not previously found).

please someone review this and comment/fix it if needed
(for example, store in opal_dstore_nonpeer instead of opal_dstore_peer
*or*
fetch in opal_dstore_peer instead of opal_dstore_nonpeer
and/or something else )

and then, locality is correctly set, coll ml receives correct information
and this does not
hang any more if mpirun is invoked without --mca coll ^ml and on a single
node single socket VM)

bottom line, job 2 *did* receive information of job 3 but failed to
store/fetch it in the right opal_store !

v1.8 is unaffected since there is only one dstore

Cheers,

Gilles


On Wed, May 28, 2014 at 4:51 AM, Ralph Castain  wrote:

> Hmmm...I did some digging, and the best I can tell is that root cause is
> that the second job ("b" in the test program) is never actually calling
> connect_accept!  This looks like a change may have occurred in
> Intercomm_create that is causing it to not recognize the need to do so.
>
> Anyone confirm that diagnosis?
>
> FWIW: job 1 clearly receives and has all the required info in the correct
> places - it is ready to provide it to job 2, if/when job 2 actually calls
> connect_accept.
>
> On May 27, 2014, at 10:13 AM, Ralph Castain  wrote:
>
> > Hi Gilles
> >
> > I concur on the typo and fixed it - thanks for catching it. I'll have to
> look into the problem you reported as it has been fixed in the past, and
> was working last I checked it. The info required for this 3-way
> connect/accept is supposed to be in the modex provided by the common
> communicator.
> >
> > On May 27, 2014, at 3:51 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> while debugging the dynamic/intercomm_create from the ibm test suite, i
> found something odd.
> >>
> >> i ran *without* any batch manager on a VM (one socket and four cpus)
> >> mpirun -np 1 ./dynamic/intercomm_create
> >>
> >> it hangs by default
> >> it works with --mca coll ^ml
> >>
> >> basically :
> >> - task 0 spawns task 1
> >> - task 0 spawns task 2
> >> - a communicator is created for the 3 tasks via MPI_Intercomm_create()
> >>
> >> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls
> ompi_proc_set_locality()
> >>
> >> then, on task 1, ompi_proc_set_locality() calls
> >> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which
> fails and this is OK
> >> then
> >> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which
> fails and this is *not* OK
> >>
> >> /* on task 2, the first fetch for "task 1" fails but the second success
> */
> >>
> >> my analysis is that when task 2 was created, it updated its
> opal_dstore_nonpeer with info from "task 1" which was previously spawned by
> task 0.
> >> when task 1 was spawned, task 2 did not exist yet and hence
> opal_dstore_nonpeer contains no reference to task 2.
> >> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been
> updated, hence the failure
> >>
> >> (on task 1, proc_flags of task 2 has incorrect locality, this likely
> confuses coll ml and hang the test)
> >>
> >> should task1 have received new information when task 2 was spawned ?
> >> shoud task2 have sent information to task1 when it was spawned ?
> >> should task1 have (tried to) get fresh information before invoking
> MPI_Intercomm_create() ?
> >>
> >> incidentally, i found ompi_proc_set_locality calls opal_dstore.store
> with
> >> identifier &proc (the argument is &proc->proc_name everywhere else, so
> this
> >> is likely a bug/typo. the attached patch fixes this.
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> __

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

Jeff,

On Wed, May 28, 2014 at 8:31 PM, Jeff Squyres (jsquyres)
> To be totally clear: MPI says it is erroneous for only some (not all)
processes in a communicator to call MPI_COMM_FREE.  So if that's the real
problem, then the discussion about why the parent(s) is(are) trying to
contact the children is moot -- the test is erroneous, and erroneous
application behavior is undefined.

This is definetly what happens : only some tasks call MPI_Comm_free()
i will commit my changes and the initially reported issue is solved :-)



about the "bonus points" :

v1.8 does not have this issue

i digged it and bottom line, the parent (who did not call MPI_Comm_free
unlike the children) calls ompi_dpm_base_dyn_finalize, which tries to isend
the already exited tasks.


bottom line, in pml_ob1_sendreq.h line 450

with v1,8
mca_bml_base_btl_array_get_size(&endpoint->btl_eager) = 0
nothing is sent but isend is reported successful

with trunk
mca_bml_base_btl_array_get_size(&endpoint->btl_eager) = 1
and then try to send the message => BOUM

i found various things that seem counter intuitive to me and will summarize
all this tomorrow.

Cheers,

Gilles

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

Ralph,

On Wed, May 28, 2014 at 9:33 PM, Ralph Castain  wrote:

> This is definetly what happens : only some tasks call MPI_Comm_free()
>
>
> Really? I don't see how that can happen in loop_spawn - every process is
> clearly calling comm_free. Or are you referring to the intercomm_create
> test?
>
> yes, i am referring intercomm_create test

about loop_spawn, i could not get any error on my single host single socket
VM.
(i tried --mca btl tcp,sm,self and --mca btl tcp,self)

MPI_Finalize will end up calling ompi_dpm_dyn_finalize which causes the
error message on the parent of intercomm_create.
a necessary condition is ompi_comm_num_dyncomm > 1
/* which by the way sounds odd to me, should it be 0 ? */
which imho cannot happen if all communicators have been freed

can you detail your full mpirun command line, the number of servers you are
using, the btl involved and the ompi release that can be used to reproduce
the issue ?

i will try to reproduce this myself

Cheers,

Gilles

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

Ralph,

thanks for the info

can you detail your full mpirun command line, the number of servers you are
> using, the btl involved and the ompi release that can be used to reproduce
> the issue ?
>
>
> Running on only one server, using the current head of the svn repo. My
> cluster only has Ethernet, and I let it freely choose the BTLs (so I
> imagine the candidates are sm,self,tcp,vader). The cmd line is really
> trivial:
>
>
is MPSS installed and loaded ?
if yes, scif is also a candidate


> mpirun -n 1 ./loop_spawn
>
> I modified loop_spawn to only run 100 iterations as I am not patient
> enough to wait for 1000, and the number of iters isn't a factor so long as
> it is greater than 1. When the parent calls finalize, I get one of the
> following emitted for every iteration that was done:
>
> dpm_base_disconnect_init: error -12 in isend to process 0
>
>
so we do the same thing but have different behaviour ...

just to be sure :
are we talking about the loop_spawn test from the ibm test suite available
at
http://svn.open-mpi.org/svn/ompi-tests/trunk/ibm/dynamic/loop_spawn.c
and
http://svn.open-mpi.org/svn/ompi-tests/trunk/ibm/dynamic/loop_child.c

number of iterations is 2000 (and not 1000)
MPI_Comm_disconnect is invoked by both parent in loop_spawn.c :

MPI_Comm_free(&comm_merged);
MPI_Comm_disconnect(&comm_spawned);


and children in loop_child.c :

MPI_Comm_free(&merged);
MPI_Comm_disconnect(&parent);


is there any possibility you are running a different test called loop_spawn
or an older version of the dynamic/loop_spawn test from the ibm test suite ?

Cheers,

Gilles

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

Ralph,

what if ?

the parent :
MPI_Comm_free(&merged);
MPI_Comm_disconnect(&comm);

and the child
MPI_Comm_free(&merged);
MPI_Comm_disconnect(&parent);

Gilles


> Good point - however, that doesn't fix it. Changing the Comm_free calls to
> Comm_disconnect results in the same error messages when the parent
> finalizes:
>
> Parent:
> MPI_Init( &argc, &argv);
>
> for (iter = 0; iter < 100; ++iter) {
> MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL,
>0, MPI_COMM_WORLD, &comm, &err);
> printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err);
>
> MPI_Intercomm_merge(comm, 0, &merged);
> MPI_Comm_rank(merged, &rank);
> MPI_Comm_size(merged, &size);
> printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n",
>iter, rank, size);
> MPI_Comm_disconnect(&merged);
> }
>
> MPI_Finalize();
>
>
> Child:
> MPI_Init(&argc, &argv);
> printf("Child: launch\n");
> MPI_Comm_get_parent(&parent);
> MPI_Intercomm_merge(parent, 1, &merged);
> MPI_Comm_rank(merged, &rank);
> MPI_Comm_size(merged, &size);
> printf("Child merged rank = %d, size = %d\n", rank, size);
>
>
> MPI_Comm_disconnect(&merged);
> MPI_Finalize();
>
>
>

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet

good to know !

how should we handle this within mtt ?
decrease nseconds to 570 ?

Cheers,

Gilles


On Thu, May 29, 2014 at 12:03 AM, Ralph Castain  wrote:

> Ah, that satisfied it!
>
> Sorry for the chase - I'll update my test.
>
>
> On May 28, 2014, at 7:55 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> what if ?
>
> the parent :
> MPI_Comm_free(&merged);
> MPI_Comm_disconnect(&comm);
>
> and the child
> MPI_Comm_free(&merged);
> MPI_Comm_disconnect(&parent);
>
> Gilles
>
>
>> Good point - however, that doesn't fix it. Changing the Comm_free calls
>> to Comm_disconnect results in the same error messages when the parent
>> finalizes:
>>
>> Parent:
>> MPI_Init( &argc, &argv);
>>
>> for (iter = 0; iter < 100; ++iter) {
>> MPI_Comm_spawn(EXE_TEST, NULL, 1, MPI_INFO_NULL,
>>0, MPI_COMM_WORLD, &comm, &err);
>> printf("parent: MPI_Comm_spawn #%d return : %d\n", iter, err);
>>
>> MPI_Intercomm_merge(comm, 0, &merged);
>> MPI_Comm_rank(merged, &rank);
>> MPI_Comm_size(merged, &size);
>> printf("parent: MPI_Comm_spawn #%d rank %d, size %d\n",
>>iter, rank, size);
>> MPI_Comm_disconnect(&merged);
>> }
>>
>> MPI_Finalize();
>>
>>
>> Child:
>> MPI_Init(&argc, &argv);
>> printf("Child: launch\n");
>> MPI_Comm_get_parent(&parent);
>> MPI_Intercomm_merge(parent, 1, &merged);
>> MPI_Comm_rank(merged, &rank);
>> MPI_Comm_size(merged, &size);
>> printf("Child merged rank = %d, size = %d\n", rank, size);
>>
>> MPI_Comm_disconnect(&merged);
>> MPI_Finalize();
>>
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14896.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14897.php
>

Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-29 Thread Gilles Gouaillardet

Ralph,


On Wed, May 28, 2014 at 9:53 PM, Ralph Castain  wrote:

> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
>  ./configure --prefix=/home/common/openmpi/build/svn-trunk
> --enable-mpi-java --enable-orterun-prefix-by-default
>
> More inline below
>
>
this looks like an up-to-date CentOS box.
i am unable to reproduce the warnings (may be uninitialized in this
function) with a similar box :-(



> On May 27, 2014, at 9:29 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> so far, it seems this is a false positive/compiler bug that could be
> triggered by inlining
>
> /* i could not find any path in which the variable is used unitialized */
>
>
> I just glanced at the first one (line 221 of osc_rdma_data_move.c), and I
> can see what the compiler is complaining about - have gotten this in other
> places as well. The problem is that you pass the address of ptr into a
> function without first initializing the value of ptr itself. There is no
> guarantee (so far as the compiler can see) that this function will in fact
> set the value of ptr - you are relying solely on the fact that (a) you
> checked that function at one point in time and saw that it always gets set
> to something if ret == OMPI_SUCCESS, and (b) nobody changed that function
> since you checked.
>
> Newer compilers seem to be getting more defensive about such things and
> starting to "bark" when they see it. I think you are correct that inlining
> also impacts that situation, though I've also been seeing it when the
> functions aren't inlined.
>
>
i wrote the simple test program :

#include 

char * mystring = "hello";
static inline int setif(int mustset, char **ptr) {
if (!mustset) {
return 1;
}
*ptr = mystring;
return 0;
}

void good(int mustset) {
char * ptr;
char buf[256];
if (setif(mustset, &ptr) == 0) {
memcpy(buf, ptr, 6);
}
}

void bad(int mustset) {
char * ptr;
char buf[256];
if (setif(mustset, &ptr) != 0) {
memcpy(buf, ptr, 6);
}
}

please note that :
- the setif function is declared 'inline'
- the setif will set *ptr only if the 'mustset' parameter is nonzero and
then return 0
- the setif will leave *ptr unmodified if the 'mustset' parameter is zero
and then return 1

it is trivial that the 'good' function is ok whereas the 'bad' function has
an issue :
the compiler has a way to figure out that ptr will be uninitialized when
invoking memcpy
(since setif returned a non zero status)

gcc -Wall -O0 test.c
does not complain

gcc -Wall -O1 test.c *does* complain
test.c:24: warning: ‘ptr’ may be used uninitialized in this function

if the 'inline' keyword is omitted, -O2 is needed to get a compiler warning.

bottom line, an optimized build (-O3 -finline-functions) correctly issues a
warning.
i checked osc_rdma_data_move.c and osc_rdma_frag.h again and again and i
could not find how ptr can be uninitialized in ompi_osc_rdma_control_send if
ompi_osc_rdma_frag_alloc returned OMPI_SUCCESS
/* not to mention i am unable to reproduce the warning */

about the compiler getting more defensive :

{ int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  rank++;
}

i never saw a compiler issue a warning about rank that could be used
uninitialized



> Not sure what to suggest here - hate to add initialization steps in that
> sequence
>
> me too, and i do not see any warnings from the compiler

can you please confirm you can reproduce the issue on the most up to date
trunk revision , on a x86_64 box (never knows ...) ?
then can you send the output of

cd ompi/mca/osc/rdma
touch osc_rdma_data_move.c
make -n osc_rdma_data_move.lo


Cheers,

Gilles

[OMPI devel] fortran types alignment

2014-05-30 Thread Gilles Gouaillardet

Folks,

i recently had to solve a tricky issue that involves alignment of fortran
types.

the attached program can be used and ran on two tasks in order to evidence
the issue.

if gfortran is used (to build both openmpi and the test case), then the
test is successful
if ifort (Intel compiler) is used (to build both openmpi and the test
case), then the test fails.

this was mentionned in the openmpi users list quite a while ago at
http://www.open-mpi.org/community/lists/users/2010/07/13857.php

the root cause is gfortran considers mpi_real8 must be aligned on 8 bytes
whereas ifort considers mpi_real8 does not need to be aligned.
consequently, the derived data type ddt is built with an extent of 16
(gfortran) or 12 (ifort)


in order to determine the type aligment, configure builds a simple program
with c and fortran that involves common.
the default behaviour of ifort is to :
- *not* align common
- align records (aka the real8_int fortran type)
hence the mismatch and the failure.

the default behaviour of gfortran is to align both common and records,
hence the success.

/* i "extracted" from configure conftest.c and conftestf.f that can be used
to build the conftest binary. conftest will store the alignment in the
conftestval file */

i am wondering how this should be dealt by OpenMPI.


here is a non exhaustive list of option :

a) do nothing, this is not related to openmpi, and even if we do something,
application built with -noalign will break.
b) advise ifort users to configure with FCFLAGS="-align zcommons" since it
is likely this is what they want
c) advise ifort users to build their application with "-noalign" to be on
the safe side (modulo a performance penalty)
d) update OpenMPI so fortran type alignment is determined via a record
instead of a common if fortran >= 90 is used
(so far, i could not find any drawback in doing that)
e) advise ifort users to create ddt with MPI_DOUBLE instead of mpi_real8
(because this works (!), i did not dig to find out why)
f) other ...

any thoughts ?

Cheers,

Gilles


bcast_types.f90
Description: Binary data


conftestf.f
Description: Binary data
#include 
#include 

#ifdef __cplusplus
extern "C" {
#endif
void align_(char *w, char *x, char *y, char *z)
{   unsigned long aw, ax, ay, az;
FILE *f=fopen("conftestval", "w");
if (!f) exit(1);
aw = (unsigned long) w;
ax = (unsigned long) x;
ay = (unsigned long) y;
az = (unsigned long) z;
if (! ((aw%16)||(ax%16)||(ay%16)||(az%16))) fprintf(f, "%d\n", 16);
else if (! ((aw%12)||(ax%12)||(ay%12)||(az%12))) fprintf(f, "%d\n", 12);
else if (! ((aw%8)||(ax%8)||(ay%8)||(az%8))) fprintf(f, "%d\n", 8);
else if (! ((aw%4)||(ax%4)||(ay%4)||(az%4))) fprintf(f, "%d\n", 4);
else if (! ((aw%2)||(ax%2)||(ay%2)||(az%2))) fprintf(f, "%d\n", 2);
else fprintf(f, "%d\n", 1);
fclose(f);
}
#ifdef __cplusplus
}
#endif

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet

Artem,

this looks like the issue initially reported by Rolf
http://www.open-mpi.org/community/lists/devel/2014/05/14836.php

in http://www.open-mpi.org/community/lists/devel/2014/05/14839.php
i posted a patch and a workaround :
export OMPI_MCA_btl_openib_use_eager_rdma=0

i do not recall i commited the patch (Nathan is reviewing) to the trunk.

if you have a chance to test it and if it works, i ll commit it tomorrow

Cheers,

Gilles



On Sun, Jun 1, 2014 at 3:57 PM, Artem Polyakov  wrote:

>
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
> attached batch script) I get followint error:
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
>
>

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet

Artem,

thanks for the feedback.

i commited the patch to the trunk (r31922)

as i indicated in the commit log, this patch is likely suboptimal and has
room for improvement.

Jeff commented about the usnic related issue, so i will wait for a fix from
the Cisco folks.

Cheers,

Gilles



On Sun, Jun 1, 2014 at 10:12 PM, Artem Polyakov  wrote:

>
> I test your approach. Both:
> a) export OMPI_MCA_btl_openib_use_eager_rdma=0
> b) applying your patch and run without "export
> OMPI_MCA_btl_openib_use_eager_rdma=0"
> works well for me.
> This fixes first part of the problem: when OMPI_MCA_btl="openib,self"
>
> However once I comment out this statement thus giving OMPI the right to
> deside which BTL to use program hangs again. Here is additional information
> that can be useful:
>
> 1. If I set 1 slot per node this problem doesn't rise.
>
> 2. If I use at least 2 cores per node I can see this hang.
> Here is the backtraces for all branches of hanged program:
>
>

[OMPI devel] btl/scif: SIGSEGV in MPI_Finalize()

2014-06-02 Thread Gilles Gouaillardet

Folks,

this email contains :
- the description of a problem
- a possible fix that requires a review


PROBLEM :
i always get SIGSEGV when running
mpirun -np 2 --mca btl scif,self ./test_4610

test_4610.c is attached to https://svn.open-mpi.org/trac/ompi/ticket/4610

in order to reproduce the issue, MPSS must be loaded
/* only MPSS is required, MIC is *not* required */


here is what happens :

ompi_mpi_finalize calls
mca_base_framework_close(&ompi_mpool_base_framework)
at ompi/runtime/ompi_mpi_finalize:411

that ends up crashing when executing

mpool_grdma->resources.deregister_mem
at ompi/mca/mpool/grdma/mpool_grdma_module.c:115

where mpool_grdma->resources.deregister_mem *was* scif_dereg_mem

i wrote *was* and not *is* because before that, ompi_mpi_finalize called

mca_base_framework_close(&ompi_bml_base_framework)
at ompi/runtime/ompi_mpi_finalize:408

which indirectly unloaded the scif btl (and hence the scif_dereg_mem
function)



POSSIBLE FIX :

a naive approach is to call
mca_base_framework_close(&ompi_mpool_base_framework)
*before*
mca_base_framework_close(&ompi_bml_base_framework)

even if i ran very few tests and did not experience any issue, i simply do
not know wether this is the right thing to do and what could be the
consequences of swapping these two calls.

could someone please review and comment this ?

Thanks in advance,

Gilles

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet

Mike and Ralph,

i got the very same error.

in orte/mca/rtc/freq/rtc_freq.c at line 187
fp = fopen(filename, "r");
and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"

there is no error check, so if fp is NULL, orte_getline() will call fgets()
that will crash.

that can happen for example if the intel_pstate (or similar) kernel module
is not loaded on a CentOS 6, or if the intel_pstate kernel module is not
even present (depending on how the linux kernel was built)

Cheers,

Gilles


On Mon, Jun 2, 2014 at 1:19 PM, Ralph Castain  wrote:

> It's merrily passing all my MTT tests, so it appears to be fine for me.
>
> It would help if you provided *some* information along with these reports
> - like how was this configured, what environment are you running under, how
> many nodes were you using, etc. Otherwise, it's a totally useless report.
>
>
>

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet

Mike and Ralph,

i could not find a simple workaround.

for the time being, i commited r31926 and invite those who face a similar
issue to use the following workaround :
export OMPI_MCA_rtc_freq_priority=0
/* or mpirun --mca rtc_freq_priority 0 ... */

Cheers,

Gilles

On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> in orte/mca/rtc/freq/rtc_freq.c at line 187
> fp = fopen(filename, "r");
> and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
>
> there is no error check, so if fp is NULL, orte_getline() will call
> fgets() that will crash.
>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet

Rolf,

i faced a bit different problem, but that is 100% reproductible :
- i launch mpirun (no batch manager) from a node with one IB port
- i use -host node01,node02 where node01 and node02 both have two IB port
on the
  same subnet

by default, this will hang.
if this is a "feature" (e.g. openmpi does not support this kind of
configuration) i am fine with it.

when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the
application is a success, then it works just fine.

if the application calls MPI_Abort() /* and even if all tasks call
MPI_Abort() */ then it will hang 100% of the time.
i do not see that as a feature but as a bug.

in an other thread, Jeff mentionned that the usnic btl is doing stuff even
if there is no usnic hardware (this will be fixed shortly).
Do you still see intermittent hang without listing usnic as a btl ?

Cheers,

Gilles

On Fri, May 30, 2014 at 12:11 AM, Rolf vandeVaart 
wrote:

> Ralph:
>
> I am seeing cases where mpirun seems to hang when one of the applications
> exits with non-zero.  For example, the intel test MPI_Cart_get_c will exit
> that way if there are not enough processes to run the test.  In most cases,
> mpirun seems to return fine with the error code, but sometimes it just
> hangs.   I first started noticing this in my mtt runs.  It seems (but not
> conclusive) that I see this when both the usnic and openib are built, even
> though I am only using the openib (as I have no usnic hardware).
>
>
>
> Anyone else seeing something like this?  Note that I see this on both 1.8
> and trunk, but I show trunk here.
>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet

Jeff,

On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) 
wrote:

> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > - i launch mpirun (no batch manager) from a node with one IB port
> > - i use -host node01,node02 where node01 and node02 both have two IB
> port on the
> >   same subnet
>
> FWIW: 2 IB ports on the same subnet?  That's not a good idea.
>
> could you please elaborate a bit ?
from what i saw, this basically doubles the bandwidth (imb PingPong
benchmark) between two nodes (!) which is a not a bad thing.
i can only guess this might not scale (e.g. if 16 tasks is running on each
host, the overhead associated with the use of two ports might void the
extra bandwidth)


> > by default, this will hang.
>
> ...but it still shouldn't hang.  I wonder if it's somehow related to
> https://svn.open-mpi.org/trac/ompi/ticket/4442...?
>
>  i doubt it ...

here is my command line (from node0)
`which mpirun` -np 2 -host node1,node2 --mca rtc_freq_priority 0 --mca btl
openib,self --mca btl_openib_if_include mlx4_0 ./abort
on top of that, the usnic btl is not built (nor installed)


> if this is a "feature" (e.g. openmpi does not support this kind of
> configuration) i am fine with it.
> >
> > when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the
> application is a success, then it works just fine.
> >
> > if the application calls MPI_Abort() /* and even if all tasks call
> MPI_Abort() */ then it will hang 100% of the time.
> > i do not see that as a feature but as a bug.
>
> Yes, OMPI should never hang upon a call to MPI_Abort.
>
> Can you get some stack traces to show where the hung process(es) are
> stuck?  That would help Ralph pin down where things aren't working down in
> ORTE.
>

on node0 :

  \_ -bash
  \_ /.../local/ompi-trunk/bin/mpirun -np 2 -host node1,node2 --mca
rtc_freq_priority 0 --mc
  \_ /usr/bin/ssh -x node1 PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR
  \_ /usr/bin/ssh -x node2 PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR


pstack (mpirun) :
$ pstack 10913
Thread 2 (Thread 0x7f0ecad35700 (LWP 10914)):
#0  0x003ba66e15e3 in select () from /lib64/libc.so.6
#1  0x7f0ecad4391e in listen_thread () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#2  0x003ba72079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x003ba66e8b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f0ecc601700 (LWP 10913)):
#0  0x003ba66df343 in poll () from /lib64/libc.so.6
#1  0x7f0ecc6b1a05 in poll_dispatch () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#2  0x7f0ecc6a641c in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#3  0x004056a1 in orterun ()
#4  0x004039f4 in main ()


on node 1 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
  \_ [abort] 

$ pstack (orted)
#0  0x7fe0ba6a0566 in vfprintf () from /lib64/libc.so.6
#1  0x7fe0ba6c9a52 in vsnprintf () from /lib64/libc.so.6
#2  0x7fe0ba6a9523 in snprintf () from /lib64/libc.so.6
#3  0x7fe0bbc019b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#4  0x7fe0bbc01791 in orte_util_print_name_args () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x7fe0b8e16a8b in mca_oob_tcp_component_hop_unknown () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#6  0x7fe0bb94ab7a in event_process_active_single_queue () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#7  0x7fe0bb94adf2 in event_process_active () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#8  0x7fe0bb94b470 in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#9  0x7fe0bbc1fa7b in orte_daemon () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#10 0x0040093a in main ()


on node 2 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
  \_ [abort] 

$ pstack (orted)
#0  0x7fe8fc435e39 in strchrnul () from /lib64/libc.so.6
#1  0x7fe8fc3ef8f5 in vfprintf () from /lib64/libc.so.6
#2  0x7fe8fc41aa52 in vsnprintf () from /lib64/libc.so.6
#3  0x7fe8fc3fa523 in snprintf () from /lib64/libc.so.6
#4  0x7fe8fd9529b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x7fe8fd952791 in orte_util_print_name_args () from
/.../local/ompi-trunk/li

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet

Mike,

did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?

*both* are required (--mca rtc_freq_priority 0 is not enough without the
patch)

can you please confirm there is no
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
(pseudo) file on your system ?

if this still does not work for you, then this might be a different issue i
was unable to reproduce.
in this case, could you run mpirun under gdb and send a gdb stack trace ?


Cheers,

Gilles




On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman 
wrote:

> more info, specifying --mca rtc_freq_priority 0 explicitly, generates
> different kind of fail:
>
> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
> [vegas12:13887] *** Process received signal ***
> [vegas12:13887] Signal: Segmentation fault (11)
> [vegas12:13887] Signal code: Address not mapped (1)
> [vegas12:13887] Failing at address: 0x20
> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
> [vegas12:13887] [ 1]
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x77dcbe50]
> [vegas12:13887] [ 2]
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x77b1076c]
> [vegas12:13887] [ 3]
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
> [vegas12:13887] [ 4]
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
> [vegas12:13887] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
> [vegas12:13887] [ 6]
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
> [vegas12:13887] *** End of error message ***
> Segmentation fault (core dumped)
>
>
> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman 
> wrote:
>
>> Hi,
>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero
>> priority" did not help and jenkins stilll fails as before.
>> The ompi was configured:
>> --with-platform=contrib/platform/mellanox/optimized
>> --with-ompi-param-check --enable-picky --with-knem --with-mxm --with-fca
>>
>> The run was on single node:
>>
>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>  -np 8 -mca btl sm,tcp 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>> [vegas12:13834] *** Process received signal ***
>> [vegas12:13834] Signal: Segmentation fault (11)
>> [vegas12:13834] Signal code: Address not mapped (1)
>> [vegas12:13834] Failing at address: (nil)
>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d]
>> [vegas12:13834] [ 2] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x741f5f3f]
>> [vegas12:13834] [ 3] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x741f679b]
>> [vegas12:13834] [ 4] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x77ddc036]
>> [vegas12:13834] [ 5] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7725b056]
>> [vegas12:13834] [ 6] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x77d97254]
>> [vegas12:13834] [ 7] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613]
>> [vegas12:13834] [ 8] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>> [vegas12:13834] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>> [vegas12:13834] [10] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>> [vegas12:13834] *** End of error message ***
>> Segmentation fault (core dumped)
>>
>>
>>
>>
>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Mike and Ralph,
>>>
>>> i could not find a simple workaround.
>>>
>>

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet

OK,

please send me a clean gdb backtrace :
ulimit -c unlimited
/* this should generate a core */
mpirun ...
gdb mpirun core...
bt

if no core
gdb mpirun
r -np ... --mca ... ...
and after the crash
bt

then i can only review the code and hope i can find the root cause of the
error i am unable to reproduce in my environment

Cheers,

Gilles




On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman 
wrote:

> Hi,
> The jenkins took your commit and applied automatically, I tried with mca
> flag later.
> Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> in our system, the cpuspeed daemon is off by default on all our nodes.
>
>
> Regards
> M
>
>
> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Mike,
>>
>> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?
>>
>> *both* are required (--mca rtc_freq_priority 0 is not enough without the
>> patch)
>>
>> can you please confirm there is no 
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>> (pseudo) file on your system ?
>>
>> if this still does not work for you, then this might be a different issue
>> i was unable to reproduce.
>> in this case, could you run mpirun under gdb and send a gdb stack trace ?
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>>
>>
>> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman 
>> wrote:
>>
>>> more info, specifying --mca rtc_freq_priority 0 explicitly, generates
>>> different kind of fail:
>>>
>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>> [vegas12:13887] *** Process received signal ***
>>> [vegas12:13887] Signal: Segmentation fault (11)
>>> [vegas12:13887] Signal code: Address not mapped (1)
>>> [vegas12:13887] Failing at address: 0x20
>>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>> [vegas12:13887] [ 1]
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x77dcbe50]
>>> [vegas12:13887] [ 2]
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x77b1076c]
>>> [vegas12:13887] [ 3]
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
>>> [vegas12:13887] [ 4]
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>>> [vegas12:13887] [ 5]
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>>> [vegas12:13887] [ 6]
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>>> [vegas12:13887] *** End of error message ***
>>> Segmentation fault (core dumped)
>>>
>>>
>>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman 
>>> wrote:
>>>
>>>> Hi,
>>>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero
>>>> priority" did not help and jenkins stilll fails as before.
>>>> The ompi was configured:
>>>> --with-platform=contrib/platform/mellanox/optimized
>>>> --with-ompi-param-check --enable-picky --with-knem --with-mxm --with-fca
>>>>
>>>> The run was on single node:
>>>>
>>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>>  -np 8 -mca btl sm,tcp 
>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>>> [vegas12:13834] *** Process received signal ***
>>>> [vegas12:13834] Signal: Segmentation fault (11)
>>>> [vegas12:13834] Signal code: Address not mapped (1)
>>>> [vegas12:13834] Failing at address: (nil)
>>>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d]
>>>> [vegas12:13834] [ 2] 
>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x741f5f3f]
>>>> [vegas12:13834] [ 3] 
>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet

Thanks Jeff,

from the FAQ, openmpi should work on nodes who have different number of IB
ports (at least since v1.2)

about IB ports on the same subnet, all i was able to find is explanation
about why i get this warning :

WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical OFA
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate OFA subnet that is
used between connected MPI processes must have different subnet ID
values.

i really had to read between the lines (and thanks to your email) in order
to figure out IB ports on the same subnet is not the most optimal way.

the following sentence is even more confusing :

"All this being said, note that there are valid network configurations
where multiple ports on the same host can share the same subnet ID value.
For example, two ports from a single host can be connected to the
*same* network
as a bandwidth multiplier or a high-availability configuration."

from a pragmatic approach, and this is not OpenMPI specific, the two IB
ports of the servers are physically connected to the same IB switch.

/* i would guess the NVIDIA Ivy cluster is similar in that sense */

a few years ago (e.g. last time i checked), using different subnets was
possible by partitionning the switch via OpenSM. IMHO this was not an easy
to maintain solution (e.g. if a switch is replaced, the opensm config had
to be changed as well).

is there a simple and free way today to put ports physically connected to
the same switch in different subnets ?

/* such as tagged vlan in Ethernet => simple switch configuration, and the
host can decide by itself in which vlan a port must be */

Cheers,

Gilles

On Mon, Jun 2, 2014 at 8:50 PM, Jeff Squyres (jsquyres) 
wrote:

>  I'm AFK but let me reply about the IB thing: double ports/multi rail is
> a good thing. It's not a good thing if they're on the same subnet.
>
>  Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I
> can't see it well enough on the small screen of my phone, but I think
> there's a q on there about how multi rail destinations are chosen.
>
>  Spoiler: put your ports in different subnets so that OMPI makes
> deterministic choices.
>
> Sent from my phone. No type good.
>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet

Thanks Ralph,

i will try this tomorrow

Cheers,

Gilles



On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain  wrote:

> I think I have this fixed with r31928, but have no way to test it on my
> machine. Please see if it works for you.
>
>
> On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:
>
> This is indeed the problem - we are trying to send a message and don't
> know how to get it somewhere. I'll break the loop, and then ask that you
> run this again with -mca oob_base_verbose 10 so we can see the intended
> recipient.
>
> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet

Ralph,

i get no more complains about rtc :-)

but MPI_Abort still hangs :-(

i reviewed my configuration and the hang is not related to one node having
one IB port and the other node having two IB ports.

the two nodes can establish TCP connections via :
- eth0 (but they are *not* on the same subnet)
- ib0 (and they *are* on the same subnet)

from the logs, it seems eth0 is "discarded" and only ib0 is used.
when the task abort, it hangs ...



i attached the logs i took on two VM with a "simpler" config :
- slurm0 has one eth port (eth0)
  * eth0 is on 192.168.122.100/24 (network 0)
  * eth0:1 is on 10.0.0.1/24 (network 0)
- slurm3 has two eth ports (eth0 and eth1)
  * eth0 is on 192.168.222.0/24 (network 1)
  * eth1 is on 10.0.0.2/24 (network 0)

network0 and network1 are connected to a router.


from slurm0, i launch :

mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10
./abort

the oob logs are attached

Cheers,

Gilles

On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Thanks Ralph,
>
> i will try this tomorrow
>
> Cheers,
>
> Gilles
>
>
>
> On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain  wrote:
>
>> I think I have this fixed with r31928, but have no way to test it on my
>> machine. Please see if it works for you.
>>
>>
>> On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:
>>
>> This is indeed the problem - we are trying to send a message and don't
>> know how to get it somewhere. I'll break the loop, and then ask that you
>> run this again with -mca oob_base_verbose 10 so we can see the intended
>> recipient.
>>
>> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>>
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
>>
>
>


abort.oob.log.gz
Description: GNU Zip compressed data

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet

Ralph,

slurm is installed and running on both nodes.

that being said, there is no running job on any node so unless
mpirun automagically detects slurm is up and running, i assume
i am running under rsh.

i can run the test again after i stop slurm if needed, but that will not
happen before tomorrow.

Cheers,

Gilles

> from slurm0, i launch :
>
> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10
> ./abort
>
>
> Is this running under slurm? Or are you running under rsh?
>
>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet

Ralph,

the application still hangs, i attached new logs.

on slurm0, if i /sbin/ifconfig eth0:1 down
then the application does not hang any more

Cheers,

Gilles


On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain  wrote:

> I appear to have this fixed now - please give the current trunk (r31949 or
> above) a spin to see if I got it for you too.
>
>
>


abort.oob.2.log.gz
Description: GNU Zip compressed data

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet

Thanks Ralf,

for the time being, i just found a workaround
--mca oob_tcp_if_include eth0

Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
 * eth0 (gigabit ethernet) : because of the cluster size, several subnets
are used.
 * ib0 (IP over IB) : only one subnet
i can easily understand such a large cluster is not so common, but on the
other hand i do not believe the IP configuration (subnetted gigE and single
subnet IPoIB) can be called exotic.

if nodes from different eth0 subnets are used, and if i understand
correctly your previous replies, orte will "discard" eth0 because nodes
cannot contact each other "directly".
directly means the nodes are not on the same subnet. that being said, they
can communicate via IP thanks to IP routing (i mean IP routing, i do *not*
mean orte routing).
that means orte communications will use IPoIB which might not be the best
thing to do since establishing an IPoIB connection can be long (especially
at scale *and* if the arp table is not populated)

is my understanding correct so far ?

bottom line, i would have expected openmpi uses eth0 regardless IP routing
is required, and ib0 is simply not used (or eventually used as a fallback
option)

this leads to my next question : is the current default ok ? if not should
we change it and how ?
/*
imho :
 - IP routing is not always a bad/slow thing
 - gigE can sometimes be better than IPoIB)
*/

i am fine if at the end :
- this issue is fixed
- we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0
the default if this is really thought to be best for the cluster. (and i
can try to draft a faq if needed)

Cheers,

Gilles

On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:

>
> I'll work on it - may take a day or two to really fix. Only impacts
> systems with mismatched interfaces, which is why we aren't generally seeing
> it.
>
>

[OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Gilles Gouaillardet

Folks,

on my single socket four cores VM (no batch manager), i am running the
intercomm_create test from the ibm test suite.

mpirun -np 1 ./intercomm_create
=> OK

mpirun -np 2 ./intercomm_create
=> HANG :-(

mpirun -np 2 --mca coll ^ml  ./intercomm_create
=> OK

basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks)
followed by MPI_Intercomm_merge
and the 4 spawned tasks will call MPI_Intercomm_merge followed by
MPI_Intercomm_create

i digged a bit into that issue and found two distinct issues :

1) binding :
tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK
tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have
expected [2-3]
tasks[4-5] (second spawn) are not bound at all => ODD again, could have
made sense if tasks[2-3] were bound on cores [2-3]
i observe the same behaviour  with the --oversubscribe mpirun parameter

2) coll/ml
coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks)
i suspect coll/ml is unable to handle unbound tasks.
if i am correct, should coll/ml detect this and simply automatically
disqualify itself ?

Cheers,

Gilles

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet

Jeff,

as pointed by Ralph, i do wish using eth0 for oob messages.

i work on a 4k+ nodes cluster with a very decent gigabit ethernet
network (reasonable oversubscription + switches
from a reputable vendor you are familiar with ;-) )
my experience is that IPoIB can be very slow at establishing a
connection, especially if the arp table is not populated
(as far as i understand, this involves the subnet manager and
performance can be very random especially if all nodes issue
arp requests at the same time)
on the other hand, performance is much more stable when using the
subnetted IP network.

as Ralf also pointed, i can imagine some architects neglect their
ethernet network (e.g. highly oversubscribed + low end switches)
and in this case ib0 is a best fit for oob messages.

> As a simple solution, there could be an TCP oob MCA param that says 
> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> routing will make everything work out ok).
+1 and/or an option to tell oob mca "do not discard the interface simply
because the peer IP is not in the same subnet"

Cheers,

Gilles

On 2014/06/05 23:01, Ralph Castain wrote:
> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
> solves the problem (the messages just route)
>
> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
> wrote:
>
>> Another random thought for Gilles situation: why not oob-TCP-if-include ib0? 
>>  (And not eth0)
>>

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Gilles Gouaillardet

Ralph,

sorry for my poor understanding ...

i tried r31956 and it solved both issues :
- MPI_Abort does not hang any more if nodes are on different eth0 subnets
- MPI_Init does not hang any more if hosts have different number of IB ports

this likely explains why you are having trouble replicating it ;-)

Thanks a lot !

Gilles


On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain  wrote:

> I keep explaining that we don't "discard" anything, but there really isn't
> any point to continuing trying to explain the system. With the announced
> intention of completing the move of the BTLs to OPAL, I no longer need the
> multi-module complexity in the OOB/TCP. So I have removed it and gone back
> to the single module that connects to everything.
>
> Try r31956 - hopefully will resolve your connectivity issues.
>
> Still looking at the MPI_Abort hang as I'm having trouble replicating it.
>
>
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
> > i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> > network (reasonable oversubscription + switches
> > from a reputable vendor you are familiar with ;-) )
> > my experience is that IPoIB can be very slow at establishing a
> > connection, especially if the arp table is not populated
> > (as far as i understand, this involves the subnet manager and
> > performance can be very random especially if all nodes issue
> > arp requests at the same time)
> > on the other hand, performance is much more stable when using the
> > subnetted IP network.
> >
> > as Ralf also pointed, i can imagine some architects neglect their
> > ethernet network (e.g. highly oversubscribed + low end switches)
> > and in this case ib0 is a best fit for oob messages.
> >
> >> As a simple solution, there could be an TCP oob MCA param that says
> "regardless of peer IP address, I can connect to them" (i.e., assume IP
> routing will make everything work out ok).
> > +1 and/or an option to tell oob mca "do not discard the interface simply
> > because the peer IP is not in the same subnet"
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/06/05 23:01, Ralph Castain wrote:
> >> Because Gilles wants to avoid using IB for TCP messages, and using eth0
> also solves the problem (the messages just route)
> >>
> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) 
> wrote:
> >>
> >>> Another random thought for Gilles situation: why not
> oob-TCP-if-include ib0?  (And not eth0)
> >>>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14982.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14983.php
>

[OMPI devel] intermittent crash in mpirun upon non zero exit status

2014-06-09 Thread Gilles Gouaillardet

Folks,

several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a
similar stack trace.

For example, you can refer to :
http://mtt.open-mpi.org/index.php?do_redir=2199

the issue is not related whatsoever to the init_thread_serialized test
(other tests failed with similar symptoms)

so far i could find that :
- the issue is intermittent and can be hard to reproduce (1 failure over
1000 runs)
- per the mtt logs, it seems this is quite a recent failure
- a necessary condition is that MPI tasks exit with a non zero status after
having called MPI_Finalize()
- the crash occurs is in orte/mca/oob/base/oob_base_frame.c at line 89 when
invoking
OBJ_RELEASE(value) ;
in some rare cases, value is NULL which causes the crash.
- though i cannot incriminate one changeset in particular, i highly suspect
the changes that were made in order to address the issue(s) discussed at
http://www.open-mpi.org/community/lists/devel/2014/05/14908.php

the attached a patch that works around this issue.
i did not commit it because i consider this as a work around and not as a
fix :
the root cause might be a tricky race condition ("abort" after
MPI_Finalize).


as a side note, here is the definition of OBJ_RELEASE
(opal/class/opal_object.h)
#if OPAL_ENABLE_DEBUG
#define OBJ_RELEASE(object) \
do {\
assert(NULL != ((opal_object_t *) (object))->obj_class);\
assert(OPAL_OBJ_MAGIC_ID == ((opal_object_t *)
(object))->obj_magic_id); \
} while (0)
...
#else
...

should we add the following assert at the beginning ?
assert(NULL != object);


Thanks in advance for your comments,

Gilles
Index: orte/mca/oob/base/oob_base_frame.c
===
--- orte/mca/oob/base/oob_base_frame.c	(revision 31967)
+++ orte/mca/oob/base/oob_base_frame.c	(working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2007  Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2013-2014 Los Alamos National Security, LLC. All rights
  * reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -86,7 +88,11 @@
 rc = opal_hash_table_get_first_key_uint64 (&orte_oob_base.peers, &key,
(void **) &value, &node);
 while (OPAL_SUCCESS == rc) {
-OBJ_RELEASE(value);
+/* in some rare cases, value can be NULL.
+   this would cause a crash in OBJ_RELEASE */
+if (NULL != value) {
+OBJ_RELEASE(value);
+}
 rc = opal_hash_table_get_next_key_uint64 (&orte_oob_base.peers, &key,
   (void **) &value, node, &node);
 }

[OMPI devel] v1.8 cannot compile since r31979

2014-06-10 Thread Gilles Gouaillardet

Folks,

in mca_oob_tcp_component_hop_unknown, the local variable bpr is not
defined, which prevents v1.8 compilation.

/* there was a local variable called pr, it seems it was removed instead of
being renamed into bpr */

the attached patch fixes this issue.

Cheers,

Gilles
Index: orte/mca/oob/tcp/oob_tcp_component.c
===
--- orte/mca/oob/tcp/oob_tcp_component.c	(revision 31980)
+++ orte/mca/oob/tcp/oob_tcp_component.c	(working copy)
@@ -928,6 +928,7 @@
 mca_oob_tcp_msg_error_t *mop = (mca_oob_tcp_msg_error_t*)cbdata;
 uint64_t ui64;
 orte_rml_send_t *snd;
+orte_oob_base_peer_t *bpr;
 
 opal_output_verbose(OOB_TCP_DEBUG_CONNECT, orte_oob_base_framework.framework_output,
 "%s tcp:unknown hop called for peer %s",

[OMPI devel] false positive mtt error on v1.8

2014-06-12 Thread Gilles Gouaillardet

Folks,

FYI

here are two mtt errors ( 1.8.2a1r31981 on cisco-usnic cluster)

http://mtt.open-mpi.org/index.php?do_redir=2202
http://mtt.open-mpi.org/index.php?do_redir=2203

OpenMPI is fine :-) the test itself was not :-(

i fixed this in r2379 (strncpy does not NULL terminate the cstr string
and random things happened when calling strcat)

Cheers,

Gilles

Index: ompitest_atoif.c
===
--- ompitest_atoif.c(revision 2378)
+++ ompitest_atoif.c(revision 2379)
@@ -47,6 +47,7 @@

 if (len > 0) {
 strncpy(cstr, str, len);
+cstr[len] = 0;
 strcat(cstr, "\n");
 }
 cstr[len + 1] = 0;

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-19 Thread Gilles Gouaillardet

Ralph and Tetsuya,

is this related to the hang i reported at
http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ?

Nathan already replied he is working on a fix.

Cheers,

Gilles


On 2014/06/20 11:54, Ralph Castain wrote:
> My guess is that the coll/ml component may have problems with binding a 
> single process across multiple cores like that - it might be that we'll have 
> to have it check for that condition and disqualify itself. It is a 
> particularly bad binding pattern, though, as shared memory gets completely 
> messed up when you split that way.
>

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet

Ralph,

my test VM is single socket four cores.
here is something odd i just found when running mpirun -np 2
intercomm_create.
tasks [0,1] are bound on cpus [0,1] => OK
tasks[2-3] (first spawn) are bound on cpus [2,3] => OK
tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK

in ompi_proc_set_locality (ompi/proc/proc.c:228) on task 0
locality =
opal_hwloc_base_get_relative_locality(opal_hwloc_topology,

ompi_process_info.cpuset,

cpu_bitmap);
where
ompi_process_info.cpuset is "0"
cpu_bitmap is "0-3"

and locality is set to OPAL_PROC_ON_HWTHREAD (!)

is this correct ?

i would have expected OPAL_PROC_ON_L2CACHE (since there is a single L2
cache on my vm,
as reported by lstopo) or even OPAL_PROC_LOCALITY_UNKNOWN

then in mca_coll_ml_comm_query (ompi/mca/coll/ml/coll_ml_module.c:2899)
the module
disqualifies itself if !ompi_rte_proc_bound.
if locality were previously set to OPAL_PROC_LOCALITY_UNKNOWN, coll/ml
could checked the flag
of all the procs of the communicator and disqualify itself if at least
one of them is OPAL_PROC_LOCALITY_UNKNOWN.

as you wrote, there might be a bunch of other corner cases.
that being said, i ll try to write a simple proof of concept and see it
this specific hang can be avoided

Cheers,

Gilles

On 2014/06/20 12:08, Ralph Castain wrote:
> It is related, but it means that coll/ml has a higher degree of sensitivity 
> to the binding pattern than what you reported (which was that coll/ml doesn't 
> work with unbound processes). What we are now seeing is that coll/ml also 
> doesn't work when processes are bound across sockets.
>
> Which means that Nathan's revised tests are going to have to cover a lot more 
> corner cases. Our locality flags don't currently include 
> "bound-to-multiple-sockets", and I'm not sure how he is going to easily 
> resolve that case.
>

Re: [OMPI devel] trunk hangs when I specify a particular binding by rankfile

2014-06-20 Thread Gilles Gouaillardet

Ralph,

Here is attached a patch that fixes/works around my issue.
this is more of a proof of concept, so i did not commit it to the trunk.

basically :

opal_hwloc_base_get_relative_locality (topo, set1, set2)
sets the locality based on the deepest element that is part of both set1 and 
set2.
in my case, set2 means "all the available cpus" that is why the subroutine
will return OPAL_PROC_ON_HWTHREAD

the patch uses opal_hwloc_base_get_relative_locality2 instead.
if one of the cpuset means "all the available cpus", then the subroutine will
simply return OPAL_PROC_ON_NODE.

i am puzzled wether this is a bug in opal_hwloc_base_get_relative_locality
or in proc.c that should not call this subroutine because it does not do what
should be expected.

Cheers,

Gilles

On 2014/06/20 13:59, Gilles Gouaillardet wrote:
> Ralph,
>
> my test VM is single socket four cores.
> here is something odd i just found when running mpirun -np 2
> intercomm_create.
> tasks [0,1] are bound on cpus [0,1] => OK
> tasks[2-3] (first spawn) are bound on cpus [2,3] => OK
> tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK
>
> in ompi_proc_set_locality (ompi/proc/proc.c:228) on task 0
> locality =
> opal_hwloc_base_get_relative_locality(opal_hwloc_topology,
> 
> ompi_process_info.cpuset,
> 
> cpu_bitmap);
> where
> ompi_process_info.cpuset is "0"
> cpu_bitmap is "0-3"
>
> and locality is set to OPAL_PROC_ON_HWTHREAD (!)
>
> is this correct ?
>
> i would have expected OPAL_PROC_ON_L2CACHE (since there is a single L2
> cache on my vm,
> as reported by lstopo) or even OPAL_PROC_LOCALITY_UNKNOWN
>
> then in mca_coll_ml_comm_query (ompi/mca/coll/ml/coll_ml_module.c:2899)
> the module
> disqualifies itself if !ompi_rte_proc_bound.
> if locality were previously set to OPAL_PROC_LOCALITY_UNKNOWN, coll/ml
> could checked the flag
> of all the procs of the communicator and disqualify itself if at least
> one of them is OPAL_PROC_LOCALITY_UNKNOWN.
>
>
> as you wrote, there might be a bunch of other corner cases.
> that being said, i ll try to write a simple proof of concept and see it
> this specific hang can be avoided
>
> Cheers,
>
> Gilles
>
> On 2014/06/20 12:08, Ralph Castain wrote:
>> It is related, but it means that coll/ml has a higher degree of sensitivity 
>> to the binding pattern than what you reported (which was that coll/ml 
>> doesn't work with unbound processes). What we are now seeing is that coll/ml 
>> also doesn't work when processes are bound across sockets.
>>
>> Which means that Nathan's revised tests are going to have to cover a lot 
>> more corner cases. Our locality flags don't currently include 
>> "bound-to-multiple-sockets", and I'm not sure how he is going to easily 
>> resolve that case.
>>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15036.php

Index: opal/mca/hwloc/base/base.h
===
--- opal/mca/hwloc/base/base.h  (revision 32056)
+++ opal/mca/hwloc/base/base.h  (working copy)
@@ -1,6 +1,8 @@
 /*
  * Copyright (c) 2011-2012 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -86,6 +88,9 @@
 OPAL_DECLSPEC opal_hwloc_locality_t 
opal_hwloc_base_get_relative_locality(hwloc_topology_t topo,
   char 
*cpuset1, char *cpuset2);

+OPAL_DECLSPEC opal_hwloc_locality_t 
opal_hwloc_base_get_relative_locality2(hwloc_topology_t topo,
+  char 
*cpuset1, char *cpuset2);
+
 OPAL_DECLSPEC int opal_hwloc_base_set_binding_policy(opal_binding_policy_t 
*policy, char *spec);

 /**
Index: opal/mca/hwloc/base/hwloc_base_util.c
===
--- opal/mca/hwloc/base/hwloc_base_util.c   (revision 32056)
+++ opal/mca/hwloc/base/hwloc_base_util.c   (working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2012-2013 Los Alamos National Security, LLC.
  * All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (

[OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet

WHAT: semantic change of opal_hwloc_base_get_relative_locality

WHY:  make is closer to what coll/ml expects.

  Currently, opal_hwloc_base_get_relative_locality means "at what level do 
these procs share cpus"
  however, coll/ml is using it as "at what level are these procs commonly 
bound".

  it is important to note that if a task is bound to all the available 
cpus, locality should
  be set to OPAL_PROC_ON_NODE only.
  /* e.g. on a single socket Sandy Bridge system, use OPAL_PROC_ON_NODE 
instead of OPAL_PROC_ON_L3CACHE */

  This has been initially discussed in the devel mailing list
  http://www.open-mpi.org/community/lists/devel/2014/06/15030.php

  as advised by Ralph, i browsed the source code looking for how the 
(ompi_proc_t *)->proc_flags is used.
  so far, it is mainly used to figure out wether the proc is on the same 
node or not.

  notable exceptions are :
   a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : 
OPAL_PROC_ON_LOCAL_SOCKET
   b) ompi/mca/coll/fca/coll_fca_module.c and 
oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS

  about a) the new definition fixes a hang in coll/ml
  about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only found 
OMPI_PROC_FLAG_LOCAL in v1.3 */
  so this macro can be simply removed and replaced with 
OPAL_PROC_ON_LOCAL_NODE

  at this stage, i cannot find any objection not to do the described change.
  please report if any and/or feel free to comment.

WHERE: see the two attached patches

TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, June 
24-26.
 The RFC will become final only after the meeting.
 /* Ralph already added this topic to the agenda */

Thanks

Gilles

Index: opal/mca/hwloc/base/hwloc_base_util.c
===
--- opal/mca/hwloc/base/hwloc_base_util.c   (revision 32067)
+++ opal/mca/hwloc/base/hwloc_base_util.c   (working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2012-2013 Los Alamos National Security, LLC.
  * All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -1315,8 +1317,7 @@
 hwloc_cpuset_t avail;
 bool shared;
 hwloc_obj_type_t type;
-int sect1, sect2;
-hwloc_cpuset_t loc1, loc2;
+hwloc_cpuset_t loc1, loc2, loc;

 /* start with what we know - they share a node on a cluster
  * NOTE: we may alter that latter part as hwloc's ability to
@@ -1337,6 +1338,19 @@
 hwloc_bitmap_list_sscanf(loc1, cpuset1);
 loc2 = hwloc_bitmap_alloc();
 hwloc_bitmap_list_sscanf(loc2, cpuset2);
+loc = hwloc_bitmap_alloc();
+hwloc_bitmap_or(loc, loc1, loc2);
+
+width = hwloc_get_nbobjs_by_depth(topo, 0);
+for (w = 0; w < width; w++) {
+obj = hwloc_get_obj_by_depth(topo, 0, w);
+avail = opal_hwloc_base_get_available_cpus(topo, obj);
+if ( hwloc_bitmap_isequal(avail, loc) ) {
+/* the task is bound to all the node cpus,
+   return without digging further */
+goto out;
+}
+}

 /* start at the first depth below the top machine level */
 for (d=1; d < depth; d++) {
@@ -1362,11 +1376,8 @@
 obj = hwloc_get_obj_by_depth(topo, d, w);
 /* get the available cpuset for this obj */
 avail = opal_hwloc_base_get_available_cpus(topo, obj);
-/* see if our locations intersect with it */
-sect1 = hwloc_bitmap_intersects(avail, loc1);
-sect2 = hwloc_bitmap_intersects(avail, loc2);
-/* if both intersect, then we share this level */
-if (sect1 && sect2) {
+/* see if our locations is included */
+if ( hwloc_bitmap_isincluded(loc, avail) ) {
 shared = true;
 switch(obj->type) {
 case HWLOC_OBJ_NODE:
@@ -1410,9 +1421,11 @@
 }
 }

+out:
 opal_output_verbose(5, opal_hwloc_base_framework.framework_output,
 "locality: %s",
 opal_hwloc_base_print_locality(locality));
+hwloc_bitmap_free(loc);
 hwloc_bitmap_free(loc1);
 hwloc_bitmap_free(loc2);

Index: oshmem/mca/scoll/fca/scoll_fca.h
===
--- oshmem/mca/scoll/fca/scoll_fca.h(revision 32067)
+++ oshmem/mca/scoll/fca/scoll_fca.h(working copy)
@@ -1,12 +1,14 @@
 /**
- *   Copyright (c) 2013  Mellanox Technologies, Inc.
- *   All rights reserved.
- * $COPYRIGHT$
+ * Copyright (c) 2013  Mellanox Technologies, Inc.
+ * All rights reserved.
+ * Copyright (c) 2014  Research Organiza

[OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet

Folks,

this issue is related to the failures reported by mtt on the trunk when
the ibm test suite invokes MPI_Comm_spawn.

my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus
per socket each.

if i run on one host (without any batch manager)

mpirun -np 16 --host slurm1 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a success with the following warning  :

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:slurm2
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--


now if i run on three hosts

mpirun -np 16 --host slurm1,slurm2,slurm3 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a success without any warning


but now, if i run on two hosts

mpirun -np 16 --host slurm1,slurm2 --oversubscribe --mca coll ^ml
./intercomm_create

then the test is a failure.

first, i get the following same warning :

--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:slurm2
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

followed by a crash

[slurm1:2482] *** An error occurred in MPI_Comm_spawn
[slurm1:2482] *** reported by process [2068512769,0]
[slurm1:2482] *** on communicator MPI_COMM_WORLD
[slurm1:2482] *** MPI_ERR_SPAWN: could not spawn processes
[slurm1:2482] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[slurm1:2482] ***and potentially your MPI job)


that being said, i the following command works :

mpirun -np 16 --host slurm1,slurm2 --mca coll ^ml --bind-to none
./intercomm_create


1) what does the first message means ?
is it a warning ? /* if yes, why does mpirun on two hosts fail ? */
is it a fatal error ? /* if yes, why does mpirun on one host success
? */

2) generally speaking, and assuming the first message is a warning,
should --oversubscribe automatically set overload-allowed ?
/* as far as i am concerned, that would be much more intuitive */

Cheers,

Gilles

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles GOUAILLARDET

Ralph,

That makes perfect sense.

What about FCA_IS_LOCAL_PROCESS ?
Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly ?

Cheers

Gilles

Ralph Castain  wrote:
>Hi Gilles
>
>
>We discussed this at the devel conference this morning. The root cause of the 
>problem is a test in coll/ml that we feel is incorrect - it basically checks 
>to see if the proc itself is bound, and then assumes that all other procs are 
>similarly bound. This in fact is never guaranteed to be true as someone could 
>use the rank_file method to specify that some procs are to be left unbound, 
>while others are to be bound to specified cpus.
>
>
>Nathan has looked at that check before and believes it isn't necessary. All 
>coll/ml really needs to know is that the two procs share the same node, and 
>the current locality algorithm will provide that information. We have asked 
>him to "fix" the coll/ml selection logic to resolve that situation.
>
>
>After then discussing the various locality definitions, it was our feeling 
>that the current definition is probably the better one unless you have a 
>reason for changing it other than coll/ml. If so, we'd be happy to revisit the 
>proposal.
>
>
>Make sense?
>
>Ralph
>
>
>
>
>On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet 
> wrote:
>
>WHAT: semantic change of opal_hwloc_base_get_relative_locality
>
>WHY:  make is closer to what coll/ml expects.
>
>      Currently, opal_hwloc_base_get_relative_locality means "at what level do 
>these procs share cpus"
>      however, coll/ml is using it as "at what level are these procs commonly 
>bound".
>
>      it is important to note that if a task is bound to all the available 
>cpus, locality should
>      be set to OPAL_PROC_ON_NODE only.
>      /* e.g. on a single socket Sandy Bridge system, use OPAL_PROC_ON_NODE 
>instead of OPAL_PROC_ON_L3CACHE */
>
>      This has been initially discussed in the devel mailing list
>      http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>
>      as advised by Ralph, i browsed the source code looking for how the 
>(ompi_proc_t *)->proc_flags is used.
>      so far, it is mainly used to figure out wether the proc is on the same 
>node or not.
>
>      notable exceptions are :
>       a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c : 
>OPAL_PROC_ON_LOCAL_SOCKET
>       b) ompi/mca/coll/fca/coll_fca_module.c and 
>oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>
>      about a) the new definition fixes a hang in coll/ml
>      about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only 
>found OMPI_PROC_FLAG_LOCAL in v1.3 */
>      so this macro can be simply removed and replaced with 
>OPAL_PROC_ON_LOCAL_NODE
>
>      at this stage, i cannot find any objection not to do the described 
>change.
>      please report if any and/or feel free to comment.
>
>WHERE: see the two attached patches
>
>TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago, June 
>24-26.
>         The RFC will become final only after the meeting.
>         /* Ralph already added this topic to the agenda */
>
>Thanks
>
>Gilles
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>
>

Re: [OMPI devel] OMPI devel] RFC: semantic change of opal_hwloc_base_get_relative_locality

2014-06-24 Thread Gilles Gouaillardet

Ralph,

i pushed the change (r32079) and updated the wiki.

the RFC can be now closed and the consensus is semantic of
opal_hwloc_base_get_relative_locality
will not be changed since this is not needed : the hang is a coll/ml
bug, so it will be fixed within coll/ml.

Cheers,

Gilles

On 2014/06/25 1:12, Ralph Castain wrote:
> Yeah, we should make that change, if you wouldn't mind doing it.
>
>
>
> On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> That makes perfect sense.
>>
>> What about FCA_IS_LOCAL_PROCESS ?
>> Shall we keep it or shall we use directly OPAL_PROC_ON_LOCAL_NODE directly
>> ?
>>
>> Cheers
>>
>> Gilles
>>
>> Ralph Castain  wrote:
>> Hi Gilles
>>
>> We discussed this at the devel conference this morning. The root cause of
>> the problem is a test in coll/ml that we feel is incorrect - it basically
>> checks to see if the proc itself is bound, and then assumes that all other
>> procs are similarly bound. This in fact is never guaranteed to be true as
>> someone could use the rank_file method to specify that some procs are to be
>> left unbound, while others are to be bound to specified cpus.
>>
>> Nathan has looked at that check before and believes it isn't necessary.
>> All coll/ml really needs to know is that the two procs share the same node,
>> and the current locality algorithm will provide that information. We have
>> asked him to "fix" the coll/ml selection logic to resolve that situation.
>>
>> After then discussing the various locality definitions, it was our feeling
>> that the current definition is probably the better one unless you have a
>> reason for changing it other than coll/ml. If so, we'd be happy to revisit
>> the proposal.
>>
>> Make sense?
>> Ralph
>>
>>
>>
>> On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>> WHAT: semantic change of opal_hwloc_base_get_relative_locality
>>>
>>> WHY:  make is closer to what coll/ml expects.
>>>
>>>   Currently, opal_hwloc_base_get_relative_locality means "at what
>>> level do these procs share cpus"
>>>   however, coll/ml is using it as "at what level are these procs
>>> commonly bound".
>>>
>>>   it is important to note that if a task is bound to all the
>>> available cpus, locality should
>>>   be set to OPAL_PROC_ON_NODE only.
>>>   /* e.g. on a single socket Sandy Bridge system, use
>>> OPAL_PROC_ON_NODE instead of OPAL_PROC_ON_L3CACHE */
>>>
>>>   This has been initially discussed in the devel mailing list
>>>   http://www.open-mpi.org/community/lists/devel/2014/06/15030.php
>>>
>>>   as advised by Ralph, i browsed the source code looking for how the
>>> (ompi_proc_t *)->proc_flags is used.
>>>   so far, it is mainly used to figure out wether the proc is on the
>>> same node or not.
>>>
>>>   notable exceptions are :
>>>a) ompi/mca/sbgp/basesmsocket/sbgp_basesmsocket_component.c :
>>> OPAL_PROC_ON_LOCAL_SOCKET
>>>b) ompi/mca/coll/fca/coll_fca_module.c and
>>> oshmem/mca/scoll/fca/scoll_fca_module.c : FCA_IS_LOCAL_PROCESS
>>>
>>>   about a) the new definition fixes a hang in coll/ml
>>>   about b) FCA_IS_LOCAL_SOCKET looks like legacy code /* i could only
>>> found OMPI_PROC_FLAG_LOCAL in v1.3 */
>>>   so this macro can be simply removed and replaced with
>>> OPAL_PROC_ON_LOCAL_NODE
>>>
>>>   at this stage, i cannot find any objection not to do the described
>>> change.
>>>   please report if any and/or feel free to comment.
>>>
>>> WHERE: see the two attached patches
>>>
>>> TIMEOUT: June 30th, after the Open MPI developers meeting in Chicago,
>>> June 24-26.
>>>  The RFC will become final only after the meeting.
>>>  /* Ralph already added this topic to the agenda */
>>>
>>> Thanks
>>>
>>> Gilles
>>>
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/06/15046.php
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/15049.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15050.php

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet

Hi Ralph,

On 2014/06/25 2:51, Ralph Castain wrote:
> Had a chance to review this with folks here, and we think that having
> oversubscribe automatically set overload makes some sense. However, we do
> want to retain the ability to separately specify oversubscribe and overload
> as well since these two terms don't mean quite the same thing.
>
> Our proposal, therefore, is to have the --oversubscribe flag set both the
> --map-by :oversubscribe and --bind-to :overload-allowed properties. If
> someone specifies both the --oversubscribe flag and a conflicting directive
> for one or both of the individual properties, then we'll error out with a
> "bozo" message.
i fully agree.
> The use-cases you describe are (minus the crash) correct as the warning
> only is emitted when you are overloaded (i.e., trying to bind to more cpus
> than you have). So you won't get any warning when running on three nodes as
> you have enough cpus for all the procs, etc.
>
> I'll investigate the crash once I get home and have access to a cluster
> again. The problem likely has to do with not properly responding to the
> failure to spawn.
humm

because you already made the change described above(r32072), the crash
does not occur any more.

about the crash, i see things the other way around : spawn should have
not failed.
/* or spawn should have failed when running on a single node, at least
for the sake of consistency */

but like i said, it works now, so it might be just pedantic to point a
bug that is still here but that cannot be triggered ...

Cheers,

Gilles

Re: [OMPI devel] trunk broken

2014-06-25 Thread Gilles Gouaillardet

Mike,

could you try again with

OMPI_MCA_btl=vader,self,openib

it seems the sm module causes a hang
(which later causes the timeout sending a SIGSEGV)

Cheers,

Gilles

On 2014/06/25 14:22, Mike Dubman wrote:
> Hi,
> The following commit broke trunk in jenkins:
>
 Per the OMPI developer conference, remove the last vestiges of
> OMPI_USE_PROGRESS_THREADS
>
> *22:15:09* + 
> LD_LIBRARY_PATH=/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib*22:15:09*
> + OMPI_MCA_scoll_fca_enable=1*22:15:09* +
> OMPI_MCA_scoll_fca_np=0*22:15:09* + OMPI_MCA_pml=ob1*22:15:09* +
> OMPI_MCA_btl=sm,self,openib*22:15:09* + OMPI_MCA_spml=yoda*22:15:09* +
> OMPI_MCA_memheap_mr_interleave_factor=8*22:15:09* +
> OMPI_MCA_memheap=ptmalloc*22:15:09* +
> OMPI_MCA_btl_openib_if_include=mlx4_0:1*22:15:09* +
> OMPI_MCA_rmaps_base_dist_hca=mlx4_0*22:15:09* +
> OMPI_MCA_memheap_base_hca_name=mlx4_0*22:15:09* +
> OMPI_MCA_rmaps_base_mapping_policy=dist:mlx4_0*22:15:09* +
> MXM_RDMA_PORTS=mlx4_0:1*22:15:09* +
> SHMEM_SYMMETRIC_HEAP_SIZE=1024M*22:15:09* + timeout -s SIGSEGV 3m
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/bin/oshrun
> -np 8 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem*22:15:09*
> [vegas12:08101] *** Process received signal 22:15:09*
> [vegas12:08101] Signal: Segmentation fault (11)*22:15:09*
> [vegas12:08101] Signal code: Address not mapped (1)*22:15:09*
> [vegas12:08101] Failing at address: (nil)*22:15:09* [vegas12:08101] [
>

Re: [OMPI devel] trunk broken

2014-06-25 Thread Gilles Gouaillardet

Mike,

by the way, i pushed r32081.
that might not be needed in your environment, but i get crash without it
in mine.

Cheers,

Gilles

On 2014/06/25 15:11, Gilles Gouaillardet wrote:
> could you try again with
>
> OMPI_MCA_btl=vader,self,openib
>
> it seems the sm module causes a hang
> (which later causes the timeout sending a SIGSEGV)
>
>
> On 2014/06/25 14:22, Mike Dubman wrote:
>> The following commit broke trunk in jenkins:
>>
>>>>> Per the OMPI developer conference, remove the last vestiges of
>> OMPI_USE_PROGRESS_THREADS
>>
>>

Re: [OMPI devel] MPI_Recv_init_null_c from intel test suite fails vs ompi trunk

2014-07-04 Thread Gilles Gouaillardet

Yossi,

thanks for reporting this issue.

i commited r32139 and r32140 to trunk in order to fix this issue (with
MPI_Startall)
and some misc extra bugs.

i also made CMR #4764 for the v1.8 branch (and asked George to review it)

Cheers,

Gilles

On 2014/07/03 22:25, Yossi Etigin wrote:
> Looks like this has to be fixed also for MPI_Startall, right?
>
>

Re: [OMPI devel] centos-7 / rhel-7 build fail (configure fails to recognize g++)

2014-07-06 Thread Gilles Gouaillardet

Olivier,

i was unable to reproduce the issue on a centos7 beta with :

- trunk (latest nightly snapshot)
- 1.8.1
- 1.6.5

the libtool-ltdl-devel package is not installed on this server

that being said, i did not use
--with-verbs
nor
--with-tm

since these packages are not installed on my server.

are you installing from a tarball or from svn/git/hg ?
could you also compress and include the config.log ?

Gilles

On 2014/07/04 22:00, olivier.laha...@free.fr wrote:
> On centos-7 beta, the configure script fails to recognize the g++ compiler. 
> checking for the C++ compiler vendor... unknown 
> checking if C and C++ are link compatible... no 
> ** 
> * It appears that your C++ compiler is unable to link against object 
> * files created by your C compiler. This generally indicates either 
> * a conflict between the options specified in CFLAGS and CXXFLAGS 
> * or a problem with the local compiler installation. More 
> * information (including exactly what command was given to the 
> * compilers and what error resulted when the commands were executed) is 
> * available in the config.log file in this directory. 
> ** 
>
>

Re: [OMPI devel] segv in ompi_info

2014-07-09 Thread Gilles Gouaillardet

Mike,

how do you test ?
i cannot reproduce a bug :

if you run ompi_info -a -l 9 | less

and i press 'q' at the early stage (e.g. before all output is written to
the pipe)
then the less process exits and receives SIG_PIPE and crash (which is a
normal unix behaviour)

now if i press the spacebar until the end of the output (e.g. i get the
(END) message from less)
and then press 'q', then there is no problem.

strace -e signal ompi_info -a -l 9 | true
will cause ompi_info receives a SIG_PIPE

strace -e signal dd if=/dev/zero bs=1M count=1 | true
will cause dd receives a SIG_PIPE

unless i miss something, i would conclude there is no bug

Cheers,

Gilles

On 2014/07/09 19:33, Mike Dubman wrote:
> mxm only intercept signals and prints the stacktrace.
> happens on trunk as well.
> only when "| less" is used.
>
>
>
>
>
>
> On Tue, Jul 8, 2014 at 4:50 PM, Jeff Squyres (jsquyres) 
> wrote:
>
>> I'm unable to replicate.  Please provide more detail...?  Is this a
>> problem in the MXM component?
>>
>> On Jul 8, 2014, at 9:20 AM, Mike Dubman  wrote:
>>
>>>
>>> $/usr/mpi/gcc/openmpi-1.8.2a1/bin/ompi_info -a -l 9|less
>>> Caught signal 13 (Broken pipe)
>>>  backtrace 
>>>  2 0x00054cac mxm_handle_error()
>>  /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:653
>>>  3 0x00054e74 mxm_error_signal_handler()
>>  /var/tmp/OFED_topdir/BUILD/mxm-3.2.2883/src/mxm/util/debug/debug.c:628
>>>  4 0x0033fbe32920 killpg()  ??:0
>>>  5 0x0033fbedb650 __write_nocancel()  interp.c:0
>>>  6 0x0033fbe71d53 _IO_file_write@@GLIBC_2.2.5()  ??:0
>>>  7 0x0033fbe73305 _IO_do_write@@GLIBC_2.2.5()  ??:0
>>>  8 0x0033fbe719cd _IO_file_xsputn@@GLIBC_2.2.5()  ??:0
>>>  9 0x0033fbe48410 _IO_vfprintf()  ??:0
>>> 10 0x0033fbe4f40a printf()  ??:0
>>> 11 0x0002bc84 opal_info_out()
>>  
>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:853
>>> 12 0x0002c6bb opal_info_show_mca_group_params()
>>  
>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:658
>>> 13 0x0002c882 opal_info_show_mca_group_params()
>>  
>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:716
>>> 14 0x0002cc13 opal_info_show_mca_params()
>>  
>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:742
>>> 15 0x0002d074 opal_info_do_params()
>>  
>> /var/tmp/OFED_topdir/BUILD/openmpi-1.8.2a1/opal/runtime/opal_info_support.c:485
>>> 16 0x0040167b main()  ??:0
>>> 17 0x0033fbe1ecdd __libc_start_main()  ??:0
>>> 18 0x00401349 _start()  ??:0
>>> ===
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/07/15075.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/07/15076.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/07/15080.php

Re: [OMPI devel] trunk and fortran errors

2014-07-10 Thread Gilles Gouaillardet

On CentOS 5.x, gfortran is unable to compile this simple program :

subroutine foo ()
  use, intrinsic :: iso_c_binding, only : c_ptr
end subroutine foo

an other workaround is to install gfortran 4.4
(yum install gcc44-gfortran)
and configure with
FC=gfortran44


On 2014/07/09 19:46, Jeff Squyres (jsquyres) wrote:
> This is almost certainly due to r32162 (Fortran commit from last night).
> [...]
> For the moment/as a workaround, use --disable-mpi-fortran in your builds if 
> you are building with an older gfortran.

Re: [OMPI devel] trunk and fortran errors

2014-07-11 Thread Gilles Gouaillardet

Thanks Jeff,

i confirm the problem is fixed on CentOS 5

i commited r32215 because some files were missing from the
tarball/nightly snapshot/make dist.

Cheers,

Gilles

On 2014/07/11 4:21, Jeff Squyres (jsquyres) wrote:
> As of r32204, this should be fixed.  Please let me know if it now works for 
> you.

Re: [OMPI devel] 100% test failures

2014-07-15 Thread Gilles GOUAILLARDET

r32236 is a suspect

i am afk

I just read the code and a class is initialized with opal_class_initialize the 
first time an object is instantiated with OBJ_NEW

I would simply revert r32236 or update opal_class_finalize and 
free(cls->cls_construct_array); only if cls->cls_construct_array is not NULL

I hope this helps

Gilles

Ralph Castain  wrote:
>Hi folks
>
>
>The changes to opal_class_finalize are generating 100% segfaults on the trunk:
>
>
>175            free(cls->cls_construct_array);
>
>Missing separate debuginfos, use: debuginfo-install 
>glibc-2.12-1.132.el6_5.2.x86_64 libgcc-4.4.7-4.el6.x86_64 
>numactl-2.0.7-8.el6.x86_64
>
>(gdb) where
>
>#0  0x7f93e3206385 in opal_class_finalize () at class/opal_object.c:175
>
>#1  0x7f93e320b62f in opal_finalize_util () at runtime/opal_finalize.c:110
>
>#2  0x7f93e320b73b in opal_finalize () at runtime/opal_finalize.c:175
>
>#3  0x7f93e350e05f in orte_finalize () at runtime/orte_finalize.c:79
>
>#4  0x004057e2 in orterun (argc=4, argv=0x7fffe27ea718) at 
>orterun.c:1098
>
>#5  0x00403a04 in main (argc=4, argv=0x7fffe27ea718) at main.c:13
>
>
>Can someone please fix this?
>
>Ralph
>
>

Re: [OMPI devel] RFC: Add an attribute((destructor)) function to opal

2014-07-16 Thread Gilles Gouaillardet

Ralph and all,

my understanding is that

opal_finalize_util

agressively tries to free memory that would be still allocated otherwise.

an other way of saying "make valgrind happy" is "fully automake memory
leak detection"
(Joost pointed to the -fsanitize=leak feature of gcc 4.9 in
http://www.open-mpi.org/community/lists/devel/2014/05/14672.php)

the following simple program :

#include 

int main(int argc, char* argv[])
{
  int ret, provided;
  ret = MPI_T_init_thread(MPI_THREAD_SINGLE, &provided);
  ret = MPI_T_finalize();
  return 0;
}

leaks a *lot* of objects (and might remove some environment variables as
well) which have been half destroyed by opal_finalize_util, for example :
- classes are still marked as initialized *but* the cls_contruct_array
has been free'd
- the oob framework was not unallocated, it is still marked as
MCA_BASE_FRAMEWORK_FLAG_REGISTERED
  but some mca variables were freed, and that will cause problems when
MPI_Init try to (re)start the tcp component

now my 0.02$ :

ideally, MPI_Finalize nor MPI_T_finalize would leak any memory and the
framework would be re-initializable.
this could be a goal and George gave some good explanations on why it is
hard to achieve.
from my pragmatic point of view, and for this test case only, i am very
happy with a simple working solution,
even if it means that MPI_T_finalize leaks way too much memory in order
to work around the non re-initializable framework.

Cheers,

Gilles

On 2014/07/16 12:49, Ralph Castain wrote:
> I've attached a solution that blocks the segfault without requiring any 
> gyrations. Can someone explain why this isn't adequate?
>
> Alternate solution was to simply decrement opal_util_initialized in 
> MPI_T_finalize rather than calling finalize itself. Either way resolves the 
> problem in a very simple manner.
>

Re: [OMPI devel] Onesided failures

2014-07-16 Thread Gilles GOUAILLARDET

Rolf,

From the man page of MPI_Win_allocate_shared

It is the user's responsibility to ensure that the communicator comm represents 
a group of processes that can create a shared memory segment that can be 
accessed by all processes in the group

And from the mtt logs, you are running 4 tasks on 2 nodes.

Unless I am missing something obvious, I will update the test tomorrow and add 
a comm split to ensure MPI_Win_allocate_shared is called from single node 
communicator and skip the test if this impossible

Cheers,

Gilles

Rolf vandeVaart  wrote:
>
>
>On both 1.8 and trunk (as Ralph mentioned in meeting) we are seeing three 
>tests fail.
>
>http://mtt.open-mpi.org/index.php?do_redir=2205
>
> 
>
>Ibm/onesided/win_allocate_shared
>
>Ibm/onesided/win_allocated_shared_mpifh
>
>Ibm/onesided/win_allocated_shared_usempi
>
> 
>
>Is there a ticket that covers these failures?
>
> 
>
>Thanks,
>
>Rolf
>
>This email message is for the sole use of the intended recipient(s) and may 
>contain confidential information.  Any unauthorized review, use, disclosure or 
>distribution is prohibited.  If you are not the intended recipient, please 
>contact the sender by reply email and destroy all copies of the original 
>message. 
>

Re: [OMPI devel] PkgSrc build of 1.8.1 gives a portability error

2014-07-17 Thread Gilles Gouaillardet

Kevin,

thanks for providing the patch.

i pushed it into the trunk :
https://svn.open-mpi.org/trac/ompi/changeset/32253
and made a CMR so it can be available in v1.8.2 :
https://svn.open-mpi.org/trac/ompi/ticket/4793

Thanks,

Gilles

On 2014/07/17 13:32, Kevin Buckley wrote:
> I have been informed, by the PkgSrc build process, of the following
>
> ---8<-8<-8<-8<-8<-8<-8<-8<-8<--
> => Checking for portability problems in extracted files
> ERROR: [check-portability.awk] => Found test ... == ...:
> ERROR: [check-portability.awk] configure:  if test "$enable_oshmem" ==
> "yes" -a "$ompi_fortran_happy" == "1" -a \
>
> Explanation:
> ===
> The "test" command, as well as the "[" command, are not required to know
> the "==" operator. Only a few implementations like bash and some
> versions of ksh support it.
>
> When you run "test foo == foo" on a platform that does not support the
> "==" operator, the result will be "false" instead of "true". This can
> lead to unexpected behavior.
>
> There are two ways to fix this error message. If the file that contains
> the "test ==" is needed for building the package, you should create a
> patch for it, replacing the "==" operator with "=". If the file is not
> needed, add its name to the CHECK_PORTABILITY_SKIP variable in the
> package Makefile.
> ===
>
> ---8<-8<-8<-8<-8<-8<-8<-8<-8<--
>
> Obviously, the file that needs to be patched is really
>
> configure.ac
>
> and not
>
> configure
>
> but anyroad, the place at which the oshmen stanza has used the "non-portable"
> double-equals construct is shown in the following attempted patch
>
>
> ---8<-8<-8<-8<-8<-8<-8<-8<-8<--
> --- configure.ac.orig   2014-04-22 14:51:44.0 +
> +++ configure.ac
> @@ -611,8 +611,8 @@ m4_ifdef([project_ompi], [OMPI_SETUP_MPI
>  ])
>
>  AM_CONDITIONAL(OSHMEM_BUILD_FORTRAN_BINDINGS,
> -[test "$enable_oshmem" == "yes" -a "$ompi_fortran_happy" == "1" -a \
> -  "$OMPI_WANT_FORTRAN_BINDINGS" == "1" -a \
> +[test "$enable_oshmem" = "yes" -a "$ompi_fortran_happy" = "1" -a \
> +  "$OMPI_WANT_FORTRAN_BINDINGS" = "1" -a \
>"$enable_oshmem_fortran" != "no"])
>
>  # checkpoint results
> ---8<-8<-8<-8<-8<-8<-8<-8<-8<--
>
>

Re: [OMPI devel] Onesided failures

2014-07-17 Thread Gilles Gouaillardet

Rolf,

i commited r2389.

MPI_Win_allocate_shared is now invoked on a single node communicator

Cheers,

Gilles

On 2014/07/16 22:59, Rolf vandeVaart wrote:
> Sounds like a good plan.  Thanks for looking into this Gilles!
>
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Gilles 
> GOUAILLARDET
> Sent: Wednesday, July 16, 2014 9:53 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Onesided failures
>
>
> Unless I am missing something obvious, I will update the test tomorrow and 
> add a comm split to ensure MPI_Win_allocate_shared is called from single node 
> communicator and skip the test if this impossible
>

Re: [OMPI devel] RFC: Add an attribute((destructor)) function to opal

2014-07-18 Thread Gilles Gouaillardet

+1 for the overall idea !

On Fri, Jul 18, 2014 at 10:17 PM, Ralph Castain  wrote:
>
> * add an OBJ_CLASS_DEREGISTER and require that all instantiations be
> matched by deregister at close of the framework/component that instanced
> it. Of course, that requires that we protect the class system against
> someone releasing/deconstructing an object after the class was deregistered
> since we don't know who might be using that class outside of where it was
> created.
>
> my understanding is that in theory, we already have an issue and
fortunatly, we do not hit it :
let's consider a framework/component that instanciate a class
(OBJ_CLASS_INSTANCE) *with a destructor*, allocate an object of this class
(OBJ_NEW) and expects "someone else" will free it (OBJ_RELEASE)
if this framework/component ends up in a dynamic library that is dlclose'd
when the framework/component is no more used, then OBJ_RELEASE will try to
call the destructor which is no more accessible (since the lib was
dlclose'd)

i could not experience such a scenario yet, and of course, this does not
mean there is no problem. i experienced a "kind of" similar situation
described in http://www.open-mpi.org/community/lists/devel/2014/06/14937.php

back to OBJ_CLASS_DEREGISTER, what about an OBJ_CLASS_REGISTER in order to
make this symmetric and easier to debug ?

currently, OBJ_CLASS_REGISTER is "implied" the first time an object of a
given class is allocated. from opal_obj_new :
if (0 == cls->cls_initialized) opal_class_initialize(cls);

that could be replaced by an error if 0 == cls->cls_initialized
and OBJ_CLASS_REGISTER would simply call opal_class_initialize

of course, this change could be implemented only when compiled
with OPAL_ENABLE_DEBUG

Cheers,

Gilles

Re: [OMPI devel] RFC: Add an attribute((destructor)) function to opal

2014-07-18 Thread Gilles Gouaillardet

>
> It would make sense, though I guess I always thought that was part of what
> happened in OBJ_CLASS_INSTANCE - guess I was wrong. My thinking was that
> DEREGISTER would be the counter to INSTANCE, and I do want to keep this
> from getting even more clunky - so maybe renaming INSTANCE to be REGISTER
> and completing the initialization inside it would be the way to go. Or
> renaming DEREGISTER to something more obviously the counter to INSTANCE?
>
>
just so we are clear :

on one hand OBJ_CLASS_INSTANCE is a macro that must be invoked "outside" of
a function :
It *statically* initializes a struct.

on the other hand, OBJ_CLASS_DEREGISTER is a macro that must be invoked
inside a function.

using OBJ_CLASS_REGISTER is not only about renaming, it also requires to
move all these invokations into functions.

my idea of having both OBJ_CLASS_INSTANCE and OBJ_CLASS_REGISTER is :
- we do not need to move OBJ_CLASS_INSTANCE into functions
- we can have two behaviours depending on OPAL_ENABLE_DEBUG :
OBJ_CLASS_REGISTER would simply do nothing if OPAL_ENABLE_DEBUG is zero
(and opal_class_initialize would still be invoked in opal_obj_new). that
could also be a bit faster than having only one OBJ_CLASS_REGISTER macro in
optimized mode.

that being said, i am also fine with simplifying this, remove
OBJ_CLASS_INSTANCE and use OBJ_CLASS_REGISTER and OBJ_CLASS_DEREGISTER


about the bug you hit, did you already solve it and how ?
a trivial workaround is not to dlclose the dynamic library (ok, that's
cheating ...)
a simple workaround (if it is even doable) is to declare the class
"somewhere else" so the (library containing the) class struct is not
dlclose'd before it is invoked (ok, that's ugly ...).

what i wrote earlier was misleading :
OBJ_CLASS_INSTANCE(class);
foo = OBJ_NEW(class);
then
opal_class_t class_class = {...};
foo->super.obj_class = &class_class;

class_class is no more accessible when the OBJ_RELEASE is called since the
library was dlclose'd, so you do not even get a change to invoke the
destructor ...

a possible workaround could be to malloc a copy of class_class, have
foo->super.obj_class point to it after each OBJ_NEW, and finally have its
cls_destruct_array point to NULL when closing the framework/component.
(of course that causes a leak ...)

Cheers,

Gilles

Re: [OMPI devel] [OMPI users] Adding a new BTL

2016-02-25 Thread Gilles Gouaillardet

on master/v2.x, you also have to

rm -f opal/mca/btl/lf/.opal_ignore

(and this file would have been .ompi_ignore on v1.10)

Cheers,

Gilles

On Fri, Feb 26, 2016 at 7:44 AM, dpchoudh .  wrote:
> Hello Jeff and other developers:
>
> Attached are five files:
> 1-2: Full output from autogen.pl and configure, captured with: ./ 2>&1
> | tee .log
> 3. Makefile.am of the specific BTL directory
> 4. configure.m4 of the same directory
> 5. config.log, as generated internally by autotools
>
> Thank you
> Durga
>
>
> Life is complex. It has real and imaginary parts.
>
> On Thu, Feb 25, 2016 at 5:15 PM, Jeff Squyres (jsquyres)
>  wrote:
>>
>> Can you send the full output from autogen and configure?
>>
>> Also, this is probably better suited for the Devel list, since we're
>> talking about OMPI internals.
>>
>> Sent from my phone. No type good.
>>
>> On Feb 25, 2016, at 2:06 PM, dpchoudh .  wrote:
>>
>> Hello Gilles
>>
>> Thank you very much for your advice. Yes, I copied the templates from the
>> master branch to the 1.10.2 release, since the release does not have them.
>> And yes, changing the Makefile.am as you suggest did make the autogen error
>> go away.
>>
>> However, in the master branch, the autotools seem to be ignoring the new
>> btl directory altogether; i.e. I do not get a Makefile.in from the
>> Makefile.am.
>>
>> In the 1.10.2 release, doing an identical sequence of steps do create a
>> Makefile.in from Makefile.am (via autogen) and a Makefile from Makefile.in
>> (via configure), but of course, the new BTL does not build because the
>> include paths in master and 1.10.2 are different.
>>
>> My Makefile.am and configure.m4 are as follows. Any thoughts on what it
>> would take in the master branch to hook up the new BTL directory?
>>
>> opal/mca/btl/lf/configure.m4
>> # 
>> AC_DEFUN([MCA_opal_btl_lf_CONFIG],[
>> AC_CONFIG_FILES([opal/mca/btl/lf/Makefile])
>> ])dnl
>>
>> opal/mca/btl/lf/Makefile.am---
>> amca_paramdir = $(AMCA_PARAM_SETS_DIR)
>> dist_amca_param_DATA = netpipe-btl-lf.txt
>>
>> sources = \
>> btl_lf.c \
>> btl_lf.h \
>> btl_lf_component.c \
>> btl_lf_endpoint.c \
>> btl_lf_endpoint.h \
>> btl_lf_frag.c \
>> btl_lf_frag.h \
>> btl_lf_proc.c \
>> btl_lf_proc.h
>>
>> # Make the output library in this directory, and name it either
>> # mca__.la (for DSO builds) or libmca__.la
>> # (for static builds).
>>
>> if MCA_BUILD_opal_btl_lf_DSO
>> lib =
>> lib_sources =
>> component = mca_btl_lf.la
>> component_sources = $(sources)
>> else
>> lib = libmca_btl_lf.la
>> lib_sources = $(sources)
>> component =
>> component_sources =
>> endif
>>
>> mcacomponentdir = $(opallibdir)
>> mcacomponent_LTLIBRARIES = $(component)
>> mca_btl_lf_la_SOURCES = $(component_sources)
>> mca_btl_lf_la_LDFLAGS = -module -avoid-version
>>
>> noinst_LTLIBRARIES = $(lib)
>> libmca_btl_lf_la_SOURCES = $(lib_sources)
>> libmca_btl_lf_la_LDFLAGS = -module -avoid-version
>>
>> -
>>
>> Life is complex. It has real and imaginary parts.
>>
>> On Thu, Feb 25, 2016 at 3:10 AM, Gilles Gouaillardet
>>  wrote:
>>>
>>> Did you copy the template from the master branch into the v1.10 branch ?
>>> if so, you need to replacing MCA_BUILD_opal_btl_lf_DSO with
>>> MCA_BUILD_ompi_btl_lf_DSO will likely solve your issue.
>>> you do need a configure.m4 (otherwise your btl will not be built) but
>>> you do not need AC_MSG_FAILURE
>>>
>>> as far as i am concerned, i would develop in the master branch, and
>>> then back port it into the v1.10 branch when it is ready.
>>>
>>> fwiw, btl used to be in ompi/mca/btl (still the case in v1.10) and
>>> have been moved into opal/mca/btl since v2.x
>>> so it is quite common a bit of porting is required, most of the time,
>>> it consists in replacing OMPI like macros by OPAL like macros
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thu, Feb 25, 2016 at 3:54 PM, dpchoudh .  wrote:
>>> > Hello all
>>> >
>>> > I am not sure if this question belongs in the user list or the
>>> > developer list, but because it is a simpler question I am trying the
>>> > user list first.
>>> &

Re: [OMPI devel] error while compiling openmpi

2016-02-26 Thread Gilles Gouaillardet


Monika,

Can you send all the information listed here:

https://www.open-mpi.org/community/help/



btw, are you using a cross-compiler ?

can you try to compile this simple program :

typedef struct xxx xxx;

struct xxx {
int i;
xxx *p;
};

void yyy(xxx *x) {
x->i = 0;
x->p = x;
}


Cheers,

Gilles

On 2/26/2016 4:34 PM, Monika Hemnani wrote:
I'm trying to run Open mpi on Microblaze(soft core processor), with 
operating system xilkernel(OS from xilinx).

I'm getting errors in the file:  opal_object.h .


 This is the part of the code where I'm getting errors.

typedef struct opal_object_t opal_object_t; //line 1 typedef struct 
opal_class_t opal_class_t; //line 2 typedef void (*opal_construct_t) 
(opal_object_t *); //line 3 typedef void (*opal_destruct_t) 
(opal_object_t *); //line 4



struct opal_class_t {
const char *cls_name;/**< symbolic name for class */
opal_class_t *cls_parent; /**< parent class descriptor *///line 5
opal_construct_t cls_construct;  /**< class constructor */
opal_destruct_t cls_destruct;/**< class destructor */
int cls_initialized;/**< is class initialized */
int cls_depth;  /**< depth of class hierarchy tree */
opal_construct_t *cls_construct_array;   /**< array of parent class 
constructors */
opal_destruct_t *cls_destruct_array; /**< array of parent class destructors 
*/
size_t cls_sizeof;   /**< size of an object instance */
};

struct opal_object_t {
#if OMPI_ENABLE_DEBUG
/** Magic ID -- want this to be the very first item in the
struct's memory */
uint64_t obj_magic_id;
#endif
opal_class_t *obj_class; /**< class descriptor *///line6
volatile int32_t obj_reference_count;   /**< reference count */
#if OMPI_ENABLE_DEBUG
const char* cls_init_file_name; /**< In debug mode store the file where the 
object get contructed */
int cls_init_lineno; /**< In debug mode store the line number where the object 
get contructed */
#endif /* OMPI_ENABLE_DEBUG */
};



The errors are:

line 1: storage class specified for parameter 'opal_object_t'

line 2: storage class specified for parameter 'opal_class_t'

line 3 and 4: expected declaration specifiers or '...' before 
'opal_object_t'


line 5 and 6: expected specifier-qualifier-list before 'opal_class_t'


The compiler used is microblaze gcc 4.6.2

How to remove these errors? Is there any other way to make these 
definitions, so that compiler won't give it as an error?





___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18631.php

Re: [OMPI devel] Confused topic for developer's meeting

2016-02-26 Thread Gilles Gouaillardet

Ralph,

The goal here is to allow vendor to distribute binary orte frameworks
(on top of binary components they can already distribute) that can be used
by a user compiled "stock" openmpi library).

Did I get it right so far ?


I gave it some thoughts and found that could be simplified.

My understanding is that such frameworks can only be used by 3rd party
component(s) from an existing framework, am I right ?

In this case, what about creating libopen-rte-ext.so with all the 3rd party
frameworks (one library in case two frameworks need each other, and avoid
circular dependencies). libopen-rte-ext.so does depend on libopen-rte.so,
and 3rd party components depend on both rte libs.
Build order is important :
- libopen-rte.so
- libopen-rte-ext.so
- components
But there is no circular build dependencies (e.g. libopen-rte.so does not
depend on libopen-rte-ext.so), and there is no need to create an other
project.
The only restriction I can think of is it is impossible that 3rd
party frameworks from vendor A and vendor B depend on each other. There
could be frameworks from different vendors as long as they use distinct lib
names for their extended orte lib.


Any thoughts ?

Gilles

On Saturday, February 27, 2016, Ralph Castain  wrote:

> There was some confusion yesterday at the developer’s meeting over a topic
> regarding framework dependencies. I apologize - I should have looked over
> the agenda more closely in advance to ensure I recalled everything. Instead
> of the topic I had wanted to discuss, we wound up discussing embedding
> dependency arguments in component definitions.
>
> What I really wanted to raise was the issue of statically including all
> base framework definitions in the core library. In other words, if you want
> to define a new framework for ORTE, you _must_ put the framework header and
> the base directory in libopen-rte. This makes it impossible for a 3rd party
> to add an ORTE framework - they have to get the OMPI community to add it
> upstream first.
>
> Note that you _can_ add components dynamically - you just can’t add a
> framework.
>
> The only solution a 3rd party has today is to create another project layer
> in the code, and put the framework there. However, this may be somewhat
> limiting due to circular build dependencies if, for example, an ORTE
> component needed to reference the new framework, and the new
> project/framework has an explicit link to libopen-rte.
>
> Resolving this would require that we dynamically load the frameworks
> themselves, and not just the components. This point is what led to Jeff’s
> proposal about dependencies - however, the dependency definitions are not
> _required_ in order to make this change.
>
> So the question to the community is: does anyone see an issue with making
> frameworks into dll’s? Obviously, this approach won’t work for static
> builds, but that is a separate issue.
>
> Thanks
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18634.php

Re: [OMPI devel] Segmentation fault in opal_fifo (MTT)

2016-03-01 Thread Gilles Gouaillardet

Adrian,

About bitness, it is correctly set when MPI install successes
See https://mtt.open-mpi.org/index.php?do_redir or even your successful
install on x86_64

I suspect it is queried once the installation is successful, and I ll try
to have a look at it.

Cheers,

Gilles

On Tuesday, March 1, 2016, Adrian Reber  wrote:

> I have seen it before but it was not reproducible. I have now two
> segfaults in opal_fifo in today's MTT run on master and 2.x:
>
>
> https://mtt.open-mpi.org/index.php?do_redir=2270
> https://mtt.open-mpi.org/index.php?do_redir=2271
>
> The thing that is strange about the MTT output is that MTT does not detect
> the endianess and bitness correctly. It says on a x86_64 (Fedora 23)
> system:
>
> Endian: unknown
> Bitness: 32
>
> Endianess is not mentioned in mtt configuration file and bitness is
> commented out like this:
>
> #CN: bitness = 32
>
> which is probably something I copied from another mtt configuration file
> when initially creating mine.
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Searchable archives:
> http://www.open-mpi.org/community/lists/devel/2016/03/18645.php
>

Re: [OMPI devel] MTT setup updated to gcc-6.0 (pre)

2016-03-01 Thread Gilles Gouaillardet


fwiw

in a previous thread, Jeff Hammond explained this is why mpich is 
relying on C89 instead of C99,

since C89 appears to be a subset of C++11.

Cheers,

Gilles

On 3/2/2016 1:02 AM, Nathan Hjelm wrote:

I will add to how crazy this is. The C standard has been very careful
to not break existing code. For example the C99 boolean is _Bool not
bool because C reserves _[A-Z]* for its own use. This means a valid C89
program is a valid C99 and C11 program. It Look like this is not true in
C++.

-Nathan

On Thu, Feb 25, 2016 at 09:52:49PM +, Jeff Squyres (jsquyres) wrote:

On Feb 25, 2016, at 3:39 PM, Paul Hargrove  wrote:

A "bare" function name (without parens) is the address of the function, which 
can be converted to an int, long, etc.
So the "rank" identifier can validly refer to the function in this context.

I understand that there's logic behind this.  But it's still crazy to me that:

-
int foo(void) {
   int rank;
   printf("Value: %d", rank);
}
-

is ambiguous.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18624.php


___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/03/18647.php

Re: [OMPI devel] RFC: warn if running a debug build

2016-03-01 Thread Gilles Gouaillardet


Jeff,

what about *not* issuing this warning if OpenMPI is built from git ?
that would be friendlier for OMPI developers,
and should basically *not* affect endusers, since they would rather 
build OMPI from a tarball.


Cheers,

Gilles

On 3/2/2016 1:00 PM, Jeff Squyres (jsquyres) wrote:

WHAT: Have orterun emit a brief warning when using a debug build.

WHY: So people stop trying to use a debug build for performance results.

WHERE: Mostly in orterun, but a little in orte/runtime

WHEN: No rush on this; the idea came up today at the MPI Forum.  We can discuss 
next Tuesday on the Webex.

MORE DETAIL:

https://github.com/open-mpi/ompi/pull/1417

Re: [OMPI devel] RFC: warn if running a debug build

2016-03-01 Thread Gilles Gouaillardet

In this case, should we only display the warning if debug build was 
implicit ?
for example, ./configure from git would display the warning (implicit 
debug),
but ./configure --enable-debug would not (explicit debug), regardless we 
built from git or a tarball



On 3/2/2016 1:13 PM, Jeff Squyres (jsquyres) wrote:

On Mar 1, 2016, at 10:06 PM, Gilles Gouaillardet  wrote:

what about *not* issuing this warning if OpenMPI is built from git ?
that would be friendlier for OMPI developers,
and should basically *not* affect endusers, since they would rather build OMPI 
from a tarball.

We're actually specifically trying to catch this case: someone builds from git, 
doesn't know (or care) that it's a debug build, and runs some performance tests 
with Open MPI.

We figured that it would be sufficient for OMPI devs to set the env variable in 
their shell startup files to avoid the message.

1 2 3 4 5 6 7 8 9 >

1 - 100 of 816 matches

Mail list logo