[OMPI devel] intermittent crash in mpirun upon non zero exit status

2014-06-09 Thread Gilles Gouaillardet
Folks, several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a similar stack trace. For example, you can refer to : http://mtt.open-mpi.org/index.php?do_redir=2199 the issue is not related whatsoever to the init_thread_serialized test (other tests failed with similar symptoms)

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Gilles Gouaillardet
at the MPI_Abort hang as I'm having trouble replicating it. > > > On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > > Jeff, > > > > as pointed by Ralph, i do wish using eth0 for oob messages. > > >

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet
Jeff, as pointed by Ralph, i do wish using eth0 for oob messages. i work on a 4k+ nodes cluster with a very decent gigabit ethernet network (reasonable oversubscription + switches from a reputable vendor you are familiar with ;-) ) my experience is that IPoIB can be very slow at establishing a

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Ralph, the application still hangs, i attached new logs. on slurm0, if i /sbin/ifconfig eth0:1 down then the application does not hang any more Cheers, Gilles On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote: > I appear to have this fixed now - please give the

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
Ralph, slurm is installed and running on both nodes. that being said, there is no running job on any node so unless mpirun automagically detects slurm is up and running, i assume i am running under rsh. i can run the test again after i stop slurm if needed, but that will not happen before

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
a btl tcp,self --mca oob_base_verbose 10 ./abort the oob logs are attached Cheers, Gilles On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Thanks Ralph, > > i will try this tomorrow > > Cheers, > > Gilles > > > > On Tue

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
recipient. > > On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > #7 0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from > /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so > > > > > ___

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Jeff, from the FAQ, openmpi should work on nodes who have different number of IB ports (at least since v1.2) about IB ports on the same subnet, all i was able to find is explanation about why i get this warning : WARNING: There are more than one active ports on host '%s', but the default

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
q/scaling_governor > in our system, the cpuspeed daemon is off by default on all our nodes. > > > Regards > M > > > On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Mike, >> >> did you apply the

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] >> [vegas12:13834] [10] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] >> [vegas12:13834] *** End of error message *** >> Segmentation fault (core dumped) >> >&

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Jeff, On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > i faced a bit different problem, but that is 100% reproductible : > > -

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
:45 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > in orte/mca/rtc/freq/rtc_freq.c at line 187 > fp = fopen(filename, "r"); > and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" > > there is no error check, so if fp

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
Mike and Ralph, i got the very same error. in orte/mca/rtc/freq/rtc_freq.c at line 187 fp = fopen(filename, "r"); and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" there is no error check, so if fp is NULL, orte_getline() will call fgets() that will crash. that can happen

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet
Artem, thanks for the feedback. i commited the patch to the trunk (r31922) as i indicated in the commit log, this patch is likely suboptimal and has room for improvement. Jeff commented about the usnic related issue, so i will wait for a fix from the Cisco folks. Cheers, Gilles On Sun,

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet
Artem, this looks like the issue initially reported by Rolf http://www.open-mpi.org/community/lists/devel/2014/05/14836.php in http://www.open-mpi.org/community/lists/devel/2014/05/14839.php i posted a patch and a workaround : export OMPI_MCA_btl_openib_use_eager_rdma=0 i do not recall i

[OMPI devel] fortran types alignment

2014-05-30 Thread Gilles Gouaillardet
Folks, i recently had to solve a tricky issue that involves alignment of fortran types. the attached program can be used and ran on two tasks in order to evidence the issue. if gfortran is used (to build both openmpi and the test case), then the test is successful if ifort (Intel compiler) is

Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-29 Thread Gilles Gouaillardet
> this looks like an up-to-date CentOS box. i am unable to reproduce the warnings (may be uninitialized in this function) with a similar box :-( > On May 27, 2014, at 9:29 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > so far, it seems this is a false posit

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet
good to know ! how should we handle this within mtt ? decrease nseconds to 570 ? Cheers, Gilles On Thu, May 29, 2014 at 12:03 AM, Ralph Castain <r...@open-mpi.org> wrote: > Ah, that satisfied it! > > Sorry for the chase - I'll update my test. > > > On May 28,

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet
Ralph, On Wed, May 28, 2014 at 9:33 PM, Ralph Castain wrote: > This is definetly what happens : only some tasks call MPI_Comm_free() > > > Really? I don't see how that can happen in loop_spawn - every process is > clearly calling comm_free. Or are you referring to the

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet
Jeff, On Wed, May 28, 2014 at 8:31 PM, Jeff Squyres (jsquyres) > To be totally clear: MPI says it is erroneous for only some (not all) processes in a communicator to call MPI_COMM_FREE. So if that's the real problem, then the discussion about why the parent(s) is(are) trying to contact the

Re: [OMPI devel] some info is not pushed into the dstore

2014-05-28 Thread Gilles Gouaillardet
wrote: > > > Hi Gilles > > > > I concur on the typo and fixed it - thanks for catching it. I'll have to > look into the problem you reported as it has been fixed in the past, and > was working last I checked it. The info required for this 3-way > connect/accept is suppos

Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-28 Thread Gilles Gouaillardet
Ralph, can you please describe your environment (at least compiler (and version) + configure command line) i checked osc_rdma_data_move.c only : size_t incoming_length; is used to improve readability. it is used only in an assert clause and in OPAL_OUTPUT_VERBOSE one way to silence the warning

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-28 Thread Gilles Gouaillardet
Ralph, On 2014/05/28 12:10, Ralph Castain wrote: > my understanding is that there are two ways of seeing things : > a) the "R-way" : the problem is the parent should not try to communicate to > already exited processes > b) the "J-way" : the problem is the children should have waited either in

Re: [OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-27 Thread Gilles Gouaillardet
FINALIZE is allowed to block if it needs to, such that > OMPI sending control messages to procs that are still "connected" (in the > MPI sense) should never cause a race condition. > > > > As such, this sounds like an OMPI bug. > > > > > > > > > > On May 27, 2014,

Re: [OMPI devel] OMPI Opengrok config

2014-05-27 Thread Gilles Gouaillardet
Thanks Jeff, i can only speak for myself : i use OpenGrok on a daily basis and it is a great help Cheers, Gilles On Wed, May 28, 2014 at 8:21 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > I can ask IU to adjust the OpenGrok config. > > > On May 27, 2014,

[OMPI devel] some info is not pushed into the dstore

2014-05-27 Thread Gilles Gouaillardet
Folks, while debugging the dynamic/intercomm_create from the ibm test suite, i found something odd. i ran *without* any batch manager on a VM (one socket and four cpus) mpirun -np 1 ./dynamic/intercomm_create it hangs by default it works with --mca coll ^ml basically : - task 0 spawns task 1 -

[OMPI devel] RFC: about dynamic/intercomm_create test from ibm test suite

2014-05-27 Thread Gilles Gouaillardet
Folks, currently, the dynamic/intercomm_create test from the ibm test suite output the following messages : dpm_base_disconnect_init: error -12 in isend to process 1 the root cause it task 0 tries to send messages to already exited tasks. one way of seeing things is that this is an application

[OMPI devel] OMPI Opengrok config

2014-05-27 Thread Gilles Gouaillardet
Folks, OMPI Opengrok search (http://svn.open-mpi.org/source) currently returns results for : - trunk - v1.6 branch - v1.5 branch - v1.3 branch imho, it could/should return results for the following branches : - trunk - v1.8 branch - v1.6 branch and maybe the v1.4 branch (and the v1.9 branch when

Re: [OMPI devel] Still problems with del_procs in trunkj

2014-05-26 Thread Gilles Gouaillardet
Rolf, the assert fails because the endpoint reference count is greater than one. the root cause is the endpoint has been added to the list of eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at ompi/mca/btl/openib/btl_openib_endpoint.c:1009) a simple workaround is not to use

Re: [OMPI devel] [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2

2014-05-20 Thread Gilles Gouaillardet
(e.g. use btl_send) - my suggested update of line 498 (e.g. use btl_send) was correct. Cheers, Gilles On 2014/05/20 4:06, Nathan Hjelm wrote: > On Mon, May 19, 2014 at 02:14:57PM +0900, Gilles Gouaillardet wrote: >>Nathan, >> >>do you mean the bug/typo was not a

Re: [OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-19 Thread Gilles Gouaillardet
Thanks guys ! i commited r31816 (bfo: allocate the allocator in init rather than open) and made a CMR based on mtt results, i will push George's commit tomorrow. and based on Rolf recommendation, i will do the CMR by the end of the week if everything works fine Gilles

Re: [OMPI devel] [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2

2014-05-19 Thread Gilles Gouaillardet
...@open-mpi.org] > Sent: Thursday, May 15, 2014 10:43 PM > To: s...@open-mpi.org > Subject: [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2 > > Author: ggouaillardet (Gilles Gouaillardet) > Date: 2014-05-16 00:43:18 EDT (Fri, 16 May 2014) > New Revision: 31786 > URL:

[OMPI devel] problem compiling trunk after r31810

2014-05-18 Thread Gilles Gouaillardet
Folks, i was unable to compile trunk after svn update. i use different directories (aka VPATH) for source and build error message is related to the missing shmem/java directory from the oshmem directory. The attached patch fixed this. /* that being said, i did not try to build java for oshmem,

[OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-16 Thread Gilles Gouaillardet
Folks, there is a small memory leak in ompi/mca/pml/bfo/pml_bfo_component.c in my environment, this module is not used. this means mca_pml_bfo_component_open() and mca_pml_bfo_component_close() are invoked but mca_pml_bfo_component_init() and mca_pml_bfo_component_fini() are *not* invoked.

[OMPI devel] yesterday commits caused a crash in helloworld with --mca btl tcp, self

2014-05-16 Thread Gilles Gouaillardet
Folks, a simple mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld crashes after some of yesterday's commits (i would blame r31778 and/or r31782, but i am not 100% sure) /* a list receives a negative value, so the program takes some time before crashing, symptom may vary from one system

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-15 Thread Gilles Gouaillardet
Nathan, this had no effect on my environment :-( i am not sure you can reuse mca_btl_scif_module.scif_fd with connect() i had to use a new scif fd for that. then i ran into an other glitch : if the listen thread does not scif_accept() the connection, the scif_connect() will take 30 seconds

[OMPI devel] r31765 causes crash in mpirun

2014-05-15 Thread Gilles Gouaillardet
Folks, since r31765 (opal/event: release the opal event context when closing the event base) mpirun crashes at the end of the job. for example : $ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c MPITEST info (0): Starting MPI_Allreduce_user() test MPITEST_results:

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-14 Thread Gilles Gouaillardet
Nathan, > Looks like this is a scif bug. From the documentation: and from the source code, scif_poll(...) simply calls poll(...) at least in MPSS 2.1 > Since that is not the case I will look through the documentation and see if there is a way other than pthread_cancel. what about : - use a

Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
o it is certainly doable. > > I don't know the specifics of why Nathan's code is having trouble exiting, > but I suspect that a simple solution - not involving pthread_cancel - can be > readily developed. > > > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet > <gilles

Re: [OMPI devel] scif btl side effects

2014-05-12 Thread Gilles Gouaillardet
i wrote this too early ... the attached program produces incorrect results when ran with --mca btl scif,vader,self once the most up-to-date patch of #4610 has been applied, (at least) one bug remain, and it is in the scif btl the attached patch fixes it. Gilles On 2014/05/12 16:17, Gilles

Re: [OMPI devel] scif btl side effects

2014-05-12 Thread Gilles Gouaillardet
Nathan, On 2014/05/08 4:21, Hjelm, Nathan T wrote: > c) that being said, that should work so there is a bug > d) there is a regression in v1.8 and a bug that might have been always here > This is probably not a regression. The SCIF btl has been part of the 1.7 > series for some time. The nightly

Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet
issue, i will investigate more next week Gilles On 2014/05/09 18:08, Gilles Gouaillardet wrote: > I ran some more investigations with --mca btl scif,self > > i found that the previous patch i posted was complete crap and i > apologize for it. > > on a brighter side, and imho, the

Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet
I ran some more investigations with --mca btl scif,self i found that the previous patch i posted was complete crap and i apologize for it. on a brighter side, and imho, the issue only occurs if fragments are received (and then processed) out of order. /* i did not observe this with the tcp btl,

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet > [gilles.gouaillar...@iferc.org] > Sent: Thursday, May 08, 2014 1:32 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] regression with derived datatypes > > George, > > you do not need a

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
Nathan and George, here are the output files of the original test_scif.c the command line was mpirun -np 2 -host localhost --mca btl scif,vader,self --mca mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca mpi_ddt_position_debug 1 a.out this is a silent failure and there is no core file

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
George, you do not need any hardware, just download MPSS from Intel and install it. make sure the mic kernel module is loaded *and* you can read/write to the newly created /dev/mic/* devices. /* i am now running this on a virtual machine with no MIC whatsoever */ i was able to improve things a

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
On 2014/05/08 2:15, Ralph Castain wrote: > I wonder if that might also explain the issue reported by Gilles regarding > the scif BTL? In his example, the problem only occurred if the message was > split across scif and vader. If so, then it might be that splitting messages > in general is

[OMPI devel] scif btl side effects

2014-05-07 Thread Gilles Gouaillardet
Dear OpenMPI Folks, i noticed some crashes when running OpenMPI (both latest v1.8 and trunk from svn) on a single linux system where a MIC is available. /* strictly speaking, MIC hardware is not needed: libscif.so, mic kernel module and accessible /dev/mic/* are enough */ the attached test_scif

Re: [OMPI devel] memory leaks upon dup/split/create of communicators?

2014-04-30 Thread Gilles Gouaillardet
Joost, i created #4581 and attached a patch (for the trunk) in order to solve this leak (and two similar ones) Cheers, Gilles On 2014/04/29 5:18, VandeVondele Joost wrote: > Hi, > > I applied the patch from ticket #4569 (to 1.8.1), and things improved (in > particular the reported issue is

Re: [OMPI devel] Wrong Endianness in Open MPI for external32 representation

2014-04-30 Thread Gilles Gouaillardet
Edgar and Christoph, i do not think ROMIO supports this yet. from ompi/mca/io/romio/romio/README "This version of ROMIO includes everything defined in the MPI I/O chapter except support for file interoperability [...]" i also ran ompi/mca/io/romio/romio/test/external32.c : on a x86_64 box

Re: [OMPI devel] MPI_Comm_create_group()

2014-04-30 Thread Gilles Gouaillardet
Lisandro, i assume you are running OpenMPI 1.8 r31554 fixes this issue (and some others) https://svn.open-mpi.org/trac/ompi/changeset/31554/branches/v1.8/ompi/communicator/comm_cid.c the root cause was an unitialized variable (rc in ompi/communicator/comm_cid.c), and the issue only occured when

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-28 Thread Gilles Gouaillardet
homogeneous > cluster, even with --enable-hetero. I've run it that way on my cluster. > > On Apr 27, 2014, at 7:50 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> According to Jeff's comment, OpenMPI compiled with >> --enable-heterogeneous is brok

Re: [OMPI devel] RFC: Remove heterogeneous support

2014-04-27 Thread Gilles Gouaillardet
According to Jeff's comment, OpenMPI compiled with --enable-heterogeneous is broken even in an homogeneous cluster. as a first step, MTT could be ran with OpenMPI compiled with --enable-heterogenous and running on an homogeneous cluster (ideally on both little and big endian) in order to identify

[OMPI devel] MPI_Recv_init_null_c from intel test suite fails vs ompi trunk

2014-04-24 Thread Gilles Gouaillardet
Folks, Here is attached an oversimplified version of the MPI_Recv_init_null_c test from the intel test suite. the test works fine with v1.6, v1.7 and v1.8 branches but fails with the trunk. i wonder wether the bug is in OpenMPI or the test itself. on one hand, we could consider there is a bug

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet
my bad :-( this has just been fixed Gilles On 2014/04/23 14:55, Nathan Hjelm wrote: > The ompi_datatype_flatten.c file appears to be missing. Let me know once > it is committed and I will take a look. I will see if I can write the > RMA code using it over the next week or so. >

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet
George, i am sorry i cannot see how flatten datatype can be helpful here :-( in this example, the master must broadcast a long vector. this datatype is contiguous so the flatten'ed datatype *is* the type provided by the MPI application. how would pipelining happen in this case (e.g. who has to

Re: [OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-23 Thread Gilles Gouaillardet
Nathan, i uploaded this part to github : https://github.com/ggouaillardet/ompi-svn-mirror/tree/flatten-datatype you really need to check the last commit : https://github.com/ggouaillardet/ompi-svn-mirror/commit/a8d014c6f144fa5732bdd25f8b6b05b07ea8 please consider this as experimental and

[OMPI devel] coll/tuned MPI_Bcast can crash or silently fail when using distinct datatypes across tasks

2014-04-17 Thread Gilles Gouaillardet
Dear OpenMPI developers, i just created #4531 in order to track this issue : https://svn.open-mpi.org/trac/ompi/ticket/4531 Basically, the coll/tuned implementation of MPI_Bcast does not work when two tasks uses datatypes of different sizes. for example, if the root send two large vectors of

<    3   4   5   6   7   8