dition vis 1.8 - I agree it
> is not a blocker for that release.
>
> Ralph
>
> On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet
> <gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> here is the patch i am using so far.
>> i will res
my 0.02 US$ ...
Bitbucket pricing model is per user (but with free public/private
repository up to 5 users)
whereas github pricing is per *private* repository (and free public
repository and with unlimited users)
from an OpenMPI point of view, this means :
- with github, only the private
Folks,
if i read between the lines, it looks like the next stable branch will be
v2.0 and not v1.10
is there a strong reason for that (such as ABI compatibility will break, or
a major but internal refactoring) ?
/* other than v1.10 is less than v1.8 when comparing strings :-) */
Cheers,
Gilles
o ahead, and thanks
>
> On Sep 20, 2014, at 10:26 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Thanks for the pointer George !
>
> On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> Or copy the han
Thanks for the pointer George !
On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote:
> Or copy the handshake protocol design of the TCP BTL...
>
>
the main difference between oob/tcp and btl/tcp is the way we resolve the
situation in which two processes send their first
moved from
MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
retry() should have been invoked ?
Cheers,
Gilles
On 2014/09/18 17:02, Ralph Castain wrote:
> The patch looks fine to me - please go ahead and apply it. Thanks!
>
> On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet
> <gilles.goua
Folks,
r32716 broke v1.8 :-(
the root cause it it included MCA_BASE_VAR_TYPE_VERSION_STRING which has
not yet landed into v1.8
the attached trivial patch fixes this issue
Can the RM/GK please review it and apply it ?
Cheers,
Gilles
Index: opal/mca/base/mca_base_var.c
Folks,
for both trunk and v1.8 branch, configure takes the --with-threads option.
valid usages are
--with-threads, --with-threads=yes, --with-threads=posix and
--with-threads=no
/* v1.6 used to support the --with-threads=solaris */
if we try to configure with --with-threads=no, this will result
t triggers it so I
> can continue debugging
>
> Ralph
>
> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Thanks Ralph,
>>
>> this is much better but there is still a bug :
>> with the very same scen
then have the higher vpid retry while the lower one waits.
> The logic for that was still in place, but it looks like you are hitting a
> different code path, and I found another potential one as well. So I think I
> plugged the holes, but will wait to hear if you confirm.
>
Ralph,
here is the full description of a race condition in oob/tcp i very briefly
mentionned in a previous post :
the race condition can occur when two not connected orted try to send a
message to each other for the first time and at the same time.
that can occur when running mpi helloworld on
Howard, and Rolf,
i initially reported the issue at
http://www.open-mpi.org/community/lists/devel/2014/09/15767.php
r32659 is not a fix nor a regression, it simply aborts instead of
OBJ_RELEASE(mpi_comm_world).
/* my point here is we should focus on the root cause and not the
consequence */
and 3 enter the allgather at the send time, they will sent a
message to each other at the same time and rml fails establishing the
connection. i could not find whether this is linked to my changes...
Cheers,
Gilles
>
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
> gilles.gouai
, 2014, at 4:02 AM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Ralph,
>>
>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> it does not invoke pmix_server_release
>> because allgather_stub was not previou
ely not the right fix, it was very lightly
tested, but so far, it works for me ...
Cheers,
Gilles
On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> Ralph,
>
> things got worst indeed :-(
>
> now a simple hello world involving two hosts hang in mpi_init.
> there is still a race conditio
and for each of them to establish a
>> persistent receive. They then can use the signature to tell which collective
>> the incoming message belongs to.
>>
>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>
>>
>> On Sep 9, 2014
ggouaillardet -> ggouaillardet
On 2014/09/10 19:46, Jeff Squyres (jsquyres) wrote:
> As the next step of the planned migration to Github, I need to know:
>
> - Your Github ID (so that you can be added to the new OMPI git repo)
> - Your SVN ID (so that I can map SVN->Github IDs, and therefore map
Folks,
Since r32672 (trunk), grpcomm/rcd is the default module.
the attached spawn.c test program is a trimmed version of the
spawn_with_env_vars.c test case
from the ibm test suite.
when invoked on two nodes :
- the program hangs with -np 2
- the program can crash with np > 2
error message is
l I can say is that
> "tolower" on my CentOS box is defined in , and that has to be
> included in the misc.h header.
>
>
> On Sep 8, 2014, at 5:49 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Ralph and Brice,
>>
>&g
Ralph and Brice,
i noted Ralph commited r32685 in order to fix a problem with Intel
compilers.
The very similar issue occurs with clang 3.2 (gcc and clang 3.4 are ok
for me)
imho, the root cause is in the hwloc configure.
in this case, configure fails to detect strncasecmp is part of the C
Folks,
when OpenMPI is configured with --disable-weak-symbols and a fortran
2008 capable compiler (e.g. gcc 4.9),
MPI_STATUSES_IGNORE invoked from Fortran is not correctly interpreted as
it should.
/* instead of being a special array of statuses, it is an array of one
status, which can lead to
ion of the coll/ml locality requirement.
>
>Did this patch "fix" the problem by avoiding the segfault due to coll/ml
>disqualifying itself? Or did it make everything work okay again?
>
>
>On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet
><gilles.gouaillar...@iferc.org>
Folks,
mtt recently failed a bunch of times with the trunk.
a good suspect is the collective/ibarrier test from the ibm test suite.
most of the time, CHECK_AND_RECYCLE will fail
/* IS_COLL_SYNCMEM(coll_op) is true */
with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is
called
, Jeff Squyres (jsquyres) wrote:
> Gilles --
>
> Did you get a reply about this?
>
>
> On Aug 26, 2014, at 3:17 AM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Folks,
>>
>> the test_shmem_zero_get.x from the openshmem-release-
Mishima-san,
the root cause is macro expansion does not always occur as one would
have expected ...
could you please give a try to the attached patch ?
it compiles (at least with gcc) and i made zero tests so far
Cheers,
Gilles
On 2014/09/01 10:44, tmish...@jcity.maeda.co.jp wrote:
> Hi
original problem that was trying to be
> addressed.
>
>
> On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Howard and Edgar,
>>
>> i fixed a few bugs (r32639 and r32642)
>>
>> the bug is trivial
Ralph and all,
The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */
mpirun -np 1 --mca btl self /bin/false
same behaviour happen with the following trivial but MPI program :
#include
int main
Howard and Edgar,
i fixed a few bugs (r32639 and r32642)
the bug is trivial to reproduce with any mpi hello world program
mpirun -np 2 --mca btl openib,self hello_world
after setting the mca param in the $HOME/.openmpi/mca-params.conf
$ cat ~/.openmpi/mca-params.conf
btl_openib_receive_queues
Thanks Ralph !
Cheers,
Gilles
On 2014/08/28 4:52, Ralph Castain wrote:
> Took me awhile to track this down, but it is now fixed - combination of
> several minor errors
>
> Thanks
> Ralph
>
> On Aug 27, 2014, at 4:07 AM, Gilles Gouaillardet
> <gilles.gouaillar...@i
Folks,
the intercomm_create test case from the ibm test suite can hang under
some configuration.
basically, it will spawn n tasks in a first communicator, and then n
tasks in a second communicator.
when i run from node0 :
mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2
Folks,
i just commited r32604 in order to fix compilation (pmix) when ompi is
configured with --without-hwloc
now, even a trivial hello world program issues the following output
(which is a non fatal, and could even be reported as a warning) :
Folks,
the test_shmem_zero_get.x from the openshmem-release-1.0d test suite is
currently failing.
i looked at the test itself, and compared it to test_shmem_zero_put.x
(that is a success) and
i am very puzzled ...
the test calls several flavors of shmem_*_get where :
- the destination is in the
gt;look at that signature to ensure we aren't getting it confused.
>
>On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet
><gilles.gouaillar...@iferc.org> wrote:
>
>> Folks,
>>
>> when i run
>> mpirun -np 1 ./intercomm_create
>> from the ibm test
Folks,
when i run
mpirun -np 1 ./intercomm_create
from the ibm test suite, it either :
- success
- hangs
- mpirun crashes (SIGSEGV) soon after writing the following message
ORTE_ERROR_LOG: Not found in file
../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
here is what happens :
ass for me
>
>
> On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet
>> <gilles.gouaillar...@gmail.com> wrote:
>>
>>> Ralph,
>>>
>>> Will do on Monday
&g
;
whereas your mpi_no_op.c return 0;
Cheers,
Gilles
Ralph Castain <r...@open-mpi.org> wrote:
>You might want to try again with current head of trunk as something seems off
>in what you are seeing - more below
>
>
>
>On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet
><
he struct in order to preserve the old behaviour.
>
> Ashley.
>
> On 21 Aug 2014, at 04:31, Gilles Gouaillardet <gilles.gouaillar...@iferc.org>
> wrote:
>
>> Paul,
>>
>> the piece of code that causes an issue with PGI 2013 and older is just a bit
>> mor
t; -Nathan
>>
>> On Tue, Aug 19, 2014 at 10:48:48PM -0400, svn-commit-mai...@open-mpi.org
>> wrote:
>>> Author: ggouaillardet (Gilles Gouaillardet)
>>> Date: 2014-08-19 22:48:47 EDT (Tue, 19 Aug 2014)
>>> New Revision: 32555
>>> URL: ht
Folks,
let's look at the following trivial test program :
#include
#include
int main (int argc, char * argv[]) {
int rank, size;
MPI_Init(, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );
printf ("I am %d/%d and i abort\n", rank, size);
r32551 now detects this limitation and automatically disable oshmem profile. I
am now revamping the patch for v1.8
Gilles
Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote:
>In the case of PGI compilers prior to 13, a workaround is to configure with
>--disable-oshmem-profil
In the case of PGI compilers prior to 13, a workaround is to configure
with --disable-oshmem-profile
On 2014/08/18 16:21, Gilles Gouaillardet wrote:
> Josh, Paul,
>
> the problem with old PGI compilers comes from the preprocessor (!)
>
> with pgi 12.10 :
> oshmem/shmem/for
Josh, Paul,
the problem with old PGI compilers comes from the preprocessor (!)
with pgi 12.10 :
oshmem/shmem/fortran/start_pes_f.c
SHMEM_GENERATE_WEAK_BINDINGS(START_PES, start_pes)
gets expanded as
#pragma weak START_PES = PSTART_PES SHMEM_GENERATE_WEAK_PRAGMA ( weak
start_pes_ = pstart_pes_
Lenny,
that looks related to #4857 which has been fixed in trunk since r32517
could you please update your openmpi library and try again ?
Gilles
On 2014/08/13 17:00, Lenny Verkhovsky wrote:
> Following Jeff's suggestion adding devel mailing list.
>
> Hi All,
> I am currently facing strange
Thanks Christopher,
this has been fixed in the trunk with r32520
Cheers,
Gilles
On 2014/08/13 14:49, Christopher Samuel wrote:
> Hi all,
>
> We spotted this in 1.6.5 and git grep shows it's fixed in the
> v1.8 branch but in master it's still there:
>
>
Folks,
i noticed mpirun (trunk) hangs when running any mpi program on two nodes
*and* each node has a private network with the same ip
(in my case, each node has a private network to a MIC)
in order to reproduce the problem, you can simply run (as root) on the
two compute nodes
brctl addbr br0
Jeff and all,
i fixed the trivial errors in the trunk, there are now 11 non trivial
errors.
(commits r32490 to r32497)
i ran the script vs the v1.8 branch and found 54 errors
(first, you need to
touch Makefile.ompi-rules
in the top-level Open MPI directory in order to make the script happy)
r_finalize in the first place
(which is sufficient but might not be necessary ...)
Cheers,
Gilles
On 2014/08/09 1:27, Ralph Castain wrote:
> Committed a fix for this in r32460 - see if I got it!
>
> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> w
Folks,
here is the description of a hang i briefly mentionned a few days ago.
with the trunk (i did not check 1.8 ...) simply run on one node :
mpirun -np 2 --mca btl sm,self ./abort
(the abort test is taken from the ibm test suite : process 0 call
MPI_Abort while process 1 enters an infinite
as an ORTE_NAME the issue will go away.
>
> George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Kawashima-san and all,
>>
>> Here is attached a one off patch for v1.8.
>> /* it does
Kawashima-san,
This is interesting :-)
proc is in the stack and has type orte_process_name_t
with
typedef uint32_t orte_jobid_t;
typedef uint32_t orte_vpid_t;
struct orte_process_name_t {
orte_jobid_t jobid; /**< Job number */
orte_vpid_t vpid; /**< Process id - equivalent to
Ralph and all,
> * static linking failure - Gilles has posted a proposed fix, but somebody
> needs to approve and CMR it. Please see:
> https://svn.open-mpi.org/trac/ompi/ticket/4834
Jeff made a better fix (r32447) to which i added a minor correction
(r32448).
as far as i am concerned,
to the POSIX prototype (aka. returning the changes value
>> instead of doing things inplace).
>>
>> George.
>>
>>
>>
>> On Wed, Aug 6, 2014 at 7:02 AM, Gilles Gouaillardet
>> <gilles.gouaillar...@iferc.org> wrote:
>> Ralph and George,
Ralph and George,
here is attached a patch that fixes the heterogeneous support without
the abstraction violation.
Cheers,
Gilles
On 2014/08/06 9:40, Gilles Gouaillardet wrote:
> hummm
>
> i intentionally did not swap the two 32 bits (!)
>
> from the top level, what we have i
testing.
> Since I've determined already that 1.6.5 did not have the problem while
> 1.7.x does, the possibility exists that some smaller change might exist to
> restore what ever was lost between the v1.6 and v1.7 branches.
>
> -Paul
>
>
> On Tue, Aug 5, 2014 at 1:
y speaking, converting a 64 bits to a big endian
>>> representation requires the swap of the 2 32 bits parts. So the correct
>>> approach would have been:
>>> uint64_t htonll(uint64_t v)
>>> {
>>> return uint64_t)ntohl(n)) << 32 | (uint64_t)ntoh
Here is a patch that has been minimally tested.
this is likely an overkill (at least when dynamic libraries can be used),
but it does the job so far ...
Cheers,
Gilles
On 2014/08/05 16:56, Gilles Gouaillardet wrote:
> from libopen-pal.la :
> dependency_libs=' -lrdmacm -libverbs -lscif
from libopen-pal.la :
dependency_libs=' -lrdmacm -libverbs -lscif -lnuma -ldl -lrt -lnsl
-lutil -lm'
i confirm mpicc fails linking
but FWIT, using libtool does work (!)
could the bug come from the mpicc (and other) wrappers ?
Gilles
$ gcc -g -O0 -o hw /csc/home1/gouaillardet/hw.c
eing empty (bswap_64 or something).
>
> George.
>
> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> George and Ralph,
>>
>> i am very confused whether there is an issue or not.
>>
>>
>> anyway, t
Paul,
this is a bit trickier ...
on a Linux platform oshmem is built by default,
on a non Linux platform, oshmem is *not* built by default.
so the configure message (disabled by default) is correct on non Linux
platform, and incorrect on Linux platform ...
i do not know what should be done,
Fixed in r32409 : %d and %s were swapped in a MLERROR (printf like)
Gilles
On 2014/08/02 11:07, Gilles Gouaillardet wrote:
> Paul,
>
> about the second point :
> mmap is called with the MAP_FIXED flag, before the fix, the
> required address was not aligned on a page size and henc
out 25% of the time
--mca btl scif,self => always hang
only the mpirun process remains and is hanging.
i will try to debug this, and i welcome any help !
Cheers,
Gilles
On 2014/08/04 11:57, Gilles Gouaillardet wrote:
> Paul,
>
> i confirm ampersand was missing and this was a bug
on both 32 and 64 bits arch in this case :
#if OPAL_ENABLE_DEBUG
static inline orte_process_name_t *
OMPI_CAST_RTE_NAME(opal_process_name_t * name);
#else
#define OMPI_CAST_RTE_NAME(a) ((orte_process_name_t*)(a))
#endif
Cheers,
Gilles
On 2014/08/03 14:49, Gilles GOUAILLARDET wrote:
> P
Paul,
imho, the root cause is a missing ampersand.
I will double check this from tomorrow only
Cheers,
Gilles
Ralph Castain wrote:
>Arg - that raises an interesting point. This is a pointer to a 64-bit number.
>Will uintptr_t resolve that problem on such platforms?
>
>
Paul,
about the second point :
mmap is called with the MAP_FIXED flag, before the fix, the
required address was not aligned on a page size and hence
mmap failed.
the mmap failure was immediatly handled, but for some reasons
i did not fully investigate yet, this failure was not correctly
?
/* and then _process_name_jobid_for_opal, _process_name_vpid_for_opal,
opal_process_name_vpid_should_never_be_called
should also be updated */
Cheers,
Gilles
On 2014/08/01 19:52, Gilles Gouaillardet wrote:
> George and Ralph,
>
> i am very confused whether there is an issue or not.
>
>
> anyway, today Pa
pagesize = 65536; /* safer to overestimate than under */
> #endif
>
>
> opal_pagesize() anyone?
>
> -Paul
>
> On Fri, Aug 1, 2014 at 12:50 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Paul,
>>
>> you are absolu
Paul and Ralph,
for what it's worth :
a) i faced the very same issue on my (slw) qemu emulated ppc64 vm
b) i was able to run very basic programs when passing --mca coll ^ml to
mpirun
Cheers,
Gilles
On 2014/08/01 12:30, Ralph Castain wrote:
> Yes, I fear this will require some effort to
e Studio
> compilers are preferred.
> Let me know if you need me to try any of those gcc installations.
>
> -Paul
>
>
> On Thu, Jul 31, 2014 at 9:12 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Paul,
>>
>> As Ralph pointed
Paul,
As Ralph pointed, this issue was reported last month on the user mailing
list.
#include did not help :
http://www.open-mpi.org/community/lists/users/2014/07/24883.php
I will try if i can reproduce and fix this issue on a solaris10 (but x86) VM
BTW, are you using the GNU compiler ?
Paul,
the ibm test suite from the non public ompi-tests repository has several
tests for usempif08.
Cheers,
Gilles
On 2014/08/01 11:04, Paul Hargrove wrote:
> Second related issue:
>
> Can/should examples/hello_usempif08.f90 be extended to use more of the
> module such that it would have
Paul,
in .../ompi/mpi/fortran/use-mpi-f08, can you create the following dumb
test program,
compile and run nm | grep f08 on the object :
$ cat foo.f90
program foo
use mpi_f08_sizeof
implicit none
real :: x
integer :: size, ierror
call MPI_Sizeof_real_s_4(x, size, ierror)
stop
end program
Paul and all,
For what it's worth, with openmpi 1.8.2rc2 and the intel fortran
compiler version 14.0.3.174 :
$ nm libmpi_usempif08.so| grep -i sizeof
there is no such undefined symbol (mpi_f08_sizeof_)
as a temporary workaround, did you try to force the linker use
stead of a pointer to the name
>
> r32357
>
> On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET <
> gilles.gouaillar...@gmail.com> wrote:
>
> Rolf,
>
> r32353 can be seen as a suspect...
> Even if it is correct, it might have exposed the bug discussed in #48
't have a PGI compiler. I also didn't specify a level of Fortran
>support, but just had --enable-mpi-fortran
>
>Maybe we need to revert this commit until we figure out a better solution?
>
>On Jul 30, 2014, at 12:16 AM, Gilles Gouaillardet
><gilles.gouaillar...@iferc.org>
e/util/name_fns.c:522
>
>522 if (name1->jobid < name2->jobid) {
>
>(gdb) print name1
>
>$1 = (const orte_process_name_t *) 0x192350001
>
>(gdb) print *name1
>
>Cannot access memory at address 0x192350001
>
>(gdb) print name2
>
>$2 = (const orte_proce
Paul,
this is a fair point.
i commited r32354 in order to abort configure in this case
Cheers,
Gilles
On 2014/07/30 15:11, Paul Hargrove wrote:
> On a related topic:
>
> I configured with an explicit --enable-mpi-fortran=usempif08.
> Then configure found PROCEDURE was missing/broken.
> The
George,
#4815 is indirectly related to the move :
in bcol/basesmuma, we used to compare ompi_process_name_t, and now we
(try to)
compare an ompi_process_name_t and an opal_process_name_t (which causes
a glory SIGSEGV)
i proposed a temporary patch which is both broken and unelegant,
could you
5-Illegal procedure interface - mpi_user_function (conftest.f90:
> 12) 0 inform, 0 warnings, 2 severes, 0 fatal for test_proc
> {hargrove@hopper04 OMPI}$ pgf90 -V pgf90 13.10-0 64-bit target on x86-64
> Linux -tp shanghai The Portland Group - PGI Compilers and Tools Copyright
> (c) 2013,
Paul,
from the logs, the only difference i see is about Fortran PROCEDURE.
openpmi 1.8 (svn checkout) does not build the usempif08 bindings if
PROCEDURE is not supported.
from the logs, openmpi 1.8.1 does not check whether PROCEDURE is
supported or not
here is the sample program to check
>
> It would make sense, though I guess I always thought that was part of what
> happened in OBJ_CLASS_INSTANCE - guess I was wrong. My thinking was that
> DEREGISTER would be the counter to INSTANCE, and I do want to keep this
> from getting even more clunky - so maybe renaming INSTANCE to be
+1 for the overall idea !
On Fri, Jul 18, 2014 at 10:17 PM, Ralph Castain wrote:
>
> * add an OBJ_CLASS_DEREGISTER and require that all instantiations be
> matched by deregister at close of the framework/component that instanced
> it. Of course, that requires that we protect
Rolf,
i commited r2389.
MPI_Win_allocate_shared is now invoked on a single node communicator
Cheers,
Gilles
On 2014/07/16 22:59, Rolf vandeVaart wrote:
> Sounds like a good plan. Thanks for looking into this Gilles!
>
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf
Rolf,
From the man page of MPI_Win_allocate_shared
It is the user's responsibility to ensure that the communicator comm represents
a group of processes that can create a shared memory segment that can be
accessed by all processes in the group
And from the mtt logs, you are running 4 tasks on
Ralph and all,
my understanding is that
opal_finalize_util
agressively tries to free memory that would be still allocated otherwise.
an other way of saying "make valgrind happy" is "fully automake memory
leak detection"
(Joost pointed to the -fsanitize=leak feature of gcc 4.9 in
r32236 is a suspect
i am afk
I just read the code and a class is initialized with opal_class_initialize the
first time an object is instantiated with OBJ_NEW
I would simply revert r32236 or update opal_class_finalize and
free(cls->cls_construct_array); only if cls->cls_construct_array is not
Thanks Jeff,
i confirm the problem is fixed on CentOS 5
i commited r32215 because some files were missing from the
tarball/nightly snapshot/make dist.
Cheers,
Gilles
On 2014/07/11 4:21, Jeff Squyres (jsquyres) wrote:
> As of r32204, this should be fixed. Please let me know if it now works
On CentOS 5.x, gfortran is unable to compile this simple program :
subroutine foo ()
use, intrinsic :: iso_c_binding, only : c_ptr
end subroutine foo
an other workaround is to install gfortran 4.4
(yum install gcc44-gfortran)
and configure with
FC=gfortran44
On 2014/07/09 19:46, Jeff Squyres
Mike,
how do you test ?
i cannot reproduce a bug :
if you run ompi_info -a -l 9 | less
and i press 'q' at the early stage (e.g. before all output is written to
the pipe)
then the less process exits and receives SIG_PIPE and crash (which is a
normal unix behaviour)
now if i press the spacebar
Olivier,
i was unable to reproduce the issue on a centos7 beta with :
- trunk (latest nightly snapshot)
- 1.8.1
- 1.6.5
the libtool-ltdl-devel package is not installed on this server
that being said, i did not use
--with-verbs
nor
--with-tm
since these packages are not installed on my server.
Yossi,
thanks for reporting this issue.
i commited r32139 and r32140 to trunk in order to fix this issue (with
MPI_Startall)
and some misc extra bugs.
i also made CMR #4764 for the v1.8 branch (and asked George to review it)
Cheers,
Gilles
On 2014/07/03 22:25, Yossi Etigin wrote:
> Looks
Mike,
could you try again with
OMPI_MCA_btl=vader,self,openib
it seems the sm module causes a hang
(which later causes the timeout sending a SIGSEGV)
Cheers,
Gilles
On 2014/06/25 14:22, Mike Dubman wrote:
> Hi,
> The following commit broke trunk in jenkins:
>
Per the OMPI developer
Hi Ralph,
On 2014/06/25 2:51, Ralph Castain wrote:
> Had a chance to review this with folks here, and we think that having
> oversubscribe automatically set overload makes some sense. However, we do
> want to retain the ability to separately specify oversubscribe and overload
> as well since
1:12, Ralph Castain wrote:
> Yeah, we should make that change, if you wouldn't mind doing it.
>
>
>
> On Tue, Jun 24, 2014 at 9:43 AM, Gilles GOUAILLARDET <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> That makes perfect sense.
>>
&g
have a
>reason for changing it other than coll/ml. If so, we'd be happy to revisit the
>proposal.
>
>
>Make sense?
>
>Ralph
>
>
>
>
>On Tue, Jun 24, 2014 at 3:24 AM, Gilles Gouaillardet
><gilles.gouaillar...@iferc.org> wrote:
>
>WHAT:
Folks,
this issue is related to the failures reported by mtt on the trunk when
the ibm test suite invokes MPI_Comm_spawn.
my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus
per socket each.
if i run on one host (without any batch manager)
mpirun -np 16 --host slurm1
WHAT: semantic change of opal_hwloc_base_get_relative_locality
WHY: make is closer to what coll/ml expects.
Currently, opal_hwloc_base_get_relative_locality means "at what level do
these procs share cpus"
however, coll/ml is using it as "at what level are these procs commonly
_NODE.
i am puzzled wether this is a bug in opal_hwloc_base_get_relative_locality
or in proc.c that should not call this subroutine because it does not do what
should be expected.
Cheers,
Gilles
On 2014/06/20 13:59, Gilles Gouaillardet wrote:
> Ralph,
>
> my test VM is single socket four cor
Ralph,
my test VM is single socket four cores.
here is something odd i just found when running mpirun -np 2
intercomm_create.
tasks [0,1] are bound on cpus [0,1] => OK
tasks[2-3] (first spawn) are bound on cpus [2,3] => OK
tasks[4-5] (second spawn) are not bound (and cpuset is [0-3]) => OK
in
Ralph and Tetsuya,
is this related to the hang i reported at
http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ?
Nathan already replied he is working on a fix.
Cheers,
Gilles
On 2014/06/20 11:54, Ralph Castain wrote:
> My guess is that the coll/ml component may have problems
Folks,
in mca_oob_tcp_component_hop_unknown, the local variable bpr is not
defined, which prevents v1.8 compilation.
/* there was a local variable called pr, it seems it was removed instead of
being renamed into bpr */
the attached patch fixes this issue.
Cheers,
Gilles
Index:
601 - 700 of 758 matches
Mail list logo