Kawashima-san,
This is interesting :-)
proc is in the stack and has type orte_process_name_t
with
typedef uint32_t orte_jobid_t;
typedef uint32_t orte_vpid_t;
struct orte_process_name_t {
orte_jobid_t jobid; /**< Job number */
orte_vpid_t vpid; /**< Process id - equivalent to
Kawashima-san and all,
Here is attached a one off patch for v1.8.
/* it does not use the __attribute__ modifier that might not be
supported by all compilers */
as far as i am concerned, the same issue is also in the trunk,
and if you do not hit it, it just means you are lucky :-)
the same issue
s an
> OPAL_ID_T, we save it as an ORTE_NAME the issue will go away.
>
> George.
>
>
>
> On Fri, Aug 8, 2014 at 1:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Kawashima-san and all,
>>
>> Here is attached a one off patch
Folks,
here is the description of a hang i briefly mentionned a few days ago.
with the trunk (i did not check 1.8 ...) simply run on one node :
mpirun -np 2 --mca btl sm,self ./abort
(the abort test is taken from the ibm test suite : process 0 call
MPI_Abort while process 1 enters an infinite lo
ize in the first place
(which is sufficient but might not be necessary ...)
Cheers,
Gilles
On 2014/08/09 1:27, Ralph Castain wrote:
> Committed a fix for this in r32460 - see if I got it!
>
> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet
> wrote:
>
>> Folks,
>>
&
Jeff and all,
i fixed the trivial errors in the trunk, there are now 11 non trivial
errors.
(commits r32490 to r32497)
i ran the script vs the v1.8 branch and found 54 errors
(first, you need to
touch Makefile.ompi-rules
in the top-level Open MPI directory in order to make the script happy)
Chee
Folks,
i noticed mpirun (trunk) hangs when running any mpi program on two nodes
*and* each node has a private network with the same ip
(in my case, each node has a private network to a MIC)
in order to reproduce the problem, you can simply run (as root) on the
two compute nodes
brctl addbr br0
if
Thanks Christopher,
this has been fixed in the trunk with r32520
Cheers,
Gilles
On 2014/08/13 14:49, Christopher Samuel wrote:
> Hi all,
>
> We spotted this in 1.6.5 and git grep shows it's fixed in the
> v1.8 branch but in master it's still there:
>
> samuel@haswell:~/Code/OMPI/ompi-svn-mirror
Lenny,
that looks related to #4857 which has been fixed in trunk since r32517
could you please update your openmpi library and try again ?
Gilles
On 2014/08/13 17:00, Lenny Verkhovsky wrote:
> Following Jeff's suggestion adding devel mailing list.
>
> Hi All,
> I am currently facing strange sit
Josh, Paul,
the problem with old PGI compilers comes from the preprocessor (!)
with pgi 12.10 :
oshmem/shmem/fortran/start_pes_f.c
SHMEM_GENERATE_WEAK_BINDINGS(START_PES, start_pes)
gets expanded as
#pragma weak START_PES = PSTART_PES SHMEM_GENERATE_WEAK_PRAGMA ( weak
start_pes_ = pstart_pes_ )
In the case of PGI compilers prior to 13, a workaround is to configure
with --disable-oshmem-profile
On 2014/08/18 16:21, Gilles Gouaillardet wrote:
> Josh, Paul,
>
> the problem with old PGI compilers comes from the preprocessor (!)
>
> with pgi 12.10 :
> oshmem/shmem/for
r32551 now detects this limitation and automatically disable oshmem profile. I
am now revamping the patch for v1.8
Gilles
Gilles Gouaillardet wrote:
>In the case of PGI compilers prior to 13, a workaround is to configure with
>--disable-oshmem-profile
>
>On 2014/08/18 1
Folks,
let's look at the following trivial test program :
#include
#include
int main (int argc, char * argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf ("I am %d/%d and i abort\n", rank, siz
> int main (void)
> {
> struct S y = { .i = x.i };
> return y.i;
> }
>
>
> -Paul
>
>
> On Wed, Aug 20, 2014 at 7:20 AM, Nathan Hjelm wrote:
>
>> Really? That means PGI 2013 is NOT C99 compliant! Figures.
>>
>> -Nathan
>>
>>
struct in order to preserve the old behaviour.
>
> Ashley.
>
> On 21 Aug 2014, at 04:31, Gilles Gouaillardet
> wrote:
>
>> Paul,
>>
>> the piece of code that causes an issue with PGI 2013 and older is just a bit
>> more complex.
>>
>&g
Cheers,
Gilles
On 2014/08/21 6:21, Ralph Castain wrote:
> I'm aware of the problem, but it will be fixed when the PMIx branch is merged
> later this week.
>
> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet
> wrote:
>
>> Folks,
>>
>> let's look
;
whereas your mpi_no_op.c return 0;
Cheers,
Gilles
Ralph Castain wrote:
>You might want to try again with current head of trunk as something seems off
>in what you are seeing - more below
>
>
>
>On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet
> wrote:
>
>
>Ralph,
&g
ass for me
>
>
> On Aug 22, 2014, at 9:12 AM, Ralph Castain wrote:
>
>> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet
>> wrote:
>>
>>> Ralph,
>>>
>>> Will do on Monday
>>>
>>> About the first test, in my case echo $?
Folks,
when i run
mpirun -np 1 ./intercomm_create
from the ibm test suite, it either :
- success
- hangs
- mpirun crashes (SIGSEGV) soon after writing the following message
ORTE_ERROR_LOG: Not found in file
../../../src/ompi-trunk/orte/orted/pmix/pmix_server.c at line 566
here is what happens :
ture to ensure we aren't getting it confused.
>
>On Aug 25, 2014, at 1:59 AM, Gilles Gouaillardet
> wrote:
>
>> Folks,
>>
>> when i run
>> mpirun -np 1 ./intercomm_create
>> from the ibm test suite, it either :
>> - success
>> - hangs
>&
Folks,
the test_shmem_zero_get.x from the openshmem-release-1.0d test suite is
currently failing.
i looked at the test itself, and compared it to test_shmem_zero_put.x
(that is a success) and
i am very puzzled ...
the test calls several flavors of shmem_*_get where :
- the destination is in the
Folks,
i just commited r32604 in order to fix compilation (pmix) when ompi is
configured with --without-hwloc
now, even a trivial hello world program issues the following output
(which is a non fatal, and could even be reported as a warning) :
[soleil][[32389,1],0][../../../../../../src/ompi-tru
Folks,
the intercomm_create test case from the ibm test suite can hang under
some configuration.
basically, it will spawn n tasks in a first communicator, and then n
tasks in a second communicator.
when i run from node0 :
mpirun -np 1 --mca btl tcp,self --mca coll ^ml -host node1,node2
./interco
Thanks Ralph !
Cheers,
Gilles
On 2014/08/28 4:52, Ralph Castain wrote:
> Took me awhile to track this down, but it is now fixed - combination of
> several minor errors
>
> Thanks
> Ralph
>
> On Aug 27, 2014, at 4:07 AM, Gilles Gouaillardet
> wrote:
>
>> Folks
Howard and Edgar,
i fixed a few bugs (r32639 and r32642)
the bug is trivial to reproduce with any mpi hello world program
mpirun -np 2 --mca btl openib,self hello_world
after setting the mca param in the $HOME/.openmpi/mca-params.conf
$ cat ~/.openmpi/mca-params.conf
btl_openib_receive_queues
Ralph and all,
The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */
mpirun -np 1 --mca btl self /bin/false
same behaviour happen with the following trivial but MPI program :
#include
int main (in
point to the original problem that was trying to be
> addressed.
>
>
> On Aug 28, 2014, at 10:01 PM, Gilles Gouaillardet
> wrote:
>
>> Howard and Edgar,
>>
>> i fixed a few bugs (r32639 and r32642)
>>
>> the bug is trivial to reproduce with any mpi
Mishima-san,
the root cause is macro expansion does not always occur as one would
have expected ...
could you please give a try to the attached patch ?
it compiles (at least with gcc) and i made zero tests so far
Cheers,
Gilles
On 2014/09/01 10:44, tmish...@jcity.maeda.co.jp wrote:
> Hi
, Jeff Squyres (jsquyres) wrote:
> Gilles --
>
> Did you get a reply about this?
>
>
> On Aug 26, 2014, at 3:17 AM, Gilles Gouaillardet
> wrote:
>
>> Folks,
>>
>> the test_shmem_zero_get.x from the openshmem-release-1.0d test suite is
>> currently
Folks,
mtt recently failed a bunch of times with the trunk.
a good suspect is the collective/ibarrier test from the ibm test suite.
most of the time, CHECK_AND_RECYCLE will fail
/* IS_COLL_SYNCMEM(coll_op) is true */
with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is
called on
ality requirement.
>
>Did this patch "fix" the problem by avoiding the segfault due to coll/ml
>disqualifying itself? Or did it make everything work okay again?
>
>
>On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet
> wrote:
>
>> Folks,
>>
>> mtt rece
Folks,
when OpenMPI is configured with --disable-weak-symbols and a fortran
2008 capable compiler (e.g. gcc 4.9),
MPI_STATUSES_IGNORE invoked from Fortran is not correctly interpreted as
it should.
/* instead of being a special array of statuses, it is an array of one
status, which can lead to buf
Ralph and Brice,
i noted Ralph commited r32685 in order to fix a problem with Intel
compilers.
The very similar issue occurs with clang 3.2 (gcc and clang 3.4 are ok
for me)
imho, the root cause is in the hwloc configure.
in this case, configure fails to detect strncasecmp is part of the C
includ
he config change. All I can say is that
> "tolower" on my CentOS box is defined in , and that has to be
> included in the misc.h header.
>
>
> On Sep 8, 2014, at 5:49 PM, Gilles Gouaillardet
> wrote:
>
>> Ralph and Brice,
>>
>> i noted Ralph c
ut this detection code is already a mess so I'd rather no change it again.
>
> Brice
>
>
>
> Le 09/09/2014 04:56, Gilles Gouaillardet a écrit :
>> Ralph,
>>
>> ok, let me clarify my point :
>>
>> tolower() is invoked in :
>> opal/mca/hwloc/hwlo
Folks,
Since r32672 (trunk), grpcomm/rcd is the default module.
the attached spawn.c test program is a trimmed version of the
spawn_with_env_vars.c test case
from the ibm test suite.
when invoked on two nodes :
- the program hangs with -np 2
- the program can crash with np > 2
error message is
[n
ggouaillardet -> ggouaillardet
On 2014/09/10 19:46, Jeff Squyres (jsquyres) wrote:
> As the next step of the planned migration to Github, I need to know:
>
> - Your Github ID (so that you can be added to the new OMPI git repo)
> - Your SVN ID (so that I can map SVN->Github IDs, and therefore map T
em to establish a
>> persistent receive. They then can use the signature to tell which collective
>> the incoming message belongs to.
>>
>> I'll fix it, but it won't be until tomorrow I'm afraid as today is shot.
>>
>>
>> On Sep 9, 2014, at 3:10 AM
the right fix, it was very lightly
tested, but so far, it works for me ...
Cheers,
Gilles
On 2014/09/11 16:11, Gilles Gouaillardet wrote:
> Ralph,
>
> things got worst indeed :-(
>
> now a simple hello world involving two hosts hang in mpi_init.
> there is still a race conditio
> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet
> wrote:
>
>> Ralph,
>>
>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> it does not invoke pmix_server_release
>> because allgather_stub was not previously invoked since the the
2 and 3 enter the allgather at the send time, they will sent a
message to each other at the same time and rml fails establishing the
connection. i could not find whether this is linked to my changes...
Cheers,
Gilles
>
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
> gilles.gouai
Howard, and Rolf,
i initially reported the issue at
http://www.open-mpi.org/community/lists/devel/2014/09/15767.php
r32659 is not a fix nor a regression, it simply aborts instead of
OBJ_RELEASE(mpi_comm_world).
/* my point here is we should focus on the root cause and not the
consequence */
firs
Ralph,
here is the full description of a race condition in oob/tcp i very briefly
mentionned in a previous post :
the race condition can occur when two not connected orted try to send a
message to each other for the first time and at the same time.
that can occur when running mpi helloworld on 4
nnections, and then have the higher vpid retry while the lower one waits.
> The logic for that was still in place, but it looks like you are hitting a
> different code path, and I found another potential one as well. So I think I
> plugged the holes, but will wait to hear if you confi
e that triggers it so I
> can continue debugging
>
> Ralph
>
> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet
> wrote:
>
>> Thanks Ralph,
>>
>> this is much better but there is still a bug :
>> with the very same scenario i described earlier, vpid 2
Folks,
for both trunk and v1.8 branch, configure takes the --with-threads option.
valid usages are
--with-threads, --with-threads=yes, --with-threads=posix and
--with-threads=no
/* v1.6 used to support the --with-threads=solaris */
if we try to configure with --with-threads=no, this will result
Folks,
r32716 broke v1.8 :-(
the root cause it it included MCA_BASE_VAR_TYPE_VERSION_STRING which has
not yet landed into v1.8
the attached trivial patch fixes this issue
Can the RM/GK please review it and apply it ?
Cheers,
Gilles
Index: opal/mca/base/mca_base_var.c
=
moved from
MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
retry() should have been invoked ?
Cheers,
Gilles
On 2014/09/18 17:02, Ralph Castain wrote:
> The patch looks fine to me - please go ahead and apply it. Thanks!
>
> On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet
> wrote:
>
&
es
On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
> Ralph,
>
> i found an other race condition.
> in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
> and processes data from the socket received from vpid 2
>
Thanks for the pointer George !
On Sat, Sep 20, 2014 at 5:46 AM, George Bosilca wrote:
> Or copy the handshake protocol design of the TCP BTL...
>
>
the main difference between oob/tcp and btl/tcp is the way we resolve the
situation in which two processes send their first message to each other a
that release too long.
> Alternatively, I can take care of it if you don't have time (I'm asking if
> you can do it solely because you have the reproducer).
>
>
> On Sep 21, 2014, at 6:54 AM, Ralph Castain wrote:
>
> Sounds fine with me - please go ahead, and thanks
&
Folks,
if i read between the lines, it looks like the next stable branch will be
v2.0 and not v1.10
is there a strong reason for that (such as ABI compatibility will break, or
a major but internal refactoring) ?
/* other than v1.10 is less than v1.8 when comparing strings :-) */
Cheers,
Gilles
my 0.02 US$ ...
Bitbucket pricing model is per user (but with free public/private
repository up to 5 users)
whereas github pricing is per *private* repository (and free public
repository and with unlimited users)
from an OpenMPI point of view, this means :
- with github, only the private ompi-tes
e race condition vis 1.8 - I agree it
> is not a blocker for that release.
>
> Ralph
>
> On Sep 22, 2014, at 4:49 PM, Gilles Gouaillardet
> wrote:
>
>> Ralph,
>>
>> here is the patch i am using so far.
>> i will resume working on this from Wednesday (t
Nathan,
why not just make the topology information available at that point as
you described it ?
the attached patch does this, could you please review it ?
Cheers,
Gilles
On 2014/09/26 2:50, Nathan Hjelm wrote:
> On Tue, Aug 26, 2014 at 07:03:24PM +0300, Lisandro Dalcin wrote:
>> I finally man
oiding changing
> anything in topo for 1.8.
>
> -Nathan
>
> On Mon, Sep 29, 2014 at 08:02:41PM +0900, Gilles Gouaillardet wrote:
>>Nathan,
>>
>>why not just make the topology information available at that point as you
>>described it ?
>>
&
Folks,
the dynamic/spawn test from the ibm test suite crashes if the openib btl
is detected
(the test can be ran on one node with an IB port)
here is what happens :
in mca_btl_openib_proc_create,
the macro
OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version,
p
Thanks Ralph !
it did fix the problem
Cheers,
Gilles
On 2014/10/01 3:04, Ralph Castain wrote:
> I fixed this in r32818 - the components shouldn't be passing back success if
> the requested info isn't found. Hope that fixes the problem.
>
>
> On Sep 30, 2014, at 1:5
time for either graph or dist graph.
>>
>> -Nathan
>>
>> On Tue, Sep 30, 2014 at 02:03:27PM +0900, Gilles Gouaillardet wrote:
>>> Nathan,
>>>
>>> here is a revision of the previously attached patch, and that supports
>>> graph and di
Hi Jeff,
thumbs up for the migration !
the names here are the CMR owners ('Owned by' field in TRAC)
should it be the duty of the creators ('Reported by' field in TRAC) to
re-create the CMR instead?
/* if not, and from a git log point of view, that means the commiter
will be the reviewer and not
ose CMRs as pull requests; probably in some
>> cases it's the reporter, probably in some cases it's the owner. :-)
>>
>>
>> On Oct 2, 2014, at 6:33 AM, Gilles Gouaillardet
>> wrote:
>>
>>> Hi Jeff,
>>>
>>> thumbs up for the
Jeff,
i could not find how to apply a label to a PR via the web interface (and
i am not sure i can even do that since authority might be required)
any idea (maybe a special keyword in the comment ...) ?
Cheers,
Gilles
On 2014/10/03 1:53, Jeff Squyres (jsquyres) wrote:
> On Oct 2, 2014, at 12:4
PR to someone,
> *after* you create the PR (same with creating issues).
>
> See https://github.com/open-mpi/ompi/wiki/SubmittingBugs for details:
>
>
>
>
>
>
> On Oct 2, 2014, at 11:37 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>
>
On Fri, Oct 3, 2014 at 7:29 PM, Jeff Squyres (jsquyres)
wrote:
> On Oct 2, 2014, at 11:33 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > the most painful part is probably to manually retrieve the git commit id
> > of a given svn commit id
>
the OMPI devs have *read* access to ompi-release, meaning you can
>create/comment on issues, but not set labels/milestones/assignees.
>
>I did not expect this behavior. Grumble. Will have to think about it a bit...
>
>
>
>
>On Oct 3, 2014, at 7:07 AM, Gilles Gouaill
will do !
Gilles
"Jeff Squyres (jsquyres)" wrote:
>That's a possibility. IU could probably host this for us.
>
>Would you mind looking into how hard this would be?
>
>
>On Oct 3, 2014, at 8:41 AM, Gilles Gouaillardet
> wrote:
>
>> Jeff,
>>
Folks,
currently, https://github.com/open-mpi/ompi last commit was 13 days ago
(see attached snapshot)
this is not the most up to date state !
for example, the last commit of my clone is
commit 54c839a970fc3025a08fe1c04b7d4b9767078264
Merge: dee6b63 5c5453b
Author: Gilles Gouaillardet
List
ion,
there are some more political implications (who
manage/update/monitor/secure this).
the second option (cron script) could be accepted more easily by IU.
i will experiment on my sandbox from now.
Cheers,
Gilles
On 2014/10/03 22:20, Gilles Gouaillardet wrote:
> will do !
>
> Gille
sandbox from now.
>
> Cheers,
>
> Gilles
>
> On 2014/10/03 22:20, Gilles Gouaillardet wrote:
>> will do !
>>
>> Gilles
>>
>> "Jeff Squyres (jsquyres)" wrote:
>>> That's a possibility. IU could probably host this for us.
&
014, at 6:57 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > so far, using webhooks looks really simple :-)
>
> Good!
>
> > a public web server (apache+php) that can
> > a) process json requests
> > b) issue curl requests
> > is
be 100% sure ... */
Cheers,
Gilles
On 2014/10/07 22:55, Jeff Squyres (jsquyres) wrote:
> Sounds perfect.
>
> On Oct 7, 2014, at 9:49 AM, Gilles Gouaillardet
> wrote:
>
>> Jeff,
>>
>> that should not be an issue since github provides the infrastructure to
>
gt;> Those revisions listed above that are new to this repository have
>> not appeared on any other notification email; so we list those
>> revisions in full, below.
>>
>> - Log ---------
>> https://github.com/o
Ralph,
let me quickly reply about this one :
On 2014/10/16 12:00, Ralph Castain wrote:
> I also don't understand some of the changes in this commit. For example, why
> did you replace the OPAL_MODEX_SEND_STRING macro with essentially a
> hard-coded replica of that macro?
OPAL_MODEX_SEND_STRING
OK, revert done :
commit b5aea782cec116af095a7e7a7310e9e2a018
Author: Gilles Gouaillardet
List-Post: devel@lists.open-mpi.org
Date: Thu Oct 16 12:24:38 2014 +0900
Revert "Fix heterogeneous support"
Per the discussion at
http://www.open-mpi.org/community/lists/devel/201
9b9
Author: Gilles Gouaillardet
List-Post: devel@lists.open-mpi.org
Date: Thu Oct 16 13:29:32 2014 +0900
pmi/s1: fix large keys
do not overwrite the PMI key when pushing a message that does
not fit within 255 bytes
diff --git a/opal/mca/pmix/base/pmix_base_fns.c
b/opal/mca
Artem,
There is a known issue #235 with modex and i made PR #238 with a tentative fix.
Could you please give it a try and reports if it solves your problem ?
Cheers
Gilles
Artem Polyakov wrote:
>Hello, I have troubles with latest trunk if I use PMI1.
>
>
>For example, if I use 2 nodes the app
ombinations. Also I
>> am curious why basesmuma module listed twice.
>>
>>
>>
>>> Best regards,
>>> Elena
>>>
>>> On Fri, Oct 17, 2014 at 7:01 PM, Artem Polyakov
>>> wrote:
>>>
>>>> Gilles,
>>>>
>&
Mike,
the root cause is vader was not fully backported to v1.8
(two OPAL_* macros were not backported to OMPI_*)
i fixed it in https://github.com/open-mpi/ompi-release/pull/49
please note a similar warning is fixed in
https://github.com/open-mpi/ompi-release/pull/48
Cheers,
Gilles
On 2014/10/
heterogeneous cluster.
could you please have a look at it when you get a chance ?
Cheers,
Gilles
On 2014/10/16 12:26, Gilles Gouaillardet wrote:
> OK, revert done :
>
> commit b5aea782cec116af095a7e7a7310e9e2a018
> Author: Gilles Gouaillardet
> Date: Thu Oct 16 12:2
Folks,
While investigating an issue started at
http://www.open-mpi.org/community/lists/users/2014/10/25562.php
i found that it is mandatory to compile with -D_REENTRANT on Solaris (10
and 11)
(otherwise errno is not per thread specific, and the pmix thread
silently misinterpret EAGAIN or EWOULDBLO
Thanks Paul !
Gilles
On 2014/10/27 18:47, Paul Hargrove wrote:
> On Mon, Oct 27, 2014 at 2:42 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> [...]
>
>> Paul, since you have access to many platforms, could you please run this
>> test
which has been gcc, llvm-gcc
> and clang through those OS revs)
>
> Though I have access, I did not try compute nodes on BG/Q or Cray X{E,K,C}.
> Let me know if any of those are of significant concern.
>
> I no longer have AIX or IRIX access.
>
> -Paul
>
>
> On Mon
Folks,
currently, the dynamic/intercomm_create fails if ran on one host with an
IB port :
mpirun -np 1 ./intercomm_create
/* misleading error message is
opal/mca/btl/openib/connect/btl_openib_connect_udcm.c:1899:udcm_process_messages]
could not find associated endpoint */
this program spawns one
-----
>> https://github.com/open-mpi/ompi/commit/68bec0ae1f022e095c132b3f8c7317238b318416
>>
>> commit 68bec0ae1f022e095c132b3f8c7317238b318416
>> Merge: 76ee98c 672d967
>> Author: Gilles Gouaillardet
>> Date: Fri Oct 31 16:34:43 201
Ralph,
FYI, here is attached the patch i am working on (still testing ...)
aa207ad2f3de5b649e5439d06dca90d86f5a82c2 should be reverted then.
Cheers,
Gilles
On 2014/11/04 13:56, Paul Hargrove wrote:
> Ralph,
>
> You will see from the message I sent a moment ago that -D_REENTRANT on
> Solaris a
t; section - i.e., add the flag if we are under solaris,
> regardless of someone asking for thread support. Since we require that
> libevent be thread-enabled, it seemed safer to always ensure those flags are
> set.
>
>
>> On Nov 3, 2014, at 9:05 PM, Gilles Gouaillardet
>&
Ralph,
On 2014/11/04 1:54, Ralph Castain wrote:
> Hi folks
>
> Looking at the over-the-weekend MTT reports plus at least one comment on the
> list, we have the following issues to address:
>
> * many-to-one continues to fail. Shall I just assume this is an unfixable
> problem or a bad test and i
Elena,
the first case (-mca btl tcp,self) crashing is a bug and i will have a
look at it.
the second case (-mca sm,self) is a feature : the sm btl cannot be used
with tasks
having different jobids (this is the case after a spawn), and obviously,
self cannot be used also,
so the behaviour and erro
SEd by the btl add_proc if it is unreachable ? */
Cheers,
Gilles
On 2014/11/06 12:46, Ralph Castain wrote:
>> On Nov 5, 2014, at 6:11 PM, Gilles Gouaillardet
>> wrote:
>>
>> Elena,
>>
>> the first case (-mca btl tcp,self) crashing is a bug and i will have a
My bad (mostly)
I made quite a lot of PR to get some review before commiting to the master, and
did not follow up in a timely manner.
I closed two obsoletes PR today.
#245 should be ready for prime time.
#227 too unless George has an objection.
I asked Jeff to review #232 and #228 because they
Mike,
Jenkins runs automated tests on each pull request, and i think this is a
good thing.
recently, it reported a bunch of failure but i could not find anything
to blame in the PR itself.
so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
git commit --allow-empty
and waited
t(s) and make jenkins to pass?
> It will help us to make sure we don`t break something that did work before?
>
> On Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Mike,
>>
>> Jenkins runs automated tests o
Thanks Mike,
BTW what is the distro running on your test cluster ?
Mike Dubman wrote:
>ok, I disabled vader tests in SHMEM and it passes.
>
>it can be requested from jenkins by specifying "vader" in PR comment line.
>
>
>On Tue, Nov 11, 2014 at 11:04 AM, Gilles Go
Folks,
I found (at least) two issues with oshmem put if btl/vader is used with
knem enabled :
$ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
--
SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
error
Harmut,
this is a known bug.
in the mean time, can you give a try to 1.8.4rc1 ?
http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.4rc1.tar.gz
/* if i remember correctly, this is fixed already in the rc1 */
Cheers,
Gilles
On 2014/11/13 19:48, Hartmut Häfner (SCC) wrote:
> Dear d
Hi Marc,
OpenLava is based on a pretty old version of LSF (4.x if i remember
correctly)
and i do not think LSF had support for parallel jobs tight integration
at that time.
my understanding is that basically, there is two kind of direct
integration :
- mpirun launch: mpirun spawns orted via the A
y
> Uppsala University, Sweden
> marc.hoepp...@bils.se
>
>> On 18 Nov 2014, at 08:40, Gilles Gouaillardet
>> wrote:
>>
>> Hi Marc,
>>
>> OpenLava is based on a pretty old version of LSF (4.x if i remember
>> correctly)
>> and i do not think L
Hi Ghislain,
that sound like a but in MPI_Dist_graph_create :-(
you can use MPI_Dist_graph_create_adjacent instead :
MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, degrees, &targets[0],
&weights[0],
degrees, &targets[0], &weights[0], info,
rankReordering, &commGraph);
it
Ralph and Paul,
On 2014/11/26 10:37, Ralph Castain wrote:
> So it looks like the issue isn't so much with our code as it is with the OS
> stack, yes? We aren't requiring that the loopback be "up", but the stack is
> in order to establish the connection, even when we are trying a non-lo
> interf
Ralph,
i noted several hangs in mtt with the v1.8 branch.
a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
from the intel_tests suite,
invoke mpirun on one node and run the taks on an other node :
node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
401 - 500 of 816 matches
Mail list logo