Re: [OMPI devel] Cuda build break

2017-10-04 Thread Sylvain Jeaugey
See my last comment on #4257 : https://github.com/open-mpi/ompi/pull/4257#issuecomment-332900393 We should completely disable CUDA in hwloc. It is breaking the build, but more importantly, it creates an extra dependency on the CUDA runtime that Open MPI doesn't have, even when compiled with

Re: [OMPI devel] CUDA kernels in OpenMPI

2017-01-27 Thread Sylvain Jeaugey
Hi Chris, First, you will need to have some configure stuff to detect nvcc and use it inside your Makefile. UTK may have some examples to show here. For the C/C++ API, you need to add 'extern "C"' statements around the interfaces you want to export in C so that you can use them inside Open

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey
, 2016, at 3:35 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote: Indeed, I implied that affinity was set before MPI_Init (usually even before the process is launched). And yes, that would require a modex ... but I thought there was one already and maybe we could pack the affinity infor

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey
, then we could do it - but at the cost of forcing a modex. You can only detect your own affinity, so to get the relative placement, you have to do an exchange if we can’t pass it to you. Perhaps we could offer it as an option? On Apr 26, 2016, at 2:27 PM, Sylvain Jeaugey <sjeau...@nvidia.com>

[OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey
Within the BTL code (and surely elsewhere), we can use those convenient OPAL_PROC_ON_LOCAL_{NODE,SOCKET, ...} macros to figure out where another endpoint is located compared to us. The problem is that it only works when ORTE defines it. The NODE works almost always since ORTE is always doing

Re: [OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey
din ? On 02/26/2016 11:46 AM, Ralph Castain wrote: So the child processes are not calling orte_init or anything like that? I can check it - any chance you can give me a line number via a debug build? On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote: I got this strange

[OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey
I got this strange crash on master this night running nv/mpix_test : Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x50 [ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710] [ 1]

Re: [OMPI devel] [OMPI users] configuring open mpi 10.1.2 with cuda on NVIDIA TK1

2016-01-22 Thread Sylvain Jeaugey
. Thanks, Sylvain On 01/22/2016 10:07 AM, Sylvain Jeaugey wrote: It looks like the errors are produced by the hwloc configure ; this one somehow can't find CUDA (I have to check if that's a problem btw). Anyway, later in the configure, the VT configure finds cuda correctly, so it seems specific

Re: [OMPI devel] FOSS for scientists devroom at FOSDEM 2013

2012-11-20 Thread Sylvain Jeaugey
Hi Jeff, Do you mean "attend" or "do a talk" ? Sylvain Le 20/11/2012 16:16, Jeff Squyres a écrit : Cool! Thanks for the invite. Do we have any European friends who would be able to attend this conference? On Nov 20, 2012, at 10:02 AM, Sylwester Arabas wrote: Dear Open MPI Team, A

Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread sylvain . jeaugey
Hi Matthias, You might want to play with process binding to see if your problem is related to bad memory affinity. Try to launch pingpong on two CPUs of the same socket, then on different sockets (i.e. bind each process to a core, and try different configurations). Sylvain De :

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey
the features contained in the change. > Please note that configure requirements on components HAVE > CHANGED. For example. a configure.params file is no longer required > in each component directory. See Jeff's emails for an explanation. > > > > _______

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey
For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. >> >> >> >> ____ >> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of Sylvain Jea

[OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread Sylvain Jeaugey
Hi All, I just realized that Bull Vendor IDs for Infiniband cards disappeared from the trunk. Actually, they were removed shortly after we included them in last September. The original commit was : r23715 | derbeyn | 2010-09-03 16:13:19 +0200 (Fri, 03 Sep 2010) | 1 line Added Bull vendor id

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread sylvain . jeaugey
Kawashima-san, Congratulations for your machine, this is a stunning achievement ! > Kawashima wrote : > Also, we modified tuned COLL to implement interconnect-and-topology- > specific bcast/allgather/alltoall/allreduce algorithm. These algorithm > implementations

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-10 Thread Sylvain Jeaugey
On Wed, 9 Mar 2011, George Bosilca wrote: One gets multiple non-overlapping BTL (in terms of peers), each with its own set of parameters and eventually accepted protocols. Mainly there will be one BTL per memory hierarchy. Pretty cool :-) I'll cleanup the code and send you a patch. We'd be

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-09 Thread Sylvain Jeaugey
Hi George, This certainly looks like our motivations are close. However, I don't see in the presentation how you implement it (maybe I misread it), especially how you manage to not modify the BTL interface. Do you have any code / SVN commit references for us to better understand what it's

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-16 Thread Sylvain Jeaugey
at 9:00 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net>wrote: I already mentionned it answering Terry's e-mail, but to be sure I'm clear : don't confuse node full topology with MPI job topology. It _is_ different. And every process does not get the whole topology in hitopo, only its own,

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-15 Thread Sylvain Jeaugey
nter- node. Sylvain On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote: As a followup of Stuttgart's developper's meeting, here is an RFC for our topology detection framework. WHAT: Add a framework for hardware topology detection to be used by any other part of Open MPI to help optimization. WHY: C

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-11-04 Thread Sylvain Jeaugey
AM, Sylvain Jeaugey wrote: On Tue, 26 Oct 2010, Jeff Squyres wrote: I don't think this is the right way to fix it. Sorry! :-( I don't think it is the right way to do it either :-) I say this because it worked somewhat by luck before, and now it's broken. If we put in another "it'll

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey
On Tue, 26 Oct 2010, Jeff Squyres wrote: I don't think this is the right way to fix it. Sorry! :-( I don't think it is the right way to do it either :-) I say this because it worked somewhat by luck before, and now it's broken. If we put in another "it'll work because of a side effect of

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey
(one to get the priorities, and then another to execute) and additional API functions in the various modules. On Oct 7, 2010, at 6:25 AM, Sylvain Jeaugey wrote: Hi list, Remember this old bug ? I think I finally found out what was going wrong. The opal "installdirs" framework has

Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-10-07 Thread Sylvain Jeaugey
On Wed, 29 Sep 2010, Ashley Pittman wrote: On 17 Sep 2010, at 11:36, Pascal Deveze wrote: Hi all, In charge of ticket 1888 (see at https://svn.open-mpi.org/trac/ompi/ticket/1888) , I have put the resulting code in bitbucket at: http://bitbucket.org/devezep/new-romio-for-openmpi/ The work in

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-07 Thread Sylvain Jeaugey
s position in the static components array ; 3. Any other idea ? Sylvain On Fri, 19 Jun 2009, Sylvain Jeaugey wrote: On Thu, 18 Jun 2009, Jeff Squyres wrote: On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote: My problem seems related to library generation through RPM, not with 1.3.2, nor the pat

Re: [OMPI devel] Possible memory leak

2010-09-01 Thread Sylvain Jeaugey
Hi ananda, I didn't try to run your program, but this seems logical to me. The problem with calling MPI_Bcast repeatedly is that you may have an infinite desynchronization between the sender and the receiver(s). MPI_Bcast is an unidirectional operation. It does not necessary block until the

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey
On Wed, 23 Jun 2010, Jeff Squyres wrote: BTW, are you guys waiting for us to commit that, or do we ever give you guys SVN commit access? Nadia is off today. She should commit it tomorrow. Sylvain

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey
Hi Jeff, Why do we want to set this value so low ? Well, just to see if it crashes :-) More seriously, we're working on lowering the memory usage of the openib BTL, which is achieved at most by having only 1 send queue element (at very large scale, send queues prevail). This "extreme"

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Sylvain Jeaugey
On Fri, 11 Jun 2010, Jeff Squyres wrote: On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote: Interesting. Do you think this behavior of the linux kernel would change if the file was unlink()ed after attach ? After a little talk with kernel guys, it seems that unlinking wouldn't change

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey
On Thu, 10 Jun 2010, Jeff Squyres wrote: Sam -- if the shmat stuff fails because the limits are too low, it'll (silently) fall back to the mmap module, right? From my experience, it completely disabled the sm component. Having a nice fallback would be indeed a very Good thing. Sylvain

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey
On Thu, 10 Jun 2010, Paul H. Hargrove wrote: One should not ignore the option of POSIX shared memory: shm_open() and shm_unlink(). When present this mechanism usually does not suffer from the small (eg 32MB) limits of SysV, and uses a "filename" (in an abstract namespace) which can portably

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey
On Wed, 9 Jun 2010, Jeff Squyres wrote: On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote: System V shared memory cleanup is a concern only if a process dies in between shmat and shmctl IPC_RMID. Shared memory segment cleanup should happen automagically in most cases, including abnormal

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
On Wed, 2 Jun 2010, Jeff Squyres wrote: Don't you mean return NULL? This function is supposed to return a (struct ibv_cq *). Oops. My bad. Yes, it should return NULL. And it seems that if I make ibv_create_cq always return NULL, the scenario described by George works smoothly : returned

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote: In my case, the error happens in : mca_btl_openib_add_procs() mca_btl_openib_size_queues() adjust_cq() ibv_create_cq_compat() ibv_create_cq() Can you nail this down

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey
Couldn't explain it better. Thanks Jeff for the summary ! On Tue, 1 Jun 2010, Jeff Squyres wrote: On May 31, 2010, at 10:27 AM, Ralph Castain wrote: Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Sylvain Jeaugey
uery sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error? On May 28, 2010, at 2:21 PM, George Bosilca wrote: On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote: On Fri, 28 May 2010, Jeff Squyres wrote: On

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-) I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Thu, 27 May 2010, Jeff Squyres wrote: On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: That's pretty much my first proposition : abort when an error arises, because if we don't, we'll crash soon afterwards. That's my original concern and this should really be fixed. Now, if you want

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
n an error, the job should abort. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of Sylvain Jeaugey [sylvain.jeau...@bull.net] Sent: Thursday, May

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey
ys will not be built. george. On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote: Hi, I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase. The current bml/r2 code silently ignores btl->add_procs() error codes with the

[OMPI devel] BTL add procs errors

2010-05-25 Thread Sylvain Jeaugey
Hi, I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase. The current bml/r2 code silently ignores btl->add_procs() error codes with the following comment : ompi/mca/bml/r2/bml_r2.c:208 /* This BTL has troubles

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-19 Thread Sylvain Jeaugey
On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote: Sylvain Jeaugey wrote: The XRC protocol seems to create shared receive queues, which is a good thing. However, comparing memory used by an "X" queue versus and "S" queue, we can see a large difference. Digging a bit into

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-17 Thread Sylvain Jeaugey
Thanks Pasha for these details. On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote: blocking is the receive queues, because they are created during MPI_Init, so in a way, they are the "basic fare" of MPI. BTW SRQ resources are also allocated on demand. We start with very small SRQ and it is

Re: [OMPI devel] Thread safety levels

2010-05-10 Thread Sylvain Jeaugey
On Mon, 10 May 2010, N.M. Maclaren wrote: As explained by Sylvain, current Open MPI implementation always returns MPI_THREAD_SINGLE as provided thread level if neither --enable-mpi-threads nor --enable-progress-threads was specified at configure (v1.4). That is definitely the correct action.

[OMPI devel] RDMA with ob1 and openib

2010-04-27 Thread Sylvain Jeaugey
Hi list, I'm currently working on IB bandwidth improvements and maybe some of you may help me understanding some things. I'm trying to align every IB RDMA operation to 64 bytes, because having it unaligned can hurt your performance from lightly to very badly, depending on your architecture.

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-30 Thread Sylvain Jeaugey
On Mon, 29 Mar 2010, Abhishek Kulkarni wrote: #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) { static int event = -1; if (OPAL_UNLIKELY(event == -1) { event = opal_sos_create_new_event(eventstr, associated_text); } .. }

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Sylvain Jeaugey
Hi Ralph, For now, I think that yes, this is a unique identifier. However, in my opinion, this could be improved in the future replacing it by a unique string. Something like : #define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) { static int event = -1; if

Re: [OMPI devel] RFC: s/ENABLE_MPI_THREADS/ENABLE_THREAD_SAFETY/g

2010-02-09 Thread Sylvain Jeaugey
While we're at it, why not call the option giving MPI_THREAD_MULTIPLE support --enable-thread-multiple ? About ORTE and OPAL, if you have --enable-thread-multiple=yes, it may force the usage of --enable-thread-safety to configure OPAL and/or ORTE. I know there are other projects using ORTE

[OMPI devel] VT config.h.in

2010-01-19 Thread Sylvain Jeaugey
Hi list, The file ompi/contrib/vt/vt/config.h.in seems to have been added to the repository, but it is also created by autogen.sh. Is it normal ? The result is that when I commit after autogen, I have my patches polluted with diffs in this file. Sylvain

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-18 Thread Sylvain Jeaugey
On Jan 17, 2010, at 11:31 AM, Ashley Pittman wrote: Tuning the libc malloc implementation using the options they provide to do is is valid and provides real benefit to a lot of applications. For the record we used to disable mmap based allocations by default on Quadrics systems and I can't

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-08 Thread Sylvain Jeaugey
On Thu, 7 Jan 2010, Eugene Loh wrote: Could someone tell me how these settings are used in OMPI or give any guidance on how they should or should not be used? This is a very good question :-) As this whole e-mail, though it's hard (in my opinion) to give it a Good (TM) answer. This means

[OMPI devel] Thread safety levels

2010-01-05 Thread Sylvain Jeaugey
Hi list, I'm currently playing with thread levels in Open MPI and I'm quite surprised by the current code. First, the C interface : at ompi/mpi/c/init_thread.c:56 we have : #if OPAL_ENABLE_MPI_THREADS *provided = MPI_THREAD_MULTIPLE; #else *provided = MPI_THREAD_SINGLE; #endif prior

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-08 Thread Sylvain Jeaugey
Thanks Rainer for the patch. I confirm it solves my testcase as well as the real application that triggered the bug. Sylvain On Mon, 7 Dec 2009, Rainer Keller wrote: Hello Sylvain, On Friday 04 December 2009 02:27:22 pm Sylvain Jeaugey wrote: There is definetly something wrong in types

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey
Fri Dec 04 19:59:26 2009 +0100 @@ -56,7 +56,7 @@ * * XXX TODO Adapt to whatever the OMPI-layer needs */ -#define OPAL_DATATYPE_MAX_SUPPORTED 46 +#define OPAL_DATATYPE_MAX_SUPPORTED 56 /* flags for the datatypes. */ On Fri, 4 Dec 2009, Sylvain Jeaugey wrote: For the record, a

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey
For the record, and to try to explain why all MTT tests may have missed this "bug", configuring without --enable-debug makes the bug disappear. Still trying to figure out why. Sylvain On Thu, 3 Dec 2009, Sylvain Jeaugey wrote: Hi list, I hope this time I won't be the only one

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-03 Thread Sylvain Jeaugey
@odin mpi]$ Sorry I don't have more time to continue pursuing this. I have no idea what is going on with your system(s), but it clearly is something peculiar to what you are doing or the system(s) you are running on. Ralph On Dec 2, 2009, at 1:56 AM, Sylvain Jeaugey wrote: Ok, so I tried

[OMPI devel] Crash when using MPI_REAL8

2009-12-03 Thread Sylvain Jeaugey
Hi list, I hope this time I won't be the only one to suffer this bug :) It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, working fine on v1.3. From what I understand, in the trunk,

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Sylvain Jeaugey
uns to get it). But since this is a race condition, your mileage may vary on a different cluster. With the patch however, I'm in every time. I'll continue to try different configurations (e.g. without slurm ...) to see if I can reproduce it on much common configurations. Sylvain On Mon, 30 Nov 2009, Sylva

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
and the compiler. On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote: Hi Ralph, I'm also puzzled :-) Here is what I did today : * download the latest nightly build (openmpi-1.7a1r22241) * untar it * patch it with my "ORTE_RELAY_DELAY" patch * build it directly on the cluster (ru

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey
in wrote: On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: Hi Ralph, I tried with the trunk and it makes no difference for me. Strange Looking at potential differences, I found out something strange. The bug may have something to do with the "routed" framework. I can repro

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey
s?? That is the only way I can recreate this behavior. I plan to modify the relay/message processing method anyway to clean it up. But there doesn't appear to be anything wrong with the current code. Ralph On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: Hi Ralph, Thanks for your efforts. I

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey
crcp enable_io_romio=no On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg,

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
it 3. now launch the local procs It would be a fairly simple reorganization of the code in the orte/mca/odls area. I can do it this weekend if you like, or you can do it - either way is fine, but if you do it, please contribute it back to the trunk. Ralph On Nov 19, 2009, at 1:39 AM, Sylvain Jeauge

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey
2009, at 9:01 AM, Sylvain Jeaugey wrote: I don't think so, and I'm not doing it explicitely at least. How do I know ? Sylvain On Tue, 17 Nov 2009, Ralph Castain wrote: We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey
chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked. When

[OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey
Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked. When MPI processes are calling MPI_Init before send_relay is complete, the send_relay function and

Re: [OMPI devel] [OMPI users] cartofile

2009-10-13 Thread Sylvain Jeaugey
We worked a bit on it and yes, there is some work to do : * The syntax used to describe the various components is far from being consistent from one usage to another ("SOCKET", "NODE", ...). We manage to make things reading the various not up to date example files - but mainly the code. *

Re: [OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey
You were faster to fix the bug than I was to send my bug report :-) So I confirm : this fixes the problem. Thanks ! Sylvain On Mon, 21 Sep 2009, Edgar Gabriel wrote: what version of OpenMPI did you use? Patch #21970 should have fixed this issue on the trunk... Thanks Edgar Sylvain Jeaugey

[OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey
Hi list, We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail). We can reproduce the deadlock only with openib and with at least 8 cores (no

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey
, but if I'm not mistaken (again !) tcp still hangs. Sylvain On Fri, 4 Sep 2009, Sylvain Jeaugey wrote: Hi Rolf, I was indeed running a more than 4 weeks old trunk, but after pulling the latest version (and checking the patch was in the code), it seems to make no difference. However, I know

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey
Understood. So, let's say that we're only implementing a hurdle to discourage users from doing things wrong. I guess the efficiency of this will reside in the message displayed to the user ("You are about to break the entire machine and you will be fined if you try to circumvent this in any

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey
Looks like users at LANL are not very nice ;) Indeed, this is no hard security. Only a way to prevent users from doing mistakes. We often give users special tuning for their application and when they see their application is going faster, they start messing with every parameter hoping that it

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey
/changeset/21833 If you are running the latest bits and still seeing the problem, then I guess it is something else. Rolf On 09/04/09 04:40, Sylvain Jeaugey wrote: Hi all, We're currently working with romio and we hit a problem when exchanging data with hindexed types with the openib btl

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey
On Fri, 4 Sep 2009, Jeff Squyres wrote: I haven't looked at the code deeply, so forgive me if I'm parsing this wrong: is the code actually reading the file into one list and then moving the values to another list? If so, that seems a little hackish. Can't it just read directly to the target

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey
On Fri, 4 Sep 2009, Jeff Squyres wrote: -- *** Checking versions checking for SVN version... done checking Open MPI version... 1.4a1hgf11244ed72b5 up to changeset c4b117c5439b checking Open MPI release date... Unreleased developer copy checking Open MPI Subversion repository version...

[OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey
Hi all, We're currently working with romio and we hit a problem when exchanging data with hindexed types with the openib btl. The attached reproducer (adapted from romio) is working fine on tcp, blocks on openib when using 1 port but works if we use 2 ports (!). I tested it against the

Re: [OMPI devel] RFC: convert send to ssend

2009-08-24 Thread Sylvain Jeaugey
For the record, I see an big interest in this. Sometimes, you have to answer calls for tender featuring applications that must work with no code change, even if the code is completely not MPI-compliant. That's sad, but true (no pun intended :-)) Sylvain On Mon, 24 Aug 2009, George Bosilca

Re: [OMPI devel] Improvement of openmpi.spec

2009-08-06 Thread Sylvain Jeaugey
-pkgname or somesuch to OMPI's configure to override the built-in name? Hum, I guess you're right, this is indeed not something to change. Sorry about that. Sylvain On Jul 31, 2009, at 11:51 AM, Sylvain Jeaugey wrote: Hi all, We had to apply a little set of modifications to the openmpi.s

[OMPI devel] Improvement of openmpi.spec

2009-07-31 Thread Sylvain Jeaugey
n a couple of places - Add an %{opt_prefix} option to be able to install in a specific path (e.g. in /opt//mpi/-/ instead of /opt/-) The patch is done with "hg extract" but should apply on the SVN trunk. Sylvain# HG changeset patch # User Sylvain Jeaugey <sylvain.jeau...@bull.net&g

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Sylvain Jeaugey
Hi Jeff, I'm interested in joining the effort, since we will likely have the same problem with SLURM's cpuset support. On Wed, 22 Jul 2009, Jeff Squyres wrote: But as to why it's getting EINVAL, that could be wonky. We might want to take this to the PLPA list and have you run some small,

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-19 Thread Sylvain Jeaugey
On Thu, 18 Jun 2009, Jeff Squyres wrote: On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote: My problem seems related to library generation through RPM, not with 1.3.2, nor the patch. I'm not sure I understand -- is there something we need to fix in our SRPM? I need to dig a bit

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey
Ok, never mind. My problem seems related to library generation through RPM, not with 1.3.2, nor the patch. Sylvain On Thu, 18 Jun 2009, Sylvain Jeaugey wrote: Hi all, Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able to move a library from its "normal&q

[OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey
Hi all, Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able to move a library from its "normal" place (either /usr or /opt) to somewhere else (i.e. my NFS home account) to be able to try things only on my account. So, I used to set OPAL_PREFIX to the root of the Open

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey
to have a real reproducer). Sylvain On Wed, 10 Jun 2009, Sylvain Jeaugey wrote: Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey
Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-10 Thread Sylvain Jeaugey
e know so a human can decide what, if anything, to do about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> wrote: I understand your point of view, and most

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey
bout it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> wrote: I understand your point of view, and mostly share it. I think the biggest point i

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey
munication point. We -want- those processes spinning away so that, when the comm starts, it can proceed as quickly as possible. Just some thoughts... Ralph On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: Sylvain Jeaugey wrote: Hi Ralph, I'm entirely convinced that MPI doesn't have to

Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Sylvain Jeaugey
On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote: ??? dual rail does double the number of switch ports. If you want to address switch failure each rail must connect to a different switch. If you do not want to have isolated fabrics you must have some additional ports on all switches to

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey
- we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been receiv

Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Sylvain Jeaugey
ty is involved in real world situations. My guess is that it won't be that big, but it's hard to know without seeing how frequently we actually insert this code. Hope that makes sense Ralph On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> wrote: About perf

Re: [OMPI devel] Multi-rail on openib

2009-06-08 Thread Sylvain Jeaugey
Hi Tom, Yes, there is a goal in mind, and definetly not performance : we are working on device failover, i.e when a network adapter or switch fails, use the remaining one. We don't intend to improve performance with multi-rail (which as you said, will not happen unless you have a DDR card

Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-28 Thread Sylvain Jeaugey
To be more complete, we pull Hg from http://www.open-mpi.org/hg/hgwebdir.cgi/ompi-svn-mirror/ ; are we mistaken ? If not, the code in v1.3 seems to be different from the code in the trunk ... Sylvain On Thu, 28 May 2009, Nadia Derbey wrote: On Tue, 2009-05-26 at 17:24 -0600, Ralph

Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-27 Thread Sylvain Jeaugey
uess is that it won't be that big, but it's hard to know without seeing how frequently we actually insert this code. Hope that makes sense Ralph On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> wrote: About performance, I may miss something, but our first

Re: [OMPI devel] Device failover in dr pml (fwd)

2009-04-16 Thread Sylvain Jeaugey
Well, if reviving means making device failover work, then yes, in a way we revived it ;) We are currently making mostly experiments to figure out how to have device failover working. No big fixes for now, and that's why we are posting here before going further. From what I understand,

Re: [OMPI devel] SM init failures

2009-03-31 Thread Sylvain Jeaugey
Sorry to continue off-topic but going to System V shm would be for me like going back in the past. System V shared memory used to be the main way to do shared memory on MPICH and from my (little) experience, this was truly painful : - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ?