Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-16 Thread Eugene Loh
Terry is dropping his account due to change in "day job" responsibilities. I'm retaining mine. Oracle status is changing from member to contributor. On 7/16/2013 12:16 AM, Rainer Keller wrote: Hi Josh, thanks for the info. Was about to look at this mail... Is Oracle / Sun not part of OMPI

Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-09 Thread Eugene Loh
in the past year. Oracle == emallove: Ethan Mallove <ethan.mall...@oracle.com> **NO COMMITS IN LAST YEAR** eugene: Eugene Loh <eugene@oracle.com> tdd: Terry Dontje <terry.don...@oracle.com> Please keep eugene, but close emallove and tdd.

Re: [OMPI devel] v1.7.0rc7

2013-02-26 Thread Eugene Loh
On 02/23/13 14:45, Ralph Castain wrote: This release candidate is the last one we expect to have before release, so please test it. Can be downloaded from the usual place: http://www.open-mpi.org/software/ompi/v1.7/ I haven't looked at this very carefully yet. Maybe someone can confirm what

Re: [OMPI devel] 1.6.4rc5: final rc

2013-02-20 Thread Eugene Loh
On 02/20/13 07:54, Jeff Squyres (jsquyres) wrote: All MTT testing looks good for 1.6.4. There seems to be an MPI dynamics problem when --enable-spare-groups is used, but this does not look like a regression to me. I put out a final rc, because there was one more minor change to accommodate

Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Eugene Loh
On 10/04/12 07:00, Kawashima, Takahiro wrote: (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE. This bug is caused by a use of an incorrect variable in ompi/mpi/c/wait.c (for MPI_Wait) and by an incorrect initialization of ompi_request_null in

Re: [OMPI devel] [patch] Invalid MPI_Status for null or inactive request

2012-10-04 Thread Eugene Loh
On 10/4/2012 4:00 AM, Kawashima, Takahiro wrote: > Hi Open MPI developers, > > I found some bugs in Open MPI and attach a patch to fix them. > > The bugs are: > > (1) MPI_SOURCE of MPI_Status for a null request must be MPI_ANY_SOURCE. > > (2) MPI_Status for an inactive request must be an empty

[OMPI devel] nightly tarballs

2012-10-02 Thread Eugene Loh
Where do I find the details on how the nightly tarballs are made from the SVN repos?

Re: [OMPI devel] making Fortran MPI_Status components public

2012-09-27 Thread Eugene Loh
On 9/27/2012 11:31 AM, N.M. Maclaren wrote: On Sep 27 2012, Jeff Squyres (jsquyres) wrote: ..."that obscene hack"... ...configure mechanism... Good discussion, but as far as my specific issue goes, it looks like it's some peculiar interaction between different compiler versions. I'm asking

[OMPI devel] making Fortran MPI_Status components public

2012-09-26 Thread Eugene Loh
The ibm tests aren't building for me. One of the issues is mprobe_usempif08.f90 trying to access status%MPI_SOURCE and status%MPI_TAG. I assume this is supposed to work, but it doesn't. E.g., trunk with Oracle Studio compilers: % cat a.f90 use mpi_f08 type(MPI_Status) status

[OMPI devel] trunk's mapping to nodes... local host

2012-09-07 Thread Eugene Loh
Maybe this is related to Reuti's "-hostfile ignored in 1.6.1" on the users mail list, but not quite sure. Let's pretend my nodes are called local, r1, and r2. That is, I launch mpirun from "local" and there are two other (remote) nodes available to me. With the trunk (e.g., v1.9 r27247), I

[OMPI devel] trunk broken?

2012-08-30 Thread Eugene Loh
Trunk broken? Last night, Oracle's MTT trunk runs all came up empty handed. E.g., *** Process received signal *** Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: (nil) [ 0] [0xe600] [ 1] /lib/libc.so.6(strlen+0x33) [0x3fa0a3] [ 2]

Re: [OMPI devel] r27078 and OMPI build

2012-08-29 Thread Eugene Loh
r27178 seems to build fine. Thanks. On 8/29/2012 7:42 AM, Shamis, Pavel wrote: Eugene, Can you please confirm that the issue is resolved on your setup ? On Aug 29, 2012, at 10:14 AM, Shamis, Pavel wrote: The issue #2 was fixed in r27178.

Re: [OMPI devel] r27078 and OMPI build

2012-08-24 Thread Eugene Loh
g specific to that one machine? I'm wondering because if it is just the one machine, then it might be something strange about how it is setup - perhaps the version of Solaris, or it is configuring --enable-static, or... Just trying to asses

Re: [OMPI devel] r27078 and OMPI build

2012-08-24 Thread Eugene Loh
On 08/24/12 09:54, Shamis, Pavel wrote: Maybe there is a chance to get direct access to this system ? No. But I'm attaching compressed log files from configure/make. tarball-of-log-files.tar.bz2 Description: application/bzip

Re: [OMPI devel] MPI_Mprobe

2012-08-09 Thread Eugene Loh
On 8/7/2012 5:45 AM, Jeff Squyres wrote: So the issue is when (for example) Fortran MPI_Recv says "hey, C ints are the same as Fortran INEGERs, so I don't need a temporary MPI_Status buffer; I'll just use the INTEGER array that I was given, and pass it to the back-end C MPI_Recv() routine."

Re: [OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh
On 7/31/2012 5:15 AM, Jeff Squyres wrote: On Jul 31, 2012, at 2:58 AM, Eugene Loh wrote: The main issue is this. If I go to ompi/mpi/fortran/mpif-h, I see six files (*recv_f and *probe_f) that take status arguments. Normally, we do some conversion between Fortran and C status arguments

[OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh
I have some questions originally motivated by some mpif-h/MPI_Mprobe failures we've seen in SPARC MTT runs at 64-bit in both v1.7 and v1.9, but my poking around spread out from there. The main issue is this. If I go to ompi/mpi/fortran/mpif-h, I see six files (*recv_f and *probe_f) that take

Re: [OMPI devel] [EXTERNAL] non-blocking collectives, SPARC, and alignment

2012-07-18 Thread Eugene Loh
at these issues. If not, it might be best to remove the libnbc code from 1.7, as it's unfortunately clear that it's not as ready for integration as we believed and I don't have time to fix the code base. On 7/16/12 2:50 PM, "Eugene Loh"<eugene@oracle.com> wrote: The NBC functionality d

Re: [OMPI devel] [EXTERNAL] non-blocking collectives, SPARC, and alignment

2012-07-16 Thread Eugene Loh
to fix the code base. Brian On 7/16/12 2:50 PM, "Eugene Loh"<eugene@oracle.com> wrote: The NBC functionality doesn't fare very well on SPARC. One of the problems is with data alignment. An NBC schedule is a number of variously sized fields laid out contiguously in lin

[OMPI devel] non-blocking collectives, SPARC, and alignment

2012-07-16 Thread Eugene Loh
The NBC functionality doesn't fare very well on SPARC. One of the problems is with data alignment. An NBC schedule is a number of variously sized fields laid out contiguously in linear memory (e.g., see nbc_internal.h or nbc.c) and words don't have much natural alignment. On SPARC, the

Re: [OMPI devel] [OMPI svn-docs] svn:open-mpi-tests r2002 - trunk/ibm/collective

2012-07-11 Thread Eugene Loh
thought i would be 100 at the end of that do loop. $%#@#@$% Fortran. :-( On Jul 11, 2012, at 12:25 PM,<svn-commit-mai...@open-mpi.org> wrote: Author: eugene (Eugene Loh) Date: 2012-07-11 12:25:09 EDT (Wed, 11 Jul 2012) New Revision: 2002 Log: Apply the "right value when calling wa

[OMPI devel] ibcast segfault on v1.7 [was: reduce_scatter_block failing on v1.7]

2012-07-07 Thread Eugene Loh
On 07/06/12 14:35, Barrett, Brian W wrote: On 7/6/12 2:31 PM, "Eugene Loh"<eugene@oracle.com> wrote: The new reduce_scatter_block test is segfaulting with v1.7 but not with the trunk. When we drop down into MPI_Reduce_scatter_block and attem

[OMPI devel] ibm/collective/bcast_f08.f90

2012-07-06 Thread Eugene Loh
I assume this is an orphaned file that should be removed? (It looks like a draft version of ibcast_f08.f90.)

[OMPI devel] reduce_scatter_block failing on v1.7

2012-07-06 Thread Eugene Loh
The new reduce_scatter_block test is segfaulting with v1.7 but not with the trunk. When we drop down into MPI_Reduce_scatter_block and attempt to call comm->c_coll.coll_reduce_scatter_block() it's NULL. (So is comm->c_coll.coll_reduce_scatter_block_module.) Is there some work on the trunk

[OMPI devel] non-blocking barrier

2012-07-06 Thread Eugene Loh
Either there is a problem with MPI_Ibarrier or I don't understand the semantics. The following example is with openmpi-1.9a1r26747. (Thanks for the fix in 26757. I tried with that as well with same results.) I get similar results for different OSes, compilers, bitness, etc. % cat

[OMPI devel] ibarrier failures on MTT

2012-07-03 Thread Eugene Loh
I'll look at this more, but for now I'll just note that the new ibarrier test is showing lots of failures on MTT (cisco and oracle).

[OMPI devel] u_int32_t typo in nbc_internal.h?

2012-06-27 Thread Eugene Loh
ompi/mca/coll/libnbc/nbc_internal.h 259/* Schedule cache structures/functions */ 260u_int32_t adler32(u_int32_t adler, int8_t *buf, int len); 261void NBC_SchedCache_args_delete(void *entry); 262void NBC_SchedCache_args_delete_key_dummy(void *k); u_int32_t -> uint32_t

Re: [OMPI devel] openib wasn't building

2012-06-25 Thread Eugene Loh
Thanks. That explains one mystery. I'm still unclear, though. Or, maybe I'm hitting a different problem. I configure with "--with-openib" (along with other stuff). I get: r26639:checking if MCA component btl:openib can compile... yes r26640:checking if MCA component btl:openib can

[OMPI devel] MPI_Reduce_scatter_block

2012-06-25 Thread Eugene Loh
In tarball 26642, Fortran compilation no longer succeeds. I suspect the problem might be 26641. E.g., libmpi_usempif08.so: undefined reference to `ompi_iscan_f' libmpi_mpifh.so: undefined reference to `MPI_Reduce_scatter_block' libmpi_mpifh.so: undefined reference to

Re: [OMPI devel] bug in r26626

2012-06-24 Thread Eugene Loh
Thanks for r26638. Looks like that file still needs a little attention: http://www.open-mpi.org/mtt/index.php?do_redir=2073 On 6/22/2012 10:40 AM, Eugene Loh wrote: Looking good. Just a few more: btl_udapl_endpoint.c has instances of seg_len and seg_addr. udapl may not have much of a future

Re: [OMPI devel] bug in r26626

2012-06-22 Thread Eugene Loh
Looking good. Just a few more: btl_udapl_endpoint.c has instances of seg_len and seg_addr. udapl may not have much of a future, but for now it's still there. On 6/22/2012 7:22 AM, Hjelm, Nathan T wrote: Looks like I missed a few places in udapl and osc. Fixed with r26635 and r26634.

Re: [OMPI devel] hang with launch including remote nodes

2012-06-21 Thread Eugene Loh
'opal_libevent2019_event_base_loop+0x606 /home/eugene/r26609/lib/libopen-rte.so.0.0.0'orte_daemon+0xd6d /home/eugene/r26609/bin/orted'0xd4b [remote1:01409] *** End of error message *** Segmentation Fault (core dumped) On Jun 19, 2012, at 8:31 PM, Eugene Loh wrote: I'm having bad luck with the trunk starting with r26609

[OMPI devel] hang with launch including remote nodes

2012-06-19 Thread Eugene Loh
I'm having bad luck with the trunk starting with r26609. Basically, things hang if I run mpirun -H remote1,remote2 -n 2 hostname where remote1 and remote2 are remote nodes.

Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-15 Thread Eugene Loh
On 6/15/2012 11:59 AM, Nathan Hjelm wrote: Until we can find the root cause I pushed a change that protects the reset by checking if size> 0. Let me know if that works for you. It does.

Re: [OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-15 Thread Eugene Loh
which only happens if the above described test fails. I had some doubts about r26597, but I don't have time to check into it until Monday. Maybe you can remove it and se if you continue to have the same segfault. george. On Jun 15, 2012, at 01:24 , Eugene Loh wrote: I see a segfault

[OMPI devel] Barrier/coll_tuned/pml_ob1 segfault for derived data types

2012-06-14 Thread Eugene Loh
I see a segfault show up in trunk testing starting with r26598 when tests like ibm collective/struct_gatherv intel src/MPI_Type_free_[types|pending_msg]_[f|c] are run over openib. Here is a typical stack trace: opal_convertor_create_stack_at_begining(convertor = 0x689730,

Re: [OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-11 Thread Eugene Loh
on segfaults with a variety of tests. So, I think it's not specific to loop_spawn. On Sat, Jun 9, 2012 at 3:35 PM, Eugene Loh <eugene@oracle.com <mailto:eugene@oracle.com>> wrote: On 6/9/2012 12:06 PM, Eugene Loh wrote: With r26565: Enable orte prog

Re: [OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-09 Thread Eugene Loh
On 6/9/2012 12:06 PM, Eugene Loh wrote: With r26565: Enable orte progress threads and libevent thread support by default Oracle MTT testing started showing new spawn_multiple failures. Sorry. I meant loop_spawn. (And then, starting I think in 26582, the problem is masked behind another

[OMPI devel] r26565 (orte progress threads and libevent thread support by default) causing segfaults

2012-06-09 Thread Eugene Loh
With r26565: Enable orte progress threads and libevent thread support by default Oracle MTT testing started showing new spawn_multiple failures. I've only seen this in 64-bit. Here are two segfaults, both from Linux/x86 systems running over TCP: This one with GNU compilers: [...]

[MTT devel] MTT queries... problems

2012-05-30 Thread Eugene Loh
I seem to get unreliable results from MTT queries. To reproduce: - go to http://www.open-mpi.org/mtt - click on "Test run" - for "Date range:" enter "2012-03-23 00:30:00 - 2012-03-23 23:55:00" - for "Org:" enter "oracle" - for "Platform name:" enter "t2k-0" - for "Suite:" enter "ibm-32" - click

[OMPI devel] orte_util_decode_pidmap and hwloc

2012-05-26 Thread Eugene Loh
I'm suspicious of some code, but would like comment from someone who understands it. In orte/util/nidmap.c orte_util_decode_pidmap(), one cycles through a buffer. One cycles through jobs. For each one, one unpacks num_procs. One also unpacks all sorts of other stuff like bind_idx. In

[OMPI devel] trunk hang (when remote orted has to spawn another orted?)

2012-05-08 Thread Eugene Loh
Here is another trunk hang. I get it if I use at least three remote nodes. E.g., with r26385: % mpirun -H remoteA,remoteB,remoteC -n 2 hostname [remoteA:20508] [[54625,0],1] ORTE_ERROR_LOG: Not found in file base/ess_base_fns.c at line 135 [remoteA:20508] [[54625,0],1] unable to get

[OMPI devel] mpirun hostname hangs on trunk r26380?

2012-05-03 Thread Eugene Loh
I'm hanging on the trunk, even with something as simple as "mpirun hostname". r26377 and earlier are fine, but r26381 is not. Quickly looking at the putback log, r26380 seems to be the likely candidate. I'll look at this some more, but the hang is here (orterun.c): 935 /* loop the

Re: [OMPI devel] Fortran linking problem: libraries have changed

2012-04-23 Thread Eugene Loh
On 4/23/2012 8:22 AM, Jeffrey Squyres wrote: On Apr 23, 2012, at 1:40 AM, Eugene Loh wrote: [rhc@odin001 ~/svn-trunk]$ mpifort --showme gfortran -I/nfs/rinfs/san/homedirs/rhc/openmpi/include -I/nfs/rinfs/san/homedirs/rhc/openmpi/lib -L/nfs/rinfs/san/homedirs/rhc/openmpi/lib -lmpi_usempi

[OMPI devel] Fortran linking problem: libraries have changed

2012-04-22 Thread Eugene Loh
Next Fortran problem. Oracle MTT managed to build the trunk (r26307) in some cases. No test-run failures in these cases, but the pass counts are way low. Turns out, the Fortran tests aren't being built (or run). I try compiling a Fortran code: ld: fatal: library -lmpi_f77: not found ld:

[OMPI devel] configure check for Fortran and threads

2012-04-21 Thread Eugene Loh
Another probably-Fortran-merge problem. Three issues in this e-mail. Introduction: The last two nights, Oracle MTT tests have been unable to build the trunk (r26307) with Oracle Studio compilers. This has been uncovered since the fix of r26302, allowing us to get further in the build

[OMPI devel] testing if Fortran compiler likes the C++ exception flags

2012-04-20 Thread Eugene Loh
I think this is related to the "Fortran merge." Last night, Oracle MTT tests couldn't build the trunk (r26307) with Intel compilers. Specifically, configure fails with checking to see if Fortran compiler likes the C++ exception flags... no configure: WARNING: C++ exception flags are

Re: [OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-16 Thread Eugene Loh
the branch. On Mar 14, 2012, at 11:27 PM, Eugene Loh wrote: I'm quitting for the day, but happened to notice that all our v1.5 MTT runs are failing with r26133, though tests ran fine as of r26129. Things run fine on-node, but if you run even just "hostname" on a remote node, the job fails

[OMPI devel] v1.5 r26132 broken on multiple nodes?

2012-03-15 Thread Eugene Loh
I'm quitting for the day, but happened to notice that all our v1.5 MTT runs are failing with r26133, though tests ran fine as of r26129. Things run fine on-node, but if you run even just "hostname" on a remote node, the job fails with orted: Command not found I get this problem whether I

Re: [OMPI devel] trunk regression in mpirun (no --prefix) r26081

2012-03-03 Thread Eugene Loh
Yes, seems to work for me, thanks. On 3/3/2012 3:14 PM, Ralph Castain wrote: Should be fixed in r26093 On Mar 3, 2012, at 4:06 PM, Eugene Loh wrote: I'll look at this some more, but for now I'll note that the trunk has an apparent regression in r26081. ./configure

[OMPI devel] trunk regression in mpirun (no --prefix) r26081

2012-03-03 Thread Eugene Loh
I'll look at this some more, but for now I'll note that the trunk has an apparent regression in r26081. ./configure \ --enable-shared\ --enable-orterun-prefix-by-default \ --disable-peruse \

[OMPI devel] locked memory consumption with openib and spawn

2012-02-27 Thread Eugene Loh
In the test suite, we have an ibm/dynamic/loop_spawn test that looks like this: for (...) { loop_spawn spawns loop_child parent and child execute MPI_Intercomm_merge parent and child execute MPI_Comm_free parent and child execute MPI_Comm_disconnect } If

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 02/22/12 14:54, Ralph Castain wrote: That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/22/2012 11:08 AM, Ralph Castain wrote: On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for n

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, I don't think you're going to see it. Somehow, hwloc on the config in question thinks there is no socket

Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh
the following should be fixed? *) on this platform, hwloc finds no socket level *) therefore hwloc returns num_sockets==0 to OMPI *) OMPI divides by 0 and barfs on basically everything On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote: We have some amount of MTT testing going on every night and on ONE

[OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Eugene Loh
We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel

[OMPI devel] Fortran improbe support

2012-02-15 Thread Eugene Loh
I had a question about our Fortran MPI_Improbe support. If I look at ompi/mpi/f77/improbe_f.c I see basically (lots of code removed): 64void mpi_improbe_f(MPI_Fint *source, MPI_Fint *tag, MPI_Fint *comm, 65 ompi_fortran_logical_t *flag, MPI_Fint *message,

Re: [MTT devel] duplicate results

2012-01-06 Thread Eugene Loh
losely at results. Mostly, in any case, things look fine. but might be something with the submit.php script - just a guess though at this point. Unfortunately I have zero time to spend on MTT for a few weeks at least. :/ -- Josh On Thu, Jan 5, 2012 at 8:11 PM, Eugene Loh <eugene@o

[OMPI devel] 2012 MTT results

2012-01-02 Thread Eugene Loh
Oracle has MTT jobs that have been running and, according to the log files, been successfully reporting results to the IU database, even in the last few days. If I look at http://www.open-mpi.org/mtt, however, I can't seem to turn up any results for the new calendar year (2012). Any

Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)

2011-12-06 Thread Eugene Loh
On 11/21/11 20:51, Lukas Razik wrote: Hello everybody! I've Sun T5120 (SPARC64) Servers with - Debian: 6.0.3 - linux-2.6.39.4 (from kernel.org) - OFED-1.5.3.2 - InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0) with newest FW (2.9.1) and

Re: [OMPI devel] r25470 (hwloc CMR) breaks v1.5

2011-11-16 Thread Eugene Loh
On 11/16/2011 3:32 AM, TERRY DONTJE wrote: On 11/15/2011 10:16 PM, Jeff Squyres wrote: On Nov 14, 2011, at 10:17 PM, Eugene Loh wrote: I tried building v1.5. r25469 builds for me, r25470 does not. This is Friday's hwloc putback of CMR 2866. I'm on Solaris11/x86. The problem is basically

[OMPI devel] r25470 (hwloc CMR) breaks v1.5

2011-11-15 Thread Eugene Loh
I tried building v1.5. r25469 builds for me, r25470 does not. This is Friday's hwloc putback of CMR 2866. I'm on Solaris11/x86. The problem is basically: Making all in tools/ompi_info CC ompi_info.o "../../../opal/include/opal/sys/ia32/atomic.h", line 173: warning: parameter in

Re: [OMPI devel] ibm/io/file_status_get_count

2011-11-04 Thread Eugene Loh
On 11/4/2011 5:56 AM, Jeff Squyres wrote: On Oct 28, 2011, at 1:59 AM, Eugene Loh wrote In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with: File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 0) with return value and errno 5

[OMPI devel] ibm/io/file_status_get_count

2011-10-28 Thread Eugene Loh
In our MTT testing, we see ibm/io/file_status_get_count fail occasionally with: File locking failed in ADIOI_Set_lock(fd A,cmd F_SETLKW/7,type F_RDLCK/0,whence 0) with return value and errno 5. - If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon

[OMPI devel] MPI 2.2 datatypes

2011-10-20 Thread Eugene Loh
In MTT testing, we check OMPI version number to decide whether to test MPI 2.2 datatypes. Specifically, in intel_tests/src/mpitest_def.h: #define MPITEST_2_2_datatype 0 #if defined(OPEN_MPI) #if (OMPI_MAJOR_VERSION > 1) || (OMPI_MAJOR_VERSION == 1 && OMPI_MINOR_VERSION >= 7) #

Re: [OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-09-01 Thread Eugene Loh
On 8/31/2011 4:48 AM, Ralph Castain wrote: Perhaps it would help if you had clearly stated your concern. Yeah. It would have helped had I clearly understood what was going on. Most of all, that way I wouldn't have had to ask any questions! :^) From this description, I gather your concern

Re: [OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-08-31 Thread Eugene Loh
On 8/30/2011 7:34 PM, Ralph Castain wrote: On Aug 29, 2011, at 11:18 PM, Eugene Loh wrote: Maybe someone can help me from having to think too hard. Let's say I want to max my system limits. I can say this: % mpirun --mca opal_set_max_sys_limits 1 ... Cool. Meanwhile, if I do

[OMPI devel] OMPI_MCA_opal_set_max_sys_limits

2011-08-30 Thread Eugene Loh
Maybe someone can help me from having to think too hard. Let's say I want to max my system limits. I can say this: % mpirun --mca opal_set_max_sys_limits 1 ... Cool. Meanwhile, if I do this: % setenv OMPI_MCA_opal_set_max_sys_limits 1 % mpirun ... remote processes don't see

[OMPI devel] descriptor limits -- FAQ item

2011-08-29 Thread Eugene Loh
It seems to me the FAQ item http://www.open-mpi.org/faq/?category=large-clusters#fd-limits needs updating. I'm willing to give this a try, but need some help first. (I'm even more willing to let someone else do all this, but I'm not holding my breath.) For example, the text sounds dated --

Re: [OMPI devel] ibm/dynamic/loop_spawn

2011-08-20 Thread Eugene Loh
ntioned above so that you can spawn on CPUs that aren't spinning tightly on MPI progress, ...etc. On Aug 15, 2011, at 11:47 AM, Eugene Loh wrote: This is a question about ompi-tests/ibm/dynamic. Some of these tests (spawn, spawn_multiple, loop_spawn/child, and no-disconnect) exercise

[OMPI devel] ibm/dynamic/loop_spawn

2011-08-15 Thread Eugene Loh
This is a question about ompi-tests/ibm/dynamic. Some of these tests (spawn, spawn_multiple, loop_spawn/child, and no-disconnect) exercise MPI_Comm_spawn* functionality. Specifically, they spawn additional processes (beyond the initial mpirun launch) and therefore exert a different load on a

Re: [OMPI devel] [TIPC BTL] test programmes

2011-08-01 Thread Eugene Loh
NAS Parallel Benchmarks are self-verifying. Another option is the MPI Testing Tool http://www.open-mpi.org/projects/mtt/ but it might be more trouble than it's worth. (INCIDENTALLY, THERE ARE TRAC TROUBLES WITH THE THREE LINKS AT THE BOTTOM OF THAT PAGE! COULD SOMEONE TAKE A LOOK?) If

Re: [OMPI devel] [OMPI svn] svn:open-mpi r24903

2011-07-14 Thread Eugene Loh
Thanks for the clarification. My myopic sense of the issue came out of stumbling on this behavior due to MPI_Comm_spawn_multiple failing. I think *multiple* issues caused this problem to escape notice for so long. One is that if the system thought it was oversubscribed, num_procs_alive was

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24830

2011-07-13 Thread Eugene Loh
On 7/13/2011 4:31 PM, Paul H. Hargrove wrote: On 7/13/2011 4:20 PM, Yevgeny Kliteynik wrote: > Finally, are you sure that infiniband/complib/cl_types_osd.h exists on all platforms? (e.g., Solaris) I know you said you don't have any Solaris machines to test with, but you should ping Oracle

[OMPI devel] orte_odls_base_default_launch_local()

2011-07-12 Thread Eugene Loh
The function orte_odls_base_default_launch_local() has a variable num_procs_alive that is basically initialized like this: if ( oversubscribed ) { ... } else { num_procs_alive = ...; } Specifically, if the "oversubscribed" test passes, the variable is not

[OMPI devel] orterun hanging

2011-04-06 Thread Eugene Loh
I'm running into a hang that is very easy to reproduce. Basically, something like this: % mpirun -H remote_node hostname remote_node ^C That is, I run a program (doesn't need to be MPI) on a remote node. The program runs, but my local orterun doesn't return. The problem seems

Re: [OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh
conf enable_progress) is minor. Either way, things are fine. My concern is more around the accumulation of many such instances. Ralph Castain wrote: On Mar 10, 2011, at 5:54 PM, Eugene Loh wrote: Ralph Castain wrote: Just stale code that doesn't hurt anything Okay, so it'd be

Re: [OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh
-code progress threads to off because the code isn't thread safe in key areas involving the event library, for one. On Mar 10, 2011, at 3:43 PM, Eugene Loh wrote: In the trunk, we hardwire progress threads to be off. E.g., % grep progress configure.ac # Hardwire all progress threads to be off

[OMPI devel] turning on progress threads

2011-03-10 Thread Eugene Loh
In the trunk, we hardwire progress threads to be off. E.g., % grep progress configure.ac # Hardwire all progress threads to be off enable_progress_threads="no" [Hardcode the ORTE progress thread to be off]) [Hardcode the OMPI progress thread to be off]) So,

[OMPI devel] multi-threaded test

2011-03-08 Thread Eugene Loh
I've been assigned CMR 2728, which is to apply some thread-support changes to 1.5.x. The trac ticket has amusing language about "needs testing". I'm not sure what that means. We rather consistently say that we don't promise anything with regards to true thread support. We specifically say

Re: [OMPI devel] --enable-opal-multi-threads

2011-02-15 Thread Eugene Loh
EADME consistent with the v1.5 source code (as opposed to talking about features that will appear in unspecified future releases), either: *) the comment should be removed from the README, or *) opal-multi-threads should be CMRed to v1.5 On Feb 14, 2011, at 5:36 PM, Eugene Loh wrote:

[OMPI devel] --enable-opal-multi-threads

2011-02-14 Thread Eugene Loh
In the v1.5 README, I see this: --enable-opal-multi-threads Enables thread lock support in the OPAL and ORTE layers. Does not enable MPI_THREAD_MULTIPLE - see above option for that feature. This is currently disabled by default. I don't otherwise find opal-multi-threads at all in this

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24356

2011-02-03 Thread Eugene Loh
Jeff Squyres wrote: Eugene -- This ROMIO fix needs to go upstream. Makes sense. Whom do I pester about that? Is r24356 (and now CMR 2712) okay as is? The ROMIO change is an unimportant stylistic change, so I'm okay cutting it loose from the other changes in the putback.

Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh
Jeff Squyres wrote: On Jan 11, 2011, at 2:05 PM, Eugene Loh wrote: Do we have configure tests for them, or just #define's? Configure tests. Ok, cool. I assume you'll remove the senseless configure tests, too. Right.

Re: [OMPI devel] u_int8_t

2011-01-11 Thread Eugene Loh
no reason). Do we have configure tests for them, or just #define's? Configure tests. On Jan 10, 2011, at 7:51 PM, Eugene Loh wrote: Why do u_int8_t u_int16_t u_int32_t u_int64_t get defined in opal_config.h? I don't see them used anywhere in the OMPI/OPAL/ORTE code base. Okay, one

[OMPI devel] u_int8_t

2011-01-10 Thread Eugene Loh
Why do u_int8_t u_int16_t u_int32_t u_int64_t get defined in opal_config.h? I don't see them used anywhere in the OMPI/OPAL/ORTE code base. Okay, one exception, in opal/util/if.c: #if defined(__DragonFly__) #define IN_LINKLOCAL(i)(((u_int32_t)(i) & 0x) == 0xa9fe)

Re: [OMPI devel] mca_bml_r2_del_proc_btl()

2011-01-04 Thread Eugene Loh
than the minimum already computed). Pre-setting to (size_t)-1 should fix the issue. On Jan 3, 2011, at 17:17 , Eugene Loh wrote: I can't tell if this is a problem, though I suspect it's a small one even if it's a problem at all. In mca_bml_r2_del_proc_btl(), a BTL is removed from the send

Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Eugene Loh
ain thread (as this only occurs in MPI_Finalize). Can you look in the syslog to see if there is any additional info related to this issue there? Not much. A one-liner like this: Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE local access violation On Dec 30, 2010, at 20:

[OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2010-12-30 Thread Eugene Loh
I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would

[OMPI devel] async thread in openib BTL

2010-12-23 Thread Eugene Loh
I'm starting to look at the openib BTL for the first time and am puzzled. In btl_openib_async.c, it looks like an asynchronous thread is started. During MPI_Init(), the main thread sends the async thread a file descriptor for each IB interface to be polled. In MPI_Finalize(), the main

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh
Jeff Squyres (jsquyres) wrote: Ya, it sounds like we should fix this eager limit help text so that others aren't misled. We did say "attempt", but that's probably a bit too subtle. Eugene - iirc: this is in the btl base (or some other central location) because it's shared between all btls.

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh
George Bosilca wrote: Moreover, eager send can improve performance if and only if the matching receives are already posted on the peer. If not, the data will become unexpected, and there will be one additional memcpy. I don't think the first sentence is strictly true. There is a cost

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh
Sébastien Boisvert wrote: Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : Sébastien Boisvert wrote: Case 1: 30 MPI ranks, message size is 4096 bytes File: mpirun-np-30-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh
Sébastien Boisvert wrote: Now I can describe the cases. The test cases can all be explained by the test requiring eager messages (something that test4096.cpp does not require). Case 1: 30 MPI ranks, message size is 4096 bytes File: mpirun-np-30-Program-4096.txt Outcome: It hangs -- I

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh
To add to Jeff's comments: Sébastien Boisvert wrote: The reason is that I am developping an MPI-based software, and I use Open-MPI as it is the only implementation I am aware of that send messages eagerly (powerful feature, that is). As wonderful as OMPI is, I am fairly sure other MPI

Re: [OMPI devel] knem_dma_min

2010-08-18 Thread Eugene Loh
Eugene Loh wrote: In mca_btl_sm_get_sync(), I see this: /* Use the DMA flag if knem supports it *and* the segment length is greater than the cutoff. Note that if the knem_dma_min value is 0 (i.e., the MCA param was set to 0), the segment size will never be larger

[OMPI devel] knem_dma_min

2010-08-18 Thread Eugene Loh
In mca_btl_sm_get_sync(), I see this: /* Use the DMA flag if knem supports it *and* the segment length is greater than the cutoff. Note that if the knem_dma_min value is 0 (i.e., the MCA param was set to 0), the segment size will never be larger than it, so DMA will never

[OMPI devel] RFC: mpirun options

2010-04-19 Thread Eugene Loh
Jeff and I were talking about trac 2035 and the handling of mpirun command-line options. While most mpirun options have long, multi-character names prefixed with a double dash, OMPI had originally also wanted to support combinations of short names (e.g., "mpirun -hvq", even if we don't

  1   2   3   >