[OMPI devel] Broken master

2015-11-05 Thread Rolf vandeVaart
Hi Ralph: Just an FYI that the following change broke the use of --host on master last night. [rvandevaart@drossetti-ivy4 ompi-master-rolfv]$ git bisect bad 169c44258d5c98870872b77166390d4f9a81105e is the first bad commit commit 169c44258d5c98870872b77166390d4f9a81105e Author: Ralph Castain Li

[OMPI devel] Open MPI Weely Meetings

2015-11-03 Thread Rolf vandeVaart (via Doodle)
Hi there, Rolf vandeVaart (rvandeva...@nvidia.com) invites you to participate in the Doodle poll "Open MPI Weely Meetings." Should we have Open MPI weekly meetings during SC15 and Thanksgiving week? Let me know if you want to attend one or both of them. Participate now https://doodl

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Rolf vandeVaart
The bfo was my creation many years ago. Can we keep it around for a little longer? If we blow it away, then we should probably clean up all the code I also have in the openib BTL for supporting failover. There is also some configure code that would have to go as well. Rolf >-Original Me

[OMPI devel] Dual rail IB card problem

2015-08-31 Thread Rolf vandeVaart
There was a problem reported on the User's list about Open MPI always picking one Mellanox card when they were two in the machine. http://www.open-mpi.org/community/lists/users/2015/08/27507.php We dug a little deeper and I think this has to do with how hwloc is figuring out where one of the

Re: [OMPI devel] pgi and fortran in master

2015-08-26 Thread Rolf vandeVaart
I just tested this against the PGI 15.7 compiler and I see the same thing. It appears that we get this error on some of the files called out in ompi/mpi/fortran/use-mpi-f08/mpi-f-interfaces-bind.h as not having an "easy-peasy" solution. All the other files compile just fine. I checked the list

[OMPI devel] Open MPI 1.8.6 memory leak

2015-07-01 Thread Rolf vandeVaart
There have been two reports on the user list about memory leaks. I have reproduced this leak with LAMMPS. Note that this has nothing to do with CUDA-aware features. The steps that Stefan has provided make it easy to reproduce. Here are some more specific steps to reproduce derived from Stefa

Re: [OMPI devel] smcuda higher exclusivity than anything else?

2015-05-20 Thread Rolf vandeVaart
A few observations. 1. The smcuda btl is only built when --with-cuda is part of the configure line so folks who do not do this will not even have this btl and will never run into this issue. 2. The priority of the smcuda btl has been higher since Open MPI 1.7.5 (March 2014). The idea is that if

Re: [OMPI devel] is anyone seeing this on their intel/inifinipath cluster?

2015-05-04 Thread Rolf vandeVaart
I am seeing it also on my cluster too. [ivy4:27085] mca_base_component_repository_open: unable to open mca_btl_usnic: /ivylogin/home/rvandevaart/ompi-repos/ompi-master-uvm/64-dbg/lib/libmca_common_libfabric.so.0: undefined symbol: psmx_eq_open (ignored) [ivy4:27085] mca_base_component_repository

Re: [OMPI devel] c_accumulate

2015-04-20 Thread Rolf vandeVaart
Hi Gilles: Is your failure similar to this ticket? https://github.com/open-mpi/ompi/issues/393 Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Gilles Gouaillardet Sent: Monday, April 20, 2015 9:12 AM To: Open MPI Developers Subject: [OMPI devel] c_accumulate Folks, i (sometimes

Re: [OMPI devel] Problems with some IBM neighbor tests

2015-04-03 Thread Rolf vandeVaart
I ended up looking at this and it was a bug in this set of tests. Needed to check for MPI_COMM_NULL in a few places. This has been fixed. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Thursday, April 02, 2015 10:10 AM To: de...@open-mpi.org Subject: [OMPI

[OMPI devel] Problems with some IBM neighbor tests

2015-04-02 Thread Rolf vandeVaart
I just recently bumped running some tests from np=4 to np=6. I am now seeing failures on the following tests in the ibm/collective directory. ineighbor_allgather, ineighbor_allgatherv, ineighbor_alltoall, ineighbor_alltoallv, ineighbor_alltoallw neighbor_allfather, neighbor_allgatherv, neighbo

[OMPI devel] New binding warnings in master

2015-03-20 Thread Rolf vandeVaart
Greetings: I am now seeing the following message for all my calls to mpirun on ompi master. This started with last night's MTT run. Is this intentional? [rvandevaart@ivy0 ~]$ mpirun -np 1 hostname -- WARNING: a request wa

[OMPI devel] BML changes

2015-02-26 Thread Rolf vandeVaart
This message is mostly for Nathan, but figured I would go with the wider distribution. I have noticed some different behaviour that I assume started with this change. https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04 I am noticing that the openib BTL will also b

Re: [OMPI devel] Solaris/x86-64 SEGV with 1.8-latest

2014-12-17 Thread Rolf vandeVaart
I think this has already been fixed by Ralph this morning. I had observed the same issue but is now gone. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Brice Goglin Sent: Wednesday, December 17, 2014 3:53 PM To: de...@open-mpi.org Subject: Re: [OMPI devel] Solaris/x86-64 SEGV with

Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Rolf vandeVaart
may be related to change set 32659. If you back this change out, do the tests pass? Howard From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Monday, September 15, 2014 8:55 AM To: de...@open-mpi.org<mailto:de...@open-mpi.org> Subject: [OMPI devel] coll ml

[OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Rolf vandeVaart
I wonder if anyone else is seeing this failure. Not sure when this started but it is only on the trunk. Here is a link to my failures as well as an example below that. There are a variety of nonblocking collectives failing like this. http://mtt.open-mpi.org/index.php?do_redir=2208 [rvandevaar

[OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Rolf vandeVaart
I noticed MTT failures from last night and then reproduced this morning on 1.8 branch. Looks like maybe a double free. I assume it is related to fixes for aborting programs. Maybe related to https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure. [rvandevaart@drossetti-ivy0 environme

[OMPI devel] RFC: Change default behavior of calling ibv_fork_init

2014-07-31 Thread Rolf vandeVaart
WHAT: Change default behavior in openib to not call ibv_fork_init() even if available. WHY: There are some strange interactions with ummunotify that cause errors. In addition, see the additional points below. WHEN: After next weekly meeting, August 5, 2014 DETAILS: This change will just be a co

Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-07-30 Thread Rolf vandeVaart
mit it and drop a note to #4815 ( I am afk until tomorrow) Cheers, Gilles Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: Just an FYI that my trunk version (r32355) does not work at all anymore if I do not include "--mca coll ^ml".Here is a stack trace from the ibm/pt

Re: [OMPI devel] trunk compilation errors in jenkins

2014-07-30 Thread Rolf vandeVaart
Just an FYI that my trunk version (r32355) does not work at all anymore if I do not include "--mca coll ^ml".Here is a stack trace from the ibm/pt2pt/send test running on a single node. (gdb) where #0 0x7f6c0d1321d0 in ?? () #1 #2 0x7f6c183abd52 in orte_util_compare_name_fie

Re: [OMPI devel] RFC: Bump minimum sm pool size to 128K from 64K

2014-07-26 Thread Rolf vandeVaart
Yes (my mistake) Sent from my iPhone On Jul 26, 2014, at 3:19 PM, "George Bosilca" mailto:bosi...@icl.utk.edu>> wrote: We are talking MB not KB isn't it? George. On Thu, Jul 24, 2014 at 2:57 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: WHAT:

[OMPI devel] RFC: Bump minimum sm pool size to 128K from 64K

2014-07-24 Thread Rolf vandeVaart
WHAT: Bump up the minimum sm pool size to 128K from 64K. WHY: When running OSU benchmark on 2 nodes and utilizing a larger btl_smcuda_max_send_size, we can run into the case where the free list cannot grow. This is not a common case, but it is something that folks sometimes experiment with.

Re: [OMPI devel] PML-bfo deadlocks for message size > eager limit after connection loss

2014-07-24 Thread Rolf vandeVaart
My guess is that no one is testing the bfo PML. However, I would have expected it to still work with Open MPI 1.6.5. From your description, it works for smaller messages but fails with larger ones? So, if you just send smaller messages and pull the cable, things work correctly? One idea is t

Re: [OMPI devel] Onesided failures

2014-07-16 Thread Rolf vandeVaart
ssing something obvious, I will update the test tomorrow and add a comm split to ensure MPI_Win_allocate_shared is called from single node communicator and skip the test if this impossible Cheers, Gilles Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: On both 1.8 and trunk (as Ralph m

[OMPI devel] Onesided failures

2014-07-16 Thread Rolf vandeVaart
On both 1.8 and trunk (as Ralph mentioned in meeting) we are seeing three tests fail. http://mtt.open-mpi.org/index.php?do_redir=2205 Ibm/onesided/win_allocate_shared Ibm/onesided/win_allocated_shared_mpifh Ibm/onesided/win_allocated_shared_usempi Is there a ticket that covers these failures? T

[OMPI devel] New crash on trunk (r32246)

2014-07-15 Thread Rolf vandeVaart
With the latest trunk (r32246) I am getting crashes while the program is shutting down. I assume this is related to some of the changes George just made. George, can you take a look when you get a chance? Looks like everyone is getting the segv during shutdown (mpirun, orted, and application)

Re: [OMPI devel] Hangs on the trunk

2014-07-14 Thread Rolf vandeVaart
the conversions in ob1. >> >> -Nathan >> >> On Mon, Jul 14, 2014 at 01:38:38PM -0700, Rolf vandeVaart wrote: >> >I have noticed that I am seeing some tests hang on the trunk. For >> >example: >> > >> > >> > >> >$

[OMPI devel] Hangs on the trunk

2014-07-14 Thread Rolf vandeVaart
I have noticed that I am seeing some tests hang on the trunk. For example: $ mpirun --mca btl_tcp_if_include eth0 --host drossetti-ivy0,drossetti-ivy1 -np 2 --mca pml ob1 --mca btl sm,tcp,self --mca coll_mdisable_allgather 1 --mca btl_openib_warn_default_gid_prefix 0 send It is not unusual for

Re: [OMPI devel] iallgather failures with coll ml

2014-06-11 Thread Rolf vandeVaart
Hearing no response, I assume this is not a known issue so I submitted https://svn.open-mpi.org/trac/ompi/ticket/4709 Nathan, is this something that you can look at? Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Friday, June 06, 2014 1:55 PM To: de

[OMPI devel] Open MPI Core Developer - Minutes June 10, 2014

2014-06-10 Thread Rolf vandeVaart
Minutes of June 10, 2014 Open MPI Core Developer Meeting 1. Review 1.6 - Nothing new 2. Review 1.8 - Most things are doing fine. Still several tickets awaiting review. If influx of bugs slows, then we will get 1.8.2 release ready. Rolf was concerned about intermittent hangs, but

Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart
n isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 3 dpm_base_disconnect_init: error -12 in isend to process 1 dpm_base_disconnect_init: error -12 in isend to process 3 [rhc@bend001 mpi]$ On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.co

[OMPI devel] Strange intercomm_create, spawn, spawn_multiple hang on trunk

2014-06-06 Thread Rolf vandeVaart
I am seeing an interesting failure on trunk. intercomm_create, spawn, and spawn_multiple from the IBM tests hang if I explicitly list the hostnames to run on. For example: Good: $ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init) Parent: 1

[OMPI devel] iallgather failures with coll ml

2014-06-06 Thread Rolf vandeVaart
On the trunk, I am seeing failures of the ibm tests iallgather and iallgather_in_place. Is this a known issue? $ mpirun --mca btl self,sm,tcp --mca coll ml,basic,libnbc --host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 iallgather [**ERROR**]: MPI_COMM_WORLD rank 0, file i

Re: [OMPI devel] regression with derived datatypes

2014-05-30 Thread Rolf vandeVaart
s >we force the exclusive usage of the send protocol, with an unconventional >fragment size. >>>> >>>> In other words using the following flags “—mca btl tcp,self —mca >btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit >23 —mca btl_tcp_max_send_s

[OMPI devel] Intermittent hangs when exiting with error

2014-05-29 Thread Rolf vandeVaart
Ralph: I am seeing cases where mpirun seems to hang when one of the applications exits with non-zero. For example, the intel test MPI_Cart_get_c will exit that way if there are not enough processes to run the test. In most cases, mpirun seems to return fine with the error code, but sometimes i

[OMPI devel] Still problems with del_procs in trunkj

2014-05-23 Thread Rolf vandeVaart
I am still seeing problems with del_procs with openib. Do we believe everything should be working? This is with the latest trunk (updated 1 hour ago). [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 connectivity_cCon

[OMPI devel] RFC: [UPDATE] Add some basic CUDA-aware support to reductions

2014-05-21 Thread Rolf vandeVaart
NOTE: This is an update to the RFC after review and help from George WHAT: Add some basic support so that reduction functions can support GPU buffers. Create new coll module that is only compiled in when CUDA-aware support is compiled in. This patch moves the GPU data into a host buffer befor

Re: [OMPI devel] RFC : what is the best way to fix the memory leak in mca/pml/bfo

2014-05-16 Thread Rolf vandeVaart
The bfo PML is mostly a duplicate of the ob1 PML but with extra code to handle failover when running with a cluster with multiple IB NICs. A few observations. 1. Almost no one uses the bfo PML. I have kept it around just in case someone thinks about failover again. 2. The code where you are s

[OMPI devel] RFC: Add some basic CUDA-aware support to reductions

2014-05-14 Thread Rolf vandeVaart
WHAT: Add some basic support so that reduction functions can support GPU buffers. All this patch does is move the GPU data into a host buffer before the reduction call and move it back to GPU after the reduction call. Changes have no effect if CUDA-aware support is not compiled in. WHY: Users

[OMPI devel] Minutes of Open MPI ConCall Meeting - Tuesday, May 13, 2014

2014-05-13 Thread Rolf vandeVaart
Open MPI 1.6: - Release was waiting on https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it was not necessary. Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1. Open MPI 1.8: - Several tickets have been applied. Some discussion about other

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
_send_size 23” should always transfer wrong data, even when only one single BTL is in play. George. On May 7, 2014, at 13:11 , Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: OK. So, I investigated a little more. I only see the issue when I am running with multiple ports ena

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
OK. So, I investigated a little more. I only see the issue when I am running with multiple ports enabled such that I have two openib BTLs instantiated. In addition, large message RDMA has to be enabled. If those conditions are not met, then I do not see the problem. For example: FAILS: Ø

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
This seems similar to what I reported on a different thread. http://www.open-mpi.org/community/lists/devel/2014/05/14688.php I need to try and reproduce again. Elena, what kind of cluster were your running on? Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Elena Elkina Sent

Re: [OMPI devel] Possible bug with derived datatypes and openib BTL in trunk

2014-04-17 Thread Rolf vandeVaart
g set to 1. I would be >interested in the output you get on your machine. > >George. > > >On Apr 16, 2014, at 14:34 , Rolf vandeVaart wrote: > >> I have seen errors when running the intel test suite using the openib BTL >when transferring derived datatypes. I do not s

[OMPI devel] Possible bug with derived datatypes and openib BTL in trunk

2014-04-16 Thread Rolf vandeVaart
I have seen errors when running the intel test suite using the openib BTL when transferring derived datatypes. I do not see the error with sm or tcp BTLs. The errors begin after this checkin. https://svn.open-mpi.org/trac/ompi/changeset/31370 Timestamp: 04/11/14 16:06:56 (5 days ago) Author: b

Re: [OMPI devel] 1-question developer poll

2014-04-16 Thread Rolf vandeVaart
SVN >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan >Hjelm >Sent: Wednesday, April 16, 2014 10:35 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] 1-question developer poll > >* PGP Signed by an unknown key > >Git > >On Wed, Apr 16, 2014 at 10

Re: [OMPI devel] -mca coll "ml" cause segv or hangs with different command lines.

2014-03-04 Thread Rolf vandeVaart
I am still seeing the same issue where I get some type of segv unless I disable the coll ml component. This may be an issue at my end, but just thought I would double check that we are sure this is fixed. Thanks, Rolf >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org]

[OMPI devel] RFC: Add two new verbose outputs to BML layer

2014-03-03 Thread Rolf vandeVaart
WHAT: Add two new verbose outputs to BML layer WHY: There are times that I really want to know which BTLs are being used. These verbose outputs can help with that. WHERE: ompi/mca/bml/r2/bml_r2.c TIMEOUT: COB Friday, 7 March 2014 MORE DETAIL: I have run into some cases where I have added to a

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r30860 - in trunk/ompi/mca: btl/usnic rte

2014-02-27 Thread Rolf vandeVaart
It could. I added that argument 4 years ago to support by my failover work with the BFO. It was a way for a BTL to pass some type of string back to the PML telling the PML who it was for verbose output to understand what was happening. >-Original Message- >From: devel [mailto:devel-b

Re: [OMPI devel] 1.7.5 fails on simple test

2014-02-10 Thread Rolf vandeVaart
I have tracked this down. There is a missing commit that affects ompi_mpi_init.c causing it to initialize bml twice. Ralph, can you apply r30310 to 1.7? Thanks, Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Monday, February 10, 2014 12:29 PM To: Open

Re: [OMPI devel] 1.7.5 fails on simple test

2014-02-10 Thread Rolf vandeVaart
I have seen this same issue although my core dump is a little bit different. I am running with tcp,self. The first entry in the list of BTLs is garbage, but then there is tcp and self in the list. Strange. This is my core dump. Line 208 in bml_r2.c is where I get the SEGV. Program termina

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart
gfaulted as >well), but obviously wouldn't have anything to do with mpirun > >On Jan 30, 2014, at 9:29 AM, Rolf vandeVaart >wrote: > >> I just retested with --mca mpi_leave_pinned 0 and that made no difference. >I still see the mpirun crash. >> >>> -

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart
>fixes the problem, I did not investigate any further. > >Do you see a similar behavior? > > George. > >On Jan 30, 2014, at 17:26 , Rolf vandeVaart wrote: > >> I am seeing this happening to me very intermittently. Looks like mpirun is >getting a SEGV. Is anyone el

Re: [OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart
, January 30, 2014 11:51 AM To: Open MPI Developers Subject: Re: [OMPI devel] Intermittent mpirun crash? Huh - not much info there, I'm afraid. I gather you didn't build this with --enable-debug? On Jan 30, 2014, at 8:26 AM, Rolf vandeVaart wrote: > I am seeing this happeni

[OMPI devel] Intermittent mpirun crash?

2014-01-30 Thread Rolf vandeVaart
I am seeing this happening to me very intermittently. Looks like mpirun is getting a SEGV. Is anyone else seeing this? This is 1.7.4 built yesterday. (Note that I added some stuff to what is being printed out so the message is slightly different than 1.7.4 output) mpirun - -np 6 -host drosse

Re: [OMPI devel] 1.7.4 status update

2014-01-22 Thread Rolf vandeVaart
Hi Ralph: In my opinion, we still try to get to a stable 1.7.4. I think we can just keep the bar high (as you said in the meeting) about what types of fixes need to get into 1.7.4. I have been telling folks 1.7.4 would be ready "really soon" so the idea of folding in 1.7.5 CMRs and delaying it

[OMPI devel] NUMA bug in openib BTL device selection

2014-01-10 Thread Rolf vandeVaart
I believe I found a bug in openib BTL and just want to see if folks agree with this. When we are running on a NUMA node and we are bound to a CPU, we only ant to use the IB device that is closest to us. However, I observed that we always used both devices regardless. I believe there is a bug

Re: [OMPI devel] CUDA support not working?

2013-11-25 Thread Rolf vandeVaart
Let me know of any other issues you are seeing. Ralph fixed the issue with ob1 and we will move that into Open MPI 1.7.4. Not sure why I never saw that issue. Will investigate some more. >-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jörg >Bornschein

Re: [OMPI devel] MPIRUN error message after ./configure and sudo make all install...

2013-11-07 Thread Rolf vandeVaart
use --enable-mca-dso...though I don't know if that is the source of the problem. On Nov 7, 2013, at 6:00 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: Hello Solibakke: Let me try and reproduce with your configure options. Rolf From: devel [mailto:devel-boun...@open-

Re: [OMPI devel] MPIRUN error message after ./configure and sudo make all install...

2013-11-07 Thread Rolf vandeVaart
Hello Solibakke: Let me try and reproduce with your configure options. Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Solibakke Per Bjarte Sent: Thursday, November 07, 2013 8:40 AM To: 'de...@open-mpi.org' Subject: [OMPI devel] MPIRUN error message after ./configure and sudo m

Re: [OMPI devel] oshmem and CFLAGS removal

2013-10-31 Thread Rolf vandeVaart
>-Original Message- >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres >(jsquyres) >Sent: Thursday, October 31, 2013 4:12 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] oshmem and CFLAGS removal > >On Oct 31, 2013, at 3:46 PM,

[OMPI devel] oshmem and CFLAGS removal

2013-10-31 Thread Rolf vandeVaart
I noticed that there were some CFLAGS that were no longer set when enabling with --enable-picky for gcc. Specifically, -Wundef and -pedantic were no longer set. This is not a problem for Open MPI 1.7. I believe this is happening because of some code in the config/oshmem_configure_options.m4 f

Re: [OMPI devel] Warnings in v1.7.4: rcache

2013-10-23 Thread Rolf vandeVaart
Yes, that is from one of my CMRs. I always configure with -enable-picky but that did not pick up this warning. I will fix this in the trunk in the morning (watching the Red Sox right now :)) and then file CMR to bring over. Rolf From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralp

[OMPI devel] RFC: Add GPU Direct RDMA support to openib btl

2013-10-08 Thread Rolf vandeVaart
WHAT: Add GPU Direct RDMA support to openib btl WHY: Better latency for small GPU message transfers WHERE: Several files, see ticket for list WHEN: Friday, October 18, 2013 COB More detail: This RFC looks to make use of GPU Direct RDMA support that is coming in the future in Mellanox libraries.

Re: [OMPI devel] RFC: Remove alignment code from rcache

2013-09-18 Thread Rolf vandeVaart
I will wait another week on this since I know a lot of folks were traveling. Any input welcome. From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 10, 2013 2:46 PM To: de...@open-mpi.org Subject: [OMPI devel] RFC: Remove alignment code from

Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-09-13 Thread Rolf vandeVaart
a private email, I had Max add some instrumentation so we could see which list was growing. We now know it is the mca_pml_base_send_requests list. >-Original Message- >From: Max Staufer [mailto:max.stau...@gmx.net] >Sent: Friday, September 13, 2013 7:06 AM >To: Rolf va

Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-09-11 Thread Rolf vandeVaart
Hi Max: You say that that the function keeps "allocating memory in the pml free list." How do you know that is happening? Do you know which free list it is happening on? There are something like 8 free lists associated with the pml ob1 so it would be interesting to know which one you observe

[OMPI devel] RFC: Remove alignment code from rcache

2013-09-10 Thread Rolf vandeVaart
WHAT: Remove alignment code from ompi/mca/rcache/vma module WHY: Because it is redundant and causing problems for memory pools that want different alignment WHERE: ompi/mca/rcache/vma/rcache_vma.c, ompi/mca/mpool/grdma/mpool_grdma_module.c (Detailed changes attached) WHEN: Tuesday, September 17,

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 03, 2013 4:52 PM To: Open MPI Developers Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes Correction: That line below should be: gmake run FILE=p2p_c From: devel [mailto:devel-boun

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
Correction: That line below should be: gmake run FILE=p2p_c From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart Sent: Tuesday, September 03, 2013 4:50 PM To: Open MPI Developers Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes I just retried and I

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
. I've tried up to np=16 without getting a single hiccup. Try a fresh checkout - let's make sure you don't have some old cruft laying around. On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: I am running a debug build. Here is my configur

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
M, Ralph Castain mailto:r...@open-mpi.org>> wrote: Dang - I just finished running it on odin without a problem. Are you seeing this with a debug or optimized build? On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wrote: Yes, it fails on the current t

Re: [OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
something up in the OOB connect code itself. I'll take a look and see if something leaps out at me - it seems to be working fine on IU's odin cluster, which is the only IB-based system I can access On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart mailto:rvandeva...@nvidia.com>> wr

[OMPI devel] openib BTL problems with ORTE async changes

2013-09-03 Thread Rolf vandeVaart
As mentioned in the weekly conference call, I am seeing some strange errors when using the openib BTL. I have narrowed down the changeset that broke things to the ORTE async code. https://svn.open-mpi.org/trac/ompi/changeset/29058 (and https://svn.open-mpi.org/trac/ompi/changeset/29061 which

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-30 Thread Rolf vandeVaart
t;interested in implementing in the future (an intern or some PhD student). > >On Aug 23, 2013, at 21:53 , Rolf vandeVaart wrote: > >> Yes, I agree that the CUDA support is more intrusive and ends up in >different areas. The problem is that the changes could not be simply isol

[OMPI devel] Quick observation - component ignored for 7 years

2013-08-27 Thread Rolf vandeVaart
The ompi/mca/rcache/rb component has been .ompi_ignored for almost 7 years. Should we delete it? --- This email message is for the sole use of the intended recipient(s) and may contain confidential information.

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-23 Thread Rolf vandeVaart
rg] On Behalf Of George >Bosilca >Sent: Friday, August 23, 2013 7:36 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in >trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1 > >Rolf, > >On Aug 22, 2013, at 19:24 , Rolf vandeVaart

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-22 Thread Rolf vandeVaart
George. > >On Aug 21, 2013, at 23:00 , svn-commit-mai...@open-mpi.org wrote: > >> Author: rolfv (Rolf Vandevaart) >> Date: 2013-08-21 17:00:09 EDT (Wed, 21 Aug 2013) New Revision: 29055 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/29055 >> >> Log: >> Fi

Re: [OMPI devel] Annual OMPI membership review: SVN accounts

2013-07-09 Thread Rolf vandeVaart
No changes here. >NVIDIA >== >rolfv:Rolf Vandevaart > --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized

[OMPI devel] RGET issue when send is less than receive

2013-06-21 Thread Rolf vandeVaart
I ran into a hang in a test in which the sender sends less data than the receiver is expecting. For example, the following shows the receiver expecting twice what the sender is sending. Rank 0: MPI_Send(buf, BUFSIZE, MPI_INT, 1, 99, MPI_COMM_WORLD) Rank 1: MPI_Recv(buf, BUFSIZE*2, MPI_INT, 0,

[OMPI devel] Build warnings in trunk

2013-05-14 Thread Rolf vandeVaart
I have noticed several warnings while building the trunk. Feel free to fix anything that you are familiar with. CC sys_limits.lo ../../../opal/util/sys_limits.c: In function 'opal_util_init_sys_limits': ../../../opal/util/sys_limits.c:107:20: warning: 'lim' may be used uninitialized in t

Re: [OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart
31, 2013 11:51 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] mpirun -host does not work from r27879 and >forward on trunk > >Yes - no hostfile and no RM allocation, just -host. > >What is your setup? > >On Jan 31, 2013, at 8:44 AM, Rolf vandeVaart >wrote: >

Re: [OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart
, Ralph Castain wrote: > >> Ummm...that was fixed a long time ago. You might try a later version. >> >> Or are you saying the head of the trunk doesn't work too? >> >> On Jan 31, 2013, at 7:31 AM, Rolf vandeVaart >wrote: >> >>> I have stum

[OMPI devel] mpirun -host does not work from r27879 and forward on trunk

2013-01-31 Thread Rolf vandeVaart
I have stumbled into a problem with the -host argument. This problem appears to be introduced with changeset r27879 on 1/19/2013 by rhc. With r27877, things work: [rolf@node]$ which mpirun /home/rolf/ompi-trunk-r27877/64/bin/mpirun [rolf@node]$ mpirun -np 2 -host c0-0,c0-3 hostname c0-3 c0-0

Re: [OMPI devel] CUDA support doesn't work starting from 1.9a1r27862

2013-01-24 Thread Rolf vandeVaart
Thanks for this report. I will look into this. Can you tell me what your mpirun command looked like and do you know what transport you are running over? Specifically, is this on a single node or multiple nodes? Rolf From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf

[OMPI devel] RFC: Support for asynchronous copies of GPU buffers over IB

2012-12-17 Thread Rolf vandeVaart
[I sent this out in June, but did not commit it. So resending. Timeout of Jan 5, 2012. Note that this does not use the GPU Direct RDMA] WHAT: Add support for doing asynchronous copies of GPU memory with larger messages. WHY: Improve performance for sending/receiving of larger GPU messages over

Re: [OMPI devel] OpenMPI CUDA 5 readiness?

2012-09-04 Thread Rolf vandeVaart
37 PM >To: Rolf vandeVaart >Cc: de...@open-mpi.org >Subject: Re: OpenMPI CUDA 5 readiness? > >CUDA 5 basically changes char* to void* in some functions. Attached is a small >patch which changes prototypes, depending on used CUDA version. Tested >with CUDA 5 preview and 4.2. > >

Re: [OMPI devel] The hostfile option

2012-07-30 Thread Rolf vandeVaart
>-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Ralph Castain >Sent: Monday, July 30, 2012 9:29 AM >To: Open MPI Developers >Subject: Re: [OMPI devel] The hostfile option > > >On Jul 30, 2012, at 2:37 AM, George Bosilca wrote: > >> I t

[OMPI devel] FW: add asynchronous copies for large GPU buffers

2012-07-10 Thread Rolf vandeVaart
Adding a timeout to this RFC. TIMEOUT: July 17, 2012 rvandeva...@nvidia.com 781-275-5358 -Original Message- From: Rolf vandeVaart Sent: Wednesday, June 27, 2012 6:13 PM To: de...@open-mpi.org Subject: RFC: add asynchronous copies for large GPU buffers WHAT: Add support for doing

Re: [OMPI devel] RFC: add asynchronous copies for large GPU buffers

2012-06-27 Thread Rolf vandeVaart
PU >buffers > >Can you make your repository public or add me to the access list? > >-Nathan > >On Wed, Jun 27, 2012 at 03:12:34PM -0700, Rolf vandeVaart wrote: >> WHAT: Add support for doing asynchronous copies of GPU memory with >larger messages. >> WHY: I

[OMPI devel] RFC: add asynchronous copies for large GPU buffers

2012-06-27 Thread Rolf vandeVaart
WHAT: Add support for doing asynchronous copies of GPU memory with larger messages. WHY: Improve performance for sending/receiving of larger GPU messages over IB WHERE: ob1, openib, and convertor code. All is protected by compiler directives so no effect on non-CUDA builds. REFEREN

Re: [OMPI devel] RFC: hide btl segment keys within btl

2012-06-18 Thread Rolf vandeVaart
Hi Nathan: I downloaded and tried it out. There were a few issues that I had to work through, but finally got things working. Can you apply this patch to your changes prior to checking things in? I also would suggest configuring with --enable-picky as there are something like 10 warnings genera

[OMPI devel] Modified files after autogen

2012-05-23 Thread Rolf vandeVaart
After doing a fresh checkout of the trunk, and then running autogen, I see this: M opal/mca/event/libevent2019/libevent/Makefile.in M opal/mca/event/libevent2019/libevent/depcomp M opal/mca/event/libevent2019/libevent/include/Makefile.in M opal/mca/event/libevent2019/libeve

Re: [OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Rolf vandeVaart
Here is my explanation. The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both the fragment as well as any payload. So, when we do the frag+1, we are setting the pointer in the frag to point where the payload of the message liv

[OMPI devel] memory bind warning with -bind-to-core and -bind-to-socket

2012-03-14 Thread Rolf vandeVaart
I am running a simple test and using the -bind-to-core or -bind-to-socket options. I think the CPU binding is working fine, but I see these warnings about not being able to bind to memory. Is this expected? This is trunk code (266128) [dt]$ mpirun --report-bindings -np 2 -bind-to-core conne

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Rolf vandeVaart
[Comment at bottom] >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Nathan Hjelm >Sent: Friday, March 09, 2012 2:23 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106 > > > >On Fri, 9 Mar 2012, J

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26039

2012-02-24 Thread Rolf vandeVaart
Hi Jeff: It is set in opal/config/opal_configure_options.m4 >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Jeffrey Squyres >Sent: Friday, February 24, 2012 6:07 AM >To: de...@open-mpi.org >Subject: Re: [OMPI devel] [OMPI svn-full]

Re: [OMPI devel] RFC: Allocate free list payload if free list isn't specified

2012-02-21 Thread Rolf vandeVaart
I think I am OK with this. Alternatively, you could have done something like is done in the TCP BTL where the payload and header are added together for the frag size? To state more clearly, I was trying to say you could do something similar to what is done at line 1015 in btl_tcp_component.c a

Re: [OMPI devel] MVAPICH2 vs Open-MPI

2012-02-14 Thread Rolf vandeVaart
There are several things going on here that make their library perform better. With respect to inter-node performance, both MVAPICH2 and Open MPI copy the GPU memory into host memory first. However, they are using special host buffers that and a code path that allows them to copy the data async

Re: [OMPI devel] GPUDirect v1 issues

2012-01-20 Thread Rolf vandeVaart
el with GPUDirect support * Use the MLNX OFED stack with GPUDirect support * Install the CUDA developer driver Does using CUDA >= 4.0 make one of the above steps redundant? I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is not needed any more? Sebastian. Rolf

  1   2   >