Re: [OMPI devel] MPIT solution still wrong

2013-08-19 Thread Ralph Castain
Agreed - I was simply trying to get it to build because it broke some developers here. On Aug 19, 2013, at 6:03 PM, "Jeff Squyres (jsquyres)" wrote: > It looks like https://svn.open-mpi.org/trac/ompi/changeset/29043 is a > stopgap, but it is still definitely wrong. > >

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Ralph Castain
On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" wrote: > On Aug 19, 2013, at 8:02 PM, Ralph Castain wrote: > >> That's how it works now. My concern is with the error message scenario. >> IIRC, Jeff's issue was that the error message only

Re: [OMPI devel] [slurm-dev] slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-19 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Ralph, On 12/08/13 06:17, Ralph Castain wrote: > 1. Slurm has no direct knowledge or visibility into the > application procs themselves when launched by mpirun. Slurm only > sees the ORTE daemons. I'm sure that Slurm rolls up all the > resources

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Jeff Squyres (jsquyres)
Thanks for finding r27212. It was about a year ago, and had clearly fallen out of my cache (I have very little to do with the openib BTL these days). Your solution isn't correct, because HAVE_IBV_LINK_LAYER_ETHERNET is defined (nor not) via this m4 macro in config/ompi_check_openfabrics.m4:

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Jeff Squyres (jsquyres)
On Aug 19, 2013, at 8:02 PM, Ralph Castain wrote: > That's how it works now. My concern is with the error message scenario. IIRC, > Jeff's issue was that the error message only contains the hostname of the > proc that generates it - it doesn't tell you the hostname of the

[OMPI devel] MPIT solution still wrong

2013-08-19 Thread Jeff Squyres (jsquyres)
It looks like https://svn.open-mpi.org/trac/ompi/changeset/29043 is a stopgap, but it is still definitely wrong. The MPIT stuff does *not* compile the same way the C bindings compile. Here's how the C bindings compile: a) in ompi/mpi/c/profile: always compile libmpi_c_pmpi.la b) in

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Ralph Castain
That's how it works now. My concern is with the error message scenario. IIRC, Jeff's issue was that the error message only contains the hostname of the proc that generates it - it doesn't tell you the hostname of the remote proc. Hence, we included that info in the proc_t. However, IIRC we

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread George Bosilca
Some networks need the name to resolve their own communication channel so they will look it up (all based on IP). However, this is indeed not enough for all cases. The general solution will be to set the proc hostname the first time the peer information is looked up in the database

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Ralph Castain
Hmmm...but then proc->hostname will *never* be filled in, because it is only ever accessed in an error message - i.e., in opal_output and its variants. If we are not going to retrieve it be default, then we need another solution *if* we want hostnames for error messages under direct launch. If

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Nathan Hjelm
That solution is fine with me. -Nathan On Tue, Aug 20, 2013 at 12:41:49AM +0200, George Bosilca wrote: > If your offer is between quadratic and non-deterministic, I'll take the > former. > > I would advocate for a middle-ground solution. Clearly document in the header > file that the

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread George Bosilca
If your offer is between quadratic and non-deterministic, I'll take the former. I would advocate for a middle-ground solution. Clearly document in the header file that the ompi_proc_get_hostname is __not__ safe to be used in all contexts as it might exhibit recursive behavior due to

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Ralph Castain
Understood - but George is correct in that a failure to find the hostname in the db will create an infinite loop. Any thoughts on a reliable way to break it? On Aug 19, 2013, at 2:52 PM, Nathan Hjelm wrote: > It would require a db read from every rank which is what we are

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Nathan Hjelm
It would require a db read from every rank which is what we are trying to avoid. This scales quadratic at best on Cray systems. -Nathan On Mon, Aug 19, 2013 at 02:48:18PM -0700, Ralph Castain wrote: > Yeah, I have some concerns about it too...been trying to test it out some > more. Would be

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread Ralph Castain
Yeah, I have some concerns about it too...been trying to test it out some more. Would be good to see just how much that one change makes - maybe restoring just the hostname wouldn't have that big an impact. I'm leery of trying to ensure we strip all the opal_output loops if we don't find the

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-19 Thread George Bosilca
As a result of this patch the first decode of a peer host name might happen in the middle of a debug message (on the first call to ompi_proc_get_hostname). Such a behavior might generate deadlocks based on the level of output verbosity, and has significant potential to reintroduce the recursive

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> -Original Message- > From: Steve Wise [mailto:sw...@opengridcomputing.com] > Sent: Monday, August 19, 2013 4:02 PM > To: 'Open MPI Developers'; 'Jeff Squyres (jsquyres)' > Cc: 'Indranil Choudhury' > Subject: RE: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC > > I guess

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
I guess HAVE_IBV_LINK_LAYER_ETHERNET is guarding against a libibverbs that doesn't have IBV_LINK_LAYER_ETHERNET defined. So the proper fix, I think, is to enhance configure to check this and #define HAVE_IBV_LINK_LAYER_ETHERNET if it exists. Or have it check existence of a link_layer field

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
This patch fixes iwarp. dunno if it breaks RoCE though :) [root@r9 ompi-trunk]# svn diff Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c === --- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> > I could if I had a patch/fix. :) I don't (yet) understand why > > HAVE_IBV_LINK_LAYER_ETHERNET was > added. > > Can the developer who made these changes explain the intent? I think it > > might have to do with RoCE > > support. > > > > Seems like there should be some change to configure

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> -Original Message- > From: Steve Wise [mailto:sw...@opengridcomputing.com] > Sent: Monday, August 19, 2013 3:25 PM > To: 'Jeff Squyres (jsquyres)' > Cc: 'Open MPI Developers'; 'Indranil Choudhury' > Subject: RE: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC > > > > >

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> -Original Message- > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > Sent: Monday, August 19, 2013 3:23 PM > To: Steve Wise > Cc: Open MPI Developers; Indranil Choudhury > Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC > > No need to both post to the

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Jeff Squyres (jsquyres)
No need to both post to the ticket and to devel -- just pick one. :-) Can you send a patch/fix? On Aug 19, 2013, at 4:17 PM, Steve Wise wrote: >> -Original Message- >> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise >> Sent:

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise > Sent: Monday, August 19, 2013 2:42 PM > To: 'Open MPI Developers'; 'Jeff Squyres (jsquyres)' > Cc: 'Indranil Choudhury' > Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC >

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
I confirmed that this is a regression from 1.7.1... I'll see if I can figure out what's going on... > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise > Sent: Monday, August 19, 2013 12:15 PM > To: 'Jeff Squyres (jsquyres)' > Cc:

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
> -Original Message- > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > Sent: Monday, August 19, 2013 12:06 PM > To: Steve Wise > Cc: > Subject: Re: openmpi-1.7.2 fails to use the RDMACM CPC > > Not offhand. > > Given the lack of iWARP testing in the

Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r29043 - in trunk: config ompi

2013-08-19 Thread Nathan Hjelm
Because according to Martin there needed to be profiling versions of all the MPI_T_* functions. I used the c bindings profiling as an base for implementing the functions for MPI_T. Not sure what went wrong but I never tried building without profiling support. -Nathan On Mon, Aug 19, 2013 at

[OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-19 Thread Steve Wise
Hello, I just tried to run openmpi-1.7.2 over chelsio's IWARP device, and it no longer works. It appears that 1.7.2 fails to use the RDMACM CPC. I guess it is trying to use OOB, which is IB-specific. If I explicitly specify the RDMACM CPC via '--mca btl_openib_cpc_include rdmacm' then it

[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r29043 - in trunk: config ompi

2013-08-19 Thread Jeff Squyres (jsquyres)
@Nathan -- Why is the MPIT convenience library related to whether profiling is enabled or not? Begin forwarded message: > From: > Subject: [OMPI svn-full] svn:open-mpi r29043 - in trunk: config ompi > Date: August 19, 2013 10:48:24 AM EDT > To:

Re: [OMPI devel] Trunk build failures

2013-08-19 Thread Jeff Squyres (jsquyres)
Make it so. :p On Aug 19, 2013, at 10:13 AM, Ralph Castain wrote: > yeah, i'm trying to fix it - could use your help when you quit your manager > impersonation > > On Aug 19, 2013, at 7:04 AM, Jeff Squyres (jsquyres) > wrote: > >> Did something

Re: [OMPI devel] Trunk build failures

2013-08-19 Thread Ralph Castain
yeah, i'm trying to fix it - could use your help when you quit your manager impersonation On Aug 19, 2013, at 7:04 AM, Jeff Squyres (jsquyres) wrote: > Did something happen to break the trunk recently? > > - > [7:03] savbu-usnic-a:~/s/o/o/t/ompi_info ❯❯❯ make > CC

[OMPI devel] Trunk build failures

2013-08-19 Thread Jeff Squyres (jsquyres)
Did something happen to break the trunk recently? - [7:03] savbu-usnic-a:~/s/o/o/t/ompi_info ❯❯❯ make CC ompi_info.o CC param.o CC components.o CC version.o CCLD ompi_info ../../../ompi/.libs/libmpi.so: undefined reference to `mpit_unlock'