Re: [OMPI devel] Intercomm Merge

2013-10-02 Thread Suraj Prabhakaran
Sorry for the very late reply. Everything works now! Thanks a lot!! On Sep 25, 2013, at 7:00 PM, Ralph Castain wrote: > I've committed a fix to the trunk (r29245) and scheduled it for v1.7.3 - > thanks for the debug info! > > Ralph > > On Sep 25, 2013, at 5:00 AM, Suraj Prabhakaran > wrote:

Re: [OMPI devel] Intercomm Merge

2013-09-25 Thread Ralph Castain
I've committed a fix to the trunk (r29245) and scheduled it for v1.7.3 - thanks for the debug info! Ralph On Sep 25, 2013, at 5:00 AM, Suraj Prabhakaran wrote: > Dear Ralph, > > I am sorry but I think I missed adding plm verbosity to 5 last time. Here is > the output of the complete program

Re: [OMPI devel] Intercomm Merge

2013-09-25 Thread Suraj Prabhakaran
Dear Ralph, I am sorry but I think I missed adding plm verbosity to 5 last time. Here is the output of the complete program with and without -novm to the following mpiexec. mpiexec -mca state_base_verbose 10 -mca errmgr_base_verbose 10 -mca plm_base_verbose 5 -mca btl tcp,sm,self -np 2 ./addho

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Ralph Castain
What I find puzzling is that I don't see any output indicating that you went thru the Torque launcher to launch the daemons - not a peep of debug output. This makes me suspicious that something else is going on. Are you sure you sent me all the output? Try adding -novm to your mpirun cmd line a

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
Hi Ralph, So here is what I do. I spawn just a "single" process on a new node which is basically not in the $PBS_NODEFILE list. My $PBS_NODEFILE list contains grsacc20 grsacc19 I then start the app with just 2 processes. So one host gets one process and they are successfully spawned through th

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Ralph Castain
I'm going to need a little help here. The problem is that you launch two new daemons, and one of them exits immediately because it thinks it lost the connection back to mpirun - before it even gets a chance to create it. Can you give me a little more info as to exactly what you are doing? Perhap

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
Hi Ralph, Output attached in a file. Thanks a lot! Best, Suraj {\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360 {\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww30340\viewh23120\viewkind0 \pard\tx566\tx1133\tx170

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Ralph Castain
Afraid I don't see the problem offhand - can you add the following to your cmd line? -mca state_base_verbose 10 -mca errmgr_base_verbose 10 Thanks Ralph On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran wrote: > Hi Ralph, > > I always got this output from any MPI job that ran on our nodes. Th

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
Hi Ralph, I always got this output from any MPI job that ran on our nodes. There seems to be a problem somewhere but it never stopped the applications from running. But anyway, I ran it again now with only tcp and excluded the infiniband and I get the same output again. Except that this time,

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Ralph Castain
Your output shows that it launched your apps, but they exited. The error is reported here, though it appears we aren't flushing the message out before exiting due to a race condition: > [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt > / no active ports found Here

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
Hi Ralph, I tested it with the trunk r29228. I still have the following problem. Now, it even spawns the daemon on the new node through torque but then suddently quits. The following is the output. Can you please have a look? Thanks Suraj [grsacc20:04511] [[6253,0],0] plm:base:receive process

Re: [OMPI devel] Intercomm Merge

2013-09-23 Thread George Bosilca
On Sep 23, 2013, at 01:43 , Ralph Castain wrote: > > On Sep 22, 2013, at 2:15 PM, George Bosilca wrote: > >> In fact there are only two type of information: one that is added by the >> OMPI layer, which is exchanged during the modex exchange stage, and whatever >> else is built on top of th

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread Ralph Castain
Found a bug in the Torque support - we were trying to connect to the MOM again, which would hang (I imagine). I pushed a fix to the trunk (r29227) and scheduled it to come to 1.7.3 if you want to try it again. On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran wrote: > Dear Ralph, > > This is t

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread Ralph Castain
On Sep 22, 2013, at 2:15 PM, George Bosilca wrote: > In fact there are only two type of information: one that is added by the OMPI > layer, which is exchanged during the modex exchange stage, and whatever else > is built on top of this information by different pieces of the software stack > (

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread Suraj Prabhakaran
Dear Ralph, This is the output I get when I execute with the verbose option. [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from [[23526,1],0] [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts [grsacc20

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread George Bosilca
In fact there are only two type of information: one that is added by the OMPI layer, which is exchanged during the modex exchange stage, and whatever else is built on top of this information by different pieces of the software stack (including the RTE). If we mark these two types of data indepen

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread Ralph Castain
I'll still need to look at the intercomm_create issue, but I just tested both the trunk and current 1.7.3 branch for "add-host" and both worked just fine. This was on my little test cluster which only has rsh available - no Torque. You might add "-mca plm_base_verbose 5" to your cmd line to get

Re: [OMPI devel] Intercomm Merge

2013-09-21 Thread Ralph Castain
On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran wrote: > Dear all, > > Really thanks a lot for your efforts. I too downloaded the trunk to check if > it works for my case and as of revision 29215, it works for the original case > I reported. Although it works, I still see the following in the

Re: [OMPI devel] Intercomm Merge

2013-09-21 Thread Suraj Prabhakaran
Dear all, Really thanks a lot for your efforts. I too downloaded the trunk to check if it works for my case and as of revision 29215, it works for the original case I reported. Although it works, I still see the following in the output. Does it mean anything? [grsacc17][[13611,1],0][btl_openib_

Re: [OMPI devel] Intercomm Merge

2013-09-20 Thread Jeff Squyres (jsquyres)
Just to close my end of this loop: as of trunk r29213, it all works for me. Thanks! On Sep 18, 2013, at 12:52 PM, Ralph Castain wrote: > Thanks George - much appreciated > > On Sep 18, 2013, at 9:49 AM, George Bosilca wrote: > >> The test case was broken. I just pushed a fix. >> >> George

Re: [OMPI devel] Intercomm Merge

2013-09-19 Thread Ralph Castain
Been wracking my brain on this, and I can't find any way to do this cleanly without invoking some kind of extension/modification to the MPI-RTE interface. The problem is that we are now executing an "in-band" modex operation. This is fine, but the modex operation (no matter how it is executed) i

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Ralph Castain
Actually, we wouldn't have to modify the interface - just have to define a DB_RTE flag and OR it to the DB_INTERNAL/DB_EXTERNAL one. We'd need to modify the "fetch" routines to pass the flag into them so we fetched the right things, but that's a simple change. On Sep 18, 2013, at 10:12 AM, Ralp

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Ralph Castain
I struggled with that myself when doing my earlier patch - part of the reason why I added the dpm API. I don't know how to update the locality without referencing RTE-specific keys, so maybe the best thing would be to provide some kind of hook into the db that says we want all the non-RTE keys?

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread George Bosilca
I hit send too early. Now that we move the entire "local" modex is there any way to trim it down or to replace the entries that are not correct anymore? Like the locality? George. On Sep 18, 2013, at 18:53 , George Bosilca wrote: > Regarding your comment on the bug trac, I noticed there is

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread George Bosilca
Regarding your comment on the bug trac, I noticed there is a DB_INTERNAL flag. While I see how to set I could not figure out any way to get it back. With the required modification of the DB API can't we take advantage of it? George. On Sep 18, 2013, at 18:52 , Ralph Castain wrote: > Thanks G

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Ralph Castain
Thanks George - much appreciated On Sep 18, 2013, at 9:49 AM, George Bosilca wrote: > The test case was broken. I just pushed a fix. > > George. > > On Sep 18, 2013, at 16:49 , Ralph Castain wrote: > >> Hangs with any np > 1 >> >> However, I'm not sure if that's an issue with the test vs t

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread George Bosilca
The test case was broken. I just pushed a fix. George. On Sep 18, 2013, at 16:49 , Ralph Castain wrote: > Hangs with any np > 1 > > However, I'm not sure if that's an issue with the test vs the underlying > implementation > > On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" > wrote

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Ralph Castain
Hangs with any np > 1 However, I'm not sure if that's an issue with the test vs the underlying implementation On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" wrote: > Does it hang when you run with -np 4? > > Sent from my phone. No type good. > > On Sep 18, 2013, at 4:10 PM, "Ralph

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Jeff Squyres (jsquyres)
Does it hang when you run with -np 4? Sent from my phone. No type good. On Sep 18, 2013, at 4:10 PM, "Ralph Castain" wrote: > Strange - it works fine for me on my Mac. However, I see one difference - I > only run it with np=1 > > On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) > wrote

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Ralph Castain
Strange - it works fine for me on my Mac. However, I see one difference - I only run it with np=1 On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) wrote: > On Sep 18, 2013, at 9:33 AM, George Bosilca wrote: > >> 1. sm doesn't work between spawned processes. So you must have another >> ne

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Jeff Squyres (jsquyres)
On Sep 18, 2013, at 9:33 AM, George Bosilca wrote: > 1. sm doesn't work between spawned processes. So you must have another > network enabled. I know :-). I have tcp available as well (OMPI will abort if you only run with sm,self because the comm_spawn will fail with unreachable errors -- I j

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread George Bosilca
2 things: 1. sm doesn't work between spawned processes. So you must have another network enabled. 2. Don't use the test case attached to my email, I left an xterm based spawn and the debugging. It can't work without xterm support. Instead try using the test case from the trunk, the one committ

Re: [OMPI devel] Intercomm Merge

2013-09-18 Thread Jeff Squyres (jsquyres)
George -- When I build the SVN trunk (r29201) on 64 bit linux, your attached test case hangs: - ❯❯❯ mpicc intercomm_create.c -o intercomm_create ❯❯❯ mpirun -np 4 intercomm_create b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4] b: MPI_Intercomm_create( intra, 0,

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread George Bosilca
Here is a quick (and definitively not the cleanest) patch that addresses the MPI_Intercomm issue at the MPI level. It should be applied after removal of 29166.I also added the corrected test case stressing the corner cases by doing barriers at every inter-comm creation and doing a clean disconnect.

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread Ralph Castain
Great! I'll welcome the patch - feel free to back mine out when you do. Thanks! On Sep 17, 2013, at 2:43 PM, George Bosilca wrote: > On Sep 17, 2013, at 23:19 , Ralph Castain wrote: > >> I very much doubt that it would work, though I can give it a try, as the >> patch addresses Intercomm_mer

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread George Bosilca
On Sep 17, 2013, at 23:19 , Ralph Castain wrote: > I very much doubt that it would work, though I can give it a try, as the > patch addresses Intercomm_merge and not Intercomm_create. I debated about > putting the patch into "create" instead, but nobody was citing that as being > a problem. In

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread Ralph Castain
On Sep 17, 2013, at 2:01 PM, George Bosilca wrote: > Ralph, > > On Sep 17, 2013, at 20:13 , Ralph Castain wrote: > >> I guess we could argue this for awhile, but I personally don't care how it >> gets fixed. The issue here is that (a) you promised to provide a "better" >> fix nearly a year

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread George Bosilca
Ralph, On Sep 17, 2013, at 20:13 , Ralph Castain wrote: > I guess we could argue this for awhile, but I personally don't care how it > gets fixed. The issue here is that (a) you promised to provide a "better" fix > nearly a year ago, (b) it never happened, and © a user who has patiently > wai

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread Ralph Castain
I guess we could argue this for awhile, but I personally don't care how it gets fixed. The issue here is that (a) you promised to provide a "better" fix nearly a year ago, (b) it never happened, and (c) a user who has patiently waited all this time has asked if we could please fix it. It now wo

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread George Bosilca
Ralph, I don't think your patch is addressing the right issue. In fact your commit treat the wrong symptom instead of addressing the core issue that generate the problem. Let me explain this in terms of MPI. The MPI_Intercomm_merge function transform an inter-comm into an intra-comm, basically

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread Suraj Prabhakaran
Hi Ralph, Thanks a lot!!! thats really cool!! Best, Suraj On Sep 15, 2013, at 5:01 PM, Ralph Castain wrote: > I fixed it and have filed a cmr to move it to 1.7.3 > > Thanks for your patience, and for reminding me > Ralph > > On Sep 13, 2013, at 12:05 PM, Suraj Prabhakaran > wrote: > >> De

Re: [OMPI devel] Intercomm Merge

2013-09-15 Thread Ralph Castain
I fixed it and have filed a cmr to move it to 1.7.3 Thanks for your patience, and for reminding me Ralph On Sep 13, 2013, at 12:05 PM, Suraj Prabhakaran wrote: > Dear Ralph, that would be great if you could give it a try. We have been > hoping for it for a year now and it could greatly benefi

Re: [OMPI devel] Intercomm Merge

2013-09-13 Thread Suraj Prabhakaran
Dear Ralph, that would be great if you could give it a try. We have been hoping for it for a year now and it could greatly benefit us if this is fixed!! :-) Thanks! Suraj On Fri, Sep 13, 2013 at 5:39 PM, Ralph Castain wrote: > It has been a low priority issue, and hence not resolved yet. I d

Re: [OMPI devel] Intercomm Merge

2013-09-13 Thread Ralph Castain
It has been a low priority issue, and hence not resolved yet. I doubt it will make 1.7.3, though if you need it, I'll give it a try. On Sep 13, 2013, at 7:21 AM, Suraj Prabhakaran wrote: > Hello, > > Is there a plan to fix the problem with MPI_Intercomm_merge with 1.7.3 as > stated in this t