Sorry for the very late reply. Everything works now! Thanks a lot!!
On Sep 25, 2013, at 7:00 PM, Ralph Castain wrote:
> I've committed a fix to the trunk (r29245) and scheduled it for v1.7.3 -
> thanks for the debug info!
>
> Ralph
>
> On Sep 25, 2013, at 5:00 AM, Suraj Prabhakaran
> wrote:
I've committed a fix to the trunk (r29245) and scheduled it for v1.7.3 - thanks
for the debug info!
Ralph
On Sep 25, 2013, at 5:00 AM, Suraj Prabhakaran
wrote:
> Dear Ralph,
>
> I am sorry but I think I missed adding plm verbosity to 5 last time. Here is
> the output of the complete program
Dear Ralph,
I am sorry but I think I missed adding plm verbosity to 5 last time. Here is
the output of the complete program with and without -novm to the following
mpiexec.
mpiexec -mca state_base_verbose 10 -mca errmgr_base_verbose 10 -mca
plm_base_verbose 5 -mca btl tcp,sm,self -np 2 ./addho
What I find puzzling is that I don't see any output indicating that you went
thru the Torque launcher to launch the daemons - not a peep of debug output.
This makes me suspicious that something else is going on. Are you sure you sent
me all the output?
Try adding -novm to your mpirun cmd line a
Hi Ralph,
So here is what I do. I spawn just a "single" process on a new node which is
basically not in the $PBS_NODEFILE list.
My $PBS_NODEFILE list contains
grsacc20
grsacc19
I then start the app with just 2 processes. So one host gets one process and
they are successfully spawned through th
I'm going to need a little help here. The problem is that you launch two new
daemons, and one of them exits immediately because it thinks it lost the
connection back to mpirun - before it even gets a chance to create it.
Can you give me a little more info as to exactly what you are doing? Perhap
Hi Ralph,
Output attached in a file.
Thanks a lot!
Best,
Suraj
{\rtf1\ansi\ansicpg1252\cocoartf1038\cocoasubrtf360
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww30340\viewh23120\viewkind0
\pard\tx566\tx1133\tx170
Afraid I don't see the problem offhand - can you add the following to your cmd
line?
-mca state_base_verbose 10 -mca errmgr_base_verbose 10
Thanks
Ralph
On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran
wrote:
> Hi Ralph,
>
> I always got this output from any MPI job that ran on our nodes. Th
Hi Ralph,
I always got this output from any MPI job that ran on our nodes. There seems to
be a problem somewhere but it never stopped the applications from running. But
anyway, I ran it again now with only tcp and excluded the infiniband and I get
the same output again. Except that this time,
Your output shows that it launched your apps, but they exited. The error is
reported here, though it appears we aren't flushing the message out before
exiting due to a race condition:
> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-openib.txt
> / no active ports found
Here
Hi Ralph,
I tested it with the trunk r29228. I still have the following problem. Now, it
even spawns the daemon on the new node through torque but then suddently quits.
The following is the output. Can you please have a look?
Thanks
Suraj
[grsacc20:04511] [[6253,0],0] plm:base:receive process
On Sep 23, 2013, at 01:43 , Ralph Castain wrote:
>
> On Sep 22, 2013, at 2:15 PM, George Bosilca wrote:
>
>> In fact there are only two type of information: one that is added by the
>> OMPI layer, which is exchanged during the modex exchange stage, and whatever
>> else is built on top of th
Found a bug in the Torque support - we were trying to connect to the MOM again,
which would hang (I imagine). I pushed a fix to the trunk (r29227) and
scheduled it to come to 1.7.3 if you want to try it again.
On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran
wrote:
> Dear Ralph,
>
> This is t
On Sep 22, 2013, at 2:15 PM, George Bosilca wrote:
> In fact there are only two type of information: one that is added by the OMPI
> layer, which is exchanged during the modex exchange stage, and whatever else
> is built on top of this information by different pieces of the software stack
> (
Dear Ralph,
This is the output I get when I execute with the verbose option.
[grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
[grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from
[[23526,1],0]
[grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
[grsacc20
In fact there are only two type of information: one that is added by the OMPI
layer, which is exchanged during the modex exchange stage, and whatever else is
built on top of this information by different pieces of the software stack
(including the RTE). If we mark these two types of data indepen
I'll still need to look at the intercomm_create issue, but I just tested both
the trunk and current 1.7.3 branch for "add-host" and both worked just fine.
This was on my little test cluster which only has rsh available - no Torque.
You might add "-mca plm_base_verbose 5" to your cmd line to get
On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran
wrote:
> Dear all,
>
> Really thanks a lot for your efforts. I too downloaded the trunk to check if
> it works for my case and as of revision 29215, it works for the original case
> I reported. Although it works, I still see the following in the
Dear all,
Really thanks a lot for your efforts. I too downloaded the trunk to check if it
works for my case and as of revision 29215, it works for the original case I
reported. Although it works, I still see the following in the output. Does it
mean anything?
[grsacc17][[13611,1],0][btl_openib_
Just to close my end of this loop: as of trunk r29213, it all works for me.
Thanks!
On Sep 18, 2013, at 12:52 PM, Ralph Castain wrote:
> Thanks George - much appreciated
>
> On Sep 18, 2013, at 9:49 AM, George Bosilca wrote:
>
>> The test case was broken. I just pushed a fix.
>>
>> George
Been wracking my brain on this, and I can't find any way to do this cleanly
without invoking some kind of extension/modification to the MPI-RTE interface.
The problem is that we are now executing an "in-band" modex operation. This is
fine, but the modex operation (no matter how it is executed) i
Actually, we wouldn't have to modify the interface - just have to define a
DB_RTE flag and OR it to the DB_INTERNAL/DB_EXTERNAL one. We'd need to modify
the "fetch" routines to pass the flag into them so we fetched the right things,
but that's a simple change.
On Sep 18, 2013, at 10:12 AM, Ralp
I struggled with that myself when doing my earlier patch - part of the reason
why I added the dpm API.
I don't know how to update the locality without referencing RTE-specific keys,
so maybe the best thing would be to provide some kind of hook into the db that
says we want all the non-RTE keys?
I hit send too early.
Now that we move the entire "local" modex is there any way to trim it down or
to replace the entries that are not correct anymore? Like the locality?
George.
On Sep 18, 2013, at 18:53 , George Bosilca wrote:
> Regarding your comment on the bug trac, I noticed there is
Regarding your comment on the bug trac, I noticed there is a DB_INTERNAL flag.
While I see how to set I could not figure out any way to get it back.
With the required modification of the DB API can't we take advantage of it?
George.
On Sep 18, 2013, at 18:52 , Ralph Castain wrote:
> Thanks G
Thanks George - much appreciated
On Sep 18, 2013, at 9:49 AM, George Bosilca wrote:
> The test case was broken. I just pushed a fix.
>
> George.
>
> On Sep 18, 2013, at 16:49 , Ralph Castain wrote:
>
>> Hangs with any np > 1
>>
>> However, I'm not sure if that's an issue with the test vs t
The test case was broken. I just pushed a fix.
George.
On Sep 18, 2013, at 16:49 , Ralph Castain wrote:
> Hangs with any np > 1
>
> However, I'm not sure if that's an issue with the test vs the underlying
> implementation
>
> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)"
> wrote
Hangs with any np > 1
However, I'm not sure if that's an issue with the test vs the underlying
implementation
On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)"
wrote:
> Does it hang when you run with -np 4?
>
> Sent from my phone. No type good.
>
> On Sep 18, 2013, at 4:10 PM, "Ralph
Does it hang when you run with -np 4?
Sent from my phone. No type good.
On Sep 18, 2013, at 4:10 PM, "Ralph Castain" wrote:
> Strange - it works fine for me on my Mac. However, I see one difference - I
> only run it with np=1
>
> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres)
> wrote
Strange - it works fine for me on my Mac. However, I see one difference - I
only run it with np=1
On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) wrote:
> On Sep 18, 2013, at 9:33 AM, George Bosilca wrote:
>
>> 1. sm doesn't work between spawned processes. So you must have another
>> ne
On Sep 18, 2013, at 9:33 AM, George Bosilca wrote:
> 1. sm doesn't work between spawned processes. So you must have another
> network enabled.
I know :-). I have tcp available as well (OMPI will abort if you only run with
sm,self because the comm_spawn will fail with unreachable errors -- I j
2 things:
1. sm doesn't work between spawned processes. So you must have another network
enabled.
2. Don't use the test case attached to my email, I left an xterm based spawn
and the debugging. It can't work without xterm support. Instead try using the
test case from the trunk, the one committ
George --
When I build the SVN trunk (r29201) on 64 bit linux, your attached test case
hangs:
-
❯❯❯ mpicc intercomm_create.c -o intercomm_create
❯❯❯ mpirun -np 4 intercomm_create
b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
b: MPI_Intercomm_create( intra, 0,
Here is a quick (and definitively not the cleanest) patch that addresses the MPI_Intercomm issue at the MPI level. It should be applied after removal of 29166.I also added the corrected test case stressing the corner cases by doing barriers at every inter-comm creation and doing a clean disconnect.
Great! I'll welcome the patch - feel free to back mine out when you do.
Thanks!
On Sep 17, 2013, at 2:43 PM, George Bosilca wrote:
> On Sep 17, 2013, at 23:19 , Ralph Castain wrote:
>
>> I very much doubt that it would work, though I can give it a try, as the
>> patch addresses Intercomm_mer
On Sep 17, 2013, at 23:19 , Ralph Castain wrote:
> I very much doubt that it would work, though I can give it a try, as the
> patch addresses Intercomm_merge and not Intercomm_create. I debated about
> putting the patch into "create" instead, but nobody was citing that as being
> a problem. In
On Sep 17, 2013, at 2:01 PM, George Bosilca wrote:
> Ralph,
>
> On Sep 17, 2013, at 20:13 , Ralph Castain wrote:
>
>> I guess we could argue this for awhile, but I personally don't care how it
>> gets fixed. The issue here is that (a) you promised to provide a "better"
>> fix nearly a year
Ralph,
On Sep 17, 2013, at 20:13 , Ralph Castain wrote:
> I guess we could argue this for awhile, but I personally don't care how it
> gets fixed. The issue here is that (a) you promised to provide a "better" fix
> nearly a year ago, (b) it never happened, and © a user who has patiently
> wai
I guess we could argue this for awhile, but I personally don't care how it gets
fixed. The issue here is that (a) you promised to provide a "better" fix nearly
a year ago, (b) it never happened, and (c) a user who has patiently waited all
this time has asked if we could please fix it.
It now wo
Ralph,
I don't think your patch is addressing the right issue. In fact your commit
treat the wrong symptom instead of addressing the core issue that generate the
problem. Let me explain this in terms of MPI.
The MPI_Intercomm_merge function transform an inter-comm into an intra-comm,
basically
Hi Ralph,
Thanks a lot!!! thats really cool!!
Best,
Suraj
On Sep 15, 2013, at 5:01 PM, Ralph Castain wrote:
> I fixed it and have filed a cmr to move it to 1.7.3
>
> Thanks for your patience, and for reminding me
> Ralph
>
> On Sep 13, 2013, at 12:05 PM, Suraj Prabhakaran
> wrote:
>
>> De
I fixed it and have filed a cmr to move it to 1.7.3
Thanks for your patience, and for reminding me
Ralph
On Sep 13, 2013, at 12:05 PM, Suraj Prabhakaran
wrote:
> Dear Ralph, that would be great if you could give it a try. We have been
> hoping for it for a year now and it could greatly benefi
Dear Ralph, that would be great if you could give it a try. We have been
hoping for it for a year now and it could greatly benefit us if this is
fixed!! :-)
Thanks!
Suraj
On Fri, Sep 13, 2013 at 5:39 PM, Ralph Castain wrote:
> It has been a low priority issue, and hence not resolved yet. I d
It has been a low priority issue, and hence not resolved yet. I doubt it will
make 1.7.3, though if you need it, I'll give it a try.
On Sep 13, 2013, at 7:21 AM, Suraj Prabhakaran
wrote:
> Hello,
>
> Is there a plan to fix the problem with MPI_Intercomm_merge with 1.7.3 as
> stated in this t
44 matches
Mail list logo