Interesting. I ran a loop calling comm_spawn 1000 times without a problem. I suspect it is the threading that is causing the trouble here.

You are welcome to send me the code. You can find my loop code in your code distribution under orte/test/mpi - look for loop_spawn and loop_child.

Ralph

On Oct 3, 2008, at 9:11 AM, Roberto Fichera wrote:

Ralph Castain ha scritto:

On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:

Ralph Castain ha scritto:
I committed something to the trunk yesterday. Given the complexity of
the fix, I don't plan to bring it over to the 1.3 branch until
sometime mid-to-end next week so it can be adequately tested.
Ok! So it means that I can checkout from the SVN/trunk to get you fix,
right?

Yes, though note that I don't claim it is fully correct yet. Still
needs testing. However, I have tested it a fair amount and it seems okay.

If you do test it, please let me know how it goes.
I execute my test on the svn/trunk below

               Open MPI: 1.4a1r19677
  Open MPI SVN revision: r19677
  Open MPI release date: Unreleased developer copy
               Open RTE: 1.4a1r19677
  Open RTE SVN revision: r19677
  Open RTE release date: Unreleased developer copy
                   OPAL: 1.4a1r19677
      OPAL SVN revision: r19677
      OPAL release date: Unreleased developer copy
           Ident string: 1.4a1r19677

below is the output which seems to freeze just after the second spawn.

[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons
--hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
sync+nidmap from local proc [[19516,1],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
Loading the node's ring from file
'/var/torque/aux//932.master.tekno-soft.it'
... adding node #1 host is 'cluster4.tekno-soft.it'
... adding node #2 host is 'cluster3.tekno-soft.it'
... adding node #3 host is 'cluster2.tekno-soft.it'
... adding node #4 host is 'cluster1.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 100000 job
across 4 nodes!!!
Setting up the host as 'cluster4.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
Daemon was launched on cluster4.tekno-soft.it - beginning to initialize Daemon [[19516,0],1] checking in as pid 25123 on host cluster4.tekno- soft.it
Daemon [[19516,0],1] not using static ports
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
waiting for commands!
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
1 arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon
0 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
sync+nidmap from local proc [[19516,2],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs

Let me know if you need my test program.


Thanks
Ralph


Ralph

On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:

Ralph Castain ha scritto:
Actually, it just occurred to me that you may be seeing a problem in comm_spawn itself that I am currently chasing down. It is in the 1.3
branch and has to do with comm_spawning procs on subsets of nodes
(instead of across all nodes). Could be related to this - you might want to give me a chance to complete the fix. I have identified the problem and should have it fixed later today in our trunk - probably
won't move to the 1.3 branch for several days.
Do you have any news about the above fix? Does the fix is already
available for testing?

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to