Ralph Castain ha scritto:
On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
Ralph Castain ha scritto:
I committed something to the trunk yesterday. Given the
complexity of
the fix, I don't plan to bring it over to the 1.3 branch until
sometime mid-to-end next week so it can be adequately tested.
Ok! So it means that I can checkout from the SVN/trunk to get
you fix,
right?
Yes, though note that I don't claim it is fully correct yet. Still
needs testing. However, I have tested it a fair amount and it seems
okay.
If you do test it, please let me know how it goes.
I execute my test on the svn/trunk below
Open MPI: 1.4a1r19677
Open MPI SVN revision: r19677
Open MPI release date: Unreleased developer copy
Open RTE: 1.4a1r19677
Open RTE SVN revision: r19677
Open RTE release date: Unreleased developer copy
OPAL: 1.4a1r19677
OPAL SVN revision: r19677
OPAL release date: Unreleased developer copy
Ident string: 1.4a1r19677
below is the output which seems to freeze just after the second
spawn.
[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons
--hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000
$PBS_NODEFILE
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
daemon
INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
sync+nidmap from local proc [[19516,1],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
Loading the node's ring from file
'/var/torque/aux//932.master.tekno-soft.it'
... adding node #1 host is 'cluster4.tekno-soft.it'
... adding node #2 host is 'cluster3.tekno-soft.it'
... adding node #3 host is 'cluster2.tekno-soft.it'
... adding node #4 host is 'cluster1.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 100000 job
across 4 nodes!!!
Setting up the host as 'cluster4.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
Daemon was launched on cluster4.tekno-soft.it - beginning to
initialize
Daemon [[19516,0],1] checking in as pid 25123 on host
cluster4.tekno-soft.it
Daemon [[19516,0],1] not using static ports
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
waiting for commands!
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master
daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4
daemon
1 arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3
daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2
daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1
daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master
daemon
0 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
sync+nidmap from local proc [[19516,2],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs
Let me know if you need my test program.
Thanks
Ralph
Ralph
On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
Ralph Castain ha scritto:
Actually, it just occurred to me that you may be seeing a
problem in
comm_spawn itself that I am currently chasing down. It is in
the
1.3
branch and has to do with comm_spawning procs on subsets of
nodes
(instead of across all nodes). Could be related to this - you
might
want to give me a chance to complete the fix. I have
identified the
problem and should have it fixed later today in our trunk -
probably
won't move to the 1.3 branch for several days.
Do you have any news about the above fix? Does the fix is
already
available for testing?
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users