Roberto Fichera ha scritto: > Hi All on the list, > > I'm trying to execute dynamic MPI applications using MPI_Comm_spawn(). > The application I'm using for tests, basically is > composed by a master, which spawn a slave in each assigned node in a > multithreading fashion. The master is started with a > number of jobs to perform and a filename, containing the list of > assigned nodes. The idea is to handle all the dispatching > logic within the application, so that the master will try to take as > busy as possible each assigned node. Said that, for each spawned > job, the master allocate a thread for spawning and handling the > communication, than generate a random number, send it to the > slave which simply send it back to the master. Finally the slave > terminate its job and the relative node become free for a new one. > The things will continue until all the requested jobs are done. > > The test program I'm using *doesn't* work flawless in mpich2 because it > has a ~24k spawned job limitation, due to a monotonically > increasing of its internal context id which basically stops the > application due to a library internal overflow. The internal context id, > allocated > for each terminated spawned job, are never recycled at moment. The > unique MPI-2 implementation, so supporting MPI_Comm_spawn(), > which was able to complete the test is currently the HP MPI. So now I > would start to check OpenMPI if it's suitable for our dynamic parallel > applications. > > The test application is linked against OpenMPI v1.3a1r19645, running of > Fedora8 x86_64 + all updates. > > My first attempt end up on the error below which I basically don't know > where to look further. Note that I've already checked PATHs and > LD_LIBRARY_PATH, the application is basically configured correctly since > it uses two scripts for starting and all the paths are set there. > Basically I need to start *one* master application which will handle all > the things for managing slave applications. The communication is *only* > master <-> slave and never collective, at moment. > > The test program is available on request. > > Does any one have an idea what's going on? > > Thanks in advance, > Roberto Fichera. > > [roberto@cluster4 TestOpenMPI]$ orterun -wdir /data/roberto/MPI/TestOpenMPI > -np > 1 testmaster 10000 $PBS_NODEFILE > Initializing MPI ... > Loading the node's ring from file '/var/torque/aux//909.master.tekno-soft.it' > ... adding node #1 host is 'cluster3.tekno-soft.it' > ... adding node #2 host is 'cluster2.tekno-soft.it' > ... adding node #3 host is 'cluster1.tekno-soft.it' > ... adding node #4 host is 'master.tekno-soft.it' > A 4 node's ring has been made > At least one node is available, let's start to distribute 10000 job across 4 > nodes!!! > ****************** Starting job #1 > ****************** Starting job #2 > ****************** Starting job #3 > ****************** Starting job #4 > Setting up the host as 'cluster3.tekno-soft.it' > Setting the work directory as '/data/roberto/MPI/TestOpenMPI' > Spawning a task './testslave.sh' on node 'cluster3.tekno-soft.it' > Setting up the host as 'cluster2.tekno-soft.it' > Setting the work directory as '/data/roberto/MPI/TestOpenMPI' > Spawning a task './testslave.sh' on node 'cluster2.tekno-soft.it' > Setting up the host as 'cluster1.tekno-soft.it' > Setting the work directory as '/data/roberto/MPI/TestOpenMPI' > Spawning a task './testslave.sh' on node 'cluster1.tekno-soft.it' > Setting up the host as 'master.tekno-soft.it' > Setting the work directory as '/data/roberto/MPI/TestOpenMPI' > Spawning a task './testslave.sh' on node 'master.tekno-soft.it' > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in > file base/plm_base_receive.c at line 169 > [cluster4.tekno-soft.it:21287] [[30014,0],0] ORTE_ERROR_LOG: Resource busy in > file base/plm_base_receive.c at line 169 > Just to say that I made a little progress, now seems that everything starts, but mpirun doesn't find the executable
[roberto@cluster4 TestOpenMPI]$ mpirun --verbose --debug-daemons --mca obl -np 1 -wdir `pwd` testmaster 10000 $PBS_NODEFILE Daemon was launched on cluster3.tekno-soft.it - beginning to initialize Daemon was launched on cluster2.tekno-soft.it - beginning to initialize Daemon was launched on cluster1.tekno-soft.it - beginning to initialize Daemon [[14600,0],2] checking in as pid 28732 on host cluster2.tekno-soft.it Daemon [[14600,0],2] not using static ports [cluster2.tekno-soft.it:28732] [[14600,0],2] orted: up and running - waiting for commands! Daemon [[14600,0],3] checking in as pid 2590 on host cluster1.tekno-soft.it Daemon [[14600,0],3] not using static ports [cluster1.tekno-soft.it:02590] [[14600,0],3] orted: up and running - waiting for commands! Daemon [[14600,0],1] checking in as pid 6969 on host cluster3.tekno-soft.it Daemon [[14600,0],1] not using static ports [cluster3.tekno-soft.it:06969] [[14600,0],1] orted: up and running - waiting for commands! Daemon was launched on master.tekno-soft.it - beginning to initialize Daemon [[14600,0],4] checking in as pid 1113 on host master.tekno-soft.it Daemon [[14600,0],4] not using static ports [master.tekno-soft.it:01113] [[14600,0],4] orted: up and running - waiting for commands! [cluster4.tekno-soft.it:07953] [[14600,0],0] orted_cmd: received add_local_procs [cluster4.tekno-soft.it:07953] [[14600,0],0] node[0].name cluster4 daemon 0 arch ffc91200 [cluster4.tekno-soft.it:07953] [[14600,0],0] node[1].name cluster3 daemon 1 arch ffc91200 [cluster4.tekno-soft.it:07953] [[14600,0],0] node[2].name cluster2 daemon 2 arch ffc91200 [cluster4.tekno-soft.it:07953] [[14600,0],0] node[3].name cluster1 daemon 3 arch ffc91200 [cluster4.tekno-soft.it:07953] [[14600,0],0] node[4].name master daemon 4 arch ffc91200 [cluster3.tekno-soft.it:06969] [[14600,0],1] orted_cmd: received add_local_procs [cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received add_local_procs [master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received add_local_procs [cluster3.tekno-soft.it:06969] [[14600,0],1] node[0].name cluster4 daemon 0 arch ffc91200 [cluster3.tekno-soft.it:06969] [[14600,0],1] node[1].name cluster3 daemon 1 arch ffc91200 [cluster3.tekno-soft.it:06969] [[14600,0],1] node[2].name cluster2 daemon 2 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[0].name cluster4 daemon 0 arch ffc91200 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[1].name cluster3 daemon 1 arch ffc91200 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[2].name cluster2 daemon 2 [master.tekno-soft.it:01113] [[14600,0],4] node[0].name cluster4 daemon 0 arch ffc91200 [master.tekno-soft.it:01113] [[14600,0],4] node[1].name cluster3 daemon 1 arch ffc91200 [master.tekno-soft.it:01113] [[14600,0],4] node[2].name cluster2 daemon 2 arch farch ffc91200 [cluster3.tekno-soft.it:06969] [[14600,0],1] node[3].name cluster1 daemon 3 arch ffc91200 [cluster3.tekno-soft.it:06969] [[14600,0],1] node[4].name master daemon 4 arch ffc91200 arch ffc91200 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[3].name cluster1 daemon 3 arch ffc91200 [cluster2.tekno-soft.it:28732] [[14600,0],2] node[4].name master daemon 4 arch ffc91200 fc91200 [master.tekno-soft.it:01113] [[14600,0],4] node[3].name cluster1 daemon 3 arch ffc91200 [master.tekno-soft.it:01113] [[14600,0],4] node[4].name master daemon 4 arch ffc91200 -------------------------------------------------------------------------- mpirun was unable to launch the specified application as it could not find an executable: Executable: 1 Node: cluster4.tekno-soft.it while attempting to start process rank 0. -------------------------------------------------------------------------- [master.tekno-soft.it:01113] [[14600,0],4] orted_cmd: received exit [master.tekno-soft.it:01113] [[14600,0],4] orted: finalizing [cluster2.tekno-soft.it:28732] [[14600,0],2] orted_cmd: received exit [cluster2.tekno-soft.it:28732] [[14600,0],2] orted: finalizing [master:01113] *** Process received signal *** [cluster2:28732] *** Process received signal *** [cluster2:28732] Signal: Segmentation fault (11) [cluster2:28732] Signal code: Address not mapped (1) [cluster2:28732] Failing at address: 0x2aaaab784af0 [master:01113] Signal: Segmentation fault (11) [master:01113] Signal code: Address not mapped (1) [master:01113] Failing at address: 0x2aaaab786af0 mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate [cluster1.tekno-soft.it:02590] [[14600,0],3] routed:binomial: Connection to lifeline [[14600,0],0] lost [roberto@cluster4 TestOpenMPI]$ > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >