Re: [OMPI users] openMPI (multiple CPUs)
Sure, go buy a motherboard that you can plug in 2 or more CPUs into it. Otherwise it would be cheaper to buy another box. From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Rodolfo Chua Sent: Friday, February 26, 2010 8:14 PM To: Open MPI Users Subject: [OMPI users] openMPI (multiple CPUs) Hi all! I'm running a code using openMPI in a quad-core cpu. Though it is working, a quad-core is still not enough. Is there another way, aside from a server, of connecting 2 or 3 CPUs and running them on parallel with MPI? Thanks. Rodolfo
[OMPI users] openMPI (multiple CPUs)
Hi all! I'm running a code using openMPI in a quad-core cpu. Though it is working, a quad-core is still not enough. Is there another way, aside from a server, of connecting 2 or 3 CPUs and running them on parallel with MPI? Thanks. Rodolfo
Re: [OMPI users] Number of processes and spawn
I'm doing some tests and it seems that is not able to do a spawn multiple with more than 128 nodes. It just hold, with no error message. What do you think? What can I try to understand the problem. Thanks, Federico 2010/2/26 Ralph Castain> No known limitations of which we are aware...the variables are all set to > int32_t, so INT32_MAX would be the only limit I can imagine. In which case, > you'll run out of memory long before you hit it. > > > 2010/2/26 Federico Golfrè Andreasi > >> HI ! >> >> have you ever did some analysis to understand if there is a limitation in >> the number of nodes usable with OpenMPI-v1.4 ? >> Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple. >> >> Thanks, >>Federico >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMPI with Sun Gridengine: Host key verification failed.
Hi, Am 26.02.2010 um 15:01 schrieb Tobias Müller: I hope this list is the right place for my problem concerning OpenMPI with Sun Gridengine. I'm running OpenMPI with gridengine support: MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) on 4 Debian Lenny system with Sun Gridengine 6.2. I've written a small which update version of SGE? test program which only displays the hostname of each MPI process its running on and start this via a simple script with a submit by qsub: #!/bin/bash #$ -V ### number of processors and parallel environment #$ -pe sol 32 ### Job name #$ -N "mpi_test" ### Start from current working directory #$ -cwd #$ -l arch=lx26-amd64 /usr/bin/mpirun.openmpi --mca pls_gridengine_verbose 1 -v ~/grid/ mpi_test/main The gridengine starts the jobs, but fails with Host key verification failed. in the logfiles: local configuration sol2.XXX not defined - using global configuration Starting server daemon at host "sol2.XXX" Starting server daemon at host "sol3.XXX" Starting server daemon at host "sol4.XXX" Starting server daemon at host "sol1.XXX" Server daemon successfully started with task id "1.sol2" Server daemon successfully started with task id "1.sol4" Server daemon successfully started with task id "1.sol1" Server daemon successfully started with task id "1.sol3" Establishing /usr/bin/ssh session to host sol2.XXX ... Host key verification failed. /usr/bin/ssh exited with exit code 255 reading exit code from shepherd ... 129 [sol2:22892] ERROR: A daemon on node sol2.XXX failed to start as expected. [sol2:22892] ERROR: There may be more information available from [sol2:22892] ERROR: the 'qstat -t' command on the Grid Engine tasks. [sol2:22892] ERROR: If the problem persists, please restart the [sol2:22892] ERROR: Grid Engine PE job [sol2:22892] ERROR: The daemon exited unexpectedly with status 129. ... The host keys for all 4 solX hosts are in the known_hosts file of the user submitting the job and of the known_hosts file of root. You setup SGE to use SSH as remote startup method and it's working otherwise for qrsh and qrsh with command? Can you try to the - builtin- method as an alternative? -- Reuti Any hints why this could go wrong? Regards Tobias ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] OpenMPI with Sun Gridengine: Host key verification failed.
Hi everybody! I hope this list is the right place for my problem concerning OpenMPI with Sun Gridengine. I'm running OpenMPI with gridengine support: MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) on 4 Debian Lenny system with Sun Gridengine 6.2. I've written a small test program which only displays the hostname of each MPI process its running on and start this via a simple script with a submit by qsub: #!/bin/bash #$ -V ### number of processors and parallel environment #$ -pe sol 32 ### Job name #$ -N "mpi_test" ### Start from current working directory #$ -cwd #$ -l arch=lx26-amd64 /usr/bin/mpirun.openmpi --mca pls_gridengine_verbose 1 -v ~/grid/mpi_test/main The gridengine starts the jobs, but fails with Host key verification failed. in the logfiles: local configuration sol2.XXX not defined - using global configuration Starting server daemon at host "sol2.XXX" Starting server daemon at host "sol3.XXX" Starting server daemon at host "sol4.XXX" Starting server daemon at host "sol1.XXX" Server daemon successfully started with task id "1.sol2" Server daemon successfully started with task id "1.sol4" Server daemon successfully started with task id "1.sol1" Server daemon successfully started with task id "1.sol3" Establishing /usr/bin/ssh session to host sol2.XXX ... Host key verification failed. /usr/bin/ssh exited with exit code 255 reading exit code from shepherd ... 129 [sol2:22892] ERROR: A daemon on node sol2.XXX failed to start as expected. [sol2:22892] ERROR: There may be more information available from [sol2:22892] ERROR: the 'qstat -t' command on the Grid Engine tasks. [sol2:22892] ERROR: If the problem persists, please restart the [sol2:22892] ERROR: Grid Engine PE job [sol2:22892] ERROR: The daemon exited unexpectedly with status 129. ... The host keys for all 4 solX hosts are in the known_hosts file of the user submitting the job and of the known_hosts file of root. Any hints why this could go wrong? Regards Tobias
Re: [OMPI users] Number of processes and spawn
No known limitations of which we are aware...the variables are all set to int32_t, so INT32_MAX would be the only limit I can imagine. In which case, you'll run out of memory long before you hit it. 2010/2/26 Federico Golfrè Andreasi> HI ! > > have you ever did some analysis to understand if there is a limitation in > the number of nodes usable with OpenMPI-v1.4 ? > Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple. > > Thanks, >Federico > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Segfault in mca_odls_default.so with > ~100 process.
I am trying to run an MPI code across 136 processing using an appfile (attached), since every process needs to be run with a host/process dependent parameter. This whole system works wonderfully for up to around 100 processes but usually fails with a segfault, apparently in in mca_odls_default.so, during initialization. The attached appfile is an attempt at 136 processes. If I split the appfile into two, both halves will initialize OK and successfully pass an MPI_Barrier() (the program won't actually work without all 136 nodes, but I'm happy MPI is doing its job). Because both halves work, I think it has to be related to the number of processes - not a problem with a specific appfile entry or machine. The cluster I am running it on has openmpi-1.3.3 but I have also installed 1.4.1 from the website in my home dir and that does the same (and is from which the attached data comes). The actual segfault is: [jac-11:12300] *** Process received signal *** [jac-11:12300] Signal: Segmentation fault (11) [jac-11:12300] Signal code: Address not mapped (1) [jac-11:12300] Failing at address: 0x40 [jac-11:12300] [ 0] [0x74640c] [jac-11:12300] [ 1] /home/oford/openmpi/lib/openmpi/mca_odls_default.so [0x8863d4] [jac-11:12300] [ 2] /home/oford/openmpi/lib/libopen-rte.so.0 [0x76ffe9] [jac-11:12300] [ 3] /home/oford/openmpi/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x2f6) [0x771b86] [jac-11:12300] [ 4] /home/oford/openmpi/lib/libopen-pal.so.0 [0x5d6ba8] [jac-11:12300] [ 5] /home/oford/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0x5d6e47] [jac-11:12300] [ 6] /home/oford/openmpi/lib/libopen-pal.so.0(opal_progress+0xce) [0x5ca00e] [jac-11:12300] [ 7] /home/oford/openmpi/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x355) [0x7815f5] [jac-11:12300] [ 8] /home/oford/openmpi/lib/openmpi/mca_plm_rsh.so [0xc73d1b] [jac-11:12300] [ 9] mpirun [0x804a8f0] [jac-11:12300] [10] mpirun [0x8049ef6] [jac-11:12300] [11] /lib/libc.so.6(__libc_start_main+0xe5) [0x1406e5] [jac-11:12300] [12] mpirun [0x8049e41] [jac-11:12300] *** End of error message *** Segmentation fault The full output with '-d' and the config.log from the build of 1.4.1 are also attached. I don't know the exact setup of the network, but I can ask our sysadmin anything else that might help. Thanks in advance, Oliver Ford Culham Centre for Fusion Energy Oxford, UK -np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-39-80115 Y 11 11 133 debug -np 1 --host jac-5 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-26-81244 N 11 11 133 debug -np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-122-75993 N 11 11 133 debug -np 1 --host batch-037 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-157-15286 N 11 11 133 debug -np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-114-89529 N 11 11 133 debug -np 1 --host jac-9 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-35-90257 N 11 11 133 debug -np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-151-56062 N 11 11 133 debug -np 1 --host batch-004 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-16-2723 N 11 11 133 debug -np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-156-65790 N 11 11 133 debug -np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-198-63239 N 11 11 133 debug -np 1 --host batch-046 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-105-12753 N 11 11 133 debug -np 1 --host batch-015 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-12-25631 N 11 11 133 debug -np 1 --host jac-12 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-196-35421 N 11 11 133 debug -np 1 --host batch-045 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-103-98246 N 11 11 133 debug -np 1 --host batch-006 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-142-44009 N 11 11 133 debug -np 1 --host batch-044 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-117-30325 N 11 11 133 debug -np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-143-21739 N 11 11 133 debug -np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-112-64293 N 11 11 133 debug -np 1 --host batch-041 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-57-11238 N 11 11 133 debug -np 1 --host batch-025 /home/oford/java/mcServer/lgidmath/lgidmath /tmp/lgiStaging/mats-94280x5887-170-80831 N 11 11 133 debug -np 1 --host jac-6 /home/oford/java/mcServer/lgidmath/lgidmath
[OMPI users] Number of processes and spawn
HI ! have you ever did some analysis to understand if there is a limitation in the number of nodes usable with OpenMPI-v1.4 ? Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple. Thanks, Federico