Re: [OMPI users] openMPI (multiple CPUs)

2010-02-26 Thread Trent Creekmore
Sure, go buy a motherboard that you can plug in 2 or more CPUs into it.

Otherwise it would be cheaper to buy another box.



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Rodolfo Chua
Sent: Friday, February 26, 2010 8:14 PM
To: Open MPI Users
Subject: [OMPI users] openMPI (multiple CPUs)



Hi all! 

I'm running a code using openMPI in a quad-core cpu. Though it is working, a
quad-core is still not enough. 
Is there another way, aside from a server, of connecting 2 or 3 CPUs and
running them on parallel with MPI?

Thanks.
Rodolfo





[OMPI users] openMPI (multiple CPUs)

2010-02-26 Thread Rodolfo Chua
Hi all!

I'm running a code using openMPI in a quad-core cpu. Though it is working, a 
quad-core is still not enough. 
Is there another way, aside from a server, of connecting 2 or 3 CPUs and 
running them on parallel with MPI?

Thanks.
Rodolfo





Re: [OMPI users] Number of processes and spawn

2010-02-26 Thread Federico Golfrè Andreasi
I'm doing some tests and it seems that is not able to do a spawn multiple
with more than 128 nodes.

It just hold, with no error message.

What do you think? What can I try to understand the problem.

Thanks,

Federico




2010/2/26 Ralph Castain 

> No known limitations of which we are aware...the variables are all set to
> int32_t, so INT32_MAX would be the only limit I can imagine. In which case,
> you'll run out of memory long before you hit it.
>
>
> 2010/2/26 Federico Golfrè Andreasi 
>
>> HI !
>>
>> have you ever did some analysis to understand if there is a limitation in
>> the number of nodes usable with OpenMPI-v1.4 ?
>> Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple.
>>
>> Thanks,
>>Federico
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] OpenMPI with Sun Gridengine: Host key verification failed.

2010-02-26 Thread Reuti

Hi,

Am 26.02.2010 um 15:01 schrieb Tobias Müller:


I hope this list is the right place for my problem concerning OpenMPI
with Sun Gridengine. I'm running OpenMPI with gridengine support:

MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)

on 4 Debian Lenny system with Sun Gridengine 6.2. I've written a small


which update version of SGE?


test program which only displays the hostname of each MPI process its
running on and start this via a simple script with a submit by qsub:

#!/bin/bash
#$ -V
### number of processors and parallel environment
#$ -pe sol 32
### Job name
#$ -N "mpi_test"
### Start from current working directory
#$ -cwd
#$ -l arch=lx26-amd64
/usr/bin/mpirun.openmpi --mca pls_gridengine_verbose 1 -v ~/grid/ 
mpi_test/main


The gridengine starts the jobs, but fails with Host key verification
failed. in the logfiles:

local configuration sol2.XXX not defined - using global configuration
Starting server daemon at host "sol2.XXX"
Starting server daemon at host "sol3.XXX"
Starting server daemon at host "sol4.XXX"
Starting server daemon at host "sol1.XXX"
Server daemon successfully started with task id "1.sol2"
Server daemon successfully started with task id "1.sol4"
Server daemon successfully started with task id "1.sol1"
Server daemon successfully started with task id "1.sol3"
Establishing /usr/bin/ssh session to host sol2.XXX ...
Host key verification failed.
/usr/bin/ssh exited with exit code 255
reading exit code from shepherd ... 129
[sol2:22892] ERROR: A daemon on node sol2.XXX failed to start as  
expected.

[sol2:22892] ERROR: There may be more information available from
[sol2:22892] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[sol2:22892] ERROR: If the problem persists, please restart the
[sol2:22892] ERROR: Grid Engine PE job
[sol2:22892] ERROR: The daemon exited unexpectedly with status 129.
...

The host keys for all 4 solX hosts are in the known_hosts file of the
user submitting the job and of the known_hosts file of root.


You setup SGE to use SSH as remote startup method and it's working  
otherwise for qrsh and qrsh with command? Can you try to the - 
builtin- method as an alternative?


-- Reuti



Any hints why this could go wrong?

Regards
  Tobias
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] OpenMPI with Sun Gridengine: Host key verification failed.

2010-02-26 Thread Tobias Müller
Hi everybody!

I hope this list is the right place for my problem concerning OpenMPI
with Sun Gridengine. I'm running OpenMPI with gridengine support:

MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)

on 4 Debian Lenny system with Sun Gridengine 6.2. I've written a small
test program which only displays the hostname of each MPI process its
running on and start this via a simple script with a submit by qsub:

#!/bin/bash
#$ -V
### number of processors and parallel environment
#$ -pe sol 32
### Job name
#$ -N "mpi_test"
### Start from current working directory
#$ -cwd
#$ -l arch=lx26-amd64
/usr/bin/mpirun.openmpi --mca pls_gridengine_verbose 1 -v ~/grid/mpi_test/main

The gridengine starts the jobs, but fails with Host key verification
failed. in the logfiles:

local configuration sol2.XXX not defined - using global configuration
Starting server daemon at host "sol2.XXX"
Starting server daemon at host "sol3.XXX"
Starting server daemon at host "sol4.XXX"
Starting server daemon at host "sol1.XXX"
Server daemon successfully started with task id "1.sol2"
Server daemon successfully started with task id "1.sol4"
Server daemon successfully started with task id "1.sol1"
Server daemon successfully started with task id "1.sol3"
Establishing /usr/bin/ssh session to host sol2.XXX ...
Host key verification failed.
/usr/bin/ssh exited with exit code 255
reading exit code from shepherd ... 129
[sol2:22892] ERROR: A daemon on node sol2.XXX failed to start as expected.
[sol2:22892] ERROR: There may be more information available from
[sol2:22892] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[sol2:22892] ERROR: If the problem persists, please restart the
[sol2:22892] ERROR: Grid Engine PE job
[sol2:22892] ERROR: The daemon exited unexpectedly with status 129.
...

The host keys for all 4 solX hosts are in the known_hosts file of the
user submitting the job and of the known_hosts file of root.

Any hints why this could go wrong?

Regards
  Tobias


Re: [OMPI users] Number of processes and spawn

2010-02-26 Thread Ralph Castain
No known limitations of which we are aware...the variables are all set to
int32_t, so INT32_MAX would be the only limit I can imagine. In which case,
you'll run out of memory long before you hit it.


2010/2/26 Federico Golfrè Andreasi 

> HI !
>
> have you ever did some analysis to understand if there is a limitation in
> the number of nodes usable with OpenMPI-v1.4 ?
> Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple.
>
> Thanks,
>Federico
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Segfault in mca_odls_default.so with > ~100 process.

2010-02-26 Thread Oliver Ford


I am trying to run an MPI code across 136 processing using an appfile
(attached), since every process needs to be run with a host/process
dependent parameter.

This whole system works wonderfully for up to around 100 processes but
usually fails with a segfault, apparently in in mca_odls_default.so,
during initialization.
The attached appfile is an attempt at 136 processes. If I split the
appfile into two, both halves will initialize OK and successfully pass
an MPI_Barrier() (the program won't actually work without all 136 nodes,
but I'm happy MPI is doing its job). Because both halves work, I think
it has to be related to the number of processes - not a problem with a
specific appfile entry or machine.

The cluster I am running it on has openmpi-1.3.3 but I have also
installed 1.4.1 from the website in my home dir and that does the same
(and is from which the attached data comes).

The actual segfault is:
[jac-11:12300] *** Process received signal ***
[jac-11:12300] Signal: Segmentation fault (11)
[jac-11:12300] Signal code: Address not mapped (1)
[jac-11:12300] Failing at address: 0x40
[jac-11:12300] [ 0] [0x74640c]
[jac-11:12300] [ 1] /home/oford/openmpi/lib/openmpi/mca_odls_default.so
[0x8863d4]
[jac-11:12300] [ 2] /home/oford/openmpi/lib/libopen-rte.so.0 [0x76ffe9]
[jac-11:12300] [ 3]
/home/oford/openmpi/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x2f6)
[0x771b86]
[jac-11:12300] [ 4] /home/oford/openmpi/lib/libopen-pal.so.0 [0x5d6ba8]
[jac-11:12300] [ 5]
/home/oford/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x27) [0x5d6e47]
[jac-11:12300] [ 6]
/home/oford/openmpi/lib/libopen-pal.so.0(opal_progress+0xce) [0x5ca00e]
[jac-11:12300] [ 7]
/home/oford/openmpi/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x355)
[0x7815f5]
[jac-11:12300] [ 8] /home/oford/openmpi/lib/openmpi/mca_plm_rsh.so
[0xc73d1b]
[jac-11:12300] [ 9] mpirun [0x804a8f0]
[jac-11:12300] [10] mpirun [0x8049ef6]
[jac-11:12300] [11] /lib/libc.so.6(__libc_start_main+0xe5) [0x1406e5]
[jac-11:12300] [12] mpirun [0x8049e41]
[jac-11:12300] *** End of error message ***
Segmentation fault


The full output with '-d' and the config.log from the build of 1.4.1 are
also attached.

I don't know the exact setup of the network, but I can ask our sysadmin
anything else that might help.

Thanks in advance,


Oliver Ford

Culham Centre for Fusion Energy
Oxford, UK




-np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-39-80115 Y 11 11 133 debug
-np 1 --host jac-5 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-26-81244 N 11 11 133 debug
-np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-122-75993 N 11 11 133 debug
-np 1 --host batch-037 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-157-15286 N 11 11 133 debug
-np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-114-89529 N 11 11 133 debug
-np 1 --host jac-9 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-35-90257 N 11 11 133 debug
-np 1 --host batch-020 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-151-56062 N 11 11 133 debug
-np 1 --host batch-004 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-16-2723 N 11 11 133 debug
-np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-156-65790 N 11 11 133 debug
-np 1 --host jac-11 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-198-63239 N 11 11 133 debug
-np 1 --host batch-046 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-105-12753 N 11 11 133 debug
-np 1 --host batch-015 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-12-25631 N 11 11 133 debug
-np 1 --host jac-12 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-196-35421 N 11 11 133 debug
-np 1 --host batch-045 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-103-98246 N 11 11 133 debug
-np 1 --host batch-006 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-142-44009 N 11 11 133 debug
-np 1 --host batch-044 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-117-30325 N 11 11 133 debug
-np 1 --host batch-003 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-143-21739 N 11 11 133 debug
-np 1 --host batch-042 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-112-64293 N 11 11 133 debug
-np 1 --host batch-041 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-57-11238 N 11 11 133 debug
-np 1 --host batch-025 /home/oford/java/mcServer/lgidmath/lgidmath 
/tmp/lgiStaging/mats-94280x5887-170-80831 N 11 11 133 debug
-np 1 --host jac-6 /home/oford/java/mcServer/lgidmath/lgidmath 

[OMPI users] Number of processes and spawn

2010-02-26 Thread Federico Golfrè Andreasi
HI !

have you ever did some analysis to understand if there is a limitation in
the number of nodes usable with OpenMPI-v1.4 ?
Using also the functions MPI_Comm_spawn o MPI_Comm_spawn_multiple.

Thanks,
   Federico