Re: [OMPI users] problem with rankfile in openmpi-1.6.4rc2

2013-01-24 Thread Ralph Castain
I built the current 1.6 branch (which hasn't seen any changes that would impact 
this function) and was able to execute it just fine on a single socket machine. 
I then gave it your slot-list, which of course failed as I don't have two 
active sockets (one is empty), but it appeared to parse the list just fine.

>From what I can tell, it looks like your linpc1 is unable to detect a second 
>socket for some reason when given the slot_list argument. I'll have to try 
>again tomorrow when I have access to a dual-socket machine.

On Jan 19, 2013, at 1:45 AM, Siegmar Gross 
 wrote:

> Hi
> 
> I have installed openmpi-1.6.4rc2 and have still a problem with my
> rankfile.
> 
> linpc1 rankfiles 113 ompi_info | grep "Open MPI:"
>Open MPI: 1.6.4rc2r27861
> 
> linpc1 rankfiles 114 cat rf_linpc1 
> rank 0=linpc1 slot=0:0-1,1:0-1
> 
> linpc1 rankfiles 115 mpiexec -report-bindings -np 1 \
>  -rf rf_linpc1 hostname
> 
> We were unable to successfully process/set the requested processor
> affinity settings:
> 
> Specified slot list: 0:0-1,1:0-1
> Error: Error
> 
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.
> 
> 
> mpiexec was unable to start the specified application as it
>  encountered an error:
> 
> Error name: Error
> Node: linpc1
> 
> when attempting to start process rank 0.
> 
> 
> 
> Everything works fine with the following command.
> 
> linpc1 rankfiles 116 mpiexec -report-bindings -np 1 -cpus-per-proc 4 \
>  -bycore -bind-to-core hostname
> [linpc1:20140] MCW rank 0 bound to socket 0[core 0-1]
>  socket 1[core 0-1]: [B B][B B]
> linpc1
> 
> 
> I would be grateful if somebody could fix the problem. Thank you very
> much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI

2013-01-24 Thread Ralph Castain
How was OMPI configured? What type of system are you running on (i.e., what is 
the launcher - ssh, lsf, slurm, ...)?


On Jan 24, 2013, at 6:35 PM, #YEO JINGJIE#  wrote:

> Dear users,
>  
> Maybe something went wrong as I was compiling OpenMPI, I am very new to 
> linux. When I try to run LAMMPS using the following command:
>  
> /usr/lib64/openmpi/bin/mpirun -n 16 /opt/lammps-21Jan13/lmp_linux < zigzag.in
>  
> I get the following errors:
>  
> 
> [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> ess_hnp_module.c at line 194
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>   orte_plm_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> runtime/orte_init.c at line 128
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>   orte_ess_set_name failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c 
> at line 616
>  
> Regards,
> Jingjie Yeo
> Ph.D. Student
> School of Mechanical and Aerospace Engineering
> Nanyang Technological University, Singapore
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI

2013-01-24 Thread #YEO JINGJIE#
Dear users,



Maybe something went wrong as I was compiling OpenMPI, I am very new to linux. 
When I try to run LAMMPS using the following command:



/usr/lib64/openmpi/bin/mpirun -n 16 /opt/lammps-21Jan13/lmp_linux < zigzag.in



I get the following errors:



[NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 194
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 128
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c 
at line 616



Regards,
Jingjie Yeo
Ph.D. Student
School of Mechanical and Aerospace Engineering
Nanyang Technological University, Singapore


Re: [OMPI users] mpivars.sh - Intel Fortran 13.1 conflict with OpenMPI 1.6.3

2013-01-24 Thread Tim Prince

On 01/24/2013 12:40 PM, Michael Kluskens wrote:

This is for reference and suggestions as this took me several hours to track down and the 
previous discussion on "mpivars.sh" failed to cover this point (nothing in the 
FAQ):

I successfully build and installed OpenMPI 1.6.3 using the following on Debian 
Linux:

./configure --prefix=/opt/openmpi/intel131 --disable-ipv6 
--with-mpi-f90-size=medium --with-f90-max-array-dim=4 --disable-vt 
F77=/opt/intel/composer_xe_2013.1.117/bin/intel64/ifort FC=/opt/
intel/composer_xe_2013.1.117/bin/intel64/ifort CXXFLAGS=-m64 CFLAGS=-m64 CC=gcc 
CXX=g++

(disable-vt was required because of an error finding -lz which I gave up on).

My .tcshrc file HAD the following:

set path = (/opt/openmpi/intel131/bin $path)
setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH
setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH
alias mpirun "mpirun --prefix /opt/openmpi/intel131 "
source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64

For years I have used these procedures on Debian Linux and OS X with earlier 
versions of OpenMPI and Intel Fortran.

However, at some point Intel Fortran started including "mpirt", including: 
/opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpirun

So even through I have the alias set for mpirun, I got the following error:


mpirun -V

.: 131: Can't open 
/opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpivars.sh

Part of the confusion is that OpenMPI source does include a reference to "mpivars" in 
"contrib/dist/linux/openmpi.spec"

The solution only occurred as I was writing this up, source intel setup first:

source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64
set path = (/opt/openmpi/intel131/bin $path)
setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH
setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH
alias mpirun "mpirun --prefix /opt/openmpi/intel131 "

Now I finally get:


mpirun -V

mpirun (Open MPI) 1.6.3

The mpi runtime should be in the redistributable for their MPI compiler not in 
the base compiler.  The question is how much of 
/opt/intel/composer_xe_2013.1.117/mpirt can I eliminate safely and should I ( 
multi-user machine were each user has their own Intel license, so I don't wish 
to trouble shoot this in the future ) ?



ifort mpirt is a run-time to support co-arrays, but not full MPI. This 
version of the compiler checks in its path setting scripts whether Intel 
MPI is already on LD_LIBRARY_PATH, and so there is a conditional setting 
of the internal mpivars.  I assume the co-array feature would be 
incompatible with OpenMPI and you would want to find a way to avoid any 
reference to that library, possibly by avoiding sourcing that part of 
ifort's compilervars.
If you want a response on this subject from the Intel support team, 
their HPC forum might be a place to bring it up: 
http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology


--
Tim Prince



[OMPI users] mpivars.sh - Intel Fortran 13.1 conflict with OpenMPI 1.6.3

2013-01-24 Thread Michael Kluskens
This is for reference and suggestions as this took me several hours to track 
down and the previous discussion on "mpivars.sh" failed to cover this point 
(nothing in the FAQ):

I successfully build and installed OpenMPI 1.6.3 using the following on Debian 
Linux:

./configure --prefix=/opt/openmpi/intel131 --disable-ipv6 
--with-mpi-f90-size=medium --with-f90-max-array-dim=4 --disable-vt 
F77=/opt/intel/composer_xe_2013.1.117/bin/intel64/ifort FC=/opt/
intel/composer_xe_2013.1.117/bin/intel64/ifort CXXFLAGS=-m64 CFLAGS=-m64 CC=gcc 
CXX=g++

(disable-vt was required because of an error finding -lz which I gave up on).

My .tcshrc file HAD the following:

set path = (/opt/openmpi/intel131/bin $path)
setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH
setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH
alias mpirun "mpirun --prefix /opt/openmpi/intel131 "
source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64

For years I have used these procedures on Debian Linux and OS X with earlier 
versions of OpenMPI and Intel Fortran.

However, at some point Intel Fortran started including "mpirt", including: 
/opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpirun

So even through I have the alias set for mpirun, I got the following error:

> mpirun -V
.: 131: Can't open 
/opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpivars.sh

Part of the confusion is that OpenMPI source does include a reference to 
"mpivars" in "contrib/dist/linux/openmpi.spec"

The solution only occurred as I was writing this up, source intel setup first:

source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64
set path = (/opt/openmpi/intel131/bin $path)
setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH
setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH
alias mpirun "mpirun --prefix /opt/openmpi/intel131 "

Now I finally get:

> mpirun -V
mpirun (Open MPI) 1.6.3

The mpi runtime should be in the redistributable for their MPI compiler not in 
the base compiler.  The question is how much of 
/opt/intel/composer_xe_2013.1.117/mpirt can I eliminate safely and should I ( 
multi-user machine were each user has their own Intel license, so I don't wish 
to trouble shoot this in the future ) ?




Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Ralph Castain
Sure  - just add --with-openib=no --with-psm=no to your config line and we'll 
ignore it

On Jan 24, 2013, at 7:09 AM, Sabuj Pattanayek  wrote:

> ahha, with --display-allocation I'm getting :
> 
> mca: base: component_find: unable to open
> /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm:
> libpsm_infinipath.so.1: cannot open shared object file: No such file
> or directory (ignored)
> 
> I think the system I compiled it on has different ib libs than the
> nodes. I'll need to recompile and then see if it runs, but is there
> anyway to get it to ignore IB and just use gigE? Not all of our nodes
> have IB and I just want to use any node.
> 
> On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain  wrote:
>> How did you configure OMPI? If you add --display-allocation to your cmd 
>> line, does it show all the nodes?
>> 
>> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek  wrote:
>> 
>>> Hi,
>>> 
>>> I'm submitting a job through torque/PBS, the head node also runs the
>>> Moab scheduler, the .pbs file has this in the resources line :
>>> 
>>> #PBS -l nodes=2:ppn=4
>>> 
>>> I've also tried something like :
>>> 
>>> #PBS -l procs=56
>>> 
>>> and at the end of script I'm running :
>>> 
>>> mpirun -np 8 cat /dev/urandom > /dev/null
>>> 
>>> or
>>> 
>>> mpirun -np 56 cat /dev/urandom > /dev/null
>>> 
>>> ...depending on how many processors I requested. The job starts,
>>> $PBS_NODEFILE has the nodes that the job was assigned listed, but all
>>> the cat's are piled onto the first node. Any idea how I can get this
>>> to submit jobs across multiple nodes? Note, I have OSU mpiexec working
>>> without problems with mvapich and mpich2 on our cluster to launch jobs
>>> across multiple nodes.
>>> 
>>> Thanks,
>>> Sabuj
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2013-01-24 Thread Number Cruncher
I've looked in more detail at the current two MPI_Alltoallv algorithms 
and wanted to raise a couple of ideas.


Firstly, the new default "pairwise" algorithm.
* There is no optimisation for sparse/empty messages, compare to the old 
basic "linear" algorithm.
* The attached "pairwise-nop" patch adds this optimisation and on the 
test case I first described in this thread (1000's of small, sparse, 
all-to-all), this cuts runtime by approximately 30%
* I think the upper bound on the loop counter for pairwise exchange is 
off-by-one. As the comment notes "starting from 1 since local exhange 
[sic] is done"; but when step = (size + 1), the sendto/recvfrom both 
reduce to rank (self-exchange is already handled in earlier code)


The pairwise algorithm still kills performance on my gigabit ethernet 
network. My message transmission time must be small compared to latency, 
and the forced MPI_Comm_size() synchronisation steps introduce a minimum 
delay (single_link_latency * comm_size), i.e. latency scale linearly 
with comm_size. The linear algorithm doesn't wait for each exchange, so 
its minimum latency is just a single transmit/receive.


Which brings me to the second idea. The problem with the existing 
implementation of the linear algorithm is that the irecv/isend pattern 
was identical on all processes, meaning that every process starts by 
having to wait for process 0 to send to everyone and every process can 
finish waiting for rank (size-1) to send to everyone.


It seems preferable to at least post the send/recv requests in the same 
order as the pairwise algorithm. The attached "linear-alltoallv" patch 
implements this and, on my test case, shows some modest 5% improvement. 
I was wondering if it would address the concerns which led to the switch 
of default algorithm.


Simon
diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c
--- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c	2012-04-03 15:30:17.0 +0100
+++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c	2013-01-24 15:12:13.299568308 +
@@ -70,7 +70,7 @@
 }

 /* Perform pairwise exchange starting from 1 since local exhange is done */
-for (step = 1; step < size + 1; step++) {
+for (step = 1; step < size; step++) {

 /* Determine sender and receiver for this step. */
 sendto  = (rank + step) % size;
diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c
--- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c	2012-04-03 15:30:17.0 +0100
+++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c	2013-01-24 15:11:56.562118400 +
@@ -37,25 +37,31 @@
  ompi_status_public_t* status )

 { /* post receive first, then send, then waitall... should be fast (I hope) */
-int err, line = 0;
+int err, line = 0, nreq = 0;
 ompi_request_t* reqs[2];
 ompi_status_public_t statuses[2];

-/* post new irecv */
-err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, 
-  comm, [0]));
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
-
-/* send data to children */
-err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, 
-  MCA_PML_BASE_SEND_STANDARD, comm, [1]));
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+if (0 != rcount) {
+/* post new irecv */
+err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, 
+  comm, [nreq++]));
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+}

-err = ompi_request_wait_all( 2, reqs, statuses );
-if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; }
+if (0 != scount) {
+/* send data to children */
+err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, 
+  MCA_PML_BASE_SEND_STANDARD, comm, [nreq++]));
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; }
+}

-if (MPI_STATUS_IGNORE != status) {
-*status = statuses[0];
+if (0 != nreq) {
+err = ompi_request_wait_all( nreq, reqs, statuses );
+if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; }
+
+if (MPI_STATUS_IGNORE != status) {
+*status = statuses[0];
+}
 }

 return (MPI_SUCCESS);
@@ -68,7 +74,7 @@
 if( MPI_ERR_IN_STATUS == err ) {
 /* At least we know he error was detected during the wait_all */
 int err_index = 0;
-if( MPI_SUCCESS != statuses[1].MPI_ERROR ) {
+if( nreq > 1 && MPI_SUCCESS != statuses[1].MPI_ERROR ) {
 err_index = 1;
 }
 if (MPI_STATUS_IGNORE != status) {
@@ -107,25 

Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Brock Palen
On Jan 24, 2013, at 10:10 AM, Sabuj Pattanayek wrote:

> or do i just need to compile two versions, one with IB and one without?

You should not need to, we have OMPI compiled for openib/psm and run that same 
install on psm/tcp and verbs(openib) based gear.

All the nodes assigned to your job have qlogic IB adaptors? They also have 
libpsm_ininipath installed on all of them?  This will be required.

Also did you build your openmpi with tm?  --with-tm=/usr/local/torque/  (or 
where ever the path to lib/libtorque.so  is.)

With TM support, mpirun from OMPI will know how to find the CPUs assigned to 
your job by torque.  This is the better way, you can also in a pinch use 
mpirun -machinefile $PBS_NODEFILE -np 8 

But really tm is better.

Here is our build line for OMPI:

./configure --prefix=/home/software/rhel6/openmpi-1.6.3-mxm/intel-12.1 
--mandir=/home/software/rhel6/openmpi-1.6.3-mxm/intel-12.1/man 
--with-tm=/usr/local/torque --with-openib --with-psm 
--with-mxm=/home/software/rhel6/mxm/1.5 
--with-io-romio-flags=--with-file-system=testfs+ufs+lustre --disable-dlopen 
--enable-shared CC=icc CXX=icpc FC=ifort F77=ifort

We run torque with OMPI.

> 
> On Thu, Jan 24, 2013 at 9:09 AM, Sabuj Pattanayek  wrote:
>> ahha, with --display-allocation I'm getting :
>> 
>> mca: base: component_find: unable to open
>> /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm:
>> libpsm_infinipath.so.1: cannot open shared object file: No such file
>> or directory (ignored)
>> 
>> I think the system I compiled it on has different ib libs than the
>> nodes. I'll need to recompile and then see if it runs, but is there
>> anyway to get it to ignore IB and just use gigE? Not all of our nodes
>> have IB and I just want to use any node.
>> 
>> On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain  wrote:
>>> How did you configure OMPI? If you add --display-allocation to your cmd 
>>> line, does it show all the nodes?
>>> 
>>> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek  wrote:
>>> 
 Hi,
 
 I'm submitting a job through torque/PBS, the head node also runs the
 Moab scheduler, the .pbs file has this in the resources line :
 
 #PBS -l nodes=2:ppn=4
 
 I've also tried something like :
 
 #PBS -l procs=56
 
 and at the end of script I'm running :
 
 mpirun -np 8 cat /dev/urandom > /dev/null
 
 or
 
 mpirun -np 56 cat /dev/urandom > /dev/null
 
 ...depending on how many processors I requested. The job starts,
 $PBS_NODEFILE has the nodes that the job was assigned listed, but all
 the cat's are piled onto the first node. Any idea how I can get this
 to submit jobs across multiple nodes? Note, I have OSU mpiexec working
 without problems with mvapich and mpich2 on our cluster to launch jobs
 across multiple nodes.
 
 Thanks,
 Sabuj
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Sabuj Pattanayek
or do i just need to compile two versions, one with IB and one without?

On Thu, Jan 24, 2013 at 9:09 AM, Sabuj Pattanayek  wrote:
> ahha, with --display-allocation I'm getting :
>
> mca: base: component_find: unable to open
> /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm:
> libpsm_infinipath.so.1: cannot open shared object file: No such file
> or directory (ignored)
>
> I think the system I compiled it on has different ib libs than the
> nodes. I'll need to recompile and then see if it runs, but is there
> anyway to get it to ignore IB and just use gigE? Not all of our nodes
> have IB and I just want to use any node.
>
> On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain  wrote:
>> How did you configure OMPI? If you add --display-allocation to your cmd 
>> line, does it show all the nodes?
>>
>> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek  wrote:
>>
>>> Hi,
>>>
>>> I'm submitting a job through torque/PBS, the head node also runs the
>>> Moab scheduler, the .pbs file has this in the resources line :
>>>
>>> #PBS -l nodes=2:ppn=4
>>>
>>> I've also tried something like :
>>>
>>> #PBS -l procs=56
>>>
>>> and at the end of script I'm running :
>>>
>>> mpirun -np 8 cat /dev/urandom > /dev/null
>>>
>>> or
>>>
>>> mpirun -np 56 cat /dev/urandom > /dev/null
>>>
>>> ...depending on how many processors I requested. The job starts,
>>> $PBS_NODEFILE has the nodes that the job was assigned listed, but all
>>> the cat's are piled onto the first node. Any idea how I can get this
>>> to submit jobs across multiple nodes? Note, I have OSU mpiexec working
>>> without problems with mvapich and mpich2 on our cluster to launch jobs
>>> across multiple nodes.
>>>
>>> Thanks,
>>> Sabuj
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Sabuj Pattanayek
ahha, with --display-allocation I'm getting :

mca: base: component_find: unable to open
/sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm:
libpsm_infinipath.so.1: cannot open shared object file: No such file
or directory (ignored)

I think the system I compiled it on has different ib libs than the
nodes. I'll need to recompile and then see if it runs, but is there
anyway to get it to ignore IB and just use gigE? Not all of our nodes
have IB and I just want to use any node.

On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain  wrote:
> How did you configure OMPI? If you add --display-allocation to your cmd line, 
> does it show all the nodes?
>
> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek  wrote:
>
>> Hi,
>>
>> I'm submitting a job through torque/PBS, the head node also runs the
>> Moab scheduler, the .pbs file has this in the resources line :
>>
>> #PBS -l nodes=2:ppn=4
>>
>> I've also tried something like :
>>
>> #PBS -l procs=56
>>
>> and at the end of script I'm running :
>>
>> mpirun -np 8 cat /dev/urandom > /dev/null
>>
>> or
>>
>> mpirun -np 56 cat /dev/urandom > /dev/null
>>
>> ...depending on how many processors I requested. The job starts,
>> $PBS_NODEFILE has the nodes that the job was assigned listed, but all
>> the cat's are piled onto the first node. Any idea how I can get this
>> to submit jobs across multiple nodes? Note, I have OSU mpiexec working
>> without problems with mvapich and mpich2 on our cluster to launch jobs
>> across multiple nodes.
>>
>> Thanks,
>> Sabuj
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Ralph Castain
How did you configure OMPI? If you add --display-allocation to your cmd line, 
does it show all the nodes?

On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek  wrote:

> Hi,
> 
> I'm submitting a job through torque/PBS, the head node also runs the
> Moab scheduler, the .pbs file has this in the resources line :
> 
> #PBS -l nodes=2:ppn=4
> 
> I've also tried something like :
> 
> #PBS -l procs=56
> 
> and at the end of script I'm running :
> 
> mpirun -np 8 cat /dev/urandom > /dev/null
> 
> or
> 
> mpirun -np 56 cat /dev/urandom > /dev/null
> 
> ...depending on how many processors I requested. The job starts,
> $PBS_NODEFILE has the nodes that the job was assigned listed, but all
> the cat's are piled onto the first node. Any idea how I can get this
> to submit jobs across multiple nodes? Note, I have OSU mpiexec working
> without problems with mvapich and mpich2 on our cluster to launch jobs
> across multiple nodes.
> 
> Thanks,
> Sabuj
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified

2013-01-24 Thread Sabuj Pattanayek
Hi,

I'm submitting a job through torque/PBS, the head node also runs the
Moab scheduler, the .pbs file has this in the resources line :

#PBS -l nodes=2:ppn=4

I've also tried something like :

#PBS -l procs=56

and at the end of script I'm running :

mpirun -np 8 cat /dev/urandom > /dev/null

or

mpirun -np 56 cat /dev/urandom > /dev/null

...depending on how many processors I requested. The job starts,
$PBS_NODEFILE has the nodes that the job was assigned listed, but all
the cat's are piled onto the first node. Any idea how I can get this
to submit jobs across multiple nodes? Note, I have OSU mpiexec working
without problems with mvapich and mpich2 on our cluster to launch jobs
across multiple nodes.

Thanks,
Sabuj