Re: [OMPI users] problem with rankfile in openmpi-1.6.4rc2
I built the current 1.6 branch (which hasn't seen any changes that would impact this function) and was able to execute it just fine on a single socket machine. I then gave it your slot-list, which of course failed as I don't have two active sockets (one is empty), but it appeared to parse the list just fine. >From what I can tell, it looks like your linpc1 is unable to detect a second >socket for some reason when given the slot_list argument. I'll have to try >again tomorrow when I have access to a dual-socket machine. On Jan 19, 2013, at 1:45 AM, Siegmar Grosswrote: > Hi > > I have installed openmpi-1.6.4rc2 and have still a problem with my > rankfile. > > linpc1 rankfiles 113 ompi_info | grep "Open MPI:" >Open MPI: 1.6.4rc2r27861 > > linpc1 rankfiles 114 cat rf_linpc1 > rank 0=linpc1 slot=0:0-1,1:0-1 > > linpc1 rankfiles 115 mpiexec -report-bindings -np 1 \ > -rf rf_linpc1 hostname > > We were unable to successfully process/set the requested processor > affinity settings: > > Specified slot list: 0:0-1,1:0-1 > Error: Error > > This could mean that a non-existent processor was specified, or > that the specification had improper syntax. > > > mpiexec was unable to start the specified application as it > encountered an error: > > Error name: Error > Node: linpc1 > > when attempting to start process rank 0. > > > > Everything works fine with the following command. > > linpc1 rankfiles 116 mpiexec -report-bindings -np 1 -cpus-per-proc 4 \ > -bycore -bind-to-core hostname > [linpc1:20140] MCW rank 0 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] > linpc1 > > > I would be grateful if somebody could fix the problem. Thank you very > much for any help in advance. > > > Kind regards > > Siegmar > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI
How was OMPI configured? What type of system are you running on (i.e., what is the launcher - ssh, lsf, slurm, ...)? On Jan 24, 2013, at 6:35 PM, #YEO JINGJIE#wrote: > Dear users, > > Maybe something went wrong as I was compiling OpenMPI, I am very new to > linux. When I try to run LAMMPS using the following command: > > /usr/lib64/openmpi/bin/mpirun -n 16 /opt/lammps-21Jan13/lmp_linux < zigzag.in > > I get the following errors: > > > [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > ess_hnp_module.c at line 194 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > orte_plm_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > runtime/orte_init.c at line 128 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > orte_ess_set_name failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c > at line 616 > > Regards, > Jingjie Yeo > Ph.D. Student > School of Mechanical and Aerospace Engineering > Nanyang Technological University, Singapore > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI
Dear users, Maybe something went wrong as I was compiling OpenMPI, I am very new to linux. When I try to run LAMMPS using the following command: /usr/lib64/openmpi/bin/mpirun -n 16 /opt/lammps-21Jan13/lmp_linux < zigzag.in I get the following errors: [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 194 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 128 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [NTU-2:28895] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 616 Regards, Jingjie Yeo Ph.D. Student School of Mechanical and Aerospace Engineering Nanyang Technological University, Singapore
Re: [OMPI users] mpivars.sh - Intel Fortran 13.1 conflict with OpenMPI 1.6.3
On 01/24/2013 12:40 PM, Michael Kluskens wrote: This is for reference and suggestions as this took me several hours to track down and the previous discussion on "mpivars.sh" failed to cover this point (nothing in the FAQ): I successfully build and installed OpenMPI 1.6.3 using the following on Debian Linux: ./configure --prefix=/opt/openmpi/intel131 --disable-ipv6 --with-mpi-f90-size=medium --with-f90-max-array-dim=4 --disable-vt F77=/opt/intel/composer_xe_2013.1.117/bin/intel64/ifort FC=/opt/ intel/composer_xe_2013.1.117/bin/intel64/ifort CXXFLAGS=-m64 CFLAGS=-m64 CC=gcc CXX=g++ (disable-vt was required because of an error finding -lz which I gave up on). My .tcshrc file HAD the following: set path = (/opt/openmpi/intel131/bin $path) setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH alias mpirun "mpirun --prefix /opt/openmpi/intel131 " source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64 For years I have used these procedures on Debian Linux and OS X with earlier versions of OpenMPI and Intel Fortran. However, at some point Intel Fortran started including "mpirt", including: /opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpirun So even through I have the alias set for mpirun, I got the following error: mpirun -V .: 131: Can't open /opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpivars.sh Part of the confusion is that OpenMPI source does include a reference to "mpivars" in "contrib/dist/linux/openmpi.spec" The solution only occurred as I was writing this up, source intel setup first: source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64 set path = (/opt/openmpi/intel131/bin $path) setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH alias mpirun "mpirun --prefix /opt/openmpi/intel131 " Now I finally get: mpirun -V mpirun (Open MPI) 1.6.3 The mpi runtime should be in the redistributable for their MPI compiler not in the base compiler. The question is how much of /opt/intel/composer_xe_2013.1.117/mpirt can I eliminate safely and should I ( multi-user machine were each user has their own Intel license, so I don't wish to trouble shoot this in the future ) ? ifort mpirt is a run-time to support co-arrays, but not full MPI. This version of the compiler checks in its path setting scripts whether Intel MPI is already on LD_LIBRARY_PATH, and so there is a conditional setting of the internal mpivars. I assume the co-array feature would be incompatible with OpenMPI and you would want to find a way to avoid any reference to that library, possibly by avoiding sourcing that part of ifort's compilervars. If you want a response on this subject from the Intel support team, their HPC forum might be a place to bring it up: http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology -- Tim Prince
[OMPI users] mpivars.sh - Intel Fortran 13.1 conflict with OpenMPI 1.6.3
This is for reference and suggestions as this took me several hours to track down and the previous discussion on "mpivars.sh" failed to cover this point (nothing in the FAQ): I successfully build and installed OpenMPI 1.6.3 using the following on Debian Linux: ./configure --prefix=/opt/openmpi/intel131 --disable-ipv6 --with-mpi-f90-size=medium --with-f90-max-array-dim=4 --disable-vt F77=/opt/intel/composer_xe_2013.1.117/bin/intel64/ifort FC=/opt/ intel/composer_xe_2013.1.117/bin/intel64/ifort CXXFLAGS=-m64 CFLAGS=-m64 CC=gcc CXX=g++ (disable-vt was required because of an error finding -lz which I gave up on). My .tcshrc file HAD the following: set path = (/opt/openmpi/intel131/bin $path) setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH alias mpirun "mpirun --prefix /opt/openmpi/intel131 " source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64 For years I have used these procedures on Debian Linux and OS X with earlier versions of OpenMPI and Intel Fortran. However, at some point Intel Fortran started including "mpirt", including: /opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpirun So even through I have the alias set for mpirun, I got the following error: > mpirun -V .: 131: Can't open /opt/intel/composer_xe_2013.1.117/mpirt/bin/intel64/mpivars.sh Part of the confusion is that OpenMPI source does include a reference to "mpivars" in "contrib/dist/linux/openmpi.spec" The solution only occurred as I was writing this up, source intel setup first: source /opt/intel/composer_xe_2013.1.117/bin/compilervars.csh intel64 set path = (/opt/openmpi/intel131/bin $path) setenv LD_LIBRARY_PATH /opt/openmpi/intel131/lib:$LD_LIBRARY_PATH setenv MANPATH /opt/openmpi/intel131/share/man:$MANPATH alias mpirun "mpirun --prefix /opt/openmpi/intel131 " Now I finally get: > mpirun -V mpirun (Open MPI) 1.6.3 The mpi runtime should be in the redistributable for their MPI compiler not in the base compiler. The question is how much of /opt/intel/composer_xe_2013.1.117/mpirt can I eliminate safely and should I ( multi-user machine were each user has their own Intel license, so I don't wish to trouble shoot this in the future ) ?
Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
Sure - just add --with-openib=no --with-psm=no to your config line and we'll ignore it On Jan 24, 2013, at 7:09 AM, Sabuj Pattanayekwrote: > ahha, with --display-allocation I'm getting : > > mca: base: component_find: unable to open > /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm: > libpsm_infinipath.so.1: cannot open shared object file: No such file > or directory (ignored) > > I think the system I compiled it on has different ib libs than the > nodes. I'll need to recompile and then see if it runs, but is there > anyway to get it to ignore IB and just use gigE? Not all of our nodes > have IB and I just want to use any node. > > On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain wrote: >> How did you configure OMPI? If you add --display-allocation to your cmd >> line, does it show all the nodes? >> >> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek wrote: >> >>> Hi, >>> >>> I'm submitting a job through torque/PBS, the head node also runs the >>> Moab scheduler, the .pbs file has this in the resources line : >>> >>> #PBS -l nodes=2:ppn=4 >>> >>> I've also tried something like : >>> >>> #PBS -l procs=56 >>> >>> and at the end of script I'm running : >>> >>> mpirun -np 8 cat /dev/urandom > /dev/null >>> >>> or >>> >>> mpirun -np 56 cat /dev/urandom > /dev/null >>> >>> ...depending on how many processors I requested. The job starts, >>> $PBS_NODEFILE has the nodes that the job was assigned listed, but all >>> the cat's are piled onto the first node. Any idea how I can get this >>> to submit jobs across multiple nodes? Note, I have OSU mpiexec working >>> without problems with mvapich and mpich2 on our cluster to launch jobs >>> across multiple nodes. >>> >>> Thanks, >>> Sabuj >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1
I've looked in more detail at the current two MPI_Alltoallv algorithms and wanted to raise a couple of ideas. Firstly, the new default "pairwise" algorithm. * There is no optimisation for sparse/empty messages, compare to the old basic "linear" algorithm. * The attached "pairwise-nop" patch adds this optimisation and on the test case I first described in this thread (1000's of small, sparse, all-to-all), this cuts runtime by approximately 30% * I think the upper bound on the loop counter for pairwise exchange is off-by-one. As the comment notes "starting from 1 since local exhange [sic] is done"; but when step = (size + 1), the sendto/recvfrom both reduce to rank (self-exchange is already handled in earlier code) The pairwise algorithm still kills performance on my gigabit ethernet network. My message transmission time must be small compared to latency, and the forced MPI_Comm_size() synchronisation steps introduce a minimum delay (single_link_latency * comm_size), i.e. latency scale linearly with comm_size. The linear algorithm doesn't wait for each exchange, so its minimum latency is just a single transmit/receive. Which brings me to the second idea. The problem with the existing implementation of the linear algorithm is that the irecv/isend pattern was identical on all processes, meaning that every process starts by having to wait for process 0 to send to everyone and every process can finish waiting for rank (size-1) to send to everyone. It seems preferable to at least post the send/recv requests in the same order as the pairwise algorithm. The attached "linear-alltoallv" patch implements this and, on my test case, shows some modest 5% improvement. I was wondering if it would address the concerns which led to the switch of default algorithm. Simon diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c --- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_alltoallv.c 2012-04-03 15:30:17.0 +0100 +++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_alltoallv.c 2013-01-24 15:12:13.299568308 + @@ -70,7 +70,7 @@ } /* Perform pairwise exchange starting from 1 since local exhange is done */ -for (step = 1; step < size + 1; step++) { +for (step = 1; step < size; step++) { /* Determine sender and receiver for this step. */ sendto = (rank + step) % size; diff -r '--exclude=*~' -u openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c --- openmpi-1.6.3/ompi/mca/coll/tuned/coll_tuned_util.c 2012-04-03 15:30:17.0 +0100 +++ openmpi-1.6.3.patched/ompi/mca/coll/tuned/coll_tuned_util.c 2013-01-24 15:11:56.562118400 + @@ -37,25 +37,31 @@ ompi_status_public_t* status ) { /* post receive first, then send, then waitall... should be fast (I hope) */ -int err, line = 0; +int err, line = 0, nreq = 0; ompi_request_t* reqs[2]; ompi_status_public_t statuses[2]; -/* post new irecv */ -err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, - comm, [0])); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } - -/* send data to children */ -err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, - MCA_PML_BASE_SEND_STANDARD, comm, [1])); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +if (0 != rcount) { +/* post new irecv */ +err = MCA_PML_CALL(irecv( recvbuf, rcount, rdatatype, source, rtag, + comm, [nreq++])); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +} -err = ompi_request_wait_all( 2, reqs, statuses ); -if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; } +if (0 != scount) { +/* send data to children */ +err = MCA_PML_CALL(isend( sendbuf, scount, sdatatype, dest, stag, + MCA_PML_BASE_SEND_STANDARD, comm, [nreq++])); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler; } +} -if (MPI_STATUS_IGNORE != status) { -*status = statuses[0]; +if (0 != nreq) { +err = ompi_request_wait_all( nreq, reqs, statuses ); +if (err != MPI_SUCCESS) { line = __LINE__; goto error_handler_waitall; } + +if (MPI_STATUS_IGNORE != status) { +*status = statuses[0]; +} } return (MPI_SUCCESS); @@ -68,7 +74,7 @@ if( MPI_ERR_IN_STATUS == err ) { /* At least we know he error was detected during the wait_all */ int err_index = 0; -if( MPI_SUCCESS != statuses[1].MPI_ERROR ) { +if( nreq > 1 && MPI_SUCCESS != statuses[1].MPI_ERROR ) { err_index = 1; } if (MPI_STATUS_IGNORE != status) { @@ -107,25
Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
On Jan 24, 2013, at 10:10 AM, Sabuj Pattanayek wrote: > or do i just need to compile two versions, one with IB and one without? You should not need to, we have OMPI compiled for openib/psm and run that same install on psm/tcp and verbs(openib) based gear. All the nodes assigned to your job have qlogic IB adaptors? They also have libpsm_ininipath installed on all of them? This will be required. Also did you build your openmpi with tm? --with-tm=/usr/local/torque/ (or where ever the path to lib/libtorque.so is.) With TM support, mpirun from OMPI will know how to find the CPUs assigned to your job by torque. This is the better way, you can also in a pinch use mpirun -machinefile $PBS_NODEFILE -np 8 But really tm is better. Here is our build line for OMPI: ./configure --prefix=/home/software/rhel6/openmpi-1.6.3-mxm/intel-12.1 --mandir=/home/software/rhel6/openmpi-1.6.3-mxm/intel-12.1/man --with-tm=/usr/local/torque --with-openib --with-psm --with-mxm=/home/software/rhel6/mxm/1.5 --with-io-romio-flags=--with-file-system=testfs+ufs+lustre --disable-dlopen --enable-shared CC=icc CXX=icpc FC=ifort F77=ifort We run torque with OMPI. > > On Thu, Jan 24, 2013 at 9:09 AM, Sabuj Pattanayekwrote: >> ahha, with --display-allocation I'm getting : >> >> mca: base: component_find: unable to open >> /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm: >> libpsm_infinipath.so.1: cannot open shared object file: No such file >> or directory (ignored) >> >> I think the system I compiled it on has different ib libs than the >> nodes. I'll need to recompile and then see if it runs, but is there >> anyway to get it to ignore IB and just use gigE? Not all of our nodes >> have IB and I just want to use any node. >> >> On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain wrote: >>> How did you configure OMPI? If you add --display-allocation to your cmd >>> line, does it show all the nodes? >>> >>> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek wrote: >>> Hi, I'm submitting a job through torque/PBS, the head node also runs the Moab scheduler, the .pbs file has this in the resources line : #PBS -l nodes=2:ppn=4 I've also tried something like : #PBS -l procs=56 and at the end of script I'm running : mpirun -np 8 cat /dev/urandom > /dev/null or mpirun -np 56 cat /dev/urandom > /dev/null ...depending on how many processors I requested. The job starts, $PBS_NODEFILE has the nodes that the job was assigned listed, but all the cat's are piled onto the first node. Any idea how I can get this to submit jobs across multiple nodes? Note, I have OSU mpiexec working without problems with mvapich and mpich2 on our cluster to launch jobs across multiple nodes. Thanks, Sabuj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
or do i just need to compile two versions, one with IB and one without? On Thu, Jan 24, 2013 at 9:09 AM, Sabuj Pattanayekwrote: > ahha, with --display-allocation I'm getting : > > mca: base: component_find: unable to open > /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm: > libpsm_infinipath.so.1: cannot open shared object file: No such file > or directory (ignored) > > I think the system I compiled it on has different ib libs than the > nodes. I'll need to recompile and then see if it runs, but is there > anyway to get it to ignore IB and just use gigE? Not all of our nodes > have IB and I just want to use any node. > > On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castain wrote: >> How did you configure OMPI? If you add --display-allocation to your cmd >> line, does it show all the nodes? >> >> On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek wrote: >> >>> Hi, >>> >>> I'm submitting a job through torque/PBS, the head node also runs the >>> Moab scheduler, the .pbs file has this in the resources line : >>> >>> #PBS -l nodes=2:ppn=4 >>> >>> I've also tried something like : >>> >>> #PBS -l procs=56 >>> >>> and at the end of script I'm running : >>> >>> mpirun -np 8 cat /dev/urandom > /dev/null >>> >>> or >>> >>> mpirun -np 56 cat /dev/urandom > /dev/null >>> >>> ...depending on how many processors I requested. The job starts, >>> $PBS_NODEFILE has the nodes that the job was assigned listed, but all >>> the cat's are piled onto the first node. Any idea how I can get this >>> to submit jobs across multiple nodes? Note, I have OSU mpiexec working >>> without problems with mvapich and mpich2 on our cluster to launch jobs >>> across multiple nodes. >>> >>> Thanks, >>> Sabuj >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
ahha, with --display-allocation I'm getting : mca: base: component_find: unable to open /sb/apps/openmpi/1.6.3/x86_64/lib/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored) I think the system I compiled it on has different ib libs than the nodes. I'll need to recompile and then see if it runs, but is there anyway to get it to ignore IB and just use gigE? Not all of our nodes have IB and I just want to use any node. On Thu, Jan 24, 2013 at 8:52 AM, Ralph Castainwrote: > How did you configure OMPI? If you add --display-allocation to your cmd line, > does it show all the nodes? > > On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayek wrote: > >> Hi, >> >> I'm submitting a job through torque/PBS, the head node also runs the >> Moab scheduler, the .pbs file has this in the resources line : >> >> #PBS -l nodes=2:ppn=4 >> >> I've also tried something like : >> >> #PBS -l procs=56 >> >> and at the end of script I'm running : >> >> mpirun -np 8 cat /dev/urandom > /dev/null >> >> or >> >> mpirun -np 56 cat /dev/urandom > /dev/null >> >> ...depending on how many processors I requested. The job starts, >> $PBS_NODEFILE has the nodes that the job was assigned listed, but all >> the cat's are piled onto the first node. Any idea how I can get this >> to submit jobs across multiple nodes? Note, I have OSU mpiexec working >> without problems with mvapich and mpich2 on our cluster to launch jobs >> across multiple nodes. >> >> Thanks, >> Sabuj >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
How did you configure OMPI? If you add --display-allocation to your cmd line, does it show all the nodes? On Jan 24, 2013, at 6:34 AM, Sabuj Pattanayekwrote: > Hi, > > I'm submitting a job through torque/PBS, the head node also runs the > Moab scheduler, the .pbs file has this in the resources line : > > #PBS -l nodes=2:ppn=4 > > I've also tried something like : > > #PBS -l procs=56 > > and at the end of script I'm running : > > mpirun -np 8 cat /dev/urandom > /dev/null > > or > > mpirun -np 56 cat /dev/urandom > /dev/null > > ...depending on how many processors I requested. The job starts, > $PBS_NODEFILE has the nodes that the job was assigned listed, but all > the cat's are piled onto the first node. Any idea how I can get this > to submit jobs across multiple nodes? Note, I have OSU mpiexec working > without problems with mvapich and mpich2 on our cluster to launch jobs > across multiple nodes. > > Thanks, > Sabuj > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] openmpi 1.6.3, job submitted through torque/PBS + Moab (scheduler) only land on one node even though multiple nodes/processors are specified
Hi, I'm submitting a job through torque/PBS, the head node also runs the Moab scheduler, the .pbs file has this in the resources line : #PBS -l nodes=2:ppn=4 I've also tried something like : #PBS -l procs=56 and at the end of script I'm running : mpirun -np 8 cat /dev/urandom > /dev/null or mpirun -np 56 cat /dev/urandom > /dev/null ...depending on how many processors I requested. The job starts, $PBS_NODEFILE has the nodes that the job was assigned listed, but all the cat's are piled onto the first node. Any idea how I can get this to submit jobs across multiple nodes? Note, I have OSU mpiexec working without problems with mvapich and mpich2 on our cluster to launch jobs across multiple nodes. Thanks, Sabuj