Re: [OMPI users] can't run MPI job under SGE
I will try building a newer ompi version in my home directory, but that will take me some time. qconf is not available to me on any machine. It provides that same error wherever I am able to try it: > denied: host ". <http://dblade65.cs.brown.edu/>.." is neither submit nor admin host Here is what it produces when I have a sysadmin run it: $ qconf -sconf | egrep "(command|daemon)" qlogin_command /sysvol/sge.test/bin/qlogin-wrapper qlogin_daemon/sysvol/sge.test/bin/grid-sshd -i rlogin_command builtin rlogin_daemonbuiltin rsh_command builtin rsh_daemon builtin does that suggest anything? Thanks! -David Laidlaw On Thu, Jul 25, 2019 at 5:21 PM Reuti wrote: > > Am 25.07.2019 um 23:00 schrieb David Laidlaw: > > > Here is most of the command output when run on a grid machine: > > > > dblade65.dhl(101) mpiexec --version > > mpiexec (OpenRTE) 2.0.2 > > This is some time old. I would suggest to install a fresh one. You can > even compile one in your home directory and install it e.g. in > $HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…) > and then access this for all your jobs (adjust for your version of gcc). In > your ~/.bash_profile and the job script: > > DEFAULT_MANPATH="$(manpath -q)" > MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared" > export PATH="$MY_OMPI/bin:$PATH" > export > LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" > export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}" > unset MY_OMPI > unset DEFAULT_MANPATH > > Essentially there is no conflict with the already installed version. > > > > dblade65.dhl(102) ompi_info | grep grid > > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v2.0.2) > > dblade65.dhl(103) c > > denied: host "dblade65.cs.brown.edu" is neither submit nor admin host > > dblade65.dhl(104) > > On a node it’s ok this way. > > > > Does that suggest anything? > > > > qconf is restricted to sysadmins, which I am not. > > What error is output if you try it anyway? Usually the viewing is always > possible. > > > > I would note that we are running debian stretch on the cluster > machines. On some of our other (non-grid) machines, running debian buster, > the output is: > > > > cslab3d.dhl(101) mpiexec --version > > mpiexec (OpenRTE) 3.1.3 > > Report bugs to http://www.open-mpi.org/community/help/ > > cslab3d.dhl(102) ompi_info | grep grid > > MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component > v3.1.3) > > If you compile on such a machine and intend to run it in the cluster it > won't work, as the versions don't match. Therefore the above solution, to > use a personal version available in your $HOME for compiling and running > the applications. > > Side note: Open MPI binds the processes to cores by default. In case more > than one MPI job is running on a node one will have to use `mpiexec > --bind-to none …` as otherwise all jobs on this node will use core 0 > upwards. > > -- Reuti > > > > Thanks! > > > > -David Laidlaw > > > > On Thu, Jul 25, 2019 at 2:13 PM Reuti > wrote: > > > > Am 25.07.2019 um 18:59 schrieb David Laidlaw via users: > > > > > I have been trying to run some MPI jobs under SGE for almost a year > without success. What seems like a very simple test program fails; the > ingredients of it are below. Any suggestions on any piece of the test, > reasons for failure, requests for additional info, configuration thoughts, > etc. would be much appreciated. I suspect the linkage between SGIEand MPI, > but can't identify the problem. We do have SGE support build into MPI. We > also have the SGE parallel environment (PE) set up as described in several > places on the web. > > > > > > Many thanks for any input! > > > > Did you compile Open MPI on your own or was it delivered with the Linux > distribution? That it tries to use `ssh` is quite strange, as nowadays Open > MPI and others have built-in support to detect that they are running under > the control of a queuing system. It should use `qrsh` in your case. > > > > What does: > > > > mpiexec --version > > ompi_info | grep grid > > > > reveal? What does: > > > > qconf -sconf | egrep "(command|daemon)" > > > > show? > > > > -- Reuti > > > > > > > Cheers, > > > > > > -David Laidlaw > > > > >
[OMPI users] can't run MPI job under SGE
I have been trying to run some MPI jobs under SGE for almost a year without success. What seems like a very simple test program fails; the ingredients of it are below. Any suggestions on any piece of the test, reasons for failure, requests for additional info, configuration thoughts, etc. would be much appreciated. I suspect the linkage between SGIEand MPI, but can't identify the problem. We do have SGE support build into MPI. We also have the SGE parallel environment (PE) set up as described in several places on the web. Many thanks for any input! Cheers, -David Laidlaw Here is how I submit the job: /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme Here is what is in runme: #!/bin/bash #$ -cwd #$ -pe orte_fill 1 env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- allocation ./hello Here is hello.c: #include #include #include #include int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the number of processes int world_size; MPI_Comm_size(MPI_COMM_WORLD, _size); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, _rank); // Get the name of the processor char processor_name[MPI_MAX_PROCESSOR_NAME]; int name_len; MPI_Get_processor_name(processor_name, _len); // Print off a hello world message printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size); // system("printenv"); sleep(15); // sleep for 60 seconds // Finalize the MPI environment. MPI_Finalize(); } This command will build it: mpicc hello.c -o hello Running produces the following: /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- and: [dblade01:10902] [[37323,0],0] plm:rsh: final template argv: /usr/bin/ssh set path = ( /usr/bin $path ) ; if ( $? LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ; /usr/bin/orted --hnp-topo-sig 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs "2" - mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791" --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca pmix "^s1,s2,cray" ssh_exchange_identification: read: Connection reset by peer ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] can't run MPI job under SGE
Here is most of the command output when run on a grid machine: dblade65.dhl(101) mpiexec --version mpiexec (OpenRTE) 2.0.2 dblade65.dhl(102) ompi_info | grep grid MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v2.0.2) dblade65.dhl(103) c denied: host "dblade65.cs.brown.edu" is neither submit nor admin host dblade65.dhl(104) Does that suggest anything? qconf is restricted to sysadmins, which I am not. I would note that we are running debian stretch on the cluster machines. On some of our other (non-grid) machines, running debian buster, the output is: cslab3d.dhl(101) mpiexec --version mpiexec (OpenRTE) 3.1.3 Report bugs to http://www.open-mpi.org/community/help/ cslab3d.dhl(102) ompi_info | grep grid MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component v3.1.3) Thanks! -David Laidlaw On Thu, Jul 25, 2019 at 2:13 PM Reuti wrote: > > Am 25.07.2019 um 18:59 schrieb David Laidlaw via users: > > > I have been trying to run some MPI jobs under SGE for almost a year > without success. What seems like a very simple test program fails; the > ingredients of it are below. Any suggestions on any piece of the test, > reasons for failure, requests for additional info, configuration thoughts, > etc. would be much appreciated. I suspect the linkage between SGIEand MPI, > but can't identify the problem. We do have SGE support build into MPI. We > also have the SGE parallel environment (PE) set up as described in several > places on the web. > > > > Many thanks for any input! > > Did you compile Open MPI on your own or was it delivered with the Linux > distribution? That it tries to use `ssh` is quite strange, as nowadays Open > MPI and others have built-in support to detect that they are running under > the control of a queuing system. It should use `qrsh` in your case. > > What does: > > mpiexec --version > ompi_info | grep grid > > reveal? What does: > > qconf -sconf | egrep "(command|daemon)" > > show? > > -- Reuti > > > > Cheers, > > > > -David Laidlaw > > > > > > > > > > Here is how I submit the job: > > > >/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme > > > > > > Here is what is in runme: > > > > #!/bin/bash > > #$ -cwd > > #$ -pe orte_fill 1 > > env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- > > allocation ./hello > > > > > > Here is hello.c: > > > > #include > > #include > > #include > > #include > > > > int main(int argc, char** argv) { > > // Initialize the MPI environment > > MPI_Init(NULL, NULL); > > > > // Get the number of processes > > int world_size; > > MPI_Comm_size(MPI_COMM_WORLD, _size); > > > > // Get the rank of the process > > int world_rank; > > MPI_Comm_rank(MPI_COMM_WORLD, _rank); > > > > // Get the name of the processor > > char processor_name[MPI_MAX_PROCESSOR_NAME]; > > int name_len; > > MPI_Get_processor_name(processor_name, _len); > > > > // Print off a hello world message > > printf("Hello world from processor %s, rank %d out of %d > processors\n", > >processor_name, world_rank, world_size); > > // system("printenv"); > > > > sleep(15); // sleep for 60 seconds > > > > // Finalize the MPI environment. > > MPI_Finalize(); > > } > > > > > > This command will build it: > > > > mpicc hello.c -o hello > > > > > > Running produces the following: > > > > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile > > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED > > > -- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > > * compilation of the orted with dynamic libraries when static are > required > &g
Re: [OMPI users] can't run MPI job under SGE
Thanks for the input, John. Here are some responses (inline): On Thu, Jul 25, 2019 at 1:21 PM John Hearns via users < users@lists.open-mpi.org> wrote: > Have you checked your ssh between nodes? > ssh is not allowed between nodes, but my understanding is that processes should be getting set up and run by SGE, since it handles the queuing. > Also how is your Path set up? > It should be using the same startup scripts as I use on other machines within our dept, since the filesystem and home directories are shared across both grid and non-grid machines. In any case, I have put in fully qualified pathnames for everything that I start up. > A. Construct a hosts file and mpirun by hand > I have looked at the hosts file, and it seems correct. I don't know that I can pass a hosts file to mpirun directly, since SGE queues things and determines what hosts will be assigned. > > B. Use modules rather than. Bashrc files > Hmm. I don't really understand this one. (I know what both are, but I don't understand the problem that would be solved by converting to modules..) > C. Slurm > I don't run the grid/cluster, so I can't choose the queuing tools that are run. There are plans to migrate to slurm at some point in the future, but that doesn't help me now... Thanks! -David Laidlaw > > On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, < > users@lists.open-mpi.org> wrote: > >> I have been trying to run some MPI jobs under SGE for almost a year >> without success. What seems like a very simple test program fails; the >> ingredients of it are below. Any suggestions on any piece of the test, >> reasons for failure, requests for additional info, configuration thoughts, >> etc. would be much appreciated. I suspect the linkage between SGIEand MPI, >> but can't identify the problem. We do have SGE support build into MPI. We >> also have the SGE parallel environment (PE) set up as described in several >> places on the web. >> >> Many thanks for any input! >> >> Cheers, >> >> -David Laidlaw >> >> >> >> >> Here is how I submit the job: >> >>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme >> >> >> Here is what is in runme: >> >> #!/bin/bash >> #$ -cwd >> #$ -pe orte_fill 1 >> env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display- >> allocation ./hello >> >> >> Here is hello.c: >> >> #include >> #include >> #include >> #include >> >> int main(int argc, char** argv) { >> // Initialize the MPI environment >> MPI_Init(NULL, NULL); >> >> // Get the number of processes >> int world_size; >> MPI_Comm_size(MPI_COMM_WORLD, _size); >> >> // Get the rank of the process >> int world_rank; >> MPI_Comm_rank(MPI_COMM_WORLD, _rank); >> >> // Get the name of the processor >> char processor_name[MPI_MAX_PROCESSOR_NAME]; >> int name_len; >> MPI_Get_processor_name(processor_name, _len); >> >> // Print off a hello world message >> printf("Hello world from processor %s, rank %d out of %d >> processors\n", >>processor_name, world_rank, world_size); >> // system("printenv"); >> >> sleep(15); // sleep for 60 seconds >> >> // Finalize the MPI environment. >> MPI_Finalize(); >> } >> >> >> This command will build it: >> >> mpicc hello.c -o hello >> >> >> Running produces the following: >> >> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile >> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED >> -- >> ORTE was unable to reliably start one or more daemons. >> This usually is caused by: >> >> * not finding the required libraries and/or binaries on >> one or more nodes. Please check your PATH and LD_LIBRARY_PATH >> settings, or configure OMPI with --enable-orterun-prefix-by-default >> >> * lack of authority to execute on one or more specified nodes. >> Please verify your allocation and authorities. >> >> * the inability to write startup files into /tmp >> (--tmpdir/orte_tmpdir_base). >> Please check with your sys admin to determine the correct location to >> use. >> >> * compilation of the orted with dynamic libraries when static are >> required >> (e.g., on Cray). Please check your configure cmd line and consider using >>