Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread Reuti via users


Am 25.07.2019 um 23:00 schrieb David Laidlaw:

> Here is most of the command output when run on a grid machine:
> 
> dblade65.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 2.0.2

This is some time old. I would suggest to install a fresh one. You can even 
compile one in your home directory and install it e.g. in 
$HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…) and 
then access this for all your jobs (adjust for your version of gcc). In your 
~/.bash_profile and the job script:

DEFAULT_MANPATH="$(manpath -q)"
MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared"
export PATH="$MY_OMPI/bin:$PATH"
export LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}"
unset MY_OMPI
unset DEFAULT_MANPATH

Essentially there is no conflict with the already installed version.


> dblade65.dhl(102) ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v2.0.2)
> dblade65.dhl(103) c
> denied: host "dblade65.cs.brown.edu" is neither submit nor admin host
> dblade65.dhl(104) 

On a node it’s ok this way.


> Does that suggest anything?
> 
> qconf is restricted to sysadmins, which I am not.

What error is output if you try it anyway? Usually the viewing is always 
possible.


> I would note that we are running debian stretch on the cluster machines.  On 
> some of our other (non-grid) machines, running debian buster, the output is:
> 
> cslab3d.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 3.1.3
> Report bugs to http://www.open-mpi.org/community/help/
> cslab3d.dhl(102) ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v3.1.3)

If you compile on such a machine and intend to run it in the cluster it won't 
work, as the versions don't match. Therefore the above solution, to use a 
personal version available in your $HOME for compiling and running the 
applications.

Side note: Open MPI binds the processes to cores by default. In case more than 
one MPI job is running on a node one will have to use `mpiexec --bind-to none 
…` as otherwise all jobs on this node will use core 0 upwards.

-- Reuti


> Thanks!
> 
> -David Laidlaw
> 
> On Thu, Jul 25, 2019 at 2:13 PM Reuti  wrote:
> 
> Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> 
> > I have been trying to run some MPI jobs under SGE for almost a year without 
> > success.  What seems like a very simple test program fails; the ingredients 
> > of it are below.  Any suggestions on any piece of the test, reasons for 
> > failure, requests for additional info, configuration thoughts, etc. would 
> > be much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> > identify the problem.  We do have SGE support build into MPI.  We also have 
> > the SGE parallel environment (PE) set up as described in several places on 
> > the web.
> > 
> > Many thanks for any input!
> 
> Did you compile Open MPI on your own or was it delivered with the Linux 
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open 
> MPI and others have built-in support to detect that they are running under 
> the control of a queuing system. It should use `qrsh` in your case.
> 
> What does:
> 
> mpiexec --version
> ompi_info | grep grid
> 
> reveal? What does:
> 
> qconf -sconf | egrep "(command|daemon)"
> 
> show?
> 
> -- Reuti
> 
> 
> > Cheers,
> > 
> > -David Laidlaw
> > 
> > 
> > 
> > 
> > Here is how I submit the job:
> > 
> >/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> > 
> > 
> > Here is what is in runme:
> > 
> >   #!/bin/bash
> >   #$ -cwd
> >   #$ -pe orte_fill 1
> >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > allocation ./hello
> > 
> > 
> > Here is hello.c:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > int main(int argc, char** argv) {
> > // Initialize the MPI environment
> > MPI_Init(NULL, NULL);
> > 
> > // Get the number of processes
> > int world_size;
> > MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> > 
> > // Get the rank of the process
> > int world_rank;
> > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> > 
> > // Get the name of the processor
> > char processor_name[MPI_MAX_PROCESSOR_NAME];
> > int name_len;
> > MPI_Get_processor_name(processor_name, &name_len);
> > 
> > // Print off a hello world message
> > printf("Hello world from processor %s, rank %d out of %d processors\n",
> >processor_name, world_rank, world_size);
> > // system("printenv");
> > 
> > sleep(15); // sleep for 60 seconds
> > 
> > // Finalize the MPI environment.
> > MPI_Finalize();
> > }
> > 
> > 
> > This command will build it:
> > 
> >  mpicc hello.c -o hello
> > 
> > 
> > Running produces the following:
> > 
> > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > dblade01.cs.brown.edu 1 s

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
Here is most of the command output when run on a grid machine:


dblade65.dhl(101) mpiexec --version

mpiexec (OpenRTE) 2.0.2

dblade65.dhl(102) ompi_info | grep grid

 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
v2.0.2)

dblade65.dhl(103) c

denied: host "dblade65.cs.brown.edu" is neither submit nor admin host

dblade65.dhl(104)


Does that suggest anything?

qconf is restricted to sysadmins, which I am not.

I would note that we are running debian stretch on the cluster machines.
On some of our other (non-grid) machines, running debian buster, the output
is:

cslab3d.dhl(101) mpiexec --version

mpiexec (OpenRTE) 3.1.3

Report bugs to http://www.open-mpi.org/community/help/

cslab3d.dhl(102) ompi_info | grep grid

 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
v3.1.3)


Thanks!

-David Laidlaw

On Thu, Jul 25, 2019 at 2:13 PM Reuti  wrote:

>
> Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
>
> > I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
> >
> > Many thanks for any input!
>
> Did you compile Open MPI on your own or was it delivered with the Linux
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open
> MPI and others have built-in support to detect that they are running under
> the control of a queuing system. It should use `qrsh` in your case.
>
> What does:
>
> mpiexec --version
> ompi_info | grep grid
>
> reveal? What does:
>
> qconf -sconf | egrep "(command|daemon)"
>
> show?
>
> -- Reuti
>
>
> > Cheers,
> >
> > -David Laidlaw
> >
> >
> >
> >
> > Here is how I submit the job:
> >
> >/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> >
> >
> > Here is what is in runme:
> >
> >   #!/bin/bash
> >   #$ -cwd
> >   #$ -pe orte_fill 1
> >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > allocation ./hello
> >
> >
> > Here is hello.c:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(int argc, char** argv) {
> > // Initialize the MPI environment
> > MPI_Init(NULL, NULL);
> >
> > // Get the number of processes
> > int world_size;
> > MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> >
> > // Get the rank of the process
> > int world_rank;
> > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> >
> > // Get the name of the processor
> > char processor_name[MPI_MAX_PROCESSOR_NAME];
> > int name_len;
> > MPI_Get_processor_name(processor_name, &name_len);
> >
> > // Print off a hello world message
> > printf("Hello world from processor %s, rank %d out of %d
> processors\n",
> >processor_name, world_rank, world_size);
> > // system("printenv");
> >
> > sleep(15); // sleep for 60 seconds
> >
> > // Finalize the MPI environment.
> > MPI_Finalize();
> > }
> >
> >
> > This command will build it:
> >
> >  mpicc hello.c -o hello
> >
> >
> > Running produces the following:
> >
> > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> >
> --
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to
> use.
> >
> > *  compilation of the orted with dynamic libraries when static are
> required
> >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> >   one of the contrib/platform definitions for your system type.
> >
> > * an inability to create a connection back to mpirun due to a
> >   lack of common network interfaces and/or no route found between
> >   them. Please check network connectivity (including firewalls
> >   and network routing requirements).
> >
> --
> >
> >
> > and:
> >
> > [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> > /usr/bin/ssh  set path = ( /usr/bin $path ) ; if (
> $?
> > LD_LIBRARY_PAT

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
Thanks for the input, John.  Here are some responses (inline):

On Thu, Jul 25, 2019 at 1:21 PM John Hearns via users <
users@lists.open-mpi.org> wrote:

> Have you checked your ssh between nodes?
>

ssh is not allowed between nodes, but my understanding is that processes
should be getting set up and run by SGE, since it handles the queuing.


> Also how is your Path set up?
>

It should be using the same startup scripts as I use on other machines
within our dept, since the filesystem and home directories are shared
across both grid and non-grid machines.  In any case, I have put in fully
qualified pathnames for everything that I start up.


> A. Construct a hosts file and mpirun by hand
>

I have looked at the hosts file, and it seems correct.  I don't know that I
can pass a hosts file to mpirun directly, since SGE queues things and
determines what hosts will be assigned.


>
> B. Use modules rather than. Bashrc files
>

Hmm.  I don't really understand this one.  (I know what both are, but I
don't understand the problem that would be solved by converting to
modules..)


> C. Slurm
>

I don't run the grid/cluster, so I can't choose the queuing tools that are
run.  There are plans to migrate to slurm at some point in the future, but
that doesn't help me now...

Thanks!

-David Laidlaw


>
> On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, <
> users@lists.open-mpi.org> wrote:
>
>> I have been trying to run some MPI jobs under SGE for almost a year
>> without success.  What seems like a very simple test program fails; the
>> ingredients of it are below.  Any suggestions on any piece of the test,
>> reasons for failure, requests for additional info, configuration thoughts,
>> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
>> but can't identify the problem.  We do have SGE support build into MPI.  We
>> also have the SGE parallel environment (PE) set up as described in several
>> places on the web.
>>
>> Many thanks for any input!
>>
>> Cheers,
>>
>> -David Laidlaw
>>
>>
>>
>>
>> Here is how I submit the job:
>>
>>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
>>
>>
>> Here is what is in runme:
>>
>>   #!/bin/bash
>>   #$ -cwd
>>   #$ -pe orte_fill 1
>>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
>> allocation ./hello
>>
>>
>> Here is hello.c:
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char** argv) {
>> // Initialize the MPI environment
>> MPI_Init(NULL, NULL);
>>
>> // Get the number of processes
>> int world_size;
>> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>
>> // Get the rank of the process
>> int world_rank;
>> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>
>> // Get the name of the processor
>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>> int name_len;
>> MPI_Get_processor_name(processor_name, &name_len);
>>
>> // Print off a hello world message
>> printf("Hello world from processor %s, rank %d out of %d
>> processors\n",
>>processor_name, world_rank, world_size);
>> // system("printenv");
>>
>> sleep(15); // sleep for 60 seconds
>>
>> // Finalize the MPI environment.
>> MPI_Finalize();
>> }
>>
>>
>> This command will build it:
>>
>>  mpicc hello.c -o hello
>>
>>
>> Running produces the following:
>>
>> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
>> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
>> --
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>   Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp
>> (--tmpdir/orte_tmpdir_base).
>>   Please check with your sys admin to determine the correct location to
>> use.
>>
>> *  compilation of the orted with dynamic libraries when static are
>> required
>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>   one of the contrib/platform definitions for your system type.
>>
>> * an inability to create a connection back to mpirun due to a
>>   lack of common network interfaces and/or no route found between
>>   them. Please check network connectivity (including firewalls
>>   and network routing requirements).
>> --
>>
>>
>> and:
>>
>> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
>> /usr/bin/ssh  set path = ( /usr/bin $path ) ; if (
>> $?
>> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_hav

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread Reuti via users


Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:

> I have been trying to run some MPI jobs under SGE for almost a year without 
> success.  What seems like a very simple test program fails; the ingredients 
> of it are below.  Any suggestions on any piece of the test, reasons for 
> failure, requests for additional info, configuration thoughts, etc. would be 
> much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> identify the problem.  We do have SGE support build into MPI.  We also have 
> the SGE parallel environment (PE) set up as described in several places on 
> the web.
> 
> Many thanks for any input!

Did you compile Open MPI on your own or was it delivered with the Linux 
distribution? That it tries to use `ssh` is quite strange, as nowadays Open MPI 
and others have built-in support to detect that they are running under the 
control of a queuing system. It should use `qrsh` in your case.

What does:

mpiexec --version
ompi_info | grep grid

reveal? What does:

qconf -sconf | egrep "(command|daemon)"

show?

-- Reuti


> Cheers,
> 
> -David Laidlaw
> 
> 
> 
> 
> Here is how I submit the job:
> 
>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> 
> 
> Here is what is in runme:
> 
>   #!/bin/bash
>   #$ -cwd
>   #$ -pe orte_fill 1
>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> allocation ./hello
> 
> 
> Here is hello.c:
> 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char** argv) {
> // Initialize the MPI environment
> MPI_Init(NULL, NULL);
> 
> // Get the number of processes
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> 
> // Get the rank of the process
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> 
> // Get the name of the processor
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int name_len;
> MPI_Get_processor_name(processor_name, &name_len);
> 
> // Print off a hello world message
> printf("Hello world from processor %s, rank %d out of %d processors\n",
>processor_name, world_rank, world_size);
> // system("printenv");
> 
> sleep(15); // sleep for 60 seconds
> 
> // Finalize the MPI environment.
> MPI_Finalize();
> }
> 
> 
> This command will build it:
> 
>  mpicc hello.c -o hello
> 
> 
> Running produces the following:
> 
> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
> 
> 
> and:
> 
> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> /usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted --hnp-topo-sig
> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs "2" -
> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> pmix "^s1,s2,cray"
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Question about OpenMPI paths

2019-07-25 Thread Ewen Chan via users
All:

Whoops.

My apologies to everybody. Accidentally pressed the wrong combination of 
buttons on the keyboard and sent this email out prematurely.

Please disregard.

Thank you.

Sincerely,
Ewen


From: users  on behalf of Ewen Chan via users 

Sent: July 25, 2019 10:19 AM
To: users@lists.open-mpi.org 
Cc: Ewen Chan 
Subject: [OMPI users] Question about OpenMPI paths

To Whom It May Concern:

I am trying to run Converge CFD by Converge Science using OpenMPI in CentOS 
7.6.1810 x86_64 and I am getting the error:

bash: orted: command not found

I've already read the FAQ: 
https://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Here's my system setup, environment variables, etc.

OpenMPI version 1.10.7

Path to mpirun: /usr/lib64/openmpi/bin
Path to openmpi libs: /usr/lib64/openmpi/lib

$ cat ~/.bashrc
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.bash_profile
...
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.profile
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH


$ cat /etc/profile
...
PATH=$PATH:...:/usr/lib64/openmpi/bin
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
export LD_LIBRARY_PATH

$ cat /home/user/cluster/node003_16_node004_16.txt
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node004
node004
node004
node004
node004
node004
node004
node004
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread John Hearns via users
Have you checked your ssh between nodes?
Also how is your Path set up?
There is a difference between interactive and non interactive login sessions

I advuse
A. Construct a hosts file and mpirun by hand

B. Use modules rather than. Bashrc files

C. Slurm

On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, <
users@lists.open-mpi.org> wrote:

> I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
>
> Many thanks for any input!
>
> Cheers,
>
> -David Laidlaw
>
>
>
>
> Here is how I submit the job:
>
>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
>
>
> Here is what is in runme:
>
>   #!/bin/bash
>   #$ -cwd
>   #$ -pe orte_fill 1
>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> allocation ./hello
>
>
> Here is hello.c:
>
> #include 
> #include 
> #include 
> #include 
>
> int main(int argc, char** argv) {
> // Initialize the MPI environment
> MPI_Init(NULL, NULL);
>
> // Get the number of processes
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>
> // Get the rank of the process
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>
> // Get the name of the processor
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int name_len;
> MPI_Get_processor_name(processor_name, &name_len);
>
> // Print off a hello world message
> printf("Hello world from processor %s, rank %d out of %d processors\n",
>processor_name, world_rank, world_size);
> // system("printenv");
>
> sleep(15); // sleep for 60 seconds
>
> // Finalize the MPI environment.
> MPI_Finalize();
> }
>
>
> This command will build it:
>
>  mpicc hello.c -o hello
>
>
> Running produces the following:
>
> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
>
>
> and:
>
> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> /usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 )
> setenv
> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
> --hnp-topo-sig
> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs
> "2" -
> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> pmix "^s1,s2,cray"
> ssh_exchange_identification: read: Connection reset by peer
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
I have been trying to run some MPI jobs under SGE for almost a year without
success.  What seems like a very simple test program fails; the ingredients
of it are below.  Any suggestions on any piece of the test, reasons for
failure, requests for additional info, configuration thoughts, etc. would
be much appreciated.  I suspect the linkage between SGIEand MPI, but can't
identify the problem.  We do have SGE support build into MPI.  We also have
the SGE parallel environment (PE) set up as described in several places on
the web.

Many thanks for any input!

Cheers,

-David Laidlaw




Here is how I submit the job:

   /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme


Here is what is in runme:

  #!/bin/bash
  #$ -cwd
  #$ -pe orte_fill 1
  env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
allocation ./hello


Here is hello.c:

#include 
#include 
#include 
#include 

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);

// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
   processor_name, world_rank, world_size);
// system("printenv");

sleep(15); // sleep for 60 seconds

// Finalize the MPI environment.
MPI_Finalize();
}


This command will build it:

 mpicc hello.c -o hello


Running produces the following:

/var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--


and:

[dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
/usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
 == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
--hnp-topo-sig
0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs
"2" -
mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
 --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
pmix "^s1,s2,cray"
ssh_exchange_identification: read: Connection reset by peer
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] bash: orted: command not found -- ran through the FAQ already

2019-07-25 Thread Jeff Squyres (jsquyres) via users
On Jul 25, 2019, at 10:31 AM, Ewen Chan via users  
wrote:
> 
> Here's my configuration:
> 
> OS: CentOS 7.6.1810 x86_64 (it's a fresh install. I installed it last night.)
> OpenMPI version: 1.10.7 (that was the version that was available in the 
> CentOS install repo)
> path to mpirun: /usr/lib64/openmpi/bin
> path to lib: /usr/lib64/openmpi/lib
> 
> $ cat ~/.bashrc
> PATH=$PATH
> export PATH
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH
> export LD_LIBRARY_PATH
> 
> $ cat ~/.bash_profile
> PATH=$PATH
> export PATH
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH
> export LD_LIBRARY_PATH
> 
> $ cat ~/.profile
> PATH=$PATH
> export PATH
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH
> export LD_LIBRARY_PATH

The commands in the 3 files, above, are not doing anything.  You're assigning a 
variable to itself, without modifying anything.  E.g, "PATH=$PATH" just sets 
the PATH variable to itself, so nothing changed.

Did you mean to put the Open MPI installation paths in the right hand side of 
the assignment?  E.g. (disclaimer: typed directly into email -- I haven't 
tested this):

-
PATH=$PATH:/usr/lib64/openmpi/bin
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
export LD_LIBRARY_PATH
-


> $ cat /etc/profile
> ...
> PATH=$PATH:...:/usr/lib64/openmpi/bin
> export PATH
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
> export LD_LIBRARY_PATH

Aside from the extra "..." In there, those statements look correct.

I don't know offhand if /etc/profile is executed or not for non-interactive 
logins or not.

I suspect that if you fix the statements in $HOME/.bashrc and/or 
$HOME/.bash_profile, that might be good enough.

The key is to make sure that PATH and LD_LIBRARY_PATH are set with the Open MPI 
paths for both interactive and non-interactive logins.  You can test easily 
with:

# Interactive
localnode$ ssh othernode
othernode$ env | grep PATH
...make sure it has the right paths in it...

# Non-interactive:
localnode$ ssh othernode env | grep PATH
...make sure it has the right paths in it...

> command that I am trying to run that results in the error:
> 
> SI8_premix_PFI_SAGE$ time -p mpirun -hostfile node003_16_node004_16.txt 
> /opt/converge_2.4.0/l_x86_64/bin/converge...ompi 2>&1 | tee run.log

This means that mpirun is not in your PATH in your local machine.

For simplicity, once you correct your .bashrc, logout and login again and see 
if the PATH is now set properly.  Specifically: just editing your .bashrc does 
not modify your current environment.  There are multiple ways to modify your 
current environment; a simple way is to just logout and login again, and see if 
your env was setup properly by your .bashrc.

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] bash: orted: command not found -- ran through the FAQ already

2019-07-25 Thread Ewen Chan via users
To Whom It May Concern:

I'm trying to run Converge CFD by Converge Science using OpenMPI and I am 
getting the error:

bash: orted: command not found

I've already read and executed the FAQ about adding OpenMPI to my PATH and 
LD_LIBRARY_PATH 
(https://www.open-mpi.org/faq/?category=running#adding-ompi-to-path).

Here's my configuration:

OS: CentOS 7.6.1810 x86_64 (it's a fresh install. I installed it last night.)
OpenMPI version: 1.10.7 (that was the version that was available in the CentOS 
install repo)
path to mpirun: /usr/lib64/openmpi/bin
path to lib: /usr/lib64/openmpi/lib

$ cat ~/.bashrc
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.bash_profile
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.profile
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat /etc/profile
...
PATH=$PATH:...:/usr/lib64/openmpi/bin
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
export LD_LIBRARY_PATH

SI8_premix_PFI_SAGE$ cat node003_16_node004_16.txt
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004
node004

command that I am trying to run that results in the error:

SI8_premix_PFI_SAGE$ time -p mpirun -hostfile node003_16_node004_16.txt 
/opt/converge_2.4.0/l_x86_64/bin/converge...ompi 2>&1 | tee run.log

However, if I run where I explicitly give the absolute path to the mpirun, then 
it works:

SI8_premix_PFI_SAGE$ time -p /usr/lib64/openmpi/bin/mpirun -hostfile 
node003_16_node004_16.txt /opt/converge_2.4.0/l_x86_64/bin/converge...ompi 2>&1 
| tee run.log

I also tried exporting the PATH and LD_LIBRARY_PATH to the slave node and that 
didn't seem to work (results in the same error):

SI8_premix_PFI_SAGE$ time -p mpirun -x PATH -x LD_LIBRARY_PATH -hostfile 
node003_16_node004_16.txt /opt/converge_2.4.0/l_x86_64/bin/converge...ompi 2>&1 
| tee run.log

I think that I'm using bash shell (I think that's the default for CentOS 
users), but to be sure, I created the .profile anyways.

passwordless ssh has been properly configured, so that's not an issue (and I've 
tested that permutatively).

The two nodes can ping each other, back and forth, also permutatively as well.

I'm at a loss as to why I need to specify the absolute path to mpirun despite 
having everything else set up and to me, it looks like that I've set everything 
else up correctly.

Your help is greatly appreciated.

Thank you.

Sincerely,

Ewen Chan
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Question about OpenMPI paths

2019-07-25 Thread Ewen Chan via users
To Whom It May Concern:

I am trying to run Converge CFD by Converge Science using OpenMPI in CentOS 
7.6.1810 x86_64 and I am getting the error:

bash: orted: command not found

I've already read the FAQ: 
https://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Here's my system setup, environment variables, etc.

OpenMPI version 1.10.7

Path to mpirun: /usr/lib64/openmpi/bin
Path to openmpi libs: /usr/lib64/openmpi/lib

$ cat ~/.bashrc
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.bash_profile
...
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

$ cat ~/.profile
PATH=$PATH
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH


$ cat /etc/profile
...
PATH=$PATH:...:/usr/lib64/openmpi/bin
export PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
export LD_LIBRARY_PATH

$ cat /home/user/cluster/node003_16_node004_16.txt
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node003
node004
node004
node004
node004
node004
node004
node004
node004
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] How is the rank determined (Open MPI and Podman)

2019-07-25 Thread Adrian Reber via users
On Wed, Jul 24, 2019 at 09:46:13PM +, Jeff Squyres (jsquyres) wrote:
> On Jul 24, 2019, at 5:16 PM, Ralph Castain via users 
>  wrote:
> > 
> > It doesn't work that way, as you discovered. You need to add this 
> > information at the same place where vader currently calls modex send, and 
> > then retrieve it at the same place vader currently calls modex recv. Those 
> > macros don't do an immediate send/recv like you are thinking - the send 
> > simply adds the value to an aggregated payload, then the "fence" call 
> > distributes that payload to everyone, and then the read extracts the 
> > requested piece from that payload.
> 
> Just to expand on what Ralph said, think of it like this:
> 
> 1. each component/module does a modex "send", which just memcopies the data 
> blob
> 
> 2. the "fence()" is deep within ompi_mpi_init(), which does the actual data 
> exchange of all the module blobs in an efficient manner
> 
> 3. each component/module can then later do a modex "receive", which just 
> memcopies the relevant blob from the module blobs that were actually received 
> in step #2

Thanks for the clarifications how this works. I was expecting something
completely different.

> (BTW, "modex" = "module exchange")

Ah, also good to know. I was not really sure what it means.

I opened a pull request completely without modex : 
https://github.com/open-mpi/ompi/pull/6844

I would have preferred to have the user namespace detection earlier than
the actual {put,get}(), but I was not able to make it work.

I hope I included all relevant information about my changes in the PR.
Let's continue the discussion there. Thanks everyone for the support.

Adrian

> >> On Jul 24, 2019, at 5:23 AM, Adrian Reber  wrote:
> >> 
> >> On Mon, Jul 22, 2019 at 04:30:50PM +, Ralph Castain wrote:
>  On Jul 22, 2019, at 9:20 AM, Adrian Reber  wrote:
>  
>  I have most of the code ready, but I still have troubles doing
>  OPAL_MODEX_RECV. I am using the following lines, based on the code from
>  orte/test/mpi/pmix.c:
>  
>  OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, 
>  OPAL_INT);
>  
>  This sets rc to 0. For receiving:
>  
>  OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT);
> >>> 
> >>> You need to replace "wildcard_rank" with the process name of the proc who 
> >>> published the "user_ns_id" key. If/when we have mpirun provide the value, 
> >>> then you can retrieve it from the wildcard rank as it will be coming from 
> >>> the system and not an application proc
> >> 
> >> So I can get the user namespace ID from all involved processes back to
> >> the main process (MCA_BTL_VADER_LOCAL_RANK == 0). But now only this
> >> process knows that the user namespace IDs are different and I have
> >> trouble using MODEX to send the information (do not use cma) back to the
> >> other involved processes. It seems am not able to used MODEX_{SEND,RECV}
> >> at the same time. One process sent and waits then on receive from the
> >> other processes. Something like this works
> >> 
> >>PROC 0 PROC 1
> >>recv() sent()
> >> 
> >> 
> >> Bit this does not work:
> >> 
> >>PROC 0 PROC 1
> >>recv() sent()
> >>sent() recv()
> >> 
> >> If I start the recv() immediately after the send() on PROC 1 no messages
> >> are delivered anymore and everything hangs, even if different MODEX keys
> >> are used. It seems like MODEX can not fetch messages in another order
> >> than it was sent. Is that so?
> >> 
> >> Not sure how to tell the other processes to not use CMA, while some
> >> processes are still transmitting their user namespace ID to PROC 0.
> >> 
> >>Adrian
> >> 
>  and rc is always set to -13. Is this how it is supposed to work, or do I
>  have to do it differently?
>  
>   Adrian
>  
>  On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote:
> > If that works, then it might be possible to include the namespace ID in 
> > the job-info provided by PMIx at startup - would have to investigate, 
> > so please confirm that the modex option works first.
> > 
> >> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users 
> >>  wrote:
> >> 
> >> Adrian,
> >> 
> >> 
> >> An option is to involve the modex.
> >> 
> >> each task would OPAL_MODEX_SEND() its own namespace ID, and then 
> >> OPAL_MODEX_RECV()
> >> 
> >> the one from its peers and decide whether CMA support can be enabled.
> >> 
> >> 
> >> Cheers,
> >> 
> >> 
> >> Gilles
> >> 
> >> On 7/22/2019 4:53 PM, Adrian Reber via users wrote:
> >>> I had a look at it and not sure if it really makes sense.
> >>> 
> >>> In btl_vader_{put,get}.c it would be easy to check for the user
> >>> namespace ID of the other