Re: [OMPI users] can't run MPI job under SGE

2019-07-29 Thread David Laidlaw via users
I will try building a newer ompi version in my home directory, but that
will take me some time.

qconf is not available to me on any machine.  It provides that same error
wherever I am able to try it:

> denied: host ". <http://dblade65.cs.brown.edu/>.." is neither submit nor
admin host


Here is what it produces when I have a sysadmin run it:

$ qconf -sconf | egrep "(command|daemon)"
qlogin_command   /sysvol/sge.test/bin/qlogin-wrapper
qlogin_daemon/sysvol/sge.test/bin/grid-sshd -i
rlogin_command   builtin
rlogin_daemonbuiltin
rsh_command  builtin
rsh_daemon   builtin


does that suggest anything?

Thanks!

-David Laidlaw




On Thu, Jul 25, 2019 at 5:21 PM Reuti  wrote:

>
> Am 25.07.2019 um 23:00 schrieb David Laidlaw:
>
> > Here is most of the command output when run on a grid machine:
> >
> > dblade65.dhl(101) mpiexec --version
> > mpiexec (OpenRTE) 2.0.2
>
> This is some time old. I would suggest to install a fresh one. You can
> even compile one in your home directory and install it e.g. in
> $HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…)
> and then access this for all your jobs (adjust for your version of gcc). In
> your ~/.bash_profile and the job script:
>
> DEFAULT_MANPATH="$(manpath -q)"
> MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared"
> export PATH="$MY_OMPI/bin:$PATH"
> export
> LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
> export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}"
> unset MY_OMPI
> unset DEFAULT_MANPATH
>
> Essentially there is no conflict with the already installed version.
>
>
> > dblade65.dhl(102) ompi_info | grep grid
> >  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
> v2.0.2)
> > dblade65.dhl(103) c
> > denied: host "dblade65.cs.brown.edu" is neither submit nor admin host
> > dblade65.dhl(104)
>
> On a node it’s ok this way.
>
>
> > Does that suggest anything?
> >
> > qconf is restricted to sysadmins, which I am not.
>
> What error is output if you try it anyway? Usually the viewing is always
> possible.
>
>
> > I would note that we are running debian stretch on the cluster
> machines.  On some of our other (non-grid) machines, running debian buster,
> the output is:
> >
> > cslab3d.dhl(101) mpiexec --version
> > mpiexec (OpenRTE) 3.1.3
> > Report bugs to http://www.open-mpi.org/community/help/
> > cslab3d.dhl(102) ompi_info | grep grid
> >  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
> v3.1.3)
>
> If you compile on such a machine and intend to run it in the cluster it
> won't work, as the versions don't match. Therefore the above solution, to
> use a personal version available in your $HOME for compiling and running
> the applications.
>
> Side note: Open MPI binds the processes to cores by default. In case more
> than one MPI job is running on a node one will have to use `mpiexec
> --bind-to none …` as otherwise all jobs on this node will use core 0
> upwards.
>
> -- Reuti
>
>
> > Thanks!
> >
> > -David Laidlaw
> >
> > On Thu, Jul 25, 2019 at 2:13 PM Reuti 
> wrote:
> >
> > Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> >
> > > I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
> > >
> > > Many thanks for any input!
> >
> > Did you compile Open MPI on your own or was it delivered with the Linux
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open
> MPI and others have built-in support to detect that they are running under
> the control of a queuing system. It should use `qrsh` in your case.
> >
> > What does:
> >
> > mpiexec --version
> > ompi_info | grep grid
> >
> > reveal? What does:
> >
> > qconf -sconf | egrep "(command|daemon)"
> >
> > show?
> >
> > -- Reuti
> >
> >
> > > Cheers,
> > >
> > > -David Laidlaw
> > >
> > 

[OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
I have been trying to run some MPI jobs under SGE for almost a year without
success.  What seems like a very simple test program fails; the ingredients
of it are below.  Any suggestions on any piece of the test, reasons for
failure, requests for additional info, configuration thoughts, etc. would
be much appreciated.  I suspect the linkage between SGIEand MPI, but can't
identify the problem.  We do have SGE support build into MPI.  We also have
the SGE parallel environment (PE) set up as described in several places on
the web.

Many thanks for any input!

Cheers,

-David Laidlaw




Here is how I submit the job:

   /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme


Here is what is in runme:

  #!/bin/bash
  #$ -cwd
  #$ -pe orte_fill 1
  env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
allocation ./hello


Here is hello.c:

#include 
#include 
#include 
#include 

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);

// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, _size);

// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, _rank);

// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, _len);

// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
   processor_name, world_rank, world_size);
// system("printenv");

sleep(15); // sleep for 60 seconds

// Finalize the MPI environment.
MPI_Finalize();
}


This command will build it:

 mpicc hello.c -o hello


Running produces the following:

/var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--


and:

[dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
/usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
 == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
--hnp-topo-sig
0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs
"2" -
mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
 --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
pmix "^s1,s2,cray"
ssh_exchange_identification: read: Connection reset by peer
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
Here is most of the command output when run on a grid machine:


dblade65.dhl(101) mpiexec --version

mpiexec (OpenRTE) 2.0.2

dblade65.dhl(102) ompi_info | grep grid

 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
v2.0.2)

dblade65.dhl(103) c

denied: host "dblade65.cs.brown.edu" is neither submit nor admin host

dblade65.dhl(104)


Does that suggest anything?

qconf is restricted to sysadmins, which I am not.

I would note that we are running debian stretch on the cluster machines.
On some of our other (non-grid) machines, running debian buster, the output
is:

cslab3d.dhl(101) mpiexec --version

mpiexec (OpenRTE) 3.1.3

Report bugs to http://www.open-mpi.org/community/help/

cslab3d.dhl(102) ompi_info | grep grid

 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
v3.1.3)


Thanks!

-David Laidlaw

On Thu, Jul 25, 2019 at 2:13 PM Reuti  wrote:

>
> Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
>
> > I have been trying to run some MPI jobs under SGE for almost a year
> without success.  What seems like a very simple test program fails; the
> ingredients of it are below.  Any suggestions on any piece of the test,
> reasons for failure, requests for additional info, configuration thoughts,
> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
> but can't identify the problem.  We do have SGE support build into MPI.  We
> also have the SGE parallel environment (PE) set up as described in several
> places on the web.
> >
> > Many thanks for any input!
>
> Did you compile Open MPI on your own or was it delivered with the Linux
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open
> MPI and others have built-in support to detect that they are running under
> the control of a queuing system. It should use `qrsh` in your case.
>
> What does:
>
> mpiexec --version
> ompi_info | grep grid
>
> reveal? What does:
>
> qconf -sconf | egrep "(command|daemon)"
>
> show?
>
> -- Reuti
>
>
> > Cheers,
> >
> > -David Laidlaw
> >
> >
> >
> >
> > Here is how I submit the job:
> >
> >/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> >
> >
> > Here is what is in runme:
> >
> >   #!/bin/bash
> >   #$ -cwd
> >   #$ -pe orte_fill 1
> >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > allocation ./hello
> >
> >
> > Here is hello.c:
> >
> > #include 
> > #include 
> > #include 
> > #include 
> >
> > int main(int argc, char** argv) {
> > // Initialize the MPI environment
> > MPI_Init(NULL, NULL);
> >
> > // Get the number of processes
> > int world_size;
> > MPI_Comm_size(MPI_COMM_WORLD, _size);
> >
> > // Get the rank of the process
> > int world_rank;
> > MPI_Comm_rank(MPI_COMM_WORLD, _rank);
> >
> > // Get the name of the processor
> > char processor_name[MPI_MAX_PROCESSOR_NAME];
> > int name_len;
> > MPI_Get_processor_name(processor_name, _len);
> >
> > // Print off a hello world message
> > printf("Hello world from processor %s, rank %d out of %d
> processors\n",
> >processor_name, world_rank, world_size);
> > // system("printenv");
> >
> > sleep(15); // sleep for 60 seconds
> >
> > // Finalize the MPI environment.
> > MPI_Finalize();
> > }
> >
> >
> > This command will build it:
> >
> >  mpicc hello.c -o hello
> >
> >
> > Running produces the following:
> >
> > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> >
> --
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to
> use.
> >
> > *  compilation of the orted with dynamic libraries when static are
> required
> &g

Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread David Laidlaw via users
Thanks for the input, John.  Here are some responses (inline):

On Thu, Jul 25, 2019 at 1:21 PM John Hearns via users <
users@lists.open-mpi.org> wrote:

> Have you checked your ssh between nodes?
>

ssh is not allowed between nodes, but my understanding is that processes
should be getting set up and run by SGE, since it handles the queuing.


> Also how is your Path set up?
>

It should be using the same startup scripts as I use on other machines
within our dept, since the filesystem and home directories are shared
across both grid and non-grid machines.  In any case, I have put in fully
qualified pathnames for everything that I start up.


> A. Construct a hosts file and mpirun by hand
>

I have looked at the hosts file, and it seems correct.  I don't know that I
can pass a hosts file to mpirun directly, since SGE queues things and
determines what hosts will be assigned.


>
> B. Use modules rather than. Bashrc files
>

Hmm.  I don't really understand this one.  (I know what both are, but I
don't understand the problem that would be solved by converting to
modules..)


> C. Slurm
>

I don't run the grid/cluster, so I can't choose the queuing tools that are
run.  There are plans to migrate to slurm at some point in the future, but
that doesn't help me now...

Thanks!

-David Laidlaw


>
> On Thu, 25 Jul 2019, 18:00 David Laidlaw via users, <
> users@lists.open-mpi.org> wrote:
>
>> I have been trying to run some MPI jobs under SGE for almost a year
>> without success.  What seems like a very simple test program fails; the
>> ingredients of it are below.  Any suggestions on any piece of the test,
>> reasons for failure, requests for additional info, configuration thoughts,
>> etc. would be much appreciated.  I suspect the linkage between SGIEand MPI,
>> but can't identify the problem.  We do have SGE support build into MPI.  We
>> also have the SGE parallel environment (PE) set up as described in several
>> places on the web.
>>
>> Many thanks for any input!
>>
>> Cheers,
>>
>> -David Laidlaw
>>
>>
>>
>>
>> Here is how I submit the job:
>>
>>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
>>
>>
>> Here is what is in runme:
>>
>>   #!/bin/bash
>>   #$ -cwd
>>   #$ -pe orte_fill 1
>>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
>> allocation ./hello
>>
>>
>> Here is hello.c:
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char** argv) {
>> // Initialize the MPI environment
>> MPI_Init(NULL, NULL);
>>
>> // Get the number of processes
>> int world_size;
>> MPI_Comm_size(MPI_COMM_WORLD, _size);
>>
>> // Get the rank of the process
>> int world_rank;
>> MPI_Comm_rank(MPI_COMM_WORLD, _rank);
>>
>> // Get the name of the processor
>> char processor_name[MPI_MAX_PROCESSOR_NAME];
>> int name_len;
>> MPI_Get_processor_name(processor_name, _len);
>>
>> // Print off a hello world message
>> printf("Hello world from processor %s, rank %d out of %d
>> processors\n",
>>processor_name, world_rank, world_size);
>> // system("printenv");
>>
>> sleep(15); // sleep for 60 seconds
>>
>> // Finalize the MPI environment.
>> MPI_Finalize();
>> }
>>
>>
>> This command will build it:
>>
>>  mpicc hello.c -o hello
>>
>>
>> Running produces the following:
>>
>> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
>> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
>> --
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>>
>> * not finding the required libraries and/or binaries on
>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>
>> * lack of authority to execute on one or more specified nodes.
>>   Please verify your allocation and authorities.
>>
>> * the inability to write startup files into /tmp
>> (--tmpdir/orte_tmpdir_base).
>>   Please check with your sys admin to determine the correct location to
>> use.
>>
>> *  compilation of the orted with dynamic libraries when static are
>> required
>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>