I have been trying to run some MPI jobs under SGE for almost a year without
success.  What seems like a very simple test program fails; the ingredients
of it are below.  Any suggestions on any piece of the test, reasons for
failure, requests for additional info, configuration thoughts, etc. would
be much appreciated.  I suspect the linkage between SGIEand MPI, but can't
identify the problem.  We do have SGE support build into MPI.  We also have
the SGE parallel environment (PE) set up as described in several places on
the web.

Many thanks for any input!


-David Laidlaw

Here is how I submit the job:

   /usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme

Here is what is in runme:

  #$ -cwd
  #$ -pe orte_fill 1
  env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
allocation ./hello

Here is hello.c:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);
    // system("printenv");

    sleep(15); // sleep for 60 seconds

    // Finalize the MPI environment.

This command will build it:

     mpicc hello.c -o hello

Running produces the following:

dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).


[dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>     set path = ( /usr/bin $path ) ; if ( $?
LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
 == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
_PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted
0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
bid "2446000128" -mca ess_base_vpid "<template>" -mca ess_base_num_procs
"2" -
mca orte_hnp_uri "2446000128.0;usock;tcp://"
 --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
pmix "^s1,s2,cray"
ssh_exchange_identification: read: Connection reset by peer
users mailing list

Reply via email to