Hello,
I'm joining the discussion on pmi2.

1. About the behaviour of PMI2_Init() without -mpi=pmi2, it is indeed
    strange that it is considered a singloton init, especially as the
    lib is distributed with slurm and the error is obvious. Maybe it
    should be an error if we are called by slurm (checking
    $SLURM_JOBID) and it could be a singloton init only otherwise.

    Something like:

--- a/contribs/pmi2/pmi2_api.c
+++ b/contribs/pmi2/pmi2_api.c
@@ -196,18 +196,24 @@ int PMI2_Init(int *spawned, int *size, int *rank, 
int *appnum)
      p = getenv("PMI2_DEBUG");
      if (p) PMI2_debug = atoi(p);

-    /* Get the fd for PMI commands; if none, we're a singleton */
+    /* Get the fd for PMI commands */
      pmi2_errno = getPMIFD();
      if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-
      if (PMI2_fd == -1) {
-           /* Singleton init: Process not started with mpiexec,
-                  so set size to 1, rank to 0 */
-               PMI2_size = 1;
-               PMI2_rank = 0;
-               *spawned = 0;
+        if (getenv("SLURM_JOBID")) {
+            /* We're probably missing --mpi=pmi2 : error */
+            pmi2_errno = PMI2_FAIL;
+            PMI2U_ERR_POP(pmi2_errno);
+        } else {
+            /* Singleton init: Process not started with mpiexec,
+               so set size to 1, rank to 0 */
+            PMI2_size = 1;
+            PMI2_rank = 0;
+            *spawned = 0;
+
+     PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
+               }

-               PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
                 goto fn_exit;
      }



2. Concerning the error message "no value for req", we indeed
    introduced an error.

    I think the line mentionned below (1488) is right (otherwise the
    following snprintfs bounds are too big), but the cmdlen is not :

--- a/contribs/pmi2/pmi2_api.c
+++ b/contribs/pmi2/pmi2_api.c
@@ -1539,7 +1539,7 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command 
*resp, const char cmd[], PMI2_
      }

      /* prepend the buffer length stripping off the trailing '\0' */
-    cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
+    cmdlen = PMII_MAX_COMMAND_LEN - remaining_len - PMII_COMMANDLEN_SIZE;
      ret = snprintf(cmdlenbuf, sizeof(cmdlenbuf), "%d", cmdlen);


Piotr


 >
 >
 > -------- Original message --------
 > From: Hongjia Cao <[email protected]>
 > Date: 07/25/2013 12:25 AM (GMT-05:00)
 > To: slurm-dev <[email protected]>
 > Subject: [slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0
 >
 >
 >
 > 在 2013-07-22一的 07:46 -0700,Andy Riebs写道:
 >> Hi,
 >>
 >> We're trying to understand how PMI2 support works with SLURM, and have
 >> come up with a test program (see below) that demonstrates unexpected
 >> results.
 >>
 >> Our questions:
 >>     1. What does it mean, when --mpi=pmi2 is _not_ specified, for
 >>        PMI2_Init() to return success, but leave size,rank,appnum
 >>        unchanged?
 > You can run your program directly (execute "./a.out" instead of "srun -n
 > 2 ./a.out") and get similar results. When --mpi=pmi2 not specified, the
 > PMI2 client library takes that the program is run singleton. "srun -n 2
 > a.out" will run two copies of the program with parallel size 1. You will
 > get 1 in numprocs after calling
 > "MPI_Comm_size(MPI_COMM_WORLD,&numprocs);" in MPI programs.
 >
 >
 >>     1. What does it mean, or what is going wrong, when --mpi=pmi2 is
 >>        specified, to get the“slurmd[hadesn10]: mpi/pmi2: no value for
 >>        key  in req” lines?
 >
 >
 > This message will not appear when running MPI programs. The following is
 > a code segment taken from contribs/pmi2/pmi2_api.c (function
 > PMIi_WriteSimpleCommand()) of SLURM:
 >
 > 1481     int pair_index;
 > 1482
 > 1483     PMI2U_printf("[BEGIN]");
 > 1484
 > 1429     ssize_t nbytes;
 > 1430     ssize_t offset;
 > 1485     /* leave space for length field */
 > 1486     memset(c, ' ', PMII_COMMANDLEN_SIZE);
 > 1487     c += PMII_COMMANDLEN_SIZE;
 > 1488     remaining_len -= PMII_COMMANDLEN_SIZE;
 > 1489
 > 1490     PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN, pmi2_errno,
 > PMI2_ERR_OTHER, "**cmd_too_long");
 > 1491
 >
 > the above line 1488 is missing in the corresponding file
 > src/pmi/pmi2/simple2pmi.c of MPICH:
 >
 > 1431     int pair_index;
 > 1432
 > 1433     /* leave space for length field */
 > 1434     memset(c, ' ', PMII_COMMANDLEN_SIZE);
 > 1435     c += PMII_COMMANDLEN_SIZE;
 > 1436
 > 1437     PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN, pmi2_errno,
 > PMI2_ERR_OTHER, "**cmd_too_long");
 > 1438
 >
 > It is not clear which is correct according to the design documents of
 > PMI2(http://wiki.mpich.org/mpich/index.php/PMI_v2_Wire_Protocol). But
 > the mpi/pmi2 plugin in SLURM which implements the server part of the PMI
 > protocol comforms to the MPICH implementation.
 >>
 >> Andy
 >>
 >>
 >> The program:
 >>
 >> 
--------------------------------------------------------------------------
 >>
 >>
 >> /*
 >>
 >> Using SLURM 2.6.0
 >>
 >> To build and run it:
 >>
 >>
 >>    cc -Wall pmi2-001.c -I/opt/slurm/include -L/opt/slurm/lib64 -lpmi2
 >> && echo "" &&
 >>    srun -n 2 ./a.out && echo "" && srun --mpi=pmi2 -n 2 ./a.out
 >>
 >> Sample output:
 >>
 >>
 >> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
 >>
 >> Job_GetId => 14; id =
 >>
 >> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
 >>
 >> Job_GetId => 14; id =
 >>
 >>
 >>
 >> slurmd[hadesn10]: mpi/pmi2: no value for key  in req
 >>
 >> Init => 0; spawned = 0, size = 2, rank = 1, appnum = -1
 >>
 >> Job_GetId => 0; id = 252.0
 >>
 >> slurmd[hadesn10]: mpi/pmi2: no value for key  in req
 >>
 >> Init => 0; spawned = 0, size = 2, rank = 0, appnum = -1
 >>
 >> Job_GetId => 0; id = 252.0
 >>
 >>
 >>
 >> */
 >>
 >>
 >>
 >> #include <stdio.h>
 >>
 >> #include "slurm/pmi2.h"
 >>
 >>
 >>
 >> int
 >>
 >> main()
 >>
 >> {
 >>
 >>  int ret;
 >>
 >>  {
 >>
 >>    int spawned = -99, size = -99, rank = -99, appnum = -99;
 >>
 >>    ret = PMI2_Init(&spawned, &size, &rank, &appnum);
 >>
 >>    printf("Init => %d; spawned = %d, size = %d, rank = %d, appnum = %
 >> d\n",
 >>
 >>         ret, spawned, size, rank, appnum);
 >>
 >>  }
 >>
 >>  {
 >>
 >>    char id[PMI2_MAX_KEYLEN];
 >>
 >>    ret = PMI2_Job_GetId(id, sizeof(id));
 >>
 >>    printf("Job_GetId => %d; id = %s\n", ret, id);
 >>
 >>  }
 >>
 >>  fflush(NULL);
 >>
 >>  return 0;
 >>
 >> }
 >>
 >> 
--------------------------------------------------------------------------
 >> --
 >> Andy Riebs
 >> Hewlett-Packard Company
 >> High Performance Computing
 >> +1 404 648 9024
 >> My opinions are not necessarily those of HP
 >>

Reply via email to