Hello,
      it is a requirement to specify --mpi=pmi2 otherwise the srun will
not load the pmi2 library
implementing the server side pmi2 functionalities.

There was a error in the contribs/pmi2/pmi2_api.c causing the 'no value
for req' message, this was the

->1488     remaining_len -= PMII_COMMANDLEN_SIZE;

wrong instruction. This was fixed couple of days ago and as already
mentioned pushed to 2.6
and master branches.

Thanks,
        David

>
>
> Hello,
> I'm joining the discussion on pmi2.
>
> 1. About the behaviour of PMI2_Init() without -mpi=pmi2, it is indeed
>     strange that it is considered a singloton init, especially as the
>     lib is distributed with slurm and the error is obvious. Maybe it
>     should be an error if we are called by slurm (checking
>     $SLURM_JOBID) and it could be a singloton init only otherwise.
>
>     Something like:
>
> --- a/contribs/pmi2/pmi2_api.c
> +++ b/contribs/pmi2/pmi2_api.c
> @@ -196,18 +196,24 @@ int PMI2_Init(int *spawned, int *size, int *rank,
> int *appnum)
>       p = getenv("PMI2_DEBUG");
>       if (p) PMI2_debug = atoi(p);
>
> -    /* Get the fd for PMI commands; if none, we're a singleton */
> +    /* Get the fd for PMI commands */
>       pmi2_errno = getPMIFD();
>       if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
> -
>       if (PMI2_fd == -1) {
> -           /* Singleton init: Process not started with mpiexec,
> -                  so set size to 1, rank to 0 */
> -               PMI2_size = 1;
> -               PMI2_rank = 0;
> -               *spawned = 0;
> +        if (getenv("SLURM_JOBID")) {
> +            /* We're probably missing --mpi=pmi2 : error */
> +            pmi2_errno = PMI2_FAIL;
> +            PMI2U_ERR_POP(pmi2_errno);
> +        } else {
> +            /* Singleton init: Process not started with mpiexec,
> +               so set size to 1, rank to 0 */
> +            PMI2_size = 1;
> +            PMI2_rank = 0;
> +            *spawned = 0;
> +
> +     PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
> +               }
>
> -               PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
>                  goto fn_exit;
>       }
>
>
>
> 2. Concerning the error message "no value for req", we indeed
>     introduced an error.
>
>     I think the line mentionned below (1488) is right (otherwise the
>     following snprintfs bounds are too big), but the cmdlen is not :
>
> --- a/contribs/pmi2/pmi2_api.c
> +++ b/contribs/pmi2/pmi2_api.c
> @@ -1539,7 +1539,7 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command
> *resp, const char cmd[], PMI2_
>       }
>
>       /* prepend the buffer length stripping off the trailing '\0' */
> -    cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
> +    cmdlen = PMII_MAX_COMMAND_LEN - remaining_len - PMII_COMMANDLEN_SIZE;
>       ret = snprintf(cmdlenbuf, sizeof(cmdlenbuf), "%d", cmdlen);
>
>
> Piotr
>
>
>  >
>  >
>  > -------- Original message --------
>  > From: Hongjia Cao <[email protected]>
>  > Date: 07/25/2013 12:25 AM (GMT-05:00)
>  > To: slurm-dev <[email protected]>
>  > Subject: [slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0
>  >
>  >
>  >
>  > 在 2013-07-22一的 07:46 -0700,Andy Riebs写道:
>  >> Hi,
>  >>
>  >> We're trying to understand how PMI2 support works with SLURM, and have
>  >> come up with a test program (see below) that demonstrates unexpected
>  >> results.
>  >>
>  >> Our questions:
>  >>     1. What does it mean, when --mpi=pmi2 is _not_ specified, for
>  >>        PMI2_Init() to return success, but leave size,rank,appnum
>  >>        unchanged?
>  > You can run your program directly (execute "./a.out" instead of "srun
> -n
>  > 2 ./a.out") and get similar results. When --mpi=pmi2 not specified, the
>  > PMI2 client library takes that the program is run singleton. "srun -n 2
>  > a.out" will run two copies of the program with parallel size 1. You
> will
>  > get 1 in numprocs after calling
>  > "MPI_Comm_size(MPI_COMM_WORLD,&numprocs);" in MPI programs.
>  >
>  >
>  >>     1. What does it mean, or what is going wrong, when --mpi=pmi2 is
>  >>        specified, to get the“slurmd[hadesn10]: mpi/pmi2: no value for
>  >>        key  in req” lines?
>  >
>  >
>  > This message will not appear when running MPI programs. The following
> is
>  > a code segment taken from contribs/pmi2/pmi2_api.c (function
>  > PMIi_WriteSimpleCommand()) of SLURM:
>  >
>  > 1481     int pair_index;
>  > 1482
>  > 1483     PMI2U_printf("[BEGIN]");
>  > 1484
>  > 1429     ssize_t nbytes;
>  > 1430     ssize_t offset;
>  > 1485     /* leave space for length field */
>  > 1486     memset(c, ' ', PMII_COMMANDLEN_SIZE);
>  > 1487     c += PMII_COMMANDLEN_SIZE;
>  > 1488     remaining_len -= PMII_COMMANDLEN_SIZE;
>  > 1489
>  > 1490     PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN,
> pmi2_errno,
>  > PMI2_ERR_OTHER, "**cmd_too_long");
>  > 1491
>  >
>  > the above line 1488 is missing in the corresponding file
>  > src/pmi/pmi2/simple2pmi.c of MPICH:
>  >
>  > 1431     int pair_index;
>  > 1432
>  > 1433     /* leave space for length field */
>  > 1434     memset(c, ' ', PMII_COMMANDLEN_SIZE);
>  > 1435     c += PMII_COMMANDLEN_SIZE;
>  > 1436
>  > 1437     PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN,
> pmi2_errno,
>  > PMI2_ERR_OTHER, "**cmd_too_long");
>  > 1438
>  >
>  > It is not clear which is correct according to the design documents of
>  > PMI2(http://wiki.mpich.org/mpich/index.php/PMI_v2_Wire_Protocol). But
>  > the mpi/pmi2 plugin in SLURM which implements the server part of the
> PMI
>  > protocol comforms to the MPICH implementation.
>  >>
>  >> Andy
>  >>
>  >>
>  >> The program:
>  >>
>  >>
> --------------------------------------------------------------------------
>  >>
>  >>
>  >> /*
>  >>
>  >> Using SLURM 2.6.0
>  >>
>  >> To build and run it:
>  >>
>  >>
>  >>    cc -Wall pmi2-001.c -I/opt/slurm/include -L/opt/slurm/lib64 -lpmi2
>  >> && echo "" &&
>  >>    srun -n 2 ./a.out && echo "" && srun --mpi=pmi2 -n 2 ./a.out
>  >>
>  >> Sample output:
>  >>
>  >>
>  >> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
>  >>
>  >> Job_GetId => 14; id =
>  >>
>  >> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
>  >>
>  >> Job_GetId => 14; id =
>  >>
>  >>
>  >>
>  >> slurmd[hadesn10]: mpi/pmi2: no value for key  in req
>  >>
>  >> Init => 0; spawned = 0, size = 2, rank = 1, appnum = -1
>  >>
>  >> Job_GetId => 0; id = 252.0
>  >>
>  >> slurmd[hadesn10]: mpi/pmi2: no value for key  in req
>  >>
>  >> Init => 0; spawned = 0, size = 2, rank = 0, appnum = -1
>  >>
>  >> Job_GetId => 0; id = 252.0
>  >>
>  >>
>  >>
>  >> */
>  >>
>  >>
>  >>
>  >> #include <stdio.h>
>  >>
>  >> #include "slurm/pmi2.h"
>  >>
>  >>
>  >>
>  >> int
>  >>
>  >> main()
>  >>
>  >> {
>  >>
>  >>  int ret;
>  >>
>  >>  {
>  >>
>  >>    int spawned = -99, size = -99, rank = -99, appnum = -99;
>  >>
>  >>    ret = PMI2_Init(&spawned, &size, &rank, &appnum);
>  >>
>  >>    printf("Init => %d; spawned = %d, size = %d, rank = %d, appnum = %
>  >> d\n",
>  >>
>  >>         ret, spawned, size, rank, appnum);
>  >>
>  >>  }
>  >>
>  >>  {
>  >>
>  >>    char id[PMI2_MAX_KEYLEN];
>  >>
>  >>    ret = PMI2_Job_GetId(id, sizeof(id));
>  >>
>  >>    printf("Job_GetId => %d; id = %s\n", ret, id);
>  >>
>  >>  }
>  >>
>  >>  fflush(NULL);
>  >>
>  >>  return 0;
>  >>
>  >> }
>  >>
>  >>
> --------------------------------------------------------------------------
>  >> --
>  >> Andy Riebs
>  >> Hewlett-Packard Company
>  >> High Performance Computing
>  >> +1 404 648 9024
>  >> My opinions are not necessarily those of HP
>  >>
>
>


/David

Reply via email to