Hello,
I'm joining the discussion on pmi2.
1. About the behaviour of PMI2_Init() without -mpi=pmi2, it is indeed
strange that it is considered a singloton init, especially as the
lib is distributed with slurm and the error is obvious. Maybe it
should be an error if we are called by slurm (checking
$SLURM_JOBID) and it could be a singloton init only otherwise.
Something like:
--- a/contribs/pmi2/pmi2_api.c
+++ b/contribs/pmi2/pmi2_api.c
@@ -196,18 +196,24 @@ int PMI2_Init(int *spawned, int *size, int *rank,
int *appnum)
p = getenv("PMI2_DEBUG");
if (p) PMI2_debug = atoi(p);
- /* Get the fd for PMI commands; if none, we're a singleton */
+ /* Get the fd for PMI commands */
pmi2_errno = getPMIFD();
if (pmi2_errno) PMI2U_ERR_POP(pmi2_errno);
-
if (PMI2_fd == -1) {
- /* Singleton init: Process not started with mpiexec,
- so set size to 1, rank to 0 */
- PMI2_size = 1;
- PMI2_rank = 0;
- *spawned = 0;
+ if (getenv("SLURM_JOBID")) {
+ /* We're probably missing --mpi=pmi2 : error */
+ pmi2_errno = PMI2_FAIL;
+ PMI2U_ERR_POP(pmi2_errno);
+ } else {
+ /* Singleton init: Process not started with mpiexec,
+ so set size to 1, rank to 0 */
+ PMI2_size = 1;
+ PMI2_rank = 0;
+ *spawned = 0;
+
+ PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
+ }
- PMI2_initialized = SINGLETON_INIT_BUT_NO_PM;
goto fn_exit;
}
2. Concerning the error message "no value for req", we indeed
introduced an error.
I think the line mentionned below (1488) is right (otherwise the
following snprintfs bounds are too big), but the cmdlen is not :
--- a/contribs/pmi2/pmi2_api.c
+++ b/contribs/pmi2/pmi2_api.c
@@ -1539,7 +1539,7 @@ int PMIi_WriteSimpleCommand( int fd, PMI2_Command
*resp, const char cmd[], PMI2_
}
/* prepend the buffer length stripping off the trailing '\0' */
- cmdlen = PMII_MAX_COMMAND_LEN - remaining_len;
+ cmdlen = PMII_MAX_COMMAND_LEN - remaining_len - PMII_COMMANDLEN_SIZE;
ret = snprintf(cmdlenbuf, sizeof(cmdlenbuf), "%d", cmdlen);
Piotr
>
>
> -------- Original message --------
> From: Hongjia Cao <[email protected]>
> Date: 07/25/2013 12:25 AM (GMT-05:00)
> To: slurm-dev <[email protected]>
> Subject: [slurm-dev] Re: Understanding PMI2 support in SLURM 2.6.0
>
>
>
> 在 2013-07-22一的 07:46 -0700,Andy Riebs写道:
>> Hi,
>>
>> We're trying to understand how PMI2 support works with SLURM, and have
>> come up with a test program (see below) that demonstrates unexpected
>> results.
>>
>> Our questions:
>> 1. What does it mean, when --mpi=pmi2 is _not_ specified, for
>> PMI2_Init() to return success, but leave size,rank,appnum
>> unchanged?
> You can run your program directly (execute "./a.out" instead of "srun -n
> 2 ./a.out") and get similar results. When --mpi=pmi2 not specified, the
> PMI2 client library takes that the program is run singleton. "srun -n 2
> a.out" will run two copies of the program with parallel size 1. You will
> get 1 in numprocs after calling
> "MPI_Comm_size(MPI_COMM_WORLD,&numprocs);" in MPI programs.
>
>
>> 1. What does it mean, or what is going wrong, when --mpi=pmi2 is
>> specified, to get the“slurmd[hadesn10]: mpi/pmi2: no value for
>> key in req” lines?
>
>
> This message will not appear when running MPI programs. The following is
> a code segment taken from contribs/pmi2/pmi2_api.c (function
> PMIi_WriteSimpleCommand()) of SLURM:
>
> 1481 int pair_index;
> 1482
> 1483 PMI2U_printf("[BEGIN]");
> 1484
> 1429 ssize_t nbytes;
> 1430 ssize_t offset;
> 1485 /* leave space for length field */
> 1486 memset(c, ' ', PMII_COMMANDLEN_SIZE);
> 1487 c += PMII_COMMANDLEN_SIZE;
> 1488 remaining_len -= PMII_COMMANDLEN_SIZE;
> 1489
> 1490 PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN, pmi2_errno,
> PMI2_ERR_OTHER, "**cmd_too_long");
> 1491
>
> the above line 1488 is missing in the corresponding file
> src/pmi/pmi2/simple2pmi.c of MPICH:
>
> 1431 int pair_index;
> 1432
> 1433 /* leave space for length field */
> 1434 memset(c, ' ', PMII_COMMANDLEN_SIZE);
> 1435 c += PMII_COMMANDLEN_SIZE;
> 1436
> 1437 PMI2U_ERR_CHKANDJUMP(strlen(cmd) > PMI2_MAX_VALLEN, pmi2_errno,
> PMI2_ERR_OTHER, "**cmd_too_long");
> 1438
>
> It is not clear which is correct according to the design documents of
> PMI2(http://wiki.mpich.org/mpich/index.php/PMI_v2_Wire_Protocol). But
> the mpi/pmi2 plugin in SLURM which implements the server part of the PMI
> protocol comforms to the MPICH implementation.
>>
>> Andy
>>
>>
>> The program:
>>
>>
--------------------------------------------------------------------------
>>
>>
>> /*
>>
>> Using SLURM 2.6.0
>>
>> To build and run it:
>>
>>
>> cc -Wall pmi2-001.c -I/opt/slurm/include -L/opt/slurm/lib64 -lpmi2
>> && echo "" &&
>> srun -n 2 ./a.out && echo "" && srun --mpi=pmi2 -n 2 ./a.out
>>
>> Sample output:
>>
>>
>> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
>>
>> Job_GetId => 14; id =
>>
>> Init => 0; spawned = 0, size = -99, rank = -99, appnum = -99
>>
>> Job_GetId => 14; id =
>>
>>
>>
>> slurmd[hadesn10]: mpi/pmi2: no value for key in req
>>
>> Init => 0; spawned = 0, size = 2, rank = 1, appnum = -1
>>
>> Job_GetId => 0; id = 252.0
>>
>> slurmd[hadesn10]: mpi/pmi2: no value for key in req
>>
>> Init => 0; spawned = 0, size = 2, rank = 0, appnum = -1
>>
>> Job_GetId => 0; id = 252.0
>>
>>
>>
>> */
>>
>>
>>
>> #include <stdio.h>
>>
>> #include "slurm/pmi2.h"
>>
>>
>>
>> int
>>
>> main()
>>
>> {
>>
>> int ret;
>>
>> {
>>
>> int spawned = -99, size = -99, rank = -99, appnum = -99;
>>
>> ret = PMI2_Init(&spawned, &size, &rank, &appnum);
>>
>> printf("Init => %d; spawned = %d, size = %d, rank = %d, appnum = %
>> d\n",
>>
>> ret, spawned, size, rank, appnum);
>>
>> }
>>
>> {
>>
>> char id[PMI2_MAX_KEYLEN];
>>
>> ret = PMI2_Job_GetId(id, sizeof(id));
>>
>> printf("Job_GetId => %d; id = %s\n", ret, id);
>>
>> }
>>
>> fflush(NULL);
>>
>> return 0;
>>
>> }
>>
>>
--------------------------------------------------------------------------
>> --
>> Andy Riebs
>> Hewlett-Packard Company
>> High Performance Computing
>> +1 404 648 9024
>> My opinions are not necessarily those of HP
>>