I've doubled the size as Artem suggested in this commit (in v14.03.11):
https://github.com/SchedMD/slurm/commit/c83e2e7dd8f88cb1cea862652b5a07f2ed528fde
In reviewing the code, I found that many instances of buffer overflow
just returned an error code without a clear error message of what
happended. I corrected that here (adding a bunch of error messages).
This should help identify the sources of such a problem more quickly
in the future (in v14.11.1):
https://github.com/SchedMD/slurm/commit/b8773a95b34c7c6215239547d17fe996dec8997c
Quoting Artem Polyakov <[email protected]>:
Andy, I have a good news for you!
1. I was able to pinpoint the problem on my laptop and the reason was in
cumulative database size.
2. To check my guess you'll need to change the sources:
a. go to <slurm-src>/src/common/pack.c
Increase variables
#define MAX_PACK_MEM_LEN (16 * 1024 * 1024)
#define MAX_PACK_STR_LEN (16 * 1024 * 1024)
to be twice greater. As you can see this limit is near to what we have
with your messages, I think that additional PMI2 headers make the DB
to exceed this limit. NOTE! that you'll need to RECOMPILE and
REINSTALL slurm on ALL nodes since this changes affects slurmds!!
While you checking my guess (that fixes the problem on my laptop) I
will work on the proper patch that will force PMI2 to split the
database to several messages if it can't send it using just one.
P.S. FYI I work on PMIx plugin now and we will use SLURM communication
infrastructure too, thus will be affected with the same problem. So I
am quite interested in this effort.
2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>:
Thanks Artem! We'll keep you posted on what we find.
Andy
On 11/21/2014 12:06 PM, Artem Polyakov wrote:
2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>:
Hello, Andy.
I am not SLURM expert, so please consider the following advices as
optional.
According to sources PMI2 fails when trying to broadcast cumulative
database to the nodes. It tries 5 times with delays that increases as
powers of two:
1, 2, 4, 8, 16 seconds.
The cumulative size of Database should be proportional to 1534*24*437 =
16051776 = 16Mb which is not that much. I attached small patch to output
exact DB size right before broadcasting it to compute nodes (I assume that
you can change sources).
One of the things that come to my mind regarding to this is what if you
try twice less tasks with twice bigger message. 2x437 is still less than
1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size problem.
The most probably the problem is in slurm_forward_data I would do the
following two things:
1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration
option) to see what is failing there. You probably done that already but
didn't mention in mail.
Here I was wrong. Since it is "srun" process who does final broadcast of
DB you need to launch your application with -vvv option increasing its
verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed
error report.
2. Optionally you can try to play with slurm fanout tree width
(TreeWidth=10/50/100/whatever... configuration option).
2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>:
Hi Slurm gurus! We are seeing an issue when launching large rank count
jobs in our IB cluster using PMI2 and could use your help. When the jobs
fail, the first line of output seems to have the most useful information:
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
One of our Mellanox friends stripped down the test case to just include
the PMI2 startup code from MPI to help try to isolate the issue further.
This code just takes one argument which is how many bytes to put in the
message. When we start up this test code on 1534 nodes with PPN=24, 436
byte messages will pass, but 437 byte messages will fail. (Maybe those
numbers will help someone figure this out!)
The only interesting slurm configuration option that we have updated is:
MessageTimeout=60, but it did not impact this issue. We are looking for
advice on how to proceed in debugging/troubleshooting this issue further.
I have attached the test program to this message.
We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system.
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
--
Morris "Moe" Jette
CTO, SchedMD LLC