Fantastic Artem!
As you might imagine, my opportunities to test Slurm changes on
thousand node clusters are limited, so it may take a couple of days,
but I'll let you know as soon as I can.
Andy
On 11/22/2014 09:11 AM, Artem Polyakov
wrote:
Re: [slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10
Andy, I have a good news for you!
1. I was able to pinpoint the problem on my laptop and the
reason was in cumulative database size.
2. To check my guess you'll need to change the sources:
a. go to <slurm-src>/src/common/pack.c
Increase variables
#define MAX_PACK_MEM_LEN (16 * 1024 * 1024)#define MAX_PACK_STR_LEN
(16 * 1024 * 1024)
to be twice greater. As you can see this limit is near to what we have
with your messages, I think that additional PMI2 headers make the DB to exceed
this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL
nodes since this changes affects slurmds!!
While you checking my guess (that fixes the problem on my laptop) I will
work on the proper patch that will force PMI2 to split the database to several
messages if it can't send it using just one.
P.S. FYI I work on PMIx plugin now and we will use SLURM communication
infrastructure too, thus will be affected with the same problem. So I am quite
interested in this effort.
2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>:
Thanks Artem! We'll
keep you posted on what we find.
Andy
On 11/21/2014 12:06 PM, Artem Polyakov wrote:
2014-11-21 21:43
GMT+06:00 Artem Polyakov <[email protected]>:
Hello, Andy.
I am not SLURM expert, so please
consider the following advices as
optional.
According to sources PMI2 fails when
trying to broadcast cumulative database
to the nodes. It tries 5 times with
delays that increases as powers of two:
1, 2, 4, 8, 16 seconds.
The cumulative size of Database
should be proportional to 1534*24*437 =
16051776 = 16Mb which is not that much.
I attached small patch to output exact
DB size right before broadcasting it to
compute nodes (I assume that you can
change sources).
One of the things that come to my
mind regarding to this is what if you
try twice less tasks with twice bigger
message. 2x437 is still less than 1024
(PMI2_MAX_VALLEN). We exclude the case
of cumulative DB size problem.
The most probably the problem is in
slurm_forward_data I would do the
following two things:
1. Enable 3rd level of debug on
slurmd's (SlurmdDebug=3 configuration
option) to see what is failing there.
You probably done that already but
didn't mention in mail.
Here I was wrong. Since it is "srun"
process who does final broadcast of DB you
need to launch your application with -vvv
option increasing its verbosity (srun -vvv
pmi2_allgather). I think you'll see more
detailed error report.
2. Optionally you can try to play
with slurm fanout tree width
(TreeWidth=10/50/100/whatever...
configuration option).
2014-11-21
19:20 GMT+06:00 Andy Riebs
<[email protected]>:
Hi
Slurm gurus! We are seeing an
issue when launching large rank
count jobs in our IB cluster
using PMI2 and could use your
help. When the jobs fail, the
first line of output seems to
have the most useful
information:
srun: error: mpi/pmi2: failed to
send temp kvs to compute nodes
One of our Mellanox friends
stripped down the test case to
just include the PMI2 startup
code from MPI to help try to
isolate the issue further. This
code just takes one argument
which is how many bytes to put
in the message. When we start
up this test code on 1534 nodes
with PPN=24, 436 byte messages
will pass, but 437 byte messages
will fail. (Maybe those numbers
will help someone figure this
out!)
The only interesting slurm
configuration option that we
have updated is:
MessageTimeout=60, but it did
not impact this issue. We are
looking for advice on how to
proceed in
debugging/troubleshooting this
issue further.
I have attached the test program
to this message.
We are using RHEL 6.5 x86_64 and
Slurm 14.03.10 on this system.
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not
necessarily those of HP
--
С Уважением, Поляков Артем
Юрьевич
Best regards, Artem Y. Polyakov
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov