2014-11-24 22:20 GMT+06:00 Andy Riebs <[email protected]>: > Hi Artem, > > You've made my team very happy, indeed! For the short term, I've increased > those limits to 64MB, which solves the current problem, though I definitely > prefer your proposal to split the database into several messages. >
I am glad, Andy :). I will implement the message split in a week or two and will need your help in testing this. > > Best regards, > Andy > > On 11/22/2014 09:11 AM, Artem Polyakov wrote: > > Andy, I have a good news for you! > > 1. I was able to pinpoint the problem on my laptop and the reason was in > cumulative database size. > 2. To check my guess you'll need to change the sources: > a. go to <slurm-src>/src/common/pack.c > Increase variables > > #define MAX_PACK_MEM_LEN (16 * 1024 * 1024) > > #define MAX_PACK_STR_LEN (16 * 1024 * 1024) > > to be twice greater. As you can see this limit is near to what we have with > your messages, I think that additional PMI2 headers make the DB to exceed > this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL > nodes since this changes affects slurmds!! > > While you checking my guess (that fixes the problem on my laptop) I will > work on the proper patch that will force PMI2 to split the database to > several messages if it can't send it using just one. > > P.S. FYI I work on PMIx plugin now and we will use SLURM communication > infrastructure too, thus will be affected with the same problem. So I am > quite interested in this effort. > > > 2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>: > >> Thanks Artem! We'll keep you posted on what we find. >> >> Andy >> >> >> On 11/21/2014 12:06 PM, Artem Polyakov wrote: >> >> >> >> 2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>: >> >>> Hello, Andy. >>> >>> I am not SLURM expert, so please consider the following advices as >>> optional. >>> >>> According to sources PMI2 fails when trying to broadcast cumulative >>> database to the nodes. It tries 5 times with delays that increases as >>> powers of two: >>> 1, 2, 4, 8, 16 seconds. >>> >>> The cumulative size of Database should be proportional to 1534*24*437 >>> = 16051776 = 16Mb which is not that much. I attached small patch to output >>> exact DB size right before broadcasting it to compute nodes (I assume that >>> you can change sources). >>> >>> One of the things that come to my mind regarding to this is what if >>> you try twice less tasks with twice bigger message. 2x437 is still less >>> than 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size >>> problem. >>> >>> The most probably the problem is in slurm_forward_data I would do the >>> following two things: >>> 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration >>> option) to see what is failing there. You probably done that already but >>> didn't mention in mail. >>> >> >> Here I was wrong. Since it is "srun" process who does final broadcast >> of DB you need to launch your application with -vvv option increasing its >> verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed >> error report. >> >> 2. Optionally you can try to play with slurm fanout tree width >>> (TreeWidth=10/50/100/whatever... configuration option). >>> >>> 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>: >>> >>>> Hi Slurm gurus! We are seeing an issue when launching large rank >>>> count jobs in our IB cluster using PMI2 and could use your help. When the >>>> jobs fail, the first line of output seems to have the most useful >>>> information: >>>> >>>> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes >>>> >>>> One of our Mellanox friends stripped down the test case to just include >>>> the PMI2 startup code from MPI to help try to isolate the issue further. >>>> This code just takes one argument which is how many bytes to put in the >>>> message. When we start up this test code on 1534 nodes with PPN=24, 436 >>>> byte messages will pass, but 437 byte messages will fail. (Maybe those >>>> numbers will help someone figure this out!) >>>> >>>> The only interesting slurm configuration option that we have updated >>>> is: MessageTimeout=60, but it did not impact this issue. We are looking >>>> for advice on how to proceed in debugging/troubleshooting this issue >>>> further. >>>> >>>> I have attached the test program to this message. >>>> >>>> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system. >>>> >>>> Andy >>>> >>>> -- >>>> Andy Riebs >>>> Hewlett-Packard Company >>>> High Performance Computing >>>> +1 404 648 9024 >>>> My opinions are not necessarily those of HP >>>> >>>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> >> > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
