*Update*: I have created the patch and the ticket for the subject problem: http://bugs.schedmd.com/show_bug.cgi?id=1282
2014-11-25 7:18 GMT+06:00 Artem Polyakov <[email protected]>: > > 2014-11-24 22:20 GMT+06:00 Andy Riebs <[email protected]>: > >> Hi Artem, >> >> You've made my team very happy, indeed! For the short term, I've >> increased those limits to 64MB, which solves the current problem, though I >> definitely prefer your proposal to split the database into several messages. >> > > I am glad, Andy :). > I will implement the message split in a week or two and will need your > help in testing this. > > >> >> Best regards, >> Andy >> >> On 11/22/2014 09:11 AM, Artem Polyakov wrote: >> >> Andy, I have a good news for you! >> >> 1. I was able to pinpoint the problem on my laptop and the reason was >> in cumulative database size. >> 2. To check my guess you'll need to change the sources: >> a. go to <slurm-src>/src/common/pack.c >> Increase variables >> >> #define MAX_PACK_MEM_LEN (16 * 1024 * 1024) >> >> #define MAX_PACK_STR_LEN (16 * 1024 * 1024) >> >> to be twice greater. As you can see this limit is near to what we have with >> your messages, I think that additional PMI2 headers make the DB to exceed >> this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL >> nodes since this changes affects slurmds!! >> >> While you checking my guess (that fixes the problem on my laptop) I will >> work on the proper patch that will force PMI2 to split the database to >> several messages if it can't send it using just one. >> >> P.S. FYI I work on PMIx plugin now and we will use SLURM communication >> infrastructure too, thus will be affected with the same problem. So I am >> quite interested in this effort. >> >> >> 2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>: >> >>> Thanks Artem! We'll keep you posted on what we find. >>> >>> Andy >>> >>> >>> On 11/21/2014 12:06 PM, Artem Polyakov wrote: >>> >>> >>> >>> 2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>: >>> >>>> Hello, Andy. >>>> >>>> I am not SLURM expert, so please consider the following advices as >>>> optional. >>>> >>>> According to sources PMI2 fails when trying to broadcast cumulative >>>> database to the nodes. It tries 5 times with delays that increases as >>>> powers of two: >>>> 1, 2, 4, 8, 16 seconds. >>>> >>>> The cumulative size of Database should be proportional to 1534*24*437 >>>> = 16051776 = 16Mb which is not that much. I attached small patch to output >>>> exact DB size right before broadcasting it to compute nodes (I assume that >>>> you can change sources). >>>> >>>> One of the things that come to my mind regarding to this is what if >>>> you try twice less tasks with twice bigger message. 2x437 is still less >>>> than 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size >>>> problem. >>>> >>>> The most probably the problem is in slurm_forward_data I would do the >>>> following two things: >>>> 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration >>>> option) to see what is failing there. You probably done that already but >>>> didn't mention in mail. >>>> >>> >>> Here I was wrong. Since it is "srun" process who does final broadcast >>> of DB you need to launch your application with -vvv option increasing its >>> verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed >>> error report. >>> >>> 2. Optionally you can try to play with slurm fanout tree width >>>> (TreeWidth=10/50/100/whatever... configuration option). >>>> >>>> 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>: >>>> >>>>> Hi Slurm gurus! We are seeing an issue when launching large rank >>>>> count jobs in our IB cluster using PMI2 and could use your help. When the >>>>> jobs fail, the first line of output seems to have the most useful >>>>> information: >>>>> >>>>> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes >>>>> >>>>> One of our Mellanox friends stripped down the test case to just >>>>> include the PMI2 startup code from MPI to help try to isolate the issue >>>>> further. This code just takes one argument which is how many bytes to put >>>>> in the message. When we start up this test code on 1534 nodes with >>>>> PPN=24, >>>>> 436 byte messages will pass, but 437 byte messages will fail. (Maybe >>>>> those >>>>> numbers will help someone figure this out!) >>>>> >>>>> The only interesting slurm configuration option that we have updated >>>>> is: MessageTimeout=60, but it did not impact this issue. We are looking >>>>> for advice on how to proceed in debugging/troubleshooting this issue >>>>> further. >>>>> >>>>> I have attached the test program to this message. >>>>> >>>>> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system. >>>>> >>>>> Andy >>>>> >>>>> -- >>>>> Andy Riebs >>>>> Hewlett-Packard Company >>>>> High Performance Computing >>>>> +1 404 648 9024 >>>>> My opinions are not necessarily those of HP >>>>> >>>>> >>>> >>>> >>>> -- >>>> С Уважением, Поляков Артем Юрьевич >>>> Best regards, Artem Y. Polyakov >>>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >>> >>> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> >> > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
