[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

Artem Polyakov Mon, 24 Nov 2014 17:18:51 -0800

2014-11-24 22:20 GMT+06:00 Andy Riebs <[email protected]>:

>  Hi Artem,
>
> You've made my team very happy, indeed! For the short term, I've increased
> those limits to 64MB, which solves the current problem, though I definitely
> prefer your proposal to split the database into several messages.
>


I am glad, Andy :).
I will implement the message split in a week or two and will need your help
in testing this.


>
> Best regards,
> Andy
>
> On 11/22/2014 09:11 AM, Artem Polyakov wrote:
>
> Andy, I have a good news for you!
>
>  1. I was able to pinpoint the problem on my laptop and the reason was in
> cumulative database size.
> 2. To check my guess you'll need to change the sources:
> a. go to <slurm-src>/src/common/pack.c
> Increase variables
>
> #define MAX_PACK_MEM_LEN      (16 * 1024 * 1024)
>
> #define MAX_PACK_STR_LEN      (16 * 1024 * 1024)
>
>  to be twice greater. As you can see this limit is near to what we have with 
> your messages, I think that additional PMI2 headers make the DB to exceed 
> this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL 
> nodes since this changes affects slurmds!!
>
>  While you checking my guess (that fixes the problem on my laptop) I will 
> work on the proper patch that will force PMI2 to split the database to 
> several messages if it can't send it using just one.
>
>  P.S. FYI I work on PMIx plugin now and we will use SLURM communication 
> infrastructure too, thus will be affected with the same problem. So I am 
> quite interested in this effort.
>
>
> 2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>:
>
>>  Thanks Artem! We'll keep you posted on what we find.
>>
>> Andy
>>
>>
>> On 11/21/2014 12:06 PM, Artem Polyakov wrote:
>>
>>
>>
>> 2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>:
>>
>>>  Hello, Andy.
>>>
>>>  I am not SLURM expert, so please consider the following advices as
>>> optional.
>>>
>>>  According to sources PMI2 fails when trying to broadcast cumulative
>>> database to the nodes. It tries 5 times with delays that increases as
>>> powers of two:
>>> 1, 2, 4, 8, 16 seconds.
>>>
>>>  The cumulative size of Database should be proportional to 1534*24*437
>>> = 16051776 = 16Mb which is not that much. I attached small patch to output
>>> exact DB size right before broadcasting it to compute nodes (I assume that
>>> you can change sources).
>>>
>>>  One of the things that come to my mind regarding to this is what if
>>> you try twice less tasks with twice bigger message. 2x437 is still less
>>> than 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size
>>> problem.
>>>
>>>  The most probably the problem is in slurm_forward_data I would do the
>>> following two things:
>>> 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration
>>> option) to see what is failing there. You probably done that already but
>>> didn't mention in mail.
>>>
>>
>>  Here I was wrong. Since it is "srun" process who does final broadcast
>> of DB you need to launch your application with -vvv option increasing its
>> verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed
>> error report.
>>
>>   2. Optionally you can try to play with slurm fanout tree width
>>> (TreeWidth=10/50/100/whatever... configuration option).
>>>
>>> 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>:
>>>
>>>> Hi Slurm gurus!   We are seeing an issue when launching large rank
>>>> count jobs in our IB cluster using PMI2 and could use your help. When the
>>>> jobs fail, the first line of output seems to have the most useful
>>>> information:
>>>>
>>>> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
>>>>
>>>> One of our Mellanox friends stripped down the test case to just include
>>>> the PMI2 startup code from MPI to help try to isolate the issue further.
>>>> This code just takes one argument which is how many bytes to put in the
>>>> message.  When we start up this test code on 1534 nodes with PPN=24, 436
>>>> byte messages will pass, but 437 byte messages will fail.  (Maybe those
>>>> numbers will help someone figure this out!)
>>>>
>>>> The only interesting slurm configuration option that we have updated
>>>> is: MessageTimeout=60, but it did not impact this issue.   We are looking
>>>> for advice on how to proceed in debugging/troubleshooting this issue
>>>> further.
>>>>
>>>> I have attached the test program to this message.
>>>>
>>>> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system.
>>>>
>>>> Andy
>>>>
>>>> --
>>>> Andy Riebs
>>>> Hewlett-Packard Company
>>>> High Performance Computing
>>>> +1 404 648 9024
>>>> My opinions are not necessarily those of HP
>>>>
>>>>
>>>
>>>
>>>   --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>>
>>
>>
>>
>>  --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>>
>>
>
>
>  --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>
>
>


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

Reply via email to