[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

Artem Polyakov Mon, 01 Dec 2014 07:55:19 -0800

*Update*: I have created the patch and the ticket for the subject problem:
http://bugs.schedmd.com/show_bug.cgi?id=1282


2014-11-25 7:18 GMT+06:00 Artem Polyakov <[email protected]>:

>
> 2014-11-24 22:20 GMT+06:00 Andy Riebs <[email protected]>:
>
>>  Hi Artem,
>>
>> You've made my team very happy, indeed! For the short term, I've
>> increased those limits to 64MB, which solves the current problem, though I
>> definitely prefer your proposal to split the database into several messages.
>>
>
> I am glad, Andy :).
> I will implement the message split in a week or two and will need your
> help in testing this.
>
>
>>
>> Best regards,
>> Andy
>>
>> On 11/22/2014 09:11 AM, Artem Polyakov wrote:
>>
>> Andy, I have a good news for you!
>>
>>  1. I was able to pinpoint the problem on my laptop and the reason was
>> in cumulative database size.
>> 2. To check my guess you'll need to change the sources:
>> a. go to <slurm-src>/src/common/pack.c
>> Increase variables
>>
>> #define MAX_PACK_MEM_LEN     (16 * 1024 * 1024)
>>
>> #define MAX_PACK_STR_LEN     (16 * 1024 * 1024)
>>
>>  to be twice greater. As you can see this limit is near to what we have with 
>> your messages, I think that additional PMI2 headers make the DB to exceed 
>> this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL 
>> nodes since this changes affects slurmds!!
>>
>>  While you checking my guess (that fixes the problem on my laptop) I will 
>> work on the proper patch that will force PMI2 to split the database to 
>> several messages if it can't send it using just one.
>>
>>  P.S. FYI I work on PMIx plugin now and we will use SLURM communication 
>> infrastructure too, thus will be affected with the same problem. So I am 
>> quite interested in this effort.
>>
>>
>> 2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>:
>>
>>>  Thanks Artem! We'll keep you posted on what we find.
>>>
>>> Andy
>>>
>>>
>>> On 11/21/2014 12:06 PM, Artem Polyakov wrote:
>>>
>>>
>>>
>>> 2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>:
>>>
>>>>  Hello, Andy.
>>>>
>>>>  I am not SLURM expert, so please consider the following advices as
>>>> optional.
>>>>
>>>>  According to sources PMI2 fails when trying to broadcast cumulative
>>>> database to the nodes. It tries 5 times with delays that increases as
>>>> powers of two:
>>>> 1, 2, 4, 8, 16 seconds.
>>>>
>>>>  The cumulative size of Database should be proportional to 1534*24*437
>>>> = 16051776 = 16Mb which is not that much. I attached small patch to output
>>>> exact DB size right before broadcasting it to compute nodes (I assume that
>>>> you can change sources).
>>>>
>>>>  One of the things that come to my mind regarding to this is what if
>>>> you try twice less tasks with twice bigger message. 2x437 is still less
>>>> than 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size
>>>> problem.
>>>>
>>>>  The most probably the problem is in slurm_forward_data I would do the
>>>> following two things:
>>>> 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration
>>>> option) to see what is failing there. You probably done that already but
>>>> didn't mention in mail.
>>>>
>>>
>>>  Here I was wrong. Since it is "srun" process who does final broadcast
>>> of DB you need to launch your application with -vvv option increasing its
>>> verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed
>>> error report.
>>>
>>>   2. Optionally you can try to play with slurm fanout tree width
>>>> (TreeWidth=10/50/100/whatever... configuration option).
>>>>
>>>> 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>:
>>>>
>>>>> Hi Slurm gurus!   We are seeing an issue when launching large rank
>>>>> count jobs in our IB cluster using PMI2 and could use your help. When the
>>>>> jobs fail, the first line of output seems to have the most useful
>>>>> information:
>>>>>
>>>>> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
>>>>>
>>>>> One of our Mellanox friends stripped down the test case to just
>>>>> include the PMI2 startup code from MPI to help try to isolate the issue
>>>>> further.  This code just takes one argument which is how many bytes to put
>>>>> in the message.  When we start up this test code on 1534 nodes with 
>>>>> PPN=24,
>>>>> 436 byte messages will pass, but 437 byte messages will fail.  (Maybe 
>>>>> those
>>>>> numbers will help someone figure this out!)
>>>>>
>>>>> The only interesting slurm configuration option that we have updated
>>>>> is: MessageTimeout=60, but it did not impact this issue.   We are looking
>>>>> for advice on how to proceed in debugging/troubleshooting this issue
>>>>> further.
>>>>>
>>>>> I have attached the test program to this message.
>>>>>
>>>>> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system.
>>>>>
>>>>> Andy
>>>>>
>>>>> --
>>>>> Andy Riebs
>>>>> Hewlett-Packard Company
>>>>> High Performance Computing
>>>>> +1 404 648 9024
>>>>> My opinions are not necessarily those of HP
>>>>>
>>>>>
>>>>
>>>>
>>>>   --
>>>> С Уважением, Поляков Артем Юрьевич
>>>> Best regards, Artem Y. Polyakov
>>>>
>>>
>>>
>>>
>>>  --
>>> С Уважением, Поляков Артем Юрьевич
>>> Best regards, Artem Y. Polyakov
>>>
>>>
>>>
>>
>>
>>  --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>>
>>
>>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

Reply via email to