2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>:

> Hello, Andy.
>
> I am not SLURM expert, so please consider the following advices as
> optional.
>
> According to sources PMI2 fails when trying to broadcast cumulative
> database to the nodes. It tries 5 times with delays that increases as
> powers of two:
> 1, 2, 4, 8, 16 seconds.
>
> The cumulative size of Database should be proportional to 1534*24*437 =
> 16051776 = 16Mb which is not that much. I attached small patch to output
> exact DB size right before broadcasting it to compute nodes (I assume that
> you can change sources).
>
> One of the things that come to my mind regarding to this is what if you
> try twice less tasks with twice bigger message. 2x437 is still less than
> 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size problem.
>
> The most probably the problem is in slurm_forward_data I would do the
> following two things:
> 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration
> option) to see what is failing there. You probably done that already but
> didn't mention in mail.
>

Here I was wrong. Since it is "srun" process who does final broadcast of DB
you need to launch your application with -vvv option increasing its
verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed
error report.

2. Optionally you can try to play with slurm fanout tree width
> (TreeWidth=10/50/100/whatever... configuration option).
>
> 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>:
>
>> Hi Slurm gurus!   We are seeing an issue when launching large rank count
>> jobs in our IB cluster using PMI2 and could use your help. When the jobs
>> fail, the first line of output seems to have the most useful information:
>>
>> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
>>
>> One of our Mellanox friends stripped down the test case to just include
>> the PMI2 startup code from MPI to help try to isolate the issue further.
>> This code just takes one argument which is how many bytes to put in the
>> message.  When we start up this test code on 1534 nodes with PPN=24, 436
>> byte messages will pass, but 437 byte messages will fail.  (Maybe those
>> numbers will help someone figure this out!)
>>
>> The only interesting slurm configuration option that we have updated is:
>> MessageTimeout=60, but it did not impact this issue.   We are looking for
>> advice on how to proceed in debugging/troubleshooting this issue further.
>>
>> I have attached the test program to this message.
>>
>> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system.
>>
>> Andy
>>
>> --
>> Andy Riebs
>> Hewlett-Packard Company
>> High Performance Computing
>> +1 404 648 9024
>> My opinions are not necessarily those of HP
>>
>>
>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

Reply via email to