2014-11-21 21:43 GMT+06:00 Artem Polyakov <[email protected]>: > Hello, Andy. > > I am not SLURM expert, so please consider the following advices as > optional. > > According to sources PMI2 fails when trying to broadcast cumulative > database to the nodes. It tries 5 times with delays that increases as > powers of two: > 1, 2, 4, 8, 16 seconds. > > The cumulative size of Database should be proportional to 1534*24*437 = > 16051776 = 16Mb which is not that much. I attached small patch to output > exact DB size right before broadcasting it to compute nodes (I assume that > you can change sources). > > One of the things that come to my mind regarding to this is what if you > try twice less tasks with twice bigger message. 2x437 is still less than > 1024 (PMI2_MAX_VALLEN). We exclude the case of cumulative DB size problem. > > The most probably the problem is in slurm_forward_data I would do the > following two things: > 1. Enable 3rd level of debug on slurmd's (SlurmdDebug=3 configuration > option) to see what is failing there. You probably done that already but > didn't mention in mail. >
Here I was wrong. Since it is "srun" process who does final broadcast of DB you need to launch your application with -vvv option increasing its verbosity (srun -vvv pmi2_allgather). I think you'll see more detailed error report. 2. Optionally you can try to play with slurm fanout tree width > (TreeWidth=10/50/100/whatever... configuration option). > > 2014-11-21 19:20 GMT+06:00 Andy Riebs <[email protected]>: > >> Hi Slurm gurus! We are seeing an issue when launching large rank count >> jobs in our IB cluster using PMI2 and could use your help. When the jobs >> fail, the first line of output seems to have the most useful information: >> >> srun: error: mpi/pmi2: failed to send temp kvs to compute nodes >> >> One of our Mellanox friends stripped down the test case to just include >> the PMI2 startup code from MPI to help try to isolate the issue further. >> This code just takes one argument which is how many bytes to put in the >> message. When we start up this test code on 1534 nodes with PPN=24, 436 >> byte messages will pass, but 437 byte messages will fail. (Maybe those >> numbers will help someone figure this out!) >> >> The only interesting slurm configuration option that we have updated is: >> MessageTimeout=60, but it did not impact this issue. We are looking for >> advice on how to proceed in debugging/troubleshooting this issue further. >> >> I have attached the test program to this message. >> >> We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system. >> >> Andy >> >> -- >> Andy Riebs >> Hewlett-Packard Company >> High Performance Computing >> +1 404 648 9024 >> My opinions are not necessarily those of HP >> >> > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
