Hi Artem,
 
 You've made my team very happy, indeed! For the short term, I've
 increased those limits to 64MB, which solves the current problem,
 though I definitely prefer your proposal to split the database into
 several messages.
 
 Best regards,
 Andy
 
 On 11/22/2014 09:11 AM, Artem Polyakov
   wrote:
   Re: [slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10
   
   Andy, I have a good news for you!
     1. I was able to pinpoint the problem on my laptop and the
       reason was in cumulative database size.
     2. To check my guess you'll need to change the sources:
     a. go to <slurm-src>/src/common/pack.c
     Increase variables
     
       #define MAX_PACK_MEM_LEN (16 * 1024 * 1024)#define MAX_PACK_STR_LEN      
(16 * 1024 * 1024)
       to be twice greater. As you can see this limit is near to what we have 
with your messages, I think that additional PMI2 headers make the DB to exceed 
this limit. NOTE! that you'll need to RECOMPILE and REINSTALL slurm on ALL 
nodes since this changes affects slurmds!! 
       While you checking my guess (that fixes the problem on my laptop) I will 
work on the proper patch that will force PMI2 to split the database to several 
messages if it can't send it using just one.
       P.S. FYI I work on PMIx plugin now and we will use SLURM communication 
infrastructure too, thus will be affected with the same problem. So I am quite 
interested in this effort.
     2014-11-22 3:27 GMT+06:00 Andy Riebs <[email protected]>:
       
          Thanks Artem! We'll
           keep you posted on what we find.
           
           Andy
               On 11/21/2014 12:06 PM, Artem Polyakov wrote:
                     2014-11-21 21:43
                       GMT+06:00 Artem Polyakov <[email protected]>:
                           Hello, Andy.
                           I am not SLURM expert, so please
                             consider the following advices as
                             optional.
                           According to sources PMI2 fails when
                             trying to broadcast cumulative database
                             to the nodes. It tries 5 times with
                             delays that increases as powers of two:
                           1, 2, 4, 8, 16 seconds.
                           The cumulative size of Database
                             should be proportional to 1534*24*437 =
                             16051776 = 16Mb which is not that much.
                             I attached small patch to output exact
                             DB size right before broadcasting it to
                             compute nodes (I assume that you can
                             change sources).
                           One of the things that come to my
                             mind regarding to this is what if you
                             try twice less tasks with twice bigger
                             message. 2x437 is still less than 1024
                             (PMI2_MAX_VALLEN). We exclude the case
                             of cumulative DB size problem.
                           The most probably the problem is in
                             slurm_forward_data I would do the
                             following two things:
                           1. Enable 3rd level of debug on
                             slurmd's (SlurmdDebug=3 configuration
                             option) to see what is failing there.
                             You probably done that already but
                             didn't mention in mail.
                       Here I was wrong. Since it is "srun"
                         process who does final broadcast of DB you
                         need to launch your application with -vvv
                         option increasing its verbosity (srun -vvv
                         pmi2_allgather). I think you'll see more
                         detailed error report.
                           2. Optionally you can try to play
                             with slurm fanout tree width
                             (TreeWidth=10/50/100/whatever...
                             configuration option).
                                 2014-11-21
                                   19:20 GMT+06:00 Andy Riebs 
<[email protected]>:
                                   Hi
                                       Slurm gurus!   We are seeing an
                                     issue when launching large rank
                                     count jobs in our IB cluster
                                     using PMI2 and could use your
                                     help. When the jobs fail, the
                                     first line of output seems to
                                     have the most useful
                                     information:
                                     
                                     srun: error: mpi/pmi2: failed to
                                     send temp kvs to compute nodes
                                     
                                     One of our Mellanox friends
                                     stripped down the test case to
                                     just include the PMI2 startup
                                     code from MPI to help try to
                                     isolate the issue further.  This
                                     code just takes one argument
                                     which is how many bytes to put
                                     in the message.  When we start
                                     up this test code on 1534 nodes
                                     with PPN=24, 436 byte messages
                                     will pass, but 437 byte messages
                                     will fail.  (Maybe those numbers
                                     will help someone figure this
                                     out!)
                                     
                                     The only interesting slurm
                                     configuration option that we
                                     have updated is:
                                     MessageTimeout=60, but it did
                                     not impact this issue.   We are
                                     looking for advice on how to
                                     proceed in
                                     debugging/troubleshooting this
                                     issue further.
                                     
                                     I have attached the test program
                                     to this message.
                                     
                                     We are using RHEL 6.5 x86_64 and
                                     Slurm 14.03.10 on this system.
                                         
                                         Andy
                                         
                                         -- 
                                         Andy Riebs
                                         Hewlett-Packard Company
                                         High Performance Computing
                                         +1 404 648 9024
                                         My opinions are not
                                         necessarily those of HP
                             -- 
                                 С Уважением, Поляков Артем
                                   Юрьевич
                                   Best regards, Artem Y. Polyakov
                     -- 
                     С Уважением, Поляков Артем Юрьевич
                       Best regards, Artem Y. Polyakov
     -- 
     С Уважением, Поляков Артем Юрьевич
       Best regards, Artem Y. Polyakov

Reply via email to