Andy, SLURM's implementation of PMI uses the MPI_Init to collect and redistribute all of the key-pairs to all of the tasks, which involves a fair-bit of data movement. Other PMI implementations may not distribute this information as part of MPI_Init, but only as the information is needed. Which would result in faster startup, but you would face delays when starting to move data.
There were some improvements in the scalability of this logic performed by Hongjia Cao at NUDT for the Thianhe-1A computer (fastest computer in the world) and found in SLURM v2.2. If you are not running v2.2, upgrading should help with scalability. Moe ________________________________________ From: [email protected] [[email protected]] On Behalf Of Andrew Roosen [[email protected]] Sent: Monday, February 07, 2011 10:19 AM To: [email protected] Subject: [slurm-dev] slow MVAPICH2 startup with SLURM PMI HI, We've a cluster with 64 compute nodes, each with 4x12-core processors connected via GigE and Mellanox ConnectX-2 Infiniband. So not small, but not huge, either. If I run a "null" mvapich2 program (just MPI_Init and MPI_Finalize) linked against SLURM's libpmi: time srun -p all -n 3000 ./mpinothing it takes about a minute or so to finish (depending on the particular run; 45-90s). I can run the same code not linked against SLURM's PMI: time salloc -p all -n 3000 mpiexec.hydra -bootstrap slurm ./mpinothing and it completes pretty consistently in about 17 seconds. Is this to be expected? I've tried tweaking the PMI environment variables without any significant change. "scontrol show config" attached. Cheers, Andy Roosen
