Andy,

SLURM's implementation of PMI uses the MPI_Init to collect and redistribute all 
of the key-pairs to all
of the tasks, which involves a fair-bit of data movement. Other PMI 
implementations may not distribute
this information as part of MPI_Init, but only as the information is needed. 
Which would result in 
faster startup, but you would face delays when starting to move data.

There were some improvements in the scalability of this logic performed by 
Hongjia Cao at NUDT for 
the Thianhe-1A computer (fastest computer in the world) and found in SLURM 
v2.2. If you are not
running v2.2, upgrading should help with scalability.

Moe

________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Andrew Roosen [[email protected]]
Sent: Monday, February 07, 2011 10:19 AM
To: [email protected]
Subject: [slurm-dev] slow MVAPICH2 startup with SLURM PMI

HI,
We've a cluster with 64 compute nodes, each with 4x12-core processors connected 
via GigE and Mellanox ConnectX-2 Infiniband. So not small, but not huge, either.

If I run a "null" mvapich2 program (just MPI_Init and MPI_Finalize) linked 
against SLURM's libpmi:
        time srun -p all -n 3000 ./mpinothing
it takes about a minute or so to finish (depending on the particular run; 
45-90s).

I can run the same code not linked against SLURM's PMI:
        time salloc -p all -n 3000 mpiexec.hydra -bootstrap slurm ./mpinothing
and it completes pretty consistently in about 17 seconds.

Is this to be expected?

I've tried tweaking the PMI environment variables without any significant 
change.  "scontrol show config" attached.

Cheers,
Andy Roosen

Reply via email to