RE: [slurm-dev] design limits for 2.2? SLURM scalability

Jette, Moe Mon, 21 Mar 2011 09:11:58 -0700

By this time next year, SLURM should be running on some much larger
systems than listed below, including those with the slurmd daemons on each
compute node. The scalability issues we see are mostly related to the rate 
of job submissions rather than the system size and we're working on that now.


Moe

________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Rayson Ho [[email protected]]
Sent: Monday, March 21, 2011 8:56 AM
To: [email protected]
Subject: Re: [slurm-dev] design limits for 2.2? SLURM scalability

Seems like SLURM daemons will not be running on each node on Sequoia -
slurmd will run on the I/O nodes but not the compute nodes if I read
this presentation correctly:

Multi-Petascale Computing on the Sequoia Architecture:
https://hpcrd.lbl.gov/scidac09/talks/Seager-Sequoia4SciDACv1.pdf

Nevertheless, the installations Jette listed are really massive!! The
largest known Grid Engine installation is Sun's Ranger at TACC, which
only has 62,976 processor cores in 3,936 nodes.

As a developer & maintainer of a Grid Engine fork (Oracle ended
developing the open-source SGE code-base in 2010, and thus we forked
the code and started the pure open-source project called "Open Grid
Scheduler"), I think Grid Engine won't be able to scale to those
numbers in the near or not so near future! :-(

Rayson



On Sat, Nov 20, 2010 at 1:49 PM, Jette, Moe <[email protected]> wrote:
> I believe that SLURM can manage any machine that HP can build and a customer 
> can pay for ;-)
>
> We have not seen any scaling issues and some of the machines running SLURM  
> today include:
> Tianhe-1A in China with 186368 cores
> Tera-100 at CEA with 138368 cores and a
> BlueGene/L at LLNL with 212992 cores
>
> We plan to run SLURM on LLNL's 20 PFlop Bluegene/Q system next year with 1.6 
> million
> processors 
> (http://www-304.ibm.com/jct03004c/press/us/en/pressrelease/26599.wss) and
> I am not expecting any scalability problems, although task launch on the 
> BlueGene systems
> differs from typical Linux systems.
>
> At the other end of the spectrum, Intel is using SLURM on their 48-core 
> "cluster on a chip"
> (http://www.hpcwire.com/features/Intel-Unveils-48-Core-Research-Chip-78378487.html).
> SLURM's architecture with a multitude of plugin options gives it tremendous 
> flexibility.
>
> Moe
> ________________________________________
> From: [email protected] [[email protected]] On 
> Behalf Of Andy Riebs [[email protected]]
> Sent: Friday, November 19, 2010 8:14 AM
> To: [email protected]
> Subject: [slurm-dev] design limits for 2.2?
>
> How large a cluster should one expect to be able to support with Slurm
> 2.2? (One suspects that the number is getting rather large!)
>
> Thanks!
> Andy
>
> --
> Andy Riebs
> Hewlett-Packard Company
> SCI Solutions
> +1-786-263-9743
> My opinions are not necessarily those of HP
>
>
>

RE: [slurm-dev] design limits for 2.2? SLURM scalability

Reply via email to