I'm working on Amazon EC2 integration with Slurm. I've found several issues (like inability to work with CLOUD nodes without DNS names) but they look fairly easy to fix. CLOUD mode with suspend/restore works OK too.
However, I have another question - is it possible to somehow make Slurm work in a 'reluctant' mode? Let me explain, nodes on Amazon EC2 are billed at one-hour increments. So if I start 10 "srun sleep 10" jobs SLURM is going to resume 10 nodes causing me to be billed for 20 hours of CPU time even though all the jobs could be completed on a single host in the time it takes to start all the EC2 nodes. I've tried to play with ResumeRate but it simply doesn't work well enough. So I'm thinking about a scheduler that will work in conjunction with the backfill scheduler. It'll wait until there's at least one task in the queue which is awaiting execution for more than N seconds to start resuming new nodes. Is it feasible or is there a better way to do it?