So I've found from practical experience that when slurm gets really busy and unresponsive turning on defer can alleviate the pressure enough for it to clear its head. For instance if a bunch of jobs exit simultaneously (either due to a cancel or timelimit) that can great a storm of requests to the master. This can cause the threads to jump to 256 which then causes everything to become unresponsive as the master tries to dig itself out of a hole.
It would be nice to say if the thread count jumps to 256 to automatically turn on defer. To basically slow down the scheduler, and allow it to breath and catch up.
Is there a way to automate this? -Paul Edmon-
