I've got a crapton of running jobs (maybe around 25000) on a cluster using slurm 2.2.4. At some point it hit a bad threshold. Many commands like squeue and scancel started complaining and the slurmctld is pegged at 100% CPU: scancel: error: slurm_receive_msg: Insane message length slurm_load_jobs error: Insane message length
I can't cancel jobs to get back to a sane number. Has anyone seen this before? -JE
