I've got a crapton of running jobs (maybe around 25000) on a cluster
using slurm 2.2.4.  At some point it hit a bad threshold.  Many commands
like squeue and scancel started complaining and the slurmctld is pegged
at 100% CPU:
scancel: error: slurm_receive_msg: Insane message length
slurm_load_jobs error: Insane message length

I can't cancel jobs to get back to a sane number.  Has anyone seen this
before?

-JE


Reply via email to