Max RPC length is configured to be 16MG. You could change MAX_MSG_SIZE in src/common/slurm_protocol_socket_implementation.c if necessary.
I'd first stop queuing new jobs: scontrol update partitionname=X state=DRAIN Hopefully after a bit you'll be able to figure out what all of these jobs are and kill a bunch of them. scancel has a bunch of filters so you can probably kill them all with a single command (after the get job info RPCs get down to a reasonable size). ________________________________________ From: [email protected] [[email protected]] On Behalf Of Josh England [[email protected]] Sent: Thursday, April 28, 2011 3:40 PM To: [email protected] Subject: [slurm-dev] Insane message length I've got a crapton of running jobs (maybe around 25000) on a cluster using slurm 2.2.4. At some point it hit a bad threshold. Many commands like squeue and scancel started complaining and the slurmctld is pegged at 100% CPU: scancel: error: slurm_receive_msg: Insane message length slurm_load_jobs error: Insane message length I can't cancel jobs to get back to a sane number. Has anyone seen this before? -JE
