Max RPC length is configured to be 16MG.
You could change MAX_MSG_SIZE in 
src/common/slurm_protocol_socket_implementation.c
if necessary.

I'd first stop queuing new jobs:
scontrol update partitionname=X state=DRAIN

Hopefully after a bit you'll be able to figure out what
all of these jobs are and kill a bunch of them.
scancel has a bunch of filters so you can probably 
kill them all with a single command (after the get job
info RPCs get down to a reasonable size).
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Josh England [[email protected]]
Sent: Thursday, April 28, 2011 3:40 PM
To: [email protected]
Subject: [slurm-dev] Insane message length

I've got a crapton of running jobs (maybe around 25000) on a cluster
using slurm 2.2.4.  At some point it hit a bad threshold.  Many commands
like squeue and scancel started complaining and the slurmctld is pegged
at 100% CPU:
scancel: error: slurm_receive_msg: Insane message length
slurm_load_jobs error: Insane message length

I can't cancel jobs to get back to a sane number.  Has anyone seen this
before?

-JE



Reply via email to