I've bumped the size up from 16MG to 128MB in the version 2.3 code. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Josh England [[email protected]] Sent: Thursday, April 28, 2011 4:18 PM To: [email protected] Subject: RE: [slurm-dev] Insane message length
Bumping up MAX_MSG_SIZE made things work again. -JE On Thu, 2011-04-28 at 16:00 -0700, Jette, Moe wrote: > Max RPC length is configured to be 16MG. > You could change MAX_MSG_SIZE in > src/common/slurm_protocol_socket_implementation.c > if necessary. > > I'd first stop queuing new jobs: > scontrol update partitionname=X state=DRAIN > > Hopefully after a bit you'll be able to figure out what > all of these jobs are and kill a bunch of them. > scancel has a bunch of filters so you can probably > kill them all with a single command (after the get job > info RPCs get down to a reasonable size). > ________________________________________ > From: [email protected] [[email protected]] On > Behalf Of Josh England [[email protected]] > Sent: Thursday, April 28, 2011 3:40 PM > To: [email protected] > Subject: [slurm-dev] Insane message length > > I've got a crapton of running jobs (maybe around 25000) on a cluster > using slurm 2.2.4. At some point it hit a bad threshold. Many commands > like squeue and scancel started complaining and the slurmctld is pegged > at 100% CPU: > scancel: error: slurm_receive_msg: Insane message length > slurm_load_jobs error: Insane message length > > I can't cancel jobs to get back to a sane number. Has anyone seen this > before? > > -JE > > >
