Bumping up MAX_MSG_SIZE made things work again.

-JE

On Thu, 2011-04-28 at 16:00 -0700, Jette, Moe wrote:
> Max RPC length is configured to be 16MG.
> You could change MAX_MSG_SIZE in 
> src/common/slurm_protocol_socket_implementation.c
> if necessary.
> 
> I'd first stop queuing new jobs:
> scontrol update partitionname=X state=DRAIN
> 
> Hopefully after a bit you'll be able to figure out what
> all of these jobs are and kill a bunch of them.
> scancel has a bunch of filters so you can probably 
> kill them all with a single command (after the get job
> info RPCs get down to a reasonable size).
> ________________________________________
> From: [email protected] [[email protected]] On 
> Behalf Of Josh England [[email protected]]
> Sent: Thursday, April 28, 2011 3:40 PM
> To: [email protected]
> Subject: [slurm-dev] Insane message length
> 
> I've got a crapton of running jobs (maybe around 25000) on a cluster
> using slurm 2.2.4.  At some point it hit a bad threshold.  Many commands
> like squeue and scancel started complaining and the slurmctld is pegged
> at 100% CPU:
> scancel: error: slurm_receive_msg: Insane message length
> slurm_load_jobs error: Insane message length
> 
> I can't cancel jobs to get back to a sane number.  Has anyone seen this
> before?
> 
> -JE
> 
> 
> 


Reply via email to