I've bumped the size up from 16MG to 128MB in the version 2.3 code.
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Josh England [[email protected]]
Sent: Thursday, April 28, 2011 4:18 PM
To: [email protected]
Subject: RE: [slurm-dev] Insane message length

Bumping up MAX_MSG_SIZE made things work again.

-JE

On Thu, 2011-04-28 at 16:00 -0700, Jette, Moe wrote:
> Max RPC length is configured to be 16MG.
> You could change MAX_MSG_SIZE in 
> src/common/slurm_protocol_socket_implementation.c
> if necessary.
>
> I'd first stop queuing new jobs:
> scontrol update partitionname=X state=DRAIN
>
> Hopefully after a bit you'll be able to figure out what
> all of these jobs are and kill a bunch of them.
> scancel has a bunch of filters so you can probably
> kill them all with a single command (after the get job
> info RPCs get down to a reasonable size).
> ________________________________________
> From: [email protected] [[email protected]] On 
> Behalf Of Josh England [[email protected]]
> Sent: Thursday, April 28, 2011 3:40 PM
> To: [email protected]
> Subject: [slurm-dev] Insane message length
>
> I've got a crapton of running jobs (maybe around 25000) on a cluster
> using slurm 2.2.4.  At some point it hit a bad threshold.  Many commands
> like squeue and scancel started complaining and the slurmctld is pegged
> at 100% CPU:
> scancel: error: slurm_receive_msg: Insane message length
> slurm_load_jobs error: Insane message length
>
> I can't cancel jobs to get back to a sane number.  Has anyone seen this
> before?
>
> -JE
>
>
>



Reply via email to