[slurm-dev] Re: Insane message length

Paul Edmon Sun, 29 Sep 2013 14:39:13 -0700

Yeah, that's why we set the 500,000 job limit. Though I didn'tanticipate the insane message length issue.

If I drop the MaxJobCount will it purge jobs to get down to that? Orwill it just prohibit new jobs?

I'm assuming that this rebuild would need to be pushed out everywhere aswell? Both clients and master?


-Paul Edmon-

On 9/29/2013 3:40 PM, Moe Jette wrote:


See MAX_MSG_SIZE in src/common/slurm_protocol_socket_implementation.c

That should get you going again, but setting per user job limitsstrongly recommended longer term. That should prevent a rogue scriptfrom bringing the system to its knees.


Moe


Quoting Paul Edmon <[email protected]>:

Where is the max message size limit set in SLURM? That's probablythe best route at this point.


-Paul Edmon-

On 9/29/2013 3:26 PM, Morris Jette wrote:

Here are some options

1. User scontrol to set queue state to drain and prevent more jobsfrom being submitted

2. Lower the job limit to block new job submissions
3. Increase the max message size limit and rebuild Slurm
4. Check accounting records for the rogue user

5. Long term set user job limits and train them to run multiplesteps on fewer jovs


Paul Edmon <[email protected]> wrote:

   [root@holy-slurm01 ~]# squeue
   squeue: error: slurm_receive_msg: Insane message length
   slurm_load_jobs error: Insane message length

   [root@holy-slurm01 ~]# sdiag
   *******************************************************
   sdiag output at Sun Sep 29 15:12:13 2013
   Data since      Sat Sep 28 20:00:01 2013
   *******************************************************
   Server thread count: 3
   Agent queue size:    0

   Jobs submitted: 21797
   Jobs started:   12030
   Jobs completed: 12209
   Jobs canceled:  70
   Jobs failed:    5

   Main schedule statistics (microseconds):
   Last cycle:   9207042
   Max cycle:    10088674
   Total cycles: 1563
   Mean cycle:   17859
   Mean depth cycle:  12138
   Cycles per minute: 1
   Last queue length: 496816

   Backfilling stats
   Total backfilled jobs (since last slurm start): 9325
   Total backfilled jobs (since last stats cycle
   start): 4952
   Total cycles: 84
   Last cycle when: Sun Sep 29 15:06:15 2013
   Last cycle: 2555321
   Max cycle:  27633565
   Mean cycle: 6115033
   Last depth cycle: 3
   Last depth cycle (try sched): 2
   Depth Mean: 278
   Depth Mean (try depth): 62
   Last queue length: 496814
   Queue length mean: 100807

I'm guessing this is due to the fact that there are roughly500,000 jobs

   in the queue.  This is at our upper limit which is 500,000

(MaxJobCount). Is there anything that can be done about this?It seems

   that commands that query jobs such as squeue and scancel are not
   working.  So I can't tell who sent in this many jobs.

   -Paul Edmon-


--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

[slurm-dev] Re: Insane message length

Reply via email to