Here are some options
1. User scontrol to set queue state to drain and prevent more jobs from being 
submitted
2. Lower the job limit to block new job submissions
 3. Increase the max message size limit and rebuild Slurm
 4. Check accounting records for the rogue user
5. Long term set user job limits and train them to run multiple steps on fewer 
jovs

Paul Edmon <[email protected]> wrote:
>
>[root@holy-slurm01 ~]# squeue
>squeue: error: slurm_receive_msg: Insane message length
>slurm_load_jobs error: Insane message length
>
>[root@holy-slurm01 ~]# sdiag
>*******************************************************
>sdiag output at Sun Sep 29 15:12:13 2013
>Data since      Sat Sep 28 20:00:01 2013
>*******************************************************
>Server thread count: 3
>Agent queue size:    0
>
>Jobs submitted: 21797
>Jobs started:   12030
>Jobs completed: 12209
>Jobs canceled:  70
>Jobs failed:    5
>
>Main schedule statistics (microseconds):
>         Last cycle:   9207042
>         Max cycle:    10088674
>         Total cycles: 1563
>         Mean cycle:   17859
>         Mean depth cycle:  12138
>         Cycles per minute: 1
>         Last queue length: 496816
>
>Backfilling stats
>         Total backfilled jobs (since last slurm start): 9325
>         Total backfilled jobs (since last stats cycle start): 4952
>         Total cycles: 84
>         Last cycle when: Sun Sep 29 15:06:15 2013
>         Last cycle: 2555321
>         Max cycle:  27633565
>         Mean cycle: 6115033
>         Last depth cycle: 3
>         Last depth cycle (try sched): 2
>         Depth Mean: 278
>         Depth Mean (try depth): 62
>         Last queue length: 496814
>         Queue length mean: 100807
>
>I'm guessing this is due to the fact that there are roughly 500,000
>jobs 
>in the queue.  This is at our upper limit which is 500,000 
>(MaxJobCount).  Is there anything that can be done about this?  It
>seems 
>that commands that query jobs such as squeue and scancel are not 
>working.  So I can't tell who sent in this many jobs.
>
>-Paul Edmon-

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to