Where is the max message size limit set in SLURM? That's probably the
best route at this point.
-Paul Edmon-
On 9/29/2013 3:26 PM, Morris Jette wrote:
Here are some options
1. User scontrol to set queue state to drain and prevent more jobs
from being submitted
2. Lower the job limit to block new job submissions
3. Increase the max message size limit and rebuild Slurm
4. Check accounting records for the rogue user
5. Long term set user job limits and train them to run multiple steps
on fewer jovs
Paul Edmon <[email protected]> wrote:
[root@holy-slurm01 ~]# squeue
squeue: error: slurm_receive_msg: Insane message length
slurm_load_jobs error: Insane message length
[root@holy-slurm01 ~]# sdiag
*******************************************************
sdiag output at Sun Sep 29 15:12:13 2013
Data since Sat Sep 28 20:00:01 2013
*******************************************************
Server thread count: 3
Agent queue size: 0
Jobs submitted: 21797
Jobs started: 12030
Jobs completed: 12209
Jobs canceled: 70
Jobs failed: 5
Main schedule statistics (microseconds):
Last cycle: 9207042
Max cycle: 10088674
Total cycles: 1563
Mean cycle: 17859
Mean depth cycle: 12138
Cycles per minute: 1
Last queue length: 496816
Backfilling stats
Total backfilled jobs (since last slurm start): 9325
Total backfilled jobs (since last stats cycle
start): 4952
Total cycles: 84
Last cycle when: Sun Sep 29 15:06:15 2013
Last cycle: 2555321
Max cycle: 27633565
Mean cycle: 6115033
Last depth cycle: 3
Last depth cycle (try sched): 2
Depth Mean: 278
Depth Mean (try depth): 62
Last queue length: 496814
Queue length mean: 100807
I'm guessing this is due to the fact that there are roughly 500,000 jobs
in the queue. This is at our upper limit which is 500,000
(MaxJobCount). Is there anything that can be done about this? It seems
that commands that query jobs such as squeue and scancel are not
working. So I can't tell who sent in this many jobs.
-Paul Edmon-
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.