That goes into the Slurm database. There are about 20 different limits available by user or group. See the resource limits web page.
Paul Edmon <[email protected]> wrote: > >That's good to hear. Is there an option to do it per user? I didn't >see one in the slurm.conf. I may have missed it. > >-Paul Edmon- > >On 9/29/2013 5:42 PM, Moe Jette wrote: >> >> Quoting Paul Edmon <[email protected]>: >> >>> >>> Yeah, that's why we set the 500,000 job limit. Though I didn't >>> anticipate the insane message length issue. >> >> I'd recommend per-user job limits too. >> >> >>> If I drop the MaxJobCount will it purge jobs to get down to that? Or > >>> will it just prohibit new jobs? >> >> It will only prohibit new jobs. >> >> >>> I'm assuming that this rebuild would need to be pushed out >everywhere >>> as well? Both clients and master? >> >> Only needed on the clients. >> >> >>> -Paul Edmon- >>> >>> On 9/29/2013 3:40 PM, Moe Jette wrote: >>>> >>>> See MAX_MSG_SIZE in >src/common/slurm_protocol_socket_implementation.c >>>> >>>> That should get you going again, but setting per user job limits >>>> strongly recommended longer term. That should prevent a rogue >script >>>> from bringing the system to its knees. >>>> >>>> Moe >>>> >>>> >>>> Quoting Paul Edmon <[email protected]>: >>>> >>>>> Where is the max message size limit set in SLURM? That's probably > >>>>> the best route at this point. >>>>> >>>>> -Paul Edmon- >>>>> >>>>> On 9/29/2013 3:26 PM, Morris Jette wrote: >>>>>> Here are some options >>>>>> 1. User scontrol to set queue state to drain and prevent more >jobs >>>>>> from being submitted >>>>>> 2. Lower the job limit to block new job submissions >>>>>> 3. Increase the max message size limit and rebuild Slurm >>>>>> 4. Check accounting records for the rogue user >>>>>> 5. Long term set user job limits and train them to run multiple >>>>>> steps on fewer jovs >>>>>> >>>>>> Paul Edmon <[email protected]> wrote: >>>>>> >>>>>> [root@holy-slurm01 ~]# squeue >>>>>> squeue: error: slurm_receive_msg: Insane message length >>>>>> slurm_load_jobs error: Insane message length >>>>>> >>>>>> [root@holy-slurm01 ~]# sdiag >>>>>> ******************************************************* >>>>>> sdiag output at Sun Sep 29 15:12:13 2013 >>>>>> Data since Sat Sep 28 20:00:01 2013 >>>>>> ******************************************************* >>>>>> Server thread count: 3 >>>>>> Agent queue size: 0 >>>>>> >>>>>> Jobs submitted: 21797 >>>>>> Jobs started: 12030 >>>>>> Jobs completed: 12209 >>>>>> Jobs canceled: 70 >>>>>> Jobs failed: 5 >>>>>> >>>>>> Main schedule statistics (microseconds): >>>>>> Last cycle: 9207042 >>>>>> Max cycle: 10088674 >>>>>> Total cycles: 1563 >>>>>> Mean cycle: 17859 >>>>>> Mean depth cycle: 12138 >>>>>> Cycles per minute: 1 >>>>>> Last queue length: 496816 >>>>>> >>>>>> Backfilling stats >>>>>> Total backfilled jobs (since last slurm start): 9325 >>>>>> Total backfilled jobs (since last stats cycle >>>>>> start): 4952 >>>>>> Total cycles: 84 >>>>>> Last cycle when: Sun Sep 29 15:06:15 2013 >>>>>> Last cycle: 2555321 >>>>>> Max cycle: 27633565 >>>>>> Mean cycle: 6115033 >>>>>> Last depth cycle: 3 >>>>>> Last depth cycle (try sched): 2 >>>>>> Depth Mean: 278 >>>>>> Depth Mean (try depth): 62 >>>>>> Last queue length: 496814 >>>>>> Queue length mean: 100807 >>>>>> >>>>>> I'm guessing this is due to the fact that there are roughly >>>>>> 500,000 jobs >>>>>> in the queue. This is at our upper limit which is 500,000 >>>>>> (MaxJobCount). Is there anything that can be done about this? > >>>>>> It seems >>>>>> that commands that query jobs such as squeue and scancel are >not >>>>>> working. So I can't tell who sent in this many jobs. >>>>>> >>>>>> -Paul Edmon- >>>>>> >>>>>> >>>>>> -- >>>>>> Sent from my Android phone with K-9 Mail. Please excuse my >brevity. >>>>> >>>>> >>>> >>> >> -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
