Just cancel the primary job ID. Paul Edmon <[email protected]> wrote: >By the way it looks like this was caused by a user submitting 47 job >array jobs each with 10,000 tasks in the array. Which ended up >producing 470,000 jobs. Is there a quick way to cancel job arrays? If >I >were to guess if you canceled the primary id that would take care of >all >of them? > >-Paul Edmon- > >On 9/29/2013 6:39 PM, Paul Edmon wrote: >> Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked. Is there any >> reason this couldn't be pushed back into the main tree? Or do you >> guys want to keep the smaller message size. >> >> -Paul Edmon- >> >> On 9/29/2013 6:00 PM, Paul Edmon wrote: >>> Ah, okay. I figured that might be the case. >>> >>> -Paul Edmon- >>> >>> On 9/29/2013 5:58 PM, Morris Jette wrote: >>>> That goes into the Slurm database. There are about 20 different >>>> limits available by user or group. See the resource limits web >page. >>>> >>>> Paul Edmon <[email protected]> wrote: >>>> >>>> That's good to hear. Is there an option to do it per user? I >didn't >>>> see one in the slurm.conf. I may have missed it. >>>> >>>> -Paul Edmon- >>>> >>>> On 9/29/2013 5:42 PM, Moe Jette wrote: >>>> >>>> Quoting Paul Edmon <[email protected]>: >>>> >>>> Yeah, that's why we set the 500,000 job limit. Though I >>>> didn't anticipate the insane message length issue. >>>> >>>> I'd recommend per-user job limits too. >>>> >>>> If I drop the MaxJobCount will it purge jobs to get >down >>>> to that? Or will it just prohibit new jobs? >>>> >>>> It will only prohibit new jobs. >>>> >>>> I'm assuming that this rebuild would need to be pushed >>>> out everywhere as well? Both clients and master? >>>> >>>> Only needed on the clients. >>>> >>>> -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote: >>>> >>>> See MAX_MSG_SIZE in >>>> src/common/slurm_protocol_socket_implementation.c >>>> That should get you going again, but setting per >>>> user job limits strongly recommended longer term. >>>> That should prevent a rogue script from bringing >the >>>> system to its knees. Moe Quoting Paul Edmon >>>> <[email protected]>: >>>> >>>> Where is the max message size limit set in >>>> SLURM? That's probably the best route at this >>>> point. -Paul Edmon- On 9/29/2013 3:26 PM, >Morris >>>> Jette wrote: >>>> >>>> Here are some options 1. User scontrol to >>>> set queue state to drain and prevent more >>>> jobs from being submitted 2. Lower the job >>>> limit to block new job submissions 3. >>>> Increase the max message size limit and >>>> rebuild Slurm 4. Check accounting records >>>> for the rogue user 5. Long term set user >job >>>> limits and train them to run multiple steps >>>> on fewer jovs Paul Edmon >>>> <[email protected]> wrote: >>>> [root@holy-slurm01 ~]# squeue squeue: >error: >>>> slurm_receive_msg: Insane message length >>>> slurm_load_jobs error: Insane message >length >>>> [root@holy-slurm01 ~]# sdiag >>>> >******************************************************* >>>> sdiag output at Sun Sep 29 15:12:13 2013 >>>> Data since Sat Sep 28 20:00:01 2013 >>>> >******************************************************* >>>> Server thread count: 3 Agent queue size: 0 >>>> Jobs submitted: 21797 Jobs started: 12030 >>>> Jobs completed: 12209 Jobs canceled: 70 >Jobs >>>> failed: 5 Main schedule statistics >>>> (microseconds): Last cycle: 9207042 Max >>>> cycle: 10088674 Total cycles: 1563 Mean >>>> cycle: 17859 Mean depth cycle: 12138 Cycles >>>> per minute: 1 Last queue length: 496816 >>>> Backfilling stats Total backfilled jobs >>>> (since last slurm start): 9325 Total >>>> backfilled jobs (since last stats cycle >>>> start): 4952 Total cycles: 84 Last cycle >>>> when: Sun Sep 29 15:06:15 2013 Last cycle: >>>> 2555321 Max cycle: 27633565 Mean cycle: >>>> 6115033 Last depth cycle: 3 Last depth >cycle >>>> (try sched): 2 Depth Mean: 278 Depth Mean >>>> (try depth): 62 Last queue length: 496814 >>>> Queue length mean: 100807 I'm guessing this >>>> is due to the fact that there are roughly >>>> 500,000 jobs in the queue. This is at our >>>> upper limit which is 500,000 (MaxJobCount). >>>> Is there anything that can be done about >>>> this? It seems that commands that query >jobs >>>> such as squeue and scancel are not working. >>>> So I can't tell who sent in this many jobs. >>>> -Paul Edmon- -- Sent from my Android phone >>>> with K-9 Mail. Please excuse my brevity. >>>> >>>> >>>> -- >>>> Sent from my Android phone with K-9 Mail. Please excuse my brevity. > >>> >>
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
