Just cancel the primary job ID.

Paul Edmon <[email protected]> wrote:
>By the way it looks like this was caused by a user submitting 47 job 
>array jobs each with 10,000 tasks in the array.  Which ended up 
>producing 470,000 jobs.  Is there a quick way to cancel job arrays? If
>I 
>were to guess if you canceled the primary id that would take care of
>all 
>of them?
>
>-Paul Edmon-
>
>On 9/29/2013 6:39 PM, Paul Edmon wrote:
>> Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked.  Is there any 
>> reason this couldn't be pushed back into the main tree?  Or do you 
>> guys want to keep the smaller message size.
>>
>> -Paul Edmon-
>>
>> On 9/29/2013 6:00 PM, Paul Edmon wrote:
>>> Ah, okay.  I figured that might be the case.
>>>
>>> -Paul Edmon-
>>>
>>> On 9/29/2013 5:58 PM, Morris Jette wrote:
>>>> That goes into the Slurm database. There are about 20 different 
>>>> limits available by user or group. See the resource limits web
>page.
>>>>
>>>> Paul Edmon <[email protected]> wrote:
>>>>
>>>>     That's good to hear.  Is there an option to do it per user?  I
>didn't
>>>>     see one in the slurm.conf.  I may have missed it.
>>>>
>>>>     -Paul Edmon-
>>>>
>>>>     On 9/29/2013 5:42 PM, Moe Jette wrote:
>>>>
>>>>         Quoting Paul Edmon <[email protected]>:
>>>>
>>>>             Yeah, that's why we set the 500,000 job limit. Though I
>>>>             didn't anticipate the insane message length issue.
>>>>
>>>>         I'd recommend per-user job limits too.
>>>>
>>>>             If I drop the MaxJobCount will it purge jobs to get
>down
>>>>             to that? Or will it just prohibit new jobs?
>>>>
>>>>         It will only prohibit new jobs.
>>>>
>>>>             I'm assuming that this rebuild would need to be pushed
>>>>             out everywhere as well? Both clients and master?
>>>>
>>>>         Only needed on the clients.
>>>>
>>>>             -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote:
>>>>
>>>>                 See MAX_MSG_SIZE in
>>>>                 src/common/slurm_protocol_socket_implementation.c
>>>>                 That should get you going again, but setting per
>>>>                 user job limits strongly recommended longer term.
>>>>                 That should prevent a rogue script from bringing
>the
>>>>                 system to its knees. Moe Quoting Paul Edmon
>>>>                 <[email protected]>:
>>>>
>>>>                     Where is the max message size limit set in
>>>>                     SLURM? That's probably the best route at this
>>>>                     point. -Paul Edmon- On 9/29/2013 3:26 PM,
>Morris
>>>>                     Jette wrote:
>>>>
>>>>                         Here are some options 1. User scontrol to
>>>>                         set queue state to drain and prevent more
>>>>                         jobs from being submitted 2. Lower the job
>>>>                         limit to block new job submissions 3.
>>>>                         Increase the max message size limit and
>>>>                         rebuild Slurm 4. Check accounting records
>>>>                         for the rogue user 5. Long term set user
>job
>>>>                         limits and train them to run multiple steps
>>>>                         on fewer jovs Paul Edmon
>>>>                         <[email protected]> wrote:
>>>>                         [root@holy-slurm01 ~]# squeue squeue:
>error:
>>>>                         slurm_receive_msg: Insane message length
>>>>                         slurm_load_jobs error: Insane message
>length
>>>>                         [root@holy-slurm01 ~]# sdiag
>>>>                        
>*******************************************************
>>>>                         sdiag output at Sun Sep 29 15:12:13 2013
>>>>                         Data since Sat Sep 28 20:00:01 2013
>>>>                        
>*******************************************************
>>>>                         Server thread count: 3 Agent queue size: 0
>>>>                         Jobs submitted: 21797 Jobs started: 12030
>>>>                         Jobs completed: 12209 Jobs canceled: 70
>Jobs
>>>>                         failed: 5 Main schedule statistics
>>>>                         (microseconds): Last cycle: 9207042 Max
>>>>                         cycle: 10088674 Total cycles: 1563 Mean
>>>>                         cycle: 17859 Mean depth cycle: 12138 Cycles
>>>>                         per minute: 1 Last queue length: 496816
>>>>                         Backfilling stats Total backfilled jobs
>>>>                         (since last slurm start): 9325 Total
>>>>                         backfilled jobs (since last stats cycle
>>>>                         start): 4952 Total cycles: 84 Last cycle
>>>>                         when: Sun Sep 29 15:06:15 2013 Last cycle:
>>>>                         2555321 Max cycle: 27633565 Mean cycle:
>>>>                         6115033 Last depth cycle: 3 Last depth
>cycle
>>>>                         (try sched): 2 Depth Mean: 278 Depth Mean
>>>>                         (try depth): 62 Last queue length: 496814
>>>>                         Queue length mean: 100807 I'm guessing this
>>>>                         is due to the fact that there are roughly
>>>>                         500,000 jobs in the queue. This is at our
>>>>                         upper limit which is 500,000 (MaxJobCount).
>>>>                         Is there anything that can be done about
>>>>                         this? It seems that commands that query
>jobs
>>>>                         such as squeue and scancel are not working.
>>>>                         So I can't tell who sent in this many jobs.
>>>>                         -Paul Edmon- -- Sent from my Android phone
>>>>                         with K-9 Mail. Please excuse my brevity.
>>>>
>>>>
>>>> -- 
>>>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>
>>>
>>

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to