[slurm-dev] Re: Insane message length

Morris Jette Sun, 29 Sep 2013 14:58:13 -0700

That goes into the Slurm database. There are about 20 different limits 
available by user or group. See the resource limits web page.


Paul Edmon <[email protected]> wrote:
>
>That's good to hear.  Is there an option to do it per user?  I didn't 
>see one in the slurm.conf.  I may have missed it.
>
>-Paul Edmon-
>
>On 9/29/2013 5:42 PM, Moe Jette wrote:
>>
>> Quoting Paul Edmon <[email protected]>:
>>
>>>
>>> Yeah, that's why we set the 500,000 job limit.  Though I didn't 
>>> anticipate the insane message length issue.
>>
>> I'd recommend per-user job limits too.
>>
>>
>>> If I drop the MaxJobCount will it purge jobs to get down to that? Or
>
>>> will it just prohibit new jobs?
>>
>> It will only prohibit new jobs.
>>
>>
>>> I'm assuming that this rebuild would need to be pushed out
>everywhere 
>>> as well?  Both clients and master?
>>
>> Only needed on the clients.
>>
>>
>>> -Paul Edmon-
>>>
>>> On 9/29/2013 3:40 PM, Moe Jette wrote:
>>>>
>>>> See MAX_MSG_SIZE in
>src/common/slurm_protocol_socket_implementation.c
>>>>
>>>> That should get you going again, but setting per user job limits 
>>>> strongly recommended longer term. That should prevent a rogue
>script 
>>>> from bringing the system to its knees.
>>>>
>>>> Moe
>>>>
>>>>
>>>> Quoting Paul Edmon <[email protected]>:
>>>>
>>>>> Where is the max message size limit set in SLURM?  That's probably
>
>>>>> the best route at this point.
>>>>>
>>>>> -Paul Edmon-
>>>>>
>>>>> On 9/29/2013 3:26 PM, Morris Jette wrote:
>>>>>> Here are some options
>>>>>> 1. User scontrol to set queue state to drain and prevent more
>jobs 
>>>>>> from being submitted
>>>>>> 2. Lower the job limit to block new job submissions
>>>>>> 3. Increase the max message size limit and rebuild Slurm
>>>>>> 4. Check accounting records for the rogue user
>>>>>> 5. Long term set user job limits and train them to run multiple 
>>>>>> steps on fewer jovs
>>>>>>
>>>>>> Paul Edmon <[email protected]> wrote:
>>>>>>
>>>>>>   [root@holy-slurm01 ~]# squeue
>>>>>>   squeue: error: slurm_receive_msg: Insane message length
>>>>>>   slurm_load_jobs error: Insane message length
>>>>>>
>>>>>>   [root@holy-slurm01 ~]# sdiag
>>>>>>   *******************************************************
>>>>>>   sdiag output at Sun Sep 29 15:12:13 2013
>>>>>>   Data since      Sat Sep 28 20:00:01 2013
>>>>>>   *******************************************************
>>>>>>   Server thread count: 3
>>>>>>   Agent queue size:    0
>>>>>>
>>>>>>   Jobs submitted: 21797
>>>>>>   Jobs started:   12030
>>>>>>   Jobs completed: 12209
>>>>>>   Jobs canceled:  70
>>>>>>   Jobs failed:    5
>>>>>>
>>>>>>   Main schedule statistics (microseconds):
>>>>>>   Last cycle:   9207042
>>>>>>   Max cycle:    10088674
>>>>>>   Total cycles: 1563
>>>>>>   Mean cycle:   17859
>>>>>>   Mean depth cycle:  12138
>>>>>>   Cycles per minute: 1
>>>>>>   Last queue length: 496816
>>>>>>
>>>>>>   Backfilling stats
>>>>>>   Total backfilled jobs (since last slurm start): 9325
>>>>>>   Total backfilled jobs (since last stats cycle
>>>>>>   start): 4952
>>>>>>   Total cycles: 84
>>>>>>   Last cycle when: Sun Sep 29 15:06:15 2013
>>>>>>   Last cycle: 2555321
>>>>>>   Max cycle:  27633565
>>>>>>   Mean cycle: 6115033
>>>>>>   Last depth cycle: 3
>>>>>>   Last depth cycle (try sched): 2
>>>>>>   Depth Mean: 278
>>>>>>   Depth Mean (try depth): 62
>>>>>>   Last queue length: 496814
>>>>>>   Queue length mean: 100807
>>>>>>
>>>>>>   I'm guessing this is due to the fact that there are roughly 
>>>>>> 500,000 jobs
>>>>>>   in the queue.  This is at our upper limit which is 500,000
>>>>>>   (MaxJobCount).  Is there anything that can be done about this? 
>
>>>>>> It seems
>>>>>>   that commands that query jobs such as squeue and scancel are
>not
>>>>>>   working.  So I can't tell who sent in this many jobs.
>>>>>>
>>>>>>   -Paul Edmon-
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Sent from my Android phone with K-9 Mail. Please excuse my
>brevity.
>>>>>
>>>>>
>>>>
>>>
>>

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

[slurm-dev] Re: Insane message length

Reply via email to