[slurm-dev] Re: Insane message length

Moe Jette Mon, 30 Sep 2013 08:29:38 -0700

Just set it on the root bank and it will automatically apply to everychild bank and user (unless another value is explicitly set for somesub-tree).


Quoting Paul Edmon <[email protected]>:

Quick question is there an easy one line command to setMaxSubmitjobs for every user. So lets say I want


MaxSubmitJobs=50,000

And to apply it every where in the DB. Is there a way to do it? Ordo you have to walk every single user in the DB?


-Paul Edmon-

On 9/29/2013 9:03 PM, Paul Edmon wrote:

Thanks.  That's what I suspected.

-Paul Edmon-

On 9/29/2013 8:20 PM, Morris Jette wrote:

Just cancel the primary job ID.

Paul Edmon <[email protected]> wrote:

   By the way it looks like this was caused by a user submitting 47
   job array jobs each with 10,000 tasks in the array.  Which ended
   up producing 470,000 jobs.  Is there a quick way to cancel job
   arrays?  If I were to guess if you canceled the primary id that
   would take care of all of them?

   -Paul Edmon-

   On 9/29/2013 6:39 PM, Paul Edmon wrote:

   Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked.  Is there
   any reason this couldn't be pushed back into the main tree?  Or
   do you guys want to keep the smaller message size.

   -Paul Edmon-

   On 9/29/2013 6:00 PM, Paul Edmon wrote:

   Ah, okay.  I figured that might be the case.

   -Paul Edmon-

   On 9/29/2013 5:58 PM, Morris Jette wrote:

   That goes into the Slurm database. There are about 20
   different limits available by user or group. See the resource
   limits web page.

   Paul Edmon <[email protected]> wrote:

That's good to hear. Is there an option to do it peruser? I didn't

       see one in the slurm.conf.  I may have missed it.

       -Paul Edmon-

       On 9/29/2013 5:42 PM, Moe Jette wrote:

           Quoting Paul Edmon <[email protected]>:

               Yeah, that's why we set the 500,000 job limit.
               Though I didn't anticipate the insane message
               length issue.

           I'd recommend per-user job limits too.

               If I drop the MaxJobCount will it purge jobs to
               get down to that? Or will it just prohibit new jobs?

           It will only prohibit new jobs.

               I'm assuming that this rebuild would need to be
               pushed out everywhere as well? Both clients and
               master?

           Only needed on the clients.

               -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote:

                   See MAX_MSG_SIZE in
                   src/common/slurm_protocol_socket_implementation.c
                   That should get you going again, but setting
                   per user job limits strongly recommended
                   longer term. That should prevent a rogue
                   script from bringing the system to its knees.
                   Moe Quoting Paul Edmon <[email protected]>:

                       Where is the max message size limit set in
                       SLURM? That's probably the best route at
                       this point. -Paul Edmon- On 9/29/2013 3:26
                       PM, Morris Jette wrote:

                           Here are some options 1. User scontrol
                           to set queue state to drain and
                           prevent more jobs from being submitted
                           2. Lower the job limit to block new
                           job submissions 3. Increase the max
                           message size limit and rebuild Slurm
                           4. Check accounting records for the
                           rogue user 5. Long term set user job
                           limits and train them to run multiple
                           steps on fewer jovs Paul Edmon
                           <[email protected]> wrote:
                           [root@holy-slurm01 ~]# squeue squeue:
                           error: slurm_receive_msg: Insane
                           message length slurm_load_jobs error:
                           Insane message length
                           [root@holy-slurm01 ~]# sdiag

*******************************************************

                           sdiag output at Sun Sep 29 15:12:13
                           2013 Data since Sat Sep 28 20:00:01
                           2013

*******************************************************

                           Server thread count: 3 Agent queue
                           size: 0 Jobs submitted: 21797 Jobs
                           started: 12030 Jobs completed: 12209
                           Jobs canceled: 70 Jobs failed: 5 Main
                           schedule statistics (microseconds):
                           Last cycle: 9207042 Max cycle:
                           10088674 Total cycles: 1563 Mean
                           cycle: 17859 Mean depth cycle: 12138
                           Cycles per minute: 1 Last queue
                           length: 496816 Backfilling stats Total
                           backfilled jobs (since last slurm
                           start): 9325 Total backfilled jobs
                           (since last stats cycle start): 4952
                           Total cycles: 84 Last cycle when: Sun
                           Sep 29 15:06:15 2013 Last cycle:
                           2555321 Max cycle: 27633565 Mean
                           cycle: 6115033 Last depth cycle: 3
                           Last depth cycle (try sched): 2 Depth
                           Mean: 278 Depth Mean (try depth): 62
                           Last queue length: 496814 Queue length
                           mean: 100807 I'm guessing this is due
                           to the fact that there are roughly
                           500,000 jobs in the queue. This is at
                           our upper limit which is 500,000
                           (MaxJobCount). Is there anything that
                           can be done about this? It seems that
                           commands that query jobs such as
                           squeue and scancel are not working. So
                           I can't tell who sent in this many
                           jobs. -Paul Edmon- -- Sent from my
                           Android phone with K-9 Mail. Please
                           excuse my brevity.


   --     Sent from my Android phone with K-9 Mail. Please excuse my
   brevity.



--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

[slurm-dev] Re: Insane message length

Reply via email to