Ah, okay.  I figured that might be the case.

-Paul Edmon-

On 9/29/2013 5:58 PM, Morris Jette wrote:
That goes into the Slurm database. There are about 20 different limits available by user or group. See the resource limits web page.

Paul Edmon <[email protected]> wrote:

    That's good to hear.  Is there an option to do it per user?  I didn't
    see one in the slurm.conf.  I may have missed it.

    -Paul Edmon-

    On 9/29/2013 5:42 PM, Moe Jette wrote:

        Quoting Paul Edmon <[email protected]>:

            Yeah, that's why we set the 500,000 job limit. Though I
            didn't anticipate the insane message length issue.

        I'd recommend per-user job limits too.

            If I drop the MaxJobCount will it purge jobs to get down
            to that? Or will it just prohibit new jobs?

        It will only prohibit new jobs.

            I'm assuming that this rebuild would need to be pushed out
            everywhere as well? Both clients and master?

        Only needed on the clients.

            -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote:

                See MAX_MSG_SIZE in
                src/common/slurm_protocol_socket_implementation.c That
                should get you going again, but setting per user job
                limits strongly recommended longer term. That should
                prevent a rogue script from bringing the system to its
                knees. Moe Quoting Paul Edmon <[email protected]>:

                    Where is the max message size limit set in SLURM?
                    That's probably the best route at this point.
                    -Paul Edmon- On 9/29/2013 3:26 PM, Morris Jette
                    wrote:

                        Here are some options 1. User scontrol to set
                        queue state to drain and prevent more jobs
                        from being submitted 2. Lower the job limit to
                        block new job submissions 3. Increase the max
                        message size limit and rebuild Slurm 4. Check
                        accounting records for the rogue user 5. Long
                        term set user job limits and train them to run
                        multiple steps on fewer jovs Paul Edmon
                        <[email protected]> wrote:
                        [root@holy-slurm01 ~]# squeue squeue: error:
                        slurm_receive_msg: Insane message length
                        slurm_load_jobs error: Insane message length
                        [root@holy-slurm01 ~]# sdiag
                        *******************************************************
                        sdiag output at Sun Sep 29 15:12:13 2013 Data
                        since Sat Sep 28 20:00:01 2013
                        *******************************************************
                        Server thread count: 3 Agent queue size: 0
                        Jobs submitted: 21797 Jobs started: 12030 Jobs
                        completed: 12209 Jobs canceled: 70 Jobs
                        failed: 5 Main schedule statistics
                        (microseconds): Last cycle: 9207042 Max cycle:
                        10088674 Total cycles: 1563 Mean cycle: 17859
                        Mean depth cycle: 12138 Cycles per minute: 1
                        Last queue length: 496816 Backfilling stats
                        Total backfilled jobs (since last slurm
                        start): 9325 Total backfilled jobs (since last
                        stats cycle start): 4952 Total cycles: 84 Last
                        cycle when: Sun Sep 29 15:06:15 2013 Last
                        cycle: 2555321 Max cycle: 27633565 Mean cycle:
                        6115033 Last depth cycle: 3 Last depth cycle
                        (try sched): 2 Depth Mean: 278 Depth Mean (try
                        depth): 62 Last queue length: 496814 Queue
                        length mean: 100807 I'm guessing this is due
                        to the fact that there are roughly 500,000
                        jobs in the queue. This is at our upper limit
                        which is 500,000 (MaxJobCount). Is there
                        anything that can be done about this? It seems
                        that commands that query jobs such as squeue
                        and scancel are not working. So I can't tell
                        who sent in this many jobs. -Paul Edmon- --
                        Sent from my Android phone with K-9 Mail.
                        Please excuse my brevity.


--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to