By the way it looks like this was caused by a user submitting 47 job array jobs each with 10,000 tasks in the array. Which ended up producing 470,000 jobs. Is there a quick way to cancel job arrays? If I were to guess if you canceled the primary id that would take care of all of them?

-Paul Edmon-

On 9/29/2013 6:39 PM, Paul Edmon wrote:
Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked. Is there any reason this couldn't be pushed back into the main tree? Or do you guys want to keep the smaller message size.

-Paul Edmon-

On 9/29/2013 6:00 PM, Paul Edmon wrote:
Ah, okay.  I figured that might be the case.

-Paul Edmon-

On 9/29/2013 5:58 PM, Morris Jette wrote:
That goes into the Slurm database. There are about 20 different limits available by user or group. See the resource limits web page.

Paul Edmon <[email protected]> wrote:

    That's good to hear.  Is there an option to do it per user?  I didn't
    see one in the slurm.conf.  I may have missed it.

    -Paul Edmon-

    On 9/29/2013 5:42 PM, Moe Jette wrote:

        Quoting Paul Edmon <[email protected]>:

            Yeah, that's why we set the 500,000 job limit. Though I
            didn't anticipate the insane message length issue.

        I'd recommend per-user job limits too.

            If I drop the MaxJobCount will it purge jobs to get down
            to that? Or will it just prohibit new jobs?

        It will only prohibit new jobs.

            I'm assuming that this rebuild would need to be pushed
            out everywhere as well? Both clients and master?

        Only needed on the clients.

            -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote:

                See MAX_MSG_SIZE in
                src/common/slurm_protocol_socket_implementation.c
                That should get you going again, but setting per
                user job limits strongly recommended longer term.
                That should prevent a rogue script from bringing the
                system to its knees. Moe Quoting Paul Edmon
                <[email protected]>:

                    Where is the max message size limit set in
                    SLURM? That's probably the best route at this
                    point. -Paul Edmon- On 9/29/2013 3:26 PM, Morris
                    Jette wrote:

                        Here are some options 1. User scontrol to
                        set queue state to drain and prevent more
                        jobs from being submitted 2. Lower the job
                        limit to block new job submissions 3.
                        Increase the max message size limit and
                        rebuild Slurm 4. Check accounting records
                        for the rogue user 5. Long term set user job
                        limits and train them to run multiple steps
                        on fewer jovs Paul Edmon
                        <[email protected]> wrote:
                        [root@holy-slurm01 ~]# squeue squeue: error:
                        slurm_receive_msg: Insane message length
                        slurm_load_jobs error: Insane message length
                        [root@holy-slurm01 ~]# sdiag
                        *******************************************************
                        sdiag output at Sun Sep 29 15:12:13 2013
                        Data since Sat Sep 28 20:00:01 2013
                        *******************************************************
                        Server thread count: 3 Agent queue size: 0
                        Jobs submitted: 21797 Jobs started: 12030
                        Jobs completed: 12209 Jobs canceled: 70 Jobs
                        failed: 5 Main schedule statistics
                        (microseconds): Last cycle: 9207042 Max
                        cycle: 10088674 Total cycles: 1563 Mean
                        cycle: 17859 Mean depth cycle: 12138 Cycles
                        per minute: 1 Last queue length: 496816
                        Backfilling stats Total backfilled jobs
                        (since last slurm start): 9325 Total
                        backfilled jobs (since last stats cycle
                        start): 4952 Total cycles: 84 Last cycle
                        when: Sun Sep 29 15:06:15 2013 Last cycle:
                        2555321 Max cycle: 27633565 Mean cycle:
                        6115033 Last depth cycle: 3 Last depth cycle
                        (try sched): 2 Depth Mean: 278 Depth Mean
                        (try depth): 62 Last queue length: 496814
                        Queue length mean: 100807 I'm guessing this
                        is due to the fact that there are roughly
                        500,000 jobs in the queue. This is at our
                        upper limit which is 500,000 (MaxJobCount).
                        Is there anything that can be done about
                        this? It seems that commands that query jobs
                        such as squeue and scancel are not working.
                        So I can't tell who sent in this many jobs.
                        -Paul Edmon- -- Sent from my Android phone
                        with K-9 Mail. Please excuse my brevity.


--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.



Reply via email to