Thanks.  That's what I suspected.

-Paul Edmon-

On 9/29/2013 8:20 PM, Morris Jette wrote:
Just cancel the primary job ID.

Paul Edmon <[email protected]> wrote:

    By the way it looks like this was caused by a user submitting 47
    job array jobs each with 10,000 tasks in the array.  Which ended
    up producing 470,000 jobs.  Is there a quick way to cancel job
    arrays?  If I were to guess if you canceled the primary id that
    would take care of all of them?

    -Paul Edmon-

    On 9/29/2013 6:39 PM, Paul Edmon wrote:
    Increasing the MAX_MSG_SIZE to 1024*1024*1024 worked.  Is there
    any reason this couldn't be pushed back into the main tree?  Or
    do you guys want to keep the smaller message size.

    -Paul Edmon-

    On 9/29/2013 6:00 PM, Paul Edmon wrote:
    Ah, okay.  I figured that might be the case.

    -Paul Edmon-

    On 9/29/2013 5:58 PM, Morris Jette wrote:
    That goes into the Slurm database. There are about 20 different
    limits available by user or group. See the resource limits web
    page.

    Paul Edmon <[email protected]> wrote:

        That's good to hear.  Is there an option to do it per user?  I didn't
        see one in the slurm.conf.  I may have missed it.

        -Paul Edmon-

        On 9/29/2013 5:42 PM, Moe Jette wrote:

            Quoting Paul Edmon <[email protected]>:

                Yeah, that's why we set the 500,000 job limit.
                Though I didn't anticipate the insane message
                length issue.

            I'd recommend per-user job limits too.

                If I drop the MaxJobCount will it purge jobs to get
                down to that? Or will it just prohibit new jobs?

            It will only prohibit new jobs.

                I'm assuming that this rebuild would need to be
                pushed out everywhere as well? Both clients and master?

            Only needed on the clients.

                -Paul Edmon- On 9/29/2013 3:40 PM, Moe Jette wrote:

                    See MAX_MSG_SIZE in
                    src/common/slurm_protocol_socket_implementation.c
                    That should get you going again, but setting
                    per user job limits strongly recommended longer
                    term. That should prevent a rogue script from
                    bringing the system to its knees. Moe Quoting
                    Paul Edmon <[email protected]>:

                        Where is the max message size limit set in
                        SLURM? That's probably the best route at
                        this point. -Paul Edmon- On 9/29/2013 3:26
                        PM, Morris Jette wrote:

                            Here are some options 1. User scontrol
                            to set queue state to drain and prevent
                            more jobs from being submitted 2. Lower
                            the job limit to block new job
                            submissions 3. Increase the max message
                            size limit and rebuild Slurm 4. Check
                            accounting records for the rogue user
                            5. Long term set user job limits and
                            train them to run multiple steps on
                            fewer jovs Paul Edmon
                            <[email protected]> wrote:
                            [root@holy-slurm01 ~]# squeue squeue:
                            error: slurm_receive_msg: Insane
                            message length slurm_load_jobs error:
                            Insane message length
                            [root@holy-slurm01 ~]# sdiag
                            
*******************************************************
                            sdiag output at Sun Sep 29 15:12:13
                            2013 Data since Sat Sep 28 20:00:01
                            2013
                            
*******************************************************
                            Server thread count: 3 Agent queue
                            size: 0 Jobs submitted: 21797 Jobs
                            started: 12030 Jobs completed: 12209
                            Jobs canceled: 70 Jobs failed: 5 Main
                            schedule statistics (microseconds):
                            Last cycle: 9207042 Max cycle: 10088674
                            Total cycles: 1563 Mean cycle: 17859
                            Mean depth cycle: 12138 Cycles per
                            minute: 1 Last queue length: 496816
                            Backfilling stats Total backfilled jobs
                            (since last slurm start): 9325 Total
                            backfilled jobs (since last stats cycle
                            start): 4952 Total cycles: 84 Last
                            cycle when: Sun Sep 29 15:06:15 2013
                            Last cycle: 2555321 Max cycle: 27633565
                            Mean cycle: 6115033 Last depth cycle: 3
                            Last depth cycle (try sched): 2 Depth
                            Mean: 278 Depth Mean (try depth): 62
                            Last queue length: 496814 Queue length
                            mean: 100807 I'm guessing this is due
                            to the fact that there are roughly
                            500,000 jobs in the queue. This is at
                            our upper limit which is 500,000
                            (MaxJobCount). Is there anything that
                            can be done about this? It seems that
                            commands that query jobs such as squeue
                            and scancel are not working. So I can't
                            tell who sent in this many jobs. -Paul
                            Edmon- -- Sent from my Android phone
                            with K-9 Mail. Please excuse my brevity.


-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.




--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to