[slurm-dev] Re: Duplicate jobs in the grid_job_table

Danny Auble Thu, 14 Mar 2013 10:37:20 -0700

This is a slightly more serious issue than you think. If you can sendme the exact sql being sent (Do you know where the wacky '��8' camefrom?) perhaps it will be clearer where the problem lies. With thissituation your DBD will continue to back up information and new infowill not be added to the database so it will appear out of sync.


On 03/14/13 09:42, Chris Read wrote:

Re: [slurm-dev] Re: Duplicate jobs in the grid_job_table
We've just had a whole load of nodes fail, and a lot of jobs torescheduled which recreated the problem of sacct not agreeing withsqueue.
Found a load of these in the slurmdbd.log:
[2013-03-14T09:33:39-05:00] error: mysql_query failed: 1064 You havean error in your SQL syntax; check the manual that corresponds to yourMySQL server version for the right syntax to use near '��8 ', 'drw','short', '**', '49') on duplicate key update job_db_inx=LAST_INSER' atline 1insert into "grid_job_table" (id_job, id_assoc, id_qos, id_wckey,id_user, id_group, nodelist, id_resv, timelimit, time_eligible,time_submit, time_start, job_name, track_steps, state, priority,cpus_req, cpus_alloc, nodes_alloc, gres_req, gres_alloc, account,partition, wckey, node_inx) values (39641248, 139, 1, 0, 7045, 7045,'n23', 0, 10, 1363271250, 1363271250, 1363271251, '<JOB_NAME>', 0, 3,2549, 1, 1, 1, '��i�8 ', '�'��8 ', '<acct>', 'short', '**', '49') onduplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx),id_wckey=0, id_user=7045, id_group=7045, nodelist='n23', id_resv=0,timelimit=10, time_submit=1363271250, time_start=1363271251,job_name='<JOB_NAME>', track_steps=0, id_qos=1, state=greatest(state,3), priority=2549, cpus_req=1, cpus_alloc=1, nodes_alloc=1,gres_req='��i�8 ', gres_alloc='�'��8 ', account='<acct>',partition='short', wckey='**', node_inx='49'
So I'm guessing that's how the confusing data got in there in thefirst place.
On Wed, Mar 13, 2013 at 8:43 PM, Danny Auble <[email protected]<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/263544490268/>>wrote:
    I would make sure these jobs weren't requeued.  Knowing what the
    times were of the entries in the database would be interesting as
    well.  Any information about the jobs in the slurmctld log would
    probably shed information on the matter.  Outside of being
    requeued I wouldn't ever expect duplicates.

    Danny

    On 03/12/13 10:57, Chris Read wrote:
    Update:

    I've just cleaned things up by deleting the duplicates where
    state = 0 (PENDING). The correct state for the job is actually 7
    (NODE_FAIL), not CANCELLED as I stated above.

    No need to restart slurmdbd either...

    Chris


    On Tue, Mar 12, 2013 at 4:28 PM, Chris Read <[email protected]
    <http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/062153156925/>>
    wrote:

        Forgot to add another question:

        What's the correct way to clean this up? Just delete the
        record showing PENDING and restart slurmdbd?


        On Tue, Mar 12, 2013 at 4:26 PM, Chris Read
        <[email protected]
        
<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/062153156925/>>
        wrote:

            Greetings...

            Just stumbled across some strange behaviour on the
            accounting side of things: we have a collection of jobs
            that have duplicate records in the grid_job_table. The
            visible symptoms of this are that sacct shows the jobs as
            still pending when they are not.

            In all of the cases I can find:

            - there is no information available in the slurmdbd.log
            - the slurmctld.log shows the jobs have been canceled
            - the job_db_inx for the entry in PENDING state is > the
            job_db_inx for the entry in CANCELLED

            We're currently on 2.5.1.

            Anyone have any idea how these got there?

            Chris

[slurm-dev] Re: Duplicate jobs in the grid_job_table

Reply via email to