[slurm-dev] Re: Duplicate jobs in the grid_job_table

Danny Auble Wed, 13 Mar 2013 13:40:09 -0700

I would make sure these jobs weren't requeued. Knowing what the timeswere of the entries in the database would be interesting as well. Anyinformation about the jobs in the slurmctld log would probably shedinformation on the matter. Outside of being requeued I wouldn't everexpect duplicates.


Danny


On 03/12/13 10:57, Chris Read wrote:

Re: Duplicate jobs in the grid_job_table
Update:

I've just cleaned things up by deleting the duplicates where state = 0(PENDING). The correct state for the job is actually 7 (NODE_FAIL),not CANCELLED as I stated above.


No need to restart slurmdbd either...

Chris

On Tue, Mar 12, 2013 at 4:28 PM, Chris Read <[email protected]<http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/281898238882/>>wrote:


    Forgot to add another question:

    What's the correct way to clean this up? Just delete the record
    showing PENDING and restart slurmdbd?


    On Tue, Mar 12, 2013 at 4:26 PM, Chris Read <[email protected]
    <http://lists.schedmd.com/cgi-bin/dada/mail.cgi/r/slurmdev/281898238882/>>
    wrote:

        Greetings...

        Just stumbled across some strange behaviour on the accounting
        side of things: we have a collection of jobs that have
        duplicate records in the grid_job_table. The visible symptoms
        of this are that sacct shows the jobs as still pending when
        they are not.

        In all of the cases I can find:

        - there is no information available in the slurmdbd.log
        - the slurmctld.log shows the jobs have been canceled
        - the job_db_inx for the entry in PENDING state is > the
        job_db_inx for the entry in CANCELLED

        We're currently on 2.5.1.

        Anyone have any idea how these got there?

        Chris

[slurm-dev] Re: Duplicate jobs in the grid_job_table

Reply via email to