On 04/22/14 09:59, Paul Edmon wrote:
Thanks.  Sorry forgot about that thread.
No problem.

I'm wagering that the jobs got orphaned due to timing out. Essentially they actually launched but the didn't successfully update the database because it was busy.
The slurmctld should be keeping record of all jobs ending unless the list got too full and the slurmctld starting throwing messages to the DBD away, this would be the only way I would expect orphan jobs like this to still be around. You should see lots of messages about this in the slurmctld log file if this is the case. Otherwise a busy DBD/database should be handled.

Danny

-Paul Edmon-

On 04/22/2014 12:15 PM, Danny Auble wrote:
Paul I think this was covered in this thread https://groups.google.com/forum/#!searchin/slurm-devel/time_start/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ

The just of it is you have to go into the database and manually update the record.

If you know the jobid or the db_inx you can do something like this

update $CLUSTER_job_table set state=3 time_end=time_start where time_end=0 and id_job=$JOBID;

That should make it go away from the check.

Knowing why the job didn't finish in the database would be very good to know though as this shouldn't happen.

Danny

On 04/22/14 06:48, Paul Edmon wrote:
Well more like the naive ones namely:

sacctmgr delete job JobID

How do you set the endtime?  Do you do that via scontrol?

-Paul Edmon-


On 04/21/2014 10:14 PM, Danny Auble wrote:
What are the obvious ones?

I would expect setting the end time to the start time and state to 4 (I think that is a completed state) should do it.



On April 21, 2014 6:54:22 PM PDT, Paul Edmon <[email protected]> wrote:

    Sure I can hunt that info down.  So what would be the command to remove
    the job from the DB?  I tried the obvious ones I could think of but with
    not effect.

    -Paul Edmon-

    On 4/21/2014 4:31 PM, Danny Auble wrote:

        Paul, you should be able to remove the job with no issue.
        The real question is why is it still running in the
        database instead of completed. If you happen to have any
        logs on the job and the information from the database it
        would be nice to look at since what you are describing
        shouldn't be possible. I know others have seen this before
        but no one has found a reproducer yet or any evidence on
        how the state was achieved. Let me know if you have
        anything like this. Thanks, Danny On 04/21/14 13:05, Paul
        Edmon wrote:

            Is there a way to delete a JobID and it's relevant data
            from the slurm database? I have a user that I want to
            remove but there is a job which slurm thinks is not
            complete that is preventing me. I want slurm to just
            remove that job data as it shouldn't impact anything.
-Paul Edmon-




Reply via email to