It you are on Slurm 16, you can try:
sacctmgr show

*RunawayJobsfrom man:*Used only with the *list* or *show* command to report
current jobs that have been orphanded on the local cluster and are now
runaway. If there are jobs in this state it will also give you an option to
"fix" the

2017-01-23 11:39 GMT+01:00 Paddy Doyle <[email protected]>:

>
> Hi Lucas,
>
> This old thread might help:
>
> https://groups.google.com/forum/#!topic/slurm-devel/TQcerLLEKAU
>
> Paddy
>
> On Fri, Jan 20, 2017 at 10:00:00AM -0800, Lucas Vuotto wrote:
>
> >
> > Hi all,
> > sreport was showing that an user was using more CPU hours per week
> > than available. After checking the output of sacct, we found that some
> > jobs from an array didn't ended:
> >
> > $ sacct -j 69204 -o jobid%-14,state%6,start,elapsed,end
> >
> >          JobID  State               Start    Elapsed                 End
> >
> > -------------- ------ ------------------- ---------- -------------------
> > 69204_[1-1000] FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
> > 69204_1        FAILED 2016-11-09T17:46:44 71-20:25:55            Unknown
> > 69204_2        FAILED 2016-11-09T17:46:44 71-20:25:55            Unknown
> > [...]
> > 69204_295      FAILED 2016-11-09T17:46:46 71-20:25:53            Unknown
> > 69204_296      FAILED 2016-11-09T17:46:46 71-20:25:53            Unknown
> > 69204_297      FAILED 2016-11-09T17:46:46   00:00:00 2016-11-09T17:46:46
> > [...]
> > 69204_999      FAILED 2016-11-09T17:46:50   00:00:00 2016-11-09T17:46:50
> >
> > It seems that somehow those jobs got stucked (~72 days after
> > 2016-11-09 is today, 2017-01-20, and that's why the wrong reports).
> > scancel says that 69204 is an invalid job id.
> >
> > Any idea on how to fix this? We're thinking about deleting the entries
> > of those jobs in the DB. Is it safe to run "arbitrary" commands in the
> > DB, bypassing slurmdbd?
> >
> > Thanks in advance.
> >
> >
> > -- lv.
> >
> >
> > -- lv.
> >
>
> --
> Paddy Doyle
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> Phone: +353-1-896-3725
> http://www.tchpc.tcd.ie/
>

Reply via email to