Thanks Joshua!

That did the 1-Trillion dollar trick!



On 8/7/2019 10:50 PM, Joshua Baker-LePain wrote:
On Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote

A user accidentally submitted a 1.4 BILLION job array on our HPC cluster.    How can I remove it?

And I thought I had problems with a user submitting a million+ individual jobs.  That was fun too.

I cannot qdel the job nor can I qhold the job because it crashes SGE.   I can restart SGE just fine but the job remains.

I removed the SGE job script itself from /var/spool/sge/job_scripts and restarted SGE, job remains.

You also need to remove the job's entry in the job "database".  Assuming you're using flat files spooling, that entry will be a directory under the "jobs" directory in the spool.  If the job ID is 8027327, e.g., then the directory is jobs/00/0802/7327.  Stop SGE, 'rm -rf jobs/00/0802/7327', then start SGE up again and the job should be gone.

users mailing list

Reply via email to