On Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote

A user accidentally submitted a 1.4 BILLION job array on our HPC cluster.    How can I remove it?

And I thought I had problems with a user submitting a million+ individual jobs. That was fun too.

I cannot qdel the job nor can I qhold the job because it crashes SGE.   I can restart SGE just fine but the job remains.

I removed the SGE job script itself from /var/spool/sge/job_scripts and restarted SGE, job remains.

You also need to remove the job's entry in the job "database". Assuming you're using flat files spooling, that entry will be a directory under the "jobs" directory in the spool. If the job ID is 8027327, e.g., then the directory is jobs/00/0802/7327. Stop SGE, 'rm -rf jobs/00/0802/7327', then start SGE up again and the job should be gone.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
users mailing list

Reply via email to