Correction.   1 TRILLION   :-)

On 8/7/2019 4:40 PM, Joseph Farran wrote:

A user accidentally submitted a 1.4 BILLION job array on our HPC cluster.    How can I remove it?

I cannot qdel the job nor can I qhold the job because it crashes SGE.   I can restart SGE just fine but the job remains.

I removed the SGE job script itself from /var/spool/sge/job_scripts and restarted SGE, job remains.

The only thing I can do is remove tasks a time either one at a time or in groups which works but at 1.4 BILLION tasks, that will take a while.

Added max_aj_task to SGE to prevent this in the future.

# qconf -sconf|grep tasks
max_aj_tasks                 100000

Any help appreciated.

Thank you,

users mailing list

users mailing list

Reply via email to