I'm curious what other sites do to keep jobs running in a reservation when one of the nodes has an error. Obviously if it's an easy fix, then you simply fix the node and the reservation can continue to run jobs. Also, if nodes are available, you may add one to the reservation to make up for the slack caused by the bad one. One can also make the reservation larger by a few nodes to account for bad luck.
I'm really wondering if there are any better options or any automated options. What do others do? Thanks, Bill. -- Bill Barth, Ph.D., Director, HPC [email protected] | Phone: (512) 232-7069 Office: ROC 1.435 | Fax: (512) 475-9445
