[gridengine users] exceeded h_rt limit, job aborts, exit status still 0?

2015-09-24 Thread Lane, William
I'm running this on a development cluster and testing implementing h_rt limits and job status email functionality. Job 187 (mpirun) Aborted Exit Status = 0 Signal = KILL User = lanew Queue= short.q@cscld1-0-2 Host = cscld1-0-2.local Start Time

Re: [gridengine users] At what point does the network overhead of adding additional nodes to a queue offset the benefit?

2015-09-24 Thread Chris Dagdigian
SGE is fine on 1GB fabrics and I don't know of anyone who uses 10Gb for SGE unless it's a combined network fabric that is carrying storage and application traffic along with SGE traffic on the same links. Or if you are running all new stuff with 10Gb for everything and maybe a 1GB NIC held ba

[gridengine users] At what point does the network overhead of adding additional nodes to a queue offset the benefit?

2015-09-24 Thread Lane, William
If a cluster is running on a relatively slow speed networking backbone (say gigabit ethernet or 10 Gib ethernet as opposed to inifiniband), is there any commonly accepted point at which increasing the number of nodes in a queue negatively affects the performance of the queue? Is there any genera

Re: [gridengine users] Create short.q queue definition that limits the runtime of a job

2015-09-24 Thread Lane, William
Reuti, 1. The exechost isn't the head node is it? We've always referred to our SGE clusters as having three types of nodes: submit nodes, compute nodes and head nodes. Compute ring is an OpenMPI term for the slots to which processes for the job are dispatched, but I meant the compute nodes that

Re: [gridengine users] Create short.q queue definition that limits the runtime of a job

2015-09-24 Thread Reuti
Hi, Am 23.09.2015 um 21:18 schrieb Lane, William: > Reuti, > > 1. > If more than one compute node takes part in the compute ring, how does one > determine > which one is the exechost? What do you mean by compute ring - a parallel job? The exechost is the one where the job script is executed.