While looking at my previous issue with job reservations I have
noticed a large performance issues with RQS and the scheduler.
I have previously noticed my qmaster system often running at 100% when
1000 or more jobs where in the system. I had just assumed this was
normal.
When I set "max_reservation 8" the scheduler task takes almost 5
minutes of 100% cpu on 1 core to run (running on a KVM virtual machine
with 2 cores dedicated). This means that priority changes can take up
to 10 minutes to be fully reflected in qstat output.
I measure the scheduler run time with 'qconf -tsm'.
starting configuration:
max_reservation 8, max-slots-on-all-hosts enabled:
Tue Apr 24 23:21:07 2012|-------------START-SCHEDULER-RUN-------------
Tue Apr 24 23:25:36 2012|--------------STOP-SCHEDULER-RUN-------------
Tue Apr 24 23:25:36 2012|-------------START-SCHEDULER-RUN-------------
Tue Apr 24 23:30:09 2012|--------------STOP-SCHEDULER-RUN-------------
max_reservation 0: max-slots-on-all-hosts enabled:
Tue Apr 24 22:58:23 2012|-------------START-SCHEDULER-RUN-------------
Tue Apr 24 22:59:22 2012|--------------STOP-SCHEDULER-RUN-------------
This is with ~1800 jobs running, ~20 jobs in 'qw' state.
I have just noticed that when I disable my max-slots-on-all-hosts RQS
the scheduling time drops significantly.
max_reservation 8, max-slots-on-all-hosts disabled:
Thu Apr 26 11:56:24 2012|-------------START-SCHEDULER-RUN-------------
Thu Apr 26 11:56:30 2012|--------------STOP-SCHEDULER-RUN-------------
For the record my RQS is now disabled:
{
name max-slots-on-all-hosts
description "Don't over commit host slots"
enabled FALSE
limit hosts {*} to slots=$num_proc
}
<rant>My internal logic says it should take any where near the
original time to schedule 2000 jobs. But an awful lot of today's code
will just consume resources without good reason. I come from a time
when compute resources where actually very expensive and people paid
attention to performance. Now-a-days, it seems people are willing to
just throw memory and cpu at problems instead of careful
programming.</rant>
This restores my belief in the original Grid Engine coders.
(still using sge6.2u5, CentOS 5)
Stuart Barkley
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users