Hi all,
I have a 27-node cluster. Currently there are 320 out of 320 slots
filled up. All by jobs requesting 1-slot.
At the top of my waiting queue there are 28 different jobs requesting 3
to 12 cores using two different parallel environments. All these jobs
are requesting -R y. They are being ignored and overrun by the myriad of
1-slot requesting jobs behind them in the waiting queue.
I have enabled the scheduler logging. During the last 4 hours, it has
logged 724 new jobs starting, in all the 27 nodes. Not a single job on
the system is requesting -l h_rt, but single-core jobs keep being
scheduled and all the parallel jobs are starving.
As far as I understand, the backfilling is killing my reservations, even
if no one is requesting any kind of time, but if I set the
"default_duration" to INFINITY, all the RESERVING log messages disappear.
Additionaly, for some odd reason, I only receive RESERVING messages from
the jobs requesting a given number of slots (-pe whatever N). The jobs
requesting a slot-range (-pe threaded 4-10) seem to reserve nothing.
My scheduler configuration is as follows:
# qconf -ssconf
algorithm default
schedule_interval 0:0:5
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 0
flush_finish_sec 0
params MONITOR=1
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 1000000000
weight_tickets_share 1000000000
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OSF
weight_ticket 0.010000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 1.000000
max_reservation 50
default_duration 24:00:00
I have also tested it with params PROFILE=1 and default_duration
INFINITY. But, when I set it, not a single reservation is logged in
/opt/gridengine/default/common/schedule and new jobs keep starting.
What am I missing? Is it possible to kill the backfilling? Are my
reservations really working?
Thanks in advance,
Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users