Am 07.10.2013 um 15:59 schrieb Txema Heredia: > El 07/10/13 14:58, Reuti escribió: >> Hi, >> >> Am 07.10.2013 um 13:15 schrieb Txema Heredia: >> >>> The problem is that, right now, the mandatory usage of h_rt is not an >>> option. So we need to work considering that all jobs will last to infinity >>> and beyond. >>> >>> Right now, the scheduler configuration is: >>> max_reservation 50 >>> default_duration 24:00:00 >>> >>> During the weekend, most of the parallel ( and -R y) jobs started running, >>> but now there is something fishy in my queues: >>> >>> The first 3 jobs in my waiting queue belong to user1. All 3 jobs request >>> -pe mpich_round 12, -R y and -l h_vmem=4G (h_vmem is set to consumable = >>> YES, not JOB). >> Which amount of memory did you specify in the exechost definition, i.e. >> what's in the machine physically? >> >> -- Reuti > > 26 nodes have 96GB of ram. One node has 48GB.
And you defined it on an exechost level under "complex_values"? - Reuti > Currently nodes range from 4 to 10 free slots and from 26 to 82.1 free GB > > The first jobs in my waiting queue (after the 3 reserving ones) require > measly 0.9G, 3G and 12G, all with slots=1 and -R n. None of them is > scheduled. But if I manually increase their priority so they are put BEFORE > the 3 -R y jobs, they are immediately scheduled. > >> >> >>> This user has already one job like these running. User1 has a RQS that >>> limits him to use only 12 slots in the whole cluster. Thus the 3 waiting >>> jobs will not be able to run until the first one finishes. >>> >>> This is the current schedule log: >>> >>> # grep "::::\|RESERVING" schedule | tail -200 | grep "::::\|Q:all" | tail >>> -37 | sort >>> :::::::: >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000 >>> >>> >>> Right now, the cluster is using 190 slots of 320 total. The schedule log >>> says that the 3 waiting jobs form user1 are the only jobs making any kind >>> of reservation. These jobs are reserving a total of 36 cores. These 3 jobs >>> are effectively blocking 36 already-free slots because the RQS doesn't >>> allow user1 to make usage of more than 12 slots at once. This is not "nice" >>> but I understand that the scheduler has its limitations and cannot predict >>> the future. >>> >>> Taking into account the jobs running + the slots & memory locked by the >>> reserving jobs, there is a grand total of 226 slots locked. Thus leaving 94 >>> free slots. >>> >>> Here comes the problem: Even though there are 94 free slots and lots of >>> spare memory, NONE of the 4300 waiting jobs is running. There are nodes >>> with 6 free slots and 59 GB of free RAM but none of the waiting jobs is >>> scheduled. New jobs only star running when one of the 190 slots occupied by >>> running jobs is freed. None of these other waiting jobs is requesting -R y, >>> -pe nor h_rt. >>> >>> >>> Additionaly, this is creating some odd behaviour. It seems that, on each >>> scheduler run, it is trying to start jobs in those "blocked slots", but it >>> fails with no apparent reason. Some of the jobs are even trying to start >>> twice, but almost none (generally none at all) gets to run: >>> >>> # tail -2000 schedule | grep -A 1000 "::::::" | grep "Q:all" | grep >>> STARTING | sort >>> 2734121:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734122:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734123:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734124:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734125:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734126:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734127:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734128:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734129:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734130:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734131:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734132:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734133:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734134:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734135:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734136:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734137:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734138:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734139:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734140:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734141:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734142:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734143:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734144:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734145:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734146:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734147:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734148:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734149:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734150:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734151:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734152:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734153:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734154:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734155:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734156:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734157:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734158:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734159:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734160:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2734161:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735158:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735159:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735160:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735161:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735162:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735163:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735164:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735165:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735166:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735167:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735168:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735169:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735170:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735171:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735172:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735173:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735174:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735175:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735176:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735177:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735178:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735179:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735180:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735181:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735182:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735183:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735184:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735185:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735186:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735187:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735188:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735189:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735190:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735191:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735192:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2735193:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743479:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743480:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743481:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743482:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743483:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743484:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743485:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743486:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743487:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743488:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743489:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743490:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743491:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743492:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743493:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743494:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743495:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743496:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743497:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743498:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743499:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743500:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743501:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743502:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743503:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743504:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743505:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743506:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743507:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743508:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743509:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743510:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743511:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743512:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743513:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743514:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743515:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743516:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743517:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> 2743518:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000 >>> >>> >>> Even though jobs appear here listed as "starting" they are not running at >>> all. But they are issuing a "starting" message on each scheduling interval. >>> >>> Why are the reservations blocking a third of the cluster??? It shouldn't be >>> a backfilling issue, they are blocking the usage of 3 times the slots >>> reserved. Why the "starting" jobs cannot run? >>> >>> Txema >>> >>> >>> >>> El 07/10/13 09:28, Christian Krause escribió: >>>> Hello, >>>> >>>> We solved it the way that `h_rt` is set to FORCED in the complex list: >>>> >>>> #name shortcut type relop requestable >>>> consumable default urgency >>>> >>>> #------------------------------------------------------------------------------------------------ >>>> h_rt h_rt TIME <= FORCED >>>> YES 0:0:0 0 >>>> >>>> And have a JSV rejecting jobs that don't request it (because they would be >>>> pending indefinetely >>>> unless you have a default duration or use qalter). >>>> >>>> You could also use a JSV to enforce that only jobs with large resources >>>> (in your case more than some >>>> amount of slots) are able to request reservation, i.e.: >>>> >>>> # pseudo JSV code >>>> SLOT_RESERVATION_THRESHOLD=... >>>> if slots < SLOT_RESERVATION_THRESHOLD then >>>> "disable reservation / reject" >>>> else >>>> "enable reservation" >>>> fi >>>> >>>> >>>> On Fri, Oct 04, 2013 at 04:25:29PM +0200, Txema Heredia wrote: >>>>> Hi all, >>>>> >>>>> I have a 27-node cluster. Currently there are 320 out of 320 slots >>>>> filled up. All by jobs requesting 1-slot. >>>>> >>>>> At the top of my waiting queue there are 28 different jobs >>>>> requesting 3 to 12 cores using two different parallel environments. >>>>> All these jobs are requesting -R y. They are being ignored and >>>>> overrun by the myriad of 1-slot requesting jobs behind them in the >>>>> waiting queue. >>>>> >>>>> I have enabled the scheduler logging. During the last 4 hours, it >>>>> has logged 724 new jobs starting, in all the 27 nodes. Not a single >>>>> job on the system is requesting -l h_rt, but single-core jobs keep >>>>> being scheduled and all the parallel jobs are starving. >>>>> >>>>> As far as I understand, the backfilling is killing my reservations, >>>>> even if no one is requesting any kind of time, but if I set the >>>>> "default_duration" to INFINITY, all the RESERVING log messages >>>>> disappear. >>>>> >>>>> Additionaly, for some odd reason, I only receive RESERVING messages >>>>> from the jobs requesting a given number of slots (-pe whatever N). >>>>> The jobs requesting a slot-range (-pe threaded 4-10) seem to reserve >>>>> nothing. >>>>> >>>>> My scheduler configuration is as follows: >>>>> >>>>> # qconf -ssconf >>>>> algorithm default >>>>> schedule_interval 0:0:5 >>>>> maxujobs 0 >>>>> queue_sort_method load >>>>> job_load_adjustments np_load_avg=0.50 >>>>> load_adjustment_decay_time 0:7:30 >>>>> load_formula np_load_avg >>>>> schedd_job_info true >>>>> flush_submit_sec 0 >>>>> flush_finish_sec 0 >>>>> params MONITOR=1 >>>>> reprioritize_interval 0:0:0 >>>>> halftime 168 >>>>> usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000 >>>>> compensation_factor 5.000000 >>>>> weight_user 0.250000 >>>>> weight_project 0.250000 >>>>> weight_department 0.250000 >>>>> weight_job 0.250000 >>>>> weight_tickets_functional 1000000000 >>>>> weight_tickets_share 1000000000 >>>>> share_override_tickets TRUE >>>>> share_functional_shares TRUE >>>>> max_functional_jobs_to_schedule 200 >>>>> report_pjob_tickets TRUE >>>>> max_pending_tasks_per_job 50 >>>>> halflife_decay_list none >>>>> policy_hierarchy OSF >>>>> weight_ticket 0.010000 >>>>> weight_waiting_time 0.000000 >>>>> weight_deadline 3600000.000000 >>>>> weight_urgency 0.100000 >>>>> weight_priority 1.000000 >>>>> max_reservation 50 >>>>> default_duration 24:00:00 >>>>> >>>>> >>>>> I have also tested it with params PROFILE=1 and default_duration >>>>> INFINITY. But, when I set it, not a single reservation is logged in >>>>> /opt/gridengine/default/common/schedule and new jobs keep starting. >>>>> >>>>> >>>>> What am I missing? Is it possible to kill the backfilling? Are my >>>>> reservations really working? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Txema >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >>> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
