Hey Michael,
A commit in 14.03.1 that may be related to what you are seeing is e94f10b8a2f85936e487358a0da001a271898d4f. It is a partial revert of commit 9b1dadea4eb823b5ef29d8b4ee56cb6b7c3be22f which first appeared in 2.6.8. Try applying that (or upgrading to 14.03) and see if it fixes your issue.
I think Paul is correct though. I don't think this is the backfill loop, but the normal scheduler.
Danny On 06/05/2014 11:16 AM, Michael Gutteridge wrote:
I'm running slurm 2.6.9: I've got the backfill scheduler set up with some pretty ridiculous parameters as we have a large number of queued jobs of various dimensions: SchedulerParameters=default_queue_depth=10000,bf_continue,bf_interval=120,bf_max_job_user=10000,bf_resolution=600,bf_window=4320,bf_max_job_part=10000 This has been working fine- backfill was effectively going through the full queue- but today it appears to have stopped- jobs which should be backfilled onto idle resources aren't being run. The scheduler log shows: [2014-06-04T13:16:10.107] sched: Running job scheduler [2014-06-04T13:16:10.111] sched: JobId=7060218. State=PENDING. Reason=Resources. Priority=10850. Partition=campus. [2014-06-04T13:16:10.111] sched: JobId=7060219. State=PENDING. Reason=Priority(Priority), Priority=10850, Partition=campus. [2014-06-04T13:16:10.111] sched: already tested 3 jobs, breaking out My understanding is that it shouldn't hit that limit until default_queue_depth. Has my controller lost it's mind? I've got a nearly identical test setup where this is working as I'd expect. Any hints appreciated... thanks much Michael
