Odds are the backfill loop is not penetrating far enough into the queue.  Recall that slurm has two scheduling loops.  The primary is the faster one that only penetrates as far as it can schedule. Thus in this case the primary loop would stop immediately on the GPU jobs that it can't schedule.  Thus it would be up to the backfill loop to fill in the gaps.  So I would make sure that your backfill loop is actually doing that.  It will either be the case that the backfill loop isn't going deep enough into the queue to pick up the cpu jobs that can run, or it has adjudicated that those jobs can't run due to some vargarity in logic (typically because it thinks that it won't fit due to time constraints).

Anyways that's where I would start.

-Paul Edmon-


On 7/3/2018 5:22 PM, Christopher Benjamin Coffey wrote:
Hello!

We are having an issue with high priority gpu jobs blocking low priority cpu 
only jobs.

Our cluster is setup with one partition, "all". All nodes reside in this 
cluster. In this all partition we have four generations of compute nodes, including gpu 
nodes. We do this to make use of those unused cores on the gpu nodes for compute only 
jobs. We schedule the various different generations, and gpu nodes by the user specifying 
a constraint (if they care), and a --qos=gpu / --gres=gpu:tesla:1 for gpu nodes. The gpu 
qos will give the jobs the highest priority in the queue, so that they can get scheduled 
sooner onto the limited resource that we have in gpu's. So this has worked out real nice 
for quite some time. But lately we've noticed that the gpu jobs are blocking the cpu only 
jobs. Yes, the gpu jobs have higher priority, yet, the gpu jobs can only run on a very 
small subset of nodes compared to the cpu only jobs. But it appears that slurm isn't 
taking into consideration the limited set of nodes the gpu job can run on. That’s the 
only possibility that I see to the gpu jobs blocking the cpu only jobs. I'm not sure if 
this is due to a recent slurm change, or if we just never noticed, but its definitely 
happening.

For example, the behavior happens in the following scenario

- 15 compute nodes (no gpus) are idle
- All of the gpus are occupied
- 1000's of low priority compute only jobs in the pending queue
- 100's of highest priority gpu jobs in the pending queue

In the above scenario, the above low priority jobs are not backfilled, or 
started, yet compute only nodes remain idle. If I hold the gpu jobs, the lower 
priority compute only jobs are then started.

Anyone seen this? Am I thinking about this wrong? I would think that slurm 
should not be considering the nodes with no gpus to fulfill the gpu jobs.

I have an idea how to fix this scenario, but I think our current config should 
work. The fix I am mulling over is to create a gpu partition, and place the gpu 
nodes into that partition. Then, use the all_partitions job submit plugin to 
schedule compute only jobs into both partitions. The gpu jobs would then only 
land in the gpu partition. I'd think that would definitely fix the issue, but 
maybe there is a down side. Yet, I think how we have it should be working!?

Thanks for your advice!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



Reply via email to