Creating a job step modifies the job record, hence last_job_update. last_job_update is used in several ways besides scheduling (state save, updates to squeue, etc.). One way to address this might be to add a new time stamp that notes changes to the job records that might impact scheduling (e.g. new job submitted or job terminates).

There have been some bugs related to the bf_continue configuration parameter in recent months. I would suggest that you reconsider that.


Quoting Magnus Jonsson <[email protected]>:

Hi!

While investigating an other matter I found that if you have lots of jobs running with short job steps they killing the backfill very effective.

As all actions on a job step modifies the last_job_update global variable that effective stops the backfill loop.

This could be very simple demonstrated with this simple batch script on a system with some jobs in the queue.

----8<----
#!/bin/bash

for n in `seq 120`; do
        srun sleep 1
done
----8<----

In 2.6.7-version I can only find a few places where last_job_update is used and only one that is directly related to job step.

Is there a need to have the code updated the last_job_update for every action of a job step?

Should there be a last_job_step_update also? Is there actions of a job step that affects the queue?

Could there be an other variable that could be used to trigger a reschedule of the queue based on events that actually affects the scheduling of the queue?

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet

Reply via email to