Creating a job step modifies the job record, hence last_job_update.
last_job_update is used in several ways besides scheduling (state
save, updates to squeue, etc.). One way to address this might be to
add a new time stamp that notes changes to the job records that might
impact scheduling (e.g. new job submitted or job terminates).
There have been some bugs related to the bf_continue configuration
parameter in recent months. I would suggest that you reconsider that.
Quoting Magnus Jonsson <[email protected]>:
Hi!
While investigating an other matter I found that if you have lots of
jobs running with short job steps they killing the backfill very
effective.
As all actions on a job step modifies the last_job_update global
variable that effective stops the backfill loop.
This could be very simple demonstrated with this simple batch script
on a system with some jobs in the queue.
----8<----
#!/bin/bash
for n in `seq 120`; do
srun sleep 1
done
----8<----
In 2.6.7-version I can only find a few places where last_job_update
is used and only one that is directly related to job step.
Is there a need to have the code updated the last_job_update for
every action of a job step?
Should there be a last_job_step_update also? Is there actions of a
job step that affects the queue?
Could there be an other variable that could be used to trigger a
reschedule of the queue based on events that actually affects the
scheduling of the queue?
Best regards,
Magnus
--
Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet