I ran into an issue where one of my nodes semi-crashed, and remounted
its root volume read-only and started causing strange problems with
backfill scheduling only trying the highest priority job. I'm not sure
I could reproduce this or get you enough information to figure out what
happened. However, the main issue I had in tracking down the problem
was that the debug output which would have shown jobs getting tested to
run on the bad node occurs after the return statement that executed with
my configuration.
The details are:
in plugins/select/cons_res/job_test.c
in _can_job_run_on_node()
the if (select_debug_flags & DEBUG_FLAG_CPU_BIND)
is after
if (!(cr_type & CR_MEMORY))
return cpus;
Thanks,
Phil