I ran into an issue where one of my nodes semi-crashed, and remounted its root volume read-only and started causing strange problems with backfill scheduling only trying the highest priority job. I'm not sure I could reproduce this or get you enough information to figure out what happened. However, the main issue I had in tracking down the problem was that the debug output which would have shown jobs getting tested to run on the bad node occurs after the return statement that executed with my configuration.

The details are:

in plugins/select/cons_res/job_test.c

in _can_job_run_on_node()

the if (select_debug_flags & DEBUG_FLAG_CPU_BIND)

is after

if (!(cr_type & CR_MEMORY))
    return cpus;

Thanks,
Phil

Reply via email to