Phil, The problem that you describe is a new one for me and we have multiple hardware failures daily here.
I'll change the code with respect to this logging in version 2.3, but it will only change the behavior if DebugFlags=CPU_Bind is configured. There is another DebugFlag of Backfill that provides detailed logging of what the backfill scheduler is doing. You can also change the debug level of the slurmctld daemon in real time with "scontrol setdebug 7" or some other number, see "man scontrol" for details. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Phil Sharfstein [[email protected]] Sent: Tuesday, March 29, 2011 1:55 PM To: [email protected] Subject: [slurm-dev] debug output problem in cons_res/job_test.c I ran into an issue where one of my nodes semi-crashed, and remounted its root volume read-only and started causing strange problems with backfill scheduling only trying the highest priority job. I'm not sure I could reproduce this or get you enough information to figure out what happened. However, the main issue I had in tracking down the problem was that the debug output which would have shown jobs getting tested to run on the bad node occurs after the return statement that executed with my configuration. The details are: in plugins/select/cons_res/job_test.c in _can_job_run_on_node() the if (select_debug_flags & DEBUG_FLAG_CPU_BIND) is after if (!(cr_type & CR_MEMORY)) return cpus; Thanks, Phil
