Phil,

The problem that you describe is a new one for me and we
have multiple hardware failures daily here.

I'll change the code with respect to this logging in version 2.3,
but it will only change the behavior if DebugFlags=CPU_Bind
is configured. There is another DebugFlag of Backfill that provides
detailed logging of what the backfill scheduler is doing. You can
also change the debug level of the slurmctld daemon in real time
with "scontrol setdebug 7" or some other number, see "man scontrol" 
for details.
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Phil Sharfstein [[email protected]]
Sent: Tuesday, March 29, 2011 1:55 PM
To: [email protected]
Subject: [slurm-dev] debug output problem in cons_res/job_test.c

I ran into an issue where one of my nodes semi-crashed, and remounted
its root volume read-only and started causing strange problems with
backfill scheduling only trying the highest priority job.  I'm not sure
I could reproduce this or get you enough information to figure out what
happened.  However, the main issue I had in tracking down the problem
was that the debug output which would have shown jobs getting tested to
run on the bad node occurs after the return statement that executed with
my configuration.

The details are:

in plugins/select/cons_res/job_test.c

in _can_job_run_on_node()

the if (select_debug_flags & DEBUG_FLAG_CPU_BIND)

is after

if (!(cr_type & CR_MEMORY))
     return cpus;

Thanks,
Phil


Reply via email to