We are pleased to release a formal 2.4.0 release! Also a first
development release of 2.5.
Both are available now for download at http://www.schedmd.com/#repos.
If you are developing new code please code against the master git repo
https://github.com/SchedMD/slurm as it is constantly updated so as to
avoid as many conflicts as possible.
*Note to BGQ earlier adopters:* Recently there have been a few changes
that require the runjob_mux to run as your SLURM user. Also the
plugin_flags must be updated as well to avoid a possible runjob_mux
crash if you are starting a job and decide to turn off the slurmctld at
the same time. Please read the updated bluegene web page
http://schedmd.com/slurmdocs/bluegene.html look for "System
Administration for BlueGene/Q only" for full instructions.
Thanks for all your help and support. Among other things 2.4 brings
substantial performance enhancements and many other improvements many of
which can be found in the RELEASE_NOTES file in the code.
As always if you find any bugs let us know through
http://bugs.schedmd.com or the slurm-dev list.
Below are changes for 2.4.0 and 2.5.0-pre1 since the last tag.
* Changes in SLURM 2.4.0
========================
-- Cray - Improve support for zero compute note resource allocations.
Partition used can now be configured with no nodes nodes.
-- BGQ - make it so srun -i<taskid> works correctly.
-- Fix parse_uint32/16 to complain if a non-digit is given.
-- Add SUBMITHOST to job state passed to Moab vial sched/wiki2. Patch
by Jon
Bringhurst (LANL).
-- BGQ - Fix issue when running with AllowSubBlockAllocations=Yes without
compiling with --enable-debug
-- Modify scontrol to require "-dd" option to report batch job's
script. Patch
from Don Albert, Bull.
-- Modify SchedulerParamters option to match documentation: "bf_res="
changed to "bf_resolution=". Patch from Rod Schultz, Bull.
-- Fix bug that clears job pending reason field. Patch fron Don
Lipari, LLNL.
-- In etc/init.d/slurm move check for scontrol after sourcing
/etc/sysconfig/slurm. Patch from Andy Wettstein, University of Chicago.
-- Fix in scheduling logic that can delay jobs with min/max node counts.
-- BGQ - fix issue where if a step uses the entire allocation and then
the next step in the allocation only uses part of the allocation it
gets
the correct cnodes.
-- BGQ - Fix checking for IO on a block with new IBM driver V1R1M1
previous
function didn't always work correctly.
-- BGQ - Fix issue when a nodeboard goes down and you want to combine
blocks
to make a larger small block and are running with sub-blocks.
-- BLUEGENE - Better logic for making small blocks around bad
nodeboard/card.
-- BGQ - When using an old IBM driver cnodes that go into error because of
a job kill timeout aren't always reported to the system. This is now
handled by the runjob_mux plugin.
-- BGQ - Added information on how to setup the runjob_mux to run as
SlurmUser.
-- Improve memory consumption on step layouts with high task count.
-- BGQ - quiter debug when the real time server comes back but there are
still messages we find when we poll but haven't given it back to
the real
time yet.
-- BGQ - fix for if a request comes in smaller than the smallest block and
we must use a small block instead of a shared midplane block.
-- Fix issues on large jobs (>64k tasks) to have the correct counter
type when
packing the step layout structure.
-- BGQ - fix issue where if a user was asking for tasks and
ntasks-per-node
but not node count the node count is correctly figured out.
-- Move logic to always use the 1st alphanumeric node as the batch
host for
batch jobs.
-- BLUEGENE - fix race condition where if a nodeboard/card goes down
at the
same time a block is destroyed and that block just happens to be the
smallest overlapping block over the bad hardware.
-- Fix bug when querying accounting looking for a job node size.
-- BLUEGENE - fix possible race condition if cleaning up a block and the
removal of the job on the block failed.
-- BLUEGENE - fix issue if a cable was in an error state make it so we can
check if a block is still makable if the cable wasn't in error.
-- Put nodes names in alphabetic order in node table.
-- If preempted job should have a grace time and preempt mode is not
cancel
but job is going to be canceled because it is interactive or other
reason
it now receives the grace time.
-- BGQ - Modified documents to explain new plugin_flags needed in
bg.properties
in order for the runjob_mux to run correctly.
-- BGQ - change linking from libslurm.o to libslurmhelper.la to avoid
warning.
* Changes in SLURM 2.5.0.pre1
=============================
-- Add new output to "scontrol show configuration" of LicensesUsed.
Output is
"name:used/total"
-- Changed jobacct_gather plugin infrastructure to be cleaner and
easier to
maintain.
-- Change license option count separator from "*" to ":" for
consistency with
the gres option (e.g. "--licenses=foo:2 --gres=gpu:2"). The "*"
will still
be accepted, but is no longer documented.
-- Permit more than 100 jobs to be scheduled per node (new limit is 10,000
jobs).
-- Restructure of srun code to allow outside programs to utilize existing
logic.