Slurm versions 14.11.2 and 15.08.0-pre1 are now available. Version
14.11.2 includes quite a few relatively minor bug fixes.
Version 15.08.0 is under active development and its release is planned
in August 2015. While this is the first pre-release there is already
quite a bit of new functionality.
Both versions can be downloaded from http://schedmd.com/#repos
Highlights of the 2 versions are these
* Changes in Slurm 14.11.2
==========================
-- Fix Centos5 compile errors.
-- Fix issue with association hash not getting the correct index which
could result in seg fault.
-- Fix salloc/sbatch -B segfault.
-- Avoid huge malloc if GRES configured with "Type" and huge "Count".
-- Fix jobs from starting in overlapping reservations that won't
finish before
a "maint" reservation begins.
-- When node gets drained while in state mixed display its status as
draining
in sinfo output.
-- Allow priority/multifactor to work with sched/wiki(2) if all priorities
have no weight. This allows for association and QOS decay limits
to work.
-- Fix "squeue --start" to override SQUEUE_FORMAT env variable.
-- Fix scancel to be able to cancel multiple jobs that are space
delimited.
-- Log Cray MPI job calling exit() without mpi_fini(), but do not
treat it as
a fatal error. This partially reverts logic added in version 14.03.9.
-- sview - Fix displaying of suspended steps elapsed times.
-- Increase number of messages that get cached before throwing them away
when the DBD is down.
-- Fix jobs from starting in overlapping reservations that won't
finish before
a "maint" reservation begins.
-- Restore GRES functionality with select/linear plugin. It was broken in
version 14.03.10.
-- Fix bug with GRES having multiple types that can cause slurmctld abort.
-- Fix squeue issue with not recognizing "localhost" in --nodelist option.
-- Make sure the bitstrings for a partitions Allow/DenyQOS are up to date
when running from cache.
-- Add smap support for job arrays and larger job ID values.
-- Fix possible race condition when attempting to use QOS on a system
running
accounting_storage/filetxt.
-- Fix issue with accounting_storage/filetxt and job arrays not being
printed
correctly.
-- In proctrack/linuxproc and proctrack/pgid, check the result of strtol()
for error condition rather than errno, which might have a vestigial
error
code.
-- Improve information recording for jobs deferred due to advanced
reservation.
-- Exports eio_new_initial_obj to the plugins and initialize kvs_seq on
mpi/pmi2 setup to support launching.
* Changes in Slurm 15.08.0pre1
==============================
-- Add sbcast support for file transfer to resources allocated to a
job step
rather than a job allocation.
-- Change structures with association in them to assoc to save space.
-- Add support for job dependencies jointed with OR operator (e.g.
"--depend=afterok:123?afternotok:124").
-- Add "--bb" (burst buffer specification) option to salloc, sbatch,
and srun.
-- Added configuration parameters BurstBufferParameters and
BurstBufferType.
-- Added burst_buffer plugin infrastructure (needs many more functions).
-- Make it so when the fanout logic comes across a node that is down
we abandon
the tree to avoid worst case scenarios when the entire branch is
down and
we have to try each serially.
-- Add better error reporting of invalid partitions at submission time.
-- Move will-run test for multiple clusters from the sbatch code into
the API
so that it can be used with DRMAA.
-- If a non-exclusive allocation requests --hint=nomultithread on a
CR_CORE/SOCKET system lay out tasks correctly.
-- Avoid including unused CPUs in a job's allocation when cores or
sockets are
allocated.
-- Added new job state of STOPPED indicating processes have been
stopped with a
SIGSTOP (using scancel or sview), but retain its allocated CPUs.
Job state
returns to RUNNING when SIGCONT is sent (also using scancel or sview).
-- Added EioTimeout parameter to slurm.conf. It is the number of
seconds srun
waits for slurmstepd to close the TCP/IP connection used to relay data
between the user application and srun when the user application
terminates.
-- Remove slurmctld/dynalloc plugin as the work was never completed,
so it is
not worth the effort of continued support at this time.
-- Remove DynAllocPort configuration parameter.
-- Add advance reservation flag of "replace" that causes allocated
resources
to be replaced with idle resources. This maintains a pool of available
resources that maintains a constant size (to the extent possible).
-- Added SchedulerParameters option of "bf_busy_nodes". When selecting
resources for pending jobs to reserve for future execution (i.e.
the job
can not be started immediately), then preferentially select nodes
that are
in use. This will tend to leave currently idle resources available for
backfilling longer running jobs, but may result in allocations
having less
than optimal network topology. This option is currently only
supported by
the select/cons_res plugin.
-- Permit "SuspendTime=NONE" as slurm.conf value rather than only a
numeric
value to match "scontrol show config" output.
-- Add the 'scontrol show cache' command which displays the associations
in slurmctld.
-- Test more frequently for node boot completion before starting a job.
Provides better responsiveness.
-- Fix PMI2 singleton initialization.
-- Permit PreemptType=qos and PreemptMode=suspend,gang to be used
together.
A high-priority QOS job will now oversubscribe resources and gang
schedule,
but only if there are insufficient resources for the job to be started
without preemption. NOTE: That with PreemptType=qos, the partition's
Shared=FORCE:# configuration option will permit one job more per
resource
to be run than than specified, but only if started by preemption.
-- Remove the CR_ALLOCATE_FULL_SOCKET configuration option. It is now the
default.
-- Fix a race condition in PMI2 when fencing counters can be out of sync.
-- Increase the MAX_PACK_MEM_LEN define to avoid PMI2 failure when fencing
with large amount of ranks.
-- Add QOS option to a partition. This will allow a partition to have
all the limits a QOS has. If a limit is set in both QOS the partition
QOS will override the job's QOS unless the job's QOS has the
PartitionQOS flag set.
-- The task_dist_states variable has been split into "flags" and "base"
components. Added SLURM_DIST_PACK_NODES and
SLURM_DIST_NO_PACK_NODES values
to give user greater control over task distribution. The srun
--dist options
has been modified to accept a "Pack" and "NoPack" option. These
options can
be used to override the CR_PACK_NODE configuration option.