SLURM version 2.3.1 and 2.4.0-pre1 are now available from
http://www.schedmd.com/#repos. Version 2.3.1 contains many bug fixes
from version 2.3.0 as described below. Version 2.4.0-pre1 contains new
development work and it intended for developer use and IBM BlueGene/Q
systems rather than use on production machines.
* Changes in SLURM 2.3.1
========================
-- Do not remove the backup slurmctld's pid file when it assumes
control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive
Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block
error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Center).
-- BGQ - fix to set up corner correctly for sub block jobs.
-- Major re-write of the CPU Management User and Administrator Guide (web
page) by Martin Perry, Bull.
-- BLUEGENE - If removing blocks from system that once existed cleanup of old
block happens correctly now.
-- Prevent slurmctld crashing with configuration of MaxMemPerCPU=0.
-- Prevent job hold by operator or account coordinator of his own job from
being an Administrator Hold rather than User Hold by default.
-- Cray - Fix for srun.pl parsing to avoid adding spaces between option and
argument (e.g. "-N2" parsed properly without changing to "-N 2").
-- Major updates to cgroup support by Mark Grondona (LLNL) and Matthieu
Hautreux (CEA) and Sam Lang. Fixes timing problems with respect to the
task_epilog. Allows cgroup mount point to be configurable. Added new
configuration parameters MaxRAMPercent and MaxSwapPercent. Allow cgroup
configuration parameters that are precentages to be floating point.
-- Fixed issue where sview wasn't displaying correct nice value for jobs.
-- Fixed issue where sview wasn't displaying correct min memory per node/cpu
value for jobs.
-- Disable some SelectTypeParameters for select/linear that aren't
compatible.
-- Move slurm_select_init to proper place to avoid loading multiple select
plugins in the slurmd.
-- BGQ - Include runjob_plugin.so in the bluegene rpm.
-- Report correct job "Reason" if needed nodes are DOWN, DRAINED, or
NOT_RESPONDING, "Resources" rather than "PartitionNodeLimit".
-- BLUEGENE - Fixed issues with running on a sub-midplane system.
-- Added some missing calls to allow older versions of SLURM to talk
to newer.
-- BGQ - allow steps to be ran.
-- Do not attempt to run HeathCheckProgram on powered down nodes. Patch from
Ramiro Alba, Centre Tecnològic de Tranferència de Calor, Spain.
* Changes in SLURM 2.4.0.pre1
=============================
-- BGQ - use the ba_geo_tables to figure out the blocks instead of the old
algorithm. The improves timing in the worst cases and simplifies the code
greatly.
-- BLUEGENE - Change to output tools labels from BP to Midplane
(i.e. BP List -> MidplaneList).
-- BLUEGENE - read MPs and BPs from the bluegene.conf
-- Modify srun's SIGINT handling logic timer (two SIGINTs within one
second) to
be based microsecond rather than second timer.
-- Modify advance reservation to accept multiple specific block sizes rather
than a single node count.
-- Permit administrator to change a job's QOS to any value without validating
the job's owner has permission to use that QOS. Based upon patch by Phil
Eckert (LLNL).
-- Add trigger flag for a permanent trigger. The trigger will NOT be purged
after an event occurs, but only when explicitly deleted.
-- Interpret a reservation with Nodes=ALL and a Partition specification as
reserving all nodes within the specified partition rather than all nodes
on the system. Based upon patch by Phil Eckert (LLNL).
-- Add the ability to reboot all compute nodes after they become idle. The
RebootProgram configuration parameter must be set and an authorized user
must execute the command "scontrol reboot_nodes". Patch from Andriy
Grytsenko (Massive Solutions Limited).
-- Modify slurmdbd.conf parsing to accept DebugLevel strings (quiet, fatal,
info, etc.) in addition to numeric values. The parsing of slurm.conf was
modified in the same fashion for SlurmctldDebug and SlurmdDebug values.
The output of sview and "scontrol show config" was also modified to report
those values as strings rather than numeric values.
-- Changed default value of StateSaveLocation configuration parameter from
/tmp to /var/spool.
-- Prevent associations from being deleted if it has any jobs in running,
pending or suspended state. Previous code prevented this only for running
jobs.
-- If a job can not run due to QOS or association limits, then do not cancel
the job, but leave it pending in a system held state (priority = 1). The
job will run when its limits or the QOS/association limits change. Based
upon a patch by Phil Ekcert (LLNL).
-- BGQ - Added logic to keep track of cnodes in an error state inside of a
booted block.
-- Added the ability to update a node's NodeAddr and NodeHostName with
scontrol. Also enable setting a node's state to "future" using scontrol.
-- Add a node state flag of CLOUD and save/restore NodeAddr and NodeHostName
information for nodes with a flag of CLOUD.
-- Cray: Add support for job reservations with node IDs that are not in
numeric order. Fix for Bugzilla #5.
-- BGQ - Fix issue with smap -R
-- Fix association limit support for jobs queued for multiple partitions.
-- BLUEGENE - fix issue for sub-midplane systems to create a full system
block correctly.
-- BLUEGENE - Added option to the bluegene.conf to tell you are running on
a sub midplane system.
-- Added the UserID of the user issuing the RPC to the job_submit/lua
functions.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- If job time limit exceeds partition maximum, but job's minimum time limit
does not, set job's time limit to partition maximum at allocation time.