We are pleased to announce the release of versions 17.02.1 and 16.05.10.
Version 17.02.1 contains 19 bug fixes discovered over the past week including
a deadlock in the slurmctld daemon. Version 16.05.10 contains 30
relatively minor bug fixes discovered over the past 5 weeks. Future
changes to version 16.05 will be limited to more significant bugs with
our focus being shifted
to version 17.02.
Both versions can be downloaded from here:
https://www.schedmd.com/downloads.php
* Changes in Slurm 17.02.1
==========================
-- Modify pam module to work when configured NodeName and
NodeHostname differ.
-- Update to sbatch/srun man pages to explain the "filename pattern" clearer
-- Add %x to sbatch/srun filename pattern to represent the job name.
-- job_submit/lua - Add job "bitflags" field.
-- Update slurm.spec file to note obsolete RPMs.
-- Fix deadlock scenario when dumping configuration in the slurmctld.
-- Remove unneeded job lock when running assoc_mgr cache. This lock could
cause potential deadlock when/if TRES changed in the database and the
slurmctld wasn't made aware of the change. This would be very rare.
-- Fix missing locks in gres logic to avoid potential memory race.
-- If gres is NULL on a job don't try to process it when returning detailed
information about a job to scontrol.
-- Fix print of consumed energy in sstat when no energy is being collected.
-- Print formatted tres string when creating/updating a reservation.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Prevent manipulation of the cpu frequency and governor for batch or
extern steps. This addresses an issue where the batch step would
inadvertently set the cpu frequency maximum to the minimum value
supported on the node.
-- Convert a slurmctd power management data structure from array to list in
order to eliminate the possibility of zombie child suspend/resume
processes.
-- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
fails. Now job will be held. Update job update time when held.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Refactor slurmctld agent logic to eliminate some pthreads.
-- Added "SyscfgTimeout" parameter to knl.conf configuration file.
-- Fix for CPU binding for job steps run under a batch job.
* Changes in Slurm 16.05.10
===========================
-- Record job state as PREEMPTED instead of TIMEOUT when GraceTime
is reached.
-- task/cgroup - print warnings to stderr when --cpu_bind=verbose is enabled
and the requested processor affinity cannot be set.
-- power/cray - Disable power cap get and set operations on DOWN nodes.
-- Jobs preempted with PreemptMode=REQUEUE were incorrectly recorded as
REQUEUED in the accounting.
-- PMIX - Use volatile specifier to avoid flag caching and lock the flag to
make sure it is protected.
-- PMIX/PMI2 - Make it possible to use %n or %h in a spool dir.
-- burst_buffer/cray - Support default pool which is not the first pool
reported by DataWarp and log in Slurm when pools that are added or removed
from DataWarp.
-- Insure job does not start running before PrologSlurmctld is complete and
node is booted (all nodes for interactive job, at least first
node for batch
job without burst buffers).
-- Fix minor memory leak in the slurmctld when removing a QOS.
-- burst_buffer/cray - Do not execute "pre_run" operation until
after all nodes
are booted and ready for use.
-- scontrol - return an error when attempting to use the +=/-+ syntax to
update a field where this is not appropriate.
-- Fix task/affinity to work correctly with --ntasks-per-socket.
-- Honor --ntasks-per-node and --ntasks option when used with job constraints
that contain node counts.
-- Prevent deadlocked slurmstepd processes due to unsafe use of regcomp with
older glibc versions.
-- Fix squeue when SLURM_BITSTR_LEN=0 is set in the user environment.
-- Fix comments in acct_policy.c to reflect actual variables instead of
old ones.
-- Fix correct variables when validating GrpTresMins on a QOS.
-- Better debug output when a job is being held because of a GrpTRES[Run]Min
limits.
-- Fix correct state reason when job can't run 'safely' because of an
association GrpWall limit.
-- Squeue always loads new data if user_id option specified
-- Fix for possible job ID parsing failure and abort.
-- If node boot in progress when slurmctld daemon is restarted, then allow
sufficient time for reboot to complete and not prematurely DOWN
the node as
"Not responding".
-- For job resize, correct logic to build "resize" script with new values.
Previously the scripts were based upon the original job size.
-- Fix squeue to not limit the size of partition, burst_buffer, exec_host, or
reason to 32 chars.
-- Fix potential packing error when packing a NULL slurmdb_clus_res_rec_t.
-- Fix potential packing errors when packing a NULL
slurmdb_reservation_cond_t.
-- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
fails. Now job will be held. Update job update time when held.
-- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
-- Increase number of ResumePrograms that can be managed without leaving
zombie/orphan processes from 10 to 100.
-- Refactor slurmctld agent logic to eliminate some pthreads.