Slurm version 14.11.7 is now available with quite a few bug fixes as
listed below.
A development tag for 15.08 (pre5) has also been made. It represents
the current state of Slurm development for the release planned in August
2015 and is intended for development and test purposes only. One
notable enhancement included is the idea of Trackable Resources (TRES)
for accounting for cpu, memory, energy, GRES, licenses, etc.
Both are available for download at
http://slurm.schedmd.com/download.html
Notable changes for these versions are these...
* Changes in Slurm 14.11.7
==========================
-- Initialize some variables used with the srun --no-alloc option that may
cause random failures.
-- Add SchedulerParameters option of sched_min_interval that controls the
minimum time interval between any job scheduling action. The
default value
is zero (disabled).
-- Change default SchedulerParameters=max_sched_time from 4 seconds to 2.
-- Refactor scancel so that all pending jobs are cancelled before starting
cancellation of running jobs. Otherwise they happen in parallel and the
pending jobs can be scheduled on resources as the running jobs are
being
cancelled.
-- ALPS - Add new cray.conf variable NoAPIDSignalOnKill. When set to
yes this
will make it so the slurmctld will not signal the apid's in a batch
job.
Instead it relies on the rpc coming from the slurmctld to kill the
job to
end things correctly.
-- ALPS - Have the slurmstepd running a batch job wait for an ALPS release
before ending the job.
-- Initialize variables in consumable resource plugin to prevent core
dump.
-- Fix scancel bug which could return an error on attempt to signal a
job step.
-- In slurmctld communication agent, make the thread timeout be the
configured
value of MessageTimeout rather than 30 seconds.
-- sshare -U/--Users only flag was used uninitialized.
-- Cray systems, add "plugstack.conf.template" sample SPANK
configuration file.
-- BLUEGENE - Set DB2NOEXITLIST when starting the slurmctld daemon to
avoid
random crashing in db2 when the slurmctld is exiting.
-- Make full node reservations display correctly the core count instead of
cpu count.
-- Preserve original errno on execve() failure in task plugin.
-- Add SLURM_JOB_NAME env variable to an salloc's environment.
-- Overwrite SLURM_JOB_NAME in an srun when it gets an allocation.
-- Make sure each job has a wckey if that is something that is tracked.
-- Make sure old step data is cleared when job is requeued.
-- Load libtinfo as needed when building ncurses tools.
-- Fix small memory leak in backup controller.
-- Fix segfault when backup controller takes control for second time.
-- Cray - Fix backup controller running native Slurm.
-- Provide prototypes for init_setproctitle()/fini_setproctitle on NetBSD.
-- Add configuration test to find out the full path to su command.
-- preempt/job_prio plugin: Fix for possible infinite loop when
identifying
preemptable jobs.
-- preempt/job_prio plugin: Implement the concept of Warm-up Time
here. Use
the QoS GraceTime as the amount of time to wait before preempting.
Basically, skip preemption if your time is not up.
-- Make srun wait KillWait time when a task is cancelled.
-- switch/cray: Revert logic added to 14.11.6 that set
"PMI_CRAY_NO_SMP_ENV=1"
if CR_PACK_NODES is configured.
-- Prevent users from setting job's partition to an invalid partition.
* Changes in Slurm 15.08.0pre5
==============================
-- Add jobcomp/elasticsearch plugin. Libcurl is required for build.
Configure
the server as follows:
"JobCompLoc=http://YOUR_ELASTICSEARCH_SERVER:9200".
-- Scancel logic large re-written to better support job arrays.
-- Added a slurm.conf parameter PrologEpilogTimeout to control how long
prolog/epilog can run.
-- Added TRES (Trackable resources) to track Mem, GRES, license, etc
utilization.
-- Add re-entrant versions of glibc time functions (e.g. localtime) to
Slurm
in order to eliminate rare deadlock of slurmstepd fork and exec calls.
-- Constrain kernel memory (if available) in cgroups.
-- Add PrologFlags option of "Contain" to create a proctrack container at
job resource allocation time.
-- Disable the OOM Killer in slurmd and slurmstepd's memory cgroup
when using
MemSpecLimit.