We are pleased to announce the immediate availability of Slurm 16.05.7. It contains about 40 relatively minor bug fixes.

Slurm downloads are available from https://www.schedmd.com/downloads.php. You may notice this is a change in location, https://www.schedmd.com/#repos will still work for the time being, but it is a good idea to update your links sooner than later.

Changes are listed below or available as always in the NEWS file.

* Changes in Slurm 16.05.7
==========================
-- Fix issue in the priority/multifactor plugin where on a slurmctld restart,
    where more time is accounted for than should be allowed.
-- cray/busrt_buffer - If total_space in a pool decreases, reset used_space
    rather than trying to account for buffer allocations in progress.
 -- cray/busrt_buffer - Fix for double counting of used_space at slurmctld
    startup.
-- Fix regression in 16.05.6 where if you request multiple cpus per task (-c2)
    and request --ntasks-per-core=1 and only 1 task on the node
    the slurmd would abort on an infinite loop fatal.
 -- cray/busrt_buffer - Internally track both allocated and unusable space.
The reported UsedSpace in a pool is now the allocated space (previously was unusable space). Base available space on whichever value leaves least free
    space.
-- cray/burst_buffer - Preserve job ID and don't translate to job array ID. -- cray/burst_buffer - Update "instance" parsing to match updated dw_wlm_cli
    output.
-- sched/backfill - Insure we don't try to start a job that was already started
    and requeued by the main scheduling logic.
 -- job_submit/lua - add access to the job features field in job_record.
-- select/linear plugin modified to better support heterogeneous clusters when
    topology/none is also configured.
 -- Permit cancellation of jobs in configuring state.
-- acct_gather_energy/rapl - prevent segfault in slurmd from race to gather
    data at slurmd startup.
 -- Integrate node_feature/knl_generic with "hbm" GRES information.
-- Fix output routines to prevent rounding the TRES values for memory or BB.
 -- switch/cray plugin - fix use after free error.
 -- docs - elaborate on how way to clear TRES limits in sacctmgr.
 -- knl_cray plugin - Avoid abort from backup slurmctld at start time.
 -- cgroup plugins - fix two minor memory leaks.
-- If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
 -- testsuite - fix job id output in test17.39.
 -- Modify backfill algorithm to improve performance with large numbers of
running jobs. Group running jobs that end in a "similar" time frame using a time window that grows exponentially rather than linearly. After one second
    of wall time, simulate the termination of all remaining running jobs in
    order to respond in a reasonable time frame.
 -- Fix slurm_job_cpus_allocated_str_on_node_id() API call.
-- sched/backfill plugin: Make malloc match data type (defined as uint32_t and
    allocated as int).
-- srun - prevent segfault when terminating job step before step has launched.
 -- sacctmgr - prevent segfault when trying to reset usage for an invalid
    account name.
 -- Make the openssl crypto plugin compile with openssl >= 1.1.
 -- Fix SuspendExcNodes and SuspendExcParts on slurmctld reconfiguration.
 -- sbcast - prevent segfault in slurmd due to race condition between file
    transfers from separate jobs using zlib compression
-- cray/burst_buffer - Increase time to synchronize operations between threads
    from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
 -- node_features/knl_cray - Fix possible race condition when changing node
    state that could result in old KNL mode as an active features.
-- Make sure if a job can't run because of resources we also check accounting limits after the node selection to make sure it doesn't violate those limits and if it does change the reason for waiting so we don't reserve resources
    on jobs violating accounting limits.
 -- NRT - Make it so a system running against IBM's PE will work with PE
    version 1.3.
 -- NRT - Make it so protocols pgas and test are allowed to be used.
-- NRT - Make it so you can have more than 1 protocol listed in MP_MSG_API. -- cray/burst_buffer - If slurmctld daemon restarts with pending job and burst buffer having unknown file stage-in status, teardown the buffer, defer the
    job, and start stage-in over again.
-- On state restore in the slurmctld don't overwrite the mem_spec_limit given
    from the slurm.conf when using FastSchedule=0.
-- Recognize a KNL's proper NUMA count (rather than setting it to the value
    in slurm.conf) when using FastSchedule=0.
 -- Fix parsing in regression test1.92 for some prompts.
 -- sbcast - use slurmd's gid cache rather than a separate lookup.
 -- slurmd - return error if setgroups() call fails in _drop_privileges().
-- Remove error messages about gres counts changing when a job is resized on
    a slurmctld restart or reconfig, as they aren't really error messages.
-- Fix possible memory corruption if a job is using GRES and changing size.
 -- jobcomp/elasticsearch - fix printf format for a value on 32-bit builds.
 -- task/cgroup - Change error message if CPU binding can not take place to
    better identify the root cause of the problem.
 -- Fix issue where task/cgroup would not always honor --cpu_bind=threads.
 -- Fix race condition in with getgrouplist() in slurmd that can lead to
user accounts being granted access to incorrect group memberships during
    job launch.

Reply via email to