[slurm-dev] slurm versions 17.02.1 and 16.05.10 released

jette Thu, 02 Mar 2017 14:44:49 -0800


We are pleased to announce the release of versions 17.02.1 and 16.05.10.
Version 17.02.1 contains 19 bug fixes discovered over the past week including

a deadlock in the slurmctld daemon. Version 16.05.10 contains 30relatively minor bug fixes discovered over the past 5 weeks. Futurechanges to version 16.05 will be limited to more significant bugs withour focus being shifted

to version 17.02.


Both versions can be downloaded from here:
https://www.schedmd.com/downloads.php

* Changes in Slurm 17.02.1
==========================

-- Modify pam module to work when configured NodeName andNodeHostname differ.

 -- Update to sbatch/srun man pages to explain the "filename pattern" clearer
 -- Add %x to sbatch/srun filename pattern to represent the job name.
 -- job_submit/lua - Add job "bitflags" field.
 -- Update slurm.spec file to note obsolete RPMs.
 -- Fix deadlock scenario when dumping configuration in the slurmctld.
 -- Remove unneeded job lock when running assoc_mgr cache.  This lock could
    cause potential deadlock when/if TRES changed in the database and the
    slurmctld wasn't made aware of the change.  This would be very rare.
 -- Fix missing locks in gres logic to avoid potential memory race.
 -- If gres is NULL on a job don't try to process it when returning detailed
    information about a job to scontrol.
 -- Fix print of consumed energy in sstat when no energy is being collected.
 -- Print formatted tres string when creating/updating a reservation.
 -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
 -- Prevent manipulation of the cpu frequency and governor for batch or
    extern steps. This addresses an issue where the batch step would
    inadvertently set the cpu frequency maximum to the minimum value
    supported on the node.
 -- Convert a slurmctd power management data structure from array to list in
    order to eliminate the possibility of zombie child suspend/resume
    processes.
 -- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
    fails. Now job will be held. Update job update time when held.
 -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
 -- Refactor slurmctld agent logic to eliminate some pthreads.
 -- Added "SyscfgTimeout" parameter to knl.conf configuration file.
 -- Fix for CPU binding for job steps run under a batch job.

* Changes in Slurm 16.05.10
===========================

-- Record job state as PREEMPTED instead of TIMEOUT when GraceTimeis reached.

 -- task/cgroup - print warnings to stderr when --cpu_bind=verbose is enabled
    and the requested processor affinity cannot be set.
 -- power/cray - Disable power cap get and set operations on DOWN nodes.
 -- Jobs preempted with PreemptMode=REQUEUE were incorrectly recorded as
    REQUEUED in the accounting.
 -- PMIX - Use volatile specifier to avoid flag caching and lock the flag to
    make sure it is protected.
 -- PMIX/PMI2 - Make it possible to use %n or %h in a spool dir.
 -- burst_buffer/cray - Support default pool which is not the first pool
    reported by DataWarp and log in Slurm when pools that are added or removed
    from DataWarp.
 -- Insure job does not start running before PrologSlurmctld is complete and

node is booted (all nodes for interactive job, at least firstnode for batch

    job without burst buffers).
 -- Fix minor memory leak in the slurmctld when removing a QOS.

-- burst_buffer/cray - Do not execute "pre_run" operation untilafter all nodes

    are booted and ready for use.
 -- scontrol - return an error when attempting to use the +=/-+ syntax to
    update a field where this is not appropriate.
 -- Fix task/affinity to work correctly with --ntasks-per-socket.
 -- Honor --ntasks-per-node and --ntasks option when used with job constraints
    that contain node counts.
 -- Prevent deadlocked slurmstepd processes due to unsafe use of regcomp with
    older glibc versions.
 -- Fix squeue when SLURM_BITSTR_LEN=0 is set in the user environment.
 -- Fix comments in acct_policy.c to reflect actual variables instead of
    old ones.
 -- Fix correct variables when validating GrpTresMins on a QOS.
 -- Better debug output when a job is being held because of a GrpTRES[Run]Min
    limits.
 -- Fix correct state reason when job can't run 'safely' because of an
    association GrpWall limit.
 -- Squeue always loads new data if user_id option specified
 -- Fix for possible job ID parsing failure and abort.
 -- If node boot in progress when slurmctld daemon is restarted, then allow

sufficient time for reboot to complete and not prematurely DOWNthe node as

    "Not responding".
 -- For job resize, correct logic to build "resize" script with new values.
    Previously the scripts were based upon the original job size.
 -- Fix squeue to not limit the size of partition, burst_buffer, exec_host, or
    reason to 32 chars.
 -- Fix potential packing error when packing a NULL slurmdb_clus_res_rec_t.

-- Fix potential packing errors when packing a NULLslurmdb_reservation_cond_t.

 -- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation
    fails. Now job will be held. Update job update time when held.
 -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly.
 -- Increase number of ResumePrograms that can be managed without leaving
    zombie/orphan processes from 10 to 100.
 -- Refactor slurmctld agent logic to eliminate some pthreads.

[slurm-dev] slurm versions 17.02.1 and 16.05.10 released

Reply via email to