Slurm version 15.08.9 is now available and includes about 40 bug fixes developed over the past six weeks as listed below.

Slurm version 16.05.0-pre2 is also available and includes new development for the next major release in May.

Slurm downloads are available from:
http://www.schedmd.com/#repos

* Changes in Slurm 15.08.9
==========================
-- BurstBuffer/cray - Defer job cancellation or time limit while "pre-run" operation in progress to avoid inconsistent state due to multiple calls
    to job termination functions.
-- Fix issue with resizing jobs and limits not be kept track of correctly.
 -- BGQ - Remove redeclaration of job_read_lock.
-- BGQ - Tighter locks around structures when nodes/cables change state.
 -- Make it possible to change CPUsPerTask with scontrol.
-- Make it so scontrol update part qos= will take away a partition QOS from
    a partition.
-- Fix issue where SocketsPerBoard didn't translate to Sockets when CPUS=
    was also given.
-- Add note to slurm.conf man page about setting "--cpu_bind=no" as part
    of SallocDefaultCommand if a TaskPlugin is in use.
 -- Set correct reason when a QOS' MaxTresMins is violated.
-- Insure that a job is completely launched before trying to suspend it.
 -- Remove historical presentations and design notes. Only distribute
    maintained doc/html and doc/man directories.
 -- Remove duplicate xmalloc() in task/cgroup plugin.
-- Backfill scheduler to validate correct job partition for job submitted to
    multiple partitions.
 -- Force close on exec on first 256 file descriptors when launching a
    slurmstepd to close potential open ones.
-- Step GRES value changed from type "int" to "int64_t" to support larger
    values.
 -- Fix getting reservations to database when database is down.
 -- Fix issue with sbcast not doing a correct fanout.
 -- Fix issue where steps weren't always getting the gres/tres involved.
 -- Fixed double read lock on getting job's gres/tres.
 -- Fix display for RoutePlugin parameter to display the correct value.
 -- Fix route/topology plugin to prevent segfault in sbcast when in use.
 -- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic.
-- Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
    allocated to a requeued job as non-usable on job termination.
-- burst_buffer/cray plugin: Prevent a requeued job from being restarted while file stage-out is still in progress. Previous logic could restart the job
    and not perform a new stage-in.
-- Fix job array formatting to allow return [0-100:2] display for arrays with
    step functions rather than [0,2,4,6,8,...] .
-- FreeBSD - replace Linux-specific set_oom_adj to avoid errors in slurmd log. -- Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctld listen on only one port like TopologyParam=NoInAddrAny does for everything else. -- Fix burst buffer plugin to prevent corruption of the CPU TRES data when bb
    is not set as an AccountingStorageTRES type.
-- Surpress error messages in acct_gather_energy/ipmi plugin after repeated
    failures.
 -- Change burst buffer use completion email message from
"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to
    "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
 -- Generate burst buffer use completion email immediately afer teardown
    completes rather than at job purge time (likely minutes later).
-- Fix issue when adding a new TRES to AccountingStorageTRES for the first
    time.
-- Update gang scheduling tables when job manually suspended or resumed. Prior
    logic could mess up job suspend/resume sequencing.
 -- Update gang scheduling data structures when job changes in size.
-- Associations - prevent hash table corruption if uid initially unset for
    a user, which can cause slurmctld to crash if that user is deleted.
-- Avoid possibly aborting srun on SIGSTOP while creating the job step due to
    threading bug.
 -- Fix deadlock issue with burst_buffer/cray when a newly created burst
    buffer is found.
-- burst_buffer/cray: Set environment variables just before starting job rather
    than at job submission time to reflect persistent buffers created or
    modified while the job is pending.
 -- Fix check of per-user qos limits on the initial run by a user.
-- Fix gang scheduling resource selection bug which could prevent multiple jobs from being allocated the same resources. Bug was introduced in 15.08.6. -- Don't print the Rgt value of an association from the cache as it isn't
    kept up to date.
 -- burst_buffer/cray - If the pre-run operation fails then don't issue
duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
 -- task/cgroup - Fix bug in task binding to CPUs.

* Changes in Slurm 16.05.0pre2
==============================
-- Split partition's "Priority" field into "PriorityTier" (used to order partitions for scheduling and preemption) plus "PriorityJobFactor" (used by priority/multifactor plugin in calculating job priority, which is used to
    order jobs within a partition for scheduling).
-- Revert call to getaddrinfo, restoring gethostbyaddr (introduced in Slurm
    16.05.0pre1) which was failing on some systems.
 -- knl_cray.conf - Added AllowMCDRAM, AllowNUMA and ALlowUserBoot
    configuration options.
 -- Add node_features_p_user_update() function to node_features plugin.
-- Don't print Weight=1 lines in 'scontrol write config' (its the default).
 -- Remove PARAMS macro from slurm.h.
 -- Remove BEGIN_C_DECLS and END_C_DECLS macros.
-- Check that PowerSave mode configured for node_features/knl_cray plugin.
    It is required to reconfigure and reboot nodes.
-- Update documentation to reflect new cgroup default location change from
    /cgroup to /sys/fs/cgroup.
-- If NodeHealthCheckProgram configured HealthCheckInterval is non-zero, then
    modify slurmd to run it before registering with slurmctld.
-- Fix for tasks being packed onto cores when the requested --cpus-per-task is greater than the number of threads on a core and --ntasks-per-core is 1. -- Make it so jobs/steps track ':' named gres/tres, before hand gres/gpu:tesla
    would only track gres/gpu, now it will track both gres/gpu and
    gres/gpu:tesla as separate gres if configured like
    AccountingStorageTRES=gres/gpu,gres/gpu:tesla
-- Added new job dependency type of "aftercorr" which will start a task of a job array after the corresponding task of another job array completes. -- Increase default MaxTasksPerNode configuration parameter from 128 to 512. -- Enable sbcast data compression logic (compress option previously ignored).
 -- Add --compress option to srun command for use with --bcast option.
-- Add TCPTimeout option to slurm[dbd].conf. Decouples MessageTimeout from TCP
    connections.
-- Don't call primary controller for every RPC when backup is in control. -- Add --gres-flags=enforce-binding option to salloc, sbatch and srun commands. If set, the only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identified in the gres.conf file will be
    strictly enforced rather than advisory).
-- Change how a node's allocated CPU count is calculated to avoid double
    counting CPUs allocated to multiple jobs at the same time.
-- Added SchedulingParameters option of "bf_min_prio_reserve". Jobs below
    the specified threshold will not have resources reserved for them.
-- Added "sacctmgr show lostjobs" to report any orphaned jobs in the database.
 -- When a stepd is about to shutdown and send it's response to srun
make the wait to return data only hit after 500 nodes and configurable
    based on the TcpTimeout value.
-- Add functionality to reset the lft and rgt values of the association table
    with the slurmdbd.
-- Add SchedulerParameter no_env_cache, if set no env cache will be use when launching a job, instead the job will fail and drain the node if the env
    isn't loaded normally.

Reply via email to