[slurm-dev] Slurm versions 15.08.9 and 16.05.0-pre2 now available

jette Tue, 29 Mar 2016 15:26:59 -0700

Slurm version 15.08.9 is now available and includes about 40 bug fixesdeveloped over the past six weeks as listed below.

Slurm version 16.05.0-pre2 is also available and includes newdevelopment for the next major release in May.


Slurm downloads are available from:
http://www.schedmd.com/#repos

* Changes in Slurm 15.08.9
==========================

-- BurstBuffer/cray - Defer job cancellation or time limit while"pre-run"operation in progress to avoid inconsistent state due to multiplecalls

    to job termination functions.

-- Fix issue with resizing jobs and limits not be kept track ofcorrectly.

 -- BGQ - Remove redeclaration of job_read_lock.

-- BGQ - Tighter locks around structures when nodes/cables changestate.

 -- Make it possible to change CPUsPerTask with scontrol.

-- Make it so scontrol update part qos= will take away a partition QOSfrom

    a partition.

-- Fix issue where SocketsPerBoard didn't translate to Sockets whenCPUS=

    was also given.

-- Add note to slurm.conf man page about setting "--cpu_bind=no" aspart

    of SallocDefaultCommand if a TaskPlugin is in use.
 -- Set correct reason when a QOS' MaxTresMins is violated.

-- Insure that a job is completely launched before trying to suspendit.

 -- Remove historical presentations and design notes. Only distribute
    maintained doc/html and doc/man directories.
 -- Remove duplicate xmalloc() in task/cgroup plugin.

-- Backfill scheduler to validate correct job partition for jobsubmitted to

    multiple partitions.
 -- Force close on exec on first 256 file descriptors when launching a
    slurmstepd to close potential open ones.

-- Step GRES value changed from type "int" to "int64_t" to supportlarger

    values.
 -- Fix getting reservations to database when database is down.
 -- Fix issue with sbcast not doing a correct fanout.
 -- Fix issue where steps weren't always getting the gres/tres involved.
 -- Fixed double read lock on getting job's gres/tres.
 -- Fix display for RoutePlugin parameter to display the correct value.
 -- Fix route/topology plugin to prevent segfault in sbcast when in use.
 -- Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic.

-- Fix Cray NHC spawning on job requeue. Previous logic would leavenodes

    allocated to a requeued job as non-usable on job termination.

-- burst_buffer/cray plugin: Prevent a requeued job from beingrestarted whilefile stage-out is still in progress. Previous logic could restartthe job

    and not perform a new stage-in.

-- Fix job array formatting to allow return [0-100:2] display forarrays with

    step functions rather than [0,2,4,6,8,...] .

-- FreeBSD - replace Linux-specific set_oom_adj to avoid errors inslurmd log.-- Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctldlistenon only one port like TopologyParam=NoInAddrAny does for everythingelse.-- Fix burst buffer plugin to prevent corruption of the CPU TRES datawhen bb

    is not set as an AccountingStorageTRES type.

-- Surpress error messages in acct_gather_energy/ipmi plugin afterrepeated

    failures.
 -- Change burst buffer use completion email message from

"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47"to

    "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
 -- Generate burst buffer use completion email immediately afer teardown
    completes rather than at job purge time (likely minutes later).

-- Fix issue when adding a new TRES to AccountingStorageTRES for thefirst

    time.

-- Update gang scheduling tables when job manually suspended orresumed. Prior

    logic could mess up job suspend/resume sequencing.
 -- Update gang scheduling data structures when job changes in size.

-- Associations - prevent hash table corruption if uid initially unsetfor

    a user, which can cause slurmctld to crash if that user is deleted.

-- Avoid possibly aborting srun on SIGSTOP while creating the job stepdue to

    threading bug.
 -- Fix deadlock issue with burst_buffer/cray when a newly created burst
    buffer is found.

-- burst_buffer/cray: Set environment variables just before startingjob rather

    than at job submission time to reflect persistent buffers created or
    modified while the job is pending.
 -- Fix check of per-user qos limits on the initial run by a user.

-- Fix gang scheduling resource selection bug which could preventmultiple jobsfrom being allocated the same resources. Bug was introduced in15.08.6.-- Don't print the Rgt value of an association from the cache as itisn't

    kept up to date.
 -- burst_buffer/cray - If the pre-run operation fails then don't issue

duplicate job cancel/requeue unless the job is still in run state.Prevents

    jobs hung in COMPLETING state.
 -- task/cgroup - Fix bug in task binding to CPUs.

* Changes in Slurm 16.05.0pre2
==============================

-- Split partition's "Priority" field into "PriorityTier" (used toorderpartitions for scheduling and preemption) plus "PriorityJobFactor"(used bypriority/multifactor plugin in calculating job priority, which isused to

    order jobs within a partition for scheduling).

-- Revert call to getaddrinfo, restoring gethostbyaddr (introduced inSlurm

    16.05.0pre1) which was failing on some systems.
 -- knl_cray.conf - Added AllowMCDRAM, AllowNUMA and ALlowUserBoot
    configuration options.
 -- Add node_features_p_user_update() function to node_features plugin.

-- Don't print Weight=1 lines in 'scontrol write config' (its thedefault).

 -- Remove PARAMS macro from slurm.h.
 -- Remove BEGIN_C_DECLS and END_C_DECLS macros.

-- Check that PowerSave mode configured for node_features/knl_crayplugin.

    It is required to reconfigure and reboot nodes.

-- Update documentation to reflect new cgroup default location changefrom

    /cgroup to /sys/fs/cgroup.

-- If NodeHealthCheckProgram configured HealthCheckInterval isnon-zero, then

    modify slurmd to run it before registering with slurmctld.

-- Fix for tasks being packed onto cores when the requested--cpus-per-task isgreater than the number of threads on a core and --ntasks-per-coreis 1.-- Make it so jobs/steps track ':' named gres/tres, before handgres/gpu:tesla

    would only track gres/gpu, now it will track both gres/gpu and
    gres/gpu:tesla as separate gres if configured like
    AccountingStorageTRES=gres/gpu,gres/gpu:tesla

-- Added new job dependency type of "aftercorr" which will start a taskof ajob array after the corresponding task of another job arraycompletes.-- Increase default MaxTasksPerNode configuration parameter from 128 to512.-- Enable sbcast data compression logic (compress option previouslyignored).

 -- Add --compress option to srun command for use with --bcast option.

-- Add TCPTimeout option to slurm[dbd].conf. Decouples MessageTimeoutfrom TCP

    connections.

-- Don't call primary controller for every RPC when backup is incontrol.-- Add --gres-flags=enforce-binding option to salloc, sbatch and sruncommands.If set, the only CPUs available to the job will be those bound totheselected GRES (i.e. the CPUs identified in the gres.conf file willbe

    strictly enforced rather than advisory).

-- Change how a node's allocated CPU count is calculated to avoiddouble

    counting CPUs allocated to multiple jobs at the same time.

-- Added SchedulingParameters option of "bf_min_prio_reserve". Jobsbelow

    the specified threshold will not have resources reserved for them.

-- Added "sacctmgr show lostjobs" to report any orphaned jobs in thedatabase.

 -- When a stepd is about to shutdown and send it's response to srun

make the wait to return data only hit after 500 nodes andconfigurable

    based on the TcpTimeout value.

-- Add functionality to reset the lft and rgt values of the associationtable

    with the slurmdbd.

-- Add SchedulerParameter no_env_cache, if set no env cache will be usewhenlaunching a job, instead the job will fail and drain the node if theenv

    isn't loaded normally.

[slurm-dev] Slurm versions 15.08.9 and 16.05.0-pre2 now available

Reply via email to