[slurm-dev] Slurm versions 2.6.5 and 14.03.0-pre5 are now available

Moe Jette Mon, 23 Dec 2013 12:44:58 -0800

Slurm version 2.6.5 with a multitude of bug fixes plus is nowavailable. We are also making available version 14.03.0-pre5 with moredevelopment work for the next major release. A summary of changes arelisted below. Downloads are available fromhttp://www.schedmd.com/#repos.



* Changes in Slurm 2.6.5
========================
 -- Correction to hostlist parsing bug introduced in v2.6.4 for hostlists with
    more than one numeric range in brackets (e.g. rack[0-3]_blade[0-63]").
 -- Add notification if using proctrack/cgroup and task/cgroup when oom hits.
 -- Corrections to advanced reservation logic with overlapping jobs.
 -- job_submit/lua - add cpus_per_task field to those available.
 -- Add cpu_load to the node information available using the Perl API.
 -- Correct a job's GRES allocation data in accounting records for non-Cray
    systems.
 -- Substantial performance improvement for systems with Shared=YES or FORCE
    and large numbers of running jobs (replace bubble sort with quick sort).

-- proctrack/cgroup - Add locking to prevent race condition whereone job step

    is ending for a user or job at the same time another job stepsis starting
    and the user or job container is deleted from under the starting job step.
 -- Fixed sh5util loop when there are no node-step files.

-- Fix race condition on batch job termination that could result ina job exit

    code of 0xfffffffe if the slurmd on node zero registers its active jobs at
    the same time that slurmstepd is recording the job's exit code.
 -- Correct logic returning remaining job dependencies in job information
    reported by scontrol and squeue. Eliminates vestigial descriptors with
    no job ID values (e.g. "afterany").
 -- Improve performance of REQUEST_JOB_INFO_SINGLE RPC by removing unnecessary
    locks and use hash function to find the desired job.
 -- jobcomp/filetxt - Reopen the file when slurmctld daemon is reconfigured
    or gets SIGHUP.
 -- Remove notice of CVE with very old/deprecated versions of Slurm in
    news.html.
 -- Fix if hwloc_get_nbobjs_by_type() returns zero core count (set to 1).
 -- Added ApbasilTimeout parameter to the cray.conf configuration file.
 -- Handle in the API if parts of the node structure are NULL.
 -- Fix srun hang when IO fails to start at launch.
 -- Fix for GRES bitmap not matching the GRES count resulting in abort
    (requires manual resetting of GRES count, changes to gres.conf file,
    and slurmd restarts).
 -- Modify sview to better support job arrays.
 -- Modify squeue to support longer job ID values (for many job array tasks).

-- Fix race condition in authentication credential creation thatcould corrupt

    memory. (NOTE: This race condition has existed since 2003 and would be
    exceedingly rare.)
 -- HDF5 - Fix minor memory leak.
 -- Slurmstepd variable initialization - Without this patch, free() is called
    on a random memory location (i.e. whatever is on the stack), which can
    result in slurmstepd dying and a completed job not being purged in a
    timely fashion.
 -- Fix slurmstepd race condition when separate threads are reading and

modifying the job's environment, which can result in theslurmstepd failing

    with an invalid memory reference.
 -- Fix erroneous error messages when running gang scheduling.
 -- Fix minor memory leak.
 -- scontrol modified to suspend, resume, hold, uhold, or release multiple
    jobs in a space separated list.
 -- Minor debug error when a connection goes away at the end of a job.
 -- Validate return code from calls to slurm_get_peer_addr

-- BGQ - Fix issues with making sure all cnodes are accounted forwhen mulitple

    steps cause multiple cnodes in one allocation to go into error at the
    same time.
 -- scontrol show job - Correct NumNodes value calculated based upon job
    specifications.
 -- BGQ - Fix issue if user runs multiple sub-block jobs inside a multiple
    midplane block that starts on a higher coordinate than it ends (i.e if a
    block has midplanes [0010,0013] 0013 is the start even though it is
    listed second in the hostlist).
 -- BGQ - Add midplane to the total_cnodes used in the runjob_mux plugin
    for better debug.
 -- Update AllocNodes paragraph in slurm.conf.5.


* Changes in Slurm 14.03.0pre5
==============================
 -- Added the SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID
    in epilog slurmctld environment.
 -- Fix bug in job step allocation failing due to memory limit.
 -- Modify the pbsnodes script to reflect its output on a TORQUE system.

-- Add ability to clear a node's DRAIN flag using scontrol or sviewby settingit's state to "UNDRAIN". The node's base state (e.g. "DOWN" or"IDLE") will

    not be changed.
 -- Modify the output of 'scontrol show partition' by displaying
    DefMemPerCPU=UNLIMITED and MaxMemPerCPU=UNLIMITED when these limits are
    configured as 0.
 -- mpirun-mic - Major re-write of the command wrapper for Xeon Phi use.
 -- Add new configuration parameter of AuthInfo to specify port used by
    authentication plugin.
 -- Fixed conditional rpm compiling.
 -- Corrected slurmstepd ident name when loggind to syslog.
 -- Fixed sh5util loop when there are no node-step files.
 -- Add SLURM_CLUSTER_NAME to environment variables passed to PrologSlurmctld,
    Prolog, EpilogSlurmctld, and Epilog
 -- Add the idea of running a prolog right when an allocation happens
    instead of when running on the node for the first time.
 -- If user runs 'scontrol reconfig' but hostnames or the host count changes
    the slurmctld throws a fatal error.
 -- gres.conf - Add "NodeName" specification so that a single gres.conf file
    can be used for a heterogeneous cluster.
 -- Add flag to accounting RPC to indicate if job data is packed or not.
 -- After all srun tasks have terminated on a node close the stdout/stderr
    channel with the slurmstepd on that node.
 -- In case of i/o error with slurmstepd log an error message and abort the
    job.

-- Add --test-only option to sbatch command to validate the scriptand options.

    The response includes expected start time and resources to be allocated.

[slurm-dev] Slurm versions 2.6.5 and 14.03.0-pre5 are now available

Reply via email to