[slurm-dev] SLURM version 2.3.0-rc1 now available

jette Thu, 28 Jul 2011 15:03:08 -0700

SLURM version 2.3.0-rc1 is now available. "rc1" indicates this is arelease candidate for version 2.3.0. We do not recommend its use forproduction computers yet, but can be considered close to the finalrelease of version 2.3.0 for testing purposes. Our intent is toperform testing and debugging of this code over the coming weeks andrelease version 2.3.0 when stable. We do not intend to perform majordevelopment work in version 2.3, but target future development workfor version 2.4. Major changes to the code since the last pre-releaseof version 2.3 are noted below. A summary of the major changes inversion 2.3 from version 2.2 can be found in the file "RELEASE_NOTES"with the distributed files.


The code is available here: http://www.schedmd.com/#repos


* Changes in SLURM 2.3.0.rc1
============================
 -- NOTE THERE HAVE BEEN NEW FIELDS ADDED TO THE JOB AND PARTITION STATE SAVE
    FILES AND RPCS. PENDING AND RUNNING JOBS WILL BE LOST WHEN UPGRADING FROM
    EARLIER VERSION 2.3 PRE-RELEASES AND RPCS WILL NOT WORK WITH EARLIER
    VERSIONS.
 -- select/cray: Add support for Accelerator information including model and
    memory options.
 -- Cray systems: Add support to suspend/resume salloc command to insure that

aprun does not get initiated when the job is suspended. Processessuspendedand resumed are determined by using process group ID and parentprocess ID,

    so some processes may be missed. Since salloc runs as a normal user, it's
    ability to identify processes associated with a job is limited.
 -- Cray systems: Modify smap and sview to display all nodes even if multiple
    nodes exist at each coordinate.
 -- Improve efficiency of select/linear plugin with topology/tree plugin
    configured, Patch by Andriy Grytsenko (Massive Solutions Limited).
 -- For front-end architectures on which job steps are run (emulated Cray and
    BlueGene systems only), fix bug that would free memory still in use.
 -- Add squeue support to display a job's license information. Patch by Andy
    Roosen (University of Deleware).

-- Add flag to the select APIs for job suspend/resume indicating ifthe action

    is for gang scheduling or an explicit job suspend/resume by the user. Only
    an explicit job suspend/resume will reset the job's priority and make
    resources exclusively held by the job available to other jobs.
 -- Fix possible invalid memory reference in sched/backfill. Patch by Andriy
    Grytsenko (Massive Solutions Limited).
 -- Add select_jobinfo to the task launch RPC. Based upon patch by Andriy
    Grytsenko (Massive Solutions Limited).
 -- Add DefMemPerCPU/Node and MaxMemPerCPU/Node to partition configuration.
    This improves flexibility when gang scheduling only specific partitions.
 -- Added new enums to print out when a job is held by a QOS instead of an
    association limit.
 -- Enhancements to sched/backfill performance with select/cons_res plugin.
    Patch from Bjørn-Helge Mevik, University of Oslo.
 -- Correct job run time reported by smap for suspended jobs.
 -- Improve job preemption logic to avoid preempting more jobs than needed.

-- Add contribs/arrayrun tool providing support for job arrays.Contributed by

    Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM
    and manual file editing is required.
 -- When suspending a job, wait 2 seconds instead of 1 second between sending
    SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the
    1 second delay.
 -- Add support for managing devices based upon Linux cgroup container. Based
    upon patch by Yiannis Georgiou, Bull.
 -- Fix memory buffering bug if a AllowGroups parameter of a partition has 100
    or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).

-- Fix bug in generic resource tracking of gres associated withspecific CPUs.

    Resources were being over-allocated.

-- On systems with front-end nodes (IBM BlueGene and Cray) limitbatch jobs to

    only one CPU of these shared resources.
 -- Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both

interactive (salloc) and batch jobs if the job has a memorylimit. For Cray

    systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the
    memory limit.
 -- Fix bug in select/cons_res task distribution logic when tasks-per-node=0.
    Patch from Rod Schultz, Bull.
 -- Restore node configuration information (CPUs, memory, etc.) for powered
    down when slurmctld daemon restarts rather than waiting for the node to be
    restored to service and getting the information from the node (NOTE: Only
    relevent if FastSchedule=0).
 -- For Cray systems with the srun2aprun wrapper, rebuild the srun man page
    identifying the srun optioins which are valid on that system.
 -- BlueGene: Permit users to specify a separate connection type for each
    dimension (e.g. "--conn-type=torus,mesh,torus").
 -- Add the ability for a user to limit the number of leaf switches in a job's
    allocation using the --switch option of salloc, sbatch and srun. There is
    also a new SchedulerParameters value of max_switch_wait, which a SLURM
    administrator can used to set a maximum job delay and prevent a user job
    from blocking lower priority jobs for too long. Based on work by Rod
    Schultz, Bull.

[slurm-dev] SLURM version 2.3.0-rc1 now available

Reply via email to