[slurm-dev] Re: Slurm versions 14.03.4 and 14.11.0-pre1 are now available

jette Tue, 17 Jun 2014 10:05:48 -0700


Sorry for the delay, please check again:
http://www.schedmd.com/#repos



Quoting Uwe Sauter <[email protected]>:

Hi,

http://www.schedmd.com/#repos still links to 14.03.3-2 andhttp://www.schedmd.com/download/latest/ doesn't show anything. Canyou please provide a link for the new version?


Thank you,

    Uwe

Am 17.06.2014 00:01, schrieb [email protected]:


Slurm versions 14.03.4 and 14.11.0-pre1 are now available.

Version 14.03.4 includes about 40 relatively minor bug fixes andenhancements

as described below. Of particular note, there are several enhancements to
control layout of tasks across resources and significant performance
improvements for backfill scheduling.

Version 14.11.0-pre1 is the first pre-release of the next major release of

Slurm scheduled for November 2014. This is very much a work inprogress and not

intended for production use.

Slurm downloads are available from
<a href="http://www.schedmd.com/#repos";>http://www.schedmd.com/#repos</a>.


Highlights of changes in Slurm version 14.03.4 include:

-- Fix issue where not enforcing QOS but a partition either allows or denies
   them.
-- CRAY - Make switch/cray default when running on a Cray natively.
-- CRAY - Make job_container/cncu default when running on a Cray natively.
-- Disable job time limit change if it's preemption is in progress.
-- Correct logic to properly enforce job preemption GraceTime.
-- Fix sinfo -R to print each down/drained node once, rather than once per
   partition.
-- If a job has non-responding node, retry job step create rather than
   returning with DOWN node error.

-- Support SLURM_CONF path which does not have "slurm.conf" as thefile name.

-- CRAY - make job_container/cncu default when running on a Cray natively
-- Fix issue where batch cpuset wasn't looked at correctly in
   jobacct_gather/cgroup.
-- Correct squeue's job node and CPU counts for requeued jobs.

-- Correct SelectTypeParameters=CR_LLN with job selecition ofspecific nodes.-- Only if ALL of their partitions are hidden will a job be hiddenby default.

-- Run EpilogSlurmctld for a job is killed during slurmctld reconfiguration.
-- Close window with srun if waiting for an allocation and while printing
   something you also get a signal which would produce deadlock.
-- Add SelectTypeParameters option of CR_PACK_NODES to pack a job's tasks

tightly on its allocated nodes rather than distributing themevenly across

   the allocated nodes.

-- cpus-per-task support: Try to pack all CPUs of each tasks ontoone socket.

   Previous logic could spread the tasks CPUs across multiple sockets.
-- Add new distribution method fcyclic so when a task is using multiple cpus
   it can bind cyclically across sockets.
-- task/affinity - When using --hint=nomultithread only bind to the first
   thread in a core.
-- Make cgroup task layout (block | cyclic) method mirror that of
   task/affinity.
-- If TaskProlog sets SLURM_PROLOG_CPU_MASK reset affinity for that task
   based on the mask given.
-- Keep supporting 'srun -N x --pty bash' for historical reasons.
-- If EnforcePartLimits=Yes and QOS job is using can override limits, allow
   it.
-- Fix issues if partition allows or denies account's or QOS' and either are
   not set.
-- If a job requests a partition and it doesn't allow a QOS or account the
   job is requesting pend unless EnforcePartLimits=Yes.  Before it would
   always kill the job at submit.
-- Fix format output of scontrol command when printing node state.
-- Improve the clean up of cgroup hierarchy when using the
   jobacct_gather/cgroup plugin.
-- Added SchedulerParameters value of Ignore_NUMA.
-- Fix issues with code when using automake 1.14.1.
-- select/cons_res plugin: Fix memory leak related to job preemption.
-- After reconfig rebuild the job node counters only for jobs that have
   not finished yet, otherwise if requeued the job may enter an invalid
   COMPLETING state.
-- Do not purge the script and environment files for completed jobs on
   slurmctld reconfiguration or restart (they might be later requeued).
-- scontrol now accepts the option job=xxx or jobid=xxx for the requeue,
   requeuehold and release operations.
-- task/cgroup - fix to bind batch job in the proper CPUs.
-- Added strigger option of -N, --noheader to not print the header when
   displaying a list of triggers.
-- Modify strigger to accept arguments to the program to execute when an
   event trigger occurs.

-- Attempt to create duplicate event trigger now generatesESLURM_TRIGGER_DUP

   ("Duplicate event trigger").
-- Treat special characters like %A, %s etc. literally in the file names
   when specified escaped e.g. sbatch -o /home/zebra\\%s will not expand
   %s as the stepid of the running job.
-- CRAYALPS - Add better support for CLE 5.2 when running Slurm over ALPS.
-- Test time when job_state file was written to detect multiple primary
   slurmctld daemons (e.g. both backup and primary are functioning as
   primary and there is a split brain problem).
-- Fix scontrol to accept update jobid=# numtasks=#
-- If the backup slurmctld assumes primary status, then do NOT purge any

job state files (batch script and environment files) and do notre-use them.This may indicate that multiple primary slurmctld daemons areactive (e.g.

   both backup and primary are functioning as primary and there is a split
   brain problem).
-- Set correct error code when requeuing a completing/pending job.
-- When checking for if dependency of type afterany, afterok and afternotok
   don't clear the dependency if the job is completing.
-- Cleanup the JOB_COMPLETING flag and eventually requeue the job when the

last epilog completes, either slurmd epilog or slurmctld epilog,whichever

   comes last.
-- When attempting to requeue a job distinguish the case in which the job is
   JOB_COMPLETING or already pending.

-- When reconfiguring the controller don't restart the slurmctldepilog if it

   is already running.
-- Email messages for job array events print now use the job ID using the
   format "#_# (#)" rather than just the internal job ID.

-- Set the number of free licenses to be 0 if the global licensecount decreases

   and total is less than in use.
-- Add DebugFlag of BackfillMap. Previously a DebugFlag value of Backfill
   logged information about what it was doing plus a map of expected resouce
   use in the future. Now that very verbose resource use map is only logged
   with a DebugFlag value of BackfillMap.
-- Fix slurmstepd core dump.

-- Modify the description of -E and -S option of sacct command aspoint in time

   'before' or 'after' the database records are returned.
-- Correct support for partition with Shared=YES configuration.

-- If job requests --exclusive then do not use nodes which have anycores in anadvanced reservation. Also prevents case where nodes can beshared by other

   jobs.

-- For "scontrol --details show job" report the correct CPU_IDswhen thre are

   multiple threads per core (we are translating a core bitmap to CPU IDs).


Highlights of changes in Slurm version 14.03.11-pre1 include:

-- Modify sdiag to report Slurm RPC traffic by user, type, count and time
   consumed.

-- Add support for allocation of GRES by model type forheterogeneous systems

   (e.g. request a Kepler GPU, a Tesla GPU, or a GPU of any type).
-- Modify squeue --start option to print the nodes expected to be used for
   pending job (in addition to expected start time, etc.).
-- Add support for non-consumable generic resources for resources that are
   limited, but can be shared between jobs.

-- Introduce automatic job requeue policy based on exit value. SeeRequeueExit

   and RequeueExitHold descriptions in slurm.conf man page.

-- Modify slurmd to cache launched job IDs for more responsive jobsuspend and

   gang scheduling.

-- Add srun --cpu-freq options to set the CPU governor (OnDemand,Performance,

   PowerSave or UserSpace).
-- Add support for a job step's CPU governor and/or frequency to be reset on
   suspend/resume (or gang scheduling). The default for an idle CPU will now

be "ondemand" rather than "userspace" with the lowest frequency(to recover

   from hard slurmd failures and support gang scheduling).
-- Replace round-robin front-end node selection with least-loaded algorithm.
-- Add new node configuration parameters CoreSpecCount, CPUSpecList and
   MemSpecLimit which support the reservation of resources for system use
   with Linux cgroup.
-- Cray/ALPS system - Enable backup controller to run outside of the Cray to
   accept new job submissions and most other operations on the pending jobs.
-- sview - Better job_array support.

-- Provide more precise error message when job allocation can notbe satisfied

   (e.g. memory, disk, cpu count, etc. rather than just "node configuration
   not available").

[slurm-dev] Re: Slurm versions 14.03.4 and 14.11.0-pre1 are now available

Reply via email to