Version 14.11.5 contains quite a few bug fixes generated over the past five weeks including two high impact bugs. There is a fix for the slurmdbd daemon aborting if a node is set to a DOWN state and it's "reason" field is NULL. The other important bug fix will prevent someone from being able to kill a job array belonging to another user. Details about all of the changes are appended.

Version 15.08.0-pre3 represents the current state of Slurm development for the release planned in August 2015 and is intended for development and test purposes only. Notable enhancements include power capping support for Cray systems and add the ability for a compute node to be allocated to multiple jobs, but restricted to one user at a time.

Both versions can be downloaded from
http://www.schedmd.com/#repos


* Changes in Slurm 14.11.5
==========================
 -- Correct the squeue command taking into account that a node can
    have NULL name if it is not in DNS but still in slurm.conf.
 -- Fix slurmdbd regression which would cause a segfault when a node is set
    down with no reason.
 -- BGQ - Fix issue with job arrays not being handled correctly
    in the runjob_mux plugin.
 -- Print FAIR_TREE, if configured, in "scontrol show config" output for
    PriorityFlags.
 -- Add SLURM_JOB_GPUS environment variable to those available in the Prolog.
 -- Load lua-5.2 library if using lua5.2 for lua job submit plugin.
 -- GRES logic: Prevent bad node_offset due to not preserving no_consume flag.
 -- Fix wrong variables used in the wrapper functions needed for systems that
    don't support strong_alias
 -- Fix code for apple computers SOL_TCP is not defined
 -- Cray/BASIL - Check for mysql credentials in /root/.my.cnf.
 -- Fix sprio showing wrong priority for job arrays until priority is
    recalculated.
 -- Account to batch step all CPUs that are allocated to a job not
    just one since the batch step has access to all CPUs like other steps.
 -- Fix job getting EligibleTime set before meeting dependency requirements.
 -- Correct the initialization of QOS MinCPUs per job limit.
 -- Set the debug level of information messages in cgroup plugin to debug2.
 -- For job running under a debugger, if the exec of the task fails, then
    cancel its I/O and abort immediately rather than waiting 60 seconds for
    I/O timeout.
 -- Fix associations not getting default qos set until after a restart.
 -- Set the value of total_cpus not to be zero before invoking
    acct_policy_job_runnable_post_select.
 -- MySQL - When requesting cluster resources, only return resources for the
    cluster(s) requested.
-- Add TaskPluginParam=autobind=threads option to set a default binding in the
    case that "auto binding" doesn't find a match.
 -- Introduce a new SchedulerParameters variable nohold_on_prolog_fail.
    If configured don't requeue jobs on hold is a Prolog fails.
 -- Make it so sched_params isn't read over and over when an epilog complete
    message comes in
 -- Fix squeue -L <licenses> not filtering out jobs with licenses.
 -- Changed the implementation of xcpuinfo_abs_to_mac() be identical
    _abs_to_mac() to fix CPUs allocation using cpuset cgroup.
 -- Improve the explanation of the unbuffered feature in the
    srun man page.
 -- Make taskplugin=cgroup work for core spec.  needed to have task/cgroup
    before.
 -- Fix reports not using the month usage table.
-- BGQ - Sanity check given for translating small blocks into slurm bg_records. -- Fix bug preventing the requeue/hold or requeue/special_exit of job from the
    completing state.
 -- Cray - Fix for launching batch step within an existing job allocation.
 -- Cray - Add ALPS_APP_ID_ENV environment variable.
 -- Increase maximum MaxArraySize configuration parameter value from 1,000,001
    to 4,000,001.
 -- Added new SchedulerParameters value of bf_min_age_reserve. The backfill
    scheduler will not reserve resources for pending jobs until they have
    been pending for at least the specified number of seconds. This can be
    valuable if jobs lack time limits or all time limits have the same value.
-- Fix support for --mem=0 (all memory of a node) with select/cons_res plugin. -- Fix bug that can permit someone to kill job array belonging to another user.
 -- Don't set the default partition on a license only reservation.
 -- Show a NodeCnt=0, instead of NO_VAL, in "scontrol show res" for a license
    only reservation.
 -- BGQ - When using static small blocks make sure when clearing the job the
    block is set up to it's original state.
 -- Start job allocation using lowest numbered sockets for block task
    distribution for consistency with cyclic distribution.


* Changes in Slurm 15.08.0pre3
==============================
-- CRAY - addition of acct_gather_energy/cray plugin.
-- Add job credential to "Run Prolog" RPC used with a configuration of
   PrologFlags=alloc. This allows the Prolog to be passed identification of
   GPUs allocated to the job.
-- Add SLURM_JOB_CONSTAINTS to environment variables available to the Prolog.
-- Added "--mail=stage_out" option to job submission commands to notify user
   when burst buffer state out is complete.
-- Require a "Reason" when using scontrol to set a node state to DOWN.
-- Mail notifications on job BEGIN, END and FAIL now apply to a job array as a
   whole rather than generating individual email messages for each task in the
   job array.
-- task/affinity - Fix memory binding to NUMA with cpusets.
-- Display job's estimated NodeCount based off of partition's configured
   resources rather than the whole system's.
-- Add AuthInfo option of "cred_expire=#" to specify the lifetime of a job
   step credential. The default value was changed from 1200 to 120 seconds.
-- Set the delay time for job requeue to the job credential lifetime (120
   seconds by default). This insures that prolog runs on every node when a
   job is requeued. (This change will slow down launch of re-queued jobs).
-- Add AuthInfo option of "cred_expire=#" to specify the lifetime of a job
   step credential.
-- Remove srun --max-launch-time option. The option has not been functional
   since Slurm version 2.0.
-- Add sockets and cores to TaskPluginParams' autobind option.
-- Added LaunchParameters configuration parameter. Have srun command test
   locally for the executable file if LaunchParameters=test_exec or the
   environment variable SLURM_TEST_EXEC is set. Without this an invalid
   command will generate one error message per task launched.
-- Fix the slurm /etc/init.d script to return 0 upon stopping the
   daemons and return 1 in case of failure.
-- Add the ability for a compute node to be allocated to multiple jobs, but
   restricted to a single user. Added "--exclusive=user" option to salloc,
   sbatch and srun commands. Added "owner" field to node record, visible using
the scontrol and sview commands. Added new partition configuration parameter
   "ExclusiveUser=yes|no".
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to