Slurm version 14.11.6 is now available with quite a few bug fixes as listed below.

Slurm downloads are available from
http://slurm.schedmd.com/download.html

* Changes in Slurm 14.11.6
==========================
 -- If SchedulerParameters value of bf_min_age_reserve is configured, then
    a newly submitted job can start immediately even if there is a higher
    priority non-runnable job which has been waiting for less time than
    bf_min_age_reserve.
 -- qsub wrapper modified to export "all" with -V option
-- RequeueExit and RequeueExitHold configuration parameters modified to accept numeric ranges. For example "RequeueExit=1,2,3,4" and "RequeueExit=1-4" are
    equivalent.
 -- Correct the job array specification parser to accept brackets in job array
    expression (e.g. "123_[4,7-9]").
 -- Fix for misleading job submit failure errors sent to users. Previous error
could indicate why specific nodes could not be used (e.g. too small memory)
    when other nodes could be used, but were not for another reason.
 -- Fix squeue --array to display correctly the array elements when the
    % separator is specified at the array submission time.
 -- Fix priority from not being calculated correctly due to memory issues.
 -- Fix a transient pending reason 'JobId=job_id has invalid QOS'.
 -- A non-administrator change to job priority will not be persistent except
for holding the job. User's wanting to change a job priority on a persistent
    basis should reset it's "nice" value.
 -- Print buffer sizes as unsigned values when failed to pack messages.
-- Fix race condition where sprio would print factors without weights applied. -- Document the sacct option JobIDRaw which for arrays prints the jobid instead
    of the arrayTaskId.
 -- Allow users to modify MinCPUsNode, MinMemoryNode and MinTmpDiskNode of
    their own jobs.
 -- Increase the jobid print field in SQUEUE_FORMAT in
    opt_modulefiles_slurm.in.
 -- Enable compiling without optimizations and with debugging symbols by
    default. Disable this by configuring with --disable-debug.
 -- job_submit/lua plugin: Add mail_type and mail_user fields.
 -- Correct output message from sshare.
 -- Use standard statvfs(2) syscall if available, in preference to
    non-standard statfs.
 -- Add a new option -U/--Users to sshare to display only users
    information, parent and ancestors are not printed.
 -- Purge 50000 records at a time so that locks can released periodically.
 -- Fix potentially uninitialized variables
 -- ALPS - Fix issue where a frontend node could become unresponsive and never
    added back into the system.
 -- Gate epilog complete messages as done with other messages
-- If we have more than a certain number of agents (50) wait longer when gating
    rpcs.
 -- FrontEnd - ping non-responding or down nodes.
 -- switch/cray: If CR_PACK_NODES is configured, then set the environment
    variable "PMI_CRAY_NO_SMP_ENV=1"
 -- Fix invalid memory reference in SlurmDBD when putting a node up.
 -- Allow opening of plugstack.conf even when a symlink.
-- Fix scontrol reboot so that rebooted nodes will not be set down with reason 'Node xyz unexpectedly rebooted' but will be correctly put back to service.
 -- CRAY - Throttle the post NHC operations as to not hog the job write lock
    if many steps/jobs finish at once.
 -- Disable changes to GRES count while jobs are running on the node.
 -- CRAY - Fix issue with scontrol reconfig.
 -- slurmd: Remove wrong reporting of "Error reading step  ... memory limit".
    The logic was treating success as an error.
 -- Eliminate "Node ping apparently hung" error messages.
 -- Fix average CPU frequency calculation.
-- When allocating resources with resolution of sockets, charge the job for all
    CPUs on allocated sockets rather than just the CPUs on used cores.
-- Prevent slurmdbd error if cluster added or removed while rollup in progress.
    Removing a cluster can cause slurmdbd to abort. Adding a cluster can cause
    the slurmdbd rollup to hang.
 -- sview - When right clicking on a tab make sure we don't display the page
    list, but only the column list.
-- FRONTEND - If doing a clean start make sure the nodes are brought up in the
    database.
 -- MySQL - Fix issue when using the TrackSlurmctldDown and nodes are down at
    the same time, don't double bill the down time.
 -- MySQL - Various memory leak fixes.
 -- sreport - Fix Energy displays
 -- Fix node manager logic to keep unexpectedly rebooted node in state
    NODE_STATE_DOWN even if already down when rebooted.
 -- Fix for array jobs submitted to multiple partitions not starting.
 -- CRAY - Enable ALPs mpp compatibility code in sbatch for native Slurm.
 -- ALPS - Move basil_inventory to less confusing function.
 -- Add SchedulerParameters option of "sched_max_job_start="  to limit the
    number of jobs that can be started in any single execution of the main
    scheduling logic.
 -- Fixed compiler warnings generated by gcc version >= 4.6.
-- sbatch to stop parsing script for "#SBATCH" directives after first command,
    which matches the documentation.
 -- Overwrite the SLURM_JOB_NAME in sbatch if already exist in the environment
    and use the one specified on the command line --job-name.
 -- Remove xmalloc_nz from unpack functions.  If the unpack ever failed the
    free afterwards would not have zeroed out memory on the variables that
    didn't get unpacked.
 -- Improve database interaction from controller.
 -- Fix for data shift when loading job archives.
 -- ALPS - Added new SchedulerParameters=inventory_interval to specify how
    often an inventory request is handled.
 -- ALPS - Don't run a release on a reservation on the slurmctld for a batch
    job.  This is already handled on the stepd when the script finishes.
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to