[slurm-users] Slurm versions 23.02.2 and 22.05.9 are now available

Marshall Garey Thu, 04 May 2023 08:44:21 -0700

We are pleased to announce the availability of Slurm version 23.02.2
and Slurm version 22.05.9.


The 23.02.2 release includes a number of fixes to Slurm stability,
including a fix for a regression in 23.02 that caused openmpi
mpirun to fail to launch tasks. It also includes two functional changes:
Don't update the cron job tasks if the whole crontab file is left
untouched after opening it with "scrontab -e", and sort dynamic nodes
and include them in topology after scontrol reconfigure or a slurmctld
restart.

The 22.05.9 release includes a fix for a regression in 22.05.7 that
prevented slurmctld from connecting to an srun running outside a
compute node, and a fix to the upgrade process to 22.05 from 21.08
or 20.11 where pending jobs that had requested --mem-per-cpu could be
killed due to incorrect memory limit enforcement.

Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Marshall

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

* Changes in Slurm 23.02.2
==========================
 -- Fix regression introduced with the migration to interfaces which caused
    sshare to core dump. Sshare now initialized the priority context correctly
    when calculating with PriorityFlags=NO_FAIR_TREE.
 -- Fix IPMI DCMI sensor initialization.
 -- For the select/cons_tres plugin, improve the best effort GPU to core
    binding, for requests with per job task count (-n) and GPU (--gpus)
    specification.
 -- scrontab - don't update the cron job tasks if the whole crontab file is
    left untouched after opening it with "scrontab -e".
 -- mpi/pmix - avoid crashing when running PMIx v5.0 branch with shmem support.
 -- Fix building switch topology after a reconfig with the correct nodes.
 -- Allow a dynamic node to register with a reason, using --conf, when the
    state is DOWN or DRAIN.
 -- Fix slurmd running tasks before RPC Prolog is run.
 -- Fix slurmd deadlock iff the controller were to give a bad alias_list.
 -- slurmrestd - correctly process job submission field "exclusive" with boolean
    True or False.
 -- slurmrestd - correctly process job submission field "exclusive" with strings
    "true" or "false".
 -- slurmctld/step_mgr - prevent non-allocatable steps from decrementing values
    that weren't previously incremented when trying to allocate them.
 -- auth/jwt - Fix memory leak in slurmctld with 'scontrol token'.
 -- Fix shared gres (shard/mps) leak when using --tres-per-task
 -- Fix sacctmgr segfault when listing accounts with coordinators.
 -- slurmrestd - improve error logging when client connections experience
    polling errors.
 -- slurmrestd - improve handling of sockets in different states of shutdown to
    avoid infinite poll() loop causing a thread to max CPU usage until process
    is killed.
 -- slurmrestd - avoid possible segfault caused by race condition of already
    completed connections.
 -- mpi/cray_shasta - Fix PMI shared secret for hetjobs.
 -- gpu/oneapi - Fix CPU affinity handling.
 -- Fix dynamic nodes powering up when already up after adding/deleting nodes
    when using power_save logic.
 -- slurmrestd - Add support for setting max connections.
 -- data_parser/v0.0.39 - fix sacct --json matching associations from a
    different cluster.
 -- Fix segfault when clearing reqnodelist of a pending job.
 -- Fix memory leak of argv when submitting jobs via slurmrestd or CLI commands.
 -- slurmrestd - correct miscalculation of job argument count that could cause
    memory leak when job submission fails.
 -- slurmdbd - add warning on startup if max_allowed_packet is too small.
 -- gpu/nvml - Remove E-cores from NVML's cpu affinity bitmap when
    "allow_ecores" is not set in SlurmdParameters.
 -- Fix regression from 23.02.0rc1 causing a FrontEnd slurmd to assert fail on
    startup and don't be configured with the appropriate port.
 -- Fix dynamic nodes not being sorted and not being included in topology,
    which resulted in suboptimal dynamic node selection for jobs.
 -- Fix slurmstepd crash due to potential division by zero (SIGFPE) in certain
    edge-cases using the PMIx plugin.
 -- Fix issue with PMIx HetJob requests where certain use-cases would end up
    with communication errors due to incorrect PMIx hostname info setup.
 -- openapi/v0.0.39 - revert regression in job update requests to accept job
    description for changes instead of requiring job description in "job" field.
 -- Fix regression in 23.02.0rc1 that caused a step to crash with a bad
    --gpu-bind=single request.
 -- job_container/tmpfs - skip more in-depth attempt to clean up the base path
    when not required.  This prevents unhelpful, and possibly misleading, debug2
    messages when not using the new "shared" mode.
 -- gpu/nvml - Fix gpu usage when graphics processes are running on the gpu.
 -- slurmrestd - fix regression where "exclusive" field was removed from job
    descriptions and submissions.
 -- Fix issue where requeued jobs had bad gres allocations leading to gres not
    being deallocated at the end of the job, preventing other jobs from using
    those resources.
 -- Fix regression in 23.02.0rc1 which caused incorrect values for
    SLURM_TASKS_PER_NODE when the job requests --ntasks-per-node and --exclusive
    or --ntasks-per-core=1 (or CR_ONE_TASK_PER_CORE) and without requesting
    --ntasks. SLURM_TASKS_PER_NODE is used by mpirun, so this regression
    caused mpirun to launch the wrong number of tasks and to sometimes fail to
    launch tasks.
 -- Prevent jobs running on shards from being canceled on slurmctld restart.
 -- Fix SPANK prolog and epilog hooks that rely on slurm_init() for access to
    internal Slurm API calls.
 -- oci.conf - Populate %m pattern with ContainerPath or SlurmdSpoolDir if
    ContainerPath is not configured.
 -- Removed zero padding for numeric values in container spool directory names.
 -- Avoid creating an unused task-4294967295 directory in container spooldir.
 -- Cleanup container step directories at step completion.
 -- sacctmgr - Fix segfault when printing empty tres.
 -- srun - fix communication issue that prevented slurmctld from connecting to
    an srun running outside of a compute node.

* Changes in Slurm 22.05.9
==========================
 -- Allocate correct number of sockets when requesting gres and running with
    CR_SOCKET*.
 -- Fix handling of --prefer for job arrays.
 -- Fix regression in 22.05.5 that causes some jobs that request
    --ntasks-per-node to be incorrectly rejected.
 -- Fix slurmctld crash when a step requests fewer tasks than nodes.
 -- Fix incorrect task count in steps that request --ntasks-per-node and a node
    count with a range (e.g. -N1-2).
 -- Fix some valid step requests hanging instead of running.
 -- slurmrestd - avoid possible race condition which would cause slurmrestd to
    silently no longer accept new client connections.
 -- Fix GPU setup on CRAY systems when using the CRAY_CUDA_MPS environment
    variable. GPUs are now correctly detected in such scenarios.
 -- Fix the job prolog not running for jobs with the interactive step
    (salloc jobs with LaunchParameters=use_interactive_step set in slurm.conf)
    that were scheduled on powered down nodes. The prolog not running also
    broke job_container/tmpfs, pam_slurm_adopt, and x11 forwarding.
 -- task/affinity - fix slurmd segfault when request launch task requests of
    type "--cpu-bind=[map,mask]_cpu:<list>" have no <list> provided.
 -- sched/backfill - fix segfault when removing a PLANNED node from system.
 -- sched/backfill - fix deleted planned node staying in planned node bitmap.
 -- Fix nodes remaining as PLANNED after slurmctld save state recovery.
 -- Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
 -- Fix incorrect memory constraint when receiving a job from 20.11 that uses
    cpu count for memory calculation.
 -- openapi/v0.0.[36-38] - avoid possible crash from jobs submitted with argv.
 -- openapi/v0.0.[36-38] - avoid possible crash from rejected jobs submitted
    with batch_features.
 -- srun - fix regression in 22.05.7 that prevented slurmctld from connecting
    to an srun running outside of a compute node

[slurm-users] Slurm versions 23.02.2 and 22.05.9 are now available

Reply via email to