Public bug reported:

We're using Slurm workload manager in a cluster with Ubuntu 22.04 and
the linux-generic kernel (amd64). We  use cgroups (cgroup2) for resource
allocation with Slurm. With kernel version

linux-image-5.15.0-91-generic         5.15.0-91.101
amd64

I'm seeing a new issue. This must have been introduces recently, I can
confirm that with kernel 5.15.0-88-generic the issue does not exist.
When I request a single GPU on a node with kernel 5.15.0-88-generic all
is well:

$ srun -G 1 -w gpu59 nvidia-smi -L
GPU 0: NVIDIA [...]


Instead with kernel 5.15.0-91-generic:

$ srun -G 1 -w gpu59 nvidia-smi -L
slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). 
Please check your system limits (MEMLOCK).
GPU 0: NVIDIA [...]
GPU 1: NVIDIA [...]
GPU 2: NVIDIA [...]
GPU 3: NVIDIA [...]
GPU 4: NVIDIA [...]
GPU 5: NVIDIA [...]
GPU 6: NVIDIA [...]
GPU 7: NVIDIA [...]

So I get this error regarding MEMLOCK limit and see all GPUs in the
system instead of only the one requested. Hence I assume the problem is
related to cgroups.

$ cat /proc/version_signature 
Ubuntu 5.15.0-91.101-generic 5.15.131

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Attachment added: "ubuntu-bug linux output"
   
https://bugs.launchpad.net/bugs/2050098/+attachment/5741416/+files/apport.linux-image-5.15.0-91-generic.l9ripain.apport

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2050098

Title:
  cgroup2 appears to be broken

Status in linux package in Ubuntu:
  New

Bug description:
  We're using Slurm workload manager in a cluster with Ubuntu 22.04 and
  the linux-generic kernel (amd64). We  use cgroups (cgroup2) for
  resource allocation with Slurm. With kernel version

  linux-image-5.15.0-91-generic         5.15.0-91.101
  amd64

  I'm seeing a new issue. This must have been introduces recently, I can
  confirm that with kernel 5.15.0-88-generic the issue does not exist.
  When I request a single GPU on a node with kernel 5.15.0-88-generic
  all is well:

  $ srun -G 1 -w gpu59 nvidia-smi -L
  GPU 0: NVIDIA [...]

  
  Instead with kernel 5.15.0-91-generic:

  $ srun -G 1 -w gpu59 nvidia-smi -L
  slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). 
Please check your system limits (MEMLOCK).
  GPU 0: NVIDIA [...]
  GPU 1: NVIDIA [...]
  GPU 2: NVIDIA [...]
  GPU 3: NVIDIA [...]
  GPU 4: NVIDIA [...]
  GPU 5: NVIDIA [...]
  GPU 6: NVIDIA [...]
  GPU 7: NVIDIA [...]

  So I get this error regarding MEMLOCK limit and see all GPUs in the
  system instead of only the one requested. Hence I assume the problem
  is related to cgroups.

  $ cat /proc/version_signature 
  Ubuntu 5.15.0-91.101-generic 5.15.131

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2050098/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to