Public bug reported:

I would like to report a regression in the Ubuntu Noble (24.04) Azure-
specific kernel. Performance has degraded significantly after update to
6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but
to no avail.

# Regression Environment:

- OS: Ubuntu 24.04 LTS (Noble Numbat)

- Working Kernel: 6.14.0-1017-azure

- Broken Kernel: 6.17-1008-azure

- Workload: Azure-specific kernel running heavy network I/O and
extensive mmap usage.

# The Core Technical Conflict:

- The Memory Paradox: The system reports a large amount of "Available"
memory, yet the kernel is failing to reclaim it automatically.

- Thrashing: Despite reported RAM availability, sar indicates high
pgscand and pgsteal activity, showing the kernel is struggling to manage
the page lifecycle.

- The "Manual Valve": Using drop_caches provides temporary relief,
confirming the memory is reclaimable. However, the kernel's internal
reclaim logic is failing to trigger on its own.

# System Impact:

- Performance Collapse: Severe I/O degradation.

- Resource Pressure: Load averages spike into the hundreds, iowait close
to 100%

In the gist related I attach logs and outputs from (some of them are
after drop_caches execution):

uname -r
top
free -h
smem -wp
sar -B 1
cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success"
cat /proc/vmstat | grep -i "slab"
cat /proc/buddyinfo
sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb
sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb
sudo sysctl -w vm.drop_caches=3
smem -wp
sar -B 1
cat /sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/min_ttl_ms
cat /proc/vmstat

Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and
one for outputs on "Node A" (6.14 HEALTHY):

https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d

Can you please help us understanding this regression? Is there a patch
that 6.17 1008 is lacking?

Current status:

For now as a mitigation we reverted impacted environments to kernel
6.14.

We also tested generic kernel 6.17 and were able to reproduce the issue
on Azure hardware.

We were *not* able to reproduce this on AWS virtual machines.

We have a repro in test environment and can provide more information if
needed.

We reported the issue to Microsoft Azure and work with their team in
addition to this bug report.

** Affects: ubuntu
     Importance: Undecided
         Status: New

** Affects: linux-azure (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: 6.17 azure kernel noble regression

** Tags added: 6.17 azure kernel noble regression

** Description changed:

  I would like to report a regression in the Ubuntu Noble (24.04) Azure-
  specific kernel. Performance has degraded significantly after update to
  6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but
  to no avail.
  
  # Regression Environment:
  
  - OS: Ubuntu 24.04 LTS (Noble Numbat)
  
  - Working Kernel: 6.14.0-1017-azure
  
  - Broken Kernel: 6.17-1008-azure
  
  - Workload: Azure-specific kernel running heavy network I/O and
  extensive mmap usage.
  
  # The Core Technical Conflict:
  
  - The Memory Paradox: The system reports a large amount of "Available"
  memory, yet the kernel is failing to reclaim it automatically.
  
  - Thrashing: Despite reported RAM availability, sar indicates high
  pgscand and pgsteal activity, showing the kernel is struggling to manage
  the page lifecycle.
  
  - The "Manual Valve": Using drop_caches provides temporary relief,
  confirming the memory is reclaimable. However, the kernel's internal
  reclaim logic is failing to trigger on its own.
  
  # System Impact:
  
  - Performance Collapse: Severe I/O degradation.
  
  - Resource Pressure: Load averages spike into the hundreds, iowait close
  to 100%
  
  
  In the gist related I attach logs and outputs from (some of them are after 
drop_caches execution):
  
  uname -r
  top
  free -h
  smem -wp
  sar -B 1
  cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success"
  cat /proc/vmstat | grep -i "slab"
  cat /proc/buddyinfo
  sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb
  sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb
  sudo sysctl -w vm.drop_caches=3
  smem -wp
  sar -B 1
  cat /sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/min_ttl_ms
  cat /proc/vmstat
  
  Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and
  one for outputs on "Node A" (6.14 HEALTHY):
  
  https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d
  
  
  Can you please help us understanding this regression? Is there a patch that 
6.17 1008 is lacking?
  
+ 
+ Current status:
+ 
  For now as a mitigation we reverted impacted environments to kernel
  6.14.
  
  We also tested generic kernel 6.17 and were able to reproduce the issue
  on Azure hardware.
  
  We were *not* able to reproduce this on AWS virtual machines.
  
  We have a repro in test environment and can provide more information if
  needed.
+ 
+ We reported the issue to Microsoft Azure and work with their team

** Description changed:

  I would like to report a regression in the Ubuntu Noble (24.04) Azure-
  specific kernel. Performance has degraded significantly after update to
  6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but
  to no avail.
  
  # Regression Environment:
  
  - OS: Ubuntu 24.04 LTS (Noble Numbat)
  
  - Working Kernel: 6.14.0-1017-azure
  
  - Broken Kernel: 6.17-1008-azure
  
  - Workload: Azure-specific kernel running heavy network I/O and
  extensive mmap usage.
  
  # The Core Technical Conflict:
  
  - The Memory Paradox: The system reports a large amount of "Available"
  memory, yet the kernel is failing to reclaim it automatically.
  
  - Thrashing: Despite reported RAM availability, sar indicates high
  pgscand and pgsteal activity, showing the kernel is struggling to manage
  the page lifecycle.
  
  - The "Manual Valve": Using drop_caches provides temporary relief,
  confirming the memory is reclaimable. However, the kernel's internal
  reclaim logic is failing to trigger on its own.
  
  # System Impact:
  
  - Performance Collapse: Severe I/O degradation.
  
  - Resource Pressure: Load averages spike into the hundreds, iowait close
  to 100%
  
- 
- In the gist related I attach logs and outputs from (some of them are after 
drop_caches execution):
+ In the gist related I attach logs and outputs from (some of them are
+ after drop_caches execution):
  
  uname -r
  top
  free -h
  smem -wp
  sar -B 1
  cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success"
  cat /proc/vmstat | grep -i "slab"
  cat /proc/buddyinfo
  sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb
  sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb
  sudo sysctl -w vm.drop_caches=3
  smem -wp
  sar -B 1
  cat /sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/min_ttl_ms
  cat /proc/vmstat
  
  Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and
  one for outputs on "Node A" (6.14 HEALTHY):
  
  https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d
  
- 
- Can you please help us understanding this regression? Is there a patch that 
6.17 1008 is lacking?
- 
+ Can you please help us understanding this regression? Is there a patch
+ that 6.17 1008 is lacking?
  
  Current status:
  
  For now as a mitigation we reverted impacted environments to kernel
  6.14.
  
  We also tested generic kernel 6.17 and were able to reproduce the issue
  on Azure hardware.
  
  We were *not* able to reproduce this on AWS virtual machines.
  
  We have a repro in test environment and can provide more information if
  needed.
  
- We reported the issue to Microsoft Azure and work with their team
+ We reported the issue to Microsoft Azure and work with their team in
+ addition to this bug report.

** Also affects: linux-azure (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2143713

Title:
  Performance regression between 6.14.0-1017 and 6.17.0-1008.8

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+bug/2143713/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to