Public bug reported: I would like to report a regression in the Ubuntu Noble (24.04) Azure- specific kernel. Performance has degraded significantly after update to 6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but to no avail.
# Regression Environment: - OS: Ubuntu 24.04 LTS (Noble Numbat) - Working Kernel: 6.14.0-1017-azure - Broken Kernel: 6.17-1008-azure - Workload: Azure-specific kernel running heavy network I/O and extensive mmap usage. # The Core Technical Conflict: - The Memory Paradox: The system reports a large amount of "Available" memory, yet the kernel is failing to reclaim it automatically. - Thrashing: Despite reported RAM availability, sar indicates high pgscand and pgsteal activity, showing the kernel is struggling to manage the page lifecycle. - The "Manual Valve": Using drop_caches provides temporary relief, confirming the memory is reclaimable. However, the kernel's internal reclaim logic is failing to trigger on its own. # System Impact: - Performance Collapse: Severe I/O degradation. - Resource Pressure: Load averages spike into the hundreds, iowait close to 100% In the gist related I attach logs and outputs from (some of them are after drop_caches execution): uname -r top free -h smem -wp sar -B 1 cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success" cat /proc/vmstat | grep -i "slab" cat /proc/buddyinfo sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb sudo sysctl -w vm.drop_caches=3 smem -wp sar -B 1 cat /sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/min_ttl_ms cat /proc/vmstat Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and one for outputs on "Node A" (6.14 HEALTHY): https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d Can you please help us understanding this regression? Is there a patch that 6.17 1008 is lacking? Current status: For now as a mitigation we reverted impacted environments to kernel 6.14. We also tested generic kernel 6.17 and were able to reproduce the issue on Azure hardware. We were *not* able to reproduce this on AWS virtual machines. We have a repro in test environment and can provide more information if needed. We reported the issue to Microsoft Azure and work with their team in addition to this bug report. ** Affects: ubuntu Importance: Undecided Status: New ** Affects: linux-azure (Ubuntu) Importance: Undecided Status: New ** Tags: 6.17 azure kernel noble regression ** Tags added: 6.17 azure kernel noble regression ** Description changed: I would like to report a regression in the Ubuntu Noble (24.04) Azure- specific kernel. Performance has degraded significantly after update to 6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but to no avail. # Regression Environment: - OS: Ubuntu 24.04 LTS (Noble Numbat) - Working Kernel: 6.14.0-1017-azure - Broken Kernel: 6.17-1008-azure - Workload: Azure-specific kernel running heavy network I/O and extensive mmap usage. # The Core Technical Conflict: - The Memory Paradox: The system reports a large amount of "Available" memory, yet the kernel is failing to reclaim it automatically. - Thrashing: Despite reported RAM availability, sar indicates high pgscand and pgsteal activity, showing the kernel is struggling to manage the page lifecycle. - The "Manual Valve": Using drop_caches provides temporary relief, confirming the memory is reclaimable. However, the kernel's internal reclaim logic is failing to trigger on its own. # System Impact: - Performance Collapse: Severe I/O degradation. - Resource Pressure: Load averages spike into the hundreds, iowait close to 100% In the gist related I attach logs and outputs from (some of them are after drop_caches execution): uname -r top free -h smem -wp sar -B 1 cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success" cat /proc/vmstat | grep -i "slab" cat /proc/buddyinfo sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb sudo sysctl -w vm.drop_caches=3 smem -wp sar -B 1 cat /sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/min_ttl_ms cat /proc/vmstat Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and one for outputs on "Node A" (6.14 HEALTHY): https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d Can you please help us understanding this regression? Is there a patch that 6.17 1008 is lacking? + + Current status: + For now as a mitigation we reverted impacted environments to kernel 6.14. We also tested generic kernel 6.17 and were able to reproduce the issue on Azure hardware. We were *not* able to reproduce this on AWS virtual machines. We have a repro in test environment and can provide more information if needed. + + We reported the issue to Microsoft Azure and work with their team ** Description changed: I would like to report a regression in the Ubuntu Noble (24.04) Azure- specific kernel. Performance has degraded significantly after update to 6.17.0-1008.8. We then checked 1009 to see if it makes a difference, but to no avail. # Regression Environment: - OS: Ubuntu 24.04 LTS (Noble Numbat) - Working Kernel: 6.14.0-1017-azure - Broken Kernel: 6.17-1008-azure - Workload: Azure-specific kernel running heavy network I/O and extensive mmap usage. # The Core Technical Conflict: - The Memory Paradox: The system reports a large amount of "Available" memory, yet the kernel is failing to reclaim it automatically. - Thrashing: Despite reported RAM availability, sar indicates high pgscand and pgsteal activity, showing the kernel is struggling to manage the page lifecycle. - The "Manual Valve": Using drop_caches provides temporary relief, confirming the memory is reclaimable. However, the kernel's internal reclaim logic is failing to trigger on its own. # System Impact: - Performance Collapse: Severe I/O degradation. - Resource Pressure: Load averages spike into the hundreds, iowait close to 100% - - In the gist related I attach logs and outputs from (some of them are after drop_caches execution): + In the gist related I attach logs and outputs from (some of them are + after drop_caches execution): uname -r top free -h smem -wp sar -B 1 cat /proc/vmstat | grep -E "compact_stall|compact_fail|compact_success" cat /proc/vmstat | grep -i "slab" cat /proc/buddyinfo sudo cat /sys/kernel/debug/lru_gen | grep -A5 ravendb sudo cat /sys/kernel/debug/lru_gen_full | grep -A25 ravendb sudo sysctl -w vm.drop_caches=3 smem -wp sar -B 1 cat /sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/min_ttl_ms cat /proc/vmstat Gist contains 2 files - one for impacted "Node C" (6.17 IMPACTED) and one for outputs on "Node A" (6.14 HEALTHY): https://gist.github.com/gregolsky/a9748a200f0ee7f4cd3ad4131f6f5c4d - - Can you please help us understanding this regression? Is there a patch that 6.17 1008 is lacking? - + Can you please help us understanding this regression? Is there a patch + that 6.17 1008 is lacking? Current status: For now as a mitigation we reverted impacted environments to kernel 6.14. We also tested generic kernel 6.17 and were able to reproduce the issue on Azure hardware. We were *not* able to reproduce this on AWS virtual machines. We have a repro in test environment and can provide more information if needed. - We reported the issue to Microsoft Azure and work with their team + We reported the issue to Microsoft Azure and work with their team in + addition to this bug report. ** Also affects: linux-azure (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2143713 Title: Performance regression between 6.14.0-1017 and 6.17.0-1008.8 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+bug/2143713/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
