Re: Regression found in memory stress test with stress-ng
Chia-Lin Kao (AceLan) 於 2025年3月21日 週五 下午3:47寫道: > > On Thu, Mar 20, 2025 at 02:32:55PM +0800, Baokun Li wrote: > > On 2025/3/20 13:23, Chia-Lin Kao (AceLan) wrote: > > > On Thu, Mar 20, 2025 at 11:52:20AM +0800, Baokun Li wrote: > > > > On 2025/3/20 10:49, AceLan Kao wrote: > > > > > Hi all, > > > > > > > > > > We have found a regression while doing a memory stress test using > > > > > stress-ng with the following command > > > > > sudo stress-ng --aggressive --verify --timeout 300 --mmapmany 0 > > > > > > > > > > This issue occurs on recent kernel versions, and we have found that > > > > > the following commit leads to the issue > > > > > 4e63aeb5d010 ("blk-wbt: don't throttle swap writes in direct > > > > > reclaim") > > > > > > > > > > Before reverting the commit directly, I wonder if we can identify the > > > > > issue and implement a solution quickly. > > > > > Currently, I'm unable to provide logs, as the system becomes > > > > > unresponsive during testing. If you have any idea to capture logs, > > > > > please let me know, I'm willing to help. > > > > Hi AceLan, > > > > > > > > I cannot reproduce this issue. The above command will trigger OOM. > > > > Have you enabled panic_on_oom? (You can check by sysctl > > > > vm.panic_on_oom). > > > > Or are there more kernel Oops reports in dmesg? > > > Actually, there is no kernel panic during the testing. > > > I tried using kernel magic key to trigger crash and this is what I > > > got. > > > It repeats the "Purging GPU memory" message over and over again. > > > > > > [ 3605.341706] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > > The messages are coming from i915_gem_shrinker_oom(), so it looks like > > it's still an OOM issue. I'm just not sure why the OOM is happening so > > often, like every 0.05 seconds. > > > > I'm not familiar with gpu/drm/i915/gem, so I CCed the relevant maintainers > > to see if they have any thoughts. > Hi Baokun, > > Right, how the i915 shrinks its memory may need some tweak to check if > it can really shrink the memory. > But this issue is more likely from the swap. > > We found the issue can't be reproduced after reverts that commit, and > the issue can't be reproduced if we run swapoff to disable swap. > I'm worrying that there might be a bug in the swap code that it can't > handle the OOM situation well. > > Do you think should we try adding some debug messages to the block driver > to see if we can find any clues? A gentle ping. Does anyone have ideas on how to debug this issue? > > > > > [ 3605.346295] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.350815] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.355463] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.360105] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.364743] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.369426] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.374044] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.378467] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.382958] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.387534] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.392130] [ T5739] Purging GPU memory, 0 pages freed, 0 pages > > > still pinned, 2787 pages left available. > > > [ 3605.394571] [ C11] sysrq: Trigger a crash > > > [ 3605.394575] [ C11] Kernel panic - not syncing: sysrq triggered > > > crash > > > [ 3605.394580] [ C11] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: > > > loaded Not tainted 6.11.0-1016-oem #16-Ubuntu > > > [ 3605.394586] [ C11] Hardware name: HP HP ZBook Fury 16 G11 Mobile > > > Workstation PC/8CA7, BIOS W98 Ver. 01.01.12 11/25/2024 > > > [ 3605.394588] [ C11] Call Trace: > > > [ 3605.394591] [ C11] > > > [ 3605.394596] [ C11] dump_stack_lvl+0x27/0xa0 > > > [ 3605.394605] [ C11] dump_stack+0x10/0x20 > > > [ 3605.394608] [ C11] panic+0x352/0x3e0 > > > [ 3605.394613] [ C11] sysrq_handle_crash+0x1a/0x20 > > > [ 3605.394618] [ C11] __handle_sysrq+0xf0/0x290 > > > [ 3605.394623] [ C11] sysrq_handle_keypress+0x2f4/0x550 > > > [ 3605.394627] [ C11] sysrq_filter+0x45/0xa0 > > > [ 3605.394631] [ C11] ? sched_balance_find_src_group+0x51/0x280 > > > [ 3605.394637] [ C11] input_handle_events_filter+0x46/0xb0 > > > [ 3605.39
Re: Regression found in memory stress test with stress-ng
On 2025/3/20 13:23, Chia-Lin Kao (AceLan) wrote: On Thu, Mar 20, 2025 at 11:52:20AM +0800, Baokun Li wrote: On 2025/3/20 10:49, AceLan Kao wrote: Hi all, We have found a regression while doing a memory stress test using stress-ng with the following command sudo stress-ng --aggressive --verify --timeout 300 --mmapmany 0 This issue occurs on recent kernel versions, and we have found that the following commit leads to the issue 4e63aeb5d010 ("blk-wbt: don't throttle swap writes in direct reclaim") Before reverting the commit directly, I wonder if we can identify the issue and implement a solution quickly. Currently, I'm unable to provide logs, as the system becomes unresponsive during testing. If you have any idea to capture logs, please let me know, I'm willing to help. Hi AceLan, I cannot reproduce this issue. The above command will trigger OOM. Have you enabled panic_on_oom? (You can check by sysctl vm.panic_on_oom). Or are there more kernel Oops reports in dmesg? Actually, there is no kernel panic during the testing. I tried using kernel magic key to trigger crash and this is what I got. It repeats the "Purging GPU memory" message over and over again. [ 3605.341706] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. The messages are coming from i915_gem_shrinker_oom(), so it looks like it's still an OOM issue. I'm just not sure why the OOM is happening so often, like every 0.05 seconds. I'm not familiar with gpu/drm/i915/gem, so I CCed the relevant maintainers to see if they have any thoughts. [ 3605.346295] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.350815] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.355463] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.360105] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.364743] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.369426] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.374044] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.378467] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.382958] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.387534] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.392130] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still pinned, 2787 pages left available. [ 3605.394571] [ C11] sysrq: Trigger a crash [ 3605.394575] [ C11] Kernel panic - not syncing: sysrq triggered crash [ 3605.394580] [ C11] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded Not tainted 6.11.0-1016-oem #16-Ubuntu [ 3605.394586] [ C11] Hardware name: HP HP ZBook Fury 16 G11 Mobile Workstation PC/8CA7, BIOS W98 Ver. 01.01.12 11/25/2024 [ 3605.394588] [ C11] Call Trace: [ 3605.394591] [ C11] [ 3605.394596] [ C11] dump_stack_lvl+0x27/0xa0 [ 3605.394605] [ C11] dump_stack+0x10/0x20 [ 3605.394608] [ C11] panic+0x352/0x3e0 [ 3605.394613] [ C11] sysrq_handle_crash+0x1a/0x20 [ 3605.394618] [ C11] __handle_sysrq+0xf0/0x290 [ 3605.394623] [ C11] sysrq_handle_keypress+0x2f4/0x550 [ 3605.394627] [ C11] sysrq_filter+0x45/0xa0 [ 3605.394631] [ C11] ? sched_balance_find_src_group+0x51/0x280 [ 3605.394637] [ C11] input_handle_events_filter+0x46/0xb0 [ 3605.394643] [ C11] input_pass_values+0x142/0x170 [ 3605.394647] [ C11] input_event_dispose+0x167/0x170 [ 3605.394651] [ C11] input_handle_event+0x41/0x80 [ 3605.394656] [ C11] input_event+0x51/0x80 [ 3605.394659] [ C11] atkbd_receive_byte+0x805/0x8f0 [ 3605.394664] [ C11] ps2_interrupt+0xb4/0x1b0 [ 3605.394668] [ C11] serio_interrupt+0x49/0xa0 [ 3605.394673] [ C11] i8042_interrupt+0x196/0x4c0 [ 3605.394677] [ C11] ? enqueue_hrtimer+0x4d/0xc0 [ 3605.394682] [ C11] ? ktime_get+0x3f/0xf0 [ 3605.394686] [ C11] ? lapic_next_deadline+0x2c/0x50 [ 3605.394691] [ C11] __handle_irq_event_percpu+0x4c/0x1b0 [ 3605.394696] [ C11] ? sched_clock_noinstr+0x9/0x10 [ 3605.394700] [ C11] handle_irq_event+0x39/0x80 [ 3605.394706] [ C11] handle_edge_irq+0x8c/0x250 [ 3605.394710] [ C11] __common_interrupt+0x4e/0x110 [ 3605.394715] [ C11] common_interrupt+0xb1/0xe0 [ 3605.394718] [ C11] [ 3605.394720] [ C11] [ 3605.394721] [ C11] asm_common_interrupt+0x27/0x40 [ 3605.394726] [ C11] RIP: 0010:poll_idle+0x4f/0xac [ 3605.394731] [ C11] Code: 00 00 65 4c 8b 3d a1 78 7b 63 f0 41 80 4f 02 20 49 8b 07 a8 08 75 32 4c 89 ef 48 89 de e8 d9 fe ff ff 49
Re: Regression found in memory stress test with stress-ng
On Thu, Mar 20, 2025 at 02:32:55PM +0800, Baokun Li wrote: > On 2025/3/20 13:23, Chia-Lin Kao (AceLan) wrote: > > On Thu, Mar 20, 2025 at 11:52:20AM +0800, Baokun Li wrote: > > > On 2025/3/20 10:49, AceLan Kao wrote: > > > > Hi all, > > > > > > > > We have found a regression while doing a memory stress test using > > > > stress-ng with the following command > > > > sudo stress-ng --aggressive --verify --timeout 300 --mmapmany 0 > > > > > > > > This issue occurs on recent kernel versions, and we have found that > > > > the following commit leads to the issue > > > > 4e63aeb5d010 ("blk-wbt: don't throttle swap writes in direct > > > > reclaim") > > > > > > > > Before reverting the commit directly, I wonder if we can identify the > > > > issue and implement a solution quickly. > > > > Currently, I'm unable to provide logs, as the system becomes > > > > unresponsive during testing. If you have any idea to capture logs, > > > > please let me know, I'm willing to help. > > > Hi AceLan, > > > > > > I cannot reproduce this issue. The above command will trigger OOM. > > > Have you enabled panic_on_oom? (You can check by sysctl vm.panic_on_oom). > > > Or are there more kernel Oops reports in dmesg? > > Actually, there is no kernel panic during the testing. > > I tried using kernel magic key to trigger crash and this is what I > > got. > > It repeats the "Purging GPU memory" message over and over again. > > > > [ 3605.341706] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > The messages are coming from i915_gem_shrinker_oom(), so it looks like > it's still an OOM issue. I'm just not sure why the OOM is happening so > often, like every 0.05 seconds. > > I'm not familiar with gpu/drm/i915/gem, so I CCed the relevant maintainers > to see if they have any thoughts. Hi Baokun, Right, how the i915 shrinks its memory may need some tweak to check if it can really shrink the memory. But this issue is more likely from the swap. We found the issue can't be reproduced after reverts that commit, and the issue can't be reproduced if we run swapoff to disable swap. I'm worrying that there might be a bug in the swap code that it can't handle the OOM situation well. Do you think should we try adding some debug messages to the block driver to see if we can find any clues? > > > [ 3605.346295] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.350815] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.355463] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.360105] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.364743] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.369426] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.374044] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.378467] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.382958] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.387534] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.392130] [ T5739] Purging GPU memory, 0 pages freed, 0 pages still > > pinned, 2787 pages left available. > > [ 3605.394571] [ C11] sysrq: Trigger a crash > > [ 3605.394575] [ C11] Kernel panic - not syncing: sysrq triggered crash > > [ 3605.394580] [ C11] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: > > loaded Not tainted 6.11.0-1016-oem #16-Ubuntu > > [ 3605.394586] [ C11] Hardware name: HP HP ZBook Fury 16 G11 Mobile > > Workstation PC/8CA7, BIOS W98 Ver. 01.01.12 11/25/2024 > > [ 3605.394588] [ C11] Call Trace: > > [ 3605.394591] [ C11] > > [ 3605.394596] [ C11] dump_stack_lvl+0x27/0xa0 > > [ 3605.394605] [ C11] dump_stack+0x10/0x20 > > [ 3605.394608] [ C11] panic+0x352/0x3e0 > > [ 3605.394613] [ C11] sysrq_handle_crash+0x1a/0x20 > > [ 3605.394618] [ C11] __handle_sysrq+0xf0/0x290 > > [ 3605.394623] [ C11] sysrq_handle_keypress+0x2f4/0x550 > > [ 3605.394627] [ C11] sysrq_filter+0x45/0xa0 > > [ 3605.394631] [ C11] ? sched_balance_find_src_group+0x51/0x280 > > [ 3605.394637] [ C11] input_handle_events_filter+0x46/0xb0 > > [ 3605.394643] [ C11] input_pass_values+0x142/0x170 > > [ 3605.394647] [ C11] input_event_dispose+0x167/0x170 > > [ 3605.394651] [ C11] input_handle_event+0x41/0x80 > > [ 3605.394656] [ C11] input_event+0x51/0x80 > > [ 3605.394659] [ C11] atkbd_receive_byte+0x805/0x8f0 > > [ 3605.394664] [