Re: [PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
On 2015/3/12 13:10, David Rientjes wrote: > On Thu, 12 Mar 2015, Gu Zheng wrote: > >> Qiu Xishi reported the following BUG when testing hot-add/hot-remove node >> under >> stress condition. >> [ 1422.011064] BUG: unable to handle kernel paging request at >> 00025f60 >> [ 1422.011086] IP: [] next_online_pgdat+0x1/0x50 >> [ 1422.011178] PGD 0 >> [ 1422.011180] Oops: [#1] SMP >> [ 1422.011409] ACPI: Device does not support D3cold >> [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop >> dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel >> ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb >> dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core >> iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad >> rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac >> scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last >> unloaded: rasf] >> [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O >> 3.10.15-5885-euler0302 #1 >> [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei >> N1, BIOS V100R001 03/02/2015 >> [ 1422.012065] Workqueue: events vmstat_update >> [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: >> a800d32ae000 >> [ 1422.012165] RIP: 0010:[] [] >> next_online_pgdat+0x1/0x50 >> [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 >> [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: >> 0082 >> [ 1422.012226] RDX: RSI: 0082 RDI: >> >> [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: >> 81cbdc96 >> [ 1422.012272] R10: 40ec R11: 00a0 R12: >> a800fffb3440 >> [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: >> a800e6616800 >> [ 1422.012292] FS: () GS:a800e660() >> knlGS: >> [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 >> [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: >> 001407e0 >> [ 1422.012328] DR0: DR1: DR2: >> >> [ 1422.012328] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 1422.012328] Stack: >> [ 1422.012328] a800d32afd28 81126ca5 a800 >> 814b4314 >> [ 1422.012328] a800d32ae010 a800e6616180 >> a800fffb3440 >> [ 1422.012328] a800d32afde8 81128220 0013 >> 0038 >> [ 1422.012328] Call Trace: >> [ 1422.012328] [] ? next_zone+0xc5/0x150 >> [ 1422.012328] [] ? __schedule+0x544/0x780 >> [ 1422.012328] [] refresh_cpu_vm_stats+0xd0/0x140 > > So refresh_cpu_vm_stats() is doing for_each_populated_zone(), which calls > next_zone(), and we've iterated over all zones for a particular node. We > call next_online_pgdat() with the pgdat of the previous zone's > zone->zone_pgdat, and that explodes on dereference, right? > > I have to ask because 3.10 is an ancient kernel, a more recent example for > the changelog would be helpful if it's reproducible. > >> [ 1422.012328] [] vmstat_update+0x11/0x50 >> [ 1422.012328] [] process_one_work+0x194/0x3d0 >> [ 1422.012328] [] worker_thread+0x12b/0x410 >> [ 1422.012328] [] ? manage_workers+0x1a0/0x1a0 >> [ 1422.012328] [] kthread+0xc6/0xd0 >> [ 1422.012328] [] ? >> kthread_freezable_should_stop+0x70/0x70 >> [ 1422.012328] [] ret_from_fork+0x7c/0xb0 >> [ 1422.012328] [] ? >> kthread_freezable_should_stop+0x70/0x70 >> >> The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of >> try_offline_node, >> which will reset the all content of pgdat to 0, as the pgdat is accessed >> lock-lee, >> so that the users still using the pgdat will panic, such as the >> vmstat_update routine. >> > > Correct me if I'm wrong, but it's not accessing pgdat at all, it's > accessing zone->zone_pgdat->node_id and zone->zone_pgdat is invalid. I > don't _think_ there's anything different with 3.10, but I'd be happy to be > shown wrong. > >> So the solution here is postponing the reset of obsolete pgdat from >> try_offline_node() >> to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and >> pgdat->classzone_idx to >> be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. >> > > I don't see how memset(pgdat, 0, sizeof(*pgdat)) can cause the error > above, can you be more specific? > Hi David, process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat)) zone = next_zone(zone)
Re: [PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
On 2015/3/12 13:10, David Rientjes wrote: On Thu, 12 Mar 2015, Gu Zheng wrote: Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition. [ 1422.011064] BUG: unable to handle kernel paging request at 00025f60 [ 1422.011086] IP: [81126b91] next_online_pgdat+0x1/0x50 [ 1422.011178] PGD 0 [ 1422.011180] Oops: [#1] SMP [ 1422.011409] ACPI: Device does not support D3cold [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 [ 1422.012065] Workqueue: events vmstat_update [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: a800d32ae000 [ 1422.012165] RIP: 0010:[81126b91] [81126b91] next_online_pgdat+0x1/0x50 [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082 [ 1422.012226] RDX: RSI: 0082 RDI: [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96 [ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440 [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800 [ 1422.012292] FS: () GS:a800e660() knlGS: [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0 [ 1422.012328] DR0: DR1: DR2: [ 1422.012328] DR3: DR6: fffe0ff0 DR7: 0400 [ 1422.012328] Stack: [ 1422.012328] a800d32afd28 81126ca5 a800 814b4314 [ 1422.012328] a800d32ae010 a800e6616180 a800fffb3440 [ 1422.012328] a800d32afde8 81128220 0013 0038 [ 1422.012328] Call Trace: [ 1422.012328] [81126ca5] ? next_zone+0xc5/0x150 [ 1422.012328] [814b4314] ? __schedule+0x544/0x780 [ 1422.012328] [81128220] refresh_cpu_vm_stats+0xd0/0x140 So refresh_cpu_vm_stats() is doing for_each_populated_zone(), which calls next_zone(), and we've iterated over all zones for a particular node. We call next_online_pgdat() with the pgdat of the previous zone's zone-zone_pgdat, and that explodes on dereference, right? I have to ask because 3.10 is an ancient kernel, a more recent example for the changelog would be helpful if it's reproducible. [ 1422.012328] [811282a1] vmstat_update+0x11/0x50 [ 1422.012328] [81064c24] process_one_work+0x194/0x3d0 [ 1422.012328] [810660bb] worker_thread+0x12b/0x410 [ 1422.012328] [81065f90] ? manage_workers+0x1a0/0x1a0 [ 1422.012328] [8106ba66] kthread+0xc6/0xd0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 [ 1422.012328] [814be0ac] ret_from_fork+0x7c/0xb0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 The cause is the memset(pgdat, 0, sizeof(*pgdat)) at the end of try_offline_node, which will reset the all content of pgdat to 0, as the pgdat is accessed lock-lee, so that the users still using the pgdat will panic, such as the vmstat_update routine. Correct me if I'm wrong, but it's not accessing pgdat at all, it's accessing zone-zone_pgdat-node_id and zone-zone_pgdat is invalid. I don't _think_ there's anything different with 3.10, but I'd be happy to be shown wrong. So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat-nr_zones and pgdat-classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. I don't see how memset(pgdat, 0, sizeof(*pgdat)) can cause the error above, can you be more specific? Hi David, process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
Re: [PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
On Thu, 12 Mar 2015, Gu Zheng wrote: > Qiu Xishi reported the following BUG when testing hot-add/hot-remove node > under > stress condition. > [ 1422.011064] BUG: unable to handle kernel paging request at 00025f60 > [ 1422.011086] IP: [] next_online_pgdat+0x1/0x50 > [ 1422.011178] PGD 0 > [ 1422.011180] Oops: [#1] SMP > [ 1422.011409] ACPI: Device does not support D3cold > [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop > dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel > ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb > dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core > iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad > rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac > scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last > unloaded: rasf] > [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O > 3.10.15-5885-euler0302 #1 > [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei > N1, BIOS V100R001 03/02/2015 > [ 1422.012065] Workqueue: events vmstat_update > [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: > a800d32ae000 > [ 1422.012165] RIP: 0010:[] [] > next_online_pgdat+0x1/0x50 > [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 > [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: > 0082 > [ 1422.012226] RDX: RSI: 0082 RDI: > > [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: > 81cbdc96 > [ 1422.012272] R10: 40ec R11: 00a0 R12: > a800fffb3440 > [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: > a800e6616800 > [ 1422.012292] FS: () GS:a800e660() > knlGS: > [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 > [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: > 001407e0 > [ 1422.012328] DR0: DR1: DR2: > > [ 1422.012328] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 1422.012328] Stack: > [ 1422.012328] a800d32afd28 81126ca5 a800 > 814b4314 > [ 1422.012328] a800d32ae010 a800e6616180 > a800fffb3440 > [ 1422.012328] a800d32afde8 81128220 0013 > 0038 > [ 1422.012328] Call Trace: > [ 1422.012328] [] ? next_zone+0xc5/0x150 > [ 1422.012328] [] ? __schedule+0x544/0x780 > [ 1422.012328] [] refresh_cpu_vm_stats+0xd0/0x140 So refresh_cpu_vm_stats() is doing for_each_populated_zone(), which calls next_zone(), and we've iterated over all zones for a particular node. We call next_online_pgdat() with the pgdat of the previous zone's zone->zone_pgdat, and that explodes on dereference, right? I have to ask because 3.10 is an ancient kernel, a more recent example for the changelog would be helpful if it's reproducible. > [ 1422.012328] [] vmstat_update+0x11/0x50 > [ 1422.012328] [] process_one_work+0x194/0x3d0 > [ 1422.012328] [] worker_thread+0x12b/0x410 > [ 1422.012328] [] ? manage_workers+0x1a0/0x1a0 > [ 1422.012328] [] kthread+0xc6/0xd0 > [ 1422.012328] [] ? kthread_freezable_should_stop+0x70/0x70 > [ 1422.012328] [] ret_from_fork+0x7c/0xb0 > [ 1422.012328] [] ? kthread_freezable_should_stop+0x70/0x70 > > The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of > try_offline_node, > which will reset the all content of pgdat to 0, as the pgdat is accessed > lock-lee, > so that the users still using the pgdat will panic, such as the vmstat_update > routine. > Correct me if I'm wrong, but it's not accessing pgdat at all, it's accessing zone->zone_pgdat->node_id and zone->zone_pgdat is invalid. I don't _think_ there's anything different with 3.10, but I'd be happy to be shown wrong. > So the solution here is postponing the reset of obsolete pgdat from > try_offline_node() > to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and > pgdat->classzone_idx to > be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. > I don't see how memset(pgdat, 0, sizeof(*pgdat)) can cause the error above, can you be more specific? > Reported-by: Xishi Qiu > Suggested-by: KAMEZAWA Hiroyuki > Cc: > Signed-off-by: Gu Zheng > --- > mm/memory_hotplug.c | 13 - > 1 files changed, 4 insertions(+), 9 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 9fab107..65842d6 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 > start) > return NULL; > > arch_refresh_nodedata(nid, pgdat); > + } else { > + /* Reset the nr_zones and
[PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition. [ 1422.011064] BUG: unable to handle kernel paging request at 00025f60 [ 1422.011086] IP: [] next_online_pgdat+0x1/0x50 [ 1422.011178] PGD 0 [ 1422.011180] Oops: [#1] SMP [ 1422.011409] ACPI: Device does not support D3cold [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 [ 1422.012065] Workqueue: events vmstat_update [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: a800d32ae000 [ 1422.012165] RIP: 0010:[] [] next_online_pgdat+0x1/0x50 [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082 [ 1422.012226] RDX: RSI: 0082 RDI: [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96 [ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440 [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800 [ 1422.012292] FS: () GS:a800e660() knlGS: [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0 [ 1422.012328] DR0: DR1: DR2: [ 1422.012328] DR3: DR6: fffe0ff0 DR7: 0400 [ 1422.012328] Stack: [ 1422.012328] a800d32afd28 81126ca5 a800 814b4314 [ 1422.012328] a800d32ae010 a800e6616180 a800fffb3440 [ 1422.012328] a800d32afde8 81128220 0013 0038 [ 1422.012328] Call Trace: [ 1422.012328] [] ? next_zone+0xc5/0x150 [ 1422.012328] [] ? __schedule+0x544/0x780 [ 1422.012328] [] refresh_cpu_vm_stats+0xd0/0x140 [ 1422.012328] [] vmstat_update+0x11/0x50 [ 1422.012328] [] process_one_work+0x194/0x3d0 [ 1422.012328] [] worker_thread+0x12b/0x410 [ 1422.012328] [] ? manage_workers+0x1a0/0x1a0 [ 1422.012328] [] kthread+0xc6/0xd0 [ 1422.012328] [] ? kthread_freezable_should_stop+0x70/0x70 [ 1422.012328] [] ret_from_fork+0x7c/0xb0 [ 1422.012328] [] ? kthread_freezable_should_stop+0x70/0x70 The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of try_offline_node, which will reset the all content of pgdat to 0, as the pgdat is accessed lock-lee, so that the users still using the pgdat will panic, such as the vmstat_update routine. So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Reported-by: Xishi Qiu Suggested-by: KAMEZAWA Hiroyuki Cc: Signed-off-by: Gu Zheng --- mm/memory_hotplug.c | 13 - 1 files changed, 4 insertions(+), 9 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9fab107..65842d6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) return NULL; arch_refresh_nodedata(nid, pgdat); + } else { + /* Reset the nr_zones and classzone_idx to 0 before reuse */ + pgdat->nr_zones = 0; + pgdat->classzone_idx = 0; } /* we can use NODE_DATA(nid) from here */ @@ -1977,15 +1981,6 @@ void try_offline_node(int nid) if (is_vmalloc_addr(zone->wait_table)) vfree(zone->wait_table); } - - /* -* Since there is no way to guarentee the address of pgdat/zone is not -* on stack of any kernel threads or used by other kernel objects -* without reference counting or other symchronizing method, do not -* reset node_data and free pgdat here. Just reset it to 0 and reuse -* the memory when the node is online again. -*/ - memset(pgdat, 0, sizeof(*pgdat)); } EXPORT_SYMBOL(try_offline_node); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
On Thu, 12 Mar 2015, Gu Zheng wrote: Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition. [ 1422.011064] BUG: unable to handle kernel paging request at 00025f60 [ 1422.011086] IP: [81126b91] next_online_pgdat+0x1/0x50 [ 1422.011178] PGD 0 [ 1422.011180] Oops: [#1] SMP [ 1422.011409] ACPI: Device does not support D3cold [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 [ 1422.012065] Workqueue: events vmstat_update [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: a800d32ae000 [ 1422.012165] RIP: 0010:[81126b91] [81126b91] next_online_pgdat+0x1/0x50 [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082 [ 1422.012226] RDX: RSI: 0082 RDI: [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96 [ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440 [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800 [ 1422.012292] FS: () GS:a800e660() knlGS: [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0 [ 1422.012328] DR0: DR1: DR2: [ 1422.012328] DR3: DR6: fffe0ff0 DR7: 0400 [ 1422.012328] Stack: [ 1422.012328] a800d32afd28 81126ca5 a800 814b4314 [ 1422.012328] a800d32ae010 a800e6616180 a800fffb3440 [ 1422.012328] a800d32afde8 81128220 0013 0038 [ 1422.012328] Call Trace: [ 1422.012328] [81126ca5] ? next_zone+0xc5/0x150 [ 1422.012328] [814b4314] ? __schedule+0x544/0x780 [ 1422.012328] [81128220] refresh_cpu_vm_stats+0xd0/0x140 So refresh_cpu_vm_stats() is doing for_each_populated_zone(), which calls next_zone(), and we've iterated over all zones for a particular node. We call next_online_pgdat() with the pgdat of the previous zone's zone-zone_pgdat, and that explodes on dereference, right? I have to ask because 3.10 is an ancient kernel, a more recent example for the changelog would be helpful if it's reproducible. [ 1422.012328] [811282a1] vmstat_update+0x11/0x50 [ 1422.012328] [81064c24] process_one_work+0x194/0x3d0 [ 1422.012328] [810660bb] worker_thread+0x12b/0x410 [ 1422.012328] [81065f90] ? manage_workers+0x1a0/0x1a0 [ 1422.012328] [8106ba66] kthread+0xc6/0xd0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 [ 1422.012328] [814be0ac] ret_from_fork+0x7c/0xb0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 The cause is the memset(pgdat, 0, sizeof(*pgdat)) at the end of try_offline_node, which will reset the all content of pgdat to 0, as the pgdat is accessed lock-lee, so that the users still using the pgdat will panic, such as the vmstat_update routine. Correct me if I'm wrong, but it's not accessing pgdat at all, it's accessing zone-zone_pgdat-node_id and zone-zone_pgdat is invalid. I don't _think_ there's anything different with 3.10, but I'd be happy to be shown wrong. So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat-nr_zones and pgdat-classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. I don't see how memset(pgdat, 0, sizeof(*pgdat)) can cause the error above, can you be more specific? Reported-by: Xishi Qiu qiuxi...@huawei.com Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: sta...@vger.kernel.org Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- mm/memory_hotplug.c | 13 - 1 files changed, 4 insertions(+), 9 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9fab107..65842d6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1092,6 +1092,10 @@
[PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition. [ 1422.011064] BUG: unable to handle kernel paging request at 00025f60 [ 1422.011086] IP: [81126b91] next_online_pgdat+0x1/0x50 [ 1422.011178] PGD 0 [ 1422.011180] Oops: [#1] SMP [ 1422.011409] ACPI: Device does not support D3cold [ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] [ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 [ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 [ 1422.012065] Workqueue: events vmstat_update [ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: a800d32ae000 [ 1422.012165] RIP: 0010:[81126b91] [81126b91] next_online_pgdat+0x1/0x50 [ 1422.012205] RSP: 0018:a800d32afce8 EFLAGS: 00010286 [ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082 [ 1422.012226] RDX: RSI: 0082 RDI: [ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96 [ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440 [ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800 [ 1422.012292] FS: () GS:a800e660() knlGS: [ 1422.012314] CS: 0010 DS: ES: CR0: 80050033 [ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0 [ 1422.012328] DR0: DR1: DR2: [ 1422.012328] DR3: DR6: fffe0ff0 DR7: 0400 [ 1422.012328] Stack: [ 1422.012328] a800d32afd28 81126ca5 a800 814b4314 [ 1422.012328] a800d32ae010 a800e6616180 a800fffb3440 [ 1422.012328] a800d32afde8 81128220 0013 0038 [ 1422.012328] Call Trace: [ 1422.012328] [81126ca5] ? next_zone+0xc5/0x150 [ 1422.012328] [814b4314] ? __schedule+0x544/0x780 [ 1422.012328] [81128220] refresh_cpu_vm_stats+0xd0/0x140 [ 1422.012328] [811282a1] vmstat_update+0x11/0x50 [ 1422.012328] [81064c24] process_one_work+0x194/0x3d0 [ 1422.012328] [810660bb] worker_thread+0x12b/0x410 [ 1422.012328] [81065f90] ? manage_workers+0x1a0/0x1a0 [ 1422.012328] [8106ba66] kthread+0xc6/0xd0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 [ 1422.012328] [814be0ac] ret_from_fork+0x7c/0xb0 [ 1422.012328] [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70 The cause is the memset(pgdat, 0, sizeof(*pgdat)) at the end of try_offline_node, which will reset the all content of pgdat to 0, as the pgdat is accessed lock-lee, so that the users still using the pgdat will panic, such as the vmstat_update routine. So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat-nr_zones and pgdat-classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Reported-by: Xishi Qiu qiuxi...@huawei.com Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: sta...@vger.kernel.org Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com --- mm/memory_hotplug.c | 13 - 1 files changed, 4 insertions(+), 9 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9fab107..65842d6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) return NULL; arch_refresh_nodedata(nid, pgdat); + } else { + /* Reset the nr_zones and classzone_idx to 0 before reuse */ + pgdat-nr_zones = 0; + pgdat-classzone_idx = 0; } /* we can use NODE_DATA(nid) from here */ @@ -1977,15 +1981,6 @@ void try_offline_node(int nid) if (is_vmalloc_addr(zone-wait_table)) vfree(zone-wait_table); } - - /* -* Since there is no way to guarentee the address of pgdat/zone is not -* on stack of any kernel threads or used by other kernel objects -* without reference counting or other symchronizing method, do not -* reset node_data and free pgdat here. Just reset it to 0 and reuse -