Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/08/2013 01:21 PM, HATAYAMA Daisuke wrote: > (2013/11/08 14:12), Atsushi Kumagai wrote: >> Hello Jingbai, >> >> (2013/11/07 17:58), Jingbai Ma wrote: >>> On 11/06/2013 10:23 PM, Vivek Goyal wrote: >>>> On Wed, Nov 06, 2013 at 02:21:39AM +, Atsushi Kumagai wrote: >>>>> (2013/11/06 5:27), Vivek Goyal wrote: >>>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: >>>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump >>>>>>> file. >>>>>>> >>>>>>> This patch requires the kernel patch to export necessary data >>>>>>> structures into >>>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo" >>>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html >>>>>>> >>>>>>> This patch introduce two new dump levels 32 and 64 to exclude all >>>>>>> unused and >>>>>>> active hugepages. The level to exclude all unnecessary pages will be >>>>>>> 127 now. >>>>>> >>>>>> Interesting. Why hugepages should be treated any differentely than normal >>>>>> pages? >>>>>> >>>>>> If user asked to filter out free page, then it should be filtered and >>>>>> it should not matter whether it is a huge page or not? >>>>> >>>>> I'm making a RFC patch of hugepages filtering based on such policy. >>>>> >>>>> I attach the prototype version. >>>>> It's able to filter out also THPs, and suitable for cyclic processing >>>>> because it depends on mem_map and looking up it can be divided into >>>>> cycles. This is the same idea as page_is_buddy(). >>>>> >>>>> So I think it's better. >>>> >>>> Agreed. Being able to treat hugepages in same manner as other pages >>>> sounds good. >>>> >>>> Jingbai, looks good to you? >>> >>> It looks good to me. >>> >>> My only concern is by this way, we only can exclude all hugepage together, >>> but can't exclude the free hugepages only. I'm not sure if user need to >>> dump out the activated hugepage only. >>> >>> Kumagai-san, please correct me, if I'm wrong. >> >> Yes, my patch treats all allocated hugetlbfs pages as user pages, >> doesn't distinguish whether the pages are actually used or not. >> I made so because I guess it's enough for almost all users. >> >> We can introduce new dump level after it's needed actually, >> but I don't think now is the time. To introduce it without >> demand will make this tool just more complex. >> > > Typically, users would allocate huge pages as much as actually they use only, > in order not to waste system memory. So, this design seems reasonable. > OK, It looks reasonable. Thanks! -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/06/2013 10:23 PM, Vivek Goyal wrote: On Wed, Nov 06, 2013 at 02:21:39AM +, Atsushi Kumagai wrote: (2013/11/06 5:27), Vivek Goyal wrote: On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: "kexec: export hugepage data structure into vmcoreinfo" http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. Interesting. Why hugepages should be treated any differentely than normal pages? If user asked to filter out free page, then it should be filtered and it should not matter whether it is a huge page or not? I'm making a RFC patch of hugepages filtering based on such policy. I attach the prototype version. It's able to filter out also THPs, and suitable for cyclic processing because it depends on mem_map and looking up it can be divided into cycles. This is the same idea as page_is_buddy(). So I think it's better. Agreed. Being able to treat hugepages in same manner as other pages sounds good. Jingbai, looks good to you? It looks good to me. My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only. Kumagai-san, please correct me, if I'm wrong. Thanks Vivek -- Thanks Atsushi Kumagai -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/06/2013 10:23 PM, Vivek Goyal wrote: On Wed, Nov 06, 2013 at 02:21:39AM +, Atsushi Kumagai wrote: (2013/11/06 5:27), Vivek Goyal wrote: On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: kexec: export hugepage data structure into vmcoreinfo http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. Interesting. Why hugepages should be treated any differentely than normal pages? If user asked to filter out free page, then it should be filtered and it should not matter whether it is a huge page or not? I'm making a RFC patch of hugepages filtering based on such policy. I attach the prototype version. It's able to filter out also THPs, and suitable for cyclic processing because it depends on mem_map and looking up it can be divided into cycles. This is the same idea as page_is_buddy(). So I think it's better. Agreed. Being able to treat hugepages in same manner as other pages sounds good. Jingbai, looks good to you? It looks good to me. My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only. Kumagai-san, please correct me, if I'm wrong. Thanks Vivek -- Thanks Atsushi Kumagai -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/08/2013 01:21 PM, HATAYAMA Daisuke wrote: (2013/11/08 14:12), Atsushi Kumagai wrote: Hello Jingbai, (2013/11/07 17:58), Jingbai Ma wrote: On 11/06/2013 10:23 PM, Vivek Goyal wrote: On Wed, Nov 06, 2013 at 02:21:39AM +, Atsushi Kumagai wrote: (2013/11/06 5:27), Vivek Goyal wrote: On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: kexec: export hugepage data structure into vmcoreinfo http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. Interesting. Why hugepages should be treated any differentely than normal pages? If user asked to filter out free page, then it should be filtered and it should not matter whether it is a huge page or not? I'm making a RFC patch of hugepages filtering based on such policy. I attach the prototype version. It's able to filter out also THPs, and suitable for cyclic processing because it depends on mem_map and looking up it can be divided into cycles. This is the same idea as page_is_buddy(). So I think it's better. Agreed. Being able to treat hugepages in same manner as other pages sounds good. Jingbai, looks good to you? It looks good to me. My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only. Kumagai-san, please correct me, if I'm wrong. Yes, my patch treats all allocated hugetlbfs pages as user pages, doesn't distinguish whether the pages are actually used or not. I made so because I guess it's enough for almost all users. We can introduce new dump level after it's needed actually, but I don't think now is the time. To introduce it without demand will make this tool just more complex. Typically, users would allocate huge pages as much as actually they use only, in order not to waste system memory. So, this design seems reasonable. OK, It looks reasonable. Thanks! -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/06/2013 04:26 AM, Vivek Goyal wrote: On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: "kexec: export hugepage data structure into vmcoreinfo" http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. Interesting. Why hugepages should be treated any differentely than normal pages? If user asked to filter out free page, then it should be filtered and it should not matter whether it is a huge page or not? Yes, free hugepages should be filtered out with other free pages. It sounds reasonable. But for active hugepages, I would offer user more choices/flexibility. (maybe bad). I'm OK to filter active hugepages with other user data page. Any other comments? Thanks Vivek -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages
Add messages for print_info. Signed-off-by: Jingbai Ma --- print_info.c | 12 +++- print_info.h |2 ++ 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/print_info.c b/print_info.c index 06939e0..978d9fb 100644 --- a/print_info.c +++ b/print_info.c @@ -103,17 +103,19 @@ print_usage(void) MSG(" The maximum of Dump_Level is 31.\n"); MSG(" Note that Dump_Level for Xen dump filtering is 0 or 1.\n"); MSG("\n"); - MSG("| cachecache\n"); - MSG(" Dump | zero without with userfree\n"); - MSG(" Level | page private private datapage\n"); - MSG(" ---+---\n"); + MSG("| cachecachefree active\n"); + MSG(" Dump | zero without with userfreehuge huge\n"); + MSG(" Level | page private private datapagepage page\n"); + MSG(" ---+--\n"); MSG(" 0 |\n"); MSG(" 1 | X\n"); MSG(" 2 | X\n"); MSG(" 4 | XX\n"); MSG(" 8 |X\n"); MSG("16 |X\n"); - MSG("31 | X XX X X\n"); + MSG("32 |X\n"); + MSG("64 |X X\n"); + MSG(" 127 | X XX X X X X\n"); MSG("\n"); MSG(" [-E]:\n"); MSG(" Create DUMPFILE in the ELF format.\n"); diff --git a/print_info.h b/print_info.h index 01e3706..8461df6 100644 --- a/print_info.h +++ b/print_info.h @@ -35,6 +35,8 @@ void print_execution_time(char *step_name, struct timeval *tv_start); #define PROGRESS_HOLES "Checking for memory holes " #define PROGRESS_UNN_PAGES "Excluding unnecessary pages" #define PROGRESS_FREE_PAGES"Excluding free pages " +#define PROGRESS_FREE_HUGE "Excluding free huge pages " +#define PROGRESS_ACTIVE_HUGE "Excluding active huge pages" #define PROGRESS_ZERO_PAGES"Excluding zero pages " #define PROGRESS_XEN_DOMAIN"Excluding xen user domain " -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: "kexec: export hugepage data structure into vmcoreinfo" http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. | cachecachefreeactive Dump | zero without with userfreehugehuge Level | page private private datapagepagepage ---+-- 0 | 1 | X 2 | X 4 | XX 8 |X 16 |X 32 |X 64 |X X 127 | X XX X X X X example: To exclude all unnecessary pages: makedumpfile -c --message-level 23 -d 127 /proc/vmcore /var/crash/kdump To exclude all unnecessary pages but keep active hugepages: makedumpfile -c --message-level 23 -d 63 /proc/vmcore /var/crash/kdump --- Jingbai Ma (3): makedumpfile: hugepage filtering: add hugepage filtering functions makedumpfile: hugepage filtering: add excluding hugepage messages makedumpfile: hugepage filtering: add new dump levels for manual page makedumpfile.8 | 170 +++ makedumpfile.c | 272 makedumpfile.h | 19 print_info.c | 12 +- print_info.h |2 5 files changed, 431 insertions(+), 44 deletions(-) -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page
Add new dump levels for makedumpfile manual page. Signed-off-by: Jingbai Ma --- makedumpfile.8 | 170 1 files changed, 133 insertions(+), 37 deletions(-) diff --git a/makedumpfile.8 b/makedumpfile.8 index adeb811..70e8732 100644 --- a/makedumpfile.8 +++ b/makedumpfile.8 @@ -164,43 +164,139 @@ by dump_level 11, makedumpfile retries it by dump_level 31. .br # makedumpfile \-d 11,31 \-x vmlinux /proc/vmcore dumpfile - | |cache |cache | | - dump | zero |without|with | user | free - level | page |private|private| data | page -.br -\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\- - 0 | | | | | - 1 | X | | | | - 2 | | X | | | - 3 | X | X | | | - 4 | | X | X | | - 5 | X | X | X | | - 6 | | X | X | | - 7 | X | X | X | | - 8 | | | | X | - 9 | X | | | X | -10 | | X | | X | -11 | X | X | | X | -12 | | X | X | X | -13 | X | X | X | X | -14 | | X | X | X | -15 | X | X | X | X | -16 | | | | | X -17 | X | | | | X -18 | | X | | | X -19 | X | X | | | X -20 | | X | X | | X -21 | X | X | X | | X -22 | | X | X | | X -23 | X | X | X | | X -24 | | | | X | X -25 | X | | | X | X -26 | | X | | X | X -27 | X | X | | X | X -28 | | X | X | X | X -29 | X | X | X | X | X -30 | | X | X | X | X -31 | X | X | X | X | X + | |cache |cache | | | free | active + dump | zero |without|with | user | free | huge | huge + level | page |private|private| data | page | page | page +.br +\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-\- + 0 | | | | | | | + 1 | X | | | | | | + 2 | | X | | | | | + 3 | X | X | | | | | + 4 | | X | X | | | | + 5 | X | X | X | | | | + 6 | | X | X | | | | + 7 | X | X | X | | | | + 8 | | | | X | | | + 9 | X | | | X | | | +10 | | X | | X | | | +11 | X | X | | X | | | +12 | | X | X | X | | | +13 | X | X | X | X | | | +14 | | X | X | X | | | +15 | X | X | X | X | | | +16 | | | | | X | | +17 | X | | | | X | | +18 | | X | | | X | | +19 | X | X | | | X | | +20 | | X | X | | X | | +21 | X | X | X | | X | | +22 | | X | X | | X | | +23 | X | X | X | | X | | +24 | | | | X | X | | +25 | X | | | X | X | | +26 | | X | | X | X | | +27 | X | X | | X | X | | +28 | | X | X | X | X | | +29 | X | X | X | X | X | | +30 | | X | X | X | X | | +31 | X | X | X | X | X | | +32 | | | | | | X | +33 | X | | | | | X | +34 | | X | | | | X | +35 | X | X | | | | X | +36 | | X | X | | | X | +37 | X | X | X | | | X | +38 | | X | X | | | X | +39 | X | X | X | | | X | +40 | | | | X | | X | +41 | X | | | X | | X | +42 | | X | | X | | X | +43 | X | X | | X | | X | +44 | | X | X | X | | X | +45 | X | X | X | X | | X | +46 | | X | X | X | | X | +47 | X | X | X | X | | X | +48 | | | | | X | X | +49 | X
[PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions
Add functions to exclude hugepage from vmcore dump. Signed-off-by: Jingbai Ma --- makedumpfile.c | 272 makedumpfile.h | 19 2 files changed, 289 insertions(+), 2 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index b42565c..f0b2531 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -46,6 +46,8 @@ unsigned long long pfn_cache_private; unsigned long long pfn_user; unsigned long long pfn_free; unsigned long long pfn_hwpoison; +unsigned long long pfn_free_huge; +unsigned long long pfn_active_huge; unsigned long long num_dumped; @@ -1038,6 +1040,7 @@ get_symbol_info(void) SYMBOL_INIT(mem_map, "mem_map"); SYMBOL_INIT(vmem_map, "vmem_map"); SYMBOL_INIT(mem_section, "mem_section"); + SYMBOL_INIT(hstates, "hstates"); SYMBOL_INIT(pkmap_count, "pkmap_count"); SYMBOL_INIT_NEXT(pkmap_count_next, "pkmap_count"); SYMBOL_INIT(system_utsname, "system_utsname"); @@ -1174,6 +1177,19 @@ get_structure_info(void) OFFSET_INIT(list_head.prev, "list_head", "prev"); /* +* Get offsets of the hstate's members. +*/ + SIZE_INIT(hstate, "hstate"); + OFFSET_INIT(hstate.order, "hstate", "order"); + OFFSET_INIT(hstate.nr_huge_pages, "hstate", "nr_huge_pages"); + OFFSET_INIT(hstate.free_huge_pages, "hstate", "free_huge_pages"); + OFFSET_INIT(hstate.hugepage_activelist, "hstate", + "hugepage_activelist"); + OFFSET_INIT(hstate.hugepage_freelists, "hstate", "hugepage_freelists"); + MEMBER_ARRAY_LENGTH_INIT(hstate.hugepage_freelists, "hstate", + "hugepage_freelists"); + + /* * Get offsets of the node_memblk_s's members. */ SIZE_INIT(node_memblk_s, "node_memblk_s"); @@ -1555,6 +1571,7 @@ write_vmcoreinfo_data(void) WRITE_SYMBOL("mem_map", mem_map); WRITE_SYMBOL("vmem_map", vmem_map); WRITE_SYMBOL("mem_section", mem_section); + WRITE_SYMBOL("hstates", hstates); WRITE_SYMBOL("pkmap_count", pkmap_count); WRITE_SYMBOL("pkmap_count_next", pkmap_count_next); WRITE_SYMBOL("system_utsname", system_utsname); @@ -1590,6 +1607,7 @@ write_vmcoreinfo_data(void) WRITE_STRUCTURE_SIZE("zone", zone); WRITE_STRUCTURE_SIZE("free_area", free_area); WRITE_STRUCTURE_SIZE("list_head", list_head); + WRITE_STRUCTURE_SIZE("hstate", hstate); WRITE_STRUCTURE_SIZE("node_memblk_s", node_memblk_s); WRITE_STRUCTURE_SIZE("nodemask_t", nodemask_t); WRITE_STRUCTURE_SIZE("pageflags", pageflags); @@ -1628,6 +1646,13 @@ write_vmcoreinfo_data(void) WRITE_MEMBER_OFFSET("vm_struct.addr", vm_struct.addr); WRITE_MEMBER_OFFSET("vmap_area.va_start", vmap_area.va_start); WRITE_MEMBER_OFFSET("vmap_area.list", vmap_area.list); + WRITE_MEMBER_OFFSET("hstate.order", hstate.order); + WRITE_MEMBER_OFFSET("hstate.nr_huge_pages", hstate.nr_huge_pages); + WRITE_MEMBER_OFFSET("hstate.free_huge_pages", hstate.free_huge_pages); + WRITE_MEMBER_OFFSET("hstate.hugepage_activelist", + hstate.hugepage_activelist); + WRITE_MEMBER_OFFSET("hstate.hugepage_freelists", + hstate.hugepage_freelists); WRITE_MEMBER_OFFSET("log.ts_nsec", log.ts_nsec); WRITE_MEMBER_OFFSET("log.len", log.len); WRITE_MEMBER_OFFSET("log.text_len", log.text_len); @@ -1647,6 +1672,9 @@ write_vmcoreinfo_data(void) WRITE_ARRAY_LENGTH("zone.free_area", zone.free_area); WRITE_ARRAY_LENGTH("free_area.free_list", free_area.free_list); + WRITE_ARRAY_LENGTH("hstate.hugepage_freelists", + hstate.hugepage_freelists); + WRITE_NUMBER("NR_FREE_PAGES", NR_FREE_PAGES); WRITE_NUMBER("N_ONLINE", N_ONLINE); @@ -1659,6 +1687,8 @@ write_vmcoreinfo_data(void) WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE); + WRITE_NUMBER("HUGE_MAX_HSTATE", HUGE_MAX_HSTATE); + /* * write the source file of 1st kernel */ @@ -1874,6 +1904,7 @@ read_vmcoreinfo(void) READ_SYMBOL("mem_map", mem_map); READ_SYMBOL("vmem_map", vmem_map); READ_SYMBOL("mem_section", mem_section); + READ_SYMBOL("hstates", hstates); READ_SYMBOL("pkmap_count", pkmap_co
[PATCH] kexec: export hugepage data structure into vmcoreinfo
This patch exports hstates data structure into vmcoreinfo when CONFIG_HUGETLB_PAGE is defined. makedumpfile needs to read information of hugepage related data structure. We introduce a function into "makedumpfile" to exclude hugepage from vmcore dump. In order to introduce this function, the hstates data structure has to export into vmcoreinfo. This patch based on Linux 3.12. The patch set for makedumpfile to filter hugepage will be sent separately. Signed-off-by: Jingbai Ma --- kernel/kexec.c | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/kernel/kexec.c b/kernel/kexec.c index 2a74f30..766c7c8 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -38,6 +38,9 @@ #include #include +#include + + /* Per cpu memory for storing cpu states in case of system crash. */ note_buf_t __percpu *crash_notes; @@ -1578,11 +1581,17 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); #endif +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_SYMBOL(hstates); +#endif VMCOREINFO_STRUCT_SIZE(page); VMCOREINFO_STRUCT_SIZE(pglist_data); VMCOREINFO_STRUCT_SIZE(zone); VMCOREINFO_STRUCT_SIZE(free_area); VMCOREINFO_STRUCT_SIZE(list_head); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_STRUCT_SIZE(hstate); +#endif VMCOREINFO_SIZE(nodemask_t); VMCOREINFO_OFFSET(page, flags); VMCOREINFO_OFFSET(page, _count); @@ -1606,9 +1615,19 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(list_head, prev); VMCOREINFO_OFFSET(vmap_area, va_start); VMCOREINFO_OFFSET(vmap_area, list); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_OFFSET(hstate, order); + VMCOREINFO_OFFSET(hstate, nr_huge_pages); + VMCOREINFO_OFFSET(hstate, free_huge_pages); + VMCOREINFO_OFFSET(hstate, hugepage_activelist); + VMCOREINFO_OFFSET(hstate, hugepage_freelists); +#endif VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); log_buf_kexec_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_LENGTH(hstate.hugepage_freelists, MAX_NUMNODES); +#endif VMCOREINFO_NUMBER(NR_FREE_PAGES); VMCOREINFO_NUMBER(PG_lru); VMCOREINFO_NUMBER(PG_private); @@ -1618,6 +1637,9 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PG_hwpoison); #endif VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_NUMBER(HUGE_MAX_HSTATE); +#endif arch_crash_save_vmcoreinfo(); update_vmcoreinfo_note(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kexec: export hugepage data structure into vmcoreinfo
This patch exports hstates data structure into vmcoreinfo when CONFIG_HUGETLB_PAGE is defined. makedumpfile needs to read information of hugepage related data structure. We introduce a function into makedumpfile to exclude hugepage from vmcore dump. In order to introduce this function, the hstates data structure has to export into vmcoreinfo. This patch based on Linux 3.12. The patch set for makedumpfile to filter hugepage will be sent separately. Signed-off-by: Jingbai Ma jingbai...@hp.com --- kernel/kexec.c | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/kernel/kexec.c b/kernel/kexec.c index 2a74f30..766c7c8 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -38,6 +38,9 @@ #include asm/io.h #include asm/sections.h +#include linux/hugetlb.h + + /* Per cpu memory for storing cpu states in case of system crash. */ note_buf_t __percpu *crash_notes; @@ -1578,11 +1581,17 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_STRUCT_SIZE(mem_section); VMCOREINFO_OFFSET(mem_section, section_mem_map); #endif +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_SYMBOL(hstates); +#endif VMCOREINFO_STRUCT_SIZE(page); VMCOREINFO_STRUCT_SIZE(pglist_data); VMCOREINFO_STRUCT_SIZE(zone); VMCOREINFO_STRUCT_SIZE(free_area); VMCOREINFO_STRUCT_SIZE(list_head); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_STRUCT_SIZE(hstate); +#endif VMCOREINFO_SIZE(nodemask_t); VMCOREINFO_OFFSET(page, flags); VMCOREINFO_OFFSET(page, _count); @@ -1606,9 +1615,19 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(list_head, prev); VMCOREINFO_OFFSET(vmap_area, va_start); VMCOREINFO_OFFSET(vmap_area, list); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_OFFSET(hstate, order); + VMCOREINFO_OFFSET(hstate, nr_huge_pages); + VMCOREINFO_OFFSET(hstate, free_huge_pages); + VMCOREINFO_OFFSET(hstate, hugepage_activelist); + VMCOREINFO_OFFSET(hstate, hugepage_freelists); +#endif VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); log_buf_kexec_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_LENGTH(hstate.hugepage_freelists, MAX_NUMNODES); +#endif VMCOREINFO_NUMBER(NR_FREE_PAGES); VMCOREINFO_NUMBER(PG_lru); VMCOREINFO_NUMBER(PG_private); @@ -1618,6 +1637,9 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PG_hwpoison); #endif VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_NUMBER(HUGE_MAX_HSTATE); +#endif arch_crash_save_vmcoreinfo(); update_vmcoreinfo_note(); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page
Add new dump levels for makedumpfile manual page. Signed-off-by: Jingbai Ma jingbai...@hp.com --- makedumpfile.8 | 170 1 files changed, 133 insertions(+), 37 deletions(-) diff --git a/makedumpfile.8 b/makedumpfile.8 index adeb811..70e8732 100644 --- a/makedumpfile.8 +++ b/makedumpfile.8 @@ -164,43 +164,139 @@ by dump_level 11, makedumpfile retries it by dump_level 31. .br # makedumpfile \-d 11,31 \-x vmlinux /proc/vmcore dumpfile - | |cache |cache | | - dump | zero |without|with | user | free - level | page |private|private| data | page -.br -\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\- - 0 | | | | | - 1 | X | | | | - 2 | | X | | | - 3 | X | X | | | - 4 | | X | X | | - 5 | X | X | X | | - 6 | | X | X | | - 7 | X | X | X | | - 8 | | | | X | - 9 | X | | | X | -10 | | X | | X | -11 | X | X | | X | -12 | | X | X | X | -13 | X | X | X | X | -14 | | X | X | X | -15 | X | X | X | X | -16 | | | | | X -17 | X | | | | X -18 | | X | | | X -19 | X | X | | | X -20 | | X | X | | X -21 | X | X | X | | X -22 | | X | X | | X -23 | X | X | X | | X -24 | | | | X | X -25 | X | | | X | X -26 | | X | | X | X -27 | X | X | | X | X -28 | | X | X | X | X -29 | X | X | X | X | X -30 | | X | X | X | X -31 | X | X | X | X | X + | |cache |cache | | | free | active + dump | zero |without|with | user | free | huge | huge + level | page |private|private| data | page | page | page +.br +\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-\- + 0 | | | | | | | + 1 | X | | | | | | + 2 | | X | | | | | + 3 | X | X | | | | | + 4 | | X | X | | | | + 5 | X | X | X | | | | + 6 | | X | X | | | | + 7 | X | X | X | | | | + 8 | | | | X | | | + 9 | X | | | X | | | +10 | | X | | X | | | +11 | X | X | | X | | | +12 | | X | X | X | | | +13 | X | X | X | X | | | +14 | | X | X | X | | | +15 | X | X | X | X | | | +16 | | | | | X | | +17 | X | | | | X | | +18 | | X | | | X | | +19 | X | X | | | X | | +20 | | X | X | | X | | +21 | X | X | X | | X | | +22 | | X | X | | X | | +23 | X | X | X | | X | | +24 | | | | X | X | | +25 | X | | | X | X | | +26 | | X | | X | X | | +27 | X | X | | X | X | | +28 | | X | X | X | X | | +29 | X | X | X | X | X | | +30 | | X | X | X | X | | +31 | X | X | X | X | X | | +32 | | | | | | X | +33 | X | | | | | X | +34 | | X | | | | X | +35 | X | X | | | | X | +36 | | X | X | | | X | +37 | X | X | X | | | X | +38 | | X | X | | | X | +39 | X | X | X | | | X | +40 | | | | X | | X | +41 | X | | | X | | X | +42 | | X | | X | | X | +43 | X | X | | X | | X | +44 | | X | X | X | | X | +45 | X | X | X | X | | X | +46 | | X | X | X | | X | +47 | X | X | X | X | | X | +48 | | | | | X | X
[PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions
Add functions to exclude hugepage from vmcore dump. Signed-off-by: Jingbai Ma jingbai...@hp.com --- makedumpfile.c | 272 makedumpfile.h | 19 2 files changed, 289 insertions(+), 2 deletions(-) diff --git a/makedumpfile.c b/makedumpfile.c index b42565c..f0b2531 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -46,6 +46,8 @@ unsigned long long pfn_cache_private; unsigned long long pfn_user; unsigned long long pfn_free; unsigned long long pfn_hwpoison; +unsigned long long pfn_free_huge; +unsigned long long pfn_active_huge; unsigned long long num_dumped; @@ -1038,6 +1040,7 @@ get_symbol_info(void) SYMBOL_INIT(mem_map, mem_map); SYMBOL_INIT(vmem_map, vmem_map); SYMBOL_INIT(mem_section, mem_section); + SYMBOL_INIT(hstates, hstates); SYMBOL_INIT(pkmap_count, pkmap_count); SYMBOL_INIT_NEXT(pkmap_count_next, pkmap_count); SYMBOL_INIT(system_utsname, system_utsname); @@ -1174,6 +1177,19 @@ get_structure_info(void) OFFSET_INIT(list_head.prev, list_head, prev); /* +* Get offsets of the hstate's members. +*/ + SIZE_INIT(hstate, hstate); + OFFSET_INIT(hstate.order, hstate, order); + OFFSET_INIT(hstate.nr_huge_pages, hstate, nr_huge_pages); + OFFSET_INIT(hstate.free_huge_pages, hstate, free_huge_pages); + OFFSET_INIT(hstate.hugepage_activelist, hstate, + hugepage_activelist); + OFFSET_INIT(hstate.hugepage_freelists, hstate, hugepage_freelists); + MEMBER_ARRAY_LENGTH_INIT(hstate.hugepage_freelists, hstate, + hugepage_freelists); + + /* * Get offsets of the node_memblk_s's members. */ SIZE_INIT(node_memblk_s, node_memblk_s); @@ -1555,6 +1571,7 @@ write_vmcoreinfo_data(void) WRITE_SYMBOL(mem_map, mem_map); WRITE_SYMBOL(vmem_map, vmem_map); WRITE_SYMBOL(mem_section, mem_section); + WRITE_SYMBOL(hstates, hstates); WRITE_SYMBOL(pkmap_count, pkmap_count); WRITE_SYMBOL(pkmap_count_next, pkmap_count_next); WRITE_SYMBOL(system_utsname, system_utsname); @@ -1590,6 +1607,7 @@ write_vmcoreinfo_data(void) WRITE_STRUCTURE_SIZE(zone, zone); WRITE_STRUCTURE_SIZE(free_area, free_area); WRITE_STRUCTURE_SIZE(list_head, list_head); + WRITE_STRUCTURE_SIZE(hstate, hstate); WRITE_STRUCTURE_SIZE(node_memblk_s, node_memblk_s); WRITE_STRUCTURE_SIZE(nodemask_t, nodemask_t); WRITE_STRUCTURE_SIZE(pageflags, pageflags); @@ -1628,6 +1646,13 @@ write_vmcoreinfo_data(void) WRITE_MEMBER_OFFSET(vm_struct.addr, vm_struct.addr); WRITE_MEMBER_OFFSET(vmap_area.va_start, vmap_area.va_start); WRITE_MEMBER_OFFSET(vmap_area.list, vmap_area.list); + WRITE_MEMBER_OFFSET(hstate.order, hstate.order); + WRITE_MEMBER_OFFSET(hstate.nr_huge_pages, hstate.nr_huge_pages); + WRITE_MEMBER_OFFSET(hstate.free_huge_pages, hstate.free_huge_pages); + WRITE_MEMBER_OFFSET(hstate.hugepage_activelist, + hstate.hugepage_activelist); + WRITE_MEMBER_OFFSET(hstate.hugepage_freelists, + hstate.hugepage_freelists); WRITE_MEMBER_OFFSET(log.ts_nsec, log.ts_nsec); WRITE_MEMBER_OFFSET(log.len, log.len); WRITE_MEMBER_OFFSET(log.text_len, log.text_len); @@ -1647,6 +1672,9 @@ write_vmcoreinfo_data(void) WRITE_ARRAY_LENGTH(zone.free_area, zone.free_area); WRITE_ARRAY_LENGTH(free_area.free_list, free_area.free_list); + WRITE_ARRAY_LENGTH(hstate.hugepage_freelists, + hstate.hugepage_freelists); + WRITE_NUMBER(NR_FREE_PAGES, NR_FREE_PAGES); WRITE_NUMBER(N_ONLINE, N_ONLINE); @@ -1659,6 +1687,8 @@ write_vmcoreinfo_data(void) WRITE_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE, PAGE_BUDDY_MAPCOUNT_VALUE); + WRITE_NUMBER(HUGE_MAX_HSTATE, HUGE_MAX_HSTATE); + /* * write the source file of 1st kernel */ @@ -1874,6 +1904,7 @@ read_vmcoreinfo(void) READ_SYMBOL(mem_map, mem_map); READ_SYMBOL(vmem_map, vmem_map); READ_SYMBOL(mem_section, mem_section); + READ_SYMBOL(hstates, hstates); READ_SYMBOL(pkmap_count, pkmap_count); READ_SYMBOL(pkmap_count_next, pkmap_count_next); READ_SYMBOL(system_utsname, system_utsname); @@ -1906,6 +1937,7 @@ read_vmcoreinfo(void) READ_STRUCTURE_SIZE(zone, zone); READ_STRUCTURE_SIZE(free_area, free_area); READ_STRUCTURE_SIZE(list_head, list_head); + READ_STRUCTURE_SIZE(hstate, hstate); READ_STRUCTURE_SIZE(node_memblk_s, node_memblk_s); READ_STRUCTURE_SIZE(nodemask_t, nodemask_t); READ_STRUCTURE_SIZE(pageflags, pageflags); @@ -1940,6 +1972,13 @@ read_vmcoreinfo(void) READ_MEMBER_OFFSET(vm_struct.addr, vm_struct.addr); READ_MEMBER_OFFSET(vmap_area.va_start, vmap_area.va_start); READ_MEMBER_OFFSET
[PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: kexec: export hugepage data structure into vmcoreinfo http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. | cachecachefreeactive Dump | zero without with userfreehugehuge Level | page private private datapagepagepage ---+-- 0 | 1 | X 2 | X 4 | XX 8 |X 16 |X 32 |X 64 |X X 127 | X XX X X X X example: To exclude all unnecessary pages: makedumpfile -c --message-level 23 -d 127 /proc/vmcore /var/crash/kdump To exclude all unnecessary pages but keep active hugepages: makedumpfile -c --message-level 23 -d 63 /proc/vmcore /var/crash/kdump --- Jingbai Ma (3): makedumpfile: hugepage filtering: add hugepage filtering functions makedumpfile: hugepage filtering: add excluding hugepage messages makedumpfile: hugepage filtering: add new dump levels for manual page makedumpfile.8 | 170 +++ makedumpfile.c | 272 makedumpfile.h | 19 print_info.c | 12 +- print_info.h |2 5 files changed, 431 insertions(+), 44 deletions(-) -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages
Add messages for print_info. Signed-off-by: Jingbai Ma jingbai...@hp.com --- print_info.c | 12 +++- print_info.h |2 ++ 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/print_info.c b/print_info.c index 06939e0..978d9fb 100644 --- a/print_info.c +++ b/print_info.c @@ -103,17 +103,19 @@ print_usage(void) MSG( The maximum of Dump_Level is 31.\n); MSG( Note that Dump_Level for Xen dump filtering is 0 or 1.\n); MSG(\n); - MSG(| cachecache\n); - MSG( Dump | zero without with userfree\n); - MSG( Level | page private private datapage\n); - MSG( ---+---\n); + MSG(| cachecachefree active\n); + MSG( Dump | zero without with userfreehuge huge\n); + MSG( Level | page private private datapagepage page\n); + MSG( ---+--\n); MSG( 0 |\n); MSG( 1 | X\n); MSG( 2 | X\n); MSG( 4 | XX\n); MSG( 8 |X\n); MSG(16 |X\n); - MSG(31 | X XX X X\n); + MSG(32 |X\n); + MSG(64 |X X\n); + MSG( 127 | X XX X X X X\n); MSG(\n); MSG( [-E]:\n); MSG( Create DUMPFILE in the ELF format.\n); diff --git a/print_info.h b/print_info.h index 01e3706..8461df6 100644 --- a/print_info.h +++ b/print_info.h @@ -35,6 +35,8 @@ void print_execution_time(char *step_name, struct timeval *tv_start); #define PROGRESS_HOLES Checking for memory holes #define PROGRESS_UNN_PAGES Excluding unnecessary pages #define PROGRESS_FREE_PAGESExcluding free pages +#define PROGRESS_FREE_HUGE Excluding free huge pages +#define PROGRESS_ACTIVE_HUGE Excluding active huge pages #define PROGRESS_ZERO_PAGESExcluding zero pages #define PROGRESS_XEN_DOMAINExcluding xen user domain -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
On 11/06/2013 04:26 AM, Vivek Goyal wrote: On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote: This patch set intend to exclude unnecessary hugepages from vmcore dump file. This patch requires the kernel patch to export necessary data structures into vmcore: kexec: export hugepage data structure into vmcoreinfo http://lists.infradead.org/pipermail/kexec/2013-November/009997.html This patch introduce two new dump levels 32 and 64 to exclude all unused and active hugepages. The level to exclude all unnecessary pages will be 127 now. Interesting. Why hugepages should be treated any differentely than normal pages? If user asked to filter out free page, then it should be filtered and it should not matter whether it is a huge page or not? Yes, free hugepages should be filtered out with other free pages. It sounds reasonable. But for active hugepages, I would offer user more choices/flexibility. (maybe bad). I'm OK to filter active hugepages with other user data page. Any other comments? Thanks Vivek -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag
On 08/13/2013 06:55 PM, Jingbai Ma wrote: > On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote: >> Hello, >> >> I've addressing kdump restriction that there's only one cpu available >> on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI >> corruption issue fixed in the following commit can again be reproduced >> by unsetting BSP flag of the boot cpu: >> >> commit 74b5820808215f65b70b05a099d6d3c969b82689 >> Author: Bjorn Helgaas >> Date: Wed Jul 29 15:54:25 2009 -0600 >> >> ACPI: bind workqueues to CPU 0 to avoid SMI corruption >> >> On some machines, a software-initiated SMI causes corruption unless the >> SMI runs on CPU 0. An SMI can be initiated by any AML, but typically >> it's >> done in GPE-related methods that are run via workqueues, so we can >> avoid >> the known corruption cases by binding the workqueues to CPU 0. >> >> References: >> http://bugzilla.kernel.org/show_bug.cgi?id=13751 >> https://bugs.launchpad.net/bugs/157171 >> https://bugs.launchpad.net/bugs/157691 >> >> Signed-off-by: Bjorn Helgaas >> Signed-off-by: Len Brown >> >> The reason is that in the current situation, I have two ideas to deal >> with the avove kdump restriction: >> >> 1) Disable BSP at the 2nd kernel, posted at: >> [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP >> https://lkml.org/lkml/2012/10/16/15 >> >> 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman >>during the discussion of the idea 1). >> >> On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion >> is that we have no method to reset BSP, i.e. recover BPS's healthy >> state, while we can recover AP by means of INIT as described in MP >> specification. >> >> The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st >> kernel. The behaviour when receiving INIT depends on whether or not >> BSP flag is set or not on its MSR; we can set and unset BSP flag of >> MSR freely at runtime. (I don't mean we should). >> >> So, next thing I should do is to evalute risk of the idea 2). In fact, >> during the discussion of the idea 1), HPA pointed out that some kind >> of firmware affects if BSP flag is unset. Also, maybe from the same >> reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu >> doesn't appear to unset BSP flag. >> >> The biggest problem next is that I don't have any machines reported in >> the bugzilla articles; this issue inherently depends on firmware. >> >> So, could anyone help testing the idea 2) above if you have which of >> the following machines? (or other ones that can lead to the same bug) >> >> - HP Compaq 6910p >> - HP Compaq 6710b >> - HP Compaq 6710s >> - HP Compaq 6510b >> - HP Compaq 2510p >> >> I prepared a small programs for this test. See the attached file. >> The steps to try to reproduce the bug is as follows: >> >> 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules >> 2. $ make # to build these programs >> 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu >> 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has >> # been unset. >>$ dmesg | tail >> 5. Close the lid of the machine. >> 6. Wait some minutes if necessary. >> 7. Open the lid and you can see oops on the screen if bug has >> successfully been reproduced. >> > > I couldn't find any model list above, but found one HP EliteBook 6930p. > I tested this machine with kernel 2.6.30 first. After resuming from > suspend, system hang. > > Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from > suspend without any problem. > > Next, I tested your program to clear BSP flag, I found the > unsetbspflag.ko didn't work everytime, sometimes I have to execute > insmod/rmmod several times to clear the BSP flag. (I used your > getcpuinfo.ko to check the BSP flag) > > cpu: 0 bios_apic: 0 apic: 0 AP > cpu: 1 bios_apic: 1 apic: 1 AP > > I suspended it, and them resumed it. This machine resumed from suspend > successfully, but the BSP flag has been set back: > > cpu: 0 bios_apic: 0 apic: 0 BSP > cpu: 1 bios_apic: 1 apic: 1 AP > > That's all my observation. Hope it's helpful. > I found a side effect of unsetting BSP flag. It affected system rebooting, once the BSP flags been removed, and issue reboot command, system will hang after message:
Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag
On 08/13/2013 06:55 PM, Jingbai Ma wrote: On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote: Hello, I've addressing kdump restriction that there's only one cpu available on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI corruption issue fixed in the following commit can again be reproduced by unsetting BSP flag of the boot cpu: commit 74b5820808215f65b70b05a099d6d3c969b82689 Author: Bjorn Helgaasbjorn.helg...@hp.com Date: Wed Jul 29 15:54:25 2009 -0600 ACPI: bind workqueues to CPU 0 to avoid SMI corruption On some machines, a software-initiated SMI causes corruption unless the SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's done in GPE-related methods that are run via workqueues, so we can avoid the known corruption cases by binding the workqueues to CPU 0. References: http://bugzilla.kernel.org/show_bug.cgi?id=13751 https://bugs.launchpad.net/bugs/157171 https://bugs.launchpad.net/bugs/157691 Signed-off-by: Bjorn Helgaasbjorn.helg...@hp.com Signed-off-by: Len Brownlen.br...@intel.com The reason is that in the current situation, I have two ideas to deal with the avove kdump restriction: 1) Disable BSP at the 2nd kernel, posted at: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP https://lkml.org/lkml/2012/10/16/15 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman during the discussion of the idea 1). On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion is that we have no method to reset BSP, i.e. recover BPS's healthy state, while we can recover AP by means of INIT as described in MP specification. The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st kernel. The behaviour when receiving INIT depends on whether or not BSP flag is set or not on its MSR; we can set and unset BSP flag of MSR freely at runtime. (I don't mean we should). So, next thing I should do is to evalute risk of the idea 2). In fact, during the discussion of the idea 1), HPA pointed out that some kind of firmware affects if BSP flag is unset. Also, maybe from the same reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu doesn't appear to unset BSP flag. The biggest problem next is that I don't have any machines reported in the bugzilla articles; this issue inherently depends on firmware. So, could anyone help testing the idea 2) above if you have which of the following machines? (or other ones that can lead to the same bug) - HP Compaq 6910p - HP Compaq 6710b - HP Compaq 6710s - HP Compaq 6510b - HP Compaq 2510p I prepared a small programs for this test. See the attached file. The steps to try to reproduce the bug is as follows: 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules 2. $ make # to build these programs 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has # been unset. $ dmesg | tail 5. Close the lid of the machine. 6. Wait some minutes if necessary. 7. Open the lid and you can see oops on the screen if bug has successfully been reproduced. I couldn't find any model list above, but found one HP EliteBook 6930p. I tested this machine with kernel 2.6.30 first. After resuming from suspend, system hang. Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from suspend without any problem. Next, I tested your program to clear BSP flag, I found the unsetbspflag.ko didn't work everytime, sometimes I have to execute insmod/rmmod several times to clear the BSP flag. (I used your getcpuinfo.ko to check the BSP flag) cpu: 0 bios_apic: 0 apic: 0 AP cpu: 1 bios_apic: 1 apic: 1 AP I suspended it, and them resumed it. This machine resumed from suspend successfully, but the BSP flag has been set back: cpu: 0 bios_apic: 0 apic: 0 BSP cpu: 1 bios_apic: 1 apic: 1 AP That's all my observation. Hope it's helpful. I found a side effect of unsetting BSP flag. It affected system rebooting, once the BSP flags been removed, and issue reboot command, system will hang after message: Restarting system. And have to do a hardware reset to recover it. I have reproduced this problem on the following systems: HP EliteBook 6930p HP Compaq DC7700 HP ProLiant DL980 (4 sockets, 40 cores) I have an idea: To avoid such kind of issue, we can unset BSP flag in the first kernel during crash processing, and restore it in the second kernel in the APs initializing. -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag
On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote: > Hello, > > I've addressing kdump restriction that there's only one cpu available > on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI > corruption issue fixed in the following commit can again be reproduced > by unsetting BSP flag of the boot cpu: > > commit 74b5820808215f65b70b05a099d6d3c969b82689 > Author: Bjorn Helgaas > Date: Wed Jul 29 15:54:25 2009 -0600 > > ACPI: bind workqueues to CPU 0 to avoid SMI corruption > > On some machines, a software-initiated SMI causes corruption unless the > SMI runs on CPU 0. An SMI can be initiated by any AML, but typically > it's > done in GPE-related methods that are run via workqueues, so we can avoid > the known corruption cases by binding the workqueues to CPU 0. > > References: > http://bugzilla.kernel.org/show_bug.cgi?id=13751 > https://bugs.launchpad.net/bugs/157171 > https://bugs.launchpad.net/bugs/157691 > > Signed-off-by: Bjorn Helgaas > Signed-off-by: Len Brown > > The reason is that in the current situation, I have two ideas to deal > with the avove kdump restriction: > >1) Disable BSP at the 2nd kernel, posted at: > [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP > https://lkml.org/lkml/2012/10/16/15 > >2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman > during the discussion of the idea 1). > > On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion > is that we have no method to reset BSP, i.e. recover BPS's healthy > state, while we can recover AP by means of INIT as described in MP > specification. > > The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st > kernel. The behaviour when receiving INIT depends on whether or not > BSP flag is set or not on its MSR; we can set and unset BSP flag of > MSR freely at runtime. (I don't mean we should). > > So, next thing I should do is to evalute risk of the idea 2). In fact, > during the discussion of the idea 1), HPA pointed out that some kind > of firmware affects if BSP flag is unset. Also, maybe from the same > reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu > doesn't appear to unset BSP flag. > > The biggest problem next is that I don't have any machines reported in > the bugzilla articles; this issue inherently depends on firmware. > > So, could anyone help testing the idea 2) above if you have which of > the following machines? (or other ones that can lead to the same bug) > > - HP Compaq 6910p > - HP Compaq 6710b > - HP Compaq 6710s > - HP Compaq 6510b > - HP Compaq 2510p > > I prepared a small programs for this test. See the attached file. > The steps to try to reproduce the bug is as follows: > >1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules >2. $ make # to build these programs >3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu >4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has > # been unset. > $ dmesg | tail >5. Close the lid of the machine. >6. Wait some minutes if necessary. >7. Open the lid and you can see oops on the screen if bug has > successfully been reproduced. > I couldn't find any model list above, but found one HP EliteBook 6930p. I tested this machine with kernel 2.6.30 first. After resuming from suspend, system hang. Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from suspend without any problem. Next, I tested your program to clear BSP flag, I found the unsetbspflag.ko didn't work everytime, sometimes I have to execute insmod/rmmod several times to clear the BSP flag. (I used your getcpuinfo.ko to check the BSP flag) cpu: 0 bios_apic: 0 apic: 0 AP cpu: 1 bios_apic: 1 apic: 1 AP I suspended it, and them resumed it. This machine resumed from suspend successfully, but the BSP flag has been set back: cpu: 0 bios_apic: 0 apic: 0 BSP cpu: 1 bios_apic: 1 apic: 1 AP That's all my observation. Hope it's helpful. -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag
On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote: Hello, I've addressing kdump restriction that there's only one cpu available on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI corruption issue fixed in the following commit can again be reproduced by unsetting BSP flag of the boot cpu: commit 74b5820808215f65b70b05a099d6d3c969b82689 Author: Bjorn Helgaasbjorn.helg...@hp.com Date: Wed Jul 29 15:54:25 2009 -0600 ACPI: bind workqueues to CPU 0 to avoid SMI corruption On some machines, a software-initiated SMI causes corruption unless the SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's done in GPE-related methods that are run via workqueues, so we can avoid the known corruption cases by binding the workqueues to CPU 0. References: http://bugzilla.kernel.org/show_bug.cgi?id=13751 https://bugs.launchpad.net/bugs/157171 https://bugs.launchpad.net/bugs/157691 Signed-off-by: Bjorn Helgaasbjorn.helg...@hp.com Signed-off-by: Len Brownlen.br...@intel.com The reason is that in the current situation, I have two ideas to deal with the avove kdump restriction: 1) Disable BSP at the 2nd kernel, posted at: [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP https://lkml.org/lkml/2012/10/16/15 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman during the discussion of the idea 1). On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion is that we have no method to reset BSP, i.e. recover BPS's healthy state, while we can recover AP by means of INIT as described in MP specification. The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st kernel. The behaviour when receiving INIT depends on whether or not BSP flag is set or not on its MSR; we can set and unset BSP flag of MSR freely at runtime. (I don't mean we should). So, next thing I should do is to evalute risk of the idea 2). In fact, during the discussion of the idea 1), HPA pointed out that some kind of firmware affects if BSP flag is unset. Also, maybe from the same reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu doesn't appear to unset BSP flag. The biggest problem next is that I don't have any machines reported in the bugzilla articles; this issue inherently depends on firmware. So, could anyone help testing the idea 2) above if you have which of the following machines? (or other ones that can lead to the same bug) - HP Compaq 6910p - HP Compaq 6710b - HP Compaq 6710s - HP Compaq 6510b - HP Compaq 2510p I prepared a small programs for this test. See the attached file. The steps to try to reproduce the bug is as follows: 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules 2. $ make # to build these programs 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has # been unset. $ dmesg | tail 5. Close the lid of the machine. 6. Wait some minutes if necessary. 7. Open the lid and you can see oops on the screen if bug has successfully been reproduced. I couldn't find any model list above, but found one HP EliteBook 6930p. I tested this machine with kernel 2.6.30 first. After resuming from suspend, system hang. Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from suspend without any problem. Next, I tested your program to clear BSP flag, I found the unsetbspflag.ko didn't work everytime, sometimes I have to execute insmod/rmmod several times to clear the BSP flag. (I used your getcpuinfo.ko to check the BSP flag) cpu: 0 bios_apic: 0 apic: 0 AP cpu: 1 bios_apic: 1 apic: 1 AP I suspended it, and them resumed it. This machine resumed from suspend successfully, but the BSP flag has been set back: cpu: 0 bios_apic: 0 apic: 0 BSP cpu: 1 bios_apic: 1 apic: 1 AP That's all my observation. Hope it's helpful. -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile 1.5.4 + kernel 3.11-rc2+ 4TB tests
Hi, I have run some tests with makedumpfile 1.5.4 and upstream kernel 3.11-rc2+ on a machine with 4TB memory, here is testing results: Test environment: Machine: HP ProLiant DL980 G7 with 4TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 CPU was enabled the 2nd kernel) Kernel: 3.11.0-rc2+ (at patch b3a3a9c441e2c8f6b6760de9331023a7906a4ac6) crashkernel=384MB vmcore size: 4.0TB Dump file size: 15GB All measured time from debug message of makedumpfile. As a comparison, I also have tested makedumpfile 1.5.3. (all time in seconds) Excluding pages Copy data Total makedumpfile 1.5.3 468 1182 1650 makedumpfile 1.5.4 93 518611 So it seems there is a great performance improvement by the mmap mechanism. -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile 1.5.4 + kernel 3.11-rc2+ 4TB tests
Hi, I have run some tests with makedumpfile 1.5.4 and upstream kernel 3.11-rc2+ on a machine with 4TB memory, here is testing results: Test environment: Machine: HP ProLiant DL980 G7 with 4TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 CPU was enabled the 2nd kernel) Kernel: 3.11.0-rc2+ (at patch b3a3a9c441e2c8f6b6760de9331023a7906a4ac6) crashkernel=384MB vmcore size: 4.0TB Dump file size: 15GB All measured time from debug message of makedumpfile. As a comparison, I also have tested makedumpfile 1.5.3. (all time in seconds) Excluding pages Copy data Total makedumpfile 1.5.3 468 1182 1650 makedumpfile 1.5.4 93 518611 So it seems there is a great performance improvement by the mmap mechanism. -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile parallel dumping test
Hi all, I have done some experiments on parallel kernel dumping. I would like to share the test result with you. Hope it helps. Test environment: Machine: HP ProLiant DL980 G7 with 4TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (4 CPU were enabled the 2nd kernel by nr_cpus=4) Kernel 3.9.0-rc7 kexec-tools 2.0.4 makedumpfile v1.5.3 with lzo library crashkernel=4096M (I have tested with 2048M but failed with OOM on 3 or 4 parallels dumping in cyclic mode) I didn't get a real multipath storage device, so I just put dump files on 4 different disks via 3 HP Smart Array controllers. (mounted on /0, /1, /2 and /3 in the capture kernel) Measured time like this (for example: lzo compression, non-cyclic, 4 parallels): time makedumpfile -l -non-cyclic --split --message-level 23 -d 31 /proc/vmcore /0/vmcore_0 /1/vmcore_1 /2/vmcore_2 /3/vmcore_3 I run several tests with different option, parallels from 1 to 4, and combined with zlib and lzo compression. Test result: - | |Parallels 1|Parallels 2|Parallels 3|Parallels 4| - |zlib cyclic| 42m25.321s| 34m0.168s| 29m44.908s| 28m50.387s| - |zlib non-cyclic| 42m7.842s| 28m28.275s| 23m25.750s| 21m6.476s| - |lzo cyclic | 23m40.010s| 18m19.932s| 21m47.903s| 22m47.605s| - |lzo non-cyclic | 20m45.749s| 16m42.045s| 15m41.070s| 15m18.605s| - -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile parallel dumping test
Hi all, I have done some experiments on parallel kernel dumping. I would like to share the test result with you. Hope it helps. Test environment: Machine: HP ProLiant DL980 G7 with 4TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (4 CPU were enabled the 2nd kernel by nr_cpus=4) Kernel 3.9.0-rc7 kexec-tools 2.0.4 makedumpfile v1.5.3 with lzo library crashkernel=4096M (I have tested with 2048M but failed with OOM on 3 or 4 parallels dumping in cyclic mode) I didn't get a real multipath storage device, so I just put dump files on 4 different disks via 3 HP Smart Array controllers. (mounted on /0, /1, /2 and /3 in the capture kernel) Measured time like this (for example: lzo compression, non-cyclic, 4 parallels): time makedumpfile -l -non-cyclic --split --message-level 23 -d 31 /proc/vmcore /0/vmcore_0 /1/vmcore_1 /2/vmcore_2 /3/vmcore_3 I run several tests with different option, parallels from 1 to 4, and combined with zlib and lzo compression. Test result: - | |Parallels 1|Parallels 2|Parallels 3|Parallels 4| - |zlib cyclic| 42m25.321s| 34m0.168s| 29m44.908s| 28m50.387s| - |zlib non-cyclic| 42m7.842s| 28m28.275s| 23m25.750s| 21m6.476s| - |lzo cyclic | 23m40.010s| 18m19.932s| 21m47.903s| 22m47.605s| - |lzo non-cyclic | 20m45.749s| 16m42.045s| 15m41.070s| 15m18.605s| - -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: makedumpfile mmap() benchmark
On 03/27/2013 02:23 PM, HATAYAMA Daisuke wrote: From: Jingbai Ma Subject: makedumpfile mmap() benchmark Date: Wed, 27 Mar 2013 13:51:37 +0800 Hi, I have tested the makedumpfile mmap patch on a machine with 2TB memory, here is testing results: Thanks for your benchmark. It's very helpful to see the benchmark on different environments. Thanks for your patch, there is a great performance improvement, very impressive! Test environment: Machine: HP ProLiant DL980 G7 with 2TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 cpu was enabled the 2nd kernel) Kernel: 3.9.0-rc3+ with mmap kernel patch v3 vmcore size: 2.0TB Dump file size: 3.6GB makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 --map-size To reduce the benchmark time, I recommend LZO or snappy compressions rather than zlib. zlib is used when -c option is specified, and it's too slow for use of crash dump. That's a very helpful suggestion, I will try it again with LZO/snappy lib again. To build makedumpfile with each compression format supports, do USELZO=on or USESNAPPY=on after installing necessary libraries. All measured time from debug message of makedumpfile. As a comparison, I also have tested with original kernel and original makedumpfile 1.5.1 and 1.5.3. I added all [Excluding unnecessary pages] and [Excluding free pages] time together as "Filter Pages", and [Copyying Data] as "Copy data" here. makedumjpfile Kernel map-size (KB) Filter pages (s) Copy data (s) Total (s) 1.5.13.7.0-0.36.el7.x86_64 N/A 940.28 1269.25 2209.53 1.5.33.7.0-0.36.el7.x86_64 N/A 380.09 992.77 1372.86 1.5.3 v3.9-rc3N/A 197.77 892.27 1090.04 1.5.3+mmap v3.9-rc3+mmap 0 164.87 606.06 770.93 1.5.3+mmap v3.9-rc3+mmap 4 88.62 576.07 664.69 1.5.3+mmap v3.9-rc3+mmap 102483.66 477.23 560.89 1.5.3+mmap v3.9-rc3+mmap 204883.44 477.21 560.65 1.5.3+mmap v3.9-rc3+mmap 10240 83.84 476.56 560.4 Did you calculate "Filter pages" by adding two [Excluding unnecessary pages] lines? The first one of the two line is displayed by get_num_dumpable_cyclic() during the calculation of the total number of dumpable pages, which is later used to print progress of writing pages in percentage. For example, here is the log, where the number of cycles is 3, and mem_map (16399) mem_map: ea0801e0 pfn_start : 20078000 pfn_end: 2008 read /proc/vmcore with mmap() STEP [Excluding unnecessary pages] : 13.703842 seconds<-- this part is by get_num_dumpable_cyclic() STEP [Excluding unnecessary pages] : 13.842656 seconds STEP [Excluding unnecessary pages] : 6.857910 seconds STEP [Excluding unnecessary pages] : 13.554281 seconds<-- this part is by the main filtering processing. STEP [Excluding unnecessary pages] : 14.103593 seconds STEP [Excluding unnecessary pages] : 7.114239 seconds STEP [Copying data ] : 138.442116 seconds Writing erase info... offset_eraseinfo: 1f4680e40, size_eraseinfo: 0 Original pages : 0x1ffc28a4 So, get_num_dumpable_cyclic() actually does filtering operation but it should not be included here. If so, I guess each measured time would be about 42 seconds, right? Then, it's almost same as the result I posted today: 35 seconds. Yes, I added them together, the following is one dump message log: makedumpfile -c --message-level 23 -d 31 --map-size 10240 /proc/vmcore /sysroot/var/crash/vmcore_10240 cyclic buffer size has been changed: 77661798 => 77661184 Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 24.17 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 17.291935 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 24.498559 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 17.278414 seconds Copying data : [100 %] STEP [Copying data ] : 476.563428 seconds Original pages : 0x1ffe874d Excluded pages : 0x1f79429e Pages filled with zero : 0x002b4c9c Cache pages : 0x000493bc Cache pages + private : 0x11f3 User process data pages : 0x5c55 Free pages : 0x1f48f3fe Hwpoison pages : 0x Remaining pages : 0x008544af (The number of pages is reduced to 1%.) Memory Hole : 0x1c0178b3 -- Total pages : 0x3c00 Thanks. HATAYAMA, Daisuke -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: makedumpfile mmap() benchmark
On 03/27/2013 02:23 PM, HATAYAMA Daisuke wrote: From: Jingbai Majingbai...@hp.com Subject: makedumpfile mmap() benchmark Date: Wed, 27 Mar 2013 13:51:37 +0800 Hi, I have tested the makedumpfile mmap patch on a machine with 2TB memory, here is testing results: Thanks for your benchmark. It's very helpful to see the benchmark on different environments. Thanks for your patch, there is a great performance improvement, very impressive! Test environment: Machine: HP ProLiant DL980 G7 with 2TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 cpu was enabled the 2nd kernel) Kernel: 3.9.0-rc3+ with mmap kernel patch v3 vmcore size: 2.0TB Dump file size: 3.6GB makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 --map-sizemap-size To reduce the benchmark time, I recommend LZO or snappy compressions rather than zlib. zlib is used when -c option is specified, and it's too slow for use of crash dump. That's a very helpful suggestion, I will try it again with LZO/snappy lib again. To build makedumpfile with each compression format supports, do USELZO=on or USESNAPPY=on after installing necessary libraries. All measured time from debug message of makedumpfile. As a comparison, I also have tested with original kernel and original makedumpfile 1.5.1 and 1.5.3. I added all [Excluding unnecessary pages] and [Excluding free pages] time together as Filter Pages, and [Copyying Data] as Copy data here. makedumjpfile Kernel map-size (KB) Filter pages (s) Copy data (s) Total (s) 1.5.13.7.0-0.36.el7.x86_64 N/A 940.28 1269.25 2209.53 1.5.33.7.0-0.36.el7.x86_64 N/A 380.09 992.77 1372.86 1.5.3 v3.9-rc3N/A 197.77 892.27 1090.04 1.5.3+mmap v3.9-rc3+mmap 0 164.87 606.06 770.93 1.5.3+mmap v3.9-rc3+mmap 4 88.62 576.07 664.69 1.5.3+mmap v3.9-rc3+mmap 102483.66 477.23 560.89 1.5.3+mmap v3.9-rc3+mmap 204883.44 477.21 560.65 1.5.3+mmap v3.9-rc3+mmap 10240 83.84 476.56 560.4 Did you calculate Filter pages by adding two [Excluding unnecessary pages] lines? The first one of the two line is displayed by get_num_dumpable_cyclic() during the calculation of the total number of dumpable pages, which is later used to print progress of writing pages in percentage. For example, here is the log, where the number of cycles is 3, and mem_map (16399) mem_map: ea0801e0 pfn_start : 20078000 pfn_end: 2008 read /proc/vmcore with mmap() STEP [Excluding unnecessary pages] : 13.703842 seconds-- this part is by get_num_dumpable_cyclic() STEP [Excluding unnecessary pages] : 13.842656 seconds STEP [Excluding unnecessary pages] : 6.857910 seconds STEP [Excluding unnecessary pages] : 13.554281 seconds-- this part is by the main filtering processing. STEP [Excluding unnecessary pages] : 14.103593 seconds STEP [Excluding unnecessary pages] : 7.114239 seconds STEP [Copying data ] : 138.442116 seconds Writing erase info... offset_eraseinfo: 1f4680e40, size_eraseinfo: 0 Original pages : 0x1ffc28a4 cut So, get_num_dumpable_cyclic() actually does filtering operation but it should not be included here. If so, I guess each measured time would be about 42 seconds, right? Then, it's almost same as the result I posted today: 35 seconds. Yes, I added them together, the following is one dump message log: Log makedumpfile -c --message-level 23 -d 31 --map-size 10240 /proc/vmcore /sysroot/var/crash/vmcore_10240 cyclic buffer size has been changed: 77661798 = 77661184 Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 24.17 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 17.291935 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 24.498559 seconds Excluding unnecessary pages: [100 %] STEP [Excluding unnecessary pages] : 17.278414 seconds Copying data : [100 %] STEP [Copying data ] : 476.563428 seconds Original pages : 0x1ffe874d Excluded pages : 0x1f79429e Pages filled with zero : 0x002b4c9c Cache pages : 0x000493bc Cache pages + private : 0x11f3 User process data pages : 0x5c55 Free pages : 0x1f48f3fe Hwpoison pages : 0x Remaining pages : 0x008544af (The number of pages is reduced to 1%.) Memory Hole : 0x1c0178b3 -- Total pages : 0x3c00 /Log Thanks. HATAYAMA, Daisuke -- Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile mmap() benchmark
Hi, I have tested the makedumpfile mmap patch on a machine with 2TB memory, here is testing results: Test environment: Machine: HP ProLiant DL980 G7 with 2TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 cpu was enabled the 2nd kernel) Kernel: 3.9.0-rc3+ with mmap kernel patch v3 vmcore size: 2.0TB Dump file size: 3.6GB makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 --map-size All measured time from debug message of makedumpfile. As a comparison, I also have tested with original kernel and original makedumpfile 1.5.1 and 1.5.3. I added all [Excluding unnecessary pages] and [Excluding free pages] time together as "Filter Pages", and [Copyying Data] as "Copy data" here. makedumjpfile Kernel map-size (KB) Filter pages (s)Copy data (s) Total (s) 1.5.13.7.0-0.36.el7.x86_64 N/A 940.28 1269.25 2209.53 1.5.33.7.0-0.36.el7.x86_64 N/A 380.09 992.77 1372.86 1.5.3 v3.9-rc3N/A 197.77 892.27 1090.04 1.5.3+mmap v3.9-rc3+mmap 0 164.87 606.06 770.93 1.5.3+mmap v3.9-rc3+mmap 4 88.62 576.07 664.69 1.5.3+mmap v3.9-rc3+mmap 102483.66 477.23 560.89 1.5.3+mmap v3.9-rc3+mmap 204883.44 477.21 560.65 1.5.3+mmap v3.9-rc3+mmap 10240 83.84 476.56 560.4 Thanks, Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
makedumpfile mmap() benchmark
Hi, I have tested the makedumpfile mmap patch on a machine with 2TB memory, here is testing results: Test environment: Machine: HP ProLiant DL980 G7 with 2TB RAM. CPU: Intel(R) Xeon(R) CPU E7- 2860 @ 2.27GHz (8 sockets, 10 cores) (Only 1 cpu was enabled the 2nd kernel) Kernel: 3.9.0-rc3+ with mmap kernel patch v3 vmcore size: 2.0TB Dump file size: 3.6GB makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 --map-size map-size All measured time from debug message of makedumpfile. As a comparison, I also have tested with original kernel and original makedumpfile 1.5.1 and 1.5.3. I added all [Excluding unnecessary pages] and [Excluding free pages] time together as Filter Pages, and [Copyying Data] as Copy data here. makedumjpfile Kernel map-size (KB) Filter pages (s)Copy data (s) Total (s) 1.5.13.7.0-0.36.el7.x86_64 N/A 940.28 1269.25 2209.53 1.5.33.7.0-0.36.el7.x86_64 N/A 380.09 992.77 1372.86 1.5.3 v3.9-rc3N/A 197.77 892.27 1090.04 1.5.3+mmap v3.9-rc3+mmap 0 164.87 606.06 770.93 1.5.3+mmap v3.9-rc3+mmap 4 88.62 576.07 664.69 1.5.3+mmap v3.9-rc3+mmap 102483.66 477.23 560.89 1.5.3+mmap v3.9-rc3+mmap 204883.44 477.21 560.65 1.5.3+mmap v3.9-rc3+mmap 10240 83.84 476.56 560.4 Thanks, Jingbai Ma -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/11/2013 05:42 PM, Eric W. Biederman wrote: Jingbai Ma writes: On 03/08/2013 11:52 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote: Vivek Goyal writes: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. Actually it does not appear to be doing any work in the first kernel. Looks like patch3 in series is doing that. machine_crash_shutdown(_regs); + generate_crash_dump_bitmap(); machine_kexec(kexec_crash_image); So this bitmap seems to be being set just before transitioning into second kernel. I am sure you would not like this extra code in this path. :-) I was thought this function code is pretty simple, could be called here safely. If it's not proper for here, how about before the function machine_crash_shutdown(_regs)? Furthermore, could you explain the real risks to execute more codes here? The kernel is known bad. What is bad is unclear. Executing any extra code is a bad idea. The history here is that before kexec-on-panic there were lots of dump routines that did all of the crashdump logic in the kernel before they shutdown. They all worked beautifully during development, and on developers test machines and were absolutely worthless in real world situations. I also have learned some from the old style kernel dump. Yes, they do have some problems in real world situations. The primary problems come from I/O operations (disk writing/network sending) and invalid page table. A piece of code that walks all of the page tables is most definitely opening itself up to all kinds of failure situations I can't even imagine. Agree, invalid page table will cause disaster. But even in the capture kernel with user space program, it may only causes a core dump, user still have chance to dump the crashed system by themselves with some special tools, It's possible, but should be very rare in real world. I doubt how many users be able to handle it in such kind of situations. So in most cases, if page tables have corrupted, and can not dump it normally, user would like to reboot the system directly. The only way that it would be ok to do this would be to maintain the bitmap in real time with the existing page table maintenance code, and that would only be ok if it did not add a performance penalty. I also have a prototype that can trace the page table changes in real time, but I still didn't test the performance penalty. I will test it again if I have time. Every once in a great while there is a new cpu architecture feature we need to deal with, but otherwise the only thing that is ok to do on that code path is to reduce it until it much more closely resembles the glorified jump instruction that it really is. Agree. But if we can find some solution that can be proved as robust as a jump that may apply. Speaking of have you given this code any coverage testing with lkdtm? Still not, But I will test it with lkdtm. Before that, I would like to test the mmap() solution first. Thanks for your very valuable comments, that helped me a lot! Eric -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/11/2013 05:42 PM, Eric W. Biederman wrote: Jingbai Majingbai...@hp.com writes: On 03/08/2013 11:52 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote: Vivek Goyalvgo...@redhat.com writes: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. Actually it does not appear to be doing any work in the first kernel. Looks like patch3 in series is doing that. machine_crash_shutdown(fixed_regs); + generate_crash_dump_bitmap(); machine_kexec(kexec_crash_image); So this bitmap seems to be being set just before transitioning into second kernel. I am sure you would not like this extra code in this path. :-) I was thought this function code is pretty simple, could be called here safely. If it's not proper for here, how about before the function machine_crash_shutdown(fixed_regs)? Furthermore, could you explain the real risks to execute more codes here? The kernel is known bad. What is bad is unclear. Executing any extra code is a bad idea. The history here is that before kexec-on-panic there were lots of dump routines that did all of the crashdump logic in the kernel before they shutdown. They all worked beautifully during development, and on developers test machines and were absolutely worthless in real world situations. I also have learned some from the old style kernel dump. Yes, they do have some problems in real world situations. The primary problems come from I/O operations (disk writing/network sending) and invalid page table. A piece of code that walks all of the page tables is most definitely opening itself up to all kinds of failure situations I can't even imagine. Agree, invalid page table will cause disaster. But even in the capture kernel with user space program, it may only causes a core dump, user still have chance to dump the crashed system by themselves with some special tools, It's possible, but should be very rare in real world. I doubt how many users be able to handle it in such kind of situations. So in most cases, if page tables have corrupted, and can not dump it normally, user would like to reboot the system directly. The only way that it would be ok to do this would be to maintain the bitmap in real time with the existing page table maintenance code, and that would only be ok if it did not add a performance penalty. I also have a prototype that can trace the page table changes in real time, but I still didn't test the performance penalty. I will test it again if I have time. Every once in a great while there is a new cpu architecture feature we need to deal with, but otherwise the only thing that is ok to do on that code path is to reduce it until it much more closely resembles the glorified jump instruction that it really is. Agree. But if we can find some solution that can be proved as robust as a jump that may apply. Speaking of have you given this code any coverage testing with lkdtm? Still not, But I will test it with lkdtm. Before that, I would like to test the mmap() solution first. Thanks for your very valuable comments, that helped me a lot! Eric -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:31 PM, HATAYAMA Daisuke wrote: From: Jingbai Ma Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process Date: Fri, 8 Mar 2013 18:06:31 +0800 On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: ... First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. I guess the slowdown came from the issue that enough VMCOREINFO was not provided from the kernel, and unnecessary filtering processing for free pages is done multiple times. Thanks for your comments! It would be very helpful. I will test it on the machine again. -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:19 AM, Vivek Goyal wrote: On Fri, Mar 08, 2013 at 06:06:31PM +0800, Jingbai Ma wrote: [..] - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. Its not same. If there is something wrong with memory management data structures, you can panic() again and self lock yourself and never even transition to the second kernel. With makedumpfile, if something is wrong, either we will save wrong bits or get segmentation fault. But one can still try to be careful or save whole dump and try to get specific pieces out. So it it is not apples to apples comparison. Understood, the double panic() does harm the reliabilities. But consider the chance to panic in to memory filtering code, it shouldn't increase the risks very much. If the filtering code panicked, I doubt even without it, the second kernel could be booted up normally. [..] Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). I would think that lets first try to make mmap() on /proc/vmcore work and optimize makefumpfile to make use of it and then see if performance is acceptable or not on large machines. And then take it from there. Sure, you are right, I'm going to test the mmap() solution first, if it doesn't meet the performance requirement on large machine, We still need a solution here. Thanks! Thanks Vivek -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:13 AM, Eric W. Biederman wrote: "Ma, Jingbai (Kingboard)" writes: On 3/8/13 6:33 PM, "H. Peter Anvin" wrote: On 03/08/2013 02:06 AM, Jingbai Ma wrote: Kernel do have some abilities that user space haven't. It's possible to map whole memory space of the first kernel into user space on the second kernel. But the user space code has to re-implement some parts of the kernel memory management system again. And worse, it's architecture dependent, more architectures supported, more codes have to be implemented. All implementation in user space must be sync to kernel implementation. It's may called "flexibility", but it's painful to maintain the codes. What? You are basically talking about /dev/mem... there is nothing particularly magic about it at all. What we are talking about is filtering memory pages (AKA memory pages classification) The makedumpfile (or any other dumper in user space) has to know the exactly memory layout of the memory management data structures, it not only architecture dependent, but also may varies in different kernel release. At this point, /dev/mem doesn't give any help. So IMHO, I would like to do it in kernel, rather than So keep tracking changes in user space code. But the fact is there is no requirment that the crash dump capture kernel is the same version as the kernel that crashed. In fact it has been common at some points in time to use slightly different build options, or slightly different kernels. Say a 32bit PAE kernel to capture a 64bit x86_64 kernel. The filtering code will be executed in the first kernel, so this problem will not be exist. So in fact performing this work in the kernel and is actively harmful to reliability and maintenance because it adds an incorrect assumption. If you do want the benefit of shared maintenance with the kernel one solution that has been suggested several times is to put code into tools/makedumpfile (probably a library) that encapsulates the kernel specific knowledge that can be loaded into the ramdisk when the crahsdump kernel is being loaded. That would allow shared maintenance along without breaking the possibility of supporting kernel versions. Yes, you are right. But it requires makedumpfile changes significantly, and if we also want to shared the code with kernel memory management subsystem, I believe that's not a easy job. (at least to my limited kernel knowledge) Eric -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/08/2013 11:52 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote: Vivek Goyal writes: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. Actually it does not appear to be doing any work in the first kernel. Looks like patch3 in series is doing that. machine_crash_shutdown(_regs); + generate_crash_dump_bitmap(); machine_kexec(kexec_crash_image); So this bitmap seems to be being set just before transitioning into second kernel. I am sure you would not like this extra code in this path. :-) I was thought this function code is pretty simple, could be called here safely. If it's not proper for here, how about before the function machine_crash_shutdown(_regs)? Furthermore, could you explain the real risks to execute more codes here? Thanks! Thanks Vivek -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/08/2013 11:52 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote: Vivek Goyalvgo...@redhat.com writes: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. Actually it does not appear to be doing any work in the first kernel. Looks like patch3 in series is doing that. machine_crash_shutdown(fixed_regs); + generate_crash_dump_bitmap(); machine_kexec(kexec_crash_image); So this bitmap seems to be being set just before transitioning into second kernel. I am sure you would not like this extra code in this path. :-) I was thought this function code is pretty simple, could be called here safely. If it's not proper for here, how about before the function machine_crash_shutdown(fixed_regs)? Furthermore, could you explain the real risks to execute more codes here? Thanks! Thanks Vivek -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:13 AM, Eric W. Biederman wrote: Ma, Jingbai (Kingboard)kingboard...@hp.com writes: On 3/8/13 6:33 PM, H. Peter Anvinh...@zytor.com wrote: On 03/08/2013 02:06 AM, Jingbai Ma wrote: Kernel do have some abilities that user space haven't. It's possible to map whole memory space of the first kernel into user space on the second kernel. But the user space code has to re-implement some parts of the kernel memory management system again. And worse, it's architecture dependent, more architectures supported, more codes have to be implemented. All implementation in user space must be sync to kernel implementation. It's may called flexibility, but it's painful to maintain the codes. What? You are basically talking about /dev/mem... there is nothing particularly magic about it at all. What we are talking about is filtering memory pages (AKA memory pages classification) The makedumpfile (or any other dumper in user space) has to know the exactly memory layout of the memory management data structures, it not only architecture dependent, but also may varies in different kernel release. At this point, /dev/mem doesn't give any help. So IMHO, I would like to do it in kernel, rather than So keep tracking changes in user space code. But the fact is there is no requirment that the crash dump capture kernel is the same version as the kernel that crashed. In fact it has been common at some points in time to use slightly different build options, or slightly different kernels. Say a 32bit PAE kernel to capture a 64bit x86_64 kernel. The filtering code will be executed in the first kernel, so this problem will not be exist. So in fact performing this work in the kernel and is actively harmful to reliability and maintenance because it adds an incorrect assumption. If you do want the benefit of shared maintenance with the kernel one solution that has been suggested several times is to put code into tools/makedumpfile (probably a library) that encapsulates the kernel specific knowledge that can be loaded into the ramdisk when the crahsdump kernel is being loaded. That would allow shared maintenance along without breaking the possibility of supporting kernel versions. Yes, you are right. But it requires makedumpfile changes significantly, and if we also want to shared the code with kernel memory management subsystem, I believe that's not a easy job. (at least to my limited kernel knowledge) Eric -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:19 AM, Vivek Goyal wrote: On Fri, Mar 08, 2013 at 06:06:31PM +0800, Jingbai Ma wrote: [..] - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. Its not same. If there is something wrong with memory management data structures, you can panic() again and self lock yourself and never even transition to the second kernel. With makedumpfile, if something is wrong, either we will save wrong bits or get segmentation fault. But one can still try to be careful or save whole dump and try to get specific pieces out. So it it is not apples to apples comparison. Understood, the double panic() does harm the reliabilities. But consider the chance to panic in to memory filtering code, it shouldn't increase the risks very much. If the filtering code panicked, I doubt even without it, the second kernel could be booted up normally. [..] Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). I would think that lets first try to make mmap() on /proc/vmcore work and optimize makefumpfile to make use of it and then see if performance is acceptable or not on large machines. And then take it from there. Sure, you are right, I'm going to test the mmap() solution first, if it doesn't meet the performance requirement on large machine, We still need a solution here. Thanks! Thanks Vivek -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/09/2013 12:31 PM, HATAYAMA Daisuke wrote: From: Jingbai Majingbai...@hp.com Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process Date: Fri, 8 Mar 2013 18:06:31 +0800 On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: ... First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. I guess the slowdown came from the issue that enough VMCOREINFO was not provided from the kernel, and unnecessary filtering processing for free pages is done multiple times. Thanks for your comments! It would be very helpful. I will test it on the machine again. -- Jingbai Ma (jingbai...@hp.com) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. 2. Filtering code itself is very simple and straightforward, doesn't depend on kernel functions too much. Current code calls pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in non-crash situation, and can be removed safely in crash processing. It may affects reliability but very limit. 3. Before calling filtering code, the machine_crash_shutdown() has been executed, so all IRQs have been disabled, all other CPUs have been halted. We only need to make sure NMI from watchdog has been disabled here. So far, we stay on a separate stack, no any potential interrupts here, only executes a little piece of code with very limit system functions. Compares to the complicated functions been executed previously, the risks from the filtering code should be acceptable. - Secondly, it moves filetering policy in kernel. I think keeping it in user space gives us the extra flexibility. It doesn't keep user from extra flexibility, just adds another possibility. I have added a flag in makedumpfile, user can decide to filter memory pages by makedumpfile itself or just use the bitmap came from the first kernel. It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your patches also you are reserving 1bit per page and that is 32MB per TB in first kernel. So memory is anyway being reserved, just that makedumpfile seems to be needing this extra bit. Not sure if that can be optimized or not. Yes, you are right. It's only a POC (proof of concept) implementation currently. I can add a mmap interface to allow makedumpfile to access the bitmap memory directly without reserving memory for it again. First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. 2. Filtering code itself is very simple and straightforward, doesn't depend on kernel functions too much. Current code calls pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in non-crash situation, and can be removed safely in crash processing. It may affects reliability but very limit. 3. Before calling filtering code, the machine_crash_shutdown() has been executed, so all IRQs have been disabled, all other CPUs have been halted. We only need to make sure NMI from watchdog has been disabled here. So far, we stay on a separate stack, no any potential interrupts here, only executes a little piece of code with very limit system functions. Compares to the complicated functions been executed previously, the risks from the filtering code should be acceptable. - Secondly, it moves filetering policy in kernel. I think keeping it in user space gives us the extra flexibility. It doesn't keep user from extra flexibility, just adds another possibility. I have added a flag in makedumpfile, user can decide to filter memory pages by makedumpfile itself or just use the bitmap came from the first kernel. It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your patches also you are reserving 1bit per page and that is 32MB per TB in first kernel. So memory is anyway being reserved, just that makedumpfile seems to be needing this extra bit. Not sure if that can be optimized or not. Yes, you are right. It's only a POC (proof of concept) implementation currently. I can add a mmap interface to allow makedumpfile to access the bitmap memory directly without reserving memory for it again. First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. 2. Filtering code itself is very simple and straightforward, doesn't depend on kernel functions too much. Current code calls pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in non-crash situation, and can be removed safely in crash processing. It may affects reliability but very limit. 3. Before calling filtering code, the machine_crash_shutdown() has been executed, so all IRQs have been disabled, all other CPUs have been halted. We only need to make sure NMI from watchdog has been disabled here. So far, we stay on a separate stack, no any potential interrupts here, only executes a little piece of code with very limit system functions. Compares to the complicated functions been executed previously, the risks from the filtering code should be acceptable. - Secondly, it moves filetering policy in kernel. I think keeping it in user space gives us the extra flexibility. It doesn't keep user from extra flexibility, just adds another possibility. I have added a flag in makedumpfile, user can decide to filter memory pages by makedumpfile itself or just use the bitmap came from the first kernel. It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your patches also you are reserving 1bit per page and that is 32MB per TB in first kernel. So memory is anyway being reserved, just that makedumpfile seems to be needing this extra bit. Not sure if that can be optimized or not. Yes, you are right. It's only a POC (proof of concept) implementation currently. I can add a mmap interface to allow makedumpfile to access the bitmap memory directly without reserving memory for it again. First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33
Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
On 03/07/2013 11:21 PM, Vivek Goyal wrote: On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote: This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). I think this is not a good idea. It has several issues. - First of all it is doing more stuff in first kernel. And that runs contrary to kdump design where we want to do stuff in second kernel. After a kernel crash, you can't trust running kernel's data structures. So to improve reliability just do minial stuff in crashed kernel and get out quickly. I agreed with you, the first kernel should do as less as possible. Intuitively, filter memory pages in the first kernel will harm the reliability of kernel dump, but let's think it thoroughly: 1. It only relies on the memory management data structure that makedumpfile also relies on, so no any reliability degradation at this point. 2. Filtering code itself is very simple and straightforward, doesn't depend on kernel functions too much. Current code calls pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in non-crash situation, and can be removed safely in crash processing. It may affects reliability but very limit. 3. Before calling filtering code, the machine_crash_shutdown() has been executed, so all IRQs have been disabled, all other CPUs have been halted. We only need to make sure NMI from watchdog has been disabled here. So far, we stay on a separate stack, no any potential interrupts here, only executes a little piece of code with very limit system functions. Compares to the complicated functions been executed previously, the risks from the filtering code should be acceptable. - Secondly, it moves filetering policy in kernel. I think keeping it in user space gives us the extra flexibility. It doesn't keep user from extra flexibility, just adds another possibility. I have added a flag in makedumpfile, user can decide to filter memory pages by makedumpfile itself or just use the bitmap came from the first kernel. It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your patches also you are reserving 1bit per page and that is 32MB per TB in first kernel. So memory is anyway being reserved, just that makedumpfile seems to be needing this extra bit. Not sure if that can be optimized or not. Yes, you are right. It's only a POC (proof of concept) implementation currently. I can add a mmap interface to allow makedumpfile to access the bitmap memory directly without reserving memory for it again. First of all 64MB per TB should not be a huge deal. And makedumpfile also has this cyclic mode where you process a map, discard it and then move on to next section. So memory usage remains constant at the expense of processing time. Yes, that's true. But in cyclic mode, makedumpfile will have to write/read bitmap from storage, it will also impact the performance. I have measured the penalty for cyclic mode is about 70% slowdown. Maybe could be faster after mmap implemented. Looks like now hpa and yinghai have done the work to be able to load kdump kernel above 4GB. I am assuming this also removes the restriction that we can only reserve 512MB or 896MB in second kernel. If that's the case, then I don't see why people can't get away with reserving 64MB per TB. That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel will have enough memory to run. And makedumpfile could be always run at non-cyclic mode, but we still concern about the kernel dump performance on systems with huge memory (above 4TB). 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33
[RFC PATCH 5/5] crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue
Linux kernel 3.9-rc1 allows crashkernel above 4GB, but current kexec-tools doesn't support it yet. This patch is only a workaround to make kdump work again. This patch should be removed after kexec-tools 2.0.4 release. Signed-off-by: Jingbai Ma --- arch/x86/kernel/setup.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 165c831..15321d6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -506,7 +506,8 @@ static void __init memblock_x86_reserve_range_setup_data(void) #ifdef CONFIG_X86_32 # define CRASH_KERNEL_ADDR_MAX (512 << 20) #else -# define CRASH_KERNEL_ADDR_MAX MAXMEM +/* # define CRASH_KERNEL_ADDR_MAX MAXMEM */ +# define CRASH_KERNEL_ADDR_MAX (896 << 20) #endif static void __init reserve_crashkernel_low(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 4/5] crash dump bitmap: add a proc interface for crash dump bitmap
Add a procfs driver for selecting exclude pages in userspace. /proc/crash_dump_bitmap/ Signed-off-by: Jingbai Ma --- fs/proc/Makefile|1 fs/proc/crash_dump_bitmap.c | 221 +++ 2 files changed, 222 insertions(+), 0 deletions(-) create mode 100644 fs/proc/crash_dump_bitmap.c diff --git a/fs/proc/Makefile b/fs/proc/Makefile index 712f24d..2dfcff1 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -27,6 +27,7 @@ proc-$(CONFIG_PROC_SYSCTL)+= proc_sysctl.o proc-$(CONFIG_NET) += proc_net.o proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o +proc-$(CONFIG_CRASH_DUMP_BITMAP) += crash_dump_bitmap.o proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o diff --git a/fs/proc/crash_dump_bitmap.c b/fs/proc/crash_dump_bitmap.c new file mode 100644 index 000..77ecaae --- /dev/null +++ b/fs/proc/crash_dump_bitmap.c @@ -0,0 +1,221 @@ +/* + *fs/proc/crash_dump_bitmap.c + *Interface for controlling the crash dump bitmap from user space. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_CRASH_DUMP_BITMAP + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Jingbai Ma "); +MODULE_DESCRIPTION("Crash dump bitmap support driver"); + +static const char *proc_dir_name = "crash_dump_bitmap"; +static const char *proc_page_status_name = "page_status"; +static const char *proc_dump_level_name = "dump_level"; + +static struct proc_dir_entry *proc_dir, *proc_page_status, *proc_dump_level; + +static unsigned int get_dump_level(void) +{ + unsigned int dump_level; + + dump_level = crash_dump_bitmap_ctrl.exclude_zero_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_private_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_user_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_free_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES : 0; + + return dump_level; +} + +static void set_dump_level(unsigned int dump_level) +{ + crash_dump_bitmap_ctrl.exclude_zero_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES) +? 1 : 0; + crash_dump_bitmap_ctrl.exclude_user_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_free_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES) ? 1 : 0; +} + +static int proc_page_status_show(struct seq_file *m, void *v) +{ + u64 start, duration; + + if (!crash_dump_bitmap_mem) { + seq_printf(m, + "crash_dump_bitmap: crash_dump_bitmap_mem not found!\n"); + + return -EINVAL; + } + + seq_printf(m, "Exclude page flag status:\n"); + seq_printf(m, "exclude_dump_bitmap_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages); + seq_printf(m, "exclude_zero_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_zero_pages); + seq_printf(m, "exclude_cache_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_cache_pages); + seq_printf(m, "exclude_cache_private_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_cache_private_pages); + seq_printf(m, "exclude_user_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_user_pages); + seq_printf(m, "exclude_free_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_free_pages); + + seq_printf(m, "Scanning all memory pages:\n&quo
[RFC PATCH 2/5] crash dump bitmap: init crash dump bitmap in kernel booting process
Reserve a memory block for crash_dump_bitmap in kernel booting process. Signed-off-by: Jingbai Ma --- arch/x86/kernel/setup.c | 59 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 + kernel/crash_dump_bitmap.c| 45 4 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 84d3285..165c831 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -67,6 +67,7 @@ #include #include +#include #include #include @@ -601,6 +602,62 @@ static void __init reserve_crashkernel(void) } #endif +#ifdef CONFIG_CRASH_DUMP_BITMAP +static void __init crash_dump_bitmap_init(void) +{ + static unsigned long BITSPERBYTE = 8; + + unsigned long long mem_start; + unsigned long long mem_size; + + if (is_kdump_kernel()) + return; + + mem_start = (1ULL << 24); /* 16MB */ + mem_size = roundup((roundup(max_pfn, BITSPERBYTE) / BITSPERBYTE), + PAGE_SIZE); + + crash_dump_bitmap_mem = memblock_find_in_range(mem_start, + MEMBLOCK_ALLOC_ACCESSIBLE, mem_size, PAGE_SIZE); + + if (!crash_dump_bitmap_mem) { + pr_err( + "crash_dump_bitmap: allocate error! size=%lldkB, from=%lldMB\n", + mem_size >> 10, mem_start >> 20); + + return; + } + + crash_dump_bitmap_mem_size = mem_size; + memblock_reserve(crash_dump_bitmap_mem, crash_dump_bitmap_mem_size); + pr_info("crash_dump_bitmap: bitmap_mem=%lldMB. size=%lldkB\n", + (unsigned long long)crash_dump_bitmap_mem >> 20, + mem_size >> 10); + + crash_dump_bitmap_res.start = crash_dump_bitmap_mem; + crash_dump_bitmap_res.end = crash_dump_bitmap_mem + mem_size - 1; + insert_resource(_resource, _dump_bitmap_res); + + crash_dump_bitmap_info.version = CRASH_DUMP_BITMAP_VERSION; + + crash_dump_bitmap_info.bitmap = crash_dump_bitmap_mem; + crash_dump_bitmap_info.bitmap_size = crash_dump_bitmap_mem_size; + + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages = 1; + crash_dump_bitmap_ctrl.exclude_zero_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = 1; + crash_dump_bitmap_ctrl.exclude_user_pages = 1; + crash_dump_bitmap_ctrl.exclude_free_pages = 1; + + pr_info("crash_dump_bitmap: Initialized!\n"); +} +#else +static void __init crash_dump_bitmap_init(void) +{ +} +#endif + static struct resource standard_io_resources[] = { { .name = "dma1", .start = 0x00, .end = 0x1f, .flags = IORESOURCE_BUSY | IORESOURCE_IO }, @@ -1094,6 +1151,8 @@ void __init setup_arch(char **cmdline_p) reserve_crashkernel(); + crash_dump_bitmap_init(); + vsmp_init(); io_delay_init(); diff --git a/include/linux/crash_dump_bitmap.h b/include/linux/crash_dump_bitmap.h new file mode 100644 index 000..63b1264 --- /dev/null +++ b/include/linux/crash_dump_bitmap.h @@ -0,0 +1,59 @@ +/* + *include/linux/crash_dump_bitmap.h + *Declaration of crash dump bitmap functions and data structures. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#ifndef _LINUX_CRASH_DUMP_BITMAP_H +#define _LINUX_CRASH_DUMP_BITMAP_H + +#define CRASH_DUMP_BITMAP_VERSION 1; + +enum { + CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES = 1, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES = 2, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES = 4, + CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES = 8, + CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES = 16 +}; + +struct crash_dump_bitmap_ctrl { + char exclude_crash_dump_bitmap_pages; + char exclude_zero_pages;/* only for tracking dump level */ + char exclude_cache_pages; + char exclude_cache_private_pages; + char exclude_user_pages; + char exclude_free_pages; +}; + +struct crash_dump_bitmap_info { + unsigned int version; + phys_addr_t bitmap; + phys_addr_t bitmap_size; + unsigned long cache_pages; + unsigned long cache_
[RFC PATCH 1/5] crash dump bitmap: add a kernel config and help document
Add a kernel config and help document for CRASH_DUMP_BITMAP. Signed-off-by: Jingbai Ma --- Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + 2 files changed, 394 insertions(+), 0 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt diff --git a/Documentation/kdump/crash_dump_bitmap.txt b/Documentation/kdump/crash_dump_bitmap.txt new file mode 100644 index 000..468cdf2 --- /dev/null +++ b/Documentation/kdump/crash_dump_bitmap.txt @@ -0,0 +1,378 @@ + +Documentation for Crash Dump Bitmap + + +This document includes overview, setup and installation, and analysis +information. + +Overview + + +Traditionally, to reduce the size of dump file, dumper scans all memory +pages to exclude the unnecessary memory pages after capture kernel +booted, and scan it in userspace code (makedumpfile). + +It introduces several problems: + +1. Requires more memory to store memory bitmap on systems with large +amount of memory installed. And in capture kernel there is only a few +free memory available, it will cause an out of memory error and fail. +(Non-cyclic mode) + +2. Scans all memory pages in makedumpfile is a very slow process. On +system with 1TB or more memory installed, the scanning process is very +long. Typically on 1TB idle system, it takes about 19 minutes. On system +with 4TB or more memory installed, it even doesn't work. To address the +out of memory issue on system with big memory (4TB or more memory +installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only +scans a piece of memory pages each time, and do it cyclically to scan +all memory pages. But it runs more slowly, on 1TB system, takes about 33 +minutes. + +3. Scans memory pages code in makedumpfile is very complicated, without +kernel memory management related data structure, makedumpfile has to +build up its on data structure, and will not able to use some macros +that only be available in kernel (e.g. page_to_pfn), and has to use some +slow lookup algorithm instead. + +This patch introduces a new way to scan memory pages. It reserves a piece of +memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first +kernel. During the kernel panic process, it scans all memory pages, clear the +bit for all excluded memory pages in the reserved memory. + +We have several benefits by this new approach: + +1. It's extremely fast, on 1TB system only takes about 17.5 seconds to +scan all memory pages! + +2. Reduces the memory requirement of makedumpfile by putting the +reserved memory in the first kernel memory space. + +3. Simplifies the complexity of existing memory pages scanning code in +userspace. + + +Usage += + +1) Enable "kernel crash dump bitmap" in "Processor type and features", under +"kernel crash dumps". + +CONFIG_CRASH_DUMP_BITMAP=y + +it depends on "kexec system call" and "kernel crash dumps", so there features +must be enabled also. + +CONFIG_KEXEC=y +CONFIG_CRASH_DUMP=y + +2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems.". + + CONFIG_SYSFS=y + +3) Compile and install the new kernel. + +4) Check the new kernel. +Once new kernel has booted, there will be a new foler +/proc/crash_dump_bitmap. +Check current dump level: +cat /proc/crash_dump_bitmap/dump_level + +Set dump level: +echo "dump level" > /proc/crash_dump_bitmap/dump_level + +The dump level is as same as the parameter of makedumpfile -d dump_level. + +Run page scan and check page status: +cat /proc/crash_dump_bitmap/page_status + +5) Download makedumpfile v1.5.3 or later from sourceforge: +http://sourceforge.net/projects/makedumpfile/ + +6) Patch it with the patch at the end of this file. + +7) Compile it and copy the patched makedumpfile into the right folder +(/sbin or /usr/sbin) + +8) Change the /etc/kdump.conf, and a "-q" in the makedumpfile parameter +line. It will tell makedumpfile to use the crash dump bitmap in kernel. +core_collector makedumpfile --non-cyclic -q -c -d 31 --message-level 23 + +9) Regenerate initramfs to make sure the patched makedumpfile and config +has been included in it. + + +To Do += + +It only supports x86-64 architecture currently, need to add supports for +other architectures. + + +Contact +=== + +Jingbai Ma (jingbai...@hp.com) + + +Patch (for makedumpfile v1.5.3) + +Please forgive me, for some format issues of makedumpfile source, I have +to wrap this patch with '#'. Please use this sed command to get the +patch for makedumpfile: + +sed -n -e "s/^#\(.*\)#$/\1/p" crash_dump_bitmap.txt > makedumpfile.patch + += +#diff --git a/makedumpfile.c b/makedumpfile.c# +#index acb1b21..f29b6a5 100644# +#--- a/makedumpfile.c# +#+++ b/m
[RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33 minutes. 3. Scans memory pages code in makedumpfile is very complicated, without kernel memory management related data structure, makedumpfile has to build up its own data structure, and will not able to use some macros that only be available in kernel (e.g. page_to_pfn), and has to use some slow lookup algorithm instead. This patch introduces a new way to scan memory pages. It reserves a piece of memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first kernel. During the kernel crash process, it scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. We have several benefits by this new approach: 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to scan all memory pages! 2. Reduces the memory requirement of makedumpfile by putting the reserved memory in the first kernel memory space. 3. Simplifies the complexity of existing memory pages scanning code in userspace. To do: 1. It only has been verified on x86 64bit platform, needs to be modified for other platforms. (ARM, XEN, PPC, etc...) --- Jingbai Ma (5): crash dump bitmap: add a kernel config and help document crash dump bitmap: init crash dump bitmap in kernel booting process crash dump bitmap: scan memory pages in kernel crash process crash dump bitmap: add a proc interface for crash dump bitmap crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + arch/x86/kernel/setup.c | 62 + fs/proc/Makefile |1 fs/proc/crash_dump_bitmap.c | 221 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 kernel/crash_dump_bitmap.c| 201 +++ kernel/kexec.c|5 9 files changed, 943 insertions(+), 1 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt create mode 100644 fs/proc/crash_dump_bitmap.c create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c -- Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 4/5] crash dump bitmap: add a proc interface for crash dump bitmap
Add a procfs driver for selecting exclude pages in userspace. /proc/crash_dump_bitmap/ Signed-off-by: Jingbai Ma --- fs/proc/Makefile|1 fs/proc/crash_dump_bitmap.c | 221 +++ 2 files changed, 222 insertions(+), 0 deletions(-) create mode 100644 fs/proc/crash_dump_bitmap.c diff --git a/fs/proc/Makefile b/fs/proc/Makefile index 712f24d..2dfcff1 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -27,6 +27,7 @@ proc-$(CONFIG_PROC_SYSCTL)+= proc_sysctl.o proc-$(CONFIG_NET) += proc_net.o proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o +proc-$(CONFIG_CRASH_DUMP_BITMAP) += crash_dump_bitmap.o proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o diff --git a/fs/proc/crash_dump_bitmap.c b/fs/proc/crash_dump_bitmap.c new file mode 100644 index 000..77ecaae --- /dev/null +++ b/fs/proc/crash_dump_bitmap.c @@ -0,0 +1,221 @@ +/* + *fs/proc/crash_dump_bitmap.c + *Interface for controlling the crash dump bitmap from user space. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_CRASH_DUMP_BITMAP + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Jingbai Ma "); +MODULE_DESCRIPTION("Crash dump bitmap support driver"); + +static const char *proc_dir_name = "crash_dump_bitmap"; +static const char *proc_page_status_name = "page_status"; +static const char *proc_dump_level_name = "dump_level"; + +static struct proc_dir_entry *proc_dir, *proc_page_status, *proc_dump_level; + +static unsigned int get_dump_level(void) +{ + unsigned int dump_level; + + dump_level = crash_dump_bitmap_ctrl.exclude_zero_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_private_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_user_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_free_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES : 0; + + return dump_level; +} + +static void set_dump_level(unsigned int dump_level) +{ + crash_dump_bitmap_ctrl.exclude_zero_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES) +? 1 : 0; + crash_dump_bitmap_ctrl.exclude_user_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_free_pages = + (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES) ? 1 : 0; +} + +static int proc_page_status_show(struct seq_file *m, void *v) +{ + u64 start, duration; + + if (!crash_dump_bitmap_mem) { + seq_printf(m, + "crash_dump_bitmap: crash_dump_bitmap_mem not found!\n"); + + return -EINVAL; + } + + seq_printf(m, "Exclude page flag status:\n"); + seq_printf(m, "exclude_dump_bitmap_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages); + seq_printf(m, "exclude_zero_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_zero_pages); + seq_printf(m, "exclude_cache_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_cache_pages); + seq_printf(m, "exclude_cache_private_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_cache_private_pages); + seq_printf(m, "exclude_user_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_user_pages); + seq_printf(m, "exclude_free_pages=%d\n", + crash_dump_bitmap_ctrl.exclude_free_pages); + + seq_printf(m, "Scanning all memory pages:\n&quo
[RFC PATCH 5/5] crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue
Linux kernel 3.9-rc1 allows crashkernel above 4GB, but current kexec-tools doesn't support it yet. This patch is only a workaround to make kdump work again. This patch should be removed after kexec-tools 2.0.4 release. Signed-off-by: Jingbai Ma --- arch/x86/kernel/setup.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 165c831..15321d6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -506,7 +506,8 @@ static void __init memblock_x86_reserve_range_setup_data(void) #ifdef CONFIG_X86_32 # define CRASH_KERNEL_ADDR_MAX (512 << 20) #else -# define CRASH_KERNEL_ADDR_MAX MAXMEM +/* # define CRASH_KERNEL_ADDR_MAX MAXMEM */ +# define CRASH_KERNEL_ADDR_MAX (896 << 20) #endif static void __init reserve_crashkernel_low(void) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 3/5] crash dump bitmap: scan memory pages in kernel crash process
In the kernel crash process, call generate_crash_dump_bitmap() to scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. Signed-off-by: Jingbai Ma --- kernel/crash_dump_bitmap.c | 156 kernel/kexec.c |5 + 2 files changed, 161 insertions(+), 0 deletions(-) diff --git a/kernel/crash_dump_bitmap.c b/kernel/crash_dump_bitmap.c index e743cdd..eed13ca 100644 --- a/kernel/crash_dump_bitmap.c +++ b/kernel/crash_dump_bitmap.c @@ -23,6 +23,8 @@ #ifdef CONFIG_CRASH_DUMP_BITMAP +#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT) + phys_addr_t crash_dump_bitmap_mem; EXPORT_SYMBOL(crash_dump_bitmap_mem); @@ -35,6 +37,7 @@ EXPORT_SYMBOL(crash_dump_bitmap_ctrl); struct crash_dump_bitmap_info crash_dump_bitmap_info; EXPORT_SYMBOL(crash_dump_bitmap_info); + /* Location of the reserved area for the crash_dump_bitmap */ struct resource crash_dump_bitmap_res = { .name = "Crash dump bitmap", @@ -42,4 +45,157 @@ struct resource crash_dump_bitmap_res = { .end = 0, .flags = IORESOURCE_BUSY | IORESOURCE_MEM }; + +inline void set_crash_dump_bitmap(unsigned long pfn, int val) +{ + phys_addr_t paddr = crash_dump_bitmap_info.bitmap + (pfn >> 3); + unsigned char *vaddr; + unsigned char bit = (pfn & 7); + + if (unlikely(paddr > (crash_dump_bitmap_mem + + crash_dump_bitmap_mem_size))) { + pr_err( + "crash_dump_bitmap: pfn exceed limit. pfn=%ld, addr=0x%llX\n", + pfn, paddr); + return; + } + + vaddr = (unsigned char *)__va(paddr); + + if (val) + *vaddr |= (1U << bit); + else + *vaddr &= (~(1U << bit)); +} + +void generate_crash_dump_bitmap(void) +{ + pg_data_t *pgdat; + struct zone *zone; + unsigned long flags; + int order, t; + struct list_head *curr; + unsigned long zone_free_pages; + phys_addr_t addr; + + if (!crash_dump_bitmap_mem) { + pr_info("crash_dump_bitmap: no crash_dump_bitmap memory.\n"); + return; + } + + pr_info( + "Excluding pages: bitmap=%d, cache=%d, private=%d, user=%d, free=%d\n", + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages, + crash_dump_bitmap_ctrl.exclude_cache_pages, + crash_dump_bitmap_ctrl.exclude_cache_private_pages, + crash_dump_bitmap_ctrl.exclude_user_pages, + crash_dump_bitmap_ctrl.exclude_free_pages); + + crash_dump_bitmap_info.free_pages = 0; + crash_dump_bitmap_info.cache_pages = 0; + crash_dump_bitmap_info.cache_private_pages = 0; + crash_dump_bitmap_info.user_pages = 0; + crash_dump_bitmap_info.hwpoison_pages = 0; + + /* Set all bits on bitmap */ + memset(__va(crash_dump_bitmap_info.bitmap), 0xff, + crash_dump_bitmap_info.bitmap_size); + + /* Exclude all crash_dump_bitmap pages */ + if (crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages) { + for (addr = crash_dump_bitmap_mem; addr < + crash_dump_bitmap_mem + crash_dump_bitmap_mem_size; + addr += PAGE_SIZE) + set_crash_dump_bitmap( + virt_to_pfn(__va(addr)), 0); + } + + /* Exclude unnecessary pages */ + for_each_online_pgdat(pgdat) { + unsigned long i; + unsigned long flags; + + pgdat_resize_lock(pgdat, ); + for (i = 0; i < pgdat->node_spanned_pages; i++) { + struct page *page; + unsigned long pfn = pgdat->node_start_pfn + i; + + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + + /* Exclude the cache pages without the private page */ + if (crash_dump_bitmap_ctrl.exclude_cache_pages + && (PageLRU(page) || PageSwapCache(page)) + && !page_has_private(page) && !PageAnon(page)) { + set_crash_dump_bitmap(pfn, 0); + crash_dump_bitmap_info.cache_pages++; + } + /* Exclude the cache pages with private page */ + else if ( + crash_dump_bitmap_ctrl.exclude_cache_private_pages + && (PageLRU(page) || PageSwapCache(page)) + && !PageAnon(page)) { + set_crash_dump_bitmap(pfn, 0); + crash_dump
[RFC PATCH 2/5] crash dump bitmap: init crash dump bitmap in kernel booting process
Reserve a memory block for crash_dump_bitmap in kernel booting process. Signed-off-by: Jingbai Ma --- arch/x86/kernel/setup.c | 59 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 + kernel/crash_dump_bitmap.c| 45 4 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 84d3285..165c831 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -67,6 +67,7 @@ #include #include +#include #include #include @@ -601,6 +602,62 @@ static void __init reserve_crashkernel(void) } #endif +#ifdef CONFIG_CRASH_DUMP_BITMAP +static void __init crash_dump_bitmap_init(void) +{ + static unsigned long BITSPERBYTE = 8; + + unsigned long long mem_start; + unsigned long long mem_size; + + if (is_kdump_kernel()) + return; + + mem_start = (1ULL << 24); /* 16MB */ + mem_size = roundup((roundup(max_pfn, BITSPERBYTE) / BITSPERBYTE), + PAGE_SIZE); + + crash_dump_bitmap_mem = memblock_find_in_range(mem_start, + MEMBLOCK_ALLOC_ACCESSIBLE, mem_size, PAGE_SIZE); + + if (!crash_dump_bitmap_mem) { + pr_err( + "crash_dump_bitmap: allocate error! size=%lldkB, from=%lldMB\n", + mem_size >> 10, mem_start >> 20); + + return; + } + + crash_dump_bitmap_mem_size = mem_size; + memblock_reserve(crash_dump_bitmap_mem, crash_dump_bitmap_mem_size); + pr_info("crash_dump_bitmap: bitmap_mem=%lldMB. size=%lldkB\n", + (unsigned long long)crash_dump_bitmap_mem >> 20, + mem_size >> 10); + + crash_dump_bitmap_res.start = crash_dump_bitmap_mem; + crash_dump_bitmap_res.end = crash_dump_bitmap_mem + mem_size - 1; + insert_resource(_resource, _dump_bitmap_res); + + crash_dump_bitmap_info.version = CRASH_DUMP_BITMAP_VERSION; + + crash_dump_bitmap_info.bitmap = crash_dump_bitmap_mem; + crash_dump_bitmap_info.bitmap_size = crash_dump_bitmap_mem_size; + + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages = 1; + crash_dump_bitmap_ctrl.exclude_zero_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = 1; + crash_dump_bitmap_ctrl.exclude_user_pages = 1; + crash_dump_bitmap_ctrl.exclude_free_pages = 1; + + pr_info("crash_dump_bitmap: Initialized!\n"); +} +#else +static void __init crash_dump_bitmap_init(void) +{ +} +#endif + static struct resource standard_io_resources[] = { { .name = "dma1", .start = 0x00, .end = 0x1f, .flags = IORESOURCE_BUSY | IORESOURCE_IO }, @@ -1094,6 +1151,8 @@ void __init setup_arch(char **cmdline_p) reserve_crashkernel(); + crash_dump_bitmap_init(); + vsmp_init(); io_delay_init(); diff --git a/include/linux/crash_dump_bitmap.h b/include/linux/crash_dump_bitmap.h new file mode 100644 index 000..63b1264 --- /dev/null +++ b/include/linux/crash_dump_bitmap.h @@ -0,0 +1,59 @@ +/* + *include/linux/crash_dump_bitmap.h + *Declaration of crash dump bitmap functions and data structures. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#ifndef _LINUX_CRASH_DUMP_BITMAP_H +#define _LINUX_CRASH_DUMP_BITMAP_H + +#define CRASH_DUMP_BITMAP_VERSION 1; + +enum { + CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES = 1, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES = 2, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES = 4, + CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES = 8, + CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES = 16 +}; + +struct crash_dump_bitmap_ctrl { + char exclude_crash_dump_bitmap_pages; + char exclude_zero_pages;/* only for tracking dump level */ + char exclude_cache_pages; + char exclude_cache_private_pages; + char exclude_user_pages; + char exclude_free_pages; +}; + +struct crash_dump_bitmap_info { + unsigned int version; + phys_addr_t bitmap; + phys_addr_t bitmap_size; + unsigned long cache_pages; + unsigned long cache_
[RFC PATCH 1/5] crash dump bitmap: add a kernel config and help document
Add a kernel config and help document for CRASH_DUMP_BITMAP. Signed-off-by: Jingbai Ma --- Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + 2 files changed, 394 insertions(+), 0 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt diff --git a/Documentation/kdump/crash_dump_bitmap.txt b/Documentation/kdump/crash_dump_bitmap.txt new file mode 100644 index 000..468cdf2 --- /dev/null +++ b/Documentation/kdump/crash_dump_bitmap.txt @@ -0,0 +1,378 @@ + +Documentation for Crash Dump Bitmap + + +This document includes overview, setup and installation, and analysis +information. + +Overview + + +Traditionally, to reduce the size of dump file, dumper scans all memory +pages to exclude the unnecessary memory pages after capture kernel +booted, and scan it in userspace code (makedumpfile). + +It introduces several problems: + +1. Requires more memory to store memory bitmap on systems with large +amount of memory installed. And in capture kernel there is only a few +free memory available, it will cause an out of memory error and fail. +(Non-cyclic mode) + +2. Scans all memory pages in makedumpfile is a very slow process. On +system with 1TB or more memory installed, the scanning process is very +long. Typically on 1TB idle system, it takes about 19 minutes. On system +with 4TB or more memory installed, it even doesn't work. To address the +out of memory issue on system with big memory (4TB or more memory +installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only +scans a piece of memory pages each time, and do it cyclically to scan +all memory pages. But it runs more slowly, on 1TB system, takes about 33 +minutes. + +3. Scans memory pages code in makedumpfile is very complicated, without +kernel memory management related data structure, makedumpfile has to +build up its on data structure, and will not able to use some macros +that only be available in kernel (e.g. page_to_pfn), and has to use some +slow lookup algorithm instead. + +This patch introduces a new way to scan memory pages. It reserves a piece of +memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first +kernel. During the kernel panic process, it scans all memory pages, clear the +bit for all excluded memory pages in the reserved memory. + +We have several benefits by this new approach: + +1. It's extremely fast, on 1TB system only takes about 17.5 seconds to +scan all memory pages! + +2. Reduces the memory requirement of makedumpfile by putting the +reserved memory in the first kernel memory space. + +3. Simplifies the complexity of existing memory pages scanning code in +userspace. + + +Usage += + +1) Enable "kernel crash dump bitmap" in "Processor type and features", under +"kernel crash dumps". + +CONFIG_CRASH_DUMP_BITMAP=y + +it depends on "kexec system call" and "kernel crash dumps", so there features +must be enabled also. + +CONFIG_KEXEC=y +CONFIG_CRASH_DUMP=y + +2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems.". + + CONFIG_SYSFS=y + +3) Compile and install the new kernel. + +4) Check the new kernel. +Once new kernel has booted, there will be a new foler +/proc/crash_dump_bitmap. +Check current dump level: +cat /proc/crash_dump_bitmap/dump_level + +Set dump level: +echo "dump level" > /proc/crash_dump_bitmap/dump_level + +The dump level is as same as the parameter of makedumpfile -d dump_level. + +Run page scan and check page status: +cat /proc/crash_dump_bitmap/page_status + +5) Download makedumpfile v1.5.3 or later from sourceforge: +http://sourceforge.net/projects/makedumpfile/ + +6) Patch it with the patch at the end of this file. + +7) Compile it and copy the patched makedumpfile into the right folder +(/sbin or /usr/sbin) + +8) Change the /etc/kdump.conf, and a "-q" in the makedumpfile parameter +line. It will tell makedumpfile to use the crash dump bitmap in kernel. +core_collector makedumpfile --non-cyclic -q -c -d 31 --message-level 23 + +9) Regenerate initramfs to make sure the patched makedumpfile and config +has been included in it. + + +To Do += + +It only supports x86-64 architecture currently, need to add supports for +other architectures. + + +Contact +=== + +Jingbai Ma (jingbai...@hp.com) + + +Patch (for makedumpfile v1.5.3) + +Please forgive me, for some format issues of makedumpfile source, I have +to wrap this patch with '#'. Please use this sed command to get the +patch for makedumpfile: + +sed -n -e "s/^#\(.*\)#$/\1/p" crash_dump_bitmap.txt > makedumpfile.patch + += +#diff --git a/makedumpfile.c b/makedumpfile.c# +#index acb1b21..f29b6a5 100644# +#--- a/makedumpfile.c# +#+++ b/m
[RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33 minutes. 3. Scans memory pages code in makedumpfile is very complicated, without kernel memory management related data structure, makedumpfile has to build up its own data structure, and will not able to use some macros that only be available in kernel (e.g. page_to_pfn), and has to use some slow lookup algorithm instead. This patch introduces a new way to scan memory pages. It reserves a piece of memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first kernel. During the kernel crash process, it scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. We have several benefits by this new approach: 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to scan all memory pages! 2. Reduces the memory requirement of makedumpfile by putting the reserved memory in the first kernel memory space. 3. Simplifies the complexity of existing memory pages scanning code in userspace. To do: 1. It only has been verified on x86 64bit platform, needs to be modified for other platforms. (ARM, XEN, PPC, etc...) --- Jingbai Ma (5): crash dump bitmap: add a kernel config and help document crash dump bitmap: init crash dump bitmap in kernel booting process crash dump bitmap: scan memory pages in kernel crash process crash dump bitmap: add a proc interface for crash dump bitmap crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + arch/x86/kernel/setup.c | 62 + fs/proc/Makefile |1 fs/proc/crash_dump_bitmap.c | 221 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 kernel/crash_dump_bitmap.c| 201 +++ kernel/kexec.c|5 9 files changed, 943 insertions(+), 1 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt create mode 100644 fs/proc/crash_dump_bitmap.c create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c -- Jingbai Ma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33 minutes. 3. Scans memory pages code in makedumpfile is very complicated, without kernel memory management related data structure, makedumpfile has to build up its own data structure, and will not able to use some macros that only be available in kernel (e.g. page_to_pfn), and has to use some slow lookup algorithm instead. This patch introduces a new way to scan memory pages. It reserves a piece of memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first kernel. During the kernel crash process, it scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. We have several benefits by this new approach: 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to scan all memory pages! 2. Reduces the memory requirement of makedumpfile by putting the reserved memory in the first kernel memory space. 3. Simplifies the complexity of existing memory pages scanning code in userspace. To do: 1. It only has been verified on x86 64bit platform, needs to be modified for other platforms. (ARM, XEN, PPC, etc...) --- Jingbai Ma (5): crash dump bitmap: add a kernel config and help document crash dump bitmap: init crash dump bitmap in kernel booting process crash dump bitmap: scan memory pages in kernel crash process crash dump bitmap: add a proc interface for crash dump bitmap crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + arch/x86/kernel/setup.c | 62 + fs/proc/Makefile |1 fs/proc/crash_dump_bitmap.c | 221 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 kernel/crash_dump_bitmap.c| 201 +++ kernel/kexec.c|5 9 files changed, 943 insertions(+), 1 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt create mode 100644 fs/proc/crash_dump_bitmap.c create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c -- Jingbai Ma jingbai...@hp.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/5] crash dump bitmap: add a kernel config and help document
Add a kernel config and help document for CRASH_DUMP_BITMAP. Signed-off-by: Jingbai Ma jingbai...@hp.com --- Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + 2 files changed, 394 insertions(+), 0 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt diff --git a/Documentation/kdump/crash_dump_bitmap.txt b/Documentation/kdump/crash_dump_bitmap.txt new file mode 100644 index 000..468cdf2 --- /dev/null +++ b/Documentation/kdump/crash_dump_bitmap.txt @@ -0,0 +1,378 @@ + +Documentation for Crash Dump Bitmap + + +This document includes overview, setup and installation, and analysis +information. + +Overview + + +Traditionally, to reduce the size of dump file, dumper scans all memory +pages to exclude the unnecessary memory pages after capture kernel +booted, and scan it in userspace code (makedumpfile). + +It introduces several problems: + +1. Requires more memory to store memory bitmap on systems with large +amount of memory installed. And in capture kernel there is only a few +free memory available, it will cause an out of memory error and fail. +(Non-cyclic mode) + +2. Scans all memory pages in makedumpfile is a very slow process. On +system with 1TB or more memory installed, the scanning process is very +long. Typically on 1TB idle system, it takes about 19 minutes. On system +with 4TB or more memory installed, it even doesn't work. To address the +out of memory issue on system with big memory (4TB or more memory +installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only +scans a piece of memory pages each time, and do it cyclically to scan +all memory pages. But it runs more slowly, on 1TB system, takes about 33 +minutes. + +3. Scans memory pages code in makedumpfile is very complicated, without +kernel memory management related data structure, makedumpfile has to +build up its on data structure, and will not able to use some macros +that only be available in kernel (e.g. page_to_pfn), and has to use some +slow lookup algorithm instead. + +This patch introduces a new way to scan memory pages. It reserves a piece of +memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first +kernel. During the kernel panic process, it scans all memory pages, clear the +bit for all excluded memory pages in the reserved memory. + +We have several benefits by this new approach: + +1. It's extremely fast, on 1TB system only takes about 17.5 seconds to +scan all memory pages! + +2. Reduces the memory requirement of makedumpfile by putting the +reserved memory in the first kernel memory space. + +3. Simplifies the complexity of existing memory pages scanning code in +userspace. + + +Usage += + +1) Enable kernel crash dump bitmap in Processor type and features, under +kernel crash dumps. + +CONFIG_CRASH_DUMP_BITMAP=y + +it depends on kexec system call and kernel crash dumps, so there features +must be enabled also. + +CONFIG_KEXEC=y +CONFIG_CRASH_DUMP=y + +2) Enable sysfs file system support in Filesystem - Pseudo filesystems.. + + CONFIG_SYSFS=y + +3) Compile and install the new kernel. + +4) Check the new kernel. +Once new kernel has booted, there will be a new foler +/proc/crash_dump_bitmap. +Check current dump level: +cat /proc/crash_dump_bitmap/dump_level + +Set dump level: +echo dump level /proc/crash_dump_bitmap/dump_level + +The dump level is as same as the parameter of makedumpfile -d dump_level. + +Run page scan and check page status: +cat /proc/crash_dump_bitmap/page_status + +5) Download makedumpfile v1.5.3 or later from sourceforge: +http://sourceforge.net/projects/makedumpfile/ + +6) Patch it with the patch at the end of this file. + +7) Compile it and copy the patched makedumpfile into the right folder +(/sbin or /usr/sbin) + +8) Change the /etc/kdump.conf, and a -q in the makedumpfile parameter +line. It will tell makedumpfile to use the crash dump bitmap in kernel. +core_collector makedumpfile --non-cyclic -q -c -d 31 --message-level 23 + +9) Regenerate initramfs to make sure the patched makedumpfile and config +has been included in it. + + +To Do += + +It only supports x86-64 architecture currently, need to add supports for +other architectures. + + +Contact +=== + +Jingbai Ma (jingbai...@hp.com) + + +Patch (for makedumpfile v1.5.3) + +Please forgive me, for some format issues of makedumpfile source, I have +to wrap this patch with '#'. Please use this sed command to get the +patch for makedumpfile: + +sed -n -e s/^#\(.*\)#$/\1/p crash_dump_bitmap.txt makedumpfile.patch + += +#diff --git a/makedumpfile.c b/makedumpfile.c# +#index acb1b21..f29b6a5 100644# +#--- a/makedumpfile.c# +#+++ b/makedumpfile.c# +#@@ -34,6 +34,10 @@ struct srcfile_table srcfile_table;# +# struct vm_table vt = { 0 };# +# struct
[RFC PATCH 2/5] crash dump bitmap: init crash dump bitmap in kernel booting process
Reserve a memory block for crash_dump_bitmap in kernel booting process. Signed-off-by: Jingbai Ma jingbai...@hp.com --- arch/x86/kernel/setup.c | 59 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 + kernel/crash_dump_bitmap.c| 45 4 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 84d3285..165c831 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -67,6 +67,7 @@ #include linux/percpu.h #include linux/crash_dump.h +#include linux/crash_dump_bitmap.h #include linux/tboot.h #include linux/jiffies.h @@ -601,6 +602,62 @@ static void __init reserve_crashkernel(void) } #endif +#ifdef CONFIG_CRASH_DUMP_BITMAP +static void __init crash_dump_bitmap_init(void) +{ + static unsigned long BITSPERBYTE = 8; + + unsigned long long mem_start; + unsigned long long mem_size; + + if (is_kdump_kernel()) + return; + + mem_start = (1ULL 24); /* 16MB */ + mem_size = roundup((roundup(max_pfn, BITSPERBYTE) / BITSPERBYTE), + PAGE_SIZE); + + crash_dump_bitmap_mem = memblock_find_in_range(mem_start, + MEMBLOCK_ALLOC_ACCESSIBLE, mem_size, PAGE_SIZE); + + if (!crash_dump_bitmap_mem) { + pr_err( + crash_dump_bitmap: allocate error! size=%lldkB, from=%lldMB\n, + mem_size 10, mem_start 20); + + return; + } + + crash_dump_bitmap_mem_size = mem_size; + memblock_reserve(crash_dump_bitmap_mem, crash_dump_bitmap_mem_size); + pr_info(crash_dump_bitmap: bitmap_mem=%lldMB. size=%lldkB\n, + (unsigned long long)crash_dump_bitmap_mem 20, + mem_size 10); + + crash_dump_bitmap_res.start = crash_dump_bitmap_mem; + crash_dump_bitmap_res.end = crash_dump_bitmap_mem + mem_size - 1; + insert_resource(iomem_resource, crash_dump_bitmap_res); + + crash_dump_bitmap_info.version = CRASH_DUMP_BITMAP_VERSION; + + crash_dump_bitmap_info.bitmap = crash_dump_bitmap_mem; + crash_dump_bitmap_info.bitmap_size = crash_dump_bitmap_mem_size; + + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages = 1; + crash_dump_bitmap_ctrl.exclude_zero_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = 1; + crash_dump_bitmap_ctrl.exclude_user_pages = 1; + crash_dump_bitmap_ctrl.exclude_free_pages = 1; + + pr_info(crash_dump_bitmap: Initialized!\n); +} +#else +static void __init crash_dump_bitmap_init(void) +{ +} +#endif + static struct resource standard_io_resources[] = { { .name = dma1, .start = 0x00, .end = 0x1f, .flags = IORESOURCE_BUSY | IORESOURCE_IO }, @@ -1094,6 +1151,8 @@ void __init setup_arch(char **cmdline_p) reserve_crashkernel(); + crash_dump_bitmap_init(); + vsmp_init(); io_delay_init(); diff --git a/include/linux/crash_dump_bitmap.h b/include/linux/crash_dump_bitmap.h new file mode 100644 index 000..63b1264 --- /dev/null +++ b/include/linux/crash_dump_bitmap.h @@ -0,0 +1,59 @@ +/* + *include/linux/crash_dump_bitmap.h + *Declaration of crash dump bitmap functions and data structures. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma jingbai...@hp.com + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#ifndef _LINUX_CRASH_DUMP_BITMAP_H +#define _LINUX_CRASH_DUMP_BITMAP_H + +#define CRASH_DUMP_BITMAP_VERSION 1; + +enum { + CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES = 1, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES = 2, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES = 4, + CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES = 8, + CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES = 16 +}; + +struct crash_dump_bitmap_ctrl { + char exclude_crash_dump_bitmap_pages; + char exclude_zero_pages;/* only for tracking dump level */ + char exclude_cache_pages; + char exclude_cache_private_pages; + char exclude_user_pages; + char exclude_free_pages; +}; + +struct crash_dump_bitmap_info { + unsigned int version; + phys_addr_t bitmap; + phys_addr_t bitmap_size; + unsigned long
[RFC PATCH 3/5] crash dump bitmap: scan memory pages in kernel crash process
In the kernel crash process, call generate_crash_dump_bitmap() to scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. Signed-off-by: Jingbai Ma jingbai...@hp.com --- kernel/crash_dump_bitmap.c | 156 kernel/kexec.c |5 + 2 files changed, 161 insertions(+), 0 deletions(-) diff --git a/kernel/crash_dump_bitmap.c b/kernel/crash_dump_bitmap.c index e743cdd..eed13ca 100644 --- a/kernel/crash_dump_bitmap.c +++ b/kernel/crash_dump_bitmap.c @@ -23,6 +23,8 @@ #ifdef CONFIG_CRASH_DUMP_BITMAP +#define virt_to_pfn(kaddr) (__pa(kaddr) PAGE_SHIFT) + phys_addr_t crash_dump_bitmap_mem; EXPORT_SYMBOL(crash_dump_bitmap_mem); @@ -35,6 +37,7 @@ EXPORT_SYMBOL(crash_dump_bitmap_ctrl); struct crash_dump_bitmap_info crash_dump_bitmap_info; EXPORT_SYMBOL(crash_dump_bitmap_info); + /* Location of the reserved area for the crash_dump_bitmap */ struct resource crash_dump_bitmap_res = { .name = Crash dump bitmap, @@ -42,4 +45,157 @@ struct resource crash_dump_bitmap_res = { .end = 0, .flags = IORESOURCE_BUSY | IORESOURCE_MEM }; + +inline void set_crash_dump_bitmap(unsigned long pfn, int val) +{ + phys_addr_t paddr = crash_dump_bitmap_info.bitmap + (pfn 3); + unsigned char *vaddr; + unsigned char bit = (pfn 7); + + if (unlikely(paddr (crash_dump_bitmap_mem + + crash_dump_bitmap_mem_size))) { + pr_err( + crash_dump_bitmap: pfn exceed limit. pfn=%ld, addr=0x%llX\n, + pfn, paddr); + return; + } + + vaddr = (unsigned char *)__va(paddr); + + if (val) + *vaddr |= (1U bit); + else + *vaddr = (~(1U bit)); +} + +void generate_crash_dump_bitmap(void) +{ + pg_data_t *pgdat; + struct zone *zone; + unsigned long flags; + int order, t; + struct list_head *curr; + unsigned long zone_free_pages; + phys_addr_t addr; + + if (!crash_dump_bitmap_mem) { + pr_info(crash_dump_bitmap: no crash_dump_bitmap memory.\n); + return; + } + + pr_info( + Excluding pages: bitmap=%d, cache=%d, private=%d, user=%d, free=%d\n, + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages, + crash_dump_bitmap_ctrl.exclude_cache_pages, + crash_dump_bitmap_ctrl.exclude_cache_private_pages, + crash_dump_bitmap_ctrl.exclude_user_pages, + crash_dump_bitmap_ctrl.exclude_free_pages); + + crash_dump_bitmap_info.free_pages = 0; + crash_dump_bitmap_info.cache_pages = 0; + crash_dump_bitmap_info.cache_private_pages = 0; + crash_dump_bitmap_info.user_pages = 0; + crash_dump_bitmap_info.hwpoison_pages = 0; + + /* Set all bits on bitmap */ + memset(__va(crash_dump_bitmap_info.bitmap), 0xff, + crash_dump_bitmap_info.bitmap_size); + + /* Exclude all crash_dump_bitmap pages */ + if (crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages) { + for (addr = crash_dump_bitmap_mem; addr + crash_dump_bitmap_mem + crash_dump_bitmap_mem_size; + addr += PAGE_SIZE) + set_crash_dump_bitmap( + virt_to_pfn(__va(addr)), 0); + } + + /* Exclude unnecessary pages */ + for_each_online_pgdat(pgdat) { + unsigned long i; + unsigned long flags; + + pgdat_resize_lock(pgdat, flags); + for (i = 0; i pgdat-node_spanned_pages; i++) { + struct page *page; + unsigned long pfn = pgdat-node_start_pfn + i; + + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + + /* Exclude the cache pages without the private page */ + if (crash_dump_bitmap_ctrl.exclude_cache_pages +(PageLRU(page) || PageSwapCache(page)) +!page_has_private(page) !PageAnon(page)) { + set_crash_dump_bitmap(pfn, 0); + crash_dump_bitmap_info.cache_pages++; + } + /* Exclude the cache pages with private page */ + else if ( + crash_dump_bitmap_ctrl.exclude_cache_private_pages +(PageLRU(page) || PageSwapCache(page)) +!PageAnon(page)) { + set_crash_dump_bitmap(pfn, 0); + crash_dump_bitmap_info.cache_private_pages++; + } + /* Exclude the pages used by user process
[RFC PATCH 4/5] crash dump bitmap: add a proc interface for crash dump bitmap
Add a procfs driver for selecting exclude pages in userspace. /proc/crash_dump_bitmap/ Signed-off-by: Jingbai Ma jingbai...@hp.com --- fs/proc/Makefile|1 fs/proc/crash_dump_bitmap.c | 221 +++ 2 files changed, 222 insertions(+), 0 deletions(-) create mode 100644 fs/proc/crash_dump_bitmap.c diff --git a/fs/proc/Makefile b/fs/proc/Makefile index 712f24d..2dfcff1 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -27,6 +27,7 @@ proc-$(CONFIG_PROC_SYSCTL)+= proc_sysctl.o proc-$(CONFIG_NET) += proc_net.o proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o +proc-$(CONFIG_CRASH_DUMP_BITMAP) += crash_dump_bitmap.o proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o diff --git a/fs/proc/crash_dump_bitmap.c b/fs/proc/crash_dump_bitmap.c new file mode 100644 index 000..77ecaae --- /dev/null +++ b/fs/proc/crash_dump_bitmap.c @@ -0,0 +1,221 @@ +/* + *fs/proc/crash_dump_bitmap.c + *Interface for controlling the crash dump bitmap from user space. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma jingbai...@hp.com + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include linux/jiffies.h +#include linux/crash_dump.h +#include linux/crash_dump_bitmap.h + +#ifdef CONFIG_CRASH_DUMP_BITMAP + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Jingbai Ma jingbai...@hp.com); +MODULE_DESCRIPTION(Crash dump bitmap support driver); + +static const char *proc_dir_name = crash_dump_bitmap; +static const char *proc_page_status_name = page_status; +static const char *proc_dump_level_name = dump_level; + +static struct proc_dir_entry *proc_dir, *proc_page_status, *proc_dump_level; + +static unsigned int get_dump_level(void) +{ + unsigned int dump_level; + + dump_level = crash_dump_bitmap_ctrl.exclude_zero_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_private_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_user_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_free_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES : 0; + + return dump_level; +} + +static void set_dump_level(unsigned int dump_level) +{ + crash_dump_bitmap_ctrl.exclude_zero_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES) +? 1 : 0; + crash_dump_bitmap_ctrl.exclude_user_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_free_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES) ? 1 : 0; +} + +static int proc_page_status_show(struct seq_file *m, void *v) +{ + u64 start, duration; + + if (!crash_dump_bitmap_mem) { + seq_printf(m, + crash_dump_bitmap: crash_dump_bitmap_mem not found!\n); + + return -EINVAL; + } + + seq_printf(m, Exclude page flag status:\n); + seq_printf(m, exclude_dump_bitmap_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages); + seq_printf(m, exclude_zero_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_zero_pages); + seq_printf(m, exclude_cache_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_cache_pages); + seq_printf(m, exclude_cache_private_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_cache_private_pages); + seq_printf(m, exclude_user_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_user_pages); + seq_printf(m, exclude_free_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_free_pages); + + seq_printf(m, Scanning all memory pages:\n); + start = get_jiffies_64
[RFC PATCH 5/5] crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue
Linux kernel 3.9-rc1 allows crashkernel above 4GB, but current kexec-tools doesn't support it yet. This patch is only a workaround to make kdump work again. This patch should be removed after kexec-tools 2.0.4 release. Signed-off-by: Jingbai Ma jingbai...@hp.com --- arch/x86/kernel/setup.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 165c831..15321d6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -506,7 +506,8 @@ static void __init memblock_x86_reserve_range_setup_data(void) #ifdef CONFIG_X86_32 # define CRASH_KERNEL_ADDR_MAX (512 20) #else -# define CRASH_KERNEL_ADDR_MAX MAXMEM +/* # define CRASH_KERNEL_ADDR_MAX MAXMEM */ +# define CRASH_KERNEL_ADDR_MAX (896 20) #endif static void __init reserve_crashkernel_low(void) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
This patch intend to speedup the memory pages scanning process in selective dump mode. Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile v1.5.3): Total scan Time Original kernel + makedumpfile v1.5.3 cyclic mode 1958.05 seconds Original kernel + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds Patched kernel + patched makedumpfile v1.5.3 17.50 seconds Traditionally, to reduce the size of dump file, dumper scans all memory pages to exclude the unnecessary memory pages after capture kernel booted, and scan it in userspace code (makedumpfile). It introduces several problems: 1. Requires more memory to store memory bitmap on systems with large amount of memory installed. And in capture kernel there is only a few free memory available, it will cause an out of memory error and fail. (Non-cyclic mode) 2. Scans all memory pages in makedumpfile is a very slow process. On system with 1TB or more memory installed, the scanning process is very long. Typically on 1TB idle system, it takes about 19 minutes. On system with 4TB or more memory installed, it even doesn't work. To address the out of memory issue on system with big memory (4TB or more memory installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only scans a piece of memory pages each time, and do it cyclically to scan all memory pages. But it runs more slowly, on 1TB system, takes about 33 minutes. 3. Scans memory pages code in makedumpfile is very complicated, without kernel memory management related data structure, makedumpfile has to build up its own data structure, and will not able to use some macros that only be available in kernel (e.g. page_to_pfn), and has to use some slow lookup algorithm instead. This patch introduces a new way to scan memory pages. It reserves a piece of memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first kernel. During the kernel crash process, it scans all memory pages, clear the bit for all excluded memory pages in the reserved memory. We have several benefits by this new approach: 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to scan all memory pages! 2. Reduces the memory requirement of makedumpfile by putting the reserved memory in the first kernel memory space. 3. Simplifies the complexity of existing memory pages scanning code in userspace. To do: 1. It only has been verified on x86 64bit platform, needs to be modified for other platforms. (ARM, XEN, PPC, etc...) --- Jingbai Ma (5): crash dump bitmap: add a kernel config and help document crash dump bitmap: init crash dump bitmap in kernel booting process crash dump bitmap: scan memory pages in kernel crash process crash dump bitmap: add a proc interface for crash dump bitmap crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + arch/x86/kernel/setup.c | 62 + fs/proc/Makefile |1 fs/proc/crash_dump_bitmap.c | 221 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 kernel/crash_dump_bitmap.c| 201 +++ kernel/kexec.c|5 9 files changed, 943 insertions(+), 1 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt create mode 100644 fs/proc/crash_dump_bitmap.c create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c -- Jingbai Ma jingbai...@hp.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/5] crash dump bitmap: add a kernel config and help document
Add a kernel config and help document for CRASH_DUMP_BITMAP. Signed-off-by: Jingbai Ma jingbai...@hp.com --- Documentation/kdump/crash_dump_bitmap.txt | 378 + arch/x86/Kconfig | 16 + 2 files changed, 394 insertions(+), 0 deletions(-) create mode 100644 Documentation/kdump/crash_dump_bitmap.txt diff --git a/Documentation/kdump/crash_dump_bitmap.txt b/Documentation/kdump/crash_dump_bitmap.txt new file mode 100644 index 000..468cdf2 --- /dev/null +++ b/Documentation/kdump/crash_dump_bitmap.txt @@ -0,0 +1,378 @@ + +Documentation for Crash Dump Bitmap + + +This document includes overview, setup and installation, and analysis +information. + +Overview + + +Traditionally, to reduce the size of dump file, dumper scans all memory +pages to exclude the unnecessary memory pages after capture kernel +booted, and scan it in userspace code (makedumpfile). + +It introduces several problems: + +1. Requires more memory to store memory bitmap on systems with large +amount of memory installed. And in capture kernel there is only a few +free memory available, it will cause an out of memory error and fail. +(Non-cyclic mode) + +2. Scans all memory pages in makedumpfile is a very slow process. On +system with 1TB or more memory installed, the scanning process is very +long. Typically on 1TB idle system, it takes about 19 minutes. On system +with 4TB or more memory installed, it even doesn't work. To address the +out of memory issue on system with big memory (4TB or more memory +installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only +scans a piece of memory pages each time, and do it cyclically to scan +all memory pages. But it runs more slowly, on 1TB system, takes about 33 +minutes. + +3. Scans memory pages code in makedumpfile is very complicated, without +kernel memory management related data structure, makedumpfile has to +build up its on data structure, and will not able to use some macros +that only be available in kernel (e.g. page_to_pfn), and has to use some +slow lookup algorithm instead. + +This patch introduces a new way to scan memory pages. It reserves a piece of +memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first +kernel. During the kernel panic process, it scans all memory pages, clear the +bit for all excluded memory pages in the reserved memory. + +We have several benefits by this new approach: + +1. It's extremely fast, on 1TB system only takes about 17.5 seconds to +scan all memory pages! + +2. Reduces the memory requirement of makedumpfile by putting the +reserved memory in the first kernel memory space. + +3. Simplifies the complexity of existing memory pages scanning code in +userspace. + + +Usage += + +1) Enable kernel crash dump bitmap in Processor type and features, under +kernel crash dumps. + +CONFIG_CRASH_DUMP_BITMAP=y + +it depends on kexec system call and kernel crash dumps, so there features +must be enabled also. + +CONFIG_KEXEC=y +CONFIG_CRASH_DUMP=y + +2) Enable sysfs file system support in Filesystem - Pseudo filesystems.. + + CONFIG_SYSFS=y + +3) Compile and install the new kernel. + +4) Check the new kernel. +Once new kernel has booted, there will be a new foler +/proc/crash_dump_bitmap. +Check current dump level: +cat /proc/crash_dump_bitmap/dump_level + +Set dump level: +echo dump level /proc/crash_dump_bitmap/dump_level + +The dump level is as same as the parameter of makedumpfile -d dump_level. + +Run page scan and check page status: +cat /proc/crash_dump_bitmap/page_status + +5) Download makedumpfile v1.5.3 or later from sourceforge: +http://sourceforge.net/projects/makedumpfile/ + +6) Patch it with the patch at the end of this file. + +7) Compile it and copy the patched makedumpfile into the right folder +(/sbin or /usr/sbin) + +8) Change the /etc/kdump.conf, and a -q in the makedumpfile parameter +line. It will tell makedumpfile to use the crash dump bitmap in kernel. +core_collector makedumpfile --non-cyclic -q -c -d 31 --message-level 23 + +9) Regenerate initramfs to make sure the patched makedumpfile and config +has been included in it. + + +To Do += + +It only supports x86-64 architecture currently, need to add supports for +other architectures. + + +Contact +=== + +Jingbai Ma (jingbai...@hp.com) + + +Patch (for makedumpfile v1.5.3) + +Please forgive me, for some format issues of makedumpfile source, I have +to wrap this patch with '#'. Please use this sed command to get the +patch for makedumpfile: + +sed -n -e s/^#\(.*\)#$/\1/p crash_dump_bitmap.txt makedumpfile.patch + += +#diff --git a/makedumpfile.c b/makedumpfile.c# +#index acb1b21..f29b6a5 100644# +#--- a/makedumpfile.c# +#+++ b/makedumpfile.c# +#@@ -34,6 +34,10 @@ struct srcfile_table srcfile_table;# +# struct vm_table vt = { 0 };# +# struct
[RFC PATCH 2/5] crash dump bitmap: init crash dump bitmap in kernel booting process
Reserve a memory block for crash_dump_bitmap in kernel booting process. Signed-off-by: Jingbai Ma jingbai...@hp.com --- arch/x86/kernel/setup.c | 59 + include/linux/crash_dump_bitmap.h | 59 + kernel/Makefile |1 + kernel/crash_dump_bitmap.c| 45 4 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 include/linux/crash_dump_bitmap.h create mode 100644 kernel/crash_dump_bitmap.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 84d3285..165c831 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -67,6 +67,7 @@ #include linux/percpu.h #include linux/crash_dump.h +#include linux/crash_dump_bitmap.h #include linux/tboot.h #include linux/jiffies.h @@ -601,6 +602,62 @@ static void __init reserve_crashkernel(void) } #endif +#ifdef CONFIG_CRASH_DUMP_BITMAP +static void __init crash_dump_bitmap_init(void) +{ + static unsigned long BITSPERBYTE = 8; + + unsigned long long mem_start; + unsigned long long mem_size; + + if (is_kdump_kernel()) + return; + + mem_start = (1ULL 24); /* 16MB */ + mem_size = roundup((roundup(max_pfn, BITSPERBYTE) / BITSPERBYTE), + PAGE_SIZE); + + crash_dump_bitmap_mem = memblock_find_in_range(mem_start, + MEMBLOCK_ALLOC_ACCESSIBLE, mem_size, PAGE_SIZE); + + if (!crash_dump_bitmap_mem) { + pr_err( + crash_dump_bitmap: allocate error! size=%lldkB, from=%lldMB\n, + mem_size 10, mem_start 20); + + return; + } + + crash_dump_bitmap_mem_size = mem_size; + memblock_reserve(crash_dump_bitmap_mem, crash_dump_bitmap_mem_size); + pr_info(crash_dump_bitmap: bitmap_mem=%lldMB. size=%lldkB\n, + (unsigned long long)crash_dump_bitmap_mem 20, + mem_size 10); + + crash_dump_bitmap_res.start = crash_dump_bitmap_mem; + crash_dump_bitmap_res.end = crash_dump_bitmap_mem + mem_size - 1; + insert_resource(iomem_resource, crash_dump_bitmap_res); + + crash_dump_bitmap_info.version = CRASH_DUMP_BITMAP_VERSION; + + crash_dump_bitmap_info.bitmap = crash_dump_bitmap_mem; + crash_dump_bitmap_info.bitmap_size = crash_dump_bitmap_mem_size; + + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages = 1; + crash_dump_bitmap_ctrl.exclude_zero_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_pages = 1; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = 1; + crash_dump_bitmap_ctrl.exclude_user_pages = 1; + crash_dump_bitmap_ctrl.exclude_free_pages = 1; + + pr_info(crash_dump_bitmap: Initialized!\n); +} +#else +static void __init crash_dump_bitmap_init(void) +{ +} +#endif + static struct resource standard_io_resources[] = { { .name = dma1, .start = 0x00, .end = 0x1f, .flags = IORESOURCE_BUSY | IORESOURCE_IO }, @@ -1094,6 +1151,8 @@ void __init setup_arch(char **cmdline_p) reserve_crashkernel(); + crash_dump_bitmap_init(); + vsmp_init(); io_delay_init(); diff --git a/include/linux/crash_dump_bitmap.h b/include/linux/crash_dump_bitmap.h new file mode 100644 index 000..63b1264 --- /dev/null +++ b/include/linux/crash_dump_bitmap.h @@ -0,0 +1,59 @@ +/* + *include/linux/crash_dump_bitmap.h + *Declaration of crash dump bitmap functions and data structures. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma jingbai...@hp.com + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#ifndef _LINUX_CRASH_DUMP_BITMAP_H +#define _LINUX_CRASH_DUMP_BITMAP_H + +#define CRASH_DUMP_BITMAP_VERSION 1; + +enum { + CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES = 1, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES = 2, + CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES = 4, + CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES = 8, + CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES = 16 +}; + +struct crash_dump_bitmap_ctrl { + char exclude_crash_dump_bitmap_pages; + char exclude_zero_pages;/* only for tracking dump level */ + char exclude_cache_pages; + char exclude_cache_private_pages; + char exclude_user_pages; + char exclude_free_pages; +}; + +struct crash_dump_bitmap_info { + unsigned int version; + phys_addr_t bitmap; + phys_addr_t bitmap_size; + unsigned long
[RFC PATCH 4/5] crash dump bitmap: add a proc interface for crash dump bitmap
Add a procfs driver for selecting exclude pages in userspace. /proc/crash_dump_bitmap/ Signed-off-by: Jingbai Ma jingbai...@hp.com --- fs/proc/Makefile|1 fs/proc/crash_dump_bitmap.c | 221 +++ 2 files changed, 222 insertions(+), 0 deletions(-) create mode 100644 fs/proc/crash_dump_bitmap.c diff --git a/fs/proc/Makefile b/fs/proc/Makefile index 712f24d..2dfcff1 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -27,6 +27,7 @@ proc-$(CONFIG_PROC_SYSCTL)+= proc_sysctl.o proc-$(CONFIG_NET) += proc_net.o proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o +proc-$(CONFIG_CRASH_DUMP_BITMAP) += crash_dump_bitmap.o proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o diff --git a/fs/proc/crash_dump_bitmap.c b/fs/proc/crash_dump_bitmap.c new file mode 100644 index 000..77ecaae --- /dev/null +++ b/fs/proc/crash_dump_bitmap.c @@ -0,0 +1,221 @@ +/* + *fs/proc/crash_dump_bitmap.c + *Interface for controlling the crash dump bitmap from user space. + * + *(C) Copyright 2013 Hewlett-Packard Development Company, L.P. + *Author: Jingbai Ma jingbai...@hp.com + * + *This program is free software; you can redistribute it and/or modify + *it under the terms of version 2 of the GNU General Public License as + *published by the Free Software Foundation. + * + *This program is distributed in the hope that it will be useful, + *but WITHOUT ANY WARRANTY; without even the implied warranty of + *MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU + *General Public License for more details. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include linux/jiffies.h +#include linux/crash_dump.h +#include linux/crash_dump_bitmap.h + +#ifdef CONFIG_CRASH_DUMP_BITMAP + +MODULE_LICENSE(GPL); +MODULE_AUTHOR(Jingbai Ma jingbai...@hp.com); +MODULE_DESCRIPTION(Crash dump bitmap support driver); + +static const char *proc_dir_name = crash_dump_bitmap; +static const char *proc_page_status_name = page_status; +static const char *proc_dump_level_name = dump_level; + +static struct proc_dir_entry *proc_dir, *proc_page_status, *proc_dump_level; + +static unsigned int get_dump_level(void) +{ + unsigned int dump_level; + + dump_level = crash_dump_bitmap_ctrl.exclude_zero_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_cache_private_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_user_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES : 0; + dump_level |= crash_dump_bitmap_ctrl.exclude_free_pages + ? CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES : 0; + + return dump_level; +} + +static void set_dump_level(unsigned int dump_level) +{ + crash_dump_bitmap_ctrl.exclude_zero_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_cache_private_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES) +? 1 : 0; + crash_dump_bitmap_ctrl.exclude_user_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES) ? 1 : 0; + crash_dump_bitmap_ctrl.exclude_free_pages = + (dump_level CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES) ? 1 : 0; +} + +static int proc_page_status_show(struct seq_file *m, void *v) +{ + u64 start, duration; + + if (!crash_dump_bitmap_mem) { + seq_printf(m, + crash_dump_bitmap: crash_dump_bitmap_mem not found!\n); + + return -EINVAL; + } + + seq_printf(m, Exclude page flag status:\n); + seq_printf(m, exclude_dump_bitmap_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages); + seq_printf(m, exclude_zero_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_zero_pages); + seq_printf(m, exclude_cache_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_cache_pages); + seq_printf(m, exclude_cache_private_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_cache_private_pages); + seq_printf(m, exclude_user_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_user_pages); + seq_printf(m, exclude_free_pages=%d\n, + crash_dump_bitmap_ctrl.exclude_free_pages); + + seq_printf(m, Scanning all memory pages:\n); + start = get_jiffies_64
[RFC PATCH 5/5] crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue
Linux kernel 3.9-rc1 allows crashkernel above 4GB, but current kexec-tools doesn't support it yet. This patch is only a workaround to make kdump work again. This patch should be removed after kexec-tools 2.0.4 release. Signed-off-by: Jingbai Ma jingbai...@hp.com --- arch/x86/kernel/setup.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 165c831..15321d6 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -506,7 +506,8 @@ static void __init memblock_x86_reserve_range_setup_data(void) #ifdef CONFIG_X86_32 # define CRASH_KERNEL_ADDR_MAX (512 20) #else -# define CRASH_KERNEL_ADDR_MAX MAXMEM +/* # define CRASH_KERNEL_ADDR_MAX MAXMEM */ +# define CRASH_KERNEL_ADDR_MAX (896 20) #endif static void __init reserve_crashkernel_low(void) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/