Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
* Borislav Petkov wrote: > On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > > How about executing a perf in usermodehelper and collecting output > > in tmpfs? Using this approach, we can start a perf after rootfs > > initialization, > > What for if we can start logging to buffers much earlier? *Reading* > from those buffers can be done much later, at our own leisure with full > userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about "when does userspace start up the earliest" details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
* Borislav Petkov b...@alien8.de wrote: On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about when does userspace start up the earliest details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: > On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt wrote: > > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > >> >With this patch, we did boot a 16TiB machine. Without the patches, > >> >the v3.10 kernel with the same configuration took 407 seconds for > >> >free_all_bootmem. With the patches and operating on 2MiB pages instead > >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >> >for how the 1GiB chunk size will perform. > >> > >> How to test how much time spend on free_all_bootmem? > > > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > > then used a modified version of script which record the time in uSecs > > at the beginning of each line of output. > > used two patches, found 3TiB system will take 100s before slub is ready. > > about three portions: > 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those > struct page area take about 30s. > 2. memmap_init_zone: take about 25s > 3. mem_init/free_all_bootmem about 30s. > > so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 > also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt h...@sgi.com wrote: On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. used two patches, found 3TiB system will take 100s before slub is ready. about three portions: 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those struct page area take about 30s. 2. memmap_init_zone: take about 25s 3. mem_init/free_all_bootmem about 30s. so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt wrote: > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: >> >With this patch, we did boot a 16TiB machine. Without the patches, >> >the v3.10 kernel with the same configuration took 407 seconds for >> >free_all_bootmem. With the patches and operating on 2MiB pages instead >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel >> >for how the 1GiB chunk size will perform. >> >> How to test how much time spend on free_all_bootmem? > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > then used a modified version of script which record the time in uSecs > at the beginning of each line of output. used two patches, found 3TiB system will take 100s before slub is ready. about three portions: 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those struct page area take about 30s. 2. memmap_init_zone: take about 25s 3. mem_init/free_all_bootmem about 30s. so still wonder why 16TiB will need hours. also your patches looks like only address 2 and 3. Yinghai printk_time_tsc_0.patch Description: Binary data printk_time_tsc_1.patch Description: Binary data
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt h...@sgi.com wrote: On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. used two patches, found 3TiB system will take 100s before slub is ready. about three portions: 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those struct page area take about 30s. 2. memmap_init_zone: take about 25s 3. mem_init/free_all_bootmem about 30s. so still wonder why 16TiB will need hours. also your patches looks like only address 2 and 3. Yinghai printk_time_tsc_0.patch Description: Binary data printk_time_tsc_1.patch Description: Binary data
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > On 07/12/2013 10:03 AM, Robin Holt wrote: > >We have been working on this since we returned from shutdown and have > >something to discuss now. We restricted ourselves to 2MiB initialization > >to keep the patch set a little smaller and more clear. > > > >First, I think I want to propose getting rid of the page flag. If I knew > >of a concrete way to determine that the page has not been initialized, > >this patch series would look different. If there is no definitive > >way to determine that the struct page has been initialized aside from > >checking the entire page struct is zero, then I think I would suggest > >we change the page flag to indicate the page has been initialized. > > > >The heart of the problem as I see it comes from expand(). We nearly > >always see a first reference to a struct page which is in the middle > >of the 2MiB region. Due to that access, the unlikely() check that was > >originally proposed really ends up referencing a different page entirely. > >We actually did not introduce an unlikely and refactor the patches to > >make that unlikely inside a static inline function. Also, given the > >strong warning at the head of expand(), we did not feel experienced > >enough to refactor it to make things always reference the 2MiB page > >first. > > > >With this patch, we did boot a 16TiB machine. Without the patches, > >the v3.10 kernel with the same configuration took 407 seconds for > >free_all_bootmem. With the patches and operating on 2MiB pages instead > >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >for how the 1GiB chunk size will perform. > > How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin > > > > >I am on vacation for the next three days so I am sorry in advance for > >my infrequent or non-existant responses. > > > > > >Signed-off-by: Robin Holt > >Signed-off-by: Nate Zimmer > >To: "H. Peter Anvin" > >To: Ingo Molnar > >Cc: Linux Kernel > >Cc: Linux MM > >Cc: Rob Landley > >Cc: Mike Travis > >Cc: Daniel J Blueman > >Cc: Andrew Morton > >Cc: Greg KH > >Cc: Yinghai Lu > >Cc: Mel Gorman > >-- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majord...@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: On 07/12/2013 10:03 AM, Robin Holt wrote: We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt h...@sgi.com Signed-off-by: Nate Zimmer nzim...@sgi.com To: H. Peter Anvin h...@zytor.com To: Ingo Molnar mi...@kernel.org Cc: Linux Kernel linux-kernel@vger.kernel.org Cc: Linux MM linux...@kvack.org Cc: Rob Landley r...@landley.net Cc: Mike Travis tra...@sgi.com Cc: Daniel J Blueman dan...@numascale-asia.com Cc: Andrew Morton a...@linux-foundation.org Cc: Greg KH gre...@linuxfoundation.org Cc: Yinghai Lu ying...@kernel.org Cc: Mel Gorman mgor...@suse.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On 07/12/2013 10:03 AM, Robin Holt wrote: We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > How about executing a perf in usermodehelper and collecting output > in tmpfs? Using this approach, we can start a perf after rootfs > initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. > > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up >and running, with a sampling frequency of 1 KHz, sending profiling >trace events to a sufficiently sized profiling buffer of 16 MB per >CPU. > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either >automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a >special 'perf record' call or some other, new perf tooling >solution/variant. > >[ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: * Robin Holt h...@sgi.com wrote: [...] With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On 07/12/2013 10:03 AM, Robin Holt wrote: We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt h...@sgi.com Signed-off-by: Nate Zimmer nzim...@sgi.com To: H. Peter Anvin h...@zytor.com To: Ingo Molnar mi...@kernel.org Cc: Linux Kernel linux-kernel@vger.kernel.org Cc: Linux MM linux...@kvack.org Cc: Rob Landley r...@landley.net Cc: Mike Travis tra...@sgi.com Cc: Daniel J Blueman dan...@numascale-asia.com Cc: Andrew Morton a...@linux-foundation.org Cc: Greg KH gre...@linuxfoundation.org Cc: Yinghai Lu ying...@kernel.org Cc: Mel Gorman mgor...@suse.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: * Robin Holt h...@sgi.com wrote: [...] With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On 12.07.13 10:27:56, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent// sleep 1 > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up >and running, with a sampling frequency of 1 KHz, sending profiling >trace events to a sufficiently sized profiling buffer of 16 MB per >CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either >automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a >special 'perf record' call or some other, new perf tooling >solution/variant. See the perf-record command above... > >[ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] > > - from that point on this workflow joins the regular profiling workflow: >perf report, perf script et al can be used to analyze the resulting >boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
* Robin Holt wrote: > [...] > > With this patch, we did boot a 16TiB machine. Without the patches, the > v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
* Robin Holt h...@sgi.com wrote: [...] With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
On 12.07.13 10:27:56, Ingo Molnar wrote: * Robin Holt h...@sgi.com wrote: [...] With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent/tracepoint/ sleep 1 Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. See the perf-record command above... [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator
We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt h...@sgi.com Signed-off-by: Nate Zimmer nzim...@sgi.com To: H. Peter Anvin h...@zytor.com To: Ingo Molnar mi...@kernel.org Cc: Linux Kernel linux-kernel@vger.kernel.org Cc: Linux MM linux...@kvack.org Cc: Rob Landley r...@landley.net Cc: Mike Travis tra...@sgi.com Cc: Daniel J Blueman dan...@numascale-asia.com Cc: Andrew Morton a...@linux-foundation.org Cc: Greg KH gre...@linuxfoundation.org Cc: Yinghai Lu ying...@kernel.org Cc: Mel Gorman mgor...@suse.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/