Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
> On Apr 6, 2018, at 12:38 AM, Michal Hockowrote: > > On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote: >> >>> On Apr 4, 2018, at 11:10 PM, Michal Hocko wrote: >>> >>> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: v2: - Make update_kswapd_threads_node less racy - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n >>> >>> Please do not repost with such a small changes. It is much more >>> important to sort out the big picture first and only then deal with >>> minor implementation details. The more versions you post the more >>> fragmented and messy the discussion will become. >>> >>> You will have to be patient because this is a rather big change and it >>> will take _quite_ some time to get sorted. >>> >>> Thanks! >>> -- >>> Michal Hocko >>> SUSE Labs >>> >> >> >> Sorry about that, I actually had three people review my code internally, >> then I managed to send out an old version. 100% guilty of submitting >> code when I needed sleep. As for the change, that was in response >> to a request from Andrew to make the update function less racy. >> >> Should I resend a correct v2 now that the thread exists? > > Let's just discuss open questions for now. Specifics of the code are the > least interesting at this stage. > > If you want some help with the code review, you can put it somewhere in > the git tree and send a reference for those who are interested. > -- > Michal Hocko > SUSE Labs Ok, I will go back through the thread and make sure all questions and concerns have been addressed.
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
> On Apr 6, 2018, at 12:38 AM, Michal Hocko wrote: > > On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote: >> >>> On Apr 4, 2018, at 11:10 PM, Michal Hocko wrote: >>> >>> On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: v2: - Make update_kswapd_threads_node less racy - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n >>> >>> Please do not repost with such a small changes. It is much more >>> important to sort out the big picture first and only then deal with >>> minor implementation details. The more versions you post the more >>> fragmented and messy the discussion will become. >>> >>> You will have to be patient because this is a rather big change and it >>> will take _quite_ some time to get sorted. >>> >>> Thanks! >>> -- >>> Michal Hocko >>> SUSE Labs >>> >> >> >> Sorry about that, I actually had three people review my code internally, >> then I managed to send out an old version. 100% guilty of submitting >> code when I needed sleep. As for the change, that was in response >> to a request from Andrew to make the update function less racy. >> >> Should I resend a correct v2 now that the thread exists? > > Let's just discuss open questions for now. Specifics of the code are the > least interesting at this stage. > > If you want some help with the code review, you can put it somewhere in > the git tree and send a reference for those who are interested. > -- > Michal Hocko > SUSE Labs Ok, I will go back through the thread and make sure all questions and concerns have been addressed.
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote: > > > On Apr 4, 2018, at 11:10 PM, Michal Hockowrote: > > > > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: > >> v2: > >> - Make update_kswapd_threads_node less racy > >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n > > > > Please do not repost with such a small changes. It is much more > > important to sort out the big picture first and only then deal with > > minor implementation details. The more versions you post the more > > fragmented and messy the discussion will become. > > > > You will have to be patient because this is a rather big change and it > > will take _quite_ some time to get sorted. > > > > Thanks! > > -- > > Michal Hocko > > SUSE Labs > > > > > Sorry about that, I actually had three people review my code internally, > then I managed to send out an old version. 100% guilty of submitting > code when I needed sleep. As for the change, that was in response > to a request from Andrew to make the update function less racy. > > Should I resend a correct v2 now that the thread exists? Let's just discuss open questions for now. Specifics of the code are the least interesting at this stage. If you want some help with the code review, you can put it somewhere in the git tree and send a reference for those who are interested. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
On Thu 05-04-18 23:25:14, Buddy Lumpkin wrote: > > > On Apr 4, 2018, at 11:10 PM, Michal Hocko wrote: > > > > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: > >> v2: > >> - Make update_kswapd_threads_node less racy > >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n > > > > Please do not repost with such a small changes. It is much more > > important to sort out the big picture first and only then deal with > > minor implementation details. The more versions you post the more > > fragmented and messy the discussion will become. > > > > You will have to be patient because this is a rather big change and it > > will take _quite_ some time to get sorted. > > > > Thanks! > > -- > > Michal Hocko > > SUSE Labs > > > > > Sorry about that, I actually had three people review my code internally, > then I managed to send out an old version. 100% guilty of submitting > code when I needed sleep. As for the change, that was in response > to a request from Andrew to make the update function less racy. > > Should I resend a correct v2 now that the thread exists? Let's just discuss open questions for now. Specifics of the code are the least interesting at this stage. If you want some help with the code review, you can put it somewhere in the git tree and send a reference for those who are interested. -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
> On Apr 4, 2018, at 11:10 PM, Michal Hockowrote: > > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: >> v2: >> - Make update_kswapd_threads_node less racy >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n > > Please do not repost with such a small changes. It is much more > important to sort out the big picture first and only then deal with > minor implementation details. The more versions you post the more > fragmented and messy the discussion will become. > > You will have to be patient because this is a rather big change and it > will take _quite_ some time to get sorted. > > Thanks! > -- > Michal Hocko > SUSE Labs > Sorry about that, I actually had three people review my code internally, then I managed to send out an old version. 100% guilty of submitting code when I needed sleep. As for the change, that was in response to a request from Andrew to make the update function less racy. Should I resend a correct v2 now that the thread exists? —Buddy
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
> On Apr 4, 2018, at 11:10 PM, Michal Hocko wrote: > > On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: >> v2: >> - Make update_kswapd_threads_node less racy >> - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n > > Please do not repost with such a small changes. It is much more > important to sort out the big picture first and only then deal with > minor implementation details. The more versions you post the more > fragmented and messy the discussion will become. > > You will have to be patient because this is a rather big change and it > will take _quite_ some time to get sorted. > > Thanks! > -- > Michal Hocko > SUSE Labs > Sorry about that, I actually had three people review my code internally, then I managed to send out an old version. 100% guilty of submitting code when I needed sleep. As for the change, that was in response to a request from Andrew to make the update function less racy. Should I resend a correct v2 now that the thread exists? —Buddy
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: > v2: > - Make update_kswapd_threads_node less racy > - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n Please do not repost with such a small changes. It is much more important to sort out the big picture first and only then deal with minor implementation details. The more versions you post the more fragmented and messy the discussion will become. You will have to be patient because this is a rather big change and it will take _quite_ some time to get sorted. Thanks! -- Michal Hocko SUSE Labs
Re: [RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
On Wed 04-04-18 21:49:54, Buddy Lumpkin wrote: > v2: > - Make update_kswapd_threads_node less racy > - Handle locking for case where CONFIG_MEMORY_HOTPLUG=n Please do not repost with such a small changes. It is much more important to sort out the big picture first and only then deal with minor implementation details. The more versions you post the more fragmented and messy the discussion will become. You will have to be patient because this is a rather big change and it will take _quite_ some time to get sorted. Thanks! -- Michal Hocko SUSE Labs
[RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created 11 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43
[RFC PATCH 1/1 v2] vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created 11 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43