Re: [RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory
Hi Honggyu, On Mon, 13 May 2024 20:59:15 +0900 Honggyu Kim wrote: > Hi SeongJae, > > Thanks very much for your work! It got delayed due to the priority > changes in my workplace for building another heterogeneous memory > allocator. > https://github.com/skhynix/hmsdk/wiki/hmalloc No problem at all. We all work on our own schedule and nobody can chase/push anybody :) > > On Sun, 12 May 2024 10:54:42 -0700 SeongJae Park wrote: > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This RFC is about its implementation for physical address > > space. > > > > Changes from RFC v3 > > (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com): > > This link cannot be opened. I will share the link again here. > https://lore.kernel.org/all/20240405060858.2818-1-honggyu@sk.com Thank you for checking the link! It's weird though, since I can open the link on my Chrome browser. > > > 0. updated from v3 and posted by SJ on behalf of Hunggyu under his > > approval. > > 1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode' > > 2. Drop vmstat change > > I haven't checked whether I can collect useful information without > vmstat, but the changes look good in general except for that. I was thinking you could use DAMOS stat[1] for the schemes and assuming no reply to it as an agreement, but maybe I should made it clear. Do you think DAMOS stat cannot be used instead? If so, what would be the limitation of DAMOS stat for your usage? > > > 3. Drop unnecessary page reference check > > I will compare this patch series with my previous v3 patchset and get > back to you later maybe next week. Thank you very much! Unless I get a good enough test setup and results from it on my own or from others' help, your test result would be the last requirement for dropping RFC from this patchset. > Sorry, I will have another break this week. No problem, I hope you to have nice break. Nobody can chase/push others. We all do this work voluntarily for our own fun and profit, right? ;) [1] https://lore.kernel.org/damon/20240405060858.2818-1-honggyu@sk.com Thanks, SJ > > Thanks, > Honggyu
Re: [RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory
Hi SeongJae, Thanks very much for your work! It got delayed due to the priority changes in my workplace for building another heterogeneous memory allocator. https://github.com/skhynix/hmsdk/wiki/hmalloc On Sun, 12 May 2024 10:54:42 -0700 SeongJae Park wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This RFC is about its implementation for physical address > space. > > Changes from RFC v3 > (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com): This link cannot be opened. I will share the link again here. https://lore.kernel.org/all/20240405060858.2818-1-honggyu@sk.com > 0. updated from v3 and posted by SJ on behalf of Hunggyu under his > approval. > 1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode' > 2. Drop vmstat change I haven't checked whether I can collect useful information without vmstat, but the changes look good in general except for that. > 3. Drop unnecessary page reference check I will compare this patch series with my previous v3 patchset and get back to you later maybe next week. Sorry, I will have another break this week. Thanks, Honggyu
[RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This RFC is about its implementation for physical address space. Changes from RFC v3 (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com): 0. updated from v3 and posted by SJ on behalf of Hunggyu under his approval. 1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode' 2. Drop vmstat change 3. Drop unnecessary page reference check Changes from RFC v2: 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. 2. Create 'target_nid' to set the migration target node instead of depending on node distance based information. 3. Instead of having page level access check in this patch series, delegate the job to a new DAMOS filter type YOUNG[2]. 4. Introduce vmstat counters "damon_migrate_{hot,cold}". 5. Rebase from v6.7 to v6.8. Changes from RFC: 1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c. 2. Simplify some functions of vmscan.c and used in paddr.c, but need to be reviewed more in depth. 3. Refactor most functions for common usage for both promote and demote actions and introduce an enum migration_mode for its control. 4. Add "target_nid" sysfs knob for migration destination node for both promote and demote actions. 5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above DAMOS_STAT. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 17~18% to 4~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these migrate_{hot,cold} actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa bala
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote: > Hi Gregory, > > On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price > wrote: > > Do you have test results which enable only DAMOS_MIGRATE_COLD actions > > but not DAMOS_MIGRATE_HOT actions? (and vice versa) > > > > The question I have is exactly how often is MIGRATE_HOT actually being > > utilized, and how much data is being moved. Testing MIGRATE_COLD only > > would at least give a rough approximation of that. > > To explain this, I better share more test results. In the section of > "Evaluation Workload", the test sequence can be summarized as follows. > > *. "Turn on DAMON." > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) Aha! I see now, you are allocating memory to ensure the real workload (redis-server) pressures the DRAM tier and causes "spillage" to the CXL tier, and then measure the overhead in different scenarios. I would still love to know what the result of a demote-only system would produce, mosty because it would very clearly demonstrate the value of the demote+promote system when the system is memory-pressured. Given the additional results below, it shows a demote-only system would likely trend toward CXL-only, and so this shows an affirmative support for the promotion logic. Just another datum that is useful and paints a more complete picture. > I didn't want to make the evaluation too long in the cover letter, but > I have also evaluated another senario, which lazyly enabled DAMON just > before YCSB run at step 4. I will call this test as "DAMON lazy". This > is missing part from the cover letter. > > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > *. "Turn on DAMON." > > In the "DAMON lazy" senario, DAMON started monitoring late so the > initial redis-server placement is same as "default", but started to > demote cold data and promote redis data just before YCSB run. > This is excellent and definitely demonstrates part of the picture I was alluding to, thank you for the additional data! > > I have included "DAMON lazy" result and also the migration size by new > DAMOS migrate actions. Please note that demotion size is way higher > than promotion because promotion target is only for redis data, but > demotion target includes huge cold memory allocated by mmap + memset. > (there could be some ping-pong issue though.) > > As you mentioned, "DAMON tiered" case gets more benefit because new > redis allocations go to DRAM more than "default", but it also gets > benefit from promotion when it is under higher memory pressure as shown > in 490G and 500G cases. It promotes 22GB and 17GB of redis data to DRAM > from CXL. I think a better way of saying this is that "DAMON tiered" more effectively mitigates the effect of memory-pressure on faster tier before spillage occurs, while "DAMON lazy" demonstrates the expected performance of the system after memory pressure outruns the demotion logic, so you wind up with hot data stuck in the slow tier. There are some out there that would simply say "just demote more aggressively", so this is useful information for the discussion. +/- ~2% despite greater meomry migration is an excellent result > > Can you also provide the DRAM-only results for each test? Presumably, > > as workload size increases from 440G to 500G, the system probably starts > > using some amount of swap/zswap/whatever. It would be good to know how > > this system compares to swap small amounts of overflow. > > It looks like my explanation doesn't correctly inform you. The size > from 440GB to 500GB is for pre allocated cold data to give memory > pressure on the system so that redis-server cannot be fully allocated at > fast DRAM, then partially allocated at CXL memory as well. > Yes, sorry for the misunderstanding. This makes it much clearer. > > I hope my explanation is helpful for you to understand. Please let me > know if you have more questions. > Excellent work, exciting results! Thank you for the additional answers :] ~Gregory
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
On Mon, 8 Apr 2024 22:41:04 +0900 Honggyu Kim wrote: [...] > To explain this, I better share more test results. In the section of > "Evaluation Workload", the test sequence can be summarized as follows. > > *. "Turn on DAMON." > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > 3. Run YCSB to make zipfian distribution of memory accesses to > redis-server, then measure execution time. > 4. Repeat 4 over 50 times to measure the average execution time for > each run. Sorry, "Repeat 4 over 50 times" is incorrect. This should be "Repeat 3 over 50 times". > 5. Increase the cold memory size then repeat goes to 2. > > I didn't want to make the evaluation too long in the cover letter, but > I have also evaluated another senario, which lazyly enabled DAMON just > before YCSB run at step 4. I will call this test as "DAMON lazy". This > is missing part from the cover letter. > > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > *. "Turn on DAMON." > 4. Run YCSB to make zipfian distribution of memory accesses to > redis-server, then measure execution time. > 5. Repeat 4 over 50 times to measure the average execution time for > each run. > 6. Increase the cold memory size then repeat goes to 2. > > In the "DAMON lazy" senario, DAMON started monitoring late so the > initial redis-server placement is same as "default", but started to > demote cold data and promote redis data just before YCSB run. [...] Thanks, Honggyu
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
Hi Gregory, On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price wrote: > On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote: > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > 1. YCSB zipfian distribution read only workload > > memory pressure with cold memory on node0 with 512GB of local DRAM. > > =++= > >| cold memory occupied by mmap and memset | > >| 0G 440G 450G 460G 470G 480G 490G 500G | > > =++= > > Execution time normalized to DRAM-only values | GEOMEAN > > -++- > > DRAM-only| 1.00 - - - - - - - | 1.00 > > CXL-only | 1.22 - - - - - - - | 1.22 > > default |- 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17 > > DAMON tiered |- 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05 > > =++= > > CXL usage of redis-server in GB | AVERAGE > > -++- > > DRAM-only| 0.0 - - - - - - - | 0.0 > > CXL-only | 52.6 - - - - - - - | 52.6 > > default |- 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1 > > DAMON tiered |- 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7 > > =++= > > > > Each test result is based on the exeuction environment as follows. > > > > DRAM-only : redis-server uses only local DRAM memory. > > CXL-only: redis-server uses only CXL memory. > > default : default memory policy(MPOL_DEFAULT). > > numa balancing disabled. > > DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and > > DAMOS_MIGRATE_HOT for CXL nodes. > > > > The above result shows the "default" execution time goes up as the size > > of cold memory is increased from 440G to 500G because the more cold > > memory used, the more CXL memory is used for the target redis workload > > and this makes the execution time increase. > > > > However, "DAMON tiered" result shows less slowdown because the > > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated > > cold memory to CXL node and this free space at DRAM increases more > > chance to allocate hot or warm pages of redis-server to fast DRAM node. > > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages > > of redis-server to DRAM node actively. > > > > As a result, it makes more memory of redis-server stay in DRAM node > > compared to "default" memory policy and this makes the performance > > improvement. > > > > The following result of latest distribution workload shows similar data. > > > > 2. YCSB latest distribution read only workload > > memory pressure with cold memory on node0 with 512GB of local DRAM. > > =++= > >| cold memory occupied by mmap and memset | > >| 0G 440G 450G 460G 470G 480G 490G 500G | > > =++= > > Execution time normalized to DRAM-only values | GEOMEAN > > -++- > > DRAM-only| 1.00 - - - - - - - | 1.00 > > CXL-only | 1.18 - - - - - - - | 1.18 > > default |- 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18 > > DAMON tiered |- 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04 > > =++= > > CXL usage of redis-server in GB | AVERAGE > > -++- > > DRAM-only | 0.0 - - - - - - - | 0.0 > > CXL-only | 52.6 - - - - - - - | 52.6 > > default |- 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1 > > DAMON tiered |- 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2 > &g
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory
Hi SeongJae, On Fri, 5 Apr 2024 12:28:00 -0700 SeongJae Park wrote: > Hello Honggyu, > > On Fri, 5 Apr 2024 15:08:49 +0900 Honggyu Kim wrote: > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This RFC is about its implementation for physical address > > space. > > > > > > Changes from RFC v2: > > 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. > > 2. Create 'target_nid' to set the migration target node instead of > > depending on node distance based information. > > 3. Instead of having page level access check in this patch series, > > delegate the job to a new DAMOS filter type YOUNG[2]. > > 4. Introduce vmstat counters "damon_migrate_{hot,cold}". > > 5. Rebase from v6.7 to v6.8. > > Thank you for patiently keeping discussion and making this great version! I > left comments on each patch, but found no special concerns. Per-page access > recheck for MIGRATE_HOT and vmstat change are taking my eyes, though. I doubt > if those really needed. It would be nice if you could answer to the comments. I will answer them where you made the comments. > Once my comments on this version are addressed, I would have no reason to > object at dropping the RFC tag from this patchset. Thanks. I will drop the RFC after addressing your comments. > Nonetheless, I show some warnings and errors from checkpatch.pl. I don't > really care about those for RFC patches, so no problem at all. But if you > agree to my opinion about RFC tag dropping, and therefore if you will send > next > version without RFC tag, please make sure you also run checkpatch.pl before > posting. Sure. I will run checkpatch.pl from the next revision. Thanks, Honggyu > > Thanks, > SJ > > [...]
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory
Hello Honggyu, On Fri, 5 Apr 2024 15:08:49 +0900 Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This RFC is about its implementation for physical address > space. > > > Changes from RFC v2: > 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. > 2. Create 'target_nid' to set the migration target node instead of > depending on node distance based information. > 3. Instead of having page level access check in this patch series, > delegate the job to a new DAMOS filter type YOUNG[2]. > 4. Introduce vmstat counters "damon_migrate_{hot,cold}". > 5. Rebase from v6.7 to v6.8. Thank you for patiently keeping discussion and making this great version! I left comments on each patch, but found no special concerns. Per-page access recheck for MIGRATE_HOT and vmstat change are taking my eyes, though. I doubt if those really needed. It would be nice if you could answer to the comments. Once my comments on this version are addressed, I would have no reason to object at dropping the RFC tag from this patchset. Nonetheless, I show some warnings and errors from checkpatch.pl. I don't really care about those for RFC patches, so no problem at all. But if you agree to my opinion about RFC tag dropping, and therefore if you will send next version without RFC tag, please make sure you also run checkpatch.pl before posting. Thanks, SJ [...]
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory
On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > 1. YCSB zipfian distribution read only workload > memory pressure with cold memory on node0 with 512GB of local DRAM. > =++= >| cold memory occupied by mmap and memset | >| 0G 440G 450G 460G 470G 480G 490G 500G | > =++= > Execution time normalized to DRAM-only values | GEOMEAN > -++- > DRAM-only| 1.00 - - - - - - - | 1.00 > CXL-only | 1.22 - - - - - - - | 1.22 > default |- 1.12 1.13 1.14 1.16 1.19 1.21 1.21 | 1.17 > DAMON tiered |- 1.04 1.03 1.04 1.06 1.05 1.05 1.05 | 1.05 > =++= > CXL usage of redis-server in GB | AVERAGE > -++- > DRAM-only| 0.0 - - - - - - - | 0.0 > CXL-only | 52.6 - - - - - - - | 52.6 > default |- 20.4 27.0 33.1 39.5 45.6 50.5 50.3 | 38.1 > DAMON tiered |- 0.1 0.3 0.8 0.6 0.7 1.3 0.9 | 0.7 > =++= > > Each test result is based on the exeuction environment as follows. > > DRAM-only : redis-server uses only local DRAM memory. > CXL-only: redis-server uses only CXL memory. > default : default memory policy(MPOL_DEFAULT). > numa balancing disabled. > DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and > DAMOS_MIGRATE_HOT for CXL nodes. > > The above result shows the "default" execution time goes up as the size > of cold memory is increased from 440G to 500G because the more cold > memory used, the more CXL memory is used for the target redis workload > and this makes the execution time increase. > > However, "DAMON tiered" result shows less slowdown because the > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated > cold memory to CXL node and this free space at DRAM increases more > chance to allocate hot or warm pages of redis-server to fast DRAM node. > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages > of redis-server to DRAM node actively. > > As a result, it makes more memory of redis-server stay in DRAM node > compared to "default" memory policy and this makes the performance > improvement. > > The following result of latest distribution workload shows similar data. > > 2. YCSB latest distribution read only workload > memory pressure with cold memory on node0 with 512GB of local DRAM. > =++= >| cold memory occupied by mmap and memset | >| 0G 440G 450G 460G 470G 480G 490G 500G | > =++= > Execution time normalized to DRAM-only values | GEOMEAN > -++- > DRAM-only| 1.00 - - - - - - - | 1.00 > CXL-only | 1.18 - - - - - - - | 1.18 > default |- 1.18 1.19 1.18 1.18 1.17 1.19 1.18 | 1.18 > DAMON tiered |- 1.04 1.04 1.04 1.05 1.04 1.05 1.05 | 1.04 > =++= > CXL usage of redis-server in GB | AVERAGE > -++- > DRAM-only| 0.0 - - - - - - - | 0.0 > CXL-only | 52.6 - - - - - - - | 52.6 > default |- 20.5 27.1 33.2 39.5 45.5 50.4 50.5 | 38.1 > DAMON tiered |- 0.2 0.4 0.7 1.6 1.2 1.1 3.4 | 1.2 > =++= > > In summary of both results, our evaluation shows that "DAMON tiered" > memory management reduces the performance slowdown compared to the > "default" memory policy from 17~18% to 4~5% when the system runs with > high memory pressure on its fast tier DRAM nodes. > > Having these DAMOS_MIGRATE_HOT and DAMOS
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Tue, 26 Mar 2024 16:03:09 -0700 SeongJae Park wrote: > On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park wrote: > > > On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim wrote: > [...] > > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park wrote: > > > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim > > > > wrote: > [...] > > > > > > I would like to hear how you think about this. > > > > So, to summarize my humble opinion, > > > > 1. I like the idea of having two actions. But I'd like to use names other > > than > >'promote' and 'demote'. > > 2. I still prefer having a filter for the page granularity access re-check. > > > [...] > > > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I > > > can talk more about this issue. > > > > Looking forward to chatting with you :) > > We met and discussed about this topic in the chat series yesterday. Sharing > the summary for keeping the open discussion. > > Honggyu thankfully accepted my humble suggestions on the last reply. Honggyu > will post the third version of this patchset soon. The patchset will > implement > two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD. Those will > migrate > the DAMOS target regions to a user-specified NUMA node, but will have > different > prioritization score function. As name implies, they will prioritize more hot > regions and cold regions, respectively. It's a late answer but thanks very much for the summary. It was really helpful discussion in the chat series. I'm leaving a message so that anyone can follow for the update. The v3 is posted at https://lore.kernel.org/damon/20240405060858.2818-1-honggyu@sk.com. > Honggyu, please feel free to fix if there is anything wrong or missed. I don't have anything to fix from your summary. > And thanks to Honggyu again for patiently keeping this productive discussion > and their awesome work. I appreciate your persistent support and kind reviews. Thanks, Honggyu > > Thanks, > SJ > > [...]
[RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This RFC is about its implementation for physical address space. Changes from RFC v2: 1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}. 2. Create 'target_nid' to set the migration target node instead of depending on node distance based information. 3. Instead of having page level access check in this patch series, delegate the job to a new DAMOS filter type YOUNG[2]. 4. Introduce vmstat counters "damon_migrate_{hot,cold}". 5. Rebase from v6.7 to v6.8. Changes from RFC: 1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c. 2. Simplify some functions of vmscan.c and used in paddr.c, but need to be reviewed more in depth. 3. Refactor most functions for common usage for both promote and demote actions and introduce an enum migration_mode for its control. 4. Add "target_nid" sysfs knob for migration destination node for both promote and demote actions. 5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above DAMOS_STAT. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 17~18% to 4~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[3]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[4], which is a widely used in-memory database and the memory access patterns are generated via YCSB[5]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these migrate_{hot,cold} actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa balancing, but it is not the scope of this DAMON based tiered memory management support. If the working set of redis can be fit fully into the DRAM node, then the redis will access the fast DRAM only. Since the performance of DRAM only is faster than partially accessing CXL memory in slow tiers, this environ
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park wrote: > On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim wrote: [...] > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park wrote: > > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim wrote: [...] > > > > I would like to hear how you think about this. > > So, to summarize my humble opinion, > > 1. I like the idea of having two actions. But I'd like to use names other > than >'promote' and 'demote'. > 2. I still prefer having a filter for the page granularity access re-check. > [...] > > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I > > can talk more about this issue. > > Looking forward to chatting with you :) We met and discussed about this topic in the chat series yesterday. Sharing the summary for keeping the open discussion. Honggyu thankfully accepted my humble suggestions on the last reply. Honggyu will post the third version of this patchset soon. The patchset will implement two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD. Those will migrate the DAMOS target regions to a user-specified NUMA node, but will have different prioritization score function. As name implies, they will prioritize more hot regions and cold regions, respectively. Honggyu, please feel free to fix if there is anything wrong or missed. And thanks to Honggyu again for patiently keeping this productive discussion and their awesome work. Thanks, SJ [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim wrote: > Hi SeongJae, > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park wrote: > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim wrote: [...] > > > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we > > > > discussed about > > > > this patchset in high level. Sharing the summary here for open > > > > discussion. As > > > > also discussed on the first version of this patchset[2], we want to > > > > make single > > > > action for general page migration with minimum changes, but would like > > > > to keep > > > > page level access re-check. We also agreed the previously proposed > > > > DAMOS > > > > filter-based approach could make sense for the purpose. > > > > > > Thanks very much for the summary. I have been trying to merge promote > > > and demote actions into a single migrate action, but I found an issue > > > regarding damon_pa_scheme_score. It currently calls damon_cold_score() > > > for demote action and damon_hot_score() for promote action, but what > > > should we call when we use a single migrate action? > > > > Good point! This is what I didn't think about when suggesting that. Thank > > you > > for letting me know this gap! I think there could be two approach, off the > > top > > of my head. > > > > The first one would be extending the interface so that the user can select > > the > > score function. This would let flexible usage, but I'm bit concerned if > > this > > could make things unnecessarily complex, and would really useful in many > > general use case. > > I also think this looks complicated and may not be useful for general > users. > > > The second approach would be letting DAMON infer the intention. In this > > case, > > I think we could know the intention is the demotion if the scheme has a youg > > pages exclusion filter. Then, we could use the cold_score(). And vice > > versa. > > To cover a case that there is no filter at all, I think we could have one > > assumption. My humble intuition says the new action (migrate) may be used > > more > > for promotion use case. So, in damon_pa_scheme_score(), if the action of > > the > > given scheme is the new one (say, MIGRATE), the function will further check > > if > > the scheme has a filter for excluding young pages. If so, the function will > > use cold_score(). Otherwise, the function will use hot_score(). > > Thanks for suggesting many ideas but I'm afraid that I feel this doesn't > look good. Thinking it again, I think we can think about keep using > DAMOS_PROMOTE and DAMOS_DEMOTE, In other words, keep having dedicated DAMOS action for intuitive prioritization score function, or, coupling the prioritization with each action, right? I think this makes sense, and fit well with the documentation. The prioritization mechanism should be different for each action. For example, rarely accessed (colder) memory regions would be prioritized for page-out scheme action. In contrast, the colder regions would be deprioritized for huge page collapse scheme action. Hence, the prioritization mechanisms for each action are implemented in each DAMON operations set, together with the actions. In other words, each DAMOS action should allow users intuitively understand what types of regions will be prioritized. We already have such couples of DAMOS actions such as DAMOS_[NO]HUGEPAGE and DAMOS_LRU_[DE]PRIO. So adding a couple of action for this case sounds reasonable to me. And I think this is better and simpler than having the inferrence based behavior. That said, I concern if 'PROMOTE' and 'DEMOTE' still sound bit ambiguous to people who don't know 'demote_folio_list()' and its friends. Meanwhile, the name might sound too detail about what it does to people who know the functions, so make it bit unflexible. They might also get confused since we don't have 'promote_folio_list()'. To my humble understanding, what you really want to do is migrating pages to specific address range (or node) prioritizing the pages based on the hotness. What about, say, MIGRATE_{HOT,COLD}? > but I can make them directly call > damon_folio_young() for access check instead of using young filter. > > And we can internally handle the complicated combination such as demote > action sets "young" filter with "matching" true and promote action sets > "young" filter with "matching" false. IMHO, this will make the usage > simpler. I think whether to exclude young/non-young (maybe idle is better than non-young?) pages from the action is better to be decoupled for following reasons. Firstly, we want to check the page granularity youngness mainly because we found DAMON's monitoring result is not accurate enough for this use case. Or, we could say that's because you cannot wait until DAMON's monitoring result becomes accurate enough. For more detail, you could increase minimum age of your scheme's target access pattern. I show you set
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park wrote: > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park wrote: > > > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > > > > > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > > > posted at [1]. > > > > > > > > It says there is no implementation of the demote/promote DAMOS action > > > > are made. This RFC is about its implementation for physical address > > > > space. > > > > > > > > > [...] > > > Thank you for running the tests again with the new version of the patches > > > and > > > sharing the results! > > > > It's a bit late answer, but the result was from the previous evaluation. > > I ran it again with RFC v2, but didn't see much difference so just > > pasted the same result here. > > No problem, thank you for clarifying :) > > [...] > > > > Honggyu Kim (3): > > > > mm/damon: refactor DAMOS_PAGEOUT with migration_mode > > > > mm: make alloc_demote_folio externally invokable for migration > > > > mm/damon: introduce DAMOS_DEMOTE action for demotion > > > > > > > > Hyeongtak Ji (4): > > > > mm/memory-tiers: add next_promotion_node to find promotion target > > > > mm/damon: introduce DAMOS_PROMOTE action for promotion > > > > mm/damon/sysfs-schemes: add target_nid on sysfs-schemes > > > > mm/damon/sysfs-schemes: apply target_nid for promote and demote > > > > actions > > > > > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed > > > about > > > this patchset in high level. Sharing the summary here for open > > > discussion. As > > > also discussed on the first version of this patchset[2], we want to make > > > single > > > action for general page migration with minimum changes, but would like to > > > keep > > > page level access re-check. We also agreed the previously proposed DAMOS > > > filter-based approach could make sense for the purpose. > > > > Thanks very much for the summary. I have been trying to merge promote > > and demote actions into a single migrate action, but I found an issue > > regarding damon_pa_scheme_score. It currently calls damon_cold_score() > > for demote action and damon_hot_score() for promote action, but what > > should we call when we use a single migrate action? > > Good point! This is what I didn't think about when suggesting that. Thank > you > for letting me know this gap! I think there could be two approach, off the > top > of my head. > > The first one would be extending the interface so that the user can select the > score function. This would let flexible usage, but I'm bit concerned if this > could make things unnecessarily complex, and would really useful in many > general use case. I also think this looks complicated and may not be useful for general users. > The second approach would be letting DAMON infer the intention. In this case, > I think we could know the intention is the demotion if the scheme has a youg > pages exclusion filter. Then, we could use the cold_score(). And vice versa. > To cover a case that there is no filter at all, I think we could have one > assumption. My humble intuition says the new action (migrate) may be used > more > for promotion use case. So, in damon_pa_scheme_score(), if the action of the > given scheme is the new one (say, MIGRATE), the function will further check if > the scheme has a filter for excluding young pages. If so, the function will > use cold_score(). Otherwise, the function will use hot_score(). Thanks for suggesting many ideas but I'm afraid that I feel this doesn't look good. Thinking it again, I think we can think about keep using DAMOS_PROMOTE and DAMOS_DEMOTE, but I can make them directly call damon_folio_young() for access check instead of using young filter. And we can internally handle the complicated combination such as demote action sets "young" filter with "matching" true and promote action sets "young" filter with "matching" false. IMHO, this will make the usage simpler. I would like to hear how you think about this. > So I'd more prefer the second approach. I think it would be not too late to > consider the first approach after waiting for it turns out more actions have > such ambiguity and need more general interface for explicitly set the score > function. I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I can talk more about this issue. Thanks, Honggyu > > Thanks, > SJ > > [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Fri, 22 Mar 2024 17:27:34 +0900 Honggyu Kim wrote: [...] > OK. It could be a matter of preference and the current filter is already > in the mainline so I won't insist more. Thank you for accepting my humble suggestion. [...] > > I'd prefer improving the documents or > > user-space tool and keep the kernel code simple. > > OK. I will see if there is a way to improve damo tool for this instead > of making changes on the kernel side. Looking forward! [...] > Yeah, I made this thread too much about filter naming discussion rather > than tiered memory support. No problem at all. Thank you for keeping this productive discussion. [...] > Thanks again for your feedback. That's my pleasure :) Thanks, SJ [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim wrote: > Hi SeongJae, > > On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park wrote: > > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > > > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > > posted at [1]. > > > > > > It says there is no implementation of the demote/promote DAMOS action > > > are made. This RFC is about its implementation for physical address > > > space. > > > > > > [...] > > Thank you for running the tests again with the new version of the patches > > and > > sharing the results! > > It's a bit late answer, but the result was from the previous evaluation. > I ran it again with RFC v2, but didn't see much difference so just > pasted the same result here. No problem, thank you for clarifying :) [...] > > > Honggyu Kim (3): > > > mm/damon: refactor DAMOS_PAGEOUT with migration_mode > > > mm: make alloc_demote_folio externally invokable for migration > > > mm/damon: introduce DAMOS_DEMOTE action for demotion > > > > > > Hyeongtak Ji (4): > > > mm/memory-tiers: add next_promotion_node to find promotion target > > > mm/damon: introduce DAMOS_PROMOTE action for promotion > > > mm/damon/sysfs-schemes: add target_nid on sysfs-schemes > > > mm/damon/sysfs-schemes: apply target_nid for promote and demote > > > actions > > > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed > > about > > this patchset in high level. Sharing the summary here for open discussion. > > As > > also discussed on the first version of this patchset[2], we want to make > > single > > action for general page migration with minimum changes, but would like to > > keep > > page level access re-check. We also agreed the previously proposed DAMOS > > filter-based approach could make sense for the purpose. > > Thanks very much for the summary. I have been trying to merge promote > and demote actions into a single migrate action, but I found an issue > regarding damon_pa_scheme_score. It currently calls damon_cold_score() > for demote action and damon_hot_score() for promote action, but what > should we call when we use a single migrate action? Good point! This is what I didn't think about when suggesting that. Thank you for letting me know this gap! I think there could be two approach, off the top of my head. The first one would be extending the interface so that the user can select the score function. This would let flexible usage, but I'm bit concerned if this could make things unnecessarily complex, and would really useful in many general use case. The second approach would be letting DAMON infer the intention. In this case, I think we could know the intention is the demotion if the scheme has a youg pages exclusion filter. Then, we could use the cold_score(). And vice versa. To cover a case that there is no filter at all, I think we could have one assumption. My humble intuition says the new action (migrate) may be used more for promotion use case. So, in damon_pa_scheme_score(), if the action of the given scheme is the new one (say, MIGRATE), the function will further check if the scheme has a filter for excluding young pages. If so, the function will use cold_score(). Otherwise, the function will use hot_score(). So I'd more prefer the second approach. I think it would be not too late to consider the first approach after waiting for it turns out more actions have such ambiguity and need more general interface for explicitly set the score function. Thanks, SJ [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park wrote: > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This RFC is about its implementation for physical address > > space. > > > > > > Introduction > > > > > > With the advent of CXL/PCIe attached DRAM, which will be called simply > > as CXL memory in this cover letter, some systems are becoming more > > heterogeneous having memory systems with different latency and bandwidth > > characteristics. They are usually handled as different NUMA nodes in > > separate memory tiers and CXL memory is used as slow tiers because of > > its protocol overhead compared to local DRAM. > > > > In this kind of systems, we need to be careful placing memory pages on > > proper NUMA nodes based on the memory access frequency. Otherwise, some > > frequently accessed pages might reside on slow tiers and it makes > > performance degradation unexpectedly. Moreover, the memory access > > patterns can be changed at runtime. > > > > To handle this problem, we need a way to monitor the memory access > > patterns and migrate pages based on their access temperature. The > > DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation > > Schemes) can be useful features for monitoring and migrating pages. > > DAMOS provides multiple actions based on DAMON monitoring results and it > > can be used for proactive reclaim, which means swapping cold pages out > > with DAMOS_PAGEOUT action, but it doesn't support migration actions such > > as demotion and promotion between tiered memory nodes. > > > > This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion > > from fast tiers and DAMOS_PROMOTE for promotion from slow tiers. This > > prevents hot pages from being stuck on slow tiers, which makes > > performance degradation and cold pages can be proactively demoted to > > slow tiers so that the system can increase the chance to allocate more > > hot pages to fast tiers. > > > > The DAMON provides various tuning knobs but we found that the proactive > > demotion for cold pages is especially useful when the system is running > > out of memory on its fast tier nodes. > > > > Our evaluation result shows that it reduces the performance slowdown > > compared to the default memory policy from 15~17% to 4~5% when the > > system runs under high memory pressure on its fast tier DRAM nodes. > > > > > > DAMON configuration > > === > > > > The specific DAMON configuration doesn't have to be in the scope of this > > patch series, but some rough idea is better to be shared to explain the > > evaluation result. > > > > The DAMON provides many knobs for fine tuning but its configuration file > > is generated by HMSDK[2]. It includes gen_config.py script that > > generates a json file with the full config of DAMON knobs and it creates > > multiple kdamonds for each NUMA node when the DAMON is enabled so that > > it can run hot/cold based migration for tiered memory. > > I was feeling a bit confused from here since DAMON doesn't receive parameters > via a file. To my understanding, the 'configuration file' means the input > file > for DAMON user-space tool, damo, not DAMON. Just a trivial thing, but making > it clear if possible could help readers in my opinion. > > > > > > > Evaluation Workload > > === > > > > The performance evaluation is done with redis[3], which is a widely used > > in-memory database and the memory access patterns are generated via > > YCSB[4]. We have measured two different workloads with zipfian and > > latest distributions but their configs are slightly modified to make > > memory usage higher and execution time longer for better evaluation. > > > > The idea of evaluation using these demote and promote actions covers > > system-wide memory management rather than partitioning hot/cold pages of > > a single workload. The default memory allocation policy creates pages > > to the fast tier DRAM node first, then allocates newly created pages to > > the slow tier CXL node when the DRAM node has insufficient free space. > > Once the page allocation is done then those pages never move between > > NUMA nodes. It's not true when using numa balancing, but it is not the > > scope
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Wed, 20 Mar 2024 09:56:19 -0700 SeongJae Park wrote: > Hi Honggyu, > > On Wed, 20 Mar 2024 16:07:48 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park wrote: > > > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim wrote: > > > > > > > Hi SeongJae, > > > > > > > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park > > > > wrote: > > > > > Hi Honggyu, > > > > > > > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim > > > > > wrote: > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > Thanks for the confirmation. I have a few comments on young filter > > > > > > so > > > > > > please read the inline comments again. > > > > > > > > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park > > > > > > wrote: > > > > > > > Hi Honggyu, > > > > > > > > > > > > > > > > -Original Message- > > > > > > > > > From: SeongJae Park > > > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > > > > > > To: Honggyu Kim > > > > > > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > > > > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > > > > > > management for CXL memory > > > > > > > > > > > > > > > > > > Hi Honggyu, > > > > > > > > > > > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > > > > > > > > > I've tested it again and found that "young" filter has to > > > > > > > > > > be set > > > > > > > > > > differently as follows. > > > > > > > > > > - demote action: set "young" filter with "matching" true > > > > > > > > > > - promote action: set "young" filter with "matching" false > > > > > > > > Thinking it again, I feel like "matching" true or false looks quite > > > > vague to me as a general user. > > > > > > > > Instead, I would like to have more meaningful names for "matching" as > > > > follows. > > > > > > > > - matching "true" can be either (filter) "out" or "skip". > > > > - matching "false" can be either (filter) "in" or "apply". > > > > > > I agree the naming could be done much better. And thank you for the nice > > > suggestions. I have a few concerns, though. > > > > I don't think my suggestion is best. I just would like to have more > > discussion about it. > > I also understand my naming sense is far from good :) I'm grateful to have > this constructive discussion! Yeah, naming is always difficult. Thanks anyway :) > > > > > Firstly, increasing the number of behavioral concepts. DAMOS filter > > > feature > > > has only single behavior: excluding some types of memory from DAMOS action > > > target. The "matching" is to provide a flexible way for further > > > specifying the > > > target to exclude in a bit detail. Without it, we would need non-variant > > > for > > > each filter type. Comapred to the current terms, the new terms feel like > > > implying there are two types of behaviors. I think one behavior is > > > easier to > > > understand than two behaviors, and better match what internal code is > > > doing. > > > > > > Secondly, ambiguity in "in" and "apply". To me, the terms sound like > > > _adding_ > > > something more than _excluding_. > > > > I understood that young filter "matching" "false" means apply action > > only to young pages. Do I misunderstood something here? If not, > > Technically speaking, having a DAMOS filter with 'matching' parameter as >
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi Honggyu, On Wed, 20 Mar 2024 16:07:48 +0900 Honggyu Kim wrote: > Hi SeongJae, > > On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park wrote: > > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim wrote: > > > > > Hi SeongJae, > > > > > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > > > > Hi Honggyu, > > > > > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim > > > > wrote: > > > > > > > > > Hi SeongJae, > > > > > > > > > > Thanks for the confirmation. I have a few comments on young filter so > > > > > please read the inline comments again. > > > > > > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park > > > > > wrote: > > > > > > Hi Honggyu, > > > > > > > > > > > > > > -Original Message- > > > > > > > > From: SeongJae Park > > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > > > > > To: Honggyu Kim > > > > > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > > > > > management for CXL memory > > > > > > > > > > > > > > > > Hi Honggyu, > > > > > > > > > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > > > > > > > I've tested it again and found that "young" filter has to be > > > > > > > > > set > > > > > > > > > differently as follows. > > > > > > > > > - demote action: set "young" filter with "matching" true > > > > > > > > > - promote action: set "young" filter with "matching" false > > > > > > Thinking it again, I feel like "matching" true or false looks quite > > > vague to me as a general user. > > > > > > Instead, I would like to have more meaningful names for "matching" as > > > follows. > > > > > > - matching "true" can be either (filter) "out" or "skip". > > > - matching "false" can be either (filter) "in" or "apply". > > > > I agree the naming could be done much better. And thank you for the nice > > suggestions. I have a few concerns, though. > > I don't think my suggestion is best. I just would like to have more > discussion about it. I also understand my naming sense is far from good :) I'm grateful to have this constructive discussion! > > > Firstly, increasing the number of behavioral concepts. DAMOS filter feature > > has only single behavior: excluding some types of memory from DAMOS action > > target. The "matching" is to provide a flexible way for further specifying > > the > > target to exclude in a bit detail. Without it, we would need non-variant > > for > > each filter type. Comapred to the current terms, the new terms feel like > > implying there are two types of behaviors. I think one behavior is easier > > to > > understand than two behaviors, and better match what internal code is doing. > > > > Secondly, ambiguity in "in" and "apply". To me, the terms sound like > > _adding_ > > something more than _excluding_. > > I understood that young filter "matching" "false" means apply action > only to young pages. Do I misunderstood something here? If not, Technically speaking, having a DAMOS filter with 'matching' parameter as 'false' for 'young pages' type means you want DAMOS to "exclude pages that not young from the scheme's action target". That's the only thing it truly does, and what it tries to guarantee. Whether the action will be applied to young pages or not depends on more factors including additional filters and DAMOS parameter. IOW, that's not what the simple setting promises. Of course, I know you are assuming there is only the single filter. Hence, effectively you're correct. And the sentence may be a better wording for end users. However, it tooke me a bit time to understand your assumption and concluding whether
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park wrote: > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > > > Hi Honggyu, > > > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > > > > > > > Hi SeongJae, > > > > > > > > Thanks for the confirmation. I have a few comments on young filter so > > > > please read the inline comments again. > > > > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park > > > > wrote: > > > > > Hi Honggyu, > > > > > > > > > > > > -Original Message- > > > > > > > From: SeongJae Park > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > > > > To: Honggyu Kim > > > > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > > > > management for CXL memory > > > > > > > > > > > > > > Hi Honggyu, > > > > > > > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > > > > wrote: > > > > > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > > > > > I've tested it again and found that "young" filter has to be set > > > > > > > > differently as follows. > > > > > > > > - demote action: set "young" filter with "matching" true > > > > > > > > - promote action: set "young" filter with "matching" false > > > > Thinking it again, I feel like "matching" true or false looks quite > > vague to me as a general user. > > > > Instead, I would like to have more meaningful names for "matching" as > > follows. > > > > - matching "true" can be either (filter) "out" or "skip". > > - matching "false" can be either (filter) "in" or "apply". > > I agree the naming could be done much better. And thank you for the nice > suggestions. I have a few concerns, though. I don't think my suggestion is best. I just would like to have more discussion about it. > Firstly, increasing the number of behavioral concepts. DAMOS filter feature > has only single behavior: excluding some types of memory from DAMOS action > target. The "matching" is to provide a flexible way for further specifying > the > target to exclude in a bit detail. Without it, we would need non-variant for > each filter type. Comapred to the current terms, the new terms feel like > implying there are two types of behaviors. I think one behavior is easier to > understand than two behaviors, and better match what internal code is doing. > > Secondly, ambiguity in "in" and "apply". To me, the terms sound like _adding_ > something more than _excluding_. I understood that young filter "matching" "false" means apply action only to young pages. Do I misunderstood something here? If not, "apply" means _adding_ or _including_ only the matched pages for action. It looks like you thought about _excluding_ non matched pages here. > I think that might confuse people in some > cases. Actually, I have used the term "filter-out" and "filter-in" on > this and several threads. When saying about "filter-in" scenario, I had to > emphasize the fact that it is not adding something but excluding others. Excluding others also means including matched pages. I think we better focus on what to do only for the matched pages. > I now think that was not a good approach. > > Finally, "apply" sounds a bit deterministic. I think it could be a bit > confusing in some cases such as when using multiple filters in a combined way. > For example, if we have two filters for 1) "apply" a memcg and 2) skip anon > pages, the given DAMOS action will not be applied to anon pages of the memcg. > I think this might be a bit confusing. No objection on this. If so, I think "in" sounds better than "apply". > > > > Internally, the type of "matching" can be boolean, but it'd be better > > for general users have another ways to set it such as "out"/"in" or > > "s
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim wrote: > Hi SeongJae, > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > > Hi Honggyu, > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > > > > > Hi SeongJae, > > > > > > Thanks for the confirmation. I have a few comments on young filter so > > > please read the inline comments again. > > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > > > > Hi Honggyu, > > > > > > > > > > -Original Message- > > > > > > From: SeongJae Park > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > > > To: Honggyu Kim > > > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > > > management for CXL memory > > > > > > > > > > > > Hi Honggyu, > > > > > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > > > wrote: > > > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > > > I've tested it again and found that "young" filter has to be set > > > > > > > differently as follows. > > > > > > > - demote action: set "young" filter with "matching" true > > > > > > > - promote action: set "young" filter with "matching" false > > Thinking it again, I feel like "matching" true or false looks quite > vague to me as a general user. > > Instead, I would like to have more meaningful names for "matching" as > follows. > > - matching "true" can be either (filter) "out" or "skip". > - matching "false" can be either (filter) "in" or "apply". I agree the naming could be done much better. And thank you for the nice suggestions. I have a few concerns, though. Firstly, increasing the number of behavioral concepts. DAMOS filter feature has only single behavior: excluding some types of memory from DAMOS action target. The "matching" is to provide a flexible way for further specifying the target to exclude in a bit detail. Without it, we would need non-variant for each filter type. Comapred to the current terms, the new terms feel like implying there are two types of behaviors. I think one behavior is easier to understand than two behaviors, and better match what internal code is doing. Secondly, ambiguity in "in" and "apply". To me, the terms sound like _adding_ something more than _excluding_. I think that might confuse people in some cases. Actually, I have used the term "filter-out" and "filter-in" on this and several threads. When saying about "filter-in" scenario, I had to emphasize the fact that it is not adding something but excluding others. I now think that was not a good approach. Finally, "apply" sounds a bit deterministic. I think it could be a bit confusing in some cases such as when using multiple filters in a combined way. For example, if we have two filters for 1) "apply" a memcg and 2) skip anon pages, the given DAMOS action will not be applied to anon pages of the memcg. I think this might be a bit confusing. > > Internally, the type of "matching" can be boolean, but it'd be better > for general users have another ways to set it such as "out"/"in" or > "skip"/"apply" via sysfs interface. I prefer "skip" and "apply" looks > more intuitive, but I don't have strong objection on "out" and "in" as > well. Unfortunately, DAMON sysfs interface is an ABI that we want to keep stable. Of course we could make some changes on it if really required. But I'm unsure if the problem of current naming and benefit of the sugegsted change are big enough to outweighs the stability risk and additional efforts. Also, DAMON sysfs interface is arguably not for _very_ general users. DAMON user-space tool is the one for _more_ general users. To quote DAMON usage document, - *DAMON user space tool.* `This <https://github.com/awslabs/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. [...] - *sysfs interface.* :ref:`This ` is for privileged user space programmers who want more optimized use of DAMON. [...] If the concept is that confused, I think we could improve the docum
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Sun, 17 Mar 2024 12:13:57 -0700 SeongJae Park wrote: > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > > > Hi Honggyu, > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > > > > > Hi SeongJae, > > > > > > Thanks for the confirmation. I have a few comments on young filter so > > > please read the inline comments again. > > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > > > > Hi Honggyu, > [...] > > > Thanks. I see that it works fine, but I would like to have more > > > discussion about "young" filter. What I think about filter is that if I > > > apply "young" filter "true" for demotion, then the action applies only > > > for "young" pages, but the current implementation works opposite. > > > > > > I understand the function name of internal implementation is > > > "damos_pa_filter_out" so the basic action is filtering out, but the > > > cgroup filter works in the opposite way for now. > > > > Does memcg filter works in the opposite way? I don't think so because > > __damos_pa_filter_out() sets 'matches' as 'true' only if the the given > > folio is > > contained in the given memcg. 'young' filter also simply sets 'matches' as > > 'true' only if the given folio is young. > > > > If it works in the opposite way, it's a bug that need to be fixed. Please > > let > > me know if I'm missing something. > > I just read the DAMOS filters part of the documentation for DAMON sysfs > interface again. I believe it is explaining the meaning of 'matching' as I > intended to, as below: > > You can write ``Y`` or ``N`` to ``matching`` file to filter out pages > that does > or does not match to the type, respectively. Then, the scheme's action > will > not be applied to the pages that specified to be filtered out. > > But, I found the following example for memcg filter is wrong, as below: > > For example, below restricts a DAMOS action to be applied to only > non-anonymous > pages of all memory cgroups except ``/having_care_already``.:: > > # echo 2 > nr_filters > # # filter out anonymous pages > echo anon > 0/type > echo Y > 0/matching > # # further filter out all cgroups except one at > '/having_care_already' > echo memcg > 1/type > echo /having_care_already > 1/memcg_path > echo N > 1/matching > > Specifically, the last line of the commands should write 'Y' instead of 'N' to > do what explained. Without the fix, the action will be applied to only > non-anonymous pages of 'having_care_already' memcg. This is definitely wrong. > I will fix this soon. I'm unsure if this is what made you to believe memcg > DAMOS filter is working in the opposite way, though. I got confused not because of this. I just think it again that this user interface is better to be more intuitive as I mentioned in the previous thread. Thanks, Honggyu > > Thanks, > SJ > > [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > Hi Honggyu, > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > Thanks for the confirmation. I have a few comments on young filter so > > please read the inline comments again. > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > > > Hi Honggyu, > > > > > > > > -Original Message- > > > > > From: SeongJae Park > > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > > To: Honggyu Kim > > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > > management for CXL memory > > > > > > > > > > Hi Honggyu, > > > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > > wrote: > > > > > > > > > > > Hi SeongJae, > > > > > > > > > > > > I've tested it again and found that "young" filter has to be set > > > > > > differently as follows. > > > > > > - demote action: set "young" filter with "matching" true > > > > > > - promote action: set "young" filter with "matching" false Thinking it again, I feel like "matching" true or false looks quite vague to me as a general user. Instead, I would like to have more meaningful names for "matching" as follows. - matching "true" can be either (filter) "out" or "skip". - matching "false" can be either (filter) "in" or "apply". Internally, the type of "matching" can be boolean, but it'd be better for general users have another ways to set it such as "out"/"in" or "skip"/"apply" via sysfs interface. I prefer "skip" and "apply" looks more intuitive, but I don't have strong objection on "out" and "in" as well. I also feel the filter name "young" is more for developers not for general users. I think this can be changed to "accessed" filter instead. The demote and promote filters can be set as follows using these. - demote action: set "accessed" filter with "matching" to "skip" - promote action: set "accessed" filter with "matching" to "apply" I also think that you can feel this is more complicated so I would like to hear how you think about this. > > > > > > > > > > DAMOS filter is basically for filtering "out" memory regions that > > > > > matches to > > > > > the condition. Right. In other tools, I see filters are more used as filtering "in" rather than filtering "out". I feel this makes me more confused. > > > > > Hence in your setup, young pages are not filtered out from > > > > > demote action target. > > > > > > > > I thought young filter true means "young pages ARE filtered out" for > > > > demotion. > > > > > > You're correct. > > > > Ack. > > > > > > > > > > > That is, you're demoting pages that "not" young. > > > > > > > > Your explanation here looks opposite to the previous statement. > > > > > > Again, you're correct. My intention was "non-young pages are not ..." but > > > maybe I was out of my mind and mistakenly removed "non-" part. Sorry for > > > the > > > confusion. > > > > No problem. I also think it's quite confusing. > > > > > > > > > > > And vice versa, so you're applying promote to non-non-young (young) > > > > > pages. > > > > > > > > Yes, I understand "promote non-non-young pages" means "promote young > > > > pages". > > > > This might be understood as "young pages are NOT filtered out" for > > > > promotion > > > > but it doesn't mean that "old pages are filtered out" instead. > > > > And we just rely hot detection only on DAMOS logics such as access > > > > frequency > > > > and age. Am I correct? > > > > > > You're correct. > > > > Ack. But if it doesn't mean that "old pages are filtered out" instead, > > It does mean that. Here, f
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park wrote: > Hi Honggyu, > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > Thanks for the confirmation. I have a few comments on young filter so > > please read the inline comments again. > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > > > Hi Honggyu, [...] > > Thanks. I see that it works fine, but I would like to have more > > discussion about "young" filter. What I think about filter is that if I > > apply "young" filter "true" for demotion, then the action applies only > > for "young" pages, but the current implementation works opposite. > > > > I understand the function name of internal implementation is > > "damos_pa_filter_out" so the basic action is filtering out, but the > > cgroup filter works in the opposite way for now. > > Does memcg filter works in the opposite way? I don't think so because > __damos_pa_filter_out() sets 'matches' as 'true' only if the the given folio > is > contained in the given memcg. 'young' filter also simply sets 'matches' as > 'true' only if the given folio is young. > > If it works in the opposite way, it's a bug that need to be fixed. Please let > me know if I'm missing something. I just read the DAMOS filters part of the documentation for DAMON sysfs interface again. I believe it is explaining the meaning of 'matching' as I intended to, as below: You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does not match to the type, respectively. Then, the scheme's action will not be applied to the pages that specified to be filtered out. But, I found the following example for memcg filter is wrong, as below: For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: # echo 2 > nr_filters # # filter out anonymous pages echo anon > 0/type echo Y > 0/matching # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path echo N > 1/matching Specifically, the last line of the commands should write 'Y' instead of 'N' to do what explained. Without the fix, the action will be applied to only non-anonymous pages of 'having_care_already' memcg. This is definitely wrong. I will fix this soon. I'm unsure if this is what made you to believe memcg DAMOS filter is working in the opposite way, though. Thanks, SJ [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi Honggyu, On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim wrote: > Hi SeongJae, > > Thanks for the confirmation. I have a few comments on young filter so > please read the inline comments again. > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > > Hi Honggyu, > > > > > > -Original Message- > > > > From: SeongJae Park > > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > > To: Honggyu Kim > > > > Cc: SeongJae Park ; kernel_team > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory > > > > management for CXL memory > > > > > > > > Hi Honggyu, > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > > wrote: > > > > > > > > > Hi SeongJae, > > > > > > > > > > I've tested it again and found that "young" filter has to be set > > > > > differently as follows. > > > > > - demote action: set "young" filter with "matching" true > > > > > - promote action: set "young" filter with "matching" false > > > > > > > > DAMOS filter is basically for filtering "out" memory regions that > > > > matches to > > > > the condition. Hence in your setup, young pages are not filtered out > > > > from > > > > demote action target. > > > > > > I thought young filter true means "young pages ARE filtered out" for > > > demotion. > > > > You're correct. > > Ack. > > > > > > > > That is, you're demoting pages that "not" young. > > > > > > Your explanation here looks opposite to the previous statement. > > > > Again, you're correct. My intention was "non-young pages are not ..." but > > maybe I was out of my mind and mistakenly removed "non-" part. Sorry for > > the > > confusion. > > No problem. I also think it's quite confusing. > > > > > > > > And vice versa, so you're applying promote to non-non-young (young) > > > > pages. > > > > > > Yes, I understand "promote non-non-young pages" means "promote young > > > pages". > > > This might be understood as "young pages are NOT filtered out" for > > > promotion > > > but it doesn't mean that "old pages are filtered out" instead. > > > And we just rely hot detection only on DAMOS logics such as access > > > frequency > > > and age. Am I correct? > > > > You're correct. > > Ack. But if it doesn't mean that "old pages are filtered out" instead, It does mean that. Here, filtering is exclusive. Hence, "filter-in a type of pages" means "filter-out pages of other types". At least that's the intention. To quote the documentation (https://docs.kernel.org/mm/damon/design.html#filters), Each filter specifies the type of target memory, and whether it should exclude the memory of the type (filter-out), or all except the memory of the type (filter-in). > then do we really need this filter for promotion? If not, maybe should > we create another "old" filter for promotion? As of now, the promotion > is mostly done inaccurately, but the accurate migration is done at > demotion level. Is this based on your theory? Or, a real behavior that you're seeing from your setup? If this is a real behavior, I think that should be a bug that need to be fixed. > To avoid this issue, I feel we should promotion only "young" pages after > filtering "old" pages out. > > > > > > > > I understand this is somewhat complex, but what we have for now. > > > > > > Thanks for the explanation. I guess you mean my filter setup is correct. > > > Is it correct? > > > > Again, you're correct. Your filter setup is what I expected to :) > > Thanks. I see that it works fine, but I would like to have more > discussion about "young" filter. What I think about filter is that if I > apply "young" filter "true" for demotion, then the action applies only > for "young" pages, but the current implementation works opposite. > > I understand the function name of internal implementation is > "damos_pa_filter_out" so the basic action is filtering out, but the > cgroup filter works in the opposite way for now. Does memcg filter works in the opposit
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, Thanks for the confirmation. I have a few comments on young filter so please read the inline comments again. On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park wrote: > Hi Honggyu, > > > > -Original Message- > > > From: SeongJae Park > > > Sent: Tuesday, March 12, 2024 3:33 AM > > > To: Honggyu Kim > > > Cc: SeongJae Park ; kernel_team > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management > > > for CXL memory > > > > > > Hi Honggyu, > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" > > > wrote: > > > > > > > Hi SeongJae, > > > > > > > > I've tested it again and found that "young" filter has to be set > > > > differently as follows. > > > > - demote action: set "young" filter with "matching" true > > > > - promote action: set "young" filter with "matching" false > > > > > > DAMOS filter is basically for filtering "out" memory regions that matches > > > to > > > the condition. Hence in your setup, young pages are not filtered out from > > > demote action target. > > > > I thought young filter true means "young pages ARE filtered out" for > > demotion. > > You're correct. Ack. > > > > > That is, you're demoting pages that "not" young. > > > > Your explanation here looks opposite to the previous statement. > > Again, you're correct. My intention was "non-young pages are not ..." but > maybe I was out of my mind and mistakenly removed "non-" part. Sorry for the > confusion. No problem. I also think it's quite confusing. > > > > > And vice versa, so you're applying promote to non-non-young (young) pages. > > > > Yes, I understand "promote non-non-young pages" means "promote young pages". > > This might be understood as "young pages are NOT filtered out" for promotion > > but it doesn't mean that "old pages are filtered out" instead. > > And we just rely hot detection only on DAMOS logics such as access frequency > > and age. Am I correct? > > You're correct. Ack. But if it doesn't mean that "old pages are filtered out" instead, then do we really need this filter for promotion? If not, maybe should we create another "old" filter for promotion? As of now, the promotion is mostly done inaccurately, but the accurate migration is done at demotion level. To avoid this issue, I feel we should promotion only "young" pages after filtering "old" pages out. > > > > > I understand this is somewhat complex, but what we have for now. > > > > Thanks for the explanation. I guess you mean my filter setup is correct. > > Is it correct? > > Again, you're correct. Your filter setup is what I expected to :) Thanks. I see that it works fine, but I would like to have more discussion about "young" filter. What I think about filter is that if I apply "young" filter "true" for demotion, then the action applies only for "young" pages, but the current implementation works opposite. I understand the function name of internal implementation is "damos_pa_filter_out" so the basic action is filtering out, but the cgroup filter works in the opposite way for now. I would like to hear how you think about this. > > > > > > Then, I see that "hot_cold" migrates hot/cold memory correctly. > > > > > > Thank you so much for sharing this great news! My tests also show no bad > > > signal so far. > > > > > > > > > > > Could you please upload the "damon_folio_mkold" patch to LKML? > > > > Then I will rebase our changes based on it and run the redis test again. > > > > > > I will do that soon. > > > > Thanks a lot for sharing the RFC v2 for DAMOS young filter. > > https://lore.kernel.org/damon/20240311204545.47097-1...@kernel.org/ > > > > I will rebase our work based on it and share the result. > > Cool, looking forward to it! Hopefully we will make it be merged into the > mainline by v6.10! I hope so. Thanks for your help! Honggyu
Re: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, I couldn't send email to LKML properly due to internal system issues, but it's better now so I will be able to communicate better. On Wed, 6 Mar 2024 19:05:50 -0800 SeongJae Park wrote: > > Hello, > > On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park wrote: > > > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > > > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > > posted at [1]. > > > > > > It says there is no implementation of the demote/promote DAMOS action > > > are made. This RFC is about its implementation for physical address > > > space. > [...] > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed > > about > > this patchset in high level. Sharing the summary here for open discussion. > > As > > also discussed on the first version of this patchset[2], we want to make > > single > > action for general page migration with minimum changes, but would like to > > keep > > page level access re-check. We also agreed the previously proposed DAMOS > > filter-based approach could make sense for the purpose. > > > > Because I was anyway planning making such DAMOS filter for not only > > promotion/demotion but other types of DAMOS action, I will start developing > > the > > page level access re-check results based DAMOS filter. Once the > > implementation > > of the prototype is done, I will share the early implementation. Then, > > Honggyu > > will adjust their implementation based on the filter, and run their tests > > again > > and share the results. > > I just posted an RFC patchset for the page level access re-check DAMOS filter: > https://lore.kernel.org/r/20240307030013.47041-1...@kernel.org > > I hope it to help you better understanding and testing the idea. Thanks very much for your work! I will test it based on your changes. Thanks, Honggyu > > > > > [1] https://lore.kernel.org/damon/20220810225102.124459-1...@kernel.org/ > > [2] https://lore.kernel.org/damon/20240118171756.80356-1...@kernel.org > > > Thanks, > SJ > > [...] > > >
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
Hello, On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park wrote: > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > > posted at [1]. > > > > It says there is no implementation of the demote/promote DAMOS action > > are made. This RFC is about its implementation for physical address > > space. [...] > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed about > this patchset in high level. Sharing the summary here for open discussion. > As > also discussed on the first version of this patchset[2], we want to make > single > action for general page migration with minimum changes, but would like to keep > page level access re-check. We also agreed the previously proposed DAMOS > filter-based approach could make sense for the purpose. > > Because I was anyway planning making such DAMOS filter for not only > promotion/demotion but other types of DAMOS action, I will start developing > the > page level access re-check results based DAMOS filter. Once the > implementation > of the prototype is done, I will share the early implementation. Then, > Honggyu > will adjust their implementation based on the filter, and run their tests > again > and share the results. I just posted an RFC patchset for the page level access re-check DAMOS filter: https://lore.kernel.org/r/20240307030013.47041-1...@kernel.org I hope it to help you better understanding and testing the idea. > > [1] https://lore.kernel.org/damon/20220810225102.124459-1...@kernel.org/ > [2] https://lore.kernel.org/damon/20240118171756.80356-1...@kernel.org Thanks, SJ [...]
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This RFC is about its implementation for physical address > space. > > > Introduction > > > With the advent of CXL/PCIe attached DRAM, which will be called simply > as CXL memory in this cover letter, some systems are becoming more > heterogeneous having memory systems with different latency and bandwidth > characteristics. They are usually handled as different NUMA nodes in > separate memory tiers and CXL memory is used as slow tiers because of > its protocol overhead compared to local DRAM. > > In this kind of systems, we need to be careful placing memory pages on > proper NUMA nodes based on the memory access frequency. Otherwise, some > frequently accessed pages might reside on slow tiers and it makes > performance degradation unexpectedly. Moreover, the memory access > patterns can be changed at runtime. > > To handle this problem, we need a way to monitor the memory access > patterns and migrate pages based on their access temperature. The > DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation > Schemes) can be useful features for monitoring and migrating pages. > DAMOS provides multiple actions based on DAMON monitoring results and it > can be used for proactive reclaim, which means swapping cold pages out > with DAMOS_PAGEOUT action, but it doesn't support migration actions such > as demotion and promotion between tiered memory nodes. > > This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion > from fast tiers and DAMOS_PROMOTE for promotion from slow tiers. This > prevents hot pages from being stuck on slow tiers, which makes > performance degradation and cold pages can be proactively demoted to > slow tiers so that the system can increase the chance to allocate more > hot pages to fast tiers. > > The DAMON provides various tuning knobs but we found that the proactive > demotion for cold pages is especially useful when the system is running > out of memory on its fast tier nodes. > > Our evaluation result shows that it reduces the performance slowdown > compared to the default memory policy from 15~17% to 4~5% when the > system runs under high memory pressure on its fast tier DRAM nodes. > > > DAMON configuration > === > > The specific DAMON configuration doesn't have to be in the scope of this > patch series, but some rough idea is better to be shared to explain the > evaluation result. > > The DAMON provides many knobs for fine tuning but its configuration file > is generated by HMSDK[2]. It includes gen_config.py script that > generates a json file with the full config of DAMON knobs and it creates > multiple kdamonds for each NUMA node when the DAMON is enabled so that > it can run hot/cold based migration for tiered memory. I was feeling a bit confused from here since DAMON doesn't receive parameters via a file. To my understanding, the 'configuration file' means the input file for DAMON user-space tool, damo, not DAMON. Just a trivial thing, but making it clear if possible could help readers in my opinion. > > > Evaluation Workload > === > > The performance evaluation is done with redis[3], which is a widely used > in-memory database and the memory access patterns are generated via > YCSB[4]. We have measured two different workloads with zipfian and > latest distributions but their configs are slightly modified to make > memory usage higher and execution time longer for better evaluation. > > The idea of evaluation using these demote and promote actions covers > system-wide memory management rather than partitioning hot/cold pages of > a single workload. The default memory allocation policy creates pages > to the fast tier DRAM node first, then allocates newly created pages to > the slow tier CXL node when the DRAM node has insufficient free space. > Once the page allocation is done then those pages never move between > NUMA nodes. It's not true when using numa balancing, but it is not the > scope of this DAMON based 2-tier memory management support. > > If the working set of redis can be fit fully into the DRAM node, then > the redis will access the fast DRAM only. Since the performance of DRAM > only is faster than partially accessing CXL memory in slow tiers, this > environment is not useful to evaluate this patch series. > > To make pages of redis be distributed across fast DRAM node and slow > CXL node to evaluate our demote and promote actions, we pre-allocate > some col
[RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This RFC is about its implementation for physical address space. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogeneous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion from fast tiers and DAMOS_PROMOTE for promotion from slow tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 15~17% to 4~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[2]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[3], which is a widely used in-memory database and the memory access patterns are generated via YCSB[4]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these demote and promote actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa balancing, but it is not the scope of this DAMON based 2-tier memory management support. If the working set of redis can be fit fully into the DRAM node, then the redis will access the fast DRAM only. Since the performance of DRAM only is faster than partially accessing CXL memory in slow tiers, this environment is not useful to evaluate this patch series. To make pages of redis be distributed across fast DRAM node and slow CXL node to evaluate our demote and promote actions, we pre-allocate some cold memory externally using mmap and memset before launching redis-server. We assumed that there are enough amount of cold memory in datacenters as TMO[5] and TPP[6] papers mentioned. The evaluation sequence is as follows. 1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and DAMOS_PROMOTE action for CXL node. It demotes cold pages on DRAM node and promotes hot pages on CXL node in a regular interval. 2. Allocate a huge block of cold memory by calling mmap and memset at the fast tier DRAM node, then make the process sleep to make the fast tier has insufficient memory for redis-server. 3. Launch redis-server and load prebaked snapshot image, dump.rdb. The redis-server consumes 52GB of anon pages and 33GB of file pages, but due to the cold memory allocated at 2
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
On Thu, 18 Jan 2024 19:40:16 +0900 Hyeongtak Ji wrote: > Hi SeongJae, > > On Wed, 17 Jan 2024 SeongJae Park wrote: > > [...] > >> Let's say there are 3 nodes in the system and the first node0 and node1 > >> are the first tier, and node2 is the second tier. > >> > >> $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist > >> 0-1 > >> > >> $ cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist > >> 2 > >> > >> Here is the result of partitioning hot/cold memory and I put execution > >> command at the right side of numastat result. I initially ran each > >> hot_cold program with preferred setting so that they initially allocate > >> memory on one of node0 or node2, but they gradually migrated based on > >> their access frequencies. > >> > >> $ numastat -c -p hot_cold > >> Per-node process memory usage (in MBs) > >> PID Node 0 Node 1 Node 2 Total > >> --- -- -- -- - > >> 754 (hot_cold) 1800 0 2000 3800<- hot_cold 1800 2000 > >> 1184 (hot_cold) 300 0500 800<- hot_cold 300 500 > >> 1818 (hot_cold) 801 0 3199 4000<- hot_cold 800 3200 > >> 30289 (hot_cold) 4 0 510<- hot_cold 3 5 > >> 30325 (hot_cold) 31 0 5181<- hot_cold 30 50 > >> --- -- -- -- - > >> Total 2938 0 5756 8695 > >> > >> The final node placement result shows that DAMON accurately migrated > >> pages by their hotness for multiple processes. > > > > What was the result when the corner cases handling logics were not applied? > > This is the result of the same test that Honggyu did, but with an insufficient > corner cases handling logics. > > $ numastat -c -p hot_cold > > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Node 2 Total > -- -- -- -- - > 862 (hot_cold)2256 0 1545 3801 <- hot_cold 1800 2000 > 863 (hot_cold) 403 0398 801 <- hot_cold 300 500 > 864 (hot_cold)1520 0 2482 4001 <- hot_cold 800 3200 > 865 (hot_cold) 6 0 3 9 <- hot_cold 3 5 > 866 (hot_cold) 29 0 5281 <- hot_cold 30 50 > -- -- -- -- - > Total 4215 0 4480 8695 > > As time goes by, DAMON keeps trying to split the hot/cold region, but it does > not seem to be enough. > > $ numastat -c -p hot_cold > > Per-node process memory usage (in MBs) > PID Node 0 Node 1 Node 2 Total > -- -- -- -- - > 862 (hot_cold)2022 0 1780 3801 <- hot_cold 1800 2000 > 863 (hot_cold) 351 0450 801 <- hot_cold 300 500 > 864 (hot_cold)1134 0 2868 4001 <- hot_cold 800 3200 > 865 (hot_cold) 7 0 2 9 <- hot_cold 3 5 > 866 (hot_cold) 43 0 3981 <- hot_cold 30 50 > -- -- -- -- - > Total 3557 0 5138 8695 > > > > > And, what are the corner cases handling logic that seemed essential? I show > > the page granularity active/reference check could indeed provide many > > improvements, but that's only my humble assumption. > > Yes, the page granularity active/reference check is essential. To make the > above "insufficient" result, the only thing I did was to promote > inactive/not_referenced pages. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f03be320f9ad..c2aefb883c54 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1127,9 +1127,7 @@ static unsigned int __promote_folio_list(struct > list_head *folio_list, > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > references = folio_check_references(folio, sc); > - if (references == FOLIOREF_KEEP || > - references == FOLIOREF_RECLAIM || > - references == FOLIOREF_RECLAIM_CLEAN) > + if (references == FOLIOREF_KEEP ) > goto keep_locked; > > /* Relocate its contents to another node. */ Thank you for sharing the details :) I think DAMOS filters based approach could be worthy to try, then. > > > > > If the corner cases are indeed better to be applied in page granularity, I > > agree we need some more efforts since DAMON monitoring results are not page > > granularity aware by the design. Users could increase min_nr_regions to > > make > > it more accurate, and we have plan to support page granularity monitoring, > > though. But maybe the overhead could be unacceptable. > > > > Ideal solution would be making DAMON more accurate while keeping current > > level > > of overhead. We indeed have TODO items for DAMON accuracy improvement, but > > this may take some time that might unacceptable for your case. > > > > If that's the case, I think the additional corner handling (or,
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
Hi SeongJae, On Wed, 17 Jan 2024 SeongJae Park wrote: [...] >> Let's say there are 3 nodes in the system and the first node0 and node1 >> are the first tier, and node2 is the second tier. >> >> $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist >> 0-1 >> >> $ cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist >> 2 >> >> Here is the result of partitioning hot/cold memory and I put execution >> command at the right side of numastat result. I initially ran each >> hot_cold program with preferred setting so that they initially allocate >> memory on one of node0 or node2, but they gradually migrated based on >> their access frequencies. >> >> $ numastat -c -p hot_cold >> Per-node process memory usage (in MBs) >> PID Node 0 Node 1 Node 2 Total >> --- -- -- -- - >> 754 (hot_cold) 1800 0 2000 3800<- hot_cold 1800 2000 >> 1184 (hot_cold) 300 0500 800<- hot_cold 300 500 >> 1818 (hot_cold) 801 0 3199 4000<- hot_cold 800 3200 >> 30289 (hot_cold) 4 0 510<- hot_cold 3 5 >> 30325 (hot_cold) 31 0 5181<- hot_cold 30 50 >> --- -- -- -- - >> Total 2938 0 5756 8695 >> >> The final node placement result shows that DAMON accurately migrated >> pages by their hotness for multiple processes. > > What was the result when the corner cases handling logics were not applied? This is the result of the same test that Honggyu did, but with an insufficient corner cases handling logics. $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 862 (hot_cold)2256 0 1545 3801 <- hot_cold 1800 2000 863 (hot_cold) 403 0398 801 <- hot_cold 300 500 864 (hot_cold)1520 0 2482 4001 <- hot_cold 800 3200 865 (hot_cold) 6 0 3 9 <- hot_cold 3 5 866 (hot_cold) 29 0 5281 <- hot_cold 30 50 -- -- -- -- - Total 4215 0 4480 8695 As time goes by, DAMON keeps trying to split the hot/cold region, but it does not seem to be enough. $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -- -- -- -- - 862 (hot_cold)2022 0 1780 3801 <- hot_cold 1800 2000 863 (hot_cold) 351 0450 801 <- hot_cold 300 500 864 (hot_cold)1134 0 2868 4001 <- hot_cold 800 3200 865 (hot_cold) 7 0 2 9 <- hot_cold 3 5 866 (hot_cold) 43 0 3981 <- hot_cold 30 50 -- -- -- -- - Total 3557 0 5138 8695 > > And, what are the corner cases handling logic that seemed essential? I show > the page granularity active/reference check could indeed provide many > improvements, but that's only my humble assumption. Yes, the page granularity active/reference check is essential. To make the above "insufficient" result, the only thing I did was to promote inactive/not_referenced pages. diff --git a/mm/vmscan.c b/mm/vmscan.c index f03be320f9ad..c2aefb883c54 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1127,9 +1127,7 @@ static unsigned int __promote_folio_list(struct list_head *folio_list, VM_BUG_ON_FOLIO(folio_test_active(folio), folio); references = folio_check_references(folio, sc); - if (references == FOLIOREF_KEEP || - references == FOLIOREF_RECLAIM || - references == FOLIOREF_RECLAIM_CLEAN) + if (references == FOLIOREF_KEEP ) goto keep_locked; /* Relocate its contents to another node. */ > > If the corner cases are indeed better to be applied in page granularity, I > agree we need some more efforts since DAMON monitoring results are not page > granularity aware by the design. Users could increase min_nr_regions to make > it more accurate, and we have plan to support page granularity monitoring, > though. But maybe the overhead could be unacceptable. > > Ideal solution would be making DAMON more accurate while keeping current level > of overhead. We indeed have TODO items for DAMON accuracy improvement, but > this may take some time that might unacceptable for your case. > > If that's the case, I think the additional corner handling (or, page gran > additional access check) could be made as DAMOS filters[1], since DAMOS > filters > can be applied in page granularity, and designed for this kind of handling of > information that DAMON monitoring results cannot provide. More specifically, > we could have filters for promotion-qualifying pages and demotion-qualifying > pages. In this way, I think we can keep the action
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
On Wed, 17 Jan 2024 13:11:03 -0800 SeongJae Park wrote: [...] > Hi Honggyu, > > On Wed, 17 Jan 2024 20:49:25 +0900 Honggyu Kim wrote: > > > Hi SeongJae, > > > > Thanks very much for your comments in details. > > > > On Tue, 16 Jan 2024 12:31:59 -0800 SeongJae Park wrote: > > [...] > > > To this end, I feel the problem might be able te be simpler, because this > > > patchset is trying to provide two sophisticated operations, while I think > > > a > > > simpler approach might be possible. My humble simpler idea is adding a > > > DAMOS > > > operation for moving pages to a given node (like sys_move_phy_pages > > > RFC[1]), > > > instead of the promote/demote. Because the general pages migration can > > > handle > > > multiple cases including the promote/demote in my humble assumption. [...] > > > In more detail, users could decide which is the appropriate node for > > > promotion > > > or demotion and use the new DAMOS action to do promotion and demotion. > > > Users > > > would requested to decide which node is the proper promotion/demotion > > > target > > > nodes, but that decision wouldn't be that hard in my opinion. > > > > > > For this, 'struct damos' would need to be updated for such > > > argument-dependent > > > actions, like 'struct damos_filter' is haing a union. > > > > That might be a better solution. I will think about it. > > More specifically, I think receiving an address range as the argument might > more flexible than just NUMA node. Maybe we can imagine proactively migrating > cold movable pages from normal zones to movable zones, to avoid normal zone > memory pressure. Yet another crazy idea. Finding hot regions in the middle of cold region and move to besides of other hot pages. As a result, memory is sorted by access temperature even in same node, and the system gains more spatial locality, which benefits general locality-based algorithms including DAMON's adaptive regions adjustment. Thanks, SJ [...]
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
ranularity active/reference check could indeed provide many improvements, but that's only my humble assumption. If the corner cases are indeed better to be applied in page granularity, I agree we need some more efforts since DAMON monitoring results are not page granularity aware by the design. Users could increase min_nr_regions to make it more accurate, and we have plan to support page granularity monitoring, though. But maybe the overhead could be unacceptable. Ideal solution would be making DAMON more accurate while keeping current level of overhead. We indeed have TODO items for DAMON accuracy improvement, but this may take some time that might unacceptable for your case. If that's the case, I think the additional corner handling (or, page gran additional access check) could be made as DAMOS filters[1], since DAMOS filters can be applied in page granularity, and designed for this kind of handling of information that DAMON monitoring results cannot provide. More specifically, we could have filters for promotion-qualifying pages and demotion-qualifying pages. In this way, I think we can keep the action more flexible while the filters can be applied in creative ways. [1] https://git.kernel.org/sj/c/98def236f63c66629fb6b2d4b69cecffc5b46539 > > > In more detail, users could decide which is the appropriate node for > > promotion > > or demotion and use the new DAMOS action to do promotion and demotion. > > Users > > would requested to decide which node is the proper promotion/demotion target > > nodes, but that decision wouldn't be that hard in my opinion. > > > > For this, 'struct damos' would need to be updated for such > > argument-dependent > > actions, like 'struct damos_filter' is haing a union. > > That might be a better solution. I will think about it. More specifically, I think receiving an address range as the argument might more flexible than just NUMA node. Maybe we can imagine proactively migrating cold movable pages from normal zones to movable zones, to avoid normal zone memory pressure. > > > In future, we could extend the operation to the promotion and the demotion > > after the dicussion around the promotion and demotion is matured, if > > required. > > And assuming DAMON be extended for originating CPU-aware access monitoring, > > the > > new DAMOS action would also cover more use cases such as general NUMA nodes > > balancing (extending DAMON for CPU-aware monitoring would required), and > > some > > complex configurations where having both CPU affinity and tiered memory. I > > also think that may well fit with my RFC idea[2] for tiered memory > > management. > > > > Looking forward to opinions from you and others. I admig I miss many > > things, > > and more than happy to be enlightened. > > > > [1] https://lwn.net/Articles/944007/ > > [2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/ > > Thanks very much for your comments. I will need a few more days for the > update but will try to address your concerns as much as possible. No problem, please take your time. I'm looking forward to the next version :) Thanks, SJ > > Thanks, > Honggyu
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
or originating CPU-aware access monitoring, > the > new DAMOS action would also cover more use cases such as general NUMA nodes > balancing (extending DAMON for CPU-aware monitoring would required), and some > complex configurations where having both CPU affinity and tiered memory. I > also think that may well fit with my RFC idea[2] for tiered memory management. > > Looking forward to opinions from you and others. I admig I miss many things, > and more than happy to be enlightened. > > [1] https://lwn.net/Articles/944007/ > [2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/ Thanks very much for your comments. I will need a few more days for the update but will try to address your concerns as much as possible. Thanks, Honggyu
Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
Hello, On Mon, 15 Jan 2024 13:52:48 +0900 Honggyu Kim wrote: > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously > posted at [1]. > > It says there is no implementation of the demote/promote DAMOS action > are made. This RFC is about its implementation for physical address > space. > [...] > Evaluation Results > == > [...] > In summary of both results, our evaluation shows that "DAMON 2-tier" > memory management reduces the performance slowdown compared to the > "default" memory policy from 15~17% to 4~5% when the system runs with > high memory pressure on its fast tier DRAM nodes. > > The similar evaluation was done in another machine that has 256GB of > local DRAM and 96GB of CXL memory. The performance slowdown is reduced > from 20~24% for "default" to 5~7% for "DAMON 2-tier". > > Having these DAMOS_DEMOTE and DAMOS_PROMOTE actions can make 2-tier > memory systems run more efficiently under high memory pressures. Thank you so much for this great patches and the above nice test results. I believe the test setup and results make sense, and merging a revised version of this patchset would provide real benefits to the users. In a high level, I think it might better to separate DAMON internal changes from DAMON external changes. For DAMON part changes, I have no big concern other than trivial coding style level comments. For DAMON-external changes that implementing demote_pages() and promote_pages(), I'm unsure if the implementation is reusing appropriate functions, and if those are placee in right source file. Especially, I'm unsure if vmscan.c is the right place for promotion code. Also I don't know if there is a good agreement on the promotion/demotion target node decision. That should be because I'm not that familiar with the areas and the files, but I feel this might because our discussions on the promotion and the demotion operations are having rooms for being more matured. Because I'm not very faimiliar with the part, I'd like to hear others' comments, too. To this end, I feel the problem might be able to be simpler, because this patchset is trying to provide two sophisticated operations, while I think a simpler approach might be possible. My humble simpler idea is adding a DAMOS operation for moving pages to a given node (like sys_move_phy_pages RFC[1]), instead of the promote/demote. Because the general pages migration can handle multiple cases including the promote/demote in my humble assumption. In more detail, users could decide which is the appropriate node for promotion or demotion and use the new DAMOS action to do promotion and demotion. Users would requested to decide which node is the proper promotion/demotion target nodes, but that decision wouldn't be that hard in my opinion. For this, 'struct damos' would need to be updated for such argument-dependent actions, like 'struct damos_filter' is haing a union. In future, we could extend the operation to the promotion and the demotion after the dicussion around the promotion and demotion is matured, if required. And assuming DAMON be extended for originating CPU-aware access monitoring, the new DAMOS action would also cover more use cases such as general NUMA nodes balancing (extending DAMON for CPU-aware monitoring would required), and some complex configurations where having both CPU affinity and tiered memory. I also think that may well fit with my RFC idea[2] for tiered memory management. Looking forward to opinions from you and others. I admig I miss many things, and more than happy to be enlightened. [1] https://lwn.net/Articles/944007/ [2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/ Thanks, SJ > > Signed-off-by: Honggyu Kim > Signed-off-by: Hyeongtak Ji > Signed-off-by: Rakie Kim > > [1] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org > [2] https://github.com/skhynix/hmsdk > [3] https://github.com/redis/redis/tree/7.0.0 > [4] https://github.com/brianfrankcooper/YCSB/tree/0.17.0 > [5] https://dl.acm.org/doi/10.1145/3503222.3507731 > [6] https://dl.acm.org/doi/10.1145/3582016.3582063 > > Honggyu Kim (2): > mm/vmscan: refactor reclaim_pages with reclaim_or_migrate_folios > mm/damon: introduce DAMOS_DEMOTE action for demotion > > Hyeongtak Ji (2): > mm/memory-tiers: add next_promotion_node to find promotion target > mm/damon: introduce DAMOS_PROMOTE action for promotion > > include/linux/damon.h | 4 + > include/linux/memory-tiers.h | 11 ++ > include/linux/migrate_mode.h | 1 + > include/linux/vm_event_item.h | 1 + > include/trace/events/migrate.h | 3 +- > mm/damon/paddr.c | 46 ++- > mm/damon/sysfs-schemes.c | 2 + > mm/internal.h | 2 + > mm
[RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously posted at [1]. It says there is no implementation of the demote/promote DAMOS action are made. This RFC is about its implementation for physical address space. Introduction With the advent of CXL/PCIe attached DRAM, which will be called simply as CXL memory in this cover letter, some systems are becoming more heterogenous having memory systems with different latency and bandwidth characteristics. They are usually handled as different NUMA nodes in separate memory tiers and CXL memory is used as slow tiers because of its protocol overhead compared to local DRAM. In this kind of systems, we need to be careful placing memory pages on proper NUMA nodes based on the memory access frequency. Otherwise, some frequently accessed pages might reside on slow tiers and it makes performance degradation unexpectedly. Moreover, the memory access patterns can be changed at runtime. To handle this problem, we need a way to monitor the memory access patterns and migrate pages based on their access temperature. The DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation Schemes) can be useful features for monitoring and migrating pages. DAMOS provides multiple actions based on DAMON monitoring results and it can be used for proactive reclaim, which means swapping cold pages out with DAMOS_PAGEOUT action, but it doesn't support migration actions such as demotion and promotion between tiered memory nodes. This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion from fast tiers and DAMOS_PROMOTE for promotion from slow tiers. This prevents hot pages from being stuck on slow tiers, which makes performance degradation and cold pages can be proactively demoted to slow tiers so that the system can increase the chance to allocate more hot pages to fast tiers. The DAMON provides various tuning knobs but we found that the proactive demotion for cold pages is especially useful when the system is running out of memory on its fast tier nodes. Our evaluation result shows that it reduces the performance slowdown compared to the default memory policy from 15~17% to 4~5% when the system runs under high memory pressure on its fast tier DRAM nodes. DAMON configuration === The specific DAMON configuration doesn't have to be in the scope of this patch series, but some rough idea is better to be shared to explain the evaluation result. The DAMON provides many knobs for fine tuning but its configuration file is generated by HMSDK[2]. It includes gen_config.py script that generates a json file with the full config of DAMON knobs and it creates multiple kdamonds for each NUMA node when the DAMON is enabled so that it can run hot/cold based migration for tiered memory. Evaluation Workload === The performance evaluation is done with redis[3], which is a widely used in-memory database and the memory access patterns are generated via YCSB[4]. We have measured two different workloads with zipfian and latest distributions but their configs are slightly modified to make memory usage higher and execution time longer for better evaluation. The idea of evaluation using these demote and promote actions covers system-wide memory management rather than partitioning hot/cold pages of a single workload. The default memory allocation policy creates pages to the fast tier DRAM node first, then allocates newly created pages to the slow tier CXL node when the DRAM node has insufficient free space. Once the page allocation is done then those pages never move between NUMA nodes. It's not true when using numa balancing, but it is not the scope of this DAMON based 2-tier memory management support. If the working set of redis can be fit fully into the DRAM node, then the redis will access the fast DRAM only. Since the performance of DRAM only is faster than partially accessing CXL memory in slow tiers, this environment is not useful to evaluate this patch series. To make pages of redis be distributed across fast DRAM node and slow CXL node to evaluate our demote and promote actions, we pre-allocate some cold memory externally using mmap and memset before launching redis-server. We assumed that there are enough amount of cold memory in datacenters as TMO[5] and TPP[6] papers mentioned. The evaluation sequence is as follows. 1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and DAMOS_PROMOTE action for CXL node. It demotes cold pages on DRAM node and promotes hot pages on CXL node in a regular interval. 2. Allocate a huge block of cold memory by calling mmap and memset at the fast tier DRAM node, then make the process sleep to make the fast tier has insufficient memory for redis-server. 3. Launch redis-server and load prebaked snapshot image, dump.rdb. The redis-server consumes 52GB of anon pages and 33GB of file pages, but due to the cold memory allocated at 2
[PATCH v2 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT
MEMORY MANAGEMENT seems to be a good fit. Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Jason Gunthorpe Cc: Peter Xu Cc: Shuah Khan Cc: linux-kselft...@vger.kernel.org Signed-off-by: David Hildenbrand --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 9450e052f1b1..cd267d218e08 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11566,6 +11566,7 @@ F: include/linux/mm.h F: include/linux/mmzone.h F: include/linux/vmalloc.h F: mm/ +F: tools/testing/selftests/vm/ MEMORY TECHNOLOGY DEVICES (MTD) M: Miquel Raynal -- 2.30.2
Re: [PATCH 1/2] MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT
On Mon, Mar 22, 2021 at 01:25:41PM +0100, Lukas Bulwahn wrote: > Commit a520110e4a15 ("mm: split out a new pagewalk.h header from mm.h") > adds a new file in ./include/linux, but misses to update MAINTAINERS > accordingly. Hence, ./scripts/get_maintainers.pl ./include/linux/pagewalk.h > points only to lkml as general fallback for all files, whereas the original > ./include/linux/mm.h clearly marks this file part of MEMORY MANAGEMENT. > > Assign ./include/linux/pagewalk.h to MEMORY MANAGEMENT. > > Signed-off-by: Lukas Bulwahn > --- > MAINTAINERS | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 728216e3919c..46a1eddbc3e4 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -11622,6 +11622,7 @@ F:include/linux/gfp.h > F: include/linux/memory_hotplug.h > F: include/linux/mm.h > F: include/linux/mmzone.h > +F: include/linux/pagewalk.h I'd say that we miss all include/linux/page* here, not only pagewalk.h > F: include/linux/vmalloc.h > F: mm/ > > -- > 2.17.1 > -- Sincerely yours, Mike.
[PATCH 1/2] MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT
Commit a520110e4a15 ("mm: split out a new pagewalk.h header from mm.h") adds a new file in ./include/linux, but misses to update MAINTAINERS accordingly. Hence, ./scripts/get_maintainers.pl ./include/linux/pagewalk.h points only to lkml as general fallback for all files, whereas the original ./include/linux/mm.h clearly marks this file part of MEMORY MANAGEMENT. Assign ./include/linux/pagewalk.h to MEMORY MANAGEMENT. Signed-off-by: Lukas Bulwahn --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 728216e3919c..46a1eddbc3e4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11622,6 +11622,7 @@ F: include/linux/gfp.h F: include/linux/memory_hotplug.h F: include/linux/mm.h F: include/linux/mmzone.h +F: include/linux/pagewalk.h F: include/linux/vmalloc.h F: mm/ -- 2.17.1
[PATCH v1 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT
MEMORY MANAGEMENT seems to be a good fit. Cc: Andrew Morton Cc: Michal Hocko Cc: Oscar Salvador Cc: Jason Gunthorpe Cc: Peter Xu Cc: Shuah Khan Cc: linux-kselft...@vger.kernel.org Signed-off-by: David Hildenbrand --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index aa84121c5611..b00963f4aa09 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11560,6 +11560,7 @@ F: include/linux/mm.h F: include/linux/mmzone.h F: include/linux/vmalloc.h F: mm/ +F: tools/testing/selftests/vm/ MEMORY TECHNOLOGY DEVICES (MTD) M: Miquel Raynal -- 2.29.2
[PATCH v2 0/3] Soft limit memory management bug fixes
During testing of tiered memory management based on memory soft limit, I found three issues with memory management using cgroup based soft limit in the mainline code. Fix the issues with the three patches in this series. Also updated patch 3 per Johannes' comments on the first version of this patch. Thanks. Tim Changelog: v2 1. Do soft limit tree uncharge update in batch of the same node only for v1 cgroups that have a soft limit. Batching in nodes is only relevant for cgroup v1 that has per node soft limit tree. Tim Chen (3): mm: Fix dropped memcg from mem cgroup soft limit tree mm: Force update of mem cgroup soft limit tree on usage excess mm: Fix missing mem cgroup soft limit tree updates mm/memcontrol.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) -- 2.20.1
[PATCH 0/3] Soft limit memory management bug fixes
During testing of tiered memory management based on memory soft limit, I found three issues with memory management using cgroup based soft limit in the mainline code. Fix the issues with the three patches in this series. Tim Chen (3): mm: Fix dropped memcg from mem cgroup soft limit tree mm: Force update of mem cgroup soft limit tree on usage excess mm: Fix missing mem cgroup soft limit tree updates mm/memcontrol.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) -- 2.20.1
[RFC V2 03/37] dmem: implement dmem memory management
From: Yulei Zhang The belowing figure shows the topology of dmem memory management, it reserves a few memory regions from the numa nodes, and in each region it leverages the bitmap to track the actual memory usage. +--+-+---+-+ | Node0| | ... | NodeN | +--/-\-+-+---+-+ / \ +---v+v-+--+ |region 0| ... | region n | +--/---\-+--+--+ / \ +-+v+--v---+-+-+-+ | | | bitmap | | | | +-+-+--+-+-+-+ It introduces the interfaces to manage dmem pages that include: - dmem_region_register(), it registers the reserved memory to the dmem management system - dmem_alloc_init(), initiate dmem allocator, note the page size the allocator used isn't the same thing with the alignment used to reserve dmem memory - dmem_alloc_pages_vma() and dmem_free_pages() are the interfaces allocating and freeing dmem memory, multiple pages can be allocated at one time, but it should be power of two Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 3 + mm/dmem.c| 674 +++ 2 files changed, 677 insertions(+) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 5049322..476a82e 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -7,6 +7,9 @@ void dmem_init(void); int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); +int dmem_alloc_init(unsigned long dpage_shift); +void dmem_alloc_uinit(void); + #else static inline int dmem_reserve_init(void) { diff --git a/mm/dmem.c b/mm/dmem.c index b5fb4f1..a77a064 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -91,11 +91,38 @@ struct dmem_pool { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), }; +#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) +#define DMEM_PAGE_UP(x)phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1)) +#define DMEM_PAGE_DOWN(x) phys_to_dpage(x) + +#define dpage_to_phys(_dpage) \ + ((_dpage) << dmem_pool.dpage_shift) +#define phys_to_dpage(_addr) \ + ((_addr) >> dmem_pool.dpage_shift) + +#define dpage_to_pfn(_dpage) \ + (__phys_to_pfn(dpage_to_phys(_dpage))) +#define pfn_to_dpage(_pfn) \ + (phys_to_dpage(__pfn_to_phys(_pfn))) + +#define dnode_to_nid(_dnode) \ + ((_dnode) - dmem_pool.nodes) +#define nid_to_dnode(nid) \ + (_pool.nodes[nid]) + #define for_each_dmem_node(_dnode) \ for (_dnode = dmem_pool.nodes; \ _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ _dnode++) +#define for_each_dmem_region(_dnode, _dregion) \ + list_for_each_entry(_dregion, &(_dnode)->regions, node) + +static inline int *dmem_nodelist(int nid) +{ + return nid_to_dnode(nid)->nodelist; +} + void __init dmem_init(void) { struct dmem_node *dnode; @@ -135,3 +162,650 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#define PENALTY_FOR_DMEM_SHARED_NODE (1) + +static int dmem_nodeload[MAX_NUMNODES] __initdata; + +/* Evaluate penalty for each dmem node */ +static int __init dmem_evaluate_node(int local, int node) +{ + int penalty; + + /* Use the distance array to find the distance */ + penalty = node_distance(local, node); + + /* Penalize nodes under us ("prefer the next node") */ + penalty += (node < local); + + /* Give preference to headless and unused nodes */ + if (!cpumask_empty(cpumask_of_node(node))) + penalty += PENALTY_FOR_NODE_WITH_CPUS; + + /* Penalize dmem-node shared with kernel */ + if (node_state(node, N_MEMORY)) + penalty += PENALTY_FOR_DMEM_SHARED_NODE; + + /* Slight preference for less loaded node */ + penalty *= (nr_online_nodes * MAX_NUMNODES); + + penalty += dmem_nodeload[node]; + + return penalty; +} + +static int __init find_next_dmem_node(int local, nodemask_t *used_nodes) +{ + struct dmem_node *dnode; + int node, best_node = NUMA_NO_NODE; + int penalty, min_penalty = INT_MAX; + + /* Invalid node is not suitable to call node_distance */ + if (!node_state(local, N_POSSIBLE)) + return NUMA_NO_NODE; + + /* Use the local node if we haven't already */ + if (!node_isset(local, *used_nodes)) { + node_set(local, *used_nodes); + return local; + } + + for_each_dmem_node(dnod
[PATCH 03/35] dmem: implement dmem memory management
From: Yulei Zhang It introduces the interfaces to manage dmem pages that include: - dmem_region_register(), it registers the reserved memory to the dmem management system, later it can be allocated out for dmemfs - dmem_alloc_init(), initiate dmem allocator, note the page size the allocator used isn't the same thing with the alignment used to reserve dmem memory - dmem_alloc_pages_from_vma() and dmem_free_pages() are the interfaces allocating and freeing dmem memory, multiple pages can be allocated at one time, but it should be power of two Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 3 + mm/dmem.c| 674 +++ 2 files changed, 677 insertions(+) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 5049322d941c..476a82e8f252 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -7,6 +7,9 @@ int dmem_reserve_init(void); void dmem_init(void); int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); +int dmem_alloc_init(unsigned long dpage_shift); +void dmem_alloc_uinit(void); + #else static inline int dmem_reserve_init(void) { diff --git a/mm/dmem.c b/mm/dmem.c index b5fb4f1b92db..a77a064c8d59 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -91,11 +91,38 @@ static struct dmem_pool dmem_pool = { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), }; +#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) +#define DMEM_PAGE_UP(x)phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1)) +#define DMEM_PAGE_DOWN(x) phys_to_dpage(x) + +#define dpage_to_phys(_dpage) \ + ((_dpage) << dmem_pool.dpage_shift) +#define phys_to_dpage(_addr) \ + ((_addr) >> dmem_pool.dpage_shift) + +#define dpage_to_pfn(_dpage) \ + (__phys_to_pfn(dpage_to_phys(_dpage))) +#define pfn_to_dpage(_pfn) \ + (phys_to_dpage(__pfn_to_phys(_pfn))) + +#define dnode_to_nid(_dnode) \ + ((_dnode) - dmem_pool.nodes) +#define nid_to_dnode(nid) \ + (_pool.nodes[nid]) + #define for_each_dmem_node(_dnode) \ for (_dnode = dmem_pool.nodes; \ _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ _dnode++) +#define for_each_dmem_region(_dnode, _dregion) \ + list_for_each_entry(_dregion, &(_dnode)->regions, node) + +static inline int *dmem_nodelist(int nid) +{ + return nid_to_dnode(nid)->nodelist; +} + void __init dmem_init(void) { struct dmem_node *dnode; @@ -135,3 +162,649 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#define PENALTY_FOR_DMEM_SHARED_NODE (1) + +static int dmem_nodeload[MAX_NUMNODES] __initdata; + +/* Evaluate penalty for each dmem node */ +static int __init dmem_evaluate_node(int local, int node) +{ + int penalty; + + /* Use the distance array to find the distance */ + penalty = node_distance(local, node); + + /* Penalize nodes under us ("prefer the next node") */ + penalty += (node < local); + + /* Give preference to headless and unused nodes */ + if (!cpumask_empty(cpumask_of_node(node))) + penalty += PENALTY_FOR_NODE_WITH_CPUS; + + /* Penalize dmem-node shared with kernel */ + if (node_state(node, N_MEMORY)) + penalty += PENALTY_FOR_DMEM_SHARED_NODE; + + /* Slight preference for less loaded node */ + penalty *= (nr_online_nodes * MAX_NUMNODES); + + penalty += dmem_nodeload[node]; + + return penalty; +} + +static int __init find_next_dmem_node(int local, nodemask_t *used_nodes) +{ + struct dmem_node *dnode; + int node, best_node = NUMA_NO_NODE; + int penalty, min_penalty = INT_MAX; + + /* Invalid node is not suitable to call node_distance */ + if (!node_state(local, N_POSSIBLE)) + return NUMA_NO_NODE; + + /* Use the local node if we haven't already */ + if (!node_isset(local, *used_nodes)) { + node_set(local, *used_nodes); + return local; + } + + for_each_dmem_node(dnode) { + if (list_empty(>regions)) + continue; + + node = dnode_to_nid(dnode); + + /* Don't want a node to appear more than once */ + if (node_isset(node, *used_nodes)) + continue; + + penalty = dmem_evaluate_node(local, node); + + if (penalty < min_penalty) { + min_penalty = penalty; + best_node = node; + } + } + +
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Mon, Jun 8, 2020 at 5:33 PM Rafael J. Wysocki wrote: > > On Saturday, June 6, 2020 8:56:26 AM CEST Rafael J. Wysocki wrote: > > On Fri, Jun 5, 2020 at 7:09 PM Dan Williams > > wrote: > > > > > > On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki > > > wrote: > > > > > > > > From: Rafael J. Wysocki > > > > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory > > > > management > > > > > > > > The ACPI OS layer uses RCU to protect the list of ACPI memory > > > > mappings from being walked while it is updated. Among other > > > > situations, that list can be walked in non-NMI interrupt context, > > > > so using a sleeping lock to protect it is not an option. > > > > > > > > However, performance issues related to the RCU usage in there > > > > appear, as described by Dan Williams: > > > > > > > > "Recently a performance problem was reported for a process invoking > > > > a non-trival ASL program. The method call in this case ends up > > > > repetitively triggering a call path like: > > > > > > > > acpi_ex_store > > > > acpi_ex_store_object_to_node > > > > acpi_ex_write_data_to_field > > > > acpi_ex_insert_into_field > > > > acpi_ex_write_with_update_rule > > > > acpi_ex_field_datum_io > > > > acpi_ex_access_region > > > > acpi_ev_address_space_dispatch > > > > acpi_ex_system_memory_space_handler > > > > acpi_os_map_cleanup.part.14 > > > > _synchronize_rcu_expedited.constprop.89 > > > > schedule > > > > > > > > The end result of frequent synchronize_rcu_expedited() invocation is > > > > tiny sub-millisecond spurts of execution where the scheduler freely > > > > migrates this apparently sleepy task. The overhead of frequent > > > > scheduler invocation multiplies the execution time by a factor > > > > of 2-3X." > > > > > > > > In order to avoid these issues, replace the RCU in the ACPI OS > > > > layer by an rwlock. > > > > > > > > That rwlock should not be frequently contended, so the performance > > > > impact of it is not expected to be significant. > > > > > > > > Reported-by: Dan Williams > > > > Signed-off-by: Rafael J. Wysocki > > > > --- > > > > > > > > Hi Dan, > > > > > > > > This is a possible fix for the ACPI OSL RCU-related performance issues, > > > > but > > > > can you please arrange for the testing of it on the affected systems? > > > > > > Ugh, is it really this simple? I did not realize the read-side is NMI > > > safe. I'll take a look. > > > > But if an NMI triggers while the lock is being held for writing, it > > will deadlock, won't it? > > > > OTOH, according to the RCU documentation it is valid to call > > rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in > > Documentation/RCU/Design/Requirements/Requirements.rst) so we are good > > from this perspective today. > > > > Unless we teach APEI to avoid mapping lookups from > > apei_{read|write}(), which wouldn't be unreasonable by itself, we need > > to hold on to the RCU in ACPI OSL, so it looks like addressing the > > problem in ACPICA is the best way to do it (and the current ACPICA > > code in question is suboptimal, so it would be good to rework it > > anyway). > > > > Cheers! > > I've sent the prototype patch below to you, Bob and Erik in private, so > here it goes to the lists for completeness. > > It introduces a "fast-path" variant of acpi_os_map_memory() that only > returns non-NULL if a matching mapping is already there in the list > and reworks acpi_ex_system_memory_space_handler() to use it. > > The idea is to do a fast-path lookup first for every new mapping and > only run the full acpi_os_map_memory() if that returns NULL and then > save the mapping return by it and do a fast-path lookup for it again > to bump up its reference counter in the OSL layer. That should prevent > the mappings from going away until the opregions that they belong to > go away (the opregion deactivation code is updated too to remove the > saved mappings), so in the cases when there's not too much opregion > creation and removal activity, it should make the RCU-related overhead > go away. > > Please test. > >
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Saturday, June 6, 2020 8:56:26 AM CEST Rafael J. Wysocki wrote: > On Fri, Jun 5, 2020 at 7:09 PM Dan Williams wrote: > > > > On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki wrote: > > > > > > From: Rafael J. Wysocki > > > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory > > > management > > > > > > The ACPI OS layer uses RCU to protect the list of ACPI memory > > > mappings from being walked while it is updated. Among other > > > situations, that list can be walked in non-NMI interrupt context, > > > so using a sleeping lock to protect it is not an option. > > > > > > However, performance issues related to the RCU usage in there > > > appear, as described by Dan Williams: > > > > > > "Recently a performance problem was reported for a process invoking > > > a non-trival ASL program. The method call in this case ends up > > > repetitively triggering a call path like: > > > > > > acpi_ex_store > > > acpi_ex_store_object_to_node > > > acpi_ex_write_data_to_field > > > acpi_ex_insert_into_field > > > acpi_ex_write_with_update_rule > > > acpi_ex_field_datum_io > > > acpi_ex_access_region > > > acpi_ev_address_space_dispatch > > > acpi_ex_system_memory_space_handler > > > acpi_os_map_cleanup.part.14 > > > _synchronize_rcu_expedited.constprop.89 > > > schedule > > > > > > The end result of frequent synchronize_rcu_expedited() invocation is > > > tiny sub-millisecond spurts of execution where the scheduler freely > > > migrates this apparently sleepy task. The overhead of frequent > > > scheduler invocation multiplies the execution time by a factor > > > of 2-3X." > > > > > > In order to avoid these issues, replace the RCU in the ACPI OS > > > layer by an rwlock. > > > > > > That rwlock should not be frequently contended, so the performance > > > impact of it is not expected to be significant. > > > > > > Reported-by: Dan Williams > > > Signed-off-by: Rafael J. Wysocki > > > --- > > > > > > Hi Dan, > > > > > > This is a possible fix for the ACPI OSL RCU-related performance issues, > > > but > > > can you please arrange for the testing of it on the affected systems? > > > > Ugh, is it really this simple? I did not realize the read-side is NMI > > safe. I'll take a look. > > But if an NMI triggers while the lock is being held for writing, it > will deadlock, won't it? > > OTOH, according to the RCU documentation it is valid to call > rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in > Documentation/RCU/Design/Requirements/Requirements.rst) so we are good > from this perspective today. > > Unless we teach APEI to avoid mapping lookups from > apei_{read|write}(), which wouldn't be unreasonable by itself, we need > to hold on to the RCU in ACPI OSL, so it looks like addressing the > problem in ACPICA is the best way to do it (and the current ACPICA > code in question is suboptimal, so it would be good to rework it > anyway). > > Cheers! I've sent the prototype patch below to you, Bob and Erik in private, so here it goes to the lists for completeness. It introduces a "fast-path" variant of acpi_os_map_memory() that only returns non-NULL if a matching mapping is already there in the list and reworks acpi_ex_system_memory_space_handler() to use it. The idea is to do a fast-path lookup first for every new mapping and only run the full acpi_os_map_memory() if that returns NULL and then save the mapping return by it and do a fast-path lookup for it again to bump up its reference counter in the OSL layer. That should prevent the mappings from going away until the opregions that they belong to go away (the opregion deactivation code is updated too to remove the saved mappings), so in the cases when there's not too much opregion creation and removal activity, it should make the RCU-related overhead go away. Please test. Cheers! --- drivers/acpi/acpica/evrgnini.c| 14 - drivers/acpi/acpica/exregion.c| 49 +-- drivers/acpi/osl.c| 59 -- include/acpi/actypes.h|7 include/acpi/platform/aclinuxex.h |2 + 5 files changed, 112 insertions(+), 19 deletions(-) Index: linux-pm/drivers/acpi/osl.c === --- linux-pm.orig/drivers/acpi/osl.c +++ linux-pm/drivers/acpi/osl.c @@ -302,21 +302,8 @@ stat
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Fri, Jun 5, 2020 at 7:09 PM Dan Williams wrote: > > On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki > > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management > > > > The ACPI OS layer uses RCU to protect the list of ACPI memory > > mappings from being walked while it is updated. Among other > > situations, that list can be walked in non-NMI interrupt context, > > so using a sleeping lock to protect it is not an option. > > > > However, performance issues related to the RCU usage in there > > appear, as described by Dan Williams: > > > > "Recently a performance problem was reported for a process invoking > > a non-trival ASL program. The method call in this case ends up > > repetitively triggering a call path like: > > > > acpi_ex_store > > acpi_ex_store_object_to_node > > acpi_ex_write_data_to_field > > acpi_ex_insert_into_field > > acpi_ex_write_with_update_rule > > acpi_ex_field_datum_io > > acpi_ex_access_region > > acpi_ev_address_space_dispatch > > acpi_ex_system_memory_space_handler > > acpi_os_map_cleanup.part.14 > > _synchronize_rcu_expedited.constprop.89 > > schedule > > > > The end result of frequent synchronize_rcu_expedited() invocation is > > tiny sub-millisecond spurts of execution where the scheduler freely > > migrates this apparently sleepy task. The overhead of frequent > > scheduler invocation multiplies the execution time by a factor > > of 2-3X." > > > > In order to avoid these issues, replace the RCU in the ACPI OS > > layer by an rwlock. > > > > That rwlock should not be frequently contended, so the performance > > impact of it is not expected to be significant. > > > > Reported-by: Dan Williams > > Signed-off-by: Rafael J. Wysocki > > --- > > > > Hi Dan, > > > > This is a possible fix for the ACPI OSL RCU-related performance issues, but > > can you please arrange for the testing of it on the affected systems? > > Ugh, is it really this simple? I did not realize the read-side is NMI > safe. I'll take a look. But if an NMI triggers while the lock is being held for writing, it will deadlock, won't it? OTOH, according to the RCU documentation it is valid to call rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in Documentation/RCU/Design/Requirements/Requirements.rst) so we are good from this perspective today. Unless we teach APEI to avoid mapping lookups from apei_{read|write}(), which wouldn't be unreasonable by itself, we need to hold on to the RCU in ACPI OSL, so it looks like addressing the problem in ACPICA is the best way to do it (and the current ACPICA code in question is suboptimal, so it would be good to rework it anyway). Cheers!
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Fri, Jun 5, 2020 at 9:40 PM Andy Shevchenko wrote: > > On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki wrote: > > ... > > > + if (!refcount) { > > + write_lock_irq(_ioremaps_list_lock); > > + > > + list_del(>list); > > + > > + write_unlock_irq(_ioremaps_list_lock); > > + } > > return refcount; > > It seems we can decrease indentation level at the same time: > > if (refcount) > return refcount; > > ... > return 0; Right, but the patch will need to be dropped anyway I think.
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki wrote: ... > + if (!refcount) { > + write_lock_irq(_ioremaps_list_lock); > + > + list_del(>list); > + > + write_unlock_irq(_ioremaps_list_lock); > + } > return refcount; It seems we can decrease indentation level at the same time: if (refcount) return refcount; ... return 0; -- With Best Regards, Andy Shevchenko
Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management > > The ACPI OS layer uses RCU to protect the list of ACPI memory > mappings from being walked while it is updated. Among other > situations, that list can be walked in non-NMI interrupt context, > so using a sleeping lock to protect it is not an option. > > However, performance issues related to the RCU usage in there > appear, as described by Dan Williams: > > "Recently a performance problem was reported for a process invoking > a non-trival ASL program. The method call in this case ends up > repetitively triggering a call path like: > > acpi_ex_store > acpi_ex_store_object_to_node > acpi_ex_write_data_to_field > acpi_ex_insert_into_field > acpi_ex_write_with_update_rule > acpi_ex_field_datum_io > acpi_ex_access_region > acpi_ev_address_space_dispatch > acpi_ex_system_memory_space_handler > acpi_os_map_cleanup.part.14 > _synchronize_rcu_expedited.constprop.89 > schedule > > The end result of frequent synchronize_rcu_expedited() invocation is > tiny sub-millisecond spurts of execution where the scheduler freely > migrates this apparently sleepy task. The overhead of frequent > scheduler invocation multiplies the execution time by a factor > of 2-3X." > > In order to avoid these issues, replace the RCU in the ACPI OS > layer by an rwlock. > > That rwlock should not be frequently contended, so the performance > impact of it is not expected to be significant. > > Reported-by: Dan Williams > Signed-off-by: Rafael J. Wysocki > --- > > Hi Dan, > > This is a possible fix for the ACPI OSL RCU-related performance issues, but > can you please arrange for the testing of it on the affected systems? Ugh, is it really this simple? I did not realize the read-side is NMI safe. I'll take a look.
[RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
From: Rafael J. Wysocki Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management The ACPI OS layer uses RCU to protect the list of ACPI memory mappings from being walked while it is updated. Among other situations, that list can be walked in non-NMI interrupt context, so using a sleeping lock to protect it is not an option. However, performance issues related to the RCU usage in there appear, as described by Dan Williams: "Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like: acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X." In order to avoid these issues, replace the RCU in the ACPI OS layer by an rwlock. That rwlock should not be frequently contended, so the performance impact of it is not expected to be significant. Reported-by: Dan Williams Signed-off-by: Rafael J. Wysocki --- Hi Dan, This is a possible fix for the ACPI OSL RCU-related performance issues, but can you please arrange for the testing of it on the affected systems? Cheers! --- drivers/acpi/osl.c | 50 ++ 1 file changed, 34 insertions(+), 16 deletions(-) Index: linux-pm/drivers/acpi/osl.c === --- linux-pm.orig/drivers/acpi/osl.c +++ linux-pm/drivers/acpi/osl.c @@ -81,8 +81,8 @@ struct acpi_ioremap { }; static LIST_HEAD(acpi_ioremaps); +static DEFINE_RWLOCK(acpi_ioremaps_list_lock); static DEFINE_MUTEX(acpi_ioremap_lock); -#define acpi_ioremap_lock_held() lock_is_held(_ioremap_lock.dep_map) static void __init acpi_request_region (struct acpi_generic_address *gas, unsigned int length, char *desc) @@ -214,13 +214,13 @@ acpi_physical_address __init acpi_os_get return pa; } -/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static struct acpi_ioremap * acpi_map_lookup(acpi_physical_address phys, acpi_size size) { struct acpi_ioremap *map; - list_for_each_entry_rcu(map, _ioremaps, list, acpi_ioremap_lock_held()) + list_for_each_entry(map, _ioremaps, list) if (map->phys <= phys && phys + size <= map->phys + map->size) return map; @@ -228,7 +228,7 @@ acpi_map_lookup(acpi_physical_address ph return NULL; } -/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static void __iomem * acpi_map_vaddr_lookup(acpi_physical_address phys, unsigned int size) { @@ -257,13 +257,13 @@ void __iomem *acpi_os_get_iomem(acpi_phy } EXPORT_SYMBOL_GPL(acpi_os_get_iomem); -/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static struct acpi_ioremap * acpi_map_lookup_virt(void __iomem *virt, acpi_size size) { struct acpi_ioremap *map; - list_for_each_entry_rcu(map, _ioremaps, list, acpi_ioremap_lock_held()) + list_for_each_entry(map, _ioremaps, list) if (map->virt <= virt && virt + size <= map->virt + map->size) return map; @@ -360,7 +360,11 @@ void __iomem __ref map->size = pg_sz; map->refcount = 1; - list_add_tail_rcu(>list, _ioremaps); + write_lock_irq(_ioremaps_list_lock); + + list_add_tail(>list, _ioremaps); + + write_unlock_irq(_ioremaps_list_lock); out: mutex_unlock(_ioremap_lock); @@ -379,14 +383,18 @@ static unsigned long acpi_os_drop_map_re { unsigned long refcount = --map->refcount; - if (!refcount) - list_del_rcu(>list); + if (!refcount) { + write_lock_irq(_ioremaps_list_lock); + + list_del(>list); + + write_unlock_irq(_ioremaps_list_lock); + } return refcount; } static void acpi_os_map_cleanup(struct acpi_ioremap *map) { - synchronize_rcu_expedited(); acpi_unmap(map->phys, map->virt); kfree(map); } @@ -704,18 +712,23 @@ acpi_status acpi_os_read_mem
Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
On 2019年08月19日 15:39, Dan Carpenter wrote: On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote: In fact as this driver is mainly used for embedded IOT usage, it doesn't handle the complex cleanup when such error is encountered. Instead the clean up is handled in free_guest_vm. A use after free here seems like a potential security problem. Security matters for IoT... :( Thanks for pointing out the issue. The cleanup will be considered carefully. regards, dan carpenter
Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
On Mon, Aug 19, 2019 at 10:39:58AM +0300, Dan Carpenter wrote: > On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote: > > In fact as this driver is mainly used for embedded IOT usage, it doesn't > > handle the complex cleanup when such error is encountered. Instead the clean > > up is handled in free_guest_vm. > > A use after free here seems like a potential security problem. Security > matters for IoT... :( Yeah, the "S" in "IoT" stands for security. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote: > In fact as this driver is mainly used for embedded IOT usage, it doesn't > handle the complex cleanup when such error is encountered. Instead the clean > up is handled in free_guest_vm. A use after free here seems like a potential security problem. Security matters for IoT... :( regards, dan carpenter
Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
On 2019年08月16日 20:58, Dan Carpenter wrote: On Fri, Aug 16, 2019 at 10:25:49AM +0800, Zhao Yakui wrote: +int hugepage_map_guest(struct acrn_vm *vm, struct vm_memmap *memmap) +{ + struct page *page = NULL, *regions_buf_pg = NULL; + unsigned long len, guest_gpa, vma; + struct vm_memory_region *region_array; + struct set_regions *regions; + int max_size = PAGE_SIZE / sizeof(struct vm_memory_region); + int ret; + + if (!vm || !memmap) + return -EINVAL; + + len = memmap->len; + vma = memmap->vma_base; + guest_gpa = memmap->gpa; + + /* prepare set_memory_regions info */ + regions_buf_pg = alloc_page(GFP_KERNEL); + if (!regions_buf_pg) + return -ENOMEM; + + regions = kzalloc(sizeof(*regions), GFP_KERNEL); + if (!regions) { + __free_page(regions_buf_pg); + return -ENOMEM; It's better to do a goto err_free_regions_buf here. More comments below. + } + regions->mr_num = 0; + regions->vmid = vm->vmid; + regions->regions_gpa = page_to_phys(regions_buf_pg); + region_array = page_to_virt(regions_buf_pg); + + while (len > 0) { + unsigned long vm0_gpa, pagesize; + + ret = get_user_pages_fast(vma, 1, 1, ); + if (unlikely(ret != 1) || (!page)) { + pr_err("failed to pin huge page!\n"); + ret = -ENOMEM; + goto err; goto err is a red flag. It's better if error labels do one specific named thing like: err_regions: kfree(regions); err_free_regions_buf: __free_page(regions_buf_pg); We should unwind in the opposite/mirror order from how things were allocated. Then we can remove the if statements in the error handling. Thanks for the review. Will follow your suggestion to unwind the error handling. In this situation, say the user triggers an -EFAULT in get_user_pages_fast() in the second iteration through the loop. That means that "page" is the non-NULL page from the previous iteration. We have already added it to add_guest_map(). But now we're freeing it without removing it from the map so probably it leads to a use after free. The best way to write the error handling in a loop like this is to clean up the partial iteration that has succeed (nothing here), and then unwind all the successful iterations at the bottom of the function. "goto unwind_loop;" In theory we should cleanup the previous success iteration if it encounters one error in the current iteration. But it will be quite complex to cleanup up the previous iteration. call the set_memory_regions for MR_DEL op. call the remove_guest_map for the added hash item call the put_page for returned page in get_user_pages_fast. In fact as this driver is mainly used for embedded IOT usage, it doesn't handle the complex cleanup when such error is encountered. Instead the clean up is handled in free_guest_vm. + } + + vm0_gpa = page_to_phys(page); + pagesize = PAGE_SIZE << compound_order(page); + + ret = add_guest_map(vm, vm0_gpa, guest_gpa, pagesize); + if (ret < 0) { + pr_err("failed to add memseg for huge page!\n"); + goto err; So then here, it would be: pr_err("failed to add memseg for huge page!\n"); put_page(page); goto unwind_loop; regards, dan carpenter + } + + /* fill each memory region into region_array */ + region_array[regions->mr_num].type = MR_ADD; + region_array[regions->mr_num].gpa = guest_gpa; + region_array[regions->mr_num].vm0_gpa = vm0_gpa; + region_array[regions->mr_num].size = pagesize; + region_array[regions->mr_num].prot = + (MEM_TYPE_WB & MEM_TYPE_MASK) | + (memmap->prot & MEM_ACCESS_RIGHT_MASK); + regions->mr_num++; + if (regions->mr_num == max_size) { + pr_debug("region buffer full, set & renew regions!\n"); + ret = set_memory_regions(regions); + if (ret < 0) { + pr_err("failed to set regions,ret=%d!\n", ret); + goto err; + } + regions->mr_num = 0; + } + + len -= pagesize; + vma += pagesize; + guest_gpa += pagesize; + } + + ret = set_memory_regions(regions); + if (ret < 0) { + pr_err("failed to set regions, ret=%d!\n", ret); + goto err; + } + + __free_page(regions_buf_pg); + kfree(regions); + + return 0; +err: + if (regions_buf_pg) +
Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
On Fri, Aug 16, 2019 at 10:25:49AM +0800, Zhao Yakui wrote: > +int hugepage_map_guest(struct acrn_vm *vm, struct vm_memmap *memmap) > +{ > + struct page *page = NULL, *regions_buf_pg = NULL; > + unsigned long len, guest_gpa, vma; > + struct vm_memory_region *region_array; > + struct set_regions *regions; > + int max_size = PAGE_SIZE / sizeof(struct vm_memory_region); > + int ret; > + > + if (!vm || !memmap) > + return -EINVAL; > + > + len = memmap->len; > + vma = memmap->vma_base; > + guest_gpa = memmap->gpa; > + > + /* prepare set_memory_regions info */ > + regions_buf_pg = alloc_page(GFP_KERNEL); > + if (!regions_buf_pg) > + return -ENOMEM; > + > + regions = kzalloc(sizeof(*regions), GFP_KERNEL); > + if (!regions) { > + __free_page(regions_buf_pg); > + return -ENOMEM; It's better to do a goto err_free_regions_buf here. More comments below. > + } > + regions->mr_num = 0; > + regions->vmid = vm->vmid; > + regions->regions_gpa = page_to_phys(regions_buf_pg); > + region_array = page_to_virt(regions_buf_pg); > + > + while (len > 0) { > + unsigned long vm0_gpa, pagesize; > + > + ret = get_user_pages_fast(vma, 1, 1, ); > + if (unlikely(ret != 1) || (!page)) { > + pr_err("failed to pin huge page!\n"); > + ret = -ENOMEM; > + goto err; goto err is a red flag. It's better if error labels do one specific named thing like: err_regions: kfree(regions); err_free_regions_buf: __free_page(regions_buf_pg); We should unwind in the opposite/mirror order from how things were allocated. Then we can remove the if statements in the error handling. In this situation, say the user triggers an -EFAULT in get_user_pages_fast() in the second iteration through the loop. That means that "page" is the non-NULL page from the previous iteration. We have already added it to add_guest_map(). But now we're freeing it without removing it from the map so probably it leads to a use after free. The best way to write the error handling in a loop like this is to clean up the partial iteration that has succeed (nothing here), and then unwind all the successful iterations at the bottom of the function. "goto unwind_loop;" > + } > + > + vm0_gpa = page_to_phys(page); > + pagesize = PAGE_SIZE << compound_order(page); > + > + ret = add_guest_map(vm, vm0_gpa, guest_gpa, pagesize); > + if (ret < 0) { > + pr_err("failed to add memseg for huge page!\n"); > + goto err; So then here, it would be: pr_err("failed to add memseg for huge page!\n"); put_page(page); goto unwind_loop; regards, dan carpenter > + } > + > + /* fill each memory region into region_array */ > + region_array[regions->mr_num].type = MR_ADD; > + region_array[regions->mr_num].gpa = guest_gpa; > + region_array[regions->mr_num].vm0_gpa = vm0_gpa; > + region_array[regions->mr_num].size = pagesize; > + region_array[regions->mr_num].prot = > + (MEM_TYPE_WB & MEM_TYPE_MASK) | > + (memmap->prot & MEM_ACCESS_RIGHT_MASK); > + regions->mr_num++; > + if (regions->mr_num == max_size) { > + pr_debug("region buffer full, set & renew regions!\n"); > + ret = set_memory_regions(regions); > + if (ret < 0) { > + pr_err("failed to set regions,ret=%d!\n", ret); > + goto err; > + } > + regions->mr_num = 0; > + } > + > + len -= pagesize; > + vma += pagesize; > + guest_gpa += pagesize; > + } > + > + ret = set_memory_regions(regions); > + if (ret < 0) { > + pr_err("failed to set regions, ret=%d!\n", ret); > + goto err; > + } > + > + __free_page(regions_buf_pg); > + kfree(regions); > + > + return 0; > +err: > + if (regions_buf_pg) > + __free_page(regions_buf_pg); > + if (page) > + put_page(page); > + kfree(regions); > + return ret; > +} > +
[RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device
In order to launch the ACRN guest system, it needs to setup the mapping between GPA (guest physical address) and HPA (host physical address). This is based on memory virtualization and configured in EPT table. The ioctl related with memory management is added and then the hypercall is called so that the ACRN hypervisor can help to setup the memory mapping for ACRN guest. The 1G/2M huge page is used to optimize the EPT table for guest VM. This will simplify the memory allocation and also optimizes the TLB. For the MMIO mapping: It can support 4K/2M page. IC_SET_MEMSEG: This is used to setup the memory mapping for the memory of guest system by using hugetlb(Guest physical address and host virtual addr).It is also used to setup the device MMIO mapping for PCI device. IC_UNSET_MEMSEG: This is used to remove the device MMIO mapping for PCI device. This is used with updating the MMIO mapping together. As the acrn hypervisor is mainly used for embedded IOT device, it doesn't support the dynamica removal of memory mapping. Co-developed-by: Jason Chen CJ Signed-off-by: Jason Chen CJ Co-developed-by: Li, Fei Signed-off-by: Li, Fei Co-developed-by: Liu Shuo Signed-off-by: Liu Shuo Signed-off-by: Zhao Yakui --- drivers/staging/acrn/Makefile | 4 +- drivers/staging/acrn/acrn_dev.c | 27 +++ drivers/staging/acrn/acrn_drv_internal.h | 90 +++--- drivers/staging/acrn/acrn_mm.c| 227 drivers/staging/acrn/acrn_mm_hugetlb.c| 281 ++ drivers/staging/acrn/acrn_vm_mngt.c | 2 + include/linux/acrn/acrn_drv.h | 86 + include/uapi/linux/acrn/acrn_common_def.h | 25 +++ include/uapi/linux/acrn/acrn_ioctl_defs.h | 41 + 9 files changed, 759 insertions(+), 24 deletions(-) create mode 100644 drivers/staging/acrn/acrn_mm.c create mode 100644 drivers/staging/acrn/acrn_mm_hugetlb.c create mode 100644 include/linux/acrn/acrn_drv.h create mode 100644 include/uapi/linux/acrn/acrn_common_def.h diff --git a/drivers/staging/acrn/Makefile b/drivers/staging/acrn/Makefile index 426d6e8..ec62afe 100644 --- a/drivers/staging/acrn/Makefile +++ b/drivers/staging/acrn/Makefile @@ -1,4 +1,6 @@ obj-$(CONFIG_ACRN_HSM) := acrn.o acrn-y := acrn_dev.o \ acrn_hypercall.o \ - acrn_vm_mngt.o + acrn_vm_mngt.o \ + acrn_mm.o \ + acrn_mm_hugetlb.o diff --git a/drivers/staging/acrn/acrn_dev.c b/drivers/staging/acrn/acrn_dev.c index 7372316..cb62819 100644 --- a/drivers/staging/acrn/acrn_dev.c +++ b/drivers/staging/acrn/acrn_dev.c @@ -44,6 +44,7 @@ static int acrn_dev_open(struct inode *inodep, struct file *filep) { struct acrn_vm *vm; + int i; vm = kzalloc(sizeof(*vm), GFP_KERNEL); if (!vm) @@ -53,6 +54,10 @@ int acrn_dev_open(struct inode *inodep, struct file *filep) vm->vmid = ACRN_INVALID_VMID; vm->dev = acrn_device; + for (i = 0; i < HUGEPAGE_HLIST_ARRAY_SIZE; i++) + INIT_HLIST_HEAD(>hugepage_hlist[i]); + mutex_init(>hugepage_lock); + write_lock_bh(_vm_list_lock); vm_list_add(>list); write_unlock_bh(_vm_list_lock); @@ -212,6 +217,28 @@ long acrn_dev_ioctl(struct file *filep, return ret; } + case IC_SET_MEMSEG: { + struct vm_memmap memmap; + + if (copy_from_user(, (void *)ioctl_param, + sizeof(memmap))) + return -EFAULT; + + ret = map_guest_memseg(vm, ); + break; + } + + case IC_UNSET_MEMSEG: { + struct vm_memmap memmap; + + if (copy_from_user(, (void *)ioctl_param, + sizeof(memmap))) + return -EFAULT; + + ret = unmap_guest_memseg(vm, ); + break; + } + default: pr_warn("Unknown IOCTL 0x%x\n", ioctl_num); ret = -EFAULT; diff --git a/drivers/staging/acrn/acrn_drv_internal.h b/drivers/staging/acrn/acrn_drv_internal.h index 6758dea..5098765 100644 --- a/drivers/staging/acrn/acrn_drv_internal.h +++ b/drivers/staging/acrn/acrn_drv_internal.h @@ -11,6 +11,57 @@ #include #include +struct vm_memory_region { +#define MR_ADD 0 +#define MR_DEL 2 + u32 type; + + /* IN: mem attr */ + u32 prot; + + /* IN: beginning guest GPA to map */ + u64 gpa; + + /* IN: VM0's GPA which foreign gpa will be mapped to */ + u64 vm0_gpa; + + /* IN: size of the region */ + u64 size; +}; + +struct set_regions { + /*IN: vmid for this hypercall */ + u16 vmid; + + /** Reserved */ + u16 reserved[3]; + + /* IN: multi memmaps numbers */ + u32 mr_num; + + /** Reserved */ + u32 reserved1; + /* IN: +* the gpa of memmaps buffer, point to the memm
Re: [RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation
> On 13 May 2019, at 18:15, Peter Zijlstra wrote: > > On Mon, May 13, 2019 at 04:38:09PM +0200, Alexandre Chartre wrote: >> From: Liran Alon >> >> Export symbols needed to create, manage, populate and switch >> a mm from a kernel module (kvm in this case). >> >> This is a hacky way for now to start. >> This should be changed to some suitable memory-management API. > > This should not be exported at all, ever, end of story. > > Modules do not get to play with address spaces like that. I agree… No doubt about that. This should never be merged like this. It’s just to have an initial PoC of the concept so we can: 1) Messure performance impact of concept. 2) Get feedback on appropriate design and APIs from community. -Liran
Re: [RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation
On Mon, May 13, 2019 at 04:38:09PM +0200, Alexandre Chartre wrote: > From: Liran Alon > > Export symbols needed to create, manage, populate and switch > a mm from a kernel module (kvm in this case). > > This is a hacky way for now to start. > This should be changed to some suitable memory-management API. This should not be exported at all, ever, end of story. Modules do not get to play with address spaces like that.
[RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation
From: Liran Alon Export symbols needed to create, manage, populate and switch a mm from a kernel module (kvm in this case). This is a hacky way for now to start. This should be changed to some suitable memory-management API. Signed-off-by: Liran Alon Signed-off-by: Alexandre Chartre --- arch/x86/kernel/ldt.c |1 + arch/x86/mm/tlb.c |3 ++- mm/memory.c |5 + 3 files changed, 8 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index b2463fc..19a86e0 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -401,6 +401,7 @@ void destroy_context_ldt(struct mm_struct *mm) free_ldt_struct(mm->context.ldt); mm->context.ldt = NULL; } +EXPORT_SYMBOL_GPL(destroy_context_ldt); void ldt_arch_exit_mmap(struct mm_struct *mm) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 7f61431..a4db7f5 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -70,7 +70,7 @@ static void clear_asid_other(void) } atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); - +EXPORT_SYMBOL_GPL(last_mm_ctx_id); static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, u16 *new_asid, bool *need_flush) @@ -159,6 +159,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, switch_mm_irqs_off(prev, next, tsk); local_irq_restore(flags); } +EXPORT_SYMBOL_GPL(switch_mm); static void sync_current_stack_to_mm(struct mm_struct *mm) { diff --git a/mm/memory.c b/mm/memory.c index 36aac68..ede9335 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -434,6 +434,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) pte_free(mm, new); return 0; } +EXPORT_SYMBOL_GPL(__pte_alloc); int __pte_alloc_kernel(pmd_t *pmd) { @@ -453,6 +454,7 @@ int __pte_alloc_kernel(pmd_t *pmd) pte_free_kernel(_mm, new); return 0; } +EXPORT_SYMBOL_GPL(__pte_alloc_kernel); static inline void init_rss_vec(int *rss) { @@ -4007,6 +4009,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) spin_unlock(>page_table_lock); return 0; } +EXPORT_SYMBOL_GPL(__p4d_alloc); #endif /* __PAGETABLE_P4D_FOLDED */ #ifndef __PAGETABLE_PUD_FOLDED @@ -4039,6 +4042,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) spin_unlock(>page_table_lock); return 0; } +EXPORT_SYMBOL_GPL(__pud_alloc); #endif /* __PAGETABLE_PUD_FOLDED */ #ifndef __PAGETABLE_PMD_FOLDED @@ -4072,6 +4076,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) spin_unlock(ptl); return 0; } +EXPORT_SYMBOL_GPL(__pmd_alloc); #endif /* __PAGETABLE_PMD_FOLDED */ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, -- 1.7.1
Re: nbd, nbdkit, loopback mounts and memory management
We have a good example of deadlock when using an XFS filesystem on an nbd loopback device regardless of mlockall and MCL_FUTURE. Below I'll paste the hung task traces but I'll try to summarize the steps that caused the problem. The setup is an nbd loopback device with an XFS filesystem that is then mounted on the local server (ubuntu 18.04 bionic, kernel 4.15). Heavy load on the server is putting the kernel in a state where it has exhausted its free pages list and new page allocation requests will attempt to first free allocated pages -- mm/vmscan.c try_to_free_pages(). XFS registers a pair of functions in the super_operations struct (nr_cached_objects and free_cached_objects) that causes the try_to_free_pages() to perform a synchronous write of dirty inodes to the block device -- fs/xfs/xfs_icache.c xfs_reclaim_inodes_nr(). A resource loop deadlock can occur if the nbd userland process needs to allocate memory to make forward progress and the allocation causes the kernel to call try_to_free_pages() which then leads to XFS issuing a blocking synchronous write of a dirty inode page destined back to the nbd loopback device. A more common deadlock we see (traces below) is when another process on the system is blocked in try_to_free_pages() waiting for the nbd device to perform the synchronous write and then the nbd userland process also triggers try_to_free_pages() and blocks on a mutex waiting for the first process to complete its synchronous write. >From a kernel implemented device driver, which pages should not be freed during a page allocation can be controlled using the GFP_ flags. (GFP_NOIO, GFP_NOFS). From userland (as far as I can tell) we do not have a way to control which pages the kernel is allowed to free in try_to_free_pages(). Below are some traces where we have hit this deadlock. 0) A process on the server needs to allocate a page and triggers the kernel to attempt to free pages and XFS issues a synchronous write to the nbd loopback device; 1) within the nbd loopback userland process, a memory allocation request triggers a malloc arena pool to expand its heap and call mprotect which causes the kernel to try to allocate pages and call try_to_free_pages(), and proceed down the XFS reclaim inodes path where it blocks trying to acquire the mutex that is held by that process from trace 0 waiting for us to make progress; 2) an example of another thread within the nbd userland process hitting the same problem when attempting to write to a tcp socket and also blocking on the XFS reclaim inodes mutex. So far the problem has been avoided by using ext4 instead of XFS as ext4 does not provide the super_operations struct functions (nr_cached_objects, free_cached_objects) and does not issue writes in the try_to_free_pages() code path. Ideally, the nbd userland process could set a process flag to tell the kernel to use GFP_NOIO or GFP_NOFS for its allocations to avoid this resource deadlock. Kernel hung tasks 0) A process somewhere on the server has triggered a syscall that causes the kernel to attempt to free pages and eventually call xfs_reclaim_inodes, acquire xfs inode reclaim mutex, and wait for a write to the nbd loopback device Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task glfs_epoll000:4535 blocked for more than 30 seconds. Mar 06 04:54:14 ip-172-16-9-203 kernel: Not tainted 4.15.0-1032-aws #34-Ubuntu Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 06 04:54:14 ip-172-16-9-203 kernel: glfs_epoll000 D0 4535 1 0x Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace: Mar 06 04:54:14 ip-172-16-9-203 kernel: __schedule+0x291/0x8a0 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule+0x2c/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: schedule_timeout+0x1cf/0x350 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? _xfs_buf_ioapply+0x396/0x4e0 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup+0xc6/0x110 Mar 06 04:54:14 ip-172-16-9-203 kernel: wait_for_completion+0xba/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? wake_up_q+0x80/0x80 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? xfs_bwrite+0x24/0x60 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_buf_submit_wait+0x81/0x210 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_bwrite+0x24/0x60 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inode+0x31d/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_ag+0x1e6/0x350 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: ? radix_tree_gang_lookup_tag+0xd9/0x160 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? __list_lru_walk_one.isra.5+0x37/0x140 Mar 06 04:54:14 ip-172-16-9-203 kernel: ? iput+0x220/0x220 Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_reclaim_inodes_nr+0x33/0x40 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: xfs_fs_free_cached_objects+0x19/0x20 [xfs] Mar 06 04:54:14 ip-172-16-9-203 kernel: super_cache_scan+0x165/0x1b0 Mar 06 04:54:14 ip-172-16-9-203 kernel:
Re: nbd, nbdkit, loopback mounts and memory management
On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote: > But Shaun reported it happened somehow often for them, so he might > have a practical test case... better than my theories :-). Yes, certainly not saying this isn't a problem. I think the good news is the fix seems quite easy, ie. to add mlockall and adjust the OOM killer score, as is done currently in the client: https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885 https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219 For now I have added a note in the TODO file to follow up in case we get a test case or reports of a problem: https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: nbd, nbdkit, loopback mounts and memory management
Hi! > So not to dispute that this could be a bug, but I couldn't cause a > deadlock. I wonder if you can see something wrong with my method? > > *** Set up *** > > - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64 > - nbd-client 3.19-1.fc30 > - nbdkit 1.11.5 (git commit ef9d1978ce28) > > Baremetal machine was booted with mem=2G to artificially limit the > RAM. The machine has 64G of swap. > > # free -m > totalusedfree shared buff/cache > available > Mem: 1806 3291383 0 93 > 1357 > Swap: 65535 179 65356 > > *** Method *** > > I started nbdkit as a 4G RAM disk: > > ./nbdkit memory size=4G > > This is implemented as a sparse array with a 2 level page table, and > should allocate (using malloc) every time a new area of the disk is > written to. Exact implementation is here: > https://github.com/libguestfs/nbdkit/tree/master/common/sparse > > I started nbd-client using the -swap option which uses > mlockall(MCL_FUTURE) to lock the client into RAM. > > nbd-client -b 512 -swap localhost /dev/nbd0 > > I then created a filesystem on the RAM disk, mounted it, and copied a > 3G file into it. I tried this various ways, but the variation I was > eventually happy with was: > > mke2fs /dev/nbd0 > mount /dev/nbd0 /tmp/mnt > > dd if=/dev/zero of=/tmp/big bs=1M count=3000 > cp /tmp/big /tmp/mnt/big > > I couldn't get any kind of deadlock or failure in this test. > > (Note that if you repeat the test several times, in theory you could > delete the file and fstrim the filesystem, but when I was testing it > to be sure I unmounted everything and killed and restarted nbdkit > between each test.) This looks like quite a good try. I'd try to use mmap() to dirty memory very quickly. But Shaun reported it happened somehow often for them, so he might have a practical test case... better than my theories :-). Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: nbd, nbdkit, loopback mounts and memory management
So not to dispute that this could be a bug, but I couldn't cause a deadlock. I wonder if you can see something wrong with my method? *** Set up *** - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64 - nbd-client 3.19-1.fc30 - nbdkit 1.11.5 (git commit ef9d1978ce28) Baremetal machine was booted with mem=2G to artificially limit the RAM. The machine has 64G of swap. # free -m totalusedfree shared buff/cache available Mem: 1806 3291383 0 931357 Swap: 65535 179 65356 *** Method *** I started nbdkit as a 4G RAM disk: ./nbdkit memory size=4G This is implemented as a sparse array with a 2 level page table, and should allocate (using malloc) every time a new area of the disk is written to. Exact implementation is here: https://github.com/libguestfs/nbdkit/tree/master/common/sparse I started nbd-client using the -swap option which uses mlockall(MCL_FUTURE) to lock the client into RAM. nbd-client -b 512 -swap localhost /dev/nbd0 I then created a filesystem on the RAM disk, mounted it, and copied a 3G file into it. I tried this various ways, but the variation I was eventually happy with was: mke2fs /dev/nbd0 mount /dev/nbd0 /tmp/mnt dd if=/dev/zero of=/tmp/big bs=1M count=3000 cp /tmp/big /tmp/mnt/big I couldn't get any kind of deadlock or failure in this test. (Note that if you repeat the test several times, in theory you could delete the file and fstrim the filesystem, but when I was testing it to be sure I unmounted everything and killed and restarted nbdkit between each test.) Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: nbd, nbdkit, loopback mounts and memory management
Hi, On Fri, Feb 15, 2019 at 10:53:32PM +, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote: > > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > > Hi! > > > > > > I watched fosdem talk about > > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > > of warning: I'm not sure using it read-write on localhost is safe. > > > > > > In particular, user application could create a lot of dirty data > > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > > nbd-server), you might get a deadlock. > > > > Thanks for the kind words about the talk. I've added Wouter Verhelst > > & the NBD mailing list to CC. Although I did the talk because the > > subject is interesting, how I actually use nbdkit / NBD is to talk to > > qemu and that's where I have most experience and where we (Red Hat) > > use it in production systems. > > > > However in January I spent a lot of time exercising the NBD loop-mount > > + nbdkit case using fio in order to find contention / bottlenecks in > > our use of threads and locks. I didn't notice any particular problems > > then, but it's possible my testing wasn't thorough enough. Or that > > fio only creates small numbers of dirty pages (because of locality in > > its access patterns I guess?) > > > > When you say it's not safe, what could happen? What would we observe > > if it was going wrong? > > Reading more carefully I see you said we'd observe a deadlock. I > didn't see that, but again my testing of this wouldn't have been very > thorough. When I have some time I'll try creating / spooling huge > files into an NBD loop mount to see if I can cause a deadlock. While it's of course impossible to fully exclude the possibility of deadlock when clearing dirty pages to the network, since Mel Gorman's work that resulted in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca this should be extremely unlikely, and swapping over the network (NBD or NFS or whatnot) should be reasonably safe, as well as clearing dirty pages etc. Additionally, nbd-client when called with -s calls memlockall() at an appropriate moment, so that it should not be swapped out. That only leaves the server side. Personally I haven't been able to deadlock a reasonably recent machine using NBD, but of course YMMV. > > > Also note that nbd.txt in Documentation/blockdev/ points to > > > sourceforge; it should probably point to > > > https://github.com/NetworkBlockDevice/nbd ? > > > > Wouter should be able to say what the correct link should be. The sourceforge project is still active, and is where I do the official file releases. I also push the git repository there. For people who just want a released version of the NBD utilities, pointing to sourceforge isn't wrong, I would say. GitHub is indeed used mostly for development, though. It might be nice to rethink all that, now that we don't have a mailinglist running at sourceforge anymore, but I don't think it's very urgent. -- To the thief who stole my anti-depressants: I hope you're happy -- seen somewhere on the Internet on a photo of a billboard
Re: nbd, nbdkit, loopback mounts and memory management
On Fri 2019-02-15 22:41:26, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > Hi! > > > > I watched fosdem talk about > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > of warning: I'm not sure using it read-write on localhost is safe. > > > > In particular, user application could create a lot of dirty data > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > nbd-server), you might get a deadlock. > > Thanks for the kind words about the talk. I've added Wouter Verhelst > & the NBD mailing list to CC. Although I did the talk because the > subject is interesting, how I actually use nbdkit / NBD is to talk to > qemu and that's where I have most experience and where we (Red Hat) > use it in production systems. > > However in January I spent a lot of time exercising the NBD loop-mount > + nbdkit case using fio in order to find contention / bottlenecks in > our use of threads and locks. I didn't notice any particular problems > then, but it's possible my testing wasn't thorough enough. Or that > fio only creates small numbers of dirty pages (because of locality in > its access patterns I guess?) > > When you say it's not safe, what could happen? What would we observe > if it was going wrong? I'm not saying I've seen it happen, or have a test. But my understanding of memory management says it could deadlock... if nbd tried allocating memory while memory was "full" of dirty data. Dunno, something like ... take 1GB block device with 1GB RAM machine. Create memory pressure so that nbdkit (etc) is dropped from memory. Then quickly make all the data on the block device dirty. I believe that scenario is something that can not happen on system without NBD in loopback configuration. Situation may be made worse if nbdkit needs to allocate memory due for compression buffers or something like that. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: nbd, nbdkit, loopback mounts and memory management
On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote: > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > > Hi! > > > > I watched fosdem talk about > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > > of warning: I'm not sure using it read-write on localhost is safe. > > > > In particular, user application could create a lot of dirty data > > quickly. If there's not enough memory for nbdkit (or nbd-client or > > nbd-server), you might get a deadlock. > > Thanks for the kind words about the talk. I've added Wouter Verhelst > & the NBD mailing list to CC. Although I did the talk because the > subject is interesting, how I actually use nbdkit / NBD is to talk to > qemu and that's where I have most experience and where we (Red Hat) > use it in production systems. > > However in January I spent a lot of time exercising the NBD loop-mount > + nbdkit case using fio in order to find contention / bottlenecks in > our use of threads and locks. I didn't notice any particular problems > then, but it's possible my testing wasn't thorough enough. Or that > fio only creates small numbers of dirty pages (because of locality in > its access patterns I guess?) > > When you say it's not safe, what could happen? What would we observe > if it was going wrong? Reading more carefully I see you said we'd observe a deadlock. I didn't see that, but again my testing of this wouldn't have been very thorough. When I have some time I'll try creating / spooling huge files into an NBD loop mount to see if I can cause a deadlock. Thanks, Rich. > > Also note that nbd.txt in Documentation/blockdev/ points to > > sourceforge; it should probably point to > > https://github.com/NetworkBlockDevice/nbd ? > > Wouter should be able to say what the correct link should be. > > Rich. > > -- > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones > Read my programming and virtualization blog: http://rwmj.wordpress.com > libguestfs lets you edit virtual machines. Supports shell scripting, > bindings from many languages. http://libguestfs.org -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Re: nbd, nbdkit, loopback mounts and memory management
On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote: > Hi! > > I watched fosdem talk about > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word > of warning: I'm not sure using it read-write on localhost is safe. > > In particular, user application could create a lot of dirty data > quickly. If there's not enough memory for nbdkit (or nbd-client or > nbd-server), you might get a deadlock. Thanks for the kind words about the talk. I've added Wouter Verhelst & the NBD mailing list to CC. Although I did the talk because the subject is interesting, how I actually use nbdkit / NBD is to talk to qemu and that's where I have most experience and where we (Red Hat) use it in production systems. However in January I spent a lot of time exercising the NBD loop-mount + nbdkit case using fio in order to find contention / bottlenecks in our use of threads and locks. I didn't notice any particular problems then, but it's possible my testing wasn't thorough enough. Or that fio only creates small numbers of dirty pages (because of locality in its access patterns I guess?) When you say it's not safe, what could happen? What would we observe if it was going wrong? > Also note that nbd.txt in Documentation/blockdev/ points to > sourceforge; it should probably point to > https://github.com/NetworkBlockDevice/nbd ? Wouter should be able to say what the correct link should be. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
nbd, nbdkit, loopback mounts and memory management
Hi! I watched fosdem talk about nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word of warning: I'm not sure using it read-write on localhost is safe. In particular, user application could create a lot of dirty data quickly. If there's not enough memory for nbdkit (or nbd-client or nbd-server), you might get a deadlock. Also note that nbd.txt in Documentation/blockdev/ points to sourceforge; it should probably point to https://github.com/NetworkBlockDevice/nbd ? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html signature.asc Description: Digital signature
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 1:01 PM Michal Hocko wrote: > > On Mon 22-10-18 15:08:22, Roman Gushchin wrote: > [...] > > RE backporting: I'm slightly surprised that only one patch of the memcg > > reclaim fix series has been backported. Either all or none makes much more > > sense to me. > > Yeah, I think this is AUTOSEL trying to be clever again. I though it has > been agreed that MM is quite good at marking patches for stable and so > it was not considered by the machinery. Sasha? I've talked about it briefly with Andrew, and he suggested that I'll send him the list of AUTOSEL commits separately to avoid the noise, so we'll try that and see what happens. -- Thanks. Sasha
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 1:01 PM Michal Hocko wrote: > > On Mon 22-10-18 15:08:22, Roman Gushchin wrote: > [...] > > RE backporting: I'm slightly surprised that only one patch of the memcg > > reclaim fix series has been backported. Either all or none makes much more > > sense to me. > > Yeah, I think this is AUTOSEL trying to be clever again. I though it has > been agreed that MM is quite good at marking patches for stable and so > it was not considered by the machinery. Sasha? I've talked about it briefly with Andrew, and he suggested that I'll send him the list of AUTOSEL commits separately to avoid the noise, so we'll try that and see what happens. -- Thanks. Sasha
Re: Memory management issue in 4.18.15
> On Sat 20-10-18 14:41:40, Spock wrote: > > Hello, > > > > I have a workload, which creates lots of cache pages. Before 4.18.15, > > the behavior was very stable: pagecache is constantly growing until it > > consumes all the free memory, and then kswapd is balancing it around > > low watermark. After 4.18.15, once in a while khugepaged is waking up > > and reclaims almost all the pages from pagecache, so there is always > > around 2G of 8G unused. THP is enabled only for madvise case and are > > not used. Spock, can you, please, check if the following patch solves the problem for you? Thank you! -- diff --git a/fs/inode.c b/fs/inode.c index 73432e64f874..63aca301a8bc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -731,7 +731,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, } /* recently referenced inodes get one more pass */ - if (inode->i_state & I_REFERENCED) { + if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) { inode->i_state &= ~I_REFERENCED; spin_unlock(>i_lock); return LRU_ROTATE;
Re: Memory management issue in 4.18.15
> On Sat 20-10-18 14:41:40, Spock wrote: > > Hello, > > > > I have a workload, which creates lots of cache pages. Before 4.18.15, > > the behavior was very stable: pagecache is constantly growing until it > > consumes all the free memory, and then kswapd is balancing it around > > low watermark. After 4.18.15, once in a while khugepaged is waking up > > and reclaims almost all the pages from pagecache, so there is always > > around 2G of 8G unused. THP is enabled only for madvise case and are > > not used. Spock, can you, please, check if the following patch solves the problem for you? Thank you! -- diff --git a/fs/inode.c b/fs/inode.c index 73432e64f874..63aca301a8bc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -731,7 +731,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, } /* recently referenced inodes get one more pass */ - if (inode->i_state & I_REFERENCED) { + if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) { inode->i_state &= ~I_REFERENCED; spin_unlock(>i_lock); return LRU_ROTATE;
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote: > Cc som more people. > > I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a > relatively small number of objects") has been backported to the stable > tree when not marked that way. Put that aside it seems likely that the > upstream kernel will have the same issue I suspect. Roman, could you > have a look please? So, the problem is probably caused by the unused inode eviction code: inode_lru_isolate() invalidates all pages belonging to an unreferenced clean inode at once, even if the goal was to scan (and potentially free) just one inode (or any other slab object). Spock's workload, as described, has few large files in the pagecache, so it becomes noticeable. A small pressure applied on inode cache surprisingly results in cleaning up significant percentage of the memory. It happened before my change too, but was probably less noticeable, because usually required higher memory pressure to happen. So, too aggressive reclaim was less unexpected. How to fix this? It seems to me, that we shouldn't try invalidating pagecache pages from the inode reclaim path at all (maybe except inodes with only few pages). If an inode has a lot of attached pagecache, let it be evicted "naturally", through file LRU lists. But I need to perform some real-life testing on how this will work. Thanks!
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote: > Cc som more people. > > I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a > relatively small number of objects") has been backported to the stable > tree when not marked that way. Put that aside it seems likely that the > upstream kernel will have the same issue I suspect. Roman, could you > have a look please? So, the problem is probably caused by the unused inode eviction code: inode_lru_isolate() invalidates all pages belonging to an unreferenced clean inode at once, even if the goal was to scan (and potentially free) just one inode (or any other slab object). Spock's workload, as described, has few large files in the pagecache, so it becomes noticeable. A small pressure applied on inode cache surprisingly results in cleaning up significant percentage of the memory. It happened before my change too, but was probably less noticeable, because usually required higher memory pressure to happen. So, too aggressive reclaim was less unexpected. How to fix this? It seems to me, that we shouldn't try invalidating pagecache pages from the inode reclaim path at all (maybe except inodes with only few pages). If an inode has a lot of attached pagecache, let it be evicted "naturally", through file LRU lists. But I need to perform some real-life testing on how this will work. Thanks!
Re: Memory management issue in 4.18.15
On Mon 22-10-18 15:08:22, Roman Gushchin wrote: [...] > RE backporting: I'm slightly surprised that only one patch of the memcg > reclaim fix series has been backported. Either all or none makes much more > sense to me. Yeah, I think this is AUTOSEL trying to be clever again. I though it has been agreed that MM is quite good at marking patches for stable and so it was not considered by the machinery. Sasha? -- Michal Hocko SUSE Labs
Re: Memory management issue in 4.18.15
On Mon 22-10-18 15:08:22, Roman Gushchin wrote: [...] > RE backporting: I'm slightly surprised that only one patch of the memcg > reclaim fix series has been backported. Either all or none makes much more > sense to me. Yeah, I think this is AUTOSEL trying to be clever again. I though it has been agreed that MM is quite good at marking patches for stable and so it was not considered by the machinery. Sasha? -- Michal Hocko SUSE Labs
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote: > Cc som more people. > > I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a > relatively small number of objects") has been backported to the stable > tree when not marked that way. Put that aside it seems likely that the > upstream kernel will have the same issue I suspect. Roman, could you > have a look please? Sure, already looking... Spock provided some useful details, and I think, I know what's happening... Hope to propose a solution soon. RE backporting: I'm slightly surprised that only one patch of the memcg reclaim fix series has been backported. Either all or none makes much more sense to me. Thanks!
Re: Memory management issue in 4.18.15
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote: > Cc som more people. > > I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a > relatively small number of objects") has been backported to the stable > tree when not marked that way. Put that aside it seems likely that the > upstream kernel will have the same issue I suspect. Roman, could you > have a look please? Sure, already looking... Spock provided some useful details, and I think, I know what's happening... Hope to propose a solution soon. RE backporting: I'm slightly surprised that only one patch of the memcg reclaim fix series has been backported. Either all or none makes much more sense to me. Thanks!
Re: Memory management issue in 4.18.15
Cc som more people. I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") has been backported to the stable tree when not marked that way. Put that aside it seems likely that the upstream kernel will have the same issue I suspect. Roman, could you have a look please? On Sat 20-10-18 14:41:40, Spock wrote: > Hello, > > I have a workload, which creates lots of cache pages. Before 4.18.15, > the behavior was very stable: pagecache is constantly growing until it > consumes all the free memory, and then kswapd is balancing it around > low watermark. After 4.18.15, once in a while khugepaged is waking up > and reclaims almost all the pages from pagecache, so there is always > around 2G of 8G unused. THP is enabled only for madvise case and are > not used. > > The exact change that leads to current behavior is > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab -- Michal Hocko SUSE Labs
Re: Memory management issue in 4.18.15
Cc som more people. I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") has been backported to the stable tree when not marked that way. Put that aside it seems likely that the upstream kernel will have the same issue I suspect. Roman, could you have a look please? On Sat 20-10-18 14:41:40, Spock wrote: > Hello, > > I have a workload, which creates lots of cache pages. Before 4.18.15, > the behavior was very stable: pagecache is constantly growing until it > consumes all the free memory, and then kswapd is balancing it around > low watermark. After 4.18.15, once in a while khugepaged is waking up > and reclaims almost all the pages from pagecache, so there is always > around 2G of 8G unused. THP is enabled only for madvise case and are > not used. > > The exact change that leads to current behavior is > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab -- Michal Hocko SUSE Labs
Re: Memory management issue in 4.18.15
On Sat, Oct 20, 2018 at 08:37:28AM -0700, Randy Dunlap wrote: > [add linux-mm mailing list + people] > > > On 10/20/18 4:41 AM, Spock wrote: > > Hello, > > > > I have a workload, which creates lots of cache pages. Before 4.18.15, > > the behavior was very stable: pagecache is constantly growing until it > > consumes all the free memory, and then kswapd is balancing it around > > low watermark. After 4.18.15, once in a while khugepaged is waking up > > and reclaims almost all the pages from pagecache, so there is always > > around 2G of 8G unused. THP is enabled only for madvise case and are > > not used. > > > > The exact change that leads to current behavior is > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab > > Hello! Can you, please, describe your workload in more details? Do you use memory cgroups? How many of them? What's the ratio between slabs and pagecache in the affected cgroup? Is the pagecache mmapped by some process? Is the majority of the pagecache created by few cached files or the number of files is big? This is definitely a strange effect. The change shouldn't affect pagecache reclaim directly, so the only possibility I see is that because we started applying some minimal pressure on slabs, we also started reclaim some internal fs structures under background memory pressure, which leads to a more aggressive pagecache reclaim. Thanks!
Re: Memory management issue in 4.18.15
On Sat, Oct 20, 2018 at 08:37:28AM -0700, Randy Dunlap wrote: > [add linux-mm mailing list + people] > > > On 10/20/18 4:41 AM, Spock wrote: > > Hello, > > > > I have a workload, which creates lots of cache pages. Before 4.18.15, > > the behavior was very stable: pagecache is constantly growing until it > > consumes all the free memory, and then kswapd is balancing it around > > low watermark. After 4.18.15, once in a while khugepaged is waking up > > and reclaims almost all the pages from pagecache, so there is always > > around 2G of 8G unused. THP is enabled only for madvise case and are > > not used. > > > > The exact change that leads to current behavior is > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab > > Hello! Can you, please, describe your workload in more details? Do you use memory cgroups? How many of them? What's the ratio between slabs and pagecache in the affected cgroup? Is the pagecache mmapped by some process? Is the majority of the pagecache created by few cached files or the number of files is big? This is definitely a strange effect. The change shouldn't affect pagecache reclaim directly, so the only possibility I see is that because we started applying some minimal pressure on slabs, we also started reclaim some internal fs structures under background memory pressure, which leads to a more aggressive pagecache reclaim. Thanks!
Re: Memory management issue in 4.18.15
[add linux-mm mailing list + people] On 10/20/18 4:41 AM, Spock wrote: > Hello, > > I have a workload, which creates lots of cache pages. Before 4.18.15, > the behavior was very stable: pagecache is constantly growing until it > consumes all the free memory, and then kswapd is balancing it around > low watermark. After 4.18.15, once in a while khugepaged is waking up > and reclaims almost all the pages from pagecache, so there is always > around 2G of 8G unused. THP is enabled only for madvise case and are > not used. > > The exact change that leads to current behavior is > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab > -- ~Randy
Re: Memory management issue in 4.18.15
[add linux-mm mailing list + people] On 10/20/18 4:41 AM, Spock wrote: > Hello, > > I have a workload, which creates lots of cache pages. Before 4.18.15, > the behavior was very stable: pagecache is constantly growing until it > consumes all the free memory, and then kswapd is balancing it around > low watermark. After 4.18.15, once in a while khugepaged is waking up > and reclaims almost all the pages from pagecache, so there is always > around 2G of 8G unused. THP is enabled only for madvise case and are > not used. > > The exact change that leads to current behavior is > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab > -- ~Randy
Memory management issue in 4.18.15
Hello, I have a workload, which creates lots of cache pages. Before 4.18.15, the behavior was very stable: pagecache is constantly growing until it consumes all the free memory, and then kswapd is balancing it around low watermark. After 4.18.15, once in a while khugepaged is waking up and reclaims almost all the pages from pagecache, so there is always around 2G of 8G unused. THP is enabled only for madvise case and are not used. The exact change that leads to current behavior is https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
Memory management issue in 4.18.15
Hello, I have a workload, which creates lots of cache pages. Before 4.18.15, the behavior was very stable: pagecache is constantly growing until it consumes all the free memory, and then kswapd is balancing it around low watermark. After 4.18.15, once in a while khugepaged is waking up and reclaims almost all the pages from pagecache, so there is always around 2G of 8G unused. THP is enabled only for madvise case and are not used. The exact change that leads to current behavior is https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
[PATCH 10/10] kconfig: add a Memory Management options" menu
This moves all the options under a proper menu. Based on a patch from Randy Dunlap. Signed-off-by: Christoph Hellwig Tested-by: Randy Dunlap Acked-by: Randy Dunlap --- mm/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ce95491abd6a..9ae1b6a8e30f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1,3 +1,6 @@ + +menu "Memory Management options" + config SELECT_MEMORY_MODEL def_bool y depends on ARCH_SELECT_MEMORY_MODEL @@ -762,3 +765,5 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool + +endmenu -- 2.18.0
[PATCH 10/10] kconfig: add a Memory Management options" menu
This moves all the options under a proper menu. Based on a patch from Randy Dunlap. Signed-off-by: Christoph Hellwig Tested-by: Randy Dunlap Acked-by: Randy Dunlap --- mm/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ce95491abd6a..9ae1b6a8e30f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1,3 +1,6 @@ + +menu "Memory Management options" + config SELECT_MEMORY_MODEL def_bool y depends on ARCH_SELECT_MEMORY_MODEL @@ -762,3 +765,5 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool + +endmenu -- 2.18.0
Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file
On Wed, Jul 25, 2018 at 05:05:00AM -0700, Matthew Wilcox wrote: > On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote: > > +User Space Memory Access > > + > > + > > +.. kernel-doc:: arch/x86/include/asm/uaccess.h > > + :internal: > > + > > +.. kernel-doc:: arch/x86/lib/usercopy_32.c > > + :export: > > + > > +The Slab Cache > > +== > > + > > +.. kernel-doc:: include/linux/slab.h > > + :internal: > > + > > +.. kernel-doc:: mm/slab.c > > + :export: > > + > > +.. kernel-doc:: mm/util.c > > + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast > > get_user_pages_fast would fit better in the previous 'User Space Memory > Access' section. Yeah, it's somewhat "backward compatible" version :) Will fix. -- Sincerely yours, Mike.
Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file
On Wed, Jul 25, 2018 at 05:05:00AM -0700, Matthew Wilcox wrote: > On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote: > > +User Space Memory Access > > + > > + > > +.. kernel-doc:: arch/x86/include/asm/uaccess.h > > + :internal: > > + > > +.. kernel-doc:: arch/x86/lib/usercopy_32.c > > + :export: > > + > > +The Slab Cache > > +== > > + > > +.. kernel-doc:: include/linux/slab.h > > + :internal: > > + > > +.. kernel-doc:: mm/slab.c > > + :export: > > + > > +.. kernel-doc:: mm/util.c > > + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast > > get_user_pages_fast would fit better in the previous 'User Space Memory > Access' section. Yeah, it's somewhat "backward compatible" version :) Will fix. -- Sincerely yours, Mike.
Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file
On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote: > +User Space Memory Access > + > + > +.. kernel-doc:: arch/x86/include/asm/uaccess.h > + :internal: > + > +.. kernel-doc:: arch/x86/lib/usercopy_32.c > + :export: > + > +The Slab Cache > +== > + > +.. kernel-doc:: include/linux/slab.h > + :internal: > + > +.. kernel-doc:: mm/slab.c > + :export: > + > +.. kernel-doc:: mm/util.c > + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast get_user_pages_fast would fit better in the previous 'User Space Memory Access' section.
Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file
On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote: > +User Space Memory Access > + > + > +.. kernel-doc:: arch/x86/include/asm/uaccess.h > + :internal: > + > +.. kernel-doc:: arch/x86/lib/usercopy_32.c > + :export: > + > +The Slab Cache > +== > + > +.. kernel-doc:: include/linux/slab.h > + :internal: > + > +.. kernel-doc:: mm/slab.c > + :export: > + > +.. kernel-doc:: mm/util.c > + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast get_user_pages_fast would fit better in the previous 'User Space Memory Access' section.
[PATCH 5/7] docs/core-api: split memory management API to a separate file
Signed-off-by: Mike Rapoport --- Documentation/core-api/index.rst | 1 + Documentation/core-api/kernel-api.rst | 54 --- Documentation/core-api/mm-api.rst | 54 +++ 3 files changed, 55 insertions(+), 54 deletions(-) create mode 100644 Documentation/core-api/mm-api.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index f5a66b7..b580204 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -27,6 +27,7 @@ Core utilities errseq printk-formats circular-buffers + mm-api gfp_mask-from-fs-io Interfaces for kernel debugging diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 39f1460..3431337 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -159,60 +159,6 @@ UUID/GUID .. kernel-doc:: lib/uuid.c :export: -Memory Management in Linux -== - -The Slab Cache --- - -.. kernel-doc:: include/linux/slab.h - :internal: - -.. kernel-doc:: mm/slab.c - :export: - -.. kernel-doc:: mm/util.c - :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast - -User Space Memory Access - - -.. kernel-doc:: arch/x86/include/asm/uaccess.h - :internal: - -.. kernel-doc:: arch/x86/lib/usercopy_32.c - :export: - -More Memory Management Functions - - -.. kernel-doc:: mm/readahead.c - :export: - -.. kernel-doc:: mm/filemap.c - :export: - -.. kernel-doc:: mm/memory.c - :export: - -.. kernel-doc:: mm/vmalloc.c - :export: - -.. kernel-doc:: mm/page_alloc.c - :internal: - -.. kernel-doc:: mm/mempool.c - :export: - -.. kernel-doc:: mm/dmapool.c - :export: - -.. kernel-doc:: mm/page-writeback.c - :export: - -.. kernel-doc:: mm/truncate.c - :export: - Kernel IPC facilities = diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst new file mode 100644 index 000..65a8ef09 --- /dev/null +++ b/Documentation/core-api/mm-api.rst @@ -0,0 +1,54 @@ +== +Memory Management APIs +== + +User Space Memory Access + + +.. kernel-doc:: arch/x86/include/asm/uaccess.h + :internal: + +.. kernel-doc:: arch/x86/lib/usercopy_32.c + :export: + +The Slab Cache +== + +.. kernel-doc:: include/linux/slab.h + :internal: + +.. kernel-doc:: mm/slab.c + :export: + +.. kernel-doc:: mm/util.c + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast + +More Memory Management Functions + + +.. kernel-doc:: mm/readahead.c + :export: + +.. kernel-doc:: mm/filemap.c + :export: + +.. kernel-doc:: mm/memory.c + :export: + +.. kernel-doc:: mm/vmalloc.c + :export: + +.. kernel-doc:: mm/page_alloc.c + :internal: + +.. kernel-doc:: mm/mempool.c + :export: + +.. kernel-doc:: mm/dmapool.c + :export: + +.. kernel-doc:: mm/page-writeback.c + :export: + +.. kernel-doc:: mm/truncate.c + :export: -- 2.7.4
[PATCH 0/7] memory management documentation updates
Hi, Here are several updates to the mm documentation. Aside from really minor changes in the first three patches, the updates are: * move the documentation of kstrdup and friends to "String Manipulation" section * split memory management API into a separate .rst file * adjust formating of the GFP flags description and include it in the reference documentation. Mike Rapoport (7): mm/util: make strndup_user description a kernel-doc comment mm/util: add kernel-doc for kvfree docs/core-api: kill trailing whitespace in kernel-api.rst docs/core-api: move *{str,mem}dup* to "String Manipulation" docs/core-api: split memory management API to a separate file docs/mm: make GFP flags descriptions usable as kernel-doc docs/core-api: mm-api: add section about GFP flags Documentation/core-api/index.rst | 1 + Documentation/core-api/kernel-api.rst | 59 +-- Documentation/core-api/mm-api.rst | 76 + include/linux/gfp.h | 291 ++ mm/util.c | 9 +- 5 files changed, 243 insertions(+), 193 deletions(-) create mode 100644 Documentation/core-api/mm-api.rst -- 2.7.4
[PATCH 5/7] docs/core-api: split memory management API to a separate file
Signed-off-by: Mike Rapoport --- Documentation/core-api/index.rst | 1 + Documentation/core-api/kernel-api.rst | 54 --- Documentation/core-api/mm-api.rst | 54 +++ 3 files changed, 55 insertions(+), 54 deletions(-) create mode 100644 Documentation/core-api/mm-api.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index f5a66b7..b580204 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -27,6 +27,7 @@ Core utilities errseq printk-formats circular-buffers + mm-api gfp_mask-from-fs-io Interfaces for kernel debugging diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 39f1460..3431337 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -159,60 +159,6 @@ UUID/GUID .. kernel-doc:: lib/uuid.c :export: -Memory Management in Linux -== - -The Slab Cache --- - -.. kernel-doc:: include/linux/slab.h - :internal: - -.. kernel-doc:: mm/slab.c - :export: - -.. kernel-doc:: mm/util.c - :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast - -User Space Memory Access - - -.. kernel-doc:: arch/x86/include/asm/uaccess.h - :internal: - -.. kernel-doc:: arch/x86/lib/usercopy_32.c - :export: - -More Memory Management Functions - - -.. kernel-doc:: mm/readahead.c - :export: - -.. kernel-doc:: mm/filemap.c - :export: - -.. kernel-doc:: mm/memory.c - :export: - -.. kernel-doc:: mm/vmalloc.c - :export: - -.. kernel-doc:: mm/page_alloc.c - :internal: - -.. kernel-doc:: mm/mempool.c - :export: - -.. kernel-doc:: mm/dmapool.c - :export: - -.. kernel-doc:: mm/page-writeback.c - :export: - -.. kernel-doc:: mm/truncate.c - :export: - Kernel IPC facilities = diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst new file mode 100644 index 000..65a8ef09 --- /dev/null +++ b/Documentation/core-api/mm-api.rst @@ -0,0 +1,54 @@ +== +Memory Management APIs +== + +User Space Memory Access + + +.. kernel-doc:: arch/x86/include/asm/uaccess.h + :internal: + +.. kernel-doc:: arch/x86/lib/usercopy_32.c + :export: + +The Slab Cache +== + +.. kernel-doc:: include/linux/slab.h + :internal: + +.. kernel-doc:: mm/slab.c + :export: + +.. kernel-doc:: mm/util.c + :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast + +More Memory Management Functions + + +.. kernel-doc:: mm/readahead.c + :export: + +.. kernel-doc:: mm/filemap.c + :export: + +.. kernel-doc:: mm/memory.c + :export: + +.. kernel-doc:: mm/vmalloc.c + :export: + +.. kernel-doc:: mm/page_alloc.c + :internal: + +.. kernel-doc:: mm/mempool.c + :export: + +.. kernel-doc:: mm/dmapool.c + :export: + +.. kernel-doc:: mm/page-writeback.c + :export: + +.. kernel-doc:: mm/truncate.c + :export: -- 2.7.4
[PATCH 0/7] memory management documentation updates
Hi, Here are several updates to the mm documentation. Aside from really minor changes in the first three patches, the updates are: * move the documentation of kstrdup and friends to "String Manipulation" section * split memory management API into a separate .rst file * adjust formating of the GFP flags description and include it in the reference documentation. Mike Rapoport (7): mm/util: make strndup_user description a kernel-doc comment mm/util: add kernel-doc for kvfree docs/core-api: kill trailing whitespace in kernel-api.rst docs/core-api: move *{str,mem}dup* to "String Manipulation" docs/core-api: split memory management API to a separate file docs/mm: make GFP flags descriptions usable as kernel-doc docs/core-api: mm-api: add section about GFP flags Documentation/core-api/index.rst | 1 + Documentation/core-api/kernel-api.rst | 59 +-- Documentation/core-api/mm-api.rst | 76 + include/linux/gfp.h | 291 ++ mm/util.c | 9 +- 5 files changed, 243 insertions(+), 193 deletions(-) create mode 100644 Documentation/core-api/mm-api.rst -- 2.7.4
[PATCH 10/10] kconfig: add a Memory Management options" menu
This moves all the options under a proper menu. Based on a patch from Randy Dunlap. Signed-off-by: Christoph Hellwig Tested-by: Randy Dunlap Acked-by: Randy Dunlap --- mm/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ce95491abd6a..9ae1b6a8e30f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1,3 +1,6 @@ + +menu "Memory Management options" + config SELECT_MEMORY_MODEL def_bool y depends on ARCH_SELECT_MEMORY_MODEL @@ -762,3 +765,5 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool + +endmenu -- 2.18.0
[PATCH 10/10] kconfig: add a Memory Management options" menu
This moves all the options under a proper menu. Based on a patch from Randy Dunlap. Signed-off-by: Christoph Hellwig Tested-by: Randy Dunlap Acked-by: Randy Dunlap --- mm/Kconfig | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ce95491abd6a..9ae1b6a8e30f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1,3 +1,6 @@ + +menu "Memory Management options" + config SELECT_MEMORY_MODEL def_bool y depends on ARCH_SELECT_MEMORY_MODEL @@ -762,3 +765,5 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool + +endmenu -- 2.18.0