Re: [RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory

2024-05-15 Thread SeongJae Park
Hi Honggyu,

On Mon, 13 May 2024 20:59:15 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> Thanks very much for your work!  It got delayed due to the priority
> changes in my workplace for building another heterogeneous memory
> allocator.
> https://github.com/skhynix/hmsdk/wiki/hmalloc

No problem at all.  We all work on our own schedule and nobody can chase/push
anybody :)

> 
> On Sun, 12 May 2024 10:54:42 -0700 SeongJae Park  wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This RFC is about its implementation for physical address
> > space.
> > 
> > Changes from RFC v3
> > (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com):
> 
> This link cannot be opened.  I will share the link again here.
> https://lore.kernel.org/all/20240405060858.2818-1-honggyu@sk.com

Thank you for checking the link!  It's weird though, since I can open the link
on my Chrome browser.

> 
> >   0. updated from v3 and posted by SJ on behalf of Hunggyu under his
> >  approval.
> >   1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
> >   2. Drop vmstat change
> 
> I haven't checked whether I can collect useful information without
> vmstat, but the changes look good in general except for that.

I was thinking you could use DAMOS stat[1] for the schemes and assuming no
reply to it as an agreement, but maybe I should made it clear.  Do you think
DAMOS stat cannot be used instead?  If so, what would be the limitation of
DAMOS stat for your usage?

> 
> >   3. Drop unnecessary page reference check
> 
> I will compare this patch series with my previous v3 patchset and get
> back to you later maybe next week.

Thank you very much!  Unless I get a good enough test setup and results from it
on my own or from others' help, your test result would be the last requirement
for dropping RFC from this patchset.

> Sorry, I will have another break this week.

No problem, I hope you to have nice break.  Nobody can chase/push others.  We
all do this work voluntarily for our own fun and profit, right? ;)


[1] https://lore.kernel.org/damon/20240405060858.2818-1-honggyu@sk.com


Thanks,
SJ

> 
> Thanks,
> Honggyu



Re: [RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory

2024-05-13 Thread Honggyu Kim
Hi SeongJae,

Thanks very much for your work!  It got delayed due to the priority
changes in my workplace for building another heterogeneous memory
allocator.
https://github.com/skhynix/hmsdk/wiki/hmalloc

On Sun, 12 May 2024 10:54:42 -0700 SeongJae Park  wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> Changes from RFC v3
> (https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com):

This link cannot be opened.  I will share the link again here.
https://lore.kernel.org/all/20240405060858.2818-1-honggyu@sk.com

>   0. updated from v3 and posted by SJ on behalf of Hunggyu under his
>  approval.
>   1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
>   2. Drop vmstat change

I haven't checked whether I can collect useful information without
vmstat, but the changes look good in general except for that.

>   3. Drop unnecessary page reference check

I will compare this patch series with my previous v3 patchset and get
back to you later maybe next week.  Sorry, I will have another break
this week.

Thanks,
Honggyu



[RFC PATCH v4 0/5] DAMON based tiered memory management for CXL memory

2024-05-12 Thread SeongJae Park
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.

Changes from RFC v3
(https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com):
  0. updated from v3 and posted by SJ on behalf of Hunggyu under his
 approval.
  1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
  2. Drop vmstat change
  3. Drop unnecessary page reference check

Changes from RFC v2:
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
 depending on node distance based information.
  3. Instead of having page level access check in this patch series,
 delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
 to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
 demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
 promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
 DAMOS_STAT.

Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.

DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.

Evaluation Workload
===

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa bala

Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Gregory Price
On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price  
> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> > 
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
> 
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)

Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.

I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.

Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.

Just another datum that is useful and paints a more complete picture.

> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>

This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!

> 
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions.  Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
> 
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
> from CXL.

I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.

There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.

+/- ~2% despite greater meomry migration is an excellent result

> > Can you also provide the DRAM-only results for each test?  Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever.  It would be good to know how
> > this system compares to swap small amounts of overflow.
> 
> It looks like my explanation doesn't correctly inform you.   The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
> 

Yes, sorry for the misunderstanding.  This makes it much clearer.

> 
> I hope my explanation is helpful for you to understand.  Please let me
> know if you have more questions.
>

Excellent work, exciting results! Thank you for the additional answers
:]

~Gregory



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Honggyu Kim
On Mon,  8 Apr 2024 22:41:04 +0900 Honggyu Kim  wrote:
[...]
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   3. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   4. Repeat 4 over 50 times to measure the average execution time for
>  each run.

Sorry, "Repeat 4 over 50 times" is incorrect.  This should be "Repeat 3
over 50 times".

>   5. Increase the cold memory size then repeat goes to 2.
> 
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
>   4. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   5. Repeat 4 over 50 times to measure the average execution time for
>  each run.
>   6. Increase the cold memory size then repeat goes to 2.
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
[...]

Thanks,
Honggyu



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-08 Thread Honggyu Kim
Hi Gregory,

On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price  
wrote:
> On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> >   1. YCSB zipfian distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =++=
> >|   cold memory occupied by mmap and memset  |
> >|   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =++=
> >   Execution time normalized to DRAM-only values | GEOMEAN
> >   -++-
> >   DRAM-only| 1.00 - - - - - - - | 1.00
> >   CXL-only | 1.22 - - - - - - - | 1.22
> >   default  |-  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
> >   DAMON tiered |-  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
> >   =++=
> >   CXL usage of redis-server in GB   | AVERAGE
> >   -++-
> >   DRAM-only|  0.0 - - - - - - - |  0.0
> >   CXL-only | 52.6 - - - - - - - | 52.6
> >   default  |-  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
> >   DAMON tiered |-   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
> >   =++=
> > 
> > Each test result is based on the exeuction environment as follows.
> > 
> >   DRAM-only   : redis-server uses only local DRAM memory.
> >   CXL-only: redis-server uses only CXL memory.
> >   default : default memory policy(MPOL_DEFAULT).
> > numa balancing disabled.
> >   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> > DAMOS_MIGRATE_HOT for CXL nodes.
> > 
> > The above result shows the "default" execution time goes up as the size
> > of cold memory is increased from 440G to 500G because the more cold
> > memory used, the more CXL memory is used for the target redis workload
> > and this makes the execution time increase.
> > 
> > However, "DAMON tiered" result shows less slowdown because the
> > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> > cold memory to CXL node and this free space at DRAM increases more
> > chance to allocate hot or warm pages of redis-server to fast DRAM node.
> > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> > of redis-server to DRAM node actively.
> > 
> > As a result, it makes more memory of redis-server stay in DRAM node
> > compared to "default" memory policy and this makes the performance
> > improvement.
> > 
> > The following result of latest distribution workload shows similar data.
> > 
> >   2. YCSB latest distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =++=
> >|   cold memory occupied by mmap and memset  |
> >|   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =++=
> >   Execution time normalized to DRAM-only values | GEOMEAN
> >   -++-
> >   DRAM-only| 1.00 - - - - - - - | 1.00
> >   CXL-only | 1.18 - - - - - - - | 1.18
> >   default  |-  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
> >   DAMON tiered |-  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
> >   =++=
> >   CXL usage of redis-server in GB   | AVERAGE
> >   -++-
> >   DRAM-only    |  0.0 -     - - - - - - |  0.0
> >   CXL-only | 52.6 - - - - - - - | 52.6
> >   default  |-  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
> >   DAMON tiered |-   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
> &g

Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-08 Thread Honggyu Kim
Hi SeongJae,

On Fri,  5 Apr 2024 12:28:00 -0700 SeongJae Park  wrote:
> Hello Honggyu,
> 
> On Fri,  5 Apr 2024 15:08:49 +0900 Honggyu Kim  wrote:
> 
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This RFC is about its implementation for physical address
> > space.
> > 
> > 
> > Changes from RFC v2:
> >   1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
> >   2. Create 'target_nid' to set the migration target node instead of
> >  depending on node distance based information.
> >   3. Instead of having page level access check in this patch series,
> >  delegate the job to a new DAMOS filter type YOUNG[2].
> >   4. Introduce vmstat counters "damon_migrate_{hot,cold}".
> >   5. Rebase from v6.7 to v6.8.
> 
> Thank you for patiently keeping discussion and making this great version!  I
> left comments on each patch, but found no special concerns.  Per-page access
> recheck for MIGRATE_HOT and vmstat change are taking my eyes, though.  I doubt
> if those really needed.  It would be nice if you could answer to the comments.

I will answer them where you made the comments.

> Once my comments on this version are addressed, I would have no reason to
> object at dropping the RFC tag from this patchset.

Thanks.  I will drop the RFC after addressing your comments.

> Nonetheless, I show some warnings and errors from checkpatch.pl.  I don't
> really care about those for RFC patches, so no problem at all.  But if you
> agree to my opinion about RFC tag dropping, and therefore if you will send 
> next
> version without RFC tag, please make sure you also run checkpatch.pl before
> posting.

Sure.  I will run checkpatch.pl from the next revision.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread SeongJae Park
Hello Honggyu,

On Fri,  5 Apr 2024 15:08:49 +0900 Honggyu Kim  wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> 
> Changes from RFC v2:
>   1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
>   2. Create 'target_nid' to set the migration target node instead of
>  depending on node distance based information.
>   3. Instead of having page level access check in this patch series,
>  delegate the job to a new DAMOS filter type YOUNG[2].
>   4. Introduce vmstat counters "damon_migrate_{hot,cold}".
>   5. Rebase from v6.7 to v6.8.

Thank you for patiently keeping discussion and making this great version!  I
left comments on each patch, but found no special concerns.  Per-page access
recheck for MIGRATE_HOT and vmstat change are taking my eyes, though.  I doubt
if those really needed.  It would be nice if you could answer to the comments.

Once my comments on this version are addressed, I would have no reason to
object at dropping the RFC tag from this patchset.

Nonetheless, I show some warnings and errors from checkpatch.pl.  I don't
really care about those for RFC patches, so no problem at all.  But if you
agree to my opinion about RFC tag dropping, and therefore if you will send next
version without RFC tag, please make sure you also run checkpatch.pl before
posting.


Thanks,
SJ

[...]



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread Gregory Price
On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
>   1. YCSB zipfian distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =++=
>|   cold memory occupied by mmap and memset  |
>|   0G  440G  450G  460G  470G  480G  490G  500G |
>   =++=
>   Execution time normalized to DRAM-only values | GEOMEAN
>   -++-
>   DRAM-only| 1.00 - - - - - - - | 1.00
>   CXL-only | 1.22 - - - - - - - | 1.22
>   default  |-  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
>   DAMON tiered |-  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
>   =++=
>   CXL usage of redis-server in GB   | AVERAGE
>   -++-
>   DRAM-only|  0.0 - - - - - - - |  0.0
>   CXL-only | 52.6 - - - - - - - | 52.6
>   default  |-  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
>   DAMON tiered |-   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
>   =++=
> 
> Each test result is based on the exeuction environment as follows.
> 
>   DRAM-only   : redis-server uses only local DRAM memory.
>   CXL-only: redis-server uses only CXL memory.
>   default : default memory policy(MPOL_DEFAULT).
> numa balancing disabled.
>   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> DAMOS_MIGRATE_HOT for CXL nodes.
> 
> The above result shows the "default" execution time goes up as the size
> of cold memory is increased from 440G to 500G because the more cold
> memory used, the more CXL memory is used for the target redis workload
> and this makes the execution time increase.
> 
> However, "DAMON tiered" result shows less slowdown because the
> DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> cold memory to CXL node and this free space at DRAM increases more
> chance to allocate hot or warm pages of redis-server to fast DRAM node.
> Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> of redis-server to DRAM node actively.
> 
> As a result, it makes more memory of redis-server stay in DRAM node
> compared to "default" memory policy and this makes the performance
> improvement.
> 
> The following result of latest distribution workload shows similar data.
> 
>   2. YCSB latest distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =++=
>|   cold memory occupied by mmap and memset  |
>|   0G  440G  450G  460G  470G  480G  490G  500G |
>   =++=
>   Execution time normalized to DRAM-only values | GEOMEAN
>   -++-
>   DRAM-only| 1.00 - - - - - - - | 1.00
>   CXL-only | 1.18 - - - - - - - | 1.18
>   default  |-  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
>   DAMON tiered |-  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
>   =++=
>   CXL usage of redis-server in GB   | AVERAGE
>   -++-
>   DRAM-only|  0.0 - - - - - - - |  0.0
>   CXL-only | 52.6 - - - - - - - | 52.6
>   default  |-  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
>   DAMON tiered |-   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
>   =++=
> 
> In summary of both results, our evaluation shows that "DAMON tiered"
> memory management reduces the performance slowdown compared to the
> "default" memory policy from 17~18% to 4~5% when the system runs with
> high memory pressure on its fast tier DRAM nodes.
> 
> Having these DAMOS_MIGRATE_HOT and DAMOS

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-04-05 Thread Honggyu Kim
Hi SeongJae,

On Tue, 26 Mar 2024 16:03:09 -0700 SeongJae Park  wrote:
> On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park  wrote:
> 
> > On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:
> [...]
> > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  
> > > > wrote:
> [...]
> > >
> > > I would like to hear how you think about this.
> > 
> > So, to summarize my humble opinion,
> > 
> > 1. I like the idea of having two actions.  But I'd like to use names other 
> > than
> >'promote' and 'demote'.
> > 2. I still prefer having a filter for the page granularity access re-check.
> > 
> [...]
> > > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
> > > can talk more about this issue.
> > 
> > Looking forward to chatting with you :)
> 
> We met and discussed about this topic in the chat series yesterday.  Sharing
> the summary for keeping the open discussion.
> 
> Honggyu thankfully accepted my humble suggestions on the last reply.  Honggyu
> will post the third version of this patchset soon.  The patchset will 
> implement
> two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD.  Those will 
> migrate
> the DAMOS target regions to a user-specified NUMA node, but will have 
> different
> prioritization score function.  As name implies, they will prioritize more hot
> regions and cold regions, respectively.

It's a late answer but thanks very much for the summary.  It was really
helpful discussion in the chat series.

I'm leaving a message so that anyone can follow for the update.  The v3
is posted at 
https://lore.kernel.org/damon/20240405060858.2818-1-honggyu@sk.com.

> Honggyu, please feel free to fix if there is anything wrong or missed.

I don't have anything to fix from your summary.

> And thanks to Honggyu again for patiently keeping this productive discussion
> and their awesome work.

I appreciate your persistent support and kind reviews. 

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]



[RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread Honggyu Kim
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Changes from RFC v2:
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
 depending on node distance based information.
  3. Instead of having page level access check in this patch series,
 delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
 to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
 demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
 promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
 DAMOS_STAT.


Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environ

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-26 Thread SeongJae Park
On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park  wrote:

> On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:
[...]
> > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
[...]
> >
> > I would like to hear how you think about this.
> 
> So, to summarize my humble opinion,
> 
> 1. I like the idea of having two actions.  But I'd like to use names other 
> than
>'promote' and 'demote'.
> 2. I still prefer having a filter for the page granularity access re-check.
> 
[...]
> > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
> > can talk more about this issue.
> 
> Looking forward to chatting with you :)

We met and discussed about this topic in the chat series yesterday.  Sharing
the summary for keeping the open discussion.

Honggyu thankfully accepted my humble suggestions on the last reply.  Honggyu
will post the third version of this patchset soon.  The patchset will implement
two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD.  Those will migrate
the DAMOS target regions to a user-specified NUMA node, but will have different
prioritization score function.  As name implies, they will prioritize more hot
regions and cold regions, respectively.

Honggyu, please feel free to fix if there is anything wrong or missed.

And thanks to Honggyu again for patiently keeping this productive discussion
and their awesome work.


Thanks,
SJ

[...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-25 Thread SeongJae Park
On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
[...]
> > > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we 
> > > > discussed about
> > > > this patchset in high level.  Sharing the summary here for open 
> > > > discussion.  As
> > > > also discussed on the first version of this patchset[2], we want to 
> > > > make single
> > > > action for general page migration with minimum changes, but would like 
> > > > to keep
> > > > page level access re-check.  We also agreed the previously proposed 
> > > > DAMOS
> > > > filter-based approach could make sense for the purpose.
> > > 
> > > Thanks very much for the summary.  I have been trying to merge promote
> > > and demote actions into a single migrate action, but I found an issue
> > > regarding damon_pa_scheme_score.  It currently calls damon_cold_score()
> > > for demote action and damon_hot_score() for promote action, but what
> > > should we call when we use a single migrate action?
> > 
> > Good point!  This is what I didn't think about when suggesting that.  Thank 
> > you
> > for letting me know this gap!  I think there could be two approach, off the 
> > top
> > of my head.
> > 
> > The first one would be extending the interface so that the user can select 
> > the
> > score function.  This would let flexible usage, but I'm bit concerned if 
> > this
> > could make things unnecessarily complex, and would really useful in many
> > general use case.
> 
> I also think this looks complicated and may not be useful for general
> users.
> 
> > The second approach would be letting DAMON infer the intention.  In this 
> > case,
> > I think we could know the intention is the demotion if the scheme has a youg
> > pages exclusion filter.  Then, we could use the cold_score().  And vice 
> > versa.
> > To cover a case that there is no filter at all, I think we could have one
> > assumption.  My humble intuition says the new action (migrate) may be used 
> > more
> > for promotion use case.  So, in damon_pa_scheme_score(), if the action of 
> > the
> > given scheme is the new one (say, MIGRATE), the function will further check 
> > if
> > the scheme has a filter for excluding young pages.  If so, the function will
> > use cold_score().  Otherwise, the function will use hot_score().
> 
> Thanks for suggesting many ideas but I'm afraid that I feel this doesn't
> look good.  Thinking it again, I think we can think about keep using
> DAMOS_PROMOTE and DAMOS_DEMOTE,

In other words, keep having dedicated DAMOS action for intuitive prioritization
score function, or, coupling the prioritization with each action, right?  I
think this makes sense, and fit well with the documentation.

The prioritization mechanism should be different for each action.  For 
example,
rarely accessed (colder) memory regions would be prioritized for page-out
scheme action.  In contrast, the colder regions would be deprioritized for 
huge
page collapse scheme action.  Hence, the prioritization mechanisms for each
action are implemented in each DAMON operations set, together with the 
actions.

In other words, each DAMOS action should allow users intuitively understand
what types of regions will be prioritized.  We already have such couples of
DAMOS actions such as DAMOS_[NO]HUGEPAGE and DAMOS_LRU_[DE]PRIO.  So adding a
couple of action for this case sounds reasonable to me.  And I think this is
better and simpler than having the inferrence based behavior.

That said, I concern if 'PROMOTE' and 'DEMOTE' still sound bit ambiguous to
people who don't know 'demote_folio_list()' and its friends.  Meanwhile, the
name might sound too detail about what it does to people who know the
functions, so make it bit unflexible.  They might also get confused since we
don't have 'promote_folio_list()'.

To my humble understanding, what you really want to do is migrating pages to
specific address range (or node) prioritizing the pages based on the hotness.
What about, say, MIGRATE_{HOT,COLD}?

> but I can make them directly call
> damon_folio_young() for access check instead of using young filter.
> 
> And we can internally handle the complicated combination such as demote
> action sets "young" filter with "matching" true and promote action sets
> "young" filter with "matching" false.  IMHO, this will make the usage
> simpler.

I think whether to exclude young/non-young (maybe idle is better than
non-young?) pages from the action is better to be decoupled for following
reasons.

Firstly, we want to check the page granularity youngness mainly because we
found DAMON's monitoring result is not accurate enough for this use case.  Or,
we could say that's because you cannot wait until DAMON's monitoring result
becomes accurate enough.  For more detail, you could increase minimum age of
your scheme's target access pattern.  I show you set 

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-25 Thread Honggyu Kim
Hi SeongJae,

On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> > > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> > > 
> > > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > > > posted at [1].
> > > > 
> > > > It says there is no implementation of the demote/promote DAMOS action
> > > > are made.  This RFC is about its implementation for physical address
> > > > space.
> > > > 
> > > > 
> [...]
> > > Thank you for running the tests again with the new version of the patches 
> > > and
> > > sharing the results!
> > 
> > It's a bit late answer, but the result was from the previous evaluation.
> > I ran it again with RFC v2, but didn't see much difference so just
> > pasted the same result here.
> 
> No problem, thank you for clarifying :)
> 
> [...]
> > > > Honggyu Kim (3):
> > > >   mm/damon: refactor DAMOS_PAGEOUT with migration_mode
> > > >   mm: make alloc_demote_folio externally invokable for migration
> > > >   mm/damon: introduce DAMOS_DEMOTE action for demotion
> > > > 
> > > > Hyeongtak Ji (4):
> > > >   mm/memory-tiers: add next_promotion_node to find promotion target
> > > >   mm/damon: introduce DAMOS_PROMOTE action for promotion
> > > >   mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
> > > >   mm/damon/sysfs-schemes: apply target_nid for promote and demote
> > > > actions
> > > 
> > > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed 
> > > about
> > > this patchset in high level.  Sharing the summary here for open 
> > > discussion.  As
> > > also discussed on the first version of this patchset[2], we want to make 
> > > single
> > > action for general page migration with minimum changes, but would like to 
> > > keep
> > > page level access re-check.  We also agreed the previously proposed DAMOS
> > > filter-based approach could make sense for the purpose.
> > 
> > Thanks very much for the summary.  I have been trying to merge promote
> > and demote actions into a single migrate action, but I found an issue
> > regarding damon_pa_scheme_score.  It currently calls damon_cold_score()
> > for demote action and damon_hot_score() for promote action, but what
> > should we call when we use a single migrate action?
> 
> Good point!  This is what I didn't think about when suggesting that.  Thank 
> you
> for letting me know this gap!  I think there could be two approach, off the 
> top
> of my head.
> 
> The first one would be extending the interface so that the user can select the
> score function.  This would let flexible usage, but I'm bit concerned if this
> could make things unnecessarily complex, and would really useful in many
> general use case.

I also think this looks complicated and may not be useful for general
users.

> The second approach would be letting DAMON infer the intention.  In this case,
> I think we could know the intention is the demotion if the scheme has a youg
> pages exclusion filter.  Then, we could use the cold_score().  And vice versa.
> To cover a case that there is no filter at all, I think we could have one
> assumption.  My humble intuition says the new action (migrate) may be used 
> more
> for promotion use case.  So, in damon_pa_scheme_score(), if the action of the
> given scheme is the new one (say, MIGRATE), the function will further check if
> the scheme has a filter for excluding young pages.  If so, the function will
> use cold_score().  Otherwise, the function will use hot_score().

Thanks for suggesting many ideas but I'm afraid that I feel this doesn't
look good.  Thinking it again, I think we can think about keep using
DAMOS_PROMOTE and DAMOS_DEMOTE, but I can make them directly call
damon_folio_young() for access check instead of using young filter.

And we can internally handle the complicated combination such as demote
action sets "young" filter with "matching" true and promote action sets
"young" filter with "matching" false.  IMHO, this will make the usage
simpler.

I would like to hear how you think about this.

> So I'd more prefer the second approach.  I think it would be not too late to
> consider the first approach after waiting for it turns out more actions have
> such ambiguity and need more general interface for explicitly set the score
> function.

I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
can talk more about this issue.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-22 Thread SeongJae Park
On Fri, 22 Mar 2024 17:27:34 +0900 Honggyu Kim  wrote:

[...]
> OK. It could be a matter of preference and the current filter is already
> in the mainline so I won't insist more.

Thank you for accepting my humble suggestion.

[...]
> > I'd prefer improving the documents or
> > user-space tool and keep the kernel code simple.
> 
> OK. I will see if there is a way to improve damo tool for this instead
> of making changes on the kernel side.

Looking forward!

[...]
> Yeah, I made this thread too much about filter naming discussion rather
> than tiered memory support.

No problem at all.  Thank you for keeping this productive discussion.

[...]
> Thanks again for your feedback.

That's my pleasure :)


Thanks,
SJ

[...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-22 Thread SeongJae Park
On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> > 
> > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > > posted at [1].
> > > 
> > > It says there is no implementation of the demote/promote DAMOS action
> > > are made.  This RFC is about its implementation for physical address
> > > space.
> > > 
> > > 
[...]
> > Thank you for running the tests again with the new version of the patches 
> > and
> > sharing the results!
> 
> It's a bit late answer, but the result was from the previous evaluation.
> I ran it again with RFC v2, but didn't see much difference so just
> pasted the same result here.

No problem, thank you for clarifying :)

[...]
> > > Honggyu Kim (3):
> > >   mm/damon: refactor DAMOS_PAGEOUT with migration_mode
> > >   mm: make alloc_demote_folio externally invokable for migration
> > >   mm/damon: introduce DAMOS_DEMOTE action for demotion
> > > 
> > > Hyeongtak Ji (4):
> > >   mm/memory-tiers: add next_promotion_node to find promotion target
> > >   mm/damon: introduce DAMOS_PROMOTE action for promotion
> > >   mm/damon/sysfs-schemes: add target_nid on sysfs-schemes
> > >   mm/damon/sysfs-schemes: apply target_nid for promote and demote
> > > actions
> > 
> > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed 
> > about
> > this patchset in high level.  Sharing the summary here for open discussion. 
> >  As
> > also discussed on the first version of this patchset[2], we want to make 
> > single
> > action for general page migration with minimum changes, but would like to 
> > keep
> > page level access re-check.  We also agreed the previously proposed DAMOS
> > filter-based approach could make sense for the purpose.
> 
> Thanks very much for the summary.  I have been trying to merge promote
> and demote actions into a single migrate action, but I found an issue
> regarding damon_pa_scheme_score.  It currently calls damon_cold_score()
> for demote action and damon_hot_score() for promote action, but what
> should we call when we use a single migrate action?

Good point!  This is what I didn't think about when suggesting that.  Thank you
for letting me know this gap!  I think there could be two approach, off the top
of my head.

The first one would be extending the interface so that the user can select the
score function.  This would let flexible usage, but I'm bit concerned if this
could make things unnecessarily complex, and would really useful in many
general use case.

The second approach would be letting DAMON infer the intention.  In this case,
I think we could know the intention is the demotion if the scheme has a youg
pages exclusion filter.  Then, we could use the cold_score().  And vice versa.
To cover a case that there is no filter at all, I think we could have one
assumption.  My humble intuition says the new action (migrate) may be used more
for promotion use case.  So, in damon_pa_scheme_score(), if the action of the
given scheme is the new one (say, MIGRATE), the function will further check if
the scheme has a filter for excluding young pages.  If so, the function will
use cold_score().  Otherwise, the function will use hot_score().

So I'd more prefer the second approach.  I think it would be not too late to
consider the first approach after waiting for it turns out more actions have
such ambiguity and need more general interface for explicitly set the score
function.


Thanks,
SJ

[...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-22 Thread Honggyu Kim
Hi SeongJae,

On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> 
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This RFC is about its implementation for physical address
> > space.
> > 
> > 
> > Introduction
> > 
> > 
> > With the advent of CXL/PCIe attached DRAM, which will be called simply
> > as CXL memory in this cover letter, some systems are becoming more
> > heterogeneous having memory systems with different latency and bandwidth
> > characteristics.  They are usually handled as different NUMA nodes in
> > separate memory tiers and CXL memory is used as slow tiers because of
> > its protocol overhead compared to local DRAM.
> > 
> > In this kind of systems, we need to be careful placing memory pages on
> > proper NUMA nodes based on the memory access frequency.  Otherwise, some
> > frequently accessed pages might reside on slow tiers and it makes
> > performance degradation unexpectedly.  Moreover, the memory access
> > patterns can be changed at runtime.
> > 
> > To handle this problem, we need a way to monitor the memory access
> > patterns and migrate pages based on their access temperature.  The
> > DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
> > Schemes) can be useful features for monitoring and migrating pages.
> > DAMOS provides multiple actions based on DAMON monitoring results and it
> > can be used for proactive reclaim, which means swapping cold pages out
> > with DAMOS_PAGEOUT action, but it doesn't support migration actions such
> > as demotion and promotion between tiered memory nodes.
> > 
> > This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion
> > from fast tiers and DAMOS_PROMOTE for promotion from slow tiers.  This
> > prevents hot pages from being stuck on slow tiers, which makes
> > performance degradation and cold pages can be proactively demoted to
> > slow tiers so that the system can increase the chance to allocate more
> > hot pages to fast tiers.
> > 
> > The DAMON provides various tuning knobs but we found that the proactive
> > demotion for cold pages is especially useful when the system is running
> > out of memory on its fast tier nodes.
> > 
> > Our evaluation result shows that it reduces the performance slowdown
> > compared to the default memory policy from 15~17% to 4~5% when the
> > system runs under high memory pressure on its fast tier DRAM nodes.
> > 
> > 
> > DAMON configuration
> > ===
> > 
> > The specific DAMON configuration doesn't have to be in the scope of this
> > patch series, but some rough idea is better to be shared to explain the
> > evaluation result.
> > 
> > The DAMON provides many knobs for fine tuning but its configuration file
> > is generated by HMSDK[2].  It includes gen_config.py script that
> > generates a json file with the full config of DAMON knobs and it creates
> > multiple kdamonds for each NUMA node when the DAMON is enabled so that
> > it can run hot/cold based migration for tiered memory.
> 
> I was feeling a bit confused from here since DAMON doesn't receive parameters
> via a file.  To my understanding, the 'configuration file' means the input 
> file
> for DAMON user-space tool, damo, not DAMON.  Just a trivial thing, but making
> it clear if possible could help readers in my opinion.
> 
> > 
> > 
> > Evaluation Workload
> > ===
> > 
> > The performance evaluation is done with redis[3], which is a widely used
> > in-memory database and the memory access patterns are generated via
> > YCSB[4].  We have measured two different workloads with zipfian and
> > latest distributions but their configs are slightly modified to make
> > memory usage higher and execution time longer for better evaluation.
> > 
> > The idea of evaluation using these demote and promote actions covers
> > system-wide memory management rather than partitioning hot/cold pages of
> > a single workload.  The default memory allocation policy creates pages
> > to the fast tier DRAM node first, then allocates newly created pages to
> > the slow tier CXL node when the DRAM node has insufficient free space.
> > Once the page allocation is done then those pages never move between
> > NUMA nodes.  It's not true when using numa balancing, but it is not the
> > scope 

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-22 Thread Honggyu Kim
Hi SeongJae,

On Wed, 20 Mar 2024 09:56:19 -0700 SeongJae Park  wrote:
> Hi Honggyu,
> 
> On Wed, 20 Mar 2024 16:07:48 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park  wrote:
> > > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:
> > > 
> > > > Hi SeongJae,
> > > > 
> > > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  
> > > > wrote:
> > > > > Hi Honggyu,
> > > > > 
> > > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  
> > > > > wrote:
> > > > > 
> > > > > > Hi SeongJae,
> > > > > > 
> > > > > > Thanks for the confirmation.  I have a few comments on young filter 
> > > > > > so
> > > > > > please read the inline comments again.
> > > > > > 
> > > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  
> > > > > > wrote:
> > > > > > > Hi Honggyu,
> > > > > > > 
> > > > > > > > > -Original Message-
> > > > > > > > > From: SeongJae Park 
> > > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > > > > To: Honggyu Kim 
> > > > > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > > > > 
> > > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > > > > management for CXL memory
> > > > > > > > >
> > > > > > > > > Hi Honggyu,
> > > > > > > > >
> > > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > > Hi SeongJae,
> > > > > > > > > >
> > > > > > > > > > I've tested it again and found that "young" filter has to 
> > > > > > > > > > be set
> > > > > > > > > > differently as follows.
> > > > > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > > > > - promote action: set "young" filter with "matching" false
> > > > 
> > > > Thinking it again, I feel like "matching" true or false looks quite
> > > > vague to me as a general user.
> > > > 
> > > > Instead, I would like to have more meaningful names for "matching" as
> > > > follows.
> > > > 
> > > > - matching "true" can be either (filter) "out" or "skip".
> > > > - matching "false" can be either (filter) "in" or "apply".
> > > 
> > > I agree the naming could be done much better.  And thank you for the nice
> > > suggestions.  I have a few concerns, though.
> > 
> > I don't think my suggestion is best.  I just would like to have more
> > discussion about it.
> 
> I also understand my naming sense is far from good :)  I'm grateful to have
> this constructive discussion!

Yeah, naming is always difficult. Thanks anyway :)

> > 
> > > Firstly, increasing the number of behavioral concepts.  DAMOS filter 
> > > feature
> > > has only single behavior: excluding some types of memory from DAMOS action
> > > target.  The "matching" is to provide a flexible way for further 
> > > specifying the
> > > target to exclude in a bit detail.  Without it, we would need non-variant 
> > > for
> > > each filter type.  Comapred to the current terms, the new terms feel like
> > > implying there are two types of behaviors.  I think one behavior is 
> > > easier to
> > > understand than two behaviors, and better match what internal code is 
> > > doing.
> > > 
> > > Secondly, ambiguity in "in" and "apply".  To me, the terms sound like 
> > > _adding_
> > > something more than _excluding_.
> > 
> > I understood that young filter "matching" "false" means apply action
> > only to young pages.  Do I misunderstood something here?  If not,
> 
> Technically speaking, having a DAMOS filter with 'matching' parameter as
> 

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-20 Thread SeongJae Park
Hi Honggyu,

On Wed, 20 Mar 2024 16:07:48 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park  wrote:
> > On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > > > Hi Honggyu,
> > > > 
> > > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  
> > > > wrote:
> > > > 
> > > > > Hi SeongJae,
> > > > > 
> > > > > Thanks for the confirmation.  I have a few comments on young filter so
> > > > > please read the inline comments again.
> > > > > 
> > > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  
> > > > > wrote:
> > > > > > Hi Honggyu,
> > > > > > 
> > > > > > > > -Original Message-
> > > > > > > > From: SeongJae Park 
> > > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > > > To: Honggyu Kim 
> > > > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > > > 
> > > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > > > management for CXL memory
> > > > > > > >
> > > > > > > > Hi Honggyu,
> > > > > > > >
> > > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > > > >  wrote:
> > > > > > > >
> > > > > > > > > Hi SeongJae,
> > > > > > > > >
> > > > > > > > > I've tested it again and found that "young" filter has to be 
> > > > > > > > > set
> > > > > > > > > differently as follows.
> > > > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > > > - promote action: set "young" filter with "matching" false
> > > 
> > > Thinking it again, I feel like "matching" true or false looks quite
> > > vague to me as a general user.
> > > 
> > > Instead, I would like to have more meaningful names for "matching" as
> > > follows.
> > > 
> > > - matching "true" can be either (filter) "out" or "skip".
> > > - matching "false" can be either (filter) "in" or "apply".
> > 
> > I agree the naming could be done much better.  And thank you for the nice
> > suggestions.  I have a few concerns, though.
> 
> I don't think my suggestion is best.  I just would like to have more
> discussion about it.

I also understand my naming sense is far from good :)  I'm grateful to have
this constructive discussion!

> 
> > Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
> > has only single behavior: excluding some types of memory from DAMOS action
> > target.  The "matching" is to provide a flexible way for further specifying 
> > the
> > target to exclude in a bit detail.  Without it, we would need non-variant 
> > for
> > each filter type.  Comapred to the current terms, the new terms feel like
> > implying there are two types of behaviors.  I think one behavior is easier 
> > to
> > understand than two behaviors, and better match what internal code is doing.
> > 
> > Secondly, ambiguity in "in" and "apply".  To me, the terms sound like 
> > _adding_
> > something more than _excluding_.
> 
> I understood that young filter "matching" "false" means apply action
> only to young pages.  Do I misunderstood something here?  If not,

Technically speaking, having a DAMOS filter with 'matching' parameter as
'false' for 'young pages' type means you want DAMOS to "exclude pages that not
young from the scheme's action target".  That's the only thing it truly does,
and what it tries to guarantee.  Whether the action will be applied to young
pages or not depends on more factors including additional filters and DAMOS
parameter.  IOW, that's not what the simple setting promises.

Of course, I know you are assuming there is only the single filter.  Hence,
effectively you're correct.  And the sentence may be a better wording for end
users.  However, it tooke me a bit time to understand your assumption and
concluding whether 

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-20 Thread Honggyu Kim
Hi SeongJae,

On Mon, 18 Mar 2024 12:07:21 -0700 SeongJae Park  wrote:
> On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > > Hi Honggyu,
> > > 
> > > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > > 
> > > > Hi SeongJae,
> > > > 
> > > > Thanks for the confirmation.  I have a few comments on young filter so
> > > > please read the inline comments again.
> > > > 
> > > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  
> > > > wrote:
> > > > > Hi Honggyu,
> > > > > 
> > > > > > > -Original Message-
> > > > > > > From: SeongJae Park 
> > > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > > To: Honggyu Kim 
> > > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > > 
> > > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > > management for CXL memory
> > > > > > >
> > > > > > > Hi Honggyu,
> > > > > > >
> > > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > Hi SeongJae,
> > > > > > > >
> > > > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > > > differently as follows.
> > > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > > - promote action: set "young" filter with "matching" false
> > 
> > Thinking it again, I feel like "matching" true or false looks quite
> > vague to me as a general user.
> > 
> > Instead, I would like to have more meaningful names for "matching" as
> > follows.
> > 
> > - matching "true" can be either (filter) "out" or "skip".
> > - matching "false" can be either (filter) "in" or "apply".
> 
> I agree the naming could be done much better.  And thank you for the nice
> suggestions.  I have a few concerns, though.

I don't think my suggestion is best.  I just would like to have more
discussion about it.

> Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
> has only single behavior: excluding some types of memory from DAMOS action
> target.  The "matching" is to provide a flexible way for further specifying 
> the
> target to exclude in a bit detail.  Without it, we would need non-variant for
> each filter type.  Comapred to the current terms, the new terms feel like
> implying there are two types of behaviors.  I think one behavior is easier to
> understand than two behaviors, and better match what internal code is doing.
> 
> Secondly, ambiguity in "in" and "apply".  To me, the terms sound like _adding_
> something more than _excluding_.

I understood that young filter "matching" "false" means apply action
only to young pages.  Do I misunderstood something here?  If not,
"apply" means _adding_ or _including_ only the matched pages for action.
It looks like you thought about _excluding_ non matched pages here.

> I think that might confuse people in some
> cases.  Actually, I have used the term "filter-out" and "filter-in" on
> this  and several threads.  When saying about "filter-in" scenario, I had to
> emphasize the fact that it is not adding something but excluding others.

Excluding others also means including matched pages.  I think we better
focus on what to do only for the matched pages.

> I now think that was not a good approach.
> 
> Finally, "apply" sounds a bit deterministic.  I think it could be a bit
> confusing in some cases such as when using multiple filters in a combined way.
> For example, if we have two filters for 1) "apply" a memcg and 2) skip anon
> pages, the given DAMOS action will not be applied to anon pages of the memcg.
> I think this might be a bit confusing.

No objection on this.  If so, I think "in" sounds better than "apply".

> > 
> > Internally, the type of "matching" can be boolean, but it'd be better
> > for general users have another ways to set it such as "out"/"in" or
> > "s

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread SeongJae Park
On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > Hi Honggyu,
> > 
> > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > Thanks for the confirmation.  I have a few comments on young filter so
> > > please read the inline comments again.
> > > 
> > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > > Hi Honggyu,
> > > > 
> > > > > > -Original Message-
> > > > > > From: SeongJae Park 
> > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > To: Honggyu Kim 
> > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > 
> > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > management for CXL memory
> > > > > >
> > > > > > Hi Honggyu,
> > > > > >
> > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > >  wrote:
> > > > > >
> > > > > > > Hi SeongJae,
> > > > > > >
> > > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > > differently as follows.
> > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > - promote action: set "young" filter with "matching" false
> 
> Thinking it again, I feel like "matching" true or false looks quite
> vague to me as a general user.
> 
> Instead, I would like to have more meaningful names for "matching" as
> follows.
> 
> - matching "true" can be either (filter) "out" or "skip".
> - matching "false" can be either (filter) "in" or "apply".

I agree the naming could be done much better.  And thank you for the nice
suggestions.  I have a few concerns, though.

Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
has only single behavior: excluding some types of memory from DAMOS action
target.  The "matching" is to provide a flexible way for further specifying the
target to exclude in a bit detail.  Without it, we would need non-variant for
each filter type.  Comapred to the current terms, the new terms feel like
implying there are two types of behaviors.  I think one behavior is easier to
understand than two behaviors, and better match what internal code is doing.

Secondly, ambiguity in "in" and "apply".  To me, the terms sound like _adding_
something more than _excluding_.  I think that might confuse people in some
cases.  Actually, I have used the term "filter-out" and "filter-in" on
this  and several threads.  When saying about "filter-in" scenario, I had to
emphasize the fact that it is not adding something but excluding others.  I now
think that was not a good approach.

Finally, "apply" sounds a bit deterministic.  I think it could be a bit
confusing in some cases such as when using multiple filters in a combined way.
For example, if we have two filters for 1) "apply" a memcg and 2) skip anon
pages, the given DAMOS action will not be applied to anon pages of the memcg.
I think this might be a bit confusing.

> 
> Internally, the type of "matching" can be boolean, but it'd be better
> for general users have another ways to set it such as "out"/"in" or
> "skip"/"apply" via sysfs interface.  I prefer "skip" and "apply" looks
> more intuitive, but I don't have strong objection on "out" and "in" as
> well.

Unfortunately, DAMON sysfs interface is an ABI that we want to keep stable.  Of
course we could make some changes on it if really required.  But I'm unsure if
the problem of current naming and benefit of the sugegsted change are big
enough to outweighs the stability risk and additional efforts.

Also, DAMON sysfs interface is arguably not for _very_ general users.  DAMON
user-space tool is the one for _more_ general users.  To quote DAMON usage
document,

- *DAMON user space tool.*
  `This <https://github.com/awslabs/damo>`_ is for privileged people such as
  system administrators who want a just-working human-friendly interface.
  [...]
- *sysfs interface.*
  :ref:`This ` is for privileged user space programmers who
  want more optimized use of DAMON. [...]
 
If the concept is that confused, I think we could improve the docum

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread Honggyu Kim
Hi SeongJae,

On Sun, 17 Mar 2024 12:13:57 -0700 SeongJae Park  wrote:
> On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> 
> > Hi Honggyu,
> > 
> > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > Thanks for the confirmation.  I have a few comments on young filter so
> > > please read the inline comments again.
> > > 
> > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > > Hi Honggyu,
> [...]
> > > Thanks.  I see that it works fine, but I would like to have more
> > > discussion about "young" filter.  What I think about filter is that if I
> > > apply "young" filter "true" for demotion, then the action applies only
> > > for "young" pages, but the current implementation works opposite.
> > > 
> > > I understand the function name of internal implementation is
> > > "damos_pa_filter_out" so the basic action is filtering out, but the
> > > cgroup filter works in the opposite way for now.
> > 
> > Does memcg filter works in the opposite way?  I don't think so because
> > __damos_pa_filter_out() sets 'matches' as 'true' only if the the given 
> > folio is
> > contained in the given memcg.  'young' filter also simply sets 'matches' as
> > 'true' only if the given folio is young.
> > 
> > If it works in the opposite way, it's a bug that need to be fixed.  Please 
> > let
> > me know if I'm missing something.
> 
> I just read the DAMOS filters part of the documentation for DAMON sysfs
> interface again.  I believe it is explaining the meaning of 'matching' as I
> intended to, as below:
> 
> You can write ``Y`` or ``N`` to ``matching`` file to filter out pages 
> that does
> or does not match to the type, respectively.  Then, the scheme's action 
> will
> not be applied to the pages that specified to be filtered out.
> 
> But, I found the following example for memcg filter is wrong, as below:
> 
> For example, below restricts a DAMOS action to be applied to only 
> non-anonymous
> pages of all memory cgroups except ``/having_care_already``.::
> 
> # echo 2 > nr_filters
> # # filter out anonymous pages
> echo anon > 0/type
> echo Y > 0/matching
> # # further filter out all cgroups except one at 
> '/having_care_already'
> echo memcg > 1/type
> echo /having_care_already > 1/memcg_path
> echo N > 1/matching
> 
> Specifically, the last line of the commands should write 'Y' instead of 'N' to
> do what explained.  Without the fix, the action will be applied to only
> non-anonymous pages of 'having_care_already' memcg.  This is definitely wrong.
> I will fix this soon.  I'm unsure if this is what made you to believe memcg
> DAMOS filter is working in the opposite way, though.

I got confused not because of this.  I just think it again that this
user interface is better to be more intuitive as I mentioned in the
previous thread.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread Honggyu Kim
Hi SeongJae,

On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> Hi Honggyu,
> 
> On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > Thanks for the confirmation.  I have a few comments on young filter so
> > please read the inline comments again.
> > 
> > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > Hi Honggyu,
> > > 
> > > > > -Original Message-
> > > > > From: SeongJae Park 
> > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > To: Honggyu Kim 
> > > > > Cc: SeongJae Park ; kernel_team 
> > > > > 
> > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > management for CXL memory
> > > > >
> > > > > Hi Honggyu,
> > > > >
> > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > >  wrote:
> > > > >
> > > > > > Hi SeongJae,
> > > > > >
> > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > differently as follows.
> > > > > > - demote action: set "young" filter with "matching" true
> > > > > > - promote action: set "young" filter with "matching" false

Thinking it again, I feel like "matching" true or false looks quite
vague to me as a general user.

Instead, I would like to have more meaningful names for "matching" as
follows.

- matching "true" can be either (filter) "out" or "skip".
- matching "false" can be either (filter) "in" or "apply".

Internally, the type of "matching" can be boolean, but it'd be better
for general users have another ways to set it such as "out"/"in" or
"skip"/"apply" via sysfs interface.  I prefer "skip" and "apply" looks
more intuitive, but I don't have strong objection on "out" and "in" as
well.

I also feel the filter name "young" is more for developers not for
general users.  I think this can be changed to "accessed" filter
instead.

The demote and promote filters can be set as follows using these.

- demote action: set "accessed" filter with "matching" to "skip"
- promote action: set "accessed" filter with "matching" to "apply"

I also think that you can feel this is more complicated so I would like
to hear how you think about this.

> > > > >
> > > > > DAMOS filter is basically for filtering "out" memory regions that 
> > > > > matches to
> > > > > the condition.

Right.  In other tools, I see filters are more used as filtering "in"
rather than filtering "out".  I feel this makes me more confused.

> > > > > Hence in your setup, young pages are not filtered out from
> > > > > demote action target.
> > > > 
> > > > I thought young filter true means "young pages ARE filtered out" for 
> > > > demotion.
> > > 
> > > You're correct.
> > 
> > Ack.
> > 
> > > > 
> > > > > That is, you're demoting pages that "not" young.
> > > > 
> > > > Your explanation here looks opposite to the previous statement.
> > > 
> > > Again, you're correct.  My intention was "non-young pages are not ..." but
> > > maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for 
> > > the
> > > confusion.
> > 
> > No problem.  I also think it's quite confusing.
> > 
> > > > 
> > > > > And vice versa, so you're applying promote to non-non-young (young) 
> > > > > pages.
> > > > 
> > > > Yes, I understand "promote non-non-young pages" means "promote young 
> > > > pages".
> > > > This might be understood as "young pages are NOT filtered out" for 
> > > > promotion
> > > > but it doesn't mean that "old pages are filtered out" instead.
> > > > And we just rely hot detection only on DAMOS logics such as access 
> > > > frequency
> > > > and age. Am I correct?
> > > 
> > > You're correct.
> > 
> > Ack.  But if it doesn't mean that "old pages are filtered out" instead,
> 
> It does mean that.  Here, f

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread SeongJae Park
On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:

> Hi Honggyu,
> 
> On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > Thanks for the confirmation.  I have a few comments on young filter so
> > please read the inline comments again.
> > 
> > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > Hi Honggyu,
[...]
> > Thanks.  I see that it works fine, but I would like to have more
> > discussion about "young" filter.  What I think about filter is that if I
> > apply "young" filter "true" for demotion, then the action applies only
> > for "young" pages, but the current implementation works opposite.
> > 
> > I understand the function name of internal implementation is
> > "damos_pa_filter_out" so the basic action is filtering out, but the
> > cgroup filter works in the opposite way for now.
> 
> Does memcg filter works in the opposite way?  I don't think so because
> __damos_pa_filter_out() sets 'matches' as 'true' only if the the given folio 
> is
> contained in the given memcg.  'young' filter also simply sets 'matches' as
> 'true' only if the given folio is young.
> 
> If it works in the opposite way, it's a bug that need to be fixed.  Please let
> me know if I'm missing something.

I just read the DAMOS filters part of the documentation for DAMON sysfs
interface again.  I believe it is explaining the meaning of 'matching' as I
intended to, as below:

You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that 
does
or does not match to the type, respectively.  Then, the scheme's action will
not be applied to the pages that specified to be filtered out.

But, I found the following example for memcg filter is wrong, as below:

For example, below restricts a DAMOS action to be applied to only 
non-anonymous
pages of all memory cgroups except ``/having_care_already``.::

# echo 2 > nr_filters
# # filter out anonymous pages
echo anon > 0/type
echo Y > 0/matching
# # further filter out all cgroups except one at '/having_care_already'
echo memcg > 1/type
echo /having_care_already > 1/memcg_path
echo N > 1/matching

Specifically, the last line of the commands should write 'Y' instead of 'N' to
do what explained.  Without the fix, the action will be applied to only
non-anonymous pages of 'having_care_already' memcg.  This is definitely wrong.
I will fix this soon.  I'm unsure if this is what made you to believe memcg
DAMOS filter is working in the opposite way, though.


Thanks,
SJ

[...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread SeongJae Park
Hi Honggyu,

On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> Thanks for the confirmation.  I have a few comments on young filter so
> please read the inline comments again.
> 
> On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > Hi Honggyu,
> > 
> > > > -Original Message-
> > > > From: SeongJae Park 
> > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > To: Honggyu Kim 
> > > > Cc: SeongJae Park ; kernel_team 
> > > > 
> > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > management for CXL memory
> > > >
> > > > Hi Honggyu,
> > > >
> > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > >  wrote:
> > > >
> > > > > Hi SeongJae,
> > > > >
> > > > > I've tested it again and found that "young" filter has to be set
> > > > > differently as follows.
> > > > > - demote action: set "young" filter with "matching" true
> > > > > - promote action: set "young" filter with "matching" false
> > > >
> > > > DAMOS filter is basically for filtering "out" memory regions that 
> > > > matches to
> > > > the condition.  Hence in your setup, young pages are not filtered out 
> > > > from
> > > > demote action target.
> > > 
> > > I thought young filter true means "young pages ARE filtered out" for 
> > > demotion.
> > 
> > You're correct.
> 
> Ack.
> 
> > > 
> > > > That is, you're demoting pages that "not" young.
> > > 
> > > Your explanation here looks opposite to the previous statement.
> > 
> > Again, you're correct.  My intention was "non-young pages are not ..." but
> > maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for 
> > the
> > confusion.
> 
> No problem.  I also think it's quite confusing.
> 
> > > 
> > > > And vice versa, so you're applying promote to non-non-young (young) 
> > > > pages.
> > > 
> > > Yes, I understand "promote non-non-young pages" means "promote young 
> > > pages".
> > > This might be understood as "young pages are NOT filtered out" for 
> > > promotion
> > > but it doesn't mean that "old pages are filtered out" instead.
> > > And we just rely hot detection only on DAMOS logics such as access 
> > > frequency
> > > and age. Am I correct?
> > 
> > You're correct.
> 
> Ack.  But if it doesn't mean that "old pages are filtered out" instead,

It does mean that.  Here, filtering is exclusive.  Hence, "filter-in a type of
pages" means "filter-out pages of other types".  At least that's the intention.
To quote the documentation
(https://docs.kernel.org/mm/damon/design.html#filters),

Each filter specifies the type of target memory, and whether it should
exclude the memory of the type (filter-out), or all except the memory of
the type (filter-in).

> then do we really need this filter for promotion?  If not, maybe should
> we create another "old" filter for promotion?  As of now, the promotion
> is mostly done inaccurately, but the accurate migration is done at
> demotion level.

Is this based on your theory?  Or, a real behavior that you're seeing from your
setup?  If this is a real behavior, I think that should be a bug that need to
be fixed.

> To avoid this issue, I feel we should promotion only "young" pages after
> filtering "old" pages out.
> 
> > > 
> > > > I understand this is somewhat complex, but what we have for now.
> > > 
> > > Thanks for the explanation. I guess you mean my filter setup is correct.
> > > Is it correct?
> > 
> > Again, you're correct.  Your filter setup is what I expected to :)
> 
> Thanks.  I see that it works fine, but I would like to have more
> discussion about "young" filter.  What I think about filter is that if I
> apply "young" filter "true" for demotion, then the action applies only
> for "young" pages, but the current implementation works opposite.
> 
> I understand the function name of internal implementation is
> "damos_pa_filter_out" so the basic action is filtering out, but the
> cgroup filter works in the opposite way for now.

Does memcg filter works in the opposit

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread Honggyu Kim
Hi SeongJae,

Thanks for the confirmation.  I have a few comments on young filter so
please read the inline comments again.

On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> Hi Honggyu,
> 
> > > -Original Message-
> > > From: SeongJae Park 
> > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > To: Honggyu Kim 
> > > Cc: SeongJae Park ; kernel_team 
> > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management 
> > > for CXL memory
> > >
> > > Hi Honggyu,
> > >
> > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > >  wrote:
> > >
> > > > Hi SeongJae,
> > > >
> > > > I've tested it again and found that "young" filter has to be set
> > > > differently as follows.
> > > > - demote action: set "young" filter with "matching" true
> > > > - promote action: set "young" filter with "matching" false
> > >
> > > DAMOS filter is basically for filtering "out" memory regions that matches 
> > > to
> > > the condition.  Hence in your setup, young pages are not filtered out from
> > > demote action target.
> > 
> > I thought young filter true means "young pages ARE filtered out" for 
> > demotion.
> 
> You're correct.

Ack.

> > 
> > > That is, you're demoting pages that "not" young.
> > 
> > Your explanation here looks opposite to the previous statement.
> 
> Again, you're correct.  My intention was "non-young pages are not ..." but
> maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for the
> confusion.

No problem.  I also think it's quite confusing.

> > 
> > > And vice versa, so you're applying promote to non-non-young (young) pages.
> > 
> > Yes, I understand "promote non-non-young pages" means "promote young pages".
> > This might be understood as "young pages are NOT filtered out" for promotion
> > but it doesn't mean that "old pages are filtered out" instead.
> > And we just rely hot detection only on DAMOS logics such as access frequency
> > and age. Am I correct?
> 
> You're correct.

Ack.  But if it doesn't mean that "old pages are filtered out" instead,
then do we really need this filter for promotion?  If not, maybe should
we create another "old" filter for promotion?  As of now, the promotion
is mostly done inaccurately, but the accurate migration is done at
demotion level.  To avoid this issue, I feel we should promotion only
"young" pages after filtering "old" pages out.

> > 
> > > I understand this is somewhat complex, but what we have for now.
> > 
> > Thanks for the explanation. I guess you mean my filter setup is correct.
> > Is it correct?
> 
> Again, you're correct.  Your filter setup is what I expected to :)

Thanks.  I see that it works fine, but I would like to have more
discussion about "young" filter.  What I think about filter is that if I
apply "young" filter "true" for demotion, then the action applies only
for "young" pages, but the current implementation works opposite.

I understand the function name of internal implementation is
"damos_pa_filter_out" so the basic action is filtering out, but the
cgroup filter works in the opposite way for now.

I would like to hear how you think about this.

> > 
> > > > Then, I see that "hot_cold" migrates hot/cold memory correctly.
> > >
> > > Thank you so much for sharing this great news!  My tests also show no bad
> > > signal so far.
> > >
> > > >
> > > > Could you please upload the "damon_folio_mkold" patch to LKML?
> > > > Then I will rebase our changes based on it and run the redis test again.
> > >
> > > I will do that soon.
> > 
> > Thanks a lot for sharing the RFC v2 for DAMOS young filter.
> > https://lore.kernel.org/damon/20240311204545.47097-1...@kernel.org/
> > 
> > I will rebase our work based on it and share the result.
> 
> Cool, looking forward to it!  Hopefully we will make it be merged into the
> mainline by v6.10!

I hope so.  Thanks for your help!

Honggyu



Re: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-08 Thread Honggyu Kim
Hi SeongJae,

I couldn't send email to LKML properly due to internal system issues,
but it's better now so I will be able to communicate better.

On Wed,  6 Mar 2024 19:05:50 -0800 SeongJae Park  wrote:
> 
> Hello,
> 
> On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:
> 
> > On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> > 
> > > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > > posted at [1].
> > > 
> > > It says there is no implementation of the demote/promote DAMOS action
> > > are made.  This RFC is about its implementation for physical address
> > > space.
> [...]
> > Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed 
> > about
> > this patchset in high level.  Sharing the summary here for open discussion. 
> >  As
> > also discussed on the first version of this patchset[2], we want to make 
> > single
> > action for general page migration with minimum changes, but would like to 
> > keep
> > page level access re-check.  We also agreed the previously proposed DAMOS
> > filter-based approach could make sense for the purpose.
> > 
> > Because I was anyway planning making such DAMOS filter for not only
> > promotion/demotion but other types of DAMOS action, I will start developing 
> > the
> > page level access re-check results based DAMOS filter.  Once the 
> > implementation
> > of the prototype is done, I will share the early implementation.  Then, 
> > Honggyu
> > will adjust their implementation based on the filter, and run their tests 
> > again
> > and share the results.
> 
> I just posted an RFC patchset for the page level access re-check DAMOS filter:
> https://lore.kernel.org/r/20240307030013.47041-1...@kernel.org
> 
> I hope it to help you better understanding and testing the idea.

Thanks very much for your work! I will test it based on your changes.

Thanks,
Honggyu

> 
> > 
> > [1] https://lore.kernel.org/damon/20220810225102.124459-1...@kernel.org/
> > [2] https://lore.kernel.org/damon/20240118171756.80356-1...@kernel.org
> 
> 
> Thanks,
> SJ
> 
> [...]
> 
> 
> 



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-06 Thread SeongJae Park
Hello,

On Tue, 27 Feb 2024 15:51:20 -0800 SeongJae Park  wrote:

> On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:
> 
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This RFC is about its implementation for physical address
> > space.
[...]
> Honggyu joined DAMON Beer/Coffee/Tea Chat[1] yesterday, and we discussed about
> this patchset in high level.  Sharing the summary here for open discussion.  
> As
> also discussed on the first version of this patchset[2], we want to make 
> single
> action for general page migration with minimum changes, but would like to keep
> page level access re-check.  We also agreed the previously proposed DAMOS
> filter-based approach could make sense for the purpose.
> 
> Because I was anyway planning making such DAMOS filter for not only
> promotion/demotion but other types of DAMOS action, I will start developing 
> the
> page level access re-check results based DAMOS filter.  Once the 
> implementation
> of the prototype is done, I will share the early implementation.  Then, 
> Honggyu
> will adjust their implementation based on the filter, and run their tests 
> again
> and share the results.

I just posted an RFC patchset for the page level access re-check DAMOS filter:
https://lore.kernel.org/r/20240307030013.47041-1...@kernel.org

I hope it to help you better understanding and testing the idea.

> 
> [1] https://lore.kernel.org/damon/20220810225102.124459-1...@kernel.org/
> [2] https://lore.kernel.org/damon/20240118171756.80356-1...@kernel.org


Thanks,
SJ

[...]



Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-02-27 Thread SeongJae Park
On Mon, 26 Feb 2024 23:05:46 +0900 Honggyu Kim  wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> 
> Introduction
> 
> 
> With the advent of CXL/PCIe attached DRAM, which will be called simply
> as CXL memory in this cover letter, some systems are becoming more
> heterogeneous having memory systems with different latency and bandwidth
> characteristics.  They are usually handled as different NUMA nodes in
> separate memory tiers and CXL memory is used as slow tiers because of
> its protocol overhead compared to local DRAM.
> 
> In this kind of systems, we need to be careful placing memory pages on
> proper NUMA nodes based on the memory access frequency.  Otherwise, some
> frequently accessed pages might reside on slow tiers and it makes
> performance degradation unexpectedly.  Moreover, the memory access
> patterns can be changed at runtime.
> 
> To handle this problem, we need a way to monitor the memory access
> patterns and migrate pages based on their access temperature.  The
> DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
> Schemes) can be useful features for monitoring and migrating pages.
> DAMOS provides multiple actions based on DAMON monitoring results and it
> can be used for proactive reclaim, which means swapping cold pages out
> with DAMOS_PAGEOUT action, but it doesn't support migration actions such
> as demotion and promotion between tiered memory nodes.
> 
> This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion
> from fast tiers and DAMOS_PROMOTE for promotion from slow tiers.  This
> prevents hot pages from being stuck on slow tiers, which makes
> performance degradation and cold pages can be proactively demoted to
> slow tiers so that the system can increase the chance to allocate more
> hot pages to fast tiers.
> 
> The DAMON provides various tuning knobs but we found that the proactive
> demotion for cold pages is especially useful when the system is running
> out of memory on its fast tier nodes.
> 
> Our evaluation result shows that it reduces the performance slowdown
> compared to the default memory policy from 15~17% to 4~5% when the
> system runs under high memory pressure on its fast tier DRAM nodes.
> 
> 
> DAMON configuration
> ===
> 
> The specific DAMON configuration doesn't have to be in the scope of this
> patch series, but some rough idea is better to be shared to explain the
> evaluation result.
> 
> The DAMON provides many knobs for fine tuning but its configuration file
> is generated by HMSDK[2].  It includes gen_config.py script that
> generates a json file with the full config of DAMON knobs and it creates
> multiple kdamonds for each NUMA node when the DAMON is enabled so that
> it can run hot/cold based migration for tiered memory.

I was feeling a bit confused from here since DAMON doesn't receive parameters
via a file.  To my understanding, the 'configuration file' means the input file
for DAMON user-space tool, damo, not DAMON.  Just a trivial thing, but making
it clear if possible could help readers in my opinion.

> 
> 
> Evaluation Workload
> ===
> 
> The performance evaluation is done with redis[3], which is a widely used
> in-memory database and the memory access patterns are generated via
> YCSB[4].  We have measured two different workloads with zipfian and
> latest distributions but their configs are slightly modified to make
> memory usage higher and execution time longer for better evaluation.
> 
> The idea of evaluation using these demote and promote actions covers
> system-wide memory management rather than partitioning hot/cold pages of
> a single workload.  The default memory allocation policy creates pages
> to the fast tier DRAM node first, then allocates newly created pages to
> the slow tier CXL node when the DRAM node has insufficient free space.
> Once the page allocation is done then those pages never move between
> NUMA nodes.  It's not true when using numa balancing, but it is not the
> scope of this DAMON based 2-tier memory management support.
> 
> If the working set of redis can be fit fully into the DRAM node, then
> the redis will access the fast DRAM only.  Since the performance of DRAM
> only is faster than partially accessing CXL memory in slow tiers, this
> environment is not useful to evaluate this patch series.
> 
> To make pages of redis be distributed across fast DRAM node and slow
> CXL node to evaluate our demote and promote actions, we pre-allocate
> some col

[RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-02-26 Thread Honggyu Kim
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion
from fast tiers and DAMOS_PROMOTE for promotion from slow tiers.  This
prevents hot pages from being stuck on slow tiers, which makes
performance degradation and cold pages can be proactively demoted to
slow tiers so that the system can increase the chance to allocate more
hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 15~17% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[2].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===

The performance evaluation is done with redis[3], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[4].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these demote and promote actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based 2-tier memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our demote and promote actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server.  We assumed that there are enough amount of cold memory in
datacenters as TMO[5] and TPP[6] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and
   DAMOS_PROMOTE action for CXL node.  It demotes cold pages on DRAM
   node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
   the fast tier DRAM node, then make the process sleep to make the fast
   tier has insufficient memory for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
   redis-server consumes 52GB of anon pages and 33GB of file pages, but
   due to the cold memory allocated at 2

Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-18 Thread SeongJae Park
On Thu, 18 Jan 2024 19:40:16 +0900 Hyeongtak Ji  wrote:

> Hi SeongJae,
> 
> On Wed, 17 Jan 2024 SeongJae Park  wrote:
> 
> [...]
> >> Let's say there are 3 nodes in the system and the first node0 and node1
> >> are the first tier, and node2 is the second tier.
> >> 
> >>   $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist
> >>   0-1
> >> 
> >>   $ cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
> >>   2
> >> 
> >> Here is the result of partitioning hot/cold memory and I put execution
> >> command at the right side of numastat result.  I initially ran each
> >> hot_cold program with preferred setting so that they initially allocate
> >> memory on one of node0 or node2, but they gradually migrated based on
> >> their access frequencies.
> >> 
> >>   $ numastat -c -p hot_cold
> >>   Per-node process memory usage (in MBs) 
> >>   PID  Node 0 Node 1 Node 2 Total 
> >>   ---  -- -- -- - 
> >>   754 (hot_cold) 1800  0   2000  3800<- hot_cold 1800 2000 
> >>   1184 (hot_cold) 300  0500   800<- hot_cold 300 500 
> >>   1818 (hot_cold) 801  0   3199  4000<- hot_cold 800 3200 
> >>   30289 (hot_cold)  4  0  510<- hot_cold 3 5 
> >>   30325 (hot_cold) 31  0 5181<- hot_cold 30 50 
> >>   ---  -- -- -- - 
> >>   Total  2938  0   5756  8695
> >> 
> >> The final node placement result shows that DAMON accurately migrated
> >> pages by their hotness for multiple processes.
> >
> > What was the result when the corner cases handling logics were not applied?
> 
> This is the result of the same test that Honggyu did, but with an insufficient
> corner cases handling logics.
> 
>   $ numastat -c -p hot_cold
> 
>   Per-node process memory usage (in MBs)
>   PID Node 0 Node 1 Node 2 Total
>   --  -- -- -- -
>   862 (hot_cold)2256  0   1545  3801   <- hot_cold 1800 2000
>   863 (hot_cold) 403  0398   801   <- hot_cold 300 500
>   864 (hot_cold)1520  0   2482  4001   <- hot_cold 800 3200
>   865 (hot_cold)   6  0  3 9   <- hot_cold 3 5
>   866 (hot_cold)  29  0 5281   <- hot_cold 30 50
>   --  -- -- -- -
>   Total 4215  0   4480  8695
> 
> As time goes by, DAMON keeps trying to split the hot/cold region, but it does
> not seem to be enough.
> 
>   $ numastat -c -p hot_cold
> 
>   Per-node process memory usage (in MBs)
>   PID Node 0 Node 1 Node 2 Total
>   --  -- -- -- -
>   862 (hot_cold)2022  0   1780  3801   <- hot_cold 1800 2000
>   863 (hot_cold) 351  0450   801   <- hot_cold 300 500
>   864 (hot_cold)1134  0   2868  4001   <- hot_cold 800 3200
>   865 (hot_cold)   7  0  2 9   <- hot_cold 3 5
>   866 (hot_cold)  43  0 3981   <- hot_cold 30 50
>   --  -- -- -- -
>   Total 3557  0   5138  8695
> 
> >
> > And, what are the corner cases handling logic that seemed essential?  I show
> > the page granularity active/reference check could indeed provide many
> > improvements, but that's only my humble assumption.
> 
> Yes, the page granularity active/reference check is essential.  To make the
> above "insufficient" result, the only thing I did was to promote
> inactive/not_referenced pages.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f03be320f9ad..c2aefb883c54 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1127,9 +1127,7 @@ static unsigned int __promote_folio_list(struct 
> list_head *folio_list,
> VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> 
> references = folio_check_references(folio, sc);
> -   if (references == FOLIOREF_KEEP ||
> -   references == FOLIOREF_RECLAIM ||
> -   references == FOLIOREF_RECLAIM_CLEAN)
> +   if (references == FOLIOREF_KEEP )
> goto keep_locked;
> 
> /* Relocate its contents to another node. */

Thank you for sharing the details :)  I think DAMOS filters based approach
could be worthy to try, then.

> 
> >
> > If the corner cases are indeed better to be applied in page granularity, I
> > agree we need some more efforts since DAMON monitoring results are not page
> > granularity aware by the design.  Users could increase min_nr_regions to 
> > make
> > it more accurate, and we have plan to support page granularity monitoring,
> > though.  But maybe the overhead could be unacceptable.
> >
> > Ideal solution would be making DAMON more accurate while keeping current 
> > level
> > of overhead.  We indeed have TODO items for DAMON accuracy improvement, but
> > this may take some time that might unacceptable for your case.
> >
> > If that's the case, I think the additional corner handling (or, 

Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-18 Thread Hyeongtak Ji
Hi SeongJae,

On Wed, 17 Jan 2024 SeongJae Park  wrote:

[...]
>> Let's say there are 3 nodes in the system and the first node0 and node1
>> are the first tier, and node2 is the second tier.
>> 
>>   $ cat /sys/devices/virtual/memory_tiering/memory_tier4/nodelist
>>   0-1
>> 
>>   $ cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist
>>   2
>> 
>> Here is the result of partitioning hot/cold memory and I put execution
>> command at the right side of numastat result.  I initially ran each
>> hot_cold program with preferred setting so that they initially allocate
>> memory on one of node0 or node2, but they gradually migrated based on
>> their access frequencies.
>> 
>>   $ numastat -c -p hot_cold
>>   Per-node process memory usage (in MBs) 
>>   PID  Node 0 Node 1 Node 2 Total 
>>   ---  -- -- -- - 
>>   754 (hot_cold) 1800  0   2000  3800<- hot_cold 1800 2000 
>>   1184 (hot_cold) 300  0500   800<- hot_cold 300 500 
>>   1818 (hot_cold) 801  0   3199  4000<- hot_cold 800 3200 
>>   30289 (hot_cold)  4  0  510<- hot_cold 3 5 
>>   30325 (hot_cold) 31  0 5181<- hot_cold 30 50 
>>   ---  -- -- -- - 
>>   Total  2938  0   5756  8695
>> 
>> The final node placement result shows that DAMON accurately migrated
>> pages by their hotness for multiple processes.
>
> What was the result when the corner cases handling logics were not applied?

This is the result of the same test that Honggyu did, but with an insufficient
corner cases handling logics.

  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  862 (hot_cold)2256  0   1545  3801   <- hot_cold 1800 2000
  863 (hot_cold) 403  0398   801   <- hot_cold 300 500
  864 (hot_cold)1520  0   2482  4001   <- hot_cold 800 3200
  865 (hot_cold)   6  0  3 9   <- hot_cold 3 5
  866 (hot_cold)  29  0 5281   <- hot_cold 30 50
  --  -- -- -- -
  Total 4215  0   4480  8695

As time goes by, DAMON keeps trying to split the hot/cold region, but it does
not seem to be enough.

  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  862 (hot_cold)2022  0   1780  3801   <- hot_cold 1800 2000
  863 (hot_cold) 351  0450   801   <- hot_cold 300 500
  864 (hot_cold)1134  0   2868  4001   <- hot_cold 800 3200
  865 (hot_cold)   7  0  2 9   <- hot_cold 3 5
  866 (hot_cold)  43  0 3981   <- hot_cold 30 50
  --  -- -- -- -
  Total 3557  0   5138  8695

>
> And, what are the corner cases handling logic that seemed essential?  I show
> the page granularity active/reference check could indeed provide many
> improvements, but that's only my humble assumption.

Yes, the page granularity active/reference check is essential.  To make the
above "insufficient" result, the only thing I did was to promote
inactive/not_referenced pages.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f03be320f9ad..c2aefb883c54 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1127,9 +1127,7 @@ static unsigned int __promote_folio_list(struct list_head 
*folio_list,
VM_BUG_ON_FOLIO(folio_test_active(folio), folio);

references = folio_check_references(folio, sc);
-   if (references == FOLIOREF_KEEP ||
-   references == FOLIOREF_RECLAIM ||
-   references == FOLIOREF_RECLAIM_CLEAN)
+   if (references == FOLIOREF_KEEP )
goto keep_locked;

/* Relocate its contents to another node. */

>
> If the corner cases are indeed better to be applied in page granularity, I
> agree we need some more efforts since DAMON monitoring results are not page
> granularity aware by the design.  Users could increase min_nr_regions to make
> it more accurate, and we have plan to support page granularity monitoring,
> though.  But maybe the overhead could be unacceptable.
>
> Ideal solution would be making DAMON more accurate while keeping current level
> of overhead.  We indeed have TODO items for DAMON accuracy improvement, but
> this may take some time that might unacceptable for your case.
>
> If that's the case, I think the additional corner handling (or, page gran
> additional access check) could be made as DAMOS filters[1], since DAMOS 
> filters
> can be applied in page granularity, and designed for this kind of handling of
> information that DAMON monitoring results cannot provide.  More specifically,
> we could have filters for promotion-qualifying pages and demotion-qualifying
> pages.  In this way, I think we can keep the action 

Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-17 Thread SeongJae Park
On Wed, 17 Jan 2024 13:11:03 -0800 SeongJae Park  wrote:

[...]
> Hi Honggyu,
> 
> On Wed, 17 Jan 2024 20:49:25 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > Thanks very much for your comments in details.
> > 
> > On Tue, 16 Jan 2024 12:31:59 -0800 SeongJae Park  wrote:
> > 
[...]
> > > To this end, I feel the problem might be able te be simpler, because this
> > > patchset is trying to provide two sophisticated operations, while I think 
> > > a
> > > simpler approach might be possible.  My humble simpler idea is adding a 
> > > DAMOS
> > > operation for moving pages to a given node (like sys_move_phy_pages 
> > > RFC[1]),
> > > instead of the promote/demote.  Because the general pages migration can 
> > > handle
> > > multiple cases including the promote/demote in my humble assumption.
[...]
> > > In more detail, users could decide which is the appropriate node for 
> > > promotion
> > > or demotion and use the new DAMOS action to do promotion and demotion.  
> > > Users
> > > would requested to decide which node is the proper promotion/demotion 
> > > target
> > > nodes, but that decision wouldn't be that hard in my opinion.
> > > 
> > > For this, 'struct damos' would need to be updated for such 
> > > argument-dependent
> > > actions, like 'struct damos_filter' is haing a union.
> > 
> > That might be a better solution.  I will think about it.
> 
> More specifically, I think receiving an address range as the argument might
> more flexible than just NUMA node.  Maybe we can imagine proactively migrating
> cold movable pages from normal zones to movable zones, to avoid normal zone
> memory pressure.

Yet another crazy idea.  Finding hot regions in the middle of cold region and
move to besides of other hot pages.  As a result, memory is sorted by access
temperature even in same node, and the system gains more spatial locality,
which benefits general locality-based algorithms including DAMON's adaptive
regions adjustment.


Thanks,
SJ

[...]



Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-17 Thread SeongJae Park
ranularity active/reference check could indeed provide many
improvements, but that's only my humble assumption.

If the corner cases are indeed better to be applied in page granularity, I
agree we need some more efforts since DAMON monitoring results are not page
granularity aware by the design.  Users could increase min_nr_regions to make
it more accurate, and we have plan to support page granularity monitoring,
though.  But maybe the overhead could be unacceptable.

Ideal solution would be making DAMON more accurate while keeping current level
of overhead.  We indeed have TODO items for DAMON accuracy improvement, but
this may take some time that might unacceptable for your case.

If that's the case, I think the additional corner handling (or, page gran
additional access check) could be made as DAMOS filters[1], since DAMOS filters
can be applied in page granularity, and designed for this kind of handling of
information that DAMON monitoring results cannot provide.  More specifically,
we could have filters for promotion-qualifying pages and demotion-qualifying
pages.  In this way, I think we can keep the action more flexible while the
filters can be applied in creative ways.

[1] https://git.kernel.org/sj/c/98def236f63c66629fb6b2d4b69cecffc5b46539

> 
> > In more detail, users could decide which is the appropriate node for 
> > promotion
> > or demotion and use the new DAMOS action to do promotion and demotion.  
> > Users
> > would requested to decide which node is the proper promotion/demotion target
> > nodes, but that decision wouldn't be that hard in my opinion.
> > 
> > For this, 'struct damos' would need to be updated for such 
> > argument-dependent
> > actions, like 'struct damos_filter' is haing a union.
> 
> That might be a better solution.  I will think about it.

More specifically, I think receiving an address range as the argument might
more flexible than just NUMA node.  Maybe we can imagine proactively migrating
cold movable pages from normal zones to movable zones, to avoid normal zone
memory pressure.

> 
> > In future, we could extend the operation to the promotion and the demotion
> > after the dicussion around the promotion and demotion is matured, if 
> > required.
> > And assuming DAMON be extended for originating CPU-aware access monitoring, 
> > the
> > new DAMOS action would also cover more use cases such as general NUMA nodes
> > balancing (extending DAMON for CPU-aware monitoring would required), and 
> > some
> > complex configurations where having both CPU affinity and tiered memory.  I
> > also think that may well fit with my RFC idea[2] for tiered memory 
> > management.
> > 
> > Looking forward to opinions from you and others.  I admig I miss many 
> > things,
> > and more than happy to be enlightened.
> > 
> > [1] https://lwn.net/Articles/944007/
> > [2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/
> 
> Thanks very much for your comments.  I will need a few more days for the
> update but will try to address your concerns as much as possible.

No problem, please take your time.  I'm looking forward to the next version :)


Thanks,
SJ

> 
> Thanks,
> Honggyu



Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-17 Thread Honggyu Kim
or originating CPU-aware access monitoring, 
> the
> new DAMOS action would also cover more use cases such as general NUMA nodes
> balancing (extending DAMON for CPU-aware monitoring would required), and some
> complex configurations where having both CPU affinity and tiered memory.  I
> also think that may well fit with my RFC idea[2] for tiered memory management.
> 
> Looking forward to opinions from you and others.  I admig I miss many things,
> and more than happy to be enlightened.
> 
> [1] https://lwn.net/Articles/944007/
> [2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/

Thanks very much for your comments.  I will need a few more days for the
update but will try to address your concerns as much as possible.

Thanks,
Honggyu



Re: [RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-16 Thread SeongJae Park
Hello,

On Mon, 15 Jan 2024 13:52:48 +0900 Honggyu Kim  wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
[...]
> Evaluation Results
> ==
> 
[...]
> In summary of both results, our evaluation shows that "DAMON 2-tier"
> memory management reduces the performance slowdown compared to the
> "default" memory policy from 15~17% to 4~5% when the system runs with
> high memory pressure on its fast tier DRAM nodes.
> 
> The similar evaluation was done in another machine that has 256GB of
> local DRAM and 96GB of CXL memory.  The performance slowdown is reduced
> from 20~24% for "default" to 5~7% for "DAMON 2-tier".
> 
> Having these DAMOS_DEMOTE and DAMOS_PROMOTE actions can make 2-tier
> memory systems run more efficiently under high memory pressures.


Thank you so much for this great patches and the above nice test results.  I
believe the test setup and results make sense, and merging a revised version of
this patchset would provide real benefits to the users.

In a high level, I think it might better to separate DAMON internal changes
from DAMON external changes.

For DAMON part changes, I have no big concern other than trivial coding style
level comments.

For DAMON-external changes that implementing demote_pages() and
promote_pages(), I'm unsure if the implementation is reusing appropriate
functions, and if those are placee in right source file.  Especially, I'm
unsure if vmscan.c is the right place for promotion code.  Also I don't know if
there is a good agreement on the promotion/demotion target node decision.  That
should be because I'm not that familiar with the areas and the files, but I
feel this might because our discussions on the promotion and the demotion
operations are having rooms for being more matured.  Because I'm not very
faimiliar with the part, I'd like to hear others' comments, too.

To this end, I feel the problem might be able to be simpler, because this
patchset is trying to provide two sophisticated operations, while I think a
simpler approach might be possible.  My humble simpler idea is adding a DAMOS
operation for moving pages to a given node (like sys_move_phy_pages RFC[1]),
instead of the promote/demote.  Because the general pages migration can handle
multiple cases including the promote/demote in my humble assumption.  In more
detail, users could decide which is the appropriate node for promotion or
demotion and use the new DAMOS action to do promotion and demotion.  Users
would requested to decide which node is the proper promotion/demotion target
nodes, but that decision wouldn't be that hard in my opinion.

For this, 'struct damos' would need to be updated for such argument-dependent
actions, like 'struct damos_filter' is haing a union.

In future, we could extend the operation to the promotion and the demotion
after the dicussion around the promotion and demotion is matured, if required.
And assuming DAMON be extended for originating CPU-aware access monitoring, the
new DAMOS action would also cover more use cases such as general NUMA nodes
balancing (extending DAMON for CPU-aware monitoring would required), and some
complex configurations where having both CPU affinity and tiered memory.  I
also think that may well fit with my RFC idea[2] for tiered memory management.

Looking forward to opinions from you and others.  I admig I miss many things,
and more than happy to be enlightened.

[1] https://lwn.net/Articles/944007/
[2] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org/


Thanks,
SJ

> 
> Signed-off-by: Honggyu Kim 
> Signed-off-by: Hyeongtak Ji 
> Signed-off-by: Rakie Kim 
> 
> [1] https://lore.kernel.org/damon/20231112195602.61525-1...@kernel.org
> [2] https://github.com/skhynix/hmsdk
> [3] https://github.com/redis/redis/tree/7.0.0
> [4] https://github.com/brianfrankcooper/YCSB/tree/0.17.0
> [5] https://dl.acm.org/doi/10.1145/3503222.3507731
> [6] https://dl.acm.org/doi/10.1145/3582016.3582063
> 
> Honggyu Kim (2):
>   mm/vmscan: refactor reclaim_pages with reclaim_or_migrate_folios
>   mm/damon: introduce DAMOS_DEMOTE action for demotion
> 
> Hyeongtak Ji (2):
>   mm/memory-tiers: add next_promotion_node to find promotion target
>   mm/damon: introduce DAMOS_PROMOTE action for promotion
> 
>  include/linux/damon.h  |   4 +
>  include/linux/memory-tiers.h   |  11 ++
>  include/linux/migrate_mode.h   |   1 +
>  include/linux/vm_event_item.h  |   1 +
>  include/trace/events/migrate.h |   3 +-
>  mm/damon/paddr.c   |  46 ++-
>  mm/damon/sysfs-schemes.c   |   2 +
>  mm/internal.h  |   2 +
>  mm

[RFC PATCH 0/4] DAMON based 2-tier memory management for CXL memory

2024-01-14 Thread Honggyu Kim
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogenous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_DEMOTE for demotion
from fast tiers and DAMOS_PROMOTE for promotion from slow tiers.  This
prevents hot pages from being stuck on slow tiers, which makes
performance degradation and cold pages can be proactively demoted to
slow tiers so that the system can increase the chance to allocate more
hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 15~17% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===
The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[2].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===

The performance evaluation is done with redis[3], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[4].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these demote and promote actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based 2-tier memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow
CXL node to evaluate our demote and promote actions, we pre-allocate
some cold memory externally using mmap and memset before launching
redis-server.  We assumed that there are enough amount of cold memory in
datacenters as TMO[5] and TPP[6] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_DEMOTE action for DRAM node and
   DAMOS_PROMOTE action for CXL node.  It demotes cold pages on DRAM
   node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
   the fast tier DRAM node, then make the process sleep to make the fast
   tier has insufficient memory for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
   redis-server consumes 52GB of anon pages and 33GB of file pages, but
   due to the cold memory allocated at 2

[PATCH v2 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT

2021-04-19 Thread David Hildenbrand
MEMORY MANAGEMENT seems to be a good fit.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Cc: Shuah Khan 
Cc: linux-kselft...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9450e052f1b1..cd267d218e08 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11566,6 +11566,7 @@ F:  include/linux/mm.h
 F: include/linux/mmzone.h
 F: include/linux/vmalloc.h
 F: mm/
+F: tools/testing/selftests/vm/
 
 MEMORY TECHNOLOGY DEVICES (MTD)
 M: Miquel Raynal 
-- 
2.30.2



Re: [PATCH 1/2] MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT

2021-03-31 Thread Mike Rapoport
On Mon, Mar 22, 2021 at 01:25:41PM +0100, Lukas Bulwahn wrote:
> Commit a520110e4a15 ("mm: split out a new pagewalk.h header from mm.h")
> adds a new file in ./include/linux, but misses to update MAINTAINERS
> accordingly. Hence, ./scripts/get_maintainers.pl ./include/linux/pagewalk.h
> points only to lkml as general fallback for all files, whereas the original
> ./include/linux/mm.h clearly marks this file part of MEMORY MANAGEMENT.
> 
> Assign ./include/linux/pagewalk.h to MEMORY MANAGEMENT.
> 
> Signed-off-by: Lukas Bulwahn 
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 728216e3919c..46a1eddbc3e4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11622,6 +11622,7 @@ F:include/linux/gfp.h
>  F:   include/linux/memory_hotplug.h
>  F:   include/linux/mm.h
>  F:   include/linux/mmzone.h
> +F:   include/linux/pagewalk.h

I'd say that we miss all include/linux/page* here, not only pagewalk.h

>  F:   include/linux/vmalloc.h
>  F:   mm/
>  
> -- 
> 2.17.1
> 

-- 
Sincerely yours,
Mike.


[PATCH 1/2] MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT

2021-03-22 Thread Lukas Bulwahn
Commit a520110e4a15 ("mm: split out a new pagewalk.h header from mm.h")
adds a new file in ./include/linux, but misses to update MAINTAINERS
accordingly. Hence, ./scripts/get_maintainers.pl ./include/linux/pagewalk.h
points only to lkml as general fallback for all files, whereas the original
./include/linux/mm.h clearly marks this file part of MEMORY MANAGEMENT.

Assign ./include/linux/pagewalk.h to MEMORY MANAGEMENT.

Signed-off-by: Lukas Bulwahn 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 728216e3919c..46a1eddbc3e4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11622,6 +11622,7 @@ F:  include/linux/gfp.h
 F: include/linux/memory_hotplug.h
 F: include/linux/mm.h
 F: include/linux/mmzone.h
+F: include/linux/pagewalk.h
 F: include/linux/vmalloc.h
 F: mm/
 
-- 
2.17.1



[PATCH v1 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT

2021-03-17 Thread David Hildenbrand
MEMORY MANAGEMENT seems to be a good fit.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Jason Gunthorpe 
Cc: Peter Xu 
Cc: Shuah Khan 
Cc: linux-kselft...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index aa84121c5611..b00963f4aa09 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11560,6 +11560,7 @@ F:  include/linux/mm.h
 F: include/linux/mmzone.h
 F: include/linux/vmalloc.h
 F: mm/
+F: tools/testing/selftests/vm/
 
 MEMORY TECHNOLOGY DEVICES (MTD)
 M: Miquel Raynal 
-- 
2.29.2



[PATCH v2 0/3] Soft limit memory management bug fixes

2021-02-17 Thread Tim Chen
During testing of tiered memory management based on memory soft limit,
I found three issues with memory management using cgroup based soft
limit in the mainline code.  Fix the issues with the three patches in
this series.  Also updated patch 3 per Johannes' comments on the first
version of this patch.

Thanks.

Tim

Changelog:
v2
1. Do soft limit tree uncharge update in batch of the same node only
for v1 cgroups that have a soft limit.  Batching in nodes is only
relevant for cgroup v1 that has per node soft limit tree. 

Tim Chen (3):
  mm: Fix dropped memcg from mem cgroup soft limit tree
  mm: Force update of mem cgroup soft limit tree on usage excess
  mm: Fix missing mem cgroup soft limit tree updates

 mm/memcontrol.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

-- 
2.20.1



[PATCH 0/3] Soft limit memory management bug fixes

2021-02-09 Thread Tim Chen
During testing of tiered memory management based on memory soft limit, I found 
three 
issues with memory management using cgroup based soft limit in the mainline 
code.
Fix the issues with the three patches in this series.

Tim Chen (3):
  mm: Fix dropped memcg from mem cgroup soft limit tree
  mm: Force update of mem cgroup soft limit tree on usage excess
  mm: Fix missing mem cgroup soft limit tree updates

 mm/memcontrol.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

-- 
2.20.1



[RFC V2 03/37] dmem: implement dmem memory management

2020-12-07 Thread yulei . kernel
From: Yulei Zhang 

The belowing figure shows the topology of dmem memory
management, it reserves a few memory regions from the
numa nodes, and in each region it leverages the bitmap to
track the actual memory usage.

 +--+-+---+-+
 | Node0| | ...   | NodeN   |
 +--/-\-+-+---+-+
   /   \
  +---v+v-+--+
  |region 0| ...  | region n |
  +--/---\-+--+--+
/ \
+-+v+--v---+-+-+-+
| | |   bitmap | | | |
+-+-+--+-+-+-+

It introduces the interfaces to manage dmem pages that include:
  - dmem_region_register(), it registers the reserved memory to the
dmem management system

 - dmem_alloc_init(), initiate dmem allocator, note the page size the
   allocator used isn't the same thing with the alignment used to
   reserve dmem memory

 - dmem_alloc_pages_vma() and dmem_free_pages() are the interfaces
   allocating and freeing dmem memory, multiple pages can be allocated
   at one time, but it should be power of two

Signed-off-by: Xiao Guangrong 
Signed-off-by: Yulei Zhang 
---
 include/linux/dmem.h |   3 +
 mm/dmem.c| 674 +++
 2 files changed, 677 insertions(+)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 5049322..476a82e 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -7,6 +7,9 @@
 void dmem_init(void);
 int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
 
+int dmem_alloc_init(unsigned long dpage_shift);
+void dmem_alloc_uinit(void);
+
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/mm/dmem.c b/mm/dmem.c
index b5fb4f1..a77a064 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -91,11 +91,38 @@ struct dmem_pool {
.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
 };
 
+#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift)
+#define DMEM_PAGE_UP(x)phys_to_dpage(((x) + DMEM_PAGE_SIZE - 
1))
+#define DMEM_PAGE_DOWN(x)  phys_to_dpage(x)
+
+#define dpage_to_phys(_dpage)  \
+   ((_dpage) << dmem_pool.dpage_shift)
+#define phys_to_dpage(_addr)   \
+   ((_addr) >> dmem_pool.dpage_shift)
+
+#define dpage_to_pfn(_dpage)   \
+   (__phys_to_pfn(dpage_to_phys(_dpage)))
+#define pfn_to_dpage(_pfn) \
+   (phys_to_dpage(__pfn_to_phys(_pfn)))
+
+#define dnode_to_nid(_dnode)   \
+   ((_dnode) - dmem_pool.nodes)
+#define nid_to_dnode(nid)  \
+   (_pool.nodes[nid])
+
 #define for_each_dmem_node(_dnode) \
for (_dnode = dmem_pool.nodes;  \
_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \
_dnode++)
 
+#define for_each_dmem_region(_dnode, _dregion) \
+   list_for_each_entry(_dregion, &(_dnode)->regions, node)
+
+static inline int *dmem_nodelist(int nid)
+{
+   return nid_to_dnode(nid)->nodelist;
+}
+
 void __init dmem_init(void)
 {
struct dmem_node *dnode;
@@ -135,3 +162,650 @@ int dmem_region_register(int node, phys_addr_t start, 
phys_addr_t end)
return 0;
 }
 
+#define PENALTY_FOR_DMEM_SHARED_NODE   (1)
+
+static int dmem_nodeload[MAX_NUMNODES] __initdata;
+
+/* Evaluate penalty for each dmem node */
+static int __init dmem_evaluate_node(int local, int node)
+{
+   int penalty;
+
+   /* Use the distance array to find the distance */
+   penalty = node_distance(local, node);
+
+   /* Penalize nodes under us ("prefer the next node") */
+   penalty += (node < local);
+
+   /* Give preference to headless and unused nodes */
+   if (!cpumask_empty(cpumask_of_node(node)))
+   penalty += PENALTY_FOR_NODE_WITH_CPUS;
+
+   /* Penalize dmem-node shared with kernel */
+   if (node_state(node, N_MEMORY))
+   penalty += PENALTY_FOR_DMEM_SHARED_NODE;
+
+   /* Slight preference for less loaded node */
+   penalty *= (nr_online_nodes * MAX_NUMNODES);
+
+   penalty += dmem_nodeload[node];
+
+   return penalty;
+}
+
+static int __init find_next_dmem_node(int local, nodemask_t *used_nodes)
+{
+   struct dmem_node *dnode;
+   int node, best_node = NUMA_NO_NODE;
+   int penalty, min_penalty = INT_MAX;
+
+   /* Invalid node is not suitable to call node_distance */
+   if (!node_state(local, N_POSSIBLE))
+   return NUMA_NO_NODE;
+
+   /* Use the local node if we haven't already */
+   if (!node_isset(local, *used_nodes)) {
+   node_set(local, *used_nodes);
+   return local;
+   }
+
+   for_each_dmem_node(dnod

[PATCH 03/35] dmem: implement dmem memory management

2020-10-08 Thread yulei . kernel
From: Yulei Zhang 

It introduces the interfaces to manage dmem pages that include:
  - dmem_region_register(), it registers the reserved memory to the
dmem management system, later it can be allocated out for dmemfs

 - dmem_alloc_init(), initiate dmem allocator, note the page size the
   allocator used isn't the same thing with the alignment used to
   reserve dmem memory

 - dmem_alloc_pages_from_vma() and dmem_free_pages() are the interfaces
   allocating and freeing dmem memory, multiple pages can be allocated
   at one time, but it should be power of two

Signed-off-by: Xiao Guangrong 
Signed-off-by: Yulei Zhang 
---
 include/linux/dmem.h |   3 +
 mm/dmem.c| 674 +++
 2 files changed, 677 insertions(+)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 5049322d941c..476a82e8f252 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -7,6 +7,9 @@ int dmem_reserve_init(void);
 void dmem_init(void);
 int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
 
+int dmem_alloc_init(unsigned long dpage_shift);
+void dmem_alloc_uinit(void);
+
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/mm/dmem.c b/mm/dmem.c
index b5fb4f1b92db..a77a064c8d59 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -91,11 +91,38 @@ static struct dmem_pool dmem_pool = {
.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
 };
 
+#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift)
+#define DMEM_PAGE_UP(x)phys_to_dpage(((x) + DMEM_PAGE_SIZE - 
1))
+#define DMEM_PAGE_DOWN(x)  phys_to_dpage(x)
+
+#define dpage_to_phys(_dpage)  \
+   ((_dpage) << dmem_pool.dpage_shift)
+#define phys_to_dpage(_addr)   \
+   ((_addr) >> dmem_pool.dpage_shift)
+
+#define dpage_to_pfn(_dpage)   \
+   (__phys_to_pfn(dpage_to_phys(_dpage)))
+#define pfn_to_dpage(_pfn) \
+   (phys_to_dpage(__pfn_to_phys(_pfn)))
+
+#define dnode_to_nid(_dnode)   \
+   ((_dnode) - dmem_pool.nodes)
+#define nid_to_dnode(nid)  \
+   (_pool.nodes[nid])
+
 #define for_each_dmem_node(_dnode) \
for (_dnode = dmem_pool.nodes;  \
_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \
_dnode++)
 
+#define for_each_dmem_region(_dnode, _dregion) \
+   list_for_each_entry(_dregion, &(_dnode)->regions, node)
+
+static inline int *dmem_nodelist(int nid)
+{
+   return nid_to_dnode(nid)->nodelist;
+}
+
 void __init dmem_init(void)
 {
struct dmem_node *dnode;
@@ -135,3 +162,649 @@ int dmem_region_register(int node, phys_addr_t start, 
phys_addr_t end)
return 0;
 }
 
+#define PENALTY_FOR_DMEM_SHARED_NODE   (1)
+
+static int dmem_nodeload[MAX_NUMNODES] __initdata;
+
+/* Evaluate penalty for each dmem node */
+static int __init dmem_evaluate_node(int local, int node)
+{
+   int penalty;
+
+   /* Use the distance array to find the distance */
+   penalty = node_distance(local, node);
+
+   /* Penalize nodes under us ("prefer the next node") */
+   penalty += (node < local);
+
+   /* Give preference to headless and unused nodes */
+   if (!cpumask_empty(cpumask_of_node(node)))
+   penalty += PENALTY_FOR_NODE_WITH_CPUS;
+
+   /* Penalize dmem-node shared with kernel */
+   if (node_state(node, N_MEMORY))
+   penalty += PENALTY_FOR_DMEM_SHARED_NODE;
+
+   /* Slight preference for less loaded node */
+   penalty *= (nr_online_nodes * MAX_NUMNODES);
+
+   penalty += dmem_nodeload[node];
+
+   return penalty;
+}
+
+static int __init find_next_dmem_node(int local, nodemask_t *used_nodes)
+{
+   struct dmem_node *dnode;
+   int node, best_node = NUMA_NO_NODE;
+   int penalty, min_penalty = INT_MAX;
+
+   /* Invalid node is not suitable to call node_distance */
+   if (!node_state(local, N_POSSIBLE))
+   return NUMA_NO_NODE;
+
+   /* Use the local node if we haven't already */
+   if (!node_isset(local, *used_nodes)) {
+   node_set(local, *used_nodes);
+   return local;
+   }
+
+   for_each_dmem_node(dnode) {
+   if (list_empty(>regions))
+   continue;
+
+   node = dnode_to_nid(dnode);
+
+   /* Don't want a node to appear more than once */
+   if (node_isset(node, *used_nodes))
+   continue;
+
+   penalty = dmem_evaluate_node(local, node);
+
+   if (penalty < min_penalty) {
+   min_penalty = penalty;
+   best_node = node;
+   }
+   }
+
+ 

Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-08 Thread Rafael J. Wysocki
On Mon, Jun 8, 2020 at 5:33 PM Rafael J. Wysocki  wrote:
>
> On Saturday, June 6, 2020 8:56:26 AM CEST Rafael J. Wysocki wrote:
> > On Fri, Jun 5, 2020 at 7:09 PM Dan Williams  
> > wrote:
> > >
> > > On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki  
> > > wrote:
> > > >
> > > > From: Rafael J. Wysocki 
> > > > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory 
> > > > management
> > > >
> > > > The ACPI OS layer uses RCU to protect the list of ACPI memory
> > > > mappings from being walked while it is updated.  Among other
> > > > situations, that list can be walked in non-NMI interrupt context,
> > > > so using a sleeping lock to protect it is not an option.
> > > >
> > > > However, performance issues related to the RCU usage in there
> > > > appear, as described by Dan Williams:
> > > >
> > > > "Recently a performance problem was reported for a process invoking
> > > > a non-trival ASL program. The method call in this case ends up
> > > > repetitively triggering a call path like:
> > > >
> > > > acpi_ex_store
> > > > acpi_ex_store_object_to_node
> > > > acpi_ex_write_data_to_field
> > > > acpi_ex_insert_into_field
> > > > acpi_ex_write_with_update_rule
> > > > acpi_ex_field_datum_io
> > > > acpi_ex_access_region
> > > > acpi_ev_address_space_dispatch
> > > > acpi_ex_system_memory_space_handler
> > > > acpi_os_map_cleanup.part.14
> > > > _synchronize_rcu_expedited.constprop.89
> > > > schedule
> > > >
> > > > The end result of frequent synchronize_rcu_expedited() invocation is
> > > > tiny sub-millisecond spurts of execution where the scheduler freely
> > > > migrates this apparently sleepy task. The overhead of frequent
> > > > scheduler invocation multiplies the execution time by a factor
> > > > of 2-3X."
> > > >
> > > > In order to avoid these issues, replace the RCU in the ACPI OS
> > > > layer by an rwlock.
> > > >
> > > > That rwlock should not be frequently contended, so the performance
> > > > impact of it is not expected to be significant.
> > > >
> > > > Reported-by: Dan Williams 
> > > > Signed-off-by: Rafael J. Wysocki 
> > > > ---
> > > >
> > > > Hi Dan,
> > > >
> > > > This is a possible fix for the ACPI OSL RCU-related performance issues, 
> > > > but
> > > > can you please arrange for the testing of it on the affected systems?
> > >
> > > Ugh, is it really this simple? I did not realize the read-side is NMI
> > > safe. I'll take a look.
> >
> > But if an NMI triggers while the lock is being held for writing, it
> > will deadlock, won't it?
> >
> > OTOH, according to the RCU documentation it is valid to call
> > rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in
> > Documentation/RCU/Design/Requirements/Requirements.rst) so we are good
> > from this perspective today.
> >
> > Unless we teach APEI to avoid mapping lookups from
> > apei_{read|write}(), which wouldn't be unreasonable by itself, we need
> > to hold on to the RCU in ACPI OSL, so it looks like addressing the
> > problem in ACPICA is the best way to do it (and the current ACPICA
> > code in question is suboptimal, so it would be good to rework it
> > anyway).
> >
> > Cheers!
>
> I've sent the prototype patch below to you, Bob and Erik in private, so
> here it goes to the lists for completeness.
>
> It introduces a "fast-path" variant of acpi_os_map_memory() that only
> returns non-NULL if a matching mapping is already there in the list
> and reworks acpi_ex_system_memory_space_handler() to use it.
>
> The idea is to do a fast-path lookup first for every new mapping and
> only run the full acpi_os_map_memory() if that returns NULL and then
> save the mapping return by it and do a fast-path lookup for it again
> to bump up its reference counter in the OSL layer.  That should prevent
> the mappings from going away until the opregions that they belong to
> go away (the opregion deactivation code is updated too to remove the
> saved mappings), so in the cases when there's not too much opregion
> creation and removal activity, it should make the RCU-related overhead
> go away.
>
> Please test.
>
>

Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-08 Thread Rafael J. Wysocki
On Saturday, June 6, 2020 8:56:26 AM CEST Rafael J. Wysocki wrote:
> On Fri, Jun 5, 2020 at 7:09 PM Dan Williams  wrote:
> >
> > On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki  wrote:
> > >
> > > From: Rafael J. Wysocki 
> > > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory 
> > > management
> > >
> > > The ACPI OS layer uses RCU to protect the list of ACPI memory
> > > mappings from being walked while it is updated.  Among other
> > > situations, that list can be walked in non-NMI interrupt context,
> > > so using a sleeping lock to protect it is not an option.
> > >
> > > However, performance issues related to the RCU usage in there
> > > appear, as described by Dan Williams:
> > >
> > > "Recently a performance problem was reported for a process invoking
> > > a non-trival ASL program. The method call in this case ends up
> > > repetitively triggering a call path like:
> > >
> > > acpi_ex_store
> > > acpi_ex_store_object_to_node
> > > acpi_ex_write_data_to_field
> > > acpi_ex_insert_into_field
> > > acpi_ex_write_with_update_rule
> > > acpi_ex_field_datum_io
> > > acpi_ex_access_region
> > > acpi_ev_address_space_dispatch
> > > acpi_ex_system_memory_space_handler
> > > acpi_os_map_cleanup.part.14
> > > _synchronize_rcu_expedited.constprop.89
> > > schedule
> > >
> > > The end result of frequent synchronize_rcu_expedited() invocation is
> > > tiny sub-millisecond spurts of execution where the scheduler freely
> > > migrates this apparently sleepy task. The overhead of frequent
> > > scheduler invocation multiplies the execution time by a factor
> > > of 2-3X."
> > >
> > > In order to avoid these issues, replace the RCU in the ACPI OS
> > > layer by an rwlock.
> > >
> > > That rwlock should not be frequently contended, so the performance
> > > impact of it is not expected to be significant.
> > >
> > > Reported-by: Dan Williams 
> > > Signed-off-by: Rafael J. Wysocki 
> > > ---
> > >
> > > Hi Dan,
> > >
> > > This is a possible fix for the ACPI OSL RCU-related performance issues, 
> > > but
> > > can you please arrange for the testing of it on the affected systems?
> >
> > Ugh, is it really this simple? I did not realize the read-side is NMI
> > safe. I'll take a look.
> 
> But if an NMI triggers while the lock is being held for writing, it
> will deadlock, won't it?
> 
> OTOH, according to the RCU documentation it is valid to call
> rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in
> Documentation/RCU/Design/Requirements/Requirements.rst) so we are good
> from this perspective today.
> 
> Unless we teach APEI to avoid mapping lookups from
> apei_{read|write}(), which wouldn't be unreasonable by itself, we need
> to hold on to the RCU in ACPI OSL, so it looks like addressing the
> problem in ACPICA is the best way to do it (and the current ACPICA
> code in question is suboptimal, so it would be good to rework it
> anyway).
> 
> Cheers!

I've sent the prototype patch below to you, Bob and Erik in private, so
here it goes to the lists for completeness.

It introduces a "fast-path" variant of acpi_os_map_memory() that only
returns non-NULL if a matching mapping is already there in the list
and reworks acpi_ex_system_memory_space_handler() to use it.

The idea is to do a fast-path lookup first for every new mapping and
only run the full acpi_os_map_memory() if that returns NULL and then
save the mapping return by it and do a fast-path lookup for it again
to bump up its reference counter in the OSL layer.  That should prevent
the mappings from going away until the opregions that they belong to
go away (the opregion deactivation code is updated too to remove the
saved mappings), so in the cases when there's not too much opregion
creation and removal activity, it should make the RCU-related overhead
go away.

Please test.

Cheers!

---
 drivers/acpi/acpica/evrgnini.c|   14 -
 drivers/acpi/acpica/exregion.c|   49 +--
 drivers/acpi/osl.c|   59 --
 include/acpi/actypes.h|7 
 include/acpi/platform/aclinuxex.h |2 +
 5 files changed, 112 insertions(+), 19 deletions(-)

Index: linux-pm/drivers/acpi/osl.c
===
--- linux-pm.orig/drivers/acpi/osl.c
+++ linux-pm/drivers/acpi/osl.c
@@ -302,21 +302,8 @@ stat

Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-06 Thread Rafael J. Wysocki
On Fri, Jun 5, 2020 at 7:09 PM Dan Williams  wrote:
>
> On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki  wrote:
> >
> > From: Rafael J. Wysocki 
> > Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
> >
> > The ACPI OS layer uses RCU to protect the list of ACPI memory
> > mappings from being walked while it is updated.  Among other
> > situations, that list can be walked in non-NMI interrupt context,
> > so using a sleeping lock to protect it is not an option.
> >
> > However, performance issues related to the RCU usage in there
> > appear, as described by Dan Williams:
> >
> > "Recently a performance problem was reported for a process invoking
> > a non-trival ASL program. The method call in this case ends up
> > repetitively triggering a call path like:
> >
> > acpi_ex_store
> > acpi_ex_store_object_to_node
> > acpi_ex_write_data_to_field
> > acpi_ex_insert_into_field
> > acpi_ex_write_with_update_rule
> > acpi_ex_field_datum_io
> > acpi_ex_access_region
> > acpi_ev_address_space_dispatch
> > acpi_ex_system_memory_space_handler
> > acpi_os_map_cleanup.part.14
> > _synchronize_rcu_expedited.constprop.89
> > schedule
> >
> > The end result of frequent synchronize_rcu_expedited() invocation is
> > tiny sub-millisecond spurts of execution where the scheduler freely
> > migrates this apparently sleepy task. The overhead of frequent
> > scheduler invocation multiplies the execution time by a factor
> > of 2-3X."
> >
> > In order to avoid these issues, replace the RCU in the ACPI OS
> > layer by an rwlock.
> >
> > That rwlock should not be frequently contended, so the performance
> > impact of it is not expected to be significant.
> >
> > Reported-by: Dan Williams 
> > Signed-off-by: Rafael J. Wysocki 
> > ---
> >
> > Hi Dan,
> >
> > This is a possible fix for the ACPI OSL RCU-related performance issues, but
> > can you please arrange for the testing of it on the affected systems?
>
> Ugh, is it really this simple? I did not realize the read-side is NMI
> safe. I'll take a look.

But if an NMI triggers while the lock is being held for writing, it
will deadlock, won't it?

OTOH, according to the RCU documentation it is valid to call
rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in
Documentation/RCU/Design/Requirements/Requirements.rst) so we are good
from this perspective today.

Unless we teach APEI to avoid mapping lookups from
apei_{read|write}(), which wouldn't be unreasonable by itself, we need
to hold on to the RCU in ACPI OSL, so it looks like addressing the
problem in ACPICA is the best way to do it (and the current ACPICA
code in question is suboptimal, so it would be good to rework it
anyway).

Cheers!


Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-06 Thread Rafael J. Wysocki
On Fri, Jun 5, 2020 at 9:40 PM Andy Shevchenko
 wrote:
>
> On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki  wrote:
>
> ...
>
> > +   if (!refcount) {
> > +   write_lock_irq(_ioremaps_list_lock);
> > +
> > +   list_del(>list);
> > +
> > +   write_unlock_irq(_ioremaps_list_lock);
> > +   }
> > return refcount;
>
> It seems we can decrease indentation level at the same time:
>
>   if (refcount)
> return refcount;
>
>  ...
>  return 0;

Right, but the patch will need to be dropped anyway I think.


Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-05 Thread Andy Shevchenko
On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki  wrote:

...

> +   if (!refcount) {
> +   write_lock_irq(_ioremaps_list_lock);
> +
> +   list_del(>list);
> +
> +   write_unlock_irq(_ioremaps_list_lock);
> +   }
> return refcount;

It seems we can decrease indentation level at the same time:

  if (refcount)
return refcount;

 ...
 return 0;

-- 
With Best Regards,
Andy Shevchenko


Re: [RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-05 Thread Dan Williams
On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki  wrote:
>
> From: Rafael J. Wysocki 
> Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
>
> The ACPI OS layer uses RCU to protect the list of ACPI memory
> mappings from being walked while it is updated.  Among other
> situations, that list can be walked in non-NMI interrupt context,
> so using a sleeping lock to protect it is not an option.
>
> However, performance issues related to the RCU usage in there
> appear, as described by Dan Williams:
>
> "Recently a performance problem was reported for a process invoking
> a non-trival ASL program. The method call in this case ends up
> repetitively triggering a call path like:
>
> acpi_ex_store
> acpi_ex_store_object_to_node
> acpi_ex_write_data_to_field
> acpi_ex_insert_into_field
> acpi_ex_write_with_update_rule
> acpi_ex_field_datum_io
> acpi_ex_access_region
> acpi_ev_address_space_dispatch
> acpi_ex_system_memory_space_handler
> acpi_os_map_cleanup.part.14
> _synchronize_rcu_expedited.constprop.89
> schedule
>
> The end result of frequent synchronize_rcu_expedited() invocation is
> tiny sub-millisecond spurts of execution where the scheduler freely
> migrates this apparently sleepy task. The overhead of frequent
> scheduler invocation multiplies the execution time by a factor
> of 2-3X."
>
> In order to avoid these issues, replace the RCU in the ACPI OS
> layer by an rwlock.
>
> That rwlock should not be frequently contended, so the performance
> impact of it is not expected to be significant.
>
> Reported-by: Dan Williams 
> Signed-off-by: Rafael J. Wysocki 
> ---
>
> Hi Dan,
>
> This is a possible fix for the ACPI OSL RCU-related performance issues, but
> can you please arrange for the testing of it on the affected systems?

Ugh, is it really this simple? I did not realize the read-side is NMI
safe. I'll take a look.


[RFT][PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

2020-06-05 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 
Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management

The ACPI OS layer uses RCU to protect the list of ACPI memory
mappings from being walked while it is updated.  Among other
situations, that list can be walked in non-NMI interrupt context,
so using a sleeping lock to protect it is not an option.

However, performance issues related to the RCU usage in there
appear, as described by Dan Williams:

"Recently a performance problem was reported for a process invoking
a non-trival ASL program. The method call in this case ends up
repetitively triggering a call path like:

acpi_ex_store
acpi_ex_store_object_to_node
acpi_ex_write_data_to_field
acpi_ex_insert_into_field
acpi_ex_write_with_update_rule
acpi_ex_field_datum_io
acpi_ex_access_region
acpi_ev_address_space_dispatch
acpi_ex_system_memory_space_handler
acpi_os_map_cleanup.part.14
_synchronize_rcu_expedited.constprop.89
schedule

The end result of frequent synchronize_rcu_expedited() invocation is
tiny sub-millisecond spurts of execution where the scheduler freely
migrates this apparently sleepy task. The overhead of frequent
scheduler invocation multiplies the execution time by a factor
of 2-3X."

In order to avoid these issues, replace the RCU in the ACPI OS
layer by an rwlock.

That rwlock should not be frequently contended, so the performance
impact of it is not expected to be significant.

Reported-by: Dan Williams 
Signed-off-by: Rafael J. Wysocki 
---

Hi Dan,

This is a possible fix for the ACPI OSL RCU-related performance issues, but
can you please arrange for the testing of it on the affected systems?

Cheers!

---
 drivers/acpi/osl.c |   50 ++
 1 file changed, 34 insertions(+), 16 deletions(-)

Index: linux-pm/drivers/acpi/osl.c
===
--- linux-pm.orig/drivers/acpi/osl.c
+++ linux-pm/drivers/acpi/osl.c
@@ -81,8 +81,8 @@ struct acpi_ioremap {
 };
 
 static LIST_HEAD(acpi_ioremaps);
+static DEFINE_RWLOCK(acpi_ioremaps_list_lock);
 static DEFINE_MUTEX(acpi_ioremap_lock);
-#define acpi_ioremap_lock_held() lock_is_held(_ioremap_lock.dep_map)
 
 static void __init acpi_request_region (struct acpi_generic_address *gas,
unsigned int length, char *desc)
@@ -214,13 +214,13 @@ acpi_physical_address __init acpi_os_get
return pa;
 }
 
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */
+/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. 
*/
 static struct acpi_ioremap *
 acpi_map_lookup(acpi_physical_address phys, acpi_size size)
 {
struct acpi_ioremap *map;
 
-   list_for_each_entry_rcu(map, _ioremaps, list, 
acpi_ioremap_lock_held())
+   list_for_each_entry(map, _ioremaps, list)
if (map->phys <= phys &&
phys + size <= map->phys + map->size)
return map;
@@ -228,7 +228,7 @@ acpi_map_lookup(acpi_physical_address ph
return NULL;
 }
 
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */
+/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. 
*/
 static void __iomem *
 acpi_map_vaddr_lookup(acpi_physical_address phys, unsigned int size)
 {
@@ -257,13 +257,13 @@ void __iomem *acpi_os_get_iomem(acpi_phy
 }
 EXPORT_SYMBOL_GPL(acpi_os_get_iomem);
 
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */
+/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. 
*/
 static struct acpi_ioremap *
 acpi_map_lookup_virt(void __iomem *virt, acpi_size size)
 {
struct acpi_ioremap *map;
 
-   list_for_each_entry_rcu(map, _ioremaps, list, 
acpi_ioremap_lock_held())
+   list_for_each_entry(map, _ioremaps, list)
if (map->virt <= virt &&
virt + size <= map->virt + map->size)
return map;
@@ -360,7 +360,11 @@ void __iomem __ref
map->size = pg_sz;
map->refcount = 1;
 
-   list_add_tail_rcu(>list, _ioremaps);
+   write_lock_irq(_ioremaps_list_lock);
+
+   list_add_tail(>list, _ioremaps);
+
+   write_unlock_irq(_ioremaps_list_lock);
 
 out:
mutex_unlock(_ioremap_lock);
@@ -379,14 +383,18 @@ static unsigned long acpi_os_drop_map_re
 {
unsigned long refcount = --map->refcount;
 
-   if (!refcount)
-   list_del_rcu(>list);
+   if (!refcount) {
+   write_lock_irq(_ioremaps_list_lock);
+
+   list_del(>list);
+
+   write_unlock_irq(_ioremaps_list_lock);
+   }
return refcount;
 }
 
 static void acpi_os_map_cleanup(struct acpi_ioremap *map)
 {
-   synchronize_rcu_expedited();
acpi_unmap(map->phys, map->virt);
kfree(map);
 }
@@ -704,18 +712,23 @@ acpi_status
 acpi_os_read_mem

Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-19 Thread Zhao, Yakui




On 2019年08月19日 15:39, Dan Carpenter wrote:

On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote:

In fact as this driver is mainly used for embedded IOT usage, it doesn't
handle the complex cleanup when such error is encountered. Instead the clean
up is handled in free_guest_vm.


A use after free here seems like a potential security problem.  Security
matters for IoT...  :(


Thanks for pointing out the issue.
The cleanup will be considered carefully.



regards,
dan carpenter



Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-19 Thread Borislav Petkov
On Mon, Aug 19, 2019 at 10:39:58AM +0300, Dan Carpenter wrote:
> On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote:
> > In fact as this driver is mainly used for embedded IOT usage, it doesn't
> > handle the complex cleanup when such error is encountered. Instead the clean
> > up is handled in free_guest_vm.
> 
> A use after free here seems like a potential security problem.  Security
> matters for IoT...  :(

Yeah, the "S" in "IoT" stands for security.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-19 Thread Dan Carpenter
On Mon, Aug 19, 2019 at 01:32:54PM +0800, Zhao, Yakui wrote:
> In fact as this driver is mainly used for embedded IOT usage, it doesn't
> handle the complex cleanup when such error is encountered. Instead the clean
> up is handled in free_guest_vm.

A use after free here seems like a potential security problem.  Security
matters for IoT...  :(

regards,
dan carpenter



Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-18 Thread Zhao, Yakui




On 2019年08月16日 20:58, Dan Carpenter wrote:

On Fri, Aug 16, 2019 at 10:25:49AM +0800, Zhao Yakui wrote:

+int hugepage_map_guest(struct acrn_vm *vm, struct vm_memmap *memmap)
+{
+   struct page *page = NULL, *regions_buf_pg = NULL;
+   unsigned long len, guest_gpa, vma;
+   struct vm_memory_region *region_array;
+   struct set_regions *regions;
+   int max_size = PAGE_SIZE / sizeof(struct vm_memory_region);
+   int ret;
+
+   if (!vm || !memmap)
+   return -EINVAL;
+
+   len = memmap->len;
+   vma = memmap->vma_base;
+   guest_gpa = memmap->gpa;
+
+   /* prepare set_memory_regions info */
+   regions_buf_pg = alloc_page(GFP_KERNEL);
+   if (!regions_buf_pg)
+   return -ENOMEM;
+
+   regions = kzalloc(sizeof(*regions), GFP_KERNEL);
+   if (!regions) {
+   __free_page(regions_buf_pg);
+   return -ENOMEM;


It's better to do a goto err_free_regions_buf here.  More comments
below.


+   }
+   regions->mr_num = 0;
+   regions->vmid = vm->vmid;
+   regions->regions_gpa = page_to_phys(regions_buf_pg);
+   region_array = page_to_virt(regions_buf_pg);
+
+   while (len > 0) {
+   unsigned long vm0_gpa, pagesize;
+
+   ret = get_user_pages_fast(vma, 1, 1, );
+   if (unlikely(ret != 1) || (!page)) {
+   pr_err("failed to pin huge page!\n");
+   ret = -ENOMEM;
+   goto err;


goto err is a red flag.  It's better if error labels do one specific
named thing like:

err_regions:
kfree(regions);
err_free_regions_buf:
__free_page(regions_buf_pg);

We should unwind in the opposite/mirror order from how things were
allocated.  Then we can remove the if statements in the error handling.


Thanks for the review.

Will follow your suggestion to unwind the error handling.



In this situation, say the user triggers an -EFAULT in
get_user_pages_fast() in the second iteration through the loop.  That
means that "page" is the non-NULL page from the previous iteration.  We
have already added it to add_guest_map().  But now we're freeing it
without removing it from the map so probably it leads to a use after
free.

The best way to write the error handling in a loop like this is to
clean up the partial iteration that has succeed (nothing here), and then
unwind all the successful iterations at the bottom of the function.
"goto unwind_loop;"



In theory we should cleanup the previous success iteration if it 
encounters one error in the current iteration.

But it will be quite complex to cleanup up the previous iteration.
call the set_memory_regions for MR_DEL op.
call the remove_guest_map for the added hash item
call the put_page for returned page in get_user_pages_fast.

In fact as this driver is mainly used for embedded IOT usage, it doesn't 
handle the complex cleanup when such error is encountered. Instead the 
clean up is handled in free_guest_vm.



+   }
+
+   vm0_gpa = page_to_phys(page);
+   pagesize = PAGE_SIZE << compound_order(page);
+
+   ret = add_guest_map(vm, vm0_gpa, guest_gpa, pagesize);
+   if (ret < 0) {
+   pr_err("failed to add memseg for huge page!\n");
+   goto err;


So then here, it would be:

pr_err("failed to add memseg for huge page!\n");
put_page(page);
goto unwind_loop;

regards,
dan carpenter


+   }
+
+   /* fill each memory region into region_array */
+   region_array[regions->mr_num].type = MR_ADD;
+   region_array[regions->mr_num].gpa = guest_gpa;
+   region_array[regions->mr_num].vm0_gpa = vm0_gpa;
+   region_array[regions->mr_num].size = pagesize;
+   region_array[regions->mr_num].prot =
+   (MEM_TYPE_WB & MEM_TYPE_MASK) |
+   (memmap->prot & MEM_ACCESS_RIGHT_MASK);
+   regions->mr_num++;
+   if (regions->mr_num == max_size) {
+   pr_debug("region buffer full, set & renew regions!\n");
+   ret = set_memory_regions(regions);
+   if (ret < 0) {
+   pr_err("failed to set regions,ret=%d!\n", ret);
+   goto err;
+   }
+   regions->mr_num = 0;
+   }
+
+   len -= pagesize;
+   vma += pagesize;
+   guest_gpa += pagesize;
+   }
+
+   ret = set_memory_regions(regions);
+   if (ret < 0) {
+   pr_err("failed to set regions, ret=%d!\n", ret);
+   goto err;
+   }
+
+   __free_page(regions_buf_pg);
+   kfree(regions);
+
+   return 0;
+err:
+   if (regions_buf_pg)
+   

Re: [RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-16 Thread Dan Carpenter
On Fri, Aug 16, 2019 at 10:25:49AM +0800, Zhao Yakui wrote:
> +int hugepage_map_guest(struct acrn_vm *vm, struct vm_memmap *memmap)
> +{
> + struct page *page = NULL, *regions_buf_pg = NULL;
> + unsigned long len, guest_gpa, vma;
> + struct vm_memory_region *region_array;
> + struct set_regions *regions;
> + int max_size = PAGE_SIZE / sizeof(struct vm_memory_region);
> + int ret;
> +
> + if (!vm || !memmap)
> + return -EINVAL;
> +
> + len = memmap->len;
> + vma = memmap->vma_base;
> + guest_gpa = memmap->gpa;
> +
> + /* prepare set_memory_regions info */
> + regions_buf_pg = alloc_page(GFP_KERNEL);
> + if (!regions_buf_pg)
> + return -ENOMEM;
> +
> + regions = kzalloc(sizeof(*regions), GFP_KERNEL);
> + if (!regions) {
> + __free_page(regions_buf_pg);
> + return -ENOMEM;

It's better to do a goto err_free_regions_buf here.  More comments
below.

> + }
> + regions->mr_num = 0;
> + regions->vmid = vm->vmid;
> + regions->regions_gpa = page_to_phys(regions_buf_pg);
> + region_array = page_to_virt(regions_buf_pg);
> +
> + while (len > 0) {
> + unsigned long vm0_gpa, pagesize;
> +
> + ret = get_user_pages_fast(vma, 1, 1, );
> + if (unlikely(ret != 1) || (!page)) {
> + pr_err("failed to pin huge page!\n");
> + ret = -ENOMEM;
> + goto err;

goto err is a red flag.  It's better if error labels do one specific
named thing like:

err_regions:
kfree(regions);
err_free_regions_buf:
__free_page(regions_buf_pg);

We should unwind in the opposite/mirror order from how things were
allocated.  Then we can remove the if statements in the error handling.

In this situation, say the user triggers an -EFAULT in
get_user_pages_fast() in the second iteration through the loop.  That
means that "page" is the non-NULL page from the previous iteration.  We
have already added it to add_guest_map().  But now we're freeing it
without removing it from the map so probably it leads to a use after
free.

The best way to write the error handling in a loop like this is to
clean up the partial iteration that has succeed (nothing here), and then
unwind all the successful iterations at the bottom of the function.
"goto unwind_loop;"

> + }
> +
> + vm0_gpa = page_to_phys(page);
> + pagesize = PAGE_SIZE << compound_order(page);
> +
> + ret = add_guest_map(vm, vm0_gpa, guest_gpa, pagesize);
> + if (ret < 0) {
> + pr_err("failed to add memseg for huge page!\n");
> + goto err;

So then here, it would be:

pr_err("failed to add memseg for huge page!\n");
put_page(page);
goto unwind_loop;

regards,
dan carpenter

> + }
> +
> + /* fill each memory region into region_array */
> + region_array[regions->mr_num].type = MR_ADD;
> + region_array[regions->mr_num].gpa = guest_gpa;
> + region_array[regions->mr_num].vm0_gpa = vm0_gpa;
> + region_array[regions->mr_num].size = pagesize;
> + region_array[regions->mr_num].prot =
> + (MEM_TYPE_WB & MEM_TYPE_MASK) |
> + (memmap->prot & MEM_ACCESS_RIGHT_MASK);
> + regions->mr_num++;
> + if (regions->mr_num == max_size) {
> + pr_debug("region buffer full, set & renew regions!\n");
> + ret = set_memory_regions(regions);
> + if (ret < 0) {
> + pr_err("failed to set regions,ret=%d!\n", ret);
> + goto err;
> + }
> + regions->mr_num = 0;
> + }
> +
> + len -= pagesize;
> + vma += pagesize;
> + guest_gpa += pagesize;
> + }
> +
> + ret = set_memory_regions(regions);
> + if (ret < 0) {
> + pr_err("failed to set regions, ret=%d!\n", ret);
> + goto err;
> + }
> +
> + __free_page(regions_buf_pg);
> + kfree(regions);
> +
> + return 0;
> +err:
> + if (regions_buf_pg)
> + __free_page(regions_buf_pg);
> + if (page)
> + put_page(page);
> + kfree(regions);
> + return ret;
> +}
> +



[RFC PATCH 08/15] drivers/acrn: add VM memory management for ACRN char device

2019-08-15 Thread Zhao Yakui
In order to launch the ACRN guest system, it needs to setup the mapping
between GPA (guest physical address) and HPA (host physical address).
This is based on memory virtualization and configured in EPT table.
The ioctl related with memory management is added and then the hypercall
is called so that the ACRN hypervisor can help to setup the memory
mapping for ACRN guest.
The 1G/2M huge page is used to optimize the EPT table for guest VM. This
will simplify the memory allocation and also optimizes the TLB.
For the MMIO mapping: It can support 4K/2M page.

IC_SET_MEMSEG: This is used to setup the memory mapping for the memory
of guest system by using hugetlb(Guest physical address and host virtual
addr).It is also used to setup the device MMIO mapping for PCI device.
IC_UNSET_MEMSEG: This is used to remove the device MMIO mapping for PCI
device. This is used with updating the MMIO mapping together. As the
acrn hypervisor is mainly used for embedded IOT device, it doesn't support
the dynamica removal of memory mapping.

Co-developed-by: Jason Chen CJ 
Signed-off-by: Jason Chen CJ 
Co-developed-by: Li, Fei 
Signed-off-by: Li, Fei 
Co-developed-by: Liu Shuo 
Signed-off-by: Liu Shuo 
Signed-off-by: Zhao Yakui 
---
 drivers/staging/acrn/Makefile |   4 +-
 drivers/staging/acrn/acrn_dev.c   |  27 +++
 drivers/staging/acrn/acrn_drv_internal.h  |  90 +++---
 drivers/staging/acrn/acrn_mm.c| 227 
 drivers/staging/acrn/acrn_mm_hugetlb.c| 281 ++
 drivers/staging/acrn/acrn_vm_mngt.c   |   2 +
 include/linux/acrn/acrn_drv.h |  86 +
 include/uapi/linux/acrn/acrn_common_def.h |  25 +++
 include/uapi/linux/acrn/acrn_ioctl_defs.h |  41 +
 9 files changed, 759 insertions(+), 24 deletions(-)
 create mode 100644 drivers/staging/acrn/acrn_mm.c
 create mode 100644 drivers/staging/acrn/acrn_mm_hugetlb.c
 create mode 100644 include/linux/acrn/acrn_drv.h
 create mode 100644 include/uapi/linux/acrn/acrn_common_def.h

diff --git a/drivers/staging/acrn/Makefile b/drivers/staging/acrn/Makefile
index 426d6e8..ec62afe 100644
--- a/drivers/staging/acrn/Makefile
+++ b/drivers/staging/acrn/Makefile
@@ -1,4 +1,6 @@
 obj-$(CONFIG_ACRN_HSM) := acrn.o
 acrn-y := acrn_dev.o \
  acrn_hypercall.o \
- acrn_vm_mngt.o
+ acrn_vm_mngt.o \
+ acrn_mm.o \
+ acrn_mm_hugetlb.o
diff --git a/drivers/staging/acrn/acrn_dev.c b/drivers/staging/acrn/acrn_dev.c
index 7372316..cb62819 100644
--- a/drivers/staging/acrn/acrn_dev.c
+++ b/drivers/staging/acrn/acrn_dev.c
@@ -44,6 +44,7 @@ static
 int acrn_dev_open(struct inode *inodep, struct file *filep)
 {
struct acrn_vm *vm;
+   int i;
 
vm = kzalloc(sizeof(*vm), GFP_KERNEL);
if (!vm)
@@ -53,6 +54,10 @@ int acrn_dev_open(struct inode *inodep, struct file *filep)
vm->vmid = ACRN_INVALID_VMID;
vm->dev = acrn_device;
 
+   for (i = 0; i < HUGEPAGE_HLIST_ARRAY_SIZE; i++)
+   INIT_HLIST_HEAD(>hugepage_hlist[i]);
+   mutex_init(>hugepage_lock);
+
write_lock_bh(_vm_list_lock);
vm_list_add(>list);
write_unlock_bh(_vm_list_lock);
@@ -212,6 +217,28 @@ long acrn_dev_ioctl(struct file *filep,
return ret;
}
 
+   case IC_SET_MEMSEG: {
+   struct vm_memmap memmap;
+
+   if (copy_from_user(, (void *)ioctl_param,
+  sizeof(memmap)))
+   return -EFAULT;
+
+   ret = map_guest_memseg(vm, );
+   break;
+   }
+
+   case IC_UNSET_MEMSEG: {
+   struct vm_memmap memmap;
+
+   if (copy_from_user(, (void *)ioctl_param,
+  sizeof(memmap)))
+   return -EFAULT;
+
+   ret = unmap_guest_memseg(vm, );
+   break;
+   }
+
default:
pr_warn("Unknown IOCTL 0x%x\n", ioctl_num);
ret = -EFAULT;
diff --git a/drivers/staging/acrn/acrn_drv_internal.h 
b/drivers/staging/acrn/acrn_drv_internal.h
index 6758dea..5098765 100644
--- a/drivers/staging/acrn/acrn_drv_internal.h
+++ b/drivers/staging/acrn/acrn_drv_internal.h
@@ -11,6 +11,57 @@
 #include 
 #include 
 
+struct vm_memory_region {
+#define MR_ADD 0
+#define MR_DEL 2
+   u32 type;
+
+   /* IN: mem attr */
+   u32 prot;
+
+   /* IN: beginning guest GPA to map */
+   u64 gpa;
+
+   /* IN: VM0's GPA which foreign gpa will be mapped to */
+   u64 vm0_gpa;
+
+   /* IN: size of the region */
+   u64 size;
+};
+
+struct set_regions {
+   /*IN: vmid for this hypercall */
+   u16 vmid;
+
+   /** Reserved */
+   u16 reserved[3];
+
+   /* IN: multi memmaps numbers */
+   u32 mr_num;
+
+   /** Reserved */
+   u32 reserved1;
+   /* IN:
+* the gpa of memmaps buffer, point to the memm

Re: [RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation

2019-05-13 Thread Liran Alon



> On 13 May 2019, at 18:15, Peter Zijlstra  wrote:
> 
> On Mon, May 13, 2019 at 04:38:09PM +0200, Alexandre Chartre wrote:
>> From: Liran Alon 
>> 
>> Export symbols needed to create, manage, populate and switch
>> a mm from a kernel module (kvm in this case).
>> 
>> This is a hacky way for now to start.
>> This should be changed to some suitable memory-management API.
> 
> This should not be exported at all, ever, end of story.
> 
> Modules do not get to play with address spaces like that.

I agree… No doubt about that. This should never be merged like this.
It’s just to have an initial PoC of the concept so we can:
1) Messure performance impact of concept.
2) Get feedback on appropriate design and APIs from community.

-Liran



Re: [RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation

2019-05-13 Thread Peter Zijlstra
On Mon, May 13, 2019 at 04:38:09PM +0200, Alexandre Chartre wrote:
> From: Liran Alon 
> 
> Export symbols needed to create, manage, populate and switch
> a mm from a kernel module (kvm in this case).
> 
> This is a hacky way for now to start.
> This should be changed to some suitable memory-management API.

This should not be exported at all, ever, end of story.

Modules do not get to play with address spaces like that.


[RFC KVM 01/27] kernel: Export memory-management symbols required for KVM address space isolation

2019-05-13 Thread Alexandre Chartre
From: Liran Alon 

Export symbols needed to create, manage, populate and switch
a mm from a kernel module (kvm in this case).

This is a hacky way for now to start.
This should be changed to some suitable memory-management API.

Signed-off-by: Liran Alon 
Signed-off-by: Alexandre Chartre 
---
 arch/x86/kernel/ldt.c |1 +
 arch/x86/mm/tlb.c |3 ++-
 mm/memory.c   |5 +
 3 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index b2463fc..19a86e0 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -401,6 +401,7 @@ void destroy_context_ldt(struct mm_struct *mm)
free_ldt_struct(mm->context.ldt);
mm->context.ldt = NULL;
 }
+EXPORT_SYMBOL_GPL(destroy_context_ldt);
 
 void ldt_arch_exit_mmap(struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7f61431..a4db7f5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -70,7 +70,7 @@ static void clear_asid_other(void)
 }
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
-
+EXPORT_SYMBOL_GPL(last_mm_ctx_id);
 
 static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
u16 *new_asid, bool *need_flush)
@@ -159,6 +159,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct 
*next,
switch_mm_irqs_off(prev, next, tsk);
local_irq_restore(flags);
 }
+EXPORT_SYMBOL_GPL(switch_mm);
 
 static void sync_current_stack_to_mm(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 36aac68..ede9335 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -434,6 +434,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
pte_free(mm, new);
return 0;
 }
+EXPORT_SYMBOL_GPL(__pte_alloc);
 
 int __pte_alloc_kernel(pmd_t *pmd)
 {
@@ -453,6 +454,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
pte_free_kernel(_mm, new);
return 0;
 }
+EXPORT_SYMBOL_GPL(__pte_alloc_kernel);
 
 static inline void init_rss_vec(int *rss)
 {
@@ -4007,6 +4009,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, 
unsigned long address)
spin_unlock(>page_table_lock);
return 0;
 }
+EXPORT_SYMBOL_GPL(__p4d_alloc);
 #endif /* __PAGETABLE_P4D_FOLDED */
 
 #ifndef __PAGETABLE_PUD_FOLDED
@@ -4039,6 +4042,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, 
unsigned long address)
spin_unlock(>page_table_lock);
return 0;
 }
+EXPORT_SYMBOL_GPL(__pud_alloc);
 #endif /* __PAGETABLE_PUD_FOLDED */
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -4072,6 +4076,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long address)
spin_unlock(ptl);
return 0;
 }
+EXPORT_SYMBOL_GPL(__pmd_alloc);
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-- 
1.7.1



Re: nbd, nbdkit, loopback mounts and memory management

2019-03-12 Thread Shaun McDowell
We have a good example of deadlock when using an XFS filesystem on an
nbd loopback device regardless of mlockall and MCL_FUTURE. Below I'll
paste the hung task traces but I'll try to summarize the steps that
caused the problem.

The setup is an nbd loopback device with an XFS filesystem that is
then mounted on the local server (ubuntu 18.04 bionic, kernel 4.15).
Heavy load on the server is putting the kernel in a state where it has
exhausted its free pages list and new page allocation requests will
attempt to first free allocated pages -- mm/vmscan.c
try_to_free_pages(). XFS registers a pair of functions in the
super_operations struct (nr_cached_objects and free_cached_objects)
that causes the try_to_free_pages() to perform a synchronous write of
dirty inodes to the block device -- fs/xfs/xfs_icache.c
xfs_reclaim_inodes_nr().

A resource loop deadlock can occur if the nbd userland process needs
to allocate memory to make forward progress and the allocation causes
the kernel to call try_to_free_pages() which then leads to XFS issuing
a blocking synchronous write of a dirty inode page destined back to
the nbd loopback device. A more common deadlock we see (traces below)
is when another process on the system is blocked in
try_to_free_pages() waiting for the nbd device to perform the
synchronous write and then the nbd userland process also triggers
try_to_free_pages() and blocks on a mutex waiting for the first
process to complete its synchronous write.

>From a kernel implemented device driver, which pages should not be
freed during a page allocation can be controlled using the GFP_ flags.
(GFP_NOIO, GFP_NOFS). From userland (as far as I can tell) we do not
have a way to control which pages the kernel is allowed to free in
try_to_free_pages().

Below are some traces where we have hit this deadlock. 0) A process on
the server needs to allocate a page and triggers the kernel to attempt
to free pages and XFS issues a synchronous write to the nbd loopback
device; 1) within the nbd loopback userland process, a memory
allocation request triggers a malloc arena pool to expand its heap and
call mprotect which causes the kernel to try to allocate pages and
call try_to_free_pages(), and proceed down the XFS reclaim inodes path
where it blocks trying to acquire the mutex that is held by that
process from trace 0 waiting for us to make progress; 2) an example of
another thread within the nbd userland process hitting the same
problem when attempting to write to a tcp socket and also blocking on
the XFS reclaim inodes mutex.

So far the problem has been avoided by using ext4 instead of XFS as
ext4 does not provide the super_operations struct functions
(nr_cached_objects, free_cached_objects) and does not issue writes in
the try_to_free_pages() code path.

Ideally, the nbd userland process could set a process flag to tell the
kernel to use GFP_NOIO or GFP_NOFS for its allocations to avoid this
resource deadlock.

Kernel hung tasks
0) A process somewhere on the server has triggered a syscall that
causes the kernel to attempt to free pages and eventually call
xfs_reclaim_inodes, acquire xfs inode reclaim mutex, and wait for a
write to the nbd loopback device
Mar 06 04:54:14 ip-172-16-9-203 kernel: INFO: task glfs_epoll000:4535
blocked for more than 30 seconds.
Mar 06 04:54:14 ip-172-16-9-203 kernel:   Not tainted
4.15.0-1032-aws #34-Ubuntu
Mar 06 04:54:14 ip-172-16-9-203 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 06 04:54:14 ip-172-16-9-203 kernel: glfs_epoll000   D0  4535
   1 0x
Mar 06 04:54:14 ip-172-16-9-203 kernel: Call Trace:
Mar 06 04:54:14 ip-172-16-9-203 kernel:  __schedule+0x291/0x8a0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule+0x2c/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  schedule_timeout+0x1cf/0x350
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? _xfs_buf_ioapply+0x396/0x4e0 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup+0xc6/0x110
Mar 06 04:54:14 ip-172-16-9-203 kernel:  wait_for_completion+0xba/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? wake_up_q+0x80/0x80
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? xfs_bwrite+0x24/0x60 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_buf_submit_wait+0x81/0x210 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_bwrite+0x24/0x60 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inode+0x31d/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_ag+0x1e6/0x350 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? radix_tree_gang_lookup_tag+0xd9/0x160
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? __list_lru_walk_one.isra.5+0x37/0x140
Mar 06 04:54:14 ip-172-16-9-203 kernel:  ? iput+0x220/0x220
Mar 06 04:54:14 ip-172-16-9-203 kernel:  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:
xfs_fs_free_cached_objects+0x19/0x20 [xfs]
Mar 06 04:54:14 ip-172-16-9-203 kernel:  super_cache_scan+0x165/0x1b0
Mar 06 04:54:14 ip-172-16-9-203 kernel:  

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones
On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote:
> But Shaun reported it happened somehow often for them, so he might
> have a practical test case... better than my theories :-).

Yes, certainly not saying this isn't a problem.

I think the good news is the fix seems quite easy, ie. to add mlockall
and adjust the OOM killer score, as is done currently in the client:

https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885
https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219

For now I have added a note in the TODO file to follow up in case we
get a test case or reports of a problem:

https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Pavel Machek
Hi!

> So not to dispute that this could be a bug, but I couldn't cause a
> deadlock.  I wonder if you can see something wrong with my method?
> 
> *** Set up ***
> 
> - kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
> - nbd-client 3.19-1.fc30
> - nbdkit 1.11.5 (git commit ef9d1978ce28)
> 
> Baremetal machine was booted with mem=2G to artificially limit the
> RAM.  The machine has 64G of swap.
> 
> # free -m
>   totalusedfree  shared  buff/cache   
> available
> Mem:   1806 3291383   0  93
> 1357
> Swap: 65535 179   65356
> 
> *** Method ***
> 
> I started nbdkit as a 4G RAM disk:
> 
>   ./nbdkit memory size=4G
> 
> This is implemented as a sparse array with a 2 level page table, and
> should allocate (using malloc) every time a new area of the disk is
> written to.  Exact implementation is here:
> https://github.com/libguestfs/nbdkit/tree/master/common/sparse
> 
> I started nbd-client using the -swap option which uses
> mlockall(MCL_FUTURE) to lock the client into RAM.
> 
>   nbd-client -b 512 -swap localhost /dev/nbd0
> 
> I then created a filesystem on the RAM disk, mounted it, and copied a
> 3G file into it.  I tried this various ways, but the variation I was
> eventually happy with was:
> 
>   mke2fs /dev/nbd0
>   mount /dev/nbd0 /tmp/mnt
> 
>   dd if=/dev/zero of=/tmp/big bs=1M count=3000
>   cp /tmp/big /tmp/mnt/big
> 
> I couldn't get any kind of deadlock or failure in this test.
> 
> (Note that if you repeat the test several times, in theory you could
> delete the file and fstrim the filesystem, but when I was testing it
> to be sure I unmounted everything and killed and restarted nbdkit
> between each test.)

This looks like quite a good try. I'd try to use mmap() to dirty
memory very quickly.

But Shaun reported it happened somehow often for them, so he might
have a practical test case... better than my theories :-).

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones
So not to dispute that this could be a bug, but I couldn't cause a
deadlock.  I wonder if you can see something wrong with my method?

*** Set up ***

- kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
- nbd-client 3.19-1.fc30
- nbdkit 1.11.5 (git commit ef9d1978ce28)

Baremetal machine was booted with mem=2G to artificially limit the
RAM.  The machine has 64G of swap.

# free -m
  totalusedfree  shared  buff/cache   available
Mem:   1806 3291383   0  931357
Swap: 65535 179   65356

*** Method ***

I started nbdkit as a 4G RAM disk:

  ./nbdkit memory size=4G

This is implemented as a sparse array with a 2 level page table, and
should allocate (using malloc) every time a new area of the disk is
written to.  Exact implementation is here:
https://github.com/libguestfs/nbdkit/tree/master/common/sparse

I started nbd-client using the -swap option which uses
mlockall(MCL_FUTURE) to lock the client into RAM.

  nbd-client -b 512 -swap localhost /dev/nbd0

I then created a filesystem on the RAM disk, mounted it, and copied a
3G file into it.  I tried this various ways, but the variation I was
eventually happy with was:

  mke2fs /dev/nbd0
  mount /dev/nbd0 /tmp/mnt

  dd if=/dev/zero of=/tmp/big bs=1M count=3000
  cp /tmp/big /tmp/mnt/big

I couldn't get any kind of deadlock or failure in this test.

(Note that if you repeat the test several times, in theory you could
delete the file and fstrim the filesystem, but when I was testing it
to be sure I unmounted everything and killed and restarted nbdkit
between each test.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-16 Thread Wouter Verhelst
Hi,

On Fri, Feb 15, 2019 at 10:53:32PM +, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote:
> > On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > > Hi!
> > > 
> > > I watched fosdem talk about
> > > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > > of warning: I'm not sure using it read-write on localhost is safe.
> > > 
> > > In particular, user application could create a lot of dirty data
> > > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > > nbd-server), you might get a deadlock.
> > 
> > Thanks for the kind words about the talk.  I've added Wouter Verhelst
> > & the NBD mailing list to CC.  Although I did the talk because the
> > subject is interesting, how I actually use nbdkit / NBD is to talk to
> > qemu and that's where I have most experience and where we (Red Hat)
> > use it in production systems.
> > 
> > However in January I spent a lot of time exercising the NBD loop-mount
> > + nbdkit case using fio in order to find contention / bottlenecks in
> > our use of threads and locks.  I didn't notice any particular problems
> > then, but it's possible my testing wasn't thorough enough.  Or that
> > fio only creates small numbers of dirty pages (because of locality in
> > its access patterns I guess?)
> > 
> > When you say it's not safe, what could happen?  What would we observe
> > if it was going wrong?
> 
> Reading more carefully I see you said we'd observe a deadlock.  I
> didn't see that, but again my testing of this wouldn't have been very
> thorough.  When I have some time I'll try creating / spooling huge
> files into an NBD loop mount to see if I can cause a deadlock.

While it's of course impossible to fully exclude the possibility of
deadlock when clearing dirty pages to the network, since Mel Gorman's
work that resulted in commit 7f338fe4540b1d0600b02314c7d885fd358e9eca
this should be extremely unlikely, and swapping over the network (NBD or
NFS or whatnot) should be reasonably safe, as well as clearing dirty
pages etc.

Additionally, nbd-client when called with -s calls memlockall() at an
appropriate moment, so that it should not be swapped out.

That only leaves the server side. Personally I haven't been able to
deadlock a reasonably recent machine using NBD, but of course YMMV.

> > > Also note that nbd.txt in Documentation/blockdev/ points to
> > > sourceforge; it should probably point to
> > > https://github.com/NetworkBlockDevice/nbd ?
> > 
> > Wouter should be able to say what the correct link should be.

The sourceforge project is still active, and is where I do the official
file releases. I also push the git repository there. For people who just
want a released version of the NBD utilities, pointing to sourceforge
isn't wrong, I would say.

GitHub is indeed used mostly for development, though.

It might be nice to rethink all that, now that we don't have a
mailinglist running at sourceforge anymore, but I don't think it's very
urgent.

-- 
To the thief who stole my anti-depressants: I hope you're happy

  -- seen somewhere on the Internet on a photo of a billboard


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Pavel Machek
On Fri 2019-02-15 22:41:26, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

I'm not saying I've seen it happen, or have a test. But my
understanding of memory management says it could deadlock... if nbd
tried allocating memory while memory was "full" of dirty data.

Dunno, something like ... take 1GB block device with 1GB RAM
machine. Create memory pressure so that nbdkit (etc) is dropped from
memory. Then quickly make all the data on the block device dirty.

I believe that scenario is something that can not happen on system
without NBD in loopback configuration.

Situation may be made worse if nbdkit needs to allocate memory due for
compression buffers or something like that.

Best regards,

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones
On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

Reading more carefully I see you said we'd observe a deadlock.  I
didn't see that, but again my testing of this wouldn't have been very
thorough.  When I have some time I'll try creating / spooling huge
files into an NBD loop mount to see if I can cause a deadlock.

Thanks, Rich.

> > Also note that nbd.txt in Documentation/blockdev/ points to
> > sourceforge; it should probably point to
> > https://github.com/NetworkBlockDevice/nbd ?
> 
> Wouter should be able to say what the correct link should be.
> 
> Rich.
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones
On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> Hi!
> 
> I watched fosdem talk about
> nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> of warning: I'm not sure using it read-write on localhost is safe.
> 
> In particular, user application could create a lot of dirty data
> quickly. If there's not enough memory for nbdkit (or nbd-client or
> nbd-server), you might get a deadlock.

Thanks for the kind words about the talk.  I've added Wouter Verhelst
& the NBD mailing list to CC.  Although I did the talk because the
subject is interesting, how I actually use nbdkit / NBD is to talk to
qemu and that's where I have most experience and where we (Red Hat)
use it in production systems.

However in January I spent a lot of time exercising the NBD loop-mount
+ nbdkit case using fio in order to find contention / bottlenecks in
our use of threads and locks.  I didn't notice any particular problems
then, but it's possible my testing wasn't thorough enough.  Or that
fio only creates small numbers of dirty pages (because of locality in
its access patterns I guess?)

When you say it's not safe, what could happen?  What would we observe
if it was going wrong?

> Also note that nbd.txt in Documentation/blockdev/ points to
> sourceforge; it should probably point to
> https://github.com/NetworkBlockDevice/nbd ?

Wouter should be able to say what the correct link should be.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org


nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Pavel Machek
Hi!

I watched fosdem talk about
nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
of warning: I'm not sure using it read-write on localhost is safe.

In particular, user application could create a lot of dirty data
quickly. If there's not enough memory for nbdkit (or nbd-client or
nbd-server), you might get a deadlock.

Also note that nbd.txt in Documentation/blockdev/ points to
sourceforge; it should probably point to
https://github.com/NetworkBlockDevice/nbd ?

Best regards,

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: Memory management issue in 4.18.15

2018-10-25 Thread Sasha Levin
On Mon, Oct 22, 2018 at 1:01 PM Michal Hocko  wrote:
>
> On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
> [...]
> > RE backporting: I'm slightly surprised that only one patch of the memcg
> > reclaim fix series has been backported. Either all or none makes much more
> > sense to me.
>
> Yeah, I think this is AUTOSEL trying to be clever again. I though it has
> been agreed that MM is quite good at marking patches for stable and so
> it was not considered by the machinery. Sasha?

I've talked about it briefly with Andrew, and he suggested that I'll
send him the list of AUTOSEL commits separately to avoid the noise, so
we'll try that and see what happens.


--
Thanks.
Sasha


Re: Memory management issue in 4.18.15

2018-10-25 Thread Sasha Levin
On Mon, Oct 22, 2018 at 1:01 PM Michal Hocko  wrote:
>
> On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
> [...]
> > RE backporting: I'm slightly surprised that only one patch of the memcg
> > reclaim fix series has been backported. Either all or none makes much more
> > sense to me.
>
> Yeah, I think this is AUTOSEL trying to be clever again. I though it has
> been agreed that MM is quite good at marking patches for stable and so
> it was not considered by the machinery. Sasha?

I've talked about it briefly with Andrew, and he suggested that I'll
send him the list of AUTOSEL commits separately to avoid the noise, so
we'll try that and see what happens.


--
Thanks.
Sasha


Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
> On Sat 20-10-18 14:41:40, Spock wrote:
> > Hello,
> > 
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.

Spock, can you, please, check if the following patch solves the problem
for you?

Thank you!

--

diff --git a/fs/inode.c b/fs/inode.c
index 73432e64f874..63aca301a8bc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,7 +731,7 @@ static enum lru_status inode_lru_isolate(struct list_head 
*item,
}
 
/* recently referenced inodes get one more pass */
-   if (inode->i_state & I_REFERENCED) {
+   if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
inode->i_state &= ~I_REFERENCED;
spin_unlock(>i_lock);
return LRU_ROTATE;



Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
> On Sat 20-10-18 14:41:40, Spock wrote:
> > Hello,
> > 
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.

Spock, can you, please, check if the following patch solves the problem
for you?

Thank you!

--

diff --git a/fs/inode.c b/fs/inode.c
index 73432e64f874..63aca301a8bc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,7 +731,7 @@ static enum lru_status inode_lru_isolate(struct list_head 
*item,
}
 
/* recently referenced inodes get one more pass */
-   if (inode->i_state & I_REFERENCED) {
+   if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
inode->i_state &= ~I_REFERENCED;
spin_unlock(>i_lock);
return LRU_ROTATE;



Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
> 
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?

So, the problem is probably caused by the unused inode eviction code:
inode_lru_isolate() invalidates all pages belonging to an unreferenced
clean inode at once, even if the goal was to scan (and potentially free)
just one inode (or any other slab object).

Spock's workload, as described, has few large files in the pagecache,
so it becomes noticeable. A small pressure applied on inode cache
surprisingly results in cleaning up significant percentage of the memory.

It happened before my change too, but was probably less noticeable, because
usually required higher memory pressure to happen. So, too aggressive reclaim
was less unexpected.

How to fix this?

It seems to me, that we shouldn't try invalidating pagecache pages from
the inode reclaim path at all (maybe except inodes with only few pages).
If an inode has a lot of attached pagecache, let it be evicted "naturally",
through file LRU lists.
But I need to perform some real-life testing on how this will work.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
> 
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?

So, the problem is probably caused by the unused inode eviction code:
inode_lru_isolate() invalidates all pages belonging to an unreferenced
clean inode at once, even if the goal was to scan (and potentially free)
just one inode (or any other slab object).

Spock's workload, as described, has few large files in the pagecache,
so it becomes noticeable. A small pressure applied on inode cache
surprisingly results in cleaning up significant percentage of the memory.

It happened before my change too, but was probably less noticeable, because
usually required higher memory pressure to happen. So, too aggressive reclaim
was less unexpected.

How to fix this?

It seems to me, that we shouldn't try invalidating pagecache pages from
the inode reclaim path at all (maybe except inodes with only few pages).
If an inode has a lot of attached pagecache, let it be evicted "naturally",
through file LRU lists.
But I need to perform some real-life testing on how this will work.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-22 Thread Michal Hocko
On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
[...]
> RE backporting: I'm slightly surprised that only one patch of the memcg
> reclaim fix series has been backported. Either all or none makes much more
> sense to me.

Yeah, I think this is AUTOSEL trying to be clever again. I though it has
been agreed that MM is quite good at marking patches for stable and so
it was not considered by the machinery. Sasha?
-- 
Michal Hocko
SUSE Labs


Re: Memory management issue in 4.18.15

2018-10-22 Thread Michal Hocko
On Mon 22-10-18 15:08:22, Roman Gushchin wrote:
[...]
> RE backporting: I'm slightly surprised that only one patch of the memcg
> reclaim fix series has been backported. Either all or none makes much more
> sense to me.

Yeah, I think this is AUTOSEL trying to be clever again. I though it has
been agreed that MM is quite good at marking patches for stable and so
it was not considered by the machinery. Sasha?
-- 
Michal Hocko
SUSE Labs


Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
> 
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?

Sure, already looking... Spock provided some useful details, and I think,
I know what's happening... Hope to propose a solution soon.

RE backporting: I'm slightly surprised that only one patch of the memcg
reclaim fix series has been backported. Either all or none makes much more
sense to me.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-22 Thread Roman Gushchin
On Mon, Oct 22, 2018 at 10:33:22AM +0200, Michal Hocko wrote:
> Cc som more people.
> 
> I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
> relatively small number of objects") has been backported to the stable
> tree when not marked that way. Put that aside it seems likely that the
> upstream kernel will have the same issue I suspect. Roman, could you
> have a look please?

Sure, already looking... Spock provided some useful details, and I think,
I know what's happening... Hope to propose a solution soon.

RE backporting: I'm slightly surprised that only one patch of the memcg
reclaim fix series has been backported. Either all or none makes much more
sense to me.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-22 Thread Michal Hocko
Cc som more people.

I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") has been backported to the stable
tree when not marked that way. Put that aside it seems likely that the
upstream kernel will have the same issue I suspect. Roman, could you
have a look please?

On Sat 20-10-18 14:41:40, Spock wrote:
> Hello,
> 
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
> 
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab

-- 
Michal Hocko
SUSE Labs


Re: Memory management issue in 4.18.15

2018-10-22 Thread Michal Hocko
Cc som more people.

I am wondering why 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") has been backported to the stable
tree when not marked that way. Put that aside it seems likely that the
upstream kernel will have the same issue I suspect. Roman, could you
have a look please?

On Sat 20-10-18 14:41:40, Spock wrote:
> Hello,
> 
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
> 
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab

-- 
Michal Hocko
SUSE Labs


Re: Memory management issue in 4.18.15

2018-10-20 Thread Roman Gushchin
On Sat, Oct 20, 2018 at 08:37:28AM -0700, Randy Dunlap wrote:
> [add linux-mm mailing list + people]
> 
> 
> On 10/20/18 4:41 AM, Spock wrote:
> > Hello,
> > 
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.
> > 
> > The exact change that leads to current behavior is
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
> > 

Hello!

Can you, please, describe your workload in more details?
Do you use memory cgroups? How many of them? What's the ratio between slabs
and pagecache in the affected cgroup? Is the pagecache mmapped by some process?
Is the majority of the pagecache created by few cached files or the number
of files is big?

This is definitely a strange effect. The change shouldn't affect pagecache
reclaim directly, so the only possibility I see is that because we started
applying some minimal pressure on slabs, we also started reclaim some internal
fs structures under background memory pressure, which leads to a more aggressive
pagecache reclaim.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-20 Thread Roman Gushchin
On Sat, Oct 20, 2018 at 08:37:28AM -0700, Randy Dunlap wrote:
> [add linux-mm mailing list + people]
> 
> 
> On 10/20/18 4:41 AM, Spock wrote:
> > Hello,
> > 
> > I have a workload, which creates lots of cache pages. Before 4.18.15,
> > the behavior was very stable: pagecache is constantly growing until it
> > consumes all the free memory, and then kswapd is balancing it around
> > low watermark. After 4.18.15, once in a while khugepaged is waking up
> > and reclaims almost all the pages from pagecache, so there is always
> > around 2G of 8G unused. THP is enabled only for madvise case and are
> > not used.
> > 
> > The exact change that leads to current behavior is
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
> > 

Hello!

Can you, please, describe your workload in more details?
Do you use memory cgroups? How many of them? What's the ratio between slabs
and pagecache in the affected cgroup? Is the pagecache mmapped by some process?
Is the majority of the pagecache created by few cached files or the number
of files is big?

This is definitely a strange effect. The change shouldn't affect pagecache
reclaim directly, so the only possibility I see is that because we started
applying some minimal pressure on slabs, we also started reclaim some internal
fs structures under background memory pressure, which leads to a more aggressive
pagecache reclaim.

Thanks!


Re: Memory management issue in 4.18.15

2018-10-20 Thread Randy Dunlap
[add linux-mm mailing list + people]


On 10/20/18 4:41 AM, Spock wrote:
> Hello,
> 
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
> 
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
> 


-- 
~Randy


Re: Memory management issue in 4.18.15

2018-10-20 Thread Randy Dunlap
[add linux-mm mailing list + people]


On 10/20/18 4:41 AM, Spock wrote:
> Hello,
> 
> I have a workload, which creates lots of cache pages. Before 4.18.15,
> the behavior was very stable: pagecache is constantly growing until it
> consumes all the free memory, and then kswapd is balancing it around
> low watermark. After 4.18.15, once in a while khugepaged is waking up
> and reclaims almost all the pages from pagecache, so there is always
> around 2G of 8G unused. THP is enabled only for madvise case and are
> not used.
> 
> The exact change that leads to current behavior is
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab
> 


-- 
~Randy


Memory management issue in 4.18.15

2018-10-20 Thread Spock
Hello,

I have a workload, which creates lots of cache pages. Before 4.18.15,
the behavior was very stable: pagecache is constantly growing until it
consumes all the free memory, and then kswapd is balancing it around
low watermark. After 4.18.15, once in a while khugepaged is waking up
and reclaims almost all the pages from pagecache, so there is always
around 2G of 8G unused. THP is enabled only for madvise case and are
not used.

The exact change that leads to current behavior is
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab


Memory management issue in 4.18.15

2018-10-20 Thread Spock
Hello,

I have a workload, which creates lots of cache pages. Before 4.18.15,
the behavior was very stable: pagecache is constantly growing until it
consumes all the free memory, and then kswapd is balancing it around
low watermark. After 4.18.15, once in a while khugepaged is waking up
and reclaims almost all the pages from pagecache, so there is always
around 2G of 8G unused. THP is enabled only for madvise case and are
not used.

The exact change that leads to current behavior is
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.18.y=62aad93f09c1952ede86405894df1b22012fd5ab


[PATCH 10/10] kconfig: add a Memory Management options" menu

2018-07-31 Thread Christoph Hellwig
This moves all the options under a proper menu.

Based on a patch from Randy Dunlap.

Signed-off-by: Christoph Hellwig 
Tested-by: Randy Dunlap 
Acked-by: Randy Dunlap 
---
 mm/Kconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ce95491abd6a..9ae1b6a8e30f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1,3 +1,6 @@
+
+menu "Memory Management options"
+
 config SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SELECT_MEMORY_MODEL
@@ -762,3 +765,5 @@ config GUP_BENCHMARK
 
 config ARCH_HAS_PTE_SPECIAL
bool
+
+endmenu
-- 
2.18.0



[PATCH 10/10] kconfig: add a Memory Management options" menu

2018-07-31 Thread Christoph Hellwig
This moves all the options under a proper menu.

Based on a patch from Randy Dunlap.

Signed-off-by: Christoph Hellwig 
Tested-by: Randy Dunlap 
Acked-by: Randy Dunlap 
---
 mm/Kconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ce95491abd6a..9ae1b6a8e30f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1,3 +1,6 @@
+
+menu "Memory Management options"
+
 config SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SELECT_MEMORY_MODEL
@@ -762,3 +765,5 @@ config GUP_BENCHMARK
 
 config ARCH_HAS_PTE_SPECIAL
bool
+
+endmenu
-- 
2.18.0



Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Mike Rapoport
On Wed, Jul 25, 2018 at 05:05:00AM -0700, Matthew Wilcox wrote:
> On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote:
> > +User Space Memory Access
> > +
> > +
> > +.. kernel-doc:: arch/x86/include/asm/uaccess.h
> > +   :internal:
> > +
> > +.. kernel-doc:: arch/x86/lib/usercopy_32.c
> > +   :export:
> > +
> > +The Slab Cache
> > +==
> > +
> > +.. kernel-doc:: include/linux/slab.h
> > +   :internal:
> > +
> > +.. kernel-doc:: mm/slab.c
> > +   :export:
> > +
> > +.. kernel-doc:: mm/util.c
> > +   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
> 
> get_user_pages_fast would fit better in the previous 'User Space Memory
> Access' section.

Yeah, it's somewhat "backward compatible" version :)

Will fix.
 
-- 
Sincerely yours,
Mike.



Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Mike Rapoport
On Wed, Jul 25, 2018 at 05:05:00AM -0700, Matthew Wilcox wrote:
> On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote:
> > +User Space Memory Access
> > +
> > +
> > +.. kernel-doc:: arch/x86/include/asm/uaccess.h
> > +   :internal:
> > +
> > +.. kernel-doc:: arch/x86/lib/usercopy_32.c
> > +   :export:
> > +
> > +The Slab Cache
> > +==
> > +
> > +.. kernel-doc:: include/linux/slab.h
> > +   :internal:
> > +
> > +.. kernel-doc:: mm/slab.c
> > +   :export:
> > +
> > +.. kernel-doc:: mm/util.c
> > +   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
> 
> get_user_pages_fast would fit better in the previous 'User Space Memory
> Access' section.

Yeah, it's somewhat "backward compatible" version :)

Will fix.
 
-- 
Sincerely yours,
Mike.



Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Matthew Wilcox
On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote:
> +User Space Memory Access
> +
> +
> +.. kernel-doc:: arch/x86/include/asm/uaccess.h
> +   :internal:
> +
> +.. kernel-doc:: arch/x86/lib/usercopy_32.c
> +   :export:
> +
> +The Slab Cache
> +==
> +
> +.. kernel-doc:: include/linux/slab.h
> +   :internal:
> +
> +.. kernel-doc:: mm/slab.c
> +   :export:
> +
> +.. kernel-doc:: mm/util.c
> +   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast

get_user_pages_fast would fit better in the previous 'User Space Memory
Access' section.



Re: [PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Matthew Wilcox
On Wed, Jul 25, 2018 at 02:26:08PM +0300, Mike Rapoport wrote:
> +User Space Memory Access
> +
> +
> +.. kernel-doc:: arch/x86/include/asm/uaccess.h
> +   :internal:
> +
> +.. kernel-doc:: arch/x86/lib/usercopy_32.c
> +   :export:
> +
> +The Slab Cache
> +==
> +
> +.. kernel-doc:: include/linux/slab.h
> +   :internal:
> +
> +.. kernel-doc:: mm/slab.c
> +   :export:
> +
> +.. kernel-doc:: mm/util.c
> +   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast

get_user_pages_fast would fit better in the previous 'User Space Memory
Access' section.



[PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Mike Rapoport
Signed-off-by: Mike Rapoport 
---
 Documentation/core-api/index.rst  |  1 +
 Documentation/core-api/kernel-api.rst | 54 ---
 Documentation/core-api/mm-api.rst | 54 +++
 3 files changed, 55 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/core-api/mm-api.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index f5a66b7..b580204 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -27,6 +27,7 @@ Core utilities
errseq
printk-formats
circular-buffers
+   mm-api
gfp_mask-from-fs-io
 
 Interfaces for kernel debugging
diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 39f1460..3431337 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -159,60 +159,6 @@ UUID/GUID
 .. kernel-doc:: lib/uuid.c
:export:
 
-Memory Management in Linux
-==
-
-The Slab Cache
---
-
-.. kernel-doc:: include/linux/slab.h
-   :internal:
-
-.. kernel-doc:: mm/slab.c
-   :export:
-
-.. kernel-doc:: mm/util.c
-   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
-
-User Space Memory Access
-
-
-.. kernel-doc:: arch/x86/include/asm/uaccess.h
-   :internal:
-
-.. kernel-doc:: arch/x86/lib/usercopy_32.c
-   :export:
-
-More Memory Management Functions
-
-
-.. kernel-doc:: mm/readahead.c
-   :export:
-
-.. kernel-doc:: mm/filemap.c
-   :export:
-
-.. kernel-doc:: mm/memory.c
-   :export:
-
-.. kernel-doc:: mm/vmalloc.c
-   :export:
-
-.. kernel-doc:: mm/page_alloc.c
-   :internal:
-
-.. kernel-doc:: mm/mempool.c
-   :export:
-
-.. kernel-doc:: mm/dmapool.c
-   :export:
-
-.. kernel-doc:: mm/page-writeback.c
-   :export:
-
-.. kernel-doc:: mm/truncate.c
-   :export:
-
 Kernel IPC facilities
 =
 
diff --git a/Documentation/core-api/mm-api.rst 
b/Documentation/core-api/mm-api.rst
new file mode 100644
index 000..65a8ef09
--- /dev/null
+++ b/Documentation/core-api/mm-api.rst
@@ -0,0 +1,54 @@
+==
+Memory Management APIs
+==
+
+User Space Memory Access
+
+
+.. kernel-doc:: arch/x86/include/asm/uaccess.h
+   :internal:
+
+.. kernel-doc:: arch/x86/lib/usercopy_32.c
+   :export:
+
+The Slab Cache
+==
+
+.. kernel-doc:: include/linux/slab.h
+   :internal:
+
+.. kernel-doc:: mm/slab.c
+   :export:
+
+.. kernel-doc:: mm/util.c
+   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
+
+More Memory Management Functions
+
+
+.. kernel-doc:: mm/readahead.c
+   :export:
+
+.. kernel-doc:: mm/filemap.c
+   :export:
+
+.. kernel-doc:: mm/memory.c
+   :export:
+
+.. kernel-doc:: mm/vmalloc.c
+   :export:
+
+.. kernel-doc:: mm/page_alloc.c
+   :internal:
+
+.. kernel-doc:: mm/mempool.c
+   :export:
+
+.. kernel-doc:: mm/dmapool.c
+   :export:
+
+.. kernel-doc:: mm/page-writeback.c
+   :export:
+
+.. kernel-doc:: mm/truncate.c
+   :export:
-- 
2.7.4



[PATCH 0/7] memory management documentation updates

2018-07-25 Thread Mike Rapoport
Hi,

Here are several updates to the mm documentation.

Aside from really minor changes in the first three patches, the updates
are:

* move the documentation of kstrdup and friends to "String Manipulation"
  section
* split memory management API into a separate .rst file
* adjust formating of the GFP flags description and include it in the
  reference documentation.

Mike Rapoport (7):
  mm/util: make strndup_user description a kernel-doc comment
  mm/util: add kernel-doc for kvfree
  docs/core-api: kill trailing whitespace in kernel-api.rst
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: split memory management API to a separate file
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: mm-api: add section about GFP flags

 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/kernel-api.rst |  59 +--
 Documentation/core-api/mm-api.rst |  76 +
 include/linux/gfp.h   | 291 ++
 mm/util.c |   9 +-
 5 files changed, 243 insertions(+), 193 deletions(-)
 create mode 100644 Documentation/core-api/mm-api.rst

-- 
2.7.4



[PATCH 5/7] docs/core-api: split memory management API to a separate file

2018-07-25 Thread Mike Rapoport
Signed-off-by: Mike Rapoport 
---
 Documentation/core-api/index.rst  |  1 +
 Documentation/core-api/kernel-api.rst | 54 ---
 Documentation/core-api/mm-api.rst | 54 +++
 3 files changed, 55 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/core-api/mm-api.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index f5a66b7..b580204 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -27,6 +27,7 @@ Core utilities
errseq
printk-formats
circular-buffers
+   mm-api
gfp_mask-from-fs-io
 
 Interfaces for kernel debugging
diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 39f1460..3431337 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -159,60 +159,6 @@ UUID/GUID
 .. kernel-doc:: lib/uuid.c
:export:
 
-Memory Management in Linux
-==
-
-The Slab Cache
---
-
-.. kernel-doc:: include/linux/slab.h
-   :internal:
-
-.. kernel-doc:: mm/slab.c
-   :export:
-
-.. kernel-doc:: mm/util.c
-   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
-
-User Space Memory Access
-
-
-.. kernel-doc:: arch/x86/include/asm/uaccess.h
-   :internal:
-
-.. kernel-doc:: arch/x86/lib/usercopy_32.c
-   :export:
-
-More Memory Management Functions
-
-
-.. kernel-doc:: mm/readahead.c
-   :export:
-
-.. kernel-doc:: mm/filemap.c
-   :export:
-
-.. kernel-doc:: mm/memory.c
-   :export:
-
-.. kernel-doc:: mm/vmalloc.c
-   :export:
-
-.. kernel-doc:: mm/page_alloc.c
-   :internal:
-
-.. kernel-doc:: mm/mempool.c
-   :export:
-
-.. kernel-doc:: mm/dmapool.c
-   :export:
-
-.. kernel-doc:: mm/page-writeback.c
-   :export:
-
-.. kernel-doc:: mm/truncate.c
-   :export:
-
 Kernel IPC facilities
 =
 
diff --git a/Documentation/core-api/mm-api.rst 
b/Documentation/core-api/mm-api.rst
new file mode 100644
index 000..65a8ef09
--- /dev/null
+++ b/Documentation/core-api/mm-api.rst
@@ -0,0 +1,54 @@
+==
+Memory Management APIs
+==
+
+User Space Memory Access
+
+
+.. kernel-doc:: arch/x86/include/asm/uaccess.h
+   :internal:
+
+.. kernel-doc:: arch/x86/lib/usercopy_32.c
+   :export:
+
+The Slab Cache
+==
+
+.. kernel-doc:: include/linux/slab.h
+   :internal:
+
+.. kernel-doc:: mm/slab.c
+   :export:
+
+.. kernel-doc:: mm/util.c
+   :functions: kfree_const kvmalloc_node kvfree get_user_pages_fast
+
+More Memory Management Functions
+
+
+.. kernel-doc:: mm/readahead.c
+   :export:
+
+.. kernel-doc:: mm/filemap.c
+   :export:
+
+.. kernel-doc:: mm/memory.c
+   :export:
+
+.. kernel-doc:: mm/vmalloc.c
+   :export:
+
+.. kernel-doc:: mm/page_alloc.c
+   :internal:
+
+.. kernel-doc:: mm/mempool.c
+   :export:
+
+.. kernel-doc:: mm/dmapool.c
+   :export:
+
+.. kernel-doc:: mm/page-writeback.c
+   :export:
+
+.. kernel-doc:: mm/truncate.c
+   :export:
-- 
2.7.4



[PATCH 0/7] memory management documentation updates

2018-07-25 Thread Mike Rapoport
Hi,

Here are several updates to the mm documentation.

Aside from really minor changes in the first three patches, the updates
are:

* move the documentation of kstrdup and friends to "String Manipulation"
  section
* split memory management API into a separate .rst file
* adjust formating of the GFP flags description and include it in the
  reference documentation.

Mike Rapoport (7):
  mm/util: make strndup_user description a kernel-doc comment
  mm/util: add kernel-doc for kvfree
  docs/core-api: kill trailing whitespace in kernel-api.rst
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: split memory management API to a separate file
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: mm-api: add section about GFP flags

 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/kernel-api.rst |  59 +--
 Documentation/core-api/mm-api.rst |  76 +
 include/linux/gfp.h   | 291 ++
 mm/util.c |   9 +-
 5 files changed, 243 insertions(+), 193 deletions(-)
 create mode 100644 Documentation/core-api/mm-api.rst

-- 
2.7.4



[PATCH 10/10] kconfig: add a Memory Management options" menu

2018-07-24 Thread Christoph Hellwig
This moves all the options under a proper menu.

Based on a patch from Randy Dunlap.

Signed-off-by: Christoph Hellwig 
Tested-by: Randy Dunlap 
Acked-by: Randy Dunlap 
---
 mm/Kconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ce95491abd6a..9ae1b6a8e30f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1,3 +1,6 @@
+
+menu "Memory Management options"
+
 config SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SELECT_MEMORY_MODEL
@@ -762,3 +765,5 @@ config GUP_BENCHMARK
 
 config ARCH_HAS_PTE_SPECIAL
bool
+
+endmenu
-- 
2.18.0



[PATCH 10/10] kconfig: add a Memory Management options" menu

2018-07-24 Thread Christoph Hellwig
This moves all the options under a proper menu.

Based on a patch from Randy Dunlap.

Signed-off-by: Christoph Hellwig 
Tested-by: Randy Dunlap 
Acked-by: Randy Dunlap 
---
 mm/Kconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ce95491abd6a..9ae1b6a8e30f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1,3 +1,6 @@
+
+menu "Memory Management options"
+
 config SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SELECT_MEMORY_MODEL
@@ -762,3 +765,5 @@ config GUP_BENCHMARK
 
 config ARCH_HAS_PTE_SPECIAL
bool
+
+endmenu
-- 
2.18.0



  1   2   3   4   5   6   7   8   >