[PATCH] Documentation: Mention why %p prints ptrval
When debugging recent kernels, people will see '(ptrval)' but there isn't much information as to what that means. Briefly describe why it's there. Signed-off-by: Joel Stanley--- Documentation/core-api/printk-formats.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 934559b3c130..eb30efdd2e78 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -60,8 +60,8 @@ Plain Pointers Pointers printed without a specifier extension (i.e unadorned %p) are hashed to prevent leaking information about the kernel memory layout. This has the added benefit of providing a unique identifier. On 64-bit machines -the first 32 bits are zeroed. If you *really* want the address see %px -below. +the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it +gathers enough entropy. If you *really* want the address see %px below. Symbols/Function Pointers - -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Input: alps - Update documentation for trackstick v3 format
Bits for M, R and L buttons are already processed in alps. Other newly documented bits not yet. Signed-off-by: Pali Rohár--- This is based on information which Masaki Ota provided to us. --- Documentation/input/devices/alps.rst | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/Documentation/input/devices/alps.rst b/Documentation/input/devices/alps.rst index 6779148e428c..b556d6bde5e1 100644 --- a/Documentation/input/devices/alps.rst +++ b/Documentation/input/devices/alps.rst @@ -192,10 +192,13 @@ The final v3 packet type is the trackstick packet:: byte 0:11 x7 y71111 byte 1:0 x6 x5 x4 x3 x2 x1 x0 byte 2:0 y6 y5 y4 y3 y2 y1 y0 - byte 3:01001000 - byte 4:0 z4 z3 z2 z1 z0?? + byte 3:01 TP SW1MRL + byte 4:0 z6 z5 z4 z3 z2 z1 z0 byte 5:00111111 +TP means Tap SW status when tap processing is enabled or Press status when press +processing is enabled. SW means scroll up when 4 buttons are available. + ALPS Absolute Mode - Protocol Version 4 --- -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/32] docs/vm: balance: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/balance | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/Documentation/vm/balance b/Documentation/vm/balance index 9645954..6a1fadf 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance @@ -1,3 +1,9 @@ +.. _balance: + + +Memory Balancing + + Started Jan 2000 by Kanoj Sarcar Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as @@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, so as to give a fighting chance for replace_with_highmem() to get a HIGHMEM page, as well as to ensure that HIGHMEM allocations do not fall back into regular zone. This also makes sure that HIGHMEM pages -are not leaked (for example, in situations where a HIGHMEM page is in +are not leaked (for example, in situations where a HIGHMEM page is in the swapcache but is not being used by anyone) kswapd also needs to know about the zones it should balance. kswapd is -primarily needed in a situation where balancing can not be done, +primarily needed in a situation where balancing can not be done, probably because all allocation requests are coming from intr context and all process contexts are sleeping. For 2.3, kswapd does not really need to balance the highmem zone, since intr context does not request @@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: + 1. Dynamic experience should influence balancing: number of failed requests -for a zone can be tracked and fed into the balancing scheme (ja...@mbay.net) + for a zone can be tracked and fed into the balancing scheme (ja...@mbay.net) 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve -dma pages. (l...@tantalophile.demon.co.uk) + dma pages. (l...@tantalophile.demon.co.uk) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/32] docs/vm: active_mm.txt convert to ReST format
Just add a label for cross-referencing and indent the text to make it ``literal`` Signed-off-by: Mike Rapoport--- Documentation/vm/active_mm.txt | 174 + 1 file changed, 91 insertions(+), 83 deletions(-) diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt index dbf4581..c84471b 100644 --- a/Documentation/vm/active_mm.txt +++ b/Documentation/vm/active_mm.txt @@ -1,83 +1,91 @@ -List: linux-kernel -Subject:Re: active_mm -From: Linus Torvalds -Date: 1999-07-30 21:36:24 - -Cc'd to linux-kernel, because I don't write explanations all that often, -and when I do I feel better about more people reading them. - -On Fri, 30 Jul 1999, David Mosberger wrote: -> -> Is there a brief description someplace on how "mm" vs. "active_mm" in -> the task_struct are supposed to be used? (My apologies if this was -> discussed on the mailing lists---I just returned from vacation and -> wasn't able to follow linux-kernel for a while). - -Basically, the new setup is: - - - we have "real address spaces" and "anonymous address spaces". The - difference is that an anonymous address space doesn't care about the - user-level page tables at all, so when we do a context switch into an - anonymous address space we just leave the previous address space - active. - - The obvious use for a "anonymous address space" is any thread that - doesn't need any user mappings - all kernel threads basically fall into - this category, but even "real" threads can temporarily say that for - some amount of time they are not going to be interested in user space, - and that the scheduler might as well try to avoid wasting time on - switching the VM state around. Currently only the old-style bdflush - sync does that. - - - "tsk->mm" points to the "real address space". For an anonymous process, - tsk->mm will be NULL, for the logical reason that an anonymous process - really doesn't _have_ a real address space at all. - - - however, we obviously need to keep track of which address space we - "stole" for such an anonymous user. For that, we have "tsk->active_mm", - which shows what the currently active address space is. - - The rule is that for a process with a real address space (ie tsk->mm is - non-NULL) the active_mm obviously always has to be the same as the real - one. - - For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the - "borrowed" mm while the anonymous process is running. When the - anonymous process gets scheduled away, the borrowed address space is - returned and cleared. - -To support all that, the "struct mm_struct" now has two counters: a -"mm_users" counter that is how many "real address space users" there are, -and a "mm_count" counter that is the number of "lazy" users (ie anonymous -users) plus one if there are any real users. - -Usually there is at least one real user, but it could be that the real -user exited on another CPU while a lazy user was still active, so you do -actually get cases where you have a address space that is _only_ used by -lazy users. That is often a short-lived state, because once that thread -gets scheduled away in favour of a real thread, the "zombie" mm gets -released because "mm_users" becomes zero. - -Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any -more. "init_mm" should be considered just a "lazy context when no other -context is available", and in fact it is mainly used just at bootup when -no real VM has yet been created. So code that used to check - - if (current->mm == _mm) - -should generally just do - - if (!current->mm) - -instead (which makes more sense anyway - the test is basically one of "do -we have a user context", and is generally done by the page fault handler -and things like that). - -Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago, -because it slightly changes the interfaces to accommodate the alpha (who -would have thought it, but the alpha actually ends up having one of the -ugliest context switch codes - unlike the other architectures where the MM -and register state is separate, the alpha PALcode joins the two, and you -need to switch both together). - -(From http://marc.info/?l=linux-kernel=93337278602211=2) +.. _active_mm: + += +Active MM += + +:: + + List: linux-kernel + Subject:Re: active_mm + From: Linus Torvalds + Date: 1999-07-30 21:36:24 + + Cc'd to linux-kernel, because I don't write explanations all that often, + and when I do I feel better about more people reading them. + + On Fri, 30 Jul 1999, David Mosberger wrote: + > + > Is there a brief description someplace on how "mm" vs. "active_mm" in + > the task_struct are supposed to be used? (My apologies if this was + > discussed on the mailing lists---I just returned from vacation and + > wasn't able to follow linux-kernel for a while). + + Basically, the new
[PATCH 00/32] docs/vm: convert to ReST format
Hi, These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. Mike Rapoport (32): docs/vm: active_mm.txt convert to ReST format docs/vm: balance: convert to ReST format docs/vm: cleancache.txt: convert to ReST format docs/vm: frontswap.txt: convert to ReST format docs/vm: highmem.txt: convert to ReST format docs/vm: hmm.txt: convert to ReST format docs/vm: hugetlbpage.txt: convert to ReST format docs/vm: hugetlbfs_reserv.txt: convert to ReST format docs/vm: hwpoison.txt: convert to ReST format docs/vm: idle_page_tracking.txt: convert to ReST format docs/vm: ksm.txt: convert to ReST format docs/vm: mmu_notifier.txt: convert to ReST format docs/vm: numa_memory_policy.txt: convert to ReST format docs/vm: overcommit-accounting: convert to ReST format docs/vm: page_frags convert to ReST format docs/vm: numa: convert to ReST format docs/vm: pagemap.txt: convert to ReST format docs/vm: page_migration: convert to ReST format docs/vm: page_owner: convert to ReST format docs/vm: remap_file_pages.txt: conert to ReST format docs/vm: slub.txt: convert to ReST format docs/vm: soft-dirty.txt: convert to ReST format docs/vm: split_page_table_lock: convert to ReST format docs/vm: swap_numa.txt: convert to ReST format docs/vm: transhuge.txt: convert to ReST format docs/vm: unevictable-lru.txt: convert to ReST format docs/vm: userfaultfd.txt: convert to ReST format docs/vm: z3fold.txt: convert to ReST format docs/vm: zsmalloc.txt: convert to ReST format docs/vm: zswap.txt: convert to ReST format docs/vm: rename documentation files to .rst docs/vm: add index.rst and link MM documentation to top level index Documentation/ABI/stable/sysfs-devices-node| 2 +- .../ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 2 +- Documentation/ABI/testing/sysfs-kernel-slab| 4 +- Documentation/admin-guide/kernel-parameters.txt| 12 +- Documentation/dev-tools/kasan.rst | 2 +- Documentation/filesystems/proc.txt | 4 +- Documentation/filesystems/tmpfs.txt| 2 +- Documentation/index.rst| 3 +- Documentation/sysctl/vm.txt| 6 +- Documentation/vm/00-INDEX | 58 +-- Documentation/vm/active_mm.rst | 91 Documentation/vm/active_mm.txt | 83 Documentation/vm/{balance => balance.rst} | 15 +- .../vm/{cleancache.txt => cleancache.rst} | 105 +++-- Documentation/vm/conf.py | 10 + Documentation/vm/{frontswap.txt => frontswap.rst} | 59 ++- Documentation/vm/{highmem.txt => highmem.rst} | 87 ++-- Documentation/vm/{hmm.txt => hmm.rst} | 66 ++- .../{hugetlbfs_reserv.txt => hugetlbfs_reserv.rst} | 212 + .../vm/{hugetlbpage.txt => hugetlbpage.rst}| 243 ++- Documentation/vm/{hwpoison.txt => hwpoison.rst}| 141 +++--- ...le_page_tracking.txt => idle_page_tracking.rst} | 55 ++- Documentation/vm/index.rst | 56 +++ Documentation/vm/ksm.rst | 183 Documentation/vm/ksm.txt | 178 Documentation/vm/mmu_notifier.rst | 99 + Documentation/vm/mmu_notifier.txt | 93 Documentation/vm/{numa => numa.rst}| 6 +- Documentation/vm/numa_memory_policy.rst| 485 + Documentation/vm/numa_memory_policy.txt| 452 --- Documentation/vm/overcommit-accounting | 80 Documentation/vm/overcommit-accounting.rst | 87 Documentation/vm/{page_frags => page_frags.rst}| 5 +- .../vm/{page_migration => page_migration.rst} | 149 --- .../vm/{page_owner.txt => page_owner.rst} | 34 +- Documentation/vm/{pagemap.txt => pagemap.rst} | 170 .../{remap_file_pages.txt => remap_file_pages.rst} | 6 + Documentation/vm/slub.rst | 361 +++ Documentation/vm/slub.txt | 342 --- .../vm/{soft-dirty.txt => soft-dirty.rst} | 20 +- ...t_page_table_lock => split_page_table_lock.rst} | 12 +- Documentation/vm/{swap_numa.txt => swap_numa.rst} | 55 ++- Documentation/vm/{transhuge.txt => transhuge.rst} | 286 +++- .../{unevictable-lru.txt => unevictable-lru.rst} | 117 +++-- .../vm/{userfaultfd.txt =>
[PATCH 04/32] docs/vm: frontswap.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/frontswap.txt | 59 ++ 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt index c71a019..1979f43 100644 --- a/Documentation/vm/frontswap.txt +++ b/Documentation/vm/frontswap.txt @@ -1,13 +1,20 @@ +.. _frontswap: + += +Frontswap += + Frontswap provides a "transcendent memory" interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. -(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" +(Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends" and the only necessary changes to the core kernel for transcendent memory; all other supporting code -- the "backends" -- is implemented as drivers. -See the LWN.net article "Transcendent memory in a nutshell" for a detailed -overview of frontswap and related kernel parts: -https://lwn.net/Articles/454795/ ) +See the LWN.net article `Transcendent memory in a nutshell`_ +for a detailed overview of frontswap and related kernel parts) + +.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The storage is assumed to be @@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may never be obtained from frontswap. If properly configured, monitoring of frontswap is done via debugfs in -the /sys/kernel/debug/frontswap directory. The effectiveness of +the `/sys/kernel/debug/frontswap` directory. The effectiveness of frontswap can be measured (across all swap devices) with: -failed_stores - how many store attempts have failed -loads - how many loads were attempted (all should succeed) -succ_stores- how many store attempts have succeeded -invalidates- how many invalidates were attempted +``failed_stores`` + how many store attempts have failed + +``loads`` + how many loads were attempted (all should succeed) + +``succ_stores`` + how many store attempts have succeeded + +``invalidates`` + how many invalidates were attempted A backend implementation may provide additional metrics. FAQ +=== -1) Where's the value? +* Where's the value? When a workload starts swapping, performance falls through the floor. Frontswap significantly increases performance in many such workloads by @@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And, using frontswap, investigation is also underway on the use of NVM as a memory extension technology. -2) Sure there may be performance advantages in some situations, but - what's the space/time overhead of frontswap? +* Sure there may be performance advantages in some situations, but + what's the space/time overhead of frontswap? If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into nothingness and the only overhead is a few extra bytes per swapon'ed @@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A backend, such as zcache, must implement policies to carefully (but dynamically) manage memory limits to ensure this doesn't happen. -3) OK, how about a quick overview of what this frontswap patch does - in terms that a kernel hacker can grok? +* OK, how about a quick overview of what this frontswap patch does + in terms that a kernel hacker can grok? Let's assume that a frontswap "backend" has registered during kernel initialization; this registration indicates that this @@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend store" and (possibly) a "frontswap backend loads", which are presumably much faster. -4) Can't frontswap be configured as a "special" swap device that is - just higher priority than any real swap device (e.g. like zswap, - or maybe swap-over-nbd/NFS)? +* Can't frontswap be configured as a "special" swap device that is + just higher priority than any real swap device (e.g. like zswap, + or maybe swap-over-nbd/NFS)? No. First, the existing swap subsystem doesn't allow for any kind of swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, @@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices can still use frontswap but a backend for such devices must configure some kind of "ghost" swap device and ensure that it is never used. -5) Why this weird definition about "duplicate stores"? If a page - has been previously successfully stored, can't it always be - successfully overwritten? +* Why this weird definition about "duplicate stores"? If a page + has been previously successfully stored, can't it always be + successfully
[PATCH 03/32] docs/vm: cleancache.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/cleancache.txt | 105 1 file changed, 62 insertions(+), 43 deletions(-) diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt index e4b49df..68cba91 100644 --- a/Documentation/vm/cleancache.txt +++ b/Documentation/vm/cleancache.txt @@ -1,4 +1,11 @@ -MOTIVATION +.. _cleancache: + +== +Cleancache +== + +Motivation +== Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for @@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented in Xen (using hypervisor memory) and zcache (using in-kernel compressed memory) and other implementations are in development. -FAQs are included below. +:ref:`FAQs ` are included below. -IMPLEMENTATION OVERVIEW +Implementation Overview +=== A cleancache "backend" that provides transcendent memory registers itself to the kernel's cleancache "frontend" by calling cleancache_register_ops, @@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page with the same handle, the results are indeterminate. Callers must lock the page to ensure serial behavior. -CLEANCACHE PERFORMANCE METRICS +Cleancache Performance Metrics +== If properly configured, monitoring of cleancache is done via debugfs in -the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache +the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache can be measured (across all filesystems) with: -succ_gets - number of gets that were successful -failed_gets- number of gets that failed -puts - number of puts attempted (all "succeed") -invalidates- number of invalidates attempted +``succ_gets`` + number of gets that were successful + +``failed_gets`` + number of gets that failed + +``puts`` + number of puts attempted (all "succeed") + +``invalidates`` + number of invalidates attempted A backend implementation may provide additional metrics. +.. _faq: + FAQ +=== -1) Where's the value? (Andrew Morton) +* Where's the value? (Andrew Morton) Cleancache provides a significant performance benefit to many workloads in many environments with negligible overhead by improving the @@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And the proposed "RAMster" driver shares RAM across multiple physical systems. -2) Why does cleancache have its sticky fingers so deep inside the - filesystems and VFS? (Andrew Morton and Christoph Hellwig) +* Why does cleancache have its sticky fingers so deep inside the + filesystems and VFS? (Andrew Morton and Christoph Hellwig) The core hooks for cleancache in VFS are in most cases a single line and the minimum set are placed precisely where needed to maintain @@ -168,9 +187,9 @@ filesystems in the future. The total impact of the hooks to existing fs and mm files is only about 40 lines added (not counting comments and blank lines). -3) Why not make cleancache asynchronous and batched so it can - more easily interface with real devices with DMA instead - of copying each individual page? (Minchan Kim) +* Why not make cleancache asynchronous and batched so it can more + easily interface with real devices with DMA instead of copying each + individual page? (Minchan Kim) The one-page-at-a-time copy semantics simplifies the implementation on both the frontend and backend and also allows the backend to @@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device" or for real kernel-addressable RAM, it makes perfect sense for transcendent memory. -4) Why is non-shared cleancache "exclusive"? And where is the - page "invalidated" after a "get"? (Minchan Kim) +* Why is non-shared cleancache "exclusive"? And where is the + page "invalidated" after a "get"? (Minchan Kim) The main reason is to free up space in transcendent memory and to avoid unnecessary cleancache_invalidate calls. If you want inclusive, @@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call. The invalidate is done by the cleancache backend implementation. -5) What's the performance impact? +* What's the performance impact? Performance analysis has been presented at OLS'09 and LCA'10. Briefly, performance gains can be significant on most workloads, @@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache has little value, but in newer multicore machines, especially consolidated/virtualized machines, it has great value. -6) How do I add cleancache support for filesystem X? (Boaz Harrash) +* How do I add cleancache support for filesystem X? (Boaz Harrash) Filesystems that are well-behaved and conform to certain restrictions can utilize
[PATCH 07/32] docs/vm: hugetlbpage.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/hugetlbpage.txt | 243 ++- 1 file changed, 139 insertions(+), 104 deletions(-) diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index faf077d..3bb0d99 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -1,3 +1,11 @@ +.. _hugetlbpage: + += +HugeTLB Pages += + +Overview + The intent of this file is to give a brief summary of hugetlbpage support in the Linux kernel. This support is built on top of multiple page size support @@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS automatically when CONFIG_HUGETLBFS is selected) configuration options. -The /proc/meminfo file provides information about the total number of +The ``/proc/meminfo`` file provides information about the total number of persistent hugetlb pages in the kernel's huge page pool. It also displays default huge page size and information about the number of free, reserved and surplus huge pages in the pool of huge pages of default size. The huge page size is needed for generating the proper alignment and size of the arguments to system calls that map huge page regions. -The output of "cat /proc/meminfo" will include lines like: +The output of ``cat /proc/meminfo`` will include lines like:: -. -HugePages_Total: uuu -HugePages_Free: vvv -HugePages_Rsvd: www -HugePages_Surp: xxx -Hugepagesize:yyy kB -Hugetlb: zzz kB + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize:yyy kB + Hugetlb: zzz kB where: -HugePages_Total is the size of the pool of huge pages. -HugePages_Free is the number of huge pages in the pool that are not yet -allocated. -HugePages_Rsvd is short for "reserved," and is the number of huge pages for -which a commitment to allocate from the pool has been made, -but no allocation has yet been made. Reserved huge pages -guarantee that an application will be able to allocate a -huge page from the pool of huge pages at fault time. -HugePages_Surp is short for "surplus," and is the number of huge pages in -the pool above the value in /proc/sys/vm/nr_hugepages. The -maximum number of surplus huge pages is controlled by -/proc/sys/vm/nr_overcommit_hugepages. -Hugepagesizeis the default hugepage size (in Kb). -Hugetlb is the total amount of memory (in kB), consumed by huge -pages of all sizes. -If huge pages of different sizes are in use, this number -will exceed HugePages_Total * Hugepagesize. To get more -detailed information, please, refer to -/sys/kernel/mm/hugepages (described below). - - -/proc/filesystems should also show a filesystem of type "hugetlbfs" configured -in the kernel. - -/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet +allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for +which a commitment to allocate from the pool has been made, +but no allocation has yet been made. Reserved huge pages +guarantee that an application will be able to allocate a +huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in +the pool above the value in ``/proc/sys/vm/nr_hugepages``. The +maximum number of surplus huge pages is controlled by +``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb +is the total amount of memory (in kB), consumed by huge +pages of all sizes. +If huge pages of different sizes are in use, this number +will exceed HugePages_Total \* Hugepagesize. To get more +detailed information, please, refer to +``/sys/kernel/mm/hugepages`` (described below). + + +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. + +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge pages in the kernel's huge page pool. "Persistent" huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages -by increasing or decreasing the value of 'nr_hugepages'. +by increasing or decreasing the value of ``nr_hugepages``. Pages that are used as huge pages are reserved inside the kernel and cannot be used for other
[PATCH 08/32] docs/vm: hugetlbfs_reserv.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/hugetlbfs_reserv.txt | 212 ++ 1 file changed, 135 insertions(+), 77 deletions(-) diff --git a/Documentation/vm/hugetlbfs_reserv.txt b/Documentation/vm/hugetlbfs_reserv.txt index 9aca09a..36a87a2 100644 --- a/Documentation/vm/hugetlbfs_reserv.txt +++ b/Documentation/vm/hugetlbfs_reserv.txt @@ -1,6 +1,13 @@ -Hugetlbfs Reservation Overview --- -Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically +.. _hugetlbfs_reserve: + += +Hugetlbfs Reservation += + +Overview + + +Huge pages as described at :ref:`hugetlbpage` are typically preallocated for application use. These huge pages are instantiated in a task's address space at page fault time if the VMA indicates huge pages are to be used. If no huge page exists at page fault time, the task is sent @@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel. Audience - + This description is primarily targeted at kernel developers who are modifying hugetlbfs code. The Data Structures +=== + resv_huge_pages This is a global (per-hstate) count of reserved huge pages. Reserved huge pages are only available to the task which reserved them. Therefore, the number of huge pages generally available is computed - as (free_huge_pages - resv_huge_pages). + as (``free_huge_pages - resv_huge_pages``). Reserve Map - A reserve map is described by the structure: - struct resv_map { - struct kref refs; - spinlock_t lock; - struct list_head regions; - long adds_in_progress; - struct list_head region_cache; - long region_cache_count; - }; + A reserve map is described by the structure:: + + struct resv_map { + struct kref refs; + spinlock_t lock; + struct list_head regions; + long adds_in_progress; + struct list_head region_cache; + long region_cache_count; + }; + There is one reserve map for each huge page mapping in the system. The regions list within the resv_map describes the regions within - the mapping. A region is described as: - struct file_region { - struct list_head link; - long from; - long to; - }; + the mapping. A region is described as:: + + struct file_region { + struct list_head link; + long from; + long to; + }; + The 'from' and 'to' fields of the file region structure are huge page indices into the mapping. Depending on the type of mapping, a region in the reserv_map may indicate reservations exist for the range, or reservations do not exist. Flags for MAP_PRIVATE Reservations These are stored in the bottom bits of the reservation map pointer. - #define HPAGE_RESV_OWNER(1UL << 0) Indicates this task is the - owner of the reservations associated with the mapping. - #define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally - mapping this range (and creating reserves) has unmapped a - page from this task (the child) due to a failed COW. + + ``#define HPAGE_RESV_OWNER(1UL << 0)`` + Indicates this task is the owner of the reservations + associated with the mapping. + ``#define HPAGE_RESV_UNMAPPED (1UL << 1)`` + Indicates task originally mapping this range (and creating + reserves) has unmapped a page from this task (the child) + due to a failed COW. Page Flags The PagePrivate page flag is used to indicate that a huge page reservation must be restored when the huge page is freed. More @@ -65,12 +80,14 @@ Page Flags Reservation Map Location (Private or Shared) - + + A huge page mapping or segment is either private or shared. If private, it is typically only available to a single address space (task). If shared, it can be mapped into multiple address spaces (tasks). The location and semantics of the reservation map is significantly different for two types of mappings. Location differences are: + - For private mappings, the reservation map hangs off the the VMA structure. Specifically, vma->vm_private_data. This reserve map is created at the time the mapping (mmap(MAP_PRIVATE)) is created. @@ -82,15 +99,15 @@ of mappings. Location differences are: Creating Reservations
[PATCH 09/32] docs/vm: hwpoison.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/hwpoison.txt | 141 +- 1 file changed, 70 insertions(+), 71 deletions(-) diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index e912d7e..b1a8c24 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -1,7 +1,14 @@ +.. hwpoison: + + +hwpoison + + What is hwpoison? += Upcoming Intel CPUs have support for recovering from some memory errors -(``MCA recovery''). This requires the OS to declare a page "poisoned", +(``MCA recovery``). This requires the OS to declare a page "poisoned", kill the processes associated with it and avoid using it in the future. This patchkit implements the necessary infrastructure in the VM. @@ -46,9 +53,10 @@ address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won't do that, but some very specialized ones might. +Failure recovery modes +== -There are two (actually three) modi memory failure recovery can be in: +There are two (actually three) modes memory failure recovery can be in: vm.memory_failure_recovery sysctl set to zero: All memory failures cause a panic. Do not attempt recovery. @@ -67,9 +75,8 @@ late kill This is best for memory error unaware applications and default Note some pages are always handled as late kill. - -User control: +User control + vm.memory_failure_recovery See sysctl.txt @@ -79,11 +86,19 @@ vm.memory_failure_early_kill PR_MCE_KILL Set early/late kill mode/revert to system default - arg1: PR_MCE_KILL_CLEAR: Revert to system default - arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode - PR_MCE_KILL_EARLY: Early kill - PR_MCE_KILL_LATE: Late kill - PR_MCE_KILL_DEFAULT: Use system global default + + arg1: PR_MCE_KILL_CLEAR: + Revert to system default + arg1: PR_MCE_KILL_SET: + arg2 defines thread specific mode + + PR_MCE_KILL_EARLY: + Early kill + PR_MCE_KILL_LATE: + Late kill + PR_MCE_KILL_DEFAULT + Use system global default + Note that if you want to have a dedicated thread which handles the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, @@ -92,77 +107,64 @@ PR_MCE_KILL PR_MCE_KILL_GET return current mode +Testing +=== - -Testing: - -madvise(MADV_HWPOISON, ) - (as root) - Poison a page in the process for testing - +* madvise(MADV_HWPOISON, ) (as root) - Poison a page in the + process for testing -hwpoison-inject module through debugfs +* hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` -/sys/kernel/debug/hwpoison/ + corrupt-pfn + Inject hwpoison fault at PFN echoed into this file. This does + some early filtering to avoid corrupted unintended pages in test suites. -corrupt-pfn + unpoison-pfn + Software-unpoison page at PFN echoed into this file. This way + a page can be reused again. This only works for Linux + injected failures, not for real memory failures. -Inject hwpoison fault at PFN echoed into this file. This does -some early filtering to avoid corrupted unintended pages in test suites. + Note these injection interfaces are not stable and might change between + kernel versions -unpoison-pfn + corrupt-filter-dev-major, corrupt-filter-dev-minor + Only handle memory failures to pages associated with the file + system defined by block device major/minor. -1U is the + wildcard value. This should be only used for testing with + artificial injection. -Software-unpoison page at PFN echoed into this file. This -way a page can be reused again. -This only works for Linux injected failures, not for real -memory failures. + corrupt-filter-memcg + Limit injection to pages owned by memgroup. Specified by inode + number of the memcg. -Note these injection interfaces are not stable and might change between -kernel versions + Example:: -corrupt-filter-dev-major -corrupt-filter-dev-minor + mkdir /sys/fs/cgroup/mem/hwpoison -Only handle memory failures to pages associated with the file system defined -by block device major/minor. -1U is the wildcard value. -This should be only used for testing with artificial injection. + usemem -m 100 -s 1000 & + echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks -corrupt-filter-memcg + memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') + echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg -Limit injection
[PATCH 12/32] docs/vm: mmu_notifier.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/mmu_notifier.txt | 108 -- 1 file changed, 57 insertions(+), 51 deletions(-) diff --git a/Documentation/vm/mmu_notifier.txt b/Documentation/vm/mmu_notifier.txt index 23b4625..47baa1c 100644 --- a/Documentation/vm/mmu_notifier.txt +++ b/Documentation/vm/mmu_notifier.txt @@ -1,7 +1,10 @@ +.. _mmu_notifier: + When do you need to notify inside page table lock ? +=== When clearing a pte/pmd we are given a choice to notify the event through -(notify version of *_clear_flush call mmu_notifier_invalidate_range) under +(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under the page table lock. But that notification is not necessary in all cases. For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use @@ -18,6 +21,7 @@ a page that might now be used by some completely different task. Case B is more subtle. For correctness it requires the following sequence to happen: + - take page table lock - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) - set page table entry to point to new page @@ -28,58 +32,60 @@ the device. Consider the following scenario (device use a feature similar to ATS/PASID): -Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume +Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume they are write protected for COW (other case of B apply too). -[Time N] -CPU-thread-0 {try to write to addrA} -CPU-thread-1 {try to write to addrB} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA and populate device TLB} -DEV-thread-2 {read addrB and populate device TLB} -[Time N+1] -- -CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} -CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+2] -- -CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} -CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] -- -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {write to addrA which is a write to new page} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+3] -- -CPU-thread-0 {preempted} -CPU-thread-1 {preempted} -CPU-thread-2 {} -CPU-thread-3 {write to addrB which is a write to new page} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+4] -- -CPU-thread-0 {preempted} -CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {} -DEV-thread-2 {} -[Time N+5] -- -CPU-thread-0 {preempted} -CPU-thread-1 {} -CPU-thread-2 {} -CPU-thread-3 {} -DEV-thread-0 {read addrA from old page} -DEV-thread-2 {read addrB from new page} +:: + + [Time N] + CPU-thread-0 {try to write to addrA} + CPU-thread-1 {try to write to addrB} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {read addrA and populate device TLB} + DEV-thread-2 {read addrB and populate device TLB} + [Time N+1] -- + CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} + CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+2] -- + CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} + CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} + CPU-thread-2 {} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] -- + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {write to addrA which is a write to new page} + CPU-thread-3 {} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+3] -- + CPU-thread-0 {preempted} + CPU-thread-1 {preempted} + CPU-thread-2 {} + CPU-thread-3 {write to addrB which is a write to new page} + DEV-thread-0 {} + DEV-thread-2 {} + [Time N+4] -- +
[PATCH 10/32] docs/vm: idle_page_tracking.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/idle_page_tracking.txt | 55 + 1 file changed, 36 insertions(+), 19 deletions(-) diff --git a/Documentation/vm/idle_page_tracking.txt b/Documentation/vm/idle_page_tracking.txt index 85dcc3b..9cbe6f8 100644 --- a/Documentation/vm/idle_page_tracking.txt +++ b/Documentation/vm/idle_page_tracking.txt @@ -1,4 +1,11 @@ -MOTIVATION +.. _idle_page_tracking: + +== +Idle Page Tracking +== + +Motivation +== The idle page tracking feature allows to track which memory pages are being accessed by a workload and which are idle. This information can be useful for @@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster. It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. -USER API +.. _user_api: -The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, -it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. +User API + + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. The file implements a bitmap where each bit corresponds to a memory page. The bitmap is represented by an array of 8-byte integers, and the page at PFN #i is @@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle -(for more details on what "accessed" actually means see the IMPLEMENTATION -DETAILS section). To mark a page idle one has to set the bit corresponding to +(for more details on what "accessed" actually means see the :ref:`Implementation +Details ` section). +To mark a page idle one has to set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. @@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, and hence such pages are never reported idle. For huge pages the idle flag is set only on the head page, so one has to read -/proc/kpageflags in order to correctly count idle huge pages. +``/proc/kpageflags`` in order to correctly count idle huge pages. -Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return -EINVAL if you are not starting the read/write on an 8-byte boundary, or if the size of the read/write is not a multiple of 8 bytes. Writing to this file beyond max PFN will return -ENXIO. @@ -41,21 +53,25 @@ That said, in order to estimate the amount of pages that are not used by a workload one should: 1. Mark all the workload's pages as idle by setting corresponding bits in -/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading -/proc/pid/pagemap if the workload is represented by a process, or by -filtering out alien pages using /proc/kpagecgroup in case the workload is -placed in a memory cgroup. +``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading +``/proc/pid/pagemap`` if the workload is represented by a process, or by +filtering out alien pages using ``/proc/kpagecgroup`` in case the workload +is placed in a memory cgroup. 2. Wait until the workload accesses its working set. - 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If -one wants to ignore certain types of pages, e.g. mlocked pages since they -are not reclaimable, he or she can filter them out using /proc/kpageflags. + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. +If one wants to ignore certain types of pages, e.g. mlocked pages since they +are not reclaimable, he or she can filter them out using +``/proc/kpageflags``. + +See Documentation/vm/pagemap.txt for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. -See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, -/proc/kpageflags, and /proc/kpagecgroup. +.. _impl_details: -IMPLEMENTATION DETAILS +Implementation Details +== The kernel internally keeps track of accesses to user memory pages in order to reclaim unreferenced pages first on memory shortage conditions. A page is @@ -77,7 +93,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or exceeding the dirty memory limit, it is not marked referenced. The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API ` section), and cleared automatically whenever a page is referenced as defined
[PATCH 13/32] docs/vm: numa_memory_policy.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/numa_memory_policy.txt | 533 +--- 1 file changed, 283 insertions(+), 250 deletions(-) diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 622b927..8cd942c 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -1,5 +1,11 @@ +.. _numa_memory_policy: + +=== +Linux Memory Policy +=== What is Linux Memory Policy? + In the Linux kernel, "memory policy" determines from which node the kernel will allocate memory in a NUMA system or in an emulated NUMA system. Linux has @@ -9,35 +15,36 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy support. Memory policies should not be confused with cpusets -(Documentation/cgroup-v1/cpusets.txt) +(``Documentation/cgroup-v1/cpusets.txt``) which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes. Memory policies are a programming interface that a NUMA-aware application can take advantage of. When both cpusets and policies are applied to a task, the restrictions of the cpuset -takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. +takes priority. See :ref:`Memory Policies and cpusets ` +below for more details. -MEMORY POLICY CONCEPTS +Memory Policy Concepts +== Scope of Memory Policies + The Linux kernel supports _scopes_ of memory policy, described here from most general to most specific: -System Default Policy: this policy is "hard coded" into the kernel. It -is the policy that governs all page allocations that aren't controlled -by one of the more specific policy scopes discussed below. When the -system is "up and running", the system default policy will use "local -allocation" described below. However, during boot up, the system -default policy will be set to interleave allocations across all nodes -with "sufficient" memory, so as not to overload the initial boot node -with boot-time allocations. - -Task/Process Policy: this is an optional, per-task policy. When defined -for a specific task, this policy controls all page allocations made by or -on behalf of the task that aren't controlled by a more specific scope. -If a task does not define a task policy, then all page allocations that -would have been controlled by the task policy "fall back" to the System -Default Policy. +System Default Policy + this policy is "hard coded" into the kernel. It is the policy + that governs all page allocations that aren't controlled by + one of the more specific policy scopes discussed below. When + the system is "up and running", the system default policy will + use "local allocation" described below. However, during boot + up, the system default policy will be set to interleave + allocations across all nodes with "sufficient" memory, so as + not to overload the initial boot node with boot-time + allocations. + +Task/Process Policy + this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy. The task policy applies to the entire address space of a task. Thus, it is inheritable, and indeed is inherited, across both fork() @@ -58,56 +65,66 @@ most general to most specific: changes its task policy remain where they were allocated based on the policy at the time they were allocated. -VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's -virtual address space. A task may define a specific policy for a range -of its virtual address space. See the MEMORY POLICIES APIS section, -below, for an overview of the mbind() system call used to set a VMA -policy. - -A VMA policy will govern the allocation of pages that back this region of -the address space. Any regions of the task's address space that don't -have an explicit VMA policy will fall back to the task policy, which may -itself fall back to the System Default Policy. - -VMA policies have a few complicating details: - - VMA policy applies ONLY to anonymous pages. These include pages - allocated for anonymous segments, such as the task stack and heap, and - any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. - If a VMA policy is applied to a file mapping, it will be ignored if - the mapping used the MAP_SHARED flag. If the file
[PATCH 17/32] docs/vm: pagemap.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/pagemap.txt | 164 +++ 1 file changed, 89 insertions(+), 75 deletions(-) diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index eafcefa..bd6d717 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -1,13 +1,16 @@ -pagemap, from the userspace perspective +.. _pagemap: + +== +pagemap from the Userspace Perspective +== pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow userspace programs to examine the page tables and related information by -reading files in /proc. +reading files in ``/proc``. There are four components to pagemap: - * /proc/pid/pagemap. This file lets a userspace process find out which + * ``/proc/pid/pagemap``. This file lets a userspace process find out which physical frame each virtual page is mapped to. It contains one 64-bit value for each virtual page, containing the following data (from fs/proc/task_mmu.c, above pagemap_read): @@ -37,24 +40,24 @@ There are four components to pagemap: determine which areas of memory are actually mapped and llseek to skip over unmapped regions. - * /proc/kpagecount. This file contains a 64-bit count of the number of + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. - * /proc/kpageflags. This file contains a 64-bit set of flags for each + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each page, indexed by PFN. - The flags are (from fs/proc/page.c, above kpageflags_read): - - 0. LOCKED - 1. ERROR - 2. REFERENCED - 3. UPTODATE - 4. DIRTY - 5. LRU - 6. ACTIVE - 7. SLAB - 8. WRITEBACK - 9. RECLAIM + The flags are (from ``fs/proc/page.c``, above kpageflags_read): + +0. LOCKED +1. ERROR +2. REFERENCED +3. UPTODATE +4. DIRTY +5. LRU +6. ACTIVE +7. SLAB +8. WRITEBACK +9. RECLAIM 10. BUDDY 11. MMAP 12. ANON @@ -72,98 +75,108 @@ There are four components to pagemap: 24. ZERO_PAGE 25. IDLE - * /proc/kpagecgroup. This file contains a 64-bit inode number of the + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. Short descriptions to the page flags: - - 0. LOCKED -page is being locked for exclusive access, eg. by undergoing read/write IO - - 7. SLAB -page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator -When compound page is used, SLUB/SLQB will only set this flag on the head -page; SLOB will not flag it at all. - -10. BUDDY += + +0 - LOCKED + page is being locked for exclusive access, eg. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY a free memory block managed by the buddy system allocator The buddy system organizes free memory in blocks of various orders. An order N block has 2^N physically contiguous pages, with the BUDDY flag set for and _only_ for the first page. - -15. COMPOUND_HEAD -16. COMPOUND_TAIL +15 - COMPOUND_HEAD A compound page with order N consists of 2^N physically contiguous pages. A compound page with order 2 takes the form of "HTTT", where H donates its head page and T donates its tail page(s). The major consumers of compound pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. memory allocators and various device drivers. However in this interface, only huge/giga pages are made visible to end users. -17. HUGE +16 - COMPOUND_TAIL +A compound page tail (see description above). +17 - HUGE this is an integral part of a HugeTLB page - -19. HWPOISON +19 - HWPOISON hardware detected memory corruption on this page: don't touch the data! - -20. NOPAGE +20 - NOPAGE no page frame exists at the requested address - -21. KSM +21 - KSM identical memory pages dynamically shared between one or more processes - -22. THP +22 - THP contiguous pages which construct transparent hugepages - -23. BALLOON +23 - BALLOON balloon compaction page - -24. ZERO_PAGE +24 - ZERO_PAGE zero page for pfn_zero or huge_zero page - -25. IDLE +25 - IDLE page has not been accessed since it was marked idle (see Documentation/vm/idle_page_tracking.txt). Note that this flag may be stale in case the page was accessed via a PTE. To make sure the flag -is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. - -[IO
[PATCH 16/32] docs/vm: numa: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/numa | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/vm/numa b/Documentation/vm/numa index a31b85b..c81e7c5 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa @@ -1,6 +1,10 @@ +.. _numa: + Started Nov 1999 by Kanoj Sarcar += What is NUMA? += This question can be answered from a couple of perspectives: the hardware view and the Linux software view. -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/32] docs/vm: page_frags convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/page_frags | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags index a671456..637cc49 100644 --- a/Documentation/vm/page_frags +++ b/Documentation/vm/page_frags @@ -1,5 +1,8 @@ +.. _page_frags: + +== Page fragments --- +== A page fragment is an arbitrary-length arbitrary-offset area of memory which resides within a 0 or higher order compound page. Multiple -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/32] docs/vm: page_migration: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/page_migration | 149 +--- 1 file changed, 77 insertions(+), 72 deletions(-) diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration index 0478ae2..07b67a8 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration @@ -1,5 +1,8 @@ +.. _page_migration: + +== Page migration --- +== Page migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the @@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen (a version later than 0.9.3 is required. Get it from ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma which provides an interface similar to other numa functionality for page -migration. cat /proc//numa_maps allows an easy review of where the +migration. cat ``/proc//numa_maps`` allows an easy review of where the pages of a process are located. See also the numa_maps documentation in the proc(5) man page. @@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel (for userspace usage see the Andi Kleen's numactl package mentioned above) and then a low level description of how the low level details work. -A. In kernel use of migrate_pages() +In kernel use of migrate_pages() + 1. Remove pages from the LRU. @@ -78,8 +81,8 @@ A. In kernel use of migrate_pages() the new page for each page that is considered for moving. -B. How migrate_pages() works - +How migrate_pages() works += migrate_pages() does several passes over its list of pages. A page is moved if all references to a page are removable at the time. The page has @@ -142,8 +145,8 @@ Steps: 20. The new page is moved to the LRU and can be scanned by the swapper etc again. -C. Non-LRU page migration -- +Non-LRU page migration +== Although original migration aimed for reducing the latency of memory access for NUMA, compaction who want to create high-order page is also main customer. @@ -164,89 +167,91 @@ migration path. If a driver want to make own pages movable, it should define three functions which are function pointers of struct address_space_operations. -1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); +1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` -What VM expects on isolate_page function of driver is to return *true* -if driver isolates page successfully. On returing true, VM marks the page -as PG_isolated so concurrent isolation in several CPUs skip the page -for isolation. If a driver cannot isolate the page, it should return *false*. + What VM expects on isolate_page function of driver is to return *true* + if driver isolates page successfully. On returing true, VM marks the page + as PG_isolated so concurrent isolation in several CPUs skip the page + for isolation. If a driver cannot isolate the page, it should return *false*. -Once page is successfully isolated, VM uses page.lru fields so driver -shouldn't expect to preserve values in that fields. + Once page is successfully isolated, VM uses page.lru fields so driver + shouldn't expect to preserve values in that fields. -2. int (*migratepage) (struct address_space *mapping, - struct page *newpage, struct page *oldpage, enum migrate_mode); +2. ``int (*migratepage) (struct address_space *mapping,`` +| ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` -After isolation, VM calls migratepage of driver with isolated page. -The function of migratepage is to move content of the old page to new page -and set up fields of struct page newpage. Keep in mind that you should -indicate to the VM the oldpage is no longer movable via __ClearPageMovable() -under page_lock if you migrated the oldpage successfully and returns -MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver -can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time -because VM interprets -EAGAIN as "temporal migration failure". On returning -any error except -EAGAIN, VM will give up the page migration without retrying -in this time. + After isolation, VM calls migratepage of driver with isolated page. + The function of migratepage is to move content of the old page to new page + and set up fields of struct page newpage. Keep in mind that you should + indicate to the VM the oldpage is no longer movable via __ClearPageMovable() + under page_lock if you migrated the oldpage successfully and returns + MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver + can return -EAGAIN. On -EAGAIN, VM will
[PATCH 21/32] docs/vm: slub.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/slub.txt | 357 -- 1 file changed, 188 insertions(+), 169 deletions(-) diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 8465241..3a775fd 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -1,5 +1,8 @@ +.. _slub: + +== Short users guide for SLUB --- +== The basic philosophy of SLUB is very different from SLAB. SLAB requires rebuilding the kernel to activate debug options for all @@ -8,18 +11,19 @@ SLUB can enable debugging only for selected slabs in order to avoid an impact on overall system performance which may make a bug more difficult to find. -In order to switch debugging on one can add an option "slub_debug" +In order to switch debugging on one can add an option ``slub_debug`` to the kernel command line. That will enable full debugging for all slabs. -Typically one would then use the "slabinfo" command to get statistical -data and perform operation on the slabs. By default slabinfo only lists +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists slabs that have data in them. See "slabinfo -h" for more options when -running the command. slabinfo can be compiled with +running the command. ``slabinfo`` can be compiled with +:: -gcc -o slabinfo tools/vm/slabinfo.c + gcc -o slabinfo tools/vm/slabinfo.c -Some of the modes of operation of slabinfo require that slub debugging +Some of the modes of operation of ``slabinfo`` require that slub debugging be enabled on the command line. F.e. no tracking information will be available without debugging on and validation can only partially be performed if debugging was not switched on. @@ -27,14 +31,17 @@ be performed if debugging was not switched on. Some more sophisticated uses of slub_debug: --- -Parameters may be given to slub_debug. If none is specified then full +Parameters may be given to ``slub_debug``. If none is specified then full debugging is enabled. Format: -slub_debug= Enable options for all slabs +slub_debug= + Enable options for all slabs slub_debug=, - Enable options only for select slabs + Enable options only for select slabs + + +Possible debug options are:: -Possible debug options are F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS Sorry SLAB legacy issues) Z Red zoning @@ -47,18 +54,18 @@ Possible debug options are - Switch all debugging off (useful if the kernel is configured with CONFIG_SLUB_DEBUG_ON) -F.e. in order to boot just with sanity checks and red zoning one would specify: +F.e. in order to boot just with sanity checks and red zoning one would specify:: slub_debug=FZ -Trying to find an issue in the dentry cache? Try +Trying to find an issue in the dentry cache? Try:: slub_debug=,dentry to only enable debugging on the dentry cache. Red zoning and tracking may realign the slab. We can just apply sanity checks -to the dentry cache with +to the dentry cache with:: slub_debug=F,dentry @@ -66,15 +73,15 @@ Debugging options may require the minimum possible slab order to increase as a result of storing the metadata (for example, caches with PAGE_SIZE object sizes). This has a higher liklihood of resulting in slab allocation errors in low memory situations or if there's high fragmentation of memory. To -switch off debugging for such caches by default, use +switch off debugging for such caches by default, use:: slub_debug=O In case you forgot to enable debugging on the kernel command line: It is possible to enable debugging manually when the kernel is up. Look at the -contents of: +contents of:: -/sys/kernel/slab// + /sys/kernel/slab// Look at the writable files. Writing 1 to them will enable the corresponding debug option. All options can be set on a slab that does @@ -86,98 +93,103 @@ Careful with tracing: It may spew out lots of information and never stop if used on the wrong slab. Slab merging - + If no debug options are specified then SLUB may merge similar slabs together in order to reduce overhead and increase cache hotness of objects. -slabinfo -a displays which slabs were merged together. +``slabinfo -a`` displays which slabs were merged together. Slab validation +=== SLUB can validate all object if the kernel was booted with slub_debug. In -order to do so you must have the slabinfo tool. Then you can do +order to do so you must have the ``slabinfo`` tool. Then you can do +:: -slabinfo -v + slabinfo -v
[PATCH 20/32] docs/vm: remap_file_pages.txt: conert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/remap_file_pages.txt | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt index f609142..7bef671 100644 --- a/Documentation/vm/remap_file_pages.txt +++ b/Documentation/vm/remap_file_pages.txt @@ -1,3 +1,9 @@ +.. _remap_file_pages: + +== +remap_file_pages() system call +== + The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/32] docs/vm: split_page_table_lock: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/split_page_table_lock | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/Documentation/vm/split_page_table_lock b/Documentation/vm/split_page_table_lock index 62842a8..889b00b 100644 --- a/Documentation/vm/split_page_table_lock +++ b/Documentation/vm/split_page_table_lock @@ -1,3 +1,6 @@ +.. _split_page_table_lock: + += Split page table lock = @@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD tables. Access to higher level tables protected by mm->page_table_lock. There are helpers to lock/unlock a table and other accessor functions: + - pte_offset_map_lock() maps pte and takes PTE table lock, returns pointer to the taken lock; @@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE tables and the architecture supports it (see below). Hugetlb and split page table lock -- += Hugetlb can support several page sizes. We use split lock only for PMD level, but not for PUD. Hugetlb-specific helpers: + - huge_pte_lock() takes pmd split lock for PMD_SIZE page, mm->page_table_lock otherwise; @@ -47,7 +52,7 @@ Hugetlb-specific helpers: returns pointer to table lock; Support of split page table lock by an architecture +=== There's no need in special enabling of PTE split page table lock: everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), @@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must be handled properly. page->ptl -- += page->ptl is used to access split page table lock, where 'page' is struct page of page containing the table. It shares storage with page->private @@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private To avoid increasing size of struct page and have best performance, we use a trick: + - if spinlock_t fits into long, we use page->ptr as spinlock, so we can avoid indirect access and save a cache line. - if size of spinlock_t is bigger then size of long, we use page->ptl as -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 22/32] docs/vm: soft-dirty.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/soft-dirty.txt | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/Documentation/vm/soft-dirty.txt b/Documentation/vm/soft-dirty.txt index 55684d1..cb0cfd6 100644 --- a/Documentation/vm/soft-dirty.txt +++ b/Documentation/vm/soft-dirty.txt @@ -1,34 +1,38 @@ -SOFT-DIRTY PTEs +.. _soft_dirty: - The soft-dirty is a bit on a PTE which helps to track which pages a task +=== +Soft-Dirty PTEs +=== + +The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from the task's PTEs. - This is done by writing "4" into the /proc/PID/clear_refs file of the + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the task in question. 2. Wait some time. 3. Read soft-dirty bits from the PTEs. - This is done by reading from the /proc/PID/pagemap. The bit 55 of the + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the 64-bit qword is the soft-dirty one. If set, the respective PTE was written to since step 1. - Internally, to do this tracking, the writable bit is cleared from PTEs +Internally, to do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is cleared. So, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. - Note, that although all the task's address space is marked as r/o after the +Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts both writable and soft-dirty bits on the PTE. - While in most cases tracking memory changes by #PF-s is more than enough +While in most cases tracking memory changes by #PF-s is more than enough there is still a scenario when we can lose soft dirty bits -- a task unmaps a previously mapped memory region and then maps a new one at exactly the same place. When unmap is called, the kernel internally clears PTE values @@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such memory region renewal the kernel always marks new memory regions (and expanded regions) as soft dirty. - This feature is actively used by the checkpoint-restore project. You +This feature is actively used by the checkpoint-restore project. You can find more details about it on http://criu.org -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 25/32] docs/vm: transhuge.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/transhuge.txt | 286 - 1 file changed, 166 insertions(+), 120 deletions(-) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 4dde03b..569d182 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -1,6 +1,11 @@ -= Transparent Hugepage Support = +.. _transhuge: -== Objective == + +Transparent Hugepage Support + + +Objective += Performance critical computing applications dealing with large memory working sets are already running on top of libhugetlbfs and in turn @@ -33,7 +38,8 @@ are using hugepages but a significant speedup already happens if only one of the two is using hugepages just because of the fact the TLB miss is going to run faster. -== Design == +Design +== - "graceful fallback": mm components which don't have transparent hugepage knowledge fall back to breaking huge pmd mapping into table of ptes and, @@ -88,16 +94,17 @@ Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions. -== sysfs == +sysfs += Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of: +system wide. This can be achieved with one of:: -echo always >/sys/kernel/mm/transparent_hugepage/enabled -echo madvise >/sys/kernel/mm/transparent_hugepage/enabled -echo never >/sys/kernel/mm/transparent_hugepage/enabled + echo always >/sys/kernel/mm/transparent_hugepage/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo never >/sys/kernel/mm/transparent_hugepage/enabled It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise @@ -108,44 +115,53 @@ use hugepages later instead of regular pages. This isn't always guaranteed, but it may be more likely in case the allocation is for a MADV_HUGEPAGE region. -echo always >/sys/kernel/mm/transparent_hugepage/defrag -echo defer >/sys/kernel/mm/transparent_hugepage/defrag -echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo madvise >/sys/kernel/mm/transparent_hugepage/defrag -echo never >/sys/kernel/mm/transparent_hugepage/defrag - -"always" means that an application requesting THP will stall on allocation -failure and directly reclaim pages and compact memory in an effort to -allocate a THP immediately. This may be desirable for virtual machines -that benefit heavily from THP use and are willing to delay the VM start -to utilise them. - -"defer" means that an application will wake kswapd in the background -to reclaim pages and wake kcompactd to compact memory so that THP is -available in the near future. It's the responsibility of khugepaged -to then install the THP pages later. - -"defer+madvise" will enter direct reclaim and compaction like "always", but -only for regions that have used madvise(MADV_HUGEPAGE); all other regions -will wake kswapd in the background to reclaim pages and wake kcompactd to -compact memory so that THP is available in the near future. - -"madvise" will enter direct reclaim like "always" but only for regions -that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. - -"never" should be self-explanatory. +:: + + echo always >/sys/kernel/mm/transparent_hugepage/defrag + echo defer >/sys/kernel/mm/transparent_hugepage/defrag + echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo madvise >/sys/kernel/mm/transparent_hugepage/defrag + echo never >/sys/kernel/mm/transparent_hugepage/defrag + +always + means that an application requesting THP will stall on + allocation failure and directly reclaim pages and compact + memory in an effort to allocate a THP immediately. This may be + desirable for virtual machines that benefit heavily from THP + use and are willing to delay the VM start to utilise them. + +defer + means that an application will wake kswapd in the background + to reclaim pages and wake kcompactd to compact memory so that + THP is available in the near future. It's the responsibility + of khugepaged to then install the THP pages later. + +defer+madvise + will enter direct reclaim and compaction like ``always``, but + only for regions that have used madvise(MADV_HUGEPAGE); all + other regions will wake kswapd in the background to reclaim + pages and wake kcompactd to compact memory so that THP is + available in the near future. + +madvise + will enter direct reclaim like
[PATCH 24/32] docs/vm: swap_numa.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/swap_numa.txt | 55 +- 1 file changed, 33 insertions(+), 22 deletions(-) diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt index d5960c9..e0466f2 100644 --- a/Documentation/vm/swap_numa.txt +++ b/Documentation/vm/swap_numa.txt @@ -1,5 +1,8 @@ +.. _swap_numa: + +=== Automatically bind swap device to numa node +=== If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap @@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance. How to use this feature +=== Swap device has priority and that decides the order of it to be used. To make use of automatically binding, there is no need to manipulate priority settings for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and swapB, with swapA attached to node 0 and swapB attached to node 1, are going -to be swapped on. Simply swapping them on by doing: -# swapon /dev/swapA -# swapon /dev/swapB +to be swapped on. Simply swapping them on by doing:: + + # swapon /dev/swapA + # swapon /dev/swapB Then node 0 will use the two swap devices in the order of swapA then swapB and node 1 will use the two swap devices in the order of swapB then swapA. Note @@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter. A more complex example on a 4 node machine. Assume 6 swap devices are going to be swapped on: swapA and swapB are attached to node 0, swapC is attached to node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. -The way to swap them on is the same as above: -# swapon /dev/swapA -# swapon /dev/swapB -# swapon /dev/swapC -# swapon /dev/swapD -# swapon /dev/swapE -# swapon /dev/swapF - -Then node 0 will use them in the order of: -swapA/swapB -> swapC -> swapD -> swapE -> swapF +The way to swap them on is the same as above:: + + # swapon /dev/swapA + # swapon /dev/swapB + # swapon /dev/swapC + # swapon /dev/swapD + # swapon /dev/swapE + # swapon /dev/swapF + +Then node 0 will use them in the order of:: + + swapA/swapB -> swapC -> swapD -> swapE -> swapF + swapA and swapB will be used in a round robin mode before any other swap device. -node 1 will use them in the order of: -swapC -> swapA -> swapB -> swapD -> swapE -> swapF +node 1 will use them in the order of:: + + swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of:: + + swapD/swapE -> swapA -> swapB -> swapC -> swapF -node 2 will use them in the order of: -swapD/swapE -> swapA -> swapB -> swapC -> swapF Similaly, swapD and swapE will be used in a round robin mode before any other swap devices. -node 3 will use them in the order of: -swapF -> swapA -> swapB -> swapC -> swapD -> swapE +node 3 will use them in the order of:: + + swapF -> swapA -> swapB -> swapC -> swapD -> swapE Implementation details --- +== The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 30/32] docs/vm: zswap.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/zswap.txt | 71 +++--- 1 file changed, 42 insertions(+), 29 deletions(-) diff --git a/Documentation/vm/zswap.txt b/Documentation/vm/zswap.txt index 0b3a114..1444ecd 100644 --- a/Documentation/vm/zswap.txt +++ b/Documentation/vm/zswap.txt @@ -1,4 +1,11 @@ -Overview: +.. _zswap: + += +zswap += + +Overview + Zswap is a lightweight compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a @@ -7,32 +14,34 @@ for potentially reduced swap I/O. This trade-off can also result in a significant performance improvement if reads from the compressed cache are faster than reads from a swap device. -NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory -reclaim. This interaction has not been fully explored on the large set of -potential configurations and workloads that exist. For this reason, zswap -is a work in progress and should be considered experimental. +.. note:: + Zswap is a new feature as of v3.11 and interacts heavily with memory + reclaim. This interaction has not been fully explored on the large set of + potential configurations and workloads that exist. For this reason, zswap + is a work in progress and should be considered experimental. + + Some potential benefits: -Some potential benefits: * Desktop/laptop users with limited RAM capacities can mitigate the - performance impact of swapping. + performance impact of swapping. * Overcommitted guests that share a common I/O resource can - dramatically reduce their swap I/O pressure, avoiding heavy handed I/O -throttling by the hypervisor. This allows more work to get done with less -impact to the guest workload and guests sharing the I/O subsystem + dramatically reduce their swap I/O pressure, avoiding heavy handed I/O + throttling by the hypervisor. This allows more work to get done with less + impact to the guest workload and guests sharing the I/O subsystem * Users with SSDs as swap devices can extend the life of the device by - drastically reducing life-shortening writes. + drastically reducing life-shortening writes. Zswap evicts pages from compressed cache on an LRU basis to the backing swap device when the compressed pool reaches its size limit. This requirement had been identified in prior community discussions. Zswap is disabled by default but can be enabled at boot time by setting -the "enabled" attribute to 1 at boot time. ie: zswap.enabled=1. Zswap +the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap can also be enabled and disabled at runtime using the sysfs interface. An example command to enable zswap at runtime, assuming sysfs is mounted -at /sys, is: +at ``/sys``, is:: -echo 1 > /sys/module/zswap/parameters/enabled + echo 1 > /sys/module/zswap/parameters/enabled When zswap is disabled at runtime it will stop storing pages that are being swapped out. However, it will _not_ immediately write out or fault @@ -43,7 +52,8 @@ pages out of the compressed pool, a swapoff on the swap device(s) will fault back into memory all swapped out pages, including those in the compressed pool. -Design: +Design +== Zswap receives pages for compression through the Frontswap API and is able to evict pages from its own compressed pool on an LRU basis and write them back to @@ -53,12 +63,12 @@ Zswap makes use of zpool for the managing the compressed memory pool. Each allocation in zpool is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed -pages are freed. The pool is not preallocated. By default, a zpool of type -zbud is created, but it can be selected at boot time by setting the "zpool" -attribute, e.g. zswap.zpool=zbud. It can also be changed at runtime using the -sysfs "zpool" attribute, e.g. +pages are freed. The pool is not preallocated. By default, a zpool +of type zbud is created, but it can be selected at boot time by +setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can +also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: -echo zbud > /sys/module/zswap/parameters/zpool + echo zbud > /sys/module/zswap/parameters/zpool The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which means the compression ratio will always be 2:1 or worse (because of half-full @@ -83,14 +93,16 @@ via frontswap, to free the compressed entry. Zswap seeks to be simple in its policies. Sysfs attributes allow for one user controlled policy: + * max_pool_percent - The maximum percentage of memory that the compressed -pool can occupy. + pool can occupy. -The default compressor is lzo, but it can be
[PATCH 27/32] docs/vm: userfaultfd.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/userfaultfd.txt | 66 1 file changed, 39 insertions(+), 27 deletions(-) diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt index bb2f945..5048cf6 100644 --- a/Documentation/vm/userfaultfd.txt +++ b/Documentation/vm/userfaultfd.txt @@ -1,6 +1,11 @@ -= Userfaultfd = +.. _userfaultfd: -== Objective == +=== +Userfaultfd +=== + +Objective += Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various @@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do. For example userfaults allows a proper and more optimal implementation of the PROT_NONE+SIGSEGV trick. -== Design == +Design +== Userfaults are delivered and resolved through the userfaultfd syscall. @@ -41,7 +47,8 @@ different processes without them being aware about what is going on themselves on the same region the manager is already tracking, which is a corner case that would currently return -EBUSY). -== API == +API +=== When first opened the userfaultfd must be enabled invoking the UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or @@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. -== QEMU/KVM == +QEMU/KVM + QEMU/KVM is using the userfaultfd syscall to implement postcopy live migration. Postcopy live migration is one form of memory @@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration thread). -== Non-cooperative userfaultfd == +Non-cooperative userfaultfd +=== When the userfaultfd is monitored by an external manager, the manager must be able to track changes in the process virtual memory @@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The manager has to explicitly enable these events by setting appropriate bits in uffdio_api.features passed to UFFDIO_API ioctl: -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When -this feature is enabled, the userfaultfd context of the parent process -is duplicated into the newly created process. The manager receives -UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in -the uffd_msg.fork. - -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() -calls. When the non-cooperative process moves a virtual memory area to -a different location, the manager will receive UFFD_EVENT_REMAP. The -uffd_msg.remap will contain the old and new addresses of the area and -its original length. - -UFFD_FEATURE_EVENT_REMOVE - enable notifications about -madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event -UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The -uffd_msg.remove will contain start and end addresses of the removed -area. - -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory -unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove -containing start and end addresses of the unmapped area. +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. + +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. + +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP are pretty similar, they quite differ in the action expected from the -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 28/32] docs/vm: z3fold.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/z3fold.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/vm/z3fold.txt b/Documentation/vm/z3fold.txt index 38e4dac..224e3c6 100644 --- a/Documentation/vm/z3fold.txt +++ b/Documentation/vm/z3fold.txt @@ -1,5 +1,8 @@ +.. _z3fold: + +== z3fold --- +== z3fold is a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. @@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. The main differences between z3fold and zbud are: + * unlike zbud, z3fold allows for up to PAGE_SIZE allocations * z3fold can hold up to 3 compressed pages in its page * z3fold doesn't export any API itself and is thus intended to be used -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 26/32] docs/vm: unevictable-lru.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/unevictable-lru.txt | 117 +++ 1 file changed, 49 insertions(+), 68 deletions(-) diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index e147185..fdd84cb 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -1,37 +1,13 @@ - == - UNEVICTABLE LRU INFRASTRUCTURE - == - - -CONTENTS - - - (*) The Unevictable LRU - - - The unevictable page list. - - Memory control group interaction. - - Marking address spaces unevictable. - - Detecting Unevictable Pages. - - vmscan's handling of unevictable pages. - - (*) mlock()'d pages. - - - History. - - Basic management. - - mlock()/mlockall() system call handling. - - Filtering special vmas. - - munlock()/munlockall() system call handling. - - Migrating mlocked pages. - - Compacting mlocked pages. - - mmap(MAP_LOCKED) system call handling. - - munmap()/exit()/exec() system call handling. - - try_to_unmap(). - - try_to_munlock() reverse map scan. - - Page reclaim in shrink_*_list(). +.. _unevictable_lru: +== +Unevictable LRU Infrastructure +== - -INTRODUCTION +.. contents:: :local: + + +Introduction This document describes the Linux memory manager's "Unevictable LRU" @@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the descriptions below add value by provide the answer to "why does it do that?". -=== -THE UNEVICTABLE LRU + +The Unevictable LRU === The Unevictable LRU facility adds an additional LRU list to track unevictable @@ -66,17 +42,17 @@ completely unresponsive. The unevictable list addresses the following classes of unevictable pages: - (*) Those owned by ramfs. + * Those owned by ramfs. - (*) Those mapped into SHM_LOCK'd shared memory regions. + * Those mapped into SHM_LOCK'd shared memory regions. - (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. + * Those mapped into VM_LOCKED [mlock()ed] VMAs. The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. -THE UNEVICTABLE PAGE LIST +The Unevictable Page List - The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list @@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other tasks are changing the "evictability" state of the page. -MEMORY CONTROL GROUP INTERACTION +Memory Control Group Interaction The unevictable LRU facility interacts with the memory control group [aka @@ -144,7 +120,9 @@ effects: the control group to thrash or to OOM-kill tasks. -MARKING ADDRESS SPACES UNEVICTABLE +.. _mark_addr_space_unevict: + +Marking Address Spaces Unevictable -- For facilities such as ramfs none of the pages attached to the address space @@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE address space flag is provided, and this can be manipulated by a filesystem using a number of wrapper functions: - (*) void mapping_set_unevictable(struct address_space *mapping); + * ``void mapping_set_unevictable(struct address_space *mapping);`` Mark the address space as being completely unevictable. - (*) void mapping_clear_unevictable(struct address_space *mapping); + * ``void mapping_clear_unevictable(struct address_space *mapping);`` Mark the address space as being evictable. - (*) int mapping_unevictable(struct address_space *mapping); + * ``int mapping_unevictable(struct address_space *mapping);`` Query the address space, and return true if it is completely unevictable. @@ -177,12 +155,13 @@ These are currently used in two places in the kernel: ensure they're in memory. -DETECTING UNEVICTABLE PAGES +Detecting Unevictable Pages --- The function page_evictable() in vmscan.c determines whether a page is -evictable or not using the query function outlined above [see section "Marking -address spaces unevictable"] to check the AS_UNEVICTABLE flag. +evictable or not using the query function outlined above [see section +:ref:`Marking address spaces unevictable `] +to check the AS_UNEVICTABLE flag. For address spaces that are so marked after being populated (as SHM regions might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate @@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is faulted into a
[PATCH 29/32] docs/vm: zsmalloc.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/zsmalloc.txt | 60 ++- 1 file changed, 36 insertions(+), 24 deletions(-) diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt index 64ed63c..6e79893 100644 --- a/Documentation/vm/zsmalloc.txt +++ b/Documentation/vm/zsmalloc.txt @@ -1,5 +1,8 @@ +.. _zsmalloc: + + zsmalloc - + This allocator is designed for use with zram. Thus, the allocator is supposed to work well under low memory conditions. In particular, it @@ -31,40 +34,49 @@ be mapped using zs_map_object() to get a usable pointer and subsequently unmapped using zs_unmap_object(). stat - + With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via -/sys/kernel/debug/zsmalloc/. Here is a sample of stat output: +``/sys/kernel/debug/zsmalloc/``. Here is a sample of stat output:: -# cat /sys/kernel/debug/zsmalloc/zram0/classes + # cat /sys/kernel/debug/zsmalloc/zram0/classes class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage -.. -.. +... +... 9 176 01 186129 8 4 10 192 10 2880 2872135 3 11 208 01 819795 42 2 12 224 01 219159 12 4 -.. -.. +... +... + +class + index +size + object size zspage stores +almost_empty + the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full + the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated + the number of objects allocated +obj_used + the number of objects allocated to the user +pages_used + the number of pages allocated for the class +pages_per_zspage + the number of 0-order pages to make a zspage -class: index -size: object size zspage stores -almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) -almost_full: the number of ZS_ALMOST_FULL zspages(see below) -obj_allocated: the number of objects allocated -obj_used: the number of objects allocated to the user -pages_used: the number of pages allocated for the class -pages_per_zspage: the number of 0-order pages to make a zspage +We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where -We assign a zspage to ZS_ALMOST_EMPTY fullness group when: - n <= N / f, where -n = number of allocated objects -N = total number of objects zspage can store -f = fullness_threshold_frac(ie, 4 at the moment) +* n = number of allocated objects +* N = total number of objects zspage can store +* f = fullness_threshold_frac(ie, 4 at the moment) Similarly, we assign zspage to: - ZS_ALMOST_FULL when n > N / f - ZS_EMPTYwhen n == 0 - ZS_FULL when n == N + +* ZS_ALMOST_FULL when n > N / f +* ZS_EMPTYwhen n == 0 +* ZS_FULL when n == N -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 31/32] docs/vm: rename documentation files to .rst
Signed-off-by: Mike Rapoport--- Documentation/ABI/stable/sysfs-devices-node| 2 +- .../ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/ABI/testing/sysfs-kernel-mm-ksm | 2 +- Documentation/ABI/testing/sysfs-kernel-slab| 4 +- Documentation/admin-guide/kernel-parameters.txt| 12 ++--- Documentation/dev-tools/kasan.rst | 2 +- Documentation/filesystems/proc.txt | 4 +- Documentation/filesystems/tmpfs.txt| 2 +- Documentation/sysctl/vm.txt| 6 +-- Documentation/vm/00-INDEX | 58 +++--- Documentation/vm/{active_mm.txt => active_mm.rst} | 0 Documentation/vm/{balance => balance.rst} | 0 .../vm/{cleancache.txt => cleancache.rst} | 0 Documentation/vm/{frontswap.txt => frontswap.rst} | 0 Documentation/vm/{highmem.txt => highmem.rst} | 0 Documentation/vm/{hmm.txt => hmm.rst} | 0 .../{hugetlbfs_reserv.txt => hugetlbfs_reserv.rst} | 0 .../vm/{hugetlbpage.txt => hugetlbpage.rst}| 2 +- Documentation/vm/{hwpoison.txt => hwpoison.rst}| 2 +- ...le_page_tracking.txt => idle_page_tracking.rst} | 2 +- Documentation/vm/{ksm.txt => ksm.rst} | 0 .../vm/{mmu_notifier.txt => mmu_notifier.rst} | 0 Documentation/vm/{numa => numa.rst}| 2 +- ...ma_memory_policy.txt => numa_memory_policy.rst} | 0 ...commit-accounting => overcommit-accounting.rst} | 0 Documentation/vm/{page_frags => page_frags.rst}| 0 .../vm/{page_migration => page_migration.rst} | 0 .../vm/{page_owner.txt => page_owner.rst} | 0 Documentation/vm/{pagemap.txt => pagemap.rst} | 6 +-- .../{remap_file_pages.txt => remap_file_pages.rst} | 0 Documentation/vm/{slub.txt => slub.rst}| 0 .../vm/{soft-dirty.txt => soft-dirty.rst} | 0 ...t_page_table_lock => split_page_table_lock.rst} | 0 Documentation/vm/{swap_numa.txt => swap_numa.rst} | 0 Documentation/vm/{transhuge.txt => transhuge.rst} | 0 .../{unevictable-lru.txt => unevictable-lru.rst} | 0 .../vm/{userfaultfd.txt => userfaultfd.rst}| 0 Documentation/vm/{z3fold.txt => z3fold.rst}| 0 Documentation/vm/{zsmalloc.txt => zsmalloc.rst}| 0 Documentation/vm/{zswap.txt => zswap.rst} | 0 MAINTAINERS| 2 +- arch/alpha/Kconfig | 2 +- arch/ia64/Kconfig | 2 +- arch/mips/Kconfig | 2 +- arch/powerpc/Kconfig | 2 +- fs/Kconfig | 2 +- fs/dax.c | 2 +- fs/proc/task_mmu.c | 4 +- include/linux/hmm.h| 2 +- include/linux/memremap.h | 4 +- include/linux/mmu_notifier.h | 2 +- include/linux/sched/mm.h | 4 +- include/linux/swap.h | 2 +- mm/Kconfig | 6 +-- mm/cleancache.c| 2 +- mm/frontswap.c | 2 +- mm/hmm.c | 2 +- mm/huge_memory.c | 4 +- mm/hugetlb.c | 4 +- mm/ksm.c | 4 +- mm/mmap.c | 2 +- mm/rmap.c | 6 +-- mm/util.c | 2 +- 63 files changed, 87 insertions(+), 87 deletions(-) rename Documentation/vm/{active_mm.txt => active_mm.rst} (100%) rename Documentation/vm/{balance => balance.rst} (100%) rename Documentation/vm/{cleancache.txt => cleancache.rst} (100%) rename Documentation/vm/{frontswap.txt => frontswap.rst} (100%) rename Documentation/vm/{highmem.txt => highmem.rst} (100%) rename Documentation/vm/{hmm.txt => hmm.rst} (100%) rename Documentation/vm/{hugetlbfs_reserv.txt => hugetlbfs_reserv.rst} (100%) rename Documentation/vm/{hugetlbpage.txt => hugetlbpage.rst} (99%) rename Documentation/vm/{hwpoison.txt => hwpoison.rst} (99%) rename Documentation/vm/{idle_page_tracking.txt => idle_page_tracking.rst} (98%) rename Documentation/vm/{ksm.txt => ksm.rst} (100%) rename Documentation/vm/{mmu_notifier.txt => mmu_notifier.rst} (100%) rename Documentation/vm/{numa => numa.rst} (99%) rename Documentation/vm/{numa_memory_policy.txt => numa_memory_policy.rst} (100%) rename Documentation/vm/{overcommit-accounting => overcommit-accounting.rst} (100%) rename Documentation/vm/{page_frags => page_frags.rst} (100%) rename Documentation/vm/{page_migration => page_migration.rst} (100%) rename
[PATCH 32/32] docs/vm: add index.rst and link MM documentation to top level index
Signed-off-by: Mike Rapoport--- Documentation/index.rst| 3 ++- Documentation/vm/conf.py | 10 + Documentation/vm/index.rst | 56 ++ 3 files changed, 68 insertions(+), 1 deletion(-) create mode 100644 Documentation/vm/conf.py create mode 100644 Documentation/vm/index.rst diff --git a/Documentation/index.rst b/Documentation/index.rst index ef5080c..cc4a098 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -45,7 +45,7 @@ the kernel interface as seen by application developers. .. toctree:: :maxdepth: 2 - userspace-api/index + userspace-api/index Introduction to kernel development @@ -88,6 +88,7 @@ needed). sound/index crypto/index filesystems/index + vm/index Architecture-specific documentation --- diff --git a/Documentation/vm/conf.py b/Documentation/vm/conf.py new file mode 100644 index 000..3b0b601 --- /dev/null +++ b/Documentation/vm/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Memory Management Documentation" + +tags.add("subproject") + +latex_documents = [ +('index', 'memory-management.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst new file mode 100644 index 000..6c45142 --- /dev/null +++ b/Documentation/vm/index.rst @@ -0,0 +1,56 @@ += +Linux Memory Management Documentation += + +This is a collection of documents about Linux memory management (mm) subsystem. + +User guides for MM features +=== + +The following documents provide guides for controlling and tuning +various features of the Linux memory management + +.. toctree:: + :maxdepth: 1 + + hugetlbpage + idle_page_tracking + ksm + numa_memory_policy + pagemap + transhuge + soft-dirty + swap_numa + userfaultfd + zswap + +Kernel developers MM documentation +== + +The below documents describe MM internals with different level of +details ranging from notes and mailing list responses to elaborate +descriptions of data structures and algorithms. + +.. toctree:: + :maxdepth: 1 + + active_mm + balance + cleancache + frontswap + highmem + hmm + hwpoison + hugetlbfs_reserv + mmu_notifier + numa + overcommit-accounting + page_migration + page_frags + page_owner + remap_file_pages + slub + split_page_table_lock + unevictable-lru + z3fold + zsmalloc -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/32] docs/vm: page_owner: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/page_owner.txt | 34 +- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.txt index 143..0ed5ab8 100644 --- a/Documentation/vm/page_owner.txt +++ b/Documentation/vm/page_owner.txt @@ -1,7 +1,11 @@ +.. _page_owner: + +== page owner: Tracking about who allocated each page +== -* Introduction +Introduction + page owner is for the tracking about who allocated each page. It can be used to debug memory leak or to find a memory hogger. @@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump label patching functionality is available. Following is the kernel's code size change due to this facility. -- Without page owner +- Without page owner:: + textdata bss dec hex filename - 406621493 644 42799a72f mm/page_alloc.o + 40662 1493 644 42799a72f mm/page_alloc.o + +- With page owner:: -- With page owner textdata bss dec hex filename - 408921493 644 43029a815 mm/page_alloc.o + 40892 1493 644 43029a815 mm/page_alloc.o 1427 24 81459 5b3 mm/page_ext.o 2722 50 02772 ad4 mm/page_owner.o @@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct page extension feature. Anyway, after that, no page is left in un-tracking state. -* Usage +Usage += + +1) Build user-space helper:: -1) Build user-space helper cd tools/vm make page_owner_sort -2) Enable page owner - Add "page_owner=on" to boot cmdline. +2) Enable page owner: add "page_owner=on" to boot cmdline. 3) Do the job what you want to debug -4) Analyze information from page owner +4) Analyze information from page owner:: + cat /sys/kernel/debug/page_owner > page_owner_full.txt grep -v ^PFN page_owner_full.txt > page_owner.txt ./page_owner_sort page_owner.txt sorted_page_owner.txt - See the result about who allocated each page - in the sorted_page_owner.txt. + See the result about who allocated each page + in the ``sorted_page_owner.txt``. -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/32] docs/vm: overcommit-accounting: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/overcommit-accounting | 107 ++--- 1 file changed, 57 insertions(+), 50 deletions(-) diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting index cbfaaa6..0dd54bb 100644 --- a/Documentation/vm/overcommit-accounting +++ b/Documentation/vm/overcommit-accounting @@ -1,80 +1,87 @@ +.. _overcommit_accounting: + += +Overcommit Accounting += + The Linux kernel supports the following overcommit handling modes -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slightly more memory in this mode. This is the - default. +0 + Heuristic overcommit handling. Obvious overcommits of address + space are refused. Used for a typical system. It ensures a + seriously wild allocation fails while allowing overcommit to + reduce swap usage. root is allowed to allocate slightly more + memory in this mode. This is the default. -1 - Always overcommit. Appropriate for some scientific - applications. Classic example is code using sparse arrays - and just relying on the virtual memory consisting almost - entirely of zero pages. +1 + Always overcommit. Appropriate for some scientific + applications. Classic example is code using sparse arrays and + just relying on the virtual memory consisting almost entirely + of zero pages. -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap + a - configurable amount (default is 50%) of physical RAM. - Depending on the amount you use, in most situations - this means a process will not be killed while accessing - pages but will receive errors on memory allocation as - appropriate. +2 + Don't overcommit. The total address space commit for the + system is not permitted to exceed swap + a configurable amount + (default is 50%) of physical RAM. Depending on the amount you + use, in most situations this means a process will not be + killed while accessing pages but will receive errors on memory + allocation as appropriate. - Useful for applications that want to guarantee their - memory allocations will be available in the future - without having to initialize every page. + Useful for applications that want to guarantee their memory + allocations will be available in the future without having to + initialize every page. -The overcommit policy is set via the sysctl `vm.overcommit_memory'. +The overcommit policy is set via the sysctl ``vm.overcommit_memory``. -The overcommit amount can be set via `vm.overcommit_ratio' (percentage) -or `vm.overcommit_kbytes' (absolute value). +The overcommit amount can be set via ``vm.overcommit_ratio`` (percentage) +or ``vm.overcommit_kbytes`` (absolute value). The current overcommit limit and amount committed are viewable in -/proc/meminfo as CommitLimit and Committed_AS respectively. +``/proc/meminfo`` as CommitLimit and Committed_AS respectively. Gotchas +=== The C language stack growth does an implicit mremap. If you want absolute -guarantees and run close to the edge you MUST mmap your stack for the +guarantees and run close to the edge you MUST mmap your stack for the largest size you think you will need. For typical stack usage this does not matter much but it's a corner case if you really really care -In mode 2 the MAP_NORESERVE flag is ignored. +In mode 2 the MAP_NORESERVE flag is ignored. How It Works - + The overcommit is based on the following rules For a file backed map - SHARED or READ-only - 0 cost (the file is the map not swap) - PRIVATE WRITABLE- size of mapping per instance + | SHARED or READ-only - 0 cost (the file is the map not swap) + | PRIVATE WRITABLE - size of mapping per instance -For an anonymous or /dev/zero map - SHARED - size of mapping - PRIVATE READ-only - 0 cost (but of little use) - PRIVATE WRITABLE- size of mapping per instance +For an anonymous or ``/dev/zero`` map + | SHARED- size of mapping + | PRIVATE READ-only - 0 cost (but of little use) + | PRIVATE WRITABLE - size of mapping per instance Additional accounting - Pages made writable copies by mmap -
[PATCH 11/32] docs/vm: ksm.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/ksm.txt | 215 --- 1 file changed, 110 insertions(+), 105 deletions(-) diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index 6686bd2..87e7eef 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt @@ -1,8 +1,11 @@ -How to use the Kernel Samepage Merging feature --- +.. _ksm: + +=== +Kernel Samepage Merging +=== KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, -added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, +added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation, and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ The KSM daemon ksmd periodically scans those areas of user memory which @@ -51,110 +54,112 @@ Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM's scans may use a lot of processing power: some installations will disable KSM for that reason. -The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, +The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``, readable by all but writable only by root: -pages_to_scan- how many present pages to scan before ksmd goes to sleep - e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" - Default: 100 (chosen for demonstration purposes) - -sleep_millisecs - how many milliseconds ksmd should sleep before next scan - e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" - Default: 20 (chosen for demonstration purposes) - -merge_across_nodes - specifies if pages from different numa nodes can be merged. - When set to 0, ksm merges only pages which physically - reside in the memory area of same NUMA node. That brings - lower latency to access of shared pages. Systems with more - nodes, at significant NUMA distances, are likely to benefit - from the lower latency of setting 0. Smaller systems, which - need to minimize memory usage, are likely to benefit from - the greater sharing of setting 1 (default). You may wish to - compare how your system performs under each setting, before - deciding on which to use. merge_across_nodes setting can be - changed only when there are no ksm shared pages in system: - set run 2 to unmerge pages first, then to 1 after changing - merge_across_nodes, to remerge according to the new setting. - Default: 1 (merging across nodes as in earlier releases) - -run - set 0 to stop ksmd from running but keep merged pages, - set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", - set 2 to stop ksmd and unmerge all pages currently merged, - but leave mergeable areas registered for next run - Default: 0 (must be changed to 1 to activate KSM, - except if CONFIG_SYSFS is disabled) - -use_zero_pages - specifies whether empty pages (i.e. allocated pages - that only contain zeroes) should be treated specially. - When set to 1, empty pages are merged with the kernel - zero page(s) instead of with each other as it would - happen normally. This can improve the performance on - architectures with coloured zero pages, depending on - the workload. Care should be taken when enabling this - setting, as it can potentially degrade the performance - of KSM for some workloads, for example if the checksums - of pages candidate for merging match the checksum of - an empty page. This setting can be changed at any time, - it is only effective for pages merged after the change. - Default: 0 (normal KSM behaviour as in earlier releases) - -max_page_sharing - Maximum sharing allowed for each KSM page. This - enforces a deduplication limit to avoid the virtual - memory rmap lists to grow too large. The minimum - value is 2 as a newly created KSM page will have at - least two sharers. The rmap walk has O(N) - complexity where N is the number of rmap_items - (i.e. virtual mappings) that are sharing the page, - which is in turn capped by max_page_sharing. So - this effectively spread the the linear O(N) - computational complexity from rmap walk
[PATCH 05/32] docs/vm: highmem.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/highmem.txt | 87 ++-- 1 file changed, 36 insertions(+), 51 deletions(-) diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt index 4324d24..0f69a9f 100644 --- a/Documentation/vm/highmem.txt +++ b/Documentation/vm/highmem.txt @@ -1,25 +1,14 @@ +.. _highmem: - -HIGH MEMORY HANDLING - + +High Memory Handling + By: Peter Zijlstra -Contents: - - (*) What is high memory? - - (*) Temporary virtual mappings. - - (*) Using kmap_atomic. - - (*) Cost of temporary mappings. - - (*) i386 PAE. +.. contents:: :local: - - -WHAT IS HIGH MEMORY? +What Is High Memory? High memory (highmem) is used when the size of physical memory approaches or @@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be divided between user and kernel space. The traditional split for architectures using this approach is 3:1, 3GiB for -userspace and the top 1GiB for kernel space: +userspace and the top 1GiB for kernel space:: ++ 0x | Kernel | @@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual space when they use mm context tags. -== -TEMPORARY VIRTUAL MAPPINGS +Temporary Virtual Mappings == The kernel contains several ways of creating temporary mappings: - (*) vmap(). This can be used to make a long duration mapping of multiple - physical pages into a contiguous virtual space. It needs global - synchronization to unmap. +* vmap(). This can be used to make a long duration mapping of multiple + physical pages into a contiguous virtual space. It needs global + synchronization to unmap. - (*) kmap(). This permits a short duration mapping of a single page. It needs - global synchronization, but is amortized somewhat. It is also prone to - deadlocks when using in a nested fashion, and so it is not recommended for - new code. +* kmap(). This permits a short duration mapping of a single page. It needs + global synchronization, but is amortized somewhat. It is also prone to + deadlocks when using in a nested fashion, and so it is not recommended for + new code. - (*) kmap_atomic(). This permits a very short duration mapping of a single - page. Since the mapping is restricted to the CPU that issued it, it - performs well, but the issuing task is therefore required to stay on that - CPU until it has finished, lest some other task displace its mappings. +* kmap_atomic(). This permits a very short duration mapping of a single + page. Since the mapping is restricted to the CPU that issued it, it + performs well, but the issuing task is therefore required to stay on that + CPU until it has finished, lest some other task displace its mappings. - kmap_atomic() may also be used by interrupt contexts, since it is does not - sleep and the caller may not sleep until after kunmap_atomic() is called. + kmap_atomic() may also be used by interrupt contexts, since it is does not + sleep and the caller may not sleep until after kunmap_atomic() is called. - It may be assumed that k[un]map_atomic() won't fail. + It may be assumed that k[un]map_atomic() won't fail. -= -USING KMAP_ATOMIC +Using kmap_atomic = When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two -functions, and they can be used in a manner similar to the following: +functions, and they can be used in a manner similar to the following:: /* Find the page of interest. */ struct page *page = find_get_page(mapping, offset); @@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument. If you need to map two pages because you want to copy from one page to -another you need to keep the kmap_atomic calls strictly nested, like: +another you need to keep the kmap_atomic calls strictly nested, like:: vaddr1 = kmap_atomic(page1); vaddr2 = kmap_atomic(page2); @@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like: kunmap_atomic(vaddr1); -== -COST OF TEMPORARY MAPPINGS +Cost of Temporary Mappings == The cost of creating temporary mappings can be quite high. The arch has to @@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary
[PATCH 06/32] docs/vm: hmm.txt: convert to ReST format
Signed-off-by: Mike Rapoport--- Documentation/vm/hmm.txt | 66 1 file changed, 28 insertions(+), 38 deletions(-) diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt index 4d3aac9..3fafa33 100644 --- a/Documentation/vm/hmm.txt +++ b/Documentation/vm/hmm.txt @@ -1,4 +1,8 @@ +.. hmm: + += Heterogeneous Memory Management (HMM) += Transparently allow any component of a program to use any memory region of said program with a device without using device specific memory allocator. This is @@ -14,19 +18,10 @@ deals with how device memory is represented inside the kernel. Finaly the last section present the new migration helper that allow to leverage the device DMA engine. +.. contents:: :local: -1) Problems of using device specific memory allocator: -2) System bus, device memory characteristics -3) Share address space and migration -4) Address space mirroring implementation and API -5) Represent and manage device memory from core kernel point of view -6) Migrate to and from device memory -7) Memory cgroup (memcg) and rss accounting - - - -1) Problems of using device specific memory allocator: +Problems of using device specific memory allocator +== Device with large amount of on board memory (several giga bytes) like GPU have historically manage their memory through dedicated driver specific API. This @@ -68,9 +63,8 @@ only do-able with a share address. It is as well more reasonable to use a share address space for all the other patterns. - -2) System bus, device memory characteristics +System bus, device memory characteristics += System bus cripple share address due to few limitations. Most system bus only allow basic memory access from device to main memory, even cache coherency is @@ -100,9 +94,8 @@ access any memory memory but we must also permit any memory to be migrated to device memory while device is using it (blocking CPU access while it happens). - -3) Share address space and migration +Share address space and migration += HMM intends to provide two main features. First one is to share the address space by duplication the CPU page table into the device page table so same @@ -140,14 +133,13 @@ leverage device memory by migrating part of data-set that is actively use by a device. - -4) Address space mirroring implementation and API +Address space mirroring implementation and API +== Address space mirroring main objective is to allow to duplicate range of CPU page table into a device page table and HMM helps keeping both synchronize. A device driver that want to mirror a process address space must start with the -registration of an hmm_mirror struct: +registration of an hmm_mirror struct:: int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm); @@ -156,7 +148,7 @@ registration of an hmm_mirror struct: The locked variant is to be use when the driver is already holding the mmap_sem of the mm in write mode. The mirror struct has a set of callback that are use -to propagate CPU page table: +to propagate CPU page table:: struct hmm_mirror_ops { /* sync_cpu_device_pagetables() - synchronize page tables @@ -187,7 +179,8 @@ be done with the update. When device driver wants to populate a range of virtual address it can use -either: +either:: + int hmm_vma_get_pfns(struct vm_area_struct *vma, struct hmm_range *range, unsigned long start, @@ -211,7 +204,7 @@ that array correspond to an address in the virtual range. HMM provide a set of flags to help driver identify special CPU page table entries. Locking with the update() callback is the most important aspect the driver must -respect in order to keep things properly synchronize. The usage pattern is : +respect in order to keep things properly synchronize. The usage pattern is:: int driver_populate_range(...) { @@ -251,9 +244,8 @@ concurrently for multiple devices. Waiting for each device to report commands as executed is serialize (there is no point in doing this concurrently). - -5) Represent and manage device memory from core kernel point of view +Represent and manage device memory from core kernel point of view
Re: [PATCH v4 4/6] tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP
Quoting Karthik Ramasubramanian (2018-03-20 15:53:25) > > > On 3/20/2018 9:37 AM, Stephen Boyd wrote: > > Quoting Karthikeyan Ramasubramanian (2018-03-14 16:58:49) > >> diff --git a/drivers/tty/serial/qcom_geni_serial.c > >> b/drivers/tty/serial/qcom_geni_serial.c > >> new file mode 100644 > >> index 000..1442777 > >> --- /dev/null > >> +++ b/drivers/tty/serial/qcom_geni_serial.c > >> @@ -0,0 +1,1158 @@ > >> + > >> +#ifdef CONFIG_SERIAL_QCOM_GENI_CONSOLE > >> +static void qcom_geni_serial_wr_char(struct uart_port *uport, int ch) > >> +{ > >> + writel_relaxed(ch, uport->membase + SE_GENI_TX_FIFOn); > > > > Does this expect the whole word to have data to write? Or does the FIFO > > output a character followed by three NUL bytes each time it gets > > written? The way that uart_console_write() works is to take each > > character a byte at a time, put it into an int (so extend that byte with > > zero) and then pass it to the putchar function. I would expect that at > > this point the hardware sees the single character and then 3 NULs enter > > the FIFO each time. > > > > For previous MSM uarts I had to handle this oddity by packing the words > > into the fifo four at a time. You may need to do the same here. > The packing configuration 1 * 8 (done using geni_se_config_packing) > ensures that only one byte per FIFO word needs to be transmitted. From > that perspective, we need not have such oddity. Ok! That's good to hear. > > > > Can you also support the OF_EARLYCON_DECLARE method of console writing > > so we can get an early printk style debug console? > Do you prefer that as part of this patch itself or is it ok if I upload > the earlycon support once this gets merged. I think this already got merged? So just split it out into another patch would be fine. I see the config is already selecting the earlycon support so it must be planned. > > > > > >> + > >> + spin_lock_irqsave(>lock, flags); > >> + m_irq_status = readl_relaxed(uport->membase + > >> SE_GENI_M_IRQ_STATUS); > >> + s_irq_status = readl_relaxed(uport->membase + > >> SE_GENI_S_IRQ_STATUS); > >> + m_irq_en = readl_relaxed(uport->membase + SE_GENI_M_IRQ_EN); > >> + writel_relaxed(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR); > >> + writel_relaxed(s_irq_status, uport->membase + SE_GENI_S_IRQ_CLEAR); > >> + > >> + if (WARN_ON(m_irq_status & M_ILLEGAL_CMD_EN)) > >> + goto out_unlock; > >> + > >> + if (s_irq_status & S_RX_FIFO_WR_ERR_EN) { > >> + uport->icount.overrun++; > >> + tty_insert_flip_char(tport, 0, TTY_OVERRUN); > >> + } > >> + > >> + if (m_irq_status & (M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN) && > >> + m_irq_en & (M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN)) > >> + qcom_geni_serial_handle_tx(uport); > >> + > >> + if (s_irq_status & S_GP_IRQ_0_EN || s_irq_status & S_GP_IRQ_1_EN) { > >> + if (s_irq_status & S_GP_IRQ_0_EN) > >> + uport->icount.parity++; > >> + drop_rx = true; > >> + } else if (s_irq_status & S_GP_IRQ_2_EN || > >> + s_irq_status & S_GP_IRQ_3_EN) { > >> + uport->icount.brk++; > > > > Maybe move this stat accounting to the place where brk is handled? > Since other error accounting like overrun, parity are happening here, it > feels logical to keep that accounting here. Alright. > >> + return uart_add_one_port(_geni_console_driver, uport); > >> +} > >> + > >> +static int qcom_geni_serial_remove(struct platform_device *pdev) > >> +{ > >> + struct qcom_geni_serial_port *port = platform_get_drvdata(pdev); > >> + struct uart_driver *drv = port->uport.private_data; > >> + > >> + uart_remove_one_port(drv, >uport); > >> + return 0; > >> +} > >> + > >> +static int __maybe_unused qcom_geni_serial_sys_suspend_noirq(struct > >> device *dev) > >> +{ > >> + struct platform_device *pdev = to_platform_device(dev); > >> + struct qcom_geni_serial_port *port = platform_get_drvdata(pdev); > >> + struct uart_port *uport = >uport; > >> + > >> + uart_suspend_port(uport->private_data, uport); > >> + return 0; > >> +} > >> + > >> +static int __maybe_unused qcom_geni_serial_sys_resume_noirq(struct device > >> *dev) > >> +{ > >> + struct platform_device *pdev = to_platform_device(dev); > >> + struct qcom_geni_serial_port *port = platform_get_drvdata(pdev); > >> + struct uart_port *uport = >uport; > >> + > >> + if (console_suspend_enabled && uport->suspended) { > >> + uart_resume_port(uport->private_data, uport); > >> + disable_irq(uport->irq); > > > > I missed the enable_irq() part. Is this still necessary? > Suspending the uart console port invokes the uart port shutdown > operation. The shutdown operation disables and frees the concerned IRQ. > Resuming the uart console port
Re: [PATCH] xfs: Change URL for the project in xfs.txt
On Sat, Mar 03, 2018 at 09:43:10AM +1100, Dave Chinner wrote: > On Fri, Mar 02, 2018 at 04:08:24PM -0600, Eric Sandeen wrote: > > > > > > On 3/2/18 3:57 PM, Dave Chinner wrote: > > > On Fri, Mar 02, 2018 at 09:24:01AM -0800, Darrick J. Wong wrote: > > >> On Fri, Mar 02, 2018 at 10:30:13PM +0900, Masanari Iida wrote: > > >>> The oss.sgi.com doesn't exist any more. > > >>> Change it to current project URL, https://xfs.wiki.kernel.org/ > > >>> > > >>> Signed-off-by: Masanari Iida> > >>> --- > > >>> Documentation/filesystems/xfs.txt | 2 +- > > >>> 1 file changed, 1 insertion(+), 1 deletion(-) > > >>> > > >>> diff --git a/Documentation/filesystems/xfs.txt > > >>> b/Documentation/filesystems/xfs.txt > > >>> index 3b9b5c149f32..4d9ff0a7f8e1 100644 > > >>> --- a/Documentation/filesystems/xfs.txt > > >>> +++ b/Documentation/filesystems/xfs.txt > > >>> @@ -9,7 +9,7 @@ variable block sizes, is extent based, and makes > > >>> extensive use of > > >>> Btrees (directories, extents, free space) to aid both performance > > >>> and scalability. > > >>> > > >>> -Refer to the documentation at http://oss.sgi.com/projects/xfs/ > > >>> +Refer to the documentation at https://xfs.wiki.kernel.org/ > > > > > > Did I miss a memo? > > > > About which part, the loss of oss.sgi or the addition of the kernel.org > > wiki? > > > > The kernel.org wiki is pretty bare though. OTOH xfs.org is a bit less > > official. We really need to resolve this issue. > > Moving everything to kernel.org wiki. As I mentioned on IRC, I'd > much prefer we move away from wiki's to something we can edit > locally, review via email, has proper revision control and a > "publish" mechanism that pushes built documentation out to the > public website. Makes sense, it's sort of annoying to have to build the pdfs from documentation repo and kup them to k.org manually. In the meantime I'd rather have a scribble-me-elmo wiki over a dead url. Also afaik the only people who actually have write access to that wiki are Luis, Eric, and me, so hopefully we won't have to deal with vandalism in the interim. I wonder if we could just make the existing dokiwiki use xfs-documentation.git as its backend and control the publishing that way...? --D > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 0/2] cpuset: Enable cpuset controller in default hierarchy
v6: - Hide cpuset control knobs in root cgroup. - Rename effective_cpus and effective_mems to cpus.effective and mems.effective respectively. - Remove cpuset.flags and add cpuset.sched_load_balance instead as the behavior of sched_load_balance has changed and so is not a simple flag. - Update cgroup-v2.txt accordingly. v5: - Add patch 2 to provide the cpuset.flags control knob for the sched_load_balance flag which should be the only feature that is essential as a replacement of the "isolcpus" kernel boot parameter. v4: - Further minimize the feature set by removing the flags control knob. v3: - Further trim the additional features down to just memory_migrate. - Update Documentation/cgroup-v2.txt. The purpose of this patchset is to provide a minimal set of cpuset features for cgroup v2. That minimal set includes the cpus, mems, cpus.effective and mems.effective and sched_load_balance. The last one is needed to support use cases similar to the "isolcpus" kernel parameter. This patchset does not exclude the possibility of adding more flags and features in the future after careful consideration. Patch 1 enables cpuset in cgroup v2 with cpus, mems and their effective counterparts. Patch 2 adds sched_load_balance whose behavior changes in v2 to become hierarchical and includes an implicit !cpu_exclusive. Waiman Long (2): cpuset: Enable cpuset controller in default hierarchy cpuset: Add cpuset.sched_load_balance to v2 Documentation/cgroup-v2.txt | 112 ++-- kernel/cgroup/cpuset.c | 104 2 files changed, 201 insertions(+), 15 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2
The sched_load_balance flag is needed to enable CPU isolation similar to what can be done with the "isolcpus" kernel boot parameter. The sched_load_balance flag implies an implicit !cpu_exclusive as it doesn't make sense to have an isolated CPU being load-balanced in another cpuset. For v2, this flag is hierarchical and is inherited by child cpusets. It is not allowed to have this flag turn off in a parent cpuset, but on in a child cpuset. This flag is set by the parent and is not delegatable. Signed-off-by: Waiman Long--- Documentation/cgroup-v2.txt | 22 ++ kernel/cgroup/cpuset.c | 56 +++-- 2 files changed, 71 insertions(+), 7 deletions(-) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index ed8ec66..c970bd7 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -1514,6 +1514,28 @@ Cpuset Interface Files it is a subset of "cpuset.mems". Its value will be affected by memory nodes hotplug events. + cpuset.sched_load_balance + A read-write single value file which exists on non-root cgroups. + The default is "1" (on), and the other possible value is "0" + (off). + + When it is on, tasks within this cpuset will be load-balanced + by the kernel scheduler. Tasks will be moved from CPUs with + high load to other CPUs within the same cpuset with less load + periodically. + + When it is off, there will be no load balancing among CPUs on + this cgroup. Tasks will stay in the CPUs they are running on + and will not be moved to other CPUs. + + This flag is hierarchical and is inherited by child cpusets. It + can be turned off only when the CPUs in this cpuset aren't + listed in the cpuset.cpus of other sibling cgroups, and all + the child cpusets, if present, have this flag turned off. + + Once it is off, it cannot be turned back on as long as the + parent cgroup still has this flag in the off state. + Device controller - diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 419b758..d675c4f 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -407,15 +407,22 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs, * * One cpuset is a subset of another if all its allowed CPUs and * Memory Nodes are a subset of the other, and its exclusive flags - * are only set if the other's are set. Call holding cpuset_mutex. + * are only set if the other's are set (on legacy hierarchy) or + * its sched_load_balance flag is only set if the other is set + * (on default hierarchy). Caller holding cpuset_mutex. */ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) { - return cpumask_subset(p->cpus_allowed, q->cpus_allowed) && - nodes_subset(p->mems_allowed, q->mems_allowed) && - is_cpu_exclusive(p) <= is_cpu_exclusive(q) && - is_mem_exclusive(p) <= is_mem_exclusive(q); + if (!cpumask_subset(p->cpus_allowed, q->cpus_allowed) || + !nodes_subset(p->mems_allowed, q->mems_allowed)) + return false; + + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) + return is_sched_load_balance(p) <= is_sched_load_balance(q); + else + return is_cpu_exclusive(p) <= is_cpu_exclusive(q) && + is_mem_exclusive(p) <= is_mem_exclusive(q); } /** @@ -498,7 +505,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial) par = parent_cs(cur); - /* On legacy hiearchy, we must be a subset of our parent cpuset. */ + /* On legacy hierarchy, we must be a subset of our parent cpuset. */ ret = -EACCES; if (!is_in_v2_mode() && !is_cpuset_subset(trial, par)) goto out; @@ -1327,6 +1334,19 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, else clear_bit(bit, >flags); + /* +* On default hierarchy, turning off sched_load_balance flag implies +* an implicit cpu_exclusive. Turning on sched_load_balance will +* clear the cpu_exclusive flag. +*/ + if ((bit == CS_SCHED_LOAD_BALANCE) && + cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) { + if (turning_on) + clear_bit(CS_CPU_EXCLUSIVE, >flags); + else + set_bit(CS_CPU_EXCLUSIVE, >flags); + } + err = validate_change(cs, trialcs); if (err < 0) goto out; @@ -1966,6 +1986,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) .flags = CFTYPE_NOT_ON_ROOT, }, + { + .name = "sched_load_balance", + .read_u64 = cpuset_read_u64, + .write_u64 = cpuset_write_u64, + .private =
[PATCH v6 1/2] cpuset: Enable cpuset controller in default hierarchy
Given the fact that thread mode had been merged into 4.14, it is now time to enable cpuset to be used in the default hierarchy (cgroup v2) as it is clearly threaded. The cpuset controller had experienced feature creep since its introduction more than a decade ago. Besides the core cpus and mems control files to limit cpus and memory nodes, there are a bunch of additional features that can be controlled from the userspace. Some of the features are of doubtful usefulness and may not be actively used. This patch enables cpuset controller in the default hierarchy with a minimal set of features, namely just the cpus and mems and their effective_* counterparts. We can certainly add more features to the default hierarchy in the future if there is a real user need for them later on. Alternatively, with the unified hiearachy, it may make more sense to move some of those additional cpuset features, if desired, to memory controller or may be to the cpu controller instead of staying with cpuset. Signed-off-by: Waiman Long--- Documentation/cgroup-v2.txt | 90 ++--- kernel/cgroup/cpuset.c | 48 ++-- 2 files changed, 130 insertions(+), 8 deletions(-) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 74cdeae..ed8ec66 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/. 5-3-2. Writeback 5-4. PID 5-4-1. PID Interface Files - 5-5. Device - 5-6. RDMA - 5-6-1. RDMA Interface Files - 5-7. Misc - 5-7-1. perf_event + 5-5. Cpuset + 5.5-1. Cpuset Interface Files + 5-6. Device + 5-7. RDMA + 5-7-1. RDMA Interface Files + 5-8. Misc + 5-8-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated. +Cpuset +-- + +The "cpuset" controller provides a mechanism for constraining +the CPU and memory node placement of tasks to only the resources +specified in the cpuset interface files in a task's current cgroup. +This is especially valuable on large NUMA systems where placing jobs +on properly sized subsets of the systems with careful processor and +memory placement to reduce cross-node memory access and contention +can improve overall system performance. + +The "cpuset" controller is hierarchical. That means the controller +cannot use CPUs or memory nodes not allowed in its parent. + + +Cpuset Interface Files +~~ + + cpuset.cpus + A read-write multiple values file which exists on non-root + cgroups. + + It lists the CPUs allowed to be used by tasks within this + cgroup. The CPU numbers are comma-separated numbers or + ranges. For example: + + # cat cpuset.cpus + 0-4,6,8-10 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.cpus" or all the available CPUs if none is found. + + The value of "cpuset.cpus" stays constant until the next update + and won't be affected by any CPU hotplug events. + + cpuset.cpus.effective + A read-only multiple values file which exists on non-root + cgroups. + + It lists the onlined CPUs that are actually allowed to be + used by tasks within the current cgroup. If "cpuset.cpus" + is empty, it shows all the CPUs from the parent cgroup that + will be available to be used by this cgroup. Otherwise, it is + a subset of "cpuset.cpus". Its value will be affected by CPU + hotplug events. + + cpuset.mems + A read-write multiple values file which exists on non-root + cgroups. + + It lists the memory nodes allowed to be used by tasks within + this cgroup. The memory node numbers are comma-separated + numbers or ranges. For example: + + # cat cpuset.mems + 0-1,3 + + An empty value indicates that the cgroup is using the same + setting as the nearest cgroup ancestor with a non-empty + "cpuset.mems" or all the available memory nodes if none + is found. + + The value of "cpuset.mems" stays constant until the next update + and won't be affected by any memory nodes hotplug events. + + cpuset.mems.effective + A read-only multiple values file which exists on non-root + cgroups. + + It lists the onlined memory nodes that are actually allowed to + be used by tasks within the current cgroup. If "cpuset.mems" + is empty, it shows all the memory nodes from the parent cgroup + that will be available to be used by this cgroup. Otherwise,
Re: [PATCH v4 6/6] arm64: dts: sdm845: Add I2C controller support
Hi Doug On 3/20/2018 9:47 PM, Doug Anderson wrote: > Hi, > > On Tue, Mar 20, 2018 at 3:16 PM, Sagar Dhariawrote: >> + pinconf { >> + pins = "gpio55", "gpio56"; >> + drive-strength = <2>; >> + bias-disable; >> + }; >> + }; >> + >> + qup-i2c10-sleep { >> + pinconf { >> + pins = "gpio55", "gpio56"; >> + bias-pull-up; > > Are you sure that you want pullups enabled for sleep here? There are > external pulls on this line (as there are on many i2c busses) so doing > this will double-enable pulls. It probably won't hurt, but I'm > curious if there's some sort of reason here. > 1. We need the lines to remain high to avoid slaves sensing a false start-condition (this can happen if the SDA goes down before SCL). 2. Disclaimer: I'm not a HW expert, but we were told that tri-state/bias-disabled lines can draw more current. I will find out more about that. >>> >>> Agreed that they need to remain high, but you've got very strong >>> pullups external to the SoC. Those will keep it high. You don't need >>> the internal ones too. >>> >>> As extra evidence that the external pullups _must_ be present on your >>> board: you specify bias-disable in the active state. That can only >>> work if there are external pullups (or if there were some special >>> extra secret internal pullups that were part of geni). i2c is an >>> open-drain bus and thus there must be pullups on the bus in order to >>> communicate. >>> >> >> You are right, I followed up about the pull-up recommendation and that >> was for a GPIO where there was no external pull-up (GPIO was not used >> for I2C). It's safe to assume I2C will always have external pullup. > > It is even more safe to say that I2C will always have an external > pullup on the SDM845-MTP. Remember that the pullup config is in the > board device tree file, not the SoC one. So even if someone out there > decides that the internal pull is somehow good enough for their own > board and they don't stuff external ones, then it will be up to them > to turn the pull up on in their own board file. > > >> We >> will change sleep-config of I2C GPIOs to no-pull. > > Even better IMHO: don't specify the bias in the sleep config. I don't > believe it's possible for the sleep config to take effect without the > default config since the default config applies at probe time. ...so > you'll always get the default config applied at probe time and you > don't need to touch the bias at sleep time. Good point, we will remove the bias from the sleep config for i2c GPIOs. Thanks Sagar > > >> + i2c10: i2c@a88000 { > > Seems like it might be nice to add all the i2c busses into the main > sdm845.dtsi file. Sure, most won't be enabled, but it seems like it > would avoid churn later. > > ...if you're sure you want to add only one i2c controller, subject of > this patch should indicate that. > Yes, we typically have a "platform(sdm845 here)-qupv3.dtsi" defining most of the serial-bus instances (i2c, spi, and uart with status=disabled) that we include from the common header. The boards enable instances they need. Will that be okay? >>> >>> Unless you really feel the need to put these in a separate file I'd >>> just put them straight in sdm845.dtsi. Yeah, it'll get big, but >>> that's OK by me. I _think_ this matches what Bjorn was suggesting on >>> previous device tree patches, but CCing him just in case. I'm >>> personally OK with whatever Bjorn and other folks with more Qualcomm >>> history would like. >>> >>> ...but yeah, I'm asking for them all to be listed with status="disabled". >>> >> >> Sure, we will change the subject of this patch to indicate that we are >> adding 1 controller as of now. Later we will add all I2C controllers to >> dtsi as another patch since that will need pinctrl settings for GPIOs >> used by those instances and the wrappers devices needed by them. > > Yeah, it's fine to just change the subject of this patch. It would be > nice to add all the other controllers in sooner rather than later, but > it doesn't have to be today. > > > -Doug > -- > To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] Input: trackpoint: document sysfs interface
On Fri, 2 Mar 2018 23:00:19 +0530 Aishwarya Pantwrote: > Descriptions have been collected from git commit logs, code commits and > the TrackPoint System Version 4.0 Engineering Specification. Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfs: Change URL for the project in xfs.txt
On Fri, 2 Mar 2018 22:30:13 +0900 Masanari Iidawrote: > The oss.sgi.com doesn't exist any more. > Change it to current project URL, https://xfs.wiki.kernel.org/ Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] HID: ntrig: document sysfs interface
On Wed, 21 Mar 2018 09:28:05 -0600 Jonathan Corbetwrote: > > Add sysfs documentation for N-Trig touchscreens under Documentation/ABI. > > Descriptions have been collected from code comments. > > Applied to the docs tree, thanks. Oops, I thought I'd checked to see whether Jiri had picked these up, but I evidently haven't had enough coffee yet. Since they're taken care of, I'll unapply them; sorry for the noise. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] HID: logitech-hidpp: document sysfs interface
On Fri, 2 Mar 2018 18:46:53 +0530 Aishwarya Pantwrote: > Descriptions have been collected from git commit logs. Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] HID: ntrig: document sysfs interface
On Fri, 2 Mar 2018 11:00:17 +0530 Aishwarya Pantwrote: > Add sysfs documentation for N-Trig touchscreens under Documentation/ABI. > Descriptions have been collected from code comments. Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] char/bsr: add sysfs interface documentation
On Thu, 1 Mar 2018 23:55:59 +0530 Aishwarya Pantwrote: > Descriptions have collected from code comments and by reading through > code. Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [trivial PATCH] Documentation/sparse: fix typo
On Tue, 13 Mar 2018 11:10:58 + Eric Engestromwrote: > If the function enters and exits without the lock held, acquiring and > releasing the lock inside the function in a balanced way, no > -annotation is needed. The tree annotations above are for cases where > +annotation is needed. The three annotations above are for cases where > sparse would otherwise report a context imbalance. Applied to the docs tree, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Documentation/CodingStyle: Add an example for braces
On Thu, 15 Mar 2018 15:04:02 -0500 Gary R Hookwrote: > +Do use braces when a body is more complex than a single simple statement: > + > +.. code-block:: c > + > + if (condition) { > + if (another_condition) > + do_something(); > + } Somebody is sure to complain at some point that this should really be: if (condition && another_condition) do_something(); To head that off, I think I'll apply your first version instead, sorry Jani. In general I'm pretty reluctant to apply coding-style patches for the simple reason that I sure don't want to be the arbitrator of proper kernel style. This one seems to fit well within the accepted norms, though. Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] docs/vm: update 00-INDEX
On Wed, 21 Mar 2018 17:05:23 +0200 Mike Rapoportwrote: > Several files were added to Documentation/vm without updates to 00-INDEX. > Fill in the missing documents Applied, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kernel-doc: Remove __sched markings
On Thu, 15 Mar 2018 05:06:23 -0700 Matthew Wilcoxwrote: > I find the __sched annotations unaesthetic in the kernel-doc. Remove > them like we remove __inline, __weak, __init and so on. Makes sense, applied, thanks. jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] docs/vm: update 00-INDEX
Several files were added to Documentation/vm without updates to 00-INDEX. Fill in the missing documents Signed-off-by: Mike Rapoport--- Documentation/vm/00-INDEX | 18 ++ 1 file changed, 18 insertions(+) diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 11d3d8d..0278f2c 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -10,6 +10,8 @@ frontswap.txt - Outline frontswap, part of the transcendent memory frontend. highmem.txt - Outline of highmem and common issues. +hmm.txt + - Documentation of heterogeneous memory management hugetlbpage.txt - a brief summary of hugetlbpage support in the Linux kernel. hugetlbfs_reserv.txt @@ -20,25 +22,41 @@ idle_page_tracking.txt - description of the idle page tracking feature. ksm.txt - how to use the Kernel Samepage Merging feature. +mmu_notifier.txt + - a note about clearing pte/pmd and mmu notifications numa - information about NUMA specific code in the Linux vm. numa_memory_policy.txt - documentation of concepts and APIs of the 2.6 memory policy support. overcommit-accounting - description of the Linux kernels overcommit handling modes. +page_frags + - description of page fragments allocator page_migration - description of page migration in NUMA systems. pagemap.txt - pagemap, from the userspace perspective +page_owner.txt + - tracking about who allocated each page +remap_file_pages.txt + - a note about remap_file_pages() system call slub.txt - a short users guide for SLUB. soft-dirty.txt - short explanation for soft-dirty PTEs split_page_table_lock - Separate per-table lock to improve scalability of the old page_table_lock. +swap_numa.txt + - automatic binding of swap device to numa node transhuge.txt - Transparent Hugepage Support, alternative way of using hugepages. unevictable-lru.txt - Unevictable LRU infrastructure +userfaultfd.txt + - description of userfaultfd system call +z3fold.txt + - outline of z3fold allocator for storing compressed pages +zsmalloc.txt + - outline of zsmalloc allocator for storing compressed pages zswap.txt - Intro to compressed cache for swap pages -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] README: Improve documentation descriptions
On Fri, 16 Mar 2018 16:57:07 +0100 Martin Kepplingerwrote: > "This file" indeed was moved once, but at some point "this file", the > top-level README, becomes a file in itself. Now that time has come :) > > Let's describe how things are, and suggest reading "this file" first, > "this file" simply being a the admin-guide README file, not a file that > was once moved. OK, sure, applied. Maybe I'll tack on something pointing to https://www.kernel.org/doc/html/latest/ as well. Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
At 03/21/2018 05:34 PM, Dou Liyang wrote: Hi Peter, At 03/21/2018 05:08 PM, Peter Zijlstra wrote: On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: How about: possible_cpus= [s390,x86_64] Set the number of possible CPUs which are determined by the ACPI tables MADT or mptables by default. possible_cpus=n : n >= 1 enforces the possible number to be 'n'. While nr_cpus is also be set: nr_cpus=m, choice the minimum one for the number of possible CPUs. So what is the exact difference between possible_cpus and nr_cpus ? I the possible_cpus= can reset the number of possible CPUs, even bigger than 'num_processors + disabled_cpus', But nr_cpus= can't. ^^^ the maximum number kernel gets from ACPI/mptables, no matter what number nr_cpus= is, the number of possible CPUs will not bigger than it. konw maxcpus= limits the number of CPUs we bring up, and possible_cpus limits the possible_map, but I'm not entirely sure what nr_cpus does here. nr_cpus can limited the maximum CPUs that the kernel could support. Here is a double check in case of using them at the same time, even if I think just using possible_cpus= is enough. :-) Thanks, dou. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
On Wed, 21 Mar 2018, Peter Zijlstra wrote: > On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: > > How about: > > > > possible_cpus= [s390,x86_64] Set the number of possible CPUs which > > are determined by the ACPI tables MADT or mptables by > > default. possible_cpus=n : n >= 1 enforces the possible > > number to be 'n'. > > While nr_cpus is also be set: nr_cpus=m, choice the > > minimum one for the number of possible CPUs. > > So what is the exact difference between possible_cpus and nr_cpus ? I > konw maxcpus= limits the number of CPUs we bring up, and possible_cpus > limits the possible_map, but I'm not entirely sure what nr_cpus does > here. nr_cpus limits the number of CPUs the kernel will handle. Think of it as a boot time override of NR_CPUs. Way too many commandline switches though. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
Hi Peter, At 03/21/2018 05:08 PM, Peter Zijlstra wrote: On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: How about: possible_cpus= [s390,x86_64] Set the number of possible CPUs which are determined by the ACPI tables MADT or mptables by default. possible_cpus=n : n >= 1 enforces the possible number to be 'n'. While nr_cpus is also be set: nr_cpus=m, choice the minimum one for the number of possible CPUs. So what is the exact difference between possible_cpus and nr_cpus ? I the possible_cpus= can reset the number of possible CPUs, even bigger than 'num_processors + disabled_cpus', But nr_cpus= can't. konw maxcpus= limits the number of CPUs we bring up, and possible_cpus limits the possible_map, but I'm not entirely sure what nr_cpus does here. nr_cpus can limited the maximum CPUs that the kernel could support. Here is a double check in case of using them at the same time, even if I think just using possible_cpus= is enough. :-) Thanks, dou. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] x86/smpboot: Add the missing description of possible_cpus
On Wed, Mar 21, 2018 at 01:33:24PM +0800, Dou Liyang wrote: > How about: > > possible_cpus=[s390,x86_64] Set the number of possible CPUs which > are determined by the ACPI tables MADT or mptables by > default. possible_cpus=n : n >= 1 enforces the possible > number to be 'n'. > While nr_cpus is also be set: nr_cpus=m, choice the > minimum one for the number of possible CPUs. So what is the exact difference between possible_cpus and nr_cpus ? I konw maxcpus= limits the number of CPUs we bring up, and possible_cpus limits the possible_map, but I'm not entirely sure what nr_cpus does here. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 5/6] arm64: dts: sdm845: Add serial console support
On 3/21/2018 1:09 AM, Stephen Boyd wrote: Quoting Karthikeyan Ramasubramanian (2018-03-14 16:58:50) diff --git a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts index 979ab49..ea3efc5 100644 --- a/arch/arm64/boot/dts/qcom/sdm845-mtp.dts +++ b/arch/arm64/boot/dts/qcom/sdm845-mtp.dts @@ -12,4 +12,43 @@ / { model = "Qualcomm Technologies, Inc. SDM845 MTP"; compatible = "qcom,sdm845-mtp"; + + aliases { + serial0 = + }; + + chosen { + stdout-path = "serial0"; Also add :115200n8 ? + }; +}; + + { I think the method is to put these inside soc node without using the phandle reference. So indent everything once more. Some of this was discussed in the previous versions [1] and we arrived at a consensus to follow this way of doing it. Bjorn also said he was going to do a series to move all the existing dts files to follow similar convention so its all consistent. https://lkml.org/lkml/2018/2/6/676 + geniqup@ac { + serial@a84000 { + status = "okay"; + }; + }; + + pinctrl@340 { + qup-uart2-default { + pinconf_tx { + pins = "gpio4"; + drive-strength = <2>; + bias-disable; + }; + + pinconf_rx { + pins = "gpio5"; + drive-strength = <2>; + bias-pull-up; + }; + }; + + qup-uart2-sleep { + pinconf { + pins = "gpio4", "gpio5"; + bias-pull-down; + }; + }; + }; }; diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi index 32f8561..59334d9 100644 --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi @@ -6,6 +6,7 @@ */ #include +#include / { interrupt-parent = <>; @@ -194,6 +195,20 @@ #gpio-cells = <2>; interrupt-controller; #interrupt-cells = <2>; + + qup_uart2_default: qup-uart2-default { + pinmux { + function = "qup9"; + pins = "gpio4", "gpio5"; + }; + }; + + qup_uart2_sleep: qup-uart2-sleep { + pinmux { + function = "gpio"; + pins = "gpio4", "gpio5"; + }; + }; Are these supposed to go to the board file? Again, this was discussed in the previous versions, and we decided it makes sense to have the pinmux (default) which rarely changes across boards in the SoC file, and have boards specify the pinconf (electrical) properties. And get rid of all the soc-pins/board-pins/pmic-pins files. https://lkml.org/lkml/2018/2/6/693 -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 7/9] hwmon: pwm-fan: add sysfs node to read rpm of fan
On 03/20/2018 09:40 PM, Rajkumar Rampelli wrote: Add fan device attribute fan1_input in pwm-fan driver to read speed of fan in rotations per minute. Signed-off-by: Rajkumar Rampelli--- V2: Removed generic-pwm-tachometer driver and using pwm-fan driver as per suggestions to read fan speed. Added fan device attribute to report speed of fan in rpms through hwmon sysfs. drivers/hwmon/pwm-fan.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/drivers/hwmon/pwm-fan.c b/drivers/hwmon/pwm-fan.c index 70cc0d1..8dda209 100644 --- a/drivers/hwmon/pwm-fan.c +++ b/drivers/hwmon/pwm-fan.c @@ -98,11 +98,34 @@ static ssize_t show_pwm(struct device *dev, return sprintf(buf, "%u\n", ctx->pwm_value); } +static ssize_t show_rpm(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct pwm_fan_ctx *ptt = dev_get_drvdata(dev); + struct pwm_device *pwm = ptt->pwm; + struct pwm_capture result; + unsigned int rpm = 0; + int ret; + + ret = pwm_capture(pwm, , 0); + if (ret < 0) { + pr_err("Failed to capture PWM: %d\n", ret); + return ret; + } + + if (result.period) + rpm = DIV_ROUND_CLOSEST_ULL(60ULL * NSEC_PER_SEC, + result.period); + + return sprintf(buf, "%u\n", rpm); +} static SENSOR_DEVICE_ATTR(pwm1, S_IRUGO | S_IWUSR, show_pwm, set_pwm, 0); +static SENSOR_DEVICE_ATTR(fan1_input, 0444, show_rpm, NULL, 0); static struct attribute *pwm_fan_attrs[] = { _dev_attr_pwm1.dev_attr.attr, + _dev_attr_fan1_input.dev_attr.attr, This doesn't make sense. The same pwm can not both control the fan speed and report it. Guenter NULL, }; -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html