[tip:perf/core] kprobes: Don't call BUG_ON() if there is a kprobe in use on free list
Commit-ID: cbdd96f5586151e48317d90a403941ec23f12660 Gitweb: https://git.kernel.org/tip/cbdd96f5586151e48317d90a403941ec23f12660 Author: Masami Hiramatsu AuthorDate: Tue, 11 Sep 2018 19:21:09 +0900 Committer: Ingo Molnar CommitDate: Wed, 12 Sep 2018 08:01:16 +0200 kprobes: Don't call BUG_ON() if there is a kprobe in use on free list Instead of calling BUG_ON(), if we find a kprobe in use on free kprobe list, just remove it from the list and keep it on kprobe hash list as same as other in-use kprobes. Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/153666126882.21306.10738207224288507996.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 63c342e5e6c3..90e98e233647 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -546,8 +546,14 @@ static void do_free_cleaned_kprobes(void) struct optimized_kprobe *op, *tmp; list_for_each_entry_safe(op, tmp, &freeing_list, list) { - BUG_ON(!kprobe_unused(&op->kp)); list_del_init(&op->list); + if (WARN_ON_ONCE(!kprobe_unused(&op->kp))) { + /* +* This must not happen, but if there is a kprobe +* still in use, keep it on kprobes hash list. +*/ + continue; + } free_aggr_kprobe(&op->kp); } }
[tip:perf/core] kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()
Commit-ID: 819319fc93461c07b9cdb3064f154bd8cfd48172 Gitweb: https://git.kernel.org/tip/819319fc93461c07b9cdb3064f154bd8cfd48172 Author: Masami Hiramatsu AuthorDate: Tue, 11 Sep 2018 19:20:40 +0900 Committer: Ingo Molnar CommitDate: Wed, 12 Sep 2018 08:01:16 +0200 kprobes: Return error if we fail to reuse kprobe instead of BUG_ON() Make reuse_unused_kprobe() to return error code if it fails to reuse unused kprobe for optprobe instead of calling BUG_ON(). Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 27 --- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 277a6cbe83db..63c342e5e6c3 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -700,9 +700,10 @@ static void unoptimize_kprobe(struct kprobe *p, bool force) } /* Cancel unoptimizing for reusing */ -static void reuse_unused_kprobe(struct kprobe *ap) +static int reuse_unused_kprobe(struct kprobe *ap) { struct optimized_kprobe *op; + int ret; /* * Unused kprobe MUST be on the way of delayed unoptimizing (means @@ -713,8 +714,12 @@ static void reuse_unused_kprobe(struct kprobe *ap) /* Enable the probe again */ ap->flags &= ~KPROBE_FLAG_DISABLED; /* Optimize it again (remove from op->list) */ - BUG_ON(!kprobe_optready(ap)); + ret = kprobe_optready(ap); + if (ret) + return ret; + optimize_kprobe(ap); + return 0; } /* Remove optimized instructions */ @@ -939,11 +944,16 @@ static void __disarm_kprobe(struct kprobe *p, bool reopt) #define kprobe_disarmed(p) kprobe_disabled(p) #define wait_for_kprobe_optimizer()do {} while (0) -/* There should be no unused kprobes can be reused without optimization */ -static void reuse_unused_kprobe(struct kprobe *ap) +static int reuse_unused_kprobe(struct kprobe *ap) { + /* +* If the optimized kprobe is NOT supported, the aggr kprobe is +* released at the same time that the last aggregated kprobe is +* unregistered. +* Thus there should be no chance to reuse unused kprobe. +*/ printk(KERN_ERR "Error: There should be no unused kprobe here.\n"); - BUG_ON(kprobe_unused(ap)); + return -EINVAL; } static void free_aggr_kprobe(struct kprobe *p) @@ -1315,9 +1325,12 @@ static int register_aggr_kprobe(struct kprobe *orig_p, struct kprobe *p) goto out; } init_aggr_kprobe(ap, orig_p); - } else if (kprobe_unused(ap)) + } else if (kprobe_unused(ap)) { /* This probe is going to die. Rescue it */ - reuse_unused_kprobe(ap); + ret = reuse_unused_kprobe(ap); + if (ret) + goto out; + } if (kprobe_gone(ap)) { /*
Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
* Mel Gorman [2018-09-10 10:41:47]: > On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote: > > > Srikar's patch here: > > > > > > > > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-sri...@linux.vnet.ibm.com > > > > > > Also frobs this condition, but in a less radical way. Does that yield > > > similar results? > > > > I can check. I do wonder of course if the less radical approach just means > > that automatic NUMA balancing and the load balancer simply disagree about > > placement at a different time. It'll take a few days to have an answer as > > the battery of workloads to check this take ages. > > > > Tests completed over the weekend and I've found that the performance of > both patches are very similar for two machines (both 2 socket) running a > variety of workloads. Hence, I'm not worried about which patch gets picked > up. However, I would prefer my own on the grounds that the additional > complexity does not appear to get us anything. Of course, that changes if > Srikar's tests on his larger ppc64 machines show the more complex approach > is justified. > Running SPECJbb2005. Higher bops are better. Kernel A = 4.18+ 13 sched patches part of v4.19-rc1. Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com) Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-sri...@linux.vnet.ibm.com + 2 patches from Mel (Do not move imbalanced load purely) http://lore.kernel.org/lkml/20180907101139.20760-5-mgor...@techsingularity.net (Stop comparing tasks for NUMA placement) http://lore.kernel.org/lkml/20180907101139.20760-4-mgor...@techsingularity.net To me, Kernel B which is the 13 patches accepted in v4.19-rc1 + 6 patches posted for review seem to be giving better performance. The numbers are compared to previous kernel i.e for Kernel A, v4.18 is prev for kernel B, Kernel A is prev for Kernel C, B is prev 2 node x86 Haswell v4.18 or 94710cac0ef4 JVMS PrevCurrent %Change 4 203769 1 316734 Kernel A JVMS PrevCurrent %Change 4 203769 209790 2.95482 1 316734 312377 -1.3756 Kernel B JVMS PrevCurrent %Change 4 209790 202059 -3.68511 1 312377 326987 4.67704 Kernel C JVMS PrevCurrent %Change 4 202059 200681 -0.681979 1 326987 316715 -3.14141 4 Node / 2 Socket PowerNV / Power 8 v4.18 or 94710cac0ef4 JVMS PrevCurrent %Change 8 88411.9 1 222075 Kernel A JVMS Prev Current %Change 8 88411.9 88733.5 0.363752 1 222075 214607 -3.36283 Kernel B JVMS Prev Current %Change 8 88733.5 899521.37321 1 214607 217226 1.22037 Kernel C JVMS PrevCurrent %Change 8 89952 89912.9 -0.0434676 1 217226 219281 0.946019 2 Node / 2 Socket Power 9 / PowerNV v4.18 or 94710cac0ef4 JVMS PrevCurrent %Change 4 195989 1 202854 Kernel A JVMS PrevCurrent %Change 4 195989 193108 -1.46998 1 202854 204042 0.585643 Kernel B JVMS PrevCurrent %Change 4 193108 196422 1.71614 1 204042 211219 3.51741 Kernel C JVMS PrevCurrent %Change 4 196422 195052 -0.697478 1 211219 207854 -1.59313 4 Node / 4 Socket Power 7 PhyP LPAR. v4.18 or 94710cac0ef4 JVMS PrevCurrent %Change 8 52826.9 1 103103 Kernel A JVMS Prev Current %Change 8 52826.9 59504.4 12.6403 1 103103 102542 -0.544116 Kernel B JVMS Prev Current %Change 8 59504.4 61674.8 3.64746 1 102542 108211 5.52847 Kernel C JVMS Prev Current %Change 8 61674.8 57946.5 -6.04509 1 108211 104533 -3.39892
Re: [RFC 0/4] perf: Per PMU access controls (paranoid setting)
Hi, Is there any plans or may be even progress on that so far? Thanks, Alexey On 26.06.2018 18:36, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin > > For situations where sysadmins might want to allow different level of > access control for different PMUs, we start creating per-PMU > perf_event_paranoid controls in sysfs. > > These work in equivalent fashion as the existing perf_event_paranoid > sysctl, which now becomes the parent control for each PMU. > > On PMU registration the global/parent value will be inherited by each PMU, > as it will be propagated to all registered PMUs when the sysctl is > updated. > > At any later point individual PMU access controls, located in > /device//perf_event_paranoid, can be adjusted to achieve > fine grained access control. > > Discussion from previous posting: > https://lkml.org/lkml/2018/5/21/156 > > Cc: Thomas Gleixner > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: Arnaldo Carvalho de Melo > Cc: Alexander Shishkin > Cc: Jiri Olsa > Cc: Namhyung Kim > Cc: Madhavan Srinivasan > Cc: Andi Kleen > Cc: Alexey Budankov > Cc: linux-kernel@vger.kernel.org > Cc: x...@kernel.org > > Tvrtko Ursulin (4): > perf: Move some access checks later in perf_event_open > perf: Pass pmu pointer to perf_paranoid_* helpers > perf: Allow per PMU access control > perf Documentation: Document the per PMU perf_event_paranoid interface > > .../sysfs-bus-event_source-devices-events | 14 +++ > arch/powerpc/perf/core-book3s.c | 2 +- > arch/x86/events/intel/bts.c | 2 +- > arch/x86/events/intel/core.c | 2 +- > arch/x86/events/intel/p4.c| 2 +- > include/linux/perf_event.h| 18 ++- > kernel/events/core.c | 104 +++--- > kernel/sysctl.c | 4 +- > kernel/trace/trace_event_perf.c | 6 +- > 9 files changed, 123 insertions(+), 31 deletions(-) >
Re: [PATCH] printk: inject caller information into the body of message
On (09/10/18 13:20), Alexander Potapenko wrote: > > Awesome. If you and Fengguang can combine forces and lead the > > whole thing towards "we couldn't care of pr_cont() less", it > > would be really huge. Go for it! > > Sorry, folks, am I understanding right that pr_cont() and flushing the > buffer on "\n" are two separate problems that can be handled outside > Tetsuo's patchset, just assuming pr_cont() is unsupported? > Or should the pr_cont() cleanup be a prerequisite for that? Oh... Sorry. I'm quite overloaded at the moment and simply forgot about this thread. So what is exactly our problem with pr_cont -- it's not SMP friendly. And this leads to various things, the most annoying of which is a preliminary flush. E.g. let me do a simple thing on my box: ps aux | grep firefox kill 2727 dmesg | tail [ 554.098341] Chrome_~dThread[2823]: segfault at 0 ip 7f5df153a1f3 sp 7f5ded47ab00 error 6 in libxul.so[7f5df1531000+4b01000] [ 554.098348] Code: e7 04 48 8d 15 a6 94 ae 03 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b 48 8b 05 57 d0 e7 04 48 8d 0d b0 94 ae 03 48 89 08 04 25 00 00 00 00 00 00 00 00 0f 0b e8 4d f4 ff ff 48 8b 05 34 [ 554.109418] Chrome_~dThread[3047]: segfault at 0 ip 7f3d5bdba1f3 sp 7f3d57cfab00 error 6 [ 554.109421] Chrome_~dThread[3077]: segfault at 0 ip 7fe773f661f3 sp 7fe76fea6b00 error 6 [ 554.109424] in libxul.so[7f3d5bdb1000+4b01000] [ 554.109426] in libxul.so[7fe773f5d000+4b01000] [ 554.109429] Code: e7 04 48 8d 15 a6 94 ae 03 48 89 10 c7 04 25 00 00 00 00 00 00 00 00 0f 0b 48 8b 05 57 d0 e7 04 48 8d 0d b0 94 ae 03 48 89 08 04 25 00 00 00 00 00 00 00 00 0f 0b e8 4d f4 ff ff 48 8b 05 34 Even such a simple thing as "printk several lines per-crashed process" is broken. Look at line #0 and lines #2-#5. And this is the only problem we probably need to address. Overlapping printk lines -- when several CPUs printk simultaneously, or same CPUs printk-s from IRQ, etc -- are here by design and it's not going to be easy to change that (and maybe we shouldn't try). Buffering multiple lines in printk buffer does not look so simple and perhaps we should not try to do this, as well. Why: - it's hard to decide what to do when buffer overflows Switching to "normal printk" defeats the reason we do buffering in the first place. Because "normal printk" permits overlapping. So buffering makes a little sense if we are OK with switching to a "normal printk". - the more we buffer the more we can lose in case of panic. We can't flush_on_panic() printk buffers which were allocated on stack. - flushing multiple lines should be more complex than just a simple printk loop while (1) { x = memchr(buf, '\n', sz); ... print("%s", buf); ... } Because "printk() loop" permits lines overlap. Hence buffering makes little sense, once again. So let's reduce the problem scope to "we want to have a replacement for pr_cont()". And let's address pr_cont()'s "preliminary flush" issue only. I scanned some of Linus' emails, and skimmed through previous discussions on this topic. Let me quote Linus: : : My preference as a user is actually to just have a dynamically : re-sizable buffer (that's pretty much what I've done in *every* single : user space project I've had in the last decade), but because some : users might have atomicity issues I do suspect that we should just use : a stack buffer. : : And then perhaps say that the buffer size has to be capped at 80 characters. : : Because if you're printing more than 80 characters and expecting it : all to fit on a line, you're doing something else wrong anyway. : : And hide it not as a explicit "char buffer[80]]" allocation, but as a : "struct line_buffer" or similar, so that : : (a) people don't get the line size wrong : : (b) the buffering code can add a few fields for length etc in there too : : Introduce a few helper functions for it: : : init_line_buffer(&buf); : print_line(&buf, fmt, args); : vprint_line(&buf, fmt, vararg); : finish_line(&buf); : And this is, basically, what I have attached to this email. It's very simple and very short. And I think this is what Linus wanted us to do. - usage example DEFINE_PR_LINE(KERN_ERR, pl); pr_line(&pl, "Hello, %s!\n", "buffer"); pr_line(&pl, "%s", "OK.\n"); pr_line(&pl, "Goodbye, %s", "buffer"); pr_line(&pl, "\n"); dmesg | tail [ 69.908542] Hello, buffer! [ 69.908544] OK. [ 69.908545] Goodbye, buffer - pr_cont-like usage DEFINE_PR_LINE(KERN_ERR, pl); pr_line(&pl,"%d ", 1); pr_line(&pl,"%d ", 3); pr_line(&pl,"%d ", 5); pr_line(&pl,"%d ", 7); pr_line(&pl,"%d\n", 9); dmesg | tail [ 69.908546] 1 3 5 7 9 - An explicit, aux buffer // output should be truncated char buf[16]; DEFINE_PR_LINE_BUF(KERN_ERR, ps, buf, sizeof(buf)); pr_line(&ps, "Test test test test test test test test test\n");
[tip:perf/core] kprobes: Remove pointless BUG_ON() from reuse_unused_kprobe()
Commit-ID: a6d18e65dff2b73ceeb187c598b48898e36ad7b1 Gitweb: https://git.kernel.org/tip/a6d18e65dff2b73ceeb187c598b48898e36ad7b1 Author: Masami Hiramatsu AuthorDate: Tue, 11 Sep 2018 19:20:11 +0900 Committer: Ingo Molnar CommitDate: Wed, 12 Sep 2018 08:01:16 +0200 kprobes: Remove pointless BUG_ON() from reuse_unused_kprobe() Since reuse_unused_kprobe() is called when the given kprobe is unused, checking it inside again with BUG_ON() is pointless. Remove it. Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/153666121154.21306.17540752948574483565.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 231569e1e2c8..277a6cbe83db 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -704,7 +704,6 @@ static void reuse_unused_kprobe(struct kprobe *ap) { struct optimized_kprobe *op; - BUG_ON(!kprobe_unused(ap)); /* * Unused kprobe MUST be on the way of delayed unoptimizing (means * there is still a relative jump) and disabled.
[tip:perf/core] kprobes: Remove pointless BUG_ON() from disarming process
Commit-ID: d0555fc78fdba5646a460e83bd2d8249c539bb89 Gitweb: https://git.kernel.org/tip/d0555fc78fdba5646a460e83bd2d8249c539bb89 Author: Masami Hiramatsu AuthorDate: Tue, 11 Sep 2018 19:19:14 +0900 Committer: Ingo Molnar CommitDate: Wed, 12 Sep 2018 08:01:15 +0200 kprobes: Remove pointless BUG_ON() from disarming process All aggr_probes at this line are already disarmed by disable_kprobe() or checked by kprobe_disarmed(). So this BUG_ON() is pointless, remove it. Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/153666115463.21306.8799008438116029806.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ab257be4d924..d1edd8d5641e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1704,7 +1704,6 @@ noclean: return 0; disarmed: - BUG_ON(!kprobe_disarmed(ap)); hlist_del_rcu(&ap->hlist); return 0; }
[tip:perf/core] kprobes: Remove pointless BUG_ON() from add_new_kprobe()
Commit-ID: c72e6742f62d7bb82a77a41ca53940cb8f73e60f Gitweb: https://git.kernel.org/tip/c72e6742f62d7bb82a77a41ca53940cb8f73e60f Author: Masami Hiramatsu AuthorDate: Tue, 11 Sep 2018 19:19:43 +0900 Committer: Ingo Molnar CommitDate: Wed, 12 Sep 2018 08:01:15 +0200 kprobes: Remove pointless BUG_ON() from add_new_kprobe() Before calling add_new_kprobe(), aggr_probe's GONE flag and kprobe GONE flag are cleared. We don't need to worry about that flag at this point. Signed-off-by: Masami Hiramatsu Cc: Anil S Keshavamurthy Cc: David S . Miller Cc: Linus Torvalds Cc: Naveen N . Rao Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/153666118298.21306.4915366706875652652.stgit@devbox Signed-off-by: Ingo Molnar --- kernel/kprobes.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index d1edd8d5641e..231569e1e2c8 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1259,8 +1259,6 @@ NOKPROBE_SYMBOL(cleanup_rp_inst); /* Add the new probe to ap->list */ static int add_new_kprobe(struct kprobe *ap, struct kprobe *p) { - BUG_ON(kprobe_gone(ap) || kprobe_gone(p)); - if (p->post_handler) unoptimize_kprobe(ap, true);/* Fall back to normal kprobe */
[PATCH v2] mm: mprotect: check page dirty when change ptes
Add an extra check on page dirty bit in change_pte_range() since there might be case where PTE dirty bit is unset but it's actually dirtied. One example is when a huge PMD is splitted after written: the dirty bit will be set on the compound page however we won't have the dirty bit set on each of the small page PTEs. I noticed this when debugging with a customized kernel that implemented userfaultfd write-protect. In that case, the dirty bit will be critical since that's required for userspace to handle the write protect page fault (otherwise it'll get a SIGBUS with a loop of page faults). However it should still be good even for upstream Linux to cover more scenarios where we shouldn't need to do extra page faults on the small pages if the previous huge page is already written, so the dirty bit optimization path underneath can cover more. CC: Andrew Morton CC: Mel Gorman CC: Khalid Aziz CC: Thomas Gleixner CC: "David S. Miller" CC: Greg Kroah-Hartman CC: Andi Kleen CC: Henry Willard CC: Anshuman Khandual CC: Andrea Arcangeli CC: Kirill A. Shutemov CC: Jerome Glisse CC: Zi Yan CC: linux...@kvack.org CC: linux-kernel@vger.kernel.org Signed-off-by: Peter Xu --- v2: - checking the dirty bit when changing PTE entries rather than fixing up the dirty bit when splitting the huge page PMD. - rebase to 4.19-rc3 Instead of keeping this in my local tree, I'm giving it another shot to see whether this could be acceptable for upstream since IMHO it should still benefit the upstream. Thanks, --- mm/mprotect.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index 6d331620b9e5..5fe752515161 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -115,6 +115,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (preserve_write) ptent = pte_mk_savedwrite(ptent); + /* +* The extra PageDirty() check will make sure +* we'll capture the dirty page even if the PTE +* dirty bit is unset. One case is when the +* PTE is splitted from a huge PMD, in that +* case the dirty flag might only be set on the +* compound page instead of this PTE. +*/ + if (PageDirty(pte_page(ptent))) + ptent = pte_mkdirty(ptent); + /* Avoid taking write faults for known dirty pages */ if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || -- 2.17.1
[PATCH] kernel: prevent submission of creds with higher privileges inside container
From: Xin Lin <18650033...@163.com> Adversaries often attack the Linux kernel via using commit_creds(prepare_kernel_cred(0)) to submit ROOT credential for the purpose of privilege escalation. For processes inside the Linux container, the above approach also works, because the container and the host share the same Linux kernel. Therefore, we en- force a check in commit_creds() before updating the cred of the caller process. If the process is insi- de a container (judging from the Namespace ID) and try to submit credentials with higher privileges t- han current (judging from the uid, gid, and cap_bset in the new cred), we will stop the modification. We consider that if the namespace ID of the process is different from the init Namespace ID (enumed in /i- nclude/linux/proc_ns.h), the process is inside a c- ontainer. And if the uid/gid in the new cred is sm- aller or the cap_bset (capability bounding set) in the new cred is larger, it may be a privilege esca- lation operation. Signed-off-by: Xin Lin <18650033...@163.com> --- kernel/cred.c | 17 + 1 file changed, 17 insertions(+) diff --git a/kernel/cred.c b/kernel/cred.c index ecf0365..826c388 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -19,6 +19,11 @@ #include #include #include +#include +#include +#include "../fs/mount.h" +#include +#include #if 0 #define kdebug(FMT, ...) \ @@ -425,6 +430,18 @@ int commit_creds(struct cred *new) struct task_struct *task = current; const struct cred *old = task->real_cred; + if (task->nsproxy->uts_ns->ns.inum != PROC_UTS_INIT_INO || + task->nsproxy->ipc_ns->ns.inum != PROC_IPC_INIT_INO || + task->nsproxy->mnt_ns->ns.inum != 0xF000U || + task->nsproxy->pid_ns_for_children->ns.inum != PROC_PID_INIT_INO || + task->nsproxy->net_ns->ns.inum != 0xF098U || + old->user_ns->ns.inum != PROC_USER_INIT_INO || + task->nsproxy->cgroup_ns->ns.inum != PROC_CGROUP_INIT_INO) { + if (new->uid.val < old->uid.val || new->gid.val < old->gid.val + || new->cap_bset.cap[0] > old->cap_bset.cap[0]) + return 0; + } + kdebug("commit_creds(%p{%d,%d})", new, atomic_read(&new->usage), read_cred_subscribers(new)); -- 2.17.1
[PATCH v3 1/2] dmaengine: doc: Add sections for per descriptor metadata support
Update the provider and client documentation with details about the metadata support. Signed-off-by: Peter Ujfalusi --- Documentation/driver-api/dmaengine/client.rst | 75 +++ .../driver-api/dmaengine/provider.rst | 46 2 files changed, 121 insertions(+) diff --git a/Documentation/driver-api/dmaengine/client.rst b/Documentation/driver-api/dmaengine/client.rst index fbbb2831f29f..0276ddae7ea2 100644 --- a/Documentation/driver-api/dmaengine/client.rst +++ b/Documentation/driver-api/dmaengine/client.rst @@ -151,6 +151,81 @@ The details of these operations are: Note that callbacks will always be invoked from the DMA engines tasklet, never from interrupt context. + Optional: per descriptor metadata + - + DMAengine provides two ways for metadata support. + + DESC_METADATA_CLIENT + +The metadata buffer is allocated/provided by the client driver and it is +attached to the descriptor. + + .. code-block:: c + + int dmaengine_desc_attach_metadata(struct dma_async_tx_descriptor *desc, + void *data, size_t len); + + DESC_METADATA_ENGINE + +The metadata buffer is allocated/managed by the DMA driver. The client +driver can ask for the pointer, maximum size and the currently used size of +the metadata and can directly update or read it. + + .. code-block:: c + + void *dmaengine_desc_get_metadata_ptr(struct dma_async_tx_descriptor *desc, + size_t *payload_len, size_t *max_len); + + int dmaengine_desc_set_metadata_len(struct dma_async_tx_descriptor *desc, + size_t payload_len); + + Client drivers can query if a given mode is supported with: + + .. code-block:: c + + bool dmaengine_is_metadata_mode_supported(struct dma_chan *chan, + enum dma_desc_metadata_mode mode); + + Depending on the used mode client drivers must follow different flow. + + DESC_METADATA_CLIENT + +- DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + construct the metadata in the client's buffer + 2. use dmaengine_desc_attach_metadata() to attach the buffer to the + descriptor + 3. submit the transfer +- DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. use dmaengine_desc_attach_metadata() to attach the buffer to the + descriptor + 3. submit the transfer + 4. when the transfer is completed, the metadata should be available in the + attached buffer + + DESC_METADATA_ENGINE + +- DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. use dmaengine_desc_get_metadata_ptr() to get the pointer to the + engine's metadata area + 3. update the metadata at the pointer + 4. use dmaengine_desc_set_metadata_len() to tell the DMA engine the + amount of data the client has placed into the metadata buffer + 5. submit the transfer +- DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. submit the transfer + 3. on transfer completion, use dmaengine_desc_get_metadata_ptr() to get the + pointer to the engine's metadata are + 4. Read out the metadate from the pointer + + .. note:: + + Mixed use of DESC_METADATA_CLIENT / DESC_METADATA_ENGINE is not allowed, + client drivers must use either of the modes per descriptor. + 4. Submit the transaction Once the descriptor has been prepared and the callback information diff --git a/Documentation/driver-api/dmaengine/provider.rst b/Documentation/driver-api/dmaengine/provider.rst index dfc4486b5743..9e6d87b3c477 100644 --- a/Documentation/driver-api/dmaengine/provider.rst +++ b/Documentation/driver-api/dmaengine/provider.rst @@ -247,6 +247,52 @@ after each transfer. In case of a ring buffer, they may loop (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO) are typically fixed. +Per descriptor metadata support +--- +Some data movement architecure (DMA controller and peripherals) uses metadata +associated with a transaction. The DMA controller role is to transfer the +payload and the metadata alongside. +The metadata itself is not used by the DMA engine itself, but it contains +parameters, keys, vectors, etc for peripheral or from the peripheral. + +The DMAengine framework provides a generic ways to facilitate the metadata for +descriptors. Depending on the architecture the DMA driver can implement either +or both of the methods and it is up to the client driver to choose which one +to use. + +- DESC_METADATA_CLIENT + + The metadata buffer is allocated/provided by the client driver and it is + attached (via the dmaengine_desc_attach_metadata() helper to the descriptor. + + From the DMA driver the following is expected for this mode: + - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM +The data from the provided metadata buffer sh
[PATCH v3 0/2] dmaengine: Add per descriptor metadata support
Hi, Changes since v2: - EXPORT_SYMBOL_GPL() for the metadata functions - Added note to Documentation to not mix the two defined metadata modes - Fixed the typos in Documentation Changes since v1: - Move code from header to dmaengine.c - Fix spelling - Use BIT() macro for bit definition - Update both provider and client documentation Changes since rfc: - DESC_METADATA_EMBEDDED renamed to DESC_METADATA_ENGINE - Use flow is added for both CLIENT and ENGINE metadata modes Some data movement architecure (DMA controller and peripherals) uses metadata associated with a transaction. The DMA controller role is to transfer the payload and the metadata alongside. The metadata itself is not used by the DMA engine itself, but it contains parameters, keys, vectors, etc for peripheral or from the peripheral. The DMAengine framework provides a generic ways to facilitate the metadata for descriptors. Depending on the architecture the DMA driver can implment either or both of the methods and it is up to the client driver to choose which one to use. If the DMA supports per descriptor metadata it can implement the attach, get_ptr/set_len callbacks. Client drivers must only use either attach or get_ptr/set_len to avoid miss configuration. Client driver can check if a given metadata mode is supported by the channel during probe time with dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_CLIENT); dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_ENGINE); and based on this information can use either mode. Wrappers are also added for the metadata_ops. To be used in DESC_METADATA_CLIENT mode: dmaengine_desc_attach_metadata() To be used in DESC_METADATA_ENGINE mode: dmaengine_desc_get_metadata_ptr() dmaengine_desc_set_metadata_len() Regards, Peter --- Peter Ujfalusi (2): dmaengine: doc: Add sections for per descriptor metadata support dmaengine: Add metadata_ops for dma_async_tx_descriptor Documentation/driver-api/dmaengine/client.rst | 75 .../driver-api/dmaengine/provider.rst | 46 drivers/dma/dmaengine.c | 73 include/linux/dmaengine.h | 108 ++ 4 files changed, 302 insertions(+) -- Peter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
[PATCH v3 2/2] dmaengine: Add metadata_ops for dma_async_tx_descriptor
The metadata is best described as side band data or parameters traveling alongside the data DMAd by the DMA engine. It is data which is understood by the peripheral and the peripheral driver only, the DMA engine see it only as data block and it is not interpreting it in any way. The metadata can be different per descriptor as it is a parameter for the data being transferred. If the DMA supports per descriptor metadata it can implement the attach, get_ptr/set_len callbacks. Client drivers must only use either attach or get_ptr/set_len to avoid misconfiguration. Client driver can check if a given metadata mode is supported by the channel during probe time with dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_CLIENT); dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_ENGINE); and based on this information can use either mode. Wrappers are also added for the metadata_ops. To be used in DESC_METADATA_CLIENT mode: dmaengine_desc_attach_metadata() To be used in DESC_METADATA_ENGINE mode: dmaengine_desc_get_metadata_ptr() dmaengine_desc_set_metadata_len() Signed-off-by: Peter Ujfalusi --- drivers/dma/dmaengine.c | 73 ++ include/linux/dmaengine.h | 108 ++ 2 files changed, 181 insertions(+) diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index f1a441ab395d..27b6d7c2d8a0 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -1306,6 +1306,79 @@ void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx, } EXPORT_SYMBOL(dma_async_tx_descriptor_init); +static inline int desc_check_and_set_metadata_mode( + struct dma_async_tx_descriptor *desc, enum dma_desc_metadata_mode mode) +{ + /* Make sure that the metadata mode is not mixed */ + if (!desc->desc_metadata_mode) { + if (dmaengine_is_metadata_mode_supported(desc->chan, mode)) + desc->desc_metadata_mode = mode; + else + return -ENOTSUPP; + } else if (desc->desc_metadata_mode != mode) { + return -EINVAL; + } + + return 0; +} + +int dmaengine_desc_attach_metadata(struct dma_async_tx_descriptor *desc, + void *data, size_t len) +{ + int ret; + + if (!desc) + return -EINVAL; + + ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_CLIENT); + if (ret) + return ret; + + if (!desc->metadata_ops || !desc->metadata_ops->attach) + return -ENOTSUPP; + + return desc->metadata_ops->attach(desc, data, len); +} +EXPORT_SYMBOL_GPL(dmaengine_desc_attach_metadata); + +void *dmaengine_desc_get_metadata_ptr(struct dma_async_tx_descriptor *desc, + size_t *payload_len, size_t *max_len) +{ + int ret; + + if (!desc) + return ERR_PTR(-EINVAL); + + ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_ENGINE); + if (ret) + return ERR_PTR(ret); + + if (!desc->metadata_ops || !desc->metadata_ops->get_ptr) + return ERR_PTR(-ENOTSUPP); + + return desc->metadata_ops->get_ptr(desc, payload_len, max_len); +} +EXPORT_SYMBOL_GPL(dmaengine_desc_get_metadata_ptr); + +int dmaengine_desc_set_metadata_len(struct dma_async_tx_descriptor *desc, + size_t payload_len) +{ + int ret; + + if (!desc) + return -EINVAL; + + ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_ENGINE); + if (ret) + return ret; + + if (!desc->metadata_ops || !desc->metadata_ops->set_len) + return -ENOTSUPP; + + return desc->metadata_ops->set_len(desc, payload_len); +} +EXPORT_SYMBOL_GPL(dmaengine_desc_set_metadata_len); + /* dma_wait_for_async_tx - spin wait for a transaction to complete * @tx: in-flight transaction to wait on */ diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h index 3db833a8c542..10ff71b13eef 100644 --- a/include/linux/dmaengine.h +++ b/include/linux/dmaengine.h @@ -231,6 +231,58 @@ typedef struct { DECLARE_BITMAP(bits, DMA_TX_TYPE_END); } dma_cap_mask_t; * @bytes_transferred: byte counter */ +/** + * enum dma_desc_metadata_mode - per descriptor metadata mode types supported + * @DESC_METADATA_CLIENT - the metadata buffer is allocated/provided by the + * client driver and it is attached (via the dmaengine_desc_attach_metadata() + * helper) to the descriptor. + * + * Client drivers interested to use this mode can follow: + * - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + * 1. prepare the descriptor (dmaengine_prep_*) + * construct the metadata in the client's buffer + * 2. use dmaengine_desc_attach_metadata() to attach the buffer to the + * descriptor + * 3. submit the transfer + * - DMA_DEV_TO_MEM: + * 1. prepare the descriptor (dmaengine_prep_*) + * 2. use dmaengine_desc_attach_
[PATCH 2/2] kbuild: remove dead code in cmd_files calculation in top Makefile
Nobody sets 'targets' in the top-level Makefile or arch/*/Makefile, hence $(targets) is empty. $(wildcard .*.cmd) will do for including the .vmlinux.cmd file. Signed-off-by: Masahiro Yamada --- Makefile | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 4b76e22..8f6dbfc 100644 --- a/Makefile +++ b/Makefile @@ -1721,8 +1721,7 @@ cmd_crmodverdir = $(Q)mkdir -p $(MODVERDIR) \ $(if $(KBUILD_MODULES),; rm -f $(MODVERDIR)/*) # read all saved command lines - -cmd_files := $(wildcard .*.cmd $(foreach f,$(sort $(targets)),$(dir $(f)).$(notdir $(f)).cmd)) +cmd_files := $(wildcard .*.cmd) ifneq ($(cmd_files),) $(cmd_files): ; # Do not try to update included dependency files -- 2.7.4
[PATCH 1/2] kbuild: hide most of targets when running config or mixed targets
When mixed/config targets are being processed, the top Makefile does not need to parse the rest of targets. Signed-off-by: Masahiro Yamada --- Makefile | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 4d5c883..4b76e22 100644 --- a/Makefile +++ b/Makefile @@ -1615,9 +1615,6 @@ namespacecheck: export_report: $(PERL) $(srctree)/scripts/export_report.pl -endif #ifeq ($(config-targets),1) -endif #ifeq ($(mixed-targets),1) - PHONY += checkstack kernelrelease kernelversion image_name # UML needs a little special treatment here. It wants to use the host @@ -1732,6 +1729,8 @@ ifneq ($(cmd_files),) include $(cmd_files) endif +endif # ifeq ($(config-targets),1) +endif # ifeq ($(mixed-targets),1) endif # skip-makefile PHONY += FORCE -- 2.7.4
Re: [REGRESSION] Errors at reboot after 722e5f2b1eec
On Tue, Sep 11, 2018 at 5:37 PM Greg Kroah-Hartman wrote: > > On Tue, Sep 11, 2018 at 10:17:44AM +0200, Takashi Iwai wrote: > > [ seems like my previous post didn't go out properly; if you have > > already received it, please discard this one ] > > Sorry, I got it, it's just in my large queue :( > > > Hi Rafael, Greg, > > > > James Wang reported on SUSE bugzilla that his machine spews many > > AMD-Vi errors at reboot like: > > > > [ 154.907879] systemd-shutdown[1]: Detaching loop devices. > > [ 154.954583] kvm: exiting hardware virtualization > > [ 154.53] usb 5-2: USB disconnect, device number 2 > > [ 155.025278] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.081360] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.136778] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.191772] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.247055] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.302614] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.358996] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.392155] usb 4-2: new full-speed USB device number 2 using ohci-pci > > [ 155.413752] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.413762] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.560307] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.616039] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.667843] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.719497] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.772697] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.823919] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.875490] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.927258] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 155.979318] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 156.031813] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 156.084293] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 > > domain=0x0006 address=0x0080 flags=0x0020] > > [ 156.272157] reboot: Restarting system > > [ 156.290316] reboot: machine restart > > [...] > > The errors are clearly related with the USB device (a KVM device, > > IIRC), and the errors are not seen if the USB device is disconnected. > > Sounds like the io pgtbl is invalidated before ohci-pci. But I can not figure out why, since it is very late to tear down of iommu, which is after device_shutdown() Cc James, could you try to enable initcall_debug, and paste the shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver core: correct device's shutdown order"") and without it? Thanks, Pingfan
Re: [PATCH 2/8] regulator: Support ROHM BD71847 power management IC
Hello Lee, Thanks again for the review! I see you did bunch of them... I really admire your devotion. For me reviewing is hard work. I do appreciate it. So nice to see you're back in the business =) On Tue, Sep 11, 2018 at 02:48:08PM +0100, Lee Jones wrote: > On Wed, 29 Aug 2018, Matti Vaittinen wrote: > > > +static const u8 bd71837_supported_revisions[] = { 0xA2 }; > > +static const u8 bd71847_supported_revisions[] = { 0xA0 }; > > I haven't seen anything like this before. > > Is this really required? > > Especially since you only have 1 of each currently. Valid question. I did ask the same thing myself. Reason why I ended up doing this is simple though. I have no idea what may change if "chip revision" is changed. I only know that I have chip(s) with these revisions on my table - and I have a data sheet which mentions these revisions. So I can only test that driver works with these revisions. I however assume there will be new revisions and I thought that with this approach marking them as supported will only require adding the revisio to these arrays. But as you have said - this makes code slightly more complex - even if I disagree with you regarding how complex is too complex =) The use case and intention of tables is quite obvious, right? That makes following the loops in probe pretty easy after all... But I won't start arguing with you - let's assume the register interface won't get changed - and if it does, well, let's handle that then. So I'll drop whole revision check. > > > -static const u8 supported_revisions[] = { 0xA2 /* BD71837 */ }; > > +struct known_revisions { > > + const u8 (*revisions)[]; > > + unsigned int known_revisions; > > I didn't initially know what this was at first glance. > > Please re-name it to show that it is the number of stored revisions. This will be fixed as I'll drop the revision check > > +static const struct known_revisions supported_revisions[BD718XX_TYPE_AMNT] > > = { > > + [BD718XX_TYPE_BD71837] = { > > + .revisions = &bd71837_supported_revisions, > > + .known_revisions = ARRAY_SIZE(bd71837_supported_revisions), > > + }, > > + [BD718XX_TYPE_BD71847] = { > > + .revisions = &bd71847_supported_revisions, > > + .known_revisions = ARRAY_SIZE(bd71847_supported_revisions), > > + }, > > +}; > > > > @@ -91,13 +104,19 @@ static int bd71837_i2c_probe(struct i2c_client *i2c, > > { > > struct bd71837 *bd71837; > > int ret, i; > > + const unsigned int *type; > > > > + type = of_device_get_match_data(&i2c->dev); > > + if (!type || *type >= BD718XX_TYPE_AMNT) { > > + dev_err(&i2c->dev, "Bad chip type\n"); > > + return -ENODEV; > > + } > > + > > + bd71837->chip_type = *type; > > + ret = regmap_read(bd71837->regmap, BD718XX_REG_REV, &bd71837->chip_rev); > > + for (i = 0; > > +i < supported_revisions[bd71837->chip_type].known_revisions; i++) > > + if ((*supported_revisions[bd71837->chip_type].revisions)[i] == > > + bd71837->chip_rev) > > break; > > > > + if (i == supported_revisions[bd71837->chip_type].known_revisions) { > > + dev_err(&i2c->dev, "Unrecognized revision 0x%02x\n", > > + bd71837->chip_rev); > > + return -EINVAL; > > } > > This has become a very (too) elaborate way to see if the current > running version is supported. Please find a way to solve this (much) > more succinctly. There are lots of examples of this. I cut out pieces of quoted patch to shorten it to relevant bits. As I said above - I think this is not as bad as you say - it is quite obvious what it does after all. And adding new revision would be just adding new entry to the revision array. But yes, I am not sure if this is needed so I'll drop this. Let's work the compatibility issues between revisions only if such ever emerge =) > > In fact, since you are using OF, it is not possible for this driver to > probe with an unsupported device. You can remove the whole lot. > I don't really see how the OF helps me with revisions - as chip revision is not presented in DT. The type of chip of course is. So you're right. Check for invalid chip_type can be dropped. > > +static const unsigned int chip_types[] = { > > + [BD718XX_TYPE_BD71837] = BD718XX_TYPE_BD71837, > > + [BD718XX_TYPE_BD71847] = BD718XX_TYPE_BD71847, > > +}; > > > > static const struct of_device_id bd71837_of_match[] = { > > - { .compatible = "rohm,bd71837", }, > > + { > > + .compatible = "rohm,bd71837", > > + .data = &chip_types[BD718XX_TYPE_BD71837] > > + }, > > + { > > + .compatible = "rohm,bd71847", > > + .data = &chip_types[BD718XX_TYPE_BD71847] > > Again, way too complex. Why not simply: > >.data = (void *)BD718XX_TYPE_BD71847? > Ugh. That's horrible on my eyes. I dislike delivering data in addresses. That's why I rather did array with IDs and used pointer to arra
Re: [PATCH v2 2/3] x86/mm/KASLR: Calculate the actual size of vmemmap region
* Baoquan He wrote: > On 09/11/18 at 08:08pm, Baoquan He wrote: > > On 09/11/18 at 11:28am, Ingo Molnar wrote: > > > Yeah, so proper context is still missing, this paragraph appears to > > > assume from the reader a > > > whole lot of prior knowledge, and this is one of the top comments in > > > kaslr.c so there's nowhere > > > else to go read about the background. > > > > > > For example what is the range of randomization of each region? Assuming > > > the static, > > > non-randomized description in Documentation/x86/x86_64/mm.txt is correct, > > > in what way does > > > KASLR modify that layout? > > Re-read this paragraph, found I missed saying the range for each memory > region, and in what way KASLR modify the layout. > > > > > > > All of this is very opaque and not explained very well anywhere that I > > > could find. We need to > > > generate a proper description ASAP. > > > > OK, let me try to give an context with my understanding. And copy the > > static layout of memory regions at below for reference. > > > Here, Documentation/x86/x86_64/mm.txt is correct, and it's the > guideline for us to manipulate the layout of kernel memory regions. > Originally the starting address of each region is aligned to 512GB > so that they are all mapped at the 0-th entry of PGD table in 4-level > page mapping. Since we are so rich to have 120 TB virtual address space, > they are aligned at 1 TB actually. So randomness comes from three parts > mainly: > > 1) The direct mapping region for physical memory. 64 TB are reserved to > cover the maximum physical memory support. However, most of systems only > have much less RAM memory than 64 TB, even much less than 1 TB most of > time. We can take the superfluous to join the randomization. This is > often the biggest part. So i.e. in the non-KASLR case we have this description (from mm.txt): 8800 - c7ff (=64 TB) direct mapping of all phys. memory c800 - c8ff (=40 bits) hole c900 - e8ff (=45 bits) vmalloc/ioremap space e900 - e9ff (=40 bits) hole ea00 - eaff (=40 bits) virtual memory map (1TB) ... unused hole ... ec00 - fbff (=44 bits) kasan shadow memory (16TB) ... unused hole ... vaddr_end for KASLR fe00 - fe7f (=39 bits) cpu_entry_area mapping ... The problems start here, this map is already *horribly* confusing: - we mix size in TB with 'bits' - we sometimes mention a size in the description and sometimes not - we sometimes list holes by address, sometimes only as an 'unused hole' line ... So how about first cleaning up the memory maps in mm.txt and streamlining them, like this: 8800 - c7ff (=46 bits, 64 TB) direct mapping of all phys. memory (page_offset_base) c800 - c8ff (=40 bits, 1 TB) ... unused hole c900 - e8ff (=45 bits, 32 TB) vmalloc/ioremap space (vmalloc_base) e900 - e9ff (=40 bits, 1 TB) ... unused hole ea00 - eaff (=40 bits, 1 TB) virtual memory map (vmemmap_base) eb00 - ebff (=40 bits, 1 TB) ... unused hole ec00 - fbff (=44 bits, 16 TB) KASAN shadow memory fc00 - fdff (=41 bits, 2 TB) ... unused hole vaddr_end for KASLR fe00 - fe7f (=39 bits) cpu_entry_area mapping ... Please double check all the calculations and ranges, and I'd suggest doing it for the whole file. Note how I added the global variables describing the base addresses - this makes it very easy to match the pointers in kaslr_regions[] to the static map, to see the intent of kaslr_regions[]. BTW., isn't that 'vaddr_end for KASLR' entry position inaccurate? In the typical case it could very well be that by chance all 3 areas end up being randomized into the first 64 TB region, right? I.e. vaddr_end could be at any 1 TB boundary in the above ranges. I'd suggest leaving out all KASLR from this static mappings table - explain it separately in this file, maybe even create its own memory map. I'll help with the wording. > 2) The hole between memory regions, even though they are only 1 TB. There's a 2 TB hole too. > 3) KASAN region takes up 16 TB, while it won't take effect when KASLR is > enabled. This is another big part. Ok. > As you can see, in these three memory regions, the physical memory > mapping region has variable size according to the existing system RAM. > However, the remaining two memory regions have fixed size, vmalloc is 32 > TB, vmemmap is 1 TB. > > With this superfluous address space as well as changing the starting address > of each memory region to be PUD level, namely 1 GB aligned, we can have > thousands of candidate position to locate those three
RE: Charity Support
-- I, Mikhail Fridman have selected you specifically as one of my beneficiaries for my Charitable Donation of $5 Million Dollars, Check the link below for confirmation: https://www.rt.com/business/343781-mikhail-fridman-will-charity/ I await your earliest response for further directives. Best Regards, Mikhail Fridman.
[PATCH] kbuild: remove old check for CFLAGS use
This check has been here more than a decade since commit 0c53c8e6eb45 ("kbuild: check for wrong use of CFLAGS"). Enough time for migration has passed. Signed-off-by: Masahiro Yamada --- scripts/Makefile.build | 10 -- 1 file changed, 10 deletions(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 5a2d1c9..cb03774 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -36,21 +36,11 @@ subdir-ccflags-y := include scripts/Kbuild.include -# For backward compatibility check that these variables do not change -save-cflags := $(CFLAGS) - # The filename Kbuild has precedence over Makefile kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src)) kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile) include $(kbuild-file) -# If the save-* variables changed error out -ifeq ($(KBUILD_NOPEDANTIC),) -ifneq ("$(save-cflags)","$(CFLAGS)") -$(error CFLAGS was changed in "$(kbuild-file)". Fix it to use ccflags-y) -endif -endif - include scripts/Makefile.lib # Do not include host rules unless needed -- 2.7.4
[PATCH v2] perf test: Add watchpoint test
We don't have perf test available to test watchpoint functionality. Add simple set of tests: - Read only watchpoint - Write only watchpoint - Read / Write watchpoint - Runtime watchpoint modification Ex on powerpc: $ sudo ./perf test 22 22: Watchpoint: 22.1: Read Only Watchpoint: Ok 22.2: Write Only Watchpoint : Ok 22.3: Read / Write Watchpoint : Ok 22.4: Modify Watchpoint : Ok Signed-off-by: Ravi Bangoria Acked-by: Jiri Olsa --- v1 -> v2: - Fix build failure for mips and other archs. - Show debug message if subtest is not supported. tools/perf/tests/Build | 1 + tools/perf/tests/builtin-test.c | 9 ++ tools/perf/tests/tests.h| 3 + tools/perf/tests/wp.c | 229 4 files changed, 242 insertions(+) create mode 100644 tools/perf/tests/wp.c diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build index 6c108fa79ae3..0b2b8305c965 100644 --- a/tools/perf/tests/Build +++ b/tools/perf/tests/Build @@ -21,6 +21,7 @@ perf-y += python-use.o perf-y += bp_signal.o perf-y += bp_signal_overflow.o perf-y += bp_account.o +perf-y += wp.o perf-y += task-exit.o perf-y += sw-clock.o perf-y += mmap-thread-lookup.o diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index d7a5e1b9aa6f..54ca7d87236f 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -120,6 +120,15 @@ static struct test generic_tests[] = { .func = test__bp_accounting, .is_supported = test__bp_signal_is_supported, }, + { + .desc = "Watchpoint", + .func = test__wp, + .subtest = { + .skip_if_fail = false, + .get_nr = test__wp_subtest_get_nr, + .get_desc = test__wp_subtest_get_desc, + }, + }, { .desc = "Number of exit events of a simple workload", .func = test__task_exit, diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h index a9760e790563..8e26a4148f30 100644 --- a/tools/perf/tests/tests.h +++ b/tools/perf/tests/tests.h @@ -59,6 +59,9 @@ int test__python_use(struct test *test, int subtest); int test__bp_signal(struct test *test, int subtest); int test__bp_signal_overflow(struct test *test, int subtest); int test__bp_accounting(struct test *test, int subtest); +int test__wp(struct test *test, int subtest); +const char *test__wp_subtest_get_desc(int subtest); +int test__wp_subtest_get_nr(void); int test__task_exit(struct test *test, int subtest); int test__mem(struct test *test, int subtest); int test__sw_clock_freq(struct test *test, int subtest); diff --git a/tools/perf/tests/wp.c b/tools/perf/tests/wp.c new file mode 100644 index ..017a99317f94 --- /dev/null +++ b/tools/perf/tests/wp.c @@ -0,0 +1,229 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include "tests.h" +#include "debug.h" +#include "cloexec.h" + +#define WP_TEST_ASSERT_VAL(fd, text, val) \ +do {\ + long long count;\ + wp_read(fd, &count, sizeof(long long)); \ + TEST_ASSERT_VAL(text, count == val);\ +} while (0) + +volatile u64 data1; +volatile u8 data2[3]; + +static int wp_read(int fd, long long *count, int size) +{ + int ret = read(fd, count, size); + + if (ret != size) { + pr_debug("failed to read: %d\n", ret); + return -1; + } + return 0; +} + +static void get__perf_event_attr(struct perf_event_attr *attr, int wp_type, +void *wp_addr, unsigned long wp_len) +{ + memset(attr, 0, sizeof(struct perf_event_attr)); + attr->type = PERF_TYPE_BREAKPOINT; + attr->size = sizeof(struct perf_event_attr); + attr->config = 0; + attr->bp_type= wp_type; + attr->bp_addr= (unsigned long)wp_addr; + attr->bp_len = wp_len; + attr->sample_period = 1; + attr->sample_type= PERF_SAMPLE_IP; + attr->exclude_kernel = 1; + attr->exclude_hv = 1; +} + +static int __event(int wp_type, void *wp_addr, unsigned long wp_len) +{ + int fd; + struct perf_event_attr attr; + + get__perf_event_attr(&attr, wp_type, wp_addr, wp_len); + fd = sys_perf_event_open(&attr, 0, -1, -1, +perf_event_open_cloexec_flag()); + if (fd < 0) + pr_debug("failed opening event %x\n", attr.bp_type); + + return fd; +} + +static int wp_ro_test(void) +{ + int fd; + unsigned long tmp, tmp1 = rand(); + + fd = __event(HW_BREAKPOINT_R, (void *)&data1, sizeo
Re: [PATCH] staging: remove unneeded static set .owner field in platform_driver
On Wed, Sep 12, 2018 at 9:22 AM zhong jiang wrote: > > platform_driver_register will set the .owner field. So it is safe > to remove the redundant assignment. > > The issue is detected with the help of Coccinelle. > > Signed-off-by: zhong jiang > --- > drivers/staging/greybus/audio_codec.c| 1 - > drivers/staging/mt7621-eth/gsw_mt7621.c | 1 - > drivers/staging/mt7621-eth/mtk_eth_soc.c | 1 - > 3 files changed, 3 deletions(-) > > diff --git a/drivers/staging/greybus/audio_codec.c > b/drivers/staging/greybus/audio_codec.c > index 35acd55..08746c8 100644 > --- a/drivers/staging/greybus/audio_codec.c > +++ b/drivers/staging/greybus/audio_codec.c > @@ -1087,7 +1087,6 @@ static int gbaudio_codec_remove(struct platform_device > *pdev) > static struct platform_driver gbaudio_codec_driver = { > .driver = { > .name = "apb-dummy-codec", > - .owner = THIS_MODULE, > #ifdef CONFIG_PM > .pm = &gbaudio_codec_pm_ops, > #endif > diff --git a/drivers/staging/mt7621-eth/gsw_mt7621.c > b/drivers/staging/mt7621-eth/gsw_mt7621.c > index 2c07b55..53767b1 100644 > --- a/drivers/staging/mt7621-eth/gsw_mt7621.c > +++ b/drivers/staging/mt7621-eth/gsw_mt7621.c > @@ -286,7 +286,6 @@ static int mt7621_gsw_remove(struct platform_device *pdev) > .remove = mt7621_gsw_remove, > .driver = { > .name = "mt7621-gsw", > - .owner = THIS_MODULE, > .of_match_table = mediatek_gsw_match, > }, > }; > diff --git a/drivers/staging/mt7621-eth/mtk_eth_soc.c > b/drivers/staging/mt7621-eth/mtk_eth_soc.c > index 7135075..363d3c9 100644 > --- a/drivers/staging/mt7621-eth/mtk_eth_soc.c > +++ b/drivers/staging/mt7621-eth/mtk_eth_soc.c > @@ -2167,7 +2167,6 @@ static int mtk_remove(struct platform_device *pdev) > .remove = mtk_remove, > .driver = { > .name = "mtk_soc_eth", > - .owner = THIS_MODULE, > .of_match_table = of_mtk_match, > }, > }; > -- > 1.7.12.4 > Acked-by: Vaibhav Agarwal
Re: [PATCH V3 4/6] x86/intel_rdt: Create required perf event attributes
Hi Reinette, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/x86/core] [also build test ERROR on v4.19-rc3 next-20180911] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Reinette-Chatre/perf-core-and-x86-intel_rdt-Fix-lack-of-coordination-with-perf/20180912-101526 config: i386-randconfig-x001-201836 (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 Note: the linux-review/Reinette-Chatre/perf-core-and-x86-intel_rdt-Fix-lack-of-coordination-with-perf/20180912-101526 HEAD b684b8727deb9e3cf635badb292b3314904d17b2 builds fine. It only hurts bisectibility. All error/warnings (new ones prefixed by >>): >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:927:15: error: variable >> 'perf_miss_attr' has initializer but incomplete type static struct perf_event_attr __attribute__((unused)) perf_miss_attr = { ^~~ >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:3: error: 'struct >> perf_event_attr' has no member named 'type' .type = PERF_TYPE_RAW, ^~~~ >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: error: 'PERF_TYPE_RAW' >> undeclared here (not in a function); did you mean 'PIDTYPE_MAX'? .type = PERF_TYPE_RAW, ^ PIDTYPE_MAX >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: warning: excess >> elements in struct initializer arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: note: (near initialization for 'perf_miss_attr') >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:3: error: 'struct >> perf_event_attr' has no member named 'size' .size = sizeof(struct perf_event_attr), ^~~~ >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:18: error: invalid >> application of 'sizeof' to incomplete type 'struct perf_event_attr' .size = sizeof(struct perf_event_attr), ^~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:11: warning: excess elements in struct initializer .size = sizeof(struct perf_event_attr), ^~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:11: note: (near initialization for 'perf_miss_attr') >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:3: error: 'struct >> perf_event_attr' has no member named 'pinned' .pinned = 1, ^~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:13: warning: excess elements in struct initializer .pinned = 1, ^ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:13: note: (near initialization for 'perf_miss_attr') >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:3: error: 'struct >> perf_event_attr' has no member named 'disabled' .disabled = 0, ^~~~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:14: warning: excess elements in struct initializer .disabled = 0, ^ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:14: note: (near initialization for 'perf_miss_attr') >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:3: error: 'struct >> perf_event_attr' has no member named 'exclude_user' .exclude_user = 1, ^~~~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:18: warning: excess elements in struct initializer .exclude_user = 1, ^ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:18: note: (near initialization for 'perf_miss_attr') >> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:935:15: error: variable >> 'perf_hit_attr' has initializer but incomplete type static struct perf_event_attr __attribute__((unused)) perf_hit_attr = { ^~~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:3: error: 'struct perf_event_attr' has no member named 'type' .type = PERF_TYPE_RAW, ^~~~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:11: warning: excess elements in struct initializer .type = PERF_TYPE_RAW, ^ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:11: note: (near initialization for 'perf_hit_attr') arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:937:3: error: 'struct perf_event_attr' has no member named 'size' .size = sizeof(struct perf_event_attr), ^~~~ arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:937:18: error: invalid application of 'sizeof' to incomplete type 'struct perf_event_attr' .size =
Re: [LKP] [rcu] 02a5c550b2: BUG:kernel_reboot-without-warning_in_test_stage
On Wed, Sep 12, 2018 at 01:25:27PM +0800, kernel test robot wrote: > FYI, we noticed the following commit (built with gcc-7): > > commit: 02a5c550b2738f2bfea8e1e00aa75944d71c9e18 ("rcu: Abstract extended > quiescent state determination") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > in testcase: perf_event_tests > with following parameters: > > paranoid: disallow_kernel_profiling > > test-description: The Perf Event Testsuite. > test-url: https://github.com/deater/perf_event_tests > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G This is a blast from the past! Almost two years ago. I will take a closer look (and also check for any later fixes), but at first glance I am not seeing anything in this commit that would actually change behavior. But is it possible that this is due to vCPU preemption on a heavily loaded test system? I have done that to myself from time to time... Thanx, Paul > caused below changes (please refer to attached dmesg/kmsg for entire > log/backtrace): > > > +-+++ > | | 2625d469ba | 02a5c550b2 | > +-+++ > | boot_successes | 18 | 0 | > | boot_failures | 0 | 17 | > | BUG:kernel_reboot-without-warning_in_test_stage | 0 | 17 | > +-+++ > > > > [ 21.715217] > [ 22.950524] perf: interrupt took too long (5334 > 5235), lowering > kernel.perf_event_max_sample_rate to 37250 > [ 22.956921] perf: interrupt took too long (6735 > 6667), lowering > kernel.perf_event_max_sample_rate to 29500 > [ 22.970150] perf: interrupt took too long (8494 > 8418), lowering > kernel.perf_event_max_sample_rate to 23500 > [ 22.976586] perf: interrupt took too long (10754 > 10617), lowering > kernel.perf_event_max_sample_rate to 18500 > BUG: kernel reboot-without-warning in test stage > > Elapsed time: 30 > > #!/bin/bash > > > > To reproduce: > > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > bin/lkp qemu -k job-script # job-script is attached in this > email > > > > Thanks, > lkp > # > # Automatically generated file; DO NOT EDIT. > # Linux/x86_64 4.10.0-rc3 Kernel Configuration > # > CONFIG_64BIT=y > CONFIG_X86_64=y > CONFIG_X86=y > CONFIG_INSTRUCTION_DECODER=y > CONFIG_OUTPUT_FORMAT="elf64-x86-64" > CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_STACKTRACE_SUPPORT=y > CONFIG_MMU=y > CONFIG_ARCH_MMAP_RND_BITS_MIN=28 > CONFIG_ARCH_MMAP_RND_BITS_MAX=32 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 > CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 > CONFIG_NEED_DMA_MAP_STATE=y > CONFIG_NEED_SG_DMA_LENGTH=y > CONFIG_GENERIC_ISA_DMA=y > CONFIG_GENERIC_BUG=y > CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y > CONFIG_GENERIC_HWEIGHT=y > CONFIG_ARCH_MAY_HAVE_PC_FDC=y > CONFIG_RWSEM_XCHGADD_ALGORITHM=y > CONFIG_GENERIC_CALIBRATE_DELAY=y > CONFIG_ARCH_HAS_CPU_RELAX=y > CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y > CONFIG_HAVE_SETUP_PER_CPU_AREA=y > CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y > CONFIG_ARCH_HIBERNATION_POSSIBLE=y > CONFIG_ARCH_SUSPEND_POSSIBLE=y > CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y > CONFIG_ARCH_WANT_GENERAL_HUGETLB=y > CONFIG_ZONE_DMA32=y > CONFIG_AUDIT_ARCH=y > CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y > CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y > CONFIG_X86_64_SMP=y > CONFIG_ARCH_SUPPORTS_UPROBES=y > CONFIG_FIX_EARLYCON_MEM=y > CONFIG_DEBUG_RODATA=y > CONFIG_PGTABLE_LEVELS=4 > CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" > CONFIG_IRQ_WORK=y > CONFIG_BUILDTIME_EXTABLE_SORT=y > CONFIG_THREAD_INFO_IN_TASK=y > > # > # General setup > # > CONFIG_INIT_ENV_ARG_LIMIT=32 > CONFIG_CROSS_COMPILE="" > # CONFIG_COMPILE_TEST is not set > CONFIG_LOCALVERSION="" > CONFIG_LOCALVERSION_AUTO=y > CONFIG_HAVE_KERNEL_GZIP=y > CONFIG_HAVE_KERNEL_BZIP2=y > CONFIG_HAVE_KERNEL_LZMA=y > CONFIG_HAVE_KERNEL_XZ=y > CONFIG_HAVE_KERNEL_LZO=y > CONFIG_HAVE_KERNEL_LZ4=y > CONFIG_KERNEL_GZIP=y > # CONFIG_KERNEL_BZIP2 is not set > # CONFIG_KERNEL_LZMA is not set > # CONFIG_KERNEL_XZ is not set > # CONFIG_KERNEL_LZO is not set > # CONFIG_KERNEL_LZ4 is not set > CONFIG_DEFAULT_HOSTNAME="(none)" > CONFIG_SWAP=y > CONFIG_SYSVIPC=y > CONFIG_SYSVIPC_SYSCTL=y > CONFIG_POSIX_MQUEUE=y > CONFIG_POSIX_MQUEUE_SYSCTL=y > CONFIG_CROSS_MEMORY_ATTACH=y > CONFIG_FHANDLE=y > CONFIG_USELIB=y > # CONFIG_AUDIT is not set > CONFIG_HAVE_ARCH_AUDITSYSCALL=y > > # > # IRQ subsystem > # > CONFIG_GENERIC_IRQ_PROBE=y > CONFIG_GENERIC_IRQ_SHOW=y > CONFIG_GENERIC_PENDING_IRQ=y > CONFIG_IRQ_DOMAIN=y > CONFIG_IRQ_DOMAIN_HIERARCHY=y >
Re: [PATCH -next] staging: mt7621-pci: Use PTR_ERR_OR_ZERO in mt7621_pcie_parse_dt()
On Wed, Sep 12, 2018 at 02:50:08AM +, YueHaibing wrote: > Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR > > Signed-off-by: YueHaibing > --- > drivers/staging/mt7621-pci/pci-mt7621.c | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/staging/mt7621-pci/pci-mt7621.c > b/drivers/staging/mt7621-pci/pci-mt7621.c > index ba1f117..d2cb910 100644 > --- a/drivers/staging/mt7621-pci/pci-mt7621.c > +++ b/drivers/staging/mt7621-pci/pci-mt7621.c > @@ -396,10 +396,7 @@ static int mt7621_pcie_parse_dt(struct mt7621_pcie *pcie) > } > > pcie->base = devm_ioremap_resource(dev, ®s); > - if (IS_ERR(pcie->base)) > - return PTR_ERR(pcie->base); > - > - return 0; > + return PTR_ERR_OR_ZERO(pcie->base); > } This patch looks good but the 'mt7621_pcie_parse_dt' function is not completed at all. There is a lot of missing for each pci node to be parsed yet and some patch series which are doing this have not been tested yet so those patches are not included. Please see: http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-September/125937.html Best regards, Sergio Paracuellos > > static int mt7621_pcie_request_resources(struct mt7621_pcie *pcie, > > >
Re: [PATCH 06/11] compat_ioctl: remove /dev/random commands
On Tue, 11 Sep 2018 22:26:54 +0200 Arnd Bergmann wrote: > On Sun, Sep 9, 2018 at 6:12 AM Al Viro wrote: > > > > On Sat, Sep 08, 2018 at 04:28:12PM +0200, Arnd Bergmann wrote: > > > These are all handled by the random driver, so instead of listing > > > each ioctl, we can just use the same function to deal with both > > > native and compat commands. > > > > Umm... I don't think it's right - > > > > > .unlocked_ioctl = random_ioctl, > > > + .compat_ioctl = random_ioctl, > > > > > > ->compat_ioctl() gets called in > > error = f.file->f_op->compat_ioctl(f.file, cmd, > > arg); > > so you do *NOT* get compat_ptr() for those - they have to do it on their > > own. It's not hard to provide a proper compat_ioctl() instance for that > > one, but this is not it. What you need in drivers/char/random.c part of > > that one is something like > > Looping in some s390 folks. > > As you suggested in another reply, I had a look at what other drivers > do the same thing and have only pointer arguments. I created a > patch to move them all over to using a new helper function that > adds the compat_ptr(), and arrived at > > drivers/android/binder.c| 2 +- > drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +- > drivers/dma-buf/dma-buf.c | 4 +--- > drivers/dma-buf/sw_sync.c | 2 +- > drivers/dma-buf/sync_file.c | 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c| 2 +- > drivers/hid/hidraw.c| 4 +--- > drivers/iio/industrialio-core.c | 2 +- > drivers/infiniband/core/uverbs_main.c | 4 ++-- > drivers/media/rc/lirc_dev.c | 4 +--- > drivers/mfd/cros_ec_dev.c | 4 +--- > drivers/misc/vmw_vmci/vmci_host.c | 2 +- > drivers/nvdimm/bus.c| 4 ++-- > drivers/nvme/host/core.c| 6 +++--- > drivers/pci/switch/switchtec.c | 2 +- > drivers/platform/x86/wmi.c | 2 +- > drivers/rpmsg/rpmsg_char.c | 4 ++-- > drivers/s390/char/sclp_ctl.c| 8 ++-- > drivers/s390/char/vmcp.c| 2 ++ > drivers/s390/cio/chsc_sch.c | 8 ++-- > drivers/sbus/char/display7seg.c | 2 +- > drivers/sbus/char/envctrl.c | 4 +--- > drivers/scsi/3w-.c | 4 +--- > drivers/scsi/cxlflash/main.c| 2 +- > drivers/scsi/esas2r/esas2r_main.c | 2 +- > drivers/scsi/pmcraid.c | 4 +--- > drivers/staging/android/ion/ion.c | 4 +--- > drivers/staging/vme/devices/vme_user.c | 2 +- > drivers/tee/tee_core.c | 2 +- > drivers/usb/class/cdc-wdm.c | 2 +- > drivers/usb/class/usbtmc.c | 4 +--- > drivers/video/fbdev/ps3fb.c | 2 +- > drivers/video/fbdev/sis/sis_main.c | 4 +--- > drivers/virt/fsl_hypervisor.c | 2 +- > fs/btrfs/super.c| 2 +- > fs/ceph/dir.c | 2 +- > fs/ceph/file.c | 2 +- > fs/fuse/dev.c | 2 +- > fs/notify/fanotify/fanotify_user.c | 2 +- > fs/userfaultfd.c| 2 +- > net/rfkill/core.c | 2 +- > 41 files changed, 48 insertions(+), 76 deletions(-) > > Out of those, there are only a few that may get used on s390, > in particular at most infiniband/uverbs, nvme, nvdimm, > btrfs, ceph, fuse, fanotify and userfaultfd. > [Note: there are three s390 drivers in the list, which use > a different method: they check in_compat_syscall() from > a shared handler to decide whether to do compat_ptr(). Using in_compat_syscall() seems to be a good solution, no? > According to my memory from when I last worked on this, > the compat_ptr() is mainly a safeguard for legacy binaries > that got created with ancient C compilers (or compilers for > something other than C) and might leave the high bit set > in a pointer, but modern C compilers (gcc-3+) won't ever > do that. And compat_ptr clears the upper 32-bit of the register. If the register is loaded to e.g. "lr" or "l" there will be junk in the 4 upper bytes. > You are probably right about /dev/random, which could be > used in lots of weird code, but I wonder to what degree we > need to worry about it for the rest. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver
2018-09-12 13:35 GMT+09:00 Vinod : > On 12-09-18, 12:01, Masahiro Yamada wrote: >> Hi Vinod, >> >> >> 2018-09-11 16:00 GMT+09:00 Vinod : >> > On 24-08-18, 10:41, Masahiro Yamada wrote: >> > >> >> +/* mc->vc.lock must be held by caller */ >> >> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md) >> >> +{ >> >> + u32 residue = 0; >> >> + int i; >> >> + >> >> + for (i = md->sg_cur; i < md->sg_len; i++) >> >> + residue += sg_dma_len(&md->sgl[i]); >> > >> > so if the descriptor is submitted to hardware, we return the descriptor >> > length, which is not correct. >> > >> > Two cases are required to be handled: >> > 1. Descriptor is in queue (IMO above logic is fine for that, but it can >> > be calculated at descriptor submit and looked up here) >> >> Where do you want it to be calculated? > > where is it calculated now? Please see __uniphier_mdmac_handle(). It gets the address and size by sg_dma_address(), sg_dma_len() just before setting them to the hardware registers. sg = &md->sgl[md->sg_cur]; if (md->dir == DMA_MEM_TO_DEV) { src_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_INC; src_addr = sg_dma_address(sg); dest_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_FIXED; dest_addr = 0; } else { src_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_FIXED; src_addr = 0; dest_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_INC; dest_addr = sg_dma_address(sg); } >> This hardware provides only simple registers (address and size) >> for one-shot transfer instead of descriptors. >> >> So, I used sgl as-is because I did not see a good reason >> to transform sgl to another data structure. > > >> > this seems missing stuff. Where do you do register calculation for the >> > descriptor and where is slave_config here, how do you know where to >> > send/receive data form/to (peripheral) >> >> >> This dmac is really simple, and un-flexible. >> >> The peripheral address to send/receive data from/to is hard-weird. >> cfg->{src_addr,dst_addr} is not configurable. >> >> Look at __uniphier_mdmac_handle(). >> 'dest_addr' and 'src_addr' must be set to 0 for the peripheral. > > Fair enough, what about other values like addr_width and maxburst? None of them is configurable. -- Best Regards Masahiro Yamada
[LKP] [rcu] 02a5c550b2: BUG:kernel_reboot-without-warning_in_test_stage
FYI, we noticed the following commit (built with gcc-7): commit: 02a5c550b2738f2bfea8e1e00aa75944d71c9e18 ("rcu: Abstract extended quiescent state determination") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: perf_event_tests with following parameters: paranoid: disallow_kernel_profiling test-description: The Perf Event Testsuite. test-url: https://github.com/deater/perf_event_tests on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): +-+++ | | 2625d469ba | 02a5c550b2 | +-+++ | boot_successes | 18 | 0 | | boot_failures | 0 | 17 | | BUG:kernel_reboot-without-warning_in_test_stage | 0 | 17 | +-+++ [ 21.715217] [ 22.950524] perf: interrupt took too long (5334 > 5235), lowering kernel.perf_event_max_sample_rate to 37250 [ 22.956921] perf: interrupt took too long (6735 > 6667), lowering kernel.perf_event_max_sample_rate to 29500 [ 22.970150] perf: interrupt took too long (8494 > 8418), lowering kernel.perf_event_max_sample_rate to 23500 [ 22.976586] perf: interrupt took too long (10754 > 10617), lowering kernel.perf_event_max_sample_rate to 18500 BUG: kernel reboot-without-warning in test stage Elapsed time: 30 #!/bin/bash To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k job-script # job-script is attached in this email Thanks, lkp # # Automatically generated file; DO NOT EDIT. # Linux/x86_64 4.10.0-rc3 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y
Re: [PATCH] firmware: vpd: fix spelling mistake "partion" -> "partition"
On Tue, Sep 11, 2018 at 09:58:48PM -0700, Guenter Roeck wrote: > On 09/11/2018 09:58 AM, Colin King wrote: > > From: Colin Ian King > > > > Trivial fix to spelling mistake in comment > > > > Signed-off-by: Colin Ian King > > Reviewed-by: Guenter Roeck > > Interesting - drivers/firmware/google/ does not have a maintainer. > Greg - is it correct to assume that you are the de-facto maintainer ? Yeah, I am :( I'll queue this up, thanks. greg k-h
[PATCH v2 0/1] gpio: mvebu: Add support for multiple PWM lines
Hi everyone, Helios4, an Armada 388 based NAS SBC, provides 2 (4-pins) fan connectors. The PWM pins on both connector are connected to GPIO on bank 1. Current gpio-mvebu does not allow more than one PWM on the same bank. Aditya --- Changes v1->v2: * Merge/squash "[Patch 2/2] gpio: mvebu: Allow to use non-default PWM counter" * Allow only two PWMs as suggested by Andrew Lunn and Richard Genoud --- Aditya Prayoga (1): gpio: mvebu: Add support for multiple PWM lines per GPIO chip drivers/gpio/gpio-mvebu.c | 73 ++- 1 file changed, 60 insertions(+), 13 deletions(-) -- 2.7.4
[PATCH v2 1/1] gpio: mvebu: Add support for multiple PWM lines per GPIO chip
Allow more than 1 PWM request (eg. PWM fan) on the same GPIO chip. If the other PWM counter is unused, allocate it to next PWM request. The priority would be: 1. Default counter assigned to the bank 2. Unused counter that is assigned to other bank Since there are only two PWM counters, only two PWMs supported. Signed-off-by: Aditya Prayoga --- drivers/gpio/gpio-mvebu.c | 73 ++- 1 file changed, 60 insertions(+), 13 deletions(-) diff --git a/drivers/gpio/gpio-mvebu.c b/drivers/gpio/gpio-mvebu.c index 6e02148..2d46b87 100644 --- a/drivers/gpio/gpio-mvebu.c +++ b/drivers/gpio/gpio-mvebu.c @@ -92,9 +92,16 @@ #define MVEBU_MAX_GPIO_PER_BANK32 +enum mvebu_pwm_counter { + MVEBU_PWM_CTRL_SET_A = 0, + MVEBU_PWM_CTRL_SET_B, + MVEBU_PWM_CTRL_MAX +}; + struct mvebu_pwm { void __iomem*membase; unsigned longclk_rate; + enum mvebu_pwm_counter id; struct gpio_desc*gpiod; struct pwm_chip chip; spinlock_t lock; @@ -128,6 +135,8 @@ struct mvebu_gpio_chip { u32level_mask_regs[4]; }; +static struct mvebu_pwm*mvebu_pwm_list[MVEBU_PWM_CTRL_MAX]; + /* * Functions returning addresses of individual registers for a given * GPIO controller. @@ -594,34 +603,59 @@ static struct mvebu_pwm *to_mvebu_pwm(struct pwm_chip *chip) return container_of(chip, struct mvebu_pwm, chip); } +static struct mvebu_pwm *mvebu_pwm_get_avail_counter(void) +{ + enum mvebu_pwm_counter i; + + for (i = MVEBU_PWM_CTRL_SET_A; i < MVEBU_PWM_CTRL_MAX; i++) { + if (!mvebu_pwm_list[i]->gpiod) + return mvebu_pwm_list[i]; + } + return NULL; +} + static int mvebu_pwm_request(struct pwm_chip *chip, struct pwm_device *pwm) { struct mvebu_pwm *mvpwm = to_mvebu_pwm(chip); struct mvebu_gpio_chip *mvchip = mvpwm->mvchip; struct gpio_desc *desc; + struct mvebu_pwm *counter; unsigned long flags; int ret = 0; spin_lock_irqsave(&mvpwm->lock, flags); - if (mvpwm->gpiod) { - ret = -EBUSY; - } else { - desc = gpiochip_request_own_desc(&mvchip->chip, -pwm->hwpwm, "mvebu-pwm"); - if (IS_ERR(desc)) { - ret = PTR_ERR(desc); + counter = mvpwm; + if (counter->gpiod) { + counter = mvebu_pwm_get_avail_counter(); + if (!counter) { + ret = -EBUSY; goto out; } - ret = gpiod_direction_output(desc, 0); - if (ret) { - gpiochip_free_own_desc(desc); - goto out; - } + pwm->chip_data = counter; + } - mvpwm->gpiod = desc; + desc = gpiochip_request_own_desc(&mvchip->chip, +pwm->hwpwm, "mvebu-pwm"); + if (IS_ERR(desc)) { + ret = PTR_ERR(desc); + goto out; } + + ret = gpiod_direction_output(desc, 0); + if (ret) { + gpiochip_free_own_desc(desc); + goto out; + } + + regmap_update_bits(mvchip->regs, GPIO_BLINK_CNT_SELECT_OFF + + mvchip->offset, BIT(pwm->hwpwm), + counter->id ? BIT(pwm->hwpwm) : 0); + regmap_read(mvchip->regs, GPIO_BLINK_CNT_SELECT_OFF + + mvchip->offset, &counter->blink_select); + + counter->gpiod = desc; out: spin_unlock_irqrestore(&mvpwm->lock, flags); return ret; @@ -632,6 +666,11 @@ static void mvebu_pwm_free(struct pwm_chip *chip, struct pwm_device *pwm) struct mvebu_pwm *mvpwm = to_mvebu_pwm(chip); unsigned long flags; + if (pwm->chip_data) { + mvpwm = (struct mvebu_pwm *) pwm->chip_data; + pwm->chip_data = NULL; + } + spin_lock_irqsave(&mvpwm->lock, flags); gpiochip_free_own_desc(mvpwm->gpiod); mvpwm->gpiod = NULL; @@ -648,6 +687,9 @@ static void mvebu_pwm_get_state(struct pwm_chip *chip, unsigned long flags; u32 u; + if (pwm->chip_data) + mvpwm = (struct mvebu_pwm *) pwm->chip_data; + spin_lock_irqsave(&mvpwm->lock, flags); val = (unsigned long long) @@ -695,6 +737,9 @@ static int mvebu_pwm_apply(struct pwm_chip *chip, struct pwm_device *pwm, unsigned long flags; unsigned int on, off; + if (pwm->chip_data) + mvpwm = (struct mvebu_pwm *) pwm->chip_data; + val = (unsigned long long) mvpwm->clk_rate * state->duty_cycle; do_div(val, NSEC_PER_SEC); if (val > UINT_MAX) @@ -804,6 +849,7 @@ static int mvebu_pwm_probe(struct platform_device *pdev,
Re: [PATCH] xtensa: remove unnecessary KBUILD_SRC ifeq conditional
On Tue, Sep 11, 2018 at 9:25 PM, Masahiro Yamada wrote: > You can always prefix variant/platform header search paths with > $(srctree)/ because $(srctree) is '.' for in-tree building. > > Signed-off-by: Masahiro Yamada > --- > > arch/xtensa/Makefile | 4 > 1 file changed, 4 deletions(-) Thanks, applied to my xtensa tree. -- Thanks. -- Max
Re: [LKP] 0a3856392c [ 10.513760] INFO: trying to register non-static key.
On 09/07/2018 10:19 AM, Matthew Wilcox wrote: On Fri, Sep 07, 2018 at 09:05:39AM +0800, kernel test robot wrote: Greetings, 0day kernel testing robot got the below dmesg and the first bad commit is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master commit 0a3856392cff1542170b5bc37211c9a21fd0c3f6 Author: Matthew Wilcox AuthorDate: Mon Jun 18 17:23:37 2018 -0400 Commit: Matthew Wilcox CommitDate: Tue Aug 21 23:54:20 2018 -0400 test_ida: Move ida_check_leaf Convert to new API and move to kernel space. Take the opportunity to test the situation a little more thoroughly (ie at different offsets). Signed-off-by: Matthew Wilcox Thank you test-bot. Can you check if this patch fixes the problem? Thanks, It works. Best Regards, Rong Chen diff --git a/lib/test_ida.c b/lib/test_ida.c index 2d1637d8136b..b06880625961 100644 --- a/lib/test_ida.c +++ b/lib/test_ida.c @@ -150,10 +150,10 @@ static void ida_check_conv(struct ida *ida) IDA_BUG_ON(ida, !ida_is_empty(ida)); } +static DEFINE_IDA(ida); + static int ida_checks(void) { - DEFINE_IDA(ida); - IDA_BUG_ON(&ida, !ida_is_empty(&ida)); ida_check_alloc(&ida); ida_check_destroy(&ida);
Re: [PATCH] firmware: vpd: fix spelling mistake "partion" -> "partition"
On 09/11/2018 09:58 AM, Colin King wrote: From: Colin Ian King Trivial fix to spelling mistake in comment Signed-off-by: Colin Ian King Reviewed-by: Guenter Roeck Interesting - drivers/firmware/google/ does not have a maintainer. Greg - is it correct to assume that you are the de-facto maintainer ? Guenter --- drivers/firmware/google/vpd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/google/vpd.c b/drivers/firmware/google/vpd.c index 1aa67bb5d8c0..c0c0b4e4e281 100644 --- a/drivers/firmware/google/vpd.c +++ b/drivers/firmware/google/vpd.c @@ -198,7 +198,7 @@ static int vpd_section_init(const char *name, struct vpd_section *sec, sec->name = name; - /* We want to export the raw partion with name ${name}_raw */ + /* We want to export the raw partition with name ${name}_raw */ sec->raw_name = kasprintf(GFP_KERNEL, "%s_raw", name); if (!sec->raw_name) { err = -ENOMEM;
[PATCH] kbuild: prefix Makefile.dtbinst path with $(srctree) unconditionally
$(srctree) always points to the top of the source tree whether KBUILD_SRC is set or not. Signed-off-by: Masahiro Yamada --- scripts/Kbuild.include | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index ce53639..46cc43e 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -193,7 +193,7 @@ modbuiltin := -f $(srctree)/scripts/Makefile.modbuiltin obj # Shorthand for $(Q)$(MAKE) -f scripts/Makefile.dtbinst obj= # Usage: # $(Q)$(MAKE) $(dtbinst)=dir -dtbinst := -f $(if $(KBUILD_SRC),$(srctree)/)scripts/Makefile.dtbinst obj +dtbinst := -f $(srctree)/scripts/Makefile.dtbinst obj ### # Shorthand for $(Q)$(MAKE) -f scripts/Makefile.clean obj= -- 2.7.4
linux-next: Tree for Sep 12
Hi all, News: there will be no linux-next releases on Friday or Monday. Changes since 20180911: Dropped trees: xarray, ida (temporarily) I applied a patch for a runtime problem in the vfs tree and I still disabled building some samples. The drm-misc tree gained a conflict against the drm tree. The tty tree lost its build failure. Non-merge commits (relative to Linus' tree): 3284 3703 files changed, 109374 insertions(+), 67935 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 287 trees (counting Linus' and 66 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (11da3a7f84f1 Linux 4.19-rc3) Merging fixes/master (72358c0b59b7 linux-next: build warnings from the build of Linus' tree) Merging kbuild-current/fixes (11da3a7f84f1 Linux 4.19-rc3) Merging arc-current/for-curr (00a99339f0a3 ARCv2: build: use mcpu=hs38 iso generic mcpu=archs) Merging arm-current/fixes (afc9f65e01cd ARM: 8781/1: Fix Thumb-2 syscall return for binutils 2.29+) Merging arm64-fixes/for-next/fixes (84c57dbd3c48 arm64: kernel: arch_crash_save_vmcoreinfo() should depend on CONFIG_CRASH_CORE) Merging m68k-current/for-linus (0986b16ab49b m68k/mac: Use correct PMU response format) Merging powerpc-fixes/fixes (cca19f0b684f powerpc/64s/radix: Fix missing global invalidations when removing copro) Merging sparc/master (df2def49c57b Merge tag 'acpi-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm) Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2) Merging net/master (7c5cca358854 qmi_wwan: Support dynamic config on Quectel EP06) Merging bpf/master (28619527b8a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging ipsec/master (782710e333a5 xfrm: reset crypto_done when iterating over multiple input xfrms) Merging netfilter/master (1286df269f49 netfilter: xt_hashlimit: use s->file instead of s->private) Merging ipvs/master (feb9f55c33e5 netfilter: nft_dynset: allow dynamic updates of non-anonymous set) Merging wireless-drivers/master (5b394b2ddf03 Linux 4.19-rc1) Merging mac80211/master (c42055105785 mac80211: fix TX status reporting for ieee80211s) Merging rdma-fixes/for-rc (8f28b178f71c RDMA/mlx4: Ensure that maximal send/receive SGE less than supported by HW) Merging sound-current/for-linus (49434c6c575d ALSA: emu10k1: fix possible info leak to userspace on SNDRV_EMU10K1_IOCTL_INFO) Merging sound-asoc-fixes/for-linus (de7609683fef Merge branch 'asoc-4.19' into asoc-linus) Merging regmap-fixes/for-linus (57361846b52b Linux 4.19-rc2) Merging regulator-fixes/for-linus (b832dd4f2c04 Merge branch 'regulator-4.19' into regulator-linus) Merging spi-fixes/for-linus (9029858be9ef Merge branch 'spi-4.19' into spi-linus) Merging pci-current/for-linus (34fb6bf9b13a PCI: pciehp: Fix hot-add vs powerfault detection order) Merging driver-core.current/driver-core-linus (11da3a7f84f1 Linux 4.19-rc3) Merging tty.current/tty-linus (7f2bf7840b74 tty: hvc: hvc_write() fix break condition) Merging usb.current/usb-linus (df3aa13c7bbb Revert "cdc-acm: implement put_char() and flush_chars()") Merging usb-gadget-fixes/fixes (d9707490077b usb: dwc2: Fix call location of dwc2_check_core_endian
Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver
On 12-09-18, 12:01, Masahiro Yamada wrote: > Hi Vinod, > > > 2018-09-11 16:00 GMT+09:00 Vinod : > > On 24-08-18, 10:41, Masahiro Yamada wrote: > > > >> +/* mc->vc.lock must be held by caller */ > >> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md) > >> +{ > >> + u32 residue = 0; > >> + int i; > >> + > >> + for (i = md->sg_cur; i < md->sg_len; i++) > >> + residue += sg_dma_len(&md->sgl[i]); > > > > so if the descriptor is submitted to hardware, we return the descriptor > > length, which is not correct. > > > > Two cases are required to be handled: > > 1. Descriptor is in queue (IMO above logic is fine for that, but it can > > be calculated at descriptor submit and looked up here) > > Where do you want it to be calculated? where is it calculated now? > This hardware provides only simple registers (address and size) > for one-shot transfer instead of descriptors. > > So, I used sgl as-is because I did not see a good reason > to transform sgl to another data structure. > > this seems missing stuff. Where do you do register calculation for the > > descriptor and where is slave_config here, how do you know where to > > send/receive data form/to (peripheral) > > > This dmac is really simple, and un-flexible. > > The peripheral address to send/receive data from/to is hard-weird. > cfg->{src_addr,dst_addr} is not configurable. > > Look at __uniphier_mdmac_handle(). > 'dest_addr' and 'src_addr' must be set to 0 for the peripheral. Fair enough, what about other values like addr_width and maxburst? -- ~Vinod
Re: [PATCH] RISC-V: Show IPI stats
On Mon, Sep 10, 2018 at 7:16 PM, Christoph Hellwig wrote: > On Fri, Sep 07, 2018 at 06:14:29PM +0530, Anup Patel wrote: >> This patch provides arch_show_interrupts() implementation to >> show IPI stats via /proc/interrupts. >> >> Now the contents of /proc/interrupts" will look like below: >>CPU0 CPU1 CPU2 CPU3 >> 8: 17 7 6 14 SiFive PLIC 8 virtio0 >> 10: 10 10 9 11 SiFive PLIC 10 ttyS0 >> IPI0: 170673251 79 Rescheduling interrupts >> IPI1: 1 12 27 1 Function call interrupts >> IPI2: 0 0 0 0 CPU wake-up interrupts >> >> Signed-off-by: Anup Patel > > Thanks, this looks pretty sensible to me. Maybe we want to also show > timer interrupts if we do this? Let's not include timer stats here until RISCV INTC driver is concluded. We can do it as separate patch if required. > >> --- a/arch/riscv/kernel/irq.c >> +++ b/arch/riscv/kernel/irq.c >> @@ -8,6 +8,7 @@ >> #include >> #include >> #include >> +#include >> >> /* >> * Possible interrupt causes: >> @@ -24,6 +25,14 @@ >> */ >> #define INTERRUPT_CAUSE_FLAG (1UL << (__riscv_xlen - 1)) >> >> +int arch_show_interrupts(struct seq_file *p, int prec) >> +{ >> +#ifdef CONFIG_SMP >> + show_ipi_stats(p, prec); >> +#endif >> + return 0; >> +} > > If we don't also add timer stats I'd just move arch_show_interrupts > to smp.c and make it conditional. If we don't this split might make > more sense. I understand you want to avoid #ifdef here. We can do same thing by having empty inline function show_ipi_stats() in asm/smp.h for !CONFIG_SMP case. This way we can keep arch_show_interrupts() in kernel/irq.c which is intuitively correct location for arch_show_interrupts(). > >> +static const char *ipi_names[IPI_MAX] = { >> + [IPI_RESCHEDULE] = "Rescheduling interrupts", >> + [IPI_CALL_FUNC] = "Function call interrupts", >> + [IPI_CALL_WAKEUP] = "CPU wake-up interrupts", >> +}; > > No need for the explicit array size. Also please use a few tabs to > align this nicely: > > static const char *ipi_names[] = { > [IPI_RESCHEDULE]= "Rescheduling interrupts", > [IPI_CALL_FUNC] = "Function call interrupts", > [IPI_CALL_WAKEUP] = "CPU wake-up interrupts", > }; Sure, will do. Regards, Anup
[PATCH] xtensa: remove unnecessary KBUILD_SRC ifeq conditional
You can always prefix variant/platform header search paths with $(srctree)/ because $(srctree) is '.' for in-tree building. Signed-off-by: Masahiro Yamada --- arch/xtensa/Makefile | 4 1 file changed, 4 deletions(-) diff --git a/arch/xtensa/Makefile b/arch/xtensa/Makefile index 295c120..d67e30fa 100644 --- a/arch/xtensa/Makefile +++ b/arch/xtensa/Makefile @@ -64,11 +64,7 @@ endif vardirs := $(patsubst %,arch/xtensa/variants/%/,$(variant-y)) plfdirs := $(patsubst %,arch/xtensa/platforms/%/,$(platform-y)) -ifeq ($(KBUILD_SRC),) -KBUILD_CPPFLAGS += $(patsubst %,-I%include,$(vardirs) $(plfdirs)) -else KBUILD_CPPFLAGS += $(patsubst %,-I$(srctree)/%include,$(vardirs) $(plfdirs)) -endif KBUILD_DEFCONFIG := iss_defconfig -- 2.7.4
[PATCH] ARM: remove unnecessary KBUILD_SRC ifeq conditional
You can always prefix machine/plat header search paths with $(srctree)/ because $(srctree) is '.' for in-tree building. Signed-off-by: Masahiro Yamada --- KernelVersion: v4.19-rc3 arch/arm/Makefile | 4 1 file changed, 4 deletions(-) diff --git a/arch/arm/Makefile b/arch/arm/Makefile index d1516f8..06ebff7 100644 --- a/arch/arm/Makefile +++ b/arch/arm/Makefile @@ -264,13 +264,9 @@ platdirs := $(patsubst %,arch/arm/plat-%/,$(sort $(plat-y))) ifneq ($(CONFIG_ARCH_MULTIPLATFORM),y) ifneq ($(CONFIG_ARM_SINGLE_ARMV7M),y) -ifeq ($(KBUILD_SRC),) -KBUILD_CPPFLAGS += $(patsubst %,-I%include,$(machdirs) $(platdirs)) -else KBUILD_CPPFLAGS += $(patsubst %,-I$(srctree)/%include,$(machdirs) $(platdirs)) endif endif -endif export TEXT_OFFSET GZFLAGS MMUEXT -- 2.7.4
Re: [PATCH] include/linux/compiler-clang.h: define __naked
Hi Arnd, Nick, Stefan, On Mon, Sep 10, 2018 at 2:14 PM, Arnd Bergmann wrote: > On Mon, Sep 10, 2018 at 8:05 AM Stefan Agner wrote: >> >> ARM32 arch code uses the __naked attribute. This has previously been >> defined in include/linux/compiler-gcc.h, which is no longer included >> for Clang. Define __naked for Clang. Conservatively add all attributes >> previously used (and supported by Clang). >> >> This fixes compile errors when building ARM32 using Clang: >> arch/arm/mach-exynos/mcpm-exynos.c:193:13: error: variable has incomplete >> type 'void' >> static void __naked exynos_pm_power_up_setup(unsigned int affinity_level) >> ^ >> >> Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually >> exclusive") >> Signed-off-by: Stefan Agner > >> +/* >> + * ARM32 is currently the only user of __naked supported by Clang. Follow >> + * gcc: Do not trace naked functions and make sure they don't get inlined. >> + */ >> +#define __naked __attribute__((naked)) noinline notrace >> + > > Please see patches 5 and 6 of the series that Miguel posted: > > https://lore.kernel.org/lkml/20180908212459.19736-6-miguel.ojeda.sando...@gmail.com/ > > I suppose we want the patch to fix clang build as soon as possible though, > and follow up with the cleanup for the next merge window, right? Not sure what the plans of Linus et. al. are, if they have any; but that would be a safe bet. In case they want to speed this up and put the entire series into v4.19 (instead of the two patches), I have done a binary & objdump diff between -rc2 and v4 (based on -rc2) on all object files (with UTS_RELEASE fixed to avoid some differences). In a x86_64 tinyconfig with gcc 7.3, the differences I found are: $ ./compare.py linux-rc2 linux-v4 [2018-09-12 06:16:39,483] [INFO] [arch/x86/boot/compressed/piggy.o] Binary diff (use 'bash -c "cmp linux-rc2/arch/x86/boot/compressed/piggy.o linux-v4/arch/x86/boot/compressed/piggy.o"' to replicate) [2018-09-12 06:16:39,606] [INFO] [arch/x86/boot/header.o] Binary diff (use 'bash -c "cmp linux-rc2/arch/x86/boot/header.o linux-v4/arch/x86/boot/header.o"' to replicate) [2018-09-12 06:16:39,659] [INFO] [arch/x86/boot/version.o] Binary diff (use 'bash -c "cmp linux-rc2/arch/x86/boot/version.o linux-v4/arch/x86/boot/version.o"' to replicate) [2018-09-12 06:16:40,483] [INFO] [init/version.o] Binary diff (use 'bash -c "cmp linux-rc2/init/version.o linux-v4/init/version.o"' to replicate) I will do a bigger one tomorrow or so and see if there are any important differences. Regardless of what we do, I will send the __naked patches separately as well (requested by Nick on GitHub). Cheers, Miguel
[PATCH] dt-bindings: power: Introduce suspend states supported properties
Introuduce linux generic suspend states supported properties. It is convenient for the generic suspend path to have the knowledge of the suspend states supported based on the device tree properties based on which it can either be suspended or safely bailed out of suspend if none of the suspend states are supported. Signed-off-by: Keerthy --- .../devicetree/bindings/power/power-states.txt | 22 ++ 1 file changed, 22 insertions(+) create mode 100644 Documentation/devicetree/bindings/power/power-states.txt diff --git a/Documentation/devicetree/bindings/power/power-states.txt b/Documentation/devicetree/bindings/power/power-states.txt new file mode 100644 index 000..bb80b36 --- /dev/null +++ b/Documentation/devicetree/bindings/power/power-states.txt @@ -0,0 +1,22 @@ +* Generic system suspend states support + +Most platforms support multiple suspend states. Define system +suspend states so that one can target appropriate low power +states based on the SoC capabilities. + +linux,suspend-to-memory-supported + +Upon suspend to memory the system context is saved to primary memory. +All the clocks for all the peripherals including CPU are gated. + +linux,suspend-power-off-supported + +In this case in additon to the clocks all the voltage resources are +turned off except the ones needed to keep the primary memory +and a wake up source that can trigger a wakeup event. + +linux,suspend-to-disk-supported + +Upon suspend to disk that system context is saved to secondary memory. +All the clocks for all the peripherals including CPU are gated. Even +the primary memory is turned off. -- 1.9.1
Re: [PATCH 2/3] sound: enable interrupt after dma buffer initialization
On 11-09-18, 14:58, Yu Zhao wrote: > On Tue, Sep 11, 2018 at 08:06:49AM +0200, Takashi Iwai wrote: > > On Mon, 10 Sep 2018 23:21:50 +0200, > > Yu Zhao wrote: > > > > > > In snd_hdac_bus_init_chip(), we enable interrupt before > > > snd_hdac_bus_init_cmd_io() initializing dma buffers. If irq has > > > been acquired and irq handler uses the dma buffer, kernel may crash > > > when interrupt comes in. > > > > > > Fix the problem by postponing enabling irq after dma buffer > > > initialization. And warn once on null dma buffer pointer during the > > > initialization. > > > > > > Signed-off-by: Yu Zhao > > > > Looks good to me. > > > > Reviewed-by: Takashi Iwai > > > > > > BTW, the reason why this hasn't been hit on the legacy HD-audio driver > > is that we allocate usually with MSI, so the irq is isolated. > > > > Any reason that Intel SKL driver doesn't use MST? > > This I'm not sure. Vinod might have answer to it, according to > https://patchwork.kernel.org/patch/6375831/#13796611 IIRC (seemed quite some time back) we faced issues with using MSI on SKL and didnt try afterwards. If Intel folks can try it and check. Pierre is out, maybe Liam can help..? -- ~Vinod
Re: [Question] Are the trace APIs declared by "TRACE_EVENT(irq_handler_entry" allowed to be used in Ko?
On Wed, 12 Sep 2018 10:08:37 +0800 "Leizhen (ThunderTown)" wrote: > After patch 7e066fb870fc ("tracepoints: add DECLARE_TRACE() and > DEFINE_TRACE()"), > the trace APIs declared by "TRACE_EVENT(irq_handler_entry" can not be > directly used > by ko, because it's not explicitly exported by EXPORT_TRACEPOINT_SYMBOL_GPL or > EXPORT_TRACEPOINT_SYMBOL. > > Did we miss it? or it's not recommended to be used in ko? > Why do you need it. This patch is almost 10 years old, and you are just now finding an issue with it? -- Steve > > - > > commit 7e066fb870fcd1025ec3ba7bbde5d541094f4ce1 > Author: Mathieu Desnoyers > Date: Fri Nov 14 17:47:47 2008 -0500 > > tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() > > Impact: API *CHANGE*. Must update all tracepoint users. > > Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint > structure in a single spot for all the kernel. It helps reducing memory > consumption, especially when declaring a lot of tracepoints, e.g. for > kmalloc tracing. > > *API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for > tracepoint declarations rather than DEFINE_TRACE(). This is the sane way > to do it. The name previously used was misleading. > > Updates scheduler instrumentation to follow this API change. > >
[PATCH] ASoC: remove unneeded static set .owner field in platform_driver
platform_driver_register will set the .owner field. So it is safe to remove the redundant assignment. The issue is detected with the help of Coccinelle. Signed-off-by: zhong jiang --- sound/soc/mediatek/mt2701/mt2701-wm8960.c | 1 - sound/soc/mediatek/mt6797/mt6797-mt6351.c | 1 - sound/soc/rockchip/rk3288_hdmi_analog.c | 1 - 3 files changed, 3 deletions(-) diff --git a/sound/soc/mediatek/mt2701/mt2701-wm8960.c b/sound/soc/mediatek/mt2701/mt2701-wm8960.c index 89f34ef..e5d49e6 100644 --- a/sound/soc/mediatek/mt2701/mt2701-wm8960.c +++ b/sound/soc/mediatek/mt2701/mt2701-wm8960.c @@ -150,7 +150,6 @@ static int mt2701_wm8960_machine_probe(struct platform_device *pdev) static struct platform_driver mt2701_wm8960_machine = { .driver = { .name = "mt2701-wm8960", - .owner = THIS_MODULE, #ifdef CONFIG_OF .of_match_table = mt2701_wm8960_machine_dt_match, #endif diff --git a/sound/soc/mediatek/mt6797/mt6797-mt6351.c b/sound/soc/mediatek/mt6797/mt6797-mt6351.c index b1558c5..6e578e8 100644 --- a/sound/soc/mediatek/mt6797/mt6797-mt6351.c +++ b/sound/soc/mediatek/mt6797/mt6797-mt6351.c @@ -205,7 +205,6 @@ static int mt6797_mt6351_dev_probe(struct platform_device *pdev) static struct platform_driver mt6797_mt6351_driver = { .driver = { .name = "mt6797-mt6351", - .owner = THIS_MODULE, #ifdef CONFIG_OF .of_match_table = mt6797_mt6351_dt_match, #endif diff --git a/sound/soc/rockchip/rk3288_hdmi_analog.c b/sound/soc/rockchip/rk3288_hdmi_analog.c index 929b3fe..a472d5e 100644 --- a/sound/soc/rockchip/rk3288_hdmi_analog.c +++ b/sound/soc/rockchip/rk3288_hdmi_analog.c @@ -286,7 +286,6 @@ static int snd_rk_mc_probe(struct platform_device *pdev) .probe = snd_rk_mc_probe, .driver = { .name = DRV_NAME, - .owner = THIS_MODULE, .pm = &snd_soc_pm_ops, .of_match_table = rockchip_sound_of_match, }, -- 1.7.12.4
[PATCH TRIVIAL] Punctuation fixes
Signed-off-by: Diego Viola --- CREDITS | 2 +- MAINTAINERS | 2 +- Makefile| 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/CREDITS b/CREDITS index 5befd2d71..b82efb36d 100644 --- a/CREDITS +++ b/CREDITS @@ -1473,7 +1473,7 @@ W: http://www.linux-ide.org/ W: http://www.linuxdiskcert.org/ D: Random SMP kernel hacker... D: Uniform Multi-Platform E-IDE driver -D: Active-ATA-Chipset maddness.. +D: Active-ATA-Chipset maddness... D: Ultra DMA 133/100/66/33 w/48-bit Addressing D: ATA-Disconnect, ATA-TCQ D: ATA-Smart Kernel Daemon diff --git a/MAINTAINERS b/MAINTAINERS index d870cb57c..6567bf245 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -93,7 +93,7 @@ Descriptions of section entries: Supported: Someone is actually paid to look after this. Maintained: Someone actually looks after it. Odd Fixes: It has a maintainer but they don't have time to do - much other than throw the odd patch in. See below.. + much other than throw the odd patch in. See below. Orphan: No current maintainer [but maybe you could take the role as you write your new code]. Obsolete:Old code. Something tagged obsolete generally means diff --git a/Makefile b/Makefile index 4d5c883a9..7b5c5d634 100644 --- a/Makefile +++ b/Makefile @@ -1109,7 +1109,7 @@ archprepare: archheaders archscripts prepare1 scripts_basic prepare0: archprepare gcc-plugins $(Q)$(MAKE) $(build)=. -# All the preparing.. +# All the preparing... prepare: prepare0 prepare-objtool # Support for using generic headers in asm-generic -- 2.19.0
[PATCH] staging: remove unneeded static set .owner field in platform_driver
platform_driver_register will set the .owner field. So it is safe to remove the redundant assignment. The issue is detected with the help of Coccinelle. Signed-off-by: zhong jiang --- drivers/staging/greybus/audio_codec.c| 1 - drivers/staging/mt7621-eth/gsw_mt7621.c | 1 - drivers/staging/mt7621-eth/mtk_eth_soc.c | 1 - 3 files changed, 3 deletions(-) diff --git a/drivers/staging/greybus/audio_codec.c b/drivers/staging/greybus/audio_codec.c index 35acd55..08746c8 100644 --- a/drivers/staging/greybus/audio_codec.c +++ b/drivers/staging/greybus/audio_codec.c @@ -1087,7 +1087,6 @@ static int gbaudio_codec_remove(struct platform_device *pdev) static struct platform_driver gbaudio_codec_driver = { .driver = { .name = "apb-dummy-codec", - .owner = THIS_MODULE, #ifdef CONFIG_PM .pm = &gbaudio_codec_pm_ops, #endif diff --git a/drivers/staging/mt7621-eth/gsw_mt7621.c b/drivers/staging/mt7621-eth/gsw_mt7621.c index 2c07b55..53767b1 100644 --- a/drivers/staging/mt7621-eth/gsw_mt7621.c +++ b/drivers/staging/mt7621-eth/gsw_mt7621.c @@ -286,7 +286,6 @@ static int mt7621_gsw_remove(struct platform_device *pdev) .remove = mt7621_gsw_remove, .driver = { .name = "mt7621-gsw", - .owner = THIS_MODULE, .of_match_table = mediatek_gsw_match, }, }; diff --git a/drivers/staging/mt7621-eth/mtk_eth_soc.c b/drivers/staging/mt7621-eth/mtk_eth_soc.c index 7135075..363d3c9 100644 --- a/drivers/staging/mt7621-eth/mtk_eth_soc.c +++ b/drivers/staging/mt7621-eth/mtk_eth_soc.c @@ -2167,7 +2167,6 @@ static int mtk_remove(struct platform_device *pdev) .remove = mtk_remove, .driver = { .name = "mtk_soc_eth", - .owner = THIS_MODULE, .of_match_table = of_mtk_match, }, }; -- 1.7.12.4
[PATCH] sparc: vdso: clean-up vdso Makefile
arch/sparc/vdso/Makefile is a replica of arch/x86/entry/vdso/Makefile. Clean-up the Makefile in the same way as I did for x86: - Remove unnecessary export - Put the generated linker script to $(obj)/ instead of $(src)/ - Simplify cmd_vdso2c The corresponding x86 commits are: - 61615faf0a89 ("x86/build/vdso: Remove unnecessary export in Makefile") - 1742ed2088cc ("x86/build/vdso: Put generated linker scripts to $(obj)/") - c5fcdbf15523 ("x86/build/vdso: Simplify 'cmd_vdso2c'") Signed-off-by: Masahiro Yamada --- arch/sparc/vdso/Makefile | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/sparc/vdso/Makefile b/arch/sparc/vdso/Makefile index dd0b5a9..dc85570 100644 --- a/arch/sparc/vdso/Makefile +++ b/arch/sparc/vdso/Makefile @@ -31,23 +31,21 @@ obj-y += $(vdso_img_objs) targets += $(vdso_img_cfiles) targets += $(vdso_img_sodbg) $(vdso_img-y:%=vdso%.so) -export CPPFLAGS_vdso.lds += -P -C +CPPFLAGS_vdso.lds += -P -C VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \ -Wl,--no-undefined \ -Wl,-z,max-page-size=8192 -Wl,-z,common-page-size=8192 \ $(DISABLE_LTO) -$(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE +$(obj)/vdso64.so.dbg: $(obj)/vdso.lds $(vobjs) FORCE $(call if_changed,vdso) HOST_EXTRACFLAGS += -I$(srctree)/tools/include hostprogs-y+= vdso2c quiet_cmd_vdso2c = VDSO2C $@ -define cmd_vdso2c - $(obj)/vdso2c $< $(<:%.dbg=%) $@ -endef + cmd_vdso2c = $(obj)/vdso2c $< $(<:%.dbg=%) $@ $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE $(call if_changed,vdso2c) -- 2.7.4
[PATCH] pstore: fix incorrect persistent ram buffer mapping
persistent_ram_vmap() returns the page start vaddr. persistent_ram_iomap() supports non-page-aligned mapping. persistent_ram_buffer_map() always adds offset-in-page to the vaddr returned from these two functions, which causes incorrect mapping of non-page-aligned persistent ram buffer. Signed-off-by: Bin Yang --- fs/pstore/ram_core.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index 951a14e..7c05fdd 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -429,7 +429,7 @@ static void *persistent_ram_vmap(phys_addr_t start, size_t size, vaddr = vmap(pages, page_count, VM_MAP, prot); kfree(pages); - return vaddr; + return vaddr + offset_in_page(start); } static void *persistent_ram_iomap(phys_addr_t start, size_t size, @@ -468,7 +468,7 @@ static int persistent_ram_buffer_map(phys_addr_t start, phys_addr_t size, return -ENOMEM; } - prz->buffer = prz->vaddr + offset_in_page(start); + prz->buffer = prz->vaddr; prz->buffer_size = size - sizeof(struct persistent_ram_buffer); return 0; @@ -515,7 +515,7 @@ void persistent_ram_free(struct persistent_ram_zone *prz) if (prz->vaddr) { if (pfn_valid(prz->paddr >> PAGE_SHIFT)) { - vunmap(prz->vaddr); + vunmap(prz->vaddr - offset_in_page(prz->paddr)); } else { iounmap(prz->vaddr); release_mem_region(prz->paddr, prz->size); -- 2.7.4
Re: [PATCH v2 2/9] nios2: build .dtb files in dts directory
On Fri, 2018-09-07 at 13:09 -0500, Rob Herring wrote: > On Thu, Sep 6, 2018 at 9:21 PM Ley Foon Tan > wrote: > > > > > > On Wed, 2018-09-05 at 18:53 -0500, Rob Herring wrote: > > > > > > Align nios2 with other architectures which build the dtb files in > > > the > > > same directory as the dts files. This is also in line with most > > > other > > > build targets which are located in the same directory as the > > > source. > > > This move will help enable the 'dtbs' target which builds all the > > > dtbs > > > regardless of kernel config. > > > > > > This transition could break some scripts if they expect dtb files > > > in > > > the old location. > > > > > > Cc: Ley Foon Tan > > > Cc: nios2-...@lists.rocketboards.org > > > Signed-off-by: Rob Herring > > > --- > > > Please ack so I can take the whole series via the DT tree. > > > > > > arch/nios2/Makefile | 4 ++-- > > > arch/nios2/boot/Makefile | 4 > > > arch/nios2/boot/dts/Makefile | 1 + > > > 3 files changed, 3 insertions(+), 6 deletions(-) > > > create mode 100644 arch/nios2/boot/dts/Makefile > > > > > > diff --git a/arch/nios2/Makefile b/arch/nios2/Makefile > > > index 8673a79dca9c..50eece1c6adb 100644 > > > --- a/arch/nios2/Makefile > > > +++ b/arch/nios2/Makefile > > > @@ -59,10 +59,10 @@ archclean: > > > $(Q)$(MAKE) $(clean)=$(nios2-boot) > > > > > > %.dtb: | scripts > > > - $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@ > > > + $(Q)$(MAKE) $(build)=$(nios2-boot)/dts $(nios2- > > > boot)/dts/$@ > > > > > > dtbs: > > > - $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@ > > > + $(Q)$(MAKE) $(build)=$(nios2-boot)/dts > > > > > > $(BOOT_TARGETS): vmlinux > > > $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@ > > > diff --git a/arch/nios2/boot/Makefile b/arch/nios2/boot/Makefile > > > index 2ba23a679732..007586094dde 100644 > > > --- a/arch/nios2/boot/Makefile > > > +++ b/arch/nios2/boot/Makefile > > > @@ -47,10 +47,6 @@ obj-$(CONFIG_NIOS2_DTB_SOURCE_BOOL) += > > > linked_dtb.o > > > > > > targets += $(dtb-y) > > > > > > -# Rule to build device tree blobs with make command > > > -$(obj)/%.dtb: $(src)/dts/%.dts FORCE > > > - $(call if_changed_dep,dtc) > > > - > > > $(obj)/dtbs: $(addprefix $(obj)/, $(dtb-y)) > > > > > > install: > > > diff --git a/arch/nios2/boot/dts/Makefile > > > b/arch/nios2/boot/dts/Makefile > > > new file mode 100644 > > > index ..f66554cd5c45 > > > --- /dev/null > > > +++ b/arch/nios2/boot/dts/Makefile > > > @@ -0,0 +1 @@ > > > +# SPDX-License-Identifier: GPL-2.0 > > > -- > > > 2.17.1 > > > > > Hi Rob > > > > I have synced your all-dtbs branch from here: https://git.kernel.or > > g/pu > > b/scm/linux/kernel/git/robh/linux.git/log/?h=all-dtbs > > > > It shows error when compile kernel image and also when "make > > dtbs_install". > Can you fetch the branch again and try it. I fixed a few dependency > issues. > > > > > make dtbs_install > > make[1]: *** No rule to make target > > 'arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dtb', > > needed by > > 'arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dtb.S'. St > > op. > What is the value of CONFIG_NIOS2_DTB_SOURCE? As patch 3 notes, it > now > should not have any path. > > If that's a problem, I could take the basename to strip the path, but > then sub directories wouldn't work either. > > BTW, next up, I want to consolidate the config variables for built-in > dtbs. > Hi Rob CONFIG_NIOS2_DTB_SOURCE has the relative path to dts file, arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dts Change CONFIG_NIOS2_DTB_SOURCE=10m50_devboard.dtb.S fix the dtb build issue. Regards Ley Foon
Re: [PATCH v2 2/3] x86/mm/KASLR: Calculate the actual size of vmemmap region
On 09/11/18 at 08:08pm, Baoquan He wrote: > On 09/11/18 at 11:28am, Ingo Molnar wrote: > > Yeah, so proper context is still missing, this paragraph appears to assume > > from the reader a > > whole lot of prior knowledge, and this is one of the top comments in > > kaslr.c so there's nowhere > > else to go read about the background. > > > > For example what is the range of randomization of each region? Assuming the > > static, > > non-randomized description in Documentation/x86/x86_64/mm.txt is correct, > > in what way does > > KASLR modify that layout? Re-read this paragraph, found I missed saying the range for each memory region, and in what way KASLR modify the layout. > > > > All of this is very opaque and not explained very well anywhere that I > > could find. We need to > > generate a proper description ASAP. > > OK, let me try to give an context with my understanding. And copy the > static layout of memory regions at below for reference. > Here, Documentation/x86/x86_64/mm.txt is correct, and it's the guideline for us to manipulate the layout of kernel memory regions. Originally the starting address of each region is aligned to 512GB so that they are all mapped at the 0-th entry of PGD table in 4-level page mapping. Since we are so rich to have 120 TB virtual address space, they are aligned at 1 TB actually. So randomness comes from three parts mainly: 1) The direct mapping region for physical memory. 64 TB are reserved to cover the maximum physical memory support. However, most of systems only have much less RAM memory than 64 TB, even much less than 1 TB most of time. We can take the superfluous to join the randomization. This is often the biggest part. 2) The hole between memory regions, even though they are only 1 TB. 3) KASAN region takes up 16 TB, while it won't take effect when KASLR is enabled. This is another big part. As you can see, in these three memory regions, the physical memory mapping region has variable size according to the existing system RAM. However, the remaining two memory regions have fixed size, vmalloc is 32 TB, vmemmap is 1 TB. With this superfluous address space as well as changing the starting address of each memory region to be PUD level, namely 1 GB aligned, we can have thousands of candidate position to locate those three memory regions. Above is for 4-level paging mode . As for 5-level, since the virtual address space is too big, Kirill makes the starting address of regions P4D aligned, namely 512 GB. When randomize the layout, their order are kept, still the physical memory mapping region is handled fistly, next vmalloc and vmemmap. Let's take the physical memory mapping region as example, we limit the starting address to be taken from the 1st 1/3 part of the whole available virtual address space which is from 0x8800 to 0xfe00, namely the original starting address of the physical memory mapping region to the starting address of cpu_entry_area mapping region. Once a random address is chosen for the physical memory mapping, we jump over the region and add 1G to begin the next region handling with the remaining available space. ~~ 8800 - c7ff (=64 TB) direct mapping of all phys. memory 136T - 200T = 64TB c800 - c8ff (=40 bits) hole 200T - 201T = 1TB c900 - e8ff (=45 bits) vmalloc/ioremap space 201T - 233T = 32TB e900 - e9ff (=40 bits) hole 233T - 234T = 1TB ea00 - eaff (=40 bits) virtual memory map (1TB) 234T - 235T = 1TB ... unused hole ... ec00 - fbff (=44 bits) kasan shadow memory (16TB) 236T - 252T = 16TB ... unused hole ... vaddr_end for KASLR fe00 - fe7f (=39 bits) cpu_entry_area mapping 254T - 254T+512G Thanks Baoquan
Re: [PATCH] perf test: Add watchpoint test
> While testing, I got curious, as a 'perf test' user, why one of the > tests had the "Skip" result: > > [root@seventh ~]# perf test watchpoint > 22: Watchpoint: > 22.1: Read Only Watchpoint: Skip > 22.2: Write Only Watchpoint : Ok > 22.3: Read / Write Watchpoint : Ok > 22.4: Modify Watchpoint : Ok > [root@seventh ~]# > > I tried with 'perf test -v watchpoint' but that didn't help, perhaps you > could add some message after the "Skip" telling why it skipped that > test? I.e. hardware doesn't have that capability, kernel driver not yet > supporting that, something else? Sure will add a message: pr_debug("Hardware does not support read only watchpoints."); Ravi
Re: [PATCH v2 1/4] arm64: dts: rockchip: Split out common nodes for Rock960 based boards
Hi Ezequiel, On Tue, Sep 11, 2018 at 04:40:29PM -0300, Ezequiel Garcia wrote: > On Tue, 2018-09-11 at 08:00 +0530, Manivannan Sadhasivam wrote: > > Since the same family members of Rock960 boards (Rock960 and Ficus) > > share the same configuration, split out the common nodes into a common > > dtsi file for reducing code duplication. The board specific nodes for > > Ficus boards are then placed in corresponding board DTS file. > > > > I think it should be possible to move the common USB nodes to the dtsi file, > and keep the board-specific (phy-supply property) in the dts files: > > &u2phy0_host { > phy-supply = <&vcc5v0_host>; > }; > > &u2phy1_host { > phy-supply = <&vcc5v0_host>; > }; > We can do that but my intention was to entirely partition the nodes which are not common. So that it would be less confusing when someone looks at it (please correct me if I'm wrong). > Also, I believe it would be good to have some more details > in this commit log. The information on the cover letter is great, > so I'd just repeat some of that here. > Sure, will add it in next iteration. > Other than that, for the ficus bits: > > Reviewed-by: Ezequiel Garcia > Thanks a lot for the review! Regards, Mani > Thanks very much for this work! > Ezequiel > > > > Signed-off-by: Manivannan Sadhasivam > > --- > > arch/arm64/boot/dts/rockchip/rk3399-ficus.dts | 429 + > > .../boot/dts/rockchip/rk3399-rock960.dtsi | 439 ++ > > 2 files changed, 440 insertions(+), 428 deletions(-) > > create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-rock960.dtsi > > > > diff --git a/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts > > b/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts > > index 8978d924eb83..7f6ec37d5a69 100644 > > --- a/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts > > +++ b/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts > > @@ -7,8 +7,7 @@ > > */ > > > > /dts-v1/; > > -#include "rk3399.dtsi" > > -#include "rk3399-opp.dtsi" > > +#include "rk3399-rock960.dtsi" > > > > / { > > model = "96boards RK3399 Ficus"; > > @@ -25,31 +24,6 @@ > > #clock-cells = <0>; > > }; > > > > - vcc1v8_s0: vcc1v8-s0 { > > - compatible = "regulator-fixed"; > > - regulator-name = "vcc1v8_s0"; > > - regulator-min-microvolt = <180>; > > - regulator-max-microvolt = <180>; > > - regulator-always-on; > > - }; > > - > > - vcc_sys: vcc-sys { > > - compatible = "regulator-fixed"; > > - regulator-name = "vcc_sys"; > > - regulator-min-microvolt = <500>; > > - regulator-max-microvolt = <500>; > > - regulator-always-on; > > - }; > > - > > - vcc3v3_sys: vcc3v3-sys { > > - compatible = "regulator-fixed"; > > - regulator-name = "vcc3v3_sys"; > > - regulator-min-microvolt = <330>; > > - regulator-max-microvolt = <330>; > > - regulator-always-on; > > - vin-supply = <&vcc_sys>; > > - }; > > - > > vcc3v3_pcie: vcc3v3-pcie-regulator { > > compatible = "regulator-fixed"; > > enable-active-high; > > @@ -75,46 +49,6 @@ > > regulator-always-on; > > vin-supply = <&vcc_sys>; > > }; > > - > > - vdd_log: vdd-log { > > - compatible = "pwm-regulator"; > > - pwms = <&pwm2 0 25000 0>; > > - regulator-name = "vdd_log"; > > - regulator-min-microvolt = <80>; > > - regulator-max-microvolt = <140>; > > - regulator-always-on; > > - regulator-boot-on; > > - vin-supply = <&vcc_sys>; > > - }; > > - > > -}; > > - > > -&cpu_l0 { > > - cpu-supply = <&vdd_cpu_l>; > > -}; > > - > > -&cpu_l1 { > > - cpu-supply = <&vdd_cpu_l>; > > -}; > > - > > -&cpu_l2 { > > - cpu-supply = <&vdd_cpu_l>; > > -}; > > - > > -&cpu_l3 { > > - cpu-supply = <&vdd_cpu_l>; > > -}; > > - > > -&cpu_b0 { > > - cpu-supply = <&vdd_cpu_b>; > > -}; > > - > > -&cpu_b1 { > > - cpu-supply = <&vdd_cpu_b>; > > -}; > > - > > -&emmc_phy { > > - status = "okay"; > > }; > > > > &gmac { > > @@ -133,263 +67,6 @@ > > status = "okay"; > > }; > > > > -&hdmi { > > - ddc-i2c-bus = <&i2c3>; > > - pinctrl-names = "default"; > > - pinctrl-0 = <&hdmi_cec>; > > - status = "okay"; > > -}; > > - > > -&i2c0 { > > - clock-frequency = <40>; > > - i2c-scl-rising-time-ns = <168>; > > - i2c-scl-falling-time-ns = <4>; > > - status = "okay"; > > - > > - vdd_cpu_b: regulator@40 { > > - compatible = "silergy,syr827"; > > - reg = <0x40>; > > - fcs,suspend-voltage-selector = <1>; > > - regulator-name = "vdd_cpu_b"; > > - regulator-min-microvolt = <712500>; > > - regulator-max-microvolt = <150>; > > - regulator-ramp-delay = <1000>; > > - regulator-always-on; > > - regulator-boot-on; > > -
Re: [PATCH] perf test: Add watchpoint test
On 09/10/2018 11:01 PM, Arnaldo Carvalho de Melo wrote: > Em Mon, Sep 10, 2018 at 11:18:30AM -0300, Arnaldo Carvalho de Melo escreveu: >> Em Mon, Sep 10, 2018 at 10:47:54AM -0300, Arnaldo Carvalho de Melo escreveu: >>> Em Mon, Sep 10, 2018 at 12:31:54PM +0200, Jiri Olsa escreveu: On Mon, Sep 10, 2018 at 03:28:11PM +0530, Ravi Bangoria wrote: > Ex on powerpc: > $ sudo ./perf test 22 > 22: Watchpoint: > 22.1: Read Only Watchpoint: Ok > 22.2: Write Only Watchpoint : Ok > 22.3: Read / Write Watchpoint : Ok > 22.4: Modify Watchpoint : Ok > cool, thanks! > Acked-by: Jiri Olsa > >>> Thanks, applied. > > Oops, fails when cross-building it to mips, I'll try to fix after lunch: Sorry for bit late reply. Will send v2 with the fix. Thanks Ravi > > 18 109.48 debian:experimental : Ok gcc (Debian 8.2.0-4) 8.2.0 > 1942.66 debian:experimental-x-arm64 : Ok aarch64-linux-gnu-gcc > (Debian 8.1.0-12) 8.1.0 > 2022.33 debian:experimental-x-mips: FAIL mips-linux-gnu-gcc (Debian > 8.1.0-12) 8.1.0 > 2120.05 debian:experimental-x-mips64 : FAIL mips64-linux-gnuabi64-gcc > (Debian 8.1.0-12) 8.1.0 > 2222.85 debian:experimental-x-mipsel : FAIL mipsel-linux-gnu-gcc > (Debian 8.1.0-12) 8.1.0 > > CC /tmp/build/perf/tests/bp_account.o > CC /tmp/build/perf/tests/wp.o > tests/wp.c:5:10: fatal error: arch-tests.h: No such file or directory > #include "arch-tests.h" > ^~ > compilation terminated. > mv: cannot stat '/tmp/build/perf/tests/.wp.o.tmp': No such file or directory > make[4]: *** [/git/linux/tools/build/Makefile.build:97: > /tmp/build/perf/tests/wp.o] Error 1 > make[4]: *** Waiting for unfinished jobs > CC /tmp/build/perf/util/record.o > CC /tmp/build/perf/util/srcline.o > make[3]: *** [/git/linux/tools/build/Makefile.build:139: tests] Error 2 > make[2]: *** [Makefile.perf:507: /tmp/build/perf/perf-in.o] Error 2 > make[2]: *** Waiting for unfinished jobs >
Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver
Hi Vinod, 2018-09-11 16:00 GMT+09:00 Vinod : > On 24-08-18, 10:41, Masahiro Yamada wrote: > >> +/* mc->vc.lock must be held by caller */ >> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md) >> +{ >> + u32 residue = 0; >> + int i; >> + >> + for (i = md->sg_cur; i < md->sg_len; i++) >> + residue += sg_dma_len(&md->sgl[i]); > > so if the descriptor is submitted to hardware, we return the descriptor > length, which is not correct. > > Two cases are required to be handled: > 1. Descriptor is in queue (IMO above logic is fine for that, but it can > be calculated at descriptor submit and looked up here) Where do you want it to be calculated? This hardware provides only simple registers (address and size) for one-shot transfer instead of descriptors. So, I used sgl as-is because I did not see a good reason to transform sgl to another data structure. > 2. Descriptor is running (interesting case), you need to read current > register and offset that from descriptor length and return OK, I will read out the register value to retrieve the residue from the on-flight transfer. >> +static struct dma_async_tx_descriptor *uniphier_mdmac_prep_slave_sg( >> + struct dma_chan *chan, >> + struct scatterlist *sgl, >> + unsigned int sg_len, >> + enum dma_transfer_direction direction, >> + unsigned long flags, void *context) >> +{ >> + struct virt_dma_chan *vc = to_virt_chan(chan); >> + struct uniphier_mdmac_desc *md; >> + >> + if (!is_slave_direction(direction)) >> + return NULL; >> + >> + md = kzalloc(sizeof(*md), GFP_KERNEL); > > _prep calls can be invoked from atomic context, so this should be > GFP_NOWAIT, see Documentation/driver-api/dmaengine/provider.rst Will fix. >> + if (!md) >> + return NULL; >> + >> + md->sgl = sgl; >> + md->sg_len = sg_len; >> + md->dir = direction; >> + >> + return vchan_tx_prep(vc, &md->vd, flags); > > this seems missing stuff. Where do you do register calculation for the > descriptor and where is slave_config here, how do you know where to > send/receive data form/to (peripheral) This dmac is really simple, and un-flexible. The peripheral address to send/receive data from/to is hard-weird. cfg->{src_addr,dst_addr} is not configurable. Look at __uniphier_mdmac_handle(). 'dest_addr' and 'src_addr' must be set to 0 for the peripheral. >> +static enum dma_status uniphier_mdmac_tx_status(struct dma_chan *chan, >> + dma_cookie_t cookie, >> + struct dma_tx_state *txstate) >> +{ >> + struct virt_dma_chan *vc; >> + struct virt_dma_desc *vd; >> + struct uniphier_mdmac_chan *mc; >> + struct uniphier_mdmac_desc *md = NULL; >> + enum dma_status stat; >> + unsigned long flags; >> + >> + stat = dma_cookie_status(chan, cookie, txstate); >> + if (stat == DMA_COMPLETE) >> + return stat; >> + >> + vc = to_virt_chan(chan); >> + >> + spin_lock_irqsave(&vc->lock, flags); >> + >> + mc = to_uniphier_mdmac_chan(vc); >> + >> + if (mc->md && mc->md->vd.tx.cookie == cookie) >> + md = mc->md; >> + >> + if (!md) { >> + vd = vchan_find_desc(vc, cookie); >> + if (vd) >> + md = to_uniphier_mdmac_desc(vd); >> + } >> + >> + if (md) >> + txstate->residue = __uniphier_mdmac_get_residue(md); > > txstate can be NULL and should be checked... Will fix. >> +static int uniphier_mdmac_probe(struct platform_device *pdev) >> +{ >> + struct device *dev = &pdev->dev; >> + struct uniphier_mdmac_device *mdev; >> + struct dma_device *ddev; >> + struct resource *res; >> + int nr_chans, ret, i; >> + >> + nr_chans = platform_irq_count(pdev); >> + if (nr_chans < 0) >> + return nr_chans; >> + >> + ret = dma_set_mask(dev, DMA_BIT_MASK(32)); >> + if (ret) >> + return ret; >> + >> + mdev = devm_kzalloc(dev, struct_size(mdev, channels, nr_chans), >> + GFP_KERNEL); > > kcalloc variant? No. I allocate here sizeof(*mdev) + nr_chans * sizeof(struct uniphier_mdmac_chan) kcalloc does not cater to it. You should check struct_size() helper macro. >> + if (!mdev) >> + return -ENOMEM; >> + >> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); >> + mdev->reg_base = devm_ioremap_resource(dev, res); >> + if (IS_ERR(mdev->reg_base)) >> + return PTR_ERR(mdev->reg_base); >> + >> + mdev->clk = devm_clk_get(dev, NULL); >> + if (IS_ERR(mdev->clk)) { >> + dev_err(dev, "failed to get clock\n"); >> + return PTR_ERR(mdev->clk); >> + } >> + >> + ret = clk_prepare_enable(m
[PATCH -next] staging: mt7621-pci: Use PTR_ERR_OR_ZERO in mt7621_pcie_parse_dt()
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Signed-off-by: YueHaibing --- drivers/staging/mt7621-pci/pci-mt7621.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/staging/mt7621-pci/pci-mt7621.c b/drivers/staging/mt7621-pci/pci-mt7621.c index ba1f117..d2cb910 100644 --- a/drivers/staging/mt7621-pci/pci-mt7621.c +++ b/drivers/staging/mt7621-pci/pci-mt7621.c @@ -396,10 +396,7 @@ static int mt7621_pcie_parse_dt(struct mt7621_pcie *pcie) } pcie->base = devm_ioremap_resource(dev, ®s); - if (IS_ERR(pcie->base)) - return PTR_ERR(pcie->base); - - return 0; + return PTR_ERR_OR_ZERO(pcie->base); } static int mt7621_pcie_request_resources(struct mt7621_pcie *pcie,
[PATCH] drivers: pci: remove set but unused variable
This patch removes a set but unused variable in quirks.c. Fixes warning: variable ‘mmio_sys_info’ set but not used [-Wunused-but-set-variable] Signed-off-by: Joshua Abraham --- drivers/pci/quirks.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index ef7143a274e0..690a3b71aa1f 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4993,7 +4993,6 @@ static void quirk_switchtec_ntb_dma_alias(struct pci_dev *pdev) void __iomem *mmio; struct ntb_info_regs __iomem *mmio_ntb; struct ntb_ctrl_regs __iomem *mmio_ctrl; - struct sys_info_regs __iomem *mmio_sys_info; u64 partition_map; u8 partition; int pp; @@ -5014,7 +5013,6 @@ static void quirk_switchtec_ntb_dma_alias(struct pci_dev *pdev) mmio_ntb = mmio + SWITCHTEC_GAS_NTB_OFFSET; mmio_ctrl = (void __iomem *) mmio_ntb + SWITCHTEC_NTB_REG_CTRL_OFFSET; - mmio_sys_info = mmio + SWITCHTEC_GAS_SYS_INFO_OFFSET; partition = ioread8(&mmio_ntb->partition_id); -- 2.17.1
Re: [WTF?] extremely old dead code
On Mon, Sep 10, 2018 at 1:55 PM Al Viro wrote: > > Hadn't that sucker been dead code since 0.98.2? What am I missing here? > Note that this thing had quite a few functionality changes over those > years; had they even been tested? Looks about right to me. The only point that actually acts on FIONBIO is the fs/ioctl.c code. Impressively, the dead tty code looks perfectly correct to me too, despite not ever being triggered. Linus
Re: [RFC v9 PATCH 2/4] mm: mmap: zap pages with read mmap_sem in munmap
On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote: > On 9/11/18 2:16 PM, Matthew Wilcox wrote: > > On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote: > > > mm/mmap.c | 97 > > > +-- > > I really think you're going about this the wrong way by duplicating > > vm_munmap(). > > If we don't duplicate vm_munmap() or do_munmap(), we need pass an extra > parameter to them to tell when it is fine to downgrade write lock or if the > lock has been acquired outside it (i.e. in mmap()/mremap()), right? But, > vm_munmap() or do_munmap() is called not only by mmap-related, but also some > other places, like arch-specific places, which don't need downgrade write > lock or are not safe to do so. > > Actually, I did this way in the v1 patches, but it got pushed back by tglx > who suggested duplicate the code so that the change could be done in mm only > without touching other files, i.e. arch-specific stuff. I didn't have strong > argument to convince him. With my patch, there is nothing to change in arch-specific code. Here it is again ... diff --git a/mm/mmap.c b/mm/mmap.c index de699523c0b7..06dc31d1da8c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2798,11 +2798,11 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, * work. This now handles partial unmappings. * Jeremy Fitzhardinge */ -int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, - struct list_head *uf) +static int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf, bool downgrade) { unsigned long end; - struct vm_area_struct *vma, *prev, *last; + struct vm_area_struct *vma, *prev, *last, *tmp; if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; @@ -2816,7 +2816,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, if (!vma) return 0; prev = vma->vm_prev; - /* we have start < vma->vm_end */ + /* we have start < vma->vm_end */ /* if it doesn't overlap, we have nothing.. */ end = start + len; @@ -2873,18 +2873,22 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * unlock any mlock()ed ranges before detaching vmas +* and check to see if there's any reason we might have to hold +* the mmap_sem write-locked while unmapping regions. */ - if (mm->locked_vm) { - struct vm_area_struct *tmp = vma; - while (tmp && tmp->vm_start < end) { - if (tmp->vm_flags & VM_LOCKED) { - mm->locked_vm -= vma_pages(tmp); - munlock_vma_pages_all(tmp); - } - tmp = tmp->vm_next; + for (tmp = vma; tmp && tmp->vm_start < end; tmp = tmp->vm_next) { + if (tmp->vm_flags & VM_LOCKED) { + mm->locked_vm -= vma_pages(tmp); + munlock_vma_pages_all(tmp); } + if (tmp->vm_file && + has_uprobes(tmp, tmp->vm_start, tmp->vm_end)) + downgrade = false; } + if (downgrade) + downgrade_write(&mm->mmap_sem); + /* * Remove the vma's, and unmap the actual pages */ @@ -2896,7 +2900,13 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* Fix up all other VM information */ remove_vma_list(mm, vma); - return 0; + return downgrade ? 1 : 0; +} + +int do_unmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf) +{ + return __do_munmap(mm, start, len, uf, false); } int vm_munmap(unsigned long start, size_t len) @@ -2905,11 +2915,12 @@ int vm_munmap(unsigned long start, size_t len) struct mm_struct *mm = current->mm; LIST_HEAD(uf); - if (down_write_killable(&mm->mmap_sem)) - return -EINTR; - - ret = do_munmap(mm, start, len, &uf); - up_write(&mm->mmap_sem); + down_write(&mm->mmap_sem); + ret = __do_munmap(mm, start, len, &uf, true); + if (ret == 1) + up_read(&mm->mmap_sem); + else + up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); return ret; } Anybody calling do_munmap() will not get the lock dropped. > And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for > safe and bisectable sake, which needs call the regular do_munmap(). That can be introduced and then taken out ... indeed, you can split this into many patches, starting with this: + if (tmp->vm_file) + downgrade = false; to only allow this optimisation for anonymous mappings at first. > In addition to this,
Re: [PATCH 06/11] dts: arm: imx7{d,s}: Update coresight binding for hardware ports
On Tue, Sep 11, 2018 at 11:17:07AM +0100, Suzuki K Poulose wrote: > Switch to the updated coresight bindings. > > Cc: Shawn Guo > Cc: Sascha Hauer > Cc: Pengutronix Kernel Team > Cc: Fabio Estevam > Cc: Mathieu Poirier > Signed-off-by: Suzuki K Poulose As per the convention we use for subject prefix, I suggest you use 'ARM: dts: imx7: ...' Shawn > --- > arch/arm/boot/dts/imx7d.dtsi | 11 --- > arch/arm/boot/dts/imx7s.dtsi | 78 > ++-- > 2 files changed, 53 insertions(+), 36 deletions(-) > > diff --git a/arch/arm/boot/dts/imx7d.dtsi b/arch/arm/boot/dts/imx7d.dtsi > index 7cbc2ff..4ced17c 100644 > --- a/arch/arm/boot/dts/imx7d.dtsi > +++ b/arch/arm/boot/dts/imx7d.dtsi > @@ -63,9 +63,11 @@ > clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>; > clock-names = "apb_pclk"; > > - port { > - etm1_out_port: endpoint { > - remote-endpoint = <&ca_funnel_in_port1>; > + out-ports { > + port { > + etm1_out_port: endpoint { > + remote-endpoint = > <&ca_funnel_in_port1>; > + }; > }; > }; > }; > @@ -148,11 +150,10 @@ > }; > }; > > -&ca_funnel_ports { > +&ca_funnel_in_ports { > port@1 { > reg = <1>; > ca_funnel_in_port1: endpoint { > - slave-mode; > remote-endpoint = <&etm1_out_port>; > }; > }; > diff --git a/arch/arm/boot/dts/imx7s.dtsi b/arch/arm/boot/dts/imx7s.dtsi > index a052198..9176885 100644 > --- a/arch/arm/boot/dts/imx7s.dtsi > +++ b/arch/arm/boot/dts/imx7s.dtsi > @@ -106,7 +106,7 @@ >*/ > compatible = "arm,coresight-replicator"; > > - ports { > + out-ports { > #address-cells = <1>; > #size-cells = <0>; > /* replicator output ports */ > @@ -123,12 +123,15 @@ > remote-endpoint = <&etr_in_port>; > }; > }; > + }; > > - /* replicator input port */ > - port@2 { > + in-ports { > + #address-cells = <1>; > + #size-cells = <0>; > + > + port@0 { > reg = <0>; > replicator_in_port0: endpoint { > - slave-mode; > remote-endpoint = <&etf_out_port>; > }; > }; > @@ -168,28 +171,31 @@ > clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>; > clock-names = "apb_pclk"; > > - ca_funnel_ports: ports { > + ca_funnel_in_ports: in-ports { > #address-cells = <1>; > #size-cells = <0>; > > - /* funnel input ports */ > port@0 { > reg = <0>; > ca_funnel_in_port0: endpoint { > - slave-mode; > remote-endpoint = > <&etm0_out_port>; > }; > }; > > - /* funnel output port */ > - port@2 { > + /* the other input ports are not connect to > anything */ > + }; > + > + out-ports { > + #address-cells = <1>; > + #size-cells = <0>; > + > + port@0 { > reg = <0>; > ca_funnel_out_port0: endpoint { > remote-endpoint = > <&hugo_funnel_in_port0>; > }; > }; > > - /* the other input ports are not connect to > anything */ > }; > }; > > @@ -200,9 +206,11 @@ > clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>; > clock-names = "apb_pclk"; > > - port { > - etm0_out_port: endpoint { > - remote-endpoint = <&ca_funnel_in_port0>; > + out-ports { > + port { > + etm
Re: [PATCH v2 3/6] drivers: qcom: rpmh: disallow active requests in solver mode
On Tue, Sep 11 2018 at 17:02 -0600, Matthias Kaehlcke wrote: Hi Raju/Lina, On Fri, Jul 27, 2018 at 03:34:46PM +0530, Raju P L S S S N wrote: From: Lina Iyer Controllers may be in 'solver' state, where they could be in autonomous mode executing low power modes for their hardware and as such are not available for sending active votes. Device driver may notify RPMH API that the controller is in solver mode and when in such mode, disallow requests from platform drivers for state change using the RSC. Signed-off-by: Lina Iyer Signed-off-by: Raju P.L.S.S.S.N --- drivers/soc/qcom/rpmh-internal.h | 2 ++ drivers/soc/qcom/rpmh.c | 59 include/soc/qcom/rpmh.h | 5 3 files changed, 66 insertions(+) diff --git a/drivers/soc/qcom/rpmh-internal.h b/drivers/soc/qcom/rpmh-internal.h index 4ff43bf..6cd2f78 100644 --- a/drivers/soc/qcom/rpmh-internal.h +++ b/drivers/soc/qcom/rpmh-internal.h @@ -72,12 +72,14 @@ struct rpmh_request { * @cache_lock: synchronize access to the cache data * @dirty: was the cache updated since flush * @batch_cache: Cache sleep and wake requests sent as batch + * @in_solver_mode: Controller is busy in solver mode */ struct rpmh_ctrlr { struct list_head cache; spinlock_t cache_lock; bool dirty; struct list_head batch_cache; + bool in_solver_mode; }; /** diff --git a/drivers/soc/qcom/rpmh.c b/drivers/soc/qcom/rpmh.c index 2382276..0d276fd 100644 --- a/drivers/soc/qcom/rpmh.c +++ b/drivers/soc/qcom/rpmh.c @@ -5,6 +5,7 @@ #include #include +#include #include #include #include @@ -75,6 +76,50 @@ static struct rpmh_ctrlr *get_rpmh_ctrlr(const struct device *dev) return &drv->client; } +static int check_ctrlr_state(struct rpmh_ctrlr *ctrlr, enum rpmh_state state) +{ + unsigned long flags; + int ret = 0; + + /* Do not allow setting active votes when in solver mode */ + spin_lock_irqsave(&ctrlr->cache_lock, flags); + if (ctrlr->in_solver_mode && state == RPMH_ACTIVE_ONLY_STATE) + ret = -EBUSY; + spin_unlock_irqrestore(&ctrlr->cache_lock, flags); + + return ret; +} + +/** + * rpmh_mode_solver_set: Indicate that the RSC controller hardware has + * been configured to be in solver mode + * + * @dev: the device making the request + * @enable: Boolean value indicating if the controller is in solver mode. + * + * When solver mode is enabled, passthru API will not be able to send wake + * votes, just awake and active votes. + */ +int rpmh_mode_solver_set(const struct device *dev, bool enable) +{ + struct rpmh_ctrlr *ctrlr = get_rpmh_ctrlr(dev); + unsigned long flags; + + for (;;) { + spin_lock_irqsave(&ctrlr->cache_lock, flags); + if (rpmh_rsc_ctrlr_is_idle(ctrlr_to_drv(ctrlr))) { + ctrlr->in_solver_mode = enable; As commented on '[v2,1/6] drivers: qcom: rpmh-rsc: return if the controller is idle', this seems potentially racy. _is_idle() could report the controller as idle, even though some TCSes are in use (after _is_idle() visited them). Additional locking may be needed or a comment if this situation should never happen on a sane system (I don't know enough about RPMh and its clients to judge if this is the case). Hmm.. Forgot that we call from here. May be a lock might be helpful. -- Lina
[PATCH v3] iio: proximity: Add driver support for ST's VL53L0X ToF ranging sensor.
This driver was originally written by ST in 2016 as a misc input device driver, and hasn't been maintained for a long time. I grabbed some code from it's API and reformed it into a iio proximity device driver. This version of driver uses i2c bus to talk to the sensor and polling for measuring completes, so no irq line is needed. This version of driver supports only one-shot mode, and it can be tested with reading from /sys/bus/iio/devices/iio:deviceX/in_distance_raw Signed-off-by: Song Qiang --- Changes in v2: - Clean up the register table. - Sort header files declarations. - Replace some bit definations with GENMASK() and BIT(). - Clean up some code and comments that's useless for now. - Change the order of some the definations of some variables to reversed xmas tree order. - Using do...while() rather while and check. - Replace pr_err() with dev_err(). - Remove device id declaration since we recommend to use DT. - Remove .owner = THIS_MODULE. - Replace probe() with probe_new() hook. - Remove IIO_BUFFER and IIO_TRIGGERED_BUFFER dependences. - Change the driver module name to vl53l0x-i2c. - Align all the parameters if they are in the same function with open parentheses. - Replace iio_device_register() with devm_iio_device_register for better resource management. - Remove the vl53l0x_remove() since it's not needed. - Remove dev_set_drvdata() since it's already setted above. Changes in v3: - Recover ST's copyright. - Clean up indio_dev member in vl53l0x_data struct since it's useless now. - Replace __le16_to_cpu() with le16_to_cpu(). - Remove the iio_device_{claim|release}_direct_mode() since it's only needed when we use buffered mode. - Clean up some coding style problems. .../bindings/iio/proximity/vl53l0x.txt| 12 ++ drivers/iio/proximity/Kconfig | 11 ++ drivers/iio/proximity/Makefile| 2 + drivers/iio/proximity/vl53l0x-i2c.c | 184 ++ 4 files changed, 209 insertions(+) create mode 100644 Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt create mode 100644 drivers/iio/proximity/vl53l0x-i2c.c diff --git a/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt b/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt new file mode 100644 index ..64b69442f08e --- /dev/null +++ b/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt @@ -0,0 +1,12 @@ +ST's VL53L0X ToF ranging sensor + +Required properties: + - compatible: must be "st,vl53l0x-i2c" + - reg: i2c address where to find the device + +Example: + +vl53l0x@29 { + compatible = "st,vl53l0x-i2c"; + reg = <0x29>; +}; diff --git a/drivers/iio/proximity/Kconfig b/drivers/iio/proximity/Kconfig index f726f9427602..5f421cbd37f3 100644 --- a/drivers/iio/proximity/Kconfig +++ b/drivers/iio/proximity/Kconfig @@ -79,4 +79,15 @@ config SRF08 To compile this driver as a module, choose M here: the module will be called srf08. +config VL53L0X_I2C + tristate "STMicroelectronics VL53L0X ToF ranger sensor (I2C)" + depends on I2C + help + Say Y here to build a driver for STMicroelectronics VL53L0X + ToF ranger sensors with i2c interface. + This driver can be used to measure the distance of objects. + + To compile this driver as a module, choose M here: the + module will be called vl53l0x-i2c. + endmenu diff --git a/drivers/iio/proximity/Makefile b/drivers/iio/proximity/Makefile index 4f4ed45e87ef..dedfb5bf3475 100644 --- a/drivers/iio/proximity/Makefile +++ b/drivers/iio/proximity/Makefile @@ -10,3 +10,5 @@ obj-$(CONFIG_RFD77402)+= rfd77402.o obj-$(CONFIG_SRF04)+= srf04.o obj-$(CONFIG_SRF08)+= srf08.o obj-$(CONFIG_SX9500) += sx9500.o +obj-$(CONFIG_VL53L0X_I2C) += vl53l0x-i2c.o + diff --git a/drivers/iio/proximity/vl53l0x-i2c.c b/drivers/iio/proximity/vl53l0x-i2c.c new file mode 100644 index ..0f7f124a38ed --- /dev/null +++ b/drivers/iio/proximity/vl53l0x-i2c.c @@ -0,0 +1,184 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Support for ST's VL53L0X FlightSense ToF Ranger Sensor on a i2c bus. + * + * Copyright (C) 2016 STMicroelectronics Imaging Division. + * Copyright (C) 2018 Song Qiang + */ + +#include +#include +#include +#include +#include +#include + +#define VL53L0X_DRV_NAME "vl53l0x-i2c" + +#define VL_REG_SYSRANGE_MODE_MASK GENMASK(3, 0) +#define VL_REG_SYSRANGE_START 0x00 +#define VL_REG_SYSRANGE_MODE_SINGLESHOT0x00 +#define VL_REG_SYSRANGE_MODE_START_STOPBIT(0) +#define VL_REG_SYSRANGE_MODE_BACKTOBACKBIT(1) +#
Re: [PATCH v2 1/6] drivers: qcom: rpmh-rsc: return if the controller is idle
On Tue, Sep 11 2018 at 16:39 -0600, Matthias Kaehlcke wrote: Hi Raju/Lina, On Fri, Jul 27, 2018 at 03:34:44PM +0530, Raju P L S S S N wrote: From: Lina Iyer Allow the controller status be queried. The controller is busy if it is actively processing request. Signed-off-by: Lina Iyer Signed-off-by: Raju P.L.S.S.S.N --- Changes in v2: - Remove unnecessary EXPORT_SYMBOL --- drivers/soc/qcom/rpmh-internal.h | 1 + drivers/soc/qcom/rpmh-rsc.c | 20 2 files changed, 21 insertions(+) diff --git a/drivers/soc/qcom/rpmh-internal.h b/drivers/soc/qcom/rpmh-internal.h index a76..4ff43bf 100644 --- a/drivers/soc/qcom/rpmh-internal.h +++ b/drivers/soc/qcom/rpmh-internal.h @@ -108,6 +108,7 @@ struct rsc_drv { int rpmh_rsc_write_ctrl_data(struct rsc_drv *drv, const struct tcs_request *msg); int rpmh_rsc_invalidate(struct rsc_drv *drv); +bool rpmh_rsc_ctrlr_is_idle(struct rsc_drv *drv); void rpmh_tx_done(const struct tcs_request *msg, int r); diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c index 33fe9f9..42d0041 100644 --- a/drivers/soc/qcom/rpmh-rsc.c +++ b/drivers/soc/qcom/rpmh-rsc.c @@ -496,6 +496,26 @@ static int tcs_ctrl_write(struct rsc_drv *drv, const struct tcs_request *msg) } /** + * rpmh_rsc_ctrlr_is_idle: Check if any of the AMCs are busy. + * + * @drv: The controller + * + * Returns true if the TCSes are engaged in handling requests. + */ +bool rpmh_rsc_ctrlr_is_idle(struct rsc_drv *drv) +{ + int m; + struct tcs_group *tcs = get_tcs_of_type(drv, ACTIVE_TCS); + + for (m = tcs->offset; m < tcs->offset + tcs->num_tcs; m++) { + if (!tcs_is_free(drv, m)) + return false; + } + + return true; +} This looks racy, tcs_write() could be running simultaneously and use TCSes that were seen as free by _is_idle(). This could be fixed by holding tcs->lock (assuming this doesn't cause lock ordering problems). However even with this tcs_write() could run right after releasing the lock, using TCSes and the caller of _is_idle() would consider the controller to be idle. We could run this without the lock, since we are only reading a status. Generally, this function is called from the idle code of the last CPU and no CPU or active TCS request should be in progress, but if it were, then this function would let the caller know we are not ready to do idle. If there were no requests that were running at that time we read the registers, we would not be making one after, since we are already in the idle code and no requests are made there. I understand, how it might appear racy, the context of the callling function helps resolve that. -- Lina
[Question] Are the trace APIs declared by "TRACE_EVENT(irq_handler_entry" allowed to be used in Ko?
After patch 7e066fb870fc ("tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()"), the trace APIs declared by "TRACE_EVENT(irq_handler_entry" can not be directly used by ko, because it's not explicitly exported by EXPORT_TRACEPOINT_SYMBOL_GPL or EXPORT_TRACEPOINT_SYMBOL. Did we miss it? or it's not recommended to be used in ko? - commit 7e066fb870fcd1025ec3ba7bbde5d541094f4ce1 Author: Mathieu Desnoyers Date: Fri Nov 14 17:47:47 2008 -0500 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() Impact: API *CHANGE*. Must update all tracepoint users. Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint structure in a single spot for all the kernel. It helps reducing memory consumption, especially when declaring a lot of tracepoints, e.g. for kmalloc tracing. *API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for tracepoint declarations rather than DEFINE_TRACE(). This is the sane way to do it. The name previously used was misleading. Updates scheduler instrumentation to follow this API change. -- Thanks! BestRegards
Re: Question: How to switch a process namespace by nsfs "device" and inode number directly?
Thank you, Andi! Yes, that's a situation, also it's an important one I guess. Another case is that a process running inside a container has exited but the container still alive.I think this is also a common case. The potential fix solutions I am thinking are following: - Using nsfs "device" and inum. This is why I am asking for your help. As we already have nsfs "device" and inum of each thread at least. - If the current thread has exited, it's probably the parent thread and the leader thread of that container are still alive. If we could have those threads' pid, then we could use setns. If the first item is not doable, I would like to try the second one. Thanks, Chengdong 在 2018/9/11 上午12:02, Andi Kleen 写道: On Mon, Sep 10, 2018 at 04:50:42PM +0800, Chengdong Li wrote: Hi folks, I am getting stuck by the lack of approach to switch process namespace by nsfs "device" and inode number in user-space, for example (mnt: 0xf000) From my best understanding, the normal way to do that is by setns system call. But setns only accept fd that refer to a opened namespace, sometimes we couldn't get it. For example: After perf record, perf report couldn't work well once the process that runs inside a container has exited, as the /proc/pid/ns doesn't exist anymore after process exit. The kernel name space doesn't exist anymore at this point, so there is simply no way to reconstruct it. Perhaps would need some higher level side band data for perf, similar as what is done for JITed code. Somehow the container run time needs to tell perf where to find the code. -Andi
Re: [PATCH i2c-next v6] i2c: aspeed: Handle master/slave combined irq events properly
On Tue, Sep 11, 2018 at 04:58:44PM -0700, Jae Hyun Yoo wrote: > On 9/11/2018 4:33 PM, Guenter Roeck wrote: > >Looking into the patch, clearing the interrupt status at the end of an > >interrupt handler is always suspicious and tends to result in race > >conditions (because additional interrupts may have arrived while handling > >the existing interrupts, or because interrupt handling itself may trigger > >another interrupt). With that in mind, the following patch fixes the > >problem for me. > > > >Guenter > > > >--- > > > >diff --git a/drivers/i2c/busses/i2c-aspeed.c > >b/drivers/i2c/busses/i2c-aspeed.c > >index c258c4d9a4c0..c488e6950b7c 100644 > >--- a/drivers/i2c/busses/i2c-aspeed.c > >+++ b/drivers/i2c/busses/i2c-aspeed.c > >@@ -552,6 +552,8 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq, void > >*dev_id) > > spin_lock(&bus->lock); > > irq_received = readl(bus->base + ASPEED_I2C_INTR_STS_REG); > >+/* Ack all interrupt bits. */ > >+writel(irq_received, bus->base + ASPEED_I2C_INTR_STS_REG); > > irq_remaining = irq_received; > > #if IS_ENABLED(CONFIG_I2C_SLAVE) > >@@ -584,8 +586,6 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq, void > >*dev_id) > > "irq handled != irq. expected 0x%08x, but was 0x%08x\n", > > irq_received, irq_handled); > >-/* Ack all interrupt bits. */ > >-writel(irq_received, bus->base + ASPEED_I2C_INTR_STS_REG); > > spin_unlock(&bus->lock); > > return irq_remaining ? IRQ_NONE : IRQ_HANDLED; > > } > > > > My intention of putting the code at the end of interrupt handler was, > to reduce possibility of combined irq calls which is explained in this > patch. But YES, I agree with you. It could make a potential race Hmm, yes, but that doesn't explain why it would make sense to acknowledge the interrupt late. The interrupt ack only means "I am going to handle these interrupts". If additional interrupts arrive while the interrupt handler is active, those will have to be acknowledged separately. Sure, there is a risk that an interrupt arrives while the handler is running, and that it is handled but not acknowledged. That can happen with pretty much all interrupt handlers, and there are mitigations to limit the impact (for example, read the interrupt status register in a loop until no more interrupts are pending). But acknowledging an interrupt that was possibly not handled is always bad idea. Thanks, Guenter
Re: [PATCH v3] ARM: dts: imx6ul: Add DTS for ConnectCore 6UL SBC Pro
On Mon, Sep 10, 2018 at 11:37:52AM +0200, Alex Gonzalez wrote: > The ConnectCore 6UL Single Board Computer (SBC) Pro contains the > ConnectCore 6UL System-On-Module. > > Its hardware specifications are: > > * 256MB DDR3 memory > * On module 256MB NAND flash > * Dual 10/100 Ethernet > * USB Host and USB OTG > * Parallel RGB display header > * LVDS display header > * CSI camera > * GPIO header > * I2C, SPI, CAN headers > * PCIe mini card and micro SIM slot > * MicroSD external storage > * On board 4GB eMMC flash > * Audio headphone, line in/out, microphone lines > > Signed-off-by: Alex Gonzalez Applied, thanks.
Re: [PATCH v9 6/6] ARM: dts: imx6: RIoTboard provide standby on power off option
On Thu, Aug 02, 2018 at 12:34:25PM +0200, Oleksij Rempel wrote: > This board, as well as some other boards with i.MX6 and a PMIC, uses a > "PMIC_STBY_REQ" line to notify the PMIC about a state change. > The PMIC is programmed for a specific state change before triggering the > line. > In this case, PMIC_STBY_REQ can be used for stand by, sleep > and power off modes. > > Signed-off-by: Oleksij Rempel Applied, thanks.
Re: KASAN: use-after-free Read in cma_bind_port
syzbot has found a reproducer for the following crash on: HEAD commit:11da3a7f84f1 Linux 4.19-rc3 git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=11766c6940 kernel config: https://syzkaller.appspot.com/x/.config?x=9917ff4b798e1a1e dashboard link: https://syzkaller.appspot.com/bug?extid=da2591e115d57a9cbb8b compiler: gcc (GCC) 8.0.1 20180413 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1686969e40 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+da2591e115d57a9cb...@syzkaller.appspotmail.com 8021q: adding VLAN 0 to HW filter on device team0 8021q: adding VLAN 0 to HW filter on device team0 8021q: adding VLAN 0 to HW filter on device team0 hrtimer: interrupt took 34369 ns == BUG: KASAN: use-after-free in cma_bind_port+0x35d/0x420 drivers/infiniband/core/cma.c:3059 Read of size 2 at addr 8801b7b056a0 by task syz-executor3/7271 CPU: 1 PID: 7271 Comm: syz-executor3 Not tainted 4.19.0-rc3+ #231 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:431 cma_bind_port+0x35d/0x420 drivers/infiniband/core/cma.c:3059 cma_alloc_port+0x115/0x180 drivers/infiniband/core/cma.c:3095 cma_alloc_any_port drivers/infiniband/core/cma.c:3160 [inline] cma_get_port drivers/infiniband/core/cma.c:3314 [inline] rdma_bind_addr+0x1765/0x23d0 drivers/infiniband/core/cma.c:3434 cma_bind_addr drivers/infiniband/core/cma.c:2963 [inline] rdma_resolve_addr+0x4e2/0x2770 drivers/infiniband/core/cma.c:2974 ucma_resolve_ip+0x242/0x2a0 drivers/infiniband/core/ucma.c:711 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680 __vfs_write+0x119/0x9f0 fs/read_write.c:485 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4572d9 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f71074b0c78 EFLAGS: 0246 ORIG_RAX: 0001 RAX: ffda RBX: 7f71074b16d4 RCX: 004572d9 RDX: 0048 RSI: 2240 RDI: 0008 RBP: 009300a0 R08: R09: R10: R11: 0246 R12: R13: 004d83c0 R14: 004c1e90 R15: Allocated by task 7285: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553 kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620 kmalloc include/linux/slab.h:513 [inline] kzalloc include/linux/slab.h:707 [inline] __rdma_create_id+0xdf/0x790 drivers/infiniband/core/cma.c:782 ucma_create_id+0x39b/0x990 drivers/infiniband/core/ucma.c:502 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680 __vfs_write+0x119/0x9f0 fs/read_write.c:485 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 7261: save_stack+0x43/0xd0 mm/kasan/kasan.c:448 set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528 __cache_free mm/slab.c:3498 [inline] kfree+0xcf/0x230 mm/slab.c:3813 rdma_destroy_id+0x835/0xcc0 drivers/infiniband/core/cma.c:1737 ucma_close+0x100/0x300 drivers/infiniband/core/ucma.c:1759 __fput+0x385/0xa30 fs/file_table.c:278 fput+0x15/0x20 fs/file_table.c:309 task_work_run+0x1e8/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:193 [inline] exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166 prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline] syscall_return_slowpath arch/x86/entry/common.c:268 [inline] do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at 8801b7b05680 which belongs to the cache kmalloc-2048 of size 2048 The buggy address is located 32 bytes inside of 2048-byte region [8801b
Re: [PATCH v9 2/6] ARM: imx6: register pm_power_off handler if "fsl,pmic-stby-poweroff" is set
On Thu, Aug 02, 2018 at 12:34:21PM +0200, Oleksij Rempel wrote: > One of the Freescale recommended sequences for power off with external > PMIC is the following: > ... > 3. SoC is programming PMIC for power off when standby is asserted. > 4. In CCM STOP mode, Standby is asserted, PMIC gates SoC supplies. > > See: > http://www.nxp.com/assets/documents/data/en/reference-manuals/IMX6DQRM.pdf > page 5083 > > This patch implements step 4. of this sequence. > > Signed-off-by: Oleksij Rempel Applied, thanks.
Re: [PATCH v3 3/3] drivers: soc: xilinx: Add ZynqMP PM driver
Hi! [ Thanks a lot for upstreaming this.. ] On Tue, Sep 11, 2018 at 02:34:57PM -0700, Jolly Shah wrote: > From: Rajan Vaja > > Add ZynqMP PM driver. PM driver provides power management > support for ZynqMP. > > Signed-off-by: Rajan Vaja > Signed-off-by: Jolly Shah > --- [...] > +static irqreturn_t zynqmp_pm_isr(int irq, void *data) > +{ > + u32 payload[CB_PAYLOAD_SIZE]; > + > + zynqmp_pm_get_callback_data(payload); > + > + /* First element is callback API ID, others are callback arguments */ > + if (payload[0] == PM_INIT_SUSPEND_CB) { > + if (work_pending(&zynqmp_pm_init_suspend_work->callback_work)) > + goto done; > + > + /* Copy callback arguments into work's structure */ > + memcpy(zynqmp_pm_init_suspend_work->args, &payload[1], > +sizeof(zynqmp_pm_init_suspend_work->args)); > + > + queue_work(system_unbound_wq, > +&zynqmp_pm_init_suspend_work->callback_work); We already have devm_request_threaded_irq() which can does this automatically for us. Use that method to register the ISR instead, then if there's more work to do, just do the memcpy and return IRQ_WAKE_THREAD. > + } > + > +done: > + return IRQ_HANDLED; > +} > + > +/** > + * zynqmp_pm_init_suspend_work_fn() - Initialize suspend > + * @work:Pointer to work_struct > + * > + * Bottom-half of PM callback IRQ handler. > + */ > +static void zynqmp_pm_init_suspend_work_fn(struct work_struct *work) > +{ > + struct zynqmp_pm_work_struct *pm_work = > + container_of(work, struct zynqmp_pm_work_struct, callback_work); > + > + if (pm_work->args[0] == ZYNQMP_PM_SUSPEND_REASON_SYSTEM_SHUTDOWN) { we_really_seem_to_love_long_40_col_names_for_some_reason > + orderly_poweroff(true); > + } else if (pm_work->args[0] == > +ZYNQMP_PM_SUSPEND_REASON_POWER_UNIT_REQUEST) { Ditto [...] > +/** > + * zynqmp_pm_sysfs_init() - Initialize PM driver sysfs interface > + * @dev: Pointer to device structure > + * > + * Return: 0 on success, negative error code otherwise > + */ > +static int zynqmp_pm_sysfs_init(struct device *dev) > +{ > + return sysfs_create_file(&dev->kobj, &dev_attr_suspend_mode.attr); > +} > + Sysfs file is created in platform driver's probe(), but is not removed anywhere in the code. What happens if this is built as a module? Am I missing something obvious? Moreover, what's the wisdom of creating a one-liner function with a huge six-line comment that: a) _purely_ wraps sysfs_create_file(); no extra logic b) is called only once c) and not passed as a function pointer anywhere IMO Such one-liner translators obfuscate the code and review process with no apparent gain.. > +/** > + * zynqmp_pm_probe() - Probe existence of the PMU Firmware > + * and initialize debugfs interface > + * > + * @pdev:Pointer to the platform_device structure > + * > + * Return: Returns 0 on success, negative error code otherwise > + */ Again, huge 8-line comment that provide no value. If anyone wants to know what a platform driver probe() does, he or she better check the primary references at: - Documentation/driver-model/platform.txt - include/linux/platform_device.h and not the comment above.. > +static int zynqmp_pm_probe(struct platform_device *pdev) > +{ > + int ret, irq; > + u32 pm_api_version; > + > + const struct zynqmp_eemi_ops *eemi_ops = zynqmp_pm_get_eemi_ops(); > + > + if (!eemi_ops || !eemi_ops->get_api_version || !eemi_ops->init_finalize) > + return -ENXIO; > + > + eemi_ops->init_finalize(); > + eemi_ops->get_api_version(&pm_api_version); > + > + /* Check PM API version number */ > + if (pm_api_version < ZYNQMP_PM_VERSION) > + return -ENODEV; > + > + irq = platform_get_irq(pdev, 0); > + if (irq <= 0) > + return -ENXIO; > + > + ret = devm_request_irq(&pdev->dev, irq, zynqmp_pm_isr, IRQF_SHARED, > +dev_name(&pdev->dev), pdev); > + if (ret) { > + dev_err(&pdev->dev, "request_irq '%d' failed with %d\n", > + irq, ret); > + return ret; > + } > + > + zynqmp_pm_init_suspend_work = > + devm_kzalloc(&pdev->dev, sizeof(struct zynqmp_pm_work_struct), > + GFP_KERNEL); > + if (!zynqmp_pm_init_suspend_work) > + return -ENOMEM; > + > + INIT_WORK(&zynqmp_pm_init_suspend_work->callback_work, > + zynqmp_pm_init_suspend_work_fn); > + Let's use devm_request_threaded_irq(). Then we can completely remove the work_struct, INIT_WORK(), and queuue_work() bits. > + ret = zynqmp_pm_sysfs_init(&pdev->dev); > + if (ret) { > + dev_err(&pdev->dev, "unable to initialize sysfs interface\n"); > + return ret; > + } > + > + return ret; Just return 0 please. BTW ret was declare
Re: [PATCH v9 1/6] ARM: imx6q: provide documentation for new fsl,pmic-stby-poweroff property
I updated the subject as below to make it clear this is a bindings change. dt-bindings: imx6q-clock: add new fsl,pmic-stby-poweroff property Patch applied, thanks. Shawn On Thu, Aug 02, 2018 at 12:34:20PM +0200, Oleksij Rempel wrote: > Signed-off-by: Oleksij Rempel > Acked-by: Rob Herring > --- > Documentation/devicetree/bindings/clock/imx6q-clock.txt | 8 > 1 file changed, 8 insertions(+) > > diff --git a/Documentation/devicetree/bindings/clock/imx6q-clock.txt > b/Documentation/devicetree/bindings/clock/imx6q-clock.txt > index a45ca67a9d5f..e1308346e00d 100644 > --- a/Documentation/devicetree/bindings/clock/imx6q-clock.txt > +++ b/Documentation/devicetree/bindings/clock/imx6q-clock.txt > @@ -6,6 +6,14 @@ Required properties: > - interrupts: Should contain CCM interrupt > - #clock-cells: Should be <1> > > +Optional properties: > +- fsl,pmic-stby-poweroff: Configure CCM to assert PMIC_STBY_REQ signal > + on power off. > + Use this property if the SoC should be powered off by external power > + management IC (PMIC) triggered via PMIC_STBY_REQ signal. > + Boards that are designed to initiate poweroff on PMIC_ON_REQ signal should > + be using "syscon-poweroff" driver instead. > + > The clock consumer should specify the desired clock by having the clock > ID in its "clocks" phandle cell. See > include/dt-bindings/clock/imx6qdl-clock.h > for the full list of i.MX6 Quad and DualLite clock IDs. > -- > 2.18.0 >
Re: [PATCH 4.4 34/80] enic: handle mtu change for vf properly
On Mon, 2018-09-03 at 18:49 +0200, Greg Kroah-Hartman wrote: > 4.4-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Govindarajulu Varadarajan > > [ Upstream commit ab123fe071c9aa9680ecd62eb080eb26cff4892c ] > > When driver gets notification for mtu change, driver does not handle it for > all RQs. It handles only RQ[0]. > > Fix is to use enic_change_mtu() interface to change mtu for vf. [...] This causes a assertion failure (noisy error logging, but not an oops) when the driver is probed. This was fixed upstream by: commit cb5c6568867325f9905e80c96531d963bec8e5ea Author: Govindarajulu Varadarajan Date: Mon Jul 30 09:56:54 2018 -0700 enic: do not call enic_change_mtu in enic_probe which is now needed on the 3.18, 4.4, and 4.9 stable branches. Ben. -- Ben Hutchings, Software Developer Codethink Ltd https://www.codethink.co.uk/ Dale House, 35 Dale Street Manchester, M1 2HF, United Kingdom
Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist
> Applied to 4.20/scsi-queue, thank you! 4.19/scsi-fixes, that is... -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist
Laura, > There are two cases that trigger this bug. Switch to using a > dynamically allocated buffer for one case and do not assign > a NULL buffer in another case. Applied to 4.20/scsi-queue, thank you! -- Martin K. Petersen Oracle Linux Engineering
[LKP] [kernel] 92114220fe: BUG:unable_to_handle_kernel
FYI, we noticed the following commit (built with gcc-6): commit: 92114220fe6a374172e99261b6451c515d29c8dc ("[PATCH] kernel: prevent submission of creds with higher privileges inside container") url: https://github.com/0day-ci/linux/commits/My-Name/kernel-prevent-submission-of-creds-with-higher-privileges-inside-container/20180911-162532 in testcase: trinity with following parameters: runtime: 300s test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/ on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 256M caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): +--+---++ | | v4.19-rc3 | 92114220fe | +--+---++ | boot_successes | 8 | 0 | | boot_failures| 0 | 6 | | BUG:unable_to_handle_kernel | 0 | 6 | | Oops:#[##] | 0 | 6 | | RIP:commit_creds | 0 | 6 | | Kernel_panic-not_syncing:Fatal_exception | 0 | 6 | +--+---++ [ 53.586547] BUG: unable to handle kernel NULL pointer dereference at 06c0 [ 53.588054] PGD 0 P4D 0 [ 53.588564] Oops: [#1] PTI [ 53.589180] CPU: 0 PID: 1 Comm: init Not tainted 4.19.0-rc3-1-g9211422 #1 [ 53.590544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 53.592139] RIP: 0010:commit_creds+0x51/0x410 [ 53.592988] Code: 08 81 ba b0 01 00 00 fe ff ff ef 74 11 8b 43 04 39 47 04 0f 83 9c 00 00 00 e9 c2 03 00 00 48 8b 50 10 48 83 05 67 82 5a 02 01 <81> ba c0 06 00 00 ff ff ff ef 75 d7 48 8b 50 18 48 83 05 57 82 5a [ 53.596525] RSP: :c900bd10 EFLAGS: 00010202 [ 53.597526] RAX: 82ca3060 RBX: 88000f02eb40 RCX: 88000f0399c8 [ 53.598883] RDX: RSI: RDI: 88000b2a53c0 [ 53.600235] RBP: 88000bd66800 R08: 88000f030740 R09: 008fb60c [ 53.601587] R10: e0098d8b R11: 10c12b46 R12: 88000f030040 [ 53.602936] R13: c9008000 R14: 88000cd07500 R15: 0001 [ 53.604285] FS: () GS:82c5b000() knlGS: [ 53.605813] CS: 0010 DS: ES: CR0: 80050033 [ 53.606906] CR2: 06c0 CR3: 0c6f6000 CR4: 000406b0 [ 53.608264] Call Trace: [ 53.608762] install_exec_creds+0x25/0xa0 [ 53.609544] load_elf_binary+0x544/0x1e72 [ 53.610324] ? __lock_acquire+0xdbb/0x1030 [ 53.611234] ? find_held_lock+0x35/0xd0 [ 53.611982] ? __lock_acquire+0xdbb/0x1030 [ 53.612891] ? find_held_lock+0x35/0xd0 [ 53.613639] ? search_binary_handler+0x83/0x180 [ 53.614512] search_binary_handler+0x98/0x180 [ 53.615356] load_script+0x348/0x370 [ 53.616058] search_binary_handler+0x98/0x180 [ 53.616906] __do_execve_file+0x7d3/0xaa0 [ 53.617804] do_execve+0x24/0x30 [ 53.618439] run_init_process+0x50/0x60 [ 53.619184] ? rest_init+0x1a0/0x1a0 [ 53.619885] kernel_init+0xca/0x1e0 [ 53.620573] ret_from_fork+0x35/0x40 [ 53.621264] CR2: 06c0 [ 53.621969] ---[ end trace 3c2bcf9b443a9ddd ]--- To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k job-script # job-script is attached in this email Thanks, lkp # # Automatically generated file; DO NOT EDIT. # Linux/x86_64 4.19.0-rc3 Kernel Configuration # # # Compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026 # CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=60400 CONFIG_CLANG_VERSION=0 CONFIG_CONSTRUCTORS=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set CONFIG_KERNEL_LZMA=y # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" # CONFIG_SYSVIPC is not set # CONFIG_POSIX_MQUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_CHIP=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_SIM=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_IRQ_MATRIX_ALL
[PATCH -V5 RESEND 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free()
When a PMD swap mapping is removed from a huge swap cluster, for example, unmap a memory range mapped with PMD swap mapping, etc, free_swap_and_cache() will be called to decrease the reference count to the huge swap cluster. free_swap_and_cache() may also free or split the huge swap cluster, and free the corresponding THP in swap cache if necessary. swap_free() is similar, and shares most implementation with free_swap_and_cache(). This patch revises free_swap_and_cache() and swap_free() to implement this. If the swap cluster has been split already, for example, because of failing to allocate a THP during swapin, we just decrease one from the reference count of all swap slots. Otherwise, we will decrease one from the reference count of all swap slots and the PMD swap mapping count in cluster_count(). When the corresponding THP isn't in swap cache, if PMD swap mapping count becomes 0, the huge swap cluster will be split, and if all swap count becomes 0, the huge swap cluster will be freed. When the corresponding THP is in swap cache, if every swap_map[offset] == SWAP_HAS_CACHE, we will try to delete the THP from swap cache. Which will cause the THP and the huge swap cluster be freed. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 9 +-- kernel/power/swap.c| 4 +- mm/madvise.c | 2 +- mm/memory.c| 4 +- mm/shmem.c | 6 +- mm/swapfile.c | 171 ++--- 7 files changed, 149 insertions(+), 49 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index f2cc7da473e4..ffd4b68adbb3 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -675,7 +675,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) dec_mm_counter(mm, mm_counter(page)); } - free_swap_and_cache(entry); + free_swap_and_cache(entry, 1); } void ptep_zap_unused(struct mm_struct *mm, unsigned long addr, diff --git a/include/linux/swap.h b/include/linux/swap.h index 1bee8b65cb8a..db3e07a3d9bc 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -453,9 +453,9 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t *entry, int entry_size); extern int swapcache_prepare(swp_entry_t entry, int entry_size); -extern void swap_free(swp_entry_t); +extern void swap_free(swp_entry_t entry, int entry_size); extern void swapcache_free_entries(swp_entry_t *entries, int n); -extern int free_swap_and_cache(swp_entry_t); +extern int free_swap_and_cache(swp_entry_t entry, int entry_size); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); @@ -509,7 +509,8 @@ static inline void show_swap_cache_info(void) { } -#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) +#define free_swap_and_cache(e, s) \ + ({(is_migration_entry(e) || is_device_private_entry(e)); }) #define swapcache_prepare(e, s) \ ({(is_migration_entry(e) || is_device_private_entry(e)); }) @@ -527,7 +528,7 @@ static inline int swap_duplicate(swp_entry_t *swp, int entry_size) return 0; } -static inline void swap_free(swp_entry_t swp) +static inline void swap_free(swp_entry_t swp, int entry_size) { } diff --git a/kernel/power/swap.c b/kernel/power/swap.c index d7f6c1a288d3..0275df84ed3d 100644 --- a/kernel/power/swap.c +++ b/kernel/power/swap.c @@ -182,7 +182,7 @@ sector_t alloc_swapdev_block(int swap) offset = swp_offset(get_swap_page_of_type(swap)); if (offset) { if (swsusp_extents_insert(offset)) - swap_free(swp_entry(swap, offset)); + swap_free(swp_entry(swap, offset), 1); else return swapdev_block(swap, offset); } @@ -206,7 +206,7 @@ void free_all_swap_pages(int swap) ext = rb_entry(node, struct swsusp_extent, node); rb_erase(node, &swsusp_extents); for (offset = ext->start; offset <= ext->end; offset++) - swap_free(swp_entry(swap, offset)); + swap_free(swp_entry(swap, offset), 1); kfree(ext); } diff --git a/mm/madvise.c b/mm/madvise.c index 972a9eaa898b..6fff1c1d2009 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -349,7 +349,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
[PATCH -V5 RESEND 20/21] swap: create PMD swap mapping when unmap the THP
This is the final step of the THP swapin support. When reclaiming a anonymous THP, after allocating the huge swap cluster and add the THP into swap cache, the PMD page mapping will be changed to the mapping to the swap space. Previously, the PMD page mapping will be split before being changed. In this patch, the unmap code is enhanced not to split the PMD mapping, but create a PMD swap mapping to replace it instead. So later when clear the SWAP_HAS_CACHE flag in the last step of swapout, the huge swap cluster will be kept instead of being split, and when swapin, the huge swap cluster will be read in one piece into a THP. That is, the THP will not be split during swapout/swapin. This can eliminate the overhead of splitting/collapsing, and reduce the page fault count, etc. But more important, the utilization of THP is improved greatly, that is, much more THP will be kept when swapping is used, so that we can take full advantage of THP including its high performance for swapout/swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 11 +++ mm/huge_memory.c| 30 ++ mm/rmap.c | 43 ++- mm/vmscan.c | 6 +- 4 files changed, 84 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 6586c1bfac21..8cbce31bc090 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -405,6 +405,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +struct page_vma_mapped_walk; + #ifdef CONFIG_THP_SWAP extern void __split_huge_swap_pmd(struct vm_area_struct *vma, unsigned long haddr, @@ -412,6 +414,8 @@ extern void __split_huge_swap_pmd(struct vm_area_struct *vma, extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, pmd_t orig_pmd); extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); +extern bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct page *page, unsigned long address, pmd_t pmdval); static inline bool transparent_hugepage_swapin_enabled( struct vm_area_struct *vma) @@ -453,6 +457,13 @@ static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) return 0; } +static inline bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct page *page, unsigned long address, + pmd_t pmdval) +{ + return false; +} + static inline bool transparent_hugepage_swapin_enabled( struct vm_area_struct *vma) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2aa432830a38..542af5836ca5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1889,6 +1889,36 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) count_vm_event(THP_SWPIN_FALLBACK); goto fallback; } + +bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, struct page *page, + unsigned long address, pmd_t pmdval) +{ + struct vm_area_struct *vma = pvmw->vma; + struct mm_struct *mm = vma->vm_mm; + pmd_t swp_pmd; + swp_entry_t entry = { .val = page_private(page) }; + + if (swap_duplicate(&entry, HPAGE_PMD_NR) < 0) { + set_pmd_at(mm, address, pvmw->pmd, pmdval); + return false; + } + if (list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&mm->mmlist)) + list_add(&mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, HPAGE_PMD_NR); + swp_pmd = swp_entry_to_pmd(entry); + if (pmd_soft_dirty(pmdval)) + swp_pmd = pmd_swp_mksoft_dirty(swp_pmd); + set_pmd_at(mm, address, pvmw->pmd, swp_pmd); + + page_remove_rmap(page, true); + put_page(page); + return true; +} #endif static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) diff --git a/mm/rmap.c b/mm/rmap.c index 3bb4be720bc0..a180cb1fe2db 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1413,11 +1413,52 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, continue; } + address = pvmw.address; + +#ifdef CONFIG_THP_SWAP + /* PMD-mapped THP swap entry */ + if (IS_ENABLED(CONFIG_THP_SWAP) && + !pvmw.pte && PageAnon(page)) { + pmd_t pmdv
[PATCH -V5 RESEND 15/21] swap: Support to copy PMD swap mapping when fork()
During fork, the page table need to be copied from parent to child. A PMD swap mapping need to be copied too and the swap reference count need to be increased. When the huge swap cluster has been split already, we need to split the PMD swap mapping and fallback to PTE copying. When swap count continuation failed to allocate a page with GFP_ATOMIC, we need to unlock the spinlock and try again with GFP_KERNEL. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/huge_memory.c | 72 1 file changed, 57 insertions(+), 15 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f98d8a543d73..4e2230583c53 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -941,6 +941,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (unlikely(!pgtable)) goto out; +retry: dst_ptl = pmd_lock(dst_mm, dst_pmd); src_ptl = pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -948,26 +949,67 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pmd = *src_pmd; -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (unlikely(is_swap_pmd(pmd))) { swp_entry_t entry = pmd_to_swp_entry(pmd); - VM_BUG_ON(!is_pmd_migration_entry(pmd)); - if (is_write_migration_entry(entry)) { - make_migration_entry_read(&entry); - pmd = swp_entry_to_pmd(entry); - if (pmd_swp_soft_dirty(*src_pmd)) - pmd = pmd_swp_mksoft_dirty(pmd); - set_pmd_at(src_mm, addr, src_pmd, pmd); +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION + if (is_migration_entry(entry)) { + if (is_write_migration_entry(entry)) { + make_migration_entry_read(&entry); + pmd = swp_entry_to_pmd(entry); + if (pmd_swp_soft_dirty(*src_pmd)) + pmd = pmd_swp_mksoft_dirty(pmd); + set_pmd_at(src_mm, addr, src_pmd, pmd); + } + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + set_pmd_at(dst_mm, addr, dst_pmd, pmd); + ret = 0; + goto out_unlock; } - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); - mm_inc_nr_ptes(dst_mm); - pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); - set_pmd_at(dst_mm, addr, dst_pmd, pmd); - ret = 0; - goto out_unlock; - } #endif + if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry)) { + ret = swap_duplicate(&entry, HPAGE_PMD_NR); + if (!ret) { + add_mm_counter(dst_mm, MM_SWAPENTS, + HPAGE_PMD_NR); + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, + pgtable); + set_pmd_at(dst_mm, addr, dst_pmd, pmd); + /* make sure dst_mm is on swapoff's mmlist. */ + if (unlikely(list_empty(&dst_mm->mmlist))) { + spin_lock(&mmlist_lock); + if (list_empty(&dst_mm->mmlist)) + list_add(&dst_mm->mmlist, +&src_mm->mmlist); + spin_unlock(&mmlist_lock); + } + } else if (ret == -ENOTDIR) { + /* +* The huge swap cluster has been split, split +* the PMD swap mapping and fallback to PTE +*/ + __split_huge_swap_pmd(vma, addr, src_pmd); + pte_free(dst_mm, pgtable); + } else if (ret == -ENOMEM) { + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + ret = add_swap_count_continuation(entry, + GFP_KERNEL); +
[PATCH -V5 RESEND 21/21] swap: Update help of CONFIG_THP_SWAP
The help of CONFIG_THP_SWAP is updated to reflect the latest progress of THP (Tranparent Huge Page) swap optimization. Signed-off-by: "Huang, Ying" Reviewed-by: Dan Williams Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/Kconfig | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 9a6e7e27e8d5..cd41bc4382bf 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -425,8 +425,6 @@ config THP_SWAP depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP help Swap transparent huge pages in one piece, without splitting. - XXX: For now, swap cluster backing transparent huge page - will be split after swapout. For selection by architectures with reasonable THP sizes. -- 2.16.4
[PATCH -V5 RESEND 19/21] swap: Support PMD swap mapping in common path
Original code is only for PMD migration entry, it is revised to support PMD swap mapping. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- fs/proc/task_mmu.c | 12 +--- mm/gup.c | 36 mm/huge_memory.c | 7 --- mm/mempolicy.c | 2 +- 4 files changed, 34 insertions(+), 23 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5ea1d64cb0b4..2d968523c57b 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -972,7 +972,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, pmd = pmd_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (is_migration_entry(pmd_to_swp_entry(pmd))) { + } else if (is_swap_pmd(pmd)) { pmd = pmd_swp_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } @@ -1302,9 +1302,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pm->show_pfn) frame = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - } -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION - else if (is_swap_pmd(pmd)) { + } else if (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) && + is_swap_pmd(pmd)) { swp_entry_t entry = pmd_to_swp_entry(pmd); unsigned long offset; @@ -1317,10 +1316,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, flags |= PM_SWAP; if (pmd_swp_soft_dirty(pmd)) flags |= PM_SOFT_DIRTY; - VM_BUG_ON(!is_pmd_migration_entry(pmd)); - page = migration_entry_to_page(entry); + if (is_pmd_migration_entry(pmd)) + page = migration_entry_to_page(entry); } -#endif if (page && page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE; diff --git a/mm/gup.c b/mm/gup.c index 1abc8b4afff6..b35b7729b1b7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -216,6 +216,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, spinlock_t *ptl; struct page *page; struct mm_struct *mm = vma->vm_mm; + swp_entry_t entry; pmd = pmd_offset(pudp, address); /* @@ -243,18 +244,22 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, if (!pmd_present(pmdval)) { if (likely(!(flags & FOLL_MIGRATION))) return no_page_table(vma, flags); - VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(pmdval)); - if (is_pmd_migration_entry(pmdval)) + entry = pmd_to_swp_entry(pmdval); + if (thp_migration_supported() && is_migration_entry(entry)) { pmd_migration_entry_wait(mm, pmd); - pmdval = READ_ONCE(*pmd); - /* -* MADV_DONTNEED may convert the pmd to null because -* mmap_sem is held in read mode -*/ - if (pmd_none(pmdval)) + pmdval = READ_ONCE(*pmd); + /* +* MADV_DONTNEED may convert the pmd to null because +* mmap_sem is held in read mode +*/ + if (pmd_none(pmdval)) + return no_page_table(vma, flags); + goto retry; + } + if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry)) return no_page_table(vma, flags); - goto retry; + WARN_ON(1); + return no_page_table(vma, flags); } if (pmd_devmap(pmdval)) { ptl = pmd_lock(mm, pmd); @@ -276,11 +281,18 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return no_page_table(vma, flags); } if (unlikely(!pmd_present(*pmd))) { + entry = pmd_to_swp_entry(*pmd); spin_unlock(ptl); if (likely(!(flags & FOLL_MIGRATION))) return no_page_table(vma, flags); - pmd_migration_entry_wait(mm, pmd); - goto retry_locked; + if (thp_migration_supported() && is_migration_entry(entry)) { + pmd_migration_entry_wait(mm, pmd); + goto retry_locked; + } +
[PATCH -V5 RESEND 18/21] swap: Support PMD swap mapping in mincore()
During mincore(), for PMD swap mapping, swap cache will be looked up. If the resulting page isn't compound page, the PMD swap mapping will be split and fallback to PTE swap mapping processing. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/mincore.c | 37 +++-- 1 file changed, 31 insertions(+), 6 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index a66f2052c7b1..a2a66c3c8c6a 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -48,7 +48,8 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, * and is up to date; i.e. that no page-in operation would be required * at this time if an application were to map and access this page. */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) +static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff, + bool *compound) { unsigned char present = 0; struct page *page; @@ -86,6 +87,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) #endif if (page) { present = PageUptodate(page); + if (compound) + *compound = PageCompound(page); put_page(page); } @@ -103,7 +106,8 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, pgoff = linear_page_index(vma, addr); for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + vec[i] = mincore_page(vma->vm_file->f_mapping, + pgoff, NULL); } else { for (i = 0; i < nr; i++) vec[i] = 0; @@ -127,14 +131,36 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t *ptep; unsigned char *vec = walk->private; int nr = (end - addr) >> PAGE_SHIFT; + swp_entry_t entry; ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + unsigned char val = 1; + bool compound; + + if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(*pmd)) { + entry = pmd_to_swp_entry(*pmd); + if (!non_swap_entry(entry)) { + val = mincore_page(swap_address_space(entry), + swp_offset(entry), + &compound); + /* +* The huge swap cluster has been +* split under us +*/ + if (!compound) { + __split_huge_swap_pmd(vma, addr, pmd); + spin_unlock(ptl); + goto fallback; + } + } + } + memset(vec, val, nr); spin_unlock(ptl); goto out; } +fallback: if (pmd_trans_unstable(pmd)) { __mincore_unmapped_range(addr, end, vma, vec); goto out; @@ -150,8 +176,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ - swp_entry_t entry = pte_to_swp_entry(pte); - + entry = pte_to_swp_entry(pte); if (non_swap_entry(entry)) { /* * migration or hwpoison entries are always @@ -161,7 +186,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } else { #ifdef CONFIG_SWAP *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); + swp_offset(entry), NULL); #else WARN_ON(1); *vec = 1; -- 2.16.4
[PATCH -V5 RESEND 17/21] swap: Support PMD swap mapping for MADV_WILLNEED
During MADV_WILLNEED, for a PMD swap mapping, if THP swapin is enabled for the VMA, the whole swap cluster will be swapin. Otherwise, the huge swap cluster and the PMD swap mapping will be split and fallback to PTE swap mapping. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/madvise.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 07ef599d4255..608c5ae201c6 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -196,14 +196,36 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, pte_t *orig_pte; struct vm_area_struct *vma = walk->private; unsigned long index; + swp_entry_t entry; + struct page *page; + pmd_t pmdval; + + pmdval = *pmd; + if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(pmdval) && + !is_pmd_migration_entry(pmdval)) { + entry = pmd_to_swp_entry(pmdval); + if (!transparent_hugepage_swapin_enabled(vma)) { + if (!split_swap_cluster(entry, 0)) + split_huge_swap_pmd(vma, pmd, start, pmdval); + } else { + page = read_swap_cache_async(entry, +GFP_HIGHUSER_MOVABLE, +vma, start, false); + if (page) { + /* The swap cluster has been split under us */ + if (!PageTransHuge(page)) + split_huge_swap_pmd(vma, pmd, start, + pmdval); + put_page(page); + } + } + } if (pmd_none_or_trans_huge_or_clear_bad(pmd)) return 0; for (index = start; index != end; index += PAGE_SIZE) { pte_t pte; - swp_entry_t entry; - struct page *page; spinlock_t *ptl; orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); -- 2.16.4
[PATCH -V5 RESEND 16/21] swap: Free PMD swap mapping when zap_huge_pmd()
For a PMD swap mapping, zap_huge_pmd() will clear the PMD and call free_swap_and_cache() to decrease the swap reference count and maybe free or split the huge swap cluster and the THP in swap cache. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/huge_memory.c | 32 +--- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4e2230583c53..d4e8b4f80543 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2024,7 +2024,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, spin_unlock(ptl); if (is_huge_zero_pmd(orig_pmd)) tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); - } else if (is_huge_zero_pmd(orig_pmd)) { + } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) { zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); @@ -2037,17 +2037,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, page_remove_rmap(page, true); VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); VM_BUG_ON_PAGE(!PageHead(page), page); - } else if (thp_migration_supported()) { - swp_entry_t entry; - - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd)); - entry = pmd_to_swp_entry(orig_pmd); - page = pfn_to_page(swp_offset(entry)); + } else { + swp_entry_t entry = pmd_to_swp_entry(orig_pmd); + + if (thp_migration_supported() && + is_migration_entry(entry)) + page = pfn_to_page(swp_offset(entry)); + else if (IS_ENABLED(CONFIG_THP_SWAP) && +!non_swap_entry(entry)) + free_swap_and_cache(entry, HPAGE_PMD_NR); + else { + WARN_ONCE(1, +"Non present huge pmd without pmd migration or swap enabled!"); + goto unlock; + } flush_needed = 0; - } else - WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!"); + } - if (PageAnon(page)) { + if (!page) { + zap_deposited_table(tlb->mm, pmd); + add_mm_counter(tlb->mm, MM_SWAPENTS, -HPAGE_PMD_NR); + } else if (PageAnon(page)) { zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { @@ -2055,7 +2065,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR); } - +unlock: spin_unlock(ptl); if (flush_needed) tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE); -- 2.16.4
[PATCH -V5 RESEND 12/21] swap: Support PMD swap mapping in swapoff
During swapoff, for a huge swap cluster, we need to allocate a THP, read its contents into the THP and unuse the PMD and PTE swap mappings to it. If failed to allocate a THP, the huge swap cluster will be split. During unuse, if it is found that the swap cluster mapped by a PMD swap mapping is split already, we will split the PMD swap mapping and unuse the PTEs. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/asm-generic/pgtable.h | 14 +-- include/linux/huge_mm.h | 8 mm/huge_memory.c | 4 +- mm/swapfile.c | 86 ++- 4 files changed, 97 insertions(+), 15 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index eb1e9d17371b..d64cef2bff04 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -931,22 +931,12 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) barrier(); #endif /* -* !pmd_present() checks for pmd migration entries -* -* The complete check uses is_pmd_migration_entry() in linux/swapops.h -* But using that requires moving current function and pmd_trans_unstable() -* to linux/swapops.h to resovle dependency, which is too much code move. -* -* !pmd_present() is equivalent to is_pmd_migration_entry() currently, -* because !pmd_present() pages can only be under migration not swapped -* out. -* -* pmd_none() is preseved for future condition checks on pmd migration +* pmd_none() is preseved for future condition checks on pmd swap * entries and not confusing with this function name, although it is * redundant with !pmd_present(). */ if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || - (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval))) + (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) && !pmd_present(pmdval))) return 1; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9dedff974def..25ba9b5f1e60 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -406,6 +406,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #ifdef CONFIG_THP_SWAP +extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd); extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); static inline bool transparent_hugepage_swapin_enabled( @@ -431,6 +433,12 @@ static inline bool transparent_hugepage_swapin_enabled( return false; } #else /* CONFIG_THP_SWAP */ +static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) +{ + return 0; +} + static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c4a766243a8f..cd353f39bed9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1671,8 +1671,8 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, } #ifdef CONFIG_THP_SWAP -static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long address, pmd_t orig_pmd) +int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; diff --git a/mm/swapfile.c b/mm/swapfile.c index 3fe50f1da0a0..64067ee6a09c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1931,6 +1931,11 @@ static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte) return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte); } +static inline int pmd_same_as_swp(pmd_t pmd, pmd_t swp_pmd) +{ + return pmd_same(pmd_swp_clear_soft_dirty(pmd), swp_pmd); +} + /* * No need to decide whether this PTE shares the swap entry with others, * just let do_wp_page work it out if a write is requested later - to @@ -1992,6 +1997,53 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, return ret; } +#ifdef CONFIG_THP_SWAP +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, +unsigned long addr, swp_entry_t entry, struct page *page) +{ + struct mem_cgroup *memcg; + spinlock_t *ptl; + int ret = 1; + + if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, + &memcg, true)) { + ret = -ENOMEM; +
[PATCH -V5 RESEND 13/21] swap: Support PMD swap mapping in madvise_free()
When madvise_free() found a PMD swap mapping, if only part of the huge swap cluster is operated on, the PMD swap mapping will be split and fallback to PTE swap mapping processing. Otherwise, if all huge swap cluster is operated on, free_swap_and_cache() will be called to decrease the PMD swap mapping count and probably free the swap space and the THP in swap cache too. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/huge_memory.c | 54 +++--- mm/madvise.c | 2 +- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cd353f39bed9..05407832e793 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1849,6 +1849,15 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) } #endif +static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) +{ + pgtable_t pgtable; + + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + pte_free(mm, pgtable); + mm_dec_nr_ptes(mm); +} + /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -1869,15 +1878,39 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, goto out_unlocked; orig_pmd = *pmd; - if (is_huge_zero_pmd(orig_pmd)) - goto out; - if (unlikely(!pmd_present(orig_pmd))) { - VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(orig_pmd)); - goto out; + swp_entry_t entry = pmd_to_swp_entry(orig_pmd); + + if (is_migration_entry(entry)) { + VM_BUG_ON(!thp_migration_supported()); + goto out; + } else if (IS_ENABLED(CONFIG_THP_SWAP) && + !non_swap_entry(entry)) { + /* +* If part of THP is discarded, split the PMD +* swap mapping and operate on the PTEs +*/ + if (next - addr != HPAGE_PMD_SIZE) { + unsigned long haddr = addr & HPAGE_PMD_MASK; + + __split_huge_swap_pmd(vma, haddr, pmd); + goto out; + } + free_swap_and_cache(entry, HPAGE_PMD_NR); + pmd_clear(pmd); + zap_deposited_table(mm, pmd); + if (current->mm == mm) + sync_mm_rss(mm); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + ret = true; + goto out; + } else + VM_BUG_ON(1); } + if (is_huge_zero_pmd(orig_pmd)) + goto out; + page = pmd_page(orig_pmd); /* * If other processes are mapping this page, we couldn't discard @@ -1923,15 +1956,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) -{ - pgtable_t pgtable; - - pgtable = pgtable_trans_huge_withdraw(mm, pmd); - pte_free(mm, pgtable); - mm_dec_nr_ptes(mm); -} - int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) { diff --git a/mm/madvise.c b/mm/madvise.c index 6fff1c1d2009..07ef599d4255 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -321,7 +321,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, unsigned long next; next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || is_swap_pmd(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) goto next; -- 2.16.4
[PATCH -V5 RESEND 14/21] swap: Support to move swap account for PMD swap mapping
Previously the huge swap cluster will be split after the THP is swapout. Now, to support to swapin the THP in one piece, the huge swap cluster will not be split after the THP is reclaimed. So in memcg, we need to move the swap account for PMD swap mappings in the process's page table. When the page table is scanned during moving memcg charge, the PMD swap mapping will be identified. And mem_cgroup_move_swap_account() and its callee is revised to move account for the whole huge swap cluster. If the swap cluster mapped by PMD has been split, the PMD swap mapping will be split and fallback to PTE processing. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 9 include/linux/swap.h| 6 +++ include/linux/swap_cgroup.h | 3 +- mm/huge_memory.c| 8 +-- mm/memcontrol.c | 129 ++-- mm/swap_cgroup.c| 45 +--- mm/swapfile.c | 14 + 7 files changed, 174 insertions(+), 40 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 25ba9b5f1e60..6586c1bfac21 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -406,6 +406,9 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #ifdef CONFIG_THP_SWAP +extern void __split_huge_swap_pmd(struct vm_area_struct *vma, + unsigned long haddr, + pmd_t *pmd); extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, pmd_t orig_pmd); extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); @@ -433,6 +436,12 @@ static inline bool transparent_hugepage_swapin_enabled( return false; } #else /* CONFIG_THP_SWAP */ +static inline void __split_huge_swap_pmd(struct vm_area_struct *vma, +unsigned long haddr, +pmd_t *pmd) +{ +} + static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, pmd_t orig_pmd) { diff --git a/include/linux/swap.h b/include/linux/swap.h index 921abd07e13f..d45c3a7746e0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -621,6 +621,7 @@ static inline swp_entry_t get_swap_page(struct page *page) #ifdef CONFIG_THP_SWAP extern int split_swap_cluster(swp_entry_t entry, unsigned long flags); extern int split_swap_cluster_map(swp_entry_t entry); +extern int get_swap_entry_size(swp_entry_t entry); #else static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags) { @@ -631,6 +632,11 @@ static inline int split_swap_cluster_map(swp_entry_t entry) { return 0; } + +static inline int get_swap_entry_size(swp_entry_t entry) +{ + return 1; +} #endif #ifdef CONFIG_MEMCG diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index a12dd1c3966c..c40fb52b0563 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -7,7 +7,8 @@ #ifdef CONFIG_MEMCG_SWAP extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, - unsigned short old, unsigned short new); + unsigned short old, unsigned short new, + unsigned int nr_ents); extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, unsigned int nr_ents); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 05407832e793..f98d8a543d73 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1636,10 +1636,11 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) return 0; } +#ifdef CONFIG_THP_SWAP /* Convert a PMD swap mapping to a set of PTE swap mappings */ -static void __split_huge_swap_pmd(struct vm_area_struct *vma, - unsigned long haddr, - pmd_t *pmd) +void __split_huge_swap_pmd(struct vm_area_struct *vma, + unsigned long haddr, + pmd_t *pmd) { struct mm_struct *mm = vma->vm_mm; pgtable_t pgtable; @@ -1670,7 +1671,6 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, pmd_populate(mm, pmd, pgtable); } -#ifdef CONFIG_THP_SWAP int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, pmd_t orig_pmd) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcec9b39e2a3..6c2527ffd17d 100644 --
[PATCH -V5 RESEND 06/21] swap: Support PMD swap mapping when splitting huge PMD
A huge PMD need to be split when zap a part of the PMD mapping etc. If the PMD mapping is a swap mapping, we need to split it too. This patch implemented the support for this. This is similar as splitting the PMD page mapping, except we need to decrease the PMD swap mapping count for the huge swap cluster too. If the PMD swap mapping count becomes 0, the huge swap cluster will be split. Notice: is_huge_zero_pmd() and pmd_page() doesn't work well with swap PMD, so pmd_present() check is called before them. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 4 include/linux/swap.h| 6 ++ mm/huge_memory.c| 48 +++- mm/swapfile.c | 32 4 files changed, 85 insertions(+), 5 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 99c19b06d9a4..0f3e1739986f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -226,6 +226,10 @@ static inline bool is_huge_zero_page(struct page *page) return READ_ONCE(huge_zero_page) == page; } +/* + * is_huge_zero_pmd() must be called after checking pmd_present(), + * otherwise, it may report false positive for PMD swap entry. + */ static inline bool is_huge_zero_pmd(pmd_t pmd) { return is_huge_zero_page(pmd_page(pmd)); diff --git a/include/linux/swap.h b/include/linux/swap.h index db3e07a3d9bc..a2a3d85decd9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -618,11 +618,17 @@ static inline swp_entry_t get_swap_page(struct page *page) #ifdef CONFIG_THP_SWAP extern int split_swap_cluster(swp_entry_t entry); +extern int split_swap_cluster_map(swp_entry_t entry); #else static inline int split_swap_cluster(swp_entry_t entry) { return 0; } + +static inline int split_swap_cluster_map(swp_entry_t entry) +{ + return 0; +} #endif #ifdef CONFIG_MEMCG diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c235ba78de68..b8b61a0879f6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1609,6 +1609,40 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) return 0; } +/* Convert a PMD swap mapping to a set of PTE swap mappings */ +static void __split_huge_swap_pmd(struct vm_area_struct *vma, + unsigned long haddr, + pmd_t *pmd) +{ + struct mm_struct *mm = vma->vm_mm; + pgtable_t pgtable; + pmd_t _pmd; + swp_entry_t entry; + int i, soft_dirty; + + entry = pmd_to_swp_entry(*pmd); + soft_dirty = pmd_soft_dirty(*pmd); + + split_swap_cluster_map(entry); + + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE, entry.val++) { + pte_t *pte, ptent; + + pte = pte_offset_map(&_pmd, haddr); + VM_BUG_ON(!pte_none(*pte)); + ptent = swp_entry_to_pte(entry); + if (soft_dirty) + ptent = pte_swp_mksoft_dirty(ptent); + set_pte_at(mm, haddr, pte, ptent); + pte_unmap(pte); + } + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pgtable); +} + /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -2075,7 +2109,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, VM_BUG_ON(haddr & ~HPAGE_PMD_MASK); VM_BUG_ON_VMA(vma->vm_start > haddr, vma); VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma); - VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) + VM_BUG_ON(!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)); count_vm_event(THP_SPLIT_PMD); @@ -2099,7 +2133,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, put_page(page); add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); return; - } else if (is_huge_zero_pmd(*pmd)) { + } else if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) { /* * FIXME: Do we want to invalidate secondary mmu by calling * mmu_notifier_invalidate_range() see comments below inside @@ -2143,6 +2177,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, page = pfn_to_page(swp_offset(entry)); } else #endif + if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(old_pmd)) + return __split_huge_swap_pmd(vma, haddr, pmd); + else page = pmd_page(old
[PATCH -V5 RESEND 11/21] swap: Add sysfs interface to configure THP swapin
Swapin a THP as a whole isn't desirable in some situations. For example, for completely random access pattern, swapin a THP in one piece will inflate the reading greatly. So a sysfs interface: /sys/kernel/mm/transparent_hugepage/swapin_enabled is added to configure it. Three options as follow are provided, - always: THP swapin will be enabled always - madvise: THP swapin will be enabled only for VMA with VM_HUGEPAGE flag set. - never: THP swapin will be disabled always The default configuration is: madvise. During page fault, if a PMD swap mapping is found and THP swapin is disabled, the huge swap cluster and the PMD swap mapping will be split and fallback to normal page swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- Documentation/admin-guide/mm/transhuge.rst | 21 +++ include/linux/huge_mm.h| 31 ++ mm/huge_memory.c | 94 -- 3 files changed, 127 insertions(+), 19 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 85e33f785fd7..23aefb17101c 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -160,6 +160,27 @@ Some userspace (such as a test program, or an optimized memory allocation cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size +Transparent hugepage may be swapout and swapin in one piece without +splitting. This will improve the utility of transparent hugepage but +may inflate the read/write too. So whether to enable swapin +transparent hugepage in one piece can be configured as follow. + + echo always >/sys/kernel/mm/transparent_hugepage/swapin_enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/swapin_enabled + echo never >/sys/kernel/mm/transparent_hugepage/swapin_enabled + +always + Attempt to allocate a transparent huge page and read it from + swap space in one piece every time. + +never + Always split the swap space and PMD swap mapping and swapin + the fault normal page during swapin. + +madvise + Only swapin the transparent huge page in one piece for + MADV_HUGEPAGE madvise regions. + khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index c2b8ced6fc2b..9dedff974def 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -63,6 +63,8 @@ enum transparent_hugepage_flag { #ifdef CONFIG_DEBUG_VM TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG, #endif + TRANSPARENT_HUGEPAGE_SWAPIN_FLAG, + TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG, }; struct kobject; @@ -405,11 +407,40 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) #ifdef CONFIG_THP_SWAP extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); + +static inline bool transparent_hugepage_swapin_enabled( + struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_NOHUGEPAGE) + return false; + + if (is_vma_temporary_stack(vma)) + return false; + + if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + return false; + + if (transparent_hugepage_flags & + (1 << TRANSPARENT_HUGEPAGE_SWAPIN_FLAG)) + return true; + + if (transparent_hugepage_flags & + (1 << TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG)) + return !!(vma->vm_flags & VM_HUGEPAGE); + + return false; +} #else /* CONFIG_THP_SWAP */ static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) { return 0; } + +static inline bool transparent_hugepage_swapin_enabled( + struct vm_area_struct *vma) +{ + return false; +} #endif /* CONFIG_THP_SWAP */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1232ade5deca..c4a766243a8f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -57,7 +57,8 @@ unsigned long transparent_hugepage_flags __read_mostly = #endif (1
[PATCH -V5 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP
To swapin a THP in one piece, we need to read a huge swap cluster from the swap device. This patch revised the __read_swap_cache_async() and its callers and callees to support this. If __read_swap_cache_async() find the swap cluster of the specified swap entry is huge, it will try to allocate a THP, add it into the swap cache. So later the contents of the huge swap cluster can be read into the THP. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 38 ++ include/linux/swap.h| 4 +-- mm/huge_memory.c| 26 -- mm/swap_state.c | 72 - mm/swapfile.c | 9 --- 5 files changed, 99 insertions(+), 50 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0f3e1739986f..3fdb29bc250c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -250,6 +250,39 @@ static inline bool thp_migration_supported(void) return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION); } +/* + * always: directly stall for all thp allocations + * defer: wake kswapd and fail if not immediately available + * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise + * fail if not immediately available + * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately + * available + * never: never stall for any thp allocation + */ +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) +{ + bool vma_madvised; + + if (!vma) + return GFP_TRANSHUGE_LIGHT; + vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, +&transparent_hugepage_flags)) + return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, +&transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, +&transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : + __GFP_KSWAPD_RECLAIM); + if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, +&transparent_hugepage_flags)) + return GFP_TRANSHUGE_LIGHT | + (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); + return GFP_TRANSHUGE_LIGHT; +} #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -363,6 +396,11 @@ static inline bool thp_migration_supported(void) { return false; } + +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index c0c3b3c077d7..921abd07e13f 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -462,7 +462,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); extern int __swap_count(swp_entry_t entry); -extern int __swp_swapcount(swp_entry_t entry); +extern int __swp_swapcount(swp_entry_t entry, int *entry_size); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); @@ -589,7 +589,7 @@ static inline int __swap_count(swp_entry_t entry) return 0; } -static inline int __swp_swapcount(swp_entry_t entry) +static inline int __swp_swapcount(swp_entry_t entry, int *entry_size) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 64123cefa978..f1358681db8f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -620,32 +620,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, } -/* - * always: directly stall for all thp allocations - * defer: wake kswapd and fail if not immediately available - * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise - * fail if not immediately available - * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately - * available - * never: never stall for any thp allocation - */ -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) -{ - const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); - - if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
[PATCH -V5 RESEND 10/21] swap: Support to count THP swapin and its fallback
2 new /proc/vmstat fields are added, "thp_swapin" and "thp_swapin_fallback" to count swapin a THP from swap device in one piece and fallback to normal page swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- Documentation/admin-guide/mm/transhuge.rst | 8 include/linux/vm_event_item.h | 2 ++ mm/huge_memory.c | 4 +++- mm/page_io.c | 15 --- mm/vmstat.c| 2 ++ 5 files changed, 27 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 7ab93a8404b9..85e33f785fd7 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -364,6 +364,14 @@ thp_swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +thp_swpin + is incremented every time a huge page is swapin in one piece + without splitting. + +thp_swpin_fallback + is incremented if a huge page has to be split during swapin. + Usually because failed to allocate a huge page. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 5c7f010676a7..7b438548a78e 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, THP_SWPOUT_FALLBACK, + THP_SWPIN, + THP_SWPIN_FALLBACK, #endif #ifdef CONFIG_MEMORY_BALLOON BALLOON_INFLATE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4dbc4f933c4f..1232ade5deca 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1673,8 +1673,10 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) /* swapoff occurs under us */ } else if (ret == -EINVAL) ret = 0; - else + else { + count_vm_event(THP_SWPIN_FALLBACK); goto fallback; + } } delayacct_clear_flag(DELAYACCT_PF_SWAPIN); goto out; diff --git a/mm/page_io.c b/mm/page_io.c index aafd19ec1db4..362254b99955 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -348,6 +348,15 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; } +static inline void count_swpin_vm_event(struct page *page) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (unlikely(PageTransHuge(page))) + count_vm_event(THP_SWPIN); +#endif + count_vm_events(PSWPIN, hpage_nr_pages(page)); +} + int swap_readpage(struct page *page, bool synchronous) { struct bio *bio; @@ -371,7 +380,7 @@ int swap_readpage(struct page *page, bool synchronous) ret = mapping->a_ops->readpage(swap_file, page); if (!ret) - count_vm_event(PSWPIN); + count_swpin_vm_event(page); return ret; } @@ -382,7 +391,7 @@ int swap_readpage(struct page *page, bool synchronous) unlock_page(page); } - count_vm_event(PSWPIN); + count_swpin_vm_event(page); return 0; } @@ -401,7 +410,7 @@ int swap_readpage(struct page *page, bool synchronous) get_task_struct(current); bio->bi_private = current; bio_set_op_attrs(bio, REQ_OP_READ, 0); - count_vm_event(PSWPIN); + count_swpin_vm_event(page); bio_get(bio); qc = submit_bio(bio); while (synchronous) { diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ba0870ecddd..ac04801bb0cb 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1263,6 +1263,8 @@ const char * const vmstat_text[] = { "thp_zero_page_alloc_failed", "thp_swpout", "thp_swpout_fallback", + "thp_swpin", + "thp_swpin_fallback", #endif #ifdef CONFIG_MEMORY_BALLOON "balloon_inflate", -- 2.16.4
[PATCH -V5 RESEND 09/21] swap: Swapin a THP in one piece
With this patch, when page fault handler find a PMD swap mapping, it will swap in a THP in one piece. This avoids the overhead of splitting/collapsing before/after the THP swapping. And improves the swap performance greatly for reduced page fault count etc. do_huge_pmd_swap_page() is added in the patch to implement this. It is similar to do_swap_page() for normal page swapin. If failing to allocate a THP, the huge swap cluster and the PMD swap mapping will be split to fallback to normal page swapin. If the huge swap cluster has been split already, the PMD swap mapping will be split to fallback to normal page swapin. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/huge_mm.h | 9 +++ mm/huge_memory.c| 174 mm/memory.c | 16 +++-- 3 files changed, 193 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3fdb29bc250c..c2b8ced6fc2b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -403,4 +403,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#ifdef CONFIG_THP_SWAP +extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd); +#else /* CONFIG_THP_SWAP */ +static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + return 0; +} +#endif /* CONFIG_THP_SWAP */ + #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f1358681db8f..4dbc4f933c4f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -33,6 +33,8 @@ #include #include #include +#include +#include #include #include @@ -1617,6 +1619,178 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma, pmd_populate(mm, pmd, pgtable); } +#ifdef CONFIG_THP_SWAP +static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long address, pmd_t orig_pmd) +{ + struct mm_struct *mm = vma->vm_mm; + spinlock_t *ptl; + int ret = 0; + + ptl = pmd_lock(mm, pmd); + if (pmd_same(*pmd, orig_pmd)) + __split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd); + else + ret = -ENOENT; + spin_unlock(ptl); + + return ret; +} + +int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd) +{ + struct page *page; + struct mem_cgroup *memcg; + struct vm_area_struct *vma = vmf->vma; + unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + swp_entry_t entry; + pmd_t pmd; + int i, locked, exclusive = 0, ret = 0; + + entry = pmd_to_swp_entry(orig_pmd); + VM_BUG_ON(non_swap_entry(entry)); + delayacct_set_flag(DELAYACCT_PF_SWAPIN); +retry: + page = lookup_swap_cache(entry, NULL, vmf->address); + if (!page) { + page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma, +haddr, false); + if (!page) { + /* +* Back out if somebody else faulted in this pmd +* while we released the pmd lock. +*/ + if (likely(pmd_same(*vmf->pmd, orig_pmd))) { + /* +* Failed to allocate huge page, split huge swap +* cluster, and fallback to swapin normal page +*/ + ret = split_swap_cluster(entry, 0); + /* Somebody else swapin the swap entry, retry */ + if (ret == -EEXIST) { + ret = 0; + goto retry; + /* swapoff occurs under us */ + } else if (ret == -EINVAL) + ret = 0; + else + goto fallback; + } + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + goto out; + } + + /* Had to read the page from swap area: Major fault */ + ret = VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); + } else if (!PageTransCompound(page)) + goto fallback; + + locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + if (!locked) { + ret |= VM_FAULT_RETRY; + goto out_release; +
[PATCH -V5 RESEND 04/21] swap: Support PMD swap mapping in put_swap_page()
Previously, during swapout, all PMD page mapping will be split and replaced with PTE swap mapping. And when clearing the SWAP_HAS_CACHE flag for the huge swap cluster in put_swap_page(), the huge swap cluster will be split. Now, during swapout, the PMD page mappings to the THP will be changed to PMD swap mappings to the corresponding swap cluster. So when clearing the SWAP_HAS_CACHE flag, the huge swap cluster will only be split if the PMD swap mapping count is 0. Otherwise, we will keep it as the huge swap cluster. So that we can swapin a THP in one piece later. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/swapfile.c | 31 --- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 138968b79de5..553d2551b35a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1314,6 +1314,15 @@ void swap_free(swp_entry_t entry) /* * Called after dropping swapcache to decrease refcnt to swap entries. + * + * When a THP is added into swap cache, the SWAP_HAS_CACHE flag will + * be set in the swap_map[] of all swap entries in the huge swap + * cluster backing the THP. This huge swap cluster will not be split + * unless the THP is split even if its PMD swap mapping count dropped + * to 0. Later, when the THP is removed from swap cache, the + * SWAP_HAS_CACHE flag will be cleared in the swap_map[] of all swap + * entries in the huge swap cluster. And this huge swap cluster will + * be split if its PMD swap mapping count is 0. */ void put_swap_page(struct page *page, swp_entry_t entry) { @@ -1332,15 +1341,23 @@ void put_swap_page(struct page *page, swp_entry_t entry) ci = lock_cluster_or_swap_info(si, offset); if (size == SWAPFILE_CLUSTER) { - VM_BUG_ON(!cluster_is_huge(ci)); + VM_BUG_ON(!IS_ALIGNED(offset, size)); map = si->swap_map + offset; - for (i = 0; i < SWAPFILE_CLUSTER; i++) { - val = map[i]; - VM_BUG_ON(!(val & SWAP_HAS_CACHE)); - if (val == SWAP_HAS_CACHE) - free_entries++; + /* +* No PMD swap mapping, the swap cluster will be freed +* if all swap entries becoming free, otherwise the +* huge swap cluster will be split. +*/ + if (!cluster_swapcount(ci)) { + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + val = map[i]; + VM_BUG_ON(!(val & SWAP_HAS_CACHE)); + if (val == SWAP_HAS_CACHE) + free_entries++; + } + if (free_entries != SWAPFILE_CLUSTER) + cluster_clear_huge(ci); } - cluster_clear_huge(ci); if (free_entries == SWAPFILE_CLUSTER) { unlock_cluster_or_swap_info(si, ci); spin_lock(&si->lock); -- 2.16.4
[PATCH -V5 RESEND 07/21] swap: Support PMD swap mapping in split_swap_cluster()
When splitting a THP in swap cache or failing to allocate a THP when swapin a huge swap cluster, the huge swap cluster will be split. In addition to clear the huge flag of the swap cluster, the PMD swap mapping count recorded in cluster_count() will be set to 0. But we will not touch PMD swap mappings themselves, because it is hard to find them all sometimes. When the PMD swap mappings are operated later, it will be found that the huge swap cluster has been split and the PMD swap mappings will be split at that time. Unless splitting a THP in swap cache (specified via "force" parameter), split_swap_cluster() will return -EEXIST if there is SWAP_HAS_CACHE flag in swap_map[offset]. Because this indicates there is a THP corresponds to this huge swap cluster, and it isn't desired to split the THP. When splitting a THP in swap cache, the position to call split_swap_cluster() is changed to before unlocking sub-pages. So that all sub-pages will be kept locked from the THP has been split to the huge swap cluster is split. This makes the code much easier to be reasoned. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/swap.h | 6 -- mm/huge_memory.c | 18 ++-- mm/swapfile.c| 58 +--- 3 files changed, 57 insertions(+), 25 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index a2a3d85decd9..c0c3b3c077d7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -616,11 +616,13 @@ static inline swp_entry_t get_swap_page(struct page *page) #endif /* CONFIG_SWAP */ +#define SSC_SPLIT_CACHED 0x1 + #ifdef CONFIG_THP_SWAP -extern int split_swap_cluster(swp_entry_t entry); +extern int split_swap_cluster(swp_entry_t entry, unsigned long flags); extern int split_swap_cluster_map(swp_entry_t entry); #else -static inline int split_swap_cluster(swp_entry_t entry) +static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b8b61a0879f6..64123cefa978 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2502,6 +2502,17 @@ static void __split_huge_page(struct page *page, struct list_head *list, unfreeze_page(head); + /* +* Split swap cluster before unlocking sub-pages. So all +* sub-pages will be kept locked from THP has been split to +* swap cluster is split. +*/ + if (PageSwapCache(head)) { + swp_entry_t entry = { .val = page_private(head) }; + + split_swap_cluster(entry, SSC_SPLIT_CACHED); + } + for (i = 0; i < HPAGE_PMD_NR; i++) { struct page *subpage = head + i; if (subpage == page) @@ -2728,12 +2739,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) __dec_node_page_state(page, NR_SHMEM_THPS); spin_unlock(&pgdata->split_queue_lock); __split_huge_page(page, list, flags); - if (PageSwapCache(head)) { - swp_entry_t entry = { .val = page_private(head) }; - - ret = split_swap_cluster(entry); - } else - ret = 0; + ret = 0; } else { if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { pr_alert("total_mapcount: %u, page_count(): %u\n", diff --git a/mm/swapfile.c b/mm/swapfile.c index 16723b9d971a..ef2b42c199c0 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1469,23 +1469,6 @@ void put_swap_page(struct page *page, swp_entry_t entry) unlock_cluster_or_swap_info(si, ci); } -#ifdef CONFIG_THP_SWAP -int split_swap_cluster(swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - - si = _swap_info_get(entry); - if (!si) - return -EBUSY; - ci = lock_cluster(si, offset); - cluster_clear_huge(ci); - unlock_cluster(ci); - return 0; -} -#endif - static int swp_entry_cmp(const void *ent1, const void *ent2) { const swp_entry_t *e1 = ent1, *e2 = ent2; @@ -4064,6 +4047,47 @@ int split_swap_cluster_map(swp_entry_t entry) unlock_cluster(ci); return 0; } + +/* + * We will not try to split all PMD swap mappings to the swap cluster, + * because we haven't enough information available for that. Later, + * when the PMD swap mapping is duplicated or swapin, etc, the PMD + * swap mapping will be split and fallback to the PTE operations. + */ +int split_swap_cluster(swp_entry_t entry, unsigned long flags) +{ + struct swap_info_struct *si; + struct swap_cluster_info *ci
[PATCH -V5 RESEND 00/21] swap: Swapout/swapin THP in one piece
Hi, Andrew, could you help me to check whether the overall design is reasonable? Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the swap part of the patchset? Especially [02/21], [03/21], [04/21], [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21], [12/21], [20/21], [21/21]. Hi, Andrea and Kirill, could you help me to review the THP part of the patchset? Especially [01/21], [07/21], [09/21], [11/21], [13/21], [15/21], [16/21], [17/21], [18/21], [19/21], [20/21]. Hi, Johannes and Michal, could you help me to review the cgroup part of the patchset? Especially [14/21]. And for all, Any comment is welcome! This patchset is based on the 2018-09-04 head of mmotm/master. This is the final step of THP (Transparent Huge Page) swap optimization. After the first and second step, the splitting huge page is delayed from almost the first step of swapout to after swapout has been finished. In this step, we avoid splitting THP for swapout and swapout/swapin the THP in one piece. We tested the patchset with vm-scalability benchmark swap-w-seq test case, with 16 processes. The test case forks 16 processes. Each process allocates large anonymous memory range, and writes it from begin to end for 8 rounds. The first round will swapout, while the remaining rounds will swapin and swapout. The test is done on a Xeon E5 v3 system, the swap device used is a RAM simulated PMEM (persistent memory) device. The test result is as follow, base optimized -- %stddev %change %stddev \ |\ 1417897 ± 2%+992.8% 15494673vm-scalability.throughput 1020489 ± 4% +1091.2% 12156349vmstat.swap.si 1255093 ± 3%+940.3% 13056114vmstat.swap.so 1259769 ± 7% +1818.3% 24166779meminfo.AnonHugePages 28021761 -10.7% 25018848 ± 2% meminfo.AnonPages 64080064 ± 4% -95.6%2787565 ± 33% interrupts.CAL:Function_call_interrupts 13.91 ± 5% -13.80.10 ± 27% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath Where, the score of benchmark (bytes written per second) improved 992.8%. The swapout/swapin throughput improved 1008% (from about 2.17GB/s to 24.04GB/s). The performance difference is huge. In base kernel, for the first round of writing, the THP is swapout and split, so in the remaining rounds, there is only normal page swapin and swapout. While in optimized kernel, the THP is kept after first swapout, so THP swapin and swapout is used in the remaining rounds. This shows the key benefit to swapout/swapin THP in one piece, the THP will be kept instead of being split. meminfo information verified this, in base kernel only 4.5% of anonymous page are THP during the test, while in optimized kernel, that is 96.6%. The TLB flushing IPI (represented as interrupts.CAL:Function_call_interrupts) reduced 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%. These are performance benefit of THP swapout/swapin too. Below is the description for all steps of THP swap optimization. Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swapping even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages to swapout/swapin a THP in one piece include: - Batch various swap operations for the THP. Many operations need to be done once per THP instead of per normal page, for example, allocating/freeing the swap space, writing/reading the swap space, flushing TLB, page fault, etc. This will improve the performance of the THP swap greatly. - The THP swap space read/write will be large sequential IO (2M on x86_64). It is particularly helpful for the swapin, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The THP order pages will be free up after THP swapout. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapout, it will take quite long time for the normal pages to collapse back into the THP after being swapin. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swapin, mainly because possible enlarged read/write IO size (for swapout/swapin) may put more overhead on the storage device.
[PATCH -V5 RESEND 02/21] swap: Add __swap_duplicate_locked()
The part of __swap_duplicate() with lock held is separated into a new function __swap_duplicate_locked(). Because we will add more logic about the PMD swap mapping into __swap_duplicate() and keep the most PTE swap mapping related logic in __swap_duplicate_locked(). Just mechanical code refactoring, there is no any functional change in this patch. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- mm/swapfile.c | 63 +-- 1 file changed, 35 insertions(+), 28 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 97a1bd1a7c9a..6a570ef00fa7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3436,32 +3436,12 @@ void si_swapinfo(struct sysinfo *val) spin_unlock(&swap_lock); } -/* - * Verify that a swap entry is valid and increment its swap map count. - * - * Returns error code in following case. - * - success -> 0 - * - swp_entry is invalid -> EINVAL - * - swp_entry is migration entry -> EINVAL - * - swap-cache reference is requested but there is already one. -> EEXIST - * - swap-cache reference is requested but the entry is not used. -> ENOENT - * - swap-mapped reference requested but needs continued swap count. -> ENOMEM - */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate_locked(struct swap_info_struct *p, + unsigned long offset, unsigned char usage) { - struct swap_info_struct *p; - struct swap_cluster_info *ci; - unsigned long offset; unsigned char count; unsigned char has_cache; - int err = -EINVAL; - - p = get_swap_device(entry); - if (!p) - goto out; - - offset = swp_offset(entry); - ci = lock_cluster_or_swap_info(p, offset); + int err = 0; count = p->swap_map[offset]; @@ -3471,12 +3451,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) */ if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { err = -ENOENT; - goto unlock_out; + goto out; } has_cache = count & SWAP_HAS_CACHE; count &= ~SWAP_HAS_CACHE; - err = 0; if (usage == SWAP_HAS_CACHE) { @@ -3503,11 +3482,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) p->swap_map[offset] = count | has_cache; -unlock_out: +out: + return err; +} + +/* + * Verify that a swap entry is valid and increment its swap map count. + * + * Returns error code in following case. + * - success -> 0 + * - swp_entry is invalid -> EINVAL + * - swp_entry is migration entry -> EINVAL + * - swap-cache reference is requested but there is already one. -> EEXIST + * - swap-cache reference is requested but the entry is not used. -> ENOENT + * - swap-mapped reference requested but needs continued swap count. -> ENOMEM + */ +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +{ + struct swap_info_struct *p; + struct swap_cluster_info *ci; + unsigned long offset; + int err = -EINVAL; + + p = get_swap_device(entry); + if (!p) + goto out; + + offset = swp_offset(entry); + ci = lock_cluster_or_swap_info(p, offset); + err = __swap_duplicate_locked(p, offset, usage); unlock_cluster_or_swap_info(p, ci); + + put_swap_device(p); out: - if (p) - put_swap_device(p); return err; } -- 2.16.4
[PATCH -V5 RESEND 03/21] swap: Support PMD swap mapping in swap_duplicate()
To support to swapin the THP in one piece, we need to create PMD swap mapping during swapout, and maintain PMD swap mapping count. This patch implements the support to increase the PMD swap mapping count (for swapout, fork, etc.) and set SWAP_HAS_CACHE flag (for swapin, etc.) for a huge swap cluster in swap_duplicate() function family. Although it only implements a part of the design of the swap reference count with PMD swap mapping, the whole design is described as follow to make it easy to understand the patch and the whole picture. A huge swap cluster is used to hold the contents of a swapouted THP. After swapout, a PMD page mapping to the THP will become a PMD swap mapping to the huge swap cluster via a swap entry in PMD. While a PTE page mapping to a subpage of the THP will become the PTE swap mapping to a swap slot in the huge swap cluster via a swap entry in PTE. If there is no PMD swap mapping and the corresponding THP is removed from the page cache (reclaimed), the huge swap cluster will be split and become a normal swap cluster. The count (cluster_count()) of the huge swap cluster is SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count. Because all swap slots in the huge swap cluster are mapped by PTE or PMD, or has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is HPAGE_PMD_NR. And the PMD swap mapping count is recorded too to make it easy to determine whether there are remaining PMD swap mappings. The count in swap_map[offset] is the sum of PTE and PMD swap mapping count. This means when we increase the PMD swap mapping count, we need to increase swap_map[offset] for all swap slots inside the swap cluster. An alternative choice is to make swap_map[offset] to record PTE swap map count only, given we have recorded PMD swap mapping count in the count of the huge swap cluster. But this need to increase swap_map[offset] when splitting the PMD swap mapping, that may fail because of memory allocation for swap count continuation. That is hard to dealt with. So we choose current solution. The PMD swap mapping to a huge swap cluster may be split when unmap a part of PMD mapping etc. That is easy because only the count of the huge swap cluster need to be changed. When the last PMD swap mapping is gone and SWAP_HAS_CACHE is unset, we will split the huge swap cluster (clear the huge flag). This makes it easy to reason the cluster state. A huge swap cluster will be split when splitting the THP in swap cache, or failing to allocate THP during swapin, etc. But when splitting the huge swap cluster, we will not try to split all PMD swap mappings, because we haven't enough information available for that sometimes. Later, when the PMD swap mapping is duplicated or swapin, etc, the PMD swap mapping will be split and fallback to the PTE operation. When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be set in the swap_map[offset] of all swap slots inside the huge swap cluster backing the THP. This huge swap cluster will not be split unless the THP is split even if its PMD swap mapping count dropped to 0. Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE flag will be cleared in the swap_map[offset] of all swap slots inside the huge swap cluster. And this huge swap cluster will be split if its PMD swap mapping count is 0. The first parameter of swap_duplicate() is changed to return the swap entry to call add_swap_count_continuation() for. Because we may need to call it for a swap entry in the middle of a huge swap cluster. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- include/linux/swap.h | 9 +++-- mm/memory.c | 2 +- mm/rmap.c| 2 +- mm/swap_state.c | 2 +- mm/swapfile.c| 107 ++- 5 files changed, 97 insertions(+), 25 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ca7c6307bda7..1bee8b65cb8a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int); extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); -extern int swap_duplicate(swp_entry_t); -extern int swapcache_prepare(swp_entry_t); +extern int swap_duplicate(swp_entry_t *entry, int entry_size); +extern int swapcache_prepare(swp_entry_t entry, int entry_size); extern void swap_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); @@ -510,7 +510,8 @@ static inline void show_swap_cache_info(void) } #define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
[PATCH -V5 RESEND 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP
Currently, "the swap entry" in the page tables is used for a number of things outside of actual swap, like page migration, etc. We support the THP/PMD "swap entry" for page migration currently and the functions behind this are tied to page migration's config option (CONFIG_ARCH_ENABLE_THP_MIGRATION). But, we also need them for THP swap optimization. So a new config option (CONFIG_HAVE_PMD_SWAP_ENTRY) is added. It is enabled when either CONFIG_ARCH_ENABLE_THP_MIGRATION or CONFIG_THP_SWAP is enabled. And PMD swap entry functions are tied to this new config option instead. Some functions enabled by CONFIG_ARCH_ENABLE_THP_MIGRATION are for page migration only, they are still enabled only for that. Signed-off-by: "Huang, Ying" Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Johannes Weiner Cc: Shaohua Li Cc: Hugh Dickins Cc: Minchan Kim Cc: Rik van Riel Cc: Dave Hansen Cc: Naoya Horiguchi Cc: Zi Yan Cc: Daniel Jordan --- arch/x86/include/asm/pgtable.h | 2 +- include/asm-generic/pgtable.h | 2 +- include/linux/swapops.h| 44 ++ mm/Kconfig | 8 4 files changed, 33 insertions(+), 23 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e4ffa565a69f..194f97dc4583 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1334,7 +1334,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte) return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY); } -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY); diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 5657a20e0c59..eb1e9d17371b 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -675,7 +675,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm, #endif #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY -#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION +#ifndef CONFIG_HAVE_PMD_SWAP_ENTRY static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd) { return pmd; diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 22af9d8a84ae..79ccbf8789d5 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -259,17 +259,7 @@ static inline int is_write_migration_entry(swp_entry_t entry) #endif -struct page_vma_mapped_walk; - -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION -extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, - struct page *page); - -extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, - struct page *new); - -extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd); - +#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd) { swp_entry_t arch_entry; @@ -287,6 +277,28 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry) arch_entry = __swp_entry(swp_type(entry), swp_offset(entry)); return __swp_entry_to_pmd(arch_entry); } +#else +static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd) +{ + return swp_entry(0, 0); +} + +static inline pmd_t swp_entry_to_pmd(swp_entry_t entry) +{ + return __pmd(0); +} +#endif + +struct page_vma_mapped_walk; + +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION +extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, + struct page *page); + +extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, + struct page *new); + +extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd); static inline int is_pmd_migration_entry(pmd_t pmd) { @@ -307,16 +319,6 @@ static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { } -static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd) -{ - return swp_entry(0, 0); -} - -static inline pmd_t swp_entry_to_pmd(swp_entry_t entry) -{ - return __pmd(0); -} - static inline int is_pmd_migration_entry(pmd_t pmd) { return 0; diff --git a/mm/Kconfig b/mm/Kconfig index 7bf074bf79e5..9a6e7e27e8d5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -430,6 +430,14 @@ config THP_SWAP For selection by architectures with reasonable THP sizes. +# +# "PMD swap entry" in the page table is used both for migration and +# actual swap. +# +config HAVE_PMD_SWAP_ENTRY + def_bool y + depends on THP_SWAP || ARCH_ENABLE_THP_MIGRATION + config TRANSPARENT_HUGE_PAGECACHE def_bool y depends on TRANSPARENT_HUGEPAGE -- 2.16.4
Re: [LKP] [vfs] fd0002870b: BUG:KASAN:null-ptr-deref_in_n
On 09/12/2018 04:29 AM, David Howells wrote: kernel test robot wrote: [ 18.568403] nfs_fs_mount+0x901/0x1220 I don't suppose you can tell me what file and line number this corresponds to? $ faddr2line vmlinux nfs_fs_mount+0x901 nfs_fs_mount+0x901/0x1218: nfs_parse_devname at fs/nfs/super.c:1911 (inlined by) nfs_validate_text_mount_data at fs/nfs/super.c:2187 (inlined by) nfs_fs_mount at fs/nfs/super.c:2684 Also, can you tell me what the mount parameters were? I'm not sure how to extract them from the information provided. qemu command (you could get from 'bin/lkp qemu -k job-script'): qemu-system-x86_64 -enable-kvm -fsdev local,id=test_dev,path=/home/nfs/.lkp//result/boot/1/vm-kbuild-1G/debian-x86_64-2018-04-03.cgz/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/0,security_model=none -device virtio-9p-pci,fsdev=test_dev,mount_tag=9p/virtfs_mount -kernel vmlinuz-4.19.0-rc1-00104-gfd00028 -append root=/dev/ram0 user=lkp job=/lkp/jobs/scheduled/vm-kbuild-1G-11/boot-1-debian-x86_64-2018-04-03.cgz-fd0002870b453c58d0d8c195954f5049bc6675fb-20180910-6016-1hqt4et-1.yaml ARCH=x86_64 kconfig=x86_64-randconfig-r0-09070102 branch=linux-devel/devel-hourly-2018090623 commit=fd0002870b453c58d0d8c195954f5049bc6675fb BOOT_IMAGE=/pkg/linux/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/vmlinuz-4.19.0-rc1-00104-gfd00028 max_uptime=600 RESULT_ROOT=/result/boot/1/vm-kbuild-1G/debian-x86_64-2018-04-03.cgz/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/3 LKP_LOCAL_RUN=1 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw ip=dhcp result_service=9p/virtfs_mount -initrd /home/nfs/.lkp/cache/final_initrd -smp 2 -m 1024M -no-reboot -watchdog i6300esb -rtc base=localtime -device e1000,netdev=net0 -netdev user,id=net0 -display none -monitor null -serial stdio -device virtio-scsi-pci,id=scsi0 -drive file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-0,if=none,id=hd0,media=disk,aio=native,cache=none -device scsi-hd,bus=scsi0.0,drive=hd0,scsi-id=1,lun=0 -drive file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-1,if=none,id=hd1,media=disk,aio=native,cache=none -device scsi-hd,bus=scsi0.0,drive=hd1,scsi-id=1,lun=1 -drive file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-2,if=none,id=hd2,media=disk,aio=native,cache=none -device scsi-hd,bus=scsi0.0,drive=hd2,scsi-id=1,lun=2 -drive file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-3,if=none,id=hd3,media=disk,aio=native,cache=none -device scsi-hd,bus=scsi0.0,drive=hd3,scsi-id=1,lun=3 -drive file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-4,if=none,id=hd4,media=disk,aio=native,cache=none -device scsi-hd,bus=scsi0.0,drive=hd4,scsi-id=1,lun=4 Best Regards, Rong Chen Thanks, David ___ LKP mailing list l...@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Inquiry
Hello, This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia. We are glad to know about your company from the web and we are interested in your products. Could you kindly send us your Latest catalog and price list for our trial order. Best Regards, Daniel Murray Purchasing Manager