[tip:perf/core] kprobes: Don't call BUG_ON() if there is a kprobe in use on free list

2018-09-11 Thread tip-bot for Masami Hiramatsu
Commit-ID:  cbdd96f5586151e48317d90a403941ec23f12660
Gitweb: https://git.kernel.org/tip/cbdd96f5586151e48317d90a403941ec23f12660
Author: Masami Hiramatsu 
AuthorDate: Tue, 11 Sep 2018 19:21:09 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Sep 2018 08:01:16 +0200

kprobes: Don't call BUG_ON() if there is a kprobe in use on free list

Instead of calling BUG_ON(), if we find a kprobe in use on free kprobe
list, just remove it from the list and keep it on kprobe hash list
as same as other in-use kprobes.

Signed-off-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S . Miller 
Cc: Linus Torvalds 
Cc: Naveen N . Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/153666126882.21306.10738207224288507996.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 63c342e5e6c3..90e98e233647 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -546,8 +546,14 @@ static void do_free_cleaned_kprobes(void)
struct optimized_kprobe *op, *tmp;
 
list_for_each_entry_safe(op, tmp, &freeing_list, list) {
-   BUG_ON(!kprobe_unused(&op->kp));
list_del_init(&op->list);
+   if (WARN_ON_ONCE(!kprobe_unused(&op->kp))) {
+   /*
+* This must not happen, but if there is a kprobe
+* still in use, keep it on kprobes hash list.
+*/
+   continue;
+   }
free_aggr_kprobe(&op->kp);
}
 }


[tip:perf/core] kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()

2018-09-11 Thread tip-bot for Masami Hiramatsu
Commit-ID:  819319fc93461c07b9cdb3064f154bd8cfd48172
Gitweb: https://git.kernel.org/tip/819319fc93461c07b9cdb3064f154bd8cfd48172
Author: Masami Hiramatsu 
AuthorDate: Tue, 11 Sep 2018 19:20:40 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Sep 2018 08:01:16 +0200

kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()

Make reuse_unused_kprobe() to return error code if
it fails to reuse unused kprobe for optprobe instead
of calling BUG_ON().

Signed-off-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S . Miller 
Cc: Linus Torvalds 
Cc: Naveen N . Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 277a6cbe83db..63c342e5e6c3 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -700,9 +700,10 @@ static void unoptimize_kprobe(struct kprobe *p, bool force)
 }
 
 /* Cancel unoptimizing for reusing */
-static void reuse_unused_kprobe(struct kprobe *ap)
+static int reuse_unused_kprobe(struct kprobe *ap)
 {
struct optimized_kprobe *op;
+   int ret;
 
/*
 * Unused kprobe MUST be on the way of delayed unoptimizing (means
@@ -713,8 +714,12 @@ static void reuse_unused_kprobe(struct kprobe *ap)
/* Enable the probe again */
ap->flags &= ~KPROBE_FLAG_DISABLED;
/* Optimize it again (remove from op->list) */
-   BUG_ON(!kprobe_optready(ap));
+   ret = kprobe_optready(ap);
+   if (ret)
+   return ret;
+
optimize_kprobe(ap);
+   return 0;
 }
 
 /* Remove optimized instructions */
@@ -939,11 +944,16 @@ static void __disarm_kprobe(struct kprobe *p, bool reopt)
 #define kprobe_disarmed(p) kprobe_disabled(p)
 #define wait_for_kprobe_optimizer()do {} while (0)
 
-/* There should be no unused kprobes can be reused without optimization */
-static void reuse_unused_kprobe(struct kprobe *ap)
+static int reuse_unused_kprobe(struct kprobe *ap)
 {
+   /*
+* If the optimized kprobe is NOT supported, the aggr kprobe is
+* released at the same time that the last aggregated kprobe is
+* unregistered.
+* Thus there should be no chance to reuse unused kprobe.
+*/
printk(KERN_ERR "Error: There should be no unused kprobe here.\n");
-   BUG_ON(kprobe_unused(ap));
+   return -EINVAL;
 }
 
 static void free_aggr_kprobe(struct kprobe *p)
@@ -1315,9 +1325,12 @@ static int register_aggr_kprobe(struct kprobe *orig_p, 
struct kprobe *p)
goto out;
}
init_aggr_kprobe(ap, orig_p);
-   } else if (kprobe_unused(ap))
+   } else if (kprobe_unused(ap)) {
/* This probe is going to die. Rescue it */
-   reuse_unused_kprobe(ap);
+   ret = reuse_unused_kprobe(ap);
+   if (ret)
+   goto out;
+   }
 
if (kprobe_gone(ap)) {
/*


Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU

2018-09-11 Thread Srikar Dronamraju
* Mel Gorman  [2018-09-10 10:41:47]:

> On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > Srikar's patch here:
> > > 
> > >   
> > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-sri...@linux.vnet.ibm.com
> > > 
> > > Also frobs this condition, but in a less radical way. Does that yield
> > > similar results?
> > 
> > I can check. I do wonder of course if the less radical approach just means
> > that automatic NUMA balancing and the load balancer simply disagree about
> > placement at a different time. It'll take a few days to have an answer as
> > the battery of workloads to check this take ages.
> > 
> 
> Tests completed over the weekend and I've found that the performance of
> both patches are very similar for two machines (both 2 socket) running a
> variety of workloads. Hence, I'm not worried about which patch gets picked
> up. However, I would prefer my own on the grounds that the additional
> complexity does not appear to get us anything. Of course, that changes if
> Srikar's tests on his larger ppc64 machines show the more complex approach
> is justified.
> 

Running SPECJbb2005. Higher bops are better.

Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
Kernel B = Kernel A + 6 patches 
(http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com)
Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e

http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-sri...@linux.vnet.ibm.com
+ 2 patches from Mel
(Do not move imbalanced load purely)

http://lore.kernel.org/lkml/20180907101139.20760-5-mgor...@techsingularity.net
(Stop comparing tasks for NUMA placement)

http://lore.kernel.org/lkml/20180907101139.20760-4-mgor...@techsingularity.net

To me, Kernel B which is the 13 patches accepted in v4.19-rc1 + 6 patches
posted for review seem to be giving better performance.

The numbers are compared to previous kernel i.e
for Kernel A, v4.18 is prev
for kernel B, Kernel A is prev
for Kernel C, B is prev

2 node x86 Haswell

v4.18 or 94710cac0ef4
JVMS  PrevCurrent  %Change
4 203769
1 316734

Kernel A
JVMS  PrevCurrent  %Change
4 203769  209790   2.95482
1 316734  312377   -1.3756

Kernel B
JVMS  PrevCurrent  %Change
4 209790  202059   -3.68511
1 312377  326987   4.67704

Kernel C
JVMS  PrevCurrent  %Change
4 202059  200681   -0.681979
1 326987  316715   -3.14141




4 Node / 2 Socket PowerNV / Power 8

v4.18 or 94710cac0ef4
JVMS  PrevCurrent  %Change
8 88411.9
1 222075

Kernel A
JVMS  Prev Current  %Change
8 88411.9  88733.5  0.363752
1 222075   214607   -3.36283

Kernel B
JVMS  Prev Current  %Change
8 88733.5  899521.37321
1 214607   217226   1.22037

Kernel C
JVMS  PrevCurrent  %Change
8 89952   89912.9  -0.0434676
1 217226  219281   0.946019





2 Node / 2 Socket Power 9 / PowerNV

v4.18 or 94710cac0ef4
JVMS  PrevCurrent  %Change
4 195989
1 202854

Kernel A
JVMS  PrevCurrent  %Change
4 195989  193108   -1.46998
1 202854  204042   0.585643

Kernel B
JVMS  PrevCurrent  %Change
4 193108  196422   1.71614
1 204042  211219   3.51741

Kernel C
JVMS  PrevCurrent  %Change
4 196422  195052   -0.697478
1 211219  207854   -1.59313




4 Node / 4 Socket Power 7 PhyP LPAR.

v4.18 or 94710cac0ef4
JVMS  PrevCurrent  %Change
8 52826.9
1 103103

Kernel A
JVMS  Prev Current  %Change
8 52826.9  59504.4  12.6403
1 103103   102542   -0.544116

Kernel B
JVMS  Prev Current  %Change
8 59504.4  61674.8  3.64746
1 102542   108211   5.52847

Kernel C
JVMS  Prev Current  %Change
8 61674.8  57946.5  -6.04509
1 108211   104533   -3.39892



Re: [RFC 0/4] perf: Per PMU access controls (paranoid setting)

2018-09-11 Thread Alexey Budankov


Hi,

Is there any plans or may be even progress on that so far?

Thanks,
Alexey

On 26.06.2018 18:36, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> For situations where sysadmins might want to allow different level of
> access control for different PMUs, we start creating per-PMU
> perf_event_paranoid controls in sysfs.
> 
> These work in equivalent fashion as the existing perf_event_paranoid
> sysctl, which now becomes the parent control for each PMU.
> 
> On PMU registration the global/parent value will be inherited by each PMU,
> as it will be propagated to all registered PMUs when the sysctl is
> updated.
> 
> At any later point individual PMU access controls, located in
> /device//perf_event_paranoid, can be adjusted to achieve
> fine grained access control.
> 
> Discussion from previous posting:
> https://lkml.org/lkml/2018/5/21/156
> 
> Cc: Thomas Gleixner 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Madhavan Srinivasan 
> Cc: Andi Kleen 
> Cc: Alexey Budankov 
> Cc: linux-kernel@vger.kernel.org
> Cc: x...@kernel.org
> 
> Tvrtko Ursulin (4):
>   perf: Move some access checks later in perf_event_open
>   perf: Pass pmu pointer to perf_paranoid_* helpers
>   perf: Allow per PMU access control
>   perf Documentation: Document the per PMU perf_event_paranoid interface
> 
>  .../sysfs-bus-event_source-devices-events |  14 +++
>  arch/powerpc/perf/core-book3s.c   |   2 +-
>  arch/x86/events/intel/bts.c   |   2 +-
>  arch/x86/events/intel/core.c  |   2 +-
>  arch/x86/events/intel/p4.c|   2 +-
>  include/linux/perf_event.h|  18 ++-
>  kernel/events/core.c  | 104 +++---
>  kernel/sysctl.c   |   4 +-
>  kernel/trace/trace_event_perf.c   |   6 +-
>  9 files changed, 123 insertions(+), 31 deletions(-)
> 


Re: [PATCH] printk: inject caller information into the body of message

2018-09-11 Thread Sergey Senozhatsky
On (09/10/18 13:20), Alexander Potapenko wrote:
> > Awesome. If you and Fengguang can combine forces and lead the
> > whole thing towards "we couldn't care of pr_cont() less", it
> > would be really huge. Go for it!
> 
> Sorry, folks, am I understanding right that pr_cont() and flushing the
> buffer on "\n" are two separate problems that can be handled outside
> Tetsuo's patchset, just assuming pr_cont() is unsupported?
> Or should the pr_cont() cleanup be a prerequisite for that?

Oh... Sorry. I'm quite overloaded at the moment and simply forgot about
this thread.

So what is exactly our problem with pr_cont -- it's not SMP friendly.
And this leads to various things, the most annoying of which is a
preliminary flush.

E.g. let me do a simple thing on my box:

ps aux | grep firefox
kill 2727

dmesg | tail
[  554.098341] Chrome_~dThread[2823]: segfault at 0 ip 7f5df153a1f3 sp 
7f5ded47ab00 error 6 in libxul.so[7f5df1531000+4b01000]
[  554.098348] Code: e7 04 48 8d 15 a6 94 ae 03 48 89 10 c7 04 25 00 00 00 00 
00 00 00 00 0f 0b 48 8b 05 57 d0 e7 04 48 8d 0d b0 94 ae 03 48 89 08  04 25 
00 00 00 00 00 00 00 00 0f 0b e8 4d f4 ff ff 48 8b 05 34
[  554.109418] Chrome_~dThread[3047]: segfault at 0 ip 7f3d5bdba1f3 sp 
7f3d57cfab00 error 6
[  554.109421] Chrome_~dThread[3077]: segfault at 0 ip 7fe773f661f3 sp 
7fe76fea6b00 error 6
[  554.109424]  in libxul.so[7f3d5bdb1000+4b01000]
[  554.109426]  in libxul.so[7fe773f5d000+4b01000]
[  554.109429] Code: e7 04 48 8d 15 a6 94 ae 03 48 89 10 c7 04 25 00 00 00 00 
00 00 00 00 0f 0b 48 8b 05 57 d0 e7 04 48 8d 0d b0 94 ae 03 48 89 08  04 25 
00 00 00 00 00 00 00 00 0f 0b e8 4d f4 ff ff 48 8b 05 34


Even such a simple thing as "printk several lines per-crashed process"
is broken. Look at line #0 and lines #2-#5.

And this is the only problem we probably need to address. Overlapping
printk lines -- when several CPUs printk simultaneously, or same CPUs
printk-s from IRQ, etc -- are here by design and it's not going to be
easy to change that (and maybe we shouldn't try).


Buffering multiple lines in printk buffer does not look so simple and
perhaps we should not try to do this, as well. Why:

- it's hard to decide what to do when buffer overflows

Switching to "normal printk" defeats the reason we do buffering in the
first place. Because "normal printk" permits overlapping. So buffering
makes a little sense if we are OK with switching to a "normal printk".

- the more we buffer the more we can lose in case of panic.

We can't flush_on_panic() printk buffers which were allocated on stack.

- flushing multiple lines should be more complex than just a simple
  printk loop

  while (1) {
 x = memchr(buf, '\n', sz);
 ...
 print("%s", buf);
 ...
  }

Because "printk() loop" permits lines overlap. Hence buffering makes
little sense, once again.



So let's reduce the problem scope to "we want to have a replacement for
pr_cont()". And let's address pr_cont()'s "preliminary flush" issue only.


I scanned some of Linus' emails, and skimmed through previous discussions
on this topic. Let me quote Linus:

: 
: My preference as a user is actually to just have a dynamically
: re-sizable buffer (that's pretty much what I've done in *every* single
: user space project I've had in the last decade), but because some
: users might have atomicity issues I do suspect that we should just use
: a stack buffer.
: 
: And then perhaps say that the buffer size has to be capped at 80 characters.
: 
: Because if you're printing more than 80 characters and expecting it
: all to fit on a line, you're doing something else wrong anyway.
: 
: And hide it not as a explicit "char buffer[80]]" allocation, but as a
: "struct line_buffer" or similar, so that
: 
:  (a) people don't get the line size wrong
: 
:  (b) the buffering code can add a few fields for length etc in there too
: 
: Introduce a few helper functions for it:
: 
:  init_line_buffer(&buf);
:  print_line(&buf, fmt, args);
:  vprint_line(&buf, fmt, vararg);
:  finish_line(&buf);
: 



And this is, basically, what I have attached to this email. It's very
simple and very short. And I think this is what Linus wanted us to do.

- usage example

   DEFINE_PR_LINE(KERN_ERR, pl);

   pr_line(&pl, "Hello, %s!\n", "buffer");
   pr_line(&pl, "%s", "OK.\n");
   pr_line(&pl, "Goodbye, %s", "buffer");
   pr_line(&pl, "\n");

dmesg | tail

[   69.908542] Hello, buffer!
[   69.908544] OK.
[   69.908545] Goodbye, buffer


- pr_cont-like usage

   DEFINE_PR_LINE(KERN_ERR, pl);

   pr_line(&pl,"%d ", 1);
   pr_line(&pl,"%d ", 3);
   pr_line(&pl,"%d ", 5);
   pr_line(&pl,"%d ", 7);
   pr_line(&pl,"%d\n", 9);

dmesg | tail

[   69.908546] 1 3 5 7 9


- An explicit, aux buffer // output should be truncated

   char buf[16];
   DEFINE_PR_LINE_BUF(KERN_ERR, ps, buf, sizeof(buf));

   pr_line(&ps, "Test test test test test test test test test\n");
   

[tip:perf/core] kprobes: Remove pointless BUG_ON() from reuse_unused_kprobe()

2018-09-11 Thread tip-bot for Masami Hiramatsu
Commit-ID:  a6d18e65dff2b73ceeb187c598b48898e36ad7b1
Gitweb: https://git.kernel.org/tip/a6d18e65dff2b73ceeb187c598b48898e36ad7b1
Author: Masami Hiramatsu 
AuthorDate: Tue, 11 Sep 2018 19:20:11 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Sep 2018 08:01:16 +0200

kprobes: Remove pointless BUG_ON() from reuse_unused_kprobe()

Since reuse_unused_kprobe() is called when the given kprobe
is unused, checking it inside again with BUG_ON() is
pointless. Remove it.

Signed-off-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S . Miller 
Cc: Linus Torvalds 
Cc: Naveen N . Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/153666121154.21306.17540752948574483565.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 231569e1e2c8..277a6cbe83db 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -704,7 +704,6 @@ static void reuse_unused_kprobe(struct kprobe *ap)
 {
struct optimized_kprobe *op;
 
-   BUG_ON(!kprobe_unused(ap));
/*
 * Unused kprobe MUST be on the way of delayed unoptimizing (means
 * there is still a relative jump) and disabled.


[tip:perf/core] kprobes: Remove pointless BUG_ON() from disarming process

2018-09-11 Thread tip-bot for Masami Hiramatsu
Commit-ID:  d0555fc78fdba5646a460e83bd2d8249c539bb89
Gitweb: https://git.kernel.org/tip/d0555fc78fdba5646a460e83bd2d8249c539bb89
Author: Masami Hiramatsu 
AuthorDate: Tue, 11 Sep 2018 19:19:14 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Sep 2018 08:01:15 +0200

kprobes: Remove pointless BUG_ON() from disarming process

All aggr_probes at this line are already disarmed by
disable_kprobe() or checked by kprobe_disarmed().

So this BUG_ON() is pointless, remove it.

Signed-off-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S . Miller 
Cc: Linus Torvalds 
Cc: Naveen N . Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/153666115463.21306.8799008438116029806.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index ab257be4d924..d1edd8d5641e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1704,7 +1704,6 @@ noclean:
return 0;
 
 disarmed:
-   BUG_ON(!kprobe_disarmed(ap));
hlist_del_rcu(&ap->hlist);
return 0;
 }


[tip:perf/core] kprobes: Remove pointless BUG_ON() from add_new_kprobe()

2018-09-11 Thread tip-bot for Masami Hiramatsu
Commit-ID:  c72e6742f62d7bb82a77a41ca53940cb8f73e60f
Gitweb: https://git.kernel.org/tip/c72e6742f62d7bb82a77a41ca53940cb8f73e60f
Author: Masami Hiramatsu 
AuthorDate: Tue, 11 Sep 2018 19:19:43 +0900
Committer:  Ingo Molnar 
CommitDate: Wed, 12 Sep 2018 08:01:15 +0200

kprobes: Remove pointless BUG_ON() from add_new_kprobe()

Before calling add_new_kprobe(), aggr_probe's GONE
flag and kprobe GONE flag are cleared. We don't need
to worry about that flag at this point.

Signed-off-by: Masami Hiramatsu 
Cc: Anil S Keshavamurthy 
Cc: David S . Miller 
Cc: Linus Torvalds 
Cc: Naveen N . Rao 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/153666118298.21306.4915366706875652652.stgit@devbox
Signed-off-by: Ingo Molnar 
---
 kernel/kprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index d1edd8d5641e..231569e1e2c8 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1259,8 +1259,6 @@ NOKPROBE_SYMBOL(cleanup_rp_inst);
 /* Add the new probe to ap->list */
 static int add_new_kprobe(struct kprobe *ap, struct kprobe *p)
 {
-   BUG_ON(kprobe_gone(ap) || kprobe_gone(p));
-
if (p->post_handler)
unoptimize_kprobe(ap, true);/* Fall back to normal kprobe */
 


[PATCH v2] mm: mprotect: check page dirty when change ptes

2018-09-11 Thread Peter Xu
Add an extra check on page dirty bit in change_pte_range() since there
might be case where PTE dirty bit is unset but it's actually dirtied.
One example is when a huge PMD is splitted after written: the dirty bit
will be set on the compound page however we won't have the dirty bit set
on each of the small page PTEs.

I noticed this when debugging with a customized kernel that implemented
userfaultfd write-protect.  In that case, the dirty bit will be critical
since that's required for userspace to handle the write protect page
fault (otherwise it'll get a SIGBUS with a loop of page faults).
However it should still be good even for upstream Linux to cover more
scenarios where we shouldn't need to do extra page faults on the small
pages if the previous huge page is already written, so the dirty bit
optimization path underneath can cover more.

CC: Andrew Morton 
CC: Mel Gorman 
CC: Khalid Aziz 
CC: Thomas Gleixner 
CC: "David S. Miller" 
CC: Greg Kroah-Hartman 
CC: Andi Kleen 
CC: Henry Willard 
CC: Anshuman Khandual 
CC: Andrea Arcangeli 
CC: Kirill A. Shutemov 
CC: Jerome Glisse 
CC: Zi Yan 
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu 
---
v2:
- checking the dirty bit when changing PTE entries rather than fixing up
  the dirty bit when splitting the huge page PMD.
- rebase to 4.19-rc3

Instead of keeping this in my local tree, I'm giving it another shot to
see whether this could be acceptable for upstream since IMHO it should
still benefit the upstream.  Thanks,
---
 mm/mprotect.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6d331620b9e5..5fe752515161 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -115,6 +115,17 @@ static unsigned long change_pte_range(struct 
vm_area_struct *vma, pmd_t *pmd,
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
 
+   /*
+* The extra PageDirty() check will make sure
+* we'll capture the dirty page even if the PTE
+* dirty bit is unset.  One case is when the
+* PTE is splitted from a huge PMD, in that
+* case the dirty flag might only be set on the
+* compound page instead of this PTE.
+*/
+   if (PageDirty(pte_page(ptent)))
+   ptent = pte_mkdirty(ptent);
+
/* Avoid taking write faults for known dirty pages */
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
-- 
2.17.1



[PATCH] kernel: prevent submission of creds with higher privileges inside container

2018-09-11 Thread My Name
From: Xin Lin <18650033...@163.com>

Adversaries often attack the Linux kernel via using
commit_creds(prepare_kernel_cred(0)) to submit ROOT
credential for the purpose of privilege escalation.
For processes inside the Linux container, the above
approach also works, because the container and the
host share the same Linux kernel. Therefore, we en-
force a check in commit_creds() before updating the
cred of the caller process. If the process is insi-
de a container (judging from the Namespace ID) and
try to submit credentials with higher privileges t-
han current (judging from the uid, gid, and cap_bset
in the new cred), we will stop the modification. We
consider that if the namespace ID of the process is
different from the init Namespace ID (enumed in /i-
nclude/linux/proc_ns.h), the process is inside a c-
ontainer. And if the uid/gid in the new cred is sm-
aller or the cap_bset (capability bounding set) in
the new cred is larger, it may be a privilege esca-
lation operation.

Signed-off-by: Xin Lin <18650033...@163.com>
---
 kernel/cred.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index ecf0365..826c388 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -19,6 +19,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include "../fs/mount.h"
+#include 
+#include 
 
 #if 0
 #define kdebug(FMT, ...)   \
@@ -425,6 +430,18 @@ int commit_creds(struct cred *new)
struct task_struct *task = current;
const struct cred *old = task->real_cred;
 
+   if (task->nsproxy->uts_ns->ns.inum != PROC_UTS_INIT_INO ||
+   task->nsproxy->ipc_ns->ns.inum != PROC_IPC_INIT_INO ||
+   task->nsproxy->mnt_ns->ns.inum != 0xF000U ||
+   task->nsproxy->pid_ns_for_children->ns.inum != PROC_PID_INIT_INO ||
+   task->nsproxy->net_ns->ns.inum != 0xF098U ||
+   old->user_ns->ns.inum != PROC_USER_INIT_INO ||
+   task->nsproxy->cgroup_ns->ns.inum != PROC_CGROUP_INIT_INO) {
+   if (new->uid.val < old->uid.val || new->gid.val < old->gid.val
+   || new->cap_bset.cap[0] > old->cap_bset.cap[0])
+   return 0;
+   }
+
kdebug("commit_creds(%p{%d,%d})", new,
   atomic_read(&new->usage),
   read_cred_subscribers(new));
-- 
2.17.1




[PATCH v3 1/2] dmaengine: doc: Add sections for per descriptor metadata support

2018-09-11 Thread Peter Ujfalusi
Update the provider and client documentation with details about the
metadata support.

Signed-off-by: Peter Ujfalusi 
---
 Documentation/driver-api/dmaengine/client.rst | 75 +++
 .../driver-api/dmaengine/provider.rst | 46 
 2 files changed, 121 insertions(+)

diff --git a/Documentation/driver-api/dmaengine/client.rst 
b/Documentation/driver-api/dmaengine/client.rst
index fbbb2831f29f..0276ddae7ea2 100644
--- a/Documentation/driver-api/dmaengine/client.rst
+++ b/Documentation/driver-api/dmaengine/client.rst
@@ -151,6 +151,81 @@ The details of these operations are:
  Note that callbacks will always be invoked from the DMA
  engines tasklet, never from interrupt context.
 
+  Optional: per descriptor metadata
+  -
+  DMAengine provides two ways for metadata support.
+
+  DESC_METADATA_CLIENT
+
+The metadata buffer is allocated/provided by the client driver and it is
+attached to the descriptor.
+
+  .. code-block:: c
+
+ int dmaengine_desc_attach_metadata(struct dma_async_tx_descriptor *desc,
+  void *data, size_t len);
+
+  DESC_METADATA_ENGINE
+
+The metadata buffer is allocated/managed by the DMA driver. The client
+driver can ask for the pointer, maximum size and the currently used size of
+the metadata and can directly update or read it.
+
+  .. code-block:: c
+
+ void *dmaengine_desc_get_metadata_ptr(struct dma_async_tx_descriptor 
*desc,
+   size_t *payload_len, size_t *max_len);
+
+ int dmaengine_desc_set_metadata_len(struct dma_async_tx_descriptor *desc,
+   size_t payload_len);
+
+  Client drivers can query if a given mode is supported with:
+
+  .. code-block:: c
+
+ bool dmaengine_is_metadata_mode_supported(struct dma_chan *chan,
+   enum dma_desc_metadata_mode mode);
+
+  Depending on the used mode client drivers must follow different flow.
+
+  DESC_METADATA_CLIENT
+
+- DMA_MEM_TO_DEV / DEV_MEM_TO_MEM:
+  1. prepare the descriptor (dmaengine_prep_*)
+ construct the metadata in the client's buffer
+  2. use dmaengine_desc_attach_metadata() to attach the buffer to the
+ descriptor
+  3. submit the transfer
+- DMA_DEV_TO_MEM:
+  1. prepare the descriptor (dmaengine_prep_*)
+  2. use dmaengine_desc_attach_metadata() to attach the buffer to the
+ descriptor
+  3. submit the transfer
+  4. when the transfer is completed, the metadata should be available in 
the
+ attached buffer
+
+  DESC_METADATA_ENGINE
+
+- DMA_MEM_TO_DEV / DEV_MEM_TO_MEM:
+  1. prepare the descriptor (dmaengine_prep_*)
+  2. use dmaengine_desc_get_metadata_ptr() to get the pointer to the
+ engine's metadata area
+  3. update the metadata at the pointer
+  4. use dmaengine_desc_set_metadata_len()  to tell the DMA engine the
+ amount of data the client has placed into the metadata buffer
+  5. submit the transfer
+- DMA_DEV_TO_MEM:
+  1. prepare the descriptor (dmaengine_prep_*)
+  2. submit the transfer
+  3. on transfer completion, use dmaengine_desc_get_metadata_ptr() to get 
the
+ pointer to the engine's metadata are
+  4. Read out the metadate from the pointer
+
+  .. note::
+
+ Mixed use of DESC_METADATA_CLIENT / DESC_METADATA_ENGINE is not allowed,
+ client drivers must use either of the modes per descriptor.
+
 4. Submit the transaction
 
Once the descriptor has been prepared and the callback information
diff --git a/Documentation/driver-api/dmaengine/provider.rst 
b/Documentation/driver-api/dmaengine/provider.rst
index dfc4486b5743..9e6d87b3c477 100644
--- a/Documentation/driver-api/dmaengine/provider.rst
+++ b/Documentation/driver-api/dmaengine/provider.rst
@@ -247,6 +247,52 @@ after each transfer. In case of a ring buffer, they may 
loop
 (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO)
 are typically fixed.
 
+Per descriptor metadata support
+---
+Some data movement architecure (DMA controller and peripherals) uses metadata
+associated with a transaction. The DMA controller role is to transfer the
+payload and the metadata alongside.
+The metadata itself is not used by the DMA engine itself, but it contains
+parameters, keys, vectors, etc for peripheral or from the peripheral.
+
+The DMAengine framework provides a generic ways to facilitate the metadata for
+descriptors. Depending on the architecture the DMA driver can implement either
+or both of the methods and it is up to the client driver to choose which one
+to use.
+
+- DESC_METADATA_CLIENT
+
+  The metadata buffer is allocated/provided by the client driver and it is
+  attached (via the dmaengine_desc_attach_metadata() helper to the descriptor.
+
+  From the DMA driver the following is expected for this mode:
+  - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM
+The data from the provided metadata buffer sh

[PATCH v3 0/2] dmaengine: Add per descriptor metadata support

2018-09-11 Thread Peter Ujfalusi
Hi,

Changes since v2:
- EXPORT_SYMBOL_GPL() for the metadata functions
- Added note to Documentation to not mix the two defined metadata modes
- Fixed the typos in Documentation

Changes since v1:
- Move code from header to dmaengine.c
- Fix spelling
- Use BIT() macro for bit definition
- Update both provider and client documentation

Changes since rfc:
- DESC_METADATA_EMBEDDED renamed to DESC_METADATA_ENGINE
- Use flow is added for both CLIENT and ENGINE metadata modes

Some data movement architecure (DMA controller and peripherals) uses metadata
associated with a transaction. The DMA controller role is to transfer the
payload and the metadata alongside.
The metadata itself is not used by the DMA engine itself, but it contains
parameters, keys, vectors, etc for peripheral or from the peripheral.

The DMAengine framework provides a generic ways to facilitate the metadata for
descriptors. Depending on the architecture the DMA driver can implment either
or both of the methods and it is up to the client driver to choose which one
to use.

If the DMA supports per descriptor metadata it can implement the attach,
get_ptr/set_len callbacks.

Client drivers must only use either attach or get_ptr/set_len to avoid
miss configuration.

Client driver can check if a given metadata mode is supported by the
channel during probe time with
dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_CLIENT);
dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_ENGINE);

and based on this information can use either mode.

Wrappers are also added for the metadata_ops.

To be used in DESC_METADATA_CLIENT mode:
dmaengine_desc_attach_metadata()

To be used in DESC_METADATA_ENGINE mode:
dmaengine_desc_get_metadata_ptr()
dmaengine_desc_set_metadata_len()

Regards,
Peter
---
Peter Ujfalusi (2):
  dmaengine: doc: Add sections for per descriptor metadata support
  dmaengine: Add metadata_ops for dma_async_tx_descriptor

 Documentation/driver-api/dmaengine/client.rst |  75 
 .../driver-api/dmaengine/provider.rst |  46 
 drivers/dma/dmaengine.c   |  73 
 include/linux/dmaengine.h | 108 ++
 4 files changed, 302 insertions(+)

-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki



[PATCH v3 2/2] dmaengine: Add metadata_ops for dma_async_tx_descriptor

2018-09-11 Thread Peter Ujfalusi
The metadata is best described as side band data or parameters traveling
alongside the data DMAd by the DMA engine. It is data
which is understood by the peripheral and the peripheral driver only, the
DMA engine see it only as data block and it is not interpreting it in any
way.

The metadata can be different per descriptor as it is a parameter for the
data being transferred.

If the DMA supports per descriptor metadata it can implement the attach,
get_ptr/set_len callbacks.

Client drivers must only use either attach or get_ptr/set_len to avoid
misconfiguration.

Client driver can check if a given metadata mode is supported by the
channel during probe time with
dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_CLIENT);
dmaengine_is_metadata_mode_supported(chan, DESC_METADATA_ENGINE);

and based on this information can use either mode.

Wrappers are also added for the metadata_ops.

To be used in DESC_METADATA_CLIENT mode:
dmaengine_desc_attach_metadata()

To be used in DESC_METADATA_ENGINE mode:
dmaengine_desc_get_metadata_ptr()
dmaengine_desc_set_metadata_len()

Signed-off-by: Peter Ujfalusi 
---
 drivers/dma/dmaengine.c   |  73 ++
 include/linux/dmaengine.h | 108 ++
 2 files changed, 181 insertions(+)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index f1a441ab395d..27b6d7c2d8a0 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -1306,6 +1306,79 @@ void dma_async_tx_descriptor_init(struct 
dma_async_tx_descriptor *tx,
 }
 EXPORT_SYMBOL(dma_async_tx_descriptor_init);
 
+static inline int desc_check_and_set_metadata_mode(
+   struct dma_async_tx_descriptor *desc, enum dma_desc_metadata_mode mode)
+{
+   /* Make sure that the metadata mode is not mixed */
+   if (!desc->desc_metadata_mode) {
+   if (dmaengine_is_metadata_mode_supported(desc->chan, mode))
+   desc->desc_metadata_mode = mode;
+   else
+   return -ENOTSUPP;
+   } else if (desc->desc_metadata_mode != mode) {
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+int dmaengine_desc_attach_metadata(struct dma_async_tx_descriptor *desc,
+  void *data, size_t len)
+{
+   int ret;
+
+   if (!desc)
+   return -EINVAL;
+
+   ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_CLIENT);
+   if (ret)
+   return ret;
+
+   if (!desc->metadata_ops || !desc->metadata_ops->attach)
+   return -ENOTSUPP;
+
+   return desc->metadata_ops->attach(desc, data, len);
+}
+EXPORT_SYMBOL_GPL(dmaengine_desc_attach_metadata);
+
+void *dmaengine_desc_get_metadata_ptr(struct dma_async_tx_descriptor *desc,
+ size_t *payload_len, size_t *max_len)
+{
+   int ret;
+
+   if (!desc)
+   return ERR_PTR(-EINVAL);
+
+   ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_ENGINE);
+   if (ret)
+   return ERR_PTR(ret);
+
+   if (!desc->metadata_ops || !desc->metadata_ops->get_ptr)
+   return ERR_PTR(-ENOTSUPP);
+
+   return desc->metadata_ops->get_ptr(desc, payload_len, max_len);
+}
+EXPORT_SYMBOL_GPL(dmaengine_desc_get_metadata_ptr);
+
+int dmaengine_desc_set_metadata_len(struct dma_async_tx_descriptor *desc,
+   size_t payload_len)
+{
+   int ret;
+
+   if (!desc)
+   return -EINVAL;
+
+   ret = desc_check_and_set_metadata_mode(desc, DESC_METADATA_ENGINE);
+   if (ret)
+   return ret;
+
+   if (!desc->metadata_ops || !desc->metadata_ops->set_len)
+   return -ENOTSUPP;
+
+   return desc->metadata_ops->set_len(desc, payload_len);
+}
+EXPORT_SYMBOL_GPL(dmaengine_desc_set_metadata_len);
+
 /* dma_wait_for_async_tx - spin wait for a transaction to complete
  * @tx: in-flight transaction to wait on
  */
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 3db833a8c542..10ff71b13eef 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -231,6 +231,58 @@ typedef struct { DECLARE_BITMAP(bits, DMA_TX_TYPE_END); } 
dma_cap_mask_t;
  * @bytes_transferred: byte counter
  */
 
+/**
+ * enum dma_desc_metadata_mode - per descriptor metadata mode types supported
+ * @DESC_METADATA_CLIENT - the metadata buffer is allocated/provided by the
+ *  client driver and it is attached (via the dmaengine_desc_attach_metadata()
+ *  helper) to the descriptor.
+ *
+ * Client drivers interested to use this mode can follow:
+ * - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM:
+ *   1. prepare the descriptor (dmaengine_prep_*)
+ * construct the metadata in the client's buffer
+ *   2. use dmaengine_desc_attach_metadata() to attach the buffer to the
+ * descriptor
+ *   3. submit the transfer
+ * - DMA_DEV_TO_MEM:
+ *   1. prepare the descriptor (dmaengine_prep_*)
+ *   2. use dmaengine_desc_attach_

[PATCH 2/2] kbuild: remove dead code in cmd_files calculation in top Makefile

2018-09-11 Thread Masahiro Yamada
Nobody sets 'targets' in the top-level Makefile or arch/*/Makefile,
hence $(targets) is empty.

$(wildcard .*.cmd) will do for including the .vmlinux.cmd file.

Signed-off-by: Masahiro Yamada 
---

 Makefile | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 4b76e22..8f6dbfc 100644
--- a/Makefile
+++ b/Makefile
@@ -1721,8 +1721,7 @@ cmd_crmodverdir = $(Q)mkdir -p $(MODVERDIR) \
   $(if $(KBUILD_MODULES),; rm -f $(MODVERDIR)/*)
 
 # read all saved command lines
-
-cmd_files := $(wildcard .*.cmd $(foreach f,$(sort $(targets)),$(dir 
$(f)).$(notdir $(f)).cmd))
+cmd_files := $(wildcard .*.cmd)
 
 ifneq ($(cmd_files),)
   $(cmd_files): ;  # Do not try to update included dependency files
-- 
2.7.4



[PATCH 1/2] kbuild: hide most of targets when running config or mixed targets

2018-09-11 Thread Masahiro Yamada
When mixed/config targets are being processed, the top Makefile
does not need to parse the rest of targets.

Signed-off-by: Masahiro Yamada 
---

 Makefile | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 4d5c883..4b76e22 100644
--- a/Makefile
+++ b/Makefile
@@ -1615,9 +1615,6 @@ namespacecheck:
 export_report:
$(PERL) $(srctree)/scripts/export_report.pl
 
-endif #ifeq ($(config-targets),1)
-endif #ifeq ($(mixed-targets),1)
-
 PHONY += checkstack kernelrelease kernelversion image_name
 
 # UML needs a little special treatment here.  It wants to use the host
@@ -1732,6 +1729,8 @@ ifneq ($(cmd_files),)
   include $(cmd_files)
 endif
 
+endif   # ifeq ($(config-targets),1)
+endif   # ifeq ($(mixed-targets),1)
 endif  # skip-makefile
 
 PHONY += FORCE
-- 
2.7.4



Re: [REGRESSION] Errors at reboot after 722e5f2b1eec

2018-09-11 Thread Pingfan Liu
On Tue, Sep 11, 2018 at 5:37 PM Greg Kroah-Hartman
 wrote:
>
> On Tue, Sep 11, 2018 at 10:17:44AM +0200, Takashi Iwai wrote:
> > [ seems like my previous post didn't go out properly; if you have
> >   already received it, please discard this one ]
>
> Sorry, I got it, it's just in my large queue :(
>
> > Hi Rafael, Greg,
> >
> > James Wang reported on SUSE bugzilla that his machine spews many
> > AMD-Vi errors at reboot like:
> >
> > [  154.907879] systemd-shutdown[1]: Detaching loop devices.
> > [  154.954583] kvm: exiting hardware virtualization
> > [  154.53] usb 5-2: USB disconnect, device number 2
> > [  155.025278] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.081360] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.136778] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.191772] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.247055] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.302614] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.358996] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.392155] usb 4-2: new full-speed USB device number 2 using ohci-pci
> > [  155.413752] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.413762] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.560307] ohci-pci :00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.616039] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.667843] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.719497] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.772697] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.823919] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.875490] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.927258] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  155.979318] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  156.031813] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  156.084293] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 
> > domain=0x0006 address=0x0080 flags=0x0020]
> > [  156.272157] reboot: Restarting system
> > [  156.290316] reboot: machine restart
> >
[...]
> > The errors are clearly related with the USB device (a KVM device,
> > IIRC), and the errors are not seen if the USB device is disconnected.
> >
Sounds like the io pgtbl is invalidated before ohci-pci. But I can not
figure out why, since it is very late to tear down of iommu, which is
after device_shutdown()

Cc James, could you try to enable initcall_debug, and paste the
shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
core: correct device's shutdown order"") and without it?

Thanks,
Pingfan


Re: [PATCH 2/8] regulator: Support ROHM BD71847 power management IC

2018-09-11 Thread Matti Vaittinen
Hello Lee,

Thanks again for the review! I see you did bunch of them... I really
admire your devotion. For me reviewing is hard work. I do appreciate
it. So nice to see you're back in the business =)

On Tue, Sep 11, 2018 at 02:48:08PM +0100, Lee Jones wrote:
> On Wed, 29 Aug 2018, Matti Vaittinen wrote:
> 
> > +static const u8 bd71837_supported_revisions[] = { 0xA2 };
> > +static const u8 bd71847_supported_revisions[] = { 0xA0 };
> 
> I haven't seen anything like this before.
> 
> Is this really required?
> 
> Especially since you only have 1 of each currently.

Valid question. I did ask the same thing myself. Reason why I ended up
doing this is simple though. I have no idea what may change if "chip
revision" is changed. I only know that I have chip(s) with these
revisions on my table - and I have a data sheet which mentions these
revisions. So I can only test that driver works with these revisions. I
however assume there will be new revisions and I thought that with this
approach marking them as supported will only require adding the revisio
to these arrays.

But as you have said - this makes code slightly more complex - even if I
disagree with you regarding how complex is too complex =) The use case
and intention of tables is quite obvious, right? That makes following
the loops in probe pretty easy after all...

But I won't start arguing with you - let's assume the register interface
won't get changed - and if it does, well, let's handle that then. So
I'll drop whole revision check.

> 
> > -static const u8 supported_revisions[] = { 0xA2 /* BD71837 */ };
> > +struct known_revisions {
> > +   const u8 (*revisions)[];
> > +   unsigned int known_revisions;
> 
> I didn't initially know what this was at first glance.
> 
> Please re-name it to show that it is the number of stored revisions.

This will be fixed as I'll drop the revision check

> > +static const struct known_revisions supported_revisions[BD718XX_TYPE_AMNT] 
> > = {
> > +   [BD718XX_TYPE_BD71837] = {
> > +   .revisions = &bd71837_supported_revisions,
> > +   .known_revisions = ARRAY_SIZE(bd71837_supported_revisions),
> > +   },
> > +   [BD718XX_TYPE_BD71847] = {
> > +   .revisions = &bd71847_supported_revisions,
> > +   .known_revisions = ARRAY_SIZE(bd71847_supported_revisions),
> > +   },
> > +};
> >  
> > @@ -91,13 +104,19 @@ static int bd71837_i2c_probe(struct i2c_client *i2c,
> >  {
> > struct bd71837 *bd71837;
> > int ret, i;
> > +   const unsigned int *type;
> >  
> > +   type = of_device_get_match_data(&i2c->dev);
> > +   if (!type || *type >= BD718XX_TYPE_AMNT) {
> > +   dev_err(&i2c->dev, "Bad chip type\n");
> > +   return -ENODEV;
> > +   }
> > +
> > +   bd71837->chip_type = *type;

> > +   ret = regmap_read(bd71837->regmap, BD718XX_REG_REV, &bd71837->chip_rev);

> > +   for (i = 0;
> > +i < supported_revisions[bd71837->chip_type].known_revisions; i++)
> > +   if ((*supported_revisions[bd71837->chip_type].revisions)[i] ==
> > +   bd71837->chip_rev)
> > break;
> >  
> > +   if (i == supported_revisions[bd71837->chip_type].known_revisions) {
> > +   dev_err(&i2c->dev, "Unrecognized revision 0x%02x\n",
> > +   bd71837->chip_rev);
> > +   return -EINVAL;
> > }
> 
> This has become a very (too) elaborate way to see if the current
> running version is supported.  Please find a way to solve this (much)
> more succinctly.  There are lots of examples of this.

I cut out pieces of quoted patch to shorten it to relevant bits.

As I said above - I think this is not as bad as you say - it is quite
obvious what it does after all. And adding new revision would be just
adding new entry to the revision array. But yes, I am not sure if this
is needed so I'll drop this. Let's work the compatibility issues between
revisions only if such ever emerge =)
> 
> In fact, since you are using OF, it is not possible for this driver to
> probe with an unsupported device.  You can remove the whole lot.
> 
I don't really see how the OF helps me with revisions - as chip revision
is not presented in DT. The type of chip of course is. So you're right.
Check for invalid chip_type can be dropped.

> > +static const unsigned int chip_types[] = {
> > +   [BD718XX_TYPE_BD71837] = BD718XX_TYPE_BD71837,
> > +   [BD718XX_TYPE_BD71847] = BD718XX_TYPE_BD71847,
> > +};
> >  
> >  static const struct of_device_id bd71837_of_match[] = {
> > -   { .compatible = "rohm,bd71837", },
> > +   {
> > +   .compatible = "rohm,bd71837",
> > +   .data = &chip_types[BD718XX_TYPE_BD71837]
> > +   },
> > +   {
> > +   .compatible = "rohm,bd71847",
> > +   .data = &chip_types[BD718XX_TYPE_BD71847]
> 
> Again, way too complex.  Why not simply:
> 
>.data = (void *)BD718XX_TYPE_BD71847?
> 

Ugh. That's horrible on my eyes. I dislike delivering data in addresses.
That's why I rather did array with IDs and used pointer to arra

Re: [PATCH v2 2/3] x86/mm/KASLR: Calculate the actual size of vmemmap region

2018-09-11 Thread Ingo Molnar


* Baoquan He  wrote:

> On 09/11/18 at 08:08pm, Baoquan He wrote:
> > On 09/11/18 at 11:28am, Ingo Molnar wrote:
> > > Yeah, so proper context is still missing, this paragraph appears to 
> > > assume from the reader a 
> > > whole lot of prior knowledge, and this is one of the top comments in 
> > > kaslr.c so there's nowhere 
> > > else to go read about the background.
> > > 
> > > For example what is the range of randomization of each region? Assuming 
> > > the static, 
> > > non-randomized description in Documentation/x86/x86_64/mm.txt is correct, 
> > > in what way does 
> > > KASLR modify that layout?
> 
> Re-read this paragraph, found I missed saying the range for each memory
> region, and in what way KASLR modify the layout.
> 
> > > 
> > > All of this is very opaque and not explained very well anywhere that I 
> > > could find. We need to 
> > > generate a proper description ASAP.
> > 
> > OK, let me try to give an context with my understanding. And copy the
> > static layout of memory regions at below for reference.
> > 
> Here, Documentation/x86/x86_64/mm.txt is correct, and it's the
> guideline for us to manipulate the layout of kernel memory regions.
> Originally the starting address of each region is aligned to 512GB
> so that they are all mapped at the 0-th entry of PGD table in 4-level
> page mapping. Since we are so rich to have 120 TB virtual address space,
> they are aligned at 1 TB actually. So randomness comes from three parts
> mainly:
> 
> 1) The direct mapping region for physical memory. 64 TB are reserved to
> cover the maximum physical memory support. However, most of systems only
> have much less RAM memory than 64 TB, even much less than 1 TB most of
> time. We can take the superfluous to join the randomization. This is
> often the biggest part.

So i.e. in the non-KASLR case we have this description (from mm.txt):

 8800 - c7ff (=64 TB) direct mapping of all phys. memory
 c800 - c8ff (=40 bits) hole
 c900 - e8ff (=45 bits) vmalloc/ioremap space
 e900 - e9ff (=40 bits) hole
 ea00 - eaff (=40 bits) virtual memory map (1TB)
 ... unused hole ...
 ec00 - fbff (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
 vaddr_end for KASLR
 fe00 - fe7f (=39 bits) cpu_entry_area mapping
 ...

The problems start here, this map is already *horribly* confusing:

 - we mix size in TB with 'bits'
 - we sometimes mention a size in the description and sometimes not
 - we sometimes list holes by address, sometimes only as an 'unused hole' line 
...

So how about first cleaning up the memory maps in mm.txt and streamlining them, 
like this:

 8800 - c7ff (=46 bits, 64 TB) direct mapping of all 
phys. memory (page_offset_base)
 c800 - c8ff (=40 bits,  1 TB) ... unused hole
 c900 - e8ff (=45 bits, 32 TB) vmalloc/ioremap space 
(vmalloc_base)
 e900 - e9ff (=40 bits,  1 TB) ... unused hole
 ea00 - eaff (=40 bits,  1 TB) virtual memory map 
(vmemmap_base)
 eb00 - ebff (=40 bits,  1 TB) ... unused hole
 ec00 - fbff (=44 bits, 16 TB) KASAN shadow memory
 fc00 - fdff (=41 bits,  2 TB) ... unused hole
 vaddr_end for KASLR
 fe00 - fe7f (=39 bits) cpu_entry_area mapping
 ...

Please double check all the calculations and ranges, and I'd suggest doing it 
for the whole 
file. Note how I added the global variables describing the base addresses - 
this makes it very 
easy to match the pointers in kaslr_regions[] to the static map, to see the 
intent of 
kaslr_regions[].

BTW., isn't that 'vaddr_end for KASLR' entry position inaccurate? In the 
typical case it could 
very well be that by chance all 3 areas end up being randomized into the first 
64 TB region, 
right?

I.e. vaddr_end could be at any 1 TB boundary in the above ranges. I'd suggest 
leaving out all 
KASLR from this static mappings table - explain it separately in this file, 
maybe even create 
its own memory map. I'll help with the wording.

> 2) The hole between memory regions, even though they are only 1 TB.

There's a 2 TB hole too.

> 3) KASAN region takes up 16 TB, while it won't take effect when KASLR is
> enabled. This is another big part. 

Ok.

> As you can see, in these three memory regions, the physical memory
> mapping region has variable size according to the existing system RAM.
> However, the remaining two memory regions have fixed size, vmalloc is 32
> TB, vmemmap is 1 TB.
> 
> With this superfluous address space as well as changing the starting address
> of each memory region to be PUD level, namely 1 GB aligned, we can have
> thousands of candidate position to locate those three 

RE: Charity Support

2018-09-11 Thread M. M. Fridman




--
I, Mikhail Fridman have selected you specifically as one of my  
beneficiaries for my Charitable Donation of $5 Million Dollars,


Check the link below for confirmation:

https://www.rt.com/business/343781-mikhail-fridman-will-charity/

I await your earliest response for further directives.

Best Regards,
Mikhail Fridman.



[PATCH] kbuild: remove old check for CFLAGS use

2018-09-11 Thread Masahiro Yamada
This check has been here more than a decade since commit 0c53c8e6eb45
("kbuild: check for wrong use of CFLAGS").

Enough time for migration has passed.

Signed-off-by: Masahiro Yamada 
---

 scripts/Makefile.build | 10 --
 1 file changed, 10 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 5a2d1c9..cb03774 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -36,21 +36,11 @@ subdir-ccflags-y :=
 
 include scripts/Kbuild.include
 
-# For backward compatibility check that these variables do not change
-save-cflags := $(CFLAGS)
-
 # The filename Kbuild has precedence over Makefile
 kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
 kbuild-file := $(if $(wildcard 
$(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
 include $(kbuild-file)
 
-# If the save-* variables changed error out
-ifeq ($(KBUILD_NOPEDANTIC),)
-ifneq ("$(save-cflags)","$(CFLAGS)")
-$(error CFLAGS was changed in "$(kbuild-file)". Fix it to use 
ccflags-y)
-endif
-endif
-
 include scripts/Makefile.lib
 
 # Do not include host rules unless needed
-- 
2.7.4



[PATCH v2] perf test: Add watchpoint test

2018-09-11 Thread Ravi Bangoria
We don't have perf test available to test watchpoint functionality.
Add simple set of tests:
 - Read only watchpoint
 - Write only watchpoint
 - Read / Write watchpoint
 - Runtime watchpoint modification

Ex on powerpc:
  $ sudo ./perf test 22
  22: Watchpoint:
  22.1: Read Only Watchpoint: Ok
  22.2: Write Only Watchpoint   : Ok
  22.3: Read / Write Watchpoint : Ok
  22.4: Modify Watchpoint   : Ok

Signed-off-by: Ravi Bangoria 
Acked-by: Jiri Olsa 
---
v1 -> v2:
 - Fix build failure for mips and other archs.
 - Show debug message if subtest is not supported.

 tools/perf/tests/Build  |   1 +
 tools/perf/tests/builtin-test.c |   9 ++
 tools/perf/tests/tests.h|   3 +
 tools/perf/tests/wp.c   | 229 
 4 files changed, 242 insertions(+)
 create mode 100644 tools/perf/tests/wp.c

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index 6c108fa79ae3..0b2b8305c965 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -21,6 +21,7 @@ perf-y += python-use.o
 perf-y += bp_signal.o
 perf-y += bp_signal_overflow.o
 perf-y += bp_account.o
+perf-y += wp.o
 perf-y += task-exit.o
 perf-y += sw-clock.o
 perf-y += mmap-thread-lookup.o
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index d7a5e1b9aa6f..54ca7d87236f 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -120,6 +120,15 @@ static struct test generic_tests[] = {
.func = test__bp_accounting,
.is_supported = test__bp_signal_is_supported,
},
+   {
+   .desc = "Watchpoint",
+   .func = test__wp,
+   .subtest = {
+   .skip_if_fail   = false,
+   .get_nr = test__wp_subtest_get_nr,
+   .get_desc   = test__wp_subtest_get_desc,
+   },
+   },
{
.desc = "Number of exit events of a simple workload",
.func = test__task_exit,
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index a9760e790563..8e26a4148f30 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -59,6 +59,9 @@ int test__python_use(struct test *test, int subtest);
 int test__bp_signal(struct test *test, int subtest);
 int test__bp_signal_overflow(struct test *test, int subtest);
 int test__bp_accounting(struct test *test, int subtest);
+int test__wp(struct test *test, int subtest);
+const char *test__wp_subtest_get_desc(int subtest);
+int test__wp_subtest_get_nr(void);
 int test__task_exit(struct test *test, int subtest);
 int test__mem(struct test *test, int subtest);
 int test__sw_clock_freq(struct test *test, int subtest);
diff --git a/tools/perf/tests/wp.c b/tools/perf/tests/wp.c
new file mode 100644
index ..017a99317f94
--- /dev/null
+++ b/tools/perf/tests/wp.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include "tests.h"
+#include "debug.h"
+#include "cloexec.h"
+
+#define WP_TEST_ASSERT_VAL(fd, text, val)   \
+do {\
+   long long count;\
+   wp_read(fd, &count, sizeof(long long)); \
+   TEST_ASSERT_VAL(text, count == val);\
+} while (0)
+
+volatile u64 data1;
+volatile u8 data2[3];
+
+static int wp_read(int fd, long long *count, int size)
+{
+   int ret = read(fd, count, size);
+
+   if (ret != size) {
+   pr_debug("failed to read: %d\n", ret);
+   return -1;
+   }
+   return 0;
+}
+
+static void get__perf_event_attr(struct perf_event_attr *attr, int wp_type,
+void *wp_addr, unsigned long wp_len)
+{
+   memset(attr, 0, sizeof(struct perf_event_attr));
+   attr->type   = PERF_TYPE_BREAKPOINT;
+   attr->size   = sizeof(struct perf_event_attr);
+   attr->config = 0;
+   attr->bp_type= wp_type;
+   attr->bp_addr= (unsigned long)wp_addr;
+   attr->bp_len = wp_len;
+   attr->sample_period  = 1;
+   attr->sample_type= PERF_SAMPLE_IP;
+   attr->exclude_kernel = 1;
+   attr->exclude_hv = 1;
+}
+
+static int __event(int wp_type, void *wp_addr, unsigned long wp_len)
+{
+   int fd;
+   struct perf_event_attr attr;
+
+   get__perf_event_attr(&attr, wp_type, wp_addr, wp_len);
+   fd = sys_perf_event_open(&attr, 0, -1, -1,
+perf_event_open_cloexec_flag());
+   if (fd < 0)
+   pr_debug("failed opening event %x\n", attr.bp_type);
+
+   return fd;
+}
+
+static int wp_ro_test(void)
+{
+   int fd;
+   unsigned long tmp, tmp1 = rand();
+
+   fd = __event(HW_BREAKPOINT_R, (void *)&data1, sizeo

Re: [PATCH] staging: remove unneeded static set .owner field in platform_driver

2018-09-11 Thread Vaibhav Agarwal
On Wed, Sep 12, 2018 at 9:22 AM zhong jiang  wrote:
>
> platform_driver_register will set the .owner field. So it is safe
> to remove the redundant assignment.
>
> The issue is detected with the help of Coccinelle.
>
> Signed-off-by: zhong jiang 
> ---
>  drivers/staging/greybus/audio_codec.c| 1 -
>  drivers/staging/mt7621-eth/gsw_mt7621.c  | 1 -
>  drivers/staging/mt7621-eth/mtk_eth_soc.c | 1 -
>  3 files changed, 3 deletions(-)
>
> diff --git a/drivers/staging/greybus/audio_codec.c 
> b/drivers/staging/greybus/audio_codec.c
> index 35acd55..08746c8 100644
> --- a/drivers/staging/greybus/audio_codec.c
> +++ b/drivers/staging/greybus/audio_codec.c
> @@ -1087,7 +1087,6 @@ static int gbaudio_codec_remove(struct platform_device 
> *pdev)
>  static struct platform_driver gbaudio_codec_driver = {
> .driver = {
> .name = "apb-dummy-codec",
> -   .owner = THIS_MODULE,
>  #ifdef CONFIG_PM
> .pm = &gbaudio_codec_pm_ops,
>  #endif
> diff --git a/drivers/staging/mt7621-eth/gsw_mt7621.c 
> b/drivers/staging/mt7621-eth/gsw_mt7621.c
> index 2c07b55..53767b1 100644
> --- a/drivers/staging/mt7621-eth/gsw_mt7621.c
> +++ b/drivers/staging/mt7621-eth/gsw_mt7621.c
> @@ -286,7 +286,6 @@ static int mt7621_gsw_remove(struct platform_device *pdev)
> .remove = mt7621_gsw_remove,
> .driver = {
> .name = "mt7621-gsw",
> -   .owner = THIS_MODULE,
> .of_match_table = mediatek_gsw_match,
> },
>  };
> diff --git a/drivers/staging/mt7621-eth/mtk_eth_soc.c 
> b/drivers/staging/mt7621-eth/mtk_eth_soc.c
> index 7135075..363d3c9 100644
> --- a/drivers/staging/mt7621-eth/mtk_eth_soc.c
> +++ b/drivers/staging/mt7621-eth/mtk_eth_soc.c
> @@ -2167,7 +2167,6 @@ static int mtk_remove(struct platform_device *pdev)
> .remove = mtk_remove,
> .driver = {
> .name = "mtk_soc_eth",
> -   .owner = THIS_MODULE,
> .of_match_table = of_mtk_match,
> },
>  };
> --
> 1.7.12.4
>

Acked-by: Vaibhav Agarwal 


Re: [PATCH V3 4/6] x86/intel_rdt: Create required perf event attributes

2018-09-11 Thread kbuild test robot
Hi Reinette,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on v4.19-rc3 next-20180911]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Reinette-Chatre/perf-core-and-x86-intel_rdt-Fix-lack-of-coordination-with-perf/20180912-101526
config: i386-randconfig-x001-201836 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: the 
linux-review/Reinette-Chatre/perf-core-and-x86-intel_rdt-Fix-lack-of-coordination-with-perf/20180912-101526
 HEAD b684b8727deb9e3cf635badb292b3314904d17b2 builds fine.
  It only hurts bisectibility.

All error/warnings (new ones prefixed by >>):

>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:927:15: error: variable 
>> 'perf_miss_attr' has initializer but incomplete type
static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
  ^~~
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:3: error: 'struct 
>> perf_event_attr' has no member named 'type'
 .type  = PERF_TYPE_RAW,
  ^~~~
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: error: 'PERF_TYPE_RAW' 
>> undeclared here (not in a function); did you mean 'PIDTYPE_MAX'?
 .type  = PERF_TYPE_RAW,
  ^
  PIDTYPE_MAX
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: warning: excess 
>> elements in struct initializer
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:928:11: note: (near 
initialization for 'perf_miss_attr')
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:3: error: 'struct 
>> perf_event_attr' has no member named 'size'
 .size  = sizeof(struct perf_event_attr),
  ^~~~
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:18: error: invalid 
>> application of 'sizeof' to incomplete type 'struct perf_event_attr'
 .size  = sizeof(struct perf_event_attr),
 ^~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:11: warning: excess 
elements in struct initializer
 .size  = sizeof(struct perf_event_attr),
  ^~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:929:11: note: (near 
initialization for 'perf_miss_attr')
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:3: error: 'struct 
>> perf_event_attr' has no member named 'pinned'
 .pinned  = 1,
  ^~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:13: warning: excess 
elements in struct initializer
 .pinned  = 1,
^
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:930:13: note: (near 
initialization for 'perf_miss_attr')
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:3: error: 'struct 
>> perf_event_attr' has no member named 'disabled'
 .disabled = 0,
  ^~~~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:14: warning: excess 
elements in struct initializer
 .disabled = 0,
 ^
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:931:14: note: (near 
initialization for 'perf_miss_attr')
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:3: error: 'struct 
>> perf_event_attr' has no member named 'exclude_user'
 .exclude_user = 1,
  ^~~~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:18: warning: excess 
elements in struct initializer
 .exclude_user = 1,
 ^
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:932:18: note: (near 
initialization for 'perf_miss_attr')
>> arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:935:15: error: variable 
>> 'perf_hit_attr' has initializer but incomplete type
static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
  ^~~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:3: error: 'struct 
perf_event_attr' has no member named 'type'
 .type  = PERF_TYPE_RAW,
  ^~~~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:11: warning: excess 
elements in struct initializer
 .type  = PERF_TYPE_RAW,
  ^
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:936:11: note: (near 
initialization for 'perf_hit_attr')
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:937:3: error: 'struct 
perf_event_attr' has no member named 'size'
 .size  = sizeof(struct perf_event_attr),
  ^~~~
   arch/x86//kernel/cpu/intel_rdt_pseudo_lock.c:937:18: error: invalid 
application of 'sizeof' to incomplete type 'struct perf_event_attr'
 .size  =

Re: [LKP] [rcu] 02a5c550b2: BUG:kernel_reboot-without-warning_in_test_stage

2018-09-11 Thread Paul E. McKenney
On Wed, Sep 12, 2018 at 01:25:27PM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 02a5c550b2738f2bfea8e1e00aa75944d71c9e18 ("rcu: Abstract extended 
> quiescent state determination")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> in testcase: perf_event_tests
> with following parameters:
> 
>   paranoid: disallow_kernel_profiling
> 
> test-description: The Perf Event Testsuite.
> test-url: https://github.com/deater/perf_event_tests
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G

This is a blast from the past!  Almost two years ago.

I will take a closer look (and also check for any later fixes), but at
first glance I am not seeing anything in this commit that would actually
change behavior.

But is it possible that this is due to vCPU preemption on a heavily
loaded test system?  I have done that to myself from time to time...

Thanx, Paul

> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +-+++
> | | 2625d469ba | 02a5c550b2 |
> +-+++
> | boot_successes  | 18 | 0  |
> | boot_failures   | 0  | 17 |
> | BUG:kernel_reboot-without-warning_in_test_stage | 0  | 17 |
> +-+++
> 
> 
> 
> [   21.715217] 
> [   22.950524] perf: interrupt took too long (5334 > 5235), lowering 
> kernel.perf_event_max_sample_rate to 37250
> [   22.956921] perf: interrupt took too long (6735 > 6667), lowering 
> kernel.perf_event_max_sample_rate to 29500
> [   22.970150] perf: interrupt took too long (8494 > 8418), lowering 
> kernel.perf_event_max_sample_rate to 23500
> [   22.976586] perf: interrupt took too long (10754 > 10617), lowering 
> kernel.perf_event_max_sample_rate to 18500
> BUG: kernel reboot-without-warning in test stage
> 
> Elapsed time: 30
> 
> #!/bin/bash
> 
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k  job-script # job-script is attached in this 
> email
> 
> 
> 
> Thanks,
> lkp

> #
> # Automatically generated file; DO NOT EDIT.
> # Linux/x86_64 4.10.0-rc3 Kernel Configuration
> #
> CONFIG_64BIT=y
> CONFIG_X86_64=y
> CONFIG_X86=y
> CONFIG_INSTRUCTION_DECODER=y
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_MMU=y
> CONFIG_ARCH_MMAP_RND_BITS_MIN=28
> CONFIG_ARCH_MMAP_RND_BITS_MAX=32
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
> CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
> CONFIG_NEED_DMA_MAP_STATE=y
> CONFIG_NEED_SG_DMA_LENGTH=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_GENERIC_BUG=y
> CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
> CONFIG_GENERIC_HWEIGHT=y
> CONFIG_ARCH_MAY_HAVE_PC_FDC=y
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_GENERIC_CALIBRATE_DELAY=y
> CONFIG_ARCH_HAS_CPU_RELAX=y
> CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
> CONFIG_HAVE_SETUP_PER_CPU_AREA=y
> CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
> CONFIG_ARCH_HIBERNATION_POSSIBLE=y
> CONFIG_ARCH_SUSPEND_POSSIBLE=y
> CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
> CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
> CONFIG_ZONE_DMA32=y
> CONFIG_AUDIT_ARCH=y
> CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_X86_64_SMP=y
> CONFIG_ARCH_SUPPORTS_UPROBES=y
> CONFIG_FIX_EARLYCON_MEM=y
> CONFIG_DEBUG_RODATA=y
> CONFIG_PGTABLE_LEVELS=4
> CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
> CONFIG_IRQ_WORK=y
> CONFIG_BUILDTIME_EXTABLE_SORT=y
> CONFIG_THREAD_INFO_IN_TASK=y
> 
> #
> # General setup
> #
> CONFIG_INIT_ENV_ARG_LIMIT=32
> CONFIG_CROSS_COMPILE=""
> # CONFIG_COMPILE_TEST is not set
> CONFIG_LOCALVERSION=""
> CONFIG_LOCALVERSION_AUTO=y
> CONFIG_HAVE_KERNEL_GZIP=y
> CONFIG_HAVE_KERNEL_BZIP2=y
> CONFIG_HAVE_KERNEL_LZMA=y
> CONFIG_HAVE_KERNEL_XZ=y
> CONFIG_HAVE_KERNEL_LZO=y
> CONFIG_HAVE_KERNEL_LZ4=y
> CONFIG_KERNEL_GZIP=y
> # CONFIG_KERNEL_BZIP2 is not set
> # CONFIG_KERNEL_LZMA is not set
> # CONFIG_KERNEL_XZ is not set
> # CONFIG_KERNEL_LZO is not set
> # CONFIG_KERNEL_LZ4 is not set
> CONFIG_DEFAULT_HOSTNAME="(none)"
> CONFIG_SWAP=y
> CONFIG_SYSVIPC=y
> CONFIG_SYSVIPC_SYSCTL=y
> CONFIG_POSIX_MQUEUE=y
> CONFIG_POSIX_MQUEUE_SYSCTL=y
> CONFIG_CROSS_MEMORY_ATTACH=y
> CONFIG_FHANDLE=y
> CONFIG_USELIB=y
> # CONFIG_AUDIT is not set
> CONFIG_HAVE_ARCH_AUDITSYSCALL=y
> 
> #
> # IRQ subsystem
> #
> CONFIG_GENERIC_IRQ_PROBE=y
> CONFIG_GENERIC_IRQ_SHOW=y
> CONFIG_GENERIC_PENDING_IRQ=y
> CONFIG_IRQ_DOMAIN=y
> CONFIG_IRQ_DOMAIN_HIERARCHY=y
> 

Re: [PATCH -next] staging: mt7621-pci: Use PTR_ERR_OR_ZERO in mt7621_pcie_parse_dt()

2018-09-11 Thread Sergio Paracuellos
On Wed, Sep 12, 2018 at 02:50:08AM +, YueHaibing wrote:
> Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/staging/mt7621-pci/pci-mt7621.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/staging/mt7621-pci/pci-mt7621.c 
> b/drivers/staging/mt7621-pci/pci-mt7621.c
> index ba1f117..d2cb910 100644
> --- a/drivers/staging/mt7621-pci/pci-mt7621.c
> +++ b/drivers/staging/mt7621-pci/pci-mt7621.c
> @@ -396,10 +396,7 @@ static int mt7621_pcie_parse_dt(struct mt7621_pcie *pcie)
>   }
>  
>   pcie->base = devm_ioremap_resource(dev, ®s);
> - if (IS_ERR(pcie->base))
> - return PTR_ERR(pcie->base);
> -
> - return 0;
> + return PTR_ERR_OR_ZERO(pcie->base);
>  }

This patch looks good but the 'mt7621_pcie_parse_dt' function is not completed 
at all.
There is a lot of missing for each pci node to be parsed yet and some patch 
series which
are doing this have not been tested yet so those patches are not included.

Please see: 

http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-September/125937.html

Best regards,
Sergio Paracuellos

>  
>  static int mt7621_pcie_request_resources(struct mt7621_pcie *pcie,
> 
> 
> 


Re: [PATCH 06/11] compat_ioctl: remove /dev/random commands

2018-09-11 Thread Martin Schwidefsky
On Tue, 11 Sep 2018 22:26:54 +0200
Arnd Bergmann  wrote:

> On Sun, Sep 9, 2018 at 6:12 AM Al Viro  wrote:
> >
> > On Sat, Sep 08, 2018 at 04:28:12PM +0200, Arnd Bergmann wrote:  
> > > These are all handled by the random driver, so instead of listing
> > > each ioctl, we can just use the same function to deal with both
> > > native and compat commands.  
> >
> > Umm...  I don't think it's right -
> >  
> > >   .unlocked_ioctl = random_ioctl,
> > > + .compat_ioctl = random_ioctl,  
> >
> >  
> > ->compat_ioctl() gets called in  
> > error = f.file->f_op->compat_ioctl(f.file, cmd, 
> > arg);
> > so you do *NOT* get compat_ptr() for those - they have to do it on their
> > own.  It's not hard to provide a proper compat_ioctl() instance for that
> > one, but this is not it.  What you need in drivers/char/random.c part of
> > that one is something like  
> 
> Looping in some s390 folks.
> 
> As you suggested in another reply, I had a look at what other drivers
> do the same thing and have only pointer arguments. I created a
> patch to move them all over to using a new helper function that
> adds the compat_ptr(), and arrived at
> 
>  drivers/android/binder.c| 2 +-
>  drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +-
>  drivers/dma-buf/dma-buf.c   | 4 +---
>  drivers/dma-buf/sw_sync.c   | 2 +-
>  drivers/dma-buf/sync_file.c | 2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c| 2 +-
>  drivers/hid/hidraw.c| 4 +---
>  drivers/iio/industrialio-core.c | 2 +-
>  drivers/infiniband/core/uverbs_main.c   | 4 ++--
>  drivers/media/rc/lirc_dev.c | 4 +---
>  drivers/mfd/cros_ec_dev.c   | 4 +---
>  drivers/misc/vmw_vmci/vmci_host.c   | 2 +-
>  drivers/nvdimm/bus.c| 4 ++--
>  drivers/nvme/host/core.c| 6 +++---
>  drivers/pci/switch/switchtec.c  | 2 +-
>  drivers/platform/x86/wmi.c  | 2 +-
>  drivers/rpmsg/rpmsg_char.c  | 4 ++--
>  drivers/s390/char/sclp_ctl.c| 8 ++--
>  drivers/s390/char/vmcp.c| 2 ++
>  drivers/s390/cio/chsc_sch.c | 8 ++--
>  drivers/sbus/char/display7seg.c | 2 +-
>  drivers/sbus/char/envctrl.c | 4 +---
>  drivers/scsi/3w-.c  | 4 +---
>  drivers/scsi/cxlflash/main.c| 2 +-
>  drivers/scsi/esas2r/esas2r_main.c   | 2 +-
>  drivers/scsi/pmcraid.c  | 4 +---
>  drivers/staging/android/ion/ion.c   | 4 +---
>  drivers/staging/vme/devices/vme_user.c  | 2 +-
>  drivers/tee/tee_core.c  | 2 +-
>  drivers/usb/class/cdc-wdm.c | 2 +-
>  drivers/usb/class/usbtmc.c  | 4 +---
>  drivers/video/fbdev/ps3fb.c | 2 +-
>  drivers/video/fbdev/sis/sis_main.c  | 4 +---
>  drivers/virt/fsl_hypervisor.c   | 2 +-
>  fs/btrfs/super.c| 2 +-
>  fs/ceph/dir.c   | 2 +-
>  fs/ceph/file.c  | 2 +-
>  fs/fuse/dev.c   | 2 +-
>  fs/notify/fanotify/fanotify_user.c  | 2 +-
>  fs/userfaultfd.c| 2 +-
>  net/rfkill/core.c   | 2 +-
>  41 files changed, 48 insertions(+), 76 deletions(-)
> 
> Out of those, there are only a few that may get used on s390,
> in particular at most infiniband/uverbs, nvme, nvdimm,
> btrfs, ceph, fuse, fanotify and userfaultfd.
> [Note: there are three s390 drivers in the list, which use
> a different method: they check in_compat_syscall() from
> a shared handler to decide whether to do compat_ptr().

Using in_compat_syscall() seems to be a good solution, no?

> According to my memory from when I last worked on this,
> the compat_ptr() is mainly a safeguard for legacy binaries
> that got created with ancient C compilers (or compilers for
> something other than C)  and might leave the high bit set
> in a pointer, but modern C compilers (gcc-3+) won't ever
> do that.

And compat_ptr clears the upper 32-bit of the register. If
the register is loaded to e.g. "lr" or "l" there will be
junk in the 4 upper bytes.

> You are probably right about /dev/random, which could be
> used in lots of weird code, but I wonder to what degree we
> need to worry about it for the rest.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver

2018-09-11 Thread Masahiro Yamada
2018-09-12 13:35 GMT+09:00 Vinod :
> On 12-09-18, 12:01, Masahiro Yamada wrote:
>> Hi Vinod,
>>
>>
>> 2018-09-11 16:00 GMT+09:00 Vinod :
>> > On 24-08-18, 10:41, Masahiro Yamada wrote:
>> >
>> >> +/* mc->vc.lock must be held by caller */
>> >> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md)
>> >> +{
>> >> + u32 residue = 0;
>> >> + int i;
>> >> +
>> >> + for (i = md->sg_cur; i < md->sg_len; i++)
>> >> + residue += sg_dma_len(&md->sgl[i]);
>> >
>> > so if the descriptor is submitted to hardware, we return the descriptor
>> > length, which is not correct.
>> >
>> > Two cases are required to be handled:
>> > 1. Descriptor is in queue (IMO above logic is fine for that, but it can
>> > be calculated at descriptor submit and looked up here)
>>
>> Where do you want it to be calculated?
>
> where is it calculated now?


Please see __uniphier_mdmac_handle().


It gets the address and size by sg_dma_address(), sg_dma_len()
just before setting them to the hardware registers.


   sg = &md->sgl[md->sg_cur];

   if (md->dir == DMA_MEM_TO_DEV) {
   src_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_INC;
   src_addr = sg_dma_address(sg);
   dest_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_FIXED;
   dest_addr = 0;
   } else {
   src_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_FIXED;
   src_addr = 0;
   dest_mode = UNIPHIER_MDMAC_CH_MODE__ADDR_INC;
   dest_addr = sg_dma_address(sg);
   }







>> This hardware provides only simple registers (address and size)
>> for one-shot transfer instead of descriptors.
>>
>> So, I used sgl as-is because I did not see a good reason
>> to transform sgl to another data structure.
>
>
>> > this seems missing stuff. Where do you do register calculation for the
>> > descriptor and where is slave_config here, how do you know where to
>> > send/receive data form/to (peripheral)
>>
>>
>> This dmac is really simple, and un-flexible.
>>
>> The peripheral address to send/receive data from/to is hard-weird.
>> cfg->{src_addr,dst_addr} is not configurable.
>>
>> Look at __uniphier_mdmac_handle().
>> 'dest_addr' and 'src_addr' must be set to 0 for the peripheral.
>
> Fair enough, what about other values like addr_width and maxburst?


None of them is configurable.



-- 
Best Regards
Masahiro Yamada


[LKP] [rcu] 02a5c550b2: BUG:kernel_reboot-without-warning_in_test_stage

2018-09-11 Thread kernel test robot
FYI, we noticed the following commit (built with gcc-7):

commit: 02a5c550b2738f2bfea8e1e00aa75944d71c9e18 ("rcu: Abstract extended 
quiescent state determination")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: perf_event_tests
with following parameters:

paranoid: disallow_kernel_profiling

test-description: The Perf Event Testsuite.
test-url: https://github.com/deater/perf_event_tests


on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


+-+++
| | 2625d469ba | 02a5c550b2 |
+-+++
| boot_successes  | 18 | 0  |
| boot_failures   | 0  | 17 |
| BUG:kernel_reboot-without-warning_in_test_stage | 0  | 17 |
+-+++



[   21.715217] 
[   22.950524] perf: interrupt took too long (5334 > 5235), lowering 
kernel.perf_event_max_sample_rate to 37250
[   22.956921] perf: interrupt took too long (6735 > 6667), lowering 
kernel.perf_event_max_sample_rate to 29500
[   22.970150] perf: interrupt took too long (8494 > 8418), lowering 
kernel.perf_event_max_sample_rate to 23500
[   22.976586] perf: interrupt took too long (10754 > 10617), lowering 
kernel.perf_event_max_sample_rate to 18500
BUG: kernel reboot-without-warning in test stage

Elapsed time: 30

#!/bin/bash



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k  job-script # job-script is attached in this 
email



Thanks,
lkp
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.10.0-rc3 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y

Re: [PATCH] firmware: vpd: fix spelling mistake "partion" -> "partition"

2018-09-11 Thread Greg Kroah-Hartman
On Tue, Sep 11, 2018 at 09:58:48PM -0700, Guenter Roeck wrote:
> On 09/11/2018 09:58 AM, Colin King wrote:
> > From: Colin Ian King 
> > 
> > Trivial fix to spelling mistake in comment
> > 
> > Signed-off-by: Colin Ian King 
> 
> Reviewed-by: Guenter Roeck 
> 
> Interesting - drivers/firmware/google/ does not have a maintainer.
> Greg - is it correct to assume that you are the de-facto maintainer ?

Yeah, I am :(

I'll queue this up, thanks.

greg k-h


[PATCH v2 0/1] gpio: mvebu: Add support for multiple PWM lines

2018-09-11 Thread Aditya Prayoga


Hi everyone,

Helios4, an Armada 388 based NAS SBC, provides 2 (4-pins) fan connectors.
The PWM pins on both connector are connected to GPIO on bank 1. Current
gpio-mvebu does not allow more than one PWM on the same bank.

Aditya

---

Changes v1->v2:
  * Merge/squash "[Patch 2/2] gpio: mvebu: Allow to use non-default PWM counter"
  * Allow only two PWMs as suggested by Andrew Lunn and Richard Genoud

---

Aditya Prayoga (1):
  gpio: mvebu: Add support for multiple PWM lines per GPIO chip

 drivers/gpio/gpio-mvebu.c | 73 ++-
 1 file changed, 60 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH v2 1/1] gpio: mvebu: Add support for multiple PWM lines per GPIO chip

2018-09-11 Thread Aditya Prayoga
Allow more than 1 PWM request (eg. PWM fan) on the same GPIO chip. If
the other PWM counter is unused, allocate it to next PWM request.
The priority would be:
1. Default counter assigned to the bank
2. Unused counter that is assigned to other bank

Since there are only two PWM counters, only two PWMs supported.

Signed-off-by: Aditya Prayoga 
---
 drivers/gpio/gpio-mvebu.c | 73 ++-
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/drivers/gpio/gpio-mvebu.c b/drivers/gpio/gpio-mvebu.c
index 6e02148..2d46b87 100644
--- a/drivers/gpio/gpio-mvebu.c
+++ b/drivers/gpio/gpio-mvebu.c
@@ -92,9 +92,16 @@
 
 #define MVEBU_MAX_GPIO_PER_BANK32
 
+enum mvebu_pwm_counter {
+   MVEBU_PWM_CTRL_SET_A = 0,
+   MVEBU_PWM_CTRL_SET_B,
+   MVEBU_PWM_CTRL_MAX
+};
+
 struct mvebu_pwm {
void __iomem*membase;
unsigned longclk_rate;
+   enum mvebu_pwm_counter   id;
struct gpio_desc*gpiod;
struct pwm_chip  chip;
spinlock_t   lock;
@@ -128,6 +135,8 @@ struct mvebu_gpio_chip {
u32level_mask_regs[4];
 };
 
+static struct mvebu_pwm*mvebu_pwm_list[MVEBU_PWM_CTRL_MAX];
+
 /*
  * Functions returning addresses of individual registers for a given
  * GPIO controller.
@@ -594,34 +603,59 @@ static struct mvebu_pwm *to_mvebu_pwm(struct pwm_chip 
*chip)
return container_of(chip, struct mvebu_pwm, chip);
 }
 
+static struct mvebu_pwm *mvebu_pwm_get_avail_counter(void)
+{
+   enum mvebu_pwm_counter i;
+
+   for (i = MVEBU_PWM_CTRL_SET_A; i < MVEBU_PWM_CTRL_MAX; i++) {
+   if (!mvebu_pwm_list[i]->gpiod)
+   return mvebu_pwm_list[i];
+   }
+   return NULL;
+}
+
 static int mvebu_pwm_request(struct pwm_chip *chip, struct pwm_device *pwm)
 {
struct mvebu_pwm *mvpwm = to_mvebu_pwm(chip);
struct mvebu_gpio_chip *mvchip = mvpwm->mvchip;
struct gpio_desc *desc;
+   struct mvebu_pwm *counter;
unsigned long flags;
int ret = 0;
 
spin_lock_irqsave(&mvpwm->lock, flags);
 
-   if (mvpwm->gpiod) {
-   ret = -EBUSY;
-   } else {
-   desc = gpiochip_request_own_desc(&mvchip->chip,
-pwm->hwpwm, "mvebu-pwm");
-   if (IS_ERR(desc)) {
-   ret = PTR_ERR(desc);
+   counter = mvpwm;
+   if (counter->gpiod) {
+   counter = mvebu_pwm_get_avail_counter();
+   if (!counter) {
+   ret = -EBUSY;
goto out;
}
 
-   ret = gpiod_direction_output(desc, 0);
-   if (ret) {
-   gpiochip_free_own_desc(desc);
-   goto out;
-   }
+   pwm->chip_data = counter;
+   }
 
-   mvpwm->gpiod = desc;
+   desc = gpiochip_request_own_desc(&mvchip->chip,
+pwm->hwpwm, "mvebu-pwm");
+   if (IS_ERR(desc)) {
+   ret = PTR_ERR(desc);
+   goto out;
}
+
+   ret = gpiod_direction_output(desc, 0);
+   if (ret) {
+   gpiochip_free_own_desc(desc);
+   goto out;
+   }
+
+   regmap_update_bits(mvchip->regs, GPIO_BLINK_CNT_SELECT_OFF +
+  mvchip->offset, BIT(pwm->hwpwm),
+  counter->id ? BIT(pwm->hwpwm) : 0);
+   regmap_read(mvchip->regs, GPIO_BLINK_CNT_SELECT_OFF +
+   mvchip->offset, &counter->blink_select);
+
+   counter->gpiod = desc;
 out:
spin_unlock_irqrestore(&mvpwm->lock, flags);
return ret;
@@ -632,6 +666,11 @@ static void mvebu_pwm_free(struct pwm_chip *chip, struct 
pwm_device *pwm)
struct mvebu_pwm *mvpwm = to_mvebu_pwm(chip);
unsigned long flags;
 
+   if (pwm->chip_data) {
+   mvpwm = (struct mvebu_pwm *) pwm->chip_data;
+   pwm->chip_data = NULL;
+   }
+
spin_lock_irqsave(&mvpwm->lock, flags);
gpiochip_free_own_desc(mvpwm->gpiod);
mvpwm->gpiod = NULL;
@@ -648,6 +687,9 @@ static void mvebu_pwm_get_state(struct pwm_chip *chip,
unsigned long flags;
u32 u;
 
+   if (pwm->chip_data)
+   mvpwm = (struct mvebu_pwm *) pwm->chip_data;
+
spin_lock_irqsave(&mvpwm->lock, flags);
 
val = (unsigned long long)
@@ -695,6 +737,9 @@ static int mvebu_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
unsigned long flags;
unsigned int on, off;
 
+   if (pwm->chip_data)
+   mvpwm = (struct mvebu_pwm *) pwm->chip_data;
+
val = (unsigned long long) mvpwm->clk_rate * state->duty_cycle;
do_div(val, NSEC_PER_SEC);
if (val > UINT_MAX)
@@ -804,6 +849,7 @@ static int mvebu_pwm_probe(struct platform_device *pdev,

Re: [PATCH] xtensa: remove unnecessary KBUILD_SRC ifeq conditional

2018-09-11 Thread Max Filippov
On Tue, Sep 11, 2018 at 9:25 PM, Masahiro Yamada
 wrote:
> You can always prefix variant/platform header search paths with
> $(srctree)/ because $(srctree) is '.' for in-tree building.
>
> Signed-off-by: Masahiro Yamada 
> ---
>
>  arch/xtensa/Makefile | 4 
>  1 file changed, 4 deletions(-)

Thanks, applied to my xtensa tree.

-- 
Thanks.
-- Max


Re: [LKP] 0a3856392c [ 10.513760] INFO: trying to register non-static key.

2018-09-11 Thread Rong Chen




On 09/07/2018 10:19 AM, Matthew Wilcox wrote:

On Fri, Sep 07, 2018 at 09:05:39AM +0800, kernel test robot wrote:

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit 0a3856392cff1542170b5bc37211c9a21fd0c3f6
Author: Matthew Wilcox 
AuthorDate: Mon Jun 18 17:23:37 2018 -0400
Commit: Matthew Wilcox 
CommitDate: Tue Aug 21 23:54:20 2018 -0400

 test_ida: Move ida_check_leaf
 
 Convert to new API and move to kernel space.  Take the opportunity to

 test the situation a little more thoroughly (ie at different offsets).
 
 Signed-off-by: Matthew Wilcox 

Thank you test-bot.  Can you check if this patch fixes the problem?

Thanks, It works.

Best Regards,
Rong Chen



diff --git a/lib/test_ida.c b/lib/test_ida.c
index 2d1637d8136b..b06880625961 100644
--- a/lib/test_ida.c
+++ b/lib/test_ida.c
@@ -150,10 +150,10 @@ static void ida_check_conv(struct ida *ida)
IDA_BUG_ON(ida, !ida_is_empty(ida));
  }
  
+static DEFINE_IDA(ida);

+
  static int ida_checks(void)
  {
-   DEFINE_IDA(ida);
-
IDA_BUG_ON(&ida, !ida_is_empty(&ida));
ida_check_alloc(&ida);
ida_check_destroy(&ida);





Re: [PATCH] firmware: vpd: fix spelling mistake "partion" -> "partition"

2018-09-11 Thread Guenter Roeck

On 09/11/2018 09:58 AM, Colin King wrote:

From: Colin Ian King 

Trivial fix to spelling mistake in comment

Signed-off-by: Colin Ian King 


Reviewed-by: Guenter Roeck 

Interesting - drivers/firmware/google/ does not have a maintainer.
Greg - is it correct to assume that you are the de-facto maintainer ?

Guenter


---
  drivers/firmware/google/vpd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/google/vpd.c b/drivers/firmware/google/vpd.c
index 1aa67bb5d8c0..c0c0b4e4e281 100644
--- a/drivers/firmware/google/vpd.c
+++ b/drivers/firmware/google/vpd.c
@@ -198,7 +198,7 @@ static int vpd_section_init(const char *name, struct 
vpd_section *sec,
  
  	sec->name = name;
  
-	/* We want to export the raw partion with name ${name}_raw */

+   /* We want to export the raw partition with name ${name}_raw */
sec->raw_name = kasprintf(GFP_KERNEL, "%s_raw", name);
if (!sec->raw_name) {
err = -ENOMEM;





[PATCH] kbuild: prefix Makefile.dtbinst path with $(srctree) unconditionally

2018-09-11 Thread Masahiro Yamada
$(srctree) always points to the top of the source tree whether
KBUILD_SRC is set or not.

Signed-off-by: Masahiro Yamada 
---

 scripts/Kbuild.include | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index ce53639..46cc43e 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -193,7 +193,7 @@ modbuiltin := -f $(srctree)/scripts/Makefile.modbuiltin obj
 # Shorthand for $(Q)$(MAKE) -f scripts/Makefile.dtbinst obj=
 # Usage:
 # $(Q)$(MAKE) $(dtbinst)=dir
-dtbinst := -f $(if $(KBUILD_SRC),$(srctree)/)scripts/Makefile.dtbinst obj
+dtbinst := -f $(srctree)/scripts/Makefile.dtbinst obj
 
 ###
 # Shorthand for $(Q)$(MAKE) -f scripts/Makefile.clean obj=
-- 
2.7.4



linux-next: Tree for Sep 12

2018-09-11 Thread Stephen Rothwell
Hi all,

News: there will be no linux-next releases on Friday or Monday.

Changes since 20180911:

Dropped trees: xarray, ida (temporarily)

I applied a patch for a runtime problem in the vfs tree and I still
disabled building some samples.

The drm-misc tree gained a conflict against the drm tree.

The tty tree lost its build failure.

Non-merge commits (relative to Linus' tree): 3284
 3703 files changed, 109374 insertions(+), 67935 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 287 trees (counting Linus' and 66 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (11da3a7f84f1 Linux 4.19-rc3)
Merging fixes/master (72358c0b59b7 linux-next: build warnings from the build of 
Linus' tree)
Merging kbuild-current/fixes (11da3a7f84f1 Linux 4.19-rc3)
Merging arc-current/for-curr (00a99339f0a3 ARCv2: build: use mcpu=hs38 iso 
generic mcpu=archs)
Merging arm-current/fixes (afc9f65e01cd ARM: 8781/1: Fix Thumb-2 syscall return 
for binutils 2.29+)
Merging arm64-fixes/for-next/fixes (84c57dbd3c48 arm64: kernel: 
arch_crash_save_vmcoreinfo() should depend on CONFIG_CRASH_CORE)
Merging m68k-current/for-linus (0986b16ab49b m68k/mac: Use correct PMU response 
format)
Merging powerpc-fixes/fixes (cca19f0b684f powerpc/64s/radix: Fix missing global 
invalidations when removing copro)
Merging sparc/master (df2def49c57b Merge tag 'acpi-4.19-rc1-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm)
Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2)
Merging net/master (7c5cca358854 qmi_wwan: Support dynamic config on Quectel 
EP06)
Merging bpf/master (28619527b8a7 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Merging ipsec/master (782710e333a5 xfrm: reset crypto_done when iterating over 
multiple input xfrms)
Merging netfilter/master (1286df269f49 netfilter: xt_hashlimit: use s->file 
instead of s->private)
Merging ipvs/master (feb9f55c33e5 netfilter: nft_dynset: allow dynamic updates 
of non-anonymous set)
Merging wireless-drivers/master (5b394b2ddf03 Linux 4.19-rc1)
Merging mac80211/master (c42055105785 mac80211: fix TX status reporting for 
ieee80211s)
Merging rdma-fixes/for-rc (8f28b178f71c RDMA/mlx4: Ensure that maximal 
send/receive SGE less than supported by HW)
Merging sound-current/for-linus (49434c6c575d ALSA: emu10k1: fix possible info 
leak to userspace on SNDRV_EMU10K1_IOCTL_INFO)
Merging sound-asoc-fixes/for-linus (de7609683fef Merge branch 'asoc-4.19' into 
asoc-linus)
Merging regmap-fixes/for-linus (57361846b52b Linux 4.19-rc2)
Merging regulator-fixes/for-linus (b832dd4f2c04 Merge branch 'regulator-4.19' 
into regulator-linus)
Merging spi-fixes/for-linus (9029858be9ef Merge branch 'spi-4.19' into 
spi-linus)
Merging pci-current/for-linus (34fb6bf9b13a PCI: pciehp: Fix hot-add vs 
powerfault detection order)
Merging driver-core.current/driver-core-linus (11da3a7f84f1 Linux 4.19-rc3)
Merging tty.current/tty-linus (7f2bf7840b74 tty: hvc: hvc_write() fix break 
condition)
Merging usb.current/usb-linus (df3aa13c7bbb Revert "cdc-acm: implement 
put_char() and flush_chars()")
Merging usb-gadget-fixes/fixes (d9707490077b usb: dwc2: Fix call location of 
dwc2_check_core_endian

Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver

2018-09-11 Thread Vinod
On 12-09-18, 12:01, Masahiro Yamada wrote:
> Hi Vinod,
> 
> 
> 2018-09-11 16:00 GMT+09:00 Vinod :
> > On 24-08-18, 10:41, Masahiro Yamada wrote:
> >
> >> +/* mc->vc.lock must be held by caller */
> >> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md)
> >> +{
> >> + u32 residue = 0;
> >> + int i;
> >> +
> >> + for (i = md->sg_cur; i < md->sg_len; i++)
> >> + residue += sg_dma_len(&md->sgl[i]);
> >
> > so if the descriptor is submitted to hardware, we return the descriptor
> > length, which is not correct.
> >
> > Two cases are required to be handled:
> > 1. Descriptor is in queue (IMO above logic is fine for that, but it can
> > be calculated at descriptor submit and looked up here)
> 
> Where do you want it to be calculated?

where is it calculated now?

> This hardware provides only simple registers (address and size)
> for one-shot transfer instead of descriptors.
> 
> So, I used sgl as-is because I did not see a good reason
> to transform sgl to another data structure.


> > this seems missing stuff. Where do you do register calculation for the
> > descriptor and where is slave_config here, how do you know where to
> > send/receive data form/to (peripheral)
> 
> 
> This dmac is really simple, and un-flexible.
> 
> The peripheral address to send/receive data from/to is hard-weird.
> cfg->{src_addr,dst_addr} is not configurable.
> 
> Look at __uniphier_mdmac_handle().
> 'dest_addr' and 'src_addr' must be set to 0 for the peripheral.

Fair enough, what about other values like addr_width and maxburst?
-- 
~Vinod


Re: [PATCH] RISC-V: Show IPI stats

2018-09-11 Thread Anup Patel
On Mon, Sep 10, 2018 at 7:16 PM, Christoph Hellwig  wrote:
> On Fri, Sep 07, 2018 at 06:14:29PM +0530, Anup Patel wrote:
>> This patch provides arch_show_interrupts() implementation to
>> show IPI stats via /proc/interrupts.
>>
>> Now the contents of /proc/interrupts" will look like below:
>>CPU0   CPU1   CPU2   CPU3
>>   8: 17  7  6 14  SiFive PLIC   8  virtio0
>>  10: 10 10  9 11  SiFive PLIC  10  ttyS0
>> IPI0:   170673251 79  Rescheduling interrupts
>> IPI1: 1 12 27  1  Function call interrupts
>> IPI2: 0  0  0  0  CPU wake-up interrupts
>>
>> Signed-off-by: Anup Patel 
>
> Thanks, this looks pretty sensible to me.  Maybe we want to also show
> timer interrupts if we do this?

Let's not include timer stats here until RISCV INTC driver is
concluded. We can do it as separate patch if required.

>
>> --- a/arch/riscv/kernel/irq.c
>> +++ b/arch/riscv/kernel/irq.c
>> @@ -8,6 +8,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  /*
>>   * Possible interrupt causes:
>> @@ -24,6 +25,14 @@
>>   */
>>  #define INTERRUPT_CAUSE_FLAG (1UL << (__riscv_xlen - 1))
>>
>> +int arch_show_interrupts(struct seq_file *p, int prec)
>> +{
>> +#ifdef CONFIG_SMP
>> + show_ipi_stats(p, prec);
>> +#endif
>> + return 0;
>> +}
>
> If we don't also add timer stats I'd just move arch_show_interrupts
> to smp.c and make it conditional.  If we don't this split might make
> more sense.

I understand you want to avoid #ifdef here. We can do same thing
by having empty inline function show_ipi_stats() in asm/smp.h for
!CONFIG_SMP case. This way we can keep arch_show_interrupts()
in kernel/irq.c which is intuitively correct location for
arch_show_interrupts().

>
>> +static const char *ipi_names[IPI_MAX] = {
>> + [IPI_RESCHEDULE] = "Rescheduling interrupts",
>> + [IPI_CALL_FUNC] = "Function call interrupts",
>> + [IPI_CALL_WAKEUP] = "CPU wake-up interrupts",
>> +};
>
> No need for the explicit array size.  Also please use a few tabs to
> align this nicely:
>
> static const char *ipi_names[] = {
> [IPI_RESCHEDULE]= "Rescheduling interrupts",
> [IPI_CALL_FUNC] = "Function call interrupts",
> [IPI_CALL_WAKEUP]   = "CPU wake-up interrupts",
> };

Sure, will do.

Regards,
Anup


[PATCH] xtensa: remove unnecessary KBUILD_SRC ifeq conditional

2018-09-11 Thread Masahiro Yamada
You can always prefix variant/platform header search paths with
$(srctree)/ because $(srctree) is '.' for in-tree building.

Signed-off-by: Masahiro Yamada 
---

 arch/xtensa/Makefile | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/xtensa/Makefile b/arch/xtensa/Makefile
index 295c120..d67e30fa 100644
--- a/arch/xtensa/Makefile
+++ b/arch/xtensa/Makefile
@@ -64,11 +64,7 @@ endif
 vardirs := $(patsubst %,arch/xtensa/variants/%/,$(variant-y))
 plfdirs := $(patsubst %,arch/xtensa/platforms/%/,$(platform-y))
 
-ifeq ($(KBUILD_SRC),)
-KBUILD_CPPFLAGS += $(patsubst %,-I%include,$(vardirs) $(plfdirs))
-else
 KBUILD_CPPFLAGS += $(patsubst %,-I$(srctree)/%include,$(vardirs) $(plfdirs))
-endif
 
 KBUILD_DEFCONFIG := iss_defconfig
 
-- 
2.7.4



[PATCH] ARM: remove unnecessary KBUILD_SRC ifeq conditional

2018-09-11 Thread Masahiro Yamada
You can always prefix machine/plat header search paths with
$(srctree)/ because $(srctree) is '.' for in-tree building.

Signed-off-by: Masahiro Yamada 
---

KernelVersion: v4.19-rc3


 arch/arm/Makefile | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/arm/Makefile b/arch/arm/Makefile
index d1516f8..06ebff7 100644
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -264,13 +264,9 @@ platdirs := $(patsubst %,arch/arm/plat-%/,$(sort 
$(plat-y)))
 
 ifneq ($(CONFIG_ARCH_MULTIPLATFORM),y)
 ifneq ($(CONFIG_ARM_SINGLE_ARMV7M),y)
-ifeq ($(KBUILD_SRC),)
-KBUILD_CPPFLAGS += $(patsubst %,-I%include,$(machdirs) $(platdirs))
-else
 KBUILD_CPPFLAGS += $(patsubst %,-I$(srctree)/%include,$(machdirs) $(platdirs))
 endif
 endif
-endif
 
 export TEXT_OFFSET GZFLAGS MMUEXT
 
-- 
2.7.4



Re: [PATCH] include/linux/compiler-clang.h: define __naked

2018-09-11 Thread Miguel Ojeda
Hi Arnd, Nick, Stefan,

On Mon, Sep 10, 2018 at 2:14 PM, Arnd Bergmann  wrote:
> On Mon, Sep 10, 2018 at 8:05 AM Stefan Agner  wrote:
>>
>> ARM32 arch code uses the __naked attribute. This has previously been
>> defined in include/linux/compiler-gcc.h, which is no longer included
>> for Clang. Define __naked for Clang. Conservatively add all attributes
>> previously used (and supported by Clang).
>>
>> This fixes compile errors when building ARM32 using Clang:
>>   arch/arm/mach-exynos/mcpm-exynos.c:193:13: error: variable has incomplete 
>> type 'void'
>>   static void __naked exynos_pm_power_up_setup(unsigned int affinity_level)
>>   ^
>>
>> Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually 
>> exclusive")
>> Signed-off-by: Stefan Agner 
>
>> +/*
>> + * ARM32 is currently the only user of __naked supported by Clang. Follow
>> + * gcc: Do not trace naked functions and make sure they don't get inlined.
>> + */
>> +#define __naked __attribute__((naked)) noinline notrace
>> +
>
> Please see patches 5 and 6 of the series that Miguel posted:
>
> https://lore.kernel.org/lkml/20180908212459.19736-6-miguel.ojeda.sando...@gmail.com/
>
> I suppose we want the patch to fix clang build as soon as possible though,
> and follow up with the cleanup for the next merge window, right?

Not sure what the plans of Linus et. al. are, if they have any; but
that would be a safe bet.

In case they want to speed this up and put the entire series into
v4.19 (instead of the two patches), I have done a binary & objdump
diff between -rc2 and v4 (based on -rc2) on all object files (with
UTS_RELEASE fixed to avoid some differences).

In a x86_64 tinyconfig with gcc 7.3, the differences I found are:

$ ./compare.py linux-rc2 linux-v4
[2018-09-12 06:16:39,483] [INFO] [arch/x86/boot/compressed/piggy.o]
Binary diff (use 'bash -c "cmp
linux-rc2/arch/x86/boot/compressed/piggy.o
linux-v4/arch/x86/boot/compressed/piggy.o"' to replicate)
[2018-09-12 06:16:39,606] [INFO] [arch/x86/boot/header.o] Binary diff
(use 'bash -c "cmp linux-rc2/arch/x86/boot/header.o
linux-v4/arch/x86/boot/header.o"' to replicate)
[2018-09-12 06:16:39,659] [INFO] [arch/x86/boot/version.o] Binary diff
(use 'bash -c "cmp linux-rc2/arch/x86/boot/version.o
linux-v4/arch/x86/boot/version.o"' to replicate)
[2018-09-12 06:16:40,483] [INFO] [init/version.o] Binary diff (use
'bash -c "cmp linux-rc2/init/version.o linux-v4/init/version.o"' to
replicate)

I will do a bigger one tomorrow or so and see if there are any
important differences. Regardless of what we do, I will send the
__naked patches separately as well (requested by Nick on GitHub).

Cheers,
Miguel


[PATCH] dt-bindings: power: Introduce suspend states supported properties

2018-09-11 Thread Keerthy
Introuduce linux generic suspend states supported properties.
It is convenient for the generic suspend path to have
the knowledge of the suspend states supported based on the
device tree properties based on which it can either be suspended
or safely bailed out of suspend if none of the suspend states
are supported.

Signed-off-by: Keerthy 
---
 .../devicetree/bindings/power/power-states.txt | 22 ++
 1 file changed, 22 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/power/power-states.txt

diff --git a/Documentation/devicetree/bindings/power/power-states.txt 
b/Documentation/devicetree/bindings/power/power-states.txt
new file mode 100644
index 000..bb80b36
--- /dev/null
+++ b/Documentation/devicetree/bindings/power/power-states.txt
@@ -0,0 +1,22 @@
+* Generic system suspend states support
+
+Most platforms support multiple suspend states. Define system
+suspend states so that one can target appropriate low power
+states based on the SoC capabilities.
+
+linux,suspend-to-memory-supported
+
+Upon suspend to memory the system context is saved to primary memory.
+All the clocks for all the peripherals including CPU are gated.
+
+linux,suspend-power-off-supported
+
+In this case in additon to the clocks all the voltage resources are
+turned off except the ones needed to keep the primary memory
+and a wake up source that can trigger a wakeup event.
+
+linux,suspend-to-disk-supported
+
+Upon suspend to disk that system context is saved to secondary memory.
+All the clocks for all the peripherals including CPU are gated. Even
+the primary memory is turned off.
-- 
1.9.1



Re: [PATCH 2/3] sound: enable interrupt after dma buffer initialization

2018-09-11 Thread Vinod
On 11-09-18, 14:58, Yu Zhao wrote:
> On Tue, Sep 11, 2018 at 08:06:49AM +0200, Takashi Iwai wrote:
> > On Mon, 10 Sep 2018 23:21:50 +0200,
> > Yu Zhao wrote:
> > > 
> > > In snd_hdac_bus_init_chip(), we enable interrupt before
> > > snd_hdac_bus_init_cmd_io() initializing dma buffers. If irq has
> > > been acquired and irq handler uses the dma buffer, kernel may crash
> > > when interrupt comes in.
> > > 
> > > Fix the problem by postponing enabling irq after dma buffer
> > > initialization. And warn once on null dma buffer pointer during the
> > > initialization.
> > > 
> > > Signed-off-by: Yu Zhao 
> > 
> > Looks good to me.
> > 
> > Reviewed-by: Takashi Iwai 
> > 
> > 
> > BTW, the reason why this hasn't been hit on the legacy HD-audio driver
> > is that we allocate usually with MSI, so the irq is isolated.
> > 
> > Any reason that Intel SKL driver doesn't use MST?
> 
> This I'm not sure. Vinod might have answer to it, according to
> https://patchwork.kernel.org/patch/6375831/#13796611

IIRC (seemed quite some time back) we faced issues with using MSI on SKL
and didnt try afterwards. If Intel folks can try it and check. Pierre is
out, maybe Liam can help..?

-- 
~Vinod


Re: [Question] Are the trace APIs declared by "TRACE_EVENT(irq_handler_entry" allowed to be used in Ko?

2018-09-11 Thread Steven Rostedt
On Wed, 12 Sep 2018 10:08:37 +0800
"Leizhen (ThunderTown)"  wrote:

> After patch 7e066fb870fc ("tracepoints: add DECLARE_TRACE() and 
> DEFINE_TRACE()"),
> the trace APIs declared by "TRACE_EVENT(irq_handler_entry" can not be 
> directly used
> by ko, because it's not explicitly exported by EXPORT_TRACEPOINT_SYMBOL_GPL or
> EXPORT_TRACEPOINT_SYMBOL.
> 
> Did we miss it? or it's not recommended to be used in ko?
> 

Why do you need it. This patch is almost 10 years old, and you are just
now finding an issue with it?

-- Steve

> 
> -
> 
> commit 7e066fb870fcd1025ec3ba7bbde5d541094f4ce1
> Author: Mathieu Desnoyers 
> Date:   Fri Nov 14 17:47:47 2008 -0500
> 
> tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()
> 
> Impact: API *CHANGE*. Must update all tracepoint users.
> 
> Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
> structure in a single spot for all the kernel. It helps reducing memory
> consumption, especially when declaring a lot of tracepoints, e.g. for
> kmalloc tracing.
> 
> *API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
> tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
> to do it. The name previously used was misleading.
> 
> Updates scheduler instrumentation to follow this API change.
> 
> 



[PATCH] ASoC: remove unneeded static set .owner field in platform_driver

2018-09-11 Thread zhong jiang
platform_driver_register will set the .owner field. So it is safe
to remove the redundant assignment.

The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang 
---
 sound/soc/mediatek/mt2701/mt2701-wm8960.c | 1 -
 sound/soc/mediatek/mt6797/mt6797-mt6351.c | 1 -
 sound/soc/rockchip/rk3288_hdmi_analog.c   | 1 -
 3 files changed, 3 deletions(-)

diff --git a/sound/soc/mediatek/mt2701/mt2701-wm8960.c 
b/sound/soc/mediatek/mt2701/mt2701-wm8960.c
index 89f34ef..e5d49e6 100644
--- a/sound/soc/mediatek/mt2701/mt2701-wm8960.c
+++ b/sound/soc/mediatek/mt2701/mt2701-wm8960.c
@@ -150,7 +150,6 @@ static int mt2701_wm8960_machine_probe(struct 
platform_device *pdev)
 static struct platform_driver mt2701_wm8960_machine = {
.driver = {
.name = "mt2701-wm8960",
-   .owner = THIS_MODULE,
 #ifdef CONFIG_OF
.of_match_table = mt2701_wm8960_machine_dt_match,
 #endif
diff --git a/sound/soc/mediatek/mt6797/mt6797-mt6351.c 
b/sound/soc/mediatek/mt6797/mt6797-mt6351.c
index b1558c5..6e578e8 100644
--- a/sound/soc/mediatek/mt6797/mt6797-mt6351.c
+++ b/sound/soc/mediatek/mt6797/mt6797-mt6351.c
@@ -205,7 +205,6 @@ static int mt6797_mt6351_dev_probe(struct platform_device 
*pdev)
 static struct platform_driver mt6797_mt6351_driver = {
.driver = {
.name = "mt6797-mt6351",
-   .owner = THIS_MODULE,
 #ifdef CONFIG_OF
.of_match_table = mt6797_mt6351_dt_match,
 #endif
diff --git a/sound/soc/rockchip/rk3288_hdmi_analog.c 
b/sound/soc/rockchip/rk3288_hdmi_analog.c
index 929b3fe..a472d5e 100644
--- a/sound/soc/rockchip/rk3288_hdmi_analog.c
+++ b/sound/soc/rockchip/rk3288_hdmi_analog.c
@@ -286,7 +286,6 @@ static int snd_rk_mc_probe(struct platform_device *pdev)
.probe = snd_rk_mc_probe,
.driver = {
.name = DRV_NAME,
-   .owner = THIS_MODULE,
.pm = &snd_soc_pm_ops,
.of_match_table = rockchip_sound_of_match,
},
-- 
1.7.12.4



[PATCH TRIVIAL] Punctuation fixes

2018-09-11 Thread Diego Viola
Signed-off-by: Diego Viola 
---
 CREDITS | 2 +-
 MAINTAINERS | 2 +-
 Makefile| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/CREDITS b/CREDITS
index 5befd2d71..b82efb36d 100644
--- a/CREDITS
+++ b/CREDITS
@@ -1473,7 +1473,7 @@ W: http://www.linux-ide.org/
 W: http://www.linuxdiskcert.org/
 D: Random SMP kernel hacker...
 D: Uniform Multi-Platform E-IDE driver
-D: Active-ATA-Chipset maddness..
+D: Active-ATA-Chipset maddness...
 D: Ultra DMA 133/100/66/33 w/48-bit Addressing
 D: ATA-Disconnect, ATA-TCQ
 D: ATA-Smart Kernel Daemon
diff --git a/MAINTAINERS b/MAINTAINERS
index d870cb57c..6567bf245 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -93,7 +93,7 @@ Descriptions of section entries:
   Supported:   Someone is actually paid to look after this.
   Maintained:  Someone actually looks after it.
   Odd Fixes:   It has a maintainer but they don't have time to do
-   much other than throw the odd patch in. See below..
+   much other than throw the odd patch in. See below.
   Orphan:  No current maintainer [but maybe you could take the
role as you write your new code].
   Obsolete:Old code. Something tagged obsolete generally means
diff --git a/Makefile b/Makefile
index 4d5c883a9..7b5c5d634 100644
--- a/Makefile
+++ b/Makefile
@@ -1109,7 +1109,7 @@ archprepare: archheaders archscripts prepare1 
scripts_basic
 prepare0: archprepare gcc-plugins
$(Q)$(MAKE) $(build)=.
 
-# All the preparing..
+# All the preparing...
 prepare: prepare0 prepare-objtool
 
 # Support for using generic headers in asm-generic
-- 
2.19.0



[PATCH] staging: remove unneeded static set .owner field in platform_driver

2018-09-11 Thread zhong jiang
platform_driver_register will set the .owner field. So it is safe
to remove the redundant assignment.

The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang 
---
 drivers/staging/greybus/audio_codec.c| 1 -
 drivers/staging/mt7621-eth/gsw_mt7621.c  | 1 -
 drivers/staging/mt7621-eth/mtk_eth_soc.c | 1 -
 3 files changed, 3 deletions(-)

diff --git a/drivers/staging/greybus/audio_codec.c 
b/drivers/staging/greybus/audio_codec.c
index 35acd55..08746c8 100644
--- a/drivers/staging/greybus/audio_codec.c
+++ b/drivers/staging/greybus/audio_codec.c
@@ -1087,7 +1087,6 @@ static int gbaudio_codec_remove(struct platform_device 
*pdev)
 static struct platform_driver gbaudio_codec_driver = {
.driver = {
.name = "apb-dummy-codec",
-   .owner = THIS_MODULE,
 #ifdef CONFIG_PM
.pm = &gbaudio_codec_pm_ops,
 #endif
diff --git a/drivers/staging/mt7621-eth/gsw_mt7621.c 
b/drivers/staging/mt7621-eth/gsw_mt7621.c
index 2c07b55..53767b1 100644
--- a/drivers/staging/mt7621-eth/gsw_mt7621.c
+++ b/drivers/staging/mt7621-eth/gsw_mt7621.c
@@ -286,7 +286,6 @@ static int mt7621_gsw_remove(struct platform_device *pdev)
.remove = mt7621_gsw_remove,
.driver = {
.name = "mt7621-gsw",
-   .owner = THIS_MODULE,
.of_match_table = mediatek_gsw_match,
},
 };
diff --git a/drivers/staging/mt7621-eth/mtk_eth_soc.c 
b/drivers/staging/mt7621-eth/mtk_eth_soc.c
index 7135075..363d3c9 100644
--- a/drivers/staging/mt7621-eth/mtk_eth_soc.c
+++ b/drivers/staging/mt7621-eth/mtk_eth_soc.c
@@ -2167,7 +2167,6 @@ static int mtk_remove(struct platform_device *pdev)
.remove = mtk_remove,
.driver = {
.name = "mtk_soc_eth",
-   .owner = THIS_MODULE,
.of_match_table = of_mtk_match,
},
 };
-- 
1.7.12.4



[PATCH] sparc: vdso: clean-up vdso Makefile

2018-09-11 Thread Masahiro Yamada
arch/sparc/vdso/Makefile is a replica of arch/x86/entry/vdso/Makefile.

Clean-up the Makefile in the same way as I did for x86:

 - Remove unnecessary export
 - Put the generated linker script to $(obj)/ instead of $(src)/
 - Simplify cmd_vdso2c

The corresponding x86 commits are:

 - 61615faf0a89 ("x86/build/vdso: Remove unnecessary export in Makefile")
 - 1742ed2088cc ("x86/build/vdso: Put generated linker scripts to $(obj)/")
 - c5fcdbf15523 ("x86/build/vdso: Simplify 'cmd_vdso2c'")

Signed-off-by: Masahiro Yamada 
---

 arch/sparc/vdso/Makefile | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/vdso/Makefile b/arch/sparc/vdso/Makefile
index dd0b5a9..dc85570 100644
--- a/arch/sparc/vdso/Makefile
+++ b/arch/sparc/vdso/Makefile
@@ -31,23 +31,21 @@ obj-y += $(vdso_img_objs)
 targets += $(vdso_img_cfiles)
 targets += $(vdso_img_sodbg) $(vdso_img-y:%=vdso%.so)
 
-export CPPFLAGS_vdso.lds += -P -C
+CPPFLAGS_vdso.lds += -P -C
 
 VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \
-Wl,--no-undefined \
-Wl,-z,max-page-size=8192 -Wl,-z,common-page-size=8192 \
$(DISABLE_LTO)
 
-$(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE
+$(obj)/vdso64.so.dbg: $(obj)/vdso.lds $(vobjs) FORCE
$(call if_changed,vdso)
 
 HOST_EXTRACFLAGS += -I$(srctree)/tools/include
 hostprogs-y+= vdso2c
 
 quiet_cmd_vdso2c = VDSO2C  $@
-define cmd_vdso2c
-   $(obj)/vdso2c $< $(<:%.dbg=%) $@
-endef
+  cmd_vdso2c = $(obj)/vdso2c $< $(<:%.dbg=%) $@
 
 $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
$(call if_changed,vdso2c)
-- 
2.7.4



[PATCH] pstore: fix incorrect persistent ram buffer mapping

2018-09-11 Thread Bin Yang
persistent_ram_vmap() returns the page start vaddr.
persistent_ram_iomap() supports non-page-aligned mapping.

persistent_ram_buffer_map() always adds offset-in-page to the vaddr
returned from these two functions, which causes incorrect mapping of
non-page-aligned persistent ram buffer.

Signed-off-by: Bin Yang 
---
 fs/pstore/ram_core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 951a14e..7c05fdd 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -429,7 +429,7 @@ static void *persistent_ram_vmap(phys_addr_t start, size_t 
size,
vaddr = vmap(pages, page_count, VM_MAP, prot);
kfree(pages);
 
-   return vaddr;
+   return vaddr + offset_in_page(start);
 }
 
 static void *persistent_ram_iomap(phys_addr_t start, size_t size,
@@ -468,7 +468,7 @@ static int persistent_ram_buffer_map(phys_addr_t start, 
phys_addr_t size,
return -ENOMEM;
}
 
-   prz->buffer = prz->vaddr + offset_in_page(start);
+   prz->buffer = prz->vaddr;
prz->buffer_size = size - sizeof(struct persistent_ram_buffer);
 
return 0;
@@ -515,7 +515,7 @@ void persistent_ram_free(struct persistent_ram_zone *prz)
 
if (prz->vaddr) {
if (pfn_valid(prz->paddr >> PAGE_SHIFT)) {
-   vunmap(prz->vaddr);
+   vunmap(prz->vaddr - offset_in_page(prz->paddr));
} else {
iounmap(prz->vaddr);
release_mem_region(prz->paddr, prz->size);
-- 
2.7.4



Re: [PATCH v2 2/9] nios2: build .dtb files in dts directory

2018-09-11 Thread Ley Foon Tan
On Fri, 2018-09-07 at 13:09 -0500, Rob Herring wrote:
> On Thu, Sep 6, 2018 at 9:21 PM Ley Foon Tan 
> wrote:
> > 
> > 
> > On Wed, 2018-09-05 at 18:53 -0500, Rob Herring wrote:
> > > 
> > > Align nios2 with other architectures which build the dtb files in
> > > the
> > > same directory as the dts files. This is also in line with most
> > > other
> > > build targets which are located in the same directory as the
> > > source.
> > > This move will help enable the 'dtbs' target which builds all the
> > > dtbs
> > > regardless of kernel config.
> > > 
> > > This transition could break some scripts if they expect dtb files
> > > in
> > > the old location.
> > > 
> > > Cc: Ley Foon Tan 
> > > Cc: nios2-...@lists.rocketboards.org
> > > Signed-off-by: Rob Herring 
> > > ---
> > > Please ack so I can take the whole series via the DT tree.
> > > 
> > >  arch/nios2/Makefile  | 4 ++--
> > >  arch/nios2/boot/Makefile | 4 
> > >  arch/nios2/boot/dts/Makefile | 1 +
> > >  3 files changed, 3 insertions(+), 6 deletions(-)
> > >  create mode 100644 arch/nios2/boot/dts/Makefile
> > > 
> > > diff --git a/arch/nios2/Makefile b/arch/nios2/Makefile
> > > index 8673a79dca9c..50eece1c6adb 100644
> > > --- a/arch/nios2/Makefile
> > > +++ b/arch/nios2/Makefile
> > > @@ -59,10 +59,10 @@ archclean:
> > > $(Q)$(MAKE) $(clean)=$(nios2-boot)
> > > 
> > >  %.dtb: | scripts
> > > -   $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@
> > > +   $(Q)$(MAKE) $(build)=$(nios2-boot)/dts $(nios2-
> > > boot)/dts/$@
> > > 
> > >  dtbs:
> > > -   $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@
> > > +   $(Q)$(MAKE) $(build)=$(nios2-boot)/dts
> > > 
> > >  $(BOOT_TARGETS): vmlinux
> > > $(Q)$(MAKE) $(build)=$(nios2-boot) $(nios2-boot)/$@
> > > diff --git a/arch/nios2/boot/Makefile b/arch/nios2/boot/Makefile
> > > index 2ba23a679732..007586094dde 100644
> > > --- a/arch/nios2/boot/Makefile
> > > +++ b/arch/nios2/boot/Makefile
> > > @@ -47,10 +47,6 @@ obj-$(CONFIG_NIOS2_DTB_SOURCE_BOOL) +=
> > > linked_dtb.o
> > > 
> > >  targets += $(dtb-y)
> > > 
> > > -# Rule to build device tree blobs with make command
> > > -$(obj)/%.dtb: $(src)/dts/%.dts FORCE
> > > -   $(call if_changed_dep,dtc)
> > > -
> > >  $(obj)/dtbs: $(addprefix $(obj)/, $(dtb-y))
> > > 
> > >  install:
> > > diff --git a/arch/nios2/boot/dts/Makefile
> > > b/arch/nios2/boot/dts/Makefile
> > > new file mode 100644
> > > index ..f66554cd5c45
> > > --- /dev/null
> > > +++ b/arch/nios2/boot/dts/Makefile
> > > @@ -0,0 +1 @@
> > > +# SPDX-License-Identifier: GPL-2.0
> > > --
> > > 2.17.1
> > > 
> > Hi Rob
> > 
> > I have synced your all-dtbs branch from here: https://git.kernel.or
> > g/pu
> > b/scm/linux/kernel/git/robh/linux.git/log/?h=all-dtbs
> > 
> > It shows error when compile kernel image and also when "make
> > dtbs_install".
> Can you fetch the branch again and try it. I fixed a few dependency
> issues.
> 
> > 
> > make dtbs_install
> > make[1]: *** No rule to make target
> > 'arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dtb',
> > needed by
> > 'arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dtb.S'.  St
> > op.
> What is the value of CONFIG_NIOS2_DTB_SOURCE? As patch 3 notes, it
> now
> should not have any path.
> 
> If that's a problem, I could take the basename to strip the path, but
> then sub directories wouldn't work either.
> 
> BTW, next up, I want to consolidate the config variables for built-in 
> dtbs.
> 

Hi Rob

CONFIG_NIOS2_DTB_SOURCE has the relative path to dts file,
arch/nios2/boot/dts/arch/nios2/boot/dts/10m50_devboard.dts

Change CONFIG_NIOS2_DTB_SOURCE=10m50_devboard.dtb.S fix the dtb build
issue.


Regards
Ley Foon


Re: [PATCH v2 2/3] x86/mm/KASLR: Calculate the actual size of vmemmap region

2018-09-11 Thread Baoquan He
On 09/11/18 at 08:08pm, Baoquan He wrote:
> On 09/11/18 at 11:28am, Ingo Molnar wrote:
> > Yeah, so proper context is still missing, this paragraph appears to assume 
> > from the reader a 
> > whole lot of prior knowledge, and this is one of the top comments in 
> > kaslr.c so there's nowhere 
> > else to go read about the background.
> > 
> > For example what is the range of randomization of each region? Assuming the 
> > static, 
> > non-randomized description in Documentation/x86/x86_64/mm.txt is correct, 
> > in what way does 
> > KASLR modify that layout?

Re-read this paragraph, found I missed saying the range for each memory
region, and in what way KASLR modify the layout.

> > 
> > All of this is very opaque and not explained very well anywhere that I 
> > could find. We need to 
> > generate a proper description ASAP.
> 
> OK, let me try to give an context with my understanding. And copy the
> static layout of memory regions at below for reference.
> 
Here, Documentation/x86/x86_64/mm.txt is correct, and it's the
guideline for us to manipulate the layout of kernel memory regions.
Originally the starting address of each region is aligned to 512GB
so that they are all mapped at the 0-th entry of PGD table in 4-level
page mapping. Since we are so rich to have 120 TB virtual address space,
they are aligned at 1 TB actually. So randomness comes from three parts
mainly:

1) The direct mapping region for physical memory. 64 TB are reserved to
cover the maximum physical memory support. However, most of systems only
have much less RAM memory than 64 TB, even much less than 1 TB most of
time. We can take the superfluous to join the randomization. This is
often the biggest part.

2) The hole between memory regions, even though they are only 1 TB.

3) KASAN region takes up 16 TB, while it won't take effect when KASLR is
enabled. This is another big part. 

As you can see, in these three memory regions, the physical memory
mapping region has variable size according to the existing system RAM.
However, the remaining two memory regions have fixed size, vmalloc is 32
TB, vmemmap is 1 TB.

With this superfluous address space as well as changing the starting address
of each memory region to be PUD level, namely 1 GB aligned, we can have
thousands of candidate position to locate those three memory regions.

Above is for 4-level paging mode . As for 5-level, since the virtual
address space is too big, Kirill makes the starting address of regions
P4D aligned, namely 512 GB.

When randomize the layout, their order are kept, still the physical
memory mapping region is handled fistly, next vmalloc and vmemmap. Let's
take the physical memory mapping region as example, we limit the
starting address to be taken from the 1st 1/3 part of the whole
available virtual address space which is from 0x8800 to
0xfe00, namely the original starting address of the physical
memory mapping region to the starting address of cpu_entry_area mapping
region. Once a random address is chosen for the physical memory mapping,
we jump over the region and add 1G to begin the next region handling
with the remaining available space.


~~
8800 - c7ff (=64 TB) direct mapping of all phys. memory 
  
136T - 200T = 64TB
c800 - c8ff (=40 bits) hole
200T - 201T = 1TB
c900 - e8ff (=45 bits) vmalloc/ioremap space
201T - 233T = 32TB
e900 - e9ff (=40 bits) hole
233T - 234T = 1TB
ea00 - eaff (=40 bits) virtual memory map (1TB)
234T - 235T = 1TB
... unused hole ...
ec00 - fbff (=44 bits) kasan shadow memory (16TB)
236T - 252T = 16TB
... unused hole ...
vaddr_end for KASLR
fe00 - fe7f (=39 bits) cpu_entry_area mapping
254T - 254T+512G

Thanks
Baoquan


Re: [PATCH] perf test: Add watchpoint test

2018-09-11 Thread Ravi Bangoria


> While testing, I got curious, as a 'perf test' user, why one of the
> tests had the "Skip" result:
> 
> [root@seventh ~]# perf test watchpoint
> 22: Watchpoint:
> 22.1: Read Only Watchpoint: Skip
> 22.2: Write Only Watchpoint   : Ok
> 22.3: Read / Write Watchpoint : Ok
> 22.4: Modify Watchpoint   : Ok
> [root@seventh ~]#
> 
> I tried with 'perf test -v watchpoint' but that didn't help, perhaps you
> could add some message after the "Skip" telling why it skipped that
> test? I.e. hardware doesn't have that capability, kernel driver not yet
> supporting that, something else?

Sure will add a message:

pr_debug("Hardware does not support read only watchpoints.");

Ravi



Re: [PATCH v2 1/4] arm64: dts: rockchip: Split out common nodes for Rock960 based boards

2018-09-11 Thread Manivannan Sadhasivam
Hi Ezequiel,

On Tue, Sep 11, 2018 at 04:40:29PM -0300, Ezequiel Garcia wrote:
> On Tue, 2018-09-11 at 08:00 +0530, Manivannan Sadhasivam wrote:
> > Since the same family members of Rock960 boards (Rock960 and Ficus)
> > share the same configuration, split out the common nodes into a common
> > dtsi file for reducing code duplication. The board specific nodes for
> > Ficus boards are then placed in corresponding board DTS file.
> > 
> 
> I think it should be possible to move the common USB nodes to the dtsi file,
> and keep the board-specific (phy-supply property) in the dts files:
> 
> &u2phy0_host {
> phy-supply = <&vcc5v0_host>;
> };
> 
> &u2phy1_host {
> phy-supply = <&vcc5v0_host>;
> };
> 

We can do that but my intention was to entirely partition the nodes
which are not common. So that it would be less confusing when someone
looks at it (please correct me if I'm wrong).

> Also, I believe it would be good to have some more details
> in this commit log. The information on the cover letter is great,
> so I'd just repeat some of that here.
> 

Sure, will add it in next iteration.

> Other than that, for the ficus bits:
> 
> Reviewed-by: Ezequiel Garcia 
> 

Thanks a lot for the review!

Regards,
Mani

> Thanks very much for this work!
> Ezequiel
> 
> 
> > Signed-off-by: Manivannan Sadhasivam 
> > ---
> >  arch/arm64/boot/dts/rockchip/rk3399-ficus.dts | 429 +
> >  .../boot/dts/rockchip/rk3399-rock960.dtsi | 439 ++
> >  2 files changed, 440 insertions(+), 428 deletions(-)
> >  create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-rock960.dtsi
> > 
> > diff --git a/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts 
> > b/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts
> > index 8978d924eb83..7f6ec37d5a69 100644
> > --- a/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts
> > +++ b/arch/arm64/boot/dts/rockchip/rk3399-ficus.dts
> > @@ -7,8 +7,7 @@
> >   */
> >  
> >  /dts-v1/;
> > -#include "rk3399.dtsi"
> > -#include "rk3399-opp.dtsi"
> > +#include "rk3399-rock960.dtsi"
> >  
> >  / {
> > model = "96boards RK3399 Ficus";
> > @@ -25,31 +24,6 @@
> > #clock-cells = <0>;
> > };
> >  
> > -   vcc1v8_s0: vcc1v8-s0 {
> > -   compatible = "regulator-fixed";
> > -   regulator-name = "vcc1v8_s0";
> > -   regulator-min-microvolt = <180>;
> > -   regulator-max-microvolt = <180>;
> > -   regulator-always-on;
> > -   };
> > -
> > -   vcc_sys: vcc-sys {
> > -   compatible = "regulator-fixed";
> > -   regulator-name = "vcc_sys";
> > -   regulator-min-microvolt = <500>;
> > -   regulator-max-microvolt = <500>;
> > -   regulator-always-on;
> > -   };
> > -
> > -   vcc3v3_sys: vcc3v3-sys {
> > -   compatible = "regulator-fixed";
> > -   regulator-name = "vcc3v3_sys";
> > -   regulator-min-microvolt = <330>;
> > -   regulator-max-microvolt = <330>;
> > -   regulator-always-on;
> > -   vin-supply = <&vcc_sys>;
> > -   };
> > -
> > vcc3v3_pcie: vcc3v3-pcie-regulator {
> > compatible = "regulator-fixed";
> > enable-active-high;
> > @@ -75,46 +49,6 @@
> > regulator-always-on;
> > vin-supply = <&vcc_sys>;
> > };
> > -
> > -   vdd_log: vdd-log {
> > -   compatible = "pwm-regulator";
> > -   pwms = <&pwm2 0 25000 0>;
> > -   regulator-name = "vdd_log";
> > -   regulator-min-microvolt = <80>;
> > -   regulator-max-microvolt = <140>;
> > -   regulator-always-on;
> > -   regulator-boot-on;
> > -   vin-supply = <&vcc_sys>;
> > -   };
> > -
> > -};
> > -
> > -&cpu_l0 {
> > -   cpu-supply = <&vdd_cpu_l>;
> > -};
> > -
> > -&cpu_l1 {
> > -   cpu-supply = <&vdd_cpu_l>;
> > -};
> > -
> > -&cpu_l2 {
> > -   cpu-supply = <&vdd_cpu_l>;
> > -};
> > -
> > -&cpu_l3 {
> > -   cpu-supply = <&vdd_cpu_l>;
> > -};
> > -
> > -&cpu_b0 {
> > -   cpu-supply = <&vdd_cpu_b>;
> > -};
> > -
> > -&cpu_b1 {
> > -   cpu-supply = <&vdd_cpu_b>;
> > -};
> > -
> > -&emmc_phy {
> > -   status = "okay";
> >  };
> >  
> >  &gmac {
> > @@ -133,263 +67,6 @@
> > status = "okay";
> >  };
> >  
> > -&hdmi {
> > -   ddc-i2c-bus = <&i2c3>;
> > -   pinctrl-names = "default";
> > -   pinctrl-0 = <&hdmi_cec>;
> > -   status = "okay";
> > -};
> > -
> > -&i2c0 {
> > -   clock-frequency = <40>;
> > -   i2c-scl-rising-time-ns = <168>;
> > -   i2c-scl-falling-time-ns = <4>;
> > -   status = "okay";
> > -
> > -   vdd_cpu_b: regulator@40 {
> > -   compatible = "silergy,syr827";
> > -   reg = <0x40>;
> > -   fcs,suspend-voltage-selector = <1>;
> > -   regulator-name = "vdd_cpu_b";
> > -   regulator-min-microvolt = <712500>;
> > -   regulator-max-microvolt = <150>;
> > -   regulator-ramp-delay = <1000>;
> > -   regulator-always-on;
> > -   regulator-boot-on;
> > -

Re: [PATCH] perf test: Add watchpoint test

2018-09-11 Thread Ravi Bangoria



On 09/10/2018 11:01 PM, Arnaldo Carvalho de Melo wrote:
> Em Mon, Sep 10, 2018 at 11:18:30AM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Mon, Sep 10, 2018 at 10:47:54AM -0300, Arnaldo Carvalho de Melo escreveu:
>>> Em Mon, Sep 10, 2018 at 12:31:54PM +0200, Jiri Olsa escreveu:
 On Mon, Sep 10, 2018 at 03:28:11PM +0530, Ravi Bangoria wrote:
> Ex on powerpc:
>   $ sudo ./perf test 22
>   22: Watchpoint:
>   22.1: Read Only Watchpoint: Ok
>   22.2: Write Only Watchpoint   : Ok
>   22.3: Read / Write Watchpoint : Ok
>   22.4: Modify Watchpoint   : Ok
> 
 cool, thanks!
> 
 Acked-by: Jiri Olsa 
> 
>>> Thanks, applied.
> 
> Oops, fails when cross-building it to mips, I'll try to fix after lunch:


Sorry for bit late reply. Will send v2 with the fix.

Thanks
Ravi

> 
>   18   109.48 debian:experimental   : Ok   gcc (Debian 8.2.0-4) 8.2.0
>   1942.66 debian:experimental-x-arm64   : Ok   aarch64-linux-gnu-gcc 
> (Debian 8.1.0-12) 8.1.0
>   2022.33 debian:experimental-x-mips: FAIL mips-linux-gnu-gcc (Debian 
> 8.1.0-12) 8.1.0
>   2120.05 debian:experimental-x-mips64  : FAIL mips64-linux-gnuabi64-gcc 
> (Debian 8.1.0-12) 8.1.0
>   2222.85 debian:experimental-x-mipsel  : FAIL mipsel-linux-gnu-gcc 
> (Debian 8.1.0-12) 8.1.0
> 
>   CC   /tmp/build/perf/tests/bp_account.o
>   CC   /tmp/build/perf/tests/wp.o
> tests/wp.c:5:10: fatal error: arch-tests.h: No such file or directory
>  #include "arch-tests.h"
>   ^~
> compilation terminated.
> mv: cannot stat '/tmp/build/perf/tests/.wp.o.tmp': No such file or directory
> make[4]: *** [/git/linux/tools/build/Makefile.build:97: 
> /tmp/build/perf/tests/wp.o] Error 1
> make[4]: *** Waiting for unfinished jobs
>   CC   /tmp/build/perf/util/record.o
>   CC   /tmp/build/perf/util/srcline.o
> make[3]: *** [/git/linux/tools/build/Makefile.build:139: tests] Error 2
> make[2]: *** [Makefile.perf:507: /tmp/build/perf/perf-in.o] Error 2
> make[2]: *** Waiting for unfinished jobs
> 



Re: [PATCH v2 2/2] dmaengine: uniphier-mdmac: add UniPhier MIO DMAC driver

2018-09-11 Thread Masahiro Yamada
Hi Vinod,


2018-09-11 16:00 GMT+09:00 Vinod :
> On 24-08-18, 10:41, Masahiro Yamada wrote:
>
>> +/* mc->vc.lock must be held by caller */
>> +static u32 __uniphier_mdmac_get_residue(struct uniphier_mdmac_desc *md)
>> +{
>> + u32 residue = 0;
>> + int i;
>> +
>> + for (i = md->sg_cur; i < md->sg_len; i++)
>> + residue += sg_dma_len(&md->sgl[i]);
>
> so if the descriptor is submitted to hardware, we return the descriptor
> length, which is not correct.
>
> Two cases are required to be handled:
> 1. Descriptor is in queue (IMO above logic is fine for that, but it can
> be calculated at descriptor submit and looked up here)

Where do you want it to be calculated?

This hardware provides only simple registers (address and size)
for one-shot transfer instead of descriptors.

So, I used sgl as-is because I did not see a good reason
to transform sgl to another data structure.




> 2. Descriptor is running (interesting case), you need to read current
> register and offset that from descriptor length and return


OK, I will read out the register value
to retrieve the residue from the on-flight transfer.


>> +static struct dma_async_tx_descriptor *uniphier_mdmac_prep_slave_sg(
>> + struct dma_chan *chan,
>> + struct scatterlist *sgl,
>> + unsigned int sg_len,
>> + enum dma_transfer_direction direction,
>> + unsigned long flags, void *context)
>> +{
>> + struct virt_dma_chan *vc = to_virt_chan(chan);
>> + struct uniphier_mdmac_desc *md;
>> +
>> + if (!is_slave_direction(direction))
>> + return NULL;
>> +
>> + md = kzalloc(sizeof(*md), GFP_KERNEL);
>
> _prep calls can be invoked from atomic context, so this should be
> GFP_NOWAIT, see Documentation/driver-api/dmaengine/provider.rst

Will fix.


>> + if (!md)
>> + return NULL;
>> +
>> + md->sgl = sgl;
>> + md->sg_len = sg_len;
>> + md->dir = direction;
>> +
>> + return vchan_tx_prep(vc, &md->vd, flags);
>
> this seems missing stuff. Where do you do register calculation for the
> descriptor and where is slave_config here, how do you know where to
> send/receive data form/to (peripheral)


This dmac is really simple, and un-flexible.

The peripheral address to send/receive data from/to is hard-weird.
cfg->{src_addr,dst_addr} is not configurable.

Look at __uniphier_mdmac_handle().
'dest_addr' and 'src_addr' must be set to 0 for the peripheral.




>> +static enum dma_status uniphier_mdmac_tx_status(struct dma_chan *chan,
>> + dma_cookie_t cookie,
>> + struct dma_tx_state *txstate)
>> +{
>> + struct virt_dma_chan *vc;
>> + struct virt_dma_desc *vd;
>> + struct uniphier_mdmac_chan *mc;
>> + struct uniphier_mdmac_desc *md = NULL;
>> + enum dma_status stat;
>> + unsigned long flags;
>> +
>> + stat = dma_cookie_status(chan, cookie, txstate);
>> + if (stat == DMA_COMPLETE)
>> + return stat;
>> +
>> + vc = to_virt_chan(chan);
>> +
>> + spin_lock_irqsave(&vc->lock, flags);
>> +
>> + mc = to_uniphier_mdmac_chan(vc);
>> +
>> + if (mc->md && mc->md->vd.tx.cookie == cookie)
>> + md = mc->md;
>> +
>> + if (!md) {
>> + vd = vchan_find_desc(vc, cookie);
>> + if (vd)
>> + md = to_uniphier_mdmac_desc(vd);
>> + }
>> +
>> + if (md)
>> + txstate->residue = __uniphier_mdmac_get_residue(md);
>
> txstate can be NULL and should be checked...

Will fix.


>> +static int uniphier_mdmac_probe(struct platform_device *pdev)
>> +{
>> + struct device *dev = &pdev->dev;
>> + struct uniphier_mdmac_device *mdev;
>> + struct dma_device *ddev;
>> + struct resource *res;
>> + int nr_chans, ret, i;
>> +
>> + nr_chans = platform_irq_count(pdev);
>> + if (nr_chans < 0)
>> + return nr_chans;
>> +
>> + ret = dma_set_mask(dev, DMA_BIT_MASK(32));
>> + if (ret)
>> + return ret;
>> +
>> + mdev = devm_kzalloc(dev, struct_size(mdev, channels, nr_chans),
>> + GFP_KERNEL);
>
> kcalloc variant?


No.

I allocate here
sizeof(*mdev) + nr_chans * sizeof(struct uniphier_mdmac_chan)

kcalloc does not cater to it.

You should check struct_size() helper macro.




>> + if (!mdev)
>> + return -ENOMEM;
>> +
>> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>> + mdev->reg_base = devm_ioremap_resource(dev, res);
>> + if (IS_ERR(mdev->reg_base))
>> + return PTR_ERR(mdev->reg_base);
>> +
>> + mdev->clk = devm_clk_get(dev, NULL);
>> + if (IS_ERR(mdev->clk)) {
>> + dev_err(dev, "failed to get clock\n");
>> + return PTR_ERR(mdev->clk);
>> + }
>> +
>> + ret = clk_prepare_enable(m

[PATCH -next] staging: mt7621-pci: Use PTR_ERR_OR_ZERO in mt7621_pcie_parse_dt()

2018-09-11 Thread YueHaibing
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Signed-off-by: YueHaibing 
---
 drivers/staging/mt7621-pci/pci-mt7621.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/staging/mt7621-pci/pci-mt7621.c 
b/drivers/staging/mt7621-pci/pci-mt7621.c
index ba1f117..d2cb910 100644
--- a/drivers/staging/mt7621-pci/pci-mt7621.c
+++ b/drivers/staging/mt7621-pci/pci-mt7621.c
@@ -396,10 +396,7 @@ static int mt7621_pcie_parse_dt(struct mt7621_pcie *pcie)
}
 
pcie->base = devm_ioremap_resource(dev, ®s);
-   if (IS_ERR(pcie->base))
-   return PTR_ERR(pcie->base);
-
-   return 0;
+   return PTR_ERR_OR_ZERO(pcie->base);
 }
 
 static int mt7621_pcie_request_resources(struct mt7621_pcie *pcie,





[PATCH] drivers: pci: remove set but unused variable

2018-09-11 Thread Joshua Abraham
This patch removes a set but unused variable in quirks.c.

Fixes warning:
variable ‘mmio_sys_info’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Joshua Abraham 
---
 drivers/pci/quirks.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ef7143a274e0..690a3b71aa1f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4993,7 +4993,6 @@ static void quirk_switchtec_ntb_dma_alias(struct pci_dev 
*pdev)
void __iomem *mmio;
struct ntb_info_regs __iomem *mmio_ntb;
struct ntb_ctrl_regs __iomem *mmio_ctrl;
-   struct sys_info_regs __iomem *mmio_sys_info;
u64 partition_map;
u8 partition;
int pp;
@@ -5014,7 +5013,6 @@ static void quirk_switchtec_ntb_dma_alias(struct pci_dev 
*pdev)
 
mmio_ntb = mmio + SWITCHTEC_GAS_NTB_OFFSET;
mmio_ctrl = (void __iomem *) mmio_ntb + SWITCHTEC_NTB_REG_CTRL_OFFSET;
-   mmio_sys_info = mmio + SWITCHTEC_GAS_SYS_INFO_OFFSET;
 
partition = ioread8(&mmio_ntb->partition_id);
 
-- 
2.17.1



Re: [WTF?] extremely old dead code

2018-09-11 Thread Linus Torvalds
On Mon, Sep 10, 2018 at 1:55 PM Al Viro  wrote:
>
> Hadn't that sucker been dead code since 0.98.2?  What am I missing here?
> Note that this thing had quite a few functionality changes over those
> years; had they even been tested?

Looks about right to me. The only point that actually acts on FIONBIO
is the fs/ioctl.c code.

Impressively, the dead tty code looks perfectly correct to me too,
despite not  ever being triggered.

  Linus


Re: [RFC v9 PATCH 2/4] mm: mmap: zap pages with read mmap_sem in munmap

2018-09-11 Thread Matthew Wilcox
On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote:
> On 9/11/18 2:16 PM, Matthew Wilcox wrote:
> > On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote:
> > >   mm/mmap.c | 97 
> > > +--
> > I really think you're going about this the wrong way by duplicating
> > vm_munmap().
> 
> If we don't duplicate vm_munmap() or do_munmap(), we need pass an extra
> parameter to them to tell when it is fine to downgrade write lock or if the
> lock has been acquired outside it (i.e. in mmap()/mremap()), right? But,
> vm_munmap() or do_munmap() is called not only by mmap-related, but also some
> other places, like arch-specific places, which don't need downgrade write
> lock or are not safe to do so.
> 
> Actually, I did this way in the v1 patches, but it got pushed back by tglx
> who suggested duplicate the code so that the change could be done in mm only
> without touching other files, i.e. arch-specific stuff. I didn't have strong
> argument to convince him.

With my patch, there is nothing to change in arch-specific code.
Here it is again ...

diff --git a/mm/mmap.c b/mm/mmap.c
index de699523c0b7..06dc31d1da8c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2798,11 +2798,11 @@ int split_vma(struct mm_struct *mm, struct 
vm_area_struct *vma,
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge 
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
- struct list_head *uf)
+static int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+ struct list_head *uf, bool downgrade)
 {
unsigned long end;
-   struct vm_area_struct *vma, *prev, *last;
+   struct vm_area_struct *vma, *prev, *last, *tmp;
 
if ((offset_in_page(start)) || start > TASK_SIZE || len > 
TASK_SIZE-start)
return -EINVAL;
@@ -2816,7 +2816,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len,
if (!vma)
return 0;
prev = vma->vm_prev;
-   /* we have  start < vma->vm_end  */
+   /* we have start < vma->vm_end  */
 
/* if it doesn't overlap, we have nothing.. */
end = start + len;
@@ -2873,18 +2873,22 @@ int do_munmap(struct mm_struct *mm, unsigned long 
start, size_t len,
 
/*
 * unlock any mlock()ed ranges before detaching vmas
+* and check to see if there's any reason we might have to hold
+* the mmap_sem write-locked while unmapping regions.
 */
-   if (mm->locked_vm) {
-   struct vm_area_struct *tmp = vma;
-   while (tmp && tmp->vm_start < end) {
-   if (tmp->vm_flags & VM_LOCKED) {
-   mm->locked_vm -= vma_pages(tmp);
-   munlock_vma_pages_all(tmp);
-   }
-   tmp = tmp->vm_next;
+   for (tmp = vma; tmp && tmp->vm_start < end; tmp = tmp->vm_next) {
+   if (tmp->vm_flags & VM_LOCKED) {
+   mm->locked_vm -= vma_pages(tmp);
+   munlock_vma_pages_all(tmp);
}
+   if (tmp->vm_file &&
+   has_uprobes(tmp, tmp->vm_start, tmp->vm_end))
+   downgrade = false;
}
 
+   if (downgrade)
+   downgrade_write(&mm->mmap_sem);
+
/*
 * Remove the vma's, and unmap the actual pages
 */
@@ -2896,7 +2900,13 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len,
/* Fix up all other VM information */
remove_vma_list(mm, vma);
 
-   return 0;
+   return downgrade ? 1 : 0;
+}
+
+int do_unmap(struct mm_struct *mm, unsigned long start, size_t len,
+   struct list_head *uf)
+{
+   return __do_munmap(mm, start, len, uf, false);
 }
 
 int vm_munmap(unsigned long start, size_t len)
@@ -2905,11 +2915,12 @@ int vm_munmap(unsigned long start, size_t len)
struct mm_struct *mm = current->mm;
LIST_HEAD(uf);
 
-   if (down_write_killable(&mm->mmap_sem))
-   return -EINTR;
-
-   ret = do_munmap(mm, start, len, &uf);
-   up_write(&mm->mmap_sem);
+   down_write(&mm->mmap_sem);
+   ret = __do_munmap(mm, start, len, &uf, true);
+   if (ret == 1)
+   up_read(&mm->mmap_sem);
+   else
+   up_write(&mm->mmap_sem);
userfaultfd_unmap_complete(mm, &uf);
return ret;
 }

Anybody calling do_munmap() will not get the lock dropped.

> And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for
> safe and bisectable sake, which needs call the regular do_munmap().

That can be introduced and then taken out ... indeed, you can split this into
many patches, starting with this:

+   if (tmp->vm_file)
+   downgrade = false;

to only allow this optimisation for anonymous mappings at first.

> In addition to this, 

Re: [PATCH 06/11] dts: arm: imx7{d,s}: Update coresight binding for hardware ports

2018-09-11 Thread Shawn Guo
On Tue, Sep 11, 2018 at 11:17:07AM +0100, Suzuki K Poulose wrote:
> Switch to the updated coresight bindings.
> 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Pengutronix Kernel Team 
> Cc: Fabio Estevam 
> Cc: Mathieu Poirier 
> Signed-off-by: Suzuki K Poulose 

As per the convention we use for subject prefix, I suggest you use

  'ARM: dts: imx7: ...'

Shawn

> ---
>  arch/arm/boot/dts/imx7d.dtsi | 11 ---
>  arch/arm/boot/dts/imx7s.dtsi | 78 
> ++--
>  2 files changed, 53 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/arm/boot/dts/imx7d.dtsi b/arch/arm/boot/dts/imx7d.dtsi
> index 7cbc2ff..4ced17c 100644
> --- a/arch/arm/boot/dts/imx7d.dtsi
> +++ b/arch/arm/boot/dts/imx7d.dtsi
> @@ -63,9 +63,11 @@
>   clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>;
>   clock-names = "apb_pclk";
>  
> - port {
> - etm1_out_port: endpoint {
> - remote-endpoint = <&ca_funnel_in_port1>;
> + out-ports {
> + port {
> + etm1_out_port: endpoint {
> + remote-endpoint = 
> <&ca_funnel_in_port1>;
> + };
>   };
>   };
>   };
> @@ -148,11 +150,10 @@
>   };
>  };
>  
> -&ca_funnel_ports {
> +&ca_funnel_in_ports {
>   port@1 {
>   reg = <1>;
>   ca_funnel_in_port1: endpoint {
> - slave-mode;
>   remote-endpoint = <&etm1_out_port>;
>   };
>   };
> diff --git a/arch/arm/boot/dts/imx7s.dtsi b/arch/arm/boot/dts/imx7s.dtsi
> index a052198..9176885 100644
> --- a/arch/arm/boot/dts/imx7s.dtsi
> +++ b/arch/arm/boot/dts/imx7s.dtsi
> @@ -106,7 +106,7 @@
>*/
>   compatible = "arm,coresight-replicator";
>  
> - ports {
> + out-ports {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   /* replicator output ports */
> @@ -123,12 +123,15 @@
>   remote-endpoint = <&etr_in_port>;
>   };
>   };
> + };
>  
> - /* replicator input port */
> - port@2 {
> + in-ports {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + port@0 {
>   reg = <0>;
>   replicator_in_port0: endpoint {
> - slave-mode;
>   remote-endpoint = <&etf_out_port>;
>   };
>   };
> @@ -168,28 +171,31 @@
>   clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>;
>   clock-names = "apb_pclk";
>  
> - ca_funnel_ports: ports {
> + ca_funnel_in_ports: in-ports {
>   #address-cells = <1>;
>   #size-cells = <0>;
>  
> - /* funnel input ports */
>   port@0 {
>   reg = <0>;
>   ca_funnel_in_port0: endpoint {
> - slave-mode;
>   remote-endpoint = 
> <&etm0_out_port>;
>   };
>   };
>  
> - /* funnel output port */
> - port@2 {
> + /* the other input ports are not connect to 
> anything */
> + };
> +
> + out-ports {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + port@0 {
>   reg = <0>;
>   ca_funnel_out_port0: endpoint {
>   remote-endpoint = 
> <&hugo_funnel_in_port0>;
>   };
>   };
>  
> - /* the other input ports are not connect to 
> anything */
>   };
>   };
>  
> @@ -200,9 +206,11 @@
>   clocks = <&clks IMX7D_MAIN_AXI_ROOT_CLK>;
>   clock-names = "apb_pclk";
>  
> - port {
> - etm0_out_port: endpoint {
> - remote-endpoint = <&ca_funnel_in_port0>;
> + out-ports {
> + port {
> + etm

Re: [PATCH v2 3/6] drivers: qcom: rpmh: disallow active requests in solver mode

2018-09-11 Thread Lina Iyer

On Tue, Sep 11 2018 at 17:02 -0600, Matthias Kaehlcke wrote:

Hi Raju/Lina,

On Fri, Jul 27, 2018 at 03:34:46PM +0530, Raju P L S S S N wrote:

From: Lina Iyer 

Controllers may be in 'solver' state, where they could be in autonomous
mode executing low power modes for their hardware and as such are not
available for sending active votes. Device driver may notify RPMH API
that the controller is in solver mode and when in such mode, disallow
requests from platform drivers for state change using the RSC.

Signed-off-by: Lina Iyer 
Signed-off-by: Raju P.L.S.S.S.N 
---
 drivers/soc/qcom/rpmh-internal.h |  2 ++
 drivers/soc/qcom/rpmh.c  | 59 
 include/soc/qcom/rpmh.h  |  5 
 3 files changed, 66 insertions(+)

diff --git a/drivers/soc/qcom/rpmh-internal.h b/drivers/soc/qcom/rpmh-internal.h
index 4ff43bf..6cd2f78 100644
--- a/drivers/soc/qcom/rpmh-internal.h
+++ b/drivers/soc/qcom/rpmh-internal.h
@@ -72,12 +72,14 @@ struct rpmh_request {
  * @cache_lock: synchronize access to the cache data
  * @dirty: was the cache updated since flush
  * @batch_cache: Cache sleep and wake requests sent as batch
+ * @in_solver_mode: Controller is busy in solver mode
  */
 struct rpmh_ctrlr {
struct list_head cache;
spinlock_t cache_lock;
bool dirty;
struct list_head batch_cache;
+   bool in_solver_mode;
 };

 /**
diff --git a/drivers/soc/qcom/rpmh.c b/drivers/soc/qcom/rpmh.c
index 2382276..0d276fd 100644
--- a/drivers/soc/qcom/rpmh.c
+++ b/drivers/soc/qcom/rpmh.c
@@ -5,6 +5,7 @@

 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -75,6 +76,50 @@ static struct rpmh_ctrlr *get_rpmh_ctrlr(const struct device 
*dev)
return &drv->client;
 }

+static int check_ctrlr_state(struct rpmh_ctrlr *ctrlr, enum rpmh_state state)
+{
+   unsigned long flags;
+   int ret = 0;
+
+   /* Do not allow setting active votes when in solver mode */
+   spin_lock_irqsave(&ctrlr->cache_lock, flags);
+   if (ctrlr->in_solver_mode && state == RPMH_ACTIVE_ONLY_STATE)
+   ret = -EBUSY;
+   spin_unlock_irqrestore(&ctrlr->cache_lock, flags);
+
+   return ret;
+}
+
+/**
+ * rpmh_mode_solver_set: Indicate that the RSC controller hardware has
+ * been configured to be in solver mode
+ *
+ * @dev: the device making the request
+ * @enable: Boolean value indicating if the controller is in solver mode.
+ *
+ * When solver mode is enabled, passthru API will not be able to send wake
+ * votes, just awake and active votes.
+ */
+int rpmh_mode_solver_set(const struct device *dev, bool enable)
+{
+   struct rpmh_ctrlr *ctrlr = get_rpmh_ctrlr(dev);
+   unsigned long flags;
+
+   for (;;) {
+   spin_lock_irqsave(&ctrlr->cache_lock, flags);
+   if (rpmh_rsc_ctrlr_is_idle(ctrlr_to_drv(ctrlr))) {
+   ctrlr->in_solver_mode = enable;


As commented on '[v2,1/6] drivers: qcom: rpmh-rsc: return if the
controller is idle', this seems potentially
racy. _is_idle() could report the controller as idle, even though some
TCSes are in use (after _is_idle() visited them).

Additional locking may be needed or a comment if this situation should
never happen on a sane system (I don't know enough about RPMh and its
clients to judge if this is the case).

Hmm.. Forgot that we call from here. May be a lock might be helpful.

-- Lina


[PATCH v3] iio: proximity: Add driver support for ST's VL53L0X ToF ranging sensor.

2018-09-11 Thread Song Qiang
This driver was originally written by ST in 2016 as a misc input device
driver, and hasn't been maintained for a long time. I grabbed some code
from it's API and reformed it into a iio proximity device driver.
This version of driver uses i2c bus to talk to the sensor and
polling for measuring completes, so no irq line is needed.
This version of driver supports only one-shot mode, and it can be
tested with reading from
/sys/bus/iio/devices/iio:deviceX/in_distance_raw

Signed-off-by: Song Qiang 
---
Changes in v2:
- Clean up the register table.
- Sort header files declarations.
- Replace some bit definations with GENMASK() and BIT().
- Clean up some code and comments that's useless for now.
- Change the order of some the definations of some variables to reversed
  xmas tree order.
- Using do...while() rather while and check.
- Replace pr_err() with dev_err().
- Remove device id declaration since we recommend to use DT.
- Remove .owner = THIS_MODULE.
- Replace probe() with probe_new() hook.
- Remove IIO_BUFFER and IIO_TRIGGERED_BUFFER dependences.
- Change the driver module name to vl53l0x-i2c.
- Align all the parameters if they are in the same function with open
  parentheses.
- Replace iio_device_register() with devm_iio_device_register
  for better resource management.
- Remove the vl53l0x_remove() since it's not needed.
- Remove dev_set_drvdata() since it's already setted above.

Changes in v3:
- Recover ST's copyright.
- Clean up indio_dev member in vl53l0x_data struct since it's
  useless now.
- Replace __le16_to_cpu() with le16_to_cpu().
- Remove the iio_device_{claim|release}_direct_mode() since it's
  only needed when we use buffered mode.
- Clean up some coding style problems.

 .../bindings/iio/proximity/vl53l0x.txt|  12 ++
 drivers/iio/proximity/Kconfig |  11 ++
 drivers/iio/proximity/Makefile|   2 +
 drivers/iio/proximity/vl53l0x-i2c.c   | 184 ++
 4 files changed, 209 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt
 create mode 100644 drivers/iio/proximity/vl53l0x-i2c.c

diff --git a/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt 
b/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt
new file mode 100644
index ..64b69442f08e
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/proximity/vl53l0x.txt
@@ -0,0 +1,12 @@
+ST's VL53L0X ToF ranging sensor
+
+Required properties:
+   - compatible: must be "st,vl53l0x-i2c"
+   - reg: i2c address where to find the device
+
+Example:
+
+vl53l0x@29 {
+   compatible = "st,vl53l0x-i2c";
+   reg = <0x29>;
+};
diff --git a/drivers/iio/proximity/Kconfig b/drivers/iio/proximity/Kconfig
index f726f9427602..5f421cbd37f3 100644
--- a/drivers/iio/proximity/Kconfig
+++ b/drivers/iio/proximity/Kconfig
@@ -79,4 +79,15 @@ config SRF08
  To compile this driver as a module, choose M here: the
  module will be called srf08.
 
+config VL53L0X_I2C
+   tristate "STMicroelectronics VL53L0X ToF ranger sensor (I2C)"
+   depends on I2C
+   help
+ Say Y here to build a driver for STMicroelectronics VL53L0X
+ ToF ranger sensors with i2c interface.
+ This driver can be used to measure the distance of objects.
+
+ To compile this driver as a module, choose M here: the
+ module will be called vl53l0x-i2c.
+
 endmenu
diff --git a/drivers/iio/proximity/Makefile b/drivers/iio/proximity/Makefile
index 4f4ed45e87ef..dedfb5bf3475 100644
--- a/drivers/iio/proximity/Makefile
+++ b/drivers/iio/proximity/Makefile
@@ -10,3 +10,5 @@ obj-$(CONFIG_RFD77402)+= rfd77402.o
 obj-$(CONFIG_SRF04)+= srf04.o
 obj-$(CONFIG_SRF08)+= srf08.o
 obj-$(CONFIG_SX9500)   += sx9500.o
+obj-$(CONFIG_VL53L0X_I2C)  += vl53l0x-i2c.o
+
diff --git a/drivers/iio/proximity/vl53l0x-i2c.c 
b/drivers/iio/proximity/vl53l0x-i2c.c
new file mode 100644
index ..0f7f124a38ed
--- /dev/null
+++ b/drivers/iio/proximity/vl53l0x-i2c.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ *  Support for ST's VL53L0X FlightSense ToF Ranger Sensor on a i2c bus.
+ *
+ *  Copyright (C) 2016 STMicroelectronics Imaging Division.
+ *  Copyright (C) 2018 Song Qiang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define VL53L0X_DRV_NAME   "vl53l0x-i2c"
+
+#define VL_REG_SYSRANGE_MODE_MASK  GENMASK(3, 0)
+#define VL_REG_SYSRANGE_START  0x00
+#define VL_REG_SYSRANGE_MODE_SINGLESHOT0x00
+#define VL_REG_SYSRANGE_MODE_START_STOPBIT(0)
+#define VL_REG_SYSRANGE_MODE_BACKTOBACKBIT(1)
+#

Re: [PATCH v2 1/6] drivers: qcom: rpmh-rsc: return if the controller is idle

2018-09-11 Thread Lina Iyer

On Tue, Sep 11 2018 at 16:39 -0600, Matthias Kaehlcke wrote:

Hi Raju/Lina,

On Fri, Jul 27, 2018 at 03:34:44PM +0530, Raju P L S S S N wrote:

From: Lina Iyer 

Allow the controller status be queried. The controller is busy if it is
actively processing request.

Signed-off-by: Lina Iyer 
Signed-off-by: Raju P.L.S.S.S.N 
---
Changes in v2:
 - Remove unnecessary EXPORT_SYMBOL
---
 drivers/soc/qcom/rpmh-internal.h |  1 +
 drivers/soc/qcom/rpmh-rsc.c  | 20 
 2 files changed, 21 insertions(+)

diff --git a/drivers/soc/qcom/rpmh-internal.h b/drivers/soc/qcom/rpmh-internal.h
index a76..4ff43bf 100644
--- a/drivers/soc/qcom/rpmh-internal.h
+++ b/drivers/soc/qcom/rpmh-internal.h
@@ -108,6 +108,7 @@ struct rsc_drv {
 int rpmh_rsc_write_ctrl_data(struct rsc_drv *drv,
 const struct tcs_request *msg);
 int rpmh_rsc_invalidate(struct rsc_drv *drv);
+bool rpmh_rsc_ctrlr_is_idle(struct rsc_drv *drv);

 void rpmh_tx_done(const struct tcs_request *msg, int r);

diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c
index 33fe9f9..42d0041 100644
--- a/drivers/soc/qcom/rpmh-rsc.c
+++ b/drivers/soc/qcom/rpmh-rsc.c
@@ -496,6 +496,26 @@ static int tcs_ctrl_write(struct rsc_drv *drv, const 
struct tcs_request *msg)
 }

 /**
+ *  rpmh_rsc_ctrlr_is_idle: Check if any of the AMCs are busy.
+ *
+ *  @drv: The controller
+ *
+ *  Returns true if the TCSes are engaged in handling requests.
+ */
+bool rpmh_rsc_ctrlr_is_idle(struct rsc_drv *drv)
+{
+   int m;
+   struct tcs_group *tcs = get_tcs_of_type(drv, ACTIVE_TCS);
+
+   for (m = tcs->offset; m < tcs->offset + tcs->num_tcs; m++) {
+   if (!tcs_is_free(drv, m))
+   return false;
+   }
+
+   return true;
+}


This looks racy, tcs_write() could be running simultaneously and use
TCSes that were seen as free by _is_idle(). This could be fixed by
holding tcs->lock (assuming this doesn't cause lock ordering problems).
However even with this tcs_write() could run right after releasing the
lock, using TCSes and the caller of _is_idle() would consider the
controller to be idle.


We could run this without the lock, since we are only reading a status.
Generally, this function is called from the idle code of the last CPU
and no CPU or active TCS request should be in progress, but if it were,
then this function would let the caller know we are not ready to do
idle. If there were no requests that were running at that time we read
the registers, we would not be making one after, since we are already
in the idle code and no requests are made there.

I understand, how it might appear racy, the context of the callling
function helps resolve that.

-- Lina



[Question] Are the trace APIs declared by "TRACE_EVENT(irq_handler_entry" allowed to be used in Ko?

2018-09-11 Thread Leizhen (ThunderTown)
After patch 7e066fb870fc ("tracepoints: add DECLARE_TRACE() and 
DEFINE_TRACE()"),
the trace APIs declared by "TRACE_EVENT(irq_handler_entry" can not be directly 
used
by ko, because it's not explicitly exported by EXPORT_TRACEPOINT_SYMBOL_GPL or
EXPORT_TRACEPOINT_SYMBOL.

Did we miss it? or it's not recommended to be used in ko?


-

commit 7e066fb870fcd1025ec3ba7bbde5d541094f4ce1
Author: Mathieu Desnoyers 
Date:   Fri Nov 14 17:47:47 2008 -0500

tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()

Impact: API *CHANGE*. Must update all tracepoint users.

Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
structure in a single spot for all the kernel. It helps reducing memory
consumption, especially when declaring a lot of tracepoints, e.g. for
kmalloc tracing.

*API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
to do it. The name previously used was misleading.

Updates scheduler instrumentation to follow this API change.


-- 
Thanks!
BestRegards



Re: Question: How to switch a process namespace by nsfs "device" and inode number directly?

2018-09-11 Thread Chengdong Li

Thank you, Andi!

Yes, that's a situation, also it's an important one I guess.

Another case is that a process running inside a container has exited but 
the container still alive.I think this is also a common case. The 
potential fix solutions I am thinking are following:


- Using nsfs "device" and inum. This is why I am asking for your help. 
As we already have nsfs "device" and inum of each thread at least.


- If the current thread has exited, it's probably the parent thread and 
the leader thread of that container are still alive. If we could have 
those threads' pid, then we could use setns.



If the first item is not doable, I would like to try the second one.


Thanks,

Chengdong

在 2018/9/11 上午12:02, Andi Kleen 写道:

On Mon, Sep 10, 2018 at 04:50:42PM +0800, Chengdong Li wrote:

Hi folks,

I am getting stuck by the lack of approach to switch process namespace by
nsfs "device" and inode number in user-space,  for example (mnt: 0xf000)

 From my best understanding, the normal way to do that is by setns system
call. But setns only accept fd that refer to a opened namespace, sometimes
we couldn't get it.

For example:  After perf record, perf report couldn't work well once the
process that runs inside a container has exited, as the /proc/pid/ns doesn't
exist anymore after process exit.

The kernel name space doesn't exist anymore at this point, so there is simply 
no way
to reconstruct it.

Perhaps would need some higher level side band data for perf, similar as what
is done for JITed code. Somehow the container run time needs to tell perf
where to find the code.

-Andi


Re: [PATCH i2c-next v6] i2c: aspeed: Handle master/slave combined irq events properly

2018-09-11 Thread Guenter Roeck
On Tue, Sep 11, 2018 at 04:58:44PM -0700, Jae Hyun Yoo wrote:
> On 9/11/2018 4:33 PM, Guenter Roeck wrote:
> >Looking into the patch, clearing the interrupt status at the end of an
> >interrupt handler is always suspicious and tends to result in race
> >conditions (because additional interrupts may have arrived while handling
> >the existing interrupts, or because interrupt handling itself may trigger
> >another interrupt). With that in mind, the following patch fixes the
> >problem for me.
> >
> >Guenter
> >
> >---
> >
> >diff --git a/drivers/i2c/busses/i2c-aspeed.c 
> >b/drivers/i2c/busses/i2c-aspeed.c
> >index c258c4d9a4c0..c488e6950b7c 100644
> >--- a/drivers/i2c/busses/i2c-aspeed.c
> >+++ b/drivers/i2c/busses/i2c-aspeed.c
> >@@ -552,6 +552,8 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq, void 
> >*dev_id)
> > spin_lock(&bus->lock);
> > irq_received = readl(bus->base + ASPEED_I2C_INTR_STS_REG);
> >+/* Ack all interrupt bits. */
> >+writel(irq_received, bus->base + ASPEED_I2C_INTR_STS_REG);
> > irq_remaining = irq_received;
> >  #if IS_ENABLED(CONFIG_I2C_SLAVE)
> >@@ -584,8 +586,6 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq, void 
> >*dev_id)
> > "irq handled != irq. expected 0x%08x, but was 0x%08x\n",
> > irq_received, irq_handled);
> >-/* Ack all interrupt bits. */
> >-writel(irq_received, bus->base + ASPEED_I2C_INTR_STS_REG);
> > spin_unlock(&bus->lock);
> > return irq_remaining ? IRQ_NONE : IRQ_HANDLED;
> >  }
> >
> 
> My intention of putting the code at the end of interrupt handler was,
> to reduce possibility of combined irq calls which is explained in this
> patch. But YES, I agree with you. It could make a potential race

Hmm, yes, but that doesn't explain why it would make sense to acknowledge
the interrupt late. The interrupt ack only means "I am going to handle these
interrupts". If additional interrupts arrive while the interrupt handler
is active, those will have to be acknowledged separately.

Sure, there is a risk that an interrupt arrives while the handler is
running, and that it is handled but not acknowledged. That can happen
with pretty much all interrupt handlers, and there are mitigations to
limit the impact (for example, read the interrupt status register in
a loop until no more interrupts are pending). But acknowledging
an interrupt that was possibly not handled is always bad idea.

Thanks,
Guenter


Re: [PATCH v3] ARM: dts: imx6ul: Add DTS for ConnectCore 6UL SBC Pro

2018-09-11 Thread Shawn Guo
On Mon, Sep 10, 2018 at 11:37:52AM +0200, Alex Gonzalez wrote:
> The ConnectCore 6UL Single Board Computer (SBC) Pro contains the
> ConnectCore 6UL System-On-Module.
> 
> Its hardware specifications are:
> 
> * 256MB DDR3 memory
> * On module 256MB NAND flash
> * Dual 10/100 Ethernet
> * USB Host and USB OTG
> * Parallel RGB display header
> * LVDS display header
> * CSI camera
> * GPIO header
> * I2C, SPI, CAN headers
> * PCIe mini card and micro SIM slot
> * MicroSD external storage
> * On board 4GB eMMC flash
> * Audio headphone, line in/out, microphone lines
> 
> Signed-off-by: Alex Gonzalez 

Applied, thanks.


Re: [PATCH v9 6/6] ARM: dts: imx6: RIoTboard provide standby on power off option

2018-09-11 Thread Shawn Guo
On Thu, Aug 02, 2018 at 12:34:25PM +0200, Oleksij Rempel wrote:
> This board, as well as some other boards with i.MX6 and a PMIC, uses a
> "PMIC_STBY_REQ" line to notify the PMIC about a state change.
> The PMIC is programmed for a specific state change before triggering the
> line.
> In this case, PMIC_STBY_REQ can be used for stand by, sleep
> and power off modes.
> 
> Signed-off-by: Oleksij Rempel 

Applied, thanks.


Re: KASAN: use-after-free Read in cma_bind_port

2018-09-11 Thread syzbot

syzbot has found a reproducer for the following crash on:

HEAD commit:11da3a7f84f1 Linux 4.19-rc3
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11766c6940
kernel config:  https://syzkaller.appspot.com/x/.config?x=9917ff4b798e1a1e
dashboard link: https://syzkaller.appspot.com/bug?extid=da2591e115d57a9cbb8b
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1686969e40

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+da2591e115d57a9cb...@syzkaller.appspotmail.com

8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
8021q: adding VLAN 0 to HW filter on device team0
hrtimer: interrupt took 34369 ns
==
BUG: KASAN: use-after-free in cma_bind_port+0x35d/0x420  
drivers/infiniband/core/cma.c:3059

Read of size 2 at addr 8801b7b056a0 by task syz-executor3/7271

CPU: 1 PID: 7271 Comm: syz-executor3 Not tainted 4.19.0-rc3+ #231
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:431
 cma_bind_port+0x35d/0x420 drivers/infiniband/core/cma.c:3059
 cma_alloc_port+0x115/0x180 drivers/infiniband/core/cma.c:3095
 cma_alloc_any_port drivers/infiniband/core/cma.c:3160 [inline]
 cma_get_port drivers/infiniband/core/cma.c:3314 [inline]
 rdma_bind_addr+0x1765/0x23d0 drivers/infiniband/core/cma.c:3434
 cma_bind_addr drivers/infiniband/core/cma.c:2963 [inline]
 rdma_resolve_addr+0x4e2/0x2770 drivers/infiniband/core/cma.c:2974
 ucma_resolve_ip+0x242/0x2a0 drivers/infiniband/core/ucma.c:711
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x119/0x9f0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4572d9
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f71074b0c78 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 7f71074b16d4 RCX: 004572d9
RDX: 0048 RSI: 2240 RDI: 0008
RBP: 009300a0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 004d83c0 R14: 004c1e90 R15: 

Allocated by task 7285:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:707 [inline]
 __rdma_create_id+0xdf/0x790 drivers/infiniband/core/cma.c:782
 ucma_create_id+0x39b/0x990 drivers/infiniband/core/ucma.c:502
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x119/0x9f0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7261:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xcf/0x230 mm/slab.c:3813
 rdma_destroy_id+0x835/0xcc0 drivers/infiniband/core/cma.c:1737
 ucma_close+0x100/0x300 drivers/infiniband/core/ucma.c:1759
 __fput+0x385/0xa30 fs/file_table.c:278
 fput+0x15/0x20 fs/file_table.c:309
 task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:193 [inline]
 exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
 do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801b7b05680
 which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 32 bytes inside of
 2048-byte region [8801b

Re: [PATCH v9 2/6] ARM: imx6: register pm_power_off handler if "fsl,pmic-stby-poweroff" is set

2018-09-11 Thread Shawn Guo
On Thu, Aug 02, 2018 at 12:34:21PM +0200, Oleksij Rempel wrote:
> One of the Freescale recommended sequences for power off with external
> PMIC is the following:
> ...
> 3.  SoC is programming PMIC for power off when standby is asserted.
> 4.  In CCM STOP mode, Standby is asserted, PMIC gates SoC supplies.
> 
> See:
> http://www.nxp.com/assets/documents/data/en/reference-manuals/IMX6DQRM.pdf
> page 5083
> 
> This patch implements step 4. of this sequence.
> 
> Signed-off-by: Oleksij Rempel 

Applied, thanks.


Re: [PATCH v3 3/3] drivers: soc: xilinx: Add ZynqMP PM driver

2018-09-11 Thread Ahmed S. Darwish
Hi!

[ Thanks a lot for upstreaming this.. ]

On Tue, Sep 11, 2018 at 02:34:57PM -0700, Jolly Shah wrote:
> From: Rajan Vaja 
>
> Add ZynqMP PM driver. PM driver provides power management
> support for ZynqMP.
>
> Signed-off-by: Rajan Vaja 
> Signed-off-by: Jolly Shah 
> ---

[...]

> +static irqreturn_t zynqmp_pm_isr(int irq, void *data)
> +{
> + u32 payload[CB_PAYLOAD_SIZE];
> +
> + zynqmp_pm_get_callback_data(payload);
> +
> + /* First element is callback API ID, others are callback arguments */
> + if (payload[0] == PM_INIT_SUSPEND_CB) {
> + if (work_pending(&zynqmp_pm_init_suspend_work->callback_work))
> + goto done;
> +
> + /* Copy callback arguments into work's structure */
> + memcpy(zynqmp_pm_init_suspend_work->args, &payload[1],
> +sizeof(zynqmp_pm_init_suspend_work->args));
> +
> + queue_work(system_unbound_wq,
> +&zynqmp_pm_init_suspend_work->callback_work);

We already have devm_request_threaded_irq() which can does this
automatically for us.

Use that method to register the ISR instead, then if there's more
work to do, just do the memcpy and return IRQ_WAKE_THREAD.

> + }
> +
> +done:
> + return IRQ_HANDLED;
> +}
> +
> +/**
> + * zynqmp_pm_init_suspend_work_fn() - Initialize suspend
> + * @work:Pointer to work_struct
> + *
> + * Bottom-half of PM callback IRQ handler.
> + */
> +static void zynqmp_pm_init_suspend_work_fn(struct work_struct *work)
> +{
> + struct zynqmp_pm_work_struct *pm_work =
> + container_of(work, struct zynqmp_pm_work_struct, callback_work);
> +
> + if (pm_work->args[0] == ZYNQMP_PM_SUSPEND_REASON_SYSTEM_SHUTDOWN) {

we_really_seem_to_love_long_40_col_names_for_some_reason

> + orderly_poweroff(true);
> + } else if (pm_work->args[0] ==
> +ZYNQMP_PM_SUSPEND_REASON_POWER_UNIT_REQUEST) {

Ditto

[...]

> +/**
> + * zynqmp_pm_sysfs_init() - Initialize PM driver sysfs interface
> + * @dev: Pointer to device structure
> + *
> + * Return: 0 on success, negative error code otherwise
> + */
> +static int zynqmp_pm_sysfs_init(struct device *dev)
> +{
> + return sysfs_create_file(&dev->kobj, &dev_attr_suspend_mode.attr);
> +}
> +

Sysfs file is created in platform driver's probe(), but is not
removed anywhere in the code.

What happens if this is built as a module? Am I missing something
obvious?

Moreover, what's the wisdom of creating a one-liner function with
a huge six-line comment that:

a) _purely_ wraps sysfs_create_file(); no extra logic
b) is called only once
c) and not passed as a function pointer anywhere

IMO Such one-liner translators obfuscate the code and review
process with no apparent gain..

> +/**
> + * zynqmp_pm_probe() - Probe existence of the PMU Firmware
> + *  and initialize debugfs interface
> + *
> + * @pdev:Pointer to the platform_device structure
> + *
> + * Return: Returns 0 on success, negative error code otherwise
> + */

Again, huge 8-line comment that provide no value.

If anyone wants to know what a platform driver probe() does, he
or she better check the primary references at:

- Documentation/driver-model/platform.txt
- include/linux/platform_device.h

and not the comment above..

> +static int zynqmp_pm_probe(struct platform_device *pdev)
> +{
> + int ret, irq;
> + u32 pm_api_version;
> +
> + const struct zynqmp_eemi_ops *eemi_ops = zynqmp_pm_get_eemi_ops();
> +
> + if (!eemi_ops || !eemi_ops->get_api_version || !eemi_ops->init_finalize)
> + return -ENXIO;
> +
> + eemi_ops->init_finalize();
> + eemi_ops->get_api_version(&pm_api_version);
> +
> + /* Check PM API version number */
> + if (pm_api_version < ZYNQMP_PM_VERSION)
> + return -ENODEV;
> +
> + irq = platform_get_irq(pdev, 0);
> + if (irq <= 0)
> + return -ENXIO;
> +
> + ret = devm_request_irq(&pdev->dev, irq, zynqmp_pm_isr, IRQF_SHARED,
> +dev_name(&pdev->dev), pdev);
> + if (ret) {
> + dev_err(&pdev->dev, "request_irq '%d' failed with %d\n",
> + irq, ret);
> + return ret;
> + }
> +
> + zynqmp_pm_init_suspend_work =
> + devm_kzalloc(&pdev->dev, sizeof(struct zynqmp_pm_work_struct),
> +  GFP_KERNEL);
> + if (!zynqmp_pm_init_suspend_work)
> + return -ENOMEM;
> +
> + INIT_WORK(&zynqmp_pm_init_suspend_work->callback_work,
> +   zynqmp_pm_init_suspend_work_fn);
> +

Let's use devm_request_threaded_irq(). Then we can completely
remove the work_struct, INIT_WORK(), and queuue_work() bits.

> + ret = zynqmp_pm_sysfs_init(&pdev->dev);
> + if (ret) {
> + dev_err(&pdev->dev, "unable to initialize sysfs interface\n");
> + return ret;
> + }
> +
> + return ret;

Just return 0 please. BTW ret was declare

Re: [PATCH v9 1/6] ARM: imx6q: provide documentation for new fsl,pmic-stby-poweroff property

2018-09-11 Thread Shawn Guo
I updated the subject as below to make it clear this is a bindings
change.

  dt-bindings: imx6q-clock: add new fsl,pmic-stby-poweroff property

Patch applied, thanks.

Shawn

On Thu, Aug 02, 2018 at 12:34:20PM +0200, Oleksij Rempel wrote:
> Signed-off-by: Oleksij Rempel 
> Acked-by: Rob Herring 
> ---
>  Documentation/devicetree/bindings/clock/imx6q-clock.txt | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/clock/imx6q-clock.txt 
> b/Documentation/devicetree/bindings/clock/imx6q-clock.txt
> index a45ca67a9d5f..e1308346e00d 100644
> --- a/Documentation/devicetree/bindings/clock/imx6q-clock.txt
> +++ b/Documentation/devicetree/bindings/clock/imx6q-clock.txt
> @@ -6,6 +6,14 @@ Required properties:
>  - interrupts: Should contain CCM interrupt
>  - #clock-cells: Should be <1>
>  
> +Optional properties:
> +- fsl,pmic-stby-poweroff: Configure CCM to assert PMIC_STBY_REQ signal
> +  on power off.
> +  Use this property if the SoC should be powered off by external power
> +  management IC (PMIC) triggered via PMIC_STBY_REQ signal.
> +  Boards that are designed to initiate poweroff on PMIC_ON_REQ signal should
> +  be using "syscon-poweroff" driver instead.
> +
>  The clock consumer should specify the desired clock by having the clock
>  ID in its "clocks" phandle cell.  See 
> include/dt-bindings/clock/imx6qdl-clock.h
>  for the full list of i.MX6 Quad and DualLite clock IDs.
> -- 
> 2.18.0
> 


Re: [PATCH 4.4 34/80] enic: handle mtu change for vf properly

2018-09-11 Thread Ben Hutchings
On Mon, 2018-09-03 at 18:49 +0200, Greg Kroah-Hartman wrote:
> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Govindarajulu Varadarajan 
> 
> [ Upstream commit ab123fe071c9aa9680ecd62eb080eb26cff4892c ]
> 
> When driver gets notification for mtu change, driver does not handle it for
> all RQs. It handles only RQ[0].
> 
> Fix is to use enic_change_mtu() interface to change mtu for vf.
[...]

This causes a assertion failure (noisy error logging, but not an oops)
when the driver is probed.  This was fixed upstream by:

commit cb5c6568867325f9905e80c96531d963bec8e5ea
Author: Govindarajulu Varadarajan 
Date:   Mon Jul 30 09:56:54 2018 -0700

enic: do not call enic_change_mtu in enic_probe

which is now needed on the 3.18, 4.4, and 4.9 stable branches.

Ben.

-- 
Ben Hutchings, Software Developer Codethink Ltd
https://www.codethink.co.uk/ Dale House, 35 Dale Street
 Manchester, M1 2HF, United Kingdom


Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist

2018-09-11 Thread Martin K. Petersen


> Applied to 4.20/scsi-queue, thank you!

4.19/scsi-fixes, that is...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCHv3] iscsi-target: Don't use stack buffer for scatterlist

2018-09-11 Thread Martin K. Petersen


Laura,

> There are two cases that trigger this bug. Switch to using a
> dynamically allocated buffer for one case and do not assign
> a NULL buffer in another case.

Applied to 4.20/scsi-queue, thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering


[LKP] [kernel] 92114220fe: BUG:unable_to_handle_kernel

2018-09-11 Thread kernel test robot
FYI, we noticed the following commit (built with gcc-6):

commit: 92114220fe6a374172e99261b6451c515d29c8dc ("[PATCH] kernel: prevent 
submission of creds with higher privileges inside container")
url: 
https://github.com/0day-ci/linux/commits/My-Name/kernel-prevent-submission-of-creds-with-higher-privileges-inside-container/20180911-162532


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 256M

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


+--+---++
|  | v4.19-rc3 | 92114220fe |
+--+---++
| boot_successes   | 8 | 0  |
| boot_failures| 0 | 6  |
| BUG:unable_to_handle_kernel  | 0 | 6  |
| Oops:#[##]   | 0 | 6  |
| RIP:commit_creds | 0 | 6  |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 6  |
+--+---++



[   53.586547] BUG: unable to handle kernel NULL pointer dereference at 
06c0
[   53.588054] PGD 0 P4D 0 
[   53.588564] Oops:  [#1] PTI
[   53.589180] CPU: 0 PID: 1 Comm: init Not tainted 4.19.0-rc3-1-g9211422 #1
[   53.590544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   53.592139] RIP: 0010:commit_creds+0x51/0x410
[   53.592988] Code: 08 81 ba b0 01 00 00 fe ff ff ef 74 11 8b 43 04 39 47 04 
0f 83 9c 00 00 00 e9 c2 03 00 00 48 8b 50 10 48 83 05 67 82 5a 02 01 <81> ba c0 
06 00 00 ff ff ff ef 75 d7 48 8b 50 18 48 83 05 57 82 5a
[   53.596525] RSP: :c900bd10 EFLAGS: 00010202
[   53.597526] RAX: 82ca3060 RBX: 88000f02eb40 RCX: 88000f0399c8
[   53.598883] RDX:  RSI:  RDI: 88000b2a53c0
[   53.600235] RBP: 88000bd66800 R08: 88000f030740 R09: 008fb60c
[   53.601587] R10: e0098d8b R11: 10c12b46 R12: 88000f030040
[   53.602936] R13: c9008000 R14: 88000cd07500 R15: 0001
[   53.604285] FS:  () GS:82c5b000() 
knlGS:
[   53.605813] CS:  0010 DS:  ES:  CR0: 80050033
[   53.606906] CR2: 06c0 CR3: 0c6f6000 CR4: 000406b0
[   53.608264] Call Trace:
[   53.608762]  install_exec_creds+0x25/0xa0
[   53.609544]  load_elf_binary+0x544/0x1e72
[   53.610324]  ? __lock_acquire+0xdbb/0x1030
[   53.611234]  ? find_held_lock+0x35/0xd0
[   53.611982]  ? __lock_acquire+0xdbb/0x1030
[   53.612891]  ? find_held_lock+0x35/0xd0
[   53.613639]  ? search_binary_handler+0x83/0x180
[   53.614512]  search_binary_handler+0x98/0x180
[   53.615356]  load_script+0x348/0x370
[   53.616058]  search_binary_handler+0x98/0x180
[   53.616906]  __do_execve_file+0x7d3/0xaa0
[   53.617804]  do_execve+0x24/0x30
[   53.618439]  run_init_process+0x50/0x60
[   53.619184]  ? rest_init+0x1a0/0x1a0
[   53.619885]  kernel_init+0xca/0x1e0
[   53.620573]  ret_from_fork+0x35/0x40
[   53.621264] CR2: 06c0
[   53.621969] ---[ end trace 3c2bcf9b443a9ddd ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k  job-script # job-script is attached in this 
email



Thanks,
lkp
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.19.0-rc3 Kernel Configuration
#

#
# Compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=60400
CONFIG_CLANG_VERSION=0
CONFIG_CONSTRUCTORS=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
CONFIG_KERNEL_LZMA=y
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SYSVIPC is not set
# CONFIG_POSIX_MQUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_SIM=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_IRQ_MATRIX_ALL

[PATCH -V5 RESEND 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free()

2018-09-11 Thread Huang Ying
When a PMD swap mapping is removed from a huge swap cluster, for
example, unmap a memory range mapped with PMD swap mapping, etc,
free_swap_and_cache() will be called to decrease the reference count
to the huge swap cluster.  free_swap_and_cache() may also free or
split the huge swap cluster, and free the corresponding THP in swap
cache if necessary.  swap_free() is similar, and shares most
implementation with free_swap_and_cache().  This patch revises
free_swap_and_cache() and swap_free() to implement this.

If the swap cluster has been split already, for example, because of
failing to allocate a THP during swapin, we just decrease one from the
reference count of all swap slots.

Otherwise, we will decrease one from the reference count of all swap
slots and the PMD swap mapping count in cluster_count().  When the
corresponding THP isn't in swap cache, if PMD swap mapping count
becomes 0, the huge swap cluster will be split, and if all swap count
becomes 0, the huge swap cluster will be freed.  When the corresponding
THP is in swap cache, if every swap_map[offset] == SWAP_HAS_CACHE, we
will try to delete the THP from swap cache.  Which will cause the THP
and the huge swap cluster be freed.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |   9 +--
 kernel/power/swap.c|   4 +-
 mm/madvise.c   |   2 +-
 mm/memory.c|   4 +-
 mm/shmem.c |   6 +-
 mm/swapfile.c  | 171 ++---
 7 files changed, 149 insertions(+), 49 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index f2cc7da473e4..ffd4b68adbb3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -675,7 +675,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
 
dec_mm_counter(mm, mm_counter(page));
}
-   free_swap_and_cache(entry);
+   free_swap_and_cache(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1bee8b65cb8a..db3e07a3d9bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,9 +453,9 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t *entry, int entry_size);
 extern int swapcache_prepare(swp_entry_t entry, int entry_size);
-extern void swap_free(swp_entry_t);
+extern void swap_free(swp_entry_t entry, int entry_size);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_cache(swp_entry_t entry, int entry_size);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
@@ -509,7 +509,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || 
is_device_private_entry(e));})
+#define free_swap_and_cache(e, s)  \
+   ({(is_migration_entry(e) || is_device_private_entry(e)); })
 #define swapcache_prepare(e, s)
\
({(is_migration_entry(e) || is_device_private_entry(e)); })
 
@@ -527,7 +528,7 @@ static inline int swap_duplicate(swp_entry_t *swp, int 
entry_size)
return 0;
 }
 
-static inline void swap_free(swp_entry_t swp)
+static inline void swap_free(swp_entry_t swp, int entry_size)
 {
 }
 
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index d7f6c1a288d3..0275df84ed3d 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -182,7 +182,7 @@ sector_t alloc_swapdev_block(int swap)
offset = swp_offset(get_swap_page_of_type(swap));
if (offset) {
if (swsusp_extents_insert(offset))
-   swap_free(swp_entry(swap, offset));
+   swap_free(swp_entry(swap, offset), 1);
else
return swapdev_block(swap, offset);
}
@@ -206,7 +206,7 @@ void free_all_swap_pages(int swap)
ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, &swsusp_extents);
for (offset = ext->start; offset <= ext->end; offset++)
-   swap_free(swp_entry(swap, offset));
+   swap_free(swp_entry(swap, offset), 1);
 
kfree(ext);
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 972a9eaa898b..6fff1c1d2009 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,7 +349,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long 
addr,
  

[PATCH -V5 RESEND 20/21] swap: create PMD swap mapping when unmap the THP

2018-09-11 Thread Huang Ying
This is the final step of the THP swapin support.  When reclaiming a
anonymous THP, after allocating the huge swap cluster and add the THP
into swap cache, the PMD page mapping will be changed to the mapping
to the swap space.  Previously, the PMD page mapping will be split
before being changed.  In this patch, the unmap code is enhanced not
to split the PMD mapping, but create a PMD swap mapping to replace it
instead.  So later when clear the SWAP_HAS_CACHE flag in the last step
of swapout, the huge swap cluster will be kept instead of being split,
and when swapin, the huge swap cluster will be read in one piece into a
THP.  That is, the THP will not be split during swapout/swapin.  This
can eliminate the overhead of splitting/collapsing, and reduce the
page fault count, etc.  But more important, the utilization of THP is
improved greatly, that is, much more THP will be kept when swapping is
used, so that we can take full advantage of THP including its high
performance for swapout/swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h | 11 +++
 mm/huge_memory.c| 30 ++
 mm/rmap.c   | 43 ++-
 mm/vmscan.c |  6 +-
 4 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6586c1bfac21..8cbce31bc090 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -405,6 +405,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+struct page_vma_mapped_walk;
+
 #ifdef CONFIG_THP_SWAP
 extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
  unsigned long haddr,
@@ -412,6 +414,8 @@ extern void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
   unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+extern bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+   struct page *page, unsigned long address, pmd_t pmdval);
 
 static inline bool transparent_hugepage_swapin_enabled(
struct vm_area_struct *vma)
@@ -453,6 +457,13 @@ static inline int do_huge_pmd_swap_page(struct vm_fault 
*vmf, pmd_t orig_pmd)
return 0;
 }
 
+static inline bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page, unsigned long address,
+ pmd_t pmdval)
+{
+   return false;
+}
+
 static inline bool transparent_hugepage_swapin_enabled(
struct vm_area_struct *vma)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2aa432830a38..542af5836ca5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1889,6 +1889,36 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
count_vm_event(THP_SWPIN_FALLBACK);
goto fallback;
 }
+
+bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, struct page *page,
+   unsigned long address, pmd_t pmdval)
+{
+   struct vm_area_struct *vma = pvmw->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   pmd_t swp_pmd;
+   swp_entry_t entry = { .val = page_private(page) };
+
+   if (swap_duplicate(&entry, HPAGE_PMD_NR) < 0) {
+   set_pmd_at(mm, address, pvmw->pmd, pmdval);
+   return false;
+   }
+   if (list_empty(&mm->mmlist)) {
+   spin_lock(&mmlist_lock);
+   if (list_empty(&mm->mmlist))
+   list_add(&mm->mmlist, &init_mm.mmlist);
+   spin_unlock(&mmlist_lock);
+   }
+   add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+   add_mm_counter(mm, MM_SWAPENTS, HPAGE_PMD_NR);
+   swp_pmd = swp_entry_to_pmd(entry);
+   if (pmd_soft_dirty(pmdval))
+   swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+   set_pmd_at(mm, address, pvmw->pmd, swp_pmd);
+
+   page_remove_rmap(page, true);
+   put_page(page);
+   return true;
+}
 #endif
 
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
diff --git a/mm/rmap.c b/mm/rmap.c
index 3bb4be720bc0..a180cb1fe2db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1413,11 +1413,52 @@ static bool try_to_unmap_one(struct page *page, struct 
vm_area_struct *vma,
continue;
}
 
+   address = pvmw.address;
+
+#ifdef CONFIG_THP_SWAP
+   /* PMD-mapped THP swap entry */
+   if (IS_ENABLED(CONFIG_THP_SWAP) &&
+   !pvmw.pte && PageAnon(page)) {
+   pmd_t pmdv

[PATCH -V5 RESEND 15/21] swap: Support to copy PMD swap mapping when fork()

2018-09-11 Thread Huang Ying
During fork, the page table need to be copied from parent to child.  A
PMD swap mapping need to be copied too and the swap reference count
need to be increased.

When the huge swap cluster has been split already, we need to split
the PMD swap mapping and fallback to PTE copying.

When swap count continuation failed to allocate a page with
GFP_ATOMIC, we need to unlock the spinlock and try again with
GFP_KERNEL.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 72 
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f98d8a543d73..4e2230583c53 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -941,6 +941,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
if (unlikely(!pgtable))
goto out;
 
+retry:
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -948,26 +949,67 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-   VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   if (is_write_migration_entry(entry)) {
-   make_migration_entry_read(&entry);
-   pmd = swp_entry_to_pmd(entry);
-   if (pmd_swp_soft_dirty(*src_pmd))
-   pmd = pmd_swp_mksoft_dirty(pmd);
-   set_pmd_at(src_mm, addr, src_pmd, pmd);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+   if (is_migration_entry(entry)) {
+   if (is_write_migration_entry(entry)) {
+   make_migration_entry_read(&entry);
+   pmd = swp_entry_to_pmd(entry);
+   if (pmd_swp_soft_dirty(*src_pmd))
+   pmd = pmd_swp_mksoft_dirty(pmd);
+   set_pmd_at(src_mm, addr, src_pmd, pmd);
+   }
+   add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+   mm_inc_nr_ptes(dst_mm);
+   pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+   ret = 0;
+   goto out_unlock;
}
-   add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-   mm_inc_nr_ptes(dst_mm);
-   pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-   ret = 0;
-   goto out_unlock;
-   }
 #endif
+   if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry)) {
+   ret = swap_duplicate(&entry, HPAGE_PMD_NR);
+   if (!ret) {
+   add_mm_counter(dst_mm, MM_SWAPENTS,
+  HPAGE_PMD_NR);
+   mm_inc_nr_ptes(dst_mm);
+   pgtable_trans_huge_deposit(dst_mm, dst_pmd,
+  pgtable);
+   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+   /* make sure dst_mm is on swapoff's mmlist. */
+   if (unlikely(list_empty(&dst_mm->mmlist))) {
+   spin_lock(&mmlist_lock);
+   if (list_empty(&dst_mm->mmlist))
+   list_add(&dst_mm->mmlist,
+&src_mm->mmlist);
+   spin_unlock(&mmlist_lock);
+   }
+   } else if (ret == -ENOTDIR) {
+   /*
+* The huge swap cluster has been split, split
+* the PMD swap mapping and fallback to PTE
+*/
+   __split_huge_swap_pmd(vma, addr, src_pmd);
+   pte_free(dst_mm, pgtable);
+   } else if (ret == -ENOMEM) {
+   spin_unlock(src_ptl);
+   spin_unlock(dst_ptl);
+   ret = add_swap_count_continuation(entry,
+ GFP_KERNEL);
+ 

[PATCH -V5 RESEND 21/21] swap: Update help of CONFIG_THP_SWAP

2018-09-11 Thread Huang Ying
The help of CONFIG_THP_SWAP is updated to reflect the latest progress
of THP (Tranparent Huge Page) swap optimization.

Signed-off-by: "Huang, Ying" 
Reviewed-by: Dan Williams 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/Kconfig | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 9a6e7e27e8d5..cd41bc4382bf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -425,8 +425,6 @@ config THP_SWAP
depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP
help
  Swap transparent huge pages in one piece, without splitting.
- XXX: For now, swap cluster backing transparent huge page
- will be split after swapout.
 
  For selection by architectures with reasonable THP sizes.
 
-- 
2.16.4



[PATCH -V5 RESEND 19/21] swap: Support PMD swap mapping in common path

2018-09-11 Thread Huang Ying
Original code is only for PMD migration entry, it is revised to
support PMD swap mapping.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 fs/proc/task_mmu.c | 12 +---
 mm/gup.c   | 36 
 mm/huge_memory.c   |  7 ---
 mm/mempolicy.c |  2 +-
 4 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5ea1d64cb0b4..2d968523c57b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -972,7 +972,7 @@ static inline void clear_soft_dirty_pmd(struct 
vm_area_struct *vma,
pmd = pmd_clear_soft_dirty(pmd);
 
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-   } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+   } else if (is_swap_pmd(pmd)) {
pmd = pmd_swp_clear_soft_dirty(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
@@ -1302,9 +1302,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   }
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-   else if (is_swap_pmd(pmd)) {
+   } else if (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) &&
+  is_swap_pmd(pmd)) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
unsigned long offset;
 
@@ -1317,10 +1316,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
flags |= PM_SWAP;
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
-   VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   if (is_pmd_migration_entry(pmd))
+   page = migration_entry_to_page(entry);
}
-#endif
 
if (page && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;
diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..b35b7729b1b7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -216,6 +216,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
+   swp_entry_t entry;
 
pmd = pmd_offset(pudp, address);
/*
@@ -243,18 +244,22 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
if (!pmd_present(pmdval)) {
if (likely(!(flags & FOLL_MIGRATION)))
return no_page_table(vma, flags);
-   VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(pmdval));
-   if (is_pmd_migration_entry(pmdval))
+   entry = pmd_to_swp_entry(pmdval);
+   if (thp_migration_supported() && is_migration_entry(entry)) {
pmd_migration_entry_wait(mm, pmd);
-   pmdval = READ_ONCE(*pmd);
-   /*
-* MADV_DONTNEED may convert the pmd to null because
-* mmap_sem is held in read mode
-*/
-   if (pmd_none(pmdval))
+   pmdval = READ_ONCE(*pmd);
+   /*
+* MADV_DONTNEED may convert the pmd to null because
+* mmap_sem is held in read mode
+*/
+   if (pmd_none(pmdval))
+   return no_page_table(vma, flags);
+   goto retry;
+   }
+   if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry))
return no_page_table(vma, flags);
-   goto retry;
+   WARN_ON(1);
+   return no_page_table(vma, flags);
}
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
@@ -276,11 +281,18 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return no_page_table(vma, flags);
}
if (unlikely(!pmd_present(*pmd))) {
+   entry = pmd_to_swp_entry(*pmd);
spin_unlock(ptl);
if (likely(!(flags & FOLL_MIGRATION)))
return no_page_table(vma, flags);
-   pmd_migration_entry_wait(mm, pmd);
-   goto retry_locked;
+   if (thp_migration_supported() && is_migration_entry(entry)) {
+   pmd_migration_entry_wait(mm, pmd);
+   goto retry_locked;
+   }
+   

[PATCH -V5 RESEND 18/21] swap: Support PMD swap mapping in mincore()

2018-09-11 Thread Huang Ying
During mincore(), for PMD swap mapping, swap cache will be looked up.
If the resulting page isn't compound page, the PMD swap mapping will
be split and fallback to PTE swap mapping processing.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/mincore.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index a66f2052c7b1..a2a66c3c8c6a 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -48,7 +48,8 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, 
unsigned long addr,
  * and is up to date; i.e. that no page-in operation would be required
  * at this time if an application were to map and access this page.
  */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
+static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff,
+ bool *compound)
 {
unsigned char present = 0;
struct page *page;
@@ -86,6 +87,8 @@ static unsigned char mincore_page(struct address_space 
*mapping, pgoff_t pgoff)
 #endif
if (page) {
present = PageUptodate(page);
+   if (compound)
+   *compound = PageCompound(page);
put_page(page);
}
 
@@ -103,7 +106,8 @@ static int __mincore_unmapped_range(unsigned long addr, 
unsigned long end,
 
pgoff = linear_page_index(vma, addr);
for (i = 0; i < nr; i++, pgoff++)
-   vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+   vec[i] = mincore_page(vma->vm_file->f_mapping,
+ pgoff, NULL);
} else {
for (i = 0; i < nr; i++)
vec[i] = 0;
@@ -127,14 +131,36 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
pte_t *ptep;
unsigned char *vec = walk->private;
int nr = (end - addr) >> PAGE_SHIFT;
+   swp_entry_t entry;
 
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   memset(vec, 1, nr);
+   unsigned char val = 1;
+   bool compound;
+
+   if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(*pmd)) {
+   entry = pmd_to_swp_entry(*pmd);
+   if (!non_swap_entry(entry)) {
+   val = mincore_page(swap_address_space(entry),
+  swp_offset(entry),
+  &compound);
+   /*
+* The huge swap cluster has been
+* split under us
+*/
+   if (!compound) {
+   __split_huge_swap_pmd(vma, addr, pmd);
+   spin_unlock(ptl);
+   goto fallback;
+   }
+   }
+   }
+   memset(vec, val, nr);
spin_unlock(ptl);
goto out;
}
 
+fallback:
if (pmd_trans_unstable(pmd)) {
__mincore_unmapped_range(addr, end, vma, vec);
goto out;
@@ -150,8 +176,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
else if (pte_present(pte))
*vec = 1;
else { /* pte is a swap entry */
-   swp_entry_t entry = pte_to_swp_entry(pte);
-
+   entry = pte_to_swp_entry(pte);
if (non_swap_entry(entry)) {
/*
 * migration or hwpoison entries are always
@@ -161,7 +186,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
} else {
 #ifdef CONFIG_SWAP
*vec = mincore_page(swap_address_space(entry),
-   swp_offset(entry));
+   swp_offset(entry), NULL);
 #else
WARN_ON(1);
*vec = 1;
-- 
2.16.4



[PATCH -V5 RESEND 17/21] swap: Support PMD swap mapping for MADV_WILLNEED

2018-09-11 Thread Huang Ying
During MADV_WILLNEED, for a PMD swap mapping, if THP swapin is enabled
for the VMA, the whole swap cluster will be swapin.  Otherwise, the
huge swap cluster and the PMD swap mapping will be split and fallback
to PTE swap mapping.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/madvise.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 07ef599d4255..608c5ae201c6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -196,14 +196,36 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned 
long start,
pte_t *orig_pte;
struct vm_area_struct *vma = walk->private;
unsigned long index;
+   swp_entry_t entry;
+   struct page *page;
+   pmd_t pmdval;
+
+   pmdval = *pmd;
+   if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(pmdval) &&
+   !is_pmd_migration_entry(pmdval)) {
+   entry = pmd_to_swp_entry(pmdval);
+   if (!transparent_hugepage_swapin_enabled(vma)) {
+   if (!split_swap_cluster(entry, 0))
+   split_huge_swap_pmd(vma, pmd, start, pmdval);
+   } else {
+   page = read_swap_cache_async(entry,
+GFP_HIGHUSER_MOVABLE,
+vma, start, false);
+   if (page) {
+   /* The swap cluster has been split under us */
+   if (!PageTransHuge(page))
+   split_huge_swap_pmd(vma, pmd, start,
+   pmdval);
+   put_page(page);
+   }
+   }
+   }
 
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
return 0;
 
for (index = start; index != end; index += PAGE_SIZE) {
pte_t pte;
-   swp_entry_t entry;
-   struct page *page;
spinlock_t *ptl;
 
orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
-- 
2.16.4



[PATCH -V5 RESEND 16/21] swap: Free PMD swap mapping when zap_huge_pmd()

2018-09-11 Thread Huang Ying
For a PMD swap mapping, zap_huge_pmd() will clear the PMD and call
free_swap_and_cache() to decrease the swap reference count and maybe
free or split the huge swap cluster and the THP in swap cache.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4e2230583c53..d4e8b4f80543 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2024,7 +2024,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
spin_unlock(ptl);
if (is_huge_zero_pmd(orig_pmd))
tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
HPAGE_PMD_SIZE);
-   } else if (is_huge_zero_pmd(orig_pmd)) {
+   } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
@@ -2037,17 +2037,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
-   } else if (thp_migration_supported()) {
-   swp_entry_t entry;
-
-   VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
-   entry = pmd_to_swp_entry(orig_pmd);
-   page = pfn_to_page(swp_offset(entry));
+   } else {
+   swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+   if (thp_migration_supported() &&
+   is_migration_entry(entry))
+   page = pfn_to_page(swp_offset(entry));
+   else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+!non_swap_entry(entry))
+   free_swap_and_cache(entry, HPAGE_PMD_NR);
+   else {
+   WARN_ONCE(1,
+"Non present huge pmd without pmd migration or swap enabled!");
+   goto unlock;
+   }
flush_needed = 0;
-   } else
-   WARN_ONCE(1, "Non present huge pmd without pmd 
migration enabled!");
+   }
 
-   if (PageAnon(page)) {
+   if (!page) {
+   zap_deposited_table(tlb->mm, pmd);
+   add_mm_counter(tlb->mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+   } else if (PageAnon(page)) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
@@ -2055,7 +2065,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, mm_counter_file(page), 
-HPAGE_PMD_NR);
}
-
+unlock:
spin_unlock(ptl);
if (flush_needed)
tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
-- 
2.16.4



[PATCH -V5 RESEND 12/21] swap: Support PMD swap mapping in swapoff

2018-09-11 Thread Huang Ying
During swapoff, for a huge swap cluster, we need to allocate a THP,
read its contents into the THP and unuse the PMD and PTE swap mappings
to it.  If failed to allocate a THP, the huge swap cluster will be
split.

During unuse, if it is found that the swap cluster mapped by a PMD
swap mapping is split already, we will split the PMD swap mapping and
unuse the PTEs.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/asm-generic/pgtable.h | 14 +--
 include/linux/huge_mm.h   |  8 
 mm/huge_memory.c  |  4 +-
 mm/swapfile.c | 86 ++-
 4 files changed, 97 insertions(+), 15 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index eb1e9d17371b..d64cef2bff04 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -931,22 +931,12 @@ static inline int 
pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
barrier();
 #endif
/*
-* !pmd_present() checks for pmd migration entries
-*
-* The complete check uses is_pmd_migration_entry() in linux/swapops.h
-* But using that requires moving current function and 
pmd_trans_unstable()
-* to linux/swapops.h to resovle dependency, which is too much code 
move.
-*
-* !pmd_present() is equivalent to is_pmd_migration_entry() currently,
-* because !pmd_present() pages can only be under migration not swapped
-* out.
-*
-* pmd_none() is preseved for future condition checks on pmd migration
+* pmd_none() is preseved for future condition checks on pmd swap
 * entries and not confusing with this function name, although it is
 * redundant with !pmd_present().
 */
if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
-   (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && 
!pmd_present(pmdval)))
+   (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) && !pmd_present(pmdval)))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9dedff974def..25ba9b5f1e60 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -406,6 +406,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+  unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 static inline bool transparent_hugepage_swapin_enabled(
@@ -431,6 +433,12 @@ static inline bool transparent_hugepage_swapin_enabled(
return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address, pmd_t orig_pmd)
+{
+   return 0;
+}
+
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4a766243a8f..cd353f39bed9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1671,8 +1671,8 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
 }
 
 #ifdef CONFIG_THP_SWAP
-static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-  unsigned long address, pmd_t orig_pmd)
+int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+   unsigned long address, pmd_t orig_pmd)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3fe50f1da0a0..64067ee6a09c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1931,6 +1931,11 @@ static inline int pte_same_as_swp(pte_t pte, pte_t 
swp_pte)
return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte);
 }
 
+static inline int pmd_same_as_swp(pmd_t pmd, pmd_t swp_pmd)
+{
+   return pmd_same(pmd_swp_clear_soft_dirty(pmd), swp_pmd);
+}
+
 /*
  * No need to decide whether this PTE shares the swap entry with others,
  * just let do_wp_page work it out if a write is requested later - to
@@ -1992,6 +1997,53 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
*pmd,
return ret;
 }
 
+#ifdef CONFIG_THP_SWAP
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+unsigned long addr, swp_entry_t entry, struct page *page)
+{
+   struct mem_cgroup *memcg;
+   spinlock_t *ptl;
+   int ret = 1;
+
+   if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+ &memcg, true)) {
+   ret = -ENOMEM;
+

[PATCH -V5 RESEND 13/21] swap: Support PMD swap mapping in madvise_free()

2018-09-11 Thread Huang Ying
When madvise_free() found a PMD swap mapping, if only part of the huge
swap cluster is operated on, the PMD swap mapping will be split and
fallback to PTE swap mapping processing.  Otherwise, if all huge swap
cluster is operated on, free_swap_and_cache() will be called to
decrease the PMD swap mapping count and probably free the swap space
and the THP in swap cache too.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 54 +++---
 mm/madvise.c |  2 +-
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cd353f39bed9..05407832e793 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1849,6 +1849,15 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
 }
 #endif
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+   pte_free(mm, pgtable);
+   mm_dec_nr_ptes(mm);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -1869,15 +1878,39 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
struct vm_area_struct *vma,
goto out_unlocked;
 
orig_pmd = *pmd;
-   if (is_huge_zero_pmd(orig_pmd))
-   goto out;
-
if (unlikely(!pmd_present(orig_pmd))) {
-   VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(orig_pmd));
-   goto out;
+   swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+   if (is_migration_entry(entry)) {
+   VM_BUG_ON(!thp_migration_supported());
+   goto out;
+   } else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+  !non_swap_entry(entry)) {
+   /*
+* If part of THP is discarded, split the PMD
+* swap mapping and operate on the PTEs
+*/
+   if (next - addr != HPAGE_PMD_SIZE) {
+   unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+   __split_huge_swap_pmd(vma, haddr, pmd);
+   goto out;
+   }
+   free_swap_and_cache(entry, HPAGE_PMD_NR);
+   pmd_clear(pmd);
+   zap_deposited_table(mm, pmd);
+   if (current->mm == mm)
+   sync_mm_rss(mm);
+   add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+   ret = true;
+   goto out;
+   } else
+   VM_BUG_ON(1);
}
 
+   if (is_huge_zero_pmd(orig_pmd))
+   goto out;
+
page = pmd_page(orig_pmd);
/*
 * If other processes are mapping this page, we couldn't discard
@@ -1923,15 +1956,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
struct vm_area_struct *vma,
return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-   pgtable_t pgtable;
-
-   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-   pte_free(mm, pgtable);
-   mm_dec_nr_ptes(mm);
-}
-
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 6fff1c1d2009..07ef599d4255 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,7 +321,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long 
addr,
unsigned long next;
 
next = pmd_addr_end(addr, end);
-   if (pmd_trans_huge(*pmd))
+   if (pmd_trans_huge(*pmd) || is_swap_pmd(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
goto next;
 
-- 
2.16.4



[PATCH -V5 RESEND 14/21] swap: Support to move swap account for PMD swap mapping

2018-09-11 Thread Huang Ying
Previously the huge swap cluster will be split after the THP is
swapout.  Now, to support to swapin the THP in one piece, the huge
swap cluster will not be split after the THP is reclaimed.  So in
memcg, we need to move the swap account for PMD swap mappings in the
process's page table.

When the page table is scanned during moving memcg charge, the PMD
swap mapping will be identified.  And mem_cgroup_move_swap_account()
and its callee is revised to move account for the whole huge swap
cluster.  If the swap cluster mapped by PMD has been split, the PMD
swap mapping will be split and fallback to PTE processing.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |   9 
 include/linux/swap.h|   6 +++
 include/linux/swap_cgroup.h |   3 +-
 mm/huge_memory.c|   8 +--
 mm/memcontrol.c | 129 ++--
 mm/swap_cgroup.c|  45 +---
 mm/swapfile.c   |  14 +
 7 files changed, 174 insertions(+), 40 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 25ba9b5f1e60..6586c1bfac21 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -406,6 +406,9 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
+ unsigned long haddr,
+ pmd_t *pmd);
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
   unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -433,6 +436,12 @@ static inline bool transparent_hugepage_swapin_enabled(
return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline void __split_huge_swap_pmd(struct vm_area_struct *vma,
+unsigned long haddr,
+pmd_t *pmd)
+{
+}
+
 static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
  unsigned long address, pmd_t orig_pmd)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 921abd07e13f..d45c3a7746e0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -621,6 +621,7 @@ static inline swp_entry_t get_swap_page(struct page *page)
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
+extern int get_swap_entry_size(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
@@ -631,6 +632,11 @@ static inline int split_swap_cluster_map(swp_entry_t entry)
 {
return 0;
 }
+
+static inline int get_swap_entry_size(swp_entry_t entry)
+{
+   return 1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index a12dd1c3966c..c40fb52b0563 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 #ifdef CONFIG_MEMCG_SWAP
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-   unsigned short old, unsigned short new);
+   unsigned short old, unsigned short new,
+   unsigned int nr_ents);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 05407832e793..f98d8a543d73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1636,10 +1636,11 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, 
pmd_t pmd)
return 0;
 }
 
+#ifdef CONFIG_THP_SWAP
 /* Convert a PMD swap mapping to a set of PTE swap mappings */
-static void __split_huge_swap_pmd(struct vm_area_struct *vma,
- unsigned long haddr,
- pmd_t *pmd)
+void __split_huge_swap_pmd(struct vm_area_struct *vma,
+  unsigned long haddr,
+  pmd_t *pmd)
 {
struct mm_struct *mm = vma->vm_mm;
pgtable_t pgtable;
@@ -1670,7 +1671,6 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
pmd_populate(mm, pmd, pgtable);
 }
 
-#ifdef CONFIG_THP_SWAP
 int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, pmd_t orig_pmd)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcec9b39e2a3..6c2527ffd17d 100644
--

[PATCH -V5 RESEND 06/21] swap: Support PMD swap mapping when splitting huge PMD

2018-09-11 Thread Huang Ying
A huge PMD need to be split when zap a part of the PMD mapping etc.
If the PMD mapping is a swap mapping, we need to split it too.  This
patch implemented the support for this.  This is similar as splitting
the PMD page mapping, except we need to decrease the PMD swap mapping
count for the huge swap cluster too.  If the PMD swap mapping count
becomes 0, the huge swap cluster will be split.

Notice: is_huge_zero_pmd() and pmd_page() doesn't work well with swap
PMD, so pmd_present() check is called before them.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |  4 
 include/linux/swap.h|  6 ++
 mm/huge_memory.c| 48 +++-
 mm/swapfile.c   | 32 
 4 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 99c19b06d9a4..0f3e1739986f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,6 +226,10 @@ static inline bool is_huge_zero_page(struct page *page)
return READ_ONCE(huge_zero_page) == page;
 }
 
+/*
+ * is_huge_zero_pmd() must be called after checking pmd_present(),
+ * otherwise, it may report false positive for PMD swap entry.
+ */
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
return is_huge_zero_page(pmd_page(pmd));
diff --git a/include/linux/swap.h b/include/linux/swap.h
index db3e07a3d9bc..a2a3d85decd9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -618,11 +618,17 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster_map(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry)
 {
return 0;
 }
+
+static inline int split_swap_cluster_map(swp_entry_t entry)
+{
+   return 0;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c235ba78de68..b8b61a0879f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1609,6 +1609,40 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, 
pmd_t pmd)
return 0;
 }
 
+/* Convert a PMD swap mapping to a set of PTE swap mappings */
+static void __split_huge_swap_pmd(struct vm_area_struct *vma,
+ unsigned long haddr,
+ pmd_t *pmd)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   swp_entry_t entry;
+   int i, soft_dirty;
+
+   entry = pmd_to_swp_entry(*pmd);
+   soft_dirty = pmd_soft_dirty(*pmd);
+
+   split_swap_cluster_map(entry);
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+   pmd_populate(mm, &_pmd, pgtable);
+
+   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE, entry.val++) {
+   pte_t *pte, ptent;
+
+   pte = pte_offset_map(&_pmd, haddr);
+   VM_BUG_ON(!pte_none(*pte));
+   ptent = swp_entry_to_pte(entry);
+   if (soft_dirty)
+   ptent = pte_swp_mksoft_dirty(ptent);
+   set_pte_at(mm, haddr, pte, ptent);
+   pte_unmap(pte);
+   }
+   smp_wmb(); /* make pte visible before pmd */
+   pmd_populate(mm, pmd, pgtable);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -2075,7 +2109,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-   VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+   VM_BUG_ON(!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd)
&& !pmd_devmap(*pmd));
 
count_vm_event(THP_SPLIT_PMD);
@@ -2099,7 +2133,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
-   } else if (is_huge_zero_pmd(*pmd)) {
+   } else if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
 * FIXME: Do we want to invalidate secondary mmu by calling
 * mmu_notifier_invalidate_range() see comments below inside
@@ -2143,6 +2177,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
page = pfn_to_page(swp_offset(entry));
} else
 #endif
+   if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(old_pmd))
+   return __split_huge_swap_pmd(vma, haddr, pmd);
+   else
page = pmd_page(old

[PATCH -V5 RESEND 11/21] swap: Add sysfs interface to configure THP swapin

2018-09-11 Thread Huang Ying
Swapin a THP as a whole isn't desirable in some situations.  For
example, for completely random access pattern, swapin a THP in one
piece will inflate the reading greatly.  So a sysfs interface:
/sys/kernel/mm/transparent_hugepage/swapin_enabled is added to
configure it.  Three options as follow are provided,

- always: THP swapin will be enabled always

- madvise: THP swapin will be enabled only for VMA with VM_HUGEPAGE
  flag set.

- never: THP swapin will be disabled always

The default configuration is: madvise.

During page fault, if a PMD swap mapping is found and THP swapin is
disabled, the huge swap cluster and the PMD swap mapping will be split
and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 Documentation/admin-guide/mm/transhuge.rst | 21 +++
 include/linux/huge_mm.h| 31 ++
 mm/huge_memory.c   | 94 --
 3 files changed, 127 insertions(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst 
b/Documentation/admin-guide/mm/transhuge.rst
index 85e33f785fd7..23aefb17101c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -160,6 +160,27 @@ Some userspace (such as a test program, or an optimized 
memory allocation
 
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+Transparent hugepage may be swapout and swapin in one piece without
+splitting.  This will improve the utility of transparent hugepage but
+may inflate the read/write too.  So whether to enable swapin
+transparent hugepage in one piece can be configured as follow.
+
+   echo always >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+   echo madvise >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+   echo never >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+
+always
+   Attempt to allocate a transparent huge page and read it from
+   swap space in one piece every time.
+
+never
+   Always split the swap space and PMD swap mapping and swapin
+   the fault normal page during swapin.
+
+madvise
+   Only swapin the transparent huge page in one piece for
+   MADV_HUGEPAGE madvise regions.
+
 khugepaged will be automatically started when
 transparent_hugepage/enabled is set to "always" or "madvise, and it'll
 be automatically shutdown if it's set to "never".
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c2b8ced6fc2b..9dedff974def 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -63,6 +63,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+   TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+   TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -405,11 +407,40 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 
 #ifdef CONFIG_THP_SWAP
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+
+static inline bool transparent_hugepage_swapin_enabled(
+   struct vm_area_struct *vma)
+{
+   if (vma->vm_flags & VM_NOHUGEPAGE)
+   return false;
+
+   if (is_vma_temporary_stack(vma))
+   return false;
+
+   if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+   return false;
+
+   if (transparent_hugepage_flags &
+   (1 << TRANSPARENT_HUGEPAGE_SWAPIN_FLAG))
+   return true;
+
+   if (transparent_hugepage_flags &
+   (1 << TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG))
+   return !!(vma->vm_flags & VM_HUGEPAGE);
+
+   return false;
+}
 #else /* CONFIG_THP_SWAP */
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
return 0;
 }
+
+static inline bool transparent_hugepage_swapin_enabled(
+   struct vm_area_struct *vma)
+{
+   return false;
+}
 #endif /* CONFIG_THP_SWAP */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1232ade5deca..c4a766243a8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -57,7 +57,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
(1

[PATCH -V5 RESEND 08/21] swap: Support to read a huge swap cluster for swapin a THP

2018-09-11 Thread Huang Ying
To swapin a THP in one piece, we need to read a huge swap cluster from
the swap device.  This patch revised the __read_swap_cache_async() and
its callers and callees to support this.  If __read_swap_cache_async()
find the swap cluster of the specified swap entry is huge, it will try
to allocate a THP, add it into the swap cache.  So later the contents
of the huge swap cluster can be read into the THP.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h | 38 ++
 include/linux/swap.h|  4 +--
 mm/huge_memory.c| 26 --
 mm/swap_state.c | 72 -
 mm/swapfile.c   |  9 ---
 5 files changed, 99 insertions(+), 50 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0f3e1739986f..3fdb29bc250c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -250,6 +250,39 @@ static inline bool thp_migration_supported(void)
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
+/*
+ * always: directly stall for all thp allocations
+ * defer: wake kswapd and fail if not immediately available
+ * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise
+ *   fail if not immediately available
+ * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
+ * available
+ * never: never stall for any thp allocation
+ */
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+{
+   bool vma_madvised;
+
+   if (!vma)
+   return GFP_TRANSHUGE_LIGHT;
+   vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+&transparent_hugepage_flags))
+   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+&transparent_hugepage_flags))
+   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
+&transparent_hugepage_flags))
+   return GFP_TRANSHUGE_LIGHT |
+   (vma_madvised ? __GFP_DIRECT_RECLAIM :
+   __GFP_KSWAPD_RECLAIM);
+   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+&transparent_hugepage_flags))
+   return GFP_TRANSHUGE_LIGHT |
+   (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+   return GFP_TRANSHUGE_LIGHT;
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -363,6 +396,11 @@ static inline bool thp_migration_supported(void)
 {
return false;
 }
+
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+{
+   return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c0c3b3c077d7..921abd07e13f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,7 +462,7 @@ extern sector_t map_swap_page(struct page *, struct 
block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
-extern int __swp_swapcount(swp_entry_t entry);
+extern int __swp_swapcount(swp_entry_t entry, int *entry_size);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
@@ -589,7 +589,7 @@ static inline int __swap_count(swp_entry_t entry)
return 0;
 }
 
-static inline int __swp_swapcount(swp_entry_t entry)
+static inline int __swp_swapcount(swp_entry_t entry, int *entry_size)
 {
return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 64123cefa978..f1358681db8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -620,32 +620,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
 
 }
 
-/*
- * always: directly stall for all thp allocations
- * defer: wake kswapd and fail if not immediately available
- * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise
- *   fail if not immediately available
- * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
- * available
- * never: never stall for any thp allocation
- */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
-{
-   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-
-   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 

[PATCH -V5 RESEND 10/21] swap: Support to count THP swapin and its fallback

2018-09-11 Thread Huang Ying
2 new /proc/vmstat fields are added, "thp_swapin" and
"thp_swapin_fallback" to count swapin a THP from swap device in one
piece and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 Documentation/admin-guide/mm/transhuge.rst |  8 
 include/linux/vm_event_item.h  |  2 ++
 mm/huge_memory.c   |  4 +++-
 mm/page_io.c   | 15 ---
 mm/vmstat.c|  2 ++
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst 
b/Documentation/admin-guide/mm/transhuge.rst
index 7ab93a8404b9..85e33f785fd7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -364,6 +364,14 @@ thp_swpout_fallback
Usually because failed to allocate some continuous swap space
for the huge page.
 
+thp_swpin
+   is incremented every time a huge page is swapin in one piece
+   without splitting.
+
+thp_swpin_fallback
+   is incremented if a huge page has to be split during swapin.
+   Usually because failed to allocate a huge page.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5c7f010676a7..7b438548a78e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+   THP_SWPIN,
+   THP_SWPIN_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4dbc4f933c4f..1232ade5deca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1673,8 +1673,10 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
/* swapoff occurs under us */
} else if (ret == -EINVAL)
ret = 0;
-   else
+   else {
+   count_vm_event(THP_SWPIN_FALLBACK);
goto fallback;
+   }
}
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto out;
diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..362254b99955 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -348,6 +348,15 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
return ret;
 }
 
+static inline void count_swpin_vm_event(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (unlikely(PageTransHuge(page)))
+   count_vm_event(THP_SWPIN);
+#endif
+   count_vm_events(PSWPIN, hpage_nr_pages(page));
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
struct bio *bio;
@@ -371,7 +380,7 @@ int swap_readpage(struct page *page, bool synchronous)
 
ret = mapping->a_ops->readpage(swap_file, page);
if (!ret)
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
return ret;
}
 
@@ -382,7 +391,7 @@ int swap_readpage(struct page *page, bool synchronous)
unlock_page(page);
}
 
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
return 0;
}
 
@@ -401,7 +410,7 @@ int swap_readpage(struct page *page, bool synchronous)
get_task_struct(current);
bio->bi_private = current;
bio_set_op_attrs(bio, REQ_OP_READ, 0);
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
bio_get(bio);
qc = submit_bio(bio);
while (synchronous) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ba0870ecddd..ac04801bb0cb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1263,6 +1263,8 @@ const char * const vmstat_text[] = {
"thp_zero_page_alloc_failed",
"thp_swpout",
"thp_swpout_fallback",
+   "thp_swpin",
+   "thp_swpin_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
"balloon_inflate",
-- 
2.16.4



[PATCH -V5 RESEND 09/21] swap: Swapin a THP in one piece

2018-09-11 Thread Huang Ying
With this patch, when page fault handler find a PMD swap mapping, it
will swap in a THP in one piece.  This avoids the overhead of
splitting/collapsing before/after the THP swapping.  And improves the
swap performance greatly for reduced page fault count etc.

do_huge_pmd_swap_page() is added in the patch to implement this.  It
is similar to do_swap_page() for normal page swapin.

If failing to allocate a THP, the huge swap cluster and the PMD swap
mapping will be split to fallback to normal page swapin.

If the huge swap cluster has been split already, the PMD swap mapping
will be split to fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |   9 +++
 mm/huge_memory.c| 174 
 mm/memory.c |  16 +++--
 3 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3fdb29bc250c..c2b8ced6fc2b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -403,4 +403,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_THP_SWAP
+extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+#else /* CONFIG_THP_SWAP */
+static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f1358681db8f..4dbc4f933c4f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1617,6 +1619,178 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
pmd_populate(mm, pmd, pgtable);
 }
 
+#ifdef CONFIG_THP_SWAP
+static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+  unsigned long address, pmd_t orig_pmd)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   ptl = pmd_lock(mm, pmd);
+   if (pmd_same(*pmd, orig_pmd))
+   __split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd);
+   else
+   ret = -ENOENT;
+   spin_unlock(ptl);
+
+   return ret;
+}
+
+int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+   struct page *page;
+   struct mem_cgroup *memcg;
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   swp_entry_t entry;
+   pmd_t pmd;
+   int i, locked, exclusive = 0, ret = 0;
+
+   entry = pmd_to_swp_entry(orig_pmd);
+   VM_BUG_ON(non_swap_entry(entry));
+   delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+retry:
+   page = lookup_swap_cache(entry, NULL, vmf->address);
+   if (!page) {
+   page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma,
+haddr, false);
+   if (!page) {
+   /*
+* Back out if somebody else faulted in this pmd
+* while we released the pmd lock.
+*/
+   if (likely(pmd_same(*vmf->pmd, orig_pmd))) {
+   /*
+* Failed to allocate huge page, split huge swap
+* cluster, and fallback to swapin normal page
+*/
+   ret = split_swap_cluster(entry, 0);
+   /* Somebody else swapin the swap entry, retry */
+   if (ret == -EEXIST) {
+   ret = 0;
+   goto retry;
+   /* swapoff occurs under us */
+   } else if (ret == -EINVAL)
+   ret = 0;
+   else
+   goto fallback;
+   }
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   goto out;
+   }
+
+   /* Had to read the page from swap area: Major fault */
+   ret = VM_FAULT_MAJOR;
+   count_vm_event(PGMAJFAULT);
+   count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+   } else if (!PageTransCompound(page))
+   goto fallback;
+
+   locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   if (!locked) {
+   ret |= VM_FAULT_RETRY;
+   goto out_release;
+

[PATCH -V5 RESEND 04/21] swap: Support PMD swap mapping in put_swap_page()

2018-09-11 Thread Huang Ying
Previously, during swapout, all PMD page mapping will be split and
replaced with PTE swap mapping.  And when clearing the SWAP_HAS_CACHE
flag for the huge swap cluster in put_swap_page(), the huge swap
cluster will be split.  Now, during swapout, the PMD page mappings to
the THP will be changed to PMD swap mappings to the corresponding swap
cluster.  So when clearing the SWAP_HAS_CACHE flag, the huge swap
cluster will only be split if the PMD swap mapping count is 0.
Otherwise, we will keep it as the huge swap cluster.  So that we can
swapin a THP in one piece later.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/swapfile.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 138968b79de5..553d2551b35a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1314,6 +1314,15 @@ void swap_free(swp_entry_t entry)
 
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
+ *
+ * When a THP is added into swap cache, the SWAP_HAS_CACHE flag will
+ * be set in the swap_map[] of all swap entries in the huge swap
+ * cluster backing the THP.  This huge swap cluster will not be split
+ * unless the THP is split even if its PMD swap mapping count dropped
+ * to 0.  Later, when the THP is removed from swap cache, the
+ * SWAP_HAS_CACHE flag will be cleared in the swap_map[] of all swap
+ * entries in the huge swap cluster.  And this huge swap cluster will
+ * be split if its PMD swap mapping count is 0.
  */
 void put_swap_page(struct page *page, swp_entry_t entry)
 {
@@ -1332,15 +1341,23 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 
ci = lock_cluster_or_swap_info(si, offset);
if (size == SWAPFILE_CLUSTER) {
-   VM_BUG_ON(!cluster_is_huge(ci));
+   VM_BUG_ON(!IS_ALIGNED(offset, size));
map = si->swap_map + offset;
-   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-   val = map[i];
-   VM_BUG_ON(!(val & SWAP_HAS_CACHE));
-   if (val == SWAP_HAS_CACHE)
-   free_entries++;
+   /*
+* No PMD swap mapping, the swap cluster will be freed
+* if all swap entries becoming free, otherwise the
+* huge swap cluster will be split.
+*/
+   if (!cluster_swapcount(ci)) {
+   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+   val = map[i];
+   VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+   if (val == SWAP_HAS_CACHE)
+   free_entries++;
+   }
+   if (free_entries != SWAPFILE_CLUSTER)
+   cluster_clear_huge(ci);
}
-   cluster_clear_huge(ci);
if (free_entries == SWAPFILE_CLUSTER) {
unlock_cluster_or_swap_info(si, ci);
spin_lock(&si->lock);
-- 
2.16.4



[PATCH -V5 RESEND 07/21] swap: Support PMD swap mapping in split_swap_cluster()

2018-09-11 Thread Huang Ying
When splitting a THP in swap cache or failing to allocate a THP when
swapin a huge swap cluster, the huge swap cluster will be split.  In
addition to clear the huge flag of the swap cluster, the PMD swap
mapping count recorded in cluster_count() will be set to 0.  But we
will not touch PMD swap mappings themselves, because it is hard to
find them all sometimes.  When the PMD swap mappings are operated
later, it will be found that the huge swap cluster has been split and
the PMD swap mappings will be split at that time.

Unless splitting a THP in swap cache (specified via "force"
parameter), split_swap_cluster() will return -EEXIST if there is
SWAP_HAS_CACHE flag in swap_map[offset].  Because this indicates there
is a THP corresponds to this huge swap cluster, and it isn't desired
to split the THP.

When splitting a THP in swap cache, the position to call
split_swap_cluster() is changed to before unlocking sub-pages.  So
that all sub-pages will be kept locked from the THP has been split to
the huge swap cluster is split.  This makes the code much easier to be
reasoned.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/swap.h |  6 --
 mm/huge_memory.c | 18 ++--
 mm/swapfile.c| 58 +---
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2a3d85decd9..c0c3b3c077d7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -616,11 +616,13 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #endif /* CONFIG_SWAP */
 
+#define SSC_SPLIT_CACHED   0x1
+
 #ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
 #else
-static inline int split_swap_cluster(swp_entry_t entry)
+static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b8b61a0879f6..64123cefa978 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2502,6 +2502,17 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
unfreeze_page(head);
 
+   /*
+* Split swap cluster before unlocking sub-pages.  So all
+* sub-pages will be kept locked from THP has been split to
+* swap cluster is split.
+*/
+   if (PageSwapCache(head)) {
+   swp_entry_t entry = { .val = page_private(head) };
+
+   split_swap_cluster(entry, SSC_SPLIT_CACHED);
+   }
+
for (i = 0; i < HPAGE_PMD_NR; i++) {
struct page *subpage = head + i;
if (subpage == page)
@@ -2728,12 +2739,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
__dec_node_page_state(page, NR_SHMEM_THPS);
spin_unlock(&pgdata->split_queue_lock);
__split_huge_page(page, list, flags);
-   if (PageSwapCache(head)) {
-   swp_entry_t entry = { .val = page_private(head) };
-
-   ret = split_swap_cluster(entry);
-   } else
-   ret = 0;
+   ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
pr_alert("total_mapcount: %u, page_count(): %u\n",
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 16723b9d971a..ef2b42c199c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1469,23 +1469,6 @@ void put_swap_page(struct page *page, swp_entry_t entry)
unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-   struct swap_info_struct *si;
-   struct swap_cluster_info *ci;
-   unsigned long offset = swp_offset(entry);
-
-   si = _swap_info_get(entry);
-   if (!si)
-   return -EBUSY;
-   ci = lock_cluster(si, offset);
-   cluster_clear_huge(ci);
-   unlock_cluster(ci);
-   return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -4064,6 +4047,47 @@ int split_swap_cluster_map(swp_entry_t entry)
unlock_cluster(ci);
return 0;
 }
+
+/*
+ * We will not try to split all PMD swap mappings to the swap cluster,
+ * because we haven't enough information available for that.  Later,
+ * when the PMD swap mapping is duplicated or swapin, etc, the PMD
+ * swap mapping will be split and fallback to the PTE operations.
+ */
+int split_swap_cluster(swp_entry_t entry, unsigned long flags)
+{
+   struct swap_info_struct *si;
+   struct swap_cluster_info *ci

[PATCH -V5 RESEND 00/21] swap: Swapout/swapin THP in one piece

2018-09-11 Thread Huang Ying
Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21], [21/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-09-04 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

base  optimized
 -- 
 %stddev %change %stddev
 \  |\  
   1417897 ±  2%+992.8%   15494673vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
   1255093 ±  3%+940.3%   13056114vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
  28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4% -95.6%2787565 ± 33%  
interrupts.CAL:Function_call_interrupts
 13.91 ±  5% -13.80.10 ± 27%  
perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may put more overhead
on the storage device.  

[PATCH -V5 RESEND 02/21] swap: Add __swap_duplicate_locked()

2018-09-11 Thread Huang Ying
The part of __swap_duplicate() with lock held is separated into a new
function __swap_duplicate_locked().  Because we will add more logic
about the PMD swap mapping into __swap_duplicate() and keep the most
PTE swap mapping related logic in __swap_duplicate_locked().

Just mechanical code refactoring, there is no any functional change in
this patch.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/swapfile.c | 63 +--
 1 file changed, 35 insertions(+), 28 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 97a1bd1a7c9a..6a570ef00fa7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3436,32 +3436,12 @@ void si_swapinfo(struct sysinfo *val)
spin_unlock(&swap_lock);
 }
 
-/*
- * Verify that a swap entry is valid and increment its swap map count.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swp_entry is migration entry -> EINVAL
- * - swap-cache reference is requested but there is already one. -> EEXIST
- * - swap-cache reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_locked(struct swap_info_struct *p,
+  unsigned long offset, unsigned char usage)
 {
-   struct swap_info_struct *p;
-   struct swap_cluster_info *ci;
-   unsigned long offset;
unsigned char count;
unsigned char has_cache;
-   int err = -EINVAL;
-
-   p = get_swap_device(entry);
-   if (!p)
-   goto out;
-
-   offset = swp_offset(entry);
-   ci = lock_cluster_or_swap_info(p, offset);
+   int err = 0;
 
count = p->swap_map[offset];
 
@@ -3471,12 +3451,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned 
char usage)
 */
if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
err = -ENOENT;
-   goto unlock_out;
+   goto out;
}
 
has_cache = count & SWAP_HAS_CACHE;
count &= ~SWAP_HAS_CACHE;
-   err = 0;
 
if (usage == SWAP_HAS_CACHE) {
 
@@ -3503,11 +3482,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned 
char usage)
 
p->swap_map[offset] = count | has_cache;
 
-unlock_out:
+out:
+   return err;
+}
+
+/*
+ * Verify that a swap entry is valid and increment its swap map count.
+ *
+ * Returns error code in following case.
+ * - success -> 0
+ * - swp_entry is invalid -> EINVAL
+ * - swp_entry is migration entry -> EINVAL
+ * - swap-cache reference is requested but there is already one. -> EEXIST
+ * - swap-cache reference is requested but the entry is not used. -> ENOENT
+ * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
+ */
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+   struct swap_info_struct *p;
+   struct swap_cluster_info *ci;
+   unsigned long offset;
+   int err = -EINVAL;
+
+   p = get_swap_device(entry);
+   if (!p)
+   goto out;
+
+   offset = swp_offset(entry);
+   ci = lock_cluster_or_swap_info(p, offset);
+   err = __swap_duplicate_locked(p, offset, usage);
unlock_cluster_or_swap_info(p, ci);
+
+   put_swap_device(p);
 out:
-   if (p)
-   put_swap_device(p);
return err;
 }
 
-- 
2.16.4



[PATCH -V5 RESEND 03/21] swap: Support PMD swap mapping in swap_duplicate()

2018-09-11 Thread Huang Ying
To support to swapin the THP in one piece, we need to create PMD swap
mapping during swapout, and maintain PMD swap mapping count.  This
patch implements the support to increase the PMD swap mapping
count (for swapout, fork, etc.)  and set SWAP_HAS_CACHE flag (for
swapin, etc.) for a huge swap cluster in swap_duplicate() function
family.  Although it only implements a part of the design of the swap
reference count with PMD swap mapping, the whole design is described
as follow to make it easy to understand the patch and the whole
picture.

A huge swap cluster is used to hold the contents of a swapouted THP.
After swapout, a PMD page mapping to the THP will become a PMD
swap mapping to the huge swap cluster via a swap entry in PMD.  While
a PTE page mapping to a subpage of the THP will become the PTE swap
mapping to a swap slot in the huge swap cluster via a swap entry in
PTE.

If there is no PMD swap mapping and the corresponding THP is removed
from the page cache (reclaimed), the huge swap cluster will be split
and become a normal swap cluster.

The count (cluster_count()) of the huge swap cluster is
SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count.  Because
all swap slots in the huge swap cluster are mapped by PTE or PMD, or
has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is
HPAGE_PMD_NR.  And the PMD swap mapping count is recorded too to make
it easy to determine whether there are remaining PMD swap mappings.

The count in swap_map[offset] is the sum of PTE and PMD swap mapping
count.  This means when we increase the PMD swap mapping count, we
need to increase swap_map[offset] for all swap slots inside the swap
cluster.  An alternative choice is to make swap_map[offset] to record
PTE swap map count only, given we have recorded PMD swap mapping count
in the count of the huge swap cluster.  But this need to increase
swap_map[offset] when splitting the PMD swap mapping, that may fail
because of memory allocation for swap count continuation.  That is
hard to dealt with.  So we choose current solution.

The PMD swap mapping to a huge swap cluster may be split when unmap a
part of PMD mapping etc.  That is easy because only the count of the
huge swap cluster need to be changed.  When the last PMD swap mapping
is gone and SWAP_HAS_CACHE is unset, we will split the huge swap
cluster (clear the huge flag).  This makes it easy to reason the
cluster state.

A huge swap cluster will be split when splitting the THP in swap
cache, or failing to allocate THP during swapin, etc.  But when
splitting the huge swap cluster, we will not try to split all PMD swap
mappings, because we haven't enough information available for that
sometimes.  Later, when the PMD swap mapping is duplicated or swapin,
etc, the PMD swap mapping will be split and fallback to the PTE
operation.

When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be
set in the swap_map[offset] of all swap slots inside the huge swap
cluster backing the THP.  This huge swap cluster will not be split
unless the THP is split even if its PMD swap mapping count dropped to
0.  Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE
flag will be cleared in the swap_map[offset] of all swap slots inside
the huge swap cluster.  And this huge swap cluster will be split if
its PMD swap mapping count is 0.

The first parameter of swap_duplicate() is changed to return the swap
entry to call add_swap_count_continuation() for.  Because we may need
to call it for a swap entry in the middle of a huge swap cluster.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/swap.h |   9 +++--
 mm/memory.c  |   2 +-
 mm/rmap.c|   2 +-
 mm/swap_state.c  |   2 +-
 mm/swapfile.c| 107 ++-
 5 files changed, 97 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca7c6307bda7..1bee8b65cb8a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t);
+extern int swap_duplicate(swp_entry_t *entry, int entry_size);
+extern int swapcache_prepare(swp_entry_t entry, int entry_size);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
@@ -510,7 +510,8 @@ static inline void show_swap_cache_info(void)
 }
 
 #define free_swap_and_cache(e) ({(is_migration_entry(e) || 
is_device_private_entry(e));})

[PATCH -V5 RESEND 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP

2018-09-11 Thread Huang Ying
Currently, "the swap entry" in the page tables is used for a number of
things outside of actual swap, like page migration, etc.  We support
the THP/PMD "swap entry" for page migration currently and the
functions behind this are tied to page migration's config
option (CONFIG_ARCH_ENABLE_THP_MIGRATION).

But, we also need them for THP swap optimization.  So a new config
option (CONFIG_HAVE_PMD_SWAP_ENTRY) is added.  It is enabled when
either CONFIG_ARCH_ENABLE_THP_MIGRATION or CONFIG_THP_SWAP is enabled.
And PMD swap entry functions are tied to this new config option
instead.  Some functions enabled by CONFIG_ARCH_ENABLE_THP_MIGRATION
are for page migration only, they are still enabled only for that.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 arch/x86/include/asm/pgtable.h |  2 +-
 include/asm-generic/pgtable.h  |  2 +-
 include/linux/swapops.h| 44 ++
 mm/Kconfig |  8 
 4 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e4ffa565a69f..194f97dc4583 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1334,7 +1334,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 5657a20e0c59..eb1e9d17371b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -675,7 +675,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct 
*mm,
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifndef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
return pmd;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 22af9d8a84ae..79ccbf8789d5 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -259,17 +259,7 @@ static inline int is_write_migration_entry(swp_entry_t 
entry)
 
 #endif
 
-struct page_vma_mapped_walk;
-
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
-   struct page *page);
-
-extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
-   struct page *new);
-
-extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
-
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
swp_entry_t arch_entry;
@@ -287,6 +277,28 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
return __swp_entry_to_pmd(arch_entry);
 }
+#else
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+   return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+   return __pmd(0);
+}
+#endif
+
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+   struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+   struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
 
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
@@ -307,16 +319,6 @@ static inline void remove_migration_pmd(struct 
page_vma_mapped_walk *pvmw,
 
 static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
 
-static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
-{
-   return swp_entry(0, 0);
-}
-
-static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
-{
-   return __pmd(0);
-}
-
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 7bf074bf79e5..9a6e7e27e8d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -430,6 +430,14 @@ config THP_SWAP
 
  For selection by architectures with reasonable THP sizes.
 
+#
+# "PMD swap entry" in the page table is used both for migration and
+# actual swap.
+#
+config HAVE_PMD_SWAP_ENTRY
+   def_bool y
+   depends on THP_SWAP || ARCH_ENABLE_THP_MIGRATION
+
 config TRANSPARENT_HUGE_PAGECACHE
def_bool y
depends on TRANSPARENT_HUGEPAGE
-- 
2.16.4



Re: [LKP] [vfs] fd0002870b: BUG:KASAN:null-ptr-deref_in_n

2018-09-11 Thread Rong Chen




On 09/12/2018 04:29 AM, David Howells wrote:

kernel test robot  wrote:


[   18.568403]  nfs_fs_mount+0x901/0x1220

I don't suppose you can tell me what file and line number this corresponds to?

$ faddr2line vmlinux nfs_fs_mount+0x901
nfs_fs_mount+0x901/0x1218:
nfs_parse_devname at fs/nfs/super.c:1911
 (inlined by) nfs_validate_text_mount_data at fs/nfs/super.c:2187
 (inlined by) nfs_fs_mount at fs/nfs/super.c:2684



Also, can you tell me what the mount parameters were?  I'm not sure how to
extract them from the information provided.

qemu command (you could get from 'bin/lkp qemu -k  job-script'):

qemu-system-x86_64 -enable-kvm -fsdev 
local,id=test_dev,path=/home/nfs/.lkp//result/boot/1/vm-kbuild-1G/debian-x86_64-2018-04-03.cgz/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/0,security_model=none 
-device virtio-9p-pci,fsdev=test_dev,mount_tag=9p/virtfs_mount -kernel 
vmlinuz-4.19.0-rc1-00104-gfd00028 -append root=/dev/ram0 user=lkp 
job=/lkp/jobs/scheduled/vm-kbuild-1G-11/boot-1-debian-x86_64-2018-04-03.cgz-fd0002870b453c58d0d8c195954f5049bc6675fb-20180910-6016-1hqt4et-1.yaml 
ARCH=x86_64 kconfig=x86_64-randconfig-r0-09070102 
branch=linux-devel/devel-hourly-2018090623 
commit=fd0002870b453c58d0d8c195954f5049bc6675fb 
BOOT_IMAGE=/pkg/linux/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/vmlinuz-4.19.0-rc1-00104-gfd00028 
max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-kbuild-1G/debian-x86_64-2018-04-03.cgz/x86_64-randconfig-r0-09070102/gcc-6/fd0002870b453c58d0d8c195954f5049bc6675fb/3 
LKP_LOCAL_RUN=1 debug apic=debug sysrq_always_enabled 
rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on 
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 
prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err 
ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 
console=ttyS0,115200 vga=normal rw  ip=dhcp 
result_service=9p/virtfs_mount -initrd /home/nfs/.lkp/cache/final_initrd 
-smp 2 -m 1024M -no-reboot -watchdog i6300esb -rtc base=localtime 
-device e1000,netdev=net0 -netdev user,id=net0 -display none -monitor 
null -serial stdio -device virtio-scsi-pci,id=scsi0 -drive 
file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-0,if=none,id=hd0,media=disk,aio=native,cache=none 
-device scsi-hd,bus=scsi0.0,drive=hd0,scsi-id=1,lun=0 -drive 
file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-1,if=none,id=hd1,media=disk,aio=native,cache=none 
-device scsi-hd,bus=scsi0.0,drive=hd1,scsi-id=1,lun=1 -drive 
file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-2,if=none,id=hd2,media=disk,aio=native,cache=none 
-device scsi-hd,bus=scsi0.0,drive=hd2,scsi-id=1,lun=2 -drive 
file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-3,if=none,id=hd3,media=disk,aio=native,cache=none 
-device scsi-hd,bus=scsi0.0,drive=hd3,scsi-id=1,lun=3 -drive 
file=/tmp/vdisk-nfs/disk-vm-kbuild-1G-11-4,if=none,id=hd4,media=disk,aio=native,cache=none 
-device scsi-hd,bus=scsi0.0,drive=hd4,scsi-id=1,lun=4


Best Regards,
Rong Chen



Thanks,
David
___
LKP mailing list
l...@lists.01.org
https://lists.01.org/mailman/listinfo/lkp




Inquiry

2018-09-11 Thread Sinara Group
Hello,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager




  1   2   3   4   5   6   7   >