Re: [v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Yang Shi




On 10/23/19 12:28 PM, Hugh Dickins wrote:

On Thu, 24 Oct 2019, Yang Shi wrote:


We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well.  But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

7f277820-7f287820 rw-s  00:14 262232 
/dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   579584 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map.  But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP.  This would prevent KVM from setting up PMD
mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

7fbeaee0-7fbfaee0 rw-s  00:14 275464 
/dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   557056 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
271

And the benchmarks are as same as anonymous THPs.


Fixes: dd78fedde4b9 ("rmap: support file thp")


OK, though it might be the best blame target.




Signed-off-by: Yang Shi 
Reported-by: Gang Deng 
Tested-by: Gang Deng 
Suggested-by: Hugh Dickins 
Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc:  4.8+
---
v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
 But I just open coding compound_mapcount to avoid duplicating the
 definition of compound_mapcount_ptr in two different files.  Since
 "compound_mapcount" looks self-explained so I'm supposed the open
 coding would not affect the readability.

No, relying on head[1] is not nice: Matthew's suggestion better.


Done in v3.




  include/linux/page-flags.h | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..954a877 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
   *
   * Unlike PageTransCompound, this is safe to be called only while
   * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
   * positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount ot filter out PTE mapped case.

s/ ot / to /


oops, will fix it.




   */
  static inline int PageTransCompoundMap(struct page *page)
  {
-   return PageTransCompound(page) && atomic_read(>_mapcount) < 0;
+   struct page *head;
+   int map_count;
+
+   if (!PageTransCompound(page))
+   return 0;
+
+   if (PageAnon(page))
+   return atomic_read(>_mapcount) < 0;
+
+   head = compound_head(page);
+   map_count = atomic_read(>_mapcount);
+   /* File THP is PMD mapped and not double mapped */

s/ double / PTE /


Will fix it.




+   return map_count >= 0 &&

You have added a map_count >= 0 test there. Okay, not wrong, but not
necessary, and not consistent with what's returned in the PageAnon
case (if this were called for an unmapped page).


I was thinking about this too. I'm wondering there might be a case that 
the PMD is split and it was the last PMD map, in this case subpage's 
_mapcount is also equal to compound_mapcount (both is -1). So, it would 
return true, then KVM may setup PMD map in EPT, but it might be PTE 
mapped later on the host. But, I'm not quite sure if this is really 
possible or if this is really a 

Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Alexei Starovoitov
On Wed, Oct 23, 2019 at 12:23:06PM -0400, Steven Rostedt wrote:
> On Tue, 22 Oct 2019 14:58:43 -0700
> Alexei Starovoitov  wrote:
> 
> > Neither of two statements are true. The per-function generated trampoline
> > I'm talking about is bpf specific. For a function with two arguments it's 
> > just:
> > push rbp 
> > mov rbp, rsp
> > push rdi
> > push rsi
> > lea  rdi,[rbp-0x10]
> > call jited_bpf_prog
> > pop rsi
> > pop rdi
> > leave
> > ret
> > 
> > fentry's nop is replaced with call to the above.
> > That's it.
> > kprobe and live patching has no use out of it.
> > 
> 
> Below is a patch that allows you to do this, and you don't need to
> worry about patching the nops. And it also allows to you hook directly
> to any function and still allow kprobes and tracing on those same
> functions (as long as they don't modify the ip, but in the future, we
> may be able to allow that too!). And this code does not depend on
> Peter's code either.
> 
> All you need to do is:
> 
>   register_ftrace_direct((unsigned long)func_you_want_to_trace,
>  (unsigned long)your_trampoline);
> 
> 
> I added to trace-event-samples.c:
> 
> void my_direct_func(raw_spinlock_t *lock)
> {
>   trace_printk("taking %p\n", lock);
> }
> 
> extern void my_tramp(void *);
> 
> asm (
> "   .pushsection.text, \"ax\", @progbits\n"
> "   my_tramp:"
> #if 1
> "   pushq %rbp\n"
> " movq %rsp, %rbp\n"
> " pushq %rdi\n"
> " call my_direct_func\n"
> " popq %rdi\n"
> " leave\n"
> #endif
> " ret\n"
> "   .popsection\n"
> );
> 
> 
> (the #if was for testing purposes)
> 
> And then in the module load and unload:
> 
>   ret = register_ftrace_direct((unsigned long)do_raw_spin_lock,
>(unsigned long)my_tramp);
> 
>   unregister_ftrace_direct((unsigned long)do_raw_spin_lock,
>(unsigned long)my_tramp);
> 
> respectively.
> 
> And what this does is if there's only a single callback to the
> registered function, it changes the nop in the function to call your
> trampoline directly (just like you want this to do!). But if we add
> another callback, a direct_ops ftrace_ops gets added to the list of the
> functions to go through, and this will set up the code to call your
> trampoline after it calls all the other callbacks.
> 
> The called trampoline will be called as if it was called directly from
> the nop.
> 
> OK, I wrote this up quickly, and it has some bugs, but nothing that
> can't be straighten out (specifically, the checks fail if you add a
> function trace to one of the direct callbacks, but this can be dealt
> with).
> 
> Note, the work needed to port this to other archs is rather minimal
> (just need to tweak the ftrace_regs_caller and have a way to pass back
> the call address via pt_regs that is not saved).
> 
> Alexei,
> 
> Would this work for you?

Yes!
Looks great. More comments below.

> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6adaf18b3365..de3372bd08ae 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -159,6 +159,7 @@ config X86
>   select HAVE_DYNAMIC_FTRACE
>   select HAVE_DYNAMIC_FTRACE_WITH_REGS
>   select HAVE_DYNAMIC_FTRACE_WITH_IPMODIFY
> + select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>   select HAVE_EBPF_JIT
>   select HAVE_EFFICIENT_UNALIGNED_ACCESS
>   select HAVE_EISA
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index c38a1576..34da1e424391 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -28,6 +28,12 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> long addr)
>   return addr;
>  }
>  
> +static inline void ftrace_set_call_func(struct pt_regs *regs, unsigned long 
> addr)
> +{
> + /* Emulate a call */
> + regs->orig_ax = addr;

This probably needs a longer comment :)

> + .if \make_call
> + movq ORIG_RAX(%rsp), %rax
> + movq %rax, MCOUNT_REG_SIZE-8(%rsp)

reading asm helps.

> +config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + bool
> +
>  config HAVE_FTRACE_MCOUNT_RECORD
>   bool
>   help
> @@ -565,6 +568,11 @@ config DYNAMIC_FTRACE_WITH_IPMODIFY_ONLY
>   depends on DYNAMIC_FTRACE
>   depends on HAVE_DYNAMIC_FTRACE_WITH_IPMODIFY
>  
> +config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + def_bool y
> + depends on DYNAMIC_FTRACE
> + depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

It seems to me that it's a bit of overkill to add new config knob
for every ftrace feature.
HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS (that arch defined) would
be enough to check and return error in register_ftrace_direct()
right?

> -static struct ftrace_hash *
> -__ftrace_hash_move(struct ftrace_hash *src)
> +static void transfer_hash(struct ftrace_hash *dst, struct ftrace_hash *src)
>  {
>   struct ftrace_func_entry *entry;
> - struct hlist_node *tn;
>   struct hlist_head *hhd;
> + struct hlist_node 

Re: [PATCH] [netfilter]: Fix skb->csum calculation when netfilter manipulation for NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

2019-10-23 Thread Florian Westphal
Praveen Chaudhary  wrote:
> Update skb->csum, when netfilter code updates IPV6 SRC\DST address in IPV6 
> HEADER due to iptable rule.
> 
> Signed-off-by: Praveen Chaudhary 
> Signed-off-by: Zhenggen Xu 
> Signed-off-by: Andy Stracner 
> ---
>  include/net/checksum.h   |  2 ++
>  net/core/utils.c | 13 +
>  net/netfilter/nf_nat_proto.c |  2 ++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/include/net/checksum.h b/include/net/checksum.h
> index 97bf488..d7d28b7 100644
> --- a/include/net/checksum.h
> +++ b/include/net/checksum.h
> @@ -145,6 +145,8 @@ void inet_proto_csum_replace4(__sum16 *sum, struct 
> sk_buff *skb,
>  void inet_proto_csum_replace16(__sum16 *sum, struct sk_buff *skb,
>  const __be32 *from, const __be32 *to,
>  bool pseudohdr);
> +void inet_proto_skb_csum_replace16(struct sk_buff *skb,
> +const __be32 *from, const __be32 *to);
>  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
>__wsum diff, bool pseudohdr);
>  
> diff --git a/net/core/utils.c b/net/core/utils.c
> index 6b6e51d..ab3284b 100644
> --- a/net/core/utils.c
> +++ b/net/core/utils.c
> @@ -458,6 +458,19 @@ void inet_proto_csum_replace16(__sum16 *sum, struct 
> sk_buff *skb,
>  }
>  EXPORT_SYMBOL(inet_proto_csum_replace16);
>  
> +void inet_proto_skb_csum_replace16(struct sk_buff *skb,
> +const __be32 *from, const __be32 *to)
> +{
> + __be32 diff[] = {
> + ~from[0], ~from[1], ~from[2], ~from[3],
> + to[0], to[1], to[2], to[3],
> + };
> + if (skb->ip_summed == CHECKSUM_COMPLETE)
> + skb->csum = csum_partial(diff, sizeof(diff),
> +   skb->csum);
> +}
> +EXPORT_SYMBOL(inet_proto_skb_csum_replace16);
> +
>  void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
>__wsum diff, bool pseudohdr)
>  {
> diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
> index 0a59c14..de94590 100644
> --- a/net/netfilter/nf_nat_proto.c
> +++ b/net/netfilter/nf_nat_proto.c
> @@ -467,6 +467,8 @@ static void nf_nat_ipv6_csum_update(struct sk_buff *skb,
>   }
>   inet_proto_csum_replace16(check, skb, oldip->s6_addr32,
> newip->s6_addr32, true);
> + inet_proto_skb_csum_replace16(skb, oldip->s6_addr32,
> +   newip->s6_addr32);

This is confusing.

You're saying that inet_proto_csum_replace16() is producing a wrong
skb->csum.  So why are you adding a new function to do the
csum calculation instead of fixing inet_proto_csum_replace16() to do
the right thing?


Re: [PATCH v2 4/4] jump_label,module: Fix module lifetime for __jump_label_mod_text_reserved

2019-10-23 Thread Steven Rostedt
On Mon, 07 Oct 2019 10:25:45 +0200
Peter Zijlstra  wrote:

> Nothing ensures the module exists while we're iterating
> mod->jump_entries in __jump_label_mod_text_reserved(), take a module
> reference to ensure the module sticks around.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Reviewed-by: Steven Rostedt (VMware) 

-- Steve

> ---
>  kernel/jump_label.c |   10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -539,19 +539,25 @@ static void static_key_set_mod(struct st
>  static int __jump_label_mod_text_reserved(void *start, void *end)
>  {
>   struct module *mod;
> + int ret;
>  
>   preempt_disable();
>   mod = __module_text_address((unsigned long)start);
>   WARN_ON_ONCE(__module_text_address((unsigned long)end) != mod);
> + if (!try_module_get(mod))
> + mod = NULL;
>   preempt_enable();
>  
>   if (!mod)
>   return 0;
>  
> -
> - return __jump_label_text_reserved(mod->jump_entries,
> + ret = __jump_label_text_reserved(mod->jump_entries,
>   mod->jump_entries + mod->num_jump_entries,
>   start, end);
> +
> + module_put(mod);
> +
> + return ret;
>  }
>  
>  static void __jump_label_mod_update(struct static_key *key)
> 



[v3 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Yang Shi
We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well.  But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

7f277820-7f287820 rw-s  00:14 262232 
/dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   579584 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map.  But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP.  This would prevent KVM from setting up PMD
mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

7fbeaee0-7fbfaee0 rw-s  00:14 275464 
/dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   557056 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
271

And the benchmarks are as same as anonymous THPs.

Signed-off-by: Yang Shi 
Reported-by: Gang Deng 
Tested-by: Gang Deng 
Suggested-by: Hugh Dickins 
Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Matthew Wilcox 
Cc:  4.8+
---
v3: Took the suggestion from Willy to move the definition of
compound_mapcount_ptr to mm_types.h in order to use it instead of
open coding it.
v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
But I just open coding compound_mapcount to avoid duplicating the
definition of compound_mapcount_ptr in two different files.  Since
"compound_mapcount" looks self-explained so I'm supposed the open
coding would not affect the readability.

 include/linux/mm.h |  5 -
 include/linux/mm_types.h   |  5 +
 include/linux/page-flags.h | 22 --
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc29227..a2adf95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t 
flags)
 
 extern void kvfree(const void *addr);
 
-static inline atomic_t *compound_mapcount_ptr(struct page *page)
-{
-   return [1].compound_mapcount;
-}
-
 static inline int compound_mapcount(struct page *page)
 {
VM_BUG_ON_PAGE(!PageCompound(page), page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fa7..270aa8f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,11 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+   return [1].compound_mapcount;
+}
+
 /*
  * Used for sizing the vmemmap region on some architectures
  */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..e0f2d53 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
  *
  * Unlike PageTransCompound, this is safe to be called only while
  * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
  * positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount ot filter out PTE mapped case.
  */
 static inline int PageTransCompoundMap(struct page *page)
 {
-   return PageTransCompound(page) && atomic_read(>_mapcount) < 0;
+   struct page *head;
+   int map_count;
+
+   if (!PageTransCompound(page))
+   return 0;
+
+   if (PageAnon(page))
+   

Re: [v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Hugh Dickins
On Thu, 24 Oct 2019, Yang Shi wrote:

> We have usecase to use tmpfs as QEMU memory backend and we would like to
> take the advantage of THP as well.  But, our test shows the EPT is not
> PMD mapped even though the underlying THP are PMD mapped on host.
> The number showed by /sys/kernel/debug/kvm/largepage is much less than
> the number of PMD mapped shmem pages as the below:
> 
> 7f277820-7f287820 rw-s  00:14 262232 
> /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
> Size:4194304 kB
> [snip]
> AnonHugePages: 0 kB
> ShmemPmdMapped:   579584 kB
> [snip]
> Locked:0 kB
> 
> cat /sys/kernel/debug/kvm/largepages
> 12
> 
> And some benchmarks do worse than with anonymous THPs.
> 
> By digging into the code we figured out that commit 127393fbe597 ("mm:
> thp: kvm: fix memory corruption in KVM with THP enabled") checks if
> there is a single PTE mapping on the page for anonymous THP when
> setting up EPT map.  But, the _mapcount < 0 check doesn't fit to page
> cache THP since every subpage of page cache THP would get _mapcount
> inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
> false for page cache THP.  This would prevent KVM from setting up PMD
> mapped EPT entry.
> 
> So we need handle page cache THP correctly.  However, when page cache
> THP's PMD gets split, kernel just remove the map instead of setting up
> PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
> the subpages may get PTE mapped even though it is still a THP since the
> page cache THP may be mapped by other processes at the mean time.
> 
> Checking its _mapcount and whether the THP has PTE mapped or not.
> Although this may report some false negative cases (PTE mapped by other
> processes), it looks not trivial to make this accurate.
> 
> With this fix /sys/kernel/debug/kvm/largepage would show reasonable
> pages are PMD mapped by EPT as the below:
> 
> 7fbeaee0-7fbfaee0 rw-s  00:14 275464 
> /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
> Size:4194304 kB
> [snip]
> AnonHugePages: 0 kB
> ShmemPmdMapped:   557056 kB
> [snip]
> Locked:0 kB
> 
> cat /sys/kernel/debug/kvm/largepages
> 271
> 
> And the benchmarks are as same as anonymous THPs.
> 

Fixes: dd78fedde4b9 ("rmap: support file thp")

> Signed-off-by: Yang Shi 
> Reported-by: Gang Deng 
> Tested-by: Gang Deng 
> Suggested-by: Hugh Dickins 
> Cc: Andrea Arcangeli 
> Cc: Kirill A. Shutemov 
> Cc:  4.8+
> ---
> v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
> But I just open coding compound_mapcount to avoid duplicating the
> definition of compound_mapcount_ptr in two different files.  Since
> "compound_mapcount" looks self-explained so I'm supposed the open
> coding would not affect the readability.

No, relying on head[1] is not nice: Matthew's suggestion better.

> 
>  include/linux/page-flags.h | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f91cb88..954a877 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
>   *
>   * Unlike PageTransCompound, this is safe to be called only while
>   * split_huge_pmd() cannot run from under us, like if protected by the
> - * MMU notifier, otherwise it may result in page->_mapcount < 0 false
> + * MMU notifier, otherwise it may result in page->_mapcount check false
>   * positives.
> + *
> + * We have to treat page cache THP differently since every subpage of it
> + * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
> + * mapped in the current process so comparing subpage's _mapcount to
> + * compound_mapcount ot filter out PTE mapped case.

s/ ot / to /

>   */
>  static inline int PageTransCompoundMap(struct page *page)
>  {
> - return PageTransCompound(page) && atomic_read(>_mapcount) < 0;
> + struct page *head;
> + int map_count;
> +
> + if (!PageTransCompound(page))
> + return 0;
> +
> + if (PageAnon(page))
> + return atomic_read(>_mapcount) < 0;
> +
> + head = compound_head(page);
> + map_count = atomic_read(>_mapcount);
> + /* File THP is PMD mapped and not double mapped */

s/ double / PTE /

> + return map_count >= 0 &&

You have added a map_count >= 0 test there. Okay, not wrong, but not
necessary, and not consistent with what's returned in the PageAnon
case (if this were called for an unmapped page).

But asking for consistency in this function is asking for too much.
It is *very* specialized to the particular places from which it is
called (doesn't really belong in page-flags.h at all), and just so
long as it gives them the right answer most of the time, and errs
on the safe side the rest of the time, it'll do.

(I don't know if it's ever called on a tail page: 

Re: [PATCH] mm: gup: fix comment of __get_user_pages()

2019-10-23 Thread John Hubbard
On 10/23/19 6:51 AM, Liu Xiang wrote:
> Because nr_pages is unsigned long, it can not be negative.
> 
> Signed-off-by: Liu Xiang 
> ---
>  mm/gup.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a3..0236954 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -735,10 +735,10 @@ static int check_vma_flags(struct vm_area_struct *vma, 
> unsigned long gup_flags)
>   * @nonblocking: whether waiting for disk IO or mmap_sem contention
>   *
>   * Returns number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno. Each page returned must be released
> - * with a put_page() call when it is finished with. vmas will only
> - * remain valid while mmap_sem is held.
> + * requested. If nr_pages is 0, returns 0. If no pages were pinned,
> + * returns -errno. Each page returned must be released with a
> + * put_page() call when it is finished with. vmas will only remain
> + * valid while mmap_sem is held.
>   *
>   * Must be called with mmap_sem held.  It may be released.  See below.
>   *
> 

Hi Liu,

Thanks for fixing the documentation! As long as you're there...for the actual 
wording, can we please do it as shown below? This also addresses David's 
feedback, and it makes this read a lot better overall:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..5816876fee51 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -734,11 +734,17 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
  * Or NULL if the caller does not require them.
  * @nonblocking: whether waiting for disk IO or mmap_sem contention
  *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno. Each page returned must be released
- * with a put_page() call when it is finished with. vmas will only
- * remain valid while mmap_sem is held.
+ * Returns either number of pages pinned (which may be less than the number
+ * requested), or an error. Details about the return value:
+ *
+ *  -- If nr_pages is 0, returns 0.
+ *  -- If nr_pages is >0, but no pages were pinned, returns -errno.
+ *  -- If nr_pages is >0, and some pages were pinned, returns the number of
+ * pages pinned. Again, this may be less than nr_pages.
+ *
+ * The caller is responsible for releasing returned @pages, via put_page().
+ *
+ * @vmas are valid only as long as mmap_sem is held.
  *
  * Must be called with mmap_sem held.  It may be released.  See below.
  *


Patch ordering: I'll have to change the above as part of my upcoming series, to
make it refer to "put_page() or put_user_page(), depending on FOLL_PIN", 
approximately. (Just more of a note to self than anything else, here.)

thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 2/4] module: Fix up module_notifier return values.

2019-10-23 Thread Steven Rostedt
On Mon, 07 Oct 2019 10:25:43 +0200
Peter Zijlstra  wrote:

>  kernel/trace/trace.c   |2 +-
>  kernel/trace/trace_events.c|2 +-
>  kernel/trace/trace_printk.c|4 ++--
>  kernel/tracepoint.c|2 +-

Acked-by: Steven Rostedt (VMware) 

-- Steve


[PATCH] PCI/DPC: Add pcie_ports=dpc-native parameter to bring back old behavior

2019-10-23 Thread Olof Johansson
In commit eed85ff4c0da7 ("PCI/DPC: Enable DPC only if AER is available"),
the behavior was changed such that native (kernel) handling of DPC
got tied to whether the kernel also handled AER. While this is what
the standard recommends, there are BIOSes out there that lack the DPC
handling since it was never required in the past.

To make DPC still work on said platforms the same way they did before,
add a "pcie_ports=dpc-native" kernel parameter that can be passed in
if needed, while keeping defaults unchanged.

Signed-off-by: Olof Johansson 
---
 Documentation/admin-guide/kernel-parameters.txt | 1 +
 drivers/pci/pcie/dpc.c  | 2 +-
 drivers/pci/pcie/portdrv_core.c | 7 ++-
 drivers/pci/pcie/portdrv_pci.c  | 8 
 include/linux/pci.h | 2 ++
 5 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 5e27d74e2b74b..e0325421980aa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3548,6 +3548,7 @@
even if the platform doesn't give the OS permission to
use them.  This may cause conflicts if the platform
also tries to use these services.
+   dpc-native  Use native PCIe service for DPC, but none other.
compat  Disable native PCIe services (PME, AER, DPC, PCIe
hotplug).
 
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index a32ec3487a8d0..e06f42f58d3d4 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -291,7 +291,7 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
 
-   if (pcie_aer_get_firmware_first(pdev))
+   if (pcie_aer_get_firmware_first(pdev) && !pcie_ports_dpc_native)
return -ENOTSUPP;
 
dpc = devm_kzalloc(device, sizeof(*dpc), GFP_KERNEL);
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 1b330129089fe..c24bf6cac4186 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -250,8 +250,13 @@ static int get_port_device_capability(struct pci_dev *dev)
pcie_pme_interrupt_enable(dev, false);
}
 
+   /*
+* With dpc-native, we set it if AER is available, even if AER is
+* handled by firmware.
+*/
if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
-   pci_aer_available() && services & PCIE_PORT_SERVICE_AER)
+   pci_aer_available() &&
+   (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
services |= PCIE_PORT_SERVICE_DPC;
 
if (pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 0a87091a0800e..b415656519a73 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -29,12 +29,20 @@ bool pcie_ports_disabled;
  */
 bool pcie_ports_native;
 
+/*
+ * If the user specified "pcie_ports=dpc-native", use the PCIe services
+ * for DPC, but cuse platform defaults for the others.
+ */
+bool pcie_ports_dpc_native;
+
 static int __init pcie_port_setup(char *str)
 {
if (!strncmp(str, "compat", 6))
pcie_ports_disabled = true;
else if (!strncmp(str, "native", 6))
pcie_ports_native = true;
+   else if (!strncmp(str, "dpc-native", 10))
+   pcie_ports_dpc_native = true;
 
return 1;
 }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index fc1844061e88f..603d4822757b6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1534,9 +1534,11 @@ static inline int pci_irqd_intx_xlate(struct irq_domain 
*d,
 #ifdef CONFIG_PCIEPORTBUS
 extern bool pcie_ports_disabled;
 extern bool pcie_ports_native;
+extern bool pcie_ports_dpc_native;
 #else
 #define pcie_ports_disabledtrue
 #define pcie_ports_native  false
+#define pcie_ports_dpc_native  false
 #endif
 
 #define PCIE_LINK_STATE_L0SBIT(0)
-- 
2.11.0



Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andy Lutomirski
On Wed, Oct 23, 2019 at 12:10 PM Andrea Arcangeli  wrote:
>
> Hello,
>
> On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> > [adding more people because this is going to be an ABI break, sigh]
>
> That wouldn't break the ABI, no more than when if you boot a kernel
> built with CONFIG_USERFAULTFD=n.
>
> All non-cooperative features can be removed any time in a backwards
> compatible way, the only precaution is to mark their feature bits as
> reserved so they can't be reused for something else later.
>
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
>
> With UFFD_EVENT_FORK, the newly created uffd that controls the child,
> is not passed to the parent nor to the child. Instead it's passed to
> the CRIU monitor only, which has to be already running as root and is
> fully trusted and acts a hypervisor (despite there is no hypervisor).
>
> By the time execve runs and any suid bit in the execve'd inode becomes
> relevant, well before the new userland executable code can run, the
> kernel throws away the "old_mm" controlled by any uffd and all
> attached uffds are released as well.
>
> All I found is your "A .read implementation MUST NOT ACT ON THE
> CALLING TASK" as an explanation that something is broken but I need
> further clarification.

There are two things going on here.

1. Daniel wants to add LSM labels to userfaultfd objects.  This seems
reasonable to me.  The question, as I understand it, is: who is the
subject that creates a uffd referring to a forked child?  I'm sure
this is solvable in any number of straightforward ways, but I think
it's less important than:

2. The existing ABI is busted independently of #1.  Suppose you call
userfaultfd to get a userfaultfd and enable UFFD_FEATURE_EVENT_FORK.
Then you do:

$ sudo <&[userfaultfd number]

Sudo will read it and get a new fd unexpectedly added to its fd table.
It's worse if SCM_RIGHTS is involved.

So I think we either need to declare that UFFD_FEATURE_EVENT_FORK is
only usable by global root or we need to remove it and maybe re-add it
in some other form.


--Andy


Re: [PATCH] x86/dumpstack/64: Don't evaluate exception stacks before setup

2019-10-23 Thread Josh Poimboeuf
On Wed, Oct 23, 2019 at 08:05:49PM +0200, Thomas Gleixner wrote:
> Cyrill reported the following crash:
> 
>   BUG: unable to handle page fault for address: 1ff0
>   #PF: supervisor read access in kernel mode
>   RIP: 0010:get_stack_info+0xb3/0x148
> 
> It turns out that if the stack tracer is invoked before the exception stack
> mappings are initialized in_exception_stack() can erroneously classify an
> invalid address as an address inside of an exception stack:
> 
> begin = this_cpu_read(cea_exception_stacks);  <- 0
> end = begin + sizeof(exception stacks);
> 
> i.e. any address between 0 and end will be considered as exception stack
> address and the subsequent code will then try to derefence the resulting
> stack frame at a non mapped address.
> 
>  end = begin + (unsigned long)ep->size;
>  ==> end = 0x2000
> 
>  regs = (struct pt_regs *)end - 1;
>  ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0
> 
>  info->next_sp   = (unsigned long *)regs->sp;
>  ==> Crashes due to accessing 0x1ff0
> 
> Prevent this by checking the validity of the cea_exception_stack base
> address and bailing out if it is zero.
> 
> Fixes: afcd21dad88b ("x86/dumpstack/64: Use cpu_entry_area instead of 
> orig_ist")
> Reported-by: Cyrill Gorcunov 
> Signed-off-by: Thomas Gleixner 
> Tested-by: Cyrill Gorcunov 
> Cc: sta...@vger.kernel.org

Acked-by: Josh Poimboeuf 

-- 
Josh



Re: [PATCH 2/3] usb, kcov: collect coverage from hub_event

2019-10-23 Thread kbuild test robot
Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc4 next-20191023]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Andrey-Konovalov/kcov-collect-coverage-from-usb-and-vhost/20191023-185245
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
3b7c59a1950c75f2c0152e5a9cd77675b09233d6
config: s390-allmodconfig (attached as .config)
compiler: s390-linux-gcc (GCC) 7.4.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

>> ERROR: "kcov_remote_stop" [drivers/usb/core/usbcore.ko] undefined!
>> ERROR: "kcov_remote_start" [drivers/usb/core/usbcore.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andrea Arcangeli
Hello,

On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> [adding more people because this is going to be an ABI break, sigh]

That wouldn't break the ABI, no more than when if you boot a kernel
built with CONFIG_USERFAULTFD=n.

All non-cooperative features can be removed any time in a backwards
compatible way, the only precaution is to mark their feature bits as
reserved so they can't be reused for something else later.

> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.

With UFFD_EVENT_FORK, the newly created uffd that controls the child,
is not passed to the parent nor to the child. Instead it's passed to
the CRIU monitor only, which has to be already running as root and is
fully trusted and acts a hypervisor (despite there is no hypervisor).

By the time execve runs and any suid bit in the execve'd inode becomes
relevant, well before the new userland executable code can run, the
kernel throws away the "old_mm" controlled by any uffd and all
attached uffds are released as well.

All I found is your "A .read implementation MUST NOT ACT ON THE
CALLING TASK" as an explanation that something is broken but I need
further clarification.

Of course I can see you can always open a uffd and pass it to any task
you are going to execve on, but that simply means the suid program
will be able to control you, not the other way around. If you don't
want to be controlled by the next task, no matter if suid or not, just
don't that. What I don't see is how you're going to control the suid
binary from the outside, the suid binary at most will block in the
poll, read and write syscalls and get garbage or write some garbage
and get an error, it won't get signals and it cannot block in any page
fault either, it's not immediately clear what's out of ordinary.

On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote:
> FWIW, 
> 
> just shows the kernel, kernel selftests, and strace code for decoding
> syscall arguments. CRIU uses it though (probably for postcopy live
> migration / lazy migration?), I guess that code isn't in debian for
> some reason.

https://criu.org/Userfaultfd#Limitations

The CRIU developers did a truly amazing job by making container post
copy live migration work great for a subset of apps, that alone was an
amazing achievement. Is that achievement enough to use post copy live
migration of bare metal containers in production? Unfortunately
probably not and not just in debian.

If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in
turn it isn't causing further maintenance burden, there is no hurry of
removing them, but in the long term, if none of the non-cooperative
features find its way in production (like it was reasonable to expect
initially), they must be removed from the kernel anyway, not just
UFFD_EVEN_FORK but all non-cooperative features associated with it.

In my view the kernel is complex enough that we can't keep in the
kernel anything that isn't actively used in production.

If you're right and UFFDIO_EVENT_FORK is flawed beyond repair well
then we should remove all non cooperative features right now.

Or is someone out there using CRIU --lazy-pages in production?

Virtual machine machine postcopy live migration is shipped in
production for years and it's fully reliable and by design it cannot
suffer from any of the above limitations.

In my view there's simply no justification not to use virtual machines
when the alternative requires so much more code to be written and so
much more complexity to be dealt with.

However the higher complexity happened in lots areas of the kernel
already where things got extremely complex just to avoid using virtual
machines, despite the end result is less secure, for the only sake of
slightly higher consolidation (especially if you ignore the existence
millisecond guest boot time, direct mapped pmem nvdimm, virtfs and
free page hinting).

The non-cooperative features of userfaultfd in principle aren't
fundamentally different from other cgroup vs KVM tradeoffs, so 1) it
wasn't apparent they wouldn't be used in production eventually and 2)
it didn't sound fair enough not to give a chance to bare metal
containers to achieve feature parity with VM (again with a much higher
code complexity, but that was to be expected and it has apparently
been dealt with in other areas with more or less satisfactory
results).

While at it, as far as userfaultfd is concerned I'd rather see people
spend time to write a malloc library that uses userfaultfd with the
UFFD_FEATURE_SIGBUS features and it replaces mmap with UFFDIO_ZEROPAGE
(plus adding the THP accelleration currently missing) and munmap with
MADV_DONTNEED to do allocations and freeing of memory with full
scalability without ever hitting on the map sem for writing. This is
already possible without 

Re: [PATCH] ASoC: fsl_asrc: refine the setting of internal clock divider

2019-10-23 Thread Nicolin Chen
On Wed, Oct 23, 2019 at 06:25:20AM +, S.j. Wang wrote:
> > On Thu, Oct 17, 2019 at 02:21:08PM +0800, Shengjiu Wang wrote:
> > > For P2P output, the output divider should align with the output sample
> > 
> > I think we should avoid "P2P" (or "M2M") keyword in the mainline code as
> > we know M2M will never get merged while somebody working with the
> > mainline and caring about new feature might be confused.
> 
> Ok. But we still curious that is there a way to upstream m2m?

Hmm..I would love to see that happening. Here is an old discussion
that you may want to take a look:
https://mailman.alsa-project.org/pipermail/alsa-devel/2014-May/076797.html

> > It makes sense to me, yet I feel that the delay at the beginning of the 
> > audio
> > playback might be longer as a compromise. I am okay with this decision
> > though...
> > 
> > > The maximum divider of asrc clock is 1024, but there is no judgement
> > > for this limitaion in driver, which may cause the divider setting not
> > > correct.
> > >
> > > For non-ideal ratio mode, the clock rate should divide the sample rate
> > > with no remainder, and the quotient should be less than 1024.
> > >
> > > Signed-off-by: Shengjiu Wang 

> > > @@ -351,7 +352,9 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair
> > *pair)
> > >   /* We only have output clock for ideal ratio mode */
> > >   clk = asrc_priv->asrck_clk[clk_index[ideal ? OUT : IN]];
> > >
> > > - div[IN] = clk_get_rate(clk) / inrate;
> > > + clk_rate = clk_get_rate(clk);
> > 
> > The fsl_asrc.c file has config.inclk being set to INCLK_NONE and this sets 
> > the
> > "ideal" in this function to true. So, although we tend to not use ideal 
> > ratio
> > setting for p2p cases, yet the input clock is still not physically 
> > connected, so
> > we still use output clock for div[IN] calculation?
> 
> For p2p case, it can be ideal or non-ideal.  For non-ideal, we still use
> Output clock for div calculation.
> 
> > 
> > I am thinking something simplier: if we decided not to use ideal ratio for
> > "P2P", instead of adding "bool p2p" with the confusing "ideal" in this
> > function, could we just set config.inclk to the same clock as the output one
> > for "P2P"? By doing so, "P2P" won't go through ideal ratio mode while still
> > having a clock rate from the output clock for div[IN] calculation here.
> 
> Bool p2p is to force output rate to be sample rate, no impact to ideal
> Ratio mode.

I just realized that the function has a bottom part for ideal mode
exclusively -- if we treat p2p as !ideal, those configurations will
be missing. So you're right, should have an extra boolean variable.

> > 
> > > + rem[IN] = do_div(clk_rate, inrate);
> > > + div[IN] = (u32)clk_rate;
> > >   if (div[IN] == 0) {
> > 
> > Could we check div[IN] and rem[IN] here? Like:
> > if (div[IN] == 0 || div[IN] > 1024) {
> > pair_err();
> > goto out;
> > }
> > 
> > if (!ideal && rem[IN]) {
> > pair_err();
> > goto out;
> > }
> > 
> > According to your commit log, I think the max-1024 limitation should be
> > applied to all cases, not confined to "!ideal" cases right? And we should
> > add some comments also, indicating it is limited by hardware.
> 
> For ideal mode,  my test result is  the divider not impact the output result.
> Which means it is ok for ideal mode even divider is not correct... 

OK.

> > 
> > >   pair_err("failed to support input sample rate %dHz by
> > asrck_%x\n",
> > >   inrate, clk_index[ideal ? OUT : IN]); @@
> > > -360,11 +363,20 @@ static int fsl_asrc_config_pair(struct
> > > fsl_asrc_pair *pair)
> > >
> > >   clk = asrc_priv->asrck_clk[clk_index[OUT]];
> > >
> > > - /* Use fixed output rate for Ideal Ratio mode (INCLK_NONE) */
> > > - if (ideal)
> > > - div[OUT] = clk_get_rate(clk) / IDEAL_RATIO_RATE;
> > > - else
> > > - div[OUT] = clk_get_rate(clk) / outrate;
> > > + /*
> > > +  * When P2P mode, output rate should align with the out samplerate.
> > > +  * if set too high output rate, there will be lots of Overload.
> > > +  * When M2M mode, output rate should also need to align with the
> > > + out
> > 
> > For this "should", do you actually mean "M2M could also"? Sorry, I'm just
> > trying to understand everyting here, not intentionally being picky at words.
> > My understanding is that we still keep the ideal ratio setting because
> > "M2M" still uses it.
> 
> We use IDEAL_RATIO_RATE as output rate for m2m mode, it likes a
> Tricky operation, in order to improve the performance. I think
> The correct operation is to use the real output rate, but the performance
> Is bad.  So it is a compromise.

I see.

> > 
> > > +  * samplerate, but M2M must use less time to achieve good
> > performance.
> > > +  */
> > > + clk_rate = clk_get_rate(clk);
> > > + if (p2p || !ideal) {

Re: [PATCH resend] Add touchscreen platform data for the Schneider SCT101CTM tablet

2019-10-23 Thread Andy Shevchenko
On Wed, Oct 23, 2019 at 9:53 PM Hans de Goede  wrote:
>
> From: Daniel Gorbea Ainz 
>
> Add touchscreen platform data for the Schneider SCT101CTM tablet

Thanks, now patchwork sees it.

> Signed-off-by: Daniel Gorbea 

> Reviewed-by: Hans de Goede 
> Signed-off-by: Hans de Goede 

I'm not sure you need to put Rb when you have your SoB.
Do you think it's fine if I remove Rb when applying?

> ---
> hdegoede: Resend from my email address as vger.kernel.org does not like
> Daniel's emails
> ---
>  drivers/platform/x86/touchscreen_dmi.c | 26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/drivers/platform/x86/touchscreen_dmi.c 
> b/drivers/platform/x86/touchscreen_dmi.c
> index 8bfef880e216..ba494ace83d4 100644
> --- a/drivers/platform/x86/touchscreen_dmi.c
> +++ b/drivers/platform/x86/touchscreen_dmi.c
> @@ -549,6 +549,24 @@ static const struct ts_dmi_data 
> pov_mobii_wintab_p1006w_v10_data = {
> .properties = pov_mobii_wintab_p1006w_v10_props,
>  };
>
> +static const struct property_entry schneider_sct101ctm_props[] = {
> +   PROPERTY_ENTRY_U32("touchscreen-size-x", 1715),
> +   PROPERTY_ENTRY_U32("touchscreen-size-y", 1140),
> +   PROPERTY_ENTRY_BOOL("touchscreen-inverted-x"),
> +   PROPERTY_ENTRY_BOOL("touchscreen-inverted-y"),
> +   PROPERTY_ENTRY_BOOL("touchscreen-swapped-x-y"),
> +   PROPERTY_ENTRY_STRING("firmware-name",
> + "gsl1680-schneider-sct101ctm.fw"),
> +   PROPERTY_ENTRY_U32("silead,max-fingers", 10),
> +   PROPERTY_ENTRY_BOOL("silead,home-button"),
> +   { }
> +};
> +
> +static const struct ts_dmi_data schneider_sct101ctm_data = {
> +   .acpi_name  = "MSSL1680:00",
> +   .properties = schneider_sct101ctm_props,
> +};
> +
>  static const struct property_entry teclast_x3_plus_props[] = {
> PROPERTY_ENTRY_U32("touchscreen-size-x", 1980),
> PROPERTY_ENTRY_U32("touchscreen-size-y", 1500),
> @@ -968,6 +986,14 @@ const struct dmi_system_id touchscreen_dmi_table[] = {
> DMI_EXACT_MATCH(DMI_BOARD_NAME, "0E57"),
> },
> },
> +   {
> +   /* Schneider SCT101CTM */
> +   .driver_data = (void *)_sct101ctm_data,
> +   .matches = {
> +   DMI_MATCH(DMI_SYS_VENDOR, "Default string"),
> +   DMI_MATCH(DMI_PRODUCT_NAME, "SCT101CTM"),
> +   },
> +   },
> {
> /* Teclast X3 Plus */
> .driver_data = (void *)_x3_plus_data,
> --
> 2.23.0
>


-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH] Input: hp_sdc_rtc - remove dead chardev code

2019-10-23 Thread Helge Deller

On 23.10.19 17:22, Alexandre Belloni wrote:

On 23/10/2019 16:25:02+0200, Arnd Bergmann wrote:

The driver contains half of the implementation of /dev/rtc, but this
was never completed, and it is now incompatible with the drivers/rtc
framework.

Remove the chardev completely. If anyone wants to add the functionality
later, that shoudl be done through rtc_register_device().

The remaining portions of the driver basically implement a single
procfs file that may or may not be used anywhere. Not sure why this
is in drivers/input/ though.

Signed-off-by: Arnd Bergmann 

Acked-by: Alexandre Belloni 


A year ago I did actually converted this driver to the RTC framework.
But after some testing on my physical box (a 715/64 PA-RISC machine
with HIL connector) I realized that the SDC in that machine doesn't
provide a functional RTC, and even more important, on that box we don't
need this RTC because the system provides a built-in RTC on-mainboard instead.
So, I never pushed my changes upstream, which can still be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=hp_sdc_rtc_conversion=0d4250dbcfa2bb8b326ce7721e19e10a66f1eb92

So, I don't think any PA-RISC machine needs this driver, and as such
I'm happy to give my:
Acked-by: Helge Deller 

I even think the whole driver can go away...

Helge

PS: Maybe some really old 68000-based HP machines needed that, but
I don't know if any recent Linux kernel runs on those old boxes any
longer...



---
  drivers/input/misc/hp_sdc_rtc.c | 342 
  1 file changed, 342 deletions(-)

diff --git a/drivers/input/misc/hp_sdc_rtc.c b/drivers/input/misc/hp_sdc_rtc.c
index abca895a6156..199bc17ddb1d 100644
--- a/drivers/input/misc/hp_sdc_rtc.c
+++ b/drivers/input/misc/hp_sdc_rtc.c
@@ -53,28 +53,10 @@ MODULE_LICENSE("Dual BSD/GPL");

  #define RTC_VERSION "1.10d"

-static DEFINE_MUTEX(hp_sdc_rtc_mutex);
  static unsigned long epoch = 2000;

  static struct semaphore i8042tregs;

-static hp_sdc_irqhook hp_sdc_rtc_isr;
-
-static struct fasync_struct *hp_sdc_rtc_async_queue;
-
-static DECLARE_WAIT_QUEUE_HEAD(hp_sdc_rtc_wait);
-
-static ssize_t hp_sdc_rtc_read(struct file *file, char __user *buf,
-  size_t count, loff_t *ppos);
-
-static long hp_sdc_rtc_unlocked_ioctl(struct file *file,
- unsigned int cmd, unsigned long arg);
-
-static unsigned int hp_sdc_rtc_poll(struct file *file, poll_table *wait);
-
-static int hp_sdc_rtc_open(struct inode *inode, struct file *file);
-static int hp_sdc_rtc_fasync (int fd, struct file *filp, int on);
-
  static void hp_sdc_rtc_isr (int irq, void *dev_id,
uint8_t status, uint8_t data)
  {
@@ -283,151 +265,6 @@ static inline int hp_sdc_rtc_read_ct(struct timespec64 
*res) {
return 0;
  }

-
-#if 0 /* not used yet */
-/* Set the i8042 real-time clock */
-static int hp_sdc_rtc_set_rt (struct timeval *setto)
-{
-   uint32_t tenms;
-   unsigned int days;
-   hp_sdc_transaction t;
-   uint8_t tseq[11] = {
-   HP_SDC_ACT_PRECMD | HP_SDC_ACT_DATAOUT,
-   HP_SDC_CMD_SET_RTMS, 3, 0, 0, 0,
-   HP_SDC_ACT_PRECMD | HP_SDC_ACT_DATAOUT,
-   HP_SDC_CMD_SET_RTD, 2, 0, 0
-   };
-
-   t.endidx = 10;
-
-   if (0x < setto->tv_sec / 86400) return -1;
-   days = setto->tv_sec / 86400;
-   if (0x < setto->tv_usec / 100 / 86400) return -1;
-   days += ((setto->tv_sec % 86400) + setto->tv_usec / 100) / 86400;
-   if (days > 0x) return -1;
-
-   if (0xff < setto->tv_sec) return -1;
-   tenms  = setto->tv_sec * 100;
-   if (0xff < setto->tv_usec / 1) return -1;
-   tenms += setto->tv_usec / 1;
-   if (tenms > 0xff) return -1;
-
-   tseq[3] = (uint8_t)(tenms & 0xff);
-   tseq[4] = (uint8_t)((tenms >> 8)  & 0xff);
-   tseq[5] = (uint8_t)((tenms >> 16) & 0xff);
-
-   tseq[9] = (uint8_t)(days & 0xff);
-   tseq[10] = (uint8_t)((days >> 8) & 0xff);
-
-   t.seq = tseq;
-
-   if (hp_sdc_enqueue_transaction()) return -1;
-   return 0;
-}
-
-/* Set the i8042 fast handshake timer */
-static int hp_sdc_rtc_set_fhs (struct timeval *setto)
-{
-   uint32_t tenms;
-   hp_sdc_transaction t;
-   uint8_t tseq[5] = {
-   HP_SDC_ACT_PRECMD | HP_SDC_ACT_DATAOUT,
-   HP_SDC_CMD_SET_FHS, 2, 0, 0
-   };
-
-   t.endidx = 4;
-
-   if (0x < setto->tv_sec) return -1;
-   tenms  = setto->tv_sec * 100;
-   if (0x < setto->tv_usec / 1) return -1;
-   tenms += setto->tv_usec / 1;
-   if (tenms > 0x) return -1;
-
-   tseq[3] = (uint8_t)(tenms & 0xff);
-   tseq[4] = (uint8_t)((tenms >> 8)  & 0xff);
-
-   t.seq = tseq;
-
-   if (hp_sdc_enqueue_transaction()) return -1;
-   return 0;
-}
-
-
-/* Set the i8042 match timer (a.k.a. alarm) */
-#define hp_sdc_rtc_set_mt 

Re: [PATCH 3/3] power: supply: bq2515x: Introduce the bq2515x family

2019-10-23 Thread Dan Murphy

Sebastian

On 10/20/19 7:15 AM, Sebastian Reichel wrote:

Hi Dan,

On Mon, Sep 30, 2019 at 09:31:37AM -0500, Dan Murphy wrote:

[...]

+
+static int bq2515x_power_supply_register(struct bq2515x_device *bq2515x)
+{
+   struct power_supply_config psy_cfg = { .drv_data = bq2515x, };
+   int ret = -EINVAL;
+
+   bq2515x->mains = devm_power_supply_register(bq2515x->dev,
+   _mains_desc,
+   _cfg);
+   if (IS_ERR(bq2515x->mains))
+   return ret;
+
+   bq2515x->battery = devm_power_supply_register(bq2515x->dev,
+ _battery_desc,
+ _cfg);
+   if (IS_ERR(bq2515x->battery)) {
+   power_supply_unregister(bq2515x->mains);

you registered the mains power-supply with devm_ prefix, so it
will be removed automatically.

Ack



+   return ret;
+   }
+
+   return 0;
+}
+
+static int bq2515x_hw_init(struct bq2515x_device *bq2515x)
+{
+   int ret = 0;
+
+   if (bq2515x->init_data.ichg)
+   ret = bq2515x_set_ilim_lvl(bq2515x, bq2515x->init_data.ichg);
+
+   if (bq2515x->init_data.vreg)
+   ret = bq2515x_set_batt_reg(bq2515x, bq2515x->init_data.vreg);

This throws away potential error code from bq2515x_set_ilim_lvl().

Ack

+   return ret;
+}
+
+static int bq2515x_read_properties(struct bq2515x_device *bq2515x)
+{
+   int ret;
+
+   ret = device_property_read_u8(bq2515x->dev, "ti,charge-current",
+ >init_data.ichg);
+   if (ret)
+   goto fail;
+
+   ret = device_property_read_u8(bq2515x->dev,
+ "ti,battery-regulation-voltage",
+ >init_data.vreg);
+   if (ret)
+   goto fail;

The above properties are marked as optional in the DT bindings
document.

ACK.  Will just set them to the data sheet default values



+   bq2515x->pg_gpio = devm_gpiod_get_optional(bq2515x->dev,
+  "pg", GPIOD_IN);
+   if (IS_ERR(bq2515x->pg_gpio))
+   dev_info(bq2515x->dev, "PG GPIO not defined");
+
+   bq2515x->reset_gpio = devm_gpiod_get_optional(bq2515x->dev,
+  "reset", GPIOD_OUT_LOW);
+   if (IS_ERR(bq2515x->reset_gpio))
+   dev_info(bq2515x->dev, "reset GPIO not defined");
+
+   bq2515x->lp_gpio = devm_gpiod_get_optional(bq2515x->dev, "low-power",
+  GPIOD_OUT_LOW);
+   if (IS_ERR(bq2515x->lp_gpio))
+   dev_info(bq2515x->dev, "LP GPIO not defined");
+
+   bq2515x->ce_gpio = devm_gpiod_get_optional(bq2515x->dev,
+  "charge-enable",
+  GPIOD_OUT_HIGH);
+   if (IS_ERR(bq2515x->ce_gpio))
+   dev_info(bq2515x->dev, "Charge enable GPIO not defined");

The GPIO errors should not be ignored, especially since you
might get a -EPROBE_DEFER.


ACK



+fail:
+   return ret;
+}
+
+static const struct regmap_config bq2515x_regmap_config = {
+   .reg_bits = 8,
+   .val_bits = 8,
+
+   .max_register = BQ2515X_DEVICE_ID,
+   .reg_defaults = bq2515x_reg_defs,
+   .num_reg_defaults = ARRAY_SIZE(bq2515x_reg_defs),
+   .cache_type   = REGCACHE_RBTREE,
+};
+
+static int bq2515x_probe(struct i2c_client *client,
+const struct i2c_device_id *id)
+{
+   struct device *dev = >dev;
+   struct bq2515x_device *bq;
+   int ret;
+
+   bq = devm_kzalloc(dev, sizeof(*bq), GFP_KERNEL);
+   if (!bq)
+   return -ENOMEM;
+
+   bq->client = client;
+   bq->dev = dev;
+
+   mutex_init(>lock);
+
+   bq->regmap = devm_regmap_init_i2c(client, _regmap_config);
+   if (IS_ERR(bq->regmap)) {
+   dev_err(dev, "failed to allocate register map\n");
+   return PTR_ERR(bq->regmap);
+   }
+
+   strncpy(bq->model_name, id->name, I2C_NAME_SIZE);
+
+   i2c_set_clientdata(client, bq);
+
+   ret = bq2515x_read_properties(bq);
+   if (ret < 0) {
+   dev_err(dev, "Failed to register power supply\n");
+   return ret;
+   }
+
+   ret = bq2515x_hw_init(bq);
+   if (ret < 0) {
+   dev_err(dev, "Cannot initialize the chip.\n");
+   return ret;
+   }
+
+   return bq2515x_power_supply_register(bq);
+}
+
+static const struct i2c_device_id bq2515x_i2c_ids[] = {
+   { "bq25150", 0 },
+   { "bq25155", 1 },
+   {},
+};
+MODULE_DEVICE_TABLE(i2c, bq2515x_i2c_ids);
+
+static const struct of_device_id bq2515x_of_match[] = {
+   { .compatible = "ti,bq25150", },
+   { .compatible = "ti,bq25155", },
+   { },
+};

[PATCH] Fix skb->csum calculation when netfilter manipulation for NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

2019-10-23 Thread Praveen Chaudhary


-Issue [BUG in current code]:-

IPV6 UDP packet is dropped by kernel in function udp6_csum_init(), when 
netfilter for NF_NAT_MANIP_SRC\DST is applied.

Counter increased: Udp6InCsumErrors. Note: incoming UPD6 packet has correct UDP 
checksum. 


-Reproduction steps: (using IPV6 UDP DNS Query)--- [linux kernel 
any version > 4.9.110]-

1.) Set an SNAT entry in /etc/iptables/rules.v6 as show below.
```
-A POSTROUTING -p udp -m udp --dport 53 -o Ethernet+ -j SNAT --to-source 
2a04:::4::2 (Masked)
```
Note: Above rule will change ipv6 source address to 2a04:::4::2 for 
outgoing UDPv6 DNS packet , and will change the ipv6 destination address 
from 2a04:::4::2 to original ipv6 address for incoming DNS response.

2.) Apply the iptable changes.
```
switch$ sudo /etc/network/if-pre-up.d/iptables
```
3.) Run a DNS query using IPV6 DNS server for any site. We can observe that DNS 
query is timed out.
```
switch$ host facebook.com 2a04::xx:1::c216
;; connection timed out; no servers could be reached
```

---Details of issue

1.) On incoming path: 
Function Trace: udp_error()-->nf_checksum()-->nf_ip6_checksum():

In function nf_ip6_checksum(); checksum is verified for incoming packet. Here 
skb->data points to IPV6 HEADER and ip_summed == CHECKSUM_NONE. So after a call 
to __skb_checksum_complete, skb->csum will store the 16-bit sum of [ipv6 header 
+ UDP header + UDP data].

Checksum verification will be successful here, because csum_ipv6_magic() 
subtracts 16-bit sum of IPv6 header from 16-bit sum of Pseudo header.

Note: UDP checksum = ~[pseudo header + UDP header + UDP data].


2.) SNAT iptable rule processing, 

Function Trace: nf_nat_ipv6_manip_pkt()--> 
l4proto_manip_pkt()-->udp_manip_pkt()-->__udp_manip_pkt()-->nf_csum_update()--->
 nf_nat_ipv6_csum_update()-->inet_proto_csum_replace16():

In function inet_proto_csum_replace16(), first udp_header checksum field will 
be updated as below, because of the NF_NAT_MANIP_SRC manipulation, I.e to 
reflect IPV6 src address change in IPV6 header.
```
*sum = csum_fold(csum_partial(diff, sizeof(diff),
 ~csum_unfold(*sum)));
```
Since skb->csum includes udp header checksum field, skb->csum will also go 
through similar calculation to reflect udp header checksum field change.
```
if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
skb->csum = ~csum_partial(diff, sizeof(diff),
  ~skb->csum);
```
After this update, skb->csum = [Original IPV6 Header + Modified UDP header + 
UDP data]



3.) Then in function nf_nat_ipv6_manip_pkt(), ipv6 header will be updated to 
have target IPV6 src address:

```
ipv6h->saddr = target->src.u3.in6;
```
Here we do not update skb->csum. So skb->csum will still be equal to [Original 
IPV6 Header + Modified UDP header + UDP data]. 
**BUG BUG BUG**: This is the bug, skb->csum must be updated to reflect this 
change in IPV6 SRC Address.

Ideally change in UDP header checksum field and change is IPV6 SRC address 
cancels each other in this case, so no update was needed in skb->csum.

4.) IPV6 header processing: 

Function Trace: ip6_input() --> ip6_input_finish() --> 
ip6_protocol_deliver_rcu():

In ip6_protocol_deliver_rcu(), 16-bit sum of IPV6 header will be subtracted 
from skb->csum.

```
skb_postpull_rcsum(skb, skb_network_header(skb),
   skb_network_header_len(skb));
```

This is the sum of new IPV6 header with modified IPV6 Source address. After 
this subtraction 

skb->csum = [Original IPV6 Header + Modified UDP header + UDP data] - [Modified 
IPv6 Header]
This is wrong value of skb->csum.


5.) UDP Header Checksum init:

Function Trace: __udp6_lib_rcv() --> udp6_csum_init() -->  
__skb_checksum_validate_complete()

In  __skb_checksum_validate_complete() below condition will never met. Because 
value of skb->csum is unexpected. 
```
if (skb->ip_summed == CHECKSUM_COMPLETE) {
if (!csum_fold(csum_add(psum, skb->csum))) {
skb->csum_valid = 1;
return 0;
}
}
```

In udp6_csum_init(), below check will be true:
```
if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) {
/* If SW calculated the value, we know it's bad */
if (skb->csum_complete_sw)
return 1;
```

and this packet will be dropped in  __udp6_lib_rcv() due to below check

```
if (udp6_csum_init(skb, uh, proto))
goto csum_error;
...
...

csum_error:
__UDP6_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
```



---Related JPROBE LOGS:---
```
Oct 19 05:15:01.420147 asw03 NOTICE kernel: [350574.264081]  nf_ip6_checksum: 
dataoff=40   
Oct 19 05:15:01.420180 asw03 NOTICE kernel: [350574.264084]  nf_ip6_checksum: 
skb:9faa71cbc500 ip_sum=0 csum_valid=0 csum_com_sw=0 csum=0
```
[Explanation]: dataoff = 40 for ipv6 

[PATCH] [netfilter]: Fix skb->csum calculation when netfilter manipulation for NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

2019-10-23 Thread Praveen Chaudhary
Update skb->csum, when netfilter code updates IPV6 SRC\DST address in IPV6 
HEADER due to iptable rule.

Signed-off-by: Praveen Chaudhary 
Signed-off-by: Zhenggen Xu 
Signed-off-by: Andy Stracner 
---
 include/net/checksum.h   |  2 ++
 net/core/utils.c | 13 +
 net/netfilter/nf_nat_proto.c |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 97bf488..d7d28b7 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -145,6 +145,8 @@ void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff 
*skb,
 void inet_proto_csum_replace16(__sum16 *sum, struct sk_buff *skb,
   const __be32 *from, const __be32 *to,
   bool pseudohdr);
+void inet_proto_skb_csum_replace16(struct sk_buff *skb,
+  const __be32 *from, const __be32 *to);
 void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
 __wsum diff, bool pseudohdr);
 
diff --git a/net/core/utils.c b/net/core/utils.c
index 6b6e51d..ab3284b 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -458,6 +458,19 @@ void inet_proto_csum_replace16(__sum16 *sum, struct 
sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_proto_csum_replace16);
 
+void inet_proto_skb_csum_replace16(struct sk_buff *skb,
+  const __be32 *from, const __be32 *to)
+{
+   __be32 diff[] = {
+   ~from[0], ~from[1], ~from[2], ~from[3],
+   to[0], to[1], to[2], to[3],
+   };
+   if (skb->ip_summed == CHECKSUM_COMPLETE)
+   skb->csum = csum_partial(diff, sizeof(diff),
+ skb->csum);
+}
+EXPORT_SYMBOL(inet_proto_skb_csum_replace16);
+
 void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
 __wsum diff, bool pseudohdr)
 {
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index 0a59c14..de94590 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -467,6 +467,8 @@ static void nf_nat_ipv6_csum_update(struct sk_buff *skb,
}
inet_proto_csum_replace16(check, skb, oldip->s6_addr32,
  newip->s6_addr32, true);
+   inet_proto_skb_csum_replace16(skb, oldip->s6_addr32,
+ newip->s6_addr32);
 #endif
 }
 
-- 
2.7.4



Re: [v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Hugh Dickins
On Wed, 23 Oct 2019, Yang Shi wrote:
> On 10/23/19 10:24 AM, Matthew Wilcox wrote:
> > On Thu, Oct 24, 2019 at 01:05:04AM +0800, Yang Shi wrote:
> > > + return map_count >= 0 &&
> > > +map_count == atomic_read([1].compound_mapcount);
> > >   }
> > I didn't like Hugh's duplicate definition either.  May I suggest:
> 
> Thanks, Willy. It is fine to me. Will take it in v3.

Agreed, that will be better.

Hugh


Applied "regulator: bd70528: Add MODULE_ALIAS to allow module auto loading" to the regulator tree

2019-10-23 Thread Mark Brown
The patch

   regulator: bd70528: Add MODULE_ALIAS to allow module auto loading

has been applied to the regulator tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git for-5.4

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 55d5f62c3fa005a6a8010363d7d1855909ceefbc Mon Sep 17 00:00:00 2001
From: Matti Vaittinen 
Date: Wed, 23 Oct 2019 15:14:52 +0300
Subject: [PATCH] regulator: bd70528: Add MODULE_ALIAS to allow module auto
 loading

The bd70528 regulator driver is probed by MFD driver. Add MODULE_ALIAS
in order to allow udev to load the module when MFD sub-device cell for
regulators is added.

Fixes: 99ea37bd1e7d7 ("regulator: bd70528: Support ROHM BD70528 regulator 
block")
Signed-off-by: Matti Vaittinen 
Link: https://lore.kernel.org/r/20191023121452.GA1812@localhost.localdomain
Signed-off-by: Mark Brown 
---
 drivers/regulator/bd70528-regulator.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/regulator/bd70528-regulator.c 
b/drivers/regulator/bd70528-regulator.c
index 0248a61f1006..ec764022621f 100644
--- a/drivers/regulator/bd70528-regulator.c
+++ b/drivers/regulator/bd70528-regulator.c
@@ -286,3 +286,4 @@ module_platform_driver(bd70528_regulator);
 MODULE_AUTHOR("Matti Vaittinen ");
 MODULE_DESCRIPTION("BD70528 voltage regulator driver");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("platform:bd70528-pmic");
-- 
2.20.1



Applied "ASoC: rsnd: dma: fix SSI9 4/5/6/7 busif dma address" to the asoc tree

2019-10-23 Thread Mark Brown
The patch

   ASoC: rsnd: dma: fix SSI9 4/5/6/7 busif dma address

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.4

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From d10be65f87fc9d98ad3cbdc406e86745fe8c59e2 Mon Sep 17 00:00:00 2001
From: Jiada Wang 
Date: Tue, 22 Oct 2019 20:54:29 +0200
Subject: [PATCH] ASoC: rsnd: dma: fix SSI9 4/5/6/7 busif dma address

Currently each SSI unit's busif dma address is calculated by
following calculation formula:
0xec54 + 0x1000 * id + busif / 4 * 0xA000 + busif % 4 * 0x400

But according to R-Car3 HW manual 41.1.4 Register Configuration,
ssi9 4/5/6/7 busif data register address
(SSI9_4_BUSIF/SSI9_5_BUSIF/SSI9_6_BUSIF/SSI9_7_BUSIF)
are out of this rule.

This patch updates the calculation formula to correct
ssi9 4/5/6/7 busif data register address.

Fixes: 5e45a6fab3b9 ("ASoc: rsnd: dma: Calculate dma address with consider of 
BUSIF")
Signed-off-by: Jiada Wang 
Signed-off-by: Timo Wischer 
[erosca: minor improvements in commit description]
Cc: Andrew Gabbasov 
Cc: sta...@vger.kernel.org # v4.20+
Signed-off-by: Eugeniu Rosca 
Acked-by: Kuninori Morimoto 
Link: https://lore.kernel.org/r/20191022185429.12769-1-ero...@de.adit-jv.com
Signed-off-by: Mark Brown 
---
 sound/soc/sh/rcar/dma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/soc/sh/rcar/dma.c b/sound/soc/sh/rcar/dma.c
index 0324a5c39619..28f65eba2bb4 100644
--- a/sound/soc/sh/rcar/dma.c
+++ b/sound/soc/sh/rcar/dma.c
@@ -508,10 +508,10 @@ static struct rsnd_mod_ops rsnd_dmapp_ops = {
 #define RDMA_SSI_I_N(addr, i)  (addr ##_reg - 0x0030 + (0x40 * i) + 0x8)
 #define RDMA_SSI_O_N(addr, i)  (addr ##_reg - 0x0030 + (0x40 * i) + 0xc)
 
-#define RDMA_SSIU_I_N(addr, i, j) (addr ##_reg - 0x00441000 + (0x1000 * (i)) + 
(((j) / 4) * 0xA000) + (((j) % 4) * 0x400))
+#define RDMA_SSIU_I_N(addr, i, j) (addr ##_reg - 0x00441000 + (0x1000 * (i)) + 
(((j) / 4) * 0xA000) + (((j) % 4) * 0x400) - (0x4000 * ((i) / 9) * ((j) / 4)))
 #define RDMA_SSIU_O_N(addr, i, j) RDMA_SSIU_I_N(addr, i, j)
 
-#define RDMA_SSIU_I_P(addr, i, j) (addr ##_reg - 0x00141000 + (0x1000 * (i)) + 
(((j) / 4) * 0xA000) + (((j) % 4) * 0x400))
+#define RDMA_SSIU_I_P(addr, i, j) (addr ##_reg - 0x00141000 + (0x1000 * (i)) + 
(((j) / 4) * 0xA000) + (((j) % 4) * 0x400) - (0x4000 * ((i) / 9) * ((j) / 4)))
 #define RDMA_SSIU_O_P(addr, i, j) RDMA_SSIU_I_P(addr, i, j)
 
 #define RDMA_SRC_I_N(addr, i)  (addr ##_reg - 0x0050 + (0x400 * i))
-- 
2.20.1



Applied "ASoC: mediatek: Check SND_SOC_CROS_EC_CODEC dependency" to the asoc tree

2019-10-23 Thread Mark Brown
The patch

   ASoC: mediatek: Check SND_SOC_CROS_EC_CODEC dependency

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From ef5dee551e3e6568fb203ea57fa24f55cb64d451 Mon Sep 17 00:00:00 2001
From: Mao Wenan 
Date: Wed, 23 Oct 2019 14:31:03 +0800
Subject: [PATCH] ASoC: mediatek: Check SND_SOC_CROS_EC_CODEC dependency

If SND_SOC_MT8183_MT6358_TS3A227E_MAX98357A=y,
below errors can be seen:
sound/soc/codecs/cros_ec_codec.o: In function `send_ec_host_command':
cros_ec_codec.c:(.text+0x534): undefined reference to `cros_ec_cmd_xfer_status'
cros_ec_codec.c:(.text+0x101c): undefined reference to `cros_ec_get_host_event'

This is because it will select SND_SOC_CROS_EC_CODEC
after commit 2cc3cd5fdc8b ("ASoC: mediatek: mt8183: support WoV"),
but SND_SOC_CROS_EC_CODEC depends on CROS_EC.

Fixes: 2cc3cd5fdc8b ("ASoC: mediatek: mt8183: support WoV")
Signed-off-by: Mao Wenan 
Link: https://lore.kernel.org/r/20191023063103.44941-1-maowe...@huawei.com
Signed-off-by: Mark Brown 
---
 sound/soc/mediatek/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/mediatek/Kconfig b/sound/soc/mediatek/Kconfig
index 8b29f3979899..a656d2014127 100644
--- a/sound/soc/mediatek/Kconfig
+++ b/sound/soc/mediatek/Kconfig
@@ -125,7 +125,7 @@ config SND_SOC_MT8183_MT6358_TS3A227E_MAX98357A
select SND_SOC_MAX98357A
select SND_SOC_BT_SCO
select SND_SOC_TS3A227E
-   select SND_SOC_CROS_EC_CODEC
+   select SND_SOC_CROS_EC_CODEC if CROS_EC
help
  This adds ASoC driver for Mediatek MT8183 boards
  with the MT6358 TS3A227E MAX98357A audio codec.
-- 
2.20.1



Applied "spi: document CS setup, hold & inactive times in header" to the spi tree

2019-10-23 Thread Mark Brown
The patch

   spi: document CS setup, hold & inactive times in header

has been applied to the spi tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From a3470c1829c0c856a19c10af58f8e7792ae27d7a Mon Sep 17 00:00:00 2001
From: Alexandru Ardelean 
Date: Wed, 23 Oct 2019 10:00:46 +0300
Subject: [PATCH] spi: document CS setup, hold & inactive times in header

This change documents the CS setup, host & inactive times. They were
omitted when the fields were added, and were caught by one of the build
bots.

Fixes: 25093bdeb6bc ("spi: implement SW control for CS times")
Reported-by: kbuild test robot 
Signed-off-by: Alexandru Ardelean 
Link: 
https://lore.kernel.org/r/20191023070046.12478-1-alexandru.ardel...@analog.com
Signed-off-by: Mark Brown 
---
 include/linux/spi/spi.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
index c40d6af2bf07..98fe8663033a 100644
--- a/include/linux/spi/spi.h
+++ b/include/linux/spi/spi.h
@@ -407,6 +407,11 @@ static inline void spi_unregister_driver(struct spi_driver 
*sdrv)
  *  controller has native support for memory like operations.
  * @unprepare_message: undo any work done by prepare_message().
  * @slave_abort: abort the ongoing transfer request on an SPI slave controller
+ * @cs_setup: delay to be introduced by the controller after CS is asserted
+ * @cs_hold: delay to be introduced by the controller before CS is deasserted
+ * @cs_inactive: delay to be introduced by the controller after CS is
+ * deasserted. If @cs_change_delay is used from @spi_transfer, then the
+ * two delays will be added up.
  * @cs_gpios: LEGACY: array of GPIO descs to use as chip select lines; one per
  * CS number. Any individual value may be -ENOENT for CS lines that
  * are not GPIOs (driven by the SPI controller itself). Use the cs_gpiods
-- 
2.20.1



Applied "ASoC: rsnd: dma: set bus width to data width for monaural data" to the asoc tree

2019-10-23 Thread Mark Brown
The patch

   ASoC: rsnd: dma: set bus width to data width for monaural data

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From d4d9360bf702890b5d3b1b62d8619a2690dd3278 Mon Sep 17 00:00:00 2001
From: Jiada Wang 
Date: Tue, 22 Oct 2019 20:55:18 +0200
Subject: [PATCH] ASoC: rsnd: dma: set bus width to data width for monaural
 data

According to R-Car3 HW manual 40.3.3 (Data Format on Audio Local Bus),
in case of monaural data writing or reading through Audio-DMAC,
it's always in Left Justified format, so both src and dst
DMA Bus width should be equal to physical data width.

Therefore set src and dst's DMA bus width to:
 - [monaural case] data width
 - [non-monaural case] 32bits (as prior applying the patch)

Cc: Andrew Gabbasov 
Cc: Timo Wischer 
Signed-off-by: Jiada Wang 
Signed-off-by: Eugeniu Rosca 
Link: https://lore.kernel.org/r/20191022185518.12838-1-ero...@de.adit-jv.com
Signed-off-by: Mark Brown 
---
 sound/soc/sh/rcar/dma.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/sound/soc/sh/rcar/dma.c b/sound/soc/sh/rcar/dma.c
index 0324a5c39619..bcb6d5960661 100644
--- a/sound/soc/sh/rcar/dma.c
+++ b/sound/soc/sh/rcar/dma.c
@@ -165,14 +165,40 @@ static int rsnd_dmaen_start(struct rsnd_mod *mod,
struct device *dev = rsnd_priv_to_dev(priv);
struct dma_async_tx_descriptor *desc;
struct dma_slave_config cfg = {};
+   enum dma_slave_buswidth buswidth = DMA_SLAVE_BUSWIDTH_4_BYTES;
int is_play = rsnd_io_is_play(io);
int ret;
 
+   /*
+* in case of monaural data writing or reading through Audio-DMAC
+* data is always in Left Justified format, so both src and dst
+* DMA Bus width need to be set equal to physical data width.
+*/
+   if (rsnd_runtime_channel_original(io) == 1) {
+   struct snd_pcm_runtime *runtime = rsnd_io_to_runtime(io);
+   int bits = snd_pcm_format_physical_width(runtime->format);
+
+   switch (bits) {
+   case 8:
+   buswidth = DMA_SLAVE_BUSWIDTH_1_BYTE;
+   break;
+   case 16:
+   buswidth = DMA_SLAVE_BUSWIDTH_2_BYTES;
+   break;
+   case 32:
+   buswidth = DMA_SLAVE_BUSWIDTH_4_BYTES;
+   break;
+   default:
+   dev_err(dev, "invalid format width %d\n", bits);
+   return -EINVAL;
+   }
+   }
+
cfg.direction   = is_play ? DMA_MEM_TO_DEV : DMA_DEV_TO_MEM;
cfg.src_addr= dma->src_addr;
cfg.dst_addr= dma->dst_addr;
-   cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
-   cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+   cfg.src_addr_width = buswidth;
+   cfg.dst_addr_width = buswidth;
 
dev_dbg(dev, "%s %pad -> %pad\n",
rsnd_mod_name(mod),
-- 
2.20.1



Applied "ASoC: hdmi-codec: drop mutex locking again" to the asoc tree

2019-10-23 Thread Mark Brown
The patch

   ASoC: hdmi-codec: drop mutex locking again

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.4

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 901af18b6baade6a327e532427cbb233f4945f5d Mon Sep 17 00:00:00 2001
From: Jerome Brunet 
Date: Wed, 23 Oct 2019 18:12:02 +0200
Subject: [PATCH] ASoC: hdmi-codec: drop mutex locking again

This reverts commit eb1ecadb7f67dde94ef0efd3ddaed5cb6c9a65ed.

This fixes the following warning reported by lockdep and a potential
issue with hibernation


WARNING: pulseaudio/1297 still has locks held!
5.3.0+ #1826 Not tainted

1 lock held by pulseaudio/1297:
 #0: ee815308 (>lock){}, at: hdmi_codec_startup+0x20/0x130

stack backtrace:
CPU: 0 PID: 1297 Comm: pulseaudio Not tainted 5.3.0+ #1826
Hardware name: Marvell Dove (Cubox)
[] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[] (show_stack) from [] (futex_wait_queue_me+0x13c/0x19c)
[] (futex_wait_queue_me) from [] (futex_wait+0x184/0x24c)
[] (futex_wait) from [] (do_futex+0x334/0x598)
[] (do_futex) from [] (sys_futex_time32+0x118/0x180)
[] (sys_futex_time32) from [] (ret_fast_syscall+0x0/0x54)
Exception stack(0xebd31fa8 to 0xebd31ff0)
1fa0:     000c8748 0189 0001 
1fc0:    00f0    00056200
1fe0: 00f0 beac03a8 b6d6c835 b6d6f456

Fixes: eb1ecadb7f67 ("ASoC: hdmi-codec: re-introduce mutex locking")
Reported-by: Russell King 
Signed-off-by: Jerome Brunet 
Link: https://lore.kernel.org/r/20191023161203.28955-2-jbru...@baylibre.com
Signed-off-by: Mark Brown 
---
 sound/soc/codecs/hdmi-codec.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/sound/soc/codecs/hdmi-codec.c b/sound/soc/codecs/hdmi-codec.c
index b5fd8f08726e..f8b5b960e597 100644
--- a/sound/soc/codecs/hdmi-codec.c
+++ b/sound/soc/codecs/hdmi-codec.c
@@ -274,7 +274,7 @@ struct hdmi_codec_priv {
uint8_t eld[MAX_ELD_BYTES];
struct snd_pcm_chmap *chmap_info;
unsigned int chmap_idx;
-   struct mutex lock;
+   unsigned long busy;
struct snd_soc_jack *jack;
unsigned int jack_status;
 };
@@ -390,8 +390,8 @@ static int hdmi_codec_startup(struct snd_pcm_substream 
*substream,
struct hdmi_codec_priv *hcp = snd_soc_dai_get_drvdata(dai);
int ret = 0;
 
-   ret = mutex_trylock(>lock);
-   if (!ret) {
+   ret = test_and_set_bit(0, >busy);
+   if (ret) {
dev_err(dai->dev, "Only one simultaneous stream supported!\n");
return -EINVAL;
}
@@ -419,7 +419,7 @@ static int hdmi_codec_startup(struct snd_pcm_substream 
*substream,
 
 err:
/* Release the exclusive lock on error */
-   mutex_unlock(>lock);
+   clear_bit(0, >busy);
return ret;
 }
 
@@ -431,7 +431,7 @@ static void hdmi_codec_shutdown(struct snd_pcm_substream 
*substream,
hcp->chmap_idx = HDMI_CODEC_CHMAP_IDX_UNKNOWN;
hcp->hcd.ops->audio_shutdown(dai->dev->parent, hcp->hcd.data);
 
-   mutex_unlock(>lock);
+   clear_bit(0, >busy);
 }
 
 static int hdmi_codec_hw_params(struct snd_pcm_substream *substream,
@@ -811,8 +811,6 @@ static int hdmi_codec_probe(struct platform_device *pdev)
return -ENOMEM;
 
hcp->hcd = *hcd;
-   mutex_init(>lock);
-
daidrv = devm_kcalloc(dev, dai_count, sizeof(*daidrv), GFP_KERNEL);
if (!daidrv)
return -ENOMEM;
-- 
2.20.1



Re: [PATCH] irqchip: Skip contexts other supervisor in plic_init()

2019-10-23 Thread Paul Walmsley
+ hch

On Wed, 23 Oct 2019, Alan Mikhak wrote:

> From: Alan Mikhak 
> 
> Modify plic_init() to skip .dts interrupt contexts other
> than supervisor external interrupt.

Might be good to explain the motivation here.

> 
> Signed-off-by: Alan Mikhak 
> ---
>  drivers/irqchip/irq-sifive-plic.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/irqchip/irq-sifive-plic.c 
> b/drivers/irqchip/irq-sifive-plic.c
> index c72c036aea76..5f2a773d5669 100644
> --- a/drivers/irqchip/irq-sifive-plic.c
> +++ b/drivers/irqchip/irq-sifive-plic.c
> @@ -251,8 +251,8 @@ static int __init plic_init(struct device_node *node,
>   continue;
>   }
>  
> - /* skip context holes */
> - if (parent.args[0] == -1)
> + /* skip contexts other than supervisor external interrupt */
> + if (parent.args[0] != IRQ_S_EXT)
>   continue;

Will this need to change for RISC-V M-mode Linux support?  

https://lore.kernel.org/linux-riscv/20191017173743.5430-1-...@lst.de/


- Paul




[PATCH resend] Add touchscreen platform data for the Schneider SCT101CTM tablet

2019-10-23 Thread Hans de Goede
From: Daniel Gorbea Ainz 

Add touchscreen platform data for the Schneider SCT101CTM tablet

Signed-off-by: Daniel Gorbea 
Reviewed-by: Hans de Goede 
Signed-off-by: Hans de Goede 
---
hdegoede: Resend from my email address as vger.kernel.org does not like
Daniel's emails
---
 drivers/platform/x86/touchscreen_dmi.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/drivers/platform/x86/touchscreen_dmi.c 
b/drivers/platform/x86/touchscreen_dmi.c
index 8bfef880e216..ba494ace83d4 100644
--- a/drivers/platform/x86/touchscreen_dmi.c
+++ b/drivers/platform/x86/touchscreen_dmi.c
@@ -549,6 +549,24 @@ static const struct ts_dmi_data 
pov_mobii_wintab_p1006w_v10_data = {
.properties = pov_mobii_wintab_p1006w_v10_props,
 };
 
+static const struct property_entry schneider_sct101ctm_props[] = {
+   PROPERTY_ENTRY_U32("touchscreen-size-x", 1715),
+   PROPERTY_ENTRY_U32("touchscreen-size-y", 1140),
+   PROPERTY_ENTRY_BOOL("touchscreen-inverted-x"),
+   PROPERTY_ENTRY_BOOL("touchscreen-inverted-y"),
+   PROPERTY_ENTRY_BOOL("touchscreen-swapped-x-y"),
+   PROPERTY_ENTRY_STRING("firmware-name",
+ "gsl1680-schneider-sct101ctm.fw"),
+   PROPERTY_ENTRY_U32("silead,max-fingers", 10),
+   PROPERTY_ENTRY_BOOL("silead,home-button"),
+   { }
+};
+
+static const struct ts_dmi_data schneider_sct101ctm_data = {
+   .acpi_name  = "MSSL1680:00",
+   .properties = schneider_sct101ctm_props,
+};
+
 static const struct property_entry teclast_x3_plus_props[] = {
PROPERTY_ENTRY_U32("touchscreen-size-x", 1980),
PROPERTY_ENTRY_U32("touchscreen-size-y", 1500),
@@ -968,6 +986,14 @@ const struct dmi_system_id touchscreen_dmi_table[] = {
DMI_EXACT_MATCH(DMI_BOARD_NAME, "0E57"),
},
},
+   {
+   /* Schneider SCT101CTM */
+   .driver_data = (void *)_sct101ctm_data,
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "Default string"),
+   DMI_MATCH(DMI_PRODUCT_NAME, "SCT101CTM"),
+   },
+   },
{
/* Teclast X3 Plus */
.driver_data = (void *)_x3_plus_data,
-- 
2.23.0



Re: [PATCH v4 15/16] module: Move where we mark modules RO,X

2019-10-23 Thread Steven Rostedt
On Tue, 22 Oct 2019 22:24:01 +0200
Peter Zijlstra  wrote:

> On Mon, Oct 21, 2019 at 10:21:10PM -0400, Steven Rostedt wrote:
> > On Fri, 18 Oct 2019 09:35:40 +0200
> > Peter Zijlstra  wrote:
> >   
> > > Now that set_all_modules_text_*() is gone, nothing depends on the
> > > relation between ->state = COMING and the protection state anymore.
> > > This enables moving the protection changes later, such that the COMING
> > > notifier callbacks can more easily modify the text.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > Cc: Jessica Yu 
> > > ---  
> > 
> > This triggered the following bug:
> >   
> 
> > The trace_event_define_fields_() is defined in
> > include/trace/trace_events.h and is an init function called by the
> > trace_events event_create_dir() via the module notifier:
> > MODULE_STATE_COMING  
> 
> The below seems to cure it; and seems to generate identical
> events/*/format output (for my .config, with the exception of ID).
> 
> It has just one section mismatch report that I'm too tired to look at
> just now.
> 
> I'm not particularly proud of the "__function__" hack, but it works :/ I
> couldn't come up with anything else for [uk]probes which seem to have
> dynamic fields and if we're having it then syscall_enter can also make
> use of it, the syscall_metadata crud was going to be ugly otherwise.
> 
> (also, win on LOC)
> 
>

After applying this series and this patch I triggered this:

[ 1397.281889] BUG: kernel NULL pointer dereference, address: 0001
[ 1397.288896] #PF: supervisor read access in kernel mode
[ 1397.294062] #PF: error_code(0x) - not-present page
[ 1397.299192] PGD 0 P4D 0 
[ 1397.301728] Oops:  [#1] PREEMPT SMP PTI
[ 1397.305908] CPU: 7 PID: 4252 Comm: ftracetest Not tainted 5.4.0-rc3-test+ 
#132
[ 1397.313114] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS 
K01 v03.03 07/14/2016
[ 1397.322056] RIP: 0010:event_create_dir+0x26a/0x520
[ 1397.326841] Code: ff ff 5a 85 c0 75 37 44 03 7b 10 48 83 c3 20 4c 8b 13 4d 
85 d2 0f 84 66 fe ff ff b9 0d 00 00 00 4c 89 d6 4c 89 f7 48 8b 53 08  a6 0f 
97 c1 80 d9 00 84 c9 75 a5 48 89 ef e8 b2 d4 a3 00 85 c0
[ 1397.345558] RSP: 0018:c9a63d18 EFLAGS: 00010202
[ 1397.350775] RAX:  RBX: c9a63d80 RCX: 000d
[ 1397.357893] RDX: 811ccfac RSI: 0001 RDI: 8207c68a
[ 1397.365006] RBP: 888114b1c750 R08:  R09: 8881147561b0
[ 1397.372119] R10: 0001 R11: 0001 R12: 8880b1d80528
[ 1397.379243] R13: 88811189aeb0 R14: 8207c68a R15: 811d2d22
[ 1397.386365] FS:  7f2567213740() GS:88811a9c() 
knlGS:
[ 1397.394437] CS:  0010 DS:  ES:  CR0: 80050033
[ 1397.400174] CR2: 0001 CR3: b1f06005 CR4: 001606e0
[ 1397.407297] Call Trace:
[ 1397.409753]  trace_add_event_call+0x6c/0xb0
[ 1397.413938]  trace_probe_register_event_call+0x22/0x50
[ 1397.419071]  trace_kprobe_create+0x65c/0xa20
[ 1397.423340]  ? argv_split+0x99/0x130
[ 1397.426913]  ? __kmalloc+0x1d4/0x2c0
[ 1397.430485]  ? trace_kprobe_create+0xa20/0xa20
[ 1397.434922]  ? trace_kprobe_create+0xa20/0xa20
[ 1397.439361]  create_or_delete_trace_kprobe+0xd/0x30
[ 1397.444237]  trace_run_command+0x72/0x90
[ 1397.448158]  trace_parse_run_command+0xaf/0x131
[ 1397.452684]  vfs_write+0xa5/0x1a0
[ 1397.455996]  ksys_write+0x5c/0xd0
[ 1397.459312]  do_syscall_64+0x48/0x120
[ 1397.462971]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1397.468017] RIP: 0033:0x7f2567303ff8

By running tools/selftests/ftrace/ftracetest

Crashed here:

[33] Kprobe dynamic event - adding and removing [PASS]
[34] Kprobe dynamic event - busy event check[PASS]
[35] Kprobe event with comm arguments   [FAIL]
[36] Kprobe event string type argumentclient_loop:

-- Steve


Re: [PATCH v1 5/6] KEYS: measure queued keys

2019-10-23 Thread Mimi Zohar
On Wed, 2019-10-23 at 13:52 -0400, Mimi Zohar wrote:
> On Wed, 2019-10-23 at 10:34 -0700, Lakshmi Ramasubramanian wrote:
> > On 10/23/19 6:23 AM, Mimi Zohar wrote:
> > 
> > > The ordering of this patch set is awkward.  It should first introduce
> > > a generic method for measuring keys based on the keyring.  Then add
> > > the additional support needed for the specific builtin_trusted_keys
> > > keyring usecase.
> > 
> > Would the following ordering of the patch set be acceptable:
> > 
> >   => PATCH 0/5: Cover letter
> > 
> >   => PATCH 1/5: Define the enum "hook(BUILTIN_TRUSTED_KEYS)" in ima.h
> > 
> >   => PATCH 2/5: Define ima hook
> > This will initially do nothing if ima is not yet
> > initialized.
> > Call process_buffer_measurement() if ima is initialized.
> > 
> >   => PATCH 3/5: key_create_or_update change and the call to ima hook
> > 
> >   => PATCH 4/5: Queue\De-Queue of key measurement requests.
> > Enable queuing of key in the ima hook if ima is not
> > initialized.
> > 
> >   => PATCH 5/5: ima policy to enable measurement of keys which will
> > enable end-to-end working of this feature.
> 
> The first patches need to introduce the generic concept of measuring
> keys based on policy.  Only afterwards would you add any builtin
> trusted keyring specific code.

1. Extend the IMA policy language to support identifying keyrings
2. Define a new IMA hook which calls process_buffer_measurement()
3. Call the new IMA hook (eg. from post_key_create_or_update)
4. Define an early workqueue for saving keys loaded prior to IMA is
initialized.  (Remember we don't hard code policy in the kernel.)

I'll be pushing out linux-integrity shortly.  For the time being,
please base your patches on -rc3.

thanks,

Mimi



Re: [PATCH] ceph: Fix use-after-free in __ceph_remove_cap

2019-10-23 Thread Jeff Layton
On Mon, 2019-10-21 at 15:51 +0100, Luis Henriques wrote:
> Jeff Layton  writes:
> 
> > On Thu, 2019-10-17 at 15:46 +0100, Luis Henriques wrote:
> > > KASAN reports a use-after-free when running xfstest generic/531, with the
> > > following trace:
> > > 
> > > [  293.903362]  kasan_report+0xe/0x20
> > > [  293.903365]  rb_erase+0x1f/0x790
> > > [  293.903370]  __ceph_remove_cap+0x201/0x370
> > > [  293.903375]  __ceph_remove_caps+0x4b/0x70
> > > [  293.903380]  ceph_evict_inode+0x4e/0x360
> > > [  293.903386]  evict+0x169/0x290
> > > [  293.903390]  __dentry_kill+0x16f/0x250
> > > [  293.903394]  dput+0x1c6/0x440
> > > [  293.903398]  __fput+0x184/0x330
> > > [  293.903404]  task_work_run+0xb9/0xe0
> > > [  293.903410]  exit_to_usermode_loop+0xd3/0xe0
> > > [  293.903413]  do_syscall_64+0x1a0/0x1c0
> > > [  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > This happens because __ceph_remove_cap() may queue a cap release
> > > (__ceph_queue_cap_release) which can be scheduled before that cap is
> > > removed from the inode list with
> > > 
> > >   rb_erase(>ci_node, >i_caps);
> > > 
> > > And, when this finally happens, the use-after-free will occur.
> > > 
> > > This can be fixed by protecting the rb_erase with the s_cap_lock spinlock,
> > > which is used by ceph_send_cap_releases(), before the cap is freed.
> > > 
> > > Signed-off-by: Luis Henriques 
> > > ---
> > >  fs/ceph/caps.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > index d3b9c9d5c1bd..21ee38cabe98 100644
> > > --- a/fs/ceph/caps.c
> > > +++ b/fs/ceph/caps.c
> > > @@ -1089,13 +1089,13 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool 
> > > queue_release)
> > >   }
> > >   cap->cap_ino = ci->i_vino.ino;
> > > 
> > > - spin_unlock(>s_cap_lock);
> > > -
> > >   /* remove from inode list */
> > >   rb_erase(>ci_node, >i_caps);
> > >   if (ci->i_auth_cap == cap)
> > >   ci->i_auth_cap = NULL;
> > > 
> > > + spin_unlock(>s_cap_lock);
> > > +
> > >   if (removed)
> > >   ceph_put_cap(mdsc, cap);
> > > 
> > 
> > Is there any reason we need to wait until this point to remove it from
> > the rbtree? ISTM that we ought to just do that at the beginning of the
> > function, before we take the s_cap_lock.
> 
> That sounds good to me, at least at a first glace.  I spent some time
> looking for any possible issues in the code, and even run a few tests.
> 
> However, looking at git log I found commit f818a73674c5 ("ceph: fix cap
> removal races"), which moved that rb_erase from the beginning of the
> function to it's current position.  So, unless the race mentioned in
> this commit has disappeared in the meantime (which is possible, this
> commit is from 2010!), this rbtree operation shouldn't be changed.
> 
> And I now wonder if my patch isn't introducing a race too...
> __ceph_remove_cap() is supposed to always be called with the session
> mutex held, except for the ceph_evict_inode() path.  Which is where I'm
> seeing the UAF.  So, maybe what's missing here is the s_mutex.  Hmm...
> 

I don't get it. That commit log talks about needing to ensure that the
backpointer is cleared under the lock which is fine, but I don't see why
we need to keep it in the inode's rbtree until that point.

Unhashing an object before you potentially free it is just good
practice, IMO. If we need to do something different here, then I think
it'd be good to add a comment explaining why.
-- 
Jeff Layton 



Re: [PATCH 1/2] Revert "ASoC: hdmi-codec: re-introduce mutex locking"

2019-10-23 Thread Mark Brown
On Wed, Oct 23, 2019 at 07:26:18PM +0100, Russell King - ARM Linux admin wrote:

> If you look at the git log for reverted commits, the vast majority
> of them follow _this_ style.  From 5.3 back to the start of current
> git history, there are 3665 commits with "Revert" in their subject
> line, 3050 of those start with "Revert" with no subsystem prefix.

That's assuming that all reverts have Revert in their subject line of
course!

> It seems that there are a small number of subsystems that want
> something different, ASoC included.  That will be an ongoing problem,
> people won't remember which want it when the majority don't.

> Maybe the revert format should be standardised in some manner?

My general thought on this is that reverts are commits like any other
and that the documentation for how you're supposed to do commit messages
also applies to them, might be worth a note though as I do see people
not writing a commit log at all for them sometimes.


signature.asc
Description: PGP signature


Re: [PATCH 1/2] rtc/ia64: remove legacy efirtc driver

2019-10-23 Thread Luck, Tony
On Wed, Oct 23, 2019 at 05:01:58PM +0200, Arnd Bergmann wrote:
> There are two EFI RTC drivers, the original drivers/char/efirtc.c
> driver and the more modern drivers/rtc/rtc-efi.c.
> 
> Both implement the same interface, but the new one does so
> in a more portable way.
> 
> Move everything over to that one and remove the old one.

The new one is more talkative that the old one. I see this extra
line on the console when I boot ia64:

rtc-efi rtc-efi: setting system clock to 2019-10-23T18:30:23 UTC (1571855423)

That seems somewhat useful & informative (though the repeated "rtc-efi"
at the start of the line is redundant).

Acked-by: Tony Luck 

-Tony


Re: [PATCH] x86/dumpstack/64: Don't evaluate exception stacks before setup

2019-10-23 Thread Cyrill Gorcunov
On Wed, Oct 23, 2019 at 11:31:40AM -0700, Matthew Wilcox wrote:
> On Wed, Oct 23, 2019 at 08:05:49PM +0200, Thomas Gleixner wrote:
> > Prevent this by checking the validity of the cea_exception_stack base
> > address and bailing out if it is zero.
> 
> Could also initialise cea_exception_stack to -1?  That would lead to it
> being caught by ...
> 
> > end = begin + sizeof(struct cea_exception_stacks);
> > /* Bail if @stack is outside the exception stack area. */
> > if (stk < begin || stk >= end)
> 
> this existing check.

As to me this would be a hack and fragile :/ In turn the current explicit
test Thomas made is a way more readable.


Re: [PATCH v2 0/1] Add support for setting MMIO PREF hotplug bridge size

2019-10-23 Thread Bjorn Helgaas
On Wed, Oct 23, 2019 at 12:12:08PM +, Nicholas Johnson wrote:
> ...
> It turns out Outlook is causing my encoding issues with git send-email.
> 
> If I get a new email for kernel development, what should it be? Gmail
> works, but looks tackier.

I wish Documentation/process/email-clients.rst said something about
Outlook, but it doesn't and I don't know enough to add anything.

It does say gmail doesn't work for sending patches.  That's certainly
true for the web GUI, but I think it might be possible to use msmtp to
send via the gmail SMTP server, e.g., https://wiki.debian.org/msmtp

Bjorn


Re: [PATCH] x86/dumpstack/64: Don't evaluate exception stacks before setup

2019-10-23 Thread Matthew Wilcox
On Wed, Oct 23, 2019 at 08:05:49PM +0200, Thomas Gleixner wrote:
> Prevent this by checking the validity of the cea_exception_stack base
> address and bailing out if it is zero.

Could also initialise cea_exception_stack to -1?  That would lead to it
being caught by ...

>   end = begin + sizeof(struct cea_exception_stacks);
>   /* Bail if @stack is outside the exception stack area. */
>   if (stk < begin || stk >= end)

this existing check.



Re: [PATCH] ASoC: fsl_esai: Add spin lock to protect reset and stop

2019-10-23 Thread Nicolin Chen
On Wed, Oct 23, 2019 at 03:29:49PM +0800, Shengjiu Wang wrote:
> xrun may happen at the end of stream, the
> trigger->fsl_esai_trigger_stop maybe called in the middle of
> fsl_esai_hw_reset, this may cause esai in wrong state
> after stop, and there may be endless xrun interrupt.

What about fsl_esai_trigger_start? It touches ESAI_xFCR_xFEN bit
that is being checked in the beginning of fsl_esai_hw_reset.

Could the scenario below be possible also?

1) ESAI TX starts
2) Xrun happens to TX
3) Starting fsl_esai_hw_reset (enabled[TX] = true; enabled[RX] = false)
4) ESAI RX starts
5) Finishing fsl_esai_hw_reset (enabled[RX] is still false)

Thanks
Nicolin

> So Add spin lock to lock these two function.
> 
> Fixes: 7ccafa2b3879 ("ASoC: fsl_esai: recover the channel swap after xrun")
> Signed-off-by: Shengjiu Wang 
> ---
>  sound/soc/fsl/fsl_esai.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
> index 37b14c48b537..6a797648b66d 100644
> --- a/sound/soc/fsl/fsl_esai.c
> +++ b/sound/soc/fsl/fsl_esai.c
> @@ -33,6 +33,7 @@
>   * @fsysclk: system clock source to derive HCK, SCK and FS
>   * @spbaclk: SPBA clock (optional, depending on SoC design)
>   * @task: tasklet to handle the reset operation
> + * @lock: spin lock to handle reset and stop behavior
>   * @fifo_depth: depth of tx/rx FIFO
>   * @slot_width: width of each DAI slot
>   * @slots: number of slots
> @@ -56,6 +57,7 @@ struct fsl_esai {
>   struct clk *fsysclk;
>   struct clk *spbaclk;
>   struct tasklet_struct task;
> + spinlock_t lock; /* Protect reset and stop */
>   u32 fifo_depth;
>   u32 slot_width;
>   u32 slots;
> @@ -676,8 +678,10 @@ static void fsl_esai_hw_reset(unsigned long arg)
>  {
>   struct fsl_esai *esai_priv = (struct fsl_esai *)arg;
>   bool tx = true, rx = false, enabled[2];
> + unsigned long lock_flags;
>   u32 tfcr, rfcr;
>  
> + spin_lock_irqsave(_priv->lock, lock_flags);
>   /* Save the registers */
>   regmap_read(esai_priv->regmap, REG_ESAI_TFCR, );
>   regmap_read(esai_priv->regmap, REG_ESAI_RFCR, );
> @@ -715,6 +719,8 @@ static void fsl_esai_hw_reset(unsigned long arg)
>   fsl_esai_trigger_start(esai_priv, tx);
>   if (enabled[rx])
>   fsl_esai_trigger_start(esai_priv, rx);
> +
> + spin_unlock_irqrestore(_priv->lock, lock_flags);
>  }
>  
>  static int fsl_esai_trigger(struct snd_pcm_substream *substream, int cmd,
> @@ -722,6 +728,7 @@ static int fsl_esai_trigger(struct snd_pcm_substream 
> *substream, int cmd,
>  {
>   struct fsl_esai *esai_priv = snd_soc_dai_get_drvdata(dai);
>   bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK;
> + unsigned long lock_flags;
>  
>   esai_priv->channels[tx] = substream->runtime->channels;
>  
> @@ -734,7 +741,9 @@ static int fsl_esai_trigger(struct snd_pcm_substream 
> *substream, int cmd,
>   case SNDRV_PCM_TRIGGER_SUSPEND:
>   case SNDRV_PCM_TRIGGER_STOP:
>   case SNDRV_PCM_TRIGGER_PAUSE_PUSH:
> + spin_lock_irqsave(_priv->lock, lock_flags);
>   fsl_esai_trigger_stop(esai_priv, tx);
> + spin_unlock_irqrestore(_priv->lock, lock_flags);
>   break;
>   default:
>   return -EINVAL;
> @@ -1002,6 +1011,7 @@ static int fsl_esai_probe(struct platform_device *pdev)
>  
>   dev_set_drvdata(>dev, esai_priv);
>  
> + spin_lock_init(_priv->lock);
>   ret = fsl_esai_hw_init(esai_priv);
>   if (ret)
>   return ret;
> -- 
> 2.21.0
> 


Re: [ipc/sem.c] 6394de3b86: BUG:kernel_NULL_pointer_dereference,address

2019-10-23 Thread Manfred Spraul

Hello,

On 10/21/19 10:35 AM, kernel test robot wrote:

FYI, we noticed the following commit (built with gcc-7):

commit: 6394de3b868537a90dd9128607192b0e97109f6b ("[PATCH 4/5] ipc/sem.c: Document 
and update memory barriers")
url: 
https://github.com/0day-ci/linux/commits/Manfred-Spraul/wake_q-Cleanup-Documentation-update/20191014-055627


Yes, known issue:

@@ -2148,9 +2176,11 @@ static long do_semtimedop(int semid, struct 
sembuf __user *tsops,

    }

    do {
-   WRITE_ONCE(queue.status, -EINTR);
+   /* memory ordering ensured by the lock in sem_lock() */
+   queue.status = EINTR;
    queue.sleeper = current;

+   /* memory ordering is ensured by the lock in sem_lock() */
    __set_current_state(TASK_INTERRUPTIBLE);
    sem_unlock(sma, locknum);
    rcu_read_unlock();

It must be "-EINTR", not "EINTR".

If there is a timeout or a spurious wakeup, then the do_semtimedop() 
returns to user space without unlinking everything properly.


I was able to reproduce the issue: V1 of the series ends up with the 
shown error.


V3 as now merged doesn't fail.

--

    Manfred




Re: [PATCH 1/2] Revert "ASoC: hdmi-codec: re-introduce mutex locking"

2019-10-23 Thread Russell King - ARM Linux admin
On Wed, Oct 23, 2019 at 05:37:16PM +0100, Mark Brown wrote:
> On Wed, Oct 23, 2019 at 06:12:02PM +0200, Jerome Brunet wrote:
> > This reverts commit eb1ecadb7f67dde94ef0efd3ddaed5cb6c9a65ed.
> > 
> > This fixes the following warning reported by lockdep and a potential
> > issue with hibernation
> 
> Please submit patches using subject lines reflecting the style for the
> subsystem, this makes it easier for people to identify relevant patches.
> Look at what existing commits in the area you're changing are doing and
> make sure your subject lines visually resemble what they're doing.
> There's no need to resubmit to fix this alone.

Hi Mark,

If you look at the git log for reverted commits, the vast majority
of them follow _this_ style.  From 5.3 back to the start of current
git history, there are 3665 commits with "Revert" in their subject
line, 3050 of those start with "Revert" with no subsystem prefix.

It seems that there are a small number of subsystems that want
something different, ASoC included.  That will be an ongoing problem,
people won't remember which want it when the majority don't.

Maybe the revert format should be standardised in some manner?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [PATCH v2 2/2] RISC-V: defconfig: Enable Goldfish RTC driver

2019-10-23 Thread Paul Walmsley
On Wed, 23 Oct 2019, Alistair Francis wrote:

> On Tue, 2019-10-22 at 18:06 -0700, Paul Walmsley wrote:
> > On Tue, 22 Oct 2019, Alistair Francis wrote:
> > 
> > > I think it makese sense for this to go into Linux first.
> > > 
> > > The QEMU patches are going to be accepted, just some nit picking to 
> > > do first :)
> > > 
> > > After that we have to wait for a PR and then a QEMU release until 
> > > most people will see the change in QEMU. In that time Linux 5.4 will 
> > > be released, if this can make it into 5.4 then everyone using 5.4 
> > > will get the new RTC as soon as they upgrade QEMU (QEMU provides the 
> > > device tree). If this has to wait until QEMU has support then it 
> > > won't be supported for users until even later.
> > > 
> > > Users are generally slow to update kernels (buildroot is still using 
> > > 5.1 by default for example) so the sooner changes like this go in 
> > > the better.
> > 
> > The defconfigs are really just for kernel developers.  We expect users 
> > to define their own Kconfigs for their own needs.
> 
> From experience most people use the defconfig, at least as a starting
> point.

We'll definitely add it to the defconfigs, but I think it makes sense to 
do that once the patches hit the QEMU master branch.  (No need to wait for 
a QEMU release.)

That roughly matches what I understand the Linux kernel's approach is to 
adding hardware support: no point in adding hardware support until it 
looks likely that it will actually exist.  Otherwise it just adds churn 
and maintenance burden.

> I was under the impression that everyone was on board with this going
> in. In QEMU land it doesn't make sense to add it if the kernel isn't
> going to, so we need to be on the same page here.

Whatever RTC gets added into QEMU, we'll take defconfig patches for.  I 
don't care which one it is.  Based on the patches that hit the kernel 
lists, it initially looked like the Goldfish RTC was more complicated than 
it needed to be; but it turned out I just didn't look deeply enough.

> From the other discussions it looks like you are happy with this change
> overall right?

Yes


- Paul


Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
On 10/23/19 2:01 PM, Michal Hocko wrote:
> On Wed 23-10-19 13:34:22, Waiman Long wrote:
>> With a threshold of 10, it is still possible that the zone lock
>> will be held for a very long time in the worst case scenario where all
>> the counts are just below the threshold. With up to 6 migration types
>> and 11 orders, it means up to 6.6 millions.
>>
>> Track the total number of list iterations done since the acquisition
>> of the zone lock and release it whenever 10 iterations or more have
>> been completed. This will cap the lock hold time to no more than 200,000
>> list iterations.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>  mm/vmstat.c | 18 ++
>>  1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 57ba091e5460..c5b82fdf54af 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct 
>> seq_file *m,
>>  pg_data_t *pgdat, struct zone *zone)
>>  {
>>  int order, mtype;
>> +unsigned long iteration_count = 0;
>>  
>>  for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>>  seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct 
>> seq_file *m,
>>   * of pages in this order should be more than
>>   * sufficient
>>   */
>> -if (++freecount >= 10) {
>> +if (++freecount > 10) {
>>  overflow = true;
>> -spin_unlock_irq(>lock);
>> -cond_resched();
>> -spin_lock_irq(>lock);
>> +freecount--;
>>  break;
>>  }
>>  }
>>  seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
>> freecount);
>> +/*
>> + * Take a break and release the zone lock when
>> + * 10 or more entries have been iterated.
>> + */
>> +iteration_count += freecount;
>> +if (iteration_count >= 10) {
>> +iteration_count = 0;
>> +spin_unlock_irq(>lock);
>> +cond_resched();
>> +spin_lock_irq(>lock);
>> +}
> Aren't you overengineering this a bit? If you are still worried then we
> can simply cond_resched for each order
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c156ce24a322..ddb89f4e0486 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>*/
>   if (++freecount >= 10) {
>   overflow = true;
> - spin_unlock_irq(>lock);
> - cond_resched();
> - spin_lock_irq(>lock);
>   break;
>   }
>   }
>   seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
> freecount);
> + spin_unlock_irq(>lock);
> + cond_resched();
> + spin_lock_irq(>lock);
>   }
>   seq_putc(m, '\n');
>   }
>
> I do not have a strong opinion here but I can fold this into my patch 2.

If the free list is empty or is very short, there is probably no need to
release and reacquire the lock. How about adding a check for a lower
bound like:

if (freecount > 1000) {
    spin_unlock_irq(>lock);
    cond_resched();
    spin_lock_irq(>lock);
}

Cheers,
Longman



RE: [PATCH v2 net-next] net: axienet: In kconfig add ARM64 as supported platform

2019-10-23 Thread Radhey Shyam Pandey
> -Original Message-
> From: Jakub Kicinski 
> Sent: Tuesday, October 22, 2019 11:00 PM
> To: Michal Simek 
> Cc: Radhey Shyam Pandey ; da...@davemloft.net;
> net...@vger.kernel.org; Anirudha Sarangi ; John Linn
> ; mchehab+sams...@kernel.org;
> gre...@linuxfoundation.org; nicolas.fe...@microchip.com; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 net-next] net: axienet: In kconfig add ARM64 as
> supported platform
> 
> On Mon, 21 Oct 2019 16:15:45 +0200, Michal Simek wrote:
> > On 21. 10. 19 12:18, Radhey Shyam Pandey wrote:
> > > xilinx axi_emac driver is supported on ZynqMP UltraScale platform.
> > > So enable ARCH64 in kconfig. It also removes redundant ARCH_ZYNQ
> > > dependency. Basic sanity testing is done on zu+ mpsoc zcu102
> > > evaluation board.
> > >
> > > Signed-off-by: Radhey Shyam Pandey
> 
> > > ---
> > > Changes for v2:
> > > Remove redundant ARCH_ZYNQ dependency.
> > > Modified commit description.
> > > ---
> > >  drivers/net/ethernet/xilinx/Kconfig | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/xilinx/Kconfig
> b/drivers/net/ethernet/xilinx/Kconfig
> > > index 8d994ce..da11876 100644
> > > --- a/drivers/net/ethernet/xilinx/Kconfig
> > > +++ b/drivers/net/ethernet/xilinx/Kconfig
> > > @@ -6,7 +6,7 @@
> > >  config NET_VENDOR_XILINX
> > >   bool "Xilinx devices"
> > >   default y
> > > - depends on PPC || PPC32 || MICROBLAZE || ARCH_ZYNQ || MIPS ||
> X86 || ARM || COMPILE_TEST
> > > + depends on PPC || PPC32 || MICROBLAZE || MIPS || X86 || ARM ||
> ARM64 || COMPILE_TEST
> > >   ---help---
> > > If you have a network (Ethernet) card belonging to this class, say Y.
> > >
> > > @@ -26,11 +26,11 @@ config XILINX_EMACLITE
> > >
> > >  config XILINX_AXI_EMAC
> > >   tristate "Xilinx 10/100/1000 AXI Ethernet support"
> > > - depends on MICROBLAZE || X86 || ARM || COMPILE_TEST
> > > + depends on MICROBLAZE || X86 || ARM || ARM64 || COMPILE_TEST
> > >   select PHYLINK
> > >   ---help---
> > > This driver supports the 10/100/1000 Ethernet from Xilinx for the
> > > -   AXI bus interface used in Xilinx Virtex FPGAs.
> > > +   AXI bus interface used in Xilinx Virtex FPGAs and Soc's.
> > >
> > >  config XILINX_LL_TEMAC
> > >   tristate "Xilinx LL TEMAC (LocalLink Tri-mode Ethernet MAC) driver"
> > >
> >
> > Acked-by: Michal Simek 
> >
> > But I can image that others could prefer to remove all dependencies.
> 
> Yes, we'd much rather see this litany of architectures removed.

Yes, I can build test on all mentioned architectures and see how it goes.

> Is there any reason it's there in the first place?
Looking into past few commits, this dependency list was incrementally
extended for each platform. In case there are no real dependencies
we can get rid of arch list.

> 
> Most drivers are tested on just a few architectures, but as long
> as correct APIs are used they are assumed to work across the board.
> Otherwise 75% of our drivers would be x86 only. Don't be shy.



Re: [PATCH v1 6/6] KEYS: measure keys when they are created or updated

2019-10-23 Thread Mimi Zohar
On Tue, 2019-10-22 at 17:18 -0700, Lakshmi Ramasubramanian wrote:
> diff --git a/security/security.c b/security/security.c
> index 250ee2d76406..707a9e7fa94d 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -2303,6 +2303,16 @@ int security_key_getsecurity(struct key *key, char 
> **_buffer)
>   return call_int_hook(key_getsecurity, 0, key, _buffer);
>  }
>  
> +int security_key_create_or_update(struct key *keyring,
> +   struct key *key,
> +   const struct cred *cred,
> +   unsigned long flags,
> +   bool create)
> +{
> + return ima_post_key_create_or_update(keyring, key, cred,
> +  flags, create);
> +}
> +
>  #endif   /* CONFIG_KEYS */

Either the new hook is an LSM and IMA hook, or it is just an IMA hook.
 We don't define a security_ function, if it is just an IMA hook.

Mimi



Re: [PATCH v2 4/9] perf affinity: Add infrastructure to save/restore affinity

2019-10-23 Thread Alexey Budankov
On 23.10.2019 20:19, Andi Kleen wrote:
> On Wed, Oct 23, 2019 at 07:16:13PM +0300, Alexey Budankov wrote:
>>
>> On 23.10.2019 17:52, Andi Kleen wrote:
>>> On Wed, Oct 23, 2019 at 04:30:49PM +0200, Jiri Olsa wrote:
 On Wed, Oct 23, 2019 at 06:02:35AM -0700, Andi Kleen wrote:
> On Wed, Oct 23, 2019 at 11:59:11AM +0200, Jiri Olsa wrote:
>> On Sun, Oct 20, 2019 at 10:51:57AM -0700, Andi Kleen wrote:
>>
>> SNIP
>>
>>> +}
>>> diff --git a/tools/perf/util/affinity.h b/tools/perf/util/affinity.h
>>> new file mode 100644
>>> index ..e56148607e33
>>> --- /dev/null
>>> +++ b/tools/perf/util/affinity.h
>>> @@ -0,0 +1,15 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +#ifndef AFFINITY_H
>>> +#define AFFINITY_H 1
>>> +
>>> +struct affinity {
>>> +   unsigned char *orig_cpus;
>>> +   unsigned char *sched_cpus;
>>
>> why not use cpu_set_t directly?
>
> Because it's too small in glibc (only 1024 CPUs) and perf already 
> supports more.

 nice, we're using it all over the place.. how about using bitmap_alloc?
>>>
>>> Okay.
>>>
>>> The other places is mainly perf record from Alexey's recent affinity 
>>> changes.
>>> These probably need to be fixed.
>>>
>>> +Alexey
>>
>> Despite the issue indeed looks generic for stat and record modes,
>> have you already observed record startup overhead somewhere in your setups?
>> I would, first, prefer to reproduce the overhead, to have stable use case 
>> for evaluation and then, possibly, improvement.
> 
> What I meant the cpu_set usages you added in 
> 
> commit 9d2ed64587c045304efe8872b0258c30803d370c
> Author: Alexey Budankov 
> Date:   Tue Jan 22 20:47:43 2019 +0300
> 
> perf record: Allocate affinity masks
> 
> need to be fixed to allocate dynamically, or at least use MAX_NR_CPUs to
> support systems with >1024CPUs. That's an independent functionality
> problem.

Oh, it is clear now. Thanks for pointing this out. For that to move from 
cpu_mask_t to new custom struct affinity type its API requires extension 
to provide mask operations similar to the ones that cpu_mask_t provides: 
CPU_ZERO(), CPU_SET(), CPU_EQUAL(), CPU_OR().

For example it could be like: affinity__mask_zero(), affinity__mask_set(), 
affinity__mask_equal(), affinity__mask_or() and then the collecting part 
of record could also be moved to struct affinity type and overcome >1024CPUs 
limitation.

~Alexey

> 
> I haven't seen any large enough perf record usage to run
> into the IPI problems for record.
> 
> -Andi
> 


Re: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
On 10/23/19 2:02 PM, Michal Hocko wrote:
> On Wed 23-10-19 13:34:23, Waiman Long wrote:
> [...]
>> @@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct 
>> seq_file *m,
>>  }
>>  seq_putc(m, '\n');
>>  }
>> +
>> +/*
>> + * List total free blocks per order
>> + */
>> +seq_printf(m, "Node %4d, zone %8s, total ",
>> +   pgdat->node_id, zone->name);
>> +for (order = 0; order < MAX_ORDER; ++order) {
>> +area = &(zone->free_area[order]);
>> +seq_printf(m, "%6lu ", area->nr_free);
>> +}
>> +seq_putc(m, '\n');
> This is essentially duplicating /proc/buddyinfo. Do we really need that?

Yes, you are right. As the information is available elsewhere. I am fine
with dropping this.

Cheers,
Longman



[PATCH] x86/dumpstack/64: Don't evaluate exception stacks before setup

2019-10-23 Thread Thomas Gleixner
Cyrill reported the following crash:

  BUG: unable to handle page fault for address: 1ff0
  #PF: supervisor read access in kernel mode
  RIP: 0010:get_stack_info+0xb3/0x148

It turns out that if the stack tracer is invoked before the exception stack
mappings are initialized in_exception_stack() can erroneously classify an
invalid address as an address inside of an exception stack:

begin = this_cpu_read(cea_exception_stacks);  <- 0
end = begin + sizeof(exception stacks);

i.e. any address between 0 and end will be considered as exception stack
address and the subsequent code will then try to derefence the resulting
stack frame at a non mapped address.

 end = begin + (unsigned long)ep->size;
 ==> end = 0x2000

 regs = (struct pt_regs *)end - 1;
 ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0

 info->next_sp   = (unsigned long *)regs->sp;
 ==> Crashes due to accessing 0x1ff0

Prevent this by checking the validity of the cea_exception_stack base
address and bailing out if it is zero.

Fixes: afcd21dad88b ("x86/dumpstack/64: Use cpu_entry_area instead of orig_ist")
Reported-by: Cyrill Gorcunov 
Signed-off-by: Thomas Gleixner 
Tested-by: Cyrill Gorcunov 
Cc: sta...@vger.kernel.org
---
 arch/x86/kernel/dumpstack_64.c |7 +++
 1 file changed, 7 insertions(+)

--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -94,6 +94,13 @@ static bool in_exception_stack(unsigned
BUILD_BUG_ON(N_EXCEPTION_STACKS != 6);
 
begin = (unsigned long)__this_cpu_read(cea_exception_stacks);
+   /*
+* Handle the case where stack trace is collected _before_
+* cea_exception_stacks had been initialized.
+*/
+   if (!begin)
+   return false;
+
end = begin + sizeof(struct cea_exception_stacks);
/* Bail if @stack is outside the exception stack area. */
if (stk < begin || stk >= end)


Re: [Outreachy kernel] [PATCH v2 0/5] Remove typedef declarations in staging: octeon

2019-10-23 Thread Wambui Karuga
On Wed, Oct 23, 2019 at 08:43:04PM +0300, Aaro Koskinen wrote:
> Hi,
> 
> On Sat, Oct 12, 2019 at 08:35:19PM +0200, Julia Lawall wrote:
> > On Sat, 12 Oct 2019, Wambui Karuga wrote:
> > > This patchset removes the addition of new typedefs data types in octeon,
> > > along with replacing the previous uses with the new declaration format.
> > >
> > > v2 of the series removes the obsolete "_t" notation in the named types.
> > >
> > > Wambui Karuga (5):
> > >   staging: octeon: remove typedef declaration for cvmx_wqe
> > >   staging: octeon: remove typedef declaration for cvmx_helper_link_info
> > >   staging: octeon: remove typedef declaration for cvmx_fau_reg_32
> > >   staging: octeon: remove typedef declartion for cvmx_pko_command_word0
> > >   staging: octeon: remove typedef declaration for cvmx_fau_op_size
> > >
> > >  drivers/staging/octeon/ethernet-mdio.c   |  6 +--
> > >  drivers/staging/octeon/ethernet-rgmii.c  |  4 +-
> > >  drivers/staging/octeon/ethernet-rx.c |  6 +--
> > >  drivers/staging/octeon/ethernet-tx.c |  4 +-
> > >  drivers/staging/octeon/ethernet.c|  6 +--
> > >  drivers/staging/octeon/octeon-ethernet.h |  2 +-
> > >  drivers/staging/octeon/octeon-stubs.h| 56 
> > >  7 files changed, 43 insertions(+), 41 deletions(-)
> > 
> > For the series:
> > 
> > Acked-by: Julia Lawall 
> 
> This series breaks the build on MIPS/OCTEON (the only actual HW using this
> driver):
> 
> $ make ARCH=mips CROSS_COMPILE=mips64-linux-gnu- cavium_octeon_defconfig
> $ make ARCH=mips CROSS_COMPILE=mips64-linux-gnu-
> [...]
>   CC  drivers/staging/octeon/ethernet.o
> In file included from drivers/staging/octeon/ethernet.c:22:
> drivers/staging/octeon/octeon-ethernet.h:94:12: warning: 'union 
> cvmx_helper_link_info' declared inside parameter list will not be visible 
> outside of this definition or declaration
>   union cvmx_helper_link_info li);  
> ^
> drivers/staging/octeon/ethernet.c: In function 'cvm_oct_free_work':
> drivers/staging/octeon/ethernet.c:177:21: error: dereferencing pointer to 
> incomplete type 'struct cvmx_wqe'
>   int segments = work->word2.s.bufs;
>  ^~
> drivers/staging/octeon/ethernet.c: In function 'cvm_oct_common_open':
> drivers/staging/octeon/ethernet.c:463:30: error: storage size of 'link_info' 
> isn't known
>   union cvmx_helper_link_info link_info;
>   ^
> 
> etc.
> 
> Probably all these patches need to be reverted.
> 
> A.

Aaro, thanks for the heads up - I can try cross-compiling to see if I
can fix the errors.
wambui karuga.


[GIT PULL v3] tracing: A couple of minor fixes

2019-10-23 Thread Steven Rostedt


Linus,

[ Hopefully the change log of the last commit is good enough ]

Two minor fixes:

 - A race in perf trace initialization (missing mutexes)

 - Minor fix to represent gfp_t in synthetic events as properly signed


Please pull the latest trace-v5.4-rc3-3 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-v5.4-rc3-3

Tag SHA1: abdeb1c7a64b38c5259c46978c103afe7d5b000c
Head SHA1: 6b1340cc00edeadd52ebd8a45171f38c8de2a387


Prateek Sood (1):
  tracing: Fix race in perf_trace_buf initialization

Zhengjun Xing (1):
  tracing: Fix "gfp_t" format for synthetic events


 kernel/trace/trace_event_perf.c  | 4 
 kernel/trace/trace_events_hist.c | 2 ++
 2 files changed, 6 insertions(+)
---
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 0892e38ed6fb..a9dfa04ffa44 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -272,9 +272,11 @@ int perf_kprobe_init(struct perf_event *p_event, bool 
is_retprobe)
goto out;
}
 
+   mutex_lock(_mutex);
ret = perf_trace_event_init(tp_event, p_event);
if (ret)
destroy_local_trace_kprobe(tp_event);
+   mutex_unlock(_mutex);
 out:
kfree(func);
return ret;
@@ -282,8 +284,10 @@ int perf_kprobe_init(struct perf_event *p_event, bool 
is_retprobe)
 
 void perf_kprobe_destroy(struct perf_event *p_event)
 {
+   mutex_lock(_mutex);
perf_trace_event_close(p_event);
perf_trace_event_unreg(p_event);
+   mutex_unlock(_mutex);
 
destroy_local_trace_kprobe(p_event->tp_event);
 }
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 57648c5aa679..7482a1466ebf 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -679,6 +679,8 @@ static bool synth_field_signed(char *type)
 {
if (str_has_prefix(type, "u"))
return false;
+   if (strcmp(type, "gfp_t") == 0)
+   return false;
 
return true;
 }


Re: [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 13:34:23, Waiman Long wrote:
[...]
> @@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>   }
>   seq_putc(m, '\n');
>   }
> +
> + /*
> +  * List total free blocks per order
> +  */
> + seq_printf(m, "Node %4d, zone %8s, total ",
> +pgdat->node_id, zone->name);
> + for (order = 0; order < MAX_ORDER; ++order) {
> + area = &(zone->free_area[order]);
> + seq_printf(m, "%6lu ", area->nr_free);
> + }
> + seq_putc(m, '\n');

This is essentially duplicating /proc/buddyinfo. Do we really need that?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 13:34:22, Waiman Long wrote:
> With a threshold of 10, it is still possible that the zone lock
> will be held for a very long time in the worst case scenario where all
> the counts are just below the threshold. With up to 6 migration types
> and 11 orders, it means up to 6.6 millions.
> 
> Track the total number of list iterations done since the acquisition
> of the zone lock and release it whenever 10 iterations or more have
> been completed. This will cap the lock hold time to no more than 200,000
> list iterations.
> 
> Signed-off-by: Waiman Long 
> ---
>  mm/vmstat.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 57ba091e5460..c5b82fdf54af 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file 
> *m,
>   pg_data_t *pgdat, struct zone *zone)
>  {
>   int order, mtype;
> + unsigned long iteration_count = 0;
>  
>   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> @@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>* of pages in this order should be more than
>* sufficient
>*/
> - if (++freecount >= 10) {
> + if (++freecount > 10) {
>   overflow = true;
> - spin_unlock_irq(>lock);
> - cond_resched();
> - spin_lock_irq(>lock);
> + freecount--;
>   break;
>   }
>   }
>   seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
> freecount);
> + /*
> +  * Take a break and release the zone lock when
> +  * 10 or more entries have been iterated.
> +  */
> + iteration_count += freecount;
> + if (iteration_count >= 10) {
> + iteration_count = 0;
> + spin_unlock_irq(>lock);
> + cond_resched();
> + spin_lock_irq(>lock);
> + }

Aren't you overengineering this a bit? If you are still worried then we
can simply cond_resched for each order
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c156ce24a322..ddb89f4e0486 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,13 +1399,13 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 */
if (++freecount >= 10) {
overflow = true;
-   spin_unlock_irq(>lock);
-   cond_resched();
-   spin_lock_irq(>lock);
break;
}
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
freecount);
+   spin_unlock_irq(>lock);
+   cond_resched();
+   spin_lock_irq(>lock);
}
seq_putc(m, '\n');
}

I do not have a strong opinion here but I can fold this into my patch 2.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH net-next 4/4] bonding: balance ICMP echoes in layer3+4 mode

2019-10-23 Thread Simon Horman
On Wed, Oct 23, 2019 at 06:58:16PM +0200, Matteo Croce wrote:
> On Wed, Oct 23, 2019 at 12:01 PM Simon Horman
>  wrote:
> >
> > On Mon, Oct 21, 2019 at 10:09:48PM +0200, Matteo Croce wrote:
> > > The bonding uses the L4 ports to balance flows between slaves.
> > > As the ICMP protocol has no ports, those packets are sent all to the
> > > same device:
> > >
> > > # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
> > > # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
> > > # ping -qc1 192.168.0.2
> > > 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, 
> > > length 64
> > > 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, 
> > > length 64
> > > # ping -qc1 192.168.0.2
> > > 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, 
> > > length 64
> > > 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, 
> > > length 64
> > > # ping -qc1 192.168.0.2
> > > 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, 
> > > length 64
> > > 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, 
> > > length 64
> > >
> > > But some ICMP packets have an Identifier field which is
> > > used to match packets within sessions, let's use this value in the hash
> > > function to balance these packets between bond slaves:
> > >
> > > # ping -qc1 192.168.0.2
> > > 0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, 
> > > length 64
> > > 0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, 
> > > length 64
> > > # ping -qc1 192.168.0.2
> > > 1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, 
> > > length 64
> > > 1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, 
> > > length 64
> > >
> > > Signed-off-by: Matteo Croce 
> >
> > I see where this patch is going but it is unclear to me what problem it is
> > solving. I would expect ICMP traffic to be low volume and thus able to be
> > handled by a single lower-device of a bond.
> >
> > ...
> 
> Hi,
> 
> The problem is not balancing the volume, even if it could increase due
> to IoT devices pinging some well known DNS servers to check for
> connection.
> If a bonding slave is down, people using pings to check for
> connectivity could fail to detect a broken link if all the packets are
> sent to the alive link.
> Anyway, although I didn't measure it, the computational overhead of
> this changeset should be minimal, and only affect ICMP packets when
> the ICMP dissector is used.

So the idea is that by using different id values ping could be used
to probe all lower-devices of a bond? If so then I understand why
you want this and have no particular objection.


Re: [v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Yang Shi




On 10/23/19 10:24 AM, Matthew Wilcox wrote:

On Thu, Oct 24, 2019 at 01:05:04AM +0800, Yang Shi wrote:

+   return map_count >= 0 &&
+  map_count == atomic_read([1].compound_mapcount);
  }

I didn't like Hugh's duplicate definition either.  May I suggest:


Thanks, Willy. It is fine to me. Will take it in v3.



diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f2199a51941..3d0efd937d2b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t 
flags)
  
  extern void kvfree(const void *addr);
  
-static inline atomic_t *compound_mapcount_ptr(struct page *page)

-{
-   return [1].compound_mapcount;
-}
-
  static inline int compound_mapcount(struct page *page)
  {
VM_BUG_ON_PAGE(!PageCompound(page), page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fa795284..270aa8fd2800 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,11 @@ struct page {
  #endif
  } _struct_page_alignment;
  
+static inline atomic_t *compound_mapcount_ptr(struct page *page)

+{
+   return [1].compound_mapcount;
+}
+
  /*
   * Used for sizing the vmemmap region on some architectures
   */




Re: [alsa-devel] [PATCH v2 4/5] soundwire: intel/cadence: add flag for interrupt enable

2019-10-23 Thread Pierre-Louis Bossart




On 10/21/19 11:55 PM, Vinod Koul wrote:

On 21-10-19, 05:26, Pierre-Louis Bossart wrote:

On 10/20/19 11:14 PM, Vinod Koul wrote:

On 16-09-19, 14:09, Pierre-Louis Bossart wrote:

Prepare for future PM support and fix error handling by disabling
interrupts as needed.

Signed-off-by: Pierre-Louis Bossart 
---
   drivers/soundwire/cadence_master.c | 18 --
   drivers/soundwire/cadence_master.h |  2 +-
   drivers/soundwire/intel.c  | 12 +++-
   3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/soundwire/cadence_master.c 
b/drivers/soundwire/cadence_master.c
index 5f900cf2acb9..a71df99ca18f 100644
--- a/drivers/soundwire/cadence_master.c
+++ b/drivers/soundwire/cadence_master.c
@@ -819,14 +819,17 @@ EXPORT_SYMBOL(sdw_cdns_exit_reset);
* sdw_cdns_enable_interrupt() - Enable SDW interrupts and update config
* @cdns: Cadence instance
*/
-int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns)
+int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state)
   {
-   u32 mask;
+   u32 slave_intmask0 = 0;
+   u32 slave_intmask1 = 0;
+   u32 mask = 0;
+
+   if (!state)
+   goto update_masks;
-   cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0,
-   CDNS_MCP_SLAVE_INTMASK0_MASK);
-   cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1,
-   CDNS_MCP_SLAVE_INTMASK1_MASK);
+   slave_intmask0 = CDNS_MCP_SLAVE_INTMASK0_MASK;
+   slave_intmask1 = CDNS_MCP_SLAVE_INTMASK1_MASK;
/* enable detection of all slave state changes */
mask = CDNS_MCP_INT_SLAVE_MASK;
@@ -849,6 +852,9 @@ int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns)
if (interrupt_mask) /* parameter override */
mask = interrupt_mask;
+update_masks:
+   cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK0, slave_intmask0);
+   cdns_writel(cdns, CDNS_MCP_SLAVE_INTMASK1, slave_intmask1);
cdns_writel(cdns, CDNS_MCP_INTMASK, mask);
/* commit changes */
diff --git a/drivers/soundwire/cadence_master.h 
b/drivers/soundwire/cadence_master.h
index 1a67728c5000..302351808098 100644
--- a/drivers/soundwire/cadence_master.h
+++ b/drivers/soundwire/cadence_master.h
@@ -162,7 +162,7 @@ int sdw_cdns_init(struct sdw_cdns *cdns);
   int sdw_cdns_pdi_init(struct sdw_cdns *cdns,
  struct sdw_cdns_stream_config config);
   int sdw_cdns_exit_reset(struct sdw_cdns *cdns);
-int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns);
+int sdw_cdns_enable_interrupt(struct sdw_cdns *cdns, bool state);
   #ifdef CONFIG_DEBUG_FS
   void sdw_cdns_debugfs_init(struct sdw_cdns *cdns, struct dentry *root);
diff --git a/drivers/soundwire/intel.c b/drivers/soundwire/intel.c
index cdb3243e8823..08530c136c5f 100644
--- a/drivers/soundwire/intel.c
+++ b/drivers/soundwire/intel.c
@@ -1036,7 +1036,7 @@ static int intel_probe(struct platform_device *pdev)
ret = sdw_add_bus_master(>cdns.bus);
if (ret) {
dev_err(>dev, "sdw_add_bus_master fail: %d\n", ret);
-   goto err_master_reg;
+   return ret;


I am not sure I like this line change, before this IIRC the function and
single place of return, so changing this doesn't seem to improve
anything here..?


Doing a goto to do a return is not very nice either.


Hrmm, isn't that what you are doing few lines below. The point here is
that this line of change here doesnt change anything, doesnt improve
anything so why change :)


I knew there was a reason.. the existing code on your soundwire/next 
branch mixes goto and return, so I chose to use a return for the first 
case as well.


ret = sdw_add_bus_master(>cdns.bus);
if (ret) {
dev_err(>dev, "sdw_add_bus_master fail: %d\n", ret);
goto err_master_reg; >>> changed to return ret;
}

if (sdw->cdns.bus.prop.hw_disabled) {
dev_info(>dev, "SoundWire master %d is disabled, 
ignoring\n",
 sdw->cdns.bus.link_id);
return 0;
}






I can change this, but it doesn't really matter: this entire code will be
removed anyways to get rid of platform_devices and the probe itself will be
split in two.




}
if (sdw->cdns.bus.prop.hw_disabled) {
@@ -1067,7 +1067,7 @@ static int intel_probe(struct platform_device *pdev)
goto err_init;
}
-   ret = sdw_cdns_enable_interrupt(>cdns);
+   ret = sdw_cdns_enable_interrupt(>cdns, true);
if (ret < 0) {
dev_err(sdw->cdns.dev, "cannot enable interrupts\n");
goto err_init;
@@ -1076,7 +1076,7 @@ static int intel_probe(struct platform_device *pdev)
ret = sdw_cdns_exit_reset(>cdns);
if (ret < 0) {
dev_err(sdw->cdns.dev, "unable to exit bus reset sequence\n");
-   goto err_init;
+   goto err_interrupt;
}
/* Register DAIs */
@@ -1084,18 +1084,19 @@ static int 

[PATCH] powerpc/fadump: Remove duplicate message.

2019-10-23 Thread Michal Suchanek
There is duplicate message about lack of support by firmware in
fadump_reserve_mem and setup_fadump. Due to different capitalization it
is clear that the one in setup_fadump is shown on boot. Remove the
duplicate that is not shown.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/fadump.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index cdcdea6c6453..f40dc713f089 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -438,10 +438,8 @@ int __init fadump_reserve_mem(void)
if (!fw_dump.fadump_enabled)
return 0;
 
-   if (!fw_dump.fadump_supported) {
-   pr_info("Firmware-Assisted Dump is not supported on this 
hardware\n");
+   if (!fw_dump.fadump_supported)
goto error_out;
-   }
 
/*
 * Initialize boot memory size
-- 
2.23.0



Re: [PATCH 2/2] mm: memcontrol: try harder to set a new memory.high

2019-10-23 Thread Johannes Weiner
On Wed, Oct 23, 2019 at 08:59:49AM +0200, Michal Hocko wrote:
> On Tue 22-10-19 16:15:18, Johannes Weiner wrote:
> > Setting a memory.high limit below the usage makes almost no effort to
> > shrink the cgroup to the new target size.
> > 
> > While memory.high is a "soft" limit that isn't supposed to cause OOM
> > situations, we should still try harder to meet a user request through
> > persistent reclaim.
> > 
> > For example, after setting a 10M memory.high on an 800M cgroup full of
> > file cache, the usage shrinks to about 350M:
> > 
> > + cat /cgroup/workingset/memory.current
> > 841568256
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 355729408
> > 
> > This isn't exactly what the user would expect to happen. Setting the
> > value a few more times eventually whittles the usage down to what we
> > are asking for:
> > 
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 104181760
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 31801344
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 10440704
> > 
> > To improve this, add reclaim retry loops to the memory.high write()
> > callback, similar to what we do for memory.max, to make a reasonable
> > effort that the usage meets the requested size after the call returns.
> 
> That suggests that the reclaim couldn't meet the given reclaim target
> but later attempts just made it through. Is this due to amount of dirty
> pages or what prevented the reclaim to do its job?
> 
> While I am not against the reclaim retry loop I would like to understand
> the underlying issue. Because if this is really about dirty memory then
> we should probably be more pro-active in flushing it. Otherwise the
> retry might not be of any help.

All the pages in my test case are clean cache. But they are active,
and they need to go through the inactive list before reclaiming. The
inactive list size is designed to pre-age just enough pages for
regular reclaim targets, i.e. pages in the SWAP_CLUSTER_MAX ballpark,
In this case, the reclaim goal for a single invocation is 790M and the
inactive list is a small funnel to put all that through, and we need
several iterations to accomplish that.

But 790M is not a reasonable reclaim target to ask of a single reclaim
invocation. And it wouldn't be reasonable to optimize the reclaim code
for it. So asking for the full size but retrying is not a bad choice
here: we express our intent, and benefit if reclaim becomes better at
handling larger requests, but we also acknowledge that some of the
deltas we can encounter in memory_high_write() are just too
ridiculously big for a single reclaim invocation to manage.


Re: [PATCH net-next 3/4] flow_dissector: extract more ICMP information

2019-10-23 Thread Simon Horman
On Wed, Oct 23, 2019 at 12:53:37PM +0200, Matteo Croce wrote:
> On Wed, Oct 23, 2019 at 12:00 PM Simon Horman
>  wrote:
> > On Mon, Oct 21, 2019 at 10:09:47PM +0200, Matteo Croce wrote:
> > > + switch (ih->type) {
> > > + case ICMP_ECHO:
> > > + case ICMP_ECHOREPLY:
> > > + case ICMP_TIMESTAMP:
> > > + case ICMP_TIMESTAMPREPLY:
> > > + case ICMPV6_ECHO_REQUEST:
> > > + case ICMPV6_ECHO_REPLY:
> > > + /* As we use 0 to signal that the Id field is not present,
> > > +  * avoid confusion with packets without such field
> > > +  */
> > > + key_icmp->id = ih->un.echo.id ? : 1;
> >
> > Its not obvious to me why the kernel should treat id-zero as a special
> > value if it is not special on the wire.
> >
> > Perhaps a caller who needs to know if the id is present can
> > check the ICMP type as this code does, say using a helper.
> >
> 
> Hi,
> 
> The problem is that the 0-0 Type-Code pair identifies the echo replies.
> So instead of adding a bool is_present value I hardcoded the info in
> the ID field making it always non null, at the expense of a possible
> collision, which is harmless.

Sorry, I feel that I'm missing something here.

My reading of the code above is that for the cased types above
(echo, echo reply, ...) the id is present. Otherwise it is not.
My idea would be to put a check for those types in a helper.

I do agree that the override you have used is harmless enough
in the context of the only user of the id which appears in
the following patch of this series.


Some other things I noticed in this patch on a second pass:

* I think you can remove the icmp field from struct flow_dissector_key_ports

* I think that adding icmp to struct flow_keys should be accompanied by
  adding ICMP to flow_keys_dissector_symmetric_keys. But I think this is
  not desirable outside of the bonding use-case and rather
  the bonding driver should define its own structures that
  includes the keys it needs - basically copies of struct flow_keys
  and flow_keys_dissector_symmetric_keys with some modifications.

* Modifying flow_keys_have_l4 affects the behaviour of
  skb_get_hash_flowi6() but there is not a corresponding update
  to flow_keys_have_l4(). I didn't look at all the other call sites
  but it strikes me that this is a) a wide-spread behavioural change
  and b) is perhaps not required for the bond-use case.


Re: [PATCH v3 3/6] KVM: x86/vPMU: Rename pmu_ops callbacks from msr_idx to rdpmc_idx

2019-10-23 Thread Jim Mattson
On Tue, Oct 22, 2019 at 1:12 AM Like Xu  wrote:
>
> The leagcy pmu_ops->msr_idx_to_pmc is only called in kvm_pmu_rdpmc, so
> this name is restrictedly limited to rdpmc_idx which could be indexed
> exactly to a kvm_pmc. Let's restrict its semantic by renaming the
> existing msr_idx_to_pmc to rdpmc_idx_to_pmc, and is_valid_msr_idx to
> is_valid_rdpmc_idx (likewise for kvm_pmu_is_valid_msr_idx).
>
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Like Xu 
Nit: The ECX argument to RDPMC is more than just an index (in fact,
intel_is_valid_msr_idx() extracts the index from the provided ECX
value), so I'd suggest s/rdpmc_idx/rdpmc_ecx/g.

Reviewed-by: Jim Mattson 


Re: [alsa-devel] [PATCH 0/2] ASoC: hdmi-codec: fix locking issue

2019-10-23 Thread Jerome Brunet


On Wed 23 Oct 2019 at 18:23, Takashi Iwai  wrote:

> On Wed, 23 Oct 2019 18:12:01 +0200,
> Jerome Brunet wrote:
>> 
>> This patchset fixes the locking issue reported by Russell.
>> 
>> As explained a mutex was used as flag and held while returning to
>> userspace.
>> 
>> Patch 2 is entirely optional and switches from bit atomic operation
>> to mutex again. I tend to prefer bit atomic operation in this
>> particular case but either way should be fine.
>
> I fail to see why the mutex is needed there.  Could you elaborate
> about the background?

You are right, It is not required.

Just a bit of history:

A while ago the hdmi-codec was keeping track of the substream pointer.
It was not used for anything useful, other than knowing if device was
busy or not, and it was causing problem with codec-to-codec links

I removed the saved substream pointer and replaced with a simple bit to
track the busy state. Protecting a single bit with a mutex seemed a bit
overkill to me, so I thought it was a good idea to replace the whole
thing with atomic bit ops. [0]

Mark told me he preferred the mutex method as it simpler to understand.
As long as as it works it's fine for me :)

I proposed something that uses the mutex as a busy flag but it turned
out to be a bad idea.

With the revert, we are back to the bit ops. Even if it works, Mark's
original comment on the bit ops still stands I think. This is why I'm
proposing patch 2 but I don't really mind if it is applied or not.

[0] https://lkml.kernel.org/r/20190506095815.24578-3-jbru...@baylibre.com

>
> IIUC, the protection with the atomic bitmap should guarantee the
> exclusive access.  The mutex would allow the possible concurrent calls
> of multiple startup of a single instance, but is this the thing to be
> solved?
>
>
> thanks,
>
> Takashi
>
>> 
>> Jerome Brunet (2):
>>   Revert "ASoC: hdmi-codec: re-introduce mutex locking"
>>   ASoC: hdmi-codec: re-introduce mutex locking again
>> 
>>  sound/soc/codecs/hdmi-codec.c | 15 +++
>>  1 file changed, 11 insertions(+), 4 deletions(-)
>> 
>> -- 
>> 2.21.0
>> 
>> ___
>> Alsa-devel mailing list
>> alsa-de...@alsa-project.org
>> https://mailman.alsa-project.org/mailman/listinfo/alsa-devel
>> 



Re: [PATCH v1 5/6] KEYS: measure queued keys

2019-10-23 Thread Mimi Zohar
On Wed, 2019-10-23 at 10:34 -0700, Lakshmi Ramasubramanian wrote:
> On 10/23/19 6:23 AM, Mimi Zohar wrote:
> 
> > The ordering of this patch set is awkward.  It should first introduce
> > a generic method for measuring keys based on the keyring.  Then add
> > the additional support needed for the specific builtin_trusted_keys
> > keyring usecase.
> 
> Would the following ordering of the patch set be acceptable:
> 
>   => PATCH 0/5: Cover letter
> 
>   => PATCH 1/5: Define the enum "hook(BUILTIN_TRUSTED_KEYS)" in ima.h
> 
>   => PATCH 2/5: Define ima hook
> This will initially do nothing if ima is not yet
> initialized.
> Call process_buffer_measurement() if ima is initialized.
> 
>   => PATCH 3/5: key_create_or_update change and the call to ima hook
> 
>   => PATCH 4/5: Queue\De-Queue of key measurement requests.
> Enable queuing of key in the ima hook if ima is not
> initialized.
> 
>   => PATCH 5/5: ima policy to enable measurement of keys which will
> enable end-to-end working of this feature.

The first patches need to introduce the generic concept of measuring
keys based on policy.  Only afterwards would you add any builtin
trusted keyring specific code.

Mimi



[PATCH v2 2/3] watchdog: jz4740: Use regmap provided by TCU driver

2019-10-23 Thread Paul Cercueil
Since we broke the ABI by changing the clock, the driver was also
updated to use the regmap provided by the TCU driver.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Acked-by: Guenter Roeck 
---

Notes:
v2: Rebase on top of 5.4-rc4

 drivers/watchdog/Kconfig  |  1 +
 drivers/watchdog/jz4740_wdt.c | 35 ---
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index 6421187769cf..dbef995856bf 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1644,6 +1644,7 @@ config JZ4740_WDT
depends on MACH_JZ4740 || MACH_JZ4780
depends on COMMON_CLK
select WATCHDOG_CORE
+   select MFD_SYSCON
help
  Hardware driver for the built-in watchdog timer on Ingenic jz4740 
SoCs.
 
diff --git a/drivers/watchdog/jz4740_wdt.c b/drivers/watchdog/jz4740_wdt.c
index 72920f09f4a7..bdf9564efa29 100644
--- a/drivers/watchdog/jz4740_wdt.c
+++ b/drivers/watchdog/jz4740_wdt.c
@@ -5,6 +5,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DEFAULT_HEARTBEAT 5
 #define MAX_HEARTBEAT 2048
@@ -36,7 +38,7 @@ MODULE_PARM_DESC(heartbeat,
 
 struct jz4740_wdt_drvdata {
struct watchdog_device wdt;
-   void __iomem *base;
+   struct regmap *map;
struct clk *clk;
unsigned long clk_rate;
 };
@@ -45,7 +47,8 @@ static int jz4740_wdt_ping(struct watchdog_device *wdt_dev)
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
 
-   writew(0x0, drvdata->base + TCU_REG_WDT_TCNT);
+   regmap_write(drvdata->map, TCU_REG_WDT_TCNT, 0);
+
return 0;
 }
 
@@ -54,16 +57,16 @@ static int jz4740_wdt_set_timeout(struct watchdog_device 
*wdt_dev,
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
u16 timeout_value = (u16)(drvdata->clk_rate * new_timeout);
-   u8 tcer;
+   unsigned int tcer;
 
-   tcer = readb(drvdata->base + TCU_REG_WDT_TCER);
-   writeb(0x0, drvdata->base + TCU_REG_WDT_TCER);
+   regmap_read(drvdata->map, TCU_REG_WDT_TCER, );
+   regmap_write(drvdata->map, TCU_REG_WDT_TCER, 0);
 
-   writew((u16)timeout_value, drvdata->base + TCU_REG_WDT_TDR);
-   writew(0x0, drvdata->base + TCU_REG_WDT_TCNT);
+   regmap_write(drvdata->map, TCU_REG_WDT_TDR, timeout_value);
+   regmap_write(drvdata->map, TCU_REG_WDT_TCNT, 0);
 
if (tcer & TCU_WDT_TCER_TCEN)
-   writeb(TCU_WDT_TCER_TCEN, drvdata->base + TCU_REG_WDT_TCER);
+   regmap_write(drvdata->map, TCU_REG_WDT_TCER, TCU_WDT_TCER_TCEN);
 
wdt_dev->timeout = new_timeout;
return 0;
@@ -72,20 +75,20 @@ static int jz4740_wdt_set_timeout(struct watchdog_device 
*wdt_dev,
 static int jz4740_wdt_start(struct watchdog_device *wdt_dev)
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
+   unsigned int tcer;
int ret;
-   u8 tcer;
 
ret = clk_prepare_enable(drvdata->clk);
if (ret)
return ret;
 
-   tcer = readb(drvdata->base + TCU_REG_WDT_TCER);
+   regmap_read(drvdata->map, TCU_REG_WDT_TCER, );
 
jz4740_wdt_set_timeout(wdt_dev, wdt_dev->timeout);
 
/* Start watchdog if it wasn't started already */
if (!(tcer & TCU_WDT_TCER_TCEN))
-   writeb(TCU_WDT_TCER_TCEN, drvdata->base + TCU_REG_WDT_TCER);
+   regmap_write(drvdata->map, TCU_REG_WDT_TCER, TCU_WDT_TCER_TCEN);
 
return 0;
 }
@@ -94,7 +97,7 @@ static int jz4740_wdt_stop(struct watchdog_device *wdt_dev)
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
 
-   writeb(0x0, drvdata->base + TCU_REG_WDT_TCER);
+   regmap_write(drvdata->map, TCU_REG_WDT_TCER, 0);
clk_disable_unprepare(drvdata->clk);
 
return 0;
@@ -172,9 +175,11 @@ static int jz4740_wdt_probe(struct platform_device *pdev)
watchdog_set_nowayout(jz4740_wdt, nowayout);
watchdog_set_drvdata(jz4740_wdt, drvdata);
 
-   drvdata->base = devm_platform_ioremap_resource(pdev, 0);
-   if (IS_ERR(drvdata->base))
-   return PTR_ERR(drvdata->base);
+   drvdata->map = device_node_to_regmap(dev->parent->of_node);
+   if (!drvdata->map) {
+   dev_err(dev, "regmap not found\n");
+   return -EINVAL;
+   }
 
return devm_watchdog_register_device(dev, >wdt);
 }
-- 
2.23.0



[PATCH v2 3/3] watchdog: jz4740: Drop dependency on MACH_JZ47xx

2019-10-23 Thread Paul Cercueil
Depending on MACH_JZ47xx prevent us from creating a generic kernel that
works on more than one MIPS board. Instead, we just depend on MIPS being
set.

Signed-off-by: Paul Cercueil 
Acked-by: Guenter Roeck 
---

Notes:
v2: Rebase on top of 5.4-rc4

 drivers/watchdog/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index dbef995856bf..fd4844f0a8f3 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1641,7 +1641,7 @@ config INDYDOG
 
 config JZ4740_WDT
tristate "Ingenic jz4740 SoC hardware watchdog"
-   depends on MACH_JZ4740 || MACH_JZ4780
+   depends on MIPS
depends on COMMON_CLK
select WATCHDOG_CORE
select MFD_SYSCON
-- 
2.23.0



[PATCH v2 1/3] watchdog: jz4740: Use WDT clock provided by TCU driver

2019-10-23 Thread Paul Cercueil
Instead of requesting the "ext" clock and handling the watchdog clock
divider and gating in the watchdog driver, we now request and use the
"wdt" clock that is supplied by the ingenic-timer "TCU" driver.

The major benefit is that the watchdog's clock rate and parent can now
be specified from within devicetree, instead of hardcoded in the driver.

Also, this driver won't poke anymore into the TCU registers to
enable/disable the clock, as this is now handled by the TCU driver.

On the bad side, we break the ABI with devicetree - as we now request a
different clock. In this very specific case it is still okay, as every
Ingenic JZ47xx-based board out there compile the devicetree within the
kernel; so it's still time to push breaking changes, in order to get a
clean devicetree that won't break once it musn't.

Signed-off-by: Paul Cercueil 
Tested-by: Mathieu Malaterre 
Tested-by: Artur Rojek 
Acked-by: Guenter Roeck 
---

Notes:
v2: Rebase on top of 5.4-rc4

 drivers/watchdog/Kconfig  |  1 +
 drivers/watchdog/jz4740_wdt.c | 75 ++-
 2 files changed, 31 insertions(+), 45 deletions(-)

diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index 58e7c100b6ad..6421187769cf 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1642,6 +1642,7 @@ config INDYDOG
 config JZ4740_WDT
tristate "Ingenic jz4740 SoC hardware watchdog"
depends on MACH_JZ4740 || MACH_JZ4780
+   depends on COMMON_CLK
select WATCHDOG_CORE
help
  Hardware driver for the built-in watchdog timer on Ingenic jz4740 
SoCs.
diff --git a/drivers/watchdog/jz4740_wdt.c b/drivers/watchdog/jz4740_wdt.c
index c6052ae54f32..72920f09f4a7 100644
--- a/drivers/watchdog/jz4740_wdt.c
+++ b/drivers/watchdog/jz4740_wdt.c
@@ -18,19 +18,6 @@
 #include 
 #include 
 
-#include 
-
-#define JZ_WDT_CLOCK_PCLK 0x1
-#define JZ_WDT_CLOCK_RTC  0x2
-#define JZ_WDT_CLOCK_EXT  0x4
-
-#define JZ_WDT_CLOCK_DIV_1(0 << TCU_TCSR_PRESCALE_LSB)
-#define JZ_WDT_CLOCK_DIV_4(1 << TCU_TCSR_PRESCALE_LSB)
-#define JZ_WDT_CLOCK_DIV_16   (2 << TCU_TCSR_PRESCALE_LSB)
-#define JZ_WDT_CLOCK_DIV_64   (3 << TCU_TCSR_PRESCALE_LSB)
-#define JZ_WDT_CLOCK_DIV_256  (4 << TCU_TCSR_PRESCALE_LSB)
-#define JZ_WDT_CLOCK_DIV_1024 (5 << TCU_TCSR_PRESCALE_LSB)
-
 #define DEFAULT_HEARTBEAT 5
 #define MAX_HEARTBEAT 2048
 
@@ -50,7 +37,8 @@ MODULE_PARM_DESC(heartbeat,
 struct jz4740_wdt_drvdata {
struct watchdog_device wdt;
void __iomem *base;
-   struct clk *rtc_clk;
+   struct clk *clk;
+   unsigned long clk_rate;
 };
 
 static int jz4740_wdt_ping(struct watchdog_device *wdt_dev)
@@ -65,32 +53,14 @@ static int jz4740_wdt_set_timeout(struct watchdog_device 
*wdt_dev,
unsigned int new_timeout)
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
-   unsigned int rtc_clk_rate;
-   unsigned int timeout_value;
-   unsigned short clock_div = JZ_WDT_CLOCK_DIV_1;
+   u16 timeout_value = (u16)(drvdata->clk_rate * new_timeout);
u8 tcer;
 
-   rtc_clk_rate = clk_get_rate(drvdata->rtc_clk);
-
-   timeout_value = rtc_clk_rate * new_timeout;
-   while (timeout_value > 0x) {
-   if (clock_div == JZ_WDT_CLOCK_DIV_1024) {
-   /* Requested timeout too high;
-   * use highest possible value. */
-   timeout_value = 0x;
-   break;
-   }
-   timeout_value >>= 2;
-   clock_div += (1 << TCU_TCSR_PRESCALE_LSB);
-   }
-
tcer = readb(drvdata->base + TCU_REG_WDT_TCER);
writeb(0x0, drvdata->base + TCU_REG_WDT_TCER);
-   writew(clock_div, drvdata->base + TCU_REG_WDT_TCSR);
 
writew((u16)timeout_value, drvdata->base + TCU_REG_WDT_TDR);
writew(0x0, drvdata->base + TCU_REG_WDT_TCNT);
-   writew(clock_div | JZ_WDT_CLOCK_RTC, drvdata->base + TCU_REG_WDT_TCSR);
 
if (tcer & TCU_WDT_TCER_TCEN)
writeb(TCU_WDT_TCER_TCEN, drvdata->base + TCU_REG_WDT_TCER);
@@ -102,11 +72,15 @@ static int jz4740_wdt_set_timeout(struct watchdog_device 
*wdt_dev,
 static int jz4740_wdt_start(struct watchdog_device *wdt_dev)
 {
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
+   int ret;
u8 tcer;
 
+   ret = clk_prepare_enable(drvdata->clk);
+   if (ret)
+   return ret;
+
tcer = readb(drvdata->base + TCU_REG_WDT_TCER);
 
-   jz4740_timer_enable_watchdog();
jz4740_wdt_set_timeout(wdt_dev, wdt_dev->timeout);
 
/* Start watchdog if it wasn't started already */
@@ -121,7 +95,7 @@ static int jz4740_wdt_stop(struct watchdog_device *wdt_dev)
struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev);
 
writeb(0x0, drvdata->base + TCU_REG_WDT_TCER);
-   jz4740_timer_disable_watchdog();
+   

Re: [PATCH] perf/core: fix multiplexing event scheduling issue

2019-10-23 Thread Stephane Eranian
On Wed, Oct 23, 2019 at 4:02 AM Peter Zijlstra  wrote:
>
> On Wed, Oct 23, 2019 at 12:30:03AM -0700, Stephane Eranian wrote:
> > On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra  wrote:
> > >
> > > On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > > > This patch complements the following commit:
> > > > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> > > >
> > > > The fix from Song addresses the consequences of the problem but
> > > > not the cause. This patch fixes the causes and can sit on top of
> > > > Song's patch.
> > >
> > > I'm tempted to say the other way around.
> > >
> > > Consider the case where you claim fixed2 with a pinned event and then
> > > have another fixed2 in the flexible list. At that point you're _never_
> > > going to run any other flexible events (without Song's patch).
> > >
> > In that case, there is no deactivation or removal of events, so yes, my 
> > patch
> > will not help that case. I said his patch is still useful. You gave one 
> > example,
> > even though in this case the rotate will not yield a reschedule of that 
> > flexible
> > event because fixed2 is used by a pinned event. So checking for it, will not
> > really help.
>
> Stick 10 cycle events after the fixed2 flexible event. Without Song's
> patch you'll never see those 10 cycle events get scheduled.
>
> > > This patch isn't going to help with that. Similarly, Songs patch helps
> > > with your situation where it will allow rotation to resume after you
> > > disable/remove all active events (while you still have pending events).
> > >
> > Yes, it will unblock the case where active events are deactivated or
> > removed. But it will delay the unblocking until the next mux timer
> > expires. And I am saying this is too far away in many cases. For instance,
> > we do not run with the 1ms timer for uncore, this is way too much overhead.
> > Imagine this timer is set to 10ms or event 100ms, just with Song's patch, 
> > the
> > inactive events would have to wait for up to 100ms to be scheduled again.
> > This is not acceptable for us.
>
> Then how was it acceptible to mux in the first place? And if
> multiplexing wasn't acceptible, then why were you doing it?
>
> > > > However, the cause is not addressed. The kernel should not rely on
> > > > the multiplexing hrtimer to unblock inactive events. That timer
> > > > can have abitrary duration in the milliseconds. Until the timer
> > > > fires, counters are available, but no measurable events are using
> > > > them. We do not want to introduce blind spots of arbitrary durations.
> > >
> > > This I disagree with -- you don't get a guarantee other than
> > > timer_period/n when you multiplex, and idling the counters until the
> > > next tick doesn't violate that at all.
> >
> > My take is that if you have free counters and "idling" events, the kernel
> > should take every effort to schedule them as soon as they become available.
> > In the situation I described in the patch, once I remove the active
> > events, there
> > is no more reasons for multiplexing, all the counters are free (ignore
> > watchdog).
>
> That's fine; all I'm arguing is that the current behaviour doesn't
> violate the guarantees given. Now you want to improve counter
> utilization (at a cost) and that is fine. Just don't argue that there's
> something broken -- there is not.
>
> Your patch also does not fix something more fundamental than Song's
> patch did. Quite the reverse. Yours is purely a utilization efficiency
> thing, while Song's addressed a correctness issue.
>
Going back to Song's patch and his test case. It exposes a problem that was
introduced with the RB tree and multiple event list changes. In the event
scheduler, there was this guarantee that each event will get a chance to
be scheduled because each would eventually get to the head of the list and
thus get a chance to be scheduled as the first event of its priority
class, assuming
there was still at least one compatible counter available from higher
priority classes.
The scheduler also still stops at the first error. In Song's case
ref-cycles:D,ref-cycles,cycles,
the pinned event is commandeering fixed2. But I believe the rotation
code was not
rotating the list *even* if it could not schedule the first event.
That explained why
the cycle event could never be scheduled. That's a violation of the guarantee.
At each timer, the list must rotate. I think his patch somehow addresses this.

> > Now you may be arguing, that it may take more time to ctx_resched() then to
> > wait for the timer to expire. But I am not sure I buy that.
>
> I'm not arguing that. All I'm saying is that fairness is not affected.
>
> > Similarly, I am not sure there is code to cancel an active mux hrtimer
> > when we clear rotate_necessary.  Maybe we just let it lapse and clear
> > itself via a ctx_sched_out() in the rotation code.
>
> Yes, we let it lapse and disable itself, I don't see the problem with
> that -- also remember 

Re: [Outreachy kernel] [PATCH v2 0/5] Remove typedef declarations in staging: octeon

2019-10-23 Thread Aaro Koskinen
Hi,

On Sat, Oct 12, 2019 at 08:35:19PM +0200, Julia Lawall wrote:
> On Sat, 12 Oct 2019, Wambui Karuga wrote:
> > This patchset removes the addition of new typedefs data types in octeon,
> > along with replacing the previous uses with the new declaration format.
> >
> > v2 of the series removes the obsolete "_t" notation in the named types.
> >
> > Wambui Karuga (5):
> >   staging: octeon: remove typedef declaration for cvmx_wqe
> >   staging: octeon: remove typedef declaration for cvmx_helper_link_info
> >   staging: octeon: remove typedef declaration for cvmx_fau_reg_32
> >   staging: octeon: remove typedef declartion for cvmx_pko_command_word0
> >   staging: octeon: remove typedef declaration for cvmx_fau_op_size
> >
> >  drivers/staging/octeon/ethernet-mdio.c   |  6 +--
> >  drivers/staging/octeon/ethernet-rgmii.c  |  4 +-
> >  drivers/staging/octeon/ethernet-rx.c |  6 +--
> >  drivers/staging/octeon/ethernet-tx.c |  4 +-
> >  drivers/staging/octeon/ethernet.c|  6 +--
> >  drivers/staging/octeon/octeon-ethernet.h |  2 +-
> >  drivers/staging/octeon/octeon-stubs.h| 56 
> >  7 files changed, 43 insertions(+), 41 deletions(-)
> 
> For the series:
> 
> Acked-by: Julia Lawall 

This series breaks the build on MIPS/OCTEON (the only actual HW using this
driver):

$ make ARCH=mips CROSS_COMPILE=mips64-linux-gnu- cavium_octeon_defconfig
$ make ARCH=mips CROSS_COMPILE=mips64-linux-gnu-
[...]
  CC  drivers/staging/octeon/ethernet.o
In file included from drivers/staging/octeon/ethernet.c:22:
drivers/staging/octeon/octeon-ethernet.h:94:12: warning: 'union 
cvmx_helper_link_info' declared inside parameter list will not be visible 
outside of this definition or declaration
  union cvmx_helper_link_info li);  
^
drivers/staging/octeon/ethernet.c: In function 'cvm_oct_free_work':
drivers/staging/octeon/ethernet.c:177:21: error: dereferencing pointer to 
incomplete type 'struct cvmx_wqe'
  int segments = work->word2.s.bufs;
 ^~
drivers/staging/octeon/ethernet.c: In function 'cvm_oct_common_open':
drivers/staging/octeon/ethernet.c:463:30: error: storage size of 'link_info' 
isn't known
  union cvmx_helper_link_info link_info;
  ^

etc.

Probably all these patches need to be reverted.

A.


Re: [PATCH v2 2/2] RISC-V: defconfig: Enable Goldfish RTC driver

2019-10-23 Thread Alistair Francis
On Tue, 2019-10-22 at 18:06 -0700, Paul Walmsley wrote:
> On Tue, 22 Oct 2019, Alistair Francis wrote:
> 
> > I think it makese sense for this to go into Linux first.
> > 
> > The QEMU patches are going to be accepted, just some nit picking to
> > do
> > first :)
> > 
> > After that we have to wait for a PR and then a QEMU release until
> > most
> > people will see the change in QEMU. In that time Linux 5.4 will be
> > released, if this can make it into 5.4 then everyone using 5.4 will
> > get
> > the new RTC as soon as they upgrade QEMU (QEMU provides the device
> > tree). If this has to wait until QEMU has support then it won't be
> > supported for users until even later.
> > 
> > Users are generally slow to update kernels (buildroot is still
> > using
> > 5.1 by default for example) so the sooner changes like this go in
> > the
> > better.
> 
> The defconfigs are really just for kernel developers.  We expect
> users to 
> define their own Kconfigs for their own needs.

From experience most people use the defconfig, at least as a starting
point.

> 
> If using the Goldfish code really is what we all want to do (see
> below), 
> then the kernel patch that should go in right away -- which also has
> no 
> dependence on what QEMU does -- would be the first patch of this
> series:
> 
> https://lore.kernel.org/linux-riscv/20190925063706.56175-2-anup.pa...@wdc.com/

Ok, so it looks like this patch will be a 5.5 patch not a 5.4 patch. It
looks like that can't be helped. I just don't want the defconfig change
waiting on QEMU as I think that just slows everything down.

> 
> And that should go in via whoever is maintaining the Goldfish driver,
> not 
> the RISC-V tree.  (It looks like drivers/platform/goldfish is
> completely 
> unmaintained - a red flag! - so probably someone needs to persuade
> Greg or 
> Andrew to take it.)
> 
> Incidentally, just looking at drivers/platform/goldfish, that driver
> seems 
> to be some sort of Google-specific RPC driver.  Are you all really
> sure 
> you want to enable that just for an RTC?  Seems like overkill - there
> are 
> much simpler RTCs out there.

I was under the impression that everyone was on board with this going
in. In QEMU land it doesn't make sense to add it if the kernel isn't
going to, so we need to be on the same page here.

From the other discussions it looks like you are happy with this change
overall right?

Alistair

> 
> 
> - Paul
> 
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv


Re: [PATCH] ARM: dts: bcm: HR2: add label to sp805 watchdog

2019-10-23 Thread Florian Fainelli
On Wed, 23 Oct 2019 11:19:56 +1300, Chris Packham 
 wrote:
> This allows boards the option of adding properties or disabling the
> watchdog entirely.
> 
> Signed-off-by: Chris Packham 
> ---

Applied to devicetree/next, thanks!
--
Florian


Re: [PATCH 3/3] x86/ftrace: Use text_poke()

2019-10-23 Thread Steven Rostedt
On Wed, 23 Oct 2019 12:23:06 -0400
Steven Rostedt  wrote:

> All you need to do is:
> 
>   register_ftrace_direct((unsigned long)func_you_want_to_trace,
>  (unsigned long)your_trampoline);
> 


> 
> Alexei,
> 
> Would this work for you?

I just pushed a test branch up to:

 git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git

 branch: ftrace/direct

Again, it's mostly development code, and may be buggy. (don't try
tracing when a direct function is added yet).

If this is something that you can use, then I'll work to clean it up
and sort out all the bugs.

Thanks!

-- Steve


Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

2019-10-23 Thread Shakeel Butt
On Wed, Oct 23, 2019 at 8:46 AM Johannes Weiner  wrote:
>
> On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:
> > On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> > > While upgrading from 4.16 to 5.2, we noticed these allocation errors
> > > in the log of the new kernel:
> > >
> > > [ 8642.253395] SLUB: Unable to allocate memory on node -1, 
> > > gfp=0xa20(GFP_ATOMIC)
> > > [ 8642.269170]   cache: tw_sock_TCPv6(960:helper-logs), object size: 232, 
> > > buffer size: 240, default order: 1, min order: 0
> > > [ 8642.293009]   node 0: slabs: 5, objs: 170, free: 0
> > >
> > > slab_out_of_memory+1
> > > ___slab_alloc+969
> > > __slab_alloc+14
> > > kmem_cache_alloc+346
> > > inet_twsk_alloc+60
> > > tcp_time_wait+46
> > > tcp_fin+206
> > > tcp_data_queue+2034
> > > tcp_rcv_state_process+784
> > > tcp_v6_do_rcv+405
> > > __release_sock+118
> > > tcp_close+385
> > > inet_release+46
> > > __sock_release+55
> > > sock_close+17
> > > __fput+170
> > > task_work_run+127
> > > exit_to_usermode_loop+191
> > > do_syscall_64+212
> > > entry_SYSCALL_64_after_hwframe+68
> > >
> > > accompanied by an increase in machines going completely radio silent
> > > under memory pressure.
> >
> > This is really worrying because that suggests that something depends on
> > GFP_ATOMIC allocation which is fragile and broken.
>
> I don't think that is true. You cannot rely on a *single instance* of
> atomic allocations to succeed. But you have to be able to rely on that
> failure is temporary and there is a chance of succeeding eventually.
>
> Network is a good example. It retries transmits, but within reason. If
> you aren't able to process incoming packets for minutes, you might as
> well be dead.
>
> > > One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> > > sock objects to kmemcg"), which made these slab caches subject to
> > > cgroup memory accounting and control.
> > >
> > > The problem with that is that cgroups, unlike the page allocator, do
> > > not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> > > its limit, atomic allocations - such as done during network rx - can
> > > fail consistently for extended periods of time. The kernel is not able
> > > to operate under these conditions.
> > >
> > > We don't want to revert the culprit patch, because it indeed tracks a
> > > potentially substantial amount of memory used by a cgroup.
> > >
> > > We also don't want to implement dedicated atomic reserves for cgroups.
> > > There is no point in keeping a fixed margin of unused bytes in the
> > > cgroup's memory budget to accomodate a consumer that is impossible to
> > > predict - we'd be wasting memory and get into configuration headaches,
> > > not unlike what we have going with min_free_kbytes. We do this for
> > > physical mem because we have to, but cgroups are an accounting game.
> > >
> > > Instead, account these privileged allocations to the cgroup, but let
> > > them bypass the configured limit if they have to. This way, we get the
> > > benefits of accounting the consumed memory and have it exert pressure
> > > on the rest of the cgroup, but like with the page allocator, we shift
> > > the burden of reclaimining on behalf of atomic allocations onto the
> > > regular allocations that can block.
> >
> > On the other hand this would allow to break the isolation by an
> > unpredictable amount. Should we put a simple cap on how much we can go
> > over the limit. If the memcg limit reclaim is not able to keep up with
> > those overflows then even __GFP_ATOMIC allocations have to fail. What do
> > you think?
>
> I don't expect a big overrun in practice, and it appears that Google
> has been letting even NOWAIT allocations pass through without
> isolation issues.

We have been overcharging for __GFP_HIGH allocations for couple of
years and see no isolation issues in the production.

> Likewise, we have been force-charging the skmem for
> a while now and it hasn't been an issue for reclaim to keep up.
>
> My experience from production is that it's a whole lot easier to debug
> something like a memory.max overrun than it is to debug a machine that
> won't respond to networking. So that's the side I would err on.


Re: [PATCH] staging: octeon: Fix incorrect type in assignment

2019-10-23 Thread Aaro Koskinen
Hi,

On Thu, Oct 10, 2019 at 07:38:15AM +0300, Wambui Karuga wrote:
> Fix the following warning generated by sparse in
> drivers/staging/octeon/ethernet-tx.c:
> 
> drivers/staging/octeon/ethernet-tx.c:563:50: warning: incorrect type in 
> assignment (different base types)
> drivers/staging/octeon/ethernet-tx.c:563:50:expected unsigned short 
> [usertype] hw_chksum
> drivers/staging/octeon/ethernet-tx.c:563:50:got restricted __wsum 
> [usertype] csum
> 
> Warning generated by running:
> make C=2 CF="-D__CHECK_ENDIAN__" drivers/staging/octeon/
> 
> Signed-off-by: Wambui Karuga 
> ---
>  drivers/staging/octeon/octeon-stubs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/octeon/octeon-stubs.h 
> b/drivers/staging/octeon/octeon-stubs.h
> index 38954b6c89e1..b2e3c72205dd 100644
> --- a/drivers/staging/octeon/octeon-stubs.h
> +++ b/drivers/staging/octeon/octeon-stubs.h
> @@ -123,7 +123,7 @@ union cvmx_pip_wqe_word0 {
>   struct {
>   uint64_t next_ptr:40;
>   uint8_t unused;
> - uint16_t hw_chksum;
> + __wsum hw_chksum;

I don't think this is correct. This struct defines HW register layout,
and the field needs to be 16 bits.

A.


[PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
With a threshold of 10, it is still possible that the zone lock
will be held for a very long time in the worst case scenario where all
the counts are just below the threshold. With up to 6 migration types
and 11 orders, it means up to 6.6 millions.

Track the total number of list iterations done since the acquisition
of the zone lock and release it whenever 10 iterations or more have
been completed. This will cap the lock hold time to no more than 200,000
list iterations.

Signed-off-by: Waiman Long 
---
 mm/vmstat.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 57ba091e5460..c5b82fdf54af 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
pg_data_t *pgdat, struct zone *zone)
 {
int order, mtype;
+   unsigned long iteration_count = 0;
 
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
seq_printf(m, "Node %4d, zone %8s, type %12s ",
@@ -1397,15 +1398,24 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 * of pages in this order should be more than
 * sufficient
 */
-   if (++freecount >= 10) {
+   if (++freecount > 10) {
overflow = true;
-   spin_unlock_irq(>lock);
-   cond_resched();
-   spin_lock_irq(>lock);
+   freecount--;
break;
}
}
seq_printf(m, "%s%6lu ", overflow ? ">" : "", 
freecount);
+   /*
+* Take a break and release the zone lock when
+* 10 or more entries have been iterated.
+*/
+   iteration_count += freecount;
+   if (iteration_count >= 10) {
+   iteration_count = 0;
+   spin_unlock_irq(>lock);
+   cond_resched();
+   spin_lock_irq(>lock);
+   }
}
seq_putc(m, '\n');
}
-- 
2.18.1



Re: [PATCH] Add prctl support for controlling PF_MEMALLOC V2

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 12:27:29, Mike Christie wrote:
> On 10/23/2019 02:11 AM, Michal Hocko wrote:
> > On Wed 23-10-19 07:43:44, Dave Chinner wrote:
> >> On Tue, Oct 22, 2019 at 06:33:10PM +0200, Michal Hocko wrote:
> > 
> > Thanks for more clarifiation regarding PF_LESS_THROTTLE.
> > 
> > [...]
> >>> PF_IO_FLUSHER would mean that the user
> >>> context is a part of the IO path and therefore there are certain reclaim
> >>> recursion restrictions.
> >>
> >> If PF_IO_FLUSHER just maps to PF_LESS_THROTTLE|PF_MEMALLOC_NOIO,
> >> then I'm not sure we need a new definition. Maybe that's the ptrace
> >> flag name, but in the kernel we don't need a PF_IO_FLUSHER process
> >> flag...
> > 
> > Yes, the internal implementation would do something like that. I was
> > more interested in the user space visible API at this stage. Something
> > generic enough because exporting MEMALLOC flags is just a bad idea IMHO
> > (especially PF_MEMALLOC).
> 
> Do you mean we would do something like:
> 
> prctl()
> 
> case PF_SET_IO_FLUSHER:
> current->flags |= PF_MEMALLOC_NOIO;
> 

yes, along with PF_LESS_THROTTLE.

-- 
Michal Hocko
SUSE Labs


[PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
Now that the free block count for each migration types in
/proc/pagetypeinfo may not show the exact count if it excceeds
100,000. Users may not know how much more the counts will be. As the
free_area structure has already tracked the total free block count in
nr_free, we may as well print it out with no additional cost. That will
give users a rough idea of where the upper bounds will be.

If there is no overflow, the presence of the total counts will also
enable us to check if the nr_free counts match the total number of
entries in the free lists.

Signed-off-by: Waiman Long 
---
 mm/vmstat.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c5b82fdf54af..172946d8f358 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,6 +1373,7 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
pg_data_t *pgdat, struct zone *zone)
 {
int order, mtype;
+   struct free_area *area;
unsigned long iteration_count = 0;
 
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
@@ -1382,7 +1383,6 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
migratetype_names[mtype]);
for (order = 0; order < MAX_ORDER; ++order) {
unsigned long freecount = 0;
-   struct free_area *area;
struct list_head *curr;
bool overflow = false;
 
@@ -1419,6 +1419,17 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
}
seq_putc(m, '\n');
}
+
+   /*
+* List total free blocks per order
+*/
+   seq_printf(m, "Node %4d, zone %8s, total ",
+  pgdat->node_id, zone->name);
+   for (order = 0; order < MAX_ORDER; ++order) {
+   area = &(zone->free_area[order]);
+   seq_printf(m, "%6lu ", area->nr_free);
+   }
+   seq_putc(m, '\n');
 }
 
 /* Print out the free pages at each order for each migatetype */
-- 
2.18.1



Re: [PATCH v1 5/6] KEYS: measure queued keys

2019-10-23 Thread Lakshmi Ramasubramanian

On 10/23/19 6:23 AM, Mimi Zohar wrote:


The ordering of this patch set is awkward.  It should first introduce
a generic method for measuring keys based on the keyring.  Then add
the additional support needed for the specific builtin_trusted_keys
keyring usecase.


Would the following ordering of the patch set be acceptable:

 => PATCH 0/5: Cover letter

 => PATCH 1/5: Define the enum "hook(BUILTIN_TRUSTED_KEYS)" in ima.h

 => PATCH 2/5: Define ima hook
   This will initially do nothing if ima is not yet
   initialized.
   Call process_buffer_measurement() if ima is initialized.

 => PATCH 3/5: key_create_or_update change and the call to ima hook

 => PATCH 4/5: Queue\De-Queue of key measurement requests.
   Enable queuing of key in the ima hook if ima is not
   initialized.

 => PATCH 5/5: ima policy to enable measurement of keys which will
   enable end-to-end working of this feature.

thanks,
 -lakshmi


[PATCH v1 0/1] rxe driver should dynamically caclculate inline data size

2019-10-23 Thread rao Shoaib
From: Rao Shoaib 

Resending because of typo in the email addresses.

Currently rxe driver has a hard coded value for inline data size, where as mlx5 
driver calculates the size of inline data and number of SGE's to use based on 
the values in the qp request. Some applications depend on this behavior. This 
patch changes rxe to dynamically calculate the values.

Rao Shoaib (1):
  rxe: calculate inline data size based on requested values

 drivers/infiniband/sw/rxe/rxe_param.h | 2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c| 4 
 2 files changed, 5 insertions(+), 1 deletion(-)

-- 
1.8.3.1



[PATCH v1 1/1] rxe: calculate inline data size based on requested values

2019-10-23 Thread rao Shoaib
From: Rao Shoaib 

rxe driver has a hard coded value for the size of inline data, where as
mlx5 driver calculates number of SGE's and inline data size based on the
values in the qp request. This patch modifies rxe driver to do the same
so that applications can work seamlessly across drivers.

Signed-off-by: Rao Shoaib 
---
 drivers/infiniband/sw/rxe/rxe_param.h | 2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c| 4 
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h 
b/drivers/infiniband/sw/rxe/rxe_param.h
index 1b596fb..657f9303 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -68,7 +68,6 @@ enum rxe_device_param {
RXE_HW_VER  = 0,
RXE_MAX_QP  = 0x1,
RXE_MAX_QP_WR   = 0x4000,
-   RXE_MAX_INLINE_DATA = 400,
RXE_DEVICE_CAP_FLAGS= IB_DEVICE_BAD_PKEY_CNTR
| IB_DEVICE_BAD_QKEY_CNTR
| IB_DEVICE_AUTO_PATH_MIG
@@ -81,6 +80,7 @@ enum rxe_device_param {
| IB_DEVICE_MEM_MGT_EXTENSIONS,
RXE_MAX_SGE = 32,
RXE_MAX_SGE_RD  = 32,
+   RXE_MAX_INLINE_DATA = RXE_MAX_SGE * sizeof(struct ib_sge),
RXE_MAX_CQ  = 16384,
RXE_MAX_LOG_CQE = 15,
RXE_MAX_MR  = 2 * 1024,
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c 
b/drivers/infiniband/sw/rxe/rxe_qp.c
index aeea994..45b5da5 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -229,6 +229,7 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct 
rxe_qp *qp,
 {
int err;
int wqe_size;
+   unsigned int inline_size;
 
err = sock_create_kern(_net, AF_INET, SOCK_DGRAM, 0, >sk);
if (err < 0)
@@ -244,6 +245,9 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct 
rxe_qp *qp,
 sizeof(struct rxe_send_wqe) +
 qp->sq.max_inline);
 
+   inline_size = wqe_size - sizeof(struct rxe_send_wqe);
+   qp->sq.max_inline = inline_size;
+   init->cap.max_inline_data = inline_size;
qp->sq.queue = rxe_queue_init(rxe,
  >sq.max_wr,
  wqe_size);
-- 
1.8.3.1



Re: [PATCH] staging: octeon: Remove typedef declaration

2019-10-23 Thread Aaro Koskinen
Hi,

On Tue, Oct 08, 2019 at 07:09:43AM +0300, Wambui Karuga wrote:
> Fixes checkpatch.pl warning: do not add new typedefs in
> drivers/staging/octeon/octeon-stubs.h:41
> 
> Signed-off-by: Wambui Karuga 
> ---
>  drivers/staging/octeon/octeon-stubs.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/staging/octeon/octeon-stubs.h 
> b/drivers/staging/octeon/octeon-stubs.h
> index a4ac3bfb62a8..773591348ef4 100644
> --- a/drivers/staging/octeon/octeon-stubs.h
> +++ b/drivers/staging/octeon/octeon-stubs.h
> @@ -38,7 +38,7 @@
>  #define CVMX_NPI_RSL_INT_BLOCKS  0
>  #define CVMX_POW_WQ_INT_PC   0
>  
> -typedef union {
> +union cvmx_pip_wqe_word2 {

The "real" definition is in arch/mips/include/asm/octeon/cvmx-wqe.h.

octeon-stubs.h is just a temporary hack to allow compile testing without
MIPS cross-compiler. Changing only the stubs file is not really an
improvement.

A.


Re: [PATCH] ARM: dts: zynq: enablement of coresight topology

2019-10-23 Thread Mathieu Poirier
Hi Michal,

I was not CC'ed on the original post so I just noticed this today,
hence the late reply.  I don't know if you were looking for feedback
or already picked up the patch but here it is anyway.

On Wed, 9 Oct 2019 at 08:07, Michal Simek  wrote:
>
> From: Zumeng Chen 
>
> This patch is to build the coresight topology structure of zynq-7000
> series according to the docs of coresight and userguide of zynq-7000.
>
> Signed-off-by: Zumeng Chen 
> Signed-off-by: Quanyang Wang 
> Signed-off-by: Michal Simek 
> ---
>
>  arch/arm/boot/dts/zynq-7000.dtsi | 158 +++
>  1 file changed, 158 insertions(+)
>
> diff --git a/arch/arm/boot/dts/zynq-7000.dtsi 
> b/arch/arm/boot/dts/zynq-7000.dtsi
> index ca6425ad794c..86430ad76fee 100644
> --- a/arch/arm/boot/dts/zynq-7000.dtsi
> +++ b/arch/arm/boot/dts/zynq-7000.dtsi
> @@ -59,6 +59,40 @@
> regulator-always-on;
> };
>
> +   replicator {
> +   compatible = "arm,coresight-static-replicator";
> +   clocks = < 27>, < 46>, < 47>;
> +   clock-names = "apb_pclk", "dbg_trc", "dbg_apb";
> +
> +   out-ports {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +
> +   /* replicator output ports */
> +   port@0 {
> +   reg = <0>;
> +   replicator_out_port0: endpoint {
> +   remote-endpoint = <_in_port>;
> +   };
> +   };
> +   port@1 {
> +   reg = <1>;
> +   replicator_out_port1: endpoint {
> +   remote-endpoint = <_in_port>;
> +   };
> +   };
> +   };
> +   in-ports {
> +   /* replicator input port */
> +   port {
> +   replicator_in_port0: endpoint {
> +   slave-mode;

The slave-mode property is no longer required and probably an
oversight since it doesn't appear elsewhere in this patch.

> +   remote-endpoint = <_out_port>;
> +   };
> +   };
> +   };
> +   };
> +
> amba: amba {
> compatible = "simple-bus";
> #address-cells = <1>;
> @@ -365,5 +399,129 @@
> reg = <0xf8005000 0x1000>;
> timeout-sec = <10>;
> };
> +
> +   etb@f8801000 {
> +   compatible = "arm,coresight-etb10", "arm,primecell";
> +   reg = <0xf8801000 0x1000>;
> +   clocks = < 27>, < 46>, < 47>;
> +   clock-names = "apb_pclk", "dbg_trc", "dbg_apb";
> +   in-ports {
> +   port {
> +   etb_in_port: endpoint {
> +   remote-endpoint = 
> <_out_port1>;
> +   };
> +   };
> +   };
> +   };
> +
> +   tpiu@f8803000 {
> +   compatible = "arm,coresight-tpiu", "arm,primecell";
> +   reg = <0xf8803000 0x1000>;
> +   clocks = < 27>, < 46>, < 47>;
> +   clock-names = "apb_pclk", "dbg_trc", "dbg_apb";
> +   in-ports {
> +   port {
> +   tpiu_in_port: endpoint {
> +   remote-endpoint = 
> <_out_port0>;
> +   };
> +   };
> +   };
> +   };
> +
> +   funnel@f8804000 {
> +   compatible = "arm,coresight-static-funnel", 
> "arm,primecell";
> +   reg = <0xf8804000 0x1000>;
> +   clocks = < 27>, < 46>, < 47>;
> +   clock-names = "apb_pclk", "dbg_trc", "dbg_apb";
> +
> +   /* funnel output ports */
> +   out-ports {
> +   port {
> +   funnel_out_port: endpoint {
> +   remote-endpoint =
> +   
> <_in_port0>;
> +   };
> +   };
> +   };
> +
> +   in-ports {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +
> +   /* funnel input ports */
> 

Re: [v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Matthew Wilcox
On Thu, Oct 24, 2019 at 01:05:04AM +0800, Yang Shi wrote:
> + return map_count >= 0 &&
> +map_count == atomic_read([1].compound_mapcount);
>  }

I didn't like Hugh's duplicate definition either.  May I suggest:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f2199a51941..3d0efd937d2b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -695,11 +695,6 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t 
flags)
 
 extern void kvfree(const void *addr);
 
-static inline atomic_t *compound_mapcount_ptr(struct page *page)
-{
-   return [1].compound_mapcount;
-}
-
 static inline int compound_mapcount(struct page *page)
 {
VM_BUG_ON_PAGE(!PageCompound(page), page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fa795284..270aa8fd2800 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,11 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+   return [1].compound_mapcount;
+}
+
 /*
  * Used for sizing the vmemmap region on some architectures
  */


Re: [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl

2019-10-23 Thread Alessio Balsini
Ops, please forgive the wrong in-reply-to messge id :)

Cheers,
Alessio

On Wed, Oct 23, 2019 at 06:17:36PM +0100, Alessio Balsini wrote:
> [ Upstream commit fdbe4eeeb1aac219b14f10c0ed31ae5d1123e9b8 ]
> 
> Enabling Direct I/O with loop devices helps reducing memory usage by
> avoiding double caching.  32 bit applications running on 64 bits systems
> are currently not able to request direct I/O because is missing from the
> lo_compat_ioctl.
> 
> This patch fixes the compatibility issue mentioned above by exporting
> LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
> The input argument for this ioctl is a single long converted to a 1-bit
> boolean, so compatibility is preserved.
> 
> Cc: Jens Axboe 
> Signed-off-by: Alessio Balsini 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da3902ac16c86..8aadd4d0c3a88 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1557,6 +1557,7 @@ static int lo_compat_ioctl(struct block_device *bdev, 
> fmode_t mode,
>   arg = (unsigned long) compat_ptr(arg);
>   case LOOP_SET_FD:
>   case LOOP_CHANGE_FD:
> + case LOOP_SET_DIRECT_IO:
>   err = lo_ioctl(bdev, mode, cmd, arg);
>   break;
>   default:
> -- 
> 2.23.0.866.gb869b98d4c-goog
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
> 


Re: [PATCH 1/3] kcov: remote coverage support

2019-10-23 Thread kbuild test robot
Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc4 next-20191023]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Andrey-Konovalov/kcov-collect-coverage-from-usb-and-vhost/20191023-185245
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
3b7c59a1950c75f2c0152e5a9cd77675b09233d6
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 7.4.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   kernel/kcov.o: In function `kcov_remote_stop':
>> kcov.c:(.text+0x1094): undefined reference to `__aeabi_uldivmod'
   kcov.c:(.text+0x1144): undefined reference to `__aeabi_uldivmod'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH bpf-next v2] libbpf: Fix strncat bounds error in libbpf_prog_type_by_name

2019-10-23 Thread Alexei Starovoitov
On Wed, Oct 23, 2019 at 8:40 AM KP Singh  wrote:
>
> From: KP Singh 
>
> On compiling samples with this change, one gets an error:
>
>  error: ‘strncat’ specified bound 118 equals destination size
>   [-Werror=stringop-truncation]
>
> strncat(dst, name + section_names[i].len,
> ^
>  sizeof(raw_tp_btf_name) - (dst - raw_tp_btf_name));
>  ~~
>
> strncat requires the destination to have enough space for the
> terminating null byte.
>
> Fixes: f75a697e09137 ("libbpf: Auto-detect btf_id of BTF-based 
> raw_tracepoint")
> Signed-off-by: KP Singh 

Applied. Thanks


Re: [PATCH v2 4/9] perf affinity: Add infrastructure to save/restore affinity

2019-10-23 Thread Andi Kleen
On Wed, Oct 23, 2019 at 07:16:13PM +0300, Alexey Budankov wrote:
> 
> On 23.10.2019 17:52, Andi Kleen wrote:
> > On Wed, Oct 23, 2019 at 04:30:49PM +0200, Jiri Olsa wrote:
> >> On Wed, Oct 23, 2019 at 06:02:35AM -0700, Andi Kleen wrote:
> >>> On Wed, Oct 23, 2019 at 11:59:11AM +0200, Jiri Olsa wrote:
>  On Sun, Oct 20, 2019 at 10:51:57AM -0700, Andi Kleen wrote:
> 
>  SNIP
> 
> > +}
> > diff --git a/tools/perf/util/affinity.h b/tools/perf/util/affinity.h
> > new file mode 100644
> > index ..e56148607e33
> > --- /dev/null
> > +++ b/tools/perf/util/affinity.h
> > @@ -0,0 +1,15 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#ifndef AFFINITY_H
> > +#define AFFINITY_H 1
> > +
> > +struct affinity {
> > +   unsigned char *orig_cpus;
> > +   unsigned char *sched_cpus;
> 
>  why not use cpu_set_t directly?
> >>>
> >>> Because it's too small in glibc (only 1024 CPUs) and perf already 
> >>> supports more.
> >>
> >> nice, we're using it all over the place.. how about using bitmap_alloc?
> > 
> > Okay.
> > 
> > The other places is mainly perf record from Alexey's recent affinity 
> > changes.
> > These probably need to be fixed.
> > 
> > +Alexey
> 
> Despite the issue indeed looks generic for stat and record modes,
> have you already observed record startup overhead somewhere in your setups?
> I would, first, prefer to reproduce the overhead, to have stable use case 
> for evaluation and then, possibly, improvement.

What I meant the cpu_set usages you added in 

commit 9d2ed64587c045304efe8872b0258c30803d370c
Author: Alexey Budankov 
Date:   Tue Jan 22 20:47:43 2019 +0300

perf record: Allocate affinity masks

need to be fixed to allocate dynamically, or at least use MAX_NR_CPUs to
support systems with >1024CPUs. That's an independent functionality
problem.

I haven't seen any large enough perf record usage to run
into the IPI problems for record.

-Andi


[PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl

2019-10-23 Thread Alessio Balsini
[ Upstream commit fdbe4eeeb1aac219b14f10c0ed31ae5d1123e9b8 ]

Enabling Direct I/O with loop devices helps reducing memory usage by
avoiding double caching.  32 bit applications running on 64 bits systems
are currently not able to request direct I/O because is missing from the
lo_compat_ioctl.

This patch fixes the compatibility issue mentioned above by exporting
LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
The input argument for this ioctl is a single long converted to a 1-bit
boolean, so compatibility is preserved.

Cc: Jens Axboe 
Signed-off-by: Alessio Balsini 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da3902ac16c86..8aadd4d0c3a88 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1557,6 +1557,7 @@ static int lo_compat_ioctl(struct block_device *bdev, 
fmode_t mode,
arg = (unsigned long) compat_ptr(arg);
case LOOP_SET_FD:
case LOOP_CHANGE_FD:
+   case LOOP_SET_DIRECT_IO:
err = lo_ioctl(bdev, mode, cmd, arg);
break;
default:
-- 
2.23.0.866.gb869b98d4c-goog



Re: [PATCH v4 15/16] module: Move where we mark modules RO,X

2019-10-23 Thread Josh Poimboeuf
On Wed, Oct 23, 2019 at 05:16:54PM +0200, Peter Zijlstra wrote:
> @@ -157,6 +158,14 @@ static int __apply_relocate_add(Elf64_Sh
>  
>   val = sym->st_value + rel[i].r_addend;
>  
> + /*
> +  * .klp.rela.* sections should only contain module
> +  * related RELAs. All core-kernel RELAs should be in
> +  * normal .rela.* sections and be applied when loading
> +  * the patch module itself.
> +  */
> + WARN_ON_ONCE(klp && core_kernel_text(val));
> +

This isn't quite true, we also use .klp.rela sections to access
unexported vmlinux symbols.

-- 
Josh



Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.

2019-10-23 Thread Andy Lutomirski
On Wed, Oct 23, 2019 at 5:44 AM Mike Rapoport  wrote:
>
> On Wed, Oct 23, 2019 at 10:29:20AM +0300, Cyrill Gorcunov wrote:
> > On Tue, Oct 22, 2019 at 09:11:04PM -0700, Andy Lutomirski wrote:
> > > Trying again.  It looks like I used the wrong address for Pavel.
> >
> > Thanks for CC Andy! I must confess I didn't dive into userfaultfd engine
> > personally but let me CC more people involved from criu side. (overquoting
> > left untouched for their sake).
>
> Thanks for CC Cyrill!
>
>
> > > On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski  wrote:
> > > >
> > > > [adding more people because this is going to be an ABI break, sigh]
> > > >
> > > > On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione  
> > > > wrote:
> > > > >
> > > > > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski  
> > > > > wrote:
> > > > > >
> > > > > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > > > > file object instead of the default one, letting security modules
> > > > > > > supervise userfaultfd use.
> > > > > > >
> > > > > > > Requiring that users pass a new flag lets us avoid changing the
> > > > > > > semantics for existing callers.
> > > > > >
> > > > > > Is there any good reason not to make this be the default?
> > > > > >
> > > > > >
> > > > > > The only downside I can see is that it would increase the memory 
> > > > > > usage
> > > > > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > > > > lighter-weight alternative would be to have a single inode shared by
> > > > > > all userfaultfd instances, which would require a somewhat different
> > > > > > internal anon_inode API.
> > > > >
> > > > > I'd also prefer to just make SELinux use mandatory, but there's a
> > > > > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > > > > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > > > > better way to deal with it.
> > > > >
> > > > > Right now, when a process with a UFFD-managed VMA using
> > > > > UFFD_EVENT_FORK forks, we make a new userfaultfd_ctx out of thin air
> > > > > and enqueue it on the message queue for the parent process. When we
> > > > > dequeue that context, we get to resolve_userfault_fork, which makes up
> > > > > a new UFFD file object out of thin air in the context of the reading
> > > > > process. Following normal SELinux rules, the SID attached to that new
> > > > > file object would be the task SID of the process *reading* the fork
> > > > > event, not the SID of the new fork child. That seems wrong, because
> > > > > the label we give to the UFFD should correspond to the label of the
> > > > > process that UFFD controls.
>
> I must admit I have no idea about how SELinux works, but what's wrong with
> making the new UFFD object to inherit the properties of the "original" one?
>
> The new file object is created in the context of the same task that owns
> the initial userfault file descriptor and it is used by the same task. So
> if you have a process that registers some of its VMAs with userfaultfd
> and enables UFFD_EVENT_FORK, the same process controls UFFD of itself and
> its children.

I'm not actually convinced this is a problem.

What *is* a problem is touching the file descriptor table at all from
read(2).  That's a big no-no.

--Andy


Re: [RFC 1/2] vhost: IFC VF hardware operation layer

2019-10-23 Thread Simon Horman
On Wed, Oct 23, 2019 at 06:36:13PM +0800, Jason Wang wrote:
> 
> On 2019/10/23 下午6:13, Simon Horman wrote:
> > On Tue, Oct 22, 2019 at 09:32:36AM +0800, Jason Wang wrote:
> > > On 2019/10/22 上午12:31, Simon Horman wrote:
> > > > On Mon, Oct 21, 2019 at 05:55:33PM +0800, Zhu, Lingshan wrote:
> > > > > On 10/16/2019 5:53 PM, Simon Horman wrote:
> > > > > > Hi Zhu,
> > > > > > 
> > > > > > thanks for your patch.
> > > > > > 
> > > > > > On Wed, Oct 16, 2019 at 09:10:40AM +0800, Zhu Lingshan wrote:
> > > > ...
> > > > 
> > > > > > > +static void ifcvf_read_dev_config(struct ifcvf_hw *hw, u64 
> > > > > > > offset,
> > > > > > > +void *dst, int length)
> > > > > > > +{
> > > > > > > + int i;
> > > > > > > + u8 *p;
> > > > > > > + u8 old_gen, new_gen;
> > > > > > > +
> > > > > > > + do {
> > > > > > > + old_gen = ioread8(>common_cfg->config_generation);
> > > > > > > +
> > > > > > > + p = dst;
> > > > > > > + for (i = 0; i < length; i++)
> > > > > > > + *p++ = ioread8((u8 *)hw->dev_cfg + offset + i);
> > > > > > > +
> > > > > > > + new_gen = ioread8(>common_cfg->config_generation);
> > > > > > > + } while (old_gen != new_gen);
> > > > > > Would it be wise to limit the number of iterations of the loop 
> > > > > > above?
> > > > > Thanks but I don't quite get it. This is used to make sure the 
> > > > > function
> > > > > would get the latest config.
> > > > I am worried about the possibility that it will loop forever.
> > > > Could that happen?
> > > > 
> > > > ...
> > > My understanding is that the function here is similar to virtio config
> > > generation [1]. So this can only happen for a buggy hardware.
> > Ok, so this circles back to my original question.
> > Should we put a bound on the number of times the loop runs
> > or should we accept that the kernel locks up if the HW is buggy?
> > 
> 
> I'm not sure, and similar logic has been used by virtio-pci drivers for
> years. Consider this logic is pretty simple and it should not be the only
> place that virito hardware can lock kernel, we can keep it as is.

Ok, I accept that there isn't much use fixing this if its idomatic and
there are other places virtio hardware can lock up the kernel.

> Actually, there's no need for hardware to implement generation logic, it
> could be emulated by software or even ignored. In new version of
> virtio-mdev, get_generation() is optional, when it was not implemented, 0 is
> simply returned by virtio-mdev transport.
> 
> Thanks
> 


Re: [PATCH] PCI: Warn about host bridge device when its numa node is NO_NODE

2019-10-23 Thread Bjorn Helgaas
On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
> On 2019/10/23 5:04, Bjorn Helgaas wrote:
> > On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:

> > I think the underlying problem you're addressing is that:
> > 
> >   - NUMA_NO_NODE == -1,
> >   - dev_to_node(dev) may return NUMA_NO_NODE,
> >   - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
> >   - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
> > 
> > For example, on arm64, mips loongson, s390, and x86,
> > cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
> > an invalid array index.
> 
> The invalid array index of -1 is the underlying problem here when
> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
> is not NUMA_NO_NODE aware yet.
> 
> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
> disscusion, it is requested that it is better to warn about the pcie
> device without a node assigned by the firmware before making the
> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
> devices of "NUMA_NO_NODE" node can be fixed by their vendor.
> 
> See: 
> https://lore.kernel.org/lkml/2019101539.gx2...@hirez.programming.kicks-ass.net/

Right.  We should warn if the NUMA node number would help us but DT or
the firmware didn't give us one.

But we can do that independently of any cpumask_of_node() changes.
There's no need to do one patch before the other.  Even if you make
cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
because we're not actually changing any node assignments.

> So maybe change the warning to below:
> 
> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE)
>   dev_err(>dev, FW_BUG "No node assigned on NUMA capable HW. Please 
> contact your vendor for updates.\n");

I think this is perfect and I don't see the need for the refinement
below:

> And it seems a pci device's parent will always set to the bridge
> device in pci_setup_device(), and device_add() which will set the
> node to its parent's when the child device' node is NUMA_NO_NODE,
> maybe we can add the bridge device' node checking to make sure
> the pci device really does not have a node assigned, as below:
> 
> if (nr_node_ids > 1 && pcibus_to_node(bus) == NUMA_NO_NODE &&
> dev_to_node(bus->bridge) == NUMA_NO_NODE)
>   dev_err(>dev, FW_BUG "No node assigned on NUMA capable HW. Please 
> contact your vendor for updates.\n");

Anyway, would the attached patch work for you?  I have it tentatively
queued up on pci/enumeration for v5.5.

> >> It is possible to
> >> have a PCI bridge shared between two nodes, such that the PCI
> >> devices have equidistance. But the moment you scale this out, you
> >> either get devices that are 'local' to a package while having
> >> multiple packages, or if you maintain a single bridge in a big
> >> system, things become so slow it all doesn't matter anyway.
> >> Assigning a node (one of the shared) is, in the generic ase of
> >> multiple packages, the better solution over assigning all nodes.
> >>
> >> As pci_device_add() will assign the pci device' node according to
> >> the bus the device is on, which is decided by pcibus_to_node().
> >> Currently different arch may implement the pcibus_to_node() based
> >> on bus->sysdata or bus device' node, which has the same node as
> >> the bridge device.
> >>
> >> And for devices behind another bridge case, the child bus device
> >> is setup with proper parent bus device and inherit its parent'
> >> sysdata in pci_alloc_child_bus(), so the pcie device under the
> >> child bus should have the same node as the parent bridge when
> >> device_add() is called, which will set the node to its parent's
> >> node when the child device' node is NUMA_NO_NODE.
> >>
> >> So this patch only warns about the case when a host bridge device
> >> is registered with a node of NO_NODE in pci_register_host_bridge().
> >> And it only warns about that when there are more than one numa
> >> nodes in the system.
> > 
> > 
> >> [1] 
> >> https://lore.kernel.org/lkml/1568724534-146242-1-git-send-email-linyunsh...@huawei.com/
> >>
> >> Signed-off-by: Yunsheng Lin 
> >> ---
> >>  drivers/pci/probe.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> >> index 3d5271a..22be96a 100644
> >> --- a/drivers/pci/probe.c
> >> +++ b/drivers/pci/probe.c
> >> @@ -927,6 +927,9 @@ static int pci_register_host_bridge(struct 
> >> pci_host_bridge *bridge)
> >>list_add_tail(>node, _root_buses);
> >>up_write(_bus_sem);
> >>  
> >> +  if (nr_node_ids > 1 && dev_to_node(bus->bridge) == NUMA_NO_NODE)
> >> +  dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable 
> >> HW by BIOS. Please contact your vendor for updates.\n");
> >> +
> >>return 0;
> >>  
> >>  unregister:

commit 8f8cf239c4f1
Author: Yunsheng Lin 
Date:   Sat Oct 19 14:45:43 2019 +0800

PCI: Warn if no host bridge NUMA node info

   

Re: [PATCH v2 6/6] crypto: caam - populate platform devices last

2019-10-23 Thread Horia Geanta
On 10/22/2019 6:30 PM, Andrey Smirnov wrote:
> Move the call to devm_of_platform_populate() at the end of
> caam_probe(), so we won't try to add any child devices until all of
> the initialization is finished successfully.
> 
> Signed-off-by: Andrey Smirnov 
> Cc: Chris Healy 
> Cc: Lucas Stach 
> Cc: Horia Geantă 
> Cc: Herbert Xu 
> Cc: Iuliana Prodan 
> Cc: linux-cry...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
Reviewed-by: Horia Geantă 

Thanks,
Horia


Re: [PATCH v2 3/6] crypto: caam - use devres to de-initialize the RNG

2019-10-23 Thread Horia Geanta
On 10/22/2019 6:30 PM, Andrey Smirnov wrote:
> Use devres to de-initialize the RNG and drop explicit de-initialization
> code in caam_remove().
> 
> Signed-off-by: Andrey Smirnov 
> Cc: Chris Healy 
> Cc: Lucas Stach 
> Cc: Horia Geantă 
> Cc: Herbert Xu 
> Cc: Iuliana Prodan 
> Cc: linux-cry...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
Reviewed-by: Horia Geantă 

Thanks,
Horia


[GIT PULL] VFIO fixes for v5.4-rc5

2019-10-23 Thread Alex Williamson
Hi Linus,

The following changes since commit 4f5cafb5cb8471e54afdc9054d973535614f7675:

  Linux 5.4-rc3 (2019-10-13 16:37:36 -0700)

are available in the Git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v5.4-rc5

for you to fetch changes up to 95f89e090618efca63918b658c2002e57d393036:

  vfio/type1: Initialize resv_msi_base (2019-10-15 14:07:01 -0600)


VFIO fixes for v5.4-rc5

 - Fix (false) uninitialized variable warning (Joerg Roedel)


Joerg Roedel (1):
  vfio/type1: Initialize resv_msi_base

 drivers/vfio/vfio_iommu_type1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: KCSAN: data-race in task_dump_owner / task_dump_owner

2019-10-23 Thread Dmitry Vyukov
On Thu, Oct 17, 2019 at 8:33 PM 'Marco Elver' via syzkaller-bugs
 wrote:
>
> On Thu, 17 Oct 2019 at 20:17, Alexey Dobriyan  wrote:
> >
> > On Thu, Oct 17, 2019 at 02:56:47PM +0200, Marco Elver wrote:
> > > Hi,
> > >
> > > On Thu, 17 Oct 2019 at 14:36, syzbot
> > >  wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following crash on:
> > > >
> > > > HEAD commit:d724f94f x86, kcsan: Enable KCSAN for x86
> > > > git tree:   https://github.com/google/ktsan.git kcsan
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17884db360
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=e392f8008a294fdf8891
> > > > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > > >
> > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+e392f8008a294fdf8...@syzkaller.appspotmail.com
> > > >
> > > > ==
> > > > BUG: KCSAN: data-race in task_dump_owner / task_dump_owner
> > > >
> > > > write to 0x8881255bb7fc of 4 bytes by task 7804 on cpu 0:
> > > >   task_dump_owner+0xd8/0x260 fs/proc/base.c:1742
> > > >   pid_update_inode+0x3c/0x70 fs/proc/base.c:1818
> > > >   pid_revalidate+0x91/0xd0 fs/proc/base.c:1841
> > > >   d_revalidate fs/namei.c:765 [inline]
> > > >   d_revalidate fs/namei.c:762 [inline]
> > > >   lookup_fast+0x7cb/0x7e0 fs/namei.c:1613
> > > >   walk_component+0x6d/0xe80 fs/namei.c:1804
> > > >   link_path_walk.part.0+0x5d3/0xa90 fs/namei.c:2139
> > > >   link_path_walk fs/namei.c:2070 [inline]
> > > >   path_openat+0x14f/0x3530 fs/namei.c:3532
> > > >   do_filp_open+0x11e/0x1b0 fs/namei.c:3563
> > > >   do_sys_open+0x3b3/0x4f0 fs/open.c:1089
> > > >   __do_sys_open fs/open.c:1107 [inline]
> > > >   __se_sys_open fs/open.c:1102 [inline]
> > > >   __x64_sys_open+0x55/0x70 fs/open.c:1102
> > > >   do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
> > > >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > write to 0x8881255bb7fc of 4 bytes by task 7813 on cpu 1:
> > > >   task_dump_owner+0xd8/0x260 fs/proc/base.c:1742
> > > >   pid_update_inode+0x3c/0x70 fs/proc/base.c:1818
> > > >   pid_revalidate+0x91/0xd0 fs/proc/base.c:1841
> > > >   d_revalidate fs/namei.c:765 [inline]
> > > >   d_revalidate fs/namei.c:762 [inline]
> > > >   lookup_fast+0x7cb/0x7e0 fs/namei.c:1613
> > > >   walk_component+0x6d/0xe80 fs/namei.c:1804
> > > >   lookup_last fs/namei.c:2271 [inline]
> > > >   path_lookupat.isra.0+0x13a/0x5a0 fs/namei.c:2316
> > > >   filename_lookup+0x145/0x2d0 fs/namei.c:2346
> > > >   user_path_at_empty+0x4c/0x70 fs/namei.c:2606
> > > >   user_path_at include/linux/namei.h:60 [inline]
> > > >   vfs_statx+0xd9/0x190 fs/stat.c:187
> > > >   vfs_stat include/linux/fs.h:3188 [inline]
> > > >   __do_sys_newstat+0x51/0xb0 fs/stat.c:341
> > > >   __se_sys_newstat fs/stat.c:337 [inline]
> > > >   __x64_sys_newstat+0x3a/0x50 fs/stat.c:337
> > > >   do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
> > > >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > Reported by Kernel Concurrency Sanitizer on:
> > > > CPU: 1 PID: 7813 Comm: ps Not tainted 5.3.0+ #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > ==
> > >
> > > My understanding is, that for every access to /proc/,
> > > d_revalidate is called, and /proc-fs implementation simply says that
> > > pid_revalidate always revalidates by rewriting uid/gid because "owning
> > > task may have performed a setuid(), etc." presumably so every access
> > > to a /proc/ entry always has the right uid/gid (in effect
> > > updating /proc/ lazily via d_revalidate).
> > >
> > > Is it possible that one of the tasks above could be preempted after
> > > doing its writes to *ruid/*rgid, another thread writing some other
> > > values (after setuid / seteuid), and then the preempted thread seeing
> > > the other values? Assertion here should never fail:
> > > === TASK 1 ===
> > > | seteuid(1000);
> > > | seteuid(0);
> > > | stat("/proc/", );
> > > | assert(fstat.st_uid == 0);
> > > === TASK 2 ===
> > > | stat("/proc/", ...);
> >
> > Is it the same as
> > pid_revalidate() snapshots (uid,gid) correctly
> > but writeback is done in any order?
>
> Yes, I think so. Snapshot is done in RCU reader critical section, but
> the writes can race with another thread. Is there logic that ensures
> this doesn't lead to the observable outcome above?


I found the case where this leads to an observable bug.
common_perm_cond() in security/apparmor/lsm.c reads the inode uid and
uses it for the security check:

static int common_perm_cond(const char *op, const struct path *path, u32 mask)
{
  

[v2 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Yang Shi
We have usecase to use tmpfs as QEMU memory backend and we would like to
take the advantage of THP as well.  But, our test shows the EPT is not
PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:

7f277820-7f287820 rw-s  00:14 262232 
/dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   579584 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
12

And some benchmarks do worse than with anonymous THPs.

By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when
setting up EPT map.  But, the _mapcount < 0 check doesn't fit to page
cache THP since every subpage of page cache THP would get _mapcount
inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns
false for page cache THP.  This would prevent KVM from setting up PMD
mapped EPT entry.

So we need handle page cache THP correctly.  However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.

Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.

With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:

7fbeaee0-7fbfaee0 rw-s  00:14 275464 
/dev/shm/qemu_back_mem.mem.SKUvat (deleted)
Size:4194304 kB
[snip]
AnonHugePages: 0 kB
ShmemPmdMapped:   557056 kB
[snip]
Locked:0 kB

cat /sys/kernel/debug/kvm/largepages
271

And the benchmarks are as same as anonymous THPs.

Signed-off-by: Yang Shi 
Reported-by: Gang Deng 
Tested-by: Gang Deng 
Suggested-by: Hugh Dickins 
Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc:  4.8+
---
v2: Adopted the suggestion from Hugh to use _mapcount and compound_mapcount.
But I just open coding compound_mapcount to avoid duplicating the
definition of compound_mapcount_ptr in two different files.  Since
"compound_mapcount" looks self-explained so I'm supposed the open
coding would not affect the readability.

 include/linux/page-flags.h | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb88..954a877 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -622,12 +622,30 @@ static inline int PageTransCompound(struct page *page)
  *
  * Unlike PageTransCompound, this is safe to be called only while
  * split_huge_pmd() cannot run from under us, like if protected by the
- * MMU notifier, otherwise it may result in page->_mapcount < 0 false
+ * MMU notifier, otherwise it may result in page->_mapcount check false
  * positives.
+ *
+ * We have to treat page cache THP differently since every subpage of it
+ * would get _mapcount inc'ed once it is PMD mapped.  But, it may be PTE
+ * mapped in the current process so comparing subpage's _mapcount to
+ * compound_mapcount ot filter out PTE mapped case.
  */
 static inline int PageTransCompoundMap(struct page *page)
 {
-   return PageTransCompound(page) && atomic_read(>_mapcount) < 0;
+   struct page *head;
+   int map_count;
+
+   if (!PageTransCompound(page))
+   return 0;
+
+   if (PageAnon(page))
+   return atomic_read(>_mapcount) < 0;
+
+   head = compound_head(page);
+   map_count = atomic_read(>_mapcount);
+   /* File THP is PMD mapped and not double mapped */
+   return map_count >= 0 &&
+  map_count == atomic_read([1].compound_mapcount);
 }
 
 /*
-- 
1.8.3.1



Re: [PATCH v1 3/6] KEYS: ima hook to measure builtin_trusted_keys

2019-10-23 Thread Mimi Zohar
On Wed, 2019-10-23 at 07:49 -0700, Lakshmi Ramasubramanian wrote:
> On 10/23/19 6:22 AM, Mimi Zohar wrote:
> 
> Thanks for reviewing the changes Mimi.
> I'll address your comments and post an updated patch set shortly.
> 
> >> Add a new ima hook to measure keys added to builtin_trusted_keys
> >> keyring.
> > 
> > There is no IMA hook in this patch.
> > 
> 
> >> +  else if (strcmp(args[0].from,
> >> +  "BUILTIN_TRUSTED_KEYS") == 0)
> >> +  entry->func = BUILTIN_TRUSTED_KEYS;
> >>else
> >>result = -EINVAL;
> >>if (!result)
> > 
> > Any new options need to be displayed as well.
> 
> Not that I can think of. Please correct me if I am wrong.

True, since you're hard coding the policy.



Re: [PATCH RFC 2/2] irqchip/gic: Allow the use of SGI interrupts

2019-10-23 Thread Florian Fainelli
Hello marc,

On 10/23/19 6:22 AM, Marc Zyngier wrote:
> Hi Florian,
> 
> Needless to say, I mostly have questions...
> 
> On 2019-10-23 01:05, Florian Fainelli wrote:
>> SGI interrupts are a convenient way for trusted firmware to target a
>> specific set of CPUs. Update the ARM GIC code to allow the translation
>> and mapping of SGI interrupts.
>>
>> Since the kernel already uses SGIs for various inter-processor interrupt
>> activities, we specifically make sure that we do not let users of the
>> IRQ API to even try to map those.
>>
>> Internal IPIs remain dispatched through handle_IPI() while public SGIs
>> get promoted to a normal interrupt flow management.
>>
>> Signed-off-by: Florian Fainelli 
>> ---
>>  drivers/irqchip/irq-gic.c | 41 +++
>>  1 file changed, 29 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
>> index 30ab623343d3..dcfdbaacdd64 100644
>> --- a/drivers/irqchip/irq-gic.c
>> +++ b/drivers/irqchip/irq-gic.c
>> @@ -385,7 +385,10 @@ static void __exception_irq_entry
>> gic_handle_irq(struct pt_regs *regs)
>>   * Pairs with the write barrier in gic_raise_softirq
>>   */
>>  smp_rmb();
>> -    handle_IPI(irqnr, regs);
>> +    if (irqnr < NR_IPI)
>> +    handle_IPI(irqnr, regs);
>> +    else
>> +    handle_domain_irq(gic->domain, irqnr, regs);
> 
> Double EOI, UNPREDICTABLE territory, your state machine is now dead.

Oh yes, the interrupt flow now also goes through ->irq_eoi (that's the
whole point), meh.

> 
>>  #endif
>>  continue;
>>  }
>> @@ -1005,20 +1008,34 @@ static int gic_irq_domain_translate(struct
>> irq_domain *d,
>>  if (fwspec->param_count < 3)
>>  return -EINVAL;
>>
>> -    /* Get the interrupt number and add 16 to skip over SGIs */
>> -    *hwirq = fwspec->param[1] + 16;
>> -
>> -    /*
>> - * For SPIs, we need to add 16 more to get the GIC irq
>> - * ID number
>> - */
>> -    if (!fwspec->param[0])
>> +    *hwirq = fwspec->param[1];
>> +    switch (fwspec->param[0]) {
>> +    case 0:
>> +    /*
>> + * For SPIs, we need to add 16 more to get the GIC irq
>> + * ID number
>> + */
>> +    *hwirq += 16;
>> +    /* fall through */
>> +    case 1:
>> +    /* Add 16 to skip over SGIs */
>>  *hwirq += 16;
>> +    *type = fwspec->param[2] & IRQ_TYPE_SENSE_MASK;
>>
>> -    *type = fwspec->param[2] & IRQ_TYPE_SENSE_MASK;
>> +    /* Make it clear that broken DTs are... broken */
>> +    WARN_ON(*type == IRQ_TYPE_NONE);
>> +    break;
>> +    case 2:
>> +    /* Refuse to map internal IPIs */
>> +    if (*hwirq < NR_IPI)
> 
> So depending on how the kernel uses SGIs, you can or cannot use these SGIs.
> That looks like a good way to corner ourselves into not being to change
> much.

arch/arm/kernel/smp.c has a forward looking statement about SGI numbering:

/*
 * SGI8-15 can be reserved by secure firmware, and thus may
 * not be usable by the kernel. Please keep the above limited
 * to at most 8 entries.
 */

is this something that can be used as an universal and unbreakable rule
for the ARM64 kernel as well in order to ensure SGIs 8-15 can be usable
through the IRQ API or is this simply not a guarantee at all?

> 
> Also, do you expect this to work for both Group-0 and Group-1 interrupts
> (since you imply that this works as a communication medium with the secure
> side)? Given that the kernel running in NS has no way to enable/disable
> Group-0 interrupts, this looks terminally flawed. Or is that Group-1 only?

That would be Group-1 interrupts only, are you suggesting there is an
additional check being done that such SGIs are actually part of Group-1?

> 
> How do we describe which SGIs are guaranteed to be available to Linux?

In our case, the Device Tree mailbox node gets populated its interrupts
property with the SGI number(s), and that same number is also passed as
a configuration parameter to the trusted firmware. Or are you echoing
back to your earlier comment about the fact that if the kernel changes
its own definition of NR_IPI then we suddenly start breaking IRQ API
uses of SGIs in a certain range?

> 
>> +    return -EPERM;
>> +
>> +    *type = IRQ_TYPE_NONE;
> 
> Or not. SGI are edge triggered, by definition.
> 
>> +    break;
>> +    default:
>> +    break;
>> +    }
>>
>> -    /* Make it clear that broken DTs are... broken */
>> -    WARN_ON(*type == IRQ_TYPE_NONE);
> 
> Really?

Given the comment in gic_set_type() about SGIs, the WARN_ON() was moved
above to continue checking for GIC_SPI and GIC_PPI, but we should
extract the type from the Devic eTree and only permit an edge mask.
-- 

Re: [PATCH] cpufreq: s3c64xx: Remove pointless NULL check in s3c64xx_cpufreq_driver_init

2019-10-23 Thread Nathan Chancellor
On Wed, Oct 23, 2019 at 05:59:23PM +0100, Mark Brown wrote:
> On Wed, Oct 23, 2019 at 09:54:17AM -0700, Nathan Chancellor wrote:
> > On Wed, Oct 23, 2019 at 05:36:56PM +0100, Mark Brown wrote:
> > > On Wed, Oct 23, 2019 at 09:26:28AM -0700, Nathan Chancellor wrote:
> > > > On Wed, Oct 23, 2019 at 11:43:04AM +0100, Mark Brown wrote:
> 
> > > > > The driver should also have supported s3c6400 as well.
> 
> > > > Kconfig says otherwise, unless I am missing something.
> 
> > > Note the XX in the config option.
> 
> > But what about the depends and the help text?
> 
> Viresh asked why the driver was written with s3c6410 support optional.
> I explained that the reason that it was written this way was to
> accomodate s3c6400 support.

Ah understood, thanks for the clarification and sorry for the
misunderstanding!

Cheers,
Nathan


Re: [PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap

2019-10-23 Thread Yang Shi




On 10/22/19 6:31 PM, Hugh Dickins wrote:

On Tue, 22 Oct 2019, Yang Shi wrote:

On 10/22/19 3:27 PM, Hugh Dickins wrote:

I completely agree that the current PageTransCompoundMap() is wrong.

A fix for that is one of many patches I've not yet got to upstreaming.
Comparing yours and mine, I'm worried by your use of PageDoubleMap(),
because really that's a flag for anon THP, and not properly supported
on shmem (or now I suppose file) THP - I forget the details, is it
that it sometimes gets set, but never cleared?  Generally, we just
don't refer to PageDoubleMap() on shmem THPs (but there may be
exceptions: sorting out the THP mapcount maze, and eliminating
PageDoubleMap(), is one of my long-held ambitions, not yet reached).

Here's the patch I've been carrying, but it's from earlier, so I
should warn that I've done no more than build-testing it on 5.4,
and I'm too far away from these issues at the moment to be able to
make a good judgement or argue for it - I hope you and others can
decide which patch is the better.  I should also add that we're
barely using PageTransCompoundMap() at all: at best it can only
give a heuristic guess as to whether the page is pmd-mapped in
any particular case, and we preferred to take forward the KVM
patches we posted back in April 2016, plumbing hva down to where
it's needed - though of course those are somewhat different now.

Thanks for catching this. I was definitely thinking about using
compount_mapcount instead of DoubleMap flag when I was working the patch. I
just simply thought it would change less file by using DoubleMap flag but I
didn't notice it was kind of unbalanced for file THP.

With the unbalanced DoubleMap flag, it sounds better to use
compound_mapcount.

Yes: no doubt PageDoubleMap could be fixed on shmem+file, but I have no
interest in doing that, because it's just unnecessary overhead for them.
(They have their own overhead, of subpage mapcounting for pmd: which is
something to eliminate and unify with anon when I get around to it.)


It might be worth fixing the unbalance since mlock depends on this flag 
too. There should be a little bit overhead when handling PTE rmap remove 
since we have to iterate every subpage to check if _mapcount is same 
with compound_mapcount or not in order to clear DoubleMap flag. It is 
easy to handle this when the last PMD map is gone.





Thanks for sharing your patch, I'm going to rework v2 by using
compound_mapcount. Do you mind I might steal your patch?

Please do! One less for me to worry about, thanks.


I'm supposed we'd better fix this bug regardless of whether you would like to
move forward your KVM patches.

Absolutely. There remain a few other uses of PageTransCompoundMap
anyway, and I really wanted this outright mm fix to go in before
re-submitting AndresLC's KVM patch (I'll ask a KVM-savvy colleague
to take that over, Cc'ing you, once the mm end is correct).


Thanks.



Hugh




Re: [PATCH v4 15/16] module: Move where we mark modules RO,X

2019-10-23 Thread Josh Poimboeuf
On Wed, Oct 23, 2019 at 01:48:35PM +0200, Peter Zijlstra wrote:
> Now sadly that commit missed all the useful information, luckily I could
> find the patch in my LKML folder, more sad, that thread still didn't
> contain the actual useful information, for that I was directed to
> github:
> 
>   https://github.com/dynup/kpatch/issues/580
> 
> Now, someone is owning me a beer for having to look at github for this.

Deal.  And you probably deserve a few more for fixing our crap.

The github thing is supposed to be temporary, at least in theory we'll
eventually have all klp patch module building code in the kernel tree.

> That finally explained that what happens is that the RELA was trying to
> fix up the paravirt indirect call to 'local_irq_disable', which
> apply_paravirt() will have overwritten with 'CLI; NOP'. This then
> obviously goes *bang*.
> 
> This then raises a number of questions:
> 
>  1) why is that RELA (that obviously does not depend on any module)
> applied so late?

Good question.  The 'pv_ops' symbol is exported by the core kernel, so I
can't see any reason why we'd need to apply that rela late.  In theory,
kpatch-build isn't supposed to convert that to a klp rela.  Maybe
something went wrong in the patch creation code.

I'm also questioning why we even need to apply the parainstructions
section late.  Maybe we can remove that apply_paravirt() call
altogether, along with .klp.arch.parainstruction sections.

I'll need to look into it...

>  2) why can't we unconditionally skip RELA's to paravirt sites?

We could, but I don't think it's needed if we fix #1.

>  3) Is there ever a possible module-dependent RELA to a paravirt /
> alternative site?

Good question...

> Now, for 1), I would propose '.klp.rela.${mod}' sections only contain
> RELAs that depend on symbols in ${mod} (or modules in general).

That was already the goal, but we've apparently failed at that.

> We can fix up RELAs that depend on core kernel early without problems.
> Let them be in the normal .rela sections and be fixed up on loading
> the patch-module as per usual.

If such symbols aren't exported, then they still need to be in
.klp.rela.vmlinux sections, since normal relas won't work.

> This should also deal with 2, paravirt should always have RELAs into the
> core kernel.
> 
> Then for 3) we only have alternatives left, and I _think_ it unlikely to
> be the case, but I'll have to have a hard look at that.

I'm not sure about alternatives, but maybe we can enforce such
limitations with tooling and/or kernel checks.

-- 
Josh



Re: [PATCH] cpufreq: s3c64xx: Remove pointless NULL check in s3c64xx_cpufreq_driver_init

2019-10-23 Thread Mark Brown
On Wed, Oct 23, 2019 at 09:54:17AM -0700, Nathan Chancellor wrote:
> On Wed, Oct 23, 2019 at 05:36:56PM +0100, Mark Brown wrote:
> > On Wed, Oct 23, 2019 at 09:26:28AM -0700, Nathan Chancellor wrote:
> > > On Wed, Oct 23, 2019 at 11:43:04AM +0100, Mark Brown wrote:

> > > > The driver should also have supported s3c6400 as well.

> > > Kconfig says otherwise, unless I am missing something.

> > Note the XX in the config option.

> But what about the depends and the help text?

Viresh asked why the driver was written with s3c6410 support optional.
I explained that the reason that it was written this way was to
accomodate s3c6400 support.


signature.asc
Description: PGP signature


Re: [PATCH 06/18] add support for Clang's Shadow Call Stack (SCS)

2019-10-23 Thread Sami Tolvanen
On Tue, Oct 22, 2019 at 9:28 AM Mark Rutland  wrote:
> I think it would be preferable to follow the example of CC_FLAGS_FTRACE
> so that this can be filtered out, e.g.
>
> ifdef CONFIG_SHADOW_CALL_STACK
> CFLAGS_SCS := -fsanitize=shadow-call-stack
> KBUILD_CFLAGS += $(CFLAGS_SCS)
> export CC_FLAGS_SCS
> endif
>
> ... with removal being:
>
> CFLAGS_REMOVE := $(CC_FLAGS_SCS)
>
> ... or:
>
> CFLAGS_REMOVE_obj.o := $(CC_FLAGS_SCS)
>
> That way you only need to define the flags once, so the enable and
> disable falgs remain in sync by construction.

CFLAGS_REMOVE appears to be only implemented for objects, which means
there's no convenient way to filter out flags for everything in
arch/arm64/kvm/hyp, for example. I could add a CFLAGS_REMOVE
separately for each object file, or we could add something like
ccflags-remove-y to complement ccflags-y, which should be relatively
simple. Masahiro, do you have any suggestions?

Sami


RE: [PATCH v3 0/7] Nvidia Arm SMMUv2 Implementation

2019-10-23 Thread Krishna Reddy
>>https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=for-joerg/arm-smmu/updates

Thanks Will! Let me rebase my patches on top of this branch and send it out.

-KR



  1   2   3   4   5   6   7   8   >