date:20210413

Re: [PATCH 5.10 000/188] 5.10.30-rc1 review

2021-04-13 Thread Samuel Zou





On 2021/4/12 16:38, Greg Kroah-Hartman wrote:

This is the start of the stable review cycle for the 5.10.30 release.
There are 188 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Wed, 14 Apr 2021 08:39:44 +.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:

https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.30-rc1.gz
or in the git tree and branch at:

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
linux-5.10.y
and the diffstat can be found below.

thanks,

greg k-h



Tested on arm64 and x86 for 5.10.30-rc1,

Kernel repo:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Branch: linux-5.10.y
Version: 5.10.30-rc1
Commit: 8ac4b1deedaa507b5d0f46316e7f32004dd99cd1
Compiler: gcc version 7.3.0 (GCC)

arm64:

Testcase Result Summary:
total: 5264
passed: 5264
failed: 0
timeout: 0


x86:

Testcase Result Summary:
total: 5264
passed: 5264
failed: 0
timeout: 0


Tested-by: Hulk Robot

[PATCH 0/8] MIPS: Fixes for PCI legacy drivers (rt2880, rt3883)

2021-04-13 Thread Ilya Lipnitskiy

One major fix for rt2880-pci in the first patch - fixes breakage that
existed since v4.14.

Other more minor fixes, cleanups, and improvements that either free up
memory, make dmesg messages clearer, or remove redundant dmesg output.

Ilya Lipnitskiy (8):
  MIPS: pci-rt2880: fix slot 0 configuration
  MIPS: pci-rt2880: remove unneeded locks
  MIPS: pci-rt3883: trivial: remove unused variable
  MIPS: pci-rt3883: more accurate DT error messages
  MIPS: pci-legacy: stop using of_pci_range_to_resource
  MIPS: pci-legacy: remove redundant info messages
  MIPS: pci-legacy: remove busn_resource field
  MIPS: pci-legacy: use generic pci_enable_resources

 arch/mips/include/asm/pci.h |  1 -
 arch/mips/pci/pci-legacy.c  | 57 ++---
 arch/mips/pci/pci-rt2880.c  | 63 +++--
 arch/mips/pci/pci-rt3883.c  | 10 ++
 4 files changed, 44 insertions(+), 87 deletions(-)

-- 
2.31.1

RE: [PATCH v7 1/2] platform/x86: dell-privacy: Add support for Dell hardware privacy

2021-04-13 Thread Yuan, Perry

Hi ,
> -Original Message-
> From: Amadeusz Sławiński 
> Sent: 2021年4月12日 18:40
> To: Yuan, Perry; po...@protonmail.com; pierre-
> louis.boss...@linux.intel.com; oder_ch...@realtek.com; pe...@perex.cz;
> ti...@suse.com; hdego...@redhat.com; mgr...@linux.intel.com
> Cc: alsa-de...@alsa-project.org; linux-kernel@vger.kernel.org;
> lgirdw...@gmail.com; platform-driver-...@vger.kernel.org;
> broo...@kernel.org; Dell Client Kernel; mario.limoncie...@outlook.com
> Subject: Re: [PATCH v7 1/2] platform/x86: dell-privacy: Add support for Dell
> hardware privacy
> 
> 
> [EXTERNAL EMAIL]
> 
> On 4/12/2021 11:19 AM, Perry Yuan wrote:
> > From: Perry Yuan 
> >
> 
> (...)
> 
> > diff --git a/drivers/platform/x86/dell/dell-laptop.c
> > b/drivers/platform/x86/dell/dell-laptop.c
> > index 70edc5bb3a14..e7ffc0b81208 100644
> > --- a/drivers/platform/x86/dell/dell-laptop.c
> > +++ b/drivers/platform/x86/dell/dell-laptop.c
> > @@ -31,6 +31,8 @@
> >   #include "dell-rbtn.h"
> >   #include "dell-smbios.h"
> >
> > +#include "dell-privacy-wmi.h"
> > +
> >   struct quirk_entry {
> > bool touchpad_led;
> > bool kbd_led_not_present;
> > @@ -90,6 +92,7 @@ static struct rfkill *wifi_rfkill;
> >   static struct rfkill *bluetooth_rfkill;
> >   static struct rfkill *wwan_rfkill;
> >   static bool force_rfkill;
> > +static bool has_privacy;
> >
> >   module_param(force_rfkill, bool, 0444);
> >   MODULE_PARM_DESC(force_rfkill, "enable rfkill on non whitelisted
> > models"); @@ -2206,10 +2209,16 @@ static int __init dell_init(void)
> >
> > if (dell_smbios_find_token(GLOBAL_MIC_MUTE_DISABLE) &&
> > dell_smbios_find_token(GLOBAL_MIC_MUTE_ENABLE)) {
> > -   micmute_led_cdev.brightness =
> ledtrig_audio_get(LED_AUDIO_MICMUTE);
> > -   ret = led_classdev_register(_device->dev,
> _led_cdev);
> > -   if (ret < 0)
> > -   goto fail_led;
> > +   if (dell_privacy_present())
> > +   has_privacy = true;
> > +   else
> > +   has_privacy = false;
> 
> Bit, of nitpicking, but you can write above shorter:
> has_privacy = dell_privacy_present();

Good point, changed the code as you suggested.
Thank you.
Perry.

Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-13 Thread Michal Hocko

On Mon 12-04-21 14:40:18, Vlastimil Babka wrote:
> On 4/12/21 2:08 PM, Mel Gorman wrote:
> > zone_pcp_reset allegedly protects against a race with drain_pages
> > using local_irq_save but this is bogus. local_irq_save only operates
> > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > offers no protection.
> > 
> > This patch deletes IRQ disable/enable on the grounds that IRQs protect
> > nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> > used after zone_pcp_enable(). That should be the case already because all
> > the pages have been freed and there is no page to put on the PCP lists.
> > 
> > Signed-off-by: Mel Gorman 
> 
> Yeah the irq disabling here is clearly bogus, so:
> 
> Acked-by: Vlastimil Babka 
> 
> But I think Michal has a point that we might best leave the pagesets around, 
> by
> a future change. I'm have some doubts that even with your reordering of the
> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have 
> no
> protection between zonelist rebuild and zonelist traversal, and that's why we
> just leave pgdats around.
> 
> So I can imagine a task racing with memory hotremove might see watermarks as 
> ok
> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
> the pcplists, then the first task is resumed and proceeds with 
> rmqueue_pcplist().
> 
> So that's very rare thus not urgent, and this patch doesn't make it less rare 
> so
> not a reason to block it.

Completely agreed here. Not an urgent thing to work on but something to
look into long term.

-- 
Michal Hocko
SUSE Labs

Re: [PATCH 3/3] sched: Use cpu_dying() to fix balance_push vs hotplug-rollback

2021-04-13 Thread Peter Zijlstra

On Mon, Apr 12, 2021 at 06:22:42PM +0100, Valentin Schneider wrote:
> On 12/04/21 14:03, Peter Zijlstra wrote:
> > On Thu, Mar 11, 2021 at 03:13:04PM +, Valentin Schneider wrote:
> >> Peter Zijlstra  writes:
> >> > @@ -7910,6 +7908,14 @@ int sched_cpu_deactivate(unsigned int cp
> >> >}
> >> >rq_unlock_irqrestore(rq, );
> >> >
> >> > +/*
> >> > + * From this point forward, this CPU will refuse to run any 
> >> > task that
> >> > + * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will 
> >> > actively
> >> > + * push those tasks away until this gets cleared, see
> >> > + * sched_cpu_dying().
> >> > + */
> >> > +balance_push_set(cpu, true);
> >> > +
> >>
> >> AIUI with cpu_dying_mask being flipped before even entering
> >> sched_cpu_deactivate(), we don't need this to be before the
> >> synchronize_rcu() anymore; is there more than that to why you're punting it
> >> back this side of it?
> >
> > I think it does does need to be like this, we need to clearly separate
> > the active=true and balance_push_set(). If we were to somehow observe
> > both balance_push_set() and active==false, we'd be in trouble.
> >
> 
> I'm afraid I don't follow; we're replacing a read of rq->balance_push with
> cpu_dying(), and those are still written on the same side of the
> synchronize_rcu(). What am I missing?

Yeah, I'm not sure anymnore either; I tried to work out why I'd done
that but upon closer examination everything fell flat.

Let me try again today :-)

> Oooh, I can't read, only the boot CPU gets its callback uninstalled in
> sched_init()! So secondaries keep push_callback installed up until
> sched_cpu_activate(), but as you said it's not effective unless a rollback
> happens.
> 
> Now, doesn't that mean we should *not* uninstall the callback in
> sched_cpu_dying()? AFAIK it's possible for the initial secondary CPU
> boot to go fine, but the next offline+online cycle fails while going up -
> that would need to rollback with push_callback installed.

Quite; I removed that shortly after sending this; when I tried to write
a comment and found it.

Re: [PATCH 4/7] mm: Introduce verify_page_range()

2021-04-13 Thread Peter Zijlstra

On Mon, Apr 12, 2021 at 01:05:09PM -0700, Kees Cook wrote:
> On Mon, Apr 12, 2021 at 10:00:16AM +0200, Peter Zijlstra wrote:
> > +struct vpr_data {
> > +   int (*fn)(pte_t pte, unsigned long addr, void *data);
> > +   void *data;
> > +};
> 
> Eeerg. This is likely to become an attack target itself. Stored function
> pointer with stored (3rd) argument.

You got some further reading on that? How exactly are those exploited?

[PATCH v2 00/16] Multigenerational LRU Framework

2021-04-13 Thread Yu Zhao

What's new in v2

Special thanks to Jens Axboe for reporting a regression in buffered
I/O and helping test the fix.

This version includes the support of tiers, which represent levels of
usage from file descriptors only. Pages accessed N times via file
descriptors belong to tier order_base_2(N). Each generation contains
at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
bits in page->flags. In contrast to moving across generations which
requires the lru lock, moving across tiers only involves an atomic
operation on page->flags and therefore has a negligible cost. A
feedback loop modeled after the well-known PID controller monitors the
refault rates across all tiers and decides when to activate pages from
which tiers, on the reclaim path.

This feedback model has a few advantages over the current feedforward
model:
1) It has a negligible overhead in the buffered I/O access path
   because activations are done in the reclaim path.
2) It takes mapped pages into account and avoids overprotecting pages
   accessed multiple times via file descriptors.
3) More tiers offer better protection to pages accessed more than
   twice when buffered-I/O-intensive workloads are under memory
   pressure.

The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
accessing Samsung PM981a in the buffered I/O mode.

Highlights from the discussions on v1
=
Thanks to Ying Huang and Dave Hansen for the comments and suggestions
on page table scanning.

A simple worst-case scenario test did not find page table scanning
underperforms the rmap because of the following optimizations:
1) It will not scan page tables from processes that have been sleeping
   since the last scan.
2) It will not scan PTE tables under non-leaf PMD entries that do not
   have the accessed bit set, when
   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
3) It will not zigzag between the PGD table and the same PMD or PTE
   table spanning multiple VMAs. In other words, it finishes all the
   VMAs with the range of the same PMD or PTE table before it returns
   to the PGD table. This optimizes workloads that have large numbers
   of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

TLDR

The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo

git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1

Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173

Background
==
DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Problems

Notion of active/inactive
-
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. False active/inactive rates are relatively high, and thus
the assumed savings may not materialize.

For phones and laptops, executable pages are frequently evicted
despite the fact that there are many less recently used anon pages.
Major faults on executable pages cause "janks" (slow UI renderings)
and negatively impact user experience.

For lruvecs from different memcgs or nodes, comparisons are impossible
due to the lack of a common frame of reference.

Incremental scans via rmap
--
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. In our
case, the average ratio of pgscan to pgsteal is above 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Solutions
=
Notion of generation numbers

The notion of generation numbers introduces a

[PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG

2021-04-13 Thread Yu Zhao

page_memcg_rcu() warns on !rcu_read_lock_held() regardless of
CONFIG_MEMCG. The following code is legit, but it triggers the warning
when !CONFIG_MEMCG, since lock_page_memcg() and unlock_page_memcg()
are empty for this config.

  memcg = lock_page_memcg(page1)
(rcu_read_lock() if CONFIG_MEMCG=y)

  do something to page1

  if (page_memcg_rcu(page2) == memcg)
do something to page2 too as it cannot be migrated away from the
memcg either.

  unlock_page_memcg(page1)
(rcu_read_unlock() if CONFIG_MEMCG=y)

Locking/unlocking rcu consistently for both configs is rigorous but it
also forces unnecessary locking upon users who have no interest in
CONFIG_MEMCG.

This patch removes the assertion for !CONFIG_MEMCG, because
page_memcg_rcu() has a few callers and there are no concerns regarding
their correctness at the moment.

Signed-off-by: Yu Zhao 
---
 include/linux/memcontrol.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..f13dc02cf277 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1077,7 +1077,6 @@ static inline struct mem_cgroup *page_memcg(struct page 
*page)
 
 static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
 {
-   WARN_ON_ONCE(!rcu_read_lock_held());
return NULL;
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex

2021-04-13 Thread Yu Zhao

cgroup_mutex is needed to synchronize with memcg creations.

Signed-off-by: Yu Zhao 
---
 include/linux/cgroup.h | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4f2f79de083e..bd5744360cfa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp)
css_put(>self);
 }
 
+extern struct mutex cgroup_mutex;
+
+static inline void cgroup_lock(void)
+{
+   mutex_lock(_mutex);
+}
+
+static inline void cgroup_unlock(void)
+{
+   mutex_unlock(_mutex);
+}
+
 /**
  * task_css_set_check - obtain a task's css_set with extra access conditions
  * @task: the task to obtain css_set for
@@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp)
  * as locks used during the cgroup_subsys::attach() methods.
  */
 #ifdef CONFIG_PROVE_RCU
-extern struct mutex cgroup_mutex;
 extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)  \
rcu_dereference_check((task)->cgroups,  \
@@ -704,6 +715,8 @@ struct cgroup;
 static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; }
 static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
+static inline void cgroup_lock(void) {}
+static inline void cgroup_unlock(void) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 struct task_struct *t) { return 0; }
 static inline int cgroupstats_build(struct cgroupstats *stats,
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries

2021-04-13 Thread Yu Zhao

Some architectures support the accessed bit on non-leaf PMD entries
(parents) in addition to leaf PTE entries (children) where pages are
mapped, e.g., x86_64 sets the accessed bit on a parent when using it
as part of linear-address translation [1]. Page table walkers who are
interested in the accessed bit on children can take advantage of this:
they do not need to search the children when the accessed bit is not
set on a parent, given that they have previously cleared the accessed
bit on this parent.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
 Volume 3 (October 2019), section 4.8

Signed-off-by: Yu Zhao 
---
 arch/Kconfig   | 9 +
 arch/x86/Kconfig   | 1 +
 arch/x86/include/asm/pgtable.h | 2 +-
 arch/x86/mm/pgtable.c  | 5 -
 include/linux/pgtable.h| 4 ++--
 5 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..cbd7f66734ee 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -782,6 +782,15 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
bool
 
+config HAVE_ARCH_PARENT_PMD_YOUNG
+   bool
+   depends on PGTABLE_LEVELS > 2
+   help
+ Architectures that select this are able to set the accessed bit on
+ non-leaf PMD entries in addition to leaf PTE entries where pages are
+ mapped. For them, page table walkers that clear the accessed bit may
+ stop at non-leaf PMD entries when they do not see the accessed bit.
+
 config HAVE_ARCH_HUGE_VMAP
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..b5972eb82337 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -163,6 +163,7 @@ config X86
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
+   select HAVE_ARCH_PARENT_PMD_YOUNG   if X86_64
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
select HAVE_ARCH_VMAP_STACK if X86_64
select HAVE_ARCH_WITHIN_STACK_FRAMES
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a6b5cfe1fc5a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-   return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+   return ((pmd_flags(pmd) | _PAGE_ACCESSED) & ~_PAGE_USER) != 
_KERNPG_TABLE;
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..1c27e6f43f80 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || 
defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
  unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
  unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e772392a379..08dd9b8c055a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -193,7 +193,7 @@ static inline int ptep_test_and_clear_young(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || 
defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
@@ -214,7 +214,7 @@ static inline int pmdp_test_and_clear_young(struct 
vm_area_struct *vma,
BUILD_BUG();
return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 05/16] mm/swap.c: export activate_page()

2021-04-13 Thread Yu Zhao

activate_page() is needed to activate pages that are already on lru or
queued in lru_pvecs.lru_add. The exported function is a merger between
the existing activate_page() and __lru_cache_activate_page().

Signed-off-by: Yu Zhao 
---
 include/linux/swap.h |  1 +
 mm/swap.c| 28 +++-
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..de2bbbf181ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,6 +344,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_cpu_zone(struct zone *zone);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
+extern void activate_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..f20ed56ebbbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -334,7 +334,7 @@ static bool need_activate_page_drain(int cpu)
return pagevec_count(_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void activate_page(struct page *page)
+static void activate_page_on_lru(struct page *page)
 {
page = compound_head(page);
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -354,7 +354,7 @@ static inline void activate_page_drain(int cpu)
 {
 }
 
-static void activate_page(struct page *page)
+static void activate_page_on_lru(struct page *page)
 {
struct lruvec *lruvec;
 
@@ -368,11 +368,22 @@ static void activate_page(struct page *page)
 }
 #endif
 
-static void __lru_cache_activate_page(struct page *page)
+/*
+ * If the page is on the LRU, queue it for activation via
+ * lru_pvecs.activate_page. Otherwise, assume the page is on a
+ * pagevec, mark it active and it'll be moved to the active
+ * LRU on the next drain.
+ */
+void activate_page(struct page *page)
 {
struct pagevec *pvec;
int i;
 
+   if (PageLRU(page)) {
+   activate_page_on_lru(page);
+   return;
+   }
+
local_lock(_pvecs.lock);
pvec = this_cpu_ptr(_pvecs.lru_add);
 
@@ -421,16 +432,7 @@ void mark_page_accessed(struct page *page)
 * evictable page accessed has no effect.
 */
} else if (!PageActive(page)) {
-   /*
-* If the page is on the LRU, queue it for activation via
-* lru_pvecs.activate_page. Otherwise, assume the page is on a
-* pagevec, mark it active and it'll be moved to the active
-* LRU on the next drain.
-*/
-   if (PageLRU(page))
-   activate_page(page);
-   else
-   __lru_cache_activate_page(page);
+   activate_page(page);
ClearPageReferenced(page);
workingset_activation(page);
}
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 11/16] mm: multigenerational lru: aging

2021-04-13 Thread Yu Zhao

The aging produces young generations. Given an lruvec, the aging walks
the mm_struct list associated with this lruvec to scan page tables for
referenced pages. Upon finding one, the aging updates the generation
number of this page to max_seq. After each round of scan, the aging
increments max_seq. The aging is due when both of min_seq[2] reaches
max_seq-1, assuming both anon and file types are reclaimable.

The aging uses the following optimizations when scanning page tables:
  1) It will not scan page tables from processes that have been
  sleeping since the last scan.
  2) It will not scan PTE tables under non-leaf PMD entries that do
  not have the accessed bit set, when
  CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
  3) It will not zigzag between the PGD table and the same PMD or PTE
  table spanning multiple VMAs. In other words, it finishes all the
  VMAs with the range of the same PMD or PTE table before it returns
  to the PGD table. This optimizes workloads that have large numbers
  of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

Signed-off-by: Yu Zhao 
---
 mm/vmscan.c | 700 
 1 file changed, 700 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d67dfd1e3930..31e1b4155677 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -4771,6 +4772,702 @@ static bool get_next_mm(struct mm_walk_args *args, int 
swappiness, struct mm_str
return last;
 }
 
+/**
+ *  the aging
+ 
**/
+
+static void update_batch_size(struct page *page, int old_gen, int new_gen,
+ struct mm_walk_args *args)
+{
+   int file = page_is_file_lru(page);
+   int zone = page_zonenum(page);
+   int delta = thp_nr_pages(page);
+
+   VM_BUG_ON(old_gen >= MAX_NR_GENS);
+   VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+   args->batch_size++;
+
+   args->nr_pages[old_gen][file][zone] -= delta;
+   args->nr_pages[new_gen][file][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct mm_walk_args *args)
+{
+   int gen, file, zone;
+   struct lrugen *lrugen = >evictable;
+
+   args->batch_size = 0;
+
+   spin_lock_irq(>lru_lock);
+
+   for_each_gen_type_zone(gen, file, zone) {
+   enum lru_list lru = LRU_FILE * file;
+   int total = args->nr_pages[gen][file][zone];
+
+   if (!total)
+   continue;
+
+   args->nr_pages[gen][file][zone] = 0;
+   WRITE_ONCE(lrugen->sizes[gen][file][zone],
+  lrugen->sizes[gen][file][zone] + total);
+
+   if (lru_gen_is_active(lruvec, gen))
+   lru += LRU_ACTIVE;
+   update_lru_size(lruvec, lru, zone, total);
+   }
+
+   spin_unlock_irq(>lru_lock);
+}
+
+static int page_update_gen(struct page *page, int new_gen)
+{
+   int old_gen;
+   unsigned long old_flags, new_flags;
+
+   VM_BUG_ON(new_gen >= MAX_NR_GENS);
+
+   do {
+   old_flags = READ_ONCE(page->flags);
+
+   old_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+   if (old_gen < 0)
+   new_flags = old_flags | BIT(PG_referenced);
+   else
+   new_flags = (old_flags & ~(LRU_GEN_MASK | 
LRU_USAGE_MASK |
+LRU_TIER_FLAGS)) | ((new_gen + 1UL) << 
LRU_GEN_PGOFF);
+
+   if (old_flags == new_flags)
+   break;
+   } while (cmpxchg(>flags, old_flags, new_flags) != old_flags);
+
+   return old_gen;
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct 
mm_walk *walk)
+{
+   struct vm_area_struct *vma = walk->vma;
+   struct mm_walk_args *args = walk->private;
+
+   if (!vma_is_accessible(vma) || is_vm_hugetlb_page(vma) ||
+   (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)))
+   return true;
+
+   if (vma_is_anonymous(vma))
+   return !args->should_walk[0];
+
+   if (vma_is_shmem(vma))
+   return !args->should_walk[0] ||
+  mapping_unevictable(vma->vm_file->f_mapping);
+
+   return !args->should_walk[1] || vma_is_dax(vma) ||
+  vma == get_gate_vma(vma->vm_mm) ||
+  mapping_unevictable(vma->vm_file->f_mapping);
+}
+
+/*
+ * Some userspace memory allocators create many single-page VMAs. So instead of
+ * returning back to the PGD table for each of such VMAs, we finish at least an
+ * entire PMD table and therefore avoid many zigzags. This optimizes page table
+ * walks for workloads that have large numbers of tiny VMAs.
+ *
+ * We scan PMD tables in two pass. The first pass reaches to PTE

[PATCH v2 08/16] mm: multigenerational lru: groundwork

2021-04-13 Thread Yu Zhao

For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in max_seq for
both anon and file types as they are aged on an equal footing. The
oldest generation numbers are stored in min_seq[2] separately for anon
and file types as clean file pages can be evicted regardless of
may_swap or may_writepage. Generation numbers are truncated into
order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
sliding window technique is used to prevent truncated generation
numbers from overlapping. Each truncated generation number is an index
to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in.

The workflow comprises two conceptually independent functions: the
aging and the eviction. The aging produces young generations. Given an
lruvec, the aging scans page tables for referenced pages of this
lruvec. Upon finding one, the aging updates its generation number to
max_seq. After each round of scan, the aging increments max_seq. The
aging is due when both of min_seq[2] reaches max_seq-1, assuming both
anon and file types are reclaimable.

The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It tries to select a type based on the values of min_seq[2] and
swappiness. During a scan, the eviction sorts pages according to their
generation numbers, if the aging has found them referenced. When it
finds all the per-zone lists of a selected type are empty, the
eviction increments min_seq[2] indexed by this selected type.

Signed-off-by: Yu Zhao 
---
 fs/fuse/dev.c |   3 +-
 include/linux/mm.h|   2 +
 include/linux/mm_inline.h | 193 +++
 include/linux/mmzone.h| 110 +++
 include/linux/page-flags-layout.h |  20 +-
 include/linux/page-flags.h|   4 +-
 kernel/bounds.c   |   6 +
 mm/huge_memory.c  |   3 +-
 mm/mm_init.c  |  16 +-
 mm/mmzone.c   |   2 +
 mm/swapfile.c |   4 +
 mm/vmscan.c   | 305 ++
 12 files changed, 656 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index c0fee830a34e..27c83f557794 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -784,7 +784,8 @@ static int fuse_check_page(struct page *page)
   1 << PG_lru |
   1 << PG_active |
   1 << PG_reclaim |
-  1 << PG_waiters))) {
+  1 << PG_waiters |
+  LRU_GEN_MASK | LRU_USAGE_MASK))) {
dump_page(page, "fuse: trying to steal weird page");
return 1;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..2c8a2db78ce9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1070,6 +1070,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
 #define ZONES_PGOFF(NODES_PGOFF - ZONES_WIDTH)
 #define LAST_CPUPID_PGOFF  (ZONES_PGOFF - LAST_CPUPID_WIDTH)
 #define KASAN_TAG_PGOFF(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
+#define LRU_GEN_PGOFF  (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
+#define LRU_USAGE_PGOFF(LRU_GEN_PGOFF - LRU_USAGE_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..2bf910eb3dd7 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -79,11 +79,198 @@ static __always_inline enum lru_list page_lru(struct page 
*page)
return lru;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+#ifdef CONFIG_LRU_GEN_ENABLED
+DECLARE_STATIC_KEY_TRUE(lru_gen_static_key);
+#define lru_gen_enabled() static_branch_likely(_gen_static_key)
+#else
+DECLARE_STATIC_KEY_FALSE(lru_gen_static_key);
+#define lru_gen_enabled() static_branch_unlikely(_gen_static_key)
+#endif
+
+/* We track at most MAX_NR_GENS generations using the sliding window 
technique. */
+static inline int lru_gen_from_seq(unsigned long seq)
+{
+   return seq % MAX_NR_GENS;
+}
+
+/* Return a proper index regardless whether we keep a full history of stats. */
+static inline int sid_from_seq_or_gen(int seq_or_gen)
+{
+   return seq_or_gen % NR_STAT_GENS;
+}
+
+/* The youngest and the second youngest generations are considered active. */
+static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
+{
+   unsigned long max_seq = READ_ONCE(lruvec->evictable.max_seq);
+
+   VM_BUG_ON(!max_seq);
+   VM_BUG_ON(gen >= MAX_NR_GENS);
+
+   return gen == lru_gen_from_seq(max_seq) || gen == 
lru_gen_from_seq(max_seq - 1);
+}
+
+/* Update the sizes of the multigenerational lru. */
+static inline void lru_gen_update_size(struct page *page,

[PATCH v2 12/16] mm: multigenerational lru: eviction

2021-04-13 Thread Yu Zhao

The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It first tries to select a type based on the values of min_seq[2].
When anon and file types are both available from the same generation,
it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their generation
numbers, if the aging has found them referenced. It also moves pages
from the tiers that have higher refault rates than tier 0 to the next
generation. When it finds all the per-zone lists of a selected type
are empty, the eviction increments min_seq[2] indexed by this selected
type.

Signed-off-by: Yu Zhao 
---
 mm/vmscan.c | 341 
 1 file changed, 341 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31e1b4155677..6239b1acd84f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5468,6 +5468,347 @@ static bool walk_mm_list(struct lruvec *lruvec, 
unsigned long max_seq,
return true;
 }
 
+/**
+ *  the eviction
+ 
**/
+
+static bool sort_page(struct page *page, struct lruvec *lruvec, int 
tier_to_isolate)
+{
+   bool success;
+   int gen = page_lru_gen(page);
+   int file = page_is_file_lru(page);
+   int zone = page_zonenum(page);
+   int tier = lru_tier_from_usage(page_tier_usage(page));
+   struct lrugen *lrugen = >evictable;
+
+   VM_BUG_ON_PAGE(gen == -1, page);
+   VM_BUG_ON_PAGE(tier_to_isolate < 0, page);
+
+   /* a lazy-free page that has been written into? */
+   if (file && PageDirty(page) && PageAnon(page)) {
+   success = lru_gen_deletion(page, lruvec);
+   VM_BUG_ON_PAGE(!success, page);
+   SetPageSwapBacked(page);
+   add_page_to_lru_list_tail(page, lruvec);
+   return true;
+   }
+
+   /* page_update_gen() has updated the page? */
+   if (gen != lru_gen_from_seq(lrugen->min_seq[file])) {
+   list_move(>lru, >lists[gen][file][zone]);
+   return true;
+   }
+
+   /* activate the page if its tier has a higher refault rate */
+   if (tier_to_isolate < tier) {
+   int sid = sid_from_seq_or_gen(gen);
+
+   page_inc_gen(page, lruvec, false);
+   WRITE_ONCE(lrugen->activated[sid][file][tier - 1],
+  lrugen->activated[sid][file][tier - 1] + 
thp_nr_pages(page));
+   inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
+   return true;
+   }
+
+   /*
+* A page can't be immediately evicted, and page_inc_gen() will mark it
+* for reclaim and hopefully writeback will write it soon if it's dirty.
+*/
+   if (PageLocked(page) || PageWriteback(page) || (file && 
PageDirty(page))) {
+   page_inc_gen(page, lruvec, true);
+   return true;
+   }
+
+   return false;
+}
+
+static bool should_skip_page(struct page *page, struct scan_control *sc)
+{
+   if (!sc->may_unmap && page_mapped(page))
+   return true;
+
+   if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
+   (PageDirty(page) || (PageAnon(page) && !PageSwapCache(page
+   return true;
+
+   if (!get_page_unless_zero(page))
+   return true;
+
+   if (!TestClearPageLRU(page)) {
+   put_page(page);
+   return true;
+   }
+
+   return false;
+}
+
+static void isolate_page(struct page *page, struct lruvec *lruvec)
+{
+   bool success;
+
+   success = lru_gen_deletion(page, lruvec);
+   VM_BUG_ON_PAGE(!success, page);
+
+   if (PageActive(page)) {
+   ClearPageActive(page);
+   /* make sure shrink_page_list() rejects this page */
+   SetPageReferenced(page);
+   return;
+   }
+
+   /* make sure shrink_page_list() doesn't try to write this page */
+   ClearPageReclaim(page);
+   /* make sure shrink_page_list() doesn't reject this page */
+   ClearPageReferenced(page);
+}
+
+static int scan_lru_gen_pages(struct lruvec *lruvec, struct scan_control *sc,
+ long *nr_to_scan, int file, int tier,
+ struct list_head *list)
+{
+   bool success;
+   int gen, zone;
+   enum vm_event_item item;
+   int sorted = 0;
+   int scanned = 0;
+   int isolated = 0;
+   int batch_size = 0;
+   struct lrugen *lrugen = >evictable;
+
+   VM_BUG_ON(!list_empty(list));
+
+   if (get_nr_gens(lruvec, file) == MIN_NR_GENS)
+   return -ENOENT;
+
+   gen = lru_gen_from_seq(lrugen->min_seq[file]);
+
+   for (zone = sc->reclaim_idx; zone >= 0; zone--) {
+

[PATCH v2 10/16] mm: multigenerational lru: mm_struct list

2021-04-13 Thread Yu Zhao

In order to scan page tables, we add an infrastructure to maintain
either a system-wide mm_struct list or per-memcg mm_struct lists.
Multiple threads can concurrently work on the same mm_struct list, and
each of them will be given a different mm_struct.

This infrastructure also tracks whether an mm_struct is being used on
any CPUs or has been used since the last time a worker looked at it.
In other words, workers will not be given an mm_struct that belongs to
a process that has been sleeping.

Signed-off-by: Yu Zhao 
---
 fs/exec.c  |   2 +
 include/linux/memcontrol.h |   6 +
 include/linux/mm_types.h   | 117 ++
 include/linux/mmzone.h |   2 -
 kernel/exit.c  |   1 +
 kernel/fork.c  |  10 ++
 kernel/kthread.c   |   1 +
 kernel/sched/core.c|   2 +
 mm/memcontrol.c|  28 
 mm/vmscan.c| 316 +
 10 files changed, 483 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c691d4d7720c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1008,6 +1008,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+   lru_gen_add_mm(mm);
/*
 * This prevents preemption while active_mm is being loaded and
 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
activate_mm(active_mm, mm);
+   lru_gen_switch_mm(active_mm, mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f13dc02cf277..cff95ed1ee2b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -212,6 +212,8 @@ struct obj_cgroup {
};
 };
 
+struct lru_gen_mm_list;
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -335,6 +337,10 @@ struct mem_cgroup {
struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+   struct lru_gen_mm_list *mm_list;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..f8a239fbb958 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -383,6 +385,8 @@ struct core_state {
struct completion startup;
 };
 
+#define ANON_AND_FILE 2
+
 struct kioctx_table;
 struct mm_struct {
struct {
@@ -561,6 +565,22 @@ struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
u32 pasid;
+#endif
+#ifdef CONFIG_LRU_GEN
+   struct {
+   /* the node of a global or per-memcg mm_struct list */
+   struct list_head list;
+#ifdef CONFIG_MEMCG
+   /* points to memcg of the owner task above */
+   struct mem_cgroup *memcg;
+#endif
+   /* whether this mm_struct has been used since the last 
walk */
+   nodemask_t nodes[ANON_AND_FILE];
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+   /* the number of CPUs using this mm_struct */
+   atomic_t nr_cpus;
+#endif
+   } lrugen;
 #endif
} __randomize_layout;
 
@@ -588,6 +608,103 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)>cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+void lru_gen_init_mm(struct mm_struct *mm);
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg);
+void lru_gen_free_mm_list(struct mem_cgroup *memcg);
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+/*
+ * Track the usage so mm_struct's that haven't been used since the last walk 
can
+ * be skipped. This function adds a theoretical overhead to each context 
switch,
+ * which hasn't been measurable.
+ */
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct 
*new)
+{
+   int file;
+
+   /* exclude init_mm, efi_mm, etc. */
+   if (!core_kernel_data((unsigned long)old)) {
+   VM_BUG_ON(old == _mm);
+
+   for (file = 0; file < ANON_AND_FILE; file++)
+   nodes_setall(old->lrugen.nodes[file]);
+
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+   atomic_dec(>lrugen.nr_cpus);
+   VM_BUG_ON_MM(atomic_read(>lrugen.nr_cpus) < 0, old);
+#endif
+   } else
+

[PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE

2021-04-13 Thread Yu Zhao

Currently is_huge_zero_pmd() only exists when
CONFIG_TRANSPARENT_HUGEPAGE=y. This patch adds the function for
!CONFIG_TRANSPARENT_HUGEPAGE.

Signed-off-by: Yu Zhao 
---
 include/linux/huge_mm.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ba973efcd369..0ba7b3f9029c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -443,6 +443,11 @@ static inline bool is_huge_zero_page(struct page *page)
return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+   return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
return false;
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA

2021-04-13 Thread Yu Zhao

Currently next_memory_node only exists when CONFIG_NUMA=y. This patch
adds the macro for !CONFIG_NUMA.

Signed-off-by: Yu Zhao 
---
 include/linux/nodemask.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index ac398e143c9a..89fe4e3592f9 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state)
 #define first_online_node  0
 #define first_memory_node  0
 #define next_online_node(nid)  (MAX_NUMNODES)
+#define next_memory_node(nid)  (MAX_NUMNODES)
 #define nr_node_ids1U
 #define nr_online_nodes1U
 
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 07/16] mm/vmscan.c: refactor shrink_node()

2021-04-13 Thread Yu Zhao

Heuristics that determine scan balance between anon and file LRUs are
rather independent. Move them into a separate function to improve
readability.

Signed-off-by: Yu Zhao 
---
 mm/vmscan.c | 186 +++-
 1 file changed, 98 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..1a24d2e0a4cb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2224,6 +2224,103 @@ enum scan_balance {
SCAN_FILE,
 };
 
+static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
+{
+   unsigned long file;
+   struct lruvec *target_lruvec;
+
+   target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
+   /*
+* Determine the scan balance between anon and file LRUs.
+*/
+   spin_lock_irq(_lruvec->lru_lock);
+   sc->anon_cost = target_lruvec->anon_cost;
+   sc->file_cost = target_lruvec->file_cost;
+   spin_unlock_irq(_lruvec->lru_lock);
+
+   /*
+* Target desirable inactive:active list ratios for the anon
+* and file LRU lists.
+*/
+   if (!sc->force_deactivate) {
+   unsigned long refaults;
+
+   refaults = lruvec_page_state(target_lruvec,
+   WORKINGSET_ACTIVATE_ANON);
+   if (refaults != target_lruvec->refaults[0] ||
+   inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+   sc->may_deactivate |= DEACTIVATE_ANON;
+   else
+   sc->may_deactivate &= ~DEACTIVATE_ANON;
+
+   /*
+* When refaults are being observed, it means a new
+* workingset is being established. Deactivate to get
+* rid of any stale active pages quickly.
+*/
+   refaults = lruvec_page_state(target_lruvec,
+   WORKINGSET_ACTIVATE_FILE);
+   if (refaults != target_lruvec->refaults[1] ||
+   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+   sc->may_deactivate |= DEACTIVATE_FILE;
+   else
+   sc->may_deactivate &= ~DEACTIVATE_FILE;
+   } else
+   sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
+
+   /*
+* If we have plenty of inactive file pages that aren't
+* thrashing, try to reclaim those first before touching
+* anonymous pages.
+*/
+   file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
+   if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+   sc->cache_trim_mode = 1;
+   else
+   sc->cache_trim_mode = 0;
+
+   /*
+* Prevent the reclaimer from falling into the cache trap: as
+* cache pages start out inactive, every cache fault will tip
+* the scan balance towards the file LRU.  And as the file LRU
+* shrinks, so does the window for rotation from references.
+* This means we have a runaway feedback loop where a tiny
+* thrashing file LRU becomes infinitely more attractive than
+* anon pages.  Try to detect this based on file LRU size.
+*/
+   if (!cgroup_reclaim(sc)) {
+   unsigned long total_high_wmark = 0;
+   unsigned long free, anon;
+   int z;
+
+   free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+   file = node_page_state(pgdat, NR_ACTIVE_FILE) +
+  node_page_state(pgdat, NR_INACTIVE_FILE);
+
+   for (z = 0; z < MAX_NR_ZONES; z++) {
+   struct zone *zone = >node_zones[z];
+
+   if (!managed_zone(zone))
+   continue;
+
+   total_high_wmark += high_wmark_pages(zone);
+   }
+
+   /*
+* Consider anon: if that's low too, this isn't a
+* runaway file reclaim problem, but rather just
+* extreme pressure. Reclaim as per usual then.
+*/
+   anon = node_page_state(pgdat, NR_INACTIVE_ANON);
+
+   sc->file_is_tiny =
+   file + free <= total_high_wmark &&
+   !(sc->may_deactivate & DEACTIVATE_ANON) &&
+   anon >> sc->priority;
+   }
+}
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -2669,7 +2766,6 @@ static void shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long nr_reclaimed, nr_scanned;
struct lruvec *target_lruvec;
bool reclaimable = false;
-   unsigned long file;
 
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
 
@@ -2679,93 +2775,7 @@ static void shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)

[PATCH v2 09/16] mm: multigenerational lru: activation

2021-04-13 Thread Yu Zhao

For pages accessed multiple times via file descriptors, instead of
activating them upon the second accesses, we activate them based on
the refault rates of their tiers. Pages accessed N times via file
descriptors belong to tier order_base_2(N). Pages from tier 0, i.e.,
those read ahead, accessed once via file descriptors and accessed only
via page tables, are evicted regardless of the refault rate. Pages
from other tiers will be moved to the next generation, i.e.,
activated, if the refault rates of their tiers are higher than that of
tier 0. Each generation contains at most MAX_NR_TIERS tiers, and they
require additional MAX_NR_TIERS-2 bits in page->flags. This feedback
model has a few advantages over the current feedforward model:
  1) It has a negligible overhead in the access path because
  activations are done in the reclaim path.
  2) It takes mapped pages into account and avoids overprotecting
  pages accessed multiple times via file descriptors.
  3) More tiers offer better protection to pages accessed more than
  twice when buffered-I/O-intensive workloads are under memory
  pressure.

For pages mapped upon page faults, the accessed bit is set and they
must be properly aged. We add them to the per-zone lists index by
max_seq, i.e., the youngest generation. For pages not in page cache
or swap cache, this can be done easily in the page fault path: we
rename lru_cache_add_inactive_or_unevictable() to
lru_cache_add_page_vma() and add a new parameter, which is set to true
for pages mapped upon page faults. For pages in page cache or swap
cache, we cannot differentiate the page fault path from the read ahead
path at the time we call lru_cache_add() in add_to_page_cache_lru()
and __read_swap_cache_async(). So we add a new function
lru_gen_activation(), which is essentially activate_page(), to move
pages to the per-zone lists indexed by max_seq at a later time.
Hopefully we would find those pages in lru_pvecs.lru_add and simply
set PageActive() on them without having to actually move them.

Finally, we need to be compatible with the existing notion of active
and inactive. We cannot use PageActive() because it is not set on
active pages unless they are isolated, in order to spare the aging the
trouble of clearing it when an active generation becomes inactive. A
new function page_is_active() compares the generation number of a page
with max_seq and max_seq-1 (modulo MAX_NR_GENS), which are considered
active and protected from the eviction. Other generations, which may
or may not exist, are considered inactive.

Signed-off-by: Yu Zhao 
---
 fs/proc/task_mmu.c|   3 +-
 include/linux/mm_inline.h | 101 +
 include/linux/swap.h  |   4 +-
 kernel/events/uprobes.c   |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/khugepaged.c   |   2 +-
 mm/memory.c   |  14 +--
 mm/migrate.c  |   2 +-
 mm/swap.c |  26 +++---
 mm/swapfile.c |   2 +-
 mm/userfaultfd.c  |   2 +-
 mm/vmscan.c   |  91 ++-
 mm/workingset.c   | 179 +++---
 13 files changed, 371 insertions(+), 59 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e862cab69583..d292f20c4e3d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1718,7 +1719,7 @@ static void gather_stats(struct page *page, struct 
numa_maps *md, int pte_dirty,
if (PageSwapCache(page))
md->swapcache += nr_pages;
 
-   if (PageActive(page) || PageUnevictable(page))
+   if (PageUnevictable(page) || page_is_active(compound_head(page), NULL))
md->active += nr_pages;
 
if (PageWriteback(page))
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 2bf910eb3dd7..5eb4b12972ec 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -95,6 +95,12 @@ static inline int lru_gen_from_seq(unsigned long seq)
return seq % MAX_NR_GENS;
 }
 
+/* Convert the level of usage to a tier. See the comment on MAX_NR_TIERS. */
+static inline int lru_tier_from_usage(int usage)
+{
+   return order_base_2(usage + 1);
+}
+
 /* Return a proper index regardless whether we keep a full history of stats. */
 static inline int sid_from_seq_or_gen(int seq_or_gen)
 {
@@ -238,12 +244,93 @@ static inline bool lru_gen_deletion(struct page *page, 
struct lruvec *lruvec)
return true;
 }
 
+/* Activate a page from page cache or swap cache after it's mapped. */
+static inline void lru_gen_activation(struct page *page, struct vm_area_struct 
*vma)
+{
+   if (!lru_gen_enabled())
+   return;
+
+   if (PageActive(page) || PageUnevictable(page) || vma_is_dax(vma) ||
+   (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)))
+   return;
+   /*
+* TODO: pass vm_fault to add_to_page_cache_lru() and
+*

Re: [PATCH 2/4] dt-bindings: Add bindings for aspeed pwm

2021-04-13 Thread Billy Tsai

Hi Rob,

Best Regards,
Billy Tsai

On 2021/4/12, 9:20 PM,Rob Herringwrote:

On Mon, 12 Apr 2021 17:54:55 +0800, Billy Tsai wrote:
>> This patch adds device bindings for aspeed pwm device which should be
>> the sub-node of aspeed,ast2600-pwm-tach.
>> 
>> Signed-off-by: Billy Tsai 
>> ---
>>  .../bindings/pwm/aspeed,ast2600-pwm.yaml  | 47 +++
>>  1 file changed, 47 insertions(+)
>>  create mode 100644 
Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.yaml
>> 

> My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check'
> on your patch (DT_CHECKER_FLAGS is new in v5.13):

> yamllint warnings/errors:

> dtschema/dtc warnings/errors:
> 
/builds/robherring/linux-dt-review/Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.yaml:
 Additional properties are not allowed ('pwm-cells' was unexpected)
> 
/builds/robherring/linux-dt-review/Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.yaml:
 Additional properties are not allowed ('pwm-cells' was unexpected)
> 
/builds/robherring/linux-dt-review/Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.yaml:
 ignoring, error in schema: 
> warning: no schema found in file: 
./Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.yaml
> 
Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.example.dt.yaml:0:0: 
/example-0/pwm_tach@1e61: failed to match any schema with compatible: 
['aspeed,ast2600-pwm-tach', 'simple-mfd', 'syscon']
> 
Documentation/devicetree/bindings/pwm/aspeed,ast2600-pwm.example.dt.yaml:0:0: 
/example-0/pwm_tach@1e61/pwm@0: failed to match any schema with compatible: 
['aspeed,ast2600-pwm']

The yaml file have dependencies with the first patch in these series. I will 
squash them.

> See https://patchwork.ozlabs.org/patch/1465116

> This check can fail if there are any dependencies. The base for a patch
> series is generally the most recent rc1.

> If you already ran 'make dt_binding_check' and didn't see the above
> error(s), then make sure 'yamllint' is installed and dt-schema is up to
> date:

> pip3 install dtschema --upgrade

> Please check and re-submit.

[PATCH 3/3] rseq: optimise for 64bit arches

2021-04-13 Thread Eric Dumazet

From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
on 64bit arches is suboptimal.

We might revisit this patch once all 32bit arches support
get_user() and/or put_user() for 8 bytes values.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
57344f9abb43905c7dd2b6081205ff508d963e1e..18a75a804008d2f564d1f7789f09216f1a8760bd
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, >rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0ULL, >rseq->rseq_cs.ptr64);
+#else
if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 09/12] usb: dwc2: Allow exit clock gating in urb enqueue

2021-04-13 Thread Artur Petrosyan

When core is in clock gating state and an external
hub is connected, upper layer sends URB enqueue request,
which results in port reset issue.

Added exit from clock gating state to avoid port
reset issue and process upper layer request properly.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index 8a42675ab94e..31d6a1b87228 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -4597,6 +4597,14 @@ static int _dwc2_hcd_urb_enqueue(struct usb_hcd *hcd, 
struct urb *urb,
"exit partial_power_down failed\n");
}
 
+   if (hsotg->params.power_down == DWC2_POWER_DOWN_PARAM_NONE &&
+   hsotg->bus_suspended) {
+   if (dwc2_is_device_mode(hsotg))
+   dwc2_gadget_exit_clock_gating(hsotg, 0);
+   else
+   dwc2_host_exit_clock_gating(hsotg, 0);
+   }
+
if (!ep)
return -EINVAL;
 
-- 
2.25.1

[PATCH v2 11/12] usb: dwc2: Add clock gating exiting flow by system resume

2021-04-13 Thread Artur Petrosyan

If not hibernation nor partial power down are supported,
port resume is done using the clock gating programming flow.

Adds a new flow of exiting clock gating when PC is
resumed.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index 09dcd37b9ef8..04a1b53d65af 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -4445,6 +4445,28 @@ static int _dwc2_hcd_resume(struct usb_hcd *hcd)
break;
case DWC2_POWER_DOWN_PARAM_HIBERNATION:
case DWC2_POWER_DOWN_PARAM_NONE:
+   /*
+* If not hibernation nor partial power down are supported,
+* port resume is done using the clock gating programming flow.
+*/
+   spin_unlock_irqrestore(>lock, flags);
+   dwc2_host_exit_clock_gating(hsotg, 0);
+
+   /*
+* Initialize the Core for Host mode, as after system resume
+* the global interrupts are disabled.
+*/
+   dwc2_core_init(hsotg, false);
+   dwc2_enable_global_interrupts(hsotg);
+   dwc2_hcd_reinit(hsotg);
+   spin_lock_irqsave(>lock, flags);
+
+   /*
+* Set HW accessible bit before powering on the controller
+* since an interrupt may rise.
+*/
+   set_bit(HCD_FLAG_HW_ACCESSIBLE, >flags);
+   break;
default:
hsotg->lx_state = DWC2_L0;
goto unlock;
-- 
2.25.1

[PATCH v2 10/12] usb: dwc2: Add clock gating entering flow by system suspend

2021-04-13 Thread Artur Petrosyan

If not hibernation nor partial power down are supported,
clock gating is used to save power.

Adds a new flow of entering clock gating when PC is
suspended.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index 31d6a1b87228..09dcd37b9ef8 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -4372,6 +4372,15 @@ static int _dwc2_hcd_suspend(struct usb_hcd *hcd)
break;
case DWC2_POWER_DOWN_PARAM_HIBERNATION:
case DWC2_POWER_DOWN_PARAM_NONE:
+   /*
+* If not hibernation nor partial power down are supported,
+* clock gating is used to save power.
+*/
+   dwc2_host_enter_clock_gating(hsotg);
+
+   /* After entering suspend, hardware is not accessible */
+   clear_bit(HCD_FLAG_HW_ACCESSIBLE, >flags);
+   break;
default:
goto skip_power_saving;
}
-- 
2.25.1

[PATCH 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet

From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH v2 2/2] Bluetooth: Support the vendor specific debug events

2021-04-13 Thread Joseph Hwang

This patch allows a user space process to enable/disable the vendor
specific (vs) debug events dynamically through the set experimental
feature mgmt interface if CONFIG_BT_FEATURE_VS_DBG_EVT is enabled.

Since the debug event feature needs to invoke the callback function
provided by the driver, i.e., hdev->set_vs_dbg_evt, a valid controller
index is required.

For generic Linux machines, the vendor specific debug events are
disabled by default.

Reviewed-by: Chethan Tumkur Narayan 

Reviewed-by: Kiran Krishnappa 
Reviewed-by: Miao-chen Chou 
Signed-off-by: Joseph Hwang 
---

(no changes since v1)

 drivers/bluetooth/btintel.c  |  73 -
 drivers/bluetooth/btintel.h  |  13 
 drivers/bluetooth/btusb.c|  16 +
 include/net/bluetooth/hci.h  |   4 ++
 include/net/bluetooth/hci_core.h |  10 +++
 net/bluetooth/Kconfig|  10 +++
 net/bluetooth/mgmt.c | 108 ++-
 7 files changed, 232 insertions(+), 2 deletions(-)

diff --git a/drivers/bluetooth/btintel.c b/drivers/bluetooth/btintel.c
index de1dbdc01e5a..c0f81d29aa5f 100644
--- a/drivers/bluetooth/btintel.c
+++ b/drivers/bluetooth/btintel.c
@@ -1213,6 +1213,7 @@ void btintel_reset_to_bootloader(struct hci_dev *hdev)
 }
 EXPORT_SYMBOL_GPL(btintel_reset_to_bootloader);
 
+#ifdef CONFIG_BT_FEATURE_VS_DBG_EVT
 int btintel_read_debug_features(struct hci_dev *hdev,
struct intel_debug_features *features)
 {
@@ -1254,14 +1255,18 @@ int btintel_set_debug_features(struct hci_dev *hdev,
u8 trace_enable = 0x02;
struct sk_buff *skb;
 
-   if (!features)
+   if (!features) {
+   bt_dev_warn(hdev, "Debug features not read");
return -EINVAL;
+   }
 
if (!(features->page1[0] & 0x3f)) {
bt_dev_info(hdev, "Telemetry exception format not supported");
return 0;
}
 
+   bt_dev_info(hdev, "trace_enable %d mask %d", trace_enable, mask[3]);
+
skb = __hci_cmd_sync(hdev, 0xfc8b, 11, mask, HCI_INIT_TIMEOUT);
if (IS_ERR(skb)) {
bt_dev_err(hdev, "Setting Intel telemetry ddc write event mask 
failed (%ld)",
@@ -1290,6 +1295,72 @@ int btintel_set_debug_features(struct hci_dev *hdev,
 }
 EXPORT_SYMBOL_GPL(btintel_set_debug_features);
 
+int btintel_reset_debug_features(struct hci_dev *hdev,
+const struct intel_debug_features *features)
+{
+   u8 mask[11] = { 0x0a, 0x92, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00,
+   0x00, 0x00, 0x00 };
+   u8 trace_enable = 0x00;
+   struct sk_buff *skb;
+
+   if (!features) {
+   bt_dev_warn(hdev, "Debug features not read");
+   return -EINVAL;
+   }
+
+   if (!(features->page1[0] & 0x3f)) {
+   bt_dev_info(hdev, "Telemetry exception format not supported");
+   return 0;
+   }
+
+   bt_dev_info(hdev, "trace_enable %d mask %d", trace_enable, mask[3]);
+
+   /* Should stop the trace before writing ddc event mask. */
+   skb = __hci_cmd_sync(hdev, 0xfca1, 1, _enable, HCI_INIT_TIMEOUT);
+   if (IS_ERR(skb)) {
+   bt_dev_err(hdev, "Stop tracing of link statistics events failed 
(%ld)",
+  PTR_ERR(skb));
+   return PTR_ERR(skb);
+   }
+   kfree_skb(skb);
+
+   skb = __hci_cmd_sync(hdev, 0xfc8b, 11, mask, HCI_INIT_TIMEOUT);
+   if (IS_ERR(skb)) {
+   bt_dev_err(hdev, "Setting Intel telemetry ddc write event mask 
failed (%ld)",
+  PTR_ERR(skb));
+   return PTR_ERR(skb);
+   }
+   kfree_skb(skb);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(btintel_reset_debug_features);
+
+int btintel_set_vs_dbg_evt(struct hci_dev *hdev, bool enable)
+{
+   struct intel_debug_features features;
+   int err;
+
+   bt_dev_dbg(hdev, "enable %d", enable);
+
+   /* Read the Intel supported features and if new exception formats
+* supported, need to load the additional DDC config to enable.
+*/
+   err = btintel_read_debug_features(hdev, );
+   if (err)
+   return err;
+
+   /* Set or reset the debug features. */
+   if (enable)
+   err = btintel_set_debug_features(hdev, );
+   else
+   err = btintel_reset_debug_features(hdev, );
+
+   return err;
+}
+EXPORT_SYMBOL_GPL(btintel_set_vs_dbg_evt);
+#endif
+
 MODULE_AUTHOR("Marcel Holtmann ");
 MODULE_DESCRIPTION("Bluetooth support for Intel devices ver " VERSION);
 MODULE_VERSION(VERSION);
diff --git a/drivers/bluetooth/btintel.h b/drivers/bluetooth/btintel.h
index d184064a5e7c..0b35b0248b91 100644
--- a/drivers/bluetooth/btintel.h
+++ b/drivers/bluetooth/btintel.h
@@ -171,10 +171,15 @@ int btintel_download_firmware_newgen(struct hci_dev *hdev,
 u32 *boot_param, u8 hw_variant,

Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc

2021-04-13 Thread Yunsheng Lin

On 2021/4/13 15:12, Hillf Danton wrote:
> On Tue, 13 Apr 2021 11:34:27 Yunsheng Lin wrote:
>> On 2021/4/13 11:26, Hillf Danton wrote:
>>> On Tue, 13 Apr 2021 10:56:42 Yunsheng Lin wrote:
 On 2021/4/13 10:21, Hillf Danton wrote:
> On Mon, 12 Apr 2021 20:00:43  Yunsheng Lin wrote:
>>
>> Yes, the below patch seems to fix the data race described in
>> the commit log.
>> Then what is the difference between my patch and your patch below:)
>
> Hehe, this is one of the tough questions over a bounch of weeks.
>
> If a seqcount can detect the race between skb enqueue and dequeue then we
> cant see any excuse for not rolling back to the point without NOLOCK.

 I am not sure I understood what you meant above.

 As my understanding, the below patch is essentially the same as
 your previous patch, the only difference I see is it uses qdisc->pad
 instead of __QDISC_STATE_NEED_RESCHEDULE.

 So instead of proposing another patch, it would be better if you
 comment on my patch, and make improvement upon that.

>>> Happy to do that after you show how it helps revert NOLOCK.
>>
>> Actually I am not going to revert NOLOCK, but add optimization
>> to it if the patch fixes the packet stuck problem.
>>
> Fix is not optimization, right?

For this patch, it is a fix.
In case you missed it, I do have a couple of idea to optimize the
lockless qdisc:

1. RFC patch to add lockless qdisc bypass optimization:

https://patchwork.kernel.org/project/netdevbpf/patch/1616404156-11772-1-git-send-email-linyunsh...@huawei.com/

2. implement lockless enqueuing for lockless qdisc using the idea
   from Jason and Toke. And it has a noticable proformance increase with
   1-4 threads running using the below prototype based on ptr_ring.

static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
{

int producer, next_producer;

do {
producer = READ_ONCE(r->producer);
if (unlikely(!r->size) || r->queue[producer])
return -ENOSPC;
next_producer = producer + 1;
if (unlikely(next_producer >= r->size))
next_producer = 0;
} while(cmpxchg_relaxed(>producer, producer, next_producer) != 
producer);

/* Make sure the pointer we are storing points to a valid data. */
/* Pairs with the dependency ordering in __ptr_ring_consume. */
smp_wmb();

WRITE_ONCE(r->queue[producer], ptr);
return 0;
}

3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
   too, because dev_hard_start_xmit is also in the protection of
   qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
   a netdev queue, which is true for pfifo_fast, I believe).

4. Remove the qdisc->running seqcount operation for lockless qdisc, which
   is mainly used to do heuristic locking on q->busylock for locked qdisc.

> 
>> Is there any reason why you want to revert it?
>>
> I think you know Jiri's plan and it would be nice to wait a couple of
> months for it to complete.

I am not sure I am aware of Jiri's plan.
Is there any link referring to the plan?

> 
> .
>

[syzbot] general protection fault in gadget_setup

2021-04-13 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:0f4498ce Merge tag 'for-5.12/dm-fixes-2' of git://git.kern..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=124adbf6d0
kernel config:  https://syzkaller.appspot.com/x/.config?x=daeff30c2474a60f
dashboard link: https://syzkaller.appspot.com/bug?extid=eb4674092e6cc8d9e0bd
userspace arch: i386

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+eb4674092e6cc8d9e...@syzkaller.appspotmail.com

general protection fault, probably for non-canonical address 
0xdc04:  [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0020-0x0027]
CPU: 1 PID: 5016 Comm: systemd-udevd Not tainted 5.12.0-rc4-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:__lock_acquire+0xcfe/0x54c0 kernel/locking/lockdep.c:4770
Code: 09 0e 41 bf 01 00 00 00 0f 86 8c 00 00 00 89 05 48 69 09 0e e9 81 00 00 
00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 5b 31 
00 00 49 81 3e c0 13 38 8f 0f 84 d0 f3 ff
RSP: :c9ce77d8 EFLAGS: 00010002
RAX: dc00 RBX:  RCX: 
RDX: 0004 RSI: 19200019cf0c RDI: 0020
RBP:  R08: 0001 R09: 0001
R10: 0001 R11: 0006 R12: 88801295b880
R13:  R14: 0020 R15: 
FS:  7fcd745f98c0() GS:88802cb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ffe279f7d87 CR3: 1c7d4000 CR4: 00150ee0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 lock_acquire kernel/locking/lockdep.c:5510 [inline]
 lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:159
 gadget_setup+0x4e/0x510 drivers/usb/gadget/legacy/raw_gadget.c:327
 dummy_timer+0x1615/0x32a0 drivers/usb/gadget/udc/dummy_hcd.c:1903
 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1431
 expire_timers kernel/time/timer.c:1476 [inline]
 __run_timers.part.0+0x67c/0xa50 kernel/time/timer.c:1745
 __run_timers kernel/time/timer.c:1726 [inline]
 run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1758
 __do_softirq+0x29b/0x9f6 kernel/softirq.c:345
 invoke_softirq kernel/softirq.c:221 [inline]
 __irq_exit_rcu kernel/softirq.c:422 [inline]
 irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
 sysvec_apic_timer_interrupt+0x45/0xc0 arch/x86/kernel/apic/apic.c:1100
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0033:0x560cfc4a02ed
Code: 4c 39 c1 48 89 42 18 4c 89 52 08 4c 89 5a 10 48 89 1a 0f 87 7b ff ff ff 
48 89 f8 48 f7 d0 48 01 c8 48 83 e0 f8 48 8d 7c 07 08 <48> 8d 0d 34 d9 02 00 48 
63 04 b1 48 01 c8 ff e0 0f 1f 00 48 8d 0d
RSP: 002b:7ffe279f9dd0 EFLAGS: 0246
RAX:  RBX: 560cfcd88e40 RCX: 560cfcd72af0
RDX: 7ffe279f9de0 RSI: 0007 RDI: 560cfcd72af0
RBP: 7ffe279f9e70 R08:  R09: 0020
R10: 560cfcd72af7 R11: 560cfcd73530 R12: 560cfcd72af0
R13:  R14: 560cfcd72b10 R15: 0001
Modules linked in:
---[ end trace ab0f6632fdd289cf ]---
RIP: 0010:__lock_acquire+0xcfe/0x54c0 kernel/locking/lockdep.c:4770
Code: 09 0e 41 bf 01 00 00 00 0f 86 8c 00 00 00 89 05 48 69 09 0e e9 81 00 00 
00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 5b 31 
00 00 49 81 3e c0 13 38 8f 0f 84 d0 f3 ff
RSP: :c9ce77d8 EFLAGS: 00010002
RAX: dc00 RBX:  RCX: 
RDX: 0004 RSI: 19200019cf0c RDI: 0020
RBP:  R08: 0001 R09: 0001
R10: 0001 R11: 0006 R12: 88801295b880
R13:  R14: 0020 R15: 
FS:  7fcd745f98c0() GS:88802cb0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ffe279f7d87 CR3: 1c7d4000 CR4: 00150ee0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Re: [PATCH v2 1/3] context_tracking: Split guest_enter/exit_irqoff

2021-04-13 Thread Christian Borntraeger





On 13.04.21 09:52, Wanpeng Li wrote:

Or did I miss anything.


I mean the if (!context_tracking_enabled_this_cpu()) part in the
function context_guest_enter_irqoff() ifdef
CONFIG_VIRT_CPU_ACCOUNTING_GEN. :)


Ah I missed that. Thanks.

Re: [PATCH 4.19 00/66] 4.19.187-rc1 review

2021-04-13 Thread Pavel Machek

Hi!

> This is the start of the stable review cycle for the 4.19.187 release.
> There are 66 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.

CIP testing did not find any problems here:

https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/tree/linux-4.19.y

Tested-by: Pavel Machek (CIP) 

Best regards,
Pavel
-- 
DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: Digital signature

Re: linux-next: manual merge of the akpm-current tree with the arm64 tree

2021-04-13 Thread Catalin Marinas

On Tue, Apr 13, 2021 at 06:59:36PM +1000, Stephen Rothwell wrote:
> diff --cc lib/test_kasan.c
> index 785e724ce0d8,bf9225002a7e..
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@@ -78,33 -83,30 +83,35 @@@ static void kasan_test_exit(struct kuni
>* fields, it can reorder or optimize away the accesses to those fields.
>* Use READ/WRITE_ONCE() for the accesses and compiler barriers around the
>* expression to prevent that.
> +  *
> +  * In between KUNIT_EXPECT_KASAN_FAIL checks, fail_data.report_found is 
> kept as
> +  * false. This allows detecting KASAN reports that happen outside of the 
> checks
> +  * by asserting !fail_data.report_found at the start of 
> KUNIT_EXPECT_KASAN_FAIL
> +  * and in kasan_test_exit.
>*/
> - #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {  \
> - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> - !kasan_async_mode_enabled())\
> - migrate_disable();  \
> - WRITE_ONCE(fail_data.report_expected, true);\
> - WRITE_ONCE(fail_data.report_found, false);  \
> - kunit_add_named_resource(test,  \
> - NULL,   \
> - NULL,   \
> - ,  \
> - "kasan_data", _data);  \
> - barrier();  \
> - expression; \
> - barrier();  \
> - if (kasan_async_mode_enabled()) \
> - kasan_force_async_fault();  \
> - barrier();  \
> - KUNIT_EXPECT_EQ(test,   \
> - READ_ONCE(fail_data.report_expected),   \
> - READ_ONCE(fail_data.report_found)); \
> - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> - !kasan_async_mode_enabled()) {  \
> - if (READ_ONCE(fail_data.report_found))  \
> - kasan_enable_tagging_sync();\
> - migrate_enable();   \
> - }   \
> + #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {  
> \
>  -if (IS_ENABLED(CONFIG_KASAN_HW_TAGS))   \
> ++if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> ++!kasan_async_mode_enabled())\
> + migrate_disable();  \
> + KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found));\
> + WRITE_ONCE(fail_data.report_expected, true);\
> + barrier();  \
> + expression; \
> + barrier();  \
> ++if (kasan_async_mode_enabled()) \
> ++kasan_force_async_fault();  \
> ++barrier();  \
> + KUNIT_EXPECT_EQ(test,   \
> + READ_ONCE(fail_data.report_expected),   \
> + READ_ONCE(fail_data.report_found)); \
>  -if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \
> ++if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
> ++!kasan_async_mode_enabled()) {  \
> + if (READ_ONCE(fail_data.report_found))  \
>  -kasan_enable_tagging(); \
> ++kasan_enable_tagging_sync();\
> + migrate_enable();   \
> + }   \
> + WRITE_ONCE(fail_data.report_found, false);  \
> + WRITE_ONCE(fail_data.report_expected, false);   \
>   } while (0)
>   
>   #define KASAN_TEST_NEEDS_CONFIG_ON(test, config) do {   
> \

Thanks Stephen. The resolution looks correct.

Andrew, if you'd rather I dropped the MTE async mode support from the
arm64 tree please let me know. Thanks.

https://lore.kernel.org/r/20210315132019.33202-1-vincenzo.frasc...@arm.com/

-- 
Catalin

[syzbot] KASAN: null-ptr-deref Write in rhashtable_free_and_destroy (2)

2021-04-13 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:d93a0d43 Merge tag 'block-5.12-2021-04-02' of git://git.ke..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12d81cfcd0
kernel config:  https://syzkaller.appspot.com/x/.config?x=71a75beb62b62a34
dashboard link: https://syzkaller.appspot.com/bug?extid=860268315ba86ea6b96b
compiler:   Debian clang version 11.0.1-2

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+860268315ba86ea6b...@syzkaller.appspotmail.com

==
BUG: KASAN: null-ptr-deref in instrument_atomic_read_write 
include/linux/instrumented.h:101 [inline]
BUG: KASAN: null-ptr-deref in test_and_set_bit 
include/asm-generic/bitops/instrumented-atomic.h:70 [inline]
BUG: KASAN: null-ptr-deref in try_to_grab_pending+0xee/0xa50 
kernel/workqueue.c:1257
Write of size 8 at addr 0088 by task kworker/0:3/4787

CPU: 0 PID: 4787 Comm: kworker/0:3 Not tainted 5.12.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Workqueue: events cfg80211_destroy_iface_wk
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x176/0x24e lib/dump_stack.c:120
 __kasan_report mm/kasan/report.c:403 [inline]
 kasan_report+0x152/0x200 mm/kasan/report.c:416
 check_region_inline mm/kasan/generic.c:135 [inline]
 kasan_check_range+0x2b5/0x2f0 mm/kasan/generic.c:186
 instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
 test_and_set_bit include/asm-generic/bitops/instrumented-atomic.h:70 [inline]
 try_to_grab_pending+0xee/0xa50 kernel/workqueue.c:1257
 __cancel_work_timer+0x81/0x5b0 kernel/workqueue.c:3098
 rhashtable_free_and_destroy+0x25/0x8b0 lib/rhashtable.c:1137
 mesh_table_free net/mac80211/mesh_pathtbl.c:70 [inline]
 mesh_pathtbl_unregister+0x4b/0xa0 net/mac80211/mesh_pathtbl.c:812
 unregister_netdevice_many+0x12ea/0x18e0 net/core/dev.c:10951
 unregister_netdevice_queue+0x2a9/0x300 net/core/dev.c:10868
 unregister_netdevice include/linux/netdevice.h:2884 [inline]
 _cfg80211_unregister_wdev+0x17b/0x5b0 net/wireless/core.c:1127
 ieee80211_if_remove+0x1cc/0x250 net/mac80211/iface.c:2020
 ieee80211_del_iface+0x12/0x20 net/mac80211/cfg.c:144
 rdev_del_virtual_intf net/wireless/rdev-ops.h:57 [inline]
 cfg80211_destroy_ifaces+0x182/0x250 net/wireless/core.c:341
 cfg80211_destroy_iface_wk+0x30/0x40 net/wireless/core.c:354
 process_one_work+0x789/0xfd0 kernel/workqueue.c:2275
 worker_thread+0xac1/0x1300 kernel/workqueue.c:2421
 kthread+0x39a/0x3c0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
==


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

/usr/bin/ld: ll_temac_main.c:undefined reference to `devm_of_iomap'

2021-04-13 Thread kernel test robot

Hi Andre,

FYI, the error/warning still remains.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   89698becf06d341a700913c3d89ce2a914af69a2
commit: e8b6c54f6d57822e228027d41a1edb317034a08c net: xilinx: temac: Relax 
Kconfig dependencies
date:   1 year, 1 month ago
config: um-randconfig-r026-20210413 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e8b6c54f6d57822e228027d41a1edb317034a08c
git remote add linus 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git fetch --no-tags linus master
git checkout e8b6c54f6d57822e228027d41a1edb317034a08c
# save the attached .config to linux build tree
make W=1 ARCH=um 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   /usr/bin/ld: drivers/net/ethernet/xilinx/ll_temac_main.o: in function 
`temac_probe':
   ll_temac_main.c:(.text+0xe9d): undefined reference to `devm_ioremap'
>> /usr/bin/ld: ll_temac_main.c:(.text+0xf90): undefined reference to 
>> `devm_of_iomap'
   /usr/bin/ld: ll_temac_main.c:(.text+0x1159): undefined reference to 
`devm_ioremap'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x8): 
undefined reference to `X86_FEATURE_XMM2'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x15): 
undefined reference to `X86_FEATURE_XMM'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x22): 
undefined reference to `X86_FEATURE_XMM'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x2f): 
undefined reference to `X86_FEATURE_XMM2'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x3c): 
undefined reference to `X86_FEATURE_XMM'
   /usr/bin/ld: drivers/misc/altera-stapl/altera-lpt.o:(.altinstructions+0x49): 
undefined reference to `X86_FEATURE_XMM'
   collect2: error: ld returned 1 exit status

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH][next] scsi: aacraid: Replace one-element array with flexible-array member

2021-04-13 Thread Gustavo A. R. Silva

Hi Martin,

On 4/12/21 23:52, Martin K. Petersen wrote:

> Silencing analyzer warnings shouldn't be done at the expense of human
> readers. If it is imperative to switch to flex_array_size() to quiesce
> checker warnings, please add a comment in the code explaining that the
> size evaluates to nseg_new-1 sge_ieee1212 structs.

Done:
https://lore.kernel.org/lkml/20210413054032.GA276102@embeddedor/

Thanks!
--
Gustavo

[PATCH] irq: Fix missing IRQF_ONESHOT as only threaded handler

2021-04-13 Thread zhuguangqing83

From: Guangqing Zhu 

Coccinelle noticed:
  kernel/irq/manage.c:2199:8-28: ERROR: Threaded IRQ with no primary
handler requested without IRQF_ONESHOT.

Signed-off-by: Guangqing Zhu 
---
 kernel/irq/manage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 4c14356543d9..222816750048 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2197,7 +2197,7 @@ int request_any_context_irq(unsigned int irq, 
irq_handler_t handler,
 
if (irq_settings_is_nested_thread(desc)) {
ret = request_threaded_irq(irq, NULL, handler,
-  flags, name, dev_id);
+  flags | IRQF_ONESHOT, name, dev_id);
return !ret ? IRQC_IS_NESTED : ret;
}
 
-- 
2.17.1

[PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi

I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in
my system. I will move cold pages from DRAM node to AEP node with
move_pages system call. With old "rep movsq', it costs 2030ms to move
1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.
I also test move 1GB pages from AEP node to DRAM node. But the result is
unexpected. "rep movesq" cost about 372 ms while "movnti" cost about
477ms. As said in X86 , "movnti" could avoid "polluting the caches" in
this situaction. I don't know if it's general result or just happening
in my machine. Hardware information is as follow:
CPU:
Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
DRAM:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: DIMM130 J40
Bank Locator: _Node1_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03B71EB0
Asset Tag: 1950
Part Number: M393A8G40MB2-CVF
Rank: 2
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: 
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None
AEP:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM131 J41
Bank Locator: _Node1_Channel3_Dimm1
Type: Logical non-volatile device
Type Detail: Synchronous Non-Volatile LRDIMM
Speed: 2666 MT/s
Manufacturer: Intel
Serial Number: 6803
Asset Tag: 1949
Part Number: NMA1XXD128GPS
Rank: 1
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: Intel persistent memory
Memory Operating Mode Capability: Volatile memory
Byte-accessible persistent memory
Firmware Version: 5355
Module Manufacturer ID: Bank 1, Hex 0x89
Module Product ID: 0x0556
Memory Subsystem Controller Manufacturer ID: Bank 1, Hex 0x89
Memory Subsystem Controller Product ID: 0x097A
Non-Volatile Size: 126 GB
Volatile Size: None
Cache Size: None
Logical Size: None
Memory dimm topoloygy:
AEP
 |
DRAMDRAMDRAM
 |   |   |
 |---|---|
CPU
 |---|---|
 |   |   |
DRAMDRAMDRAM

Signed-off-by: Kemeng Shi 
---
 arch/x86/lib/copy_page_64.S | 73 -
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 2402d4c489d2..69389b4aeeed 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -14,7 +14,8 @@
  */
ALIGN
 SYM_FUNC_START(copy_page)
-   ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
+   ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
+  "jmp copy_page_nt", X86_FEATURE_XMM2
movl$4096/8, %ecx
rep movsq
ret
@@ -87,3 +88,73 @@ SYM_FUNC_START_LOCAL(copy_page_regs)
addq$2*8, %rsp
ret
 SYM_FUNC_END(copy_page_regs)
+
+SYM_FUNC_START_LOCAL(copy_page_nt)
+   subq$2*8,   %rsp
+   movq%rbx,   (%rsp)
+   movq%r12,   1*8(%rsp)
+
+   movl$(4096/64)-5, %ecx
+   .p2align 4
+.LoopNT64:
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
+
+   prefetcht0 5*64(%rsi)
+
+   movnti  %rax, 0x8*0(%rdi)
+   movnti  %rbx, 0x8*1(%rdi)
+   movnti  %rdx, 0x8*2(%rdi)
+   movnti  %r8,  0x8*3(%rdi)
+   movnti  %r9,  0x8*4(%rdi)
+   movnti  %r10, 0x8*5(%rdi)
+   movnti  %r11, 0x8*6(%rdi)
+   movnti  %r12, 0x8*7(%rdi)
+
+   leaq64(%rdi), %rdi
+   leaq64(%rsi), %rsi
+   jnz .LoopNT64
+
+   movl$5, %ecx
+   .p2align 4
+.LoopNT2:
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi),

Re: [PATCH 6/7] i915: Convert to verify_page_range()

2021-04-13 Thread Peter Zijlstra

On Mon, Apr 12, 2021 at 01:08:38PM -0700, Kees Cook wrote:
> On Mon, Apr 12, 2021 at 10:00:18AM +0200, Peter Zijlstra wrote:
> > @@ -1249,14 +1249,14 @@ static int check_absent_pte(pte_t *pte,
> >  
> >  static int check_present(unsigned long addr, unsigned long len)
> >  {
> > -   return apply_to_page_range(current->mm, addr, len,
> > -  check_present_pte, (void *)addr);
> > +   return verify_page_range(current->mm, addr, len,
> > +check_present_pte, (void *)addr);
> 
> For example, switch to returning bad addr through verify_page_range(),
> or have a by-reference value, etc:
> 
>   unsigned long failed;
> 
>   failed = verify_page_range(current->mm< addr, len, check_present_pte);
>   if (failed) {
>   pr_err("missing PTE:%lx\n",
>  (addr - failed) >> PAGE_SHIFT);

OK, lemme try that.

Re: [Outreachy kernel] Subject: [PATCH v2] staging: media: meson: vdec: declare u32 as static const appropriately

2021-04-13 Thread Julia Lawall




On Tue, 13 Apr 2021, Mitali Borkar wrote:

> Declared 32 bit unsigned int as static constant inside a function
> appropriately.

I don't think that the description matches what is done.  Perhaps all the
meaning is intended to be in the word "appropriately", but that is not
very clear.  The message makes it looks like static const is the new part,
but it is already there.

julia

>
> Reported-by: kernel test robot 
> Signed-off-by: Mitali Borkar 
> ---
>
> Changes from v1:- Rectified the mistake by declaring u32 as static const
> properly.
>
>  drivers/staging/media/meson/vdec/codec_h264.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/media/meson/vdec/codec_h264.c 
> b/drivers/staging/media/meson/vdec/codec_h264.c
> index ea86e9e1c447..80141b89a9f6 100644
> --- a/drivers/staging/media/meson/vdec/codec_h264.c
> +++ b/drivers/staging/media/meson/vdec/codec_h264.c
> @@ -287,8 +287,8 @@ static void codec_h264_resume(struct amvdec_session *sess)
>   struct amvdec_core *core = sess->core;
>   struct codec_h264 *h264 = sess->priv;
>   u32 mb_width, mb_height, mb_total;
> - static const u32[] canvas3 = { ANCO_CANVAS_ADDR, 0 };
> - static const u32[] canvas4 = { 24, 0 };
> + static const u32 canvas3[] = { ANCO_CANVAS_ADDR, 0 };
> + static const u32 canvas4[] = { 24, 0 };
>
>   amvdec_set_canvases(sess, canvas3, canvas4);
>
> --
> 2.30.2
>
> --
> You received this message because you are subscribed to the Google Groups 
> "outreachy-kernel" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to outreachy-kernel+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/outreachy-kernel/YHU56OM%2BC2zY34VP%40kali.
>

Re: [PATCH 0/7] Restructure the rpmsg char and introduce the rpmsg-raw channel

2021-04-13 Thread Arnaud POULIQUEN

Hello Mathieu,

On 4/12/21 10:02 PM, Mathieu Poirier wrote:
> On Tue, Mar 23, 2021 at 01:27:30PM +0100, Arnaud Pouliquen wrote:
>> This series is the second step in the division of the series [1]: 
>> "Introducing a Generic IOCTL Interface for RPMsg Channel Management".
>>
>> The purpose of this patchset is to:
>> - split the control code related to the control
>>   and the endpoint. 
>> - define the rpmsg-raw channel, associated with the rpmsg char device to
>>   allow it to be instantiated using a name service announcement.
>> 
>> An important point to keep in mind for this patchset is that the concept of
>> channel is associated with a default endpoint. To facilitate communication
>> with the remote side, this default endpoint must have a fixed address.
>>
>> Consequently, for this series, I made a design choice to fix the endpoint
>> on the "rpmsg-raw" channel probe, and not allow to create/destroy an endpoint
>> on FS open/close.
>>
>> This is only applicable for channels probed by the rpmsg bus. The behavior,
>> using the RPMSG_CREATE_EPT_IOCTL and RPMSG_DESTROY_EPT_IOCTL controls, is
>> preserved.
>>   
>> The next steps should be to correct this:
>> Introduce the IOCTLs RPMSG_CREATE_DEV_IOCTL and RPMSG_DESTROY_DEV_IOCTL
>> to instantiate the rpmsg devices
>>
>> [1]: 
>> https://patchwork.kernel.org/project/linux-remoteproc/list/?series=435523
>>
>> Arnaud Pouliquen (7):
>>   rpmsg: char: Export eptdev create an destroy functions
>>   rpmsg: Move the rpmsg control device from rpmsg_char to rpmsg_ctrl
>>   rpmsg: Update rpmsg_chrdev_register_device function
>>   rpmsg: char: Introduce __rpmsg_chrdev_create_eptdev function
>>   rpmsg: char: Introduce a rpmsg driver for the rpmsg char device
>>   rpmsg: char: No dynamic endpoint management for the default one
>>   rpmsg: char: Return error if user try to destroy a default endpoint.
>>
> 
> I am done reviewing this set.

Thanks for the review! I will integrate all your remarks in my next revision.
Since I haven't seen any major problems, I hope to send it today or tomorrow.

Regards,
Arnaud

> 
> Thanks,
> Mathieu
>  
>>  drivers/rpmsg/Kconfig |   9 ++
>>  drivers/rpmsg/Makefile|   1 +
>>  drivers/rpmsg/qcom_glink_native.c |   2 +-
>>  drivers/rpmsg/qcom_smd.c  |   2 +-
>>  drivers/rpmsg/rpmsg_char.c| 221 +---
>>  drivers/rpmsg/rpmsg_char.h|  50 +++
>>  drivers/rpmsg/rpmsg_ctrl.c| 233 ++
>>  drivers/rpmsg/rpmsg_internal.h|   8 +-
>>  drivers/rpmsg/virtio_rpmsg_bus.c  |   2 +-
>>  9 files changed, 368 insertions(+), 160 deletions(-)
>>  create mode 100644 drivers/rpmsg/rpmsg_char.h
>>  create mode 100644 drivers/rpmsg/rpmsg_ctrl.c
>>
>> -- 
>> 2.17.1
>>

Re: [PATCH v2 1/2] perf/core: Share an event with multiple cgroups

2021-04-13 Thread kernel test robot

Hi Namhyung,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on tip/master linux/master linus/master v5.12-rc7 
next-20210412]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Namhyung-Kim/perf-core-Sharing-events-with-multiple-cgroups/20210413-124251
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
cface0326a6c2ae5c8f47bd466f07624b3e348a7
config: arm64-randconfig-r026-20210413 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
9829f5e6b1bca9b61efc629770d28bb9014dec45)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# 
https://github.com/0day-ci/linux/commit/c604a61fb3cfd58be50992c8284b13e598312794
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Namhyung-Kim/perf-core-Sharing-events-with-multiple-cgroups/20210413-124251
git checkout c604a61fb3cfd58be50992c8284b13e598312794
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> kernel/events/core.c:3891:32: error: use of undeclared identifier 
>> 'cgroup_ctx_list'; did you mean 'cgroup_exit'?
   if (!list_empty(this_cpu_ptr(_ctx_list)) &&
 ^~~
 cgroup_exit
   include/linux/percpu-defs.h:252:39: note: expanded from macro 'this_cpu_ptr'
   #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
 ^
   include/linux/percpu-defs.h:241:20: note: expanded from macro 'raw_cpu_ptr'
   __verify_pcpu_ptr(ptr); \
 ^
   include/linux/percpu-defs.h:219:47: note: expanded from macro 
'__verify_pcpu_ptr'
   const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;\
^
   include/linux/cgroup.h:130:6: note: 'cgroup_exit' declared here
   void cgroup_exit(struct task_struct *p);
^
>> kernel/events/core.c:3891:32: error: use of undeclared identifier 
>> 'cgroup_ctx_list'; did you mean 'cgroup_exit'?
   if (!list_empty(this_cpu_ptr(_ctx_list)) &&
 ^~~
 cgroup_exit
   include/linux/percpu-defs.h:252:39: note: expanded from macro 'this_cpu_ptr'
   #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
 ^
   include/linux/percpu-defs.h:242:19: note: expanded from macro 'raw_cpu_ptr'
   arch_raw_cpu_ptr(ptr);  \
^
   include/asm-generic/percpu.h:44:48: note: expanded from macro 
'arch_raw_cpu_ptr'
   #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
  ^
   include/linux/percpu-defs.h:231:23: note: expanded from macro 
'SHIFT_PERCPU_PTR'
   RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
^
   include/linux/compiler.h:181:31: note: expanded from macro 'RELOC_HIDE'
__ptr = (unsigned long) (ptr); \
 ^
   include/linux/cgroup.h:130:6: note: 'cgroup_exit' declared here
   void cgroup_exit(struct task_struct *p);
^
>> kernel/events/core.c:3891:32: error: use of undeclared identifier 
>> 'cgroup_ctx_list'; did you mean 'cgroup_exit'?
   if (!list_empty(this_cpu_ptr(_ctx_list)) &&
 ^~~
 cgroup_exit
   include/linux/percpu-defs.h:252:39: note: expanded from macro 'this_cpu_ptr'
   #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
 ^
   include/linux/percpu-defs.h:242:19: note: expanded from macro 'raw_cpu_ptr'
   arch_raw_cpu_ptr(ptr);  \
^
   include/asm-generic/percpu.h:44:48: note: expanded from macro 
'arch_raw_cpu_ptr'
   #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
  ^
   include/linux/percpu-defs.h:231:49: note: expanded from macro

[PATCH 12/12] usb: dwc2: Add exit clock gating before removing driver

2021-04-13 Thread Artur Petrosyan

When dwc2 core is in clock gating mode loading driver
again causes driver fail. Because in that mode
registers are not accessible.

Added a flow of exiting clock gating mode
to avoid the driver reload failure.

Signed-off-by: Artur Petrosyan 
---
 drivers/usb/dwc2/platform.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/usb/dwc2/platform.c b/drivers/usb/dwc2/platform.c
index b28b8cd45799..f8b819cfa80e 100644
--- a/drivers/usb/dwc2/platform.c
+++ b/drivers/usb/dwc2/platform.c
@@ -326,6 +326,15 @@ static int dwc2_driver_remove(struct platform_device *dev)
"exit partial_power_down failed\n");
}
 
+   /* Exit clock gating when driver is removed. */
+   if (hsotg->params.power_down == DWC2_POWER_DOWN_PARAM_NONE &&
+   hsotg->bus_suspended) {
+   if (dwc2_is_device_mode(hsotg))
+   dwc2_gadget_exit_clock_gating(hsotg, 0);
+   else
+   dwc2_host_exit_clock_gating(hsotg, 0);
+   }
+
dwc2_debugfs_exit(hsotg);
if (hsotg->hcd_enabled)
dwc2_hcd_remove(hsotg);
-- 
2.25.1

Re: [PATCH v6] platform/x86: intel_pmc_core: export platform global reset bits via etr3 sysfs file

2021-04-13 Thread Hans de Goede

Hi,

On 4/11/21 4:15 PM, Tomas Winkler wrote:
> From: Tamar Mashiah 
> 
> During PCH (platform/board) manufacturing process a global platform
> reset has to be induced in order for the configuration changes take
> the effect upon following platform reset. This is an internal platform
> state and is not intended to be used in the regular platform resets.
> The setting is exposed via ETR3 (Extended Test Mode Register 3).
> After the manufacturing process is completed the register cannot be
> written anymore and is hardware locked.
> This setting was commonly done by accessing PMC registers via /dev/mem
> but due to security concerns /dev/mem access is much more restricted,
> hence the reason for exposing this setting via the dedicated sysfs
> interface.
> To prevent post manufacturing abuse the register is protected
> by hardware locking and the file is set to read-only mode via is_visible
> handler.
> 
> The register in MMIO space is defined for Cannon Lake and newer PCHs.
> 
> Cc: Hans de Goede 
> Cc: David E Box 
> Reviewed-by: Andy Shevchenko 
> Signed-off-by: Tamar Mashiah 
> Signed-off-by: Tomas Winkler 

Thank you for your patch, I've applied this patch to my review-hans 
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git/log/?h=review-hans

Note it will show up in my review-hans branch once I've pushed my
local branch there, which might take a while.

Once I've run some tests on this branch the patches there will be
added to the platform-drivers-x86/for-next branch and eventually
will be included in the pdx86 pull-request to Linus for the next
merge-window.

Regards,

Hans




> ---
> V2:
> 1. Add locking for reading the ET3 register  (Andy)
> 2. Fix few style issues (Andy)
> V3:
> 1. Resend
> v4:
> 1. Fix return statement (Andy)
> 2. Specify manufacturing process (Enrico)
> V5:
> 1. Rename sysfs file to etr3 (Hans)
> 2. Make file read only when register is locked (Hans)
> 3. Add more info to sysfs ABI documentation
> V5:
> 1. Parentheses around arithmetic in operand of '|' [-Wparentheses] (lkp)
>656 |  return reg & ETR3_CF9LOCK ? attr->mode & SYSFS_PREALLOC | 0444 : 
> attr->mode
> 
>  .../ABI/testing/sysfs-platform-intel-pmc  |  20 
>  MAINTAINERS   |   1 +
>  drivers/platform/x86/intel_pmc_core.c | 113 ++
>  drivers/platform/x86/intel_pmc_core.h |   6 +
>  4 files changed, 140 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-platform-intel-pmc
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-intel-pmc 
> b/Documentation/ABI/testing/sysfs-platform-intel-pmc
> new file mode 100644
> index ..ef199af75ab0
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-platform-intel-pmc
> @@ -0,0 +1,20 @@
> +What:/sys/devices/platform//etr3
> +Date:Apr 2021
> +KernelVersion:   5.13
> +Contact: "Tomas Winkler" 
> +Description:
> + The file exposes "Extended Test Mode Register 3" global
> + reset bits. The bits are used during an Intel platform
> + manufacturing process to indicate that consequent reset
> + of the platform is a "global reset". This type of reset
> + is required in order for manufacturing configurations
> + to take effect.
> +
> + Display global reset setting bits for PMC.
> + * bit 31 - global reset is locked
> + * bit 20 - global reset is set
> + Writing bit 20 value to the etr3 will induce
> + a platform "global reset" upon consequent platform reset,
> + in case the register is not locked.
> + The "global reset bit" should be locked on a production
> + system and the file is in read-only mode.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7dd6b67f0f51..3e898660b5b4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9145,6 +9145,7 @@ M:  Rajneesh Bhardwaj 
>  M:   David E Box 
>  L:   platform-driver-...@vger.kernel.org
>  S:   Maintained
> +F:   Documentation/ABI/testing/sysfs-platform-intel-pmc
>  F:   drivers/platform/x86/intel_pmc_core*
>  
>  INTEL PMIC GPIO DRIVERS
> diff --git a/drivers/platform/x86/intel_pmc_core.c 
> b/drivers/platform/x86/intel_pmc_core.c
> index b5888aeb4bcf..8fb4e6d1d68d 100644
> --- a/drivers/platform/x86/intel_pmc_core.c
> +++ b/drivers/platform/x86/intel_pmc_core.c
> @@ -401,6 +401,7 @@ static const struct pmc_reg_map cnp_reg_map = {
>   .pm_cfg_offset = CNP_PMC_PM_CFG_OFFSET,
>   .pm_read_disable_bit = CNP_PMC_READ_DISABLE_BIT,
>   .ltr_ignore_max = CNP_NUM_IP_IGN_ALLOWED,
> + .etr3_offset = ETR3_OFFSET,
>  };
>  
>  static const struct pmc_reg_map icl_reg_map = {
> @@ -418,6 +419,7 @@ static const struct pmc_reg_map icl_reg_map = {
>   .pm_cfg_offset = CNP_PMC_PM_CFG_OFFSET,
>   .pm_read_disable_bit = CNP_PMC_READ_DISABLE_BIT,
>   .ltr_ignore_max =

[PATCH 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet

From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq *r = t->rseq;
 
-   if (put_user(cpu_id, >rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, >rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(r, sizeof(*r)))
+   goto efault;
+   unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, >cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog

Re: [PATCH v2 1/3] context_tracking: Split guest_enter/exit_irqoff

2021-04-13 Thread Christian Borntraeger





On 13.04.21 09:16, Wanpeng Li wrote:
[...]


@@ -145,6 +155,13 @@ static __always_inline void guest_exit_irqoff(void)
  }

  #else
+static __always_inline void context_guest_enter_irqoff(void)
+{
+   instrumentation_begin();
+   rcu_virt_note_context_switch(smp_processor_id());
+   instrumentation_end();
+}
+
  static __always_inline void guest_enter_irqoff(void)
  {
/*
@@ -155,10 +172,13 @@ static __always_inline void guest_enter_irqoff(void)
instrumentation_begin();
vtime_account_kernel(current);
current->flags |= PF_VCPU;
-   rcu_virt_note_context_switch(smp_processor_id());
instrumentation_end();
+
+   context_guest_enter_irqoff();


So we now do instrumentation_begin 2 times?

[PATCH v2 04/12] usb: dwc2: Add exit clock gating from wakeup interrupt

2021-04-13 Thread Artur Petrosyan

Added exit from clock gating mode when wakeup interrupt
is detected. To exit from the clock gating
in device mode "dwc2_gadget_exit_clock_gating()"
function is used with rem_wakeup parameter 0. To exit
clock gating in host mode "dwc2_host_exit_clock_gating()"
with rem_wakeup parameter 1.

Signed-off-by: Artur Petrosyan 
Acked-by: Minas Harutyunyan 
---
 drivers/usb/dwc2/core_intr.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/usb/dwc2/core_intr.c b/drivers/usb/dwc2/core_intr.c
index ab7fe303c0f9..c764407e7633 100644
--- a/drivers/usb/dwc2/core_intr.c
+++ b/drivers/usb/dwc2/core_intr.c
@@ -415,17 +415,24 @@ static void dwc2_handle_wakeup_detected_intr(struct 
dwc2_hsotg *hsotg)
if (dwc2_is_device_mode(hsotg)) {
dev_dbg(hsotg->dev, "DSTS=0x%0x\n",
dwc2_readl(hsotg, DSTS));
-   if (hsotg->lx_state == DWC2_L2 && hsotg->in_ppd) {
-   u32 dctl = dwc2_readl(hsotg, DCTL);
-   /* Clear Remote Wakeup Signaling */
-   dctl &= ~DCTL_RMTWKUPSIG;
-   dwc2_writel(hsotg, dctl, DCTL);
-   ret = dwc2_exit_partial_power_down(hsotg, 1,
-  true);
-   if (ret)
-   dev_err(hsotg->dev,
-   "exit partial_power_down failed\n");
-   call_gadget(hsotg, resume);
+   if (hsotg->lx_state == DWC2_L2) {
+   if (hsotg->in_ppd) {
+   u32 dctl = dwc2_readl(hsotg, DCTL);
+   /* Clear Remote Wakeup Signaling */
+   dctl &= ~DCTL_RMTWKUPSIG;
+   dwc2_writel(hsotg, dctl, DCTL);
+   ret = dwc2_exit_partial_power_down(hsotg, 1,
+  true);
+   if (ret)
+   dev_err(hsotg->dev,
+   "exit partial_power_down 
failed\n");
+   call_gadget(hsotg, resume);
+   }
+
+   /* Exit gadget mode clock gating. */
+   if (hsotg->params.power_down ==
+   DWC2_POWER_DOWN_PARAM_NONE && hsotg->bus_suspended)
+   dwc2_gadget_exit_clock_gating(hsotg, 0);
} else {
/* Change to L0 state */
hsotg->lx_state = DWC2_L0;
@@ -440,6 +447,10 @@ static void dwc2_handle_wakeup_detected_intr(struct 
dwc2_hsotg *hsotg)
"exit partial_power_down 
failed\n");
}
 
+   if (hsotg->params.power_down ==
+   DWC2_POWER_DOWN_PARAM_NONE && hsotg->bus_suspended)
+   dwc2_host_exit_clock_gating(hsotg, 1);
+
/*
 * If we've got this quirk then the PHY is stuck upon
 * wakeup.  Assert reset.  This will propagate out and
-- 
2.25.1

[PATCH v2 03/12] usb: dwc2: Allow entering clock gating from USB_SUSPEND interrupt

2021-04-13 Thread Artur Petrosyan

If core doesn't support hibernation or partial power
down power saving options, power can still be saved
using clock gating on all the clocks.

- Added entering clock gating state from USB_SUSPEND
  interrupt.

Signed-off-by: Artur Petrosyan 
Acked-by: Minas Harutyunyan 
---
 drivers/usb/dwc2/core_intr.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/dwc2/core_intr.c b/drivers/usb/dwc2/core_intr.c
index 8c0152b514be..ab7fe303c0f9 100644
--- a/drivers/usb/dwc2/core_intr.c
+++ b/drivers/usb/dwc2/core_intr.c
@@ -529,14 +529,18 @@ static void dwc2_handle_usb_suspend_intr(struct 
dwc2_hsotg *hsotg)
/* Ask phy to be suspended */
if (!IS_ERR_OR_NULL(hsotg->uphy))
usb_phy_set_suspend(hsotg->uphy, true);
-   }
-
-   if (hsotg->hw_params.hibernation) {
+   } else if (hsotg->hw_params.hibernation) {
ret = dwc2_enter_hibernation(hsotg, 0);
if (ret && ret != -ENOTSUPP)
dev_err(hsotg->dev,
"%s: enter hibernation 
failed\n",
__func__);
+   } else {
+   /*
+* If not hibernation nor partial power down 
are supported,
+* clock gating is used to save power.
+*/
+   dwc2_gadget_enter_clock_gating(hsotg);
}
 skip_power_saving:
/*
-- 
2.25.1

[PATCH v2 06/12] usb: dwc2: Add exit clock gating when port reset is asserted

2021-04-13 Thread Artur Petrosyan

Adds clock gating exit flow when set port feature
reset is received in suspended state.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index f1c24c15d185..27f030d5de54 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -3712,6 +3712,10 @@ static int dwc2_hcd_hub_control(struct dwc2_hsotg 
*hsotg, u16 typereq,
"exit partial_power_down 
failed\n");
}
 
+   if (hsotg->params.power_down ==
+   DWC2_POWER_DOWN_PARAM_NONE && hsotg->bus_suspended)
+   dwc2_host_exit_clock_gating(hsotg, 0);
+
hprt0 = dwc2_read_hprt0(hsotg);
dev_dbg(hsotg->dev,
"SetPortFeature - USB_PORT_FEAT_RESET\n");
-- 
2.25.1

[PATCH v2 07/12] usb: dwc2: Update enter clock gating when port is suspended

2021-04-13 Thread Artur Petrosyan

Updates the implementation of entering clock gating mode
when core receives port suspend.
Instead of setting the required bit fields of the registers
inline, called the "dwc2_host_enter_clock_gating()" function.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index 27f030d5de54..e1225fe6c61a 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -3298,7 +3298,6 @@ static int dwc2_host_is_b_hnp_enabled(struct dwc2_hsotg 
*hsotg)
 int dwc2_port_suspend(struct dwc2_hsotg *hsotg, u16 windex)
 {
unsigned long flags;
-   u32 hprt0;
u32 pcgctl;
u32 gotgctl;
int ret = 0;
@@ -3323,22 +3322,12 @@ int dwc2_port_suspend(struct dwc2_hsotg *hsotg, u16 
windex)
break;
case DWC2_POWER_DOWN_PARAM_HIBERNATION:
case DWC2_POWER_DOWN_PARAM_NONE:
-   default:
-   hprt0 = dwc2_read_hprt0(hsotg);
-   hprt0 |= HPRT0_SUSP;
-   dwc2_writel(hsotg, hprt0, HPRT0);
-   hsotg->bus_suspended = true;
/*
-* If power_down is supported, Phy clock will be suspended
-* after registers are backuped.
+* If not hibernation nor partial power down are supported,
+* clock gating is used to save power.
 */
-   if (!hsotg->params.power_down) {
-   /* Suspend the Phy Clock */
-   pcgctl = dwc2_readl(hsotg, PCGCTL);
-   pcgctl |= PCGCTL_STOPPCLK;
-   dwc2_writel(hsotg, pcgctl, PCGCTL);
-   udelay(10);
-   }
+   dwc2_host_enter_clock_gating(hsotg);
+   break;
}
 
/* For HNP the bus must be suspended for at least 200ms */
-- 
2.25.1

[PATCH v2 05/12] usb: dwc2: Add exit clock gating from session request interrupt

2021-04-13 Thread Artur Petrosyan

Added clock gating exit flow from session
request interrupt handler according programming guide.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/core_intr.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/usb/dwc2/core_intr.c b/drivers/usb/dwc2/core_intr.c
index c764407e7633..550c52c1a0c7 100644
--- a/drivers/usb/dwc2/core_intr.c
+++ b/drivers/usb/dwc2/core_intr.c
@@ -316,12 +316,19 @@ static void dwc2_handle_session_req_intr(struct 
dwc2_hsotg *hsotg)
hsotg->lx_state);
 
if (dwc2_is_device_mode(hsotg)) {
-   if (hsotg->lx_state == DWC2_L2 && hsotg->in_ppd) {
-   ret = dwc2_exit_partial_power_down(hsotg, 0,
-  true);
-   if (ret)
-   dev_err(hsotg->dev,
-   "exit power_down failed\n");
+   if (hsotg->lx_state == DWC2_L2) {
+   if (hsotg->in_ppd) {
+   ret = dwc2_exit_partial_power_down(hsotg, 0,
+  true);
+   if (ret)
+   dev_err(hsotg->dev,
+   "exit power_down failed\n");
+   }
+
+   /* Exit gadget mode clock gating. */
+   if (hsotg->params.power_down ==
+   DWC2_POWER_DOWN_PARAM_NONE && hsotg->bus_suspended)
+   dwc2_gadget_exit_clock_gating(hsotg, 0);
}
 
/*
-- 
2.25.1

[PATCH v2 08/12] usb: dwc2: Update exit clock gating when port is resumed

2021-04-13 Thread Artur Petrosyan

Updates the implementation of exiting clock gating mode
when core receives port resume.
Instead of setting the required bit fields of the registers
inline, called the "dwc2_host_exit_clock_gating()" function.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/hcd.c | 29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index e1225fe6c61a..8a42675ab94e 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -3359,8 +3359,6 @@ int dwc2_port_suspend(struct dwc2_hsotg *hsotg, u16 
windex)
 int dwc2_port_resume(struct dwc2_hsotg *hsotg)
 {
unsigned long flags;
-   u32 hprt0;
-   u32 pcgctl;
int ret = 0;
 
spin_lock_irqsave(>lock, flags);
@@ -3374,33 +3372,14 @@ int dwc2_port_resume(struct dwc2_hsotg *hsotg)
break;
case DWC2_POWER_DOWN_PARAM_HIBERNATION:
case DWC2_POWER_DOWN_PARAM_NONE:
-   default:
/*
-* If power_down is supported, Phy clock is already resumed
-* after registers restore.
+* If not hibernation nor partial power down are supported,
+* port resume is done using the clock gating programming flow.
 */
-   if (!hsotg->params.power_down) {
-   pcgctl = dwc2_readl(hsotg, PCGCTL);
-   pcgctl &= ~PCGCTL_STOPPCLK;
-   dwc2_writel(hsotg, pcgctl, PCGCTL);
-   spin_unlock_irqrestore(>lock, flags);
-   msleep(20);
-   spin_lock_irqsave(>lock, flags);
-   }
-
-   hprt0 = dwc2_read_hprt0(hsotg);
-   hprt0 |= HPRT0_RES;
-   hprt0 &= ~HPRT0_SUSP;
-   dwc2_writel(hsotg, hprt0, HPRT0);
spin_unlock_irqrestore(>lock, flags);
-
-   msleep(USB_RESUME_TIMEOUT);
-
+   dwc2_host_exit_clock_gating(hsotg, 0);
spin_lock_irqsave(>lock, flags);
-   hprt0 = dwc2_read_hprt0(hsotg);
-   hprt0 &= ~(HPRT0_RES | HPRT0_SUSP);
-   dwc2_writel(hsotg, hprt0, HPRT0);
-   hsotg->bus_suspended = false;
+   break;
}
 
spin_unlock_irqrestore(>lock, flags);
-- 
2.25.1

[PATCH 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet

From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise for 64bit arches

 kernel/rseq.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog

Re: [PATCH v2 1/3] context_tracking: Split guest_enter/exit_irqoff

2021-04-13 Thread Wanpeng Li

On Tue, 13 Apr 2021 at 15:48, Christian Borntraeger
 wrote:
>
>
>
> On 13.04.21 09:38, Wanpeng Li wrote:
> > On Tue, 13 Apr 2021 at 15:35, Christian Borntraeger
> >  wrote:
> >>
> >>
> >>
> >> On 13.04.21 09:16, Wanpeng Li wrote:
> >> [...]
> >>
> >>> @@ -145,6 +155,13 @@ static __always_inline void guest_exit_irqoff(void)
> >>>}
> >>>
> >>>#else
> THis is the else part
>
>
> >>> +static __always_inline void context_guest_enter_irqoff(void)
> >>> +{
> >>> + instrumentation_begin();
>
> 2nd on
> >>> + rcu_virt_note_context_switch(smp_processor_id());
> >>> + instrumentation_end();
> 2nd off
> >>> +}
> >>> +
> >>>static __always_inline void guest_enter_irqoff(void)
> >>>{
> >>>/*
> >>> @@ -155,10 +172,13 @@ static __always_inline void guest_enter_irqoff(void)
> >>>instrumentation_begin();
>
> first on
> >>>vtime_account_kernel(current);
> >>>current->flags |= PF_VCPU;
> >>> - rcu_virt_note_context_switch(smp_processor_id());
> >>>instrumentation_end();
>
> first off
> >>> +
> >>> + context_guest_enter_irqoff();
> here we call the 2nd on and off.
> >>
> >> So we now do instrumentation_begin 2 times?
> >
> > Similar to context_guest_enter_irqoff() ifdef 
> > CONFIG_VIRT_CPU_ACCOUNTING_GEN.
>
> For the
> ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN part
> context_guest_enter_irqoff()
> does not have instrumentation_begin/end.
>
> Or did I miss anything.

I mean the if (!context_tracking_enabled_this_cpu()) part in the
function context_guest_enter_irqoff() ifdef
CONFIG_VIRT_CPU_ACCOUNTING_GEN. :)

Wanpeng

Re: [PATCH v2] platform/x86: pmc_atom: Match all Beckhoff Automation baytrail boards with critclk_systems DMI table

2021-04-13 Thread Hans de Goede

Hi,

On 4/12/21 3:30 PM, Steffen Dirkwinkel wrote:
> From: Steffen Dirkwinkel 
> 
> pmc_plt_clk* clocks are used for ethernet controllers, so need to stay
> turned on. This adds the affected board family to critclk_systems DMI
> table, so the clocks are marked as CLK_CRITICAL and not turned off.
> 
> This replaces the previously listed boards with a match for the whole
> device family CBxx63. CBxx63 matches only baytrail devices.
> There are new affected boards that would otherwise need to be listed.
> There are unaffected boards in the family, but having the clocks
> turned on is not an issue.
> 
> Fixes: 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
> Reviewed-by: Andy Shevchenko 
> Signed-off-by: Steffen Dirkwinkel 

Thank you for your patch, I've applied this patch to my review-hans 
branch:
https://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git/log/?h=review-hans

Note it will show up in my review-hans branch once I've pushed my
local branch there, which might take a while.

Once I've run some tests on this branch the patches there will be
added to the platform-drivers-x86/for-next branch and eventually
will be included in the pdx86 pull-request to Linus for the next
merge-window.

Regards,

Hans




> ---
>  drivers/platform/x86/pmc_atom.c | 28 ++--
>  1 file changed, 2 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/platform/x86/pmc_atom.c b/drivers/platform/x86/pmc_atom.c
> index ca684ed760d1..a9d2a4b98e57 100644
> --- a/drivers/platform/x86/pmc_atom.c
> +++ b/drivers/platform/x86/pmc_atom.c
> @@ -393,34 +393,10 @@ static const struct dmi_system_id critclk_systems[] = {
>   },
>   {
>   /* pmc_plt_clk* - are used for ethernet controllers */
> - .ident = "Beckhoff CB3163",
> + .ident = "Beckhoff Baytrail",
>   .matches = {
>   DMI_MATCH(DMI_SYS_VENDOR, "Beckhoff Automation"),
> - DMI_MATCH(DMI_BOARD_NAME, "CB3163"),
> - },
> - },
> - {
> - /* pmc_plt_clk* - are used for ethernet controllers */
> - .ident = "Beckhoff CB4063",
> - .matches = {
> - DMI_MATCH(DMI_SYS_VENDOR, "Beckhoff Automation"),
> - DMI_MATCH(DMI_BOARD_NAME, "CB4063"),
> - },
> - },
> - {
> - /* pmc_plt_clk* - are used for ethernet controllers */
> - .ident = "Beckhoff CB6263",
> - .matches = {
> - DMI_MATCH(DMI_SYS_VENDOR, "Beckhoff Automation"),
> - DMI_MATCH(DMI_BOARD_NAME, "CB6263"),
> - },
> - },
> - {
> - /* pmc_plt_clk* - are used for ethernet controllers */
> - .ident = "Beckhoff CB6363",
> - .matches = {
> - DMI_MATCH(DMI_SYS_VENDOR, "Beckhoff Automation"),
> - DMI_MATCH(DMI_BOARD_NAME, "CB6363"),
> + DMI_MATCH(DMI_PRODUCT_FAMILY, "CBxx63"),
>   },
>   },
>   {
>

Re: [PATCH 7/7 v2] tracing: Do not create tracefs files if tracefs lockdown is in effect

2021-04-13 Thread Ondrej Mosnacek

On Sat, Oct 12, 2019 at 2:59 AM Steven Rostedt  wrote:
> From: "Steven Rostedt (VMware)" 
>
> If on boot up, lockdown is activated for tracefs, don't even bother creating
> the files. This can also prevent instances from being created if lockdown is
> in effect.
>
> Link: 
> http://lkml.kernel.org/r/CAHk-=whC6Ji=fwnjh2+es4b15tnbss4vpvtvbowcy1jjeg_...@mail.gmail.com
>
> Suggested-by: Linus Torvalds 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  fs/tracefs/inode.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index eeeae0475da9..0caa151cae4e 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -390,6 +391,9 @@ struct dentry *tracefs_create_file(const char *name, 
> umode_t mode,
> struct dentry *dentry;
> struct inode *inode;
>
> +   if (security_locked_down(LOCKDOWN_TRACEFS))
> +   return NULL;
> +
> if (!(mode & S_IFMT))
> mode |= S_IFREG;
> BUG_ON(!S_ISREG(mode));
> --
> 2.23.0

Hi all,

sorry for coming back to an old thread, but it turns out that this
patch doesn't play well with SELinux's implementation of the
security_locked_down() hook, which was added a few months later (so
not your fault :) in commit 59438b46471a ("security,lockdown,selinux:
implement SELinux lockdown").

What SELinux does is it checks if the current task's creds are allowed
the lockdown::integrity or lockdown::confidentiality permission in the
policy whenever security_locked_down() is called. The idea is to be
able to control at SELinux domain level which tasks can do these
sensitive operations (when the kernel is not actually locked down by
the Lockdown LSM).

With this patch + the SELinux lockdown mechanism in use, when a
userspace task loads a module that creates some tracefs nodes in its
initcall SELinux will check if the task has the
lockdown::confidentiality permission and if not, will report denials
in audit log and prevent the tracefs entries from being created. But
that is not a very logical behavior, since the task loading the module
is itself not (explicitly) doing anything that would breach
confidentiality. It just indirectly causes some tracefs nodes to be
created, but doesn't actually use them at that point.

Since it seems the other patches also added security_locked_down()
calls to the tracefs nodes' open functions, I guess reverting this
patch could be an acceptable way to fix this problem (please correct
me if there is something that this call catches, which the other ones
don't). However, even then I can understand that you (or someone else)
might want to keep this as an optimization, in which case we could
instead do this:
1. Add a new hook security_locked_down_permanently() (the name is open
for discussion), which would be intended for situations when we want
to avoid doing some pointless work when the kernel is in a "hard"
lockdown that can't be taken back (except perhaps in some rescue
scenario...).
2. This hook would be backed by the same implementation as
security_locked_down() in the Lockdown LSM and left unimplemented by
SELinux.
3. tracefs_create_file() would call this hook instead of security_locked_down().

This way it would work as before relative to the standard lockdown via
the Lockdown LSM and would be simply ignored by SELinux. I went over
all the security_locked_down() call in the kernel and I think this
alternative hook could also fit better in arch/powerpc/xmon/xmon.c,
where it seems to be called from interrupt context (so task creds are
irrelevant, anyway...) and mainly causes some values to be redacted.
(I also found a couple minor issues with how the hook is used in other
places, for which I plan to send patches later.)

Thoughts?

--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.

Re: [PATCH v2 0/3] KVM: Properly account for guest CPU time

2021-04-13 Thread Christian Borntraeger





On 13.04.21 09:16, Wanpeng Li wrote:

The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831
reported that the guest time remains 0 when running a while true
loop in the guest.

The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it
belongs") moves guest_exit_irqoff() close to vmexit breaks the
tick-based time accouting when the ticks that happen after IRQs are
disabled are incorrectly accounted to the host/system time. This is
because we exit the guest state too early.

This patchset splits both context tracking logic and the time accounting
logic from guest_enter/exit_irqoff(), keep context tracking around the
actual vmentry/exit code, have the virt time specific helpers which
can be placed at the proper spots in kvm. In addition, it will not
break the world outside of x86.

v1 -> v2:
  * split context_tracking from guest_enter/exit_irqoff
  * provide separate vtime accounting functions for consistent
  * place the virt time specific helpers at the proper splot

Suggested-by: Thomas Gleixner 
Cc: Thomas Gleixner 
Cc: Sean Christopherson 
Cc: Michael Tokarev 

Wanpeng Li (3):
   context_tracking: Split guest_enter/exit_irqoff
   context_tracking: Provide separate vtime accounting functions
   x86/kvm: Fix vtime accounting

  arch/x86/kvm/svm/svm.c   |  6 ++-
  arch/x86/kvm/vmx/vmx.c   |  6 ++-
  arch/x86/kvm/x86.c   |  1 +
  include/linux/context_tracking.h | 84 +++-
  4 files changed, 74 insertions(+), 23 deletions(-)



The non CONFIG_VIRT_CPU_ACCOUNTING_GEN look good.

Re: [PATCH] kernel:irq:manage: request threaded irq with a specified priority

2021-04-13 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 14:19, Song Chen wrote:
> In general, irq handler thread will be assigned a default priority which
> is MAX_RT_PRIO/2, as a result, no one can preempt others.
>
> Here is the case I found in a real project, an interrupt int_a is
> coming, wakes up its handler handler_a and handler_a wakes up a
> userspace RT process task_a.
>
> However, if another irq handler handler_b which has nothing to do
> with any RT tasks is running when int_a is coming, handler_a can't
> preempt handler_b, as a result, task_a can't be waken up immediately
> as expected until handler_b gives up cpu voluntarily. In this case,
> determinism breaks.

It breaks because the system designer failed to assign proper priorities
to the irq threads int_a, int_b and to the user space process task_a.

That's not solvable at the kernel level.

Thanks,

tglx

Re: [PATCH 5.10 000/188] 5.10.30-rc1 review

2021-04-13 Thread Pavel Machek

Hi!

> This is the start of the stable review cycle for the 5.10.30 release.
> There are 188 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.

CIP testing did not find any problems here:

https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/tree/linux-5.10.y

Tested-by: Pavel Machek (CIP) 

Best regards,
Pavel
-- 
DENX Software Engineering GmbH,  Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: Digital signature

Re: [PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-04-13 Thread Miklos Szeredi

On Mon, Apr 12, 2021 at 3:23 PM Baolin Wang
 wrote:
>
> Hi Miklos,
>
> 在 2021/3/27 14:36, Baolin Wang 写道:
> > We can meet below deadlock scenario when writing back dirty pages, and
> > writing files at the same time. The deadlock scenario can be reproduced
> > by:
> >
> > - A writeback worker thread A is trying to write a bunch of dirty pages by
> > fuse_writepages(), and the fuse_writepages() will lock one page (named page 
> > 1),
> > add it into rb_tree with setting writeback flag, and unlock this page 1,
> > then try to lock next page (named page 2).
> >
> > - But at the same time a file writing can be triggered by another process B,
> > to write several pages by fuse_perform_write(), the fuse_perform_write()
> > will lock all required pages firstly, then wait for all writeback pages
> > are completed by fuse_wait_on_page_writeback().
> >
> > - Now the process B can already lock page 1 and page 2, and wait for page 1
> > waritehack is completed (page 1 is under writeback set by process A). But
> > process A can not complete the writeback of page 1, since it is still
> > waiting for locking page 2, which was locked by process B already.
> >
> > A deadlock is occurred.
> >
> > To fix this issue, we should make sure each page writeback is completed
> > after lock the page in fuse_fill_write_pages() separately, and then write
> > them together when all pages are stable.
> >
> > [1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
> > seconds.
> > [1450578.796179] kworker/u259:6  D0 119885  2 0x0028
> > [1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
> > [1450578.796188] Call trace:
> > [1450578.798804]  __switch_to+0xd8/0x148
> > [1450578.802458]  __schedule+0x280/0x6a0
> > [1450578.806112]  schedule+0x34/0xe8
> > [1450578.809413]  io_schedule+0x20/0x40
> > [1450578.812977]  __lock_page+0x164/0x278
> > [1450578.816718]  write_cache_pages+0x2b0/0x4a8
> > [1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
> > [1450578.825592]  do_writepages+0x58/0x108
> > [1450578.829412]  __writeback_single_inode+0x48/0x448
> > [1450578.834217]  writeback_sb_inodes+0x220/0x520
> > [1450578.838647]  __writeback_inodes_wb+0x50/0xe8
> > [1450578.843080]  wb_writeback+0x294/0x3b8
> > [1450578.846906]  wb_do_writeback+0x2ec/0x388
> > [1450578.850992]  wb_workfn+0x80/0x1e0
> > [1450578.854472]  process_one_work+0x1bc/0x3f0
> > [1450578.858645]  worker_thread+0x164/0x468
> > [1450578.862559]  kthread+0x108/0x138
> > [1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
> > [1450578.888321] doioD0 207752 207740 0x
> > [1450578.888329] Call trace:
> > [1450578.890945]  __switch_to+0xd8/0x148
> > [1450578.894599]  __schedule+0x280/0x6a0
> > [1450578.898255]  schedule+0x34/0xe8
> > [1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
> > [1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
> > [1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
> > [1450578.917207]  do_iter_readv_writev+0x110/0x188
> > [1450578.921724]  do_iter_write+0x90/0x1c8
> > [1450578.925598]  vfs_writev+0x84/0xf8
> > [1450578.929071]  do_writev+0x70/0x110
> > [1450578.932552]  __arm64_sys_writev+0x24/0x30
> > [1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
> > [1450578.941694]  el0_svc_handler+0x30/0x80
> > [1450578.945606]  el0_svc+0x10/0x14
> >
> > Suggested-by: Peng Tao 
> > Signed-off-by: Baolin Wang 
>
> Do you have any comments for this patch set? Thanks.

Hi,

I guess this is related:

https://lore.kernel.org/linux-fsdevel/20210209100115.gb1208...@miu.piliscsaba.redhat.com/

Can you verify that the patch at the above link fixes your issue?

Thanks,
Miklos

[PATCH v4] MIPS: Loongson64: Add kexec/kdump support

2021-04-13 Thread Youling Tang

From: Huacai Chen 

Add kexec/kdump support for Loongson64 by:
1, Provide Loongson-specific kexec functions: loongson_kexec_prepare(),
   loongson_kexec_shutdown() and loongson_crash_shutdown();
2, Provide Loongson-specific assembly code in kexec_smp_wait();

To start Loongson64, The boot CPU needs 3 parameters:
fw_arg0: the number of arguments in cmdline (i.e., argc).
fw_arg1: structure holds cmdline such as "root=/dev/sda1 console=tty"
 (i.e., argv).
fw_arg2: environment (i.e., envp, additional boot parameters from LEFI).

Non-boot CPUs do not need one parameter as the IPI mailbox base address.
They query their own IPI mailbox to get PC, SP and GP in a loopi, until
the boot CPU brings them up.

loongson_kexec_prepare(): Setup cmdline for kexec/kdump. The kexec/kdump
cmdline comes from kexec's "append" option string. This structure will
be parsed in fw_init_cmdline() of arch/mips/fw/lib/cmdline.c. Both image
->control_code_page and the cmdline need to be in a safe memory region
(memory allocated by the old kernel may be corrupted by the new kernel).
In order to maintain compatibility for the old firmware, the low 2MB is
reserverd and safe for Loongson. So let KEXEC_CTRL_CODE and KEXEC_ARGV_
ADDR be here. LEFI parameters may be corrupted at runtime, so backup it
at mips_reboot_setup(), and then restore it at loongson_kexec_shutdown()
/loongson_crash_shutdown().

loongson_kexec_shutdown(): Wake up all present CPUs and let them go to
reboot_code_buffer. Pass the kexec parameters to kexec_args.

loongson_crash_shutdown(): Pass the kdump parameters to kexec_args.

The assembly part in kexec_smp_wait provide a routine as BIOS does, in
order to keep secondary CPUs in a querying loop.

The layout of low 2MB memory in our design:
0x8000, the first MB, the first 64K, Exception vectors
0x8001, the first MB, the second 64K, STR (suspend) data
0x8002, the first MB, the third and fourth 64K, UEFI HOB
0x8004, the first MB, the fifth 64K, RT-Thread for SMC
0x8010, the second MB, the first 64K, KEXEC code
0x80108000, the second MB, the second 64K, KEXEC data

Cc: Eric Biederman 
Tested-by: Jinyang He 
Signed-off-by: Huacai Chen 
Signed-off-by: Jinyang He 
Signed-off-by: Youling Tang 
---
v3 -> v4:
- Use the macro kexec_smp_wait_final and move the platform-specific code
  into the kernel-entry-init.h file. This suggestion comes from Thomas.

 .../asm/mach-cavium-octeon/kernel-entry-init.h |   8 ++
 .../asm/mach-loongson64/kernel-entry-init.h|  27 +
 arch/mips/kernel/relocate_kernel.S |   9 +-
 arch/mips/loongson64/reset.c   | 113 +
 4 files changed, 152 insertions(+), 5 deletions(-)

diff --git a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h 
b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
index c38b38c..b071a73 100644
--- a/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
+++ b/arch/mips/include/asm/mach-cavium-octeon/kernel-entry-init.h
@@ -157,4 +157,12 @@
.macro  smp_slave_setup
.endm
 
+#define USE_KEXEC_SMP_WAIT_FINAL
+   .macro  kexec_smp_wait_final
+   .set push
+   .set noreorder
+   synci   0($0)
+   .set pop
+   .endm
+
 #endif /* __ASM_MACH_CAVIUM_OCTEON_KERNEL_ENTRY_H */
diff --git a/arch/mips/include/asm/mach-loongson64/kernel-entry-init.h 
b/arch/mips/include/asm/mach-loongson64/kernel-entry-init.h
index e4d77f4..13373c5 100644
--- a/arch/mips/include/asm/mach-loongson64/kernel-entry-init.h
+++ b/arch/mips/include/asm/mach-loongson64/kernel-entry-init.h
@@ -75,4 +75,31 @@
.setpop
.endm
 
+#define USE_KEXEC_SMP_WAIT_FINAL
+   .macro  kexec_smp_wait_final
+   /* s0:prid s1:initfn */
+   /* a0:base t1:cpuid t2:node t9:count */
+   mfc0t1, CP0_EBASE
+   andit1, MIPS_EBASE_CPUNUM
+   dinsa0, t1, 8, 2   /* insert core id*/
+   dextt2, t1, 2, 2
+   dinsa0, t2, 44, 2  /* insert node id */
+   mfc0s0, CP0_PRID
+   andis0, s0, (PRID_IMP_MASK | PRID_REV_MASK)
+   beq s0, (PRID_IMP_LOONGSON_64C | PRID_REV_LOONGSON3B_R1), 1f
+   beq s0, (PRID_IMP_LOONGSON_64C | PRID_REV_LOONGSON3B_R2), 1f
+   b   2f /* 
Loongson-3A1000/3A2000/3A3000/3A4000 */
+1: dinsa0, t2, 14, 2  /* Loongson-3B1000/3B1500 need bit 
15~14 */
+2: li  t9, 0x100  /* wait for init loop */
+3: addiu   t9, -1 /* limit mailbox access */
+   bnezt9, 3b
+   lw  s1, 0x20(a0)   /* check PC as an indicator */
+   beqzs1, 2b
+   ld  s1, 0x20(a0)   /* get PC via mailbox reg0 */
+   ld  sp, 0x28(a0)   /* get SP via mailbox reg1 */
+   ld  gp, 0x30(a0)   /* get GP via mailbox reg2 */
+

Re: [PATCH RFC v2 3/4] virtio_net: move tx vq operation under tx queue lock

2021-04-13 Thread Jason Wang




在 2021/4/13 下午1:47, Michael S. Tsirkin 写道:

It's unsafe to operate a vq from multiple threads.
Unfortunately this is exactly what we do when invoking
clean tx poll from rx napi.
As a fix move everything that deals with the vq to under tx lock.

Signed-off-by: Michael S. Tsirkin 
---
  drivers/net/virtio_net.c | 22 +-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 16d5abed582c..460ccdbb840e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1505,6 +1505,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
struct netdev_queue *txq;
+   int opaque;
+   bool done;
  
  	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {

/* We don't need to enable cb for XDP */
@@ -1514,10 +1516,28 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
int budget)
  
  	txq = netdev_get_tx_queue(vi->dev, index);

__netif_tx_lock(txq, raw_smp_processor_id());
+   virtqueue_disable_cb(sq->vq);
free_old_xmit_skbs(sq, true);
+
+   opaque = virtqueue_enable_cb_prepare(sq->vq);
+
+   done = napi_complete_done(napi, 0);
+
+   if (!done)
+   virtqueue_disable_cb(sq->vq);
+
__netif_tx_unlock(txq);
  
-	virtqueue_napi_complete(napi, sq->vq, 0);



So I wonder why not simply move __netif_tx_unlock() after 
virtqueue_napi_complete()?


Thanks



+   if (done) {
+   if (unlikely(virtqueue_poll(sq->vq, opaque))) {
+   if (napi_schedule_prep(napi)) {
+   __netif_tx_lock(txq, raw_smp_processor_id());
+   virtqueue_disable_cb(sq->vq);
+   __netif_tx_unlock(txq);
+   __napi_schedule(napi);
+   }
+   }
+   }
  
  	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)

netif_tx_wake_queue(txq);

Re: [PATCH 3/4] pwm: Add Aspeed ast2600 PWM support

2021-04-13 Thread Billy Tsai

Thanks for your review

Best Regards,
Billy Tsai

On 2021/4/12, 8:35 PM,Uwe Kleine-Königwrote:

Hello Billy,

On Mon, Apr 12, 2021 at 05:54:56PM +0800, Billy Tsai wrote:
>> This patch add the support of PWM controller which can find at aspeed
>> ast2600 soc chip. This controller supoorts up to 16 channels.
>> 
>> Signed-off-by: Billy Tsai 
>> ---
>>  drivers/pwm/pwm-aspeed-g6.c | 291 
>>  1 file changed, 291 insertions(+)
>>  create mode 100644 drivers/pwm/pwm-aspeed-g6.c
>> 
>> diff --git a/drivers/pwm/pwm-aspeed-g6.c b/drivers/pwm/pwm-aspeed-g6.c
>> new file mode 100644
>> index ..4bb4f97453c6
>> --- /dev/null
>> +++ b/drivers/pwm/pwm-aspeed-g6.c
>> @@ -0,0 +1,291 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) ASPEED Technology Inc.

> Don't you need to add a year here?

Got it.

>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 or 
later as
>> + * published by the Free Software Foundation.

> Hmm, the comment and the SPDX-License-Identifier contradict each other.
> The idea of the latter is that the former isn't needed.

I will use "// SPDX-License-Identifier: GPL-2.0-or-later" for the license.

>> + */

> Is there documentation available in the internet for this hardware? If
> yes, please mention a link here.

Sorry we don't have the hardware document in the internet.

> Also describe the hardware here similar to how e.g.
> drivers/pwm/pwm-sifive.c does it. Please stick to the same format for
> easy grepping.

Got it.

>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +/* The channel number of Aspeed pwm controller */
>> +#define ASPEED_NR_PWMS 16
>> +/* PWM Control Register */
>> +#define ASPEED_PWM_CTRL_CH(ch) ((ch * 0x10) + 0x00)

> #define ASPEED_PWM_CTRL_CH(ch) (((ch) * 0x10) + 0x00)

Got it.

>> +#define PWM_LOAD_SEL_AS_WDT BIT(19)
>> +#define LOAD_SEL_FALLING 0
>> +#define LOAD_SEL_RIGING 1
>> +#define PWM_DUTY_LOAD_AS_WDT_EN BIT(18)
>> +#define PWM_DUTY_SYNC_DIS BIT(17)
>> +#define PWM_CLK_ENABLE BIT(16)
>> +#define PWM_LEVEL_OUTPUT BIT(15)
>> +#define PWM_INVERSE BIT(14)
>> +#define PWM_OPEN_DRAIN_EN BIT(13)
>> +#define PWM_PIN_EN BIT(12)
>> +#define PWM_CLK_DIV_H_SHIFT 8
>> +#define PWM_CLK_DIV_H_MASK (0xf << PWM_CLK_DIV_H_SHIFT)
>> +#define PWM_CLK_DIV_L_SHIFT 0
>> +#define PWM_CLK_DIV_L_MASK (0xff << PWM_CLK_DIV_L_SHIFT)
>> +/* PWM Duty Cycle Register */
>> +#define ASPEED_PWM_DUTY_CYCLE_CH(ch) ((ch * 0x10) + 0x04)
>> +#define PWM_PERIOD_SHIFT (24)
>> +#define PWM_PERIOD_MASK (0xff << PWM_PERIOD_SHIFT)
>> +#define PWM_POINT_AS_WDT_SHIFT (16)
>> +#define PWM_POINT_AS_WDT_MASK (0xff << PWM_POINT_AS_WDT_SHIFT)
>> +#define PWM_FALLING_POINT_SHIFT (8)
>> +#define PWM_FALLING_POINT_MASK (0x << PWM_FALLING_POINT_SHIFT)
>> +#define PWM_RISING_POINT_SHIFT (0)
>> +#define PWM_RISING_POINT_MASK (0x << PWM_RISING_POINT_SHIFT)
>> +/* PWM default value */
>> +#define DEFAULT_PWM_PERIOD 0xff
>> +#define DEFAULT_TARGET_PWM_FREQ 25000
>> +#define DEFAULT_DUTY_PT 10
>> +#define DEFAULT_WDT_RELOAD_DUTY_PT 16

> You could spend a few empty lines to make this better readable. Also
> please use a consistent driver-specific prefix for your defines and
> consider using the macros from . Also defines
> for bitfields should contain the register name.

Got it. I will use the bitfield method to write the hardware register.

>> +struct aspeed_pwm_data {
>> +struct pwm_chip chip;
>> +struct regmap *regmap;
>> +unsigned long clk_freq;
>> +struct reset_control *reset;
>> +};
>> +/**
>> + * struct aspeed_pwm - per-PWM driver data
>> + * @freq: cached pwm freq
>> + */
>> +struct aspeed_pwm {
>> +u32 freq;
>> +};

> This is actually unused, please drop it. (You save a value in it, but
> make never use of it.)

Got it.

>> +static void aspeed_set_pwm_channel_enable(struct regmap *regmap, u8 
pwm_channel,
>> +  bool enable)
>> +{
>> +regmap_update_bits(regmap, ASPEED_PWM_CTRL_CH(pwm_channel),
>> +   (PWM_CLK_ENABLE | PWM_PIN_EN),
>> +   enable ? (PWM_CLK_ENABLE | PWM_PIN_EN) : 0);

> What is the semantic of PIN_EN?

It means PIN_ENABLE. I will complete the defined name with PWM_PIN_ENABLE.

>> +}
>> +/*
>> + * The PWM frequency =

Re: cocci script hints request

2021-04-13 Thread Fabio Aiuto

On Tue, Apr 13, 2021 at 11:11:38AM +0200, Greg KH wrote:
> On Tue, Apr 13, 2021 at 11:04:01AM +0200, Fabio Aiuto wrote:
> > Hi,
> > 
> > I would like to improve the following coccinelle script:
> > 
> > @@
> > expression a, fmt;
> > expression list var_args;
> > @@
> > 
> > -   DBG_871X_LEVEL(a, fmt, var_args);
> > +   printk(fmt, var_args);
> > 
> > I would  replace the DBG_871X_LEVEL macro with printk,
> 
> No you really do not, you want to change that to a dev_*() call instead
> depending on the "level" of the message.
> 
> No "raw" printk() calls please, I will just reject them :)
> 
> thanks,
> 
> greg k-h

but there are very few occurences of DBG_871X_LEVEL in module init functions:

static int __init rtw_drv_entry(void)
{
int ret;

DBG_871X_LEVEL(_drv_always_, "module init start\n");
dump_drv_version(RTW_DBGDUMP);
#ifdef BTCOEXVERSION
DBG_871X_LEVEL(_drv_always_, "rtl8723bs BT-Coex version = %s\n", 
BTCOEXVERSION);
#endif /*  BTCOEXVERSION */

sdio_drvpriv.drv_registered = true;

ret = sdio_register_driver(_drvpriv.r871xs_drv);
if (ret != 0) {
sdio_drvpriv.drv_registered = false;
rtw_ndev_notifier_unregister();
}

DBG_871X_LEVEL(_drv_always_, "module init ret =%d\n", ret);
return ret;
}

where I don't have a device available... shall I pass NULL to
first argument?

Another question: may I use netdev_dbg in case of rtl8723bs?

thank you,

fabio

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet

On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
>
> On 4/12/21 10:38 AM, Eric Dumazet wrote:
> [ ... ]
>
> > Yes, I think this is the real issue here. This smells like some memory
> > corruption.
> >
> > In my traces, packet is correctly received in AF_PACKET queue.
> >
> > I have checked the skb is well formed.
> >
> > But the user space seems to never call poll() and recvmsg() on this
> > af_packet socket.
> >
>
> After sprinkling the kernel with debug messages:
>
> 424   00:01:33.674181 sendto(6, 
> "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> 424   00:01:33.693873 close(6)  = 0
> 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> EFAULT (Bad address)
> 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) failed\n", 
> 40) = -1 EFAULT (Bad address)
> 424   00:01:33.697311 exit_group(1) = ?
> 424   00:01:33.698346 +++ exited with 1 +++
>
> I only see that after adding debug messages in the kernel, so I guess there 
> must be
> a heisenbug somehere.
>
> Anyway, indeed, I see (another kernel debug message):
>
> __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
>
> So udhcpc doesn't even try to read the reply because it crashes after sendto()
> when trying to read the current time. Unless I am missing something, that 
> means
> that the problem happens somewhere on the send side.
>
> To make things even more interesting, it looks like the failing system call
> isn't always clock_gettime().
>
> Guenter


I think GRO fast path has never worked on SUPERH. Probably SUPERH has
never used a fast NIC (10Gbit+)

The following hack fixes the issue.


diff --git a/net/core/dev.c b/net/core/dev.c
index 
af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5916,13 +5916,16 @@ static struct list_head
*gro_list_prepare(struct napi_struct *napi,

 static void skb_gro_reset_offset(struct sk_buff *skb)
 {
+#if !defined(CONFIG_SUPERH)
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = >frags[0];
+#endif

NAPI_GRO_CB(skb)->data_offset = 0;
NAPI_GRO_CB(skb)->frag0 = NULL;
NAPI_GRO_CB(skb)->frag0_len = 0;

+#if !defined(CONFIG_SUPERH)
if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
@@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
skb_frag_size(frag0),
skb->end - skb->tail);
}
+#endif
 }

 static void gro_pull_from_frag0(struct sk_buff *skb, int grow)

Re: [PATCH] riscv: locks: introduce ticket-based spinlock implementation

2021-04-13 Thread Peter Zijlstra

On Tue, Apr 13, 2021 at 11:22:40AM +0200, Christoph Müllner wrote:

> > For ticket locks you really only needs atomic_fetch_add() and
> > smp_store_release() and an architectural guarantees that the
> > atomic_fetch_add() has fwd progress under contention and that a sub-word
> > store (through smp_store_release()) will fail the SC.
> >
> > Then you can do something like:
> >
> > void lock(atomic_t *lock)
> > {
> > u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */
> > u16 ticket = val >> 16;
> >
> > for (;;) {
> > if (ticket == (u16)val)
> > break;
> > cpu_relax();
> > val = atomic_read_acquire(lock);
> > }
> > }
> >
> > void unlock(atomic_t *lock)
> > {
> > u16 *ptr = (u16 *)lock + (!!__BIG_ENDIAN__);
> > u32 val = atomic_read(lock);
> >
> > smp_store_release(ptr, (u16)val + 1);
> > }
> >
> > That's _almost_ as simple as a test-and-set :-) It isn't quite optimal
> > on x86 for not being allowed to use a memop on unlock, since its being
> > forced into a load-store because of all the volatile, but whatever.
> 
> What about trylock()?
> I.e. one could implement trylock() without a loop, by letting
> trylock() fail if the SC fails.
> That looks safe on first view, but nobody does this right now.

Generic code has to use cmpxchg(), and then you get something like this:

bool trylock(atomic_t *lock)
{
u32 old = atomic_read(lock);

if ((old >> 16) != (old & 0x))
return false;

return atomic_try_cmpxchg(lock, , old + (1<<16)); /* SC, for RCsc */
}

That will try and do the full LL/SC loop, because it wants to complete
the cmpxchg, but in generic code we have no other option.

(Is this what C11's weak cmpxchg is for?)

URGENT RESPONSE NEEDED.

2021-04-13 Thread ibrahim musa

Dear friend, i am contacting you independently of my investigation in
my bank and no one is informed of this communication. I need your
urgent assistance in transferring the sum of $5.3million dollars to
your private account,that belongs to one of our foreign customer who
died a longtime with his supposed NEXT OF KIN since July 22, 2003. The
money has been here in our Bank lying dormant for years now without
anybody coming for the claim of it.

I want to release the money to you as the relative to our deceased
customer , the Banking laws here does not allow such money to stay
more than 18years, because the money will be recalled to the Bank
treasury account as unclaimed fund. I am ready to share with you 40%
for you and 60% will be kept for me, by indicating your interest i
will send you the full details on how the business will be executed, i
will be waiting for your urgent response.

Re: [PATCH v2 0/7] remove different PHY fixups

2021-04-13 Thread Oleksij Rempel

Hello,

On Tue, Mar 30, 2021 at 12:04:50PM -0300, Fabio Estevam wrote:
> Hi Andrew,
> 
> On Tue, Mar 30, 2021 at 11:30 AM Andrew Lunn  wrote:
> 
> > Hi Fabio
> >
> > I think it should be merged, and we fixup anything which does break.
> > We are probably at the point where more is broken by not merging it
> > than merging it.
> 
> Thanks for your feedback. I agree.
> 
> Shawn wants to collect some Acked-by for this series.
> 
> Could you please give your Acked-by for this series?

Andrew, can you please add you ACK?

Shawn will it be enough or you need more ACKs?

Regards,
Oleksij
-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the crashkernel option is specified

2021-04-13 Thread Baoquan He

On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> On Mon, Apr 12, 2021 at 2:52 AM Baoquan He  wrote:
> >
> > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> > >
> > >
> > > > On Apr 11, 2021, at 6:14 PM, Baoquan He  wrote:
> > > >
> > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> > > >> Why don't we do this unconditionally? At the very best we gain half a 
> > > >> megabyte of memory (except the trampoline, which has to live there, 
> > > >> but it is only a few kilobytes.)
> > > >
> > > > This is a great suggestion, thanks. I think we can fix it in this way to
> > > > make code simpler. Then the specific caring of real mode in
> > > > efi_free_boot_services() can be removed too.
> > > >
> > >
> > > This whole situation makes me think that the code is buggy before and 
> > > buggy after.
> > >
> > > The issue here (I think) is that various pieces of code want to reserve 
> > > specific pieces of otherwise-available low memory for their own nefarious 
> > > uses. I don’t know *why* crash kernel needs this, but that doesn’t matter 
> > > too much.
> >
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> >
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> >
> > >
> > > I propose that the right solution is to give low-memory-reserving code 
> > > paths two chances to do what they need: once at the very beginning and 
> > > once after EFI boot services are freed.
> > >
> > > Alternatively, just reserve *all* otherwise unused sub 1M memory up 
> > > front, then release it right after releasing boot services, and then 
> > > invoke the special cases exactly once.
> >
> > I am not sure if I got both suggested ways clearly. They look a little
> > complicated in our case. As I explained at above, we want the whole low
> > 1M locked up, not one piece or some pieces of it.
> 
> My second suggestion is probably the better one.  Here it is, concretely:
> 
> The early (pre-free_efi_boot_services) code just reserves all
> available sub-1M memory unconditionally, but it specially marks it as
> reserved-but-available-later.  We stop allocating the trampoline page
> at this stage.
> 
> In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> stick it in the pile of reserved memory created in the early step.
> This may involve splitting a block, kind of like the current
> trampoline late allocation works.
> 
> Then, *after* free_efi_boot_services(), we run a single block of code
> that lets everything that wants sub-1M code claim some.  This means
> that the trampoline gets allocated and, if crashkernel wants to claim
> everything else, it can.  After that, everything still unclaimed gets
> freed.

void __init setup_arch(char **cmdline_p)
{
...
efi_reserve_boot_services();
e820__memblock_alloc_reserved_mpc_new();
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
setup_bios_corruption_check();
#endif
reserve_real_mode();
  

trim_platform_memory_ranges();
trim_low_memory_range();
...
}

After efi_reserve_boot_services(), there are several function calling to
require memory reservation under low 1M.


asmlinkage __visible void __init __no_sanitize_address start_kernel(void)   
  
{
...
setup_arch(_line);
...
mm_init();
--> mem_init();
 -->memblock_free_all();

...
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
-->efi_free_boot_services();
-->memblock_free_late();
#endif
...
}

So from the code flow, we can see that buddy allocator is built in
mm_init() which puts all memory from memblock.memory excluding
memblock.reserved into buddy. And much later, we call
efi_free_boot_services() to release those

[PATCH] hwmon: (nct6683) remove useless function

2021-04-13 Thread Jiapeng Chong

Fix the following clang warning:

drivers/hwmon/nct6683.c:491:19: warning: unused function 'in_to_reg'
[-Wunused-function].

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/hwmon/nct6683.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/drivers/hwmon/nct6683.c b/drivers/hwmon/nct6683.c
index a23047a..b886cf0 100644
--- a/drivers/hwmon/nct6683.c
+++ b/drivers/hwmon/nct6683.c
@@ -488,17 +488,6 @@ static inline long in_from_reg(u16 reg, u8 src)
return reg * scale;
 }
 
-static inline u16 in_to_reg(u32 val, u8 src)
-{
-   int scale = 16;
-
-   if (src == MON_SRC_VCC || src == MON_SRC_VSB || src == MON_SRC_AVSB ||
-   src == MON_SRC_VBAT)
-   scale <<= 1;
-
-   return clamp_val(DIV_ROUND_CLOSEST(val, scale), 0, 127);
-}
-
 static u16 nct6683_read(struct nct6683_data *data, u16 reg)
 {
int res;
-- 
1.8.3.1

Re: [PATCH] KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST read

2021-04-13 Thread Keqian Zhu



On 2021/4/12 23:00, Eric Auger wrote:
> When reading the base address of the a REDIST region
> through KVM_VGIC_V3_ADDR_TYPE_REDIST we expect the
> redistributor region list to be populated with a single
> element.
> 
> However list_first_entry() expects the list to be non empty.
Indeed, list_first_entry() always return a non-null ptr. If the list
is empty, it will mistake the list head as the first element.

> Instead we should use list_first_entry_or_null which effectively
> returns NULL if the list is empty.
> 
> Fixes: dbd9733ab674 ("KVM: arm/arm64: Replace the single rdist region by a 
> list")
> Cc:  # v4.18+
> Signed-off-by: Eric Auger 
> Reported-by: Gavin Shan 
> ---
>  arch/arm64/kvm/vgic/vgic-kvm-device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c 
> b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> index 44419679f91a..5eaede3e3b5a 100644
> --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
> +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> @@ -87,8 +87,8 @@ int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 
> *addr, bool write)
>   r = vgic_v3_set_redist_base(kvm, 0, *addr, 0);
>   goto out;
>   }
> - rdreg = list_first_entry(>rd_regions,
> -  struct vgic_redist_region, list);
> + rdreg = list_first_entry_or_null(>rd_regions,
> +  struct vgic_redist_region, 
> list);
>   if (!rdreg)
>   addr_ptr = _value;
>   else
>

Re: [PATCH 1/4] dt-bindings: Add bindings for aspeed pwm-tach.

2021-04-13 Thread Billy Tsai

Hi,

Best Regards,
Billy Tsai

On 2021/4/12, 8:55 PM,Uwe Kleine-Königwrote:

> Hello,

On Mon, Apr 12, 2021 at 05:54:54PM +0800, Billy Tsai wrote:
> +  - Billy Tsai 

> I object because the MTA at aspeedtech.com doesn't know this email
> address.

This is typo error, my email address is billy_t...@aspeedtech.com
I will fix it at v2.

> Best regards
> Uwe

> -- 
> Pengutronix e.K.   | Uwe Kleine-König|
> Industrial Linux Solutions | https://www.pengutronix.de/ |

[PATCH v8] RISC-V: enable XIP

2021-04-13 Thread Alexandre Ghiti

From: Vitaly Wool 

Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage. The physical flash address used
to link the kernel object files and for storing it has to be known
at compile time and is represented by a Kconfig option.

XIP on RISC-V will for the time being only work on MMU-enabled
kernels.

Signed-off-by: Alexandre Ghiti  [ Rebase on top of "Move
kernel mapping outside the linear mapping" ]
Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig  |  55 +++-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/page.h   |  21 +
 arch/riscv/include/asm/pgtable.h|  25 +-
 arch/riscv/kernel/head.S|  46 +-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |  10 ++-
 arch/riscv/kernel/vmlinux-xip.lds.S | 133 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 115 ++--
 11 files changed, 418 insertions(+), 17 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8ea60a0a19ae..7c7efdd67a10 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -28,7 +28,7 @@ config RISCV
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
-   select ARCH_HAS_STRICT_KERNEL_RWX if MMU
+   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
@@ -441,7 +441,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -465,11 +465,60 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config PHYS_RAM_BASE_FIXED
+   bool "Explicitly specified physical RAM address"
+   default n
+
+config PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on PHYS_RAM_BASE_FIXED
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU && SPARSEMEM
+   select PHYS_RAM_BASE_FIXED
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ SPARSEMEM is required because the kernel text and rodata that are
+ flash resident are not backed by memmap, then any attempt to get
+ a struct page on those regions will trigger a fault.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
+   bool
depends on OF
+   default y if XIP_KERNEL
 
 menu "Power management options"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 1368d943f1f3..8fcbec03974d 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -82,7 +82,11 @@ CHECKFLAGS += -D__riscv -D__riscv_xlen=$(BITS)
 
 # Default target when executing plain make
 boot

Re: [PATCH] MIPS: fix memory reservation for non-usermem setups

2021-04-13 Thread Ilya Lipnitskiy

On Mon, Apr 12, 2021 at 11:45 PM Ilya Lipnitskiy
 wrote:
>
> Hi Thomas,
>
> On Tue, Apr 6, 2021 at 6:18 AM Thomas Bogendoerfer
>  wrote:
> >
> > On Sat, Apr 03, 2021 at 07:02:13PM -0700, Ilya Lipnitskiy wrote:
> > > Hi Mike,
> > >
> > > On Tue, Mar 16, 2021 at 11:33 PM Mike Rapoport  wrote:
> > > >
> > > > Hi Ilya,
> > > >
> > > > On Tue, Mar 16, 2021 at 10:10:09PM -0700, Ilya Lipnitskiy wrote:
> > > > > Hi Thomas,
> > > > >
> > > > > On Fri, Mar 12, 2021 at 7:19 AM Thomas Bogendoerfer
> > > > >  wrote:
> > > > > >
> > > > > > On Sun, Mar 07, 2021 at 11:40:30AM -0800, Ilya Lipnitskiy wrote:
> > > > > > > From: Tobias Wolf 
> > > > > > >
> > > > > > > Commit 67a3ba25aa95 ("MIPS: Fix incorrect mem=X@Y handling") 
> > > > > > > introduced a new
> > > > > > > issue for rt288x where "PHYS_OFFSET" is 0x0 but the calculated 
> > > > > > > "ramstart" is
> > > > > > > not. As the prerequisite of custom memory map has been removed, 
> > > > > > > this results
> > > > > > > in the full memory range of 0x0 - 0x800 to be marked as 
> > > > > > > reserved for this
> > > > > > > platform.
> > > > > >
> > > > > > and where is the problem here ?
> > > > > Turns out this was already attempted to be upstreamed - not clear why
> > > > > it wasn't merged. Context:
> > > > > https://lore.kernel.org/linux-mips/6504517.U6H5IhoIOn@loki/
> > > > >
> > > > > I hope the thread above helps you understand the problem.
> > > >
> > > > The memory initialization was a bit different then. Do you still see the
> > > > same problem?
> > > Thanks for asking. I obtained a RT2880 device and gave it a try. It
> > > hangs at boot without this patch, however selecting
> >
> > can you provide debug logs with memblock=debug for both good and bad
> > kernels ? I'm curious what's the reason for failing allocation...
> Sorry for taking a while to respond. See attached.
> FWIW, it seems these are the lines that stand out in hang.log:
> [0.00] memblock_reserve: [0x-0x07ff] 
> setup_arch+0x214/0x5d8
> [0.00] Wasting 1048576 bytes for tracking 32768 unused pages
> ...
> [0.00]  reserved[0x0][0x-0x087137aa], 0x087137ab
> bytes flags: 0x0
Just to be clear, good.log is mips-next tip (dbd815c0dcca) and
hang.log is the same with MIPS_AUTO_PFN_OFFSET _NOT_ selected.

Ilya

Re: [PATCH net-next 1/2] net: phy: marvell-88x2222: check that link is operational

2021-04-13 Thread Ivan Bornyakov

On Tue, Apr 13, 2021 at 01:40:32AM +0200, Andrew Lunn wrote:
> On Mon, Apr 12, 2021 at 03:16:59PM +0300, Ivan Bornyakov wrote:
> > Some SFP modules uses RX_LOS for link indication. In such cases link
> > will be always up, even without cable connected. RX_LOS changes will
> > trigger link_up()/link_down() upstream operations. Thus, check that SFP
> > link is operational before actual read link status.
> 
> Sorry, but this is not making much sense to me.
> 
> LOS just indicates some sort of light is coming into the device. You
> have no idea what sort of light. The transceiver might be able to
> decode that light and get sync, it might not. It is important that
> mv_read_status() returns the line side status. Has it been able to
> achieve sync? That should be independent of LOS. Or are you saying the
> transceiver is reporting sync, despite no light coming in?
> 
>   Andrew

Yes, with some SFP modules transceiver is reporting sync despite no
light coming in. So, the idea is to check that link is somewhat
operational before determing line-side status.

Re: [PATCH V4 2/2] net: ethernet: ravb: Enable optional refclk

2021-04-13 Thread Geert Uytterhoeven

Hi Adam,

On Mon, Apr 12, 2021 at 3:27 PM Adam Ford  wrote:
> For devices that use a programmable clock for the AVB reference clock,
> the driver may need to enable them.  Add code to find the optional clock
> and enable it when available.
>
> Signed-off-by: Adam Ford 
> Reviewed-by: Andrew Lunn 
>
> ---
> V4:  Eliminate the NULL check when disabling refclk, and add a line
>  to disable the refclk if there is a failure after it's been
>  initialized.

Thanks for the update!

> --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -2148,6 +2148,13 @@ static int ravb_probe(struct platform_device *pdev)
> goto out_release;
> }
>
> +   priv->refclk = devm_clk_get_optional(>dev, "refclk");
> +   if (IS_ERR(priv->refclk)) {
> +   error = PTR_ERR(priv->refclk);
> +   goto out_release;

Note that this will call clk_disable_unprepare() in case of failure, which is
fine, as that function is a no-op in case of a failed clock.

> +   }
> +   clk_prepare_enable(priv->refclk);
> +
> ndev->max_mtu = 2048 - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN);
> ndev->min_mtu = ETH_MIN_MTU;
>
> @@ -2244,6 +2251,7 @@ static int ravb_probe(struct platform_device *pdev)
> if (chip_id != RCAR_GEN2)
> ravb_ptp_stop(ndev);
>  out_release:
> +   clk_disable_unprepare(priv->refclk);
> free_netdev(ndev);
>
> pm_runtime_put(>dev);

Reviewed-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

[PATCH v2 12/12] usb: dwc2: Add exit clock gating before removing driver

2021-04-13 Thread Artur Petrosyan

When dwc2 core is in clock gating mode loading driver
again causes driver fail. Because in that mode
registers are not accessible.

Added a flow of exiting clock gating mode
to avoid the driver reload failure.

Signed-off-by: Artur Petrosyan 
---
 Changes in v2:
 - None

 drivers/usb/dwc2/platform.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/usb/dwc2/platform.c b/drivers/usb/dwc2/platform.c
index b28b8cd45799..f8b819cfa80e 100644
--- a/drivers/usb/dwc2/platform.c
+++ b/drivers/usb/dwc2/platform.c
@@ -326,6 +326,15 @@ static int dwc2_driver_remove(struct platform_device *dev)
"exit partial_power_down failed\n");
}
 
+   /* Exit clock gating when driver is removed. */
+   if (hsotg->params.power_down == DWC2_POWER_DOWN_PARAM_NONE &&
+   hsotg->bus_suspended) {
+   if (dwc2_is_device_mode(hsotg))
+   dwc2_gadget_exit_clock_gating(hsotg, 0);
+   else
+   dwc2_host_exit_clock_gating(hsotg, 0);
+   }
+
dwc2_debugfs_exit(hsotg);
if (hsotg->hcd_enabled)
dwc2_hcd_remove(hsotg);
-- 
2.25.1

Re: [Outreachy kernel] [PATCH] staging: rtl8723bs: hal: Remove camelcase

2021-04-13 Thread Greg Kroah-Hartman

On Mon, Apr 12, 2021 at 11:02:58PM +0200, Fabio M. De Francesco wrote:
> Removed camelcase in (some) symbols. Further work is needed.

What symbols did you do this for?  What did you change them from and to?

Be specific, and try to do only one structure at a time at the most,
trying to review 1000 lines of changes at once is hard, would you want
to do that?  :)

thanks,

greg k-h

Re: [PATCH 4/7] mm: Introduce verify_page_range()

2021-04-13 Thread Peter Zijlstra

On Mon, Apr 12, 2021 at 01:05:09PM -0700, Kees Cook wrote:
> On Mon, Apr 12, 2021 at 10:00:16AM +0200, Peter Zijlstra wrote:
> > +struct vpr_data {
> > +   int (*fn)(pte_t pte, unsigned long addr, void *data);
> > +   void *data;
> > +};
> 
> Eeerg. This is likely to become an attack target itself. Stored function
> pointer with stored (3rd) argument.
> 
> This doesn't seem needed: only DRM uses it, and that's for error
> reporting. I'd rather plumb back errors in a way to not have to add
> another place in the kernel where we do func+arg stored calling.

Is this any better? It does have the stored pointer, but not a stored
argument, assuming you don't count returns as arguments I suppose.

The alternative is refactoring apply_to_page_range() :-/

---

struct vpr_data {
bool (*fn)(pte_t pte, unsigned long addr);
unsigned long addr;
};

static int vpr_fn(pte_t *pte, unsigned long addr, void *data)
{
struct vpr_data *vpr = data;
if (!vpr->fn(*pte, addr)) {
vpr->addr = addr;
return -EINVAL;
}
return 0;
}

/**
 * verify_page_range() - Scan (and fill) a range of virtual memory and validate 
PTEs
 * @mm: mm identifying the virtual memory map
 * @addr: starting virtual address of the range
 * @size: size of the range
 * @fn: function that verifies the PTEs
 *
 * Scan a region of virtual memory, filling in page tables as necessary and
 * calling a provided function on each leaf, providing a copy of the
 * page-table-entry.
 *
 * Similar apply_to_page_range(), but does not provide direct access to the
 * page-tables.
 *
 * NOTE! this function does not work correctly vs large pages.
 *
 * Return: the address that failed verification or 0 on success.
 */
unsigned long verify_page_range(struct mm_struct *mm,
unsigned long addr, unsigned long size,
bool (*fn)(pte_t pte, unsigned long addr))
{
struct vpr_data vpr = {
.fn = fn,
.addr = 0,
};
apply_to_page_range(mm, addr, size, vpr_fn, );
return vpr.addr;
}
EXPORT_SYMBOL_GPL(verify_page_range);

Re: [PATCH v5 0/3] staging: rtl8192e: Cleanup patchset for style issues in rtl819x_HTProc.c

2021-04-13 Thread Greg KH

On Tue, Apr 13, 2021 at 08:55:03AM +0530, Mitali Borkar wrote:
> Changes from v4:-
> [PATCH v4 1/3]:- No changes.
> [PATCH v4 2/3]:- No changes.
> [PATCH V4 3/3]:- Removed casts and parentheses.

This series does not apply cleanly, please rebase and resend.

thanks,

greg k-h

Re: [PATCH] USB: Don't set USB_PORT_FEAT_SUSPEND on WD19's Realtek Hub

2021-04-13 Thread Chris Chiu

On Mon, Apr 12, 2021 at 11:12 PM Alan Stern  wrote:
>
> On Mon, Apr 12, 2021 at 11:00:06PM +0800, chris.c...@canonical.com wrote:
> > From: Chris Chiu 
> >
> > Realtek Hub (0bda:5413) in Dell Dock WD19 sometimes fails to work
> > after the system resumes from suspend with remote wakeup enabled
> > device connected:
> > [ 1947.640907] hub 5-2.3:1.0: hub_ext_port_status failed (err = -71)
> > [ 1947.641208] usb 5-2.3-port5: cannot disable (err = -71)
> > [ 1947.641401] hub 5-2.3:1.0: hub_ext_port_status failed (err = -71)
> > [ 1947.641450] usb 5-2.3-port4: cannot reset (err = -71)
> >
> > Information of this hub:
> > T:  Bus=01 Lev=02 Prnt=02 Port=02 Cnt=01 Dev#=  9 Spd=480  MxCh= 6
> > D:  Ver= 2.10 Cls=09(hub  ) Sub=00 Prot=02 MxPS=64 #Cfgs=  1
> > P:  Vendor=0bda ProdID=5413 Rev= 1.21
> > S:  Manufacturer=Dell Inc.
> > S:  Product=Dell dock
> > C:* #Ifs= 1 Cfg#= 1 Atr=a0 MxPwr=  0mA
> > I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=01 Driver=hub
> > E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=256ms
> > I:* If#= 0 Alt= 1 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=02 Driver=hub
> > E:  Ad=81(I) Atr=03(Int.) MxPS=   1 Ivl=256ms
> >
> > The failure results from the ETIMEDOUT by chance when turning on
> > the suspend feature of the hub. The usb_resume_device will not be
> > invoked since the device state is not set to suspended, then the
> > hub fails to activate subsequently.
> >
> > The USB_PORT_FEAT_SUSPEND is not really necessary due to the
> > "global suspend" in USB 2.0 spec. It's only for many hub devices
> > which don't relay wakeup requests from the devices connected to
> > downstream ports. For this realtek hub, there's no problem waking
> > up the system from connected keyboard.
>
> What about runtime suspend?  That _does_ require USB_PORT_FEAT_SUSPEND.

It's hard to reproduce the same thing with runtime PM. I also don't
know the aggressive
way to trigger runtime suspend. So I'm assuming the same thing will happen in
runtime PM case because they both go the same usb_port_resume path. Could
you please suggest a better way to verify this for runtime PM?

>
> > This commit bypasses the USB_PORT_FEAT_SUSPEND for the quirky hub.
> >
> > Signed-off-by: Chris Chiu 
> > ---
>
>
> > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> > index 7f71218cc1e5..8478d49bba77 100644
> > --- a/drivers/usb/core/hub.c
> > +++ b/drivers/usb/core/hub.c
> > @@ -3329,8 +3329,11 @@ int usb_port_suspend(struct usb_device *udev, 
> > pm_message_t msg)
> >* descendants is enabled for remote wakeup.
> >*/
> >   else if (PMSG_IS_AUTO(msg) || usb_wakeup_enabled_descendants(udev) > 
> > 0)
> > - status = set_port_feature(hub->hdev, port1,
> > - USB_PORT_FEAT_SUSPEND);
> > + if (udev->quirks & USB_QUIRK_NO_SET_FEAT_SUSPEND)
>
> You should test hub->hdev->quirks, here, not udev->quirks.  The quirk
> belongs to the Realtek hub, not to the device that's plugged into the
> hub.
>

Thanks for pointing that out. I'll verify again and propose a V2 after
it's done.

> Alan Stern

Re: [syzbot] general protection fault in gadget_setup

2021-04-13 Thread Dmitry Vyukov

On Tue, Apr 13, 2021 at 10:08 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:0f4498ce Merge tag 'for-5.12/dm-fixes-2' of git://git.kern..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=124adbf6d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=daeff30c2474a60f
> dashboard link: https://syzkaller.appspot.com/bug?extid=eb4674092e6cc8d9e0bd
> userspace arch: i386
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+eb4674092e6cc8d9e...@syzkaller.appspotmail.com

I suspect that the raw gadget_unbind() can be called while the timer
is still active. gadget_unbind() sets gadget data to NULL.
But I am not sure which unbind call this is:
usb_gadget_remove_driver() or right in udc_bind_to_driver() due to a
start error.

Also looking at the code, gadget_bind() resets data to NULL on this error path:
https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/usb/gadget/legacy/raw_gadget.c#L283
but not on this error path:
https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/usb/gadget/legacy/raw_gadget.c#L306
Should the second one also reset data to NULL?


> general protection fault, probably for non-canonical address 
> 0xdc04:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0020-0x0027]
> CPU: 1 PID: 5016 Comm: systemd-udevd Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> RIP: 0010:__lock_acquire+0xcfe/0x54c0 kernel/locking/lockdep.c:4770
> Code: 09 0e 41 bf 01 00 00 00 0f 86 8c 00 00 00 89 05 48 69 09 0e e9 81 00 00 
> 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 5b 
> 31 00 00 49 81 3e c0 13 38 8f 0f 84 d0 f3 ff
> RSP: :c9ce77d8 EFLAGS: 00010002
> RAX: dc00 RBX:  RCX: 
> RDX: 0004 RSI: 19200019cf0c RDI: 0020
> RBP:  R08: 0001 R09: 0001
> R10: 0001 R11: 0006 R12: 88801295b880
> R13:  R14: 0020 R15: 
> FS:  7fcd745f98c0() GS:88802cb0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7ffe279f7d87 CR3: 1c7d4000 CR4: 00150ee0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  lock_acquire kernel/locking/lockdep.c:5510 [inline]
>  lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>  _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:159
>  gadget_setup+0x4e/0x510 drivers/usb/gadget/legacy/raw_gadget.c:327
>  dummy_timer+0x1615/0x32a0 drivers/usb/gadget/udc/dummy_hcd.c:1903
>  call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1431
>  expire_timers kernel/time/timer.c:1476 [inline]
>  __run_timers.part.0+0x67c/0xa50 kernel/time/timer.c:1745
>  __run_timers kernel/time/timer.c:1726 [inline]
>  run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1758
>  __do_softirq+0x29b/0x9f6 kernel/softirq.c:345
>  invoke_softirq kernel/softirq.c:221 [inline]
>  __irq_exit_rcu kernel/softirq.c:422 [inline]
>  irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
>  sysvec_apic_timer_interrupt+0x45/0xc0 arch/x86/kernel/apic/apic.c:1100
>  asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
> RIP: 0033:0x560cfc4a02ed
> Code: 4c 39 c1 48 89 42 18 4c 89 52 08 4c 89 5a 10 48 89 1a 0f 87 7b ff ff ff 
> 48 89 f8 48 f7 d0 48 01 c8 48 83 e0 f8 48 8d 7c 07 08 <48> 8d 0d 34 d9 02 00 
> 48 63 04 b1 48 01 c8 ff e0 0f 1f 00 48 8d 0d
> RSP: 002b:7ffe279f9dd0 EFLAGS: 0246
> RAX:  RBX: 560cfcd88e40 RCX: 560cfcd72af0
> RDX: 7ffe279f9de0 RSI: 0007 RDI: 560cfcd72af0
> RBP: 7ffe279f9e70 R08:  R09: 0020
> R10: 560cfcd72af7 R11: 560cfcd73530 R12: 560cfcd72af0
> R13:  R14: 560cfcd72b10 R15: 0001
> Modules linked in:
> ---[ end trace ab0f6632fdd289cf ]---
> RIP: 0010:__lock_acquire+0xcfe/0x54c0 kernel/locking/lockdep.c:4770
> Code: 09 0e 41 bf 01 00 00 00 0f 86 8c 00 00 00 89 05 48 69 09 0e e9 81 00 00 
> 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 5b 
> 31 00 00 49 81 3e c0 13 38 8f 0f 84 d0 f3 ff
> RSP: :c9ce77d8 EFLAGS: 00010002
> RAX: dc00 RBX:  RCX: 
> RDX: 0004 RSI: 19200019cf0c RDI: 0020
> RBP:  R08: 0001 R09: 0001
> R10: 0001 R11: 0006 R12: 88801295b880
> R13:  R14: 0020 R15: 
> FS:

Re: [Nouveau] [PATCH v2] ALSA: hda: Continue to probe when codec probe fails

2021-04-13 Thread Roy Spliet


Op 13-04-2021 om 01:10 schreef Karol Herbst:

On Mon, Apr 12, 2021 at 9:36 PM Roy Spliet  wrote:


Hello Aaron,

Thanks for your insights. A follow-up query and some observations in-line.

Op 12-04-2021 om 20:06 schreef Aaron Plattner:

On 4/10/21 1:48 PM, Roy Spliet wrote:

Op 10-04-2021 om 20:23 schreef Lukas Wunner:

On Sat, Apr 10, 2021 at 04:51:27PM +0100, Roy Spliet wrote:

Can I ask someone with more
technical knowledge of snd_hda_intel and vgaswitcheroo to brainstorm
about
the possible challenges of nouveau taking matters into its own hand
rather
than keeping this PCI quirk around?


It sounds to me like the HDA is not powered if no cable is plugged in.
What is reponsible then for powering it up or down, firmware code on
the GPU or in the host's BIOS?


Sometimes the BIOS, but definitely unconditionally the PCI quirk code:
https://github.com/torvalds/linux/blob/master/drivers/pci/quirks.c#L5289

(CC Aaron Plattner)


My basic understanding is that the audio function stops responding
whenever the graphics function is powered off. So the requirement here
is that the audio driver can't try to talk to the audio function while
the graphics function is asleep, and must trigger a graphics function
wakeup before trying to communicate with the audio function.


I believe that vgaswitcheroo takes care of this for us.



yeah, and also: why would the driver want to do stuff? If the GPU is
turned off, there is no point in communicating with the audio device
anyway. The driver should do the initial probe and leave the device be
unless it's actively used. Also there is no such thing as "use the
audio function, but not the graphics one"


I think
there are also requirements about the audio function needing to be awake
when the graphics driver is updating the ELD, but I'm not sure.



well, it's one physical device anyway, so technically the audio
function is powered on.


This is harder on Windows because the audio driver lives in its own
little world doing its own thing but on Linux we can do better.


Ideally, we should try to find out how to control HDA power from the
operating system rather than trying to cooperate with whatever firmware
is doing.  If we have that capability, the OS should power the HDA up
and down as it sees fit.


After system boot, I don't think there's any firmware involved, but I'm
not super familiar with the low-level details and it's possible the
situation changed since I last looked at it.

I think the problem with having nouveau write this quirk is that the
kernel will need to re-probe the PCI device to notice that it has
suddenly become a multi-function device with an audio function, and
hotplug the audio driver. I originally looked into trying to do that but
it was tricky because the PCI subsystem didn't really have a mechanism
for a single-function device to become a multi-function device on the
fly and it seemed easier to enable it early on during bus enumeration.
That way the kernel sees both functions all the time without anything
else having to be special about this configuration.


Well, we do have this pci/quirk.c thing, no? Nouveau does flip the
bit, but I am actually not sure if that's even doing something
anymore. Maybe in the runtime_resume case it's still relevant but not
sure _when_ DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY is triggered, it does
seem to be called even in the runtime_resume case though.



Right, so for a little more context: a while ago I noticed that my
laptop (lucky me, Asus K501UB) has a 940M with HDA but no codec. Seems
legit, given how this GPU has no displays attached; they're all hooked
up to the Intel integrated GPU. That threw off the snd_hda_intel
mid-probe, and as a result didn't permit runpm, keeping the entire GPU,
PCIe bus and thus the CPU package awake. A bit of hackerly later we
decided to continue probing without a codec, and now my laptop is happy,
but...
A new problem popped up with several other NVIDIA GPUs that expose their
HDA subdevice, but somehow its inaccessible. Relevant lines from a
users' log:

[3.031222] MXM: GUID detected in BIOS
[3.031280] ACPI BIOS Error (bug): AE_AML_PACKAGE_LIMIT, Index
(0x3) is beyond end of object (length 0x0) (20200925/exoparg2-393)
[3.031352] ACPI Error: Aborting method \_SB.PCI0.GFX0._DSM due to
previous error (AE_AML_PACKAGE_LIMIT) (20200925/psparse-529)
[3.031419] ACPI: \_SB_.PCI0.GFX0: failed to evaluate _DSM (0x300b)
[3.031424] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type
mismatch - Found [Buffer], ACPI requires [Package] (20200925/nsarguments-61)
[3.031619] pci :00:02.0: optimus capabilities: enabled, status
dynamic power,
[3.031667] ACPI BIOS Error (bug): AE_AML_PACKAGE_LIMIT, Index
(0x3) is beyond end of object (length 0x0) (20200925/exoparg2-393)
[3.031731] ACPI Error: Aborting method \_SB.PCI0.GFX0._DSM due to
previous error (AE_AML_PACKAGE_LIMIT) (20200925/psparse-529)
[3.031791] ACPI Error: Aborting method

RE: [PATCH] MIPS: Fix strnlen_user access check

2021-04-13 Thread David Laight

From: Jinyang He
> Sent: 13 April 2021 02:16
> 
> > On Mon, Apr 12, 2021 at 11:02:19AM +0800, Tiezhu Yang wrote:
> >> On 04/11/2021 07:04 PM, Jinyang He wrote:
> >>> Commit 04324f44cb69 ("MIPS: Remove get_fs/set_fs") brought a problem for
> >>> strnlen_user(). Jump out when checking access_ok() with condition that
> >>> (s + strlen(s)) < __UA_LIMIT <= (s + n). The old __strnlen_user_asm()
> >>> just checked (ua_limit & s) without checking (ua_limit & (s + n)).
> >>> Therefore, find strlen form s to __UA_LIMIT - 1 in that condition.
> >>>
> >>> Signed-off-by: Jinyang He 
> >>> ---
> >>>arch/mips/include/asm/uaccess.h | 11 +--
> >>>1 file changed, 9 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/mips/include/asm/uaccess.h 
> >>> b/arch/mips/include/asm/uaccess.h
> >>> index 91bc7fb..85ba0c8 100644
> >>> --- a/arch/mips/include/asm/uaccess.h
> >>> +++ b/arch/mips/include/asm/uaccess.h
> >>> @@ -630,8 +630,15 @@ static inline long strnlen_user(const char __user 
> >>> *s, long n)
> >>>{
> >>>   long res;
> >>> - if (!access_ok(s, n))
> >>> - return -0;
> >>> + if (unlikely(n <= 0))
> >>> + return 0;
> >>> +
> >>> + if (!access_ok(s, n)) {
> >>> + if (!access_ok(s, 0))
> >>> + return 0;
> >>> +
> >>> + n = __UA_LIMIT - (unsigned long)s - 1;
> >>> + }
> >>>   might_fault();
> >>>   __asm__ __volatile__(
> >> The following simple changes are OK to fix this issue?
> >>
> >> diff --git a/arch/mips/include/asm/uaccess.h 
> >> b/arch/mips/include/asm/uaccess.h
> >> index 91bc7fb..eafc99b 100644
> >> --- a/arch/mips/include/asm/uaccess.h
> >> +++ b/arch/mips/include/asm/uaccess.h
> >> @@ -630,8 +630,8 @@ static inline long strnlen_user(const char __user *s, 
> >> long n)
> >>   {
> >>  long res;
> >> -   if (!access_ok(s, n))
> >> -   return -0;
> >> +   if (!access_ok(s, 1))
> >> +   return 0;
> >>  might_fault();
> >>  __asm__ __volatile__(
> > that's the fix I'd like to apply. Could someone send it as a formal
> > patch ? Thanks.
> >
> > Thomas.
> >
> Hi, Thomas,
> 
> Thank you for bringing me more thinking.
> 
> I always think it is better to use access_ok(s, 0) on MIPS. I have been
> curious about the difference between access_ok(s, 0) and access_ok(s, 1)
> until I saw __access_ok() on RISCV at arch/riscv/include/asm/uaccess.h
> 
> The __access_ok() is noted with `Ensure that the range [addr, addr+size)
> is within the process's address space`. Does the range checked by
> __access_ok() on MIPS is [addr, addr+size]. So if we want to use
> access_ok(s, 1), should we modify __access_ok()? Or my misunderstanding?

ISTR that access_ok(xxx, 0) is unconditionally true on some architectures.
The range checked should be [addr, addr+size).
These are needed so that write(fd, random(), 0) doesn't ever fault.

> More importantly, the implementation of strnlen_user in lib/strnlen_user.c
> is noted `we hit the address space limit, and we still had more characters
> the caller would have wanted. That's 0.` Does it make sense? It is not
> achieved on MIPS when hit __ua_limit, if only access_ok(s, 1) is used.

There is the question of whether one call to access_ok(addr, 1) is
sufficient for any code that does sequential accesses.
It is if there is an unmapped page between the last valid user page
and the first valid kernel page.
IIRC x86 has such an unmapped page because 'horrid things' (tm) happen
if the cpu prefetches across the user-kernel boundary.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH] sched: remove the redundant comments

2021-04-13 Thread Dietmar Eggemann

On 12/04/2021 09:39, Hui Su wrote:
> Since the commit 55627e3cd22c ("sched/core: Remove rq->cpu_load[]"),
> we don't need this any more.
> 
> Signed-off-by: Hui Su 
> ---
>  kernel/sched/sched.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 10a1522b1e30..2232022d8561 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -897,11 +897,6 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
>  struct rq {
>   /* runqueue lock: */
>   raw_spinlock_t  lock;
> -
> - /*
> -  * nr_running and cpu_load should be in the same cacheline because
> -  * remote CPUs use both these fields when doing load calculation.
> -  */
>   unsigned intnr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>   unsigned intnr_numa_running;

I forgot to remove this snippet back then. LGTM.

Add a

  Fixes: 55627e3cd22c ("sched/core: Remove rq->cpu_load[]")

line.

Reviewed-by: Dietmar Eggemann

Re: x86: report: write to unrecognized MSR

2021-04-13 Thread Borislav Petkov

On Mon, Apr 12, 2021 at 03:09:41PM -0700, Randy Dunlap wrote:
> 
> [   27.075563] msr: Write to unrecognized MSR 0x1b0 by x86_energy_perf (pid: 
> 1223).
> [   27.078979] msr: See 
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for details.
> 
> (aka x86_energy_perf_policy)
> 
> 
> on linux-next-20210409 / x86_64.

You're using an old x86_energy_perf tool on a new kernel. That's fixed
upstream.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v7 1/2] platform/x86: dell-privacy: Add support for Dell hardware privacy

2021-04-13 Thread Hans de Goede

Hi,

On 4/12/21 11:19 AM, Perry Yuan wrote:
> From: Perry Yuan 
> 
> add support for Dell privacy driver for the Dell units equipped
> hardware privacy design, which protect users privacy of audio and
> camera from hardware level. Once the audio or camera privacy mode
> activated, any applications will not get any audio or video stream
> when user pressed ctrl+F4 hotkey, audio privacy mode will be
> enabled, micmute led will be also changed accordingly
> The micmute led is fully controlled by hardware & EC(embedded controller)
> and camera mute hotkey is Ctrl+F9. Currently design only emits
> SW_CAMERA_LENS_COVER event while the camera lens shutter will be
> changed by EC & HW(hardware) control
> 
> *The flow is like this:
> 1) User presses key. HW does stuff with this key (timeout timer is started)
> 2) WMI event is emitted from BIOS to kernel
> 3) WMI event is received by dell-privacy
> 4) KEY_MICMUTE emitted from dell-privacy
> 5) Userland picks up key and modifies kcontrol for SW mute
> 6) Codec kernel driver catches and calls ledtrig_audio_set, like this:
>ledtrig_audio_set(LED_AUDIO_MICMUTE, rt715->micmute_led ? LED_ON :LED_OFF);
> 7) If "LED" is set to on dell-privacy notifies EC, and timeout is cancelled,
>HW mic mute activated. If EC not notified, HW mic mute will also be
>activated when timeout used up, it is just later than active ack
> 
> Signed-off-by: Perry Yuan 
> ---
> v5 -> v6:
> * addressed feedback from Hans
> * addressed feedback from Pierre
> * optimize some debug format with dev_dbg()
> * remove platform driver,combined privacy acpi driver into single wmi
>   driver file
> * optimize sysfs interface with string added to be more clearly reading
> * remove unused function and clear header file

Thank you, almost there. A few small remarks inline.

> v4 -> v5:
> * addressed feedback from Randy Dunlap
> * addressed feedback from Pierre-Louis Bossart
> * rebase to latest 5.12 rc4 upstream kernel
> * fix some space alignment problem
> v3 -> v4:
> * fix format for Kconfig
> * add sysfs document
> * add flow comments to the privacy wmi/acpi driver
> * addressed feedback from Barnabás Pőcze[Thanks very much]
> * export privacy_valid to make the global state simpler to query
> * fix one issue which will block the dell-laptop driver to load when
>   privacy driver invalid
> * addressed feedback from Pierre-Louis Bossart,remove the EC ID match
> v2 -> v3:
> * add sysfs attributes doc
> v1 -> v2:
> * query EC handle from EC driver directly.
> * fix some code style.
> * add KEY_END to keymap array.
> * clean platform device when cleanup called
> * use hexadecimal format for log print in dev_dbg
> * remove __set_bit for the report keys from probe.
> * fix keymap leak
> * add err_free_keymap in dell_privacy_wmi_probe
> * wmi driver will be unregistered if privacy_acpi_init() fails
> * add sysfs attribute files for user space query.
> * add leds micmute driver to privacy acpi
> * add more design info the commit info
> ---
> ---
>  .../testing/sysfs-platform-dell-privacy-wmi   |  55 +++
>  drivers/platform/x86/dell/Kconfig |  14 +
>  drivers/platform/x86/dell/Makefile|   1 +
>  drivers/platform/x86/dell/dell-laptop.c   |  23 +-
>  drivers/platform/x86/dell/dell-privacy-wmi.c  | 394 ++
>  drivers/platform/x86/dell/dell-privacy-wmi.h  |  23 +
>  drivers/platform/x86/dell/dell-wmi.c  |   8 +-
>  7 files changed, 511 insertions(+), 7 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi
>  create mode 100644 drivers/platform/x86/dell/dell-privacy-wmi.c
>  create mode 100644 drivers/platform/x86/dell/dell-privacy-wmi.h
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi 
> b/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi
> new file mode 100644
> index ..7f9e18705861
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi
> @@ -0,0 +1,55 @@
> +What:
> /sys/bus/wmi/devices/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type
> +Date:Apr 2021
> +KernelVersion:   5.13
> +Contact: "perry.y...@dell.com>"
> +Description:
> + Display which dell hardware level privacy devices are supported
> + “Dell Privacy” is a set of HW, FW, and SW features to enhance
> + Dell’s commitment to platform privacy for MIC, Camera, and
> + ePrivacy screens.
> + The supported hardware privacy devices are:
> +Attributes:
> + Microphone Mute:
> + Identifies the local microphone can be muted by 
> hardware, no applications
> + is available to capture system mic sound
> +
> + Camera Shutter:
> + Identifies camera shutter controlled by 
> hardware, which is a micromechanical
> + shutter assembly that is built onto the camera

linux-next: manual merge of the akpm-current tree with the arm64 tree

2021-04-13 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the akpm-current tree got a conflict in:

  lib/test_kasan.c

between commits:

  2603f8a78dfb ("kasan: Add KASAN mode kernel parameter")
  e80a76aa1a91 ("kasan, arm64: tests supports for HW_TAGS async mode")

from the arm64 tree and commit:

  c616ba7e0d63 ("kasan: detect false-positives in tests")

from the akpm-current tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc lib/test_kasan.c
index 785e724ce0d8,bf9225002a7e..
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@@ -78,33 -83,30 +83,35 @@@ static void kasan_test_exit(struct kuni
   * fields, it can reorder or optimize away the accesses to those fields.
   * Use READ/WRITE_ONCE() for the accesses and compiler barriers around the
   * expression to prevent that.
+  *
+  * In between KUNIT_EXPECT_KASAN_FAIL checks, fail_data.report_found is kept 
as
+  * false. This allows detecting KASAN reports that happen outside of the 
checks
+  * by asserting !fail_data.report_found at the start of 
KUNIT_EXPECT_KASAN_FAIL
+  * and in kasan_test_exit.
   */
- #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {\
-   if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
-   !kasan_async_mode_enabled())\
-   migrate_disable();  \
-   WRITE_ONCE(fail_data.report_expected, true);\
-   WRITE_ONCE(fail_data.report_found, false);  \
-   kunit_add_named_resource(test,  \
-   NULL,   \
-   NULL,   \
-   ,  \
-   "kasan_data", _data);  \
-   barrier();  \
-   expression; \
-   barrier();  \
-   if (kasan_async_mode_enabled()) \
-   kasan_force_async_fault();  \
-   barrier();  \
-   KUNIT_EXPECT_EQ(test,   \
-   READ_ONCE(fail_data.report_expected),   \
-   READ_ONCE(fail_data.report_found)); \
-   if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
-   !kasan_async_mode_enabled()) {  \
-   if (READ_ONCE(fail_data.report_found))  \
-   kasan_enable_tagging_sync();\
-   migrate_enable();   \
-   }   \
+ #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do {
\
 -  if (IS_ENABLED(CONFIG_KASAN_HW_TAGS))   \
++  if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
++  !kasan_async_mode_enabled())\
+   migrate_disable();  \
+   KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found));\
+   WRITE_ONCE(fail_data.report_expected, true);\
+   barrier();  \
+   expression; \
+   barrier();  \
++  if (kasan_async_mode_enabled()) \
++  kasan_force_async_fault();  \
++  barrier();  \
+   KUNIT_EXPECT_EQ(test,   \
+   READ_ONCE(fail_data.report_expected),   \
+   READ_ONCE(fail_data.report_found)); \
 -  if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \
++  if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \
++  !kasan_async_mode_enabled()) {  \
+   if (READ_ONCE(fail_data.report_found))  \
 -  kasan_enable_tagging(); \
++  kasan_enable_tagging_sync();\
+   migrate_enable();   \
+   }   \
+   WRITE_ONCE(fail_data.report_found, false);  \

[Ping for Dmitry] Re: [PATCH v5 3/3] iio: adc: add ADC driver for the TI TSC2046 controller

2021-04-13 Thread Oleksij Rempel

Hi Dmitry,

probably this mail passed under your radar. Can you please add your
statement here.

On Mon, Mar 29, 2021 at 11:58:26AM +0100, Jonathan Cameron wrote:
> On Mon, 29 Mar 2021 09:31:31 +0200
> Oleksij Rempel  wrote:
> 
> > Basically the TI TSC2046 touchscreen controller is 8 channel ADC optimized 
> > for
> > the touchscreen use case. By implementing it as an IIO ADC device, we can
> > make use of resistive-adc-touch and iio-hwmon drivers.
> > 
> > Polled readings are currently not implemented to keep this patch small, so
> > iio-hwmon will not work out of the box for now.
> > 
> > So far, this driver was tested with a custom version of resistive-adc-touch 
> > driver,
> > since it needs to be extended to make use of Z1 and Z2 channels. The X/Y
> > are working without additional changes.
> > 
> > Signed-off-by: Oleksij Rempel 
> > Reviewed-by: Andy Shevchenko 
> Hi Oleksij,
> 
> Couple of things in here I missed before, but big question is still whether
> Dmitry is happy with what you mention in the cover letter:
> 
> "This driver can replace drivers/input/touchscreen/ads7846.c and has
> following advantages over it:
> - less code to maintain
> - shared code paths (resistive-adc-touch, iio-hwmon, etc)
> - can be used as plain IIO ADC to investigate signaling issues or test
>   real capacity of the plates and attached low-pass filters
>   (or use the touchscreen as a microphone if you like ;) )"
> 
> So two things that need addressing in here are
> iio_dev->name (part number, not hybrid of that an spi device name)
> Why oversampling is DT rather than userspace controllable.
> For that I'm looking for clear reasoning for the choice.
 
Regards,
Oleksij
-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

[PATCH] f2fs: fix to avoid NULL pointer dereference

2021-04-13 Thread Chao Yu

From: Yi Chen 

Unable to handle kernel NULL pointer dereference at virtual address 

pc : f2fs_put_page+0x1c/0x26c
lr : __revoke_inmem_pages+0x544/0x75c
f2fs_put_page+0x1c/0x26c
__revoke_inmem_pages+0x544/0x75c
__f2fs_commit_inmem_pages+0x364/0x3c0
f2fs_commit_inmem_pages+0xc8/0x1a0
f2fs_ioc_commit_atomic_write+0xa4/0x15c
f2fs_ioctl+0x5b0/0x1574
file_ioctl+0x154/0x320
do_vfs_ioctl+0x164/0x740
__arm64_sys_ioctl+0x78/0xa4
el0_svc_common+0xbc/0x1d0
el0_svc_handler+0x74/0x98
el0_svc+0x8/0xc

In f2fs_put_page, we access page->mapping is NULL.
The root cause is:
In some cases, the page refcount and ATOMIC_WRITTEN_PAGE
flag miss set for page-priavte flag has been set.
We add f2fs_bug_on like this:

f2fs_register_inmem_page()
{
...
f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE);

f2fs_bug_on(F2FS_I_SB(inode), !IS_ATOMIC_WRITTEN_PAGE(page));
...
}

The bug on stack follow link this:
PC is at f2fs_register_inmem_page+0x238/0x2b4
LR is at f2fs_register_inmem_page+0x2a8/0x2b4
f2fs_register_inmem_page+0x238/0x2b4
f2fs_set_data_page_dirty+0x104/0x164
set_page_dirty+0x78/0xc8
f2fs_write_end+0x1b4/0x444
generic_perform_write+0x144/0x1cc
__generic_file_write_iter+0xc4/0x174
f2fs_file_write_iter+0x2c0/0x350
__vfs_write+0x104/0x134
vfs_write+0xe8/0x19c
SyS_pwrite64+0x78/0xb8

To fix this issue, let's add page refcount add page-priavte flag.
The page-private flag is not cleared and needs further analysis.

Signed-off-by: Chao Yu 
Signed-off-by: Ge Qiu 
Signed-off-by: Dehe Gu 
Signed-off-by: Yi Chen 
---
 fs/f2fs/segment.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 0cb1ca88d4aa..d6c6c13feb43 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -186,7 +186,10 @@ void f2fs_register_inmem_page(struct inode *inode, struct 
page *page)
 {
struct inmem_pages *new;
 
-   f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE);
+   if (PagePrivate(page))
+   set_page_private(page, (unsigned long)ATOMIC_WRITTEN_PAGE);
+   else
+   f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE);
 
new = f2fs_kmem_cache_alloc(inmem_entry_slab, GFP_NOFS);
 
-- 
2.29.2

Re: [PATCH v2 10/12] usb: dwc2: Add clock gating entering flow by system suspend

2021-04-13 Thread Sergei Shtylyov


On 13.04.2021 10:37, Artur Petrosyan wrote:


If not hibernation nor partial power down are supported,


   s/not/neither/?


clock gating is used to save power.

Adds a new flow of entering clock gating when PC is
suspended.

Signed-off-by: Artur Petrosyan 
---
  Changes in v2:
  - None

  drivers/usb/dwc2/hcd.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index 31d6a1b87228..09dcd37b9ef8 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -4372,6 +4372,15 @@ static int _dwc2_hcd_suspend(struct usb_hcd *hcd)
break;
case DWC2_POWER_DOWN_PARAM_HIBERNATION:
case DWC2_POWER_DOWN_PARAM_NONE:
+   /*
+* If not hibernation nor partial power down are supported,


   s/not/neither/?


+* clock gating is used to save power.
+*/
+   dwc2_host_enter_clock_gating(hsotg);
+
+   /* After entering suspend, hardware is not accessible */
+   clear_bit(HCD_FLAG_HW_ACCESSIBLE, >flags);
+   break;
default:
goto skip_power_saving;
}

MBR, Sergei

Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Andi Kleen

Alex Kogan  writes:
>  
> + numa_spinlock_threshold=[NUMA, PV_OPS]
> + Set the time threshold in milliseconds for the
> + number of intra-node lock hand-offs before the
> + NUMA-aware spinlock is forced to be passed to
> + a thread on another NUMA node.  Valid values
> + are in the [1..100] range. Smaller values result
> + in a more fair, but less performant spinlock,
> + and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

This would also make the code a lot shorter I guess.

-Andi

Re: [syzbot] KASAN: slab-out-of-bounds Read in reiserfs_xattr_get

2021-04-13 Thread Dmitry Vyukov

On Tue, Apr 13, 2021 at 7:55 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:3a229812 Merge tag 'arm-fixes-5.11-2' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16b4d196d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f91155ccddaf919c
> dashboard link: https://syzkaller.appspot.com/bug?extid=72ba979b6681c3369db4
> compiler:   Debian clang version 11.0.1-2
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+72ba979b6681c3369...@syzkaller.appspotmail.com

Maybe related to:
https://lore.kernel.org/lkml/5f397905ba42a...@google.com/
? there are some uninits involved in reiserfs attrs.

> loop3: detected capacity change from 0 to 65534
> ==
> BUG: KASAN: slab-out-of-bounds in reiserfs_xattr_get+0xe0/0x590 
> fs/reiserfs/xattr.c:681
> Read of size 8 at addr 888028983198 by task syz-executor.3/4211
>
> CPU: 1 PID: 4211 Comm: syz-executor.3 Not tainted 5.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x176/0x24e lib/dump_stack.c:120
>  print_address_description+0x5f/0x3a0 mm/kasan/report.c:232
>  __kasan_report mm/kasan/report.c:399 [inline]
>  kasan_report+0x15c/0x200 mm/kasan/report.c:416
>  reiserfs_xattr_get+0xe0/0x590 fs/reiserfs/xattr.c:681
>  reiserfs_get_acl+0x63/0x670 fs/reiserfs/xattr_acl.c:211
>  get_acl+0x152/0x2e0 fs/posix_acl.c:141
>  check_acl fs/namei.c:294 [inline]
>  acl_permission_check fs/namei.c:339 [inline]
>  generic_permission+0x2ed/0x5b0 fs/namei.c:392
>  do_inode_permission fs/namei.c:446 [inline]
>  inode_permission+0x28e/0x500 fs/namei.c:513
>  may_open+0x228/0x3e0 fs/namei.c:2985
>  do_open fs/namei.c:3365 [inline]
>  path_openat+0x2697/0x3860 fs/namei.c:3500
>  do_filp_open+0x1a3/0x3b0 fs/namei.c:3527
>  do_sys_openat2+0xba/0x380 fs/open.c:1187
>  do_sys_open fs/open.c:1203 [inline]
>  __do_sys_openat fs/open.c:1219 [inline]
>  __se_sys_openat fs/open.c:1214 [inline]
>  __x64_sys_openat+0x1c8/0x1f0 fs/open.c:1214
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x419544
> Code: 84 00 00 00 00 00 44 89 54 24 0c e8 96 f9 ff ff 44 8b 54 24 0c 44 89 e2 
> 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 
> 34 44 89 c7 89 44 24 0c e8 c8 f9 ff ff 8b 44
> RSP: 002b:7fa357a03f30 EFLAGS: 0293 ORIG_RAX: 0101
> RAX: ffda RBX: 2200 RCX: 00419544
> RDX: 0001 RSI: 2100 RDI: ff9c
> RBP: 2100 R08:  R09: 2000
> R10:  R11: 0293 R12: 0001
> R13: 2100 R14: 7fa357a04000 R15: 20065600
>
> Allocated by task 4210:
>  kasan_save_stack mm/kasan/common.c:38 [inline]
>  kasan_set_track mm/kasan/common.c:46 [inline]
>  set_alloc_info mm/kasan/common.c:427 [inline]
>  kasan_kmalloc+0xc2/0xf0 mm/kasan/common.c:506
>  kasan_kmalloc include/linux/kasan.h:233 [inline]
>  kmem_cache_alloc_trace+0x21b/0x350 mm/slub.c:2934
>  kmalloc include/linux/slab.h:554 [inline]
>  kzalloc include/linux/slab.h:684 [inline]
>  smk_fetch security/smack/smack_lsm.c:288 [inline]
>  smack_d_instantiate+0x65c/0xcc0 security/smack/smack_lsm.c:3411
>  security_d_instantiate+0xa5/0x100 security/security.c:1987
>  d_instantiate_new+0x61/0x110 fs/dcache.c:2025
>  ext4_add_nondir+0x22b/0x290 fs/ext4/namei.c:2590
>  ext4_symlink+0x8ce/0xe90 fs/ext4/namei.c:3417
>  vfs_symlink+0x3a0/0x540 fs/namei.c:4178
>  do_symlinkat+0x1c9/0x440 fs/namei.c:4208
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Freed by task 4210:
>  kasan_save_stack mm/kasan/common.c:38 [inline]
>  kasan_set_track+0x3d/0x70 mm/kasan/common.c:46
>  kasan_set_free_info+0x1f/0x40 mm/kasan/generic.c:357
>  kasan_slab_free+0x100/0x140 mm/kasan/common.c:360
>  kasan_slab_free include/linux/kasan.h:199 [inline]
>  slab_free_hook mm/slub.c:1562 [inline]
>  slab_free_freelist_hook+0x171/0x270 mm/slub.c:1600
>  slab_free mm/slub.c:3161 [inline]
>  kfree+0xcf/0x2d0 mm/slub.c:4213
>  smk_fetch security/smack/smack_lsm.c:300 [inline]
>  smack_d_instantiate+0x6db/0xcc0 security/smack/smack_lsm.c:3411
>  security_d_instantiate+0xa5/0x100 security/security.c:1987
>  d_instantiate_new+0x61/0x110 fs/dcache.c:2025
>  ext4_add_nondir+0x22b/0x290 fs/ext4/namei.c:2590
>  ext4_symlink+0x8ce/0xe90 fs/ext4/namei.c:3417
>  vfs_symlink+0x3a0/0x540 fs/namei.c:4178
>  do_symlinkat+0x1c9/0x440 fs/namei.c:4208
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>
>

Re: [PATCH 1/1] arm: topology: parse the topology from the dt

2021-04-13 Thread Ruifeng Zhang

Valentin Schneider  于2021年4月12日周一 下午11:33写道：
>
> On 12/04/21 20:20, Ruifeng Zhang wrote:
> > There is a armv8.3 cpu which should work normally both on aarch64 and 
> > aarch32.
> > The MPIDR has been written to the chip register in armv8.3 format.
> > For example,
> > core0: 8000
> > core1: 8100
> > core2: 8200
> > ...
> >
> > Its cpu topology can be parsed normally on aarch64 mode (both
> > userspace and kernel work on arm64).
> >
> > The problem is when it working on aarch32 mode (both userspace and
> > kernel work on arm 32-bit),
>
> I didn't know using aarch32 elsewhere than EL0 was something actually being
> used. Do you deploy this somewhere, or do you use it for testing purposes?

In Unisoc, the sc9863a SoC which using cortex-a55, it has two software
version, one
of them is the kernel running on EL1 using aarch32.
  user(EL0)kernel(EL1)
sc9863a_go  aarch32   aarch32
sc9863aaarch64   aarch64
>
> > the cpu topology
> > will parse error because of the format is different between armv7 and 
> > armv8.3.
> > The arm 32-bit driver, arch/arm/kernel/topology will parse the MPIDR
> > and store to the topology with armv7,
> > and the result is all cpu core_id is 0, the bit[1:0] of armv7 MPIDR format.
> >
>
> I'm not fluent at all in armv7 (or most aarch32 compat mode stuff), but
> I couldn't find anything about MPIDR format differences:
>
>   DDI 0487G.a G8.2.113
>   """
>   AArch32 System register MPIDR bits [31:0] are architecturally mapped to
>   AArch64 System register MPIDR_EL1[31:0].
>   """
>
> Peeking at some armv7 doc and arm/kernel/topology.c the layout really looks
> just the same, i.e. for both of them, with your example of:

The cortex-a7 spec DDI0464F 4.3.5
https://developer.arm.com/documentation/ddi0464/f/?lang=en

The current arch/arm/kernel/topology code parse the MPIDR with a armv7 format.
the parse code is:
void store_cpu_topology(unsigned int cpuid)
{
...
cpuid_topo->thread_id = -1;
cpuid_topo->core_id = MPIDR_AFFINITY_LEVEL(mpidr, 0);
cpuid_topo->package_id = MPIDR_AFFINITY_LEVEL(mpidr, 1);
...
}
>
>   core0: 8000
>   core1: 8100
>   core2: 8200
>   ...
>
> we'll get:
>
>   |   | aff2 | aff1 | aff0 |
>   |---+--+--+--|
>   | Core0 |0 |0 |0 |
>   | Core1 |0 |1 |0 |
>   | Core2 |0 |2 |0 |
>   ...
>
> Now, arm64 doesn't fallback to MPIDR for topology information anymore since
>
>   3102bc0e6ac7 ("arm64: topology: Stop using MPIDR for topology information")
>
> so without DT we would get:
>   |   | package_id | core_id |
>   |---++-|
>   | Core0 |  0 |   0 |
>   | Core1 |  0 |   1 |
>   | Core2 |  0 |   2 |
>
> Whereas with an arm kernel we'll end up parsing MPIDR as:
>   |   | package_id | core_id |
>   |---++-|
>   | Core0 |  0 |   0 |
>   | Core1 |  1 |   0 |
>   | Core2 |  2 |   0 |
>
> Did I get this right? Is this what you're observing?

Yes, this is a problem if an armv8.2 or above cpu is running a 32-bit
kernel on EL1.
>
> > In addition, I think arm should also allow customers to configure cpu
> > topologies via DT.

[PATCH 1/8] MIPS: pci-rt2880: fix slot 0 configuration

2021-04-13 Thread Ilya Lipnitskiy

pci_fixup_irqs() used to call pcibios_map_irq on every PCI device, which
for RT2880 included bus 0 slot 0. After pci_fixup_irqs() got removed,
only slots/funcs with devices attached would be called. While arguably
the right thing, that left no chance for this driver to ever initialize
slot 0, effectively bricking PCI and USB on RT2880 devices such as the
Belkin F5D8235-4 v1.

Slot 0 configuration needs to happen after PCI bus enumeration, but
before any device at slot 0x11 (func 0 or 1) is talked to. That was
determined empirically by testing on a Belkin F5D8235-4 v1 device. A
minimal BAR 0 config write followed by read, then setting slot 0
PCI_COMMAND to MASTER | IO | MEMORY is all that seems to be required for
proper functionality.

Tested by ensuring that full- and high-speed USB devices get enumerated
on the Belkin F5D8235-4 v1 (with an out of tree DTS file from OpenWrt).

Fixes: 04c81c7293df ("MIPS: PCI: Replace pci_fixup_irqs() call with host bridge 
IRQ mapping hooks")
Signed-off-by: Ilya Lipnitskiy 
Cc: Lorenzo Pieralisi 
Cc: Tobias Wolf 
Cc:  # v4.14+
---
 arch/mips/pci/pci-rt2880.c | 50 +-
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/arch/mips/pci/pci-rt2880.c b/arch/mips/pci/pci-rt2880.c
index e1f12e398136..19f7860fb28b 100644
--- a/arch/mips/pci/pci-rt2880.c
+++ b/arch/mips/pci/pci-rt2880.c
@@ -66,9 +66,13 @@ static int rt2880_pci_config_read(struct pci_bus *bus, 
unsigned int devfn,
unsigned long flags;
u32 address;
u32 data;
+   int busn = 0;
 
-   address = rt2880_pci_get_cfgaddr(bus->number, PCI_SLOT(devfn),
-PCI_FUNC(devfn), where);
+   if (bus)
+   busn = bus->number;
+
+   address = rt2880_pci_get_cfgaddr(busn, PCI_SLOT(devfn), PCI_FUNC(devfn),
+where);
 
spin_lock_irqsave(_pci_lock, flags);
rt2880_pci_reg_write(address, RT2880_PCI_REG_CONFIG_ADDR);
@@ -96,9 +100,13 @@ static int rt2880_pci_config_write(struct pci_bus *bus, 
unsigned int devfn,
unsigned long flags;
u32 address;
u32 data;
+   int busn = 0;
+
+   if (bus)
+   busn = bus->number;
 
-   address = rt2880_pci_get_cfgaddr(bus->number, PCI_SLOT(devfn),
-PCI_FUNC(devfn), where);
+   address = rt2880_pci_get_cfgaddr(busn, PCI_SLOT(devfn), PCI_FUNC(devfn),
+where);
 
spin_lock_irqsave(_pci_lock, flags);
rt2880_pci_reg_write(address, RT2880_PCI_REG_CONFIG_ADDR);
@@ -180,7 +188,6 @@ static inline void rt2880_pci_write_u32(unsigned long reg, 
u32 val)
 
 int pcibios_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
 {
-   u16 cmd;
int irq = -1;
 
if (dev->bus->number != 0)
@@ -188,8 +195,6 @@ int pcibios_map_irq(const struct pci_dev *dev, u8 slot, u8 
pin)
 
switch (PCI_SLOT(dev->devfn)) {
case 0x00:
-   rt2880_pci_write_u32(PCI_BASE_ADDRESS_0, 0x0800);
-   (void) rt2880_pci_read_u32(PCI_BASE_ADDRESS_0);
break;
case 0x11:
irq = RT288X_CPU_IRQ_PCI;
@@ -201,16 +206,6 @@ int pcibios_map_irq(const struct pci_dev *dev, u8 slot, u8 
pin)
break;
}
 
-   pci_write_config_byte((struct pci_dev *) dev,
-   PCI_CACHE_LINE_SIZE, 0x14);
-   pci_write_config_byte((struct pci_dev *) dev, PCI_LATENCY_TIMER, 0xFF);
-   pci_read_config_word((struct pci_dev *) dev, PCI_COMMAND, );
-   cmd |= PCI_COMMAND_MASTER | PCI_COMMAND_IO | PCI_COMMAND_MEMORY |
-   PCI_COMMAND_INVALIDATE | PCI_COMMAND_FAST_BACK |
-   PCI_COMMAND_SERR | PCI_COMMAND_WAIT | PCI_COMMAND_PARITY;
-   pci_write_config_word((struct pci_dev *) dev, PCI_COMMAND, cmd);
-   pci_write_config_byte((struct pci_dev *) dev, PCI_INTERRUPT_LINE,
- dev->irq);
return irq;
 }
 
@@ -251,6 +246,27 @@ static int rt288x_pci_probe(struct platform_device *pdev)
 
 int pcibios_plat_dev_init(struct pci_dev *dev)
 {
+   static bool slot0_init;
+
+   /*
+* Nobody seems to initialize slot 0, but this platform requires it, so
+* do it once when some other slot is being enabled. The PCI subsystem
+* should configure other slots properly, so no need to do anything
+* special for those.
+*/
+   if (!slot0_init) {
+   u32 cmd;
+
+   slot0_init = true;
+
+   rt2880_pci_write_u32(PCI_BASE_ADDRESS_0, 0x0800);
+   (void) rt2880_pci_read_u32(PCI_BASE_ADDRESS_0);
+
+   rt2880_pci_config_read(NULL, 0, PCI_COMMAND, 2, );
+   cmd |= PCI_COMMAND_MASTER | PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
+   rt2880_pci_config_write(NULL, 0, PCI_COMMAND, 2, cmd);
+   }
+
return 0;
 }
 
-- 
2.31.1

[PATCH 5/8] MIPS: pci-legacy: stop using of_pci_range_to_resource

2021-04-13 Thread Ilya Lipnitskiy

Mirror commit aeba3731b150 ("powerpc/pci: Fix IO space breakage after
of_pci_range_to_resource() change").

Most MIPS platforms do not define PCI_IOBASE, nor implement
pci_address_to_pio(). Moreover, IO_SPACE_LIMIT is 0x for most MIPS
platforms. of_pci_range_to_resource passes the _start address_ of the IO
range into pci_address_to_pio, which then checks it against
IO_SPACE_LIMIT and fails, because for MIPS platforms that use
pci-legacy (pci-lantiq, pci-rt3883, pci-mt7620), IO ranges start much
higher than 0x.

In fact, pci-mt7621 in staging already works around this problem, see
commit 09dd629eeabb ("staging: mt7621-pci: fix io space and properly set
resource limits")

So just stop using of_pci_range_to_resource, which does not work for
MIPS.

Fixes PCI errors like:
  pci_bus :00: root bus resource [io  0x]

Fixes: 0b0b0893d49b ("of/pci: Fix the conversion of IO ranges into IO 
resources")
Signed-off-by: Ilya Lipnitskiy 
Cc: Liviu Dudau 
---
 arch/mips/pci/pci-legacy.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/mips/pci/pci-legacy.c b/arch/mips/pci/pci-legacy.c
index 39052de915f3..3a909194284a 100644
--- a/arch/mips/pci/pci-legacy.c
+++ b/arch/mips/pci/pci-legacy.c
@@ -166,8 +166,13 @@ void pci_load_of_ranges(struct pci_controller *hose, 
struct device_node *node)
res = hose->mem_resource;
break;
}
-   if (res != NULL)
-   of_pci_range_to_resource(, node, res);
+   if (res != NULL) {
+   res->name = node->full_name;
+   res->flags = range.flags;
+   res->start = range.cpu_addr;
+   res->end = range.cpu_addr + range.size - 1;
+   res->parent = res->child = res->sibling = NULL;
+   }
}
 }
 
-- 
2.31.1

[PATCH 8/8] MIPS: pci-legacy: use generic pci_enable_resources

2021-04-13 Thread Ilya Lipnitskiy

Follow the reasoning from commit 842de40d93e0 ("PCI: add generic
pci_enable_resources()"):

  The only functional difference from the MIPS version is that the
  generic one uses "!r->parent" to check for resource collisions
  instead of "!r->start && r->end".

That should have no effect on any pci-legacy driver.

Suggested-by: Bjorn Helgaas 
Signed-off-by: Ilya Lipnitskiy 
---
 arch/mips/pci/pci-legacy.c | 40 ++
 1 file changed, 2 insertions(+), 38 deletions(-)

diff --git a/arch/mips/pci/pci-legacy.c b/arch/mips/pci/pci-legacy.c
index 78c22987bef0..c24226ea0a6e 100644
--- a/arch/mips/pci/pci-legacy.c
+++ b/arch/mips/pci/pci-legacy.c
@@ -241,47 +241,11 @@ static int __init pcibios_init(void)
 
 subsys_initcall(pcibios_init);
 
-static int pcibios_enable_resources(struct pci_dev *dev, int mask)
-{
-   u16 cmd, old_cmd;
-   int idx;
-   struct resource *r;
-
-   pci_read_config_word(dev, PCI_COMMAND, );
-   old_cmd = cmd;
-   for (idx=0; idx < PCI_NUM_RESOURCES; idx++) {
-   /* Only set up the requested stuff */
-   if (!(mask & (1flags & (IORESOURCE_IO | IORESOURCE_MEM)))
-   continue;
-   if ((idx == PCI_ROM_RESOURCE) &&
-   (!(r->flags & IORESOURCE_ROM_ENABLE)))
-   continue;
-   if (!r->start && r->end) {
-   pci_err(dev,
-   "can't enable device: resource collisions\n");
-   return -EINVAL;
-   }
-   if (r->flags & IORESOURCE_IO)
-   cmd |= PCI_COMMAND_IO;
-   if (r->flags & IORESOURCE_MEM)
-   cmd |= PCI_COMMAND_MEMORY;
-   }
-   if (cmd != old_cmd) {
-   pci_info(dev, "enabling device (%04x -> %04x)\n", old_cmd, cmd);
-   pci_write_config_word(dev, PCI_COMMAND, cmd);
-   }
-   return 0;
-}
-
 int pcibios_enable_device(struct pci_dev *dev, int mask)
 {
-   int err;
+   int err = pci_enable_resources(dev, mask);
 
-   if ((err = pcibios_enable_resources(dev, mask)) < 0)
+   if (err < 0)
return err;
 
return pcibios_plat_dev_init(dev);
-- 
2.31.1

Re: [PATCH RESEND v1 0/4] powerpc/vdso: Add support for time namespaces

2021-04-13 Thread Michael Ellerman

Thomas Gleixner  writes:
> On Wed, Mar 31 2021 at 16:48, Christophe Leroy wrote:
>> [Sorry, resending with complete destination list, I used the wrong script on 
>> the first delivery]
>>
>> This series adds support for time namespaces on powerpc.
>>
>> All timens selftests are successfull.
>
> If PPC people want to pick up the whole lot, no objections from my side.

Thanks, will do.

cheers

[PATCH v2 14/16] mm: multigenerational lru: user interface

2021-04-13 Thread Yu Zhao

Add a sysfs file /sys/kernel/mm/lru_gen/enabled so users can enable
and disable the multigenerational lru at runtime.

Add a sysfs file /sys/kernel/mm/lru_gen/spread so users can spread
pages out across multiple generations. More generations make the
background aging more aggressive.

Add a debugfs file /sys/kernel/debug/lru_gen so users can monitor the
multigenerational lru and trigger the aging and the eviction. This
file has the following output:
  memcg  memcg_id  memcg_path
node  node_id
  min_gen  birth_time  anon_size  file_size
  ...
  max_gen  birth_time  anon_size  file_size

Given a memcg and a node, "min_gen" is the oldest generation (number)
and "max_gen" is the youngest. Birth time is in milliseconds. The
sizes of anon and file types are in pages.

This file takes the following input:
  + memcg_id node_id gen [swappiness]
  - memcg_id node_id gen [swappiness] [nr_to_reclaim]

The first command line accounts referenced pages to generation
"max_gen" and creates the next generation "max_gen"+1. In this case,
"gen" should be equal to "max_gen". A swap file and a non-zero
"swappiness" are required to scan anon type. If swapping is not
desired, set vm.swappiness to 0. The second command line evicts
generations less than or equal to "gen". In this case, "gen" should be
less than "max_gen"-1 as "max_gen" and "max_gen"-1 are active
generations and therefore protected from the eviction. Use
"nr_to_reclaim" to limit the number of pages to be evicted. Multiple
command lines are supported, so does concatenation with delimiters ","
and ";".

Signed-off-by: Yu Zhao 
---
 mm/vmscan.c | 405 
 1 file changed, 405 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01c475386379..284e32d897cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -6248,6 +6250,403 @@ static int __meminit __maybe_unused 
lru_gen_online_mem(struct notifier_block *se
return NOTIFY_DONE;
 }
 
+/**
+ *  sysfs interface
+ 
**/
+
+static ssize_t show_lru_gen_spread(struct kobject *kobj, struct kobj_attribute 
*attr,
+  char *buf)
+{
+   return sprintf(buf, "%d\n", READ_ONCE(lru_gen_spread));
+}
+
+static ssize_t store_lru_gen_spread(struct kobject *kobj, struct 
kobj_attribute *attr,
+   const char *buf, size_t len)
+{
+   int spread;
+
+   if (kstrtoint(buf, 10, ) || spread >= MAX_NR_GENS)
+   return -EINVAL;
+
+   WRITE_ONCE(lru_gen_spread, spread);
+
+   return len;
+}
+
+static struct kobj_attribute lru_gen_spread_attr = __ATTR(
+   spread, 0644, show_lru_gen_spread, store_lru_gen_spread
+);
+
+static ssize_t show_lru_gen_enabled(struct kobject *kobj, struct 
kobj_attribute *attr,
+   char *buf)
+{
+   return snprintf(buf, PAGE_SIZE, "%ld\n", lru_gen_enabled());
+}
+
+static ssize_t store_lru_gen_enabled(struct kobject *kobj, struct 
kobj_attribute *attr,
+const char *buf, size_t len)
+{
+   int enable;
+
+   if (kstrtoint(buf, 10, ))
+   return -EINVAL;
+
+   lru_gen_set_state(enable, true, false);
+
+   return len;
+}
+
+static struct kobj_attribute lru_gen_enabled_attr = __ATTR(
+   enabled, 0644, show_lru_gen_enabled, store_lru_gen_enabled
+);
+
+static struct attribute *lru_gen_attrs[] = {
+   _gen_spread_attr.attr,
+   _gen_enabled_attr.attr,
+   NULL
+};
+
+static struct attribute_group lru_gen_attr_group = {
+   .name = "lru_gen",
+   .attrs = lru_gen_attrs,
+};
+
+/**
+ *  debugfs interface
+ 
**/
+
+static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos)
+{
+   struct mem_cgroup *memcg;
+   loff_t nr_to_skip = *pos;
+
+   m->private = kzalloc(PATH_MAX, GFP_KERNEL);
+   if (!m->private)
+   return ERR_PTR(-ENOMEM);
+
+   memcg = mem_cgroup_iter(NULL, NULL, NULL);
+   do {
+   int nid;
+
+   for_each_node_state(nid, N_MEMORY) {
+   if (!nr_to_skip--)
+   return mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+   }
+   } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+   return NULL;
+}
+
+static void lru_gen_seq_stop(struct seq_file *m, void *v)
+{
+   if (!IS_ERR_OR_NULL(v))
+   mem_cgroup_iter_break(NULL, lruvec_memcg(v));
+
+   kfree(m->private);
+   m->private = NULL;
+}
+
+static void *lru_gen_seq_next(struct seq_file *m,

[PATCH v2 16/16] mm: multigenerational lru: documentation

2021-04-13 Thread Yu Zhao

Add Documentation/vm/multigen_lru.rst.

Signed-off-by: Yu Zhao 
---
 Documentation/vm/index.rst|   1 +
 Documentation/vm/multigen_lru.rst | 192 ++
 2 files changed, 193 insertions(+)
 create mode 100644 Documentation/vm/multigen_lru.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..c353b3f55924 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -17,6 +17,7 @@ various features of the Linux memory management
 
swap_numa
zswap
+   multigen_lru
 
 Kernel developers MM documentation
 ==
diff --git a/Documentation/vm/multigen_lru.rst 
b/Documentation/vm/multigen_lru.rst
new file mode 100644
index ..cf772aeca317
--- /dev/null
+++ b/Documentation/vm/multigen_lru.rst
@@ -0,0 +1,192 @@
+=
+Multigenerational LRU
+=
+
+Quick Start
+===
+Build Options
+-
+:Required: Set ``CONFIG_LRU_GEN=y``.
+
+:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support
+ a maximum of ``X`` generations.
+
+:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to support
+ a maximum of ``Y`` tiers per generation.
+
+:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
+ default.
+
+Runtime Options
+---
+:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
+ feature was not turned on by default.
+
+:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N``
+ to spread pages out across ``N+1`` generations. ``N`` should be less
+ than ``X``. Larger values make the background aging more aggressive.
+
+:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature.
+ This file has the following output:
+
+::
+
+  memcg  memcg_id  memcg_path
+node  node_id
+  min_gen  birth_time  anon_size  file_size
+  ...
+  max_gen  birth_time  anon_size  file_size
+
+Given a memcg and a node, ``min_gen`` is the oldest generation
+(number) and ``max_gen`` is the youngest. Birth time is in
+milliseconds. The sizes of anon and file types are in pages.
+
+Recipes
+---
+:Android on ARMv8.1+: ``X=4``, ``N=0``
+
+:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of
+ ``ARM64_HW_AFDBM``
+
+:Laptops running Chrome on x86_64: ``X=7``, ``N=2``
+
+:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]``
+ to ``/sys/kernel/debug/lru_gen`` to account referenced pages to
+ generation ``max_gen`` and create the next generation ``max_gen+1``.
+ ``gen`` should be equal to ``max_gen``. A swap file and a non-zero
+ ``swappiness`` are required to scan anon type. If swapping is not
+ desired, set ``vm.swappiness`` to ``0``.
+
+:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness]
+ [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict
+ generations less than or equal to ``gen``. ``gen`` should be less
+ than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active
+ generations and therefore protected from the eviction. Use
+ ``nr_to_reclaim`` to limit the number of pages to be evicted.
+ Multiple command lines are supported, so does concatenation with
+ delimiters ``,`` and ``;``.
+
+Framework
+=
+For each ``lruvec``, evictable pages are divided into multiple
+generations. The youngest generation number is stored in ``max_seq``
+for both anon and file types as they are aged on an equal footing. The
+oldest generation numbers are stored in ``min_seq[2]`` separately for
+anon and file types as clean file pages can be evicted regardless of
+swap and write-back constraints. Generation numbers are truncated into
+``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
+``page->flags``. The sliding window technique is used to prevent
+truncated generation numbers from overlapping. Each truncated
+generation number is an index to an array of per-type and per-zone
+lists. Evictable pages are added to the per-zone lists indexed by
+``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``),
+depending on whether they are being faulted in.
+
+Each generation is then divided into multiple tiers. Tiers represent
+levels of usage from file descriptors only. Pages accessed N times via
+file descriptors belong to tier order_base_2(N). In contrast to moving
+across generations which requires the lru lock, moving across tiers
+only involves an atomic operation on ``page->flags`` and therefore has
+a negligible cost.
+
+The workflow comprises two conceptually independent functions: the
+aging and the eviction.
+
+Aging
+-
+The aging produces young generations. Given an ``lruvec``, the aging
+scans page tables for referenced pages of this ``lruvec``. Upon
+finding one, the aging updates its generation number to ``max_seq``.
+After each round of scan, the aging increments ``max_seq``.
+
+The aging maintains either a system-wide ``mm_struct`` list or
+per-memcg ``mm_struct`` lists, and it only scans page tables of

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-13 Thread Vincent Guittot

On Mon, 12 Apr 2021 at 17:24, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
> SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

But the SMT4 level is still present  here with select_idle_core taking
of the spreading

>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.

What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB

>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
> spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
> L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
> what it means as such and whether it has any preferences on
> search order.
>
> ARM has similar requirements again according to "scheduler: expose the
> topology of clusters and add cluster scheduler" and that one *does*
> add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
>
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> >
>
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.

That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered  in fast path

>
> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in the topology hierarchy.
> >
> > What is the current sched_domain topology description for zen ?
> >
>
> The cache and NUMA topologies differ slightly between each generation
> of Zen. The common pattern is that a single NUMA node can have multiple
> L3 caches and at one point I thought it might be reasonable to allow
> spillover to select a local idle CPU instead of stacking multiple tasks
> on a CPU sharing cache. I never got as far as thinking how it could be
> done in a way that multiple

Re: [PATCH 0/1] Use of /sys/bus/pci/devices/…/index for non-SMBIOS platforms

2021-04-13 Thread Leon Romanovsky

On Tue, Apr 13, 2021 at 08:57:19AM +0200, Niklas Schnelle wrote:
> On Tue, 2021-04-13 at 08:39 +0300, Leon Romanovsky wrote:
> > On Mon, Apr 12, 2021 at 03:59:04PM +0200, Niklas Schnelle wrote:
> > > Hi Narendra, Hi All,
> > > 
> > > According to Documentation/ABI/testing/sysfs-bus-pci you are responsible
> > > for the index device attribute that is used by systemd to create network
> > > interface names.
> > > 
> > > Now we would like to reuse this attribute for firmware provided PCI
> > > device index numbers on the s390 architecture which doesn't have
> > > SMBIOS/DMI nor ACPI. All code changes are within our architecture
> > > specific code but I'd like to get some Acks for this reuse. I've sent an
> > > RFC version of this patch on 15th of March with the subject:
> > > 
> > >s390/pci: expose a PCI device's UID as its index
> > > 
> > > but got no response. Would it be okay to re-use this attribute for
> > > essentially the same purpose but with index numbers provided by
> > > a different platform mechanism? I think this would be cleaner than
> > > further proliferation of /sys/bus/pci/devices//xyz_index
> > > attributes and allows re-use of the existing userspace infrastructure.
> > 
> > I'm missing an explanation that this change is safe for systemd and
> > they don't have some hard-coded assumption about the meaning of existing
> > index on s390.
> > 
> > Thanks
> 
> 
> Sure, good point. So first off yes this change does create new index
> based names also on existing systemd versions, this is known and
> intended and we'll certainly closely collaborate with any distributions
> wishing to backport this change.
> 
> As for being otherwise safe or having unintended consequences, Viktor
> (see R-b) and I recently got the following PR merged in that exact area
> of systemd to fix how hotplug slot derived interface names are
> generated:
> https://github.com/systemd/systemd/pull/19017
> In working on that we did also analyse the use of the index attribute
> for hidden assumptions and tested with this attribute added. Arguably,
> as the nature of that PR shows we haven't had a perfect track record of
> keeping this monitored but will in the future as PCI based NICs become
> increasingly important for our platform. We also have special NIC
> naming logic in the same area for our channel based platform specific
> NICs which was also contributed by Viktor.

Thanks, this PR is exciting to read, very warm words were said about
kernel developers :). Can you please summarize that will be the breakage
in old systemd if this index will be overloaded?

Thanks

> 
> Thanks,
> Niklas
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1441 matches

Mail list logo