from:"Alexandre Ghiti"

[PATCH] riscv: Fix early ftrace nop patching

2024-05-23 Thread Alexandre Ghiti

Commit c97bf629963e ("riscv: Fix text patching when IPI are used")
converted ftrace_make_nop() to use patch_insn_write() which does not
emit any icache flush relying entirely on __ftrace_modify_code() to do
that.

But we missed that ftrace_make_nop() was called very early directly when
converting mcount calls into nops (actually on riscv it converts 2B nops
emitted by the compiler into 4B nops).

This caused crashes on multiple HW as reported by Conor and Björn since
the booting core could have half-patched instructions in its icache
which would trigger an illegal instruction trap: fix this by emitting a
local flush icache when early patching nops.

Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/cacheflush.h | 6 ++
 arch/riscv/kernel/ftrace.c  | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/arch/riscv/include/asm/cacheflush.h 
b/arch/riscv/include/asm/cacheflush.h
index dd8d07146116..ce79c558a4c8 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -13,6 +13,12 @@ static inline void local_flush_icache_all(void)
asm volatile ("fence.i" ::: "memory");
 }
 
+static inline void local_flush_icache_range(unsigned long start,
+   unsigned long end)
+{
+   local_flush_icache_all();
+}
+
 #define PG_dcache_clean PG_arch_1
 
 static inline void flush_dcache_folio(struct folio *folio)
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index 4f4987a6d83d..32e7c401dfb4 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -120,6 +120,9 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace 
*rec)
out = ftrace_make_nop(mod, rec, MCOUNT_ADDR);
mutex_unlock(_mutex);
 
+   if (!mod)
+   local_flush_icache_range(rec->ip, rec->ip + MCOUNT_INSN_SIZE);
+
return out;
 }
 
-- 
2.39.2

Re: [PATCH v3 9/9] riscv: mm: Add support for ZONE_DEVICE

2024-05-21 Thread Alexandre Ghiti

On Tue, May 21, 2024 at 1:49 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> ZONE_DEVICE pages need DEVMAP PTEs support to function
> (ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit
> in the PTE for DEVMAP mark, add the corresponding helpers, and enable
> ARCH_HAS_PTE_DEVMAP for riscv64.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig|  1 +
>  arch/riscv/include/asm/pgtable-64.h   | 20 
>  arch/riscv/include/asm/pgtable-bits.h |  1 +
>  arch/riscv/include/asm/pgtable.h  | 17 +
>  4 files changed, 39 insertions(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 2724dc2af29f..0b74698c63c7 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -36,6 +36,7 @@ config RISCV
> select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_PMEM_API
> select ARCH_HAS_PREPARE_SYNC_CORE_CMD
> +   select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_SET_DIRECT_MAP if MMU
> select ARCH_HAS_SET_MEMORY if MMU
> diff --git a/arch/riscv/include/asm/pgtable-64.h 
> b/arch/riscv/include/asm/pgtable-64.h
> index 221a5c1ee287..c67a9bbfd010 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -400,4 +400,24 @@ static inline struct page *pgd_page(pgd_t pgd)
>  #define p4d_offset p4d_offset
>  p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline int pte_devmap(pte_t pte);
> +static inline pte_t pmd_pte(pmd_t pmd);
> +
> +static inline int pmd_devmap(pmd_t pmd)
> +{
> +   return pte_devmap(pmd_pte(pmd));
> +}
> +
> +static inline int pud_devmap(pud_t pud)
> +{
> +   return 0;
> +}
> +
> +static inline int pgd_devmap(pgd_t pgd)
> +{
> +   return 0;
> +}
> +#endif
> +
>  #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable-bits.h 
> b/arch/riscv/include/asm/pgtable-bits.h
> index 179bd4afece4..a8f5205cea54 100644
> --- a/arch/riscv/include/asm/pgtable-bits.h
> +++ b/arch/riscv/include/asm/pgtable-bits.h
> @@ -19,6 +19,7 @@
>  #define _PAGE_SOFT  (3 << 8)/* Reserved for software */
>
>  #define _PAGE_SPECIAL   (1 << 8)/* RSW: 0x1 */
> +#define _PAGE_DEVMAP(1 << 9)/* RSW, devmap */
>  #define _PAGE_TABLE _PAGE_PRESENT
>
>  /*
> diff --git a/arch/riscv/include/asm/pgtable.h 
> b/arch/riscv/include/asm/pgtable.h
> index 7933f493db71..02fadc276064 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -387,6 +387,13 @@ static inline int pte_special(pte_t pte)
> return pte_val(pte) & _PAGE_SPECIAL;
>  }
>
> +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
> +static inline int pte_devmap(pte_t pte)
> +{
> +   return pte_val(pte) & _PAGE_DEVMAP;
> +}
> +#endif

Not sure you need the #ifdef here.

> +
>  /* static inline pte_t pte_rdprotect(pte_t pte) */
>
>  static inline pte_t pte_wrprotect(pte_t pte)
> @@ -428,6 +435,11 @@ static inline pte_t pte_mkspecial(pte_t pte)
> return __pte(pte_val(pte) | _PAGE_SPECIAL);
>  }
>
> +static inline pte_t pte_mkdevmap(pte_t pte)
> +{
> +   return __pte(pte_val(pte) | _PAGE_DEVMAP);
> +}
> +
>  static inline pte_t pte_mkhuge(pte_t pte)
>  {
> return pte;
> @@ -711,6 +723,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
> return pte_pmd(pte_mkdirty(pmd_pte(pmd)));
>  }
>
> +static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> +{
> +   return pte_pmd(pte_mkdevmap(pmd_pte(pmd)));
> +}
> +
>  static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp, pmd_t pmd)
>  {
> --
> 2.40.1
>

Otherwise, you can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v3 7/9] riscv: Enable memory hotplugging for RISC-V

2024-05-21 Thread Alexandre Ghiti

On Tue, May 21, 2024 at 1:49 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index fe5281398543..2724dc2af29f 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> select ARCH_DMA_DEFAULT_COHERENT
> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM_VMEMMAP && 64BIT && MMU

Not sure you need 64BIT && MMU here since ARCH_SPARSEMEM_ENABLE
depends on MMU and SPARSEMEM_VMEMMAP_ENABLE is only enabled on 64BIT.

> +   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> select ARCH_HAS_BINFMT_FLAT
> --
> 2.40.1
>

But anyway, to me that does not require a new version so you can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v3 5/9] riscv: mm: Add memory hotplugging support

2024-05-21 Thread Alexandre Ghiti

On Tue, May 21, 2024 at 1:49 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
>
>  arch_add_memory()
>   This callback is responsible for adding the physical memory into the
>   direct map, and call into the memory hotplugging generic code via
>   __add_pages() that adds the corresponding struct page entries, and
>   updates the vmemmap mapping.
>
>  arch_remove_memory()
>   This is the inverse of the callback above.
>
>  vmemmap_free()
>   This function tears down the vmemmap mappings (if
>   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>   backing vmemmap pages. Note that for persistent memory, an
>   alternative allocator for the backing pages can be used; The
>   vmem_altmap. This means that when the backing pages are cleared,
>   extra care is needed so that the correct deallocation method is
>   used.
>
>  arch_get_mappable_range()
>   This functions returns the PA range that the direct map can map.
>   Used by the MHP internals for sanity checks.
>
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 261 +++
>  1 file changed, 261 insertions(+)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..6693b742bf2f 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,264 @@ void __init pgtable_cache_init(void)
> }
>  }
>  #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +   unsigned int nr_pages = 1 << order;
> +
> +   /*
> +* vmemmap/direct page tables can be reserved, if added at
> +* boot.
> +*/
> +   if (PageReserved(page)) {
> +   __ClearPageReserved(page);

What's the difference between __ClearPageReserved() and
ClearPageReserved()? Because it seems like free_reserved_page() calls
the latter already, so why would you need to call
__ClearPageReserved() on the first page?

> +   while (nr_pages--)
> +   free_reserved_page(page++);
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(page), order);
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +   pte_t *pte;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PTE; i++) {
> +   pte = pte_start + i;
> +   if (!pte_none(*pte))
> +   return;
> +   }
> +
> +   free_pagetable(pmd_page(*pmd), 0);
> +   pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +   pmd_t *pmd;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PMD; i++) {
> +   pmd = pmd_start + i;
> +   if (!pmd_none(*pmd))
> +   return;
> +   }
> +
> +   free_pagetable(pud_page(*pud), 0);
> +   pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +   pud_t *pud;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PUD; i++) {
> +   pud = pud_start + i;
> +   if (!pud_none(*pud))
> +   return;
> +   }
> +
> +   free_pagetable(p4d_page(*p4d), 0);
> +   p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +  struct vmem_altmap *altmap)
> +{
> +   if (altmap)
> +   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +   else
> +   free_pagetable(page, get_order(size));
> +}
> +
> +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *ptep, pte;
> +
> +   for (; addr < end; addr = next) {
> +   next = (addr + PAGE_SIZE) & PAGE_MASK;

Nit: use ALIGN() instead.

> +   if (next > end)
> +   next = end;
> +
> +   ptep = pte_base + pte_index(addr);
> +   pte = READ_ONCE(*ptep);

Nit: Use ptep_get()

> +
> +   if (!pte_present(*ptep))
> +   continue;
> +
> +   pte_clear(_mm, addr, ptep);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, 
> altmap);
> +   }
> +}
> +
> +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct

Re: [PATCH v3 3/9] riscv: mm: Change attribute from init to meminit for page functions

2024-05-21 Thread Alexandre Ghiti

  uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
> +uintptr_t va, phys_addr_t pa,
> +phys_addr_t sz, pgprot_t prot)
>  {
> pte_t *ptep;
> phys_addr_t pte_phys;
> @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
> return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>  }
>
> -static pud_t *__init get_pud_virt_late(phys_addr_t pa)
> +static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
>  {
> return (pud_t *)__va(pa);
>  }
> @@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t alloc_pud_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
>  {
> unsigned long vaddr;
>
> @@ -541,7 +540,7 @@ static p4d_t *__init get_p4d_virt_fixmap(phys_addr_t pa)
> return (p4d_t *)set_fixmap_offset(FIX_P4D, pa);
>  }
>
> -static p4d_t *__init get_p4d_virt_late(phys_addr_t pa)
> +static p4d_t *__meminit get_p4d_virt_late(phys_addr_t pa)
>  {
> return (p4d_t *)__va(pa);
>  }
> @@ -559,7 +558,7 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t alloc_p4d_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
>  {
> unsigned long vaddr;
>
> @@ -568,9 +567,8 @@ static phys_addr_t alloc_p4d_late(uintptr_t va)
> return __pa(vaddr);
>  }
>
> -static void __init create_pud_mapping(pud_t *pudp,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, 
> phys_addr_t pa, phys_addr_t sz,
> +pgprot_t prot)
>  {
> pmd_t *nextp;
> phys_addr_t next_phys;
> @@ -595,9 +593,8 @@ static void __init create_pud_mapping(pud_t *pudp,
> create_pmd_mapping(nextp, va, pa, sz, prot);
>  }
>
> -static void __init create_p4d_mapping(p4d_t *p4dp,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_p4d_mapping(p4d_t *p4dp, uintptr_t va, 
> phys_addr_t pa, phys_addr_t sz,
> +pgprot_t prot)
>  {
> pud_t *nextp;
> phys_addr_t next_phys;
> @@ -653,9 +650,8 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
>  #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot) do {} while(0)
>  #endif /* __PAGETABLE_PMD_FOLDED */
>
> -void __init create_pgd_mapping(pgd_t *pgdp,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, 
> phys_addr_t sz,
> + pgprot_t prot)
>  {
> pgd_next_t *nextp;
> phys_addr_t next_phys;
> @@ -680,8 +676,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
> create_pgd_next_mapping(nextp, va, pa, sz, prot);
>  }
>
> -static uintptr_t __init best_map_size(phys_addr_t pa, uintptr_t va,
> - phys_addr_t size)
> +static uintptr_t __meminit best_map_size(phys_addr_t pa, uintptr_t va, 
> phys_addr_t size)
>  {
> if (pgtable_l5_enabled &&
> !(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= 
> P4D_SIZE)
> @@ -714,7 +709,7 @@ asmlinkage void __init __copy_data(void)
>  #endif
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -static __init pgprot_t pgprot_from_va(uintptr_t va)
> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>  {
> if (is_va_kernel_text(va))
> return PAGE_KERNEL_READ_EXEC;
> @@ -739,7 +734,7 @@ void mark_rodata_ro(void)
>   set_memory_ro);
>  }
>  #else
> -static __init pgprot_t pgprot_from_va(uintptr_t va)
> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>  {
> if (IS_ENABLED(CONFIG_64BIT) && !is_kernel_mapping(va))
> return PAGE_KERNEL;
> @@ -1231,9 +1226,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> pt_ops_set_fixmap();
>  }
>
> -static void __init create_linear_mapping_range(phys_addr_t start,
> -  phys_addr_t end,
> -  uintptr_t fixed_map_size)
> +static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> + uintptr_t fixed_map_size)
>  {
> phys_addr_t pa;
> uintptr_t va, map_size;
> --
> 2.40.1
>

You can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v3 1/9] riscv: mm: Properly forward vmemmap_populate() altmap parameter

2024-05-21 Thread Alexandre Ghiti

Hi Björn,

On Tue, May 21, 2024 at 1:48 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 2574f6a3b0e7..b66f846e7634 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1434,7 +1434,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>  * memory hotplug, we are not able to update all the page tables with
>  * the new PMDs.
>  */
> -   return vmemmap_populate_hugepages(start, end, node, NULL);
> +   return vmemmap_populate_hugepages(start, end, node, altmap);
>  }
>  #endif
>
> --
> 2.40.1
>

You can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Alexandre Ghiti

On Tue, May 14, 2024 at 8:17 PM Björn Töpel  wrote:
>
> Alexandre Ghiti  writes:
>
> > On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
> >>
> >> From: Björn Töpel 
> >>
> >> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> >> RISC-V.
> >>
> >> Signed-off-by: Björn Töpel 
> >> ---
> >>  arch/riscv/Kconfig | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >> index 6bec1bce6586..b9398b64bb69 100644
> >> --- a/arch/riscv/Kconfig
> >> +++ b/arch/riscv/Kconfig
> >> @@ -16,6 +16,8 @@ config RISCV
> >> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> >> select ARCH_DMA_DEFAULT_COHERENT
> >> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> >> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
> >
> > I think this should be SPARSEMEM_VMEMMAP here.
>
> Hmm, care to elaborate? I thought that was optional.

My bad, I thought VMEMMAP was required in your patchset. Sorry for the noise!

Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Alexandre Ghiti

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> select ARCH_DMA_DEFAULT_COHERENT
> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

I think this should be SPARSEMEM_VMEMMAP here.

> +   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> select ARCH_HAS_BINFMT_FLAT
> --
> 2.40.1
>

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Alexandre Ghiti

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
>
>  arch_add_memory()
>   This callback is responsible for adding the physical memory into the
>   direct map, and call into the memory hotplugging generic code via
>   __add_pages() that adds the corresponding struct page entries, and
>   updates the vmemmap mapping.
>
>  arch_remove_memory()
>   This is the inverse of the callback above.
>
>  vmemmap_free()
>   This function tears down the vmemmap mappings (if
>   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>   backing vmemmap pages. Note that for persistent memory, an
>   alternative allocator for the backing pages can be used; The
>   vmem_altmap. This means that when the backing pages are cleared,
>   extra care is needed so that the correct deallocation method is
>   used.
>
>  arch_get_mappable_range()
>   This functions returns the PA range that the direct map can map.
>   Used by the MHP internals for sanity checks.
>
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 242 +++
>  1 file changed, 242 insertions(+)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..7f0b921a3d3a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
> }
>  }
>  #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +   pte_t *pte;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PTE; i++) {
> +   pte = pte_start + i;
> +   if (!pte_none(*pte))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
> +   pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +   pmd_t *pmd;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PMD; i++) {
> +   pmd = pmd_start + i;
> +   if (!pmd_none(*pmd))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
> +   pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +   pud_t *pud;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PUD; i++) {
> +   pud = pud_start + i;
> +   if (!pud_none(*pud))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
> +   p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +  struct vmem_altmap *altmap)
> +{
> +   if (altmap)
> +   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +   else
> +   free_pages((unsigned long)page_address(page), 
> get_order(size));
> +}
> +
> +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *ptep, pte;
> +
> +   for (; addr < end; addr = next) {
> +   next = (addr + PAGE_SIZE) & PAGE_MASK;
> +   if (next > end)
> +   next = end;
> +
> +   ptep = pte_base + pte_index(addr);
> +   pte = READ_ONCE(*ptep);
> +
> +   if (!pte_present(*ptep))
> +   continue;
> +
> +   pte_clear(_mm, addr, ptep);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, 
> altmap);
> +   }
> +}
> +
> +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *pte_base;
> +   pmd_t *pmdp, pmd;
> +
> +   for (; addr < end; addr = next) {
> +   next = pmd_addr_end(addr, end);
> +   pmdp = pmd_base + pmd_index(addr);
> +   pmd = READ_ONCE(*pmdp);
> +
> +   if (!pmd_present(pmd))
> +   continue;
> +
> +   if (pmd_leaf(pmd)) {
> +   pmd_clear(pmdp);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, 
> altmap);
> +   continue;
> +   }
> +
> +   pte_base =

Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread Alexandre Ghiti

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>
>  static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> - uintptr_t fixed_map_size)
> + uintptr_t fixed_map_size, 
> const pgprot_t *pgprot)
>  {
> phys_addr_t pa;
> uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit 
> create_linear_mapping_range(phys_addr_t start, phys_addr_t
> best_map_size(pa, va, end - pa);
>
> create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -  pgprot_from_va(va));
> +  pgprot ? *pgprot : pgprot_from_va(va));
> }
>  }
>
> @@ -1282,22 +1282,19 @@ static void __init 
> create_linear_mapping_page_table(void)
> if (end >= __pa(PAGE_OFFSET) + memory_limit)
> end = __pa(PAGE_OFFSET) + memory_limit;
>
> -   create_linear_mapping_range(start, end, 0);
> +   create_linear_mapping_range(start, end, 0, NULL);
> }
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> -   create_linear_mapping_range(krodata_start,
> -   krodata_start + krodata_size, 0);
> +   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, 
> NULL);
> +   create_linear_mapping_range(krodata_start, krodata_start + 
> krodata_size, 0, NULL);
>
> memblock_clear_nomap(ktext_start,  ktext_size);
> memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>
>  #ifdef CONFIG_KFENCE
> -   create_linear_mapping_range(kfence_pool,
> -   kfence_pool + KFENCE_POOL_SIZE,
> -   PAGE_SIZE);
> +   create_linear_mapping_range(kfence_pool, kfence_pool + 
> KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>
> memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> --
> 2.40.1
>

You can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v2 2/8] riscv: mm: Change attribute from init to meminit for page functions

2024-05-14 Thread Alexandre Ghiti

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
>
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
>
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/include/asm/mmu.h |  4 +--
>  arch/riscv/include/asm/pgtable.h |  2 +-
>  arch/riscv/mm/init.c | 58 ++--
>  3 files changed, 29 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
> index 60be458e94da..c09c3c79f496 100644
> --- a/arch/riscv/include/asm/mmu.h
> +++ b/arch/riscv/include/asm/mmu.h
> @@ -28,8 +28,8 @@ typedef struct {
>  #endif
>  } mm_context_t;
>
> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
> -  phys_addr_t sz, pgprot_t prot);
> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, 
> phys_addr_t sz,
> + pgprot_t prot);
>  #endif /* __ASSEMBLY__ */
>
>  #endif /* _ASM_RISCV_MMU_H */
> diff --git a/arch/riscv/include/asm/pgtable.h 
> b/arch/riscv/include/asm/pgtable.h
> index 58fd7b70b903..7933f493db71 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -162,7 +162,7 @@ struct pt_alloc_ops {
>  #endif
>  };
>
> -extern struct pt_alloc_ops pt_ops __initdata;
> +extern struct pt_alloc_ops pt_ops __meminitdata;
>
>  #ifdef CONFIG_MMU
>  /* Number of PGD entries that a user-mode program can use */
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 5b8cdfafb52a..c969427eab88 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> -struct pt_alloc_ops pt_ops __initdata;
> +struct pt_alloc_ops pt_ops __meminitdata;
>
>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> @@ -357,7 +357,7 @@ static inline pte_t *__init 
> get_pte_virt_fixmap(phys_addr_t pa)
> return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
>  }
>
> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
>  {
> return (pte_t *) __va(pa);
>  }
> @@ -376,7 +376,7 @@ static inline phys_addr_t __init 
> alloc_pte_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pte_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
>  {
> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
> 0);
>
> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
> return __pa((pte_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pte_mapping(pte_t *ptep,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, 
> phys_addr_t pa, phys_addr_t sz,
> +pgprot_t prot)
>  {
> uintptr_t pte_idx = pte_index(va);
>
> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
> return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
>  }
>
> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
>  {
> return (pmd_t *) __va(pa);
>  }
> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pmd_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
>  {
> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
> 0);
>
> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
> return __pa((pmd_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pmd_mapping(pmd_t *pmdp,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
> +uintptr_t va, phys_addr_t pa,
> +phys_addr_t sz, pgprot_t prot)
>  {
> pte_t *ptep;
> phys_addr_t pte_phys;
> @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
>

Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries

2024-05-14 Thread Alexandre Ghiti

Hi Björn,

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
> all userland page tables, which means that if the PGD level table is
> changed, other page tables has to be updated as well.
>
> Instead of having the PGD changes ripple out to all tables, the
> synchronization can be avoided by pre-allocating the PGD entries/pages
> at boot, avoiding the synchronization all together.
>
> This is currently done for the bpf/modules, and vmalloc PGD regions.
> Extend this scheme for the PGD regions touched by memory hotplugging.
>
> Prepare the RISC-V port for memory hotplug by pre-allocate
> vmemmap/direct map entries at the PGD level. This will roughly waste
> ~128 worth of 4K pages when memory hotplugging is enabled in the
> kernel configuration.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/include/asm/kasan.h | 4 ++--
>  arch/riscv/mm/init.c   | 7 +++
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 0b85e363e778..e6a0071bdb56 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -6,8 +6,6 @@
>
>  #ifndef __ASSEMBLY__
>
> -#ifdef CONFIG_KASAN
> -
>  /*
>   * The following comment was copied from arm64:
>   * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
> @@ -34,6 +32,8 @@
>   */
>  #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & 
> PGDIR_MASK)
>  #define KASAN_SHADOW_END   MODULES_LOWEST_VADDR
> +
> +#ifdef CONFIG_KASAN
>  #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
>  void kasan_init(void);
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 2574f6a3b0e7..5b8cdfafb52a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -27,6 +27,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1488,10 +1489,16 @@ static void __init 
> preallocate_pgd_pages_range(unsigned long start, unsigned lon
> panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
>  }
>
> +#define PAGE_END KASAN_SHADOW_START
> +
>  void __init pgtable_cache_init(void)
>  {
> preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
> if (IS_ENABLED(CONFIG_MODULES))
> preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, 
> "bpf/modules");
> +   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
> +   preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, 
> "vmemmap");
> +   preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct 
> map");
> +   }
>  }
>  #endif
> --
> 2.40.1
>

As you asked, with
https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexgh...@rivosinc.com/T/#u,
you will be able to remove the usage of KASAN_SHADOW_START.

But anyhow, you can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Alexandre Ghiti


On 26/03/2024 17:49, Jarkko Sakkinen wrote:

On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:

Hi Jarkko,

On 25/03/2024 22:55, Jarkko Sakkinen wrote:

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
   arch/riscv/Kconfig  |  1 +
   arch/riscv/kernel/Makefile  |  3 +++
   arch/riscv/kernel/execmem.c | 22 ++
   3 files changed, 26 insertions(+)
   create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
   
   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o

   obj-$(CONFIG_MODULES)+= module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
   obj-$(CONFIG_MODULE_SECTIONS)+= module-sections.o
   
   obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o

diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)

Need to have the parameter name here. I guess this could just as well
pass through gfp to vmalloc from the caller as kprobes does call
module_alloc() with GFP_KERNEL set in RISC-V.


+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}


The __vmalloc_node_range() line ^^ must be from an old kernel since we
added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix
module_alloc() that did not reset the linear mapping permissions").

In addition, I guess module_alloc() should now use alloc_execmem() right?

Ack for the first comment. For the 2nd it is up to arch/ to choose
whether to have shared or separate allocators.

So if you want I can change it that way but did not want to make the
call myself.



I'd say module_alloc() should use alloc_execmem() then since there are 
no differences for now.






+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}


I remember Mike Rapoport sent a patchset to introduce an API for
executable memory allocation
(https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/),
how does this intersect with your work? I don't know the status of his
patchset though.

Thanks,

Alex

I have also made a patch set for kprobes in the 2022:

https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/

I think this Calvin's, Mike's and my early patch set have the same
problem: they try to choke all architectures at once. And further,
Calvin's and Mike's work also try to cover also tracing subsystems
at once.

I feel that my relatively small patch set which deals only with
trivial kprobe (which is more in the leaf than e.g. bpf which
is more like orchestrator tool) and implements one arch of which
dog food I actually eat is a better starting point.

Arch code is always something where you need to have genuine
understanding so full architecture coverage from day one is
just too risky for stabil

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Alexandre Ghiti


Hi Jarkko,

On 25/03/2024 22:55, Jarkko Sakkinen wrote:

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
   been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
   this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
   is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
  arch/riscv/Kconfig  |  1 +
  arch/riscv/kernel/Makefile  |  3 +++
  arch/riscv/kernel/execmem.c | 22 ++
  3 files changed, 26 insertions(+)
  create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
  
  obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o

  obj-$(CONFIG_MODULES) += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
  obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o
  
  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o

diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}



The __vmalloc_node_range() line ^^ must be from an old kernel since we 
added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
module_alloc() that did not reset the linear mapping permissions").


In addition, I guess module_alloc() should now use alloc_execmem() right?



+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}



I remember Mike Rapoport sent a patchset to introduce an API for 
executable memory allocation 
(https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), 
how does this intersect with your work? I don't know the status of his 
patchset though.


Thanks,

Alex

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-06 Thread Alexandre Ghiti


+cc Andy and Evgenii

On 06/03/2024 21:35, Alexandre Ghiti wrote:

Hi Puranjay,

On 06/03/2024 17:59, Puranjay Mohan wrote:

This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
This allows each ftrace callsite to provide an ftrace_ops to the common
ftrace trampoline, allowing each callsite to invoke distinct tracer
functions without the need to fall back to list processing or to
allocate custom trampolines for each callsite. This significantly speeds
up cases where multiple distinct trace functions are used and callsites
are mostly traced by a single tracer.

The idea and most of the implementation is taken from the ARM64's
implementation of the same feature. The idea is to place a pointer to
the ftrace_ops as a literal at a fixed offset from the function entry
point, which can be recovered by the common ftrace trampoline.

We use -fpatchable-function-entry to reserve 8 bytes above the function
entry by emitting 2 4 byte or 4 2 byte  nops depending on the 
presence of

CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer
to the associated ftrace_ops for that callsite. Functions are aligned to
8 bytes to make sure that the accesses to this literal are atomic.

This approach allows for directly invoking ftrace_ops::func even for
ftrace_ops which are dynamically-allocated (or part of a module),
without going via ftrace_ops_list_func.

I've benchamrked this with the ftrace_ops sample module on Qemu, with
the next version, I will provide benchmarks on real hardware:

Without this patch:

+---+-++
|  Number of tracers    | Total time (ns) | Per-call average time  |
|---+-+|
| Relevant | Irrelevant |    10 calls | Total (ns) | Overhead (ns) |
|--++-++---|
|    0 |  0 |    15615700 |    156 | - |
|    0 |  1 |    15917600 |    159 | - |
|    0 |  2 |    15668000 |    156 | - |
|    0 | 10 |    14971500 |    149 | - |
|    0 |    100 |    15417600 |    154 | - |
|    0 |    200 |    15387000 |    153 | - |
|--++-++---|
|    1 |  0 |   119906800 |   1199 |  1043 |
|    1 |  1 |   137428600 |   1374 |  1218 |
|    1 |  2 |   159562400 |   1374 |  1218 |
|    1 | 10 |   302099900 |   3020 |  2864 |
|    1 |    100 |  2008785500 |  20087 | 19931 |
|    1 |    200 |  3965221900 |  39652 | 39496 |
|--++-++---|
|    1 |  0 |   119166700 |   1191 |  1035 |
|    2 |  0 |   15700 |   1579 |  1423 |
|   10 |  0 |   425370100 |   4253 |  4097 |
|  100 |  0 |  3595252100 |  35952 | 35796 |
|  200 |  0 |  7023485700 |  70234 | 70078 |
+--++-++---+

Note: per-call overhead is estimated relative to the baseline case with
0 relevant tracers and 0 irrelevant tracers.

With this patch:

+---+-++
|   Number of tracers   | Total time (ns) | Per-call average time  |
|---+-+|
| Relevant | Irrelevant |    10 calls | Total (ns) | Overhead (ns) |
|--++-++---|
|    0 |  0 |    15254600 |    152 | - |
|    0 |  1 |    16136700 |    161 | - |
|    0 |  2 |    15329500 |    153 | - |
|    0 | 10 |    15148800 |    151 | - |
|    0 |    100 |    15746900 |    157 | - |
|    0 |    200 |    15737400 |    157 | - |
|--++-++---|
|    1 |  0 |    47909000 |    479 |   327 |
|    1 |  1 |    48297400 |    482 |   330 |
|    1 |  2 |    47314100 |    473 |   321 |
|    1 | 10 |    47844900 |    478 |   326 |
|    1 |    100 |    46591900 |    465 |   313 |
|    1 |    200 |    47178900 |    471 |   319 |
|--++-++---|
|    1 |  0 |    46715800 |    467 |   315 |
|    2 |  0 |   155134500 |   1551

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-06 Thread Alexandre Ghiti


Hi Puranjay,

On 06/03/2024 17:59, Puranjay Mohan wrote:

This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
This allows each ftrace callsite to provide an ftrace_ops to the common
ftrace trampoline, allowing each callsite to invoke distinct tracer
functions without the need to fall back to list processing or to
allocate custom trampolines for each callsite. This significantly speeds
up cases where multiple distinct trace functions are used and callsites
are mostly traced by a single tracer.

The idea and most of the implementation is taken from the ARM64's
implementation of the same feature. The idea is to place a pointer to
the ftrace_ops as a literal at a fixed offset from the function entry
point, which can be recovered by the common ftrace trampoline.

We use -fpatchable-function-entry to reserve 8 bytes above the function
entry by emitting 2 4 byte or 4 2 byte  nops depending on the presence of
CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer
to the associated ftrace_ops for that callsite. Functions are aligned to
8 bytes to make sure that the accesses to this literal are atomic.

This approach allows for directly invoking ftrace_ops::func even for
ftrace_ops which are dynamically-allocated (or part of a module),
without going via ftrace_ops_list_func.

I've benchamrked this with the ftrace_ops sample module on Qemu, with
the next version, I will provide benchmarks on real hardware:

Without this patch:

+---+-++
|  Number of tracers| Total time (ns) | Per-call average time  |
|---+-+|
| Relevant | Irrelevant |10 calls | Total (ns) | Overhead (ns) |
|--++-++---|
|0 |  0 |15615700 |156 | - |
|0 |  1 |15917600 |159 | - |
|0 |  2 |15668000 |156 | - |
|0 | 10 |14971500 |149 | - |
|0 |100 |15417600 |154 | - |
|0 |200 |15387000 |153 | - |
|--++-++---|
|1 |  0 |   119906800 |   1199 |  1043 |
|1 |  1 |   137428600 |   1374 |  1218 |
|1 |  2 |   159562400 |   1374 |  1218 |
|1 | 10 |   302099900 |   3020 |  2864 |
|1 |100 |  2008785500 |  20087 | 19931 |
|1 |200 |  3965221900 |  39652 | 39496 |
|--++-++---|
|1 |  0 |   119166700 |   1191 |  1035 |
|2 |  0 |   15700 |   1579 |  1423 |
|   10 |  0 |   425370100 |   4253 |  4097 |
|  100 |  0 |  3595252100 |  35952 | 35796 |
|  200 |  0 |  7023485700 |  70234 | 70078 |
+--++-++---+

Note: per-call overhead is estimated relative to the baseline case with
0 relevant tracers and 0 irrelevant tracers.

With this patch:

+---+-++
|   Number of tracers   | Total time (ns) | Per-call average time  |
|---+-+|
| Relevant | Irrelevant |10 calls | Total (ns) | Overhead (ns) |
|--++-++---|
|0 |  0 |15254600 |152 | - |
|0 |  1 |16136700 |161 | - |
|0 |  2 |15329500 |153 | - |
|0 | 10 |15148800 |151 | - |
|0 |100 |15746900 |157 | - |
|0 |200 |15737400 |157 | - |
|--++-++---|
|1 |  0 |47909000 |479 |   327 |
|1 |  1 |48297400 |482 |   330 |
|1 |  2 |47314100 |473 |   321 |
|1 | 10 |47844900 |478 |   326 |
|1 |100 |46591900 |465 |   313 |
|1 |200 |47178900 |471 |   319 |
|--++-++---|
|1 |  0 |46715800 |467 |   315 |
|2 |  0 |   155134500 |   1551 |  1399 |
|   10 |  0 |

[PATCH v3 2/2] riscv: Fix text patching when IPI are used

2024-02-29 Thread Alexandre Ghiti

For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes (which is emitted in patch_text_nosync()), the system
hangs.

So instead, make sure every CPU executes the stop_machine() patching
function and emit a local icache flush there.

Co-developed-by: Björn Töpel 
Signed-off-by: Björn Töpel 
Signed-off-by: Alexandre Ghiti 
Reviewed-by: Andrea Parri 
---
 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 44 ++
 arch/riscv/kernel/patch.c  | 16 +
 3 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
index e88b52d39eac..9f5d6e14c405 100644
--- a/arch/riscv/include/asm/patch.h
+++ b/arch/riscv/include/asm/patch.h
@@ -6,6 +6,7 @@
 #ifndef _ASM_RISCV_PATCH_H
 #define _ASM_RISCV_PATCH_H
 
+int patch_insn_write(void *addr, const void *insn, size_t len);
 int patch_text_nosync(void *addr, const void *insns, size_t len);
 int patch_text_set_nosync(void *addr, u8 c, size_t len);
 int patch_text(void *addr, u32 *insns, int ninsns);
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f5aa24d9e1c1..4f4987a6d83d 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
unsigned long target,
make_call_t0(hook_pos, target, call);
 
/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-   if (patch_text_nosync
-   ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
 
make_call_t0(rec->ip, addr, call);
 
-   if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
*rec,
 {
unsigned int nops[2] = {NOP4, NOP4};
 
-   if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -134,6 +134,42 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
return ret;
 }
+
+struct ftrace_modify_param {
+   int command;
+   atomic_t cpu_count;
+};
+
+static int __ftrace_modify_code(void *data)
+{
+   struct ftrace_modify_param *param = data;
+
+   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
+   ftrace_modify_all_code(param->command);
+   /*
+* Make sure the patching store is effective *before* we
+* increment the counter which releases all waiting CPUs
+* by using the release variant of atomic increment. The
+* release pairs with the call to local_flush_icache_all()
+* on the waiting CPU.
+*/
+   atomic_inc_return_release(>cpu_count);
+   } else {
+   while (atomic_read(>cpu_count) <= num_online_cpus())
+   cpu_relax();
+   }
+
+   local_flush_icache_all();
+
+   return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
+
+   stop_machine(__ftrace_modify_code, , cpu_online_mask);
+}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 0b5c16dfe3f4..9a1bce1adf5a 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
 }
 NOKPROBE_SYMBOL(patch_text_set_nosync);
 
-static int patch_insn_write(void *addr, const void *insn, size_t len)
+int patch_insn_write(void *addr, const void *insn, size_t len)
 {
size_t patched = 0;
size_t size;
@@ -232,15 +232,23 @@ static int patch_text_cb(void *data)
if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
for (i = 0; ret == 0 && i < patch->ninsns; i++) {
len = GET_INSN_LENGTH(patch->insns[i]);
-   ret = patch_text_nosync(patch->addr + i * len,
-   >insns[i], len);
+   ret = patch_insn_write(patch->addr + i * len, 
>insns[i], len);
}
-   atomic_inc(>cpu_count);
+   /*
+* Make sure the patching store is effective *before* we
+* increment the counter which releases

[PATCH v3 1/2] riscv: Remove superfluous smp_mb()

2024-02-29 Thread Alexandre Ghiti

This memory barrier is not needed and not documented so simply remove
it.

Suggested-by: Andrea Parri 
Signed-off-by: Alexandre Ghiti 
Reviewed-by: Andrea Parri 
---
 arch/riscv/kernel/patch.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 37e87fdcf6a0..0b5c16dfe3f4 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -239,7 +239,6 @@ static int patch_text_cb(void *data)
} else {
while (atomic_read(>cpu_count) <= num_online_cpus())
cpu_relax();
-   smp_mb();
}
 
return ret;
-- 
2.39.2

[PATCH v3 0/2] riscv: fix patching with IPI

2024-02-29 Thread Alexandre Ghiti

patch 1 removes a useless memory barrier and patch 2 actually fixes the
issue with IPI in the patching code.

Changes in v3:
- Remove wrong cleanup as noted by Samuel
- Enhance comment about usage of release semantics as suggested by
  Andrea
- Add RBs from Andrea

Changes in v2:
- Add patch 1 and then remove the memory barrier from patch 2 as
  suggested by Andrea
- Convert atomic_inc into an atomic_inc with release semantics as
  suggested by Andrea

Alexandre Ghiti (2):
  riscv: Remove superfluous smp_mb()
  riscv: Fix text patching when IPI are used

 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 44 ++
 arch/riscv/kernel/patch.c  | 17 +
 3 files changed, 53 insertions(+), 9 deletions(-)

-- 
2.39.2

Re: [PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Alexandre Ghiti

On Wed, Feb 28, 2024 at 7:21 PM Samuel Holland
 wrote:
>
> Hi Alex,
>
> On 2024-02-28 11:51 AM, Alexandre Ghiti wrote:
> > For now, we use stop_machine() to patch the text and when we use IPIs for
> > remote icache flushes (which is emitted in patch_text_nosync()), the system
> > hangs.
> >
> > So instead, make sure every cpu executes the stop_machine() patching
> > function and emit a local icache flush there.
> >
> > Co-developed-by: Björn Töpel 
> > Signed-off-by: Björn Töpel 
> > Signed-off-by: Alexandre Ghiti 
> > ---
> >  arch/riscv/include/asm/patch.h |  1 +
> >  arch/riscv/kernel/ftrace.c | 42 ++
> >  arch/riscv/kernel/patch.c  | 18 +--
> >  3 files changed, 50 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
> > index e88b52d39eac..9f5d6e14c405 100644
> > --- a/arch/riscv/include/asm/patch.h
> > +++ b/arch/riscv/include/asm/patch.h
> > @@ -6,6 +6,7 @@
> >  #ifndef _ASM_RISCV_PATCH_H
> >  #define _ASM_RISCV_PATCH_H
> >
> > +int patch_insn_write(void *addr, const void *insn, size_t len);
> >  int patch_text_nosync(void *addr, const void *insns, size_t len);
> >  int patch_text_set_nosync(void *addr, u8 c, size_t len);
> >  int patch_text(void *addr, u32 *insns, int ninsns);
> > diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
> > index f5aa24d9e1c1..5654966c4e7d 100644
> > --- a/arch/riscv/kernel/ftrace.c
> > +++ b/arch/riscv/kernel/ftrace.c
> > @@ -8,6 +8,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
> > unsigned long target,
> >   make_call_t0(hook_pos, target, call);
> >
> >   /* Replace the auipc-jalr pair at once. Return -EPERM on write error. 
> > */
> > - if (patch_text_nosync
> > - ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
> > MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned 
> > long addr)
> >
> >   make_call_t0(rec->ip, addr, call);
> >
> > - if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
> > *rec,
> >  {
> >   unsigned int nops[2] = {NOP4, NOP4};
> >
> > - if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
> > + if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
> >   return -EPERM;
> >
> >   return 0;
> > @@ -134,6 +134,40 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
> >
> >   return ret;
> >  }
> > +
> > +struct ftrace_modify_param {
> > + int command;
> > + atomic_t cpu_count;
> > +};
> > +
> > +static int __ftrace_modify_code(void *data)
> > +{
> > + struct ftrace_modify_param *param = data;
> > +
> > + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> > + ftrace_modify_all_code(param->command);
> > + /*
> > +  * Make sure the patching store is effective *before* we
> > +  * increment the counter which releases all waiting cpus
> > +  * by using the release version of atomic increment.
> > +  */
> > + atomic_inc_return_release(>cpu_count);
> > + } else {
> > + while (atomic_read(>cpu_count) <= num_online_cpus())
> > + cpu_relax();
> > + }
> > +
> > + local_flush_icache_all();
> > +
> > + return 0;
> > +}
> > +
> > +void arch_ftrace_update_code(int command)
> > +{
> > + struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
> > +
> > + stop_machine(__ftrace_modify_code, , cpu_online_mask);
> > +}
> >  #endif
> >
> >  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
> > index 0b5c16dfe3f4..82d8508c765b 100644
> > --- a/arch/riscv/kernel/patch.c

[PATCH 2/2] riscv: Fix text patching when IPI are used

2024-02-28 Thread Alexandre Ghiti

For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes (which is emitted in patch_text_nosync()), the system
hangs.

So instead, make sure every cpu executes the stop_machine() patching
function and emit a local icache flush there.

Co-developed-by: Björn Töpel 
Signed-off-by: Björn Töpel 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 42 ++
 arch/riscv/kernel/patch.c  | 18 +--
 3 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
index e88b52d39eac..9f5d6e14c405 100644
--- a/arch/riscv/include/asm/patch.h
+++ b/arch/riscv/include/asm/patch.h
@@ -6,6 +6,7 @@
 #ifndef _ASM_RISCV_PATCH_H
 #define _ASM_RISCV_PATCH_H
 
+int patch_insn_write(void *addr, const void *insn, size_t len);
 int patch_text_nosync(void *addr, const void *insns, size_t len);
 int patch_text_set_nosync(void *addr, u8 c, size_t len);
 int patch_text(void *addr, u32 *insns, int ninsns);
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f5aa24d9e1c1..5654966c4e7d 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
unsigned long target,
make_call_t0(hook_pos, target, call);
 
/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-   if (patch_text_nosync
-   ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
 
make_call_t0(rec->ip, addr, call);
 
-   if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
*rec,
 {
unsigned int nops[2] = {NOP4, NOP4};
 
-   if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -134,6 +134,40 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
return ret;
 }
+
+struct ftrace_modify_param {
+   int command;
+   atomic_t cpu_count;
+};
+
+static int __ftrace_modify_code(void *data)
+{
+   struct ftrace_modify_param *param = data;
+
+   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
+   ftrace_modify_all_code(param->command);
+   /*
+* Make sure the patching store is effective *before* we
+* increment the counter which releases all waiting cpus
+* by using the release version of atomic increment.
+*/
+   atomic_inc_return_release(>cpu_count);
+   } else {
+   while (atomic_read(>cpu_count) <= num_online_cpus())
+   cpu_relax();
+   }
+
+   local_flush_icache_all();
+
+   return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
+
+   stop_machine(__ftrace_modify_code, , cpu_online_mask);
+}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 0b5c16dfe3f4..82d8508c765b 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
 }
 NOKPROBE_SYMBOL(patch_text_set_nosync);
 
-static int patch_insn_write(void *addr, const void *insn, size_t len)
+int patch_insn_write(void *addr, const void *insn, size_t len)
 {
size_t patched = 0;
size_t size;
@@ -211,11 +211,9 @@ NOKPROBE_SYMBOL(patch_insn_write);
 
 int patch_text_nosync(void *addr, const void *insns, size_t len)
 {
-   u32 *tp = addr;
int ret;
 
-   ret = patch_insn_write(tp, insns, len);
-
+   ret = patch_insn_write(addr, insns, len);
if (!ret)
flush_icache_range((uintptr_t) tp, (uintptr_t) tp + len);
 
@@ -232,15 +230,21 @@ static int patch_text_cb(void *data)
if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
for (i = 0; ret == 0 && i < patch->ninsns; i++) {
len = GET_INSN_LENGTH(patch->insns[i]);
-   ret = patch_text_nosync(patch->addr + i * len,
-   >insns[i], len);
+   ret = patch_insn_write(patch->addr + i * len, 
&g

[PATCH 1/2] riscv: Remove superfluous smp_mb()

2024-02-28 Thread Alexandre Ghiti

This memory barrier is not needed and not documented so simply remove
it.

Suggested-by: Andrea Parri 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/kernel/patch.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 37e87fdcf6a0..0b5c16dfe3f4 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -239,7 +239,6 @@ static int patch_text_cb(void *data)
} else {
while (atomic_read(>cpu_count) <= num_online_cpus())
cpu_relax();
-   smp_mb();
}
 
return ret;
-- 
2.39.2

[PATCH 0/2] riscv: fix patching with IPI

2024-02-28 Thread Alexandre Ghiti

patch 1 removes a useless memory barrier and patch 2 actually fixes the
issue with IPI in the patching code.

Changes in v2:
- Add patch 1 and then remove the memory barrier from patch 2 as
  suggested by Andrea
- Convert atomic_inc into an atomic_inc with release semantics as
  suggested by Andrea

Alexandre Ghiti (2):
  riscv: Remove superfluous smp_mb()
  riscv: Fix text patching when IPI are used

 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 42 ++
 arch/riscv/kernel/patch.c  | 19 ---
 3 files changed, 50 insertions(+), 12 deletions(-)

-- 
2.39.2

Re: [PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-08 Thread Alexandre Ghiti

Hi Andrea,

On Thu, Feb 8, 2024 at 12:42 PM Andrea Parri  wrote:
>
> > +static int __ftrace_modify_code(void *data)
> > +{
> > + struct ftrace_modify_param *param = data;
> > +
> > + if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> > + ftrace_modify_all_code(param->command);
> > + atomic_inc(>cpu_count);
>
> I stared at ftrace_modify_all_code() for a bit but honestly I don't see
> what prevents the ->cpu_count increment from being reordered before the
> insn write(s) (architecturally) now that you have removed the IPI dance:
> perhaps add an smp_wmb() right before the atomic_inc() (or promote this
> latter to a (void)atomic_inc_return_release()) and/or an inline comment
> saying why such reordering is not possible?

I did not even think of that, and it actually makes sense so I'll go
with what you propose: I'll replace atomic_inc() with
atomic_inc_return_release(). And I'll add the following comment if
that's ok with you:

"Make sure the patching store is effective *before* we increment the
counter which releases all waiting cpus"

>
>
> > + } else {
> > + while (atomic_read(>cpu_count) <= num_online_cpus())
> > + cpu_relax();
> > + smp_mb();
>
> I see that you've lifted/copied the memory barrier from patch_text_cb():
> what's its point?  AFAIU, the barrier has no ordering effect on program
> order later insn fetches; perhaps the code was based on some legacy/old
> version of Zifencei?  IAC, comments, comments, ... or maybe just remove
> that memory barrier?

Honestly, I looked at it one minute, did not understand its purpose
and said to myself "ok that can't hurt anyway, I may be missing
something".

FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
I'll remove it (here and where I copied it).

>
>
> > + }
> > +
> > + local_flush_icache_all();
> > +
> > + return 0;
> > +}
>
> [...]
>
>
> > @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
> >   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
> >   for (i = 0; ret == 0 && i < patch->ninsns; i++) {
> >   len = GET_INSN_LENGTH(patch->insns[i]);
> > - ret = patch_text_nosync(patch->addr + i * len,
> > - >insns[i], len);
> > + ret = patch_insn_write(patch->addr + i * len, 
> > >insns[i], len);
> >   }
> >   atomic_inc(>cpu_count);
> >   } else {
> > @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
> >   smp_mb();
> >   }
> >
> > + local_flush_icache_all();
> > +
> >   return ret;
> >  }
> >  NOKPROBE_SYMBOL(patch_text_cb);
>
> My above remarks/questions also apply to this function.
>
>
> On a last topic, although somehow orthogonal to the scope of this patch,
> I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> correct: I can see why we may want (need to do) the local TLB flush be-
> fore returning from patch_{map,unmap}(), but does a local flush suffice?
> For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> sequence in their unmapping stage (and apparently relying on "no caching
> of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> (riscv's) TLB invalidations will necessary introduce some complexity...
>
> Thoughts?

To avoid remote TLBI, could we simply disable the preemption before
the first patch_map()? arm64 disables the irqs, but that seems
overkill to me, but maybe I'm missing something again?

Thanks for your comments Andrea,

Alex

>
>   Andrea

[PATCH] riscv: Fix text patching when icache flushes use IPIs

2024-02-06 Thread Alexandre Ghiti

For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes, the system hangs since the irqs are disabled on all
cpus.

So instead, make sure every cpu executes the stop_machine() patching
function which emits a local icache flush and then avoids the use of
IPIs.

Co-developed-by: Björn Töpel 
Signed-off-by: Björn Töpel 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c | 38 ++
 arch/riscv/kernel/patch.c  | 11 +-
 3 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
index e88b52d39eac..9f5d6e14c405 100644
--- a/arch/riscv/include/asm/patch.h
+++ b/arch/riscv/include/asm/patch.h
@@ -6,6 +6,7 @@
 #ifndef _ASM_RISCV_PATCH_H
 #define _ASM_RISCV_PATCH_H
 
+int patch_insn_write(void *addr, const void *insn, size_t len);
 int patch_text_nosync(void *addr, const void *insns, size_t len);
 int patch_text_set_nosync(void *addr, u8 c, size_t len);
 int patch_text(void *addr, u32 *insns, int ninsns);
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f5aa24d9e1c1..1694a1861d1e 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -75,8 +76,7 @@ static int __ftrace_modify_call(unsigned long hook_pos, 
unsigned long target,
make_call_t0(hook_pos, target, call);
 
/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-   if (patch_text_nosync
-   ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)hook_pos, enable ? call : nops, 
MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -88,7 +88,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
 
make_call_t0(rec->ip, addr, call);
 
-   if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -99,7 +99,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace 
*rec,
 {
unsigned int nops[2] = {NOP4, NOP4};
 
-   if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+   if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
return -EPERM;
 
return 0;
@@ -134,6 +134,36 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 
return ret;
 }
+
+struct ftrace_modify_param {
+   int command;
+   atomic_t cpu_count;
+};
+
+static int __ftrace_modify_code(void *data)
+{
+   struct ftrace_modify_param *param = data;
+
+   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
+   ftrace_modify_all_code(param->command);
+   atomic_inc(>cpu_count);
+   } else {
+   while (atomic_read(>cpu_count) <= num_online_cpus())
+   cpu_relax();
+   smp_mb();
+   }
+
+   local_flush_icache_all();
+
+   return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
+
+   stop_machine(__ftrace_modify_code, , cpu_online_mask);
+}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 37e87fdcf6a0..ec7760a4d6cd 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -188,7 +188,7 @@ int patch_text_set_nosync(void *addr, u8 c, size_t len)
 }
 NOKPROBE_SYMBOL(patch_text_set_nosync);
 
-static int patch_insn_write(void *addr, const void *insn, size_t len)
+int patch_insn_write(void *addr, const void *insn, size_t len)
 {
size_t patched = 0;
size_t size;
@@ -211,11 +211,9 @@ NOKPROBE_SYMBOL(patch_insn_write);
 
 int patch_text_nosync(void *addr, const void *insns, size_t len)
 {
-   u32 *tp = addr;
int ret;
 
-   ret = patch_insn_write(tp, insns, len);
-
+   ret = patch_insn_write(addr, insns, len);
if (!ret)
flush_icache_range((uintptr_t) tp, (uintptr_t) tp + len);
 
@@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
for (i = 0; ret == 0 && i < patch->ninsns; i++) {
len = GET_INSN_LENGTH(patch->insns[i]);
-   ret = patch_text_nosync(patch->addr + i * len,
-   >insns[i], len);
+   ret = patch_insn_write(patch->addr + i * len, 
>insns[i], len);
}
atomic_inc(>cpu_count);
} else {
@@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
smp_mb();
}
 
+   local_flush_icache_all();
+
return ret;
 }
 NOKPROBE_SYMBOL(patch_text_cb);
-- 
2.39.2

Re: [PATCH -fixes] riscv: Fix ftrace syscall handling which are now prefixed with __riscv_

2023-10-03 Thread Alexandre Ghiti

@Conor Dooley This fails checkpatch but the documentation here states
that this is how to do:
https://elixir.bootlin.com/linux/latest/source/Documentation/trace/ftrace-design.rst#L246

On Tue, Oct 3, 2023 at 8:24 PM Alexandre Ghiti  wrote:
>
> ftrace creates entries for each syscall in the tracefs but has failed
> since commit 08d0ce30e0e4 ("riscv: Implement syscall wrappers") which
> prefixes all riscv syscalls with __riscv_.
>
> So fix this by implementing arch_syscall_match_sym_name() which allows us
> to ignore this prefix.
>
> And also ignore compat syscalls like x86/arm64 by implementing
> arch_trace_is_compat_syscall().
>
> Fixes: 08d0ce30e0e4 ("riscv: Implement syscall wrappers")
> Signed-off-by: Alexandre Ghiti 
> ---
>  arch/riscv/include/asm/ftrace.h | 21 +
>  1 file changed, 21 insertions(+)
>
> diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
> index 740a979171e5..2b2f5df7ef2c 100644
> --- a/arch/riscv/include/asm/ftrace.h
> +++ b/arch/riscv/include/asm/ftrace.h
> @@ -31,6 +31,27 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> long addr)
> return addr;
>  }
>
> +/*
> + * Let's do like x86/arm64 and ignore the compat syscalls.
> + */
> +#define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
> +static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
> +{
> +   return is_compat_task();
> +}
> +
> +#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
> +static inline bool arch_syscall_match_sym_name(const char *sym,
> +  const char *name)
> +{
> +   /*
> +* Since all syscall functions have __riscv_ prefix, we must skip it.
> +* However, as we described above, we decided to ignore compat
> +* syscalls, so we don't care about __riscv_compat_ prefix here.
> +*/
> +   return !strcmp(sym + 8, name);
> +}
> +
>  struct dyn_arch_ftrace {
>  };
>  #endif
> --
> 2.39.2
>

[PATCH -fixes] riscv: Fix ftrace syscall handling which are now prefixed with __riscv_

2023-10-03 Thread Alexandre Ghiti

ftrace creates entries for each syscall in the tracefs but has failed
since commit 08d0ce30e0e4 ("riscv: Implement syscall wrappers") which
prefixes all riscv syscalls with __riscv_.

So fix this by implementing arch_syscall_match_sym_name() which allows us
to ignore this prefix.

And also ignore compat syscalls like x86/arm64 by implementing
arch_trace_is_compat_syscall().

Fixes: 08d0ce30e0e4 ("riscv: Implement syscall wrappers")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/ftrace.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
index 740a979171e5..2b2f5df7ef2c 100644
--- a/arch/riscv/include/asm/ftrace.h
+++ b/arch/riscv/include/asm/ftrace.h
@@ -31,6 +31,27 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
return addr;
 }
 
+/*
+ * Let's do like x86/arm64 and ignore the compat syscalls.
+ */
+#define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS
+static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
+{
+   return is_compat_task();
+}
+
+#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+static inline bool arch_syscall_match_sym_name(const char *sym,
+  const char *name)
+{
+   /*
+* Since all syscall functions have __riscv_ prefix, we must skip it.
+* However, as we described above, we decided to ignore compat
+* syscalls, so we don't care about __riscv_compat_ prefix here.
+*/
+   return !strcmp(sym + 8, name);
+}
+
 struct dyn_arch_ftrace {
 };
 #endif
-- 
2.39.2

Re: [PATCH v3 08/13] riscv: extend execmem_params for generated code allocations

2023-09-22 Thread Alexandre Ghiti


Hi Mike,

On 18/09/2023 09:29, Mike Rapoport wrote:

From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes and BPF on RISC-V are not placed in
the modules area and these custom allocations are implemented with
overrides of alloc_insn_page() and  bpf_jit_alloc_exec().

Slightly reorder execmem_params initialization to support both 32 and 64
bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in
riscv::execmem_params and drop overrides of alloc_insn_page() and
bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) 
---
  arch/riscv/kernel/module.c | 21 -
  arch/riscv/kernel/probes/kprobes.c | 10 --
  arch/riscv/net/bpf_jit_core.c  | 13 -
  3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 343a0edfb6dd..31505ecb5c72 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -436,20 +436,39 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
  }
  
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)

+#ifdef CONFIG_MMU
  static struct execmem_params execmem_params __ro_after_init = {
.ranges = {
[EXECMEM_DEFAULT] = {
.pgprot = PAGE_KERNEL,
.alignment = 1,
},
+   [EXECMEM_KPROBES] = {
+   .pgprot = PAGE_KERNEL_READ_EXEC,
+   .alignment = 1,
+   },
+   [EXECMEM_BPF] = {
+   .pgprot = PAGE_KERNEL,
+   .alignment = 1,



Not entirely sure it is the same alignment (sorry did not go through the 
entire series), but if it is, the alignment above ^ is not the same that 
is requested by our current bpf_jit_alloc_exec() implementation which is 
PAGE_SIZE.




+   },
},
  };
  
  struct execmem_params __init *execmem_arch_params(void)

  {
+#ifdef CONFIG_64BIT
execmem_params.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
execmem_params.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+#else
+   execmem_params.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+   execmem_params.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+#endif
+
+   execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+   execmem_params.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+   execmem_params.ranges[EXECMEM_BPF].start = BPF_JIT_REGION_START;
+   execmem_params.ranges[EXECMEM_BPF].end = BPF_JIT_REGION_END;
  
  	return _params;

  }
diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
  }
  
-#ifdef CONFIG_MMU

-void *alloc_insn_page(void)
-{
-   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
-VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-__builtin_return_address(0));
-}
-#endif
-
  /* install breakpoint in text */
  void __kprobes arch_arm_kprobe(struct kprobe *p)
  {
diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c
index 7b70ccb7fec3..c8a758f0882b 100644
--- a/arch/riscv/net/bpf_jit_core.c
+++ b/arch/riscv/net/bpf_jit_core.c
@@ -218,19 +218,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return BPF_JIT_REGION_SIZE;
  }
  
-void *bpf_jit_alloc_exec(unsigned long size)

-{
-   return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
-   BPF_JIT_REGION_END, GFP_KERNEL,
-   PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
-   return vfree(addr);
-}
-
  void *bpf_arch_text_copy(void *dst, void *src, size_t len)
  {
int ret;



Otherwise, you can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex

[PATCH] riscv: Remove 32b kernel mapping from page table dump

2021-04-18 Thread Alexandre Ghiti

The 32b kernel mapping lies in the linear mapping, there is no point in
printing its address in page table dump, so remove this leftover that
comes from moving the kernel mapping outside the linear mapping for 64b
kernel.

Fixes: e9efb21fe352 ("riscv: Prepare ptdump for vm layout dynamic addresses")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/ptdump.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 0aba4421115c..a4ed4bdbbfde 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -76,8 +76,8 @@ enum address_markers_idx {
PAGE_OFFSET_NR,
 #ifdef CONFIG_64BIT
MODULES_MAPPING_NR,
-#endif
KERNEL_MAPPING_NR,
+#endif
END_OF_SPACE_NR
 };
 
@@ -99,8 +99,8 @@ static struct addr_marker address_markers[] = {
{0, "Linear mapping"},
 #ifdef CONFIG_64BIT
{0, "Modules mapping"},
-#endif
{0, "Kernel mapping (kernel, BPF)"},
+#endif
{-1, NULL},
 };
 
@@ -379,8 +379,8 @@ static int ptdump_init(void)
address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
 #ifdef CONFIG_64BIT
address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
-#endif
address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+#endif
 
kernel_ptd_info.base_addr = KERN_VIRT_START;
 
-- 
2.20.1

[PATCH] riscv: Fix 32b kernel caused by 64b kernel mapping moving outside linear mapping

2021-04-17 Thread Alexandre Ghiti

Fix multiple leftovers when moving the kernel mapping outside the linear
mapping for 64b kernel that left the 32b kernel unusable.

Fixes: 4b67f48da707 ("riscv: Move kernel mapping outside of linear mapping")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/page.h|  9 +
 arch/riscv/include/asm/pgtable.h | 16 
 arch/riscv/mm/init.c | 25 -
 3 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 22cfb2be60dc..f64b61296c0c 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,20 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+#ifdef CONFIG_64BIT
 extern unsigned long va_kernel_pa_offset;
+#endif
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#ifdef CONFIG_64BIT
 #define va_kernel_pa_offset0
+#endif
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
+#ifdef CONFIG_64BIT
 extern unsigned long kernel_virt_addr;
 
 #define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
@@ -112,6 +117,10 @@ extern unsigned long kernel_virt_addr;
(_x < kernel_virt_addr) ?   
\
linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
})
+#else
+#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
+#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+#endif
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 80e63a93e903..5afda75cc2c3 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -16,19 +16,27 @@
 #else
 
 #define ADDRESS_SPACE_END  (UL(-1))
-/*
- * Leave 2GB for kernel and BPF at the end of the address space
- */
+
+#ifdef CONFIG_64BIT
+/* Leave 2GB for kernel and BPF at the end of the address space */
 #define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
+#else
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#endif
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
-/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
+#ifdef CONFIG_64BIT
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
 #define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+#else
+#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
+#define BPF_JIT_REGION_END (VMALLOC_END)
+#endif
 
 /* Modules always live before the kernel */
 #ifdef CONFIG_64BIT
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 093f3a96ecfc..dc9b988e0778 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -91,8 +91,10 @@ static void print_vm_layout(void)
  (unsigned long)VMALLOC_END);
print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
  (unsigned long)high_memory);
+#ifdef CONFIG_64BIT
print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR,
  (unsigned long)ADDRESS_SPACE_END);
+#endif
 }
 #else
 static void print_vm_layout(void) { }
@@ -165,9 +167,11 @@ static struct pt_alloc_ops pt_ops;
 /* Offset between linear mapping virtual address and kernel load address */
 unsigned long va_pa_offset;
 EXPORT_SYMBOL(va_pa_offset);
+#ifdef CONFIG_64BIT
 /* Offset between kernel mapping virtual address and kernel load address */
 unsigned long va_kernel_pa_offset;
 EXPORT_SYMBOL(va_kernel_pa_offset);
+#endif
 unsigned long pfn_base;
 EXPORT_SYMBOL(pfn_base);
 
@@ -410,7 +414,9 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
load_sz = (uintptr_t)(&_end) - load_pa;
 
va_pa_offset = PAGE_OFFSET - load_pa;
+#ifdef CONFIG_64BIT
va_kernel_pa_offset = kernel_virt_addr - load_pa;
+#endif
 
pfn_base = PFN_DOWN(load_pa);
 
@@ -469,12 +475,16 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
   pa + PMD_SIZE, PMD_SIZE, PAGE_KERNEL);
dtb_early_va = (void *)DTB_EARLY_BASE_VA + (dtb_pa & (PMD_SIZE - 1));
 #else /* CONFIG_BUILTIN_DTB */
+#ifdef CONFIG_64BIT
/*
 * __va can't be used since it would return a linear mapping address
 * whereas dtb_early_va will be used before setup_vm_final installs
 * the linear mapping.
 */
dtb_early_va = kernel_mapping_pa_to_va(dtb_pa);
+#else
+   dtb_early_va = __va(dtb_pa);
+#endif /* CONFIG_64BIT */
 #endif /* CONFIG_BUILTIN_DTB */
 #else
 #ifndef CONFIG_BUILTIN_DTB
@@ -486,7 +4

[PATCH] riscv: Protect kernel linear mapping only if CONFIG_STRICT_KERNEL_RWX is set

2021-04-15 Thread Alexandre Ghiti

If CONFIG_STRICT_KERNEL_RWX is not set, we cannot set different permissions
to the kernel data and text sections, so make sure it is defined before
trying to protect the kernel linear mapping.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/kernel/setup.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 626003bb5fca..ab394d173cd4 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -264,12 +264,12 @@ void __init setup_arch(char **cmdline_p)
 
sbi_init();
 
-   if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
+   if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) {
protect_kernel_text_data();
-
-#if defined(CONFIG_64BIT) && defined(CONFIG_MMU)
-   protect_kernel_linear_mapping_text_rodata();
+#ifdef CONFIG_64BIT
+   protect_kernel_linear_mapping_text_rodata();
 #endif
+   }
 
 #ifdef CONFIG_SWIOTLB
swiotlb_init(1);
-- 
2.20.1

[PATCH v8] RISC-V: enable XIP

2021-04-13 Thread Alexandre Ghiti

From: Vitaly Wool 

Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage. The physical flash address used
to link the kernel object files and for storing it has to be known
at compile time and is represented by a Kconfig option.

XIP on RISC-V will for the time being only work on MMU-enabled
kernels.

Signed-off-by: Alexandre Ghiti  [ Rebase on top of "Move
kernel mapping outside the linear mapping" ]
Signed-off-by: Vitaly Wool 
---
 arch/riscv/Kconfig  |  55 +++-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/page.h   |  21 +
 arch/riscv/include/asm/pgtable.h|  25 +-
 arch/riscv/kernel/head.S|  46 +-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |  10 ++-
 arch/riscv/kernel/vmlinux-xip.lds.S | 133 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 115 ++--
 11 files changed, 418 insertions(+), 17 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8ea60a0a19ae..7c7efdd67a10 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -28,7 +28,7 @@ config RISCV
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
-   select ARCH_HAS_STRICT_KERNEL_RWX if MMU
+   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
@@ -441,7 +441,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -465,11 +465,60 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config PHYS_RAM_BASE_FIXED
+   bool "Explicitly specified physical RAM address"
+   default n
+
+config PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on PHYS_RAM_BASE_FIXED
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU && SPARSEMEM
+   select PHYS_RAM_BASE_FIXED
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will take more space to
+ store it.  The flash address used to link the kernel object files,
+ and for storing it, is configuration dependent. Therefore, if you
+ say Y here, you must know the proper physical address where to
+ store the kernel image depending on your own flash memory usage.
+
+ Also note that the make target becomes "make xipImage" rather than
+ "make zImage" or "make Image".  The final kernel binary to put in
+ ROM memory will be arch/riscv/boot/xipImage.
+
+ SPARSEMEM is required because the kernel text and rodata that are
+ flash resident are not backed by memmap, then any attempt to get
+ a struct page on those regions will trigger a fault.
+
+ If unsure, say N.
+
+config XIP_PHYS_ADDR
+   hex "XIP Kernel Physical Location"
+   depends on XIP_KERNEL
+   default "0x2100"
+   help
+ This is the physical address in your flash memory the kernel will
+ be linked for and stored to.  This address is dependent on your
+ own flash usage.
+
 endmenu
 
 config BUILTIN_DTB
-   def_bool n
+   bool
depends on OF
+   default y if XIP_KERNEL
 
 menu "Power management options"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 1368d943f1f3..8fcbec03974d 100644
--- a/arch/riscv/

[PATCH v5 3/3] riscv: Prepare ptdump for vm layout dynamic addresses

2021-04-11 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 73 +++---
 1 file changed, 61 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..0aba4421115c 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,56 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+#ifdef CONFIG_64BIT
+   MODULES_MAPPING_NR,
+#endif
+   KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
-   {KASAN_SHADOW_START,"Kasan shadow start"},
-   {KASAN_SHADOW_END,  "Kasan shadow end"},
+   {0, "Kasan shadow start"},
+   {0, "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
+#endif
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+#ifdef CONFIG_64BIT
+   {0, "Modules mapping"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "Kernel mapping (kernel, BPF)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +362,28 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+#ifdef CONFIG_KASAN
+   address_markers[KASAN_SHADOW_START_NR].start_address = 
KASAN_SHADOW_START;
+   address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
+#endif
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+#ifdef CONFIG_64BIT
+   address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
+#endif
+   address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH v5 2/3] Documentation: riscv: Add documentation that describes the VM layout

2021-04-11 Thread Alexandre Ghiti

This new document presents the RISC-V virtual memory layout and is based
one the x86 one: it describes the different limits of the different regions
of the virtual address space.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/riscv/index.rst |  1 +
 Documentation/riscv/vm-layout.rst | 63 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/riscv/vm-layout.rst

diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 6e6e39482502..ea915c196048 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -6,6 +6,7 @@ RISC-V architecture
 :maxdepth: 1
 
 boot-image-header
+vm-layout
 pmu
 patch-acceptance
 
diff --git a/Documentation/riscv/vm-layout.rst 
b/Documentation/riscv/vm-layout.rst
new file mode 100644
index ..329d32098af4
--- /dev/null
+++ b/Documentation/riscv/vm-layout.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Virtual Memory Layout on RISC-V Linux
+=
+
+:Author: Alexandre Ghiti 
+:Date: 12 February 2021
+
+This document describes the virtual memory layout used by the RISC-V Linux
+Kernel.
+
+RISC-V Linux Kernel 32bit
+=
+
+RISC-V Linux Kernel SV32
+
+
+TODO
+
+RISC-V Linux Kernel 64bit
+=
+
+The RISC-V privileged architecture document states that the 64bit addresses
+"must have bits 63–48 all equal to bit 47, or else a page-fault exception will
+occur.": that splits the virtual address space into 2 halves separated by a 
very
+big hole, the lower half is where the userspace resides, the upper half is 
where
+the RISC-V Linux Kernel resides.
+
+RISC-V Linux Kernel SV39
+
+
+::
+
+  

+  Start addr|   Offset   | End addr |  Size   | VM area 
description
+  

+||  | |
+    |0   | 003f |  256 GB | user-space 
virtual memory, different per mm
+  
__||__|_|___
+||  | |
+   0040 | +256GB | ffbf | ~16M TB | ... huge, 
almost 64 bits wide hole of non-canonical
+||  | | virtual 
memory addresses up to the -256 GB
+||  | | starting 
offset of kernel mappings.
+  
__||__|_|___
+  |
+  | Kernel-space 
virtual memory, shared between all processes:
+  
|___
+||  | |
+   ffc0 | -256GB | ffc7 |   32 GB | kasan
+   ffcefee0 | -196GB | ffcefeff |2 MB | fixmap
+   ffceff00 | -196GB | ffce |   16 MB | PCI io
+   ffcf | -196GB | ffcf |4 GB | vmemmap
+   ffd0 | -192GB | ffdf |   64 GB | 
vmalloc/ioremap space
+   ffe0 | -128GB | 7fff |  124 GB | direct mapping 
of all physical memory
+  
__||__|_|
+  |
+  |
+  
|
+||  | |
+    |   -4GB | 7fff |2 GB | modules
+   8000 |   -2GB |  |2 GB | kernel, BPF
+  
__||__|_|
-- 
2.20.1

[PATCH v5 1/3] riscv: Move kernel mapping outside of linear mapping

2021-04-11 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 +-
 arch/riscv/include/asm/pgtable.h| 37 
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +-
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 87 ++---
 arch/riscv/mm/kasan_init.c  |  9 +++
 arch/riscv/mm/physaddr.c|  2 +-
 12 files changed, 146 insertions(+), 40 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index adc9d26f3d75..22cfb2be60dc 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,28 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+extern unsigned long kernel_virt_addr;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define kernel_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_kernel_pa_offset))
+#define __pa_to_va_nodebug(x)  linear_mapping_pa_to_va(x)
+
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) ((unsigned long)(x) - 
va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  ({  
\
+   unsigned long _x = x;   
\
+   (_x < kernel_virt_addr) ?   
\
+   linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
+   })
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebf817c1bdf4..80e63a93e903 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,30 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel and BPF at the end of the address space
+ */
+#define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define MODULES_VADDR  (PFN_ALIGN((

[PATCH v5 0/3] Move kernel mapping outside the linear mapping

2021-04-11 Thread Alexandre Ghiti

I decided to split sv48 support in small series to ease the review.

This patchset pushes the kernel mapping (modules and BPF too) to the last
4GB of the 64bit address space, this allows to:
- implement relocatable kernel (that will come later in another
  patchset) that requires to move the kernel mapping out of the linear
  mapping to avoid to copy the kernel at a different physical address.
- have a single kernel that is not relocatable (and then that avoids the
  performance penalty imposed by PIC kernel) for both sv39 and sv48.

The first patch implements this behaviour, the second patch introduces a
documentation that describes the virtual address space layout of the 64bit
kernel and the last patch is taken from my sv48 series where I simply added
the dump of the modules/kernel/BPF mapping.

I removed the Reviewed-by on the first patch since it changed enough from
last time and deserves a second look.

Changes in v5:
- Fix 32BIT build that failed because MODULE_VADDR does not exist as
  modules lie in the vmalloc zone in 32BIT, reported by kernel test
  robot.

Changes in v4:
- Fix BUILTIN_DTB since we used __va to obtain the virtual address of the
  builtin DTB which returns a linear mapping address, and then we use
  this address before setup_vm_final installs the linear mapping: this
  is not possible anymore since the kernel does not lie inside the
  linear mapping anymore.

Changes in v3:
- Fix broken nommu build as reported by kernel test robot by protecting
  the kernel mapping only in 64BIT and MMU configs, by reverting the
  introduction of load_sz_pmd and by not exporting load_sz/load_pa anymore
  since they were not initialized in nommu config. 

Changes in v2:
- Fix documentation about direct mapping size which is 124GB instead
  of 126GB.
- Fix SPDX missing header in documentation.
- Fix another checkpatch warning about EXPORT_SYMBOL which was not
  directly below variable declaration.
 
Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

 Documentation/riscv/index.rst   |  1 +
 Documentation/riscv/vm-layout.rst   | 63 +
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 +-
 arch/riscv/include/asm/pgtable.h| 37 
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +-
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 87 ++---
 arch/riscv/mm/kasan_init.c  |  9 +++
 arch/riscv/mm/physaddr.c|  2 +-
 arch/riscv/mm/ptdump.c  | 73 
 15 files changed, 271 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/riscv/vm-layout.rst

-- 
2.20.1

[PATCH v7] RISC-V: enable XIP

2021-04-09 Thread Alexandre Ghiti

From: Vitaly Wool 

Introduce XIP (eXecute In Place) support for RISC-V platforms.
It allows code to be executed directly from non-volatile storage
directly addressable by the CPU, such as QSPI NOR flash which can
be found on many RISC-V platforms. This makes way for significant
optimization of RAM footprint. The XIP kernel is not compressed
since it has to run directly from flash, so it will occupy more
space on the non-volatile storage. The physical flash address used
to link the kernel object files and for storing it has to be known
at compile time and is represented by a Kconfig option.

XIP on RISC-V will for the time being only work on MMU-enabled
kernels.

Signed-off-by: Alexandre Ghiti  [ Rebase on top of "Move
kernel mapping outside the linear mapping ]
Signed-off-by: Vitaly Wool 
---

Changes in v2:
- dedicated macro for XIP address fixup when MMU is not enabled yet
  o both for 32-bit and 64-bit RISC-V
- SP is explicitly set to a safe place in RAM before __copy_data call
- removed redundant alignment requirements in vmlinux-xip.lds.S
- changed long -> uintptr_t typecast in __XIP_FIXUP macro.
Changes in v3:
- rebased against latest for-next
- XIP address fixup macro now takes an argument
- SMP related fixes
Changes in v4:
- rebased against the current for-next
- less #ifdef's in C/ASM code
- dedicated XIP_FIXUP_OFFSET assembler macro in head.S
- C-specific definitions moved into #ifndef __ASSEMBLY__
- Fixed multi-core boot
Changes in v5:
- fixed build error for non-XIP kernels
Changes in v6:
- XIP_PHYS_RAM_BASE config option renamed to PHYS_RAM_BASE
- added PHYS_RAM_BASE_FIXED config flag to allow usage of
  PHYS_RAM_BASE in non-XIP configurations if needed
- XIP_FIXUP macro rewritten with a tempoarary variable to avoid side
  effects
- fixed crash for non-XIP kernels that don't use built-in DTB
Changes in v7:
- Fix pfn_base that required FIXUP
- Fix copy_data which lacked + 1 in size to copy
- Fix pfn_valid for FLATMEM
- Rebased on top of "Move kernel mapping outside the linear mapping":
  this is the biggest change and affected mm/init.c,
  kernel/vmlinux-xip.lds.S and include/asm/pgtable.h: XIP kernel is now
  mapped like 'normal' kernel at the end of the address space.

 arch/riscv/Kconfig  |  51 ++-
 arch/riscv/Makefile |   8 +-
 arch/riscv/boot/Makefile|  13 +++
 arch/riscv/include/asm/page.h   |  28 ++
 arch/riscv/include/asm/pgtable.h|  25 +-
 arch/riscv/kernel/head.S|  46 +-
 arch/riscv/kernel/head.h|   3 +
 arch/riscv/kernel/setup.c   |  10 ++-
 arch/riscv/kernel/vmlinux-xip.lds.S | 133 
 arch/riscv/kernel/vmlinux.lds.S |   6 ++
 arch/riscv/mm/init.c| 118 ++--
 11 files changed, 424 insertions(+), 17 deletions(-)
 create mode 100644 arch/riscv/kernel/vmlinux-xip.lds.S

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8ea60a0a19ae..4d0153805927 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -28,7 +28,7 @@ config RISCV
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
-   select ARCH_HAS_STRICT_KERNEL_RWX if MMU
+   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
@@ -441,7 +441,7 @@ config EFI_STUB
 
 config EFI
bool "UEFI runtime support"
-   depends on OF
+   depends on OF && !XIP_KERNEL
select LIBFDT
select UCS2_STRING
select EFI_PARAMS_FROM_FDT
@@ -465,11 +465,56 @@ config STACKPROTECTOR_PER_TASK
def_bool y
depends on STACKPROTECTOR && CC_HAVE_STACKPROTECTOR_TLS
 
+config PHYS_RAM_BASE_FIXED
+   bool "Explicitly specified physical RAM address"
+   default n
+
+config PHYS_RAM_BASE
+   hex "Platform Physical RAM address"
+   depends on PHYS_RAM_BASE_FIXED
+   default "0x8000"
+   help
+ This is the physical address of RAM in the system. It has to be
+ explicitly specified to run early relocations of read-write data
+ from flash to RAM.
+
+config XIP_KERNEL
+   bool "Kernel Execute-In-Place from ROM"
+   depends on MMU
+   select PHYS_RAM_BASE_FIXED
+   help
+ Execute-In-Place allows the kernel to run from non-volatile storage
+ directly addressable by the CPU, such as NOR flash. This saves RAM
+ space since the text section of the kernel is not loaded from flash
+ to RAM.  Read-write sections, such as the data section and stack,
+ are still copied to RAM.  The XIP kernel is not compressed since
+ it has to run directly from flash, so it will ta

[PATCH v4 3/3] riscv: Prepare ptdump for vm layout dynamic addresses

2021-04-09 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 67 ++
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..aa1b3bce61ab 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,52 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+   MODULES_MAPPING_NR,
+   KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
-   {KASAN_SHADOW_START,"Kasan shadow start"},
-   {KASAN_SHADOW_END,  "Kasan shadow end"},
+   {0, "Kasan shadow start"},
+   {0, "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+   {0, "Modules mapping"},
+   {0, "Kernel mapping (kernel, BPF)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +358,26 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+#ifdef CONFIG_KASAN
+   address_markers[KASAN_SHADOW_START_NR].start_address = 
KASAN_SHADOW_START;
+   address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
+#endif
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+   address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
+   address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH v4 2/3] Documentation: riscv: Add documentation that describes the VM layout

2021-04-09 Thread Alexandre Ghiti

This new document presents the RISC-V virtual memory layout and is based
one the x86 one: it describes the different limits of the different regions
of the virtual address space.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/riscv/index.rst |  1 +
 Documentation/riscv/vm-layout.rst | 63 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/riscv/vm-layout.rst

diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 6e6e39482502..ea915c196048 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -6,6 +6,7 @@ RISC-V architecture
 :maxdepth: 1
 
 boot-image-header
+vm-layout
 pmu
 patch-acceptance
 
diff --git a/Documentation/riscv/vm-layout.rst 
b/Documentation/riscv/vm-layout.rst
new file mode 100644
index ..329d32098af4
--- /dev/null
+++ b/Documentation/riscv/vm-layout.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Virtual Memory Layout on RISC-V Linux
+=
+
+:Author: Alexandre Ghiti 
+:Date: 12 February 2021
+
+This document describes the virtual memory layout used by the RISC-V Linux
+Kernel.
+
+RISC-V Linux Kernel 32bit
+=
+
+RISC-V Linux Kernel SV32
+
+
+TODO
+
+RISC-V Linux Kernel 64bit
+=
+
+The RISC-V privileged architecture document states that the 64bit addresses
+"must have bits 63–48 all equal to bit 47, or else a page-fault exception will
+occur.": that splits the virtual address space into 2 halves separated by a 
very
+big hole, the lower half is where the userspace resides, the upper half is 
where
+the RISC-V Linux Kernel resides.
+
+RISC-V Linux Kernel SV39
+
+
+::
+
+  

+  Start addr|   Offset   | End addr |  Size   | VM area 
description
+  

+||  | |
+    |0   | 003f |  256 GB | user-space 
virtual memory, different per mm
+  
__||__|_|___
+||  | |
+   0040 | +256GB | ffbf | ~16M TB | ... huge, 
almost 64 bits wide hole of non-canonical
+||  | | virtual 
memory addresses up to the -256 GB
+||  | | starting 
offset of kernel mappings.
+  
__||__|_|___
+  |
+  | Kernel-space 
virtual memory, shared between all processes:
+  
|___
+||  | |
+   ffc0 | -256GB | ffc7 |   32 GB | kasan
+   ffcefee0 | -196GB | ffcefeff |2 MB | fixmap
+   ffceff00 | -196GB | ffce |   16 MB | PCI io
+   ffcf | -196GB | ffcf |4 GB | vmemmap
+   ffd0 | -192GB | ffdf |   64 GB | 
vmalloc/ioremap space
+   ffe0 | -128GB | 7fff |  124 GB | direct mapping 
of all physical memory
+  
__||__|_|
+  |
+  |
+  
|
+||  | |
+    |   -4GB | 7fff |2 GB | modules
+   8000 |   -2GB |  |2 GB | kernel, BPF
+  
__||__|_|
-- 
2.20.1

[PATCH v4 1/3] riscv: Move kernel mapping outside of linear mapping

2021-04-09 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 +-
 arch/riscv/include/asm/pgtable.h| 37 
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +-
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 87 ++---
 arch/riscv/mm/kasan_init.c  |  9 +++
 arch/riscv/mm/physaddr.c|  2 +-
 12 files changed, 146 insertions(+), 40 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index adc9d26f3d75..22cfb2be60dc 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,28 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+extern unsigned long kernel_virt_addr;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define kernel_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_kernel_pa_offset))
+#define __pa_to_va_nodebug(x)  linear_mapping_pa_to_va(x)
+
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) ((unsigned long)(x) - 
va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  ({  
\
+   unsigned long _x = x;   
\
+   (_x < kernel_virt_addr) ?   
\
+   linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
+   })
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebf817c1bdf4..80e63a93e903 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,30 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel and BPF at the end of the address space
+ */
+#define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define MODULES_VADDR  (PFN_ALIGN((

[PATCH v4 0/3] Move kernel mapping outside the linear mapping

2021-04-09 Thread Alexandre Ghiti

I decided to split sv48 support in small series to ease the review.

This patchset pushes the kernel mapping (modules and BPF too) to the last
4GB of the 64bit address space, this allows to:
- implement relocatable kernel (that will come later in another
  patchset) that requires to move the kernel mapping out of the linear
  mapping to avoid to copy the kernel at a different physical address.
- have a single kernel that is not relocatable (and then that avoids the
  performance penalty imposed by PIC kernel) for both sv39 and sv48.

The first patch implements this behaviour, the second patch introduces a
documentation that describes the virtual address space layout of the 64bit
kernel and the last patch is taken from my sv48 series where I simply added
the dump of the modules/kernel/BPF mapping.

I removed the Reviewed-by on the first patch since it changed enough from
last time and deserves a second look.

Changes in v4:
- Fix BUILTIN_DTB since we used __va to obtain the virtual address of the
  builtin DTB which returns a linear mapping address, and then we use
  this address before setup_vm_final installs the linear mapping: this
  is not possible anymore since the kernel does not lie inside the
  linear mapping anymore.

Changes in v3:
- Fix broken nommu build as reported by kernel test robot by protecting
  the kernel mapping only in 64BIT and MMU configs, by reverting the
  introduction of load_sz_pmd and by not exporting load_sz/load_pa anymore
  since they were not initialized in nommu config. 

Changes in v2:
- Fix documentation about direct mapping size which is 124GB instead
  of 126GB.
- Fix SPDX missing header in documentation.
- Fix another checkpatch warning about EXPORT_SYMBOL which was not
  directly below variable declaration.

Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

 Documentation/riscv/index.rst   |  1 +
 Documentation/riscv/vm-layout.rst   | 63 +
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 +-
 arch/riscv/include/asm/pgtable.h| 37 
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +-
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 87 ++---
 arch/riscv/mm/kasan_init.c  |  9 +++
 arch/riscv/mm/physaddr.c|  2 +-
 arch/riscv/mm/ptdump.c  | 67 ++
 15 files changed, 265 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/riscv/vm-layout.rst

-- 
2.20.1

[PATCH] driver: of: Properly truncate command line if too long

2021-03-16 Thread Alexandre Ghiti

In case the command line given by the user is too long, warn about it
and truncate it to the last full argument.

This is what efi already does in commit 80b1bfe1cb2f ("efi/libstub:
Don't parse overlong command lines").

Reported-by: Dmitry Vyukov 
Signed-off-by: Alexandre Ghiti 
---
 drivers/of/fdt.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index dcc1dd96911a..de4c6f9bac39 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include   /* for COMMAND_LINE_SIZE */
 #include 
@@ -1050,9 +1051,27 @@ int __init early_init_dt_scan_chosen(unsigned long node, 
const char *uname,
 
/* Retrieve command line */
p = of_get_flat_dt_prop(node, "bootargs", );
-   if (p != NULL && l > 0)
+   if (p != NULL && l > 0) {
strlcpy(data, p, min(l, COMMAND_LINE_SIZE));
 
+   /*
+* If the given command line size is larger than
+* COMMAND_LINE_SIZE, truncate it to the last complete
+* parameter.
+*/
+   if (l > COMMAND_LINE_SIZE) {
+   char *cmd_p = (char *)data + COMMAND_LINE_SIZE - 1;
+
+   while (!isspace(*cmd_p))
+   cmd_p--;
+
+   *cmd_p = '\0';
+
+   pr_err("Command line is too long: truncated to %d 
bytes\n",
+  (int)(cmd_p - (char *)data + 1));
+   }
+   }
+
/*
 * CONFIG_CMDLINE is meant to be a default in case nothing else
 * managed to set the command line, unless CONFIG_CMDLINE_FORCE
-- 
2.20.1

[PATCH] riscv: Bump COMMAND_LINE_SIZE value to 1024

2021-03-16 Thread Alexandre Ghiti

Increase COMMAND_LINE_SIZE as the current default value is too low
for syzbot kernel command line.

Reported-by: Dmitry Vyukov 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/uapi/asm/setup.h | 8 
 1 file changed, 8 insertions(+)
 create mode 100644 arch/riscv/include/uapi/asm/setup.h

diff --git a/arch/riscv/include/uapi/asm/setup.h 
b/arch/riscv/include/uapi/asm/setup.h
new file mode 100644
index ..66b13a522880
--- /dev/null
+++ b/arch/riscv/include/uapi/asm/setup.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+
+#ifndef _UAPI_ASM_RISCV_SETUP_H
+#define _UAPI_ASM_RISCV_SETUP_H
+
+#define COMMAND_LINE_SIZE  1024
+
+#endif /* _UAPI_ASM_RISCV_SETUP_H */
-- 
2.20.1

[PATCH v3 3/3] riscv: Prepare ptdump for vm layout dynamic addresses

2021-03-14 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 67 ++
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..aa1b3bce61ab 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,52 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+   MODULES_MAPPING_NR,
+   KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
-   {KASAN_SHADOW_START,"Kasan shadow start"},
-   {KASAN_SHADOW_END,  "Kasan shadow end"},
+   {0, "Kasan shadow start"},
+   {0, "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+   {0, "Modules mapping"},
+   {0, "Kernel mapping (kernel, BPF)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +358,26 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+#ifdef CONFIG_KASAN
+   address_markers[KASAN_SHADOW_START_NR].start_address = 
KASAN_SHADOW_START;
+   address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
+#endif
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+   address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
+   address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH v3 2/3] Documentation: riscv: Add documentation that describes the VM layout

2021-03-14 Thread Alexandre Ghiti

This new document presents the RISC-V virtual memory layout and is based
one the x86 one: it describes the different limits of the different regions
of the virtual address space.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/riscv/index.rst |  1 +
 Documentation/riscv/vm-layout.rst | 63 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/riscv/vm-layout.rst

diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 6e6e39482502..ea915c196048 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -6,6 +6,7 @@ RISC-V architecture
 :maxdepth: 1
 
 boot-image-header
+vm-layout
 pmu
 patch-acceptance
 
diff --git a/Documentation/riscv/vm-layout.rst 
b/Documentation/riscv/vm-layout.rst
new file mode 100644
index ..329d32098af4
--- /dev/null
+++ b/Documentation/riscv/vm-layout.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Virtual Memory Layout on RISC-V Linux
+=
+
+:Author: Alexandre Ghiti 
+:Date: 12 February 2021
+
+This document describes the virtual memory layout used by the RISC-V Linux
+Kernel.
+
+RISC-V Linux Kernel 32bit
+=
+
+RISC-V Linux Kernel SV32
+
+
+TODO
+
+RISC-V Linux Kernel 64bit
+=
+
+The RISC-V privileged architecture document states that the 64bit addresses
+"must have bits 63–48 all equal to bit 47, or else a page-fault exception will
+occur.": that splits the virtual address space into 2 halves separated by a 
very
+big hole, the lower half is where the userspace resides, the upper half is 
where
+the RISC-V Linux Kernel resides.
+
+RISC-V Linux Kernel SV39
+
+
+::
+
+  

+  Start addr|   Offset   | End addr |  Size   | VM area 
description
+  

+||  | |
+    |0   | 003f |  256 GB | user-space 
virtual memory, different per mm
+  
__||__|_|___
+||  | |
+   0040 | +256GB | ffbf | ~16M TB | ... huge, 
almost 64 bits wide hole of non-canonical
+||  | | virtual 
memory addresses up to the -256 GB
+||  | | starting 
offset of kernel mappings.
+  
__||__|_|___
+  |
+  | Kernel-space 
virtual memory, shared between all processes:
+  
|___
+||  | |
+   ffc0 | -256GB | ffc7 |   32 GB | kasan
+   ffcefee0 | -196GB | ffcefeff |2 MB | fixmap
+   ffceff00 | -196GB | ffce |   16 MB | PCI io
+   ffcf | -196GB | ffcf |4 GB | vmemmap
+   ffd0 | -192GB | ffdf |   64 GB | 
vmalloc/ioremap space
+   ffe0 | -128GB | 7fff |  124 GB | direct mapping 
of all physical memory
+  
__||__|_|
+  |
+  |
+  
|
+||  | |
+    |   -4GB | 7fff |2 GB | modules
+   8000 |   -2GB |  |2 GB | kernel, BPF
+  
__||__|_|
-- 
2.20.1

[PATCH v3 1/3] riscv: Move kernel mapping outside of linear mapping

2021-03-14 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 ++-
 arch/riscv/include/asm/pgtable.h| 37 ++
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 78 ++---
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 12 files changed, 139 insertions(+), 38 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index adc9d26f3d75..0cdd0c4db941 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,28 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+extern unsigned long kernel_virt_addr;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define __pa_to_va_nodebug(x)  linear_mapping_pa_to_va(x)
+
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  ({  
\
+   unsigned long _x = x;   
\
+   (_x < kernel_virt_addr) ?   
\
+   linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
+   })
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebf817c1bdf4..80e63a93e903 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,30 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel and BPF at the end of the address space
+ */
+#define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define MODULES_VADDR  (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
+#define MODULES_END(PFN_ALIGN((unsigned

[PATCH v3 0/3] Move kernel mapping outside the linear mapping

2021-03-14 Thread Alexandre Ghiti

I decided to split sv48 support in small series to ease the review.

This patchset pushes the kernel mapping (modules and BPF too) to the last
4GB of the 64bit address space, this allows to:
- implement relocatable kernel (that will come later in another
  patchset) that requires to move the kernel mapping out of the linear
  mapping to avoid to copy the kernel at a different physical address.
- have a single kernel that is not relocatable (and then that avoids the
  performance penalty imposed by PIC kernel) for both sv39 and sv48.

The first patch implements this behaviour, the second patch introduces a
documentation that describes the virtual address space layout of the 64bit
kernel and the last patch is taken from my sv48 series where I simply added
the dump of the modules/kernel/BPF mapping.

I removed the Reviewed-by on the first patch since it changed enough from
last time and deserves a second look.

Changes in v3:
- Fix broken nommu build as reported by kernel test robot by protecting
  the kernel mapping only in 64BIT and MMU configs, by reverting the
  introduction of load_sz_pmd and by not exporting load_sz/load_pa anymore
  since they were not initialized in nommu config. 

Changes in v2:
- Fix documentation about direct mapping size which is 124GB instead
  of 126GB.
- Fix SPDX missing header in documentation.
- Fix another checkpatch warning about EXPORT_SYMBOL which was not
  directly below variable declaration.

Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

 Documentation/riscv/index.rst   |  1 +
 Documentation/riscv/vm-layout.rst   | 63 +++
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 17 ++-
 arch/riscv/include/asm/pgtable.h| 37 ++
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  5 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 78 ++---
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 arch/riscv/mm/ptdump.c  | 67 -
 15 files changed, 258 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/riscv/vm-layout.rst

-- 
2.20.1

[PATCH v2 3/3] riscv: Prepare ptdump for vm layout dynamic addresses

2021-03-13 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 67 ++
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..aa1b3bce61ab 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,52 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+   MODULES_MAPPING_NR,
+   KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
-   {KASAN_SHADOW_START,"Kasan shadow start"},
-   {KASAN_SHADOW_END,  "Kasan shadow end"},
+   {0, "Kasan shadow start"},
+   {0, "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+   {0, "Modules mapping"},
+   {0, "Kernel mapping (kernel, BPF)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +358,26 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+#ifdef CONFIG_KASAN
+   address_markers[KASAN_SHADOW_START_NR].start_address = 
KASAN_SHADOW_START;
+   address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
+#endif
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+   address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
+   address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH v2 2/3] Documentation: riscv: Add documentation that describes the VM layout

2021-03-13 Thread Alexandre Ghiti

This new document presents the RISC-V virtual memory layout and is based
one the x86 one: it describes the different limits of the different regions
of the virtual address space.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/riscv/index.rst |  1 +
 Documentation/riscv/vm-layout.rst | 63 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/riscv/vm-layout.rst

diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 6e6e39482502..ea915c196048 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -6,6 +6,7 @@ RISC-V architecture
 :maxdepth: 1
 
 boot-image-header
+vm-layout
 pmu
 patch-acceptance
 
diff --git a/Documentation/riscv/vm-layout.rst 
b/Documentation/riscv/vm-layout.rst
new file mode 100644
index ..329d32098af4
--- /dev/null
+++ b/Documentation/riscv/vm-layout.rst
@@ -0,0 +1,63 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Virtual Memory Layout on RISC-V Linux
+=
+
+:Author: Alexandre Ghiti 
+:Date: 12 February 2021
+
+This document describes the virtual memory layout used by the RISC-V Linux
+Kernel.
+
+RISC-V Linux Kernel 32bit
+=
+
+RISC-V Linux Kernel SV32
+
+
+TODO
+
+RISC-V Linux Kernel 64bit
+=
+
+The RISC-V privileged architecture document states that the 64bit addresses
+"must have bits 63–48 all equal to bit 47, or else a page-fault exception will
+occur.": that splits the virtual address space into 2 halves separated by a 
very
+big hole, the lower half is where the userspace resides, the upper half is 
where
+the RISC-V Linux Kernel resides.
+
+RISC-V Linux Kernel SV39
+
+
+::
+
+  

+  Start addr|   Offset   | End addr |  Size   | VM area 
description
+  

+||  | |
+    |0   | 003f |  256 GB | user-space 
virtual memory, different per mm
+  
__||__|_|___
+||  | |
+   0040 | +256GB | ffbf | ~16M TB | ... huge, 
almost 64 bits wide hole of non-canonical
+||  | | virtual 
memory addresses up to the -256 GB
+||  | | starting 
offset of kernel mappings.
+  
__||__|_|___
+  |
+  | Kernel-space 
virtual memory, shared between all processes:
+  
|___
+||  | |
+   ffc0 | -256GB | ffc7 |   32 GB | kasan
+   ffcefee0 | -196GB | ffcefeff |2 MB | fixmap
+   ffceff00 | -196GB | ffce |   16 MB | PCI io
+   ffcf | -196GB | ffcf |4 GB | vmemmap
+   ffd0 | -192GB | ffdf |   64 GB | 
vmalloc/ioremap space
+   ffe0 | -128GB | 7fff |  124 GB | direct mapping 
of all physical memory
+  
__||__|_|
+  |
+  |
+  
|
+||  | |
+    |   -4GB | 7fff |2 GB | modules
+   8000 |   -2GB |  |2 GB | kernel, BPF
+  
__||__|_|
-- 
2.20.1

[PATCH v2 1/3] riscv: Move kernel mapping outside of linear mapping

2021-03-13 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 18 ++-
 arch/riscv/include/asm/pgtable.h| 37 +
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  3 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 83 +++--
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 12 files changed, 143 insertions(+), 38 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index adc9d26f3d75..dd69e4a58ee0 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,29 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+extern unsigned long kernel_virt_addr;
+extern uintptr_t load_pa, load_sz, load_sz_pmd;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define __pa_to_va_nodebug(x)  linear_mapping_pa_to_va(x)
+
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  ({  
\
+   unsigned long _x = x;   
\
+   (_x < kernel_virt_addr) ?   
\
+   linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
+   })
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebf817c1bdf4..80e63a93e903 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,30 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel and BPF at the end of the address space
+ */
+#define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define MODULES_VADDR  (PFN_ALIGN((unsigned long)&_end) - SZ_2

[PATCH v2 0/3] Move kernel mapping outside the linear mapping

2021-03-13 Thread Alexandre Ghiti

I decided to split sv48 support in small series to ease the review.

This patchset pushes the kernel mapping (modules and BPF too) to the last
4GB of the 64bit address space, this allows to:
- implement relocatable kernel (that will come later in another
  patchset) that requires to move the kernel mapping out of the linear
  mapping to avoid to copy the kernel at a different physical address.
- have a single kernel that is not relocatable (and then that avoids the
  performance penalty imposed by PIC kernel) for both sv39 and sv48.

The first patch implements this behaviour, the second patch introduces a
documentation that describes the virtual address space layout of the 64bit
kernel and the last patch is taken from my sv48 series where I simply added
the dump of the modules/kernel/BPF mapping.

I removed the Reviewed-by on the first patch since it changed enough from
last time and deserves a second look.

Changes in v2:
- Fix documentation about direct mapping size which is 124GB instead
  of 126GB.
- Fix SPDX missing header in documentation.
- Fix another checkpatch warning about EXPORT_SYMBOL which was not
  directly below variable declaration.

Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

 Documentation/riscv/index.rst   |  1 +
 Documentation/riscv/vm-layout.rst   | 63 ++
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 18 ++-
 arch/riscv/include/asm/pgtable.h| 37 +
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  3 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 83 +++--
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 arch/riscv/mm/ptdump.c  | 67 ++-
 15 files changed, 262 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/riscv/vm-layout.rst

-- 
2.20.1

[PATCH v3 0/2] Improve KASAN_VMALLOC support

2021-03-13 Thread Alexandre Ghiti

This patchset improves KASAN vmalloc implementation by fixing an
oversight where kernel page table was not flushed in patch 1 and by
reworking the kernel page table PGD level population in patch 2.

Changes in v3:
- Split into 2 patches
- Add reviewed-by

Changes in v2:
- Quiet kernel test robot warnings about missing prototypes by declaring
  the introduced functions as static.

Alexandre Ghiti (2):
  riscv: Ensure page table writes are flushed when initializing KASAN
vmalloc
  riscv: Cleanup KASAN_VMALLOC support

 arch/riscv/mm/kasan_init.c | 61 +-
 1 file changed, 20 insertions(+), 41 deletions(-)

-- 
2.20.1

[PATCH v3 2/2] riscv: Cleanup KASAN_VMALLOC support

2021-03-13 Thread Alexandre Ghiti

When KASAN vmalloc region is populated, there is no userspace process and
the page table in use is swapper_pg_dir, so there is no need to read
SATP. Then we can use the same scheme used by kasan_populate_p*d
functions to go through the page table, which harmonizes the code.

In addition, make use of set_pgd that goes through all unused page table
levels, contrary to p*d_populate functions, which makes this function work
whatever the number of page table levels.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/mm/kasan_init.c | 59 --
 1 file changed, 18 insertions(+), 41 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 57bf4ae09361..c16178918239 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -11,18 +11,6 @@
 #include 
 #include 
 
-static __init void *early_alloc(size_t size, int node)
-{
-   void *ptr = memblock_alloc_try_nid(size, size,
-   __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, node);
-
-   if (!ptr)
-   panic("%pS: Failed to allocate %zu bytes align=%zx nid=%d 
from=%llx\n",
-   __func__, size, size, node, (u64)__pa(MAX_DMA_ADDRESS));
-
-   return ptr;
-}
-
 extern pgd_t early_pg_dir[PTRS_PER_PGD];
 asmlinkage void __init kasan_early_init(void)
 {
@@ -155,38 +143,27 @@ static void __init kasan_populate(void *start, void *end)
memset(start, KASAN_SHADOW_INIT, end - start);
 }
 
-void __init kasan_shallow_populate(void *start, void *end)
+static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned 
long end)
 {
-   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
-   unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long pfn;
-   int index;
+   unsigned long next;
void *p;
-   pud_t *pud_dir, *pud_k;
-   pgd_t *pgd_dir, *pgd_k;
-   p4d_t *p4d_dir, *p4d_k;
-
-   while (vaddr < vend) {
-   index = pgd_index(vaddr);
-   pfn = csr_read(CSR_SATP) & SATP_PPN;
-   pgd_dir = (pgd_t *)pfn_to_virt(pfn) + index;
-   pgd_k = init_mm.pgd + index;
-   pgd_dir = pgd_offset_k(vaddr);
-   set_pgd(pgd_dir, *pgd_k);
-
-   p4d_dir = p4d_offset(pgd_dir, vaddr);
-   p4d_k  = p4d_offset(pgd_k, vaddr);
-
-   vaddr = (vaddr + PUD_SIZE) & PUD_MASK;
-   pud_dir = pud_offset(p4d_dir, vaddr);
-   pud_k = pud_offset(p4d_k, vaddr);
-
-   if (pud_present(*pud_dir)) {
-   p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
-   pud_populate(_mm, pud_dir, p);
+   pgd_t *pgd_k = pgd_offset_k(vaddr);
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+   if (pgd_page_vaddr(*pgd_k) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd)) {
+   p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
}
-   vaddr += PAGE_SIZE;
-   }
+   } while (pgd_k++, vaddr = next, vaddr != end);
+}
+
+static void __init kasan_shallow_populate(void *start, void *end)
+{
+   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
+   unsigned long vend = PAGE_ALIGN((unsigned long)end);
+
+   kasan_shallow_populate_pgd(vaddr, vend);
 
local_flush_tlb_all();
 }
-- 
2.20.1

[PATCH v3 1/2] riscv: Ensure page table writes are flushed when initializing KASAN vmalloc

2021-03-13 Thread Alexandre Ghiti

Make sure that writes to kernel page table during KASAN vmalloc
initialization are made visible by adding a sfence.vma.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/mm/kasan_init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 1b968855d389..57bf4ae09361 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -187,6 +187,8 @@ void __init kasan_shallow_populate(void *start, void *end)
}
vaddr += PAGE_SIZE;
}
+
+   local_flush_tlb_all();
 }
 
 void __init kasan_init(void)
-- 
2.20.1

[PATCH v2] riscv: Improve KASAN_VMALLOC support

2021-02-26 Thread Alexandre Ghiti

When KASAN vmalloc region is populated, there is no userspace process and
the page table in use is swapper_pg_dir, so there is no need to read
SATP. Then we can use the same scheme used by kasan_populate_p*d
functions to go through the page table, which harmonizes the code.

In addition, make use of set_pgd that goes through all unused page table
levels, contrary to p*d_populate functions, which makes this function work
whatever the number of page table levels.

And finally, make sure the writes to swapper_pg_dir are visible using
an sfence.vma.

Signed-off-by: Alexandre Ghiti 
---

Changes in v2:  
 
- Quiet kernel test robot warnings about missing prototypes by declaring
 
  the introduced functions as static.

 arch/riscv/mm/kasan_init.c | 61 +-
 1 file changed, 20 insertions(+), 41 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index e3d91f334b57..aaa3bdc0ffc0 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -11,18 +11,6 @@
 #include 
 #include 
 
-static __init void *early_alloc(size_t size, int node)
-{
-   void *ptr = memblock_alloc_try_nid(size, size,
-   __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, node);
-
-   if (!ptr)
-   panic("%pS: Failed to allocate %zu bytes align=%zx nid=%d 
from=%llx\n",
-   __func__, size, size, node, (u64)__pa(MAX_DMA_ADDRESS));
-
-   return ptr;
-}
-
 extern pgd_t early_pg_dir[PTRS_PER_PGD];
 asmlinkage void __init kasan_early_init(void)
 {
@@ -155,38 +143,29 @@ static void __init kasan_populate(void *start, void *end)
memset(start, KASAN_SHADOW_INIT, end - start);
 }
 
-void __init kasan_shallow_populate(void *start, void *end)
+static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned 
long end)
 {
-   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
-   unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long pfn;
-   int index;
+   unsigned long next;
void *p;
-   pud_t *pud_dir, *pud_k;
-   pgd_t *pgd_dir, *pgd_k;
-   p4d_t *p4d_dir, *p4d_k;
-
-   while (vaddr < vend) {
-   index = pgd_index(vaddr);
-   pfn = csr_read(CSR_SATP) & SATP_PPN;
-   pgd_dir = (pgd_t *)pfn_to_virt(pfn) + index;
-   pgd_k = init_mm.pgd + index;
-   pgd_dir = pgd_offset_k(vaddr);
-   set_pgd(pgd_dir, *pgd_k);
-
-   p4d_dir = p4d_offset(pgd_dir, vaddr);
-   p4d_k  = p4d_offset(pgd_k, vaddr);
-
-   vaddr = (vaddr + PUD_SIZE) & PUD_MASK;
-   pud_dir = pud_offset(p4d_dir, vaddr);
-   pud_k = pud_offset(p4d_k, vaddr);
-
-   if (pud_present(*pud_dir)) {
-   p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
-   pud_populate(_mm, pud_dir, p);
+   pgd_t *pgd_k = pgd_offset_k(vaddr);
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+   if (pgd_page_vaddr(*pgd_k) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd)) {
+   p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
}
-   vaddr += PAGE_SIZE;
-   }
+   } while (pgd_k++, vaddr = next, vaddr != end);
+}
+
+static void __init kasan_shallow_populate(void *start, void *end)
+{
+   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
+   unsigned long vend = PAGE_ALIGN((unsigned long)end);
+
+   kasan_shallow_populate_pgd(vaddr, vend);
+
+   local_flush_tlb_all();
 }
 
 void __init kasan_init(void)
-- 
2.20.1

[PATCH] riscv: Improve KASAN_VMALLOC support

2021-02-26 Thread Alexandre Ghiti

When KASAN vmalloc region is populated, there is no userspace process and
the page table in use is swapper_pg_dir, so there is no need to read
SATP. Then we can use the same scheme used by kasan_populate_p*d
functions to go through the page table, which harmonizes the code.

In addition, make use of set_pgd that goes through all unused page table
levels, contrary to p*d_populate functions, which makes this function work
whatever the number of page table levels.

And finally, make sure the writes to swapper_pg_dir are visible using
an sfence.vma.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 59 --
 1 file changed, 19 insertions(+), 40 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index e3d91f334b57..b0cee8d35938 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -11,18 +11,6 @@
 #include 
 #include 
 
-static __init void *early_alloc(size_t size, int node)
-{
-   void *ptr = memblock_alloc_try_nid(size, size,
-   __pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, node);
-
-   if (!ptr)
-   panic("%pS: Failed to allocate %zu bytes align=%zx nid=%d 
from=%llx\n",
-   __func__, size, size, node, (u64)__pa(MAX_DMA_ADDRESS));
-
-   return ptr;
-}
-
 extern pgd_t early_pg_dir[PTRS_PER_PGD];
 asmlinkage void __init kasan_early_init(void)
 {
@@ -155,38 +143,29 @@ static void __init kasan_populate(void *start, void *end)
memset(start, KASAN_SHADOW_INIT, end - start);
 }
 
+void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
+{
+   unsigned long next;
+   void *p;
+   pgd_t *pgd_k = pgd_offset_k(vaddr);
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+   if (pgd_page_vaddr(*pgd_k) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd)) {
+   p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
+   }
+   } while (pgd_k++, vaddr = next, vaddr != end);
+}
+
 void __init kasan_shallow_populate(void *start, void *end)
 {
unsigned long vaddr = (unsigned long)start & PAGE_MASK;
unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long pfn;
-   int index;
-   void *p;
-   pud_t *pud_dir, *pud_k;
-   pgd_t *pgd_dir, *pgd_k;
-   p4d_t *p4d_dir, *p4d_k;
-
-   while (vaddr < vend) {
-   index = pgd_index(vaddr);
-   pfn = csr_read(CSR_SATP) & SATP_PPN;
-   pgd_dir = (pgd_t *)pfn_to_virt(pfn) + index;
-   pgd_k = init_mm.pgd + index;
-   pgd_dir = pgd_offset_k(vaddr);
-   set_pgd(pgd_dir, *pgd_k);
-
-   p4d_dir = p4d_offset(pgd_dir, vaddr);
-   p4d_k  = p4d_offset(pgd_k, vaddr);
-
-   vaddr = (vaddr + PUD_SIZE) & PUD_MASK;
-   pud_dir = pud_offset(p4d_dir, vaddr);
-   pud_k = pud_offset(p4d_k, vaddr);
-
-   if (pud_present(*pud_dir)) {
-   p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
-   pud_populate(_mm, pud_dir, p);
-   }
-   vaddr += PAGE_SIZE;
-   }
+
+   kasan_shallow_populate_pgd(vaddr, vend);
+
+   local_flush_tlb_all();
 }
 
 void __init kasan_init(void)
-- 
2.20.1

[PATCH 2/3] Documentation: riscv: Add documentation that describes the VM layout

2021-02-25 Thread Alexandre Ghiti

This new document presents the RISC-V virtual memory layout and is based
one the x86 one: it describes the different limits of the different regions
of the virtual address space.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/riscv/index.rst |  1 +
 Documentation/riscv/vm-layout.rst | 61 +++
 2 files changed, 62 insertions(+)
 create mode 100644 Documentation/riscv/vm-layout.rst

diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index 6e6e39482502..ea915c196048 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -6,6 +6,7 @@ RISC-V architecture
 :maxdepth: 1
 
 boot-image-header
+vm-layout
 pmu
 patch-acceptance
 
diff --git a/Documentation/riscv/vm-layout.rst 
b/Documentation/riscv/vm-layout.rst
new file mode 100644
index ..e8e569e2686a
--- /dev/null
+++ b/Documentation/riscv/vm-layout.rst
@@ -0,0 +1,61 @@
+=
+Virtual Memory Layout on RISC-V Linux
+=
+
+:Author: Alexandre Ghiti 
+:Date: 12 February 2021
+
+This document describes the virtual memory layout used by the RISC-V Linux
+Kernel.
+
+RISC-V Linux Kernel 32bit
+=
+
+RISC-V Linux Kernel SV32
+
+
+TODO
+
+RISC-V Linux Kernel 64bit
+=
+
+The RISC-V privileged architecture document states that the 64bit addresses
+"must have bits 63–48 all equal to bit 47, or else a page-fault exception will
+occur.": that splits the virtual address space into 2 halves separated by a 
very
+big hole, the lower half is where the userspace resides, the upper half is 
where
+the RISC-V Linux Kernel resides.
+
+RISC-V Linux Kernel SV39
+
+
+::
+
+  

+  Start addr|   Offset   | End addr |  Size   | VM area 
description
+  

+||  | |
+    |0   | 003f |  256 GB | user-space 
virtual memory, different per mm
+  
__||__|_|___
+||  | |
+   0040 | +256GB | ffbf | ~16M TB | ... huge, 
almost 64 bits wide hole of non-canonical
+||  | | virtual 
memory addresses up to the -256 GB
+||  | | starting 
offset of kernel mappings.
+  
__||__|_|___
+  |
+  | Kernel-space 
virtual memory, shared between all processes:
+  
|___
+||  | |
+   ffc0 | -256GB | ffc7 |   32 GB | kasan
+   ffcefee0 | -196GB | ffcefeff |2 MB | fixmap
+   ffceff00 | -196GB | ffce |   16 MB | PCI io
+   ffcf | -196GB | ffcf |4 GB | vmemmap
+   ffd0 | -192GB | ffdf |   64 GB | 
vmalloc/ioremap space
+   ffe0 | -128GB | 7fff |  126 GB | direct mapping 
of all physical memory
+  
__||__|_|
+  |
+  |
+  
|
+||  | |
+    |   -4GB | 7fff |2 GB | modules
+   8000 |   -2GB |  |2 GB | kernel, BPF
+  
__||__|_|
-- 
2.20.1

[PATCH 3/3] riscv: Prepare ptdump for vm layout dynamic addresses

2021-02-25 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 67 ++
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..aa1b3bce61ab 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,52 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+   MODULES_MAPPING_NR,
+   KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
-   {KASAN_SHADOW_START,"Kasan shadow start"},
-   {KASAN_SHADOW_END,  "Kasan shadow end"},
+   {0, "Kasan shadow start"},
+   {0, "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+   {0, "Modules mapping"},
+   {0, "Kernel mapping (kernel, BPF)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +358,26 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+#ifdef CONFIG_KASAN
+   address_markers[KASAN_SHADOW_START_NR].start_address = 
KASAN_SHADOW_START;
+   address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
+#endif
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+   address_markers[MODULES_MAPPING_NR].start_address = MODULES_VADDR;
+   address_markers[KERNEL_MAPPING_NR].start_address = kernel_virt_addr;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH 1/3] riscv: Move kernel mapping outside of linear mapping

2021-02-25 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 18 ++-
 arch/riscv/include/asm/pgtable.h| 37 +
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  3 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 81 +++--
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 12 files changed, 141 insertions(+), 38 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index e49d51b97bc1..6557535dc2da 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,15 +90,29 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+extern unsigned long kernel_virt_addr;
+extern uintptr_t load_pa, load_sz, load_sz_pmd;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define __pa_to_va_nodebug(x)  linear_mapping_pa_to_va(x)
+
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  ({  
\
+   unsigned long _x = x;   
\
+   (_x < kernel_virt_addr) ?   
\
+   linear_mapping_va_to_pa(_x) : kernel_mapping_va_to_pa(_x);  
\
+   })
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 25e90cf0bde4..50c068993591 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,30 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel and BPF at the end of the address space
+ */
+#define KERNEL_LINK_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define MODULES_VADDR  (PFN_ALIGN((unsigned long)&_end) - SZ_2

[PATCH 0/3] Move kernel mapping outside the linear mapping

2021-02-25 Thread Alexandre Ghiti

I decided to split sv48 support in small series to ease the review.

This patchset pushes the kernel mapping (modules and BPF too) to the last
4GB of the 64bit address space, this allows to:
- implement relocatable kernel (that will come later in another
  patchset) that requires to move the kernel mapping out of the linear
  mapping to avoid to copy the kernel at a different physical address.
- have a single kernel that is not relocatable (and then that avoids the
  performance penalty imposed by PIC kernel) for both sv39 and sv48.

The first patch implements this behaviour, the second patch introduces a
documentation that describes the virtual address space layout of the 64bit
kernel and the last patch is taken from my sv48 series where I simply added
the dump of the modules/kernel/BPF mapping.

I removed the Reviewed-by on the first patch since it changed enough from
last time and deserves a second look.

Alexandre Ghiti (3):
  riscv: Move kernel mapping outside of linear mapping
  Documentation: riscv: Add documentation that describes the VM layout
  riscv: Prepare ptdump for vm layout dynamic addresses

 Documentation/riscv/index.rst   |  1 +
 Documentation/riscv/vm-layout.rst   | 61 ++
 arch/riscv/boot/loader.lds.S|  3 +-
 arch/riscv/include/asm/page.h   | 18 ++-
 arch/riscv/include/asm/pgtable.h| 37 +
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/kernel/head.S|  3 +-
 arch/riscv/kernel/module.c  |  6 +--
 arch/riscv/kernel/setup.c   |  3 ++
 arch/riscv/kernel/vmlinux.lds.S |  3 +-
 arch/riscv/mm/fault.c   | 13 +
 arch/riscv/mm/init.c| 81 +++--
 arch/riscv/mm/kasan_init.c  |  9 
 arch/riscv/mm/physaddr.c|  2 +-
 arch/riscv/mm/ptdump.c  | 67 +++-
 15 files changed, 258 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/riscv/vm-layout.rst

-- 
2.20.1

[PATCH] riscv: Add KASAN_VMALLOC support

2021-02-24 Thread Alexandre Ghiti

Populate the top-level of the kernel page table to implement KASAN_VMALLOC,
lower levels are filled dynamically upon memory allocation at runtime.

Co-developed-by: Nylon Chen 
Signed-off-by: Nylon Chen 
Co-developed-by: Nick Hu 
Signed-off-by: Nick Hu 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig |  1 +
 arch/riscv/mm/kasan_init.c | 35 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 8eadd1cbd524..3832a537c5d6 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -57,6 +57,7 @@ config RISCV
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if MMU && 64BIT
+   select HAVE_ARCH_KASAN_VMALLOC if MMU && 64BIT
select HAVE_ARCH_KGDB
select HAVE_ARCH_KGDB_QXFER_PKT
select HAVE_ARCH_MMAP_RND_BITS if MMU
diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 719b6e4d6075..171569df4334 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -142,6 +142,31 @@ static void __init kasan_populate(void *start, void *end)
memset(start, KASAN_SHADOW_INIT, end - start);
 }
 
+void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
+{
+   unsigned long next;
+   void *p;
+   pgd_t *pgd_k = pgd_offset_k(vaddr);
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+   if (pgd_page_vaddr(*pgd_k) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd)) {
+   p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
+   }
+   } while (pgd_k++, vaddr = next, vaddr != end);
+}
+
+void __init kasan_shallow_populate(void *start, void *end)
+{
+   unsigned long vaddr = (unsigned long)start & PAGE_MASK;
+   unsigned long vend = PAGE_ALIGN((unsigned long)end);
+
+   kasan_shallow_populate_pgd(vaddr, vend);
+
+   local_flush_tlb_all();
+}
+
 void __init kasan_init(void)
 {
phys_addr_t _start, _end;
@@ -149,7 +174,15 @@ void __init kasan_init(void)
 
kasan_populate_early_shadow((void *)KASAN_SHADOW_START,
(void *)kasan_mem_to_shadow((void *)
-   VMALLOC_END));
+   VMEMMAP_END));
+   if (IS_ENABLED(CONFIG_KASAN_VMALLOC))
+   kasan_shallow_populate(
+   (void *)kasan_mem_to_shadow((void *)VMALLOC_START),
+   (void *)kasan_mem_to_shadow((void *)VMALLOC_END));
+   else
+   kasan_populate_early_shadow(
+   (void *)kasan_mem_to_shadow((void *)VMALLOC_START),
+   (void *)kasan_mem_to_shadow((void *)VMALLOC_END));
 
for_each_mem_range(i, &_start, &_end) {
void *start = (void *)_start;
-- 
2.20.1

[PATCH] riscv: Pass virtual addresses to kasan_mem_to_shadow

2021-02-22 Thread Alexandre Ghiti

kasan_mem_to_shadow translates virtual addresses to kasan shadow
addresses whereas for_each_mem_range returns physical addresses: it is
then required to use __va on those addresses before passing them to
kasan_mem_to_shadow.

Fixes: b10d6bca8720 ("arch, drivers: replace for_each_membock() with 
for_each_mem_range()")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 4b9149f963d3..6d3b88f2c566 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -148,8 +148,8 @@ void __init kasan_init(void)
(void *)kasan_mem_to_shadow((void *)VMALLOC_END));
 
for_each_mem_range(i, &_start, &_end) {
-   void *start = (void *)_start;
-   void *end = (void *)_end;
+   void *start = (void *)__va(_start);
+   void *end = (void *)__va(_end);
 
if (start >= end)
break;
-- 
2.20.1

[PATCH] riscv: Get rid of MAX_EARLY_MAPPING_SIZE

2021-02-21 Thread Alexandre Ghiti

At early boot stage, we have a whole PGDIR to map the kernel, so there
is no need to restrict the early mapping size to 128MB. Removing this
define also allows us to simplify some compile time logic.

This fixes large kernel mappings with a size greater than 128MB, as it
is the case for syzbot kernels whose size was just ~130MB.

Note that on rv64, for now, we are then limited to PGDIR size for early
mapping as we can't use PGD mappings (see [1]). That should be enough
given the relative small size of syzbot kernels compared to PGDIR_SIZE
which is 1GB.

[1] https://lore.kernel.org/lkml/20200603153608.30056-1-a...@ghiti.fr/

Reported-by: Dmitry Vyukov 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/init.c | 21 +
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9f9568d689e..f81f813b9603 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -226,8 +226,6 @@ pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
 
-#define MAX_EARLY_MAPPING_SIZE SZ_128M
-
 pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
 
 void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
@@ -302,13 +300,7 @@ static void __init create_pte_mapping(pte_t *ptep,
 
 pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
-
-#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
-#define NUM_EARLY_PMDS 1UL
-#else
-#define NUM_EARLY_PMDS (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
-#endif
-pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
+pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 pmd_t early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
@@ -330,11 +322,9 @@ static pmd_t *get_pmd_virt_late(phys_addr_t pa)
 
 static phys_addr_t __init alloc_pmd_early(uintptr_t va)
 {
-   uintptr_t pmd_num;
+   BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
 
-   pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
-   BUG_ON(pmd_num >= NUM_EARLY_PMDS);
-   return (uintptr_t)_pmd[pmd_num * PTRS_PER_PMD];
+   return (uintptr_t)early_pmd;
 }
 
 static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
@@ -452,7 +442,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
uintptr_t va, pa, end_va;
uintptr_t load_pa = (uintptr_t)(&_start);
uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
-   uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+   uintptr_t map_size;
 #ifndef __PAGETABLE_PMD_FOLDED
pmd_t fix_bmap_spmd, fix_bmap_epmd;
 #endif
@@ -464,12 +454,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 * Enforce boot alignment requirements of RV32 and
 * RV64 by only allowing PMD or PGD mappings.
 */
-   BUG_ON(map_size == PAGE_SIZE);
+   map_size = PMD_SIZE;
 
/* Sanity check alignment and size */
BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
BUG_ON((load_pa % map_size) != 0);
-   BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
 
pt_ops.alloc_pte = alloc_pte_early;
pt_ops.get_pte_virt = get_pte_virt_early;
-- 
2.20.1

[PATCH 4/4] riscv: Improve kasan population by using hugepages when possible

2021-02-08 Thread Alexandre Ghiti

The kasan functions that populates the shadow regions used to allocate them
page by page and did not take advantage of hugepages, so fix this by
trying to allocate hugepages of 1GB and fallback to 2MB hugepages or 4K
pages in case it fails.

This reduces the page table memory consumption and improves TLB usage,
as shown below:

Before this patch:

---[ Kasan shadow start ]---
0xffc0-0xffc40x818ef00016G PTE 
. A . . . . R V
0xffc4-0xffc447fc0x0002b7f4f000   1179392K PTE 
D A . . . W R V
0xffc48000-0xffc80x818ef00014G PTE 
. A . . . . R V
---[ Kasan shadow end ]---

After this patch:

---[ Kasan shadow start ]---
0xffc0-0xffc40x818ef00016G PTE 
. A . . . . R V
0xffc4-0xffc440000x00024000 1G PGD 
D A . . . W R V
0xffc44000-0xffc447e00x0002b7e0   126M PMD 
D A . . . W R V
0xffc447e0-0xffc447fc0x0002b818f000  1792K PTE 
D A . . . W R V
0xffc48000-0xffc80x818ef00014G PTE 
. A . . . . R V
---[ Kasan shadow end ]---

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index b7d4d9abd144..2b196f512f07 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -83,6 +83,15 @@ static void kasan_populate_pmd(pgd_t *pgd, unsigned long 
vaddr, unsigned long en
 
do {
next = pmd_addr_end(vaddr, end);
+
+   if (pmd_none(*pmdp) && IS_ALIGNED(vaddr, PMD_SIZE) && (next - 
vaddr) >= PMD_SIZE) {
+   phys_addr = memblock_phys_alloc(PMD_SIZE, PMD_SIZE);
+   if (phys_addr) {
+   set_pmd(pmdp, pfn_pmd(PFN_DOWN(phys_addr), 
PAGE_KERNEL));
+   continue;
+   }
+   }
+
kasan_populate_pte(pmdp, vaddr, next);
} while (pmdp++, vaddr = next, vaddr != end);
 
@@ -103,6 +112,21 @@ static void kasan_populate_pgd(unsigned long vaddr, 
unsigned long end)
 
do {
next = pgd_addr_end(vaddr, end);
+
+   /*
+* pgdp can't be none since kasan_early_init initialized all 
KASAN
+* shadow region with kasan_early_shadow_pmd: if this is 
stillthe case,
+* that means we can try to allocate a hugepage as a 
replacement.
+*/
+   if (pgd_page_vaddr(*pgdp) == (unsigned 
long)lm_alias(kasan_early_shadow_pmd) &&
+   IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= 
PGDIR_SIZE) {
+   phys_addr = memblock_phys_alloc(PGDIR_SIZE, PGDIR_SIZE);
+   if (phys_addr) {
+   set_pgd(pgdp, pfn_pgd(PFN_DOWN(phys_addr), 
PAGE_KERNEL));
+   continue;
+   }
+   }
+
kasan_populate_pmd(pgdp, vaddr, next);
} while (pgdp++, vaddr = next, vaddr != end);
 }
-- 
2.20.1

[PATCH 3/4] riscv: Improve kasan population function

2021-02-08 Thread Alexandre Ghiti

Current population code populates a whole page table without taking care
of what could have been already allocated and without taking into account
possible index in page table, assuming the virtual address to map is always
aligned on the page table size, which, for example, won't be the case when
the kernel will get pushed to the end of the address space.

Address those problems by rewriting the kasan population function,
splitting it into subfunctions for each different page table level.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 91 ++
 1 file changed, 63 insertions(+), 28 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 7bbe09416a2e..b7d4d9abd144 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -47,37 +47,72 @@ asmlinkage void __init kasan_early_init(void)
local_flush_tlb_all();
 }
 
-static void __init populate(void *start, void *end)
+static void kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long 
end)
+{
+   phys_addr_t phys_addr;
+   pte_t *ptep, *base_pte;
+
+   if (pmd_none(*pmd))
+   base_pte = memblock_alloc(PTRS_PER_PTE * sizeof(pte_t), 
PAGE_SIZE);
+   else
+   base_pte = (pte_t *)pmd_page_vaddr(*pmd);
+
+   ptep = base_pte + pte_index(vaddr);
+
+   do {
+   if (pte_none(*ptep)) {
+   phys_addr = memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pte(ptep, pfn_pte(PFN_DOWN(phys_addr), 
PAGE_KERNEL));
+   }
+   } while (ptep++, vaddr += PAGE_SIZE, vaddr != end);
+
+   set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
+}
+
+static void kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long 
end)
+{
+   phys_addr_t phys_addr;
+   pmd_t *pmdp, *base_pmd;
+   unsigned long next;
+
+   base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
+   if (base_pmd == lm_alias(kasan_early_shadow_pmd))
+   base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), 
PAGE_SIZE);
+
+   pmdp = base_pmd + pmd_index(vaddr);
+
+   do {
+   next = pmd_addr_end(vaddr, end);
+   kasan_populate_pte(pmdp, vaddr, next);
+   } while (pmdp++, vaddr = next, vaddr != end);
+
+   /*
+* Wait for the whole PGD to be populated before setting the PGD in
+* the page table, otherwise, if we did set the PGD before populating
+* it entirely, memblock could allocate a page at a physical address
+* where KASAN is not populated yet and then we'd get a page fault.
+*/
+   set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
+}
+
+static void kasan_populate_pgd(unsigned long vaddr, unsigned long end)
+{
+   phys_addr_t phys_addr;
+   pgd_t *pgdp = pgd_offset_k(vaddr);
+   unsigned long next;
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+   kasan_populate_pmd(pgdp, vaddr, next);
+   } while (pgdp++, vaddr = next, vaddr != end);
+}
+
+static void __init kasan_populate(void *start, void *end)
 {
-   unsigned long i, offset;
unsigned long vaddr = (unsigned long)start & PAGE_MASK;
unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long n_pages = (vend - vaddr) / PAGE_SIZE;
-   unsigned long n_ptes =
-   ((n_pages + PTRS_PER_PTE) & -PTRS_PER_PTE) / PTRS_PER_PTE;
-   unsigned long n_pmds =
-   ((n_ptes + PTRS_PER_PMD) & -PTRS_PER_PMD) / PTRS_PER_PMD;
-
-   pte_t *pte =
-   memblock_alloc(n_ptes * PTRS_PER_PTE * sizeof(pte_t), PAGE_SIZE);
-   pmd_t *pmd =
-   memblock_alloc(n_pmds * PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
-   pgd_t *pgd = pgd_offset_k(vaddr);
-
-   for (i = 0; i < n_pages; i++) {
-   phys_addr_t phys = memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
-   set_pte([i], pfn_pte(PHYS_PFN(phys), PAGE_KERNEL));
-   }
-
-   for (i = 0, offset = 0; i < n_ptes; i++, offset += PTRS_PER_PTE)
-   set_pmd([i],
-   pfn_pmd(PFN_DOWN(__pa([offset])),
-   __pgprot(_PAGE_TABLE)));
 
-   for (i = 0, offset = 0; i < n_pmds; i++, offset += PTRS_PER_PMD)
-   set_pgd([i],
-   pfn_pgd(PFN_DOWN(__pa([offset])),
-   __pgprot(_PAGE_TABLE)));
+   kasan_populate_pgd(vaddr, vend);
 
local_flush_tlb_all();
memset(start, KASAN_SHADOW_INIT, end - start);
@@ -99,7 +134,7 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   populate(kasan_mem_to_shadow(start), kasan_mem_to_shadow(end));
+   kasan_populate(kasan_mem_to_shadow(start), 
kasan_mem_to_shadow(end));
};
 
for (i = 0; i < PTRS_PER_PTE; i++)
-- 
2.20.1

[PATCH 2/4] riscv: Use KASAN_SHADOW_INIT define for kasan memory initialization

2021-02-08 Thread Alexandre Ghiti

Instead of hardcoding memory initialization to 0, use KASAN_SHADOW_INIT.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index a8a2ffd9114a..7bbe09416a2e 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -80,7 +80,7 @@ static void __init populate(void *start, void *end)
__pgprot(_PAGE_TABLE)));
 
local_flush_tlb_all();
-   memset(start, 0, end - start);
+   memset(start, KASAN_SHADOW_INIT, end - start);
 }
 
 void __init kasan_init(void)
@@ -108,6 +108,6 @@ void __init kasan_init(void)
   __pgprot(_PAGE_PRESENT | _PAGE_READ |
_PAGE_ACCESSED)));
 
-   memset(kasan_early_shadow_page, 0, PAGE_SIZE);
+   memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE);
init_task.kasan_depth = 0;
 }
-- 
2.20.1

[PATCH 1/4] riscv: Improve kasan definitions

2021-02-08 Thread Alexandre Ghiti

There is no functional change here, only improvement in code readability
by adding comments to explain where the kasan constants come from and by
replacing hardcoded numerical constant by the corresponding define.

Note that the comments come from arm64.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/kasan.h | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index b04028c6218c..a2b3d9cdbc86 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -8,12 +8,28 @@
 
 #ifdef CONFIG_KASAN
 
+/*
+ * The following comment was copied from arm64:
+ * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
+ * KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
+ * where N = (1 << KASAN_SHADOW_SCALE_SHIFT).
+ *
+ * KASAN_SHADOW_OFFSET:
+ * This value is used to map an address to the corresponding shadow
+ * address by the following formula:
+ * shadow_addr = (address >> KASAN_SHADOW_SCALE_SHIFT) + 
KASAN_SHADOW_OFFSET
+ *
+ * (1 << (64 - KASAN_SHADOW_SCALE_SHIFT)) shadow addresses that lie in range
+ * [KASAN_SHADOW_OFFSET, KASAN_SHADOW_END) cover all 64-bits of virtual
+ * addresses. So KASAN_SHADOW_OFFSET should satisfy the following equation:
+ *  KASAN_SHADOW_OFFSET = KASAN_SHADOW_END -
+ *  (1ULL << (64 - KASAN_SHADOW_SCALE_SHIFT))
+ */
 #define KASAN_SHADOW_SCALE_SHIFT   3
 
-#define KASAN_SHADOW_SIZE  (UL(1) << (38 - KASAN_SHADOW_SCALE_SHIFT))
-#define KASAN_SHADOW_START KERN_VIRT_START /* 2^64 - 2^38 */
+#define KASAN_SHADOW_SIZE  (UL(1) << ((CONFIG_VA_BITS - 1) - 
KASAN_SHADOW_SCALE_SHIFT))
+#define KASAN_SHADOW_START KERN_VIRT_START
 #define KASAN_SHADOW_END   (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
-
 #define KASAN_SHADOW_OFFSET(KASAN_SHADOW_END - (1ULL << \
(64 - KASAN_SHADOW_SCALE_SHIFT)))
 
-- 
2.20.1

[PATCH 0/4] Kasan improvements and fixes

2021-02-08 Thread Alexandre Ghiti

This small series contains some improvements for the riscv KASAN code:

- it brings a better readability of the code (patch 1/2)
- it fixes oversight regarding page table population which I uncovered
  while working on my sv48 patchset (patch 3)
- it helps to have better performance by using hugepages when possible
  (patch 4)

Alexandre Ghiti (4):
  riscv: Improve kasan definitions
  riscv: Use KASAN_SHADOW_INIT define for kasan memory initialization
  riscv: Improve kasan population function
  riscv: Improve kasan population by using hugepages when possible

 arch/riscv/include/asm/kasan.h |  22 +-
 arch/riscv/mm/kasan_init.c | 119 -
 2 files changed, 108 insertions(+), 33 deletions(-)

-- 
2.20.1

[PATCH] riscv: Improve kasan population by using hugepages when possible

2021-02-01 Thread Alexandre Ghiti

Kasan function that populates the shadow regions used to allocate them
page by page and did not take advantage of hugepages, so fix this by
trying to allocate hugepages of 1GB and fallback to 2MB hugepages or 4K
pages in case it fails.

This reduces the page table memory consumption and improves TLB usage,
as shown below:

Before this patch:

---[ Kasan shadow start ]---
0xffc0-0xffc40x818ef00016G PTE 
. A . . . . R V
0xffc4-0xffc447fc0x0002b7f4f000   1179392K PTE 
D A . . . W R V
0xffc48000-0xffc80x818ef00014G PTE 
. A . . . . R V
---[ Kasan shadow end ]---

After this patch:

---[ Kasan shadow start ]---
0xffc0-0xffc40x818ef00016G PTE 
. A . . . . R V
0xffc4-0xffc440000x00024000 1G PGD 
D A . . . W R V
0xffc44000-0xffc447e00x0002b7e0   126M PMD 
D A . . . W R V
0xffc447e0-0xffc447fc0x0002b818f000  1792K PTE 
D A . . . W R V
0xffc48000-0xffc80x818ef00014G PTE 
. A . . . . R V
---[ Kasan shadow end ]---

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/kasan_init.c | 101 +++--
 1 file changed, 73 insertions(+), 28 deletions(-)

diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index a8a2ffd9114a..8f11b73018b1 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -47,37 +47,82 @@ asmlinkage void __init kasan_early_init(void)
local_flush_tlb_all();
 }
 
-static void __init populate(void *start, void *end)
+static void kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long 
end)
+{
+   phys_addr_t phys_addr;
+   pte_t *ptep = memblock_alloc(PTRS_PER_PTE * sizeof(pte_t), PAGE_SIZE);
+
+   do {
+   phys_addr = memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+   set_pte(ptep, pfn_pte(PFN_DOWN(phys_addr), PAGE_KERNEL));
+   } while (ptep++, vaddr += PAGE_SIZE, vaddr != end);
+
+   set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(ptep)), PAGE_TABLE));
+}
+
+static void kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long 
end)
+{
+   phys_addr_t phys_addr;
+   pmd_t *pmdp = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
+   unsigned long next;
+
+   do {
+   next = pmd_addr_end(vaddr, end);
+
+   if (IS_ALIGNED(vaddr, PMD_SIZE) && (next - vaddr) >= PMD_SIZE) {
+   phys_addr = memblock_phys_alloc(PMD_SIZE, PMD_SIZE);
+   if (phys_addr) {
+   set_pmd(pmdp, pfn_pmd(PFN_DOWN(phys_addr), 
PAGE_KERNEL));
+   continue;
+   }
+   }
+
+   kasan_populate_pte(pmdp, vaddr, end);
+   } while (pmdp++, vaddr = next, vaddr != end);
+
+   /*
+* Wait for the whole PGD to be populated before setting the PGD in
+* the page table, otherwise, if we did set the PGD before populating
+* it entirely, memblock could allocate a page at a physical address
+* where KASAN is not populated yet and then we'd get a page fault.
+*/
+   set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(pmdp)), PAGE_TABLE));
+}
+
+static void kasan_populate_pgd(unsigned long vaddr, unsigned long end)
+{
+   phys_addr_t phys_addr;
+   pgd_t *pgdp = pgd_offset_k(vaddr);
+   unsigned long next;
+
+   do {
+   next = pgd_addr_end(vaddr, end);
+
+   if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= 
PGDIR_SIZE) {
+   phys_addr = memblock_phys_alloc(PGDIR_SIZE, PGDIR_SIZE);
+   if (phys_addr) {
+   set_pgd(pgdp, pfn_pgd(PFN_DOWN(phys_addr), 
PAGE_KERNEL));
+   continue;
+   }
+   }
+
+   kasan_populate_pmd(pgdp, vaddr, end);
+   } while (pgdp++, vaddr = next, vaddr != end);
+}
+
+/*
+ * This function populates KASAN shadow region focusing on hugepages in
+ * order to minimize the page table cost and TLB usage too.
+ * Note that start must be PGDIR_SIZE-aligned in SV39 which amounts to be
+ * 1G aligned (that represents a 8G alignment constraint on virtual address
+ * ranges because of KASAN_SHADOW_SCALE_SHIFT).
+ */
+static void __init kasan_populate(void *start, void *end)
 {
-   unsigned long i, offset;
unsigned long vaddr = (unsigned long)start & PAGE_MASK;
unsigned long vend = PAGE_ALIGN((unsigned long)end);
-   unsigned long n_pages = (vend - vaddr) / PAGE_SIZE;
-   unsigned long n_ptes =
-   ((n_pages + PTRS_PER_PTE) & -PTRS_PER_PTE) / PTRS_PER_PTE;
-   unsigned long n_pmds =
-   ((n_ptes + PTRS_PER_PMD) & -PTRS_PER_PMD) /

[PATCH] riscv: virt_addr_valid must check the address belongs to linear mapping

2021-01-29 Thread Alexandre Ghiti

virt_addr_valid macro checks that a virtual address is valid, ie that
the address belongs to the linear mapping and that the corresponding
 physical page exists.

Add the missing check that ensures the virtual address belongs to the
linear mapping, otherwise __virt_to_phys, when compiled with
CONFIG_DEBUG_VIRTUAL enabled, raises a WARN that is interpreted as a
kernel bug by syzbot.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/page.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..64a675c5c30a 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -135,7 +135,10 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
 
 #endif /* __ASSEMBLY__ */
 
-#define virt_addr_valid(vaddr) (pfn_valid(virt_to_pfn(vaddr)))
+#define virt_addr_valid(vaddr) ({  
\
+   unsigned long _addr = (unsigned long)vaddr; 
\
+   (unsigned long)(_addr) >= PAGE_OFFSET && pfn_valid(virt_to_pfn(_addr)); 
\
+})
 
 #define VM_DATA_DEFAULT_FLAGS  VM_DATA_FLAGS_NON_EXEC
 
-- 
2.20.1

[RFC PATCH 02/12] riscv: Protect the kernel linear mapping

2021-01-04 Thread Alexandre Ghiti

The kernel is now mapped at the end of the address space and it should
be accessed through this mapping only: so map the whole kernel in the
linear mapping as read only.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/page.h |  9 -
 arch/riscv/mm/init.c  | 29 +
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 98188e315e8d..a93e35aaa717 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -102,8 +102,15 @@ extern unsigned long pfn_base;
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
 extern unsigned long kernel_virt_addr;
+extern uintptr_t load_pa, load_sz;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
+#define kernel_mapping_pa_to_va(x) \
+   ((void *)((unsigned long) (x) + va_kernel_pa_offset))
+#define __pa_to_va_nodebug(x)  \
+   ((x >= load_pa && x < load_pa + load_sz) ?  \
+   kernel_mapping_pa_to_va(x): linear_mapping_pa_to_va(x))
 
-#define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
 #define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
 #define kernel_mapping_va_to_pa(x) \
((unsigned long)(x) - va_kernel_pa_offset)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 9d06ff0e015a..7b87c14f1d24 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -159,8 +159,6 @@ void __init setup_bootmem(void)
 {
phys_addr_t mem_start = 0;
phys_addr_t start, end = 0;
-   phys_addr_t vmlinux_end = __pa_symbol(&_end);
-   phys_addr_t vmlinux_start = __pa_symbol(&_start);
u64 i;
 
/* Find the memory region containing the kernel */
@@ -168,7 +166,7 @@ void __init setup_bootmem(void)
phys_addr_t size = end - start;
if (!mem_start)
mem_start = start;
-   if (start <= vmlinux_start && vmlinux_end <= end)
+   if (start <= load_pa && (load_pa + load_sz) <= end)
BUG_ON(size == 0);
}
 
@@ -179,8 +177,13 @@ void __init setup_bootmem(void)
 */
memblock_enforce_memory_limit(mem_start - PAGE_OFFSET);
 
-   /* Reserve from the start of the kernel to the end of the kernel */
-   memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
+   /*
+* Reserve from the start of the kernel to the end of the kernel
+* and make sure we align the reservation on PMD_SIZE since we will
+* map the kernel in the linear mapping as read-only: we do not want
+* any allocation to happen between _end and the next pmd aligned page.
+*/
+   memblock_reserve(load_pa, (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1));
 
max_pfn = PFN_DOWN(memblock_end_of_DRAM());
max_low_pfn = max_pfn;
@@ -438,7 +441,9 @@ static uintptr_t __init best_map_size(phys_addr_t base, 
phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use 
absolute addressing."
 #endif
 
-static uintptr_t load_pa, load_sz;
+uintptr_t load_pa, load_sz;
+EXPORT_SYMBOL(load_pa);
+EXPORT_SYMBOL(load_sz);
 
 static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
 {
@@ -596,9 +601,17 @@ static void __init setup_vm_final(void)
 
map_size = best_map_size(start, end - start);
for (pa = start; pa < end; pa += map_size) {
-   va = (uintptr_t)__va(pa);
+   pgprot_t prot = PAGE_KERNEL;
+
+   /* Protect the kernel mapping that lies in the linear 
mapping */
+   if (pa >= __pa(_start) && pa < __pa(_end))
+   prot = PAGE_KERNEL_READ;
+
+   /* Make sure we get virtual addresses in the linear 
mapping */
+   va = (uintptr_t)linear_mapping_pa_to_va(pa);
+
create_pgd_mapping(swapper_pg_dir, va, pa,
-  map_size, PAGE_KERNEL);
+  map_size, prot);
}
}
 
-- 
2.20.1

[RFC PATCH 12/12] riscv: Improve virtual kernel memory layout dump

2021-01-04 Thread Alexandre Ghiti

With the arrival of sv48 and its large address space, it would be
cumbersome to statically define the unit size to use to print the different
portions of the virtual memory layout: instead, determine it dynamically.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/init.c  | 46 ---
 include/linux/sizes.h |  3 ++-
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9a99cb1870b..f06c21985274 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -80,30 +80,62 @@ static void setup_zero_page(void)
 }
 
 #if defined(CONFIG_MMU) && defined(CONFIG_DEBUG_VM)
+
+#define LOG2_SZ_1K  ilog2(SZ_1K)
+#define LOG2_SZ_1M  ilog2(SZ_1M)
+#define LOG2_SZ_1G  ilog2(SZ_1G)
+#define LOG2_SZ_1T  ilog2(SZ_1T)
+
 static inline void print_mlk(char *name, unsigned long b, unsigned long t)
 {
pr_notice("%12s : 0x%08lx - 0x%08lx   (%4ld kB)\n", name, b, t,
- (((t) - (b)) >> 10));
+ (((t) - (b)) >> LOG2_SZ_1K));
 }
 
 static inline void print_mlm(char *name, unsigned long b, unsigned long t)
 {
pr_notice("%12s : 0x%08lx - 0x%08lx   (%4ld MB)\n", name, b, t,
- (((t) - (b)) >> 20));
+ (((t) - (b)) >> LOG2_SZ_1M));
+}
+
+static inline void print_mlg(char *name, unsigned long b, unsigned long t)
+{
+   pr_notice("%12s : 0x%08lx - 0x%08lx   (%4ld GB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1G));
+}
+
+static inline void print_mlt(char *name, unsigned long b, unsigned long t)
+{
+   pr_notice("%12s : 0x%08lx - 0x%08lx   (%4ld TB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1T));
+}
+
+static inline void print_ml(char *name, unsigned long b, unsigned long t)
+{
+unsigned long diff = t - b;
+
+if ((diff >> LOG2_SZ_1T) >= 10)
+print_mlt(name, b, t);
+else if ((diff >> LOG2_SZ_1G) >= 10)
+print_mlg(name, b, t);
+else if ((diff >> LOG2_SZ_1M) >= 10)
+print_mlm(name, b, t);
+else
+print_mlk(name, b, t);
 }
 
 static void print_vm_layout(void)
 {
pr_notice("Virtual kernel memory layout:\n");
-   print_mlk("fixmap", (unsigned long)FIXADDR_START,
+   print_ml("fixmap", (unsigned long)FIXADDR_START,
  (unsigned long)FIXADDR_TOP);
-   print_mlm("pci io", (unsigned long)PCI_IO_START,
+   print_ml("pci io", (unsigned long)PCI_IO_START,
  (unsigned long)PCI_IO_END);
-   print_mlm("vmemmap", (unsigned long)VMEMMAP_START,
+   print_ml("vmemmap", (unsigned long)VMEMMAP_START,
  (unsigned long)VMEMMAP_END);
-   print_mlm("vmalloc", (unsigned long)VMALLOC_START,
+   print_ml("vmalloc", (unsigned long)VMALLOC_START,
  (unsigned long)VMALLOC_END);
-   print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
+   print_ml("lowmem", (unsigned long)PAGE_OFFSET,
  (unsigned long)high_memory);
 }
 #else
diff --git a/include/linux/sizes.h b/include/linux/sizes.h
index 9874f6f67537..9528b082873b 100644
--- a/include/linux/sizes.h
+++ b/include/linux/sizes.h
@@ -42,8 +42,9 @@
 
 #define SZ_1G  0x4000
 #define SZ_2G  0x8000
-
 #define SZ_4G  _AC(0x1, ULL)
+
+#define SZ_1T  _AC(0x100, ULL)
 #define SZ_64T _AC(0x4000, ULL)
 
 #endif /* __LINUX_SIZES_H__ */
-- 
2.20.1

[RFC PATCH 11/12] riscv: Explicit comment about user virtual address space size

2021-01-04 Thread Alexandre Ghiti

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/include/asm/pgtable.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 95721016049d..360858cdbfdd 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -465,8 +465,15 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
 #endif
 
 /*
- * Task size is 0x40 for RV64 or 0x9fc0 for RV32.
- * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
+ * Task size is:
+ * - 0x9fc0 (~2.5GB) for RV32.
+ * -   0x40 ( 256GB) for RV64 using SV39 mmu
+ * - 0x8000 ( 128TB) for RV64 using SV48 mmu
+ *
+ * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
+ * Instruction Set Manual Volume II: Privileged Architecture" states that
+ * "load and store effective addresses, which are 64bits, must have bits
+ * 63–48 all equal to bit 47, or else a page-fault exception will occur."
  */
 #ifdef CONFIG_64BIT
 #define TASK_SIZE  (PGDIR_SIZE * PTRS_PER_PGD / 2)
-- 
2.20.1

[RFC PATCH 10/12] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo

2021-01-04 Thread Alexandre Ghiti

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/include/asm/pgtable.h |  1 +
 arch/riscv/kernel/cpu.c  | 23 ---
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index dd27d28f1d9e..95721016049d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -495,6 +495,7 @@ extern char _start[];
 extern void *dtb_early_va;
 extern uintptr_t dtb_early_pa;
 extern u64 satp_mode;
+extern bool pgtable_l4_enabled;
 void setup_bootmem(void);
 void paging_init(void);
 
diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 6d59e6906fdd..dea9b1c31889 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Returns the hart ID of the given device tree node, or -ENODEV if the node
@@ -70,18 +71,19 @@ static void print_isa(struct seq_file *f, const char *isa)
seq_puts(f, "\n");
 }
 
-static void print_mmu(struct seq_file *f, const char *mmu_type)
+static void print_mmu(struct seq_file *f)
 {
+   char sv_type[16];
+
 #if defined(CONFIG_32BIT)
-   if (strcmp(mmu_type, "riscv,sv32") != 0)
-   return;
+   strncpy(sv_type, "sv32", 5);
 #elif defined(CONFIG_64BIT)
-   if (strcmp(mmu_type, "riscv,sv39") != 0 &&
-   strcmp(mmu_type, "riscv,sv48") != 0)
-   return;
+   if (pgtable_l4_enabled)
+   strncpy(sv_type, "sv48", 5);
+   else
+   strncpy(sv_type, "sv39", 5);
 #endif
-
-   seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
+   seq_printf(f, "mmu\t\t: %s\n", sv_type);
 }
 
 static void *c_start(struct seq_file *m, loff_t *pos)
@@ -106,14 +108,13 @@ static int c_show(struct seq_file *m, void *v)
 {
unsigned long cpu_id = (unsigned long)v - 1;
struct device_node *node = of_get_cpu_node(cpu_id, NULL);
-   const char *compat, *isa, *mmu;
+   const char *compat, *isa;
 
seq_printf(m, "processor\t: %lu\n", cpu_id);
seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
if (!of_property_read_string(node, "riscv,isa", ))
print_isa(m, isa);
-   if (!of_property_read_string(node, "mmu-type", ))
-   print_mmu(m, mmu);
+   print_mmu(m);
if (!of_property_read_string(node, "compatible", )
&& strcmp(compat, "riscv"))
seq_printf(m, "uarch\t\t: %s\n", compat);
-- 
2.20.1

[RFC PATCH 09/12] riscv: Allow user to downgrade to sv39 when hw supports sv48

2021-01-04 Thread Alexandre Ghiti

This is made possible by using the mmu-type property of the cpu node of
the device tree.

By default, the kernel will boot with 4-level page table if the hw supports
it but it can be interesting for the user to select 3-level page table as
it is less memory consuming and faster since it requires less memory
accesses in case of a TLB miss.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/init.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index cb23a30d9af3..f9a99cb1870b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -550,10 +550,32 @@ void disable_pgtable_l4(void)
  * then read SATP to see if the configuration was taken into account
  * meaning sv48 is supported.
  */
-asmlinkage __init void set_satp_mode(uintptr_t load_pa)
+asmlinkage __init void set_satp_mode(uintptr_t load_pa, uintptr_t dtb_pa)
 {
u64 identity_satp, hw_satp;
+   int cpus_node;
 
+   /* 1/ Check if the user asked for sv39 explicitly in the device tree */
+   cpus_node = fdt_path_offset((void *)dtb_pa, "/cpus");
+   if (cpus_node >= 0) {
+   int node;
+
+   fdt_for_each_subnode(node, (void *)dtb_pa, cpus_node) {
+   const char *mmu_type = fdt_getprop((void *)dtb_pa, node,
+   "mmu-type", NULL);
+   if (!mmu_type)
+   continue;
+
+   if (!strcmp(mmu_type, "riscv,sv39")) {
+   disable_pgtable_l4();
+   return;
+   }
+
+   break;
+   }
+   }
+
+   /* 2/ Determine if the HW supports sv48: if not, fallback to sv39 */
create_pgd_mapping(early_pg_dir, load_pa, (uintptr_t)early_pud,
   PGDIR_SIZE, PAGE_TABLE);
create_pud_mapping(early_pud, load_pa, (uintptr_t)early_pmd,
@@ -611,7 +633,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 #endif
 
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MAXPHYSMEM_2GB)
-   set_satp_mode(load_pa);
+   set_satp_mode(load_pa, dtb_pa);
 #endif
 
kernel_virt_addr = KERNEL_VIRT_ADDR;
-- 
2.20.1

[RFC PATCH 08/12] riscv: Implement sv48 support

2021-01-04 Thread Alexandre Ghiti

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that roughly
offers ~160TB of virtual address space to userspace and allows up to 64TB
of physical memory.

If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig  |   6 +-
 arch/riscv/include/asm/csr.h|   3 +-
 arch/riscv/include/asm/fixmap.h |   3 +
 arch/riscv/include/asm/page.h   |  12 ++
 arch/riscv/include/asm/pgalloc.h|  40 +
 arch/riscv/include/asm/pgtable-64.h | 104 +++-
 arch/riscv/include/asm/pgtable.h|  12 +-
 arch/riscv/kernel/head.S|   3 +-
 arch/riscv/mm/context.c |   2 +-
 arch/riscv/mm/init.c| 212 +---
 drivers/firmware/efi/libstub/efi-stub.c |   2 +-
 11 files changed, 362 insertions(+), 37 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 852ab2f7a50d..03205e11f952 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -127,7 +127,7 @@ config PAGE_OFFSET
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
default 0x8000 if 64BIT && !MMU
default 0x8000 if 64BIT && MAXPHYSMEM_2GB
-   default 0xffe0 if 64BIT && !MAXPHYSMEM_2GB
+   default 0xc000 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
def_bool y
@@ -176,9 +176,11 @@ config GENERIC_HWEIGHT
 config FIX_EARLYCON_MEM
def_bool MMU
 
+# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
+# on a 3-level page table when sv48 is not supported.
 config PGTABLE_LEVELS
int
-   default 3 if 64BIT
+   default 4 if 64BIT
default 2
 
 config LOCKDEP_SUPPORT
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index cec462e198ce..d41536c3f8d4 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,11 +40,10 @@
 #ifndef CONFIG_64BIT
 #define SATP_PPN   _AC(0x003F, UL)
 #define SATP_MODE_32   _AC(0x8000, UL)
-#define SATP_MODE  SATP_MODE_32
 #else
 #define SATP_PPN   _AC(0x0FFF, UL)
 #define SATP_MODE_39   _AC(0x8000, UL)
-#define SATP_MODE  SATP_MODE_39
+#define SATP_MODE_48   _AC(0x9000, UL)
 #endif
 
 /* Exception cause high bit - is an interrupt if set */
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 54cbf07fb4e9..c4e51929773a 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -24,6 +24,9 @@ enum fixed_addresses {
FIX_HOLE,
FIX_PTE,
FIX_PMD,
+#ifdef CONFIG_64BIT
+   FIX_PUD,
+#endif
FIX_TEXT_POKE1,
FIX_TEXT_POKE0,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index a93e35aaa717..37ca192a7b80 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -31,7 +31,16 @@
  * When not using MMU this corresponds to the first free page in
  * physical memory (aligned on a page boundary).
  */
+#ifdef CONFIG_64BIT
+#define PAGE_OFFSET__page_offset
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3 0xffe0
+#else
 #define PAGE_OFFSET_AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_64BIT */
 
 #define KERN_VIRT_SIZE (-PAGE_OFFSET)
 
@@ -102,6 +111,9 @@ extern unsigned long pfn_base;
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
 extern unsigned long kernel_virt_addr;
+#ifdef CONFIG_64BIT
+extern unsigned long __page_offset;
+#endif
 extern uintptr_t load_pa, load_sz;
 
 #define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + 
va_pa_offset))
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 23b1544e0ca5..2b7fb8156fc6 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -11,6 +11,8 @@
 #include 
 
 #ifdef CONFIG_MMU
+#define __HAVE_ARCH_PUD_ALLOC_ONE
+#define __HAVE_ARCH_PUD_FREE
 #include 
 
 static inline void pmd_populate_kernel(struct mm_struct *mm,
@@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t 
*pud, pmd_t *pmd)
 
set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
 }
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+   if (pgtable_l4_enabled) {
+   unsigned long pfn = virt_to_pfn(pud);
+
+   set_p4d(p4d, __p4d((pfn << _PAG

[RFC PATCH 07/12] asm-generic: Prepare for riscv use of pud_alloc_one and pud_free

2021-01-04 Thread Alexandre Ghiti

In the following commits, riscv will almost use the generic versions of
pud_alloc_one and pud_free but an additional check is required since those
functions are only relevant when using at least a 4-level page table, which
will be determined at runtime on riscv.

So move the content of those functions into other functions that riscv
can use without duplicating code.

Signed-off-by: Alexandre Ghiti 
---
 include/asm-generic/pgalloc.h | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 02932efad3ab..977bea16cf1b 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -147,6 +147,15 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t 
*pmd)
 
 #if CONFIG_PGTABLE_LEVELS > 3
 
+static inline pud_t *__pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+   gfp_t gfp = GFP_PGTABLE_USER;
+
+   if (mm == _mm)
+   gfp = GFP_PGTABLE_KERNEL;
+   return (pud_t *)get_zeroed_page(gfp);
+}
+
 #ifndef __HAVE_ARCH_PUD_ALLOC_ONE
 /**
  * pud_alloc_one - allocate a page for PUD-level page table
@@ -159,20 +168,23 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t 
*pmd)
  */
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   gfp_t gfp = GFP_PGTABLE_USER;
-
-   if (mm == _mm)
-   gfp = GFP_PGTABLE_KERNEL;
-   return (pud_t *)get_zeroed_page(gfp);
+   return __pud_alloc_one(mm, addr);
 }
 #endif
 
-static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+static inline void __pud_free(struct mm_struct *mm, pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
free_page((unsigned long)pud);
 }
 
+#ifndef __HAVE_ARCH_PUD_FREE
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+   __pud_free(mm, pud);
+}
+#endif
+
 #endif /* CONFIG_PGTABLE_LEVELS > 3 */
 
 #ifndef __HAVE_ARCH_PGD_FREE
-- 
2.20.1

[RFC PATCH 06/12] riscv: Prepare ptdump for vm layout dynamic addresses

2021-01-04 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 56 ++
 1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..1be2ca81f8ad 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,50 @@ struct ptd_mm_info {
unsigned long end;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+KERNEL_MAPPING_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
{KASAN_SHADOW_START,"Kasan shadow start"},
{KASAN_SHADOW_END,  "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
+   {0, "Kernel mapping (kernel, BPF, modules)"},
{-1, NULL},
 };
 
 static struct ptd_mm_info kernel_ptd_info = {
.mm = _mm,
.markers= address_markers,
-   .base_addr  = KERN_VIRT_START,
+   .base_addr  = 0,
.end= ULONG_MAX,
 };
 
@@ -335,6 +356,21 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+   address_markers[KERNEL_MAPPING_NR].start_address = KERNEL_LINK_ADDR;
+
+   kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[RFC PATCH 05/12] riscv: Simplify MAXPHYSMEM config

2021-01-04 Thread Alexandre Ghiti

Either the user specifies maximum physical memory size of 2GB or the
user lives with the system constraint which is 1/4th of maximum
addressable memory in Sv39 MMU mode (i.e. 128GB) for now.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Kconfig | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2979a44103be..852ab2f7a50d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -127,7 +127,7 @@ config PAGE_OFFSET
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
default 0x8000 if 64BIT && !MMU
default 0x8000 if 64BIT && MAXPHYSMEM_2GB
-   default 0xffe0 if 64BIT && MAXPHYSMEM_128GB
+   default 0xffe0 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
def_bool y
@@ -235,19 +235,11 @@ config MODULE_SECTIONS
bool
select HAVE_MOD_ARCH_SPECIFIC
 
-choice
-   prompt "Maximum Physical Memory"
-   default MAXPHYSMEM_2GB if 32BIT
-   default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
-   default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
-
-   config MAXPHYSMEM_2GB
-   bool "2GiB"
-   config MAXPHYSMEM_128GB
-   depends on 64BIT && CMODEL_MEDANY
-   bool "128GiB"
-endchoice
-
+config MAXPHYSMEM_2GB
+   bool "Maximum Physical Memory 2GiB"
+   default y if 32BIT
+   default y if 64BIT && CMODEL_MEDLOW
+   default n
 
 config SMP
bool "Symmetric Multi-Processing"
-- 
2.20.1

[RFC PATCH 03/12] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE

2021-01-04 Thread Alexandre Ghiti

There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
definition.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/mm/init.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 7b87c14f1d24..694efcc3a131 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -296,13 +296,7 @@ static void __init create_pte_mapping(pte_t *ptep,
 
 pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
-
-#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
-#define NUM_EARLY_PMDS 1UL
-#else
-#define NUM_EARLY_PMDS (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
-#endif
-pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
+pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 pmd_t early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
@@ -324,11 +318,9 @@ static pmd_t *get_pmd_virt_late(phys_addr_t pa)
 
 static phys_addr_t __init alloc_pmd_early(uintptr_t va)
 {
-   uintptr_t pmd_num;
+   BUG_ON((va - kernel_virt_addr) >> PGDIR_SHIFT);
 
-   pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
-   BUG_ON(pmd_num >= NUM_EARLY_PMDS);
-   return (uintptr_t)_pmd[pmd_num * PTRS_PER_PMD];
+   return (uintptr_t)early_pmd;
 }
 
 static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
-- 
2.20.1

[RFC PATCH 04/12] riscv: Allow to dynamically define VA_BITS

2021-01-04 Thread Alexandre Ghiti

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig | 10 --
 arch/riscv/include/asm/pgtable.h   | 11 +--
 arch/riscv/include/asm/sparsemem.h |  6 +-
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..2979a44103be 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -122,16 +122,6 @@ config ZONE_DMA32
bool
default y if 64BIT
 
-config VA_BITS
-   int
-   default 32 if 32BIT
-   default 39 if 64BIT
-
-config PA_BITS
-   int
-   default 34 if 32BIT
-   default 56 if 64BIT
-
 config PAGE_OFFSET
hex
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 102b728ca146..c7973bfd65bc 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -43,8 +43,14 @@
  * struct pages to map half the virtual address space. Then
  * position vmemmap directly below the VMALLOC region.
  */
+#ifdef CONFIG_64BIT
+#define VA_BITS39
+#else
+#define VA_BITS32
+#endif
+
 #define VMEMMAP_SHIFT \
-   (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
+   (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
 #define VMEMMAP_SIZE   BIT(VMEMMAP_SHIFT)
 #define VMEMMAP_END(VMALLOC_START - 1)
 #define VMEMMAP_START  (VMALLOC_START - VMEMMAP_SIZE)
@@ -83,6 +89,7 @@
 #endif /* CONFIG_64BIT */
 
 #ifdef CONFIG_MMU
+
 /* Number of entries in the page global directory */
 #define PTRS_PER_PGD(PAGE_SIZE / sizeof(pgd_t))
 /* Number of entries in the page table */
@@ -453,7 +460,7 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
  * and give the kernel the other (upper) half.
  */
 #ifdef CONFIG_64BIT
-#define KERN_VIRT_START(-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
+#define KERN_VIRT_START(-(BIT(VA_BITS)) + TASK_SIZE)
 #else
 #define KERN_VIRT_STARTFIXADDR_START
 #endif
diff --git a/arch/riscv/include/asm/sparsemem.h 
b/arch/riscv/include/asm/sparsemem.h
index 45a7018a8118..63acaecc3374 100644
--- a/arch/riscv/include/asm/sparsemem.h
+++ b/arch/riscv/include/asm/sparsemem.h
@@ -4,7 +4,11 @@
 #define _ASM_RISCV_SPARSEMEM_H
 
 #ifdef CONFIG_SPARSEMEM
-#define MAX_PHYSMEM_BITS   CONFIG_PA_BITS
+#ifdef CONFIG_64BIT
+#define MAX_PHYSMEM_BITS   56
+#else
+#define MAX_PHYSMEM_BITS   34
+#endif /* CONFIG_64BIT */
 #define SECTION_SIZE_BITS  27
 #endif /* CONFIG_SPARSEMEM */
 
-- 
2.20.1

[RFC PATCH 01/12] riscv: Move kernel mapping outside of linear mapping

2021-01-04 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel and sv48 support.

The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.

The kernel mapping is moved to the last 2GB of the address space and then
BPF and modules are also pushed to the same range since they have to lie
close to the kernel inside a 2GB window.

Note then that KASLR implementation will simply have to move the kernel in
this 2GB range and modify BPF/modules regions accordingly.

In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.

Suggested-by: Arnd Bergmann 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/boot/loader.lds.S |  3 +-
 arch/riscv/include/asm/page.h| 10 -
 arch/riscv/include/asm/pgtable.h | 39 +--
 arch/riscv/kernel/head.S |  3 +-
 arch/riscv/kernel/module.c   |  4 +-
 arch/riscv/kernel/vmlinux.lds.S  |  3 +-
 arch/riscv/mm/init.c | 65 
 arch/riscv/mm/physaddr.c |  2 +-
 8 files changed, 94 insertions(+), 35 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..98188e315e8d 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;
 
 #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  \
+   (((x) < KERNEL_LINK_ADDR) ? \
+   linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 183f1f4b2ae6..102b728ca146 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,32 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR   PAGE_OFFSET
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
 
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END  (UL(-1))
+/*
+ * Leave 2GB for kernel, modules and BPF at the end of the address space
+ */
+#define KERNEL_VIRT_ADDR   (ADDRESS_SPACE_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR   KERNEL_VIRT_ADDR
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
+/* KASLR should leave at least 128MB for BPF after the kernel */
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START   (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
+#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
+#endif
 
 /*
  * Roughly size the vmemmap space to be large enough to fit enough
@@ -57,9 +66,16 @@
 #define FIXADDR_SIZE PGDIR_SIZE
 #endif
 #define FIXADDR_START(FIXADDR_TOP - FIXADDR_SIZE)
-
 #endif
 
+#ifndef __ASSEMBLY__
+
+/* Page Upper Directory not used in RISC-V */
+#include 
+#include 
+#include 
+#include 
+
 #ifdef CON

[RFC PATCH 00/12] Introduce sv48 support without relocable kernel

2021-01-04 Thread Alexandre Ghiti

This patchset, contrary to the previous versions, allows to have a single   
 
kernel for sv39 and sv48 without being relocatable. 
 

 
The idea comes from Arnd Bergmann who suggested to do the same as x86,  
 
that is mapping the kernel to the end of the address space, which allows
 
the kernel to be linked at the same address for both sv39 and sv48 and  
 
then does not require to be relocated at runtime.   
 

 
This is an RFC because I need to at least rebase a few commits and add  
 
documentation. The most interesting patches where I expect feedbacks are
 
1/12, 2/12 and 8/12. Note that moving the kernel out of the linear
mapping and sv48 support can be separate patchsets, I share them together
today to show that it works (this patchset is rebased on top of v5.10). 

If we agree about the overall idea, I'll rebase my relocatable patchset
on top of that and then KASLR implementation from Zong will be greatly
simplified since moving the kernel out of the linear mapping will avoid
to copy the kernel physically. 

 
This implements sv48 support at runtime. The kernel will try to 
 
boot with 4-level page table and will fallback to 3-level if the HW does not
 
support it. Folding the 4th level into a 3-level page table has almost no   
 
cost at runtime.
 

 
Finally, the user can now ask for sv39 explicitly by using the device-tree  
 
which will reduce memory footprint and reduce the number of memory accesses 
 
in case of TLB miss.   

Alexandre Ghiti (12):
  riscv: Move kernel mapping outside of linear mapping
  riscv: Protect the kernel linear mapping
  riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  riscv: Allow to dynamically define VA_BITS
  riscv: Simplify MAXPHYSMEM config
  riscv: Prepare ptdump for vm layout dynamic addresses
  asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
  riscv: Implement sv48 support
  riscv: Allow user to downgrade to sv39 when hw supports sv48
  riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  riscv: Explicit comment about user virtual address space size
  riscv: Improve virtual kernel memory layout dump

 arch/riscv/Kconfig  |  34 +--
 arch/riscv/boot/loader.lds.S|   3 +-
 arch/riscv/include/asm/csr.h|   3 +-
 arch/riscv/include/asm/fixmap.h |   3 +
 arch/riscv/include/asm/page.h   |  33 ++-
 arch/riscv/include/asm/pgalloc.h|  40 +++
 arch/riscv/include/asm/pgtable-64.h | 104 ++-
 arch/riscv/include/asm/pgtable.h|  68 +++--
 arch/riscv/include/asm/sparsemem.h  |   6 +-
 arch/riscv/kernel/cpu.c |  23 +-
 arch/riscv/kernel/head.S|   6 +-
 arch/riscv/kernel/module.c  |   4 +-
 arch/riscv/kernel/vmlinux.lds.S |   3 +-
 arch/riscv/mm/context.c |   2 +-
 arch/riscv/mm/init.c| 376 
 arch/riscv/mm/physaddr.c|   2 +-
 arch/riscv/mm/ptdump.c  |  56 +++-
 drivers/firmware/efi/libstub/efi-stub.c |   2 +-
 include/asm-generic/pgalloc.h   |  24 +-
 include/linux/sizes.h   |   3 +-
 20 files changed, 648 insertions(+), 147 deletions(-)

-- 
2.20.1

[PATCH v5 4/4] riscv: Check relocations at compile time

2020-06-07 Thread Alexandre Ghiti

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

There exists a script in scripts/ that extracts the relocations from
vmlinux that is then used at postlink to check the relocations.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Makefile.postlink | 36 
 arch/riscv/tools/relocs_check.sh | 26 +++
 2 files changed, 62 insertions(+)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh

diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
new file mode 100644
index ..bf2b2bca1845
--- /dev/null
+++ b/arch/riscv/Makefile.postlink
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===
+# Post-link riscv pass
+# ===
+#
+# Check that vmlinux relocations look sane
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include scripts/Kbuild.include
+
+quiet_cmd_relocs_check = CHKREL  $@
+cmd_relocs_check = \
+   $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh 
"$(OBJDUMP)" "$(NM)" "$@"
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+   @true
+ifdef CONFIG_RELOCATABLE
+   $(call if_changed,relocs_check)
+endif
+
+%.ko: FORCE
+   @true
+
+clean:
+   @true
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
new file mode 100755
index ..baeb2e7b2290
--- /dev/null
+++ b/arch/riscv/tools/relocs_check.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Based on powerpc relocs_check.sh
+
+# This script checks the relocations of a vmlinux for "suspicious"
+# relocations.
+
+if [ $# -lt 3 ]; then
+echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
+exit 1
+fi
+
+bad_relocs=$(
+${srctree}/scripts/relocs_check.sh "$@" |
+   # These relocations are okay
+   #   R_RISCV_RELATIVE
+   grep -F -w -v 'R_RISCV_RELATIVE'
+)
+
+if [ -z "$bad_relocs" ]; then
+   exit 0
+fi
+
+num_bad=$(echo "$bad_relocs" | wc -l)
+echo "WARNING: $num_bad bad relocations"
+echo "$bad_relocs"
-- 
2.20.1

[PATCH v5 3/4] powerpc: Move script to check relocations at compile time in scripts/

2020-06-07 Thread Alexandre Ghiti

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

Powerpc architecture has a script that allows to check at compile time
for such unexpected relocations: extract the common logic to scripts/
so that other architectures can take advantage of it.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/powerpc/tools/relocs_check.sh | 18 ++
 scripts/relocs_check.sh| 20 
 2 files changed, 22 insertions(+), 16 deletions(-)
 create mode 100755 scripts/relocs_check.sh

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index 014e00e74d2b..e367895941ae 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
exit 1
 fi
 
-# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
-objdump="$1"
-nm="$2"
-vmlinux="$3"
-
-# Remove from the bad relocations those that match an undefined weak symbol
-# which will result in an absolute relocation to 0.
-# Weak unresolved symbols are of that form in nm output:
-# "  w _binary__btf_vmlinux_bin_end"
-undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
-
 bad_relocs=$(
-$objdump -R "$vmlinux" |
-   # Only look at relocation lines.
-   grep -E '\

[PATCH v5 2/4] riscv: Introduce CONFIG_RELOCATABLE

2020-06-07 Thread Alexandre Ghiti

This config allows to compile the kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR and to 4-level
page table folding at runtime. Runtime relocation is possible since
relocation metadata are embedded into the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the compiler
options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Zong Li 
Reviewed-by: Anup Patel 
---
 arch/riscv/Kconfig  | 12 +++
 arch/riscv/Makefile |  5 ++-
 arch/riscv/kernel/vmlinux.lds.S |  6 ++--
 arch/riscv/mm/Makefile  |  4 +++
 arch/riscv/mm/init.c| 63 +
 5 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a31e1a41913a..93127d5913fe 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -170,6 +170,18 @@ config PGTABLE_LEVELS
default 3 if 64BIT
default 2
 
+config RELOCATABLE
+   bool
+   depends on MMU
+   help
+  This builds a kernel as a Position Independent Executable (PIE),
+  which retains all relocation metadata required to relocate the
+  kernel binary at runtime to a different virtual address than the
+  address it was linked at.
+  Since RISCV uses the RELA relocation format, this requires a
+  relocation pass at runtime even if the kernel is loaded at the
+  same address it was linked at.
+
 source "arch/riscv/Kconfig.socs"
 
 menu "Platform type"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index fb6e37db836d..1406416ea743 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -9,7 +9,10 @@
 #
 
 OBJCOPYFLAGS:= -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
+KBUILD_CFLAGS += -fPIE
+endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
LDFLAGS_vmlinux := --no-relax
 endif
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index a9abde62909f..e8ffba8c2044 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -85,8 +85,10 @@ SECTIONS
 
BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
 
-   .rel.dyn : {
-   *(.rel.dyn*)
+   .rela.dyn : ALIGN(8) {
+   __rela_dyn_start = .;
+   *(.rela .rela*)
+   __rela_dyn_end = .;
}
 
_end = .;
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 363ef01c30b1..dc5cdaa80bc1 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 CFLAGS_init.o := -mcmodel=medany
+ifdef CONFIG_RELOCATABLE
+CFLAGS_init.o += -fno-pie
+endif
+
 ifdef CONFIG_FTRACE
 CFLAGS_REMOVE_init.o = -pg
 endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 71da78914645..29b33289a12f 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_RELOCATABLE
+#include 
+#endif
 
 #include 
 #include 
@@ -379,6 +382,53 @@ static uintptr_t __init best_map_size(phys_addr_t base, 
phys_addr_t size)
 #error "setup_vm() is called from head.S before relocate so it should not use 
absolute addressing."
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __rela_dyn_start, __rela_dyn_end;
+
+#ifdef CONFIG_64BIT
+#define Elf_Rela Elf64_Rela
+#define Elf_Addr Elf64_Addr
+#else
+#define Elf_Rela Elf32_Rela
+#define Elf_Addr Elf32_Addr
+#endif
+
+void __init relocate_kernel(uintptr_t load_pa)
+{
+   Elf_Rela *rela = (Elf_Rela *)&__rela_dyn_start;
+   /*
+* This holds the offset between the linked virtual address and the
+* relocated virtual address.
+*/
+   uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
+   /*
+* This holds the offset between kernel linked virtual address and
+* physical address.
+*/
+   uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
+
+   for ( ; rela < (Elf_Rela *)&__rela_dyn_end; rela++) {
+   Elf_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
+   Elf_Addr relocated_addr = rela->r_addend;
+
+   if (rela->r_info != R_RISCV_RELATIVE)
+   continue;
+
+   /*
+* Make sure to not relocate vdso symbols like rt_sigreturn
+* which are linked from the address 0 in vmlinux since
+* vdso symbol addresses are actually used as an offset from
+* mm->context.vdso in VDSO_OFFSET macro.
+*/
+   if (relocated_addr >= KERNEL_LINK_ADDR)
+   relocated_addr += reloc_offset;
+
+   *(

[PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-06-07 Thread Alexandre Ghiti

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be loaded
physically at the beginning of the main memory. Therefore, we could use
the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from PAGE_OFFSET
and since in the linear mapping, two different virtual addresses cannot
point to the same physical address, the kernel mapping needs to lie outside
the linear mapping.

In addition, because modules and BPF must be close to the kernel (inside
+-2GB window), the kernel is placed at the end of the vmalloc zone minus
2GB, which leaves room for modules and BPF. The kernel could not be
placed at the beginning of the vmalloc zone since other vmalloc
allocations from the kernel could get all the +-2GB window around the
kernel which would prevent new modules and BPF programs to be loaded.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Zong Li 
---
 arch/riscv/boot/loader.lds.S |  3 +-
 arch/riscv/include/asm/page.h| 10 +-
 arch/riscv/include/asm/pgtable.h | 38 ++---
 arch/riscv/kernel/head.S |  3 +-
 arch/riscv/kernel/module.c   |  4 +--
 arch/riscv/kernel/vmlinux.lds.S  |  3 +-
 arch/riscv/mm/init.c | 58 +---
 arch/riscv/mm/physaddr.c |  2 +-
 8 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #include 
+#include 
 
 OUTPUT_ARCH(riscv)
 ENTRY(_start)
 
 SECTIONS
 {
-   . = PAGE_OFFSET;
+   . = KERNEL_LINK_ADDR;
 
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..48bb09b6a9b7 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
 
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET(pfn_base)
 #else
 #define va_pa_offset   0
+#define va_kernel_pa_offset0
 #define ARCH_PFN_OFFSET(PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */
 
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;
 
 #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x)  ((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+   ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)  \
+   (((x) >= PAGE_OFFSET) ? \
+   linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
 
 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 35b60035b6b0..94ef3b49dfb6 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,29 @@
 
 #include 
 
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include 
-#include 
-#include 
-#include 
-
-#ifdef CONFIG_MMU
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR   PAGE_OFFSET
+#define KERNEL_LINK_ADDR   PAGE_OFFSET
+#else
+/*
+ * Leave 2GB for modules and BPF that must lie within a 2GB range around
+ * the kernel.
+ */
+#define KERNEL_VIRT_ADDR   (VMALLOC_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR   KERNEL_VIRT_ADDR
 
 #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
 #define VMALLOC_END  (PAGE_OFFSET - 1)
 #define VMALLOC_START(PAGE_OFFSET - VMALLOC_SIZE)
 
 #define BPF_JIT_REGION_SIZE(SZ_128M)
-#define BPF_JIT_REGION_START   (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START   PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START   BPF_JIT_REGION_END
+#define VMALLOC_MODULE_END (((unsigned long)&_start & PAGE_MASK) + SZ_2G)
+#endif
 
 /*
  * Roughly size the vmemmap space to be large enough to fit enough
@@ -57,9 +63,16 @@
 #define FIXADDR_SIZE PGDIR_SIZE
 #endif
 #define FIXADDR_START(FIXADDR_TOP - FIXADDR_SIZE)
-
 #endif
 
+#ifndef __ASSEMBLY__
+
+/* Page Upper Directory not used in RISC-V */
+#include 
+#include 
+#include 
+#include 
+
 #ifdef CONFIG_64BIT
 #include 
 #else
@@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page *page, 
int numpages, int enabl
 
 #define kern_addr_valid(addr)   (1) /* FIXME */
 
+extern char _start[];
 extern void *dtb_early_va;
 void

[PATCH v5 0/4] vmalloc kernel mapping and relocatable kernel

2020-06-07 Thread Alexandre Ghiti

This patchset originally implemented relocatable kernel support but now 
 
also moves the kernel mapping into the vmalloc zone.
 

 
The first patch explains why we need to move the kernel into vmalloc
 
zone (instead of memcpying it around). That patch should ease KASLR 
 
implementation a lot.   
 

 
The second patch allows to build relocatable kernels but is not selected
 
by default. 
 

 
The third and fourth patches take advantage of an already existing powerpc  
 
script that checks relocations at compile-time, and uses it for riscv.  
 

 
Changes in v5:
  * Add "static __init" to create_kernel_page_table function as reported by
Kbuild test robot
  * Add reviewed-by from Zong
  * Rebase onto v5.7

Changes in v4:  
 
  * Fix BPF region that overlapped with kernel's as suggested by Zong   
 
  * Fix end of module region that could be larger than 2GB as suggested by Zong 
 
  * Fix the size of the vm area reserved for the kernel as we could lose
 
PMD_SIZE if the size was already aligned on PMD_SIZE
 
  * Split compile time relocations check patch into 2 patches as suggested by 
Anup
  * Applied Reviewed-by from Zong and Anup  
 

 
Changes in v3:  
 
  * Move kernel mapping to vmalloc  
 

 
Changes in v2:  
 
  * Make RELOCATABLE depend on MMU as suggested by Anup 
 
  * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup  
 
  * Use __pa_symbol instead of __pa, as suggested by Zong   
 
  * Rebased on top of v5.6-rc3  
 
  * Tested with sv48 patchset   
 
  * Add Reviewed/Tested-by from Zong and Anup 

Alexandre Ghiti (4):
  riscv: Move kernel mapping to vmalloc zone
  riscv: Introduce CONFIG_RELOCATABLE
  powerpc: Move script to check relocations at compile time in scripts/
  riscv: Check relocations at compile time

 arch/powerpc/tools/relocs_check.sh |  18 +
 arch/riscv/Kconfig |  12 +++
 arch/riscv/Makefile|   5 +-
 arch/riscv/Makefile.postlink   |  36 +
 arch/riscv/boot/loader.lds.S   |   3 +-
 arch/riscv/include/asm/page.h  |  10 ++-
 arch/riscv/include/asm/pgtable.h   |  38 ++---
 arch/riscv/kernel/head.S   |   3 +-
 arch/riscv/kernel/module.c |   4 +-
 arch/riscv/kernel/vmlinux.lds.S|   9 ++-
 arch/riscv/mm/Makefile |   4 +
 arch/riscv/mm/init.c   | 121 +
 arch/riscv/mm/physaddr.c   |   2 +-
 arch/riscv/tools/relocs_check.sh   |  26 +++
 scripts/relocs_check.sh|  20 +
 15 files changed, 259 insertions(+), 52 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

-- 
2.20.1

[PATCH 2/2] riscv: Use PUD/PGDIR entries for linear mapping when possible

2020-06-03 Thread Alexandre Ghiti

Improve best_map_size so that PUD or PGDIR entries are used for linear
mapping when possible as it allows better TLB utilization.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/mm/init.c | 45 +---
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 9a5c97e091c1..d275f9f834cf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
create_pgd_next_mapping(nextp, va, pa, sz, prot);
 }
 
-static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
+static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
+  uintptr_t base_virt, phys_addr_t size)
 {
-   /* Upgrade to PMD_SIZE mappings whenever possible */
-   if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
-   return PAGE_SIZE;
+   return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
+   (size < map_size));
+}
+
+static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
+ phys_addr_t size)
+{
+#ifndef __PAGETABLE_PMD_FOLDED
+   if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
+   return PGDIR_SIZE;
+
+   if (pgtable_l4_enabled)
+   if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
+   return PUD_SIZE;
+#endif
+
+   if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
+   return PMD_SIZE;
 
-   return PMD_SIZE;
+   return PAGE_SIZE;
 }
 
 /*
@@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t 
map_size)
 asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
uintptr_t va, end_va;
-   uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+   uintptr_t map_size;
 
load_pa = (uintptr_t)(&_start);
load_sz = (uintptr_t)(&_end) - load_pa;
@@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 
kernel_virt_addr = KERNEL_VIRT_ADDR;
 
+   map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
va_pa_offset = PAGE_OFFSET - load_pa;
va_kernel_pa_offset = kernel_virt_addr - load_pa;
pfn_base = PFN_DOWN(load_pa);
@@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
 
/* Map all memory banks */
for_each_memblock(memory, reg) {
+   uintptr_t remaining_size;
+
start = reg->base;
end = start + reg->size;
 
@@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
break;
if (memblock_is_nomap(reg))
continue;
-   if (start <= __pa(PAGE_OFFSET) &&
-   __pa(PAGE_OFFSET) < end)
-   start = __pa(PAGE_OFFSET);
 
-   map_size = best_map_size(start, end - start);
-   for (pa = start; pa < end; pa += map_size) {
+   pa = start;
+   remaining_size = reg->size;
+
+   while (remaining_size) {
va = (uintptr_t)__va(pa);
+   map_size = best_map_size(pa, va, remaining_size);
+
create_pgd_mapping(swapper_pg_dir, va, pa,
   map_size, PAGE_KERNEL);
+
+   pa += map_size;
+   remaining_size -= map_size;
}
}
 
-- 
2.20.1

[PATCH 1/2] riscv: Get memory below load_pa while ensuring linear mapping is PMD aligned

2020-06-03 Thread Alexandre Ghiti

Early page table uses the kernel load address as mapping for PAGE_OFFSET:
that makes memblock remove any memory below the kernel which results in
using only PMD entries for the linear mapping.

By setting MIN_MEMBLOCK_ADDR to 0, we allow this memory to be present
when creating the kernel page table: that potentially allows to use
PUD/PGDIR entries for the linear mapping.

But as the firmware might ask the kernel to remove some part of this
memory, we need to ensure that the physical address targeted by
PAGE_OFFSET is at least aligned on PMD size since otherwise the linear
mapping would use only PTE entries.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/page.h |  8 
 arch/riscv/mm/init.c  | 24 +++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 5e77fe7f0d6d..b416396fc357 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -100,6 +100,14 @@ typedef struct page *pgtable_t;
 #define PTE_FMT "%08lx"
 #endif
 
+/*
+ * Early page table maps PAGE_OFFSET to load_pa, which may not be the memory
+ * base address and by default MIN_MEMBLOCK_ADDR is equal to __pa(PAGE_OFFSET)
+ * then memblock ignores memory below load_pa: we want this memory to get 
mapped
+ * as it may allow to use hugepages for linear mapping.
+ */
+#define MIN_MEMBLOCK_ADDR  0
+
 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
 extern unsigned long va_kernel_pa_offset;
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 4064639b24e4..9a5c97e091c1 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -664,7 +664,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 static void __init setup_vm_final(void)
 {
uintptr_t va, map_size;
-   phys_addr_t pa, start, end;
+   phys_addr_t pa, start, end, dram_start;
struct memblock_region *reg;
static struct vm_struct vm_kernel = { 0 };
 
@@ -676,6 +676,28 @@ static void __init setup_vm_final(void)
   __pa_symbol(fixmap_pgd_next),
   PGDIR_SIZE, PAGE_TABLE);
 
+   /*
+* Make sure that virtual and physical addresses are at least aligned
+* on PMD_SIZE, even if we have to lose some memory (< PMD_SIZE)
+* otherwise the linear mapping would get mapped using PTE entries.
+*/
+   dram_start = memblock_start_of_DRAM();
+   if (dram_start & (PMD_SIZE - 1)) {
+   uintptr_t next_dram_start;
+
+   next_dram_start = (dram_start + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
+   memblock_remove(dram_start, next_dram_start - dram_start);
+   dram_start = next_dram_start;
+   }
+
+   /*
+* We started considering PAGE_OFFSET would start at load_pa because
+* it was the only piece of information we had, but now make PAGE_OFFSET
+* point to the real beginning of the memory area.
+*/
+   va_pa_offset = PAGE_OFFSET - dram_start;
+   pfn_base = PFN_DOWN(dram_start);
+
/* Map all memory banks */
for_each_memblock(memory, reg) {
start = reg->base;
-- 
2.20.1

[PATCH 0/2] PUD/PGDIR entries for linear mapping

2020-06-03 Thread Alexandre Ghiti

This small patchset intends to use PUD/PGDIR entries for linear mapping
in order to better utilize TLB.

At the moment, only PMD entries can be used since on common platforms
(qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
and physical addresses and then prevents the use of PUD/PGDIR entries.
So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
beginning of the DRAM: this is achieved in patch 1.

But furthermore, at the moment, the firmware (opensbi) explicitly asks the
kernel not to map the region it occupies, which is on those common
platforms at the very beginning of the DRAM and then it also dealigns
virtual and physical addresses. I proposed a patch here:

https://github.com/riscv/opensbi/pull/167

that removes this 'constraint' but *not* all the time as it offers some
kind of protection in case PMP is not available. So sometimes, we may
have a part of the memory below the kernel that is removed creating a
misalignment between virtual and physical addresses. So for performance
reasons, we must at least make sure that PMD entries can be used: that
is guaranteed by patch 1 too.

Finally the second patch simply improves best_map_size so that whenever
possible, PUD/PGDIR entries are used. 

Below is the kernel page table without this patch on a 6G platform:

---[ Linear mapping ]---
0xc000-0xc00176e00x8020 5998M PMD D A . 
. . W R V 

And with this patchset + opensbi patch:

---[ Linear mapping ]---
0xc000-0xc0014000 0x8000 5G PUD D A 
. . . W R V
0xc0014000-0xc00177000x0001c000 880M PMD D A . 
. . W R V

Alexandre Ghiti (2):
  riscv: Get memory below load_pa while ensuring linear mapping is PMD
aligned
  riscv: Use PUD/PGDIR entries for linear mapping when possible

 arch/riscv/include/asm/page.h |  8 
 arch/riscv/mm/init.c  | 69 +--
 2 files changed, 65 insertions(+), 12 deletions(-)

-- 
2.20.1

[PATCH v2 8/8] riscv: Explicit comment about user virtual address space size

2020-06-03 Thread Alexandre Ghiti

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/include/asm/pgtable.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index cb8c6863266b..86bbc2ed1cdd 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -480,8 +480,15 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
 #endif
 
 /*
- * Task size is 0x40 for RV64 or 0x9fc0 for RV32.
- * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
+ * Task size is:
+ * - 0x9fc0 (~2.5GB) for RV32.
+ * -   0x40 ( 256GB) for RV64 using SV39 mmu
+ * - 0x8000 ( 128TB) for RV64 using SV48 mmu
+ *
+ * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
+ * Instruction Set Manual Volume II: Privileged Architecture" states that
+ * "load and store effective addresses, which are 64bits, must have bits
+ * 63–48 all equal to bit 47, or else a page-fault exception will occur."
  */
 #ifdef CONFIG_64BIT
 #define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
-- 
2.20.1

[PATCH v2 7/8] riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo

2020-06-03 Thread Alexandre Ghiti

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/include/asm/pgtable.h |  1 +
 arch/riscv/kernel/cpu.c  | 23 ---
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index b4b532525fee..cb8c6863266b 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -507,6 +507,7 @@ static inline void __kernel_map_pages(struct page *page, 
int numpages, int enabl
 extern char _start[];
 extern void *dtb_early_va;
 extern u64 satp_mode;
+extern bool pgtable_l4_enabled;
 void setup_bootmem(void);
 void paging_init(void);
 
diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 40a3c442ac5f..4661b6669edb 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Returns the hart ID of the given device tree node, or -ENODEV if the node
@@ -54,18 +55,19 @@ static void print_isa(struct seq_file *f, const char *isa)
seq_puts(f, "\n");
 }
 
-static void print_mmu(struct seq_file *f, const char *mmu_type)
+static void print_mmu(struct seq_file *f)
 {
+   char sv_type[16];
+
 #if defined(CONFIG_32BIT)
-   if (strcmp(mmu_type, "riscv,sv32") != 0)
-   return;
+   strncpy(sv_type, "sv32", 5);
 #elif defined(CONFIG_64BIT)
-   if (strcmp(mmu_type, "riscv,sv39") != 0 &&
-   strcmp(mmu_type, "riscv,sv48") != 0)
-   return;
+   if (pgtable_l4_enabled)
+   strncpy(sv_type, "sv48", 5);
+   else
+   strncpy(sv_type, "sv39", 5);
 #endif
-
-   seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
+   seq_printf(f, "mmu\t\t: %s\n", sv_type);
 }
 
 static void *c_start(struct seq_file *m, loff_t *pos)
@@ -90,14 +92,13 @@ static int c_show(struct seq_file *m, void *v)
 {
unsigned long cpu_id = (unsigned long)v - 1;
struct device_node *node = of_get_cpu_node(cpu_id, NULL);
-   const char *compat, *isa, *mmu;
+   const char *compat, *isa;
 
seq_printf(m, "processor\t: %lu\n", cpu_id);
seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
if (!of_property_read_string(node, "riscv,isa", ))
print_isa(m, isa);
-   if (!of_property_read_string(node, "mmu-type", ))
-   print_mmu(m, mmu);
+   print_mmu(m);
if (!of_property_read_string(node, "compatible", )
&& strcmp(compat, "riscv"))
seq_printf(m, "uarch\t\t: %s\n", compat);
-- 
2.20.1

[PATCH v2 6/8] riscv: Allow user to downgrade to sv39 when hw supports sv48

2020-06-03 Thread Alexandre Ghiti

This is made possible by using the mmu-type property of the cpu node of
the device tree.

By default, the kernel will boot with 4-level page table if the hw supports
it but it can be interesting for the user to select 3-level page table as
it is less memory consuming and faster since it requires less memory
accesses in case of a TLB miss.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/init.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index a937173af13d..4064639b24e4 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -509,10 +509,32 @@ void disable_pgtable_l4(void)
  * then read SATP to see if the configuration was taken into account
  * meaning sv48 is supported.
  */
-asmlinkage __init void set_satp_mode(uintptr_t load_pa)
+asmlinkage __init void set_satp_mode(uintptr_t load_pa, uintptr_t dtb_pa)
 {
u64 identity_satp, hw_satp;
+   int cpus_node;
 
+   /* 1/ Check if the user asked for sv39 explicitly in the device tree */
+   cpus_node = fdt_path_offset((void *)dtb_pa, "/cpus");
+   if (cpus_node >= 0) {
+   int node;
+
+   fdt_for_each_subnode(node, (void *)dtb_pa, cpus_node) {
+   const char *mmu_type = fdt_getprop((void *)dtb_pa, node,
+   "mmu-type", NULL);
+   if (!mmu_type)
+   continue;
+
+   if (!strcmp(mmu_type, "riscv,sv39")) {
+   disable_pgtable_l4();
+   return;
+   }
+
+   break;
+   }
+   }
+
+   /* 2/ Determine if the HW supports sv48: if not, fallback to sv39 */
create_pgd_mapping(early_pg_dir, load_pa, (uintptr_t)early_pud,
   PGDIR_SIZE, PAGE_TABLE);
create_pud_mapping(early_pud, load_pa, (uintptr_t)early_pmd,
@@ -560,7 +582,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
load_sz = (uintptr_t)(&_end) - load_pa;
 
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MAXPHYSMEM_2GB)
-   set_satp_mode(load_pa);
+   set_satp_mode(load_pa, dtb_pa);
 #endif
 
kernel_virt_addr = KERNEL_VIRT_ADDR;
-- 
2.20.1

[PATCH v2 5/8] riscv: Implement sv48 support

2020-06-03 Thread Alexandre Ghiti

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that roughly
offers ~160TB of virtual address space to userspace and allows up to 64TB
of physical memory.

If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Kconfig  |   6 +-
 arch/riscv/include/asm/csr.h|   3 +-
 arch/riscv/include/asm/fixmap.h |   1 +
 arch/riscv/include/asm/page.h   |  15 +++
 arch/riscv/include/asm/pgalloc.h|  36 +++
 arch/riscv/include/asm/pgtable-64.h |  97 -
 arch/riscv/include/asm/pgtable.h|  10 +-
 arch/riscv/kernel/head.S|   3 +-
 arch/riscv/mm/context.c |   2 +-
 arch/riscv/mm/init.c| 158 +---
 10 files changed, 307 insertions(+), 24 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e167f16131f4..3f73f60e9732 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -68,6 +68,7 @@ config RISCV
select ARCH_HAS_GCOV_PROFILE_ALL
select HAVE_COPY_THREAD_TLS
select HAVE_ARCH_KASAN if MMU && 64BIT
+   select RELOCATABLE if 64BIT
 
 config ARCH_MMAP_RND_BITS_MIN
default 18 if 64BIT
@@ -106,7 +107,7 @@ config PAGE_OFFSET
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
default 0x8000 if 64BIT && !MMU
default 0x8000 if 64BIT && MAXPHYSMEM_2GB
-   default 0xffe0 if 64BIT && !MAXPHYSMEM_2GB
+   default 0xc000 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
def_bool y
@@ -155,8 +156,11 @@ config GENERIC_HWEIGHT
 config FIX_EARLYCON_MEM
def_bool MMU
 
+# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
+# on a 3-level page table when sv48 is not supported.
 config PGTABLE_LEVELS
int
+   default 4 if 64BIT && RELOCATABLE
default 3 if 64BIT
default 2
 
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index cec462e198ce..d41536c3f8d4 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,11 +40,10 @@
 #ifndef CONFIG_64BIT
 #define SATP_PPN   _AC(0x003F, UL)
 #define SATP_MODE_32   _AC(0x8000, UL)
-#define SATP_MODE  SATP_MODE_32
 #else
 #define SATP_PPN   _AC(0x0FFF, UL)
 #define SATP_MODE_39   _AC(0x8000, UL)
-#define SATP_MODE  SATP_MODE_39
+#define SATP_MODE_48   _AC(0x9000, UL)
 #endif
 
 /* Exception cause high bit - is an interrupt if set */
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 2368d49eb4ef..d891cf9c73c5 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -27,6 +27,7 @@ enum fixed_addresses {
FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
FIX_PTE,
FIX_PMD,
+   FIX_PUD,
FIX_TEXT_POKE1,
FIX_TEXT_POKE0,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 48bb09b6a9b7..5e77fe7f0d6d 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -31,7 +31,19 @@
  * When not using MMU this corresponds to the first free page in
  * physical memory (aligned on a page boundary).
  */
+#ifdef CONFIG_RELOCATABLE
+#define PAGE_OFFSET__page_offset
+
+#ifdef CONFIG_64BIT
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3 0xffe0
+#endif /* CONFIG_64BIT */
+#else
 #define PAGE_OFFSET_AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_RELOCATABLE */
 
 #define KERN_VIRT_SIZE (-PAGE_OFFSET)
 
@@ -102,6 +114,9 @@ extern unsigned long pfn_base;
 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
 extern unsigned long kernel_virt_addr;
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __page_offset;
+#endif
 
 #define __pa_to_va_nodebug(x)  ((void *)((unsigned long) (x) + va_pa_offset))
 #define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 3f601ee8233f..540eaa5a8658 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct *mm, pud_t 
*pud, pmd_t *pmd)
 
set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
 }
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+

[PATCH v2 4/8] riscv: Prepare ptdump for vm layout dynamic addresses

2020-06-03 Thread Alexandre Ghiti

This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.

Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/mm/ptdump.c | 49 ++
 1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 7eab76a93106..7d9386a7f5c2 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -49,22 +49,41 @@ struct addr_marker {
const char *name;
 };
 
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+   KASAN_SHADOW_START_NR,
+   KASAN_SHADOW_END_NR,
+#endif
+   FIXMAP_START_NR,
+   FIXMAP_END_NR,
+   PCI_IO_START_NR,
+   PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   VMEMMAP_START_NR,
+   VMEMMAP_END_NR,
+#endif
+   VMALLOC_START_NR,
+   VMALLOC_END_NR,
+   PAGE_OFFSET_NR,
+   END_OF_SPACE_NR
+};
+
 static struct addr_marker address_markers[] = {
 #ifdef CONFIG_KASAN
{KASAN_SHADOW_START,"Kasan shadow start"},
{KASAN_SHADOW_END,  "Kasan shadow end"},
 #endif
-   {FIXADDR_START, "Fixmap start"},
-   {FIXADDR_TOP,   "Fixmap end"},
-   {PCI_IO_START,  "PCI I/O start"},
-   {PCI_IO_END,"PCI I/O end"},
+   {0, "Fixmap start"},
+   {0, "Fixmap end"},
+   {0, "PCI I/O start"},
+   {0, "PCI I/O end"},
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-   {VMEMMAP_START, "vmemmap start"},
-   {VMEMMAP_END,   "vmemmap end"},
+   {0, "vmemmap start"},
+   {0, "vmemmap end"},
 #endif
-   {VMALLOC_START, "vmalloc() area"},
-   {VMALLOC_END,   "vmalloc() end"},
-   {PAGE_OFFSET,   "Linear mapping"},
+   {0, "vmalloc() area"},
+   {0, "vmalloc() end"},
+   {0, "Linear mapping"},
{-1, NULL},
 };
 
@@ -304,6 +323,18 @@ static int ptdump_init(void)
 {
unsigned int i, j;
 
+   address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+   address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+   address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+   address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+   address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+   address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+   address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+   address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+   address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
-- 
2.20.1

[PATCH v2 3/8] riscv: Simplify MAXPHYSMEM config

2020-06-03 Thread Alexandre Ghiti

Either the user specifies maximum physical memory size of 2GB or the
user lives with the system constraint which is 1/4th of maximum
addressable memory in Sv39 MMU mode (i.e. 128GB) for now.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Kconfig | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 64b25a90d60f..e167f16131f4 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -106,7 +106,7 @@ config PAGE_OFFSET
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
default 0x8000 if 64BIT && !MMU
default 0x8000 if 64BIT && MAXPHYSMEM_2GB
-   default 0xffe0 if 64BIT && MAXPHYSMEM_128GB
+   default 0xffe0 if 64BIT && !MAXPHYSMEM_2GB
 
 config ARCH_FLATMEM_ENABLE
def_bool y
@@ -223,19 +223,11 @@ config MODULE_SECTIONS
bool
select HAVE_MOD_ARCH_SPECIFIC
 
-choice
-   prompt "Maximum Physical Memory"
-   default MAXPHYSMEM_2GB if 32BIT
-   default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
-   default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
-
-   config MAXPHYSMEM_2GB
-   bool "2GiB"
-   config MAXPHYSMEM_128GB
-   depends on 64BIT && CMODEL_MEDANY
-   bool "128GiB"
-endchoice
-
+config MAXPHYSMEM_2GB
+   bool "Maximum Physical Memory 2GiB"
+   default y if 32BIT
+   default y if 64BIT && CMODEL_MEDLOW
+   default n
 
 config SMP
bool "Symmetric Multi-Processing"
-- 
2.20.1

[PATCH v2 2/8] riscv: Allow to dynamically define VA_BITS

2020-06-03 Thread Alexandre Ghiti

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig | 10 --
 arch/riscv/include/asm/pgtable.h   | 11 +--
 arch/riscv/include/asm/sparsemem.h |  6 +-
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 93127d5913fe..64b25a90d60f 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -101,16 +101,6 @@ config ZONE_DMA32
bool
default y if 64BIT
 
-config VA_BITS
-   int
-   default 32 if 32BIT
-   default 39 if 64BIT
-
-config PA_BITS
-   int
-   default 34 if 32BIT
-   default 56 if 64BIT
-
 config PAGE_OFFSET
hex
default 0xC000 if 32BIT && MAXPHYSMEM_2GB
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 94ef3b49dfb6..ec9694624f3c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -40,8 +40,14 @@
  * struct pages to map half the virtual address space. Then
  * position vmemmap directly below the VMALLOC region.
  */
+#ifdef CONFIG_64BIT
+#define VA_BITS39
+#else
+#define VA_BITS32
+#endif
+
 #define VMEMMAP_SHIFT \
-   (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
+   (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
 #define VMEMMAP_SIZE   BIT(VMEMMAP_SHIFT)
 #define VMEMMAP_END(VMALLOC_START - 1)
 #define VMEMMAP_START  (VMALLOC_START - VMEMMAP_SIZE)
@@ -80,6 +86,7 @@
 #endif /* CONFIG_64BIT */
 
 #ifdef CONFIG_MMU
+
 /* Number of entries in the page global directory */
 #define PTRS_PER_PGD(PAGE_SIZE / sizeof(pgd_t))
 /* Number of entries in the page table */
@@ -466,7 +473,7 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
  * and give the kernel the other (upper) half.
  */
 #ifdef CONFIG_64BIT
-#define KERN_VIRT_START(-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
+#define KERN_VIRT_START(-(BIT(VA_BITS)) + TASK_SIZE)
 #else
 #define KERN_VIRT_STARTFIXADDR_START
 #endif
diff --git a/arch/riscv/include/asm/sparsemem.h 
b/arch/riscv/include/asm/sparsemem.h
index 45a7018a8118..63acaecc3374 100644
--- a/arch/riscv/include/asm/sparsemem.h
+++ b/arch/riscv/include/asm/sparsemem.h
@@ -4,7 +4,11 @@
 #define _ASM_RISCV_SPARSEMEM_H
 
 #ifdef CONFIG_SPARSEMEM
-#define MAX_PHYSMEM_BITS   CONFIG_PA_BITS
+#ifdef CONFIG_64BIT
+#define MAX_PHYSMEM_BITS   56
+#else
+#define MAX_PHYSMEM_BITS   34
+#endif /* CONFIG_64BIT */
 #define SECTION_SIZE_BITS  27
 #endif /* CONFIG_SPARSEMEM */
 
-- 
2.20.1

[PATCH v2 1/8] riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE

2020-06-03 Thread Alexandre Ghiti

There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
definition.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Reviewed-by: Palmer Dabbelt 
---
 arch/riscv/mm/init.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index e63ea5b6b6cf..80fd692b72d5 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -256,13 +256,7 @@ static void __init create_pte_mapping(pte_t *ptep,
 
 pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
-
-#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
-#define NUM_EARLY_PMDS 1UL
-#else
-#define NUM_EARLY_PMDS (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
-#endif
-pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
+pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
 
 static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 {
@@ -276,14 +270,12 @@ static pmd_t *__init get_pmd_virt(phys_addr_t pa)
 
 static phys_addr_t __init alloc_pmd(uintptr_t va)
 {
-   uintptr_t pmd_num;
-
if (mmu_enabled)
return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 
-   pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
-   BUG_ON(pmd_num >= NUM_EARLY_PMDS);
-   return (uintptr_t)_pmd[pmd_num * PTRS_PER_PMD];
+   BUG_ON((va - kernel_virt_addr) >> PGDIR_SHIFT);
+
+   return (uintptr_t)early_pmd;
 }
 
 static void __init create_pmd_mapping(pmd_t *pmdp,
-- 
2.20.1

[PATCH v2 0/8] Introduce sv48 support

2020-06-03 Thread Alexandre Ghiti

This patchset implements sv48 support at runtime. The kernel will try to
 
boot with 4-level page table and will fallback to 3-level if the HW does not
 
support it. 
 

 
The biggest advantage is that we only have one kernel for 64bit, which  
 
is way easier to maintain.  
 

 
Folding the 4th level into a 3-level page table has almost no cost at   
 
runtime. But as mentioned Palmer, the relocatable code generated is less
 
performant. 
 

 
At the moment, there is no way to build a 3-level page table non-relocatable
 
64bit kernel. We agreed that distributions will use this runtime configuration  
 
anyway, but Palmer proposed to introduce a new Kconfig, which I will do later   
 
as sv48 support was asked for 5.8.  
 

 
Finally, the user can now ask for sv39 explicitly by using the device-tree  
 
which will reduce memory footprint and reduce the number of memory accesses 
 
in case of TLB miss.

Changes in v2:
  * Move variable declarations to pgtable.h in patch 5/7 as suggested by Anup
  * Restore mmu-type properties in patch 6 as suggested by Anup
  * Fix unused variable in patch 5 that was used in patch 6
  * Fix SPARSEMEM build (patch 2 was modified so I dropped the Reviewed-by)
  * Applied various Reviewed-by

Alexandre Ghiti (8):
  riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
  riscv: Allow to dynamically define VA_BITS
  riscv: Simplify MAXPHYSMEM config
  riscv: Prepare ptdump for vm layout dynamic addresses
  riscv: Implement sv48 support
  riscv: Allow user to downgrade to sv39 when hw supports sv48
  riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
  riscv: Explicit comment about user virtual address space size

 arch/riscv/Kconfig  |  34 ++---
 arch/riscv/include/asm/csr.h|   3 +-
 arch/riscv/include/asm/fixmap.h |   1 +
 arch/riscv/include/asm/page.h   |  15 +++
 arch/riscv/include/asm/pgalloc.h|  36 ++
 arch/riscv/include/asm/pgtable-64.h |  97 +-
 arch/riscv/include/asm/pgtable.h|  31 -
 arch/riscv/include/asm/sparsemem.h  |   6 +-
 arch/riscv/kernel/cpu.c |  23 ++--
 arch/riscv/kernel/head.S|   3 +-
 arch/riscv/mm/context.c |   2 +-
 arch/riscv/mm/init.c| 194 
 arch/riscv/mm/ptdump.c  |  49 +--
 13 files changed, 412 insertions(+), 82 deletions(-)

-- 
2.20.1

1 2 3 >

1 - 100 of 290 matches

Mail list logo