Re: [PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.

2024-03-08 Thread Alexei Starovoitov
On Fri, Mar 8, 2024 at 9:14 AM Marek Szyprowski
 wrote:
>
> On 05.03.2024 04:05, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov 
> >
> > There are various users of get_vm_area() + ioremap_page_range() APIs.
> > Enforce that get_vm_area() was requested as VM_IOREMAP type and range
> > passed to ioremap_page_range() matches created vm_area to avoid
> > accidentally ioremap-ing into wrong address range.
> >
> > Reviewed-by: Christoph Hellwig 
> > Signed-off-by: Alexei Starovoitov 
> > ---
>
> This patch landed in today's linux-next as commit 3e49a866c9dc ("mm:
> Enforce VM_IOREMAP flag and range in ioremap_page_range.").
> Unfortunately it triggers the following warning on all my test machines
> with PCI bridges. Here is an example reproduced with QEMU and ARM64
> 'virt' machine:

Sorry about the breakage.
Here is the thread where we're discussing the fix:
https://lore.kernel.org/bpf/CAADnVQLP=dxBb+RiMGXoaCEuRrbK387J6B+pfzWKF_F=arg...@mail.gmail.com/



Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-03-06 Thread Alexei Starovoitov
On Wed, Mar 6, 2024 at 2:57 PM Pasha Tatashin  wrote:
>
> On Wed, Mar 6, 2024 at 5:13 PM Alexei Starovoitov
>  wrote:
> >
> > On Wed, Mar 6, 2024 at 1:46 PM Pasha Tatashin  
> > wrote:
> > >
> > > > > This interface and in general VM_SPARSE would be useful for
> > > > > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > > > > would be a problem. We would need to be able to handle
> > > > > vm_area_map_pages() from interrupt disabled context therefore no
> > > > > sleeping. The caller would need to guarantee that the page tables are
> > > > > pre-allocated before the mapping.
> > > >
> > > > Sounds like we'd need to differentiate two kinds of sparse regions.
> > > > One that is really sparse where page tables are not populated (bpf use 
> > > > case)
> > > > and another where only the pte level might be empty.
> > > > Only the latter one will be usable for such auto-grow stacks.
> > > >
> > > > Months back I played with this idea:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> > > > that
> > > > "Make vmap_pages_range() allocate page tables down to the last (PTE) 
> > > > level."
> > > > Essentially pass NULL instead of 'pages' into vmap_pages_range()
> > > > and it will populate all levels except the last.
> > >
> > > Yes, this is what is needed, however, it can be a little simpler with
> > > kernel stacks:
> > > given that the first page in the vm_area is mapped when stack is first
> > > allocated, and that the VA range is aligned to 16K, we actually are
> > > guaranteed to have all page table levels down to pte pre-allocated
> > > during that initial mapping. Therefore, we do not need to worry about
> > > allocating them later during PFs.
> >
> > Ahh. Found:
> > stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ...
> >
> > > > Then the page fault handler can service a fault in auto-growing stack
> > > > area if it has a page stashed in some per-cpu free list.
> > > > I suspect this is something you might need for
> > > > "16k stack that is populated on fault",
> > > > plus a free list of 3 pages per-cpu,
> > > > and set_pte_at() in pf handler.
> > >
> > > Yes, what you described is exactly what I am working on: using 3-pages
> > > per-cpu to handle kstack page faults. The only thing that is missing
> > > is that I would like to have the ability to call a non-sleeping
> > > version of vm_area_map_pages().
> >
> > vm_area_map_pages() cannot be non-sleepable, since the [start, end)
> > range will dictate whether mid level allocs and locks are needed.
> >
> > Instead in alloc_thread_stack_node() you'd need a flavor
> > of get_vm_area() that can align the range to THREAD_ALIGN.
> > Then immediately call _sleepable_ vm_area_map_pages() to populate
> > the first page and later set_pte_at() the other pages on demand
> > from the fault handler.
>
> We still need to get to PTE level to use set_pte_at(). So, either
> store it in task_struct for faster PF handling, or add another
> non-sleeping vmap function that will do something like this:
>
> vm_area_set_page_at(addr, page)
> {
>pgd = pgd_offset_k(addr)
>p4d = vunmap_p4d_range(pgd, addr)
>pud = pud_offset(p4d, addr)
>pmd = pmd_offset(pud, addr)
>pte = pte_offset_kernel(pmd, addr)
>
>   set_pte_at(init_mm, addr, pte, mk_pte(page...));
> }

Right. There are several flavors of this logic across the tree.
What you're proposing is pretty much vmalloc_to_page() that
returns pte even if !pte_present, instead of a page.
x86 is doing mostly the same in lookup_address() fwiw.
Good opportunity to clean all this up and share the code.



Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-03-06 Thread Alexei Starovoitov
On Wed, Mar 6, 2024 at 1:46 PM Pasha Tatashin  wrote:
>
> > > This interface and in general VM_SPARSE would be useful for
> > > dynamically grown kernel stacks [1]. However, the might_sleep() here
> > > would be a problem. We would need to be able to handle
> > > vm_area_map_pages() from interrupt disabled context therefore no
> > > sleeping. The caller would need to guarantee that the page tables are
> > > pre-allocated before the mapping.
> >
> > Sounds like we'd need to differentiate two kinds of sparse regions.
> > One that is really sparse where page tables are not populated (bpf use case)
> > and another where only the pte level might be empty.
> > Only the latter one will be usable for such auto-grow stacks.
> >
> > Months back I played with this idea:
> > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
> > that
> > "Make vmap_pages_range() allocate page tables down to the last (PTE) level."
> > Essentially pass NULL instead of 'pages' into vmap_pages_range()
> > and it will populate all levels except the last.
>
> Yes, this is what is needed, however, it can be a little simpler with
> kernel stacks:
> given that the first page in the vm_area is mapped when stack is first
> allocated, and that the VA range is aligned to 16K, we actually are
> guaranteed to have all page table levels down to pte pre-allocated
> during that initial mapping. Therefore, we do not need to worry about
> allocating them later during PFs.

Ahh. Found:
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, ...

> > Then the page fault handler can service a fault in auto-growing stack
> > area if it has a page stashed in some per-cpu free list.
> > I suspect this is something you might need for
> > "16k stack that is populated on fault",
> > plus a free list of 3 pages per-cpu,
> > and set_pte_at() in pf handler.
>
> Yes, what you described is exactly what I am working on: using 3-pages
> per-cpu to handle kstack page faults. The only thing that is missing
> is that I would like to have the ability to call a non-sleeping
> version of vm_area_map_pages().

vm_area_map_pages() cannot be non-sleepable, since the [start, end)
range will dictate whether mid level allocs and locks are needed.

Instead in alloc_thread_stack_node() you'd need a flavor
of get_vm_area() that can align the range to THREAD_ALIGN.
Then immediately call _sleepable_ vm_area_map_pages() to populate
the first page and later set_pte_at() the other pages on demand
from the fault handler.



Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-03-06 Thread Alexei Starovoitov
On Wed, Mar 6, 2024 at 1:04 PM Pasha Tatashin  wrote:
>
> On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov
>  wrote:
> >
> > From: Alexei Starovoitov 
> >
> > vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
> > virtual space.
> >
> > get_vm_area() with appropriate flag is used to request an area of kernel
> > address range. It's used for vmalloc, vmap, ioremap, xen use cases.
> > - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
> > - the areas created by vmap() function should be tagged with VM_MAP.
> > - ioremap areas are tagged with VM_IOREMAP.
> >
> > BPF would like to extend the vmap API to implement a lazily-populated
> > sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
> > and vm_area_map_pages(area, start_addr, count, pages) API to map a set
> > of pages within a given area.
> > It has the same sanity checks as vmap() does.
> > It also checks that get_vm_area() was created with VM_SPARSE flag
> > which identifies such areas in /proc/vmallocinfo
> > and returns zero pages on read through /proc/kcore.
> >
> > The next commits will introduce bpf_arena which is a sparsely populated
> > shared memory region between bpf program and user space process. It will
> > map privately-managed pages into a sparse vm area with the following steps:
> >
> >   // request virtual memory region during bpf prog verification
> >   area = get_vm_area(area_size, VM_SPARSE);
> >
> >   // on demand
> >   vm_area_map_pages(area, kaddr, kend, pages);
> >   vm_area_unmap_pages(area, kaddr, kend);
> >
> >   // after bpf program is detached and unloaded
> >   free_vm_area(area);
> >
> > Signed-off-by: Alexei Starovoitov 
> > ---
> >  include/linux/vmalloc.h |  5 
> >  mm/vmalloc.c| 59 +++--
> >  2 files changed, 62 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..0f72c85a377b 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -35,6 +35,7 @@ struct iov_iter;  /* in uio.h */
> >  #else
> >  #define VM_DEFER_KMEMLEAK  0
> >  #endif
> > +#define VM_SPARSE  0x1000  /* sparse vm_area. not all 
> > pages are present. */
> >
> >  /* bits [20..32] reserved for arch specific ioremap internals */
> >
> > @@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void 
> > *addr)
> >  }
> >
> >  #ifdef CONFIG_MMU
> > +int vm_area_map_pages(struct vm_struct *area, unsigned long start,
> > + unsigned long end, struct page **pages);
> > +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
> > +unsigned long end);
> >  void vunmap_range(unsigned long addr, unsigned long end);
> >  static inline void set_vm_flush_reset_perms(void *addr)
> >  {
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index f42f98a127d5..e5b8c70950bc 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, 
> > unsigned long end,
> > return err;
> >  }
> >
> > +static int check_sparse_vm_area(struct vm_struct *area, unsigned long 
> > start,
> > +   unsigned long end)
> > +{
> > +   might_sleep();
>
> This interface and in general VM_SPARSE would be useful for
> dynamically grown kernel stacks [1]. However, the might_sleep() here
> would be a problem. We would need to be able to handle
> vm_area_map_pages() from interrupt disabled context therefore no
> sleeping. The caller would need to guarantee that the page tables are
> pre-allocated before the mapping.

Sounds like we'd need to differentiate two kinds of sparse regions.
One that is really sparse where page tables are not populated (bpf use case)
and another where only the pte level might be empty.
Only the latter one will be usable for such auto-grow stacks.

Months back I played with this idea:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
that
"Make vmap_pages_range() allocate page tables down to the last (PTE) level."
Essentially pass NULL instead of 'pages' into vmap_pages_range()
and it will populate all levels except the last.
Then the page fault handler can service a fault in auto-growing stack
area if it has a page stashed in some per-cpu free list.
I suspect this is something you might need for
"16k stack that is populated on fault",
plus a free list of 3 pages per-cpu,
and set_pte_at() in pf handler.



Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-03-06 Thread Alexei Starovoitov
On Wed, Mar 6, 2024 at 6:19 AM Christoph Hellwig  wrote:
>
> I'd still prefer to hide the vm_area, but for now:
>
> Reviewed-by: Christoph Hellwig 

Thank you.
I will think of a way to move get_vm_area() to mm/internal.h and
propose a plan by lsf/mm/bpf in May.



[PATCH v4 bpf-next 1/2] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.

2024-03-04 Thread Alexei Starovoitov
From: Alexei Starovoitov 

There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range
passed to ioremap_page_range() matches created vm_area to avoid
accidentally ioremap-ing into wrong address range.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Alexei Starovoitov 
---
 mm/vmalloc.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..f42f98a127d5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned 
long end,
 int ioremap_page_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot)
 {
+   struct vm_struct *area;
int err;
 
+   area = find_vm_area((void *)addr);
+   if (!area || !(area->flags & VM_IOREMAP)) {
+   WARN_ONCE(1, "vm_area at addr %lx is not marked as 
VM_IOREMAP\n", addr);
+   return -EINVAL;
+   }
+   if (addr != (unsigned long)area->addr ||
+   (void *)end != area->addr + get_vm_area_size(area)) {
+   WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area 
[%lx, %lx)\n",
+ addr, end, (long)area->addr,
+ (long)area->addr + get_vm_area_size(area));
+   return -ERANGE;
+   }
err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 ioremap_max_page_shift);
flush_cache_vmap(addr, end);
-- 
2.43.0




[PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-03-04 Thread Alexei Starovoitov
From: Alexei Starovoitov 

vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
virtual space.

get_vm_area() with appropriate flag is used to request an area of kernel
address range. It's used for vmalloc, vmap, ioremap, xen use cases.
- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
- the areas created by vmap() function should be tagged with VM_MAP.
- ioremap areas are tagged with VM_IOREMAP.

BPF would like to extend the vmap API to implement a lazily-populated
sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
and vm_area_map_pages(area, start_addr, count, pages) API to map a set
of pages within a given area.
It has the same sanity checks as vmap() does.
It also checks that get_vm_area() was created with VM_SPARSE flag
which identifies such areas in /proc/vmallocinfo
and returns zero pages on read through /proc/kcore.

The next commits will introduce bpf_arena which is a sparsely populated
shared memory region between bpf program and user space process. It will
map privately-managed pages into a sparse vm area with the following steps:

  // request virtual memory region during bpf prog verification
  area = get_vm_area(area_size, VM_SPARSE);

  // on demand
  vm_area_map_pages(area, kaddr, kend, pages);
  vm_area_unmap_pages(area, kaddr, kend);

  // after bpf program is detached and unloaded
  free_vm_area(area);

Signed-off-by: Alexei Starovoitov 
---
 include/linux/vmalloc.h |  5 
 mm/vmalloc.c| 59 +++--
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..0f72c85a377b 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -35,6 +35,7 @@ struct iov_iter;  /* in uio.h */
 #else
 #define VM_DEFER_KMEMLEAK  0
 #endif
+#define VM_SPARSE  0x1000  /* sparse vm_area. not all 
pages are present. */
 
 /* bits [20..32] reserved for arch specific ioremap internals */
 
@@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
 }
 
 #ifdef CONFIG_MMU
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+ unsigned long end, struct page **pages);
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+unsigned long end);
 void vunmap_range(unsigned long addr, unsigned long end);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f42f98a127d5..e5b8c70950bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned 
long end,
return err;
 }
 
+static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
+   unsigned long end)
+{
+   might_sleep();
+   if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS))
+   return -EINVAL;
+   if (WARN_ON_ONCE(area->flags & VM_NO_GUARD))
+   return -EINVAL;
+   if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
+   return -EINVAL;
+   if ((end - start) >> PAGE_SHIFT > totalram_pages())
+   return -E2BIG;
+   if (start < (unsigned long)area->addr ||
+   (void *)end > area->addr + get_vm_area_size(area))
+   return -ERANGE;
+   return 0;
+}
+
+/**
+ * vm_area_map_pages - map pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ * @pages: pages to map (always PAGE_SIZE pages)
+ */
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+ unsigned long end, struct page **pages)
+{
+   int err;
+
+   err = check_sparse_vm_area(area, start, end);
+   if (err)
+   return err;
+
+   return vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT);
+}
+
+/**
+ * vm_area_unmap_pages - unmap pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ */
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+unsigned long end)
+{
+   if (check_sparse_vm_area(area, start, end))
+   return;
+
+   vunmap_range(start, end);
+}
+
 int is_vmalloc_or_module_addr(const void *x)
 {
/*
@@ -3822,9 +3874,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, 
size_t count)
 
if (flags & VMAP_RAM)
copied = vmap_ram_vread_iter(iter, addr, n, flags);
-   else if (!(vm && (vm->flags & VM_IOREMAP)))
+   else if (!(vm && (vm->flags & (VM_IOREMAP | VM_SPARSE
copied = aligned_vread_iter(iter, addr, n);
-   else /* IOREMAP area is treated as memory hol

[PATCH v4 bpf-next 0/2] mm: Enforce ioremap address space and introduce sparse vm_area

2024-03-04 Thread Alexei Starovoitov
From: Alexei Starovoitov 

v3 -> v4
- dropped VM_XEN patch for now. It will be in the follow up.
- fixed constant as pointed out by Mike

v2 -> v3
- added Christoph's reviewed-by to patch 1
- cap commit log lines to 75 chars
- factored out common checks in patch 3 into helper
- made vm_area_unmap_pages() return void

There are various users of kernel virtual address space:
vmalloc, vmap, ioremap, xen.

- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
and these areas are treated differently by KASAN.

- the areas created by vmap() function should be tagged with VM_MAP
(as majority of the users do).

- ioremap areas are tagged with VM_IOREMAP and vm area start is aligned
to size of the area unlike vmalloc/vmap.

- there is also xen usage that is marked as VM_IOREMAP, but it doesn't
call ioremap_page_range() unlike all other VM_IOREMAP users.

To clean this up a bit, enforce that ioremap_page_range() checks the range
and VM_IOREMAP flag.

In addition BPF would like to reserve regions of kernel virtual address
space and populate it lazily, similar to xen use cases.
For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages()
helpers to populate this sparse area.

In the end the /proc/vmallocinfo will show
"vmalloc"
"vmap"
"ioremap"
"sparse"
categories for different kinds of address regions.

ioremap, sparse will return zero when dumped through /proc/kcore

Alexei Starovoitov (2):
  mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

 include/linux/vmalloc.h |  5 +++
 mm/vmalloc.c| 72 +++--
 2 files changed, 75 insertions(+), 2 deletions(-)

-- 
2.43.0




Re: [PATCH v2 bpf-next 2/3] mm, xen: Separate xen use cases from ioremap.

2024-03-04 Thread Alexei Starovoitov
On Sun, Mar 3, 2024 at 11:55 PM Mike Rapoport  wrote:
>
> On Fri, Feb 23, 2024 at 03:57:27PM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov 
> >
> > xen grant table and xenbus ring are not ioremap the way arch specific code 
> > is using it,
> > so let's add VM_XEN flag to separate them from VM_IOREMAP users.
> > xen will not and should not be calling ioremap_page_range() on that range.
> > /proc/vmallocinfo will print such region as "xen" instead of "ioremap" as 
> > well.
> >
> > Signed-off-by: Alexei Starovoitov 
> > ---
> >  arch/x86/xen/grant-table.c | 2 +-
> >  drivers/xen/xenbus/xenbus_client.c | 2 +-
> >  include/linux/vmalloc.h| 1 +
> >  mm/vmalloc.c   | 7 +--
> >  4 files changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
> > index 1e681bf62561..b816db0349c4 100644
> > --- a/arch/x86/xen/grant-table.c
> > +++ b/arch/x86/xen/grant-table.c
> > @@ -104,7 +104,7 @@ static int arch_gnttab_valloc(struct gnttab_vm_area 
> > *area, unsigned nr_frames)
> >   area->ptes = kmalloc_array(nr_frames, sizeof(*area->ptes), 
> > GFP_KERNEL);
> >   if (area->ptes == NULL)
> >   return -ENOMEM;
> > - area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_IOREMAP);
> > + area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_XEN);
> >   if (!area->area)
> >   goto out_free_ptes;
> >   if (apply_to_page_range(_mm, (unsigned long)area->area->addr,
> > diff --git a/drivers/xen/xenbus/xenbus_client.c 
> > b/drivers/xen/xenbus/xenbus_client.c
> > index 32835b4b9bc5..b9c81a2d578b 100644
> > --- a/drivers/xen/xenbus/xenbus_client.c
> > +++ b/drivers/xen/xenbus/xenbus_client.c
> > @@ -758,7 +758,7 @@ static int xenbus_map_ring_pv(struct xenbus_device *dev,
> >   bool leaked = false;
> >   int err = -ENOMEM;
> >
> > - area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_IOREMAP);
> > + area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_XEN);
> >   if (!area)
> >   return -ENOMEM;
> >   if (apply_to_page_range(_mm, (unsigned long)area->addr,
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..223e51c243bc 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -28,6 +28,7 @@ struct iov_iter;/* in uio.h */
> >  #define VM_FLUSH_RESET_PERMS 0x0100  /* reset direct map and flush 
> > TLB on unmap, can't be freed in atomic context */
> >  #define VM_MAP_PUT_PAGES 0x0200  /* put pages and free array 
> > in vfree */
> >  #define VM_ALLOW_HUGE_VMAP   0x0400  /* Allow for huge pages on 
> > archs with HAVE_ARCH_HUGE_VMALLOC */
> > +#define VM_XEN   0x0800  /* xen use cases */
> >
> >  #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
> >   !defined(CONFIG_KASAN_VMALLOC)
>
> There's also VM_DEFER_KMEMLEAK a line below:

Ohh. Good catch. Will fix.

> I think it makes sense to use an enumeration for vm_flags, just like as
> Suren did for GFP
> (https://lore.kernel.org/linux-mm/20240224015800.2569851-1-sur...@google.com/)

Hmm. I'm pretty sure Christoph hates BIT macro obfuscation.
I'm not a fan of it either, though we use it in bpf in a few places.
If mm folks prefer that style they can do such conversion later.



[PATCH v3 bpf-next 3/3] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-02-29 Thread Alexei Starovoitov
From: Alexei Starovoitov 

vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
virtual space.

get_vm_area() with appropriate flag is used to request an area of kernel
address range. It'se used for vmalloc, vmap, ioremap, xen use cases.
- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
- the areas created by vmap() function should be tagged with VM_MAP.
- ioremap areas are tagged with VM_IOREMAP.
- xen use cases are VM_XEN.

BPF would like to extend the vmap API to implement a lazily-populated
sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
and vm_area_map_pages(area, start_addr, count, pages) API to map a set
of pages within a given area.
It has the same sanity checks as vmap() does.
It also checks that get_vm_area() was created with VM_SPARSE flag
which identifies such areas in /proc/vmallocinfo
and returns zero pages on read through /proc/kcore.

The next commits will introduce bpf_arena which is a sparsely populated
shared memory region between bpf program and user space process. It will
map privately-managed pages into a sparse vm area with the following steps:

  // request virtual memory region during bpf prog verification
  area = get_vm_area(area_size, VM_SPARSE);

  // on demand
  vm_area_map_pages(area, kaddr, kend, pages);
  vm_area_unmap_pages(area, kaddr, kend);

  // after bpf program is detached and unloaded
  free_vm_area(area);

Signed-off-by: Alexei Starovoitov 
---
 include/linux/vmalloc.h |  5 
 mm/vmalloc.c| 59 +++--
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 71075ece0ed2..dfbcfb9f9a08 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -29,6 +29,7 @@ struct iov_iter;  /* in uio.h */
 #define VM_MAP_PUT_PAGES   0x0200  /* put pages and free array in 
vfree */
 #define VM_ALLOW_HUGE_VMAP 0x0400  /* Allow for huge pages on 
archs with HAVE_ARCH_HUGE_VMALLOC */
 #define VM_XEN 0x0800  /* xen grant table and xenbus 
use cases */
+#define VM_SPARSE  0x1000  /* sparse vm_area. not all 
pages are present. */
 
 #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
!defined(CONFIG_KASAN_VMALLOC)
@@ -233,6 +234,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
 }
 
 #ifdef CONFIG_MMU
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+ unsigned long end, struct page **pages);
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+unsigned long end);
 void vunmap_range(unsigned long addr, unsigned long end);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d53ece3f38ee..dae98b1f78a8 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned 
long end,
return err;
 }
 
+static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
+   unsigned long end)
+{
+   might_sleep();
+   if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS))
+   return -EINVAL;
+   if (WARN_ON_ONCE(area->flags & VM_NO_GUARD))
+   return -EINVAL;
+   if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
+   return -EINVAL;
+   if ((end - start) >> PAGE_SHIFT > totalram_pages())
+   return -E2BIG;
+   if (start < (unsigned long)area->addr ||
+   (void *)end > area->addr + get_vm_area_size(area))
+   return -ERANGE;
+   return 0;
+}
+
+/**
+ * vm_area_map_pages - map pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ * @pages: pages to map (always PAGE_SIZE pages)
+ */
+int vm_area_map_pages(struct vm_struct *area, unsigned long start,
+ unsigned long end, struct page **pages)
+{
+   int err;
+
+   err = check_sparse_vm_area(area, start, end);
+   if (err)
+   return err;
+
+   return vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT);
+}
+
+/**
+ * vm_area_unmap_pages - unmap pages inside given sparse vm_area
+ * @area: vm_area
+ * @start: start address inside vm_area
+ * @end: end address inside vm_area
+ */
+void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
+unsigned long end)
+{
+   if (check_sparse_vm_area(area, start, end))
+   return;
+
+   vunmap_range(start, end);
+}
+
 int is_vmalloc_or_module_addr(const void *x)
 {
/*
@@ -3822,9 +3874,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, 
size_t count)
 
if (flags & VMAP_RAM)
   

[PATCH v3 bpf-next 1/3] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.

2024-02-29 Thread Alexei Starovoitov
From: Alexei Starovoitov 

There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range
passed to ioremap_page_range() matches created vm_area to avoid
accidentally ioremap-ing into wrong address range.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Alexei Starovoitov 
---
 mm/vmalloc.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..f42f98a127d5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned 
long end,
 int ioremap_page_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot)
 {
+   struct vm_struct *area;
int err;
 
+   area = find_vm_area((void *)addr);
+   if (!area || !(area->flags & VM_IOREMAP)) {
+   WARN_ONCE(1, "vm_area at addr %lx is not marked as 
VM_IOREMAP\n", addr);
+   return -EINVAL;
+   }
+   if (addr != (unsigned long)area->addr ||
+   (void *)end != area->addr + get_vm_area_size(area)) {
+   WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area 
[%lx, %lx)\n",
+ addr, end, (long)area->addr,
+ (long)area->addr + get_vm_area_size(area));
+   return -ERANGE;
+   }
err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 ioremap_max_page_shift);
flush_cache_vmap(addr, end);
-- 
2.34.1




[PATCH v3 bpf-next 2/3] mm, xen: Separate xen use cases from ioremap.

2024-02-29 Thread Alexei Starovoitov
From: Alexei Starovoitov 

xen grant table and xenbus ring are not ioremap the way arch specific code
is using it, so let's add VM_XEN flag to separate these use cases from
VM_IOREMAP users. xen will not and should not be calling
ioremap_page_range() on that range. /proc/vmallocinfo will print such
regions as "xen" instead of "ioremap".

Signed-off-by: Alexei Starovoitov 
---
 arch/x86/xen/grant-table.c | 2 +-
 drivers/xen/xenbus/xenbus_client.c | 2 +-
 include/linux/vmalloc.h| 1 +
 mm/vmalloc.c   | 7 +--
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index 1e681bf62561..b816db0349c4 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -104,7 +104,7 @@ static int arch_gnttab_valloc(struct gnttab_vm_area *area, 
unsigned nr_frames)
area->ptes = kmalloc_array(nr_frames, sizeof(*area->ptes), GFP_KERNEL);
if (area->ptes == NULL)
return -ENOMEM;
-   area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_IOREMAP);
+   area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_XEN);
if (!area->area)
goto out_free_ptes;
if (apply_to_page_range(_mm, (unsigned long)area->area->addr,
diff --git a/drivers/xen/xenbus/xenbus_client.c 
b/drivers/xen/xenbus/xenbus_client.c
index 32835b4b9bc5..b9c81a2d578b 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -758,7 +758,7 @@ static int xenbus_map_ring_pv(struct xenbus_device *dev,
bool leaked = false;
int err = -ENOMEM;
 
-   area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_IOREMAP);
+   area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_XEN);
if (!area)
return -ENOMEM;
if (apply_to_page_range(_mm, (unsigned long)area->addr,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..71075ece0ed2 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -28,6 +28,7 @@ struct iov_iter;  /* in uio.h */
 #define VM_FLUSH_RESET_PERMS   0x0100  /* reset direct map and flush 
TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES   0x0200  /* put pages and free array in 
vfree */
 #define VM_ALLOW_HUGE_VMAP 0x0400  /* Allow for huge pages on 
archs with HAVE_ARCH_HUGE_VMALLOC */
+#define VM_XEN 0x0800  /* xen grant table and xenbus 
use cases */
 
 #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
!defined(CONFIG_KASAN_VMALLOC)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f42f98a127d5..d53ece3f38ee 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3822,9 +3822,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, 
size_t count)
 
if (flags & VMAP_RAM)
copied = vmap_ram_vread_iter(iter, addr, n, flags);
-   else if (!(vm && (vm->flags & VM_IOREMAP)))
+   else if (!(vm && (vm->flags & (VM_IOREMAP | VM_XEN
copied = aligned_vread_iter(iter, addr, n);
-   else /* IOREMAP area is treated as memory hole */
+   else /* IOREMAP | XEN area is treated as memory hole */
copied = zero_iter(iter, n);
 
addr += copied;
@@ -4415,6 +4415,9 @@ static int s_show(struct seq_file *m, void *p)
if (v->flags & VM_IOREMAP)
seq_puts(m, " ioremap");
 
+   if (v->flags & VM_XEN)
+   seq_puts(m, " xen");
+
if (v->flags & VM_ALLOC)
seq_puts(m, " vmalloc");
 
-- 
2.34.1




[PATCH v3 bpf-next 0/3] mm: Cleanup and identify various users of kernel virtual address space

2024-02-29 Thread Alexei Starovoitov
From: Alexei Starovoitov 

v2 -> v3
- added Christoph's reviewed-by to patch 1
- cap commit log lines to 75 chars
- factored out common checks in patch 3 into helper
- made vm_area_unmap_pages() return void

There are various users of kernel virtual address space:
vmalloc, vmap, ioremap, xen.

- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
and these areas are treated differently by KASAN.

- the areas created by vmap() function should be tagged with VM_MAP
(as majority of the users do).

- ioremap areas are tagged with VM_IOREMAP and vm area start is aligned
to size of the area unlike vmalloc/vmap.

- there is also xen usage that is marked as VM_IOREMAP, but it doesn't
call ioremap_page_range() unlike all other VM_IOREMAP users.

To clean this up:
1. Enforce that ioremap_page_range() checks the range and VM_IOREMAP flag
2. Introduce VM_XEN flag to separate xen us cases from ioremap

In addition BPF would like to reserve regions of kernel virtual address
space and populate it lazily, similar to xen use cases.
For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages()
helpers to populate this sparse area.

In the end the /proc/vmallocinfo will show
"vmalloc"
"vmap"
"ioremap"
"xen"
"sparse"
categories for different kinds of address regions.

ioremap, xen, sparse will return zero when dumped through /proc/kcore

Alexei Starovoitov (3):
  mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  mm, xen: Separate xen use cases from ioremap.
  mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

 arch/x86/xen/grant-table.c |  2 +-
 drivers/xen/xenbus/xenbus_client.c |  2 +-
 include/linux/vmalloc.h|  6 +++
 mm/vmalloc.c   | 75 +-
 4 files changed, 81 insertions(+), 4 deletions(-)

-- 
2.34.1




Re: [PATCH v2 bpf-next 3/3] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-02-27 Thread Alexei Starovoitov
On Tue, Feb 27, 2024 at 9:59 AM Christoph Hellwig  wrote:
>
> > privately-managed pages into a sparse vm area with the following steps:
> >
> >   area = get_vm_area(area_size, VM_SPARSE);  // at bpf prog verification 
> > time
> >   vm_area_map_pages(area, kaddr, 1, page);   // on demand
> > // it will return an error if kaddr is out of range
> >   vm_area_unmap_pages(area, kaddr, 1);
> >   free_vm_area(area);// after bpf prog is unloaded
>
> I'm still wondering if this should just use an opaque cookie instead
> of exposing the vm_area.  But otherwise this mostly looks fine to me.

What would it look like with a cookie?
A static inline wrapper around get_vm_area() that returns area->addr ?
And the start address of vmap range will be such a cookie?

Then vm_area_map_pages() will be doing find_vm_area() for kaddr
to check that vm_area->flag & VM_SPARSE ?
That's fine,
but what would be an equivalent of void free_vm_area(struct vm_struct *area) ?
Another static inline wrapper similar to remove_vm_area()
that also does kfree(area); ?

Fine by me, but api isn't user friendly with such obfuscation.

I guess I don't understand the motivation to hide 'struct vm_struct *'.

> > + if (addr < (unsigned long)area->addr || (void *)end > area->addr + 
> > area->size)
> > + return -ERANGE;
>
> This check is duplicated so many times that it really begs for a helper.

ok. will do.

> > +int vm_area_unmap_pages(struct vm_struct *area, unsigned long addr, 
> > unsigned int count)
> > +{
> > + unsigned long size = ((unsigned long)count) * PAGE_SIZE;
> > + unsigned long end = addr + size;
> > +
> > + if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
> > + return -EINVAL;
> > + if (addr < (unsigned long)area->addr || (void *)end > area->addr + 
> > area->size)
> > + return -ERANGE;
> > +
> > + vunmap_range(addr, end);
> > + return 0;
>
> Does it make much sense to have an error return here vs just debug
> checks?  It's not like the caller can do much if it violates these
> basic invariants.

Ok. Will switch to void return.

Will reduce commit line logs to 75 chars in all patches as suggested.

re: VM_GRANT_TABLE or VM_XEN_GRANT_TABLE suggestion for patch 2.

I'm not sure it fits, since only one of get_vm_area() in xen code
is a grant table related. The other one is for xenbus that
creates a shared memory ring between domains.
So I'm planning to keep it as VM_XEN in the next revision unless
folks come up with a better name.

Thanks for the reviews.



[PATCH v2 bpf-next 3/3] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

2024-02-23 Thread Alexei Starovoitov
From: Alexei Starovoitov 

vmap/vmalloc APIs are used to map a set of pages into contiguous kernel virtual 
space.

get_vm_area() with appropriate flag is used to request an area of kernel 
address range.
It'se used for vmalloc, vmap, ioremap, xen use cases.
- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
- the areas created by vmap() function should be tagged with VM_MAP.
- ioremap areas are tagged with VM_IOREMAP.
- xen use cases are VM_XEN.

BPF would like to extend the vmap API to implement a lazily-populated
sparse, yet contiguous kernel virtual space.
Introduce VM_SPARSE vm_area flag and
vm_area_map_pages(area, start_addr, count, pages) API to map a set
of pages within a given area.
It has the same sanity checks as vmap() does.
It also checks that get_vm_area() was created with VM_SPARSE flag
which identifies such areas in /proc/vmallocinfo
and returns zero pages on read through /proc/kcore.

The next commits will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process. It will map
privately-managed pages into a sparse vm area with the following steps:

  area = get_vm_area(area_size, VM_SPARSE);  // at bpf prog verification time
  vm_area_map_pages(area, kaddr, 1, page);   // on demand
// it will return an error if kaddr is out of range
  vm_area_unmap_pages(area, kaddr, 1);
  free_vm_area(area);// after bpf prog is unloaded

Signed-off-by: Alexei Starovoitov 
---
 include/linux/vmalloc.h |  4 +++
 mm/vmalloc.c| 55 +++--
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 223e51c243bc..416bc7b0b4db 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -29,6 +29,7 @@ struct iov_iter;  /* in uio.h */
 #define VM_MAP_PUT_PAGES   0x0200  /* put pages and free array in 
vfree */
 #define VM_ALLOW_HUGE_VMAP 0x0400  /* Allow for huge pages on 
archs with HAVE_ARCH_HUGE_VMALLOC */
 #define VM_XEN 0x0800  /* xen use cases */
+#define VM_SPARSE  0x1000  /* sparse vm_area. not all 
pages are present. */
 
 #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
!defined(CONFIG_KASAN_VMALLOC)
@@ -233,6 +234,9 @@ static inline bool is_vm_area_hugepages(const void *addr)
 }
 
 #ifdef CONFIG_MMU
+int vm_area_map_pages(struct vm_struct *area, unsigned long addr, unsigned int 
count,
+ struct page **pages);
+int vm_area_unmap_pages(struct vm_struct *area, unsigned long addr, unsigned 
int count);
 void vunmap_range(unsigned long addr, unsigned long end);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d769a65bddad..a05dfbbacb78 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -648,6 +648,54 @@ static int vmap_pages_range(unsigned long addr, unsigned 
long end,
return err;
 }
 
+/**
+ * vm_area_map_pages - map pages inside given vm_area
+ * @area: vm_area
+ * @addr: start address inside vm_area
+ * @count: number of pages
+ * @pages: pages to map (always PAGE_SIZE pages)
+ */
+int vm_area_map_pages(struct vm_struct *area, unsigned long addr, unsigned int 
count,
+ struct page **pages)
+{
+   unsigned long size = ((unsigned long)count) * PAGE_SIZE;
+   unsigned long end = addr + size;
+
+   might_sleep();
+   if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS))
+   return -EINVAL;
+   if (WARN_ON_ONCE(area->flags & VM_NO_GUARD))
+   return -EINVAL;
+   if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
+   return -EINVAL;
+   if (count > totalram_pages())
+   return -E2BIG;
+   if (addr < (unsigned long)area->addr || (void *)end > area->addr + 
area->size)
+   return -ERANGE;
+
+   return vmap_pages_range(addr, end, PAGE_KERNEL, pages, PAGE_SHIFT);
+}
+
+/**
+ * vm_area_unmap_pages - unmap pages inside given vm_area
+ * @area: vm_area
+ * @addr: start address inside vm_area
+ * @count: number of pages to unmap
+ */
+int vm_area_unmap_pages(struct vm_struct *area, unsigned long addr, unsigned 
int count)
+{
+   unsigned long size = ((unsigned long)count) * PAGE_SIZE;
+   unsigned long end = addr + size;
+
+   if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
+   return -EINVAL;
+   if (addr < (unsigned long)area->addr || (void *)end > area->addr + 
area->size)
+   return -ERANGE;
+
+   vunmap_range(addr, end);
+   return 0;
+}
+
 int is_vmalloc_or_module_addr(const void *x)
 {
/*
@@ -3822,9 +3870,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, 
size_t count)
 
if (flags & VMAP_RAM)
   

[PATCH v2 bpf-next 1/3] mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.

2024-02-23 Thread Alexei Starovoitov
From: Alexei Starovoitov 

There are various users of get_vm_area() + ioremap_page_range() APIs.
Enforce that get_vm_area() was requested as VM_IOREMAP type and range passed to
ioremap_page_range() matches created vm_area to avoid accidentally ioremap-ing
into wrong address range.

Signed-off-by: Alexei Starovoitov 
---
 mm/vmalloc.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..f42f98a127d5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned 
long end,
 int ioremap_page_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot)
 {
+   struct vm_struct *area;
int err;
 
+   area = find_vm_area((void *)addr);
+   if (!area || !(area->flags & VM_IOREMAP)) {
+   WARN_ONCE(1, "vm_area at addr %lx is not marked as 
VM_IOREMAP\n", addr);
+   return -EINVAL;
+   }
+   if (addr != (unsigned long)area->addr ||
+   (void *)end != area->addr + get_vm_area_size(area)) {
+   WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area 
[%lx, %lx)\n",
+ addr, end, (long)area->addr,
+ (long)area->addr + get_vm_area_size(area));
+   return -ERANGE;
+   }
err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 ioremap_max_page_shift);
flush_cache_vmap(addr, end);
-- 
2.34.1




[PATCH v2 bpf-next 2/3] mm, xen: Separate xen use cases from ioremap.

2024-02-23 Thread Alexei Starovoitov
From: Alexei Starovoitov 

xen grant table and xenbus ring are not ioremap the way arch specific code is 
using it,
so let's add VM_XEN flag to separate them from VM_IOREMAP users.
xen will not and should not be calling ioremap_page_range() on that range.
/proc/vmallocinfo will print such region as "xen" instead of "ioremap" as well.

Signed-off-by: Alexei Starovoitov 
---
 arch/x86/xen/grant-table.c | 2 +-
 drivers/xen/xenbus/xenbus_client.c | 2 +-
 include/linux/vmalloc.h| 1 +
 mm/vmalloc.c   | 7 +--
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index 1e681bf62561..b816db0349c4 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -104,7 +104,7 @@ static int arch_gnttab_valloc(struct gnttab_vm_area *area, 
unsigned nr_frames)
area->ptes = kmalloc_array(nr_frames, sizeof(*area->ptes), GFP_KERNEL);
if (area->ptes == NULL)
return -ENOMEM;
-   area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_IOREMAP);
+   area->area = get_vm_area(PAGE_SIZE * nr_frames, VM_XEN);
if (!area->area)
goto out_free_ptes;
if (apply_to_page_range(_mm, (unsigned long)area->area->addr,
diff --git a/drivers/xen/xenbus/xenbus_client.c 
b/drivers/xen/xenbus/xenbus_client.c
index 32835b4b9bc5..b9c81a2d578b 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -758,7 +758,7 @@ static int xenbus_map_ring_pv(struct xenbus_device *dev,
bool leaked = false;
int err = -ENOMEM;
 
-   area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_IOREMAP);
+   area = get_vm_area(XEN_PAGE_SIZE * nr_grefs, VM_XEN);
if (!area)
return -ENOMEM;
if (apply_to_page_range(_mm, (unsigned long)area->addr,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..223e51c243bc 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -28,6 +28,7 @@ struct iov_iter;  /* in uio.h */
 #define VM_FLUSH_RESET_PERMS   0x0100  /* reset direct map and flush 
TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES   0x0200  /* put pages and free array in 
vfree */
 #define VM_ALLOW_HUGE_VMAP 0x0400  /* Allow for huge pages on 
archs with HAVE_ARCH_HUGE_VMALLOC */
+#define VM_XEN 0x0800  /* xen use cases */
 
 #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
!defined(CONFIG_KASAN_VMALLOC)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f42f98a127d5..d769a65bddad 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3822,9 +3822,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, 
size_t count)
 
if (flags & VMAP_RAM)
copied = vmap_ram_vread_iter(iter, addr, n, flags);
-   else if (!(vm && (vm->flags & VM_IOREMAP)))
+   else if (!(vm && (vm->flags & (VM_IOREMAP | VM_XEN
copied = aligned_vread_iter(iter, addr, n);
-   else /* IOREMAP area is treated as memory hole */
+   else /* IOREMAP|XEN area is treated as memory hole */
copied = zero_iter(iter, n);
 
addr += copied;
@@ -4415,6 +4415,9 @@ static int s_show(struct seq_file *m, void *p)
if (v->flags & VM_IOREMAP)
seq_puts(m, " ioremap");
 
+   if (v->flags & VM_XEN)
+   seq_puts(m, " xen");
+
if (v->flags & VM_ALLOC)
seq_puts(m, " vmalloc");
 
-- 
2.34.1




[PATCH v2 bpf-next 0/3] mm: Cleanup and identify various users of kernel virtual address space

2024-02-23 Thread Alexei Starovoitov
From: Alexei Starovoitov 

There are various users of kernel virtual address space: vmalloc, vmap, 
ioremap, xen.

- vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
and these areas are treated differently by KASAN.

- the areas created by vmap() function should be tagged with VM_MAP
(as majority of the users do).

- ioremap areas are tagged with VM_IOREMAP and vm area start is aligned to size
of the area unlike vmalloc/vmap.

- there is also xen usage that is marked as VM_IOREMAP, but it doesn't
call ioremap_page_range() unlike all other VM_IOREMAP users.

To clean this up:
1. Enforce that ioremap_page_range() checks the range and VM_IOREMAP flag.
2. Introduce VM_XEN flag to separate xen us cases from ioremap.

In addition BPF would like to reserve regions of kernel virtual address
space and populate it lazily, similar to xen use cases.
For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages() helpers
to populate this sparse area.

In the end the /proc/vmallocinfo will show
"vmalloc"
"vmap"
"ioremap"
"xen"
"sparse"
categories for different kinds of address regions.

ioremap, xen, sparse will return zero when dumped through /proc/kcore

Alexei Starovoitov (3):
  mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
  mm, xen: Separate xen use cases from ioremap.
  mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

 arch/x86/xen/grant-table.c |  2 +-
 drivers/xen/xenbus/xenbus_client.c |  2 +-
 include/linux/vmalloc.h|  5 +++
 mm/vmalloc.c   | 71 +-
 4 files changed, 76 insertions(+), 4 deletions(-)

-- 
2.34.1