Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-08 Thread Andy Lutomirski


>> On May 7, 2018, at 4:31 AM, Kirill A. Shutemov  wrote:
>> 
>> On Mon, May 07, 2018 at 04:51:57AM +, Andy Lutomirski wrote:
>> On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
>> kirill.shute...@linux.intel.com> wrote:
>> 
>>> Hi everybody,
>> 
>>> I've proposed to talk about page able manipulation API on the LSF/MM'2018,
>>> so I need something material to talk about.
>> 
>> 
>> I gave it a quick read.  I like the concept a lot, and I have a few
>> comments.
> 
> Thank you for the input.
> 
>>> +/*
>>> + * How manu bottom level we account to mm->pgtables_bytes
>>> + */
>>> +#define PT_ACCOUNT_LVLS 3
>>> +
>>> +struct pt_ptr {
>>> +   unsigned long *ptr;
>>> +   int lvl;
>>> +};
>>> +
>> 
>> I think you've inherited something that I consider to be a defect in the
>> old code: you're conflating page *tables* with page table *entries*.  Your
>> 'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
>> you're using it to point to a specific entry within a table.  I think that
>> both the new core code and the code that uses it would be clearer and less
>> error prone if you made the distinction explicit.  I can think of two clean
>> ways to do it:
>> 
>> 1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
>> pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
>> given a pt_ptr and either an index or an address.
>> 
>> 2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
>> would take an address or an index parameter.
> 
> Well, I'm not sure how useful pointer to whole page tables are.
> Where do you them useful?

Hmm, that’s a fair question. I guess that, in your patch, you pass around a 
ptv_t when you want to refer to a whole page table. That seems to work okay. 
Maybe still rename ptp_t to ptep_t or similar to emphasize that it points to an 
entry, not a table.

That being said, once you implement map/unmap for real, it might be more 
natural for map to return a pointer to a table rather than a pointer to an 
entry.

> 
>  In x86-64 case I pretend that CR3 is single-entry page table. It
>  requires a special threatement in ptp_page_vaddr(), but works fine
>  otherwise.
> 

Hmm. If you stick with that, it definitely needs better comments. Why do you 
need this, though?  What’s the benefit over simply starting with a pointer to 
the root table or a pointer to the first entry in the root table?  We certainly 
don’t want anyone to do ptp_set() on the fake CR3 entry.

> 
>> 
>>> +/*
>>> + * When walking page tables, get the address of the next boundary,
>>> + * or the end address of the range if that comes earlier.  Although no
>>> + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
>>> + */
>> 
>> I read this comment twice, and I still don't get it.  Can you clarify what
>> this function does and why you would use it?
> 
> That's basically ported variant of p?d_addr_end. It helps step address by
> right value for the page table entry and handles wrapping properly.
> 
> See example in copy_pt_range().

Ok

> 
>>> +/* Operations on page table pointers */
>>> +
>>> +/* Initialize ptp_t with pointer to top page table level. */
>>> +static inline ptp_t ptp_init(struct mm_struct *mm)
>>> +{
>>> +   struct pt_ptr ptp ={
>>> +   .ptr = (unsigned long *)mm->pgd,
>>> +   .lvl = PT_TOP_LEVEL,
>>> +   };
>>> +
>>> +   return ptp;
>>> +}
>>> +
>> 
>> On some architectures, there are multiple page table roots.  For example,
>> ARM64 has a root for the kernel half of the address space and a root for
>> the user half (at least -- I don't fully understand it).  x86 PAE sort-of
>> has four roots.  Would it make sense to expose this in the API for
>> real?
> 
> I will give it a thought.
> 
> Is there a reason not to threat it as an additional page table layer and
> deal with it in a unified way?

I was thinking that it would be more confusing to treat it as one table. After 
all, it doesn’t exist in memory. Also, if anyone ever makes the top half be 
per-cpu and the bottom half be per-mm (which would be awesome if x86 had 
hardware support, hint hint), then pretending that it’s one table gets even 
weirder.  The code that emulates it as a table would have to know what mm *and* 
what CPU it is faking.

> 
> 
>>> +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
>>> +{
>>> +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
>>> +   ptp->ptr += __pt_index(addr, --ptp->lvl);
>>> +}
>> 
>> Can you add a comment that says what this function does?
> 
> Okay, I will.
> 
>> Why does it not change the level?
> 
> It does. --ptp->lvl.

Hmm.

Maybe change this to ptp_t ptp_walk(ptp_t ptp, unsigned long addr)?  It’s less 
error prone and should generate identical code.

> 
>>> +
>>> +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
>>> +{
>>> +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
>>> +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));

Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-08 Thread Andy Lutomirski


>> On May 7, 2018, at 4:31 AM, Kirill A. Shutemov  wrote:
>> 
>> On Mon, May 07, 2018 at 04:51:57AM +, Andy Lutomirski wrote:
>> On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
>> kirill.shute...@linux.intel.com> wrote:
>> 
>>> Hi everybody,
>> 
>>> I've proposed to talk about page able manipulation API on the LSF/MM'2018,
>>> so I need something material to talk about.
>> 
>> 
>> I gave it a quick read.  I like the concept a lot, and I have a few
>> comments.
> 
> Thank you for the input.
> 
>>> +/*
>>> + * How manu bottom level we account to mm->pgtables_bytes
>>> + */
>>> +#define PT_ACCOUNT_LVLS 3
>>> +
>>> +struct pt_ptr {
>>> +   unsigned long *ptr;
>>> +   int lvl;
>>> +};
>>> +
>> 
>> I think you've inherited something that I consider to be a defect in the
>> old code: you're conflating page *tables* with page table *entries*.  Your
>> 'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
>> you're using it to point to a specific entry within a table.  I think that
>> both the new core code and the code that uses it would be clearer and less
>> error prone if you made the distinction explicit.  I can think of two clean
>> ways to do it:
>> 
>> 1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
>> pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
>> given a pt_ptr and either an index or an address.
>> 
>> 2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
>> would take an address or an index parameter.
> 
> Well, I'm not sure how useful pointer to whole page tables are.
> Where do you them useful?

Hmm, that’s a fair question. I guess that, in your patch, you pass around a 
ptv_t when you want to refer to a whole page table. That seems to work okay. 
Maybe still rename ptp_t to ptep_t or similar to emphasize that it points to an 
entry, not a table.

That being said, once you implement map/unmap for real, it might be more 
natural for map to return a pointer to a table rather than a pointer to an 
entry.

> 
>  In x86-64 case I pretend that CR3 is single-entry page table. It
>  requires a special threatement in ptp_page_vaddr(), but works fine
>  otherwise.
> 

Hmm. If you stick with that, it definitely needs better comments. Why do you 
need this, though?  What’s the benefit over simply starting with a pointer to 
the root table or a pointer to the first entry in the root table?  We certainly 
don’t want anyone to do ptp_set() on the fake CR3 entry.

> 
>> 
>>> +/*
>>> + * When walking page tables, get the address of the next boundary,
>>> + * or the end address of the range if that comes earlier.  Although no
>>> + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
>>> + */
>> 
>> I read this comment twice, and I still don't get it.  Can you clarify what
>> this function does and why you would use it?
> 
> That's basically ported variant of p?d_addr_end. It helps step address by
> right value for the page table entry and handles wrapping properly.
> 
> See example in copy_pt_range().

Ok

> 
>>> +/* Operations on page table pointers */
>>> +
>>> +/* Initialize ptp_t with pointer to top page table level. */
>>> +static inline ptp_t ptp_init(struct mm_struct *mm)
>>> +{
>>> +   struct pt_ptr ptp ={
>>> +   .ptr = (unsigned long *)mm->pgd,
>>> +   .lvl = PT_TOP_LEVEL,
>>> +   };
>>> +
>>> +   return ptp;
>>> +}
>>> +
>> 
>> On some architectures, there are multiple page table roots.  For example,
>> ARM64 has a root for the kernel half of the address space and a root for
>> the user half (at least -- I don't fully understand it).  x86 PAE sort-of
>> has four roots.  Would it make sense to expose this in the API for
>> real?
> 
> I will give it a thought.
> 
> Is there a reason not to threat it as an additional page table layer and
> deal with it in a unified way?

I was thinking that it would be more confusing to treat it as one table. After 
all, it doesn’t exist in memory. Also, if anyone ever makes the top half be 
per-cpu and the bottom half be per-mm (which would be awesome if x86 had 
hardware support, hint hint), then pretending that it’s one table gets even 
weirder.  The code that emulates it as a table would have to know what mm *and* 
what CPU it is faking.

> 
> 
>>> +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
>>> +{
>>> +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
>>> +   ptp->ptr += __pt_index(addr, --ptp->lvl);
>>> +}
>> 
>> Can you add a comment that says what this function does?
> 
> Okay, I will.
> 
>> Why does it not change the level?
> 
> It does. --ptp->lvl.

Hmm.

Maybe change this to ptp_t ptp_walk(ptp_t ptp, unsigned long addr)?  It’s less 
error prone and should generate identical code.

> 
>>> +
>>> +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
>>> +{
>>> +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
>>> +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));
>>> +}
>>> +
>> 
>> 

Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Kirill A. Shutemov
On Mon, May 07, 2018 at 12:23:46PM +, Matthew Wilcox wrote:
> On Mon, May 07, 2018 at 02:31:25PM +0300, Kirill A. Shutemov wrote:
> > > Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> > > comment would be helpful.
> > 
> > It is bottom. But it should be up to architecture to decide.
> 
> That's not true because ...
> 
> > > > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > > > +{
> > > > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > > > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > > > +}
> > > 
> > > Can you add a comment that says what this function does?
> > 
> > Okay, I will.
> > 
> > > Why does it not change the level?
> > 
> > It does. --ptp->lvl.
> 
> ... you've hardcoded that walking down decrements the level by 1.
> 
> I don't see that as a defect; it's just part of the API that needs
> documenting.

You assume that the function is a generic one. This may or may not be
true.

This is subject for refinement anyway.

-- 
 Kirill A. Shutemov


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Kirill A. Shutemov
On Mon, May 07, 2018 at 12:23:46PM +, Matthew Wilcox wrote:
> On Mon, May 07, 2018 at 02:31:25PM +0300, Kirill A. Shutemov wrote:
> > > Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> > > comment would be helpful.
> > 
> > It is bottom. But it should be up to architecture to decide.
> 
> That's not true because ...
> 
> > > > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > > > +{
> > > > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > > > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > > > +}
> > > 
> > > Can you add a comment that says what this function does?
> > 
> > Okay, I will.
> > 
> > > Why does it not change the level?
> > 
> > It does. --ptp->lvl.
> 
> ... you've hardcoded that walking down decrements the level by 1.
> 
> I don't see that as a defect; it's just part of the API that needs
> documenting.

You assume that the function is a generic one. This may or may not be
true.

This is subject for refinement anyway.

-- 
 Kirill A. Shutemov


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Matthew Wilcox
On Mon, May 07, 2018 at 02:31:25PM +0300, Kirill A. Shutemov wrote:
> > Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> > comment would be helpful.
> 
> It is bottom. But it should be up to architecture to decide.

That's not true because ...

> > > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > > +{
> > > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > > +}
> > 
> > Can you add a comment that says what this function does?
> 
> Okay, I will.
> 
> > Why does it not change the level?
> 
> It does. --ptp->lvl.

... you've hardcoded that walking down decrements the level by 1.

I don't see that as a defect; it's just part of the API that needs
documenting.



Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Matthew Wilcox
On Mon, May 07, 2018 at 02:31:25PM +0300, Kirill A. Shutemov wrote:
> > Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> > comment would be helpful.
> 
> It is bottom. But it should be up to architecture to decide.

That's not true because ...

> > > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > > +{
> > > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > > +}
> > 
> > Can you add a comment that says what this function does?
> 
> Okay, I will.
> 
> > Why does it not change the level?
> 
> It does. --ptp->lvl.

... you've hardcoded that walking down decrements the level by 1.

I don't see that as a defect; it's just part of the API that needs
documenting.



Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Kirill A. Shutemov
On Mon, May 07, 2018 at 04:51:57AM +, Andy Lutomirski wrote:
> On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
> kirill.shute...@linux.intel.com> wrote:
> 
> > Hi everybody,
> 
> > I've proposed to talk about page able manipulation API on the LSF/MM'2018,
> > so I need something material to talk about.
> 
> 
> I gave it a quick read.  I like the concept a lot, and I have a few
> comments.

Thank you for the input.

> > +/*
> > + * How manu bottom level we account to mm->pgtables_bytes
> > + */
> > +#define PT_ACCOUNT_LVLS 3
> > +
> > +struct pt_ptr {
> > +   unsigned long *ptr;
> > +   int lvl;
> > +};
> > +
> 
> I think you've inherited something that I consider to be a defect in the
> old code: you're conflating page *tables* with page table *entries*.  Your
> 'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
> you're using it to point to a specific entry within a table.  I think that
> both the new core code and the code that uses it would be clearer and less
> error prone if you made the distinction explicit.  I can think of two clean
> ways to do it:
> 
> 1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
> pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
> given a pt_ptr and either an index or an address.
> 
> 2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
> would take an address or an index parameter.

Well, I'm not sure how useful pointer to whole page tables are.
Where do you them useful?

How I see the picture so far:

- ptp_t represent a pointer to an entry in a page table.

  In x86-64 case I pretend that CR3 is single-entry page table. It
  requires a special threatement in ptp_page_vaddr(), but works fine
  otherwise.

- ptv_t represents a value that dereferenced from ptp_t or can be set to
  ptp_t.

It's trivial to find the start of page table if we would need it by
masking out botom bits from ptp->ptr. It works on x86 and should be
possible on any architecture.

> Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> comment would be helpful.

It is bottom. But it should be up to architecture to decide.

> 
> > +/*
> > + * When walking page tables, get the address of the next boundary,
> > + * or the end address of the range if that comes earlier.  Although no
> > + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
> > + */
> 
> I read this comment twice, and I still don't get it.  Can you clarify what
> this function does and why you would use it?

That's basically ported variant of p?d_addr_end. It helps step address by
right value for the page table entry and handles wrapping properly.

See example in copy_pt_range().

> > +/* Operations on page table pointers */
> > +
> > +/* Initialize ptp_t with pointer to top page table level. */
> > +static inline ptp_t ptp_init(struct mm_struct *mm)
> > +{
> > +   struct pt_ptr ptp ={
> > +   .ptr = (unsigned long *)mm->pgd,
> > +   .lvl = PT_TOP_LEVEL,
> > +   };
> > +
> > +   return ptp;
> > +}
> > +
> 
> On some architectures, there are multiple page table roots.  For example,
> ARM64 has a root for the kernel half of the address space and a root for
> the user half (at least -- I don't fully understand it).  x86 PAE sort-of
> has four roots.  Would it make sense to expose this in the API for
> real?

I will give it a thought.

Is there a reason not to threat it as an additional page table layer and
deal with it in a unified way?

> For example, ptp_init(mm) could be replaced with ptp_init(mm, addr).  This
> would make it a bit cleaner to handle an separate user and kernel tables.
>   (As it stands, what is supposed to happen on ARM if you do
> ptp_init(something that isn't init_mm) and then walk it to look for a
> kernel address?)

IIUC, we can handle it in ptp_walk() since we have all may handle root in
a special way as I do for x86-64.

> Also, ptp_init() seems oddly named for me.  ptp_get_root_for_mm(),
> perhaps?  There could also be ptp_get_kernel_root() to get the root for the
> init_mm's tables.

Yeah, sounds better.

> > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > +{
> > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > +}
> 
> Can you add a comment that says what this function does?

Okay, I will.

> Why does it not change the level?

It does. --ptp->lvl.

> > +
> > +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
> > +{
> > +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
> > +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));
> > +}
> > +
> 
> As it stands, this is a function that seems easy easy to misuse given the
> confusion between page tables and page table entries.

Hm. I probably have a blind spot, but I don't see it.

The function has to be named better for sure.

> Finally, a general comment.  Actually fully implementing this the way
> you've done it 

Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-07 Thread Kirill A. Shutemov
On Mon, May 07, 2018 at 04:51:57AM +, Andy Lutomirski wrote:
> On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
> kirill.shute...@linux.intel.com> wrote:
> 
> > Hi everybody,
> 
> > I've proposed to talk about page able manipulation API on the LSF/MM'2018,
> > so I need something material to talk about.
> 
> 
> I gave it a quick read.  I like the concept a lot, and I have a few
> comments.

Thank you for the input.

> > +/*
> > + * How manu bottom level we account to mm->pgtables_bytes
> > + */
> > +#define PT_ACCOUNT_LVLS 3
> > +
> > +struct pt_ptr {
> > +   unsigned long *ptr;
> > +   int lvl;
> > +};
> > +
> 
> I think you've inherited something that I consider to be a defect in the
> old code: you're conflating page *tables* with page table *entries*.  Your
> 'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
> you're using it to point to a specific entry within a table.  I think that
> both the new core code and the code that uses it would be clearer and less
> error prone if you made the distinction explicit.  I can think of two clean
> ways to do it:
> 
> 1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
> pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
> given a pt_ptr and either an index or an address.
> 
> 2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
> would take an address or an index parameter.

Well, I'm not sure how useful pointer to whole page tables are.
Where do you them useful?

How I see the picture so far:

- ptp_t represent a pointer to an entry in a page table.

  In x86-64 case I pretend that CR3 is single-entry page table. It
  requires a special threatement in ptp_page_vaddr(), but works fine
  otherwise.

- ptv_t represents a value that dereferenced from ptp_t or can be set to
  ptp_t.

It's trivial to find the start of page table if we would need it by
masking out botom bits from ptp->ptr. It works on x86 and should be
possible on any architecture.

> Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
> comment would be helpful.

It is bottom. But it should be up to architecture to decide.

> 
> > +/*
> > + * When walking page tables, get the address of the next boundary,
> > + * or the end address of the range if that comes earlier.  Although no
> > + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
> > + */
> 
> I read this comment twice, and I still don't get it.  Can you clarify what
> this function does and why you would use it?

That's basically ported variant of p?d_addr_end. It helps step address by
right value for the page table entry and handles wrapping properly.

See example in copy_pt_range().

> > +/* Operations on page table pointers */
> > +
> > +/* Initialize ptp_t with pointer to top page table level. */
> > +static inline ptp_t ptp_init(struct mm_struct *mm)
> > +{
> > +   struct pt_ptr ptp ={
> > +   .ptr = (unsigned long *)mm->pgd,
> > +   .lvl = PT_TOP_LEVEL,
> > +   };
> > +
> > +   return ptp;
> > +}
> > +
> 
> On some architectures, there are multiple page table roots.  For example,
> ARM64 has a root for the kernel half of the address space and a root for
> the user half (at least -- I don't fully understand it).  x86 PAE sort-of
> has four roots.  Would it make sense to expose this in the API for
> real?

I will give it a thought.

Is there a reason not to threat it as an additional page table layer and
deal with it in a unified way?

> For example, ptp_init(mm) could be replaced with ptp_init(mm, addr).  This
> would make it a bit cleaner to handle an separate user and kernel tables.
>   (As it stands, what is supposed to happen on ARM if you do
> ptp_init(something that isn't init_mm) and then walk it to look for a
> kernel address?)

IIUC, we can handle it in ptp_walk() since we have all may handle root in
a special way as I do for x86-64.

> Also, ptp_init() seems oddly named for me.  ptp_get_root_for_mm(),
> perhaps?  There could also be ptp_get_kernel_root() to get the root for the
> init_mm's tables.

Yeah, sounds better.

> > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > +{
> > +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> > +}
> 
> Can you add a comment that says what this function does?

Okay, I will.

> Why does it not change the level?

It does. --ptp->lvl.

> > +
> > +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
> > +{
> > +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
> > +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));
> > +}
> > +
> 
> As it stands, this is a function that seems easy easy to misuse given the
> confusion between page tables and page table entries.

Hm. I probably have a blind spot, but I don't see it.

The function has to be named better for sure.

> Finally, a general comment.  Actually fully implementing this the way
> you've done it 

Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-06 Thread Andy Lutomirski
On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
kirill.shute...@linux.intel.com> wrote:

> Hi everybody,

> I've proposed to talk about page able manipulation API on the LSF/MM'2018,
> so I need something material to talk about.


I gave it a quick read.  I like the concept a lot, and I have a few
comments.

> +/*
> + * How manu bottom level we account to mm->pgtables_bytes
> + */
> +#define PT_ACCOUNT_LVLS 3
> +
> +struct pt_ptr {
> +   unsigned long *ptr;
> +   int lvl;
> +};
> +

I think you've inherited something that I consider to be a defect in the
old code: you're conflating page *tables* with page table *entries*.  Your
'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
you're using it to point to a specific entry within a table.  I think that
both the new core code and the code that uses it would be clearer and less
error prone if you made the distinction explicit.  I can think of two clean
ways to do it:

1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
given a pt_ptr and either an index or an address.

2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
would take an address or an index parameter.

Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
comment would be helpful.

> +/*
> + * When walking page tables, get the address of the next boundary,
> + * or the end address of the range if that comes earlier.  Although no
> + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
> + */

I read this comment twice, and I still don't get it.  Can you clarify what
this function does and why you would use it?

> +/* Operations on page table pointers */
> +
> +/* Initialize ptp_t with pointer to top page table level. */
> +static inline ptp_t ptp_init(struct mm_struct *mm)
> +{
> +   struct pt_ptr ptp ={
> +   .ptr = (unsigned long *)mm->pgd,
> +   .lvl = PT_TOP_LEVEL,
> +   };
> +
> +   return ptp;
> +}
> +

On some architectures, there are multiple page table roots.  For example,
ARM64 has a root for the kernel half of the address space and a root for
the user half (at least -- I don't fully understand it).  x86 PAE sort-of
has four roots.  Would it make sense to expose this in the API for real?
For example, ptp_init(mm) could be replaced with ptp_init(mm, addr).  This
would make it a bit cleaner to handle an separate user and kernel tables.
  (As it stands, what is supposed to happen on ARM if you do
ptp_init(something that isn't init_mm) and then walk it to look for a
kernel address?)

Also, ptp_init() seems oddly named for me.  ptp_get_root_for_mm(),
perhaps?  There could also be ptp_get_kernel_root() to get the root for the
init_mm's tables.

> +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> +{
> +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> +}

Can you add a comment that says what this function does?  Why does it not
change the level?

> +
> +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
> +{
> +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
> +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));
> +}
> +

As it stands, this is a function that seems easy easy to misuse given the
confusion between page tables and page table entries.


Finally, a general comment.  Actually fully implementing this the way
you've done it seems like a giant mess given that you need to support all
architectures.  But couldn't you implement the new API as a wrapper around
the old API so you automatically get all architectures?


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-06 Thread Andy Lutomirski
On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
kirill.shute...@linux.intel.com> wrote:

> Hi everybody,

> I've proposed to talk about page able manipulation API on the LSF/MM'2018,
> so I need something material to talk about.


I gave it a quick read.  I like the concept a lot, and I have a few
comments.

> +/*
> + * How manu bottom level we account to mm->pgtables_bytes
> + */
> +#define PT_ACCOUNT_LVLS 3
> +
> +struct pt_ptr {
> +   unsigned long *ptr;
> +   int lvl;
> +};
> +

I think you've inherited something that I consider to be a defect in the
old code: you're conflating page *tables* with page table *entries*.  Your
'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
you're using it to point to a specific entry within a table.  I think that
both the new core code and the code that uses it would be clearer and less
error prone if you made the distinction explicit.  I can think of two clean
ways to do it:

1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
pt_entry_ptr instead of a pt_ptr.  Add a helper to find a pt_entry_ptr
given a pt_ptr and either an index or an address.

2. Don't allow pointers to page table entries at all.  Instead, get_ptv()
would take an address or an index parameter.

Also, what does lvl == 0 mean?  Is it the top or the bottom?  I think a
comment would be helpful.

> +/*
> + * When walking page tables, get the address of the next boundary,
> + * or the end address of the range if that comes earlier.  Although no
> + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
> + */

I read this comment twice, and I still don't get it.  Can you clarify what
this function does and why you would use it?

> +/* Operations on page table pointers */
> +
> +/* Initialize ptp_t with pointer to top page table level. */
> +static inline ptp_t ptp_init(struct mm_struct *mm)
> +{
> +   struct pt_ptr ptp ={
> +   .ptr = (unsigned long *)mm->pgd,
> +   .lvl = PT_TOP_LEVEL,
> +   };
> +
> +   return ptp;
> +}
> +

On some architectures, there are multiple page table roots.  For example,
ARM64 has a root for the kernel half of the address space and a root for
the user half (at least -- I don't fully understand it).  x86 PAE sort-of
has four roots.  Would it make sense to expose this in the API for real?
For example, ptp_init(mm) could be replaced with ptp_init(mm, addr).  This
would make it a bit cleaner to handle an separate user and kernel tables.
  (As it stands, what is supposed to happen on ARM if you do
ptp_init(something that isn't init_mm) and then walk it to look for a
kernel address?)

Also, ptp_init() seems oddly named for me.  ptp_get_root_for_mm(),
perhaps?  There could also be ptp_get_kernel_root() to get the root for the
init_mm's tables.

> +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> +{
> +   ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> +   ptp->ptr += __pt_index(addr, --ptp->lvl);
> +}

Can you add a comment that says what this function does?  Why does it not
change the level?

> +
> +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
> +{
> +   if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
> +   ptlock_free(pfn_to_page(ptv_pfn(ptv)));
> +}
> +

As it stands, this is a function that seems easy easy to misuse given the
confusion between page tables and page table entries.


Finally, a general comment.  Actually fully implementing this the way
you've done it seems like a giant mess given that you need to support all
architectures.  But couldn't you implement the new API as a wrapper around
the old API so you automatically get all architectures?


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-04 Thread Kirill A. Shutemov
On Fri, May 04, 2018 at 09:12:44PM +, Matthew Wilcox wrote:
> On Tue, Apr 24, 2018 at 06:43:56PM +0300, Kirill A. Shutemov wrote:
> > +struct pt_ptr {
> > +   unsigned long *ptr;
> > +   int lvl;
> > +};
> 
> On x86, you've got three kinds of paging scheme, referred to in the manual
> as 32-bit, PAE and 4-level.

You forgot 5-level :)

(although it's not in the manual yet, so fair enough)

> On 32-bit, you've got 3 levels (Directory, Table and Entry), and you can
> encode those three levels in the bottom two bits of the pointer.  With
> PAE and 4L, pointers are 64-bit aligned, so you can encode up to eight
> levels in the bottom three bits of the pointer.

I didn't thought about this. Thank you.

> > +struct pt_val {
> > +   unsigned long val;
> > +   int lvl;
> > +};
> 
> I don't think it's possible to shrink this down to a single ulong.
> _Maybe_ it is if you can squirm a single bit free from the !pte_present
> case.

I don't think it worth it. It gets tricky quickly.

> ... this is only for x86 4L and maybe 32 paging, right?  It'd need to
> use unsigned long val[2] for PAE.

I didn't look at 32-bit at all. But 4L and 5L [kinda] work.

> I'm going to think about this some more.  There's a lot of potential here.

Thanks for the input.

-- 
 Kirill A. Shutemov


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-04 Thread Kirill A. Shutemov
On Fri, May 04, 2018 at 09:12:44PM +, Matthew Wilcox wrote:
> On Tue, Apr 24, 2018 at 06:43:56PM +0300, Kirill A. Shutemov wrote:
> > +struct pt_ptr {
> > +   unsigned long *ptr;
> > +   int lvl;
> > +};
> 
> On x86, you've got three kinds of paging scheme, referred to in the manual
> as 32-bit, PAE and 4-level.

You forgot 5-level :)

(although it's not in the manual yet, so fair enough)

> On 32-bit, you've got 3 levels (Directory, Table and Entry), and you can
> encode those three levels in the bottom two bits of the pointer.  With
> PAE and 4L, pointers are 64-bit aligned, so you can encode up to eight
> levels in the bottom three bits of the pointer.

I didn't thought about this. Thank you.

> > +struct pt_val {
> > +   unsigned long val;
> > +   int lvl;
> > +};
> 
> I don't think it's possible to shrink this down to a single ulong.
> _Maybe_ it is if you can squirm a single bit free from the !pte_present
> case.

I don't think it worth it. It gets tricky quickly.

> ... this is only for x86 4L and maybe 32 paging, right?  It'd need to
> use unsigned long val[2] for PAE.

I didn't look at 32-bit at all. But 4L and 5L [kinda] work.

> I'm going to think about this some more.  There's a lot of potential here.

Thanks for the input.

-- 
 Kirill A. Shutemov


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-04 Thread Matthew Wilcox
On Tue, Apr 24, 2018 at 06:43:56PM +0300, Kirill A. Shutemov wrote:
> +struct pt_ptr {
> + unsigned long *ptr;
> + int lvl;
> +};

On x86, you've got three kinds of paging scheme, referred to in the manual
as 32-bit, PAE and 4-level.  On 32-bit, you've got 3 levels (Directory,
Table and Entry), and you can encode those three levels in the bottom
two bits of the pointer.  With PAE and 4L, pointers are 64-bit aligned,
so you can encode up to eight levels in the bottom three bits of the
pointer.

> +struct pt_val {
> + unsigned long val;
> + int lvl;
> +};

I don't think it's possible to shrink this down to a single ulong.
_Maybe_ it is if you can squirm a single bit free from the !pte_present
case.

... this is only for x86 4L and maybe 32 paging, right?  It'd need to
use unsigned long val[2] for PAE.

I'm going to think about this some more.  There's a lot of potential here.


Re: Proof-of-concept: better(?) page-table manipulation API

2018-05-04 Thread Matthew Wilcox
On Tue, Apr 24, 2018 at 06:43:56PM +0300, Kirill A. Shutemov wrote:
> +struct pt_ptr {
> + unsigned long *ptr;
> + int lvl;
> +};

On x86, you've got three kinds of paging scheme, referred to in the manual
as 32-bit, PAE and 4-level.  On 32-bit, you've got 3 levels (Directory,
Table and Entry), and you can encode those three levels in the bottom
two bits of the pointer.  With PAE and 4L, pointers are 64-bit aligned,
so you can encode up to eight levels in the bottom three bits of the
pointer.

> +struct pt_val {
> + unsigned long val;
> + int lvl;
> +};

I don't think it's possible to shrink this down to a single ulong.
_Maybe_ it is if you can squirm a single bit free from the !pte_present
case.

... this is only for x86 4L and maybe 32 paging, right?  It'd need to
use unsigned long val[2] for PAE.

I'm going to think about this some more.  There's a lot of potential here.


Proof-of-concept: better(?) page-table manipulation API

2018-04-24 Thread Kirill A. Shutemov
Hi everybody,

I've proposed to talk about page able manipulation API on the LSF/MM'2018,
so I need something material to talk about.

Below is how better (arguably) API may look like as implemented for x86-64
(both paging modes).

It's very incomplete (notably no PTI and paravirt support), not split
properly and probably broken in many ways. But it's good enough to
illustrate the idea.

And it doesn't crash immediately :P

So far I converted only sync_global_pgds() and part of copy_page_range()
codepath as an example.

Pros:
 - Naturally scales to different number of page table levels;
 - No weird folding semantics;
 - No need in special casing for top-level page table handling (see
   sync_global_pgds());
 - Simplifies integration of huge page into generic codepath;

Cons:
 - No clear path on how to convert everything the new API.
   Not sure if compatibility layer is possible with a reasonable
   overhead;

Any feedback is welcome.

diff --git a/arch/x86/include/asm/pt.h b/arch/x86/include/asm/pt.h
index e69de29bb2d1..faffd1a17532 100644
--- a/arch/x86/include/asm/pt.h
+++ b/arch/x86/include/asm/pt.h
@@ -0,0 +1,411 @@
+#ifndef _ASM_X86_PT_H
+#define _ASM_X86_PT_H
+
+/* Arch-private stuff */
+
+/* How many virtual address bits each page table resolve */
+#define BITS_PER_PT9
+
+/* Number of entries in each page table */
+#define PTRS_PER_PT(1 << BITS_PER_PT)
+
+/* 0 is 4k entry */
+#define PT_TOP_LEVEL   (pgtable_l5_enabled ? 5 : 4)
+
+/*
+ * How many bottom levels has per-page tables lock
+ * (instead of mm->page_table_lock).
+ */
+#define PT_SPLIT_LOCK_LVLS 2
+
+/*
+ * How manu bottom level we account to mm->pgtables_bytes
+ */
+#define PT_ACCOUNT_LVLS 3
+
+struct pt_ptr {
+   unsigned long *ptr;
+   int lvl;
+};
+
+struct pt_val {
+   unsigned long val;
+   int lvl;
+};
+
+#define PTP_INIT(addr, level)  \
+{  \
+   .ptr = (unsigned long *) (addr),\
+   .lvl = (level), \
+}
+
+static inline int __pt_shift(int lvl)
+{
+   return lvl * BITS_PER_PT + PAGE_SHIFT;
+}
+
+static inline int __pt_index(unsigned long addr, int lvl)
+{
+   return (addr >> __pt_shift(lvl)) & (PTRS_PER_PT - 1);
+}
+
+static inline unsigned long __ptv_flags(struct pt_val ptv)
+{
+   return ptv.val & PTE_FLAGS_MASK;
+}
+
+/* Public: can be used from generic code */
+
+/*
+ * Encodes information required to operate on page table entry value
+ * of *any* level.
+ */
+typedef struct pt_val ptv_t;
+
+/*
+ * Encodes information required to operate on pointer to page table entry
+ * of *any* level.
+ */
+typedef struct pt_ptr ptp_t;
+
+/* Dereference entry in page table */
+static inline ptv_t get_ptv(ptp_t *ptp)
+{
+   struct pt_val ptv = {
+   .val = *ptp->ptr,
+   .lvl = ptp->lvl,
+   };
+
+   return ptv;
+}
+
+/* Operations on page table values */
+
+static inline bool ptv_bottom(ptv_t ptv)
+{
+   return !ptv.lvl;
+}
+
+static inline unsigned long ptv_size(ptv_t ptv)
+{
+   return 1UL << __pt_shift(ptv.lvl);
+}
+
+static inline unsigned long ptv_mask(ptv_t ptv)
+{
+   return ~(ptv_size(ptv) - 1);
+}
+
+/*
+ * When walking page tables, get the address of the next boundary,
+ * or the end address of the range if that comes earlier.  Although no
+ * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
+ */
+static inline unsigned long ptv_addr_end(ptv_t ptv,
+   unsigned long addr, unsigned long end)
+{
+   unsigned long boundary = (addr + ptv_size(ptv)) & ptv_mask(ptv);
+
+   if (boundary - 1 < end - 1)
+   return boundary;
+   else
+   return end;
+}
+
+static inline bool ptv_none(ptv_t ptv)
+{
+   return !ptv.val;
+}
+static inline bool ptv_present(ptv_t ptv)
+{
+   return ptv.val & _PAGE_PRESENT;
+}
+
+static inline bool ptv_bad(ptv_t ptv)
+{
+   unsigned long ignore_flags = _PAGE_USER;
+
+   if (ptv_bottom(ptv))
+   return false;
+
+   if (IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
+   ignore_flags |= _PAGE_NX;
+
+   return (__ptv_flags(ptv) & ~ignore_flags) != _KERNPG_TABLE;
+}
+
+static inline bool ptv_huge(ptv_t ptv)
+{
+   return ptv.val & _PAGE_PSE;
+}
+
+static inline bool ptv_devmap(ptv_t ptv)
+{
+   return ptv.val & _PAGE_DEVMAP;
+}
+
+static inline bool ptv_swap(ptv_t ptv)
+{
+   return !ptv_none(ptv) && !ptv_present(ptv);
+}
+
+static inline bool ptv_leaf(ptv_t ptv)
+{
+   if (ptv_none(ptv))
+   return false;
+   return ptv_bottom(ptv) || ptv_huge(ptv) ||
+   ptv_devmap(ptv) || ptv_swap(ptv);
+}
+
+static inline unsigned long ptv_pfn(ptv_t ptv)
+{
+   if (ptv_leaf(ptv)) {
+   /* TODO */
+   BUG();
+   }
+
+   return (ptv.val & PTE_PFN_MASK) >> PAGE_SHIFT;
+}
+
+/* Operations on page table pointers */
+
+/* Initialize ptp_t with pointer to top page table 

Proof-of-concept: better(?) page-table manipulation API

2018-04-24 Thread Kirill A. Shutemov
Hi everybody,

I've proposed to talk about page able manipulation API on the LSF/MM'2018,
so I need something material to talk about.

Below is how better (arguably) API may look like as implemented for x86-64
(both paging modes).

It's very incomplete (notably no PTI and paravirt support), not split
properly and probably broken in many ways. But it's good enough to
illustrate the idea.

And it doesn't crash immediately :P

So far I converted only sync_global_pgds() and part of copy_page_range()
codepath as an example.

Pros:
 - Naturally scales to different number of page table levels;
 - No weird folding semantics;
 - No need in special casing for top-level page table handling (see
   sync_global_pgds());
 - Simplifies integration of huge page into generic codepath;

Cons:
 - No clear path on how to convert everything the new API.
   Not sure if compatibility layer is possible with a reasonable
   overhead;

Any feedback is welcome.

diff --git a/arch/x86/include/asm/pt.h b/arch/x86/include/asm/pt.h
index e69de29bb2d1..faffd1a17532 100644
--- a/arch/x86/include/asm/pt.h
+++ b/arch/x86/include/asm/pt.h
@@ -0,0 +1,411 @@
+#ifndef _ASM_X86_PT_H
+#define _ASM_X86_PT_H
+
+/* Arch-private stuff */
+
+/* How many virtual address bits each page table resolve */
+#define BITS_PER_PT9
+
+/* Number of entries in each page table */
+#define PTRS_PER_PT(1 << BITS_PER_PT)
+
+/* 0 is 4k entry */
+#define PT_TOP_LEVEL   (pgtable_l5_enabled ? 5 : 4)
+
+/*
+ * How many bottom levels has per-page tables lock
+ * (instead of mm->page_table_lock).
+ */
+#define PT_SPLIT_LOCK_LVLS 2
+
+/*
+ * How manu bottom level we account to mm->pgtables_bytes
+ */
+#define PT_ACCOUNT_LVLS 3
+
+struct pt_ptr {
+   unsigned long *ptr;
+   int lvl;
+};
+
+struct pt_val {
+   unsigned long val;
+   int lvl;
+};
+
+#define PTP_INIT(addr, level)  \
+{  \
+   .ptr = (unsigned long *) (addr),\
+   .lvl = (level), \
+}
+
+static inline int __pt_shift(int lvl)
+{
+   return lvl * BITS_PER_PT + PAGE_SHIFT;
+}
+
+static inline int __pt_index(unsigned long addr, int lvl)
+{
+   return (addr >> __pt_shift(lvl)) & (PTRS_PER_PT - 1);
+}
+
+static inline unsigned long __ptv_flags(struct pt_val ptv)
+{
+   return ptv.val & PTE_FLAGS_MASK;
+}
+
+/* Public: can be used from generic code */
+
+/*
+ * Encodes information required to operate on page table entry value
+ * of *any* level.
+ */
+typedef struct pt_val ptv_t;
+
+/*
+ * Encodes information required to operate on pointer to page table entry
+ * of *any* level.
+ */
+typedef struct pt_ptr ptp_t;
+
+/* Dereference entry in page table */
+static inline ptv_t get_ptv(ptp_t *ptp)
+{
+   struct pt_val ptv = {
+   .val = *ptp->ptr,
+   .lvl = ptp->lvl,
+   };
+
+   return ptv;
+}
+
+/* Operations on page table values */
+
+static inline bool ptv_bottom(ptv_t ptv)
+{
+   return !ptv.lvl;
+}
+
+static inline unsigned long ptv_size(ptv_t ptv)
+{
+   return 1UL << __pt_shift(ptv.lvl);
+}
+
+static inline unsigned long ptv_mask(ptv_t ptv)
+{
+   return ~(ptv_size(ptv) - 1);
+}
+
+/*
+ * When walking page tables, get the address of the next boundary,
+ * or the end address of the range if that comes earlier.  Although no
+ * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
+ */
+static inline unsigned long ptv_addr_end(ptv_t ptv,
+   unsigned long addr, unsigned long end)
+{
+   unsigned long boundary = (addr + ptv_size(ptv)) & ptv_mask(ptv);
+
+   if (boundary - 1 < end - 1)
+   return boundary;
+   else
+   return end;
+}
+
+static inline bool ptv_none(ptv_t ptv)
+{
+   return !ptv.val;
+}
+static inline bool ptv_present(ptv_t ptv)
+{
+   return ptv.val & _PAGE_PRESENT;
+}
+
+static inline bool ptv_bad(ptv_t ptv)
+{
+   unsigned long ignore_flags = _PAGE_USER;
+
+   if (ptv_bottom(ptv))
+   return false;
+
+   if (IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
+   ignore_flags |= _PAGE_NX;
+
+   return (__ptv_flags(ptv) & ~ignore_flags) != _KERNPG_TABLE;
+}
+
+static inline bool ptv_huge(ptv_t ptv)
+{
+   return ptv.val & _PAGE_PSE;
+}
+
+static inline bool ptv_devmap(ptv_t ptv)
+{
+   return ptv.val & _PAGE_DEVMAP;
+}
+
+static inline bool ptv_swap(ptv_t ptv)
+{
+   return !ptv_none(ptv) && !ptv_present(ptv);
+}
+
+static inline bool ptv_leaf(ptv_t ptv)
+{
+   if (ptv_none(ptv))
+   return false;
+   return ptv_bottom(ptv) || ptv_huge(ptv) ||
+   ptv_devmap(ptv) || ptv_swap(ptv);
+}
+
+static inline unsigned long ptv_pfn(ptv_t ptv)
+{
+   if (ptv_leaf(ptv)) {
+   /* TODO */
+   BUG();
+   }
+
+   return (ptv.val & PTE_PFN_MASK) >> PAGE_SHIFT;
+}
+
+/* Operations on page table pointers */
+
+/* Initialize ptp_t with pointer to top page table