from:"Matthew Wilcox"

Re: [PATCH] module: ban '.', '..' as module names, ban '/' in module names

2024-04-14 Thread Matthew Wilcox

On Sun, Apr 14, 2024 at 10:05:05PM +0300, Alexey Dobriyan wrote:
> Any other subsystem should use nice helper function aptly named
> 
>   string_is_vfs_ready()
> 
> and apply additional restrictions if necessary.
> 
> /proc/modules hints that newlines should be banned too,
> and \x1f, and whitespace, and similar looking characters 
> from different languages and emojis (except obviously).

I don't see the purpose of allowing any character in 0x01-0x1f.
How annoying to have BEL in there.  And, really, what's the value in
allowing characters after 0x7e?

Re: [PATCH 7/9] mm: Free up PG_slab

2024-04-01 Thread Matthew Wilcox

On Sun, Mar 31, 2024 at 11:11:10PM +0800, kernel test robot wrote:
> kernel test robot noticed "UBSAN:shift-out-of-bounds_in_fs/proc/page.c" on:
> 
> commit: 30e5296811312a13938b83956a55839ac1e3aa40 ("[PATCH 7/9] mm: Free up 
> PG_slab")

Quite right.  Spotted another one while I was at it.  Not able to test
right now, but this should do the trick:

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 5bc82828c6aa..55b01535eb22 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -175,6 +175,8 @@ u64 stable_page_flags(const struct page *page)
u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
+   if (folio_test_slab(folio))
+   u |= 1 << KPF_SLAB;
 
 #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
u |= kpf_copy_bit(k, KPF_IDLE,  PG_idle);
@@ -184,7 +186,6 @@ u64 stable_page_flags(const struct page *page)
 #endif
 
u |= kpf_copy_bit(k, KPF_LOCKED,PG_locked);
-   u |= kpf_copy_bit(k, KPF_SLAB,  PG_slab);
u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
u |= kpf_copy_bit(k, KPF_UPTODATE,  PG_uptodate);
diff --git a/tools/cgroup/memcg_slabinfo.py b/tools/cgroup/memcg_slabinfo.py
index 1d3a90d93fe2..270c28a0d098 100644
--- a/tools/cgroup/memcg_slabinfo.py
+++ b/tools/cgroup/memcg_slabinfo.py
@@ -146,12 +146,11 @@ def detect_kernel_config():
 
 
 def for_each_slab(prog):
-PGSlab = 1 << prog.constant('PG_slab')
-PGHead = 1 << prog.constant('PG_head')
+PGSlab = ~prog.constant('PG_slab')
 
 for page in for_each_page(prog):
 try:
-if page.flags.value_() & PGSlab:
+if page.page_type.value_() == PGSlab:
 yield cast('struct slab *', page)
 except FaultError:
 pass

Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Matthew Wilcox

On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 08:21:00PM +0000, Matthew Wilcox wrote:
> > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > > From: Hou Tao 
> > > > 
> > > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > > Considering the size of both the sg array and the bounce buffer may be
> > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > > > possibility of memory allocation failure.
> > > > 
> > > 
> > > What's the practical benefit of this patch. Looks like if memory
> > > allocation fails, we keep retrying at interval of 1ms and don't
> > > return error to user space.
> > 
> > You don't deplete the atomic reserves unnecessarily?
> 
> Sounds reasonable. 
> 
> With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
> indefinitely till memory can be allocated. 

If you need the "loop indefinitely" behaviour, that's
GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
which can free up memory, this is a bad choice.  If you're just sleeping
and retrying, you might as well have the MM do that for you.

Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Matthew Wilcox

On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > From: Hou Tao 
> > 
> > When invoking virtio_fs_enqueue_req() through kworker, both the
> > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > Considering the size of both the sg array and the bounce buffer may be
> > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > possibility of memory allocation failure.
> > 
> 
> What's the practical benefit of this patch. Looks like if memory
> allocation fails, we keep retrying at interval of 1ms and don't
> return error to user space.

You don't deplete the atomic reserves unnecessarily?

Re: [PATCH] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-03 Thread Matthew Wilcox

On Thu, Jan 04, 2024 at 09:58:05AM +0800, Hou Tao wrote:
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -  struct fuse_req *req, bool in_flight);
> +  struct fuse_req *req, bool in_flight,
> +  bool in_atomic);

Better to pass the gfp_t directly instead of a bool.

Re: [2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2024-01-02 Thread Matthew Wilcox

On Tue, Jan 02, 2024 at 11:47:38AM +0100, Markus Elfring wrote:
> > Do you consider more clarity in your argumentation?
> 
> It is probably clear that the function call “kfree(NULL)” does not perform
> data processing which is really useful for the caller.
> Such a call is kept in some cases because programmers did not like to invest
> development resources for its avoidance.

on the contrary, it is extremely useful for callers to not have to perform
the NULL check themselves.  It also mirrors userspace where free(NULL)
is valid according to ISO/ANSI C, so eases the transition for programmers
who are coming from userspace.  It costs nothing in the implementation
as it is part of the check for the ZERO_PTR.

And from a practical point of view, we can't take it out now.  We can
never find all the places which assume the current behaviour.  So since
we must keep kfree(NULL) working, we should take advantage of that to
simplify users.

Re: [2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2024-01-02 Thread Matthew Wilcox

On Tue, Jan 02, 2024 at 10:35:17AM +0100, Markus Elfring wrote:
> >>> So what?  kfree(NULL) is perfectly acceptable.
> >>
> >> I suggest to reconsider the usefulness of such a special function call.
> >
> > Can you be more explicit in your suggestion?
> 
> I hope that the change acceptance can grow for the presented transformation.
> Are you looking for an improved patch description?

Do you consider more clarity in your argumentation?

Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Matthew Wilcox

On Fri, Dec 29, 2023 at 10:10:08AM +0100, Markus Elfring wrote:
> >> The kfree() function was called in two cases by
> >> the virtio_fs_get_tree() function during error handling
> >> even if the passed variable contained a null pointer.
> >
> > So what?  kfree(NULL) is perfectly acceptable.
> 
> I suggest to reconsider the usefulness of such a special function call.

Can you be more explicit in your suggestion?

Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2023-12-29 Thread Matthew Wilcox

On Fri, Dec 29, 2023 at 09:38:47AM +0100, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 29 Dec 2023 09:15:07 +0100
> 
> The kfree() function was called in two cases by
> the virtio_fs_get_tree() function during error handling
> even if the passed variable contained a null pointer.

So what?  kfree(NULL) is perfectly acceptable.  Are you trying to
optimise an error path?

Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-14 Thread Matthew Wilcox

On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
>   struct dax_region *dax_region = dev_get_drvdata(dev);
> - unsigned long long size;
>  
> - device_lock(dev);
> - size = dax_region_avail_size(dax_region);
> - device_unlock(dev);
> + guard(device)(dev);
>  
> - return sprintf(buf, "%llu\n", size);
> + return sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
>  }

Is this an appropriate use of guard()?  sprintf is not the fastest of
functions, so we will end up holding the device_lock for longer than
we used to.

> @@ -908,9 +890,8 @@ static ssize_t size_show(struct device *dev,
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   unsigned long long size;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   size = dev_dax_size(dev_dax);
> - device_unlock(dev);
>  
>   return sprintf(buf, "%llu\n", size);
>  }

If it is appropriate, then you can do without the 'size' variable here.

> @@ -1137,21 +1117,20 @@ static ssize_t mapping_store(struct device *dev, 
> struct device_attribute *attr,
>   if (rc)
>   return rc;
>  
> - rc = -ENXIO;
> - device_lock(dax_region->dev);
> - if (!dax_region->dev->driver) {
> - device_unlock(dax_region->dev);
> - return rc;
> - }
> - device_lock(dev);
> + guard(device)(dax_region->dev);
> + if (!dax_region->dev->driver)
> + return -ENXIO;
>  
> + guard(device)(dev);
>   to_alloc = range_len();
> - if (alloc_is_aligned(dev_dax, to_alloc))
> - rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> - device_unlock(dev);
> - device_unlock(dax_region->dev);
> + if (!alloc_is_aligned(dev_dax, to_alloc))
> + return -ENXIO;
>  
> - return rc == 0 ? len : rc;
> + rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
> + if (rc)
> + return rc;
> +
> + return len;
>  }

Have I mentioned how much I hate the "rc" naming convention?  It tells
you nothing useful about the contents of the variable.  If you called it
'err', I'd know it was an error, and then the end of this function would
make sense.

if (err)
return err;
return len;

Re: [v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-27 Thread Matthew Wilcox

On Tue, Jun 27, 2023 at 06:22:47PM +0200, Markus Elfring wrote:
> >> How do you think about to put additional information below triple dashes
> >> (or even into improved change descriptions)?
> >>
> >> See also:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
> >
> > Markus,
> >
> > Please go away.  Your feedback is not helpful.
> 
> Would you insist on the usage of cover letters also for single patches?

I would neither insist on it, nor prohibit it.  It simply does not
make enough difference.

Re: [v5 0/1] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-06-27 Thread Matthew Wilcox

On Tue, Jun 27, 2023 at 08:08:19AM +0200, Markus Elfring wrote:
> > The thought was to put descriptions unsuitable for commit header in the 
> > cover letter.
> 
> How do you think about to put additional information below triple dashes
> (or even into improved change descriptions)?
> 
> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686

Markus,

Please go away.  Your feedback is not helpful.  Thank you.

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Matthew Wilcox

On Thu, Apr 27, 2023 at 06:35:57PM -0700, Dan Williams wrote:
> Jane Chu wrote:
> > Hi, Dan,
> > 
> > On 4/27/2023 2:36 PM, Dan Williams wrote:
> > > Jane Chu wrote:
> > >> When dax fault handler fails to provision the fault page due to
> > >> hwpoison, it returns VM_FAULT_SIGBUS which lead to a sigbus delivered
> > >> to userspace with .si_code BUS_ADRERR.  Channel dax backend driver's
> > >> detection on hwpoison to the filesystem to provide the precise reason
> > >> for the fault.
> > > 
> > > It's not yet clear to me by this description why this is an improvement
> > > or will not cause other confusion. In this case the reason for the
> > > SIGBUS is because the driver wants to prevent access to poison, not that
> > > the CPU consumed poison. Can you clarify what is lost by *not* making
> > > this change?
> > 
> > Elsewhere when hwpoison is detected by page fault handler and helpers as 
> > the direct cause to failure, VM_FAULT_HWPOISON or 
> > VM_FAULT_HWPOISON_LARGE is flagged to ensure accurate SIGBUS payload is 
> > produced, such as wp_page_copy() in COW case, do_swap_page() from 
> > handle_pte_fault(), hugetlb_fault() in hugetlb page fault case where the 
> > huge fault size would be indicated in the payload.
> > 
> > But dax fault has been an exception in that the SIGBUS payload does not 
> > indicate poison, nor fault size.  I don't see why it should be though,
> > recall an internal user expressing confusion regarding the different 
> > SIGBUS payloads.
> 
> ...but again this the typical behavior with block devices. If a block
> device has badblock that causes page cache page not to be populated
> that's a SIGBUS without hwpoison information. If the page cache is
> properly populated and then the CPU consumes poison that's a SIGBUS with
> the additional hwpoison information.

I'm not sure that's true when we mmap().  Yes, it's not consistent with
-EIO from read(), but we have additional information here, and it's worth
providing it.  You can think of it as *in this instance*, the error is
found "in the page cache", because that's effectively where the error
is from the point of view of the application?

Re: [PATCH v2] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-27 Thread Matthew Wilcox

On Thu, Apr 27, 2023 at 04:36:58PM -0700, Jane Chu wrote:
> > This change results in EHWPOISON leaking to usersapce in the case of
> > read(2), that's not a return code that block I/O applications have ever
> > had to contend with before. Just as badblocks cause EIO to be returned,
> > so should poisoned cachelines for pmem.
> 
> The read(2) man page (https://man.archlinux.org/man/read.2) says
> "On error, -1 is returned, and errno is set to indicate the error. In this
> case, it is left unspecified whether the file position (if any) changes."
> 
> If read(2) users haven't dealt with EHWPOISON before, they may discover that
> with pmem backed dax file, it's possible.

I don't think they should.  While syscalls are allowed to return errnos
other than the ones listed in POSIX, I don't think this is a worthwhile
difference.  We should be abstracting from the user that this is pmem
rather than spinning rust or nand.  So we should convert the EHWPOISON
to EIO as Dan suggests.

Re: [PATCH] dax: enable dax fault handler to report VM_FAULT_HWPOISON

2023-04-06 Thread Matthew Wilcox

On Thu, Apr 06, 2023 at 11:55:56AM -0600, Jane Chu wrote:
>  static vm_fault_t dax_fault_return(int error)
>  {
>   if (error == 0)
>   return VM_FAULT_NOPAGE;
> - return vmf_error(error);
> + else if (error == -ENOMEM)
> + return VM_FAULT_OOM;
> + else if (error == -EHWPOISON)
> + return VM_FAULT_HWPOISON;
> + return VM_FAULT_SIGBUS;
>  }

Why would we want to handle it here instead of changing vmf_error()?

Re: [PATCH v9 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2023-02-05 Thread Matthew Wilcox

On Sat, Feb 04, 2023 at 02:58:38PM +, Shiyang Ruan wrote:
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + /* invalidate_inode_pages2() invalidates dax mapping */
> + super_drop_pagecache(mp->m_super, invalidate_inode_pages2);

OK, there's a better way to handle all of this.

First, an essentially untyped interface with a void * argument is
bad.  Second, we can do all this with invalidate_inode_pages2_range()
and invalidate_mapping_pages() without introducing a new function.

Here's the proposal:

void super_drop_pagecache(struct super_block *sb,
int (*invalidate)(struct address_space *, pgoff_t start, 
pgoff_t end))

In fs/drop-caches.c:

static void drop_pagecache_sb(struct super_block *sb, void *unused)
{
super_drop_pagecache(sb, invalidate_mapping_pages);
}

In XFS:

super_drop_pagecache(mp->m_super,
invalidate_inode_pages2_range);

Much smaller change ...

Re: [PATCH v9 1/3] xfs: fix the calculation of length and end

2023-02-05 Thread Matthew Wilcox

On Sat, Feb 04, 2023 at 02:58:36PM +, Shiyang Ruan wrote:
> @@ -222,8 +222,8 @@ xfs_dax_notify_failure(
>   len -= ddev_start - offset;
>   offset = 0;
>   }
> - if (offset + len > ddev_end)
> - len -= ddev_end - offset;
> + if (offset + len - 1 > ddev_end)
> + len -= offset + len - 1 - ddev_end;

This _looks_ wrong.  Are you sure it shouldn't be:

len = ddev_end - offset + 1;

Re: [PATCH v9 2/3] fs: move drop_pagecache_sb() for others to use

2023-02-05 Thread Matthew Wilcox

On Sat, Feb 04, 2023 at 02:58:37PM +, Shiyang Ruan wrote:
> @@ -678,6 +679,48 @@ void drop_super_exclusive(struct super_block *sb)
>  }
>  EXPORT_SYMBOL(drop_super_exclusive);
>  
> +/*

You've gone to the trouble of writing kernel-doc, just add the extra '*'
and make it actually part of the documentation!

> + *   super_drop_pagecache - drop all page caches of a filesystem
> + *   @sb: superblock to invalidate
> + *   @arg: invalidate method, such as invalidate_inode_pages(),
> + *   invalidate_inode_pages2()
> + *
> + *   Scans the inodes of a filesystem, drop all page caches.
> + */

> +++ b/include/linux/fs.h
> @@ -3308,6 +3308,7 @@ extern struct super_block *get_super(struct 
> block_device *);
>  extern struct super_block *get_active_super(struct block_device *bdev);
>  extern void drop_super(struct super_block *sb);
>  extern void drop_super_exclusive(struct super_block *sb);
> +void super_drop_pagecache(struct super_block *sb, void *unused);

But the arg isn't unused.  Call it 'invalidator' here.

> +/**
> + * invalidate_inode_pages - Invalidate all clean, unlocked cache of one inode
> + * @mapping: the address_space which holds the cache to invalidate
> + *
> + * This function removes all pages that are clean, unmapped and unlocked,
> + * as well as shadow entries. It will not block on IO activity.
> + */
> +int invalidate_inode_pages(struct address_space *mapping)
> +{
> + invalidate_mapping_pages(mapping, 0, -1);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(invalidate_inode_pages);

I might make this a static function in super.c but maybe you need it
in the next patch.

[PATCH v5 24/32] tools/testing/nvdimm: Convert to printbuf

2022-08-07 Thread Matthew Wilcox (Oracle)

From: Kent Overstreet 

This converts from seq_buf to printbuf. Here we're using printbuf with
an external buffer, meaning it's a direct conversion.

Signed-off-by: Kent Overstreet 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: nvd...@lists.linux.dev
---
 tools/testing/nvdimm/test/ndtest.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 4d1a947367f9..a2097955dace 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -12,7 +12,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #include "../watermark.h"
 #include "nfit_test.h"
@@ -740,32 +740,30 @@ static ssize_t flags_show(struct device *dev,
 {
struct nvdimm *nvdimm = to_nvdimm(dev);
struct ndtest_dimm *dimm = nvdimm_provider_data(nvdimm);
-   struct seq_buf s;
+   struct printbuf s = PRINTBUF_EXTERN(buf, PAGE_SIZE);
u64 flags;
 
flags = dimm->flags;
 
-   seq_buf_init(, buf, PAGE_SIZE);
if (flags & PAPR_PMEM_UNARMED_MASK)
-   seq_buf_printf(, "not_armed ");
+   prt_printf(, "not_armed ");
 
if (flags & PAPR_PMEM_BAD_SHUTDOWN_MASK)
-   seq_buf_printf(, "flush_fail ");
+   prt_printf(, "flush_fail ");
 
if (flags & PAPR_PMEM_BAD_RESTORE_MASK)
-   seq_buf_printf(, "restore_fail ");
+   prt_printf(, "restore_fail ");
 
if (flags & PAPR_PMEM_SAVE_MASK)
-   seq_buf_printf(, "save_fail ");
+   prt_printf(, "save_fail ");
 
if (flags & PAPR_PMEM_SMART_EVENT_MASK)
-   seq_buf_printf(, "smart_notify ");
+   prt_printf(, "smart_notify ");
 
+   if (printbuf_written())
+   prt_printf(, "\n");
 
-   if (seq_buf_used())
-   seq_buf_printf(, "\n");
-
-   return seq_buf_used();
+   return printbuf_written();
 }
 static DEVICE_ATTR_RO(flags);
 
-- 
2.35.1

Re: [PATCH v2] mm: fix missing wake-up event for FSDAX pages

2022-07-05 Thread Matthew Wilcox

On Tue, Jul 05, 2022 at 02:18:19PM -0700, Andrew Morton wrote:
> On Tue,  5 Jul 2022 20:35:32 +0800 Muchun Song  
> wrote:
> 
> > FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> > 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> > then they will be unpinned via unpin_user_page() using a folio variant
> > to put the page, however, folio variants did not consider this special
> > case, the result will be to miss a wakeup event (like the user of
> > __fuse_dax_break_layouts()).  Since FSDAX pages are only possible get
> > by GUP users, so fix GUP instead of folio_put() to lower overhead.
> > 
> 
> What are the user visible runtime effects of this bug?

"missing wake up event" seems pretty obvious to me?  Something goes to
sleep waiting for a page to become unused, and is never woken.

Re: [PATCH] mm: fix missing wake-up event for FSDAX pages

2022-07-04 Thread Matthew Wilcox

On Mon, Jul 04, 2022 at 03:40:54PM +0800, Muchun Song wrote:
> FSDAX page refcounts are 1-based, rather than 0-based: if refcount is
> 1, then the page is freed.  The FSDAX pages can be pinned through GUP,
> then they will be unpinned via unpin_user_page() using a folio variant
> to put the page, however, folio variants did not consider this special
> case, the result will be to miss a wakeup event (like the user of
> __fuse_dax_break_layouts()).

Argh, no.  The 1-based refcounts are a blight on the entire kernel.
They need to go away, not be pushed into folios as well.  I think
we're close to having that fixed, but until then, this should do
the trick?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc98ab012a9b..4cef5e0f78b6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1129,18 +1129,18 @@ static inline bool is_zone_movable_page(const struct 
page *page)
 #if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-bool __put_devmap_managed_page(struct page *page);
-static inline bool put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page, int refs);
+static inline bool put_devmap_managed_page(struct page *page, int refs)
 {
if (!static_branch_unlikely(_managed_key))
return false;
if (!is_zone_device_page(page))
return false;
-   return __put_devmap_managed_page(page);
+   return __put_devmap_managed_page(page, refs);
 }
 
 #else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page, int refs)
 {
return false;
 }
@@ -1246,7 +1246,7 @@ static inline void put_page(struct page *page)
 * For some devmap managed pages we need to catch refcount transition
 * from 2 to 1:
 */
-   if (put_devmap_managed_page(>page))
+   if (put_devmap_managed_page(>page, 1))
return;
folio_put(folio);
 }
diff --git a/mm/gup.c b/mm/gup.c
index d1132b39aa8f..28df02121c78 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -88,7 +88,8 @@ static inline struct folio *try_get_folio(struct page *page, 
int refs)
 * belongs to this folio.
 */
if (unlikely(page_folio(page) != folio)) {
-   folio_put_refs(folio, refs);
+   if (!put_devmap_managed_page(>page, refs))
+   folio_put_refs(folio, refs);
goto retry;
}
 
@@ -177,6 +178,8 @@ static void gup_put_folio(struct folio *folio, int refs, 
unsigned int flags)
refs *= GUP_PIN_COUNTING_BIAS;
}
 
+   if (put_devmap_managed_page(>page, refs))
+   return;
folio_put_refs(folio, refs);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index b870a659eee6..b25e40e3a11e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -499,7 +499,7 @@ void free_zone_device_page(struct page *page)
 }
 
 #ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page, int refs)
 {
if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
return false;
@@ -509,7 +509,7 @@ bool __put_devmap_managed_page(struct page *page)
 * refcount is 1, then the page is free and the refcount is
 * stable because nobody holds a reference on the page.
 */
-   if (page_ref_dec_return(page) == 1)
+   if (page_ref_sub_return(page, refs) == 1)
wake_up_var(>_refcount);
return true;
 }
diff --git a/mm/swap.c b/mm/swap.c
index c6194cfa2af6..94e42a9bab92 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -960,7 +960,7 @@ void release_pages(struct page **pages, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
-   if (put_devmap_managed_page(>page))
+   if (put_devmap_managed_page(>page, 1))
continue;
if (folio_put_testzero(folio))
free_zone_device_page(>page);

Re: [PATCH] sparse: use force attribute for vm_fault_t casts

2022-05-14 Thread Matthew Wilcox

On Sat, May 14, 2022 at 05:26:21PM +0300, Vasily Averin wrote:
> Fixes sparse warnings:
> ./include/trace/events/fs_dax.h:10:1: sparse:
> got restricted vm_fault_t
> ./include/trace/events/fs_dax.h:153:1: sparse:
> got restricted vm_fault_t
> fs/dax.c:563:39: sparse:got restricted vm_fault_t
> fs/dax.c:565:39: sparse:got restricted vm_fault_t
> fs/dax.c:569:31: sparse:got restricted vm_fault_t
> fs/dax.c:1055:41: sparse:
> got restricted vm_fault_t [assigned] [usertype] ret
> fs/dax.c:1461:46: sparse:got restricted vm_fault_t [usertype] ret
> fs/dax.c:1477:21: sparse:
> expected restricted vm_fault_t [assigned] [usertype] ret
> fs/dax.c:1518:51: sparse:
> got restricted vm_fault_t [assigned] [usertype] ret
> fs/dax.c:1599:21: sparse:
> expected restricted vm_fault_t [assigned] [usertype] ret
> fs/dax.c:1633:62: sparse:
> got restricted vm_fault_t [assigned] [usertype] ret
> fs/dax.c:1696:55: sparse:got restricted vm_fault_t
> fs/dax.c:1711:58: sparse:
> got restricted vm_fault_t [assigned] [usertype] ret
> 
> vm_fault_t type is bitwise and requires __force attribute for any casts.

Well, this patch is all kinds of messy.  I would rather we had better
abstractions.  For example ...

> @@ -560,13 +560,13 @@ static void *grab_mapping_entry(struct xa_state *xas,
>   if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
>   goto retry;
>   if (xas->xa_node == XA_ERROR(-ENOMEM))
> - return xa_mk_internal(VM_FAULT_OOM);
> + return xa_mk_internal((__force unsigned long)VM_FAULT_OOM);
>   if (xas_error(xas))
> - return xa_mk_internal(VM_FAULT_SIGBUS);
> + return xa_mk_internal((__force unsigned long)VM_FAULT_SIGBUS);
>   return entry;
>  fallback:
>   xas_unlock_irq(xas);
> - return xa_mk_internal(VM_FAULT_FALLBACK);
> + return xa_mk_internal((__force unsigned long)VM_FAULT_FALLBACK);
>  }

return vm_fault_encode(VM_FAULT_xxx);

>  /**
> @@ -1052,7 +1052,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   DAX_ZERO_PAGE, false);
>  
>   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> - trace_dax_load_hole(inode, vmf, ret);
> + trace_dax_load_hole(inode, vmf, (__force int)ret);

Seems like trace_dax_load_hole() should take a vm_fault_t?

> - trace_dax_pte_fault(iter.inode, vmf, ret);
> + trace_dax_pte_fault(iter.inode, vmf, (__force int)ret);

Ditto.

> @@ -1474,7 +1474,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>  
>   entry = grab_mapping_entry(, mapping, 0);
>   if (xa_is_internal(entry)) {
> - ret = xa_to_internal(entry);
> + ret = (__force vm_fault_t)xa_to_internal(entry);

vm_fault_decode(entry)?

... the others seem like more of the same.  So I'm in favour of what
you're doing, but would rather it were done differently.  Generally
seeing __force casts in the body of a function is a sign that things are
wrong; it's better to have them hidden in abstractions.

Re: [PATCH v2 3/6] mm: page_vma_mapped: support checking if a pfn is mapped into a vma

2022-02-02 Thread Matthew Wilcox

On Wed, Feb 02, 2022 at 10:33:04PM +0800, Muchun Song wrote:
> page_vma_mapped_walk() is supposed to check if a page is mapped into a vma.
> However, not all page frames (e.g. PFN_DEV) have a associated struct page
> with it. There is going to be some duplicate codes similar with this function
> if someone want to check if a pfn (without a struct page) is mapped into a
> vma. So add support for checking if a pfn is mapped into a vma. In the next
> patch, the dax will use this new feature.

I'm coming to more or less the same solution for fixing the bug in
page_mapped_in_vma().  If you call it with a head page, it will look
for any page in the THP instead of the precise page.  I think we can do
a fairly significant simplification though, so I'm going to go off
and work on that next ...

Re: [PATCH v10 2/9] mm: factor helpers for memory_failure_dev_pagemap

2022-02-01 Thread Matthew Wilcox

On Thu, Jan 27, 2022 at 08:40:51PM +0800, Shiyang Ruan wrote:
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + struct page *page = pfn_to_page(pfn);
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> +
> + /*
> +  * Prevent the inode from being freed while we are interrogating
> +  * the address_space, typically this would be handled by
> +  * lock_page(), but dax pages do not use the page lock. This
> +  * also prevents changes to the mapping of this pfn until
> +  * poison signaling is complete.
> +  */
> + cookie = dax_lock_page(page);
> + if (!cookie)
> + return -EBUSY;
> +
> + if (hwpoison_filter(page))
> + return 0;
> +
> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + /*
> +  * TODO: Handle HMM pages which may need coordination
> +  * with device-side memory.
> +  */
> + return -EBUSY;
> + }
> +
> + /*
> +  * Use this flag as an indication that the dax page has been
> +  * remapped UC to prevent speculative consumption of poison.
> +  */
> + SetPageHWPoison(page);
> +
> + /*
> +  * Unlike System-RAM there is no possibility to swap in a
> +  * different physical page at a given virtual address, so all
> +  * userspace consumption of ZONE_DEVICE memory necessitates
> +  * SIGBUS (i.e. MF_MUST_KILL)
> +  */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, _kill, true);
> +
> + unmap_and_kill(_kill, pfn, page->mapping, page->index, flags);
> + dax_unlock_page(page, cookie);
> + return 0;
> +}

There's an assumption here that fsdax only has order-0 pages.
pfn_to_page() is going to return the precise page for that pfn, but then
page->mapping and page->index are not valid for tail pages.

I'm currently trying to folio-ise memory-failure.c, and it is not
going well!  There are several places where it's hard to tell
what should happen.

[PATCH] arm64: Show three registers per line

2021-04-20 Thread Matthew Wilcox (Oracle)

Displaying two registers per line takes 15 lines.  That improves to just
10 lines if we display three registers per line, which reduces the amount
of information lost when oopses are cut off.  It stays within 80 columns
and matches x86-64.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 arch/arm64/kernel/process.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6e60aa3b5ea9..aff5a2c12297 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -294,13 +294,10 @@ void __show_regs(struct pt_regs *regs)
i = top_reg;
 
while (i >= 0) {
-   printk("x%-2d: %016llx ", i, regs->regs[i]);
-   i--;
+   printk("x%-2d: %016llx", i, regs->regs[i]);
 
-   if (i % 2 == 0) {
-   pr_cont("x%-2d: %016llx ", i, regs->regs[i]);
-   i--;
-   }
+   while (i-- % 3)
+   pr_cont(" x%-2d: %016llx", i, regs->regs[i]);
 
pr_cont("\n");
}
-- 
2.30.2

Re: [PATCH v2] docs: proc.rst: meminfo: briefly describe gaps in memory accounting

2021-04-20 Thread Matthew Wilcox

On Tue, Apr 20, 2021 at 03:13:54PM +0300, Mike Rapoport wrote:
> Add a paragraph that explains that it may happen that the counters in
> /proc/meminfo do not add up to the overall memory usage.

... that is, the sum may be lower because memory is allocated for other
purposes that is not reported here, right?

Is it ever possible for it to be higher?  Maybe due to a race when
sampling the counters?

>  Provides information about distribution and utilization of memory.  This
> -varies by architecture and compile options.  The following is from a
> -16GB PIII, which has highmem enabled.  You may not have all of these fields.
> +varies by architecture and compile options. Please note that it may happen
> +that the memory accounted here does not add up to the overall memory usage
> +and the difference for some workloads can be substantial. In many cases there
> +are other means to find out additional memory using subsystem specific
> +interfaces, for instance /proc/net/sockstat for TCP memory allocations.

How about just:

+varies by architecture and compile options.  The memory reported here
+may not add up to the overall memory usage and the difference for some
+workloads can be substantial. [...]

But I'd like to be a bit more explicit about the reason, hence my question
above to be sure I understand.

It's also not entirely clear which of the fields in meminfo can be
usefully summed.  VmallocTotal is larger than MemTotal, for example.
But I know that KernelStack is allocated through vmalloc these days,
and I don't know whether VmallocUsed includes KernelStack or whether I
can sum them.  Similarly, is Mlocked a subset of Unevictable?

There is some attempt at explaining how these numbers fit together, but
it's outdated, and doesn't include Mlocked, Unevictable or KernelStack

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-20 Thread Matthew Wilcox

On Tue, Apr 20, 2021 at 09:39:54AM +0200, Geert Uytterhoeven wrote:
> > +++ b/include/linux/mm_types.h
> > @@ -97,10 +97,10 @@ struct page {
> > };
> > struct {/* page_pool used by netstack */
> > /**
> > -* @dma_addr: might require a 64-bit value even on
> > +* @dma_addr: might require a 64-bit value on
> >  * 32-bit architectures.
> >  */
> > -   dma_addr_t dma_addr;
> > +   unsigned long dma_addr[2];
> 
> So we get two 64-bit words on 64-bit platforms, while only one is
> needed?

Not really.  This is part of the 5-word union in struct page, so the space
ends up being reserved anyway, event if it's not "assigned" to dma_addr.

> > +   dma_addr_t ret = page->dma_addr[0];
> > +   if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +   ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
> 
> We don't seem to have a handy macro for a 32-bit left shift yet...
> 
> But you can also avoid the warning using
> 
> ret |= (u64)page->dma_addr[1] << 32;

Sure.  It doesn't really matter which way we eliminate the warning;
the code is unreachable.

> > +{
> > +   page->dma_addr[0] = addr;
> > +   if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +   page->dma_addr[1] = addr >> 16 >> 16;
> 
> ... but we do have upper_32_bits() for a 32-bit right shift.

Yep, that's what my current tree looks like.

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-19 Thread Matthew Wilcox

On Tue, Apr 20, 2021 at 02:48:17AM +, Vineet Gupta wrote:
> > 32-bit architectures which expect 8-byte alignment for 8-byte integers
> > and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> > page inadvertently expanded in 2019.
> 
> FWIW, ARC doesn't require 8 byte alignment for 8 byte integers. This is 
> only needed for 8-byte atomics due to the requirements of LLOCKD/SCOND 
> instructions.

Ah, like x86?  OK, great, I'll drop your arch from the list of
affected.  Thanks!

Re: [PATCH 04/13] Kbuild: Rust support

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 09:58:51PM +0200, David Sterba wrote:
> On Fri, Apr 16, 2021 at 07:34:51PM +0200, Miguel Ojeda wrote:
> > something like:
> > 
> > [0.903456]  rust_begin_unwind+0x9/0x10
> > [0.903456]  ? _RNvNtCsbDqzXfLQacH_4core9panicking9panic_fmt+0x29/0x30
> > [0.903456]  ? _RNvNtCsbDqzXfLQacH_4core9panicking5panic+0x44/0x50
> > [0.903456]  ? _RNvCsbDqzXfLQacH_12rust_minimal1h+0x1c/0x20
> > [0.903456]  ? _RNvCsbDqzXfLQacH_12rust_minimal1g+0x9/0x10
> > [0.903456]  ? _RNvCsbDqzXfLQacH_12rust_minimal1f+0x9/0x10
> > [0.903456]  ?
> > _RNvXCsbDqzXfLQacH_12rust_minimalNtB2_11RustMinimalNtCsbDqzXfLQacH_6kernel12KernelModule4init+0x73/0x80
> > [0.903456]  ? 
> > _RNvXsa_NtCsbDqzXfLQacH_4core3fmtbNtB5_5Debug3fmt+0x30/0x30
> > [0.903456]  ? __rust_minimal_init+0x11/0x20
> 
> Are there plans to unmangle the symbols when printing stacks? c++filt
> says:
> 
>   rust_begin_unwind+0x9/0x10
>   ? core[8787f43e282added]::panicking::panic_fmt+0x29/0x30
>   ? core[8787f43e282added]::panicking::panic+0x44/0x50
>   ? rust_minimal[8787f43e282added]::h+0x1c/0x20
>   ? rust_minimal[8787f43e282added]::g+0x9/0x10
>   ? rust_minimal[8787f43e282added]::f+0x9/0x10
>   ?  kernel[8787f43e282added]::KernelModule>::init+0x73/0x80
>   ? ::fmt+0x30/0x30
>   ? __rust_minimal_init+0x11/0x20
> 
> for simple functions it's barely parseable but the following is hardly
> readable
> 
> > _RNvXs5_NtCsbDqzXfLQacH_11rust_binder11range_allocNtB5_15DescriptorStateNtNtCsbDqzXfLQacH_4core3fmt5Debug3fmt+0x60/0x60
> 
> translates to
> 
>core[8787f43e282added]::fmt::Debug>::fmt

Yes, I agree, we need a better story for name mangling.
My proposal is that we store a pretty name which matches the source
(eg rust_binder::range_alloc) and a sha1 of the mangled symbol
(40 bytes of uninteresting hex).  Symbol resolution is performed against
the sha1.  Printing is of the pretty name.  It should be obvious from
the stack trace which variant of a function is being called, no?

Re: [PATCH v7 09/28] mm: Create FolioFlags

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 03:25:46PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 07:50:46PM +0100, Matthew Wilcox (Oracle) wrote:
> > These new functions are the folio analogues of the PageFlags functions.
> > If CONFIG_DEBUG_VM_PGFLAGS is enabled, we check the folio is not a tail
> > page at every invocation.  Note that this will also catch the PagePoisoned
> > case as a poisoned page has every bit set, which would include PageTail.
> > 
> > This saves 1727 bytes of text with the distro-derived config that
> > I'm testing due to removing a double call to compound_head() in
> > PageSwapCache().
> 
> I vote for dropping the Camels if we're going to rework all this.

I'm open to that.  It's a bit of rework now, but easier to do it as
part of this than as a separate series.

So, concretely:

PageReferences() becomes folio_referenced()
SetPageReferenced() becomes folio_set_referenced()
ClearPageReferenced() becomes folio_clear_referenced()
__SetFolioReferenced() becomes __folio_set_referenced()
__ClearFolioReferenced() becomes __folio_clear_referenced()
TestSetPageReferenced() becomes folio_test_set_referenced()
TestClearPageReferenced() becomes folio_test_clear_referenced()

We do have some functions already like set_page_writeback(), but I
think those can become folio_set_writeback() without doing any harm.
We also have page_is_young(), page_is_pfmemalloc(), page_is_guard(),
etc.  Should it be folio_referenced()?  or folio_is_referenced()?

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 14 Apr 2021 13:09:47 -0700
> Shakeel Butt  wrote:
> 
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> >  wrote:
> > >  
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.  
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.  
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >  
> > 
> > No, only page->_mapcount is used.
> 
> Good to know.
> I will admit that I don't fully understand the usage of page->mapping
> and page->index members.

That's fair.  It's not well-documented, and it's complicated.

For a page mapped into userspace, page->mapping is one of:
 - NULL
 - A pointer to a file's address_space
 - A pointer to an anonymous page's anon_vma
If a page isn't mapped into userspace, you can use the space in page->mapping
for anything you like (eg slab uses it)

page->index is only used for indicating pfmemalloc today (and I want to
move that indicator).  I think it can also be used to merge VMAs (if
some other conditions are also true), but failing to merge VMAs isn't
a big deal for this kind of situation.

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > 
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> 
> True, other drivers are doing the same.

And the contents of this page already came from that device ... if it
wanted to write bad data, it could already have done so.

> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).  
> > 
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
> 
> That is a good point.  So, there is a race window where it is possible
> to gain recycling.
> 
> It seems my page_pool co-maintainer Ilias is interested in taking up the
> challenge to get this working with TCP RX zerocopy.  So, lets see how
> this is doable.

You could also check page_ref_count() - page_mapcount() instead of
just checking page_ref_count().  Assuming mapping/unmapping can't
race with recycling?

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 02:56:17PM +0300, Mike Rapoport wrote:
> On Mon, Apr 19, 2021 at 12:23:02PM +0100, Matthew Wilcox wrote:
> > So you're calling page_is_secretmem() on a struct page without having
> > a refcount on it.  That is definitely not allowed.  secretmem seems to
> > be full of these kinds of races; I know this isn't the first one I've
> > seen in it.  I don't think this patchset is ready for this merge window.
> 
> There were races in the older version that did caching of large pages and
> those were fixed then, but this is anyway irrelevant because all that code
> was dropped in the latest respins.
> 
> I don't think that the fix of the race in gup_pte_range is that significant
> to wait 3 more months because of it.

I have no particular interest in secretmem, but it seems that every time
I come across it while looking at something else, I see these kinds of
major mistakes in it.  That says to me it's not ready and hasn't seen
enough review.

> > With that fixed, you'll have a head page that you can use for testing,
> > which means you don't need to test PageCompound() (because you know the
> > page isn't PageTail), you can just test PageHead().
> 
> I can't say I follow you here. page_is_secretmem() is intended to be a
> generic test, so I don't see how it will get PageHead() if it is called
> from other places.

static inline bool head_is_secretmem(struct page *head)
{
if (PageHead(head))
return false;
...
}

static inline bool page_is_secretmem(struct page *page)
{
if (PageTail(page))
return false;
return head_is_secretmem(page);
}

(yes, calling it head is a misnomer, because it's not necessarily a head,
it might be a base page, but until we have a name for pages which might
be a head page or a base page, it'll have to do ...)

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 12:36:19PM +0300, Mike Rapoport wrote:
> Well, most if the -4.2% of the performance regression kbuild reported were
> due to repeated compount_head(page) in page_mapping(). So the whole point
> of this patch is to avoid calling page_mapping().

It's quite ludicrous how many times we call compound_head() in
page_mapping() today:

 page = compound_head(page);
 if (__builtin_expect(!!(PageSlab(page)), 0))
 if (__builtin_expect(!!(PageSwapCache(page)), 0)) {

TESTPAGEFLAG(Slab, slab, PF_NO_TAIL) expands to:

static __always_inline int PageSlab(struct page *page)
{
PF_POISONED_CHECK(compound_head(page));
return test_bit(PG_slab, _head(page));
}

static __always_inline int PageSwapCache(struct page *page)
{
page = compound_head(page);
return PageSwapBacked(page) && test_bit(PG_swapcache, >flags);
}

but then!

TESTPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) also expands like Slab does.

So that's six calls to compound_head(), depending what Kconfig options
you have enabled.

And folio_mapping() is one of the functions I add in the first batch of
patches, so review, etc will be helpful.

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 11:42:18AM +0300, Mike Rapoport wrote:
> The perf profile of the test indicated that the regression is caused by
> page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

Uhh ... you're calling it in the wrong place!

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

if (page_is_secretmem(page))
goto pte_unmap;

head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;

So you're calling page_is_secretmem() on a struct page without having
a refcount on it.  That is definitely not allowed.  secretmem seems to
be full of these kinds of races; I know this isn't the first one I've
seen in it.  I don't think this patchset is ready for this merge window.

With that fixed, you'll have a head page that you can use for testing,
which means you don't need to test PageCompound() (because you know the
page isn't PageTail), you can just test PageHead().

Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part

2021-04-18 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 10:33:00AM +0800, Fox Chen wrote:
> On Mon, Apr 19, 2021 at 10:17 AM Matthew Wilcox  wrote:
> >
> > On Tue, Mar 16, 2021 at 01:47:16PM +0800, Fox Chen wrote:
> > > -In the absence of symbolic links, ``walk_component()`` creates a new
> > > +As the last step of ``walk_component()``, ``step_into()`` will be called 
> > > either
> >
> > You can drop ``..`` from around function named which are followed with
> > ().  d74b0d31ddde ("Docs: An initial automarkup extension for sphinx")
> > marks them up automatically.
> >
> 
> Got it, thanks for letting me know. But I will still use them in this
> patch series to keep consistency with the remaining parts of the
> document.

Well, you weren't.  For example:

+As the last step of ``walk_component()``, ``step_into()`` will be called either
+directly from walk_component() or from handle_dots().  It calls
+``handle_mount()``, to check and handle mount points, in which a new

Neither of the functions on the second line were using ``.

Re: [PATCH v2 01/12] docs: path-lookup: update follow_managed() part

2021-04-18 Thread Matthew Wilcox

On Tue, Mar 16, 2021 at 01:47:16PM +0800, Fox Chen wrote:
> -In the absence of symbolic links, ``walk_component()`` creates a new
> +As the last step of ``walk_component()``, ``step_into()`` will be called 
> either

You can drop ``..`` from around function named which are followed with
().  d74b0d31ddde ("Docs: An initial automarkup extension for sphinx")
marks them up automatically.

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread Matthew Wilcox

On Sat, Apr 17, 2021 at 09:18:57PM +, David Laight wrote:
> Ugly as well.

Thank you for expressing your opinion.  Again.

Re: [PATCH 2/2] mm: Indicate pfmemalloc pages in compound_head

2021-04-17 Thread Matthew Wilcox

On Sat, Apr 17, 2021 at 09:13:45PM +, David Laight wrote:
> > struct {/* page_pool used by netstack */
> > -   /**
> > -* @dma_addr: might require a 64-bit value on
> > -* 32-bit architectures.
> > -*/
> > +   unsigned long pp_magic;
> > +   unsigned long xmi;
> > +   unsigned long _pp_mapping_pad;
> > unsigned long dma_addr[2];
> > };
> 
> You've deleted the comment.

Yes.  It no longer added any value.  You can see dma_addr now occupies
two words.

> I also think there should be a comment that dma_addr[0]
> must be aliased to ->index.

That's not a requirement.  Moving the pfmemalloc indicator is a
requirement so that we _can_ use index, but there's no requirement about
how index is used.

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread Matthew Wilcox

On Sat, Apr 17, 2021 at 09:32:06PM +0300, Ilias Apalodimas wrote:
> > +static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t 
> > addr)
> > +{
> > +   page->dma_addr[0] = addr;
> > +   if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +   page->dma_addr[1] = addr >> 16 >> 16;
> 
> The 'error' that was reported will never trigger right?
> I assume this was compiled with dma_addr_t as 32bits (so it triggered the
> compilation error), but the if check will never allow this codepath to run.
> If so can we add a comment explaining this, since none of us will remember why
> in 6 months from now?

That's right.  I compiled it all three ways -- 32-bit, 64-bit dma, 32-bit long
and 64-bit.  The 32/64 bit case turn into:

if (0)
page->dma_addr[1] = addr >> 16 >> 16;

which gets elided.  So the only case that has to work is 64-bit dma and
32-bit long.

I can replace this with upper_32_bits().

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread Matthew Wilcox

On Sat, Apr 17, 2021 at 12:31:37PM +0200, Arnd Bergmann wrote:
> On Fri, Apr 16, 2021 at 5:27 PM Matthew Wilcox  wrote:
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..db7c7020746a 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
> > page_pool *pool,
> >
> >  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >  {
> > -   return page->dma_addr;
> > +   dma_addr_t ret = page->dma_addr[0];
> > +   if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +   ret |= (dma_addr_t)page->dma_addr[1] << 32;
> > +   return ret;
> > +}
> 
> Have you considered using a PFN type address here? I suspect you
> can prove that shifting the DMA address by PAGE_BITS would
> make it fit into an 'unsigned long' on all 32-bit architectures with
> 64-bit dma_addr_t. This requires that page->dma_addr to be
> page aligned, as well as fit into 44 bits. I recently went through the
> maximum address space per architecture to define a
> MAX_POSSIBLE_PHYSMEM_BITS, and none of them have more than
> 40 here, presumably the same is true for dma address space.

I wouldn't like to make that assumption.  I've come across IOMMUs (maybe
on parisc?  powerpc?) that like to encode fun information in the top
few bits.  So we could get it down to 52 bits, but I don't think we can
get all the way down to 32 bits.  Also, we need to keep the bottom bit
clear for PageTail, so that further constrains us.

Anyway, I like the "two unsigned longs" approach I posted yesterday,
but thanks for the suggestion.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 07:08:23PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 16 Apr 2021 16:27:55 +0100
> Matthew Wilcox  wrote:
> 
> > On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > > See below patch.  Where I swap32 the dma address to satisfy
> > > page->compound having bit zero cleared. (It is the simplest fix I could
> > > come up with).  
> > 
> > I think this is slightly simpler, and as a bonus code that assumes the
> > old layout won't compile.
> 
> This is clever, I like it!  When reading the code one just have to
> remember 'unsigned long' size difference between 64-bit vs 32-bit.
> And I assume compiler can optimize the sizeof check out then doable.

I checked before/after with the replacement patch that doesn't
have compiler warnings.  On x86, there is zero codegen difference
(objdump -dr before/after matches exactly) for both x86-32 with 32-bit
dma_addr_t and x86-64.  For x86-32 with 64-bit dma_addr_t, the compiler
makes some different inlining decisions in page_pool_empty_ring(),
page_pool_put_page() and page_pool_put_page_bulk(), but it's not clear
to me that they're wrong.

Function old new   delta
page_pool_empty_ring 387 307 -80
page_pool_put_page   604 516 -88
page_pool_put_page_bulk  690 517-173

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox



Replacement patch to fix compiler warning.

From: "Matthew Wilcox (Oracle)" 
Date: Fri, 16 Apr 2021 16:34:55 -0400
Subject: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems
To: bro...@redhat.com
Cc: linux-kernel@vger.kernel.org,
linux...@kvack.org,
net...@vger.kernel.org,
linuxppc-...@lists.ozlabs.org,
linux-arm-ker...@lists.infradead.org,
linux-m...@vger.kernel.org,
ilias.apalodi...@linaro.org,
mcr...@linux.microsoft.com,
grygorii.stras...@ti.com,
a...@kernel.org,
h...@lst.de,
linux-snps-...@lists.infradead.org,
mho...@kernel.org,
mgor...@suse.de

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long, and also fixes a
potential problem where (on a big endian platform), the bit used to denote
PageTail could inadvertently get set, and a racing get_user_pages_fast()
could dereference a bogus compound_head().

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm_types.h |  4 ++--
 include/net/page_pool.h  | 12 +++-
 net/core/page_pool.c | 12 +++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5aacc1c10a45 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -97,10 +97,10 @@ struct page {
};
struct {/* page_pool used by netstack */
/**
-* @dma_addr: might require a 64-bit value even on
+* @dma_addr: might require a 64-bit value on
 * 32-bit architectures.
 */
-   dma_addr_t dma_addr;
+   unsigned long dma_addr[2];
};
struct {/* slab, slob and slub */
union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..ad6154dc206c 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-   return page->dma_addr;
+   dma_addr_t ret = page->dma_addr[0];
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
+   return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+   page->dma_addr[0] = addr;
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   page->dma_addr[1] = addr >> 16 >> 16;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..f014fd8c19a6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool 
*pool,
  struct page *page,
  unsigned int dma_sync_size)
 {
+   dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
dma_sync_size = min(dma_sync_size, pool->p.max_len);
-   dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+   dma_sync_single_range_for_device(pool->p.dev, dma_addr,
 pool->p.offset, dma_sync_size,
 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct 
page_pool *pool,
put_page(page);
return NULL;
}
-   page->dma_addr = dma;
+   page_pool_set_dma_addr(page, dma);
 
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, 
struct page *page)
 */
goto skip_dma_unmap;
 
-   dma = page->dma_addr;
+   dma = page_pool_get_dma_addr(page);
 
-   /* When page is unmapped, it cannot be returned our pool */
+   /* When page is unmapped, it cannot be returned to our pool */
dma_unmap_page_attrs(pool->p.dev, dma,
 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 DMA_ATTR_SKIP_CPU_SYNC);
-   page->dma_addr = 0;
+   page_pool_set_dma_addr(page, 0);
 skip_dma_unm

[PATCH 6/6] mm: Constify page_count and page_ref_count

2021-04-16 Thread Matthew Wilcox (Oracle)

Now that compound_head() accepts a const struct page pointer, these two
functions can be marked as not modifying the page pointer they are passed.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/page_ref.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index f3318f34fc54..7ad46f45df39 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -62,12 +62,12 @@ static inline void __page_ref_unfreeze(struct page *page, 
int v)
 
 #endif
 
-static inline int page_ref_count(struct page *page)
+static inline int page_ref_count(const struct page *page)
 {
return atomic_read(>_refcount);
 }
 
-static inline int page_count(struct page *page)
+static inline int page_count(const struct page *page)
 {
return atomic_read(_head(page)->_refcount);
 }
-- 
2.30.2

[PATCH 5/6] mm: Constify get_pfnblock_flags_mask and get_pfnblock_migratetype

2021-04-16 Thread Matthew Wilcox (Oracle)

The struct page is not modified by these routines, so it can be marked
const.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/pageblock-flags.h |  2 +-
 mm/page_alloc.c | 13 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index fff52ad370c1..973fd731a520 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -54,7 +54,7 @@ extern unsigned int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pfnblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(const struct page *page,
unsigned long pfn,
unsigned long mask);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0152670c6f04..4be2179eedd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -474,7 +474,7 @@ static inline bool defer_init(int nid, unsigned long pfn, 
unsigned long end_pfn)
 #endif
 
 /* Return a pointer to the bitmap storing bits affecting a block of pages */
-static inline unsigned long *get_pageblock_bitmap(struct page *page,
+static inline unsigned long *get_pageblock_bitmap(const struct page *page,
unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
@@ -484,7 +484,7 @@ static inline unsigned long *get_pageblock_bitmap(struct 
page *page,
 #endif /* CONFIG_SPARSEMEM */
 }
 
-static inline int pfn_to_bitidx(struct page *page, unsigned long pfn)
+static inline int pfn_to_bitidx(const struct page *page, unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
pfn &= (PAGES_PER_SECTION-1);
@@ -495,7 +495,7 @@ static inline int pfn_to_bitidx(struct page *page, unsigned 
long pfn)
 }
 
 static __always_inline
-unsigned long __get_pfnblock_flags_mask(struct page *page,
+unsigned long __get_pfnblock_flags_mask(const struct page *page,
unsigned long pfn,
unsigned long mask)
 {
@@ -520,13 +520,14 @@ unsigned long __get_pfnblock_flags_mask(struct page *page,
  *
  * Return: pageblock_bits flags
  */
-unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
-   unsigned long mask)
+unsigned long get_pfnblock_flags_mask(const struct page *page,
+   unsigned long pfn, unsigned long mask)
 {
return __get_pfnblock_flags_mask(page, pfn, mask);
 }
 
-static __always_inline int get_pfnblock_migratetype(struct page *page, 
unsigned long pfn)
+static __always_inline int get_pfnblock_migratetype(const struct page *page,
+   unsigned long pfn)
 {
return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK);
 }
-- 
2.30.2

[PATCH 4/6] mm: Make compound_head const-preserving

2021-04-16 Thread Matthew Wilcox (Oracle)

If you pass a const pointer to compound_head(), you get a const pointer
back; if you pass a mutable pointer, you get a mutable pointer back.
Also remove an unnecessary forward definition of struct page; we're about
to dereference page->compound_head, so it must already have been defined.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/page-flags.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 04a34c08e0a6..d8e26243db25 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -177,17 +177,17 @@ enum pageflags {
 
 #ifndef __GENERATING_BOUNDS_H
 
-struct page;   /* forward declaration */
-
-static inline struct page *compound_head(struct page *page)
+static inline unsigned long _compound_head(const struct page *page)
 {
unsigned long head = READ_ONCE(page->compound_head);
 
if (unlikely(head & 1))
-   return (struct page *) (head - 1);
-   return page;
+   return head - 1;
+   return (unsigned long)page;
 }
 
+#define compound_head(page)((typeof(page))_compound_head(page))
+
 static __always_inline int PageTail(struct page *page)
 {
return READ_ONCE(page->compound_head) & 1;
-- 
2.30.2

[PATCH 3/6] mm/page_owner: Constify dump_page_owner

2021-04-16 Thread Matthew Wilcox (Oracle)

dump_page_owner() only uses struct page to find the page_ext, and
lookup_page_ext() already takes a const argument.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/page_owner.h | 6 +++---
 mm/page_owner.c| 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..719bfe5108c5 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -14,7 +14,7 @@ extern void __set_page_owner(struct page *page,
 extern void __split_page_owner(struct page *page, unsigned int nr);
 extern void __copy_page_owner(struct page *oldpage, struct page *newpage);
 extern void __set_page_owner_migrate_reason(struct page *page, int reason);
-extern void __dump_page_owner(struct page *page);
+extern void __dump_page_owner(const struct page *page);
 extern void pagetypeinfo_showmixedcount_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone);
 
@@ -46,7 +46,7 @@ static inline void set_page_owner_migrate_reason(struct page 
*page, int reason)
if (static_branch_unlikely(_owner_inited))
__set_page_owner_migrate_reason(page, reason);
 }
-static inline void dump_page_owner(struct page *page)
+static inline void dump_page_owner(const struct page *page)
 {
if (static_branch_unlikely(_owner_inited))
__dump_page_owner(page);
@@ -69,7 +69,7 @@ static inline void copy_page_owner(struct page *oldpage, 
struct page *newpage)
 static inline void set_page_owner_migrate_reason(struct page *page, int reason)
 {
 }
-static inline void dump_page_owner(struct page *page)
+static inline void dump_page_owner(const struct page *page)
 {
 }
 #endif /* CONFIG_PAGE_OWNER */
diff --git a/mm/page_owner.c b/mm/page_owner.c
index adfabb560eb9..f51a57e92aa3 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -392,7 +392,7 @@ print_page_owner(char __user *buf, size_t count, unsigned 
long pfn,
return -ENOMEM;
 }
 
-void __dump_page_owner(struct page *page)
+void __dump_page_owner(const struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
struct page_owner *page_owner;
-- 
2.30.2

[PATCH 2/6] mm/debug: Factor PagePoisoned out of __dump_page

2021-04-16 Thread Matthew Wilcox (Oracle)

Move the PagePoisoned test into dump_page().  Skip the hex print
for poisoned pages -- we know they're full of .  Move the
reason printing from __dump_page() to dump_page().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/debug.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index 84cdcd0f7bd3..e73fe0a8ec3d 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -42,11 +42,10 @@ const struct trace_print_flags vmaflag_names[] = {
{0, NULL}
 };
 
-static void __dump_page(struct page *page, const char *reason)
+static void __dump_page(struct page *page)
 {
struct page *head = compound_head(page);
struct address_space *mapping;
-   bool page_poisoned = PagePoisoned(page);
bool compound = PageCompound(page);
/*
 * Accessing the pageblock without the zone lock. It could change to
@@ -58,16 +57,6 @@ static void __dump_page(struct page *page, const char 
*reason)
int mapcount;
char *type = "";
 
-   /*
-* If struct page is poisoned don't access Page*() functions as that
-* leads to recursive loop. Page*() check for poisoned pages, and calls
-* dump_page() when detected.
-*/
-   if (page_poisoned) {
-   pr_warn("page:%px is uninitialized and poisoned", page);
-   goto hex_only;
-   }
-
if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
/*
 * Corrupt page, so we cannot call page_mapping. Instead, do a
@@ -173,8 +162,6 @@ static void __dump_page(struct page *page, const char 
*reason)
 
pr_warn("%sflags: %#lx(%pGp)%s\n", type, head->flags, >flags,
page_cma ? " CMA" : "");
-
-hex_only:
print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
sizeof(unsigned long), page,
sizeof(struct page), false);
@@ -182,14 +169,16 @@ static void __dump_page(struct page *page, const char 
*reason)
print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32,
sizeof(unsigned long), head,
sizeof(struct page), false);
-
-   if (reason)
-   pr_warn("page dumped because: %s\n", reason);
 }
 
 void dump_page(struct page *page, const char *reason)
 {
-   __dump_page(page, reason);
+   if (PagePoisoned(page))
+   pr_warn("page:%p is uninitialized and poisoned", page);
+   else
+   __dump_page(page);
+   if (reason)
+   pr_warn("page dumped because: %s\n", reason);
dump_page_owner(page);
 }
 EXPORT_SYMBOL(dump_page);
-- 
2.30.2

[PATCH 1/6] mm: Make __dump_page static

2021-04-16 Thread Matthew Wilcox (Oracle)

The only caller of __dump_page() now opencodes dump_page(), so
remove it as an externally visible symbol.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mmdebug.h | 3 +--
 mm/debug.c  | 2 +-
 mm/page_alloc.c | 3 +--
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 5d0767cb424a..1935d4c72d10 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -9,8 +9,7 @@ struct page;
 struct vm_area_struct;
 struct mm_struct;
 
-extern void dump_page(struct page *page, const char *reason);
-extern void __dump_page(struct page *page, const char *reason);
+void dump_page(struct page *page, const char *reason);
 void dump_vma(const struct vm_area_struct *vma);
 void dump_mm(const struct mm_struct *mm);
 
diff --git a/mm/debug.c b/mm/debug.c
index 0bdda8407f71..84cdcd0f7bd3 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -42,7 +42,7 @@ const struct trace_print_flags vmaflag_names[] = {
{0, NULL}
 };
 
-void __dump_page(struct page *page, const char *reason)
+static void __dump_page(struct page *page, const char *reason)
 {
struct page *head = compound_head(page);
struct address_space *mapping;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a35f21b57c6..0152670c6f04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -658,8 +658,7 @@ static void bad_page(struct page *page, const char *reason)
 
pr_alert("BUG: Bad page state in process %s  pfn:%05lx\n",
current->comm, page_to_pfn(page));
-   __dump_page(page, reason);
-   dump_page_owner(page);
+   dump_page(page, reason);
 
print_modules();
dump_stack();
-- 
2.30.2

[RESEND][PATCH 0/6] Constify struct page arguments

2021-04-16 Thread Matthew Wilcox (Oracle)

[I'm told that patches 2-6 did not make it to the list; resending and
cc'ing lkml this time]

While working on various solutions to the 32-bit struct page size
regression, one of the problems I found was the networking stack expects
to be able to pass const struct page pointers around, and the mm doesn't
provide a lot of const-friendly functions to call.  The root tangle of
problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls
dump_page(), which calls a lot of functions which don't take a const
struct page (but could be const).

I have other things I need to work on, but I offer these patches as a few
steps towards being able to make dump_page() take a const page pointer.

Matthew Wilcox (Oracle) (6):
  mm: Make __dump_page static
  mm/debug: Factor PagePoisoned out of __dump_page
  mm/page_owner: Constify dump_page_owner
  mm: Make compound_head const-preserving
  mm: Constify get_pfnblock_flags_mask and get_pfnblock_migratetype
  mm: Constify page_count and page_ref_count

 include/linux/mmdebug.h |  3 +--
 include/linux/page-flags.h  | 10 +-
 include/linux/page_owner.h  |  6 +++---
 include/linux/page_ref.h|  4 ++--
 include/linux/pageblock-flags.h |  2 +-
 mm/debug.c  | 25 +++--
 mm/page_alloc.c | 16 
 mm/page_owner.c |  2 +-
 8 files changed, 28 insertions(+), 40 deletions(-)

-- 
2.30.2

[PATCH 2/2] mm: Indicate pfmemalloc pages in compound_head

2021-04-16 Thread Matthew Wilcox (Oracle)

The net page_pool wants to use a magic value to identify page pool pages.
The best place to put it is in the first word where it can be clearly a
non-pointer value.  That means shifting dma_addr up to alias with ->index,
which means we need to find another way to indicate page_is_pfmemalloc().
Since page_pool doesn't want to set its magic value on pages which are
pfmemalloc, we can use bit 1 of compound_head to indicate that the page
came from the memory reserves.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm.h   | 12 +++-
 include/linux/mm_types.h |  7 +++
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..44eab3f6d5ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1629,10 +1629,12 @@ struct address_space *page_mapping_file(struct page 
*page);
 static inline bool page_is_pfmemalloc(const struct page *page)
 {
/*
-* Page index cannot be this large so this must be
-* a pfmemalloc page.
+* This is not a tail page; compound_head of a head page is unused
+* at return from the page allocator, and will be overwritten
+* by callers who do not care whether the page came from the
+* reserves.
 */
-   return page->index == -1UL;
+   return page->compound_head & 2;
 }
 
 /*
@@ -1641,12 +1643,12 @@ static inline bool page_is_pfmemalloc(const struct page 
*page)
  */
 static inline void set_page_pfmemalloc(struct page *page)
 {
-   page->index = -1UL;
+   page->compound_head = 2;
 }
 
 static inline void clear_page_pfmemalloc(struct page *page)
 {
-   page->index = 0;
+   page->compound_head = 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c10a45..39f7163dcace 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -96,10 +96,9 @@ struct page {
unsigned long private;
};
struct {/* page_pool used by netstack */
-   /**
-* @dma_addr: might require a 64-bit value on
-* 32-bit architectures.
-*/
+   unsigned long pp_magic;
+   unsigned long xmi;
+   unsigned long _pp_mapping_pad;
unsigned long dma_addr[2];
};
struct {/* slab, slob and slub */
-- 
2.30.2

[PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox (Oracle)

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long, and also fixes a
potential problem where (on a big endian platform), the bit used to denote
PageTail could inadvertently get set, and a racing get_user_pages_fast()
could dereference a bogus compound_head().

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm_types.h |  4 ++--
 include/net/page_pool.h  | 12 +++-
 net/core/page_pool.c | 12 +++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5aacc1c10a45 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -97,10 +97,10 @@ struct page {
};
struct {/* page_pool used by netstack */
/**
-* @dma_addr: might require a 64-bit value even on
+* @dma_addr: might require a 64-bit value on
 * 32-bit architectures.
 */
-   dma_addr_t dma_addr;
+   unsigned long dma_addr[2];
};
struct {/* slab, slob and slub */
union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..db7c7020746a 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-   return page->dma_addr;
+   dma_addr_t ret = page->dma_addr[0];
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   ret |= (dma_addr_t)page->dma_addr[1] << 32;
+   return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+   page->dma_addr[0] = addr;
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   page->dma_addr[1] = addr >> 32;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..f014fd8c19a6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool 
*pool,
  struct page *page,
  unsigned int dma_sync_size)
 {
+   dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
dma_sync_size = min(dma_sync_size, pool->p.max_len);
-   dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+   dma_sync_single_range_for_device(pool->p.dev, dma_addr,
 pool->p.offset, dma_sync_size,
 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct 
page_pool *pool,
put_page(page);
return NULL;
}
-   page->dma_addr = dma;
+   page_pool_set_dma_addr(page, dma);
 
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, 
struct page *page)
 */
goto skip_dma_unmap;
 
-   dma = page->dma_addr;
+   dma = page_pool_get_dma_addr(page);
 
-   /* When page is unmapped, it cannot be returned our pool */
+   /* When page is unmapped, it cannot be returned to our pool */
dma_unmap_page_attrs(pool->p.dev, dma,
 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 DMA_ATTR_SKIP_CPU_SYNC);
-   page->dma_addr = 0;
+   page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
/* This may be the last page returned, releasing the pool, so
 * it is not safe to reference pool afterwards.
-- 
2.30.2

[PATCH 0/2] Change struct page layout for page_pool

2021-04-16 Thread Matthew Wilcox (Oracle)

The first patch here fixes two bugs on ppc32, and mips32.  It fixes one
bug on arc and arm32 (in certain configurations).  It probably makes
sense to get it in ASAP through the networking tree.  I'd like to see
testing on those four architectures if possible?

The second patch enables new functionality.  It is much less urgent.
I'd really like to see Mel & Michal's thoughts on it.

I have only compile-tested these patches.

Matthew Wilcox (Oracle) (2):
  mm: Fix struct page layout on 32-bit systems
  mm: Indicate pfmemalloc pages in compound_head

 include/linux/mm.h   | 12 +++-
 include/linux/mm_types.h |  9 -
 include/net/page_pool.h  | 12 +++-
 net/core/page_pool.c | 12 +++-
 4 files changed, 29 insertions(+), 16 deletions(-)

-- 
2.30.2

Re: [PATCH 00/13] [RFC] Rust support

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 07:18:48PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 16, 2021 at 07:10:17PM +0200, Miguel Ojeda wrote:
> 
> > Of course, UB is only a subset of errors, but it is a major one, and
> > particularly critical for privileged code.
> 
> I've seen relatively few UBSAN warnings that weren't due to UBSAN being
> broken.

Lucky you.

84c34df158cf215b0cd1475ab3b8e6f212f81f23

(i'd argue this is C being broken; promoting only as far as int, when
assigning to an unsigned long is Bad, but until/unless either GCC fixes
that or the language committee realises that being stuck in the 1970s
is Bad, people are going to keep making this kind of mistake)

Re: [RFCv2 00/13] TDX and guest memory unmapping

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 06:40:53PM +0300, Kirill A. Shutemov wrote:
> TDX integrity check failures may lead to system shutdown host kernel must
> not allow any writes to TD-private memory. This requirment clashes with
> KVM design: KVM expects the guest memory to be mapped into host userspace
> (e.g. QEMU).
> 
> This patchset aims to start discussion on how we can approach the issue.
> 
> The core of the change is in the last patch. Please see more detailed
> description of the issue and proposoal of the solution there.

This seems to have some parallels with s390's arch_make_page_accessible().
Is there any chance to combine the two, so we don't end up with duplicated
hooks all over the MM for this kind of thing?

https://patchwork.kernel.org/project/kvm/cover/20200214222658.12946-1-borntrae...@de.ibm.com/

and recent THP/Folio-related discussion:
https://lore.kernel.org/linux-mm/20210409194059.gw2531...@casper.infradead.org/

Re: [PATCH v7 02/28] mm: Introduce struct folio

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 09, 2021 at 07:50:39PM +0100, Matthew Wilcox (Oracle) wrote:
> A struct folio is a new abstraction to replace the venerable struct page.
> A function which takes a struct folio argument declares that it will
> operate on the entire (possibly compound) page, not just PAGE_SIZE bytes.
> In return, the caller guarantees that the pointer it is passing does
> not point to a tail page.
> +++ b/include/linux/mm_types.h
[...]
> +static inline struct folio *page_folio(struct page *page)
> +{
> + unsigned long head = READ_ONCE(page->compound_head);
> +
> + if (unlikely(head & 1))
> + return (struct folio *)(head - 1);
> + return (struct folio *)page;
> +}

I'm looking at changing this for the next revision, and basing it on
my recent patch to make compound_head() const-preserving:

+#define page_folio(page)   _Generic((page),\
+   const struct page *:(const struct folio *)_compound_head(page), \
+   struct page *:  (struct folio *)_compound_head(page))

I've also noticed an awkward pattern occurring that I think this makes
less awkward:

+/**
+ * folio_page - Return a page from a folio.
+ * @folio: The folio.
+ * @n: The page number to return.
+ *
+ * @n is relative to the start of the folio.  It should be between
+ * 0 and folio_nr_pages(@folio) - 1, but this is not checked for.
+ */
+#define folio_page(folio, n)   nth_page(&(folio)->page, n)

That lets me simplify folio_next():

+static inline struct folio *folio_next(struct folio *folio)
+{
+   return (struct folio *)folio_page(folio, folio_nr_pages(folio));
+}

(it occurs to me this should also be const-preserving, but it's not clear
that's needed yet)

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> See below patch.  Where I swap32 the dma address to satisfy
> page->compound having bit zero cleared. (It is the simplest fix I could
> come up with).

I think this is slightly simpler, and as a bonus code that assumes the
old layout won't compile.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5aacc1c10a45 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -97,10 +97,10 @@ struct page {
};
struct {/* page_pool used by netstack */
/**
-* @dma_addr: might require a 64-bit value even on
+* @dma_addr: might require a 64-bit value on
 * 32-bit architectures.
 */
-   dma_addr_t dma_addr;
+   unsigned long dma_addr[2];
};
struct {/* slab, slob and slub */
union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..db7c7020746a 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-   return page->dma_addr;
+   dma_addr_t ret = page->dma_addr[0];
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   ret |= (dma_addr_t)page->dma_addr[1] << 32;
+   return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+   page->dma_addr[0] = addr;
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   page->dma_addr[1] = addr >> 32;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..f014fd8c19a6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool 
*pool,
  struct page *page,
  unsigned int dma_sync_size)
 {
+   dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
dma_sync_size = min(dma_sync_size, pool->p.max_len);
-   dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+   dma_sync_single_range_for_device(pool->p.dev, dma_addr,
 pool->p.offset, dma_sync_size,
 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct 
page_pool *pool,
put_page(page);
return NULL;
}
-   page->dma_addr = dma;
+   page_pool_set_dma_addr(page, dma);
 
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, 
struct page *page)
 */
goto skip_dma_unmap;
 
-   dma = page->dma_addr;
+   dma = page_pool_get_dma_addr(page);
 
-   /* When page is unmapped, it cannot be returned our pool */
+   /* When page is unmapped, it cannot be returned to our pool */
dma_unmap_page_attrs(pool->p.dev, dma,
 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 DMA_ATTR_SKIP_CPU_SYNC);
-   page->dma_addr = 0;
+   page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
/* This may be the last page returned, releasing the pool, so
 * it is not safe to reference pool afterwards.

Re: [PATCH 00/13] [RFC] Rust support

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 02:07:49PM +0100, Wedson Almeida Filho wrote:
> There is nothing in C forcing developers to actually use DEFINE_MUTEX_GUARD. 
> So
> someone may simply forget (or not know that they need) to lock
> current->perf_event_mutex and directly access some field protected by it. This
> is unlikely to happen when one first writes the code, but over time as 
> different
> people modify the code and invariants change, it is possible for this to 
> happen.
> 
> In Rust, this isn't possible: the data protected by a lock is only accessible
> when the lock is locked. So developers cannot accidentally make mistakes of 
> this
> kind. And since the enforcement happens at compile time, there is no runtime
> cost.

Well, we could do that in C too.

struct unlocked_inode {
spinlock_t i_lock;
};

struct locked_inode {
spinlock_t i_lock;
unsigned short i_bytes;
blkcnt_t i_blocks;
};

struct locked_inode *lock_inode(struct unlocked_inode *inode)
{
spin_lock(>i_lock);
return (struct locked_inode *)inode;
}

There's a combinatoric explosion when you have multiple locks in a data
structure that protect different things in it (and things in a data
structure that are protected by locks outside that data structure),
but I'm not sufficiently familiar with Rust to know if/how it solves
that problem.

Anyway, my point is that if we believe this is a sufficiently useful
feature to have, and we're willing to churn the kernel, it's less churn
to do this than it is to rewrite in Rust.

> Another scenario: suppose within perf_event_task_enable you need to call a
> function that requires the mutex to be locked and that will unlock it for you 
> on
> error (or unconditionally, doesn't matter). How would you do that in C? In 
> Rust,
> there is a clean idiomatic way of transferring ownership of a guard (or any
> other object) such that the previous owner cannot continue to use it after
> ownership is transferred. Again, this is enforced at compile time. I'm happy 
> to
> provide a small example if that would help.

I think we could do that too with an __attribute__((free)).  It isn't,
of course, actually freeing the pointer to the locked_inode, but it will
make the compiler think the pointer is invalid after the function returns.

(hm, looks like gcc doesn't actually have __attribute__((free)) yet.
that's unfortunate.  there's a potential solution in gcc-11 that might
do what we need)

Re: [PATCH] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 11:37:19AM +0200, Peter Enderborg wrote:
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6fa761c9cc78..3c1a82b51a6f 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -16,6 +16,7 @@
>  #ifdef CONFIG_CMA
>  #include 
>  #endif
> +#include 
>  #include 
>  #include "internal.h"
>  
> @@ -145,6 +146,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>   show_val_kb(m, "CmaFree:",
>   global_zone_page_state(NR_FREE_CMA_PAGES));
>  #endif
> + show_val_kb(m, "DmaBufTotal:", dma_buf_get_size());
>  
>   hugetlb_report_meminfo(m);
>  

... and if CONFIG_DMA_SHARED_BUFFER is not set ...?

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 07:32:35AM +, David Laight wrote:
> From: Matthew Wilcox 
> > Sent: 15 April 2021 23:22
> > 
> > On Thu, Apr 15, 2021 at 09:11:56PM +, David Laight wrote:
> > > Isn't it possible to move the field down one long?
> > > This might require an explicit zero - but this is not a common
> > > code path - the extra write will be noise.
> > 
> > Then it overlaps page->mapping.  See emails passim.
> 
> The rules on overlaps make be wonder if every 'long'
> should be in its own union.

That was what we used to have.  It was worse.

> The comments would need to say when each field is used.
> It would, at least, make these errors less common.
> 
> That doesn't solve the 64bit dma_addr though.
> 
> Actually rather that word-swapping dma_addr on 32bit BE
> could you swap over the two fields it overlays with.
> That might look messy in the .h, but it doesn't require
> an accessor function to do the swap - easily missed.

No.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 09:11:56PM +, David Laight wrote:
> Isn't it possible to move the field down one long?
> This might require an explicit zero - but this is not a common
> code path - the extra write will be noise.

Then it overlaps page->mapping.  See emails passim.

Re: [Outreachy kernel] [PATCH v2] staging: media: atomisp: pci: Change line break to avoid an open parenthesis at the end of the line

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 08:57:04PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 15, 2021 at 10:49:55PM +0300, Sakari Ailus wrote:
> > On Thu, Apr 15, 2021 at 06:14:09PM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 15, 2021 at 02:08:19PM -0300, Aline Santana Cordeiro wrote:
> > > > -const struct atomisp_format_bridge 
> > > > *get_atomisp_format_bridge_from_mbus(
> > > > -u32 mbus_code);
> > > > +const struct atomisp_format_bridge*
> > > > +get_atomisp_format_bridge_from_mbus(u32 mbus_code);
> > > 
> > > No, this does not match coding style.  Probably best to break the
> > > 80-column guideline in this instance.  Best would be to have a function
> > 
> > Having the return type on the previous line is perfectly fine. There should
> > be a space before the asterisk though.
> 
> No, it's not.  Linus has ranted about that before.

Found it.  
https://lore.kernel.org/lkml/1054519757.161...@palladium.transmeta.com/

Re: [Outreachy kernel] [PATCH v2] staging: media: atomisp: pci: Change line break to avoid an open parenthesis at the end of the line

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 10:49:55PM +0300, Sakari Ailus wrote:
> On Thu, Apr 15, 2021 at 06:14:09PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 15, 2021 at 02:08:19PM -0300, Aline Santana Cordeiro wrote:
> > > -const struct atomisp_format_bridge *get_atomisp_format_bridge_from_mbus(
> > > -u32 mbus_code);
> > > +const struct atomisp_format_bridge*
> > > +get_atomisp_format_bridge_from_mbus(u32 mbus_code);
> > 
> > No, this does not match coding style.  Probably best to break the
> > 80-column guideline in this instance.  Best would be to have a function
> 
> Having the return type on the previous line is perfectly fine. There should
> be a space before the asterisk though.

No, it's not.  Linus has ranted about that before.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> +static inline
> +dma_addr_t page_pool_dma_addr_read(dma_addr_t dma_addr)
> +{
> + /* Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
> +  * page.  The page->dma_addr share area with page->compound_head which
> +  * use bit zero to mark compound pages. This is okay, as DMA-addr are
> +  * aligned pointers which have bit zero cleared.
> +  *
> +  * In the 32-bit case, page->compound_head is 32-bit.  Thus, when
> +  * dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
> +  * swapping dma_addr 32-bit segments.
> +  */
> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT

#if defined(CONFIG_ARCH_DMA_ADDR_T_64BIT) && defined(__BIG_ENDIAN)
otherwise you'll create the problem on ARM that you're avoiding on PPC ...

I think you want to delete the word '_read' from this function name because
you're using it for both read and write.

Re: [Outreachy kernel] [PATCH v2] staging: media: atomisp: pci: Change line break to avoid an open parenthesis at the end of the line

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 02:08:19PM -0300, Aline Santana Cordeiro wrote:
> -const struct atomisp_format_bridge *get_atomisp_format_bridge_from_mbus(
> -u32 mbus_code);
> +const struct atomisp_format_bridge*
> +get_atomisp_format_bridge_from_mbus(u32 mbus_code);

No, this does not match coding style.  Probably best to break the
80-column guideline in this instance.  Best would be to have a function
and/or struct name that isn't so ridiculously long, but that would
require some in-depth thinking.

> -void atomisp_apply_css_parameters(
> -struct atomisp_sub_device *asd,
> -struct atomisp_css_params *css_param);
> +void atomisp_apply_css_parameters(struct atomisp_sub_device *asd,
> +   struct atomisp_css_params *css_param);
> +

Good.

Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Matthew Wilcox

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:
> With memory accounting disable, the run time was 2.848s. With memory
> accounting enabled, the run times with the application of various
> patches in the patchset were:
> 
>   Applied patches   Run time   Accounting overhead   Overhead %age
>   ---      ---   -
>None  10.800s 7.952s  100.0%
> 1-2   9.140s 6.292s   79.1%
> 1-3   7.641s 4.793s   60.3%
> 1-5   6.801s 3.953s   49.7%

I think this is a misleading way to report the overhead.  I would have said:

10.800s 7.952s  279.2%
 9.140s 6.292s  220.9%
 7.641s 4.793s  168.3%
 6.801s 3.953s  138.8%

Re: [PATCH 5.10 12/25] radix tree test suite: Fix compilation

2021-04-15 Thread Matthew Wilcox

On Thu, Apr 15, 2021 at 04:48:06PM +0200, Greg Kroah-Hartman wrote:
> From: Matthew Wilcox (Oracle) 
> 
> [ Upstream commit 7487de534dcbe143e6f41da751dd3ffcf93b00ee ]
> 
> Commit 4bba4c4bb09a added tools/include/linux/compiler_types.h which
> includes linux/compiler-gcc.h.  Unfortunately, we had our own (empty)
> compiler_types.h which overrode the one added by that commit, and
> so we lost the definition of __must_be_array().  Removing our empty
> compiler_types.h fixes the problem and reduces our divergence from the
> rest of the tools.

I don't see 4bba4c4bb09a backported to 5.10.y, so I think this will break
compilation of the radix tree test suite.  The corresponding commit for
5.11.y is good, since 5.11.y includes 4bba4c4bb09a.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> (If others want to reproduce).  First I could not reproduce on ARM32.
> Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.

hmmm ... you should be able to provoke it by enabling ARM_LPAE,
which selects PHYS_ADDR_T_64BIT, and

config ARCH_DMA_ADDR_T_64BIT
def_bool 64BIT || PHYS_ADDR_T_64BIT

>  struct page {
> long unsigned int  flags;/* 0 4 */
> 
> /* XXX 4 bytes hole, try to pack */
> 
> union {
> struct {
> struct list_head lru;/* 8 8 */
> struct address_space * mapping;  /*16 4 */
> long unsigned int index; /*20 4 */
> long unsigned int private;   /*24 4 */
> };   /* 820 */
> struct {
> dma_addr_t dma_addr; /* 8 8 */
> };   /* 8 8 */
[...]
> } __attribute__((__aligned__(8)));   /* 824 */
> union {
> atomic_t   _mapcount;/*32 4 */
> unsigned int   page_type;/*32 4 */
> unsigned int   active;   /*32 4 */
> intunits;/*32 4 */
> };   /*32 4 */
> atomic_t   _refcount;/*36 4 */
> 
> /* size: 40, cachelines: 1, members: 4 */
> /* sum members: 36, holes: 1, sum holes: 4 */
> /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
> /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));

If you also enable CONFIG_MEMCG or enough options to make
LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
hole at the end.

Re: [PATCH 00/13] [RFC] Rust support

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 08:45:51PM +0200, oj...@kernel.org wrote:
>   - Manish Goregaokar implemented the fallible `Box`, `Arc`, and `Rc`
> allocator APIs in Rust's `alloc` standard library for us.

There's a philosophical point to be discussed here which you're skating
right over!  Should rust-in-the-linux-kernel provide the same memory
allocation APIs as the rust-standard-library, or should it provide a Rusty
API to the standard-linux-memory-allocation APIs?  You seem to be doing
both ... there was a wrapper around alloc_pages() in the Binder patches,
and then you talk about Box, Arc and Rc here.

Maybe there's some details about when one can use one kind of API and
when to use another.  But I fear that we'll have Rust code at interrupt
level trying to use allocators which assume that they can sleep, and
things will go badly wrong.

By the way, I don't think that Rust necessarily has to conform to the
current way that Linux works.  If this prompted us to track the current
context (inside spinlock, handling interrupt, performing writeback, etc)
and do away with (some) GFP flags, that's not the end of the world.
We're already moving in that direction to a certain extent with the
scoped memory allocation APIs to replace GFP_NOFS / GFP_NOIO.

Re: [PATCH 09/13] Samples: Rust examples

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 09:42:26PM +0200, Miguel Ojeda wrote:
> On Wed, Apr 14, 2021 at 9:34 PM Linus Torvalds
>  wrote:
> >
> > Honestly, I'd like to see a real example. This is fine for testing,
> > but I'd like to see something a bit more real, and a bit less special
> > than the Android "binder" WIP that comes a few patches later.
> >
> > Would there be some kind of real driver or something that people could
> > use as a example of a real piece of code that actually does something
> > meaningful?
> 
> Yeah, we are planning to write a couple of drivers that talk to actual
> hardware. Not sure which ones we will do, but we will have them
> written.

I'd suggest NVMe as a target.  It's readily available, both as real
hardware and in (eg) qemu.  The spec is freely available, and most devices
come pretty close to conforming to the spec until you start to push hard
at the edges.  Also then you can do performance tests and see where you
might want to focus performance efforts.

Re: [PATCH 01/13] kallsyms: Support "big" kernel symbols (2-byte lengths)

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 08:45:52PM +0200, oj...@kernel.org wrote:
> Increasing to 255 is not enough in some cases, and therefore
> we need to introduce 2-byte lengths to the symbol table. We call
> these "big" symbols.
> 
> In order to avoid increasing all lengths to 2 bytes (since most
> of them only require 1 byte, including many Rust ones), we use
> length zero to mark "big" symbols in the table.

How about doing something a bit more utf-8-like?

len = data[0];
if (len == 0)
error
else if (len < 128)
return len;
else if (len < 192)
return 128 + (len - 128) * 256 + data[1];
... that takes you all the way out to 16511 bytes.  You probably don't
even need the third byte option.  But if you do ...
else if (len < 223)
return 16512 + (len - 192) * 256 * 256 +
data[1] * 256 + data[2];
which takes you all the way out to 2,113,663 bytes and leaves 224-255 unused.

Alternatively, if the symbols are really this long, perhaps we should not
do string matches.  A sha-1 (... or whatever ...) hash of the function
name is 160 bits.  Expressed as hex digits, that's 40 characters.
Expressed in base-64, it's 27 characters.  We'd also want a "pretty"
name to go along with the hash, but that seems preferable to printing
out a mangled-with-types-and-who-knows-what name.

> Co-developed-by: Alex Gaynor 
> Signed-off-by: Alex Gaynor 

If you have C-d-b, you don't also need S-o-b.

Re: [PATCH] mm: Optimise nth_page for contiguous memmap

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 05:24:42PM +0200, David Hildenbrand wrote:
> On 13.04.21 21:46, Matthew Wilcox (Oracle) wrote:
> > +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> >   #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
> > +#else
> > +#define nth_page(page,n) ((page) + (n))
> > +#endif
> 
> For sparsemem we could optimize within a single memory section. But not sure
> if it's worth the trouble.

Not only is it not worth the trouble, I suspect it's more expensive to
test-and-branch than just unconditionally call pfn_to_page() and
page_to_pfn().  That said, I haven't measured.

SPARSEMEM_VMEMMAP is default Y, and enabled by arm64, ia64, powerpc,
riscv, s390, sparc and x86.  I mean ... do we care any more?

> Reviewed-by: David Hildenbrand 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>

Re: [PATCH v2 10/16] mm: multigenerational lru: mm_struct list

2021-04-14 Thread Matthew Wilcox

On Tue, Apr 13, 2021 at 12:56:27AM -0600, Yu Zhao wrote:
> In order to scan page tables, we add an infrastructure to maintain
> either a system-wide mm_struct list or per-memcg mm_struct lists.
> Multiple threads can concurrently work on the same mm_struct list, and
> each of them will be given a different mm_struct.
> 
> This infrastructure also tracks whether an mm_struct is being used on
> any CPUs or has been used since the last time a worker looked at it.
> In other words, workers will not be given an mm_struct that belongs to
> a process that has been sleeping.

This seems like a great use for an allocating XArray.  You can use a
search mark to indicate whether it's been used since the last time a
worker looked at it.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Matthew Wilcox

On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> Yes, indeed! - And very frustrating.  It's keeping me up at night.
> I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> me that I don't sleep well with these kind of dreams ;-)

Then you're going to love this ... even with the latest patch, there's
still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
that forces the union to be 64-bit aligned (as we already knew and worked
around), but what I'd forgotten is that forces the entirety of struct
page to be 64-bit aligned.  Which means ...

/* size: 40, cachelines: 1, members: 4 */
/* padding: 4 */
/* forced alignments: 1 */
/* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

.. that we still have a hole!  It's just moved from being at offset 4
to being at offset 36.

> That said, I think we need to have a quicker fix for the immediate
> issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> happens on certain 32-bit archs?

AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
on x86-32 because dma_addr_t is 32-bit aligned.

Doing this fixes it:

+++ b/include/linux/types.h
@@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
  * so they don't care about the size of the actual bus addresses.
  */
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
-typedef u64 dma_addr_t;
+typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif

> I'm seriously considering removing page_pool's support for doing/keeping
> DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.

... if you're going to do that, then we don't need to do this.

Re: [Outreachy kernel] [PATCH] :staging: rtl8723bs: Remove useless led_blink_hdl()

2021-04-13 Thread Matthew Wilcox

On Tue, Apr 13, 2021 at 09:45:03PM +0200, Fabio M. De Francesco wrote:
> 1) The driver doesn't call that function from anywhere else than the macro.
> 2) You have explained that the macro add its symbol to a slot in an array 
> that would shift all the subsequent elements down if that macro is not used 
> exactly in the line where it is.
> 3) Dan Carpenter said that that array is full of null functions (or empty 
> slots?).
> 
> Unless that function is called anonymously dereferencing its address from 
> the position it occupies in the array, I'm not able to see what else means 
> can any caller use.
> 
> I know I have much less experience than you with C: what can go wrong?

Here's where the driver calls that function:

$ git grep wlancmds drivers/staging/rtl8723bs/
drivers/staging/rtl8723bs/core/rtw_cmd.c:static struct cmd_hdl wlancmds[] = {
drivers/staging/rtl8723bs/core/rtw_cmd.c:   if (pcmd->cmdcode < 
ARRAY_SIZE(wlancmds)) {
drivers/staging/rtl8723bs/core/rtw_cmd.c:   cmd_hdl = 
wlancmds[pcmd->cmdcode].h2cfuns;

[PATCH] mm: Optimise nth_page for contiguous memmap

2021-04-13 Thread Matthew Wilcox (Oracle)

If the memmap is virtually contiguous (either because we're using
a virtually mapped memmap or because we don't support a discontig
memmap at all), then we can implement nth_page() by simple addition.
Contrary to popular belief, the compiler is not able to optimise this
itself for a vmemmap configuration.  This reduces one example user (sg.c)
by four instructions:

struct page *page = nth_page(rsv_schp->pages[k], offset >> PAGE_SHIFT);

before:
   49 8b 45 70 mov0x70(%r13),%rax
   48 63 c9movslq %ecx,%rcx
   48 c1 eb 0c shr$0xc,%rbx
   48 8b 04 c8 mov(%rax,%rcx,8),%rax
   48 2b 05 00 00 00 00sub0x0(%rip),%rax
   R_X86_64_PC32  vmemmap_base-0x4
   48 c1 f8 06 sar$0x6,%rax
   48 01 d8add%rbx,%rax
   48 c1 e0 06 shl$0x6,%rax
   48 03 05 00 00 00 00add0x0(%rip),%rax
   R_X86_64_PC32  vmemmap_base-0x4

after:
   49 8b 45 70 mov0x70(%r13),%rax
   48 63 c9movslq %ecx,%rcx
   48 c1 eb 0c shr$0xc,%rbx
   48 c1 e3 06 shl$0x6,%rbx
   48 03 1c c8 add(%rax,%rcx,8),%rbx

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 include/linux/mm.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25b9041f9925..2327f99b121f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -234,7 +234,11 @@ int overcommit_policy_handler(struct ctl_table *, int, 
void *, size_t *,
 int __add_to_page_cache_locked(struct page *page, struct address_space 
*mapping,
pgoff_t index, gfp_t gfp, void **shadowp);
 
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
+#else
+#define nth_page(page,n) ((page) + (n))
+#endif
 
 /* to align the pointer to the (next) page boundary */
 #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
-- 
2.30.2

Re: [PATCH v7] mm: Add set/end/wait functions for PG_private_2

2021-04-13 Thread Matthew Wilcox

On Tue, Apr 13, 2021 at 04:11:41PM +0100, David Howells wrote:
> Suggested-by: Linus Torvalds 
> Signed-off-by: David Howells 
> Tested-by: Jeff Layton 
> Tested-by: Dave Wysochanski 

Reviewed-by: Matthew Wilcox (Oracle)

Re: [PATCH v3 01/18] vfs: add fileattr ops

2021-04-13 Thread Matthew Wilcox

On Thu, Mar 25, 2021 at 08:37:38PM +0100, Miklos Szeredi wrote:
> @@ -107,6 +110,8 @@ fiemap:   no
>  update_time: no
>  atomic_open: shared (exclusive if O_CREAT is set in open flags)
>  tmpfile: no
> +fileattr_get:no or exclusive
> +fileattr_set:exclusive
>   =

This introduces a warning to `make htmldocs`:

/home/willy/kernel/folio/Documentation/filesystems/locking.rst:113: WARNING: 
Malformed table.
Text in column margin in table line 24.

You need to add an extra '=' to the first batch of '=' (on all three lines of
the table).  Like this:

@@ -87,9 +87,9 @@ prototypes::
 locking rules:
all may block

-   =
+=  =
 opsi_rwsem(inode)
-   =
+=  =
 lookup:shared
 create:exclusive
 link:  exclusive (both)
@@ -112,7 +112,7 @@ atomic_open:shared (exclusive if O_CREAT is set in 
open flags)
 tmpfile:   no
 fileattr_get:  no or exclusive
 fileattr_set:  exclusive
-   =
+=  =

(whitespace damaged)

Re: [Outreachy kernel] [PATCH] staging: rtl8723bs: hal: Remove camelcase

2021-04-12 Thread Matthew Wilcox

On Mon, Apr 12, 2021 at 11:02:58PM +0200, Fabio M. De Francesco wrote:
> Removed camelcase in (some) symbols. Further work is needed.

What would be more useful for this driver is making it use
drivers/net/wireless/realtek/rtlwifi/btcoexist/ which has already
graduated out of staging.  I haven't checked how closely it matches,
but this is surely a better use of time than "cleaning up" this version.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-12 Thread Matthew Wilcox

On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> Could you explain your intent here?
> I worry about @index.
> 
> As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> (code copy-pasted below signature) which imply that the member @index
> have to be kept intact. In above, I'm unsure @index is untouched.

Well, I tried three different approaches.  Here's the one I hated the least.

From: "Matthew Wilcox (Oracle)" 
Date: Sat, 10 Apr 2021 16:12:06 -0400
Subject: [PATCH] mm: Fix struct page layout on 32-bit systems

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

We could fix this by telling the compiler to use a smaller alignment
for the dma_addr, but that seems a little fragile.  Instead, move the
'flags' into the union.  That causes dma_addr to shift into the same
bits as 'mapping', which causes problems with page_mapping() called from
set_page_dirty() in the munmap path.  To avoid this, insert three words
of padding and use the same bits as ->index and ->private, neither of
which have to be cleared on free.

However, page->index is currently used to indicate page_is_pfmemalloc.
Move that information to bit 1 of page->lru (aka compound_head).  This
has the same properties; it will be overwritten by callers who do
not care about pfmemalloc (as opposed to using a bit in page->flags).

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm.h   | 12 +++-
 include/linux/mm_types.h | 38 ++
 2 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b58c73e50da0..23cca0eaa9da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1668,10 +1668,12 @@ struct address_space *page_mapping(struct page *page);
 static inline bool page_is_pfmemalloc(const struct page *page)
 {
/*
-* Page index cannot be this large so this must be
-* a pfmemalloc page.
+* This is not a tail page; compound_head of a head page is unused
+* at return from the page allocator, and will be overwritten
+* by callers who do not care whether the page came from the
+* reserves.
 */
-   return page->index == -1UL;
+   return page->compound_head & 2;
 }
 
 /*
@@ -1680,12 +1682,12 @@ static inline bool page_is_pfmemalloc(const struct page 
*page)
  */
 static inline void set_page_pfmemalloc(struct page *page)
 {
-   page->index = -1UL;
+   page->compound_head = 2;
 }
 
 static inline void clear_page_pfmemalloc(struct page *page)
 {
-   page->index = 0;
+   page->compound_head = 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..45c563e9b50e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,16 +68,22 @@ struct mem_cgroup;
 #endif
 
 struct page {
-   unsigned long flags;/* Atomic flags, some possibly
-* updated asynchronously */
/*
-* Five words (20/40 bytes) are available in this union.
-* WARNING: bit 0 of the first word is used for PageTail(). That
-* means the other users of this union MUST NOT use the bit to
+* This union is six words (24 / 48 bytes) in size.
+* The first word is reserved for atomic flags, often updated
+* asynchronously.  Use the PageFoo() macros to access it.  Some
+* of the flags can be reused for your own purposes, but the
+* word as a whole often contains other information and overwriting
+* it will cause functions like page_zone() and page_node() to stop
+* working correctly.
+*
+* Bit 0 of the second word is used for PageTail(). That
+* means the other users of this union MUST leave the bit zero to
 * avoid collision and false-positive PageTail().
 */
union {
struct {/* Page cache and anonymous pages */
+   unsigned long flags;
/**
 * @lru: Pageout list, eg. active_list protected by
 * lruvec->lru_lock.  Sometimes used as a generic list
@@ -96,13 +102,14 @@ struct page {
unsigned long private;
};
struct {/* page_pool used by netstack */
-   /**
-* @dma_addr: might require a 64-bit value even on
-* 32-bit architectures.
-*/
-

Re: [PATCH 02/17] locking: Add split_lock

2021-04-12 Thread Matthew Wilcox

On Mon, Apr 12, 2021 at 04:29:28PM +0200, Thomas Gleixner wrote:
> On Fri, Apr 09 2021 at 03:51, Matthew Wilcox wrote:
> > Bitlocks do not currently participate in lockdep.  Conceptually, a
> > bit_spinlock is a split lock, eg across each bucket in a hash table.
> > The struct split_lock gives us somewhere to record the lockdep_map.
> 
> I like the concept, but the name is strange. The only purpose of 
> 
> > +struct split_lock {
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +   struct lockdep_map dep_map;
> > +#endif
> > +};
> 
> is to have a place to stick the lockdep map into. So it's not a lock
> construct as the name suggests, it's just auxiliary data when lockdep is
> enabled.

That's the implementation _today_, but conceptually, it's a single lock.
I was thinking that for non-RT, we could put a qspinlock in there for a
thread to spin on if the bit is contended.  It'd need a bit of ingenuity
to make sure that a thread unlocking a bitlock made sure that a thread
spinning on the qspinlock saw the wakeup, but it should be doable.

Anyway, from the point of view of the user, they should be declaring
"this is the lock", not "this is the lock tracking structure", right?

> I know you hinted that RT could make use of that data structure and the
> fact that it's mandatory for the various lock functions, but that's not
> really feasible if this is related to a hash with a bit spinlock per
> bucket as the data structure is hash global.
> 
> Sure, we can do pointer math to find out the bucket index and do
> something from there, but I'm not sure whether that really helps. Need
> to stare at the remaining few places where bit spinlocks are an issue on
> RT.

I obviously don't understand exactly what the RT patchset does.  My
thinking was that you could handle the bit locks like rw sems, and
sacrifice the scalability of per-bucket-lock for the determinism of
a single PI lock.

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-11 Thread Matthew Wilcox

On Sun, Apr 11, 2021 at 11:33:18AM +0100, Matthew Wilcox wrote:
> Basically, we have three aligned dwords here.  We can either alias with
> @flags and the first word of @lru, or the second word of @lru and @mapping,
> or @index and @private.  @flags is a non-starter.  If we use @mapping,
> then you have to set it to NULL before you free it, and I'm not sure
> how easy that will be for you.  If that's trivial, then we could use
> the layout:
> 
>   unsigned long _pp_flags;
>   unsigned long pp_magic;
>   union {
>   dma_addr_t dma_addr;/* might be one or two words */
>   unsigned long _pp_align[2];
>   };
>   unsigned long pp_pfmemalloc;
>   unsigned long xmi;

I forgot about the munmap path.  That calls zap_page_range() which calls
set_page_dirty() which calls page_mapping().  If we use page->mapping,
that's going to get interpreted as an address_space pointer.

*sigh*.  Foiled at every turn.

I'm kind of inclined towards using two (or more) bits for PageSlab as
we discussed here:

https://lore.kernel.org/linux-mm/01000163efe179fe-d6270c58-eaba-482f-a6bd-334667250ef7-000...@email.amazonses.com/

so we have PageKAlloc that's true for PageSlab, PagePool, PageDMAPool,
PageVMalloc, PageFrag and maybe a few other kernel-internal allocations.

(see also here:)
https://lore.kernel.org/linux-mm/20180518194519.3820-18-wi...@infradead.org/

Re: Bogus struct page layout on 32-bit

2021-04-11 Thread Matthew Wilcox

On Sat, Apr 10, 2021 at 09:10:47PM +0200, Arnd Bergmann wrote:
> On Sat, Apr 10, 2021 at 4:44 AM Matthew Wilcox  wrote:
> > +   dma_addr_t dma_addr __packed;
> > };
> > struct {/* slab, slob and slub */
> > union {
> >
> > but I don't know if GCC is smart enough to realise that dma_addr is now
> > on an 8 byte boundary and it can use a normal instruction to access it,
> > or whether it'll do something daft like use byte loads to access it.
> >
> > We could also do:
> >
> > +   dma_addr_t dma_addr __packed __aligned(sizeof(void 
> > *));
> >
> > and I see pahole, at least sees this correctly:
> >
> > struct {
> > long unsigned int _page_pool_pad; /* 4 4 */
> > dma_addr_t dma_addr 
> > __attribute__((__aligned__(4))); /* 8 8 */
> > } __attribute__((__packed__)) 
> > __attribute__((__aligned__(4)));
> >
> > This presumably affects any 32-bit architecture with a 64-bit phys_addr_t
> > / dma_addr_t.  Advice, please?
> 
> I've tried out what gcc would make of this:  https://godbolt.org/z/aTEbxxbG3
> 
> struct page {
> short a;
> struct {
> short b;
> long long c __attribute__((packed, aligned(2)));
> } __attribute__((packed));
> } __attribute__((aligned(8)));
> 
> In this structure, 'c' is clearly aligned to eight bytes, and gcc does
> realize that
> it is safe to use the 'ldrd' instruction for 32-bit arm, which is forbidden on
> struct members with less than 4 byte alignment. However, it also complains
> that passing a pointer to 'c' into a function that expects a 'long long' is 
> not
> allowed because alignof(c) is only '2' here.
> 
> (I used 'short' here because I having a 64-bit member misaligned by four
> bytes wouldn't make a difference to the instructions on Arm, or any other
> 32-bit architecture I can think of, regardless of the ABI requirements).

So ... we could do this:

+++ b/include/linux/types.h
@@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
  * so they don't care about the size of the actual bus addresses.
  */
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
-typedef u64 dma_addr_t;
+typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif

but I'm a little scared that this might have unintended consequences.
And Jesper points out that a big-endian 64-bit dma_addr_t can impersonate
a PageTail page, and we should solve that problem while we're at it.
So I don't think we should do this, but thought I should mention it as
a possibility.

Re: [PATCH v2] mm: eliminate "expecting prototype" kernel-doc warnings

2021-04-11 Thread Matthew Wilcox

On Sun, Apr 11, 2021 at 02:06:42PM -0700, Randy Dunlap wrote:
> Fix stray kernel-doc warnings in mm/ due to mis-typed or missing
> function names.
> 
> Quietens these kernel-doc warnings:
> 
> ../mm/mmu_gather.c:264: warning: expecting prototype for tlb_gather_mmu(). 
> Prototype was for __tlb_gather_mmu() instead
> ../mm/oom_kill.c:180: warning: expecting prototype for Check whether 
> unreclaimable slab amount is greater than(). Prototype was for 
> should_dump_unreclaim_slab() instead
> ../mm/shuffle.c:155: warning: expecting prototype for shuffle_free_memory(). 
> Prototype was for __shuffle_free_memory() instead
> 
> Signed-off-by: Randy Dunlap 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: Matthew Wilcox 

Reviewed-by: Matthew Wilcox (Oracle)

Re: [PATCH] mm: eliminate "expecting prototype" kernel-doc warnings

2021-04-11 Thread Matthew Wilcox

On Sun, Apr 11, 2021 at 10:43:21AM -0700, Randy Dunlap wrote:
> +++ linux-next-20210409/mm/mmu_gather.c
> @@ -250,7 +250,7 @@ void tlb_flush_mmu(struct mmu_gather *tl
>  }
>  
>  /**
> - * tlb_gather_mmu - initialize an mmu_gather structure for page-table 
> tear-down
> + * __tlb_gather_mmu - initialize an mmu_gather structure for page-table 
> tear-down
>   * @tlb: the mmu_gather structure to initialize
>   * @mm: the mm_struct of the target address space
>   * @fullmm: @mm is without users and we're going to destroy the full address

I think this is the wrong fix.  __tlb_gather_mmu is static, so documenting
it isn't going to do much good.  Instead, this doc should be moved
down to tlb_gather_mmu().  For bonus points, add documentation for
tlb_gather_mmu_fullmm().

> --- linux-next-20210409.orig/mm/oom_kill.c
> +++ linux-next-20210409/mm/oom_kill.c
> @@ -171,10 +171,11 @@ static bool oom_unkillable_task(struct t
>  }
>  
>  /**
> - * Check whether unreclaimable slab amount is greater than
> - * all user memory(LRU pages).
> + * should_dump_unreclaim_slab - Check whether unreclaimable slab amount
> + * is greater than all user memory (LRU pages).
> + *
>   * dump_unreclaimable_slab() could help in the case that
> - * oom due to too much unreclaimable slab used by kernel.
> + * oom is due to too much unreclaimable slab used by kernel.
>  */
>  static bool should_dump_unreclaim_slab(void)

This is static.  I'd just remove the second '*' and turn it into a
non-kernel-doc comment.

>  {
> --- linux-next-20210409.orig/mm/shuffle.c
> +++ linux-next-20210409/mm/shuffle.c
> @@ -148,7 +148,7 @@ void __meminit __shuffle_zone(struct zon
>  }
>  
>  /**
> - * shuffle_free_memory - reduce the predictability of the page allocator
> + * __shuffle_free_memory - reduce the predictability of the page allocator
>   * @pgdat: node page data
>   */
>  void __meminit __shuffle_free_memory(pg_data_t *pgdat)

Nobody calls __shuffle_free_memory() directly.  If anything, the doc
should be moved to shuffle_free_memory().  But since it has precisely
one caller, and it's within mm/, I'm more inclined to leave this comment
where it is and turn it into a non-kernel-doc comment.  Thoughts?

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-11 Thread Matthew Wilcox

On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 10 Apr 2021 21:52:45 +0100
> "Matthew Wilcox (Oracle)"  wrote:
> 
> > 32-bit architectures which expect 8-byte alignment for 8-byte integers
> > and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> > page inadvertently expanded in 2019.  When the dma_addr_t was added,
> > it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> > gap between 'flags' and the union.
> > 
> > We could fix this by telling the compiler to use a smaller alignment
> > for the dma_addr, but that seems a little fragile.  Instead, move the
> > 'flags' into the union.  That causes dma_addr to shift into the same
> > bits as 'mapping', so it would have to be cleared on free.  To avoid
> > this, insert three words of padding and use the same bits as ->index
> > and ->private, neither of which have to be cleared on free.
> > 
> > Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > ---
> >  include/linux/mm_types.h | 38 ++
> >  1 file changed, 26 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..45c563e9b50e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -68,16 +68,22 @@ struct mem_cgroup;
> >  #endif
> >  
> >  struct page {
> > -   unsigned long flags;/* Atomic flags, some possibly
> > -* updated asynchronously */
> > /*
> > -* Five words (20/40 bytes) are available in this union.
> > -* WARNING: bit 0 of the first word is used for PageTail(). That
> > -* means the other users of this union MUST NOT use the bit to
> > +* This union is six words (24 / 48 bytes) in size.
> > +* The first word is reserved for atomic flags, often updated
> > +* asynchronously.  Use the PageFoo() macros to access it.  Some
> > +* of the flags can be reused for your own purposes, but the
> > +* word as a whole often contains other information and overwriting
> > +* it will cause functions like page_zone() and page_node() to stop
> > +* working correctly.
> > +*
> > +* Bit 0 of the second word is used for PageTail(). That
> > +* means the other users of this union MUST leave the bit zero to
> >  * avoid collision and false-positive PageTail().
> >  */
> > union {
> > struct {/* Page cache and anonymous pages */
> > +   unsigned long flags;
> > /**
> >  * @lru: Pageout list, eg. active_list protected by
> >  * lruvec->lru_lock.  Sometimes used as a generic list
> > @@ -96,13 +102,14 @@ struct page {
> > unsigned long private;
> > };
> > struct {/* page_pool used by netstack */
> > -   /**
> > -* @dma_addr: might require a 64-bit value even on
> > -* 32-bit architectures.
> > -*/
> > -   dma_addr_t dma_addr;
> 
> The original intend of placing member @dma_addr here is that it overlap
> with @LRU (type struct list_head) which contains two pointers.  Thus, in
> case of CONFIG_ARCH_DMA_ADDR_T_64BIT=y on 32-bit architectures it would
> use both pointers.
> 
> Thinking more about this, this design is flawed as bit 0 of the first
> word is used for compound pages (see PageTail and @compound_head), is
> reserved.  We knew DMA addresses were aligned, thus we though this
> satisfied that need.  BUT for DMA_ADDR_T_64BIT=y on 32-bit arch the
> first word will contain the "upper" part of the DMA addr, which I don't
> think gives this guarantee.
> 
> I guess, nobody are using this combination?!?  I though we added this
> to satisfy TI (Texas Instrument) driver cpsw (code in
> drivers/net/ethernet/ti/cpsw*).  Thus, I assumed it was in use?

It may be in use, but we've got away with it?  It's relatively rare
that this is going to bite us.  I think what has to happen is:

page is mapped to userspace
task calls get_user_page_fast(), loads the PTE

page is unmapped & freed
page is reallocated to the page_pool
page is DMA mapped to an address that happens to have that bit set

task looks for the compound_head() of that PTE, and attempts to bump
the refcount.  *oops*

If it has happened, would it have turned into a bug report?
If we had seen such a bug repor

[PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-10 Thread Matthew Wilcox (Oracle)

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

We could fix this by telling the compiler to use a smaller alignment
for the dma_addr, but that seems a little fragile.  Instead, move the
'flags' into the union.  That causes dma_addr to shift into the same
bits as 'mapping', so it would have to be cleared on free.  To avoid
this, insert three words of padding and use the same bits as ->index
and ->private, neither of which have to be cleared on free.

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm_types.h | 38 ++
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..45c563e9b50e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,16 +68,22 @@ struct mem_cgroup;
 #endif
 
 struct page {
-   unsigned long flags;/* Atomic flags, some possibly
-* updated asynchronously */
/*
-* Five words (20/40 bytes) are available in this union.
-* WARNING: bit 0 of the first word is used for PageTail(). That
-* means the other users of this union MUST NOT use the bit to
+* This union is six words (24 / 48 bytes) in size.
+* The first word is reserved for atomic flags, often updated
+* asynchronously.  Use the PageFoo() macros to access it.  Some
+* of the flags can be reused for your own purposes, but the
+* word as a whole often contains other information and overwriting
+* it will cause functions like page_zone() and page_node() to stop
+* working correctly.
+*
+* Bit 0 of the second word is used for PageTail(). That
+* means the other users of this union MUST leave the bit zero to
 * avoid collision and false-positive PageTail().
 */
union {
struct {/* Page cache and anonymous pages */
+   unsigned long flags;
/**
 * @lru: Pageout list, eg. active_list protected by
 * lruvec->lru_lock.  Sometimes used as a generic list
@@ -96,13 +102,14 @@ struct page {
unsigned long private;
};
struct {/* page_pool used by netstack */
-   /**
-* @dma_addr: might require a 64-bit value even on
-* 32-bit architectures.
-*/
-   dma_addr_t dma_addr;
+   unsigned long _pp_flags;
+   unsigned long pp_magic;
+   unsigned long xmi;
+   unsigned long _pp_mapping_pad;
+   dma_addr_t dma_addr;/* might be one or two words */
};
struct {/* slab, slob and slub */
+   unsigned long _slab_flags;
union {
struct list_head slab_list;
struct {/* Partial pages */
@@ -130,6 +137,7 @@ struct page {
};
};
struct {/* Tail pages of compound page */
+   unsigned long _t1_flags;
unsigned long compound_head;/* Bit zero is set */
 
/* First tail page only */
@@ -139,12 +147,14 @@ struct page {
unsigned int compound_nr; /* 1 << compound_order */
};
struct {/* Second tail page of compound page */
+   unsigned long _t2_flags;
unsigned long _compound_pad_1;  /* compound_head */
atomic_t hpage_pinned_refcount;
/* For both global and memcg */
struct list_head deferred_list;
};
struct {/* Page table pages */
+   unsigned long _pt_flags;
unsigned long _pt_pad_1;/* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2;/* mapping */
@@ -159,6 +169,7 @@ struct page {
 #endif
};
struct {/* ZONE_DEVICE pages */
+   unsigned long _zd_flags;
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;

[PATCH 0/1] Fix struct page layout on 32-bit systems

2021-04-10 Thread Matthew Wilcox (Oracle)

I'd really appreciate people testing this, particularly on
arm32/mips32/ppc32 systems with a 64-bit dma_addr_t.

Matthew Wilcox (Oracle) (1):
  mm: Fix struct page layout on 32-bit systems

 include/linux/mm_types.h | 38 ++
 1 file changed, 26 insertions(+), 12 deletions(-)

-- 
2.30.2

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-10 Thread Matthew Wilcox

On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.
> 
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.
> So we have to keep the extra added field on struct page for our mark.
> Matthew had an intersting idea.  He suggested keeping it, but changing the 
> magic number, so it can't be a kernel address, but I'll let him follow 
> up on the details.

Sure!  So, given the misalignment problem I discovered yesterday [1],
we probably want a page_pool page to look like:

unsigned long   flags;
unsigned long   pp_magic;
unsigned long   xmi;
unsigned long   _pp_mapping_pad;
dma_addr_t  dma_addr;   /* might be one or two words */

The only real restriction here is that pp_magic should not be a valid
pointer, and it must have the bottom bit clear.  I'd recommend something
like:

#define PP_MAGIC(0x20 + POISON_POINTER_DELTA)

This leaves page->mapping as NULL, so you don't have to worry about
clearing it before free.

[1] 
https://lore.kernel.org/linux-mm/20210410024313.gx2531...@casper.infradead.org/

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-10 Thread Matthew Wilcox

On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> This is needed by the page_pool to avoid recycling a page not allocated
> via page_pool.

Is the PageType mechanism more appropriate to your needs?  It wouldn't
be if you use page->_mapcount (ie mapping it to userspace).

> Signed-off-by: Matteo Croce 
> ---
>  include/linux/mm_types.h | 1 +
>  include/net/page_pool.h  | 2 ++
>  net/core/page_pool.c | 4 
>  3 files changed, 7 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..ef2d0d5f62e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -101,6 +101,7 @@ struct page {
>* 32-bit architectures.
>*/
>   dma_addr_t dma_addr;
> + unsigned long signature;
>   };
>   struct {/* slab, slob and slub */
>   union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..b30405e84b5e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -63,6 +63,8 @@
>   */
>  #define PP_ALLOC_CACHE_SIZE  128
>  #define PP_ALLOC_CACHE_REFILL64
> +#define PP_SIGNATURE 0x20210303
> +
>  struct pp_alloc_cache {
>   u32 count;
>   void *cache[PP_ALLOC_CACHE_SIZE];
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..2ae9b554ef98 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>   page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>  
>  skip_dma_map:
> + page->signature = PP_SIGNATURE;
> +
>   /* Track how many pages are held 'in-flight' */
>   pool->pages_state_hold_cnt++;
>  
> @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, 
> struct page *page)
>DMA_ATTR_SKIP_CPU_SYNC);
>   page->dma_addr = 0;
>  skip_dma_unmap:
> + page->signature = 0;
> +
>   /* This may be the last page returned, releasing the pool, so
>* it is not safe to reference pool afterwards.
>*/
> -- 
> 2.30.2
>

Re: Bogus struct page layout on 32-bit

2021-04-10 Thread Matthew Wilcox

How about moving the flags into the union?  A bit messy, but we don't
have to play games with __packed__.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1210a8e41fad..f374d2f06255 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,16 +68,22 @@ struct mem_cgroup;
 #endif
 
 struct page {
-   unsigned long flags;/* Atomic flags, some possibly
-* updated asynchronously */
/*
-* Five words (20/40 bytes) are available in this union.
-* WARNING: bit 0 of the first word is used for PageTail(). That
-* means the other users of this union MUST NOT use the bit to
+* This union is six words (24 / 48 bytes) in size.
+* The first word is reserved for atomic flags, often updated
+* asynchronously.  Use the PageFoo() macros to access it.  Some
+* of the flags can be reused for your own purposes, but the
+* word as a whole often contains other information and overwriting
+* it will cause functions like page_zone() and page_node() to stop
+* working correctly.
+*
+* Bit 0 of the second word is used for PageTail(). That
+* means the other users of this union MUST leave the bit zero to
 * avoid collision and false-positive PageTail().
 */
union {
struct {/* Page cache and anonymous pages */
+   unsigned long flags;
/**
 * @lru: Pageout list, eg. active_list protected by
 * lruvec->lru_lock.  Sometimes used as a generic list
@@ -96,6 +102,8 @@ struct page {
unsigned long private;
};
struct {/* page_pool used by netstack */
+   unsigned long _pp_flags;
+   unsigned long _pp_pad;
/**
 * @dma_addr: might require a 64-bit value even on
 * 32-bit architectures.
@@ -103,6 +111,7 @@ struct page {
dma_addr_t dma_addr;
};
struct {/* slab, slob and slub */
+   unsigned long _slab_flags;
union {
struct list_head slab_list;
struct {/* Partial pages */
@@ -130,6 +139,7 @@ struct page {
};
};
struct {/* Tail pages of compound page */
+   unsigned long _tail1_flags;
unsigned long compound_head;/* Bit zero is set */
 
/* First tail page only */
@@ -139,12 +149,14 @@ struct page {
unsigned int compound_nr; /* 1 << compound_order */
};
struct {/* Second tail page of compound page */
+   unsigned long _tail2_flags;
unsigned long _compound_pad_1;  /* compound_head */
atomic_t hpage_pinned_refcount;
/* For both global and memcg */
struct list_head deferred_list;
};
struct {/* Page table pages */
+   unsigned long _pt_flags;
unsigned long _pt_pad_1;/* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2;/* mapping */
@@ -159,6 +171,7 @@ struct page {
 #endif
};
struct {/* ZONE_DEVICE pages */
+   unsigned long _zd_flags;
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
void *zone_device_data;
@@ -174,8 +187,11 @@ struct page {
 */
};
 
-   /** @rcu_head: You can use this to free a page by RCU. */
-   struct rcu_head rcu_head;
+   struct {
+   unsigned long _rcu_flags;
+   /** @rcu_head: You can use this to free a page by RCU. 
*/
+   struct rcu_head rcu_head;
+   };
};
 
union { /* This union is 4 bytes in size. */

Bogus struct page layout on 32-bit

2021-04-09 Thread Matthew Wilcox

On Sat, Apr 10, 2021 at 06:45:35AM +0800, kernel test robot wrote:
> >> include/linux/mm_types.h:274:1: error: static_assert failed due to 
> >> requirement '__builtin_offsetof(struct page, lru) == 
> >> __builtin_offsetof(struct folio, lru)' "offsetof(struct page, lru) == 
> >> offsetof(struct folio, lru)"
>FOLIO_MATCH(lru, lru);
>include/linux/mm_types.h:272:2: note: expanded from macro 'FOLIO_MATCH'
>static_assert(offsetof(struct page, pg) == offsetof(struct folio, 
> fl))

Well, this is interesting.  pahole reports:

struct page {
long unsigned int  flags;/* 0 4 */
/* XXX 4 bytes hole, try to pack */
union {
struct {
struct list_head lru;/* 8 8 */
...
struct folio {
union {
struct {
long unsigned int flags; /* 0 4 */
struct list_head lru;/* 4 8 */

so this assert has absolutely done its job.

But why has this assert triggered?  Why is struct page layout not what
we thought it was?  Turns out it's the dma_addr added in 2019 by commit
c25fff7171be ("mm: add dma_addr_t to struct page").  On this particular
config, it's 64-bit, and ppc32 requires alignment to 64-bit.  So
the whole union gets moved out by 4 bytes.

Unfortunately, we can't just fix this by putting an 'unsigned long pad'
in front of it.  It still aligns the entire union to 8 bytes, and then
it skips another 4 bytes after the pad.

We can fix it like this ...

+++ b/include/linux/mm_types.h
@@ -96,11 +96,12 @@ struct page {
unsigned long private;
};
struct {/* page_pool used by netstack */
+   unsigned long _page_pool_pad;
/**
 * @dma_addr: might require a 64-bit value even on
 * 32-bit architectures.
 */
-   dma_addr_t dma_addr;
+   dma_addr_t dma_addr __packed;
};
struct {/* slab, slob and slub */
union {

but I don't know if GCC is smart enough to realise that dma_addr is now
on an 8 byte boundary and it can use a normal instruction to access it,
or whether it'll do something daft like use byte loads to access it.

We could also do:

+   dma_addr_t dma_addr __packed __aligned(sizeof(void *));

and I see pahole, at least sees this correctly:

struct {
long unsigned int _page_pool_pad; /* 4 4 */
dma_addr_t dma_addr __attribute__((__aligned__(4))); /* 
8 8 */
} __attribute__((__packed__)) __attribute__((__aligned__(4)));  

This presumably affects any 32-bit architecture with a 64-bit phys_addr_t
/ dma_addr_t.  Advice, please?

[PATCH v7 28/28] mm/filemap: Convert page wait queues to be folios

2021-04-09 Thread Matthew Wilcox (Oracle)

Reinforce that if we're waiting for a bit in a struct page, that's
actually in the head page by changing the type from page to folio.
Increases the size of cachefiles by two bytes, but the kernel core
is unchanged in size.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 fs/cachefiles/rdwr.c| 16 
 include/linux/pagemap.h |  8 
 mm/filemap.c| 38 +++---
 3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 8ffc40e84a59..364af267ebaa 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -25,20 +25,20 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, 
unsigned mode,
struct cachefiles_object *object;
struct fscache_retrieval *op = monitor->op;
struct wait_page_key *key = _key;
-   struct page *page = wait->private;
+   struct folio *folio = wait->private;
 
ASSERT(key);
 
_enter("{%lu},%u,%d,{%p,%u}",
   monitor->netfs_page->index, mode, sync,
-  key->page, key->bit_nr);
+  key->folio, key->bit_nr);
 
-   if (key->page != page || key->bit_nr != PG_locked)
+   if (key->folio != folio || key->bit_nr != PG_locked)
return 0;
 
-   _debug("--- monitor %p %lx ---", page, page->flags);
+   _debug("--- monitor %p %lx ---", folio, folio->flags);
 
-   if (!PageUptodate(page) && !PageError(page)) {
+   if (!FolioUptodate(folio) && !FolioError(folio)) {
/* unlocked, not uptodate and not erronous? */
_debug("page probably truncated");
}
@@ -107,7 +107,7 @@ static int cachefiles_read_reissue(struct cachefiles_object 
*object,
put_page(backpage2);
 
INIT_LIST_HEAD(>op_link);
-   add_page_wait_queue(backpage, >monitor);
+   add_folio_wait_queue(page_folio(backpage), >monitor);
 
if (trylock_page(backpage)) {
ret = -EIO;
@@ -294,7 +294,7 @@ static int cachefiles_read_backing_file_one(struct 
cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
-   add_page_wait_queue(backpage, >monitor);
+   add_folio_wait_queue(page_folio(backpage), >monitor);
monitor = NULL;
 
/* but the page may have been read before the monitor was installed, so
@@ -548,7 +548,7 @@ static int cachefiles_read_backing_file(struct 
cachefiles_object *object,
get_page(backpage);
monitor->back_page = backpage;
monitor->monitor.private = backpage;
-   add_page_wait_queue(backpage, >monitor);
+   add_folio_wait_queue(page_folio(backpage), >monitor);
monitor = NULL;
 
/* but the page may have been read before the monitor was
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5bff48eb..17df86f2bcde 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -690,13 +690,13 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
 }
 
 struct wait_page_key {
-   struct page *page;
+   struct folio *folio;
int bit_nr;
int page_match;
 };
 
 struct wait_page_queue {
-   struct page *page;
+   struct folio *folio;
int bit_nr;
wait_queue_entry_t wait;
 };
@@ -704,7 +704,7 @@ struct wait_page_queue {
 static inline bool wake_page_match(struct wait_page_queue *wait_page,
  struct wait_page_key *key)
 {
-   if (wait_page->page != key->page)
+   if (wait_page->folio != key->folio)
   return false;
key->page_match = 1;
 
@@ -859,7 +859,7 @@ int wait_on_page_private_2_killable(struct page *page);
 /*
  * Add an arbitrary waiter to a page's wait queue
  */
-extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
+void add_folio_wait_queue(struct folio *folio, wait_queue_entry_t *waiter);
 
 /*
  * Fault everything in given userspace address range in.
diff --git a/mm/filemap.c b/mm/filemap.c
index dfdc04130c5b..bc0021632c47 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1019,11 +1019,11 @@ EXPORT_SYMBOL(__page_cache_alloc);
  */
 #define PAGE_WAIT_TABLE_BITS 8
 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
-static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] 
__cacheline_aligned;
+static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] 
__cacheline_aligned;
 
-static wait_queue_head_t *page_waitqueue(struct page *page)
+static wait_queue_head_t *folio_waitqueue(struct folio *folio)
 {
-   return _wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
+   return _wait_table[hash_ptr(folio, P

[PATCH v7 27/28] mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit

2021-04-09 Thread Matthew Wilcox (Oracle)

All callers have a folio, so use it directly.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 mm/filemap.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8f07e21a8f29..dfdc04130c5b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1121,14 +1121,14 @@ static int wake_page_function(wait_queue_entry_t *wait, 
unsigned mode, int sync,
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
-static void wake_up_page_bit(struct page *page, int bit_nr)
+static void wake_up_folio_bit(struct folio *folio, int bit_nr)
 {
-   wait_queue_head_t *q = page_waitqueue(page);
+   wait_queue_head_t *q = page_waitqueue(>page);
struct wait_page_key key;
unsigned long flags;
wait_queue_entry_t bookmark;
 
-   key.page = page;
+   key.page = >page;
key.bit_nr = bit_nr;
key.page_match = 0;
 
@@ -1163,7 +1163,7 @@ static void wake_up_page_bit(struct page *page, int 
bit_nr)
 * page waiters.
 */
if (!waitqueue_active(q) || !key.page_match) {
-   ClearPageWaiters(page);
+   ClearFolioWaiters(folio);
/*
 * It's possible to miss clearing Waiters here, when we woke
 * our page waiters, but the hashed waitqueue has waiters for
@@ -1179,7 +1179,7 @@ static void wake_up_folio(struct folio *folio, int bit)
 {
if (!FolioWaiters(folio))
return;
-   wake_up_page_bit(>page, bit);
+   wake_up_folio_bit(folio, bit);
 }
 
 /*
@@ -1444,7 +1444,7 @@ void unlock_folio(struct folio *folio)
BUILD_BUG_ON(PG_waiters != 7);
VM_BUG_ON_FOLIO(!FolioLocked(folio), folio);
if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0)))
-   wake_up_page_bit(>page, PG_locked);
+   wake_up_folio_bit(folio, PG_locked);
 }
 EXPORT_SYMBOL(unlock_folio);
 
@@ -1461,11 +1461,12 @@ EXPORT_SYMBOL(unlock_folio);
  */
 void end_page_private_2(struct page *page)
 {
-   page = compound_head(page);
-   VM_BUG_ON_PAGE(!PagePrivate2(page), page);
-   clear_bit_unlock(PG_private_2, >flags);
-   wake_up_page_bit(page, PG_private_2);
-   put_page(page);
+   struct folio *folio = page_folio(page);
+
+   VM_BUG_ON_FOLIO(!FolioPrivate2(folio), folio);
+   clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
+   wake_up_folio_bit(folio, PG_private_2);
+   put_folio(folio);
 }
 EXPORT_SYMBOL(end_page_private_2);
 
-- 
2.30.2

[PATCH v7 26/28] mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit

2021-04-09 Thread Matthew Wilcox (Oracle)

We must always wait on the folio, otherwise we won't be woken up.

This commit shrinks the kernel by 691 bytes, mostly due to moving
the page waitqueue lookup into wait_on_folio_bit_common().

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 include/linux/pagemap.h | 10 +++---
 mm/filemap.c| 67 -
 mm/page-writeback.c |  4 +--
 3 files changed, 39 insertions(+), 42 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d50fc5adbee1..5bff48eb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -790,11 +790,11 @@ static inline int lock_page_or_retry(struct page *page, 
struct mm_struct *mm,
 }
 
 /*
- * This is exported only for wait_on_page_locked/wait_on_page_writeback, etc.,
+ * This is exported only for wait_on_folio_locked/wait_on_folio_writeback, 
etc.,
  * and should not be used directly.
  */
-extern void wait_on_page_bit(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wait_on_folio_bit(struct folio *folio, int bit_nr);
+extern int wait_on_folio_bit_killable(struct folio *folio, int bit_nr);
 
 /* 
  * Wait for a folio to be unlocked.
@@ -806,14 +806,14 @@ extern int wait_on_page_bit_killable(struct page *page, 
int bit_nr);
 static inline void wait_on_folio_locked(struct folio *folio)
 {
if (FolioLocked(folio))
-   wait_on_page_bit(>page, PG_locked);
+   wait_on_folio_bit(folio, PG_locked);
 }
 
 static inline int wait_on_folio_locked_killable(struct folio *folio)
 {
if (!FolioLocked(folio))
return 0;
-   return wait_on_page_bit_killable(>page, PG_locked);
+   return wait_on_folio_bit_killable(folio, PG_locked);
 }
 
 static inline void wait_on_page_locked(struct page *page)
diff --git a/mm/filemap.c b/mm/filemap.c
index cdb8250af510..8f07e21a8f29 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1102,7 +1102,7 @@ static int wake_page_function(wait_queue_entry_t *wait, 
unsigned mode, int sync,
 *
 * So update the flags atomically, and wake up the waiter
 * afterwards to avoid any races. This store-release pairs
-* with the load-acquire in wait_on_page_bit_common().
+* with the load-acquire in wait_on_folio_bit_common().
 */
smp_store_release(>flags, flags | WQ_FLAG_WOKEN);
wake_up_state(wait->private, mode);
@@ -1183,7 +1183,7 @@ static void wake_up_folio(struct folio *folio, int bit)
 }
 
 /*
- * A choice of three behaviors for wait_on_page_bit_common():
+ * A choice of three behaviors for wait_on_folio_bit_common():
  */
 enum behavior {
EXCLUSIVE,  /* Hold ref to page and take the bit when woken, like
@@ -1217,9 +1217,10 @@ static inline bool trylock_page_bit_common(struct page 
*page, int bit_nr,
 /* How many times do we accept lock stealing from under a waiter? */
 int sysctl_page_lock_unfairness = 5;
 
-static inline int wait_on_page_bit_common(wait_queue_head_t *q,
-   struct page *page, int bit_nr, int state, enum behavior behavior)
+static inline int wait_on_folio_bit_common(struct folio *folio, int bit_nr,
+   int state, enum behavior behavior)
 {
+   wait_queue_head_t *q = page_waitqueue(>page);
int unfairness = sysctl_page_lock_unfairness;
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = _page.wait;
@@ -1228,8 +1229,8 @@ static inline int 
wait_on_page_bit_common(wait_queue_head_t *q,
unsigned long pflags;
 
if (bit_nr == PG_locked &&
-   !PageUptodate(page) && PageWorkingset(page)) {
-   if (!PageSwapBacked(page)) {
+   !FolioUptodate(folio) && FolioWorkingset(folio)) {
+   if (!FolioSwapBacked(folio)) {
delayacct_thrashing_start();
delayacct = true;
}
@@ -1239,7 +1240,7 @@ static inline int 
wait_on_page_bit_common(wait_queue_head_t *q,
 
init_wait(wait);
wait->func = wake_page_function;
-   wait_page.page = page;
+   wait_page.page = >page;
wait_page.bit_nr = bit_nr;
 
 repeat:
@@ -1254,7 +1255,7 @@ static inline int 
wait_on_page_bit_common(wait_queue_head_t *q,
 * Do one last check whether we can get the
 * page bit synchronously.
 *
-* Do the SetPageWaiters() marking before that
+* Do the SetFolioWaiters() marking before that
 * to let any waker we _just_ missed know they
 * need to wake us up (otherwise they'll never
 * even go to the slow case that looks at the
@@ -1265,8 +1266,8 @@ static inline int 
wait_on_page_bit_common(wait_queue_head_t *q,
 * lock to avoid races.
 */
spin_lock_irq(>lock);
-   SetPageWaiters(page);
-   if (!trylock_page_bit_common(pa

[PATCH v7 25/28] mm/writeback: Add wait_for_stable_folio

2021-04-09 Thread Matthew Wilcox (Oracle)

Move wait_for_stable_page() into the folio compatibility file.
wait_for_stable_folio() avoids a call to compound_head() and is 14 bytes
smaller than wait_for_stable_page() was.  The net text size grows by 24
bytes as a result of this patch.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 include/linux/pagemap.h |  1 +
 mm/folio-compat.c   |  6 ++
 mm/page-writeback.c | 24 ++--
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 99331c35c89c..d50fc5adbee1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -833,6 +833,7 @@ int wait_on_folio_writeback_killable(struct folio *folio);
 void end_page_writeback(struct page *page);
 void end_folio_writeback(struct folio *folio);
 void wait_for_stable_page(struct page *page);
+void wait_for_stable_folio(struct folio *folio);
 
 void page_endio(struct page *page, bool is_write, int err);
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 6aadecc39fba..335594fe414e 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -29,3 +29,9 @@ void wait_on_page_writeback(struct page *page)
return wait_on_folio_writeback(page_folio(page));
 }
 EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+
+void wait_for_stable_page(struct page *page)
+{
+   return wait_for_stable_folio(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_for_stable_page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8271f9b24b69..9d55ceec05c0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2865,17 +2865,21 @@ int wait_on_folio_writeback_killable(struct folio 
*folio)
 EXPORT_SYMBOL_GPL(wait_on_folio_writeback_killable);
 
 /**
- * wait_for_stable_page() - wait for writeback to finish, if necessary.
- * @page:  The page to wait on.
+ * wait_for_stable_folio() - wait for writeback to finish, if necessary.
+ * @folio: The folio to wait on.
  *
- * This function determines if the given page is related to a backing device
- * that requires page contents to be held stable during writeback.  If so, then
- * it will wait for any pending writeback to complete.
+ * This function determines if the given folio is related to a backing
+ * device that requires folio contents to be held stable during writeback.
+ * If so, then it will wait for any pending writeback to complete.
+ *
+ * Context: Sleeps.  Must be called in process context and with
+ * no spinlocks held.  Caller should hold a reference on the folio.
+ * If the folio is not locked, writeback may start again after writeback
+ * has finished.
  */
-void wait_for_stable_page(struct page *page)
+void wait_for_stable_folio(struct folio *folio)
 {
-   page = thp_head(page);
-   if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
-   wait_on_page_writeback(page);
+   if (folio->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES)
+   wait_on_folio_writeback(folio);
 }
-EXPORT_SYMBOL_GPL(wait_for_stable_page);
+EXPORT_SYMBOL_GPL(wait_for_stable_folio);
-- 
2.30.2

[PATCH v7 24/28] mm/writeback: Add wait_on_folio_writeback

2021-04-09 Thread Matthew Wilcox (Oracle)

wait_on_page_writeback_killable() only has one caller, so convert it to
call wait_on_folio_writeback_killable().  For the wait_on_page_writeback()
callers, add a compatibility wrapper around wait_on_folio_writeback().

Turning PageWriteback() into FolioWriteback() eliminates a call to
compound_head() which saves 8 bytes and 15 bytes in the two functions.
That is more than offset by adding the wait_on_page_writeback
compatibility wrapper for a net increase in text of 15 bytes.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 fs/afs/write.c  |  9 
 include/linux/pagemap.h |  3 ++-
 mm/folio-compat.c   |  6 ++
 mm/page-writeback.c | 48 -
 4 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index dc66ff15dd16..a1def42e2e45 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -831,7 +831,8 @@ int afs_fsync(struct file *file, loff_t start, loff_t end, 
int datasync)
  */
 vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
 {
-   struct page *page = thp_head(vmf->page);
+   struct folio *folio = page_folio(vmf->page);
+   struct page *page = >page;
struct file *file = vmf->vma->vm_file;
struct inode *inode = file_inode(file);
struct afs_vnode *vnode = AFS_FS_I(inode);
@@ -850,7 +851,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
return VM_FAULT_RETRY;
 #endif
 
-   if (wait_on_page_writeback_killable(page))
+   if (wait_on_folio_writeback_killable(folio))
return VM_FAULT_RETRY;
 
if (lock_page_killable(page) < 0)
@@ -860,8 +861,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
 * details the portion of the page we need to write back and we might
 * need to redirty the page if there's a problem.
 */
-   if (wait_on_page_writeback_killable(page) < 0) {
-   unlock_page(page);
+   if (wait_on_folio_writeback_killable(folio) < 0) {
+   unlock_folio(folio);
return VM_FAULT_RETRY;
}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9bc01429dc25..99331c35c89c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -828,7 +828,8 @@ static inline int wait_on_page_locked_killable(struct page 
*page)
 
 int put_and_wait_on_page_locked(struct page *page, int state);
 void wait_on_page_writeback(struct page *page);
-int wait_on_page_writeback_killable(struct page *page);
+void wait_on_folio_writeback(struct folio *folio);
+int wait_on_folio_writeback_killable(struct folio *folio);
 void end_page_writeback(struct page *page);
 void end_folio_writeback(struct folio *folio);
 void wait_for_stable_page(struct page *page);
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index d1a1dfe52589..6aadecc39fba 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -23,3 +23,9 @@ void end_page_writeback(struct page *page)
return end_folio_writeback(page_folio(page));
 }
 EXPORT_SYMBOL(end_page_writeback);
+
+void wait_on_page_writeback(struct page *page)
+{
+   return wait_on_folio_writeback(page_folio(page));
+}
+EXPORT_SYMBOL_GPL(wait_on_page_writeback);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0062d5c57d41..8271f9b24b69 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2818,33 +2818,51 @@ int __test_set_page_writeback(struct page *page, bool 
keep_write)
 }
 EXPORT_SYMBOL(__test_set_page_writeback);
 
-/*
- * Wait for a page to complete writeback
+/**
+ * wait_on_folio_writeback - Wait for a folio to finish writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete.
+ *
+ * Context: Sleeps.  Must be called in process context and with
+ * no spinlocks held.  Caller should hold a reference on the folio.
+ * If the folio is not locked, writeback may start again after writeback
+ * has finished.
  */
-void wait_on_page_writeback(struct page *page)
+void wait_on_folio_writeback(struct folio *folio)
 {
-   while (PageWriteback(page)) {
-   trace_wait_on_page_writeback(page, page_mapping(page));
-   wait_on_page_bit(page, PG_writeback);
+   while (FolioWriteback(folio)) {
+   trace_wait_on_page_writeback(>page, 
folio_mapping(folio));
+   wait_on_page_bit(>page, PG_writeback);
}
 }
-EXPORT_SYMBOL_GPL(wait_on_page_writeback);
+EXPORT_SYMBOL_GPL(wait_on_folio_writeback);
 
-/*
- * Wait for a page to complete writeback.  Returns -EINTR if we get a
- * fatal signal while waiting.
+/**
+ * wait_on_folio_writeback_killable - Wait for a folio to finish writeback.
+ * @folio: The folio to wait for.
+ *
+ * If the folio is currently being written back to storage, wait for the
+ * I/O to complete or a fatal signal to arrive.
+ *
+ * Context: Sleeps.  Mus

[PATCH v7 23/28] mm/filemap: Add end_folio_writeback

2021-04-09 Thread Matthew Wilcox (Oracle)

Add an end_page_writeback() wrapper function for users that are not yet
converted to folios.

end_folio_writeback() is less than half the size of end_page_writeback()
at just 105 bytes compared to 213 bytes, due to removing all the
compound_head() calls.  The 30 byte wrapper function makes this a net
saving of 70 bytes.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 include/linux/pagemap.h |  3 ++-
 mm/filemap.c| 38 +++---
 mm/folio-compat.c   |  6 ++
 3 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 18c4c2ae8074..9bc01429dc25 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -829,7 +829,8 @@ static inline int wait_on_page_locked_killable(struct page 
*page)
 int put_and_wait_on_page_locked(struct page *page, int state);
 void wait_on_page_writeback(struct page *page);
 int wait_on_page_writeback_killable(struct page *page);
-extern void end_page_writeback(struct page *page);
+void end_page_writeback(struct page *page);
+void end_folio_writeback(struct folio *folio);
 void wait_for_stable_page(struct page *page);
 
 void page_endio(struct page *page, bool is_write, int err);
diff --git a/mm/filemap.c b/mm/filemap.c
index d23430ad5bbc..cdb8250af510 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,11 +1175,11 @@ static void wake_up_page_bit(struct page *page, int 
bit_nr)
spin_unlock_irqrestore(>lock, flags);
 }
 
-static void wake_up_page(struct page *page, int bit)
+static void wake_up_folio(struct folio *folio, int bit)
 {
-   if (!PageWaiters(page))
+   if (!FolioWaiters(folio))
return;
-   wake_up_page_bit(page, bit);
+   wake_up_page_bit(>page, bit);
 }
 
 /*
@@ -1512,38 +1512,38 @@ int wait_on_page_private_2_killable(struct page *page)
 EXPORT_SYMBOL(wait_on_page_private_2_killable);
 
 /**
- * end_page_writeback - end writeback against a page
- * @page: the page
+ * end_folio_writeback - End writeback against a folio.
+ * @folio: The folio.
  */
-void end_page_writeback(struct page *page)
+void end_folio_writeback(struct folio *folio)
 {
/*
 * TestClearPageReclaim could be used here but it is an atomic
 * operation and overkill in this particular case. Failing to
-* shuffle a page marked for immediate reclaim is too mild to
+* shuffle a folio marked for immediate reclaim is too mild to
 * justify taking an atomic operation penalty at the end of
-* ever page writeback.
+* every folio writeback.
 */
-   if (PageReclaim(page)) {
-   ClearPageReclaim(page);
-   rotate_reclaimable_page(page);
+   if (FolioReclaim(folio)) {
+   ClearFolioReclaim(folio);
+   rotate_reclaimable_page(>page);
}
 
/*
-* Writeback does not hold a page reference of its own, relying
+* Writeback does not hold a folio reference of its own, relying
 * on truncation to wait for the clearing of PG_writeback.
-* But here we must make sure that the page is not freed and
-* reused before the wake_up_page().
+* But here we must make sure that the folio is not freed and
+* reused before the wake_up_folio().
 */
-   get_page(page);
-   if (!test_clear_page_writeback(page))
+   get_folio(folio);
+   if (!test_clear_page_writeback(>page))
BUG();
 
smp_mb__after_atomic();
-   wake_up_page(page, PG_writeback);
-   put_page(page);
+   wake_up_folio(folio, PG_writeback);
+   put_folio(folio);
 }
-EXPORT_SYMBOL(end_page_writeback);
+EXPORT_SYMBOL(end_folio_writeback);
 
 /*
  * After completing I/O on a page, call this routine to update the page
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 02798abf19a1..d1a1dfe52589 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -17,3 +17,9 @@ void unlock_page(struct page *page)
return unlock_folio(page_folio(page));
 }
 EXPORT_SYMBOL(unlock_page);
+
+void end_page_writeback(struct page *page)
+{
+   return end_folio_writeback(page_folio(page));
+}
+EXPORT_SYMBOL(end_page_writeback);
-- 
2.30.2

[PATCH v7 22/28] mm/filemap: Add wait_on_folio_locked

2021-04-09 Thread Matthew Wilcox (Oracle)

Also add wait_on_folio_locked_killable().  Turn wait_on_page_locked()
and wait_on_page_locked_killable() into wrappers.  This eliminates a
call to compound_head() from each call-site, reducing text size by 200
bytes for me.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 include/linux/pagemap.h | 26 ++
 mm/filemap.c|  4 ++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 00864e098693..18c4c2ae8074 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -797,23 +797,33 @@ extern void wait_on_page_bit(struct page *page, int 
bit_nr);
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 
 /* 
- * Wait for a page to be unlocked.
+ * Wait for a folio to be unlocked.
  *
- * This must be called with the caller "holding" the page,
- * ie with increased "page->count" so that the page won't
+ * This must be called with the caller "holding" the folio,
+ * ie with increased "page->count" so that the folio won't
  * go away during the wait..
  */
+static inline void wait_on_folio_locked(struct folio *folio)
+{
+   if (FolioLocked(folio))
+   wait_on_page_bit(>page, PG_locked);
+}
+
+static inline int wait_on_folio_locked_killable(struct folio *folio)
+{
+   if (!FolioLocked(folio))
+   return 0;
+   return wait_on_page_bit_killable(>page, PG_locked);
+}
+
 static inline void wait_on_page_locked(struct page *page)
 {
-   if (PageLocked(page))
-   wait_on_page_bit(compound_head(page), PG_locked);
+   wait_on_folio_locked(page_folio(page));
 }
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
-   if (!PageLocked(page))
-   return 0;
-   return wait_on_page_bit_killable(compound_head(page), PG_locked);
+   return wait_on_folio_locked_killable(page_folio(page));
 }
 
 int put_and_wait_on_page_locked(struct page *page, int state);
diff --git a/mm/filemap.c b/mm/filemap.c
index 73c8d7102157..d23430ad5bbc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1643,9 +1643,9 @@ int __lock_folio_or_retry(struct folio *folio, struct 
mm_struct *mm,
 
mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
-   wait_on_page_locked_killable(page);
+   wait_on_folio_locked_killable(folio);
else
-   wait_on_page_locked(page);
+   wait_on_folio_locked(folio);
return 0;
}
if (flags & FAULT_FLAG_KILLABLE) {
-- 
2.30.2

[PATCH v7 21/28] mm/filemap: Add __lock_folio_or_retry

2021-04-09 Thread Matthew Wilcox (Oracle)

Convert __lock_page_or_retry() to __lock_folio_or_retry().  This actually
saves 4 bytes in the only caller of lock_page_or_retry() (due to better
register allocation) and saves the 20 byte cost of calling page_folio()
in __lock_folio_or_retry() for a total saving of 24 bytes.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 include/linux/pagemap.h |  9 ++---
 mm/filemap.c| 10 --
 mm/memory.c |  8 
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c63f59d4ae60..00864e098693 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -716,7 +716,7 @@ static inline bool wake_page_match(struct wait_page_queue 
*wait_page,
 
 void __lock_folio(struct folio *folio);
 int __lock_folio_killable(struct folio *folio);
-extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+int __lock_folio_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags);
 void unlock_page(struct page *page);
 void unlock_folio(struct folio *folio);
@@ -777,13 +777,16 @@ static inline int lock_page_killable(struct page *page)
  * caller indicated that it can handle a retry.
  *
  * Return value and mmap_lock implications depend on flags; see
- * __lock_page_or_retry().
+ * __lock_folio_or_retry().
  */
 static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 unsigned int flags)
 {
+   struct folio *folio;
might_sleep();
-   return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
+
+   folio = page_folio(page);
+   return trylock_folio(folio) || __lock_folio_or_retry(folio, mm, flags);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 9485ce2a4820..73c8d7102157 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1621,20 +1621,18 @@ static int __lock_folio_async(struct folio *folio, 
struct wait_page_queue *wait)
 
 /*
  * Return values:
- * 1 - page is locked; mmap_lock is still held.
- * 0 - page is not locked.
+ * 1 - folio is locked; mmap_lock is still held.
+ * 0 - folio is not locked.
  * mmap_lock has been released (mmap_read_unlock(), unless flags had both
  * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
  * which case mmap_lock is still held.
  *
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
- * with the page locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock unperturbed.
  */
-int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
+int __lock_folio_or_retry(struct folio *folio, struct mm_struct *mm,
 unsigned int flags)
 {
-   struct folio *folio = page_folio(page);
-
if (fault_flag_allow_retry_first(flags)) {
/*
 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index cc71a445c76c..a2768ca793e5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4067,7 +4067,7 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults).
  * The mmap_lock may have been released depending on flags and our
- * return value.  See filemap_fault() and __lock_page_or_retry().
+ * return value.  See filemap_fault() and __lock_folio_or_retry().
  * If mmap_lock is released, vma may become invalid (for example
  * by other thread calling munmap()).
  */
@@ -4299,7 +4299,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t 
orig_pud)
  * concurrent faults).
  *
  * The mmap_lock may have been released depending on flags and our return 
value.
- * See filemap_fault() and __lock_page_or_retry().
+ * See filemap_fault() and __lock_folio_or_retry().
  */
 static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
@@ -4403,7 +4403,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
  * By the time we get here, we already hold the mm semaphore
  *
  * The mmap_lock may have been released depending on flags and our
- * return value.  See filemap_fault() and __lock_page_or_retry().
+ * return value.  See filemap_fault() and __lock_folio_or_retry().
  */
 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
@@ -4559,7 +4559,7 @@ static inline void mm_account_fault(struct pt_regs *regs,
  * By the time we get here, we already hold the mm semaphore
  *
  * The mmap_lock may have been released depending on flags and our
- * return value.  See filemap_fault() and __lock_page_or_retry().
+ * return value.  See filemap_fault() and __lock_folio_or_retry().
  */
 vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
   unsigned int flags, struct pt_regs *regs)
-- 
2.30.2

[PATCH v7 20/28] mm/filemap: Add __lock_folio_async

2021-04-09 Thread Matthew Wilcox (Oracle)

There aren't any actual callers of lock_page_async(), so remove it.
Convert filemap_update_page() to call __lock_folio_async().

__lock_folio_async() is 21 bytes smaller than __lock_page_async(),
but the real savings come from using a folio in filemap_update_page(),
shrinking it from 514 bytes to 403 bytes, saving 111 bytes.  The text
shrinks by 132 bytes in total.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Acked-by: Jeff Layton 
---
 fs/io_uring.c   |  2 +-
 include/linux/pagemap.h | 17 -
 mm/filemap.c| 31 ---
 3 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 81e5d156af1c..154fd92ab8a8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3142,7 +3142,7 @@ static int io_read_prep(struct io_kiocb *req, const 
struct io_uring_sqe *sqe)
 }
 
 /*
- * This is our waitqueue callback handler, registered through lock_page_async()
+ * This is our waitqueue callback handler, registered through 
lock_folio_async()
  * when we initially tried to do the IO with the iocb armed our waitqueue.
  * This gets called when the page is unlocked, and we generally expect that to
  * happen when the page IO is completed and the page is now uptodate. This will
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b23b95f771f7..c63f59d4ae60 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -716,7 +716,6 @@ static inline bool wake_page_match(struct wait_page_queue 
*wait_page,
 
 void __lock_folio(struct folio *folio);
 int __lock_folio_killable(struct folio *folio);
-extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
 void unlock_page(struct page *page);
@@ -773,22 +772,6 @@ static inline int lock_page_killable(struct page *page)
return lock_folio_killable(page_folio(page));
 }
 
-/*
- * lock_page_async - Lock the page, unless this would block. If the page
- * is already locked, then queue a callback when the page becomes unlocked.
- * This callback can then retry the operation.
- *
- * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
- * was already locked and the callback defined in 'wait' was queued.
- */
-static inline int lock_page_async(struct page *page,
- struct wait_page_queue *wait)
-{
-   if (!trylock_page(page))
-   return __lock_page_async(page, wait);
-   return 0;
-}
-
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index 2a4fa0b5fa88..9485ce2a4820 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1593,18 +1593,18 @@ int __lock_folio_killable(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(__lock_folio_killable);
 
-int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+static int __lock_folio_async(struct folio *folio, struct wait_page_queue 
*wait)
 {
-   struct wait_queue_head *q = page_waitqueue(page);
+   struct wait_queue_head *q = page_waitqueue(>page);
int ret = 0;
 
-   wait->page = page;
+   wait->page = >page;
wait->bit_nr = PG_locked;
 
spin_lock_irq(>lock);
__add_wait_queue_entry_tail(q, >wait);
-   SetPageWaiters(page);
-   ret = !trylock_page(page);
+   SetFolioWaiters(folio);
+   ret = !trylock_folio(folio);
/*
 * If we were successful now, we know we're still on the
 * waitqueue as we're still under the lock. This means it's
@@ -2351,41 +2351,42 @@ static int filemap_update_page(struct kiocb *iocb,
struct address_space *mapping, struct iov_iter *iter,
struct page *page)
 {
+   struct folio *folio = page_folio(page);
int error;
 
-   if (!trylock_page(page)) {
+   if (!trylock_folio(folio)) {
if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
return -EAGAIN;
if (!(iocb->ki_flags & IOCB_WAITQ)) {
-   put_and_wait_on_page_locked(page, TASK_KILLABLE);
+   put_and_wait_on_page_locked(>page, 
TASK_KILLABLE);
return AOP_TRUNCATED_PAGE;
}
-   error = __lock_page_async(page, iocb->ki_waitq);
+   error = __lock_folio_async(folio, iocb->ki_waitq);
if (error)
return error;
}
 
-   if (!page->mapping)
+   if (!folio->mapping)
goto truncated;
 
error = 0;
-   if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
+   if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, >page))
goto unlock;
 
error = -EA

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8399 matches

Mail list logo