from:"Matthew Wilcox"

Re: [PATCH v3 2/2] secretmem: optimize page_is_secretmem()

2021-05-07 Thread Matthew Wilcox

On Tue, Apr 20, 2021 at 06:00:49PM +0300, Mike Rapoport wrote:
> + mapping = (struct address_space *)
> + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
> +
> + if (mapping != page->mapping)
> + return false;
> +
> + return page->mapping->a_ops == _aops;

... why do you go back to page->mapping here?

return mapping->a_ops == _aops
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-06 Thread Matthew Wilcox

On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
> >> I think it is ok for dax/nvdimm to continue to maintain their align
> >> value because it should be ok to have 4MB align if the device really
> >> wanted. However, when it goes to map that alignment with
> >> memremap_pages() it can pick a mode. For example, it's already the
> >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> >> they're already separate concepts that can stay separate.
> > 
> > devdax namespace with align of 1G implies we expect to map them with 1G
> > pte entries? I didn't follow when you say we map them today with
> > DEVMAP_PTE entries.
> 
> This sort of confusion is largelly why Dan is suggesting a @geometry for 
> naming rather
> than @align (which traditionally refers to page tables entry sizes in 
> pagemap-related stuff).
> 
> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages 
> (DEVMAP_PTE),
> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order 
> (DEVMAP_PUD).
> 
> So, today:
> 
> * namespaces with align of 1G would use *struct pages of order-0* 
> (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> * namespaces with align of 2M would use *struct pages of order-0* 
> (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> 
> After this series:
> 
> * namespaces with align of 1G would use *compound struct pages of order-30* 
> (DEVMAP_PUD)
> backed with PMD entries in the direct map.

order-18

> * namespaces with align of 1G would use *compound struct pages of order-21* 
> (DEVMAP_PMD)
> backed with PTE entries in the direct map.

i think you mean "align of 2M", and order-9.

(note that these numbers are/can be different for powerpc since it
supports different TLB page sizes.  please don't get locked into x86
sizes, and don't ignore the benefits *to software* of managing memory
in sizes other than just those supported by the hardware).
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages

2021-05-05 Thread Matthew Wilcox

On Wed, May 05, 2021 at 11:44:29AM -0700, Dan Williams wrote:
> > @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> > unsigned long pfn, end_pfn = start_pfn + nr_pages;
> > struct pglist_data *pgdat = zone->zone_pgdat;
> > struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +   unsigned int pfn_align = pgmap_pfn_align(pgmap);
> > +   unsigned int order_align = order_base_2(pfn_align);
> > unsigned long zone_idx = zone_idx(zone);
> > unsigned long start = jiffies;
> > int nid = pgdat->node_id;
> > @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone 
> > *zone,
> > nr_pages = end_pfn - start_pfn;
> > }
> >
> > -   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +   for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?

If something is measured in bytes, I like to use size_t (if it's
in memory) and loff_t (if it's on storage).  The compiler doesn't do
anything useful to warn you, but it's a nice indication to humans about
what's going on.  And it removes the temptation to do 'pfn_align >>=
PAGE_SHIFT' and suddenly take pfn_align from being measured in bytes to
being measured in pages.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-26 Thread Matthew Wilcox

On Mon, Apr 26, 2021 at 01:52:17PM -0400, Vivek Goyal wrote:
> On Mon, Apr 26, 2021 at 02:46:32PM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 23, 2021 at 09:07:21AM -0400, Vivek Goyal wrote:
> > > +enum dax_wake_mode {
> > > + WAKE_NEXT,
> > > + WAKE_ALL,
> > > +};
> > 
> > Why define them in this order when ...
> > 
> > > @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
> > > *entry, bool wake_all)
> > >* must be in the waitqueue and the following check will see them.
> > >*/
> > >   if (waitqueue_active(wq))
> > > - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> > > + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
> > 
> > ... they're used like this?  This is almost as bad as
> > 
> > enum bool {
> > true,
> > false,
> > };
> 
> Hi Matthew,
> 
> So you prefer that I should switch order of WAKE_NEXT and WAKE_ALL? 
> 
> enum dax_wake_mode {
>   WAKE_ALL,
>   WAKE_NEXT,
> };

That, yes.

> And then do following to wake task.
> 
> if (waitqueue_active(wq))
>   __wake_up(wq, TASK_NORMAL, mode, );

No, the third argument to __wake_up() is a count, not an enum.  It just so
happens that '0' means 'all' and we only ever wake up 1 and not, say, 5.
So the logical way to define the enum is ALL, NEXT which _just happens
to match_ the usage of __wake_up().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-26 Thread Matthew Wilcox

On Fri, Apr 23, 2021 at 09:07:21AM -0400, Vivek Goyal wrote:
> +enum dax_wake_mode {
> + WAKE_NEXT,
> + WAKE_ALL,
> +};

Why define them in this order when ...

> @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
> *entry, bool wake_all)
>* must be in the waitqueue and the following check will see them.
>*/
>   if (waitqueue_active(wq))
> - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
> + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );

... they're used like this?  This is almost as bad as

enum bool {
true,
false,
};
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-21 Thread Matthew Wilcox

On Wed, Apr 21, 2021 at 11:56:31AM -0400, Vivek Goyal wrote:
> +/**
> + * enum dax_entry_wake_mode: waitqueue wakeup toggle

s/toggle/behaviour/ ?

> + * @WAKE_NEXT: wake only the first waiter in the waitqueue
> + * @WAKE_ALL: wake all waiters in the waitqueue
> + */
> +enum dax_entry_wake_mode {
> + WAKE_NEXT,
> + WAKE_ALL,
> +};
> +
>  static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
>   void *entry, struct exceptional_entry_key *key)
>  {
> @@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(w
>   * The important information it's conveying is whether the entry at
>   * this index used to be a PMD entry.
>   */
> -static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
> +static void dax_wake_entry(struct xa_state *xas, void *entry,
> +enum dax_entry_wake_mode mode)

It's an awfully verbose name.  'dax_wake_mode'?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 02:56:17PM +0300, Mike Rapoport wrote:
> On Mon, Apr 19, 2021 at 12:23:02PM +0100, Matthew Wilcox wrote:
> > So you're calling page_is_secretmem() on a struct page without having
> > a refcount on it.  That is definitely not allowed.  secretmem seems to
> > be full of these kinds of races; I know this isn't the first one I've
> > seen in it.  I don't think this patchset is ready for this merge window.
> 
> There were races in the older version that did caching of large pages and
> those were fixed then, but this is anyway irrelevant because all that code
> was dropped in the latest respins.
> 
> I don't think that the fix of the race in gup_pte_range is that significant
> to wait 3 more months because of it.

I have no particular interest in secretmem, but it seems that every time
I come across it while looking at something else, I see these kinds of
major mistakes in it.  That says to me it's not ready and hasn't seen
enough review.

> > With that fixed, you'll have a head page that you can use for testing,
> > which means you don't need to test PageCompound() (because you know the
> > page isn't PageTail), you can just test PageHead().
> 
> I can't say I follow you here. page_is_secretmem() is intended to be a
> generic test, so I don't see how it will get PageHead() if it is called
> from other places.

static inline bool head_is_secretmem(struct page *head)
{
if (PageHead(head))
return false;
...
}

static inline bool page_is_secretmem(struct page *page)
{
if (PageTail(page))
return false;
return head_is_secretmem(page);
}

(yes, calling it head is a misnomer, because it's not necessarily a head,
it might be a base page, but until we have a name for pages which might
be a head page or a base page, it'll have to do ...)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 12:36:19PM +0300, Mike Rapoport wrote:
> Well, most if the -4.2% of the performance regression kbuild reported were
> due to repeated compount_head(page) in page_mapping(). So the whole point
> of this patch is to avoid calling page_mapping().

It's quite ludicrous how many times we call compound_head() in
page_mapping() today:

 page = compound_head(page);
 if (__builtin_expect(!!(PageSlab(page)), 0))
 if (__builtin_expect(!!(PageSwapCache(page)), 0)) {

TESTPAGEFLAG(Slab, slab, PF_NO_TAIL) expands to:

static __always_inline int PageSlab(struct page *page)
{
PF_POISONED_CHECK(compound_head(page));
return test_bit(PG_slab, _head(page));
}

static __always_inline int PageSwapCache(struct page *page)
{
page = compound_head(page);
return PageSwapBacked(page) && test_bit(PG_swapcache, >flags);
}

but then!

TESTPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) also expands like Slab does.

So that's six calls to compound_head(), depending what Kconfig options
you have enabled.

And folio_mapping() is one of the functions I add in the first batch of
patches, so review, etc will be helpful.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] secretmem: optimize page_is_secretmem()

2021-04-19 Thread Matthew Wilcox

On Mon, Apr 19, 2021 at 11:42:18AM +0300, Mike Rapoport wrote:
> The perf profile of the test indicated that the regression is caused by
> page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range):

Uhh ... you're calling it in the wrong place!

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

if (page_is_secretmem(page))
goto pte_unmap;

head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;

So you're calling page_is_secretmem() on a struct page without having
a refcount on it.  That is definitely not allowed.  secretmem seems to
be full of these kinds of races; I know this isn't the first one I've
seen in it.  I don't think this patchset is ready for this merge window.

With that fixed, you'll have a head page that you can use for testing,
which means you don't need to test PageCompound() (because you know the
page isn't PageTail), you can just test PageHead().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 1/3] fsdax: Factor helpers to simplify dax fault code

2021-04-07 Thread Matthew Wilcox

On Wed, Apr 07, 2021 at 02:32:05PM +0800, Shiyang Ruan wrote:
> +static int dax_fault_cow_page(struct vm_fault *vmf, struct iomap *iomap,
> + loff_t pos, vm_fault_t *ret)
> +{
> + int error = 0;
> + unsigned long vaddr = vmf->address;
> + sector_t sector = dax_iomap_sector(iomap, pos);
> +
> + switch (iomap->type) {
> + case IOMAP_HOLE:
> + case IOMAP_UNWRITTEN:
> + clear_user_highpage(vmf->cow_page, vaddr);
> + break;
> + case IOMAP_MAPPED:
> + error = copy_cow_page_dax(iomap->bdev, iomap->dax_dev,
> + sector, vmf->cow_page, vaddr);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + error = -EIO;
> + break;
> + }
> +
> + if (error)
> + return error;
> +
> + __SetPageUptodate(vmf->cow_page);
> + *ret = finish_fault(vmf);
> + if (!*ret)
> + *ret = VM_FAULT_DONE_COW;
> + return 0;
> +}
...

> + error = dax_fault_cow_page(vmf, , pos, );
>   if (error)
> + ret = dax_fault_return(error);
>   goto finish_iomap;

This seems unnecessarily complex.  Why not return the vm_fault_t instead of
returning the errno and then converting it?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-02 Thread Matthew Wilcox

OK, more competent testing, and that previous bug now detected and fixed.
I have a reasonable amount of confidence this will solve your problem.
If you do apply this patch, don't enable CONFIG_TEST_XARRAY as the new
tests assume that attempting to allocate with a GFP flags of 0 will
definitely fail, which is true for my userspace allocator, but not true
inside the kernel.  I'll add some ifdeffery to skip these tests inside
the kernel, as without a way to deterministically fail allocation,
there's no way to test this code properly.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8b1c318189ce..14cbc12bfeca 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -321,12 +321,10 @@ static noinline void check_xa_mark(struct xarray *xa)
check_xa_mark_3(xa);
 }
 
-static noinline void check_xa_shrink(struct xarray *xa)
+static noinline void check_xa_shrink_1(struct xarray *xa)
 {
XA_STATE(xas, xa, 1);
struct xa_node *node;
-   unsigned int order;
-   unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 15 : 1;
 
XA_BUG_ON(xa, !xa_empty(xa));
XA_BUG_ON(xa, xa_store_index(xa, 0, GFP_KERNEL) != NULL);
@@ -349,6 +347,13 @@ static noinline void check_xa_shrink(struct xarray *xa)
XA_BUG_ON(xa, xa_load(xa, 0) != xa_mk_value(0));
xa_erase_index(xa, 0);
XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_xa_shrink_2(struct xarray *xa)
+{
+   unsigned int order;
+   unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 15 : 1;
+   struct xa_node *node;
 
for (order = 0; order < max_order; order++) {
unsigned long max = (1UL << order) - 1;
@@ -370,6 +375,34 @@ static noinline void check_xa_shrink(struct xarray *xa)
}
 }
 
+static noinline void check_xa_shrink_3(struct xarray *xa, int nr,
+   unsigned long anchor, unsigned long newbie)
+{
+   XA_STATE(xas, xa, newbie);
+   int i;
+
+   xa_store_index(xa, anchor, GFP_KERNEL);
+
+   for (i = 0; i < nr; i++) {
+   xas_create(, true);
+   xas_nomem(, GFP_KERNEL);
+   }
+   xas_create(, true);
+   xas_nomem(, 0);
+   XA_BUG_ON(xa, xas_error() == 0);
+
+   xa_erase_index(xa, anchor);
+   XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_xa_shrink(struct xarray *xa)
+{
+   check_xa_shrink_1(xa);
+   check_xa_shrink_2(xa);
+   check_xa_shrink_3(xa, 8, 0, 1UL << 31);
+   check_xa_shrink_3(xa, 4, 1UL << 31, 0);
+}
+
 static noinline void check_insert(struct xarray *xa)
 {
unsigned long i;
@@ -1463,6 +1496,36 @@ static noinline void check_create_range_4(struct xarray 
*xa,
XA_BUG_ON(xa, !xa_empty(xa));
 }
 
+static noinline void check_create_range_5(struct xarray *xa, void *entry,
+   unsigned long index, unsigned order)
+{
+   XA_STATE_ORDER(xas, xa, index, order);
+   int i = 0;
+   gfp_t gfp = GFP_KERNEL;
+
+   XA_BUG_ON(xa, !xa_empty(xa));
+
+   if (entry)
+   xa_store(xa, xa_to_value(entry), entry, GFP_KERNEL);
+
+   do {
+   xas_lock();
+   xas_create_range();
+   xas_unlock();
+   if (++i == 4)
+   gfp = GFP_NOWAIT;
+   } while (xas_nomem(, gfp));
+
+   if (entry)
+   xa_erase(xa, xa_to_value(entry));
+
+   if (!xas_error())
+   xa_destroy(xa);
+
+   XA_BUG_ON(xa, xas.xa_alloc);
+   XA_BUG_ON(xa, !xa_empty(xa));
+}
+
 static noinline void check_create_range(struct xarray *xa)
 {
unsigned int order;
@@ -1490,6 +1553,24 @@ static noinline void check_create_range(struct xarray 
*xa)
check_create_range_4(xa, (3U << order) + 1, order);
check_create_range_4(xa, (3U << order) - 1, order);
check_create_range_4(xa, (1U << 24) + 1, order);
+
+   check_create_range_5(xa, NULL, 0, order);
+   check_create_range_5(xa, NULL, (1U << order), order);
+   check_create_range_5(xa, NULL, (2U << order), order);
+   check_create_range_5(xa, NULL, (3U << order), order);
+   check_create_range_5(xa, NULL, (1U << (2 * order)), order);
+
+   check_create_range_5(xa, xa_mk_value(0), 0, order);
+   check_create_range_5(xa, xa_mk_value(0), (1U << order), order);
+   check_create_range_5(xa, xa_mk_value(0), (2U << order), order);
+   check_create_range_5(xa, xa_mk_value(0), (3U << order), order);
+   check_create_range_5(xa, xa_mk_value(0), (1U << (2 * order)), 
order);
+
+   check_create_range_5(xa, xa_mk_value(1U << order), 0, order);
+   check_create_range_5(xa, xa_mk_value(1U << order), (1U << 
order), order);
+   check_create_range_5(xa, xa_mk_value(1U << order), (2U << 
order), order);
+   check_create_range_5(xa, xa_mk_value(1U << order), (3U << 
order),

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-02 Thread Matthew Wilcox

On Fri, Apr 02, 2021 at 04:13:05AM +0100, Matthew Wilcox wrote:
> + for (;;) {
> + xas_load(xas);
> + if (!xas_is_node(xas))
> + break;
> + xas_delete_node(xas);
> + xas->xa_index -= XA_CHUNK_SIZE;
> + if (xas->xa_index < index)
> + break;

That's a bug.  index can be 0, so the condition would never be true.
It should be:

if (xas->xa_index <= (index | XA_CHUNK_MASK))
break;
xas->xa_index -= XA_CHUNK_SIZE;

The test doesn't notice this bug because the tree is otherwise empty,
and the !xas_is_node(xas) condition is hit first.  The next test will
notice this.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-01 Thread Matthew Wilcox

On Thu, Apr 01, 2021 at 06:06:15PM +0100, Matthew Wilcox wrote:
> On Wed, Mar 31, 2021 at 02:58:12PM -0700, Hugh Dickins wrote:
> > I suspect there's a bug in the XArray handling in collapse_file(),
> > which sometimes leaves empty nodes behind.
> 
> Urp, yes, that can easily happen.
> 
> /* This will be less messy when we use multi-index entries */
> do {
> xas_lock_irq();
> xas_create_range();
> if (!xas_error())
> break;
> if (!xas_nomem(, GFP_KERNEL)) {
> result = SCAN_FAIL;
> goto out;
> }
> 
> xas_create_range() can absolutely create nodes with zero entries.
> So if we create m/n nodes and then it runs out of memory (or cgroup
> denies it), we can leave nodes in the tree with zero entries.
> 
> There are three options for fixing it ...
>  - Switch to using multi-index entries.  We need to do this anyway, but
>I don't yet have a handle on the bugs that you found last time I
>pushed this into linux-next.  At -rc5 seems like a late stage to be
>trying this solution.
>  - Add an xas_prune_range() that gets called on failure.  Should be
>straightforward to write, but will be obsolete as soon as we do the
>above and it's a pain for the callers.
>  - Change how xas_create_range() works to merely preallocate the xa_nodes
>and not insert them into the tree until we're trying to insert data into
>them.  I favour this option, and this scenario is amenable to writing
>a test that will simulate failure halfway through.
> 
> I'm going to start on option 3 now.

option 3 didn't work out terribly well.  So here's option 4; if we fail
to allocate memory when creating a node, prune the tree.  This fixes
(I think) the problem inherited from the radix tree, although the test
case is only for xas_create_range().  I should add a couple of test cases
for xas_create() failing, but I just got this to pass and I wanted to
send it out as soon as possible.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8b1c318189ce..84c6057932f3 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1463,6 +1463,30 @@ static noinline void check_create_range_4(struct xarray 
*xa,
XA_BUG_ON(xa, !xa_empty(xa));
 }
 
+static noinline void check_create_range_5(struct xarray *xa,
+   unsigned long index, unsigned order)
+{
+   XA_STATE_ORDER(xas, xa, index, order);
+   int i = 0;
+   gfp_t gfp = GFP_KERNEL;
+
+   XA_BUG_ON(xa, !xa_empty(xa));
+
+   do {
+   xas_lock();
+   xas_create_range();
+   xas_unlock();
+   if (++i == 4)
+   gfp = GFP_NOWAIT;
+   } while (xas_nomem(, gfp));
+
+   if (!xas_error())
+   xa_destroy(xa);
+
+   XA_BUG_ON(xa, xas.xa_alloc);
+   XA_BUG_ON(xa, !xa_empty(xa));
+}
+
 static noinline void check_create_range(struct xarray *xa)
 {
unsigned int order;
@@ -1490,6 +1514,12 @@ static noinline void check_create_range(struct xarray 
*xa)
check_create_range_4(xa, (3U << order) + 1, order);
check_create_range_4(xa, (3U << order) - 1, order);
check_create_range_4(xa, (1U << 24) + 1, order);
+
+   check_create_range_5(xa, 0, order);
+   check_create_range_5(xa, (1U << order), order);
+   check_create_range_5(xa, (2U << order), order);
+   check_create_range_5(xa, (3U << order), order);
+   check_create_range_5(xa, (1U << (2 * order)), order);
}
 
check_create_range_3();
diff --git a/lib/xarray.c b/lib/xarray.c
index f5d8f54907b4..923ccde6379e 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -276,77 +276,6 @@ static void xas_destroy(struct xa_state *xas)
}
 }
 
-/**
- * xas_nomem() - Allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * If we need to add new nodes to the XArray, we try to allocate memory
- * with GFP_NOWAIT while holding the lock, which will usually succeed.
- * If it fails, @xas is flagged as needing memory to continue.  The caller
- * should drop the lock and call xas_nomem().  If xas_nomem() succeeds,
- * the caller should retry the operation.
- *
- * Forward progress is guaranteed as one node is allocated here and
- * stored in the xa_state where it will be found by xas_alloc().  More
- * nodes will likely be found in the slab allocator, but we do not tie
- * them up here.
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-bool xas_nomem(struct xa_state *xas, gfp_t gfp)
-{
-   if (xas->xa_node != XA_ERROR(-ENOMEM)) {
-   xas_destroy(xas);
-   return false;
-   }

Re: BUG_ON(!mapping_empty(>i_data))

2021-04-01 Thread Matthew Wilcox

On Wed, Mar 31, 2021 at 02:58:12PM -0700, Hugh Dickins wrote:
> I suspect there's a bug in the XArray handling in collapse_file(),
> which sometimes leaves empty nodes behind.

Urp, yes, that can easily happen.

/* This will be less messy when we use multi-index entries */
do {
xas_lock_irq();
xas_create_range();
if (!xas_error())
break;
if (!xas_nomem(, GFP_KERNEL)) {
result = SCAN_FAIL;
goto out;
}

xas_create_range() can absolutely create nodes with zero entries.
So if we create m/n nodes and then it runs out of memory (or cgroup
denies it), we can leave nodes in the tree with zero entries.

There are three options for fixing it ...
 - Switch to using multi-index entries.  We need to do this anyway, but
   I don't yet have a handle on the bugs that you found last time I
   pushed this into linux-next.  At -rc5 seems like a late stage to be
   trying this solution.
 - Add an xas_prune_range() that gets called on failure.  Should be
   straightforward to write, but will be obsolete as soon as we do the
   above and it's a pain for the callers.
 - Change how xas_create_range() works to merely preallocate the xa_nodes
   and not insert them into the tree until we're trying to insert data into
   them.  I favour this option, and this scenario is amenable to writing
   a test that will simulate failure halfway through.

I'm going to start on option 3 now.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: BUG_ON(!mapping_empty(>i_data))

2021-03-30 Thread Matthew Wilcox

On Tue, Mar 30, 2021 at 06:30:22PM -0700, Hugh Dickins wrote:
> Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
> mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
> I hit clear_inode()'s BUG_ON(!mapping_empty(>i_data)); on two
> machines, within an hour or few, repeatably though not to order.
> 
> The stack backtrace has always been clear_inode < ext4_clear_inode <
> ext4_evict_inode < evict < dispose_list < prune_icache_sb <
> super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
> shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
> 
> ext4 is the disk filesystem I read the source to build from, and also
> the filesystem I use on a loop device on a tmpfs file: I have not tried
> with other filesystems, nor checked whether perhaps it happens always on
> the loop one or always on the disk one.  I have not seen it happen with
> tmpfs - probably because its inodes cannot be evicted by the shrinker
> anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
> inodes (but suspect that may be down to timing, or less pressure).
> I doubt it's a matter of filesystem: think it's an XArray thing.
> 
> Whenever I've looked at the XArray nodes involved, the root node
> (shift 6) contained one or three (adjacent) pointers to empty shift
> 0 nodes, which each had offset and parent and array correctly set.
> Is there some way in which empty nodes can get left behind, and so
> fail eviction's mapping_empty() check?

There isn't _supposed_ to be.  The XArray is supposed to delete nodes
whenever the ->count reaches zero.  It might give me a clue if you could
share a dump of the tree, if you still have that handy.

> I did wonder whether some might get left behind if xas_alloc() fails
> (though probably the tree here is too shallow to show that).  Printks
> showed that occasionally xas_alloc() did fail while testing (maybe at
> memcg limit), but there was no correlation with the BUG_ONs.

This is a problem inherited from the radix tree, and I really want to
justify fixing it ... I think I may have enough infrastructure in place
to do it now (as part of the xas_split() commit we can now allocate
multiple xa_nodes in xas->xa_alloc).  But you're right; if we allocated
all the way down to an order-0 node, then this isn't the bug.

Were you using the ALLOW_ERROR_INJECTION feature on
__add_to_page_cache_locked()?  I haven't looked into how that works,
and maybe that could leave us in an inconsistent state.

> I did wonder whether this is a long-standing issue, which your new
> BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
> a BUG_ON(!xa_empty(>i_data.i_pages)) after its nrpages and
> nrexceptional BUG_ONs.  The result there surprised me: I expected
> it to behave the same way, but it hits that BUG_ON in a minute or
> so, instead of an hour or so.  Was there a fix you made somewhere,
> to avoid the BUG_ON(!mapping_empty) most of the time? but needs
> more work. I looked around a little, but didn't find any.

I didn't make a fix for this issue; indeed I haven't observed it myself.
It seems like cgroups are a good way to induce allocation failures, so
I should play around with that a bit.  The userspace test-suite has a
relatively malicious allocator that will fail every allocation not marked
as GFP_KERNEL, so it always exercises the fallback path for GFP_NOWAIT,
but then it will always succeed eventually.

> I had hoped to work this out myself, and save us both some writing:
> but better hand over to you, in the hope that you'll quickly guess
> what's up, then I can try patches. I do like the no-nrexceptionals
> series, but there's something still to be fixed.

Agreed.  It seems like it's unmasking a bug that already existed, so
it's not an argument for dropping the series, but we should fix the bug
so we don't crash people's machines.

Arguably, the condition being checked for is not serious enough for a
BUG_ON.  A WARN_ON, yes, and dump the tree for later perusal, but it's
just a memory leak, and not (I think?) likely to lead to later memory
corruption.  The nodes don't contain any pages, so there's nothing to
point to the mapping.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 0/4] Remove nrexceptional tracking

2021-03-12 Thread Matthew Wilcox

Ping?

On Thu, Jan 21, 2021 at 06:43:34PM +, Matthew Wilcox wrote:
> Ping?  These patches still apply to next-20210121.
> 
> On Mon, Oct 26, 2020 at 03:18:45PM +, Matthew Wilcox (Oracle) wrote:
> > We actually use nrexceptional for very little these days.  It's a minor
> > pain to keep in sync with nrpages, but the pain becomes much bigger
> > with the THP patches because we don't know how many indices a shadow
> > entry occupies.  It's easier to just remove it than keep it accurate.
> > 
> > Also, we save 8 bytes per inode which is nothing to sneeze at; on my
> > laptop, it would improve shmem_inode_cache from 22 to 23 objects per
> > 16kB, and inode_cache from 26 to 27 objects.  Combined, that saves
> > a megabyte of memory from a combined usage of 25MB for both caches.
> > Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
> > any memory for ext4.
> > 
> > Matthew Wilcox (Oracle) (4):
> >   mm: Introduce and use mapping_empty
> >   mm: Stop accounting shadow entries
> >   dax: Account DAX entries as nrpages
> >   mm: Remove nrexceptional from inode
> > 
> >  fs/block_dev.c  |  2 +-
> >  fs/dax.c|  8 
> >  fs/gfs2/glock.c |  3 +--
> >  fs/inode.c  |  2 +-
> >  include/linux/fs.h  |  2 --
> >  include/linux/pagemap.h |  5 +
> >  mm/filemap.c| 16 
> >  mm/swap_state.c |  4 
> >  mm/truncate.c   | 19 +++
> >  mm/workingset.c |  1 -
> >  10 files changed, 15 insertions(+), 47 deletions(-)
> > 
> > -- 
> > 2.28.0
> > 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Matthew Wilcox

On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> On 13:02 10/03, Matthew Wilcox wrote:
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> > 
> > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> > 
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware.  btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> > 
> > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > to give up all the features you like in btrfs.  So you may as well use
> > a different filesystem.
> 
> DAX on btrfs has been attempted[1]. Of course, we could not

But why?  A completeness fetish?  I don't understand why you decided
to do this work.

> have checksums or multi-device with it. However, got stuck on
> associating a shared extent on the same page mapping: basically the
> TODO above dax_associate_entry().
> 
> Shiyang has proposed a way to disassociate existing mapping, but I
> don't think that is the best solution. DAX for CoW will not work until
> we have a way of mapping a page to multiple inodes (page->mapping),
> which will convert a 1-N inode-page mapping to M-N inode-page mapping.

If you're still thinking in terms of pages, you're doing DAX wrong.
DAX should work without a struct page.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Matthew Wilcox

On Wed, Mar 10, 2021 at 08:36:06AM -0500, Neal Gompa wrote:
> On Wed, Mar 10, 2021 at 8:02 AM Matthew Wilcox  wrote:
> >
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> >
> > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> >
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware.  btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> >
> > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > to give up all the features you like in btrfs.  So you may as well use
> > a different filesystem.
> 
> So does that mean that DAX is incompatible with those filesystems when
> layered on DM (e.g. through LVM)?

Yes.  It might be possible to work through RAID-0 or read-only through
RAID-1, but I'm not sure anybody's bothered to do that work.

> Also, based on what you're saying, that means that DAX'd resources
> would not be able to use reflinks on XFS, right? That'd put it in
> similar territory as swap files on Btrfs, I would think.

You can use DAX with reflinks because the CPU can do read-only mmaps.
On a write fault, we break the reflink, copy the data and put in a
writable PTE.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 00/10] fsdax,xfs: Add reflink support for fsdax

2021-03-10 Thread Matthew Wilcox

On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> Forgive my ignorance, but is there a reason why this isn't wired up to
> Btrfs at the same time? It seems weird to me that adding a feature

btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.

If you think about it, btrfs and DAX are diametrically opposite things.
DAX is about giving raw access to the hardware.  btrfs is about offering
extra value (RAID, checksums, ...), none of which can be done if the
filesystem isn't in the read/write path.

That's why there's no DAX support in btrfs.  If you want DAX, you have
to give up all the features you like in btrfs.  So you may as well use
a different filesystem.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2] include: Remove pagemap.h from blkdev.h

2021-03-09 Thread Matthew Wilcox (Oracle)

My UEK-derived config has 1030 files depending on pagemap.h before
this change.  Afterwards, just 326 files need to be rebuilt when I
touch pagemap.h.  I think blkdev.h is probably included too widely,
but untangling that dependency is harder and this solves my problem.
x86 allmodconfig builds, but there may be implicit include problems
on other architectures.

Signed-off-by: Matthew Wilcox (Oracle) 
---
v2: Fix CONFIG_SWAP=n implicit use of pagemap.h by swap.h.  Increases
the number of files from 240, but that's still a big win -- 68%
reduction instead of 77%.

 block/blk-settings.c  | 1 +
 drivers/block/brd.c   | 1 +
 drivers/block/loop.c  | 1 +
 drivers/md/bcache/super.c | 1 +
 drivers/nvdimm/btt.c  | 1 +
 drivers/nvdimm/pmem.c | 1 +
 drivers/scsi/scsicam.c| 1 +
 include/linux/blkdev.h| 1 -
 include/linux/swap.h  | 1 +
 9 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index b4aa2f37fab6..976085a44fb8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include /* for max_pfn/max_low_pfn */
 #include 
 #include 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 18bf99906662..2a5a1933826b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a370cde3ddd4..d58d68f3c7cd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 71691f32959b..f154c89d1326 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -16,6 +16,7 @@
 #include "features.h"
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 41aa1f01fc07..18a267d5073f 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..16760b237229 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/scsi/scsicam.c b/drivers/scsi/scsicam.c
index f1553a453616..0ffdb8f2995f 100644
--- a/drivers/scsi/scsicam.c
+++ b/drivers/scsi/scsicam.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c032cfe133c7..1e2a95599390 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..ae194bb7ddb4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.30.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH] include: Remove pagemap.h from blkdev.h

2021-03-09 Thread Matthew Wilcox (Oracle)

My UEK-derived config has 1030 files depending on pagemap.h before
this change.  Afterwards, just 240 files need to be rebuilt when I
touch pagemap.h.  I think blkdev.h is probably included too widely,
but untangling that dependency is harder and this solves my problem.
x86 allmodconfig builds, but there may be implicit include problems
on other architectures.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 block/blk-settings.c  | 1 +
 drivers/block/brd.c   | 1 +
 drivers/block/loop.c  | 1 +
 drivers/md/bcache/super.c | 1 +
 drivers/nvdimm/btt.c  | 1 +
 drivers/nvdimm/pmem.c | 1 +
 drivers/scsi/scsicam.c| 1 +
 include/linux/blkdev.h| 1 -
 8 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index b4aa2f37fab6..976085a44fb8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include /* for max_pfn/max_low_pfn */
 #include 
 #include 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 18bf99906662..2a5a1933826b 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a370cde3ddd4..d58d68f3c7cd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 71691f32959b..f154c89d1326 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -16,6 +16,7 @@
 #include "features.h"
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 41aa1f01fc07..18a267d5073f 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..16760b237229 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/scsi/scsicam.c b/drivers/scsi/scsicam.c
index f1553a453616..0ffdb8f2995f 100644
--- a/drivers/scsi/scsicam.c
+++ b/drivers/scsi/scsicam.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c032cfe133c7..1e2a95599390 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.30.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v16 08/11] secretmem: add memcg accounting

2021-01-26 Thread Matthew Wilcox

On Mon, Jan 25, 2021 at 11:38:17PM +0200, Mike Rapoport wrote:
> I cannot use __GFP_ACCOUNT because cma_alloc() does not use gfp.
> Besides, kmem accounting with __GFP_ACCOUNT does not seem
> to update stats and there was an explicit request for statistics:
>  
> https://lore.kernel.org/lkml/CALo0P13aq3GsONnZrksZNU9RtfhMsZXGWhK1n=xyjwqizcd...@mail.gmail.com/
> 
> As for (ab)using NR_SLAB_UNRECLAIMABLE_B, as it was already discussed here:
> 
> https://lore.kernel.org/lkml/20201129172625.gd557...@kernel.org/
> 
> I think that a dedicated stats counter would be too much at the moment and
> NR_SLAB_UNRECLAIMABLE_B is the only explicit stat for unreclaimable memory.

That's not true -- Mlocked is also unreclaimable.  And doesn't this
feel more like mlocked memory than unreclaimable slab?  It's also
Unevictable, so could be counted there instead.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v16 08/11] secretmem: add memcg accounting

2021-01-25 Thread Matthew Wilcox

On Thu, Jan 21, 2021 at 02:27:20PM +0200, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Account memory consumed by secretmem to memcg. The accounting is updated
> when the memory is actually allocated and freed.

I think this is wrong.  It fails to account subsequent allocators from
the same PMD.  If you want to track like this, you need separate pools
per memcg.

I think you shouldn't try to track like this; better to just track on
a per-page basis.  After all, the page allocator doesn't track order-10
pages to the memcg that initially caused them to be split.

> Signed-off-by: Mike Rapoport 
> Acked-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 
> Cc: Alexander Viro 
> Cc: Andy Lutomirski 
> Cc: Arnd Bergmann 
> Cc: Borislav Petkov 
> Cc: Catalin Marinas 
> Cc: Christopher Lameter 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Elena Reshetova 
> Cc: Hagen Paul Pfeifer 
> Cc: "H. Peter Anvin" 
> Cc: Ingo Molnar 
> Cc: James Bottomley 
> Cc: "Kirill A. Shutemov" 
> Cc: Mark Rutland 
> Cc: Matthew Wilcox 
> Cc: Michael Kerrisk 
> Cc: Palmer Dabbelt 
> Cc: Palmer Dabbelt 
> Cc: Paul Walmsley 
> Cc: Peter Zijlstra 
> Cc: Rick Edgecombe 
> Cc: Shuah Khan 
> Cc: Thomas Gleixner 
> Cc: Tycho Andersen 
> Cc: Will Deacon 
> ---
>  mm/filemap.c   |  3 ++-
>  mm/secretmem.c | 36 +++-
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2d0c6721879d..bb28dd6d9e22 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -839,7 +840,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
>   page->mapping = mapping;
>   page->index = offset;
>  
> - if (!huge) {
> + if (!huge && !page_is_secretmem(page)) {
>   error = mem_cgroup_charge(page, current->mm, gfp);
>   if (error)
>   goto error;
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 469211c7cc3a..05026460e2ee 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -44,6 +45,32 @@ struct secretmem_ctx {
>  
>  static struct cma *secretmem_cma;
>  
> +static int secretmem_account_pages(struct page *page, gfp_t gfp, int order)
> +{
> + int err;
> +
> + err = memcg_kmem_charge_page(page, gfp, order);
> + if (err)
> + return err;
> +
> + /*
> +  * seceremem caches are unreclaimable kernel allocations, so treat
> +  * them as unreclaimable slab memory for VM statistics purposes
> +  */
> + mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
> +   PAGE_SIZE << order);
> +
> + return 0;
> +}
> +
> +static void secretmem_unaccount_pages(struct page *page, int order)
> +{
> +
> + mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
> +   -PAGE_SIZE << order);
> + memcg_kmem_uncharge_page(page, order);
> +}
> +
>  static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
>  {
>   unsigned long nr_pages = (1 << PMD_PAGE_ORDER);
> @@ -56,6 +83,10 @@ static int secretmem_pool_increase(struct secretmem_ctx 
> *ctx, gfp_t gfp)
>   if (!page)
>   return -ENOMEM;
>  
> + err = secretmem_account_pages(page, gfp, PMD_PAGE_ORDER);
> + if (err)
> + goto err_cma_release;
> +
>   /*
>* clear the data left from the prevoius user before dropping the
>* pages from the direct map
> @@ -65,7 +96,7 @@ static int secretmem_pool_increase(struct secretmem_ctx 
> *ctx, gfp_t gfp)
>  
>   err = set_direct_map_invalid_noflush(page, nr_pages);
>   if (err)
> - goto err_cma_release;
> + goto err_memcg_uncharge;
>  
>   addr = (unsigned long)page_address(page);
>   err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE);
> @@ -83,6 +114,8 @@ static int secretmem_pool_increase(struct secretmem_ctx 
> *ctx, gfp_t gfp)
>* won't fail
>*/
>   set_direct_map_default_noflush(page, nr_pages);
> +err_memcg_uncharge:
> + secretmem_unaccount_pages(page, PMD_PAGE_ORDER);
>  err_cma_release:
>   cma_release(secretmem_cma, page, nr_pages);
>   return err;
> @@ -314,6 +347,7 @@ static void secretmem_cleanup_chunk(struct gen_pool *pool,
>   int i;
>  
>   set_direct_map_default_noflush(page, nr_pages);
> + secretmem_unaccount_pages(page, PMD_PAGE_ORDER);
>  
>   for (i = 0; i < nr_pages; i++)
>   clear_highpage(page + i);
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 1/4] mm: Introduce and use mapping_empty

2021-01-21 Thread Matthew Wilcox

On Thu, Jan 21, 2021 at 03:42:31PM -0500, Johannes Weiner wrote:
> On Mon, Oct 26, 2020 at 03:18:46PM +0000, Matthew Wilcox (Oracle) wrote:
> > Instead of checking the two counters (nrpages and nrexceptional), we
> > can just check whether i_pages is empty.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Tested-by: Vishal Verma 
> 
> Acked-by: Johannes Weiner 
> 
> Heh, I was looking for the fs/inode.c hunk here, because I remember
> those BUG_ONs in the inode free path. Found it in the last patch - I
> guess they escaped grep but the compiler let you know? :-)

Heh, I forget now!  I think I did it that way on purpose, but now I
forget what that purpose was!
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Expense of read_iter

2021-01-21 Thread Matthew Wilcox

On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> Do you have some idea how to optimize the generic code that calls 
> ->read_iter?

Yes.

> It might be better to maintain an f_iocb_flags in the
> struct file and just copy that unconditionally.  We'd need to remember
> to update it in fcntl(F_SETFL), but I think that's the only place.

Want to give that a try?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v15 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-01-20 Thread Matthew Wilcox

On Wed, Jan 20, 2021 at 08:06:08PM +0200, Mike Rapoport wrote:
> +static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
>  {
> + unsigned long nr_pages = (1 << PMD_PAGE_ORDER);
> + struct gen_pool *pool = ctx->pool;
> + unsigned long addr;
> + struct page *page;
> + int err;
> +
> + page = cma_alloc(secretmem_cma, nr_pages, PMD_SIZE, gfp & __GFP_NOWARN);
> + if (!page)
> + return -ENOMEM;

Does cma_alloc() zero the pages it allocates?  If not, where do we avoid
leaking kernel memory to userspace?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v15 06/11] mm: introduce memfd_secret system call to create "secret" memory areas

2021-01-20 Thread Matthew Wilcox

On Wed, Jan 20, 2021 at 08:06:07PM +0200, Mike Rapoport wrote:
> +static struct page *secretmem_alloc_page(gfp_t gfp)
> +{
> + /*
> +  * FIXME: use a cache of large pages to reduce the direct map
> +  * fragmentation
> +  */
> + return alloc_page(gfp);
> +}
> +
> +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> + pgoff_t offset = vmf->pgoff;
> + unsigned long addr;
> + struct page *page;
> + int err;
> +
> + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> + return vmf_error(-EINVAL);
> +
> +retry:
> + page = find_lock_page(mapping, offset);
> + if (!page) {
> + page = secretmem_alloc_page(vmf->gfp_mask);
> + if (!page)
> + return VM_FAULT_OOM;
> +
> + err = set_direct_map_invalid_noflush(page, 1);
> + if (err)
> + return vmf_error(err);

Haven't we leaked the page at this point?

> + __SetPageUptodate(page);
> + err = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);

At this point, doesn't the page contain data from the last person to use
the page?  ie we've leaked data to this process?  I don't see anywhere
that we write data to the page.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v14 05/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-01-20 Thread Matthew Wilcox

On Wed, Jan 20, 2021 at 05:05:10PM +0200, Mike Rapoport wrote:
> On Tue, Jan 19, 2021 at 08:22:13PM +0000, Matthew Wilcox wrote:
> > On Thu, Dec 03, 2020 at 08:29:44AM +0200, Mike Rapoport wrote:
> > > +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > +{
> > > + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > > + struct inode *inode = file_inode(vmf->vma->vm_file);
> > > + pgoff_t offset = vmf->pgoff;
> > > + vm_fault_t ret = 0;
> > > + unsigned long addr;
> > > + struct page *page;
> > > + int err;
> > > +
> > > + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > + return vmf_error(-EINVAL);
> > > +
> > > + page = find_get_page(mapping, offset);
> > > + if (!page) {
> > > +
> > > + page = secretmem_alloc_page(vmf->gfp_mask);
> > > + if (!page)
> > > + return vmf_error(-ENOMEM);
> > 
> > Just use VM_FAULT_OOM directly.
>  
> Ok.
> 
> > > + err = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);
> > > + if (unlikely(err))
> > > + goto err_put_page;
> > 
> > What if the error is EEXIST because somebody else raced with you to add
> > a new page to the page cache?
> 
> Right, for -EEXIST I need a retry here, thanks.
> 
> > > + err = set_direct_map_invalid_noflush(page, 1);
> > > + if (err)
> > > + goto err_del_page_cache;
> > 
> > Does this work correctly if somebody else has a reference to the page
> > in the meantime?
> 
> Yes, it does. If somebody else won the race that page was dropped from the
> direct map and this call would be essentially a nop. And anyway, the very
> next patch changes the way pages are removed from the direct map ;-)

What I'm thinking is:

thread A page faults
doesn't find page
allocates page
adds page to page cache
thread B page faults
does find page in page cache
set direct map invalid fails
deletes from page cache
... ?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v14 05/10] mm: introduce memfd_secret system call to create "secret" memory areas

2021-01-19 Thread Matthew Wilcox

On Thu, Dec 03, 2020 at 08:29:44AM +0200, Mike Rapoport wrote:
> +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> + pgoff_t offset = vmf->pgoff;
> + vm_fault_t ret = 0;
> + unsigned long addr;
> + struct page *page;
> + int err;
> +
> + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> + return vmf_error(-EINVAL);
> +
> + page = find_get_page(mapping, offset);
> + if (!page) {
> +
> + page = secretmem_alloc_page(vmf->gfp_mask);
> + if (!page)
> + return vmf_error(-ENOMEM);

Just use VM_FAULT_OOM directly.

> + err = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);
> + if (unlikely(err))
> + goto err_put_page;

What if the error is EEXIST because somebody else raced with you to add
a new page to the page cache?

> + err = set_direct_map_invalid_noflush(page, 1);
> + if (err)
> + goto err_del_page_cache;

Does this work correctly if somebody else has a reference to the page
in the meantime?

> + addr = (unsigned long)page_address(page);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> + __SetPageUptodate(page);

Once you've added it to the cache, somebody else can come along and try
to lock it.  They will set PageWaiter.  Now you call __SetPageUptodate
and wipe out their PageWaiter bit.  So you won't wake them up when you
unlock.

You can call __SetPageUptodate before adding it to the page cache,
but once it's visible to another thread, you can't do that.

> + ret = VM_FAULT_LOCKED;
> + }
> +
> + vmf->page = page;

You're supposed to return the page locked, so use find_lock_page() instead
of find_get_page().

> + return ret;
> +
> +err_del_page_cache:
> + delete_from_page_cache(page);
> +err_put_page:
> + put_page(page);
> + return vmf_error(err);
> +}
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Expense of read_iter

2021-01-10 Thread Matthew Wilcox

On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> I put counters into vfs_read and vfs_readv.
> 
> After a fresh boot of the virtual machine, the counters show "13385 4". 
> After a kernel compilation they show "4475220 8".
> 
> So, the readv path is almost unused.
> 
> My reasoning was that we should optimize for the "read" path and glue the 
> "readv" path on the top of that. Currently, the kernel is doing the 
> opposite - optimizing for "readv" and glueing "read" on the top of it.

But it's not about optimising for read vs readv.  read_iter handles
a host of other cases, such as pread(), preadv(), AIO reads, splice,
and reads to in-kernel buffers.

Some device drivers abused read() vs readv() to actually return different
information, depending which you called.  That's why there's now a
prohibition against both.

So let's figure out how to make iter_read() perform well for sys_read().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Expense of read_iter

2021-01-09 Thread Matthew Wilcox

On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote:
> On Thu, 7 Jan 2021, Matthew Wilcox wrote:
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > I'd like to ask about this piece of code in __kernel_read:
> > >   if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > >   return warn_unsupported...
> > > and __kernel_write:
> > >   if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > >   return warn_unsupported...
> > > 
> > > - It exits with an error if both read_iter and read or write_iter and 
> > > write are present.
> > > 
> > > I found out that on NVFS, reading a file with the read method has 10% 
> > > better performance than the read_iter method. The benchmark just reads 
> > > the 
> > > same 4k page over and over again - and the cost of creating and parsing 
> > > the kiocb and iov_iter structures is just that high.
> > 
> > Which part of it is so expensive?
> 
> The read_iter path is much bigger:
> vfs_read  - 0x160 bytes
> new_sync_read - 0x160 bytes
> nvfs_rw_iter  - 0x100 bytes
> nvfs_rw_iter_locked   - 0x4a0 bytes
> iov_iter_advance  - 0x300 bytes

Number of bytes in a function isn't really correlated with how expensive
a particular function is.  That said, looking at new_sync_read() shows
one part that's particularly bad, init_sync_kiocb():

static inline int iocb_flags(struct file *file)
{
int res = 0;
if (file->f_flags & O_APPEND)
res |= IOCB_APPEND;
 7ec:   8b 57 40mov0x40(%rdi),%edx
 7ef:   48 89 75 80 mov%rsi,-0x80(%rbp)
if (file->f_flags & O_DIRECT)
 7f3:   89 d0   mov%edx,%eax
 7f5:   c1 e8 06shr$0x6,%eax
 7f8:   83 e0 10and$0x10,%eax
res |= IOCB_DIRECT;
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
 7fb:   89 c1   mov%eax,%ecx
 7fd:   81 c9 00 00 02 00   or $0x2,%ecx
 803:   f6 c6 40test   $0x40,%dh
 806:   0f 45 c1cmovne %ecx,%eax
res |= IOCB_DSYNC;
 809:   f6 c6 10test   $0x10,%dh
 80c:   75 18   jne826 
 80e:   48 8b 8f d8 00 00 00mov0xd8(%rdi),%rcx
 815:   48 8b 09mov(%rcx),%rcx
 818:   48 8b 71 28 mov0x28(%rcx),%rsi
 81c:   f6 46 50 10 testb  $0x10,0x50(%rsi)
 820:   0f 84 e2 00 00 00   je 908 
if (file->f_flags & __O_SYNC)
 826:   83 c8 02or $0x2,%eax
res |= IOCB_SYNC;
return res;
 829:   89 c1   mov%eax,%ecx
 82b:   83 c9 04or $0x4,%ecx
 82e:   81 e2 00 00 10 00   and$0x10,%edx

We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT |
O_APPEND) and returning 0 if none of them are set, since they're all
pretty rare.  It might be better to maintain an f_iocb_flags in the
struct file and just copy that unconditionally.  We'd need to remember
to update it in fcntl(F_SETFL), but I think that's the only place.


> If we go with the "read" method, there's just:
> vfs_read  - 0x160 bytes
> nvfs_read - 0x200 bytes
> 
> > Is it worth, eg adding an iov_iter
> > type that points to a single buffer instead of a single-member iov?

>  6.57%  pread[nvfs][k] nvfs_rw_iter_locked
>  2.31%  pread[kernel.vmlinux]  [k] new_sync_read
>  1.89%  pread[kernel.vmlinux]  [k] iov_iter_advance
>  1.24%  pread[nvfs][k] nvfs_rw_iter
>  0.29%  pread[kernel.vmlinux]  [k] iov_iter_init

>  2.71%  pread[nvfs][k] nvfs_read

> Note that if we sum the percentage of nvfs_iter_locked, new_sync_read, 
> iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the 
> second trace, nvfs_read consumes just 2.71% - and it replaces 
> functionality of all these functions.
> 
> That is the reason for that 10% degradation with read_iter.

You seem to be focusing on your argument for "let's just permit
filesystems to implement both ->read and ->read_iter".  My suggestion
is that we need to optimise the ->read_iter path, but to do that we need
to know what's expensive.

nvfs_rw_iter_locked() looks very complicated.  I suspect it can
be simplified.  Of course new_sync_read() needs to be improved too,
as do the other functions here, but fully a third of the difference
between read() and read_iter() is the difference between nvfs_read()
and nvfs_rw_iter_locked().
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Expense of read_iter

2021-01-07 Thread Matthew Wilcox

On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> I'd like to ask about this piece of code in __kernel_read:
>   if (unlikely(!file->f_op->read_iter || file->f_op->read))
>   return warn_unsupported...
> and __kernel_write:
>   if (unlikely(!file->f_op->write_iter || file->f_op->write))
>   return warn_unsupported...
> 
> - It exits with an error if both read_iter and read or write_iter and 
> write are present.
> 
> I found out that on NVFS, reading a file with the read method has 10% 
> better performance than the read_iter method. The benchmark just reads the 
> same 4k page over and over again - and the cost of creating and parsing 
> the kiocb and iov_iter structures is just that high.

Which part of it is so expensive?  Is it worth, eg adding an iov_iter
type that points to a single buffer instead of a single-member iov?

+++ b/include/linux/uio.h
@@ -19,6 +19,7 @@ struct kvec {
 
 enum iter_type {
/* iter types */
+   ITER_UBUF = 2,
ITER_IOVEC = 4,
ITER_KVEC = 8,
ITER_BVEC = 16,
@@ -36,6 +36,7 @@ struct iov_iter {
size_t iov_offset;
size_t count;
union {
+   void __user *buf;
const struct iovec *iov;
const struct kvec *kvec;
const struct bio_vec *bvec;

and then doing all the appropriate changes to make that work.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2] fs/dax: include to fix build error on ARC

2021-01-04 Thread Matthew Wilcox

On Mon, Jan 04, 2021 at 12:13:02PM -0800, Dan Williams wrote:
> On Thu, Dec 31, 2020 at 8:29 PM Randy Dunlap  wrote:
> > +++ lnx-511-rc1/fs/dax.c
> > @@ -25,6 +25,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> 
> I would expect this to come from one of the linux/ includes like
> linux/mm.h. asm/ headers are implementation linux/ headers are api.

It does indeed come from linux/mm.h already.  And a number of
other random places, including linux/serial.h.  Our headers are a mess,
but they shouldn't be made worse by adding _this_ include.  So I
second Dan's objection here.

> Once you drop that then the subject of this patch can just be "arc:
> add a copy_user_page() implementation", and handled by the arc
> maintainer (or I can take it with Vineet's ack).
> 
> >  #include 
> 
> Yes, this one should have a linux/ api header to front it, but that's
> a cleanup for another day.

Definitely more involved.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Matthew Wilcox

On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +, Joao Martins wrote:
> 
> > Today (without the series) struct pages are not represented the way they
> > are expressed in the page tables, which is what I am hoping to fix in this
> > series thus initializing these as compound pages of a given order. But me
> > introducing PGMAP_COMPOUND was to conservatively keep both old 
> > (non-compound)
> > and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.

Uh, yes.  I'm surprised it hasn't caused more problems.

> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.

But we also see good performance improvements from doing all reference
counts on the head page instead of spread throughout the pages, so we
really want compound pages.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2020-12-08 Thread Matthew Wilcox

On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
> > Add a new flag for struct dev_pagemap which designates that a a pagemap
> 
> a a
> 
> > is described as a set of compound pages or in other words, that how
> > pages are grouped together in the page tables are reflected in how we
> > describe struct pages. This means that rather than initializing
> > individual struct pages, we also initialize these struct pages, as
> 
> Let's not say "rather than x, we also do y", because it's self-contradictory.
> I think you want to just leave out the "also", like this:
> 
> "This means that rather than initializing> individual struct pages, we
> initialize these struct pages ..."
> 
> Is that right?

I'd phrase it as:

Add a new flag for struct dev_pagemap which specifies that a pagemap is
composed of a set of compound pages instead of individual pages.  When
these pages are initialised, most are initialised as tail pages
instead of order-0 pages.

> > For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> > size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> > thus playing the same tricks as hugetlb pages.

Rather than "playing the same tricks", how about "are treated the same
way as THP or hugetlb pages"?

> > +   if (pgmap->flags & PGMAP_COMPOUND)
> > +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> > +   - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
> 
> Is there some reason that we cannot use range_len(), instead of pfn_end() 
> minus
> pfn_first()? (Yes, this more about the pre-existing code than about your 
> change.)
> 
> And if not, then why are the nearby range_len() uses OK? I realize that 
> range_len()
> is simpler and skips a case, but it's not clear that it's required here. But 
> I'm
> new to this area so be warned. :)
> 
> Also, dividing by PHYS_PFN() feels quite misleading: that function does what 
> you
> happen to want, but is not named accordingly. Can you use or create something
> more accurately named? Like "number of pages in this large page"?

We have compound_nr(), but that takes a struct page as an argument.
We also have HPAGE_NR_PAGES.  I'm not quite clear what you want.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: PATCH] fs/dax: fix compile problem on parisc and mips

2020-12-04 Thread Matthew Wilcox

On Fri, Dec 04, 2020 at 08:28:47AM -0500, John David Anglin wrote:
> (.mlocate): page allocation failure: order:5, 
> mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
>  [<4035416c>] __kmalloc+0x5e4/0x740
>  [<040ddbe8>] nfsd_reply_cache_init+0x1d0/0x360 [nfsd]

Oof, order 5.  Fortunately, that one was already fixed by commit
8c38b705b4f4ca4e7f9cc116141bc38391917c30.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: PATCH] fs/dax: fix compile problem on parisc and mips

2020-12-04 Thread Matthew Wilcox

On Fri, Dec 04, 2020 at 08:57:37AM +0100, Helge Deller wrote:
> On 12/4/20 4:48 AM, Matthew Wilcox wrote:
> > On Thu, Dec 03, 2020 at 04:33:10PM -0800, James Bottomley wrote:
> >> These platforms define PMD_ORDER in asm/pgtable.h
> >
> > I think that's the real problem, though.
> >
> > #define PGD_ORDER   1 /* Number of pages per pgd */
> > #define PMD_ORDER   1 /* Number of pages per pmd */
> > #define PGD_ALLOC_ORDER (2 + 1) /* first pgd contains pmd */
> > #else
> > #define PGD_ORDER   1 /* Number of pages per pgd */
> > #define PGD_ALLOC_ORDER (PGD_ORDER + 1)
> >
> > That should clearly be PMD_ALLOC_ORDER, not PMD_ORDER.  Or even
> > PAGES_PER_PMD like the comment calls it, because I really think
> > that doing an order-3 (8 pages) allocation for the PGD is wrong.
> 
> We need a spinlock to protect parallel accesses to the PGD,
> search for pgd_spinlock().
> This spinlock is stored behind the memory used for the PGD, which
> is why we allocate more memory (and waste 3 pages).

There are better places to store it than that!  For example, storing it
in the struct page, like many architectures do for split ptlocks.
You'll have to skip the _pt_pad_1 so it doesn't get confused with
being a compound_head, but soemthing like this:

struct {/* PA-RISC PGD */
unsigned long _pa_pad_1;/* compound_head */
#if ALLOC_SPLIT_PA_PTLOCKS
spinlock_t *pa_ptl;
#else
spinlock_t pa_ptl;
#endif
};

inside struct page (linux/mm_types.h) should do the trick.  You'll
still need to allocate them separately if various debugging options
are enabled (see the ALLOC_SPLIT_PTLOCKS for details), but usually
this will save you a lot of memory.

You could also fill in pt_mm like x86 does for its pgds, and then use
mm->page_table_lock to protect whatever the PGD lock currently protects.
Maybe page_table_lock is sometimes held when calling ptep_set_wrprotect()
and sometimes isn't; then this wouldn't work.

Also, could you fix the comments?  They don't match the code:

 #define PGD_ORDER  1 /* Number of pages per pgd */

should be

 #define PGD_ALLOC_ORDER  1 /* 2 pages (8KiB) per pgd */
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: PATCH] fs/dax: fix compile problem on parisc and mips

2020-12-03 Thread Matthew Wilcox

On Thu, Dec 03, 2020 at 04:33:10PM -0800, James Bottomley wrote:
> These platforms define PMD_ORDER in asm/pgtable.h

I think that's the real problem, though.

#define PGD_ORDER   1 /* Number of pages per pgd */
#define PMD_ORDER   1 /* Number of pages per pmd */
#define PGD_ALLOC_ORDER (2 + 1) /* first pgd contains pmd */
#else
#define PGD_ORDER   1 /* Number of pages per pgd */
#define PGD_ALLOC_ORDER (PGD_ORDER + 1)

That should clearly be PMD_ALLOC_ORDER, not PMD_ORDER.  Or even
PAGES_PER_PMD like the comment calls it, because I really think
that doing an order-3 (8 pages) allocation for the PGD is wrong.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: mapcount corruption regression

2020-12-01 Thread Matthew Wilcox

On Tue, Dec 01, 2020 at 06:28:45PM -0800, Dan Williams wrote:
> On Tue, Dec 1, 2020 at 12:49 PM Matthew Wilcox  wrote:
> >
> > On Tue, Dec 01, 2020 at 12:42:39PM -0800, Dan Williams wrote:
> > > On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > > > > Kirill, Willy, compound page experts,
> > > > >
> > > > > I am seeking some debug ideas about the following splat:
> > > > >
> > > > > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > > > > page:51ef73f7 refcount:0 mapcount:-1024
> > > > > mapping: index:0x0 pfn:0x121a12
> > > >
> > > > Mapcount of -1024 is the signature of:
> > > >
> > > > #define PG_guard0x0400
> > >
> > > Oh, thanks for that. I overlooked how mapcount is overloaded. Although
> > > in v5.10-rc4 that value is:
> > >
> > > #define PG_table0x0400
> >
> > Ah, I was looking at -next, where Roman renumbered it.
> >
> > I know UML had a problem where it was not clearing PG_table, but you
> > seem to be running on bare metal.  SuperH did too, but again, you're
> > not using SuperH.
> >
> > > >
> > > > (the bits are inverted, so this turns into 0xfbff which is reported
> > > > as -1024)
> > > >
> > > > I assume you have debug_pagealloc enabled?
> > >
> > > Added it, but no extra spew. I'll dig a bit more on how PG_table is
> > > not being cleared in this case.
> >
> > I only asked about debug_pagealloc because that sets PG_guard.  Since
> > the problem is actually PG_table, it's not relevant.
> 
> As a shot in the dark I reverted:
> 
> b2b29d6d0119 mm: account PMD tables like PTE tables
> 
> ...and the test passed.

That's not really surprising ... you're still freeing PMD tables without
calling the destructor, which means that you're leaking ptlocks on
configs that can't embed the ptlock in the struct page.

I suppose it shows that you're leaking a PMD table rather than a PTE
table, so that might help track it down.  Checking for PG_table in
free_unref_page() and calling show_stack() will probably help more.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: mapcount corruption regression

2020-12-01 Thread Matthew Wilcox

On Tue, Dec 01, 2020 at 12:42:39PM -0800, Dan Williams wrote:
> On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  wrote:
> >
> > On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > > Kirill, Willy, compound page experts,
> > >
> > > I am seeking some debug ideas about the following splat:
> > >
> > > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > > page:51ef73f7 refcount:0 mapcount:-1024
> > > mapping: index:0x0 pfn:0x121a12
> >
> > Mapcount of -1024 is the signature of:
> >
> > #define PG_guard0x0400
> 
> Oh, thanks for that. I overlooked how mapcount is overloaded. Although
> in v5.10-rc4 that value is:
> 
> #define PG_table0x0400

Ah, I was looking at -next, where Roman renumbered it.

I know UML had a problem where it was not clearing PG_table, but you
seem to be running on bare metal.  SuperH did too, but again, you're
not using SuperH.

> >
> > (the bits are inverted, so this turns into 0xfbff which is reported
> > as -1024)
> >
> > I assume you have debug_pagealloc enabled?
> 
> Added it, but no extra spew. I'll dig a bit more on how PG_table is
> not being cleared in this case.

I only asked about debug_pagealloc because that sets PG_guard.  Since
the problem is actually PG_table, it's not relevant.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: mapcount corruption regression

2020-11-30 Thread Matthew Wilcox

On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> Kirill, Willy, compound page experts,
> 
> I am seeking some debug ideas about the following splat:
> 
> BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> page:51ef73f7 refcount:0 mapcount:-1024
> mapping: index:0x0 pfn:0x121a12

Mapcount of -1024 is the signature of:

#define PG_guard0x0400

(the bits are inverted, so this turns into 0xfbff which is reported
as -1024)

I assume you have debug_pagealloc enabled?

> flags: 0x28()
> raw: 0028 dead0100  
> raw:  8a6914886b48 fbff 
> page dumped because: nonzero mapcount
> [..]
> CPU: 26 PID: 6127 Comm: lt-pmem-ns Tainted: G   OE 5.10.0-rc4+ 
> #450
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> Call Trace:
>  dump_stack+0x8b/0xb0
>  bad_page.cold+0x63/0x94
>  free_pcp_prepare+0x224/0x270
>  free_unref_page+0x18/0xd0
>  pud_free_pmd_page+0x146/0x160
>  ioremap_pud_range+0xe3/0x350
>  ioremap_page_range+0x108/0x160
>  __ioremap_caller.constprop.0+0x174/0x2b0
>  ? memremap+0x7a/0x110
>  memremap+0x7a/0x110
>  devm_memremap+0x53/0xa0
>  pmem_attach_disk+0x4ed/0x530 [nd_pmem]
> 
> It triggers on v5.10-rc4 not on v5.9, but the bisect comes up with an
> ambiguous result. I've run the bisect 3 times and landed on:
> 
> 032c7ed95817 Merge tag 'arm64-upstream' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> 
> ...which does not touch anything near _mapcount. I suspect there is
> something unique about the build that lines up the corruption to
> happen or not happen.
> 
> The test is a simple namespace creation test that results in an
> memremap() / ioremap() over several gigabytes of memory capacity. The
> -1024 was interesting because that's the GUP_PIN_COUNTING_BIAS, but
> that's the _refcount, I did not see any questionable changes to how
> _mapcount is manipulated post v5.9. Problem should be reproducible by
> running:
> 
> make -j TESTS="pmem-ns" check
> 
> ...in qemu-kvm with some virtual pmem defined:
> 
> -object 
> memory-backend-file,id=mem1,share,mem-path=${mem}1,size=$((mem_size+label_size))
> -device nvdimm,memdev=mem1,id=nv1,label-size=${label_size}
> 
> ...where ${mem}1 is a 128GB sparse file $mem_size is 127GB and
> $label_size is 128KB.
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/3] fs: dax.c: move fs hole signifier from DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-30 Thread Matthew Wilcox

On Mon, Nov 30, 2020 at 04:09:23PM +0100, Jan Kara wrote:
> On Mon 30-11-20 06:22:42, Amy Parker wrote:
> > > > +/*
> > > > + * A zero entry, XA_ZERO_ENTRY, is used to represent a zero page. This
> > > > + * definition helps with checking if an entry is a PMD size.
> > > > + */
> > > > +#define XA_ZERO_PMD_ENTRY DAX_PMD | (unsigned long)XA_ZERO_ENTRY
> > > > +
> > >
> > > Firstly, if you define a macro, we usually wrap it inside braces like:
> > >
> > > #define XA_ZERO_PMD_ENTRY (DAX_PMD | (unsigned long)XA_ZERO_ENTRY)
> > >
> > > to avoid unexpected issues when macro expands and surrounding operators
> > > have higher priority.
> > 
> > Oops! Must've missed that - I'll make sure to get on that when
> > revising this patch.
> > 
> > > Secondly, I don't think you can combine XA_ZERO_ENTRY with DAX_PMD (or any
> > > other bits for that matter). XA_ZERO_ENTRY is defined as
> > > xa_mk_internal(257) which is ((257 << 2) | 2) - DAX bits will overlap with
> > > the bits xarray internal entries are using and things will break.
> > 
> > Could you provide an example of this overlap? I can't seem to find any.
> 
> Well XA_ZERO_ENTRY | DAX_PMD == ((257 << 2) | 2) | (1 << 1). So the way
> you've defined XA_ZERO_PMD_ENTRY the DAX_PMD will just get lost. AFAIU (but
> Matthew might correct me here), for internal entries (and XA_ZERO_ENTRY is
> one instance of such entry) low 10-bits of the of the entry values are
> reserved for internal xarray usage so DAX could use only higher bits. For
> classical value entries, only the lowest bit is reserved for xarray usage,
> all the rest is available for the user (and so DAX uses it).

The XArray entries are pretty inventive in how they are used ...

1. If bit 0 is set, it's a value entry.  That is, it encodes an integer
between 0 and LONG_MAX.
2. If bits 0 & 1 are clear, it's a pointer.
3. If bit 0 is clear and bit 1 is set, it's _either_ an internal entry,
_or_ it's a pointer that's only 2-byte aligned.  These can exist on m68k,
alas.

Internal entries above -MAX_ERRNO are used for returning errors.
Internal entries below 1024 (256 << 2) are used for sibling entries.
Internal entry 256 is the retry entry.
Internal entry 257 is the zero entry.
Internal entries 258-1023 are not currently used.
Internal entries between 4096 and MAX_ERRNO are pointers to the next
level of the tree.

The m68k pointer problem is "solved" by only allowing them to be in a
node which is the bottom of the tree.  This means that the optimisation
of placing a single pointer at index 0 in the root of the tree has to be
disabled for these pointers.  That's unfortunate, but there's no other
way to solve it, given the need for RCU readers.  You also can't use
an m68k pointer for a multi-index entry.

There's also support for pointers tagged in their lower bits.  Those are
incompatible with value entries.  And you can't use pointer tag 2 ...
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v8 4/9] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-13 Thread Matthew Wilcox

On Tue, Nov 10, 2020 at 05:14:39PM +0200, Mike Rapoport wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c89c5444924b..d8d170fa5210 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -884,4 +884,7 @@ config ARCH_HAS_HUGEPD
>  config MAPPING_DIRTY_HELPERS
>  bool
>  
> +config SECRETMEM
> + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED

So I now have to build this in, whether I want it or not?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v8 4/9] mm: introduce memfd_secret system call to create "secret" memory areas

2020-11-13 Thread Matthew Wilcox

On Tue, Nov 10, 2020 at 05:14:39PM +0200, Mike Rapoport wrote:
> +static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + struct inode *inode = file_inode(vmf->vma->vm_file);
> + pgoff_t offset = vmf->pgoff;
> + unsigned long addr;
> + struct page *page;
> + int ret = 0;
> +
> + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> + return vmf_error(-EINVAL);
> +
> + page = find_get_entry(mapping, offset);

Why did you decide to use find_get_entry() here?  You don't handle
swap or shadow entries.

> + if (!page) {
> + page = secretmem_alloc_page(vmf->gfp_mask);
> + if (!page)
> + return vmf_error(-EINVAL);

Why is this EINVAL and not ENOMEM?

> + ret = add_to_page_cache(page, mapping, offset, vmf->gfp_mask);
> + if (unlikely(ret))
> + goto err_put_page;
> +
> + ret = set_direct_map_invalid_noflush(page, 1);
> + if (ret)
> + goto err_del_page_cache;
> +
> + addr = (unsigned long)page_address(page);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> + __SetPageUptodate(page);
> +
> + ret = VM_FAULT_LOCKED;
> + }
> +
> + vmf->page = page;
> + return ret;

Does sparse not warn you about this abuse of vm_fault_t?  Separate out
'ret' and 'err'.


Andrew, please fold in this fix.  I suspect Mike will want to fix
the other things I mention above.

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 3dfdbd85ba00..09ca27f21661 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -172,7 +172,7 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
return vmf_error(-EINVAL);
 
-   page = find_get_entry(mapping, offset);
+   page = find_get_page(mapping, offset);
if (!page) {
page = secretmem_alloc_page(ctx, vmf->gfp_mask);
if (!page)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 2/2] mm: simplify follow_pte{,pmd}

2020-11-10 Thread Matthew Wilcox

On Thu, Oct 29, 2020 at 11:14:32AM +0100, Christoph Hellwig wrote:
> Merge __follow_pte_pmd, follow_pte_pmd and follow_pte into a single
> follow_pte function and just pass two additional NULL arguments for
> the two previous follow_pte callers.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 

I'm not entirely convinced this is the right interface, but your patch
makes things better, so I approve.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 1/2] mm: unexport follow_pte_pmd

2020-11-10 Thread Matthew Wilcox

On Thu, Oct 29, 2020 at 11:14:31AM +0100, Christoph Hellwig wrote:
> follow_pte_pmd is only used by the DAX code, which can't be modular.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Best solution for shifting DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-08 Thread Matthew Wilcox

On Sun, Nov 08, 2020 at 05:54:14PM -0800, Amy Parker wrote:
> On Sun, Nov 8, 2020 at 5:50 PM Matthew Wilcox  wrote:
> >
> > On Sun, Nov 08, 2020 at 05:33:22PM -0800, Darrick J. Wong wrote:
> > > On Sun, Nov 08, 2020 at 05:15:55PM -0800, Amy Parker wrote:
> > > > XA_ZERO_ENTRY
> > > > is defined in include/linux/xarray.h, where it's defined using
> > > > xa_mk_internal(257). This function returns a void pointer, which
> > > > is incompatible with the bitwise arithmetic it is performed on with.
> >
> > We don't really perform bitwise arithmetic on it, outside of:
> >
> > static int dax_is_zero_entry(void *entry)
> > {
> > return xa_to_value(entry) & DAX_ZERO_PAGE;
> > }
> 
> We also have:
> 
> if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
>unsigned long index = xas->xa_index;
>/* we are replacing a zero page with block mapping */
>if (dax_is_pmd_entry(entry))
>   unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
> PG_PMD_NR, false);
>else /* pte entry */
>   unmap_mapping_pages(mapping, index, 1, false);
> }
> 
> and:
> 
> *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
>   DAX_PMD | DAX_ZERO_PAGE, false);

Right.  We need to be able to distinguish whether an entry represents
a PMD size.  So maybe we need XA_ZERO_PMD_ENTRY ... ?  Or we could use
the recently-added xa_get_order().

> > > > Should we go the route of adding a new definition, we might as
> > > > well just change the definition of DAX_ZERO_PAGE. This would
> > > > break the simplicity of the current DAX bit definitions:
> > > >
> > > > #define DAX_LOCKED  (1UL << 0)
> > > > #define DAX_PMD   (1UL << 1)
> > > > #define DAX_ZERO_PAGE  (1UL << 2)
> > > > #define DAX_EMPTY  (1UL << 3)
> >
> > I was proposing deleting the entire bit and shifting DAX_EMPTY down.
> 
> That'd probably be a better idea - so what should we do about the type
> issue? Not typecasting it causes it not to compile.

I don't think you'll need to do any casting once the bit operations go
away ...
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Best solution for shifting DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-08 Thread Matthew Wilcox

On Sun, Nov 08, 2020 at 05:33:22PM -0800, Darrick J. Wong wrote:
> On Sun, Nov 08, 2020 at 05:15:55PM -0800, Amy Parker wrote:
> > I've been writing a patch to migrate the defined DAX_ZERO_PAGE
> > to XA_ZERO_ENTRY for representing holes in files.
> 
> Why?  IIRC XA_ZERO_ENTRY ("no mapping in the address space") isn't the
> same as DAX_ZERO_PAGE ("the zero page is mapped into the address space
> because we took a read fault on a sparse file hole").

There's no current user of XA_ZERO_ENTRY in i_pages, whether it be
DAX or non-DAX.

> > XA_ZERO_ENTRY
> > is defined in include/linux/xarray.h, where it's defined using
> > xa_mk_internal(257). This function returns a void pointer, which
> > is incompatible with the bitwise arithmetic it is performed on with.

We don't really perform bitwise arithmetic on it, outside of:

static int dax_is_zero_entry(void *entry)
{
return xa_to_value(entry) & DAX_ZERO_PAGE;
}

> > Currently, DAX_ZERO_PAGE is defined as an unsigned long,
> > so I considered typecasting it. Typecasting every time would be
> > repetitive and inefficient. I thought about making a new definition
> > for it which has the typecast, but this breaks the original point of
> > using already defined terms.
> > 
> > Should we go the route of adding a new definition, we might as
> > well just change the definition of DAX_ZERO_PAGE. This would
> > break the simplicity of the current DAX bit definitions:
> > 
> > #define DAX_LOCKED  (1UL << 0)
> > #define DAX_PMD   (1UL << 1)
> > #define DAX_ZERO_PAGE  (1UL << 2)
> > #define DAX_EMPTY  (1UL << 3)

I was proposing deleting the entire bit and shifting DAX_EMPTY down.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 0/4] Remove nrexceptional tracking

2020-10-26 Thread Matthew Wilcox (Oracle)

We actually use nrexceptional for very little these days.  It's a minor
pain to keep in sync with nrpages, but the pain becomes much bigger
with the THP patches because we don't know how many indices a shadow
entry occupies.  It's easier to just remove it than keep it accurate.

Also, we save 8 bytes per inode which is nothing to sneeze at; on my
laptop, it would improve shmem_inode_cache from 22 to 23 objects per
16kB, and inode_cache from 26 to 27 objects.  Combined, that saves
a megabyte of memory from a combined usage of 25MB for both caches.
Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
any memory for ext4.

Matthew Wilcox (Oracle) (4):
  mm: Introduce and use mapping_empty
  mm: Stop accounting shadow entries
  dax: Account DAX entries as nrpages
  mm: Remove nrexceptional from inode

 fs/block_dev.c  |  2 +-
 fs/dax.c|  8 
 fs/gfs2/glock.c |  3 +--
 fs/inode.c  |  2 +-
 include/linux/fs.h  |  2 --
 include/linux/pagemap.h |  5 +
 mm/filemap.c| 16 
 mm/swap_state.c |  4 
 mm/truncate.c   | 19 +++
 mm/workingset.c |  1 -
 10 files changed, 15 insertions(+), 47 deletions(-)

-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 2/4] mm: Stop accounting shadow entries

2020-10-26 Thread Matthew Wilcox (Oracle)

We no longer need to keep track of how many shadow entries are
present in a mapping.  This saves a few writes to the inode and
memory barriers.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Vishal Verma 
---
 mm/filemap.c| 13 -
 mm/swap_state.c |  4 
 mm/truncate.c   |  1 -
 mm/workingset.c |  1 -
 4 files changed, 19 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bd116f63263e..2e68116be4b0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,17 +140,6 @@ static void page_cache_delete(struct address_space 
*mapping,
 
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
-
-   if (shadow) {
-   mapping->nrexceptional += nr;
-   /*
-* Make sure the nrexceptional update is committed before
-* the nrpages update so that final truncate racing
-* with reclaim does not see both counters 0 at the
-* same time and miss a shadow entry.
-*/
-   smp_wmb();
-   }
mapping->nrpages -= nr;
 }
 
@@ -883,8 +872,6 @@ noinline int __add_to_page_cache_locked(struct page *page,
if (xas_error())
goto unlock;
 
-   if (old)
-   mapping->nrexceptional--;
mapping->nrpages++;
 
/* hugetlb pages do not participate in page cache accounting */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ee465827420e..85aca8d63aeb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -160,7 +160,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
xas_store(, page);
xas_next();
}
-   address_space->nrexceptional -= nr_shadows;
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
@@ -199,8 +198,6 @@ void __delete_from_swap_cache(struct page *page,
xas_next();
}
ClearPageSwapCache(page);
-   if (shadow)
-   address_space->nrexceptional += nr;
address_space->nrpages -= nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
ADD_CACHE_INFO(del_total, nr);
@@ -301,7 +298,6 @@ void clear_shadow_from_swap_cache(int type, unsigned long 
begin,
xas_store(, NULL);
nr_shadows++;
}
-   address_space->nrexceptional -= nr_shadows;
xa_unlock_irq(_space->i_pages);
 
/* search the next swapcache until we meet end */
diff --git a/mm/truncate.c b/mm/truncate.c
index 58524aaf67e2..27cf411ae51f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -40,7 +40,6 @@ static inline void __clear_shadow_entry(struct address_space 
*mapping,
if (xas_load() != entry)
return;
xas_store(, NULL);
-   mapping->nrexceptional--;
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
diff --git a/mm/workingset.c b/mm/workingset.c
index 975a4d2dd02e..74d5f460e446 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -557,7 +557,6 @@ static enum lru_status shadow_lru_isolate(struct list_head 
*item,
goto out_invalid;
if (WARN_ON_ONCE(node->count != node->nr_values))
goto out_invalid;
-   mapping->nrexceptional -= node->nr_values;
xa_delete_node(node, workingset_update_node);
__inc_lruvec_slab_state(node, WORKINGSET_NODERECLAIM);
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 4/4] mm: Remove nrexceptional from inode

2020-10-26 Thread Matthew Wilcox (Oracle)

We no longer track anything in nrexceptional, so remove it, saving 8
bytes per inode.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Vishal Verma 
---
 fs/inode.c | 2 +-
 include/linux/fs.h | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d78c37b00b8..4531358ae97b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -530,7 +530,7 @@ void clear_inode(struct inode *inode)
 */
xa_lock_irq(>i_data.i_pages);
BUG_ON(inode->i_data.nrpages);
-   BUG_ON(inode->i_data.nrexceptional);
+   BUG_ON(!mapping_empty(>i_data));
xa_unlock_irq(>i_data.i_pages);
BUG_ON(!list_empty(>i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0bd126418bb6..a5d801430040 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -439,7 +439,6 @@ int pagecache_write_end(struct file *, struct address_space 
*mapping,
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
  * @nrpages: Number of page entries, protected by the i_pages lock.
- * @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
  * @a_ops: Methods.
  * @flags: Error bits and flags (AS_*).
@@ -460,7 +459,6 @@ struct address_space {
struct rb_root_cached   i_mmap;
struct rw_semaphore i_mmap_rwsem;
unsigned long   nrpages;
-   unsigned long   nrexceptional;
pgoff_t writeback_index;
const struct address_space_operations *a_ops;
unsigned long   flags;
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 1/4] mm: Introduce and use mapping_empty

2020-10-26 Thread Matthew Wilcox (Oracle)

Instead of checking the two counters (nrpages and nrexceptional), we
can just check whether i_pages is empty.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Vishal Verma 
---
 fs/block_dev.c  |  2 +-
 fs/dax.c|  2 +-
 fs/gfs2/glock.c |  3 +--
 include/linux/pagemap.h |  5 +
 mm/truncate.c   | 18 +++---
 5 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e84b1928b94..34105f66e12f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@ static void kill_bdev(struct block_device *bdev)
 {
struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (mapping_empty(mapping))
return;
 
invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..53ed0ab8c958 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -965,7 +965,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
return -EIO;
 
-   if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
+   if (mapping_empty(mapping) || wbc->sync_mode != WB_SYNC_ALL)
return 0;
 
trace_dax_writeback_range(inode, xas.xa_index, end_index);
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5441c17562c5..bfad01ce096d 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -273,8 +273,7 @@ static void __gfs2_glock_put(struct gfs2_glock *gl)
if (mapping) {
truncate_inode_pages_final(mapping);
if (!gfs2_withdrawn(sdp))
-   GLOCK_BUG_ON(gl, mapping->nrpages ||
-mapping->nrexceptional);
+   GLOCK_BUG_ON(gl, !mapping_empty(mapping));
}
trace_gfs2_glock_put(gl);
sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index dc3390e6ee3e..86143d36d028 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -18,6 +18,11 @@
 
 struct pagevec;
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+   return xa_empty(>i_pages);
+}
+
 /*
  * Bits in mapping->flags.
  */
diff --git a/mm/truncate.c b/mm/truncate.c
index 11ef90d7e3af..58524aaf67e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -342,7 +342,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
struct page *   page;
bool partial_end;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (mapping_empty(mapping))
goto out;
 
/*
@@ -470,9 +470,6 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-   unsigned long nrexceptional;
-   unsigned long nrpages;
-
/*
 * Page reclaim can not participate in regular inode lifetime
 * management (can't call iput()) and thus can race with the
@@ -482,16 +479,7 @@ void truncate_inode_pages_final(struct address_space 
*mapping)
 */
mapping_set_exiting(mapping);
 
-   /*
-* When reclaim installs eviction entries, it increases
-* nrexceptional first, then decreases nrpages.  Make sure we see
-* this in the right order or we might miss an entry.
-*/
-   nrpages = mapping->nrpages;
-   smp_rmb();
-   nrexceptional = mapping->nrexceptional;
-
-   if (nrpages || nrexceptional) {
+   if (!mapping_empty(mapping)) {
/*
 * As truncation uses a lockless tree lookup, cycle
 * the tree lock to make sure any ongoing tree
@@ -657,7 +645,7 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
int ret2 = 0;
int did_range_unmap = 0;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (mapping_empty(mapping))
goto out;
 
pagevec_init();
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 3/4] dax: Account DAX entries as nrpages

2020-10-26 Thread Matthew Wilcox (Oracle)

Simplify mapping_needs_writeback() by accounting DAX entries as
pages instead of exceptional entries.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Vishal Verma 
---
 fs/dax.c | 6 +++---
 mm/filemap.c | 3 ---
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 53ed0ab8c958..a20f2342a9e4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -525,7 +525,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
dax_wake_entry(xas, entry, true);
-   mapping->nrexceptional--;
+   mapping->nrpages -= PG_PMD_NR;
entry = NULL;
xas_set(xas, index);
}
@@ -541,7 +541,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
dax_lock_entry(xas, entry);
if (xas_error(xas))
goto out_unlock;
-   mapping->nrexceptional++;
+   mapping->nrpages += 1UL << order;
}
 
 out_unlock:
@@ -661,7 +661,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
goto out;
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
-   mapping->nrexceptional--;
+   mapping->nrpages -= 1UL << dax_entry_order(entry);
ret = 1;
 out:
put_unlocked_entry(, entry);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2e68116be4b0..2214a2c48dd1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -616,9 +616,6 @@ EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
 /* Returns true if writeback might be needed or already in progress. */
 static bool mapping_needs_writeback(struct address_space *mapping)
 {
-   if (dax_mapping(mapping))
-   return mapping->nrexceptional;
-
return mapping->nrpages;
 }
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-18 Thread Matthew Wilcox

On Sun, Oct 18, 2020 at 12:13:35PM -0700, James Bottomley wrote:
> On Sun, 2020-10-18 at 19:59 +0100, Matthew Wilcox wrote:
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > clang has a number of useful, new warnings see
> > > https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
> > >  
> > 
> > Please get your IT department to remove that stupidity.  If you
> > can't, please send email from a non-Red Hat email address.
> 
> Actually, the problem is at Oracle's end somewhere in the ocfs2 list
> ... if you could fix it, that would be great.  The usual real mailing
> lists didn't get this transformation
> 
> https://lore.kernel.org/bpf/20201017160928.12698-1-t...@redhat.com/
> 
> but the ocfs2 list archive did:
> 
> https://oss.oracle.com/pipermail/ocfs2-devel/2020-October/015330.html
> 
> I bet Oracle IT has put some spam filter on the list that mangles URLs
> this way.

*sigh*.  I'm sure there's a way.  I've raised it with someone who should
be able to fix it.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Ocfs2-devel] [RFC] treewide: cleanup unreachable breaks

2020-10-18 Thread Matthew Wilcox

On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> clang has a number of useful, new warnings see
> https://urldefense.com/v3/__https://clang.llvm.org/docs/DiagnosticsReference.html__;!!GqivPVa7Brio!Krxz78O3RKcB9JBMVo_F98FupVhj_jxX60ddN6tKGEbv_cnooXc1nnBmchm-e_O9ieGnyQ$
>  

Please get your IT department to remove that stupidity.  If you can't,
please send email from a non-Red Hat email address.

I don't understand why this is a useful warning to fix.  What actual
problem is caused by the code below?

> return and break
> 
>   switch (c->x86_vendor) {
>   case X86_VENDOR_INTEL:
>   intel_p5_mcheck_init(c);
>   return 1;
> - break;

Sure, it's unnecessary, but it's not masking a bug.  It's not unclear.
Why do we want to enable this warning?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Matthew Wilcox

On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> >
> > From: Ira Weiny 
> >
> > The kmap() calls in this FS are localized to a single thread.  To avoid
> > the over head of global PKRS updates use the new kmap_thread() call.
> >
> > Cc: Nicolas Pitre 
> > Signed-off-by: Ira Weiny 
> > ---
> >  fs/cramfs/inode.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > index 912308600d39..003c014a42ed 100644
> > --- a/fs/cramfs/inode.c
> > +++ b/fs/cramfs/inode.c
> > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block *sb, 
> > unsigned int offset,
> > struct page *page = pages[i];
> >
> > if (page) {
> > -   memcpy(data, kmap(page), PAGE_SIZE);
> > -   kunmap(page);
> > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > +   kunmap_thread(page);
> 
> Why does this need a sleepable kmap? This looks like a textbook
> kmap_atomic() use case.

There's a lot of code of this form.  Could we perhaps have:

static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned int 
size)
{
char *vto = kmap_atomic(to);

memcpy(vto, vfrom, size);
kunmap_atomic(vto);
}

in linux/highmem.h ?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Matthew Wilcox

On Mon, Oct 12, 2020 at 12:53:54PM -0700, Ira Weiny wrote:
> On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> > On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > > always CPU-local and never broadcast.
> > > 
> > > So, basically, unless you *must* sleep while the mapping is in place,
> > > kmap_atomic() is preferred.
> > 
> > But kmap_atomic() disables preemption, so the _ideal_ interface would map
> > it only locally, then on preemption make it global.  I don't even know
> > if that _can_ be done.  But this email makes it seem like kmap_atomic()
> > has no downsides.
> 
> And that is IIUC what Thomas was trying to solve.
> 
> Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]
> 
> >From what I can see all of these discussions support the need to have 
> >something
> between kmap() and kmap_atomic().
> 
> However, the reason behind converting call sites to kmap_thread() are 
> different
> between Thomas' patch set and mine.  Both require more kmap granularity.
> However, they do so with different reasons and underlying implementations but
> with the _same_ resulting semantics; a thread local mapping which is
> preemptable.[2]  Therefore they each focus on changing different call sites.
> 
> While this patch set is huge I think it serves a valuable purpose to identify 
> a
> large number of call sites which are candidates for this new semantic.

Yes, I agree.  My problem with this patch-set is that it ties it to
some Intel feature that almost nobody cares about.  Maybe we should
care about it, but you didn't try very hard to make anyone care about
it in the cover letter.

For a future patch-set, I'd like to see you just introduce the new
API.  Then you can optimise the Intel implementation of it afterwards.
Those patch-sets have entirely different reviewers.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Matthew Wilcox

On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> kmap_atomic() is always preferred over kmap()/kmap_thread().
> kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> always CPU-local and never broadcast.
> 
> So, basically, unless you *must* sleep while the mapping is in place,
> kmap_atomic() is preferred.

But kmap_atomic() disables preemption, so the _ideal_ interface would map
it only locally, then on preemption make it global.  I don't even know
if that _can_ be done.  But this email makes it seem like kmap_atomic()
has no downsides.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-09 Thread Matthew Wilcox

On Fri, Oct 09, 2020 at 02:34:34PM -0700, Eric Biggers wrote:
> On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > The kmap() calls in this FS are localized to a single thread.  To avoid
> > the over head of global PKRS updates use the new kmap_thread() call.
> >
> > @@ -2410,12 +2410,12 @@ static inline struct page *f2fs_pagecache_get_page(
> >  
> >  static inline void f2fs_copy_page(struct page *src, struct page *dst)
> >  {
> > -   char *src_kaddr = kmap(src);
> > -   char *dst_kaddr = kmap(dst);
> > +   char *src_kaddr = kmap_thread(src);
> > +   char *dst_kaddr = kmap_thread(dst);
> >  
> > memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > -   kunmap(dst);
> > -   kunmap(src);
> > +   kunmap_thread(dst);
> > +   kunmap_thread(src);
> >  }
> 
> Wouldn't it make more sense to switch cases like this to kmap_atomic()?
> The pages are only mapped to do a memcpy(), then they're immediately unmapped.

Maybe you missed the earlier thread from Thomas trying to do something
similar for rather different reasons ...

https://lore.kernel.org/lkml/20200919091751.06...@linutronix.de/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 0/4] Remove nrexceptional tracking

2020-10-08 Thread Matthew Wilcox

On Thu, Aug 06, 2020 at 08:16:02PM +, Verma, Vishal L wrote:
> On Thu, 2020-08-06 at 19:44 +, Verma, Vishal L wrote:
> > > 
> > > I'm running xfstests on this patchset right now.  If one of the DAX
> > > people could try it out, that'd be fantastic.
> > > 
> > > Matthew Wilcox (Oracle) (4):
> > >   mm: Introduce and use page_cache_empty
> > >   mm: Stop accounting shadow entries
> > >   dax: Account DAX entries as nrpages
> > >   mm: Remove nrexceptional from inode
> > 
> > Hi Matthew,
> > 
> > I applied these on top of 5.8 and ran them through the nvdimm unit test
> > suite, and saw some test failures. The first failing test signature is:
> > 
> >   + umount test_dax_mnt
> >   ./dax-ext4.sh: line 62: 15749 Segmentation fault  umount $MNT
> >   FAIL dax-ext4.sh (exit status: 139)

Thanks.  Fixed:

+++ b/fs/dax.c
@@ -644,7 +644,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
goto out;
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
-   mapping->nrpages -= dax_entry_order(entry);
+   mapping->nrpages -= 1UL << dax_entry_order(entry);
ret = 1;
 out:
put_unlocked_entry(, entry);

Updated git tree at
https://git.infradead.org/users/willy/pagecache.git/

It survives an xfstests run on an fsdax namespace which supports 2MB pages.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation

2020-09-30 Thread Matthew Wilcox

On Wed, Sep 30, 2020 at 01:27:45PM +0300, Mike Rapoport wrote:
> On Tue, Sep 29, 2020 at 05:15:52PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 29, 2020 at 05:58:13PM +0300, Mike Rapoport wrote:
> > > On Tue, Sep 29, 2020 at 04:12:16PM +0200, Peter Zijlstra wrote:
> > 
> > > > It will drop them down to 4k pages. Given enough inodes, and allocating
> > > > only a single sekrit page per pmd, we'll shatter the directmap into 4k.
> > > 
> > > Why? Secretmem allocates PMD-size page per inode and uses it as a pool
> > > of 4K pages for that inode. This way it ensures that
> > > __kernel_map_pages() is always called on PMD boundaries.
> > 
> > Oh, you unmap the 2m page upfront? I read it like you did the unmap at
> > the sekrit page alloc, not the pool alloc side of things.
> > 
> > Then yes, but then you're wasting gobs of memory. Basically you can pin
> > 2M per inode while only accounting a single page.
> 
> Right, quite like THP :)

Huh?  THP accounts every page it allocates.  If you allocate 2MB,
it accounts 512 pages.  And THP are reclaimable by vmscan, this is
obviously not.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-23 Thread Matthew Wilcox

On Tue, Sep 22, 2020 at 06:05:26PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 22, 2020 at 12:23:45PM -0400, Qian Cai wrote:
> > On Fri, 2020-09-11 at 00:47 +0100, Matthew Wilcox (Oracle) wrote:
> > > Size the uptodate array dynamically to support larger pages in the
> > > page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> > > but with a 2MB maximum page size, we'd have to allocate more than 4kB
> > > per page.  Add a few debugging assertions.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > Reviewed-by: Dave Chinner 
> > 
> > Some syscall fuzzing will trigger this on powerpc:
> > 
> > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> > 
> > [ 8805.895344][T445431] WARNING: CPU: 61 PID: 445431 at 
> > fs/iomap/buffered-io.c:78 iomap_page_release+0x250/0x270
> 
> Well, I'm glad it triggered.  That warning is:
> WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> PageUptodate(page));
> so there was definitely a problem of some kind.

OK, I'm pretty sure the problem predated this commit, and it's simply
that I added a warning (looking for something else) that caught this.

I have a tree completly gunked up with debugging code now to try to
understand the problem better, but my guess is that if you put this
warning into a previous version, you'd see the same problem occurring
(and it is a real problem, because we skip writeback of parts of the
page which are !uptodate).
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-23 Thread Matthew Wilcox

On Tue, Sep 22, 2020 at 10:00:01PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 23, 2020 at 03:48:59AM +0100, Matthew Wilcox wrote:
> > On Tue, Sep 22, 2020 at 09:06:03PM -0400, Qian Cai wrote:
> > > On Tue, 2020-09-22 at 18:05 +0100, Matthew Wilcox wrote:
> > > > On Tue, Sep 22, 2020 at 12:23:45PM -0400, Qian Cai wrote:
> > > > > On Fri, 2020-09-11 at 00:47 +0100, Matthew Wilcox (Oracle) wrote:
> > > > > > Size the uptodate array dynamically to support larger pages in the
> > > > > > page cache.  With a 64kB page, we're only saving 8 bytes per page 
> > > > > > today,
> > > > > > but with a 2MB maximum page size, we'd have to allocate more than 
> > > > > > 4kB
> > > > > > per page.  Add a few debugging assertions.
> > > > > > 
> > > > > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > > > > Reviewed-by: Dave Chinner 
> > > > > 
> > > > > Some syscall fuzzing will trigger this on powerpc:
> > > > > 
> > > > > .config: 
> > > > > https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> > > > > 
> > > > > [ 8805.895344][T445431] WARNING: CPU: 61 PID: 445431 at 
> > > > > fs/iomap/buffered-
> > > > > io.c:78 iomap_page_release+0x250/0x270
> > > > 
> > > > Well, I'm glad it triggered.  That warning is:
> > > > WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> > > > PageUptodate(page));
> > > > so there was definitely a problem of some kind.
> > > > 
> > > > truncate_cleanup_page() calls
> > > > do_invalidatepage() calls
> > > > iomap_invalidatepage() calls
> > > > iomap_page_release()
> > > > 
> > > > Is this the first warning?  I'm wondering if maybe there was an I/O 
> > > > error
> > > > earlier which caused PageUptodate to get cleared again.  If it's easy to
> > > > reproduce, perhaps you could try something like this?
> > > > 
> > > > +void dump_iomap_page(struct page *page, const char *reason)
> > > > +{
> > > > +   struct iomap_page *iop = to_iomap_page(page);
> > > > +   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, 
> > > > page);
> > > > +
> > > > +   dump_page(page, reason);
> > > > +   if (iop)
> > > > +   printk("iop:reads %d writes %d uptodate %*pb\n",
> > > > +   atomic_read(>read_bytes_pending),
> > > > +   atomic_read(>write_bytes_pending),
> > > > +   nr_blocks, iop->uptodate);
> > > > +   else
> > > > +   printk("iop:none\n");
> > > > +}
> > > > 
> > > > and then do something like:
> > > > 
> > > > if (bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page))
> > > > dump_iomap_page(page, NULL);
> > > 
> > > This:
> > > 
> > > [ 1683.158254][T164965] page:4a6c16cd refcount:2 mapcount:0 
> > > mapping:ea017dc5 index:0x2 pfn:0xc365c
> > > [ 1683.158311][T164965] aops:xfs_address_space_operations ino:417b7e7 
> > > dentry name:"trinity-testfile2"
> > > [ 1683.158354][T164965] flags: 0x7fff800015(locked|uptodate|lru)
> > > [ 1683.158392][T164965] raw: 007fff800015 c00c019c4b08 
> > > c00c019a53c8 c000201c8362c1e8
> > > [ 1683.158430][T164965] raw: 0002  
> > > 0002 c000201c54db4000
> > > [ 1683.158470][T164965] page->mem_cgroup:c000201c54db4000
> > > [ 1683.158506][T164965] iop:none
> > 
> > Oh, I'm a fool.  This is after the call to detach_page_private() so
> > page->private is NULL and we don't get the iop dumped.
> > 
> > Nevertheless, this is interesting.  Somehow, the page is marked Uptodate,
> > but the bitmap is deemed not full.  There are three places where we set
> > an iomap page Uptodate:
> > 
> > 1.  if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
> > SetPageUptodate(page);
> > 
> > 2.  if (page_has_private(page))
> > iomap_iop_set_range_uptodate(page, off, len);
> > else
> > SetPage

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-23 Thread Matthew Wilcox

On Wed, Sep 23, 2020 at 09:11:43AM -0400, Mikulas Patocka wrote:
> I also don't know how to implement journling on persistent memory :) On 
> EXT4 or XFS you can pin dirty buffers in memory until the journal is 
> flushed. This is obviously impossible on persistent memory. So, I'm 
> considering implementing only some lightweight journaling that will 
> guarantee atomicity between just a few writes.

That's a bit disappointing considering people have been publishing
papers on how to do umpteen different variations on persistent memory
journalling for the last five years.

https://www.google.com/search?q=intel+persistent+memory+atomic+updates

for example
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-22 Thread Matthew Wilcox

On Tue, Sep 22, 2020 at 09:06:03PM -0400, Qian Cai wrote:
> On Tue, 2020-09-22 at 18:05 +0100, Matthew Wilcox wrote:
> > On Tue, Sep 22, 2020 at 12:23:45PM -0400, Qian Cai wrote:
> > > On Fri, 2020-09-11 at 00:47 +0100, Matthew Wilcox (Oracle) wrote:
> > > > Size the uptodate array dynamically to support larger pages in the
> > > > page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> > > > but with a 2MB maximum page size, we'd have to allocate more than 4kB
> > > > per page.  Add a few debugging assertions.
> > > > 
> > > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > > Reviewed-by: Dave Chinner 
> > > 
> > > Some syscall fuzzing will trigger this on powerpc:
> > > 
> > > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> > > 
> > > [ 8805.895344][T445431] WARNING: CPU: 61 PID: 445431 at fs/iomap/buffered-
> > > io.c:78 iomap_page_release+0x250/0x270
> > 
> > Well, I'm glad it triggered.  That warning is:
> > WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> > PageUptodate(page));
> > so there was definitely a problem of some kind.
> > 
> > truncate_cleanup_page() calls
> > do_invalidatepage() calls
> > iomap_invalidatepage() calls
> > iomap_page_release()
> > 
> > Is this the first warning?  I'm wondering if maybe there was an I/O error
> > earlier which caused PageUptodate to get cleared again.  If it's easy to
> > reproduce, perhaps you could try something like this?
> > 
> > +void dump_iomap_page(struct page *page, const char *reason)
> > +{
> > +   struct iomap_page *iop = to_iomap_page(page);
> > +   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, 
> > page);
> > +
> > +   dump_page(page, reason);
> > +   if (iop)
> > +   printk("iop:reads %d writes %d uptodate %*pb\n",
> > +   atomic_read(>read_bytes_pending),
> > +   atomic_read(>write_bytes_pending),
> > +   nr_blocks, iop->uptodate);
> > +   else
> > +   printk("iop:none\n");
> > +}
> > 
> > and then do something like:
> > 
> > if (bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page))
> > dump_iomap_page(page, NULL);
> 
> This:
> 
> [ 1683.158254][T164965] page:4a6c16cd refcount:2 mapcount:0 
> mapping:ea017dc5 index:0x2 pfn:0xc365c
> [ 1683.158311][T164965] aops:xfs_address_space_operations ino:417b7e7 dentry 
> name:"trinity-testfile2"
> [ 1683.158354][T164965] flags: 0x7fff800015(locked|uptodate|lru)
> [ 1683.158392][T164965] raw: 007fff800015 c00c019c4b08 
> c00c019a53c8 c000201c8362c1e8
> [ 1683.158430][T164965] raw: 0002  
> 0002 c000201c54db4000
> [ 1683.158470][T164965] page->mem_cgroup:c000201c54db4000
> [ 1683.158506][T164965] iop:none

Oh, I'm a fool.  This is after the call to detach_page_private() so
page->private is NULL and we don't get the iop dumped.

Nevertheless, this is interesting.  Somehow, the page is marked Uptodate,
but the bitmap is deemed not full.  There are three places where we set
an iomap page Uptodate:

1.  if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
SetPageUptodate(page);

2.  if (page_has_private(page))
iomap_iop_set_range_uptodate(page, off, len);
else
SetPageUptodate(page);

3.  BUG_ON(page->index);
...
SetPageUptodate(page);

It can't be #2 because the page has an iop.  It can't be #3 because the
page->index is not 0.  So at some point in the past, the bitmap was full.

I don't think it's possible for inode->i_blksize to change, and you
aren't running with THPs, so it's definitely not possible for thp_size()
to change.  So i_blocks_per_page() isn't going to change.

We seem to have allocated enough memory for ->iop because that's also
based on i_blocks_per_page().

I'm out of ideas.  Maybe I'll wake up with a better idea in the morning.
I've been trying to reproduce this on x86 with a 1kB block size
filesystem, and haven't been able to yet.  Maybe I'll try to setup a
powerpc cross-compilation environment tomorrow.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-22 Thread Matthew Wilcox

On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> I agree that the b+tree were a good choice for XFS.
> 
> In RAM-based maps, red-black trees or avl trees are used often. In 
> disk-based maps, btrees or b+trees are used. That's because in RAM, you 
> are optimizing for the number of cache lines accessed, and on the disk, 
> you are optimizing for the number of blocks accessed.

That's because software hasn't yet caught up to modern hardware.  An
in-memory btree outperforms an rbtree for several reasons:

 - rbtrees are taller; the average height of a b-tree with even just
   8 pointers per level is about half of an rbtree.
 - Not all cachelines are created equal.  The subsequent cacheline
   of this cacheline is probably already in cache.  If not, it's on its
   way and will be here shortly.  So a forward scan of this node will be
   quicker than finding another level of tree.  Our experiments have found
   a performance improvement with 256-byte B-tree nodes over an rbtree.
 - btrees are (or can be) RCU-safe.  It's very hard to make an rbtree
   RCU safe.  You can do a seqlock to get most of the benefit, but btrees
   let you allocate a new node and copy into it, while rbtrees embed the
   node in the object.

The downside, of course, is that b-trees require external allocations
while rbtrees embed the node in the object.  So you may need to do
more allocations.  But filesystems love doing additional allocations;
they make journalling so much easier.

> BTW. How does XFS "predict" the file size? - so that it allocates extent 
> of proper size without knowing how big the file will be?

XFS does delayed allocation.  The page cache absorbs the writes and then
at writeback time, XFS chooses where on storage to put it.  Some things
break this like O_SYNC and, er, DAX, but it's very effective.

> > The NVFS indirect block tree has a fan-out of 16,
> 
> No. The top level in the inode contains 16 blocks (11 direct and 5 
> indirect). And each indirect block can have 512 pointers (4096/8). You can 
> format the device with larger block size and this increases the fanout 
> (the NVFS block size must be greater or equal than the system page size).
> 
> 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> map 256 TiB and 5 levels can map 128 PiB.

But compare to an unfragmented file ... you can map the entire thing with
a single entry.  Even if you have to use a leaf node, you can get four
extents in a single cacheline (and that's a fairly naive leaf node layout;
I don't know exactly what XFS uses)

> > Rename is another operation that has specific "operation has atomic
> > behaviour" expectations. I haven't looked at how you've
> > implementated that yet, but I suspect it also is extremely difficult
> > to implement in an atomic manner using direct pmem updates to the
> > directory structures.
> 
> There is a small window when renamed inode is neither in source nor in 
> target directory. Fsck will reclaim such inode and add it to lost+found - 
> just like on EXT2.

... ouch.  If you have to choose, it'd be better to link it to the second
directory then unlink it from the first one.  Then your fsck can detect
it has the wrong count and fix up the count (ie link it into both
directories rather than neither).

> If you think that the lack of journaling is show-stopper, I can implement 
> it. But then, I'll have something that has complexity of EXT4 and 
> performance of EXT4. So that there will no longer be any reason why to use 
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> attract some users who want good performance and who don't care about GID 
> and UID being updated atomically, etc.

Well, what's your intent with nvfs?  Do you already have customers in mind
who want to use this in production, or is this somewhere to play with and
develop concepts that might make it into one of the longer-established
filesystems?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-22 Thread Matthew Wilcox

On Tue, Sep 22, 2020 at 12:23:45PM -0400, Qian Cai wrote:
> On Fri, 2020-09-11 at 00:47 +0100, Matthew Wilcox (Oracle) wrote:
> > Size the uptodate array dynamically to support larger pages in the
> > page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> > but with a 2MB maximum page size, we'd have to allocate more than 4kB
> > per page.  Add a few debugging assertions.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Reviewed-by: Dave Chinner 
> 
> Some syscall fuzzing will trigger this on powerpc:
> 
> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> 
> [ 8805.895344][T445431] WARNING: CPU: 61 PID: 445431 at 
> fs/iomap/buffered-io.c:78 iomap_page_release+0x250/0x270

Well, I'm glad it triggered.  That warning is:
WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
PageUptodate(page));
so there was definitely a problem of some kind.

truncate_cleanup_page() calls
do_invalidatepage() calls
iomap_invalidatepage() calls
iomap_page_release()

Is this the first warning?  I'm wondering if maybe there was an I/O error
earlier which caused PageUptodate to get cleared again.  If it's easy to
reproduce, perhaps you could try something like this?

+void dump_iomap_page(struct page *page, const char *reason)
+{
+   struct iomap_page *iop = to_iomap_page(page);
+   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
+
+   dump_page(page, reason);
+   if (iop)
+   printk("iop:reads %d writes %d uptodate %*pb\n",
+   atomic_read(>read_bytes_pending),
+   atomic_read(>write_bytes_pending),
+   nr_blocks, iop->uptodate);
+   else
+   printk("iop:none\n");
+}

and then do something like:

if (bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page))
dump_iomap_page(page, NULL);
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-22 Thread Matthew Wilcox

On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> The same for directories - NVFS hashes the file name and uses radix-tree 
> to locate a directory page where the directory entry is located. XFS 
> b+trees would result in much more accesses than the radix-tree.

What?  Radix trees behave _horribly_ badly when indexed by a hash.
If you have a 64-bit hash and use 8 bits per level of the tree, you have
to traverse 8 pointers to get to your destination.  You might as well
use a linked list!
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 9/9] iomap: Change calling convention for zeroing

2020-09-17 Thread Matthew Wilcox

On Thu, Sep 17, 2020 at 03:05:00PM -0700, Darrick J. Wong wrote:
> > -static loff_t
> > -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> > -   void *data, struct iomap *iomap, struct iomap *srcmap)
> > +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> > +   loff_t length, void *data, struct iomap *iomap,
> 
> Any reason not to change @length and the return value to s64?

Because it's an actor, passed to iomap_apply, so its types have to match.
I can change that, but it'll be a separate patch series.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 2/9] fs: Introduce i_blocks_per_page

2020-09-15 Thread Matthew Wilcox

On Tue, Sep 15, 2020 at 03:40:52PM +, David Laight wrote:
> > @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, 
> > unsigned off, unsigned len)
> > unsigned int i;
> > 
> > spin_lock_irqsave(>uptodate_lock, flags);
> > -   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> > +   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
> 
> You probably don't want to call the helper every time
> around the loop.

This is a classic example of focusing on the details and missing the
larger picture.  We don't want the loop at all, and if you'd kept reading
the patch series, you'd see it disappear later.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Matthew Wilcox

On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> based block devices directly?

Oh, because fs/block_dev.c has:
.mmap   = generic_file_mmap,

I don't see why we shouldn't have a blkdev_mmap modelled after
ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 2/9] fs: Introduce i_blocks_per_page

2020-09-10 Thread Matthew Wilcox (Oracle)

This helper is useful for both THPs and for supporting block size larger
than page size.  Convert all users that I could find (we have a few
different ways of writing this idiom, and I may have missed some).

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
---
 fs/iomap/buffered-io.c  |  8 
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/pagemap.h | 16 
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d81a9a86c5aa..330f86b825d7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
struct iomap_page *iop = to_iomap_page(page);
 
-   if (iop || i_blocksize(inode) == PAGE_SIZE)
+   if (iop || i_blocks_per_page(inode, page) <= 1)
return iop;
 
iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
off, unsigned len)
unsigned int i;
 
spin_lock_irqsave(>uptodate_lock, flags);
-   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
if (i >= first && i <= last)
set_bit(i, iop->uptodate);
else if (!test_bit(i, iop->uptodate))
@@ -1077,7 +1077,7 @@ iomap_finish_page_writeback(struct inode *inode, struct 
page *page,
mapping_set_error(inode->i_mapping, -EIO);
}
 
-   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
 
if (!iop || atomic_dec_and_test(>write_count))
@@ -1373,7 +1373,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
int error = 0, count = 0, i;
LIST_HEAD(submit_list);
 
-   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
 
/*
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page 
*page)
struct inode *inode = page->mapping->host;
struct bio *bio = NULL;
int block_offset;
-   int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+   int blocks_per_page = i_blocks_per_page(inode, page);
sector_t page_start;/* address of page in fs blocks */
sector_t pblock;
int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..55d126d4e096 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -544,7 +544,7 @@ xfs_discard_page(
page, ip->i_ino, offset);
 
error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-   PAGE_SIZE / i_blocksize(inode));
+   i_blocks_per_page(inode, page));
if (error && !XFS_FORCED_SHUTDOWN(mp))
xfs_alert(mp, "page discard unable to remove delalloc 
mapping.");
 out_invalidate:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 50d2c39b47ab..f7f602040913 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -975,4 +975,20 @@ static inline int page_mkwrite_check_truncate(struct page 
*page,
return offset;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The page (head page if the page is a THP).
+ *
+ * If the block size is larger than the size of this page, return zero.
+ *
+ * Context: The caller should hold a refcount on the page to prevent it
+ * from being split.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+   return thp_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 7/9] iomap: Convert write_count to write_bytes_pending

2020-09-10 Thread Matthew Wilcox (Oracle)

Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1cf976a8e55c..64a5cb383f30 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -27,7 +27,7 @@
  */
 struct iomap_page {
atomic_tread_bytes_pending;
-   atomic_twrite_count;
+   atomic_twrite_bytes_pending;
spinlock_t  uptodate_lock;
unsigned long   uptodate[];
 };
@@ -73,7 +73,7 @@ iomap_page_release(struct page *page)
if (!iop)
return;
WARN_ON_ONCE(atomic_read(>read_bytes_pending));
-   WARN_ON_ONCE(atomic_read(>write_count));
+   WARN_ON_ONCE(atomic_read(>write_bytes_pending));
WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
PageUptodate(page));
kfree(iop);
@@ -1047,7 +1047,7 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 
 static void
 iomap_finish_page_writeback(struct inode *inode, struct page *page,
-   int error)
+   int error, unsigned int len)
 {
struct iomap_page *iop = to_iomap_page(page);
 
@@ -1057,9 +1057,9 @@ iomap_finish_page_writeback(struct inode *inode, struct 
page *page,
}
 
WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
-   WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
+   WARN_ON_ONCE(iop && atomic_read(>write_bytes_pending) <= 0);
 
-   if (!iop || atomic_dec_and_test(>write_count))
+   if (!iop || atomic_sub_and_test(len, >write_bytes_pending))
end_page_writeback(page);
 }
 
@@ -1093,7 +1093,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 
/* walk each page on bio, ending page IO on them */
bio_for_each_segment_all(bv, bio, iter_all)
-   iomap_finish_page_writeback(inode, bv->bv_page, error);
+   iomap_finish_page_writeback(inode, bv->bv_page, error,
+   bv->bv_len);
bio_put(bio);
}
/* The ioend has been freed by bio_put() */
@@ -1309,8 +1310,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, 
struct page *page,
 
merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
_page);
-   if (iop && !same_page)
-   atomic_inc(>write_count);
+   if (iop)
+   atomic_add(len, >write_bytes_pending);
 
if (!merged) {
if (bio_full(wpc->ioend->io_bio, len)) {
@@ -1353,7 +1354,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
LIST_HEAD(submit_list);
 
WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
-   WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
+   WARN_ON_ONCE(iop && atomic_read(>write_bytes_pending) != 0);
 
/*
 * Walk through the page to find areas to write back. If we run off the
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-10 Thread Matthew Wilcox (Oracle)

Size the uptodate array dynamically to support larger pages in the
page cache.  With a 64kB page, we're only saving 8 bytes per page today,
but with a 2MB maximum page size, we'd have to allocate more than 4kB
per page.  Add a few debugging assertions.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Dave Chinner 
---
 fs/iomap/buffered-io.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7fc0e02d27b0..9670c096b83e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,18 +22,25 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page when block size < PAGE_SIZE to track
- * sub-page uptodate status and I/O completions.
+ * Structure allocated for each page or THP when block size < page size
+ * to track sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
atomic_tread_count;
atomic_twrite_count;
spinlock_t  uptodate_lock;
-   DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+   unsigned long   uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
 {
+   /*
+* per-block data is stored in the head page.  Callers should
+* not be dealing with tail pages (and if they are, they can
+* call thp_head() first.
+*/
+   VM_BUG_ON_PGFLAGS(PageTail(page), page);
+
if (page_has_private(page))
return (struct iomap_page *)page_private(page);
return NULL;
@@ -45,11 +52,13 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
struct iomap_page *iop = to_iomap_page(page);
+   unsigned int nr_blocks = i_blocks_per_page(inode, page);
 
-   if (iop || i_blocks_per_page(inode, page) <= 1)
+   if (iop || nr_blocks <= 1)
return iop;
 
-   iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
+   iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+   GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(>uptodate_lock);
attach_page_private(page, iop);
return iop;
@@ -59,11 +68,14 @@ static void
 iomap_page_release(struct page *page)
 {
struct iomap_page *iop = detach_page_private(page);
+   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
 
if (!iop)
return;
WARN_ON_ONCE(atomic_read(>read_count));
WARN_ON_ONCE(atomic_read(>write_count));
+   WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+   PageUptodate(page));
kfree(iop);
 }
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 4/9] iomap: Use bitmap ops to set uptodate bits

2020-09-10 Thread Matthew Wilcox (Oracle)

Now that the bitmap is protected by a spinlock, we can use the
more efficient bitmap ops instead of individual test/set bit ops.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
---
 fs/iomap/buffered-io.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 58a1fd83f2a4..7fc0e02d27b0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -134,19 +134,11 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
off, unsigned len)
struct inode *inode = page->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
-   bool uptodate = true;
unsigned long flags;
-   unsigned int i;
 
spin_lock_irqsave(>uptodate_lock, flags);
-   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-   if (i >= first && i <= last)
-   set_bit(i, iop->uptodate);
-   else if (!test_bit(i, iop->uptodate))
-   uptodate = false;
-   }
-
-   if (uptodate)
+   bitmap_set(iop->uptodate, first, last - first + 1);
+   if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
SetPageUptodate(page);
spin_unlock_irqrestore(>uptodate_lock, flags);
 }
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 0/9] THP iomap patches for 5.10

2020-09-10 Thread Matthew Wilcox (Oracle)

These patches are carefully plucked from the THP series.  I would like
them to hit 5.10 to make the THP patchset merge easier.  Some of these
are just generic improvements that make sense on their own terms, but
the overall intent is to support THPs in iomap.

v2:
 - Move the call to flush_dcache_page (Christoph)
 - Clarify comments (Darrick)
 - Rename read_count to read_bytes_pending (Christoph)
 - Rename write_count to write_bytes_pending (Christoph)
 - Restructure iomap_readpage_actor() (Christoph)
 - Change return type of the zeroing functions from loff_t to s64

Matthew Wilcox (Oracle) (9):
  iomap: Fix misplaced page flushing
  fs: Introduce i_blocks_per_page
  iomap: Use kzalloc to allocate iomap_page
  iomap: Use bitmap ops to set uptodate bits
  iomap: Support arbitrarily many blocks per page
  iomap: Convert read_count to read_bytes_pending
  iomap: Convert write_count to write_bytes_pending
  iomap: Convert iomap_write_end types
  iomap: Change calling convention for zeroing

 fs/dax.c|  13 ++-
 fs/iomap/buffered-io.c  | 173 +---
 fs/jfs/jfs_metapage.c   |   2 +-
 fs/xfs/xfs_aops.c   |   2 +-
 include/linux/dax.h |   3 +-
 include/linux/pagemap.h |  16 
 6 files changed, 96 insertions(+), 113 deletions(-)

-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 9/9] iomap: Change calling convention for zeroing

2020-09-10 Thread Matthew Wilcox (Oracle)

Pass the full length to iomap_zero() and dax_iomap_zero(), and have
them return how many bytes they actually handled.  This is preparatory
work for handling THP, although it looks like DAX could actually take
advantage of it if there's a larger contiguous area.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/dax.c   | 13 ++---
 fs/iomap/buffered-io.c | 33 +++--
 include/linux/dax.h|  3 +--
 3 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 994ab66a9907..6ad346352a8c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
 }
 
-int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
-  struct iomap *iomap)
+s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
pgoff_t pgoff;
long rc, id;
void *kaddr;
bool page_aligned = false;
-
+   unsigned offset = offset_in_page(pos);
+   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
 
if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
-   IS_ALIGNED(size, PAGE_SIZE))
+   (size == PAGE_SIZE))
page_aligned = true;
 
rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
@@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned 
size,
id = dax_read_lock();
 
if (page_aligned)
-   rc = dax_zero_page_range(iomap->dax_dev, pgoff,
-size >> PAGE_SHIFT);
+   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
else
rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
if (rc < 0) {
@@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned 
size,
dax_flush(iomap->dax_dev, kaddr + offset, size);
}
dax_read_unlock(id);
-   return 0;
+   return size;
 }
 
 static loff_t
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cb25a7b70401..3e1eb40a73fd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -898,11 +898,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-   unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
+static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length,
+   struct iomap *iomap, struct iomap *srcmap)
 {
struct page *page;
int status;
+   unsigned offset = offset_in_page(pos);
+   unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
 
status = iomap_write_begin(inode, pos, bytes, 0, , iomap, srcmap);
if (status)
@@ -914,38 +916,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
unsigned offset,
return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static loff_t
-iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap, struct iomap *srcmap)
+static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
+   loff_t length, void *data, struct iomap *iomap,
+   struct iomap *srcmap)
 {
bool *did_zero = data;
loff_t written = 0;
-   int status;
 
/* already zeroed?  we're done. */
if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
-   return count;
+   return length;
 
do {
-   unsigned offset, bytes;
-
-   offset = offset_in_page(pos);
-   bytes = min_t(loff_t, PAGE_SIZE - offset, count);
+   s64 bytes;
 
if (IS_DAX(inode))
-   status = dax_iomap_zero(pos, offset, bytes, iomap);
+   bytes = dax_iomap_zero(pos, length, iomap);
else
-   status = iomap_zero(inode, pos, offset, bytes, iomap,
-   srcmap);
-   if (status < 0)
-   return status;
+   bytes = iomap_zero(inode, pos, length, iomap, srcmap);
+   if (bytes < 0)
+   return bytes;
 
pos += bytes;
-   count -= bytes;
+   length -= bytes;
written += bytes;
if (did_zero)
*did_zero = true;
-   } while (count > 0);
+   } while (length > 0);
 
return written;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..951a851a0481 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -214,8 +214,7 @@ vm_fault_t dax_finish_sync_fault(struct vm_fa

[PATCH v2 6/9] iomap: Convert read_count to read_bytes_pending

2020-09-10 Thread Matthew Wilcox (Oracle)

Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 41 -
 1 file changed, 12 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9670c096b83e..1cf976a8e55c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -26,7 +26,7 @@
  * to track sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
-   atomic_tread_count;
+   atomic_tread_bytes_pending;
atomic_twrite_count;
spinlock_t  uptodate_lock;
unsigned long   uptodate[];
@@ -72,7 +72,7 @@ iomap_page_release(struct page *page)
 
if (!iop)
return;
-   WARN_ON_ONCE(atomic_read(>read_count));
+   WARN_ON_ONCE(atomic_read(>read_bytes_pending));
WARN_ON_ONCE(atomic_read(>write_count));
WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
PageUptodate(page));
@@ -167,13 +167,6 @@ iomap_set_range_uptodate(struct page *page, unsigned off, 
unsigned len)
SetPageUptodate(page);
 }
 
-static void
-iomap_read_finish(struct iomap_page *iop, struct page *page)
-{
-   if (!iop || atomic_dec_and_test(>read_count))
-   unlock_page(page);
-}
-
 static void
 iomap_read_page_end_io(struct bio_vec *bvec, int error)
 {
@@ -187,7 +180,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
}
 
-   iomap_read_finish(iop, page);
+   if (!iop || atomic_sub_and_test(bvec->bv_len, >read_bytes_pending))
+   unlock_page(page);
 }
 
 static void
@@ -267,30 +261,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
}
 
ctx->cur_page_in_bio = true;
+   if (iop)
+   atomic_add(plen, >read_bytes_pending);
 
-   /*
-* Try to merge into a previous segment if we can.
-*/
+   /* Try to merge into a previous segment if we can */
sector = iomap_sector(iomap, pos);
-   if (ctx->bio && bio_end_sector(ctx->bio) == sector)
+   if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff,
+   _page))
+   goto done;
is_contig = true;
-
-   if (is_contig &&
-   __bio_try_merge_page(ctx->bio, page, plen, poff, _page)) {
-   if (!same_page && iop)
-   atomic_inc(>read_count);
-   goto done;
}
 
-   /*
-* If we start a new segment we need to increase the read count, and we
-* need to do so before submitting any previous full bio to make sure
-* that we don't prematurely unlock the page.
-*/
-   if (iop)
-   atomic_inc(>read_count);
-
-   if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
+   if (!is_contig || bio_full(ctx->bio, plen)) {
gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
gfp_t orig_gfp = gfp;
int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 3/9] iomap: Use kzalloc to allocate iomap_page

2020-09-10 Thread Matthew Wilcox (Oracle)

We can skip most of the initialisation, although spinlocks still
need explicit initialisation as architectures may use a non-zero
value to indicate unlocked.  The comment is no longer useful as
attach_page_private() handles the refcount now.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
---
 fs/iomap/buffered-io.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 330f86b825d7..58a1fd83f2a4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -49,16 +49,8 @@ iomap_page_create(struct inode *inode, struct page *page)
if (iop || i_blocks_per_page(inode, page) <= 1)
return iop;
 
-   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-   atomic_set(>read_count, 0);
-   atomic_set(>write_count, 0);
+   iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(>uptodate_lock);
-   bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
-
-   /*
-* migrate_page_move_mapping() assumes that pages with private data have
-* their count elevated by 1.
-*/
attach_page_private(page, iop);
return iop;
 }
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 8/9] iomap: Convert iomap_write_end types

2020-09-10 Thread Matthew Wilcox (Oracle)

iomap_write_end cannot return an error, so switch it to return
size_t instead of int and remove the error checking from the callers.
Also convert the arguments to size_t from unsigned int, in case anyone
ever wants to support a page size larger than 2GB.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 64a5cb383f30..cb25a7b70401 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -663,9 +663,8 @@ iomap_set_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
 
-static int
-__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-   unsigned copied, struct page *page)
+static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+   size_t copied, struct page *page)
 {
flush_dcache_page(page);
 
@@ -687,9 +686,8 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
return copied;
 }
 
-static int
-iomap_write_end_inline(struct inode *inode, struct page *page,
-   struct iomap *iomap, loff_t pos, unsigned copied)
+static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
+   struct iomap *iomap, loff_t pos, size_t copied)
 {
void *addr;
 
@@ -705,13 +703,14 @@ iomap_write_end_inline(struct inode *inode, struct page 
*page,
return copied;
 }
 
-static int
-iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
-   struct page *page, struct iomap *iomap, struct iomap *srcmap)
+/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
+static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+   size_t copied, struct page *page, struct iomap *iomap,
+   struct iomap *srcmap)
 {
const struct iomap_page_ops *page_ops = iomap->page_ops;
loff_t old_size = inode->i_size;
-   int ret;
+   size_t ret;
 
if (srcmap->type == IOMAP_INLINE) {
ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
@@ -790,11 +789,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 
-   status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
+   copied = iomap_write_end(inode, pos, bytes, copied, page, iomap,
srcmap);
-   if (unlikely(status < 0))
-   break;
-   copied = status;
 
cond_resched();
 
@@ -868,11 +864,8 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 
status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
srcmap);
-   if (unlikely(status <= 0)) {
-   if (WARN_ON_ONCE(status == 0))
-   return -EIO;
-   return status;
-   }
+   if (WARN_ON_ONCE(status == 0))
+   return -EIO;
 
cond_resched();
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH v2 1/9] iomap: Fix misplaced page flushing

2020-09-10 Thread Matthew Wilcox (Oracle)

If iomap_unshare_actor() unshares to an inline iomap, the page was
not being flushed.  block_write_end() and __iomap_write_end() already
contain flushes, so adding it to iomap_write_end_inline() seems like
the best place.  That means we can remove it from iomap_write_actor().

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 897ab9a26a74..d81a9a86c5aa 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -717,6 +717,7 @@ iomap_write_end_inline(struct inode *inode, struct page 
*page,
WARN_ON_ONCE(!PageUptodate(page));
BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
+   flush_dcache_page(page);
addr = kmap_atomic(page);
memcpy(iomap->inline_data + pos, addr + pos, copied);
kunmap_atomic(addr);
@@ -810,8 +811,6 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 
-   flush_dcache_page(page);
-
status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
srcmap);
if (unlikely(status < 0))
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-25 Thread Matthew Wilcox

On Tue, Aug 25, 2020 at 02:02:03PM -0700, Darrick J. Wong wrote:
> >  /*
> > - * Structure allocated for each page when block size < PAGE_SIZE to track
> > + * Structure allocated for each page when block size < page size to track
> >   * sub-page uptodate status and I/O completions.
> 
> "for each regular page or head page of a huge page"?  Or whatever we're
> calling them nowadays?

Well, that's what I'm calling a "page" ;-)

How about "for each page or THP"?  The fact that it's stored in the
head page is incidental -- it's allocated for the THP.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-25 Thread Matthew Wilcox

On Tue, Aug 25, 2020 at 02:27:11PM +1000, Dave Chinner wrote:
> On Mon, Aug 24, 2020 at 09:35:59PM -0600, Andreas Dilger wrote:
> > On Aug 24, 2020, at 9:26 PM, Matthew Wilcox  wrote:
> > > 
> > > On Tue, Aug 25, 2020 at 10:27:35AM +1000, Dave Chinner wrote:
> > >>> do {
> > >>> -   unsigned offset, bytes;
> > >>> -
> > >>> -   offset = offset_in_page(pos);
> > >>> -   bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> > >>> +   loff_t bytes;
> > >>> 
> > >>> if (IS_DAX(inode))
> > >>> -   status = dax_iomap_zero(pos, offset, bytes, 
> > >>> iomap);
> > >>> +   bytes = dax_iomap_zero(pos, length, iomap);
> > >> 
> > >> Hmmm. everything is loff_t here, but the callers are defining length
> > >> as u64, not loff_t. Is there a potential sign conversion problem
> > >> here? (sure 64 bit is way beyond anything we'll pass here, but...)
> > > 
> > > I've gone back and forth on the correct type for 'length' a few times.
> > > size_t is too small (not for zeroing, but for seek()).  An unsigned type
> > > seems right -- a length can't be negative, and we don't want to give
> > > the impression that it can.  But the return value from these functions
> > > definitely needs to be signed so we can represent an error.  So a u64
> > > length with an loff_t return type feels like the best solution.  And
> > > the upper layers have to promise not to pass in a length that's more
> > > than 2^63-1.
> > 
> > The problem with allowing a u64 as the length is that it leads to the
> > possibility of an argument value that cannot be returned.  Checking
> > length < 0 is not worse than checking length > 0x7ff,
> > and has the benefit of consistency with the other argument types and
> > signs...

The callee should just trust that the caller isn't going to do that.
File sizes can't be more than 2^63-1 bytes, so an extent of a file also
can't be more than 2^63-1 bytes.

> I think the problem here is that we have no guaranteed 64 bit size
> type. when that was the case with off_t, we created loff_t to always
> represent a 64 bit offset value. However, we never created one for
> the count/size that is passed alongside loff_t in many places - it
> was said that "syscalls are limited to 32 bit sizes" and
> "size_t is 64 bit on 64 bit platforms" and so on and so we still
> don't have a clean way to pass 64 bit sizes through the IO path.
> 
> We've been living with this shitty situation for a long time now, so
> perhaps it's time for us to define lsize_t for 64 bit lengths and
> start using that everywhere that needs a 64 bit clean path
> through the code, regardless of whether the arch is 32 or 64 bit...
> 
> Thoughts?

I don't think the THP patches should be blocked on this expedition.

We have a guaranteed 64 bit type -- it's u64.  I don't think defining
lsize_t is going to fix anything.  The next big problem to fix will be
supporting storage >16EiB, but I think that's a project that can start
in ten years and still be here before anyone but the TLAs have that much
storage in a single device.

Any objection to leaving this patch as-is with a u64 length?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-24 Thread Matthew Wilcox

On Tue, Aug 25, 2020 at 10:27:35AM +1000, Dave Chinner wrote:
> > do {
> > -   unsigned offset, bytes;
> > -
> > -   offset = offset_in_page(pos);
> > -   bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> > +   loff_t bytes;
> >  
> > if (IS_DAX(inode))
> > -   status = dax_iomap_zero(pos, offset, bytes, iomap);
> > +   bytes = dax_iomap_zero(pos, length, iomap);
> 
> Hmmm. everything is loff_t here, but the callers are defining length
> as u64, not loff_t. Is there a potential sign conversion problem
> here? (sure 64 bit is way beyond anything we'll pass here, but...)

I've gone back and forth on the correct type for 'length' a few times.
size_t is too small (not for zeroing, but for seek()).  An unsigned type
seems right -- a length can't be negative, and we don't want to give
the impression that it can.  But the return value from these functions
definitely needs to be signed so we can represent an error.  So a u64
length with an loff_t return type feels like the best solution.  And
the upper layers have to promise not to pass in a length that's more
than 2^63-1.

I have a patch set which starts the conversion process for all the actors
over to using u64 for the length.  Obviously, that's not mixed in with
the THP patchset.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 8/9] iomap: Convert iomap_write_end types

2020-08-24 Thread Matthew Wilcox

On Tue, Aug 25, 2020 at 10:12:23AM +1000, Dave Chinner wrote:
> > -static int
> > -__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> > -   unsigned copied, struct page *page)
> > +static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t 
> > len,
> > +   size_t copied, struct page *page)
> >  {
> 
> Please leave the function declarations formatted the same way as
> they currently are. They are done that way intentionally so it is
> easy to grep for function definitions. Not to mention is't much
> easier to read than when the function name is commingled into the
> multiline paramener list like...

I understand that's true for XFS, but it's not true throughout the
rest of the kernel.  This file isn't even consistent:

buffered-io.c:static inline struct iomap_page *to_iomap_page(struct page *page)
buffered-io.c:static inline bool iomap_block_needs_zeroing(struct inode
buffered-io.c:static int iomap_zero(struct inode *inode, loff_t pos, unsigned 
offset,
buffered-io.c:static void iomap_writepage_end_bio(struct bio *bio)
buffered-io.c:static int __init iomap_init(void)

(i just grepped for ^static so there're other functions not covered by this)

The other fs/iomap/ files are equally inconsistent.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-24 Thread Matthew Wilcox

On Tue, Aug 25, 2020 at 09:59:18AM +1000, Dave Chinner wrote:
> On Mon, Aug 24, 2020 at 03:55:06PM +0100, Matthew Wilcox (Oracle) wrote:
> >  static inline struct iomap_page *to_iomap_page(struct page *page)
> >  {
> > +   VM_BUG_ON_PGFLAGS(PageTail(page), page);
> > if (page_has_private(page))
> > return (struct iomap_page *)page_private(page);
> > return NULL;
> 
> Just to confirm: this vm bug check is to needed becuse we only
> attach the iomap_page to the head page of a compound page?
> 
> Assuming that I've understood the above correctly:

That's correct.  If we get a tail page in this path, something's gone
wrong somewhere upstream of us, and the stack trace should tell us where.
It's PGFLAGS so it's usually compiled out by distro kernels (you need to
enable CONFIG_DEBUG_VM_PGFLAGS for it to be active).

It was definitely useful in development; probably not so useful for
a distro kernel to assert.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 0/9] THP iomap patches for 5.10

2020-08-24 Thread Matthew Wilcox (Oracle)

These patches are carefully plucked from the THP series.  I would like
them to hit 5.10 to make the THP patchset merge easier.  Some of these
are just generic improvements that make sense on their own terms, but
the overall intent is to support THPs in iomap.

I'll send another patch series later today which are the changes to
iomap which don't pay their own way until we actually have THPs in the
page cache.  I would like those to be reviewed with an eye to merging
them into 5.11.

Matthew Wilcox (Oracle) (9):
  iomap: Fix misplaced page flushing
  fs: Introduce i_blocks_per_page
  iomap: Use kzalloc to allocate iomap_page
  iomap: Use bitmap ops to set uptodate bits
  iomap: Support arbitrarily many blocks per page
  iomap: Convert read_count to byte count
  iomap: Convert write_count to byte count
  iomap: Convert iomap_write_end types
  iomap: Change calling convention for zeroing

 fs/dax.c|  13 ++--
 fs/iomap/buffered-io.c  | 145 
 fs/jfs/jfs_metapage.c   |   2 +-
 fs/xfs/xfs_aops.c   |   2 +-
 include/linux/dax.h |   3 +-
 include/linux/pagemap.h |  16 +
 6 files changed, 83 insertions(+), 98 deletions(-)

-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 1/9] iomap: Fix misplaced page flushing

2020-08-24 Thread Matthew Wilcox (Oracle)

If iomap_unshare_actor() unshares to an inline iomap, the page was
not being flushed.  block_write_end() and __iomap_write_end() already
contain flushes, so adding it to iomap_write_end_inline() seems like
the best place.  That means we can remove it from iomap_write_actor().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index bcfc288dba3f..cffd575e57b6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ iomap_write_end_inline(struct inode *inode, struct page 
*page,
 {
void *addr;
 
+   flush_dcache_page(page);
WARN_ON_ONCE(!PageUptodate(page));
BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
 
@@ -811,8 +812,6 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 
-   flush_dcache_page(page);
-
status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
srcmap);
if (unlikely(status < 0))
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 3/9] iomap: Use kzalloc to allocate iomap_page

2020-08-24 Thread Matthew Wilcox (Oracle)

We can skip most of the initialisation, although spinlocks still
need explicit initialisation as architectures may use a non-zero
value to indicate unlocked.  The comment is no longer useful as
attach_page_private() handles the refcount now.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 13d5cdab8dcd..639d54a4177e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -49,16 +49,8 @@ iomap_page_create(struct inode *inode, struct page *page)
if (iop || i_blocks_per_page(inode, page) <= 1)
return iop;
 
-   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
-   atomic_set(>read_count, 0);
-   atomic_set(>write_count, 0);
+   iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(>uptodate_lock);
-   bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
-
-   /*
-* migrate_page_move_mapping() assumes that pages with private data have
-* their count elevated by 1.
-*/
attach_page_private(page, iop);
return iop;
 }
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-24 Thread Matthew Wilcox (Oracle)

Pass the full length to iomap_zero() and dax_iomap_zero(), and have
them return how many bytes they actually handled.  This is preparatory
work for handling THP, although it looks like DAX could actually take
advantage of it if there's a larger contiguous area.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/dax.c   | 13 ++---
 fs/iomap/buffered-io.c | 33 +++--
 include/linux/dax.h|  3 +--
 3 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 95341af1a966..f2b912cb034e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
return ret;
 }
 
-int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
-  struct iomap *iomap)
+loff_t dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 {
sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
pgoff_t pgoff;
long rc, id;
void *kaddr;
bool page_aligned = false;
-
+   unsigned offset = offset_in_page(pos);
+   unsigned size = min_t(u64, PAGE_SIZE - offset, length);
 
if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
-   IS_ALIGNED(size, PAGE_SIZE))
+   (size == PAGE_SIZE))
page_aligned = true;
 
rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
@@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned 
size,
id = dax_read_lock();
 
if (page_aligned)
-   rc = dax_zero_page_range(iomap->dax_dev, pgoff,
-size >> PAGE_SHIFT);
+   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
else
rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
if (rc < 0) {
@@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, unsigned 
size,
dax_flush(iomap->dax_dev, kaddr + offset, size);
}
dax_read_unlock(id);
-   return 0;
+   return size;
 }
 
 static loff_t
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7f618ab4b11e..2dba054095e8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -901,11 +901,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
-   unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
+static loff_t iomap_zero(struct inode *inode, loff_t pos, u64 length,
+   struct iomap *iomap, struct iomap *srcmap)
 {
struct page *page;
int status;
+   unsigned offset = offset_in_page(pos);
+   unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
 
status = iomap_write_begin(inode, pos, bytes, 0, , iomap, srcmap);
if (status)
@@ -917,38 +919,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
unsigned offset,
return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
 }
 
-static loff_t
-iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
-   void *data, struct iomap *iomap, struct iomap *srcmap)
+static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
+   loff_t length, void *data, struct iomap *iomap,
+   struct iomap *srcmap)
 {
bool *did_zero = data;
loff_t written = 0;
-   int status;
 
/* already zeroed?  we're done. */
if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
-   return count;
+   return length;
 
do {
-   unsigned offset, bytes;
-
-   offset = offset_in_page(pos);
-   bytes = min_t(loff_t, PAGE_SIZE - offset, count);
+   loff_t bytes;
 
if (IS_DAX(inode))
-   status = dax_iomap_zero(pos, offset, bytes, iomap);
+   bytes = dax_iomap_zero(pos, length, iomap);
else
-   status = iomap_zero(inode, pos, offset, bytes, iomap,
-   srcmap);
-   if (status < 0)
-   return status;
+   bytes = iomap_zero(inode, pos, length, iomap, srcmap);
+   if (bytes < 0)
+   return bytes;
 
pos += bytes;
-   count -= bytes;
+   length -= bytes;
written += bytes;
if (did_zero)
*did_zero = true;
-   } while (count > 0);
+   } while (length > 0);
 
return written;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..80f17946f940 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -214,8 +214,7 @@ vm_fault_t dax_finish_sync_faul

[PATCH 6/9] iomap: Convert read_count to byte count

2020-08-24 Thread Matthew Wilcox (Oracle)

Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 29 ++---
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 844e95cacea8..c9b44f86d166 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -161,13 +161,6 @@ iomap_set_range_uptodate(struct page *page, unsigned off, 
unsigned len)
SetPageUptodate(page);
 }
 
-static void
-iomap_read_finish(struct iomap_page *iop, struct page *page)
-{
-   if (!iop || atomic_dec_and_test(>read_count))
-   unlock_page(page);
-}
-
 static void
 iomap_read_page_end_io(struct bio_vec *bvec, int error)
 {
@@ -181,7 +174,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
}
 
-   iomap_read_finish(iop, page);
+   if (!iop || atomic_sub_and_test(bvec->bv_len, >read_count))
+   unlock_page(page);
 }
 
 static void
@@ -269,20 +263,17 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
if (ctx->bio && bio_end_sector(ctx->bio) == sector)
is_contig = true;
 
-   if (is_contig &&
-   __bio_try_merge_page(ctx->bio, page, plen, poff, _page)) {
-   if (!same_page && iop)
-   atomic_inc(>read_count);
-   goto done;
-   }
-
/*
-* If we start a new segment we need to increase the read count, and we
-* need to do so before submitting any previous full bio to make sure
-* that we don't prematurely unlock the page.
+* We need to increase the read count before submitting any
+* previous bio to make sure that we don't prematurely unlock
+* the page.
 */
if (iop)
-   atomic_inc(>read_count);
+   atomic_add(plen, >read_count);
+
+   if (is_contig &&
+   __bio_try_merge_page(ctx->bio, page, plen, poff, _page))
+   goto done;
 
if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 4/9] iomap: Use bitmap ops to set uptodate bits

2020-08-24 Thread Matthew Wilcox (Oracle)

Now that the bitmap is protected by a spinlock, we can use the
more efficient bitmap ops instead of individual test/set bit ops.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 639d54a4177e..dbf9572dabe9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -134,19 +134,11 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
off, unsigned len)
struct inode *inode = page->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
-   bool uptodate = true;
unsigned long flags;
-   unsigned int i;
 
spin_lock_irqsave(>uptodate_lock, flags);
-   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
-   if (i >= first && i <= last)
-   set_bit(i, iop->uptodate);
-   else if (!test_bit(i, iop->uptodate))
-   uptodate = false;
-   }
-
-   if (uptodate)
+   bitmap_set(iop->uptodate, first, last - first + 1);
+   if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
SetPageUptodate(page);
spin_unlock_irqrestore(>uptodate_lock, flags);
 }
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 8/9] iomap: Convert iomap_write_end types

2020-08-24 Thread Matthew Wilcox (Oracle)

iomap_write_end cannot return an error, so switch it to return
size_t instead of int and remove the error checking from the callers.
Also convert the arguments to size_t from unsigned int, in case anyone
ever wants to support a page size larger than 2GB.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8c6187a6f34e..7f618ab4b11e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -666,9 +666,8 @@ iomap_set_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
 
-static int
-__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
-   unsigned copied, struct page *page)
+static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+   size_t copied, struct page *page)
 {
flush_dcache_page(page);
 
@@ -690,9 +689,8 @@ __iomap_write_end(struct inode *inode, loff_t pos, unsigned 
len,
return copied;
 }
 
-static int
-iomap_write_end_inline(struct inode *inode, struct page *page,
-   struct iomap *iomap, loff_t pos, unsigned copied)
+static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
+   struct iomap *iomap, loff_t pos, size_t copied)
 {
void *addr;
 
@@ -708,13 +706,14 @@ iomap_write_end_inline(struct inode *inode, struct page 
*page,
return copied;
 }
 
-static int
-iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned copied,
-   struct page *page, struct iomap *iomap, struct iomap *srcmap)
+/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
+static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
+   size_t copied, struct page *page, struct iomap *iomap,
+   struct iomap *srcmap)
 {
const struct iomap_page_ops *page_ops = iomap->page_ops;
loff_t old_size = inode->i_size;
-   int ret;
+   size_t ret;
 
if (srcmap->type == IOMAP_INLINE) {
ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
@@ -793,11 +792,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
length, void *data,
 
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 
-   status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
+   copied = iomap_write_end(inode, pos, bytes, copied, page, iomap,
srcmap);
-   if (unlikely(status < 0))
-   break;
-   copied = status;
 
cond_resched();
 
@@ -871,11 +867,8 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 
status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
srcmap);
-   if (unlikely(status <= 0)) {
-   if (WARN_ON_ONCE(status == 0))
-   return -EIO;
-   return status;
-   }
+   if (WARN_ON_ONCE(status == 0))
+   return -EIO;
 
cond_resched();
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 7/9] iomap: Convert write_count to byte count

2020-08-24 Thread Matthew Wilcox (Oracle)

Instead of counting bio segments, count the number of bytes submitted.
This insulates us from the block layer's definition of what a 'same page'
is, which is not necessarily clear once THPs are involved.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c9b44f86d166..8c6187a6f34e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1050,7 +1050,7 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 
 static void
 iomap_finish_page_writeback(struct inode *inode, struct page *page,
-   int error)
+   int error, unsigned int len)
 {
struct iomap_page *iop = to_iomap_page(page);
 
@@ -1062,7 +1062,7 @@ iomap_finish_page_writeback(struct inode *inode, struct 
page *page,
WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
 
-   if (!iop || atomic_dec_and_test(>write_count))
+   if (!iop || atomic_sub_and_test(len, >write_count))
end_page_writeback(page);
 }
 
@@ -1096,7 +1096,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 
/* walk each page on bio, ending page IO on them */
bio_for_each_segment_all(bv, bio, iter_all)
-   iomap_finish_page_writeback(inode, bv->bv_page, error);
+   iomap_finish_page_writeback(inode, bv->bv_page, error,
+   bv->bv_len);
bio_put(bio);
}
/* The ioend has been freed by bio_put() */
@@ -1312,8 +1313,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, 
struct page *page,
 
merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
_page);
-   if (iop && !same_page)
-   atomic_inc(>write_count);
+   if (iop)
+   atomic_add(len, >write_count);
 
if (!merged) {
if (bio_full(wpc->ioend->io_bio, len)) {
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-24 Thread Matthew Wilcox (Oracle)

Size the uptodate array dynamically to support larger pages in the
page cache.  With a 64kB page, we're only saving 8 bytes per page today,
but with a 2MB maximum page size, we'd have to allocate more than 4kB
per page.  Add a few debugging assertions.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/iomap/buffered-io.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dbf9572dabe9..844e95cacea8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,18 +22,19 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page when block size < PAGE_SIZE to track
+ * Structure allocated for each page when block size < page size to track
  * sub-page uptodate status and I/O completions.
  */
 struct iomap_page {
atomic_tread_count;
atomic_twrite_count;
spinlock_t  uptodate_lock;
-   DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
+   unsigned long   uptodate[];
 };
 
 static inline struct iomap_page *to_iomap_page(struct page *page)
 {
+   VM_BUG_ON_PGFLAGS(PageTail(page), page);
if (page_has_private(page))
return (struct iomap_page *)page_private(page);
return NULL;
@@ -45,11 +46,13 @@ static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
struct iomap_page *iop = to_iomap_page(page);
+   unsigned int nr_blocks = i_blocks_per_page(inode, page);
 
-   if (iop || i_blocks_per_page(inode, page) <= 1)
+   if (iop || nr_blocks <= 1)
return iop;
 
-   iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
+   iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
+   GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(>uptodate_lock);
attach_page_private(page, iop);
return iop;
@@ -59,11 +62,14 @@ static void
 iomap_page_release(struct page *page)
 {
struct iomap_page *iop = detach_page_private(page);
+   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
 
if (!iop)
return;
WARN_ON_ONCE(atomic_read(>read_count));
WARN_ON_ONCE(atomic_read(>write_count));
+   WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
+   PageUptodate(page));
kfree(iop);
 }
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 2/9] fs: Introduce i_blocks_per_page

2020-08-24 Thread Matthew Wilcox (Oracle)

This helper is useful for both THPs and for supporting block size larger
than page size.  Convert all users that I could find (we have a few
different ways of writing this idiom, and I may have missed some).

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/iomap/buffered-io.c  |  8 
 fs/jfs/jfs_metapage.c   |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/pagemap.h | 16 
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cffd575e57b6..13d5cdab8dcd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
 {
struct iomap_page *iop = to_iomap_page(page);
 
-   if (iop || i_blocksize(inode) == PAGE_SIZE)
+   if (iop || i_blocks_per_page(inode, page) <= 1)
return iop;
 
iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
@@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
off, unsigned len)
unsigned int i;
 
spin_lock_irqsave(>uptodate_lock, flags);
-   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+   for (i = 0; i < i_blocks_per_page(inode, page); i++) {
if (i >= first && i <= last)
set_bit(i, iop->uptodate);
else if (!test_bit(i, iop->uptodate))
@@ -1078,7 +1078,7 @@ iomap_finish_page_writeback(struct inode *inode, struct 
page *page,
mapping_set_error(inode->i_mapping, -EIO);
}
 
-   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
 
if (!iop || atomic_dec_and_test(>write_count))
@@ -1374,7 +1374,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
int error = 0, count = 0, i;
LIST_HEAD(submit_list);
 
-   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
+   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
 
/*
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index a2f5338a5ea1..176580f54af9 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page 
*page)
struct inode *inode = page->mapping->host;
struct bio *bio = NULL;
int block_offset;
-   int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
+   int blocks_per_page = i_blocks_per_page(inode, page);
sector_t page_start;/* address of page in fs blocks */
sector_t pblock;
int xlen;
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b35611882ff9..55d126d4e096 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -544,7 +544,7 @@ xfs_discard_page(
page, ip->i_ino, offset);
 
error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-   PAGE_SIZE / i_blocksize(inode));
+   i_blocks_per_page(inode, page));
if (error && !XFS_FORCED_SHUTDOWN(mp))
xfs_alert(mp, "page discard unable to remove delalloc 
mapping.");
 out_invalidate:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7de11dcd534d..853733286138 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -899,4 +899,20 @@ static inline int page_mkwrite_check_truncate(struct page 
*page,
return offset;
 }
 
+/**
+ * i_blocks_per_page - How many blocks fit in this page.
+ * @inode: The inode which contains the blocks.
+ * @page: The page (head page if the page is a THP).
+ *
+ * If the block size is larger than the size of this page, return zero.
+ *
+ * Context: The caller should hold a refcount on the page to prevent it
+ * from being split.
+ * Return: The number of filesystem blocks covered by this page.
+ */
+static inline
+unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
+{
+   return thp_size(page) >> inode->i_blkbits;
+}
 #endif /* _LINUX_PAGEMAP_H */
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 1/4] mm: Introduce and use page_cache_empty

2020-08-15 Thread Matthew Wilcox

On Fri, Aug 07, 2020 at 02:24:00AM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 04, 2020 at 05:17:52PM +0100, Matthew Wilcox (Oracle) wrote:
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 484a36185bb5..a474a92a2a72 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -18,6 +18,11 @@
> >  
> >  struct pagevec;
> >  
> > +static inline bool page_cache_empty(struct address_space *mapping)
> > +{
> > +   return xa_empty(>i_pages);
> 
> What about something like
> 
>   bool empty = xa_empty(>i_pages);
>   VM_BUG_ON(empty && mapping->nrpages);
>   return empty;

I tried this and it's triggered by generic/418.  The problem
is that it's called when the pagecache lock isn't held (by
invalidate_inode_pages2_range), so it's possible for xa_empty() to
return true, then a page be added to the page cache, and mapping->pages
be incremented to 1.  That seems to be what's happened here:

(gdb) p/x *(struct address_space *)0x88804b21b360
$2 = {host = 0x88804b21b200, i_pages = {xa_lock = {{rlock = {raw_lock = {{
  val = {counter = 0x0}, {locked = 0x0, pending = 0x0}, {
locked_pending = 0x0, tail = 0x0}}, xa_flags = 0x21, 
*  xa_head = 0xea0001e187c0}, gfp_mask = 0x100c4a, i_mmap_writable = {
counter = 0x0}, nr_thps = {counter = 0x0}, i_mmap = {rb_root = {
  rb_node = 0x0}, rb_leftmost = 0x0}, i_mmap_rwsem = {count = {
  counter = 0x0}, owner = {counter = 0x0}, osq = {tail = {counter = 0x0}}, 
wait_lock = {raw_lock = {{val = {counter = 0x0}, {locked = 0x0, 
pending = 0x0}, {locked_pending = 0x0, tail = 0x0, 
wait_list = {next = 0x88804b21b3b0, prev = 0x88804b21b3b0}}, 
* nrpages = 0x1, writeback_index = 0x0, a_ops = 0x81c2ed60, 
  flags = 0x40, wb_err = 0x0, private_lock = {{rlock = {raw_lock = {{val = {
  counter = 0x0}, {locked = 0x0, pending = 0x0}, {
  locked_pending = 0x0, tail = 0x0}}, private_list = {
next = 0x88804b21b3e8, prev = 0x88804b21b3e8}, private_data = 0x0}

(marked the critical lines with *)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

2020-08-10 Thread Matthew Wilcox

On Mon, Aug 10, 2020 at 04:15:50PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/8/7 下午9:38, Matthew Wilcox wrote:
> > On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> > > This patchset is a try to resolve the problem of tracking shared page
> > > for fsdax.
> > > 
> > > Instead of per-page tracking method, this patchset introduces a query
> > > interface: get_shared_files(), which is implemented by each FS, to
> > > obtain the owners of a shared page.  It returns an owner list of this
> > > shared page.  Then, the memory-failure() iterates the list to be able
> > > to notify each process using files that sharing this page.
> > > 
> > > The design of the tracking method is as follow:
> > > 1. dax_assocaite_entry() associates the owner's info to this page
> > 
> > I think that's the first problem with this design.  dax_associate_entry is
> > a horrendous idea which needs to be ripped out, not made more important.
> > It's all part of the general problem of trying to do something on a
> > per-page basis instead of per-extent basis.
> > 
> 
> The memory-failure needs to track owners info from a dax page, so I should
> associate the owner with this page.  In this version, I associate the block
> device to the dax page, so that the memory-failure is able to iterate the
> owners by the query interface provided by filesystem.

No, it doesn't need to track owner info from a DAX page.  What it needs to
do is ask the filesystem.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 0/8] fsdax: introduce FS query interface to support reflink

2020-08-07 Thread Matthew Wilcox

On Fri, Aug 07, 2020 at 09:13:28PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Instead of per-page tracking method, this patchset introduces a query
> interface: get_shared_files(), which is implemented by each FS, to
> obtain the owners of a shared page.  It returns an owner list of this
> shared page.  Then, the memory-failure() iterates the list to be able
> to notify each process using files that sharing this page.
> 
> The design of the tracking method is as follow:
> 1. dax_assocaite_entry() associates the owner's info to this page

I think that's the first problem with this design.  dax_associate_entry is
a horrendous idea which needs to be ripped out, not made more important.
It's all part of the general problem of trying to do something on a
per-page basis instead of per-extent basis.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

[PATCH 1/4] mm: Introduce and use page_cache_empty

2020-08-04 Thread Matthew Wilcox (Oracle)

Instead of checking the two counters (nrpages and nrexceptional), we
can just check whether i_pages is empty.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/block_dev.c  |  2 +-
 fs/dax.c|  2 +-
 include/linux/pagemap.h |  5 +
 mm/truncate.c   | 18 +++---
 4 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0ae656e022fd..2a77bd2c6144 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@ static void kill_bdev(struct block_device *bdev)
 {
struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (page_cache_empty(mapping))
return;
 
invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..2f75ee2cd41f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -949,7 +949,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
return -EIO;
 
-   if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
+   if (page_cache_empty(mapping) || wbc->sync_mode != WB_SYNC_ALL)
return 0;
 
trace_dax_writeback_range(inode, xas.xa_index, end_index);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 484a36185bb5..a474a92a2a72 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -18,6 +18,11 @@
 
 struct pagevec;
 
+static inline bool page_cache_empty(struct address_space *mapping)
+{
+   return xa_empty(>i_pages);
+}
+
 /*
  * Bits in mapping->flags.
  */
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..7c4c8ac140be 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -300,7 +300,7 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
pgoff_t index;
int i;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (page_cache_empty(mapping))
goto out;
 
/* Offsets within partial pages */
@@ -488,9 +488,6 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-   unsigned long nrexceptional;
-   unsigned long nrpages;
-
/*
 * Page reclaim can not participate in regular inode lifetime
 * management (can't call iput()) and thus can race with the
@@ -500,16 +497,7 @@ void truncate_inode_pages_final(struct address_space 
*mapping)
 */
mapping_set_exiting(mapping);
 
-   /*
-* When reclaim installs eviction entries, it increases
-* nrexceptional first, then decreases nrpages.  Make sure we see
-* this in the right order or we might miss an entry.
-*/
-   nrpages = mapping->nrpages;
-   smp_rmb();
-   nrexceptional = mapping->nrexceptional;
-
-   if (nrpages || nrexceptional) {
+   if (!page_cache_empty(mapping)) {
/*
 * As truncation uses a lockless tree lookup, cycle
 * the tree lock to make sure any ongoing tree
@@ -692,7 +680,7 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
int ret2 = 0;
int did_range_unmap = 0;
 
-   if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+   if (page_cache_empty(mapping))
goto out;
 
pagevec_init();
-- 
2.27.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

1 2 3 >

1 - 100 of 247 matches

Mail list logo