from:"John Hubbard"

Re: [PATCH v2 7/7] ABI: sysfs-kernel-mm-cma: fix two cross-references

2021-04-01 Thread John Hubbard


On 3/25/21 3:38 AM, Mauro Carvalho Chehab wrote:

Change the text in order to generate cross-references for
alloc_pages_success and alloc_pages_fail symbols.

Signed-off-by: Mauro Carvalho Chehab 
---
  Documentation/ABI/testing/sysfs-kernel-mm-cma | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
b/Documentation/ABI/testing/sysfs-kernel-mm-cma
index 02b2bb60c296..86e261185561 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-cma
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
@@ -5,12 +5,10 @@ Description:
/sys/kernel/mm/cma/ contains a subdirectory for each CMA
heap name (also sometimes called CMA areas).
  
-		Each CMA heap subdirectory (that is, each

-   /sys/kernel/mm/cma/ directory) contains the
-   following items:
+   Each CMA heap subdirectory contains the following items:
  
-			alloc_pages_success

-   alloc_pages_fail
+   - /sys/kernel/mm/cma//alloc_pages_success
+   - /sys/kernel/mm/cma//alloc_pages_fail
  


I agree that this is clearer and easier on the reader, who can now see
directly what the full path to each item is.

As for calling it a "fix", that seems a bit much. It's an upgrade.
I'm not sure how many people realize that this sort of change causes
cross refs to magically start working. I certainly didn't until now.

But either way, this improvement is nice to have, so:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


  What: /sys/kernel/mm/cma//alloc_pages_success
  Date: Feb 2021

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 8:56 PM, John Hubbard wrote:

On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code.  Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

     try_to_munlock() --> try_to_mlock()
     try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.


Actually, re-reading your and Jason's earlier points in the thread, I see
that I'm *not* missing anything, and we are actually in agreement about how
the code operates. OK, good!

Also, as you point out above, maybe the "try_" prefix is not really accurate
either, given how this works. So maybe we have arrived at something like:

try_to_munlock() --> page_mlock() // or mlock_page()...
try_to_munlock_one() --> page_mlock_one()



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
 * try_to_munlock - try to munlock a page
 * @page: the page to be munlocked
 *
 * Called from munlock code.  Checks all of the VMAs mapping the page
 * to make sure nobody else has this page mlocked. The page will be
 * returned with PG_mlocked cleared if no other vmas have it mlocked.
 */

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

 try_to_munlock() --> try_to_mlock()
 try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.



This is actually inspired from a suggestion in Documentation/vm/unevictable-
lru.rst which warns about this problem:

try_to_munlock() Reverse Map Scan
-

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.



This is actually rather bad advice! page_referenced() returns an
int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
stands now, returns void. Usually when I'm writing a TODO item, I'm in a
hurry, and I think that's what probably happened here, too. :)



Although, it seems reasonable to tack such renaming patches onto the tail

end

of this series. But whatever works.


Unless anyone objects strongly I will roll the rename into this patch as there
is only one caller of try_to_munlock.

  - Alistair



No objections here. :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread John Hubbard


On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
...

As far as I can tell this has always been called try_to_munlock() even though
it appears to do the opposite.


Maybe we should change it then?


/**
  * try_to_munlock - try to munlock a page
  * @page: the page to be munlocked
  *
  * Called from munlock code.  Checks all of the VMAs mapping the page
  * to make sure nobody else has this page mlocked. The page will be
  * returned with PG_mlocked cleared if no other vmas have it mlocked.
  */


In other words it sets PG_mlocked if one or more vmas has it mlocked. So
try_to_mlock() might be a better name, except that seems to have the potential
for confusion as well because it's only called from the munlock code path and
never for mlock.


That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)


Something needs attention here..


I think the code is correct, but perhaps the naming could be better. Would be
interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
as the current name appears based on the context it is called from (munlock)
rather than what it does (mlock).


The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.



I'd add that, after looking around the calling code, this is a really unhappy
pre-existing situation. Anyone reading this has to remember at which point in 
the
call stack the naming transitions from "do the opposite of what the name says",
to "do what the name says".

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

Although, it seems reasonable to tack such renaming patches onto the tail end
of this series. But whatever works.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: gup: remove FOLL_SPLIT

2021-03-30 Thread John Hubbard


On 3/29/21 11:24 PM, kernel test robot wrote:

Hi Yang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-linux-mm/master]

url:
https://github.com/0day-ci/linux/commits/Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042
base:   https://github.com/hnaz/linux-mm master
config: s390-randconfig-r032-20210330 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # 
https://github.com/0day-ci/linux/commit/c8563a636718f98af86a3965d94e25b8f2cf2354
 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042
 git checkout c8563a636718f98af86a3965d94e25b8f2cf2354
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=s390

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

arch/s390/mm/gmap.c: In function 'thp_split_mm':

arch/s390/mm/gmap.c:2498:27: error: 'FOLL_SPLIT' undeclared (first use in this 
function); did you mean 'FOLL_PIN'?

 2498 |follow_page(vma, addr, FOLL_SPLIT);
  |   ^~
  |   FOLL_PIN
arch/s390/mm/gmap.c:2498:27: note: each undeclared identifier is reported 
only once for each function it appears in


vim +2498 arch/s390/mm/gmap.c


There appears to be an imperfection in this 0day testing system, because (just 
as the patch
says), commit ba925fa35057a062ac98c3e8138b013ce4ce351c ("s390/gmap: improve THP 
splitting"),
July 29, 2020, removes the above use of FOLL_SPLIT.

And "git grep", just to be sure, shows it is not there in today's linux.git. So 
I guess the
https://github.com/0day-ci/linux repo needs a better way to stay in sync?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: gup: remove FOLL_SPLIT

2021-03-30 Thread John Hubbard


On 3/29/21 12:38 PM, Yang Shi wrote:

Since commit 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
and commit ba925fa35057 ("s390/gmap: improve THP splitting") FOLL_SPLIT
has not been used anymore.  Remove the dead code.

Signed-off-by: Yang Shi 
---
  include/linux/mm.h |  1 -
  mm/gup.c   | 28 ++--
  2 files changed, 2 insertions(+), 27 deletions(-)



Looks nice.

As long as I'm running git grep here, there is one more search hit that should 
also
be fixed up, as part of a "remove FOLL_SPLIT" patch:

git grep -nw FOLL_SPLIT
Documentation/vm/transhuge.rst:57:follow_page, the FOLL_SPLIT bit can be 
specified as a parameter to

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..3568836841f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2780,7 +2780,6 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
  #define FOLL_NOWAIT   0x20/* if a disk transfer is needed, start the IO
 * and return without waiting upon it */
  #define FOLL_POPULATE 0x40/* fault in page */
-#define FOLL_SPLIT 0x80/* don't return transhuge pages, split them */
  #define FOLL_HWPOISON 0x100   /* check page is hwpoisoned */
  #define FOLL_NUMA 0x200   /* force NUMA hinting page fault */
  #define FOLL_MIGRATION0x400   /* wait for page to replace migration 
entry */
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..f3d45a8f18ae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -435,18 +435,6 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
}
}
  
-	if (flags & FOLL_SPLIT && PageTransCompound(page)) {

-   get_page(page);
-   pte_unmap_unlock(ptep, ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (ret)
-   return ERR_PTR(ret);
-   goto retry;
-   }
-
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -591,7 +579,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, >pgmap);
}
-   if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
+   if (flags & FOLL_SPLIT_PMD) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
@@ -600,19 +588,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
-   } else if (flags & FOLL_SPLIT) {
-   if (unlikely(!try_get_page(page))) {
-   spin_unlock(ptl);
-   return ERR_PTR(-ENOMEM);
-   }
-   spin_unlock(ptl);
-   lock_page(page);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (pmd_none(*pmd))
-   return no_page_table(vma, flags);
-   } else {  /* flags & FOLL_SPLIT_PMD */
+   } else {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, address);
ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;

Re: [PATCH v3] kernel/resource: Fix locking in request_free_mem_region

2021-03-29 Thread John Hubbard


On 3/29/21 9:59 PM, Alistair Popple wrote:
...

res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
+   if (dev) {
+   dr->parent = _resource;
+   dr->start = addr;
+   dr->n = size;
+   devres_add(dev, dr);
+   }
+
+   write_unlock(_lock);
+   revoke_iomem(res);


This is new, and not mentioned in the commit log, and therefore quite
surprising. It seems like the right thing to do but it also seems like a
different fix! I'm not saying that it should be a separate patch, but it
does seem worth loudly mentioning in the commit log, yes?


This isn't a different fix though, it is just about maintaining the original
behaviour which called revoke_iomem() after dropping the lock. I inadvertently
switched this around in the initial patch such that revoke_iomem() got called
with the lock, leading to deadlock on x86 with CONFIG_IO_STRICT_DEVMEM=y.

This does change the order of revoke_iomem() and devres_add() slightly, but as
far as I can tell that shouldn't be an issue. Can call that out in the commit
log.


Maybe that's why it looked like a change to me. I do think it's worth 
mentioning.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v3] kernel/resource: Fix locking in request_free_mem_region

2021-03-29 Thread John Hubbard

mp_load_acquire(_inode)->i_mapping;
  }
  
-/**

- * __request_region - create a new busy resource region
- * @parent: parent resource descriptor
- * @start: resource start address
- * @n: resource region size
- * @name: reserving caller's ID string
- * @flags: IO resource flags
- */
-struct resource * __request_region(struct resource *parent,
-  resource_size_t start, resource_size_t n,
-  const char *name, int flags)
+static bool request_region_locked(struct resource *parent,
+   struct resource *res, resource_size_t start,
+   resource_size_t n, const char *name, int 
flags)
  {
DECLARE_WAITQUEUE(wait, current);
-   struct resource *res = alloc_resource(GFP_KERNEL);
-   struct resource *orig_parent = parent;
-
-   if (!res)
-   return NULL;
  
  	res->name = name;

res->start = start;
res->end = start + n - 1;
  
-	write_lock(_lock);

-
for (;;) {
struct resource *conflict;
  
@@ -1230,14 +1224,37 @@ struct resource * __request_region(struct resource *parent,

write_lock(_lock);
continue;
}
+   return false;
+   }
+
+   return true;
+}
+
+/**
+ * __request_region - create a new busy resource region
+ * @parent: parent resource descriptor
+ * @start: resource start address
+ * @n: resource region size
+ * @name: reserving caller's ID string
+ * @flags: IO resource flags
+ */
+struct resource *__request_region(struct resource *parent,
+ resource_size_t start, resource_size_t n,
+ const char *name, int flags)
+{
+   struct resource *res = alloc_resource(GFP_KERNEL);
+
+   if (!res)
+   return NULL;
+
+   write_lock(_lock);
+   if (!request_region_locked(parent, res, start, n, name, flags)) {
/* Uhhuh, that didn't work out.. */
free_resource(res);
res = NULL;
-   break;
}
write_unlock(_lock);
-
-   if (res && orig_parent == _resource)
+   if (res && parent == _resource)
revoke_iomem(res);
  
  	return res;

@@ -1779,26 +1796,51 @@ static struct resource 
*__request_free_mem_region(struct device *dev,
  {
resource_size_t end, addr;
struct resource *res;
+   struct region_devres *dr = NULL;
+
+   res = alloc_resource(GFP_KERNEL);
+   if (!res)
+   return ERR_PTR(-ENOMEM);
+
+   if (dev) {
+   dr = devres_alloc(devm_region_release, sizeof(struct 
region_devres),
+ GFP_KERNEL);
+   if (!dr) {
+   free_resource(res);
+   return ERR_PTR(-ENOMEM);
+   }
+   }
  
  	size = ALIGN(size, 1UL << PA_SECTION_SHIFT);

end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1);
addr = end - size + 1UL;
  
+	write_lock(_lock);

for (; addr > size && addr >= base->start; addr -= size) {
-   if (region_intersects(addr, size, 0, IORES_DESC_NONE) !=
+   if (__region_intersects(addr, size, 0, IORES_DESC_NONE) !=
REGION_DISJOINT)
continue;
  
-		if (dev)

-   res = devm_request_mem_region(dev, addr, size, name);
-   else
-   res = request_mem_region(addr, size, name);
-   if (!res)
-   return ERR_PTR(-ENOMEM);
+   if (!request_region_locked(_resource, res, addr,
+  size, name, 0))
+   break;
+
res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
+   if (dev) {
+   dr->parent = _resource;
+   dr->start = addr;
+   dr->n = size;
+   devres_add(dev, dr);
+   }
+
+   write_unlock(_lock);
+   revoke_iomem(res);


This is new, and not mentioned in the commit log, and therefore quite
surprising. It seems like the right thing to do but it also seems like a
different fix! I'm not saying that it should be a separate patch, but it
does seem worth loudly mentioning in the commit log, yes?


return res;
}
  
+	write_unlock(_lock);

+   free_resource(res);
+
return ERR_PTR(-ERANGE);
  }
  



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7] mm: cma: support sysfs

2021-03-24 Thread John Hubbard


On 3/24/21 3:11 PM, Dmitry Osipenko wrote:

25.03.2021 01:01, John Hubbard пишет:

On 3/24/21 2:31 PM, Dmitry Osipenko wrote:

...

+#include 
+
+struct cma_kobject {
+    struct cma *cma;
+    struct kobject kobj;


If you'll place the kobj as the first member of the struct, then
container_of will be a no-op.



However, *this does not matter*. Let's not get carried away. If
container_of() ends up as a compile-time addition of +8, instead
of +0, there is not going to be a visible effect in the world.
Or do you have some perf data to the contrary?

Sometimes these kinds of things matter. But other times, they are
just pointless to fret about, and this is once such case.


Performance is out of question here, my main point is about maintaining


In that case, there is even less reason to harass people about the order
of members of a struct.


a good coding style. Otherwise there is no point in not embedding kobj
into cma struct as well, IMO.



We really don't need to worry about the order of members in a struct,
from a "coding style" point of view. It is a solid step too far.

Sorry if that sounds a little too direct. But this review has tended to
go quite too far into nitpicks that are normally left as-is, and I've
merely picked one that is particularly questionable. I realize that other
coding communities have their own standards. Here, I'm explaining what
I have observed about linux-mm and linux-kernel, which needs to be respected.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7] mm: cma: support sysfs

2021-03-24 Thread John Hubbard


On 3/24/21 2:31 PM, Dmitry Osipenko wrote:

...

+#include 
+
+struct cma_kobject {
+   struct cma *cma;
+   struct kobject kobj;


If you'll place the kobj as the first member of the struct, then
container_of will be a no-op.



However, *this does not matter*. Let's not get carried away. If
container_of() ends up as a compile-time addition of +8, instead
of +0, there is not going to be a visible effect in the world.
Or do you have some perf data to the contrary?

Sometimes these kinds of things matter. But other times, they are
just pointless to fret about, and this is once such case.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: fix corruption cma_sysfs_alloc_pages_count

2021-03-24 Thread John Hubbard


On 3/24/21 12:20 PM, Minchan Kim wrote:

struct cma_stat's lifespan for cma_sysfs is different with
struct cma because kobject for sysfs requires dynamic object
while CMA is static object[1]. When CMA is initialized,
it couldn't use slab to allocate cma_stat since slab was not
initialized yet. Thus, it allocates the dynamic object
in subsys_initcall.

However, the cma allocation can happens before subsys_initcall
then, it goes crash.

Dmitry reported[2]:

..
[1.226190] [] (cma_sysfs_alloc_pages_count) from [] 
(cma_alloc+0x153/0x274)
[1.226720] [] (cma_alloc) from [] 
(__alloc_from_contiguous+0x37/0x8c)
[1.227272] [] (__alloc_from_contiguous) from [] 
(atomic_pool_init+0x7b/0x126)
[1.233596] [] (atomic_pool_init) from [] 
(do_one_initcall+0x45/0x1e4)
[1.234188] [] (do_one_initcall) from [] 
(kernel_init_freeable+0x157/0x1a6)
[1.234741] [] (kernel_init_freeable) from [] 
(kernel_init+0xd/0xe0)
[1.235289] [] (kernel_init) from [] 
(ret_from_fork+0x11/0x1c)

This patch moves those statistic fields of cma_stat into struct cma
and introduces cma_kobject wrapper to follow kobject's rule.

At the same time, it fixes other routines based on suggestions[3][4].

[1] https://lore.kernel.org/linux-mm/ycoamxqt6dzkc...@kroah.com/
[2] 
https://lore.kernel.org/linux-mm/fead70a2-4330-79ff-e79a-d8511eab1...@gmail.com/
[3] 
https://lore.kernel.org/linux-mm/20210323195050.2577017-1-minc...@kernel.org/
[4] 
https://lore.kernel.org/linux-mm/20210324010547.4134370-1-minc...@kernel.org/

Reported-by: Dmitry Osipenko 
Tested-by: Dmitry Osipenko 
Suggested-by: Dmitry Osipenko 
Suggested-by: John Hubbard 
Suggested-by: Matthew Wilcox 
Signed-off-by: Minchan Kim 
---
I belive it's worth to have separate patch rather than replacing
original patch. It will also help to merge without conflict
since we already filed other patch based on it.
Strictly speaking, separating fix part and readbility part
in this patch would be better but it's gray to separate them
since most code in this patch was done while we were fixing
the bug. Since we don't release it yet, I hope it will work.
Otherwise, I can send a replacement patch inclucing all of
changes happend until now with gathering SoB.


If we still have a choice, we should not merge a patch that has a known
serious problem, such as a crash. That's only done if the broken problematic
patch has already been committed to a tree that doesn't allow rebasing,
such as of course the main linux.git.

Here, I *think* it's just in linux-next and mmotm, so we still are allowed
to fix the original patch.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v6] mm: cma: support sysfs

2021-03-24 Thread John Hubbard


On 3/23/21 11:57 PM, Minchan Kim wrote:
...

, how about approximately this:

struct cma_kobject_wrapper {
struct cma *parent;
struct kobject kobj;
};

struct cma {
...
struct cma_kobject_wrapper *cma_kobj_wrapper;
};


...thus allowing readers of cma_sysfs.c to read that file more easily.


I agree cma->kobj->kobj is awkward but personally, I don't like the
naming: cma_kobject_wrapper parent pointer. cma_kobject is alredy
wrapper so it sounds me redundant and it's not a parent in same
hierarchy.

Since the kobj->kobj is just one line in the code(I don't imagine
it could grow up in cma_sysfs in future), I don't think it would
be a problem. If we really want to make it more clear, maybe?

cma->cma_kobj->kobj

It would be consistent with other variables in cma_sysfs_init.



OK, that's at least better than it was.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v6] mm: cma: support sysfs

2021-03-24 Thread John Hubbard


On 3/23/21 10:44 PM, Minchan Kim wrote:

On Tue, Mar 23, 2021 at 09:47:27PM -0700, John Hubbard wrote:

On 3/23/21 8:27 PM, Minchan Kim wrote:
...

+static int __init cma_sysfs_init(void)
+{
+   unsigned int i;
+
+   cma_kobj_root = kobject_create_and_add("cma", mm_kobj);
+   if (!cma_kobj_root)
+   return -ENOMEM;
+
+   for (i = 0; i < cma_area_count; i++) {
+   int err;
+   struct cma *cma;
+   struct cma_kobject *cma_kobj;
+
+   cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL);
+   if (!cma_kobj) {
+   kobject_put(cma_kobj_root);
+   return -ENOMEM;


This leaks little cma_kobj's all over the floor. :)


I thought kobject_put(cma_kobj_root) should deal with it. No?


If this fails when i > 0, there will be cma_kobj instances that
were stashed in the cma_areas[] array. But this code only deletes
the most recently allocated cma_kobj, not anything allocated on
previous iterations of the loop.


Oh, I misunderstood that destroying of root kobject will release
children recursively. Seems not true. Go back to old version.


index 16c81c9cb9b7..418951a3f138 100644
--- a/mm/cma_sysfs.c
+++ b/mm/cma_sysfs.c
@@ -80,20 +80,19 @@ static struct kobj_type cma_ktype = {
  static int __init cma_sysfs_init(void)
  {
 unsigned int i;
+   int err;
+   struct cma *cma;
+   struct cma_kobject *cma_kobj;

 cma_kobj_root = kobject_create_and_add("cma", mm_kobj);
 if (!cma_kobj_root)
 return -ENOMEM;

 for (i = 0; i < cma_area_count; i++) {
-   int err;
-   struct cma *cma;
-   struct cma_kobject *cma_kobj;
-
 cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL);
 if (!cma_kobj) {
-   kobject_put(cma_kobj_root);
-   return -ENOMEM;
+   err = -ENOMEM;
+   goto out;
 }

 cma = _areas[i];
@@ -103,11 +102,21 @@ static int __init cma_sysfs_init(void)
cma_kobj_root, "%s", cma->name);
 if (err) {
 kobject_put(_kobj->kobj);
-   kobject_put(cma_kobj_root);
-   return err;
+   goto out;
 }
 }

 return 0;
+out:
+   while (--i >= 0) {
+   cma = _areas[i];
+
+   kobject_put(>kobj->kobj);



OK. As long as you are spinning a new version, let's fix up the naming to
be a little better, too. In this case, with a mildly dizzying mix of cma's
and kobjects, it actually makes a real difference. I wouldn't have asked,
but the above cma->kobj->kobj chain really made it obvious for me just now.

So instead of this (in cma.h):

struct cma_kobject {
struct cma *cma;
struct kobject kobj;
};

struct cma {
...
struct cma_kobject *kobj;
};

, how about approximately this:

struct cma_kobject_wrapper {
struct cma *parent;
struct kobject kobj;
};

struct cma {
...
struct cma_kobject_wrapper *cma_kobj_wrapper;
};


...thus allowing readers of cma_sysfs.c to read that file more easily.



+   kfree(cma->kobj);
+   cma->kobj = NULL;
+   }
+   kobject_put(cma_kobj_root);
+
+   return err;
  }
  subsys_initcall(cma_sysfs_init);





thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v6] mm: cma: support sysfs

2021-03-23 Thread John Hubbard


On 3/23/21 8:27 PM, Minchan Kim wrote:
...

+static int __init cma_sysfs_init(void)
+{
+   unsigned int i;
+
+   cma_kobj_root = kobject_create_and_add("cma", mm_kobj);
+   if (!cma_kobj_root)
+   return -ENOMEM;
+
+   for (i = 0; i < cma_area_count; i++) {
+   int err;
+   struct cma *cma;
+   struct cma_kobject *cma_kobj;
+
+   cma_kobj = kzalloc(sizeof(*cma_kobj), GFP_KERNEL);
+   if (!cma_kobj) {
+   kobject_put(cma_kobj_root);
+   return -ENOMEM;


This leaks little cma_kobj's all over the floor. :)


I thought kobject_put(cma_kobj_root) should deal with it. No?


If this fails when i > 0, there will be cma_kobj instances that
were stashed in the cma_areas[] array. But this code only deletes
the most recently allocated cma_kobj, not anything allocated on
previous iterations of the loop.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v6] mm: cma: support sysfs

2021-03-23 Thread John Hubbard

err;


Hopefully this little bit of logic could also go into the cleanup
routine.


+   }
+   }
+
+   return 0;
+}
+subsys_initcall(cma_sysfs_init);



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: Fix potential null dereference on pointer cma

2021-03-17 Thread John Hubbard


On 3/16/21 3:04 AM, Colin King wrote:

From: Colin Ian King 

At the start of the function there is a null pointer check on cma
and then branch to error handling label 'out'.  The subsequent call
to cma_sysfs_fail_pages_count dereferences cma, hence there is a
potential null pointer deference issue.  Fix this by only calling


As far as I can tell, it's not possible to actually cause that null
failure with the existing kernel code paths.  *Might* be worth mentioning
that here (unless I'm wrong), but either way, looks good, so:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


cma_sysfs_fail_pages_count if cma is not null.

Addresses-Coverity: ("Dereference after null check")
Fixes: dc4764f7e9ac ("mm: cma: support sysfs")
Signed-off-by: Colin Ian King 
---
  mm/cma.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/cma.c b/mm/cma.c
index 6d4dbafdb318..90e27458ddb7 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -512,7 +512,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
cma_sysfs_alloc_pages_count(cma, count);
} else {
count_vm_event(CMA_ALLOC_FAIL);
-   cma_sysfs_fail_pages_count(cma, count);
+   if (cma)
+   cma_sysfs_fail_pages_count(cma, count);
}
  
  	return page;

Re: [PATCH v2] mm: vmstat: add cma statistics

2021-03-03 Thread John Hubbard


On 3/2/21 10:33, Minchan Kim wrote:

Since CMA is used more widely, it's worth to have CMA
allocation statistics into vmstat. With it, we could
know how agressively system uses cma allocation and
how often it fails.

Signed-off-by: Minchan Kim 
---
* from v1 - 
https://lore.kernel.org/linux-mm/20210217170025.512704-1-minc...@kernel.org/
   * change alloc_attempt with alloc_success - jhubbard
   * item per line for vm_event_item - jhubbard

  include/linux/vm_event_item.h |  4 
  mm/cma.c  | 12 +---
  mm/vmstat.c   |  4 
  3 files changed, 17 insertions(+), 3 deletions(-)



Seems reasonable, and the diffs look good.

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..21d7c7f72f1c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
  #endif
  #ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
+#endif
+#ifdef CONFIG_CMA
+   CMA_ALLOC_SUCCESS,
+   CMA_ALLOC_FAIL,
  #endif
UNEVICTABLE_PGCULLED,   /* culled to noreclaim list */
UNEVICTABLE_PGSCANNED,  /* scanned for reclaimability */
diff --git a/mm/cma.c b/mm/cma.c
index 23d4a97c834a..04ca863d1807 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -435,13 +435,13 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
int ret = -ENOMEM;
  
  	if (!cma || !cma->count || !cma->bitmap)

-   return NULL;
+   goto out;
  
  	pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma,

 count, align);
  
  	if (!count)

-   return NULL;
+   goto out;
  
  	mask = cma_bitmap_aligned_mask(cma, align);

offset = cma_bitmap_aligned_offset(cma, align);
@@ -449,7 +449,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
  
  	if (bitmap_count > bitmap_maxno)

-   return NULL;
+   goto out;
  
  	for (;;) {

mutex_lock(>lock);
@@ -506,6 +506,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
}
  
  	pr_debug("%s(): returned %p\n", __func__, page);

+out:
+   if (page)
+   count_vm_event(CMA_ALLOC_SUCCESS);
+   else
+   count_vm_event(CMA_ALLOC_FAIL);
+
return page;
  }
  
diff --git a/mm/vmstat.c b/mm/vmstat.c

index 97fc32a53320..d8c32a33208d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1305,6 +1305,10 @@ const char * const vmstat_text[] = {
  #ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
"htlb_buddy_alloc_fail",
+#endif
+#ifdef CONFIG_CMA
+   "cma_alloc_success",
+   "cma_alloc_fail",
  #endif
"unevictable_pgs_culled",
"unevictable_pgs_scanned",

Re: [PATCH] mm: vmstat: add cma statistics

2021-02-17 Thread John Hubbard


On 2/17/21 9:00 AM, Minchan Kim wrote:

Since CMA is used more widely, it's worth to have CMA
allocation statistics into vmstat. With it, we could
know how agressively system uses cma allocation and
how often it fails.

Signed-off-by: Minchan Kim 
---
  include/linux/vm_event_item.h |  3 +++
  mm/cma.c  | 12 +---
  mm/vmstat.c   |  4 
  3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..0c567014ce82 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
  #endif
  #ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
+#endif
+#ifdef CONFIG_CMA
+   CMA_ALLOC, CMA_ALLOC_FAIL,


This seems wrong: here it's called "alloc", but in the output it's
called "alloc success", and in the implementation it's clearly
"alloc attempt" that is being counted.

Once these are all made consistent, then the bug should naturally
go away as part of that.

nit: I think the multiple items per line is a weak idea at best, even
though it's used here already. Each item is important and needs to be
visually compared to it's output item later. So one per line might
have helped avoid mismatches, and I think we should change to that to
encourage that trend.

thanks,
--
John Hubbard
NVIDIA


  #endif
UNEVICTABLE_PGCULLED,   /* culled to noreclaim list */
UNEVICTABLE_PGSCANNED,  /* scanned for reclaimability */
diff --git a/mm/cma.c b/mm/cma.c
index 23d4a97c834a..ea1e39559526 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -434,14 +434,16 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
struct page *page = NULL;
int ret = -ENOMEM;
  
+	count_vm_event(CMA_ALLOC);

+
if (!cma || !cma->count || !cma->bitmap)
-   return NULL;
+   goto out;
  
  	pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma,

 count, align);
  
  	if (!count)

-   return NULL;
+   goto out;
  
  	mask = cma_bitmap_aligned_mask(cma, align);

offset = cma_bitmap_aligned_offset(cma, align);
@@ -449,7 +451,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
  
  	if (bitmap_count > bitmap_maxno)

-   return NULL;
+   goto out;
  
  	for (;;) {

mutex_lock(>lock);
@@ -506,6 +508,10 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
}
  
  	pr_debug("%s(): returned %p\n", __func__, page);

+out:
+   if (!page)
+   count_vm_event(CMA_ALLOC_FAIL);
+
return page;
  }
  
diff --git a/mm/vmstat.c b/mm/vmstat.c

index 97fc32a53320..d8c32a33208d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1305,6 +1305,10 @@ const char * const vmstat_text[] = {
  #ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
"htlb_buddy_alloc_fail",
+#endif
+#ifdef CONFIG_CMA
+   "cma_alloc_success",
+   "cma_alloc_fail",
  #endif
"unevictable_pgs_culled",
"unevictable_pgs_scanned",

Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread John Hubbard


On 2/10/21 4:59 AM, Daniel Vetter wrote:
...

GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.


Yeah, but we can forcefully break this whenever we feel like by revoking
the page, moving it, and then reinstating the gpu pte again and let it
continue.


Oh yes, that's true.



If that's no possible then what we need here instead is an mlock() type of
thing I think.

No need for that, then.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-10 Thread John Hubbard


On 2/5/21 12:41 PM, Joao Martins wrote:

Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

To that end, define another iterator for_each_compound_range()
that operates in page ranges as opposed to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
  include/linux/mm.h |  2 ++
  mm/gup.c   | 62 ++
  2 files changed, 64 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
  void unpin_user_pages(struct page **pages, unsigned long npages);
  
  /**

diff --git a/mm/gup.c b/mm/gup.c
index 467a11df216d..938964d31494 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
  }
  EXPORT_SYMBOL(unpin_user_page);
  
+static inline void compound_range_next(unsigned long i, unsigned long npages,

+  struct page **list, struct page **head,
+  unsigned int *ntails)


Yes, the new names look good, and I have failed to find any logic errors, so:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


+{
+   struct page *next, *page;
+   unsigned int nr = 1;
+
+   if (i >= npages)
+   return;
+
+   next = *list + i;
+   page = compound_head(next);
+   if (PageCompound(page) && compound_order(page) >= 1)
+   nr = min_t(unsigned int,
+  page + compound_nr(page) - next, npages - i);
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_range_next(__i, __npages, __list, &(__head), 
&(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
  static inline void compound_next(unsigned long i, unsigned long npages,
 struct page **list, struct page **head,
 unsigned int *ntails)
@@ -303,6 +329,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
  }
  EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
  
+/**

+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely 
released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if 
the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
  /**
   * unpin_user_pages() - release an array of gup-pinned pages.
   * @pages:  array of pages to be marked dirty and released.

Re: [PATCH v3] mm: cma: support sysfs

2021-02-10 Thread John Hubbard


On 2/9/21 11:55 PM, Minchan Kim wrote:

Since CMA is getting used more widely, it's more important to
keep monitoring CMA statistics for system health since it's
directly related to user experience.

This patch introduces sysfs statistics for CMA, in order to provide
some basic monitoring of the CMA allocator.

  * the number of CMA page allocation attempts
  * the number of CMA page allocation failures

These two values allow the user to calcuate the allocation
failure rate for each CMA area.

e.g.)
   /sys/kernel/mm/cma/WIFI/cma_alloc_pages_[attempts|fails]
   /sys/kernel/mm/cma/SENSOR/cma_alloc_pages_[attempts|fails]
   /sys/kernel/mm/cma/BLUETOOTH/cma_alloc_pages_[attempts|fails]

Signed-off-by: Minchan Kim 
---


Looks good.

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


 From v2 - 
https://lore.kernel.org/linux-mm/20210208180142.2765456-1-minc...@kernel.org/
  * sysfs doc and description modification - jhubbard

 From v1 - 
https://lore.kernel.org/linux-mm/20210203155001.4121868-1-minc...@kernel.org/
  * fix sysfs build and refactoring - willy
  * rename and drop some attributes - jhubbard
  Documentation/ABI/testing/sysfs-kernel-mm-cma |  25 
  mm/Kconfig|   7 ++
  mm/Makefile   |   1 +
  mm/cma.c  |   6 +-
  mm/cma.h  |  18 +++
  mm/cma_sysfs.c| 114 ++
  6 files changed, 170 insertions(+), 1 deletion(-)
  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma
  create mode 100644 mm/cma_sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
b/Documentation/ABI/testing/sysfs-kernel-mm-cma
new file mode 100644
index ..f518af819cee
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
@@ -0,0 +1,25 @@
+What:  /sys/kernel/mm/cma/
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   /sys/kernel/mm/cma/ contains a subdirectory for each CMA
+   heap name (also sometimes called CMA areas).
+
+   Each CMA heap subdirectory (that is, each
+   /sys/kernel/mm/cma/ directory) contains the
+   following items:
+
+   cma_alloc_pages_attempts
+   cma_alloc_pages_fails
+
+What:  /sys/kernel/mm/cma//cma_alloc_pages_attempts
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of pages CMA API tried to allocate
+
+What:  /sys/kernel/mm/cma//cma_alloc_pages_fails
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of pages CMA API failed to allocate
diff --git a/mm/Kconfig b/mm/Kconfig
index ec35bf406439..ad7e9c065657 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -513,6 +513,13 @@ config CMA_DEBUGFS
help
  Turns on the DebugFS interface for CMA.
  
+config CMA_SYSFS

+   bool "CMA information through sysfs interface"
+   depends on CMA && SYSFS
+   help
+ This option exposes some sysfs attributes to get information
+ from CMA.
+
  config CMA_AREAS
int "Maximum count of the CMA areas"
depends on CMA
diff --git a/mm/Makefile b/mm/Makefile
index b2a564eec27f..e10c14300191 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_CMA)   += cma.o
  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
  obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
  obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
  obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
diff --git a/mm/cma.c b/mm/cma.c
index 23d4a97c834a..b42ccafc71e5 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -447,9 +447,10 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
offset = cma_bitmap_aligned_offset(cma, align);
bitmap_maxno = cma_bitmap_maxno(cma);
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
+   cma_sysfs_alloc_count(cma, count);
  
  	if (bitmap_count > bitmap_maxno)

-   return NULL;
+   goto out;
  
  	for (;;) {

mutex_lock(>lock);
@@ -504,6 +505,9 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
__func__, count, ret);
cma_debug_show_areas(cma);
}
+out:
+   if (!page)
+   cma_sysfs_fail_count(cma, count);
  
  	pr_debug("%s(): returned %p\n", __func__, page);

return page;
diff --git a/mm/cma.h b/mm/cma.h
index 42ae082cb067..24a1d61eabc7 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -3,6 +3,14 @@
  #define __MM_CMA_H__
  
  #include 

+#include 
+
+struct cma_stat {
+   spinlock_t lock;
+   unsigned long pages_attempts;   /* the number of CMA p

Re: [PATCH v2] mm: cma: support sysfs

2021-02-09 Thread John Hubbard


On 2/9/21 11:26 PM, Greg KH wrote:
...

I just am not especially happy about the inability to do natural, efficient
things here, such as use a statically allocated set of things with sysfs. And
I remain convinced that the above is not "improper"; it's a reasonable
step, given the limitations of the current sysfs design. I just wanted to say
that out loud, as my proposal sinks to the bottom of the trench here. haha :)


What is "odd" is that you are creating an object in the kernel that you
_never_ free.  That's not normal at all in the kernel, and so, your wish
to have a kobject that you never free represent this object also is not
normal :)



OK, thanks for taking the time to explain that, much appreciated!


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-09 Thread John Hubbard


On 2/9/21 11:12 PM, Minchan Kim wrote:
...

Agreed. How about this for the warning part?

+
+/*
+ * note: kobj_type should provide a release function to free dynamically
+ * allocated object since kobject is responsible for controlling lifespan
+ * of the object. However, cma_area is static object so technially, it
+ * doesn't need release function. It's very exceptional case so pleaes
+ * do not follow this model.
+ */
  static struct kobj_type cma_ktype = {
 .sysfs_ops = _sysfs_ops,
 .default_groups = cma_groups
+   .release = NULL, /* do not follow. See above */
  };



No, please no.  Just do it the correct way, what is the objection to
creating a few dynamic kobjects from the heap?  How many of these are
you going to have that it will somehow be "wasteful"?

Please do it properly.


Oh, I misunderstood your word "don't provide a release function for the
kobject" so thought you agreed on John. If you didn't, we are stuck again:
IIUC, the objection from John was the cma_stat lifetime should be on parent
object, which is reasonable and make code simple.
Frankly speaking, I don't have strong opinion about either approach.
John?



We should do it as Greg requests, now that it's quite clear that he's insisting
on this. Not a big deal.

I just am not especially happy about the inability to do natural, efficient
things here, such as use a statically allocated set of things with sysfs. And
I remain convinced that the above is not "improper"; it's a reasonable
step, given the limitations of the current sysfs design. I just wanted to say
that out loud, as my proposal sinks to the bottom of the trench here. haha :)


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-09 Thread John Hubbard


On 2/9/21 5:37 AM, Daniel Vetter wrote:

On Tue, Feb 9, 2021 at 1:57 PM Alistair Popple  wrote:


On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:


Recent changes to pin_user_pages() prevent the creation of pinned pages in
ZONE_MOVABLE. This series allows pinned pages to be created in

ZONE_MOVABLE

as attempts to migrate may fail which would be fatal to userspace.

In this case migration of the pinned page is unnecessary as the page can

be

unpinned at anytime by having the driver revoke atomic permission as it
does for the migrate_to_ram() callback. However a method of calling this
when memory needs to be moved has yet to be resolved so any discussion is
welcome.


Why do we need to pin for gpu atomics? You still have the callback for
cpu faults, so you
can move the page as needed, and hence a long-term pin sounds like the
wrong approach.


Technically a real long term unmoveable pin isn't required, because as you say
the page can be moved as needed at any time. However I needed some way of
stopping the CPU page from being freed once the userspace mappings for it had
been removed. Obviously I could have just used get_page() but from the
perspective of page migration the result is much the same as a pin - a page
which can't be moved because of the extra refcount.


long term pin vs short term page reference aren't fully fleshed out.
But the rule more or less is:
- short term page reference: _must_ get released in finite time for
migration and other things, either because you have a callback, or
because it's just for direct I/O, which will complete. This means
short term pins will delay migration, but not foul it complete



GPU atomic operations to sysmem are hard to categorize, because because 
application
programmers could easily write programs that do a long series of atomic 
operations.
Such a program would be a little weird, but it's hard to rule out.




- long term pin: the page cannot be moved, all migration must fail.
Also this will have an impact on COW behaviour for fork (but not sure
where those patches are, John Hubbard will know).



That would be Jason's commit 57efa1fe59576 ("mm/gup: prevent gup_fast from 
racing
with COW during fork"), which is in linux-next 20201216.




So I think for your use case here you want a) short term page
reference to make sure it doesn't disappear plus b) callback to make
sure migrate isn't blocked.

Breaking ZONE_MOVEABLE with either allowing long term pins or failing
migrations because you don't release your short term page reference
isn't good.


The normal solution of registering an MMU notifier to unpin the page when it
needs to be moved also doesn't work as the CPU page tables now point to the
device-private page and hence the migration code won't call any invalidate
notifiers for the CPU page.


Yeah you need some other callback for migration on the page directly.
it's a bit awkward since there is one already for struct
address_space, but that's own by the address_space/page cache, not
HMM. So I think we need something else, maybe something for each
ZONE_DEVICE?



This direction sounds at least...possible. Using MMU notifiers instead of pins
is definitely appealing. I'm not quite clear on the callback idea above, but
overall it seems like taking advantage of the ZONE_DEVICE tracking of pages
(without having to put anything additional in each struct page), could work.

Additional notes or ideas here are definitely welcome.



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-09 Thread John Hubbard


On 2/9/21 9:49 AM, Greg KH wrote:

That's fine if you want to add it to the parent.  If so, then the
kobject controls the lifetime of the structure, nothing else can.


The problem was parent object(i.e., struct cma cma_areas) is
static arrary so kobj->release function will be NULL or just
dummy. Is it okay? I thought it was one of the what you wanted
to avoid it.


No, that is not ok.


Either is fine with me, what is "forbidden" is having a kobject and
somehow thinking that it does not control the lifetime of the structure.


Since parent object is static arrary, there is no need to control the
lifetime so I am curious if parent object approach is okay from kobject
handling point of view.


So the array is _NEVER_ freed?  If not, fine, don't provide a release
function for the kobject, but ick, just make a dynamic kobject I don't
see the problem for something so tiny and not very many...



Yeah, I wasn't trying to generate so much discussion, I initially thought it
would be a minor comment: "just use an embedded struct and avoid some extra
code", at first.


I worry that any static kobject might be copied/pasted as someone might
think this is an ok thing to do.  And it's not an ok thing to do.



Overall, then, we're seeing that there is a small design hole: in order
to use sysfs most naturally, you either much provide a dynamically allocated
item for it, or you must use the static kobject, and the second one sets a
bad example.

I think we should just use a static kobject, with a cautionary comment to
would-be copy-pasters, that explains the design constraints above. That way,
we still get a nice, less-code implementation, a safe design, and it only
really costs us a single carefully written comment.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard


On 2/8/21 10:27 PM, John Hubbard wrote:

On 2/8/21 10:13 PM, Greg KH wrote:

On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote:

On 2/8/21 3:36 PM, Minchan Kim wrote:
...

    char name[CMA_MAX_NAME];
+#ifdef CONFIG_CMA_SYSFS
+    struct cma_stat    *stat;


This should not be a pointer. By making it a pointer, you've added a bunch of 
pointless
extra code to the implementation.


Originally, I went with the object lifetime with struct cma as you
suggested to make code simple. However, Greg KH wanted to have
release for kobj_type since it is consistent with other kboject
handling.


Are you talking about the kobj in your new struct cma_stat? That seems
like circular logic if so. I'm guessing Greg just wanted kobj methods
to be used *if* you are dealing with kobjects. That's a narrower point.

I can't imagine that he would have insisted on having additional
allocations just so that kobj freeing methods could be used. :)


Um, yes, I was :)

You can not add a kobject to a structure and then somehow think you can
just ignore the reference counting issues involved.  If a kobject is
part of a structure then the kobject is responsible for controling the
lifespan of the memory, nothing else can be.

So by making the kobject dynamic, you properly handle that memory
lifespan of the object, instead of having to worry about the lifespan of
the larger object (which the original patch was not doing.)

Does that make sense?


That part makes sense, yes, thanks. The part that I'm trying to straighten
out is, why was kobject even added to the struct cma_stat in the first
place? Why not just leave .stat as a static member variable, without
a kobject in it, and done?



Sorry, I think I get it now: this is in order to allow a separate lifetime
for the .stat member. I was sort of implicitly assuming that the "right"
way to do it is just have the whole object use one lifetime management,
but as you say, there is no kobject being added to the parent.

I still feel odd about the allocation and freeing of something that seems
to be logically the same lifetime (other than perhaps a few, briefly pending
sysfs reads, at the end of life). So I'd still think that the kobject should
be added to the parent...

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard


On 2/8/21 10:13 PM, Greg KH wrote:

On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote:

On 2/8/21 3:36 PM, Minchan Kim wrote:
...

char name[CMA_MAX_NAME];
+#ifdef CONFIG_CMA_SYSFS
+   struct cma_stat *stat;


This should not be a pointer. By making it a pointer, you've added a bunch of 
pointless
extra code to the implementation.


Originally, I went with the object lifetime with struct cma as you
suggested to make code simple. However, Greg KH wanted to have
release for kobj_type since it is consistent with other kboject
handling.


Are you talking about the kobj in your new struct cma_stat? That seems
like circular logic if so. I'm guessing Greg just wanted kobj methods
to be used *if* you are dealing with kobjects. That's a narrower point.

I can't imagine that he would have insisted on having additional
allocations just so that kobj freeing methods could be used. :)


Um, yes, I was :)

You can not add a kobject to a structure and then somehow think you can
just ignore the reference counting issues involved.  If a kobject is
part of a structure then the kobject is responsible for controling the
lifespan of the memory, nothing else can be.

So by making the kobject dynamic, you properly handle that memory
lifespan of the object, instead of having to worry about the lifespan of
the larger object (which the original patch was not doing.)

Does that make sense?


That part makes sense, yes, thanks. The part that I'm trying to straighten
out is, why was kobject even added to the struct cma_stat in the first
place? Why not just leave .stat as a static member variable, without
a kobject in it, and done?

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard


On 2/8/21 9:18 PM, John Hubbard wrote:

On 2/8/21 8:19 PM, Minchan Kim wrote:

On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote:

On 2/8/21 3:36 PM, Minchan Kim wrote:
...

    char name[CMA_MAX_NAME];
+#ifdef CONFIG_CMA_SYSFS
+    struct cma_stat    *stat;


This should not be a pointer. By making it a pointer, you've added a bunch of 
pointless

extra code to the implementation.


Originally, I went with the object lifetime with struct cma as you
suggested to make code simple. However, Greg KH wanted to have
release for kobj_type since it is consistent with other kboject
handling.


Are you talking about the kobj in your new struct cma_stat? That seems
like circular logic if so. I'm guessing Greg just wanted kobj methods
to be used *if* you are dealing with kobjects. That's a narrower point.

I can't imagine that he would have insisted on having additional
allocations just so that kobj freeing methods could be used. :)


I have no objection if Greg agree static kobject is okay in this
case. Greg?



What I meant is, no kobject at all in the struct cma_stat member
variable. The lifetime of the cma_stat member is the same as the
containing struct, so no point in putting a kobject into it.



...unless...are you actually *wanting* to keep the lifetimes separate?
Hmmm, given the short nature of sysfs reads, though, I'd be inclined
to just let the parent object own the lifetime. But maybe I'm missing
some design point here?

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard


On 2/8/21 8:19 PM, Minchan Kim wrote:

On Mon, Feb 08, 2021 at 05:57:17PM -0800, John Hubbard wrote:

On 2/8/21 3:36 PM, Minchan Kim wrote:
...

char name[CMA_MAX_NAME];
+#ifdef CONFIG_CMA_SYSFS
+   struct cma_stat *stat;


This should not be a pointer. By making it a pointer, you've added a bunch of 
pointless
extra code to the implementation.


Originally, I went with the object lifetime with struct cma as you
suggested to make code simple. However, Greg KH wanted to have
release for kobj_type since it is consistent with other kboject
handling.


Are you talking about the kobj in your new struct cma_stat? That seems
like circular logic if so. I'm guessing Greg just wanted kobj methods
to be used *if* you are dealing with kobjects. That's a narrower point.

I can't imagine that he would have insisted on having additional
allocations just so that kobj freeing methods could be used. :)


I have no objection if Greg agree static kobject is okay in this
case. Greg?



What I meant is, no kobject at all in the struct cma_stat member
variable. The lifetime of the cma_stat member is the same as the
containing struct, so no point in putting a kobject into it.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard


On 2/8/21 3:36 PM, Minchan Kim wrote:
...

char name[CMA_MAX_NAME];
+#ifdef CONFIG_CMA_SYSFS
+   struct cma_stat *stat;


This should not be a pointer. By making it a pointer, you've added a bunch of 
pointless
extra code to the implementation.


Originally, I went with the object lifetime with struct cma as you
suggested to make code simple. However, Greg KH wanted to have
release for kobj_type since it is consistent with other kboject
handling.


Are you talking about the kobj in your new struct cma_stat? That seems
like circular logic if so. I'm guessing Greg just wanted kobj methods
to be used *if* you are dealing with kobjects. That's a narrower point.

I can't imagine that he would have insisted on having additional
allocations just so that kobj freeing methods could be used. :)


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm: cma: support sysfs

2021-02-08 Thread John Hubbard

-   kobject_put(>stat->kobj);
+   kobject_put(>stat.kobj);
goto out;
}
-   } while (++i < cma_area_count)
+   } while (++i < cma_area_count);

return 0;
 out:
while (--i >= 0) {
cma = _areas[i];
-   kobject_put(>stat->kobj);
+   kobject_put(>stat.kobj);
}

-   kfree(cma_stats);
kobject_put(cma_kobj);

return -ENOMEM;


+#endif
  };
  
  extern struct cma cma_areas[MAX_CMA_AREAS];

@@ -26,4 +37,11 @@ static inline unsigned long cma_bitmap_maxno(struct cma *cma)
return cma->count >> cma->order_per_bit;
  }
  
+#ifdef CONFIG_CMA_SYSFS

+void cma_sysfs_alloc_count(struct cma *cma, size_t count);
+void cma_sysfs_fail(struct cma *cma, size_t count);
+#else
+static inline void cma_sysfs_alloc_count(struct cma *cma, size_t count) {};
+static inline void cma_sysfs_fail(struct cma *cma, size_t count) {};
+#endif
  #endif
diff --git a/mm/cma_sysfs.c b/mm/cma_sysfs.c
new file mode 100644
index ..1f6b9f785825
--- /dev/null
+++ b/mm/cma_sysfs.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CMA SysFS Interface
+ *
+ * Copyright (c) 2021 Minchan Kim 
+ */
+
+#include 
+#include 
+#include 
+
+#include "cma.h"
+
+static struct cma_stat *cma_stats;


I don't know what that's for but it definitely is not needed if you make 
cma.stat
not a pointer, and not in any other case either.


+
+void cma_sysfs_alloc_count(struct cma *cma, size_t count)
+{
+   spin_lock(>stat->lock);
+   cma->stat->pages_attempt += count;
+   spin_unlock(>stat->lock);
+}
+
+void cma_sysfs_fail(struct cma *cma, size_t count)
+{
+   spin_lock(>stat->lock);
+   cma->stat->pages_fail += count;
+   spin_unlock(>stat->lock);
+}
+
+#define CMA_ATTR_RO(_name) \
+   static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+static struct kobject *cma_kobj;
+
+static ssize_t cma_alloc_pages_attempt_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj);
+
+   return sysfs_emit(buf, "%lu\n", stat->pages_attempt);
+}
+CMA_ATTR_RO(cma_alloc_pages_attempt);
+
+static ssize_t cma_alloc_pages_fail_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj);
+
+   return sysfs_emit(buf, "%lu\n", stat->pages_fail);
+}
+CMA_ATTR_RO(cma_alloc_pages_fail);
+
+static void cma_kobj_release(struct kobject *kobj)
+{
+   struct cma_stat *stat = container_of(kobj, struct cma_stat, kobj);
+
+   kfree(stat);
+}
+
+static struct attribute *cma_attrs[] = {
+   _alloc_pages_attempt_attr.attr,
+   _alloc_pages_fail_attr.attr,
+   NULL,
+};
+ATTRIBUTE_GROUPS(cma);
+
+static struct kobj_type cma_ktype = {
+   .release = cma_kobj_release,
+   .sysfs_ops = _sysfs_ops,
+   .default_groups = cma_groups
+};
+
+static int __init cma_sysfs_init(void)
+{
+   int i = 0;
+   struct cma *cma;
+
+   cma_kobj = kobject_create_and_add("cma", mm_kobj);
+   if (!cma_kobj) {
+   pr_err("failed to create cma kobject\n");
+   return -ENOMEM;
+   }
+
+   cma_stats = kzalloc(array_size(sizeof(struct cma_stat),
+   cma_area_count), GFP_KERNEL);
+   if (!cma_stats) {
+   pr_err("failed to create cma_stats\n");
+   goto out;
+   }
+
+   do {
+   cma = _areas[i];
+   cma->stat = _stats[i];
+   spin_lock_init(>stat->lock);
+   if (kobject_init_and_add(>stat->kobj, _ktype,
+   cma_kobj, "%s", cma->name)) {
+   kobject_put(>stat->kobj);
+   goto out;
+   }
+   } while (++i < cma_area_count)


This clearly is not going to compile! Don't forget to build and test the
patches.


+
+   return 0;
+out:
+   while (--i >= 0) {
+   cma = _areas[i];
+   kobject_put(>stat->kobj);
+   }
+
+   kfree(cma_stats);
+   kobject_put(cma_kobj);
+
+   return -ENOMEM;
+}
+subsys_initcall(cma_sysfs_init);



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-08 Thread John Hubbard

On 2/6/21 9:08 AM, Pintu Agarwal wrote:
...

# cat meminfo | grep -i cma
CmaTotal:1048576 kB
CmaFree: 1046872 kB

This CMA info was added by me way back in 2014.
At that time I even thought about adding this cma alloc/fail counter in vmstat.
That time I also had an internal patch about it but later dropped
(never released to mainline).
If required I can re-work again on this part.

And I have few more points to add here.
1) In the past I have faced this cma allocation failure (which could
be common in small embedded devices).
Earlier it was even difficult to figure if cma failure actually happened.
Thus I added this simple patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc6=5984af1082f3b115082178ed88c47033d43b924d

2) IMO just reporting CMA alloc/fail may not be enough (at times). The
main point is for which client/dev is this failure happening ?
Sometimes just tuning the size or fixing the alignment can resolve the
failure if we know the client. For global CMA it will be just NULL
(dev).

3) IMO having 2 options SYSFS and DEBUGFS may confuse the
developer/user (personal experience). So are we going to remove the
DEBUGFS or merge it ?

It is confusing to have a whole bunch of different places to find data
about a system. Here, I think the key is to add to the Documentation/
directory. So far, the documentation I see is:

admin-guide/mm/cma_debugfs.rst: only covers debugfs
admin-guide/kernel-parameters.txt:602: for CMA kernel parameters

If we add sysfs items then we will want a new .rst document that covers
that, and lists all the places to look.

So anyway, the underlying guidelines for which fs to user are approximately:

* sysfs: used for monitoring. One value per item, stable ABI, production.

* debugfs: *theoretically* not a stable ABI (we hope no one locks us in
by doing obnoxious things that break userspace if the debugfs APIs change).
The intention is that developers can put *anything* in there.

debugfs has a firm place in debugging, and is probably a keeper here.

I originally thought we should combine CMA's sysfs and debugfs items,
but Minchan listed an example that seems to show a pretty clear need
for monitoring of CMA areas, in production systems, and that's what
sysfs is for.

So it looks like we want both debugfs and sysfs for CMA, plus a new
overall CMA documentation that points to everything.

4) Sometimes CMA (or DMA) allocation failure can happen even very
early in boot time itself. At that time these are anyways not useful.
Some system may not proceed if CMA/DMA allocation is failing (again
personal experience).
==> Anyways this is altogether is different case...

5) The default max CMA areas are defined to be 7 but user can
configure it to any number, may be 20 or 50 (???)

6) Thus I would like to propose another thing here.
Just like we have /proc/vmallocinfo, /proc/slabinfo, etc., we can also
have: /proc/cmainfo
Here in /proc/cmaminfo we can capute more detailed information:
Global CMA Data: Alloc/Free
Client specific data: name, size, success, fail, flags, align, etc
(just a random example).
==> This can dynamically grow as large as possible
==> It can also be enabled/disabled based on CMA config itself (no
additional config required)

Any feedback on point (6) specifically ?

Well, /proc these days is for process-specific items. And CMA areas are
system-wide. So that's an argument against it. However...if there is any
process-specific CMA allocation info to report, then /proc is just the
right place for it.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-05 Thread John Hubbard


On 2/5/21 1:28 PM, Minchan Kim wrote:

On Fri, Feb 05, 2021 at 12:25:52PM -0800, John Hubbard wrote:

On 2/5/21 8:15 AM, Minchan Kim wrote:
...
OK. But...what *is* your goal, and why is this useless (that's what
orthogonal really means here) for your goal?


As I mentioned, the goal is to monitor the failure from each of CMA
since they have each own purpose.

Let's have an example.

System has 5 CMA area and each CMA is associated with each
user scenario. They have exclusive CMA area to avoid
fragmentation problem.

CMA-1 depends on bluetooh
CMA-2 depends on WIFI
CMA-3 depends on sensor-A
CMA-4 depends on sensor-B
CMA-5 depends on sensor-C



aha, finally. I had no idea that sort of use case was happening.

This would be good to put in the patch commit description.


With this, we could catch which module was affected but with global failure,
I couldn't find who was affected.



Also, would you be willing to try out something simple first,
such as providing indication that cma is active and it's overall success
rate, like this:

/proc/vmstat:

cma_alloc_success   125
cma_alloc_failure   25

...or is the only way to provide the more detailed items, complete with
per-CMA details, in a non-debugfs location?




...and then, to see if more is needed, some questions:

a)  Do you know of an upper bound on how many cma areas there can be
(I think Matthew also asked that)?


There is no upper bound since it's configurable.



OK, thanks,so that pretty much rules out putting per-cma details into
anything other than a directory or something like it.



b) Is tracking the cma area really as valuable as other possibilities? We can 
put
"a few" to "several" items here, so really want to get your very favorite bits 
of
information in. If, for example, there can be *lots* of cma areas, then maybe 
tracking


At this moment, allocation/failure for each CMA area since they have
particular own usecase, which makes me easy to keep which module will
be affected. I think it is very useful per-CMA statistics as minimum
code change so I want to enable it by default under CONFIG_CMA && CONFIG_SYSFS.


by a range of allocation sizes is better...


I takes your suggestion something like this.

[alloc_range] could be order or range by interval

/sys/kernel/mm/cma/cma-A/[alloc_range]/success
/sys/kernel/mm/cma/cma-A/[alloc_range]/fail
..
..
/sys/kernel/mm/cma/cma-Z/[alloc_range]/success
/sys/kernel/mm/cma/cma-Z/[alloc_range]/fail


Actually, I meant, "ranges instead of cma areas", like this:

/

Understand your point but it would make hard to find who was
affected by the failure. That's why I suggested to have your
suggestion under additional config since per-cma metric with
simple sucess/failure are enough.





I agree it would be also useful but I'd like to enable it under
CONFIG_CMA_SYSFS_ALLOC_RANGE as separate patchset.



I will stop harassing you very soon, just want to bottom out on
understanding the real goals first. :)



I hope my example makes the goal more clear for you.



Yes it does. Based on the (rather surprising) use of cma-area-per-device,
it seems clear that you will need this, so I'll drop my objections to
putting it in sysfs.

I still think the "number of allocation failures" needs refining, probably
via a range-based thing, as we've discussed. But the number of pages
failed per cma looks OK now.



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-05 Thread John Hubbard


On 2/5/21 1:52 PM, Suren Baghdasaryan wrote:

I takes your suggestion something like this.

[alloc_range] could be order or range by interval

/sys/kernel/mm/cma/cma-A/[alloc_range]/success
/sys/kernel/mm/cma/cma-A/[alloc_range]/fail
..
..
/sys/kernel/mm/cma/cma-Z/[alloc_range]/success
/sys/kernel/mm/cma/cma-Z/[alloc_range]/fail


The interface above seems to me the most useful actually, if by
[alloc_range] you mean the different allocation orders. This would
cover Minchan's per-CMA failure tracking and would also allow us to
understand what kind of allocations are failing and therefore if the
problem is caused by pinning/fragmentation or by over-utilization.



I agree. That seems about right, now that we've established that
cma areas are a must-have.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-05 Thread John Hubbard


On 2/5/21 8:15 AM, Minchan Kim wrote:
...

Yes, approximately. I was wondering if this would suffice at least as a 
baseline:

cma_alloc_success   125
cma_alloc_failure   25


IMO, regardless of the my patch, it would be good to have such statistics
in that CMA was born to replace carved out memory with dynamic allocation
ideally for memory efficiency ideally so failure should regard critical
so admin could notice it how the system is hurt.


Right. So CMA failures are useful for the admin to see, understood.



Anyway, it's not enough for me and orthgonal with my goal.



OK. But...what *is* your goal, and why is this useless (that's what
orthogonal really means here) for your goal?

Also, would you be willing to try out something simple first,
such as providing indication that cma is active and it's overall success
rate, like this:

/proc/vmstat:

cma_alloc_success   125
cma_alloc_failure   25

...or is the only way to provide the more detailed items, complete with
per-CMA details, in a non-debugfs location?




...and then, to see if more is needed, some questions:

a)  Do you know of an upper bound on how many cma areas there can be
(I think Matthew also asked that)?


There is no upper bound since it's configurable.



OK, thanks,so that pretty much rules out putting per-cma details into
anything other than a directory or something like it.



b) Is tracking the cma area really as valuable as other possibilities? We can 
put
"a few" to "several" items here, so really want to get your very favorite bits 
of
information in. If, for example, there can be *lots* of cma areas, then maybe 
tracking


At this moment, allocation/failure for each CMA area since they have
particular own usecase, which makes me easy to keep which module will
be affected. I think it is very useful per-CMA statistics as minimum
code change so I want to enable it by default under CONFIG_CMA && CONFIG_SYSFS.


by a range of allocation sizes is better...


I takes your suggestion something like this.

[alloc_range] could be order or range by interval

/sys/kernel/mm/cma/cma-A/[alloc_range]/success
/sys/kernel/mm/cma/cma-A/[alloc_range]/fail
..
..
/sys/kernel/mm/cma/cma-Z/[alloc_range]/success
/sys/kernel/mm/cma/cma-Z/[alloc_range]/fail


Actually, I meant, "ranges instead of cma areas", like this:

/

I agree it would be also useful but I'd like to enable it under
CONFIG_CMA_SYSFS_ALLOC_RANGE as separate patchset.



I will stop harassing you very soon, just want to bottom out on
understanding the real goals first. :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] selftests/vm: rename file run_vmtests to run_vmtests.sh

2021-02-05 Thread John Hubbard


On 2/5/21 12:55 AM, Rong Chen wrote:

Commit c2aa8afc36fa has renamed run_vmtests in Makefile,
but the file still uses the old name.

The kernel test robot reported the following issue:

  # selftests: vm: run_vmtests.sh
  # Warning: file run_vmtests.sh is missing!
  not ok 1 selftests: vm: run_vmtests.sh



I don't know exactly what is going on here, but there was originally a mistake
on my part in renaming run_vmtests to run_vmtests.sh (I was trying to set
the executable bit, which is not always supported by the patch flow), and that 
caused some churn in Andrews's tree. And so maybe that rename got lost/dropped

along the way.


Reported-by: kernel test robot 
Fixes: c2aa8afc36fa (selftests/vm: rename run_vmtests --> run_vmtests.sh)
Signed-off-by: Rong Chen 
---
  tools/testing/selftests/vm/{run_vmtests => run_vmtests.sh} | 0
  1 file changed, 0 insertions(+), 0 deletions(-)
  rename tools/testing/selftests/vm/{run_vmtests => run_vmtests.sh} (100%)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests.sh
similarity index 100%
rename from tools/testing/selftests/vm/run_vmtests
rename to tools/testing/selftests/vm/run_vmtests.sh



So I guess this is OK, given that I see "run_vmtests" in both -next
and main.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2 1/4] mm/gup: add compound page list iterator

2021-02-05 Thread John Hubbard


On 2/5/21 2:46 AM, Joao Martins wrote:
...>> If instead you keep npages constant like it naturally wants to be, you 
could

just do a "(*ntails)++" in the loop, to take care of *ntails.


I didn't do it as such as I would need to deref @ntails per iteration, so
it felt more efficient to do as above. On a second thought, I could 
alternatively do the
following below, thoughts?

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..8defe4f670d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
  }
  EXPORT_SYMBOL(unpin_user_page);

+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   page = compound_head(list[i]);
+   for (nr = i + 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr - i;
+}
+


Yes, this is cleaner and quite a bit easier to verify that it is correct.




However, given that the patch is correct and works as-is, the above is really 
just
an optional idea, so please feel free to add:

Reviewed-by: John Hubbard 



Thanks!

Hopefully I can retain that if the snippet above is preferred?

Joao



Yes. Still looks good.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 10:24 PM, Minchan Kim wrote:

On Thu, Feb 04, 2021 at 09:49:54PM -0800, John Hubbard wrote:

On 2/4/21 9:17 PM, Minchan Kim wrote:

...

# cat vmstat | grep -i cma
nr_free_cma 261718

# cat meminfo | grep -i cma
CmaTotal:1048576 kB
CmaFree: 1046872 kB

OK, given that CMA is already in those two locations, maybe we should put
this information in one or both of those, yes?


Do you suggest something liks this, for example?


cat vmstat | grep -i cma
cma_a_success   125
cma_a_fail  25
cma_b_success   130
cma_b_fail  156
..
cma_f_fail  xxx



Yes, approximately. I was wondering if this would suffice at least as a 
baseline:

cma_alloc_success   125
cma_alloc_failure   25

...and then, to see if more is needed, some questions:

a)  Do you know of an upper bound on how many cma areas there can be
(I think Matthew also asked that)?

b) Is tracking the cma area really as valuable as other possibilities? We can 
put
"a few" to "several" items here, so really want to get your very favorite bits 
of
information in. If, for example, there can be *lots* of cma areas, then maybe 
tracking
by a range of allocation sizes is better...


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 9:17 PM, Minchan Kim wrote:
...

Presumably, having the source code, you can easily deduce that a bluetooth
allocation failure goes directly to a CMA allocation failure, right?


Still wondering about this...


It would work if we have full source code and stack are not complicated for
every usecases. Having said, having a good central place automatically
popped up is also beneficial for not to add similar statistics for each
call sites.

Why do we have too many item in slab sysfs instead of creating each call
site inventing on each own?



I'm not sure I understand that question fully, but I don't think we need to
invent anything unique here. So far we've discussed debugfs, sysfs, and /proc,
none of which are new mechanisms.

...


It's actually easier to monitor one or two simpler items than it is to monitor
a larger number of complicated items. And I get the impression that this is
sort of a top-level, production software indicator.


Let me clarify one more time.

What I'd like to get ultimately is per-CMA statistics instead of
global vmstat for the usecase at this moment. Global vmstat
could help the decision whether I should go deeper but it ends up
needing per-CMA statistics. And I'd like to keep them in sysfs,
not debugfs since it should be stable as a telemetric.

What points do you disagree in this view?



No huge disagreements, I just want to get us down to the true essential elements
of what is required--and find a good home for the data. Initial debugging always
has excesses, and those should not end up in the more carefully vetted 
production
code.

If I were doing this, I'd probably consider HugeTLB pages as an example to 
follow,
because they have a lot in common with CMA: it's another memory allocation 
pool, and
people also want to monitor it.

HugeTLB pages and THP pages are monitored in /proc:
/proc/meminfo and /proc/vmstat:

# cat meminfo |grep -i huge
AnonHugePages: 88064 kB
ShmemHugePages:0 kB
FileHugePages: 0 kB
HugePages_Total: 500
HugePages_Free:  500
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb: 1024000 kB

# cat vmstat | grep -i huge
nr_shmem_hugepages 0
nr_file_hugepages 0
nr_anon_transparent_hugepages 43
numa_huge_pte_updates 0

...aha, so is CMA:

# cat vmstat | grep -i cma
nr_free_cma 261718

# cat meminfo | grep -i cma
CmaTotal:1048576 kB
CmaFree: 1046872 kB

OK, given that CMA is already in those two locations, maybe we should put
this information in one or both of those, yes?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-04 Thread John Hubbard

leased.



Didn't spot any actual problems with how this works.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2 1/4] mm/gup: add compound page list iterator

2021-02-04 Thread John Hubbard


On 2/4/21 12:24 PM, Joao Martins wrote:

Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
  mm/gup.c | 29 +
  1 file changed, 29 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..d1549c61c2f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
  }
  EXPORT_SYMBOL(unpin_user_page);
  
+static inline void compound_next(unsigned long i, unsigned long npages,

+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   list += i;
+   npages -= i;


It is worth noting that this is slightly more complex to read than it needs to 
be.
You are changing both endpoints of a loop at once. That's hard to read for a 
human.
And you're only doing it in order to gain the small benefit of being able to
use nr directly at the end of the routine.

If instead you keep npages constant like it naturally wants to be, you could
just do a "(*ntails)++" in the loop, to take care of *ntails.

However, given that the patch is correct and works as-is, the above is really 
just
an optional idea, so please feel free to add:

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


+   page = compound_head(*list);
+
+   for (nr = 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
  /**
   * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
   * @pages:  array of pages to be maybe marked dirty, and definitely released.

Re: [PATCH] xfs: fix unused variable build warning in xfs_log.c

2021-02-04 Thread John Hubbard

On 2/4/21 7:30 PM, Darrick J. Wong wrote:

On Thu, Feb 04, 2021 at 07:18:14PM -0800, John Hubbard wrote:

Delete the unused "log" variable in xfs_log_cover().

Fixes: 303591a0a9473 ("xfs: cover the log during log quiesce")
Cc: Brian Foster 
Cc: Christoph Hellwig 
Cc: Darrick J. Wong 
Cc: Allison Henderson 
Signed-off-by: John Hubbard 
---
Hi,

I just ran into this on today's linux-next, so here you go!

Thanks for the tipoff, I just realized with horror that I got the git
push wrong and never actually updated xfs-linux.git#for-next.  This (and
all the other gcc warnings) are fixed in "xfs-for-next" which ... is not
for-next.

Sigh.  so much for trying to get things in for testing. :(

Well, if it's any consolation, this is the *only* warning that fired during
my particular build, in linux-next. :)

thanks,
--
John Hubbard
NVIDIA

[PATCH] xfs: fix unused variable build warning in xfs_log.c

2021-02-04 Thread John Hubbard

Delete the unused "log" variable in xfs_log_cover().

Fixes: 303591a0a9473 ("xfs: cover the log during log quiesce")
Cc: Brian Foster 
Cc: Christoph Hellwig 
Cc: Darrick J. Wong 
Cc: Allison Henderson 
Signed-off-by: John Hubbard 
---
Hi,

I just ran into this on today's linux-next, so here you go!

thanks,
John Hubbard
NVIDIA

 fs/xfs/xfs_log.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 58699881c100..5a9cca3f7cbf 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1108,7 +1108,6 @@ static int
 xfs_log_cover(
struct xfs_mount*mp)
 {
-   struct xlog *log = mp->m_log;
int error = 0;
boolneed_covered;
 

base-commit: 0e2c50f40b7ffb73a039157f7c38495c6d99e86f
-- 
2.30.0

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 5:44 PM, Minchan Kim wrote:

On Thu, Feb 04, 2021 at 04:24:20PM -0800, John Hubbard wrote:

On 2/4/21 4:12 PM, Minchan Kim wrote:
...

Then, how to know how often CMA API failed?


Why would you even need to know that, *in addition* to knowing specific
page allocation numbers that failed? Again, there is no real-world motivation
cited yet, just "this is good data". Need more stories and support here.


Let me give an example.

Let' assume we use memory buffer allocation via CMA for bluetooth
enable of  device.
If user clicks the bluetooth button in the phone but fail to allocate
the memory from CMA, user will still see bluetooth button gray.
User would think his touch was not enough powerful so he try clicking
again and fortunately CMA allocation was successful this time and
they will see bluetooh button enabled and could listen the music.

Here, product team needs to monitor how often CMA alloc failed so
if the failure ratio is steadily increased than the bar,
it means engineers need to go investigation.

Make sense?



Yes, except that it raises more questions:

1) Isn't this just standard allocation failure? Don't you already have a way
to track that?

Presumably, having the source code, you can easily deduce that a bluetooth
allocation failure goes directly to a CMA allocation failure, right?


Still wondering about this...



Anyway, even though the above is still a little murky, I expect you're right
that it's good to have *some* indication, somewhere about CMA behavior...

Thinking about this some more, I wonder if this is really /proc/vmstat sort
of data that we're talking about. It seems to fit right in there, yes?


Thing is CMA instance are multiple, cma-A, cma-B, cma-C and each of CMA
heap has own specific scenario. /proc/vmstat could be bloated a lot
while CMA instance will be increased.



Yes, that would not fit in /proc/vmstat...assuming that you really require
knowing--at this point--which CMA heap is involved. And that's worth poking
at. If you get an overall indication in vmstat that CMA is having trouble,
then maybe that's all you need to start digging further.

It's actually easier to monitor one or two simpler items than it is to monitor
a larger number of complicated items. And I get the impression that this is
sort of a top-level, production software indicator.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 4:25 PM, John Hubbard wrote:

On 2/4/21 3:45 PM, Suren Baghdasaryan wrote:
...

2) The overall CMA allocation attempts/failures (first two items above) seem
an odd pair of things to track. Maybe that is what was easy to track, but I'd
vote for just omitting them.


Then, how to know how often CMA API failed?


Why would you even need to know that, *in addition* to knowing specific
page allocation numbers that failed? Again, there is no real-world motivation
cited yet, just "this is good data". Need more stories and support here.


IMHO it would be very useful to see whether there are multiple
small-order allocation failures or a few large-order ones, especially
for CMA where large allocations are not unusual. For that I believe
both alloc_pages_attempt and alloc_pages_fail would be required.


Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
be required".


So if you want to know that, the existing items are still a little too indirect
to really get it right. You can only know the average allocation size, by
dividing. Instead, we should provide the allocation size, for each count.

The limited interface makes this a little awkward, but using zones/ranges could
work: "for this range of allocation sizes, there were the following stats". Or,
some other technique that I haven't thought of (maybe two items per file?) would
be better.

On the other hand, there's an argument for keeping this minimal and simple. That
would probably lead us to putting in a couple of items into /proc/vmstat, as I
just mentioned in my other response, and calling it good.


...and remember: if we keep it nice and minimal and clean, we can put it into
/proc/vmstat and monitor it.

And then if a problem shows up, the more complex and advanced debugging data can
go into debugfs's CMA area. And you're all set.

If Android made up some policy not to use debugfs, then:

a) that probably won't prevent engineers from using it anyway, for advanced 
debugging,
and

b) If (a) somehow falls short, then we need to talk about what Android's plans 
are to
fill the need. And "fill up sysfs with debugfs items, possibly duplicating some 
of them,
and generally making an unecessary mess, to compensate for not using debugfs" 
is not
my first choice. :)


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 3:45 PM, Suren Baghdasaryan wrote:
...

2) The overall CMA allocation attempts/failures (first two items above) seem
an odd pair of things to track. Maybe that is what was easy to track, but I'd
vote for just omitting them.


Then, how to know how often CMA API failed?


Why would you even need to know that, *in addition* to knowing specific
page allocation numbers that failed? Again, there is no real-world motivation
cited yet, just "this is good data". Need more stories and support here.


IMHO it would be very useful to see whether there are multiple
small-order allocation failures or a few large-order ones, especially
for CMA where large allocations are not unusual. For that I believe
both alloc_pages_attempt and alloc_pages_fail would be required.


Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
be required".


So if you want to know that, the existing items are still a little too indirect
to really get it right. You can only know the average allocation size, by
dividing. Instead, we should provide the allocation size, for each count.

The limited interface makes this a little awkward, but using zones/ranges could
work: "for this range of allocation sizes, there were the following stats". Or,
some other technique that I haven't thought of (maybe two items per file?) would
be better.

On the other hand, there's an argument for keeping this minimal and simple. That
would probably lead us to putting in a couple of items into /proc/vmstat, as I
just mentioned in my other response, and calling it good.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 4:12 PM, Minchan Kim wrote:
...

Then, how to know how often CMA API failed?


Why would you even need to know that, *in addition* to knowing specific
page allocation numbers that failed? Again, there is no real-world motivation
cited yet, just "this is good data". Need more stories and support here.


Let me give an example.

Let' assume we use memory buffer allocation via CMA for bluetooth
enable of  device.
If user clicks the bluetooth button in the phone but fail to allocate
the memory from CMA, user will still see bluetooth button gray.
User would think his touch was not enough powerful so he try clicking
again and fortunately CMA allocation was successful this time and
they will see bluetooh button enabled and could listen the music.

Here, product team needs to monitor how often CMA alloc failed so
if the failure ratio is steadily increased than the bar,
it means engineers need to go investigation.

Make sense?



Yes, except that it raises more questions:

1) Isn't this just standard allocation failure? Don't you already have a way
to track that?

Presumably, having the source code, you can easily deduce that a bluetooth
allocation failure goes directly to a CMA allocation failure, right?

Anyway, even though the above is still a little murky, I expect you're right
that it's good to have *some* indication, somewhere about CMA behavior...

Thinking about this some more, I wonder if this is really /proc/vmstat sort
of data that we're talking about. It seems to fit right in there, yes?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/4] mm/gup: add compound page list iterator

2021-02-04 Thread John Hubbard


On 2/4/21 11:53 AM, Jason Gunthorpe wrote:

On Wed, Feb 03, 2021 at 03:00:01PM -0800, John Hubbard wrote:

+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   if (i >= npages)
+   return;
+
+   *ntails = count_ntails(list + i, npages - i);
+   *head = compound_head(list[i]);
+}
+
+#define for_each_compound_head(i, list, npages, head, ntails) \


When using macros, which are dangerous in general, you have to worry about
things like name collisions. I really dislike that C has forced this unsafe
pattern upon us, but of course we are stuck with it, for iterator helpers.

Given that we're stuck, you should probably use names such as __i, __list, etc,
in the the above #define. Otherwise you could stomp on existing variables.


Not this macro, it after cpp gets through with it all the macro names
vanish, it can't collide with variables.



Yes, I guess it does just vaporize, because it turns all the args into
their substituted values. I was just having flashbacks from similar cases
I guess.


The usual worry is you might collide with other #defines, but we don't
seem to worry about that in the kernel



Well, I worry about it a little anyway. haha :)


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/4/21 12:07 PM, Minchan Kim wrote:

On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote:

On 2/3/21 7:50 AM, Minchan Kim wrote:

Since CMA is getting used more widely, it's more important to
keep monitoring CMA statistics for system health since it's
directly related to user experience.

This patch introduces sysfs for the CMA and exposes stats below
to keep monitor for telemetric in the system.

   * the number of CMA allocation attempts
   * the number of CMA allocation failures
   * the number of CMA page allocation attempts
   * the number of CMA page allocation failures


The desire to report CMA data is understandable, but there are a few
odd things here:

1) First of all, this has significant overlap with /sys/kernel/debug/cma
items. I suspect that all of these items could instead go into


At this moment, I don't see any overlap with item from cma_debugfs.
Could you specify what item you are mentioning?


Just the fact that there would be two systems under /sys, both of which are
doing very very similar things: providing information that is intended to
help diagnose CMA.




/sys/kernel/debug/cma, right?


Anyway, thing is I need an stable interface for that and need to use
it in Android production build, too(Unfortunately, Android deprecated
the debugfs
https://source.android.com/setup/start/android-11-release#debugfs
)


That's the closest hint to a "why this is needed" that we've seen yet.
But it's only a hint.



What should be in debugfs and in sysfs? What's the criteria?


Well, it's a gray area. "Debugging support" goes into debugfs, and
"production-level monitoring and control" goes into sysfs, roughly
speaking. And here you have items that could be classified as either.



Some statistic could be considered about debugging aid or telemetric
depening on view point and usecase. And here, I want to use it for
telemetric, get an stable interface and use it in production build
of Android. In this chance, I'd like to get concrete guideline
what should be in sysfs and debugfs so that pointing out this thread
whenever folks dump their stat into sysfs to avoid waste of time
for others in future. :)



2) The overall CMA allocation attempts/failures (first two items above) seem
an odd pair of things to track. Maybe that is what was easy to track, but I'd
vote for just omitting them.


Then, how to know how often CMA API failed?


Why would you even need to know that, *in addition* to knowing specific
page allocation numbers that failed? Again, there is no real-world motivation
cited yet, just "this is good data". Need more stories and support here.


thanks,
--
John Hubbard
NVIDIA


There are various size allocation request for a CMA so only page
allocation stat are not enough to know it.



Signed-off-by: Minchan Kim 
---
   Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
   include/linux/cma.h   |   1 +
   mm/Makefile   |   1 +
   mm/cma.c  |   6 +-
   mm/cma.h  |  20 +++
   mm/cma_sysfs.c| 143 ++
   6 files changed, 209 insertions(+), 1 deletion(-)
   create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma
   create mode 100644 mm/cma_sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
b/Documentation/ABI/testing/sysfs-kernel-mm-cma
new file mode 100644
index ..2a43c0aacc39
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
@@ -0,0 +1,39 @@
+What:  /sys/kernel/mm/cma/
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   /sys/kernel/mm/cma/ contains a number of subdirectories by
+   cma-heap name. The subdirectory contains a number of files
+   to represent cma allocation statistics.


Somewhere, maybe here, there should be a mention of the closely related
/sys/kernel/debug/cma files.


+
+   There are number of files under
+   /sys/kernel/mm/cma/ directory
+
+   - cma_alloc_attempt
+   - cma_alloc_fail


Are these really useful? They a summary of the alloc_pages items, really.


+   - alloc_pages_attempt
+   - alloc_pages_fail


This should also have "cma" in the name, really: cma_alloc_pages_*.


No problem.




+
+What:  /sys/kernel/mm/cma//cma_alloc_attempt
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of cma_alloc API attempted
+
+What:  /sys/kernel/mm/cma//cma_alloc_fail
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of CMA_alloc API failed
+
+What:  /sys/kernel/mm/cma//alloc_pages_attempt
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of page

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread John Hubbard


On 2/3/21 7:50 AM, Minchan Kim wrote:

Since CMA is getting used more widely, it's more important to
keep monitoring CMA statistics for system health since it's
directly related to user experience.

This patch introduces sysfs for the CMA and exposes stats below
to keep monitor for telemetric in the system.

  * the number of CMA allocation attempts
  * the number of CMA allocation failures
  * the number of CMA page allocation attempts
  * the number of CMA page allocation failures


The desire to report CMA data is understandable, but there are a few
odd things here:

1) First of all, this has significant overlap with /sys/kernel/debug/cma
items. I suspect that all of these items could instead go into
/sys/kernel/debug/cma, right?

2) The overall CMA allocation attempts/failures (first two items above) seem
an odd pair of things to track. Maybe that is what was easy to track, but I'd
vote for just omitting them.


Signed-off-by: Minchan Kim 
---
  Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
  include/linux/cma.h   |   1 +
  mm/Makefile   |   1 +
  mm/cma.c  |   6 +-
  mm/cma.h  |  20 +++
  mm/cma_sysfs.c| 143 ++
  6 files changed, 209 insertions(+), 1 deletion(-)
  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma
  create mode 100644 mm/cma_sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
b/Documentation/ABI/testing/sysfs-kernel-mm-cma
new file mode 100644
index ..2a43c0aacc39
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
@@ -0,0 +1,39 @@
+What:  /sys/kernel/mm/cma/
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   /sys/kernel/mm/cma/ contains a number of subdirectories by
+   cma-heap name. The subdirectory contains a number of files
+   to represent cma allocation statistics.


Somewhere, maybe here, there should be a mention of the closely related
/sys/kernel/debug/cma files.


+
+   There are number of files under
+   /sys/kernel/mm/cma/ directory
+
+   - cma_alloc_attempt
+   - cma_alloc_fail


Are these really useful? They a summary of the alloc_pages items, really.


+   - alloc_pages_attempt
+   - alloc_pages_fail


This should also have "cma" in the name, really: cma_alloc_pages_*.


+
+What:  /sys/kernel/mm/cma//cma_alloc_attempt
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of cma_alloc API attempted
+
+What:  /sys/kernel/mm/cma//cma_alloc_fail
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of CMA_alloc API failed
+
+What:  /sys/kernel/mm/cma//alloc_pages_attempt
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of pages CMA API tried to allocate
+
+What:  /sys/kernel/mm/cma//alloc_pages_fail
+Date:  Feb 2021
+Contact:   Minchan Kim 
+Description:
+   the number of pages CMA API failed to allocate
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..71a28a5bb54e 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -49,4 +49,5 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned 
int count);
  
  extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);

+


A single additional blank line seems to be the only change to this file. :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

2021-02-03 Thread John Hubbard


On 2/3/21 2:00 PM, Joao Martins wrote:

Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:


In the patch subject line:

   s/__ib_mem_release/__ib_umem_release/



Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2631 rounds in 5.001 sec: 1900.618 usec / round
31625 rounds in 60.001 sec: 1897.267 usec / round

Signed-off-by: Joao Martins 
---
  drivers/infiniband/core/umem.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..ea4ebb3261d9 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
  
  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)

  {
-   struct sg_page_iter sg_iter;
-   struct page *page;
+   bool make_dirty = umem->writable && dirty;
+   struct scatterlist *sg;
+   int i;


Maybe unsigned int is better, so as to perfectly match the scatterlist.length.

  
  	if (umem->nmap > 0)

ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);
  
-	for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {

-   page = sg_page_iter_page(_iter);
-   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
-   }
+   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)


The change from umem->sg_nents to umem->nmap looks OK, although we should get
IB people to verify that there is not some odd bug or reason to leave it as is.


+   unpin_user_page_range_dirty_lock(sg_page(sg),
+   DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);


Is it really OK to refer directly to sg->length? The scatterlist library goes
to some effort to avoid having callers directly access the struct member 
variables.

Actually, the for_each_sg() code and its behavior with sg->length and 
sg_page(sg)
confuses me because I'm new to it, and I don't quite understand how this works.
Especially with SG_CHAIN. I'm assuming that you've monitored /proc/vmstat for
nr_foll_pin* ?

  
  	sg_free_table(>sg_head);

  }



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-03 Thread John Hubbard


On 2/3/21 2:00 PM, Joao Martins wrote:
...

+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+


Also, looking at this again while trying to verify the sg diffs in patch #4, I 
noticed
that the name is tricky. Usually a "range" would not have a single struct page* 
as the
argument. So in this case, perhaps a comment above the function would help, 
something
approximately like this:

/*
 * The "range" in the function name refers to the fact that the page may be a
 * compound page. If so, then that compound page will be treated as one or more
 * ranges of contiguous tail pages.
 */

...I guess. Or maybe the name is just wrong (a comment block explaining a name 
is
always a bad sign). Perhaps we should rename it to something like:

unpin_user_compound_page_dirty_lock()

?

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-03 Thread John Hubbard


On 2/3/21 2:00 PM, Joao Martins wrote:

Add a unpin_user_page_range() API which takes a starting page
and how many consecutive pages we want to dirty.

Given that we won't be iterating on a list of changes, change
compound_next() to receive a bool, whether to calculate from the starting
page, or walk the page array. Finally add a separate iterator,


A bool arg is sometimes, but not always, a hint that you really just want
a separate set of routines. Below...


for_each_compound_range() that just operate in page ranges as opposed
to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
  include/linux/mm.h |  2 ++
  mm/gup.c   | 48 ++
  2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
  void unpin_user_pages(struct page **pages, unsigned long npages);
  
  /**

diff --git a/mm/gup.c b/mm/gup.c
index 971a24b4b73f..1b57355d5033 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,11 +215,16 @@ void unpin_user_page(struct page *page)
  }
  EXPORT_SYMBOL(unpin_user_page);
  
-static inline unsigned int count_ntails(struct page **pages, unsigned long npages)

+static inline unsigned int count_ntails(struct page **pages,
+   unsigned long npages, bool range)
  {
-   struct page *head = compound_head(pages[0]);
+   struct page *page = pages[0], *head = compound_head(page);
unsigned int ntails;
  
+	if (range)

+   return (!PageCompound(head) || compound_order(head) <= 1) ? 1 :
+  min_t(unsigned int, (head + compound_nr(head) - page), 
npages);


Here, you clearly should use a separate set of _range routines. Because you're 
basically
creating two different routines here! Keep it simple.

Once you're in a separate routine, you might feel more comfortable expanding 
that to
a more readable form, too:

if (!PageCompound(head) || compound_order(head) <= 1)
return 1;

return min_t(unsigned int, (head + compound_nr(head) - page), npages);


thanks,
--
John Hubbard
NVIDIA


+
for (ntails = 1; ntails < npages; ntails++) {
if (compound_head(pages[ntails]) != head)
break;
@@ -229,20 +234,32 @@ static inline unsigned int count_ntails(struct page 
**pages, unsigned long npage
  }
  
  static inline void compound_next(unsigned long i, unsigned long npages,

-struct page **list, struct page **head,
-unsigned int *ntails)
+struct page **list, bool range,
+struct page **head, unsigned int *ntails)
  {
+   struct page *p, **next = 
+
if (i >= npages)
return;
  
-	*ntails = count_ntails(list + i, npages - i);

-   *head = compound_head(list[i]);
+   if (range)
+   *next = *list + i;
+   else
+   next = list + i;
+
+   *ntails = count_ntails(next, npages - i, range);
+   *head = compound_head(*next);
  }
  
+#define for_each_compound_range(i, list, npages, head, ntails) \

+   for (i = 0, compound_next(i, npages, list, true, , ); \
+i < npages; i += ntails, \
+compound_next(i, npages, list, true,  , ))
+
  #define for_each_compound_head(i, list, npages, head, ntails) \
-   for (i = 0, compound_next(i, npages, list, , ); \
+   for (i = 0, compound_next(i, npages, list, false, , ); \
 i < npages; i += ntails, \
-compound_next(i, npages, list, , ))
+compound_next(i, npages, list, false,  , ))
  
  /**

   * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
@@ -306,6 +323,21 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
  }
  EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
  
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,

+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+

Re: [PATCH 2/4] mm/gup: decrement head page once for group of subpages

2021-02-03 Thread John Hubbard


On 2/3/21 2:00 PM, Joao Martins wrote:

Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins 
---
  mm/gup.c | 29 +++--
  1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4f88dcef39f2..971a24b4b73f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 bool make_dirty)
  {
unsigned long index;
-
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/


Great to see this TODO (and the related one below) finally done!

Everything looks correct here.

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


+   struct page *head;
+   unsigned int ntails;
  
  	if (!make_dirty) {

unpin_user_pages(pages, npages);
return;
}
  
-	for (index = 0; index < npages; index++) {

-   struct page *page = compound_head(pages[index]);
+   for_each_compound_head(index, pages, npages, head, ntails) {
/*
 * Checking PageDirty at this point may race with
 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -304,9 +299,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 * written back, so it gets written back again in the
 * next writeback cycle. This is harmless.
 */
-   if (!PageDirty(page))
-   set_page_dirty_lock(page);
-   unpin_user_page(page);
+   if (!PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
}
  }
  EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -323,6 +318,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
  void unpin_user_pages(struct page **pages, unsigned long npages)
  {
unsigned long index;
+   struct page *head;
+   unsigned int ntails;
  
  	/*

 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -331,13 +328,9 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
 */
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
-   for (index = 0; index < npages; index++)
-   unpin_user_page(pages[index]);
+
+   for_each_compound_head(index, pages, npages, head, ntails)
+   put_compound_head(head, ntails, FOLL_PIN);
  }
  EXPORT_SYMBOL(unpin_user_pages);

Re: [PATCH 1/4] mm/gup: add compound page list iterator

2021-02-03 Thread John Hubbard


On 2/3/21 2:00 PM, Joao Martins wrote:

Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
  mm/gup.c | 29 +
  1 file changed, 29 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..4f88dcef39f2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
  }
  EXPORT_SYMBOL(unpin_user_page);
  
+static inline unsigned int count_ntails(struct page **pages, unsigned long npages)


Silly naming nit: could we please name this function count_pagetails()? 
count_ntails
is a bit redundant, plus slightly less clear.


+{
+   struct page *head = compound_head(pages[0]);
+   unsigned int ntails;
+
+   for (ntails = 1; ntails < npages; ntails++) {
+   if (compound_head(pages[ntails]) != head)
+   break;
+   }
+
+   return ntails;
+}
+
+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   if (i >= npages)
+   return;
+
+   *ntails = count_ntails(list + i, npages - i);
+   *head = compound_head(list[i]);
+}
+
+#define for_each_compound_head(i, list, npages, head, ntails) \


When using macros, which are dangerous in general, you have to worry about
things like name collisions. I really dislike that C has forced this unsafe
pattern upon us, but of course we are stuck with it, for iterator helpers.

Given that we're stuck, you should probably use names such as __i, __list, etc,
in the the above #define. Otherwise you could stomp on existing variables.


+   for (i = 0, compound_next(i, npages, list, , ); \
+i < npages; i += ntails, \
+compound_next(i, npages, list, , ))
+
  /**
   * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
   * @pages:  array of pages to be maybe marked dirty, and definitely released.



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2 net-next 3/4] net: introduce common dev_page_is_reserved()

2021-01-30 Thread John Hubbard

On 1/30/21 11:45 AM, Alexander Lobakin wrote:

From: Jakub Kicinski 
Date: Sat, 30 Jan 2021 11:07:07 -0800

On Sat, 30 Jan 2021 15:42:29 + Alexander Lobakin wrote:

On Wed, 27 Jan 2021 20:11:23 + Alexander Lobakin wrote:

+ * dev_page_is_reserved - check whether a page can be reused for network Rx
+ * @page: the page to test
+ *
+ * A page shouldn't be considered for reusing/recycling if it was allocated
+ * under memory pressure or at a distant memory node.
+ *
+ * Returns true if this page should be returned to page allocator, false
+ * otherwise.
+ */
+static inline bool dev_page_is_reserved(const struct page *page)

Am I the only one who feels like "reusable" is a better term than
"reserved".

I thought about it, but this will need to inverse the conditions in
most of the drivers. I decided to keep it as it is.
I can redo if "reusable" is preferred.

Naming is hard. As long as the condition is not a double negative it
reads fine to me, but that's probably personal preference.
The thing that doesn't sit well is the fact that there is nothing
"reserved" about a page from another NUMA node.. But again, if nobody
+1s this it's whatever...

Agree on NUMA and naming. I'm a bit surprised that 95% of drivers
have this helper called "reserved" (one of the reasons why I finished
with this variant).
Let's say, if anybody else will vote for "reusable", I'll pick it for
v3.

Definitely "reusable" seems better to me, and especially anything *other*
than "reserved" is a good idea, IMHO.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages

2021-01-24 Thread John Hubbard


On 1/24/21 3:18 PM, John Hubbard wrote:

On 1/21/21 7:37 PM, Pavel Tatashin wrote:

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH). This decision is
controlled with the new -z command line option.

Signed-off-by: Pavel Tatashin 
---
  mm/gup_test.c |  6 ++
  tools/testing/selftests/vm/gup_test.c | 23 +++
  2 files changed, 25 insertions(+), 4 deletions(-)



This also looks good. I do see the WARN_ON_ONCE firing in
internal_get_user_pages_fast(), when running with *only* the new -z
option.

I'll poke around the rest of the patchset and see if that is expected
and normal, but either way the test code itself looks correct and seems


The warning that is firing in internal_get_user_pages_fast() is:

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
   FOLL_FORCE | FOLL_PIN | FOLL_GET |
   FOLL_FAST_ONLY)))
return -EINVAL;

...OK, so this is because "./gup_test -z" invokes get_user_pages_fast(),
which so far does not allow passing in FOLL_TOUCH. Probably because there
is nothing "fast" about touching and possibly faulting in pages. :)

So, again, the test code still looks correct, even though it's possible
to pass in options that run into things that are rejected by gup.c



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages

2021-01-24 Thread John Hubbard


On 1/21/21 7:37 PM, Pavel Tatashin wrote:

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH). This decision is
controlled with the new -z command line option.

Signed-off-by: Pavel Tatashin 
---
  mm/gup_test.c |  6 ++
  tools/testing/selftests/vm/gup_test.c | 23 +++
  2 files changed, 25 insertions(+), 4 deletions(-)



This also looks good. I do see the WARN_ON_ONCE firing in
internal_get_user_pages_fast(), when running with *only* the new -z
option.

I'll poke around the rest of the patchset and see if that is expected
and normal, but either way the test code itself looks correct and seems
to be passing my set of "run a bunch of different gup_test options" here,
so feel free to add:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/mm/gup_test.c b/mm/gup_test.c
index a6ed1c877679..d974dec19e1c 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page 
**pages,
  
  dump_page(page, "gup_test failure");

break;
+   } else if (cmd == PIN_LONGTERM_BENCHMARK &&
+   WARN(!is_pinnable_page(page),
+"pages[%lu] is NOT pinnable but pinned\n",
+i)) {
+   dump_page(page, "gup_test failure");
+   break;
}
}
break;
diff --git a/tools/testing/selftests/vm/gup_test.c 
b/tools/testing/selftests/vm/gup_test.c
index 943cc2608dc2..1e662d59c502 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -13,6 +13,7 @@
  
  /* Just the flags we need, copied from mm.h: */

  #define FOLL_WRITE0x01/* check pte is writable */
+#define FOLL_TOUCH 0x02/* mark page accessed */
  
  static char *cmd_to_str(unsigned long cmd)

  {
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
unsigned long size = 128 * MB;
int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
unsigned long cmd = GUP_FAST_BENCHMARK;
-   int flags = MAP_PRIVATE;
+   int flags = MAP_PRIVATE, touch = 0;
char *file = "/dev/zero";
char *p;
  
-	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {

+   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
case 'H':
flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
break;
+   case 'z':
+   /* fault pages in gup, do not fault in userland */
+   touch = 1;
+   break;
default:
return -1;
}
@@ -167,8 +172,18 @@ int main(int argc, char **argv)
else if (thp == 0)
madvise(p, size, MADV_NOHUGEPAGE);
  
-	for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)

-   p[0] = 0;
+   /*
+* FOLL_TOUCH, in gup_test, is used as an either/or case: either
+* fault pages in from the kernel via FOLL_TOUCH, or fault them
+* in here, from user space. This allows comparison of performance
+* between those two cases.
+*/
+   if (touch) {
+   gup.gup_flags |= FOLL_TOUCH;
+   } else {
+   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+   p[0] = 0;
+   }
  
  	/* Only report timing information on the *_BENCHMARK commands: */

if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||

Re: [PATCH v7 13/14] selftests/vm: test flag is broken

2021-01-24 Thread John Hubbard


On 1/21/21 7:37 PM, Pavel Tatashin wrote:

In gup_test both gup_flags and test_flags use the same flags field.
This is broken.

Farther, in the actual gup_test.c all the passed gup_flags are erased and
unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155 if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156 nr = pin_user_pages(addr, nr, gup->flags,
157 pages + i, NULL);
158 else
159 nr = get_user_pages(addr, nr, gup->flags,
160 pages + i, NULL);
161 break;

Add a new test_flags field, to allow raw gup_flags to work.
Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
should be performed.
Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

Rename flags with gup_flags.


Hi Pavel,

Thanks again for fixing this up! Looks good, with a tiny point about the
subject line:

1) To follow convention, the subject line should say what you're doing,
not what the previous condition was. Also, there are several tests in that
directory, so we should say which one. So more like this:

"selftests/vm: gup_test: fix test flag"

That is just a minor documentation point, so either way, feel free to add:


Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



With the fix, dump works like this:

root@virtme:/# gup_test  -c
 page #0, starting from user virt addr: 0x7f8acb9e4000
page:d3d2ee27 refcount:2 mapcount:1 mapping:
index:0x0 pfn:0x100bcf
anon flags: 0x3080016(referenced|uptodate|lru|swapbacked)
raw: 03080016 d0e204021608 d0e208df2e88 8ea04243ec61
raw:   0002 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

root@virtme:/# gup_test  -c -p
 page #0, starting from user virt addr: 0x7fd19701b000
page:baed3c7d refcount:1025 mapcount:1 mapping:
index:0x0 pfn:0x108008
anon flags: 0x3080014(uptodate|lru|swapbacked)
raw: 03080014 d0e204200188 d0e205e09088 8ea04243ee71
raw:   0401 
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Signed-off-by: Pavel Tatashin 
---
  mm/gup_test.c | 23 ++-
  mm/gup_test.h |  3 ++-
  tools/testing/selftests/vm/gup_test.c | 15 +++
  3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index e3cf78e5873e..a6ed1c877679 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
  {
ktime_t start_time, end_time;
unsigned long i, nr_pages, addr, next;
-   int nr;
+   long nr;
struct page **pages;
int ret = 0;
bool needs_mmap_lock =
@@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
  
-		/* Filter out most gup flags: only allow a tiny subset here: */

-   gup->flags &= FOLL_WRITE;
-
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags,
+   nr = get_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case GUP_BASIC_TEST:
-   nr = get_user_pages(addr, nr, gup->flags, pages + i,
+   nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_FAST_BENCHMARK:
-   nr = pin_user_pages_fast(addr, nr, gup->flags,
+   nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
 pages + i);
break;
case PIN_BASIC_TEST:
-   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
NULL);
break;
case PIN_LONGTERM_BENCHMARK:
nr = pin_user_pages(addr, nr,
-   gup->fla

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-15 Thread John Hubbard


On 1/15/21 11:46 AM, David Hildenbrand wrote:

7) There is no easy way to detect if a page really was pinned: we might
have false positives. Further, there is no way to distinguish if it was
pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
we most probably would need more counters, which we cannot fit into
struct page. (AFAIU, for huge pages it's easier).


I think this is the real issue. We can only store so much information,
so we have to decide which things work and which things are broken. So
far someone hasn't presented a way to record everything at least..


I do wonder how many (especially long-term) GUP readers/writers we have
to expect, and especially, support for a single base page. Do we have a
rough estimate?

With RDMA, I would assume we only need a single one (e.g., once RDMA
device; I'm pretty sure I'm wrong, sounds too easy).
With VFIO I guess we need one for each VFIO container (~ in the worst
case one for each passthrough device).
With direct I/O, vmsplice and other GUP users ?? No idea.

If we could somehow put a limit on the #GUP we support, and fail further
GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
which is most probably a horribly bad choice)

[ GUP READ ][ GUP R/W ] [  ordinary ]
31  ...  28 27  ...  24 23      0

But due to saturate handling in "ordinary", we would lose further 2 bits
(AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
how many bits we actually need in practice.

Maybe we need less for GUP READ, because most users want GUP R/W? No idea.

Just wild ideas. Most probably that has already been discussed, and most
probably people figured that it's impossible :)



I proposed this exact idea a few days ago [1]. It's remarkable that we both
picked nearly identical values for the layout! :)

But as the responses show, security problems prevent pursuing that approach.


[1] https://lore.kernel.org/r/45806a5a-65c2-67ce-fc92-dc8c2144d...@nvidia.com

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-10 Thread John Hubbard


On 1/10/21 11:30 AM, Linus Torvalds wrote:

On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds
 wrote:

Just having a bit in the page flags for "I already made this
exclusive, and fork() will keep it so" is I feel the best option. In a
way, "page is writable" right now _is_ that bit. By definition, if you
have a writable page in an anonymous mapping, that is an exclusive
user.

But because "writable" has these interactions with other operations,
it would be better if it was a harder bit than that "maybe_pinned()",
though. It would be lovely if a regular non-pinning write-GUP just
always set it, for example.

"maybe_pinned()" is good enough for the fork() case, which is the one
that matters for long-term pinning. But it's admittedly not perfect.



There is at least one way to improve this part of it--maybe.

IMHO, a lot of the bits in page _refcount are still being wasted (even
after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
there are many callers of gup/pup per page. If anyone points out that
that is wrong, then the rest of this falls apart, but...if we were to
make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever
simultaneously allowed on a given page", then several things become
possible:

1) "maybe dma pinned" could become "very likely dma-pinned", by allowing
perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma
pins), thus all but ensuring no aliasing between normal refcounts and
dma pins. The name change below to "likely" is not actually necessary,
I'm just using it to make the point clear:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..28f7f64e888f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 23)

 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
@@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages);
  * @Return:True, if it is likely that the page has been "dma-pinned".
  * False, if the page is definitely not dma-pinned.
  */
-static inline bool page_maybe_dma_pinned(struct page *page)
+static inline bool page_likely_dma_pinned(struct page *page)
 {
if (hpage_pincount_available(page))
return compound_pincount(page) > 0;


2) Doing (1) allows, arguably, failing mprotect calls if
page_likely_dma_pinned() returns true, because the level of confidence
is much higher now.

3) An additional counter, for normal gup (FOLL_GET) is possible: divide
up the top 8 bits into two separate counters of just 4 bits each. Only
allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating
FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..ce5af27e3057 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define DMA_PIN_COUNTING_BIAS (1U << 23)
+#define GUP_PIN_COUNTING_BIAS (1U << 27)

 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

So:

FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount.
These are long term pins for dma.

FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount.
These are not long term pins.

Doing (3) would require yet another release call variant:
unpin_user_pages() would need to take a flag to say which refcount to
release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy
(compared to the original effort of teasing apart generic put_page()
calls, into unpin_user_pages() calls). We already have all the
unpin_user_pages() calls in place, and so it's "merely" a matter of
adding a flag to 74 call sites.

The real question is whether we can get away with supporting only a very
low count of FOLL_PIN and FOLL_GET pages. Can we?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread John Hubbard


On 1/7/21 2:24 PM, Linus Torvalds wrote:

On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds
 wrote:


Hmm, add a check for the page being PageAnon(), perhaps?

If it's a shared vma, then the page can be pinned shared with multiple
mappings, I agree.


Or check the vma directly for whether it's a COW vma. That's probably
a more obvious test, but would have to be done outside of
page_maybe_dma_pinned().

For example, in copy_present_page(), we've already done that COW-vma
test, so if we want to strengthen just _that_ test, then it would be
sufficient to just add a

 /* This cannot be a pinned page if it has more than one mapping */
 if (page_mappings(page) != 1)
 return 1;

to copy_present_page() along with the existing page_maybe_dma_pinned() test.

No?

 Linus


That approach makes me a lot happier, yes. Because it doesn't add constraints
to the RDMA cases, but still does what we need here.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread John Hubbard


On 1/7/21 2:00 PM, Linus Torvalds wrote:

On Thu, Jan 7, 2021 at 1:53 PM John Hubbard  wrote:




Now, I do agree that from a QoI standpoint, it would be really lovely
if we actually enforced it. I'm not entirely sure we can, but maybe it
would be reasonable to use that

mm->has_pinned && page_maybe_dma_pinned(page)

at least as the beginning of a heuristic.

In fact, I do think that "page_maybe_dma_pinned()" could possibly be
made stronger than it is. Because at *THAT* point, we might say "we


What exactly did you have in mind, to make it stronger? I think the
answer is in this email but I don't quite see it yet...


Literally just adding a " && page_mapcount(page) == 1" in there
(probably best done inside page_maybe_dma_pinned() itself)


Well, that means that pages that are used for pinned DMA like this, can
not be shared with other processes. Is that an acceptable limitation
for the RDMA users? It seems a bit constraining, at first glance anyway.




Direct IO pins, on the other hand, are more transient. We can probably live
without tagging Direct IO pages as FOLL_PIN. I think.


Yes. I think direct-IO writes should be able to just do a transient
GUP, and if it causes a COW fault that isn't coherent, that's the
correct semantics, I think (ie the direct-IO will see the original
data, the COW faulter will get it's own private copy to make changes
to).

I think pinning should be primarily limited to things that _require_
coherency (ie you pin because you're going to do some active two-way
communication using that page)

Does that match your thinking?



Yes, perfectly. I'm going to update Documentation/core-api/pin_user_pages.rst
accordingly, once the dust settles on these discussions.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-07 Thread John Hubbard


On 1/7/21 1:29 PM, Linus Torvalds wrote:

On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli  wrote:


The problem is it's not even possible to detect reliably if there's
really a long term GUP pin because of speculative pagecache lookups.


So none of the normal code _needs_ that any more these days, which is
what I think is so nice. Any pinning will do the COW, and then we have
the logic to make sure it stays writable, and that keeps everything
nicely coherent and is all fairly simple.

And yes, it does mean that if somebody then explicitly write-protects
a page, it may end up being COW'ed after all, but if you first pinned
it, and then started playing with the protections of that page, why
should you be surprised?

So to me, this sounds like a "don't do that then" situation.

Anybody who does page pinning and wants coherency should NOT TOUCH THE
MAPPING IT PINNED.

(And if you do touch it, it's your own fault, and you get to keep both
of the broken pieces)

Now, I do agree that from a QoI standpoint, it would be really lovely
if we actually enforced it. I'm not entirely sure we can, but maybe it
would be reasonable to use that

   mm->has_pinned && page_maybe_dma_pinned(page)

at least as the beginning of a heuristic.

In fact, I do think that "page_maybe_dma_pinned()" could possibly be
made stronger than it is. Because at *THAT* point, we might say "we


What exactly did you have in mind, to make it stronger? I think the
answer is in this email but I don't quite see it yet...

Also, now seems to be a good time to mention that I've been thinking about
a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer
to peer with GUP/NIC, and HMM support for atomic operations from a device).
And it seems like the following approach would help:

* Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks
to Jason for this point) means "user space owns the lifetime". We might
even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would
make them mean the same thing. The idea is that there are no "short term"
pins of this kind of memory.

* Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans,
because several of us had thought that Direct IO needs FOLL_PIN. However, this
recent conversation, plus my list of cases above, seems to indicate otherwise.
That's because we only have one refcount approach for marking pages in this way,
and we should spend it on the long-term pinned pages. Those are both hard to
identify otherwise, and actionable once we identify them.

Direct IO pins, on the other hand, are more transient. We can probably live
without tagging Direct IO pages as FOLL_PIN. I think.

This is all assuming that we make progress in the area of "if it's not a
page_maybe_dma_pinned() page, then we can wait for it or otherwise do reasonable
things about the refcount". So we end up with a clear (-ish) difference between
pages that can be waited for, and pages that should not be waited for in the
kernel.

I hope this helps, but if it's too much of a side-track, please disregard.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

2020-12-19 Thread John Hubbard

r) mistake with -F was that I
mis-predicted what the next future feature would need: I thought we'd be
adding sub-flags for the dump pages test, but instead we are (at least
so far) adding flags for the gup/pup calls. So let's adapt to that.

Some of this contradicts what I proposed earlier, but now it forms a
consistent API as a whole. How's this sound:

* Change the meaning of the -F option, slightly: it now means "raw flags
to pass into .gup_flags and thus to gup/pup, and these will override
other options".

* Use -F 0x2 (the value of FOLL_TOUCH) instead of the -z option.

* We can remain backward compatible with -w at the command line level,
with a caveat: passing in "-F 0" or "-F 2" (anything that doesn't set
FOLL_WRITE) will override -w. The rule would be that raw -F flags
override other settings, just for a consistent way to think about it.

* No need, obviously, for any option to undo -w. -F can do that.

* Do *not* provide direct access to .test_control_flags. Just set those
according to a new command line option. Because these aren't (yet)
exploding.

If this is too much fooling around, let me know and I'll send some diffs
or even a patch. I don't want to take away too much time from the main
patch, and I feel some obligation to get gup_test straightened out. :)









   static char *cmd_to_str(unsigned long cmd)
   {
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
   unsigned long size = 128 * MB;
   int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
   unsigned long cmd = GUP_FAST_BENCHMARK;
- int flags = MAP_PRIVATE;
+ int flags = MAP_PRIVATE, touch = 0;



Silly nit, can we put it on its own line? This pre-existing mess of
declarations makes it hard to read everything. One item per line is
easier on the reader, who is often just looking for a single item at a
time. Actually why not rename it slightly while we're here (see below),
maybe to this:

 int use_foll_touch = 0;


Sure, I will move it. I also prefer one initialization per-line, but
tried to keep the current style. I can also correct the other
initializations in this function.





   char *file = "/dev/zero";
   char *p;

- while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {


Yes, this seems worth its own command line option.


   switch (opt) {
   case 'a':
   cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
   case 'H':
   flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
   break;
+ case 'z':
+ /* fault pages in gup, do not fault in userland */


How about:
 /*
  * Use gup/pup(FOLL_TOUCH), *instead* of faulting
  * pages in from user space.
  */
 use_foll_touch = 1;


Sure.




+ touch = 1;
+ break;
   default:
   return -1;
   }
@@ -167,8 +172,12 @@ int main(int argc, char **argv)
   else if (thp == 0)
   madvise(p, size, MADV_NOHUGEPAGE);

- for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
- p[0] = 0;
+ if (touch) {
+ gup.flags |= FOLL_TOUCH;
+ } else {
+ for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+ p[0] = 0;
+ }


Here, the test would change to "if (gup.gup_flags & FOLL_TOUCH)".

And we need a comment somewhere (maybe right above that) that says something
like:

/*
 * FOLL_TOUCH, in gup_test, is used as an either/or case: either
 * fault pages in from the kernel via FOLL_TOUCH, or fault them
 * in here, from user space. This allows comparison of performance
 * between those two cases.
 */

...OR, you know what? We can do even better. Because the above is too quirky
and weird. Instead, let's leave FOLL_TOUCH "pure": it's just a gup flag.

And then, add an option (maybe -z, after all) that says, "skip faulting pages
in from user space". That's a lot clearer!  And you can tell it's better,
because we don't have to write a chunk of documentation explaining the odd
quirks. Ha, it feels better!

What do you think? (Again, if you want me to send over some diffs that put this
all together, let me know. I'm kind of embarrassed at all the typing required 
here.)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v4 10/10] selftests/vm: test faulting in kernel, and verify pinnable pages

2020-12-18 Thread John Hubbard

   */
use_foll_touch = 1;


+   touch = 1;
+   break;
default:
return -1;
}
@@ -167,8 +172,12 @@ int main(int argc, char **argv)
else if (thp == 0)
madvise(p, size, MADV_NOHUGEPAGE);
  
-	for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)

-   p[0] = 0;
+   if (touch) {
+   gup.flags |= FOLL_TOUCH;
+   } else {
+   for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+   p[0] = 0;
+   }


OK.

  
  	/* Only report timing information on the *_BENCHMARK commands: */

if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v4 09/10] selftests/vm: test flag is broken

2020-12-18 Thread John Hubbard


On 12/18/20 1:06 AM, John Hubbard wrote:

Add a new test_flags field, to allow raw gup_flags to work.


I think .test_control_flags field would be a good name, to make it very
clear that it's not destined for gup_flags. Just .test_flags is not quite
as clear a distinction from .gup_flags, as .test_control_flags is, IMHO.



And maybe renaming .flags to .gup_flags, just to make it really clear.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v4 09/10] selftests/vm: test flag is broken

2020-12-18 Thread John Hubbard

ch (cmd) {
case GUP_FAST_BENCHMARK:
nr = get_user_pages_fast(addr, nr, gup->flags,
@@ -152,7 +149,7 @@ static int __gup_test_ioctl(unsigned int cmd,
pages + i, NULL);
break;
case DUMP_USER_PAGES_TEST:
-   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
+   if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
nr = pin_user_pages(addr, nr, gup->flags,
pages + i, NULL);
else
@@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
  
  	start_time = ktime_get();
  
-	put_back_pages(cmd, pages, nr_pages, gup->flags);

+   put_back_pages(cmd, pages, nr_pages, gup->test_flags);
  
  	end_time = ktime_get();

gup->put_delta_usec = ktime_us_delta(end_time, start_time);
diff --git a/mm/gup_test.h b/mm/gup_test.h
index 90a6713d50eb..b1b376b5d3b2 100644
--- a/mm/gup_test.h
+++ b/mm/gup_test.h
@@ -22,6 +22,7 @@ struct gup_test {
__u64 size;
__u32 nr_pages_per_call;
__u32 flags;
+   __u32 test_flags;
/*
 * Each non-zero entry is the number of the page (1-based: first page is
 * page 1, so that zero entries mean "do nothing") from the .addr base.
diff --git a/tools/testing/selftests/vm/gup_test.c 
b/tools/testing/selftests/vm/gup_test.c
index 6c6336dd3b7f..42c71483729f 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -37,13 +37,13 @@ int main(int argc, char **argv)
  {
struct gup_test gup = { 0 };
unsigned long size = 128 * MB;
-   int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
+   int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;


Let's leave that alone, as noted above.


unsigned long cmd = GUP_FAST_BENCHMARK;
int flags = MAP_PRIVATE;
char *file = "/dev/zero";
char *p;
  
-	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {

+   while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {


And also as noted above I don't think we need this or any of the remaining
hunks below.


switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -65,6 +65,10 @@ int main(int argc, char **argv)
 */
gup.which_pages[0] = 1;
break;
+   case 'p':
+   /* works only with DUMP_USER_PAGES_TEST */
+   gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
+   break;
case 'F':
/* strtol, so you can pass flags in hex form */
gup.flags = strtol(optarg, 0, 0);
@@ -93,6 +97,9 @@ int main(int argc, char **argv)
case 'w':
write = 1;
break;
+   case 'W':
+   write = 0;
+   break;
    case 'f':
file = optarg;
break;



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 18/25] btrfs: Use readahead_batch_length

2020-12-17 Thread John Hubbard


On 12/17/20 5:42 AM, Matthew Wilcox wrote:

On Thu, Dec 17, 2020 at 12:12:46PM +, Matthew Wilcox wrote:

ehh ... probably an off-by-one.  Does subtracting 1 from contig_end fix it?
I'll spool up a test VM shortly and try it out.


Yes, this fixed it:

-   u64 contig_end = contig_start + readahead_batch_length(rac);
+   u64 contig_end = contig_start + readahead_batch_length(rac) - 1;



Yes, confirmed on my end, too.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 18/25] btrfs: Use readahead_batch_length

2020-12-17 Thread John Hubbard


On 12/16/20 10:23 AM, Matthew Wilcox (Oracle) wrote:

Implement readahead_batch_length() to determine the number of bytes in
the current batch of readahead pages and use it in btrfs.

Signed-off-by: Matthew Wilcox (Oracle) 
---
  fs/btrfs/extent_io.c| 6 ++
  include/linux/pagemap.h | 9 +
  2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e3b72e63e42..42936a83a91b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4436,10 +4436,8 @@ void extent_readahead(struct readahead_control *rac)
int nr;
  
  	while ((nr = readahead_page_batch(rac, pagepool))) {

-   u64 contig_start = page_offset(pagepool[0]);
-   u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
-
-   ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
+   u64 contig_start = readahead_pos(rac);
+   u64 contig_end = contig_start + readahead_batch_length(rac);


Something in this tiny change is breaking btrfs: it hangs my Fedora 33 test
system (which changed over to btrfs) on boot. I haven't quite figured out
what's really wrong, but git bisect lands here, *and* turning the whole
extent_readahead() function into a no-op (on top of the whole series)
allows everything to work once again.

Sorry for not actually solving the root cause, but I figured you'd be able
to jump straight to the answer, with the above information, so I'm sending
it out early.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v14 10/10] secretmem: test: add basic selftest for memfd_secret(2)

2020-12-11 Thread John Hubbard


On 12/2/20 10:29 PM, Mike Rapoport wrote:

From: Mike Rapoport 

...

+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#include 


Hi Mike,

Say, when I tried this out from today's linux-next, I had to delete the
above line. In other words, the following was required in order to build:

diff --git a/tools/testing/selftests/vm/memfd_secret.c 
b/tools/testing/selftests/vm/memfd_secret.c
index 79578dfd13e6..c878c2b841fc 100644
--- a/tools/testing/selftests/vm/memfd_secret.c
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -29,8 +29,6 @@

 #ifdef __NR_memfd_secret

-#include 
-
 #define PATTERN0x55

 static const int prot = PROT_READ | PROT_WRITE;


...and that makes sense to me, because:

a) secretmem.h is not in the uapi, which this selftests/vm build system
   expects (it runs "make headers_install" for us, which is *not* going
   to pick up items in the kernel include dirs), and

b) There is nothing in secretmem.h that this test uses, anyway! Just these:

bool vma_is_secretmem(struct vm_area_struct *vma);
bool page_is_secretmem(struct page *page);
bool secretmem_active(void);


...or am I just Doing It Wrong? :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v3 5/6] mm/gup: migrate pinned pages out of movable zone

2020-12-11 Thread John Hubbard


On 12/11/20 3:00 PM, Pavel Tatashin wrote:

I guess revert what we did (unpin) and return an error. The interesting 
question is what can make migration/isolation fail


OK. I will make the necessary changes. Let's handle errors properly.
Whatever the cause for the error, we will know it when it happens, and
when error is returned. I think I will add a 10-time retry instead of
the infinite retry that we currently have. The 10-times retry we
currently have during the hot-remove path.


It occurs to me that maybe the pre-existing infinite loop shouldn't be
there at all? Would it be better to let the callers retry? Obviously that
would be a separate patch and I'm not sure it's safe to make that change,
but the current loop seems buried maybe too far down.

Thoughts, anyone?

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

2020-12-03 Thread John Hubbard


On 12/3/20 7:06 AM, Pavel Tatashin wrote:
...

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 611799c72da5..7a6d86d0bc5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
gfp_mask)
   return alloc_flags;
   }

-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
- unsigned int alloc_flags)
+static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
+unsigned int alloc_flags)


Actually, maybe the original name should be left intact. This handles current 
alloc
flags, which right now happen to only cover CMA flags, so the original name 
seems
accurate, right?


The reason I re-named it is because we do not access current context
anymore, only use gfp_mask to get cma flag.

- unsigned int pflags = current->flags;


So, keeping "current" in the function name makes its intent misleading.



OK, I see. That sounds OK then.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
allocated before pinning they need to migrated to a different zone.
Currently, we migrate movable CMA pages only. Generalize the function
that migrates CMA pages to migrate all movable pages.

Signed-off-by: Pavel Tatashin 
---
  include/linux/migrate.h|  1 +
  include/trace/events/migrate.h |  3 +-
  mm/gup.c   | 56 +-
  3 files changed, 24 insertions(+), 36 deletions(-)



I like the cleanup so far, even at this point it's a relief to finally
see the nested ifdefs get removed.

One naming nit/idea below, but this looks fine as is, so:

Reviewed-by: John Hubbard 


diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0f8d1583fa8e..00bab23d1ee5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+   MR_LONGTERM_PIN,
MR_TYPES
  };
  
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h

index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset")  \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")\
EM( MR_NUMA_MISPLACED,  "numa_misplaced") \
-   EMe(MR_CONTIG_RANGE,"contig_range")
+   EM( MR_CONTIG_RANGE,"contig_range")   \
+   EMe(MR_LONGTERM_PIN,"longterm_pin")
  
  /*

   * First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index 724d8a65e1df..1d511f65f8a7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct 
**vmas, long nr_pages)
  }
  #endif
  
-#ifdef CONFIG_CMA

-static long check_and_migrate_cma_pages(struct mm_struct *mm,
-   unsigned long start,
-   unsigned long nr_pages,
-   struct page **pages,
-   struct vm_area_struct **vmas,
-   unsigned int gup_flags)
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+   unsigned long start,
+   unsigned long nr_pages,
+   struct page **pages,
+   struct vm_area_struct **vmas,
+   unsigned int gup_flags)
  {
unsigned long i;
unsigned long step;
bool drain_allow = true;
bool migrate_allow = true;
-   LIST_HEAD(cma_page_list);
+   LIST_HEAD(page_list);



Maybe naming it "movable_page_list", would be a nice touch.


thanks,
--
John Hubbard
NVIDIA


long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
@@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct 
mm_struct *mm,
 */
step = compound_nr(head) - (pages[i] - head);
/*
-* If we get a page from the CMA zone, since we are going to
-* be pinning these entries, we might as well move them out
-* of the CMA zone if possible.
+* If we get a movable page, since we are going to be pinning
+* these entries, try to move them out if possible.
 */
-   if (is_migrate_cma_page(head)) {
+   if (is_migrate_movable(get_pageblock_migratetype(head))) {
if (PageHuge(head))
-   isolate_huge_page(head, _page_list);
+   isolate_huge_page(head, _list);
else {
if (!PageLRU(head) && drain_allow) {
lru_add_drain_all();
@@ -1637,7 +1635,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
}
  
  if (!isolate_lru_page(head)) {

-   list_add_tail(>lru, 
_page_list);
+   list_add_tail(>lru, _list);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +

page_is_file_lru(head),
@@ -1649,7 +1647,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
i += step;
}
  
-	if (!list_empty(_page_list)) {

+   if (!list_empty(_li

Re: [PATCH 5/6] mm: honor PF_MEMALLOC_NOMOVABLE for all allocations

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

PF_MEMALLOC_NOMOVABLE is only honored for CMA allocations, extend
this flag to work for any allocations by removing __GFP_MOVABLE from
gfp_mask when this flag is passed in the current context, thus
prohibiting allocations from ZONE_MOVABLE.

Signed-off-by: Pavel Tatashin 
---
  mm/hugetlb.c|  2 +-
  mm/page_alloc.c | 26 --
  2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 02213c74ed6b..00e786201d8b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1036,7 +1036,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
hstate *h, int nid)
bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
  
  	list_for_each_entry(page, >hugepage_freelists[nid], lru) {

-   if (nomovable && is_migrate_cma_page(page))
+   if (nomovable && 
is_migrate_movable(get_pageblock_migratetype(page)))



I wonder if we should add a helper, like is_migrate_cma_page(), that avoids 
having
to call get_pageblock_migratetype() at all of the callsites?



continue;
  
  		if (PageHWPoison(page))

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 611799c72da5..7a6d86d0bc5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3766,20 +3766,25 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
gfp_mask)
return alloc_flags;
  }
  
-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,

-   unsigned int alloc_flags)
+static inline unsigned int cma_alloc_flags(gfp_t gfp_mask,
+  unsigned int alloc_flags)


Actually, maybe the original name should be left intact. This handles current 
alloc
flags, which right now happen to only cover CMA flags, so the original name 
seems
accurate, right?


thanks,

John Hubbard
NVIDIA


  {
  #ifdef CONFIG_CMA
-   unsigned int pflags = current->flags;
-
-   if (!(pflags & PF_MEMALLOC_NOMOVABLE) &&
-   gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+   if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
-
  #endif
return alloc_flags;
  }
  
+static inline gfp_t current_gfp_checkmovable(gfp_t gfp_mask)

+{
+   unsigned int pflags = current->flags;
+
+   if ((pflags & PF_MEMALLOC_NOMOVABLE))
+   return gfp_mask & ~__GFP_MOVABLE;
+   return gfp_mask;
+}
+
  /*
   * get_page_from_freelist goes through the zonelist trying to allocate
   * a page.
@@ -4423,7 +4428,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
} else if (unlikely(rt_task(current)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
  
-	alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);

+   alloc_flags = cma_alloc_flags(gfp_mask, alloc_flags);
  
  	return alloc_flags;

  }
@@ -4725,7 +4730,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  
  	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);

if (reserve_flags)
-   alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+   alloc_flags = cma_alloc_flags(gfp_mask, reserve_flags);
  
  	/*

 * Reset the nodemask and zonelist iterators if memory policies can be
@@ -4894,7 +4899,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return false;
  
-	*alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);

+   *alloc_flags = cma_alloc_flags(gfp_mask, *alloc_flags);
  
  	/* Dirty zone balancing only done in the fast path */

ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
@@ -4932,6 +4937,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order, int preferred_nid,
}
  
  	gfp_mask &= gfp_allowed_mask;

+   gfp_mask = current_gfp_checkmovable(gfp_mask);
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, , 
_mask, _flags))
return NULL;

Re: [PATCH 4/6] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_NOMOVABLE

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

PF_MEMALLOC_NOCMA is used for longterm pinning and has an effect of
clearing _GFP_MOVABLE or prohibiting allocations from ZONE_MOVABLE.

We will prohibit allocating any pages that are getting
longterm pinned from ZONE_MOVABLE, and we would want to unify and re-use
this flag. So, rename it to generic PF_MEMALLOC_NOMOVABLE.
Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_nomovable_save()/memalloc_nomovable_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin 


Looks accurate, and grep didn't find any lingering rename candidates
after this series is applied. And it's a good rename.


Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


---
  include/linux/sched.h|  2 +-
  include/linux/sched/mm.h | 21 +
  mm/gup.c |  4 ++--
  mm/hugetlb.c |  4 ++--
  mm/page_alloc.c  |  4 ++--
  5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 76cd21fa5501..f1bf05f5f5fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1548,7 +1548,7 @@ extern struct pid *cad_pid;
  #define PF_SWAPWRITE  0x0080  /* Allowed to write to swap */
  #define PF_NO_SETAFFINITY 0x0400  /* Userland is not allowed to 
meddle with cpus_mask */
  #define PF_MCE_EARLY  0x0800  /* Early kill for mce process 
policy */
-#define PF_MEMALLOC_NOCMA  0x1000  /* All allocation request will 
have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_NOMOVABLE  0x1000  /* All allocation request will 
have _GFP_MOVABLE cleared */
  #define PF_FREEZER_SKIP   0x4000  /* Freezer should not 
count it as freezable */
  #define PF_SUSPEND_TASK   0x8000  /* This thread called 
freeze_processes() and should not be frozen */
  
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h

index d5ece7a9a403..5bb9a6b69479 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -254,29 +254,18 @@ static inline void memalloc_noreclaim_restore(unsigned 
int flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
  }
  
-#ifdef CONFIG_CMA

-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_nomovable_save(void)
  {
-   unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+   unsigned int flags = current->flags & PF_MEMALLOC_NOMOVABLE;
  
-	current->flags |= PF_MEMALLOC_NOCMA;

+   current->flags |= PF_MEMALLOC_NOMOVABLE;
return flags;
  }
  
-static inline void memalloc_nocma_restore(unsigned int flags)

+static inline void memalloc_nomovable_restore(unsigned int flags)
  {
-   current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+   current->flags = (current->flags & ~PF_MEMALLOC_NOMOVABLE) | flags;
  }
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
-   return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif
  
  #ifdef CONFIG_MEMCG

  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index 0e2de888a8b0..724d8a65e1df 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1726,7 +1726,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
if (!vmas_tmp)
return -ENOMEM;
}
-   flags = memalloc_nocma_save();
+   flags = memalloc_nomovable_save();
}
  
  	rc = __get_user_pages_locked(mm, start, nr_pages, pages,

@@ -1749,7 +1749,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
rc = check_and_migrate_cma_pages(mm, start, rc, pages,
 vmas_tmp, gup_flags);
  out:
-   memalloc_nocma_restore(flags);
+   memalloc_nomovable_restore(flags);
}
  
  	if (vmas_tmp != vmas)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 37f15c3c24dc..02213c74ed6b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1033,10 +1033,10 @@ static void enqueue_huge_page(struct hstate *h, struct 
page *page)
  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
  {
struct page *page;
-   bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+   bool nomovable = !!(current->flags & PF_MEMALLOC_NOMOVABLE);
  
  	list_for_each_entry(page, >hugepage_freelists[nid], lru) {

-   if (nocma && is_migrate_cma_page(page))
+   if (nomovable && is_migrate_cma_page(page))
continue;
  
  		if (PageHWPoison(page))

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..611799c72da5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3772,8 +3772,8 @@ static inline unsigned int current_alloc_flag

Re: [PATCH 3/6] mm/gup: make __gup_longterm_locked common

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

__gup_longterm_locked() has CMA || FS_DAX version and a common stub
version. In the preparation of prohibiting longterm pinning of pages from
movable zone make the CMA || FS_DAX version common, and delete the stub
version.

Signed-off-by: Pavel Tatashin 
---
  mm/gup.c | 13 -
  1 file changed, 13 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3a76c005a3e2..0e2de888a8b0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1567,7 +1567,6 @@ struct page *get_dump_page(unsigned long addr)
  }
  #endif /* CONFIG_ELF_CORE */
  
-#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)

  #ifdef CONFIG_FS_DAX
  static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
  {
@@ -1757,18 +1756,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
kfree(vmas_tmp);
return rc;
  }
-#else /* !CONFIG_FS_DAX && !CONFIG_CMA */
-static __always_inline long __gup_longterm_locked(struct mm_struct *mm,
- unsigned long start,
- unsigned long nr_pages,
- struct page **pages,
- struct vm_area_struct **vmas,
- unsigned int flags)
-{
-   return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
-  NULL, flags);
-}
-#endif /* CONFIG_FS_DAX || CONFIG_CMA */
  
  static bool is_valid_gup_flags(unsigned int gup_flags)

  {



At last some simplification here, yea!

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 2/6] mm/gup: don't pin migrated cma pages in movable zone

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.

Signed-off-by: Pavel Tatashin 
---
  mm/gup.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index cdb8b9eeb016..3a76c005a3e2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1610,7 +1610,7 @@ static long check_and_migrate_cma_pages(struct mm_struct 
*mm,
long ret = nr_pages;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
-   .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+   .gfp_mask = GFP_USER | __GFP_NOWARN,
};
  
  check_again:




Reviewed-by: John Hubbard 

...while I was here, I noticed this, which is outside the scope of your 
patchset, but
I thought I'd just mention it anyway in case anyone cares:

static inline bool is_migrate_movable(int mt)
{
return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
}

...that really should take an "enum migratetype mt" instead of an int, I think.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/6] mm/gup: perform check_dax_vmas only when FS_DAX is enabled

2020-12-03 Thread John Hubbard


On 12/1/20 9:23 PM, Pavel Tatashin wrote:

There is no need to check_dax_vmas() and run through the npage loop of
pinned pages if FS_DAX is not enabled.

Add a stub check_dax_vmas() function for no-FS_DAX case.

Signed-off-by: Pavel Tatashin 
---
  mm/gup.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609..cdb8b9eeb016 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1568,6 +1568,7 @@ struct page *get_dump_page(unsigned long addr)
  #endif /* CONFIG_ELF_CORE */
  
  #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)

+#ifdef CONFIG_FS_DAX
  static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
  {
long i;
@@ -1586,6 +1587,12 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, 
long nr_pages)
}
return false;
  }
+#else
+static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+   return false;
+}
+#endif
  
  #ifdef CONFIG_CMA

  static long check_and_migrate_cma_pages(struct mm_struct *mm,



Looks obviously correct, and the follow-up simplication is very nice.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA

Re: Pinning ZONE_MOVABLE pages

2020-11-23 Thread John Hubbard


On 11/20/20 12:27 PM, Pavel Tatashin wrote:

Recently, I encountered a hang that is happening during memory hot
remove operation. It turns out that the hang is caused by pinned user
pages in ZONE_MOVABLE.

Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
this is not the case if a user applications such as through dpdk
libraries pinned them via vfio dma map. Kernel keeps trying to
hot-remove them, but refcnt never gets to zero, so we are looping
until the hardware watchdog kicks in.

We cannot do dma unmaps before hot-remove, because hot-remove is a
slow operation, and we have thousands for network flows handled by
dpdk that we just cannot suspend for the duration of hot-remove
operation.

The solution is for dpdk to allocate pages from a zone below
ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
There is no user interface that we have that allows applications to
select what zone the memory should come from.

I've spoken with Stephen Hemminger, and he said that DPDK is moving in
the direction of using transparent huge pages instead of HugeTLBs,
which means that we need to allow at least anonymous, and anonymous
transparent huge pages to come from non-movable zones on demand.

Here is what I am proposing:
1. Add a new flag that is passed through pin_user_pages_* down to
fault handlers, and allow the fault handler to allocate from a
non-movable zone.



I like where the discussion so far (in the other threads) has taken
this. And the current plan also implies, I think, that you can probably
avoid any new flags at all: just check that both FOLL_LONGTERM and
FOLL_PIN are set, and if they are, then make your attempt to migrate
away from ZONE_MOVABLE.





Sample function stacks through which this info needs to be passed is this:

pin_user_pages_remote(gup_flags)
  __get_user_pages_remote(gup_flags)
   __gup_longterm_locked(gup_flags)
__get_user_pages_locked(gup_flags)
 __get_user_pages(gup_flags)
  faultin_page(gup_flags)
   Convert gup_flags into fault_flags
   handle_mm_fault(fault_flags)


I'm pleased that the gup_flags have pretty much been plumbed through all the
main places that they were missing, so there shouldn't be too much required
at this point.



 From handle_mm_fault(), the stack diverges into various faults,
examples include:

Transparent Huge Page
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
create_huge_pmd(vmf);
do_huge_pmd_anonymous_page(vmf);
mm_get_huge_zero_page(vma->vm_mm); -> flag is lost, so flag from
vmf.gfp_mask should be passed as well.

There are several other similar paths in a transparent huge page, also
there is a named path where allocation is based on filesystems, and
the flag should be honored there as well, but it does not have to be
added at the same time.

Regular Pages
handle_mm_fault(fault_flags)
__handle_mm_fault(fault_flags)
Create: struct vm_fault vmf, use fault_flags to specify correct gfp_mask
handle_pte_fault(vmf)
do_anonymous_page(vmf);
page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ->
replace change this call according to gfp_mask.

The above only take care of the case if user application faults on the
page during pinning time, but there are also cases where pages already
exist.

2. Add an internal move_pages_zone() similar to move_pages() syscall
but instead of migrating to a different NUMA node, migrate pages from
ZONE_MOVABLE to another zone.
Call move_pages_zone() on demand prior to pinning pages from
vfio_pin_map_dma() for instance.

3. Perhaps, it also makes sense to add madvise() flag, to allocate
pages from non-movable zone. When a user application knows that it
will do DMA mapping, and pin pages for a long time, the memory that it
allocates should never be migrated or hot-removed, so make sure that
it comes from the appropriate place.
The benefit of adding madvise() flag is that we won't have to deal
with slow page migration during pin time, but the disadvantage is that
we would need to change the user interface.

Before I start working on the above approaches, I would like to get an
opinion from the community on an appropriate path forward for this
problem. If what I described sounds reasonable, or if there are other
ideas on how to address the problem that I am seeing.



I'm also in favor with avoiding (3) for now and maybe forever, depending on
how it goes. Good luck... :)


thanks,
--
John Hubbard
NVIDIA

Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-18 Thread John Hubbard


On 11/18/20 10:17 AM, Dan Williams wrote:

On Wed, Nov 18, 2020 at 5:51 AM Jan Kara  wrote:


On Mon 16-11-20 19:35:31, John Hubbard wrote:


On 11/16/20 6:48 PM, kernel test robot wrote:


Greeting,

FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s 
due to commit:



That's a huge slowdown...



commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
page->hpage_pinned_refcount: exact pin counts for huge pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


...but that commit happened in April, 2020. Surely if this were a serious
issue we would have some other indication...is this worth following up
on?? I'm inclined to ignore it, honestly.


Why this was detected so late is a fair question although it doesn't quite
invalidate the report...


I don't know what specifically happened in this case, perhaps someone
from the lkp team can comment? However, the myth / contention that
"surely someone else would have noticed by now" is why the lkp project
was launched. Kernels regressed without much complaint and it wasn't
until much later in the process, around the time enterprise distros
rebased to new kernels, did end users start filing performance loss
regression reports. Given -stable kernel releases, 6-7 months is still
faster than many end user upgrade cycles to new kernel baselines.



I see, thanks for explaining. I'll take a peek, then.

thanks,
--
John Hubbard
NVIDIA

Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-16 Thread John Hubbard




On 11/16/20 6:48 PM, kernel test robot wrote:


Greeting,

FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s 
due to commit:



That's a huge slowdown...



commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
page->hpage_pinned_refcount: exact pin counts for huge pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


...but that commit happened in April, 2020. Surely if this were a serious issue 
we
would have some other indication...is this worth following up on?? I'm inclined 
to
ignore it, honestly.

thanks,
--
John Hubbard
NVIDIA



in testcase: phoronix-test-suite
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G 
memory
with following parameters:

test: npb-1.3.1
option_a: FT.A
cpufreq_governor: performance
ucode: 0x5002f01

test-description: The Phoronix Test Suite is the most comprehensive testing and 
benchmarking platform available that provides an extensible framework for which 
new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


Details are as below:
-->


To reproduce:

 git clone https://github.com/intel/lkp-tests.git
 cd lkp-tests
 bin/lkp install job.yaml  # job file is attached in this email
 bin/lkp run job.yaml

=
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase/ucode:
   
gcc-9/performance/x86_64-rhel-8.3/FT.A/debian-x86_64-phoronix/lkp-csl-2sp8/npb-1.3.1/phoronix-test-suite/0x5002f01

commit:
   3faa52c03f ("mm/gup: track FOLL_PIN pages")
   47e29d32af ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge 
pages")

3faa52c03f440d1b 47e29d32afba11b13efb51f0315
 ---
fail:runs  %reproductionfail:runs
| | |
   1:4  -25%:4 
kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
  %stddev %change %stddev
  \  |\
   4585 ±  2% -45.0%   2522
phoronix-test-suite.npb.FT.A.total_mop_s
   1223 ±  4% +40.2%   1714
phoronix-test-suite.time.percent_of_cpu_this_job_got


 
  phoronix-test-suite.npb.FT.A.total_mop_s
 
   6500 ++

|  .+.  .+.  .+. |
   6000 |.+   +.+.+.++.+.+.+.+.+.+.+   +.+.++   +.+.+.+.+.+.+.+.+.++.+   |
   5500 |-+   :  |
| :  |
   5000 |-+: |
   4500 |-++.+.+.|
||
   4000 |-+  |
   3500 |-+  |
||
   3000 |-+  |
   2500 |-+   O   O O|
| O O O O O OO O   O O O O O   O   O  O O|
   2000 ++
 
 
[*] bisect-good sample

[O] bisect-bad  sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang

Re: [PATCH v3 1/7] compiler-clang: add build check for clang 10.0.1

2020-11-16 Thread John Hubbard

Hi,

I just ran into this and it's a real pain to figure out, because even
with the very latest Fedora 33 on my test machine, which provides clang
version 11.0.0:

$ clang --version
clang version 11.0.0 (Fedora 11.0.0-2.fc33)
Target: x86_64-unknown-linux-gnu

...the bpftrace program still chokes on some, but not all commands, in
ways that invisible to normal debugging. For example:

$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_vmsplice { @[kstack()]
= count(); }'
/lib/modules/5.10.0-rc4-hubbard-github+/source/include/linux/compiler-clang.h:12:3:
error: Sorry, your version of Clang is too old - please use 10.0.1 or
newer.

But Jarkko's recommended fix works! In other words, applying the diff
below fixes it for me. So I'm replying in order to note that the problem
is real and hoping that the fix is applied soon.


diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
index dd7233c48bf3..c2228b957fd7 100644
--- a/include/linux/compiler-clang.h
+++ b/include/linux/compiler-clang.h
@@ -7,9 +7,11 @@
 + __clang_minor__ * 100\
 + __clang_patchlevel__)
 
+#ifndef __BPF_TRACING__
 #if CLANG_VERSION < 11
 # error Sorry, your version of Clang is too old - please use 10.0.1 or newer.
 #endif
+#endif
 
 /* Compiler specific definitions for Clang compiler */
 


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v2] mm/gup_test: GUP_TEST depends on DEBUG_FS

2020-11-08 Thread John Hubbard


On 11/8/20 12:37 AM, Barry Song wrote:

Without DEBUG_FS, all the code in gup_test becomes meaningless. For sure
kernel provides debugfs stub while DEBUG_FS is disabled, but the point
here is that GUP_TEST can do nothing without DEBUG_FS.

Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Randy Dunlap 
Suggested-by: John Garry 
Signed-off-by: Barry Song 
---
  -v2:
  add comment as a prompt to users as commented by John and Randy, thanks!

  mm/Kconfig | 4 
  1 file changed, 4 insertions(+)


Thanks for suffering through a lot of discussion about this!

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



diff --git a/mm/Kconfig b/mm/Kconfig
index 01b0ae0cd9d3..a7ff0d31afd5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -836,6 +836,7 @@ config PERCPU_STATS
  
  config GUP_TEST

bool "Enable infrastructure for get_user_pages()-related unit tests"
+   depends on DEBUG_FS
help
  Provides /sys/kernel/debug/gup_test, which in turn provides a way
  to make ioctl calls that can launch kernel-based unit tests for
@@ -853,6 +854,9 @@ config GUP_TEST
  
  	  See tools/testing/selftests/vm/gup_test.c
  
+comment "GUP_TEST needs to have DEBUG_FS enabled"

+   depends on !GUP_TEST && !DEBUG_FS
+
  config GUP_GET_PTE_LOW_HIGH
bool

Re: [PATCH 2/2] tomoyo: Fixed typo in documentation

2020-11-08 Thread John Hubbard


On 11/8/20 7:41 PM, Souptick Joarder wrote:

On Sat, Nov 7, 2020 at 2:27 PM John Hubbard  wrote:


On 11/7/20 12:24 AM, Souptick Joarder wrote:

Fixed typo s/Poiner/Pointer

Fixes: 5b636857fee6 ("TOMOYO: Allow using argv[]/envp[] of execve() as 
conditions.")
Signed-off-by: Souptick Joarder 
Cc: John Hubbard 
---
   security/tomoyo/domain.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index bd748be..7b2babe 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -891,7 +891,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm)
*
* @bprm: Pointer to "struct linux_binprm".
* @pos:  Location to dump.
- * @dump: Poiner to "struct tomoyo_page_dump".
+ * @dump: Pointer to "struct tomoyo_page_dump".


Not worth a separate patch, especially since the original comment is merely
copying the C sources, and as such, does not add any value.

I'd either a) craft a new documentation line that adds some value, or b) just
merge this patch into the previous one, and make a note in the commit
description to the effect that you've included a trivial typo fix as long
as you're there.



John, as patch[1/2] is dropped, can we take this patch forward with some more
updates in documentations ?



That's really up to the folks who work on this code. Personally I would rarely
post a patch *just* for this, but on the other hand it is a correction. Either
way is fine with me of course.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 11:35 PM, Song Bao Hua (Barry Song) wrote:

Or do you want this ?

(Code B)
diff --git a/mm/Kconfig b/mm/Kconfig
index 01b0ae0cd9d3..a7ff0d31afd5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -836,6 +836,7 @@ config PERCPU_STATS

  config GUP_TEST
 bool "Enable infrastructure for get_user_pages()-related unit tests"
+   depends on DEBUG_FS
 help
   Provides /sys/kernel/debug/gup_test, which in turn provides a way
   to make ioctl calls that can launch kernel-based unit tests for
@@ -853,6 +854,9 @@ config GUP_TEST

   See tools/testing/selftests/vm/gup_test.c

+comment "GUP_TEST needs to have DEBUG_FS enabled"
+   depends on !GUP_TEST && !DEBUG_FS
+
  config GUP_GET_PTE_LOW_HIGH
 bool



The above, please. It's your original patch, plus a way to display something
if the "depends on" is not met.




To be honest, I am not a big fan of both of code A and B. I think "depends on" 
has
clearly said everything the redundant comment wants to say.


It's not really just a "comment", in the sense of commenting the Kconfig 
sources.
It actually has an effect, which is that it displays something in "make 
menuconfig".
So it's not redundant, because it provides that output *instead* of hiding the
option entirely, when !DEBUG_FS.

Try it out, it's nice.




   │ Symbol: GUP_TEST [=]
   │ Type  : bool
   │ Defined at mm/Kconfig:837
   │   Prompt: Enable infrastructure for get_user_pages()-related unit tests
   │   Depends on: DEBUG_FS [=n]
   │   Location:
   │ (1) -> Memory Management option

Menuconfig shows GUP_TEST depends on DEBUG_FS and right now DEBUG_FS is
"n". so it is impossible to enable GUP_TEST.

"comment" is a good thing, but it is more likely to be used for a menu or a 
group
of configurations to extend a bundle of things.

On the other hand, If this particular case needs this comment, so do countless
other configurations in hundreds of Kconfig files.


Well, maybe, yes.

I personally find it quite difficult, having options appear and disappear on me,
in this system. If they all had this "comment" behavior by default, to show up
as a placeholder, I think it would be a better user experience.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/2] tomoyo: Convert get_user_pages() to pin_user_pages()

2020-11-07 Thread John Hubbard


On 11/7/20 8:12 PM, Tetsuo Handa wrote:

On 2020/11/08 11:17, John Hubbard wrote:

Excuse me, but Documentation/core-api/pin_user_pages.rst says
"CASE 5: Pinning in order to _write_ to the data within the page"
while tomoyo_dump_page() is for "_read_ the data within the page".
Do we want to convert to pin_user_pages_remote() or lock_page() ?



Sorry, I missed the direction here, was too focused on the Case 5
aspect. Yes. Case 5 (which, again, I think we're about to re-document)
is only about *writing* to data within the page.

So in this case, where it is just reading from the page, I think it's
already from a gup vs pup point of view.

btw, it's not clear to me whether the current code is susceptible to any
sort of problem involving something writing to the page while it
is being dumped (I am curious). But changing from gup to pup wouldn't
fix that, if it were a problem. It a separate question from this patch.


The "struct page" tomoyo_dump_page() accesses is argv/envp arguments passed
to execve() syscall. Therefore, these pages are not visible from threads
except current thread, and thus there is no possibility that these pages
are modified by other threads while current thread is reading.



Perfect. So since I accidentally left out the word "correct" above (I meant
to write, "it's already correct"), let me be extra clear: Souptick, we
should just drop this patch.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 8:11 PM, Randy Dunlap wrote:
...

Look at kconfig-language.rst instead.


aha, yes.



One thing that could be done (and is done in a few places for other reasons) is 
to add
a Kconfig comment if DEBUG_FS is not enabled:

comment "GUP_TEST needs to have DEBUG_FS enabled"
depends on !GUP_TEST && !DEBUG_FS



Sweet--I just applied that here, and it does exactly what I wanted: puts a nice 
clear
message on the "make menuconfig" screen. No more hidden item. Brilliant!

Let's go with that, shall we?

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 7:14 PM, John Hubbard wrote:

On 11/7/20 6:58 PM, Song Bao Hua (Barry Song) wrote:

On 11/7/20 2:20 PM, Randy Dunlap wrote:

On 11/7/20 11:16 AM, John Hubbard wrote:

On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote:

From: John Hubbard [mailto:jhubb...@nvidia.com]

...

But if you really disagree, then I'd go with, just drop the patch entirely, 
because
it doesn't really make things better as written...IMHO anyway. :)


Just imagine a case, we don't enable DEBUG_FS but we enable GUP_TEST, we will
get an image with totally useless code section since GUP_TEST depends on debugfs
entry to perform any useful functionality.



Looking at the choices, from the user's (usually kernel developer's) experience:

a) The user enables GUP_TEST, then boots up, runs, and is briefly surprised by a
runtime failure. But it's a very quick diagnosis: "open: No such file or 
directory",
when trying to make that ioctl call. The path indicates that it's a debug fs 
path,
so the solution is pretty clear, at least for the main audience.

b) The other choice: the user *never even sees* GUP_TEST as a choice. This 
especially
bothers me because sometimes you find things by poking around in the menu, 
although
of course "you should already know about it"...but there's a lot to "already 
know"
in a large kernel.

 From a user experience, it's way better to simply see what you want, and 
select it
in the menu. Or, at least get some prompt that you need to pre-select something 
else.



...and again, this is all fallout from Kconfig. I might be missing some advanced
feature, because it seems surprising to only be allowed to choose between
missing dependencies (which this patch would correct), or a poorer user 
experience
(which I argue that this patch would also provide).

Ideally, we'd just set up the dependencies, and then have some options for
visibility, but I'm not aware of any way to do that...and after a quick peek
at Documentation/kbuild/kconfig-macro-language.rst it looks pretty bare bones.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 6:58 PM, Song Bao Hua (Barry Song) wrote:

On 11/7/20 2:20 PM, Randy Dunlap wrote:

On 11/7/20 11:16 AM, John Hubbard wrote:

On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote:

From: John Hubbard [mailto:jhubb...@nvidia.com]

...

But if you really disagree, then I'd go with, just drop the patch entirely, 
because
it doesn't really make things better as written...IMHO anyway. :)


Just imagine a case, we don't enable DEBUG_FS but we enable GUP_TEST, we will
get an image with totally useless code section since GUP_TEST depends on debugfs
entry to perform any useful functionality.



Looking at the choices, from the user's (usually kernel developer's) experience:

a) The user enables GUP_TEST, then boots up, runs, and is briefly surprised by a
runtime failure. But it's a very quick diagnosis: "open: No such file or 
directory",
when trying to make that ioctl call. The path indicates that it's a debug fs 
path,
so the solution is pretty clear, at least for the main audience.

b) The other choice: the user *never even sees* GUP_TEST as a choice. This 
especially
bothers me because sometimes you find things by poking around in the menu, 
although
of course "you should already know about it"...but there's a lot to "already 
know"
in a large kernel.

From a user experience, it's way better to simply see what you want, and select 
it
in the menu. Or, at least get some prompt that you need to pre-select something 
else.



The difference between "depends on" and "select" for this case is like:
depends on: if we want to use GUP_TEST, we have to enable DEBUG_FS first;
select: if we enable GUP_TEST, Kconfig will enable DEBUG_FS automatically.

To me, I am 60% inclined to "depends on" as I think "DEBUG_FS" is more
of a pre-condition of GUP_TEST than an internal part of GUP_TEST. So people
should realize the pre-condition must be met before using GUP_TEST and



Right, but first of course they must read every single line of the test code
carefully. And while it is true the you often *do* end up reading most or
all of the test code, there are situations in which you don't need to. We'd
be taking away some of those situations. :)



they must manually enable it if they haven't. That's why I think this patch is
making things better.



...which makes things a little bit worse.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/2] tomoyo: Convert get_user_pages() to pin_user_pages()

2020-11-07 Thread John Hubbard


On 11/7/20 5:13 PM, Tetsuo Handa wrote:

On 2020/11/08 4:17, John Hubbard wrote:

On 11/7/20 1:04 AM, John Hubbard wrote:

On 11/7/20 12:24 AM, Souptick Joarder wrote:

In 2019, we introduced pin_user_pages*() and now we are converting
get_user_pages*() to the new API as appropriate. [1] & [2] could
be referred for more information. This is case 5 as per document [1].


It turns out that Case 5 can be implemented via a better pattern, as long
as we're just dealing with a page at a time, briefly:

lock_page()
write to page's data
unlock_page()

...which neatly synchronizes with writeback and other fs activities.


Ahem, I left out a key step: set_page_dirty()!

lock_page()
write to page's data
set_page_dirty()
unlock_page()



Excuse me, but Documentation/core-api/pin_user_pages.rst says
"CASE 5: Pinning in order to _write_ to the data within the page"
while tomoyo_dump_page() is for "_read_ the data within the page".
Do we want to convert to pin_user_pages_remote() or lock_page() ?



Sorry, I missed the direction here, was too focused on the Case 5
aspect. Yes. Case 5 (which, again, I think we're about to re-document)
is only about *writing* to data within the page.

So in this case, where it is just reading from the page, I think it's
already from a gup vs pup point of view.

btw, it's not clear to me whether the current code is susceptible to any
sort of problem involving something writing to the page while it
is being dumped (I am curious). But changing from gup to pup wouldn't
fix that, if it were a problem. It a separate question from this patch.

(Souptick, if you're interested, the Case 5 documentation change and
callsite retrofit is all yours if you want it. Otherwise it's on
my list.)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 4:24 PM, Randy Dunlap wrote:

On 11/7/20 4:03 PM, John Hubbard wrote:

On 11/7/20 2:20 PM, Randy Dunlap wrote:

On 11/7/20 11:16 AM, John Hubbard wrote:

On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote:

...

OK, thanks, I see how you get that list now.

JFTR, those are not 42 independent users of DEBUG_FS. There are lots of &
and a few ||s in that list.



You are right, and that means that I have to withdraw my earlier claim that
"42 committers" made such a decision.



xconfig shows this for DEBUG_FS: (13 selects for x86_64 allmodconfig)



...maybe only 13 committers and dependency chain, then. :)


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 2:20 PM, Randy Dunlap wrote:

On 11/7/20 11:16 AM, John Hubbard wrote:

On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote:

-Original Message-
From: John Hubbard [mailto:jhubb...@nvidia.com]

...

    config GUP_BENCHMARK
    bool "Enable infrastructure for get_user_pages() and related calls

benchmarking"

+    depends on DEBUG_FS



I think "select DEBUG_FS" is better here. "depends on" has the obnoxious
behavior of hiding the choice from you, if the dependencies aren't already met.
Whereas what the developer *really* wants is a no-nonsense activation of the
choice: "enable GUP_BENCHMARK and the debug fs that it requires".



To some extent, I agree with you. But I still think here it is better to use 
"depends on".
According to
https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt

 select should be used with care. select will force
 a symbol to a value without visiting the dependencies.
 By abusing select you are able to select a symbol FOO even
 if FOO depends on BAR that is not set.
 In general use select only for non-visible symbols
 (no prompts anywhere) and for symbols with no dependencies.
 That will limit the usefulness but on the other hand avoid
 the illegal configurations all over.

On the other hand, in kernel there are 78 "depends on DEBUG_FS" and
only 14 "select DEBUG_FS".



You're not looking at the best statistics. Go look at what *already* selects
DEBUG_FS, and you'll find about 50 items.


Sorry, I'm not following you. I see the same 14 "select DEBUG_FS" as Barry.


I ran make menuconfig, and looked at it. Because I want to see the true end 
result,
and I didn't trust my grep use, given that the system has interlocking 
dependencies,
and I think one select could end up activating others (yes?).

And sure enough, there are 42 items listed, here they are, cleaned up so that 
there
is one per line:

ZSMALLOC_STAT [=n]
ZSMALLOC [=m]
BCACHE_CLOSURES_DEBUG [=n]
MD [=y]
BCACHE [=n]
DVB_C8SECTPFE [=n]
MEDIA_SUPPORT [=m]
MEDIA_PLATFORM_SUPPORT [=y]
DVB_PLATFORM_DRIVERS [=n]
PINCT
DRM_I915_DEBUG [=n]
HAS_IOMEM [=y]
EXPERT [=y]
DRM_I915 [=m]
EDAC_DEBUG [=n]
EDAC [=y]
SUNRPC_DEBUG [=n]
NETWORK_FILESYSTEMS [=y]
SUNRPC [=m]
SYSCTL [=y]
PAGE_OWNER [=n]
DEBUG_KERNEL [=y]
STACKTRACE_SUPPORT [=y]
DEBUG_KMEMLEAK [=n]
DEBUG_KERNEL [=y]
HAVE_DEBUG_KMEMLEAK [=y]
BLK_DEV_IO_TRACE [=n]
TRACING_SUPPORT [=y]
FTRACE [=y]
SYSFS [=y]
BLOCK [=y]
PUNIT_ATOM_DEBUG [=n]
PCI [=y]
NOTIFIER_ERROR_INJECTION [=n]
DEBUG_KERNEL [=y]
FAIL_FUTEX [=n]
FAULT_INJECTION [=n]
FUTEX [=y]
KCOV [=n]
ARCH_HAS_KCOV [=y]
CC_HAS_SANCOV_TRACE_PC [=y]
GCC_PLUGINS




In general we don't want any one large "feature" (or subsystem) to be enabled
by one driver. If someone has gone to the trouble to disable DEBUG_FS (or 
whatever),
then a different Kconfig symbol shouldn't undo that.



I agree with the "in general" point, yes. And my complaint is really 80% due to 
the
very unhappy situation with Kconfig, where we seem to get a choice between 
*hiding*
a feature, or forcing a dependency break. What we really want is a way to 
indicate
a dependency that doesn't hide entire features, unless we want that. (Maybe I 
should
attempt to get into the implementation, although I suspect it's harder than I
realize.)

But the other 20% of my complaint is, given what we have, I think the 
appropriate
adaptation for GUP_BENCHMARK's relationship to DEBUG_FS *in particular*, is: 
select.

And 42 other committers have chosen the same thing, for their relationship to
DEBUG_FS. I'm in good company.

But if you really disagree, then I'd go with, just drop the patch entirely, 
because
it doesn't really make things better as written...IMHO anyway. :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/2] tomoyo: Convert get_user_pages() to pin_user_pages()

2020-11-07 Thread John Hubbard


On 11/7/20 1:04 AM, John Hubbard wrote:

On 11/7/20 12:24 AM, Souptick Joarder wrote:

In 2019, we introduced pin_user_pages*() and now we are converting
get_user_pages*() to the new API as appropriate. [1] & [2] could
be referred for more information. This is case 5 as per document [1].


It turns out that Case 5 can be implemented via a better pattern, as long
as we're just dealing with a page at a time, briefly:

lock_page()
write to page's data
unlock_page()

...which neatly synchronizes with writeback and other fs activities.


Ahem, I left out a key step: set_page_dirty()!

lock_page()
write to page's data
set_page_dirty()
unlock_page()


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-07 Thread John Hubbard


On 11/7/20 11:05 AM, Song Bao Hua (Barry Song) wrote:

-Original Message-
From: John Hubbard [mailto:jhubb...@nvidia.com]

...

   config GUP_BENCHMARK
bool "Enable infrastructure for get_user_pages() and related calls

benchmarking"

+   depends on DEBUG_FS



I think "select DEBUG_FS" is better here. "depends on" has the obnoxious
behavior of hiding the choice from you, if the dependencies aren't already met.
Whereas what the developer *really* wants is a no-nonsense activation of the
choice: "enable GUP_BENCHMARK and the debug fs that it requires".



To some extent, I agree with you. But I still think here it is better to use 
"depends on".
According to
https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt

select should be used with care. select will force
a symbol to a value without visiting the dependencies.
By abusing select you are able to select a symbol FOO even
if FOO depends on BAR that is not set.
In general use select only for non-visible symbols
(no prompts anywhere) and for symbols with no dependencies.
That will limit the usefulness but on the other hand avoid
the illegal configurations all over.

On the other hand, in kernel there are 78 "depends on DEBUG_FS" and
only 14 "select DEBUG_FS".



You're not looking at the best statistics. Go look at what *already* selects
DEBUG_FS, and you'll find about 50 items.


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH 1/2] tomoyo: Convert get_user_pages() to pin_user_pages()

2020-11-07 Thread John Hubbard


On 11/7/20 12:24 AM, Souptick Joarder wrote:

In 2019, we introduced pin_user_pages*() and now we are converting
get_user_pages*() to the new API as appropriate. [1] & [2] could
be referred for more information. This is case 5 as per document [1].


It turns out that Case 5 can be implemented via a better pattern, as long
as we're just dealing with a page at a time, briefly:

lock_page()
write to page's data
unlock_page()

...which neatly synchronizes with writeback and other fs activities.

I was going to track down the Case 5's and do that [1].

+CC Jan and Matthew, to keep us on the straight and narrow, just in case
I'm misunderstanding something.

[1] https://lore.kernel.org/r/e78fb7af-627b-ce80-275e-51f97f1f3...@nvidia.com

thanks,
--
John Hubbard
NVIDIA



[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
 https://lwn.net/Articles/807108/

Signed-off-by: Souptick Joarder 
Cc: John Hubbard 
---
  security/tomoyo/domain.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index dc4ecc0..bd748be 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -914,7 +914,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned 
long pos,
 * (represented by bprm).  'current' is the process doing
 * the execve().
 */
-   if (get_user_pages_remote(bprm->mm, pos, 1,
+   if (pin_user_pages_remote(bprm->mm, pos, 1,
FOLL_FORCE, , NULL, NULL) <= 0)
return false;
  #else
@@ -936,7 +936,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned 
long pos,
}
/* Same with put_arg_page(page) in fs/exec.c */
  #ifdef CONFIG_MMU
-   put_page(page);
+   unpin_user_page(page);
  #endif
return true;
  }

Re: [PATCH 2/2] tomoyo: Fixed typo in documentation

2020-11-07 Thread John Hubbard


On 11/7/20 12:24 AM, Souptick Joarder wrote:

Fixed typo s/Poiner/Pointer

Fixes: 5b636857fee6 ("TOMOYO: Allow using argv[]/envp[] of execve() as 
conditions.")
Signed-off-by: Souptick Joarder 
Cc: John Hubbard 
---
  security/tomoyo/domain.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index bd748be..7b2babe 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -891,7 +891,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm)
   *
   * @bprm: Pointer to "struct linux_binprm".
   * @pos:  Location to dump.
- * @dump: Poiner to "struct tomoyo_page_dump".
+ * @dump: Pointer to "struct tomoyo_page_dump".


Not worth a separate patch, especially since the original comment is merely
copying the C sources, and as such, does not add any value.

I'd either a) craft a new documentation line that adds some value, or b) just
merge this patch into the previous one, and make a note in the commit
description to the effect that you've included a trivial typo fix as long
as you're there.


thanks,
--
John Hubbard
NVIDIA


   *
   * Returns true on success, false otherwise.
   */

Re: [PATCH] mm/gup_benchmark: GUP_BENCHMARK depends on DEBUG_FS

2020-11-06 Thread John Hubbard

On 11/4/20 2:05 AM, Barry Song wrote:

Without DEBUG_FS, all the code in gup_benchmark becomes meaningless.
For sure kernel provides debugfs stub while DEBUG_FS is disabled, but
the point here is that GUP_BENCHMARK can do nothing without DEBUG_FS.

Cc: John Hubbard 
Cc: Ralph Campbell 
Inspired-by: John Garry 
Signed-off-by: Barry Song 
---
  * inspired by John's comment in this patch:

https://lore.kernel.org/linux-iommu/184797b8-512e-e3da-fae7-25c7d6626...@huawei.com/

  mm/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f..91fa923 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -836,6 +836,7 @@ config PERCPU_STATS

  config GUP_BENCHMARK

bool "Enable infrastructure for get_user_pages() and related calls 
benchmarking"
+   depends on DEBUG_FS

I think "select DEBUG_FS" is better here. "depends on" has the obnoxious 
behavior
of hiding the choice from you, if the dependencies aren't already met. Whereas 
what
the developer *really* wants is a no-nonsense activation of the choice: "enable
GUP_BENCHMARK and the debug fs that it requires".

So depends on really on is better for things that you just can't control, such 
as
the cpu arch you're on, etc.

Also note that this will have some minor merge conflict with mmotm, Due to 
renaming
to GUP_TEST. No big deal though.

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-11-05 Thread John Hubbard


On 11/5/20 4:49 AM, Jason Gunthorpe wrote:

On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:

/*
  * If we can't determine whether or not a pte is special, then fail immediately
  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
  * to be special.
  *
  * For a futex to be placed on a THP tail page, get_futex_key requires a
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */


We support hugepage faults in gpu drivers since recently, and I'm not
seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
just me missing something again.


It means ioremap can't create an IO page PUD, it has to be broken up.

Does ioremap even create anything larger than PTEs?



From my reading, yes. See ioremap_try_huge_pmd().

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

2020-11-04 Thread John Hubbard


On 11/4/20 10:17 AM, Jason Gunthorpe wrote:

On Wed, Nov 04, 2020 at 04:41:19PM +, Christoph Hellwig wrote:

On Wed, Nov 04, 2020 at 04:37:58PM +, Christoph Hellwig wrote:

On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:

What we're discussing is whether gup_fast and pup_fast also obey this,
or fall over and can give you the struct page that's backing the
dma_mmap_* memory. Since the _fast variant doesn't check for
vma->vm_flags, and afaict that's the only thing which closes this gap.
And like you restate, that would be a bit a problem. So where's that
check which Jason aren't spotting?


remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
errors out on pte_special.  Of course this only works for the
CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
a real problem.


Except that we don't really support pte-level gup-fast without
CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.


Mm, I thought it was probably the special flag..

Knowing that CONFIG_HAVE_FAST_GUP can't be set without
CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
the Kconfig?

config HAVE_FAST_GUP
 depends on MMU
 depends on ARCH_HAS_PTE_SPECIAL
 bool


Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
gup-fast is not *completely* unavailable there, so I don't think you want
to shut it off like that:

/*
 * If we can't determine whether or not a pte is special, then fail immediately
 * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
 * to be special.
 *
 * For a futex to be placed on a THP tail page, get_futex_key requires a
 * get_user_pages_fast_only implementation that can pin pages. Thus it's still
 * useful to have gup_huge_pmd even if we can't operate on ptes.
 */


thanks,
--
John Hubbard
NVIDIA

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1254 matches

Mail list logo