from:"Joonsoo Kim"

Re: [PATCH 01/11] mm/page_isolation: prefer the node of the source page

2020-05-20 Thread Joonsoo Kim

2020년 5월 21일 (목) 오전 9:37, Roman Gushchin 님이 작성:
>
> On Mon, May 18, 2020 at 10:20:47AM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > For locality, it's better to migrate the page to the same node
> > rather than the node of the current caller's cpu.
> >
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Roman Gushchin 

Thanks for review!

> > ---
> >  mm/page_isolation.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 2c11a38..7df89bd 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -300,5 +300,7 @@ int test_pages_isolated(unsigned long start_pfn, 
> > unsigned long end_pfn,
> >
> >  struct page *alloc_migrate_target(struct page *page, unsigned long private)
> >  {
> > - return new_page_nodemask(page, numa_node_id(), 
> > _states[N_MEMORY]);
> > + int nid = page_to_nid(page);
> > +
> > + return new_page_nodemask(page, nid, _states[N_MEMORY]);
>
> Why not new_page_nodemask(page, page_to_nid(page), _states[N_MEMORY]) ?
> Still fits into 80 characters.

It's just my preference not directly using function call in argument
as much as possible.
If you don't like it, I will change it as you suggested. However,
since alloc_migrate_target()
will be removed in following patch, there will be no difference in the end.

Thanks.

Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

2020-05-10 Thread Joonsoo Kim

On Fri, May 08, 2020 at 12:01:22PM -0400, Johannes Weiner wrote:
> On Thu, Apr 23, 2020 at 02:25:06PM +0900, Joonsoo Kim wrote:
> > On Wed, Apr 22, 2020 at 08:09:46AM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 03:40:41PM +0900, Joonsoo Kim wrote:
> > > > On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> > > > > @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode 
> > > > > *inode, pgoff_t index,
> > > > >   goto failed;
> > > > >   }
> > > > >  
> > > > > - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, 
> > > > > );
> > > > > - if (!error) {
> > > > > - error = shmem_add_to_page_cache(page, mapping, index,
> > > > > - 
> > > > > swp_to_radix_entry(swap), gfp);
> > > > > - /*
> > > > > -  * We already confirmed swap under page lock, and make
> > > > > -  * no memory allocation here, so usually no possibility
> > > > > -  * of error; but free_swap_and_cache() only trylocks a
> > > > > -  * page, so it is just possible that the entry has been
> > > > > -  * truncated or holepunched since swap was confirmed.
> > > > > -  * shmem_undo_range() will have done some of the
> > > > > -  * unaccounting, now delete_from_swap_cache() will do
> > > > > -  * the rest.
> > > > > -  */
> > > > > - if (error) {
> > > > > - mem_cgroup_cancel_charge(page, memcg);
> > > > > - delete_from_swap_cache(page);
> > > > > - }
> > > > > - }
> > > > > - if (error)
> > > > > + error = shmem_add_to_page_cache(page, mapping, index,
> > > > > + swp_to_radix_entry(swap), gfp,
> > > > > + charge_mm);
> > > > > + /*
> > > > > +  * We already confirmed swap under page lock, and make no
> > > > > +  * memory allocation here, so usually no possibility of error;
> > > > > +  * but free_swap_and_cache() only trylocks a page, so it is
> > > > > +  * just possible that the entry has been truncated or
> > > > > +  * holepunched since swap was confirmed.  shmem_undo_range()
> > > > > +  * will have done some of the unaccounting, now
> > > > > +  * delete_from_swap_cache() will do the rest.
> > > > > +  */
> > > > > + if (error) {
> > > > > + delete_from_swap_cache(page);
> > > > >   goto failed;
> > > > 
> > > > -EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
> > > > differently. delete_from_swap_cache() is for -EEXIST case.
> > > 
> > > Good catch, I accidentally changed things here.
> > > 
> > > I was just going to change it back, but now I'm trying to understand
> > > how it actually works.
> > > 
> > > Who is removing the page from swap cache if shmem_undo_range() races
> > > but we fail to charge the page?
> > > 
> > > Here is how this race is supposed to be handled: The page is in the
> > > swapcache, we have it locked and confirmed that the entry in i_pages
> > > is indeed a swap entry. We charge the page, then we try to replace the
> > > swap entry in i_pages with the actual page. If we determine, under
> > > tree lock now, that shmem_undo_range has raced with us, unaccounted
> > > the swap space, but must have failed to get the page lock, we remove
> > > the page from swap cache on our side, to free up swap slot and page.
> > > 
> > > But what if shmem_undo_range() raced with us, deleted the swap entry
> > > from i_pages while we had the page locked, but then we simply failed
> > > to charge? We unlock the page and return -EEXIST (shmem_confirm_swap
> > > at the exit). The page with its userdata is now in swapcache, but no
> > > corresponding swap entry in i_pages. shmem_getpage_gfp() sees the
> > > -EEXIST, retries, finds nothing in i_pages and allocates a new, empty
> > > page.
> > > 
> > > Aren't we leaking the swap slot and the page?

Re: [PATCH v2 03/10] kexec: separate PageHighMem() and PageHighMemZone() use case

2020-05-05 Thread Joonsoo Kim

On Mon, May 04, 2020 at 09:03:56AM -0500, Eric W. Biederman wrote:
> 
> I have added in the kexec mailling list.
> 
> Looking at the patch we are discussing it appears that the kexec code
> could be doing much better in highmem situations today but is not.

Sound great!

> 
> 
> Joonsoo Kim  writes:
> 
> > 2020년 5월 1일 (금) 오후 11:06, Eric W. Biederman 님이 작성:
> >>
> >> js1...@gmail.com writes:
> >>
> >> > From: Joonsoo Kim 
> >> >
> >> > Until now, PageHighMem() is used for two different cases. One is to check
> >> > if there is a direct mapping for this page or not. The other is to check
> >> > the zone of this page, that is, weather it is the highmem type zone or 
> >> > not.
> >> >
> >> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
> >> > each cases. Use appropriate one.
> >> >
> >> > Note that there are some rules to determine the proper macro.
> >> >
> >> > 1. If PageHighMem() is called for checking if the direct mapping exists
> >> > or not, use PageHighMem().
> >> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> >> > this page, use PageHighMemZone(). The zone of the page is related to
> >> > the gfp_flags.
> >> > 3. If purpose of calling PageHighMem() is to count highmem page and
> >> > to interact with the system by using this count, use PageHighMemZone().
> >> > This counter is usually used to calculate the available memory for an
> >> > kernel allocation and pages on the highmem zone cannot be available
> >> > for an kernel allocation.
> >> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> >> > is just copy of the previous PageHighMem() implementation and won't
> >> > be changed.
> >> >
> >> > I apply the rule #2 for this patch.
> >>
> >> Hmm.
> >>
> >> What happened to the notion of deprecating and reducing the usage of
> >> highmem?  I know that we have some embedded architectures where it is
> >> still important but this feels like it flies in the face of that.
> >
> > AFAIK, deprecating highmem requires some more time and, before then,
> > we need to support it.
> 
> But it at least makes sense to look at what we are doing with highmem
> and ask if it makes sense.
> 
> >> This part of kexec would be much more maintainable if it had a proper
> >> mm layer helper that tested to see if the page matched the passed in
> >> gfp flags.  That way the mm layer could keep changing and doing weird
> >> gyrations and this code would not care.
> >
> > Good idea! I will do it.
> >
> >>
> >> What would be really helpful is if there was a straight forward way to
> >> allocate memory whose physical address fits in the native word size.
> >>
> >>
> >> All I know for certain about this patch is that it takes a piece of code
> >> that looked like it made sense, and transfroms it into something I can
> >> not easily verify, and can not maintain.
> >
> > Although I decide to make a helper as you described above, I don't
> > understand why you think that a new code isn't maintainable. It is just
> > the same thing with different name. Could you elaborate more why do
> > you think so?
> 
> Because the current code is already wrong.  It does not handle
> the general case of what it claims to handle.  When the only distinction
> that needs to be drawn is highmem or not highmem that is likely fine.
> But now you are making it possible to draw more distinctions.  At which
> point I have no idea which distinction needs to be drawn.
> 
> 
> The code and the logic is about 20 years old.  When it was written I
> don't recally taking numa seriously and the kernel only had 3 zones
> as I recall (DMA aka the now deprecated GFP_DMA, NORMAL, and HIGH).
> 
> The code attempts to work around limitations of those old zones amd play
> nice in a highmem world by allocating memory HIGH memory and not using
> it if the memory was above 4G ( on 32bit ).
> 
> Looking the kernel now has GFP_DMA32 so on 32bit with highmem we should
> probably be using that, when allocating memory.
> 

>From quick investigation, unfortunately, ZONE_DMA32 isn't available on
x86 32bit now so using GFP_DMA32 to allocate memory below 4G would not
work. Enabling ZONE_DMA32 on x86 32bit would be not simple, so, IMHO, it
would be better to leave the code as it is.

> 
> 
> Furth

Re: [PATCH v2 03/10] kexec: separate PageHighMem() and PageHighMemZone() use case

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 11:06, Eric W. Biederman 님이 작성:
>
> js1...@gmail.com writes:
>
> > From: Joonsoo Kim 
> >
> > Until now, PageHighMem() is used for two different cases. One is to check
> > if there is a direct mapping for this page or not. The other is to check
> > the zone of this page, that is, weather it is the highmem type zone or not.
> >
> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
> > each cases. Use appropriate one.
> >
> > Note that there are some rules to determine the proper macro.
> >
> > 1. If PageHighMem() is called for checking if the direct mapping exists
> > or not, use PageHighMem().
> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> > this page, use PageHighMemZone(). The zone of the page is related to
> > the gfp_flags.
> > 3. If purpose of calling PageHighMem() is to count highmem page and
> > to interact with the system by using this count, use PageHighMemZone().
> > This counter is usually used to calculate the available memory for an
> > kernel allocation and pages on the highmem zone cannot be available
> > for an kernel allocation.
> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> > is just copy of the previous PageHighMem() implementation and won't
> > be changed.
> >
> > I apply the rule #2 for this patch.
>
> Hmm.
>
> What happened to the notion of deprecating and reducing the usage of
> highmem?  I know that we have some embedded architectures where it is
> still important but this feels like it flies in the face of that.

AFAIK, deprecating highmem requires some more time and, before then,
we need to support it.

>
> This part of kexec would be much more maintainable if it had a proper
> mm layer helper that tested to see if the page matched the passed in
> gfp flags.  That way the mm layer could keep changing and doing weird
> gyrations and this code would not care.

Good idea! I will do it.

>
> What would be really helpful is if there was a straight forward way to
> allocate memory whose physical address fits in the native word size.
>
>
> All I know for certain about this patch is that it takes a piece of code
> that looked like it made sense, and transfroms it into something I can
> not easily verify, and can not maintain.

Although I decide to make a helper as you described above, I don't
understand why you think that a new code isn't maintainable. It is just
the same thing with different name. Could you elaborate more why do
you think so?

Thanks.

Re: [PATCH v2 00/10] change the implementation of the PageHighMem()

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 9:34, Christoph Hellwig 님이 작성:
>
> On Fri, May 01, 2020 at 09:15:30PM +0900, Joonsoo Kim wrote:
> > I think that PageHighMemZone() is long and complicated enough to have
> > a macro.
>
> It is.  But then again it also shouldn't really be used by anything
> but MM internals.

I'm not sure that we can make it MM internal but I will try.

> >
> > PageHighMemZone(page) = is_highmem_idx(zone_idx(page_zone(page))
> >
> > Instead of open-code, how about changing the style of macro like as
> > page_from_highmem()? What PageHighMemZone() represent is derivated
> > attribute from the page so PageXXX() style may not be appropriate.
>
> Maybe page_is_highmem_zone() with a big kerneldoc comment explaining
> the use case?  Bonus points of killing enough users that it can be
> in mm/internal.h.

I will try to kill page_is_highmem_zone() as much as possible in next version.

Thanks.

Re: [PATCH v2 07/10] mm: separate PageHighMem() and PageHighMemZone() use case

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 9:30, Christoph Hellwig 님이 작성:
>
> On Wed, Apr 29, 2020 at 12:26:40PM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Until now, PageHighMem() is used for two different cases. One is to check
> > if there is a direct mapping for this page or not. The other is to check
> > the zone of this page, that is, weather it is the highmem type zone or not.
> >
> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
> > each cases. Use appropriate one.
> >
> > Note that there are some rules to determine the proper macro.
> >
> > 1. If PageHighMem() is called for checking if the direct mapping exists
> > or not, use PageHighMem().
> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> > this page, use PageHighMemZone(). The zone of the page is related to
> > the gfp_flags.
> > 3. If purpose of calling PageHighMem() is to count highmem page and
> > to interact with the system by using this count, use PageHighMemZone().
> > This counter is usually used to calculate the available memory for an
> > kernel allocation and pages on the highmem zone cannot be available
> > for an kernel allocation.
> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> > is just copy of the previous PageHighMem() implementation and won't
> > be changed.
> >
> > I apply the rule #3 for this patch.
> >
> > Acked-by: Roman Gushchin 
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/memory_hotplug.c | 2 +-
> >  mm/page_alloc.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 555137b..891c214 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -593,7 +593,7 @@ void generic_online_page(struct page *page, unsigned 
> > int order)
> >   __free_pages_core(page, order);
> >   totalram_pages_add(1UL << order);
> >  #ifdef CONFIG_HIGHMEM
> > - if (PageHighMem(page))
> > + if (PageHighMemZone(page))
> >   totalhigh_pages_add(1UL << order);
> >  #endif
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fc5919e..7fe5115 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7444,7 +7444,7 @@ void adjust_managed_page_count(struct page *page, 
> > long count)
> >   atomic_long_add(count, _zone(page)->managed_pages);
> >   totalram_pages_add(count);
> >  #ifdef CONFIG_HIGHMEM
> > - if (PageHighMem(page))
> > + if (PageHighMemZone(page))
> >   totalhigh_pages_add(count);
> >  #endif
>
> This function already uses the page_zone structure above, I think
> life would be easier of you compare against that, as that makes
> the code more obvious.

If I can kill all the PageHighMemZone() macro, I will use page_zone() above.
However, if it's not possible, I will leave it as it is. It would be
simpler than
your suggestion.

Thanks.

Re: [PATCH v2 06/10] mm/hugetlb: separate PageHighMem() and PageHighMemZone() use case

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 9:26, Christoph Hellwig 님이 작성:
>
> On Wed, Apr 29, 2020 at 12:26:39PM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Until now, PageHighMem() is used for two different cases. One is to check
> > if there is a direct mapping for this page or not. The other is to check
> > the zone of this page, that is, weather it is the highmem type zone or not.
> >
> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
> > each cases. Use appropriate one.
> >
> > Note that there are some rules to determine the proper macro.
> >
> > 1. If PageHighMem() is called for checking if the direct mapping exists
> > or not, use PageHighMem().
> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> > this page, use PageHighMemZone(). The zone of the page is related to
> > the gfp_flags.
> > 3. If purpose of calling PageHighMem() is to count highmem page and
> > to interact with the system by using this count, use PageHighMemZone().
> > This counter is usually used to calculate the available memory for an
> > kernel allocation and pages on the highmem zone cannot be available
> > for an kernel allocation.
> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> > is just copy of the previous PageHighMem() implementation and won't
> > be changed.
> >
> > I apply the rule #3 for this patch.
> >
> > Acked-by: Roman Gushchin 
> > Signed-off-by: Joonsoo Kim 
>
> Why do we care about the zone here?  This only cares about having
> kernel direct mapped pages as far as I can tell.

My understaning is that what we want to do here is to first free memory
that can be used for kernel allocation. If direct mapped page is on the zone
that cannot be used for kernel allocation, there is no meaning to free
this page first. So, we need to take a care of the zone here.

Thanks.

Re: [PATCH v2 05/10] mm/gup: separate PageHighMem() and PageHighMemZone() use case

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 9:24, Christoph Hellwig 님이 작성:
>
> On Wed, Apr 29, 2020 at 12:26:38PM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Until now, PageHighMem() is used for two different cases. One is to check
> > if there is a direct mapping for this page or not. The other is to check
> > the zone of this page, that is, weather it is the highmem type zone or not.
> >
> > Now, we have separate functions, PageHighMem() and PageHighMemZone() for
> > each cases. Use appropriate one.
> >
> > Note that there are some rules to determine the proper macro.
> >
> > 1. If PageHighMem() is called for checking if the direct mapping exists
> > or not, use PageHighMem().
> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> > this page, use PageHighMemZone(). The zone of the page is related to
> > the gfp_flags.
> > 3. If purpose of calling PageHighMem() is to count highmem page and
> > to interact with the system by using this count, use PageHighMemZone().
> > This counter is usually used to calculate the available memory for an
> > kernel allocation and pages on the highmem zone cannot be available
> > for an kernel allocation.
> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> > is just copy of the previous PageHighMem() implementation and won't
> > be changed.
> >
> > I apply the rule #2 for this patch.
> >
> > Acked-by: Roman Gushchin 
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/gup.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 11fda53..9652eed 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1608,7 +1608,7 @@ static struct page *new_non_cma_page(struct page 
> > *page, unsigned long private)
> >*/
> >   gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
> >
> > - if (PageHighMem(page))
> > + if (PageHighMemZone(page))
> >   gfp_mask |= __GFP_HIGHMEM;
>
> I think this wants to stay PageHighMem.  This migrates CMA pages to
> other places before doing a long term pin.  Anything that didn't have
> a direct mapping before won't need one for the new page, which could
> also include non-highmem zones without a highmem mapping.

What we want to do here is to guess allocation gfp flags of original
page in order to allocate
a new page with most relaxed gfp flag. It is depend on the zone where
the page belong to
rather than existence of direct mapping. Until now, existence of
direct mapping implies
the type of zone so there is no problem.

After my future CMA patchset, direct mapped CMA page will be on the
ZONE_MOVABLE.
And, a page on ZONE_MOVABLE should be allocated with __GFP_HIGHMEM |
__GFP_MOVABLE.
So, most relaxed gfp flag for this CMA page would include
__GFP_HIGHMEM. If PageHighMem()
is used here, __GFP_HIGHMEM would be lost since this CMA page has a
direct mapping.

Therefore, PageHighMemZone() is right one here.

Anyway, I saw Eric's comment in other e-mail that abstraction is
needed to guess gfp flags of
original page and I agree with it. This site can also get benefit of
such a change.

Thanks.

Re: [PATCH v2 04/10] power: separate PageHighMem() and PageHighMemZone() use case

2020-05-03 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 9:22, Christoph Hellwig 님이 작성:
>
> On Wed, Apr 29, 2020 at 12:26:37PM +0900, js1...@gmail.com wrote:
> > index 6598001..be759a6 100644
> > --- a/kernel/power/snapshot.c
> > +++ b/kernel/power/snapshot.c
> > @@ -1227,7 +1227,7 @@ static struct page *saveable_highmem_page(struct zone 
> > *zone, unsigned long pfn)
> >   if (!page || page_zone(page) != zone)
> >   return NULL;
> >
> > - BUG_ON(!PageHighMem(page));
> > + BUG_ON(!PageHighMemZone(page));
>
> The above check already checks for the highmem zone.  So if we want
> to keep the BUG_ON it needs stay PageHighMem to make sense.  That being
> said I'd rather remove it entirelẏ.

Okay.

> > - BUG_ON(PageHighMem(page));
> > + BUG_ON(PageHighMemZone(page));
>
> Same here.

Okay.

Thanks.

Re: [PATCH v2 00/10] change the implementation of the PageHighMem()

2020-05-01 Thread Joonsoo Kim

2020년 5월 1일 (금) 오후 7:55, Christoph Hellwig 님이 작성:
>
> On Fri, May 01, 2020 at 07:52:35PM +0900, Joonsoo Kim wrote:
> > > - New code will pop up which gets it wrong and nobody will notice for
> > >   a long time.
> >
> > Hmm... I think that it's not that hard to decide correct macro. If we rename
> > PageHighMem() with PageDirectMapped(), they, PageDirectMapped() and
> > PageHighMemZone(), are self-explanation macro. There would be no
> > confusion to use.
>
> What confuses me is why we even need PageHighMemZone - mostly code
> should not care about particular zones.  Maybe just open coding
> PageHighMemZone makes more sense - it is a little more cumersome, but
> at least it makes it explicit what we check for.  I already sent you
> an incremental diff for one obvious place, but maybe we need to look
> through the remaining ones if we can kill them or open code them in an
> obvious way.

I think that PageHighMemZone() is long and complicated enough to have
a macro.

PageHighMemZone(page) = is_highmem_idx(zone_idx(page_zone(page))

Instead of open-code, how about changing the style of macro like as
page_from_highmem()? What PageHighMemZone() represent is derivated
attribute from the page so PageXXX() style may not be appropriate.

Thanks.

Re: [PATCH v2 00/10] change the implementation of the PageHighMem()

2020-05-01 Thread Joonsoo Kim

2020년 4월 30일 (목) 오전 10:47, Andrew Morton 님이 작성:
>
> On Wed, 29 Apr 2020 12:26:33 +0900 js1...@gmail.com wrote:
>
> > From: Joonsoo Kim 
> >
> > Changes on v2
> > - add "acked-by", "reviewed-by" tags
> > - replace PageHighMem() with use open-code, instead of using
> > new PageHighMemZone() macro. Related file is "include/linux/migrate.h"
> >
> > Hello,
> >
> > This patchset separates two use cases of PageHighMem() by introducing
> > PageHighMemZone() macro. And, it changes the implementation of
> > PageHighMem() to reflect the actual meaning of this macro. This patchset
> > is a preparation step for the patchset,
> > "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" [1].
> >
> > PageHighMem() is used for two different cases. One is to check if there
> > is a direct mapping for this page or not. The other is to check the
> > zone of this page, that is, weather it is the highmem type zone or not.
> >
> > Until now, both the cases are the perfectly same thing. So, implementation
> > of the PageHighMem() uses the one case that checks if the zone of the page
> > is the highmem type zone or not.
> >
> > "#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p))"
> >
> > ZONE_MOVABLE is special. It is considered as normal type zone on
> > !CONFIG_HIGHMEM, but, it is considered as highmem type zone
> > on CONFIG_HIGHMEM. Let's focus on later case. In later case, all pages
> > on the ZONE_MOVABLE has no direct mapping until now.
> >
> > However, following patchset
> > "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"
> > , which is once merged and reverted, will be tried again and will break
> > this assumption that all pages on the ZONE_MOVABLE has no direct mapping.
> > Hence, the ZONE_MOVABLE which is considered as highmem type zone could
> > have the both types of pages, direct mapped and not. Since
> > the ZONE_MOVABLE could have both type of pages, __GFP_HIGHMEM is still
> > required to allocate the memory from it. And, we conservatively need to
> > consider the ZONE_MOVABLE as highmem type zone.
> >
> > Even in this situation, PageHighMem() for the pages on the ZONE_MOVABLE
> > when it is called for checking the direct mapping should return correct
> > result. Current implementation of PageHighMem() just returns TRUE
> > if the zone of the page is on a highmem type zone. So, it could be wrong
> > if the page on the MOVABLE_ZONE is actually direct mapped.
> >
> > To solve this potential problem, this patch introduces a new
> > PageHighMemZone() macro. In following patches, two use cases of
> > PageHighMem() are separated by calling proper macro, PageHighMem() and
> > PageHighMemZone(). Then, implementation of PageHighMem() will be changed
> > as just checking if the direct mapping exists or not, regardless of
> > the zone of the page.
> >
> > Note that there are some rules to determine the proper macro.
> >
> > 1. If PageHighMem() is called for checking if the direct mapping exists
> > or not, use PageHighMem().
> > 2. If PageHighMem() is used to predict the previous gfp_flags for
> > this page, use PageHighMemZone(). The zone of the page is related to
> > the gfp_flags.
> > 3. If purpose of calling PageHighMem() is to count highmem page and
> > to interact with the system by using this count, use PageHighMemZone().
> > This counter is usually used to calculate the available memory for an
> > kernel allocation and pages on the highmem zone cannot be available
> > for an kernel allocation.
> > 4. Otherwise, use PageHighMemZone(). It's safe since it's implementation
> > is just copy of the previous PageHighMem() implementation and won't
> > be changed.
>
> hm, this won't improve maintainability :(
>
> - Everyone will need to remember when to use PageHighMem() and when
>   to use PageHighMemZone().  If they get it wrong, they're unlikely to
>   notice any problem in their runtime testing, correct?
>
> - New code will pop up which gets it wrong and nobody will notice for
>   a long time.

Hmm... I think that it's not that hard to decide correct macro. If we rename
PageHighMem() with PageDirectMapped(), they, PageDirectMapped() and
PageHighMemZone(), are self-explanation macro. There would be no
confusion to use.

> So I guess we need to be pretty confident that the series "mm/cma:
> manage the memory of the CMA area by using the ZONE_MOVABLE" will be
> useful and merged before proceeding with this, yes?

Yes and my assumption is that we (MM) have agreed with

Re: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation

2020-04-28 Thread Joonsoo Kim

2020년 4월 24일 (금) 오전 11:51, Johannes Weiner 님이 작성:
>
> On Fri, Apr 24, 2020 at 09:44:42AM +0900, Joonsoo Kim wrote:
> > On Mon, Apr 20, 2020 at 06:11:24PM -0400, Johannes Weiner wrote:
> > > @@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp_entry_t 
> > > entry, gfp_t gfp_mask,
> > >  */
> > > cond_resched();
> > > continue;
> > > -   } else if (err) /* swp entry is obsolete ? */
> > > -   break;
> > > -
> > > -   /* May fail (-ENOMEM) if XArray node allocation failed. */
> > > -   __SetPageLocked(new_page);
> > > -   __SetPageSwapBacked(new_page);
> > > -   err = add_to_swap_cache(new_page, entry, gfp_mask & 
> > > GFP_KERNEL);
> > > -   if (likely(!err)) {
> > > -   /* Initiate read into locked page */
> > > -   SetPageWorkingset(new_page);
> > > -   lru_cache_add_anon(new_page);
> > > -   *new_page_allocated = true;
> > > -   return new_page;
> > > }
> > > -   __ClearPageLocked(new_page);
> > > -   /*
> > > -* add_to_swap_cache() doesn't return -EEXIST, so we can 
> > > safely
> > > -* clear SWAP_HAS_CACHE flag.
> > > -*/
> > > -   put_swap_page(new_page, entry);
> > > -   } while (err != -ENOMEM);
> > > +   if (err)/* swp entry is obsolete ? */
> > > +   return NULL;
> >
> > "if (err)" is not needed since "!err" is already exiting the loop.
>
> But we don't want to leave the loop, we want to leave the
> function. For example, if swapcache_prepare() says the entry is gone
> (-ENOENT), we don't want to exit the loop and allocate a page for it.

Yes, so I said "if (err)" is not needed.
Just "return NULL;" would be enough.

> > > +
> > > +   /*
> > > +* The swap entry is ours to swap in. Prepare a new page.
> > > +*/
> > > +
> > > +   page = alloc_page_vma(gfp_mask, vma, addr);
> > > +   if (!page)
> > > +   goto fail_free;
> > > +
> > > +   __SetPageLocked(page);
> > > +   __SetPageSwapBacked(page);
> > > +
> > > +   /* May fail (-ENOMEM) if XArray node allocation failed. */
> > > +   if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> > > +   goto fail_unlock;
> > > +
> > > +   if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
> > > +   goto fail_delete;
> > > +
> >
> > I think that following order of operations is better than yours.
> >
> > 1. page alloc
> > 2. memcg charge
> > 3. swapcache_prepare
> > 4. add_to_swap_cache
> >
> > Reason is that page allocation and memcg charging could take for a
> > long time due to reclaim and other tasks waiting this swapcache page
> > could be blocked inbetween swapcache_prepare() and add_to_swap_cache().
>
> I see how that would be preferable, but memcg charging actually needs
> the swap(cache) information to figure out the cgroup that owned it at
> swapout, then uncharge the swapcache and drop the swap cgroup record.
>
> Maybe it could be done, but I'm not sure that level of surgery would
> be worth the benefits? Whoever else would be trying to swap the page
> in at the same time is likely in the same memory situation, and would
> not necessarily be able to allocate pages any faster.

Hmm, at least, some modifications are needed since waiting task would do
busy waiting in the loop and it wastes overall system cpu time.

I still think that changing operation order is better since it's possible that
later task allocates a page faster though it's not usual case. However, I
also agree your reasoning so will not insist more.

Thanks.

Re: + revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch added to -mm tree

2018-05-22 Thread Joonsoo Kim

2018-05-23 9:07 GMT+09:00  :
>
> The patch titled
>  Subject: Revert "mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE"
> has been added to the -mm tree.  Its filename is
>  
> revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
>
> This patch should soon appear at
> 
> http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
> and later at
> 
> http://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
>
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
>
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
>
> --
> From: Ville Syrjälä 
> Subject: Revert "mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE"
>
> Revert bad8c6c0b11446 ("mm/cma: manage the memory of the CMA area by using
> the ZONE_MOVABLE").

Hmm...Now I checked the patch and it seems like reverting is imperfectly done.
I will send a reverting patch that includes the full series.

Thanks.

Re: + revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch added to -mm tree

2018-05-22 Thread Joonsoo Kim

2018-05-23 9:07 GMT+09:00  :
>
> The patch titled
>  Subject: Revert "mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE"
> has been added to the -mm tree.  Its filename is
>  
> revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
>
> This patch should soon appear at
> 
> http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
> and later at
> 
> http://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-cma-manage-the-memory-of-the-cma-area-by-using-the-zone_movable.patch
>
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
>
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
>
> --
> From: Ville Syrjälä 
> Subject: Revert "mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE"
>
> Revert bad8c6c0b11446 ("mm/cma: manage the memory of the CMA area by using
> the ZONE_MOVABLE").

Hmm...Now I checked the patch and it seems like reverting is imperfectly done.
I will send a reverting patch that includes the full series.

Thanks.

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-22 Thread Joonsoo Kim

2018-05-23 9:02 GMT+09:00 Andrew Morton <a...@linux-foundation.org>:
> On Mon, 21 May 2018 15:16:33 +0900 Joonsoo Kim <js1...@gmail.com> wrote:
>
>> > (gdb) list *(dma_direct_alloc+0x22f)
>> > 0x573fbf is in dma_direct_alloc (../lib/dma-direct.c:104).
>> > 94
>> > 95  if (!page)
>> > 96  return NULL;
>> > 97  ret = page_address(page);
>> > 98  if (force_dma_unencrypted()) {
>> > 99  set_memory_decrypted((unsigned long)ret, 1 << 
>> > page_order);
>> > 100 *dma_handle = __phys_to_dma(dev, page_to_phys(page));
>> > 101 } else {
>> > 102 *dma_handle = phys_to_dma(dev, page_to_phys(page));
>> > 103 }
>> > 104 memset(ret, 0, size);
>> > 105 return ret;
>> > 106 }
>> >
>>
>> Okay. I find the reason about this error.
>
> It's getting rather late and we don't seem to have a final set of fixes
> yet.  Perhaps the best approach here is to revert and try again for
> 4.18?

Yes. Reverting seems to be a right decision at this moment.
Could you apply original revert patch from Ville?

Thanks.

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-22 Thread Joonsoo Kim

2018-05-23 9:02 GMT+09:00 Andrew Morton :
> On Mon, 21 May 2018 15:16:33 +0900 Joonsoo Kim  wrote:
>
>> > (gdb) list *(dma_direct_alloc+0x22f)
>> > 0x573fbf is in dma_direct_alloc (../lib/dma-direct.c:104).
>> > 94
>> > 95  if (!page)
>> > 96  return NULL;
>> > 97  ret = page_address(page);
>> > 98  if (force_dma_unencrypted()) {
>> > 99  set_memory_decrypted((unsigned long)ret, 1 << 
>> > page_order);
>> > 100 *dma_handle = __phys_to_dma(dev, page_to_phys(page));
>> > 101 } else {
>> > 102 *dma_handle = phys_to_dma(dev, page_to_phys(page));
>> > 103 }
>> > 104 memset(ret, 0, size);
>> > 105 return ret;
>> > 106 }
>> >
>>
>> Okay. I find the reason about this error.
>
> It's getting rather late and we don't seem to have a final set of fixes
> yet.  Perhaps the best approach here is to revert and try again for
> 4.18?

Yes. Reverting seems to be a right decision at this moment.
Could you apply original revert patch from Ville?

Thanks.

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-21 Thread Joonsoo Kim

On Sat, May 19, 2018 at 05:46:32PM +0300, Ville Syrjälä wrote:
> On Fri, May 18, 2018 at 01:01:04PM +0900, Joonsoo Kim wrote:
> > On Thu, May 17, 2018 at 10:53:32AM -0700, Laura Abbott wrote:
> > > On 05/17/2018 10:08 AM, Michal Hocko wrote:
> > > >On Thu 17-05-18 18:49:47, Michal Hocko wrote:
> > > >>On Thu 17-05-18 16:58:32, Ville Syrjälä wrote:
> > > >>>On Thu, May 17, 2018 at 04:36:29PM +0300, Ville Syrjälä wrote:
> > > >>>>On Thu, May 17, 2018 at 03:21:09PM +0200, Michal Hocko wrote:
> > > >>>>>On Thu 17-05-18 15:59:59, Ville Syrjala wrote:
> > > >>>>>>From: Ville Syrjälä <ville.syrj...@linux.intel.com>
> > > >>>>>>
> > > >>>>>>This reverts commit bad8c6c0b1144694ecb0bc5629ede9b8b578b86e.
> > > >>>>>>
> > > >>>>>>Make x86 with HIGHMEM=y and CMA=y boot again.
> > > >>>>>
> > > >>>>>Is there any bug report with some more details? It is much more
> > > >>>>>preferable to fix the issue rather than to revert the whole thing
> > > >>>>>right away.
> > > >>>>
> > > >>>>The machine I have in front of me right now didn't give me anything.
> > > >>>>Black screen, and netconsole was silent. No serial port on this
> > > >>>>machine unfortunately.
> > > >>>
> > > >>>Booted on another machine with serial:
> > > >>
> > > >>Could you provide your .config please?
> > > >>
> > > >>[...]
> > > >>>[0.00] cma: Reserved 4 MiB at 0x3700
> > > >>[...]
> > > >>>[0.00] BUG: Bad page state in process swapper  pfn:377fe
> > > >>>[0.00] page:f53effc0 count:0 mapcount:-127 mapping: 
> > > >>>index:0x0
> > > >>
> > > >>OK, so this looks the be the source of the problem. -128 would be a
> > > >>buddy page but I do not see anything that would set the counter to -127
> > > >>and the real map count updates shouldn't really happen that early.
> > > >>
> > > >>Maybe CONFIG_DEBUG_VM and CONFIG_DEBUG_HIGHMEM will tell us more.
> > > >
> > > >Looking closer, I _think_ that the bug is in 
> > > >set_highmem_pages_init->is_highmem
> > > >and zone_movable_is_highmem might force CMA pages in the zone movable to
> > > >be initialized as highmem. And that sounds supicious to me. Joonsoo?
> > > >
> > > 
> > > For a point of reference, arm with this configuration doesn't hit this bug
> > > because highmem pages are freed via the memblock interface only instead
> > > of iterating through each zone. It looks like the x86 highmem code
> > > assumes only a single highmem zone and/or it's disjoint?
> > 
> > Good point! Reason of the crash is that the span of MOVABLE_ZONE is
> > extended to whole node span for future CMA initialization, and,
> > normal memory is wrongly freed here.
> > 
> > Here goes the fix. Ville, Could you test below patch?
> > I re-generated the issue on my side and this patch fixed it.
> 
> This gets me past the initial hurdles. But when I tried it on a machine
> with an actual 32 bit OS it oopsed again later in the boot.
> 
> [0.00] Linux version 4.17.0-rc5-mgm+ () (gcc version 6.4.0 (Gentoo 
> 6.4.0-r1 p1.3)) #57 PREEMPT Sat May 19 17:25:27 EEST 2018
> [0.00] x86/fpu: x87 FPU will use FXSAVE
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x0009f7ff] usable
> [0.00] BIOS-e820: [mem 0x0009f800-0x0009] reserved
[snip]
> [1.487182] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> [1.514395] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 
> 16550A
> [1.522211] serial 00:05: skipping CIR port at 0x2e8 / 0x0, IRQ 3
> [1.530173] ata_piix :00:1f.1: enabling device (0005 -> 0007)
> [1.538301] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [1.540010] *pde =  
> [1.540010] Oops: 0002 [#1] PREEMPT
> [1.540010] Modules linked in:
> [1.540010] CPU: 0 PID: 1 Comm: swapper Tainted: GW 
> 4.17.0-rc5-mgm+ #57
> [1.540010] Hardware name: FUJITSU SIEMENS LIFEBOOK S6120/FJNB16C, BIOS 
> Version 1.26  05/10/2004
> [1.540010] EIP: dma_direct_alloc+0x22f/

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-21 Thread Joonsoo Kim

On Sat, May 19, 2018 at 05:46:32PM +0300, Ville Syrjälä wrote:
> On Fri, May 18, 2018 at 01:01:04PM +0900, Joonsoo Kim wrote:
> > On Thu, May 17, 2018 at 10:53:32AM -0700, Laura Abbott wrote:
> > > On 05/17/2018 10:08 AM, Michal Hocko wrote:
> > > >On Thu 17-05-18 18:49:47, Michal Hocko wrote:
> > > >>On Thu 17-05-18 16:58:32, Ville Syrjälä wrote:
> > > >>>On Thu, May 17, 2018 at 04:36:29PM +0300, Ville Syrjälä wrote:
> > > >>>>On Thu, May 17, 2018 at 03:21:09PM +0200, Michal Hocko wrote:
> > > >>>>>On Thu 17-05-18 15:59:59, Ville Syrjala wrote:
> > > >>>>>>From: Ville Syrjälä 
> > > >>>>>>
> > > >>>>>>This reverts commit bad8c6c0b1144694ecb0bc5629ede9b8b578b86e.
> > > >>>>>>
> > > >>>>>>Make x86 with HIGHMEM=y and CMA=y boot again.
> > > >>>>>
> > > >>>>>Is there any bug report with some more details? It is much more
> > > >>>>>preferable to fix the issue rather than to revert the whole thing
> > > >>>>>right away.
> > > >>>>
> > > >>>>The machine I have in front of me right now didn't give me anything.
> > > >>>>Black screen, and netconsole was silent. No serial port on this
> > > >>>>machine unfortunately.
> > > >>>
> > > >>>Booted on another machine with serial:
> > > >>
> > > >>Could you provide your .config please?
> > > >>
> > > >>[...]
> > > >>>[0.00] cma: Reserved 4 MiB at 0x3700
> > > >>[...]
> > > >>>[0.00] BUG: Bad page state in process swapper  pfn:377fe
> > > >>>[0.00] page:f53effc0 count:0 mapcount:-127 mapping: 
> > > >>>index:0x0
> > > >>
> > > >>OK, so this looks the be the source of the problem. -128 would be a
> > > >>buddy page but I do not see anything that would set the counter to -127
> > > >>and the real map count updates shouldn't really happen that early.
> > > >>
> > > >>Maybe CONFIG_DEBUG_VM and CONFIG_DEBUG_HIGHMEM will tell us more.
> > > >
> > > >Looking closer, I _think_ that the bug is in 
> > > >set_highmem_pages_init->is_highmem
> > > >and zone_movable_is_highmem might force CMA pages in the zone movable to
> > > >be initialized as highmem. And that sounds supicious to me. Joonsoo?
> > > >
> > > 
> > > For a point of reference, arm with this configuration doesn't hit this bug
> > > because highmem pages are freed via the memblock interface only instead
> > > of iterating through each zone. It looks like the x86 highmem code
> > > assumes only a single highmem zone and/or it's disjoint?
> > 
> > Good point! Reason of the crash is that the span of MOVABLE_ZONE is
> > extended to whole node span for future CMA initialization, and,
> > normal memory is wrongly freed here.
> > 
> > Here goes the fix. Ville, Could you test below patch?
> > I re-generated the issue on my side and this patch fixed it.
> 
> This gets me past the initial hurdles. But when I tried it on a machine
> with an actual 32 bit OS it oopsed again later in the boot.
> 
> [0.00] Linux version 4.17.0-rc5-mgm+ () (gcc version 6.4.0 (Gentoo 
> 6.4.0-r1 p1.3)) #57 PREEMPT Sat May 19 17:25:27 EEST 2018
> [0.00] x86/fpu: x87 FPU will use FXSAVE
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x0009f7ff] usable
> [0.00] BIOS-e820: [mem 0x0009f800-0x0009] reserved
[snip]
> [1.487182] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> [1.514395] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 
> 16550A
> [1.522211] serial 00:05: skipping CIR port at 0x2e8 / 0x0, IRQ 3
> [1.530173] ata_piix :00:1f.1: enabling device (0005 -> 0007)
> [1.538301] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [1.540010] *pde =  
> [1.540010] Oops: 0002 [#1] PREEMPT
> [1.540010] Modules linked in:
> [1.540010] CPU: 0 PID: 1 Comm: swapper Tainted: GW 
> 4.17.0-rc5-mgm+ #57
> [1.540010] Hardware name: FUJITSU SIEMENS LIFEBOOK S6120/FJNB16C, BIOS 
> Version 1.26  05/10/2004
> [1.540010] EIP: dma_direct_alloc+0x22f/0x260
> [1.540010] EFLAGS:

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-17 Thread Joonsoo Kim

On Thu, May 17, 2018 at 10:53:32AM -0700, Laura Abbott wrote:
> On 05/17/2018 10:08 AM, Michal Hocko wrote:
> >On Thu 17-05-18 18:49:47, Michal Hocko wrote:
> >>On Thu 17-05-18 16:58:32, Ville Syrjälä wrote:
> >>>On Thu, May 17, 2018 at 04:36:29PM +0300, Ville Syrjälä wrote:
> >>>>On Thu, May 17, 2018 at 03:21:09PM +0200, Michal Hocko wrote:
> >>>>>On Thu 17-05-18 15:59:59, Ville Syrjala wrote:
> >>>>>>From: Ville Syrjälä <ville.syrj...@linux.intel.com>
> >>>>>>
> >>>>>>This reverts commit bad8c6c0b1144694ecb0bc5629ede9b8b578b86e.
> >>>>>>
> >>>>>>Make x86 with HIGHMEM=y and CMA=y boot again.
> >>>>>
> >>>>>Is there any bug report with some more details? It is much more
> >>>>>preferable to fix the issue rather than to revert the whole thing
> >>>>>right away.
> >>>>
> >>>>The machine I have in front of me right now didn't give me anything.
> >>>>Black screen, and netconsole was silent. No serial port on this
> >>>>machine unfortunately.
> >>>
> >>>Booted on another machine with serial:
> >>
> >>Could you provide your .config please?
> >>
> >>[...]
> >>>[0.00] cma: Reserved 4 MiB at 0x3700
> >>[...]
> >>>[0.00] BUG: Bad page state in process swapper  pfn:377fe
> >>>[0.00] page:f53effc0 count:0 mapcount:-127 mapping: 
> >>>index:0x0
> >>
> >>OK, so this looks the be the source of the problem. -128 would be a
> >>buddy page but I do not see anything that would set the counter to -127
> >>and the real map count updates shouldn't really happen that early.
> >>
> >>Maybe CONFIG_DEBUG_VM and CONFIG_DEBUG_HIGHMEM will tell us more.
> >
> >Looking closer, I _think_ that the bug is in 
> >set_highmem_pages_init->is_highmem
> >and zone_movable_is_highmem might force CMA pages in the zone movable to
> >be initialized as highmem. And that sounds supicious to me. Joonsoo?
> >
> 
> For a point of reference, arm with this configuration doesn't hit this bug
> because highmem pages are freed via the memblock interface only instead
> of iterating through each zone. It looks like the x86 highmem code
> assumes only a single highmem zone and/or it's disjoint?

Good point! Reason of the crash is that the span of MOVABLE_ZONE is
extended to whole node span for future CMA initialization, and,
normal memory is wrongly freed here.

Here goes the fix. Ville, Could you test below patch?
I re-generated the issue on my side and this patch fixed it.

Thanks.

>8-
>From 569899a4dbd28cebb8d350d3d1ebb590d88b2629 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo@lge.com>
Date: Fri, 18 May 2018 10:52:05 +0900
Subject: [PATCH] x86/32/highmem: check if the zone is matched when free
 highmem pages on init

If CONFIG_CMA is enabled, it extends the span of the MOVABLE_ZONE
to manage the CMA memory later. And, in this case, the span of the
MOVABLE_ZONE could overlap the other zone's memory. We need to
avoid freeing this overlapped memory here since it would be the
memory of the other zone. Therefore, this patch adds a check
whether the page is indeed on the requested zone or not. Skipped
page will be freed when the memory of the matched zone is freed.

Reported-by: Ville Syrjälä <ville.syrj...@linux.intel.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo@lge.com>
---
 arch/x86/include/asm/highmem.h |  4 ++--
 arch/x86/mm/highmem_32.c   |  5 -
 arch/x86/mm/init_32.c  | 25 +
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/highmem.h b/arch/x86/include/asm/highmem.h
index a805993..e383f57 100644
--- a/arch/x86/include/asm/highmem.h
+++ b/arch/x86/include/asm/highmem.h
@@ -72,8 +72,8 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
 
 #define flush_cache_kmaps()do { } while (0)
 
-extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn,
-   unsigned long end_pfn);
+extern void add_highpages_with_active_regions(int nid, struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn);
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6d18b70..bf9f5b8 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -120,6 +120,9 @@ void __init set_highmem_pages_init(void)
if (!is_highmem(zone))
continue;
 
+   if

Re: [PATCH] Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"

2018-05-17 Thread Joonsoo Kim

On Thu, May 17, 2018 at 10:53:32AM -0700, Laura Abbott wrote:
> On 05/17/2018 10:08 AM, Michal Hocko wrote:
> >On Thu 17-05-18 18:49:47, Michal Hocko wrote:
> >>On Thu 17-05-18 16:58:32, Ville Syrjälä wrote:
> >>>On Thu, May 17, 2018 at 04:36:29PM +0300, Ville Syrjälä wrote:
> >>>>On Thu, May 17, 2018 at 03:21:09PM +0200, Michal Hocko wrote:
> >>>>>On Thu 17-05-18 15:59:59, Ville Syrjala wrote:
> >>>>>>From: Ville Syrjälä 
> >>>>>>
> >>>>>>This reverts commit bad8c6c0b1144694ecb0bc5629ede9b8b578b86e.
> >>>>>>
> >>>>>>Make x86 with HIGHMEM=y and CMA=y boot again.
> >>>>>
> >>>>>Is there any bug report with some more details? It is much more
> >>>>>preferable to fix the issue rather than to revert the whole thing
> >>>>>right away.
> >>>>
> >>>>The machine I have in front of me right now didn't give me anything.
> >>>>Black screen, and netconsole was silent. No serial port on this
> >>>>machine unfortunately.
> >>>
> >>>Booted on another machine with serial:
> >>
> >>Could you provide your .config please?
> >>
> >>[...]
> >>>[0.00] cma: Reserved 4 MiB at 0x3700
> >>[...]
> >>>[0.00] BUG: Bad page state in process swapper  pfn:377fe
> >>>[0.00] page:f53effc0 count:0 mapcount:-127 mapping: 
> >>>index:0x0
> >>
> >>OK, so this looks the be the source of the problem. -128 would be a
> >>buddy page but I do not see anything that would set the counter to -127
> >>and the real map count updates shouldn't really happen that early.
> >>
> >>Maybe CONFIG_DEBUG_VM and CONFIG_DEBUG_HIGHMEM will tell us more.
> >
> >Looking closer, I _think_ that the bug is in 
> >set_highmem_pages_init->is_highmem
> >and zone_movable_is_highmem might force CMA pages in the zone movable to
> >be initialized as highmem. And that sounds supicious to me. Joonsoo?
> >
> 
> For a point of reference, arm with this configuration doesn't hit this bug
> because highmem pages are freed via the memblock interface only instead
> of iterating through each zone. It looks like the x86 highmem code
> assumes only a single highmem zone and/or it's disjoint?

Good point! Reason of the crash is that the span of MOVABLE_ZONE is
extended to whole node span for future CMA initialization, and,
normal memory is wrongly freed here.

Here goes the fix. Ville, Could you test below patch?
I re-generated the issue on my side and this patch fixed it.

Thanks.

>8-
>From 569899a4dbd28cebb8d350d3d1ebb590d88b2629 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Fri, 18 May 2018 10:52:05 +0900
Subject: [PATCH] x86/32/highmem: check if the zone is matched when free
 highmem pages on init

If CONFIG_CMA is enabled, it extends the span of the MOVABLE_ZONE
to manage the CMA memory later. And, in this case, the span of the
MOVABLE_ZONE could overlap the other zone's memory. We need to
avoid freeing this overlapped memory here since it would be the
memory of the other zone. Therefore, this patch adds a check
whether the page is indeed on the requested zone or not. Skipped
page will be freed when the memory of the matched zone is freed.

Reported-by: Ville Syrjälä 
Signed-off-by: Joonsoo Kim 
---
 arch/x86/include/asm/highmem.h |  4 ++--
 arch/x86/mm/highmem_32.c   |  5 -
 arch/x86/mm/init_32.c  | 25 +
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/highmem.h b/arch/x86/include/asm/highmem.h
index a805993..e383f57 100644
--- a/arch/x86/include/asm/highmem.h
+++ b/arch/x86/include/asm/highmem.h
@@ -72,8 +72,8 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot);
 
 #define flush_cache_kmaps()do { } while (0)
 
-extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn,
-   unsigned long end_pfn);
+extern void add_highpages_with_active_regions(int nid, struct zone *zone,
+   unsigned long start_pfn, unsigned long end_pfn);
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6d18b70..bf9f5b8 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -120,6 +120,9 @@ void __init set_highmem_pages_init(void)
if (!is_highmem(zone))
continue;
 
+   if (!populated_zone(zone))
+   continue;
+
zone_start_pfn = zone->zone_start_pfn;
zone_end_pfn = zone_start

Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-05-14 Thread Joonsoo Kim

Hello, Mikulas.

On Tue, Apr 24, 2018 at 02:41:47PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Matthew Wilcox wrote:
> 
> > On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > > 
> > > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > > Some bugs (such as buffer overflows) are better detected
> > > > > with kmalloc code, so we must test the kmalloc path too.
> > > > 
> > > > Well now, this brings up another item for the collective TODO list --
> > > > implement redzone checks for vmalloc.  Unless this is something already
> > > > taken care of by kasan or similar.
> > > 
> > > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > > the next slab size and detects buffer overflows only at this boundary.
> > > 
> > > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > > immediatelly after the requested size - so that it can detect overflows 
> > > at 
> > > byte boundary 
> > > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > > 
> > > That patch found a bug in crypto code:
> > > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> > 
> > Is it still worth doing this, now we have kasan?
> 
> The kmalloc guard has much lower overhead than kasan.

I skimm at your code and it requires rebuilding the kernel.
I think that if rebuilding is required as the same with the KASAN,
using the KASAN is better since it has far better coverage for
detection the bug.

However, I think that if the redzone can be setup tightly
without rebuild, it would be worth implementing. I have an idea to
implement it only for the SLUB. Could I try it? (I'm asking this
because I'm inspired from the above patch.) :)
Or do you wanna try it?

Thanks.

Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-05-14 Thread Joonsoo Kim

Hello, Mikulas.

On Tue, Apr 24, 2018 at 02:41:47PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Matthew Wilcox wrote:
> 
> > On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > > 
> > > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > > Some bugs (such as buffer overflows) are better detected
> > > > > with kmalloc code, so we must test the kmalloc path too.
> > > > 
> > > > Well now, this brings up another item for the collective TODO list --
> > > > implement redzone checks for vmalloc.  Unless this is something already
> > > > taken care of by kasan or similar.
> > > 
> > > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > > the next slab size and detects buffer overflows only at this boundary.
> > > 
> > > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > > immediatelly after the requested size - so that it can detect overflows 
> > > at 
> > > byte boundary 
> > > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > > 
> > > That patch found a bug in crypto code:
> > > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> > 
> > Is it still worth doing this, now we have kasan?
> 
> The kmalloc guard has much lower overhead than kasan.

I skimm at your code and it requires rebuilding the kernel.
I think that if rebuilding is required as the same with the KASAN,
using the KASAN is better since it has far better coverage for
detection the bug.

However, I think that if the redzone can be setup tightly
without rebuild, it would be worth implementing. I have an idea to
implement it only for the SLUB. Could I try it? (I'm asking this
because I'm inspired from the above patch.) :)
Or do you wanna try it?

Thanks.

Re: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx

2018-05-07 Thread Joonsoo Kim

Hello, Mel.

Thanks for precious input!

2018-05-04 19:33 GMT+09:00 Mel Gorman :
> On Fri, May 04, 2018 at 09:03:02AM +0200, Vlastimil Babka wrote:
>> > min watermark for NORMAL zone on node 0
>> > allocation initiated on node 0: 750 + 4096 = 4846
>> > allocation initiated on node 1: 750 + 0 = 750
>> >
>> > This watermark difference could cause too many numa_miss allocation
>> > in some situation and then performance could be downgraded.
>> >
>> > Recently, there was a regression report about this problem on CMA patches
>> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> > that problem is disappeared with this fix that uses high_zoneidx
>> > for classzone_idx.
>> >
>> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>> >
>> > Using high_zoneidx for classzone_idx is more consistent way than previous
>> > approach because system's memory layout doesn't affect anything to it.
>>
>> So to summarize;
>> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
>> represents the highest zone the allocation can use
>
> It's arcane but it was simply a fast-path calculation. A much older
> definition would be easier to understand but it was slower.
>
>> - classzone_idx was supposed to be the highest zone that the allocation
>> can use, that is actually available in the system. Somehow that became
>> the highest zone that is available on the preferred node (in the default
>> node-order zonelist), which causes the watermark inconsistencies you
>> mention.
>>
>
> I think it *always* was the index of the first preferred zone of a
> zonelist. The treatment of classzone has changed a lot over the years and
> I didn't do a historical check but the general intent was always "protect
> some pages in lower zones". This was particularly important for 32-bit
> and highmem albeit that is less of a concern today. When it transferred to
> NUMA, I don't think it ever was seriously considered if it should change
> as the critical node was likely to be node 0 with all the zones and the
> remote nodes all used the highest zone. CMA/MOVABLE changed that slightly
> by allowing the possibility of node0 having a "higher" zone than every

I think that this problem is related to not only protection of the
lowmem (that is
lower than normal) but also node balance.

In fact, problem reported by zeroday-bot is caused by node1 having a
"higher" zone. In this case, node0's lowmem is protected well but
node balance of the allocation is broken since node1's normal memory cannot
be protected from allocation that is initiated on remote node.

> other node. When MOVABLE was introduced, it wasn't much of a problem as
> the purpose of MOVABLE was for systems that dynamically needed to allocate
> hugetlbfs later in the runtime but for CMA, it was a lot more critical
> for ordinary usage so this is primarily a CMA thing.

I'm not sure that it's primarily a CMA thing. There is an another critical setup
for this problem, that is, memory hotplug. If someone plug-in a new memory to
the MOVABLE zone, "higher" zone will be created in a specific node and
this problem happens. I have checked this with QEMU.

>> I don't see a problem with your change. I would be worried about
>> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
>> seem to be the case. My laptop has empty ZONE_MOVABLE and the
>> ZONE_NORMAL protection for movable is 0.
>>
>> But there had to be some reason for classzone_idx to be like this and
>> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
>> then, but is not anymore? Sigh, it seems to be pre-git.
>>
>
> classzone predates my involvement with Linux but I would be less concerneed
> about what the original intent was and instead ensure that classzone index
> is consistent, sane and potentially renamed while preserving the intent of
> "reserve pages in lower zones when an allocation request can use higher
> zones". While historically the critical intent was to preserve Normal and
> to a lesser extent DMA on 32-bit systems, there still should be some care
> of DMA32 so we should not lose that.

Agreed!

> With the patch, the allocator looks like it would be fine as just
> reservations change. I think it's unlikely that CMA usage will result
> in lowmem starvation.  Compaction becomes a bit weird as classzone index
> has no special meaning versis highmem and I think it'll be very easy to
> forget. Similarly, vmscan can reclaim pages from remote nodes and zones
> that are higher than the original request. That is not likely to be a
> problem but it's a change in behaviour and easy to miss.
>
> Fundamentally, I find it extremely weird we now have two variables that are
> essentially the same thing. They should be collapsed into one variable,
> renamed and documented on what the index means for page allocator,
> compaction, vmscan and the special casing around CMA.

Agreed!
I will update this patch to reflect your comment. If someone have

Re: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx

2018-05-07 Thread Joonsoo Kim

Hello, Mel.

Thanks for precious input!

2018-05-04 19:33 GMT+09:00 Mel Gorman :
> On Fri, May 04, 2018 at 09:03:02AM +0200, Vlastimil Babka wrote:
>> > min watermark for NORMAL zone on node 0
>> > allocation initiated on node 0: 750 + 4096 = 4846
>> > allocation initiated on node 1: 750 + 0 = 750
>> >
>> > This watermark difference could cause too many numa_miss allocation
>> > in some situation and then performance could be downgraded.
>> >
>> > Recently, there was a regression report about this problem on CMA patches
>> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> > that problem is disappeared with this fix that uses high_zoneidx
>> > for classzone_idx.
>> >
>> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>> >
>> > Using high_zoneidx for classzone_idx is more consistent way than previous
>> > approach because system's memory layout doesn't affect anything to it.
>>
>> So to summarize;
>> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
>> represents the highest zone the allocation can use
>
> It's arcane but it was simply a fast-path calculation. A much older
> definition would be easier to understand but it was slower.
>
>> - classzone_idx was supposed to be the highest zone that the allocation
>> can use, that is actually available in the system. Somehow that became
>> the highest zone that is available on the preferred node (in the default
>> node-order zonelist), which causes the watermark inconsistencies you
>> mention.
>>
>
> I think it *always* was the index of the first preferred zone of a
> zonelist. The treatment of classzone has changed a lot over the years and
> I didn't do a historical check but the general intent was always "protect
> some pages in lower zones". This was particularly important for 32-bit
> and highmem albeit that is less of a concern today. When it transferred to
> NUMA, I don't think it ever was seriously considered if it should change
> as the critical node was likely to be node 0 with all the zones and the
> remote nodes all used the highest zone. CMA/MOVABLE changed that slightly
> by allowing the possibility of node0 having a "higher" zone than every

I think that this problem is related to not only protection of the
lowmem (that is
lower than normal) but also node balance.

In fact, problem reported by zeroday-bot is caused by node1 having a
"higher" zone. In this case, node0's lowmem is protected well but
node balance of the allocation is broken since node1's normal memory cannot
be protected from allocation that is initiated on remote node.

> other node. When MOVABLE was introduced, it wasn't much of a problem as
> the purpose of MOVABLE was for systems that dynamically needed to allocate
> hugetlbfs later in the runtime but for CMA, it was a lot more critical
> for ordinary usage so this is primarily a CMA thing.

I'm not sure that it's primarily a CMA thing. There is an another critical setup
for this problem, that is, memory hotplug. If someone plug-in a new memory to
the MOVABLE zone, "higher" zone will be created in a specific node and
this problem happens. I have checked this with QEMU.

>> I don't see a problem with your change. I would be worried about
>> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
>> seem to be the case. My laptop has empty ZONE_MOVABLE and the
>> ZONE_NORMAL protection for movable is 0.
>>
>> But there had to be some reason for classzone_idx to be like this and
>> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
>> then, but is not anymore? Sigh, it seems to be pre-git.
>>
>
> classzone predates my involvement with Linux but I would be less concerneed
> about what the original intent was and instead ensure that classzone index
> is consistent, sane and potentially renamed while preserving the intent of
> "reserve pages in lower zones when an allocation request can use higher
> zones". While historically the critical intent was to preserve Normal and
> to a lesser extent DMA on 32-bit systems, there still should be some care
> of DMA32 so we should not lose that.

Agreed!

> With the patch, the allocator looks like it would be fine as just
> reservations change. I think it's unlikely that CMA usage will result
> in lowmem starvation.  Compaction becomes a bit weird as classzone index
> has no special meaning versis highmem and I think it'll be very easy to
> forget. Similarly, vmscan can reclaim pages from remote nodes and zones
> that are higher than the original request. That is not likely to be a
> problem but it's a change in behaviour and easy to miss.
>
> Fundamentally, I find it extremely weird we now have two variables that are
> essentially the same thing. They should be collapsed into one variable,
> renamed and documented on what the index means for page allocator,
> compaction, vmscan and the special casing around CMA.

Agreed!
I will update this patch to reflect your comment. If someone have an idea
on

Re: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx

2018-05-04 Thread Joonsoo Kim

2018-05-04 16:03 GMT+09:00 Vlastimil Babka <vba...@suse.cz>:
> On 05/04/2018 06:30 AM, js1...@gmail.com wrote:
>> From: Joonsoo Kim <iamjoonsoo@lge.com>
>>
>> Currently, we use the zone index of preferred_zone which represents
>> the best matching zone for allocation, as classzone_idx. It has a problem
>> on NUMA system with ZONE_MOVABLE.
>>
>> In NUMA system, it can be possible that each node has different populated
>> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
>> node 1 could have only NORMAL zone. In this setup, allocation request
>> initiated on node 0 and the one on node 1 would have different
>> classzone_idx, 3 and 2, respectively, since their preferred_zones are
>> different. If they are handled by only their own node, there is no problem.
>> However, if they are somtimes handled by the remote node, the problem
>> would happen.
>>
>> In the following setup, allocation initiated on node 1 will have some
>> precedence than allocation initiated on node 0 when former allocation is
>> processed on node 0 due to not enough memory on node 1. They will have
>> different lowmem reserve due to their different classzone_idx thus
>> an watermark bars are also different.
>>
> ...
>
>>
>> min watermark for NORMAL zone on node 0
>> allocation initiated on node 0: 750 + 4096 = 4846
>> allocation initiated on node 1: 750 + 0 = 750
>>
>> This watermark difference could cause too many numa_miss allocation
>> in some situation and then performance could be downgraded.
>>
>> Recently, there was a regression report about this problem on CMA patches
>> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> that problem is disappeared with this fix that uses high_zoneidx
>> for classzone_idx.
>>
>> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>>
>> Using high_zoneidx for classzone_idx is more consistent way than previous
>> approach because system's memory layout doesn't affect anything to it.
>
> So to summarize;
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the highest zone the allocation can use
> - classzone_idx was supposed to be the highest zone that the allocation
> can use, that is actually available in the system. Somehow that became
> the highest zone that is available on the preferred node (in the default
> node-order zonelist), which causes the watermark inconsistencies you
> mention.

Yes! Thanks for summarize!

> I don't see a problem with your change. I would be worried about
> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
> seem to be the case. My laptop has empty ZONE_MOVABLE and the
> ZONE_NORMAL protection for movable is 0.

Yes! Protection number is calculated by using the number of managed page
in upper zone. If there is no memory on the upper zone, protection will be 0.

> But there had to be some reason for classzone_idx to be like this and
> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
> then, but is not anymore? Sigh, it seems to be pre-git.

Based on my code inspection, this patch changing classzone_idx implementation
would not cause the problem. I also have tried to find the reason
for classzone_idx implementation by searching git history but I can't.
As you said,
it seems to be pre-git. It would be really helpful that someone who remembers
the reason for current classzone_idx implementation teaches me the reason.

Thanks.

Re: [PATCH] mm/page_alloc: use ac->high_zoneidx for classzone_idx

2018-05-04 Thread Joonsoo Kim

2018-05-04 16:03 GMT+09:00 Vlastimil Babka :
> On 05/04/2018 06:30 AM, js1...@gmail.com wrote:
>> From: Joonsoo Kim 
>>
>> Currently, we use the zone index of preferred_zone which represents
>> the best matching zone for allocation, as classzone_idx. It has a problem
>> on NUMA system with ZONE_MOVABLE.
>>
>> In NUMA system, it can be possible that each node has different populated
>> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
>> node 1 could have only NORMAL zone. In this setup, allocation request
>> initiated on node 0 and the one on node 1 would have different
>> classzone_idx, 3 and 2, respectively, since their preferred_zones are
>> different. If they are handled by only their own node, there is no problem.
>> However, if they are somtimes handled by the remote node, the problem
>> would happen.
>>
>> In the following setup, allocation initiated on node 1 will have some
>> precedence than allocation initiated on node 0 when former allocation is
>> processed on node 0 due to not enough memory on node 1. They will have
>> different lowmem reserve due to their different classzone_idx thus
>> an watermark bars are also different.
>>
> ...
>
>>
>> min watermark for NORMAL zone on node 0
>> allocation initiated on node 0: 750 + 4096 = 4846
>> allocation initiated on node 1: 750 + 0 = 750
>>
>> This watermark difference could cause too many numa_miss allocation
>> in some situation and then performance could be downgraded.
>>
>> Recently, there was a regression report about this problem on CMA patches
>> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> that problem is disappeared with this fix that uses high_zoneidx
>> for classzone_idx.
>>
>> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>>
>> Using high_zoneidx for classzone_idx is more consistent way than previous
>> approach because system's memory layout doesn't affect anything to it.
>
> So to summarize;
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the highest zone the allocation can use
> - classzone_idx was supposed to be the highest zone that the allocation
> can use, that is actually available in the system. Somehow that became
> the highest zone that is available on the preferred node (in the default
> node-order zonelist), which causes the watermark inconsistencies you
> mention.

Yes! Thanks for summarize!

> I don't see a problem with your change. I would be worried about
> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
> seem to be the case. My laptop has empty ZONE_MOVABLE and the
> ZONE_NORMAL protection for movable is 0.

Yes! Protection number is calculated by using the number of managed page
in upper zone. If there is no memory on the upper zone, protection will be 0.

> But there had to be some reason for classzone_idx to be like this and
> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
> then, but is not anymore? Sigh, it seems to be pre-git.

Based on my code inspection, this patch changing classzone_idx implementation
would not cause the problem. I also have tried to find the reason
for classzone_idx implementation by searching git history but I can't.
As you said,
it seems to be pre-git. It would be really helpful that someone who remembers
the reason for current classzone_idx implementation teaches me the reason.

Thanks.

Re: [PATCH] mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving

2018-04-05 Thread Joonsoo Kim

On Thu, Apr 05, 2018 at 05:05:39PM +0900, Joonsoo Kim wrote:
> On Thu, Apr 05, 2018 at 09:57:53AM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 16:27:16, Joonsoo Kim wrote:
> > > From: Joonsoo Kim <iamjoonsoo@lge.com>
> > > 
> > > ZONE_MOVABLE only has movable pages so we don't need to keep enough
> > > freepages to avoid or deal with fragmentation. So, don't count it.
> > > 
> > > This changes min_free_kbytes and thus min_watermark greatly
> > > if ZONE_MOVABLE is used. It will make the user uses more memory.

Oops.. s/more/less

> > 
> > OK, but why does it matter. Has anybody seen this as an issue?
> 
> There was a regression report for CMA patchset and I think that it is
> related to this problem. CMA patchset makes the system uses one more
> zone (ZONE_MOVABLE) and then increase min_free_kbytes. It reduces
> usable memory and it could cause regression.
> 
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

Thanks.

Re: [PATCH] mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving

2018-04-05 Thread Joonsoo Kim

On Thu, Apr 05, 2018 at 05:05:39PM +0900, Joonsoo Kim wrote:
> On Thu, Apr 05, 2018 at 09:57:53AM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 16:27:16, Joonsoo Kim wrote:
> > > From: Joonsoo Kim 
> > > 
> > > ZONE_MOVABLE only has movable pages so we don't need to keep enough
> > > freepages to avoid or deal with fragmentation. So, don't count it.
> > > 
> > > This changes min_free_kbytes and thus min_watermark greatly
> > > if ZONE_MOVABLE is used. It will make the user uses more memory.

Oops.. s/more/less

> > 
> > OK, but why does it matter. Has anybody seen this as an issue?
> 
> There was a regression report for CMA patchset and I think that it is
> related to this problem. CMA patchset makes the system uses one more
> zone (ZONE_MOVABLE) and then increase min_free_kbytes. It reduces
> usable memory and it could cause regression.
> 
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

Thanks.

Re: [PATCH] mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving

2018-04-05 Thread Joonsoo Kim

On Thu, Apr 05, 2018 at 09:57:53AM +0200, Michal Hocko wrote:
> On Thu 05-04-18 16:27:16, Joonsoo Kim wrote:
> > From: Joonsoo Kim <iamjoonsoo@lge.com>
> > 
> > ZONE_MOVABLE only has movable pages so we don't need to keep enough
> > freepages to avoid or deal with fragmentation. So, don't count it.
> > 
> > This changes min_free_kbytes and thus min_watermark greatly
> > if ZONE_MOVABLE is used. It will make the user uses more memory.
> 
> OK, but why does it matter. Has anybody seen this as an issue?

There was a regression report for CMA patchset and I think that it is
related to this problem. CMA patchset makes the system uses one more
zone (ZONE_MOVABLE) and then increase min_free_kbytes. It reduces
usable memory and it could cause regression.

http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

Thanks.

Re: [PATCH] mm/thp: don't count ZONE_MOVABLE as the target for freepage reserving

2018-04-05 Thread Joonsoo Kim

On Thu, Apr 05, 2018 at 09:57:53AM +0200, Michal Hocko wrote:
> On Thu 05-04-18 16:27:16, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> > 
> > ZONE_MOVABLE only has movable pages so we don't need to keep enough
> > freepages to avoid or deal with fragmentation. So, don't count it.
> > 
> > This changes min_free_kbytes and thus min_watermark greatly
> > if ZONE_MOVABLE is used. It will make the user uses more memory.
> 
> OK, but why does it matter. Has anybody seen this as an issue?

There was a regression report for CMA patchset and I think that it is
related to this problem. CMA patchset makes the system uses one more
zone (ZONE_MOVABLE) and then increase min_free_kbytes. It reduces
usable memory and it could cause regression.

http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

Thanks.

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-04-05 Thread Joonsoo Kim

Hello,
sorry for bothering you.

2018-01-09 16:16 GMT+09:00 Joonsoo Kim <iamjoonsoo@lge.com>:
> On Sat, Jan 06, 2018 at 05:26:31PM +0800, Ye Xiaolong wrote:
>> Hi,
>>
>> On 01/03, Joonsoo Kim wrote:
>> >Hello!
>> >
>> >On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
>> >>
>> >> Greeting,
>> >>
>> >> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
>> >>
>> >>
>> >> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the 
>> >> memory of the CMA area by using the ZONE_MOVABLE")
>> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> >>
>> >> in testcase: fio-basic
>> >> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz 
>> >> with 256G memory
>> >> with following parameters:
>> >>
>> >>disk: 2pmem
>> >>fs: ext4
>> >>runtime: 200s
>> >>nr_task: 50%
>> >>time_based: tb
>> >>rw: randread
>> >>bs: 2M
>> >>ioengine: mmap
>> >>test_size: 200G
>> >>cpufreq_governor: performance
>> >>
>> >> test-description: Fio is a tool that will spawn a number of threads or 
>> >> processes doing a particular type of I/O action as specified by the user.
>> >> test-url: https://github.com/axboe/fio
>> >>
>> >>
>> >>
>> >> Details are as below:
>> >> -->
>> >>
>> >>
>> >> To reproduce:
>> >>
>> >> git clone https://github.com/intel/lkp-tests.git
>> >> cd lkp-tests
>> >> bin/lkp install job.yaml  # job file is attached in this email
>> >> bin/lkp run job.yaml
>> >>
>> >> =
>> >> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
>> >>   
>> >> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
>> >>
>> >> commit:
>> >>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
>> >> request")
>> >>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
>> >> ZONE_MOVABLE")
>> >>
>> >> f6572f9cd248df2c 2b0f904a5a8781498417d67226
>> >>  --
>> >>  %stddev %change %stddev
>> >>  \  |\
>> >>  11451   -16.1%   9605fio.read_bw_MBps
>> >>   0.29 ą  5%  +0.10.40 ą  3%  fio.latency_1000us%
>> >>  19.35 ą  5%  -4.7   14.69 ą  3%  fio.latency_10ms%
>> >>   7.92 ą  3% +12.2   20.15fio.latency_20ms%
>> >>   0.05 ą 11%  +0.00.09 ą  8%  fio.latency_2ms%
>> >>  70.22-8.9   61.36fio.latency_4ms%
>> >>   0.29 ą 13%  +0.00.33 ą  3%  fio.latency_500us%
>> >>   0.45 ą 29%  +1.01.45 ą  4%  fio.latency_50ms%
>> >>   1.37+0.11.44fio.latency_750us%
>> >>   9792   +31.7%  12896fio.read_clat_90%_us
>> >>  10560   +33.0%  14048fio.read_clat_95%_us
>> >>  15376 ą 10% +46.9%  22592fio.read_clat_99%_us
>> >>   4885   +19.2%   5825fio.read_clat_mean_us
>> >>   5725   -16.1%   4802fio.read_iops
>> >>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
>> >> 453153-8.4% 415215
>> >> fio.time.involuntary_context_switches
>> >>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
>> >>1822257   +23.7%2254706
>> >> fio.time.maximum_resident_set_size
>> >>   5089+1.6%   5172fio.time.system_time
>> >> 514.50   -16.3% 430.48fio.time.user_time
>> >
>> >System

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-04-05 Thread Joonsoo Kim

Hello,
sorry for bothering you.

2018-01-09 16:16 GMT+09:00 Joonsoo Kim :
> On Sat, Jan 06, 2018 at 05:26:31PM +0800, Ye Xiaolong wrote:
>> Hi,
>>
>> On 01/03, Joonsoo Kim wrote:
>> >Hello!
>> >
>> >On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
>> >>
>> >> Greeting,
>> >>
>> >> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
>> >>
>> >>
>> >> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the 
>> >> memory of the CMA area by using the ZONE_MOVABLE")
>> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>> >>
>> >> in testcase: fio-basic
>> >> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz 
>> >> with 256G memory
>> >> with following parameters:
>> >>
>> >>disk: 2pmem
>> >>fs: ext4
>> >>runtime: 200s
>> >>nr_task: 50%
>> >>time_based: tb
>> >>rw: randread
>> >>bs: 2M
>> >>ioengine: mmap
>> >>test_size: 200G
>> >>cpufreq_governor: performance
>> >>
>> >> test-description: Fio is a tool that will spawn a number of threads or 
>> >> processes doing a particular type of I/O action as specified by the user.
>> >> test-url: https://github.com/axboe/fio
>> >>
>> >>
>> >>
>> >> Details are as below:
>> >> -->
>> >>
>> >>
>> >> To reproduce:
>> >>
>> >> git clone https://github.com/intel/lkp-tests.git
>> >> cd lkp-tests
>> >> bin/lkp install job.yaml  # job file is attached in this email
>> >> bin/lkp run job.yaml
>> >>
>> >> =
>> >> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
>> >>   
>> >> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
>> >>
>> >> commit:
>> >>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
>> >> request")
>> >>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
>> >> ZONE_MOVABLE")
>> >>
>> >> f6572f9cd248df2c 2b0f904a5a8781498417d67226
>> >>  --
>> >>  %stddev %change %stddev
>> >>  \  |\
>> >>  11451   -16.1%   9605fio.read_bw_MBps
>> >>   0.29 ą  5%  +0.10.40 ą  3%  fio.latency_1000us%
>> >>  19.35 ą  5%  -4.7   14.69 ą  3%  fio.latency_10ms%
>> >>   7.92 ą  3% +12.2   20.15fio.latency_20ms%
>> >>   0.05 ą 11%  +0.00.09 ą  8%  fio.latency_2ms%
>> >>  70.22-8.9   61.36fio.latency_4ms%
>> >>   0.29 ą 13%  +0.00.33 ą  3%  fio.latency_500us%
>> >>   0.45 ą 29%  +1.01.45 ą  4%  fio.latency_50ms%
>> >>   1.37+0.11.44fio.latency_750us%
>> >>   9792   +31.7%  12896fio.read_clat_90%_us
>> >>  10560   +33.0%  14048fio.read_clat_95%_us
>> >>  15376 ą 10% +46.9%  22592fio.read_clat_99%_us
>> >>   4885   +19.2%   5825fio.read_clat_mean_us
>> >>   5725   -16.1%   4802fio.read_iops
>> >>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
>> >> 453153-8.4% 415215
>> >> fio.time.involuntary_context_switches
>> >>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
>> >>1822257   +23.7%2254706
>> >> fio.time.maximum_resident_set_size
>> >>   5089+1.6%   5172fio.time.system_time
>> >> 514.50   -16.3% 430.48fio.time.user_time
>> >
>> >System time is increased and user tim

Re: [PATCH v1] mm: help the ALLOC_HARDER allocation pass the watermarki when CMA on

2018-04-05 Thread Joonsoo Kim

On Wed, Apr 04, 2018 at 03:37:03PM -0700, Andrew Morton wrote:
> On Wed, 4 Apr 2018 09:31:10 +0900 Joonsoo Kim <iamjoonsoo@lge.com> wrote:
> 
> > On Fri, Mar 23, 2018 at 01:04:08PM -0700, Andrew Morton wrote:
> > > On Fri, 23 Mar 2018 10:33:27 +0100 Michal Hocko <mho...@kernel.org> wrote:
> > > 
> > > > On Fri 23-03-18 17:19:26, Zhaoyang Huang wrote:
> > > > > On Fri, Mar 23, 2018 at 4:38 PM, Michal Hocko <mho...@kernel.org> 
> > > > > wrote:
> > > > > > On Fri 23-03-18 15:57:32, Zhaoyang Huang wrote:
> > > > > >> For the type of 'ALLOC_HARDER' page allocation, there is an express
> > > > > >> highway for the whole process which lead the allocation reach 
> > > > > >> __rmqueue_xxx
> > > > > >> easier than other type.
> > > > > >> However, when CMA is enabled, the free_page within 
> > > > > >> zone_watermark_ok() will
> > > > > >> be deducted for number the pages in CMA type, which may cause the 
> > > > > >> watermark
> > > > > >> check fail, but there are possible enough HighAtomic or Unmovable 
> > > > > >> and
> > > > > >> Reclaimable pages in the zone. So add 'alloc_harder' here to
> > > > > >> count CMA pages in to clean the obstacles on the way to the final.
> > > > > >
> > > > > > This is no longer the case in the current mmotm tree. Have a look at
> > > > > > Joonsoo's zone movable based CMA patchset 
> > > > > > http://lkml.kernel.org/r/1512114786-5085-1-git-send-email-iamjoonsoo@lge.com
> > > > > >
> > > > > Thanks for the information. However, I can't find the commit in the
> > > > > latest mainline, is it merged?
> > > > 
> > > > Not yet. It is still sitting in the mmomt tree. I am not sure what is
> > > > the merge plan but I guess it is still waiting for some review feedback.
> > > 
> > > http://lkml.kernel.org/r/20171222001113.GA1729@js1304-P5Q-DELUXE
> > > 
> > > That patchset has been floating about since December and still has
> > > unresolved issues.
> > > 
> > > Joonsoo, can you please let us know the status?
> > 
> > Hello, Andrew.
> > Sorry for a late response.
> > 
> > Today, I finally have answered the question from Michal and it seems
> > that there is no problem at all.
> > 
> > http://lkml.kernel.org/r/CAAmzW4NGv7RyCYyokPoj4aR3ySKub4jaBZ3k=pt_yrefbby...@mail.gmail.com
> > 
> > You can merge the patch as is.
> > 
> 
> hm.
> 
> There was also a performance regression reported:
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

I analyze the report and may find the reason.

When we uses more zones, min_free_kbytes is increased for avoiding
fragmentation if THP is enabled. This patch uses one more zone to
manage CMA memory so min_free_kbytes and thus min_watermark is increased.
It would reduce our usable memory and cause regression.

However, this reservation for fragmentation isn't needed for
ZONE_MOVABLE since it has only movable pages so I send a patch to fix it.

http://lkml.kernel.org/r/<1522913236-15776-1-git-send-email-iamjoonsoo@lge.com>

I'm not sure that it is a root cause of above performance regression
but I highly anticipate that they are related. I will ask the reporter
to test this patch on top of that.

Thanks.

Re: [PATCH v1] mm: help the ALLOC_HARDER allocation pass the watermarki when CMA on

2018-04-05 Thread Joonsoo Kim

On Wed, Apr 04, 2018 at 03:37:03PM -0700, Andrew Morton wrote:
> On Wed, 4 Apr 2018 09:31:10 +0900 Joonsoo Kim  wrote:
> 
> > On Fri, Mar 23, 2018 at 01:04:08PM -0700, Andrew Morton wrote:
> > > On Fri, 23 Mar 2018 10:33:27 +0100 Michal Hocko  wrote:
> > > 
> > > > On Fri 23-03-18 17:19:26, Zhaoyang Huang wrote:
> > > > > On Fri, Mar 23, 2018 at 4:38 PM, Michal Hocko  
> > > > > wrote:
> > > > > > On Fri 23-03-18 15:57:32, Zhaoyang Huang wrote:
> > > > > >> For the type of 'ALLOC_HARDER' page allocation, there is an express
> > > > > >> highway for the whole process which lead the allocation reach 
> > > > > >> __rmqueue_xxx
> > > > > >> easier than other type.
> > > > > >> However, when CMA is enabled, the free_page within 
> > > > > >> zone_watermark_ok() will
> > > > > >> be deducted for number the pages in CMA type, which may cause the 
> > > > > >> watermark
> > > > > >> check fail, but there are possible enough HighAtomic or Unmovable 
> > > > > >> and
> > > > > >> Reclaimable pages in the zone. So add 'alloc_harder' here to
> > > > > >> count CMA pages in to clean the obstacles on the way to the final.
> > > > > >
> > > > > > This is no longer the case in the current mmotm tree. Have a look at
> > > > > > Joonsoo's zone movable based CMA patchset 
> > > > > > http://lkml.kernel.org/r/1512114786-5085-1-git-send-email-iamjoonsoo@lge.com
> > > > > >
> > > > > Thanks for the information. However, I can't find the commit in the
> > > > > latest mainline, is it merged?
> > > > 
> > > > Not yet. It is still sitting in the mmomt tree. I am not sure what is
> > > > the merge plan but I guess it is still waiting for some review feedback.
> > > 
> > > http://lkml.kernel.org/r/20171222001113.GA1729@js1304-P5Q-DELUXE
> > > 
> > > That patchset has been floating about since December and still has
> > > unresolved issues.
> > > 
> > > Joonsoo, can you please let us know the status?
> > 
> > Hello, Andrew.
> > Sorry for a late response.
> > 
> > Today, I finally have answered the question from Michal and it seems
> > that there is no problem at all.
> > 
> > http://lkml.kernel.org/r/CAAmzW4NGv7RyCYyokPoj4aR3ySKub4jaBZ3k=pt_yrefbby...@mail.gmail.com
> > 
> > You can merge the patch as is.
> > 
> 
> hm.
> 
> There was also a performance regression reported:
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

I analyze the report and may find the reason.

When we uses more zones, min_free_kbytes is increased for avoiding
fragmentation if THP is enabled. This patch uses one more zone to
manage CMA memory so min_free_kbytes and thus min_watermark is increased.
It would reduce our usable memory and cause regression.

However, this reservation for fragmentation isn't needed for
ZONE_MOVABLE since it has only movable pages so I send a patch to fix it.

http://lkml.kernel.org/r/<1522913236-15776-1-git-send-email-iamjoonsoo@lge.com>

I'm not sure that it is a root cause of above performance regression
but I highly anticipate that they are related. I will ask the reporter
to test this patch on top of that.

Thanks.

Re: + mm-slabc-remove-duplicated-check-of-colour_next.patch added to -mm tree

2018-04-03 Thread Joonsoo Kim

On Mon, Mar 12, 2018 at 03:13:33PM -0700, a...@linux-foundation.org wrote:
> 
> The patch titled
>  Subject: mm/slab.c: remove duplicated check of colour_next
> has been added to the -mm tree.  Its filename is
>  mm-slabc-remove-duplicated-check-of-colour_next.patch
> 
> This patch should soon appear at
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-slabc-remove-duplicated-check-of-colour_next.patch
> and later at
> 
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-slabc-remove-duplicated-check-of-colour_next.patch
> 
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> --
> From: Roman Lakeev <sunnydda...@gmail.com>
> Subject: mm/slab.c: remove duplicated check of colour_next
> 
> Remove check that offset greater than cachep->colour bacause this is
> already checked in previous lines.
> 
> Link: http://lkml.kernel.org/r/877eqilr71@gmail.com
> Signed-off-by: Roman Lakeev <sunnydda...@gmail.com>
> Acked-by: Christoph Lameter <c...@linux.com>
> Acked-by: David Rientjes <rient...@google.com>
> Reviewed-by: Andrew Morton <a...@linux-foundation.org>
> Cc: Pekka Enberg <penb...@kernel.org>
> Cc: Joonsoo Kim <iamjoonsoo@lge.com>
> Signed-off-by: Andrew Morton <a...@linux-foundation.org>
> ---
> 
>  mm/slab.c |6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff -puN mm/slab.c~mm-slabc-remove-duplicated-check-of-colour_next mm/slab.c
> --- a/mm/slab.c~mm-slabc-remove-duplicated-check-of-colour_next
> +++ a/mm/slab.c
> @@ -2674,11 +2674,7 @@ static struct page *cache_grow_begin(str
>   if (n->colour_next >= cachep->colour)
>   n->colour_next = 0;
>  
> - offset = n->colour_next;
> - if (offset >= cachep->colour)
> - offset = 0;
> -
> - offset *= cachep->colour_off;
> + offset = n->colour_next * cachep->colour_off;
>  
>   /* Get slab management. */
>   freelist = alloc_slabmgmt(cachep, page, offset,

Hello, all.

This n->colour_next can be update at the other cpu since
there is no lock to protect it. So, removing this one
another check can make the offset overflow.

Andrew, could you drop this patch?

Thanks.

Re: + mm-slabc-remove-duplicated-check-of-colour_next.patch added to -mm tree

2018-04-03 Thread Joonsoo Kim

On Mon, Mar 12, 2018 at 03:13:33PM -0700, a...@linux-foundation.org wrote:
> 
> The patch titled
>  Subject: mm/slab.c: remove duplicated check of colour_next
> has been added to the -mm tree.  Its filename is
>  mm-slabc-remove-duplicated-check-of-colour_next.patch
> 
> This patch should soon appear at
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-slabc-remove-duplicated-check-of-colour_next.patch
> and later at
> 
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-slabc-remove-duplicated-check-of-colour_next.patch
> 
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> --
> From: Roman Lakeev 
> Subject: mm/slab.c: remove duplicated check of colour_next
> 
> Remove check that offset greater than cachep->colour bacause this is
> already checked in previous lines.
> 
> Link: http://lkml.kernel.org/r/877eqilr71@gmail.com
> Signed-off-by: Roman Lakeev 
> Acked-by: Christoph Lameter 
> Acked-by: David Rientjes 
> Reviewed-by: Andrew Morton 
> Cc: Pekka Enberg 
> Cc: Joonsoo Kim 
> Signed-off-by: Andrew Morton 
> ---
> 
>  mm/slab.c |6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff -puN mm/slab.c~mm-slabc-remove-duplicated-check-of-colour_next mm/slab.c
> --- a/mm/slab.c~mm-slabc-remove-duplicated-check-of-colour_next
> +++ a/mm/slab.c
> @@ -2674,11 +2674,7 @@ static struct page *cache_grow_begin(str
>   if (n->colour_next >= cachep->colour)
>   n->colour_next = 0;
>  
> - offset = n->colour_next;
> - if (offset >= cachep->colour)
> - offset = 0;
> -
> - offset *= cachep->colour_off;
> + offset = n->colour_next * cachep->colour_off;
>  
>   /* Get slab management. */
>   freelist = alloc_slabmgmt(cachep, page, offset,

Hello, all.

This n->colour_next can be update at the other cpu since
there is no lock to protect it. So, removing this one
another check can make the offset overflow.

Andrew, could you drop this patch?

Thanks.

Re: [PATCH v1] mm: help the ALLOC_HARDER allocation pass the watermarki when CMA on

2018-04-03 Thread Joonsoo Kim

On Fri, Mar 23, 2018 at 01:04:08PM -0700, Andrew Morton wrote:
> On Fri, 23 Mar 2018 10:33:27 +0100 Michal Hocko  wrote:
> 
> > On Fri 23-03-18 17:19:26, Zhaoyang Huang wrote:
> > > On Fri, Mar 23, 2018 at 4:38 PM, Michal Hocko  wrote:
> > > > On Fri 23-03-18 15:57:32, Zhaoyang Huang wrote:
> > > >> For the type of 'ALLOC_HARDER' page allocation, there is an express
> > > >> highway for the whole process which lead the allocation reach 
> > > >> __rmqueue_xxx
> > > >> easier than other type.
> > > >> However, when CMA is enabled, the free_page within zone_watermark_ok() 
> > > >> will
> > > >> be deducted for number the pages in CMA type, which may cause the 
> > > >> watermark
> > > >> check fail, but there are possible enough HighAtomic or Unmovable and
> > > >> Reclaimable pages in the zone. So add 'alloc_harder' here to
> > > >> count CMA pages in to clean the obstacles on the way to the final.
> > > >
> > > > This is no longer the case in the current mmotm tree. Have a look at
> > > > Joonsoo's zone movable based CMA patchset 
> > > > http://lkml.kernel.org/r/1512114786-5085-1-git-send-email-iamjoonsoo@lge.com
> > > >
> > > Thanks for the information. However, I can't find the commit in the
> > > latest mainline, is it merged?
> > 
> > Not yet. It is still sitting in the mmomt tree. I am not sure what is
> > the merge plan but I guess it is still waiting for some review feedback.
> 
> http://lkml.kernel.org/r/20171222001113.GA1729@js1304-P5Q-DELUXE
> 
> That patchset has been floating about since December and still has
> unresolved issues.
> 
> Joonsoo, can you please let us know the status?

Hello, Andrew.
Sorry for a late response.

Today, I finally have answered the question from Michal and it seems
that there is no problem at all.

http://lkml.kernel.org/r/CAAmzW4NGv7RyCYyokPoj4aR3ySKub4jaBZ3k=pt_yrefbby...@mail.gmail.com

You can merge the patch as is.

Thanks.

Re: [PATCH v1] mm: help the ALLOC_HARDER allocation pass the watermarki when CMA on

2018-04-03 Thread Joonsoo Kim

On Fri, Mar 23, 2018 at 01:04:08PM -0700, Andrew Morton wrote:
> On Fri, 23 Mar 2018 10:33:27 +0100 Michal Hocko  wrote:
> 
> > On Fri 23-03-18 17:19:26, Zhaoyang Huang wrote:
> > > On Fri, Mar 23, 2018 at 4:38 PM, Michal Hocko  wrote:
> > > > On Fri 23-03-18 15:57:32, Zhaoyang Huang wrote:
> > > >> For the type of 'ALLOC_HARDER' page allocation, there is an express
> > > >> highway for the whole process which lead the allocation reach 
> > > >> __rmqueue_xxx
> > > >> easier than other type.
> > > >> However, when CMA is enabled, the free_page within zone_watermark_ok() 
> > > >> will
> > > >> be deducted for number the pages in CMA type, which may cause the 
> > > >> watermark
> > > >> check fail, but there are possible enough HighAtomic or Unmovable and
> > > >> Reclaimable pages in the zone. So add 'alloc_harder' here to
> > > >> count CMA pages in to clean the obstacles on the way to the final.
> > > >
> > > > This is no longer the case in the current mmotm tree. Have a look at
> > > > Joonsoo's zone movable based CMA patchset 
> > > > http://lkml.kernel.org/r/1512114786-5085-1-git-send-email-iamjoonsoo@lge.com
> > > >
> > > Thanks for the information. However, I can't find the commit in the
> > > latest mainline, is it merged?
> > 
> > Not yet. It is still sitting in the mmomt tree. I am not sure what is
> > the merge plan but I guess it is still waiting for some review feedback.
> 
> http://lkml.kernel.org/r/20171222001113.GA1729@js1304-P5Q-DELUXE
> 
> That patchset has been floating about since December and still has
> unresolved issues.
> 
> Joonsoo, can you please let us know the status?

Hello, Andrew.
Sorry for a late response.

Today, I finally have answered the question from Michal and it seems
that there is no problem at all.

http://lkml.kernel.org/r/CAAmzW4NGv7RyCYyokPoj4aR3ySKub4jaBZ3k=pt_yrefbby...@mail.gmail.com

You can merge the patch as is.

Thanks.

Re: [PATCH] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request

2018-04-03 Thread Joonsoo Kim

Hello, Michal.

Sorry for a really long delay.

2017-09-14 22:24 GMT+09:00 Michal Hocko <mho...@kernel.org>:
> [Sorry for a later reply]
>
> On Wed 06-09-17 13:35:25, Joonsoo Kim wrote:
>> From: Joonsoo Kim <iamjoonsoo@lge.com>
>>
>> Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
>> important to reserve.
>
> I am still not convinced this is a good idea. I do agree that reserving
> memory in both HIGHMEM and MOVABLE is just wasting memory but removing
> the reserve from the highmem as well will result that an oom victim will
> allocate from lower zones and that might have unexpected side effects.

Looks like you are confused.

This patch only affects the situation that ZONE_HIGHMEM and ZONE_MOVABLE is
used at the same time. In that case, before this patch, ZONE_HIGHMEM has
reserve for GFP_HIGHMEM | GFP_MOVABLE request, but, with this patch,  no reserve
in ZONE_HIGHMEM for GFP_HIGHMEM | GFP_MOVABLE request. This perfectly
matchs with your hope. :)

> Can we simply leave HIGHMEM reserve and only remove it from the movable
> zone if both are present?

There is no higher zone than ZONE_MOVABLE so ZONE_MOVABLE has no reserve
with/without this patch. To save memory, we need to remove the reserve in
ZONE_HIGHMEM.

Thanks.

Re: [PATCH] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request

2018-04-03 Thread Joonsoo Kim

Hello, Michal.

Sorry for a really long delay.

2017-09-14 22:24 GMT+09:00 Michal Hocko :
> [Sorry for a later reply]
>
> On Wed 06-09-17 13:35:25, Joonsoo Kim wrote:
>> From: Joonsoo Kim 
>>
>> Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
>> important to reserve.
>
> I am still not convinced this is a good idea. I do agree that reserving
> memory in both HIGHMEM and MOVABLE is just wasting memory but removing
> the reserve from the highmem as well will result that an oom victim will
> allocate from lower zones and that might have unexpected side effects.

Looks like you are confused.

This patch only affects the situation that ZONE_HIGHMEM and ZONE_MOVABLE is
used at the same time. In that case, before this patch, ZONE_HIGHMEM has
reserve for GFP_HIGHMEM | GFP_MOVABLE request, but, with this patch,  no reserve
in ZONE_HIGHMEM for GFP_HIGHMEM | GFP_MOVABLE request. This perfectly
matchs with your hope. :)

> Can we simply leave HIGHMEM reserve and only remove it from the movable
> zone if both are present?

There is no higher zone than ZONE_MOVABLE so ZONE_MOVABLE has no reserve
with/without this patch. To save memory, we need to remove the reserve in
ZONE_HIGHMEM.

Thanks.

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-01-08 Thread Joonsoo Kim

On Sat, Jan 06, 2018 at 05:26:31PM +0800, Ye Xiaolong wrote:
> Hi,
> 
> On 01/03, Joonsoo Kim wrote:
> >Hello!
> >
> >On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
> >> 
> >> Greeting,
> >> 
> >> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
> >> 
> >> 
> >> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the 
> >> memory of the CMA area by using the ZONE_MOVABLE")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> 
> >> in testcase: fio-basic
> >> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 
> >> 256G memory
> >> with following parameters:
> >> 
> >>disk: 2pmem
> >>fs: ext4
> >>runtime: 200s
> >>nr_task: 50%
> >>time_based: tb
> >>rw: randread
> >>bs: 2M
> >>ioengine: mmap
> >>test_size: 200G
> >>cpufreq_governor: performance
> >> 
> >> test-description: Fio is a tool that will spawn a number of threads or 
> >> processes doing a particular type of I/O action as specified by the user.
> >> test-url: https://github.com/axboe/fio
> >> 
> >> 
> >> 
> >> Details are as below:
> >> -->
> >> 
> >> 
> >> To reproduce:
> >> 
> >> git clone https://github.com/intel/lkp-tests.git
> >> cd lkp-tests
> >> bin/lkp install job.yaml  # job file is attached in this email
> >> bin/lkp run job.yaml
> >> 
> >> =
> >> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
> >>   
> >> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
> >> 
> >> commit: 
> >>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
> >> request")
> >>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
> >> ZONE_MOVABLE")
> >> 
> >> f6572f9cd248df2c 2b0f904a5a8781498417d67226 
> >>  -- 
> >>  %stddev %change %stddev
> >>  \  |\  
> >>  11451   -16.1%   9605fio.read_bw_MBps
> >>   0.29 ±  5%  +0.10.40 ±  3%  fio.latency_1000us%
> >>  19.35 ±  5%  -4.7   14.69 ±  3%  fio.latency_10ms%
> >>   7.92 ±  3% +12.2   20.15fio.latency_20ms%
> >>   0.05 ± 11%  +0.00.09 ±  8%  fio.latency_2ms%
> >>  70.22-8.9   61.36fio.latency_4ms%
> >>   0.29 ± 13%  +0.00.33 ±  3%  fio.latency_500us%
> >>   0.45 ± 29%  +1.01.45 ±  4%  fio.latency_50ms%
> >>   1.37+0.11.44fio.latency_750us%
> >>   9792   +31.7%  12896fio.read_clat_90%_us
> >>  10560   +33.0%  14048fio.read_clat_95%_us
> >>  15376 ± 10% +46.9%  22592fio.read_clat_99%_us
> >>   4885   +19.2%   5825fio.read_clat_mean_us
> >>   5725   -16.1%   4802fio.read_iops
> >>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
> >> 453153-8.4% 415215
> >> fio.time.involuntary_context_switches
> >>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
> >>1822257   +23.7%2254706
> >> fio.time.maximum_resident_set_size
> >>   5089+1.6%   5172fio.time.system_time
> >> 514.50   -16.3% 430.48fio.time.user_time
> >
> >System time is increased and user time is decreased. On the below, there is 
> >a clue.
> >
> >>  24569 ±  2%  +9.6%  26917 ±  2%  
> >> fio.time.voluntary_context_switches
> >>   54443725   -14.9%   46353339
> >> interrupts.CAL:Function_call_interrupts
> >>   0.00 ± 79%  -0.00.00 ± 17%  mpstat.cpu.iowait%
> >>   4.45

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-01-08 Thread Joonsoo Kim

On Sat, Jan 06, 2018 at 05:26:31PM +0800, Ye Xiaolong wrote:
> Hi,
> 
> On 01/03, Joonsoo Kim wrote:
> >Hello!
> >
> >On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
> >> 
> >> Greeting,
> >> 
> >> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
> >> 
> >> 
> >> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the 
> >> memory of the CMA area by using the ZONE_MOVABLE")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >> 
> >> in testcase: fio-basic
> >> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 
> >> 256G memory
> >> with following parameters:
> >> 
> >>disk: 2pmem
> >>fs: ext4
> >>runtime: 200s
> >>nr_task: 50%
> >>time_based: tb
> >>rw: randread
> >>bs: 2M
> >>ioengine: mmap
> >>test_size: 200G
> >>cpufreq_governor: performance
> >> 
> >> test-description: Fio is a tool that will spawn a number of threads or 
> >> processes doing a particular type of I/O action as specified by the user.
> >> test-url: https://github.com/axboe/fio
> >> 
> >> 
> >> 
> >> Details are as below:
> >> -->
> >> 
> >> 
> >> To reproduce:
> >> 
> >> git clone https://github.com/intel/lkp-tests.git
> >> cd lkp-tests
> >> bin/lkp install job.yaml  # job file is attached in this email
> >> bin/lkp run job.yaml
> >> 
> >> =
> >> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
> >>   
> >> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
> >> 
> >> commit: 
> >>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
> >> request")
> >>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
> >> ZONE_MOVABLE")
> >> 
> >> f6572f9cd248df2c 2b0f904a5a8781498417d67226 
> >>  -- 
> >>  %stddev %change %stddev
> >>  \  |\  
> >>  11451   -16.1%   9605fio.read_bw_MBps
> >>   0.29 ±  5%  +0.10.40 ±  3%  fio.latency_1000us%
> >>  19.35 ±  5%  -4.7   14.69 ±  3%  fio.latency_10ms%
> >>   7.92 ±  3% +12.2   20.15fio.latency_20ms%
> >>   0.05 ± 11%  +0.00.09 ±  8%  fio.latency_2ms%
> >>  70.22-8.9   61.36fio.latency_4ms%
> >>   0.29 ± 13%  +0.00.33 ±  3%  fio.latency_500us%
> >>   0.45 ± 29%  +1.01.45 ±  4%  fio.latency_50ms%
> >>   1.37+0.11.44fio.latency_750us%
> >>   9792   +31.7%  12896fio.read_clat_90%_us
> >>  10560   +33.0%  14048fio.read_clat_95%_us
> >>  15376 ± 10% +46.9%  22592fio.read_clat_99%_us
> >>   4885   +19.2%   5825fio.read_clat_mean_us
> >>   5725   -16.1%   4802fio.read_iops
> >>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
> >> 453153-8.4% 415215
> >> fio.time.involuntary_context_switches
> >>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
> >>1822257   +23.7%2254706
> >> fio.time.maximum_resident_set_size
> >>   5089+1.6%   5172fio.time.system_time
> >> 514.50   -16.3% 430.48fio.time.user_time
> >
> >System time is increased and user time is decreased. On the below, there is 
> >a clue.
> >
> >>  24569 ±  2%  +9.6%  26917 ±  2%  
> >> fio.time.voluntary_context_switches
> >>   54443725   -14.9%   46353339
> >> interrupts.CAL:Function_call_interrupts
> >>   0.00 ± 79%  -0.00.00 ± 17%  mpstat.cpu.iowait%
> >>   4.45

Re: ACPI issues on cold power on [bisected]

2018-01-02 Thread Joonsoo Kim

On Tue, Jan 02, 2018 at 11:25:01AM +0100, Rafael J. Wysocki wrote:
> On Tue, Jan 2, 2018 at 3:54 AM, Joonsoo Kim <iamjoonsoo@lge.com> wrote:
> > On Fri, Dec 29, 2017 at 04:36:59PM +, Jonathan McDowell wrote:
> >> On Fri, Dec 22, 2017 at 09:21:09AM +0900, Joonsoo Kim wrote:
> >> > On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> >> > > I've been sitting on this for a while and should have spent time to
> >> > > investigate sooner, but it's been an odd failure mode that wasn't quite
> >> > > obvious.
> >> > >
> >> > > In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> >> > > don't see anything after grub says its booting. In 4.10 onwards the
> >> > > laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> >> > > (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> >> > > taken from 4.12 as that's the most recent error dmesg I have saved but
> >> > > also seen back in 4.10. It's always address 0x30 for the dereference.
> >> > >
> >> > > Rebooting the laptop does not lead to these problems; it's *only* from 
> >> > > a
> >> > > complete cold boot that they arise (which didn't help me in terms of
> >> > > being able to reliably bisect). Once I realised that I was able to
> >> > > bisect, but it leads me to an odd commit:
> >> > >
> >> > > 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> >> > > (mm/slab: fix kmemcg cache creation delayed issue)
> >> > >
> >> > > If I revert this then I can cold boot without problems.
> >> > >
> >> > > Also I don't see the problem with a stock Debian kernel, I think 
> >> > > because
> >> > > the ACPI support is modularised.
> >> >
> >> > Sorry for late response. I was on a long vacation.
> >>
> >> No problem. I've been trying to get around to diagnosing this for a
> >> while now anyway and this isn't a great time of year for fast responses.
> >>
> >> > I have tried to solve the problem however I don't find any clue yet.
> >> >
> >> > >From my analysis, oops report shows that 'struct sock *ssk' passed to
> >> > netlink_broadcast_filtered() is NULL. It means that some of
> >> > netlink_kernel_create() returns NULL. Maybe, it is due to slab
> >> > allocation failure. Could you check it by inserting some log on that
> >> > part? The issue cannot be reproducible in my side so I need your help.
> >>
> >> I've added some debug in acpi_bus_generate_netlink_event +
> >> genlmsg_multicast and the problem seems to be that genlmsg_multicast is
> >> getting called when init_net.genl_sock has not yet been initialised,
> >> leading to the NULL deference.
> >>
> >> Full dmesg output from a cold 4.14.8 boot at:
> >>
> >> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-broken
> >>
> >> And the same kernel after a reboot ("shutdown -r now"):
> >>
> >> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-working
> >>
> >> Patch that I've applied is at
> >>
> >> https://the.earth.li/~noodles/acpi-problem/debug-acpi.diff
> >>
> >
> > Thanks for testing! It's very helpful.
> >
> >> The interesting difference seems to be:
> >>
> >>  PCI: Using ACPI for IRQ routing
> >> +ACPI: Generating event type 208 (:9DBB5994-A997-11DA-B012-B622A1EF5492)
> >> +ERROR: init_net.genl_sock is NULL
> >> +BUG: unable to handle kernel NULL pointer dereference at 0030
> >> +IP: netlink_broadcast_filtered+0x20/0x3d0
> >> +PGD 0 P4D 0
> >> +Oops:  [#1] SMP
> >> +Modules linked in:
> >> +CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 4.14.8+ #1
> >> +Hardware name: Dell Inc. Latitude E7240/07RPNV, BIOS A22 10/18/2017
> >> +Workqueue: kacpi_notify acpi_os_execute_deferred
> >>
> >> 9DBB5994-A997-11DA-B012-B622A1EF5492 is the Dell WMI event GUID and
> >> there's no visible event for it on a reboot, just on a cold power on.
> >> Some sort of ordering issues such that genl_sock is being initialised
> >> later with the slab change?
> >
> > I have checked that there is an ordering issue.
> >
> > genl_init() which initializes init_net->genl_sock is called on
> > subsys_initcall().
> >
> > acpi_wmi_init() which schedules acpi_wmi_notify_handler() to the
> > workqueue is called on subsys_initcall(), too.
> > (acpi_wmi_notify_handler() -> acpi_bus_generate_netlink_event() ->
> > netlink_broadcast())
> >
> > In my system, acpi_wmi_init() is called before the genl_init().
> > Therefore, if the worker is scheduled before genl_init() is done, NULL
> > derefence would happen.
> 
> Does it help to change the subsys_initcall() in wmi.c to 
> subsys_initcall_sync()?

I guess that it would work. I cannot reproduce the issue so it needs
to be checked by Jonathan. Jonathan, could you check the problem
is disappeared with above change?

Thanks.

Re: ACPI issues on cold power on [bisected]

2018-01-02 Thread Joonsoo Kim

On Tue, Jan 02, 2018 at 11:25:01AM +0100, Rafael J. Wysocki wrote:
> On Tue, Jan 2, 2018 at 3:54 AM, Joonsoo Kim  wrote:
> > On Fri, Dec 29, 2017 at 04:36:59PM +, Jonathan McDowell wrote:
> >> On Fri, Dec 22, 2017 at 09:21:09AM +0900, Joonsoo Kim wrote:
> >> > On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> >> > > I've been sitting on this for a while and should have spent time to
> >> > > investigate sooner, but it's been an odd failure mode that wasn't quite
> >> > > obvious.
> >> > >
> >> > > In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> >> > > don't see anything after grub says its booting. In 4.10 onwards the
> >> > > laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> >> > > (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> >> > > taken from 4.12 as that's the most recent error dmesg I have saved but
> >> > > also seen back in 4.10. It's always address 0x30 for the dereference.
> >> > >
> >> > > Rebooting the laptop does not lead to these problems; it's *only* from 
> >> > > a
> >> > > complete cold boot that they arise (which didn't help me in terms of
> >> > > being able to reliably bisect). Once I realised that I was able to
> >> > > bisect, but it leads me to an odd commit:
> >> > >
> >> > > 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> >> > > (mm/slab: fix kmemcg cache creation delayed issue)
> >> > >
> >> > > If I revert this then I can cold boot without problems.
> >> > >
> >> > > Also I don't see the problem with a stock Debian kernel, I think 
> >> > > because
> >> > > the ACPI support is modularised.
> >> >
> >> > Sorry for late response. I was on a long vacation.
> >>
> >> No problem. I've been trying to get around to diagnosing this for a
> >> while now anyway and this isn't a great time of year for fast responses.
> >>
> >> > I have tried to solve the problem however I don't find any clue yet.
> >> >
> >> > >From my analysis, oops report shows that 'struct sock *ssk' passed to
> >> > netlink_broadcast_filtered() is NULL. It means that some of
> >> > netlink_kernel_create() returns NULL. Maybe, it is due to slab
> >> > allocation failure. Could you check it by inserting some log on that
> >> > part? The issue cannot be reproducible in my side so I need your help.
> >>
> >> I've added some debug in acpi_bus_generate_netlink_event +
> >> genlmsg_multicast and the problem seems to be that genlmsg_multicast is
> >> getting called when init_net.genl_sock has not yet been initialised,
> >> leading to the NULL deference.
> >>
> >> Full dmesg output from a cold 4.14.8 boot at:
> >>
> >> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-broken
> >>
> >> And the same kernel after a reboot ("shutdown -r now"):
> >>
> >> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-working
> >>
> >> Patch that I've applied is at
> >>
> >> https://the.earth.li/~noodles/acpi-problem/debug-acpi.diff
> >>
> >
> > Thanks for testing! It's very helpful.
> >
> >> The interesting difference seems to be:
> >>
> >>  PCI: Using ACPI for IRQ routing
> >> +ACPI: Generating event type 208 (:9DBB5994-A997-11DA-B012-B622A1EF5492)
> >> +ERROR: init_net.genl_sock is NULL
> >> +BUG: unable to handle kernel NULL pointer dereference at 0030
> >> +IP: netlink_broadcast_filtered+0x20/0x3d0
> >> +PGD 0 P4D 0
> >> +Oops:  [#1] SMP
> >> +Modules linked in:
> >> +CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 4.14.8+ #1
> >> +Hardware name: Dell Inc. Latitude E7240/07RPNV, BIOS A22 10/18/2017
> >> +Workqueue: kacpi_notify acpi_os_execute_deferred
> >>
> >> 9DBB5994-A997-11DA-B012-B622A1EF5492 is the Dell WMI event GUID and
> >> there's no visible event for it on a reboot, just on a cold power on.
> >> Some sort of ordering issues such that genl_sock is being initialised
> >> later with the slab change?
> >
> > I have checked that there is an ordering issue.
> >
> > genl_init() which initializes init_net->genl_sock is called on
> > subsys_initcall().
> >
> > acpi_wmi_init() which schedules acpi_wmi_notify_handler() to the
> > workqueue is called on subsys_initcall(), too.
> > (acpi_wmi_notify_handler() -> acpi_bus_generate_netlink_event() ->
> > netlink_broadcast())
> >
> > In my system, acpi_wmi_init() is called before the genl_init().
> > Therefore, if the worker is scheduled before genl_init() is done, NULL
> > derefence would happen.
> 
> Does it help to change the subsys_initcall() in wmi.c to 
> subsys_initcall_sync()?

I guess that it would work. I cannot reproduce the issue so it needs
to be checked by Jonathan. Jonathan, could you check the problem
is disappeared with above change?

Thanks.

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-01-02 Thread Joonsoo Kim

Hello!

On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
> 
> 
> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the memory 
> of the CMA area by using the ZONE_MOVABLE")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: fio-basic
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 
> 256G memory
> with following parameters:
> 
>   disk: 2pmem
>   fs: ext4
>   runtime: 200s
>   nr_task: 50%
>   time_based: tb
>   rw: randread
>   bs: 2M
>   ioengine: mmap
>   test_size: 200G
>   cpufreq_governor: performance
> 
> test-description: Fio is a tool that will spawn a number of threads or 
> processes doing a particular type of I/O action as specified by the user.
> test-url: https://github.com/axboe/fio
> 
> 
> 
> Details are as below:
> -->
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
> 
> =
> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
>   
> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
> 
> commit: 
>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
> request")
>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE")
> 
> f6572f9cd248df2c 2b0f904a5a8781498417d67226 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  11451   -16.1%   9605fio.read_bw_MBps
>   0.29 ±  5%  +0.10.40 ±  3%  fio.latency_1000us%
>  19.35 ±  5%  -4.7   14.69 ±  3%  fio.latency_10ms%
>   7.92 ±  3% +12.2   20.15fio.latency_20ms%
>   0.05 ± 11%  +0.00.09 ±  8%  fio.latency_2ms%
>  70.22-8.9   61.36fio.latency_4ms%
>   0.29 ± 13%  +0.00.33 ±  3%  fio.latency_500us%
>   0.45 ± 29%  +1.01.45 ±  4%  fio.latency_50ms%
>   1.37+0.11.44fio.latency_750us%
>   9792   +31.7%  12896fio.read_clat_90%_us
>  10560   +33.0%  14048fio.read_clat_95%_us
>  15376 ± 10% +46.9%  22592fio.read_clat_99%_us
>   4885   +19.2%   5825fio.read_clat_mean_us
>   5725   -16.1%   4802fio.read_iops
>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
> 453153-8.4% 415215
> fio.time.involuntary_context_switches
>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
>1822257   +23.7%2254706
> fio.time.maximum_resident_set_size
>   5089+1.6%   5172fio.time.system_time
> 514.50   -16.3% 430.48fio.time.user_time

System time is increased and user time is decreased. On the below, there is a 
clue.

>  24569 ±  2%  +9.6%  26917 ±  2%  
> fio.time.voluntary_context_switches
>   54443725   -14.9%   46353339
> interrupts.CAL:Function_call_interrupts
>   0.00 ± 79%  -0.00.00 ± 17%  mpstat.cpu.iowait%
>   4.45-0.73.71mpstat.cpu.usr%
>1467516   +21.3%1779543 ±  3%  meminfo.Active
>1276031   +23.7%1578443 ±  4%  meminfo.Active(file)
>  25789 ±  3% -76.7%   6013 ±  4%  meminfo.CmaFree
>  1.296e+08   -12.6%  1.133e+08turbostat.IRQ
>  41.89-3.4%  40.47turbostat.RAMWatt
>  17444 ±  2% -13.5%  15092 ±  3%  turbostat.SMI
>   10896428   -16.4%9111830vmstat.io.bi
>   6010-6.2%   5637vmstat.system.cs
> 317438   -12.1% 278980vmstat.system.in
>1072892 ±  3% +21.5%1303487numa-meminfo.node0.Active
> 978318   +21.6%1189809 ±  2%  numa-meminfo.node0.Active(file)
> 222968   -25.2% 166818numa-meminfo.node0.PageTables
>  47374 ±  2% +10.6%  52402 ±  7%  numa-meminfo.node0.SUnreclaim
> 165213   +31.9% 217870numa-meminfo.node1.PageTables
> 222405   +10.4% 245633 ±  2%  numa-meminfo.node1.SReclaimable
> 102992 ± 46% -80.8%  19812 ± 38%  numa-meminfo.node1.Shmem
>  2.475e+08 ±  2% -24.0%  1.881e+08

Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

2018-01-02 Thread Joonsoo Kim

Hello!

On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
> 
> 
> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the memory 
> of the CMA area by using the ZONE_MOVABLE")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> 
> in testcase: fio-basic
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 
> 256G memory
> with following parameters:
> 
>   disk: 2pmem
>   fs: ext4
>   runtime: 200s
>   nr_task: 50%
>   time_based: tb
>   rw: randread
>   bs: 2M
>   ioengine: mmap
>   test_size: 200G
>   cpufreq_governor: performance
> 
> test-description: Fio is a tool that will spawn a number of threads or 
> processes doing a particular type of I/O action as specified by the user.
> test-url: https://github.com/axboe/fio
> 
> 
> 
> Details are as below:
> -->
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
> 
> =
> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
>   
> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
> 
> commit: 
>   f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE 
> request")
>   2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the 
> ZONE_MOVABLE")
> 
> f6572f9cd248df2c 2b0f904a5a8781498417d67226 
>  -- 
>  %stddev %change %stddev
>  \  |\  
>  11451   -16.1%   9605fio.read_bw_MBps
>   0.29 ±  5%  +0.10.40 ±  3%  fio.latency_1000us%
>  19.35 ±  5%  -4.7   14.69 ±  3%  fio.latency_10ms%
>   7.92 ±  3% +12.2   20.15fio.latency_20ms%
>   0.05 ± 11%  +0.00.09 ±  8%  fio.latency_2ms%
>  70.22-8.9   61.36fio.latency_4ms%
>   0.29 ± 13%  +0.00.33 ±  3%  fio.latency_500us%
>   0.45 ± 29%  +1.01.45 ±  4%  fio.latency_50ms%
>   1.37+0.11.44fio.latency_750us%
>   9792   +31.7%  12896fio.read_clat_90%_us
>  10560   +33.0%  14048fio.read_clat_95%_us
>  15376 ± 10% +46.9%  22592fio.read_clat_99%_us
>   4885   +19.2%   5825fio.read_clat_mean_us
>   5725   -16.1%   4802fio.read_iops
>  4.598e+09   -16.4%  3.845e+09fio.time.file_system_inputs
> 453153-8.4% 415215
> fio.time.involuntary_context_switches
>  5.748e+08   -16.4%  4.806e+08fio.time.major_page_faults
>1822257   +23.7%2254706
> fio.time.maximum_resident_set_size
>   5089+1.6%   5172fio.time.system_time
> 514.50   -16.3% 430.48fio.time.user_time

System time is increased and user time is decreased. On the below, there is a 
clue.

>  24569 ±  2%  +9.6%  26917 ±  2%  
> fio.time.voluntary_context_switches
>   54443725   -14.9%   46353339
> interrupts.CAL:Function_call_interrupts
>   0.00 ± 79%  -0.00.00 ± 17%  mpstat.cpu.iowait%
>   4.45-0.73.71mpstat.cpu.usr%
>1467516   +21.3%1779543 ±  3%  meminfo.Active
>1276031   +23.7%1578443 ±  4%  meminfo.Active(file)
>  25789 ±  3% -76.7%   6013 ±  4%  meminfo.CmaFree
>  1.296e+08   -12.6%  1.133e+08turbostat.IRQ
>  41.89-3.4%  40.47turbostat.RAMWatt
>  17444 ±  2% -13.5%  15092 ±  3%  turbostat.SMI
>   10896428   -16.4%9111830vmstat.io.bi
>   6010-6.2%   5637vmstat.system.cs
> 317438   -12.1% 278980vmstat.system.in
>1072892 ±  3% +21.5%1303487numa-meminfo.node0.Active
> 978318   +21.6%1189809 ±  2%  numa-meminfo.node0.Active(file)
> 222968   -25.2% 166818numa-meminfo.node0.PageTables
>  47374 ±  2% +10.6%  52402 ±  7%  numa-meminfo.node0.SUnreclaim
> 165213   +31.9% 217870numa-meminfo.node1.PageTables
> 222405   +10.4% 245633 ±  2%  numa-meminfo.node1.SReclaimable
> 102992 ± 46% -80.8%  19812 ± 38%  numa-meminfo.node1.Shmem
>  2.475e+08 ±  2% -24.0%  1.881e+08

Re: ACPI issues on cold power on [bisected]

2018-01-01 Thread Joonsoo Kim

On Fri, Dec 29, 2017 at 04:36:59PM +, Jonathan McDowell wrote:
> On Fri, Dec 22, 2017 at 09:21:09AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> > > I've been sitting on this for a while and should have spent time to
> > > investigate sooner, but it's been an odd failure mode that wasn't quite
> > > obvious.
> > > 
> > > In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> > > don't see anything after grub says its booting. In 4.10 onwards the
> > > laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> > > (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> > > taken from 4.12 as that's the most recent error dmesg I have saved but
> > > also seen back in 4.10. It's always address 0x30 for the dereference.
> > > 
> > > Rebooting the laptop does not lead to these problems; it's *only* from a
> > > complete cold boot that they arise (which didn't help me in terms of
> > > being able to reliably bisect). Once I realised that I was able to
> > > bisect, but it leads me to an odd commit:
> > > 
> > > 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> > > (mm/slab: fix kmemcg cache creation delayed issue)
> > > 
> > > If I revert this then I can cold boot without problems.
> > > 
> > > Also I don't see the problem with a stock Debian kernel, I think because
> > > the ACPI support is modularised.
> > 
> > Sorry for late response. I was on a long vacation.
> 
> No problem. I've been trying to get around to diagnosing this for a
> while now anyway and this isn't a great time of year for fast responses.
> 
> > I have tried to solve the problem however I don't find any clue yet.
> > 
> > >From my analysis, oops report shows that 'struct sock *ssk' passed to
> > netlink_broadcast_filtered() is NULL. It means that some of
> > netlink_kernel_create() returns NULL. Maybe, it is due to slab
> > allocation failure. Could you check it by inserting some log on that
> > part? The issue cannot be reproducible in my side so I need your help.
> 
> I've added some debug in acpi_bus_generate_netlink_event +
> genlmsg_multicast and the problem seems to be that genlmsg_multicast is
> getting called when init_net.genl_sock has not yet been initialised,
> leading to the NULL deference.
> 
> Full dmesg output from a cold 4.14.8 boot at:
> 
> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-broken
> 
> And the same kernel after a reboot ("shutdown -r now"):
> 
> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-working
> 
> Patch that I've applied is at
> 
> https://the.earth.li/~noodles/acpi-problem/debug-acpi.diff
> 

Thanks for testing! It's very helpful.

> The interesting difference seems to be:
> 
>  PCI: Using ACPI for IRQ routing
> +ACPI: Generating event type 208 (:9DBB5994-A997-11DA-B012-B622A1EF5492)
> +ERROR: init_net.genl_sock is NULL
> +BUG: unable to handle kernel NULL pointer dereference at 0030
> +IP: netlink_broadcast_filtered+0x20/0x3d0
> +PGD 0 P4D 0 
> +Oops:  [#1] SMP
> +Modules linked in:
> +CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 4.14.8+ #1
> +Hardware name: Dell Inc. Latitude E7240/07RPNV, BIOS A22 10/18/2017
> +Workqueue: kacpi_notify acpi_os_execute_deferred
> 
> 9DBB5994-A997-11DA-B012-B622A1EF5492 is the Dell WMI event GUID and
> there's no visible event for it on a reboot, just on a cold power on.
> Some sort of ordering issues such that genl_sock is being initialised
> later with the slab change?

I have checked that there is an ordering issue.

genl_init() which initializes init_net->genl_sock is called on
subsys_initcall().

acpi_wmi_init() which schedules acpi_wmi_notify_handler() to the
workqueue is called on subsys_initcall(), too.
(acpi_wmi_notify_handler() -> acpi_bus_generate_netlink_event() ->
netlink_broadcast())

In my system, acpi_wmi_init() is called before the genl_init().
Therefore, if the worker is scheduled before genl_init() is done, NULL
derefence would happen.

Although slab change revealed this problem, I think that problem is on
ACPI side and need to be fixed there.

Anyway, I'm not sure why it doesn't happen before. These ACPI
initialization code looks not changed for a long time. Could you test
this problem with the slub?

Thanks.

Re: ACPI issues on cold power on [bisected]

2018-01-01 Thread Joonsoo Kim

On Fri, Dec 29, 2017 at 04:36:59PM +, Jonathan McDowell wrote:
> On Fri, Dec 22, 2017 at 09:21:09AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> > > I've been sitting on this for a while and should have spent time to
> > > investigate sooner, but it's been an odd failure mode that wasn't quite
> > > obvious.
> > > 
> > > In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> > > don't see anything after grub says its booting. In 4.10 onwards the
> > > laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> > > (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> > > taken from 4.12 as that's the most recent error dmesg I have saved but
> > > also seen back in 4.10. It's always address 0x30 for the dereference.
> > > 
> > > Rebooting the laptop does not lead to these problems; it's *only* from a
> > > complete cold boot that they arise (which didn't help me in terms of
> > > being able to reliably bisect). Once I realised that I was able to
> > > bisect, but it leads me to an odd commit:
> > > 
> > > 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> > > (mm/slab: fix kmemcg cache creation delayed issue)
> > > 
> > > If I revert this then I can cold boot without problems.
> > > 
> > > Also I don't see the problem with a stock Debian kernel, I think because
> > > the ACPI support is modularised.
> > 
> > Sorry for late response. I was on a long vacation.
> 
> No problem. I've been trying to get around to diagnosing this for a
> while now anyway and this isn't a great time of year for fast responses.
> 
> > I have tried to solve the problem however I don't find any clue yet.
> > 
> > >From my analysis, oops report shows that 'struct sock *ssk' passed to
> > netlink_broadcast_filtered() is NULL. It means that some of
> > netlink_kernel_create() returns NULL. Maybe, it is due to slab
> > allocation failure. Could you check it by inserting some log on that
> > part? The issue cannot be reproducible in my side so I need your help.
> 
> I've added some debug in acpi_bus_generate_netlink_event +
> genlmsg_multicast and the problem seems to be that genlmsg_multicast is
> getting called when init_net.genl_sock has not yet been initialised,
> leading to the NULL deference.
> 
> Full dmesg output from a cold 4.14.8 boot at:
> 
> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-broken
> 
> And the same kernel after a reboot ("shutdown -r now"):
> 
> https://the.earth.li/~noodles/acpi-problem/dmesg-4.14.8-working
> 
> Patch that I've applied is at
> 
> https://the.earth.li/~noodles/acpi-problem/debug-acpi.diff
> 

Thanks for testing! It's very helpful.

> The interesting difference seems to be:
> 
>  PCI: Using ACPI for IRQ routing
> +ACPI: Generating event type 208 (:9DBB5994-A997-11DA-B012-B622A1EF5492)
> +ERROR: init_net.genl_sock is NULL
> +BUG: unable to handle kernel NULL pointer dereference at 0030
> +IP: netlink_broadcast_filtered+0x20/0x3d0
> +PGD 0 P4D 0 
> +Oops:  [#1] SMP
> +Modules linked in:
> +CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 4.14.8+ #1
> +Hardware name: Dell Inc. Latitude E7240/07RPNV, BIOS A22 10/18/2017
> +Workqueue: kacpi_notify acpi_os_execute_deferred
> 
> 9DBB5994-A997-11DA-B012-B622A1EF5492 is the Dell WMI event GUID and
> there's no visible event for it on a reboot, just on a cold power on.
> Some sort of ordering issues such that genl_sock is being initialised
> later with the slab change?

I have checked that there is an ordering issue.

genl_init() which initializes init_net->genl_sock is called on
subsys_initcall().

acpi_wmi_init() which schedules acpi_wmi_notify_handler() to the
workqueue is called on subsys_initcall(), too.
(acpi_wmi_notify_handler() -> acpi_bus_generate_netlink_event() ->
netlink_broadcast())

In my system, acpi_wmi_init() is called before the genl_init().
Therefore, if the worker is scheduled before genl_init() is done, NULL
derefence would happen.

Although slab change revealed this problem, I think that problem is on
ACPI side and need to be fixed there.

Anyway, I'm not sure why it doesn't happen before. These ACPI
initialization code looks not changed for a long time. Could you test
this problem with the slub?

Thanks.

Re: [PATCH 00/18] introduce a new tool, valid access checker

2017-12-21 Thread Joonsoo Kim

On Tue, Nov 28, 2017 at 04:48:35PM +0900, js1...@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo@lge.com>
> 
> Hello,
> 
> This patchset introduces a new tool, valid access checker.
> 
> Vchecker is a dynamic memory error detector. It provides a new debug feature
> that can find out an un-intended access to valid area. Valid area here means
> the memory which is allocated and allowed to be accessed by memory owner and
> un-intended access means the read/write that is initiated by non-owner.
> Usual problem of this class is memory overwritten.
> 
> Most of debug feature focused on finding out un-intended access to
> in-valid area, for example, out-of-bound access and use-after-free, and,
> there are many good tools for it. But, as far as I know, there is no good tool
> to find out un-intended access to valid area. This kind of problem is really
> hard to solve so this tool would be very useful.
> 
> This tool doesn't automatically catch a problem. Manual runtime configuration
> to specify the target object is required.
> 
> Note that there was a similar attempt for the debugging overwritten problem
> however it requires manual code modifying and recompile.
> 
> http://lkml.kernel.org/r/<20171117223043.7277-1-wen.gang.w...@oracle.com>
> 
> To get more information about vchecker, please see a documention at
> the last patch.
> 
> Patchset can also be available at
> 
> https://github.com/JoonsooKim/linux/tree/vchecker-master-v1.0-next-20171122
> 
> Enjoy it.
> 
> Thanks.

Hello, Andrew.

Before the fixing some build failure on this patchset, I'd like to know
other reviewer's opinion on this patchset, especially, yours. :)

There are some interests on this patchset from some developers. Wengang
come up with a very similar change and Andi said that this looks useful.
Do you think that this tool is useful and can be merged?

Thanks.

Re: [PATCH 00/18] introduce a new tool, valid access checker

2017-12-21 Thread Joonsoo Kim

On Tue, Nov 28, 2017 at 04:48:35PM +0900, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> Hello,
> 
> This patchset introduces a new tool, valid access checker.
> 
> Vchecker is a dynamic memory error detector. It provides a new debug feature
> that can find out an un-intended access to valid area. Valid area here means
> the memory which is allocated and allowed to be accessed by memory owner and
> un-intended access means the read/write that is initiated by non-owner.
> Usual problem of this class is memory overwritten.
> 
> Most of debug feature focused on finding out un-intended access to
> in-valid area, for example, out-of-bound access and use-after-free, and,
> there are many good tools for it. But, as far as I know, there is no good tool
> to find out un-intended access to valid area. This kind of problem is really
> hard to solve so this tool would be very useful.
> 
> This tool doesn't automatically catch a problem. Manual runtime configuration
> to specify the target object is required.
> 
> Note that there was a similar attempt for the debugging overwritten problem
> however it requires manual code modifying and recompile.
> 
> http://lkml.kernel.org/r/<20171117223043.7277-1-wen.gang.w...@oracle.com>
> 
> To get more information about vchecker, please see a documention at
> the last patch.
> 
> Patchset can also be available at
> 
> https://github.com/JoonsooKim/linux/tree/vchecker-master-v1.0-next-20171122
> 
> Enjoy it.
> 
> Thanks.

Hello, Andrew.

Before the fixing some build failure on this patchset, I'd like to know
other reviewer's opinion on this patchset, especially, yours. :)

There are some interests on this patchset from some developers. Wengang
come up with a very similar change and Andi said that this looks useful.
Do you think that this tool is useful and can be merged?

Thanks.

Re: ACPI issues on cold power on [bisected]

2017-12-21 Thread Joonsoo Kim

On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> I've been sitting on this for a while and should have spent time to
> investigate sooner, but it's been an odd failure mode that wasn't quite
> obvious.
> 
> In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> don't see anything after grub says its booting. In 4.10 onwards the
> laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> taken from 4.12 as that's the most recent error dmesg I have saved but
> also seen back in 4.10. It's always address 0x30 for the dereference.
> 
> Rebooting the laptop does not lead to these problems; it's *only* from a
> complete cold boot that they arise (which didn't help me in terms of
> being able to reliably bisect). Once I realised that I was able to
> bisect, but it leads me to an odd commit:
> 
> 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> (mm/slab: fix kmemcg cache creation delayed issue)
> 
> If I revert this then I can cold boot without problems.
> 
> Also I don't see the problem with a stock Debian kernel, I think because
> the ACPI support is modularised.

Hello,

Sorry for late response. I was on a long vacation.

I have tried to solve the problem however I don't find any clue yet.

>From my analysis, oops report shows that 'struct sock *ssk' passed to
netlink_broadcast_filtered() is NULL. It means that some of
netlink_kernel_create() returns NULL. Maybe, it is due to slab
allocation failure. Could you check it by inserting some log on that
part? The issue cannot be reproducible in my side so I need your help.

Thanks.

Re: ACPI issues on cold power on [bisected]

2017-12-21 Thread Joonsoo Kim

On Fri, Dec 08, 2017 at 03:11:59PM +, Jonathan McDowell wrote:
> I've been sitting on this for a while and should have spent time to
> investigate sooner, but it's been an odd failure mode that wasn't quite
> obvious.
> 
> In 4.9 if I cold power on my laptop (Dell E7240) it fails to boot - I
> don't see anything after grub says its booting. In 4.10 onwards the
> laptop boots, but I get an Oops as part of the boot and ACPI is unhappy
> (no suspend, no clean poweroff, no ACPI buttons). The Oops is below;
> taken from 4.12 as that's the most recent error dmesg I have saved but
> also seen back in 4.10. It's always address 0x30 for the dereference.
> 
> Rebooting the laptop does not lead to these problems; it's *only* from a
> complete cold boot that they arise (which didn't help me in terms of
> being able to reliably bisect). Once I realised that I was able to
> bisect, but it leads me to an odd commit:
> 
> 86d9f48534e800e4d62cdc1b5aaf539f4c1d47d6
> (mm/slab: fix kmemcg cache creation delayed issue)
> 
> If I revert this then I can cold boot without problems.
> 
> Also I don't see the problem with a stock Debian kernel, I think because
> the ACPI support is modularised.

Hello,

Sorry for late response. I was on a long vacation.

I have tried to solve the problem however I don't find any clue yet.

>From my analysis, oops report shows that 'struct sock *ssk' passed to
netlink_broadcast_filtered() is NULL. It means that some of
netlink_kernel_create() returns NULL. Maybe, it is due to slab
allocation failure. Could you check it by inserting some log on that
part? The issue cannot be reproducible in my side so I need your help.

Thanks.

Re: [PATCH v2 0/3] mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE

2017-12-21 Thread Joonsoo Kim

On Fri, Dec 08, 2017 at 02:37:19PM -0800, Andrew Morton wrote:
> On Fri,  1 Dec 2017 16:53:03 +0900 js1...@gmail.com wrote:
> 
> > From: Joonsoo Kim <iamjoonsoo@lge.com>
> > 
> > v2
> > o previous failure in linux-next turned out that it's not the problem of
> > this patchset. It was caused by the wrong assumption by specific
> > architecture.
> > 
> > lkml.kernel.org/r/20171114173719.ga28...@atomide.com
> > 
> > o add missing cache flush to the patch "ARM: CMA: avoid double mapping
> > to the CMA area if CONFIG_HIGHMEM = y"
> > 
> > 
> > This patchset is the follow-up of the discussion about the
> > "Introduce ZONE_CMA (v7)" [1]. Please reference it if more information
> > is needed.
> > 
> > In this patchset, the memory of the CMA area is managed by using
> > the ZONE_MOVABLE. Since there is another type of the memory in this zone,
> > we need to maintain a migratetype for the CMA memory to account
> > the number of the CMA memory. So, unlike previous patchset, there is
> > less deletion of the code.
> > 
> > Otherwise, there is no big change.
> > 
> > Motivation of this patchset is described in the commit description of
> > the patch "mm/cma: manage the memory of the CMA area by using
> > the ZONE_MOVABLE". Please refer it for more information.
> > 
> > This patchset is based on linux-next-20170822 plus
> > "mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE".
> 
> mhocko had issues with that patch which weren't addressed?
> http://lkml.kernel.org/r/20170914132452.d5klyizce72rh...@dhcp22.suse.cz

Hello, Andrew.

Sorry for late response. I was on a long vacation.

I don't do anything on that patch yet. In fact, that patch isn't really
necessary to this patchset so I didn't include it into this patchset.

I will re-submit that patch after fixing the issue.

Thanks.

Re: [PATCH v2 0/3] mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE

2017-12-21 Thread Joonsoo Kim

On Fri, Dec 08, 2017 at 02:37:19PM -0800, Andrew Morton wrote:
> On Fri,  1 Dec 2017 16:53:03 +0900 js1...@gmail.com wrote:
> 
> > From: Joonsoo Kim 
> > 
> > v2
> > o previous failure in linux-next turned out that it's not the problem of
> > this patchset. It was caused by the wrong assumption by specific
> > architecture.
> > 
> > lkml.kernel.org/r/20171114173719.ga28...@atomide.com
> > 
> > o add missing cache flush to the patch "ARM: CMA: avoid double mapping
> > to the CMA area if CONFIG_HIGHMEM = y"
> > 
> > 
> > This patchset is the follow-up of the discussion about the
> > "Introduce ZONE_CMA (v7)" [1]. Please reference it if more information
> > is needed.
> > 
> > In this patchset, the memory of the CMA area is managed by using
> > the ZONE_MOVABLE. Since there is another type of the memory in this zone,
> > we need to maintain a migratetype for the CMA memory to account
> > the number of the CMA memory. So, unlike previous patchset, there is
> > less deletion of the code.
> > 
> > Otherwise, there is no big change.
> > 
> > Motivation of this patchset is described in the commit description of
> > the patch "mm/cma: manage the memory of the CMA area by using
> > the ZONE_MOVABLE". Please refer it for more information.
> > 
> > This patchset is based on linux-next-20170822 plus
> > "mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE".
> 
> mhocko had issues with that patch which weren't addressed?
> http://lkml.kernel.org/r/20170914132452.d5klyizce72rh...@dhcp22.suse.cz

Hello, Andrew.

Sorry for late response. I was on a long vacation.

I don't do anything on that patch yet. In fact, that patch isn't really
necessary to this patchset so I didn't include it into this patchset.

I will re-submit that patch after fixing the issue.

Thanks.

Re: [PATCH 02/18] vchecker: introduce the valid access checker

2017-12-01 Thread Joonsoo Kim

On Fri, Dec 01, 2017 at 01:08:13PM +0800, kbuild test robot wrote:
> Hi Joonsoo,
> 
> I love your patch! Yet something to improve:

Thanks! I will fix all the error from kbuild bot on next spin.

Thanks.

Re: [PATCH 02/18] vchecker: introduce the valid access checker

2017-12-01 Thread Joonsoo Kim

On Fri, Dec 01, 2017 at 01:08:13PM +0800, kbuild test robot wrote:
> Hi Joonsoo,
> 
> I love your patch! Yet something to improve:

Thanks! I will fix all the error from kbuild bot on next spin.

Thanks.

Re: [PATCH 00/18] introduce a new tool, valid access checker

2017-11-30 Thread Joonsoo Kim

On Wed, Nov 29, 2017 at 10:27:00AM +0100, Dmitry Vyukov wrote:
> On Tue, Nov 28, 2017 at 8:48 AM,  <js1...@gmail.com> wrote:
> > From: Joonsoo Kim <iamjoonsoo@lge.com>
> >
> > Hello,
> >
> > This patchset introduces a new tool, valid access checker.
> >
> > Vchecker is a dynamic memory error detector. It provides a new debug feature
> > that can find out an un-intended access to valid area. Valid area here means
> > the memory which is allocated and allowed to be accessed by memory owner and
> > un-intended access means the read/write that is initiated by non-owner.
> > Usual problem of this class is memory overwritten.
> >
> > Most of debug feature focused on finding out un-intended access to
> > in-valid area, for example, out-of-bound access and use-after-free, and,
> > there are many good tools for it. But, as far as I know, there is no good 
> > tool
> > to find out un-intended access to valid area. This kind of problem is really
> > hard to solve so this tool would be very useful.
> >
> > This tool doesn't automatically catch a problem. Manual runtime 
> > configuration
> > to specify the target object is required.
> >
> > Note that there was a similar attempt for the debugging overwritten problem
> > however it requires manual code modifying and recompile.
> >
> > http://lkml.kernel.org/r/<20171117223043.7277-1-wen.gang.w...@oracle.com>
> >
> > To get more information about vchecker, please see a documention at
> > the last patch.
> >
> > Patchset can also be available at
> >
> > https://github.com/JoonsooKim/linux/tree/vchecker-master-v1.0-next-20171122
> >
> > Enjoy it.
> 
> 
> Hi Joonsoo,
> 
> I skimmed through the code and this looks fine from KASAN point of
> view (minimal code changes and no perf impact).
> I don't feel like I can judge if this should go in or not. I will not
> use this, we use KASAN for large-scale testing, but vchecker is in a
> different bucket, it is meant for developers debugging hard bugs.
> Wengang come up with a very similar change, and Andi said that this
> looks useful.

Thanks for comment.

Hello, other reviewers!
Please let me know more opinions about this patchset.

> 
> If the decision is that this goes in, please let me take a closer look
> before this is merged.

I will let you know when the decision is made.

Thanks.

Re: [PATCH 00/18] introduce a new tool, valid access checker

2017-11-30 Thread Joonsoo Kim

On Wed, Nov 29, 2017 at 10:27:00AM +0100, Dmitry Vyukov wrote:
> On Tue, Nov 28, 2017 at 8:48 AM,   wrote:
> > From: Joonsoo Kim 
> >
> > Hello,
> >
> > This patchset introduces a new tool, valid access checker.
> >
> > Vchecker is a dynamic memory error detector. It provides a new debug feature
> > that can find out an un-intended access to valid area. Valid area here means
> > the memory which is allocated and allowed to be accessed by memory owner and
> > un-intended access means the read/write that is initiated by non-owner.
> > Usual problem of this class is memory overwritten.
> >
> > Most of debug feature focused on finding out un-intended access to
> > in-valid area, for example, out-of-bound access and use-after-free, and,
> > there are many good tools for it. But, as far as I know, there is no good 
> > tool
> > to find out un-intended access to valid area. This kind of problem is really
> > hard to solve so this tool would be very useful.
> >
> > This tool doesn't automatically catch a problem. Manual runtime 
> > configuration
> > to specify the target object is required.
> >
> > Note that there was a similar attempt for the debugging overwritten problem
> > however it requires manual code modifying and recompile.
> >
> > http://lkml.kernel.org/r/<20171117223043.7277-1-wen.gang.w...@oracle.com>
> >
> > To get more information about vchecker, please see a documention at
> > the last patch.
> >
> > Patchset can also be available at
> >
> > https://github.com/JoonsooKim/linux/tree/vchecker-master-v1.0-next-20171122
> >
> > Enjoy it.
> 
> 
> Hi Joonsoo,
> 
> I skimmed through the code and this looks fine from KASAN point of
> view (minimal code changes and no perf impact).
> I don't feel like I can judge if this should go in or not. I will not
> use this, we use KASAN for large-scale testing, but vchecker is in a
> different bucket, it is meant for developers debugging hard bugs.
> Wengang come up with a very similar change, and Andi said that this
> looks useful.

Thanks for comment.

Hello, other reviewers!
Please let me know more opinions about this patchset.

> 
> If the decision is that this goes in, please let me take a closer look
> before this is merged.

I will let you know when the decision is made.

Thanks.

Re: [PATCH] mm, compaction: direct freepage allocation for async direct compaction

2017-11-28 Thread Joonsoo Kim

On Wed, Nov 22, 2017 at 03:52:55PM +0100, Vlastimil Babka wrote:
> On 11/22/2017 03:33 PM, Johannes Weiner wrote:
> > From: Vlastimil Babka 
> > 
> > The goal of direct compaction is to quickly make a high-order page available
> > for the pending allocation. The free page scanner can add significant 
> > latency
> > when searching for migration targets, although to succeed the compaction, 
> > the
> > only important limit on the target free pages is that they must not come 
> > from
> > the same order-aligned block as the migrated pages.
> > 
> > This patch therefore makes direct async compaction allocate freepages 
> > directly
> > from freelists. Pages that do come from the same block (which we cannot 
> > simply
> > exclude from the freelist allocation) are put on separate list and released
> > only after migration to allow them to merge.
> > 
> > In addition to reduced stall, another advantage is that we split larger free
> > pages for migration targets only when smaller pages are depleted, while the
> > free scanner can split pages up to (order - 1) as it encouters them. 
> > However,
> > this approach likely sacrifices some of the long-term anti-fragmentation
> > features of a thorough compaction, so we limit the direct allocation 
> > approach
> > to direct async compaction.
> > 
> > For observational purposes, the patch introduces two new counters to
> > /proc/vmstat. compact_free_direct_alloc counts how many pages were allocated
> > directly without scanning, and compact_free_direct_miss counts the subset of
> > these allocations that were from the wrong range and had to be held on the
> > separate list.
> > 
> > Signed-off-by: Vlastimil Babka 
> > Signed-off-by: Johannes Weiner 
> > ---
> > 
> > Hi. I'm resending this because we've been struggling with the cost of
> > compaction in our fleet, and this patch helps substantially.
> > 
> > On 128G+ machines, we have seen isolate_freepages_block() eat up 40%
> > of the CPU cycles and scanning up to a billion PFNs per minute. Not in
> > a spike, but continuously, to service higher-order allocations from
> > the network stack, fork (non-vmap stacks), THP, etc. during regular
> > operation.
> > 
> > I've been running this patch on a handful of less-affected but still
> > pretty bad machines for a week, and the results look pretty great:
> > 
> > http://cmpxchg.org/compactdirectalloc/compactdirectalloc.png
> 
> Thanks a lot, that's very encouraging!
> 
> > 
> > Note the two different scales - otherwise the compact_free_direct
> > lines wouldn't be visible. The free scanner peaks close to 10M pages
> > checked per minute, whereas the direct allocations peak at under 180
> > per minute, direct misses at 50.
> > 
> > The work doesn't increase over this period, which is a good sign that
> > long-term we're not trending toward worse fragmentation.
> > 
> > There was an outstanding concern from Joonsoo regarding this patch -
> > https://marc.info/?l=linux-mm=146035962702122=2 - although that
> > didn't seem to affect us much in practice.
> 
> That concern would be easy to fix, but I was also concerned that if there
> are multiple direct compactions in parallel, they might keep too many free
> pages isolated away. Recently I resumed work on this and come up with a
> different approach, where I put the pages immediately back on tail of
> free lists. There might be some downside in more "direct misses".
> Also I didn't plan to restrict this to async compaction anymore, because
> if it's a better way, we should use it everywhere. So here's how it
> looks like now (only briefly tested), we could compare and pick the better
> approach, or go with the older one for now and potentially change it later.

IMHO, "good bye free scanner" is a way to go. My major concern is that
co-existence of two different compaction algorithms make the system
behaviour less predictable and make debugging hard.

And, this compaction stall is immediate and actual problem reported
many times unlike theoretical long term fragmentation which current
freepage scanner try to prevent.

Thanks.

Re: [PATCH] mm, compaction: direct freepage allocation for async direct compaction

2017-11-28 Thread Joonsoo Kim

On Wed, Nov 22, 2017 at 03:52:55PM +0100, Vlastimil Babka wrote:
> On 11/22/2017 03:33 PM, Johannes Weiner wrote:
> > From: Vlastimil Babka 
> > 
> > The goal of direct compaction is to quickly make a high-order page available
> > for the pending allocation. The free page scanner can add significant 
> > latency
> > when searching for migration targets, although to succeed the compaction, 
> > the
> > only important limit on the target free pages is that they must not come 
> > from
> > the same order-aligned block as the migrated pages.
> > 
> > This patch therefore makes direct async compaction allocate freepages 
> > directly
> > from freelists. Pages that do come from the same block (which we cannot 
> > simply
> > exclude from the freelist allocation) are put on separate list and released
> > only after migration to allow them to merge.
> > 
> > In addition to reduced stall, another advantage is that we split larger free
> > pages for migration targets only when smaller pages are depleted, while the
> > free scanner can split pages up to (order - 1) as it encouters them. 
> > However,
> > this approach likely sacrifices some of the long-term anti-fragmentation
> > features of a thorough compaction, so we limit the direct allocation 
> > approach
> > to direct async compaction.
> > 
> > For observational purposes, the patch introduces two new counters to
> > /proc/vmstat. compact_free_direct_alloc counts how many pages were allocated
> > directly without scanning, and compact_free_direct_miss counts the subset of
> > these allocations that were from the wrong range and had to be held on the
> > separate list.
> > 
> > Signed-off-by: Vlastimil Babka 
> > Signed-off-by: Johannes Weiner 
> > ---
> > 
> > Hi. I'm resending this because we've been struggling with the cost of
> > compaction in our fleet, and this patch helps substantially.
> > 
> > On 128G+ machines, we have seen isolate_freepages_block() eat up 40%
> > of the CPU cycles and scanning up to a billion PFNs per minute. Not in
> > a spike, but continuously, to service higher-order allocations from
> > the network stack, fork (non-vmap stacks), THP, etc. during regular
> > operation.
> > 
> > I've been running this patch on a handful of less-affected but still
> > pretty bad machines for a week, and the results look pretty great:
> > 
> > http://cmpxchg.org/compactdirectalloc/compactdirectalloc.png
> 
> Thanks a lot, that's very encouraging!
> 
> > 
> > Note the two different scales - otherwise the compact_free_direct
> > lines wouldn't be visible. The free scanner peaks close to 10M pages
> > checked per minute, whereas the direct allocations peak at under 180
> > per minute, direct misses at 50.
> > 
> > The work doesn't increase over this period, which is a good sign that
> > long-term we're not trending toward worse fragmentation.
> > 
> > There was an outstanding concern from Joonsoo regarding this patch -
> > https://marc.info/?l=linux-mm=146035962702122=2 - although that
> > didn't seem to affect us much in practice.
> 
> That concern would be easy to fix, but I was also concerned that if there
> are multiple direct compactions in parallel, they might keep too many free
> pages isolated away. Recently I resumed work on this and come up with a
> different approach, where I put the pages immediately back on tail of
> free lists. There might be some downside in more "direct misses".
> Also I didn't plan to restrict this to async compaction anymore, because
> if it's a better way, we should use it everywhere. So here's how it
> looks like now (only briefly tested), we could compare and pick the better
> approach, or go with the older one for now and potentially change it later.

IMHO, "good bye free scanner" is a way to go. My major concern is that
co-existence of two different compaction algorithms make the system
behaviour less predictable and make debugging hard.

And, this compaction stall is immediate and actual problem reported
many times unlike theoretical long term fragmentation which current
freepage scanner try to prevent.

Thanks.

Re: [PATCH] mm, compaction: direct freepage allocation for async direct compaction

2017-11-28 Thread Joonsoo Kim

On Thu, Nov 23, 2017 at 02:08:43PM +, Mel Gorman wrote:

> 3. Another reason a linear scanner was used was because we wanted to
>clear entire pageblocks we were migrating from and pack the target
>pageblocks as much as possible. This was to reduce the amount of
>migration required overall even though the scanning hurts. This patch
>takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
>Those potentially have to be moved again and again trying to randomly
>fill a MIGRATE_MOVABLE block. Have you considered using the freelists
>as a hint? i.e. take a page from the freelist, then isolate all free
>pages in the same pageblock as migration targets? That would preserve
>the "packing property" of the linear scanner.
> 
>This would increase the amount of scanning but that *might* be offset by
>the number of migrations the workload does overall. Note that migrations
>potentially are minor faults so if we do too many migrations, your
>workload may suffer.
> 
> 4. One problem the linear scanner avoids is that a migration target is
>subsequently used as a migration source and leads to a ping-pong effect.
>I don't know how bad this is in practice or even if it's a problem at
>all but it was a concern at the time

IIUC, this potential advantage for a linear scanner would not be the
actual advantage in the *running* system.

Consider about following worst case scenario for "direct freepage
allocation" that "moved again" happens.

__M1___F1_F2__F3___

M: migration source page (the number denotes the ID of the page)
F: freepage (the number denotes the sequence in the freelist of the buddy)
_: other pages
migration scanner: move from left to right.

If migration happens with "direct freepage allocation", memory state
will be changed to:

__F?___M1_F2__F3___

And then, please assume that there is an another movable allocation
before another migration. It's reasonable assumption since there are
really many movable allocations in the *running* system.

__F?___M1_M2__F3___

If migration happens again, memory state will be:

__F?___F?_M2__M1___

M1 is moved twice and overall number of migration is two.

Now, think about a linear scanner. With the same scenario,
memory state will be as following.

__M1___F1_F2__F3___
__F?___F1_F2__M1___
__F?___M2_F2__M1___
__F?___F?_M2__M1___

Although "move again" doesn't happen in a linear scanner, final memory
state is the same and the same number of migration happens.

So, I think that "direct freepage allocation" doesn't suffer from such
a ping-poing effect. Am I missing something?

Thanks.

Re: [PATCH] mm, compaction: direct freepage allocation for async direct compaction

2017-11-28 Thread Joonsoo Kim

On Thu, Nov 23, 2017 at 02:08:43PM +, Mel Gorman wrote:

> 3. Another reason a linear scanner was used was because we wanted to
>clear entire pageblocks we were migrating from and pack the target
>pageblocks as much as possible. This was to reduce the amount of
>migration required overall even though the scanning hurts. This patch
>takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
>Those potentially have to be moved again and again trying to randomly
>fill a MIGRATE_MOVABLE block. Have you considered using the freelists
>as a hint? i.e. take a page from the freelist, then isolate all free
>pages in the same pageblock as migration targets? That would preserve
>the "packing property" of the linear scanner.
> 
>This would increase the amount of scanning but that *might* be offset by
>the number of migrations the workload does overall. Note that migrations
>potentially are minor faults so if we do too many migrations, your
>workload may suffer.
> 
> 4. One problem the linear scanner avoids is that a migration target is
>subsequently used as a migration source and leads to a ping-pong effect.
>I don't know how bad this is in practice or even if it's a problem at
>all but it was a concern at the time

IIUC, this potential advantage for a linear scanner would not be the
actual advantage in the *running* system.

Consider about following worst case scenario for "direct freepage
allocation" that "moved again" happens.

__M1___F1_F2__F3___

M: migration source page (the number denotes the ID of the page)
F: freepage (the number denotes the sequence in the freelist of the buddy)
_: other pages
migration scanner: move from left to right.

If migration happens with "direct freepage allocation", memory state
will be changed to:

__F?___M1_F2__F3___

And then, please assume that there is an another movable allocation
before another migration. It's reasonable assumption since there are
really many movable allocations in the *running* system.

__F?___M1_M2__F3___

If migration happens again, memory state will be:

__F?___F?_M2__M1___

M1 is moved twice and overall number of migration is two.

Now, think about a linear scanner. With the same scenario,
memory state will be as following.

__M1___F1_F2__F3___
__F?___F1_F2__M1___
__F?___M2_F2__M1___
__F?___F?_M2__M1___

Although "move again" doesn't happen in a linear scanner, final memory
state is the same and the same number of migration happens.

So, I think that "direct freepage allocation" doesn't suffer from such
a ping-poing effect. Am I missing something?

Thanks.

Re: [PATCH v2] mm/cma: fix alloc_contig_range ret code/potential leak

2017-11-28 Thread Joonsoo Kim

On Wed, Nov 22, 2017 at 10:52:14AM -0800, Mike Kravetz wrote:
> If the call __alloc_contig_migrate_range() in alloc_contig_range
> returns -EBUSY, processing continues so that test_pages_isolated()
> is called where there is a tracepoint to identify the busy pages.
> However, it is possible for busy pages to become available between
> the calls to these two routines.  In this case, the range of pages
> may be allocated.   Unfortunately, the original return code (ret
> == -EBUSY) is still set and returned to the caller.  Therefore,
> the caller believes the pages were not allocated and they are leaked.
> 
> Update comment to indicate that allocation is still possible even if
> __alloc_contig_migrate_range returns -EBUSY.  Also, clear return code
> in this case so that it is not accidentally used or returned to caller.
> 
> Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation 
> failure")
> Cc: <sta...@vger.kernel.org>
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>

Good catch!!

Acked-by: Joonsoo Kim <iamjoonsoo@lge.com>

Thanks.

Re: [PATCH v2] mm/cma: fix alloc_contig_range ret code/potential leak

2017-11-28 Thread Joonsoo Kim

On Wed, Nov 22, 2017 at 10:52:14AM -0800, Mike Kravetz wrote:
> If the call __alloc_contig_migrate_range() in alloc_contig_range
> returns -EBUSY, processing continues so that test_pages_isolated()
> is called where there is a tracepoint to identify the busy pages.
> However, it is possible for busy pages to become available between
> the calls to these two routines.  In this case, the range of pages
> may be allocated.   Unfortunately, the original return code (ret
> == -EBUSY) is still set and returned to the caller.  Therefore,
> the caller believes the pages were not allocated and they are leaked.
> 
> Update comment to indicate that allocation is still possible even if
> __alloc_contig_migrate_range returns -EBUSY.  Also, clear return code
> in this case so that it is not accidentally used or returned to caller.
> 
> Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation 
> failure")
> Cc: 
> Signed-off-by: Mike Kravetz 

Good catch!!

Acked-by: Joonsoo Kim 

Thanks.

Re: [PATCH 02/18] vchecker: introduce the valid access checker

2017-11-28 Thread Joonsoo Kim

On Tue, Nov 28, 2017 at 11:41:08AM -0800, Andi Kleen wrote:
> js1...@gmail.com writes:
> 
> > From: Joonsoo Kim <iamjoonsoo@lge.com>
> 
> Looks useful. Essentially unlimited hardware break points, combined
> with slab.

Thanks!!!

> 
> Didn't do a full review, but noticed some things below.
> > +
> > +   buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > +   if (!buf)
> > +   return -ENOMEM;
> > +
> > +   if (copy_from_user(buf, ubuf, cnt)) {
> > +   kfree(buf);
> > +   return -EFAULT;
> > +   }
> > +
> > +   if (isspace(buf[0]))
> > +   remove = true;
> 
> and that may be uninitialized.

I will add 'cnt == 0' check above.

> and the space changes the operation? That's a strange syntax.

Intention is to clear the all the previous configuration when user
input is '\n'. Will fix it by checking '\n' directly.

> 
> > +   buf[cnt - 1] = '\0';
> 
> That's an underflow of one byte if cnt is 0.

Will add 'cnt == 0' check above.

String parsing part in this patchset will not work properly when the
last input character is not '\n'. I will fix it on the next spin.

Thanks.

Re: [PATCH 02/18] vchecker: introduce the valid access checker

2017-11-28 Thread Joonsoo Kim

On Tue, Nov 28, 2017 at 11:41:08AM -0800, Andi Kleen wrote:
> js1...@gmail.com writes:
> 
> > From: Joonsoo Kim 
> 
> Looks useful. Essentially unlimited hardware break points, combined
> with slab.

Thanks!!!

> 
> Didn't do a full review, but noticed some things below.
> > +
> > +   buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> > +   if (!buf)
> > +   return -ENOMEM;
> > +
> > +   if (copy_from_user(buf, ubuf, cnt)) {
> > +   kfree(buf);
> > +   return -EFAULT;
> > +   }
> > +
> > +   if (isspace(buf[0]))
> > +   remove = true;
> 
> and that may be uninitialized.

I will add 'cnt == 0' check above.

> and the space changes the operation? That's a strange syntax.

Intention is to clear the all the previous configuration when user
input is '\n'. Will fix it by checking '\n' directly.

> 
> > +   buf[cnt - 1] = '\0';
> 
> That's an underflow of one byte if cnt is 0.

Will add 'cnt == 0' check above.

String parsing part in this patchset will not work properly when the
last input character is not '\n'. I will fix it on the next spin.

Thanks.

Re: n900 in next-20170901

2017-11-14 Thread Joonsoo Kim

On Tue, Nov 14, 2017 at 06:04:00PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171115 00:48]:
> > On Tue, Nov 14, 2017 at 09:37:19AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim <iamjoonsoo@lge.com> [171114 06:34]:
> > > > On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> > > > > * Joonsoo Kim <iamjoonsoo@lge.com> [171110 06:34]:
> > > > > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > > > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > > > > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > > > > > +#define OMAP34XX_SRAM_SIZE   0x1
> > > > > > 
> > > > > > For my testing environment, vmalloc address space is started at
> > > > > > roughly 0xe000 so 0xd001 would not be valid.
> > > > > 
> > > > > Well we can map it anywhere we want, got any preferences?
> > > > 
> > > > My testing environment is a beagle-(xm?) for QEMU. It is configured by
> > > > CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
> > > > And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
> > > > direct mapping. See below.
> > > > 
> > > > [0.00] Memory: 429504K/522240K available (11264K kernel code,
> > > > 1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
> > > > 65536K cma-reserved, 0K highmem)
> > > > [0.00] Virtual kernel memory layout:
> > > > [0.00] vector  : 0x - 0x1000   (   4 kB)
> > > > [0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
> > > > [0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
> > > > [0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
> > > > [0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
> > > > [0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
> > > > [0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
> > > > [0.00]   .init : 0xc130 - 0xc150   (2048 kB)
> > > > [0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
> > > > [0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)
> > > > 
> > > > Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
> > > > broken and the system doesn't work. I guess that we should use more
> > > > stable address like as 0xf000.
> > > 
> > > OK. Let's forget about adding static mappings and my earlier
> > > patches to attempt to fix this. Below is what I now think we should
> > > merge as a fix before merging Joonsoo's patches. Please all review
> > > and test, adding Tero to Cc also.
> > 
> > Okay. Then, this patch will be merged by yourself as a fix? I'm okay
> > with either way. Just let me know. :)
> 
> Well what's the plan with your patches? I'd not have mainline
> kernel broken so if this was the only blocker for the v4.15,
> then you should pick it.
> 
> However, if your patches are now planned for v4.16, then I'll
> queue it for early v4.15-rc.

It was the only blocker but I think that it's too late for v4.15. I
will try again for v4.16. Please queue your fix for early v4.15-rc.

Thanks.

Re: n900 in next-20170901

2017-11-14 Thread Joonsoo Kim

On Tue, Nov 14, 2017 at 06:04:00PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171115 00:48]:
> > On Tue, Nov 14, 2017 at 09:37:19AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim  [171114 06:34]:
> > > > On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> > > > > * Joonsoo Kim  [171110 06:34]:
> > > > > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > > > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > > > > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > > > > > +#define OMAP34XX_SRAM_SIZE   0x1
> > > > > > 
> > > > > > For my testing environment, vmalloc address space is started at
> > > > > > roughly 0xe000 so 0xd001 would not be valid.
> > > > > 
> > > > > Well we can map it anywhere we want, got any preferences?
> > > > 
> > > > My testing environment is a beagle-(xm?) for QEMU. It is configured by
> > > > CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
> > > > And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
> > > > direct mapping. See below.
> > > > 
> > > > [0.00] Memory: 429504K/522240K available (11264K kernel code,
> > > > 1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
> > > > 65536K cma-reserved, 0K highmem)
> > > > [0.00] Virtual kernel memory layout:
> > > > [0.00] vector  : 0x - 0x1000   (   4 kB)
> > > > [0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
> > > > [0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
> > > > [0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
> > > > [0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
> > > > [0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
> > > > [0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
> > > > [0.00]   .init : 0xc130 - 0xc150   (2048 kB)
> > > > [0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
> > > > [0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)
> > > > 
> > > > Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
> > > > broken and the system doesn't work. I guess that we should use more
> > > > stable address like as 0xf000.
> > > 
> > > OK. Let's forget about adding static mappings and my earlier
> > > patches to attempt to fix this. Below is what I now think we should
> > > merge as a fix before merging Joonsoo's patches. Please all review
> > > and test, adding Tero to Cc also.
> > 
> > Okay. Then, this patch will be merged by yourself as a fix? I'm okay
> > with either way. Just let me know. :)
> 
> Well what's the plan with your patches? I'd not have mainline
> kernel broken so if this was the only blocker for the v4.15,
> then you should pick it.
> 
> However, if your patches are now planned for v4.16, then I'll
> queue it for early v4.15-rc.

It was the only blocker but I think that it's too late for v4.15. I
will try again for v4.16. Please queue your fix for early v4.15-rc.

Thanks.

Re: n900 in next-20170901

2017-11-14 Thread Joonsoo Kim

On Tue, Nov 14, 2017 at 09:37:19AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171114 06:34]:
> > On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim <iamjoonsoo@lge.com> [171110 06:34]:
> > > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > > > +#define OMAP34XX_SRAM_SIZE   0x1
> > > > 
> > > > For my testing environment, vmalloc address space is started at
> > > > roughly 0xe000 so 0xd001 would not be valid.
> > > 
> > > Well we can map it anywhere we want, got any preferences?
> > 
> > My testing environment is a beagle-(xm?) for QEMU. It is configured by
> > CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
> > And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
> > direct mapping. See below.
> > 
> > [0.00] Memory: 429504K/522240K available (11264K kernel code,
> > 1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
> > 65536K cma-reserved, 0K highmem)
> > [0.00] Virtual kernel memory layout:
> > [0.00] vector  : 0x - 0x1000   (   4 kB)
> > [0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
> > [0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
> > [0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
> > [0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
> > [0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
> > [0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
> > [0.00]   .init : 0xc130 - 0xc150   (2048 kB)
> > [0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
> > [0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)
> > 
> > Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
> > broken and the system doesn't work. I guess that we should use more
> > stable address like as 0xf000.
> 
> OK. Let's forget about adding static mappings and my earlier
> patches to attempt to fix this. Below is what I now think we should
> merge as a fix before merging Joonsoo's patches. Please all review
> and test, adding Tero to Cc also.

Okay. Then, this patch will be merged by yourself as a fix? I'm okay
with either way. Just let me know. :)

Thanks.

Re: n900 in next-20170901

2017-11-14 Thread Joonsoo Kim

On Tue, Nov 14, 2017 at 09:37:19AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171114 06:34]:
> > On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim  [171110 06:34]:
> > > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > > > +#define OMAP34XX_SRAM_SIZE   0x1
> > > > 
> > > > For my testing environment, vmalloc address space is started at
> > > > roughly 0xe000 so 0xd001 would not be valid.
> > > 
> > > Well we can map it anywhere we want, got any preferences?
> > 
> > My testing environment is a beagle-(xm?) for QEMU. It is configured by
> > CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
> > And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
> > direct mapping. See below.
> > 
> > [0.00] Memory: 429504K/522240K available (11264K kernel code,
> > 1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
> > 65536K cma-reserved, 0K highmem)
> > [0.00] Virtual kernel memory layout:
> > [0.00] vector  : 0x - 0x1000   (   4 kB)
> > [0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
> > [0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
> > [0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
> > [0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
> > [0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
> > [0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
> > [0.00]   .init : 0xc130 - 0xc150   (2048 kB)
> > [0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
> > [0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)
> > 
> > Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
> > broken and the system doesn't work. I guess that we should use more
> > stable address like as 0xf000.
> 
> OK. Let's forget about adding static mappings and my earlier
> patches to attempt to fix this. Below is what I now think we should
> merge as a fix before merging Joonsoo's patches. Please all review
> and test, adding Tero to Cc also.

Okay. Then, this patch will be merged by yourself as a fix? I'm okay
with either way. Just let me know. :)

Thanks.

Re: n900 in next-20170901

2017-11-13 Thread Joonsoo Kim

On Mon, Nov 13, 2017 at 01:15:30PM -0800, Tony Lindgren wrote:
> * Tony Lindgren <t...@atomide.com> [171110 07:36]:
> > * Joonsoo Kim <iamjoonsoo@lge.com> [171110 06:34]:
> > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > +#define OMAP34XX_SRAM_PHYS 0x4020
> > > > +#define OMAP34XX_SRAM_VIRT 0xd001
> > > > +#define OMAP34XX_SRAM_SIZE 0x1
> > > 
> > > For my testing environment, vmalloc address space is started at
> > > roughly 0xe000 so 0xd001 would not be valid.
> > 
> > Well we can map it anywhere we want, got any preferences?
> 
> Hmm and I'm also wondering what you do to make vmalloc space to
> start at 0xe000 instead of 0xd000?

Please see the another reply.

> 
> The reason I'm asking is because I think we can just move all of
> save_secure_ram_context to run from DDR instead of SRAM. But I'd
> rather do a minimal patch first that fixes your series and then we
> can test the further changes with more time.

Okay. I agree to make a minimal patch first and then go next step.

> After moving save_secure_ram_context to DDR, we can call
> save_secure_ram_context directly with something like:
> 
>   args_pa = __pa(omap3_secure_ram_storage);
>   offset = (unsigned long)omap3_secure_ram_storage - args_pa;
>   ret = save_secure_ram_context(args_pa, offset);
> 
> > Just that the current save_secure_ram_context uses "high_mask"
> > of 0x to translate the address. To make this more flexible,
> > we need the save_secure_ram_context changes too. So we might
> > want to do the static mapping and save_secure_ram_context changes
> > as a single patch.
> > 
> > > And, PHYS can be different according to the system type. Maybe either
> > > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
> > > be considered, too. My understanding is correct?
> > 
> > We can have a static map for the whole SRAM area, see function
> > __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the
> > static mapping whenever possible". So the different public SRAM start
> > addresses and sizes don't matter there.
> 
> And then if save_secure_ram_contet runs in DDR, no static map is
> needed.

Okay.

Thanks.

Re: n900 in next-20170901

2017-11-13 Thread Joonsoo Kim

On Mon, Nov 13, 2017 at 01:15:30PM -0800, Tony Lindgren wrote:
> * Tony Lindgren  [171110 07:36]:
> > * Joonsoo Kim  [171110 06:34]:
> > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > > +#define OMAP34XX_SRAM_PHYS 0x4020
> > > > +#define OMAP34XX_SRAM_VIRT 0xd001
> > > > +#define OMAP34XX_SRAM_SIZE 0x1
> > > 
> > > For my testing environment, vmalloc address space is started at
> > > roughly 0xe000 so 0xd001 would not be valid.
> > 
> > Well we can map it anywhere we want, got any preferences?
> 
> Hmm and I'm also wondering what you do to make vmalloc space to
> start at 0xe000 instead of 0xd000?

Please see the another reply.

> 
> The reason I'm asking is because I think we can just move all of
> save_secure_ram_context to run from DDR instead of SRAM. But I'd
> rather do a minimal patch first that fixes your series and then we
> can test the further changes with more time.

Okay. I agree to make a minimal patch first and then go next step.

> After moving save_secure_ram_context to DDR, we can call
> save_secure_ram_context directly with something like:
> 
>   args_pa = __pa(omap3_secure_ram_storage);
>   offset = (unsigned long)omap3_secure_ram_storage - args_pa;
>   ret = save_secure_ram_context(args_pa, offset);
> 
> > Just that the current save_secure_ram_context uses "high_mask"
> > of 0x to translate the address. To make this more flexible,
> > we need the save_secure_ram_context changes too. So we might
> > want to do the static mapping and save_secure_ram_context changes
> > as a single patch.
> > 
> > > And, PHYS can be different according to the system type. Maybe either
> > > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
> > > be considered, too. My understanding is correct?
> > 
> > We can have a static map for the whole SRAM area, see function
> > __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the
> > static mapping whenever possible". So the different public SRAM start
> > addresses and sizes don't matter there.
> 
> And then if save_secure_ram_contet runs in DDR, no static map is
> needed.

Okay.

Thanks.

Re: n900 in next-20170901

2017-11-13 Thread Joonsoo Kim

On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171110 06:34]:
> > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > +#define OMAP34XX_SRAM_SIZE   0x1
> > 
> > For my testing environment, vmalloc address space is started at
> > roughly 0xe000 so 0xd001 would not be valid.
> 
> Well we can map it anywhere we want, got any preferences?

My testing environment is a beagle-(xm?) for QEMU. It is configured by
CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
direct mapping. See below.

[0.00] Memory: 429504K/522240K available (11264K kernel code,
1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
65536K cma-reserved, 0K highmem)
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
[0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
[0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
[0.00]   .init : 0xc130 - 0xc150   (2048 kB)
[0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
[0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)

Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
broken and the system doesn't work. I guess that we should use more
stable address like as 0xf000.

> 
> Just that the current save_secure_ram_context uses "high_mask"
> of 0x to translate the address. To make this more flexible,
> we need the save_secure_ram_context changes too. So we might
> want to do the static mapping and save_secure_ram_context changes
> as a single patch.
> 
> > And, PHYS can be different according to the system type. Maybe either
> > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
> > be considered, too. My understanding is correct?
> 
> We can have a static map for the whole SRAM area, see function
> __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the
> static mapping whenever possible". So the different public SRAM start
> addresses and sizes don't matter there.

Okay. Look fine with SRAM start addresses and sizes. However, we need
to consider mtype since __arm_ioremap_pfn_caller() doesn't reuse the
mapping if mtype is different. mtype can be either MT_MEMORY_RWX or
MT_MEMORY_RWX_NONCACHED.

Thanks.

Re: n900 in next-20170901

2017-11-13 Thread Joonsoo Kim

On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171110 06:34]:
> > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> > > +#define OMAP34XX_SRAM_PHYS   0x4020
> > > +#define OMAP34XX_SRAM_VIRT   0xd001
> > > +#define OMAP34XX_SRAM_SIZE   0x1
> > 
> > For my testing environment, vmalloc address space is started at
> > roughly 0xe000 so 0xd001 would not be valid.
> 
> Well we can map it anywhere we want, got any preferences?

My testing environment is a beagle-(xm?) for QEMU. It is configured by
CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000.
And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for
direct mapping. See below.

[0.00] Memory: 429504K/522240K available (11264K kernel code,
1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved,
65536K cma-reserved, 0K highmem)
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xffc0 - 0xfff0   (3072 kB)
[0.00] vmalloc : 0xe000 - 0xff80   ( 504 MB)
[0.00] lowmem  : 0xc000 - 0xdff0   ( 511 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0208000 - 0xc0e0   (12256 kB)
[0.00]   .init : 0xc130 - 0xc150   (2048 kB)
[0.00]   .data : 0xc150 - 0xc1686810   (1563 kB)
[0.00].bss : 0xc168fc68 - 0xc16f512c   ( 406 kB)

Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is
broken and the system doesn't work. I guess that we should use more
stable address like as 0xf000.

> 
> Just that the current save_secure_ram_context uses "high_mask"
> of 0x to translate the address. To make this more flexible,
> we need the save_secure_ram_context changes too. So we might
> want to do the static mapping and save_secure_ram_context changes
> as a single patch.
> 
> > And, PHYS can be different according to the system type. Maybe either
> > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
> > be considered, too. My understanding is correct?
> 
> We can have a static map for the whole SRAM area, see function
> __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the
> static mapping whenever possible". So the different public SRAM start
> addresses and sizes don't matter there.

Okay. Look fine with SRAM start addresses and sizes. However, we need
to consider mtype since __arm_ioremap_pfn_caller() doesn't reuse the
mapping if mtype is different. mtype can be either MT_MEMORY_RWX or
MT_MEMORY_RWX_NONCACHED.

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 10:23:40PM -0800, Tony Lindgren wrote:
> * Tony Lindgren  [171109 22:19]:
> > * Tony Lindgren  [171110 03:28]:
> > > Then I'll follow up on cleaning up save_secure_ram_context later.
> > 
> > Here's a better version, the static mapping did not get used.. It
> > just moved the area so it happened to work. It needs to be set
> > up as MT_MEMORY_RWX_NONCACHED instead.
> 

I see a better version now. Hmm... I guess that it also has the
problem that I mentioned on first version.

> And FYI, here's what I currently have for the follow-up patch,
> but that can wait a bit.

Okay. So, this patch should be applied on the top of above better version?

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 10:23:40PM -0800, Tony Lindgren wrote:
> * Tony Lindgren  [171109 22:19]:
> > * Tony Lindgren  [171110 03:28]:
> > > Then I'll follow up on cleaning up save_secure_ram_context later.
> > 
> > Here's a better version, the static mapping did not get used.. It
> > just moved the area so it happened to work. It needs to be set
> > up as MT_MEMORY_RWX_NONCACHED instead.
> 

I see a better version now. Hmm... I guess that it also has the
problem that I mentioned on first version.

> And FYI, here's what I currently have for the follow-up patch,
> but that can wait a bit.

Okay. So, this patch should be applied on the top of above better version?

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171110 00:10]:
> > On Thu, Nov 09, 2017 at 07:08:54AM -0800, Tony Lindgren wrote:
> > > Hmm OK. Does your first patch above now have the initcall issue too?
> > > It boots if I make that also subsys_initcall and then I get:
> > 
> > > [2.078094] vmalloc_pool_init: DMA: get vmalloc area: d001
> > 
> > Yes, first patch has the initcall issue and it's intentional in order
> > to check the theory. I checked following log for this.
> > 
> > - Boot failure
> > SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
> > SRAM_ADDR: omap_map_sram: V: 0xd005 - 0xd0057000
> > 
> > - Boot success
> > SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
> > SRAM_ADDR: omap_map_sram: V: 0xd0008000 - 0xd000f000
> > 
> > When failure, virtual address for sram is higher than normal one due
> > to vmalloc area allocation in __dma_alloc_remap(). If it is deferred,
> > virtual address is the same with success case and then the system work.
> > 
> > So, my next theory is that there is n900 specific assumption that sram
> > should have that address. Could you check if any working tree for n900
> > which doesn't have my CMA series work or not with adding
> > "arm/dma: vmalloc area allocation"?
> 
> Oh I see, sorry I was not following you earlier. So you mean that
> by adding the vmalloc_pool_init() initcall the va mapping for SRAM
> changes.

Exactly.

> 
> And yes, save_secure_ram_context seems to be doing some sketchy
> virt to phys calculation with sram_phy_addr_mask. Here's a small
> patch to fix that for your CMA series, maybe you can merge it
> with your series to avoid breaking booting for git bisect.
> 
> Then I'll follow up on cleaning up save_secure_ram_context later.

Thanks for the patch. However, the patch should be modified. See below.

> Regards,
> 
> Tony
> 
> 8< -
> >From tony Mon Sep 17 00:00:00 2001
> From: Tony Lindgren <t...@atomide.com>
> Date: Thu, 9 Nov 2017 17:05:34 -0800
> Subject: [PATCH] ARM: OMAP2+: Add static SRAM mapping for
>  save_secure_ram_context
> 
> With the CMA changes from Joonsoo Kim <iamjoonsoo@lge.com>, it
> was noticed that n900 stopped booting. After investigating it turned
> out that n900 save_secure_ram_context does some whacky virtual to
> physical address translation for the SRAM data address.
> 
> Let's fix this for CMA changes by adding a static mapping for SRAM
> on omap3. Then we can follow up with a patch to clean up the address
> translation in save_secure_ram_context later on.
> 
> Debugged-by: Joonsoo Kim <iamjoonsoo@lge.com>
> Signed-off-by: Tony Lindgren <t...@atomide.com>
> ---
>  arch/arm/mach-omap2/io.c| 6 ++
>  arch/arm/mach-omap2/iomap.h | 4 
>  2 files changed, 10 insertions(+)
> 
> diff --git a/arch/arm/mach-omap2/io.c b/arch/arm/mach-omap2/io.c
> --- a/arch/arm/mach-omap2/io.c
> +++ b/arch/arm/mach-omap2/io.c
> @@ -139,6 +139,12 @@ static struct map_desc omap243x_io_desc[] __initdata = {
>  
>  #ifdef   CONFIG_ARCH_OMAP3
>  static struct map_desc omap34xx_io_desc[] __initdata = {
> + {
> + .virtual= OMAP34XX_SRAM_VIRT,
> + .pfn= __phys_to_pfn(OMAP34XX_SRAM_PHYS),
> + .length = OMAP34XX_SRAM_SIZE,
> + .type   = MT_DEVICE
> + },
>   {
>   .virtual= L3_34XX_VIRT,
>   .pfn= __phys_to_pfn(L3_34XX_PHYS),
> diff --git a/arch/arm/mach-omap2/iomap.h b/arch/arm/mach-omap2/iomap.h
> --- a/arch/arm/mach-omap2/iomap.h
> +++ b/arch/arm/mach-omap2/iomap.h
> @@ -123,6 +123,10 @@
>   * VPOM3430 was not working for Int controller
>   */
>  
> +#define OMAP34XX_SRAM_PHYS   0x4020
> +#define OMAP34XX_SRAM_VIRT   0xd001
> +#define OMAP34XX_SRAM_SIZE   0x1

For my testing environment, vmalloc address space is started at
roughly 0xe000 so 0xd001 would not be valid. And, PHYS
can be different according to the system type. Maybe either
OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
be considered, too. My understanding is correct?

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171110 00:10]:
> > On Thu, Nov 09, 2017 at 07:08:54AM -0800, Tony Lindgren wrote:
> > > Hmm OK. Does your first patch above now have the initcall issue too?
> > > It boots if I make that also subsys_initcall and then I get:
> > 
> > > [2.078094] vmalloc_pool_init: DMA: get vmalloc area: d001
> > 
> > Yes, first patch has the initcall issue and it's intentional in order
> > to check the theory. I checked following log for this.
> > 
> > - Boot failure
> > SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
> > SRAM_ADDR: omap_map_sram: V: 0xd005 - 0xd0057000
> > 
> > - Boot success
> > SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
> > SRAM_ADDR: omap_map_sram: V: 0xd0008000 - 0xd000f000
> > 
> > When failure, virtual address for sram is higher than normal one due
> > to vmalloc area allocation in __dma_alloc_remap(). If it is deferred,
> > virtual address is the same with success case and then the system work.
> > 
> > So, my next theory is that there is n900 specific assumption that sram
> > should have that address. Could you check if any working tree for n900
> > which doesn't have my CMA series work or not with adding
> > "arm/dma: vmalloc area allocation"?
> 
> Oh I see, sorry I was not following you earlier. So you mean that
> by adding the vmalloc_pool_init() initcall the va mapping for SRAM
> changes.

Exactly.

> 
> And yes, save_secure_ram_context seems to be doing some sketchy
> virt to phys calculation with sram_phy_addr_mask. Here's a small
> patch to fix that for your CMA series, maybe you can merge it
> with your series to avoid breaking booting for git bisect.
> 
> Then I'll follow up on cleaning up save_secure_ram_context later.

Thanks for the patch. However, the patch should be modified. See below.

> Regards,
> 
> Tony
> 
> 8< -
> >From tony Mon Sep 17 00:00:00 2001
> From: Tony Lindgren 
> Date: Thu, 9 Nov 2017 17:05:34 -0800
> Subject: [PATCH] ARM: OMAP2+: Add static SRAM mapping for
>  save_secure_ram_context
> 
> With the CMA changes from Joonsoo Kim , it
> was noticed that n900 stopped booting. After investigating it turned
> out that n900 save_secure_ram_context does some whacky virtual to
> physical address translation for the SRAM data address.
> 
> Let's fix this for CMA changes by adding a static mapping for SRAM
> on omap3. Then we can follow up with a patch to clean up the address
> translation in save_secure_ram_context later on.
> 
> Debugged-by: Joonsoo Kim 
> Signed-off-by: Tony Lindgren 
> ---
>  arch/arm/mach-omap2/io.c| 6 ++
>  arch/arm/mach-omap2/iomap.h | 4 
>  2 files changed, 10 insertions(+)
> 
> diff --git a/arch/arm/mach-omap2/io.c b/arch/arm/mach-omap2/io.c
> --- a/arch/arm/mach-omap2/io.c
> +++ b/arch/arm/mach-omap2/io.c
> @@ -139,6 +139,12 @@ static struct map_desc omap243x_io_desc[] __initdata = {
>  
>  #ifdef   CONFIG_ARCH_OMAP3
>  static struct map_desc omap34xx_io_desc[] __initdata = {
> + {
> + .virtual= OMAP34XX_SRAM_VIRT,
> + .pfn= __phys_to_pfn(OMAP34XX_SRAM_PHYS),
> + .length = OMAP34XX_SRAM_SIZE,
> + .type   = MT_DEVICE
> + },
>   {
>   .virtual= L3_34XX_VIRT,
>   .pfn= __phys_to_pfn(L3_34XX_PHYS),
> diff --git a/arch/arm/mach-omap2/iomap.h b/arch/arm/mach-omap2/iomap.h
> --- a/arch/arm/mach-omap2/iomap.h
> +++ b/arch/arm/mach-omap2/iomap.h
> @@ -123,6 +123,10 @@
>   * VPOM3430 was not working for Int controller
>   */
>  
> +#define OMAP34XX_SRAM_PHYS   0x4020
> +#define OMAP34XX_SRAM_VIRT   0xd001
> +#define OMAP34XX_SRAM_SIZE   0x1

For my testing environment, vmalloc address space is started at
roughly 0xe000 so 0xd001 would not be valid. And, PHYS
can be different according to the system type. Maybe either
OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should
be considered, too. My understanding is correct?

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 07:08:54AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171109 03:47]:
> > Could you test following two commits on my updated branch?
> > 
> > "arm/dma: vmalloc area allocation"
> 
> Won't boot at this commit:
> 
> [6.747283] save_secure_sram() returns ff02
> [6.751983] save_secure_sram()'s param: 0: 0x4
> [6.756561] save_secure_sram()'s param: 1: 0x8e70
> [6.761749] save_secure_sram()'s param: 2: 0x0
> [6.766326] save_secure_sram()'s param: 3: 0x1
> [6.770904] save_secure_sram()'s param: 4: 0x1
> 
> > "arm/dma: defer atomic pool initialization"
> 
> Boots at this commit.
> 
> > I suspect that changed virtual address of the sram due to early
> > __dma_alloc_remap() call causes the problem and above two commits test
> > this theory.
> 
> Hmm OK. Does your first patch above now have the initcall issue too?
> It boots if I make that also subsys_initcall and then I get:

> [2.078094] vmalloc_pool_init: DMA: get vmalloc area: d001

Yes, first patch has the initcall issue and it's intentional in order
to check the theory. I checked following log for this.

- Boot failure
SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
SRAM_ADDR: omap_map_sram: V: 0xd005 - 0xd0057000

- Boot success
SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
SRAM_ADDR: omap_map_sram: V: 0xd0008000 - 0xd000f000

When failure, virtual address for sram is higher than normal one due
to vmalloc area allocation in __dma_alloc_remap(). If it is deferred,
virtual address is the same with success case and then the system work.

So, my next theory is that there is n900 specific assumption that sram
should have that address. Could you check if any working tree for n900
which doesn't have my CMA series work or not with adding
"arm/dma: vmalloc area allocation"?

Thanks.

Re: n900 in next-20170901

2017-11-09 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 07:08:54AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171109 03:47]:
> > Could you test following two commits on my updated branch?
> > 
> > "arm/dma: vmalloc area allocation"
> 
> Won't boot at this commit:
> 
> [6.747283] save_secure_sram() returns ff02
> [6.751983] save_secure_sram()'s param: 0: 0x4
> [6.756561] save_secure_sram()'s param: 1: 0x8e70
> [6.761749] save_secure_sram()'s param: 2: 0x0
> [6.766326] save_secure_sram()'s param: 3: 0x1
> [6.770904] save_secure_sram()'s param: 4: 0x1
> 
> > "arm/dma: defer atomic pool initialization"
> 
> Boots at this commit.
> 
> > I suspect that changed virtual address of the sram due to early
> > __dma_alloc_remap() call causes the problem and above two commits test
> > this theory.
> 
> Hmm OK. Does your first patch above now have the initcall issue too?
> It boots if I make that also subsys_initcall and then I get:

> [2.078094] vmalloc_pool_init: DMA: get vmalloc area: d001

Yes, first patch has the initcall issue and it's intentional in order
to check the theory. I checked following log for this.

- Boot failure
SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
SRAM_ADDR: omap_map_sram: V: 0xd005 - 0xd0057000

- Boot success
SRAM_ADDR: omap_map_sram: P: 0x40208000 - 0x4020f000
SRAM_ADDR: omap_map_sram: V: 0xd0008000 - 0xd000f000

When failure, virtual address for sram is higher than normal one due
to vmalloc area allocation in __dma_alloc_remap(). If it is deferred,
virtual address is the same with success case and then the system work.

So, my next theory is that there is n900 specific assumption that sram
should have that address. Could you check if any working tree for n900
which doesn't have my CMA series work or not with adding
"arm/dma: vmalloc area allocation"?

Thanks.

Re: [PATCH] mm: page_ext: check if page_ext is not prepared

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 03:21:06PM +0100, Michal Hocko wrote:
> On Wed 08-11-17 16:59:56, Joonsoo Kim wrote:
> > On Tue, Nov 07, 2017 at 10:47:30AM +0100, Michal Hocko wrote:
> > > [CC Joonsoo]
> > > 
> > > On Tue 07-11-17 18:41:31, Jaewon Kim wrote:
> > > > online_page_ext and page_ext_init allocate page_ext for each section, 
> > > > but
> > > > they do not allocate if the first PFN is !pfn_present(pfn) or
> > > > !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext
> > > > checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN,
> > > > __set_page_owner will try to get page_ext through lookup_page_ext.
> > > > Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as 
> > > > value
> > > > 0. This incurrs invalid address access.
> > > > 
> > > > This is the panic example when PFN 0x10 is not valid but PFN 
> > > > 0x13FC00
> > > > is being used for page_ext. section->page_ext is NULL, get_entry 
> > > > returned
> > > > invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00.
> > > > 
> > > > To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
> > > > will be checked at all times.
> > > > 
> > > > <1>[   11.618085] Unable to handle kernel paging request at virtual 
> > > > address 01dfa014
> > > > <1>[   11.618140] pgd = ffc0c6dc9000
> > > > <1>[   11.618174] [01dfa014] *pgd=, 
> > > > *pud=
> > > > <4>[   11.618240] [ cut here ]
> > > > <2>[   11.618278] Kernel BUG at ff80082371e0 [verbose debug info 
> > > > unavailable]
> > > > <0>[   11.618338] Internal error: Oops: 9645 [#1] PREEMPT SMP
> > > > <4>[   11.618381] Modules linked in:
> > > > <4>[   11.618524] task: ffc0c6ec9180 task.stack: ffc0c6f4
> > > > <4>[   11.618569] PC is at __set_page_owner+0x48/0x78
> > > > <4>[   11.618607] LR is at __set_page_owner+0x44/0x78
> > > > <4>[   11.626025] [] __set_page_owner+0x48/0x78
> > > > <4>[   11.626071] [] 
> > > > get_page_from_freelist+0x880/0x8e8
> > > > <4>[   11.626118] [] 
> > > > __alloc_pages_nodemask+0x14c/0xc48
> > > > <4>[   11.626165] [] 
> > > > __do_page_cache_readahead+0xdc/0x264
> > > > <4>[   11.626214] [] filemap_fault+0x2ac/0x550
> > > > <4>[   11.626259] [] ext4_filemap_fault+0x3c/0x58
> > > > <4>[   11.626305] [] __do_fault+0x80/0x120
> > > > <4>[   11.626347] [] handle_mm_fault+0x704/0xbb0
> > > > <4>[   11.626393] [] do_page_fault+0x2e8/0x394
> > > > <4>[   11.626437] [] do_mem_abort+0x88/0x124
> > > > 
> > > 
> > > I suspec this goes all the way down to when page_ext has been
> > > resurrected.  It is quite interesting that nobody has noticed this in 3
> > > years but maybe the feature is not used all that much and the HW has to
> > > be quite special to trigger. Anyway the following should be added
> > > 
> > >  Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code 
> > > for debugging")
> > >  Cc: stable
> > 
> > IIRC, caller of lookup_page_ext() doesn't check 'NULL' until
> > f86e427197 ("mm: check the return value of lookup_page_ext for all
> > call sites"). So, this problem would happen old kernel even if this
> > patch is applied to old kernel.
> 
> OK, then the changelog should mention dependency on that check so that
> anybody who backports this patch to pre 4.7 kernels knows to pull that
> one as well.
> 
> > IMO, proper fix is to check all the pfn in the section. It is sent
> > from Jaewon in other mail.
> 
> I believe that this patch is valuable on its own and the other one
> should build on top of it.

Okay, agreed.

Thanks.

Re: [PATCH] mm: page_ext: check if page_ext is not prepared

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 03:21:06PM +0100, Michal Hocko wrote:
> On Wed 08-11-17 16:59:56, Joonsoo Kim wrote:
> > On Tue, Nov 07, 2017 at 10:47:30AM +0100, Michal Hocko wrote:
> > > [CC Joonsoo]
> > > 
> > > On Tue 07-11-17 18:41:31, Jaewon Kim wrote:
> > > > online_page_ext and page_ext_init allocate page_ext for each section, 
> > > > but
> > > > they do not allocate if the first PFN is !pfn_present(pfn) or
> > > > !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext
> > > > checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN,
> > > > __set_page_owner will try to get page_ext through lookup_page_ext.
> > > > Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as 
> > > > value
> > > > 0. This incurrs invalid address access.
> > > > 
> > > > This is the panic example when PFN 0x10 is not valid but PFN 
> > > > 0x13FC00
> > > > is being used for page_ext. section->page_ext is NULL, get_entry 
> > > > returned
> > > > invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00.
> > > > 
> > > > To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
> > > > will be checked at all times.
> > > > 
> > > > <1>[   11.618085] Unable to handle kernel paging request at virtual 
> > > > address 01dfa014
> > > > <1>[   11.618140] pgd = ffc0c6dc9000
> > > > <1>[   11.618174] [01dfa014] *pgd=, 
> > > > *pud=
> > > > <4>[   11.618240] [ cut here ]
> > > > <2>[   11.618278] Kernel BUG at ff80082371e0 [verbose debug info 
> > > > unavailable]
> > > > <0>[   11.618338] Internal error: Oops: 9645 [#1] PREEMPT SMP
> > > > <4>[   11.618381] Modules linked in:
> > > > <4>[   11.618524] task: ffc0c6ec9180 task.stack: ffc0c6f4
> > > > <4>[   11.618569] PC is at __set_page_owner+0x48/0x78
> > > > <4>[   11.618607] LR is at __set_page_owner+0x44/0x78
> > > > <4>[   11.626025] [] __set_page_owner+0x48/0x78
> > > > <4>[   11.626071] [] 
> > > > get_page_from_freelist+0x880/0x8e8
> > > > <4>[   11.626118] [] 
> > > > __alloc_pages_nodemask+0x14c/0xc48
> > > > <4>[   11.626165] [] 
> > > > __do_page_cache_readahead+0xdc/0x264
> > > > <4>[   11.626214] [] filemap_fault+0x2ac/0x550
> > > > <4>[   11.626259] [] ext4_filemap_fault+0x3c/0x58
> > > > <4>[   11.626305] [] __do_fault+0x80/0x120
> > > > <4>[   11.626347] [] handle_mm_fault+0x704/0xbb0
> > > > <4>[   11.626393] [] do_page_fault+0x2e8/0x394
> > > > <4>[   11.626437] [] do_mem_abort+0x88/0x124
> > > > 
> > > 
> > > I suspec this goes all the way down to when page_ext has been
> > > resurrected.  It is quite interesting that nobody has noticed this in 3
> > > years but maybe the feature is not used all that much and the HW has to
> > > be quite special to trigger. Anyway the following should be added
> > > 
> > >  Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code 
> > > for debugging")
> > >  Cc: stable
> > 
> > IIRC, caller of lookup_page_ext() doesn't check 'NULL' until
> > f86e427197 ("mm: check the return value of lookup_page_ext for all
> > call sites"). So, this problem would happen old kernel even if this
> > patch is applied to old kernel.
> 
> OK, then the changelog should mention dependency on that check so that
> anybody who backports this patch to pre 4.7 kernels knows to pull that
> one as well.
> 
> > IMO, proper fix is to check all the pfn in the section. It is sent
> > from Jaewon in other mail.
> 
> I believe that this patch is valuable on its own and the other one
> should build on top of it.

Okay, agreed.

Thanks.

Re: [PATCH] mm: page_ext: allocate page extension though first PFN is invalid

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 10:33:51PM +0900, Jaewon Kim wrote:
> 2017-11-08 16:52 GMT+09:00 Joonsoo Kim <iamjoonsoo@lge.com>:
> > On Tue, Nov 07, 2017 at 06:44:47PM +0900, Jaewon Kim wrote:
> >> online_page_ext and page_ext_init allocate page_ext for each section, but
> >> they do not allocate if the first PFN is !pfn_present(pfn) or
> >> !pfn_valid(pfn).
> >>
> >> Though the first page is not valid, page_ext could be useful for other
> >> pages in the section. But checking all PFNs in a section may be time
> >> consuming job. Let's check each (section count / 16) PFN, then prepare
> >> page_ext if any PFN is present or valid.
> >
> > I guess that this kind of section is not so many. And, this is for
> > debugging so completeness would be important. It's better to check
> > all pfn in the section.
> Thank you for your comment.
> 
> AFAIK physical memory address depends on HW SoC.
> Sometimes a SoC remains few GB address region hole between few GB DRAM
> and other few GB DRAM
> such as 2GB under 4GB address and 2GB beyond 4GB address and holes between 
> them.
> If SoC designs so big hole between actual mapping, I thought too much
> time will be spent on just checking all the PFNs.

I don't think that it is painful because it is done just once at
initialization step. However, if you worry about it, we can use
pfn_present() to skip the whole section at once. !pfn_present()
guarantees that there is no valid pfn in the section. If pfn_present()
returns true, we need to search the whole pages in the section in
order to find valid pfn.

And, I think that we don't need to change online_page_ext(). AFAIK,
hotplug always adds section aligned memory so pfn_present() check
should be enough.

Thanks.

Re: [PATCH] mm: page_ext: allocate page extension though first PFN is invalid

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 10:33:51PM +0900, Jaewon Kim wrote:
> 2017-11-08 16:52 GMT+09:00 Joonsoo Kim :
> > On Tue, Nov 07, 2017 at 06:44:47PM +0900, Jaewon Kim wrote:
> >> online_page_ext and page_ext_init allocate page_ext for each section, but
> >> they do not allocate if the first PFN is !pfn_present(pfn) or
> >> !pfn_valid(pfn).
> >>
> >> Though the first page is not valid, page_ext could be useful for other
> >> pages in the section. But checking all PFNs in a section may be time
> >> consuming job. Let's check each (section count / 16) PFN, then prepare
> >> page_ext if any PFN is present or valid.
> >
> > I guess that this kind of section is not so many. And, this is for
> > debugging so completeness would be important. It's better to check
> > all pfn in the section.
> Thank you for your comment.
> 
> AFAIK physical memory address depends on HW SoC.
> Sometimes a SoC remains few GB address region hole between few GB DRAM
> and other few GB DRAM
> such as 2GB under 4GB address and 2GB beyond 4GB address and holes between 
> them.
> If SoC designs so big hole between actual mapping, I thought too much
> time will be spent on just checking all the PFNs.

I don't think that it is painful because it is done just once at
initialization step. However, if you worry about it, we can use
pfn_present() to skip the whole section at once. !pfn_present()
guarantees that there is no valid pfn in the section. If pfn_present()
returns true, we need to search the whole pages in the section in
order to find valid pfn.

And, I think that we don't need to change online_page_ext(). AFAIK,
hotplug always adds section aligned memory so pfn_present() check
should be enough.

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 09:36:39AM +0900, Joonsoo Kim wrote:
> On Wed, Nov 08, 2017 at 04:11:13PM -0800, Tony Lindgren wrote:
> > * Joonsoo Kim <iamjoonsoo@lge.com> [171109 00:05]:
> > > On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> > > > * Joonsoo Kim <iamjoonsoo@lge.com> [171108 07:43]:
> > > > > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > > > > So it seems the issue is currently at the atomic_pool_init()
> > > > > > related code?
> > > > > 
> > > > > Yes, your test showed it although I can't find any clue in
> > > > > atomic_pool_init().
> > > > > 
> > > > > Could you test updated branch?
> > > > > 
> > > > > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > > > > 
> > > > > There are two relevant commits.
> > > > > 
> > > > > arm/dma: stop dma allocation before __dma_alloc_remap()
> > > > > arm/dma: disable atomic pool after dma allocation
> > > > > 
> > > > > atomic pool initialization will be done partially to check
> > > > > exact point of failure. These are brain-dead commits however I have no
> > > > > idea what's going on here until now. :/
> > > > 
> > > > OK that booted, dmesg output below. Hopefully that provides
> > > > you with some more clues.
> > > 
> > > Thanks!
> > > 
> > > Could you let me know which one is booted? Both of them? or just top
> > > commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?
> > 
> > Oh OK. Only the top commit boots.
> 
> Okay! I will try to analyze!
> 

Could you test following two commits on my updated branch?

"arm/dma: vmalloc area allocation"
"arm/dma: defer atomic pool initialization"

I suspect that changed virtual address of the sram due to early
__dma_alloc_remap() call causes the problem and above two commits test
this theory.

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Thu, Nov 09, 2017 at 09:36:39AM +0900, Joonsoo Kim wrote:
> On Wed, Nov 08, 2017 at 04:11:13PM -0800, Tony Lindgren wrote:
> > * Joonsoo Kim  [171109 00:05]:
> > > On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> > > > * Joonsoo Kim  [171108 07:43]:
> > > > > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > > > > So it seems the issue is currently at the atomic_pool_init()
> > > > > > related code?
> > > > > 
> > > > > Yes, your test showed it although I can't find any clue in
> > > > > atomic_pool_init().
> > > > > 
> > > > > Could you test updated branch?
> > > > > 
> > > > > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > > > > 
> > > > > There are two relevant commits.
> > > > > 
> > > > > arm/dma: stop dma allocation before __dma_alloc_remap()
> > > > > arm/dma: disable atomic pool after dma allocation
> > > > > 
> > > > > atomic pool initialization will be done partially to check
> > > > > exact point of failure. These are brain-dead commits however I have no
> > > > > idea what's going on here until now. :/
> > > > 
> > > > OK that booted, dmesg output below. Hopefully that provides
> > > > you with some more clues.
> > > 
> > > Thanks!
> > > 
> > > Could you let me know which one is booted? Both of them? or just top
> > > commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?
> > 
> > Oh OK. Only the top commit boots.
> 
> Okay! I will try to analyze!
> 

Could you test following two commits on my updated branch?

"arm/dma: vmalloc area allocation"
"arm/dma: defer atomic pool initialization"

I suspect that changed virtual address of the sram due to early
__dma_alloc_remap() call causes the problem and above two commits test
this theory.

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 04:11:13PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171109 00:05]:
> > On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim <iamjoonsoo@lge.com> [171108 07:43]:
> > > > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > > > So it seems the issue is currently at the atomic_pool_init()
> > > > > related code?
> > > > 
> > > > Yes, your test showed it although I can't find any clue in
> > > > atomic_pool_init().
> > > > 
> > > > Could you test updated branch?
> > > > 
> > > > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > > > 
> > > > There are two relevant commits.
> > > > 
> > > > arm/dma: stop dma allocation before __dma_alloc_remap()
> > > > arm/dma: disable atomic pool after dma allocation
> > > > 
> > > > atomic pool initialization will be done partially to check
> > > > exact point of failure. These are brain-dead commits however I have no
> > > > idea what's going on here until now. :/
> > > 
> > > OK that booted, dmesg output below. Hopefully that provides
> > > you with some more clues.
> > 
> > Thanks!
> > 
> > Could you let me know which one is booted? Both of them? or just top
> > commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?
> 
> Oh OK. Only the top commit boots.

Okay! I will try to analyze!

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 04:11:13PM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171109 00:05]:
> > On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> > > * Joonsoo Kim  [171108 07:43]:
> > > > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > > > So it seems the issue is currently at the atomic_pool_init()
> > > > > related code?
> > > > 
> > > > Yes, your test showed it although I can't find any clue in
> > > > atomic_pool_init().
> > > > 
> > > > Could you test updated branch?
> > > > 
> > > > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > > > 
> > > > There are two relevant commits.
> > > > 
> > > > arm/dma: stop dma allocation before __dma_alloc_remap()
> > > > arm/dma: disable atomic pool after dma allocation
> > > > 
> > > > atomic pool initialization will be done partially to check
> > > > exact point of failure. These are brain-dead commits however I have no
> > > > idea what's going on here until now. :/
> > > 
> > > OK that booted, dmesg output below. Hopefully that provides
> > > you with some more clues.
> > 
> > Thanks!
> > 
> > Could you let me know which one is booted? Both of them? or just top
> > commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?
> 
> Oh OK. Only the top commit boots.

Okay! I will try to analyze!

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171108 07:43]:
> > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > So it seems the issue is currently at the atomic_pool_init()
> > > related code?
> > 
> > Yes, your test showed it although I can't find any clue in
> > atomic_pool_init().
> > 
> > Could you test updated branch?
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > 
> > There are two relevant commits.
> > 
> > arm/dma: stop dma allocation before __dma_alloc_remap()
> > arm/dma: disable atomic pool after dma allocation
> > 
> > atomic pool initialization will be done partially to check
> > exact point of failure. These are brain-dead commits however I have no
> > idea what's going on here until now. :/
> 
> OK that booted, dmesg output below. Hopefully that provides
> you with some more clues.

Thanks!

Could you let me know which one is booted? Both of them? or just top
commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?

Thanks.

Re: n900 in next-20170901

2017-11-08 Thread Joonsoo Kim

On Wed, Nov 08, 2017 at 08:34:13AM -0800, Tony Lindgren wrote:
> * Joonsoo Kim  [171108 07:43]:
> > On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> > > So it seems the issue is currently at the atomic_pool_init()
> > > related code?
> > 
> > Yes, your test showed it although I can't find any clue in
> > atomic_pool_init().
> > 
> > Could you test updated branch?
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > 
> > There are two relevant commits.
> > 
> > arm/dma: stop dma allocation before __dma_alloc_remap()
> > arm/dma: disable atomic pool after dma allocation
> > 
> > atomic pool initialization will be done partially to check
> > exact point of failure. These are brain-dead commits however I have no
> > idea what's going on here until now. :/
> 
> OK that booted, dmesg output below. Hopefully that provides
> you with some more clues.

Thanks!

Could you let me know which one is booted? Both of them? or just top
commit ("arm/dma: stop dma allocation before __dma_alloc_remap()")?

Thanks.

Re: [PATCH] mm: page_ext: check if page_ext is not prepared

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 10:47:30AM +0100, Michal Hocko wrote:
> [CC Joonsoo]
> 
> On Tue 07-11-17 18:41:31, Jaewon Kim wrote:
> > online_page_ext and page_ext_init allocate page_ext for each section, but
> > they do not allocate if the first PFN is !pfn_present(pfn) or
> > !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext
> > checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN,
> > __set_page_owner will try to get page_ext through lookup_page_ext.
> > Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as value
> > 0. This incurrs invalid address access.
> > 
> > This is the panic example when PFN 0x10 is not valid but PFN 0x13FC00
> > is being used for page_ext. section->page_ext is NULL, get_entry returned
> > invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00.
> > 
> > To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
> > will be checked at all times.
> > 
> > <1>[   11.618085] Unable to handle kernel paging request at virtual address 
> > 01dfa014
> > <1>[   11.618140] pgd = ffc0c6dc9000
> > <1>[   11.618174] [01dfa014] *pgd=, *pud=
> > <4>[   11.618240] [ cut here ]
> > <2>[   11.618278] Kernel BUG at ff80082371e0 [verbose debug info 
> > unavailable]
> > <0>[   11.618338] Internal error: Oops: 9645 [#1] PREEMPT SMP
> > <4>[   11.618381] Modules linked in:
> > <4>[   11.618524] task: ffc0c6ec9180 task.stack: ffc0c6f4
> > <4>[   11.618569] PC is at __set_page_owner+0x48/0x78
> > <4>[   11.618607] LR is at __set_page_owner+0x44/0x78
> > <4>[   11.626025] [] __set_page_owner+0x48/0x78
> > <4>[   11.626071] [] get_page_from_freelist+0x880/0x8e8
> > <4>[   11.626118] [] __alloc_pages_nodemask+0x14c/0xc48
> > <4>[   11.626165] [] __do_page_cache_readahead+0xdc/0x264
> > <4>[   11.626214] [] filemap_fault+0x2ac/0x550
> > <4>[   11.626259] [] ext4_filemap_fault+0x3c/0x58
> > <4>[   11.626305] [] __do_fault+0x80/0x120
> > <4>[   11.626347] [] handle_mm_fault+0x704/0xbb0
> > <4>[   11.626393] [] do_page_fault+0x2e8/0x394
> > <4>[   11.626437] [] do_mem_abort+0x88/0x124
> > 
> 
> I suspec this goes all the way down to when page_ext has been
> resurrected.  It is quite interesting that nobody has noticed this in 3
> years but maybe the feature is not used all that much and the HW has to
> be quite special to trigger. Anyway the following should be added
> 
>  Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for 
> debugging")
>  Cc: stable

IIRC, caller of lookup_page_ext() doesn't check 'NULL' until
f86e427197 ("mm: check the return value of lookup_page_ext for all
call sites"). So, this problem would happen old kernel even if this
patch is applied to old kernel.

IMO, proper fix is to check all the pfn in the section. It is sent
from Jaewon in other mail.

Thanks.

Re: [PATCH] mm: page_ext: check if page_ext is not prepared

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 10:47:30AM +0100, Michal Hocko wrote:
> [CC Joonsoo]
> 
> On Tue 07-11-17 18:41:31, Jaewon Kim wrote:
> > online_page_ext and page_ext_init allocate page_ext for each section, but
> > they do not allocate if the first PFN is !pfn_present(pfn) or
> > !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext
> > checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN,
> > __set_page_owner will try to get page_ext through lookup_page_ext.
> > Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as value
> > 0. This incurrs invalid address access.
> > 
> > This is the panic example when PFN 0x10 is not valid but PFN 0x13FC00
> > is being used for page_ext. section->page_ext is NULL, get_entry returned
> > invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00.
> > 
> > To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
> > will be checked at all times.
> > 
> > <1>[   11.618085] Unable to handle kernel paging request at virtual address 
> > 01dfa014
> > <1>[   11.618140] pgd = ffc0c6dc9000
> > <1>[   11.618174] [01dfa014] *pgd=, *pud=
> > <4>[   11.618240] [ cut here ]
> > <2>[   11.618278] Kernel BUG at ff80082371e0 [verbose debug info 
> > unavailable]
> > <0>[   11.618338] Internal error: Oops: 9645 [#1] PREEMPT SMP
> > <4>[   11.618381] Modules linked in:
> > <4>[   11.618524] task: ffc0c6ec9180 task.stack: ffc0c6f4
> > <4>[   11.618569] PC is at __set_page_owner+0x48/0x78
> > <4>[   11.618607] LR is at __set_page_owner+0x44/0x78
> > <4>[   11.626025] [] __set_page_owner+0x48/0x78
> > <4>[   11.626071] [] get_page_from_freelist+0x880/0x8e8
> > <4>[   11.626118] [] __alloc_pages_nodemask+0x14c/0xc48
> > <4>[   11.626165] [] __do_page_cache_readahead+0xdc/0x264
> > <4>[   11.626214] [] filemap_fault+0x2ac/0x550
> > <4>[   11.626259] [] ext4_filemap_fault+0x3c/0x58
> > <4>[   11.626305] [] __do_fault+0x80/0x120
> > <4>[   11.626347] [] handle_mm_fault+0x704/0xbb0
> > <4>[   11.626393] [] do_page_fault+0x2e8/0x394
> > <4>[   11.626437] [] do_mem_abort+0x88/0x124
> > 
> 
> I suspec this goes all the way down to when page_ext has been
> resurrected.  It is quite interesting that nobody has noticed this in 3
> years but maybe the feature is not used all that much and the HW has to
> be quite special to trigger. Anyway the following should be added
> 
>  Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for 
> debugging")
>  Cc: stable

IIRC, caller of lookup_page_ext() doesn't check 'NULL' until
f86e427197 ("mm: check the return value of lookup_page_ext for all
call sites"). So, this problem would happen old kernel even if this
patch is applied to old kernel.

IMO, proper fix is to check all the pfn in the section. It is sent
from Jaewon in other mail.

Thanks.

Re: [PATCH] mm: page_ext: allocate page extension though first PFN is invalid

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 06:44:47PM +0900, Jaewon Kim wrote:
> online_page_ext and page_ext_init allocate page_ext for each section, but
> they do not allocate if the first PFN is !pfn_present(pfn) or
> !pfn_valid(pfn).
> 
> Though the first page is not valid, page_ext could be useful for other
> pages in the section. But checking all PFNs in a section may be time
> consuming job. Let's check each (section count / 16) PFN, then prepare
> page_ext if any PFN is present or valid.

I guess that this kind of section is not so many. And, this is for
debugging so completeness would be important. It's better to check
all pfn in the section.

Thanks.

Re: [PATCH] mm: page_ext: allocate page extension though first PFN is invalid

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 06:44:47PM +0900, Jaewon Kim wrote:
> online_page_ext and page_ext_init allocate page_ext for each section, but
> they do not allocate if the first PFN is !pfn_present(pfn) or
> !pfn_valid(pfn).
> 
> Though the first page is not valid, page_ext could be useful for other
> pages in the section. But checking all PFNs in a section may be time
> consuming job. Let's check each (section count / 16) PFN, then prepare
> page_ext if any PFN is present or valid.

I guess that this kind of section is not so many. And, this is for
debugging so completeness would be important. It's better to check
all pfn in the section.

Thanks.

Re: n900 in next-20170901

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> Hi,
> 
> * Joonsoo Kim <iamjoonsoo@lge.com> [171107 05:30]:
> > Could you test follwing updated branch?
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > 
> > It has three relevant commits on top and enables CMA memory use.
> 
> Okie dokie.
> 
> > "arm/dma: remove i-cache flush"
> 
> Your branch at the above commit is not booting on n900.
> 
> > "arm/dma: flush i-cache and d-cache separately"
> 
> Not booting at this commit either.
> 
> > "arm/dma: call flush_cache_all() and don't do outer cache operation"
> 
> Not booting at this commit either.
> 
> Earlier commit f14f3479c0d7 ("arm/dma: re-enable to remap the CMA
> area and flush cache before remapping") in you branch boots.
> 
> Then the following commit d6512d6d0171 ("Revert "arm/mm: disable
> atomic_pool"") won't boot.
> 
> Also your tree at commit 6d0525cef962 ("arm/dma: remove i-cache
> flush") with only commit d6512d6d0171 ("Revert "arm/mm: disable
> atomic_pool"") reverted boots on n900.

Thanks a lot!

> So it seems the issue is currently at the atomic_pool_init()
> related code?

Yes, your test showed it although I can't find any clue in
atomic_pool_init().

Could you test updated branch?

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

There are two relevant commits.

arm/dma: stop dma allocation before __dma_alloc_remap()
arm/dma: disable atomic pool after dma allocation

atomic pool initialization will be done partially to check
exact point of failure. These are brain-dead commits however I have no
idea what's going on here until now. :/

Thanks.

Re: n900 in next-20170901

2017-11-07 Thread Joonsoo Kim

On Tue, Nov 07, 2017 at 07:48:42AM -0800, Tony Lindgren wrote:
> Hi,
> 
> * Joonsoo Kim  [171107 05:30]:
> > Could you test follwing updated branch?
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> > 
> > It has three relevant commits on top and enables CMA memory use.
> 
> Okie dokie.
> 
> > "arm/dma: remove i-cache flush"
> 
> Your branch at the above commit is not booting on n900.
> 
> > "arm/dma: flush i-cache and d-cache separately"
> 
> Not booting at this commit either.
> 
> > "arm/dma: call flush_cache_all() and don't do outer cache operation"
> 
> Not booting at this commit either.
> 
> Earlier commit f14f3479c0d7 ("arm/dma: re-enable to remap the CMA
> area and flush cache before remapping") in you branch boots.
> 
> Then the following commit d6512d6d0171 ("Revert "arm/mm: disable
> atomic_pool"") won't boot.
> 
> Also your tree at commit 6d0525cef962 ("arm/dma: remove i-cache
> flush") with only commit d6512d6d0171 ("Revert "arm/mm: disable
> atomic_pool"") reverted boots on n900.

Thanks a lot!

> So it seems the issue is currently at the atomic_pool_init()
> related code?

Yes, your test showed it although I can't find any clue in
atomic_pool_init().

Could you test updated branch?

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

There are two relevant commits.

arm/dma: stop dma allocation before __dma_alloc_remap()
arm/dma: disable atomic pool after dma allocation

atomic pool initialization will be done partially to check
exact point of failure. These are brain-dead commits however I have no
idea what's going on here until now. :/

Thanks.

Re: n900 in next-20170901

2017-11-06 Thread Joonsoo Kim

Hello,

Sorry for dealy. I was on vacation during last week.

On Thu, Oct 26, 2017 at 07:16:27AM -0700, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171025 21:45]:
> > On Wed, Oct 25, 2017 at 10:31:38AM -0700, Tony Lindgren wrote:
> > > Great, this branch boots on n900! Early parts of the dmesg attached
> > > below.
> > 
> > Sound good. Perhaps, problem is due to the cache management. With my
> > patchset, direct mapping for CMA area is cleared in
> > dma_contiguous_remap() if CONFIG_HIGHMEM. I guess that proper cache
> > operation is required there.
> > 
> > Could you test my newly updated branch again? I re-enable
> > dma_contiguous_remap() to clear direct mapping for CMA area and add
> > proper(?) cache operation although I'm not the expert on that area.
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> 
> Yeah that one boots too, dmesg below. I wonder why flushing the range
> with dmac_flush_range() and outer_flush_range() no longer seems enough
> though?

I don't know the reason. AFAIK, it should be enough to flush cpu cache
before clearing page table entry, however, it doesn't work for n900.

flush_cache_all() has not only d-cache flushing but also i-cache
flushing. So, I'd like to test effect of i-cache flushing.

Could you test follwing updated branch?

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

It has three relevant commits on top and enables CMA memory use.

"arm/dma: remove i-cache flush"
"arm/dma: flush i-cache and d-cache separately"
"arm/dma: call flush_cache_all() and don't do outer cache operation"

I think that flush_cache_all() here looks fine because size of CMA
area is usually large enough to use flush_cache_all() but
understanding root cause would be important so I ask this test. If u
don't mind, please test each commit with reverting one by one and let
me know the result of each test.

> Reverting the arch/arm/mm/dma-mapping.c changes in your patch
> "arm/dma: re-enable to remap the CMA area and flush cache before
> remapping" also boots, but produces the following build warnings:
> 
> arch/arm/mm/dma-mapping.c: In function 'dma_contiguous_remap':
> arch/arm/mm/dma-mapping.c:503:20: warning: passing argument 1 of
> 'cpu_cache.dma_flush_range' makes pointer from integer without a
> cast [-Wint-conversion]
> dmac_flush_range(map.virtual, map.virtual + map.length);
> ^~~
> arch/arm/mm/dma-mapping.c:503:20: note: expected 'const void *' but
> argument is of type 'long unsigned int'
> arch/arm/mm/dma-mapping.c:503:33: warning: passing argument 2 of
> 'cpu_cache.dma_flush_range' makes pointer from integer without a
> cast [-Wint-conversion]
> dmac_flush_range(map.virtual, map.virtual + map.length);

Thanks for info.
I think that incorrect type would not be related to this problem.

Thanks.

Re: n900 in next-20170901

2017-11-06 Thread Joonsoo Kim

Hello,

Sorry for dealy. I was on vacation during last week.

On Thu, Oct 26, 2017 at 07:16:27AM -0700, Tony Lindgren wrote:
> * Joonsoo Kim  [171025 21:45]:
> > On Wed, Oct 25, 2017 at 10:31:38AM -0700, Tony Lindgren wrote:
> > > Great, this branch boots on n900! Early parts of the dmesg attached
> > > below.
> > 
> > Sound good. Perhaps, problem is due to the cache management. With my
> > patchset, direct mapping for CMA area is cleared in
> > dma_contiguous_remap() if CONFIG_HIGHMEM. I guess that proper cache
> > operation is required there.
> > 
> > Could you test my newly updated branch again? I re-enable
> > dma_contiguous_remap() to clear direct mapping for CMA area and add
> > proper(?) cache operation although I'm not the expert on that area.
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> 
> Yeah that one boots too, dmesg below. I wonder why flushing the range
> with dmac_flush_range() and outer_flush_range() no longer seems enough
> though?

I don't know the reason. AFAIK, it should be enough to flush cpu cache
before clearing page table entry, however, it doesn't work for n900.

flush_cache_all() has not only d-cache flushing but also i-cache
flushing. So, I'd like to test effect of i-cache flushing.

Could you test follwing updated branch?

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

It has three relevant commits on top and enables CMA memory use.

"arm/dma: remove i-cache flush"
"arm/dma: flush i-cache and d-cache separately"
"arm/dma: call flush_cache_all() and don't do outer cache operation"

I think that flush_cache_all() here looks fine because size of CMA
area is usually large enough to use flush_cache_all() but
understanding root cause would be important so I ask this test. If u
don't mind, please test each commit with reverting one by one and let
me know the result of each test.

> Reverting the arch/arm/mm/dma-mapping.c changes in your patch
> "arm/dma: re-enable to remap the CMA area and flush cache before
> remapping" also boots, but produces the following build warnings:
> 
> arch/arm/mm/dma-mapping.c: In function 'dma_contiguous_remap':
> arch/arm/mm/dma-mapping.c:503:20: warning: passing argument 1 of
> 'cpu_cache.dma_flush_range' makes pointer from integer without a
> cast [-Wint-conversion]
> dmac_flush_range(map.virtual, map.virtual + map.length);
> ^~~
> arch/arm/mm/dma-mapping.c:503:20: note: expected 'const void *' but
> argument is of type 'long unsigned int'
> arch/arm/mm/dma-mapping.c:503:33: warning: passing argument 2 of
> 'cpu_cache.dma_flush_range' makes pointer from integer without a
> cast [-Wint-conversion]
> dmac_flush_range(map.virtual, map.virtual + map.length);

Thanks for info.
I think that incorrect type would not be related to this problem.

Thanks.

Re: n900 in next-20170901

2017-10-25 Thread Joonsoo Kim

On Wed, Oct 25, 2017 at 10:31:38AM -0700, Tony Lindgren wrote:
> * Joonsoo Kim <iamjoonsoo@lge.com> [171022 21:51]:
> > On Fri, Oct 20, 2017 at 10:31:47AM -0700, Tony Lindgren wrote:
> > > * Joonsoo Kim <iamjoonsoo@lge.com> [171019 18:53]:
> > > > Oops... I made a mistak. Could you test with reverting commit
> > > > c977ee2803787363187d6aca9cebdabc793c6531 ("omap: forcibly call
> > > > save_secure_ram_context() for test") in that branch?
> > > > Without reverting it, it doesn't call 'smc' so it always cause a
> > > > hang.
> > > 
> > > Oops I should have noticed that one. Here you go with commit
> > > c977ee280378 reverted. Still not booting.
> > 
> > Still very thanks to you. :)
> 
> No problem, sorry for the delay in testing this one.
> 
> > Okay. Could you test my updated branch? In there, I also disable
> > atomic_pool initialization and disable to remap the CMA area in order
> > to completely make any operation on CMA area as no-op.
> > 
> > And, it enables memblock_debug to check how memblock region is
> > allocated.
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> 
> Great, this branch boots on n900! Early parts of the dmesg attached
> below.

Sound good. Perhaps, problem is due to the cache management. With my
patchset, direct mapping for CMA area is cleared in
dma_contiguous_remap() if CONFIG_HIGHMEM. I guess that proper cache
operation is required there.

Could you test my newly updated branch again? I re-enable
dma_contiguous_remap() to clear direct mapping for CMA area and add
proper(?) cache operation although I'm not the expert on that area.

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

Thanks.

Re: n900 in next-20170901

2017-10-25 Thread Joonsoo Kim

On Wed, Oct 25, 2017 at 10:31:38AM -0700, Tony Lindgren wrote:
> * Joonsoo Kim  [171022 21:51]:
> > On Fri, Oct 20, 2017 at 10:31:47AM -0700, Tony Lindgren wrote:
> > > * Joonsoo Kim  [171019 18:53]:
> > > > Oops... I made a mistak. Could you test with reverting commit
> > > > c977ee2803787363187d6aca9cebdabc793c6531 ("omap: forcibly call
> > > > save_secure_ram_context() for test") in that branch?
> > > > Without reverting it, it doesn't call 'smc' so it always cause a
> > > > hang.
> > > 
> > > Oops I should have noticed that one. Here you go with commit
> > > c977ee280378 reverted. Still not booting.
> > 
> > Still very thanks to you. :)
> 
> No problem, sorry for the delay in testing this one.
> 
> > Okay. Could you test my updated branch? In there, I also disable
> > atomic_pool initialization and disable to remap the CMA area in order
> > to completely make any operation on CMA area as no-op.
> > 
> > And, it enables memblock_debug to check how memblock region is
> > allocated.
> > 
> > https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901
> 
> Great, this branch boots on n900! Early parts of the dmesg attached
> below.

Sound good. Perhaps, problem is due to the cache management. With my
patchset, direct mapping for CMA area is cleared in
dma_contiguous_remap() if CONFIG_HIGHMEM. I guess that proper cache
operation is required there.

Could you test my newly updated branch again? I re-enable
dma_contiguous_remap() to clear direct mapping for CMA area and add
proper(?) cache operation although I'm not the expert on that area.

https://github.com/JoonsooKim/linux/tree/cma-debug4-next-20180901

Thanks.

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 4538 matches

Mail list logo