Re: [PATCH] sched,psi: fix the 'int' underflow for psi

2021-04-15 Thread Charan Teja Kalla
Thanks Johannes!!

On 4/15/2021 8:12 PM, Johannes Weiner wrote:
> Makes sense, it's more graceful in the event of a bug.
> 
> But what motivates this change? Is it something you hit recently with
> an upstream kernel and we should investigate?

We specifically didn't hit the issue around this change. Identified this
while doing code walk through.

> 
> There is already a branch on the tasks to signal the bug. How about:
> 
>   if (groupc->tasksk[t]) {
>   groupc->tasks[t]--;
>   } else if (!psi_bug) {
>   printk_deferred(...
>   psi_bug = 1;
>   }

Looks cleaner. Will raise V2 with this.
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH RFC 0/1] mm: balancing the node zones occupancy

2021-02-23 Thread Charan Teja Kalla
Thanks Vlastimil for the review comments!!

On 2/19/2021 4:56 PM, Vlastimil Babka wrote:
> Can you share the use case for doing this? If it's to replace a failed RAM, 
> then
> it's probably extremely rare, right.
> 
>> We have the proof-of-concept code tried on the Snapdragon systems with
>> the system configuration, single memory node of just 2 zones, 6GB normal
>> zone and 2GB movable zone. And this Movable zone is such that hot-added
>> once and there after offline/online based on the need.
> Hm, snapdragon... so is this some kind of power saving thing?
> 

You are correct. This is the power saving usecase which does the offline
and online of the memory blocks in the system by the user. This is not a
failed RAM.

> Anyway, shouln't auto NUMA balancing help here, and especially "Migrate Pages 
> in
> lieu of discard" (CC'd Dave) as a generic mechanism, 

On the Snapdragon systems we have got only single memory node with
Normal and movable zones. And my little understanding is that on most
embedded systems we will just have single memory node.

My limited understanding about this auto NUMA balancing is that there
should be min. 2 nodes for this balancing to trigger. Please correct if
I am wrong here. If I am correct then this approach is not suitable for
us. Moreover the idea I would like to convey in this RFC patch is about
__balancing the zones in a node but not across NUMA nodes__.

> so we wouldn't need to have hotplug-specific actions?
David has told a very simple view of this problem which is nothing todo
with the hotplug specific actions.

With just 2 zones(Normal and Movable) in a single node in the system,

1. Application 1 allocates a lot of memory and gets ZONE_MOVABLE.
2. Application 2 allocates a lot of memory and gets ZONE_NORMAL.
3. Application 1 quits.

Then after step3, we can expect a lot free memory in the Movable zone
but normal zone is under pressure. Applying the similar semantics of
Auto numa balancing("Migrate pages in lieu of swap/discard"), we could
migrate some eligible pages of Application 2 to Movable zone there by
can relieve some pressure in Normal zone.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH RFC 0/1] mm: balancing the node zones occupancy

2021-02-23 Thread Charan Teja Kalla


Thanks David for the review comments!!

On 2/18/2021 11:46 PM, David Hildenbrand wrote:
>> I would like to start discussion about  balancing the occupancy of
>> memory zones in a node in the system whose imabalance may be caused by
>> migration of pages to other zones during hotremove and then hotadding
>> same memory. In this case there is a lot of free memory in newly hotadd
>> memory which can be filled up by the previous migrated pages(as part of
>> offline/hotremove) thus may free up some pressure in other zones of the
>> node.
> 
> Why is this specific to memory hot(un)plug? I think the problem is more
> generic:
> 
> Assume
> 
> 1. Application 1 allocates a lot of memory and gets ZONE_MOVABLE.
> 2. Application 2 allocates a lot of memory and gets ZONE_NORMAL.
> 3. Application 1 quits.
> 
> Same problem, no?

Thanks for simplifying this problem. Yeah, this looks more generic
problem. But for these type of problems, user/system administrator has
clear view about the state of the system and thus may need to take some
decisions to maintain the the node zones balancing e.g. like this change
where migrate the eligible pages to other zones.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly

2021-02-05 Thread Charan Teja Kalla
On 2/6/2021 3:58 AM, David Rientjes wrote:
>> In the code, when COMPACT_SKIPPED is being returned, the page will
>> always be NULL. So, I'm not sure how much useful it is for the page ==
>> NULL check here. Or I failed to understand your point here?
>>
> Your code is short-circuiting the rest of  __alloc_pages_direct_compact() 
> where the return value is dictated by whether page is NULL or non-NULL.  
> We can't leak a captured page if we are testing for it being NULL or 
> non-NULL, which is what the rest of __alloc_pages_direct_compact() does 
> *before* your change.  So the idea was to add a check the page is actually 
> NULL here since you are now relying on the return value of 
> compact_zone_order() to be COMPACT_SKIPPED to infer page == NULL.
> 
> I agree that's currently true in the code, I was trying to catch any 
> errors where current->capture_control.page was non-NULL but 
> try_to_compact_pages() returns COMPACT_SKIPPED.  There's some complexity 
> here.

Thanks for the detailed explanation. This looks fine to me. I will send
V2 with this information in the commit log.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly

2021-02-02 Thread Charan Teja Kalla
Thanks David for the review!!

On 2/2/2021 2:54 AM, David Rientjes wrote:
> On Mon, 1 Feb 2021, Charan Teja Reddy wrote:
> 
>> By defination, COMPACT[STALL|FAIL] events needs to be counted when there
> 
> s/defination/definition/\

Done.

> 
>> is 'At least in one zone compaction wasn't deferred or skipped from the
>> direct compaction'. And when compaction is skipped or deferred,
>> COMPACT_SKIPPED will be returned but it will still go and update these
>> compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
>> is counted without even trying the compaction.
>>
>> Correct this by skipping the counting of these events when
>> COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
>> the unnecessary try into the get_page_from_freelist() when compaction is
>> not even tried.
>>
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>  mm/page_alloc.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 519a60d..531f244 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
>> int order,
>>  memalloc_noreclaim_restore(noreclaim_flag);
>>  psi_memstall_leave();
>>  
>> +if (*compact_result == COMPACT_SKIPPED)
>> +return NULL;
>>  /*
>>   * At least in one zone compaction wasn't deferred or skipped, so let's
>>   * count a compaction stall
> 
> This makes sense, I wonder if it would also be useful to check that 
> page == NULL, either in try_to_compact_pages() or here for 
> COMPACT_SKIPPED?

In the code, when COMPACT_SKIPPED is being returned, the page will
always be NULL. So, I'm not sure how much useful it is for the page ==
NULL check here. Or I failed to understand your point here?

As, till now, COMPACTFAIL also presents page allocation failures because
of the direct compaction is skipped, creating a separate COMPACTSKIPFAIL
event which indicates that 'failed to get the free page as direct
compaction is skipped' is useful?
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH V3] mm/compaction: correct deferral logic for proactive compaction

2021-01-27 Thread Charan Teja Kalla



On 1/25/2021 4:24 AM, David Rientjes wrote:
> On Wed, 20 Jan 2021, Vlastimil Babka wrote:
> 
>> On 1/19/21 8:26 PM, David Rientjes wrote:
>>> On Mon, 18 Jan 2021, Charan Teja Reddy wrote:
>>>
 should_proactive_compact_node() returns true when sum of the
 weighted fragmentation score of all the zones in the node is greater
 than the wmark_high of compaction, which then triggers the proactive
 compaction that operates on the individual zones of the node. But
 proactive compaction runs on the zone only when its weighted
 fragmentation score is greater than wmark_low(=wmark_high - 10).

 This means that the sum of the weighted fragmentation scores of all the
 zones can exceed the wmark_high but individual weighted fragmentation
 zone scores can still be less than wmark_low which makes the unnecessary
 trigger of the proactive compaction only to return doing nothing.

 Issue with the return of proactive compaction with out even trying is
 its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
 the scores across the proactive compaction is same, thinking that
 compaction didn't make any progress but in reality it didn't even try.
>>>
>>> Isn't this an issue in deferred compaction as well?  It seems like 
>>> deferred compaction should check that work was actually performed before 
>>> deferring subsequent calls to compaction.
>>
>> Direct compaction does, proactive not.
>>
>>> In other words, I don't believe deferred compaction is intended to avoid 
>>> checks to determine if compaction is worth it; it should only defer 
>>> *additional* work that was not productive.
>>
>> Yeah, that should be more optimal.
>>
> 
> Charan, is this something you'd like to follow up on, or should I take a 
> look instead?
> 

Sure David. Happy to follow up on this. Thanks!

> Thanks!
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH V2] mm/compaction: correct deferral logic for proactive compaction

2021-01-18 Thread Charan Teja Kalla
Thanks Vlasitmil!!

On 1/18/2021 6:07 PM, Vlastimil Babka wrote:
> On 1/18/21 1:20 PM, Charan Teja Reddy wrote:
>> should_proactive_compact_node() returns true when sum of the
>> weighted fragmentation score of all the zones in the node is greater
>> than the wmark_high of compaction, which then triggers the proactive
>> compaction that operates on the individual zones of the node. But
>> proactive compaction runs on the zone only when its weighted
>> fragmentation score is greater than wmark_low(=wmark_high - 10).
>>
>> This means that the sum of the weighted fragmentation scores of all the
>> zones can exceed the wmark_high but individual weighted fragmentation
>> zone scores can still be less than wmark_low which makes the unnecessary
>> trigger of the proactive compaction only to return doing nothing.
>>
>> Issue with the return of proactive compaction with out even trying is
>> its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
>> the scores across the proactive compaction is same, thinking that
>> compaction didn't make any progress but in reality it didn't even try.
>> With the delay between successive retries for proactive compaction is
>> 500msec, it can result into the deferral for ~30sec with out even trying
>> the proactive compaction.
>>
>> Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
>> 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
>> sizes 5GB and 6GB respectively. After opening some apps on the android,
>> the weighted fragmentation scores of these zones are 47 and 49
>> respectively. Since the sum of these fragmentation scores are above the
>> wmark_high which triggers the proactive compaction and there since the
>> individual zones weighted fragmentation scores are below wmark_low, it
>> returns without trying the proactive compaction. As a result the
>> weighted fragmentation scores of the zones are still 47 and 49 which
>> makes the existing logic to defer the compaction thinking that
>> noprogress is made across the compaction.
>>
>> Fix this by checking just zone fragmentation score, not the weighted, in
>> __compact_finished() and use the zones weighted fragmentation score in
>> fragmentation_score_node(). In the test case above, If the weighted
>> average of is above wmark_high, then individual score (not adjusted) of
>> atleast one zone has to be above wmark_high. Thus it avoids the
>> unnecessary trigger and deferrals of the proactive compaction.
>>
>> Fix-suggested-by: Vlastimil Babka 
>> Signed-off-by: Charan Teja Reddy 
> 
> Acked-by: Vlastimil Babka 
> 
> But I would move fragmentation_score_zone() above
> fragmentation_score_zone_weighted(), so fragmentation_score_zone_weighted() 
> can
> call fragmentation_score_zone() instead of having two places with
> extfrag_for_order(...).

Yes, this suggestion makes the code cleaner. I will raise V3 for this.
Thanks.

> 
> Thanks.
> 
>> ---
>>
>> Changes in V2: Addressed comments from vlastimil
>>
>> Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/
>>
>>  mm/compaction.c | 19 ++-
>>  1 file changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index e5acb97..1b98427 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1924,16 +1924,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
>>  }
>>  
>>  /*
>> - * A zone's fragmentation score is the external fragmentation wrt to the
>> - * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value
>> - * in the range [0, 100].
>> + * A weighted zone's fragmentation score is the external fragmentation
>> + * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
>> + * returns a value in the range [0, 100].
>>   *
>>   * The scaling factor ensures that proactive compaction focuses on larger
>>   * zones like ZONE_NORMAL, rather than smaller, specialized zones like
>>   * ZONE_DMA32. For smaller zones, the score value remains close to zero,
>>   * and thus never exceeds the high threshold for proactive compaction.
>>   */
>> -static unsigned int fragmentation_score_zone(struct zone *zone)
>> +static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
>>  {
>>  unsigned long score;
>>  
>> @@ -1943,6 +1943,15 @@ static unsigned int fragmentation_score_zone(struct 
>> zone *zone)
>>  }
>>  
>>  /*
>> + * A zone's fragmentation score is the external fragmentation wrt to the
>> + * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
>> + */
>> +static unsigned int fragmentation_score_zone(struct zone *zone)
>> +{
>> +return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>> +}
>> +
>> +/*
>>   * The per-node proactive (background) compaction process is started by its
>>   * corresponding kcompactd thread when the node's fragmentation score
>>   * exceeds the high threshold. The compaction process remains active till
>> @@ -1958,7 +1967,7 @@ static unsigned int 

Re: [PATCH] mm/compaction: return proper state in should_proactive_compact_node

2021-01-15 Thread Charan Teja Kalla
Thank you Vlastimil!!

On 1/15/2021 6:15 PM, Vlastimil Babka wrote:
> On 1/13/21 3:03 PM, Charan Teja Reddy wrote:
>> should_proactive_compact_node() returns true when sum of the
>> fragmentation score of all the zones in the node is greater than the
>> wmark_high of compaction which then triggers the proactive compaction
>> that operates on the individual zones of the node. But proactive
>> compaction runs on the zone only when the fragmentation score of the
>> zone is greater than wmark_low(=wmark_high - 10).
>>
>> This means that the sum of the fragmentation scores of all the zones can
>> exceed the wmark_high but individual zone scores can still be less than
>> the wmark_low which makes the unnecessary trigger of the proactive
>> compaction only to return doing nothing.
>>
>> Another issue with the return of proactive compaction with out even
>> trying is its deferral. It is simply deferred for 1 <<
>> COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is
>> same, thinking that compaction didn't make any progress but in reality
>> it didn't even try. With the delay between successive retries for
>> proactive compaction is 500msec, it can result into the deferral for
>> ~30sec with out even trying the proactive compaction.
>>
>> Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
>> 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
>> sizes 5GB and 6GB respectively. After opening some apps on the android,
>> the fragmentation scores of these zones are 47 and 49 respectively.
>> Since the sum of these fragmentation scores are above the wmark_high
>> which triggers the proactive compaction and there since the individual
>> zone scores are below wmark_low, it returns without trying the
>> compaction. As a result the fragmentation scores of the zones are still
>> 47 and 49 which makes the existing logic to defer the compaction
>> thinking that noprogress is made across the compaction.
>>
>> So, run the proactive compaction on the node zones only when atleast one
>> of the zones fragmentation score is greater than wmark_low. This avoids
>> the unnecessary deferral and retries of the compaction.
>>
>> Signed-off-by: Charan Teja Reddy 
> 
> Good catch about the problem, but I wonder if the solution could be better.
> 
> fragmentation_score_node() is a weighted average of scores of all zones, 
> that's
> why fragmentation_score_zone() adjusts the score by zone_present/node_present.
> 
> But when considering an individual zone in __compact_finished(), we shouldn't 
> be
> using fragmentation_score_zone() with the adjustment. We are not calculating 
> the
> weighted average for the whole node there, so it doesn't make sense to do the
> adjustment by size. So if it simply took extfrag_for_order(...) as the score, 
> it
> should work as expected. In your example above, the score of each zone would 
> be
> above 60. If the weighted average is above wmark_high, then individual score
> (not adjusted) of at least one zone has to be above wmark_high, and the extra
> check using max() is not necessary.
> 
> So I would split fragmentation_score_zone() to e.g. fragmentation_score_zone()
> and fragmentation_score_zone_weighted() and call the latter only from
> fragmentation_score_node(), and not from __compact_finished().

This suggestion looks good and much cleaner. Will raise V2 in couple of
days. Thanks.

> 
> Vlastimil
> 
>> ---
>>  mm/compaction.c | 27 +--
>>  1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index e5acb97..f7a772a 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1964,6 +1964,26 @@ static unsigned int 
>> fragmentation_score_node(pg_data_t *pgdat)
>>  return score;
>>  }
>>  
>> +/*
>> + * Returns the maximum of fragmentation scores of zones in a node. This is
>> + * used in taking the decission of whether to trigger the proactive 
>> compaction
>> + * on the zones of this node.
>> + */
>> +static unsigned int fragmentation_score_node_zones_max(pg_data_t *pgdat)
>> +{
>> +int zoneid;
>> +unsigned int max = 0;
>> +
>> +for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
>> +struct zone *zone;
>> +
>> +zone = >node_zones[zoneid];
>> +max = max_t(unsigned int, fragmentation_score_zone(zone), max);
>> +}
>> +
>> +return max;
>> +}
>> +
>>  static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
>>  {
>>  unsigned int wmark_low;
>> @@ -1979,13 +1999,16 @@ static unsigned int 
>> fragmentation_score_wmark(pg_data_t *pgdat, bool low)
>>  
>>  static bool should_proactive_compact_node(pg_data_t *pgdat)
>>  {
>> -int wmark_high;
>> +int wmark_low, wmark_high;
>>  
>>  if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat))
>>  return false;
>>  
>>  wmark_high = fragmentation_score_wmark(pgdat, false);
>> -return 

Re: [PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-27 Thread Charan Teja Kalla
Thanks Michal!!

On 11/26/2020 2:48 PM, Michal Hocko wrote:
> On Wed 25-11-20 16:18:06, Charan Teja Kalla wrote:
>>
>>
>> On 11/24/2020 1:11 PM, Michal Hocko wrote:
>>> On Mon 23-11-20 20:40:40, Charan Teja Kalla wrote:
>>>>
>>>> Thanks Michal!
>>>> On 11/23/2020 7:43 PM, Michal Hocko wrote:
>>>>> On Mon 23-11-20 19:33:16, Charan Teja Reddy wrote:
>>>>>> When the pages are failed to get isolate or migrate, the page owner
>>>>>> information along with page info is dumped. If there are continuous
>>>>>> failures in migration(say page is pinned) or isolation, the log buffer
>>>>>> is simply getting flooded with the page owner information. As most of
>>>>>> the times page info is sufficient to know the causes for failures of
>>>>>> migration or isolation, place the page owner information under DEBUG_VM.
>>>>>
>>>>> I do not see why this path is any different from others that call
>>>>> dump_page. Page owner can add a very valuable information to debug
>>>>> the underlying reasons for failures here. It is an opt-in debugging
>>>>> feature which needs to be enabled explicitly. So I would argue users
>>>>> are ready to accept a lot of data in the kernel log.
>>>>
>>>> Just thinking how frequently failures can happen in those paths. In the
>>>> memory hotplug path, we can flood the page owner logs just by making one
>>>> page pinned.
>>>
>>> If you are operating on a movable zone then pages shouldn't be pinned
>>> for unbound amount of time. Yeah there are some ways to break this
>>> fundamental assumption but this is a bigger problem that needs a
>>> solution.
>>>
>>>> Say If it is anonymous page, the page owner information
>>>> shows is something like below, which is not really telling anything
>>>> other than how the pinned page is allocated.
>>>
>>> Well you can tell an anonymous page from __dump_page, all right, but
>>> this is not true universally.
>>>
>>>> page last allocated via order 0, migratetype Movable, gfp_mask
>>>> 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
>>>>   prep_new_page+0x7c/0x1a4
>>>>   get_page_from_freelist+0x1ac/0x1c4
>>>>   __alloc_pages_nodemask+0x12c/0x378
>>>>   do_anonymous_page+0xac/0x3b4
>>>>   handle_pte_fault+0x2a4/0x3bc
>>>>   __handle_speculative_fault+0x208/0x3c0
>>>>   do_page_fault+0x280/0x508
>>>>   do_translation_fault+0x3c/0x54
>>>>   do_mem_abort+0x64/0xf4
>>>>   el0_da+0x1c/0x20
>>>>  page last free stack trace:
>>>>   free_pcp_prepare+0x320/0x454
>>>>   free_unref_page_list+0x9c/0x2a4
>>>>   release_pages+0x370/0x3c8
>>>>   free_pages_and_swap_cache+0xdc/0x10c
>>>>   tlb_flush_mmu+0x110/0x134
>>>>   tlb_finish_mmu+0x48/0xc0
>>>>   unmap_region+0x104/0x138
>>>>   __do_munmap+0x2ec/0x3b4
>>>>   __arm64_sys_munmap+0x80/0xd8
>>>>
>>>> I see at some places in the kernel where they put the dump_page under
>>>> DEBUG_VM, but in the end I agree that it is up to the users need. Then
>>>> there are some users who don't care for these page owner logs.
>>>
>>> Well, as I've said page_owner requires an explicit enabling and I would
>>> expect that if somebody enables this tracking then it is expected to see
>>> the information when we dump a page state.
>>>
>>>> And an issue on Embedded systems with these continuous logs being
>>>> printed to the console is the watchdog timeouts, because console logging
>>>> happens by disabling the interrupts.
>>>
>>> Are you enabling page_owner on those systems unconditionally?
>>>
>>
>> Yes, We do always enable the page owner on just the internal debug
>> builds for memory analysis, But never on the production kernels. And on
>> these builds excessive logging, at times because of a pinned page,
>> causing the watchdog timeouts, is the problem.
> 
> OK, I see but I still believe that the debugging might be useful
> especially when the owner is not really obvious from the page state.
> I also agree that if the output is swapping the logs then the situation
> is not really great either. Would something like the below work for your
> situation?
> 
> MAGIC_NUMBER would need to be somehow figured but I would start with 1

Re: [PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-25 Thread Charan Teja Kalla
Thanks Vlastimil!

On 11/24/2020 7:09 PM, Vlastimil Babka wrote:
> On 11/23/20 4:10 PM, Charan Teja Kalla wrote:
>>
>> Thanks Michal!
>> On 11/23/2020 7:43 PM, Michal Hocko wrote:
>>> On Mon 23-11-20 19:33:16, Charan Teja Reddy wrote:
>>>> When the pages are failed to get isolate or migrate, the page owner
>>>> information along with page info is dumped. If there are continuous
>>>> failures in migration(say page is pinned) or isolation, the log buffer
>>>> is simply getting flooded with the page owner information. As most of
>>>> the times page info is sufficient to know the causes for failures of
>>>> migration or isolation, place the page owner information under
>>>> DEBUG_VM.
>>>
>>> I do not see why this path is any different from others that call
>>> dump_page. Page owner can add a very valuable information to debug
>>> the underlying reasons for failures here. It is an opt-in debugging
>>> feature which needs to be enabled explicitly. So I would argue users
>>> are ready to accept a lot of data in the kernel log.
>>
>> Just thinking how frequently failures can happen in those paths. In the
>> memory hotplug path, we can flood the page owner logs just by making one
>> page pinned. Say If it is anonymous page, the page owner information
> 
> So you say it's flooded when page_owner info is included, but not
> flooded when only the rest of __dump_page() is printed? (which is also
> not just one or two lines). That has to be very specific rate of failures.
> Anyway I don't like the solution with arbitrary config option. To
> prevent flooding we generally have ratelimiting, how about that?
> 

I can still say the logs are flooded with just the __dump_page() but
they are lot lesser compare to dump_page_owner.  The lines are something
like below:
page:0b070b80 refcount:3 mapcount:1 mapping:ff80faf118e9
index:0xc0903
anon flags:
0x8008042c(uptodate|dirty|active|owner_priv_1|swapbacked)
raw: 8008042c ffc047483a58 ffc047483a58 ff80faf118e9
raw: 000c0903 000985eb 0003 ff800b5f3000
page dumped because: migration failure
page->mem_cgroup:ff800b5f3000
page_owner tracks the page as allocated

Rate limiting the logs, both from __dump_page and dump_page_owner,
looking nice. If it is okay for both of you and Michal,  I can raise the
V2 here.

> Also agreed with Michal that page_owner is explicitly enabled debugging
> option and if you use it in production, that's rather surprising to me,
> and possibly more rare than DEBUG_VM, which IIRC Fedora kernels use.

We just enable it on the internal debug systems but never on the
production kernels.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-25 Thread Charan Teja Kalla



On 11/24/2020 1:11 PM, Michal Hocko wrote:
> On Mon 23-11-20 20:40:40, Charan Teja Kalla wrote:
>>
>> Thanks Michal!
>> On 11/23/2020 7:43 PM, Michal Hocko wrote:
>>> On Mon 23-11-20 19:33:16, Charan Teja Reddy wrote:
>>>> When the pages are failed to get isolate or migrate, the page owner
>>>> information along with page info is dumped. If there are continuous
>>>> failures in migration(say page is pinned) or isolation, the log buffer
>>>> is simply getting flooded with the page owner information. As most of
>>>> the times page info is sufficient to know the causes for failures of
>>>> migration or isolation, place the page owner information under DEBUG_VM.
>>>
>>> I do not see why this path is any different from others that call
>>> dump_page. Page owner can add a very valuable information to debug
>>> the underlying reasons for failures here. It is an opt-in debugging
>>> feature which needs to be enabled explicitly. So I would argue users
>>> are ready to accept a lot of data in the kernel log.
>>
>> Just thinking how frequently failures can happen in those paths. In the
>> memory hotplug path, we can flood the page owner logs just by making one
>> page pinned.
> 
> If you are operating on a movable zone then pages shouldn't be pinned
> for unbound amount of time. Yeah there are some ways to break this
> fundamental assumption but this is a bigger problem that needs a
> solution.
> 
>> Say If it is anonymous page, the page owner information
>> shows is something like below, which is not really telling anything
>> other than how the pinned page is allocated.
> 
> Well you can tell an anonymous page from __dump_page, all right, but
> this is not true universally.
> 
>> page last allocated via order 0, migratetype Movable, gfp_mask
>> 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
>>   prep_new_page+0x7c/0x1a4
>>   get_page_from_freelist+0x1ac/0x1c4
>>   __alloc_pages_nodemask+0x12c/0x378
>>   do_anonymous_page+0xac/0x3b4
>>   handle_pte_fault+0x2a4/0x3bc
>>   __handle_speculative_fault+0x208/0x3c0
>>   do_page_fault+0x280/0x508
>>   do_translation_fault+0x3c/0x54
>>   do_mem_abort+0x64/0xf4
>>   el0_da+0x1c/0x20
>>  page last free stack trace:
>>   free_pcp_prepare+0x320/0x454
>>   free_unref_page_list+0x9c/0x2a4
>>   release_pages+0x370/0x3c8
>>   free_pages_and_swap_cache+0xdc/0x10c
>>   tlb_flush_mmu+0x110/0x134
>>   tlb_finish_mmu+0x48/0xc0
>>   unmap_region+0x104/0x138
>>   __do_munmap+0x2ec/0x3b4
>>   __arm64_sys_munmap+0x80/0xd8
>>
>> I see at some places in the kernel where they put the dump_page under
>> DEBUG_VM, but in the end I agree that it is up to the users need. Then
>> there are some users who don't care for these page owner logs.
> 
> Well, as I've said page_owner requires an explicit enabling and I would
> expect that if somebody enables this tracking then it is expected to see
> the information when we dump a page state.
> 
>> And an issue on Embedded systems with these continuous logs being
>> printed to the console is the watchdog timeouts, because console logging
>> happens by disabling the interrupts.
> 
> Are you enabling page_owner on those systems unconditionally?
> 

Yes, We do always enable the page owner on just the internal debug
builds for memory analysis, But never on the production kernels. And on
these builds excessive logging, at times because of a pinned page,
causing the watchdog timeouts, is the problem.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-23 Thread Charan Teja Kalla


Thanks Michal!
On 11/23/2020 7:43 PM, Michal Hocko wrote:
> On Mon 23-11-20 19:33:16, Charan Teja Reddy wrote:
>> When the pages are failed to get isolate or migrate, the page owner
>> information along with page info is dumped. If there are continuous
>> failures in migration(say page is pinned) or isolation, the log buffer
>> is simply getting flooded with the page owner information. As most of
>> the times page info is sufficient to know the causes for failures of
>> migration or isolation, place the page owner information under DEBUG_VM.
> 
> I do not see why this path is any different from others that call
> dump_page. Page owner can add a very valuable information to debug
> the underlying reasons for failures here. It is an opt-in debugging
> feature which needs to be enabled explicitly. So I would argue users
> are ready to accept a lot of data in the kernel log.

Just thinking how frequently failures can happen in those paths. In the
memory hotplug path, we can flood the page owner logs just by making one
page pinned. Say If it is anonymous page, the page owner information
shows is something like below, which is not really telling anything
other than how the pinned page is allocated.

page last allocated via order 0, migratetype Movable, gfp_mask
0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO)
  prep_new_page+0x7c/0x1a4
  get_page_from_freelist+0x1ac/0x1c4
  __alloc_pages_nodemask+0x12c/0x378
  do_anonymous_page+0xac/0x3b4
  handle_pte_fault+0x2a4/0x3bc
  __handle_speculative_fault+0x208/0x3c0
  do_page_fault+0x280/0x508
  do_translation_fault+0x3c/0x54
  do_mem_abort+0x64/0xf4
  el0_da+0x1c/0x20
 page last free stack trace:
  free_pcp_prepare+0x320/0x454
  free_unref_page_list+0x9c/0x2a4
  release_pages+0x370/0x3c8
  free_pages_and_swap_cache+0xdc/0x10c
  tlb_flush_mmu+0x110/0x134
  tlb_finish_mmu+0x48/0xc0
  unmap_region+0x104/0x138
  __do_munmap+0x2ec/0x3b4
  __arm64_sys_munmap+0x80/0xd8

I see at some places in the kernel where they put the dump_page under
DEBUG_VM, but in the end I agree that it is up to the users need. Then
there are some users who don't care for these page owner logs.

And an issue on Embedded systems with these continuous logs being
printed to the console is the watchdog timeouts, because console logging
happens by disabling the interrupts.

>  
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>  mm/memory_hotplug.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 63b2e46..f48f30d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1326,7 +1326,10 @@ do_migrate_range(unsigned long start_pfn, unsigned 
>> long end_pfn)
>>  
>>  } else {
>>  pr_warn("failed to isolate pfn %lx\n", pfn);
>> -dump_page(page, "isolation failed");
>> +__dump_page(page, "isolation failed");
>> +#if defined(CONFIG_DEBUG_VM)
>> +dump_page_owner(page);
>> +#endif
>>  }
>>  put_page(page);
>>  }
>> @@ -1357,7 +1360,10 @@ do_migrate_range(unsigned long start_pfn, unsigned 
>> long end_pfn)
>>  list_for_each_entry(page, , lru) {
>>  pr_warn("migrating pfn %lx failed ret:%d ",
>> page_to_pfn(page), ret);
>> -dump_page(page, "migration failure");
>> +__dump_page(page, "migration failure");
>> +#if defined(CONFIG_DEBUG_VM)
>> +dump_page_owner(page);
>> +#endif
>>  }
>>  putback_movable_pages();
>>  }
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] iommu: of: skip iommu_device_list traversal in of_iommu_xlate()

2020-09-23 Thread Charan Teja Kalla



On 9/23/2020 9:54 PM, Robin Murphy wrote:
> On 2020-09-23 15:53, Charan Teja Reddy wrote:
>> In of_iommu_xlate(), check if iommu device is enabled before traversing
>> the iommu_device_list through iommu_ops_from_fwnode(). It is of no use
>> in traversing the iommu_device_list only to return NO_IOMMU because of
>> iommu device node is disabled.
> 
> Well, the "use" is that it keeps the code that much smaller and simpler
> to have a single path for returning this condition. This whole callstack
> isn't exactly a high-performance code path to begin with, and we've
> always assumed that IOMMUs present but disabled in DT would be a pretty
> rare exception. 

Fine..I thought that it is logical to return when IOMMU DT node is
disabled over code simplicity. And agree that it is not high-performance
path.

> Do you have a system that challenges those assumptions
> and shows any benefit from this change?

No, I don't have a system that challenges these assumptions.

> 
> Robin.
> 
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>   drivers/iommu/of_iommu.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
>> index e505b91..225598c 100644
>> --- a/drivers/iommu/of_iommu.c
>> +++ b/drivers/iommu/of_iommu.c
>> @@ -94,9 +94,10 @@ static int of_iommu_xlate(struct device *dev,
>>   struct fwnode_handle *fwnode = _spec->np->fwnode;
>>   int ret;
>>   +    if (!of_device_is_available(iommu_spec->np))
>> +    return NO_IOMMU;
>>   ops = iommu_ops_from_fwnode(fwnode);
>> -    if ((ops && !ops->of_xlate) ||
>> -    !of_device_is_available(iommu_spec->np))
>> +    if (ops && !ops->of_xlate)
>>   return NO_IOMMU;
>>     ret = iommu_fwspec_init(dev, _spec->np->fwnode, ops);
>>

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-13 Thread Charan Teja Kalla
Thanks Michal.

On 8/13/2020 10:00 PM, Michal Hocko wrote:
> On Thu 13-08-20 21:51:29, Charan Teja Kalla wrote:
>> Thanks Michal for comments.
>>
>> On 8/13/2020 5:11 PM, Michal Hocko wrote:
>>> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
>>> [...]
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index e4896e6..839039f 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, 
>>>> int count,
>>>>struct page *page, *tmp;
>>>>LIST_HEAD(head);
>>>>  
>>>> +  /*
>>>> +   * Ensure proper count is passed which otherwise would stuck in the
>>>> +   * below while (list_empty(list)) loop.
>>>> +   */
>>>> +  count = min(pcp->count, count);
>>>>while (count) {
>>>>struct list_head *list;
>>>
>>>
>>> How does this prevent the race actually?
>>
>> This doesn't prevent the race. This only fixes the core hung(as this is
>> called with spin_lock_irq()) caused by the race condition. This core
>> hung is because of incorrect count value is passed to the
>> free_pcppages_bulk() function.
> 
> Let me ask differently. What does enforce that the count and lists do
> not get out of sync in the loop. 

count value is updated whenever an order-0 page is being added to the
pcp lists through free_unref_page_commit(), which is being called with
both interrupts, premption disabled.
static void free_unref_page_commit(struct page *page, {

list_add(>lru, >lists[migratetype]);
pcp->count++
}

As these are pcp lists, they only gets touched by another process when
this process is context switched, which happens only after enabling
premption or interrupts. So, as long as process is operating on these
pcp lists in free_unref_page_commit function, the count and lists are
always synced.

However, the problem here is not that the count and lists are being out
of sync. They do always in sync, as explained above. It is with the
asking free_pcppages_bulk() to free the pages more than what is present
in the pcp lists which is ending up in while(list_empty()).

> Your changelog says that the fix is to
> use the proper value without any specifics.
> 
Will change this to: Ensure the count value passed is not greater than
the pcp lists count. Any better you suggest?

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-13 Thread Charan Teja Kalla
Thanks Michal for comments.

On 8/13/2020 5:11 PM, Michal Hocko wrote:
> On Tue 11-08-20 18:28:23, Charan Teja Reddy wrote:
> [...]
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..839039f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int 
>> count,
>>  struct page *page, *tmp;
>>  LIST_HEAD(head);
>>  
>> +/*
>> + * Ensure proper count is passed which otherwise would stuck in the
>> + * below while (list_empty(list)) loop.
>> + */
>> +count = min(pcp->count, count);
>>  while (count) {
>>  struct list_head *list;
> 
> 
> How does this prevent the race actually?

This doesn't prevent the race. This only fixes the core hung(as this is
called with spin_lock_irq()) caused by the race condition. This core
hung is because of incorrect count value is passed to the
free_pcppages_bulk() function.

The actual race should be fixed by David's suggestion (isolate, online
and undo isolation).

Something needs to be improved in commit message? May be like below:
s/The following race is observed with the repeated ... / Cpu core hung
is observed as a result of race with the use case of repeated...

 Don't we need something like
> the following instead?
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..45bcc7ba37c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1317,9 +1317,16 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>* lists
>*/
>   do {
> + bool looped = false;

IIUC, this looped will always be initialzed to false thus never jumped
to free.
But I think I got your idea that looping of the pcp lists for any pages.
If not found despite MIGRATE_PCPTYPES count lists are traversed, just
break the loop. Does this checking really required? Shouldn't pcp->count
tells the same whether any pages left in the pcp lists?

> +
>   batch_free++;
> - if (++migratetype == MIGRATE_PCPTYPES)
> + if (++migratetype == MIGRATE_PCPTYPES) {
> + if (looped)
> + goto free;
> +
>   migratetype = 0;
> + looped = true;
> + }
>   list = >lists[migratetype];
>   } while (list_empty(list));
>  
> @@ -1352,6 +1359,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   } while (--count && --batch_free && !list_empty(list));
>   }
>  
> +free:
>   spin_lock(>lock);
>   isolated_pageblocks = has_isolate_pageblock(zone);
>  
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-12 Thread Charan Teja Kalla



On 8/12/2020 3:30 PM, David Hildenbrand wrote:
> On 12.08.20 11:46, Charan Teja Kalla wrote:
>>
>> Thanks David for the inputs.
>>
>> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>>> The following race is observed with the repeated online, offline and a
>>>> delay between two successive online of memory blocks of movable zone.
>>>>
>>>> P1 P2
>>>>
>>>> Online the first memory block in
>>>> the movable zone. The pcp struct
>>>> values are initialized to default
>>>> values,i.e., pcp->high = 0 &
>>>> pcp->batch = 1.
>>>>
>>>>Allocate the pages from the
>>>>movable zone.
>>>>
>>>> Try to Online the second memory
>>>> block in the movable zone thus it
>>>> entered the online_pages() but yet
>>>> to call zone_pcp_update().
>>>>This process is entered into
>>>>the exit path thus it tries
>>>>to release the order-0 pages
>>>>to pcp lists through
>>>>free_unref_page_commit().
>>>>As pcp->high = 0, pcp->count = 1
>>>>proceed to call the function
>>>>free_pcppages_bulk().
>>>> Update the pcp values thus the
>>>> new pcp values are like, say,
>>>> pcp->high = 378, pcp->batch = 63.
>>>>Read the pcp's batch value using
>>>>READ_ONCE() and pass the same to
>>>>free_pcppages_bulk(), pcp values
>>>>passed here are, batch = 63,
>>>>count = 1.
>>>>
>>>>Since num of pages in the pcp
>>>>lists are less than ->batch,
>>>>then it will stuck in
>>>>while(list_empty(list)) loop
>>>>with interrupts disabled thus
>>>>a core hung.
>>>>
>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>>> of pcp list pages.
>>>>
>>>> The mentioned race is some what easily reproducible without [1] because
>>>> pcp's are not updated for the first memory block online and thus there
>>>> is a enough race window for P2 between alloc+free and pcp struct values
>>>> update through onlining of second memory block.
>>>>
>>>> With [1], the race is still exists but it is very much narrow as we
>>>> update the pcp struct values for the first memory block online itself.
>>>>
>>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>>
>>>
>>> IIUC, this is not limited to the movable zone, it could also happen in
>>> corner cases with the normal zone (e.g., hotplug to a node that only has
>>> DMA memory, or no other memory yet).
>>
>> Yes, this is my understanding too. I explained the above race in terms
>> of just movable zone for which it is observed. We can add the below line
>> in the end in patch commit message:
>> "This is not limited to the movable zone, it could also happen in cases
>> with the normal zone (e.g., hotplug to a node that only has DMA memory,
>> or no other memory yet)."
> 
> Yeah, that makes sense!
> 
>>
>> Just curious, there exists such systems where just a dma zone present
>> and we hot add the normal zone? I am not aware such thing in the
>> embedded world.
> 
> You can easily create such setups using QEMU.
> 
> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
> node, or 4G initial memory and two NUMA nodes. Then hotplug memory.
> 
> (IIRC kata containers always start a VM with 2G and then hotplug memory)
>
I see. Thanks for letting me know this.

>>>
>>>> Signed-off-by: Charan Teja Reddy 
>>>> ---
>>>>
>>>> v1: https://patchwork.kernel.org/pat

Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-12 Thread Charan Teja Kalla


Thanks David for the inputs.

On 8/12/2020 2:35 AM, David Hildenbrand wrote:
> On 11.08.20 14:58, Charan Teja Reddy wrote:
>> The following race is observed with the repeated online, offline and a
>> delay between two successive online of memory blocks of movable zone.
>>
>> P1   P2
>>
>> Online the first memory block in
>> the movable zone. The pcp struct
>> values are initialized to default
>> values,i.e., pcp->high = 0 &
>> pcp->batch = 1.
>>
>>  Allocate the pages from the
>>  movable zone.
>>
>> Try to Online the second memory
>> block in the movable zone thus it
>> entered the online_pages() but yet
>> to call zone_pcp_update().
>>  This process is entered into
>>  the exit path thus it tries
>>  to release the order-0 pages
>>  to pcp lists through
>>  free_unref_page_commit().
>>  As pcp->high = 0, pcp->count = 1
>>  proceed to call the function
>>  free_pcppages_bulk().
>> Update the pcp values thus the
>> new pcp values are like, say,
>> pcp->high = 378, pcp->batch = 63.
>>  Read the pcp's batch value using
>>  READ_ONCE() and pass the same to
>>  free_pcppages_bulk(), pcp values
>>  passed here are, batch = 63,
>>  count = 1.
>>
>>  Since num of pages in the pcp
>>  lists are less than ->batch,
>>  then it will stuck in
>>  while(list_empty(list)) loop
>>  with interrupts disabled thus
>>  a core hung.
>>
>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>> of pcp list pages.
>>
>> The mentioned race is some what easily reproducible without [1] because
>> pcp's are not updated for the first memory block online and thus there
>> is a enough race window for P2 between alloc+free and pcp struct values
>> update through onlining of second memory block.
>>
>> With [1], the race is still exists but it is very much narrow as we
>> update the pcp struct values for the first memory block online itself.
>>
>> [1]: https://patchwork.kernel.org/patch/11696389/
>>
> 
> IIUC, this is not limited to the movable zone, it could also happen in
> corner cases with the normal zone (e.g., hotplug to a node that only has
> DMA memory, or no other memory yet).

Yes, this is my understanding too. I explained the above race in terms
of just movable zone for which it is observed. We can add the below line
in the end in patch commit message:
"This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory,
or no other memory yet)."

Just curious, there exists such systems where just a dma zone present
and we hot add the normal zone? I am not aware such thing in the
embedded world.
> 
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>
>> v1: https://patchwork.kernel.org/patch/11707637/
>>
>>  mm/page_alloc.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..839039f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int 
>> count,
>>  struct page *page, *tmp;
>>  LIST_HEAD(head);
>>  
>> +/*
>> + * Ensure proper count is passed which otherwise would stuck in the
>> + * below while (list_empty(list)) loop.
>> + */
>> +count = min(pcp->count, count);
>>  while (count) {
>>  struct list_head *list;
>>  
>>
> 
> Fixes: and Cc: stable... tags?

Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
one-list-per-migrate-type")
Cc:  [2.6+]

I am not sure If I should have to raise V3 including these?
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-11 Thread Charan Teja Kalla
Thanks David for the comments.

On 8/11/2020 1:59 PM, David Hildenbrand wrote:
> On 10.08.20 18:10, Charan Teja Reddy wrote:
>> The following race is observed with the repeated online, offline and a
>> delay between two successive online of memory blocks of movable zone.
>>
>> P1   P2
>>
>> Online the first memory block in
>> the movable zone. The pcp struct
>> values are initialized to default
>> values,i.e., pcp->high = 0 &
>> pcp->batch = 1.
>>
>>  Allocate the pages from the
>>  movable zone.
>>
>> Try to Online the second memory
>> block in the movable zone thus it
>> entered the online_pages() but yet
>> to call zone_pcp_update().
>>  This process is entered into
>>  the exit path thus it tries
>>  to release the order-0 pages
>>  to pcp lists through
>>  free_unref_page_commit().
>>  As pcp->high = 0, pcp->count = 1
>>  proceed to call the function
>>  free_pcppages_bulk().
>> Update the pcp values thus the
>> new pcp values are like, say,
>> pcp->high = 378, pcp->batch = 63.
>>  Read the pcp's batch value using
>>  READ_ONCE() and pass the same to
>>  free_pcppages_bulk(), pcp values
>>  passed here are, batch = 63,
>>  count = 1.
>>
>>  Since num of pages in the pcp
>>  lists are less than ->batch,
>>  then it will stuck in
>>  while(list_empty(list)) loop
>>  with interrupts disabled thus
>>  a core hung.
>>
>> Avoid this by ensuring free_pcppages_bulk() called with proper count of
>> pcp list pages.
>>
>> The mentioned race is some what easily reproducible without [1] because
>> pcp's are not updated for the first memory block online and thus there
>> is a enough race window for P2 between alloc+free and pcp struct values
>> update through onlining of second memory block.
>>
>> With [1], the race is still exists but it is very much narrow as we
>> update the pcp struct values for the first memory block online itself.
>>
>> [1]: https://patchwork.kernel.org/patch/11696389/
>>
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>  mm/page_alloc.c | 16 ++--
>>  1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..25e7e12 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, 
>> unsigned long pfn)
>>  struct zone *zone = page_zone(page);
>>  struct per_cpu_pages *pcp;
>>  int migratetype;
>> +int high;
>>  
>>  migratetype = get_pcppage_migratetype(page);
>>  __count_vm_event(PGFREE);
>> @@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, 
>> unsigned long pfn)
>>  pcp = _cpu_ptr(zone->pageset)->pcp;
>>  list_add(>lru, >lists[migratetype]);
>>  pcp->count++;
>> -if (pcp->count >= pcp->high) {
>> -unsigned long batch = READ_ONCE(pcp->batch);
>> +high = READ_ONCE(pcp->high);
>> +if (pcp->count >= high) {
>> +int batch;
>> +
>> +batch = READ_ONCE(pcp->batch);
>> +/*
>> + * For non-default pcp struct values, high is always
>> + * greater than the batch. If high < batch then pass
>> + * proper count to free the pcp's list pages.
>> + */
>> +if (unlikely(high < batch))
>> +batch = min(pcp->count, batch);
>> +
>>  free_pcppages_bulk(zone, batch, pcp);
>>  }
>>  }
>>
> 
> I was wondering if we should rather set all pageblocks to
> MIGRATE_ISOLATE in online_pages() before doing the online_pages_range()
> call, and do undo_isolate_page_range() after onlining is done.
> 
> move_pfn_range_to_zone()->memmap_init_zone() marks all pageblocks
> MIGRATE_MOVABLE, and as that function is used also during boot, we could
> supply a parameter to configure this.
> 
> This would prevent another race from happening: Having pages exposed to
> the buddy ready for allocation in online_pages_range() before the
> sections are marked online.

Yeah this is another bug. And idea of isolate first, online and undoing
the isolation after zonelist and pcp struct update should work even for
the mentioned issue. This needs to go as a separate fix?

However, IMO, issue in free_pcppages_bulk() 

Re: [PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-11 Thread Charan Teja Kalla
Thanks David.

On 8/11/2020 1:06 AM, David Rientjes wrote:
> On Mon, 10 Aug 2020, Charan Teja Reddy wrote:
> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e4896e6..25e7e12 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, 
>> unsigned long pfn)
>>  struct zone *zone = page_zone(page);
>>  struct per_cpu_pages *pcp;
>>  int migratetype;
>> +int high;
>>  
>>  migratetype = get_pcppage_migratetype(page);
>>  __count_vm_event(PGFREE);
>> @@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, 
>> unsigned long pfn)
>>  pcp = _cpu_ptr(zone->pageset)->pcp;
>>  list_add(>lru, >lists[migratetype]);
>>  pcp->count++;
>> -if (pcp->count >= pcp->high) {
>> -unsigned long batch = READ_ONCE(pcp->batch);
>> +high = READ_ONCE(pcp->high);
>> +if (pcp->count >= high) {
>> +int batch;
>> +
>> +batch = READ_ONCE(pcp->batch);
>> +/*
>> + * For non-default pcp struct values, high is always
>> + * greater than the batch. If high < batch then pass
>> + * proper count to free the pcp's list pages.
>> + */
>> +if (unlikely(high < batch))
>> +batch = min(pcp->count, batch);
>> +
>>  free_pcppages_bulk(zone, batch, pcp);
>>  }
>>  }
> 
> I'm wondering if a fix to free_pcppages_bulk() is more appropriate here 
> because the count passed into it seems otherwise fragile if this results 
> in a hung core?
> 

Agree that the free_pcppages_bulk() is appropriate place to fix and it
actually much cleaner. Raised V2:
https://patchwork.kernel.org/patch/11709225/

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm, memory_hotplug: update pcp lists everytime onlining a memory block

2020-08-03 Thread Charan Teja Kalla
Thanks David for the comments.

On 8/3/2020 1:35 PM, David Hildenbrand wrote:
> On 02.08.20 14:54, Charan Teja Reddy wrote:
>> When onlining a first memory block in a zone, pcp lists are not updated
>> thus pcp struct will have the default setting of ->high = 0,->batch = 1.
>> This means till the second memory block in a zone(if it have) is onlined
>> the pcp lists of this zone will not contain any pages because pcp's
>> ->count is always greater than ->high thus free_pcppages_bulk() is
>> called to free batch size(=1) pages every time system wants to add a
>> page to the pcp list through free_unref_page(). To put this in a word,
>> system is not using benefits offered by the pcp lists when there is a
>> single onlineable memory block in a zone. Correct this by always
>> updating the pcp lists when memory block is onlined.
> 
> I guess such setups are rare ... but I can imagine it being the case
> with virtio-mem in the future ... not 100% if this performance
> optimization is really relevant in practice ... how did you identify this?

Even the Snapdragon hardware that I had tested on contain multiple
onlineable memory blocks. But we have the use case in which we online
single memory block and once it is filled then online the next block. In
the step where single block is onlined, we observed the below pageset
params.
  pagesets
cpu: 0
  count: 0
  high:  0
  batch: 1
Once the second block is onlined then only seeing some sane values as below.
cpu: 0
  count: 32
  high:  378
  batch: 63

In the above case, till the second block is onlined, no page is held in
the pcp list. So, updating the pcp params every time when onlining the
memory block is required, as an example in the usecase that I had
mentioned.

> 
>>
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>  mm/memory_hotplug.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index dcdf327..7f62d69 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -854,8 +854,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages,
>>  node_states_set_node(nid, );
>>  if (need_zonelists_rebuild)
>>  build_all_zonelists(NULL);
>> -else
>> -zone_pcp_update(zone);
>> +zone_pcp_update(zone);
>>  
>>  init_per_zone_wmark_min();
>>  
>>
> 
> Does, in general, look sane to me.
> 
> Reviewed-by: David Hildenbrand 

Thanks for the ACK.

> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH v2] dmabuf: use spinlock to access dmabuf->name

2020-06-23 Thread Charan Teja Kalla
Thanks Mike for the inputs.

On 6/22/2020 5:10 PM, Ruhl, Michael J wrote:
>> -Original Message-
>> From: charante=codeaurora@mg.codeaurora.org
>>  On Behalf Of Charan Teja
>> Kalla
>> Sent: Monday, June 22, 2020 5:26 AM
>> To: Ruhl, Michael J ; Sumit Semwal
>> ; david.lai...@aculab.com; open list:DMA
>> BUFFER SHARING FRAMEWORK ; DRI mailing
>> list 
>> Cc: Linaro MM SIG ; LKML > ker...@vger.kernel.org>
>> Subject: Re: [PATCH v2] dmabuf: use spinlock to access dmabuf->name
>>
>> Hello Mike,
>>
>> On 6/19/2020 7:11 PM, Ruhl, Michael J wrote:
>>>> -----Original Message-
>>>> From: charante=codeaurora@mg.codeaurora.org
>>>>  On Behalf Of Charan
>> Teja
>>>> Kalla
>>>> Sent: Friday, June 19, 2020 7:57 AM
>>>> To: Sumit Semwal ; Ruhl, Michael J
>>>> ; david.lai...@aculab.com; open list:DMA
>>>> BUFFER SHARING FRAMEWORK ; DRI
>> mailing
>>>> list 
>>>> Cc: Linaro MM SIG ; LKML >>> ker...@vger.kernel.org>
>>>> Subject: [PATCH v2] dmabuf: use spinlock to access dmabuf->name
>>>>
>>>> There exists a sleep-while-atomic bug while accessing the dmabuf->name
>>>> under mutex in the dmabuffs_dname(). This is caused from the SELinux
>>>> permissions checks on a process where it tries to validate the inherited
>>>> files from fork() by traversing them through iterate_fd() (which
>>>> traverse files under spin_lock) and call
>>>> match_file(security/selinux/hooks.c) where the permission checks
>> happen.
>>>> This audit information is logged using dump_common_audit_data() where
>> it
>>>> calls d_path() to get the file path name. If the file check happen on
>>>> the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
>>>> access dmabuf->name. The flow will be like below:
>>>> flush_unauthorized_files()
>>>>  iterate_fd()
>>>>spin_lock() --> Start of the atomic section.
>>>>  match_file()
>>>>file_has_perm()
>>>>  avc_has_perm()
>>>>avc_audit()
>>>>  slow_avc_audit()
>>>>common_lsm_audit()
>>>>  dump_common_audit_data()
>>>>audit_log_d_path()
>>>>  d_path()
>>>>dmabuffs_dname()
>>>>  mutex_lock()--> Sleep while atomic.
>>>>
>>>> Call trace captured (on 4.19 kernels) is below:
>>>> ___might_sleep+0x204/0x208
>>>> __might_sleep+0x50/0x88
>>>> __mutex_lock_common+0x5c/0x1068
>>>> __mutex_lock_common+0x5c/0x1068
>>>> mutex_lock_nested+0x40/0x50
>>>> dmabuffs_dname+0xa0/0x170
>>>> d_path+0x84/0x290
>>>> audit_log_d_path+0x74/0x130
>>>> common_lsm_audit+0x334/0x6e8
>>>> slow_avc_audit+0xb8/0xf8
>>>> avc_has_perm+0x154/0x218
>>>> file_has_perm+0x70/0x180
>>>> match_file+0x60/0x78
>>>> iterate_fd+0x128/0x168
>>>> selinux_bprm_committing_creds+0x178/0x248
>>>> security_bprm_committing_creds+0x30/0x48
>>>> install_exec_creds+0x1c/0x68
>>>> load_elf_binary+0x3a4/0x14e0
>>>> search_binary_handler+0xb0/0x1e0
>>>>
>>>> So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.
>>>>
>>>> Cc:  [5.3+]
>>>> Signed-off-by: Charan Teja Reddy 
>>>> ---
>>>>
>>>> Changes in V2: Addressed review comments from Ruhl, Michael J
>>>>
>>>> Changes in V1: https://lore.kernel.org/patchwork/patch/1255055/
>>>>
>>>> drivers/dma-buf/dma-buf.c | 11 +++
>>>> include/linux/dma-buf.h   |  1 +
>>>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>> index 01ce125..d81d298 100644
>>>> --- a/drivers/dma-buf/dma-buf.c
>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>> @@ -45,10 +45,10 @@ static char *dmabuffs_dname(struct dentry
>> *dentry,
>>>> char *buffer, int buflen)
>>>>size_t ret = 0;
>>>>
>>>>dmabuf = dentry->d_fsdata;
>>>> -  dma_resv_lock(dmabuf->resv, NULL);
>>>> +  spin_lock(>name_lock);
>>>>if (dmabuf-&g

Re: [PATCH v2] dmabuf: use spinlock to access dmabuf->name

2020-06-22 Thread Charan Teja Kalla
Hello Mike,

On 6/19/2020 7:11 PM, Ruhl, Michael J wrote:
>> -Original Message-
>> From: charante=codeaurora@mg.codeaurora.org
>>  On Behalf Of Charan Teja
>> Kalla
>> Sent: Friday, June 19, 2020 7:57 AM
>> To: Sumit Semwal ; Ruhl, Michael J
>> ; david.lai...@aculab.com; open list:DMA
>> BUFFER SHARING FRAMEWORK ; DRI mailing
>> list 
>> Cc: Linaro MM SIG ; LKML > ker...@vger.kernel.org>
>> Subject: [PATCH v2] dmabuf: use spinlock to access dmabuf->name
>>
>> There exists a sleep-while-atomic bug while accessing the dmabuf->name
>> under mutex in the dmabuffs_dname(). This is caused from the SELinux
>> permissions checks on a process where it tries to validate the inherited
>> files from fork() by traversing them through iterate_fd() (which
>> traverse files under spin_lock) and call
>> match_file(security/selinux/hooks.c) where the permission checks happen.
>> This audit information is logged using dump_common_audit_data() where it
>> calls d_path() to get the file path name. If the file check happen on
>> the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
>> access dmabuf->name. The flow will be like below:
>> flush_unauthorized_files()
>>  iterate_fd()
>>spin_lock() --> Start of the atomic section.
>>  match_file()
>>file_has_perm()
>>  avc_has_perm()
>>avc_audit()
>>  slow_avc_audit()
>>  common_lsm_audit()
>>dump_common_audit_data()
>>  audit_log_d_path()
>>d_path()
>>dmabuffs_dname()
>>  mutex_lock()--> Sleep while atomic.
>>
>> Call trace captured (on 4.19 kernels) is below:
>> ___might_sleep+0x204/0x208
>> __might_sleep+0x50/0x88
>> __mutex_lock_common+0x5c/0x1068
>> __mutex_lock_common+0x5c/0x1068
>> mutex_lock_nested+0x40/0x50
>> dmabuffs_dname+0xa0/0x170
>> d_path+0x84/0x290
>> audit_log_d_path+0x74/0x130
>> common_lsm_audit+0x334/0x6e8
>> slow_avc_audit+0xb8/0xf8
>> avc_has_perm+0x154/0x218
>> file_has_perm+0x70/0x180
>> match_file+0x60/0x78
>> iterate_fd+0x128/0x168
>> selinux_bprm_committing_creds+0x178/0x248
>> security_bprm_committing_creds+0x30/0x48
>> install_exec_creds+0x1c/0x68
>> load_elf_binary+0x3a4/0x14e0
>> search_binary_handler+0xb0/0x1e0
>>
>> So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.
>>
>> Cc:  [5.3+]
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>
>> Changes in V2: Addressed review comments from Ruhl, Michael J
>>
>> Changes in V1: https://lore.kernel.org/patchwork/patch/1255055/
>>
>> drivers/dma-buf/dma-buf.c | 11 +++
>> include/linux/dma-buf.h   |  1 +
>> 2 files changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index 01ce125..d81d298 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -45,10 +45,10 @@ static char *dmabuffs_dname(struct dentry *dentry,
>> char *buffer, int buflen)
>>  size_t ret = 0;
>>
>>  dmabuf = dentry->d_fsdata;
>> -dma_resv_lock(dmabuf->resv, NULL);
>> +spin_lock(>name_lock);
>>  if (dmabuf->name)
>>  ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
>> -dma_resv_unlock(dmabuf->resv);
>> +spin_unlock(>name_lock);
>>
>>  return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
>>   dentry->d_name.name, ret > 0 ? name : "");
>> @@ -341,8 +341,10 @@ static long dma_buf_set_name(struct dma_buf
>> *dmabuf, const char __user *buf)
>>  kfree(name);
>>  goto out_unlock;
>>  }
>> +spin_lock(>name_lock);
>>  kfree(dmabuf->name);
>>  dmabuf->name = name;
>> +spin_unlock(>name_lock);
> 
> While this code path is ok, I would have separated the protection of the
> attachment list and the name manipulation.
> 
> dma_resv_lock(resv)
> if (!list_empty(attachment)
>   ret = -EBUSY
> dma_resv_unlock(resv)
> 
> if (ret) {
>   kfree(name)
>   return ret;
> }

Is it that the name should be visible before importer attaches to the
dmabuf,(using dma_buf_attach()), thus _buf_set_name() is under the
_resv_lock() as well?

> 
> spinlock(nam_lock)
> ...
> 
> Nesting locks  that don't need to be nested always makes m

[PATCH v2] dmabuf: use spinlock to access dmabuf->name

2020-06-19 Thread Charan Teja Kalla
There exists a sleep-while-atomic bug while accessing the dmabuf->name
under mutex in the dmabuffs_dname(). This is caused from the SELinux
permissions checks on a process where it tries to validate the inherited
files from fork() by traversing them through iterate_fd() (which
traverse files under spin_lock) and call
match_file(security/selinux/hooks.c) where the permission checks happen.
This audit information is logged using dump_common_audit_data() where it
calls d_path() to get the file path name. If the file check happen on
the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
access dmabuf->name. The flow will be like below:
flush_unauthorized_files()
  iterate_fd()
spin_lock() --> Start of the atomic section.
  match_file()
file_has_perm()
  avc_has_perm()
avc_audit()
  slow_avc_audit()
common_lsm_audit()
  dump_common_audit_data()
audit_log_d_path()
  d_path()
dmabuffs_dname()
  mutex_lock()--> Sleep while atomic.

Call trace captured (on 4.19 kernels) is below:
___might_sleep+0x204/0x208
__might_sleep+0x50/0x88
__mutex_lock_common+0x5c/0x1068
__mutex_lock_common+0x5c/0x1068
mutex_lock_nested+0x40/0x50
dmabuffs_dname+0xa0/0x170
d_path+0x84/0x290
audit_log_d_path+0x74/0x130
common_lsm_audit+0x334/0x6e8
slow_avc_audit+0xb8/0xf8
avc_has_perm+0x154/0x218
file_has_perm+0x70/0x180
match_file+0x60/0x78
iterate_fd+0x128/0x168
selinux_bprm_committing_creds+0x178/0x248
security_bprm_committing_creds+0x30/0x48
install_exec_creds+0x1c/0x68
load_elf_binary+0x3a4/0x14e0
search_binary_handler+0xb0/0x1e0

So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.

Cc:  [5.3+]
Signed-off-by: Charan Teja Reddy 
---

Changes in V2: Addressed review comments from Ruhl, Michael J

Changes in V1: https://lore.kernel.org/patchwork/patch/1255055/

 drivers/dma-buf/dma-buf.c | 11 +++
 include/linux/dma-buf.h   |  1 +
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 01ce125..d81d298 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -45,10 +45,10 @@ static char *dmabuffs_dname(struct dentry *dentry, char 
*buffer, int buflen)
size_t ret = 0;
 
dmabuf = dentry->d_fsdata;
-   dma_resv_lock(dmabuf->resv, NULL);
+   spin_lock(>name_lock);
if (dmabuf->name)
ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
-   dma_resv_unlock(dmabuf->resv);
+   spin_unlock(>name_lock);
 
return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
 dentry->d_name.name, ret > 0 ? name : "");
@@ -341,8 +341,10 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const 
char __user *buf)
kfree(name);
goto out_unlock;
}
+   spin_lock(>name_lock);
kfree(dmabuf->name);
dmabuf->name = name;
+   spin_unlock(>name_lock);
 
 out_unlock:
dma_resv_unlock(dmabuf->resv);
@@ -405,10 +407,10 @@ static void dma_buf_show_fdinfo(struct seq_file *m, 
struct file *file)
/* Don't count the temporary reference taken inside procfs seq_show */
seq_printf(m, "count:\t%ld\n", file_count(dmabuf->file) - 1);
seq_printf(m, "exp_name:\t%s\n", dmabuf->exp_name);
-   dma_resv_lock(dmabuf->resv, NULL);
+   spin_lock(>name_lock);
if (dmabuf->name)
seq_printf(m, "name:\t%s\n", dmabuf->name);
-   dma_resv_unlock(dmabuf->resv);
+   spin_unlock(>name_lock);
 }
 
 static const struct file_operations dma_buf_fops = {
@@ -546,6 +548,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
dmabuf->size = exp_info->size;
dmabuf->exp_name = exp_info->exp_name;
dmabuf->owner = exp_info->owner;
+   spin_lock_init(>name_lock);
init_waitqueue_head(>poll);
dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = >poll;
dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index ab0c156..93108fd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -311,6 +311,7 @@ struct dma_buf {
void *vmap_ptr;
const char *exp_name;
const char *name;
+   spinlock_t name_lock;
struct module *owner;
struct list_head list_node;
void *priv;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH] dmabuf: use spinlock to access dmabuf->name

2020-06-18 Thread Charan Teja Kalla



On 6/17/2020 11:13 PM, Ruhl, Michael J wrote:
>> -Original Message-
>> From: charante=codeaurora@mg.codeaurora.org
>>  On Behalf Of Charan Teja
>> Kalla
>> Sent: Wednesday, June 17, 2020 2:29 AM
>> To: Ruhl, Michael J ; Sumit Semwal
>> ; open list:DMA BUFFER SHARING FRAMEWORK
>> ; DRI mailing list > de...@lists.freedesktop.org>
>> Cc: Linaro MM SIG ;
>> vinme...@codeaurora.org; LKML ;
>> sta...@vger.kernel.org
>> Subject: Re: [PATCH] dmabuf: use spinlock to access dmabuf->name
>>
>> Thanks Michael for the comments..
>>
>> On 6/16/2020 7:29 PM, Ruhl, Michael J wrote:
>>>> -Original Message-----
>>>> From: dri-devel  On Behalf Of
>>>> Ruhl, Michael J
>>>> Sent: Tuesday, June 16, 2020 9:51 AM
>>>> To: Charan Teja Kalla ; Sumit Semwal
>>>> ; open list:DMA BUFFER SHARING
>> FRAMEWORK
>>>> ; DRI mailing list >>> de...@lists.freedesktop.org>
>>>> Cc: Linaro MM SIG ;
>>>> vinme...@codeaurora.org; LKML ;
>>>> sta...@vger.kernel.org
>>>> Subject: RE: [PATCH] dmabuf: use spinlock to access dmabuf->name
>>>>
>>>>> -Original Message-
>>>>> From: dri-devel  On Behalf Of
>>>>> Charan Teja Kalla
>>>>> Sent: Thursday, June 11, 2020 9:40 AM
>>>>> To: Sumit Semwal ; open list:DMA BUFFER
>>>>> SHARING FRAMEWORK ; DRI mailing list
>> >>>> de...@lists.freedesktop.org>
>>>>> Cc: Linaro MM SIG ;
>>>>> vinme...@codeaurora.org; LKML ;
>>>>> sta...@vger.kernel.org
>>>>> Subject: [PATCH] dmabuf: use spinlock to access dmabuf->name
>>>>>
>>>>> There exists a sleep-while-atomic bug while accessing the dmabuf->name
>>>>> under mutex in the dmabuffs_dname(). This is caused from the SELinux
>>>>> permissions checks on a process where it tries to validate the inherited
>>>>> files from fork() by traversing them through iterate_fd() (which
>>>>> traverse files under spin_lock) and call
>>>>> match_file(security/selinux/hooks.c) where the permission checks
>> happen.
>>>>> This audit information is logged using dump_common_audit_data()
>> where it
>>>>> calls d_path() to get the file path name. If the file check happen on
>>>>> the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex
>> to
>>>>> access dmabuf->name. The flow will be like below:
>>>>> flush_unauthorized_files()
>>>>>  iterate_fd()
>>>>>spin_lock() --> Start of the atomic section.
>>>>>  match_file()
>>>>>file_has_perm()
>>>>>  avc_has_perm()
>>>>>avc_audit()
>>>>>  slow_avc_audit()
>>>>>   common_lsm_audit()
>>>>> dump_common_audit_data()
>>>>>   audit_log_d_path()
>>>>> d_path()
>>>>>dmabuffs_dname()
>>>>>  mutex_lock()--> Sleep while atomic.
>>>>>
>>>>> Call trace captured (on 4.19 kernels) is below:
>>>>> ___might_sleep+0x204/0x208
>>>>> __might_sleep+0x50/0x88
>>>>> __mutex_lock_common+0x5c/0x1068
>>>>> __mutex_lock_common+0x5c/0x1068
>>>>> mutex_lock_nested+0x40/0x50
>>>>> dmabuffs_dname+0xa0/0x170
>>>>> d_path+0x84/0x290
>>>>> audit_log_d_path+0x74/0x130
>>>>> common_lsm_audit+0x334/0x6e8
>>>>> slow_avc_audit+0xb8/0xf8
>>>>> avc_has_perm+0x154/0x218
>>>>> file_has_perm+0x70/0x180
>>>>> match_file+0x60/0x78
>>>>> iterate_fd+0x128/0x168
>>>>> selinux_bprm_committing_creds+0x178/0x248
>>>>> security_bprm_committing_creds+0x30/0x48
>>>>> install_exec_creds+0x1c/0x68
>>>>> load_elf_binary+0x3a4/0x14e0
>>>>> search_binary_handler+0xb0/0x1e0
>>>>>
>>>>> So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.
>>>>>
>>>>> Cc:  [5.3+]
>>>>> Signed-off-by: Charan Teja Reddy 
>>>>> ---
>>>>> drivers/dma-buf/dma-buf.c | 13 +++--
>>>>> include/linux/dma-buf.h   |  1 +
>>>>> 2 files changed, 8

Re: [PATCH] dmabuf: use spinlock to access dmabuf->name

2020-06-17 Thread Charan Teja Kalla



On 6/17/2020 1:51 PM, David Laight wrote:
> From: Charan Teja Kalla
>> Sent: 17 June 2020 07:29
> ...
>>>> If name is freed you will copy garbage, but the only way
>>>> for that to happen is that _set_name or _release have to be called
>>>> at just the right time.
>>>>
>>>> And the above would probably only be an issue if the set_name
>>>> was called, so you will get NULL or a real name.
>>
>> And there exists a use-after-free to avoid which requires the lock. Say
>> that memcpy() in dmabuffs_dname is in progress and in parallel _set_name
>> will free the same buffer that memcpy is operating on.
> 
> If the name is being looked at while the item is being freed
> you almost certainly have much bigger problems that just
> the name being a 'junk' pointer.

True, thus needs the lock.

> 
>   David.
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] dmabuf: use spinlock to access dmabuf->name

2020-06-17 Thread Charan Teja Kalla
Thanks Michael for the comments..

On 6/16/2020 7:29 PM, Ruhl, Michael J wrote:
>> -Original Message-
>> From: dri-devel  On Behalf Of
>> Ruhl, Michael J
>> Sent: Tuesday, June 16, 2020 9:51 AM
>> To: Charan Teja Kalla ; Sumit Semwal
>> ; open list:DMA BUFFER SHARING FRAMEWORK
>> ; DRI mailing list > de...@lists.freedesktop.org>
>> Cc: Linaro MM SIG ;
>> vinme...@codeaurora.org; LKML ;
>> sta...@vger.kernel.org
>> Subject: RE: [PATCH] dmabuf: use spinlock to access dmabuf->name
>>
>>> -Original Message-
>>> From: dri-devel  On Behalf Of
>>> Charan Teja Kalla
>>> Sent: Thursday, June 11, 2020 9:40 AM
>>> To: Sumit Semwal ; open list:DMA BUFFER
>>> SHARING FRAMEWORK ; DRI mailing list >> de...@lists.freedesktop.org>
>>> Cc: Linaro MM SIG ;
>>> vinme...@codeaurora.org; LKML ;
>>> sta...@vger.kernel.org
>>> Subject: [PATCH] dmabuf: use spinlock to access dmabuf->name
>>>
>>> There exists a sleep-while-atomic bug while accessing the dmabuf->name
>>> under mutex in the dmabuffs_dname(). This is caused from the SELinux
>>> permissions checks on a process where it tries to validate the inherited
>>> files from fork() by traversing them through iterate_fd() (which
>>> traverse files under spin_lock) and call
>>> match_file(security/selinux/hooks.c) where the permission checks happen.
>>> This audit information is logged using dump_common_audit_data() where it
>>> calls d_path() to get the file path name. If the file check happen on
>>> the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
>>> access dmabuf->name. The flow will be like below:
>>> flush_unauthorized_files()
>>>  iterate_fd()
>>>spin_lock() --> Start of the atomic section.
>>>  match_file()
>>>file_has_perm()
>>>  avc_has_perm()
>>>avc_audit()
>>>  slow_avc_audit()
>>> common_lsm_audit()
>>>   dump_common_audit_data()
>>> audit_log_d_path()
>>>   d_path()
>>>dmabuffs_dname()
>>>  mutex_lock()--> Sleep while atomic.
>>>
>>> Call trace captured (on 4.19 kernels) is below:
>>> ___might_sleep+0x204/0x208
>>> __might_sleep+0x50/0x88
>>> __mutex_lock_common+0x5c/0x1068
>>> __mutex_lock_common+0x5c/0x1068
>>> mutex_lock_nested+0x40/0x50
>>> dmabuffs_dname+0xa0/0x170
>>> d_path+0x84/0x290
>>> audit_log_d_path+0x74/0x130
>>> common_lsm_audit+0x334/0x6e8
>>> slow_avc_audit+0xb8/0xf8
>>> avc_has_perm+0x154/0x218
>>> file_has_perm+0x70/0x180
>>> match_file+0x60/0x78
>>> iterate_fd+0x128/0x168
>>> selinux_bprm_committing_creds+0x178/0x248
>>> security_bprm_committing_creds+0x30/0x48
>>> install_exec_creds+0x1c/0x68
>>> load_elf_binary+0x3a4/0x14e0
>>> search_binary_handler+0xb0/0x1e0
>>>
>>> So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.
>>>
>>> Cc:  [5.3+]
>>> Signed-off-by: Charan Teja Reddy 
>>> ---
>>> drivers/dma-buf/dma-buf.c | 13 +++--
>>> include/linux/dma-buf.h   |  1 +
>>> 2 files changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 01ce125..2e0456c 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -45,10 +45,10 @@ static char *dmabuffs_dname(struct dentry *dentry,
>>> char *buffer, int buflen)
>>> size_t ret = 0;
>>>
>>> dmabuf = dentry->d_fsdata;
>>> -   dma_resv_lock(dmabuf->resv, NULL);
>>> +   spin_lock(>name_lock);
>>> if (dmabuf->name)
>>> ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
>>> -   dma_resv_unlock(dmabuf->resv);
>>> +   spin_unlock(>name_lock);
>>
>> I am not really clear on why you need this lock.
>>
>> If name == NULL you have no issues.
>> If name is real, you have no issues.

Yeah, ideal cases...

>>
>> If name is freed you will copy garbage, but the only way
>> for that to happen is that _set_name or _release have to be called
>> at just the right time.
>>
>> And the above would probably only be an issue if the set_name
>> was called, so you will get NULL or a re

Re: [PATCH v2] dma-buf: Move dma_buf_release() from fops to dentry_ops

2020-06-16 Thread Charan Teja Kalla
Thanks Sumit for the fix.

On 6/11/2020 5:14 PM, Sumit Semwal wrote:
> Charan Teja reported a 'use-after-free' in dmabuffs_dname [1], which
> happens if the dma_buf_release() is called while the userspace is
> accessing the dma_buf pseudo fs's dmabuffs_dname() in another process,
> and dma_buf_release() releases the dmabuf object when the last reference
> to the struct file goes away.
> 
> I discussed with Arnd Bergmann, and he suggested that rather than tying
> the dma_buf_release() to the file_operations' release(), we can tie it to
> the dentry_operations' d_release(), which will be called when the last ref
> to the dentry is removed.
> 
> The path exercised by __fput() calls f_op->release() first, and then calls
> dput, which eventually calls d_op->d_release().
> 
> In the 'normal' case, when no userspace access is happening via dma_buf
> pseudo fs, there should be exactly one fd, file, dentry and inode, so
> closing the fd will kill of everything right away.
> 
> In the presented case, the dentry's d_release() will be called only when
> the dentry's last ref is released.
> 
> Therefore, lets move dma_buf_release() from fops->release() to
> d_ops->d_release()
> 
> Many thanks to Arnd for his FS insights :)
> 
> [1]: https://lore.kernel.org/patchwork/patch/1238278/
> 
> Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
> Reported-by: syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
> Cc:  [5.3+]
> Cc: Arnd Bergmann 
> Reported-by: Charan Teja Reddy 
> Reviewed-by: Arnd Bergmann 
> Signed-off-by: Sumit Semwal 
> 

Tested this patch for Android running on Snapdragon hardware and see no
issues.
Tested-by: Charan Teja Reddy 

> ---
> v2: per Arnd: Moved dma_buf_release() above to avoid forward declaration;
>  removed dentry_ops check.
> ---
>  drivers/dma-buf/dma-buf.c | 54 ++-
>  1 file changed, 25 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 01ce125f8e8d..412629601ad3 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -54,37 +54,11 @@ static char *dmabuffs_dname(struct dentry *dentry, char 
> *buffer, int buflen)
>dentry->d_name.name, ret > 0 ? name : "");
>  }
>  
> -static const struct dentry_operations dma_buf_dentry_ops = {
> - .d_dname = dmabuffs_dname,
> -};
> -
> -static struct vfsmount *dma_buf_mnt;
> -
> -static int dma_buf_fs_init_context(struct fs_context *fc)
> -{
> - struct pseudo_fs_context *ctx;
> -
> - ctx = init_pseudo(fc, DMA_BUF_MAGIC);
> - if (!ctx)
> - return -ENOMEM;
> - ctx->dops = _buf_dentry_ops;
> - return 0;
> -}
> -
> -static struct file_system_type dma_buf_fs_type = {
> - .name = "dmabuf",
> - .init_fs_context = dma_buf_fs_init_context,
> - .kill_sb = kill_anon_super,
> -};
> -
> -static int dma_buf_release(struct inode *inode, struct file *file)
> +static void dma_buf_release(struct dentry *dentry)
>  {
>   struct dma_buf *dmabuf;
>  
> - if (!is_dma_buf_file(file))
> - return -EINVAL;
> -
> - dmabuf = file->private_data;
> + dmabuf = dentry->d_fsdata;
>  
>   BUG_ON(dmabuf->vmapping_counter);
>  
> @@ -110,9 +84,32 @@ static int dma_buf_release(struct inode *inode, struct 
> file *file)
>   module_put(dmabuf->owner);
>   kfree(dmabuf->name);
>   kfree(dmabuf);
> +}
> +
> +static const struct dentry_operations dma_buf_dentry_ops = {
> + .d_dname = dmabuffs_dname,
> + .d_release = dma_buf_release,
> +};
> +
> +static struct vfsmount *dma_buf_mnt;
> +
> +static int dma_buf_fs_init_context(struct fs_context *fc)
> +{
> + struct pseudo_fs_context *ctx;
> +
> + ctx = init_pseudo(fc, DMA_BUF_MAGIC);
> + if (!ctx)
> + return -ENOMEM;
> + ctx->dops = _buf_dentry_ops;
>   return 0;
>  }
>  
> +static struct file_system_type dma_buf_fs_type = {
> + .name = "dmabuf",
> + .init_fs_context = dma_buf_fs_init_context,
> + .kill_sb = kill_anon_super,
> +};
> +
>  static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct 
> *vma)
>  {
>   struct dma_buf *dmabuf;
> @@ -412,7 +409,6 @@ static void dma_buf_show_fdinfo(struct seq_file *m, 
> struct file *file)
>  }
>  
>  static const struct file_operations dma_buf_fops = {
> - .release= dma_buf_release,
>   .mmap   = dma_buf_mmap_internal,
>   .llseek = dma_buf_llseek,
>   .poll   = dma_buf_poll,
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations

2020-06-12 Thread Charan Teja Kalla
Thanks Mel for feedback.

On 6/9/2020 5:58 PM, Mel Gorman wrote:
> On Tue, May 19, 2020 at 03:28:04PM +0530, Charan Teja Reddy wrote:
>> When boosting is enabled, it is observed that rate of atomic order-0
>> allocation failures are high due to the fact that free levels in the
>> system are checked with ->watermark_boost offset. This is not a problem
>> for sleepable allocations but for atomic allocations which looks like
>> regression.
>>
> 
> Are high-order allocations in general of interest to this platform? If
> not then a potential option is to simply disable boosting. The patch is
> still relevant but it's worth thinking about.
> 

Yes we do care till order-3.

>> This problem is seen frequently on system setup of Android kernel
>> running on Snapdragon hardware with 4GB RAM size. When no extfrag event
>> occurred in the system, ->watermark_boost factor is zero, thus the
>> watermark configurations in the system are:
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 0
>>
>> After launching some memory hungry applications in Android which can
>> cause extfrag events in the system to an extent that ->watermark_boost
>> can be set to max i.e. default boost factor makes it to 150% of high
>> watermark.
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 14077, -->~57MB
>>
>> With default system configuration, for an atomic order-0 allocation to
>> succeed, having free memory of ~2MB will suffice. But boosting makes
>> the min_wmark to ~61MB thus for an atomic order-0 allocation to be
>> successful system should have minimum of ~23MB of free memory(from
>> calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are
>> observed despite system is having ~20MB of free memory. In the testing,
>> this is reproducible as early as first 300secs since boot and with
>> furtherlowram configurations(<2GB) it is observed as early as first
>> 150secs since boot.
>>
>> These failures can be avoided by excluding the ->watermark_boost in
>> watermark caluculations for atomic order-0 allocations.
>>
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>  mm/page_alloc.c | 12 
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d001d61..5193d7e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3709,6 +3709,18 @@ static bool zone_allows_reclaim(struct zone 
>> *local_zone, struct zone *zone)
>>  }
>>  
>>  mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
>> +/*
>> + * Allow GFP_ATOMIC order-0 allocations to exclude the
>> + * zone->watermark_boost in its watermark calculations.
>> + * We rely on the ALLOC_ flags set for GFP_ATOMIC
>> + * requests in gfp_to_alloc_flags() for this. Reason not to
>> + * use the GFP_ATOMIC directly is that we want to fall back
>> + * to slow path thus wake up kswapd.
>> + */
> 
> The comment is a bit difficult to parse. Maybe this.
> 
>   /*
>* Ignore watermark boosting for GFP_ATOMIC order-0 allocations
>* when checking the min watermark. The min watermark is the
>* point where boosting is ignored so that kswapd is woken up
>* when below the low watermark.
>*/
> 
> I left out the ALLOC_ part for reasons that are explained blow.
> 
>> +if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
>> + (alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
>> +mark = zone->_watermark[WMARK_MIN];
>> +}
> 
> The second check is a bit more obscure than it needs to be and depends
> on WMARK_MIN == 0. That will probably be true forever but it's not
> obvious at a glance. I suggest something like
> ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN).
> 
> For detecting atomic alloctions, you rely on the either ALLOC_HARDER or
> ALLOC_HIGH being set. ALLOC_HIGH can be set for non-atomic allocations
> and ALLOC_HARDER can be set for RT tasks. You probably should just test
> the gfp_mask because as it stands non-atomic allocations can ignore the
> boost too.
> 
> Finally, the patch puts an unlikely check into a relatively fast path even
> though watermarks may be fine with or without boosting.  Instead you could
> put the checks in zone_watermark_fast() if and only if the watermarks
> failed the first time. If the checks pass, the watermarks get checked
> a second time. This will be fractionally slower for requests failing
> watermark checks but there is no penalty for most allocation requests.
> It would need the gfp_mask to be passed into zone_watermark_fast but
> as it's an inlined function, there should be no cost to passing in the
> arguement i.e. do something 

[PATCH] dmabuf: use spinlock to access dmabuf->name

2020-06-11 Thread Charan Teja Kalla
There exists a sleep-while-atomic bug while accessing the dmabuf->name
under mutex in the dmabuffs_dname(). This is caused from the SELinux
permissions checks on a process where it tries to validate the inherited
files from fork() by traversing them through iterate_fd() (which
traverse files under spin_lock) and call
match_file(security/selinux/hooks.c) where the permission checks happen.
This audit information is logged using dump_common_audit_data() where it
calls d_path() to get the file path name. If the file check happen on
the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
access dmabuf->name. The flow will be like below:
flush_unauthorized_files()
  iterate_fd()
spin_lock() --> Start of the atomic section.
  match_file()
file_has_perm()
  avc_has_perm()
avc_audit()
  slow_avc_audit()
common_lsm_audit()
  dump_common_audit_data()
audit_log_d_path()
  d_path()
dmabuffs_dname()
  mutex_lock()--> Sleep while atomic.

Call trace captured (on 4.19 kernels) is below:
___might_sleep+0x204/0x208
__might_sleep+0x50/0x88
__mutex_lock_common+0x5c/0x1068
__mutex_lock_common+0x5c/0x1068
mutex_lock_nested+0x40/0x50
dmabuffs_dname+0xa0/0x170
d_path+0x84/0x290
audit_log_d_path+0x74/0x130
common_lsm_audit+0x334/0x6e8
slow_avc_audit+0xb8/0xf8
avc_has_perm+0x154/0x218
file_has_perm+0x70/0x180
match_file+0x60/0x78
iterate_fd+0x128/0x168
selinux_bprm_committing_creds+0x178/0x248
security_bprm_committing_creds+0x30/0x48
install_exec_creds+0x1c/0x68
load_elf_binary+0x3a4/0x14e0
search_binary_handler+0xb0/0x1e0

So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.

Cc:  [5.3+]
Signed-off-by: Charan Teja Reddy 
---
 drivers/dma-buf/dma-buf.c | 13 +++--
 include/linux/dma-buf.h   |  1 +
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 01ce125..2e0456c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -45,10 +45,10 @@ static char *dmabuffs_dname(struct dentry *dentry, char 
*buffer, int buflen)
size_t ret = 0;
 
dmabuf = dentry->d_fsdata;
-   dma_resv_lock(dmabuf->resv, NULL);
+   spin_lock(>name_lock);
if (dmabuf->name)
ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
-   dma_resv_unlock(dmabuf->resv);
+   spin_unlock(>name_lock);
 
return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
 dentry->d_name.name, ret > 0 ? name : "");
@@ -335,7 +335,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const 
char __user *buf)
if (IS_ERR(name))
return PTR_ERR(name);
 
-   dma_resv_lock(dmabuf->resv, NULL);
+   spin_lock(>name_lock);
if (!list_empty(>attachments)) {
ret = -EBUSY;
kfree(name);
@@ -345,7 +345,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const 
char __user *buf)
dmabuf->name = name;
 
 out_unlock:
-   dma_resv_unlock(dmabuf->resv);
+   spin_unlock(>name_lock);
return ret;
 }
 
@@ -405,10 +405,10 @@ static void dma_buf_show_fdinfo(struct seq_file *m, 
struct file *file)
/* Don't count the temporary reference taken inside procfs seq_show */
seq_printf(m, "count:\t%ld\n", file_count(dmabuf->file) - 1);
seq_printf(m, "exp_name:\t%s\n", dmabuf->exp_name);
-   dma_resv_lock(dmabuf->resv, NULL);
+   spin_lock(>name_lock);
if (dmabuf->name)
seq_printf(m, "name:\t%s\n", dmabuf->name);
-   dma_resv_unlock(dmabuf->resv);
+   spin_unlock(>name_lock);
 }
 
 static const struct file_operations dma_buf_fops = {
@@ -546,6 +546,7 @@ struct dma_buf *dma_buf_export(const struct 
dma_buf_export_info *exp_info)
dmabuf->size = exp_info->size;
dmabuf->exp_name = exp_info->exp_name;
dmabuf->owner = exp_info->owner;
+   spin_lock_init(>name_lock);
init_waitqueue_head(>poll);
dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = >poll;
dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index ab0c156..93108fd 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -311,6 +311,7 @@ struct dma_buf {
void *vmap_ptr;
const char *exp_name;
const char *name;
+   spinlock_t name_lock;
struct module *owner;
struct list_head list_node;
void *priv;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative Project


[PATCH] mm, page_alloc: skip ->watermark_boost for atomic order-0 allocations-fix

2020-06-11 Thread Charan Teja Kalla
When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset. This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.

This problem is seen frequently on system setup of Android kernel
running on Snapdragon hardware with 4GB RAM size. When no extfrag event
occurred in the system, ->watermark_boost factor is zero, thus the
watermark configurations in the system are:
   _watermark = (
  [WMARK_MIN] = 1272, --> ~5MB
  [WMARK_LOW] = 9067, --> ~36MB
  [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 0

After launching some memory hungry applications in Android which can
cause extfrag events in the system to an extent that ->watermark_boost
can be set to max i.e. default boost factor makes it to 150% of high
watermark.
   _watermark = (
  [WMARK_MIN] = 1272, --> ~5MB
  [WMARK_LOW] = 9067, --> ~36MB
  [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 14077, -->~57MB

With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice. But boosting makes
the min_wmark to ~61MB thus for an atomic order-0 allocation to be
successful system should have minimum of ~23MB of free memory(from
calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are
observed despite system is having ~20MB of free memory. In the testing,
this is reproducible as early as first 300secs since boot and with
furtherlowram configurations(<2GB) it is observed as early as first
150secs since boot.

These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.

Fix-suggested-by: Mel Gorman 
Signed-off-by: Charan Teja Reddy 
---

Change in linux-next: https://lore.kernel.org/patchwork/patch/1244272/ 

 mm/page_alloc.c | 36 
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c435b2..18f407e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3580,7 +3580,7 @@ bool zone_watermark_ok(struct zone *z, unsigned int 
order, unsigned long mark,
 
 static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
unsigned long mark, int highest_zoneidx,
-   unsigned int alloc_flags)
+   unsigned int alloc_flags, gfp_t gfp_mask)
 {
long free_pages = zone_page_state(z, NR_FREE_PAGES);
long cma_pages = 0;
@@ -3602,8 +3602,23 @@ static inline bool zone_watermark_fast(struct zone *z, 
unsigned int order,
mark + z->lowmem_reserve[highest_zoneidx])
return true;
 
-   return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
-   free_pages);
+   if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
+   free_pages))
+   return true;
+   /*
+* Ignore watermark boosting for GFP_ATOMIC order-0 allocations
+* when checking the min watermark. The min watermark is the
+* point where boosting is ignored so that kswapd is woken up
+* when below the low watermark.
+*/
+   if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
+   && ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
+   mark = z->_watermark[WMARK_MIN];
+   return __zone_watermark_ok(z, order, mark, highest_zoneidx,
+   alloc_flags, free_pages);
+   }
+
+   return false;
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
@@ -3746,20 +3761,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
}
 
mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
-   /*
-* Allow GFP_ATOMIC order-0 allocations to exclude the
-* zone->watermark_boost in their watermark calculations.
-* We rely on the ALLOC_ flags set for GFP_ATOMIC requests in
-* gfp_to_alloc_flags() for this.  Reason not to use the
-* GFP_ATOMIC directly is that we want to fall back to slow path
-* thus wake up kswapd.
-*/
-   if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
-(alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
-   mark = zone->_watermark[WMARK_MIN];
-   }
if (!zone_watermark_fast(zone, order, mark,
-  ac->highest_zoneidx, alloc_flags)) {
+  ac->highest_zoneidx, alloc_flags,
+  

Re: [PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations

2020-06-09 Thread Charan Teja Kalla
Adding more people to get additional reviewer inputs.

On 6/5/2020 3:13 AM, Andrew Morton wrote:
> On Tue, 19 May 2020 15:28:04 +0530 Charan Teja Reddy 
>  wrote:
> 
>> When boosting is enabled, it is observed that rate of atomic order-0
>> allocation failures are high due to the fact that free levels in the
>> system are checked with ->watermark_boost offset. This is not a problem
>> for sleepable allocations but for atomic allocations which looks like
>> regression.
>>
>> This problem is seen frequently on system setup of Android kernel
>> running on Snapdragon hardware with 4GB RAM size. When no extfrag event
>> occurred in the system, ->watermark_boost factor is zero, thus the
>> watermark configurations in the system are:
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 0
>>
>> After launching some memory hungry applications in Android which can
>> cause extfrag events in the system to an extent that ->watermark_boost
>> can be set to max i.e. default boost factor makes it to 150% of high
>> watermark.
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 14077, -->~57MB
>>
>> With default system configuration, for an atomic order-0 allocation to
>> succeed, having free memory of ~2MB will suffice. But boosting makes
>> the min_wmark to ~61MB thus for an atomic order-0 allocation to be
>> successful system should have minimum of ~23MB of free memory(from
>> calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are
>> observed despite system is having ~20MB of free memory. In the testing,
>> this is reproducible as early as first 300secs since boot and with
>> furtherlowram configurations(<2GB) it is observed as early as first
>> 150secs since boot.
>>
>> These failures can be avoided by excluding the ->watermark_boost in
>> watermark caluculations for atomic order-0 allocations.
> 
> Do we have any additional reviewer input on this one?
> 
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3709,6 +3709,18 @@ static bool zone_allows_reclaim(struct zone 
>> *local_zone, struct zone *zone)
>>  }
>>  
>>  mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
>> +/*
>> + * Allow GFP_ATOMIC order-0 allocations to exclude the
>> + * zone->watermark_boost in its watermark calculations.
>> + * We rely on the ALLOC_ flags set for GFP_ATOMIC
>> + * requests in gfp_to_alloc_flags() for this. Reason not to
>> + * use the GFP_ATOMIC directly is that we want to fall back
>> + * to slow path thus wake up kswapd.
>> + */
>> +if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
>> + (alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
>> +mark = zone->_watermark[WMARK_MIN];
>> +}
>>  if (!zone_watermark_fast(zone, order, mark,
>> ac->highest_zoneidx, alloc_flags)) {
>>  int ret;
> 
> It would seem smart to do
> 
> --- 
> a/mm/page_alloc.c~mm-page_alloc-skip-waternark_boost-for-atomic-order-0-allocations-fix
> +++ a/mm/page_alloc.c
> @@ -3745,7 +3745,6 @@ retry:
>   }
>   }
>  
> - mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
>   /*
>* Allow GFP_ATOMIC order-0 allocations to exclude the
>* zone->watermark_boost in their watermark calculations.
> @@ -3757,6 +3756,8 @@ retry:
>   if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
>(alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
>   mark = zone->_watermark[WMARK_MIN];
> + } else {
> + mark = wmark_pages(zone, alloc_flags & 
> ALLOC_WMARK_MASK);
>   }
>   if (!zone_watermark_fast(zone, order, mark,
>  ac->highest_zoneidx, alloc_flags)) {
> 
> but that makes page_alloc.o 16 bytes larger, so I guess don't bother.
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations

2020-05-20 Thread Charan Teja Kalla
Thank you Andrew for the comments..

On 5/20/2020 7:10 AM, Andrew Morton wrote:
> On Tue, 19 May 2020 15:28:04 +0530 Charan Teja Reddy 
>  wrote:
> 
>> When boosting is enabled, it is observed that rate of atomic order-0
>> allocation failures are high due to the fact that free levels in the
>> system are checked with ->watermark_boost offset. This is not a problem
>> for sleepable allocations but for atomic allocations which looks like
>> regression.
>>
>> This problem is seen frequently on system setup of Android kernel
>> running on Snapdragon hardware with 4GB RAM size. When no extfrag event
>> occurred in the system, ->watermark_boost factor is zero, thus the
>> watermark configurations in the system are:
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 0
>>
>> After launching some memory hungry applications in Android which can
>> cause extfrag events in the system to an extent that ->watermark_boost
>> can be set to max i.e. default boost factor makes it to 150% of high
>> watermark.
>>_watermark = (
>>   [WMARK_MIN] = 1272, --> ~5MB
>>   [WMARK_LOW] = 9067, --> ~36MB
>>   [WMARK_HIGH] = 9385), --> ~38MB
>>watermark_boost = 14077, -->~57MB
>>
>> With default system configuration, for an atomic order-0 allocation to
>> succeed, having free memory of ~2MB will suffice. But boosting makes
>> the min_wmark to ~61MB thus for an atomic order-0 allocation to be
>> successful system should have minimum of ~23MB of free memory(from
>> calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are
>> observed despite system is having ~20MB of free memory. In the testing,
>> this is reproducible as early as first 300secs since boot and with
>> furtherlowram configurations(<2GB) it is observed as early as first
>> 150secs since boot.
>>
>> These failures can be avoided by excluding the ->watermark_boost in
>> watermark caluculations for atomic order-0 allocations.
> 
> Seems sensible.
> 
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3709,6 +3709,18 @@ static bool zone_allows_reclaim(struct zone 
>> *local_zone, struct zone *zone)
>>  }
>>  
>>  mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
>> +/*
>> + * Allow GFP_ATOMIC order-0 allocations to exclude the
>> + * zone->watermark_boost in its watermark calculations.
>> + * We rely on the ALLOC_ flags set for GFP_ATOMIC
>> + * requests in gfp_to_alloc_flags() for this. Reason not to
>> + * use the GFP_ATOMIC directly is that we want to fall back
>> + * to slow path thus wake up kswapd.
> 
> Nice comment, but I don't understand it ;)
> 
> Why would testing gfp_mask prevent us from waking kswapd?

This piece of code is in the common function get_page_from_freelist()
which will be called in the below order:
1) alloc_pages() with alloc_flags = ALLOC_WMARK_LOW. If watermark check
fails here then,
2) Allocation request will fall back to the slow path,
__alloc_pages_slowpath with alloc_flags = ALLOC_WMARK_MIN, **where it
wakes up kswapd**.

If I use the GFP_ATOMIC directly then even if watermarks are boosted
(and kswapd is yet to wake up), then atomic allocation can return
success from 1) with out even waking up kswapd. This doesn't seem correct.

> 
>> + */
>> +if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
>> + (alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
>> +mark = zone->_watermark[WMARK_MIN];
>> +}
> 
> Why is this not implemented for higher-order allocation attempts?

I don't think that higher-order allocation failures are such critical, I
meant that there will be no critical users who can rely on higher-order
atomic allocations to be successful. Please correct me If I am wrong
here. Atleast this is the case on Android systems..

> 
>>  if (!zone_watermark_fast(zone, order, mark,
>> ac->highest_zoneidx, alloc_flags)) {
>>  int ret;
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project


Re: [PATCH v2] dma-buf: fix use-after-free in dmabuffs_dname

2020-05-14 Thread Charan Teja Kalla
Thank you for the reply.

On 5/13/2020 9:33 PM, Sumit Semwal wrote:
> On Wed, 13 May 2020 at 21:16, Daniel Vetter  wrote:
>>
>> On Wed, May 13, 2020 at 02:51:12PM +0200, Greg KH wrote:
>>> On Wed, May 13, 2020 at 05:40:26PM +0530, Charan Teja Kalla wrote:
>>>>
>>>> Thank you Greg for the comments.
>>>> On 5/12/2020 2:22 PM, Greg KH wrote:
>>>>> On Fri, May 08, 2020 at 12:11:03PM +0530, Charan Teja Reddy wrote:
>>>>>> The following race occurs while accessing the dmabuf object exported as
>>>>>> file:
>>>>>> P1   P2
>>>>>> dma_buf_release()  dmabuffs_dname()
>>>>>> [say lsof reading /proc//fd/]
>>>>>>
>>>>>> read dmabuf stored in dentry->d_fsdata
>>>>>> Free the dmabuf object
>>>>>> Start accessing the dmabuf structure
>>>>>>
>>>>>> In the above description, the dmabuf object freed in P1 is being
>>>>>> accessed from P2 which is resulting into the use-after-free. Below is
>>>>>> the dump stack reported.
>>>>>>
>>>>>> We are reading the dmabuf object stored in the dentry->d_fsdata but
>>>>>> there is no binding between the dentry and the dmabuf which means that
>>>>>> the dmabuf can be freed while it is being read from ->d_fsdata and
>>>>>> inuse. Reviews on the patch V1 says that protecting the dmabuf inuse
>>>>>> with an extra refcount is not a viable solution as the exported dmabuf
>>>>>> is already under file's refcount and keeping the multiple refcounts on
>>>>>> the same object coordinated is not possible.
>>>>>>
>>>>>> As we are reading the dmabuf in ->d_fsdata just to get the user passed
>>>>>> name, we can directly store the name in d_fsdata thus can avoid the
>>>>>> reading of dmabuf altogether.
>>>>>>
>>>>>> Call Trace:
>>>>>>  kasan_report+0x12/0x20
>>>>>>  __asan_report_load8_noabort+0x14/0x20
>>>>>>  dmabuffs_dname+0x4f4/0x560
>>>>>>  tomoyo_realpath_from_path+0x165/0x660
>>>>>>  tomoyo_get_realpath
>>>>>>  tomoyo_check_open_permission+0x2a3/0x3e0
>>>>>>  tomoyo_file_open
>>>>>>  tomoyo_file_open+0xa9/0xd0
>>>>>>  security_file_open+0x71/0x300
>>>>>>  do_dentry_open+0x37a/0x1380
>>>>>>  vfs_open+0xa0/0xd0
>>>>>>  path_openat+0x12ee/0x3490
>>>>>>  do_filp_open+0x192/0x260
>>>>>>  do_sys_openat2+0x5eb/0x7e0
>>>>>>  do_sys_open+0xf2/0x180
>>>>>>
>>>>>> Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
>>>>>> Reported-by: syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
>>>>>> Cc:  [5.3+]
>>>>>> Signed-off-by: Charan Teja Reddy 
>>>>>> ---
>>>>>>
>>>>>> Changes in v2:
>>>>>>
>>>>>> - Pass the user passed name in ->d_fsdata instead of dmabuf
>>>>>> - Improve the commit message
>>>>>>
>>>>>> Changes in v1: (https://patchwork.kernel.org/patch/11514063/)
>>>>>>
>>>>>>  drivers/dma-buf/dma-buf.c | 17 ++---
>>>>>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>>>> index 01ce125..0071f7d 100644
>>>>>> --- a/drivers/dma-buf/dma-buf.c
>>>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>>>> @@ -25,6 +25,7 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> +#include 
>>>>>>
>>>>>>  #include 
>>>>>>  #include 
>>>>>> @@ -40,15 +41,13 @@ struct dma_buf_list {
>>>>>>
>>>>>>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int 
>>>>>> buflen)
>>>>>>  {
>>>>>> -struct dma_buf *dmabuf;
>>>>>>  char name[DMA_BUF_NAME_LEN];
>>>>>>  size_t ret = 0;
>>>>>

Re: [PATCH v2] dma-buf: fix use-after-free in dmabuffs_dname

2020-05-13 Thread Charan Teja Kalla


Thank you Greg for the comments.
On 5/12/2020 2:22 PM, Greg KH wrote:
> On Fri, May 08, 2020 at 12:11:03PM +0530, Charan Teja Reddy wrote:
>> The following race occurs while accessing the dmabuf object exported as
>> file:
>> P1   P2
>> dma_buf_release()  dmabuffs_dname()
>> [say lsof reading /proc//fd/]
>>
>> read dmabuf stored in dentry->d_fsdata
>> Free the dmabuf object
>> Start accessing the dmabuf structure
>>
>> In the above description, the dmabuf object freed in P1 is being
>> accessed from P2 which is resulting into the use-after-free. Below is
>> the dump stack reported.
>>
>> We are reading the dmabuf object stored in the dentry->d_fsdata but
>> there is no binding between the dentry and the dmabuf which means that
>> the dmabuf can be freed while it is being read from ->d_fsdata and
>> inuse. Reviews on the patch V1 says that protecting the dmabuf inuse
>> with an extra refcount is not a viable solution as the exported dmabuf
>> is already under file's refcount and keeping the multiple refcounts on
>> the same object coordinated is not possible.
>>
>> As we are reading the dmabuf in ->d_fsdata just to get the user passed
>> name, we can directly store the name in d_fsdata thus can avoid the
>> reading of dmabuf altogether.
>>
>> Call Trace:
>>  kasan_report+0x12/0x20
>>  __asan_report_load8_noabort+0x14/0x20
>>  dmabuffs_dname+0x4f4/0x560
>>  tomoyo_realpath_from_path+0x165/0x660
>>  tomoyo_get_realpath
>>  tomoyo_check_open_permission+0x2a3/0x3e0
>>  tomoyo_file_open
>>  tomoyo_file_open+0xa9/0xd0
>>  security_file_open+0x71/0x300
>>  do_dentry_open+0x37a/0x1380
>>  vfs_open+0xa0/0xd0
>>  path_openat+0x12ee/0x3490
>>  do_filp_open+0x192/0x260
>>  do_sys_openat2+0x5eb/0x7e0
>>  do_sys_open+0xf2/0x180
>>
>> Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
>> Reported-by: syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
>> Cc:  [5.3+]
>> Signed-off-by: Charan Teja Reddy 
>> ---
>>
>> Changes in v2: 
>>
>> - Pass the user passed name in ->d_fsdata instead of dmabuf
>> - Improve the commit message
>>
>> Changes in v1: (https://patchwork.kernel.org/patch/11514063/)
>>
>>  drivers/dma-buf/dma-buf.c | 17 ++---
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index 01ce125..0071f7d 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -25,6 +25,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -40,15 +41,13 @@ struct dma_buf_list {
>>  
>>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
>>  {
>> -struct dma_buf *dmabuf;
>>  char name[DMA_BUF_NAME_LEN];
>>  size_t ret = 0;
>>  
>> -dmabuf = dentry->d_fsdata;
>> -dma_resv_lock(dmabuf->resv, NULL);
>> -if (dmabuf->name)
>> -ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
>> -dma_resv_unlock(dmabuf->resv);
>> +spin_lock(>d_lock);
> 
> Are you sure this lock always protects d_fsdata?

I think yes. In the dma-buf.c, I have to make sure that d_fsdata should
always be under d_lock thus it will be protected. (In this posted patch
there is one place(in dma_buf_set_name) that is missed, will update this
in V3).

> 
>> +if (dentry->d_fsdata)
>> +ret = strlcpy(name, dentry->d_fsdata, DMA_BUF_NAME_LEN);
>> +spin_unlock(>d_lock);
>>  
>>  return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
>>   dentry->d_name.name, ret > 0 ? name : "");
> 
> If the above check fails the name will be what?  How could d_name.name
> be valid but d_fsdata not be valid?

In case of check fails, empty string "" is appended to the name by the
code, ret > 0 ? name : "", ret is initialized to zero. Thus the name
string will be like "/dmabuf:".

Regarding the validity of d_fsdata, we are setting the dmabuf's
dentry->d_fsdata to NULL in the dma_buf_release() thus can go invalid if
that dmabuf is in the free path.


> 
> 
>> @@ -80,12 +79,16 @@ static int dma_buf_fs_init_context(struct fs_context *fc)
>>  static int dma_buf_release(struct inode *inode, struct file *file)
>>  {
>>  struct dma_buf *dmabuf;
>> +struct dentry *dentry = file->f_path.dentry;
>>  
>>  if (!is_dma_buf_file(file))
>>  return -EINVAL;
>>  
>>  dmabuf = file->private_data;
>>  
>> +spin_lock(>d_lock);
>> +dentry->d_fsdata = NULL;
>> +spin_unlock(>d_lock);
>>  BUG_ON(dmabuf->vmapping_counter);
>>  
>>  /*
>> @@ -343,6 +346,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, 
>> const char __user *buf)
>>  }
>>  kfree(dmabuf->name);
>>  dmabuf->name = name;
>> +dmabuf->file->f_path.dentry->d_fsdata = name;
> 
> You are just changing the use of d_fsdata from being a pointer to the
> dmabuf to being a pointer to the name string?  What's to 

Re: [PATCH] mm, page_alloc: reset the zone->watermark_boost early

2020-05-13 Thread Charan Teja Kalla



On 5/12/2020 7:01 PM, Charan Teja Kalla wrote:
> 
> Thank you Andrew for the reply.
> 
> On 5/12/2020 1:41 AM, Andrew Morton wrote:
>> On Mon, 11 May 2020 19:10:08 +0530 Charan Teja Reddy 
>>  wrote:
>>
>>> Updating the zone watermarks by any means, like extra_free_kbytes,
>>> min_free_kbytes, water_mark_scale_factor e.t.c, when watermark_boost is
>>> set will result into the higher low and high watermarks than the user
>>> asks. This can be avoided by resetting the zone->watermark_boost to zero
>>> early.
>>
>> Does this solve some problem which has been observed in testing?

Sorry that I misunderstood your question. Yes it has solved problem of higher
water marks seen in the zone than what I set through min_free_kbytes.

Below are the steps I pursued to reproduce the problem
1) My system setup of Android kernel running on snapdragon hardware have the 
   below settings as default:
   #cat /proc/sys/vm/min_free_kbytes = 5162
   #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone   Normal
min  797
low  8340
high 8539
boost0 // This is the extra print I have added to check the 
boosting
2) Now I just try to change the zone watermark when the ->watermark_boost
   is greater than zero. I just write the same value of min_free_kbytes in 
   which case we should have seen the watermarks same as default(I mean of step 
1)

   #echo 5162 > /proc/sys/vm/min_free_kbytes

   But I have seen very high values of watermarks in the system,
  # cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone   Normal
  min  797
  low  21148
  high 21347
  boost   0

So, yes, this problem is got fixed with the changes made in this patch.

> 
> Sorry, what are those issues observed in testing? It would be helpful
> If you post them here. 
> 
>>
>>> ...
>>>
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -7746,9 +7746,9 @@ static void __setup_per_zone_wmarks(void)
>>> mult_frac(zone_managed_pages(zone),
>>>   watermark_scale_factor, 1));
>>>  
>>> +   zone->watermark_boost = 0;
>>> zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>>> zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>>> -   zone->watermark_boost = 0;
>>>  
>>> spin_unlock_irqrestore(>lock, flags);
>>> }
>>
>> This could only be a problem if code is accessing these things without
>> holding zone->lock.  Is that ever the case?
>>
> 
> This is a problem even when accessing these things with zone->lock
> held because we are directly using the macro min_wmark_pages(zone)
> which leads to the issue. Pasting macro here for reference.
> 
> #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
> 
> Steps that lead to the issue is like below:
> 1) On the extfrag event, we try to boost the watermark by storing the
>value in ->watermark_boost.
> 
> 2) User changes the value of extra|min_free_kbytes or watermark_scale_factor.
>   
>In __setup_perzone_wmarks, we directly store the user asked
>watermarks in the zones structure. In this step, the value
>is always offsets by ->watermark_boost as we use the min_wmark_pages() 
> macro.
> 
> 3) Later, when kswapd woke up, it resets the zone's watermark_boost to zero. 
> 
> Step 2 from the above is what resulting into the issue.
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH] mm, page_alloc: reset the zone->watermark_boost early

2020-05-12 Thread Charan Teja Kalla


Thank you Andrew for the reply.

On 5/12/2020 1:41 AM, Andrew Morton wrote:
> On Mon, 11 May 2020 19:10:08 +0530 Charan Teja Reddy 
>  wrote:
> 
>> Updating the zone watermarks by any means, like extra_free_kbytes,
>> min_free_kbytes, water_mark_scale_factor e.t.c, when watermark_boost is
>> set will result into the higher low and high watermarks than the user
>> asks. This can be avoided by resetting the zone->watermark_boost to zero
>> early.
> 
> Does this solve some problem which has been observed in testing?

Sorry, what are those issues observed in testing? It would be helpful
If you post them here. 

> 
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7746,9 +7746,9 @@ static void __setup_per_zone_wmarks(void)
>>  mult_frac(zone_managed_pages(zone),
>>watermark_scale_factor, 1));
>>  
>> +zone->watermark_boost = 0;
>>  zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>>  zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>> -zone->watermark_boost = 0;
>>  
>>  spin_unlock_irqrestore(>lock, flags);
>>  }
> 
> This could only be a problem if code is accessing these things without
> holding zone->lock.  Is that ever the case?
> 

This is a problem even when accessing these things with zone->lock
held because we are directly using the macro min_wmark_pages(zone)
which leads to the issue. Pasting macro here for reference.

#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)

Steps that lead to the issue is like below:
1) On the extfrag event, we try to boost the watermark by storing the
   value in ->watermark_boost.

2) User changes the value of extra|min_free_kbytes or watermark_scale_factor.
  
   In __setup_perzone_wmarks, we directly store the user asked
   watermarks in the zones structure. In this step, the value
   is always offsets by ->watermark_boost as we use the min_wmark_pages() macro.

3) Later, when kswapd woke up, it resets the zone's watermark_boost to zero. 

Step 2 from the above is what resulting into the issue.

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH] dma-buf: fix use-after-free in dmabuffs_dname

2020-05-11 Thread Charan Teja Kalla



Thank you Greg for the comments.

On 5/6/2020 2:30 PM, Greg KH wrote:

On Wed, May 06, 2020 at 02:00:10PM +0530, Charan Teja Kalla wrote:

Thank you Greg for the reply.

On 5/5/2020 3:38 PM, Greg KH wrote:

On Tue, Apr 28, 2020 at 01:24:02PM +0530, Charan Teja Reddy wrote:

The following race occurs while accessing the dmabuf object exported as
file:
P1  P2
dma_buf_release()  dmabuffs_dname()
   [say lsof reading /proc//fd/]

   read dmabuf stored in dentry->fsdata
Free the dmabuf object
   Start accessing the dmabuf structure

In the above description, the dmabuf object freed in P1 is being
accessed from P2 which is resulting into the use-after-free. Below is
the dump stack reported.

Call Trace:
   kasan_report+0x12/0x20
   __asan_report_load8_noabort+0x14/0x20
   dmabuffs_dname+0x4f4/0x560
   tomoyo_realpath_from_path+0x165/0x660
   tomoyo_get_realpath
   tomoyo_check_open_permission+0x2a3/0x3e0
   tomoyo_file_open
   tomoyo_file_open+0xa9/0xd0
   security_file_open+0x71/0x300
   do_dentry_open+0x37a/0x1380
   vfs_open+0xa0/0xd0
   path_openat+0x12ee/0x3490
   do_filp_open+0x192/0x260
   do_sys_openat2+0x5eb/0x7e0
   do_sys_open+0xf2/0x180

Fixes: bb2bb90 ("dma-buf: add DMA_BUF_SET_NAME ioctls")

Nit, please read the documentation for how to do a Fixes: line properly,
you need more digits:
Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")


Will update the patch



Reported-by:syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
Signed-off-by: Charan Teja Reddy

Also, any reason you didn't include the other mailing lists that
get_maintainer.pl said to?


Really sorry for not sending to complete list. Added now.



And finally, no cc: stable in the s-o-b area for an issue that needs to
be backported to older kernels?


Will update the patch.



---
   drivers/dma-buf/dma-buf.c | 25 +++--
   include/linux/dma-buf.h   |  1 +
   2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 570c923..069d8f78 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -25,6 +25,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -38,18 +39,34 @@ struct dma_buf_list {
   static struct dma_buf_list db_list;
+static void dmabuf_dent_put(struct dma_buf *dmabuf)
+{
+   if (atomic_dec_and_test(>dent_count)) {
+   kfree(dmabuf->name);
+   kfree(dmabuf);
+   }

Why not just use a kref instead of an open-coded atomic value?


Kref approach looks cleaner. will update the patch accordingly.



+}
+
   static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
   {
struct dma_buf *dmabuf;
char name[DMA_BUF_NAME_LEN];
size_t ret = 0;
+   spin_lock(>d_lock);
dmabuf = dentry->d_fsdata;
+   if (!dmabuf || !atomic_add_unless(>dent_count, 1, 0)) {
+   spin_unlock(>d_lock);
+   goto out;

How can dmabuf not be valid here?

And isn't there already a usecount for the buffer?


dmabuf exported as file simply relies on that file's refcount, thus fput()
releases the dmabuf.

We are storing the dmabuf in the dentry->d_fsdata but there is no binding
between the dentry and the dmabuf. So, flow will be like

1) P1 calls fput(dmabuf_fd)

2) P2 trying to access the file information of P1.
     Eg: lsof command trying to list out the dmabuf_fd information using
/proc//fd/dmabuf_fd

3) P1 calls the file->f_op->release(dmabuf_fd_file)(ends up in calling
dma_buf_release()),   thus frees up the dmabuf buffer.

4) P2 access the dmabuf stored in the dentry->d_fsdata which was freed in
step 3.

So we need to have some refcount mechanism to avoid the use-after-free in
step 4.

Ok, but watch out, now you have 2 different reference counts for the
same structure.  Keeping them coordinated is almost always an impossible
task so you need to only rely on one.  If you can't use the file api,
just drop all of the reference counting logic in there and only use the
kref one.


I feel that changing the refcount logic now to dma-buf objects involve 
changes in


the core dma-buf framework. NO? Instead, how about passing the user 
passed name directly


in the ->d_fsdata inplace of dmabuf object? Because we just need user 
passed name in the


dmabuffs_dname(). With this we can avoid the need for extra refcount on 
dmabuf.


Posted patch-V2: https://lkml.org/lkml/2020/5/8/158




good luck!

greg k-h


--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH] dma-buf: fix use-after-free in dmabuffs_dname

2020-05-06 Thread Charan Teja Kalla

Thank you Greg for the reply.

On 5/5/2020 3:38 PM, Greg KH wrote:

On Tue, Apr 28, 2020 at 01:24:02PM +0530, Charan Teja Reddy wrote:

The following race occurs while accessing the dmabuf object exported as
file:
P1  P2
dma_buf_release()  dmabuffs_dname()
   [say lsof reading /proc//fd/]

   read dmabuf stored in dentry->fsdata
Free the dmabuf object
   Start accessing the dmabuf structure

In the above description, the dmabuf object freed in P1 is being
accessed from P2 which is resulting into the use-after-free. Below is
the dump stack reported.

Call Trace:
  kasan_report+0x12/0x20
  __asan_report_load8_noabort+0x14/0x20
  dmabuffs_dname+0x4f4/0x560
  tomoyo_realpath_from_path+0x165/0x660
  tomoyo_get_realpath
  tomoyo_check_open_permission+0x2a3/0x3e0
  tomoyo_file_open
  tomoyo_file_open+0xa9/0xd0
  security_file_open+0x71/0x300
  do_dentry_open+0x37a/0x1380
  vfs_open+0xa0/0xd0
  path_openat+0x12ee/0x3490
  do_filp_open+0x192/0x260
  do_sys_openat2+0x5eb/0x7e0
  do_sys_open+0xf2/0x180

Fixes: bb2bb90 ("dma-buf: add DMA_BUF_SET_NAME ioctls")

Nit, please read the documentation for how to do a Fixes: line properly,
you need more digits:
Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")



Will update the patch



Reported-by:syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
Signed-off-by: Charan Teja Reddy

Also, any reason you didn't include the other mailing lists that
get_maintainer.pl said to?



Really sorry for not sending to complete list. Added now.



And finally, no cc: stable in the s-o-b area for an issue that needs to
be backported to older kernels?



Will update the patch.





---
  drivers/dma-buf/dma-buf.c | 25 +++--
  include/linux/dma-buf.h   |  1 +
  2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 570c923..069d8f78 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -38,18 +39,34 @@ struct dma_buf_list {
  
  static struct dma_buf_list db_list;
  
+static void dmabuf_dent_put(struct dma_buf *dmabuf)

+{
+   if (atomic_dec_and_test(>dent_count)) {
+   kfree(dmabuf->name);
+   kfree(dmabuf);
+   }

Why not just use a kref instead of an open-coded atomic value?



Kref approach looks cleaner. will update the patch accordingly.



+}
+
  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
  {
struct dma_buf *dmabuf;
char name[DMA_BUF_NAME_LEN];
size_t ret = 0;
  
+	spin_lock(>d_lock);

dmabuf = dentry->d_fsdata;
+   if (!dmabuf || !atomic_add_unless(>dent_count, 1, 0)) {
+   spin_unlock(>d_lock);
+   goto out;

How can dmabuf not be valid here?

And isn't there already a usecount for the buffer?



dmabuf exported as file simply relies on that file's refcount, thus 
fput() releases the dmabuf.


We are storing the dmabuf in the dentry->d_fsdata but there is no 
binding between the dentry and the dmabuf. So, flow will be like


1) P1 calls fput(dmabuf_fd)

2) P2 trying to access the file information of P1.
    Eg: lsof command trying to list out the dmabuf_fd information using 
/proc//fd/dmabuf_fd


3) P1 calls the file->f_op->release(dmabuf_fd_file)(ends up in calling 
dma_buf_release()),   thus frees up the dmabuf buffer.


4) P2 access the dmabuf stored in the dentry->d_fsdata which was freed 
in step 3.


So we need to have some refcount mechanism to avoid the use-after-free 
in step 4.



+   }
+   spin_unlock(>d_lock);
dma_resv_lock(dmabuf->resv, NULL);
if (dmabuf->name)
ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
dma_resv_unlock(dmabuf->resv);
+   dmabuf_dent_put(dmabuf);
  
+out:

return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
 dentry->d_name.name, ret > 0 ? name : "");
  }
@@ -80,12 +97,16 @@ static int dma_buf_fs_init_context(struct fs_context *fc)
  static int dma_buf_release(struct inode *inode, struct file *file)
  {
struct dma_buf *dmabuf;
+   struct dentry *dentry = file->f_path.dentry;
  
  	if (!is_dma_buf_file(file))

return -EINVAL;
  
  	dmabuf = file->private_data;
  
+	spin_lock(>d_lock);

+   dentry->d_fsdata = NULL;
+   spin_unlock(>d_lock);
BUG_ON(dmabuf->vmapping_counter);
  
  	/*

@@ -108,8 +129,7 @@ static int dma_buf_release(struct inode *inode, struct file 
*file)
dma_resv_fini(dmabuf->resv);
  
  	module_put(dmabuf->owner);

-   kfree(dmabuf->name);
-   kfree(dmabuf);
+   dmabuf_dent_put(dmabuf);
return 0;
  }
  
@@ -548,6 +568,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)