Re: your mail

2021-04-07 Thread Christian König
Thanks Ray for pointing this out. Looks like the mail ended up in my 
spam folder otherwise.


Apart from that this patch is a really really big NAK. I can't count how 
often I had to reject stuff like this!


Using the page reference for TTM pages is illegal and can lead to struct 
page corruption.


Can you please describe why you need that?

Regards,
Christian.

Am 07.04.21 um 10:25 schrieb Huang Rui:

On Wed, Apr 07, 2021 at 09:27:46AM +0800, songqiang wrote:

Please add the description in the commit message and subject.

Thanks,
Ray


Signed-off-by: songqiang 
---
  drivers/gpu/drm/ttm/ttm_page_alloc.c | 18 ++
  1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 14660f723f71..f3698f0ad4d7 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -736,8 +736,16 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
if (++p != pages[i + j])
break;
  
-if (j == HPAGE_PMD_NR)

+   if (j == HPAGE_PMD_NR) {
order = HPAGE_PMD_ORDER;
+   for (j = 1; j < HPAGE_PMD_NR; ++j)
+   page_ref_dec(pages[i+j]);
+   }
}
  #endif
  
@@ -868,10 +876,12 @@ static int ttm_get_pages(struct page **pages, unsigned npages, int flags,

p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
if (!p)
break;
-
-   for (j = 0; j < HPAGE_PMD_NR; ++j)
+   for (j = 0; j < HPAGE_PMD_NR; ++j) {
pages[i++] = p++;
-
+   if (j > 0)
+   page_ref_inc(pages[i-1]);
+   }
npages -= HPAGE_PMD_NR;
}
}



___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-develdata=04%7C01%7Cray.huang%40amd.com%7C4ccc617b77d746db5af108d8f98db612%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533734805563118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=9bSP90LYdJyJYJYmuphVmqk%2B3%2FE4JPrtXkQTbxwAt68%3Dreserved=0




Re: your mail

2021-04-07 Thread Huang Rui
On Wed, Apr 07, 2021 at 09:27:46AM +0800, songqiang wrote:

Please add the description in the commit message and subject.

Thanks,
Ray

> Signed-off-by: songqiang 
> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
> b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> index 14660f723f71..f3698f0ad4d7 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> @@ -736,8 +736,16 @@ static void ttm_put_pages(struct page **pages, unsigned 
> npages, int flags,
>   if (++p != pages[i + j])
>   break;
>  
> - if (j == HPAGE_PMD_NR)
> + if (j == HPAGE_PMD_NR) {
>   order = HPAGE_PMD_ORDER;
> + for (j = 1; j < HPAGE_PMD_NR; ++j)
> + page_ref_dec(pages[i+j]);
> + }
>   }
>  #endif
>  
> @@ -868,10 +876,12 @@ static int ttm_get_pages(struct page **pages, unsigned 
> npages, int flags,
>   p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
>   if (!p)
>   break;
> -
> - for (j = 0; j < HPAGE_PMD_NR; ++j)
> + for (j = 0; j < HPAGE_PMD_NR; ++j) {
>   pages[i++] = p++;
> -
> + if (j > 0)
> + page_ref_inc(pages[i-1]);
> + }
>   npages -= HPAGE_PMD_NR;
>   }
>   }
> 
> 
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-develdata=04%7C01%7Cray.huang%40amd.com%7C4ccc617b77d746db5af108d8f98db612%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533734805563118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=9bSP90LYdJyJYJYmuphVmqk%2B3%2FE4JPrtXkQTbxwAt68%3Dreserved=0


Re: your mail

2021-04-07 Thread Manivannan Sadhasivam
On Thu, Apr 01, 2021 at 02:16:09PM -0700, Bhaumik Bhatt wrote:
> Subject: [PATCH v8 0/9] Updates to MHI channel handling
> 

Subject is present in the body ;)

> MHI specification shows a state machine with support for STOP channel command
> and the validity of certain state transitions. MHI host currently does not
> provide any mechanism to stop a channel and restart it without resetting it.
> There are also times when the device moves on to a different execution
> environment while client drivers on the host are unaware of it and still
> attempt to reset the channels facing unnecessary timeouts.
> 
> This series addresses the above areas to provide support for stopping an MHI
> channel, resuming it back, improved documentation and improving upon channel
> state machine handling in general.
> 
> This set of patches was tested on arm64 and x86_64 architecture.
> 

Series applied to mhi-next!

Thanks,
Mani

> v8:
> -Split the state machine improvements patch to three patches as per review
> 
> v7:
> -Tested on x86_64 architecture
> -Drop the patch "Do not clear channel context more than once" as issue is 
> fixed
> differently using "bus: mhi: core: Fix double dma free()"
> -Update the commit text to better reflect changes on state machine 
> improvements
> 
> v6:
> -Dropped the patch which introduced start/stop transfer APIs for lack of users
> -Updated error handling and debug prints on channel handling improvements 
> patch
> -Improved commit text to better explain certain patches based on review 
> comments
> -Removed references to new APIs from the documentation improvement patch
> 
> v5:
> -Added reviewed-by tags from Hemant I missed earlier
> -Added patch to prevent kernel warnings on clearing channel context twice
> 
> v4:
> -Updated commit text/descriptions and addressed checkpatch checks
> -Added context validity check before starting/stopping channels from new API
> -Added patch to clear channel context configuration after reset/unprepare
> 
> v3:
> -Updated documentation for channel transfer APIs to highlight differences
> -Create separate patch for "allowing channel to be disabled from stopped 
> state"
> 
> v2:
> -Renamed the newly introduced APIs to mhi_start_transfer() / 
> mhi_stop_transfer()
> -Added improved documentation to avoid confusion with the new APIs
> -Removed the __ prefix from mhi_unprepare_channel() API for consistency.
> 
> Bhaumik Bhatt (9):
>   bus: mhi: core: Allow sending the STOP channel command
>   bus: mhi: core: Clear context for stopped channels from remove()
>   bus: mhi: core: Improvements to the channel handling state machine
>   bus: mhi: core: Update debug messages to use client device
>   bus: mhi: core: Hold device wake for channel update commands
>   bus: mhi: core: Clear configuration from channel context during reset
>   bus: mhi: core: Check channel execution environment before issuing
> reset
>   bus: mhi: core: Remove __ prefix for MHI channel unprepare function
>   bus: mhi: Improve documentation on channel transfer setup APIs
> 
>  drivers/bus/mhi/core/init.c |  22 -
>  drivers/bus/mhi/core/internal.h |  12 +++
>  drivers/bus/mhi/core/main.c | 190 
> 
>  include/linux/mhi.h |  18 +++-
>  4 files changed, 162 insertions(+), 80 deletions(-)
> 
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
> 


Re: your mail

2021-03-22 Thread Steven Rostedt
On Mon, Mar 22, 2021 at 05:36:44PM -0400, Steven Rostedt wrote:

$@#@#$%%%

Bah! There was another typo in the email list!

Take 3

-- Steve



Re: your mail

2021-03-22 Thread Steven Rostedt
On Mon, Mar 22, 2021 at 05:21:56PM -0400, Steven Rostedt wrote:

Bah! John 'Warthog' Hawley email had those single quotes in it that I cut and
pasted into the Cc list, causing the quilt mail parsing to fail, but as LKML
was in the "To" part, it still sent!

Take 2

-- Steve



Re: your mail

2020-12-02 Thread Andy Shevchenko
On Wed, Dec 02, 2020 at 08:51:27PM +0200, Andy Shevchenko wrote:
> On Thu, Dec 03, 2020 at 03:27:33AM +0900, Yun Levi wrote:
> > On Thu, Dec 3, 2020 at 2:36 AM Andy Shevchenko
> >  wrote:
> > > On Wed, Dec 02, 2020 at 09:26:05AM -0800, Yury Norov wrote:

...

> > > Side note: speaking of performance, any plans to fix for_each_*_bit*() for
> > > cases when the nbits is known to be <= BITS_PER_LONG?
> > >
> > > Now it makes an awful code generation (something like few hundred bytes of
> > > code).
> 
> > Frankly Speaking, I don't have an idea in now.
> > Could you share your idea or wisdom?
> 
> Something like (I may be mistaken by names, etc, I'm not a compiler expert,
> and this is in pseudo language, I don't remember all API names by hart,
> just to express the idea) as a rough first step
> 
> __builtin_constant(nbits, find_next_set_bit_long, find_next_set_bit)
> 
> find_next_set_bit_long()
> {
>   unsigned long v = BIT_LAST_WORD(i);
>   return ffs_long(v);
> }
> 
> Same for find_first_set_bit() -> map it to ffs_long().
> 
> And I believe it can be optimized more.

Btw it will also require to reconsider test cases where such constant small
nbits values are passed (forcing compiler to avoid optimization somehow, one
way is to try random nbits for some test cases).

-- 
With Best Regards,
Andy Shevchenko




Re: your mail

2020-12-02 Thread Andy Shevchenko
On Thu, Dec 03, 2020 at 03:27:33AM +0900, Yun Levi wrote:
> On Thu, Dec 3, 2020 at 2:36 AM Andy Shevchenko
>  wrote:
> > On Wed, Dec 02, 2020 at 09:26:05AM -0800, Yury Norov wrote:

...

> > Side note: speaking of performance, any plans to fix for_each_*_bit*() for
> > cases when the nbits is known to be <= BITS_PER_LONG?
> >
> > Now it makes an awful code generation (something like few hundred bytes of
> > code).

> Frankly Speaking, I don't have an idea in now.
> Could you share your idea or wisdom?

Something like (I may be mistaken by names, etc, I'm not a compiler expert,
and this is in pseudo language, I don't remember all API names by hart,
just to express the idea) as a rough first step

__builtin_constant(nbits, find_next_set_bit_long, find_next_set_bit)

find_next_set_bit_long()
{
unsigned long v = BIT_LAST_WORD(i);
return ffs_long(v);
}

Same for find_first_set_bit() -> map it to ffs_long().

And I believe it can be optimized more.

-- 
With Best Regards,
Andy Shevchenko




Re: your mail

2020-08-30 Thread Bjorn Andersson
On Mon 17 Aug 17:12 UTC 2020, Amit Pundir wrote:

> On Thu, 13 Aug 2020 at 12:38, Bjorn Andersson
>  wrote:
> >
> > On Thu 06 Aug 15:31 PDT 2020, Konrad Dybcio wrote:
> >
> > > Subject: Re: [PATCH v4] arm64: dts: qcom: Add support for Xiaomi Poco F1 
> > > (Beryllium)
> > >
> > > >// This removed_region is needed to boot the device
> > > >   // TODO: Find out the user of this reserved memory
> > > >   removed_region: memory@88f0 {
> > >
> > > This region seems to belong to the Trust Zone. When Linux tries to access 
> > > it, TZ bites and shuts the device down.
> > >
> >
> > This is in line with what the documentation indicates and then it would
> > be better to just bump _mem to a size of 0x490.
> 
> Hi, so just to be sure that I got this right, you want me to extend
> _mem to the size of 0x490 from the default size of 0x2D0 by
> including this downstream _region (of size 0x1A0) +
> previously unreserved downstream memory region (of size 0x20), to
> align with the starting address of _mem?
> 

Yes

Regards,
Bjorn

> I just gave this _mem change a spin and I do not see any obvious
> regression in my limited smoke testing (Boots AOSP to UI with
> v5.9-rc1. Touch/BT/WiFi works) so far, with 20+ out-of-tree patches.
> 
> Regards,
> Amit Pundir
> 
> >
> > Regards,
> > Bjorn


Re: your mail

2020-08-17 Thread Amit Pundir
On Thu, 13 Aug 2020 at 12:38, Bjorn Andersson
 wrote:
>
> On Thu 06 Aug 15:31 PDT 2020, Konrad Dybcio wrote:
>
> > Subject: Re: [PATCH v4] arm64: dts: qcom: Add support for Xiaomi Poco F1 
> > (Beryllium)
> >
> > >// This removed_region is needed to boot the device
> > >   // TODO: Find out the user of this reserved memory
> > >   removed_region: memory@88f0 {
> >
> > This region seems to belong to the Trust Zone. When Linux tries to access 
> > it, TZ bites and shuts the device down.
> >
>
> This is in line with what the documentation indicates and then it would
> be better to just bump _mem to a size of 0x490.

Hi, so just to be sure that I got this right, you want me to extend
_mem to the size of 0x490 from the default size of 0x2D0 by
including this downstream _region (of size 0x1A0) +
previously unreserved downstream memory region (of size 0x20), to
align with the starting address of _mem?

I just gave this _mem change a spin and I do not see any obvious
regression in my limited smoke testing (Boots AOSP to UI with
v5.9-rc1. Touch/BT/WiFi works) so far, with 20+ out-of-tree patches.

Regards,
Amit Pundir

>
> Regards,
> Bjorn


Re: your mail

2020-08-13 Thread Bjorn Andersson
On Thu 06 Aug 15:31 PDT 2020, Konrad Dybcio wrote:

> Subject: Re: [PATCH v4] arm64: dts: qcom: Add support for Xiaomi Poco F1 
> (Beryllium)
> 
> >// This removed_region is needed to boot the device
> >   // TODO: Find out the user of this reserved memory
> >   removed_region: memory@88f0 {
> 
> This region seems to belong to the Trust Zone. When Linux tries to access it, 
> TZ bites and shuts the device down.
> 

This is in line with what the documentation indicates and then it would
be better to just bump _mem to a size of 0x490.

Regards,
Bjorn


Re: your mail

2020-06-09 Thread Greg KH
On Tue, Jun 09, 2020 at 07:38:38AM -0400, Gaurav Singh wrote:
> Please find the patch below.
> 
> Thanks and regards,
> Gaurav.
> 
> >From Gaurav Singh  # This line is ignored.
> From: Gaurav Singh 
> Reply-To: 
> Subject: 
> In-Reply-To: 
> 
> 

I think something went wrong in your submission, just use 'git
send-email' to send the patch out.

thanks,

greg k-h


Re: your mail

2020-05-07 Thread Thomas Bogendoerfer
On Wed, May 06, 2020 at 01:52:45PM +0800, Jiaxun Yang wrote:
> Subject: [PATCH v6] MIPS: Truncate link address into 32bit for 32bit kernel
> In-Reply-To: <20200413062651.3992652-1-jiaxun.y...@flygoat.com>
> 
> LLD failed to link vmlinux with 64bit load address for 32bit ELF
> while bfd will strip 64bit address into 32bit silently.
> To fix LLD build, we should truncate load address provided by platform
> into 32bit for 32bit kernel.
> 
> Signed-off-by: Jiaxun Yang 
> Link: https://github.com/ClangBuiltLinux/linux/issues/786
> Link: https://sourceware.org/bugzilla/show_bug.cgi?id=25784
> Reviewed-by: Fangrui Song 
> Reviewed-by: Kees Cook 
> Tested-by: Nathan Chancellor 
> Cc: Maciej W. Rozycki 
> ---
> V2: Take MaskRay's shell magic.
> 
> V3: After spent an hour on dealing with special character issue in
> Makefile, I gave up to do shell hacks and write a util in C instead.
> Thanks Maciej for pointing out Makefile variable problem.
> 
> v4: Finally we managed to find a Makefile method to do it properly
> thanks to Kees. As it's too far from the initial version, I removed
> Review & Test tag from Nick and Fangrui and Cc instead.
> 
> v5: Care vmlinuz as well.
> 
> v6: Rename to LIKER_LOAD_ADDRESS 
> ---
>  arch/mips/Makefile | 13 -
>  arch/mips/boot/compressed/Makefile |  2 +-
>  arch/mips/kernel/vmlinux.lds.S |  2 +-
>  3 files changed, 14 insertions(+), 3 deletions(-)

applied to mips-next.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]


Re: your mail

2019-06-27 Thread Tejun Heo
On Wed, Jun 26, 2019 at 04:52:36PM +0200, Sebastian Andrzej Siewior wrote:
> A small series of tiny cleanups.

Applied 1-2 to wq/for-5.3.

Thanks.

-- 
tejun


Re: your mail

2019-04-11 Thread Nicholas Piggin
Peter Zijlstra's on April 11, 2019 8:53 pm:
> Was this supposed to be patch 6/5 of your previous series?

Dang, I screwed up the headers? Thanks for the ping, I will resend.

It is standalone. It seems more suited to the scheduler tree than the
timers one, but your call.

It is generally of more use when CPU0 is _not_ a housekeeping one,
and that's where I've done most testing, but I don't see any hard
dependency.

Thanks,
Nick

> 
> On Thu, Apr 11, 2019 at 04:05:36PM +1000, Nicholas Piggin wrote:
>> Date: Tue, 9 Apr 2019 20:23:16 +1000
>> Subject: [PATCH] kernel/sched: run nohz idle load balancer on HK_FLAG_MISC
>>  CPUs
>> 
>> The nohz idle balancer runs on the lowest idle CPU. This can
>> interfere with isolated CPUs, so confine it to HK_FLAG_MISC
>> housekeeping CPUs.
>> 
>> HK_FLAG_SCHED is not used for this because it is not set anywhere
>> at the moment. This could be folded into HK_FLAG_SCHED once that
>> option is fixed.
>> 
>> The problem was observed with increased jitter on an application
>> running on CPU0, caused by nohz idle load balancing being run on
>> CPU1 (an SMT sibling).
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  kernel/sched/fair.c | 16 ++--
>>  1 file changed, 10 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fdab7eb6f351..d29ca323214d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9522,22 +9522,26 @@ static inline int on_null_domain(struct rq *rq)
>>   * - When one of the busy CPUs notice that there may be an idle rebalancing
>>   *   needed, they will kick the idle load balancer, which then does idle
>>   *   load balancing for all the idle CPUs.
>> + * - HK_FLAG_MISC CPUs are used for this task, because HK_FLAG_SCHED not set
>> + *   anywhere yet.
>>   */
>>  
>>  static inline int find_new_ilb(void)
>>  {
>> -int ilb = cpumask_first(nohz.idle_cpus_mask);
>> +int ilb;
>>  
>> -if (ilb < nr_cpu_ids && idle_cpu(ilb))
>> -return ilb;
>> +for_each_cpu_and(ilb, nohz.idle_cpus_mask,
>> +  housekeeping_cpumask(HK_FLAG_MISC)) {
>> +if (idle_cpu(ilb))
>> +return ilb;
>> +}
>>  
>>  return nr_cpu_ids;
>>  }
>>  
>>  /*
>> - * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
>> - * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
>> - * CPU (if there is one).
>> + * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
>> + * idle CPU in the HK_FLAG_MISC housekeeping set (if there is one).
>>   */
>>  static void kick_ilb(unsigned int flags)
>>  {
>> -- 
>> 2.20.1
>> 
> 


Re: your mail

2019-04-11 Thread Peter Zijlstra
Was this supposed to be patch 6/5 of your previous series?

On Thu, Apr 11, 2019 at 04:05:36PM +1000, Nicholas Piggin wrote:
> Date: Tue, 9 Apr 2019 20:23:16 +1000
> Subject: [PATCH] kernel/sched: run nohz idle load balancer on HK_FLAG_MISC
>  CPUs
> 
> The nohz idle balancer runs on the lowest idle CPU. This can
> interfere with isolated CPUs, so confine it to HK_FLAG_MISC
> housekeeping CPUs.
> 
> HK_FLAG_SCHED is not used for this because it is not set anywhere
> at the moment. This could be folded into HK_FLAG_SCHED once that
> option is fixed.
> 
> The problem was observed with increased jitter on an application
> running on CPU0, caused by nohz idle load balancing being run on
> CPU1 (an SMT sibling).
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  kernel/sched/fair.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdab7eb6f351..d29ca323214d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9522,22 +9522,26 @@ static inline int on_null_domain(struct rq *rq)
>   * - When one of the busy CPUs notice that there may be an idle rebalancing
>   *   needed, they will kick the idle load balancer, which then does idle
>   *   load balancing for all the idle CPUs.
> + * - HK_FLAG_MISC CPUs are used for this task, because HK_FLAG_SCHED not set
> + *   anywhere yet.
>   */
>  
>  static inline int find_new_ilb(void)
>  {
> - int ilb = cpumask_first(nohz.idle_cpus_mask);
> + int ilb;
>  
> - if (ilb < nr_cpu_ids && idle_cpu(ilb))
> - return ilb;
> + for_each_cpu_and(ilb, nohz.idle_cpus_mask,
> +   housekeeping_cpumask(HK_FLAG_MISC)) {
> + if (idle_cpu(ilb))
> + return ilb;
> + }
>  
>   return nr_cpu_ids;
>  }
>  
>  /*
> - * Kick a CPU to do the nohz balancing, if it is time for it. We pick the
> - * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle
> - * CPU (if there is one).
> + * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
> + * idle CPU in the HK_FLAG_MISC housekeeping set (if there is one).
>   */
>  static void kick_ilb(unsigned int flags)
>  {
> -- 
> 2.20.1
> 


Re: your mail

2019-04-08 Thread Maxim Levitsky
On Tue, 2019-03-19 at 09:22 -0600, Keith Busch wrote:
> On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> >   -> Share the NVMe device between host and guest. 
> >  Even in fully virtualized configurations,
> >  some partitions of nvme device could be used by guests as block
> > devices 
> >  while others passed through with nvme-mdev to achieve balance between
> >  all features of full IO stack emulation and performance.
> >   
> >   -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
> >  can send interrupts to the guest directly without a context 
> >  switch that can be expensive due to meltdown mitigation.
> > 
> >   -> Is able to utilize interrupts to get reasonable performance. 
> >  This is only implemented
> >  as a proof of concept and not included in the patches, 
> >  but interrupt driven mode shows reasonable performance
> >  
> >   -> This is a framework that later can be used to support NVMe devices 
> >  with more of the IO virtualization built-in 
> >  (IOMMU with PASID support coupled with device that supports it)
> 
> Would be very interested to see the PASID support. You wouldn't even
> need to mediate the IO doorbells or translations if assigning entire
> namespaces, and should be much faster than the shadow doorbells.
> 
> I think you should send 6/9 "nvme/pci: init shadow doorbell after each
> reset" separately for immediate inclusion.
> 
> I like the idea in principle, but it will take me a little time to get
> through reviewing your implementation. I would have guessed we could
> have leveraged something from the existing nvme/target for the mediating
> controller register access and admin commands. Maybe even start with
> implementing an nvme passthrough namespace target type (we currently
> have block and file).


Hi!

Sorry to bother you, but any update?

I was somewhat sick for the last week, now finally back in shape to continue
working on this and other tasks I have.

I am studing now the nvme target code and the io_uring to evaluate the
difficultiy of using something similiar to talk to the block device instead of /
in addtion to the  direct connection I implemented.

I would be glad to hear more feedback on this project.

I will also soon post the few fixes separately as you suggested.

Best regards,
Maxim Levitskky






Re: your mail

2019-03-26 Thread Dan Carpenter
On Sat, Mar 23, 2019 at 01:17:38PM -0400, William J. Cunningham wrote:
> >From bb04b0ca982b7042902fffe1377e0e38e83b402b Mon Sep 17 00:00:00 2001
> From: Will Cunningham 
> Date: Sat, 23 Mar 2019 12:54:34 -0400
> Subject: [PATCH] Staging: emxx_udc: emxx_udc: Fixed a coding style error
> 
> Removed unnecessary parentheses.
> 
> Signed-off-by: Will Cunningham 

Please fix up the headers and resend.

regards,
dan carpenter



Re: your mail

2019-03-25 Thread Stefan Hajnoczi
On Thu, Mar 21, 2019 at 07:07:38PM +0200, Maxim Levitsky wrote:
> On Thu, 2019-03-21 at 16:13 +, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> > > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > > 
> > > Hi everyone!
> > > 
> > > In this patch series, I would like to introduce my take on the problem of
> > > doing 
> > > as fast as possible virtualization of storage with emphasis on low 
> > > latency.
> > > 
> > > In this patch series I implemented a kernel vfio based, mediated device
> > > that 
> > > allows the user to pass through a partition and/or whole namespace to a
> > > guest.
> > > 
> > > The idea behind this driver is based on paper you can find at
> > > https://www.usenix.org/conference/atc18/presentation/peng,
> > > 
> > > Although note that I stared the development prior to reading this paper, 
> > > independently.
> > > 
> > > In addition to that implementation is not based on code used in the paper
> > > as 
> > > I wasn't being able at that time to make the source available to me.
> > > 
> > > ***Key points about the implementation:***
> > > 
> > > * Polling kernel thread is used. The polling is stopped after a 
> > > predefined timeout (1/2 sec by default).
> > > Support for all interrupt driven mode is planned, and it shows promising
> > > results.
> > > 
> > > * Guest sees a standard NVME device - this allows to run guest with 
> > > unmodified drivers, for example windows guests.
> > > 
> > > * The NVMe device is shared between host and guest.
> > > That means that even a single namespace can be split between host 
> > > and guest based on different partitions.
> > > 
> > > * Simple configuration
> > > 
> > > *** Performance ***
> > > 
> > > Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
> > > and both latency and throughput is very similar to SPDK.
> > > 
> > > Soon I will test this on a better server and nvme device and provide
> > > more formal performance numbers.
> > > 
> > > Latency numbers:
> > > ~80ms - spdk with fio plugin on the host.
> > > ~84ms - nvme driver on the host
> > > ~87ms - mdev-nvme + nvme driver in the guest
> > 
> > You mentioned the spdk numbers are with vhost-user-nvme.  Have you
> > measured SPDK's vhost-user-blk?
> 
> I had lot of measuments of vhost-user-blk vs vhost-user-nvme.
> vhost-user-nvme was always a bit faster but only a bit.
> Thus I don't think it makes sense to benchamrk against vhost-user-blk.

It's interesting because mdev-nvme is closest to the hardware while
vhost-user-blk is closest to software.  Doing things at the NVMe level
isn't buying much performance because it's still going through a
software path comparable to vhost-user-blk.

From what you say it sounds like there isn't much to optimize away :(.

Stefan


signature.asc
Description: PGP signature


Re: your mail

2019-03-21 Thread Maxim Levitsky
On Thu, 2019-03-21 at 16:13 +, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > 
> > Hi everyone!
> > 
> > In this patch series, I would like to introduce my take on the problem of
> > doing 
> > as fast as possible virtualization of storage with emphasis on low latency.
> > 
> > In this patch series I implemented a kernel vfio based, mediated device
> > that 
> > allows the user to pass through a partition and/or whole namespace to a
> > guest.
> > 
> > The idea behind this driver is based on paper you can find at
> > https://www.usenix.org/conference/atc18/presentation/peng,
> > 
> > Although note that I stared the development prior to reading this paper, 
> > independently.
> > 
> > In addition to that implementation is not based on code used in the paper
> > as 
> > I wasn't being able at that time to make the source available to me.
> > 
> > ***Key points about the implementation:***
> > 
> > * Polling kernel thread is used. The polling is stopped after a 
> > predefined timeout (1/2 sec by default).
> > Support for all interrupt driven mode is planned, and it shows promising
> > results.
> > 
> > * Guest sees a standard NVME device - this allows to run guest with 
> > unmodified drivers, for example windows guests.
> > 
> > * The NVMe device is shared between host and guest.
> > That means that even a single namespace can be split between host 
> > and guest based on different partitions.
> > 
> > * Simple configuration
> > 
> > *** Performance ***
> > 
> > Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
> > and both latency and throughput is very similar to SPDK.
> > 
> > Soon I will test this on a better server and nvme device and provide
> > more formal performance numbers.
> > 
> > Latency numbers:
> > ~80ms - spdk with fio plugin on the host.
> > ~84ms - nvme driver on the host
> > ~87ms - mdev-nvme + nvme driver in the guest
> 
> You mentioned the spdk numbers are with vhost-user-nvme.  Have you
> measured SPDK's vhost-user-blk?

I had lot of measuments of vhost-user-blk vs vhost-user-nvme.
vhost-user-nvme was always a bit faster but only a bit.
Thus I don't think it makes sense to benchamrk against vhost-user-blk.

Best regards,
Maxim Levitsky





Re: your mail

2019-03-21 Thread Stefan Hajnoczi
On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> Date: Tue, 19 Mar 2019 14:45:45 +0200
> Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> 
> Hi everyone!
> 
> In this patch series, I would like to introduce my take on the problem of 
> doing 
> as fast as possible virtualization of storage with emphasis on low latency.
> 
> In this patch series I implemented a kernel vfio based, mediated device that 
> allows the user to pass through a partition and/or whole namespace to a guest.
> 
> The idea behind this driver is based on paper you can find at
> https://www.usenix.org/conference/atc18/presentation/peng,
> 
> Although note that I stared the development prior to reading this paper, 
> independently.
> 
> In addition to that implementation is not based on code used in the paper as 
> I wasn't being able at that time to make the source available to me.
> 
> ***Key points about the implementation:***
> 
> * Polling kernel thread is used. The polling is stopped after a 
> predefined timeout (1/2 sec by default).
> Support for all interrupt driven mode is planned, and it shows promising 
> results.
> 
> * Guest sees a standard NVME device - this allows to run guest with 
> unmodified drivers, for example windows guests.
> 
> * The NVMe device is shared between host and guest.
> That means that even a single namespace can be split between host 
> and guest based on different partitions.
> 
> * Simple configuration
> 
> *** Performance ***
> 
> Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
> and both latency and throughput is very similar to SPDK.
> 
> Soon I will test this on a better server and nvme device and provide
> more formal performance numbers.
> 
> Latency numbers:
> ~80ms - spdk with fio plugin on the host.
> ~84ms - nvme driver on the host
> ~87ms - mdev-nvme + nvme driver in the guest

You mentioned the spdk numbers are with vhost-user-nvme.  Have you
measured SPDK's vhost-user-blk?

Stefan


signature.asc
Description: PGP signature


Re: your mail

2019-03-20 Thread Maxim Levitsky
On Wed, 2019-03-20 at 11:03 -0600, Keith Busch wrote:
> On Wed, Mar 20, 2019 at 06:30:29PM +0200, Maxim Levitsky wrote:
> > Or instead I can use the block backend, 
> > (but note that currently the block back-end doesn't support polling which is
> > critical for the performance).
> 
> Oh, I think you can do polling through there. For reference, fs/io_uring.c
> has a pretty good implementation that aligns with how you could use it.


That is exactly my thought. The polling recently got lot of improvements in the
block layer, which migh make this feasable.

I will give it a try.

Best regards,
Maxim Levitsky



Re: your mail

2019-03-20 Thread Keith Busch
On Wed, Mar 20, 2019 at 06:30:29PM +0200, Maxim Levitsky wrote:
> Or instead I can use the block backend, 
> (but note that currently the block back-end doesn't support polling which is
> critical for the performance).

Oh, I think you can do polling through there. For reference, fs/io_uring.c
has a pretty good implementation that aligns with how you could use it.


Re: your mail

2019-03-20 Thread Maxim Levitsky
On Tue, 2019-03-19 at 23:49 +, Chaitanya Kulkarni wrote:
> Hi Keith,
> On 03/19/2019 08:21 AM, Keith Busch wrote:
> > On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> > >-> Share the NVMe device between host and guest.
> > >   Even in fully virtualized configurations,
> > >   some partitions of nvme device could be used by guests as block
> > > devices
> > >   while others passed through with nvme-mdev to achieve balance
> > > between
> > >   all features of full IO stack emulation and performance.
> > > 
> > >-> NVME-MDEV is a bit faster due to the fact that in-kernel driver
> > >   can send interrupts to the guest directly without a context
> > >   switch that can be expensive due to meltdown mitigation.
> > > 
> > >-> Is able to utilize interrupts to get reasonable performance.
> > >   This is only implemented
> > >   as a proof of concept and not included in the patches,
> > >   but interrupt driven mode shows reasonable performance
> > > 
> > >-> This is a framework that later can be used to support NVMe devices
> > >   with more of the IO virtualization built-in
> > >   (IOMMU with PASID support coupled with device that supports it)
> > 
> > Would be very interested to see the PASID support. You wouldn't even
> > need to mediate the IO doorbells or translations if assigning entire
> > namespaces, and should be much faster than the shadow doorbells.
> > 
> > I think you should send 6/9 "nvme/pci: init shadow doorbell after each
> > reset" separately for immediate inclusion.
> > 
> > I like the idea in principle, but it will take me a little time to get
> > through reviewing your implementation. I would have guessed we could
> > have leveraged something from the existing nvme/target for the mediating
> > controller register access and admin commands. Maybe even start with
> > implementing an nvme passthrough namespace target type (we currently
> > have block and file).
> 
> I have the code for the NVMeOf target passthru-ctrl, I think we can use 
> that as it is if you are looking for the passthru for NVMeOF.
> 
> I'll post patch-series based on the latest code base soon.

I am very intersing in this code. 
Could you explain how your NVMeOF target passthrough works? 
Which components of the NVME stack does it involve?

Best regards,
Maxim Levitsky

> > 
> > ___
> > Linux-nvme mailing list
> > linux-n...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme
> > 
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme




Re: your mail

2019-03-20 Thread Maxim Levitsky
On Tue, 2019-03-19 at 09:22 -0600, Keith Busch wrote:
> On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
> >   -> Share the NVMe device between host and guest. 
> >  Even in fully virtualized configurations,
> >  some partitions of nvme device could be used by guests as block
> > devices 
> >  while others passed through with nvme-mdev to achieve balance between
> >  all features of full IO stack emulation and performance.
> >   
> >   -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
> >  can send interrupts to the guest directly without a context 
> >  switch that can be expensive due to meltdown mitigation.
> > 
> >   -> Is able to utilize interrupts to get reasonable performance. 
> >  This is only implemented
> >  as a proof of concept and not included in the patches, 
> >  but interrupt driven mode shows reasonable performance
> >  
> >   -> This is a framework that later can be used to support NVMe devices 
> >  with more of the IO virtualization built-in 
> >  (IOMMU with PASID support coupled with device that supports it)
> 

> Would be very interested to see the PASID support. You wouldn't even
> need to mediate the IO doorbells or translations if assigning entire
> namespaces, and should be much faster than the shadow doorbells.

I fully agree with that.
Note that to enable PASID support two things have to happen in this vendor.

1. Mature support for IOMMU with PASID support. On Intel side I know that they
only have a spec released and currently the kernel bits to support it are
placed.
I still don't know when a product actually supporting this spec is going to be
released. For other vendors (ARM/AMD/) I haven't done yet a research on state of
PASID based IOMMU support on their platforms.

2. NVMe spec has to be extended to support PASID. At minimum, we need an ability
to assign an PASID to a sq/cq queue pair and ability to relocate the doorbells,
such as each guest would get its own (hardware backed) MMIO page with its own
doorbells. Plus of course the hardware vendors have to embrace the spec. I guess
these two things will happen in collaborative manner.


> 
> I think you should send 6/9 "nvme/pci: init shadow doorbell after each
> reset" separately for immediate inclusion.
I'll do this soon. 

Also '5/9 nvme/pci: add known admin effects to augment admin effects log page'
can be considered for immediate inclusion as well, as it works around a flaw
in the NVMe controller badly done admin side effects page with no side effects
(pun intended) for spec compliant controllers (I think so). 

This can be fixed with a quirk if you prefer though.

> 
> I like the idea in principle, but it will take me a little time to get
> through reviewing your implementation. I would have guessed we could
> have leveraged something from the existing nvme/target for the mediating
> controller register access and admin commands. Maybe even start with
> implementing an nvme passthrough namespace target type (we currently
> have block and file).

I fully agree with you on that I could have used some of the nvme/target code,
and I am planning to do so eventually.

For that I would need to make my driver, to be one of the target drivers, and I
would need to add another target back end, like you said to allow my target
driver to talk directly to the nvme hardware bypassing the block layer.

Or instead I can use the block backend, 
(but note that currently the block back-end doesn't support polling which is
critical for the performance).

Switch to the target code might though have some (probably minor) performance
impact, as it would probably lengthen the critical code path a bit (I might need
for instance to translate the PRP lists I am getting from the virtual controller
to a scattergather list and back).

This is why I did this the way I did, but now knowing that probably I can afford
to loose a bit of performance, I can look at doing that.

Best regards,
Thanks in advance for the review,
Maxim Levitsky

PS:

For reference currently the IO path looks more or less like that:

My IO thread notices a doorbell write, reads a command from a submission queue,
translates it (without even looking at the data pointer) and sends it to the
nvme pci driver together with pointer to data iterator'.

The nvme pci driver calls the data iterator N times, which makes the iterator
translate and fetch the DMA addresses where the data is already mapped on the
its pci nvme device (the mdev driver maps all the guest memory to the nvme pci
device).
The nvme pci driver uses these addresses it receives, to create a prp list,
which it puts into the data pointer.

The nvme pci driver also allocates an free command id, from a list, puts it into
the command ID and sends the command to the real hardware.

Later the IO thread calls to the nvme pci driver to poll the queue. When
completions arrive, the nvme pci driver returns them back to the IO thread.

> 
> 

Re: your mail

2019-03-20 Thread Greg Kroah-Hartman
On Mon, Mar 18, 2019 at 08:20:01PM -0600, George Hilliard wrote:
> Because of this change, the driver now expects a pinctrl device
> reference in the mmc controller's device tree node; without it, it will
> bail out.  This could break existing setups that don't specify it
> because it "just worked" up until now.  So currently I just let the old
> behavior fall away because this is a staging driver.  But if this is a
> problem, the old behavior could be added back as a fallback.
> 
> This is version 2 of a patchset that I requested feedback for about a
> month ago.  Please review as if they are a new patchset; all the patches
> were rebased several times and a couple new correctness fixes added.
> 
> The TODO list is largely unchanged, aside from the couple of TODO
> comments in the code that I have addressed.  Ultimately, I think this
> driver could potentially be merged with the "real" mtk-mmc driver as the
> TODO suggests, but someone who is more familiar with the IP core will
> have to do that.  Mediatek documentation (that I can find) is very
> sparse.
> 
> This is ready to merge if there is no other feedback!
> 
> >From George Hilliard  # This line is ignored.
> From: George Hilliard 
> Reply-To: 
> Subject: [PATCH v2 00/11] mt7621-mmc: Various correctness fixes
> In-Reply-To: 
> 
> 

No subject for this email?



Re: your mail

2019-03-19 Thread Chaitanya Kulkarni
Hi Keith,
On 03/19/2019 08:21 AM, Keith Busch wrote:
> On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
>>-> Share the NVMe device between host and guest.
>>   Even in fully virtualized configurations,
>>   some partitions of nvme device could be used by guests as block devices
>>   while others passed through with nvme-mdev to achieve balance between
>>   all features of full IO stack emulation and performance.
>>
>>-> NVME-MDEV is a bit faster due to the fact that in-kernel driver
>>   can send interrupts to the guest directly without a context
>>   switch that can be expensive due to meltdown mitigation.
>>
>>-> Is able to utilize interrupts to get reasonable performance.
>>   This is only implemented
>>   as a proof of concept and not included in the patches,
>>   but interrupt driven mode shows reasonable performance
>>
>>-> This is a framework that later can be used to support NVMe devices
>>   with more of the IO virtualization built-in
>>   (IOMMU with PASID support coupled with device that supports it)
>
> Would be very interested to see the PASID support. You wouldn't even
> need to mediate the IO doorbells or translations if assigning entire
> namespaces, and should be much faster than the shadow doorbells.
>
> I think you should send 6/9 "nvme/pci: init shadow doorbell after each
> reset" separately for immediate inclusion.
>
> I like the idea in principle, but it will take me a little time to get
> through reviewing your implementation. I would have guessed we could
> have leveraged something from the existing nvme/target for the mediating
> controller register access and admin commands. Maybe even start with
> implementing an nvme passthrough namespace target type (we currently
> have block and file).

I have the code for the NVMeOf target passthru-ctrl, I think we can use 
that as it is if you are looking for the passthru for NVMeOF.

I'll post patch-series based on the latest code base soon.
>
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>



Re: your mail

2019-03-19 Thread Keith Busch
On Tue, Mar 19, 2019 at 04:41:07PM +0200, Maxim Levitsky wrote:
>   -> Share the NVMe device between host and guest. 
>  Even in fully virtualized configurations,
>  some partitions of nvme device could be used by guests as block devices 
>  while others passed through with nvme-mdev to achieve balance between
>  all features of full IO stack emulation and performance.
>   
>   -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
>  can send interrupts to the guest directly without a context 
>  switch that can be expensive due to meltdown mitigation.
> 
>   -> Is able to utilize interrupts to get reasonable performance. 
>  This is only implemented
>  as a proof of concept and not included in the patches, 
>  but interrupt driven mode shows reasonable performance
>  
>   -> This is a framework that later can be used to support NVMe devices 
>  with more of the IO virtualization built-in 
>  (IOMMU with PASID support coupled with device that supports it)

Would be very interested to see the PASID support. You wouldn't even
need to mediate the IO doorbells or translations if assigning entire
namespaces, and should be much faster than the shadow doorbells.

I think you should send 6/9 "nvme/pci: init shadow doorbell after each
reset" separately for immediate inclusion.

I like the idea in principle, but it will take me a little time to get
through reviewing your implementation. I would have guessed we could
have leveraged something from the existing nvme/target for the mediating
controller register access and admin commands. Maybe even start with
implementing an nvme passthrough namespace target type (we currently
have block and file).


Re: your mail

2019-02-26 Thread Roman Gushchin
On Mon, Feb 25, 2019 at 03:16:29PM -0500, Johannes Weiner wrote:
> [resending, rebased on top of latest mmots]
> 
> The memcg LRU stats usage is currently a bit messy. Memcg has private
> per-zone counters because reclaim needs zone granularity sometimes,
> but we also have plenty of users that need to awkwardly sum them up to
> node or memcg granularity. Meanwhile the canonical per-memcg vmstats
> do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
> 
> This series enables LRU count tracking in the per-memcg vmstats array
> such that lruvec_page_state() and memcg_page_state() work on the enum
> node_stat_item items for the LRU counters. Then it converts all the
> callers that don't specifically need per-zone numbers over to that.

The updated version looks very good* to me!
Please, feel free to use:
Reviewed-by: Roman Gushchin 

Looking through the patchset, I have a feeling that we're sometimes
gathering too much data. Perhaps we don't need the whole set
of counters to be per-cpu on both memcg- and memcg-per-node levels.
Merging them can save quite a lot of space. Anyway, it's a separate
topic.

* except "to" and "subject" of the cover letter

Thanks!


Re: your mail

2018-07-25 Thread Greg Kroah-Hartman
On Wed, Jul 25, 2018 at 01:22:07AM +0300, Georgios Tsotsos wrote:
> Date: Wed, 25 Jul 2018 01:18:58 +0300
> Subject: [PATCH 0/4] Staging: octeon-usb: Fixes and Coding style applied. 
> 
> Hello, 

Somehow your subject here got messed up and put in the bod of the email.
Not a big deal this time, but be more careful next time please.

thanks,

greg k-h


Re: your mail

2018-07-25 Thread Greg Kroah-Hartman
On Wed, Jul 25, 2018 at 01:22:07AM +0300, Georgios Tsotsos wrote:
> Date: Wed, 25 Jul 2018 01:18:58 +0300
> Subject: [PATCH 0/4] Staging: octeon-usb: Fixes and Coding style applied. 
> 
> Hello, 

Somehow your subject here got messed up and put in the bod of the email.
Not a big deal this time, but be more careful next time please.

thanks,

greg k-h


Re: your mail

2018-06-19 Thread Dan Carpenter
Thanks for this.  This is a lot of work.

On Wed, Jun 13, 2018 at 08:31:28PM +0300, Anton Vasilyev wrote:
> diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
> index 70e0b8623110..69e6abe14abf 100644
> --- a/drivers/staging/rts5208/rtsx.c
> +++ b/drivers/staging/rts5208/rtsx.c
> @@ -857,7 +857,7 @@ static int rtsx_probe(struct pci_dev *pci,
>   dev->chip = kzalloc(sizeof(*dev->chip), GFP_KERNEL);
>   if (!dev->chip) {
>   err = -ENOMEM;
> - goto errout;
> + goto chip_alloc_fail;

The most recent successful allocation is scsi_host_alloc().  I was
really hoping this would say something like "goto err_free_host;" or
something.  The naming style here is a "come from" label which doesn't
say if it's going to free the scsi host or not...  It turns out we don't
free the the host, but we should:

err_put_host:
scsi_host_put(host);

The kzalloc() has it's own error message built in, and all the other
error paths as well so the dev_err() is not super important to me...

Killing the threads seems actually really complicated so maybe we should
just have a separate error paths for that.  I'm not sure...

regards,
dan carpenter



Re: your mail

2018-06-19 Thread Dan Carpenter
Thanks for this.  This is a lot of work.

On Wed, Jun 13, 2018 at 08:31:28PM +0300, Anton Vasilyev wrote:
> diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
> index 70e0b8623110..69e6abe14abf 100644
> --- a/drivers/staging/rts5208/rtsx.c
> +++ b/drivers/staging/rts5208/rtsx.c
> @@ -857,7 +857,7 @@ static int rtsx_probe(struct pci_dev *pci,
>   dev->chip = kzalloc(sizeof(*dev->chip), GFP_KERNEL);
>   if (!dev->chip) {
>   err = -ENOMEM;
> - goto errout;
> + goto chip_alloc_fail;

The most recent successful allocation is scsi_host_alloc().  I was
really hoping this would say something like "goto err_free_host;" or
something.  The naming style here is a "come from" label which doesn't
say if it's going to free the scsi host or not...  It turns out we don't
free the the host, but we should:

err_put_host:
scsi_host_put(host);

The kzalloc() has it's own error message built in, and all the other
error paths as well so the dev_err() is not super important to me...

Killing the threads seems actually really complicated so maybe we should
just have a separate error paths for that.  I'm not sure...

regards,
dan carpenter



Re: your mail

2017-12-07 Thread Greg Kroah-Hartman
On Thu, Dec 07, 2017 at 01:26:14AM -0800, Alexander Kappner wrote:
> Date: Wed, 6 Dec 2017 15:28:37 -0800
> Subject: [PATCH] usb-core: Fix potential null pointer dereference in 
> xhci-debugfs.c

Something went wrong here, resulting in an email with no subject.

Can you fix this up and resend?

thanks,

greg k-h


Re: your mail

2017-12-07 Thread Greg Kroah-Hartman
On Thu, Dec 07, 2017 at 01:26:14AM -0800, Alexander Kappner wrote:
> Date: Wed, 6 Dec 2017 15:28:37 -0800
> Subject: [PATCH] usb-core: Fix potential null pointer dereference in 
> xhci-debugfs.c

Something went wrong here, resulting in an email with no subject.

Can you fix this up and resend?

thanks,

greg k-h


Re: your mail

2017-08-18 Thread Rajneesh Bhardwaj
On Fri, Aug 18, 2017 at 11:12:14PM +0530, Rajneesh Bhardwaj wrote:
> Bcc: 
> Subject: Re: [PATCH] platform/x86: intel_pmc_core: Add Package C-states
>  residency info
> Reply-To: 
> In-Reply-To: 
> 
>

Please ignore my previous email without subject. It was sent by mistake.

> On Fri, Aug 18, 2017 at 08:17:32PM +0300, Andy Shevchenko wrote:
> > +PeterZ (since I mentioned his name)
> > 
> > On Fri, Aug 18, 2017 at 5:58 PM, Rajneesh Bhardwaj
> >  wrote:
> > > On Fri, Aug 18, 2017 at 03:57:34PM +0300, Andy Shevchenko wrote:
> > >> On Fri, Aug 18, 2017 at 3:37 PM, Rajneesh Bhardwaj
> > >>  wrote:
> > >> > This patch introduces a new debugfs entry to read current Package 
> > >> > C-state
> > >> > residency values and, one new kernel API to read the Package C-10 
> > >> > residency
> > >> > counter.
> > >> >
> > >> > Package C-state residency MSRs provide useful debug information about 
> > >> > system
> > >> > idle states. In idle states system must enter deeper Package C-states.
> > 
> > >> Why this patch is needed?
> > >
> > > Andy, I'll try to give some background for this.
> > >
> > > This is needed to enhance the S0ix failure debug capabilities from within
> > > the kernel. On ChromeOS we have S0ix failsafe kernel framework that is 
> > > used
> > > to validate S0ix and report the blockers in case of a failure.
> > > https://patchwork.kernel.org/patch/9148999/
> > 
> > (It's not part of upstream)
> 
> Sorry i sent an older link. There are fresh attempts to get this into
> mainline kernel and looks like there is a traction for it.
> https://patchwork.kernel.org/patch/9831229/
> 
> Package C-state (PC10) validation is discussed there.
> 
> > 
> > > So far only intel_pmc_slp_s0_counter_read is called by this framework to
> > > check whether the previous attempt to enter S0ix was success or not.
> > 
> > I harder see even a single user of that API in current kernel. It
> > should be unexported and removed I think.
> > 
> > >  Having
> > > another PC10 counter related exported function enhances the S0ix debug 
> > > since
> > > PC10 state is a prerequisite to enter S0ix.
> > >
> > >> See, we have turbostat and cpupower user space tools which do this
> > >> without any additional code to be written in kernel. What prevents
> > >> your user space application do the same?
> > >>
> > >> Moreover, we have events for cstate, I assume perf or something alike
> > >> can monitor those counters as well.
> > >
> > > You're right, perhaps the debugfs is redundant when we have those user 
> > > space
> > > tools but such tools are not available readily for all platforms/distros.
> > > Interfaces like /dev/cpu/*/msr that turbostat uses are not available on 
> > > all
> > > the platforms.
> > > PMC driver is a debug driver so i thought its better to show Package 
> > > C-state
> > > related info for low power debug here.
> > >
> > >>
> > >> Sorry, NAK.
> > >
> > > This patch has two parts i.e. exported PC10 API and the debugfs. Based on
> > > the above explanation, if the patch is not good as is, please let me know 
> > > if
> > > i should drop the debugfs part and respin a v2 with just the exported API 
> > > or
> > > drop this totally.
> > >
> > > Thanks for the feedback and thanks for taking time to review!
> > 
> > Reading above makes me think that entire design of this is misguided.
> > Since the most of values are counters they better to be accessed in a
> > way how perf does.
> > 
> > In case you need *in-kernel* facility, do some APIs (if it's not done
> > yet) for events drivers first.
> > cstate event driver is already in upstream.
> > 
> > Sorry, NAK for entire patch until it would be blessed by people like Peter 
> > Z.
> > 
> > -- 
> > With Best Regards,
> > Andy Shevchenko
> 
> -- 
> Best Regards,
> Rajneesh

-- 
Best Regards,
Rajneesh


Re: your mail

2017-08-18 Thread Rajneesh Bhardwaj
On Fri, Aug 18, 2017 at 11:12:14PM +0530, Rajneesh Bhardwaj wrote:
> Bcc: 
> Subject: Re: [PATCH] platform/x86: intel_pmc_core: Add Package C-states
>  residency info
> Reply-To: 
> In-Reply-To: 
> 
>

Please ignore my previous email without subject. It was sent by mistake.

> On Fri, Aug 18, 2017 at 08:17:32PM +0300, Andy Shevchenko wrote:
> > +PeterZ (since I mentioned his name)
> > 
> > On Fri, Aug 18, 2017 at 5:58 PM, Rajneesh Bhardwaj
> >  wrote:
> > > On Fri, Aug 18, 2017 at 03:57:34PM +0300, Andy Shevchenko wrote:
> > >> On Fri, Aug 18, 2017 at 3:37 PM, Rajneesh Bhardwaj
> > >>  wrote:
> > >> > This patch introduces a new debugfs entry to read current Package 
> > >> > C-state
> > >> > residency values and, one new kernel API to read the Package C-10 
> > >> > residency
> > >> > counter.
> > >> >
> > >> > Package C-state residency MSRs provide useful debug information about 
> > >> > system
> > >> > idle states. In idle states system must enter deeper Package C-states.
> > 
> > >> Why this patch is needed?
> > >
> > > Andy, I'll try to give some background for this.
> > >
> > > This is needed to enhance the S0ix failure debug capabilities from within
> > > the kernel. On ChromeOS we have S0ix failsafe kernel framework that is 
> > > used
> > > to validate S0ix and report the blockers in case of a failure.
> > > https://patchwork.kernel.org/patch/9148999/
> > 
> > (It's not part of upstream)
> 
> Sorry i sent an older link. There are fresh attempts to get this into
> mainline kernel and looks like there is a traction for it.
> https://patchwork.kernel.org/patch/9831229/
> 
> Package C-state (PC10) validation is discussed there.
> 
> > 
> > > So far only intel_pmc_slp_s0_counter_read is called by this framework to
> > > check whether the previous attempt to enter S0ix was success or not.
> > 
> > I harder see even a single user of that API in current kernel. It
> > should be unexported and removed I think.
> > 
> > >  Having
> > > another PC10 counter related exported function enhances the S0ix debug 
> > > since
> > > PC10 state is a prerequisite to enter S0ix.
> > >
> > >> See, we have turbostat and cpupower user space tools which do this
> > >> without any additional code to be written in kernel. What prevents
> > >> your user space application do the same?
> > >>
> > >> Moreover, we have events for cstate, I assume perf or something alike
> > >> can monitor those counters as well.
> > >
> > > You're right, perhaps the debugfs is redundant when we have those user 
> > > space
> > > tools but such tools are not available readily for all platforms/distros.
> > > Interfaces like /dev/cpu/*/msr that turbostat uses are not available on 
> > > all
> > > the platforms.
> > > PMC driver is a debug driver so i thought its better to show Package 
> > > C-state
> > > related info for low power debug here.
> > >
> > >>
> > >> Sorry, NAK.
> > >
> > > This patch has two parts i.e. exported PC10 API and the debugfs. Based on
> > > the above explanation, if the patch is not good as is, please let me know 
> > > if
> > > i should drop the debugfs part and respin a v2 with just the exported API 
> > > or
> > > drop this totally.
> > >
> > > Thanks for the feedback and thanks for taking time to review!
> > 
> > Reading above makes me think that entire design of this is misguided.
> > Since the most of values are counters they better to be accessed in a
> > way how perf does.
> > 
> > In case you need *in-kernel* facility, do some APIs (if it's not done
> > yet) for events drivers first.
> > cstate event driver is already in upstream.
> > 
> > Sorry, NAK for entire patch until it would be blessed by people like Peter 
> > Z.
> > 
> > -- 
> > With Best Regards,
> > Andy Shevchenko
> 
> -- 
> Best Regards,
> Rajneesh

-- 
Best Regards,
Rajneesh


Re: your mail

2017-06-14 Thread Yury Norov
Hi Catalin, all.

Thank you for your time on reviewing the series. I really appreciate it.

This is the updated version where I tried to address all comments:
https://github.com/norov/linux/commits/ilp32-20170613.4

(3 last patches here is the Andrew Pinski's rework of vdso rebased on
ilp32 series)

If nothing will come here on review, I'll send v8 at the beginning of
the next week. Is this plan OK?

And this is the backport on the v4.11 kernel:
https://github.com/norov/linux/commits/ilp32-4.11.4

Yury

On Sun, Jun 04, 2017 at 02:59:49PM +0300, Yury Norov wrote:
> Subject: [PATCH v7 resend 2 00/20] ILP32 for ARM64
> 
> Hi Catalin,
>  
> Here is a rebase of latest kernel patchset against next-20170602. There's 
> almost
> no changes, but there are some conflicts that are not trivial, and I'd like to
> refresh the submission therefore.
> 
> How are your experiments with testing and benchmarking of ILP32 are going? In
> my current tests I see 0 failures on LTP. Benchmarking on SPEC CPU2006 and
> LMBench shows no difference for LP64 and expected performance boost for ILP32
> (compared to LP64 results).
> 
> Steve Ellcey is handling upstream submission of Glibc patches. The patches are
> ready and have been reviewed and reworked per community’s comments. There are
> no outstanding userspace ABI issues from Glibc. Glibc submission is now 
> waiting
> on ILP32 kernel submission.
> 
> Catalin, regarding rootfs, is OpenSuSe’s build sufficient for your 
> experiments?
> I’ve also seen Wookey merging patches for ILP32 triplet to binutils and 
> pushing
> them to Debian.
> 
> One last thing I wanted to check with you about is ILP32 PCS - does, in your
> view, ARM Ltd. needs to publish any additional docs for ABI to become 
> official?
> 
> Below is the regular description.
> 
> Thanks.
> Yury
> 
> 
> 
> This series enables aarch64 with ilp32 mode.
> 
> As supporting work, it introduces ARCH_32BIT_OFF_T configuration
> option that is enabled for existing 32-bit architectures but disabled
> for new arches (so 64-bit off_t is is used by new userspace). Also it
> deprecates getrlimit and setrlimit syscalls prior to prlimit64.
> 
> This version is based on linux-next from 2017-03-01. It works with
> glibc-2.25, and tested with LTP, glibc testsuite, trinity, lmbench,
> CPUSpec.
> 
> Patches 1, 2, 3 and 8 are general, and may be applied separately.
> 
> This is the rebase of v7 - still no major changes has been made.
> 
> Kernel and GLIBC trees:
> https://github.com/norov/linux/tree/ilp32-20170602
> https://github.com/norov/glibc/tree/dev9
> 
> (GLIBC patches are managed by Steve Ellcey, so my tree is only for
> reference.)
> 
> Changes:
> v3: https://lkml.org/lkml/2014/9/3/704
> v4: https://lkml.org/lkml/2015/4/13/691
> v5: https://lkml.org/lkml/2015/9/29/911
> v6: https://lkml.org/lkml/2016/5/23/661
> v7: RFC nowrap:  https://lkml.org/lkml/2016/6/17/990
> v7: RFC2 nowrap: https://lkml.org/lkml/2016/8/17/245
> v7: RFC3 nowrap: https://lkml.org/lkml/2016/10/21/883
> v7: https://lkml.org/lkml/2017/1/9/213
> v7: Resend: 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/490801.html
> v7: Resend 2:
> - vdso-ilp32 Makefile synced with lp64 Makefile (patch 19);
> - rebased on next-20170602.
> 
> Andrew Pinski (6):
>   arm64: rename COMPAT to AARCH32_EL0 in Kconfig
>   arm64: ensure the kernel is compiled for LP64
>   arm64:uapi: set __BITS_PER_LONG correctly for ILP32 and LP64
>   arm64: ilp32: add sys_ilp32.c and a separate table (in entry.S) to use
> it
>   arm64: ilp32: introduce ilp32-specific handlers for sigframe and
> ucontext
>   arm64:ilp32: add ARM64_ILP32 to Kconfig
> 
> Philipp Tomsich (1):
>   arm64:ilp32: add vdso-ilp32 and use for signal return
> 
> Yury Norov (13):
>   compat ABI: use non-compat openat and open_by_handle_at variants
>   32-bit ABI: introduce ARCH_32BIT_OFF_T config option
>   asm-generic: Drop getrlimit and setrlimit syscalls from default list
>   arm64: ilp32: add documentation on the ILP32 ABI for ARM64
>   thread: move thread bits accessors to separated file
>   arm64: introduce is_a32_task and is_a32_thread (for AArch32 compat)
>   arm64: ilp32: add is_ilp32_compat_{task,thread} and TIF_32BIT_AARCH64
>   arm64: introduce binfmt_elf32.c
>   arm64: ilp32: introduce binfmt_ilp32.c
>   arm64: ilp32: share aarch32 syscall handlers
>   arm64: signal: share lp64 signal routines to ilp32
>   arm64: signal32: move ilp32 and aarch32 common code to separated file
>   arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32
> 
>  Documentation/arm64/ilp32.txt |  45 +++
>  arch/Kconfig  |   4 +
>  arch/arc/Kconfig  |   1 +
>  arch/arc/include/uapi/asm/unistd.h|   1 +
>  arch/arm/Kconfig  |   1 +
>  arch/arm64/Kconfig|  19 ++-
>  arch/arm64/Makefile   |   8 ++
>  

Re: your mail

2017-06-14 Thread Yury Norov
Hi Catalin, all.

Thank you for your time on reviewing the series. I really appreciate it.

This is the updated version where I tried to address all comments:
https://github.com/norov/linux/commits/ilp32-20170613.4

(3 last patches here is the Andrew Pinski's rework of vdso rebased on
ilp32 series)

If nothing will come here on review, I'll send v8 at the beginning of
the next week. Is this plan OK?

And this is the backport on the v4.11 kernel:
https://github.com/norov/linux/commits/ilp32-4.11.4

Yury

On Sun, Jun 04, 2017 at 02:59:49PM +0300, Yury Norov wrote:
> Subject: [PATCH v7 resend 2 00/20] ILP32 for ARM64
> 
> Hi Catalin,
>  
> Here is a rebase of latest kernel patchset against next-20170602. There's 
> almost
> no changes, but there are some conflicts that are not trivial, and I'd like to
> refresh the submission therefore.
> 
> How are your experiments with testing and benchmarking of ILP32 are going? In
> my current tests I see 0 failures on LTP. Benchmarking on SPEC CPU2006 and
> LMBench shows no difference for LP64 and expected performance boost for ILP32
> (compared to LP64 results).
> 
> Steve Ellcey is handling upstream submission of Glibc patches. The patches are
> ready and have been reviewed and reworked per community’s comments. There are
> no outstanding userspace ABI issues from Glibc. Glibc submission is now 
> waiting
> on ILP32 kernel submission.
> 
> Catalin, regarding rootfs, is OpenSuSe’s build sufficient for your 
> experiments?
> I’ve also seen Wookey merging patches for ILP32 triplet to binutils and 
> pushing
> them to Debian.
> 
> One last thing I wanted to check with you about is ILP32 PCS - does, in your
> view, ARM Ltd. needs to publish any additional docs for ABI to become 
> official?
> 
> Below is the regular description.
> 
> Thanks.
> Yury
> 
> 
> 
> This series enables aarch64 with ilp32 mode.
> 
> As supporting work, it introduces ARCH_32BIT_OFF_T configuration
> option that is enabled for existing 32-bit architectures but disabled
> for new arches (so 64-bit off_t is is used by new userspace). Also it
> deprecates getrlimit and setrlimit syscalls prior to prlimit64.
> 
> This version is based on linux-next from 2017-03-01. It works with
> glibc-2.25, and tested with LTP, glibc testsuite, trinity, lmbench,
> CPUSpec.
> 
> Patches 1, 2, 3 and 8 are general, and may be applied separately.
> 
> This is the rebase of v7 - still no major changes has been made.
> 
> Kernel and GLIBC trees:
> https://github.com/norov/linux/tree/ilp32-20170602
> https://github.com/norov/glibc/tree/dev9
> 
> (GLIBC patches are managed by Steve Ellcey, so my tree is only for
> reference.)
> 
> Changes:
> v3: https://lkml.org/lkml/2014/9/3/704
> v4: https://lkml.org/lkml/2015/4/13/691
> v5: https://lkml.org/lkml/2015/9/29/911
> v6: https://lkml.org/lkml/2016/5/23/661
> v7: RFC nowrap:  https://lkml.org/lkml/2016/6/17/990
> v7: RFC2 nowrap: https://lkml.org/lkml/2016/8/17/245
> v7: RFC3 nowrap: https://lkml.org/lkml/2016/10/21/883
> v7: https://lkml.org/lkml/2017/1/9/213
> v7: Resend: 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/490801.html
> v7: Resend 2:
> - vdso-ilp32 Makefile synced with lp64 Makefile (patch 19);
> - rebased on next-20170602.
> 
> Andrew Pinski (6):
>   arm64: rename COMPAT to AARCH32_EL0 in Kconfig
>   arm64: ensure the kernel is compiled for LP64
>   arm64:uapi: set __BITS_PER_LONG correctly for ILP32 and LP64
>   arm64: ilp32: add sys_ilp32.c and a separate table (in entry.S) to use
> it
>   arm64: ilp32: introduce ilp32-specific handlers for sigframe and
> ucontext
>   arm64:ilp32: add ARM64_ILP32 to Kconfig
> 
> Philipp Tomsich (1):
>   arm64:ilp32: add vdso-ilp32 and use for signal return
> 
> Yury Norov (13):
>   compat ABI: use non-compat openat and open_by_handle_at variants
>   32-bit ABI: introduce ARCH_32BIT_OFF_T config option
>   asm-generic: Drop getrlimit and setrlimit syscalls from default list
>   arm64: ilp32: add documentation on the ILP32 ABI for ARM64
>   thread: move thread bits accessors to separated file
>   arm64: introduce is_a32_task and is_a32_thread (for AArch32 compat)
>   arm64: ilp32: add is_ilp32_compat_{task,thread} and TIF_32BIT_AARCH64
>   arm64: introduce binfmt_elf32.c
>   arm64: ilp32: introduce binfmt_ilp32.c
>   arm64: ilp32: share aarch32 syscall handlers
>   arm64: signal: share lp64 signal routines to ilp32
>   arm64: signal32: move ilp32 and aarch32 common code to separated file
>   arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32
> 
>  Documentation/arm64/ilp32.txt |  45 +++
>  arch/Kconfig  |   4 +
>  arch/arc/Kconfig  |   1 +
>  arch/arc/include/uapi/asm/unistd.h|   1 +
>  arch/arm/Kconfig  |   1 +
>  arch/arm64/Kconfig|  19 ++-
>  arch/arm64/Makefile   |   8 ++
>  

Re: your mail

2017-04-27 Thread Michal Hocko
On Thu 27-04-17 11:08:38, Joonsoo Kim wrote:
> On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > 
> > > > > > You are trying to change a semantic of something that has a well 
> > > > > > defined
> > > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > > simpler thing to do because pfn walkers will have to be checked but 
> > > > > > what
> > > > > > you are proposing is conflating two different things together.
> > > > > 
> > > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > > It would be original semantic of pfn_valid().
> > > > > 
> > > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > > zone information,"
> > > > 
> > > > I do not see any guarantee about the zone information anywhere. In fact
> > > > this is not true with the original implementation as I've tried to
> > > > explain already. We do have new pages associated with a zone but that
> > > > association might change during the online phase. So you cannot really
> > > > rely on that information until the page is online. There is no real
> > > > change in that regards after my rework.
> > > 
> > > I know that what you did doesn't change thing much. What I try to say
> > > is that previous implementation related to pfn_valid() in hotplug is
> > > wrong. Please do not assume that hotplug implementation is correct and
> > > other pfn_valid() users are incorrect. There is no design document so
> > > I'm not sure which one is correct but assumption that pfn_valid() user
> > > can access whole the struct page information makes much sense to me.
> > 
> > Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> > still need pfn_valid to work for those pfns. Really, pfn_valid has a
> 
> It's really contrary example to your insist. They requires not only
> struct page but also other information, especially, the zone index.
> They checks zone idx to know whether this page is for ZONE_DEVICE or not.

Yes and they guarantee this association is true. Without memory onlining
though. This memory is never online for anybody who is asking.

[...]

> I think that I did my best to explain my reasoning. It seems that we
> cannot agree with each other so it's better for some others to express
> their opinion to this problem. I will stop this discussion from now
> on.

I _do_ appreciate your feedback and if the general consensus is to
modify pfn_valid I can go that direction but my gut feeling tells me
that conflating "existing struct page" test and "fully online and
initialized" one is a wrong thing to do.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-27 Thread Michal Hocko
On Thu 27-04-17 11:08:38, Joonsoo Kim wrote:
> On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > 
> > > > > > You are trying to change a semantic of something that has a well 
> > > > > > defined
> > > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > > simpler thing to do because pfn walkers will have to be checked but 
> > > > > > what
> > > > > > you are proposing is conflating two different things together.
> > > > > 
> > > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > > It would be original semantic of pfn_valid().
> > > > > 
> > > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > > zone information,"
> > > > 
> > > > I do not see any guarantee about the zone information anywhere. In fact
> > > > this is not true with the original implementation as I've tried to
> > > > explain already. We do have new pages associated with a zone but that
> > > > association might change during the online phase. So you cannot really
> > > > rely on that information until the page is online. There is no real
> > > > change in that regards after my rework.
> > > 
> > > I know that what you did doesn't change thing much. What I try to say
> > > is that previous implementation related to pfn_valid() in hotplug is
> > > wrong. Please do not assume that hotplug implementation is correct and
> > > other pfn_valid() users are incorrect. There is no design document so
> > > I'm not sure which one is correct but assumption that pfn_valid() user
> > > can access whole the struct page information makes much sense to me.
> > 
> > Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> > still need pfn_valid to work for those pfns. Really, pfn_valid has a
> 
> It's really contrary example to your insist. They requires not only
> struct page but also other information, especially, the zone index.
> They checks zone idx to know whether this page is for ZONE_DEVICE or not.

Yes and they guarantee this association is true. Without memory onlining
though. This memory is never online for anybody who is asking.

[...]

> I think that I did my best to explain my reasoning. It seems that we
> cannot agree with each other so it's better for some others to express
> their opinion to this problem. I will stop this discussion from now
> on.

I _do_ appreciate your feedback and if the general consensus is to
modify pfn_valid I can go that direction but my gut feeling tells me
that conflating "existing struct page" test and "fully online and
initialized" one is a wrong thing to do.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-26 Thread Joonsoo Kim
On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > [...]
> > > 
> > > > > You are trying to change a semantic of something that has a well 
> > > > > defined
> > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > simpler thing to do because pfn walkers will have to be checked but 
> > > > > what
> > > > > you are proposing is conflating two different things together.
> > > > 
> > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > It would be original semantic of pfn_valid().
> > > > 
> > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > zone information,"
> > > 
> > > I do not see any guarantee about the zone information anywhere. In fact
> > > this is not true with the original implementation as I've tried to
> > > explain already. We do have new pages associated with a zone but that
> > > association might change during the online phase. So you cannot really
> > > rely on that information until the page is online. There is no real
> > > change in that regards after my rework.
> > 
> > I know that what you did doesn't change thing much. What I try to say
> > is that previous implementation related to pfn_valid() in hotplug is
> > wrong. Please do not assume that hotplug implementation is correct and
> > other pfn_valid() users are incorrect. There is no design document so
> > I'm not sure which one is correct but assumption that pfn_valid() user
> > can access whole the struct page information makes much sense to me.
> 
> Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> still need pfn_valid to work for those pfns. Really, pfn_valid has a

It's really contrary example to your insist. They requires not only
struct page but also other information, especially, the zone index.
They checks zone idx to know whether this page is for ZONE_DEVICE or not.

So, pfn_valid() for ZONE_DEVICE pages assume that struct page has all
the valid information. It's perfectly matched with my suggestion.
Online isn't important issue here. What the important point is the condition
that pfn_valid() return true. pfn_valid() for ZONE_DEVICE returns true after
arch_add_memory() since all the struct page information is fixed there.

If zone of hotplugged memory cannot be fixed at this moment, you can
defef it until all the information is fixed (onlining). That
seems to be better semantic of pfn_valid() to me.

> different meaning than you would like it to have. Who knows how many
> others like that are lurking there. I feel much more comfortable to go
> and hunt already broken code and fix it rathert than break something
> unexpectedly.

I think that I did my best to explain my reasoning. It seems that we
cannot agree with each other so it's better for some others to express
their opinion to this problem. I will stop this discussion from now
on.

Thanks.


Re: your mail

2017-04-26 Thread Joonsoo Kim
On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > [...]
> > > 
> > > > > You are trying to change a semantic of something that has a well 
> > > > > defined
> > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > simpler thing to do because pfn walkers will have to be checked but 
> > > > > what
> > > > > you are proposing is conflating two different things together.
> > > > 
> > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > It would be original semantic of pfn_valid().
> > > > 
> > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > zone information,"
> > > 
> > > I do not see any guarantee about the zone information anywhere. In fact
> > > this is not true with the original implementation as I've tried to
> > > explain already. We do have new pages associated with a zone but that
> > > association might change during the online phase. So you cannot really
> > > rely on that information until the page is online. There is no real
> > > change in that regards after my rework.
> > 
> > I know that what you did doesn't change thing much. What I try to say
> > is that previous implementation related to pfn_valid() in hotplug is
> > wrong. Please do not assume that hotplug implementation is correct and
> > other pfn_valid() users are incorrect. There is no design document so
> > I'm not sure which one is correct but assumption that pfn_valid() user
> > can access whole the struct page information makes much sense to me.
> 
> Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> still need pfn_valid to work for those pfns. Really, pfn_valid has a

It's really contrary example to your insist. They requires not only
struct page but also other information, especially, the zone index.
They checks zone idx to know whether this page is for ZONE_DEVICE or not.

So, pfn_valid() for ZONE_DEVICE pages assume that struct page has all
the valid information. It's perfectly matched with my suggestion.
Online isn't important issue here. What the important point is the condition
that pfn_valid() return true. pfn_valid() for ZONE_DEVICE returns true after
arch_add_memory() since all the struct page information is fixed there.

If zone of hotplugged memory cannot be fixed at this moment, you can
defef it until all the information is fixed (onlining). That
seems to be better semantic of pfn_valid() to me.

> different meaning than you would like it to have. Who knows how many
> others like that are lurking there. I feel much more comfortable to go
> and hunt already broken code and fix it rathert than break something
> unexpectedly.

I think that I did my best to explain my reasoning. It seems that we
cannot agree with each other so it's better for some others to express
their opinion to this problem. I will stop this discussion from now
on.

Thanks.


Re: your mail

2017-04-26 Thread Michal Hocko
On Tue 25-04-17 11:50:45, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> > On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > Which pfn walkers you have in mind?
> > > > > > > 
> > > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > > using pfn_valid().
> > > > > > 
> > > > > > Yeah, I've checked that one and in fact this is a good example of 
> > > > > > the
> > > > > > case where you do not really care about holes. It just checks the 
> > > > > > page
> > > > > > count which is a valid information under any circumstances.
> > > > > 
> > > > > I don't think so. First, it checks the page *map* count. Is it still 
> > > > > valid
> > > > > even if PageReserved() is set?
> > > > 
> > > > I do not know about any user which would manipulate page map count for
> > > > referenced pages. The core MM code doesn't.
> > > 
> > > That's weird that we can get *map* count without PageReserved() check,
> > > but we cannot get zone information.
> > > Zone information is more static information than map count.
> > 
> > As I've already pointed out the rework of the hotplug code is mainly
> > about postponing the zone initialization from the physical hot add to
> > the logical onlining. The zone is really not clear until that moment.
> >  
> > > It should be defined/documented in this time that what information in
> > > the struct page is valid even if PageReserved() is set. And then, we
> > > need to fix all the things based on this design decision.
> > 
> > Where would you suggest documenting this? We do have
> > Documentation/memory-hotplug.txt but it is not really specific about
> > struct page.
> 
> pfn_valid() in include/linux/mmzone.h looks proper place.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c412e6a3a1e9..443258fcac93 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,10 +1288,14 @@ unsigned long __init node_memmap_size_bytes(int, 
unsigned long, unsigned long);
 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
 /*
  * pfn_valid() is meant to be able to tell if a given PFN has valid memmap
- * associated with it or not. In FLATMEM, it is expected that holes always
- * have valid memmap as long as there is valid PFNs either side of the hole.
- * In SPARSEMEM, it is assumed that a valid section has a memmap for the
- * entire section.
+ * associated with it or not. This means that a struct page exists for this
+ * pfn. The caller cannot assume the page is fully initialized though.
+ * pfn_to_online_page() should be used to make sure the struct page is fully
+ * initialized.
+ *
+ * In FLATMEM, it is expected that holes always have valid memmap as long as
+ * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed
+ * that a valid section has a memmap for the entire section.
  *
  * However, an ARM, and maybe other embedded architectures in the future
  * free memmap backing holes to save memory on the assumption the memmap is

> > [...]
> > 
> > > > You are trying to change a semantic of something that has a well defined
> > > > meaning. I disagree that we should change it. It might sound like a
> > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > you are proposing is conflating two different things together.
> > > 
> > > I don't think that *I* try to change the semantic of pfn_valid().
> > > It would be original semantic of pfn_valid().
> > > 
> > > "If pfn_valid() returns true, we can get proper struct page and the
> > > zone information,"
> > 
> > I do not see any guarantee about the zone information anywhere. In fact
> > this is not true with the original implementation as I've tried to
> > explain already. We do have new pages associated with a zone but that
> > association might change during the online phase. So you cannot really
> > rely on that information until the page is online. There is no real
> > change in that regards after my rework.
> 
> I know that what you did doesn't change thing much. What I try to say
> is that previous implementation related to pfn_valid() in hotplug is
> wrong. Please do not assume that hotplug implementation is correct and
> other pfn_valid() users are incorrect. There is no design document so
> I'm not sure which one is correct but assumption that pfn_valid() user
> can access whole the struct page information makes much sense to me.

Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
still need pfn_valid to work for those pfns. Really, pfn_valid has a
different meaning than you would like it to have. Who knows how many
others like that 

Re: your mail

2017-04-26 Thread Michal Hocko
On Tue 25-04-17 11:50:45, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> > On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > Which pfn walkers you have in mind?
> > > > > > > 
> > > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > > using pfn_valid().
> > > > > > 
> > > > > > Yeah, I've checked that one and in fact this is a good example of 
> > > > > > the
> > > > > > case where you do not really care about holes. It just checks the 
> > > > > > page
> > > > > > count which is a valid information under any circumstances.
> > > > > 
> > > > > I don't think so. First, it checks the page *map* count. Is it still 
> > > > > valid
> > > > > even if PageReserved() is set?
> > > > 
> > > > I do not know about any user which would manipulate page map count for
> > > > referenced pages. The core MM code doesn't.
> > > 
> > > That's weird that we can get *map* count without PageReserved() check,
> > > but we cannot get zone information.
> > > Zone information is more static information than map count.
> > 
> > As I've already pointed out the rework of the hotplug code is mainly
> > about postponing the zone initialization from the physical hot add to
> > the logical onlining. The zone is really not clear until that moment.
> >  
> > > It should be defined/documented in this time that what information in
> > > the struct page is valid even if PageReserved() is set. And then, we
> > > need to fix all the things based on this design decision.
> > 
> > Where would you suggest documenting this? We do have
> > Documentation/memory-hotplug.txt but it is not really specific about
> > struct page.
> 
> pfn_valid() in include/linux/mmzone.h looks proper place.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c412e6a3a1e9..443258fcac93 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,10 +1288,14 @@ unsigned long __init node_memmap_size_bytes(int, 
unsigned long, unsigned long);
 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
 /*
  * pfn_valid() is meant to be able to tell if a given PFN has valid memmap
- * associated with it or not. In FLATMEM, it is expected that holes always
- * have valid memmap as long as there is valid PFNs either side of the hole.
- * In SPARSEMEM, it is assumed that a valid section has a memmap for the
- * entire section.
+ * associated with it or not. This means that a struct page exists for this
+ * pfn. The caller cannot assume the page is fully initialized though.
+ * pfn_to_online_page() should be used to make sure the struct page is fully
+ * initialized.
+ *
+ * In FLATMEM, it is expected that holes always have valid memmap as long as
+ * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed
+ * that a valid section has a memmap for the entire section.
  *
  * However, an ARM, and maybe other embedded architectures in the future
  * free memmap backing holes to save memory on the assumption the memmap is

> > [...]
> > 
> > > > You are trying to change a semantic of something that has a well defined
> > > > meaning. I disagree that we should change it. It might sound like a
> > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > you are proposing is conflating two different things together.
> > > 
> > > I don't think that *I* try to change the semantic of pfn_valid().
> > > It would be original semantic of pfn_valid().
> > > 
> > > "If pfn_valid() returns true, we can get proper struct page and the
> > > zone information,"
> > 
> > I do not see any guarantee about the zone information anywhere. In fact
> > this is not true with the original implementation as I've tried to
> > explain already. We do have new pages associated with a zone but that
> > association might change during the online phase. So you cannot really
> > rely on that information until the page is online. There is no real
> > change in that regards after my rework.
> 
> I know that what you did doesn't change thing much. What I try to say
> is that previous implementation related to pfn_valid() in hotplug is
> wrong. Please do not assume that hotplug implementation is correct and
> other pfn_valid() users are incorrect. There is no design document so
> I'm not sure which one is correct but assumption that pfn_valid() user
> can access whole the struct page information makes much sense to me.

Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
still need pfn_valid to work for those pfns. Really, pfn_valid has a
different meaning than you would like it to have. Who knows how many
others like that 

Re: your mail

2017-04-24 Thread Joonsoo Kim
On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Which pfn walkers you have in mind?
> > > > > > 
> > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > using pfn_valid().
> > > > > 
> > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > case where you do not really care about holes. It just checks the page
> > > > > count which is a valid information under any circumstances.
> > > > 
> > > > I don't think so. First, it checks the page *map* count. Is it still 
> > > > valid
> > > > even if PageReserved() is set?
> > > 
> > > I do not know about any user which would manipulate page map count for
> > > referenced pages. The core MM code doesn't.
> > 
> > That's weird that we can get *map* count without PageReserved() check,
> > but we cannot get zone information.
> > Zone information is more static information than map count.
> 
> As I've already pointed out the rework of the hotplug code is mainly
> about postponing the zone initialization from the physical hot add to
> the logical onlining. The zone is really not clear until that moment.
>  
> > It should be defined/documented in this time that what information in
> > the struct page is valid even if PageReserved() is set. And then, we
> > need to fix all the things based on this design decision.
> 
> Where would you suggest documenting this? We do have
> Documentation/memory-hotplug.txt but it is not really specific about
> struct page.

pfn_valid() in include/linux/mmzone.h looks proper place.

> 
> [...]
> 
> > > You are trying to change a semantic of something that has a well defined
> > > meaning. I disagree that we should change it. It might sound like a
> > > simpler thing to do because pfn walkers will have to be checked but what
> > > you are proposing is conflating two different things together.
> > 
> > I don't think that *I* try to change the semantic of pfn_valid().
> > It would be original semantic of pfn_valid().
> > 
> > "If pfn_valid() returns true, we can get proper struct page and the
> > zone information,"
> 
> I do not see any guarantee about the zone information anywhere. In fact
> this is not true with the original implementation as I've tried to
> explain already. We do have new pages associated with a zone but that
> association might change during the online phase. So you cannot really
> rely on that information until the page is online. There is no real
> change in that regards after my rework.

I know that what you did doesn't change thing much. What I try to say
is that previous implementation related to pfn_valid() in hotplug is
wrong. Please do not assume that hotplug implementation is correct and
other pfn_valid() users are incorrect. There is no design document so
I'm not sure which one is correct but assumption that pfn_valid() user
can access whole the struct page information makes much sense to me.
So, I hope that please fix hotplug implementation rather than
modifying each pfn_valid() users.

> 
> [...]
> > > So please do not conflate those two different concepts together. I
> > > believe that the most prominent pfn walkers should be covered now and
> > > others can be evaluated later.
> > 
> > Even if original pfn_valid()'s semantic is not the one that I mentioned,
> > I think that suggested semantic from me is better.
> > Only hotplug code need to be changed and others doesn't need to be changed.
> > There is no overhead for others. What's the problem about this approach?
> 
> That this would require to check _every_ single pfn_valid user in the
> kernel. That is beyond my time capacity and not really necessary because
> the current code already suffers from the same/similar class of
> problems.

I think that all the pfn_valid() user doesn't consider hole case.
Unlike your expectation, if your way is taken, it requires to check
_every_ pfn_valid() users.

Thanks.



Re: your mail

2017-04-24 Thread Joonsoo Kim
On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Which pfn walkers you have in mind?
> > > > > > 
> > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > using pfn_valid().
> > > > > 
> > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > case where you do not really care about holes. It just checks the page
> > > > > count which is a valid information under any circumstances.
> > > > 
> > > > I don't think so. First, it checks the page *map* count. Is it still 
> > > > valid
> > > > even if PageReserved() is set?
> > > 
> > > I do not know about any user which would manipulate page map count for
> > > referenced pages. The core MM code doesn't.
> > 
> > That's weird that we can get *map* count without PageReserved() check,
> > but we cannot get zone information.
> > Zone information is more static information than map count.
> 
> As I've already pointed out the rework of the hotplug code is mainly
> about postponing the zone initialization from the physical hot add to
> the logical onlining. The zone is really not clear until that moment.
>  
> > It should be defined/documented in this time that what information in
> > the struct page is valid even if PageReserved() is set. And then, we
> > need to fix all the things based on this design decision.
> 
> Where would you suggest documenting this? We do have
> Documentation/memory-hotplug.txt but it is not really specific about
> struct page.

pfn_valid() in include/linux/mmzone.h looks proper place.

> 
> [...]
> 
> > > You are trying to change a semantic of something that has a well defined
> > > meaning. I disagree that we should change it. It might sound like a
> > > simpler thing to do because pfn walkers will have to be checked but what
> > > you are proposing is conflating two different things together.
> > 
> > I don't think that *I* try to change the semantic of pfn_valid().
> > It would be original semantic of pfn_valid().
> > 
> > "If pfn_valid() returns true, we can get proper struct page and the
> > zone information,"
> 
> I do not see any guarantee about the zone information anywhere. In fact
> this is not true with the original implementation as I've tried to
> explain already. We do have new pages associated with a zone but that
> association might change during the online phase. So you cannot really
> rely on that information until the page is online. There is no real
> change in that regards after my rework.

I know that what you did doesn't change thing much. What I try to say
is that previous implementation related to pfn_valid() in hotplug is
wrong. Please do not assume that hotplug implementation is correct and
other pfn_valid() users are incorrect. There is no design document so
I'm not sure which one is correct but assumption that pfn_valid() user
can access whole the struct page information makes much sense to me.
So, I hope that please fix hotplug implementation rather than
modifying each pfn_valid() users.

> 
> [...]
> > > So please do not conflate those two different concepts together. I
> > > believe that the most prominent pfn walkers should be covered now and
> > > others can be evaluated later.
> > 
> > Even if original pfn_valid()'s semantic is not the one that I mentioned,
> > I think that suggested semantic from me is better.
> > Only hotplug code need to be changed and others doesn't need to be changed.
> > There is no overhead for others. What's the problem about this approach?
> 
> That this would require to check _every_ single pfn_valid user in the
> kernel. That is beyond my time capacity and not really necessary because
> the current code already suffers from the same/similar class of
> problems.

I think that all the pfn_valid() user doesn't consider hole case.
Unlike your expectation, if your way is taken, it requires to check
_every_ pfn_valid() users.

Thanks.



Re: your mail

2017-04-24 Thread Michal Hocko
On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Which pfn walkers you have in mind?
> > > > > 
> > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > using pfn_valid().
> > > > 
> > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > case where you do not really care about holes. It just checks the page
> > > > count which is a valid information under any circumstances.
> > > 
> > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > even if PageReserved() is set?
> > 
> > I do not know about any user which would manipulate page map count for
> > referenced pages. The core MM code doesn't.
> 
> That's weird that we can get *map* count without PageReserved() check,
> but we cannot get zone information.
> Zone information is more static information than map count.

As I've already pointed out the rework of the hotplug code is mainly
about postponing the zone initialization from the physical hot add to
the logical onlining. The zone is really not clear until that moment.
 
> It should be defined/documented in this time that what information in
> the struct page is valid even if PageReserved() is set. And then, we
> need to fix all the things based on this design decision.

Where would you suggest documenting this? We do have
Documentation/memory-hotplug.txt but it is not really specific about
struct page.

[...]

> > You are trying to change a semantic of something that has a well defined
> > meaning. I disagree that we should change it. It might sound like a
> > simpler thing to do because pfn walkers will have to be checked but what
> > you are proposing is conflating two different things together.
> 
> I don't think that *I* try to change the semantic of pfn_valid().
> It would be original semantic of pfn_valid().
> 
> "If pfn_valid() returns true, we can get proper struct page and the
> zone information,"

I do not see any guarantee about the zone information anywhere. In fact
this is not true with the original implementation as I've tried to
explain already. We do have new pages associated with a zone but that
association might change during the online phase. So you cannot really
rely on that information until the page is online. There is no real
change in that regards after my rework.

[...]
> > So please do not conflate those two different concepts together. I
> > believe that the most prominent pfn walkers should be covered now and
> > others can be evaluated later.
> 
> Even if original pfn_valid()'s semantic is not the one that I mentioned,
> I think that suggested semantic from me is better.
> Only hotplug code need to be changed and others doesn't need to be changed.
> There is no overhead for others. What's the problem about this approach?

That this would require to check _every_ single pfn_valid user in the
kernel. That is beyond my time capacity and not really necessary because
the current code already suffers from the same/similar class of
problems.
 
> And, I'm not sure that you covered the most prominent pfn walkers.
> Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.

I probably haven't (and will send a patch to fix this one - thanks for
pointing to it) but the point is they those are broken already and they
can be fixed in follow up patches. If you change pfn_valid you might
break an existing code in an unexpected ways.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-24 Thread Michal Hocko
On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Which pfn walkers you have in mind?
> > > > > 
> > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > using pfn_valid().
> > > > 
> > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > case where you do not really care about holes. It just checks the page
> > > > count which is a valid information under any circumstances.
> > > 
> > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > even if PageReserved() is set?
> > 
> > I do not know about any user which would manipulate page map count for
> > referenced pages. The core MM code doesn't.
> 
> That's weird that we can get *map* count without PageReserved() check,
> but we cannot get zone information.
> Zone information is more static information than map count.

As I've already pointed out the rework of the hotplug code is mainly
about postponing the zone initialization from the physical hot add to
the logical onlining. The zone is really not clear until that moment.
 
> It should be defined/documented in this time that what information in
> the struct page is valid even if PageReserved() is set. And then, we
> need to fix all the things based on this design decision.

Where would you suggest documenting this? We do have
Documentation/memory-hotplug.txt but it is not really specific about
struct page.

[...]

> > You are trying to change a semantic of something that has a well defined
> > meaning. I disagree that we should change it. It might sound like a
> > simpler thing to do because pfn walkers will have to be checked but what
> > you are proposing is conflating two different things together.
> 
> I don't think that *I* try to change the semantic of pfn_valid().
> It would be original semantic of pfn_valid().
> 
> "If pfn_valid() returns true, we can get proper struct page and the
> zone information,"

I do not see any guarantee about the zone information anywhere. In fact
this is not true with the original implementation as I've tried to
explain already. We do have new pages associated with a zone but that
association might change during the online phase. So you cannot really
rely on that information until the page is online. There is no real
change in that regards after my rework.

[...]
> > So please do not conflate those two different concepts together. I
> > believe that the most prominent pfn walkers should be covered now and
> > others can be evaluated later.
> 
> Even if original pfn_valid()'s semantic is not the one that I mentioned,
> I think that suggested semantic from me is better.
> Only hotplug code need to be changed and others doesn't need to be changed.
> There is no overhead for others. What's the problem about this approach?

That this would require to check _every_ single pfn_valid user in the
kernel. That is beyond my time capacity and not really necessary because
the current code already suffers from the same/similar class of
problems.
 
> And, I'm not sure that you covered the most prominent pfn walkers.
> Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.

I probably haven't (and will send a patch to fix this one - thanks for
pointing to it) but the point is they those are broken already and they
can be fixed in follow up patches. If you change pfn_valid you might
break an existing code in an unexpected ways.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-23 Thread Joonsoo Kim
On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Which pfn walkers you have in mind?
> > > > 
> > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > using pfn_valid().
> > > 
> > > Yeah, I've checked that one and in fact this is a good example of the
> > > case where you do not really care about holes. It just checks the page
> > > count which is a valid information under any circumstances.
> > 
> > I don't think so. First, it checks the page *map* count. Is it still valid
> > even if PageReserved() is set?
> 
> I do not know about any user which would manipulate page map count for
> referenced pages. The core MM code doesn't.

That's weird that we can get *map* count without PageReserved() check,
but we cannot get zone information.
Zone information is more static information than map count.

It should be defined/documented in this time that what information in
the struct page is valid even if PageReserved() is set. And then, we
need to fix all the things based on this design decision.

> 
> > What I'd like to ask in this example is
> > that what information is valid if PageReserved() is set. Is there any
> > design document on this? I think that we need to define/document it first.
> 
> NO, it is not AFAIK.
> 
> [...]
> > > OK, fair enough. I did't consider memblock allocations. I will rethink
> > > this patch but there are essentially 3 options
> > >   - use a different criterion for the offline holes dection. I
> > > have just realized we might do it by storing the online
> > > information into the mem sections
> > >   - drop this patch
> > >   - move the PageReferenced check down the chain into
> > > isolate_freepages_block resp. isolate_migratepages_block
> > > 
> > > I would prefer 3 over 2 over 1. I definitely want to make this more
> > > robust so 1 is preferable long term but I do not want this to be a
> > > roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > I like #1 among of above options and I already see your patch for #1.
> > It's much better than your first attempt but I'm still not happy due
> > to the semantic of pfn_valid().
> 
> You are trying to change a semantic of something that has a well defined
> meaning. I disagree that we should change it. It might sound like a
> simpler thing to do because pfn walkers will have to be checked but what
> you are proposing is conflating two different things together.

I don't think that *I* try to change the semantic of pfn_valid().
It would be original semantic of pfn_valid().

"If pfn_valid() returns true, we can get proper struct page and the
zone information,"

That situation is now being changed by your patch *hotplug rework*.

"Even if pfn_valid() returns true, we cannot get the zone information
without PageReserved() check, since *zone is determined during
onlining* and pfn_valid() return true after adding the memory."

> 
> > > [..]
> > > > Let me clarify my desire(?) for this issue.
> > > > 
> > > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > > without checking PageResereved().
> > > 
> > > This is no longer true after my rework. Pages are associated with the
> > > zone during _onlining_ rather than when they are physically hotpluged.
> > 
> > If your rework make information valid during _onlining_, my
> > suggestion is making pfn_valid() return false until onlining.
> > 
> > Caller of pfn_valid() expects that they can get valid information from
> > the struct page. There is no reason to access the struct page if they
> > can't get valid information from it. So, passing pfn_valid() should
> > guarantee that, at least, some kind of information is valid.
> > 
> > If pfn_valid() doesn't guarantee it, most of the pfn walker should
> > check PageResereved() to make sure that validity of information from
> > the struct page.
> 
> This is true only for those walkers which really depend on the full
> initialization. This is not the case for all of them. I do not see any
> reason to introduce another _pfn_valid to just check whether there is a
> struct page...

It's really confusing concept that only some information is valid for
*not* fully initialized struct page. Even, there is no document that
what information is valid for this half-initialized struct page.

Better design would be that we regard that every information is
invalid for half-initialized struct page. In this case, it's natural
to make pfn_valid() returns false for this half-initialized struct page.

>  
> So please do not conflate those two different concepts together. I
> believe that the most prominent pfn walkers should be covered now 

Re: your mail

2017-04-23 Thread Joonsoo Kim
On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Which pfn walkers you have in mind?
> > > > 
> > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > using pfn_valid().
> > > 
> > > Yeah, I've checked that one and in fact this is a good example of the
> > > case where you do not really care about holes. It just checks the page
> > > count which is a valid information under any circumstances.
> > 
> > I don't think so. First, it checks the page *map* count. Is it still valid
> > even if PageReserved() is set?
> 
> I do not know about any user which would manipulate page map count for
> referenced pages. The core MM code doesn't.

That's weird that we can get *map* count without PageReserved() check,
but we cannot get zone information.
Zone information is more static information than map count.

It should be defined/documented in this time that what information in
the struct page is valid even if PageReserved() is set. And then, we
need to fix all the things based on this design decision.

> 
> > What I'd like to ask in this example is
> > that what information is valid if PageReserved() is set. Is there any
> > design document on this? I think that we need to define/document it first.
> 
> NO, it is not AFAIK.
> 
> [...]
> > > OK, fair enough. I did't consider memblock allocations. I will rethink
> > > this patch but there are essentially 3 options
> > >   - use a different criterion for the offline holes dection. I
> > > have just realized we might do it by storing the online
> > > information into the mem sections
> > >   - drop this patch
> > >   - move the PageReferenced check down the chain into
> > > isolate_freepages_block resp. isolate_migratepages_block
> > > 
> > > I would prefer 3 over 2 over 1. I definitely want to make this more
> > > robust so 1 is preferable long term but I do not want this to be a
> > > roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > I like #1 among of above options and I already see your patch for #1.
> > It's much better than your first attempt but I'm still not happy due
> > to the semantic of pfn_valid().
> 
> You are trying to change a semantic of something that has a well defined
> meaning. I disagree that we should change it. It might sound like a
> simpler thing to do because pfn walkers will have to be checked but what
> you are proposing is conflating two different things together.

I don't think that *I* try to change the semantic of pfn_valid().
It would be original semantic of pfn_valid().

"If pfn_valid() returns true, we can get proper struct page and the
zone information,"

That situation is now being changed by your patch *hotplug rework*.

"Even if pfn_valid() returns true, we cannot get the zone information
without PageReserved() check, since *zone is determined during
onlining* and pfn_valid() return true after adding the memory."

> 
> > > [..]
> > > > Let me clarify my desire(?) for this issue.
> > > > 
> > > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > > without checking PageResereved().
> > > 
> > > This is no longer true after my rework. Pages are associated with the
> > > zone during _onlining_ rather than when they are physically hotpluged.
> > 
> > If your rework make information valid during _onlining_, my
> > suggestion is making pfn_valid() return false until onlining.
> > 
> > Caller of pfn_valid() expects that they can get valid information from
> > the struct page. There is no reason to access the struct page if they
> > can't get valid information from it. So, passing pfn_valid() should
> > guarantee that, at least, some kind of information is valid.
> > 
> > If pfn_valid() doesn't guarantee it, most of the pfn walker should
> > check PageResereved() to make sure that validity of information from
> > the struct page.
> 
> This is true only for those walkers which really depend on the full
> initialization. This is not the case for all of them. I do not see any
> reason to introduce another _pfn_valid to just check whether there is a
> struct page...

It's really confusing concept that only some information is valid for
*not* fully initialized struct page. Even, there is no document that
what information is valid for this half-initialized struct page.

Better design would be that we regard that every information is
invalid for half-initialized struct page. In this case, it's natural
to make pfn_valid() returns false for this half-initialized struct page.

>  
> So please do not conflate those two different concepts together. I
> believe that the most prominent pfn walkers should be covered now 

Re: your mail

2017-04-21 Thread Michal Hocko
On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > [...]
> > > > Which pfn walkers you have in mind?
> > > 
> > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > using pfn_valid().
> > 
> > Yeah, I've checked that one and in fact this is a good example of the
> > case where you do not really care about holes. It just checks the page
> > count which is a valid information under any circumstances.
> 
> I don't think so. First, it checks the page *map* count. Is it still valid
> even if PageReserved() is set?

I do not know about any user which would manipulate page map count for
referenced pages. The core MM code doesn't.

> What I'd like to ask in this example is
> that what information is valid if PageReserved() is set. Is there any
> design document on this? I think that we need to define/document it first.

NO, it is not AFAIK.

[...]
> > OK, fair enough. I did't consider memblock allocations. I will rethink
> > this patch but there are essentially 3 options
> > - use a different criterion for the offline holes dection. I
> >   have just realized we might do it by storing the online
> >   information into the mem sections
> > - drop this patch
> > - move the PageReferenced check down the chain into
> >   isolate_freepages_block resp. isolate_migratepages_block
> > 
> > I would prefer 3 over 2 over 1. I definitely want to make this more
> > robust so 1 is preferable long term but I do not want this to be a
> > roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> I like #1 among of above options and I already see your patch for #1.
> It's much better than your first attempt but I'm still not happy due
> to the semantic of pfn_valid().

You are trying to change a semantic of something that has a well defined
meaning. I disagree that we should change it. It might sound like a
simpler thing to do because pfn walkers will have to be checked but what
you are proposing is conflating two different things together.

> > [..]
> > > Let me clarify my desire(?) for this issue.
> > > 
> > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > without checking PageResereved().
> > 
> > This is no longer true after my rework. Pages are associated with the
> > zone during _onlining_ rather than when they are physically hotpluged.
> 
> If your rework make information valid during _onlining_, my
> suggestion is making pfn_valid() return false until onlining.
> 
> Caller of pfn_valid() expects that they can get valid information from
> the struct page. There is no reason to access the struct page if they
> can't get valid information from it. So, passing pfn_valid() should
> guarantee that, at least, some kind of information is valid.
> 
> If pfn_valid() doesn't guarantee it, most of the pfn walker should
> check PageResereved() to make sure that validity of information from
> the struct page.

This is true only for those walkers which really depend on the full
initialization. This is not the case for all of them. I do not see any
reason to introduce another _pfn_valid to just check whether there is a
struct page...
 
So please do not conflate those two different concepts together. I
believe that the most prominent pfn walkers should be covered now and
others can be evaluated later.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-21 Thread Michal Hocko
On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > [...]
> > > > Which pfn walkers you have in mind?
> > > 
> > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > using pfn_valid().
> > 
> > Yeah, I've checked that one and in fact this is a good example of the
> > case where you do not really care about holes. It just checks the page
> > count which is a valid information under any circumstances.
> 
> I don't think so. First, it checks the page *map* count. Is it still valid
> even if PageReserved() is set?

I do not know about any user which would manipulate page map count for
referenced pages. The core MM code doesn't.

> What I'd like to ask in this example is
> that what information is valid if PageReserved() is set. Is there any
> design document on this? I think that we need to define/document it first.

NO, it is not AFAIK.

[...]
> > OK, fair enough. I did't consider memblock allocations. I will rethink
> > this patch but there are essentially 3 options
> > - use a different criterion for the offline holes dection. I
> >   have just realized we might do it by storing the online
> >   information into the mem sections
> > - drop this patch
> > - move the PageReferenced check down the chain into
> >   isolate_freepages_block resp. isolate_migratepages_block
> > 
> > I would prefer 3 over 2 over 1. I definitely want to make this more
> > robust so 1 is preferable long term but I do not want this to be a
> > roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> I like #1 among of above options and I already see your patch for #1.
> It's much better than your first attempt but I'm still not happy due
> to the semantic of pfn_valid().

You are trying to change a semantic of something that has a well defined
meaning. I disagree that we should change it. It might sound like a
simpler thing to do because pfn walkers will have to be checked but what
you are proposing is conflating two different things together.

> > [..]
> > > Let me clarify my desire(?) for this issue.
> > > 
> > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > without checking PageResereved().
> > 
> > This is no longer true after my rework. Pages are associated with the
> > zone during _onlining_ rather than when they are physically hotpluged.
> 
> If your rework make information valid during _onlining_, my
> suggestion is making pfn_valid() return false until onlining.
> 
> Caller of pfn_valid() expects that they can get valid information from
> the struct page. There is no reason to access the struct page if they
> can't get valid information from it. So, passing pfn_valid() should
> guarantee that, at least, some kind of information is valid.
> 
> If pfn_valid() doesn't guarantee it, most of the pfn walker should
> check PageResereved() to make sure that validity of information from
> the struct page.

This is true only for those walkers which really depend on the full
initialization. This is not the case for all of them. I do not see any
reason to introduce another _pfn_valid to just check whether there is a
struct page...
 
So please do not conflate those two different concepts together. I
believe that the most prominent pfn walkers should be covered now and
others can be evaluated later.
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-20 Thread Joonsoo Kim
On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> [...]
> > > Which pfn walkers you have in mind?
> > 
> > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > using pfn_valid().
> 
> Yeah, I've checked that one and in fact this is a good example of the
> case where you do not really care about holes. It just checks the page
> count which is a valid information under any circumstances.

I don't think so. First, it checks the page *map* count. Is it still valid
even if PageReserved() is set? What I'd like to ask in this example is
that what information is valid if PageReserved() is set. Is there any
design document on this? I think that we need to define/document it first.

And, I hope that all the information in flags field is valid in all
cases if pfn_valid() return true. By the design.

This makes all the exsiting pfn walkers happy since we don't need an
additional check for PageReserved().

> 
> > > > The other problem I found is that your change will makes some
> > > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > > by memblock API is also marked as PageResereved. If we consider this as
> > > > a hole, we will set such a zone as non-contiguous.
> > > 
> > > Why would that be a problem? We shouldn't touch those pages anyway?
> > 
> > Skipping those pages in compaction are valid so no problem in this
> > case.
> > 
> > The problem I mentioned above is that adding PageReserved() check in
> > __pageblock_pfn_to_page() invalidates optimization by
> > set_zone_contiguous(). In compaction, we need to get a valid struct
> > page and it requires a lot of work. There is performance problem
> > report due to this so set_zone_contiguous() optimization is added. It
> > checks if the zone is contiguous or not in boot time. If zone is
> > determined as contiguous, we can easily get a valid struct page in
> > runtime without expensive checks.
> 
> OK, I see. I've had some vague understading and the clarification helps.
> 
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
>   - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
>   - drop this patch
>   - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

I like #1 among of above options and I already see your patch for #1.
It's much better than your first attempt but I'm still not happy due
to the semantic of pfn_valid().

> [..]
> > Let me clarify my desire(?) for this issue.
> > 
> > 1. If pfn_valid() returns true, struct page has valid information, at
> > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > without checking PageResereved().
> 
> This is no longer true after my rework. Pages are associated with the
> zone during _onlining_ rather than when they are physically hotpluged.

If your rework make information valid during _onlining_, my
suggestion is making pfn_valid() return false until onlining.

Caller of pfn_valid() expects that they can get valid information from
the struct page. There is no reason to access the struct page if they
can't get valid information from it. So, passing pfn_valid() should
guarantee that, at least, some kind of information is valid.

If pfn_valid() doesn't guarantee it, most of the pfn walker should
check PageResereved() to make sure that validity of information from
the struct page.

> Basically only the nid is set properly. Strictly speaking this is the
> case also without my rework because the zone might change during online
> phase so you cannot assume it is correct even now. It just happens that
> it more or less works just fine.
>
> > 2. pfn_valid() for offlined holes returns false. This can be easily
> > (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> > guess that there is no reason that pfn_valid() returns true for
> > offlined holes. If there is, please let me know.
> 
> There is some code which really expects that pfn_valid returns true iff
> there is a struct page and it doesn't care about the online status.
> E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
> do though is to add 

Re: your mail

2017-04-20 Thread Joonsoo Kim
On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> [...]
> > > Which pfn walkers you have in mind?
> > 
> > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > using pfn_valid().
> 
> Yeah, I've checked that one and in fact this is a good example of the
> case where you do not really care about holes. It just checks the page
> count which is a valid information under any circumstances.

I don't think so. First, it checks the page *map* count. Is it still valid
even if PageReserved() is set? What I'd like to ask in this example is
that what information is valid if PageReserved() is set. Is there any
design document on this? I think that we need to define/document it first.

And, I hope that all the information in flags field is valid in all
cases if pfn_valid() return true. By the design.

This makes all the exsiting pfn walkers happy since we don't need an
additional check for PageReserved().

> 
> > > > The other problem I found is that your change will makes some
> > > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > > by memblock API is also marked as PageResereved. If we consider this as
> > > > a hole, we will set such a zone as non-contiguous.
> > > 
> > > Why would that be a problem? We shouldn't touch those pages anyway?
> > 
> > Skipping those pages in compaction are valid so no problem in this
> > case.
> > 
> > The problem I mentioned above is that adding PageReserved() check in
> > __pageblock_pfn_to_page() invalidates optimization by
> > set_zone_contiguous(). In compaction, we need to get a valid struct
> > page and it requires a lot of work. There is performance problem
> > report due to this so set_zone_contiguous() optimization is added. It
> > checks if the zone is contiguous or not in boot time. If zone is
> > determined as contiguous, we can easily get a valid struct page in
> > runtime without expensive checks.
> 
> OK, I see. I've had some vague understading and the clarification helps.
> 
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
>   - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
>   - drop this patch
>   - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

I like #1 among of above options and I already see your patch for #1.
It's much better than your first attempt but I'm still not happy due
to the semantic of pfn_valid().

> [..]
> > Let me clarify my desire(?) for this issue.
> > 
> > 1. If pfn_valid() returns true, struct page has valid information, at
> > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > without checking PageResereved().
> 
> This is no longer true after my rework. Pages are associated with the
> zone during _onlining_ rather than when they are physically hotpluged.

If your rework make information valid during _onlining_, my
suggestion is making pfn_valid() return false until onlining.

Caller of pfn_valid() expects that they can get valid information from
the struct page. There is no reason to access the struct page if they
can't get valid information from it. So, passing pfn_valid() should
guarantee that, at least, some kind of information is valid.

If pfn_valid() doesn't guarantee it, most of the pfn walker should
check PageResereved() to make sure that validity of information from
the struct page.

> Basically only the nid is set properly. Strictly speaking this is the
> case also without my rework because the zone might change during online
> phase so you cannot assume it is correct even now. It just happens that
> it more or less works just fine.
>
> > 2. pfn_valid() for offlined holes returns false. This can be easily
> > (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> > guess that there is no reason that pfn_valid() returns true for
> > offlined holes. If there is, please let me know.
> 
> There is some code which really expects that pfn_valid returns true iff
> there is a struct page and it doesn't care about the online status.
> E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
> do though is to add 

Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 13:56:34, Vlastimil Babka wrote:
> On 04/20/2017 10:49 AM, Michal Hocko wrote:
> > On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> >> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > [...]
> >>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> >>> woule make that zone->contiguous usually returns false since memory
> >>> used by memblock API is marked as PageReserved() and your patch regard
> >>> it as a hole. It invalidates set_zone_contiguous() optimization and I
> >>> worry about it.
> >>
> >> OK, fair enough. I did't consider memblock allocations. I will rethink
> >> this patch but there are essentially 3 options
> >>- use a different criterion for the offline holes dection. I
> >>  have just realized we might do it by storing the online
> >>  information into the mem sections
> >>- drop this patch
> >>- move the PageReferenced check down the chain into
> >>  isolate_freepages_block resp. isolate_migratepages_block
> >>
> >> I would prefer 3 over 2 over 1. I definitely want to make this more
> >> robust so 1 is preferable long term but I do not want this to be a
> >> roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > So I've played with all three options just to see how the outcome would
> > look like and it turned out that going with 1 will be easiest in the
> > end. What do you think about the following? It should be free of any 
> > false positives. I have only compile tested it yet.
> 
> That looks fine, can't say immediately if fully correct. I think you'll
> need to bump SECTION_NID_SHIFT as well and make sure things still fit?
> Otherwise looks like nobody needed a new section bit since 2005, so we
> should be fine.

You are absolutely right. Thanks for spotting this! I have folded this
in

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 611ff869fa4d..c412e6a3a1e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,7 +1166,7 @@ extern unsigned long usemap_size(void);
 #define SECTION_IS_ONLINE  (1UL<<2)
 #define SECTION_MAP_LAST_BIT   (1UL<<3)
 #define SECTION_MAP_MASK   (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT  2
+#define SECTION_NID_SHIFT  3
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 13:56:34, Vlastimil Babka wrote:
> On 04/20/2017 10:49 AM, Michal Hocko wrote:
> > On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> >> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > [...]
> >>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> >>> woule make that zone->contiguous usually returns false since memory
> >>> used by memblock API is marked as PageReserved() and your patch regard
> >>> it as a hole. It invalidates set_zone_contiguous() optimization and I
> >>> worry about it.
> >>
> >> OK, fair enough. I did't consider memblock allocations. I will rethink
> >> this patch but there are essentially 3 options
> >>- use a different criterion for the offline holes dection. I
> >>  have just realized we might do it by storing the online
> >>  information into the mem sections
> >>- drop this patch
> >>- move the PageReferenced check down the chain into
> >>  isolate_freepages_block resp. isolate_migratepages_block
> >>
> >> I would prefer 3 over 2 over 1. I definitely want to make this more
> >> robust so 1 is preferable long term but I do not want this to be a
> >> roadblock to the rest of the rework. Does that sound acceptable to you?
> > 
> > So I've played with all three options just to see how the outcome would
> > look like and it turned out that going with 1 will be easiest in the
> > end. What do you think about the following? It should be free of any 
> > false positives. I have only compile tested it yet.
> 
> That looks fine, can't say immediately if fully correct. I think you'll
> need to bump SECTION_NID_SHIFT as well and make sure things still fit?
> Otherwise looks like nobody needed a new section bit since 2005, so we
> should be fine.

You are absolutely right. Thanks for spotting this! I have folded this
in

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 611ff869fa4d..c412e6a3a1e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,7 +1166,7 @@ extern unsigned long usemap_size(void);
 #define SECTION_IS_ONLINE  (1UL<<2)
 #define SECTION_MAP_LAST_BIT   (1UL<<3)
 #define SECTION_MAP_MASK   (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT  2
+#define SECTION_NID_SHIFT  3
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-20 Thread Vlastimil Babka
On 04/20/2017 10:49 AM, Michal Hocko wrote:
> On Thu 20-04-17 09:28:20, Michal Hocko wrote:
>> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> [...]
>>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
>>> woule make that zone->contiguous usually returns false since memory
>>> used by memblock API is marked as PageReserved() and your patch regard
>>> it as a hole. It invalidates set_zone_contiguous() optimization and I
>>> worry about it.
>>
>> OK, fair enough. I did't consider memblock allocations. I will rethink
>> this patch but there are essentially 3 options
>>  - use a different criterion for the offline holes dection. I
>>have just realized we might do it by storing the online
>>information into the mem sections
>>  - drop this patch
>>  - move the PageReferenced check down the chain into
>>isolate_freepages_block resp. isolate_migratepages_block
>>
>> I would prefer 3 over 2 over 1. I definitely want to make this more
>> robust so 1 is preferable long term but I do not want this to be a
>> roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> So I've played with all three options just to see how the outcome would
> look like and it turned out that going with 1 will be easiest in the
> end. What do you think about the following? It should be free of any 
> false positives. I have only compile tested it yet.

That looks fine, can't say immediately if fully correct. I think you'll
need to bump SECTION_NID_SHIFT as well and make sure things still fit?
Otherwise looks like nobody needed a new section bit since 2005, so we
should be fine.

> ---
> From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 13 Apr 2017 10:28:45 +0200
> Subject: [PATCH] mm: consider zone which is not fully populated to have holes
> 
> __pageblock_pfn_to_page has two users currently, set_zone_contiguous
> which checks whether the given zone contains holes and
> pageblock_pfn_to_page which then carefully returns a first valid
> page from the given pfn range for the given zone. This doesn't handle
> zones which are not fully populated though. Memory pageblocks can be
> offlined or might not have been onlined yet. In such a case the zone
> should be considered to have holes otherwise pfn walkers can touch
> and play with offline pages.
> 
> Current callers of pageblock_pfn_to_page in compaction seem to work
> properly right now because they only isolate PageBuddy
> (isolate_freepages_block) or PageLRU resp. __PageMovable
> (isolate_migratepages_block) which will be always false for these pages.
> It would be safer to skip these pages altogether, though.
> 
> In order to do this patch adds a new memory section state
> (SECTION_IS_ONLINE) which is set in memory_present (during boot
> time) or in online_pages_range during the memory hotplug. Similarly
> offline_mem_sections clears the bit and it is called when the memory
> range is offlined.
> 
> pfn_to_online_page helper is then added which check the mem section and
> only returns a page if it is onlined already.
> 
> Use the new helper in __pageblock_pfn_to_page and skip the whole page
> block in such a case.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/memory_hotplug.h | 21 
>  include/linux/mmzone.h | 20 ++-
>  mm/memory_hotplug.c|  3 +++
>  mm/page_alloc.c|  5 -
>  mm/sparse.c| 45 
> +-
>  5 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 3c8cf86201c3..fc1c873504eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -14,6 +14,19 @@ struct memory_block;
>  struct resource;
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +#define pfn_to_online_page(pfn)  \
> +({   \
> + struct page *___page = NULL;\
> + \
> + if (online_section_nr(pfn_to_section_nr(pfn)))  \
> + ___page = pfn_to_page(pfn); \
> + ___page;\
> +})
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>  
>  #else /* ! CONFIG_MEMORY_HOTPLUG */
> +#define pfn_to_online_page(pfn)  \
> +({   \
> + struct page *___page = NULL;\
> +   

Re: your mail

2017-04-20 Thread Vlastimil Babka
On 04/20/2017 10:49 AM, Michal Hocko wrote:
> On Thu 20-04-17 09:28:20, Michal Hocko wrote:
>> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> [...]
>>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
>>> woule make that zone->contiguous usually returns false since memory
>>> used by memblock API is marked as PageReserved() and your patch regard
>>> it as a hole. It invalidates set_zone_contiguous() optimization and I
>>> worry about it.
>>
>> OK, fair enough. I did't consider memblock allocations. I will rethink
>> this patch but there are essentially 3 options
>>  - use a different criterion for the offline holes dection. I
>>have just realized we might do it by storing the online
>>information into the mem sections
>>  - drop this patch
>>  - move the PageReferenced check down the chain into
>>isolate_freepages_block resp. isolate_migratepages_block
>>
>> I would prefer 3 over 2 over 1. I definitely want to make this more
>> robust so 1 is preferable long term but I do not want this to be a
>> roadblock to the rest of the rework. Does that sound acceptable to you?
> 
> So I've played with all three options just to see how the outcome would
> look like and it turned out that going with 1 will be easiest in the
> end. What do you think about the following? It should be free of any 
> false positives. I have only compile tested it yet.

That looks fine, can't say immediately if fully correct. I think you'll
need to bump SECTION_NID_SHIFT as well and make sure things still fit?
Otherwise looks like nobody needed a new section bit since 2005, so we
should be fine.

> ---
> From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 13 Apr 2017 10:28:45 +0200
> Subject: [PATCH] mm: consider zone which is not fully populated to have holes
> 
> __pageblock_pfn_to_page has two users currently, set_zone_contiguous
> which checks whether the given zone contains holes and
> pageblock_pfn_to_page which then carefully returns a first valid
> page from the given pfn range for the given zone. This doesn't handle
> zones which are not fully populated though. Memory pageblocks can be
> offlined or might not have been onlined yet. In such a case the zone
> should be considered to have holes otherwise pfn walkers can touch
> and play with offline pages.
> 
> Current callers of pageblock_pfn_to_page in compaction seem to work
> properly right now because they only isolate PageBuddy
> (isolate_freepages_block) or PageLRU resp. __PageMovable
> (isolate_migratepages_block) which will be always false for these pages.
> It would be safer to skip these pages altogether, though.
> 
> In order to do this patch adds a new memory section state
> (SECTION_IS_ONLINE) which is set in memory_present (during boot
> time) or in online_pages_range during the memory hotplug. Similarly
> offline_mem_sections clears the bit and it is called when the memory
> range is offlined.
> 
> pfn_to_online_page helper is then added which check the mem section and
> only returns a page if it is onlined already.
> 
> Use the new helper in __pageblock_pfn_to_page and skip the whole page
> block in such a case.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/memory_hotplug.h | 21 
>  include/linux/mmzone.h | 20 ++-
>  mm/memory_hotplug.c|  3 +++
>  mm/page_alloc.c|  5 -
>  mm/sparse.c| 45 
> +-
>  5 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 3c8cf86201c3..fc1c873504eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -14,6 +14,19 @@ struct memory_block;
>  struct resource;
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +#define pfn_to_online_page(pfn)  \
> +({   \
> + struct page *___page = NULL;\
> + \
> + if (online_section_nr(pfn_to_section_nr(pfn)))  \
> + ___page = pfn_to_page(pfn); \
> + ___page;\
> +})
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
>  extern void clear_zone_contiguous(struct zone *zone);
>  
>  #else /* ! CONFIG_MEMORY_HOTPLUG */
> +#define pfn_to_online_page(pfn)  \
> +({   \
> + struct page *___page = NULL;\
> + if (pfn_valid(pfn)) 

Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
[...]
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
>   - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
>   - drop this patch
>   - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

So I've played with all three options just to see how the outcome would
look like and it turned out that going with 1 will be easiest in the
end. What do you think about the following? It should be free of any 
false positives. I have only compile tested it yet.
---
>From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 13 Apr 2017 10:28:45 +0200
Subject: [PATCH] mm: consider zone which is not fully populated to have holes

__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid
page from the given pfn range for the given zone. This doesn't handle
zones which are not fully populated though. Memory pageblocks can be
offlined or might not have been onlined yet. In such a case the zone
should be considered to have holes otherwise pfn walkers can touch
and play with offline pages.

Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.

In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot
time) or in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.

pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.

Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.

Signed-off-by: Michal Hocko 
---
 include/linux/memory_hotplug.h | 21 
 include/linux/mmzone.h | 20 ++-
 mm/memory_hotplug.c|  3 +++
 mm/page_alloc.c|  5 -
 mm/sparse.c| 45 +-
 5 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3c8cf86201c3..fc1c873504eb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,19 @@ struct memory_block;
 struct resource;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Return page for the valid pfn only if the page is online. All pfn
+ * walkers which rely on the fully initialized page->flags and others
+ * should use this rather than pfn_valid && pfn_to_page
+ */
+#define pfn_to_online_page(pfn)\
+({ \
+   struct page *___page = NULL;\
+   \
+   if (online_section_nr(pfn_to_section_nr(pfn)))  \
+   ___page = pfn_to_page(pfn); \
+   ___page;\
+})
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
@@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
 #else /* ! CONFIG_MEMORY_HOTPLUG */
+#define pfn_to_online_page(pfn)\
+({ \
+   struct page *___page = NULL;\
+   if (pfn_valid(pfn)) \
+   ___page = pfn_to_page(pfn); \
+   ___page;\
+ })
+
 /*
  * Stub functions for when hotplug is off
  */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0fc121bbf4ff..cad16ac080f5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1143,7 +1143,8 @@ extern unsigned long usemap_size(void);

Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
[...]
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
> 
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
>   - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
>   - drop this patch
>   - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
> 
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?

So I've played with all three options just to see how the outcome would
look like and it turned out that going with 1 will be easiest in the
end. What do you think about the following? It should be free of any 
false positives. I have only compile tested it yet.
---
>From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 13 Apr 2017 10:28:45 +0200
Subject: [PATCH] mm: consider zone which is not fully populated to have holes

__pageblock_pfn_to_page has two users currently, set_zone_contiguous
which checks whether the given zone contains holes and
pageblock_pfn_to_page which then carefully returns a first valid
page from the given pfn range for the given zone. This doesn't handle
zones which are not fully populated though. Memory pageblocks can be
offlined or might not have been onlined yet. In such a case the zone
should be considered to have holes otherwise pfn walkers can touch
and play with offline pages.

Current callers of pageblock_pfn_to_page in compaction seem to work
properly right now because they only isolate PageBuddy
(isolate_freepages_block) or PageLRU resp. __PageMovable
(isolate_migratepages_block) which will be always false for these pages.
It would be safer to skip these pages altogether, though.

In order to do this patch adds a new memory section state
(SECTION_IS_ONLINE) which is set in memory_present (during boot
time) or in online_pages_range during the memory hotplug. Similarly
offline_mem_sections clears the bit and it is called when the memory
range is offlined.

pfn_to_online_page helper is then added which check the mem section and
only returns a page if it is onlined already.

Use the new helper in __pageblock_pfn_to_page and skip the whole page
block in such a case.

Signed-off-by: Michal Hocko 
---
 include/linux/memory_hotplug.h | 21 
 include/linux/mmzone.h | 20 ++-
 mm/memory_hotplug.c|  3 +++
 mm/page_alloc.c|  5 -
 mm/sparse.c| 45 +-
 5 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3c8cf86201c3..fc1c873504eb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,19 @@ struct memory_block;
 struct resource;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Return page for the valid pfn only if the page is online. All pfn
+ * walkers which rely on the fully initialized page->flags and others
+ * should use this rather than pfn_valid && pfn_to_page
+ */
+#define pfn_to_online_page(pfn)\
+({ \
+   struct page *___page = NULL;\
+   \
+   if (online_section_nr(pfn_to_section_nr(pfn)))  \
+   ___page = pfn_to_page(pfn); \
+   ___page;\
+})
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
@@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
 #else /* ! CONFIG_MEMORY_HOTPLUG */
+#define pfn_to_online_page(pfn)\
+({ \
+   struct page *___page = NULL;\
+   if (pfn_valid(pfn)) \
+   ___page = pfn_to_page(pfn); \
+   ___page;\
+ })
+
 /*
  * Stub functions for when hotplug is off
  */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0fc121bbf4ff..cad16ac080f5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1143,7 +1143,8 @@ extern unsigned long usemap_size(void);
  */
 #define

Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
[...]
> > Which pfn walkers you have in mind?
> 
> For example, kpagecount_read() in fs/proc/page.c. I searched it by
> using pfn_valid().

Yeah, I've checked that one and in fact this is a good example of the
case where you do not really care about holes. It just checks the page
count which is a valid information under any circumstances.

> > > The other problem I found is that your change will makes some
> > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > by memblock API is also marked as PageResereved. If we consider this as
> > > a hole, we will set such a zone as non-contiguous.
> > 
> > Why would that be a problem? We shouldn't touch those pages anyway?
> 
> Skipping those pages in compaction are valid so no problem in this
> case.
> 
> The problem I mentioned above is that adding PageReserved() check in
> __pageblock_pfn_to_page() invalidates optimization by
> set_zone_contiguous(). In compaction, we need to get a valid struct
> page and it requires a lot of work. There is performance problem
> report due to this so set_zone_contiguous() optimization is added. It
> checks if the zone is contiguous or not in boot time. If zone is
> determined as contiguous, we can easily get a valid struct page in
> runtime without expensive checks.

OK, I see. I've had some vague understading and the clarification helps.

> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> woule make that zone->contiguous usually returns false since memory
> used by memblock API is marked as PageReserved() and your patch regard
> it as a hole. It invalidates set_zone_contiguous() optimization and I
> worry about it.

OK, fair enough. I did't consider memblock allocations. I will rethink
this patch but there are essentially 3 options
- use a different criterion for the offline holes dection. I
  have just realized we might do it by storing the online
  information into the mem sections
- drop this patch
- move the PageReferenced check down the chain into
  isolate_freepages_block resp. isolate_migratepages_block

I would prefer 3 over 2 over 1. I definitely want to make this more
robust so 1 is preferable long term but I do not want this to be a
roadblock to the rest of the rework. Does that sound acceptable to you?
 
[..]
> Let me clarify my desire(?) for this issue.
> 
> 1. If pfn_valid() returns true, struct page has valid information, at
> least, in flags (zone id, node id, flags, etc...). So, we can use them
> without checking PageResereved().

This is no longer true after my rework. Pages are associated with the
zone during _onlining_ rather than when they are physically hotpluged.
Basically only the nid is set properly. Strictly speaking this is the
case also without my rework because the zone might change during online
phase so you cannot assume it is correct even now. It just happens that
it more or less works just fine.

> 2. pfn_valid() for offlined holes returns false. This can be easily
> (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> guess that there is no reason that pfn_valid() returns true for
> offlined holes. If there is, please let me know.

There is some code which really expects that pfn_valid returns true iff
there is a struct page and it doesn't care about the online status.
E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
do though is to add pfn_to_online_page which would do the proper check.
I have already sent [1]. As noted above we can (ab)use the remaining bit
in SECTION_MAP_MASK to detect offline pages more robustly.

> 3. We don't need to check PageReserved() in most of pfn walkers in
> order to check offline holes.

We still have to distinguish those who care about offline pages from
those who do not care about it.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-20 Thread Michal Hocko
On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
[...]
> > Which pfn walkers you have in mind?
> 
> For example, kpagecount_read() in fs/proc/page.c. I searched it by
> using pfn_valid().

Yeah, I've checked that one and in fact this is a good example of the
case where you do not really care about holes. It just checks the page
count which is a valid information under any circumstances.

> > > The other problem I found is that your change will makes some
> > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > by memblock API is also marked as PageResereved. If we consider this as
> > > a hole, we will set such a zone as non-contiguous.
> > 
> > Why would that be a problem? We shouldn't touch those pages anyway?
> 
> Skipping those pages in compaction are valid so no problem in this
> case.
> 
> The problem I mentioned above is that adding PageReserved() check in
> __pageblock_pfn_to_page() invalidates optimization by
> set_zone_contiguous(). In compaction, we need to get a valid struct
> page and it requires a lot of work. There is performance problem
> report due to this so set_zone_contiguous() optimization is added. It
> checks if the zone is contiguous or not in boot time. If zone is
> determined as contiguous, we can easily get a valid struct page in
> runtime without expensive checks.

OK, I see. I've had some vague understading and the clarification helps.

> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> woule make that zone->contiguous usually returns false since memory
> used by memblock API is marked as PageReserved() and your patch regard
> it as a hole. It invalidates set_zone_contiguous() optimization and I
> worry about it.

OK, fair enough. I did't consider memblock allocations. I will rethink
this patch but there are essentially 3 options
- use a different criterion for the offline holes dection. I
  have just realized we might do it by storing the online
  information into the mem sections
- drop this patch
- move the PageReferenced check down the chain into
  isolate_freepages_block resp. isolate_migratepages_block

I would prefer 3 over 2 over 1. I definitely want to make this more
robust so 1 is preferable long term but I do not want this to be a
roadblock to the rest of the rework. Does that sound acceptable to you?
 
[..]
> Let me clarify my desire(?) for this issue.
> 
> 1. If pfn_valid() returns true, struct page has valid information, at
> least, in flags (zone id, node id, flags, etc...). So, we can use them
> without checking PageResereved().

This is no longer true after my rework. Pages are associated with the
zone during _onlining_ rather than when they are physically hotpluged.
Basically only the nid is set properly. Strictly speaking this is the
case also without my rework because the zone might change during online
phase so you cannot assume it is correct even now. It just happens that
it more or less works just fine.

> 2. pfn_valid() for offlined holes returns false. This can be easily
> (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> guess that there is no reason that pfn_valid() returns true for
> offlined holes. If there is, please let me know.

There is some code which really expects that pfn_valid returns true iff
there is a struct page and it doesn't care about the online status.
E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
do though is to add pfn_to_online_page which would do the proper check.
I have already sent [1]. As noted above we can (ab)use the remaining bit
in SECTION_MAP_MASK to detect offline pages more robustly.

> 3. We don't need to check PageReserved() in most of pfn walkers in
> order to check offline holes.

We still have to distinguish those who care about offline pages from
those who do not care about it.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-19 Thread Joonsoo Kim
On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> > On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > > Hi,
> > > here I 3 more preparatory patches which I meant to send on Thursday but
> > > forgot... After more thinking about pfn walkers I have realized that
> > > the current code doesn't check offline holes in zones. From a quick
> > > review that doesn't seem to be a problem currently. Pfn walkers can race
> > > with memory offlining and with the original hotplug impementation those
> > > offline pages can change the zone but I wasn't able to find any serious
> > > problem other than small confusion. The new hotplug code, will not have
> > > any valid zone, though so those code paths should check PageReserved
> > > to rule offline holes. I hope I have addressed all of them in these 3
> > > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > > me.
> > 
> > Hello, Michal.
> > 
> > s/Jonsoo/Joonsoo. :)
> 
> ups, sorry about that.
> 
> > I'm not sure that it's a good idea to add PageResereved() check in pfn
> > walkers. First, this makes struct page validity check as two steps,
> > pfn_valid() and then PageResereved().
> 
> Yes, those are two separate checkes because semantically they are
> different. Not all pfn walkers do care about the online status.

If offlined page has no valid information, reading information
about offlined pages are just wrong. So, all pfn walkers that reads
information about the page should do care about it.

I guess that many callers for pfn_valid() is in this category.

> 
> > If we should not use struct page
> > in this case, it's better to pfn_valid() returns false rather than
> > adding a separate check. Anyway, we need to fix more places (all pfn
> > walker?) if we want to check validity by two steps.
> 
> Which pfn walkers you have in mind?

For example, kpagecount_read() in fs/proc/page.c. I searched it by
using pfn_valid().

> > The other problem I found is that your change will makes some
> > contiguous zones to be considered as non-contiguous. Memory allocated
> > by memblock API is also marked as PageResereved. If we consider this as
> > a hole, we will set such a zone as non-contiguous.
> 
> Why would that be a problem? We shouldn't touch those pages anyway?

Skipping those pages in compaction are valid so no problem in this
case.

The problem I mentioned above is that adding PageReserved() check in
__pageblock_pfn_to_page() invalidates optimization by
set_zone_contiguous(). In compaction, we need to get a valid struct
page and it requires a lot of work. There is performance problem
report due to this so set_zone_contiguous() optimization is added. It
checks if the zone is contiguous or not in boot time. If zone is
determined as contiguous, we can easily get a valid struct page in
runtime without expensive checks.

Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
woule make that zone->contiguous usually returns false since memory
used by memblock API is marked as PageReserved() and your patch regard
it as a hole. It invalidates set_zone_contiguous() optimization and I
worry about it.

>  
> > And, I guess that it's not enough to check PageResereved() in
> > pageblock_pfn_to_page() in order to skip these pages in compaction. If
> > holes are in the middle of the pageblock, pageblock_pfn_to_page()
> > cannot catch it and compaction will use struct page for this hole.
> 
> Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
> current implementation anyway. So the implementation won't be any worse
> than with the current code. On the other hand offline holes will always
> fill the whole pageblock (assuming those are not spanning multiple
> memblocks).
>  
> > Therefore, I think that making pfn_valid() return false for not
> > onlined memory is a better solution for this problem. I don't know the
> > implementation detail for hotplug and I don't see your recent change
> > but we may defer memmap initialization until the zone is determined.
> > It will make pfn_valid() return false for un-initialized range.
> 
> I am not really sure. pfn_valid is used in many context and its only
> purpose is to tell whether pfn_to_page will return a valid struct page
> AFAIU.
> 
> I agree that having more checks is more error prone and we can add a
> helper pfn_to_valid_page or something similar but I believe we can do
> that on top of the current hotplug rework. This would require a non
> trivial amount of changes and I believe that a lacking check for the
> offline holes is not critical - we would (ab)use the lowest zone which
> is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

I'm not objecting your hotplug rework. In fact, I don't know the
relationship between this work and hotplug rework. I'm agreeing
with checking offline holes but I don't like the design and
implementation about it.

Let me clarify my desire(?) for this issue.


Re: your mail

2017-04-19 Thread Joonsoo Kim
On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> > On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > > Hi,
> > > here I 3 more preparatory patches which I meant to send on Thursday but
> > > forgot... After more thinking about pfn walkers I have realized that
> > > the current code doesn't check offline holes in zones. From a quick
> > > review that doesn't seem to be a problem currently. Pfn walkers can race
> > > with memory offlining and with the original hotplug impementation those
> > > offline pages can change the zone but I wasn't able to find any serious
> > > problem other than small confusion. The new hotplug code, will not have
> > > any valid zone, though so those code paths should check PageReserved
> > > to rule offline holes. I hope I have addressed all of them in these 3
> > > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > > me.
> > 
> > Hello, Michal.
> > 
> > s/Jonsoo/Joonsoo. :)
> 
> ups, sorry about that.
> 
> > I'm not sure that it's a good idea to add PageResereved() check in pfn
> > walkers. First, this makes struct page validity check as two steps,
> > pfn_valid() and then PageResereved().
> 
> Yes, those are two separate checkes because semantically they are
> different. Not all pfn walkers do care about the online status.

If offlined page has no valid information, reading information
about offlined pages are just wrong. So, all pfn walkers that reads
information about the page should do care about it.

I guess that many callers for pfn_valid() is in this category.

> 
> > If we should not use struct page
> > in this case, it's better to pfn_valid() returns false rather than
> > adding a separate check. Anyway, we need to fix more places (all pfn
> > walker?) if we want to check validity by two steps.
> 
> Which pfn walkers you have in mind?

For example, kpagecount_read() in fs/proc/page.c. I searched it by
using pfn_valid().

> > The other problem I found is that your change will makes some
> > contiguous zones to be considered as non-contiguous. Memory allocated
> > by memblock API is also marked as PageResereved. If we consider this as
> > a hole, we will set such a zone as non-contiguous.
> 
> Why would that be a problem? We shouldn't touch those pages anyway?

Skipping those pages in compaction are valid so no problem in this
case.

The problem I mentioned above is that adding PageReserved() check in
__pageblock_pfn_to_page() invalidates optimization by
set_zone_contiguous(). In compaction, we need to get a valid struct
page and it requires a lot of work. There is performance problem
report due to this so set_zone_contiguous() optimization is added. It
checks if the zone is contiguous or not in boot time. If zone is
determined as contiguous, we can easily get a valid struct page in
runtime without expensive checks.

Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
woule make that zone->contiguous usually returns false since memory
used by memblock API is marked as PageReserved() and your patch regard
it as a hole. It invalidates set_zone_contiguous() optimization and I
worry about it.

>  
> > And, I guess that it's not enough to check PageResereved() in
> > pageblock_pfn_to_page() in order to skip these pages in compaction. If
> > holes are in the middle of the pageblock, pageblock_pfn_to_page()
> > cannot catch it and compaction will use struct page for this hole.
> 
> Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
> current implementation anyway. So the implementation won't be any worse
> than with the current code. On the other hand offline holes will always
> fill the whole pageblock (assuming those are not spanning multiple
> memblocks).
>  
> > Therefore, I think that making pfn_valid() return false for not
> > onlined memory is a better solution for this problem. I don't know the
> > implementation detail for hotplug and I don't see your recent change
> > but we may defer memmap initialization until the zone is determined.
> > It will make pfn_valid() return false for un-initialized range.
> 
> I am not really sure. pfn_valid is used in many context and its only
> purpose is to tell whether pfn_to_page will return a valid struct page
> AFAIU.
> 
> I agree that having more checks is more error prone and we can add a
> helper pfn_to_valid_page or something similar but I believe we can do
> that on top of the current hotplug rework. This would require a non
> trivial amount of changes and I believe that a lacking check for the
> offline holes is not critical - we would (ab)use the lowest zone which
> is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

I'm not objecting your hotplug rework. In fact, I don't know the
relationship between this work and hotplug rework. I'm agreeing
with checking offline holes but I don't like the design and
implementation about it.

Let me clarify my desire(?) for this issue.


Re: your mail

2017-04-17 Thread Michal Hocko
On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > Hi,
> > here I 3 more preparatory patches which I meant to send on Thursday but
> > forgot... After more thinking about pfn walkers I have realized that
> > the current code doesn't check offline holes in zones. From a quick
> > review that doesn't seem to be a problem currently. Pfn walkers can race
> > with memory offlining and with the original hotplug impementation those
> > offline pages can change the zone but I wasn't able to find any serious
> > problem other than small confusion. The new hotplug code, will not have
> > any valid zone, though so those code paths should check PageReserved
> > to rule offline holes. I hope I have addressed all of them in these 3
> > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > me.
> 
> Hello, Michal.
> 
> s/Jonsoo/Joonsoo. :)

ups, sorry about that.

> I'm not sure that it's a good idea to add PageResereved() check in pfn
> walkers. First, this makes struct page validity check as two steps,
> pfn_valid() and then PageResereved().

Yes, those are two separate checkes because semantically they are
different. Not all pfn walkers do care about the online status.

> If we should not use struct page
> in this case, it's better to pfn_valid() returns false rather than
> adding a separate check. Anyway, we need to fix more places (all pfn
> walker?) if we want to check validity by two steps.

Which pfn walkers you have in mind?

> The other problem I found is that your change will makes some
> contiguous zones to be considered as non-contiguous. Memory allocated
> by memblock API is also marked as PageResereved. If we consider this as
> a hole, we will set such a zone as non-contiguous.

Why would that be a problem? We shouldn't touch those pages anyway?
 
> And, I guess that it's not enough to check PageResereved() in
> pageblock_pfn_to_page() in order to skip these pages in compaction. If
> holes are in the middle of the pageblock, pageblock_pfn_to_page()
> cannot catch it and compaction will use struct page for this hole.

Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
current implementation anyway. So the implementation won't be any worse
than with the current code. On the other hand offline holes will always
fill the whole pageblock (assuming those are not spanning multiple
memblocks).
 
> Therefore, I think that making pfn_valid() return false for not
> onlined memory is a better solution for this problem. I don't know the
> implementation detail for hotplug and I don't see your recent change
> but we may defer memmap initialization until the zone is determined.
> It will make pfn_valid() return false for un-initialized range.

I am not really sure. pfn_valid is used in many context and its only
purpose is to tell whether pfn_to_page will return a valid struct page
AFAIU.

I agree that having more checks is more error prone and we can add a
helper pfn_to_valid_page or something similar but I believe we can do
that on top of the current hotplug rework. This would require a non
trivial amount of changes and I believe that a lacking check for the
offline holes is not critical - we would (ab)use the lowest zone which
is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-17 Thread Michal Hocko
On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > Hi,
> > here I 3 more preparatory patches which I meant to send on Thursday but
> > forgot... After more thinking about pfn walkers I have realized that
> > the current code doesn't check offline holes in zones. From a quick
> > review that doesn't seem to be a problem currently. Pfn walkers can race
> > with memory offlining and with the original hotplug impementation those
> > offline pages can change the zone but I wasn't able to find any serious
> > problem other than small confusion. The new hotplug code, will not have
> > any valid zone, though so those code paths should check PageReserved
> > to rule offline holes. I hope I have addressed all of them in these 3
> > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > me.
> 
> Hello, Michal.
> 
> s/Jonsoo/Joonsoo. :)

ups, sorry about that.

> I'm not sure that it's a good idea to add PageResereved() check in pfn
> walkers. First, this makes struct page validity check as two steps,
> pfn_valid() and then PageResereved().

Yes, those are two separate checkes because semantically they are
different. Not all pfn walkers do care about the online status.

> If we should not use struct page
> in this case, it's better to pfn_valid() returns false rather than
> adding a separate check. Anyway, we need to fix more places (all pfn
> walker?) if we want to check validity by two steps.

Which pfn walkers you have in mind?

> The other problem I found is that your change will makes some
> contiguous zones to be considered as non-contiguous. Memory allocated
> by memblock API is also marked as PageResereved. If we consider this as
> a hole, we will set such a zone as non-contiguous.

Why would that be a problem? We shouldn't touch those pages anyway?
 
> And, I guess that it's not enough to check PageResereved() in
> pageblock_pfn_to_page() in order to skip these pages in compaction. If
> holes are in the middle of the pageblock, pageblock_pfn_to_page()
> cannot catch it and compaction will use struct page for this hole.

Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
current implementation anyway. So the implementation won't be any worse
than with the current code. On the other hand offline holes will always
fill the whole pageblock (assuming those are not spanning multiple
memblocks).
 
> Therefore, I think that making pfn_valid() return false for not
> onlined memory is a better solution for this problem. I don't know the
> implementation detail for hotplug and I don't see your recent change
> but we may defer memmap initialization until the zone is determined.
> It will make pfn_valid() return false for un-initialized range.

I am not really sure. pfn_valid is used in many context and its only
purpose is to tell whether pfn_to_page will return a valid struct page
AFAIU.

I agree that having more checks is more error prone and we can add a
helper pfn_to_valid_page or something similar but I believe we can do
that on top of the current hotplug rework. This would require a non
trivial amount of changes and I believe that a lacking check for the
offline holes is not critical - we would (ab)use the lowest zone which
is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: your mail

2017-04-16 Thread Joonsoo Kim
On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> Hi,
> here I 3 more preparatory patches which I meant to send on Thursday but
> forgot... After more thinking about pfn walkers I have realized that
> the current code doesn't check offline holes in zones. From a quick
> review that doesn't seem to be a problem currently. Pfn walkers can race
> with memory offlining and with the original hotplug impementation those
> offline pages can change the zone but I wasn't able to find any serious
> problem other than small confusion. The new hotplug code, will not have
> any valid zone, though so those code paths should check PageReserved
> to rule offline holes. I hope I have addressed all of them in these 3
> patches. I would appreciate if Vlastimil and Jonsoo double check after
> me.

Hello, Michal.

s/Jonsoo/Joonsoo. :)

I'm not sure that it's a good idea to add PageResereved() check in pfn
walkers. First, this makes struct page validity check as two steps,
pfn_valid() and then PageResereved(). If we should not use struct page
in this case, it's better to pfn_valid() returns false rather than
adding a separate check. Anyway, we need to fix more places (all pfn
walker?) if we want to check validity by two steps.

The other problem I found is that your change will makes some
contiguous zones to be considered as non-contiguous. Memory allocated
by memblock API is also marked as PageResereved. If we consider this as
a hole, we will set such a zone as non-contiguous.

And, I guess that it's not enough to check PageResereved() in
pageblock_pfn_to_page() in order to skip these pages in compaction. If
holes are in the middle of the pageblock, pageblock_pfn_to_page()
cannot catch it and compaction will use struct page for this hole.

Therefore, I think that making pfn_valid() return false for not
onlined memory is a better solution for this problem. I don't know the
implementation detail for hotplug and I don't see your recent change
but we may defer memmap initialization until the zone is determined.
It will make pfn_valid() return false for un-initialized range.

Thanks.


Re: your mail

2017-04-16 Thread Joonsoo Kim
On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> Hi,
> here I 3 more preparatory patches which I meant to send on Thursday but
> forgot... After more thinking about pfn walkers I have realized that
> the current code doesn't check offline holes in zones. From a quick
> review that doesn't seem to be a problem currently. Pfn walkers can race
> with memory offlining and with the original hotplug impementation those
> offline pages can change the zone but I wasn't able to find any serious
> problem other than small confusion. The new hotplug code, will not have
> any valid zone, though so those code paths should check PageReserved
> to rule offline holes. I hope I have addressed all of them in these 3
> patches. I would appreciate if Vlastimil and Jonsoo double check after
> me.

Hello, Michal.

s/Jonsoo/Joonsoo. :)

I'm not sure that it's a good idea to add PageResereved() check in pfn
walkers. First, this makes struct page validity check as two steps,
pfn_valid() and then PageResereved(). If we should not use struct page
in this case, it's better to pfn_valid() returns false rather than
adding a separate check. Anyway, we need to fix more places (all pfn
walker?) if we want to check validity by two steps.

The other problem I found is that your change will makes some
contiguous zones to be considered as non-contiguous. Memory allocated
by memblock API is also marked as PageResereved. If we consider this as
a hole, we will set such a zone as non-contiguous.

And, I guess that it's not enough to check PageResereved() in
pageblock_pfn_to_page() in order to skip these pages in compaction. If
holes are in the middle of the pageblock, pageblock_pfn_to_page()
cannot catch it and compaction will use struct page for this hole.

Therefore, I think that making pfn_valid() return false for not
onlined memory is a better solution for this problem. I don't know the
implementation detail for hotplug and I don't see your recent change
but we may defer memmap initialization until the zone is determined.
It will make pfn_valid() return false for un-initialized range.

Thanks.


Re: your mail

2016-11-16 Thread Peter Zijlstra
On Wed, Nov 16, 2016 at 09:25:43AM -0500, Steven Rostedt wrote:
> On Wed, 16 Nov 2016 11:40:14 +0100
> Peter Zijlstra  wrote:
> 
> 
> > On top of which, the implementation had issues; now I know you're the
> > blinder kind of person that disregards everything not in his immediate
> > interest, but if you'd looked at the patch you'd have seen he'd added
> > code the idle entry path, which will slow down every single to-idle
> > transition.
> 
> Isn't to-idle a bit bloated anyway? Or has that been fixed. I know
> there was some issues with idle_balance() which can add latency to
> wakeups. idle_balance() is also in the to-idle path.
> 

Yes it is too heavy as is, but just stacking more crap in just because
its already expensive seems to wrong way around.




Re: your mail

2016-11-16 Thread Peter Zijlstra
On Wed, Nov 16, 2016 at 09:25:43AM -0500, Steven Rostedt wrote:
> On Wed, 16 Nov 2016 11:40:14 +0100
> Peter Zijlstra  wrote:
> 
> 
> > On top of which, the implementation had issues; now I know you're the
> > blinder kind of person that disregards everything not in his immediate
> > interest, but if you'd looked at the patch you'd have seen he'd added
> > code the idle entry path, which will slow down every single to-idle
> > transition.
> 
> Isn't to-idle a bit bloated anyway? Or has that been fixed. I know
> there was some issues with idle_balance() which can add latency to
> wakeups. idle_balance() is also in the to-idle path.
> 

Yes it is too heavy as is, but just stacking more crap in just because
its already expensive seems to wrong way around.




Re: your mail

2016-11-16 Thread Steven Rostedt
On Wed, 16 Nov 2016 11:40:14 +0100
Peter Zijlstra  wrote:


> On top of which, the implementation had issues; now I know you're the
> blinder kind of person that disregards everything not in his immediate
> interest, but if you'd looked at the patch you'd have seen he'd added
> code the idle entry path, which will slow down every single to-idle
> transition.

Isn't to-idle a bit bloated anyway? Or has that been fixed. I know
there was some issues with idle_balance() which can add latency to
wakeups. idle_balance() is also in the to-idle path.

Note, that this is a sched feature which would be a nop (jump_label)
when disabled. And I'm sure it could also be optimized to be a static
inline as well when it is enabled.

I'm not saying we need to go this approach, but I'm just saying that
the to-idle issue is a bit of a red herring.

-- Steve


Re: your mail

2016-11-16 Thread Steven Rostedt
On Wed, 16 Nov 2016 11:40:14 +0100
Peter Zijlstra  wrote:


> On top of which, the implementation had issues; now I know you're the
> blinder kind of person that disregards everything not in his immediate
> interest, but if you'd looked at the patch you'd have seen he'd added
> code the idle entry path, which will slow down every single to-idle
> transition.

Isn't to-idle a bit bloated anyway? Or has that been fixed. I know
there was some issues with idle_balance() which can add latency to
wakeups. idle_balance() is also in the to-idle path.

Note, that this is a sched feature which would be a nop (jump_label)
when disabled. And I'm sure it could also be optimized to be a static
inline as well when it is enabled.

I'm not saying we need to go this approach, but I'm just saying that
the to-idle issue is a bit of a red herring.

-- Steve


Re: your mail

2016-11-16 Thread Peter Zijlstra
On Tue, Nov 15, 2016 at 02:29:16PM -0600, Christoph Lameter wrote:
> 
> > > There is a deadlock, Peter!!!
> >
> > Describe please? Also, have you tried disabling RT_RUNTIME_SHARE ?
> >
> 
> 
> The description was given earlier in the the thread and the drawbacks of
> using RT_RUNTIME_SHARE as well.

I've not seen a deadlock described. It either was an unbounded priority
inversion or a starvation issue, both of which are 'design' features of
the !rt kernel.

Neither things are new, so its not a regression either.

And, as stated, I'm not really happy to muck with this known troublesome
code and add features for which we must then maintain feature parity
when replacing it either.

On top of which, the implementation had issues; now I know you're the
blinder kind of person that disregards everything not in his immediate
interest, but if you'd looked at the patch you'd have seen he'd added
code the idle entry path, which will slow down every single to-idle
transition.



Re: your mail

2016-11-16 Thread Peter Zijlstra
On Tue, Nov 15, 2016 at 02:29:16PM -0600, Christoph Lameter wrote:
> 
> > > There is a deadlock, Peter!!!
> >
> > Describe please? Also, have you tried disabling RT_RUNTIME_SHARE ?
> >
> 
> 
> The description was given earlier in the the thread and the drawbacks of
> using RT_RUNTIME_SHARE as well.

I've not seen a deadlock described. It either was an unbounded priority
inversion or a starvation issue, both of which are 'design' features of
the !rt kernel.

Neither things are new, so its not a regression either.

And, as stated, I'm not really happy to muck with this known troublesome
code and add features for which we must then maintain feature parity
when replacing it either.

On top of which, the implementation had issues; now I know you're the
blinder kind of person that disregards everything not in his immediate
interest, but if you'd looked at the patch you'd have seen he'd added
code the idle entry path, which will slow down every single to-idle
transition.



Re: your mail

2016-09-20 Thread andrew banman
Subject line got dropped the first time around. Will send again.

Apologies for the chatter,

Andrew

On Tue, Sep 20, 2016 at 05:21:06PM -0500, Andrew Banman wrote:
> From Andrew Banman  # This line is ignored.
> From: Andrew Banman 
> Subject: [PATCH 0/9] arch/x86/platform/uv: add UV4 support to BAU
> In-Reply-To: 
> 
> The following patch set adds support for UV4 architecture to the Broadcast
> Assist Unit (BAU). Major hardware changes to the BAU require these fixes to
> ensure correct operation and to avoid illegal MMR writes.
> 
>  arch/x86/include/asm/uv/uv_bau.h |  45 ++
>  arch/x86/platform/uv/tlb_uv.c| 114 
> ---
> -
> 
> The patch set can be thought of in three logical groups:
> 
> 1) General cleanup.
> 
>  [PATCH 1/9] arch/x86/platform/uv: BAU cleanup: update printks
>  [PATCH 2/9] arch/x86/platform/uv: BAU cleanup: pq_init
>  [PATCH 3/9] arch/x86/platform/uv: BAU replace uv_physnodeaddr
> 
>  These housekeeping patches make the subsequent UV4 patches clearer,
>  and they should be done in any case.
> 
> 
> 2) Implement a new scheme to abstract UV version-specific functions.
> 
>  [PATCH 4/9] arch/x86/platform/uv: BAU add generic function pointers
>  [PATCH 5/9] arch/x86/platform/uv: BAU use generic function pointers
> 
>  We add a struct of function pointers to define version-specific BAU
>  operations. The philosophy is to abstract functions that perform the same
>  operation on all UV versions but have different implementations. This will
>  simplify their use in the body of the driver code and greatly simplify the
>  UV4 patches to follow.
> 
> 
> 3) Add UV4 functionality.
> 
>  [PATCH 6/9] arch/x86/platform/uv: BAU UV4 populate uvhub_version
>  [PATCH 7/9] arch/x86/platform/uv: BAU UV4 disable software timeout
>  [PATCH 8/9] arch/x86/platform/uv: BAU UV4 fix payload queue setup
>  [PATCH 9/9] arch/x86/platform/uv: BAU UV4 add version-specific
> 
>  These patches feature a minimal set of changes to make the BAU on UV4
>  operational.
> 
> 
> This patch set has been tested for regressions on pre-UV4 architectures and
> for correct functionality on UV4. The patches apply cleanly to 4.8-rc7.
> Fine-tuned performance tweaking for UV4 will come in a future patch set.
> 
> 
> Thank you,
> 
> Andrew Banman


Re: your mail

2016-09-20 Thread andrew banman
Subject line got dropped the first time around. Will send again.

Apologies for the chatter,

Andrew

On Tue, Sep 20, 2016 at 05:21:06PM -0500, Andrew Banman wrote:
> From Andrew Banman  # This line is ignored.
> From: Andrew Banman 
> Subject: [PATCH 0/9] arch/x86/platform/uv: add UV4 support to BAU
> In-Reply-To: 
> 
> The following patch set adds support for UV4 architecture to the Broadcast
> Assist Unit (BAU). Major hardware changes to the BAU require these fixes to
> ensure correct operation and to avoid illegal MMR writes.
> 
>  arch/x86/include/asm/uv/uv_bau.h |  45 ++
>  arch/x86/platform/uv/tlb_uv.c| 114 
> ---
> -
> 
> The patch set can be thought of in three logical groups:
> 
> 1) General cleanup.
> 
>  [PATCH 1/9] arch/x86/platform/uv: BAU cleanup: update printks
>  [PATCH 2/9] arch/x86/platform/uv: BAU cleanup: pq_init
>  [PATCH 3/9] arch/x86/platform/uv: BAU replace uv_physnodeaddr
> 
>  These housekeeping patches make the subsequent UV4 patches clearer,
>  and they should be done in any case.
> 
> 
> 2) Implement a new scheme to abstract UV version-specific functions.
> 
>  [PATCH 4/9] arch/x86/platform/uv: BAU add generic function pointers
>  [PATCH 5/9] arch/x86/platform/uv: BAU use generic function pointers
> 
>  We add a struct of function pointers to define version-specific BAU
>  operations. The philosophy is to abstract functions that perform the same
>  operation on all UV versions but have different implementations. This will
>  simplify their use in the body of the driver code and greatly simplify the
>  UV4 patches to follow.
> 
> 
> 3) Add UV4 functionality.
> 
>  [PATCH 6/9] arch/x86/platform/uv: BAU UV4 populate uvhub_version
>  [PATCH 7/9] arch/x86/platform/uv: BAU UV4 disable software timeout
>  [PATCH 8/9] arch/x86/platform/uv: BAU UV4 fix payload queue setup
>  [PATCH 9/9] arch/x86/platform/uv: BAU UV4 add version-specific
> 
>  These patches feature a minimal set of changes to make the BAU on UV4
>  operational.
> 
> 
> This patch set has been tested for regressions on pre-UV4 architectures and
> for correct functionality on UV4. The patches apply cleanly to 4.8-rc7.
> Fine-tuned performance tweaking for UV4 will come in a future patch set.
> 
> 
> Thank you,
> 
> Andrew Banman


Re: your mail

2015-08-03 Thread Dan Carpenter
Returning EINVAL here is the wrong thing.  Just leave the code as is.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-08-03 Thread Sudip Mukherjee
On Mon, Aug 03, 2015 at 11:48:59AM +0530, Shraddha Barke wrote:
> From b67c6c20455b04b77447ab4561e44f1a75dd978d Mon Sep 17 00:00:00 2001
> From: Shraddha Barke 
> Date: Mon, 3 Aug 2015 11:34:19 +0530
> Subject: [PATCH] Staging : lustre : Use -EINVAL instead of -ENOSYS

You do not need these in the commit message.

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-08-03 Thread Sudip Mukherjee
On Mon, Aug 03, 2015 at 11:48:59AM +0530, Shraddha Barke wrote:
 From b67c6c20455b04b77447ab4561e44f1a75dd978d Mon Sep 17 00:00:00 2001
 From: Shraddha Barke shraddha.6...@gmail.com
 Date: Mon, 3 Aug 2015 11:34:19 +0530
 Subject: [PATCH] Staging : lustre : Use -EINVAL instead of -ENOSYS

You do not need these in the commit message.

regards
sudip
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-08-03 Thread Dan Carpenter
Returning EINVAL here is the wrong thing.  Just leave the code as is.

regards,
dan carpenter

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-01-28 Thread atull
On Wed, 21 Jan 2015, Jason Gunthorpe wrote:

> [unfutzd the cc a bit, sorry]
> 
> On Wed, Jan 21, 2015 at 04:19:17PM -0600, atull wrote:
> > > If we consider a Zynq, for instance, there are a number of clock nets
> > > that the CPU drives into the FPGA fabric. These nets are controlled by
> > > the kernel CLK framework. So, before we program the FPGA bitstream the
> > > clocks must be setup properly.
> > 
> > It's pretty normal for drivers to find out what their clocks are from
> > the DT and enable them.  
> 
> Sure, but the clocks are bitfile specific, and not related to
> programming. Some bitfiles may not require CPU clocks at all.

The bitfile specific clocks are the clocks that are turned on by the device
driver for that chunk of the bitfile.  So those clocks can be specified
the same way as clocks are specied in the DT.

> 
> > Yes the DT overlay can specify:
> >   * clock info
> >   * firmware file name if user is doing it that way
> >   * fpga manager - specific info
> > * compatiblity string specifies what type of fpga it is
> > * which fpga this image should go into
> >   * fpga/processor bridges to enable
> >   * driver(s) info that is dependent on the above
> 
> All sounds reasonable
> 
> > > Today in our Zynq systems we have the bootloader preconfigure
> > > everything for what we are trying to do - but that is specific to the
> > > particular FPGA we are expecting to run, and eg, I expect if we ran a
> > > kernel using the Zynq clk framework there would be problems with it
> > > mangling the configuration.
> > > 
> > > So there would have to be some kind of sequence where the DT is
> > > loaded, the zynq specific FPGA programmer does its pre setup, then the
> > > request_firmarw/fpga_program_fw loads the bitstream and another pass
> > > for a zynq specific post setup and completion handshake?
> > 
> > fpga-mgr.c has the concept that each different FPGA family will
> > likely need its own way of doing these 3 steps:
> >  * write_init (prepare fgpa for receiving configuration information)
> >  * write (write configuration info to the fpga)
> >  * write_complete (done writing, put fpga into running mode)
> > 
> > There are callbacks into the manufacturor/fpga family specific lower
> > level driver to do these things (as part of the "fpga_manager_ops"
> 
> I think the missing bit here is that there are bitfile specific things
> as well.
> 
> The functions above are fine for a generic manufacturer bitfile loader,
> ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.
> 
> But wrappered around that should be another set of functions that are
> bitfile specific.
> 
> Like Zynq-PL-boot-protocol-v1 - which deasserts a reset line and waits
> for the PL to signal back that it has completed reset.
> 
> Or jgg-boot-protocol-v1 which monitors the configuration GPIOs for a
> specific ready pattern..
> 
> Or ... 
> 
> All of those procedures depend on the bitfile to implement something.
> 
> > > The DT needs to specify not only the bitstream programming HW to use
> > > but this ancillary programming protocol. There are many ways to do
> > > a out of reset and completion handshake on Zynq, for instance.
> > 
> > Currently the lower level driver supports only one preferred method
> > of programming.  I guess we could add an enumerated DT property to
> > select programming protocol.  It would have to be manufacturor specific.
> > Alternatively it could be encoded into the compatibility string if that
> > makes sense.
> 
> From a DT perspective I'd expect it to look something like:
> 
> soc {
> 
>   // This is the 'how to program a bitstream'
>   fpga-bitstream0: zynq_pl_dma 
>   {
>  compatible = "xilinx,zynq,pl,dma";
>  regs = <..>
>   }
> 
>   fpga: ..
>   {
>  // This is 'what is in the bitstream'
>  boot-protocol = "xilinx,zynq,protocol1";
>  compatible = "jgg,fpga-foo-bar";
>  manager = @fpga-bitstream0
>  clocks = ..
>  clock-frequency = ...
> 
>  zynq_axi_gp0
>  {
>   // Settings for a CPU to FPGA AXI bridge
>axi setting 1 = ...
>[...]
>  }
>   }
> }

That's good.  It also needs to specify the driver(s) for the hardware that the
bitstream will instantiate.

> 
> I could also see integrating with the regulator framework as well to
> power up FPGA specific controllable power supplies.
> 
> > > And then user space would need to have control points between each of
> > > these steps.
> 
> > We could have two options, configurable from the ioctl:
> > * When the DT is loaded, do everything
> > * Even when the DT is loaded, wait for further instructions from ioctl or
> > 
> > Freewheeling flow:
> > * Tell ioctl that we are in freewheeling mode
> > * Load DT overlay
> > 
> > Tightly controlled flow:
> > * Tell ioctl that we are doing things stepwise
> > * Load DT overlay
> > * Use ioctl to step through getting the fpga loaded and known to be happy
> 
> I think you've certainly got the idea!
> 
> Thinking it through some more, if the kernel 

Re: your mail

2015-01-28 Thread atull
On Wed, 21 Jan 2015, Jason Gunthorpe wrote:

 [unfutzd the cc a bit, sorry]
 
 On Wed, Jan 21, 2015 at 04:19:17PM -0600, atull wrote:
   If we consider a Zynq, for instance, there are a number of clock nets
   that the CPU drives into the FPGA fabric. These nets are controlled by
   the kernel CLK framework. So, before we program the FPGA bitstream the
   clocks must be setup properly.
  
  It's pretty normal for drivers to find out what their clocks are from
  the DT and enable them.  
 
 Sure, but the clocks are bitfile specific, and not related to
 programming. Some bitfiles may not require CPU clocks at all.

The bitfile specific clocks are the clocks that are turned on by the device
driver for that chunk of the bitfile.  So those clocks can be specified
the same way as clocks are specied in the DT.

 
  Yes the DT overlay can specify:
* clock info
* firmware file name if user is doing it that way
* fpga manager - specific info
  * compatiblity string specifies what type of fpga it is
  * which fpga this image should go into
* fpga/processor bridges to enable
* driver(s) info that is dependent on the above
 
 All sounds reasonable
 
   Today in our Zynq systems we have the bootloader preconfigure
   everything for what we are trying to do - but that is specific to the
   particular FPGA we are expecting to run, and eg, I expect if we ran a
   kernel using the Zynq clk framework there would be problems with it
   mangling the configuration.
   
   So there would have to be some kind of sequence where the DT is
   loaded, the zynq specific FPGA programmer does its pre setup, then the
   request_firmarw/fpga_program_fw loads the bitstream and another pass
   for a zynq specific post setup and completion handshake?
  
  fpga-mgr.c has the concept that each different FPGA family will
  likely need its own way of doing these 3 steps:
   * write_init (prepare fgpa for receiving configuration information)
   * write (write configuration info to the fpga)
   * write_complete (done writing, put fpga into running mode)
  
  There are callbacks into the manufacturor/fpga family specific lower
  level driver to do these things (as part of the fpga_manager_ops
 
 I think the missing bit here is that there are bitfile specific things
 as well.
 
 The functions above are fine for a generic manufacturer bitfile loader,
 ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.
 
 But wrappered around that should be another set of functions that are
 bitfile specific.
 
 Like Zynq-PL-boot-protocol-v1 - which deasserts a reset line and waits
 for the PL to signal back that it has completed reset.
 
 Or jgg-boot-protocol-v1 which monitors the configuration GPIOs for a
 specific ready pattern..
 
 Or ... 
 
 All of those procedures depend on the bitfile to implement something.
 
   The DT needs to specify not only the bitstream programming HW to use
   but this ancillary programming protocol. There are many ways to do
   a out of reset and completion handshake on Zynq, for instance.
  
  Currently the lower level driver supports only one preferred method
  of programming.  I guess we could add an enumerated DT property to
  select programming protocol.  It would have to be manufacturor specific.
  Alternatively it could be encoded into the compatibility string if that
  makes sense.
 
 From a DT perspective I'd expect it to look something like:
 
 soc {
 
   // This is the 'how to program a bitstream'
   fpga-bitstream0: zynq_pl_dma 
   {
  compatible = xilinx,zynq,pl,dma;
  regs = ..
   }
 
   fpga: ..
   {
  // This is 'what is in the bitstream'
  boot-protocol = xilinx,zynq,protocol1;
  compatible = jgg,fpga-foo-bar;
  manager = @fpga-bitstream0
  clocks = ..
  clock-frequency = ...
 
  zynq_axi_gp0
  {
   // Settings for a CPU to FPGA AXI bridge
axi setting 1 = ...
[...]
  }
   }
 }

That's good.  It also needs to specify the driver(s) for the hardware that the
bitstream will instantiate.

 
 I could also see integrating with the regulator framework as well to
 power up FPGA specific controllable power supplies.
 
   And then user space would need to have control points between each of
   these steps.
 
  We could have two options, configurable from the ioctl:
  * When the DT is loaded, do everything
  * Even when the DT is loaded, wait for further instructions from ioctl or
  
  Freewheeling flow:
  * Tell ioctl that we are in freewheeling mode
  * Load DT overlay
  
  Tightly controlled flow:
  * Tell ioctl that we are doing things stepwise
  * Load DT overlay
  * Use ioctl to step through getting the fpga loaded and known to be happy
 
 I think you've certainly got the idea!
 
 Thinking it through some more, if the kernel DT tells the fpga-mgr
 that it is 
 
   boot-protocol = xilinx,zynq,protocol1,jgg,foo-bar;
 
 Then the kernel should refuse to start it if it doesn't know how to do
 both 'xilinx,zynq,protocol1' and 

Re: your mail

2015-01-22 Thread One Thousand Gnomes
> The functions above are fine for a generic manufacturer bitfile loader,
> ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.
> 
> But wrappered around that should be another set of functions that are
> bitfile specific.

And also a transport layer. You can have the same FPGA with the same
loader protocol off multiple different bus types (from USB to on CPU die
and all the way between).

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-01-22 Thread One Thousand Gnomes
 The functions above are fine for a generic manufacturer bitfile loader,
 ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.
 
 But wrappered around that should be another set of functions that are
 bitfile specific.

And also a transport layer. You can have the same FPGA with the same
loader protocol off multiple different bus types (from USB to on CPU die
and all the way between).

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2015-01-21 Thread Jason Gunthorpe
[unfutzd the cc a bit, sorry]

On Wed, Jan 21, 2015 at 04:19:17PM -0600, atull wrote:
> > If we consider a Zynq, for instance, there are a number of clock nets
> > that the CPU drives into the FPGA fabric. These nets are controlled by
> > the kernel CLK framework. So, before we program the FPGA bitstream the
> > clocks must be setup properly.
> 
> It's pretty normal for drivers to find out what their clocks are from
> the DT and enable them.  

Sure, but the clocks are bitfile specific, and not related to
programming. Some bitfiles may not require CPU clocks at all.

> Yes the DT overlay can specify:
>   * clock info
>   * firmware file name if user is doing it that way
>   * fpga manager - specific info
> * compatiblity string specifies what type of fpga it is
> * which fpga this image should go into
>   * fpga/processor bridges to enable
>   * driver(s) info that is dependent on the above

All sounds reasonable

> > Today in our Zynq systems we have the bootloader preconfigure
> > everything for what we are trying to do - but that is specific to the
> > particular FPGA we are expecting to run, and eg, I expect if we ran a
> > kernel using the Zynq clk framework there would be problems with it
> > mangling the configuration.
> > 
> > So there would have to be some kind of sequence where the DT is
> > loaded, the zynq specific FPGA programmer does its pre setup, then the
> > request_firmarw/fpga_program_fw loads the bitstream and another pass
> > for a zynq specific post setup and completion handshake?
> 
> fpga-mgr.c has the concept that each different FPGA family will
> likely need its own way of doing these 3 steps:
>  * write_init (prepare fgpa for receiving configuration information)
>  * write (write configuration info to the fpga)
>  * write_complete (done writing, put fpga into running mode)
> 
> There are callbacks into the manufacturor/fpga family specific lower
> level driver to do these things (as part of the "fpga_manager_ops"

I think the missing bit here is that there are bitfile specific things
as well.

The functions above are fine for a generic manufacturer bitfile loader,
ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.

But wrappered around that should be another set of functions that are
bitfile specific.

Like Zynq-PL-boot-protocol-v1 - which deasserts a reset line and waits
for the PL to signal back that it has completed reset.

Or jgg-boot-protocol-v1 which monitors the configuration GPIOs for a
specific ready pattern..

Or ... 

All of those procedures depend on the bitfile to implement something.

> > The DT needs to specify not only the bitstream programming HW to use
> > but this ancillary programming protocol. There are many ways to do
> > a out of reset and completion handshake on Zynq, for instance.
> 
> Currently the lower level driver supports only one preferred method
> of programming.  I guess we could add an enumerated DT property to
> select programming protocol.  It would have to be manufacturor specific.
> Alternatively it could be encoded into the compatibility string if that
> makes sense.

>From a DT perspective I'd expect it to look something like:

soc {

  // This is the 'how to program a bitstream'
  fpga-bitstream0: zynq_pl_dma 
  {
 compatible = "xilinx,zynq,pl,dma";
 regs = <..>
  }

  fpga: ..
  {
 // This is 'what is in the bitstream'
 boot-protocol = "xilinx,zynq,protocol1";
 compatible = "jgg,fpga-foo-bar";
 manager = @fpga-bitstream0
 clocks = ..
 clock-frequency = ...

 zynq_axi_gp0
 {
  // Settings for a CPU to FPGA AXI bridge
   axi setting 1 = ...
   [...]
 }
  }
}

I could also see integrating with the regulator framework as well to
power up FPGA specific controllable power supplies.

> > And then user space would need to have control points between each of
> > these steps.

> We could have two options, configurable from the ioctl:
> * When the DT is loaded, do everything
> * Even when the DT is loaded, wait for further instructions from ioctl or
> 
> Freewheeling flow:
> * Tell ioctl that we are in freewheeling mode
> * Load DT overlay
> 
> Tightly controlled flow:
> * Tell ioctl that we are doing things stepwise
> * Load DT overlay
> * Use ioctl to step through getting the fpga loaded and known to be happy

I think you've certainly got the idea!

Thinking it through some more, if the kernel DT tells the fpga-mgr
that it is 

  boot-protocol = "xilinx,zynq,protocol1","jgg,foo-bar";

Then the kernel should refuse to start it if it doesn't know how to do
both 'xilinx,zynq,protocol1' and 'jgg,foo-bar'.

Thus the user space ioctl interface becomes more of how to implement a
boot protocol helper in userspace? With the proper locking - while the
helper is working the FPGA cannot be messed with..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: your mail

2015-01-21 Thread Jason Gunthorpe
[unfutzd the cc a bit, sorry]

On Wed, Jan 21, 2015 at 04:19:17PM -0600, atull wrote:
  If we consider a Zynq, for instance, there are a number of clock nets
  that the CPU drives into the FPGA fabric. These nets are controlled by
  the kernel CLK framework. So, before we program the FPGA bitstream the
  clocks must be setup properly.
 
 It's pretty normal for drivers to find out what their clocks are from
 the DT and enable them.  

Sure, but the clocks are bitfile specific, and not related to
programming. Some bitfiles may not require CPU clocks at all.

 Yes the DT overlay can specify:
   * clock info
   * firmware file name if user is doing it that way
   * fpga manager - specific info
 * compatiblity string specifies what type of fpga it is
 * which fpga this image should go into
   * fpga/processor bridges to enable
   * driver(s) info that is dependent on the above

All sounds reasonable

  Today in our Zynq systems we have the bootloader preconfigure
  everything for what we are trying to do - but that is specific to the
  particular FPGA we are expecting to run, and eg, I expect if we ran a
  kernel using the Zynq clk framework there would be problems with it
  mangling the configuration.
  
  So there would have to be some kind of sequence where the DT is
  loaded, the zynq specific FPGA programmer does its pre setup, then the
  request_firmarw/fpga_program_fw loads the bitstream and another pass
  for a zynq specific post setup and completion handshake?
 
 fpga-mgr.c has the concept that each different FPGA family will
 likely need its own way of doing these 3 steps:
  * write_init (prepare fgpa for receiving configuration information)
  * write (write configuration info to the fpga)
  * write_complete (done writing, put fpga into running mode)
 
 There are callbacks into the manufacturor/fpga family specific lower
 level driver to do these things (as part of the fpga_manager_ops

I think the missing bit here is that there are bitfile specific things
as well.

The functions above are fine for a generic manufacturer bitfile loader,
ie Xilinx GPIO twiddling, Altera JTAG, Zynq DMA, etc.

But wrappered around that should be another set of functions that are
bitfile specific.

Like Zynq-PL-boot-protocol-v1 - which deasserts a reset line and waits
for the PL to signal back that it has completed reset.

Or jgg-boot-protocol-v1 which monitors the configuration GPIOs for a
specific ready pattern..

Or ... 

All of those procedures depend on the bitfile to implement something.

  The DT needs to specify not only the bitstream programming HW to use
  but this ancillary programming protocol. There are many ways to do
  a out of reset and completion handshake on Zynq, for instance.
 
 Currently the lower level driver supports only one preferred method
 of programming.  I guess we could add an enumerated DT property to
 select programming protocol.  It would have to be manufacturor specific.
 Alternatively it could be encoded into the compatibility string if that
 makes sense.

From a DT perspective I'd expect it to look something like:

soc {

  // This is the 'how to program a bitstream'
  fpga-bitstream0: zynq_pl_dma 
  {
 compatible = xilinx,zynq,pl,dma;
 regs = ..
  }

  fpga: ..
  {
 // This is 'what is in the bitstream'
 boot-protocol = xilinx,zynq,protocol1;
 compatible = jgg,fpga-foo-bar;
 manager = @fpga-bitstream0
 clocks = ..
 clock-frequency = ...

 zynq_axi_gp0
 {
  // Settings for a CPU to FPGA AXI bridge
   axi setting 1 = ...
   [...]
 }
  }
}

I could also see integrating with the regulator framework as well to
power up FPGA specific controllable power supplies.

  And then user space would need to have control points between each of
  these steps.

 We could have two options, configurable from the ioctl:
 * When the DT is loaded, do everything
 * Even when the DT is loaded, wait for further instructions from ioctl or
 
 Freewheeling flow:
 * Tell ioctl that we are in freewheeling mode
 * Load DT overlay
 
 Tightly controlled flow:
 * Tell ioctl that we are doing things stepwise
 * Load DT overlay
 * Use ioctl to step through getting the fpga loaded and known to be happy

I think you've certainly got the idea!

Thinking it through some more, if the kernel DT tells the fpga-mgr
that it is 

  boot-protocol = xilinx,zynq,protocol1,jgg,foo-bar;

Then the kernel should refuse to start it if it doesn't know how to do
both 'xilinx,zynq,protocol1' and 'jgg,foo-bar'.

Thus the user space ioctl interface becomes more of how to implement a
boot protocol helper in userspace? With the proper locking - while the
helper is working the FPGA cannot be messed with..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-10-27 Thread Tejun Heo
Hello, Christoph.

On Wed, Oct 15, 2014 at 03:10:37AM -0500, Christoph Lameter wrote:
> Subject: Convert remaining __get_cpu_var uses
> 
> During the 3.18 merge period additional __get_cpu_var uses were
> added. The patch converts these to this_cpu_ptr().
> 
> [This does not address the powerpc issue where the conversion
> patches were routed directly to the powerpc maintainers but were
> not applied in the merge period. Will have to be handled separately]
> 
> Signed-off-by: Christoph Lameter 

Can you please repost with proper subject line and the subsys
maintainers cc'd?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-10-27 Thread Tejun Heo
Hello, Christoph.

On Wed, Oct 15, 2014 at 03:10:37AM -0500, Christoph Lameter wrote:
 Subject: Convert remaining __get_cpu_var uses
 
 During the 3.18 merge period additional __get_cpu_var uses were
 added. The patch converts these to this_cpu_ptr().
 
 [This does not address the powerpc issue where the conversion
 patches were routed directly to the powerpc maintainers but were
 not applied in the merge period. Will have to be handled separately]
 
 Signed-off-by: Christoph Lameter c...@linux.com

Can you please repost with proper subject line and the subsys
maintainers cc'd?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-09-01 Thread Dan Carpenter
No subject.

It should be a subject about adding spaces.

regards,
dan carpenter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-09-01 Thread Dan Carpenter
On Mon, Sep 01, 2014 at 03:34:56PM +0800, sunwxg wrote:
> From: Sun Wang 
> 
> Subject: [PATCH] staging: unisys: visorutil: procobjecttree: fix coding style 
> issue 
> 

Your email headers are mangled.  The subject is vague.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-09-01 Thread Dan Carpenter
On Mon, Sep 01, 2014 at 03:34:56PM +0800, sunwxg wrote:
 From: Sun Wang xiaoguang_wang5...@qq.com
 
 Subject: [PATCH] staging: unisys: visorutil: procobjecttree: fix coding style 
 issue 
 

Your email headers are mangled.  The subject is vague.

regards,
dan carpenter

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-09-01 Thread Dan Carpenter
No subject.

It should be a subject about adding spaces.

regards,
dan carpenter


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-07-09 Thread Mark Brown
On Wed, Jul 09, 2014 at 10:03:32AM +0900, James Ban wrote:

> > > + ret = regmap_read(chip->regmap, DA9211_REG_EVENT_B, _val);
> > > + if (ret < 0)
> > > + goto error_i2c;

> > > + if (reg_val & DA9211_E_OV_CURR_A) {

> > > + if (reg_val & DA9211_E_OV_CURR_B) {

> > > + return IRQ_HANDLED;

> > This is buggy - the driver should only return IRQ_HANDLED if it handled the
> > interrupt somehow, otherwise it should return IRQ_NONE and let the interrupt
> > core handle things.  This is especially important since the device appears 
> > to
> > require that interrupts are explicitly acknoweldged so if something is 
> > flagged
> > but not handled the interrupt will just sit constantly asserted.

> Basically all interrupts are masked when the chip wakes up. 
> Only two interrupts are unmasked at the start of driver like below.

I know that's the intention but the code should still be written
robustly - something might go wrong somewhere which causes another
interrupt to be enabled, or we might even gain support for shared
threaded interrupts in the interrupt core and someone could then
try to use that in a system.


signature.asc
Description: Digital signature


Re: your mail

2014-07-09 Thread Mark Brown
On Wed, Jul 09, 2014 at 10:03:32AM +0900, James Ban wrote:

   + ret = regmap_read(chip-regmap, DA9211_REG_EVENT_B, reg_val);
   + if (ret  0)
   + goto error_i2c;

   + if (reg_val  DA9211_E_OV_CURR_A) {

   + if (reg_val  DA9211_E_OV_CURR_B) {

   + return IRQ_HANDLED;

  This is buggy - the driver should only return IRQ_HANDLED if it handled the
  interrupt somehow, otherwise it should return IRQ_NONE and let the interrupt
  core handle things.  This is especially important since the device appears 
  to
  require that interrupts are explicitly acknoweldged so if something is 
  flagged
  but not handled the interrupt will just sit constantly asserted.

 Basically all interrupts are masked when the chip wakes up. 
 Only two interrupts are unmasked at the start of driver like below.

I know that's the intention but the code should still be written
robustly - something might go wrong somewhere which causes another
interrupt to be enabled, or we might even gain support for shared
threaded interrupts in the interrupt core and someone could then
try to use that in a system.


signature.asc
Description: Digital signature


Re: your mail

2014-02-25 Thread Will Deacon
On Tue, Feb 25, 2014 at 11:20:11AM +, srikanth TS wrote:
> 
> On Feb 25, 2014 2:28 AM, "Will Deacon" 
> mailto:will.dea...@arm.com>> wrote:
> >
> > On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
> > > Hi Will Deacon,
> >
> > Hello,
> >
> > > Currently SMMU driver expecting all stream ID used by respective master
> > > should be defined in the DT.
> > >
> > > We want to know how to handle in the case of virtual functions dynamically
> > > created and destroyed.
> > >
> > > Is PCI driver responsible for creating stream ID respective BDand
> > > requesting SMMU to add to the mapping table[stream Id to context mapping
> > > table]?
> > >
> > > Or is there any right way of doing it?
> >
> > Correct, the driver currently doesn't support dynamic mappings (mainly
> > because I didn't want to try and invent something that I couldn't test).
> >
> > There are a couple of ways to solve this:
> >
> >   (1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
> >   within a fixed range. That would probably need some code in the bus
> >   layer, so that a bus notifier can kick and call back to the relevant
> >   SMMU.
> 
> I think first way of solving seems to be better, because we don't know how 
> many
> 
> VF are used and i feel its not good idea to keep whole list of streamID 
> [which is
> 
> equal to max num vf] in DT. Again in this method we need to generate the 
> stream ID
> 
> dynamically whenever VF is added in pci iov driver side. And then pass that
> 
> stream ID to SMMU.
> 
> Is it ok this way?  Or you prefer 2nd way which is simpler.

I'm happy either way, but I'd need to see some patches before I can merge
anything ;)

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: your mail

2014-02-25 Thread Varun Sethi


> -Original Message-
> From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
> boun...@lists.linux-foundation.org] On Behalf Of Will Deacon
> Sent: Monday, February 24, 2014 10:59 PM
> To: srikanth TS
> Cc: io...@lists.linux-foundation.org; sungjinn.ch...@samsung.com; linux-
> ker...@vger.kernel.org; ts.srika...@samsung.com
> Subject: Re: your mail
> 
> On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
> > Hi Will Deacon,
> 
> Hello,
> 
> > Currently SMMU driver expecting all stream ID used by respective
> > master should be defined in the DT.
> >
> > We want to know how to handle in the case of virtual functions
> > dynamically created and destroyed.
> >
> > Is PCI driver responsible for creating stream ID respective BDand
> > requesting SMMU to add to the mapping table[stream Id to context
> > mapping table]?
> >
> > Or is there any right way of doing it?
> 
> Correct, the driver currently doesn't support dynamic mappings (mainly
> because I didn't want to try and invent something that I couldn't test).
> 
> There are a couple of ways to solve this:
> 
>   (1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
>   within a fixed range. That would probably need some code in the bus
>   layer, so that a bus notifier can kick and call back to the
> relevant
>   SMMU.
This could be done in add device notifier. I am working on similar(not PCI) hot 
plug device infrastructure for arm smmu driver. I will post an RFC patch by 
next week.

-Varun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: your mail

2014-02-25 Thread Varun Sethi


 -Original Message-
 From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
 boun...@lists.linux-foundation.org] On Behalf Of Will Deacon
 Sent: Monday, February 24, 2014 10:59 PM
 To: srikanth TS
 Cc: io...@lists.linux-foundation.org; sungjinn.ch...@samsung.com; linux-
 ker...@vger.kernel.org; ts.srika...@samsung.com
 Subject: Re: your mail
 
 On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
  Hi Will Deacon,
 
 Hello,
 
  Currently SMMU driver expecting all stream ID used by respective
  master should be defined in the DT.
 
  We want to know how to handle in the case of virtual functions
  dynamically created and destroyed.
 
  Is PCI driver responsible for creating stream ID respective BDand
  requesting SMMU to add to the mapping table[stream Id to context
  mapping table]?
 
  Or is there any right way of doing it?
 
 Correct, the driver currently doesn't support dynamic mappings (mainly
 because I didn't want to try and invent something that I couldn't test).
 
 There are a couple of ways to solve this:
 
   (1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
   within a fixed range. That would probably need some code in the bus
   layer, so that a bus notifier can kick and call back to the
 relevant
   SMMU.
This could be done in add device notifier. I am working on similar(not PCI) hot 
plug device infrastructure for arm smmu driver. I will post an RFC patch by 
next week.

-Varun

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-02-25 Thread Will Deacon
On Tue, Feb 25, 2014 at 11:20:11AM +, srikanth TS wrote:
 
 On Feb 25, 2014 2:28 AM, Will Deacon 
 will.dea...@arm.commailto:will.dea...@arm.com wrote:
 
  On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
   Hi Will Deacon,
 
  Hello,
 
   Currently SMMU driver expecting all stream ID used by respective master
   should be defined in the DT.
  
   We want to know how to handle in the case of virtual functions dynamically
   created and destroyed.
  
   Is PCI driver responsible for creating stream ID respective BDand
   requesting SMMU to add to the mapping table[stream Id to context mapping
   table]?
  
   Or is there any right way of doing it?
 
  Correct, the driver currently doesn't support dynamic mappings (mainly
  because I didn't want to try and invent something that I couldn't test).
 
  There are a couple of ways to solve this:
 
(1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
within a fixed range. That would probably need some code in the bus
layer, so that a bus notifier can kick and call back to the relevant
SMMU.
 
 I think first way of solving seems to be better, because we don't know how 
 many
 
 VF are used and i feel its not good idea to keep whole list of streamID 
 [which is
 
 equal to max num vf] in DT. Again in this method we need to generate the 
 stream ID
 
 dynamically whenever VF is added in pci iov driver side. And then pass that
 
 stream ID to SMMU.
 
 Is it ok this way?  Or you prefer 2nd way which is simpler.

I'm happy either way, but I'd need to see some patches before I can merge
anything ;)

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-02-24 Thread Will Deacon
On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
> Hi Will Deacon,

Hello,

> Currently SMMU driver expecting all stream ID used by respective master
> should be defined in the DT.
> 
> We want to know how to handle in the case of virtual functions dynamically
> created and destroyed.
> 
> Is PCI driver responsible for creating stream ID respective BDand
> requesting SMMU to add to the mapping table[stream Id to context mapping
> table]?
> 
> Or is there any right way of doing it?

Correct, the driver currently doesn't support dynamic mappings (mainly
because I didn't want to try and invent something that I couldn't test).

There are a couple of ways to solve this:

  (1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
  within a fixed range. That would probably need some code in the bus
  layer, so that a bus notifier can kick and call back to the relevant
  SMMU.

  (2) Describe the RID -> SID mapping in the device-tree. We probably want
  to avoid an enormous table, so this would only work for simple `SID =
  RID + offset' or 'SID = RID & mask' cases.

How do your IDs map to each other?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: your mail

2014-02-24 Thread Will Deacon
On Mon, Feb 24, 2014 at 03:12:21PM +, srikanth TS wrote:
 Hi Will Deacon,

Hello,

 Currently SMMU driver expecting all stream ID used by respective master
 should be defined in the DT.
 
 We want to know how to handle in the case of virtual functions dynamically
 created and destroyed.
 
 Is PCI driver responsible for creating stream ID respective BDand
 requesting SMMU to add to the mapping table[stream Id to context mapping
 table]?
 
 Or is there any right way of doing it?

Correct, the driver currently doesn't support dynamic mappings (mainly
because I didn't want to try and invent something that I couldn't test).

There are a couple of ways to solve this:

  (1) Add a way for a PCI RC to dynamically allocate StreamIDs on an SMMU
  within a fixed range. That would probably need some code in the bus
  layer, so that a bus notifier can kick and call back to the relevant
  SMMU.

  (2) Describe the RID - SID mapping in the device-tree. We probably want
  to avoid an enormous table, so this would only work for simple `SID =
  RID + offset' or 'SID = RID  mask' cases.

How do your IDs map to each other?

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >