Re: [PATCH 1/1] rtc: fix type information of rtc-proc

2015-11-11 Thread Leizhen (ThunderTown)


On 2015/11/11 18:54, Alexandre Belloni wrote:
> On 11/11/2015 at 09:06:51 +0800, Leizhen (ThunderTown) wrote :
>> Hi, all
>>
>> I'm sorry. Maybe I didn't describe clearly enough before. These words are 
>> finally
>> shown to the end user. The end user maybe not a programmer, abbreviation 
>> word is unsuitable.
>>
> 
> Yes, that is exactly m point. What if an end user currently has a
> program parsing the file and looking for alrm_time or alrm_date? After
> updating his kernel, the program won't work anymore which is something
> we don't want.

OK. I see. Thanks.

> 
>>
>> cat /proc/driver/rtc
>>
>> rtc_time: 00:47:43
>> rtc_date: 2015-11-11
>> alrm_time   : 03:27:58   //alrm_time --> 
>> alarm_time
>> alrm_date   : 2015-10-08 //alrm_date --> 
>> alarm_date
>> alarm_IRQ   : no
>> alrm_pending: no //alrm_pending --> 
>> alarm_pending
>> update IRQ enabled  : no
>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] rtc: fix type information of rtc-proc

2015-11-10 Thread Leizhen (ThunderTown)
Hi, all

I'm sorry. Maybe I didn't describe clearly enough before. These words are 
finally
shown to the end user. The end user maybe not a programmer, abbreviation word 
is unsuitable.


cat /proc/driver/rtc

rtc_time: 00:47:43
rtc_date: 2015-11-11
alrm_time   : 03:27:58  //alrm_time --> 
alarm_time
alrm_date   : 2015-10-08//alrm_date --> 
alarm_date
alarm_IRQ   : no
alrm_pending: no//alrm_pending --> 
alarm_pending
update IRQ enabled  : no


On 2015/10/8 17:47, Zhen Lei wrote:
> Display the whole word of "alarm", make it look more comfortable.
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/rtc/rtc-proc.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/rtc/rtc-proc.c b/drivers/rtc/rtc-proc.c
> index ffa69e1..ef83f34 100644
> --- a/drivers/rtc/rtc-proc.c
> +++ b/drivers/rtc/rtc-proc.c
> @@ -58,7 +58,7 @@ static int rtc_proc_show(struct seq_file *seq, void *offset)
> 
>   err = rtc_read_alarm(rtc, );
>   if (err == 0) {
> - seq_printf(seq, "alrm_time\t: ");
> + seq_printf(seq, "alarm_time\t: ");
>   if ((unsigned int)alrm.time.tm_hour <= 24)
>   seq_printf(seq, "%02d:", alrm.time.tm_hour);
>   else
> @@ -72,7 +72,7 @@ static int rtc_proc_show(struct seq_file *seq, void *offset)
>   else
>   seq_printf(seq, "**\n");
> 
> - seq_printf(seq, "alrm_date\t: ");
> + seq_printf(seq, "alarm_date\t: ");
>   if ((unsigned int)alrm.time.tm_year <= 200)
>   seq_printf(seq, "%04d-", alrm.time.tm_year + 1900);
>   else
> @@ -87,7 +87,7 @@ static int rtc_proc_show(struct seq_file *seq, void *offset)
>   seq_printf(seq, "**\n");
>   seq_printf(seq, "alarm_IRQ\t: %s\n",
>   alrm.enabled ? "yes" : "no");
> - seq_printf(seq, "alrm_pending\t: %s\n",
> + seq_printf(seq, "alarm_pending\t: %s\n",
>   alrm.pending ? "yes" : "no");
>   seq_printf(seq, "update IRQ enabled\t: %s\n",
>   (rtc->uie_rtctimer.enabled) ? "yes" : "no");
> --
> 2.5.0
> 
> 
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: to allow EFI_RTC can be selected on ARM64

2015-10-08 Thread Leizhen (ThunderTown)


On 2015/9/28 17:44, Leizhen (ThunderTown) wrote:
>> > 
>>> >> --drivers/char/Kconfig--
>>> >> if RTC_LIB=n
>>> >>
>>> >> config RTC
>>> >> tristate "Enhanced Real Time Clock Support (legacy PC RTC driver)"
>>> >>
>>> >> ...
>>> >>
>>> >> config EFI_RTC
>>> >> bool "EFI Real Time Clock Services"
>>> >> depends on IA64 || ARM64
>>> >>
>>> >> ...
>>> >>
>>> >> endif # RTC_LIB
>> > 
>> > The driver you want is RTC_DRV_EFI, not EFI_RTC.
> OK, I will try it tommorrow.

Sorry, the coach called me to learn to drive these days. I opened RTC_DRV_EFI 
and the RTC worked fine, thanks a lot.

> 
>> > 
>> >Arnd
>> > 
>> > .
>> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] of: to support binding numa node to root subnode(non-bus)

2015-08-24 Thread Leizhen (ThunderTown)


On 2015/8/24 21:25, Rob Herring wrote:
 +benh
 
 On Mon, Aug 24, 2015 at 7:30 AM, Zhen Lei thunder.leiz...@huawei.com wrote:
 If use of_platform_populate to scan dt-nodes and add devices, the
 subnode of root(such as /smmu), when being scanned and invoke
 
 You should have a bus as the sub-node of root rather than devices
 directly off of root. You still have a problem though...

But actually the parent of bus is also platform_bus if we didn't have special 
initialization.
For example:
The function of_platform_device_create_pdata invoke of_device_alloc first, then 
invoke of_device_add.
But in of_device_alloc, we can find that:
dev-dev.parent = parent ? : platform_bus;

 
 of_device_add, the ofdev-dev.parent is always equal platform_bus. So
 that, function set_dev_node will not be called. And in device_add,
 dev_to_node(parent) always return NUMA_NO_NODE.

 Signed-off-by: Zhen Lei thunder.leiz...@huawei.com
 ---
  drivers/base/core.c | 2 +-
  drivers/of/device.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

 diff --git a/drivers/base/core.c b/drivers/base/core.c
 index dafae6d..5df4f46b 100644
 --- a/drivers/base/core.c
 +++ b/drivers/base/core.c
 @@ -1017,7 +1017,7 @@ int device_add(struct device *dev)
 dev-kobj.parent = kobj;

 /* use parent numa_node */
 -   if (parent)
 +   if (parent  (parent != platform_bus))
 
 This is only fixing one specific case, but I think things are broken
 for any case where the NUMA associativity if not set at the top level
 bus node. I think this should be something like:
 
 if (parent  (dev_to_node(dev) != NO_NUMA_NODE))

It seems a mistake, we should use equal sign.
if (parent  (dev_to_node(dev) == NUMA_NO_NODE))

 
 Then the OF code can set the node however it wants.

OK. I will send patch v2 base upon your advice. Thank you.

 
 set_dev_node(dev, dev_to_node(parent));

 /* first, register with generic layer. */
 diff --git a/drivers/of/device.c b/drivers/of/device.c
 index 8b91ea2..96ebece 100644
 --- a/drivers/of/device.c
 +++ b/drivers/of/device.c
 @@ -63,7 +63,7 @@ int of_device_add(struct platform_device *ofdev)
 /* device_add will assume that this device is on the same node as
  * the parent. If there is no parent defined, set the node
  * explicitly */
 -   if (!ofdev-dev.parent)
 +   if (!ofdev-dev.parent || (ofdev-dev.parent == platform_bus))
 
 And then remove the if here.
 

OK. I also think remove this statement will be better. Althouth set_dev_node 
maybe called two times,
but it only spends very little time, and this almost happened at initialization 
phase.

 set_dev_node(ofdev-dev, 
 of_node_to_nid(ofdev-dev.of_node));

 return device_add(ofdev-dev);
 --
 2.5.0


 --
 To unsubscribe from this list: send the line unsubscribe devicetree in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/1] of: to support binding numa node to specified device

2015-09-10 Thread Leizhen (ThunderTown)
Sorry, missed version number in title.


On 2015/8/25 12:08, Zhen Lei wrote:
> Changelog:
> v1 -> v2:
> In patch v1, binding numa node to specified device only take effect for 
> dt-nodes
> directly of root. Patch v2 removed this limitation, we can binding numa node 
> to
> any specified device in devicetree.
> 
> Zhen Lei (1):
>   of: to support binding numa node to specified device in devicetree
> 
>  drivers/base/core.c |  2 +-
>  drivers/of/device.c | 11 ++-
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] of: to support binding numa node to specified device in devicetree

2015-09-10 Thread Leizhen (ThunderTown)
Hi all,

Can somebody take a few moments to review it? This patch is
too small, only changed two lines.

Thanks,
Thunder.

On 2015/8/25 12:08, Zhen Lei wrote:
> For now, in function device_add, the new device will be forced to
> inherit the numa node of its parent. But this will override the device's
> numa node which configured in devicetree.
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/base/core.c |  2 +-
>  drivers/of/device.c | 11 ++-
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index dafae6d..e06de82 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -1017,7 +1017,7 @@ int device_add(struct device *dev)
>   dev->kobj.parent = kobj;
> 
>   /* use parent numa_node */
> - if (parent)
> + if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
>   set_dev_node(dev, dev_to_node(parent));
> 
>   /* first, register with generic layer. */
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index 8b91ea2..e5f47ce 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -60,11 +60,12 @@ int of_device_add(struct platform_device *ofdev)
>   ofdev->name = dev_name(>dev);
>   ofdev->id = -1;
> 
> - /* device_add will assume that this device is on the same node as
> -  * the parent. If there is no parent defined, set the node
> -  * explicitly */
> - if (!ofdev->dev.parent)
> - set_dev_node(>dev, of_node_to_nid(ofdev->dev.of_node));
> + /*
> +  * If this device has not binding numa node in devicetree, that is
> +  * of_node_to_nid returns NUMA_NO_NODE. device_add will assume that this
> +  * device is on the same node as the parent.
> +  */
> + set_dev_node(>dev, of_node_to_nid(ofdev->dev.of_node));
> 
>   return device_add(>dev);
>  }
> --
> 2.5.0
> 
> 
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: to allow EFI_RTC can be selected on ARM64

2015-09-28 Thread Leizhen (ThunderTown)


On 2015/9/28 15:35, Arnd Bergmann wrote:
> On Monday 28 September 2015 13:34:38 Zhen Lei wrote:
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 07d1811..25cec57 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -85,7 +85,7 @@ config ARM64
>> select PERF_USE_VMALLOC
>> select POWER_RESET
>> select POWER_SUPPLY
>> -   select RTC_LIB
>> +   select RTC_LIB if !EFI
>> select SPARSE_IRQ
>> select SYSCTL_EXCEPTION_TRACE
>> select HAVE_CONTEXT_TRACKING
> 
> Sorry, we can't do that: enabling EFI has to be done in a way that it only
> adds features but not disables them.

I run "make ARCH=arm64 menuconfig" and found that RTC_CLASS is selected by 
default. Actually, RTC_LIB only
controls whether to display some configs when run "make menuconfig". I list all 
informations below:

-make ARCH=arm64 menuconfig-
  [*] Real Time Clock  --->

-drivers/rtc/Kconfig---
menuconfig RTC_CLASS
bool "Real Time Clock"
default n
depends on !S390 && !UML
select RTC_LIB

---
find . -name "*Kconfig*" | xargs grep RTC_LIB
./drivers/rtc/Kconfig:config RTC_LIB
./drivers/rtc/Kconfig:  select RTC_LIB
./drivers/char/Kconfig:if RTC_LIB=n
./drivers/char/Kconfig:endif # RTC_LIB
./arch/x86/Kconfig: select RTC_LIB
./arch/arm/Kconfig: select RTC_LIB
./arch/arm64/Kconfig:   select RTC_LIB if !EFI
./arch/sh/Kconfig:  select RTC_LIB
./arch/mips/Kconfig:select RTC_LIB if !MACH_LOONGSON64

--drivers/char/Kconfig--
if RTC_LIB=n

config RTC
tristate "Enhanced Real Time Clock Support (legacy PC RTC driver)"

...

endif # RTC_LIB


> 
> Your patch breaks RTC on all non-EFI platforms as soon as CONFIG_EFI
> is selected by the user.

No, on non-EFI platforms, they can still use RTC as before. As I mentioned 
above,
RTC_LIB only controls whether to display some configs when run "make 
menuconfig".
On ARM64, (in this patch) I only allowed EFI_RTC can be showed when RTC_LIB was 
not selected.

--drivers/char/Kconfig--
if RTC_LIB=n

config RTC
tristate "Enhanced Real Time Clock Support (legacy PC RTC driver)"

...

config EFI_RTC
bool "EFI Real Time Clock Services"
depends on IA64 || ARM64

...

endif # RTC_LIB

> 
>   Arnd
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: to allow EFI_RTC can be selected on ARM64

2015-09-28 Thread Leizhen (ThunderTown)


On 2015/9/28 16:42, Arnd Bergmann wrote:
> On Monday 28 September 2015 16:29:57 Leizhen wrote:
>>
>> On 2015/9/28 15:35, Arnd Bergmann wrote:
>>> On Monday 28 September 2015 13:34:38 Zhen Lei wrote:
 diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
 index 07d1811..25cec57 100644
 --- a/arch/arm64/Kconfig
 +++ b/arch/arm64/Kconfig
 @@ -85,7 +85,7 @@ config ARM64
 select PERF_USE_VMALLOC
 select POWER_RESET
 select POWER_SUPPLY
 -   select RTC_LIB
 +   select RTC_LIB if !EFI
 select SPARSE_IRQ
 select SYSCTL_EXCEPTION_TRACE
 select HAVE_CONTEXT_TRACKING
>>>
>>> Sorry, we can't do that: enabling EFI has to be done in a way that it only
>>> adds features but not disables them.
>>
>> I run "make ARCH=arm64 menuconfig" and found that RTC_CLASS is selected by 
>> default. Actually, RTC_LIB only
>> controls whether to display some configs when run "make menuconfig". I list 
>> all informations below:
>>
>> -make ARCH=arm64 menuconfig-
>>   [*] Real Time Clock  --->
>>
>> -drivers/rtc/Kconfig---
>> menuconfig RTC_CLASS
>> bool "Real Time Clock"
>> default n
>> depends on !S390 && !UML
>> select RTC_LIB
> 
> Ok, I see. So your patch here has no effect at all and can be dropped, or
> we can remove the 'select RTC_LIB' without the EFI dependency.

Oh, I described the reason in the reply to Ard Biesheuvel.

https://lkml.org/lkml/2015/9/28/124

> 
>> ---
>> find . -name "*Kconfig*" | xargs grep RTC_LIB
>> ./drivers/rtc/Kconfig:config RTC_LIB
>> ./drivers/rtc/Kconfig:   select RTC_LIB
>> ./drivers/char/Kconfig:if RTC_LIB=n
>> ./drivers/char/Kconfig:endif # RTC_LIB
>> ./arch/x86/Kconfig:  select RTC_LIB
>> ./arch/arm/Kconfig:  select RTC_LIB
>> ./arch/arm64/Kconfig:select RTC_LIB if !EFI
>> ./arch/sh/Kconfig:   select RTC_LIB
>> ./arch/mips/Kconfig: select RTC_LIB if !MACH_LOONGSON64
>>
>> --drivers/char/Kconfig--
>> if RTC_LIB=n
>>
>> config RTC
>> tristate "Enhanced Real Time Clock Support (legacy PC RTC driver)"
>>
>> ...
>>
>> endif # RTC_LIB
>>
>>
>>>
>>> Your patch breaks RTC on all non-EFI platforms as soon as CONFIG_EFI
>>> is selected by the user.
>>
>> No, on non-EFI platforms, they can still use RTC as before. As I mentioned 
>> above,
>> RTC_LIB only controls whether to display some configs when run "make 
>> menuconfig".
>> On ARM64, (in this patch) I only allowed EFI_RTC can be showed when RTC_LIB 
>> was
>> not selected.
>>
> 
> but that is the wrong driver that uses the legacy API, we cannot have that
> on ARM because it conflicts with the normal RTC_CLASS drivers.

Yes, RTC_CLASS will automatically select RTC_LIB, and will not display EFI_RTC, 
because
RTC_LIB=y now.

We can select EFI_RTC only when RTC_CLASS is not selected(meanwhile RTC_LIB=n)

> 
>> --drivers/char/Kconfig--
>> if RTC_LIB=n
>>
>> config RTC
>> tristate "Enhanced Real Time Clock Support (legacy PC RTC driver)"
>>
>> ...
>>
>> config EFI_RTC
>> bool "EFI Real Time Clock Services"
>> depends on IA64 || ARM64
>>
>> ...
>>
>> endif # RTC_LIB
> 
> The driver you want is RTC_DRV_EFI, not EFI_RTC.

OK, I will try it tommorrow.

> 
>   Arnd
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] arm64: to allow EFI_RTC can be selected on ARM64

2015-09-28 Thread Leizhen (ThunderTown)


On 2015/9/28 15:40, Ard Biesheuvel wrote:
> On 28 September 2015 at 06:34, Zhen Lei  wrote:
>> Now, ARM64 is also support EFI startup. We hope use EFI runtime services
>> to get/set current time and date.
>>
>> RTC_LIB only controls some configs in drivers/char/Kconfig(included
>> EFI_RTC), and will be automatically selected when RTC_CLASS opened. So
>> this patch have no functional change but give an opportunity to select
>> EFI_RTC when RTC_CLASS closed.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/Kconfig | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 07d1811..25cec57 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -85,7 +85,7 @@ config ARM64
>> select PERF_USE_VMALLOC
>> select POWER_RESET
>> select POWER_SUPPLY
>> -   select RTC_LIB
>> +   select RTC_LIB if !EFI
>> select SPARSE_IRQ
>> select SYSCTL_EXCEPTION_TRACE
>> select HAVE_CONTEXT_TRACKING
> 
> You can currently enable EFI_RTC just fine on arm64 when EFI is enabled.
> Why exactly do you need this patch on top?

Because when we run "make ARCH=arm64 menuconfig", RTC_LIB is always selected. 
And we have no opportunity
to deselect it. And EFI_RTC can be displayed only when RTC_LIB=n.

drivers/rtc/Kconfig---
config RTC_LIB
bool

menuconfig RTC_CLASS
bool "Real Time Clock"
default n
depends on !S390 && !UML
select RTC_LIB

--drivers/char/Kconfig--
if RTC_LIB=n

..

config EFI_RTC
bool "EFI Real Time Clock Services"
depends on IA64 || ARM64

...

endif # RTC_LIB

> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 00/14] fix some type infos and bugs for arm64/of numa

2016-06-08 Thread Leizhen (ThunderTown)


On 2016/6/7 21:58, Will Deacon wrote:
> On Tue, Jun 07, 2016 at 04:08:04PM +0800, Zhen Lei wrote:
>> v3 -> v4:
>> 1. Packed three patches of Kefeng Wang, patch6-8.
>> 2. Add 6 new patches(9-15) to enhance the numa on arm64.
>>
>> v2 -> v3:
>> 1. Adjust patch2 and patch5 according to Matthias Brugger's advice, to make 
>> the
>>patches looks more well. The final code have no change. 
>>
>> v1 -> v2:
>> 1. Base on https://lkml.org/lkml/2016/5/24/679
> 
> If you want bug fixes to land in 4.7, you'll need to base them on a
> mainline kernel.

I heared that David Daney's acpi numa patch series was accepted and put into 
next branch(Linux 4.8).
Otherwise I will suggest him sending his patch6-7 to mainline first. So that, 
only a very small conflict
will be exist.

I also tested that:
1. git am David Daney's patch6-7, then git am all of my patches on a branch, 
named branch A.
2. git am David Daney's patch6-7 on another branch, named branch B.
3. when I git merge B into branch A, it's still conflict. So I guess git merge 
is based on source code, rather than patches.

So at present, unless the maintainers are willing to resolve the conflict, 
otherwise I update my patches will not work.

Fortunately, these patches are not particularly urgent. So I think I can wait 
until Linux 4.8 start, then send these patches again.
But I'm not sure whether these patches can be merged into Linux 4.8, I really 
hope.

> 
> Will
> 
> .
> 



Re: [PATCH v4 11/14] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-06-08 Thread Leizhen (ThunderTown)


On 2016/6/8 12:45, Ganapatrao Kulkarni wrote:
> On Wed, Jun 8, 2016 at 7:46 AM, Leizhen (ThunderTown)
> <thunder.leiz...@huawei.com> wrote:
>>
>>
>> On 2016/6/7 22:01, Ganapatrao Kulkarni wrote:
>>> On Tue, Jun 7, 2016 at 6:27 PM, Leizhen (ThunderTown)
>>> <thunder.leiz...@huawei.com> wrote:
>>>>
>>>>
>>>> On 2016/6/7 16:31, Ganapatrao Kulkarni wrote:
>>>>> On Tue, Jun 7, 2016 at 1:38 PM, Zhen Lei <thunder.leiz...@huawei.com> 
>>>>> wrote:
>>>>>> Some numa nodes may have no memory. For example:
>>>>>> 1. cpu0 on node0
>>>>>> 2. cpu1 on node1
>>>>>> 3. device0 access the momory from node0 and node1 take the same time.
>>>>>
>>>>> i am wondering, if access to both nodes is same, then why you need numa.
>>>>> the example you are quoting is against the basic principle of "numa"
>>>>> what is device0 here? cpu?
>>>> The device0 can also be a cpu. I drew a simple diagram:
>>>>
>>>>   cpu0 cpu1cpu2/device0
>>>> ||  |
>>>> ||  |
>>>>DDR0 DDR1No DIMM slots or no DIMM plugged
>>>>  (node0)  (node1) (node2)
>>>>
>>>
>>> thanks for the clarification. your example is for 3 node system, where
>>> third node is memory less node.
>>> do you see any issue in supporting this topology with existing code?
>> If opened HAVE_MEMORYLESS_NODES, it will pick the nearest node for the cpus 
>> on
>> memoryless node.
> 
> i see couple of arch enabled HAVE_MEMORYLESS_NODES, but i don't see
> any code in arch specific numa code for this
> is that means the core code will take care of this?
I just spent some time to read the implementation code of HAVE_MEMORYLESS_NODES 
on PPC and IA64.
For NODE_DATA initialization, it's similar to mine on IA64. But PPC have no 
special process, it's
similar to yours. I think the developers of PPC need to fix it.

I picked the code on IA64 as below:
static void __init *memory_less_node_alloc(int nid, unsigned long pernodesize)
{
void *ptr = NULL;
u8 best = 0xff;
int bestnode = -1, node, anynode = 0;

for_each_online_node(node) {
if (node_isset(node, memory_less_mask))
continue;
else if (node_distance(nid, node) < best) {
best = node_distance(nid, node);
bestnode = node;
}
anynode = node;
}

if (bestnode == -1)
bestnode = anynode;

ptr = __alloc_bootmem_node(pgdat_list[bestnode], pernodesize,
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));

return ptr;
}

/**
 * memory_less_nodes - allocate and initialize CPU only nodes pernode
 *  information.
 */
static void __init memory_less_nodes(void)
{
unsigned long pernodesize;
void *pernode;
int node;

for_each_node_mask(node, memory_less_mask) {
pernodesize = compute_pernodesize(node);
pernode = memory_less_node_alloc(node, pernodesize);
fill_pernode(node, __pa(pernode), pernodesize);
}

return;
}



> 
>>
>> For example, in include/linux/topology.h
>> #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>> ...
>> static inline int cpu_to_mem(int cpu)
>> {
>> return per_cpu(_numa_mem_, cpu);
>> }
>> ...
>> #else
>> ...
>> static inline int cpu_to_mem(int cpu)
>> {
>> return cpu_to_node(cpu);
>> }
>> ...
>> #endif
>>
>>> I think, this use case should be supported with present code.
>>>
>>>>>>
>>>>>> So, we can not simply classify device0 to node0 or node1, but we can
>>>>>> define a node2 which distances to node0 and node1 are the same.
>>>>>>
>>>>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>>>>> ---
>>>>>>  arch/arm64/Kconfig  |  4 
>>>>>>  arch/arm64/kernel/smp.c |  1 +
>>>>>>  arch/arm64/mm/numa.c| 43 +--
>>>>>>  3 files changed, 46 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>>> index 05c1bf1..5904a62 100644
>>>>>> --- a/arch/arm64/Kconfig
>>>>>> +++ b/arch/arm64/Kconfig
>>>>>> @@ -581,6 +581,10 @@ c

Re: [PATCH v4 11/14] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-06-07 Thread Leizhen (ThunderTown)


On 2016/6/7 22:01, Ganapatrao Kulkarni wrote:
> On Tue, Jun 7, 2016 at 6:27 PM, Leizhen (ThunderTown)
> <thunder.leiz...@huawei.com> wrote:
>>
>>
>> On 2016/6/7 16:31, Ganapatrao Kulkarni wrote:
>>> On Tue, Jun 7, 2016 at 1:38 PM, Zhen Lei <thunder.leiz...@huawei.com> wrote:
>>>> Some numa nodes may have no memory. For example:
>>>> 1. cpu0 on node0
>>>> 2. cpu1 on node1
>>>> 3. device0 access the momory from node0 and node1 take the same time.
>>>
>>> i am wondering, if access to both nodes is same, then why you need numa.
>>> the example you are quoting is against the basic principle of "numa"
>>> what is device0 here? cpu?
>> The device0 can also be a cpu. I drew a simple diagram:
>>
>>   cpu0 cpu1cpu2/device0
>> ||  |
>> ||  |
>>DDR0 DDR1No DIMM slots or no DIMM plugged
>>  (node0)  (node1) (node2)
>>
> 
> thanks for the clarification. your example is for 3 node system, where
> third node is memory less node.
> do you see any issue in supporting this topology with existing code?
If opened HAVE_MEMORYLESS_NODES, it will pick the nearest node for the cpus on
memoryless node.

For example, in include/linux/topology.h
#ifdef CONFIG_HAVE_MEMORYLESS_NODES
...
static inline int cpu_to_mem(int cpu)
{
return per_cpu(_numa_mem_, cpu);
}
...
#else
...
static inline int cpu_to_mem(int cpu)
{
return cpu_to_node(cpu);
}
...
#endif

> I think, this use case should be supported with present code.
> 
>>>>
>>>> So, we can not simply classify device0 to node0 or node1, but we can
>>>> define a node2 which distances to node0 and node1 are the same.
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>>> ---
>>>>  arch/arm64/Kconfig  |  4 
>>>>  arch/arm64/kernel/smp.c |  1 +
>>>>  arch/arm64/mm/numa.c| 43 +--
>>>>  3 files changed, 46 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index 05c1bf1..5904a62 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -581,6 +581,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>>>> def_bool y
>>>> depends on NUMA
>>>>
>>>> +config HAVE_MEMORYLESS_NODES
>>>> +   def_bool y
>>>> +   depends on NUMA
>>>> +
>>>>  source kernel/Kconfig.preempt
>>>>  source kernel/Kconfig.hz
>>>>
>>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>>> index d099306..9e15297 100644
>>>> --- a/arch/arm64/kernel/smp.c
>>>> +++ b/arch/arm64/kernel/smp.c
>>>> @@ -620,6 +620,7 @@ static void __init of_parse_and_init_cpus(void)
>>>> }
>>>>
>>>> bootcpu_valid = true;
>>>> +   early_map_cpu_to_node(0, of_node_to_nid(dn));
>>>>
>>>> /*
>>>>  * cpu_logical_map has already been
>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>> index df5c842..d73b0a0 100644
>>>> --- a/arch/arm64/mm/numa.c
>>>> +++ b/arch/arm64/mm/numa.c
>>>> @@ -128,6 +128,14 @@ void __init early_map_cpu_to_node(unsigned int cpu, 
>>>> int nid)
>>>> nid = 0;
>>>>
>>>> cpu_to_node_map[cpu] = nid;
>>>> +
>>>> +   /*
>>>> +* We should set the numa node of cpu0 as soon as possible, 
>>>> because it
>>>> +* has already been set up online before. cpu_to_node(0) will soon 
>>>> be
>>>> +* called.
>>>> +*/
>>>> +   if (!cpu)
>>>> +   set_cpu_numa_node(cpu, nid);
>>>>  }
>>>>
>>>>  #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
>>>> @@ -215,6 +223,35 @@ int __init numa_add_memblk(int nid, u64 start, u64 
>>>> end)
>>>> return ret;
>>>>  }
>>>>
>>>> +static u64 __init alloc_node_data_from_nearest_node(int nid, const size_t 
>>>> size)
>>>> +{
>>>> +   int i, best_nid, distance;
>>>> +   u64 pa;
>>>> +   DECLARE_BITMAP(nodes_map, MAX_NU

Re: [PATCH v4 11/14] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-06-07 Thread Leizhen (ThunderTown)


On 2016/6/7 16:31, Ganapatrao Kulkarni wrote:
> On Tue, Jun 7, 2016 at 1:38 PM, Zhen Lei  wrote:
>> Some numa nodes may have no memory. For example:
>> 1. cpu0 on node0
>> 2. cpu1 on node1
>> 3. device0 access the momory from node0 and node1 take the same time.
> 
> i am wondering, if access to both nodes is same, then why you need numa.
> the example you are quoting is against the basic principle of "numa"
> what is device0 here? cpu?
The device0 can also be a cpu. I drew a simple diagram:

  cpu0 cpu1cpu2/device0
||  |
||  |
   DDR0 DDR1No DIMM slots or no DIMM plugged
 (node0)  (node1) (node2)

>>
>> So, we can not simply classify device0 to node0 or node1, but we can
>> define a node2 which distances to node0 and node1 are the same.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/Kconfig  |  4 
>>  arch/arm64/kernel/smp.c |  1 +
>>  arch/arm64/mm/numa.c| 43 +--
>>  3 files changed, 46 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 05c1bf1..5904a62 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -581,6 +581,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>> def_bool y
>> depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> +   def_bool y
>> +   depends on NUMA
>> +
>>  source kernel/Kconfig.preempt
>>  source kernel/Kconfig.hz
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index d099306..9e15297 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -620,6 +620,7 @@ static void __init of_parse_and_init_cpus(void)
>> }
>>
>> bootcpu_valid = true;
>> +   early_map_cpu_to_node(0, of_node_to_nid(dn));
>>
>> /*
>>  * cpu_logical_map has already been
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index df5c842..d73b0a0 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -128,6 +128,14 @@ void __init early_map_cpu_to_node(unsigned int cpu, int 
>> nid)
>> nid = 0;
>>
>> cpu_to_node_map[cpu] = nid;
>> +
>> +   /*
>> +* We should set the numa node of cpu0 as soon as possible, because 
>> it
>> +* has already been set up online before. cpu_to_node(0) will soon be
>> +* called.
>> +*/
>> +   if (!cpu)
>> +   set_cpu_numa_node(cpu, nid);
>>  }
>>
>>  #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
>> @@ -215,6 +223,35 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>> return ret;
>>  }
>>
>> +static u64 __init alloc_node_data_from_nearest_node(int nid, const size_t 
>> size)
>> +{
>> +   int i, best_nid, distance;
>> +   u64 pa;
>> +   DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
>> +
>> +   bitmap_zero(nodes_map, MAX_NUMNODES);
>> +   bitmap_set(nodes_map, nid, 1);
>> +
>> +find_nearest_node:
>> +   best_nid = NUMA_NO_NODE;
>> +   distance = INT_MAX;
>> +
>> +   for_each_clear_bit(i, nodes_map, MAX_NUMNODES)
>> +   if (numa_distance[nid][i] < distance) {
>> +   best_nid = i;
>> +   distance = numa_distance[nid][i];
>> +   }
>> +
>> +   pa = memblock_alloc_nid(size, SMP_CACHE_BYTES, best_nid);
>> +   if (!pa) {
>> +   BUG_ON(best_nid == NUMA_NO_NODE);
>> +   bitmap_set(nodes_map, best_nid, 1);
>> +   goto find_nearest_node;
>> +   }
>> +
>> +   return pa;
>> +}
>> +
>>  /**
>>   * Initialize NODE_DATA for a node on the local memory
>>   */
>> @@ -228,7 +265,9 @@ static void __init setup_node_data(int nid, u64 
>> start_pfn, u64 end_pfn)
>> pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>> nid, start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
>>
>> -   nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +   nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +   if (!nd_pa)
>> +   nd_pa = alloc_node_data_from_nearest_node(nid, nd_size);
>> nd = __va(nd_pa);
>>
>> /* report and initialize */
>> @@ -238,7 +277,7 @@ static void __init setup_node_data(int nid, u64 
>> start_pfn, u64 end_pfn)
>> if (tnid != nid)
>> pr_info("NODE_DATA(%d) on node %d\n", nid, tnid);
>>
>> -   node_data[nid] = nd;
>> +   NODE_DATA(nid) = nd;
>> memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>> NODE_DATA(nid)->node_id = nid;
>> NODE_DATA(nid)->node_start_pfn = start_pfn;
>> --
>> 2.5.0
>>
>>
> Ganapat
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> 

Re: [PATCH v4 12/14] arm64/numa: remove some useless code

2016-06-07 Thread Leizhen (ThunderTown)


On 2016/6/7 16:28, Ganapatrao Kulkarni wrote:
> On Tue, Jun 7, 2016 at 1:38 PM, Zhen Lei  wrote:
>> 1. Currently only cpu0 set on cpu_possible_mask and percpu areas have not
>>been initialized.
>> 2. No reason to limit cpu0 must belongs to node0.
> 
> even smp init assumes cpu0/boot processor.
Yes, we define boot cpu as cpu0. But we can not force cpu0 must belongs to 
node0.
For example, we use the same Image and dtb run on two boards. On the first 
board,
BIOS choose cpu-A as boot cpu. But on the other board, BIOS may choose CPU-B as
boot cpu. Although this case is difficult to appear, but we can not sure that it
will not appear.

> is this patch tested on any hardware?
Yes, I tested it on our D02 board.

> can you describe your testing hardware?
Althoug D02 only contains one hardware numa node. But the implementation of numa
software is hardware independent, so I define some logical numa nodes. For 
example:
treat each core as a numa node, and subdivide momory.

>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 8 
>>  1 file changed, 8 deletions(-)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index d73b0a0..92b1692 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -93,7 +93,6 @@ void numa_clear_node(unsigned int cpu)
>>   */
>>  static void __init setup_node_to_cpumask_map(void)
>>  {
>> -   unsigned int cpu;
>> int node;
>>
>> /* setup nr_node_ids if not done yet */
>> @@ -106,9 +105,6 @@ static void __init setup_node_to_cpumask_map(void)
>> cpumask_clear(node_to_cpumask_map[node]);
>> }
>>
>> -   for_each_possible_cpu(cpu)
>> -   set_cpu_numa_node(cpu, NUMA_NO_NODE);
>> -
> 
> do you see this init of setting node id to NUMA_NO_NODE  for each cpu
> happening any where else?
I have used below code to verify my judgement, it's only "cpu=0" printed.

for_each_possible_cpu(cpu)
pr_info("setup_node_to_cpumask_map: cpu=%d\n", cpu);

Actually, the execution sequence is as below:
1. setup_arch
   1) bootmem_init();   -->arm64_numa_init
   2) smp_init_cpus();  -->smp_cpu_setup --> set_cpu_possible(cpu, true);

So that, the above deleted code only set cpu0 to NUMA_NO_NODE. And the below
deleted code set cpu0 to nid0. In fact, the default value of cpu_to_node(0)
is also zero. So I said these code take no effect.

> otherwise, better to have initialised node id/NUMA_NO_NODE to every
> cpu otherwise default  node id will be shown as zero
> which is not correct.
> 
>> /* cpumask_of_node() will now work */
>> pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
>>  }
>> @@ -379,10 +375,6 @@ static int __init numa_init(int (*init_func)(void))
>>
>> setup_node_to_cpumask_map();
>>
>> -   /* init boot processor */
>> -   cpu_to_node_map[0] = 0;
>> -   map_cpu_to_node(0, 0);
>> -
> 
> otherwise, how you set numa info for cpu0/boot-processor?
I have done it in the previous patch.

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d099306..9e15297 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -620,6 +620,7 @@ static void __init of_parse_and_init_cpus(void)
}

bootcpu_valid = true;
+   early_map_cpu_to_node(0, of_node_to_nid(dn));

/*
 * cpu_logical_map has already been
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index df5c842..d73b0a0 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -128,6 +128,14 @@ void __init early_map_cpu_to_node(unsigned int cpu, int 
nid)
nid = 0;

cpu_to_node_map[cpu] = nid;
+
+   /*
+* We should set the numa node of cpu0 as soon as possible, because it
+* has already been set up online before. cpu_to_node(0) will soon be
+* called.
+*/
+   if (!cpu)
+   set_cpu_numa_node(cpu, nid);
 }

> 
> thanks
> Ganapat
>> return 0;
>>  }
>>
>> --
>> 2.5.0
>>
>>
>>
> 
> thanks
> ganapat
> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> .
> 



Re: [PATCH v2 2/5] of/numa: fix a memory@ node can only contains one memory block

2016-06-05 Thread Leizhen (ThunderTown)


On 2016/6/3 17:45, Will Deacon wrote:
> On Thu, Jun 02, 2016 at 09:36:40AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/6/2 4:13, Rob Herring wrote:
>>> I believe you still need this and not the one above. You only need it
>>> within the loop if you return. Otherwise, the last node always need to
>>> be put.
>>
>> OK. Thanks.
>>
>> Addition with Matthias's suggestion, I will move "return" into this patch,
>> so that this of_node_put(np) can be safely removed.
> 
> Do you want to include Kefeng's [1] patches in your series too? We don't
> need two sets of related NUMA cleanups :)

Yes, It's originally suggested by Joe Perches.

> 
> Will
> 
> [1] 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/432715.html
> 
> .
> 



Re: [PATCH v3 3/5] arm64/numa: add nid check for memory block

2016-06-05 Thread Leizhen (ThunderTown)
On 2016/6/3 17:52, Will Deacon wrote:
> On Thu, Jun 02, 2016 at 10:28:09AM +0800, Zhen Lei wrote:
>> Use the same tactic to cpu and numa-distance nodes.
> 
> Sorry, I don't understand... :/

In function of_numa_parse_cpu_nodes:
for_each_child_of_node(cpus, np) {
...
r = of_property_read_u32(np, "numa-node-id", );
...
if (nid >= MAX_NUMNODES)
//check nid
pr_warn("NUMA: Node id %u exceeds maximum value\n", 
nid);   //print warning info
...


In function numa_set_distance:
if (from >= numa_distance_cnt || to >= numa_distance_cnt || 
//check nid
from < 0 || to < 0) {
pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d 
to=%d distance=%d\n",   //print warning info
from, to, distance);
return;
}

Both these two functions will check that whether nid(configured in dts, the 
subnodes of
cpus and distance-map) is right or not. So memory@ should also be checked.


memory@c0 {
device_type = "memory";
reg = <0x0 0xc0 0x0 0x8000>;
/* node 0 */
numa-node-id = <0>; //have not been 
checked yet.
};  //suppose I 
configued a wrong nid, it will not print any warning info

cpus {
#address-cells = <2>;
#size-cells = <0>;

cpu@0 {
device_type = "cpu";
compatible =  "arm,armv8";
reg = <0x0 0x0>;
enable-method = "psci";
/* node 0 */
numa-node-id = <0>; //checked in 
of_numa_parse_cpu_nodes
};

distance-map {
compatible = "numa-distance-map-v1";
distance-matrix = <0 0 10>, //checked in 
of_numa_parse_distance_map_v1 --> numa_set_distance
  <0 1 20>,
  <1 1 10>;
};

> 
> Will
> 
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index c7fe3ec..2601660 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -141,6 +141,11 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>>  {
>>  int ret;
>>
>> +if (nid >= MAX_NUMNODES) {
>> +pr_warn("NUMA: Node id %u exceeds maximum value\n", nid);
>> +return -EINVAL;
>> +}
>> +
>>  ret = memblock_set_node(start, (end - start), , nid);
>>  if (ret < 0) {
>>  pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node 
>> %d\n",
>> --
>> 2.5.0
>>
>>
> 
> .
> 



Re: [PATCH v3 5/5] arm64/numa: avoid inconsistent information to be printed

2016-06-05 Thread Leizhen (ThunderTown)


On 2016/6/3 17:55, Will Deacon wrote:
> On Thu, Jun 02, 2016 at 10:28:11AM +0800, Zhen Lei wrote:
>> numa_init(of_numa_init) may returned error because of numa configuration
>> error. So "No NUMA configuration found" is inaccurate. In fact, specific
>> configuration error information should be immediately printed by the
>> testing branch.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Looks fine to me, but this doesn't apply against -rc1.

Oh,

These patched based on https://lkml.org/lkml/2016/5/24/679 series.

> 
> Will
> 
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 2601660..1b9622c 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -338,8 +338,10 @@ static int __init numa_init(int (*init_func)(void))
>>  if (ret < 0)
>>  return ret;
>>
>> -if (nodes_empty(numa_nodes_parsed))
>> +if (nodes_empty(numa_nodes_parsed)) {
>> +pr_info("No NUMA configuration found\n");
>>  return -EINVAL;
>> +}
>>
>>  ret = numa_register_nodes();
>>  if (ret < 0)
>> @@ -370,8 +372,6 @@ static int __init dummy_numa_init(void)
>>
>>  if (numa_off)
>>  pr_info("NUMA disabled\n"); /* Forced off on command line. */
>> -else
>> -pr_info("No NUMA configuration found\n");
>>  pr_info("NUMA: Faking a node at [mem %#018Lx-%#018Lx]\n",
>> 0LLU, PFN_PHYS(max_pfn) - 1);
>>
>> --
>> 2.5.0
>>
>>
> 
> .
> 



Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-25 Thread Leizhen (ThunderTown)


On 2016/5/25 18:50, Catalin Marinas wrote:
> On Wed, May 25, 2016 at 11:36:38AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/5/25 9:20, Leizhen (ThunderTown) wrote:
>>> On 2016/5/24 21:02, Catalin Marinas wrote:
>>>> On Tue, May 24, 2016 at 08:19:05PM +0800, Leizhen (ThunderTown) wrote:
>>>>> On 2016/5/24 19:37, Mark Rutland wrote:
>>>>>> It looks like the test may be missing I-cache maintenance regardless of
>>>>>> the semantics of mprotect in this case.
>>>>>>
>>>>>> I have not yet devled into flush_cache_range and how it is called.
>>>>>
>>>>> SYSCALL_DEFINE3(mprotect ---> mprotect_fixup ---> change_protection ---> 
>>>>> change_protection_range --> flush_cache_range
>>>>
>>>> The change_protection() shouldn't need to flush the caches in
>>>> flush_cache_range(). The change_pte_range() function eventually ends up
>>>> calling set_pte_at() which calls __sync_icache_dcache() if the mapping
>>>> is executable.
>>>
>>> OK, I see.
>>> But I'm afraid it entered the "if (pte_present(oldpte))" branch in
>>> function change_pte_range. Because the test case called mmap to
>>> create pte first, then called pte_modify. I will check it later.
>>
>> I have checked that it entered "if (pte_present(oldpte))" branch.
> 
> This path eventually calls set_pte_at() via ptep_modify_prot_commit().
OK, I see.

> 
>> But I don't known why I add flush_icache_range is OK, but add
>> __sync_icache_dcache have no effect.
> 
> Do you mean you modified set_pte_at() to use flush_icache_range()
Just about. I added in change_pte_range after below statement.
ptent = pte_modify(ptent, newprot);

> instead of __sync_icache_dcache() and it works?
Yes.

> 
> What happens is that __sync_icache_dcache() only takes care of the first
> time a page is mapped in user space and flushes the caches, marking it
> as "clean" (PG_dcache_clean) afterwards. Subsequent changes to this
> mapping or writes to it are entirely the responsibility of the user. So
> if the user plans to execute instructions, it better explicitly flush
> the caches (as Mark Rutland already stated in a previous reply).
> 
> I ran our internal LTP version yesterday and it was fine but didn't
> realise that we actually patched mprotect04.c to include:
> 
>   __clear_cache((char *)func, (char *)func + page_sz);
> 
> just after memcpy().
Yes, I aslo tried this before I sent this patch. Flush dcache in userspace
or kernel can both fixs this problem.

> 
> (we still need to investigate whether the I-cache invalidation is
> actually needed in flush_cache_range() or it's just something we forgot
> to remove)
> 



Re: [PATCH v2 5/5] arm64/numa: avoid inconsistent information to be printed

2016-05-31 Thread Leizhen (ThunderTown)


On 2016/5/31 19:27, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/5/31 17:07, Matthias Brugger wrote:
>>
>>
>> On 28/05/16 11:22, Zhen Lei wrote:
>>> numa_init(of_numa_init) may returned error because of numa configuration
>>> error. So "No NUMA configuration found" is inaccurate. In fact, specific
>>> configuration error information should be immediately printed by the
>>> testing branch.
>>>
>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>> ---
>>
>> Which kernel version is this patch based on?
> 
> Base on 
> mainline(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git), I 
> git pulled about 3-5 days ago, the last commit-id is dc03c0f.
> 
> And thess patches base on https://lkml.org/lkml/2016/5/24/679 series(acpi 
> numa) as David Daney's requirement.
> 
>>
>> Regards,
>> Matthias
>>
>>>   arch/arm64/mm/numa.c | 6 +++---
>>>   drivers/of/of_numa.c | 7 +++
>>>   2 files changed, 6 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 2601660..1b9622c 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -338,8 +338,10 @@ static int __init numa_init(int (*init_func)(void))
>>>   if (ret < 0)
>>>   return ret;
>>>
>>> -if (nodes_empty(numa_nodes_parsed))
>>> +if (nodes_empty(numa_nodes_parsed)) {
>>> +pr_info("No NUMA configuration found\n");
>>>   return -EINVAL;
>>> +}
>>>
>>>   ret = numa_register_nodes();
>>>   if (ret < 0)
>>> @@ -370,8 +372,6 @@ static int __init dummy_numa_init(void)
>>>
>>>   if (numa_off)
>>>   pr_info("NUMA disabled\n"); /* Forced off on command line. */
>>> -else
>>> -pr_info("No NUMA configuration found\n");
>>>   pr_info("NUMA: Faking a node at [mem %#018Lx-%#018Lx]\n",
>>>  0LLU, PFN_PHYS(max_pfn) - 1);
>>>
>>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>>> index fb62307..3157130 100644
>>> --- a/drivers/of/of_numa.c
>>> +++ b/drivers/of/of_numa.c
>>> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>>>   struct device_node *np = NULL;
>>>   struct resource rsrc;
>>>   u32 nid;
>>> -int i, r = 0;
>>> +int i, r;
>>>
>>>   for_each_node_by_type(np, "memory") {
>>>   r = of_property_read_u32(np, "numa-node-id", );
>>> @@ -81,12 +81,11 @@ static int __init of_numa_parse_memory_nodes(void)
>>>   if (!i || r) {
>>>   of_node_put(np);
>>>   pr_err("NUMA: bad property in memory node\n");
>>> -r = r ? : -EINVAL;
>>> -break;
>>> +return r ? : -EINVAL;
>>>   }
>>>   }
>>>
>>> -return r;
>>> +return 0;
>>>   }
>>>
>>
>> Well this is fixing changes you introduced in this patch-set. Any reason 
>> this is not part of patch 2?
> 
> Because they fixed two different problems.

Hi, Matthias

I thougth it again on my way home yesterday. Yeah, you're right, move this part 
to patch 2, will make these two
patches looks more well. I put it here before, because for "No numa 
configuration" case, it originally returns error
code, so that it can not walk to "if (nodes_empty(numa_nodes_parsed))".

ret = init_func();
if (ret < 0)
return ret;

   -if (nodes_empty(numa_nodes_parsed))
   +if (nodes_empty(numa_nodes_parsed)) {
   +pr_info("No NUMA configuration found\n");
return -EINVAL;
   +}

Regards,
Zhen Lei

> 
>>
>>>   static int __init of_numa_parse_distance_map_v1(struct device_node *map)
>>> -- 
>>> 2.5.0
>>>
>>>
>>>
>>> ___
>>> linux-arm-kernel mailing list
>>> linux-arm-ker...@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>
>>
>> .
>>



Re: [PATCH 2/3] of/numa: fix a memory@ dt node can only contains one memory block

2016-05-26 Thread Leizhen (ThunderTown)


On 2016/5/26 21:13, Rob Herring wrote:
> On Thu, May 26, 2016 at 10:43:58AM +0800, Zhen Lei wrote:
>> For a normal memory@ devicetree node, its reg property can contains more
>> memory blocks.
>>
>> Because we don't known how many memory blocks maybe contained, so we try
>> from index=0, increase 1 until error returned(the end).
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/of/of_numa.c | 30 --
>>  1 file changed, 20 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>> index 21d831f..2c5f249 100644
>> --- a/drivers/of/of_numa.c
>> +++ b/drivers/of/of_numa.c
>> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>>  struct device_node *np = NULL;
>>  struct resource rsrc;
>>  u32 nid;
>> -int r = 0;
>> +int i, r = 0;
>>
>>  for (;;) {
>>  np = of_find_node_by_type(np, "memory");
>> @@ -82,17 +82,27 @@ static int __init of_numa_parse_memory_nodes(void)
>>  /* some other error */
>>  break;
>>
>> -r = of_address_to_resource(np, 0, );
>> -if (r) {
>> -pr_err("NUMA: bad reg property in memory node\n");
>> -break;
>> +for (i = 0; ; i++) {
>> +r = of_address_to_resource(np, i, );
>> +if (r) {
>> +/* reached the end of of_address */
>> +if (i > 0) {
>> +r = 0;
>> +break;
>> +}
>> +
>> +pr_err("NUMA: bad reg property in memory 
>> node\n");
>> +goto finished;
>> +}
>> +
>> +r = numa_add_memblk(nid, rsrc.start,
>> +rsrc.end - rsrc.start + 1);
>> +if (r)
>> +goto finished;
>>  }
>> -
>> -r = numa_add_memblk(nid, rsrc.start,
>> -rsrc.end - rsrc.start + 1);
>> -if (r)
>> -break;
>>  }
>> +
>> +finished:
>>  of_node_put(np);
> 
> This function can be simplified down to:
> 
>   for_each_node_by_type(np, "memory") {
OK, That's good.

>   r = of_property_read_u32(np, "numa-node-id", );
>   if (r == -EINVAL)
>   /*
>* property doesn't exist if -EINVAL, continue
>* looking for more memory nodes with
>* "numa-node-id" property
>*/
>   continue;
Hi, everybody:
If some "memory" node contains "numa-node-id", but some others missed. Can 
we simply ignored it?
I think we should break out too, and faking to only have node0.

>   else if (r)
>   /* some other error */
>   break;
> 
>   r = of_address_to_resource(np, 0, );
>   for (i = 0; !r; i++, r = of_address_to_resource(np, i, 

But r(non-zero) is just break this loop, the original is break the outer for 
(;;) loop

How about as below?

for_each_node_by_type(np, "memory") {
... ...

for (i = 0; !of_address_to_resource(np, i, ); i++) {
r = numa_add_memblk(nid, rsrc.start,
rsrc.end - rsrc.start + 1);
if (r)
goto finished;
}

if (!i)
pr_err("NUMA: bad reg property in memory node\n");
}

finished:


> )) {
>   r = numa_add_memblk(nid, rsrc.start,
>   rsrc.end - rsrc.start + 1);
>   }
>   }
>   of_node_put(np);
> 
>   return r;
> 
> 
> Perhaps with a "if (!i && r) pr_err()" for an error message at the end.
> 
> Rob
> 
> .
> 



Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-26 Thread Leizhen (ThunderTown)


On 2016/5/25 18:50, Catalin Marinas wrote:
> On Wed, May 25, 2016 at 11:36:38AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/5/25 9:20, Leizhen (ThunderTown) wrote:
>>> On 2016/5/24 21:02, Catalin Marinas wrote:
>>>> On Tue, May 24, 2016 at 08:19:05PM +0800, Leizhen (ThunderTown) wrote:
>>>>> On 2016/5/24 19:37, Mark Rutland wrote:
>>>>>> It looks like the test may be missing I-cache maintenance regardless of
>>>>>> the semantics of mprotect in this case.
>>>>>>
>>>>>> I have not yet devled into flush_cache_range and how it is called.
>>>>>
>>>>> SYSCALL_DEFINE3(mprotect ---> mprotect_fixup ---> change_protection ---> 
>>>>> change_protection_range --> flush_cache_range
>>>>
>>>> The change_protection() shouldn't need to flush the caches in
>>>> flush_cache_range(). The change_pte_range() function eventually ends up
>>>> calling set_pte_at() which calls __sync_icache_dcache() if the mapping
>>>> is executable.
>>>
>>> OK, I see.
>>> But I'm afraid it entered the "if (pte_present(oldpte))" branch in
>>> function change_pte_range. Because the test case called mmap to
>>> create pte first, then called pte_modify. I will check it later.
>>
>> I have checked that it entered "if (pte_present(oldpte))" branch.
> 
> This path eventually calls set_pte_at() via ptep_modify_prot_commit().
> 
>> But I don't known why I add flush_icache_range is OK, but add
>> __sync_icache_dcache have no effect.
> 
> Do you mean you modified set_pte_at() to use flush_icache_range()
> instead of __sync_icache_dcache() and it works?
> 
> What happens is that __sync_icache_dcache() only takes care of the first
> time a page is mapped in user space and flushes the caches, marking it
> as "clean" (PG_dcache_clean) afterwards. Subsequent changes to this

Hi,
As my tracing, it is returned by "if (!page_mapping(page))", because "mmap" are 
anonymous pages. I commented below code lines, it works well.

/* no flushing needed for anonymous pages */
if (!page_mapping(page))
return;


I printed the page information three times, as below:
page->mapping=8017baf36961, page->flags=0x10040048
page->mapping=8017b265bf51, page->flags=0x10040048
page->mapping=8017b94fc5a1, page->flags=0x10040048

PG_slab=7, PG_arch_1=9, PG_swapcache=15

> mapping or writes to it are entirely the responsibility of the user. So
> if the user plans to execute instructions, it better explicitly flush
> the caches (as Mark Rutland already stated in a previous reply).
> 
> I ran our internal LTP version yesterday and it was fine but didn't
> realise that we actually patched mprotect04.c to include:
> 
>   __clear_cache((char *)func, (char *)func + page_sz);
> 
> just after memcpy().
> 
> (we still need to investigate whether the I-cache invalidation is
> actually needed in flush_cache_range() or it's just something we forgot
> to remove)
> 



Re: [PATCH 3/3] arm64/numa: fix type info

2016-05-26 Thread Leizhen (ThunderTown)


On 2016/5/27 1:12, David Daney wrote:
> The current patch to correct this problem is here:
> 
> https://lkml.org/lkml/2016/5/24/679
> 
> Since v7 of the ACPI/NUMA patches are likely going to be added to linux-next 
> as soon as the current merge window ends, further simplifications of the 
> informational prints should probably be rebased on top of it.
> 
> David Daney
> 

>> On Thu, 2016-05-26 at 09:22 -0700, Ganapatrao Kulkarni wrote:
>>> IIRC, it should be
>>> if (!numa_off)
>>> we want to print this message when we failed to find proper numa 
>>> configuration.
>>> when numa_off is set, we will not look for any numa configuration.
>>>

 +   pr_info("%s\n", "No NUMA configuration found");
>>


OK, I think I also missed some cases.

But my problem still have not been resolved by 
"https://lkml.org/lkml/2016/5/24/679;, see below. I will update my patches base 
on it.


[0.00] NUMA: Adding memblock [0x0 - 0x6aff] on node 0
[0.00] NUMA: parsing numa-distance-map-v1
[0.00] NUMA: Warning: invalid memblk node 4 [mem 0x6b00-0x7fbf] 
//My numa configuration is incorrect, but not "No ... found"
[0.00] No NUMA configuration found  
//Above warning is very detail, this can be removed
[0.00] NUMA: Faking a node at [mem 
0x-0x0017]



Re: [PATCH 2/3] of/numa: fix a memory@ dt node can only contains one memory block

2016-05-27 Thread Leizhen (ThunderTown)


On 2016/5/27 12:20, Rob Herring wrote:
> On Thu, May 26, 2016 at 10:36 PM, Leizhen (ThunderTown)
> <thunder.leiz...@huawei.com> wrote:
>>
>>
>> On 2016/5/26 21:13, Rob Herring wrote:
>>> On Thu, May 26, 2016 at 10:43:58AM +0800, Zhen Lei wrote:
>>>> For a normal memory@ devicetree node, its reg property can contains more
>>>> memory blocks.
>>>>
>>>> Because we don't known how many memory blocks maybe contained, so we try
>>>> from index=0, increase 1 until error returned(the end).
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>>> ---
>>>>  drivers/of/of_numa.c | 30 --
>>>>  1 file changed, 20 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>>>> index 21d831f..2c5f249 100644
>>>> --- a/drivers/of/of_numa.c
>>>> +++ b/drivers/of/of_numa.c
>>>> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>>>>  struct device_node *np = NULL;
>>>>  struct resource rsrc;
>>>>  u32 nid;
>>>> -int r = 0;
>>>> +int i, r = 0;
>>>>
>>>>  for (;;) {
>>>>  np = of_find_node_by_type(np, "memory");
>>>> @@ -82,17 +82,27 @@ static int __init of_numa_parse_memory_nodes(void)
>>>>  /* some other error */
>>>>  break;
>>>>
>>>> -r = of_address_to_resource(np, 0, );
>>>> -if (r) {
>>>> -pr_err("NUMA: bad reg property in memory node\n");
>>>> -break;
>>>> +for (i = 0; ; i++) {
>>>> +r = of_address_to_resource(np, i, );
>>>> +if (r) {
>>>> +/* reached the end of of_address */
>>>> +if (i > 0) {
>>>> +r = 0;
>>>> +break;
>>>> +}
>>>> +
>>>> +pr_err("NUMA: bad reg property in memory 
>>>> node\n");
>>>> +goto finished;
>>>> +}
>>>> +
>>>> +r = numa_add_memblk(nid, rsrc.start,
>>>> +rsrc.end - rsrc.start + 1);
>>>> +if (r)
>>>> +goto finished;
>>>>  }
>>>> -
>>>> -r = numa_add_memblk(nid, rsrc.start,
>>>> -rsrc.end - rsrc.start + 1);
>>>> -if (r)
>>>> -break;
>>>>  }
>>>> +
>>>> +finished:
>>>>  of_node_put(np);
>>>
>>> This function can be simplified down to:
>>>
>>>   for_each_node_by_type(np, "memory") {
>> OK, That's good.
>>
>>>   r = of_property_read_u32(np, "numa-node-id", );
>>>   if (r == -EINVAL)
>>>   /*
>>>* property doesn't exist if -EINVAL, continue
>>>* looking for more memory nodes with
>>>* "numa-node-id" property
>>>*/
>>>   continue;
>> Hi, everybody:
>> If some "memory" node contains "numa-node-id", but some others missed. 
>> Can we simply ignored it?
>> I think we should break out too, and faking to only have node0.
> 
> Continuing to work is probably better than not.
> 
>>
>>>   else if (r)
>>>   /* some other error */
>>>   break;
>>>
>>>   r = of_address_to_resource(np, 0, );
>>>   for (i = 0; !r; i++, r = of_address_to_resource(np, i,
>>
>> But r(non-zero) is just break this loop, the original is break the outer for 
>> (;;) loop
> 
> It is not really the kernel's job to validate the DT. If there's
> random things in it then kernel's behavior is undefined.
> 
>>
>> How about as below?
>>
>> for_each_node_by_type(np, "memory"

Re: [PATCH v2 2/5] of/numa: fix a memory@ node can only contains one memory block

2016-06-01 Thread Leizhen (ThunderTown)


On 2016/6/2 4:13, Rob Herring wrote:
> On Sat, May 28, 2016 at 4:22 AM, Zhen Lei  wrote:
>> For a normal memory@ devicetree node, its reg property can contains more
>> memory blocks.
>>
>> Because we don't known how many memory blocks maybe contained, so we try
>> from index=0, increase 1 until error returned(the end).
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/of/of_numa.c | 26 +-
>>  1 file changed, 9 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>> index fb71b4e..fa85a51 100644
>> --- a/drivers/of/of_numa.c
>> +++ b/drivers/of/of_numa.c
>> @@ -63,13 +63,9 @@ static int __init of_numa_parse_memory_nodes(void)
>> struct device_node *np = NULL;
>> struct resource rsrc;
>> u32 nid;
>> -   int r = 0;
>> -
>> -   for (;;) {
>> -   np = of_find_node_by_type(np, "memory");
>> -   if (!np)
>> -   break;
>> +   int i, r = 0;
>>
>> +   for_each_node_by_type(np, "memory") {
>> r = of_property_read_u32(np, "numa-node-id", );
>> if (r == -EINVAL)
>> /*
>> @@ -78,21 +74,17 @@ static int __init of_numa_parse_memory_nodes(void)
>>  * "numa-node-id" property
>>  */
>> continue;
>> -   else if (r)
>> -   /* some other error */
>> -   break;
>>
>> -   r = of_address_to_resource(np, 0, );
>> -   if (r) {
>> -   pr_err("NUMA: bad reg property in memory node\n");
>> -   break;
>> -   }
>> +   for (i = 0; !r && !of_address_to_resource(np, i, ); i++)
>> +   r = numa_add_memblk(nid, rsrc.start, rsrc.end + 1);
>>
>> -   r = numa_add_memblk(nid, rsrc.start, rsrc.end + 1);
>> -   if (r)
>> +   if (!i || r) {
>> +   of_node_put(np);
>> +   pr_err("NUMA: bad property in memory node\n");
>> +   r = r ? : -EINVAL;
>> break;
>> +   }
>> }
>> -   of_node_put(np);
> 
> I believe you still need this and not the one above. You only need it
> within the loop if you return. Otherwise, the last node always need to
> be put.

OK. Thanks.

Addition with Matthias's suggestion, I will move "return" into this patch, so 
that this of_node_put(np) can be safely removed.


> 
> With that, for the series:
> 
> Acked-by: Rob Herring 
> 
> Rob
> 
> .
> 



Re: [PATCH v2 5/5] arm64/numa: avoid inconsistent information to be printed

2016-05-31 Thread Leizhen (ThunderTown)


On 2016/5/31 17:07, Matthias Brugger wrote:
> 
> 
> On 28/05/16 11:22, Zhen Lei wrote:
>> numa_init(of_numa_init) may returned error because of numa configuration
>> error. So "No NUMA configuration found" is inaccurate. In fact, specific
>> configuration error information should be immediately printed by the
>> testing branch.
>>
>> Signed-off-by: Zhen Lei 
>> ---
> 
> Which kernel version is this patch based on?

Base on 
mainline(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git), I 
git pulled about 3-5 days ago, the last commit-id is dc03c0f.

And thess patches base on https://lkml.org/lkml/2016/5/24/679 series(acpi numa) 
as David Daney's requirement.

> 
> Regards,
> Matthias
> 
>>   arch/arm64/mm/numa.c | 6 +++---
>>   drivers/of/of_numa.c | 7 +++
>>   2 files changed, 6 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 2601660..1b9622c 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -338,8 +338,10 @@ static int __init numa_init(int (*init_func)(void))
>>   if (ret < 0)
>>   return ret;
>>
>> -if (nodes_empty(numa_nodes_parsed))
>> +if (nodes_empty(numa_nodes_parsed)) {
>> +pr_info("No NUMA configuration found\n");
>>   return -EINVAL;
>> +}
>>
>>   ret = numa_register_nodes();
>>   if (ret < 0)
>> @@ -370,8 +372,6 @@ static int __init dummy_numa_init(void)
>>
>>   if (numa_off)
>>   pr_info("NUMA disabled\n"); /* Forced off on command line. */
>> -else
>> -pr_info("No NUMA configuration found\n");
>>   pr_info("NUMA: Faking a node at [mem %#018Lx-%#018Lx]\n",
>>  0LLU, PFN_PHYS(max_pfn) - 1);
>>
>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>> index fb62307..3157130 100644
>> --- a/drivers/of/of_numa.c
>> +++ b/drivers/of/of_numa.c
>> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>>   struct device_node *np = NULL;
>>   struct resource rsrc;
>>   u32 nid;
>> -int i, r = 0;
>> +int i, r;
>>
>>   for_each_node_by_type(np, "memory") {
>>   r = of_property_read_u32(np, "numa-node-id", );
>> @@ -81,12 +81,11 @@ static int __init of_numa_parse_memory_nodes(void)
>>   if (!i || r) {
>>   of_node_put(np);
>>   pr_err("NUMA: bad property in memory node\n");
>> -r = r ? : -EINVAL;
>> -break;
>> +return r ? : -EINVAL;
>>   }
>>   }
>>
>> -return r;
>> +return 0;
>>   }
>>
> 
> Well this is fixing changes you introduced in this patch-set. Any reason this 
> is not part of patch 2?

Because they fixed two different problems.

> 
>>   static int __init of_numa_parse_distance_map_v1(struct device_node *map)
>> -- 
>> 2.5.0
>>
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
> 
> .
> 



Re: [PATCH v4 00/14] fix some type infos and bugs for arm64/of numa

2016-06-21 Thread Leizhen (ThunderTown)


On 2016/6/20 14:39, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/6/14 22:22, Catalin Marinas wrote:
>> On Wed, Jun 08, 2016 at 04:59:03PM +0800, Leizhen (ThunderTown) wrote:
>>> On 2016/6/7 21:58, Will Deacon wrote:
>>>> On Tue, Jun 07, 2016 at 04:08:04PM +0800, Zhen Lei wrote:
>>>>> v3 -> v4:
>>>>> 1. Packed three patches of Kefeng Wang, patch6-8.
>>>>> 2. Add 6 new patches(9-15) to enhance the numa on arm64.
>>>>>
>>>>> v2 -> v3:
>>>>> 1. Adjust patch2 and patch5 according to Matthias Brugger's advice, to 
>>>>> make the
>>>>>patches looks more well. The final code have no change. 
>>>>>
>>>>> v1 -> v2:
>>>>> 1. Base on https://lkml.org/lkml/2016/5/24/679
>>>>
>>>> If you want bug fixes to land in 4.7, you'll need to base them on a
>>>> mainline kernel.
>>>
>>> I heared that David Daney's acpi numa patch series was accepted and
>>> put into next branch(Linux 4.8).
>>> Otherwise I will suggest him sending his patch6-7 to mainline first.
>>> So that, only a very small conflict will be exist.
>>>
>>> I also tested that:
>>> 1. git am David Daney's patch6-7, then git am all of my patches on a
>>> branch, named branch A.
>>> 2. git am David Daney's patch6-7 on another branch, named branch B.
>>> 3. when I git merge B into branch A, it's still conflict. So I guess
>>> git merge is based on source code, rather than patches.
>>>
>>> So at present, unless the maintainers are willing to resolve the
>>> conflict, otherwise I update my patches will not work.
>>
>> It usually depends on how complex the conflict is and whether your
>> patches functionally depend on the other patches. I have no idea what
>> the dependency is here since I haven't tried applying them to mainline.
>>
>>> Fortunately, these patches are not particularly urgent. So I think I
>>> can wait until Linux 4.8 start, then send these patches again. But I'm
>>> not sure whether these patches can be merged into Linux 4.8, I really
>>> hope.
>>
>> If there are fixes to the arm64 ACPI NUMA patches that Rafael queued
>> into linux-next, they should be sent to him and potentially being queued
>> on top ahead of the 4.8 merging window or shortly after 4.8-rc1.
>> Non-ACPI NUMA patches (as I can see, most of these patches are DT
>> specific) could be merged independently.
>>
>> So how many patches do you have in each category below:
>>
>> 1. NUMA fixes against current mainline (4.7-rc3)
>> 2. NUMA fixes against the arm64 ACPI NUMA patches queued by Rafael
> My patches have not fixed any bugs for ACPI NUMA, but just based on it.
> There are only three related patches:
> [PATCH v7 06_15] arm64, numa  rework numa_add_memblk()
> [PATCH v7 07_15] arm64, numa  Cleanup NUMA disabled messages.
> [PATCH v7 14_15] arm64, acpi, numa  NUMA support based on SRAT and SLIT
> 
> arch/arm64/mm/numa.c  |  28 --
> drivers/of/of_numa.c  |   4 +-
> 
> My patches 1-5, 8, 11 will confict with it.
> 
>> 3. New functionality or clean-up. Are these against mainline or ACPI
>>NUMA patches?
> Hi, Catalin
> I'm sorry to reply this email too late. Because I have been thinking if
> there are any other solutions.
> 
> I try to adjust the sequence of my patches as below:
> 1. New functionality  //queued in your branch  (my patches 9-14, and 
> 6, 6 is clean-up)
> 2. 4.8-rc1//apci numa series and my new functionality had 
> been merged
> 3. bug fixes  //other 4.8-rc versions  (my patches 1-5)
> 4. clean-up (pr_fmt)  //queued in 4.9  (my patches 7-8)

Hi, Catalin
  What about your opinion? Are you agree?

> 
> And there only one confliction exist:
> ++<<<<<<< HEAD
>  +static u8 numa_distance[MAX_NUMNODES][MAX_NUMNODES];
> //choose this
>  +static int numa_off;
> ++===
> + static int numa_distance_cnt;
> + static u8 *numa_distance;
> + static bool numa_off;   
> //choose this
> ++>>>>>>> acpi
> 
>>



Re: [PATCH v4 00/14] fix some type infos and bugs for arm64/of numa

2016-06-20 Thread Leizhen (ThunderTown)


On 2016/6/14 22:22, Catalin Marinas wrote:
> On Wed, Jun 08, 2016 at 04:59:03PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/6/7 21:58, Will Deacon wrote:
>>> On Tue, Jun 07, 2016 at 04:08:04PM +0800, Zhen Lei wrote:
>>>> v3 -> v4:
>>>> 1. Packed three patches of Kefeng Wang, patch6-8.
>>>> 2. Add 6 new patches(9-15) to enhance the numa on arm64.
>>>>
>>>> v2 -> v3:
>>>> 1. Adjust patch2 and patch5 according to Matthias Brugger's advice, to 
>>>> make the
>>>>patches looks more well. The final code have no change. 
>>>>
>>>> v1 -> v2:
>>>> 1. Base on https://lkml.org/lkml/2016/5/24/679
>>>
>>> If you want bug fixes to land in 4.7, you'll need to base them on a
>>> mainline kernel.
>>
>> I heared that David Daney's acpi numa patch series was accepted and
>> put into next branch(Linux 4.8).
>> Otherwise I will suggest him sending his patch6-7 to mainline first.
>> So that, only a very small conflict will be exist.
>>
>> I also tested that:
>> 1. git am David Daney's patch6-7, then git am all of my patches on a
>> branch, named branch A.
>> 2. git am David Daney's patch6-7 on another branch, named branch B.
>> 3. when I git merge B into branch A, it's still conflict. So I guess
>> git merge is based on source code, rather than patches.
>>
>> So at present, unless the maintainers are willing to resolve the
>> conflict, otherwise I update my patches will not work.
> 
> It usually depends on how complex the conflict is and whether your
> patches functionally depend on the other patches. I have no idea what
> the dependency is here since I haven't tried applying them to mainline.
> 
>> Fortunately, these patches are not particularly urgent. So I think I
>> can wait until Linux 4.8 start, then send these patches again. But I'm
>> not sure whether these patches can be merged into Linux 4.8, I really
>> hope.
> 
> If there are fixes to the arm64 ACPI NUMA patches that Rafael queued
> into linux-next, they should be sent to him and potentially being queued
> on top ahead of the 4.8 merging window or shortly after 4.8-rc1.
> Non-ACPI NUMA patches (as I can see, most of these patches are DT
> specific) could be merged independently.
> 
> So how many patches do you have in each category below:
> 
> 1. NUMA fixes against current mainline (4.7-rc3)
> 2. NUMA fixes against the arm64 ACPI NUMA patches queued by Rafael
My patches have not fixed any bugs for ACPI NUMA, but just based on it.
There are only three related patches:
[PATCH v7 06_15] arm64, numa  rework numa_add_memblk()
[PATCH v7 07_15] arm64, numa  Cleanup NUMA disabled messages.
[PATCH v7 14_15] arm64, acpi, numa  NUMA support based on SRAT and SLIT

arch/arm64/mm/numa.c  |  28 --
drivers/of/of_numa.c  |   4 +-

My patches 1-5, 8, 11 will confict with it.

> 3. New functionality or clean-up. Are these against mainline or ACPI
>NUMA patches?
Hi, Catalin
I'm sorry to reply this email too late. Because I have been thinking if
there are any other solutions.

I try to adjust the sequence of my patches as below:
1. New functionality//queued in your branch  (my patches 9-14, and 
6, 6 is clean-up)
2. 4.8-rc1  //apci numa series and my new functionality had 
been merged
3. bug fixes//other 4.8-rc versions  (my patches 1-5)
4. clean-up (pr_fmt)//queued in 4.9  (my patches 7-8)

And there only one confliction exist:
++<<<<<<< HEAD
 +static u8 numa_distance[MAX_NUMNODES][MAX_NUMNODES];  
//choose this
 +static int numa_off;
++===
+ static int numa_distance_cnt;
+ static u8 *numa_distance;
+ static bool numa_off; 
//choose this
++>>>>>>> acpi

> 



Re: [PATCH 2/2] PCI: generic: add description of property "interrupt-skip-mask"

2016-02-25 Thread Leizhen (ThunderTown)


On 2016/2/25 20:20, Mark Rutland wrote:
> Hi,
> 
> In future, please send the binding document first in a series, per point
> 3 of Documentation/devicetree/bindings/submitting-patches.txt. It makes
> review easier/faster.
Thank you for your reminding.

> 
> On Thu, Feb 25, 2016 at 07:53:28PM +0800, Zhen Lei wrote:
>> Interrupt Pin register is read-only and optional. Some pci devices may use
>> msi/msix but leave the value of Interrupt Pin non-zero.
> 
> Is that permitted by the spec? Surely 'optional' means it must be zero
> if not implemented?

In :
Devices (or device functions) that do not use an interrupt pin must put a 0 in 
this register. This register is read-only.

So, do you think this is a hardware bug? But these pci-devices are not produced 
by our company.

In function init_service_irqs, it try msix first, then msi, Interrupt PIN is 
the last attemption. But of_irq_parse_pci() happened before this.


In fact, there also a familiar problem exist. As below:
pci :42:00.0: BAR 7: no space for [io  size 0x1000]
pci :42:00.0: BAR 7: failed to assign [io  size 0x1000]

There no "io space" on arm64, maybe only exist on X86. And the Memory Space 
Indicator also read-only in BAR register.

> 
>> In this case, the driver will print information as below: pci
>> :40:00.0: of_irq_parse_pci() failed with rc=-22
>>
>> It's easily lead to misinterpret.
> 
> If this is limited to a subset of devices which we know are broken in
> this regard, can we not handle these cases explicitly?
Actually, we have another way to block this warning. Use "interrupt-map" to map 
it to a pesudo IRQ. But I think it will also be misunderstanded.

> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  Documentation/devicetree/bindings/pci/host-generic-pci.txt | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/pci/host-generic-pci.txt 
>> b/Documentation/devicetree/bindings/pci/host-generic-pci.txt
>> index 3f1d3fc..0f10978 100644
>> --- a/Documentation/devicetree/bindings/pci/host-generic-pci.txt
>> +++ b/Documentation/devicetree/bindings/pci/host-generic-pci.txt
>> @@ -70,6 +70,8 @@ Practice: Interrupt Mapping' and requires the following 
>> properties:
>>
>>  - interrupt-map-mask : 
>>
>> +- interrupt-skip-mask: Explicitly declare which pci devices only use 
>> msi/msix
>> +but leave the value of Interrupt Pin non-zero.
> 
> Unlike the rest of the interrupt mapping properties, this is not
> described in  `Open Firmware Recommended Practice: Interrupt Mapping'.
> 
> This needs a far more complete description.
> 
> This also doesn't strike me as th right approach. The interrupt-map-mask
> property describe as relationship between the host-controller-provided
> interrupt lines and endpoints, while this seems to be a bug completely
> contained within an endpoint.

In :
// PCI_DEVICE(3)  INT#(1)  CONTROLLER(PHANDLE)  CONTROLLER_DATA(3)
interrupt-map = <  0x0 0x0 0x0  0x10x0 0x4 0x1

PCI_DEVICE contain 3 cells. But only the first one be used in function 
of_irq_parse_pci.
laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
laddr[1] = laddr[2] = cpu_to_be32(0);

And for INT#, I don't think there will some Pins used but others unused on a 
pci-device. So I can ommit it.

So, only laddr[0] mask need to be described.
> 
> Thanks,
> Mark.
> 
>>
>>  Example:
>>
>> --
>> 2.5.0
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe devicetree" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> .
> 



Re: [PATCH 1/1] arm64/dma-mapping: remove an unnecessary conversion

2016-03-15 Thread Leizhen (ThunderTown)


On 2016/3/15 23:37, Catalin Marinas wrote:
> On Tue, Mar 15, 2016 at 10:12:11AM +0800, Zhen Lei wrote:
>> 1. In swiotlb_alloc_coherent, the branch of __get_free_pages. Directly
>>return vaddr on success, and pass vaddr to free_pages on failure.
>> 2. So, we can directly transparent pass vaddr from __dma_free to
>>swiotlb_free_coherent, keep consistent with swiotlb_alloc_coherent.
>>
>> This patch have no functional change,
> 
> I don't think so.
> 
>> but can obtain a bit performance improvement.
> 
> Have you actually measured it?
I have not run any performance testing, but reduced a line of code. So I said 
"a bit".

> 
>> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
>> index a6e757c..b2f2834 100644
>> --- a/arch/arm64/mm/dma-mapping.c
>> +++ b/arch/arm64/mm/dma-mapping.c
>> @@ -187,8 +187,6 @@ static void __dma_free(struct device *dev, size_t size,
>> void *vaddr, dma_addr_t dma_handle,
>> struct dma_attrs *attrs)
>>  {
>> -void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle));
>> -
>>  size = PAGE_ALIGN(size);
>>
>>  if (!is_device_dma_coherent(dev)) {
>> @@ -196,7 +194,7 @@ static void __dma_free(struct device *dev, size_t size,
>>  return;
>>  vunmap(vaddr);
>>  }
>> -__dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs);
>> +__dma_free_coherent(dev, size, vaddr, dma_handle, attrs);
>>  }
> 
> What happens when !is_device_dma_coherent(dev)? (hint: read two lines
> above __dma_free_coherent).
> 
The whole function of __dma_free as below: (nobody use swiotlb_addr except 
__dma_free_coherent)
static void __dma_free(struct device *dev, size_t size,
   void *vaddr, dma_addr_t dma_handle,
   struct dma_attrs *attrs)
{
void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle));

size = PAGE_ALIGN(size);

if (!is_device_dma_coherent(dev)) {
if (__free_from_pool(vaddr, size))
return;
vunmap(vaddr);
}
__dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs);
}




Re: Suspicious error for CMA stress test

2016-03-08 Thread Leizhen (ThunderTown)


On 2016/3/8 9:54, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/3/8 2:42, Laura Abbott wrote:
>> On 03/07/2016 12:16 AM, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2016/3/7 12:34, Joonsoo Kim wrote:
>>>> On Fri, Mar 04, 2016 at 03:35:26PM +0800, Hanjun Guo wrote:
>>>>> On 2016/3/4 14:38, Joonsoo Kim wrote:
>>>>>> On Fri, Mar 04, 2016 at 02:05:09PM +0800, Hanjun Guo wrote:
>>>>>>> On 2016/3/4 12:32, Joonsoo Kim wrote:
>>>>>>>> On Fri, Mar 04, 2016 at 11:02:33AM +0900, Joonsoo Kim wrote:
>>>>>>>>> On Thu, Mar 03, 2016 at 08:49:01PM +0800, Hanjun Guo wrote:
>>>>>>>>>> On 2016/3/3 15:42, Joonsoo Kim wrote:
>>>>>>>>>>> 2016-03-03 10:25 GMT+09:00 Laura Abbott <labb...@redhat.com>:
>>>>>>>>>>>> (cc -mm and Joonsoo Kim)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/02/2016 05:52 AM, Hanjun Guo wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I came across a suspicious error for CMA stress test:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Before the test, I got:
>>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo | grep Cma
>>>>>>>>>>>>> CmaTotal: 204800 kB
>>>>>>>>>>>>> CmaFree:  195044 kB
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> After running the test:
>>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo | grep Cma
>>>>>>>>>>>>> CmaTotal: 204800 kB
>>>>>>>>>>>>> CmaFree: 6602584 kB
>>>>>>>>>>>>>
>>>>>>>>>>>>> So the freed CMA memory is more than total..
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also the the MemFree is more than mem total:
>>>>>>>>>>>>>
>>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo
>>>>>>>>>>>>> MemTotal:   16342016 kB
>>>>>>>>>>>>> MemFree:22367268 kB
>>>>>>>>>>>>> MemAvailable:   22370528 kB
>>>>>>>>>> [...]
>>>>>>>>>>>> I played with this a bit and can see the same problem. The sanity
>>>>>>>>>>>> check of CmaFree < CmaTotal generally triggers in
>>>>>>>>>>>> __move_zone_freepage_state in unset_migratetype_isolate.
>>>>>>>>>>>> This also seems to be present as far back as v4.0 which was the
>>>>>>>>>>>> first version to have the updated accounting from Joonsoo.
>>>>>>>>>>>> Were there known limitations with the new freepage accounting,
>>>>>>>>>>>> Joonsoo?
>>>>>>>>>>> I don't know. I also played with this and looks like there is
>>>>>>>>>>> accounting problem, however, for my case, number of free page is 
>>>>>>>>>>> slightly less
>>>>>>>>>>> than total. I will take a look.
>>>>>>>>>>>
>>>>>>>>>>> Hanjun, could you tell me your malloc_size? I tested with 1 and it 
>>>>>>>>>>> doesn't
>>>>>>>>>>> look like your case.
>>>>>>>>>> I tested with malloc_size with 2M, and it grows much bigger than 1M, 
>>>>>>>>>> also I
>>>>>>>>>> did some other test:
>>>>>>>>> Thanks! Now, I can re-generate erronous situation you mentioned.
>>>>>>>>>
>>>>>>>>>>   - run with single thread with 10 times, everything is fine.
>>>>>>>>>>
>>>>>>>>>>   - I hack the cam_alloc() and free as below [1] to see if it's lock 
>>>>>>>>>> issue, with
>>>>>>>

Re: [PATCH 1/1] dma-mapping: to avoid exception when cpu_addr is NULL

2016-03-07 Thread Leizhen (ThunderTown)


On 2016/3/8 6:59, Andrew Morton wrote:
> On Mon, 7 Mar 2016 18:43:47 +0800 "Leizhen (ThunderTown)" 
> <thunder.leiz...@huawei.com> wrote:
> 
>> Suppose:
>> CONFIG_SPARSEMEM is opened.
>> CONFIG_DMA_API_DEBUG or CONFIG_CMA is opened.
>>
>> Then virt_to_page or phys_to_page will be called. Finally, in __pfn_to_page, 
>> __sec = __pfn_to_section(__pfn) is NULL.
>> So access section->section_mem_map will trigger exception.
>>
>> -
>>
>> #define __pfn_to_page(pfn)   \
>> ({   unsigned long __pfn = (pfn);\
>>  struct mem_section *__sec = __pfn_to_section(__pfn);\
>>  __section_mem_map_addr(__sec) + __pfn;  \
>> })
>>
>> static inline struct page *__section_mem_map_addr(struct mem_section 
>> *section)
>> {
>>  unsigned long map = section->section_mem_map;
>>  map &= SECTION_MAP_MASK;
>>  return (struct page *)map;
>> }
> 
> I'm having a bit of trouble understanding this.
> 
> Perhaps you could explain the bug more carefully (inclusion of an oops
> output would help) then we'll be in a better position to understand the
> proposed fix(es).
> 

Unable to handle kernel paging request at virtual address ffc020d3b2b8
pgd = ffc083a61000
[ffc020d3b2b8] *pgd=, *pud=
CPU: 4 PID: 1489 Comm: malloc_dma_1 Tainted: G   O
Hardware name:
task: ffc00d7d26c0 ti: ffc0837fc000 task.ti: ffc0837fc000
PC is at __dma_free_coherent.isra.10+0x74/0xc8
LR is at __dma_free+0x9c/0xb0
pc : [] lr : [] pstate: 8145
sp : ffc0837ff700
x29: ffc0837ff700 x28: 
x27:  x26: 
x25: ffc000d1b1d0 x24: 
x23: 00a0 x22: ffbfff5f
x21: 0010 x20: ffc2e21f7010
x19:  x18: 
x17: 007f9360a2b0 x16: ffc000541040
x15:  x14: 
x13:  x12: 0001
x11: 0068 x10: 0040
x9 : ffc000214e00 x8 : ffc2e54586b0
x7 :  x6 : 0004
x5 : ffc000214d64 x4 : 
x3 : 03ff x2 : 0003
x1 : 000f x0 : ffc000d3b2c0

Process malloc_dma_1 (pid: 1489, stack limit = 0xffc0837fc020)
Stack: (0xffc0837ff700 to 0xffc08380)
f700: ffc0837ff730 ffc000214e00 0010 
f720: ffc2e21f7010 ffc0837ff7d0 ffc0837ff770 ffbffc1d6134
f740: ffc2e21f7010 01a0 0064 ffc0837ff7d0
f760: ffc000c9fa20 ffc0837ffaf0 ffc0837ffe10 ffc000239b0c
f780: ffc00d54a280 ffc000d1ef58 ffc000957163 ffc2e21f7000
f7a0: ffbffc1d6030   
f7c0:   ffc01300 ffc013c0
f7e0: ffc013f0 ffc01420 ffc01460 ffc014a0
f800: ffc014e0 ffc01520 ffc01560 ffc015a0
f820: ffc015e0 ffc01620 ffc01660 ffc016a0
f840: ffc016e0 ffc01720 ffc01760 ffc017a0
f860: ffc017e0 ffc01820 ffc01860 ffc018a0
f880: ffc018e0 ffc01920 ffc01960 ffc019a0
f8a0: ffc019e0 ffc01a20 ffc01a60 ffc01aa0
f8c0: ffc01ae0 ffc01b20 ffc01b60 ffc01ba0
f8e0: ffc01be0 ffc01c20 ffc01c60 ffc01ca0
f900: ffc01ce0 ffc01d20 ffc01d60 ffc01da0
f920: ffc01de0 ffc01e20 ffc01e40 ffc01e60
f940: ffc01e90 ffc01ea0 ffc01eb0 ffc01ec0
f960: ffc01ee0 ffc01f20  
f980:    
f9a0:    
f9c0:    
f9e0:    
fa00:    
fa20:    
fa40:    
fa60:    
fa80:    
faa0:    
fac0:    
fae0:   13a0 1460
fb00: 1490 14c0 1500 15

Re: [PATCH 1/1] arm64/dma-mapping: remove an unnecessary conversion

2016-03-19 Thread Leizhen (ThunderTown)


On 2016/3/16 9:56, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/3/15 23:37, Catalin Marinas wrote:
>> On Tue, Mar 15, 2016 at 10:12:11AM +0800, Zhen Lei wrote:
>>> 1. In swiotlb_alloc_coherent, the branch of __get_free_pages. Directly
>>>return vaddr on success, and pass vaddr to free_pages on failure.
>>> 2. So, we can directly transparent pass vaddr from __dma_free to
>>>swiotlb_free_coherent, keep consistent with swiotlb_alloc_coherent.
>>>
>>> This patch have no functional change,
>>
>> I don't think so.
>>
>>> but can obtain a bit performance improvement.
>>
>> Have you actually measured it?
> I have not run any performance testing, but reduced a line of code. So I said 
> "a bit".
> 
>>
>>> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
>>> index a6e757c..b2f2834 100644
>>> --- a/arch/arm64/mm/dma-mapping.c
>>> +++ b/arch/arm64/mm/dma-mapping.c
>>> @@ -187,8 +187,6 @@ static void __dma_free(struct device *dev, size_t size,
>>>void *vaddr, dma_addr_t dma_handle,
>>>struct dma_attrs *attrs)
>>>  {
>>> -   void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle));
>>> -
>>> size = PAGE_ALIGN(size);
>>>
>>> if (!is_device_dma_coherent(dev)) {
>>> @@ -196,7 +194,7 @@ static void __dma_free(struct device *dev, size_t size,
>>> return;
>>> vunmap(vaddr);
>>> }
>>> -   __dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs);
>>> +   __dma_free_coherent(dev, size, vaddr, dma_handle, attrs);
>>>  }
>>
>> What happens when !is_device_dma_coherent(dev)? (hint: read two lines
>> above __dma_free_coherent).
Do you afraid "vaddr" maybe modified by these statement?
First, it could not be __free_from_pool. Otherwise, the function vunmap(which 
after it) can not work well.
Then, it count not be vunmap too, the parameter is defined as "const void *".

In the call chain: 
__dma_free_coherent-->__dma_free_coherent-->swiotlb_free_coherent, only 
swiotlb_free_coherent finally use "vaddr".

>>
> The whole function of __dma_free as below: (nobody use swiotlb_addr except 
> __dma_free_coherent)
> static void __dma_free(struct device *dev, size_t size,
>void *vaddr, dma_addr_t dma_handle,
>struct dma_attrs *attrs)
> {
> void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle));
> 
> size = PAGE_ALIGN(size);
> 
> if (!is_device_dma_coherent(dev)) {
> if (__free_from_pool(vaddr, size))
> return;
> vunmap(vaddr);
> }
> __dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs);
> }
> 



Re: [PATCH 1/1] arm64/dma-mapping: remove an unnecessary conversion

2016-03-19 Thread Leizhen (ThunderTown)


On 2016/3/17 19:59, Catalin Marinas wrote:
> On Thu, Mar 17, 2016 at 07:06:27PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/3/16 9:56, Leizhen (ThunderTown) wrote:
>>> On 2016/3/15 23:37, Catalin Marinas wrote:
>>>> On Tue, Mar 15, 2016 at 10:12:11AM +0800, Zhen Lei wrote:
>>>>> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
>>>>> index a6e757c..b2f2834 100644
>>>>> --- a/arch/arm64/mm/dma-mapping.c
>>>>> +++ b/arch/arm64/mm/dma-mapping.c
>>>>> @@ -187,8 +187,6 @@ static void __dma_free(struct device *dev, size_t 
>>>>> size,
>>>>>  void *vaddr, dma_addr_t dma_handle,
>>>>>  struct dma_attrs *attrs)
>>>>>  {
>>>>> - void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle));
>>>>> -
>>>>>   size = PAGE_ALIGN(size);
>>>>>
>>>>>   if (!is_device_dma_coherent(dev)) {
>>>>> @@ -196,7 +194,7 @@ static void __dma_free(struct device *dev, size_t 
>>>>> size,
>>>>>   return;
>>>>>   vunmap(vaddr);
>>>>>   }
>>>>> - __dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs);
>>>>> + __dma_free_coherent(dev, size, vaddr, dma_handle, attrs);
>>>>>  }
>>>>
>>>> What happens when !is_device_dma_coherent(dev)? (hint: read two lines
>>>> above __dma_free_coherent).
>>
>> Do you afraid "vaddr" maybe modified by these statement?
>> First, it could not be __free_from_pool. Otherwise, the function
>> vunmap(which after it) can not work well. Then, it count not be vunmap
>> too, the parameter is defined as "const void *".
>>
>> In the call chain:
>> __dma_free_coherent-->__dma_free_coherent-->swiotlb_free_coherent,
>> only swiotlb_free_coherent finally use "vaddr".
> 
> Exactly. So you give swiotlb_free_coherent a vaddr which has been
> unmapped. It doesn't even matter whether it's still mapped since this
> address is passed further to free_pages() which performs a
> virt_to_page(). The latter is *only* valid on linear map addresses (and
> you would actually hit the VM_BUG_ON in free_pages; you can try running
> this with CONFIG_DEBUG_VM enabled and non-coherent DMA).
> 
> For non-coherent DMA, the vaddr is not part of the linear mapping as it
> has been remapped by __dma_alloc() via dma_common_contiguous_remap(),
> hence for swiotlb freeing we need the actual linear map address (the
> original "ptr" in __dma_alloc()). We can generate it by a
> phys_to_virt(dma_to_phys(dma_handle)).
> 

OK, I got it.

So actually I should move the statement into branch "if 
(!is_device_dma_coherent(dev))", I will prepare v2.



Re: [PATCH 1/1] dma-mapping: to avoid exception when cpu_addr is NULL

2016-03-07 Thread Leizhen (ThunderTown)


On 2016/3/7 19:41, One Thousand Gnomes wrote:
> On Mon, 7 Mar 2016 17:21:25 +0800
> Zhen Lei  wrote:
> 
>> Do this to keep consistent with kfree, which tolerate ptr is NULL.
>>
>> Signed-off-by: Zhen Lei 
> 
> This is inlined code so you are adding extra logic to every single
> instance of a call to the function. What is it's total effect on kernel
> size ?

This a simple if statement, I think it will only generates two instructions.
Maybe I need move it into function dma_free_attrs, as below:
if (!ops->free || !cpu_addr)
return;

So that, it only generates one instruction. And dma_free_noncoherent can also 
be impacted.

Or I changed it to BUG_ON(!cpu_addr)?

Otherwise, I move it into ops->free, but that maybe more ARCHs.

> 
> Alan
> 
> .
> 



Re: [PATCH 1/1] dma-mapping: to avoid exception when cpu_addr is NULL

2016-03-07 Thread Leizhen (ThunderTown)
Suppose:
CONFIG_SPARSEMEM is opened.
CONFIG_DMA_API_DEBUG or CONFIG_CMA is opened.

Then virt_to_page or phys_to_page will be called. Finally, in __pfn_to_page, 
__sec = __pfn_to_section(__pfn) is NULL.
So access section->section_mem_map will trigger exception.

-

#define __pfn_to_page(pfn)  \
({  unsigned long __pfn = (pfn);\
struct mem_section *__sec = __pfn_to_section(__pfn);\
__section_mem_map_addr(__sec) + __pfn;  \
})

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}


On 2016/3/7 17:21, Zhen Lei wrote:
> Do this to keep consistent with kfree, which tolerate ptr is NULL.
> 
> Signed-off-by: Zhen Lei 
> ---
>  include/linux/dma-mapping.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 75857cd..fdd4294 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -402,7 +402,10 @@ static inline void *dma_alloc_coherent(struct device 
> *dev, size_t size,
>  static inline void dma_free_coherent(struct device *dev, size_t size,
>   void *cpu_addr, dma_addr_t dma_handle)
>  {
> - return dma_free_attrs(dev, size, cpu_addr, dma_handle, NULL);
> + if (unlikely(!cpu_addr))
> + return;
> +
> + dma_free_attrs(dev, size, cpu_addr, dma_handle, NULL);
>  }
> 
>  static inline void *dma_alloc_noncoherent(struct device *dev, size_t size,
> --
> 2.5.0
> 
> 
> 
> .
> 



Re: Suspicious error for CMA stress test

2016-03-07 Thread Leizhen (ThunderTown)


On 2016/3/8 2:42, Laura Abbott wrote:
> On 03/07/2016 12:16 AM, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/3/7 12:34, Joonsoo Kim wrote:
>>> On Fri, Mar 04, 2016 at 03:35:26PM +0800, Hanjun Guo wrote:
>>>> On 2016/3/4 14:38, Joonsoo Kim wrote:
>>>>> On Fri, Mar 04, 2016 at 02:05:09PM +0800, Hanjun Guo wrote:
>>>>>> On 2016/3/4 12:32, Joonsoo Kim wrote:
>>>>>>> On Fri, Mar 04, 2016 at 11:02:33AM +0900, Joonsoo Kim wrote:
>>>>>>>> On Thu, Mar 03, 2016 at 08:49:01PM +0800, Hanjun Guo wrote:
>>>>>>>>> On 2016/3/3 15:42, Joonsoo Kim wrote:
>>>>>>>>>> 2016-03-03 10:25 GMT+09:00 Laura Abbott <labb...@redhat.com>:
>>>>>>>>>>> (cc -mm and Joonsoo Kim)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 03/02/2016 05:52 AM, Hanjun Guo wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I came across a suspicious error for CMA stress test:
>>>>>>>>>>>>
>>>>>>>>>>>> Before the test, I got:
>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo | grep Cma
>>>>>>>>>>>> CmaTotal: 204800 kB
>>>>>>>>>>>> CmaFree:  195044 kB
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After running the test:
>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo | grep Cma
>>>>>>>>>>>> CmaTotal: 204800 kB
>>>>>>>>>>>> CmaFree: 6602584 kB
>>>>>>>>>>>>
>>>>>>>>>>>> So the freed CMA memory is more than total..
>>>>>>>>>>>>
>>>>>>>>>>>> Also the the MemFree is more than mem total:
>>>>>>>>>>>>
>>>>>>>>>>>> -bash-4.3# cat /proc/meminfo
>>>>>>>>>>>> MemTotal:   16342016 kB
>>>>>>>>>>>> MemFree:22367268 kB
>>>>>>>>>>>> MemAvailable:   22370528 kB
>>>>>>>>> [...]
>>>>>>>>>>> I played with this a bit and can see the same problem. The sanity
>>>>>>>>>>> check of CmaFree < CmaTotal generally triggers in
>>>>>>>>>>> __move_zone_freepage_state in unset_migratetype_isolate.
>>>>>>>>>>> This also seems to be present as far back as v4.0 which was the
>>>>>>>>>>> first version to have the updated accounting from Joonsoo.
>>>>>>>>>>> Were there known limitations with the new freepage accounting,
>>>>>>>>>>> Joonsoo?
>>>>>>>>>> I don't know. I also played with this and looks like there is
>>>>>>>>>> accounting problem, however, for my case, number of free page is 
>>>>>>>>>> slightly less
>>>>>>>>>> than total. I will take a look.
>>>>>>>>>>
>>>>>>>>>> Hanjun, could you tell me your malloc_size? I tested with 1 and it 
>>>>>>>>>> doesn't
>>>>>>>>>> look like your case.
>>>>>>>>> I tested with malloc_size with 2M, and it grows much bigger than 1M, 
>>>>>>>>> also I
>>>>>>>>> did some other test:
>>>>>>>> Thanks! Now, I can re-generate erronous situation you mentioned.
>>>>>>>>
>>>>>>>>>   - run with single thread with 10 times, everything is fine.
>>>>>>>>>
>>>>>>>>>   - I hack the cam_alloc() and free as below [1] to see if it's lock 
>>>>>>>>> issue, with
>>>>>>>>> the same test with 100 multi-thread, then I got:
>>>>>>>> [1] would not be sufficient to close this race.
>>>>>>>>
>>>>>>>> Try following things [A]. And, for more accurate test, I changed code 
>>>>>>>> a bit more
&g

Re: Suspicious error for CMA stress test

2016-03-07 Thread Leizhen (ThunderTown)


On 2016/3/7 12:34, Joonsoo Kim wrote:
> On Fri, Mar 04, 2016 at 03:35:26PM +0800, Hanjun Guo wrote:
>> On 2016/3/4 14:38, Joonsoo Kim wrote:
>>> On Fri, Mar 04, 2016 at 02:05:09PM +0800, Hanjun Guo wrote:
 On 2016/3/4 12:32, Joonsoo Kim wrote:
> On Fri, Mar 04, 2016 at 11:02:33AM +0900, Joonsoo Kim wrote:
>> On Thu, Mar 03, 2016 at 08:49:01PM +0800, Hanjun Guo wrote:
>>> On 2016/3/3 15:42, Joonsoo Kim wrote:
 2016-03-03 10:25 GMT+09:00 Laura Abbott :
> (cc -mm and Joonsoo Kim)
>
>
> On 03/02/2016 05:52 AM, Hanjun Guo wrote:
>> Hi,
>>
>> I came across a suspicious error for CMA stress test:
>>
>> Before the test, I got:
>> -bash-4.3# cat /proc/meminfo | grep Cma
>> CmaTotal: 204800 kB
>> CmaFree:  195044 kB
>>
>>
>> After running the test:
>> -bash-4.3# cat /proc/meminfo | grep Cma
>> CmaTotal: 204800 kB
>> CmaFree: 6602584 kB
>>
>> So the freed CMA memory is more than total..
>>
>> Also the the MemFree is more than mem total:
>>
>> -bash-4.3# cat /proc/meminfo
>> MemTotal:   16342016 kB
>> MemFree:22367268 kB
>> MemAvailable:   22370528 kB
>>> [...]
> I played with this a bit and can see the same problem. The sanity
> check of CmaFree < CmaTotal generally triggers in
> __move_zone_freepage_state in unset_migratetype_isolate.
> This also seems to be present as far back as v4.0 which was the
> first version to have the updated accounting from Joonsoo.
> Were there known limitations with the new freepage accounting,
> Joonsoo?
 I don't know. I also played with this and looks like there is
 accounting problem, however, for my case, number of free page is 
 slightly less
 than total. I will take a look.

 Hanjun, could you tell me your malloc_size? I tested with 1 and it 
 doesn't
 look like your case.
>>> I tested with malloc_size with 2M, and it grows much bigger than 1M, 
>>> also I
>>> did some other test:
>> Thanks! Now, I can re-generate erronous situation you mentioned.
>>
>>>  - run with single thread with 10 times, everything is fine.
>>>
>>>  - I hack the cam_alloc() and free as below [1] to see if it's lock 
>>> issue, with
>>>the same test with 100 multi-thread, then I got:
>> [1] would not be sufficient to close this race.
>>
>> Try following things [A]. And, for more accurate test, I changed code a 
>> bit more
>> to prevent kernel page allocation from cma area [B]. This will prevent 
>> kernel
>> page allocation from cma area completely so we can focus 
>> cma_alloc/release race.
>>
>> Although, this is not correct fix, it could help that we can guess
>> where the problem is.
> More correct fix is something like below.
> Please test it.
 Hmm, this is not working:
>>> Sad to hear that.
>>>
>>> Could you tell me your system's MAX_ORDER and pageblock_order?
>>>
>>
>> MAX_ORDER is 11, pageblock_order is 9, thanks for your help!
> 
> Hmm... that's same with me.
> 
> Below is similar fix that prevents buddy merging when one of buddy's
> migrate type, but, not both, is MIGRATE_ISOLATE. In fact, I have
> no idea why previous fix (more correct fix) doesn't work for you.
> (It works for me.) But, maybe there is a bug on the fix
> so I make new one which is more general form. Please test it.

Hi,
Hanjun Guo has gone to Tailand on business, so I help him to run this 
patch. The result
shows that the count of "CmaFree:" is OK now. But sometimes printed some 
information as below:

alloc_contig_range: [28500, 28600) PFNs busy
alloc_contig_range: [28300, 28380) PFNs busy

> 
> Thanks.
> 
> -->8-
>>From dd41e348572948d70b935fc24f82c096ff0fb417 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim 
> Date: Fri, 4 Mar 2016 13:28:17 +0900
> Subject: [PATCH] mm/cma: fix race
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 33 +++--
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c6c38ed..d80d071 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -620,8 +620,8 @@ static inline void rmv_page_order(struct page *page)
>   *
>   * For recording page's order, we use page_private(page).
>   */
> -static inline int page_is_buddy(struct page *page, struct page *buddy,
> -   unsigned int order)
> +static inline int page_is_buddy(struct zone *zone, struct page *page,
> +   struct page *buddy, unsigned int order)
>  {
> if 

Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-24 Thread Leizhen (ThunderTown)


On 2016/5/25 9:20, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/5/24 21:02, Catalin Marinas wrote:
>> On Tue, May 24, 2016 at 08:19:05PM +0800, Leizhen (ThunderTown) wrote:
>>> On 2016/5/24 19:37, Mark Rutland wrote:
>>>> On Tue, May 24, 2016 at 07:16:37PM +0800, Zhen Lei wrote:
>>>>> When we ran mprotect04(a test case in LTP) infinitely, it would always
>>>>> failed after a few seconds. The case can be described briefly that: copy
>>>>> a empty function from code area into a new memory area(created by mmap),
>>>>> then call mprotect to change the protection to PROT_EXEC. The syscall
>>>>> sys_mprotect will finally invoke flush_cache_range, but this function
>>>>> currently only invalid icache, the operation of flush dcache is missed.
>>>>
>>>> In the LTP code I see powerpc-specific D-cache / I-cache synchronisation
>>>> (i.e. d-cache cleaning followed by I-cache invalidation), so there
>>>> appears to be some expectation of userspace maintenance. Hoever, there
>>>> is no such ARM-specific I-cache maintenance.
>>>
>>> But I see some other platforms have D-cache maintenance, like: 
>>> arch/nios2/mm/cacheflush.c
>>> And according to the name of flush_cache_range, it should do this, I 
>>> judged. Otherwise,
>>> mprotect04 will be failed on more platforms, it's easy to discover. Only 
>>> PPC have specific
>>> cache synchronization, maybe it meets some hardware limitation. It's 
>>> impossible a programmer
>>> fixed a common bug on only one platform but leave others unchanged.
>>
>> flush_cache_range() is primarily used on VIVT caches before changing the
>> mapping and should not really be implemented on arm64. I don't recall
>> why we still have the I-cache invalidation, possibly for the ASID-tagged
>> VIVT I-cache case, though we should have a specific check for this.
>>
>> There are some other cases where flush_cache_range() is called and no
>> D-cache maintenance is necessary on arm64, so I don't want to penalise
>> them by implementing flush_cache_range().
>>
>>>> It looks like the test may be missing I-cache maintenance regardless of
>>>> the semantics of mprotect in this case.
>>>>
>>>> I have not yet devled into flush_cache_range and how it is called.
>>>
>>> SYSCALL_DEFINE3(mprotect ---> mprotect_fixup ---> change_protection ---> 
>>> change_protection_range --> flush_cache_range
>>
>> The change_protection() shouldn't need to flush the caches in
>> flush_cache_range(). The change_pte_range() function eventually ends up
>> calling set_pte_at() which calls __sync_icache_dcache() if the mapping
>> is executable.
> 
> OK, I see.
> But I'm afraid it entered the "if (pte_present(oldpte))" branch in function 
> change_pte_range.
> Because the test case called mmap to create pte first, then called pte_modify.
> I will check it later.

I have checked that it entered "if (pte_present(oldpte))" branch.

But I don't known why I add flush_icache_range is OK, but add 
__sync_icache_dcache have no effect.

> 
>>
>> Can you be more specific about the kernel version you are using, its
>> configuration?
>>
> I used the latest mainline kernel version, and built with 
> arch/arm64/configs/defconfig, ran on our D02 board.
> I have attached the testcase, you can simply run: sh test.sh
> 



Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-24 Thread Leizhen (ThunderTown)


On 2016/5/24 19:37, Mark Rutland wrote:
> On Tue, May 24, 2016 at 07:16:37PM +0800, Zhen Lei wrote:
>> When we ran mprotect04(a test case in LTP) infinitely, it would always
>> failed after a few seconds. The case can be described briefly that: copy
>> a empty function from code area into a new memory area(created by mmap),
>> then call mprotect to change the protection to PROT_EXEC. The syscall
>> sys_mprotect will finally invoke flush_cache_range, but this function
>> currently only invalid icache, the operation of flush dcache is missed.
> 
> In the LTP code I see powerpc-specific D-cache / I-cache synchronisation
> (i.e. d-cache cleaning followed by I-cache invalidation), so there
> appears to be some expectation of userspace maintenance. Hoever, there
> is no such ARM-specific I-cache maintenance.
But I see some other platforms have D-cache maintenance, like: 
arch/nios2/mm/cacheflush.c
And according to the name of flush_cache_range, it should do this, I judged. 
Otherwise,
mprotect04 will be failed on more platforms, it's easy to discover. Only PPC 
have specific
cache synchronization, maybe it meets some hardware limitation. It's impossible 
a programmer
fixed a common bug on only one platform but leave others unchanged.

> 
> It looks like the test may be missing I-cache maintenance regardless of
> the semantics of mprotect in this case.
> 
> I have not yet devled into flush_cache_range and how it is called.

SYSCALL_DEFINE3(mprotect ---> mprotect_fixup ---> change_protection ---> 
change_protection_range --> flush_cache_range

> 
> Thanks,
> Mark.
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/flush.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>> index dbd12ea..eda4124 100644
>> --- a/arch/arm64/mm/flush.c
>> +++ b/arch/arm64/mm/flush.c
>> @@ -31,7 +31,7 @@ void flush_cache_range(struct vm_area_struct *vma, 
>> unsigned long start,
>> unsigned long end)
>>  {
>>  if (vma->vm_flags & VM_EXEC)
>> -__flush_icache_all();
>> +flush_icache_range(start, end);
>>  }
>>
>>  static void sync_icache_aliases(void *kaddr, unsigned long len)
>> --
>> 2.5.0
>>
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>
> 
> .
> 



Re: [PATCH 1/1] tty/serial: to support 8250 earlycon can be enabled independently

2016-05-17 Thread Leizhen (ThunderTown)


On 2016/5/16 23:40, Peter Hurley wrote:
> On 05/16/2016 04:35 AM, Zhen Lei wrote:
>> Sometimes, we may only use SSH to login, and build 8250 uart driver as a
>> ko(insmod if needed). But the earlycon may still be necessary, because
>> the kernel boot process may take a long time. It's not good to display
>> nothing but ask people to wait patiently.
> 
> I'm confused; you want the possibility of earlycon but _not_ a normal
> serial console?
Our downstream customers want add some private functions into 8250.ko. So that, 
we can not pre-build the 8250 driver into Image.

> 
> This configuration is unsafe because nothing prevents the 8250 driver
> and 8250 earlycon from concurrently accessing the hardware.
earlycon is a boot console, it will be disabled in printk_late_init(suppose we 
have not set keep_bootcon).

> 
> 
>> In addition, the 8250.ko can not be worked if we have not opened any
>> other serial drivers, because SERIAL_CORE would not be selected.
> 
> I don't understand what this means.

Before I opened CONFIG_SERIAL_AMBA_PL011_CONSOLE(only built 8250 as a module, 
this case can not be worked):
CONFIG_SERIAL_CORE=m

After I opened CONFIG_SERIAL_AMBA_PL011_CONSOLE:
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_AMBA_PL011=y
CONFIG_SERIAL_AMBA_PL011_CONSOLE=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y

> 
> Regards,
> Peter Hurley
> 
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/tty/serial/8250/Kconfig  | 9 +++--
>>  drivers/tty/serial/8250/Makefile | 1 -
>>  drivers/tty/serial/Makefile  | 1 +
>>  3 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/tty/serial/8250/Kconfig 
>> b/drivers/tty/serial/8250/Kconfig
>> index 4d7cb9c..2992f0a 100644
>> --- a/drivers/tty/serial/8250/Kconfig
>> +++ b/drivers/tty/serial/8250/Kconfig
>> @@ -3,6 +3,12 @@
>>  # you somehow have an implicit or explicit dependency on SERIAL_8250.
>>  #
>>
>> +config SERIAL_8250_EARLYCON
>> +bool "Early console using 8250"
>> +select SERIAL_CORE
>> +select SERIAL_CORE_CONSOLE
>> +select SERIAL_EARLYCON
>> +
>>  config SERIAL_8250
>>  tristate "8250/16550 and compatible serial support"
>>  select SERIAL_CORE
>> @@ -60,8 +66,7 @@ config SERIAL_8250_PNP
>>  config SERIAL_8250_CONSOLE
>>  bool "Console on 8250/16550 and compatible serial port"
>>  depends on SERIAL_8250=y
>> -select SERIAL_CORE_CONSOLE
>> -select SERIAL_EARLYCON
>> +select SERIAL_8250_EARLYCON
>>  ---help---
>>If you say Y here, it will be possible to use a serial port as the
>>system console (the system console is the device which receives all
>> diff --git a/drivers/tty/serial/8250/Makefile 
>> b/drivers/tty/serial/8250/Makefile
>> index c9a2d6e..1f24c74 100644
>> --- a/drivers/tty/serial/8250/Makefile
>> +++ b/drivers/tty/serial/8250/Makefile
>> @@ -13,7 +13,6 @@ obj-$(CONFIG_SERIAL_8250_HP300)+= 8250_hp300.o
>>  obj-$(CONFIG_SERIAL_8250_CS)+= serial_cs.o
>>  obj-$(CONFIG_SERIAL_8250_ACORN) += 8250_acorn.o
>>  obj-$(CONFIG_SERIAL_8250_BCM2835AUX)+= 8250_bcm2835aux.o
>> -obj-$(CONFIG_SERIAL_8250_CONSOLE)   += 8250_early.o
>>  obj-$(CONFIG_SERIAL_8250_FOURPORT)  += 8250_fourport.o
>>  obj-$(CONFIG_SERIAL_8250_ACCENT)+= 8250_accent.o
>>  obj-$(CONFIG_SERIAL_8250_BOCA)  += 8250_boca.o
>> diff --git a/drivers/tty/serial/Makefile b/drivers/tty/serial/Makefile
>> index 8c261ad..cd84181 100644
>> --- a/drivers/tty/serial/Makefile
>> +++ b/drivers/tty/serial/Makefile
>> @@ -19,6 +19,7 @@ obj-$(CONFIG_SERIAL_SUNSAB) += sunsab.o
>>
>>  # Now bring in any enabled 8250/16450/16550 type drivers.
>>  obj-$(CONFIG_SERIAL_8250) += 8250/
>> +obj-$(CONFIG_SERIAL_8250_EARLYCON) += 8250/8250_early.o
>>
>>  obj-$(CONFIG_SERIAL_AMBA_PL010) += amba-pl010.o
>>  obj-$(CONFIG_SERIAL_AMBA_PL011) += amba-pl011.o
>> --
>> 2.5.0
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-serial" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> .
> 



Re: [PATCH v5 03/14] arm64/numa: add nid check for memory block

2016-08-09 Thread Leizhen (ThunderTown)


On 2016/8/10 10:12, Hanjun Guo wrote:
> On 2016/8/8 17:18, Zhen Lei wrote:
>> Use the same tactic to cpu and numa-distance nodes.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index c7fe3ec..2601660 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -141,6 +141,11 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>>  {
>>  int ret;
>>
>> +if (nid >= MAX_NUMNODES) {
>> +pr_warn("NUMA: Node id %u exceeds maximum value\n", nid);
>> +return -EINVAL;
>> +}
> 
> I think this check should be added to of_numa_parse_memory_nodes(), which 
> before
> the numa_add_memblk() called, it's the same logic in 
> of_numa_parse_cpu_nodes() and
> the node id is checked before calling numa_add_memblk() in ACPI.

Yes, you are right. This check is arch independent.

> 
> Thanks
> Hanjun
> 
> 
> 
> .
> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-07-19 Thread Leizhen (ThunderTown)


On 2016/7/12 23:35, Catalin Marinas wrote:
> On Mon, Jul 11, 2016 at 08:43:32PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/7/9 0:13, Catalin Marinas wrote:
>>> On Fri, Jul 08, 2016 at 11:24:26PM +0800, Leizhen (ThunderTown) wrote:
>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>> 8<
>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>> --- a/arch/arm64/mm/flush.c
>>>>> +++ b/arch/arm64/mm/flush.c
>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long addr)
>>>>>   if (!page_mapping(page))
>>>>>   return;
>>>>>  
>>>>> - if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>> + if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>> + PageDirty(page))
>>>>>   sync_icache_aliases(page_address(page),
>>>>>   PAGE_SIZE << compound_order(page));
>>>>>   else if (icache_is_aivivt())
>>>>> 8<-
Hi, Catalin:
  Do you plan to send this patch? My colleagues told me that if our patches are 
quite
different, it should be Signed-off-by you.

  I searched all Linux source code, __sync_icache_dcache is only called by 
set_pte_at,
and some check conditions(especially pte_exec) will limit its impact.

if (pte_user(pte) && pte_exec(pte) && !pte_special(pte))
__sync_icache_dcache(pte, addr);

>>>>>
>>>>> BTW, can you make your tests (source) available somewhere?
>>>>
>>>> Both cases worked well with this patch.
>>>
>>> Now I'm even more confused ;). IIUC, after an msync() in user space we
>>> should flush the pages to disk via write_cache_pages(). This function
>>> calls clear_page_dirty_for_io() after which PageDirty() is no longer
>>> true. I can't tell how a subsequent mmap() can see the written pages as
>>> dirty.
>>
>> As my tracing, both cases invoked empty function.
>>
>> int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int 
>> datasync)
>>  ..
>>  return file->f_op->fsync(file, start, end, datasync);
>> }
>>
>> const struct file_operations hugetlbfs_file_operations = {
>>  .fsync  = noop_fsync,
>>
>> static const struct file_operations shmem_file_operations = {
>>  .mmap   = shmem_mmap,
>> #ifdef CONFIG_TMPFS
>>  .fsync  = noop_fsync,
> 
> I was referring to standard filesystem (e.g. ext4) writes where, IIUC,
> the PageDirty() status is cleared after I/O but it's not necessarily
> removed from the page cache.
> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-07-08 Thread Leizhen (ThunderTown)


On 2016/7/8 21:54, Catalin Marinas wrote:
> On Fri, Jul 08, 2016 at 11:36:57AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/7/7 23:37, Catalin Marinas wrote:
>>> On Thu, Jul 07, 2016 at 08:09:04PM +0800, Zhen Lei wrote:
>>>> At present, PG_dcache_clean is only cleared when the related huge page
>>>> is about to be freed. But sometimes, there maybe a process is in charge
>>>> to copy binary codes into a shared memory, and notifies other processes
>>>> to execute base on that. For the first time, there is no problem, because
>>>> the default value of page->flags is PG_dcache_clean cleared. So the cache
>>>> will be maintained at the time of set_pte_at for other processes. But if
>>>> the content of the shared memory have been updated again, there is no
>>>> cache operations, because the PG_dcache_clean is still set.
>>>>
>>>> For example:
>>>> Process A
>>>>open a hugetlbfs file
>>>>mmap it as a shared memory
>>>>copy some binary codes into it
>>>>munmap
>>>>
>>>> Process B
>>>>open the hugetlbfs file
>>>>mmap it as a shared memory, executable
>>>>invoke the functions in the shared memory
>>>>munmap
>>>>
>>>> repeat the above steps.
>>>
>>> Does this work as you would expect with small pages (and for example
>>> shared file mmap)? I don't want to have a different behaviour between
>>> small and huge pages.
>>
>> The small pages also have this problem, I will try to fix it too.
> 
> Have you run the above tests on a standard file (with small pages)? It's
> strange that we haven't hit this so far with gcc or something else
> generating code (unless they don't use mmap but just sequential writes).
The test code should be randomly generated, to make sure the context
in ICache is always stale. I have attached the simplified testcase demo.

The main portion is picked as below:
srand(time(NULL));
ptr = (unsigned int *)share_mem;
*ptr++ = 0xd280;//mov x0, #0
for (i = 0, total = 0; i < 100; i++) {
value = 0xfff & rand();
total += value;
*ptr++ = 0xb100 | (value << 10);//adds x0, x0, #value
}
*ptr = 0xd65f03c0;  //ret

> 
> If both cases need solving, we might better move the fix in the
> __sync_icache_dcache() function. Untested:
Yes.

At first I also want to fix it as below. But I'm not sure which time the 
PageDirty
will be cleared, and if two or more processes mmap it as executable, cache 
operations
will be duplicated. At present, I really have not found any good place to clear
PG_dcache_clean. So the below modification may be the best choice, concisely 
and clearly.

> 
> 8<
> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> index dbd12ea8ce68..c753fa804165 100644
> --- a/arch/arm64/mm/flush.c
> +++ b/arch/arm64/mm/flush.c
> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long addr)
>   if (!page_mapping(page))
>   return;
>  
> - if (!test_and_set_bit(PG_dcache_clean, >flags))
> + if (!test_and_set_bit(PG_dcache_clean, >flags) ||
> + PageDirty(page))
>   sync_icache_aliases(page_address(page),
>   PAGE_SIZE << compound_order(page));
>   else if (icache_is_aivivt())
> 8<-
> 
> BTW, can you make your tests (source) available somewhere?
Both cases worked well with this patch.

> 
> Thanks.
> 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILENAME"/mnt/huge/test_file"
#define TST_MMAP_SIZE   0x20

typedef unsigned int (*TEST_FUNC_T)(void);

/*
 * mkdir -p /mnt/huge
 * echo 20 > /proc/sys/vm/nr_hugepages
 * mount none /mnt/huge -t hugetlbfs -o pagesize=2048K
 */
int main(void)
{
int i;
int fd;
int ret;
void *share_mem;
size_t size;
struct stat sb;
TEST_FUNC_T func_ptr;
unsigned int value, total;
unsigned int *ptr;

fd = open(FILENAME, O_RDWR | O_CREAT);
if (fd == -1) {
printf("Open file %s failed P1: %s\n", FILENAME, 
strerror(errno));
return 1;
}

lseek(fd, TST_MMAP_SIZE - 1, SEEK_SET);  
write(fd, "", 1);

share_mem = mmap(NULL, TST_MMAP_SIZE, PROT_READ | PROT_WRITE, 
MAP_SHARED, fd, 0);
if (share_mem == MAP_FAILED) {

Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-07-11 Thread Leizhen (ThunderTown)


On 2016/7/9 0:13, Catalin Marinas wrote:
> On Fri, Jul 08, 2016 at 11:24:26PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>> On Fri, Jul 08, 2016 at 11:36:57AM +0800, Leizhen (ThunderTown) wrote:
>>>> On 2016/7/7 23:37, Catalin Marinas wrote:
>>>>> On Thu, Jul 07, 2016 at 08:09:04PM +0800, Zhen Lei wrote:
>>>>>> At present, PG_dcache_clean is only cleared when the related huge page
>>>>>> is about to be freed. But sometimes, there maybe a process is in charge
>>>>>> to copy binary codes into a shared memory, and notifies other processes
>>>>>> to execute base on that. For the first time, there is no problem, because
>>>>>> the default value of page->flags is PG_dcache_clean cleared. So the cache
>>>>>> will be maintained at the time of set_pte_at for other processes. But if
>>>>>> the content of the shared memory have been updated again, there is no
>>>>>> cache operations, because the PG_dcache_clean is still set.
>>>>>>
>>>>>> For example:
>>>>>> Process A
>>>>>>  open a hugetlbfs file
>>>>>>  mmap it as a shared memory
>>>>>>  copy some binary codes into it
>>>>>>  munmap
>>>>>>
>>>>>> Process B
>>>>>>  open the hugetlbfs file
>>>>>>  mmap it as a shared memory, executable
>>>>>>  invoke the functions in the shared memory
>>>>>>  munmap
>>>>>>
>>>>>> repeat the above steps.
>>>>>
>>>>> Does this work as you would expect with small pages (and for example
>>>>> shared file mmap)? I don't want to have a different behaviour between
>>>>> small and huge pages.
>>>>
>>>> The small pages also have this problem, I will try to fix it too.
> [...]
>>> If both cases need solving, we might better move the fix in the
>>> __sync_icache_dcache() function. Untested:
>>
>> At first I also want to fix it as below. But I'm not sure which time the 
>> PageDirty
>> will be cleared, and if two or more processes mmap it as executable, cache 
>> operations
>> will be duplicated. At present, I really have not found any good place to 
>> clear
>> PG_dcache_clean. So the below modification may be the best choice, concisely 
>> and clearly.
>>
>>> 8<
>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>> index dbd12ea8ce68..c753fa804165 100644
>>> --- a/arch/arm64/mm/flush.c
>>> +++ b/arch/arm64/mm/flush.c
>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long addr)
>>> if (!page_mapping(page))
>>> return;
>>>  
>>> -   if (!test_and_set_bit(PG_dcache_clean, >flags))
>>> +   if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>> +   PageDirty(page))
>>> sync_icache_aliases(page_address(page),
>>> PAGE_SIZE << compound_order(page));
>>> else if (icache_is_aivivt())
>>> 8<-
>>>
>>> BTW, can you make your tests (source) available somewhere?
>>
>> Both cases worked well with this patch.
> 
> Now I'm even more confused ;). IIUC, after an msync() in user space we
> should flush the pages to disk via write_cache_pages(). This function
> calls clear_page_dirty_for_io() after which PageDirty() is no longer
> true. I can't tell how a subsequent mmap() can see the written pages as
> dirty.
> 

As my tracing, both cases invoked empty function.

int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
..
return file->f_op->fsync(file, start, end, datasync);
}

const struct file_operations hugetlbfs_file_operations = {
.fsync  = noop_fsync,

static const struct file_operations shmem_file_operations = {
.mmap   = shmem_mmap,
#ifdef CONFIG_TMPFS
.fsync  = noop_fsync,



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-07-07 Thread Leizhen (ThunderTown)


On 2016/7/7 23:37, Catalin Marinas wrote:
> On Thu, Jul 07, 2016 at 08:09:04PM +0800, Zhen Lei wrote:
>> At present, PG_dcache_clean is only cleared when the related huge page
>> is about to be freed. But sometimes, there maybe a process is in charge
>> to copy binary codes into a shared memory, and notifies other processes
>> to execute base on that. For the first time, there is no problem, because
>> the default value of page->flags is PG_dcache_clean cleared. So the cache
>> will be maintained at the time of set_pte_at for other processes. But if
>> the content of the shared memory have been updated again, there is no
>> cache operations, because the PG_dcache_clean is still set.
>>
>> For example:
>> Process A
>>  open a hugetlbfs file
>>  mmap it as a shared memory
>>  copy some binary codes into it
>>  munmap
>>
>> Process B
>>  open the hugetlbfs file
>>  mmap it as a shared memory, executable
>>  invoke the functions in the shared memory
>>  munmap
>>
>> repeat the above steps.
> 
> Does this work as you would expect with small pages (and for example
> shared file mmap)? I don't want to have a different behaviour between
> small and huge pages.

The small pages also have this problem, I will try to fix it too.

> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-08-21 Thread Leizhen (ThunderTown)


On 2016/7/20 17:19, Catalin Marinas wrote:
> On Wed, Jul 20, 2016 at 10:46:27AM +0800, Leizhen (ThunderTown) wrote:
>>>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>>>> 8<
>>>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>>>> --- a/arch/arm64/mm/flush.c
>>>>>>> +++ b/arch/arm64/mm/flush.c
>>>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long 
>>>>>>> addr)
>>>>>>> if (!page_mapping(page))
>>>>>>> return;
>>>>>>>  
>>>>>>> -   if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>>>> +   if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>>>> +   PageDirty(page))
>>>>>>> sync_icache_aliases(page_address(page),
>>>>>>> PAGE_SIZE << compound_order(page));
>>>>>>> else if (icache_is_aivivt())
>>>>>>> 8<-
>>
>> Do you plan to send this patch? My colleagues told me that if our
>> patches are quite different, it should be Signed-off-by you.
> 
> The reason I'm not sending it is that I don't fully understand how it
> solves the problem for a shared file mmap(), not just hugetlbfs. As I
> said in an earlier email: after an msync() in user space we
> should flush the pages to disk via write_cache_pages(). This function
Hi Catalin:
   I'm so sorry for my fault. The previous small pages test result I actually 
ran on ramfs.
Today, I ran the case on harddisk fs, it worked well without this patch.

Summarized as follows:
small pages on ramfs: need this patch
small pages on harddisk fs: no need this patch
hugetlbfs: need this patch



> calls clear_page_dirty_for_io() after which PageDirty() is no longer
> true. I can't tell how a subsequent mmap() can see the written pages as
> dirty.
> 
>> I searched all Linux source code, __sync_icache_dcache is only called
>> by set_pte_at, and some check conditions(especially pte_exec) will
>> limit its impact.
>>
>>  if (pte_user(pte) && pte_exec(pte) && !pte_special(pte))
>>  __sync_icache_dcache(pte, addr);
> 
> Yes, and set_pte_at() would be called as a result of a page fault when
> accessing the mmap'ed file.
> 



Re: [PATCH v6 00/14] fix some type infos and bugs for arm64/of numa

2016-08-22 Thread Leizhen (ThunderTown)
Hi everybody:
   Is this patch series can be accepted or still need to be improved? It seems
to have been a long time.

Thanks,
   Zhen Lei


On 2016/8/11 17:33, Zhen Lei wrote:
> v5 -> v6:
> Move memblk nid check from arch/arm64/mm/numa.c into drivers/of/of_numa.c,
> because this check is arch independent.
> 
> This modification only related to patch 3, but impacted the contents of patch 
> 7 and 8,
> other patches have no change.
> 
> v4 -> v5:
> This version has no code changes, just add "Acked-by: Rob Herring 
> "
> into patches 1, 2, 4, 6, 7, 13, 14. Because these patches rely on some acpi 
> numa
> patches, and the latter had not been upstreamed in 4.7, but upstreamed in 
> 4.8-rc1,
> so I resend my patches again.
> 
> v3 -> v4:
> 1. Packed three patches of Kefeng Wang, patch6-8.
> 2. Add 6 new patches(9-15) to enhance the numa on arm64.
> 
> v2 -> v3:
> 1. Adjust patch2 and patch5 according to Matthias Brugger's advice, to make 
> the
>patches looks more well. The final code have no change. 
> 
> v1 -> v2:
> 1. Base on https://lkml.org/lkml/2016/5/24/679
> 2. Rewrote of_numa_parse_memory_nodes according to Rob Herring's advice. So 
> that it looks more clear.
> 3. Rewrote patch 5 because some scenes were not considered before.
> 
> Kefeng Wang (3):
>   of_numa: Use of_get_next_parent to simplify code
>   of_numa: Use pr_fmt()
>   arm64: numa: Use pr_fmt()
> 
> Zhen Lei (11):
>   of/numa: remove a duplicated pr_debug information
>   of/numa: fix a memory@ node can only contains one memory block
>   arm64/numa: add nid check for memory block
>   of/numa: remove a duplicated warning
>   arm64/numa: avoid inconsistent information to be printed
>   arm64/numa: support HAVE_SETUP_PER_CPU_AREA
>   arm64/numa: define numa_distance as array to simplify code
>   arm64/numa: support HAVE_MEMORYLESS_NODES
>   arm64/numa: remove some useless code
>   of/numa: remove the constraint on the distances of node pairs
>   Documentation: remove the constraint on the distances of node pairs
> 
>  Documentation/devicetree/bindings/numa.txt |   1 -
>  arch/arm64/Kconfig |  12 ++
>  arch/arm64/include/asm/numa.h  |   1 -
>  arch/arm64/kernel/smp.c|   1 +
>  arch/arm64/mm/numa.c   | 223 
> -
>  drivers/of/of_numa.c   |  88 ++--
>  6 files changed, 178 insertions(+), 148 deletions(-)
> 
> --
> 2.5.0
> 
> 
> 
> .
> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-08-24 Thread Leizhen (ThunderTown)


On 2016/8/24 1:28, Catalin Marinas wrote:
> On Mon, Aug 22, 2016 at 12:19:04PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/7/20 17:19, Catalin Marinas wrote:
>>> On Wed, Jul 20, 2016 at 10:46:27AM +0800, Leizhen (ThunderTown) wrote:
>>>>>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>>>>>> 8<
>>>>>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>>>>>> --- a/arch/arm64/mm/flush.c
>>>>>>>>> +++ b/arch/arm64/mm/flush.c
>>>>>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long 
>>>>>>>>> addr)
>>>>>>>>>   if (!page_mapping(page))
>>>>>>>>>   return;
>>>>>>>>>  
>>>>>>>>> - if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>>>>>> + if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>>>>>> + PageDirty(page))
>>>>>>>>>   sync_icache_aliases(page_address(page),
>>>>>>>>>   PAGE_SIZE << compound_order(page));
>>>>>>>>>   else if (icache_is_aivivt())
>>>>>>>>> 8<-
>>>>
>>>> Do you plan to send this patch? My colleagues told me that if our
>>>> patches are quite different, it should be Signed-off-by you.
>>>
>>> The reason I'm not sending it is that I don't fully understand how it
>>> solves the problem for a shared file mmap(), not just hugetlbfs. As I
>>> said in an earlier email: after an msync() in user space we
>>> should flush the pages to disk via write_cache_pages(). This function
>> Hi Catalin:
>>I'm so sorry for my fault. The previous small pages test result I 
>> actually ran on ramfs.
>> Today, I ran the case on harddisk fs, it worked well without this patch.
>>
>> Summarized as follows:
>> small pages on ramfs: need this patch
>> small pages on harddisk fs: no need this patch
>> hugetlbfs: need this patch
> 
> I would add:
> 
> small pages over nfs: fails with or without this patch
> 
> (tested on Juno, Cortex-A57; seems to be fixed if I remove the
> PG_dcache_clean test altogether but, well, we end up over-flushing)
> 
> I assume that when using a hard drive, it goes through the block I/O
> layer and we may have a flush_dcache_page() called when the kernel is
> about to read a page that has been mapped in user space. This would
> clear the PG_dcache_clean bit and subsequent __sync_icache_dcache()
> would perform cache maintenance.
> 
> Could you try on your system the test case without the msync() call? I'm
According to my test results: without msync, the test case may failed.

10-175-112-211:~ # ./tst_small_page_no_msync
Test is Failed: The result is 0x316b9, expect = 0x365a5
10-175-112-211:~ # ./tst_small_page_no_msync
Test is Failed: The result is 0x31023, expect = 0x31efa
10-175-112-211:~ # ./tst_small_page_no_msync
Test is Passed: The result is 0x31efa, expect = 0x31efa

10-175-112-211:~ # ./tst_small_page
Test is Passed: The result is 0x31eb7, expect = 0x31eb7
10-175-112-211:~ # ./tst_small_page
Test is Passed: The result is 0x3111f, expect = 0x3111f
10-175-112-211:~ # ./tst_small_page
Test is Passed: The result is 0x3111f, expect = 0x3111f

> not sure whether munmap() would trigger an immediate write-back, in
> which case we may see the issue even with the filesystem on a hard
> drive.
> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-08-24 Thread Leizhen (ThunderTown)


On 2016/8/24 18:30, Catalin Marinas wrote:
> On Wed, Aug 24, 2016 at 05:00:50PM +0800, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/8/24 1:28, Catalin Marinas wrote:
>>> On Mon, Aug 22, 2016 at 12:19:04PM +0800, Leizhen (ThunderTown) wrote:
>>>> On 2016/7/20 17:19, Catalin Marinas wrote:
>>>>> On Wed, Jul 20, 2016 at 10:46:27AM +0800, Leizhen (ThunderTown) wrote:
>>>>>>>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>>>>>>>> 8<
>>>>>>>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>>>>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>>>>>>>> --- a/arch/arm64/mm/flush.c
>>>>>>>>>>> +++ b/arch/arm64/mm/flush.c
>>>>>>>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned 
>>>>>>>>>>> long addr)
>>>>>>>>>>> if (!page_mapping(page))
>>>>>>>>>>> return;
>>>>>>>>>>>  
>>>>>>>>>>> -   if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>>>>>>>> +   if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>>>>>>>> +   PageDirty(page))
>>>>>>>>>>> sync_icache_aliases(page_address(page),
>>>>>>>>>>> PAGE_SIZE << compound_order(page));
>>>>>>>>>>> else if (icache_is_aivivt())
>>>>>>>>>>> 8<-
>>>>>>
>>>>>> Do you plan to send this patch? My colleagues told me that if our
>>>>>> patches are quite different, it should be Signed-off-by you.
>>>>>
>>>>> The reason I'm not sending it is that I don't fully understand how it
>>>>> solves the problem for a shared file mmap(), not just hugetlbfs. As I
>>>>> said in an earlier email: after an msync() in user space we
>>>>> should flush the pages to disk via write_cache_pages(). This function
>>>> Hi Catalin:
>>>>I'm so sorry for my fault. The previous small pages test result I 
>>>> actually ran on ramfs.
>>>> Today, I ran the case on harddisk fs, it worked well without this patch.
>>>>
>>>> Summarized as follows:
>>>> small pages on ramfs: need this patch
>>>> small pages on harddisk fs: no need this patch
>>>> hugetlbfs: need this patch
>>>
>>> I would add:
>>>
>>> small pages over nfs: fails with or without this patch
>>>
>>> (tested on Juno, Cortex-A57; seems to be fixed if I remove the
>>> PG_dcache_clean test altogether but, well, we end up over-flushing)
>>>
>>> I assume that when using a hard drive, it goes through the block I/O
>>> layer and we may have a flush_dcache_page() called when the kernel is
>>> about to read a page that has been mapped in user space. This would
>>> clear the PG_dcache_clean bit and subsequent __sync_icache_dcache()
>>> would perform cache maintenance.
>>>
>>> Could you try on your system the test case without the msync() call? I'm
>>
>> According to my test results: without msync, the test case may failed.
> 
> Thanks. Just to be clear, does the test generate a file on on a hard
> drive?
Yes. I checked that the intermediate file had been generated.

> 
>> 10-175-112-211:~ # ./tst_small_page_no_msync
>> Test is Failed: The result is 0x316b9, expect = 0x365a5
>> 10-175-112-211:~ # ./tst_small_page_no_msync
>> Test is Failed: The result is 0x31023, expect = 0x31efa
>> 10-175-112-211:~ # ./tst_small_page_no_msync
>> Test is Passed: The result is 0x31efa, expect = 0x31efa
>>
>> 10-175-112-211:~ # ./tst_small_page
>> Test is Passed: The result is 0x31eb7, expect = 0x31eb7
>> 10-175-112-211:~ # ./tst_small_page
>> Test is Passed: The result is 0x3111f, expect = 0x3111f
>> 10-175-112-211:~ # ./tst_small_page
>> Test is Passed: The result is 0x3111f, expect = 0x3111f
> 
> How many tests did you run for the "passed" case? With NFS it may
I ran ./tst_small_page_no_msync and ./tst_small_page 10 times for each.

> sometime take minutes before a failure (I use the "watch" command with a
> slightly modified test to return non-zero in case of value mismatch).
> 
> While we indeed see failures on multiple filesystem types, I wonder
> whether this test case is actually expected to work. If I modify the
> test to pass O_TRUNC to open(), I can no longer see failures. So any
> standard tool that copies/creates executable files (gcc, dpkg, cp, rsync
> etc.) wouldn't encounter such issues since they truncate the original
> file and old page cache pages would be removed.
> 
> Do you have a real use-case where a task mmap's an executable file,
> modifies it in place and expects another task to see the new
> instructions without user-space cache maintenance?
No, it's just a test case created by testers.

> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-08-25 Thread Leizhen (ThunderTown)


On 2016/8/25 17:30, Catalin Marinas wrote:
> On Thu, Aug 25, 2016 at 09:42:26AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/8/24 18:30, Catalin Marinas wrote:
>>>>>>>>>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>>>>>>>>>> 8<
>>>>>>>>>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>>>>>>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>>>>>>>>>> --- a/arch/arm64/mm/flush.c
>>>>>>>>>>>>> +++ b/arch/arm64/mm/flush.c
>>>>>>>>>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned 
>>>>>>>>>>>>> long addr)
>>>>>>>>>>>>>   if (!page_mapping(page))
>>>>>>>>>>>>>   return;
>>>>>>>>>>>>>  
>>>>>>>>>>>>> - if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>>>>>>>>>> + if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>>>>>>>>>> + PageDirty(page))
>>>>>>>>>>>>>   sync_icache_aliases(page_address(page),
>>>>>>>>>>>>>   PAGE_SIZE << compound_order(page));
>>>>>>>>>>>>>   else if (icache_is_aivivt())
>>>>>>>>>>>>> 8<-
> [...]
>>> While we indeed see failures on multiple filesystem types, I wonder
>>> whether this test case is actually expected to work. If I modify the
>>> test to pass O_TRUNC to open(), I can no longer see failures. So any
>>> standard tool that copies/creates executable files (gcc, dpkg, cp, rsync
>>> etc.) wouldn't encounter such issues since they truncate the original
>>> file and old page cache pages would be removed.
>>>
>>> Do you have a real use-case where a task mmap's an executable file,
>>> modifies it in place and expects another task to see the new
>>> instructions without user-space cache maintenance?
>>
>> No, it's just a test case created by testers.
> 
> In this case I propose we ignore this patch and you adjust the test to
> use O_TRUNC, at least until we find a real scenario where this would
> matter.
OK, thanks. We currently add __clear_cache in user space.

> 



Re: [PATCH v7 14/14] Documentation: remove the constraint on the distances of node pairs

2016-08-30 Thread Leizhen (ThunderTown)


On 2016/8/31 1:55, Will Deacon wrote:
> On Sat, Aug 27, 2016 at 06:44:39PM +0800, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/8/26 23:35, Will Deacon wrote:
>>> On Wed, Aug 24, 2016 at 03:44:53PM +0800, Zhen Lei wrote:
>>>> Update documentation. This limit is unneccessary.
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>>> Acked-by: Rob Herring <r...@kernel.org>
>>>> ---
>>>>  Documentation/devicetree/bindings/numa.txt | 1 -
>>>>  1 file changed, 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/numa.txt 
>>>> b/Documentation/devicetree/bindings/numa.txt
>>>> index 21b3505..c0ea4a7 100644
>>>> --- a/Documentation/devicetree/bindings/numa.txt
>>>> +++ b/Documentation/devicetree/bindings/numa.txt
>>>> @@ -48,7 +48,6 @@ distance (memory latency) between all numa nodes.
>>>>
>>>>Note:
>>>>1. Each entry represents distance from first node to second node.
>>>> -  The distances are equal in either direction.
>>>
>>> Hmm, so what happens now if firmware provides a description where both
>>> distances (in either direction) are supplied, but are different?
>> I have not known any hardware that the distances of two direction are
>> different yet
> 
> Then let's not add support for this just yet. When we have systems that
> actually need it, we'll be in a much better position to assess the
> suitability of any patches. At the moment, the whole thing is pretty
> questionable and it adds needless complication to the code.
How about I changed to:
To simplify the configuration, the distance of the opposite direction is the 
same to it by default.

> 
> Will
> 
> .
> 



Re: [PATCH v7 05/14] arm64/numa: avoid inconsistent information to be printed

2016-08-30 Thread Leizhen (ThunderTown)


On 2016/8/31 1:51, Will Deacon wrote:
> On Sat, Aug 27, 2016 at 04:54:56PM +0800, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/8/26 20:47, Will Deacon wrote:
>>> On Wed, Aug 24, 2016 at 03:44:44PM +0800, Zhen Lei wrote:
>>>> numa_init(of_numa_init) may returned error because of numa configuration
>>>> error. So "No NUMA configuration found" is inaccurate. In fact, specific
>>>> configuration error information should be immediately printed by the
>>>> testing branch.
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>>> ---
>>>>  arch/arm64/mm/numa.c | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>> index 5bb15ea..d97c6e2 100644
>>>> --- a/arch/arm64/mm/numa.c
>>>> +++ b/arch/arm64/mm/numa.c
>>>> @@ -335,8 +335,10 @@ static int __init numa_init(int (*init_func)(void))
>>>>if (ret < 0)
>>>>return ret;
>>>>
>>>> -  if (nodes_empty(numa_nodes_parsed))
>>>> +  if (nodes_empty(numa_nodes_parsed)) {
>>>> +  pr_info("No NUMA configuration found\n");
>>>>return -EINVAL;
>>>
>>> Hmm, but dummy_numa_init calls node_set(nid, numa_nodes_parsed) for a
>>> completely artificial setup, created by adding all memblocks to node 0,
>>> so this new message will be suppressed even though things really did go
>>> wrong.
>> It will be printed by the former: numa_init(of_numa_init)
> 
> Does that print an error for every possible failure case? What about the
> acpi path?
I think acpi path should print error by itself. The reason maybe:
1. In numa_init and its sub function, all error paths printed error 
immediately, except arm64_acpi_numa_init.
2. Suppose numa_init returns error, we do not print the returned error code, so 
the user don't known what problem cause acpi numa failed.


> 
>>> In that case, don't we want to print *something* (like we do today in
>>> dummy_numa_init) but maybe not "No NUMA configuration found"? What
>>> exactly do you find inaccurate about the current message?
>> For example:
>> [0.00] NUMA: No distance-matrix property in distance-map
>> [0.00] No NUMA configuration found
>>
>> So if of_numa_init or arm64_acpi_numa_init returned error, because of
>> some numa configuration error had been found, it's no good to print "No
>> NUMA ...".
> 
> Sure, I'm all for changing the message. I just think removing it is
> probably unhelpful. Something like:
> 
> "NUMA: Failed to initialise from firmware"
I think adding this into arm64_acpi_numa_init will be better, maybe we should 
print 'ret' further:

int __init arm64_acpi_numa_init(void)
{
int ret;

ret = acpi_numa_init();
if (ret) {
+   pr_info("Failed to initialise from firmware\n");
return ret;
}

> 
> might do the trick?
> 
> Will
> 
> .
> 



Re: [PATCH v8 00/16] fix some type infos and bugs for arm64/of numa

2016-09-08 Thread Leizhen (ThunderTown)


On 2016/9/8 19:01, Will Deacon wrote:
> On Thu, Sep 01, 2016 at 02:54:51PM +0800, Zhen Lei wrote:
>> v7 -> v8:
>> Updated patches according to Will Deacon's review comments, thanks.
>>
>> The changed patches is: 3, 5, 8, 9, 10, 11, 12, 13, 15
>> Patch 3 requires an ack from Rob Herring.
>> Patch 10 requires an ack from linux-mm.
>>
>> Hi, Will:
>> Something should still be clarified:
>> Patch 5, I modified it according to my last reply. BTW, The last sentence
>>  "srat_disabled() ? -EINVAL : 0" of arm64_acpi_numa_init should be 
>> moved
>>  into acpi_numa_init, I think.
>>  
>> Patch 9, I still leave the code in arch/arm64.
>>  1) the implementation of setup_per_cpu_areas on all platforms are 
>> different.
>>  2) Although my implementation referred to PowerPC, but still 
>> something different.
>>
>> Patch 15, I modified the description again. Can you take a look at it? If 
>> this patch is
>>dropped, the patch 14 should also be dropped.
>>
>> Patch 16, How many times the function node_distance to be called rely on the 
>> APP(need many tasks
>>   to be scheduled), I have not prepared yet, so I give up this patch 
>> as your advise. 
> 
> Ok, I'm trying to pick the pieces out of this patch series and it's not
> especially easy. As far as I can tell:
> 
>   Patch 3 needs an ack from the device-tree folks
Rob just acked.

> 
>   Patch 10 needs an ack from the memblock folks
I'll immediately send a email to remind them.

> 
>   Patch 11 depends on patch 10
> 
>   Patches 14,15,16 can wait for the time being (I still don't see their
>   value).
OK, that's no problem. So I put them in the end beforehand.

> 
> So, I could pick up patches 1-2, 4-9 and 12-13 but it's not clear whether
Now you can also add patch 3.

> that makes any sense. The whole series seems to be a mix of trivial printk
The most valueable patches are: patch 2, 9, 11. The other is just because of a 
programmer wants the code to be nice.

> cleanups, a bunch of core OF stuff, some new features and then some
> questionable changes at the end.
> 
> Please throw me a clue,
> 
> Will
> 
> .
> 



Re: [PATCH v8 10/16] mm/memblock: add a new function memblock_alloc_near_nid

2016-09-08 Thread Leizhen (ThunderTown)
Hi, linux-mm folks:
Can somebody help me to review this patch?
I ran scripts/get_maintainer.pl -f mm/memblock.c and 
scripts/get_maintainer.pl -f mm/, but
the results showed me that there is no maintainer.
To understand this patch should also read patch 11.

On 2016/9/1 14:55, Zhen Lei wrote:
> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
> actually exist. The percpu variable areas and numa control blocks of that
> memoryless numa nodes must be allocated from the nearest available node
> to improve performance.
> 
> Signed-off-by: Zhen Lei 
> ---
>  include/linux/memblock.h |  1 +
>  mm/memblock.c| 28 
>  2 files changed, 29 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 2925da2..8e866e0 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -290,6 +290,7 @@ static inline int memblock_get_region_node(const struct 
> memblock_region *r)
> 
>  phys_addr_t memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid);
>  phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int 
> nid);
> +phys_addr_t memblock_alloc_near_nid(phys_addr_t size, phys_addr_t align, int 
> nid);
> 
>  phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 483197e..6578fff 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1189,6 +1189,34 @@ again:
>   return ret;
>  }
> 
> +phys_addr_t __init memblock_alloc_near_nid(phys_addr_t size, phys_addr_t 
> align, int nid)
> +{
> + int i, best_nid, distance;
> + u64 pa;
> + DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
> +
> + bitmap_zero(nodes_map, MAX_NUMNODES);
> +
> +find_nearest_node:
> + best_nid = NUMA_NO_NODE;
> + distance = INT_MAX;
> +
> + for_each_clear_bit(i, nodes_map, MAX_NUMNODES)
> + if (node_distance(nid, i) < distance) {
> + best_nid = i;
> + distance = node_distance(nid, i);
> + }
> +
> + pa = memblock_alloc_nid(size, align, best_nid);
> + if (!pa) {
> + BUG_ON(best_nid == NUMA_NO_NODE);
> + bitmap_set(nodes_map, best_nid, 1);
> + goto find_nearest_node;
> + }
> +
> + return pa;
> +}
> +
>  phys_addr_t __init __memblock_alloc_base(phys_addr_t size, phys_addr_t 
> align, phys_addr_t max_addr)
>  {
>   return memblock_alloc_base_nid(size, align, max_addr, NUMA_NO_NODE,
> --
> 2.5.0
> 
> 
> 
> .
> 



Re: [RFC] Arm64 boot fail with numa enable in BIOS

2016-09-19 Thread Leizhen (ThunderTown)


On 2016/9/19 22:45, Will Deacon wrote:
> On Mon, Sep 19, 2016 at 03:07:19PM +0100, Mark Rutland wrote:
>> [adding LAKML, arm64 maintainers]
> 
> I've also looped in Euler ThunderTown, since (a) he's at Huawei and is
> assumedly testing this stuff and (b) he has a fairly big NUMA patch
> series doing the rounds (some of which I've queued).
In my patch series, only one is used to resolve crashed problem, but it's 
related to device-tree.

> 
>> On Mon, Sep 19, 2016 at 09:05:26PM +0800, Yisheng Xie wrote:
>> In future, please make sure to Cc LAKML along with relevant parties when
>> sending arm64 patches/queries.
>>
>> For everyone newly Cc'd, the original message (with attachments) can be
>> found at:
>>
>> http://lkml.kernel.org/r/7618d76d-bfa8-d8aa-59aa-06f9d90c1...@huawei.com
>>
>>> When I enable NUMA in BIOS for arm64, it failed to boot on 
>>> v4.8-rc4-162-g071e31e.
>>
>> That commit ID doesn't seem to be in mainline (I can't find it in my
>> local tree). Which tree are you using? Do you have local patches
>> applied?
> 
> That commit is in mainline:
> 
>   http://git.kernel.org/linus/071e31e
> 
> It would be nice to know if the problem also exists on the arm64
> for-next/core branch.
> 
> Will
> 
> 
>> I take it that by "enable NUMA in BIOS", you mean exposing SRAT to the
>> OS?
>>
>>> For the crash log, it seems caused by error number of cpumask.
>>> Any ideas about it?
>>
>> Much earlier in your log, there was a (non-fatal) warning, as below. Do
>> you see this without NUMA/SRAT enabled in your FW? I don't see how the
>> SRAT should affect the secondaries we try to bring online.
>>
>> Given your MPIDRs have Aff2 bits set, I wonder if we've conflated a
>> logical ID with a physical ID somewhere, and it just so happens that the
>> NUMA code is more likely to poke something based on that.
>>
>> Can you modify the warning in cpumask.h to dump the bad CPU number? That
>> would make it fairly clear if that's the case.
>>
>> Thanks,
>> Mark.
>>
>>> [0.297337] Detected PIPT I-cache on CPU1
>>> [0.297347] GICv3: CPU1: found redistributor 10001 region 
>>> 1:0x4d14
>>> [0.297356] CPU1: Booted secondary processor [410fd082]
>>> [0.297375] [ cut here ]
>>> [0.320390] WARNING: CPU: 1 PID: 0 at ./include/linux/cpumask.h:121 
>>> gic_raise_softirq+0x128/0x17c
>>> [0.329356] Modules linked in:
>>> [0.332434] 
>>> [0.333932] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
>>> 4.8.0-rc4-00163-g803ea3a #21
>>> [0.341581] Hardware name: Hisilicon Hi1616 Evaluation Board (DT)
>>> [0.347735] task: 8013e9dd task.stack: 8013e9dcc000
>>> [0.353714] PC is at gic_raise_softirq+0x128/0x17c
>>> [0.358550] LR is at gic_raise_softirq+0xa0/0x17c
>>> [0.363298] pc : [] lr : [] pstate: 
>>> 21c5
>>> [0.370770] sp : 8013e9dcfde0
>>> [0.374112] x29: 8013e9dcfde0 x28:  
>>> [0.379476] x27: 0083207c x26: 08ca5d70 
>>> [0.384841] x25: 00010001 x24: 08d63ff3 
>>> [0.390205] x23:  x22: 08cb 
>>> [0.395569] x21: 0884edb0 x20: 0001 
>>> [0.400933] x19: 0001 x18:  
>>> [0.406298] x17:  x16: 03010066 
>>> [0.411661] x15: 08ca8000 x14: 0013 
>>> [0.417025] x13:  x12: 0013 
>>> [0.422389] x11: 0013 x10: 02e92aa7 
>>> [0.427754] x9 :  x8 : 8413eb6ca668 
>>> [0.433118] x7 : 8413eb6ca690 x6 :  
>>> [0.438482] x5 : fffe x4 :  
>>> [0.443845] x3 : 0040 x2 : 0041 
>>> [0.449209] x1 :  x0 : 0001 
>>> [0.454573] 
>>> [0.456069] ---[ end trace b58e70f3295a8cd7 ]---
>>> [0.460730] Call trace:
>>> [0.463193] Exception stack(0x8013e9dcfc10 to 0x8013e9dcfd40)
>>> [0.469699] fc00:   0001 
>>> 0001
>>> [0.477611] fc20: 8013e9dcfde0 0838c124 08d72228 
>>> 8013e9dcff70
>>> [0.485524] fc40: 08d72608 08ab02a4  
>>> 
>>> [0.493436] fc60:  3464313430303030  
>>> 
>>> [0.501348] fc80: 8013e9dcfc90 0836e678 8013e9dcfca0 
>>> 0836e910
>>> [0.509259] fca0: 8013e9dcfd30 0836ec10 0001 
>>> 
>>> [0.517171] fcc0: 0041 0040  
>>> fffe
>>> [0.525083] fce0:  8413eb6ca690 8413eb6ca668 
>>> 
>>> [0.532995] fd00: 02e92aa7 0013 0013 
>>> 
>>> [0.540907] fd20: 0013 08ca8000 03010066 
>>> 
>>> [

Re: [PATCH v7 03/14] arm64/numa: add nid check for memory block

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 20:39, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:42PM +0800, Zhen Lei wrote:
>> Use the same tactic to cpu and numa-distance nodes.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/of/of_numa.c | 5 +
>>  1 file changed, 5 insertions(+)
> 
> The subject has arm64/numa, but this is clearly core OF code and
I originally added below check in arch/arm64/mm/numa.c, until Hanjun Guo
told me that it should move into drivers/of/of_numa.c

I forgot updating this.

> requires an ack from Rob.
> 
> The commit message also doesn't make much sense to me.
> 
>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>> index 7b3fbdc..afaeb9c 100644
>> --- a/drivers/of/of_numa.c
>> +++ b/drivers/of/of_numa.c
>> @@ -75,6 +75,11 @@ static int __init of_numa_parse_memory_nodes(void)
>>   */
>>  continue;
>>
>> +if (nid >= MAX_NUMNODES) {
>> +pr_warn("NUMA: Node id %u exceeds maximum value\n", 
>> nid);
>> +return -EINVAL;
>> +}
> 
> Do you really want to return from the function here? Shouldn't we at least
> of_node_put(np), i.e. by using a break; ?
Thanks for pointing out this mistake. I will change to "r = -EINVAL" in the 
next version.

> 
> Will
> 
> .
> 



Re: [PATCH v7 05/14] arm64/numa: avoid inconsistent information to be printed

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 20:47, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:44PM +0800, Zhen Lei wrote:
>> numa_init(of_numa_init) may returned error because of numa configuration
>> error. So "No NUMA configuration found" is inaccurate. In fact, specific
>> configuration error information should be immediately printed by the
>> testing branch.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 5bb15ea..d97c6e2 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -335,8 +335,10 @@ static int __init numa_init(int (*init_func)(void))
>>  if (ret < 0)
>>  return ret;
>>
>> -if (nodes_empty(numa_nodes_parsed))
>> +if (nodes_empty(numa_nodes_parsed)) {
>> +pr_info("No NUMA configuration found\n");
>>  return -EINVAL;
> 
> Hmm, but dummy_numa_init calls node_set(nid, numa_nodes_parsed) for a
> completely artificial setup, created by adding all memblocks to node 0,
> so this new message will be suppressed even though things really did go
> wrong.
It will be printed by the former: numa_init(of_numa_init)

> 
> In that case, don't we want to print *something* (like we do today in
> dummy_numa_init) but maybe not "No NUMA configuration found"? What
> exactly do you find inaccurate about the current message?
For example:
[0.00] NUMA: No distance-matrix property in distance-map
[0.00] No NUMA configuration found

So if of_numa_init or arm64_acpi_numa_init returned error, because of
some numa configuration error had been found, it's no good to print "No NUMA 
...".

> 
> Will
> 
> .
> 



Re: [PATCH v7 08/14] arm64: numa: Use pr_fmt()

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 20:54, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:47PM +0800, Zhen Lei wrote:
>> From: Kefeng Wang 
>>
>> Use pr_fmt to prefix kernel output, and remove duplicated msg
>> of NUMA turned off.
>>
>> Signed-off-by: Kefeng Wang 
>> ---
>>  arch/arm64/mm/numa.c | 40 
>>  1 file changed, 20 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index d97c6e2..7b73808 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -17,6 +17,8 @@
>>   * along with this program.  If not, see .
>>   */
>>
>> +#define pr_fmt(fmt) "numa: " fmt
> 
> Shouldn't this be uppercase for consistency with the existing code and
> the code in places like drivers/of/of_numa.c?
OK, I will change it to "NUMA: ".

> 
>>  #include 
>>  #include 
>>  #include 
>> @@ -38,10 +40,9 @@ static __init int numa_parse_early_param(char *opt)
>>  {
>>  if (!opt)
>>  return -EINVAL;
>> -if (!strncmp(opt, "off", 3)) {
>> -pr_info("%s\n", "NUMA turned off");
>> +if (!strncmp(opt, "off", 3))
>>  numa_off = true;
>> -}
>> +
>>  return 0;
>>  }
>>  early_param("numa", numa_parse_early_param);
>> @@ -110,7 +111,7 @@ static void __init setup_node_to_cpumask_map(void)
>>  set_cpu_numa_node(cpu, NUMA_NO_NODE);
>>
>>  /* cpumask_of_node() will now work */
>> -pr_debug("NUMA: Node to cpumask map for %d nodes\n", nr_node_ids);
>> +pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
>>  }
>>
>>  /*
>> @@ -145,13 +146,13 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>>
>>  ret = memblock_set_node(start, (end - start), , nid);
>>  if (ret < 0) {
>> -pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node 
>> %d\n",
>> +pr_err("memblock [0x%llx - 0x%llx] failed to add on node %d\n",
>>  start, (end - 1), nid);
>>  return ret;
>>  }
>>
>>  node_set(nid, numa_nodes_parsed);
>> -pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>> +pr_info("Adding memblock [0x%llx - 0x%llx] on node %d\n",
>>  start, (end - 1), nid);
>>  return ret;
>>  }
>> @@ -166,19 +167,18 @@ static void __init setup_node_data(int nid, u64 
>> start_pfn, u64 end_pfn)
>>  void *nd;
>>  int tnid;
>>
>> -pr_info("NUMA: Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>> -nid, start_pfn << PAGE_SHIFT,
>> -(end_pfn << PAGE_SHIFT) - 1);
>> +pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>> +nid, start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
>>
>>  nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>  nd = __va(nd_pa);
>>
>>  /* report and initialize */
>> -pr_info("NUMA: NODE_DATA [mem %#010Lx-%#010Lx]\n",
>> +pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
> 
> Why are you adding leading whitespace?
Kefeng Wang said that just in order to make the final print info looks more 
clear.

I will remove the leading whitespace in v8.

> 
>>  nd_pa, nd_pa + nd_size - 1);
>>  tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>>  if (tnid != nid)
>> -pr_info("NUMA: NODE_DATA(%d) on node %d\n", nid, tnid);
>> +pr_info("NODE_DATA(%d) on node %d\n", nid, tnid);
> 
> 
> Same here.
> 
>>  node_data[nid] = nd;
>>  memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>> @@ -235,8 +235,7 @@ static int __init numa_alloc_distance(void)
>>  numa_distance[i * numa_distance_cnt + j] = i == j ?
>>  LOCAL_DISTANCE : REMOTE_DISTANCE;
>>
>> -pr_debug("NUMA: Initialized distance table, cnt=%d\n",
>> -numa_distance_cnt);
>> +pr_debug("Initialized distance table, cnt=%d\n", numa_distance_cnt);
>>
>>  return 0;
>>  }
>> @@ -257,20 +256,20 @@ static int __init numa_alloc_distance(void)
>>  void __init numa_set_distance(int from, int to, int distance)
>>  {
>>  if (!numa_distance) {
>> -pr_warn_once("NUMA: Warning: distance table not allocated 
>> yet\n");
>> +pr_warn_once("Warning: distance table not allocated yet\n");
>>  return;
>>  }
>>
>>  if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
>>  from < 0 || to < 0) {
>> -pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d 
>> to=%d distance=%d\n",
>> +pr_warn_once("Warning: node ids are out of bound, from=%d to=%d 
>> distance=%d\n",
>>  from, to, distance);
>>  return;
>>  }
>>
>>  if ((u8)distance != distance ||
>>  (from == to && distance != LOCAL_DISTANCE)) {
>> -pr_warn_once("NUMA: Warning: invalid distance parameter, 
>> from=%d to=%d 

Re: [PATCH v7 09/14] arm64/numa: support HAVE_SETUP_PER_CPU_AREA

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 21:28, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:48PM +0800, Zhen Lei wrote:
>> To make each percpu area allocated from its local numa node. Without this
>> patch, all percpu areas will be allocated from the node which cpu0 belongs
>> to.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/Kconfig   |  8 
>>  arch/arm64/mm/numa.c | 55 
>> 
>>  2 files changed, 63 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index bc3f00f..2815af6 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -603,6 +603,14 @@ config USE_PERCPU_NUMA_NODE_ID
>>  def_bool y
>>  depends on NUMA
>>
>> +config HAVE_SETUP_PER_CPU_AREA
>> +def_bool y
>> +depends on NUMA
>> +
>> +config NEED_PER_CPU_EMBED_FIRST_CHUNK
>> +def_bool y
>> +depends on NUMA
> 
> Why do we need this? Is it purely about using block mappings for the
> pcpu area?
Without NEED_PER_CPU_EMBED_FIRST_CHUNK, Link error will be reported.

#if defined(CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK) || \
!defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
#define BUILD_EMBED_FIRST_CHUNK
#endif

#if defined(BUILD_EMBED_FIRST_CHUNK)
//pcpu_embed_first_chunk definition
#endif

setup_per_cpu_areas -->pcpu_embed_first_chunk


> 
>>  source kernel/Kconfig.preempt
>>  source kernel/Kconfig.hz
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 7b73808..5e44ad1 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -26,6 +26,7 @@
>>  #include 
>>
>>  #include 
>> +#include 
>>
>>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>>  EXPORT_SYMBOL(node_data);
>> @@ -131,6 +132,60 @@ void __init early_map_cpu_to_node(unsigned int cpu, int 
>> nid)
>>  cpu_to_node_map[cpu] = nid;
>>  }
>>
>> +#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
>> +unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
>> +EXPORT_SYMBOL(__per_cpu_offset);
>> +
>> +static int __init early_cpu_to_node(int cpu)
>> +{
>> +return cpu_to_node_map[cpu];
>> +}
>> +
>> +static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
>> +{
>> +if (early_cpu_to_node(from) == early_cpu_to_node(to))
>> +return LOCAL_DISTANCE;
>> +else
>> +return REMOTE_DISTANCE;
>> +}
> 
> Is it too early to use __node_distance here?
Good, we can directly use node_distance, thanks.

> 
>> +static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size,
>> +   size_t align)
>> +{
>> +int nid = early_cpu_to_node(cpu);
>> +
>> +return  memblock_virt_alloc_try_nid(size, align,
>> +__pa(MAX_DMA_ADDRESS), MEMBLOCK_ALLOC_ACCESSIBLE, nid);
>> +}
>> +
>> +static void __init pcpu_fc_free(void *ptr, size_t size)
>> +{
>> +memblock_free_early(__pa(ptr), size);
>> +}
>> +
>> +void __init setup_per_cpu_areas(void)
>> +{
>> +unsigned long delta;
>> +unsigned int cpu;
>> +int rc;
>> +
>> +/*
>> + * Always reserve area for module percpu variables.  That's
>> + * what the legacy allocator did.
>> + */
>> +rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
>> +PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
>> +pcpu_cpu_distance,
>> +pcpu_fc_alloc, pcpu_fc_free);
>> +if (rc < 0)
>> +panic("Failed to initialize percpu areas.");
>> +
>> +delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
>> +for_each_possible_cpu(cpu)
>> +__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
>> +}
>> +#endif
> 
> It's a pity that this is practically identical to PowerPC. Ideally, there
> would be definitions of this initialisation gunk in the core code that
> could be reused across architectures.
But these are different from other ARCHs, except PPC.

I originally want to put it into driver/of/of_numa.c, but now the ACPI NUMA is
coming up, so I don't known where.

> 
> Will
> 
> .
> 



Re: [PATCH v7 10/14] arm64/numa: define numa_distance as array to simplify code

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 23:29, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:49PM +0800, Zhen Lei wrote:
>> 1. MAX_NUMNODES is base on CONFIG_NODES_SHIFT, the default value of the
>>latter is very small now.
>> 2. Suppose the default value of MAX_NUMNODES is enlarged to 64, so the
>>size of numa_distance is 4K, it's still acceptable if run the Image
>>on other processors.
>> 3. It will make function __node_distance quicker than before.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/include/asm/numa.h |  1 -
>>  arch/arm64/mm/numa.c  | 74 
>> +++
>>  2 files changed, 5 insertions(+), 70 deletions(-)
> 
> I fail to see the advantages of this patch. Do you have some compelling
> performance figures or something?

We can only put numa_distance_cnt on one node, so for the cpus of other nodes 
to access it should
spend more time. I have not tested how many can be improved yet.

I will try to get some data next week.

> 
> Will
> 
> .
> 



Re: [PATCH v7 14/14] Documentation: remove the constraint on the distances of node pairs

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 23:35, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:53PM +0800, Zhen Lei wrote:
>> Update documentation. This limit is unneccessary.
>>
>> Signed-off-by: Zhen Lei 
>> Acked-by: Rob Herring 
>> ---
>>  Documentation/devicetree/bindings/numa.txt | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/Documentation/devicetree/bindings/numa.txt 
>> b/Documentation/devicetree/bindings/numa.txt
>> index 21b3505..c0ea4a7 100644
>> --- a/Documentation/devicetree/bindings/numa.txt
>> +++ b/Documentation/devicetree/bindings/numa.txt
>> @@ -48,7 +48,6 @@ distance (memory latency) between all numa nodes.
>>
>>Note:
>>  1. Each entry represents distance from first node to second node.
>> -The distances are equal in either direction.
> 
> Hmm, so what happens now if firmware provides a description where both
> distances (in either direction) are supplied, but are different?
I have not known any hardware that the distances of two direction are different 
yet, but:
1. software have no need to limit the distances of two direction must be equal.
2. suppose below software scenario:
   1) cpu0 and cpu1 belong to the same hardware node.
   2) cpu0 is a master control CPU, many tasks and interrupts deliver to cpu0 
first. So cpu0 often busy than cpu1.
   3) we split cpu0 and cpu1 into two logical nodes, cpu0 belongs to node0, 
cpu1 belong to node1. Now, we make
  the distance from cpu0 to cpu1 larger than the distance from cpu1 to cpu0.

> 
> Will
> 
> .
> 



Re: [PATCH v7 11/14] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-08-27 Thread Leizhen (ThunderTown)


On 2016/8/26 23:43, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:50PM +0800, Zhen Lei wrote:
>> Some numa nodes may have no memory. For example:
>> 1. cpu0 on node0
>> 2. cpu1 on node1
>> 3. device0 access the momory from node0 and node1 take the same time.
>>
>> So, we can not simply classify device0 to node0 or node1, but we can
>> define a node2 which distances to node0 and node1 are the same.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/Kconfig  |  4 
>>  arch/arm64/kernel/smp.c |  1 +
>>  arch/arm64/mm/numa.c| 43 +--
>>  3 files changed, 46 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 2815af6..3a2b6ed 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -611,6 +611,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>>  def_bool y
>>  depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> +def_bool y
>> +depends on NUMA
>> +
>>  source kernel/Kconfig.preempt
>>  source kernel/Kconfig.hz
>>
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index d93d433..4879085 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -619,6 +619,7 @@ static void __init of_parse_and_init_cpus(void)
>>  }
>>
>>  bootcpu_valid = true;
>> +early_map_cpu_to_node(0, of_node_to_nid(dn));
> 
> This seems unrelated?
I will get off my work soon. Maybe I need put it into patch 12.

> 
>>  /*
>>   * cpu_logical_map has already been
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 6853db7..114180f 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -129,6 +129,14 @@ void __init early_map_cpu_to_node(unsigned int cpu, int 
>> nid)
>>  nid = 0;
>>
>>  cpu_to_node_map[cpu] = nid;
>> +
>> +/*
>> + * We should set the numa node of cpu0 as soon as possible, because it
>> + * has already been set up online before. cpu_to_node(0) will soon be
>> + * called.
>> + */
>> +if (!cpu)
>> +set_cpu_numa_node(cpu, nid);
> 
> Likewise.
> 
>>  }
>>
>>  #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
>> @@ -211,6 +219,35 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>>  return ret;
>>  }
>>
>> +static u64 __init alloc_node_data_from_nearest_node(int nid, const size_t 
>> size)
>> +{
>> +int i, best_nid, distance;
>> +u64 pa;
>> +DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
>> +
>> +bitmap_zero(nodes_map, MAX_NUMNODES);
>> +bitmap_set(nodes_map, nid, 1);
>> +
>> +find_nearest_node:
>> +best_nid = NUMA_NO_NODE;
>> +distance = INT_MAX;
>> +
>> +for_each_clear_bit(i, nodes_map, MAX_NUMNODES)
>> +if (numa_distance[nid][i] < distance) {
>> +best_nid = i;
>> +distance = numa_distance[nid][i];
>> +}
>> +
>> +pa = memblock_alloc_nid(size, SMP_CACHE_BYTES, best_nid);
>> +if (!pa) {
>> +BUG_ON(best_nid == NUMA_NO_NODE);
>> +bitmap_set(nodes_map, best_nid, 1);
>> +goto find_nearest_node;
>> +}
>> +
>> +return pa;
>> +}
>> +
>>  /**
>>   * Initialize NODE_DATA for a node on the local memory
>>   */
>> @@ -224,7 +261,9 @@ static void __init setup_node_data(int nid, u64 
>> start_pfn, u64 end_pfn)
>>  pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>>  nid, start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
>>
>> -nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>> +if (!nd_pa)
>> +nd_pa = alloc_node_data_from_nearest_node(nid, nd_size);
> 
> Why not add memblock_alloc_near_nid to the core code, and make it do
> what you need there?
I'm thinking about it next week. But some ARCHs like X86/IA64 have their own 
implementation.

> 
> Will
> 
> .
> 



Re: [PATCH 1/1] arm64/hugetlb: clear PG_dcache_clean if the page is dirty when munmap

2016-08-23 Thread Leizhen (ThunderTown)


On 2016/8/24 1:28, Catalin Marinas wrote:
> On Mon, Aug 22, 2016 at 12:19:04PM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/7/20 17:19, Catalin Marinas wrote:
>>> On Wed, Jul 20, 2016 at 10:46:27AM +0800, Leizhen (ThunderTown) wrote:
>>>>>>>> On 2016/7/8 21:54, Catalin Marinas wrote:
>>>>>>>>> 8<
>>>>>>>>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>>>>>>>>> index dbd12ea8ce68..c753fa804165 100644
>>>>>>>>> --- a/arch/arm64/mm/flush.c
>>>>>>>>> +++ b/arch/arm64/mm/flush.c
>>>>>>>>> @@ -75,7 +75,8 @@ void __sync_icache_dcache(pte_t pte, unsigned long 
>>>>>>>>> addr)
>>>>>>>>>   if (!page_mapping(page))
>>>>>>>>>   return;
>>>>>>>>>  
>>>>>>>>> - if (!test_and_set_bit(PG_dcache_clean, >flags))
>>>>>>>>> + if (!test_and_set_bit(PG_dcache_clean, >flags) ||
>>>>>>>>> + PageDirty(page))
>>>>>>>>>   sync_icache_aliases(page_address(page),
>>>>>>>>>   PAGE_SIZE << compound_order(page));
>>>>>>>>>   else if (icache_is_aivivt())
>>>>>>>>> 8<-
>>>>
>>>> Do you plan to send this patch? My colleagues told me that if our
>>>> patches are quite different, it should be Signed-off-by you.
>>>
>>> The reason I'm not sending it is that I don't fully understand how it
>>> solves the problem for a shared file mmap(), not just hugetlbfs. As I
>>> said in an earlier email: after an msync() in user space we
>>> should flush the pages to disk via write_cache_pages(). This function
>> Hi Catalin:
>>I'm so sorry for my fault. The previous small pages test result I 
>> actually ran on ramfs.
>> Today, I ran the case on harddisk fs, it worked well without this patch.
>>
>> Summarized as follows:
>> small pages on ramfs: need this patch
>> small pages on harddisk fs: no need this patch
>> hugetlbfs: need this patch
> 
> I would add:
> 
> small pages over nfs: fails with or without this patch
> 
> (tested on Juno, Cortex-A57; seems to be fixed if I remove the
> PG_dcache_clean test altogether but, well, we end up over-flushing)
> 
> I assume that when using a hard drive, it goes through the block I/O
> layer and we may have a flush_dcache_page() called when the kernel is
> about to read a page that has been mapped in user space. This would
> clear the PG_dcache_clean bit and subsequent __sync_icache_dcache()
> would perform cache maintenance.
> 
> Could you try on your system the test case without the msync() call? I'm
> not sure whether munmap() would trigger an immediate write-back, in
> which case we may see the issue even with the filesystem on a hard
> drive.
OK, no problem. I will do it today or tomorrow.

> 



Re: [PATCH v7 11/14] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-08-28 Thread Leizhen (ThunderTown)


On 2016/8/27 19:05, Leizhen (ThunderTown) wrote:
> 
> 
> On 2016/8/26 23:43, Will Deacon wrote:
>> On Wed, Aug 24, 2016 at 03:44:50PM +0800, Zhen Lei wrote:
>>> Some numa nodes may have no memory. For example:
>>> 1. cpu0 on node0
>>> 2. cpu1 on node1
>>> 3. device0 access the momory from node0 and node1 take the same time.
>>>
>>> So, we can not simply classify device0 to node0 or node1, but we can
>>> define a node2 which distances to node0 and node1 are the same.
>>>
>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>> ---
>>>  arch/arm64/Kconfig  |  4 
>>>  arch/arm64/kernel/smp.c |  1 +
>>>  arch/arm64/mm/numa.c| 43 +--
>>>  3 files changed, 46 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 2815af6..3a2b6ed 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -611,6 +611,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>>> def_bool y
>>> depends on NUMA
>>>
>>> +config HAVE_MEMORYLESS_NODES
>>> +   def_bool y
>>> +   depends on NUMA
>>> +
>>>  source kernel/Kconfig.preempt
>>>  source kernel/Kconfig.hz
>>>
>>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>>> index d93d433..4879085 100644
>>> --- a/arch/arm64/kernel/smp.c
>>> +++ b/arch/arm64/kernel/smp.c
>>> @@ -619,6 +619,7 @@ static void __init of_parse_and_init_cpus(void)
>>> }
>>>
>>> bootcpu_valid = true;
>>> +   early_map_cpu_to_node(0, of_node_to_nid(dn));
>>
>> This seems unrelated?
> I will get off my work soon. Maybe I need put it into patch 12.
> 
>>
>>> /*
>>>  * cpu_logical_map has already been
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index 6853db7..114180f 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -129,6 +129,14 @@ void __init early_map_cpu_to_node(unsigned int cpu, 
>>> int nid)
>>> nid = 0;
>>>
>>> cpu_to_node_map[cpu] = nid;
>>> +
>>> +   /*
>>> +* We should set the numa node of cpu0 as soon as possible, because it
>>> +* has already been set up online before. cpu_to_node(0) will soon be
>>> +* called.
>>> +*/
>>> +   if (!cpu)
>>> +   set_cpu_numa_node(cpu, nid);
>>
>> Likewise.
>>
>>>  }
>>>
>>>  #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
>>> @@ -211,6 +219,35 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
>>> return ret;
>>>  }
>>>
>>> +static u64 __init alloc_node_data_from_nearest_node(int nid, const size_t 
>>> size)
>>> +{
>>> +   int i, best_nid, distance;
>>> +   u64 pa;
>>> +   DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
>>> +
>>> +   bitmap_zero(nodes_map, MAX_NUMNODES);
>>> +   bitmap_set(nodes_map, nid, 1);
>>> +
>>> +find_nearest_node:
>>> +   best_nid = NUMA_NO_NODE;
>>> +   distance = INT_MAX;
>>> +
>>> +   for_each_clear_bit(i, nodes_map, MAX_NUMNODES)
>>> +   if (numa_distance[nid][i] < distance) {
>>> +   best_nid = i;
>>> +   distance = numa_distance[nid][i];
>>> +   }
>>> +
>>> +   pa = memblock_alloc_nid(size, SMP_CACHE_BYTES, best_nid);
>>> +   if (!pa) {
>>> +   BUG_ON(best_nid == NUMA_NO_NODE);
>>> +   bitmap_set(nodes_map, best_nid, 1);
>>> +   goto find_nearest_node;
>>> +   }
>>> +
>>> +   return pa;
>>> +}
>>> +
>>>  /**
>>>   * Initialize NODE_DATA for a node on the local memory
>>>   */
>>> @@ -224,7 +261,9 @@ static void __init setup_node_data(int nid, u64 
>>> start_pfn, u64 end_pfn)
>>> pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>>> nid, start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1);
>>>
>>> -   nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>>> +   nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>>> +   if (!nd_pa)
>>> +   nd_pa = alloc_node_data_from_nearest_node(nid, nd_size);
>>
>> Why not add memblock_alloc_near_nid to the core code, and make it do
>> what you need there?
> I'm thinking about it next week. But some ARCHs like X86/IA64 have their own 
> implementation.

Do you mean directly and only call alloc_node_data_from_nearest_node? OK, 
that's fine. Thanks.

> 
>>
>> Will
>>
>> .
>>



Re: [PATCH v7 12/14] arm64/numa: remove the limitation that cpu0 must bind to node0

2016-08-29 Thread Leizhen (ThunderTown)


On 2016/8/26 23:49, Will Deacon wrote:
> On Wed, Aug 24, 2016 at 03:44:51PM +0800, Zhen Lei wrote:
>> 1. Currently only cpu0 set on cpu_possible_mask and percpu areas have not
>>been initialized.
This description refer to below:
-   for_each_possible_cpu(cpu)
-   set_cpu_numa_node(cpu, NUMA_NO_NODE);

1. When the above code is executed, only the bit of cpu0 was set on 
cpu_possible_mask.
   So that, only set_cpu_numa_node(0, NUMA_NO_NODE); will be executed.
2. set_cpu_numa_node will access percpu variable numa_node, but 
setup_per_cpu_areas is
   called after current time. Without the first problem, it will lead kernel 
crash.

I changed the title of this patch in v7, the original is "remove some useless 
code".
I think I should separate this into a new patch.



>> 2. No reason to limit cpu0 must belongs to node0.
> 
> Whilst I suspect you're using enumerated lists in order to try to make
> things clearer, I'm having a really hard time understanding the commit
> messages you have in this series. It's actually much better if you
> structure them as concise paragraphs explaining:
> 
>   - What is the problem that you're fixing?
> 
>   - How does that problem manifest?
> 
>   - How does the patch fix it?
> 
> As far as I can see, this patch just removes a bunch of code with no
> explanation as to why it's not required or any problems caused by
> keeping it around.
> 
> Will
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/mm/numa.c | 12 ++--
>>  1 file changed, 2 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 114180f..07a1978 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -94,7 +94,6 @@ void numa_clear_node(unsigned int cpu)
>>   */
>>  static void __init setup_node_to_cpumask_map(void)
>>  {
>> -unsigned int cpu;
>>  int node;
>>
>>  /* setup nr_node_ids if not done yet */
>> @@ -107,9 +106,6 @@ static void __init setup_node_to_cpumask_map(void)
>>  cpumask_clear(node_to_cpumask_map[node]);
>>  }
>>
>> -for_each_possible_cpu(cpu)
>> -set_cpu_numa_node(cpu, NUMA_NO_NODE);
>> -
>>  /* cpumask_of_node() will now work */
>>  pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
>>  }
>> @@ -119,13 +115,13 @@ static void __init setup_node_to_cpumask_map(void)
>>   */
>>  void numa_store_cpu_info(unsigned int cpu)
>>  {
>> -map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>> +map_cpu_to_node(cpu, cpu_to_node_map[cpu]);
>>  }
>>
>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  /* fallback to node 0 */
>> -if (nid < 0 || nid >= MAX_NUMNODES)
>> +if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>  nid = 0;
After the below code have been removed, we should make the corresponding 
adjustment.
otherwise, kernel will be crashed if "numa=off" was set in bootargs.

>>
>>  cpu_to_node_map[cpu] = nid;
>> @@ -375,10 +371,6 @@ static int __init numa_init(int (*init_func)(void))
>>
>>  setup_node_to_cpumask_map();
>>
>> -/* init boot processor */
>> -cpu_to_node_map[0] = 0;
>> -map_cpu_to_node(0, 0);
These code limit cpu0 must belong to node0, but our current implementation 
deesn't
have this limitation.

>> -
>>  return 0;
>>  }
>>
>> --
>> 2.5.0
>>
>>
> 
> .
> 



Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-23 Thread Leizhen (ThunderTown)


On 2016/8/23 19:30, Will Deacon wrote:
> On Tue, Aug 23, 2016 at 07:19:01PM +0800, Leizhen (ThunderTown) wrote:
>> He applied my patches, which I mentioned these days.
> 
> [...]
> 
>> I will update my patch series and resend it again.
> 
> To be clear, you plan to send an updated version of:
Yes, but just merge Zhongjiang's patch into mine, only one or two lines changed.

> 
>   [PATCH v6 00/14] fix some type infos and bugs for arm64/of numa
> 
> so I can ignore v6 of that?
If you have not merged v6 into your branch, I think you can wait my v7. I will 
send v7 tomorrow.

> 
> Will
> 
> .
> 



Re: [PATCH] mm, numa: boot cpu should bound to the node0 when node_off enable

2016-08-23 Thread Leizhen (ThunderTown)

On 2016/8/22 22:28, Catalin Marinas wrote:
> On Sat, Aug 20, 2016 at 05:38:59PM +0800, zhong jiang wrote:
>> On 2016/8/19 12:11, Ganapatrao Kulkarni wrote:
>>> On Fri, Aug 19, 2016 at 9:30 AM, Ganapatrao Kulkarni
>>>  wrote:
 On Fri, Aug 19, 2016 at 7:28 AM, zhong jiang  wrote:
> On 2016/8/19 1:45, Ganapatrao Kulkarni wrote:
>> On Thu, Aug 18, 2016 at 9:34 PM, Catalin Marinas
>>  wrote:
>>> On Thu, Aug 18, 2016 at 09:09:26PM +0800, zhongjiang wrote:
 At present, boot cpu will bound to a node from device tree when 
 node_off enable.
 if the node is not initialization, it will lead to a following problem.
> [...]
 --- a/arch/arm64/mm/numa.c
 +++ b/arch/arm64/mm/numa.c
 @@ -119,7 +119,7 @@ void numa_store_cpu_info(unsigned int cpu)
  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
  {
   /* fallback to node 0 */
 - if (nid < 0 || nid >= MAX_NUMNODES)
 + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
>>
>> i  did not understood how this line change fixes the issue that you
>> have mentioned (i too not understood fully the issue description)
>> this array used while mapping node id when secondary cores comes up
>> when numa_off is set the cpu_to_node_map[cpu] is not used and set to
>> node0 always( refer function numa_store_cpu_info)..
>> please provide more details to understand the issue you are facing.
>> /*
>>  *  Set the cpu to node and mem mapping
>>  */
>> void numa_store_cpu_info(unsigned int cpu)
>> {
>> map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>> }
>
> The issue comes up when we test the kdump. it will leads to kernel crash.
> when I debug the issue, I find boot cpu actually bound to the node1. while
> node1 is not real existence when numa_off enable.

 boot cpu is default mapped to node0
 are you running with any other patches?
He applied my patches, which I mentioned these days.

I chated with ZhongJiang, this problem is only exist for my patches, and no 
matter
whether use kdump or not. Mainline doesn't have this problem.

The details of this problem is(suppose numa_off is true), according to the code 
execution sequence :

1. setup_arch-->bootmem_init-->arm64_numa_init
When numa_off is true, all memory blocks will add into node 0.

2. setup_arch-->of_smp_init_cpus
I added early_map_cpu_to_node for boot cpu, so that the nid of cpu0 will change 
to the value read from dt node.
With ZhongJiang's patch, it will correct the nid of cpu0 to zero when numa_off 
is true.

3. build_all_zonelists
Because numa is off, so that only the control block of node 0 had been 
initialized. So cpu0 with non-zero nid will lead the kernel crash.

4. kernel_init_freeable-->smp_prepare_cpus-->smp_store_cpu_info
Set the nid of cpu0 to zero, but it's too late.

5. secondary_start_kernel-->smp_store_cpu_info
Set the nid of other cpus to zero.

I will update my patch series and resend it again.

Best regards,
 Town·Thunder
 (My Chinese name Zhen Lei direct translation into English)

>>>
>>> if you added any patch to change this code
>>>   /* init boot processor */
>>> cpu_to_node_map[0] = 0;
>>> map_cpu_to_node(0, 0);
>>>
>>> then adding code to take-care numa_off here might solve your issue.
>>
>>  but in of_smp_init_cpus, boot cpu will call early_map_cpu_to_node[] to get
>>  the relation node. and the node is from devicetree.
>>
>>  you points to the code will be covered with another node. therefore, it is
>>  possible that cpu_to_node[cpu] will leads to the incorrect results. 
>> therefore,
>>  The crash will come up.
> 
> I think I get Ganapat's point. The cpu_to_node_map[0] may be incorrectly
> set by early_map_cpu_to_node() when called from smp_init_cpus() ->
> of_parse_and_init_cpus(). However, the cpu_to_node_map[] array is *only*
> read by numa_store_cpu_info(). This latter function calls
> map_cpu_to_node() and, if numa_off, will only ever pass 0 as the nid.
> 
> Given that the cpu_to_node_map[] array is static, I don't see how any
> non-zero value could leak outside the arch/arm64/mm/numa.c file.
> 
> So please give more details of any additional patches you have on top of
> mainline or whether you reproduced this issue with the vanilla kernel
> (since you mentioned kdump, that's not in mainline yet).
> 



Re: [PATCH 2/2] arm64/numa: support HAVE_MEMORYLESS_NODES

2016-10-26 Thread Leizhen (ThunderTown)


On 2016/10/27 2:36, Will Deacon wrote:
> On Tue, Oct 25, 2016 at 10:59:18AM +0800, Zhen Lei wrote:
>> Some numa nodes may have no memory. For example:
>> 1) a node has no memory bank plugged.
>> 2) a node has no memory bank slots.
>>
>> To ensure percpu variable areas and numa control blocks of the
>> memoryless numa nodes to be allocated from the nearest available node to
>> improve performance, defined node_distance_ready. And make its value to be
>> true immediately after node distances have been initialized.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  arch/arm64/Kconfig| 4 
>>  arch/arm64/include/asm/numa.h | 3 +++
>>  arch/arm64/mm/numa.c  | 6 +-
>>  3 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 30398db..648dd13 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -609,6 +609,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
>>  def_bool y
>>  depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> +def_bool y
>> +depends on NUMA
> 
> Given that patch 1 and the associated node_distance_ready stuff is all
> an unqualified performance optimisation, is there any merit in just
> enabling HAVE_MEMORYLESS_NODES in Kconfig and then optimising things as
> a separate series when you have numbers to back it up?
HAVE_MEMORYLESS_NODES is also an performance optimisation for memoryless 
scenario.
For example:
node0 is a memoryless node, node1 is the nearest node of node0.
We want to allocate memory from node0, normally memory manager will try node0 
first, then node1.
But we have already kwown that node0 have no memory, so we can tell memory 
manager directly try
node1 first. So HAVE_MEMORYLESS_NODES is used to skip the memoryless nodes, 
don't try them.

So I think the title of this patch is misleading, I will rewrite it in V2.

Or, Do you mean separate it into a new patch?


> 
> Will
> 
> .
> 



Re: [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc

2016-10-26 Thread Leizhen (ThunderTown)


On 2016/10/26 17:31, Michal Hocko wrote:
> On Wed 26-10-16 11:10:44, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/10/25 21:23, Michal Hocko wrote:
>>> On Tue 25-10-16 10:59:17, Zhen Lei wrote:
>>>> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
>>>> actually exist. The percpu variable areas and numa control blocks of that
>>>> memoryless numa nodes need to be allocated from the nearest available
>>>> node to improve performance.
>>>>
>>>> Although memblock_alloc_try_nid and memblock_virt_alloc_try_nid try the
>>>> specified nid at the first time, but if that allocation failed it will
>>>> directly drop to use NUMA_NO_NODE. This mean any nodes maybe possible at
>>>> the second time.
>>>>
>>>> To compatible the above old scene, I use a marco node_distance_ready to
>>>> control it. By default, the marco node_distance_ready is not defined in
>>>> any platforms, the above mentioned functions will work as normal as
>>>> before. Otherwise, they will try the nearest node first.
>>>
>>> I am sorry but it is absolutely unclear to me _what_ is the motivation
>>> of the patch. Is this a performance optimization, correctness issue or
>>> something else? Could you please restate what is the problem, why do you
>>> think it has to be fixed at memblock layer and describe what the actual
>>> fix is please?
>>
>> This is a performance optimization.
> 
> Do you have any numbers to back the improvements?
I have not collected any performance data, but at least in theory, it's 
beneficial and harmless,
except make code looks a bit urly. Because all related functions are actually 
defined as __init,
for example:
phys_addr_t __init memblock_alloc_try_nid(
void * __init memblock_virt_alloc_try_nid(

And all related memory(percpu variables and NODE_DATA) is mostly referred at 
running time.

> 
>> The problem is if some memoryless numa nodes are
>> actually exist, for example: there are total 4 nodes, 0,1,2,3, node 1 has no 
>> memory,
>> and the node distances is as below:
>> -board---
>>  |   |
>> |   |
>>  socket0 socket1
>>/ \ / \
>>   /   \   /   \
>>node0 node1 node2 node3
>> distance[1][0] is nearer than distance[1][2] and distance[1][3]. CPUs on 
>> node1 access
>> the memory of node0 is faster than node2 or node3.
>>
>> Linux defines a lot of percpu variables, each cpu has a copy of it and most 
>> of the time
>> only to access their own percpu area. In this example, we hope the percpu 
>> area of CPUs
>> on node1 allocated from node0. But without these patches, it's not sure that.
> 
> I am not familiar with the percpu allocator much so I might be
> completely missig a point but why cannot this be solved in the percpu
> allocator directly e.g. by using cpu_to_mem which should already be
> memoryless aware.
My test result told me that it can not:
[0.00] Initmem setup node 0 [mem 0x-0x0011]
[0.00] Could not find start_pfn for node 1
[0.00] Initmem setup node 1 [mem 0x-0x]
[0.00] Initmem setup node 2 [mem 0x0012-0x0013]
[0.00] Initmem setup node 3 [mem 0x0014-0x0017]


[   14.801895] NODE_DATA(0) = 0x11e500
[   14.805749] NODE_DATA(1) = 0x11ca00  //(1), see below
[   14.809602] NODE_DATA(2) = 0x13e500
[   14.813455] NODE_DATA(3) = 0x17fffe5480
[   14.817316] cpu 0 on node0: 11fff87638
[   14.821083] cpu 1 on node0: 11fff9c638
[   14.824850] cpu 2 on node0: 11fffb1638
[   14.828616] cpu 3 on node0: 11fffc6638
[   14.832383] cpu 4 on node1: 17fff8a638   //(2), see below
[   14.836149] cpu 5 on node1: 17fff9f638
[   14.839912] cpu 6 on node1: 17fffb4638
[   14.843677] cpu 7 on node1: 17fffc9638
[   14.847444] cpu 8 on node2: 13fffa4638
[   14.851210] cpu 9 on node2: 13fffb9638
[   14.854976] cpu10 on node2: 13fffce638
[   14.858742] cpu11 on node2: 13fffe3638
[   14.862510] cpu12 on node3: 17fff36638
[   14.866276] cpu13 on node3: 17fff4b638
[   14.870042] cpu14 on node3: 17fff60638
[   14.873809] cpu15 on node3: 17fff75638

(1) memblock_alloc_try_nid and with these patches, memory was allocated from 
node0
(2) do the same implementation as X86 and PowerPC, memory was allocated from 
node3:
return  __alloc_bootmem_node(NODE_DATA(nid), size, align, 
__pa(MAX_DMA_ADDRESS));

I'm not sure how about on

Re: [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc

2016-10-25 Thread Leizhen (ThunderTown)


On 2016/10/25 21:23, Michal Hocko wrote:
> On Tue 25-10-16 10:59:17, Zhen Lei wrote:
>> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
>> actually exist. The percpu variable areas and numa control blocks of that
>> memoryless numa nodes need to be allocated from the nearest available
>> node to improve performance.
>>
>> Although memblock_alloc_try_nid and memblock_virt_alloc_try_nid try the
>> specified nid at the first time, but if that allocation failed it will
>> directly drop to use NUMA_NO_NODE. This mean any nodes maybe possible at
>> the second time.
>>
>> To compatible the above old scene, I use a marco node_distance_ready to
>> control it. By default, the marco node_distance_ready is not defined in
>> any platforms, the above mentioned functions will work as normal as
>> before. Otherwise, they will try the nearest node first.
> 
> I am sorry but it is absolutely unclear to me _what_ is the motivation
> of the patch. Is this a performance optimization, correctness issue or
> something else? Could you please restate what is the problem, why do you
> think it has to be fixed at memblock layer and describe what the actual
> fix is please?
This is a performance optimization. The problem is if some memoryless numa 
nodes are
actually exist, for example: there are total 4 nodes, 0,1,2,3, node 1 has no 
memory,
and the node distances is as below:
-board---
|   |
|   |
 socket0 socket1
   / \ / \
  /   \   /   \
   node0 node1 node2 node3
distance[1][0] is nearer than distance[1][2] and distance[1][3]. CPUs on node1 
access
the memory of node0 is faster than node2 or node3.

Linux defines a lot of percpu variables, each cpu has a copy of it and most of 
the time
only to access their own percpu area. In this example, we hope the percpu area 
of CPUs
on node1 allocated from node0. But without these patches, it's not sure that.

If each node has their own memory, we can directly use below functions to 
allocate memory
from its local node:
1. memblock_alloc_nid
2. memblock_alloc_try_nid
3. memblock_virt_alloc_try_nid_nopanic
4. memblock_virt_alloc_try_nid

So, these patches is only used for numa memoryless scenario.

Another use case is the control block "extern pg_data_t *node_data[]",
Here is an example of x86 numa in arch/x86/mm/numa.c:
static void __init alloc_node_data(int nid)
{
... ...
/*
 * Allocate node data.  Try node-local memory and then any node.
//==>But the nearest node is the best
 * Never allocate in DMA zone.
 */
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
nd_pa = __memblock_alloc_base(nd_size, SMP_CACHE_BYTES,
  MEMBLOCK_ALLOC_ACCESSIBLE);
if (!nd_pa) {
pr_err("Cannot find %zu bytes in node %d\n",
   nd_size, nid);
return;
}
}
nd = __va(nd_pa);
... ...
node_data[nid] = nd;

> 
>>From a quick glance you are trying to bend over the memblock API for
> something that should be handled on a different layer.
> 
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  mm/memblock.c | 76 
>> ++-
>>  1 file changed, 65 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 7608bc3..556bbd2 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1213,9 +1213,71 @@ phys_addr_t __init memblock_alloc(phys_addr_t size, 
>> phys_addr_t align)
>>  return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
>>  }
>>
>> +#ifndef node_distance_ready
>> +#define node_distance_ready()   0
>> +#endif
>> +
>> +static phys_addr_t __init memblock_alloc_near_nid(phys_addr_t size,
>> +phys_addr_t align, phys_addr_t start,
>> +phys_addr_t end, int nid, ulong flags,
>> +int alloc_func_type)
>> +{
>> +int nnid, round = 0;
>> +u64 pa;
>> +DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
>> +
>> +bitmap_zero(nodes_map, MAX_NUMNODES);
>> +
>> +again:
>> +/*
>> + * There are total 4 cases:
>> + * 
>> + *   1)2) node_distance_ready || !node_distance_ready
>> + *  Round 1, nnid = nid = NUMA_NO_NODE;
>> + * 
>> + *   3) !node_distance_ready
>> + *  Round 1, nnid = nid;
>> + *::Round 2, currently only applicable for alloc_func_type = <0>
>> + *  Round 2, nnid = NUMA_NO_NODE;
>> + *   4) node_distance_ready
>> + *  Round 1, LOCAL_DISTANCE, nnid = nid;
>> + *  Round ?, nnid = nearest nid;

Re: [PATCH 1/2] mm/memblock: prepare a capability to support memblock near alloc

2016-10-27 Thread Leizhen (ThunderTown)


On 2016/10/27 15:22, Michal Hocko wrote:
> On Thu 27-10-16 10:41:24, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2016/10/26 17:31, Michal Hocko wrote:
>>> On Wed 26-10-16 11:10:44, Leizhen (ThunderTown) wrote:
>>>>
>>>>
>>>> On 2016/10/25 21:23, Michal Hocko wrote:
>>>>> On Tue 25-10-16 10:59:17, Zhen Lei wrote:
>>>>>> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
>>>>>> actually exist. The percpu variable areas and numa control blocks of that
>>>>>> memoryless numa nodes need to be allocated from the nearest available
>>>>>> node to improve performance.
>>>>>>
>>>>>> Although memblock_alloc_try_nid and memblock_virt_alloc_try_nid try the
>>>>>> specified nid at the first time, but if that allocation failed it will
>>>>>> directly drop to use NUMA_NO_NODE. This mean any nodes maybe possible at
>>>>>> the second time.
>>>>>>
>>>>>> To compatible the above old scene, I use a marco node_distance_ready to
>>>>>> control it. By default, the marco node_distance_ready is not defined in
>>>>>> any platforms, the above mentioned functions will work as normal as
>>>>>> before. Otherwise, they will try the nearest node first.
>>>>>
>>>>> I am sorry but it is absolutely unclear to me _what_ is the motivation
>>>>> of the patch. Is this a performance optimization, correctness issue or
>>>>> something else? Could you please restate what is the problem, why do you
>>>>> think it has to be fixed at memblock layer and describe what the actual
>>>>> fix is please?
>>>>
>>>> This is a performance optimization.
>>>
>>> Do you have any numbers to back the improvements?
>>
>> I have not collected any performance data, but at least in theory,
>> it's beneficial and harmless, except make code looks a bit
>> urly.
> 
> The whole memoryless area is cluttered with hacks because everybody just
> adds pieces here and there to make his particular usecase work IMHO.
> Adding more on top for performance reasons which are even not measured
OK, I will ask my colleagues for help, whether some APPs can be used or not.

> to prove a clear win is a no go. Please step back try to think how this
> could be done with an existing infrastructure we have (some cleanups
OK, I will try to do it. But some infrastructures maybe only restricted in the
theoretical analysis, I don't have the related testing environment, so there is
no way to verify.


> while doing that would be hugely appreciated) and if that is not
> possible then explain why and why it is not feasible to fix that before
I think it will be feasible.

> you start adding a new API.
> 
> Thanks!
> 



Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

2016-10-27 Thread Leizhen (ThunderTown)


On 2016/10/27 1:00, David Daney wrote:
> On 10/26/2016 06:43 AM, Robert Richter wrote:
>> On 25.10.16 14:31:00, David Daney wrote:
>>> From: David Daney 
>>>
>>> On arm64 NUMA kernels we can pass "numa=off" on the command line to
>>> disable NUMA.  A side effect of this is that kmalloc_node() calls to
>>> non-zero nodes will crash the system with an OOPS:
>>>
>>> [0.00] [] __alloc_pages_nodemask+0xa4/0xe68
>>> [0.00] [] new_slab+0xd0/0x57c
>>> [0.00] [] ___slab_alloc+0x2e4/0x514
>>> [0.00] [] __slab_alloc+0x48/0x58
>>> [0.00] [] __kmalloc_node+0xd0/0x2e0
>>> [0.00] [] __irq_domain_add+0x7c/0x164
>>> [0.00] [] its_probe+0x784/0x81c
>>> [0.00] [] its_init+0x48/0x1b0
>>> .
>>> .
>>> .
>>>
>>> This is caused by code like this in kernel/irq/irqdomain.c
>>>
>>>  domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
>>>GFP_KERNEL, of_node_to_nid(of_node));
>>>
>>> When NUMA is disabled, the concept of a node is really undefined, so
>>> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>>>
>>> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
>>> return NUMA_NO_NODE.
>>>
>>> The follow on patch will call this new function from the arm64 numa
>>> code.
>>
>> Didn't that work before?
> 
> I am fairly certain that it used to work.
> 
>> numa=off just maps all mem to node 0.
> 
> Yes, that is the current behavior.
It just deal with the cpu nodes, but I think currently you added "numa-node-id" 
in the peripheral device(maybe ITS).

> 
>> If mem
>> allocation is requested for another node it should just fall back to a
>> node with mem (node 0 then).
> 
> This is the root of the problem.  The ITS code is allocating memory. It calls 
> of_node_to_nid() to determine which node it resides on.  The answer in the 
> failing case is node-1.  Since we have mapped all the memory to node-0 the  
> __kmalloc_node(..., 1) call fails with the OOPS shown.
> 
> It could be that __kmalloc_node() used to allocate memory on a node other 
> than the requested node if the request couldn't be met.  But in v4.8 and 
> later it produces that OOPS.
> 
> If you pass a node containing free memory or NUMA_NO_NODE to 
> __kmalloc_node(), the allocation succeeds.
> 
> When we first did these patches, I advocated removing the numa=off feature, 
> and requiring people to install usable firmware on their systems.  That was 
> rejected on the grounds that not everybody has the ability to change their 
> firmware and we would like to allow NUMA kernels to run on systems with 
> defective firmware by supplying this command line parameter.  Now that I have 
> seen requests from the wild for this, I think it is a good idea to allow 
> numa=off to be used to work around this bad firmware.
> 
> The change in this patch set is fairly small, and seems to get the job done.  
> An alternative would be to change __kmalloc_node() to ignore the node 
> parameter if the request cannot be made, but I assume that there were good 
> reasons to have the current behavior, so that would be a much more 
> complicated change to make.
> 
> 
> 
>> I suspect there is something wrong with
>> the page initialization, see:
>>
>>   http://www.spinics.net/lists/arm-kernel/msg535191.html
>>   https://bugzilla.redhat.com/show_bug.cgi?id=1387793
>>
>> What is the complete oops?
>>
>> So I think k*alloc_node() must be able to handle requests to
>> non-existing nodes. Otherwise your fix is incomplete, assume a failed
>> of_numa_init() causing a dummy init but still some devices reporting a
>> node.
> 
> .
> .
> .
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [0.00] Booting Linux on physical CPU 0x0
> [0.00] Linux version 4.8.0-rc8-dd (ddaney@localhost.localdomain) (gcc 
> version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #29 SMP Tue Sep 27 15:50:35 
> PDT 2016
> [0.00] Boot CPU: AArch64 Processor [431f0a10]
> [0.00] NUMA turned off
> [0.00] earlycon: pl11 at MMIO 0x87e02400 (options '')
> [0.00] bootconsole [pl11] enabled
> [0.00] efi: Getting EFI parameters from FDT:
> [0.00] efi: EFI v2.40 by Cavium Thunder cn88xx EFI 
> jenkins_weekly_build_40-0-ga1f880f Sep 13 2016 17:05:35
> [0.00] efi:  ACPI=0xf000  ACPI 2.0=0xf014  SMBIOS 
> 3.0=0x10ffafcf000
> [0.00] cma: Reserved 512 MiB at 0xc000
> [0.00] NUMA disabled
> [0.00] NUMA: Faking a node at [mem 
> 0x-0x010f]
> [0.00] NUMA: Adding memblock [0x140 - 0xfffd] on node 0
> [0.00] NUMA: Adding memblock [0xfffe - 0x] on node 0
> [0.00] NUMA: Adding memblock [0x1 - 0xf] on node 0
> [0.00] NUMA: Adding memblock [0x140 - 0x10ffa38] on node 0
> [

Re: aarch64 ACPI boot regressed by commit 7ba5f605f3a0 ("arm64/numa: remove the limitation that cpu0 must bind to node0")

2016-10-17 Thread Leizhen (ThunderTown)


>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index d3f151cfd4a1..8507703dabe4 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -544,6 +544,7 @@ acpi_map_gic_cpu_interface(struct 
>> acpi_madt_generic_interrupt *processor)
>>  return;
>>  }
>>  bootcpu_valid = true;
>> +early_map_cpu_to_node(0, acpi_numa_get_nid(0, hwid));
>>  return;
>>  }
>>
> 
> Anyway, your patch works with both the two-node NUMA configuration Drew 
> suggested for testing, and with the single-node config that I originally used 
> for the bisection. Therefore:
> 
> Tested-by: Laszlo Ersek 
> Reported-by: Laszlo Ersek 
> 
> Thank you very much for the quick bugfix! And, I think your patch (when you 
> send it for real) should carry
I'm so sorry about this. My patch series prepared before ACPI NUMA upstreamed, 
and forgot considering it in later.

> 
> Fixes: 7ba5f605f3a0d9495aad539eeb8346d726dfc183
> 
> too, because it supplies the cpu#0<->node#xxx association that 7ba5f605f3a0 
> removed not just for DT, but also for ACPI.
> 
> Cheers!
> Laszlo
> 
> .
> 



Re: [PATCH v2] arm64: kernel: numa: fix ACPI boot cpu numa node mapping

2016-10-18 Thread Leizhen (ThunderTown)


On 2016/10/18 16:39, Hanjun Guo wrote:
> On 2016/10/17 22:56, Lorenzo Pieralisi wrote:
>> Commit 7ba5f605f3a0 ("arm64/numa: remove the limitation that cpu0 must
>> bind to node0") removed the numa cpu<->node mapping restriction whereby
>> logical cpu 0 always corresponds to numa node 0; removing the
>> restriction was correct, in that it does not really exist in practice
>> but the commit only updated the early mapping of logical cpu 0 to its
>> real numa node for the DT boot path, missing the ACPI one, leading to
>> boot failures on ACPI systems owing to missing cpu<->node map for
>> logical cpu 0.
>>
>> Fix the issue by updating the ACPI boot path with code that carries out
>> the early cpu<->node mapping also for the boot cpu (ie cpu 0), mirroring
>> what is currently done in the DT boot path.
>>
>> Fixes: 7ba5f605f3a0 ("arm64/numa: remove the limitation that cpu0 must bind 
>> to node0")
>> Signed-off-by: Lorenzo Pieralisi 
>> Tested-by: Laszlo Ersek 
>> Reported-by: Laszlo Ersek 
>> Cc: Will Deacon 
>> Cc: Laszlo Ersek 
>> Cc: Hanjun Guo 
> 
> Thanks for the quick response and fix,
> 
> Acked-by: Hanjun Guo 
> 
> By the way, I got another boot failure [1] when we have multi
> NUMA nodes system with some memory-less nodes (only one node
> have memory), we are looking into it now, this patch needs
> to be merged first.
You should apply my numa MEMORYLESS patches first, because the two patches have 
not been upstreamed yet.
I just tested it base on 4.9-rc1 for dt numa, it worked well. I will connect 
you to check what's wrong on ACPI numa.

> 
> Thanks
> Hanjun
> 
> [1]: boot failure log:
> [0.00] NUMA: Adding memblock [0x0 - 0x3fff] on node 0
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x3fff]
> [0.00] NUMA: Adding memblock [0x14 - 0x17] on node 1
> [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x14-0x17]
> [0.00] NUMA: Adding memblock [0x10 - 0x13] on node 0
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x10-0x13]
> [0.00] NUMA: Initmem setup node 0 [mem 0x-0x13fbff]
> [0.00] NUMA: NODE_DATA [mem 0x13fbffe500-0x13fbff]
> [0.00] NUMA: Initmem setup node 1 [mem 0x14-0x17fbff]
> [0.00] NUMA: NODE_DATA [mem 0x17fbfec500-0x17fbfedfff]
> [0.00] NUMA: Initmem setup node 2 [mem 0x-0x]
> [0.00] NUMA: NODE_DATA [mem 0x17fbfeaa00-0x17fbfec4ff]
> [0.00] NUMA: NODE_DATA(2) on node 1
> [0.00] NUMA: Initmem setup node 3 [mem 0x-0x]
> [0.00] NUMA: NODE_DATA [mem 0x17fbfe8f00-0x17fbfea9ff]
> [0.00] NUMA: NODE_DATA(3) on node 1
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x-0x]
> [0.00]   Normal   [mem 0x0001-0x0017fbff]
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x-0x00024fff]
> [0.00]   node   0: [mem 0x00026000-0x319d]
> [0.00]   node   0: [mem 0x319e-0x31a4]
> [0.00]   node   0: [mem 0x31a5-0x31b2]
> [0.00]   node   0: [mem 0x31b3-0x31b3]
> [0.00]   node   0: [mem 0x31b4-0x39ba]
> [0.00]   node   0: [mem 0x39bb-0x3a143fff]
> [0.00]   node   0: [mem 0x3a144000-0x3f12]
> [0.00]   node   0: [mem 0x3f13-0x3f15]
> [0.00]   node   0: [mem 0x3f16-0x3fbf]
> [0.00]   node   0: [mem 0x00104000-0x0013fbff]
> [0.00]   node   1: [mem 0x0014-0x0017fbff]
> [0.00] Initmem setup node 0 [mem 
> 0x-0x0013fbff]
> [0.00] Initmem setup node 1 [mem 
> 0x0014-0x0017fbff]
> [0.00] Could not find start_pfn for node 2
> [0.00] Initmem setup node 2 [mem 
> 0x-0x]
> [0.00] Could not find start_pfn for node 3
> [0.00] Initmem setup node 3 [mem 
> 0x-0x]
> [0.00] psci: probing for conduit method from ACPI.
> [0.00] [ cut here ]
> [0.00] kernel BUG at mm/percpu.c:1916!
> [0.00] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 4.9.0-rc1-00083-g3dd62e5 #680
> [0.00] Hardware name: Hisilicon Hi1616 Evaluation Board (DT)
> [0.00] task: 08d5e980 task.stack: 08d5
> [0.00] PC is at 

Re: [PATCH v8 10/16] mm/memblock: add a new function memblock_alloc_near_nid

2016-10-11 Thread Leizhen (ThunderTown)


On 2016/10/11 18:16, Will Deacon wrote:
> On Tue, Oct 11, 2016 at 09:44:20AM +0800, Leizhen (ThunderTown) wrote:
>> On 2016/9/1 14:55, Zhen Lei wrote:
>>> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
>>> actually exist. The percpu variable areas and numa control blocks of that
>>> memoryless numa nodes must be allocated from the nearest available node
>>> to improve performance.
>>>
>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>> ---
>>>  include/linux/memblock.h |  1 +
>>>  mm/memblock.c| 28 
>>>  2 files changed, 29 insertions(+)
>>
>> Hi Will,
>>   It seems no one take care about this, how about I move below function into 
>> arch/arm64/mm/numa.c
>> again? So that, merge it and patch 11 into one.
> 
> I'd rather you reposted it after the merge window so we can see what to
> do with it then. The previous posting was really hard to figure out and
> mixed lots of different concepts into one series, so it's not completely
> surprising that it didn't all get picked up.
OK, thanks.

> 
> Will
> 
> .
> 



Re: [PATCH v8 10/16] mm/memblock: add a new function memblock_alloc_near_nid

2016-10-10 Thread Leizhen (ThunderTown)


On 2016/9/1 14:55, Zhen Lei wrote:
> If HAVE_MEMORYLESS_NODES is selected, and some memoryless numa nodes are
> actually exist. The percpu variable areas and numa control blocks of that
> memoryless numa nodes must be allocated from the nearest available node
> to improve performance.
> 
> Signed-off-by: Zhen Lei 
> ---
>  include/linux/memblock.h |  1 +
>  mm/memblock.c| 28 
>  2 files changed, 29 insertions(+)

Hi Will,
  It seems no one take care about this, how about I move below function into 
arch/arm64/mm/numa.c
again? So that, merge it and patch 11 into one.

> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 2925da2..8e866e0 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -290,6 +290,7 @@ static inline int memblock_get_region_node(const struct 
> memblock_region *r)
> 
>  phys_addr_t memblock_alloc_nid(phys_addr_t size, phys_addr_t align, int nid);
>  phys_addr_t memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, int 
> nid);
> +phys_addr_t memblock_alloc_near_nid(phys_addr_t size, phys_addr_t align, int 
> nid);
> 
>  phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 483197e..6578fff 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1189,6 +1189,34 @@ again:
>   return ret;
>  }
> 
> +phys_addr_t __init memblock_alloc_near_nid(phys_addr_t size, phys_addr_t 
> align, int nid)
> +{
> + int i, best_nid, distance;
> + u64 pa;
> + DECLARE_BITMAP(nodes_map, MAX_NUMNODES);
> +
> + bitmap_zero(nodes_map, MAX_NUMNODES);
> +
> +find_nearest_node:
> + best_nid = NUMA_NO_NODE;
> + distance = INT_MAX;
> +
> + for_each_clear_bit(i, nodes_map, MAX_NUMNODES)
> + if (node_distance(nid, i) < distance) {
> + best_nid = i;
> + distance = node_distance(nid, i);
> + }
> +
> + pa = memblock_alloc_nid(size, align, best_nid);
> + if (!pa) {
> + BUG_ON(best_nid == NUMA_NO_NODE);
> + bitmap_set(nodes_map, best_nid, 1);
> + goto find_nearest_node;
> + }
> +
> + return pa;
> +}
> +
>  phys_addr_t __init __memblock_alloc_base(phys_addr_t size, phys_addr_t 
> align, phys_addr_t max_addr)
>  {
>   return memblock_alloc_base_nid(size, align, max_addr, NUMA_NO_NODE,
> --
> 2.5.0
> 
> 
> 
> .
> 



Re: [PATCH 1/7] iommu/iova: fix incorrect variable types

2017-03-23 Thread Leizhen (ThunderTown)


On 2017/3/23 19:42, Robin Murphy wrote:
> On 22/03/17 06:27, Zhen Lei wrote:
>> Keep these four variables type consistent with the paramters of function
>> __alloc_and_insert_iova_range and the members of struct iova:
>>
>> 1. static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>  unsigned long size, unsigned long limit_pfn,
>>
>> 2. struct iova {
>>  unsigned long   pfn_hi;
>>  unsigned long   pfn_lo;
>>
>> In fact, limit_pfn is most likely larger than 32 bits on DMA64.
> 
> FWIW if pad_size manages to overflow an int something's probably gone
> horribly wrong, but there's no harm in making it consistent with
> everything else here. However, given that patch #6 makes this irrelevant
> anyway, do we really need to bother?

Because I'm not sure whether patch #6 can be applied or not.

> 
> Robin.
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/iova.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index b7268a1..8ba8b496 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -104,8 +104,8 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
>> struct iova *free)
>>   * Computes the padding size required, to make the start address
>>   * naturally aligned on the power-of-two order of its size
>>   */
>> -static unsigned int
>> -iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
>> +static unsigned long
>> +iova_get_pad_size(unsigned long size, unsigned long limit_pfn)
>>  {
>>  return (limit_pfn + 1 - size) & (__roundup_pow_of_two(size) - 1);
>>  }
>> @@ -117,7 +117,7 @@ static int __alloc_and_insert_iova_range(struct 
>> iova_domain *iovad,
>>  struct rb_node *prev, *curr = NULL;
>>  unsigned long flags;
>>  unsigned long saved_pfn;
>> -unsigned int pad_size = 0;
>> +unsigned long pad_size = 0;
>>  
>>  /* Walk the tree backwards */
>>  spin_lock_irqsave(>iova_rbtree_lock, flags);
>>
> 
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 3/7] iommu/iova: insert start_pfn boundary of dma32

2017-03-23 Thread Leizhen (ThunderTown)


On 2017/3/23 21:01, Robin Murphy wrote:
> On 22/03/17 06:27, Zhen Lei wrote:
>> Reserve the first granule size memory(start at start_pfn) as boundary
>> iova, to make sure that iovad->cached32_node can not be NULL in future.
>> Meanwhile, changed the assignment of iovad->cached32_node from rb_next to
>> rb_prev of >node in function __cached_rbnode_delete_update.
> 
> I'm not sure I follow this. It's a top-down allocator, so cached32_node
> points to the last node allocated (or the node above the last one freed)
> on the assumption that there is likely free space directly below there,
> thus it's a pretty good place for the next allocation to start searching
> from. On the other hand, start_pfn is a hard "do not go below this line"
> limit, so it doesn't seem to make any sense to ever point the former at
> the latter.
This patch just prepares for dma64. Because we really need to add the boundary
between dma32 and dma64, there are two main purposes:
1. to make dma32 iova allocation faster, because the last node which dma32 can 
be
seen is the boundary. So dma32 iova allocation will only try within dma32 iova 
space.
Meanwhile, we hope dma64 allocation try dma64 iova space(iova>=4G) first, 
because the
maxium dma32 iova space is 4GB, dma64 iova space is almost richer than dma32.

2. to prevent a allocated iova cross dma32 and dma64 space. Otherwise, this 
special
case should be considered when allocate and free iova.

After the above boundary added, it's better to add start_pfn of dma32 boundary 
also,
to make them to be considered in one model.

After the two boundaries added, adjust cached32/64_node point to the free iova 
node can
simplified programming.


> 
> I could understand slightly more if we were reserving the PFN *above*
> the cached range, but as-is I don't see what we gain from the change
> here, nor what benefit the cached32_node != NULL assumption gives
> (AFAICS it would be more useful to simply restrict the cases where it
> may be NULL to when the address space is either completely full or
> completely empty, or perhaps both).
> 
> Robin.
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/iova.c | 63 
>> ++--
>>  1 file changed, 37 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 1c49969..b5a148e 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -32,6 +32,17 @@ static unsigned long iova_rcache_get(struct iova_domain 
>> *iovad,
>>  static void init_iova_rcaches(struct iova_domain *iovad);
>>  static void free_iova_rcaches(struct iova_domain *iovad);
>>  
>> +static void
>> +insert_iova_boundary(struct iova_domain *iovad)
>> +{
>> +struct iova *iova;
>> +unsigned long start_pfn_32bit = iovad->start_pfn;
>> +
>> +iova = reserve_iova(iovad, start_pfn_32bit, start_pfn_32bit);
>> +BUG_ON(!iova);
>> +iovad->cached32_node = >node;
>> +}
>> +
>>  void
>>  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>>  unsigned long start_pfn, unsigned long pfn_32bit)
>> @@ -45,27 +56,38 @@ init_iova_domain(struct iova_domain *iovad, unsigned 
>> long granule,
>>  
>>  spin_lock_init(>iova_rbtree_lock);
>>  iovad->rbroot = RB_ROOT;
>> -iovad->cached32_node = NULL;
>>  iovad->granule = granule;
>>  iovad->start_pfn = start_pfn;
>>  iovad->dma_32bit_pfn = pfn_32bit;
>>  init_iova_rcaches(iovad);
>> +
>> +/*
>> + * Insert boundary nodes for dma32. So cached32_node can not be NULL in
>> + * future.
>> + */
>> +insert_iova_boundary(iovad);
>>  }
>>  EXPORT_SYMBOL_GPL(init_iova_domain);
>>  
>>  static struct rb_node *
>>  __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
>>  {
>> -if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>> -(iovad->cached32_node == NULL))
>> +struct rb_node *cached_node;
>> +struct rb_node *next_node;
>> +
>> +if (*limit_pfn > iovad->dma_32bit_pfn)
>>  return rb_last(>rbroot);
>> -else {
>> -struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>> -struct iova *curr_iova =
>> -rb_entry(iovad->cached32_node, struct iova, node);
>> -*limit_pfn = curr_iova->pfn_lo - 1;
>> -return prev_node;
>> +else
>> +cached_node = iovad->cached32_node;
>> +
>> +next_node = rb_next(cached_node);
>> +if (next_node) {
>> +struct iova *next_iova = rb_entry(next_node, struct iova, node);
>> +
>> +*limit_pfn = min(*limit_pfn, next_iova->pfn_lo - 1);
>>  }
>> +
>> +return cached_node;
>>  }
>>  
>>  static void
>> @@ -83,20 +105,13 @@ __cached_rbnode_delete_update(struct iova_domain 
>> *iovad, struct iova *free)
>>  struct iova *cached_iova;
>>  struct rb_node *curr;
>>  
>> -if (!iovad->cached32_node)
>> -return;
>>  curr = iovad->cached32_node;
>> 

Re: [PATCH 2/7] iommu/iova: cut down judgement times

2017-03-30 Thread Leizhen (ThunderTown)


On 2017/3/23 20:11, Robin Murphy wrote:
> On 22/03/17 06:27, Zhen Lei wrote:
>> Below judgement can only be satisfied at the last time, which produced 2N
>> judgements(suppose N times failed, 0 or 1 time successed) in vain.
>>
>> if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
>>  return iova;
>> }
> 
> For me, GCC (6.2.1 AArch64) seems to do a pretty good job of this
> function already, so this change only saves two instructions in total
> (pfn is compared against pfn_lo only once instead of twice), which I
> wouldn't expect to see a noticeable performance effect from.
OK, thanks for your careful analysis.

Although only two instructions saved in each loop iteration, but it's also an 
improvment and no harm.

> 
> Given the improvement in readability, though, I don't even care about
> any codegen differences :)
> 
> Reviewed-by: Robin Murphy 
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/iova.c | 9 +++--
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 8ba8b496..1c49969 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -312,15 +312,12 @@ private_find_iova(struct iova_domain *iovad, unsigned 
>> long pfn)
>>  while (node) {
>>  struct iova *iova = rb_entry(node, struct iova, node);
>>  
>> -/* If pfn falls within iova's range, return iova */
>> -if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
>> -return iova;
>> -}
>> -
>>  if (pfn < iova->pfn_lo)
>>  node = node->rb_left;
>> -else if (pfn > iova->pfn_lo)
>> +else if (pfn > iova->pfn_hi)
>>  node = node->rb_right;
>> +else
>> +return iova;/* pfn falls within iova's range */
>>  }
>>  
>>  return NULL;
>>
> 
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 3/7] iommu/iova: insert start_pfn boundary of dma32

2017-03-30 Thread Leizhen (ThunderTown)
Because the problem of my email-server, all patches sent to Joerg Roedel 
<j...@8bytes.org> failed.
So I repost this email again.


On 2017/3/24 11:43, Leizhen (ThunderTown) wrote:
> 
> 
> On 2017/3/23 21:01, Robin Murphy wrote:
>> On 22/03/17 06:27, Zhen Lei wrote:
>>> Reserve the first granule size memory(start at start_pfn) as boundary
>>> iova, to make sure that iovad->cached32_node can not be NULL in future.
>>> Meanwhile, changed the assignment of iovad->cached32_node from rb_next to
>>> rb_prev of >node in function __cached_rbnode_delete_update.
>>
>> I'm not sure I follow this. It's a top-down allocator, so cached32_node
>> points to the last node allocated (or the node above the last one freed)
>> on the assumption that there is likely free space directly below there,
>> thus it's a pretty good place for the next allocation to start searching
>> from. On the other hand, start_pfn is a hard "do not go below this line"
>> limit, so it doesn't seem to make any sense to ever point the former at
>> the latter.
> This patch just prepares for dma64. Because we really need to add the boundary
> between dma32 and dma64, there are two main purposes:
> 1. to make dma32 iova allocation faster, because the last node which dma32 
> can be
> seen is the boundary. So dma32 iova allocation will only try within dma32 
> iova space.
> Meanwhile, we hope dma64 allocation try dma64 iova space(iova>=4G) first, 
> because the
> maxium dma32 iova space is 4GB, dma64 iova space is almost richer than dma32.
> 
> 2. to prevent a allocated iova cross dma32 and dma64 space. Otherwise, this 
> special
> case should be considered when allocate and free iova.
> 
> After the above boundary added, it's better to add start_pfn of dma32 
> boundary also,
> to make them to be considered in one model.
> 
> After the two boundaries added, adjust cached32/64_node point to the free 
> iova node can
> simplified programming.
> 
> 
>>
>> I could understand slightly more if we were reserving the PFN *above*
>> the cached range, but as-is I don't see what we gain from the change
>> here, nor what benefit the cached32_node != NULL assumption gives
>> (AFAICS it would be more useful to simply restrict the cases where it
>> may be NULL to when the address space is either completely full or
>> completely empty, or perhaps both).
>>
>> Robin.
>>
>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>> ---
>>>  drivers/iommu/iova.c | 63 
>>> ++--
>>>  1 file changed, 37 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index 1c49969..b5a148e 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -32,6 +32,17 @@ static unsigned long iova_rcache_get(struct iova_domain 
>>> *iovad,
>>>  static void init_iova_rcaches(struct iova_domain *iovad);
>>>  static void free_iova_rcaches(struct iova_domain *iovad);
>>>  
>>> +static void
>>> +insert_iova_boundary(struct iova_domain *iovad)
>>> +{
>>> +   struct iova *iova;
>>> +   unsigned long start_pfn_32bit = iovad->start_pfn;
>>> +
>>> +   iova = reserve_iova(iovad, start_pfn_32bit, start_pfn_32bit);
>>> +   BUG_ON(!iova);
>>> +   iovad->cached32_node = >node;
>>> +}
>>> +
>>>  void
>>>  init_iova_domain(struct iova_domain *iovad, unsigned long granule,
>>> unsigned long start_pfn, unsigned long pfn_32bit)
>>> @@ -45,27 +56,38 @@ init_iova_domain(struct iova_domain *iovad, unsigned 
>>> long granule,
>>>  
>>> spin_lock_init(>iova_rbtree_lock);
>>> iovad->rbroot = RB_ROOT;
>>> -   iovad->cached32_node = NULL;
>>> iovad->granule = granule;
>>> iovad->start_pfn = start_pfn;
>>> iovad->dma_32bit_pfn = pfn_32bit;
>>> init_iova_rcaches(iovad);
>>> +
>>> +   /*
>>> +* Insert boundary nodes for dma32. So cached32_node can not be NULL in
>>> +* future.
>>> +*/
>>> +   insert_iova_boundary(iovad);
>>>  }
>>>  EXPORT_SYMBOL_GPL(init_iova_domain);
>>>  
>>>  static struct rb_node *
>>>  __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
>>>  {
>>> -   if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>> -   (iovad->cached32_node == NULL))
>>> +   st

Re: [PATCH 1/7] iommu/iova: fix incorrect variable types

2017-03-30 Thread Leizhen (ThunderTown)


On 2017/3/24 10:27, Leizhen (ThunderTown) wrote:
> 
> 
> On 2017/3/23 19:42, Robin Murphy wrote:
>> On 22/03/17 06:27, Zhen Lei wrote:
>>> Keep these four variables type consistent with the paramters of function
>>> __alloc_and_insert_iova_range and the members of struct iova:
>>>
>>> 1. static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>>> unsigned long size, unsigned long limit_pfn,
>>>
>>> 2. struct iova {
>>> unsigned long   pfn_hi;
>>> unsigned long   pfn_lo;
>>>
>>> In fact, limit_pfn is most likely larger than 32 bits on DMA64.
>>
>> FWIW if pad_size manages to overflow an int something's probably gone
>> horribly wrong, but there's no harm in making it consistent with
>> everything else here. However, given that patch #6 makes this irrelevant
>> anyway, do we really need to bother?
> 
> Because I'm not sure whether patch #6 can be applied or not.
So if Patch #6 can be applied, I can merge this patch and patch #6 into one.

> 
>>
>> Robin.
>>
>>> Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
>>> ---
>>>  drivers/iommu/iova.c | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index b7268a1..8ba8b496 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -104,8 +104,8 @@ __cached_rbnode_delete_update(struct iova_domain 
>>> *iovad, struct iova *free)
>>>   * Computes the padding size required, to make the start address
>>>   * naturally aligned on the power-of-two order of its size
>>>   */
>>> -static unsigned int
>>> -iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
>>> +static unsigned long
>>> +iova_get_pad_size(unsigned long size, unsigned long limit_pfn)
>>>  {
>>> return (limit_pfn + 1 - size) & (__roundup_pow_of_two(size) - 1);
>>>  }
>>> @@ -117,7 +117,7 @@ static int __alloc_and_insert_iova_range(struct 
>>> iova_domain *iovad,
>>> struct rb_node *prev, *curr = NULL;
>>> unsigned long flags;
>>> unsigned long saved_pfn;
>>> -   unsigned int pad_size = 0;
>>> +   unsigned long pad_size = 0;
>>>  
>>> /* Walk the tree backwards */
>>> spin_lock_irqsave(>iova_rbtree_lock, flags);
>>>
>>
>>
>> .
>>
> 

-- 
Thanks!
BestRegards



Re: [PATCH 0/4] Optimise 64-bit IOVA allocations

2017-07-21 Thread Leizhen (ThunderTown)


On 2017/7/19 18:23, Robin Murphy wrote:
> On 19/07/17 09:37, Ard Biesheuvel wrote:
>> On 18 July 2017 at 17:57, Robin Murphy  wrote:
>>> Hi all,
>>>
>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>> close up all the little gaps that lead to hitting the slowest case, it
>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding
>>> the changes rather too hard to follow, so I've taken the liberty here of
>>> picking the whole thing up and reimplementing the main part in a rather
>>> less invasive manner.
>>>
>>> Robin.
>>>
>>> [1] 
>>> https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html
>>>
>>> Robin Murphy (1):
>>>   iommu/iova: Extend rbtree node caching
>>>
>>> Zhen Lei (3):
>>>   iommu/iova: Optimise rbtree searching
>>>   iommu/iova: Optimise the padding calculation
>>>   iommu/iova: Make dma_32bit_pfn implicit
>>>
>>>  drivers/gpu/drm/tegra/drm.c  |   3 +-
>>>  drivers/gpu/host1x/dev.c |   3 +-
>>>  drivers/iommu/amd_iommu.c|   7 +--
>>>  drivers/iommu/dma-iommu.c|  18 +--
>>>  drivers/iommu/intel-iommu.c  |  11 ++--
>>>  drivers/iommu/iova.c | 112 
>>> ---
>>>  drivers/misc/mic/scif/scif_rma.c |   3 +-
>>>  include/linux/iova.h |   8 +--
>>>  8 files changed, 60 insertions(+), 105 deletions(-)
>>>
>>
>> These patches look suspiciously like the ones I have been using over
>> the past couple of weeks (modulo the tegra and host1x changes) from
>> your git tree. They work fine on my AMD Overdrive B1, both in DT and
>> in ACPI/IORT modes, although it is difficult to quantify any
>> performance deltas on my setup.
> 
> Indeed - this is a rebase (to account for those new callers) with a
> couple of trivial tweaks to error paths and corner cases that normal
> usage shouldn't have been hitting anyway. "No longer unusably awful" is
> a good enough performance delta for me :)
> 
>> Tested-by: Ard Biesheuvel 
I got the same performance data compared with my patch version. It works well.

Tested-by: Zhen Lei 

> 
> Thanks!
> 
> Robin.
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-07-26 Thread Leizhen (ThunderTown)


On 2017/7/26 19:08, Joerg Roedel wrote:
> Hi Robin.
> 
> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>> Hi all,
>>
>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>> heavily influenced by IOVA allocation performance. Separately, Ard also
>> reported massive performance drops for a graphical desktop on AMD Seattle
>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>> the overhead of the rbtree slow path. Whilst we could go around trying to
>> close up all the little gaps that lead to hitting the slowest case, it
>> seems a much better idea to simply make said slowest case a lot less slow.
> 
> Do you have some numbers here? How big was the impact before these
> patches and how is it with the patches?
Here are some numbers:

(before)$ iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
[  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
[  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec

(after)$ iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
[  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
[  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
[  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
[  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec

> 
> 
>   Joerg
> 
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 4/4] iommu/iova: Make dma_32bit_pfn implicit

2017-07-19 Thread Leizhen (ThunderTown)


On 2017/7/19 23:07, kbuild test robot wrote:
> Hi Zhen,
> 
> [auto build test WARNING on iommu/next]
> [also build test WARNING on v4.13-rc1]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Robin-Murphy/Optimise-64-bit-IOVA-allocations/20170719-060847
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
> config: arm-multi_v7_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget 
> https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm 
> 
> All warnings (new ones prefixed by >>):
> 
>drivers/iommu/iova.c: In function 'init_iova_domain':
>>> drivers/iommu/iova.c:53:41: warning: large integer implicitly truncated to 
>>> unsigned type [-Woverflow]
>  iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
OK, I see. I think the problem is that "1ULL << 32" exceed the scope of 32bits 
general register. We should
replace "1ULL << 32" with DMA_BIT_MASK(32), the latter will minus one to keep 
it can be safely stored in
the general register.

iovad->dma_32bit_pfn = iova_pfn(iovad, DMA_BIT_MASK(32)) + 1;

> ^~~~
> 
> vim +53 drivers/iommu/iova.c
> 
> 35
> 36void
> 37init_iova_domain(struct iova_domain *iovad, unsigned long 
> granule,
> 38unsigned long start_pfn)
> 39{
> 40/*
> 41 * IOVA granularity will normally be equal to the 
> smallest
> 42 * supported IOMMU page size; both *must* be capable of
> 43 * representing individual CPU pages exactly.
> 44 */
> 45BUG_ON((granule > PAGE_SIZE) || 
> !is_power_of_2(granule));
> 46
> 47spin_lock_init(>iova_rbtree_lock);
> 48iovad->rbroot = RB_ROOT;
> 49iovad->cached_node = NULL;
> 50iovad->cached32_node = NULL;
> 51iovad->granule = granule;
> 52iovad->start_pfn = start_pfn;
>   > 53iovad->dma_32bit_pfn = iova_pfn(iovad, 1ULL << 32);
> 54init_iova_rcaches(iovad);
> 55}
> 56EXPORT_SYMBOL_GPL(init_iova_domain);
> 57
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-06-28 Thread Leizhen (ThunderTown)


On 2017/6/28 17:32, Will Deacon wrote:
> Hi Zhen Lei,
> 
> Nate (CC'd), Robin and I have been working on something very similar to
> this series, but this patch is different to what we had planned. More below.
> 
> On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:
>> Because all TLBI commands should be followed by a SYNC command, to make
>> sure that it has been completely finished. So we can just add the TLBI
>> commands into the queue, and put off the execution until meet SYNC or
>> other commands. To prevent the followed SYNC command waiting for a long
>> time because of too many commands have been delayed, restrict the max
>> delayed number.
>>
>> According to my test, I got the same performance data as I replaced writel
>> with writel_relaxed in queue_inc_prod.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 42 +-
>>  1 file changed, 37 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 291da5f..4481123 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -337,6 +337,7 @@
>>  /* Command queue */
>>  #define CMDQ_ENT_DWORDS 2
>>  #define CMDQ_MAX_SZ_SHIFT   8
>> +#define CMDQ_MAX_DELAYED32
>>  
>>  #define CMDQ_ERR_SHIFT  24
>>  #define CMDQ_ERR_MASK   0x7f
>> @@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent {
>>  };
>>  } cfgi;
>>  
>> +#define CMDQ_OP_TLBI_NH_ALL 0x10
>>  #define CMDQ_OP_TLBI_NH_ASID0x11
>>  #define CMDQ_OP_TLBI_NH_VA  0x12
>>  #define CMDQ_OP_TLBI_EL2_ALL0x20
>> @@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent {
>>  
>>  struct arm_smmu_queue {
>>  int irq; /* Wired interrupt */
>> +u32 nr_delay;
>>  
>>  __le64  *base;
>>  dma_addr_t  base_dma;
>> @@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q)
>>  return ret;
>>  }
>>  
>> -static void queue_inc_prod(struct arm_smmu_queue *q)
>> +static void queue_inc_swprod(struct arm_smmu_queue *q)
>>  {
>> -u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
>> +u32 prod = q->prod + 1;
>>  
>>  q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
>> +}
>> +
>> +static void queue_inc_prod(struct arm_smmu_queue *q)
>> +{
>> +queue_inc_swprod(q);
>>  writel(q->prod, q->prod_reg);
>>  }
>>  
>> @@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t 
>> n_dwords)
>>  *dst++ = cpu_to_le64(*src++);
>>  }
>>  
>> -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
>> +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int 
>> optimize)
>>  {
>>  if (queue_full(q))
>>  return -ENOSPC;
>>  
>>  queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
>> -queue_inc_prod(q);
>> +
>> +/*
>> + * We don't want too many commands to be delayed, this may lead the
>> + * followed sync command to wait for a long time.
>> + */
>> +if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
>> +queue_inc_swprod(q);
>> +} else {
>> +queue_inc_prod(q);
>> +q->nr_delay = 0;
>> +}
>> +
> 
> So here, you're effectively putting invalidation commands into the command
> queue without updating PROD. Do you actually see a performance advantage
> from doing so? Another side of the argument would be that we should be
Yes, my sas ssd performance test showed that it can improve about 
100-150K/s(the same to I directly replace
writel with writel_relaxed). And the average execution time of 
iommu_unmap(which called by iommu_dma_unmap_sg)
dropped from 10us to 5us.

> moving PROD as soon as we can, so that the SMMU can process invalidation
> commands in the background and reduce the cost of the final SYNC operation
> when the high-level unmap operation is complete.
There maybe that __iowmb() is more expensive than wait for tlbi complete. 
Except the time of __iowmb()
itself, it also protected by spinlock, lock confliction will rise rapidly in 
the stress scene. __iowmb()
average cost 300-500ns(Sorry, I forget the exact value).

In addition, after applied this patcheset and Robin's v2, and my earlier dma64 
iova optimization patchset.
Our net performance test got the same data to global bypass. But sas ssd still 
have more than 20% dropped.
Maybe we should still focus at map/unamp, because the average execution time of 
iova alloc/free is only
about 400ns.

By the way, patch2-5 is more effective than this one, it can improve more than 
350K/s. And with it, we can
got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective 
of Robin's v2. Sorry, I have
not tested how about this patch without 

Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Leizhen (ThunderTown)


On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
> <thunder.leiz...@huawei.com> wrote:
>>
>>
>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>> Hi Robin.
>>>
>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>>> Hi all,
>>>>
>>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>>> workloads (e.g. storage I/O with large scatterlists) probably remain quite
>>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>>> close up all the little gaps that lead to hitting the slowest case, it
>>>> seems a much better idea to simply make said slowest case a lot less slow.
>>>
>>> Do you have some numbers here? How big was the impact before these
>>> patches and how is it with the patches?
>> Here are some numbers:
>>
>> (before)$ iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>
>> (after)$ iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>
> 
> Is this testing done on Host or on Guest/VM?
Host

> 
>>>
>>>
>>>   Joerg
>>>
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH v2 0/4] Optimise 64-bit IOVA allocations

2017-08-08 Thread Leizhen (ThunderTown)


On 2017/8/9 11:24, Ganapatrao Kulkarni wrote:
> On Wed, Aug 9, 2017 at 7:12 AM, Leizhen (ThunderTown)
> <thunder.leiz...@huawei.com> wrote:
>>
>>
>> On 2017/8/8 20:03, Ganapatrao Kulkarni wrote:
>>> On Wed, Jul 26, 2017 at 4:47 PM, Leizhen (ThunderTown)
>>> <thunder.leiz...@huawei.com> wrote:
>>>>
>>>>
>>>> On 2017/7/26 19:08, Joerg Roedel wrote:
>>>>> Hi Robin.
>>>>>
>>>>> On Fri, Jul 21, 2017 at 12:41:57PM +0100, Robin Murphy wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> In the wake of the ARM SMMU optimisation efforts, it seems that certain
>>>>>> workloads (e.g. storage I/O with large scatterlists) probably remain 
>>>>>> quite
>>>>>> heavily influenced by IOVA allocation performance. Separately, Ard also
>>>>>> reported massive performance drops for a graphical desktop on AMD Seattle
>>>>>> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA
>>>>>> ops domain getting initialised differently for ACPI vs. DT, and exposing
>>>>>> the overhead of the rbtree slow path. Whilst we could go around trying to
>>>>>> close up all the little gaps that lead to hitting the slowest case, it
>>>>>> seems a much better idea to simply make said slowest case a lot less 
>>>>>> slow.
>>>>>
>>>>> Do you have some numbers here? How big was the impact before these
>>>>> patches and how is it with the patches?
>>>> Here are some numbers:
>>>>
>>>> (before)$ iperf -s
>>>> 
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> 
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35898
>>>> [ ID] Interval   Transfer Bandwidth
>>>> [  4]  0.0-10.2 sec  7.88 MBytes  6.48 Mbits/sec
>>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35900
>>>> [  5]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 35902
>>>> [  4]  0.0-10.3 sec  7.88 MBytes  6.43 Mbits/sec
>>>>
>>>> (after)$ iperf -s
>>>> 
>>>> Server listening on TCP port 5001
>>>> TCP window size: 85.3 KByte (default)
>>>> 
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36330
>>>> [ ID] Interval   Transfer Bandwidth
>>>> [  4]  0.0-10.0 sec  1.09 GBytes   933 Mbits/sec
>>>> [  5] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36332
>>>> [  5]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>>> [  4] local 192.168.1.106 port 5001 connected with 192.168.1.198 port 36334
>>>> [  4]  0.0-10.0 sec  1.10 GBytes   938 Mbits/sec
>>>>
>>>
>>> Is this testing done on Host or on Guest/VM?
>> Host
> 
> As per your log, iperf throughput is improved to 938 Mbits/sec
> from  6.43 Mbits/sec.
> IMO, this seems to be unrealistic, some thing wrong with the testing?
For 64bits non-pci devices, the iova allocation is always searched from the 
last rb-tree node.
When many iovas allocated and keep a long time, the search process should check 
many rb nodes
then find a suitable free space. As my tracking, the average times exceeds 10K.
[free-space][free][used][...][used]
  ^ ^  ^
  | |  |-rb_last
  | |- maybe more than 10K allocated iova nodes
  |--- for 32bits devices, cached32_node remember the 
lastest freed node, which can help us reduce check times

This patch series add a new member "cached_node" to service for 64bits devices, 
like cached32_node service for 32bits devices.

> 
>>
>>>
>>>>>
>>>>>
>>>>>   Joerg
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>> --
>>>> Thanks!
>>>> BestRegards
>>>>
>>>>
>>>> ___
>>>> linux-arm-kernel mailing list
>>>> linux-arm-ker...@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>>>
>>> thanks
>>> Ganapat
>>>
>>> .
>>>
>>
>> --
>> Thanks!
>> BestRegards
>>
> 
> thanks
> Ganapat
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-17 Thread Leizhen (ThunderTown)


On 2017/8/17 22:36, Will Deacon wrote:
> Thunder, Nate, Robin,
> 
> On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
>> I described the optimization more detail in patch 1 and 2, and patch 3-5 are
>> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
>>
>> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
>> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
>> memory contents. I thought more than 3 whole days and got this one.
>>
>> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
>> removal.
> 
> For the time being, I think we should focus on the new TLB flushing
> interface posted by Joerg:
> 
> http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org
> 
> which looks like it can give us most of the benefits of this series. Once
> we've got that, we can see what's left in the way of performance and focus
> on the cmdq batching separately (because I'm still not convinced about it).
OK, this is a good news.

But I have a review comment(sorry, I have not subscribed it yet, so can not 
directly reply it):
I don't think we should add tlb sync for map operation
1. at init time, all tlbs will be invalidated
2. when we try to map a new range, there are no related ptes bufferd in tlb, 
because of above 1 and below 3
3. when we unmap the above range, make sure all related ptes bufferd in tlb to 
be invalidated before unmap finished

> 
> Thanks,
> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-08-22 Thread Leizhen (ThunderTown)


On 2017/8/22 23:41, Joerg Roedel wrote:
> On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:
>> -static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
>> +static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int 
>> optimize)
>>  {
>>  if (queue_full(q))
>>  return -ENOSPC;
>>  
>>  queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
>> -queue_inc_prod(q);
>> +
>> +/*
>> + * We don't want too many commands to be delayed, this may lead the
>> + * followed sync command to wait for a long time.
>> + */
>> +if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
>> +queue_inc_swprod(q);
>> +} else {
>> +queue_inc_prod(q);
>> +q->nr_delay = 0;
>> +}
>> +
>>  return 0;
>>  }
>>  
>> @@ -909,6 +928,7 @@ static void arm_smmu_cmdq_skip_err(struct 
>> arm_smmu_device *smmu)
>>  static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>>  struct arm_smmu_cmdq_ent *ent)
>>  {
>> +int optimize = 0;
>>  u64 cmd[CMDQ_ENT_DWORDS];
>>  unsigned long flags;
>>  bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>> @@ -920,8 +940,17 @@ static void arm_smmu_cmdq_issue_cmd(struct 
>> arm_smmu_device *smmu,
>>  return;
>>  }
>>  
>> +/*
>> + * All TLBI commands should be followed by a sync command later.
>> + * The CFGI commands is the same, but they are rarely executed.
>> + * So just optimize TLBI commands now, to reduce the "if" judgement.
>> + */
>> +if ((ent->opcode >= CMDQ_OP_TLBI_NH_ALL) &&
>> +(ent->opcode <= CMDQ_OP_TLBI_NSNH_ALL))
>> +optimize = 1;
>> +
>>  spin_lock_irqsave(>cmdq.lock, flags);
>> -while (queue_insert_raw(q, cmd) == -ENOSPC) {
>> +while (queue_insert_raw(q, cmd, optimize) == -ENOSPC) {
>>  if (queue_poll_cons(q, false, wfe))
>>  dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
>>  }
> 
> This doesn't look correct. How do you make sure that a given IOVA range
> is flushed before the addresses are reused?
Hi, Joerg:
It's actullay guaranteed by the upper layer functions, for example:
static int arm_lpae_unmap(
...
unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);   
//__arm_lpae_unmap will indirectly call arm_smmu_cmdq_issue_cmd to invalidate 
tlbs
if (unmapped)
io_pgtable_tlb_sync(>iop);//a 
tlb_sync wait all tlbi operations finished


I also described it in the next patch(2/5). Showed below:

Some people might ask: Is it safe to do so? The answer is yes. The standard
processing flow is:
alloc iova
map
process data
unmap
tlb invalidation and sync
free iova

What should be guaranteed is: "free iova" action is behind "unmap" and "tlbi
operation" action, that is what we are doing right now. This ensures that:
all TLBs of an iova-range have been invalidated before the iova reallocated.

Best regards,
LeiZhen

> 
> 
> Regards,
> 
>   Joerg
> 
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/1] iommu/arm-smmu-v3: replace writel with writel_relaxed in queue_inc_prod

2017-06-20 Thread Leizhen (ThunderTown)


On 2017/6/20 19:35, Robin Murphy wrote:
> On 20/06/17 12:04, Zhen Lei wrote:
>> This function is protected by spinlock, and the latter will do memory
>> barrier implicitly. So that we can safely use writel_relaxed. In fact, the
>> dmb operation will lengthen the time protected by lock, which indirectly
>> increase the locking confliction in the stress scene.
> 
> If you remove the DSB between writing the commands (to Normal memory)
> and writing the pointer (to Device memory), how can you guarantee that
> the complete command is visible to the SMMU and it isn't going to try to
> consume stale memory contents? The spinlock is irrelevant since it's
> taken *before* the command is written.
OK, I see, thanks. Let's me see if there are any other methods. And I think
that this may should be done well by hardware.

> 
> Robin.
> 
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index 380969a..d2fbee3 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -728,7 +728,7 @@ static void queue_inc_prod(struct arm_smmu_queue *q)
>>  u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
>>
>>  q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
>> -writel(q->prod, q->prod_reg);
>> +writel_relaxed(q->prod, q->prod_reg);
>>  }
>>
>>  /*
>> --
>> 2.5.0
>>
>>
> 
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/1] iommu/arm-smmu-v3: replace writel with writel_relaxed in queue_inc_prod

2017-06-26 Thread Leizhen (ThunderTown)


On 2017/6/21 17:08, Will Deacon wrote:
> On Wed, Jun 21, 2017 at 09:28:23AM +0800, Leizhen (ThunderTown) wrote:
>> On 2017/6/20 19:35, Robin Murphy wrote:
>>> On 20/06/17 12:04, Zhen Lei wrote:
>>>> This function is protected by spinlock, and the latter will do memory
>>>> barrier implicitly. So that we can safely use writel_relaxed. In fact, the
>>>> dmb operation will lengthen the time protected by lock, which indirectly
>>>> increase the locking confliction in the stress scene.
>>>
>>> If you remove the DSB between writing the commands (to Normal memory)
>>> and writing the pointer (to Device memory), how can you guarantee that
>>> the complete command is visible to the SMMU and it isn't going to try to
>>> consume stale memory contents? The spinlock is irrelevant since it's
>>> taken *before* the command is written.
>> OK, I see, thanks. Let's me see if there are any other methods. And I think
>> that this may should be done well by hardware.
> 
> FWIW, I did use the _relaxed variants wherever I could when I wrote the
> driver. There might, of course, be bugs, but it's not like the normal case
> for drivers where the author didn't consider the _relaxed accessors
> initially.
A good news. I got a new idea and I will post v2 later.

> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/1] iommu/arm-smmu-v3: replace writel with writel_relaxed in queue_inc_prod

2017-06-26 Thread Leizhen (ThunderTown)


On 2017/6/26 21:29, Leizhen (ThunderTown) wrote:
> 
> 
> On 2017/6/21 17:08, Will Deacon wrote:
>> On Wed, Jun 21, 2017 at 09:28:23AM +0800, Leizhen (ThunderTown) wrote:
>>> On 2017/6/20 19:35, Robin Murphy wrote:
>>>> On 20/06/17 12:04, Zhen Lei wrote:
>>>>> This function is protected by spinlock, and the latter will do memory
>>>>> barrier implicitly. So that we can safely use writel_relaxed. In fact, the
>>>>> dmb operation will lengthen the time protected by lock, which indirectly
>>>>> increase the locking confliction in the stress scene.
>>>>
>>>> If you remove the DSB between writing the commands (to Normal memory)
>>>> and writing the pointer (to Device memory), how can you guarantee that
>>>> the complete command is visible to the SMMU and it isn't going to try to
>>>> consume stale memory contents? The spinlock is irrelevant since it's
>>>> taken *before* the command is written.
>>> OK, I see, thanks. Let's me see if there are any other methods. And I think
>>> that this may should be done well by hardware.
>>
>> FWIW, I did use the _relaxed variants wherever I could when I wrote the
>> driver. There might, of course, be bugs, but it's not like the normal case
>> for drivers where the author didn't consider the _relaxed accessors
>> initially.
> A good news. I got a new idea and I will post v2 later.
[PATCH 0/5] arm-smmu: performance optimization
[PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock 
confliction

I just sent.

> 
>>
>> Will
>>
>> .
>>
> 

-- 
Thanks!
BestRegards



[Question or BUG] [NUMA]: I feel puzzled at the function cpumask_of_node

2017-06-07 Thread Leizhen (ThunderTown)
When I executed numactl -H(print cpumask_of_node for each node), I got 
different result on X86 and ARM64.
For each numa node, the former only displayed online CPUs, and the latter 
displayed all possible CPUs.
Actually, all other ARCHs is the same to ARM64.

So, my question is: Which case(online or possible) should function 
cpumask_of_node be? Or there is no matter about it?

-- 
Thanks!
BestRegards



Re: [Question or BUG] [NUMA]: I feel puzzled at the function cpumask_of_node

2017-06-14 Thread Leizhen (ThunderTown)


On 2017/6/8 22:12, Michal Hocko wrote:
> [CC linux-api]
> 
> On Wed 07-06-17 17:23:20, Leizhen (ThunderTown) wrote:
>> When I executed numactl -H(print cpumask_of_node for each node), I got
>> different result on X86 and ARM64.  For each numa node, the former
>> only displayed online CPUs, and the latter displayed all possible
>> CPUs.  Actually, all other ARCHs is the same to ARM64.
>>
>> So, my question is: Which case(online or possible) should function
>> cpumask_of_node be? Or there is no matter about it?
> 
> Unfortunatelly the documentation is quite unclear
> What: /sys/devices/system/node/nodeX/cpumap
> Date: October 2002
> Contact:  Linux Memory Management list <linux...@kvack.org>
> Description:
>   The node's cpumap.
> 
> not really helpeful, is it? Semantically I _think_ printing online cpus
> makes more sense because it doesn't really make much sense to bind
> anything on offline nodes. Generic implementtion of cpumask_of_node
> indeed provides only online cpus. I haven't checked specific
> implementations of arch specific code but listing offline cpus sounds
> confusing to me.
> 
OK, thank you very much. So, how about we directly add "cpumask_and with 
cpu_online_mask", as below:

diff --git a/drivers/base/node.c b/drivers/base/node.c
index b10479c..199723d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -28,12 +28,14 @@ static struct bus_type node_subsys = {
 static ssize_t node_read_cpumap(struct device *dev, bool list, char *buf)
 {
struct node *node_dev = to_node(dev);
-   const struct cpumask *mask = cpumask_of_node(node_dev->dev.id);
+ struct cpumask mask;
+
+ cpumask_and(, cpumask_of_node(node_dev->dev.id), cpu_online_mask);

/* 2008/04/07: buf currently PAGE_SIZE, need 9 chars per 32 bits. */
BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));

-   return cpumap_print_to_pagebuf(list, buf, mask);
+ return cpumap_print_to_pagebuf(list, buf, );
 }

 static inline ssize_t node_read_cpumask(struct device *dev,


-- 
Thanks!
BestRegards



Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-09-19 Thread Leizhen (ThunderTown)


On 2017/7/31 19:42, Robin Murphy wrote:
> Hi Nate,
> 
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 0020
>> user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>> [0020] *pgd=0006efab0003, *pud=0006efab0003,
>> *pmd=0007d8720003, *pte=
>> Internal error: Oops: 9607 [#1] SMP
>> Modules linked in:
>> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>> task: 8007da1e5780 task.stack: 8007ddcb8000
>> PC is at __cached_rbnode_delete_update+0x2c/0x58
>> LR is at private_free_iova+0x2c/0x60
>> pc : [] lr : [] pstate: 204001c5
>> sp : 8007ddcbba00
>> x29: 8007ddcbba00 x28: 8007c8350210
>> x27: 8007d1a8 x26: 8007dcc20800
>> x25: 0140 x24: 8007c98f0008
>> x23: fe4e x22: 0140
>> x21: 8007c98f0008 x20: 8007c9adb240
>> x19: 8007c98f0018 x18: 0010
>> x17:  x16: 
>> x15: 4000 x14: 
>> x13:  x12: 0001
>> x11: dead0200 x10: 
>> x9 :  x8 : 8007c9adb1c0
>> x7 : 40002000 x6 : 00210d00
>> x5 :  x4 : c57e
>> x3 : ffcf x2 : ffcf
>> x1 : 8007c9adb240 x0 : 
>> [...]
>> [] __cached_rbnode_delete_update+0x2c/0x58
>> [] private_free_iova+0x2c/0x60
>> [] iova_magazine_free_pfns+0x4c/0xa0
>> [] free_iova_fast+0x1b0/0x230
>> [] iommu_dma_free_iova+0x5c/0x80
>> [] __iommu_dma_unmap+0x5c/0x98
>> [] iommu_dma_unmap_resource+0x24/0x30
>> [] iommu_dma_unmap_page+0xc/0x18
>> [] __iommu_unmap_page+0x40/0x60
>> [] mlx5e_page_release+0xbc/0x128
>> [] mlx5e_dealloc_rx_wqe+0x30/0x40
>> [] mlx5e_close_channel+0x70/0x1f8
>> [] mlx5e_close_channels+0x2c/0x50
>> [] mlx5e_close_locked+0x54/0x68
>> [] mlx5e_close+0x30/0x58
>> [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>> 0852C6CC|cmp x3,x2
>> 0852C6D0|b.cs0x0852C708
>> |curr = >cached32_node;
>>   94|if (!curr)
>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>> 0852C6D8|b.eq0x0852C708
>> |
>> |cached_iova = rb_entry(*curr, struct iova, node);
>> |
>>   99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> 0852C6E8|cmp x2,x0
>> 0852C6EC|b.cc0x0852C6FC
>> 0852C6F0|mov x0,x1; x0,free
>>  100|*curr = rb_next(>node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0x_ which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
> 
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
> 
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in 

Re: [PATCH v2 0/3] arm-smmu: performance optimization

2017-09-19 Thread Leizhen (ThunderTown)


On 2017/9/19 12:31, Nate Watterson wrote:
> Hi Leizhen,
> 
> On 9/12/2017 9:00 AM, Zhen Lei wrote:
>> v1 -> v2:
>> base on (add02cfdc9bc2 "iommu: Introduce Interface for IOMMU TLB Flushing")
>>
>> Zhen Lei (3):
>>iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
>>  confliction
>>iommu/arm-smmu-v3: add support for unmap an iova range with only one
>>  tlb sync
> 
> I tested these (2) patches on QDF2400 hardware and saw performance
> improvements in line with those I reported when testing the original
> series. I don't have any hardware close at hand to test the 3rd patch
> in the series so that will have to come from someone else.
Thanks a lot.

> 
> Tested-by: Nate Watterson 
> 
> Thanks,
> Nate
> 
>>iommu/arm-smmu: add support for unmap a memory range with only one tlb
>>  sync
>>
>>   drivers/iommu/arm-smmu-v3.c| 52 
>> ++
>>   drivers/iommu/arm-smmu.c   | 10 
>>   drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
>>   drivers/iommu/io-pgtable-arm.c | 30 ++
>>   drivers/iommu/io-pgtable.h |  1 +
>>   5 files changed, 99 insertions(+), 26 deletions(-)
>>
> 

-- 
Thanks!
BestRegards



Re: [PATCH 1/1] mm: only dispaly online cpus of the numa node

2017-09-29 Thread Leizhen (ThunderTown)


On 2017/8/28 21:13, Michal Hocko wrote:
> On Fri 25-08-17 18:34:33, Will Deacon wrote:
>> On Thu, Aug 24, 2017 at 10:32:26AM +0200, Michal Hocko wrote:
>>> It seems this has slipped through cracks. Let's CC arm64 guys
>>>
>>> On Tue 20-06-17 20:43:28, Zhen Lei wrote:
 When I executed numactl -H(which read /sys/devices/system/node/nodeX/cpumap
 and display cpumask_of_node for each node), but I got different result on
 X86 and arm64. For each numa node, the former only displayed online CPUs,
 and the latter displayed all possible CPUs. Unfortunately, both Linux
 documentation and numactl manual have not described it clear.

 I sent a mail to ask for help, and Michal Hocko  replied
 that he preferred to print online cpus because it doesn't really make much
 sense to bind anything on offline nodes.
>>>
>>> Yes printing offline CPUs is just confusing and more so when the
>>> behavior is not consistent over architectures. I believe that x86
>>> behavior is the more appropriate one because it is more logical to dump
>>> the NUMA topology and use it for affinity setting than adding one
>>> additional step to check the cpu state to achieve the same.
>>>
>>> It is true that the online/offline state might change at any time so the
>>> above might be tricky on its own but if we should at least make the
>>> behavior consistent.
>>>
 Signed-off-by: Zhen Lei 
>>>
>>> Acked-by: Michal Hocko 
>>
>> The concept looks find to me, but shouldn't we use cpumask_var_t and
>> alloc/free_cpumask_var?
> 
> This will be safer but both callers of node_read_cpumap are shallow
> stack so I am not sure a stack is a limiting factor here.
> 
> Zhen Lei, would you care to update that part please?
> 
Sure, I will send v2 immediately.

I'm so sorry that missed this email until someone told me.

-- 
Thanks!
BestRegards



Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-31 Thread Leizhen (ThunderTown)


On 2017/8/4 3:41, Nate Watterson wrote:
> Hi Robin,
> 
> On 7/31/2017 7:42 AM, Robin Murphy wrote:
>> Hi Nate,
>>
>> On 29/07/17 04:57, Nate Watterson wrote:
>>> Hi Robin,
>>> I am seeing a crash when performing very basic testing on this series
>>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>>> patch is the culprit, but this rcache business is still mostly
>>> witchcraft to me.
>>>
>>> # ifconfig eth5 up
>>> # ifconfig eth5 down
>>>  Unable to handle kernel NULL pointer dereference at virtual address
>>> 0020
>>>  user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
>>>  [0020] *pgd=0006efab0003, *pud=0006efab0003,
>>> *pmd=0007d8720003, *pte=
>>>  Internal error: Oops: 9607 [#1] SMP
>>>  Modules linked in:
>>>  CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>>>  task: 8007da1e5780 task.stack: 8007ddcb8000
>>>  PC is at __cached_rbnode_delete_update+0x2c/0x58
>>>  LR is at private_free_iova+0x2c/0x60
>>>  pc : [] lr : [] pstate: 204001c5
>>>  sp : 8007ddcbba00
>>>  x29: 8007ddcbba00 x28: 8007c8350210
>>>  x27: 8007d1a8 x26: 8007dcc20800
>>>  x25: 0140 x24: 8007c98f0008
>>>  x23: fe4e x22: 0140
>>>  x21: 8007c98f0008 x20: 8007c9adb240
>>>  x19: 8007c98f0018 x18: 0010
>>>  x17:  x16: 
>>>  x15: 4000 x14: 
>>>  x13:  x12: 0001
>>>  x11: dead0200 x10: 
>>>  x9 :  x8 : 8007c9adb1c0
>>>  x7 : 40002000 x6 : 00210d00
>>>  x5 :  x4 : c57e
>>>  x3 : ffcf x2 : ffcf
>>>  x1 : 8007c9adb240 x0 : 
>>>  [...]
>>>  [] __cached_rbnode_delete_update+0x2c/0x58
>>>  [] private_free_iova+0x2c/0x60
>>>  [] iova_magazine_free_pfns+0x4c/0xa0
>>>  [] free_iova_fast+0x1b0/0x230
>>>  [] iommu_dma_free_iova+0x5c/0x80
>>>  [] __iommu_dma_unmap+0x5c/0x98
>>>  [] iommu_dma_unmap_resource+0x24/0x30
>>>  [] iommu_dma_unmap_page+0xc/0x18
>>>  [] __iommu_unmap_page+0x40/0x60
>>>  [] mlx5e_page_release+0xbc/0x128
>>>  [] mlx5e_dealloc_rx_wqe+0x30/0x40
>>>  [] mlx5e_close_channel+0x70/0x1f8
>>>  [] mlx5e_close_channels+0x2c/0x50
>>>  [] mlx5e_close_locked+0x54/0x68
>>>  [] mlx5e_close+0x30/0x58
>>>  [...]
>>>
>>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>>>92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> 0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
>>> 0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
>>> 0852C6CC|cmp x3,x2
>>> 0852C6D0|b.cs0x0852C708
>>>  |curr = >cached32_node;
>>>94|if (!curr)
>>> 0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
>>> 0852C6D8|b.eq0x0852C708
>>>  |
>>>  |cached_iova = rb_entry(*curr, struct iova, node);
>>>  |
>>>99|if (free->pfn_lo >= cached_iova->pfn_lo)
>>> 0852C6DC|ldr x0,[x19] ; xiovad,[curr]
>>> 0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
>>> 0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
>>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>>
>>> 0852C6E8|cmp x2,x0
>>> 0852C6EC|b.cc0x0852C6FC
>>> 0852C6F0|mov x0,x1; x0,free
>>>   100|*curr = rb_next(>node);
>>> After instrumenting the code a bit, this seems to be the culprit. In the
>>> previous call, free->pfn_lo was 0x_ which is actually the
>>> dma_limit for the domain so rb_next() returns NULL.
>>>
>>> Let me know if you have any questions or would like additional tests
>>> run. I also applied your "DMA domain debug info" patches and dumped the
>>> contents of the domain at each of the steps above in case that would be
>>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>>> especially when using 64k pages and many CPUs.
>>
>> Thanks for the report - I somehow managed to reason myself out of
>> keeping the "no cached node" check in __cached_rbnode_delete_update() on
>> the assumption that it must always be set by a previous allocation.
>> However, there is indeed just one case case for which that fails: when
>> you free any IOVA immediately after freeing the very topmost one. Which
>> is something that freeing an entire magazine's worth of IOVAs back to
>> the tree all at once has a very real chance of doing...
>>
>> The obvious 

Re: [PATCH v2 1/1] mm: only dispaly online cpus of the numa node

2017-10-09 Thread Leizhen (ThunderTown)


On 2017/10/3 21:56, Michal Hocko wrote:
> On Tue 03-10-17 14:47:26, Will Deacon wrote:
>> On Mon, Oct 02, 2017 at 02:54:46PM -0700, Andrew Morton wrote:
>>> On Mon, 2 Oct 2017 11:38:07 +0100 Will Deacon  wrote:
>>>
> When I executed numactl -H(which read 
> /sys/devices/system/node/nodeX/cpumap
> and display cpumask_of_node for each node), but I got different result on
> X86 and arm64. For each numa node, the former only displayed online CPUs,
> and the latter displayed all possible CPUs. Unfortunately, both Linux
> documentation and numactl manual have not described it clear.
>
> I sent a mail to ask for help, and Michal Hocko  
> replied
> that he preferred to print online cpus because it doesn't really make much
> sense to bind anything on offline nodes.
>
> Signed-off-by: Zhen Lei 
> Acked-by: Michal Hocko 
> ---
>  drivers/base/node.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)

 Which tree is this intended to go through? I'm happy to take it via arm64,
 but I don't want to tread on anybody's toes in linux-next and it looks like
 there are already queued changes to this file via Andrew's tree.
>>>
>>> I grabbed it.  I suppose there's some small risk of userspace breakage
>>> so I suggest it be a 4.15-rc1 thing?
>>
>> To be honest, I suspect the vast majority (if not all) code that reads this
>> file was developed for x86, so having the same behaviour for arm64 sounds
>> like something we should do ASAP before people try to special case with
>> things like #ifdef __aarch64__.
>>
>> I'd rather have this in 4.14 if possible.
> 
> Agreed!
> 

+1

-- 
Thanks!
BestRegards



Re: [PATCH v2 2/3] iommu/arm-smmu-v3: add support for unmap an iova range with only one tlb sync

2017-10-18 Thread Leizhen (ThunderTown)


On 2017/10/18 21:00, Will Deacon wrote:
> On Tue, Sep 12, 2017 at 09:00:37PM +0800, Zhen Lei wrote:
>> This patch is base on: 
>> (add02cfdc9bc2 "iommu: Introduce Interface for IOMMU TLB Flushing")
>>
>> Because iotlb_sync is moved out of ".unmap = arm_smmu_unmap", some interval
>> ".unmap" calls should explicitly followed by a iotlb_sync operation.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>  drivers/iommu/arm-smmu-v3.c| 10 ++
>>  drivers/iommu/io-pgtable-arm.c | 30 --
>>  drivers/iommu/io-pgtable.h |  1 +
>>  3 files changed, 31 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index ef42c4b..e92828e 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -1772,6 +1772,15 @@ arm_smmu_unmap(struct iommu_domain *domain, unsigned 
>> long iova, size_t size)
>>  return ops->unmap(ops, iova, size);
>>  }
>>  
>> +static void arm_smmu_iotlb_sync(struct iommu_domain *domain)
>> +{
>> +struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>> +struct io_pgtable_ops *ops = smmu_domain->pgtbl_ops;
>> +
>> +if (ops && ops->iotlb_sync)
>> +ops->iotlb_sync(ops);
>> +}
>> +
>>  static phys_addr_t
>>  arm_smmu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
>>  {
>> @@ -1991,6 +2000,7 @@ static struct iommu_ops arm_smmu_ops = {
>>  .attach_dev = arm_smmu_attach_dev,
>>  .map= arm_smmu_map,
>>  .unmap  = arm_smmu_unmap,
>> +.iotlb_sync = arm_smmu_iotlb_sync,
>>  .map_sg = default_iommu_map_sg,
>>  .iova_to_phys   = arm_smmu_iova_to_phys,
>>  .add_device = arm_smmu_add_device,
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index e8018a3..805efc9 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -304,6 +304,8 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
>> *data,
>>  WARN_ON(!selftest_running);
>>  return -EEXIST;
>>  } else if (iopte_type(pte, lvl) == ARM_LPAE_PTE_TYPE_TABLE) {
>> +size_t unmapped;
>> +
>>  /*
>>   * We need to unmap and free the old table before
>>   * overwriting it with a block entry.
>> @@ -312,7 +314,9 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
>> *data,
>>  size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>>  
>>  tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
>> -if (WARN_ON(__arm_lpae_unmap(data, iova, sz, lvl, tblp) != sz))
>> +unmapped = __arm_lpae_unmap(data, iova, sz, lvl, tblp);
>> +io_pgtable_tlb_sync(>iop);
>> +if (WARN_ON(unmapped != sz))
>>  return -EINVAL;
>>  }
>>  
>> @@ -584,7 +588,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable 
>> *data,
>>  /* Also flush any partial walks */
>>  io_pgtable_tlb_add_flush(iop, iova, size,
>>  ARM_LPAE_GRANULE(data), false);
>> -io_pgtable_tlb_sync(iop);
>>  ptep = iopte_deref(pte, data);
>>  __arm_lpae_free_pgtable(data, lvl + 1, ptep);
>>  } else {
>> @@ -609,7 +612,6 @@ static int __arm_lpae_unmap(struct arm_lpae_io_pgtable 
>> *data,
>>  static int arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
>>size_t size)
>>  {
>> -size_t unmapped;
>>  struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>>  arm_lpae_iopte *ptep = data->pgd;
>>  int lvl = ARM_LPAE_START_LVL(data);
>> @@ -617,11 +619,14 @@ static int arm_lpae_unmap(struct io_pgtable_ops *ops, 
>> unsigned long iova,
>>  if (WARN_ON(iova >= (1ULL << data->iop.cfg.ias)))
>>  return 0;
>>  
>> -unmapped = __arm_lpae_unmap(data, iova, size, lvl, ptep);
>> -if (unmapped)
>> -io_pgtable_tlb_sync(>iop);
>> +return __arm_lpae_unmap(data, iova, size, lvl, ptep);
>> +}
> 
> This change is already queued in Joerg's tree, due to a patch from Robin.
Yes, I see. So this one can be skipped.

> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards



  1   2   3   4   5   >