Re: ILP32 for ARM64 - testing with lmbench

2016-12-05 Thread Catalin Marinas
On Mon, Dec 05, 2016 at 06:16:09PM +0800, Zhangjian (Bamvor) wrote:
> Do you have suggestion of next move of upstreaming ILP32?

I mentioned the steps a few time before. I'm pasting them again here:

1. Complete the review of the Linux patches and ABI (no merge yet)
2. Review the corresponding glibc patches (no merge yet)
3. Ask (Linaro, Cavium) for toolchain + filesystem (pre-built and more
   than just busybox) to be able to reproduce the testing in ARM
4. More testing (LTP, trinity, performance regressions etc.)
5. Move the ILP32 PCS out of beta (based on the results from 4)
6. Check the market again to see if anyone still needs ILP32
7. Based on 6, decide whether to merge the kernel and glibc patches

What's not explicitly mentioned in step 4 is glibc testing. Point 5 is
ARM's responsibility (toolchain folk).

> There are already the test results of lmbench and specint. Do you they
> are ok or need more data to prove no regression?

I would need to reproduce the tests myself, see step 3.

> I have also noticed that there are ILP32 failures in glibc testsuite.
> Is it the only blocker for merge ILP32(in technology part)?

It's probably not the only blocker but I have to review the kernel
patches again to make sure. I'd also like to see whether the libc-alpha
community is ok with the glibc counterpart (but don't merge the patches
until the ABI is agreed on both sides).

On performance, I want to make sure there are no regressions on
AArch32/compat and AArch64/LP64.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ILP32 for ARM64 - testing with lmbench

2016-11-16 Thread Zhangjian (Bamvor)

Hi, Maxim

On 2016/11/17 13:02, Maxim Kuvyrkov wrote:

Hi Bamvor,

I'm surprised that you see this much difference from ILP32 patches on SPEC 
CPU2006int at all.  The SPEC CPU2006 benchmarks spend almost no time in the 
kernel syscalls.  I can imagine memory, TLB, and cache handling in the kernel 
could affect CPU2006 benchmarks.  Do ILP32 patches touch code in those areas?

Other than that, it would be interesting to check what the variance is between 
the 3 iterations of benchmark runs.  Could you check what relative standard 
deviation is between the 3 iterations -- (STDEV(RUN1, RUN2, RUN3) / 
RUNselected)?

For reference, in my [non-ILP32] benchmarking I see 1.1% for 401.bzip2,  0.8% 
for 429.mcf, 0.2% for 456.hmmer, and 0.1% for 462.libquantum.

Here is my result:
ILP32_mergedILP32_unmerged
  401.bzip20.31%0.26%
  429.mcf  1.61%1.36%
  456.hmmer1.37%1.57%
  462.libquantum   0.29%0.28%

Regards

Bamvor



--
Maxim Kuvyrkov
www.linaro.org




On Nov 17, 2016, at 7:28 AM, Zhangjian (Bamvor)  
wrote:

Hi, all

I test specint of aarch64 LP64 when aarch32 el0 disable/enabled respectively
and compare with ILP32 unmerged kernel(4.8-rc6) in our arm64 board. I found
that difference(ILP32 disabled/ILP32 unmerged) is bigger when aarch32 el0 is
enabled, compare with aarch32 el0 disabled kernel. And bzip2, mcg, hmmer,
libquantum are the top four differences[1]. Note that bigger is better in
specint test.

In order to make sure the above results, I retest these four testcases in
reportable way(reference the command in the end). The result[2] show that
libquantum decrease -2.09% after ILP32 enabled and aarch32 on. I think it is in
significant.

The result of lmbench is not stable in my board. I plan to dig it later.

[1] The following test result is tested through --size=ref --iterations=3.
1.1 Test when aarch32_el0 is enabled.
   ILP32 disabledbase line
 400.perlbench100.00% 100%
 401.bzip2 99.35% 100%
 403.gcc  100.26% 100%
 429.mcf  102.75% 100%
 445.gobmk100.00% 100%
 456.hmmer 95.66% 100%
 458.sjeng100.00% 100%
 462.libquantum   100.00% 100%
 471.omnetpp  100.59% 100%
 473.astar 99.66% 100%
 483.xalancbmk 99.10% 100%

1.2 Test when aarch32_el0 is disabled
   ILP32 disabled base line
 400.perlbench100.22%  100%
 401.bzip2100.95%  100%
 403.gcc  100.20%  100%
 429.mcf  100.76%  100%
 445.gobmk100.36%  100%
 456.hmmer 97.94%  100%
 458.sjeng 99.73%  100%
 462.libquantum98.72%  100%
 471.omnetpp  100.86%  100%
 473.astar 99.15%  100%
 483.xalancbmk100.08%  100%

[2] The following test result is tested through: runspec --config=my.cfg 
--size=test,train,ref --noreportable --tune=base,peak --iterations=3 bzip2 mcf 
hmmer libquantum
2.1 Test when aarch32_el0 is enabled.
ILP32_enabled base line
 401.bzip2100.82%  100%
 429.mcf  100.18%  100%
 456.hmmer 99.64%  100%
 462.libquantum97.91%  100%

Regards

Bamvor

On 2016/10/28 20:46, Yury Norov wrote:

[Add Steve Ellcey, thanks for testing on ThunderX]

Lmbench-3.0-a9 testing is performed on ThunderX machine to check that
ILP32 series does not add performance regressions for LP64. Test
summary is in the table below. Our measurements doesn't show
significant performance regression of LP64 if ILP32 code is merged,
both enabled or disabled.

  ILP32 enabled   ILP32  disabled   Standard Kernel
null syscall   0.1066  0.11210.1121
  95.09%  100.00%

stat   1.3947  1.38141.3864
  100.60% 99.64%

fstat  0.4459  0.43440.4524
  98.56%  96.02%

open/close 4.0606  4.04114.0453
  100.38% 99.90%

read   0.4819  0.50140.5014
  96.11%  100.00%

Tested with linux 4.8 because 4.9-rc1 is not fixed yet for ThunderX.
Other system details below.

Yury.

ubuntu@crb6:~$ uname -a
Linux crb6 4.8.0+ #3 SMP Thu Oct 27 11:01:32 PDT 2016 aarch64 

Re: ILP32 for ARM64 - testing with lmbench

2016-11-16 Thread Maxim Kuvyrkov
Hi Bamvor,

I'm surprised that you see this much difference from ILP32 patches on SPEC 
CPU2006int at all.  The SPEC CPU2006 benchmarks spend almost no time in the 
kernel syscalls.  I can imagine memory, TLB, and cache handling in the kernel 
could affect CPU2006 benchmarks.  Do ILP32 patches touch code in those areas?

Other than that, it would be interesting to check what the variance is between 
the 3 iterations of benchmark runs.  Could you check what relative standard 
deviation is between the 3 iterations -- (STDEV(RUN1, RUN2, RUN3) / 
RUNselected)?

For reference, in my [non-ILP32] benchmarking I see 1.1% for 401.bzip2,  0.8% 
for 429.mcf, 0.2% for 456.hmmer, and 0.1% for 462.libquantum.

--
Maxim Kuvyrkov
www.linaro.org



> On Nov 17, 2016, at 7:28 AM, Zhangjian (Bamvor)  
> wrote:
> 
> Hi, all
> 
> I test specint of aarch64 LP64 when aarch32 el0 disable/enabled respectively
> and compare with ILP32 unmerged kernel(4.8-rc6) in our arm64 board. I found
> that difference(ILP32 disabled/ILP32 unmerged) is bigger when aarch32 el0 is
> enabled, compare with aarch32 el0 disabled kernel. And bzip2, mcg, hmmer,
> libquantum are the top four differences[1]. Note that bigger is better in
> specint test.
> 
> In order to make sure the above results, I retest these four testcases in
> reportable way(reference the command in the end). The result[2] show that
> libquantum decrease -2.09% after ILP32 enabled and aarch32 on. I think it is 
> in
> significant.
> 
> The result of lmbench is not stable in my board. I plan to dig it later.
> 
> [1] The following test result is tested through --size=ref --iterations=3.
> 1.1 Test when aarch32_el0 is enabled.
>ILP32 disabledbase line
>  400.perlbench100.00% 100%
>  401.bzip2 99.35% 100%
>  403.gcc  100.26% 100%
>  429.mcf  102.75% 100%
>  445.gobmk100.00% 100%
>  456.hmmer 95.66% 100%
>  458.sjeng100.00% 100%
>  462.libquantum   100.00% 100%
>  471.omnetpp  100.59% 100%
>  473.astar 99.66% 100%
>  483.xalancbmk 99.10% 100%
> 
> 1.2 Test when aarch32_el0 is disabled
>ILP32 disabled base line
>  400.perlbench100.22%  100%
>  401.bzip2100.95%  100%
>  403.gcc  100.20%  100%
>  429.mcf  100.76%  100%
>  445.gobmk100.36%  100%
>  456.hmmer 97.94%  100%
>  458.sjeng 99.73%  100%
>  462.libquantum98.72%  100%
>  471.omnetpp  100.86%  100%
>  473.astar 99.15%  100%
>  483.xalancbmk100.08%  100%
> 
> [2] The following test result is tested through: runspec --config=my.cfg 
> --size=test,train,ref --noreportable --tune=base,peak --iterations=3 bzip2 
> mcf hmmer libquantum
> 2.1 Test when aarch32_el0 is enabled.
> ILP32_enabled base line
>  401.bzip2100.82%  100%
>  429.mcf  100.18%  100%
>  456.hmmer 99.64%  100%
>  462.libquantum97.91%  100%
> 
> Regards
> 
> Bamvor
> 
> On 2016/10/28 20:46, Yury Norov wrote:
>> [Add Steve Ellcey, thanks for testing on ThunderX]
>> 
>> Lmbench-3.0-a9 testing is performed on ThunderX machine to check that
>> ILP32 series does not add performance regressions for LP64. Test
>> summary is in the table below. Our measurements doesn't show
>> significant performance regression of LP64 if ILP32 code is merged,
>> both enabled or disabled.
>> 
>>   ILP32 enabled   ILP32  disabled   Standard Kernel
>> null syscall   0.1066  0.11210.1121
>>   95.09%  100.00%
>> 
>> stat   1.3947  1.38141.3864
>>   100.60% 99.64%
>> 
>> fstat  0.4459  0.43440.4524
>>   98.56%  96.02%
>> 
>> open/close 4.0606  4.04114.0453
>>   100.38% 99.90%
>> 
>> read   0.4819  0.50140.5014
>>   96.11%  100.00%
>> 
>> Tested with linux 4.8 because 4.9-rc1 is not fixed yet for ThunderX.
>> Other system details below.
>> 
>> Yury.
>> 
>> ubuntu@crb6:~$ uname -a
>> Linux crb6 4.8.0+ #3 SMP Thu Oct 27 11:01:32 PDT 2016 aarch64 aarch64 
>> aarch64 GNU/Linux
>> 
>> ubuntu@crb6:~$ cat /proc/meminfo
>> MemTotal:   132011948 kB
>> MemFree:

Re: ILP32 for ARM64 - testing with lmbench

2016-11-16 Thread Zhangjian (Bamvor)

Hi, all

I test specint of aarch64 LP64 when aarch32 el0 disable/enabled respectively
and compare with ILP32 unmerged kernel(4.8-rc6) in our arm64 board. I found
that difference(ILP32 disabled/ILP32 unmerged) is bigger when aarch32 el0 is
enabled, compare with aarch32 el0 disabled kernel. And bzip2, mcg, hmmer,
libquantum are the top four differences[1]. Note that bigger is better in
specint test.

In order to make sure the above results, I retest these four testcases in
reportable way(reference the command in the end). The result[2] show that
libquantum decrease -2.09% after ILP32 enabled and aarch32 on. I think it is in
significant.

The result of lmbench is not stable in my board. I plan to dig it later.

[1] The following test result is tested through --size=ref --iterations=3.
1.1 Test when aarch32_el0 is enabled.
ILP32 disabledbase line
  400.perlbench100.00% 100%
  401.bzip2 99.35% 100%
  403.gcc  100.26% 100%
  429.mcf  102.75% 100%
  445.gobmk100.00% 100%
  456.hmmer 95.66% 100%
  458.sjeng100.00% 100%
  462.libquantum   100.00% 100%
  471.omnetpp  100.59% 100%
  473.astar 99.66% 100%
  483.xalancbmk 99.10% 100%

1.2 Test when aarch32_el0 is disabled
ILP32 disabled base line
  400.perlbench100.22%  100%
  401.bzip2100.95%  100%
  403.gcc  100.20%  100%
  429.mcf  100.76%  100%
  445.gobmk100.36%  100%
  456.hmmer 97.94%  100%
  458.sjeng 99.73%  100%
  462.libquantum98.72%  100%
  471.omnetpp  100.86%  100%
  473.astar 99.15%  100%
  483.xalancbmk100.08%  100%

[2] The following test result is tested through: runspec --config=my.cfg 
--size=test,train,ref --noreportable --tune=base,peak --iterations=3 bzip2 mcf 
hmmer libquantum
2.1 Test when aarch32_el0 is enabled.
 ILP32_enabled base line
  401.bzip2100.82%  100%
  429.mcf  100.18%  100%
  456.hmmer 99.64%  100%
  462.libquantum97.91%  100%

Regards

Bamvor

On 2016/10/28 20:46, Yury Norov wrote:

[Add Steve Ellcey, thanks for testing on ThunderX]

Lmbench-3.0-a9 testing is performed on ThunderX machine to check that
ILP32 series does not add performance regressions for LP64. Test
summary is in the table below. Our measurements doesn't show
significant performance regression of LP64 if ILP32 code is merged,
both enabled or disabled.

   ILP32 enabled   ILP32  disabled   Standard Kernel
null syscall   0.1066  0.11210.1121
   95.09%  100.00%

stat   1.3947  1.38141.3864
   100.60% 99.64%

fstat  0.4459  0.43440.4524
   98.56%  96.02%

open/close 4.0606  4.04114.0453
   100.38% 99.90%

read   0.4819  0.50140.5014
   96.11%  100.00%

Tested with linux 4.8 because 4.9-rc1 is not fixed yet for ThunderX.
Other system details below.

Yury.

ubuntu@crb6:~$ uname -a
Linux crb6 4.8.0+ #3 SMP Thu Oct 27 11:01:32 PDT 2016 aarch64 aarch64 aarch64 
GNU/Linux

ubuntu@crb6:~$ cat /proc/meminfo
MemTotal:   132011948 kB
MemFree:131442672 kB
MemAvailable:   130695764 kB
Buffers:   15696 kB
Cached:88088 kB
SwapCached:0 kB
Active:82760 kB
Inactive:  41336 kB
Active(anon):  20880 kB
Inactive(anon): 8576 kB
Active(file):  61880 kB
Inactive(file):32760 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:  128920572 kB
SwapFree:   128920572 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 20544 kB
Mapped:19780 kB
Shmem:  9060 kB
Slab:  78804 kB
SReclaimable:  27372 kB
SUnreclaim:51432 kB
KernelStack:8336 kB
PageTables:  820 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:194926544 kB
Committed_AS: 256324 kB
VmallocTotal:   135290290112 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
AnonHugePages: 0 kB
ShmemHugePages:0 kB
ShmemPmdMapped:0 kB
CmaTotal:  0 kB

Re: ILP32 for ARM64 - testing with lmbench

2016-10-28 Thread Yury Norov
[Add Steve Ellcey, thanks for testing on ThunderX]

Lmbench-3.0-a9 testing is performed on ThunderX machine to check that
ILP32 series does not add performance regressions for LP64. Test
summary is in the table below. Our measurements doesn't show
significant performance regression of LP64 if ILP32 code is merged,
both enabled or disabled.

   ILP32 enabled   ILP32  disabled   Standard Kernel 
null syscall   0.1066  0.11210.1121
   95.09%  100.00%

stat   1.3947  1.38141.3864
   100.60% 99.64%

fstat  0.4459  0.43440.4524
   98.56%  96.02%

open/close 4.0606  4.04114.0453
   100.38% 99.90%

read   0.4819  0.50140.5014
   96.11%  100.00%

Tested with linux 4.8 because 4.9-rc1 is not fixed yet for ThunderX.
Other system details below.

Yury.

ubuntu@crb6:~$ uname -a
Linux crb6 4.8.0+ #3 SMP Thu Oct 27 11:01:32 PDT 2016 aarch64 aarch64 aarch64 
GNU/Linux

ubuntu@crb6:~$ cat /proc/meminfo
MemTotal:   132011948 kB
MemFree:131442672 kB
MemAvailable:   130695764 kB
Buffers:   15696 kB
Cached:88088 kB
SwapCached:0 kB
Active:82760 kB
Inactive:  41336 kB
Active(anon):  20880 kB
Inactive(anon): 8576 kB
Active(file):  61880 kB
Inactive(file):32760 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:  128920572 kB
SwapFree:   128920572 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 20544 kB
Mapped:19780 kB
Shmem:  9060 kB
Slab:  78804 kB
SReclaimable:  27372 kB
SUnreclaim:51432 kB
KernelStack:8336 kB
PageTables:  820 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:194926544 kB
Committed_AS: 256324 kB
VmallocTotal:   135290290112 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
AnonHugePages: 0 kB
ShmemHugePages:0 kB
ShmemPmdMapped:0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB

ubuntu@crb6:~$ cat /proc/cpuinfo
processor   : 0
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 1
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 2
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 3
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 4
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 5
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 6
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 7
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 8
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 9
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 10
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x0a1
CPU revision: 0

processor   : 11
BogoMIPS: 200.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
CPU implementer