Re: Is ARM64 officially supported ?

2020-03-19 Thread dormando
memtier is trash. Check the README for mc-crusher, I just updated it a bit
a day or two ago. Those numbers are incredibly low, I'd have to dig a
laptop out of the 90's to get something to perform that badly.

mc-crusher runs blindly and you use the other utilities that come with it
to find command rates and sample the latency while the benchmark runs.
Almost all 3rd party memcached benchmarks end up benchmarking the
benchmark tool, not the server. I know mc-crusher doesn't make it very
obvious how to use though, sorry.

A really quick untuned test against my raspberry pi 3 nets 92,000
gets/sec. (mc-crusher running on a different machine). On a xeon machine
I can get tens of millions of ops/sec depending on the read/write ratio.

On Thu, 19 Mar 2020, Martin Grigorov wrote:

> Hi
>
> I've made some local performance testing
>
> First I tried with https://github.com/memcached/mc-crusher but it seems it 
> doesn't calculate any statistics after the load runs.
>
> The results below are from https://github.com/RedisLabs/memtier_benchmark
>
> 1) Text
> ./memtier_benchmark --server XYZ --port 12345 -P memcache_text
>
> ARM64 text
> =
> Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
> -
> Sets          985.28          ---          ---     20.02700        67.22
> Gets         9842.00         0.00      9842.00     20.01900       248.83
> Waits           0.00          ---          ---      0.0          ---
> Totals      10827.28         0.00      9842.00     20.02000       316.05
>
>
> X86 text
> =
> Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
> -
> Sets          931.04          ---          ---     20.06800        63.52
> Gets         9300.21         0.00      9300.21     20.32600       235.13
> Waits           0.00          ---          ---      0.0          ---
> Totals      10231.26         0.00      9300.21     20.30200       298.66
>
>
>
> 2) Binary
> ./memtier_benchmark --server XYZ --port 12345 -P memcache_binary
>
> ARM64 binary
> =
> Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
> -
> Sets          829.68          ---          ---     23.46500        63.90
> Gets         8287.69         0.00      8287.69     23.56100       314.75
> Waits           0.00          ---          ---      0.0          ---
> Totals       9117.37         0.00      8287.69     23.55200       378.65
>
> X86 binary
> =
> Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
> -
> Sets          829.32          ---          ---     23.63600        63.87
> Gets         8284.10         0.00      8284.10     23.58600       314.61
> Waits           0.00          ---          ---      0.0          ---
> Totals       9113.42         0.00      8284.10     23.59100       378.48 
>
>
>
> Text is faster on the ARM64. Binary is similar for both.
>
> The benchmarking tool runs on different machine than the ones running 
> Memcached:
>
> The ARM64 server has this spec:
>
> $ lscpu
> Architecture:        aarch64
> Byte Order:          Little Endian
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  1
> Core(s) per socket:  4
> Socket(s):           1
> NUMA node(s):        1
> Vendor ID:           0x48
> Model:               0
> Stepping:            0x1
> BogoMIPS:            200.00
> L1d cache:           64K
> L1i cache:           64K
> L2 cache:            512K
> L3 cache:            32768K
> NUMA node0 CPU(s):   0-3
> Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
> asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
>
>
> The x64 one:
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              4
> On-line CPU(s) list: 0-3
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           1
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               85
> Model name:          Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
> Stepping:            7
> CPU MHz:             3000.000
> BogoMIPS:            6000.00
> Hypervisor vendor:   KVM
> Virtualization type: full
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            1024K
> L3 cache:            30976K
> NUMA node0 CPU(s):   0-3
> Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 

Re: Is ARM64 officially supported ?

2020-03-19 Thread Martin Grigorov
Hi

I've made some local performance testing

First I tried with https://github.com/memcached/mc-crusher but it seems it
doesn't calculate any statistics after the load runs.

The results below are from https://github.com/RedisLabs/memtier_benchmark

1) Text
./memtier_benchmark --server XYZ --port 12345 -P memcache_text

ARM64 text
=
Type Ops/sec Hits/sec   Misses/sec  Latency   KB/sec
-
Sets  985.28  ---  --- 20.0270067.22
Gets 9842.00 0.00  9842.00 20.01900   248.83
Waits   0.00  ---  ---  0.0  ---
Totals  10827.28 0.00  9842.00 20.02000   316.05


X86 text
=
Type Ops/sec Hits/sec   Misses/sec  Latency   KB/sec
-
Sets  931.04  ---  --- 20.0680063.52
Gets 9300.21 0.00  9300.21 20.32600   235.13
Waits   0.00  ---  ---  0.0  ---
Totals  10231.26 0.00  9300.21 20.30200   298.66



2) Binary
./memtier_benchmark --server XYZ --port 12345 -P memcache_binary

ARM64 binary
=
Type Ops/sec Hits/sec   Misses/sec  Latency   KB/sec
-
Sets  829.68  ---  --- 23.4650063.90
Gets 8287.69 0.00  8287.69 23.56100   314.75
Waits   0.00  ---  ---  0.0  ---
Totals   9117.37 0.00  8287.69 23.55200   378.65

X86 binary
=
Type Ops/sec Hits/sec   Misses/sec  Latency   KB/sec
-
Sets  829.32  ---  --- 23.6360063.87
Gets 8284.10 0.00  8284.10 23.58600   314.61
Waits   0.00  ---  ---  0.0  ---
Totals   9113.42 0.00  8284.10 23.59100   378.48



Text is faster on the ARM64. Binary is similar for both.

The benchmarking tool runs on different machine than the ones running
Memcached:

The ARM64 server has this spec:

$ lscpu
Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):1
Vendor ID:   0x48
Model:   0
Stepping:0x1
BogoMIPS:200.00
L1d cache:   64K
L1i cache:   64K
L2 cache:512K
L3 cache:32768K
NUMA node0 CPU(s):   0-3
Flags:   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm


The x64 one:
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
Stepping:7
CPU MHz: 3000.000
BogoMIPS:6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:30976K
NUMA node0 CPU(s):   0-3
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq
rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec
xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities

Both with 16GB RAM.


Regards,
Martin

On Mon, Mar 9, 2020 at 11:23 AM Martin Grigorov 
wrote:

> Hi Dormando,
>
> On Mon, Mar 9, 2020 at 9:19 AM Martin Grigorov 
> wrote:
>
>> Hi Dormando,
>>
>> On Fri, Mar 6, 2020 at 10:15 PM dormando  wrote:
>>
>>> Yo,
>>>
>>> Just to add in: yes we support ARM64. Though my build test platform is a
>>> raspberry pi 3 and I haven't done any serious performance work.
>>> packet.net
>>> had an arm test