Jan Kiszka <[email protected]> writes:

> On 15.04.21 09:54, Philippe Gerum wrote:
>> 
>> Jan Kiszka <[email protected]> writes:
>> 
>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>
>>>> Jan Kiszka <[email protected]> writes:
>>>>
>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>> From: Philippe Gerum <[email protected]>
>>>>>>
>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>> a non-zero value would not even yield the proper result on some
>>>>>> architectures.
>>>>>>
>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>> csum_partial() to do the right thing.
>>>>>>
>>>>>
>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>> and csum in one step?
>>>>>
>>>>
>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>> value to be fed in?
>>>>
>>>
>>> rtskb_copy_and_csum_dev does not need that.
>>>
>> 
>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>> you want to deal with this?
>> 
>
> That is an internal API, so we don't care.
>
> But even if we converted rtskb_copy_and_csum_bits to work as it used to
> (with a csum != 0), there would be not reason to make
> rtskb_copy_and_csum_dev pay that price.
>

Well, there may be a reason to challenge the idea that a folded
copy_and_csum is better for a real-time system than a split memcpy+csum
in the first place. Out of curiosity, I ran a few benchmarks lately on a
few SoCs regarding this, and it turned out that optimizing the data copy
to get the buffer quickly in place before checksumming the result may
well yield much better results with respect to jitter than what
csum_and_copy currently achieves on these SoCs.

Inline csum_and_copy did perform slightly better on average (a couple of
hundreds of nanosecs at best) but with pathological jittery in the worst
case at times. On the contrary, the split memcpy+csum method did not
exhibit such jittery during these tests, not once.

- four SoCs tested (2 x x86, armv7, armv8a)
- test code ran in kernel space (real-time task context,
  out-of-band/primary context)
- csum_partial_copy_nocheck() vs memcpy()+csum_partial()
- 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
- all buffers (src & dst) aligned on L1_CACHE_BYTES
- each run performed 1000,000 iterations of a given checksum loop, no
  pause in between.
- no concurrent load on the board
- all results in nanoseconds

The worst results obtained are presented here for each SoC.

x86[1]
------

vendor_id       : GenuineIntel
cpu family      : 6
model           : 92
model name      : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
stepping        : 9
cpu MHz         : 1593.600
cache size      : 1024 KB
cpuid level     : 21
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology 
tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 
ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 
pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms 
mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves 
dtherm ida arat pln pts
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad 
ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid 
unrestricted_guest vapic_reg vid ple shadow_vmcs

==

CSUM_COPY 32b: min=68, max=653, avg=70
CSUM_COPY 1024b: min=248, max=373, avg=251
CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
COPY+CSUM 32b: min=101, max=790, avg=103
COPY+CSUM 1024b: min=297, max=397, avg=300
COPY+CSUM 1500b: min=402, max=490, avg=405

==

CSUM_COPY 32b: min=68, max=1420, avg=70
CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
CSUM_COPY 1500b: min=344, max=792, avg=350
COPY+CSUM 32b: min=101, max=872, avg=103
COPY+CSUM 1024b: min=297, max=776, avg=300
COPY+CSUM 1500b: min=402, max=853, avg=405

==

CSUM_COPY 32b: min=68, max=661, avg=70
CSUM_COPY 1024b: min=248, max=1714, avg=251
CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
COPY+CSUM 32b: min=101, max=610, avg=103
COPY+CSUM 1024b: min=297, max=605, avg=300
COPY+CSUM 1500b: min=402, max=712, avg=405

x86[2]
------

vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
stepping        : 6
microcode       : 0x60c
cpu MHz         : 1627.113
cache size      : 3072 KB
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 
monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm

CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
CSUM_COPY 1024b: min=558, max=2794, avg=745
CSUM_COPY 1500b: min=558, max=2794, avg=841
COPY+CSUM 32b: min=558, max=2794, avg=671
COPY+CSUM 1024b: min=558, max=2794, avg=778
COPY+CSUM 1500b: min=838, max=2794, avg=865

==

CSUM_COPY 32b: min=59, max=532, avg=62
CSUM_COPY 1024b: min=198, max=270, avg=201
CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
COPY+CSUM 32b: min=53, max=326, avg=56
COPY+CSUM 1024b: min=228, max=461, avg=232
COPY+CSUM 1500b: min=311, max=341, avg=317

==

CSUM_COPY 32b: min=59, max=382, avg=62
CSUM_COPY 1024b: min=198, max=383, avg=201
CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
COPY+CSUM 32b: min=52, max=300, avg=56
COPY+CSUM 1024b: min=228, max=348, avg=232
COPY+CSUM 1500b: min=311, max=409, avg=317

Cortex A9 quad-core 1.2Ghz (iMX6qp)
-----------------------------------

model name      : ARMv7 Processor rev 10 (v7l)
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc09
CPU revision    : 10

CSUM_COPY 32b: min=333, max=1334, avg=440
CSUM_COPY 1024b: min=1000, max=2666, avg=1060
CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
COPY+CSUM 32b: min=333, max=1334, avg=476
COPY+CSUM 1024b: min=1000, max=2333, avg=1324
COPY+CSUM 1500b: min=1666, max=2334, avg=1713

==

CSUM_COPY 32b: min=333, max=1334, avg=439
CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
CSUM_COPY 1500b: min=1333, max=5000, avg=1351
COPY+CSUM 32b: min=333, max=1334, avg=476
COPY+CSUM 1024b: min=1000, max=2334, avg=1324
COPY+CSUM 1500b: min=1666, max=2667, avg=1713

==

CSUM_COPY 32b: min=333, max=1666, avg=454
CSUM_COPY 1024b: min=1000, max=2000, avg=1060
CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1000, max=2334, avg=1317
COPY+CSUM 1500b: min=1666, max=6000, avg=1712

Cortex A55 quad-core 2Ghz (Odroid C4)
-------------------------------------

Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd05
CPU revision    : 0


CSUM_COPY 32b: min=125, max=833, avg=140
CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
CSUM_COPY 1500b: min=875, max=3875, avg=923
COPY+CSUM 32b: min=125, max=458, avg=140
COPY+CSUM 1024b: min=625, max=1166, avg=666
COPY+CSUM 1500b: min=875, max=1167, avg=913

==

CSUM_COPY 32b: min=125, max=1292, avg=139
CSUM_COPY 1024b: min=541, max=48333, avg=555
CSUM_COPY 1500b: min=708, max=3458, avg=740
COPY+CSUM 32b: min=125, max=292, avg=136
COPY+CSUM 1024b: min=541, max=750, avg=556
COPY+CSUM 1500b: min=708, max=834, avg=740

==

CSUM_COPY 32b: min=125, max=833, avg=140
CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
CSUM_COPY 1500b: min=875, max=4208, avg=913
COPY+CSUM 32b: min=125, max=375, avg=140
COPY+CSUM 1024b: min=666, max=916, avg=673
COPY+CSUM 1500b: min=875, max=1042, avg=913

============

A few additional observations from looking at the implementation:

For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
which is faster.

arm32/armv7 optimizes memcpy by loading up to 8 words in a single
instruction. csum_and_copy loads/stores at best 4 words at a time,
only when src and dst are 32bit aligned (which matches the test case).

arm64/armv8a uses load/store pair instructions to copy memory
blocks. It does not have asm-optimized csum_and_copy support, so it
uses the generic C version.

What could be inferred in terms of prefetching and speculation might
explain some differences between the approaches too.

I would be interested in any converging / diverging results testing the
same combo with a different test code, because from my standpoint,
things do not seem as obvious as they are supposed to be at the moment.

-- 
Philippe.

Reply via email to