net: wrap csum_partial_copy_nocheck()

Jan Kiszka via Xenomai Tue, 04 May 2021 22:51:13 -0700

On 04.05.21 16:48, Philippe Gerum wrote:
> 
> Philippe Gerum <[email protected]> writes:
> 
>> Philippe Gerum <[email protected]> writes:
>>
>>> Jan Kiszka <[email protected]> writes:
>>>
>>>> On 16.04.21 18:48, Philippe Gerum wrote:
>>>>>
>>>>> Jan Kiszka <[email protected]> writes:
>>>>>
>>>>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>>>>
>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>
>>>>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>>>>
>>>>>>>>> Jan Kiszka <[email protected]> writes:
>>>>>>>>>
>>>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>>>>> From: Philippe Gerum <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as 
>>>>>>>>>>> its
>>>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, 
>>>>>>>>>>> passing
>>>>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>>>>> architectures.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it 
>>>>>>>>>>> for
>>>>>>>>>>> later kernels so that we still feed csum_partial() with the 
>>>>>>>>>>> user-given
>>>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>>>>> and csum in one step?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from 
>>>>>>>>> the
>>>>>>>>> kernel folding the copy and checksum operations anymore, so I see no 
>>>>>>>>> way
>>>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>>>>> value to be fed in?
>>>>>>>>>
>>>>>>>>
>>>>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>>>>
>>>>>>>
>>>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>>>>> you want to deal with this?
>>>>>>>
>>>>>>
>>>>>> That is an internal API, so we don't care.
>>>>>>
>>>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>>>>> (with a csum != 0), there would be not reason to make
>>>>>> rtskb_copy_and_csum_dev pay that price.
>>>>>>
>>>>>
>>>>> Well, there may be a reason to challenge the idea that a folded
>>>>> copy_and_csum is better for a real-time system than a split memcpy+csum
>>>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>>>>> few SoCs regarding this, and it turned out that optimizing the data copy
>>>>> to get the buffer quickly in place before checksumming the result may
>>>>> well yield much better results with respect to jitter than what
>>>>> csum_and_copy currently achieves on these SoCs.
>>>>>
>>>>> Inline csum_and_copy did perform slightly better on average (a couple of
>>>>> hundreds of nanosecs at best) but with pathological jittery in the worst
>>>>> case at times. On the contrary, the split memcpy+csum method did not
>>>>> exhibit such jittery during these tests, not once.
>>>>>
>>>>> - four SoCs tested (2 x x86, armv7, armv8a)
>>>>> - test code ran in kernel space (real-time task context,
>>>>>   out-of-band/primary context)
>>>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>>>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>>>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>>>>> - each run performed 1000,000 iterations of a given checksum loop, no
>>>>>   pause in between.
>>>>> - no concurrent load on the board
>>>>> - all results in nanoseconds
>>>>>
>>>>> The worst results obtained are presented here for each SoC.
>>>>>
>>>>> x86[1]
>>>>> ------
>>>>>
>>>>> vendor_id : GenuineIntel
>>>>> cpu family        : 6
>>>>> model             : 92
>>>>> model name        : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>>>>> stepping  : 9
>>>>> cpu MHz           : 1593.600
>>>>> cache size        : 1024 KB
>>>>> cpuid level       : 21
>>>>> wp                : yes
>>>>> flags             : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>>>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
>>>>> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
>>>>> rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf 
>>>>> tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 
>>>>> xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave 
>>>>> rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi 
>>>>> flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a 
>>>>> rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves 
>>>>> dtherm ida arat pln pts
>>>>> vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad 
>>>>> ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid 
>>>>> unrestricted_guest vapic_reg vid ple shadow_vmcs
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=653, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=373, avg=251
>>>>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>>>>> COPY+CSUM 32b: min=101, max=790, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=397, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=490, avg=405
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=1420, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>>>>> CSUM_COPY 1500b: min=344, max=792, avg=350
>>>>> COPY+CSUM 32b: min=101, max=872, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=776, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=853, avg=405
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=68, max=661, avg=70
>>>>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>>>>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>>>>> COPY+CSUM 32b: min=101, max=610, avg=103
>>>>> COPY+CSUM 1024b: min=297, max=605, avg=300
>>>>> COPY+CSUM 1500b: min=402, max=712, avg=405
>>>>>
>>>>> x86[2]
>>>>> ------
>>>>>
>>>>> vendor_id       : GenuineIntel
>>>>> cpu family      : 6
>>>>> model           : 23
>>>>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>>>>> stepping        : 6
>>>>> microcode       : 0x60c
>>>>> cpu MHz         : 1627.113
>>>>> cache size      : 3072 KB
>>>>> cpuid level     : 10
>>>>> wp              : yes
>>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
>>>>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall 
>>>>> nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf 
>>>>> pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti 
>>>>> dtherm
>>>>>
>>>>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>>>>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>>>>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>>>>> COPY+CSUM 32b: min=558, max=2794, avg=671
>>>>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>>>>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=59, max=532, avg=62
>>>>> CSUM_COPY 1024b: min=198, max=270, avg=201
>>>>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>>>>> COPY+CSUM 32b: min=53, max=326, avg=56
>>>>> COPY+CSUM 1024b: min=228, max=461, avg=232
>>>>> COPY+CSUM 1500b: min=311, max=341, avg=317
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=59, max=382, avg=62
>>>>> CSUM_COPY 1024b: min=198, max=383, avg=201
>>>>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>>>>> COPY+CSUM 32b: min=52, max=300, avg=56
>>>>> COPY+CSUM 1024b: min=228, max=348, avg=232
>>>>> COPY+CSUM 1500b: min=311, max=409, avg=317
>>>>>
>>>>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>>>>> -----------------------------------
>>>>>
>>>>> model name        : ARMv7 Processor rev 10 (v7l)
>>>>> Features  : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>>>>> CPU implementer   : 0x41
>>>>> CPU architecture: 7
>>>>> CPU variant       : 0x2
>>>>> CPU part  : 0xc09
>>>>> CPU revision      : 10
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1334, avg=440
>>>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>>>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>>>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1334, avg=439
>>>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>>>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>>>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=333, max=1666, avg=454
>>>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>>>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>>>>> COPY+CSUM 32b: min=333, max=1334, avg=454
>>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>>>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>>>>>
>>>>> Cortex A55 quad-core 2Ghz (Odroid C4)
>>>>> -------------------------------------
>>>>>
>>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
>>>>> asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>>>>> CPU implementer : 0x41
>>>>> CPU architecture: 8
>>>>> CPU variant     : 0x1
>>>>> CPU part        : 0xd05
>>>>> CPU revision    : 0
>>>>>
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>>>>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>>>>> COPY+CSUM 32b: min=125, max=458, avg=140
>>>>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>>>>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=1292, avg=139
>>>>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>>>>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>>>>> COPY+CSUM 32b: min=125, max=292, avg=136
>>>>> COPY+CSUM 1024b: min=541, max=750, avg=556
>>>>> COPY+CSUM 1500b: min=708, max=834, avg=740
>>>>>
>>>>> ==
>>>>>
>>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>>>>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>>>>> COPY+CSUM 32b: min=125, max=375, avg=140
>>>>> COPY+CSUM 1024b: min=666, max=916, avg=673
>>>>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>>>>>
>>>>> ============
>>>>>
>>>>> A few additional observations from looking at the implementation:
>>>>>
>>>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>>>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>>>>> which is faster.
>>>>>
>>>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>>>>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>>>>> only when src and dst are 32bit aligned (which matches the test case).
>>>>>
>>>>> arm64/armv8a uses load/store pair instructions to copy memory
>>>>> blocks. It does not have asm-optimized csum_and_copy support, so it
>>>>> uses the generic C version.
>>>>>
>>>>> What could be inferred in terms of prefetching and speculation might
>>>>> explain some differences between the approaches too.
>>>>>
>>>>> I would be interested in any converging / diverging results testing the
>>>>> same combo with a different test code, because from my standpoint,
>>>>> things do not seem as obvious as they are supposed to be at the moment.
>>>>>
>>>>
>>>> If copy+csum is not using any recent memcopy optimizations, that is an
>>>> argument for at least equivalent performance.
>>>>
>>>
>>> You mean the folded version, i.e. copy_and_csum? If so, I can't see any
>>> way for that one to optimize via fast string operations.
>>>
>>>> But I don't get yet where the huge jittery should be coming from. Were
>>>> the measurement loop preemptible? In that case I would expect a split
>>>
>>> Out of band stage, so only preemptible by Xenomai timer ticks, which
>>> means only the host tick emulation at this point since there was no
>>> outstanding Xenomai timers started yet when running the loops. Pretty
>>> slim chance to see these latency spots consistently reproduced, and only
>>> for the folded copy_sum version.
>>>
>>>> copy followed by another loop to csum should give much worse results as
>>>> it needs the cache to stay warm - while copy-csum only touches the data
>>>> once.
>>>>
>>>
>>> Conversely, if the copy is much faster, the odds of being preempted may
>>> increase, yielding better results overall.
>>
>> False alarm. Preemption was the issue, by the top half of the host tick
>> handling in primary mode. The latest clock event scheduled by the kernel
>> managed to enter the pipeline at a random time, but always within the
>> execution window of the all-in-one csum_and_copy code. Although this
>> event was deferred and not immediately passed to the in-band context,
>> the time spent dealing with it was enough to show up in the results.
>>
>>> This said, I'm unsure this is
>>> related to preemption anyway, this looks like the fingerprints of minor
>>> faults with PTEs. Why this would only happen in the folded version is
>>> still a mystery to me at the moment.
>>
>> It did not actually, no minor faults.
>>
>> The results are now consistent, both implementations are comparable
>> performance-wise as the optimized memcpy tends to offset the advantage
>> of calculating the checksum on the fly, saving a read access. armv8
>> benefits more from the former, since it does not have an optimized
>> csum_and_copy but uses the generic C version instead.
>>
>> == x86[1]
>>
>> CSUM_COPY 32b: min=68, max=640, avg=70
>> CSUM_COPY 1024b: min=247, max=773, avg=252
>> CSUM_COPY 1500b: min=343, max=832, avg=350
>> COPY+CSUM 32b: min=100, max=651, avg=131
>> COPY+CSUM 1024b: min=296, max=752, avg=298
>> COPY+CSUM 1500b: min=397, max=845, avg=400
>>
>> == x86[2]
>>
>> CSUM_COPY 32b: min=63, max=267, avg=66
>> CSUM_COPY 1024b: min=198, max=300, avg=201
>> CSUM_COPY 1500b: min=288, max=611, avg=291
>> COPY+CSUM 32b: min=56, max=360, avg=56
>> COPY+CSUM 1024b: min=228, max=420, avg=231
>> COPY+CSUM 1500b: min=307, max=337, avg=318
>>
>> == armv7 (imx6qp)
>>
>> CSUM_COPY 32b: min=333, max=1334, avg=439
>> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
>> CSUM_COPY 1500b: min=1000, max=2334, avg=1325
>> COPY+CSUM 32b: min=333, max=1334, avg=454
>> COPY+CSUM 1024b: min=1333, max=2334, avg=1347
>> COPY+CSUM 1500b: min=1666, max=2667, avg=1734
>>
>> == armv8a (C4)
>>
>> CSUM_COPY 32b: min=125, max=792, avg=130
>> CSUM_COPY 1024b: min=500, max=1125, avg=550
>> CSUM_COPY 1500b: min=708, max=1833, avg=726
>> COPY+CSUM 32b: min=125, max=292, avg=130
>> COPY+CSUM 1024b: min=541, max=708, avg=550
>> COPY+CSUM 1500b: min=708, max=875, avg=730
> 
> Last round of results about this issue, now measuring the csum_copy vs
> csum+copy performances in idle vs busy scenarios. Busy means
> hackbench+dd loop streaming 128M in the background from zero -> null, in
> order to badly trash the D-caches while the test runs. All figures in
> nanosecs.
> 
> iMX6QP (Cortex A9)
> ------------------
> 
> === idle
> 
> CSUM_COPY 32b: min=333, max=1333, avg=439
> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
> CSUM_COPY 1500b: min=1333, max=2000, avg=1333
> COPY+CSUM 32b: min=333, max=1333, avg=443
> COPY+CSUM 1024b: min=1000, max=2334, avg=1345
> COPY+CSUM 1500b: min=1666, max=2667, avg=1737
> 
> === busy
> 
> CSUM_COPY 32b: min=333, max=4333, avg=466
> CSUM_COPY 1024b: min=1000, max=5000, avg=1088
> CSUM_COPY 1500b: min=1333, max=5667, avg=1393
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1000, max=2000, avg=1341
> COPY+CSUM 1500b: min=1666, max=2666, avg=1745
> 
> C4 (Cortex A55)
> ---------------
> 
> === idle
> 
> CSUM_COPY 32b: min=125, max=791, avg=130
> CSUM_COPY 1024b: min=541, max=834, avg=550
> CSUM_COPY 1500b: min=708, max=1875, avg=740
> COPY+CSUM 32b: min=125, max=167, avg=133
> COPY+CSUM 1024b: min=541, max=625, avg=553
> COPY+CSUM 1500b: min=708, max=750, avg=730
> 
> === busy
> 
> CSUM_COPY 32b: min=125, max=792, avg=133
> CSUM_COPY 1024b: min=500, max=2000, avg=552
> CSUM_COPY 1500b: min=708, max=1542, avg=744
> COPY+CSUM 32b: min=125, max=375, avg=133
> COPY+CSUM 1024b: min=500, max=709, avg=553
> COPY+CSUM 1500b: min=708, max=916, avg=743
> 
> x86 (atom x5)
> -------------
> 
> === idle
> 
> CSUM_COPY 32b: min=67, max=590, avg=70
> CSUM_COPY 1024b: min=245, max=385, avg=251
> CSUM_COPY 1500b: min=343, max=521, avg=350
> COPY+CSUM 32b: min=101, max=679, avg=117
> COPY+CSUM 1024b: min=296, max=379, avg=298
> COPY+CSUM 1500b: min=399, max=502, avg=404
> 
> == busy
> 
> CSUM_COPY 32b: min=65, max=709, avg=71
> CSUM_COPY 1024b: min=243, max=702, avg=252
> CSUM_COPY 1500b: min=340, max=1055, avg=351
> COPY+CSUM 32b: min=100, max=665, avg=120
> COPY+CSUM 1024b: min=295, max=669, avg=298
> COPY+CSUM 1500b: min=399, max=686, avg=403
> 
> As expected from the code, arm64 which has no folded csum_copy
> implementation makes the best of using the split copy+csum path. All
> architectures seem to benefit from optimized memcpy under load when it
> comes to the worst case execution time. x86 is less prone to jittery
> under cache trashing than others as usual, but even there, the
> max. figures for csum+copy in busy context look pretty much on par with
> the csum_copy version.
>


Then let's go for your conversion - but then possibly even
unconditionally, no?

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()

Reply via email to