Jan Kiszka <[email protected]> writes:
> On 15.04.21 09:54, Philippe Gerum wrote: >> >> Jan Kiszka <[email protected]> writes: >> >>> On 15.04.21 09:21, Philippe Gerum wrote: >>>> >>>> Jan Kiszka <[email protected]> writes: >>>> >>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>> From: Philippe Gerum <[email protected]> >>>>>> >>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>> a non-zero value would not even yield the proper result on some >>>>>> architectures. >>>>>> >>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>> csum_partial() to do the right thing. >>>>>> >>>>> >>>>> If that issue only affects the ICMP code path, why not only changing >>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>> and csum in one step? >>>>> >>>> >>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>> kernel folding the copy and checksum operations anymore, so I see no way >>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>> value to be fed in? >>>> >>> >>> rtskb_copy_and_csum_dev does not need that. >>> >> >> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >> you want to deal with this? >> > > That is an internal API, so we don't care. > > But even if we converted rtskb_copy_and_csum_bits to work as it used to > (with a csum != 0), there would be not reason to make > rtskb_copy_and_csum_dev pay that price. > Well, there may be a reason to challenge the idea that a folded copy_and_csum is better for a real-time system than a split memcpy+csum in the first place. Out of curiosity, I ran a few benchmarks lately on a few SoCs regarding this, and it turned out that optimizing the data copy to get the buffer quickly in place before checksumming the result may well yield much better results with respect to jitter than what csum_and_copy currently achieves on these SoCs. Inline csum_and_copy did perform slightly better on average (a couple of hundreds of nanosecs at best) but with pathological jittery in the worst case at times. On the contrary, the split memcpy+csum method did not exhibit such jittery during these tests, not once. - four SoCs tested (2 x x86, armv7, armv8a) - test code ran in kernel space (real-time task context, out-of-band/primary context) - csum_partial_copy_nocheck() vs memcpy()+csum_partial() - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each - all buffers (src & dst) aligned on L1_CACHE_BYTES - each run performed 1000,000 iterations of a given checksum loop, no pause in between. - no concurrent load on the board - all results in nanoseconds The worst results obtained are presented here for each SoC. x86[1] ------ vendor_id : GenuineIntel cpu family : 6 model : 92 model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz stepping : 9 cpu MHz : 1593.600 cache size : 1024 KB cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs == CSUM_COPY 32b: min=68, max=653, avg=70 CSUM_COPY 1024b: min=248, max=373, avg=251 CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= COPY+CSUM 32b: min=101, max=790, avg=103 COPY+CSUM 1024b: min=297, max=397, avg=300 COPY+CSUM 1500b: min=402, max=490, avg=405 == CSUM_COPY 32b: min=68, max=1420, avg=70 CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= CSUM_COPY 1500b: min=344, max=792, avg=350 COPY+CSUM 32b: min=101, max=872, avg=103 COPY+CSUM 1024b: min=297, max=776, avg=300 COPY+CSUM 1500b: min=402, max=853, avg=405 == CSUM_COPY 32b: min=68, max=661, avg=70 CSUM_COPY 1024b: min=248, max=1714, avg=251 CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= COPY+CSUM 32b: min=101, max=610, avg=103 COPY+CSUM 1024b: min=297, max=605, avg=300 COPY+CSUM 1500b: min=402, max=712, avg=405 x86[2] ------ vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz stepping : 6 microcode : 0x60c cpu MHz : 1627.113 cache size : 3072 KB cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm CSUM_COPY 32b: min=558, max=31010, avg=674 <================= CSUM_COPY 1024b: min=558, max=2794, avg=745 CSUM_COPY 1500b: min=558, max=2794, avg=841 COPY+CSUM 32b: min=558, max=2794, avg=671 COPY+CSUM 1024b: min=558, max=2794, avg=778 COPY+CSUM 1500b: min=838, max=2794, avg=865 == CSUM_COPY 32b: min=59, max=532, avg=62 CSUM_COPY 1024b: min=198, max=270, avg=201 CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= COPY+CSUM 32b: min=53, max=326, avg=56 COPY+CSUM 1024b: min=228, max=461, avg=232 COPY+CSUM 1500b: min=311, max=341, avg=317 == CSUM_COPY 32b: min=59, max=382, avg=62 CSUM_COPY 1024b: min=198, max=383, avg=201 CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= COPY+CSUM 32b: min=52, max=300, avg=56 COPY+CSUM 1024b: min=228, max=348, avg=232 COPY+CSUM 1500b: min=311, max=409, avg=317 Cortex A9 quad-core 1.2Ghz (iMX6qp) ----------------------------------- model name : ARMv7 Processor rev 10 (v7l) Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 CSUM_COPY 32b: min=333, max=1334, avg=440 CSUM_COPY 1024b: min=1000, max=2666, avg=1060 CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= COPY+CSUM 32b: min=333, max=1334, avg=476 COPY+CSUM 1024b: min=1000, max=2333, avg=1324 COPY+CSUM 1500b: min=1666, max=2334, avg=1713 == CSUM_COPY 32b: min=333, max=1334, avg=439 CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= CSUM_COPY 1500b: min=1333, max=5000, avg=1351 COPY+CSUM 32b: min=333, max=1334, avg=476 COPY+CSUM 1024b: min=1000, max=2334, avg=1324 COPY+CSUM 1500b: min=1666, max=2667, avg=1713 == CSUM_COPY 32b: min=333, max=1666, avg=454 CSUM_COPY 1024b: min=1000, max=2000, avg=1060 CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= COPY+CSUM 32b: min=333, max=1334, avg=454 COPY+CSUM 1024b: min=1000, max=2334, avg=1317 COPY+CSUM 1500b: min=1666, max=6000, avg=1712 Cortex A55 quad-core 2Ghz (Odroid C4) ------------------------------------- Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 CSUM_COPY 32b: min=125, max=833, avg=140 CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= CSUM_COPY 1500b: min=875, max=3875, avg=923 COPY+CSUM 32b: min=125, max=458, avg=140 COPY+CSUM 1024b: min=625, max=1166, avg=666 COPY+CSUM 1500b: min=875, max=1167, avg=913 == CSUM_COPY 32b: min=125, max=1292, avg=139 CSUM_COPY 1024b: min=541, max=48333, avg=555 CSUM_COPY 1500b: min=708, max=3458, avg=740 COPY+CSUM 32b: min=125, max=292, avg=136 COPY+CSUM 1024b: min=541, max=750, avg=556 COPY+CSUM 1500b: min=708, max=834, avg=740 == CSUM_COPY 32b: min=125, max=833, avg=140 CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= CSUM_COPY 1500b: min=875, max=4208, avg=913 COPY+CSUM 32b: min=125, max=375, avg=140 COPY+CSUM 1024b: min=666, max=916, avg=673 COPY+CSUM 1500b: min=875, max=1042, avg=913 ============ A few additional observations from looking at the implementation: For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb which is faster. arm32/armv7 optimizes memcpy by loading up to 8 words in a single instruction. csum_and_copy loads/stores at best 4 words at a time, only when src and dst are 32bit aligned (which matches the test case). arm64/armv8a uses load/store pair instructions to copy memory blocks. It does not have asm-optimized csum_and_copy support, so it uses the generic C version. What could be inferred in terms of prefetching and speculation might explain some differences between the approaches too. I would be interested in any converging / diverging results testing the same combo with a different test code, because from my standpoint, things do not seem as obvious as they are supposed to be at the moment. -- Philippe.
