Dear wsjt developers, Out of curiosity, I've played with the wsprd/wsprd_exp.c optimization for ARMv7-A. Currently, on a dual core ARM Cortex-A9 CPU, it takes 'wsprd_exp -J -w' up to 70 seconds to decode 8 WSPR bands with several wsprd_exp processes running in parallel. It's still within the available 120 seconds but I've been wondering by how much this time could be improved.
Basic profiling shows that most of the CPU time is spent in 3 functions: sync_and_demodulate, jelinek, subtract_signal2. I could think of two possible ways of optimization: - NEON coprocessor in ARM Cortex-A9 has vector instructions that process 4 single-precision floating-point values simultaneously and the code in sync_and_demodulate could be very easily vectorized. - NEON coprocessor supports only single-precision floating-point values and replacing double with float in some calculation intensive parts could improve the performance. First, I've tried to manually vectorize the code in sync_and_demodulate and obtained the same results as with automatic vectorization. Looks like gcc does a very good job with the following flags: -O3 -march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9 -mfpu=neon -mfloat-abi=hard -ffast-math Next, I've managed to achieve about 15% performance improvement by replacing double with float in some calculation intensive parts of the code where precision does not seem to be critical. Basically, I've replaced double with float in sync_and_demodulate, subtract_signal and subtract_signal2. With this 15% improvement, 70 seconds of CPU time could become 60 (70*0.85). I've tested this modified version with the set of 410 .wav files from http://physics.princeton.edu/pulsar/K1JT/wspr_data.tgz The results obtained with the modified version and with the original version of wsprd_exp.c differ a little bit. The modified version missed 4 spots and found 6 additional spots. The total number of spots is around 2800. So, the effect is around 0.1-0.3%. Since this test would take hours on ARM Cortex-A9 CPU, I did it only on Intel Pentium CPU. BTW, the performance improvement on Intel Pentium CPU is ~5%. This smaller improvement on Intel Pentium CPU is not unexpected since SSE2 coprocessor supports double-precision floating-point values. At the moment, I'm not sure what to do with these results. The 15% improvement is not huge and I don't know if it could justify the changes in the code. I'd be interested to know what do you think about this. Best regards, Pavel ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140 _______________________________________________ wsjt-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/wsjt-devel
