Dear wsjt developers,

Out of curiosity, I've played with the wsprd/wsprd_exp.c optimization 
for ARMv7-A. Currently, on a dual core ARM Cortex-A9 CPU, it takes 
'wsprd_exp -J -w' up to 70 seconds to decode 8 WSPR bands with several 
wsprd_exp processes running in parallel. It's still within the available 
120 seconds but I've been wondering by how much this time could be improved.

Basic profiling shows that most of the CPU time is spent in 3 functions: 
sync_and_demodulate, jelinek, subtract_signal2.

I could think of two possible ways of optimization:
  - NEON coprocessor in ARM Cortex-A9 has vector instructions that 
process 4 single-precision floating-point values simultaneously and the 
code in sync_and_demodulate could be very easily vectorized.
  - NEON coprocessor supports only single-precision floating-point 
values and replacing double with float in some calculation intensive 
parts could improve the performance.

First, I've tried to manually vectorize the code in sync_and_demodulate 
and obtained the same results as with automatic vectorization. Looks 
like gcc does a very good job with the following flags:
-O3 -march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9 -mfpu=neon 
-mfloat-abi=hard -ffast-math

Next, I've managed to achieve about 15% performance improvement by 
replacing double with float in some calculation intensive parts of the 
code where precision does not seem to be critical. Basically, I've 
replaced double with float in sync_and_demodulate, subtract_signal and 
subtract_signal2. With this 15% improvement, 70 seconds of CPU time 
could become 60 (70*0.85).

I've tested this modified version with the set of 410 .wav files from 
http://physics.princeton.edu/pulsar/K1JT/wspr_data.tgz

The results obtained with the modified version and with the original 
version of wsprd_exp.c differ a little bit. The modified version missed 
4 spots and found 6 additional spots. The total number of spots is 
around 2800. So, the effect is around 0.1-0.3%. Since this test would 
take hours on ARM Cortex-A9 CPU, I did it only on Intel Pentium CPU. 
BTW, the performance improvement on Intel Pentium CPU is ~5%. This 
smaller improvement on Intel Pentium CPU is not unexpected since SSE2 
coprocessor supports double-precision floating-point values.

At the moment, I'm not sure what to do with these results. The 15% 
improvement is not huge and I don't know if it could justify the changes 
in the code.

I'd be interested to know what do you think about this.

Best regards,

Pavel


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
wsjt-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/wsjt-devel

Reply via email to