Hi,

I'm currently optimizing a codebase for WebAssembly. The code includes some 
hand-written SIMD implementations for ARM NEON and Intel SIMD. It seems 
WebAssembly's own SIMD instructions are not yet stable and therefore not 
available today in browsers without a feature flag.

I've come across some nifty bit twiddling methods [1] that allow some SIMD 
operations to be performed using standard non-SIMD registers and 
instructions. 64-bit architectures obviously provide better opportunities 
for speedups using these techniques than the 32-bit examples shown on that 
old page from 1997.

I implemented a particular function (half-sampling a greyscale image by 
averaging 2x2 blocks of input pixels) using these techniques with uint64_t 
registers. Native benchmarking with an iPhone XR (-O3, Xcode's LLVM) gets 
these timings relative to my NEON implementation:

               |  -O3    |  -O3 -fno-vectorize -fno-slp-vectorize
         NEON  |  1.00x  |  1.00x
      Plain C  |  1.36x  | 10.71x
Bit twiddling  |  3.36x  |  5.19x

LLVM's vectorizers obviously found some opportunity to vectorize both the 
plain C and the bit twiddling implementations, as both were slower when 
disabling these optimizations. The vectorizer preferred the simplicity of 
the plain C version as it found very good vectorization opportunities (an 
almost 8 times speedup) which then easily outperformed the bit-twiddling 
approach. However without vectorization the bit twiddling approach is a 2x 
improvement on the plain C variant.

I will do some benchmarking of the same code through Emscripten, but given 
the additional layers involved (C -> wasm -> V8 codegen (with multi-level 
JIT?)) I thought it would also be worth a couple of direct questions on 
this:

1) Does v8 codegen emit 64-bit machine instructions for 64-bit wasm 
instructions on 64-bit architectures (specifically Android)? I imagine the 
speedup from SWAR techniques will be significantly reduced using just 
32-bit registers, perhaps to the level that doesn't make this worthwhile.

2) Does v8's codegen currently do any vectorization? If not, are there 
plans to add it? In this case the plain C version might be best to stick 
with as it would be easier for auto-vectorization to detect and optimize.

3) Can anyone provide tips / links to help with investigating and 
optimizing this kind of thing? Any way of flagging wasm functions for 
maximum optimization for benchmarking purposes?

Thanks!

Simon

[1] http://aggregate.org/SWAR/over.html

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/0c373d6a-ceb4-41e8-80d6-3b743cea3747%40googlegroups.com.

Reply via email to