Hi,
I'm currently optimizing a codebase for WebAssembly. The code includes some
hand-written SIMD implementations for ARM NEON and Intel SIMD. It seems
WebAssembly's own SIMD instructions are not yet stable and therefore not
available today in browsers without a feature flag.
I've come across some nifty bit twiddling methods [1] that allow some SIMD
operations to be performed using standard non-SIMD registers and
instructions. 64-bit architectures obviously provide better opportunities
for speedups using these techniques than the 32-bit examples shown on that
old page from 1997.
I implemented a particular function (half-sampling a greyscale image by
averaging 2x2 blocks of input pixels) using these techniques with uint64_t
registers. Native benchmarking with an iPhone XR (-O3, Xcode's LLVM) gets
these timings relative to my NEON implementation:
| -O3 | -O3 -fno-vectorize -fno-slp-vectorize
NEON | 1.00x | 1.00x
Plain C | 1.36x | 10.71x
Bit twiddling | 3.36x | 5.19x
LLVM's vectorizers obviously found some opportunity to vectorize both the
plain C and the bit twiddling implementations, as both were slower when
disabling these optimizations. The vectorizer preferred the simplicity of
the plain C version as it found very good vectorization opportunities (an
almost 8 times speedup) which then easily outperformed the bit-twiddling
approach. However without vectorization the bit twiddling approach is a 2x
improvement on the plain C variant.
I will do some benchmarking of the same code through Emscripten, but given
the additional layers involved (C -> wasm -> V8 codegen (with multi-level
JIT?)) I thought it would also be worth a couple of direct questions on
this:
1) Does v8 codegen emit 64-bit machine instructions for 64-bit wasm
instructions on 64-bit architectures (specifically Android)? I imagine the
speedup from SWAR techniques will be significantly reduced using just
32-bit registers, perhaps to the level that doesn't make this worthwhile.
2) Does v8's codegen currently do any vectorization? If not, are there
plans to add it? In this case the plain C version might be best to stick
with as it would be easier for auto-vectorization to detect and optimize.
3) Can anyone provide tips / links to help with investigating and
optimizing this kind of thing? Any way of flagging wasm functions for
maximum optimization for benchmarking purposes?
Thanks!
Simon
[1] http://aggregate.org/SWAR/over.html
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/v8-dev/0c373d6a-ceb4-41e8-80d6-3b743cea3747%40googlegroups.com.