Thanks for sharing your analysis on webkit-dev.

There has been a lot of criticisms about SIMD.js this year. It is great to read about solutions for vectorization without the problems of SIMD.js.


On 9/26/14, 3:16 PM, Nadav Rotem wrote:
Recently members of the JavaScript community at Intel and Mozilla
havesuggested <>adding SIMD
types to the JavaScript language. In this email would like to share my
thoughts about this proposal and to start a technical discussion about
SIMD.js support in Webkit. I BCCed some of the authors of the proposal
to allow them to participate in this discussion.

Modern processors feature SIMD (Single Instruction Multiple Data)
<> instructions, which perform the same
arithmetic operation on a vector of elements. SIMD instructions are used
to accelerate compute intensive code, like image processing algorithms,
because the same calculation is applied to every pixel in the image. A
single SIMD instruction can process 4 or 8 pixels at the same time.
Compilers try to make use of SIMD instructions in an optimization that
is called vectorization.

<> adds new
types, such as float32x4, and operators that map to vector instructions
on most processors. The idea behind the proposal is that manual use of
vector instructions, just like intrinsics in C, will allow developers to
accelerate common compute-intensive JavaScript applications. The idea of
using SIMD instructions to accelerate JavaScript code is compelling
because high performance applications in JavaScript are becoming very

Before I became involved with JavaScript through my work on the FTL
<>, I
developed the LLVM vectorizer
<> and worked on a vectorizing
compiler for a data-parallel programming language. Based on my
experience with vectorization, I believe that the current proposal to
include SIMD types in the JavaScript language is not the right approach
to utilize SIMD instructions. In this email I argue that vector types
should not be added to the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and
features from one generation to another. For example, some Intel
processors feature 512-bit wide vector instructions
<>. This
means that they can process 16 floating point numbers with one
instruction. However, today’s high-end ARM processors feature 128-bit
wide vector instructions
<> and can
only process 4 floating point elements. ARM processors support
byte-sized blend instructions but only recent Intel processors added
support for byte-sized blends. ARM processors support variable shifts
but only Intel processors with AVX2 support variable shifts. Different
generations of Intel processors support different instruction sets with
different features such as broadcasting from a local register, 16-bit
and 64-bit arithmetic, and varied shuffles. Modern processors even
feature predicated arithmetic and scatter/gather instructions that are
very difficult to model using target independent high-level intrinsics.
The designers of the high-level target independent API should decide if
they want to support the union of all vector instruction sets, or the
intersection. A subset of the vector instructions that represent the
intersection of all popular instruction sets is not useable for writing
non-trivial vector programs. And the superset of the vector instructions
will cause huge performance regressions on platforms that do not support
the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing
compilers feature complex cost models and heuristics for deciding when
to vectorize, at which vector width, and how many loop iterations to
interleave. The cost models takes into account the features of the
vector instruction set, properties of the architecture such as the
number of vector registers, and properties of the current processor
generation. Making a poor selection decision on any of the vectorization
parameters can result in a major performance regression. Executing
vector intrinsics on processors that don’t support them is slower than
executing multiple scalar instructions because the compiler can’t always
generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code
that will show performance gains on processors from different families.
Executing vector code with insufficient hardware support will cause
major performance regressions. One of the motivations for SIMD.js was to
allow Emscripten
<> to 
C code and to emit JavaScript SIMD intrinsics. One problem with this
suggestion is that the Emscripten compiler should not be assuming that
the target is an x86 machine and that a specific vector width and
interleave width is the right answer. Targeting a specific processor
will surely cause regressions on other processors.

SIMD.js does not make good use of modern vector instruction sets. Modern
vector processors feature large vectors (up to 512-bit), predication of
arithmetic and memory operations, scatter/gather memory operations,
advance shuffles and broadcasts and other features that make
vectorization efficient. The current SIMD.js proposal is limited to a
small number of arithmetic operations on 128-bit vector data types.

So far, I’ve explained why I believe SIMD.js will not be
performance-portable and why it will not utilize modern instruction
sets, but I have not made a suggestion on how to use vector instructions
to accelerate JavaScript programs. Vectorization, like instruction
scheduling and register allocation, is a code-generation problem. In
order to solve these problems, it is necessary for the compiler to have
intimate knowledge of the architecture. Forcing the compiler to use a
specific instruction or a specific data-type is the wrong answer. We can
learn a lesson from the design of compilers for data-parallel languages.
GPU programs (shaders and compute languages, such as OpenCL and GLSL)
are written using vector instructions because the domain of the problem
requires vectors (colors and coordinates). One of the first thing that
data-parallel compilers do is to break vector instructions into scalars
(this process is called scalarization). After getting rid of the vectors
that resulted from the problem domain, the compiler may begin to analyze
the program, calculate profitability, and make use of the available
instruction set.

I believe that it is the responsibility of JIT compilers to use vector
instructions. In the implementation of the Webkit’s FTL JIT compiler, we
took one step in the direction of using vector instructions. LLVM
already vectorizes some code sequences during instruction selection, and
we started investigating the use of LLVM’s Loop and SLP vectorizers. We
found that despite nice performance gains on a number of workloads, we
experienced some performance regressions on Intel’s Sandybridge
processors, which is currently a very popular desktop processor.
JavaScript code contains many branches (due to dynamic speculation).
Unfortunately, branches on Sandybridge execute on Port5, which is also
where many vector instructions are executed. So, pressure on Port5
prevented performance gains. The LLVM vectorizer currently does not
model execution port pressure and we had to disable vectorization in
FTL. In the future, we intend to enable more vectorization features in FTL.

To summarize, SIMD.js will not provide a portable performance solution
because vector instruction sets are sparse and vary between
architectures and generations. Emscripten should not generate vector
instructions because it can’t model the target machine. SIMD.js will not
make use of modern SIMD features such as predication or scatter/gather.
Vectorization is a compiler code generation problem that should be
solved by JIT compilers, and not by the language itself. JIT compilers
should continue to evolve and to start vectorizing code like modern


webkit-dev mailing list

webkit-dev mailing list

Reply via email to