Recently members of the JavaScript community at Intel and Mozilla have 
suggested <> adding SIMD types to the 
JavaScript language. In this email would like to share my thoughts about this 
proposal and to start a technical discussion about SIMD.js support in Webkit. I 
BCCed some of the authors of the proposal to allow them to participate in this 

Modern processors feature SIMD (Single Instruction Multiple Data) 
<> instructions, which perform the same 
arithmetic operation on a vector of elements. SIMD instructions are used to 
accelerate compute intensive code, like image processing algorithms, because 
the same calculation is applied to every pixel in the image. A single SIMD 
instruction can process 4 or 8 pixels at the same time. Compilers try to make 
use of SIMD instructions in an optimization that is called vectorization. 

The SIMD.js API <> 
adds new types, such as float32x4, and operators that map to vector 
instructions on most processors. The idea behind the proposal is that manual 
use of vector instructions, just like intrinsics in C, will allow developers to 
accelerate common compute-intensive JavaScript applications. The idea of using 
SIMD instructions to accelerate JavaScript code is compelling because high 
performance applications in JavaScript are becoming very popular. 

Before I became involved with JavaScript through my work on the FTL project 
<>, I developed 
the LLVM vectorizer <> and worked on a 
vectorizing compiler for a data-parallel programming language. Based on my 
experience with vectorization, I believe that the current proposal to include 
SIMD types in the JavaScript language is not the right approach to utilize SIMD 
instructions. In this email I argue that vector types should not be added to 
the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and features 
from one generation to another. For example, some Intel processors feature 
512-bit wide vector instructions 
<>. This means 
that they can process 16 floating point numbers with one instruction. However, 
today’s high-end ARM processors feature 128-bit wide vector instructions 
<> and can only 
process 4 floating point elements. ARM processors support byte-sized blend 
instructions but only recent Intel processors added support for byte-sized 
blends. ARM processors support variable shifts but only Intel processors with 
AVX2 support variable shifts. Different generations of Intel processors support 
different instruction sets with different features such as broadcasting from a 
local register, 16-bit and 64-bit arithmetic, and varied shuffles. Modern 
processors even feature predicated arithmetic and scatter/gather instructions 
that are very difficult to model using target independent high-level 
The designers of the high-level target independent API should decide if they 
want to support the union of all vector instruction sets, or the intersection. 
A subset of the vector instructions that represent the intersection of all 
popular instruction sets is not useable for writing non-trivial vector 
programs. And the superset of the vector instructions will cause huge 
performance regressions on platforms that do not support the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing 
compilers feature complex cost models and heuristics for deciding when to 
vectorize, at which vector width, and how many loop iterations to interleave. 
The cost models takes into account the features of the vector instruction set, 
properties of the architecture such as the number of vector registers, and 
properties of the current processor generation. Making a poor selection 
decision on any of the vectorization parameters can result in a major 
performance regression. Executing vector intrinsics on processors that don’t 
support them is slower than executing multiple scalar instructions because the 
compiler can’t always generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code that will 
show performance gains on processors from different families. Executing vector 
code with insufficient hardware support will cause major performance 
regressions. One of the motivations for SIMD.js was to allow Emscripten 
<> to 
vectorize C code and to emit JavaScript SIMD intrinsics. One problem with this 
suggestion is that the Emscripten compiler should not be assuming that the 
target is an x86 machine and that a specific vector width and interleave width 
is the right answer. Targeting a specific processor will surely cause 
regressions on other processors. 

SIMD.js does not make good use of modern vector instruction sets. Modern vector 
processors feature large vectors (up to 512-bit), predication of arithmetic and 
memory operations, scatter/gather memory operations, advance shuffles and 
broadcasts and other features that make vectorization efficient. The current 
SIMD.js proposal is limited to a small number of arithmetic operations on 
128-bit vector data types.

So far, I’ve explained why I believe SIMD.js will not be performance-portable 
and why it will not utilize modern instruction sets, but I have not made a 
suggestion on how to use vector instructions to accelerate JavaScript programs. 
Vectorization, like instruction scheduling and register allocation, is a 
code-generation problem. In order to solve these problems, it is necessary for 
the compiler to have intimate knowledge of the architecture. Forcing the 
compiler to use a specific instruction or a specific data-type is the wrong 
answer. We can learn a lesson from the design of compilers for data-parallel 
languages. GPU programs (shaders and compute languages, such as OpenCL and 
GLSL) are written using vector instructions because the domain of the problem 
requires vectors (colors and coordinates). One of the first thing that 
data-parallel compilers do is to break vector instructions into scalars (this 
process is called scalarization). After getting rid of the vectors that 
resulted from the problem domain, the compiler may begin to analyze the 
program, calculate profitability, and make use of the available instruction 

I believe that it is the responsibility of JIT compilers to use vector 
instructions. In the implementation of the Webkit’s FTL JIT compiler, we took 
one step in the direction of using vector instructions. LLVM already vectorizes 
some code sequences during instruction selection, and we started investigating 
the use of LLVM’s Loop and SLP vectorizers. We found that despite nice 
performance gains on a number of workloads, we experienced some performance 
regressions on Intel’s Sandybridge processors, which is currently a very 
popular desktop processor. JavaScript code contains many branches (due to 
dynamic speculation). Unfortunately, branches on Sandybridge execute on Port5, 
which is also where many vector instructions are executed. So, pressure on 
Port5 prevented performance gains. The LLVM vectorizer currently does not model 
execution port pressure and we had to disable vectorization in FTL. In the 
future, we intend to enable more vectorization features in FTL.

To summarize, SIMD.js will not provide a portable performance solution because 
vector instruction sets are sparse and vary between architectures and 
generations. Emscripten should not generate vector instructions because it 
can’t model the target machine. SIMD.js will not make use of modern SIMD 
features such as predication or scatter/gather. Vectorization is a compiler 
code generation problem that should be solved by JIT compilers, and not by the 
language itself. JIT compilers should continue to evolve and to start 
vectorizing code like modern compilers.


webkit-dev mailing list

Reply via email to