Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Nadav Rotem
Hi Dan!

 On Sep 28, 2014, at 6:44 AM, Dan Gohman sunf...@mozilla.com wrote:
 
 Hi Nadav,
 
 I agree with much of your assessment of the the proposed SIMD.js API.
 However, I don't believe it's unsuitability for some problems
 invalidates it for solving other very important problems, which it is
 well suited for. Performance portability is actually one of SIMD.js'
 biggest strengths: it's not the kind of performance portability that
 aims for a consistent percentage of peak on every machine (which, as you
 note, of course an explicit 128-bit SIMD API won't achieve), it's the
 kind of performance portability that achieves predictable performance
 and minimizes surprises across machines (though yes, there are some
 unavoidable ones, but overall the picture is quite good).

There is a tradeoff between the performance portability of the SIMD.js ISA and 
its usefulness. A small number of instructions (that only targets 32bit data 
types, no masks, etc) is not useful for developing non-trivial vector programs. 
You need 16bit vector elements to support WebGL vertex indices, and 
lane-masking for implementing predicated control flow for programs like ray 
tracers. Introducing a large number of vector instructions will expose the 
performance portability problems. I don’t believe that there is a sweet spot in 
this tradeoff. I don’t think that we can find a small set of instructions that 
will be useful for writing non-trivial vector code that is performance portable.

 
 On 09/26/2014 03:16 PM, Nadav Rotem wrote:
 So far, I’ve explained why I believe SIMD.js will not be
 performance-portable and why it will not utilize modern instruction
 sets, but I have not made a suggestion on how to use vector
 instructions to accelerate JavaScript programs. Vectorization, like
 instruction scheduling and register allocation, is a code-generation
 problem. In order to solve these problems, it is necessary for the
 compiler to have intimate knowledge of the architecture. Forcing the
 compiler to use a specific instruction or a specific data-type is the
 wrong answer. We can learn a lesson from the design of compilers for
 data-parallel languages. GPU programs (shaders and compute languages,
 such as OpenCL and GLSL) are written using vector instructions because
 the domain of the problem requires vectors (colors and coordinates).
 One of the first thing that data-parallel compilers do is to break
 vector instructions into scalars (this process is called
 scalarization). After getting rid of the vectors that resulted from
 the problem domain, the compiler may begin to analyze the program,
 calculate profitability, and make use of the available instruction set.
 
 I believe that it is the responsibility of JIT compilers to use vector
 instructions. In the implementation of the Webkit’s FTL JIT compiler,
 we took one step in the direction of using vector instructions. LLVM
 already vectorizes some code sequences during instruction selection,
 and we started investigating the use of LLVM’s Loop and SLP
 vectorizers. We found that despite nice performance gains on a number
 of workloads, we experienced some performance regressions on Intel’s
 Sandybridge processors, which is currently a very popular desktop
 processor. JavaScript code contains many branches (due to dynamic
 speculation). Unfortunately, branches on Sandybridge execute on Port5,
 which is also where many vector instructions are executed. So,
 pressure on Port5 prevented performance gains. The LLVM vectorizer
 currently does not model execution port pressure and we had to disable
 vectorization in FTL. In the future, we intend to enable more
 vectorization features in FTL.
 
 This is an example of a weakness of depending on automatic vectorization
 alone. High-level language features create complications which can lead
 to surprising performance problems. Compiler transformations to target
 specialized hardware features often have widely varying applicability.
 Expensive analyses can sometimes enable more and better vectorization,
 but when a compiler has to do an expensive complex analysis in order to
 optimize, it's unlikely that a programmer can count on other compilers
 doing the exact same analysis and optimizing in all the same cases. This
 is a problem we already face in many areas of compilers, but it's more
 pronounced with vectorization than many other optimizations.

I agree with this argument. Compiler optimizations are unpredictable. You never 
know when the register allocator will decide to spill a variable inside a hot 
loop.  or a memory operation confuse the alias analysis. I also agree that loop 
vectorization is especially sensitive.
However, it looks like the kind of vectorization that is needed to replace 
SIMD.js is a very simple SLP vectorization 
http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer (BB vectorization). 
It is really easy for a compiler to combine a few scalar arithmetic operations 
into a vector. LLVM’s SLP-vectorizer

[webkit-dev] SIMD support in JavaScript

2014-09-26 Thread Nadav Rotem
Recently members of the JavaScript community at Intel and Mozilla have 
suggested http://www.2ality.com/2013/12/simd-js.html adding SIMD types to the 
JavaScript language. In this email would like to share my thoughts about this 
proposal and to start a technical discussion about SIMD.js support in Webkit. I 
BCCed some of the authors of the proposal to allow them to participate in this 
discussion. 

Modern processors feature SIMD (Single Instruction Multiple Data) 
http://en.wikipedia.org/wiki/SIMD instructions, which perform the same 
arithmetic operation on a vector of elements. SIMD instructions are used to 
accelerate compute intensive code, like image processing algorithms, because 
the same calculation is applied to every pixel in the image. A single SIMD 
instruction can process 4 or 8 pixels at the same time. Compilers try to make 
use of SIMD instructions in an optimization that is called vectorization. 

The SIMD.js API http://wiki.ecmascript.org/doku.php?id=strawman:simd_number 
adds new types, such as float32x4, and operators that map to vector 
instructions on most processors. The idea behind the proposal is that manual 
use of vector instructions, just like intrinsics in C, will allow developers to 
accelerate common compute-intensive JavaScript applications. The idea of using 
SIMD instructions to accelerate JavaScript code is compelling because high 
performance applications in JavaScript are becoming very popular. 

Before I became involved with JavaScript through my work on the FTL project 
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/, I developed 
the LLVM vectorizer http://llvm.org/docs/Vectorizers.html and worked on a 
vectorizing compiler for a data-parallel programming language. Based on my 
experience with vectorization, I believe that the current proposal to include 
SIMD types in the JavaScript language is not the right approach to utilize SIMD 
instructions. In this email I argue that vector types should not be added to 
the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and features 
from one generation to another. For example, some Intel processors feature 
512-bit wide vector instructions 
https://software.intel.com/en-us/blogs/2013/avx-512-instructions. This means 
that they can process 16 floating point numbers with one instruction. However, 
today’s high-end ARM processors feature 128-bit wide vector instructions 
http://www.arm.com/products/processors/technologies/neon.php and can only 
process 4 floating point elements. ARM processors support byte-sized blend 
instructions but only recent Intel processors added support for byte-sized 
blends. ARM processors support variable shifts but only Intel processors with 
AVX2 support variable shifts. Different generations of Intel processors support 
different instruction sets with different features such as broadcasting from a 
local register, 16-bit and 64-bit arithmetic, and varied shuffles. Modern 
processors even feature predicated arithmetic and scatter/gather instructions 
that are very difficult to model using target independent high-level 
intrinsics. 
The designers of the high-level target independent API should decide if they 
want to support the union of all vector instruction sets, or the intersection. 
A subset of the vector instructions that represent the intersection of all 
popular instruction sets is not useable for writing non-trivial vector 
programs. And the superset of the vector instructions will cause huge 
performance regressions on platforms that do not support the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing 
compilers feature complex cost models and heuristics for deciding when to 
vectorize, at which vector width, and how many loop iterations to interleave. 
The cost models takes into account the features of the vector instruction set, 
properties of the architecture such as the number of vector registers, and 
properties of the current processor generation. Making a poor selection 
decision on any of the vectorization parameters can result in a major 
performance regression. Executing vector intrinsics on processors that don’t 
support them is slower than executing multiple scalar instructions because the 
compiler can’t always generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code that will 
show performance gains on processors from different families. Executing vector 
code with insufficient hardware support will cause major performance 
regressions. One of the motivations for SIMD.js was to allow Emscripten 
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten to 
vectorize C code and to emit JavaScript SIMD intrinsics. One problem with this 
suggestion is that the Emscripten compiler should not be assuming that the 
target is an x86 machine and that a specific vector width and interleave width 
is the right answer.