[webkit-dev] SIMD support in JavaScript

2014-09-26 Thread Nadav Rotem
Recently members of the JavaScript community at Intel and Mozilla have 
suggested http://www.2ality.com/2013/12/simd-js.html adding SIMD types to the 
JavaScript language. In this email would like to share my thoughts about this 
proposal and to start a technical discussion about SIMD.js support in Webkit. I 
BCCed some of the authors of the proposal to allow them to participate in this 
discussion. 

Modern processors feature SIMD (Single Instruction Multiple Data) 
http://en.wikipedia.org/wiki/SIMD instructions, which perform the same 
arithmetic operation on a vector of elements. SIMD instructions are used to 
accelerate compute intensive code, like image processing algorithms, because 
the same calculation is applied to every pixel in the image. A single SIMD 
instruction can process 4 or 8 pixels at the same time. Compilers try to make 
use of SIMD instructions in an optimization that is called vectorization. 

The SIMD.js API http://wiki.ecmascript.org/doku.php?id=strawman:simd_number 
adds new types, such as float32x4, and operators that map to vector 
instructions on most processors. The idea behind the proposal is that manual 
use of vector instructions, just like intrinsics in C, will allow developers to 
accelerate common compute-intensive JavaScript applications. The idea of using 
SIMD instructions to accelerate JavaScript code is compelling because high 
performance applications in JavaScript are becoming very popular. 

Before I became involved with JavaScript through my work on the FTL project 
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/, I developed 
the LLVM vectorizer http://llvm.org/docs/Vectorizers.html and worked on a 
vectorizing compiler for a data-parallel programming language. Based on my 
experience with vectorization, I believe that the current proposal to include 
SIMD types in the JavaScript language is not the right approach to utilize SIMD 
instructions. In this email I argue that vector types should not be added to 
the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and features 
from one generation to another. For example, some Intel processors feature 
512-bit wide vector instructions 
https://software.intel.com/en-us/blogs/2013/avx-512-instructions. This means 
that they can process 16 floating point numbers with one instruction. However, 
today’s high-end ARM processors feature 128-bit wide vector instructions 
http://www.arm.com/products/processors/technologies/neon.php and can only 
process 4 floating point elements. ARM processors support byte-sized blend 
instructions but only recent Intel processors added support for byte-sized 
blends. ARM processors support variable shifts but only Intel processors with 
AVX2 support variable shifts. Different generations of Intel processors support 
different instruction sets with different features such as broadcasting from a 
local register, 16-bit and 64-bit arithmetic, and varied shuffles. Modern 
processors even feature predicated arithmetic and scatter/gather instructions 
that are very difficult to model using target independent high-level 
intrinsics. 
The designers of the high-level target independent API should decide if they 
want to support the union of all vector instruction sets, or the intersection. 
A subset of the vector instructions that represent the intersection of all 
popular instruction sets is not useable for writing non-trivial vector 
programs. And the superset of the vector instructions will cause huge 
performance regressions on platforms that do not support the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing 
compilers feature complex cost models and heuristics for deciding when to 
vectorize, at which vector width, and how many loop iterations to interleave. 
The cost models takes into account the features of the vector instruction set, 
properties of the architecture such as the number of vector registers, and 
properties of the current processor generation. Making a poor selection 
decision on any of the vectorization parameters can result in a major 
performance regression. Executing vector intrinsics on processors that don’t 
support them is slower than executing multiple scalar instructions because the 
compiler can’t always generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code that will 
show performance gains on processors from different families. Executing vector 
code with insufficient hardware support will cause major performance 
regressions. One of the motivations for SIMD.js was to allow Emscripten 
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten to 
vectorize C code and to emit JavaScript SIMD intrinsics. One problem with this 
suggestion is that the Emscripten compiler should not be assuming that the 
target is an x86 machine and that a specific vector width and interleave width 
is the right answer. 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-26 Thread Benjamin Poulain

Thanks for sharing your analysis on webkit-dev.

There has been a lot of criticisms about SIMD.js this year. It is great 
to read about solutions for vectorization without the problems of SIMD.js.


Benjamin

On 9/26/14, 3:16 PM, Nadav Rotem wrote:

Recently members of the JavaScript community at Intel and Mozilla
havesuggested http://www.2ality.com/2013/12/simd-js.htmladding SIMD
types to the JavaScript language. In this email would like to share my
thoughts about this proposal and to start a technical discussion about
SIMD.js support in Webkit. I BCCed some of the authors of the proposal
to allow them to participate in this discussion.

Modern processors feature SIMD (Single Instruction Multiple Data)
http://en.wikipedia.org/wiki/SIMD instructions, which perform the same
arithmetic operation on a vector of elements. SIMD instructions are used
to accelerate compute intensive code, like image processing algorithms,
because the same calculation is applied to every pixel in the image. A
single SIMD instruction can process 4 or 8 pixels at the same time.
Compilers try to make use of SIMD instructions in an optimization that
is called vectorization.

The SIMD.js API
http://wiki.ecmascript.org/doku.php?id=strawman:simd_number adds new
types, such as float32x4, and operators that map to vector instructions
on most processors. The idea behind the proposal is that manual use of
vector instructions, just like intrinsics in C, will allow developers to
accelerate common compute-intensive JavaScript applications. The idea of
using SIMD instructions to accelerate JavaScript code is compelling
because high performance applications in JavaScript are becoming very
popular.

Before I became involved with JavaScript through my work on the FTL
project
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/, I
developed the LLVM vectorizer
http://llvm.org/docs/Vectorizers.html and worked on a vectorizing
compiler for a data-parallel programming language. Based on my
experience with vectorization, I believe that the current proposal to
include SIMD types in the JavaScript language is not the right approach
to utilize SIMD instructions. In this email I argue that vector types
should not be added to the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and
features from one generation to another. For example, some Intel
processors feature 512-bit wide vector instructions
https://software.intel.com/en-us/blogs/2013/avx-512-instructions. This
means that they can process 16 floating point numbers with one
instruction. However, today’s high-end ARM processors feature 128-bit
wide vector instructions
http://www.arm.com/products/processors/technologies/neon.php and can
only process 4 floating point elements. ARM processors support
byte-sized blend instructions but only recent Intel processors added
support for byte-sized blends. ARM processors support variable shifts
but only Intel processors with AVX2 support variable shifts. Different
generations of Intel processors support different instruction sets with
different features such as broadcasting from a local register, 16-bit
and 64-bit arithmetic, and varied shuffles. Modern processors even
feature predicated arithmetic and scatter/gather instructions that are
very difficult to model using target independent high-level intrinsics.
The designers of the high-level target independent API should decide if
they want to support the union of all vector instruction sets, or the
intersection. A subset of the vector instructions that represent the
intersection of all popular instruction sets is not useable for writing
non-trivial vector programs. And the superset of the vector instructions
will cause huge performance regressions on platforms that do not support
the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing
compilers feature complex cost models and heuristics for deciding when
to vectorize, at which vector width, and how many loop iterations to
interleave. The cost models takes into account the features of the
vector instruction set, properties of the architecture such as the
number of vector registers, and properties of the current processor
generation. Making a poor selection decision on any of the vectorization
parameters can result in a major performance regression. Executing
vector intrinsics on processors that don’t support them is slower than
executing multiple scalar instructions because the compiler can’t always
generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code
that will show performance gains on processors from different families.
Executing vector code with insufficient hardware support will cause
major performance regressions. One of the motivations for SIMD.js was to
allow Emscripten
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten to 
vectorize
C code and to emit JavaScript SIMD intrinsics. One problem