Re: [webkit-dev] SIMD support in JavaScript

2014-09-29 Thread Dan Gohman
Hi Nadav,

- Original Message -
 Hi Dan!
 
  On Sep 28, 2014, at 6:44 AM, Dan Gohman sunf...@mozilla.com wrote:
  
  Hi Nadav,
  
  I agree with much of your assessment of the the proposed SIMD.js API.
  However, I don't believe it's unsuitability for some problems
  invalidates it for solving other very important problems, which it is
  well suited for. Performance portability is actually one of SIMD.js'
  biggest strengths: it's not the kind of performance portability that
  aims for a consistent percentage of peak on every machine (which, as you
  note, of course an explicit 128-bit SIMD API won't achieve), it's the
  kind of performance portability that achieves predictable performance
  and minimizes surprises across machines (though yes, there are some
  unavoidable ones, but overall the picture is quite good).
 
 There is a tradeoff between the performance portability of the SIMD.js ISA
 and its usefulness. A small number of instructions (that only targets 32bit
 data types, no masks, etc) is not useful for developing non-trivial vector
 programs. You need 16bit vector elements to support WebGL vertex indices,
 and lane-masking for implementing predicated control flow for programs like
 ray tracers. Introducing a large number of vector instructions will expose
 the performance portability problems. I don’t believe that there is a sweet
 spot in this tradeoff. I don’t think that we can find a small set of
 instructions that will be useful for writing non-trivial vector code that is
 performance portable.

My belief in the existence of a sweet spot is based on looking at other 
systems, hardware and software, that have already gone there.

For an interesting example, take a look at this page:

https://software.intel.com/en-us/articles/interactive-ray-tracing

Every SIMD operation used in that article is directly supported by a 
corresponding function in SIMD.js today. We do have an open question on whether 
we should do something different for the rsqrt instruction, since the hardware 
only provides an approximation. In this case the code requires some 
Newton-Raphson, which may give us some flexibility, but several things are 
possible there. And of course, sweet spot doesn't mean cure-all.

Also, I am preparing to propose that SIMD.js handle 16-bit vector elements too 
(int16x8). It fits pretty naturally into the overall model. There are some 
challenges on some architectures, but there are challenges with alternative 
approaches too, and overall the story looks good.

Other changes are also being discussed too. In general, the SIMD.js spec is 
still evolving; participation is welcome :-).

  This is an example of a weakness of depending on automatic vectorization
  alone. High-level language features create complications which can lead
  to surprising performance problems. Compiler transformations to target
  specialized hardware features often have widely varying applicability.
  Expensive analyses can sometimes enable more and better vectorization,
  but when a compiler has to do an expensive complex analysis in order to
  optimize, it's unlikely that a programmer can count on other compilers
  doing the exact same analysis and optimizing in all the same cases. This
  is a problem we already face in many areas of compilers, but it's more
  pronounced with vectorization than many other optimizations.
 
 I agree with this argument. Compiler optimizations are unpredictable. You
 never know when the register allocator will decide to spill a variable
 inside a hot loop.  or a memory operation confuse the alias analysis. I also
 agree that loop vectorization is especially sensitive.
 However, it looks like the kind of vectorization that is needed to replace
 SIMD.js is a very simple SLP vectorization
 http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer (BB
 vectorization). It is really easy for a compiler to combine a few scalar
 arithmetic operations into a vector. LLVM’s SLP-vectorizer support
 vectorization of computations across basic blocks and succeeds in surprising
 places, like vectorization of STDLIB code where the ‘begin' and ‘end'
 iterators fit into a 128-bit register!

That's a surprising trick!

I agree that SLP vectorization doesn't have the same level of performance 
cliff as loop vectorization. And, it may be a desirable thing for JS JITs to 
start doing.

Even so, there is still value in an explicit SIMD API in the present. For the 
core features, instead of giving developers sets of expression patterns to 
follow to ensure SLP recognition, we are giving names to those patterns and 
letting developers identify which patterns they wish to use by their names. We 
can coordinate, compare, and standardize them by name across browsers, and in 
the future we may make a variety of interesting extensions to the API which 
developers will be able to feature-test for.

And if, in the future, SLP vectorization proves itself reliable enough in JS, 
then we can drop our custom JIT 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-29 Thread Dan Gohman
Hi Maciej,

- Original Message -
 
 Dan, you say that SIMD.js delivers performance portability, and Nadav says it
 doesn’t.
 
 Nadav’s argument seems to come down to (as I understand it):
 - The set of vector operations supported on different CPU architectures
 varies widely.

This is true, but it's also true that there is a core set of features which is 
pretty consistent across popular SIMD architectures. This commonality exists 
because it's a very popular set. The proposed SIMD.js doesn't solve all 
problems, but it does solve a large number of important problems well, and it 
is following numerous precedents.

We are also exploring the possibility of exposing additional instructions 
outside this core set. Several creative ideas are being discussed which could 
expand the API's reach while preserving a portability story. However, 
regardless of what we do there, I expect the core set will remain a prominent 
part of the API, due to its applicability.

 - Executing vector intrinsics on processors that don’t support them is
 slower than executing multiple scalar instructions because the compiler
 can’t always generate efficient with the same semantics.”

This is also true, however the intent of SIMD.js *is* to be implementable on 
all popular architectures. The SIMD.js spec is originally derived from the Dart 
SIMD spec, which is already implemented and in use on at least x86 and ARM. We 
are also taking some ideas from OpenCL, which offers a very similar set of core 
functionality, and which is implemented on even more architectures. We have 
several reasons to expect that SIMD.js can cover enough functionality to be 
useful while still being sufficiently portable.

 - Even when vector intrinsics are supported by the CPU, whether it is
 profitable to use them may depend in non-obvious ways on exact
 characteristics of the target CPU and the surrounding code (the Port5
 example).

With SIMD.js, there are plain integer types, so developers directly bypass 
plain JS number semantics, so there are fewer corner cases for the compiler to 
insert extra code to check for. This means fewer branches, and among other 
things, should mean less port 5 contention overall on Sandy Bridge.

Furthermore, automatic vectorization often requires the compiler make 
conservative assumptions about key information like pointer aliasing, trip 
counts, integer overflow, array indexing, load safety, scatter ordering, 
alignment, and more. In order to preserve observable semantics, these 
assumptions cause compilers to insert extra instructions, which are typically 
things like selects, shuffles, branches or other things, to handle all the 
possible corner cases. This is extra overhead that human programmers can often 
avoid, because they can more easily determine what corner cases are relevant in 
a given piece of code. And on Sandy Bridge in particular, these extra selects, 
shuffles, and branches hit port 5.

 For these reasons, Nadav says that it’s better to autovectorize, and that
 this is the norm even for languages with explicit vector data. In other
 words, he’s saying that SIMD.js will result in code that is not
 performance-portable between different CPUs.

I question whether it is actually the norm. In C++, where auto-vectorization is 
available in every major compiler today, explicit SIMD APIs like xmmintrin.h 
are hugely popular. That particular header has become supported by Microsoft's 
C++ compiler, Intel's C++ compiler, GCC, and clang. I see many uses of 
xmmintrin.h in many contexts, including HPC, graphics, codecs, cryptography, 
and games. It seems many C++ developers are still willing to go through the 
pain of #ifdefs, preprocessor macros, and funny-looking syntax rather than rely 
on auto-vectorization, even with restrict and and other aids.

Both auto-vectorization and SIMD.js have their strengths, and both have their 
weaknesses. I don't believe the fact that both solve some problems that the 
other doesn't rules out either of them.

 I don’t see a rebuttal to any of these points. Instead, you argue that,
 because SIMD.js does not require advanced compiler analysis, it is more
 likely to give similar results between different JITs (presumably when
 targeting the same CPU, or ones with the same supported vector operations
 and similar perf characteristics). That seems like a totally different sense
 of performance portability.

 Given these arguments, it’s possible that you and Nadav are both right[*].
 That would mean that both these statements hold:
 (a) SIMD.js is not performance-portable between different CPU architectures
 and models.
 (b) SIMD.js is performance-portable between different JITs targeting the same
 CPU model.
 
 On net, I think that combination would be a strong argument *against*
 SIMD.js. The Web aims for portability between different hardware and not
 just different software. At Apple alone we support four major CPU
 instruction sets and a considerably greater number of 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Dan Gohman
Hi Nadav,

I agree with much of your assessment of the the proposed SIMD.js API.
However, I don't believe it's unsuitability for some problems
invalidates it for solving other very important problems, which it is
well suited for. Performance portability is actually one of SIMD.js'
biggest strengths: it's not the kind of performance portability that
aims for a consistent percentage of peak on every machine (which, as you
note, of course an explicit 128-bit SIMD API won't achieve), it's the
kind of performance portability that achieves predictable performance
and minimizes surprises across machines (though yes, there are some
unavoidable ones, but overall the picture is quite good).

On 09/26/2014 03:16 PM, Nadav Rotem wrote:
 So far, I’ve explained why I believe SIMD.js will not be
 performance-portable and why it will not utilize modern instruction
 sets, but I have not made a suggestion on how to use vector
 instructions to accelerate JavaScript programs. Vectorization, like
 instruction scheduling and register allocation, is a code-generation
 problem. In order to solve these problems, it is necessary for the
 compiler to have intimate knowledge of the architecture. Forcing the
 compiler to use a specific instruction or a specific data-type is the
 wrong answer. We can learn a lesson from the design of compilers for
 data-parallel languages. GPU programs (shaders and compute languages,
 such as OpenCL and GLSL) are written using vector instructions because
 the domain of the problem requires vectors (colors and coordinates).
 One of the first thing that data-parallel compilers do is to break
 vector instructions into scalars (this process is called
 scalarization). After getting rid of the vectors that resulted from
 the problem domain, the compiler may begin to analyze the program,
 calculate profitability, and make use of the available instruction set.

 I believe that it is the responsibility of JIT compilers to use vector
 instructions. In the implementation of the Webkit’s FTL JIT compiler,
 we took one step in the direction of using vector instructions. LLVM
 already vectorizes some code sequences during instruction selection,
 and we started investigating the use of LLVM’s Loop and SLP
 vectorizers. We found that despite nice performance gains on a number
 of workloads, we experienced some performance regressions on Intel’s
 Sandybridge processors, which is currently a very popular desktop
 processor. JavaScript code contains many branches (due to dynamic
 speculation). Unfortunately, branches on Sandybridge execute on Port5,
 which is also where many vector instructions are executed. So,
 pressure on Port5 prevented performance gains. The LLVM vectorizer
 currently does not model execution port pressure and we had to disable
 vectorization in FTL. In the future, we intend to enable more
 vectorization features in FTL.

This is an example of a weakness of depending on automatic vectorization
alone. High-level language features create complications which can lead
to surprising performance problems. Compiler transformations to target
specialized hardware features often have widely varying applicability.
Expensive analyses can sometimes enable more and better vectorization,
but when a compiler has to do an expensive complex analysis in order to
optimize, it's unlikely that a programmer can count on other compilers
doing the exact same analysis and optimizing in all the same cases. This
is a problem we already face in many areas of compilers, but it's more
pronounced with vectorization than many other optimizations.

In contrast, the proposed SIMD.js has the property that code using it
will not depend on expensive compiler analysis in the JIT, and is much
more likely to deliver predictable performance in practice between
different JIT implementations and across a very practical variety of
hardware architectures.


 To summarize, SIMD.js will not provide a portable performance solution
 because vector instruction sets are sparse and vary between
 architectures and generations. Emscripten should not generate vector
 instructions because it can’t model the target machine. SIMD.js will
 not make use of modern SIMD features such as predication or
 scatter/gather. Vectorization is a compiler code generation problem
 that should be solved by JIT compilers, and not by the language
 itself. JIT compilers should continue to evolve and to start
 vectorizing code like modern compilers.

As I mentioned above, performance portability is actually one of
SIMD.js's core strengths.

I have found it useful to think of the API propsed in SIMD.js as a
short vector API. It hits a sweet spot, being a convenient size for
many XYZW and RGB/RGBA and similar algorithms, being implementable on a
wide variety of very relevant hardware architectures, being long enough
to deliver worthwhile speedups for many tasks, and being short enough to
still be convenient to manipulate.

I agree that the short vector model doesn't address all 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Anne van Kesteren
Could this thread maybe be moved to es-discuss?
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Filip Pizlo
You are free to start such a discussion on es-discuss. I think it's useful for 
the webkit community to be able to discuss what we think of the feature. 

-Filip

 On Sep 28, 2014, at 8:39 AM, Anne van Kesteren ann...@annevk.nl wrote:
 
 Could this thread maybe be moved to es-discuss?
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Maciej Stachowiak

We will probably bring it up on es-discuss once we’ve had a chance to discuss 
it among WebKit folks.

 - Maciej

 On Sep 28, 2014, at 8:39 AM, Anne van Kesteren ann...@annevk.nl wrote:
 
 Could this thread maybe be moved to es-discuss?
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Maciej Stachowiak

Dan, you say that SIMD.js delivers performance portability, and Nadav says it 
doesn’t. 

Nadav’s argument seems to come down to (as I understand it):
- The set of vector operations supported on different CPU architectures varies 
widely.
- Executing vector intrinsics on processors that don’t support them is slower 
than executing multiple scalar instructions because the compiler can’t always 
generate efficient with the same semantics.”
- Even when vector intrinsics are supported by the CPU, whether it is 
profitable to use them may depend in non-obvious ways on exact characteristics 
of the target CPU and the surrounding code (the Port5 example).

For these reasons, Nadav says that it’s better to autovectorize, and that this 
is the norm even for languages with explicit vector data. In other words, he’s 
saying that SIMD.js will result in code that is not performance-portable 
between different CPUs.


I don’t see a rebuttal to any of these points. Instead, you argue that, because 
SIMD.js does not require advanced compiler analysis, it is more likely to give 
similar results between different JITs (presumably when targeting the same CPU, 
or ones with the same supported vector operations and similar perf 
characteristics). That seems like a totally different sense of performance 
portability.


Given these arguments, it’s possible that you and Nadav are both right[*]. That 
would mean that both these statements hold:
(a) SIMD.js is not performance-portable between different CPU architectures and 
models.
(b) SIMD.js is performance-portable between different JITs targeting the same 
CPU model.

On net, I think that combination would be a strong argument *against* SIMD.js. 
The Web aims for portability between different hardware and not just different 
software. At Apple alone we support four major CPU instruction sets and a 
considerably greater number of specific CPU models. From our point of view, 
code that is performance-portable between JITs but not between CPUs would not 
be good enough, and it might be actively bad if it results in worse performance 
on some of our CPU architectures. The WebKit community as a whole supports even 
more target CPU architectures.

Do you agree with the above assessment? Alternately, do you have an argument 
that SIMD.js is performance-portable between different CPU architectures?

Regards,
Maciej

[*] I’m not totally convinced about your argument for cross-JIT performance 
portability. It seems to me that, in the case of the Port5 problem, different 
JITs could have different levels of Port5 contention, so you would not get the 
same results. But let’s grant it for the sake of argument.


 On Sep 28, 2014, at 6:44 AM, Dan Gohman sunf...@mozilla.com wrote:
 
 Hi Nadav,
 
 I agree with much of your assessment of the the proposed SIMD.js API.
 However, I don't believe it's unsuitability for some problems
 invalidates it for solving other very important problems, which it is
 well suited for. Performance portability is actually one of SIMD.js'
 biggest strengths: it's not the kind of performance portability that
 aims for a consistent percentage of peak on every machine (which, as you
 note, of course an explicit 128-bit SIMD API won't achieve), it's the
 kind of performance portability that achieves predictable performance
 and minimizes surprises across machines (though yes, there are some
 unavoidable ones, but overall the picture is quite good).
 
 On 09/26/2014 03:16 PM, Nadav Rotem wrote:
 So far, I’ve explained why I believe SIMD.js will not be
 performance-portable and why it will not utilize modern instruction
 sets, but I have not made a suggestion on how to use vector
 instructions to accelerate JavaScript programs. Vectorization, like
 instruction scheduling and register allocation, is a code-generation
 problem. In order to solve these problems, it is necessary for the
 compiler to have intimate knowledge of the architecture. Forcing the
 compiler to use a specific instruction or a specific data-type is the
 wrong answer. We can learn a lesson from the design of compilers for
 data-parallel languages. GPU programs (shaders and compute languages,
 such as OpenCL and GLSL) are written using vector instructions because
 the domain of the problem requires vectors (colors and coordinates).
 One of the first thing that data-parallel compilers do is to break
 vector instructions into scalars (this process is called
 scalarization). After getting rid of the vectors that resulted from
 the problem domain, the compiler may begin to analyze the program,
 calculate profitability, and make use of the available instruction set.
 
 I believe that it is the responsibility of JIT compilers to use vector
 instructions. In the implementation of the Webkit’s FTL JIT compiler,
 we took one step in the direction of using vector instructions. LLVM
 already vectorizes some code sequences during instruction selection,
 and we started investigating the use of LLVM’s Loop and SLP
 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-28 Thread Nadav Rotem
Hi Dan!

 On Sep 28, 2014, at 6:44 AM, Dan Gohman sunf...@mozilla.com wrote:
 
 Hi Nadav,
 
 I agree with much of your assessment of the the proposed SIMD.js API.
 However, I don't believe it's unsuitability for some problems
 invalidates it for solving other very important problems, which it is
 well suited for. Performance portability is actually one of SIMD.js'
 biggest strengths: it's not the kind of performance portability that
 aims for a consistent percentage of peak on every machine (which, as you
 note, of course an explicit 128-bit SIMD API won't achieve), it's the
 kind of performance portability that achieves predictable performance
 and minimizes surprises across machines (though yes, there are some
 unavoidable ones, but overall the picture is quite good).

There is a tradeoff between the performance portability of the SIMD.js ISA and 
its usefulness. A small number of instructions (that only targets 32bit data 
types, no masks, etc) is not useful for developing non-trivial vector programs. 
You need 16bit vector elements to support WebGL vertex indices, and 
lane-masking for implementing predicated control flow for programs like ray 
tracers. Introducing a large number of vector instructions will expose the 
performance portability problems. I don’t believe that there is a sweet spot in 
this tradeoff. I don’t think that we can find a small set of instructions that 
will be useful for writing non-trivial vector code that is performance portable.

 
 On 09/26/2014 03:16 PM, Nadav Rotem wrote:
 So far, I’ve explained why I believe SIMD.js will not be
 performance-portable and why it will not utilize modern instruction
 sets, but I have not made a suggestion on how to use vector
 instructions to accelerate JavaScript programs. Vectorization, like
 instruction scheduling and register allocation, is a code-generation
 problem. In order to solve these problems, it is necessary for the
 compiler to have intimate knowledge of the architecture. Forcing the
 compiler to use a specific instruction or a specific data-type is the
 wrong answer. We can learn a lesson from the design of compilers for
 data-parallel languages. GPU programs (shaders and compute languages,
 such as OpenCL and GLSL) are written using vector instructions because
 the domain of the problem requires vectors (colors and coordinates).
 One of the first thing that data-parallel compilers do is to break
 vector instructions into scalars (this process is called
 scalarization). After getting rid of the vectors that resulted from
 the problem domain, the compiler may begin to analyze the program,
 calculate profitability, and make use of the available instruction set.
 
 I believe that it is the responsibility of JIT compilers to use vector
 instructions. In the implementation of the Webkit’s FTL JIT compiler,
 we took one step in the direction of using vector instructions. LLVM
 already vectorizes some code sequences during instruction selection,
 and we started investigating the use of LLVM’s Loop and SLP
 vectorizers. We found that despite nice performance gains on a number
 of workloads, we experienced some performance regressions on Intel’s
 Sandybridge processors, which is currently a very popular desktop
 processor. JavaScript code contains many branches (due to dynamic
 speculation). Unfortunately, branches on Sandybridge execute on Port5,
 which is also where many vector instructions are executed. So,
 pressure on Port5 prevented performance gains. The LLVM vectorizer
 currently does not model execution port pressure and we had to disable
 vectorization in FTL. In the future, we intend to enable more
 vectorization features in FTL.
 
 This is an example of a weakness of depending on automatic vectorization
 alone. High-level language features create complications which can lead
 to surprising performance problems. Compiler transformations to target
 specialized hardware features often have widely varying applicability.
 Expensive analyses can sometimes enable more and better vectorization,
 but when a compiler has to do an expensive complex analysis in order to
 optimize, it's unlikely that a programmer can count on other compilers
 doing the exact same analysis and optimizing in all the same cases. This
 is a problem we already face in many areas of compilers, but it's more
 pronounced with vectorization than many other optimizations.

I agree with this argument. Compiler optimizations are unpredictable. You never 
know when the register allocator will decide to spill a variable inside a hot 
loop.  or a memory operation confuse the alias analysis. I also agree that loop 
vectorization is especially sensitive.
However, it looks like the kind of vectorization that is needed to replace 
SIMD.js is a very simple SLP vectorization 
http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer (BB vectorization). 
It is really easy for a compiler to combine a few scalar arithmetic operations 
into a vector. LLVM’s SLP-vectorizer 

[webkit-dev] SIMD support in JavaScript

2014-09-26 Thread Nadav Rotem
Recently members of the JavaScript community at Intel and Mozilla have 
suggested http://www.2ality.com/2013/12/simd-js.html adding SIMD types to the 
JavaScript language. In this email would like to share my thoughts about this 
proposal and to start a technical discussion about SIMD.js support in Webkit. I 
BCCed some of the authors of the proposal to allow them to participate in this 
discussion. 

Modern processors feature SIMD (Single Instruction Multiple Data) 
http://en.wikipedia.org/wiki/SIMD instructions, which perform the same 
arithmetic operation on a vector of elements. SIMD instructions are used to 
accelerate compute intensive code, like image processing algorithms, because 
the same calculation is applied to every pixel in the image. A single SIMD 
instruction can process 4 or 8 pixels at the same time. Compilers try to make 
use of SIMD instructions in an optimization that is called vectorization. 

The SIMD.js API http://wiki.ecmascript.org/doku.php?id=strawman:simd_number 
adds new types, such as float32x4, and operators that map to vector 
instructions on most processors. The idea behind the proposal is that manual 
use of vector instructions, just like intrinsics in C, will allow developers to 
accelerate common compute-intensive JavaScript applications. The idea of using 
SIMD instructions to accelerate JavaScript code is compelling because high 
performance applications in JavaScript are becoming very popular. 

Before I became involved with JavaScript through my work on the FTL project 
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/, I developed 
the LLVM vectorizer http://llvm.org/docs/Vectorizers.html and worked on a 
vectorizing compiler for a data-parallel programming language. Based on my 
experience with vectorization, I believe that the current proposal to include 
SIMD types in the JavaScript language is not the right approach to utilize SIMD 
instructions. In this email I argue that vector types should not be added to 
the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and features 
from one generation to another. For example, some Intel processors feature 
512-bit wide vector instructions 
https://software.intel.com/en-us/blogs/2013/avx-512-instructions. This means 
that they can process 16 floating point numbers with one instruction. However, 
today’s high-end ARM processors feature 128-bit wide vector instructions 
http://www.arm.com/products/processors/technologies/neon.php and can only 
process 4 floating point elements. ARM processors support byte-sized blend 
instructions but only recent Intel processors added support for byte-sized 
blends. ARM processors support variable shifts but only Intel processors with 
AVX2 support variable shifts. Different generations of Intel processors support 
different instruction sets with different features such as broadcasting from a 
local register, 16-bit and 64-bit arithmetic, and varied shuffles. Modern 
processors even feature predicated arithmetic and scatter/gather instructions 
that are very difficult to model using target independent high-level 
intrinsics. 
The designers of the high-level target independent API should decide if they 
want to support the union of all vector instruction sets, or the intersection. 
A subset of the vector instructions that represent the intersection of all 
popular instruction sets is not useable for writing non-trivial vector 
programs. And the superset of the vector instructions will cause huge 
performance regressions on platforms that do not support the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing 
compilers feature complex cost models and heuristics for deciding when to 
vectorize, at which vector width, and how many loop iterations to interleave. 
The cost models takes into account the features of the vector instruction set, 
properties of the architecture such as the number of vector registers, and 
properties of the current processor generation. Making a poor selection 
decision on any of the vectorization parameters can result in a major 
performance regression. Executing vector intrinsics on processors that don’t 
support them is slower than executing multiple scalar instructions because the 
compiler can’t always generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code that will 
show performance gains on processors from different families. Executing vector 
code with insufficient hardware support will cause major performance 
regressions. One of the motivations for SIMD.js was to allow Emscripten 
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten to 
vectorize C code and to emit JavaScript SIMD intrinsics. One problem with this 
suggestion is that the Emscripten compiler should not be assuming that the 
target is an x86 machine and that a specific vector width and interleave width 
is the right answer. 

Re: [webkit-dev] SIMD support in JavaScript

2014-09-26 Thread Benjamin Poulain

Thanks for sharing your analysis on webkit-dev.

There has been a lot of criticisms about SIMD.js this year. It is great 
to read about solutions for vectorization without the problems of SIMD.js.


Benjamin

On 9/26/14, 3:16 PM, Nadav Rotem wrote:

Recently members of the JavaScript community at Intel and Mozilla
havesuggested http://www.2ality.com/2013/12/simd-js.htmladding SIMD
types to the JavaScript language. In this email would like to share my
thoughts about this proposal and to start a technical discussion about
SIMD.js support in Webkit. I BCCed some of the authors of the proposal
to allow them to participate in this discussion.

Modern processors feature SIMD (Single Instruction Multiple Data)
http://en.wikipedia.org/wiki/SIMD instructions, which perform the same
arithmetic operation on a vector of elements. SIMD instructions are used
to accelerate compute intensive code, like image processing algorithms,
because the same calculation is applied to every pixel in the image. A
single SIMD instruction can process 4 or 8 pixels at the same time.
Compilers try to make use of SIMD instructions in an optimization that
is called vectorization.

The SIMD.js API
http://wiki.ecmascript.org/doku.php?id=strawman:simd_number adds new
types, such as float32x4, and operators that map to vector instructions
on most processors. The idea behind the proposal is that manual use of
vector instructions, just like intrinsics in C, will allow developers to
accelerate common compute-intensive JavaScript applications. The idea of
using SIMD instructions to accelerate JavaScript code is compelling
because high performance applications in JavaScript are becoming very
popular.

Before I became involved with JavaScript through my work on the FTL
project
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/, I
developed the LLVM vectorizer
http://llvm.org/docs/Vectorizers.html and worked on a vectorizing
compiler for a data-parallel programming language. Based on my
experience with vectorization, I believe that the current proposal to
include SIMD types in the JavaScript language is not the right approach
to utilize SIMD instructions. In this email I argue that vector types
should not be added to the JavaScript language.

Vector instruction sets are sparse, asymmetrical, and vary in size and
features from one generation to another. For example, some Intel
processors feature 512-bit wide vector instructions
https://software.intel.com/en-us/blogs/2013/avx-512-instructions. This
means that they can process 16 floating point numbers with one
instruction. However, today’s high-end ARM processors feature 128-bit
wide vector instructions
http://www.arm.com/products/processors/technologies/neon.php and can
only process 4 floating point elements. ARM processors support
byte-sized blend instructions but only recent Intel processors added
support for byte-sized blends. ARM processors support variable shifts
but only Intel processors with AVX2 support variable shifts. Different
generations of Intel processors support different instruction sets with
different features such as broadcasting from a local register, 16-bit
and 64-bit arithmetic, and varied shuffles. Modern processors even
feature predicated arithmetic and scatter/gather instructions that are
very difficult to model using target independent high-level intrinsics.
The designers of the high-level target independent API should decide if
they want to support the union of all vector instruction sets, or the
intersection. A subset of the vector instructions that represent the
intersection of all popular instruction sets is not useable for writing
non-trivial vector programs. And the superset of the vector instructions
will cause huge performance regressions on platforms that do not support
the used instructions.

Code that uses SIMD.js is not performance-portable. Modern vectorizing
compilers feature complex cost models and heuristics for deciding when
to vectorize, at which vector width, and how many loop iterations to
interleave. The cost models takes into account the features of the
vector instruction set, properties of the architecture such as the
number of vector registers, and properties of the current processor
generation. Making a poor selection decision on any of the vectorization
parameters can result in a major performance regression. Executing
vector intrinsics on processors that don’t support them is slower than
executing multiple scalar instructions because the compiler can’t always
generate efficient with the same semantics.
I don’t believe that it is possible to write non-trivial vector code
that will show performance gains on processors from different families.
Executing vector code with insufficient hardware support will cause
major performance regressions. One of the motivations for SIMD.js was to
allow Emscripten
https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten to 
vectorize
C code and to emit JavaScript SIMD intrinsics. One problem