Hi Dan!

> On Sep 28, 2014, at 6:44 AM, Dan Gohman <sunf...@mozilla.com> wrote:
> Hi Nadav,
> I agree with much of your assessment of the the proposed SIMD.js API.
> However, I don't believe it's unsuitability for some problems
> invalidates it for solving other very important problems, which it is
> well suited for. Performance portability is actually one of SIMD.js'
> biggest strengths: it's not the kind of performance portability that
> aims for a consistent percentage of peak on every machine (which, as you
> note, of course an explicit 128-bit SIMD API won't achieve), it's the
> kind of performance portability that achieves predictable performance
> and minimizes surprises across machines (though yes, there are some
> unavoidable ones, but overall the picture is quite good).

There is a tradeoff between the performance portability of the SIMD.js ISA and 
its usefulness. A small number of instructions (that only targets 32bit data 
types, no masks, etc) is not useful for developing non-trivial vector programs. 
You need 16bit vector elements to support WebGL vertex indices, and 
lane-masking for implementing predicated control flow for programs like ray 
tracers. Introducing a large number of vector instructions will expose the 
performance portability problems. I don’t believe that there is a sweet spot in 
this tradeoff. I don’t think that we can find a small set of instructions that 
will be useful for writing non-trivial vector code that is performance portable.

> On 09/26/2014 03:16 PM, Nadav Rotem wrote:
>> So far, I’ve explained why I believe SIMD.js will not be
>> performance-portable and why it will not utilize modern instruction
>> sets, but I have not made a suggestion on how to use vector
>> instructions to accelerate JavaScript programs. Vectorization, like
>> instruction scheduling and register allocation, is a code-generation
>> problem. In order to solve these problems, it is necessary for the
>> compiler to have intimate knowledge of the architecture. Forcing the
>> compiler to use a specific instruction or a specific data-type is the
>> wrong answer. We can learn a lesson from the design of compilers for
>> data-parallel languages. GPU programs (shaders and compute languages,
>> such as OpenCL and GLSL) are written using vector instructions because
>> the domain of the problem requires vectors (colors and coordinates).
>> One of the first thing that data-parallel compilers do is to break
>> vector instructions into scalars (this process is called
>> scalarization). After getting rid of the vectors that resulted from
>> the problem domain, the compiler may begin to analyze the program,
>> calculate profitability, and make use of the available instruction set.
>> I believe that it is the responsibility of JIT compilers to use vector
>> instructions. In the implementation of the Webkit’s FTL JIT compiler,
>> we took one step in the direction of using vector instructions. LLVM
>> already vectorizes some code sequences during instruction selection,
>> and we started investigating the use of LLVM’s Loop and SLP
>> vectorizers. We found that despite nice performance gains on a number
>> of workloads, we experienced some performance regressions on Intel’s
>> Sandybridge processors, which is currently a very popular desktop
>> processor. JavaScript code contains many branches (due to dynamic
>> speculation). Unfortunately, branches on Sandybridge execute on Port5,
>> which is also where many vector instructions are executed. So,
>> pressure on Port5 prevented performance gains. The LLVM vectorizer
>> currently does not model execution port pressure and we had to disable
>> vectorization in FTL. In the future, we intend to enable more
>> vectorization features in FTL.
> This is an example of a weakness of depending on automatic vectorization
> alone. High-level language features create complications which can lead
> to surprising performance problems. Compiler transformations to target
> specialized hardware features often have widely varying applicability.
> Expensive analyses can sometimes enable more and better vectorization,
> but when a compiler has to do an expensive complex analysis in order to
> optimize, it's unlikely that a programmer can count on other compilers
> doing the exact same analysis and optimizing in all the same cases. This
> is a problem we already face in many areas of compilers, but it's more
> pronounced with vectorization than many other optimizations.

I agree with this argument. Compiler optimizations are unpredictable. You never 
know when the register allocator will decide to spill a variable inside a hot 
loop.  or a memory operation confuse the alias analysis. I also agree that loop 
vectorization is especially sensitive.
However, it looks like the kind of vectorization that is needed to replace 
SIMD.js is a very simple SLP vectorization 
<http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer> (BB vectorization). 
It is really easy for a compiler to combine a few scalar arithmetic operations 
into a vector. LLVM’s SLP-vectorizer support vectorization of computations 
across basic blocks and succeeds in surprising places, like vectorization of 
STDLIB code where the ‘begin' and ‘end' iterators fit into a 128-bit register!

> In contrast, the proposed SIMD.js has the property that code using it
> will not depend on expensive compiler analysis in the JIT, and is much
> more likely to deliver predictable performance in practice between
> different JIT implementations and across a very practical variety of
> hardware architectures.

Performance portability across JITs should not motivate us to solve a compiler 
problem in the language itself. JITs should continue to evolve and learn new 
tricks. Introducing new language features increases the barrier of entry for 
new JavaScript implementations.  

>> To summarize, SIMD.js will not provide a portable performance solution
>> because vector instruction sets are sparse and vary between
>> architectures and generations. Emscripten should not generate vector
>> instructions because it can’t model the target machine. SIMD.js will
>> not make use of modern SIMD features such as predication or
>> scatter/gather. Vectorization is a compiler code generation problem
>> that should be solved by JIT compilers, and not by the language
>> itself. JIT compilers should continue to evolve and to start
>> vectorizing code like modern compilers.
> As I mentioned above, performance portability is actually one of
> SIMD.js's core strengths.
> I have found it useful to think of the API propsed in SIMD.js as a
> "short vector" API. It hits a sweet spot, being a convenient size for
> many XYZW and RGB/RGBA and similar algorithms, being implementable on a
> wide variety of very relevant hardware architectures, being long enough
> to deliver worthwhile speedups for many tasks, and being short enough to
> still be convenient to manipulate.
> I agree that the "short vector" model doesn't address all use cases, so
> I also believe a "long vector" approach would be very desirable as well.
> Such an approach could be based on automatic loop vectorization, a SPMD
> programming model, or something else. I look forward to discussing ideas
> for this. Such approaches have the potential to be much more scalable
> and adaptable, and can be much better positioned to solve those problems
> that the presently proposed SIMD.js API doesn't attempt to solve. I
> believe there is room for both approaches to coexist, and to serve
> distinct sets of needs.
> In fact, a good example of short and long vector models coexisting is in
> these popular GPU programming models that you mentioned, where short
> vectors represent things in the problem domains like colors and
> coordinates, and are then broken down by the compiler to participate in
> the long vectors, as you described. It's very plausible that the
> proposed SIMD.js could be adapted to combine with a future long-vector
> approach in the same way.

Data-parallel languages like GLSL and OpenCL are statically typed and vector 
types are used to increase the developer productivity. Using vector types in 
data-parallel languages often hurts performance because it forces the memory 
layout to be AOS instead of SOA. In JavaScript, the library Three.js 
<http://threejs.org/> introduces data types such as “THREE.Vector3” that are 
used to describe the problem domain, and not to accelerate code. 


> Dan

webkit-dev mailing list

Reply via email to