Hi Dan! > On Sep 28, 2014, at 6:44 AM, Dan Gohman <sunf...@mozilla.com> wrote: > > Hi Nadav, > > I agree with much of your assessment of the the proposed SIMD.js API. > However, I don't believe it's unsuitability for some problems > invalidates it for solving other very important problems, which it is > well suited for. Performance portability is actually one of SIMD.js' > biggest strengths: it's not the kind of performance portability that > aims for a consistent percentage of peak on every machine (which, as you > note, of course an explicit 128-bit SIMD API won't achieve), it's the > kind of performance portability that achieves predictable performance > and minimizes surprises across machines (though yes, there are some > unavoidable ones, but overall the picture is quite good).
There is a tradeoff between the performance portability of the SIMD.js ISA and its usefulness. A small number of instructions (that only targets 32bit data types, no masks, etc) is not useful for developing non-trivial vector programs. You need 16bit vector elements to support WebGL vertex indices, and lane-masking for implementing predicated control flow for programs like ray tracers. Introducing a large number of vector instructions will expose the performance portability problems. I don’t believe that there is a sweet spot in this tradeoff. I don’t think that we can find a small set of instructions that will be useful for writing non-trivial vector code that is performance portable. > > On 09/26/2014 03:16 PM, Nadav Rotem wrote: >> So far, I’ve explained why I believe SIMD.js will not be >> performance-portable and why it will not utilize modern instruction >> sets, but I have not made a suggestion on how to use vector >> instructions to accelerate JavaScript programs. Vectorization, like >> instruction scheduling and register allocation, is a code-generation >> problem. In order to solve these problems, it is necessary for the >> compiler to have intimate knowledge of the architecture. Forcing the >> compiler to use a specific instruction or a specific data-type is the >> wrong answer. We can learn a lesson from the design of compilers for >> data-parallel languages. GPU programs (shaders and compute languages, >> such as OpenCL and GLSL) are written using vector instructions because >> the domain of the problem requires vectors (colors and coordinates). >> One of the first thing that data-parallel compilers do is to break >> vector instructions into scalars (this process is called >> scalarization). After getting rid of the vectors that resulted from >> the problem domain, the compiler may begin to analyze the program, >> calculate profitability, and make use of the available instruction set. > >> I believe that it is the responsibility of JIT compilers to use vector >> instructions. In the implementation of the Webkit’s FTL JIT compiler, >> we took one step in the direction of using vector instructions. LLVM >> already vectorizes some code sequences during instruction selection, >> and we started investigating the use of LLVM’s Loop and SLP >> vectorizers. We found that despite nice performance gains on a number >> of workloads, we experienced some performance regressions on Intel’s >> Sandybridge processors, which is currently a very popular desktop >> processor. JavaScript code contains many branches (due to dynamic >> speculation). Unfortunately, branches on Sandybridge execute on Port5, >> which is also where many vector instructions are executed. So, >> pressure on Port5 prevented performance gains. The LLVM vectorizer >> currently does not model execution port pressure and we had to disable >> vectorization in FTL. In the future, we intend to enable more >> vectorization features in FTL. > > This is an example of a weakness of depending on automatic vectorization > alone. High-level language features create complications which can lead > to surprising performance problems. Compiler transformations to target > specialized hardware features often have widely varying applicability. > Expensive analyses can sometimes enable more and better vectorization, > but when a compiler has to do an expensive complex analysis in order to > optimize, it's unlikely that a programmer can count on other compilers > doing the exact same analysis and optimizing in all the same cases. This > is a problem we already face in many areas of compilers, but it's more > pronounced with vectorization than many other optimizations. I agree with this argument. Compiler optimizations are unpredictable. You never know when the register allocator will decide to spill a variable inside a hot loop. or a memory operation confuse the alias analysis. I also agree that loop vectorization is especially sensitive. However, it looks like the kind of vectorization that is needed to replace SIMD.js is a very simple SLP vectorization <http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer> (BB vectorization). It is really easy for a compiler to combine a few scalar arithmetic operations into a vector. LLVM’s SLP-vectorizer support vectorization of computations across basic blocks and succeeds in surprising places, like vectorization of STDLIB code where the ‘begin' and ‘end' iterators fit into a 128-bit register! > > In contrast, the proposed SIMD.js has the property that code using it > will not depend on expensive compiler analysis in the JIT, and is much > more likely to deliver predictable performance in practice between > different JIT implementations and across a very practical variety of > hardware architectures. Performance portability across JITs should not motivate us to solve a compiler problem in the language itself. JITs should continue to evolve and learn new tricks. Introducing new language features increases the barrier of entry for new JavaScript implementations. > >> >> To summarize, SIMD.js will not provide a portable performance solution >> because vector instruction sets are sparse and vary between >> architectures and generations. Emscripten should not generate vector >> instructions because it can’t model the target machine. SIMD.js will >> not make use of modern SIMD features such as predication or >> scatter/gather. Vectorization is a compiler code generation problem >> that should be solved by JIT compilers, and not by the language >> itself. JIT compilers should continue to evolve and to start >> vectorizing code like modern compilers. > > As I mentioned above, performance portability is actually one of > SIMD.js's core strengths. > > I have found it useful to think of the API propsed in SIMD.js as a > "short vector" API. It hits a sweet spot, being a convenient size for > many XYZW and RGB/RGBA and similar algorithms, being implementable on a > wide variety of very relevant hardware architectures, being long enough > to deliver worthwhile speedups for many tasks, and being short enough to > still be convenient to manipulate. > > I agree that the "short vector" model doesn't address all use cases, so > I also believe a "long vector" approach would be very desirable as well. > Such an approach could be based on automatic loop vectorization, a SPMD > programming model, or something else. I look forward to discussing ideas > for this. Such approaches have the potential to be much more scalable > and adaptable, and can be much better positioned to solve those problems > that the presently proposed SIMD.js API doesn't attempt to solve. I > believe there is room for both approaches to coexist, and to serve > distinct sets of needs. > > In fact, a good example of short and long vector models coexisting is in > these popular GPU programming models that you mentioned, where short > vectors represent things in the problem domains like colors and > coordinates, and are then broken down by the compiler to participate in > the long vectors, as you described. It's very plausible that the > proposed SIMD.js could be adapted to combine with a future long-vector > approach in the same way. Data-parallel languages like GLSL and OpenCL are statically typed and vector types are used to increase the developer productivity. Using vector types in data-parallel languages often hurts performance because it forces the memory layout to be AOS instead of SOA. In JavaScript, the library Three.js <http://threejs.org/> introduces data types such as “THREE.Vector3” that are used to describe the problem domain, and not to accelerate code. Thanks, Nadav > > Dan
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev