Hi Maciej,

----- Original Message -----
> Dan, you say that SIMD.js delivers performance portability, and Nadav says it
> doesn’t.
> Nadav’s argument seems to come down to (as I understand it):
> - The set of vector operations supported on different CPU architectures
> varies widely.

This is true, but it's also true that there is a core set of features which is 
pretty consistent across popular SIMD architectures. This commonality exists 
because it's a very popular set. The proposed SIMD.js doesn't solve all 
problems, but it does solve a large number of important problems well, and it 
is following numerous precedents.

We are also exploring the possibility of exposing additional instructions 
outside this core set. Several creative ideas are being discussed which could 
expand the API's reach while preserving a portability story. However, 
regardless of what we do there, I expect the core set will remain a prominent 
part of the API, due to its applicability.

> - "Executing vector intrinsics on processors that don’t support them is
> slower than executing multiple scalar instructions because the compiler
> can’t always generate efficient with the same semantics.”

This is also true, however the intent of SIMD.js *is* to be implementable on 
all popular architectures. The SIMD.js spec is originally derived from the Dart 
SIMD spec, which is already implemented and in use on at least x86 and ARM. We 
are also taking some ideas from OpenCL, which offers a very similar set of core 
functionality, and which is implemented on even more architectures. We have 
several reasons to expect that SIMD.js can cover enough functionality to be 
useful while still being sufficiently portable.

> - Even when vector intrinsics are supported by the CPU, whether it is
> profitable to use them may depend in non-obvious ways on exact
> characteristics of the target CPU and the surrounding code (the Port5
> example).

With SIMD.js, there are plain integer types, so developers directly bypass 
plain JS number semantics, so there are fewer corner cases for the compiler to 
insert extra code to check for. This means fewer branches, and among other 
things, should mean less port 5 contention overall on Sandy Bridge.

Furthermore, automatic vectorization often requires the compiler make 
conservative assumptions about key information like pointer aliasing, trip 
counts, integer overflow, array indexing, load safety, scatter ordering, 
alignment, and more. In order to preserve observable semantics, these 
assumptions cause compilers to insert extra instructions, which are typically 
things like selects, shuffles, branches or other things, to handle all the 
possible corner cases. This is extra overhead that human programmers can often 
avoid, because they can more easily determine what corner cases are relevant in 
a given piece of code. And on Sandy Bridge in particular, these extra selects, 
shuffles, and branches hit port 5.

> For these reasons, Nadav says that it’s better to autovectorize, and that
> this is the norm even for languages with explicit vector data. In other
> words, he’s saying that SIMD.js will result in code that is not
> performance-portable between different CPUs.

I question whether it is actually the norm. In C++, where auto-vectorization is 
available in every major compiler today, explicit SIMD APIs like <xmmintrin.h> 
are hugely popular. That particular header has become supported by Microsoft's 
C++ compiler, Intel's C++ compiler, GCC, and clang. I see many uses of 
<xmmintrin.h> in many contexts, including HPC, graphics, codecs, cryptography, 
and games. It seems many C++ developers are still willing to go through the 
pain of #ifdefs, preprocessor macros, and funny-looking syntax rather than rely 
on auto-vectorization, even with "restrict" and and other aids.

Both auto-vectorization and SIMD.js have their strengths, and both have their 
weaknesses. I don't believe the fact that both solve some problems that the 
other doesn't rules out either of them.

> I don’t see a rebuttal to any of these points. Instead, you argue that,
> because SIMD.js does not require advanced compiler analysis, it is more
> likely to give similar results between different JITs (presumably when
> targeting the same CPU, or ones with the same supported vector operations
> and similar perf characteristics). That seems like a totally different sense
> of performance portability.
> Given these arguments, it’s possible that you and Nadav are both right[*].
> That would mean that both these statements hold:
> (a) SIMD.js is not performance-portable between different CPU architectures
> and models.
> (b) SIMD.js is performance-portable between different JITs targeting the same
> CPU model.
> On net, I think that combination would be a strong argument *against*
> SIMD.js. The Web aims for portability between different hardware and not
> just different software. At Apple alone we support four major CPU
> instruction sets and a considerably greater number of specific CPU models.
> From our point of view, code that is performance-portable between JITs but
> not between CPUs would not be good enough, and it might be actively bad if
> it results in worse performance on some of our CPU architectures. The WebKit
> community as a whole supports even more target CPU architectures.
> Do you agree with the above assessment? Alternately, do you have an argument
> that SIMD.js is performance-portable between different CPU architectures?

My more specific responses are above.

SIMD.js *is* aimed to be performance-portable between different CPU 
architectures. By this, I mean that we expect that if you can use it, you will 
usually be able to get several times speedups over your scalar code. We predict 
this because we see other systems, like Dart, have already done this. 
"Performance portable" can have other useful meanings too, and my position is 
that the meaning I'm using here is one of the useful ones. I also acknowledge 
that there are some challenges, but overall I think the story is good.

Advanced features like vector predication and gather/scatter are less critical 
in SIMD.js' space than has been suggested. For example, NEON lacks these 
features entirely, and yet it is still one of the most popular SIMD 
architectures in the world. This is possible because there are a large number 
of problems which have no need for these features even when they are present, 
and a large number of problems where simpler alternatives are sufficient.

Many of my ideas here share a common theme; we can think of "short" vectors and 
"long" vectors. The proposed SIMD.js is primarily aimed at the "short" vector 
problem, and much of the discussion in this thread has focused on its 
shortcomings in the "long" vector space. At the same time, the weaknesses of 
auto-vectorization tend to be more prominent in the "sort" vector space, 
because short vector code is sometimes less structured, and because some of the 
overhead can be amortized in long loops. As further illustration of the 
difference, things like vector predication and gather/scatter have been staples 
in the "long" vector space since the Cray vector days, and yet they were 
omitted from popular "short" SIMD architectures; they're less important there. 
And there are further differences.

I would ideally like to see solutions in both spaces advance; they excel at 
distinct problems, and they can co-exist and even cooperate.

webkit-dev mailing list

Reply via email to