Re: [webkit-dev] Is there a plan for supporting multi-process and WebCL in webkit

Filip Pizlo Wed, 10 Apr 2013 13:15:17 -0700

On Apr 10, 2013, at 12:37 PM, Dirk Pranke <dpra...@chromium.org> wrote:

> 
> 
> 
> On Wed, Apr 10, 2013 at 7:37 AM, Filip Pizlo <fpi...@apple.com> wrote:
> 
> 
> On Apr 10, 2013, at 2:29 AM, "Pozdnyakov, Mikhail" 
> <mikhail.pozdnya...@intel.com> wrote:
> 
> >
> >
> >> Why not infer ParallelArrays automatically?
> > Sorry, did not get it. Could you please elaborate?
> 
> Normal JS arrays. You use them with pure computation. Profiler notices this. 
> And then the same things that you have in the ParallelArrays proposal now 
> work except you don't need a new type.
> 
> 
> Compiler writers have been chasing this problem for 50 years, with (AFAIK) 
> very limited success for languages that don't give you compile-time 
> annotations to help you along.

Lets demystify this a bit.  It's easy to say "OMG they did things for 50 
years", and we're all guilty of saying things like this sometimes, but, it's a 
bit more interesting to identify what these people did for those 50 years, 
where they got, and what the remaining problems are.  It is already possible to 
autoparallelize loops, but there are challenges that remain that make it 
impractical for "real code".  The problems are with:

A) Identifying loops that are candidates for sound parallelization.  In 
formal-speak they look for the lack of "loop-carried side-effects", i.e. 
something you'd do in one iteration that would be visible in a subsequent 
iteration - this represents a dependency that inhibits parallelization.  But if 
you prove that no loop-carried side-effects exist, then you're off to a good 
start: you know that you could fork off some threads to run the loop's body in 
parallel, and the program would still be guaranteed to get identical results.  
(I'm oversimplifying a bit - if you tried to auto-parallelize a loop in a 
program that already had threads, you'd be in a world of hurt from a memory 
model standpoint - but we don't have that problem in JS because all code in JS 
is semantically serial and so the parallelization you did wouldn't interfere 
with existing uses of threads, since there aren't any.)

B) Identifying loops that are candidates for profitable parallelization.  
Spinning up a thread carries a fixed cost.  If you find a loop with no 
loop-carried side-effects, but that loop actually only executes for a short 
time, parallelizing it will be a performance regression - and likely a steep 
one at that.  Finding which loops are profitable to parallelize is thought to 
be a hard problem.  The traditional attempts to solve it centered around making 
threads cheaper (i.e. Cilk, CML, and similar), or having stronger inference of 
profitability (I think this is what http://ciao-lang.org does), or pushing 
parallelizability hints into the language (X10, OpenMP, and these 
ParallelArrays).

(A) is not a problem that is solved by ParallelArrays.  ParallelArrays 
guarantee sequential equivalence; i.e. ParallelArray.prototype.forEach() will 
behave like Array.prototype.forEach() if the closure you pass it has the 
equivalent of "loop-carried side-effects".  I.e. if you believe that (A) is not 
a solvable problem, then this constitutes an argument against ParallelArrays, 
and not against inferring that a normal array should behave like a 
ParallelArray.  If you can do the inference that allows 
ParallelArrays.prototype.forEach() to soundly parallelize, then you can do that 
inference for Array.prototype.forEach(), as well.  I happen to think that (A) 
is a big problem with ParallelArrays; I don't like it that these guys are 
proposing a language construct called "ParallelArray" but then secretly they 
will probably not parallelize a ton of code because the user didn't realize 
that storing to a global variable shafts their inference for side-effects.

(B) is a problem that ParallelArrays attempt to solve.  But here's what we know 
about the different approaches to solving (B):

- Making threads faster hasn't really worked.  Cilk's threads are fast enough 
that if you use them yourself, manually, you'll be really happy, sort of, 
except that it sort of sometimes doesn't work like you'd expect because you've 
got user-level scheduling - but the point is, they're not fast enough that you 
could spin them up for _any_ loop in a program.  You'd get a regression if you 
did.

- Stronger inference of profitability is a promising area, which isn't 
adequately explored, IMO.  The approaches with which I'm most familiar focus 
too much on static inference.

- Adding hints to the language results in programming models that nobody uses.  
Have you ever used OpenMP?  Probably not, and if you have, you probably wanted 
to puke all over yourself.  It's a disgusting programming model.  There's 
something offensive about writing what looks like a serial loop, and then 
adding decorations to it to "declare" to the compiler what are the set of 
things it could do to the loop.  Eventually these declarations get so 
sophisticated that you realize that you made a poor life decision by using them 
- you could have been more productive if you wrote the thread code yourself.  
There is a similar, though lesser, problem with X10; in X10 at least you can do 
like in Cilk and just spin up a bunch of light-weight threads directly.  But 
even X10 has so far had only limited success, and a bunch of super smart people 
have been working on it for a decade.  In short, hardly anybody uses these 
programming hints.  I'm opposed to adding such a hint-based model to JS because 
we already know that other such similar models are used by almost nobody - 
these are niche programming models that you don't want to pick up unless you've 
literally run out of all other sensible options.

So the most promising thing appears to be strengthening automagic inference, 
which brings me to:

> I'm not aware of any work on JIT compilers doing this.

Neither am I!

> Why do you think we could do better?

Because we're doing it in a JIT compiler.  The problem with (B) is that you 
don't know, in a static ahead of time compiler, how long a loop will execute.  
And even if you did have a guess, if your guess turned out to be wrong and you 
emitted code to call pthread_create or similar, you'd get really terrible 
performance.  This is why traditional autoparallelization techniques fail in 
static compilers: static compiler writers, like all other performance geeks, 
don't like to add features that allege to improve performance but in fact 
regress performance.

But in a JIT, we already have to do two things that the static people can't 
*quite* do:

1) Runtime feedback-driven decision making that is acutely aware of the running 
time of loops.  Both JSC and V8 do already know your average trip count through 
any loop; they may not know it explicitly (i.e. for example in JSC we don't 
have a data structure that will immediately tell you this information) but we 
do know it implicitly (we happen to already have a loop trip counter for 
deciding when to tier up, even though that counter is currently not designed 
for the level of accuracy that autoparallelization would need).  Point is, if 
we wanted to have the level of granularity of understanding of loop execution 
times that is necessary for inferring when it is profitable to parallelize a 
loop that you've already proven to be sound to parallelize, then we could add 
it, at not much cost - we'd likely somehow hijack existing loop profiling 
functionality.

2) JITs are designed around sometimes making dump decisions, bailing out when 
we detect that we've made a dumb decision, reconstituting the state of the 
world so that we're in a state "as if" we had never made that decision, 
recording that we were dumb to have made that decision, and then trying again 
without making that same dumb decision.  This closed-loop feedback is what 
allows us to infer types, reshape objects, make arithmetic fast, etc.  We - 
both us here in JSC and those V8 dudes - do it all the time, because we're 
awesome at our jobs.  So, if we had some statistical profiling that told us 
that it's likely that a loop takes a long time, we could add guards that assert 
that this forever continues to be true; if it turns out to be false then we 
deoptimize back to a serial world.  Conversely, if we had some profiling that 
told us that it is _NOT_ profitable to parallelize a loop, we could similarly 
guard this and "deoptimize" (more like *optimize*) into a parallel loop mode.

The irony here is that the overheads we already pay in JITs to enable even 
basic type inference already give us most of the functionality we would need to 
detect when it is profitable to autoparallelize.

This, to me, appears like a comprehensive solution to the profitability 
problem.  It still leaves the soundness problem.  Here we have some help from 
the JIT-crazy as well: for example we wouldn't have to *prove* that a loop has 
no carried side-effects; we could simply *speculate* that it doesn't and insert 
appropriate checks.  This is easier.  I don't know if it's easy and practical 
enough to put into a browser, though, which brings me to the central point:

If it is impractical to infer that a forEach closure passed to ParallelArray is 
safe-to-parallelize, then it follows that ParallelArrays are impractical, and 
we should R- this whole programming model and look for something else - like 
adding threads and locks to JS and calling it a day.  But if inferring that a 
forEach closure is safe-to-parallelize is doable, then it is _no_more_ doable 
in a world where you have the ParallelArray construct - since that construct 
only helps with the profitable-to-parallelize part of the equation.  We could 
already do it for any closure, or any loop body.  And then we could use 
existing JIT madness to take care of the profitability question.

> (Honest question, I'm not trying to be snarky, nor am I an expert).

Hope my answers help! :-)

Anyway, I was the one being snarky here. ;-)  I don't actually think that this 
is a good feature for the web.  It would be better to just give JS a more 
general-purpose concurrency construct - something that would work even if you 
*did* have side effects!  I like threads and locks, but then again, I've been 
told that I'm insane for liking them.  But I do know that to do the work to 
infer the safe-to-parallelize that ParallelArrays need would be a *huge* 
opportunity cost for the project - and if it was successful then it would be 
better to go all the way and just infer the profitable-to-parallelize bit as 
well.  It's better to infer things for the programmer, than to have the 
programmer tell you things, if at all possible!

-Filip

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev

Re: [webkit-dev] Is there a plan for supporting multi-process and WebCL in webkit

Reply via email to