nigelsande...@btconnect.com wrote:

From that statement, you do not appear to understand the subject matter of this thread: Perl 6 concurrency model.

If I misunderstood then I apologize: I had thought that the subject was the underlying abstractions of parallelism and concurrency that perl6 language will define in order to enable specific threading modules to be provided by implementations. If the subject is specifically contemporary [multicore] CPU threading implementations then my comments may not be relevant.


For CPU-bound processes [...].

Sure, there are exotica hardware that have thousands of cores, but it hardly seems likely that having spent $millions upon such hardware to run your massively parallel algorithms to solve problems in realistic time frames, that your going to use a dynamic (no-compiled) language to write your solutions in.

Any midrange GPU will support millions of thread-launches per second. One frame of 720p video has a million pixels, and these days in not uncommon to have multiple threads per pixel.

True, using a dynamic programming language for the guts of these threads is probably not a good tradeoff today: I'd probably use an Inline::OpenCL module initially. But I see no reason that a strongly statically typed subset of Perl6 could not be compiled to efficient device-code.

To use millions of threads you don't focus on what the algorithm is doing: you focus on where the data is going. If you move data unnecessarily (or fail to move it when it was necessary) then you'll burn power and lose performance.

Sorry, but I've got to call you on this.

Parallelisation (threading) is all about improving performance. And the first three rules of performance are: algorithm; algorithm; algorithm. Choose the wrong algorithm and you are wasting cycles. Parallelise that wrong algorithm, and you're just multiplying the number of cycles you're wasting.

I always thought that the first rule of performance optimization is "measure" (i.e. run a profiler). But ignoring that quibble, the reason that bad algorithms are bad is (usually) bad data management (either unnecessary movement, or unnecessary locking). If you want to understand why an algorithm is inefficient then you need to study the data accesses, not (just) the processing. A special case of bad data movement, that applies even to sequential code, is cache-thrashing.

This is somewhat analagous to ASIC design: at around .13 um processes, wire load delays started to dominate gate delays: wires became a dominant source of delay for many paths: initially just the longer routes, but these days you can't ignore them. This doesn't mean that you can completely ignore the logic: it just means that logic optimization is taken as a given, and that the real work is in placement and routing.

Similarly, while a bad algorithm is obviously bad, even a good algorithm will perform badly (i.e. will waste memory bandwidth and power) if you don't have a way to define how the data will move in the implementation of that algorithm. If you're not careful then you'll burn more power moving data from/to memory than processing it (slightly stale data: moving a 32-bit value to local dram may use 1nJ (1 nanojoule); moving that same value a millimeter onchip may burn only 10 pJ: similar to the energy of a single-precision floating point operation, but that op needs 2 source operands and must write a destination, each of which requires a data movement. You can, of course, ignore these issues if you're just managing a handful of IO-bound CPU threads.

Feel free to tell me that perl6 will never be used in scenarios where such considerations are important. I'll probably disagree. But I could probably be persuaded that the the current type-system has sufficient mechanisms (via traits) to define data placement without any new features. And therefore the issue can be ignored until someone actually attempts to implement OpenCL bindings.


Dave.

Reply via email to