nigelsande...@btconnect.com wrote:
From that statement, you do not appear to understand the subject matter
of this thread: Perl 6 concurrency model.
If I misunderstood then I apologize: I had thought that the subject was
the underlying abstractions of parallelism and concurrency that perl6
language will define in order to enable specific threading modules to be
provided by implementations. If the subject is specifically contemporary
[multicore] CPU threading implementations then my comments may not be
relevant.
For CPU-bound processes [...].
Sure, there are exotica hardware that have thousands of cores, but it
hardly seems likely that having spent $millions upon such hardware to
run your massively parallel algorithms to solve problems in realistic
time frames, that your going to use a dynamic (no-compiled) language to
write your solutions in.
Any midrange GPU will support millions of thread-launches per second.
One frame of 720p video has a million pixels, and these days in not
uncommon to have multiple threads per pixel.
True, using a dynamic programming language for the guts of these threads
is probably not a good tradeoff today: I'd probably use an
Inline::OpenCL module initially. But I see no reason that a strongly
statically typed subset of Perl6 could not be compiled to efficient
device-code.
To use millions of threads you don't focus on what the algorithm is
doing: you focus on where the data is going. If you move data
unnecessarily (or fail to move it when it was necessary) then you'll
burn power and lose performance.
Sorry, but I've got to call you on this.
Parallelisation (threading) is all about improving performance. And the
first three rules of performance are: algorithm; algorithm; algorithm.
Choose the wrong algorithm and you are wasting cycles. Parallelise that
wrong algorithm, and you're just multiplying the number of cycles you're
wasting.
I always thought that the first rule of performance optimization is
"measure" (i.e. run a profiler). But ignoring that quibble, the reason
that bad algorithms are bad is (usually) bad data management (either
unnecessary movement, or unnecessary locking). If you want to understand
why an algorithm is inefficient then you need to study the data
accesses, not (just) the processing. A special case of bad data
movement, that applies even to sequential code, is cache-thrashing.
This is somewhat analagous to ASIC design: at around .13 um processes,
wire load delays started to dominate gate delays: wires became a
dominant source of delay for many paths: initially just the longer
routes, but these days you can't ignore them. This doesn't mean that you
can completely ignore the logic: it just means that logic optimization
is taken as a given, and that the real work is in placement and routing.
Similarly, while a bad algorithm is obviously bad, even a good algorithm
will perform badly (i.e. will waste memory bandwidth and power) if you
don't have a way to define how the data will move in the implementation
of that algorithm. If you're not careful then you'll burn more power
moving data from/to memory than processing it (slightly stale data:
moving a 32-bit value to local dram may use 1nJ (1 nanojoule); moving
that same value a millimeter onchip may burn only 10 pJ: similar to the
energy of a single-precision floating point operation, but that op needs
2 source operands and must write a destination, each of which requires a
data movement. You can, of course, ignore these issues if you're just
managing a handful of IO-bound CPU threads.
Feel free to tell me that perl6 will never be used in scenarios where
such considerations are important. I'll probably disagree. But I could
probably be persuaded that the the current type-system has sufficient
mechanisms (via traits) to define data placement without any new
features. And therefore the issue can be ignored until someone actually
attempts to implement OpenCL bindings.
Dave.