Re: The reason for scads of keyed variants

2003-09-02 Thread Dan Sugalski
At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote:
Dan Sugalski [EMAIL PROTECTED] wrote:

[ heavily snipped ]

 Now, for aggregates that hold PMCs ...
  ... and on JITted cores there's no
 win at all.

 For aggregates that *don't* hold PMCs, though, that's where the win
 is.

 If we don't have direct operations on aggregate elements but instead
 have to do a fetch and perform the operation on what we fetch, it
 means we have to create a temporary PMC for each 'fake' entry, one
 potentially with a fair amount of knowledge about how the aggregate
 works, which means that every aggregate will need to provide two
 vtables, one for itself and one for an aggregate entry.
I don't see the point here especially why we would need a temporary PMC.
If we have an array of packed ints, I just need a pointer to the element
to work on it. This is very similar to the Ckey opcode I had in some of
my proposals.
We can't do that. There's insufficient information about the 
variables at compiletime to do anything other than call into the PMC 
to do the operation, so the internal representation's irrelevant. As 
far as parrot is concerned, it's a PMC and it doesn't know anything 
about the inside.

  Going with direct operations on keyed aggregates, then, makes the
 code somewhat more dense, since we only need to access one vtable per
 operand rather than two (one to fetch the data and one to operate on
 it). That's balanced somewhat by having to have two sets of PMC
 opcodes, one that operates on all parts keyed and one that doesn't.
Not when you need 64 opcodes for the keyed variants. 64:1 isn't
somewhat balanced.
Erm you need to redo your math. Even *if* we went with a full set 
of keyed and unkeyed parameters (and I don't see a reason to do so, 
though we certainly can) it's not 64. At worst it's 16 for three-arg 
ops, and that's for both the keyed int, keyed normal, and nonkeyed 
version.

  More information's available if anyone wants it, but this *is* the
 way we're going unless someone proves that it's suboptimal in the
 common case.
Implementation details wanted ;-)
I'll go thump the ops preprocessor, then. There's no reason to 
actually write all the code for it, as it can be autogenerated.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: The reason for scads of keyed variants

2003-09-02 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:
 At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote:

I don't see the point here especially why we would need a temporary PMC.
If we have an array of packed ints, I just need a pointer to the element
to work on it. This is very similar to the Ckey opcode I had in some of
my proposals.

 We can't do that. There's insufficient information about the
 variables at compiletime to do anything other than call into the PMC
 to do the operation, so the internal representation's irrelevant. As
 far as parrot is concerned, it's a PMC and it doesn't know anything
 about the inside.

We don't loose the abstraction with a special Ckey opcode. This Ckey
opcode is the abstraction: get a pointer to an item in the aggregate (or
prepare for a LHS). This is a vtable method of the aggregate. Splitting
the 3-keyed operations down to simpler parts, that's it - no permutations.

Not when you need 64 opcodes for the keyed variants. 64:1 isn't
somewhat balanced.

 Erm you need to redo your math.

So I'll try. We have 4 different addressing modes of one single keyed
argument:

 set_p_k
 set_p_kc
 set_p_ki
 set_p_kic

Now when we want to support 3-keyed operations there are 4*4*4 different
opcodes of an all-3-keyed operation (+ some more if one isn't keyed).

If we only have e.g. op_p_k_p_k_p_k we are missing the nice
optimization of integer indices, we have to go through key_integer(),
have to check for NULL keys and we need a full fledged (albeit) constant
Key PMC per argument.
Further: you have to implement type morphing, range checking BIG*
promotion in the non-keyed variants and in the keyed vtable variants
too. This is error-prone and a waste of memory.

Anf finally, you are saying that these are most usefull for aggregate
containing non-PMCs. What about

  @a[i] = @b[i] + 1;

or the equivalent Perl6 vector operations[1]. More permutaions on opcodes
to implement these?

 ... Even *if* we went with a full set
 of keyed and unkeyed parameters (and I don't see a reason to do so,
 though we certainly can) it's not 64. At worst it's 16 for three-arg
 ops, and that's for both the keyed int, keyed normal, and nonkeyed
 version.

What I'm missing here?

Implementation details wanted ;-)

 I'll go thump the ops preprocessor, then. There's no reason to
 actually write all the code for it, as it can be autogenerated.

But please with a Configure switch to turn it off ;-)

[1] I expect the most speed gain here, when we can optimize to
hell.

  @a = @b + 1   // @a, @b are known to be arrays of packed int

  n = @b.elements()
  c = n / CHUNK_SIZE
  r = n % CHUNK_SIZE
  for i (0..c-1)
ptra = @a.make_chunk(i)
ptrb = @b.get_chunk(i)
for (1..CHUNK_SIZE)
  *ptr++ = *ptrb++ + 1
  // do rest ...
  @a.elements = @b.elements

This would be a seqence of some specialized opcodes (enventually with a
runtime check in front) and its fully JITtable.

leo


Re: The reason for scads of keyed variants

2003-09-02 Thread Luke Palmer
Leopold Toetsch writes:
 Dan Sugalski [EMAIL PROTECTED] wrote:
  At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote:
 
 I don't see the point here especially why we would need a temporary PMC.
 If we have an array of packed ints, I just need a pointer to the element
 to work on it. This is very similar to the Ckey opcode I had in some of
 my proposals.
 
  We can't do that. There's insufficient information about the
  variables at compiletime to do anything other than call into the PMC
  to do the operation, so the internal representation's irrelevant. As
  far as parrot is concerned, it's a PMC and it doesn't know anything
  about the inside.
 
 We don't loose the abstraction with a special Ckey opcode. This Ckey
 opcode is the abstraction: get a pointer to an item in the aggregate (or
 prepare for a LHS). This is a vtable method of the aggregate. Splitting
 the 3-keyed operations down to simpler parts, that's it - no permutations.

And I think you're saying that it'll be illegal to use this pointer PMC
if the aggregate changes or anything like that, so the proxy can be as
dumb and fast as possible... right?  And that it wouldn't really need a
header.  So it wouldn't really be a PMC.  But that's okay, I think.

I'm wondering how the actual ops would look in this case though.

foo a[x], b[y]

turns into:

key P0, a[x]
key P1, b[y]
foo P0, P1

How does foo know that it's working on fake PMCs?  Or am I missing your
vision (probably am)?

 [1] I expect the most speed gain here, when we can optimize to
 hell.
 
   @a = @b + 1 // @a, @b are known to be arrays of packed int

FWIW, that does something entirely useless[1]: the equivalent of:

@a = (1 + @b) x @a

I think you meant to say:

@a = @b + 1


   n = @b.elements()
   c = n / CHUNK_SIZE
   r = n % CHUNK_SIZE
   for i (0..c-1)
 ptra = @a.make_chunk(i)
 ptrb = @b.get_chunk(i)
 for (1..CHUNK_SIZE)
   *ptr++ = *ptrb++ + 1
   // do rest ...
   @a.elements = @b.elements
 
 This would be a seqence of some specialized opcodes (enventually with a
 runtime check in front) and its fully JITtable.

That would be, um, very cool.  PDL, eat your heart out!

Luke

[1] I'm saying that word with full awareness of its connotations on a
list like this :-)

 leo


Re: The reason for scads of keyed variants

2003-09-02 Thread Leopold Toetsch
Luke Palmer [EMAIL PROTECTED] wrote:
 Leopold Toetsch writes:

 And I think you're saying that it'll be illegal to use this pointer PMC
 if the aggregate changes or anything like that, so the proxy can be as
 dumb and fast as possible... right?  And that it wouldn't really need a
 header.  So it wouldn't really be a PMC.  But that's okay, I think.

It would be a Key PMC probably with some extra flags. I didn't much
think about details yet. Its basically that part inside current keyed
set operations that gets at the value inside the aggregate.

This part of code is currently repeated all over followed by an
get- or set_type something operation basically.
For aggregates holding basic types a C-pointer would be enough.

 I'm wondering how the actual ops would look in this case though.

 foo a[x], b[y]

 turns into:

 key P0, a[x]
 key P1, b[y]
 foo P0, P1

Yes. The first one would be Ckey_lhs or such.

 How does foo know that it's working on fake PMCs?  Or am I missing your
 vision (probably am)?

It doesn't matter for PMCs. The Ckey_lhs for Ca has to prepare a new
PMC slot in the aggregate, P0 points at it, Cfoo sets the value then.

 I think you meant to say:

 @a = @b + 1

Oops, yes of course.

   n = @b.elements()
   c = n / CHUNK_SIZE
   r = n % CHUNK_SIZE
   for i (0..c-1)
 ptra = @a.make_chunk(i)
 ptrb = @b.get_chunk(i)
 for (1..CHUNK_SIZE)
   *ptr++ = *ptrb++ + 1
   // do rest ...
   @a.elements = @b.elements

 This would be a seqence of some specialized opcodes (enventually with a
 runtime check in front) and its fully JITtable.

 That would be, um, very cool.  PDL, eat your heart out!

I dunno, which of the many acronyms of PDL you are applying here :-) But
anyway such optimizations make vector ops fly at optimized C level
speed.
Fat multikeyed opcodes wont't do that.

 Luke

leo


The reason for scads of keyed variants

2003-09-01 Thread Dan Sugalski
I should read the list and respond to the outstanding stuff, but I 
should also get this done, and since the former probably precludes 
the latter...

Why, exactly, have I spec'd (nay, demanded!) that every darned 
operation in a PMC's vtable have a keyed variant?

Simple. A combination of speed and space. (And yes, I know we're 
going to fight over this one)

Now, for aggregates that hold PMCs and are relatively passive 
containers (that is, they don't get in the way of anything besides 
load and store operations, if that) the keyed variants provide no 
benefit to speak of. Less opcode dispatch overhead, but there's not a 
whole lot of that in general anyway, and on JITted cores there's no 
win at all.

For aggregates that *don't* hold PMCs, though, that's where the win 
is. Those are the sorts of aggregates we're aggressively targeting as 
well. One of the big issues with perl, python, and ruby is that the 
base variable data structure is big. (And we're not making it any 
smaller with Parrot--our PMCs are pretty hefty still) Optimizing the 
size of an individual scalar isn't much of a win, but optimizing the 
size of arrays and hashes of scalars is a win. (This is one of the 
lessons learned from Chip's Topaz project) Many hashes and arrays 
don't need full-fledged PMCs in them, nor the space that full PMCs 
take. A string, integer, or bool array is sufficient for many of the 
uses that aggregates are put to.

This is a good argument for abstract aggregate element access, which 
we have now, and that's good. Saves us space, potentially. Yay us.

How this argues for keyed access to the operations on aggregate 
elements, however, is less obvious, but I think it's worth it.

If we don't have direct operations on aggregate elements but instead 
have to do a fetch and perform the operation on what we fetch, it 
means we have to create a temporary PMC for each 'fake' entry, one 
potentially with a fair amount of knowledge about how the aggregate 
works, which means that every aggregate will need to provide two 
vtables, one for itself and one for an aggregate entry.

Going with direct operations on keyed aggregates, then, makes the 
code somewhat more dense, since we only need to access one vtable per 
operand rather than two (one to fetch the data and one to operate on 
it). That's balanced somewhat by having to have two sets of PMC 
opcodes, one that operates on all parts keyed and one that doesn't.

The integrated approach *also* makes it easier to optimize the 
operation. For example, this code:

   foo[12] = foo[12] + bar[15];

or the more compact

   foo[12] += bar[15];

has the destination identical to one of the sources. In the keyed 
access scheme, that's a single operation, one where foo's vtable 
function can *easily* determine that the destination and one of the 
sources is identical. (It's a pair of comparisons) If accessing foo 
is somewhat expensive--for example requiring a DB access--having 
knowledge that the source and destination are identical can allow for 
optimizations that wouldn't otherwise be allowable. (For example, if 
foo and bar represented fields in two tables, knowing that the 
destination and one of the sources is identical can potentially cut 
out one of the DB accesses) While this is doable if we pass in plain 
scalar PMCs, it'll have to be more expensive, and as such is less 
likely to be done.

Yes, I also know that this potentially breaks down in more complex 
expressions. That's fine--it means that we can optimize some access 
rather than all access. I'm OK with that, as it's better than 
optimizing all accesses.

More information's available if anyone wants it, but this *is* the 
way we're going unless someone proves that it's suboptimal in the 
common case.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: The reason for scads of keyed variants

2003-09-01 Thread Leopold Toetsch
Dan Sugalski [EMAIL PROTECTED] wrote:

[ heavily snipped ]

 Now, for aggregates that hold PMCs ...
  ... and on JITted cores there's no
 win at all.

 For aggregates that *don't* hold PMCs, though, that's where the win
 is.

 If we don't have direct operations on aggregate elements but instead
 have to do a fetch and perform the operation on what we fetch, it
 means we have to create a temporary PMC for each 'fake' entry, one
 potentially with a fair amount of knowledge about how the aggregate
 works, which means that every aggregate will need to provide two
 vtables, one for itself and one for an aggregate entry.

I don't see the point here especially why we would need a temporary PMC.
If we have an array of packed ints, I just need a pointer to the element
to work on it. This is very similar to the Ckey opcode I had in some of
my proposals.

 Going with direct operations on keyed aggregates, then, makes the
 code somewhat more dense, since we only need to access one vtable per
 operand rather than two (one to fetch the data and one to operate on
 it). That's balanced somewhat by having to have two sets of PMC
 opcodes, one that operates on all parts keyed and one that doesn't.

Not when you need 64 opcodes for the keyed variants. 64:1 isn't
somewhat balanced.

 More information's available if anyone wants it, but this *is* the
 way we're going unless someone proves that it's suboptimal in the
 common case.

Implementation details wanted ;-)

leo