Re: The reason for scads of keyed variants
At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote: Dan Sugalski [EMAIL PROTECTED] wrote: [ heavily snipped ] Now, for aggregates that hold PMCs ... ... and on JITted cores there's no win at all. For aggregates that *don't* hold PMCs, though, that's where the win is. If we don't have direct operations on aggregate elements but instead have to do a fetch and perform the operation on what we fetch, it means we have to create a temporary PMC for each 'fake' entry, one potentially with a fair amount of knowledge about how the aggregate works, which means that every aggregate will need to provide two vtables, one for itself and one for an aggregate entry. I don't see the point here especially why we would need a temporary PMC. If we have an array of packed ints, I just need a pointer to the element to work on it. This is very similar to the Ckey opcode I had in some of my proposals. We can't do that. There's insufficient information about the variables at compiletime to do anything other than call into the PMC to do the operation, so the internal representation's irrelevant. As far as parrot is concerned, it's a PMC and it doesn't know anything about the inside. Going with direct operations on keyed aggregates, then, makes the code somewhat more dense, since we only need to access one vtable per operand rather than two (one to fetch the data and one to operate on it). That's balanced somewhat by having to have two sets of PMC opcodes, one that operates on all parts keyed and one that doesn't. Not when you need 64 opcodes for the keyed variants. 64:1 isn't somewhat balanced. Erm you need to redo your math. Even *if* we went with a full set of keyed and unkeyed parameters (and I don't see a reason to do so, though we certainly can) it's not 64. At worst it's 16 for three-arg ops, and that's for both the keyed int, keyed normal, and nonkeyed version. More information's available if anyone wants it, but this *is* the way we're going unless someone proves that it's suboptimal in the common case. Implementation details wanted ;-) I'll go thump the ops preprocessor, then. There's no reason to actually write all the code for it, as it can be autogenerated. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The reason for scads of keyed variants
Dan Sugalski [EMAIL PROTECTED] wrote: At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote: I don't see the point here especially why we would need a temporary PMC. If we have an array of packed ints, I just need a pointer to the element to work on it. This is very similar to the Ckey opcode I had in some of my proposals. We can't do that. There's insufficient information about the variables at compiletime to do anything other than call into the PMC to do the operation, so the internal representation's irrelevant. As far as parrot is concerned, it's a PMC and it doesn't know anything about the inside. We don't loose the abstraction with a special Ckey opcode. This Ckey opcode is the abstraction: get a pointer to an item in the aggregate (or prepare for a LHS). This is a vtable method of the aggregate. Splitting the 3-keyed operations down to simpler parts, that's it - no permutations. Not when you need 64 opcodes for the keyed variants. 64:1 isn't somewhat balanced. Erm you need to redo your math. So I'll try. We have 4 different addressing modes of one single keyed argument: set_p_k set_p_kc set_p_ki set_p_kic Now when we want to support 3-keyed operations there are 4*4*4 different opcodes of an all-3-keyed operation (+ some more if one isn't keyed). If we only have e.g. op_p_k_p_k_p_k we are missing the nice optimization of integer indices, we have to go through key_integer(), have to check for NULL keys and we need a full fledged (albeit) constant Key PMC per argument. Further: you have to implement type morphing, range checking BIG* promotion in the non-keyed variants and in the keyed vtable variants too. This is error-prone and a waste of memory. Anf finally, you are saying that these are most usefull for aggregate containing non-PMCs. What about @a[i] = @b[i] + 1; or the equivalent Perl6 vector operations[1]. More permutaions on opcodes to implement these? ... Even *if* we went with a full set of keyed and unkeyed parameters (and I don't see a reason to do so, though we certainly can) it's not 64. At worst it's 16 for three-arg ops, and that's for both the keyed int, keyed normal, and nonkeyed version. What I'm missing here? Implementation details wanted ;-) I'll go thump the ops preprocessor, then. There's no reason to actually write all the code for it, as it can be autogenerated. But please with a Configure switch to turn it off ;-) [1] I expect the most speed gain here, when we can optimize to hell. @a = @b + 1 // @a, @b are known to be arrays of packed int n = @b.elements() c = n / CHUNK_SIZE r = n % CHUNK_SIZE for i (0..c-1) ptra = @a.make_chunk(i) ptrb = @b.get_chunk(i) for (1..CHUNK_SIZE) *ptr++ = *ptrb++ + 1 // do rest ... @a.elements = @b.elements This would be a seqence of some specialized opcodes (enventually with a runtime check in front) and its fully JITtable. leo
Re: The reason for scads of keyed variants
Leopold Toetsch writes: Dan Sugalski [EMAIL PROTECTED] wrote: At 11:17 PM +0200 9/1/03, Leopold Toetsch wrote: I don't see the point here especially why we would need a temporary PMC. If we have an array of packed ints, I just need a pointer to the element to work on it. This is very similar to the Ckey opcode I had in some of my proposals. We can't do that. There's insufficient information about the variables at compiletime to do anything other than call into the PMC to do the operation, so the internal representation's irrelevant. As far as parrot is concerned, it's a PMC and it doesn't know anything about the inside. We don't loose the abstraction with a special Ckey opcode. This Ckey opcode is the abstraction: get a pointer to an item in the aggregate (or prepare for a LHS). This is a vtable method of the aggregate. Splitting the 3-keyed operations down to simpler parts, that's it - no permutations. And I think you're saying that it'll be illegal to use this pointer PMC if the aggregate changes or anything like that, so the proxy can be as dumb and fast as possible... right? And that it wouldn't really need a header. So it wouldn't really be a PMC. But that's okay, I think. I'm wondering how the actual ops would look in this case though. foo a[x], b[y] turns into: key P0, a[x] key P1, b[y] foo P0, P1 How does foo know that it's working on fake PMCs? Or am I missing your vision (probably am)? [1] I expect the most speed gain here, when we can optimize to hell. @a = @b + 1 // @a, @b are known to be arrays of packed int FWIW, that does something entirely useless[1]: the equivalent of: @a = (1 + @b) x @a I think you meant to say: @a = @b + 1 n = @b.elements() c = n / CHUNK_SIZE r = n % CHUNK_SIZE for i (0..c-1) ptra = @a.make_chunk(i) ptrb = @b.get_chunk(i) for (1..CHUNK_SIZE) *ptr++ = *ptrb++ + 1 // do rest ... @a.elements = @b.elements This would be a seqence of some specialized opcodes (enventually with a runtime check in front) and its fully JITtable. That would be, um, very cool. PDL, eat your heart out! Luke [1] I'm saying that word with full awareness of its connotations on a list like this :-) leo
Re: The reason for scads of keyed variants
Luke Palmer [EMAIL PROTECTED] wrote: Leopold Toetsch writes: And I think you're saying that it'll be illegal to use this pointer PMC if the aggregate changes or anything like that, so the proxy can be as dumb and fast as possible... right? And that it wouldn't really need a header. So it wouldn't really be a PMC. But that's okay, I think. It would be a Key PMC probably with some extra flags. I didn't much think about details yet. Its basically that part inside current keyed set operations that gets at the value inside the aggregate. This part of code is currently repeated all over followed by an get- or set_type something operation basically. For aggregates holding basic types a C-pointer would be enough. I'm wondering how the actual ops would look in this case though. foo a[x], b[y] turns into: key P0, a[x] key P1, b[y] foo P0, P1 Yes. The first one would be Ckey_lhs or such. How does foo know that it's working on fake PMCs? Or am I missing your vision (probably am)? It doesn't matter for PMCs. The Ckey_lhs for Ca has to prepare a new PMC slot in the aggregate, P0 points at it, Cfoo sets the value then. I think you meant to say: @a = @b + 1 Oops, yes of course. n = @b.elements() c = n / CHUNK_SIZE r = n % CHUNK_SIZE for i (0..c-1) ptra = @a.make_chunk(i) ptrb = @b.get_chunk(i) for (1..CHUNK_SIZE) *ptr++ = *ptrb++ + 1 // do rest ... @a.elements = @b.elements This would be a seqence of some specialized opcodes (enventually with a runtime check in front) and its fully JITtable. That would be, um, very cool. PDL, eat your heart out! I dunno, which of the many acronyms of PDL you are applying here :-) But anyway such optimizations make vector ops fly at optimized C level speed. Fat multikeyed opcodes wont't do that. Luke leo
The reason for scads of keyed variants
I should read the list and respond to the outstanding stuff, but I should also get this done, and since the former probably precludes the latter... Why, exactly, have I spec'd (nay, demanded!) that every darned operation in a PMC's vtable have a keyed variant? Simple. A combination of speed and space. (And yes, I know we're going to fight over this one) Now, for aggregates that hold PMCs and are relatively passive containers (that is, they don't get in the way of anything besides load and store operations, if that) the keyed variants provide no benefit to speak of. Less opcode dispatch overhead, but there's not a whole lot of that in general anyway, and on JITted cores there's no win at all. For aggregates that *don't* hold PMCs, though, that's where the win is. Those are the sorts of aggregates we're aggressively targeting as well. One of the big issues with perl, python, and ruby is that the base variable data structure is big. (And we're not making it any smaller with Parrot--our PMCs are pretty hefty still) Optimizing the size of an individual scalar isn't much of a win, but optimizing the size of arrays and hashes of scalars is a win. (This is one of the lessons learned from Chip's Topaz project) Many hashes and arrays don't need full-fledged PMCs in them, nor the space that full PMCs take. A string, integer, or bool array is sufficient for many of the uses that aggregates are put to. This is a good argument for abstract aggregate element access, which we have now, and that's good. Saves us space, potentially. Yay us. How this argues for keyed access to the operations on aggregate elements, however, is less obvious, but I think it's worth it. If we don't have direct operations on aggregate elements but instead have to do a fetch and perform the operation on what we fetch, it means we have to create a temporary PMC for each 'fake' entry, one potentially with a fair amount of knowledge about how the aggregate works, which means that every aggregate will need to provide two vtables, one for itself and one for an aggregate entry. Going with direct operations on keyed aggregates, then, makes the code somewhat more dense, since we only need to access one vtable per operand rather than two (one to fetch the data and one to operate on it). That's balanced somewhat by having to have two sets of PMC opcodes, one that operates on all parts keyed and one that doesn't. The integrated approach *also* makes it easier to optimize the operation. For example, this code: foo[12] = foo[12] + bar[15]; or the more compact foo[12] += bar[15]; has the destination identical to one of the sources. In the keyed access scheme, that's a single operation, one where foo's vtable function can *easily* determine that the destination and one of the sources is identical. (It's a pair of comparisons) If accessing foo is somewhat expensive--for example requiring a DB access--having knowledge that the source and destination are identical can allow for optimizations that wouldn't otherwise be allowable. (For example, if foo and bar represented fields in two tables, knowing that the destination and one of the sources is identical can potentially cut out one of the DB accesses) While this is doable if we pass in plain scalar PMCs, it'll have to be more expensive, and as such is less likely to be done. Yes, I also know that this potentially breaks down in more complex expressions. That's fine--it means that we can optimize some access rather than all access. I'm OK with that, as it's better than optimizing all accesses. More information's available if anyone wants it, but this *is* the way we're going unless someone proves that it's suboptimal in the common case. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The reason for scads of keyed variants
Dan Sugalski [EMAIL PROTECTED] wrote: [ heavily snipped ] Now, for aggregates that hold PMCs ... ... and on JITted cores there's no win at all. For aggregates that *don't* hold PMCs, though, that's where the win is. If we don't have direct operations on aggregate elements but instead have to do a fetch and perform the operation on what we fetch, it means we have to create a temporary PMC for each 'fake' entry, one potentially with a fair amount of knowledge about how the aggregate works, which means that every aggregate will need to provide two vtables, one for itself and one for an aggregate entry. I don't see the point here especially why we would need a temporary PMC. If we have an array of packed ints, I just need a pointer to the element to work on it. This is very similar to the Ckey opcode I had in some of my proposals. Going with direct operations on keyed aggregates, then, makes the code somewhat more dense, since we only need to access one vtable per operand rather than two (one to fetch the data and one to operate on it). That's balanced somewhat by having to have two sets of PMC opcodes, one that operates on all parts keyed and one that doesn't. Not when you need 64 opcodes for the keyed variants. 64:1 isn't somewhat balanced. More information's available if anyone wants it, but this *is* the way we're going unless someone proves that it's suboptimal in the common case. Implementation details wanted ;-) leo