Re: SAPI (Was RE: args, argv in parrot?)
On Tuesday 04 December 2001 12:21 am, David M. Lloyd wrote: He was telling me about the way PHP seperates the interpreter from the OS-specific stuff via a SAPI layer. Facinating, but perl is, afterall a scripting language first. perl -e '' is very essential; though I could see a compatibility mode being enabled by default with '-e' if necessary. The level of deviation is pretty much up to Larry I suppose. Basically I see the environ / command args being globally shared-PMCs from the beginning. They can be accessed by any language more-or-less the same way if they're functions or scalars; but they're more fluid as scalars. Plus it's trivial to lock them down once encapsulated inside thread-safe PMCs. -Michael
Re: Thoughts on vtables...
On Sunday 02 December 2001 02:47 pm, Brent Dax wrote: Quick comment: I've been thinking about constructing an 'OpaqueHandle' PMC type. All the programmer is supposed to know about it is that it points to Something (usually a C struct, but they don't have to know that). I'm not sure what trying to access it results in--perhaps it'll look like a reference from the outside (ParrotInterp=HANDLE(0xDECAF)), or maybe it'll throw a tantrum. Perl5 just used a string as the generic c-struct handle as far as I know.. Then you use pack/unpack to access it further (assuming the programmer knew what they were doing). There are tiny gotcha's (such as we'd have to have flags so the GC knew the data wasn't in relocatable memory, etc), but since we need that sort of support for perl5, I don't see a great advantage is creating a new datatype for perl6; unless our least-common denominator needs it I guess (e.g. lisp, python, etc). -Michael
Re: Butt-ugliness reduction
Combining these two, I have a couple comments. On Thursday 15 November 2001 02:07 pm, Simon Cozens wrote: if (pmc-flags PS_INTEGER_OK) { return ((struct PerlScalarData *)(pmc-data))-intdata; Why are you storing flags for PerlScalarData inside the pmc-flags? If the nested type is doing the caching, then it should store the flags. Additionally, you should probably break out the variables for readibility: The following assumes macros (bite me), typedefs (don't know why this isn't standardized, ever since programming in c++, I've never left c-structs untypedef'd), and unions (as with Uri's suggestion) static INTVAL Parrot_scalar_get_integer (struct Parrot_Interp *interpreter, PMC* pmc) { if ( isPMCScalarClass( pmc ) ) { PerlScalarData* psd = pmc-data.scalar; switch ( psd-type ) { case SCALAR_STR: psd-intdata = String2Int( psd-stringdata ); setScalarIntOK( psd ); return psd-intdata; case SCALAR_NUM: case SCALAR_NUM_STR: psd-intdata = (int) psd-numdata; setScalarIntOK( psd ); /* falls through */ case SCALAR_INT_NUM_STR: case SCALAR_INT_STR: case SCALAR_INT_NUM: case SCALAR_INT: return psd-intdata; default: // throw exception } } else { // exception since we're specifically equesting a scalar?? } } Nested if-statements could be used instead, but so long as type is = num_items, then this gets optimized into a fast table-lookup (with -O2 is used). Though a bit longer, I think this is still readible. On Thu, Nov 15, 2001 at 02:01:53PM -0500, Dan Sugalski wrote: *) Use the cache entry in the PMC header. It'll mean for ints and floats you won't need to bother with a separate structure or dereference Currently, the cache only holds a single value. Is this for space efficiency? If we broke out the union into independant cached values, then the two layers could be collapsed into one. (data would go unused for scalars) Course the switch statement above would have to be masked, as with: #define getScalarID(x) ( ( x-flags SCLAR_MASK ) SCALAR_MASK_OFFSET ) switch ( getScalarID(pmc) ) { ... } In general, I feel that OO-symantics would be most readible here: if ( pmc-isInteger() ) { return pmc-getInt(); } else if ( pmc-isNum() ) { return pmc-cvtNumToInt()-getInt(); } else if ( pmc-isString() ) { return pmc-cvtStrToInt()-getInt(); } If this were C++, then the isXXX, cvtXXX would be inlined functions, which would have the same effect as macro's but with much greater maintainability. Unfortunately we're using C, so the closest we have is macro's. While speed is king, readibility is VERY important (as I'm sure we all agree). The last alternative I can think of is to break PerlScalarData into it's own sub vtable. Thus if ( isPMCScalar( pmc ) ) { return pmc-data.scalar-vtable[ SCALAR_GET_INT ]-( pmc-data.scalar ); } This would have similar overhead to the table-lookup switch-statement, and since these are extremely simple functions, many architectures (most notably SUN) wouldn't require a full blown function call with frame management. As a last sort of alternative. I had invisioned PMC's has having separate vtables for each of the many different types of operations. Meaning, that a scalar-int would be separate from a scalar-string-num, etc. Since it seems that we're not currently doing this, it would require 8 times as many vtable's, but it would prevent hardly any if-statements. I can't. :( /* It's more efficient to use our own caching than the PMC's cache Why? Because we also need to store state on what type of thing is in the cache, and this can be one of three things. (four, if you include nothing) Four things is two bits, which means two flag tests instead of one. Hence, we just use the old-style IOK/NOK/POK */ *) Swap out vtables as the pmc internals change, rather than vectoring off flags. Ops, yeah, that's what I was trying to say above. That's probably the right way to do it. But what about overlapping types that are both integer and string OK? Hense the larger number of combined types coupled with removing the cache and just storing each explicit type. On Thursday 15 November 2001 02:07 pm, Uri Guttman wrote: and the PMC would have in it: union { INTVAL *int ; NUMVAL *num ; } data ; -Michael
Re: memory assumptions
On Tuesday 13 November 2001 12:05 pm, Ken Fox wrote: Are _all_ handles (PMC's, Strings, or whatever) that serve as root-set elements dynamically allocated? Nope. Part of the root set includes auto C PMC handles used by extensions. You also have to think about hardware registers too if GC is running in another thread. I'm not worried about separate threads. Shared handles are already allowed to be slow and thus few and far between. Locking also alleviates most of the hangups. Most of my thinking involves running much of the GC within a worker thread (spawned by allocs), so different interpreters are, for the most part, independant of each other, even in garbage collection. My ideal is to only profile at alloc time, then hand off a job to a gc-thread, but so far there are more problems than worthwhile (especially with coping GCs). The alternative is to enforce read/write barriers on code and do all aspects of GCing in another thread. This avoids much pause-time, but all but requires multi-threading (something I'd rather not require for one-liners). IMHO the easiest way to code this is to define an API that extension writers use to declare every PMC handle. This code looks ugly and encourages bugs though so a lot of people don't like it. Many systems treat the C stack as conservative roots. Those systems flush the hardware registers to the C stack before running the collector. The benefit is that handles don't need to be declared. The cost is the non-portable code used to scan the C stack and flush the registers. The various texts I'm reading suggest that memory management should be a black-box module. It's easier to debug and use.. They suggest that too large a percentage time is devoted to memory management (leak-finding), so the less XS-writers have to worry about memory, the more acceptance parrot will have IMHO. (Even if some performance or intermitant pausing is necessary... Though hopefully we'll be able to provide a choice between the two at run-time). Another simple solution is to suspend the collector while in extension code (or untrusted extension code). This works for lots of cases, but falls apart badly in callback/event-loop extensions that never return. Unfortunately, this would require hooks into the mem-mngr module, which violates it's black-box nature. Not to mention the points you make. One nice thing about writing Perl extensions is that the SV * can be stuck in odd places. For example, I put them in X Toolkit and Motif widget user data slots. Once the SV * are stuck in the slots I have no idea what happens to them. This works fine because the ref count will keep the SV alive until the widget dies. A hybrid ref count GC with pinned PMCs will work exactly the same. If we get fancier though and either use movable PMCs or tracing collectors, then my user data slots become roots. That will crash and burn because I don't know how to register them. I'd have to use a separate array for holding the PMCs and then stick the array indexes into the widgets. So long as SV* equivalents are registered as foreign access, then XS modules shouldn't have to worry. Unfortunately this might cause memory leaks. Perhaps foreign access can utilize a reference count.. XS modules that don't directly utilize GC-able objects, register a newly allocated handle with foreign access, and are required to unregister it when they're done.. When a handle is unregistered as many times as it's registered, it's removed from the foreign access area, and thus it's contents are reclaimable. Only remaining problem is in how the handle was acquired (hense my original question), since the handle needs to be reclaimed somehow, since the following is possible: PMC* ptr = malloc(sizeof(PMC)); ptr = malloc(sizeof(PMC)); // memory leak If allocations instead come from gc_malloc - or more likely from the arena newPMC(), then once the PMC is unregistered from foreign-access, XS-code can safely ignore it. The only gotcha here is that XS code has to treat unregistering as if the PMC were being free'd (and optimizations will probably do exactly that). Heck, I'm even thinking that there should be two forms of newPMC, one to be used within the root-set, and one used externally. newPMC, and newPMCR (for registered). It's still too early to see what's a good direction. It's probably not worth making my job as an extension writer easier if it slows/complicates the rest of the GC. Most extension writers are pretty good at this sort of thing and a special utility library for pseudo-pinned PMCs wouldn't be any big deal. I don't really see a need for handles to be movable (especially if they're arena based), but reclaiming handles might require for some sort of extra record-fields. Since handles are fixed sizes and decently big (several integers worth), I'm wanting to prepend next/prev pointers to the structure and maintain from/to/free lists. A
Re: memory assumptions
On Tuesday 13 November 2001 01:20 pm, Jason Gloudon wrote: On Mon, Nov 12, 2001 at 11:59:08PM -0500, Michael L Maraist wrote: 2) Can we assume that a buffer object is ONLY accessible by a single reverse-path-to-PMC? PMC's or array-buffers can point to other PMC's, so it's possible to have multiple paths from the root to an object, but I'm asking if, for example, we could use an algorithm that relied upon the fact that it only has one direct parent. That assumption would mean one could not directly share non-constant buffers between strings. There has been talk of having copy-on-write strings. That's fine. I didn't really have strings in mind, and it's possible to treat them separately. We're already having to treat handles differently than buffer-objects. I'm also wanting to segregate lists of leaf-buffers, arrays, hashes, etc, so as to avoid putting switch-statements in the inner loop of the marker. Since each type would have a different inner loop, strings are confined to the same algorithms. So, any other ideas related specifically to PMC's? One interesting thing to note is that strings are leaf-objects. Furhter, copy-on-write would only be efficient if set for values that have more than one parent (otherwise each modification would require a copy). I haven't seen the discussion, so I don't know where it'll wind up, but if this is the case, then it seems to me that the only way to determine if copy-on-write should be applied (in an efficient manner) is via some variation of reference counting. Even if only 1-bit ref-counting is used. Such as: // In pseudo-code STR_REG[x] = newString(abc); // f_copy_on_write = 0 STR_REG[x] _= newString(foo); // check f_c_o_w; it was 0, so we modify string (possibly resizing) STR_REG[x] = PL_null_str; // abc eventually reclaimed by GC STR_REG[x] = newString(abc); // f_c_o_w = 0 STR_REG[y] = STR_REG[x] // f_c_o_w = 1 STR_REG[x] _= newString(bar); // check f_c_o_w; was 1, so we copy out // at this point, a GC-pass would count the instances of abc and notice that it only has one handle. It's f_c_o_w would be reset to 0. Even if the GC didn't perform this last stage, the system would work, it would just cause a greater percentage of multi-ref'd garbage and copying. Additionally, note that such a pass would insinuate a pre-GC-stage which resets f_c_o_w's to zero. Setting f_c_o_w to 1 all the time is slightly faster than performing an increment. Further, separate string vtables could be utilized to avoid the if-statements. (being a RO string instead of a RW string). Lastly, the overhead of a bit might as well be a byte, and the garbage reduction by actually performing full-reference counting (albeit with a GC fallback) might make this worthwhile. full ref-counting on strings would be harder to work with for XS-code though. -Michael
Re: memory assumptions
On Tuesday 13 November 2001 03:57 pm, James Mastros wrote: Unfortunately, this would require hooks into the mem-mngr module, which violates it's black-box nature. Not to mention the points you make. I don't really see this as ownerous fiddiling with the memory manager; we want to have hooks in to modify the timing of the GC anyway. All we have to do is wrap a refcount around them (and I'm not sure about that; can you really be in more then one XS at once?) and call them in the approprate places. That isn't fiddiling with the internals of the memory manager, it's timing is defined to be an external (by us). I don't mean that we won't have API hooks into the mem-manager (sorry, I worded it poorly), I mean that the more visible the guts of the mem-manager is, the less likely it can be replaced by a better / more suitable algorithm, possibly in embedded situations. Extending mem-mngr data-structures (like the linked-list) to the definition of a PMC is what I'm worried about; not exposing a function to say GC-me-more-often. (I appologize for the LISP mneumonic :) -Michael
Re: memory assumptions
On Tuesday 13 November 2001 06:37 pm, James Mastros wrote: On Tue, Nov 13, 2001 at 06:14:30PM -0500, Michael L Maraist wrote: Extending mem-mngr data-structures (like the linked-list) to the definition of a PMC is what I'm worried about; not exposing a function to say GC-me-more-often. (I appologize for the LISP mneumonic :) Hm. I think that the problem is that the definition of a PMC is, I think, what we say it is in perlpmc.pod, and not what we say it is in pmc.h. This means that we need to recomple XSes to change PMCs, but I think that's possibly OK. If not, why do we need to keep the linked-list inside the PMCs, instead of sepperatly? Again, this is only speculation about possible algorithms. The only reason to expose linked-list structures is if a collection algorithm was used that assumed it contained a list of ALL PMCs. If the newPMC() / newPMCext() were exclusively used, then this isn't a problem, but if PMC's could be externally allocated and then registered within foreign access, then there'd potentially be a data-type mismatch. Namely using header = (gc_object_header*)((int*)pmc_ptr - 2 ) header-next =... header-prev=... would be a failed assumption. I'm in favor of requiring newXX() since this is more portable between memory managers. It's equivalent to assuming that you could determine the size of a malloc'd object by doing: *((int*)ptr - 1) simply because your current mem-mngr stores the value there. Obviously it's best to pass the pointer to the associated mem-mngr API and let it do things like fetching size. Here the potential problem is static allocations generated by XS-code. In theory our API would allow: // desirable API functions PMC_t* newPMC(); // assumes it'll be attached to the root-set PMC_t* newPMCext(); // auto-ataches to foreign access portion of root-set. void PL_registerForeignAccess(PMC_t*); // questionable API function void PL_unregisterForeignAccess(PMC_t*); void PL_initPMC(PMC_t*); And an XS coder could have: void fooExtension(..) { PMC_t my_pmc; PL_initPMC( my_pmc ); PL_registerForeignAccess( my_pmc ); ... // my_pmc falls out of scope. // coder forgot to call PL_unregisterFA( my_pmc ); memory leak.. } How PL_registerFA works isn't yet determined. If newPMCext() called PL_regFA(), then there'd have to be some additional data-structure or allocated memory. But if newPMCext() simply used the GC-exclusive region of the linked-list, then the PMC could be placed on the appropriate list, and thus avoid additional flags or memory allocations. It's state / type is implicit based on the associated list as far as GCing is concerned. It's just that this doesn't work with the above code.. The GC-hidden data-structure could also have a ref-count of the number of times it's been marked with PL_regFA, that way an intial reg-stack PMC could be grabbed by XS-code and cached for what-ever evil purposes safely. Thus my current idea of PMC handles is: struct gc_handle_header_t { struct gc_handle_header_t *prev, *next; int fa_cnt; char data[0]; // gcc-specific code for demonstration purposes }; Again, this is just the weeding out stage. -Michael
Re: Lexical implementation work
On Friday 09 November 2001 03:36 pm, Dan Sugalski wrote: Do we want non-PMC lexical support? Nope, I wasn't going to bother. All variables are PMCs. The int/string/num things are for internal speed hacks. But can't a compiler generate one of these internal hacks? My thoughts are that a given scope can not return internal representations (like int's / string's), and must encapsulate them inside scalars, BUT the compiler may optimize the inside of a scope. If no evals occur inside the block (rather common), then the lexicals aren't volitile. If no divides are ever performed on a scalar, nor are they stringified, then the compiler might be able to assume that they're raw integers (whether they were declared my $foo as int or not). This also assumes that the would-be-integer does not interact with any other scalars. I believe there are op-codes that assign the value of primatives into a scalar (would have to), so scalars and primatives can still interact. Since this is an optimization, I agree that it's usefulness is limited. From the above, the only uses I can see for declaring my $foo as int is to set flags (or utilize different vtables) to enforce integerness, and to say to the optimizer that it's ok to use a primitive integer if the block contained a divide. Note that: $foo = int( $bar / $baz) should have the same effect. End result is that this is a problem that can be put off. -Michael
memory assumptions
Dan gave a list a bit ago about some minimal assumptions about how we'll deal with memory. I've been researching various memory management techniques, and I'm noticing that various optimizations can occur if certain assumptions are made.. It's still early in the game, but I thought it might be helpful to see if we could identify any other truths that we can take advantage in selecting an algorithm. 1) Are we allowing _any_ dynamic memory to be non-GC-managed? I assume that there are core parts / naive-extensions that assume a malloc/free memory model. While we could always define free as an empty macro, I think it would be to our benifit to utilize regular mem managers where possible. Further, our memory format wouldn't necessarily be compatible with external libraries anyway. 2) Can we assume that a buffer object is ONLY accessible by a single reverse-path-to-PMC? PMC's or array-buffers can point to other PMC's, so it's possible to have multiple paths from the root to an object, but I'm asking if, for example, we could use an algorithm that relied upon the fact that it only has one direct parent. 3) I believe Dan said that we _could_ assume that PMC's were unmovable, but are we requiring this? This is related to the next question. 4) Are _all_ handles (PMC's, Strings, or whatever) that serve as root-set elements dynamically allocated? Or can c-code declare PMC-structs instead of PMC-structs-pointers? That Dan defined a foreign access portion of the root-set suggests to me that c-code might acquire it's own handles (though we can enforce an API for their allocation). This is important because we do have to perform garbage collection on the handles. If a register-stack is decremented, for example, valid pointers to PMC's / string would become garbage that requires some sort of collection. While this is easy enough to manage, a heterogenous pool of handles would complicate things (and disqualify many desirable algorithms). I'm evaluating algorithms for the handles which requires prepending a linked-list. It's easily hidden away and thus ignored by non memory manager algorithms IFF we require that all handles be allocated by the API. This will be especially important since root-set navigation in it's raw form will be slow due to the complex nature of some data-types. That's all for now. -Michael
Re: Rules for memory allocation and pointing
On Sunday 04 November 2001 02:39 pm, Dan Sugalski wrote: At 08:32 PM 11/4/2001 +0100, Benoit Cerrina wrote: There will be a mechanism to register PMCs with the interpreter to note they're pointed to by something that the interpreter can't reach. (For example, a structure in your extension code, or via a pointer stashed in the depths of a buffer object, or referenced by another interpreter) This foreign access registry is considered part of an interpreter's root set. If this is the case, how do you want to move the PMC, I thought you wanted a copying collector? While the PMC structures themselves don't move (no real need--there of fixed size so you can't fragment your allocation pool, though it makes generational collection easier to some extent) the data pointed to by the PMC can. That's the bit that moves. Ok, so far, here's what I see: There are two main memory segments. A traditional alloc/free region. Within this area are arena's (for fixed sized memory objects). This region must be efficient within MT, and hopefully not too wasteful or fragmenting. This region is mostly for core operation and the non arena allocations are to be minimized; memory leaks are critical as usual. The utilization patterns should be analyzable, well known and compensated for (with respect to fragmentation, thread-contention, etc). My vmem-derivative might still be valuable here, but I suppose we need to let the core's needs/characteristics flesh out further. Then there's a GC region which is primarly the allocation space of the interpreted app, which obviously can't be trusted with respect to memory leaks or usage (fragmentation) patterns. PMC's and strings are handles into this GC-region, though the handles are to be stored in core-memory (above). Question: Are there other types of references? I can't think of any. The GC algorithm (being a separate thread or called incrementally when memory is low (but not yet starved)), needs to quickly access this GC heap's values, which I believe is Dans for requiring a maximum of two levels of indirection. I suppose it just runs down the known PMC lists including the PMC and string register sets, the stacks and the stashes for each interpreter. You do, however refer to a foreign address region, and my first impression is that it's the wrong way of going about it. First of all, how are arrays of arrays of arrays handled? // create a buffer object of PMCs PMC_t p1 = new Array(50) for i in 0 .. 49: p1-data[ i ] = new Array(50) In order to comply with the max-depth, the array creation will have to register each sub-array-entry in the foreign access region, or am I missing something? First of all, this will mean that the foreign access data-structure will grow VERY large when PMC arrays/ hashes are prevalant. What's worse, this data-structure is stored within the core, which means that there is additional burden on the core memory fragmentation / contention. Additionally, what happens when an array is shared by two threads (and thus two interpreters). Who's foreign access region is it stored in? My guess is that to avoid premature freeing, BOTH. So now a work-q used by a 30-thread producer/consumer app is allocating and freeing LOTS of core-memory on each enqueue / dispatch.. Again, with the details this fuzzy, I'm probably going off half-cocked; but I did qualify that this was my initial impression. My suggestion is to not use a foreign references section; or if we do, not utilize it for deep data-structure nesting. And instead incorporate a doubly linked list w/in PMCs and strings... Thus wheneever you allocate a PMC or string, you attach it to the chain of allocated handles. Whenever the PMC is free'd, you detach it. The GC then has the laughably simple task of navigating this linked list, which spans all threads. This can encorporate mark-and-sweep or copying or what-ever. By adding 8 or 16 bytes to the size of a PMC / string, you avoid many memory related problems. Not to mention the fact that we are free of the concern of depth away from the interpreter-root. Beyond this, I think I see some problems with not having PMCs relocatable. While compacting the object-stores that are readily resized can be very valuable, the only type of memory leak this avoids is fragmentation-related. The PMCs themselves still need to be tested against memory leaks. Now I'm still in favor of some form of reference counting; I think that in the most common case, only one data-structure will reference a PMC and thus when it goes away, it should immediately cause the deallocation of the associated object-space (sacraficing a pitance of run-time CPU so that the GC and free memory are relaxed). But I hear that we're not relying on an integer for reference counting (as with perl5), and instead are mostly dependant on the GC. Well, if we use a copying GC, but never move the PMC, then how are we
Re: Rounding?
On Sunday 04 November 2001 10:59 pm, Zach Lipton wrote: I'm working on learning some parrot asm, but if I write something like this: set N0,2 set N1,2 add N3, N0, N1 print N3 I get: 4.00 Is there any way to round this, or at least chop the 0's off the end? since print is for debugging purposes, that's doubtful. You could create a throw-away-patch that adds does: printf s, i|n|s -Michael
Re: vmem memory manager
My intial impression is that a page should be 1024 bytes (though I'm also looking at 512 bytes to reduce the shotgun effect of 8-byte allocs). I've actually found that q-caches were detramental to things like conslidation, and the calling depth (overhead) of a worst-case allocation. So I'm currently leaving them out. Instead I'm using a power-of-two basd dq-cache (for differential q-cache). Forgot to include two other points on this topic. First, if q-caches were used, the slab-header would require a whole page (so as to maintain alignment). With a q-max of size 4 at most and a slab object multiple of 4, we have 16 pages as the minimum allocation actually fullfilled by vmem. However, pages are of 1k-alignment, which you'll note tells vmem that it should look in it's segment-hashtable. Thus a q-cache in this system could not simply use multiples of a page size. On the other hand, an alternative would be to subdivide pages into two regions instead of one. If it assumed that accesses within the q-cache region will be rare (relegated mostly to spills into vmem by the dq-cache), then the q-cache could be considered a second tier subdivision of a page. For example, if the page-size was raised to 16K (must be a power-of-two for efficient page-alignment calculations), the dq-caches would still utilize a slab-size of 1k and define dq's of sizes: 8,16,24,...384, but the q-cache would utilize a slab-size of 16K (our new page size) and define q's of sizes: 1,2,3,4,5. Note that dq's use power-of 2/3 to efficiently cover a broad range while q's use multiples so as to minimize slack space. Here, however, our q-cache allocations have an incredible slack space due to the forced subdivisions of 16 as the following table shows: page-size 16K q-cache-size : num-objects : header-overhead : slack-space : % overhead (including slack ) 1 : 15 : 1 : 0 : 6.66% 2 : 7 : 1 : 1 : 14.2% 3 : 5 : 1 : 0 : 6.66% 4 : 3 : 1 : 3 : 33% 5 : 3 : 1 : 0 : 6.66% There are several trade-offs. The larger the slab-size, the less max overhead for any q-cache (here we have 33% max overhead for optimial alignments: in the worst case, an allocation of size 3.01k, for example would have 77% slack). However, an allocation of size 6k would still require a full 16K allocation by vmem, since we want to minimize fragmentation. That's an overhead of 166% (wasting 10k of memory). If we wanted to reduce both the percent and max physically wasted memory, we could run the q-cache all the way until size == 1 (at which point the slab is of little or no use): page size = 8K 1 : 7 : 1 : 0 : 12.5% 2 : 3 : 1 : 1 : 25% 3 : 2 : 1 : 1 : 25% We round sizes 4,5, 6 and 7 up to an allocation of size 8. Thus the max overhead is 3.1k - 8k = 166% which wastes over 4k of physical memory. Note also that we incur at least 12.5% overhead. It's debatable whether 8K or 16K is better. Both have worst cases of 166%, but the 16K wastes more total physical memory. Compare this to the following table of power-2 allocations for dq-caches: virtual page size = 1k, slab header size is approx 28 bytes 8 : 124 : 4 : 4 : 3.2% 12 : 83 : 2 : 0 : 2.811% 16 : 62 : 2 : 4 : 3.2% 24 : 41 : 1 : 12 : 4% 32 : 31 : 1 : 4 : 3.2% 48 : 20 : 1 : 36 : 6.6% 64 : 15 : 1 : 36 : 6.6% 96 : 10 : 0 : 64 : 6.6% 128 : 7 : 1 : 100 : 14.2% 192 : 5 : 0 : 36 : 6.6% 256 : 3 : 1 : 228 : 33% 384 : 2 : 0 : 228 : 33% BUT, these first tier dq-caches all need 1k allocations from vmem, which passes the request off to the q-caches, which incure an additional 12.5% overhead for 8K page or 6.66% overhead for 16K pages. Additionally note that the overhead can be reduced somewhat if we could use an external header for slabs of object-size = .5 vpage. But unfortunately, the first object of this slab falls on a page boundry, and thus will be inappropriatedly interpreted by vmem_size and vmem_free. Note, once again, that explicit object caches do not use vmem directly and are thus shielded from this page/ vpage limitation. This only applies to the internal dq-cache and the debated q-cache, not the explicitly allocated object-caches. More important than slack space is the shot-gun effect, which is exacerbated since we now have 7 layers in which a freed memory object may reside: 8B dq-cache magazine (1 of 20; assuming max-stack-size = 5, data/prev_data are both full, and there are two threads) 8B dq-cache depot ( 1 of 50; assuming max-stacks = 5 ) 8B slab ( 1 of 124 ) 1024B q-cache magazine ( 1 of 8; assuming max-stack-size = 2, data/prev_data are both full and there are two threads) 1024B q-cache depot ( 1 of 50 ) 1024B slab ( 1 of 7 ) vmem segment ( 1 of 255 ) (segment size = 255 * 8 * 1kvpage = 2 Meg) That's 347 million non-vmem hiding places for a free'd object. A reclaim currently is only slated to free depot's, but reduces us to 3.47 million hiding places. If we do away with the second tier (and the vpage concept), we have
cvs tags
I just noticed that we're not using any tags in the CVS tree. Do you think that we're at a state where we could start applying labels to our code. I am considering two types of labels. One for the periodic snapshots (so as to back-trace features in those files), and one for successful smoke-tests. Thus if something is broken, developers can pull down the last known successful smoke-test and build up from there. I'm assuming of course that there isn't a more sophisticated behind the scenes setup (like perforce). -Michael
vmem memory manager
This references the previously posted (by someone else) article on SUN's libu memory architecture. If needed I can run through the archives and find the URL. I've already written a threaded implementation which has a decent responsiveness. It did as much as the SUN libu claimed that it would do. Recently, however, I redesigned the architecture from scratch, taking advantage of many possible optimizations. Since it's towards the end of the day, I figured I'd document my architecture here and wait for comments before beginning the rewrite / redebug. As a refresher, the SUNs design is based on 4 layers: vmem, slab, depot and magazine-cache. At the bottom is vmem, which maps pages of memory (in libu, new pages are acquired from mmap; excess are readily freed via munmap). The next higher layer is the slab, which partitions a contiguous collection of pages into an array of raw memory objects. (Objects are fixed sized pieces of memory). Thus each object size requires a separate slab. The slab requests n-pages from vmem to construct a new slab when it runs out of memory objects. It also frees slabs to vmem when all the individual objects have been reclaimed . Both of these occur automatically w/o external intervention. The next higher layer is a depot, which stores constructed objects; e.g. objects that are initialized and potentially contain references to other memory objects. When memory requests are made and no free objects exist, requests are passed down to the associated slab (of corresponding size) and a constructor is applied. This happens automatically. Free's to the slab are asynchronous; they occur when the entire system is memory starved, or when a depot has too many free objects. In any case, prior to a free to the slab, the destructor is applied to each freed memory object. The highest level is the magazine-cache The sole function of the cache is to allow multiple threads to have thread-local accesses to the depot. It does this by caching up to 2*M objects in a thread-local magazine-structure. On an alloc, when the magazine is empty, thread-synchronized access is performed to the underlying depot / slab / vmem layers. On a free, when the magazine is full (contains 2*M objects), thread-synchronized access to lower layers is performed. Most of the time there is a mixture of alloc's and frees which avoids lower-level accesses (and thus mutex locks). Aside from sync-contention alleviation, this minimizes the sharing of memory objects between threads, and on multi-CPU systems, this avoids cache pollution (due to setting a cache flag in MESI to S). So far, nothing revolutionary, albeit clever on SUNs part. The next attribute, however, is the use of external boundry tags. The idea is that the allocated memory should never be written to (SUNs design goal included the use of read-only allocations). Further, these external tags don't have to be contiguous. Runs of contiguous memory regions are collected together in a segment. Segments are allocation regions (such as via mmap). Within a segment, consolidation may occur. When fully consolidated, the segment is freed. Hashtables are used to associate a user's memory pointer with these external boundry tags. From this, SUN claims a constant (bounded) alloc/free time no matter what is going on. From my analysis, the only possible wrinkle to this constant time is in the reclaiming stage (which makes n reclaims, where n is the number of depots which is a relatively low and constant number) At run-time, object-caches can be defined for arbitrarily sized objects. Constructors and destructors can optionally be configured. Such object references use the magazine layer interface. For generic memory access, vmem is the API, but small memory requests are redirected to one of several predefined quantum caches (of sizes 1-page, 2-pages, .. q_max-pages), which alleviates lock-contention on vmem when used genericly. SUN addressed the memory fragmentation problem by placing a lower-bound on a memory request size. Slabs allocate at least 3 times the object-size. In most cases, a slab is 16 - 32 times the object size. Further, with q-caches, all allocations up to q-max have slabs of the same size. Thus slabs used by 8 and 16 byte objects allocations can be trivially intermixed. The only danger is if you never free all the objects w/in each slab. I call this a shot-gun effect, since you have little bits throughout memory space, preventing the freeing of slabs, and thus preventing the interchange of slabs when needed. My design is as follows: sbrk is completely avoided, which leaves the possibility of having two separate memory managers. The following functions are public: void* vm_alloc(size) void vm_free(void*) void* vm_realloc(void*,size) int vm_sizeof(void*) vm_dump_stats() // automatically allocates a slab (reuses one if possible) depot_t* depot_create(
Re: Improved storage-to-storage architecture performance
Absolutely. Compile-and-go is an absolute must, and one that'll end up disabling most of the potential optimizations for sheer startup time reasons. Which is a pity, but we can always let people enable them if they want from the command line. Or via use flags, since mod_perl and the like might like to see optimizations, even though they'll not have access to argv. Course mod_perl is a totally separate executable. -Michael
Re: Improved storage-to-storage architecture performance
On Tuesday 30 October 2001 01:47 am, Ken Fox wrote: Uri Guttman wrote: that is good. i wasn't disagreeing with your alternative architecture. i was just making sure that the priority was execution over compilation speed. I use a snazzy quintuple-pass object-oriented assembler written in equal parts spit and string (with a little RecDescent thrown in for good measure). A real speed demon it is... ;) The real motivation of my work is to see if a storage-to-storage machine ends up using cache better and with less compiler effort than a register machine. When I read about CRISP, the first thing that came to mind was the top-of-stack-register-file could be simulated exactly with high-speed cache in a software VM. Dropping the stack-machine instructions in favor of Parrot's 3 operand ones made it sound even better. This raises an interesting question. PMC's are mighty beasts and can do just about anything: mutate, store into stashes, perform queries of their datatype, have embedded lock variables and even have embedded garbage collecting info. They're work for safe thread-sharing, auto-variables, bla bla bla. The only memory storage for scalars that I currently am conceiving of is in name-space stashes (globals). Thus our most likely implementation of S2S would be to have 'add g_x, g_y, g_z' which performs two stash lookups, an add, then one stash write. I can't imagine that this would be faster than using a load/store architecture w/ respect to stash accesess; the overhead of the multi-level hash-table lookups is much greater than that of the op-code. What I think I hear you saying is that S2S would be storing in some sort of memory frame outside of the register set. But this adds a bunch of complexity. How does the GC sweep across a huge VM heap (as opposed to walking the register / stash maps). Further, we'd have to have a notice of the relative addresses of these objects (aren't they relocatable?). Assuming the exclusive use of ld/st stash access for globals (which due to the possibile stash-renaming/remapping, I think this is a fair assumption), then the only reason for a scalar scratchboard would be for lexical locals. But that could be addressed via an indexible stack, which is probably most efficient with a load/store architecture anyway (due to the multitude of stacks). I spoke exclusively on scalars (PMCs) because I had an interesting time considering static datatypes. It's obvious what happens when we declare: for my int $idx ( 0 .. 100 ) { } We get an integer register variable which maps directly to add / extraction calls. But what about: our int $g_idx; What's it's default value? zero? Where does it live? If you allow the storage of pure integers in a stash, then you'll have to have a stash structure like this: struct stash_t { . PMC sv_val; PMC av_val; PMC hv_val; PMC cv_val; PMC io_val; ... IV i_val; NV n_val; SV str_val; }; But then when you try and retrieve a global stash symbol, how do you distinguish between i_val and a sv_val? Or even str_val from sv_val. If perl6 is pre-compiled with access to all module-code (not likely due to late binding), then it can be optimized to make special access requests (knowing the prototype of $i), but this doesn't allow for dynamic modification of the stash-space. We are, after-all, a high-level language, not java. Further, this would be completely unused in perl5 execution. Best I can figure is that we can't store non PMC's in stashes. Which also means that we don't store non PMC's anywhere except the stacks; They're thus purely for lexical values. Anyway have any vision that conflicts with this? Back to the original discussion. I'm thusly not seeing a benifit for S2S here. -Michael
Re: Parameter passing conventions
A split between local, marginal, and global registers would be an interesting thing to do, and I can see it making the code more elegant. I worry about it making things more complex, though, especially with us already having multiple register types. (We'd double or triple the number of register types essentially, and to some extent blow cache even more than we do now. Might be a win in other ways, though. I'll have to ponder a bit) Yeah, I didn't like the idea of proliferating that more either. I still sometimes dream about a single register file of N regs into which we can put whatever we want. Each block of registers has room for the reg contents and the type info too. Seems you've got some of the support for that figured out in the stack already. Just declare that either (a) it is illegal (or behavior undefined) to do set $2, 5 set $3, foo bar add $1, $2, $3 [just because we have higher-level data types than a real machine doesn't mean we can't still have general-purpose registers, I think] or (b) that if you do something numeric with a register that is non-numeric type mucking happens behind the scenes and throws an exception if there is a problem. Certainly this wouldn't be surprising to anyone who had been looking at what we do with PMCs and arithmetic ops. If we ever did move to such a single-register-file model, I'd support looking seriously at the calling conventions of MMIX to see if we can get the appropriate performance characteristics. And, BTW, we have 4*32 = 128 regs now. We could even match the logical register count of MMIX (256) with only a doubling of total register count. And, if we ever determined we needed another kind of register (such as one that can be used for address arithmetic, since INTVAL doesn't cut it), we wouldn't have to add a fifth file, we'd just add another type (thinking again about the stack implementation). After reading the entire MMIX chapter, my mind went back and forth. First of all, the only reason that 256 registers were used was because of the byte aligned register arguments to the op-codes (plus modern intuition is that more registers = faster execution). Currently we have 4B arguments and don't perform boundry-condition checks, so this is neither here nor there. Currently we translate P1 to: interpreter-num_reg-registers[ cur_opcode[1] ] Which requires 4 indirections. (multiple Px instances should be optimized by gcc so that subsequent accesses only require 2 indirections) To avoid core-dumping on invalid arguments, we could up the reg-set to 256 and convert the above to: interpreter-num_reg-registers[ (unsigned char)cur_opcode[1] ] and adjust the assembler accordingly. Alternatively to work with 32 regs, we'd have: interpreter-num_reg-registers[ cur_opcode[1] 0x001F ] As for MMIX. I don't see a need for globals, since we're going to have various global symbol stashes available to us. Further, I don't see a value in providing special trapping code to return zero when reading from a Marginal or extending the local variable space when writing. For writing, an explicit reserve (once (or less)per function call) shouldn't be too much bother. And if the code is silly enough to write or read from this marginal region, then we'll pretend that they're using uninitialized values. Further, the reserveer must require that there is enough space to fully utilize the n-register set (currently 32), so that set $r31, 5 doesn't spill into the tail of the register set (since that was previously handled by the trapping code). This modifies the MMIX spec such that we potentially waste up to 31 register slots in the register window (which is trivial when we have a window of size = 1024). The rolling register set can be accomplished via three methods. First, realloc the register stack every time an extend exceeds the size. (This sucks for recursive functions). Second use paged register sets and copy values durring partial spillover (very complex). Lastly utilize a [pow2] modulous on a fixed size register stack. This has a very interesting implication; that we completely do away with the (push|pop)[ipsn] and their associated data-structures. Thus the P1 translation becomes: //(for a 1K rolling stack) #define STACK_MASK 0x03FF interpreter-num_reg[ ( interp-num_offset + cur_opcode[1] ) STACK_MASK ] This has 4 indirections and two integer ops. As above, for multiple uses of Px in an op-code, this should be optimized to 2 indirections. Since indirections are significantly slower than bitwise logical operations, this should be roughly equivalent in speed to our current interpreter. If we were hell-bent on speed, we could utilize STACK_SIZE alligned memory regions and perform direct memory arithmetic as with: interp-x_reg_base = [ ... ] 1 K chunk of memory aligned to 1K boundry interp-x_reg = interp-x_reg_base + interp-x_offset #define P1 *(int*)(((int)( interp-x_reg
Re: Are threads what we really want ???
On Wednesday 24 October 2001 08:28 pm, you wrote: Ive just spent some time looking through the various RFCs for Perl 6. IMHO: The various proposals dealing with multi threading, synchronization and RPC can be solved in a different way. Instead of thinking about multiple threads, one could think about multiple execution contexts. Each instance of an object must belong to one and only one execution context. Each execution context has an attached security context and a security manager. When making a call from an object in one execution context to a method of an object in another execution context the security context of the caller is passed to the receiver an validated by its security manager before being allowed to execute. Asynchronous execution could be implemented using oneway functions and future variables and a combination of both. future variables would allow for returning results from oneway functions, and for turning synchronous functions into oneway functions. deferred methods would also be very nice. A deferred method is something that would be executed at some later time when the current execution context is idle. If the following keywords were added to Perl: 1.abandon 2.asynch[ronous] 3.at 4.cancel 5.deferred 6.future 7.oneway [snip] I'm well aware that most of this functionality can be achived by other means - but I think a language and runtime level solution would be most elegant and portable. My choice of syntax and keywords are for illustration purposes only. Well, from the sideline, I have a couple of comments. First, parrot is supposed to be language agnostic, so such an implementation would go unused in python / perl5 lingo. Perhaps there could be support modules bolted on, who knows. However, this sort of RFC is probably a bit too late (ironically) to make any headway, given that we're in the post RFC stage. As for the technical feasibility. I'm not sure what direction you're trying to take. You speak as if this were an alternative to a threading model, but you don't talk about fake threading or even dispatch-driven code (which I'm sure is already ruled out in the interest of main-stream performance). Thus I'm not quite sure what your vision of the implementation is. The pseudo-perl code is really only syntactic sugar. Again, parrot doesn't care what perl6 does. Only what perl6 needs at a fundamental level. If you want to try again in terms of op-codes, then you might make some more head-way. In general, however, IPC is not an issue parrot is concerned with. As far as I understand, it won't even be part of the core, but instead live in dynamically loaded system modules. I believe the real push of threading is less involved with IPC (or ITC as the case may be) but to provide state-of-the-art programming capability in a reliable (and performance minded) manner. Apache 2.0 will be MT, and perl5 is not, thus there's going to be a rift if at least an MT mod_perl can't be relied apon. The MT has to co-exist with c-librarires (a la DBD drivers), and it seems will be very close to a posix varient (with locking on marginally shared structures). I think a litmus test is going to be MT DBD support. (Currently many CPAN modules refuse to compile alongside perl5.005 threads) None the less, separate execution contexts is exactly what seems to be going forward. Each thread will have (at least) one parrot interpreter datastructure. I'm assuming that this implies completely separate name-spaces. So long as there are well thought out interfaces between the threads, any such IPC that you refer to can be imlemented ontop. In fairness though, I've been partial to the QNX real-time-OS, which utilized at it's lowest level a message passing architecture (not too dissimilar in fundamental concept from what you're suggesting). The advantage was that these messages could be relayed/tunneled anywhere (including other processes or other networks networks). But the messages were hidden behind system calls (as with TCP/IP based messenges), so performance wasn't hindered too badly. That said, I'm not seeing an advantage that outweighs the specialization complexity in implementing what I do understand about your RFC. -Michael
Re: memory allocation paper (Courtesy of Alan)
Uri Guttman wrote: DS == Dan Sugalski [EMAIL PROTECTED] writes: off list DS Ask has found us a spot for the paper Alan was speaking of. DS http://dev.perl.org/perl6/talks/ very impressive paper. my recent proposal for a multi-queue malloc system has some of the ideas of this but nowhere nearly as fully developed. if we can drop in the libumem implementation directly, i think it will be a major win. it doesn't look too difficult for us to write our own version (with some parrot centric features maybe). the thread (vs. cpu) support looks very good too. i like how they get constant or linear time results. uri I was facinated by the concepts and implications so I wrote several test utilities based on a 466MHZ dual processor celeron running Linux RedHat 7.1. The paper shows a benchmark based on a 10cpu Solaris Machine that is a large loop of allocs and frees (of an unspecified size, assumed to be 4bytes). The paper shows single threaded throughput of .5Mega cycles / second growing to an upper bound of 3 Mc / second. I ran several combinations of a simple loop of: thr_main: for i .. size: ptr = alloc(4); *ptr = 1; free(ptr); main: for i .. num_threads: pthread_create( thr_main ) The first thing to note is that using glibc's non-MT malloc I achieved 2.67Mega cycles / second. Using the perl5 memory allocator (non-MT), I achieved 8.4Mega cycles / second. In both cases using pthreads (via uncommenting thread-specific code and linking -lpthread), the throughput fell to 0.978M / second ( for both perl5 and glibc ). Perhaps I'm not successfully mapping alloc to perl - I tried various alternative names for the allocator (such as New, Perl_xxx, etc). I obviously didn't have libumem to play with, but I did want to see what the lower-bound was for the various operations. I found some strange results. First I wrote an unlocked void* MyAlloc(int), void MyFree(void*) pair which simply used a stack for allocation. It wasn't a valid memory allocator, since it assumes that you free in reverse order but it sufficied. It achieved 20 million cycles / second. Next I added mutex's, which slowed the single thread down to 1.3Mega cycles/ second. The next thing I did was to split resources into separate arrays which are accessed via a hashing of the thread-id. This required (int)pthread_self(). The existing implementation is that thread_t is just an int, but it doesn't give me warm fuzzies doing this (how portable could that be). Strangely enough Linux's thread_id's are wildly jumping numbers (starting at over a thousand with at least a thousand numeric values separating each). The hash was a simple masking of the right several most bits (3 in my test case since I wasn't to test more than 10 simultaneous threads). Very quickly I found cache-line-contention to be a problem (just as the paper suggested). Since I was using separate arrays of integers, the cache-lines were hopelessly shared. My interoom solution was to multiply the thread_id by some value ( 8 or 16 work well ), at the expense of unused array elements. The following are some benchmarks (that I can still read from my scratch-sheets): Num-threads|run-time for thread-multiplier = 16 (pad)|run-time for thread-multiplier = 2|mutex's with multiplier=16 1|1.52s|1.52s|8.05s 2|8.25s|8.6s|16.14s 3|5.6s|12.99s|16.18s 4|3.86s|17.23s|21s 5|5.04s||23.19s 6|6.7s 7|6.64s One comment, the left-most data-set has some wild-swings; the more common occurance were something like: 1|1.52s|6.4Mega cycles / s 2|8s|2.5Mc/s 3|5s|6Mc/s 4|5.5s|7.2Mc/s 5|6s|8.3Mc/s 6|6.5s|... etc The above are based on the top of my head from the types of numbers I've seen after hundreds of tweaks and unrecorded runs. Two things seemed true. First, the more threads that were used, the more efficient it became. Secondly the second thread always seemed to provide the slowest / least-efficient operation, without exception. I can't explain this. It seemed impervious to cache-line-width padding. Note in the padding-times table that two threads took roughly the same time for both 2 and 16x thread_id multipliers. When only either a 1 or 2x muliplier, the run-time for 2 threads seems to fall in line with the other thread-values. The 1-thread operation is obviously faster because there is no cache-line contention. I'm curious to learn if this is a general problem with either Linux, x86, the dual-processor celerons or something more subtle. At this point I was using a two level memory manager. The first level was a thread-specific magazine allocator. In this primative memory system, there was no need for data / previous as described in the SUN paper, and instead a single magazine array /
Re: [PATCH] print_s_v op (was: RE: variable number of arguments)
Gregor N. Purdy wrote: Michael -- I had more time to think about it, and I determined how a compute op-code could be efficient. [snip] You wicked, wicked person! :) I'd like to see some benchmarks on that one vs. the most efficient possible hand-coded separate ops for moderate to complex arithmetic... These sorts of ops should be possible without them being in the core once we have loadable oplibs, as long as the assembler knows how to lay them out... BTW, the ops should probably be: * compute_i_s_v * compute_i_sc_v * compute_n_s_v * compute_n_sc_v /** FMT: compute v_arg_size, destination-register, fmt_const_string, ( list of reg-indexes ) The nargs operand is spliced in in place of the 'v' Regards, I didn't have the var-args available so I wrote a hack: functions with 4, 8 and 16 register-parameters (from the above) which were just aliases of the same function which didn't care (because it was based on the format field). I wrote a dozen or so variations of the function 'calculate'. I determined after a lot of experimenting that there is not much value in providing a general stack machine sub-op-code-engine (with push, pop, dup, and swap). The performance trails off as the number of symbols increases (since 50% of the operations are push). My attempts at combining multiple chars in a switch statement had marginally positive effects when in optimial conditions, but performed slower in sub-optimal conditions. I later wrote two variations. One was a semi-simple divde-and-conquor summer: /* sum Pdest, arg-count, arg1, arg2, ... */ AUTO_OP sum_i_i_i_i_i_i { unsigned int cnt = P2 - 1; opcode_t* args_offset = cur_opcode + 3; int accum = INT_REG( *args_offset++ ); while(cnt) { if ( cnt = 8 ) { accum += INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ); cnt -= 8; } else if ( cnt = 4 ) { accum += INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ); cnt -= 4; } else if ( cnt = 2 ) { accum += INT_REG( *args_offset++ ) + INT_REG( *args_offset++ ); cnt -= 2; } else { accum += INT_REG( *args_offset++ ); cnt--; } } INT_REG( P1 ) = accum; } This performed the best in all tests. Note that I only went down to n=4; obviously there would be a performance hit for anything less. Unfortuntaely without a format, it's impossible to mix-and-match constants with registers. Or even integers with numbers. You'd need a separate sum_int and sum_num. The other function is a general accumulator-based function which uses a format to figure out not only the type of operation (add, sub, mul, div, mod,and,or,xor, land,lor,lxor,not), but also the format (int-reg, int-const, num-reg, num-const). By using a single character, performance is maintained (though readibility suffers somewhat). What follows is mostly what I used for my tests. AUTO_OP compute_i_i_s_i_i_i_i { // note should be I_i_s_v STRING *s = Parrot_string_constants[P3]; opcode_t* args_offset = cur_opcode + 4; char * ops = s-bufstart; /* do bounds checking */ int accum = INT_REG( *args_offset++ ); // probably not always what we want, but costs too much otherwise while(*ops) { switch( *ops ) { case '+': accum += INT_REG( *args_offset++ ); break; case '-': accum -= INT_REG( *args_offset++ ); break; case '*': accum *= INT_REG( *args_offset++ ); break; case '/': accum /= INT_REG( *args_offset++ ); break; case '%': accum %= INT_REG( *args_offset++ ); break; case 'a': accum += *args_offset++; break; case 's': accum -= *args_offset++; break; case 'm': accum *= *args_offset++; break; case 'd': accum /= *args_offset++; break; case 'A': accum += NUM_REG( *args_offset++ ); break; case 'S': accum -= NUM_REG( *args_offset++ ); break; case 'M': accum *= NUM_REG( *args_offset++ ); break; case 'D': accum /= NUM_REG( *args_offset++ ); break; default: printf(Unsupported calculate-opcode %c. Exiting\n, *ops); exit(-1); } ops++; } INT_REG( P2 ) = accum; } A summary of the results for 6 million iterations of each of the following on a 466MHZ celeron on Linux: // for n = 4|for n = 16 sum:21 seconds (18% faster) |23 seconds ( 78% faster) format-calculated:22s (9% faster) | 39s ( 47% faster) manual-adds:23s|63s null-loop: 12s
Re: Strings db
Wizard wrote: Michael Maraist [mailto:[EMAIL PROTECTED]] wrote: Does you idea allow for: int msgid = txtToMsgid( This feels strange\n ); char *label = msgidToRes( msgid ); I'm not sure that I understand your question. This is not my idea, but GNU's gettext tools. I, myself, am not thrilled with this implementation, and would suggest using some other implementation (either our own or someone else's). Grant M. You quoted something similar to my text above and said you didn't like it. I believe mostly because it involved reading external files, but also because of the concept of the message-id. My statement was that there are efficient reasons to want to use a message-id, so it should be possible to have two api's, one with and one that's direct (which is simply a wrapper around the former). I realize that you were merely inciting that we should be able to develop something, I was just commenting on performance. -Michael
Re: [PATCH] print_s_v op (was: RE: variable number of arguments)
Michael Maraist wrote: All -- I've created a varargs-ish example by making a new op, print_s_v. This is pretty rough, and I haven't updated the assembler, but it seems to work. With var-args, we could produce highly efficient SIMD instructions. printf obviously, multi-push/pop, createArrays/ Hashes. I toyed with the idea of a stack-based ALU to handle algebra such as: ( ( a + b*c ) / d + e ) / ( f + g + h ) // compute dest, fmt, register-names compute Rd, ++^*+/+, f, g, h, a, b, c, d, e Unfortunately I couldn't find a way of garunteeing that it's efficient. The best I've come up with so far is to use a mini-switch statement. The only reason it might be faster than an optimized switch-statement over all op-codes is that the switch has fewer terms. The above case also gets complex if we don't use multiple registers for intermediate calculations. None the less the possibilities are cool. I had more time to think about it, and I determined how a compute op-code could be efficient. Pros: o First of all, there is no need to assign to temporary registers (as would be necessary for the compiler). o Secondly, I'm under the impression that compilers convert algebraic expressions to reverse polish notation anyway (to reduce the temporaries); though parallel computing is an obvious exception (when multiple ops can be executing simultaneously). This theoretically simplifies the job of the compiler. o Third, only a single register access is required for each parameter (where-as an expression might otherwise require multiple accesses). o Fourth, it's possible to check for a subset of two or more adjacent operations such as which would quickly add four number with n-1 physical adds ('accumulator = a+ b + c + d + e') which can be optimized in for parallel op-code execution within the physical CPU. Or ^+ which normally pushes a stack-element, adds top two, then pops. Here it just adds to the top-element. To prevent d^n number of operations, only common ones are defined. The default block of the switch checks for individual operations. o Fifth, the op-code is a lot more dense: (16 + 6 * var) verses (16 * n). In inner loops, this could add up with cache-savings (don't pay retail). Cons: o The overhead of an op-code is probably not much greater than the total overhead of this instruction (especially with a switch-based op-dispatcher and -O2 compilation) o Potentially long op-chains are possible which could thwart responsiveness to synchronous event-handling. (Unless there is a max-fmt size) o Potential portability issues if we deal with 16bit shorts. Might have to require 32bit ints as the smallest macro-symbol /** FMT: compute v_arg_size, destination-register, fmt_const_string, ( list of reg-indexes ) *contents of fmt_string: * + //add top two stack el's * - // sub .. * * // mul .. * / // divide .. * % // take modulous * // logical and * | // logical or * P // push register onto local stack * C // push constant onto local stack * D // duplicate top stack element * S // switch top two elements * \0 // end-of-fmt-string. Must be word-aligned with such eofs eg PPP++\0\0\0 ** * Note that the above symbols should probably be part of a c-enum so that the switch statement can be optimized **/ AUTO_OP computei_varg { static int stack[MAX_STACK]; // make 1 larger than the max-size of allowable varg_size unsigned int top = 0; // points to safe location STRING *s = Parrot_string_constants[P2]; int args_offset = 3 +code ; // pointer to the start of parameters (could have just been word_align( strlen(code[4]) ) ) short * dops = (short*)s-bufstart; // get format int f_first_two_op; char * sops; // single_char_ops // sanity check against varg_size if ( code[1] MAX_STACK ) die invalid size-of operation; stack[ 0 ] = 0; // null-value (should never read) stack[ 1 ] = 0; // initially stack contains null // compare pairs of op-codes (theoretically could nest this into the default statement of a 4-op comparison) while (*dops top) { switch( *dops ) { case PUSH_PLUS: // P+ or ('^' 8) +'+' stack[top] += INT_REG( code[ args_offset++ ] ); break; case CONST_PLUS: // C+ stack[top] += code[ args_offset++ ]; case PLUS_PLUS: // ++ stack[top] += stack[top--] + stack[top--]; ... defailt: // fall back to the single-character case for unrecognized pairs f_first_two_op = 0; sops = (char*)dops; while(!f_first_two_op++) { switch( *((char*) sops)) { case '+': ... case '-': ... ... default: die invalid op-code } // end 1char-switch } // end 1char-while } // end 2char-switch } // end 2char-while if ( top ) INT_REG( P2 ) = stack[top]; else die invalid number of pushes; } // computei_varg -Michael
Re: question about branching/returning
The internal do-op-loop runs until it sees a return value of zero from any op-codes. The RETURN statement within basic_opcodes.ops is really a keyword which gets parsed into an offset of the current PC counter based on the internally calculated size of the instruction (known at Configure time from opcode_table). Of course, this is based off my reverse engineering of existing code. I'm sure any of this is subject to change. As a note, another poster said that end is never executed which isn't currently true. -Michael Dave Storrs wrote: Ok, that was pretty much what I thought. But then what is the 'end' opcode for? It does a 'RETURN 0', which would increment the PC by 0 opcodes...which either counts as an infinite loop or a no-op, and we've already got a no-op op.
Re: Parrot multithreading?
Arthur Bergman wrote: In an effort to rest my braine from a coredumping perl5 I started to think a bit on threading under parrot? While it has been decided that perl should be using ithread like threading, I guess that is irelevant at the parrot level. Are you going to have one virtual cpu per thread with it's own set of registers or are you going to context switch the virtual cpu? If it was one virtual cpu per thread then one would just create a new virtual cpu and feed it the bytecode stream? Is there anything I could help with regarding this? Arthur The context is almost identical to that of Perl5's MULTIPLICITY which passes the perl-interpreter to each op-code. Thus there is inherent support for multiple ithread-streams. In the main-loop (between each invoked op-code) there is an event-checker (or was in older versions at any rate). It doesn't do anything yet, but it would make sence to assume that this is where context-switches would occur, which would simply involve swapping out the current pointer to the perl-context; A trivial matter. The easiest threading model I can think of would be to have a global var called next_interpreter which is always loaded in the do-loop. An asynchronous timer (or event) could cause the value of next_interpreter to be swapped. This way no schedule function need be checked on each operation. The cost is that of an extra indirection once per op-code. True MT code simply has each thread use it's own local interpreter instance. MT-code is problematic with non MT-safe extensions (since you can't enforce that). In iThread, you don't have a problem with atomic operations, but you can't take advantage of multiple CPUs nor can you garuntee prevention of IO-blocking (though you can get sneaky with UNIX-select). -Michael
Re: Parrot multithreading?
What we're going to do is fire up a new interpreter for each thread. (We may have a pool of prebuilt interpreters hanging around for this eventuality) Threading *is* essential at the parrot level, and there are even a few (as yet undocumented) opcodes to deal with them, and some stuff that's an integral part of the variable vtable code to deal with it. Whether it's considered ithread-like or not's up in the air--it'll probably look a lot like a mix. I'm also seriously considering throwing *all* PerlIO code into separate threads (one per file) as an aid to asynchrony. Just remember the cost in context-switching, plus the lack of scalability as the number of file-handles increases. Linux thread-context-switches are relatively brutal compared to say Solaris. Additionally you're consuming a new stack area for each file-handle. That's lots of overhead. There are bound to be semi-portable methods of non-blocking IO. UNIX-select has to have an equiv on NT. Granted it's a lot more complicated. Basically you could have IO ops trigger an event-based iThread swap, which causes select to be invoked. I've always thought this was the most efficient model for single-CPU machines. The biggest problem was always segmenting one's code into call-backs. Well, with op-codes, we have a natural division. We have a tight inner loop that occasionally hits a dispatcher on a complexity level similar to a GUI. You're much better prone to handle event-based operations (which means higher level languages can be built atop such a parrot-design). Food for thought. -Michael
Re: SV: Parrot multithreading?
Arthur Bergman wrote: Arthur Bergman wrote: In an effort to rest my braine from a coredumping perl5 I started to think a bit on threading under parrot? While it has been decided that perl should be using ithread like threading, I guess that is irelevant at the parrot level. Are you going to have one virtual cpu per thread with it's own set of registers or are you going to context switch the virtual cpu? If it was one virtual cpu per thread then one would just create a new virtual cpu and feed it the bytecode stream? Is there anything I could help with regarding this? Arthur The context is almost identical to that of Perl5's MULTIPLICITY which passes the perl-interpreter to each op-code. Thus there is inherent support for multiple ithread-streams. In the main-loop (between each invoked op-code) there is an event-checker (or was in older versions at any rate). It doesn't do anything yet, but it would make sence to assume that this is where context-switches would occur, which would simply involve swapping out the current pointer to the perl-context; A trivial matter. Uhm, are you talking perl 5 here? The event checker checks for signals, we got safe signals now. There wasn't any code for CHECK_EVENTS w/in Parrot when I first read the source-code. I merely assumed that it's role was not-yet determined, but considered the possible uses. CHECK_EVENTS seems to be gone at the moment, so it's a moot point. MULTIPLICITY is just allowing multiple interpreters, ithreads is letting them run at the same time and properly clone them. If you want to use it switch interpreters at runtime for fake threads, patches are welcome, send it and I will apply it. The easiest threading model I can think of would be to have a global var called next_interpreter which is always loaded in the do-loop. An asynchronous timer (or event) could cause the value of next_interpreter to be swapped. This way no schedule function need be checked on each operation. The cost is that of an extra indirection once per op-code. True MT code simply has each thread use it's own local interpreter instance. MT-code is problematic with non MT-safe extensions (since you can't enforce that). I am sorry to say, but perl 5 is true MT. Yes, but that feature never got past being experimental. I know of a couple DBDs that would not let you compile XS code with MT enabled since they weren't MT-safe. The interpreter can be built MT-safe (java is a good example), but extensions are always going to be problematic. (Especially when many extensions are simply wrappers around existing non-MT-aware APIs). I think a good solution to them would be to tread it like X does (which says you can only run X-code w/in the main-thread). An extension could say whether it was MT-safe or not, and be forced to be serialized w/in the main-physical-thread, which becomes the monitoring thread. An alternative would be to simply have XS code compile in a flag which says to throw an exception if the code is run outside of the main-thread; Documentation would emphatically state that it's up to the user to design the system such that only the main-thread calls it. On the side, I never understood the full role of iThreads w/in perl 5.6. As far as I understood, it was merely used as a way of faking fork on NT by running multiple true-threads that don't share any globals. I'd be curious to learn if there were other known uses for it. In iThread, you don't have a problem with atomic operations, but you can't take advantage of multiple CPUs nor can you garuntee prevention of IO-blocking (though you can get sneaky with UNIX-select). Where did you get this breaking info? ithread works with multiple CPUs and IO blocking is not a problem. Arthur I'm under the impression that the terminology for iThreads assumes an independance of the physical threading model. As other posters have noted, there are portability issues if we require hardware threading. Given the prospect of falling back to fake-threads, then multi-CPU and IO blocking is problematic; though the latter can be avoided / minimized if async-IO is somehow enforced. From my scarce exposure to the Linux Java movement, green-threads were considered more stable for a long time, even though the porters were just trying to get things to work on one platform. I would definately like hardware threading to be available. If nothing else, it lets students taking Operating Systems to experiment with threading w/o all the headaches of c. (Granted there's Java, but we like perl) However, I'm not convinced that threading won't ultimately be restrictive if used for generation operation (such as for the IO-subsystem). I'm inclined to believe that threading is only necessary when the user physically wants it (e.g. requests it), and that in many cases fake-threads fulfill the basic desires of everyone involved