Re: SAPI (Was RE: args, argv in parrot?)

2001-12-04 Thread Michael L Maraist

On Tuesday 04 December 2001 12:21 am, David M. Lloyd wrote:
 He was telling me about the way PHP seperates the interpreter from
 the OS-specific stuff via a SAPI layer.

Facinating, but perl is, afterall a scripting language first.  perl -e '' 
is very essential; though I could see a compatibility mode being enabled by 
default with '-e' if necessary.  The level of deviation is pretty much up to 
Larry I suppose.

Basically I see the environ / command args being globally shared-PMCs from 
the beginning.  They can be accessed by any language more-or-less the same 
way if they're functions or scalars; but they're more fluid as scalars.  Plus 
it's trivial to lock them down once encapsulated inside thread-safe PMCs.

-Michael 



Re: Thoughts on vtables...

2001-12-02 Thread Michael L Maraist

On Sunday 02 December 2001 02:47 pm, Brent Dax wrote:
 Quick comment: I've been thinking about constructing an 'OpaqueHandle'
 PMC type.  All the programmer is supposed to know about it is that it
 points to Something (usually a C struct, but they don't have to know
 that).  I'm not sure what trying to access it results in--perhaps it'll
 look like a reference from the outside (ParrotInterp=HANDLE(0xDECAF)),
 or maybe it'll throw a tantrum.

Perl5 just used a string as the generic c-struct handle as far as I know.. 
Then you use pack/unpack to access it further (assuming the programmer knew 
what they were doing).  There are tiny gotcha's (such as we'd have to have 
flags so the GC knew the data wasn't in relocatable memory, etc), but since 
we need that sort of support for perl5, I don't see a great advantage is 
creating a new datatype for perl6; unless our least-common denominator needs 
it I guess (e.g. lisp, python, etc).

-Michael



Re: Butt-ugliness reduction

2001-11-15 Thread Michael L Maraist

Combining these two, I have a couple comments.

On Thursday 15 November 2001 02:07 pm, Simon Cozens wrote:
  if (pmc-flags  PS_INTEGER_OK) {
   return ((struct PerlScalarData *)(pmc-data))-intdata;

Why are you storing flags for PerlScalarData inside the pmc-flags?  If the 
nested type is doing the caching, then it should store the flags.   
Additionally, you should probably break out the variables for readibility:


The following assumes macros (bite me), typedefs (don't know why this isn't 
standardized, ever since programming in c++, I've never left c-structs 
untypedef'd), and unions (as with Uri's suggestion)

static INTVAL Parrot_scalar_get_integer (struct Parrot_Interp *interpreter, 
PMC* pmc) {
if ( isPMCScalarClass( pmc ) ) {
  PerlScalarData* psd = pmc-data.scalar;
  switch ( psd-type ) {
case SCALAR_STR:
   psd-intdata = String2Int( psd-stringdata );
   setScalarIntOK( psd );
   return psd-intdata;

case SCALAR_NUM:
case SCALAR_NUM_STR:
   psd-intdata = (int) psd-numdata;
   setScalarIntOK( psd );
   /* falls through */

case SCALAR_INT_NUM_STR:
case SCALAR_INT_STR:
case SCALAR_INT_NUM:
case SCALAR_INT: return psd-intdata;

default: // throw exception
  }
} else {
   // exception since we're specifically equesting a scalar??
} 
}

Nested if-statements could be used instead, but so long as type is = 
num_items, then this gets optimized into a fast table-lookup (with -O2 is 
used).  Though a bit longer, I think this is still readible.

 On Thu, Nov 15, 2001 at 02:01:53PM -0500, Dan Sugalski wrote:
  *) Use the cache entry in the PMC header. It'll mean for ints and floats
  you won't need to bother with a separate structure or dereference

Currently, the cache only holds a single value.  Is this for space 
efficiency?  If we broke out the union into independant cached values, then 
the two layers could be collapsed into one.  (data would go unused for 
scalars)  Course the switch statement above would have to be masked, as with:

#define getScalarID(x) ( ( x-flags  SCLAR_MASK )  SCALAR_MASK_OFFSET )

switch ( getScalarID(pmc) ) {
  ...
}

In general, I feel that OO-symantics would be most readible here:

if ( pmc-isInteger() ) {
   return pmc-getInt();
} else if ( pmc-isNum() ) {
   return pmc-cvtNumToInt()-getInt();
} else if ( pmc-isString() ) {
   return pmc-cvtStrToInt()-getInt();
}

If this were C++, then the isXXX, cvtXXX would be inlined functions, which 
would have the same effect as macro's but with much greater maintainability.  
Unfortunately we're using C, so the closest we have is macro's.

While speed is king, readibility is VERY important (as I'm sure we all agree).

The last alternative I can think of is to break PerlScalarData into it's own 
sub vtable.  Thus

if ( isPMCScalar( pmc ) ) {
   return pmc-data.scalar-vtable[ SCALAR_GET_INT ]-( pmc-data.scalar );
}

This would have similar overhead to the table-lookup switch-statement, and 
since these are extremely simple functions, many architectures (most notably 
SUN) wouldn't require a full blown function call with frame management.

As a last sort of alternative.  I had invisioned PMC's has having separate 
vtables for each of the many different types of operations.  Meaning, that a 
scalar-int would be separate from a scalar-string-num, etc.  Since it seems 
that we're not currently doing this, it would require 8 times as many 
vtable's, but it would prevent hardly any if-statements.


 I can't. :(

 /* It's more efficient to use our own caching than the PMC's cache
Why? Because we also need to store state on what type of thing is
in the cache, and this can be one of three things. (four, if you
include nothing) Four things is two bits, which means two flag
tests instead of one. Hence, we just use the old-style IOK/NOK/POK */

  *) Swap out vtables as the pmc internals change, rather than vectoring
  off flags.

Ops, yeah, that's what I was trying to say above.


 That's probably the right way to do it. But what about overlapping types
 that are both integer and string OK?

Hense the larger number of combined types coupled with removing the cache and 
just storing each explicit type.

On Thursday 15 November 2001 02:07 pm, Uri Guttman wrote:
 and the PMC would have in it:

   union {

   INTVAL  *int ;
   NUMVAL  *num ;
   } data ;


-Michael




Re: memory assumptions

2001-11-13 Thread Michael L Maraist

On Tuesday 13 November 2001 12:05 pm, Ken Fox wrote:
  Are _all_ handles (PMC's, Strings, or whatever) that serve as
  root-set elements dynamically allocated?

 Nope. Part of the root set includes auto C PMC handles used by
 extensions. You also have to think about hardware registers too
 if GC is running in another thread.

I'm not worried about separate threads.  Shared handles are already allowed 
to be slow and thus few and far between.  Locking also alleviates most of the 
hangups.  Most of my thinking involves running much of the GC within a worker 
thread (spawned by allocs), so different interpreters are, for the most part, 
independant of each other, even in garbage collection.

My ideal is to only profile at alloc time, then hand off a job to a 
gc-thread, but so far there are more problems than worthwhile (especially 
with coping GCs).  The alternative is to enforce read/write barriers on code 
and do all aspects of GCing in another thread.  This avoids much pause-time, 
but all but requires multi-threading (something I'd rather not require for 
one-liners).


 IMHO the easiest way to code this is to define an API that
 extension writers use to declare every PMC handle. This code
 looks ugly and encourages bugs though so a lot of people don't
 like it. Many systems treat the C stack as conservative roots.
 Those systems flush the hardware registers to the C stack
 before running the collector. The benefit is that handles
 don't need to be declared. The cost is the non-portable code
 used to scan the C stack and flush the registers.

The various texts I'm reading suggest that memory management should be a 
black-box module.  It's easier to debug and use.. They suggest that too large 
a percentage time is devoted to memory management (leak-finding), so the less 
 XS-writers have to worry about memory, the more acceptance parrot will have 
IMHO. (Even if some performance or intermitant pausing is necessary... Though 
hopefully we'll be able to provide a choice between the two at run-time).

 Another simple solution is to suspend the collector while in
 extension code (or untrusted extension code). This works for
 lots of cases, but falls apart badly in callback/event-loop
 extensions that never return.

Unfortunately, this would require hooks into the mem-mngr module, which 
violates it's black-box nature.  Not to mention the points you make. 

 One nice thing about writing Perl extensions is that the SV *
 can be stuck in odd places. For example, I put them in X Toolkit
 and Motif widget user data slots. Once the SV * are stuck
 in the slots I have no idea what happens to them. This works fine
 because the ref count will keep the SV alive until the widget
 dies. A hybrid ref count GC with pinned PMCs will work exactly
 the same. If we get fancier though and either use movable
 PMCs or tracing collectors, then my user data slots become
 roots. That will crash and burn because I don't know how to
 register them. I'd have to use a separate array for holding the
 PMCs and then stick the array indexes into the widgets.

So long as SV* equivalents are registered as foreign access, then XS modules 
shouldn't have to worry.  Unfortunately this might cause memory leaks.  
Perhaps foreign access can utilize a reference count.. XS modules that don't 
directly utilize GC-able objects, register a newly allocated handle with 
foreign access, and are required to unregister it when they're done..  When a 
handle is unregistered as many times as it's registered, it's removed from 
the foreign access area, and thus it's contents are reclaimable.  Only 
remaining problem is in how the handle was acquired (hense my original 
question), since the handle needs to be reclaimed somehow, since the 
following is possible:

PMC* ptr = malloc(sizeof(PMC)); 
ptr = malloc(sizeof(PMC)); // memory leak

If allocations instead come from gc_malloc - or more likely from the arena 
newPMC(), then once the PMC is unregistered from foreign-access, XS-code can 
safely ignore it.  The only gotcha here is that XS code has to treat 
unregistering as if the PMC were being free'd (and optimizations will 
probably do exactly that).

Heck, I'm even thinking that there should be two forms of newPMC, one to be 
used within the root-set, and one used externally.  newPMC, and newPMCR (for 
registered).

It's still too early to see what's a good direction.


 It's probably not worth making my job as an extension
 writer easier if it slows/complicates the rest of the GC. Most
 extension writers are pretty good at this sort of thing and
 a special utility library for pseudo-pinned PMCs wouldn't be
 any big deal.

I don't really see a need for handles to be movable (especially if they're 
arena based), but reclaiming handles might require for some sort of 
extra record-fields.  Since handles are fixed sizes and decently big 
(several integers worth), I'm wanting to prepend next/prev pointers to the 
structure and maintain from/to/free lists.  A 

Re: memory assumptions

2001-11-13 Thread Michael L Maraist

On Tuesday 13 November 2001 01:20 pm, Jason Gloudon wrote:
 On Mon, Nov 12, 2001 at 11:59:08PM -0500, Michael L Maraist wrote:
  2)
  Can we assume that a buffer object is ONLY accessible by a single
  reverse-path-to-PMC?  PMC's or array-buffers can point to other PMC's, so
  it's possible to have multiple paths from the root to an object, but I'm
  asking if, for example, we could use an algorithm that relied upon the
  fact that it only has one direct parent.

 That assumption would mean one could not directly share non-constant
 buffers between strings. There has been talk of having copy-on-write
 strings.

That's fine.  I didn't really have strings in mind, and it's possible to 
treat them separately.  We're already having to treat handles differently 
than buffer-objects.  I'm also wanting to segregate lists of leaf-buffers, 
arrays, hashes, etc, so as to avoid putting switch-statements in the inner 
loop of the marker.  Since each type would have a different inner loop, 
strings are confined to the same algorithms.  So, any other ideas related 
specifically to PMC's?

One interesting thing to note is that strings are leaf-objects.  Furhter, 
copy-on-write would only be efficient if set for values that have more than 
one parent (otherwise each modification would require a copy).  I haven't 
seen the discussion, so I don't know where it'll wind up, but if this is the 
case, then it seems to me that the only way to determine if copy-on-write 
should be applied (in an efficient manner) is via some variation of reference 
counting.  Even if only 1-bit ref-counting is used.  Such as:

// In pseudo-code
STR_REG[x] = newString(abc); // f_copy_on_write = 0
STR_REG[x] _= newString(foo); // check f_c_o_w; it was 0, so we modify 
string (possibly resizing)
STR_REG[x] = PL_null_str; // abc eventually reclaimed by GC
STR_REG[x] = newString(abc); // f_c_o_w = 0
STR_REG[y] = STR_REG[x] // f_c_o_w = 1
STR_REG[x] _= newString(bar); // check f_c_o_w; was 1, so we copy out
 // at this point, a GC-pass would count the instances of abc and notice 
that it only has one handle.  It's f_c_o_w would be reset to 0.

Even if the GC didn't perform this last stage, the system would work, it 
would just cause a greater percentage of multi-ref'd garbage and copying.  
Additionally, note that such a pass would insinuate a pre-GC-stage which 
resets f_c_o_w's to zero.

Setting f_c_o_w to 1 all the time is slightly faster than performing an 
increment.  Further, separate string vtables could be utilized to avoid the 
if-statements. (being a RO string instead of a RW string).

Lastly, the overhead of a bit might as well be a byte, and the garbage 
reduction by actually performing full-reference counting (albeit with a GC 
fallback) might make this worthwhile.  full ref-counting on strings would be 
harder to work with for XS-code though.

-Michael



Re: memory assumptions

2001-11-13 Thread Michael L Maraist

On Tuesday 13 November 2001 03:57 pm, James Mastros wrote:
  Unfortunately, this would require hooks into the mem-mngr module, which
  violates it's black-box nature.  Not to mention the points you make.

 I don't really see this as ownerous fiddiling with the memory manager; we
 want to have hooks in to modify the timing of the GC anyway.  All we have
 to do is wrap a refcount around them (and I'm not sure about that; can you
 really be in more then one XS at once?) and call them in the approprate
 places.  That isn't fiddiling with the internals of the memory manager,
 it's timing is defined to be an external (by us).

I don't mean that we won't have API hooks into the mem-manager (sorry, I 
worded it poorly), I mean that the more visible the guts of the mem-manager 
is, the less likely it can be replaced by a better / more suitable algorithm, 
possibly in embedded situations.  Extending mem-mngr data-structures (like 
the linked-list) to the definition of a PMC is what I'm worried about; not 
exposing a function to say GC-me-more-often. (I appologize for the LISP 
mneumonic :)

-Michael



Re: memory assumptions

2001-11-13 Thread Michael L Maraist

On Tuesday 13 November 2001 06:37 pm, James Mastros wrote:
 On Tue, Nov 13, 2001 at 06:14:30PM -0500, Michael L Maraist wrote:
  Extending mem-mngr data-structures (like the linked-list) to the
  definition of a PMC is what I'm worried about; not exposing a function to
  say GC-me-more-often. (I appologize for the LISP mneumonic :)

 Hm.  I think that the problem is that the definition of a PMC is, I think,
 what we say it is in perlpmc.pod, and not what we say it is in pmc.h.

 This means that we need to recomple XSes to change PMCs, but I think that's
 possibly OK.  If not, why do we need to keep the linked-list inside the
 PMCs, instead of sepperatly?

Again, this is only speculation about possible algorithms.  The only reason 
to expose linked-list structures is if a collection algorithm was used that 
assumed it contained a list of ALL PMCs.  If the newPMC() / newPMCext() were 
exclusively used, then this isn't a problem, but if PMC's could be externally 
allocated and then registered within foreign access, then there'd 
potentially be a data-type mismatch.  Namely using  
header = (gc_object_header*)((int*)pmc_ptr - 2 )
header-next =...
header-prev=...
would be a failed assumption.  I'm in favor of requiring newXX() since this 
is more portable between memory managers.

It's equivalent to assuming that you could determine the size of a malloc'd 
object by doing:  *((int*)ptr - 1) simply because your current mem-mngr 
stores the value there.  Obviously it's best to pass the pointer to the 
associated mem-mngr API and let it do things like fetching size.  Here the 
potential problem is static allocations generated by XS-code.  In theory our 
API would allow:

// desirable API functions
PMC_t* newPMC(); // assumes it'll be attached to the root-set
PMC_t* newPMCext(); // auto-ataches to foreign access portion of root-set.

void PL_registerForeignAccess(PMC_t*);

// questionable API function
void PL_unregisterForeignAccess(PMC_t*);
void PL_initPMC(PMC_t*);

And an XS coder could have:

void fooExtension(..) {
  PMC_t my_pmc;
  PL_initPMC( my_pmc );
  PL_registerForeignAccess( my_pmc );
  ...
  // my_pmc falls out of scope.  
  // coder forgot to call PL_unregisterFA( my_pmc );  memory leak..
}

How PL_registerFA works isn't yet determined.  If newPMCext() called 
PL_regFA(), then there'd have to be some additional data-structure or 
allocated memory.  But if newPMCext() simply used the GC-exclusive region of 
the linked-list, then the PMC could be placed on the appropriate list, and 
thus avoid additional flags or memory allocations.  It's state / type is 
implicit based on the associated list as far as GCing is concerned.  It's 
just that this doesn't work with the above code..  The GC-hidden 
data-structure could also have a ref-count of the number of times it's been 
marked with PL_regFA, that way an intial reg-stack PMC could be grabbed by 
XS-code and cached for what-ever evil purposes safely.  Thus my current idea 
of PMC handles is:

struct gc_handle_header_t {
  struct gc_handle_header_t *prev, *next;
  int fa_cnt;
  char data[0]; // gcc-specific code for demonstration purposes
}; 

Again, this is just the weeding out stage.

-Michael



Re: Lexical implementation work

2001-11-12 Thread Michael L Maraist

On Friday 09 November 2001 03:36 pm, Dan Sugalski wrote:
     Do we want non-PMC lexical support?

 Nope, I wasn't going to bother. All variables are PMCs. The int/string/num
 things are for internal speed hacks.

But can't a compiler generate one of these internal hacks?  My thoughts are 
that a given scope can not return internal representations (like int's / 
string's), and must encapsulate them inside scalars, BUT the compiler may 
optimize the inside of a scope.  If no evals occur inside the block (rather 
common), then the lexicals aren't volitile.  If no divides are ever performed 
on a scalar, nor are they stringified, then the compiler might be able to 
assume that they're raw integers (whether they were declared my $foo as int 
or not).  This also assumes that the would-be-integer does not interact with 
any other scalars.  I believe there are op-codes that assign the value of 
primatives into a scalar (would have to), so scalars and primatives can still 
interact.  Since this is an optimization, I agree that it's usefulness is 
limited.  

From the above, the only uses I can see for declaring my $foo as int is to 
set flags (or utilize different vtables) to enforce integerness, and to say 
to the optimizer that it's ok to use a primitive integer if the block 
contained a divide.  Note that:
$foo = int( $bar / $baz)
should have the same effect.

End result is that this is a problem that can be put off.

-Michael



memory assumptions

2001-11-12 Thread Michael L Maraist

Dan gave a list a bit ago about some minimal assumptions about how we'll deal 
with memory.  I've been researching various memory management techniques, and 
I'm noticing that various optimizations can occur if certain assumptions are 
made.. It's still early in the game, but I thought it might be helpful to see 
if we could identify any other truths that we can take advantage in 
selecting an algorithm.

1)
Are we allowing _any_ dynamic memory to be non-GC-managed?  I assume that 
there are core parts / naive-extensions that assume a malloc/free memory 
model.  While we could always define free as an empty macro, I think it would 
be to our benifit to utilize regular mem managers where possible.  Further, 
our memory format wouldn't necessarily be compatible with external libraries 
anyway.

2)
Can we assume that a buffer object is ONLY accessible by a single 
reverse-path-to-PMC?  PMC's or array-buffers can point to other PMC's, so 
it's possible to have multiple paths from the root to an object, but I'm 
asking if, for example, we could use an algorithm that relied upon the fact 
that it only has one direct parent.

3)
I believe Dan said that we _could_ assume that PMC's were unmovable, but are 
we requiring this?  This is related to the next question.

4)
Are _all_ handles (PMC's, Strings, or whatever) that serve as root-set 
elements dynamically allocated?  Or can c-code declare PMC-structs instead of 
PMC-structs-pointers?  That Dan defined a foreign access portion of the 
root-set suggests to me that c-code might acquire it's own handles (though we 
can enforce an API for their allocation).  This is important because we do 
have to perform garbage collection on the handles.  If a register-stack is 
decremented, for example, valid pointers to PMC's / string would become 
garbage that requires some sort of collection.  While this is easy enough to 
manage, a heterogenous pool of handles would complicate things (and 
disqualify many desirable algorithms).  I'm evaluating algorithms for the 
handles which requires prepending a linked-list.  It's easily hidden away and 
thus ignored by non memory manager algorithms IFF we require that all handles 
be allocated by the API.  This will be especially important since root-set 
navigation in it's raw form will be slow due to the complex nature of some 
data-types.

That's all for now.

-Michael



Re: Rules for memory allocation and pointing

2001-11-04 Thread Michael L Maraist

On Sunday 04 November 2001 02:39 pm, Dan Sugalski wrote:
 At 08:32 PM 11/4/2001 +0100, Benoit Cerrina wrote:
   There will be a mechanism to register PMCs with the interpreter to note
   they're pointed to by something that the interpreter can't reach. (For
   example, a structure in your extension code, or via a pointer stashed
   in the depths of a buffer object, or referenced by another interpreter)
   This foreign access registry is considered part of an interpreter's
   root set.
 
 If this is the case, how do you want to move the PMC, I thought you wanted
  a copying collector?

 While the PMC structures themselves don't move (no real need--there of
 fixed size so you can't fragment your allocation pool, though it makes
 generational collection easier to some extent) the data pointed to by the
 PMC can. That's the bit that moves.


Ok, so far, here's what I see:

There are two main memory segments.  A traditional alloc/free region.  Within 
this area are arena's (for fixed sized memory objects).  This region must be 
efficient within MT, and hopefully not too wasteful or fragmenting.  This 
region is mostly for core operation and the non arena allocations are to be 
minimized; memory leaks are critical as usual. The utilization patterns 
should be analyzable, well known and compensated for (with respect to 
fragmentation, thread-contention, etc).  My vmem-derivative might still be 
valuable here, but I suppose we need to let the core's needs/characteristics 
flesh out further.

Then there's a GC region which is primarly the allocation space of the 
interpreted app, which obviously can't be trusted with respect to memory 
leaks or usage (fragmentation) patterns.  PMC's and strings are handles into 
this GC-region, though the handles are to be stored in core-memory (above).

Question: Are there other types of references?  I can't think of any.

The GC algorithm (being a separate thread or called incrementally when memory 
is low (but not yet starved)), needs to quickly access this GC heap's values, 
which I believe is Dans for requiring a maximum of two levels of 
indirection.  I suppose it just runs down the known PMC lists including the 
PMC and string register sets, the stacks and the stashes for each 
interpreter.  You do, however refer to a foreign address region, and my 
first impression is that it's the wrong way of going about it.

First of all, how are arrays of arrays of arrays handled?

// create a buffer object of PMCs
PMC_t p1 = new Array(50)

for i in 0 .. 49:
  p1-data[ i ] = new Array(50)

In order to comply with the max-depth, the array creation will have to 
register each sub-array-entry in the foreign access region, or am I missing 
something?

First of all, this will mean that the foreign access data-structure will grow 
VERY large when PMC arrays/ hashes are prevalant.  What's worse, this 
data-structure is stored within the core, which means that there is 
additional burden on the core memory fragmentation / contention.

Additionally, what happens when an array is shared by two threads (and thus 
two interpreters).  Who's foreign access region is it stored in?  My guess is 
that to avoid premature freeing, BOTH.  So now a work-q used by a 30-thread 
producer/consumer app is allocating and freeing LOTS of core-memory on each 
enqueue / dispatch..  Again, with the details this fuzzy, I'm probably going 
off half-cocked; but I did qualify that this was my initial impression.

My suggestion is to not use a foreign references section; or if we do, not 
utilize it for deep data-structure nesting.  And instead incorporate a doubly 
linked list w/in PMCs and strings...  Thus wheneever you allocate a PMC or 
string, you attach it to the chain of allocated handles.  Whenever the PMC is 
free'd, you detach it.  The GC then has the laughably simple task of 
navigating this linked list, which spans all threads.  This can encorporate 
mark-and-sweep or copying or what-ever.  By adding 8 or 16 bytes to the size 
of a PMC / string, you avoid many memory related problems.  Not to mention 
the fact that we are free of the concern of depth away from the 
interpreter-root.

Beyond this, I think I see some problems with not having PMCs relocatable.  
While compacting the object-stores that are readily resized can be very 
valuable, the only type of memory leak this avoids is fragmentation-related.  
The PMCs themselves still need to be tested against memory leaks.  Now I'm 
still in favor of some form of reference counting; I think that in the most 
common case, only one data-structure will reference a PMC and thus when it 
goes away, it should immediately cause the deallocation of the associated 
object-space (sacraficing a pitance of run-time CPU so that the GC and free 
memory are relaxed).  But I hear that we're not relying on an integer for 
reference counting (as with perl5), and instead are mostly dependant on the 
GC.   Well, if we use a copying GC, but never move the PMC, then how are we 

Re: Rounding?

2001-11-04 Thread Michael L Maraist

On Sunday 04 November 2001 10:59 pm, Zach Lipton wrote:
 I'm working on learning some parrot asm, but if I write something like
 this:

 set N0,2
 set N1,2
 add N3, N0, N1
 print N3


 I get:

 4.00

 Is there any way to round this, or at least chop the 0's off the end?

since print is for debugging purposes, that's doubtful.  You could create a 
throw-away-patch that adds does:
printf s, i|n|s

-Michael



Re: vmem memory manager

2001-11-01 Thread Michael L Maraist


 My intial impression is that a page should be 1024 bytes (though I'm also
 looking at 512 bytes to reduce the shotgun effect of 8-byte allocs).
 I've actually found that q-caches were detramental to things like
 conslidation, and the calling depth (overhead) of a worst-case allocation.
 So I'm currently leaving them out.  Instead I'm using a power-of-two basd
 dq-cache (for differential q-cache).

Forgot to include two other points on this topic.  First, if q-caches were 
used, the slab-header would require a whole page (so as to maintain 
alignment).  With a q-max of size 4 at most and a slab object multiple of 4,  
we have 16 pages as the minimum allocation actually fullfilled by vmem.  
However, pages are of 1k-alignment, which you'll note tells vmem that it 
should look in it's segment-hashtable.  Thus a q-cache in this system could 
not simply use multiples of a page size.  On the other hand, an alternative 
would be to subdivide pages into two regions instead of one.  If it assumed 
that accesses within the q-cache region will be rare (relegated mostly to 
spills into vmem by the dq-cache), then the q-cache could be considered a 
second tier subdivision of a page.  For example, if the page-size was raised 
to 16K (must be a power-of-two for efficient page-alignment calculations), 
the dq-caches would still utilize a slab-size of 1k and define dq's of sizes: 
8,16,24,...384, but the q-cache would utilize a slab-size of 16K (our new 
page size) and define q's of sizes: 1,2,3,4,5.  Note that dq's use power-of 
2/3 to efficiently cover a broad range while q's use multiples so as to 
minimize slack space.  Here, however, our q-cache allocations have an 
incredible slack space due to the forced subdivisions of 16 as the following 
table shows:

page-size 16K
q-cache-size : num-objects : header-overhead : slack-space : % overhead 
(including slack )
1 : 15 : 1 : 0 : 6.66%
2 : 7   : 1 : 1 : 14.2%
3 : 5   : 1 : 0 : 6.66%
4 : 3   : 1 : 3 : 33%
5 : 3   : 1 : 0 : 6.66%

There are several trade-offs.  The larger the slab-size, the less max 
overhead for any q-cache (here we have 33% max overhead  for optimial 
alignments: in the worst case, an allocation of size 3.01k, for example would 
have 77% slack).  However, an allocation of size 6k would still require a 
full 16K allocation by vmem, since we want to minimize fragmentation.  That's 
an overhead of 166% (wasting 10k of memory).  If we wanted to reduce both the 
percent and max physically wasted memory, we could run the q-cache all the 
way until size == 1 (at which point the slab is of little or no use):

page size = 8K
1 : 7 : 1 : 0 : 12.5%
2 : 3 : 1 : 1 : 25%
3 : 2 : 1 : 1 : 25%

We round sizes 4,5, 6 and 7 up to an allocation of size 8.  Thus the max 
overhead is 3.1k - 8k = 166% which wastes over 4k of physical memory.
Note also that we incur at least 12.5% overhead.  It's debatable whether 8K 
or 16K is better.  Both have worst cases of 166%, but the 16K wastes more 
total physical memory.  Compare this to the following table of power-2 
allocations for dq-caches:

virtual page size = 1k, slab header size is approx 28 bytes
8  : 124 : 4 : 4 : 3.2%
12 : 83  : 2 : 0 : 2.811%
16 : 62  : 2 : 4 : 3.2%
24 : 41  : 1 : 12 : 4%
32 : 31  : 1 : 4  : 3.2%
48 : 20  : 1 : 36 : 6.6% 
64 : 15  : 1 : 36  : 6.6%
96 : 10  : 0 : 64  : 6.6%
128 : 7  : 1 : 100 : 14.2%
192 : 5  : 0 : 36  : 6.6%
256 : 3  : 1 : 228  : 33%
384 : 2  : 0 : 228 : 33%

BUT, these first tier dq-caches all need 1k allocations from vmem, which 
passes the request off to the q-caches, which incure an additional 12.5% 
overhead for 8K page or 6.66% overhead for 16K pages.  

Additionally note that the overhead can be reduced somewhat if we could use 
an external header for slabs of object-size = .5 vpage.  But unfortunately, 
the first object of this slab falls on a page boundry, and thus will be 
inappropriatedly interpreted by vmem_size and vmem_free.  Note, once again, 
that explicit object caches do not use vmem directly and are thus shielded 
from this page/ vpage limitation.  This only applies to the internal dq-cache 
and the debated q-cache, not the explicitly allocated object-caches.

More important than slack space is the shot-gun effect, which is exacerbated 
since we now have 7 layers in which a freed memory object may reside:

8B dq-cache magazine (1 of 20; assuming max-stack-size = 5, data/prev_data 
are both full, and there are two threads)
8B dq-cache depot ( 1 of 50; assuming max-stacks = 5 ) 
8B slab ( 1 of 124 )
1024B q-cache magazine ( 1 of 8; assuming max-stack-size = 2, data/prev_data 
are both full and there are two threads)
1024B q-cache depot ( 1 of 50 )
1024B slab ( 1 of  7 )
vmem segment ( 1 of 255 ) (segment size = 255 * 8 * 1kvpage = 2 Meg) 

That's 347 million non-vmem hiding places for a free'd object.  A reclaim 
currently is only slated to free depot's, but reduces us to 3.47 
million hiding places.  If we do away with the second tier (and the vpage 
concept), we have 

cvs tags

2001-11-01 Thread Michael L Maraist

I just noticed that we're not using any tags in the CVS tree.  Do you think 
that we're at a state where we could start applying labels to our code.  I am 
considering two types of labels.  One for the periodic snapshots (so as to 
back-trace features in those files), and one for successful smoke-tests.

Thus if something is broken, developers can pull down the last known 
successful smoke-test and build up from there.

I'm assuming of course that there isn't a more sophisticated behind the 
scenes setup (like perforce).

-Michael



vmem memory manager

2001-10-31 Thread Michael L Maraist

This references the previously posted (by someone else) article on SUN's libu 
memory architecture.  If needed I can run through the archives and find the 
URL.

I've already written a threaded implementation which has a decent 
responsiveness.  It did as much as the SUN libu claimed that it would do.  
Recently, however, I redesigned the architecture from scratch, taking 
advantage of many possible optimizations.  Since it's towards the end of the 
day, I figured I'd document my architecture here and wait for comments before 
beginning the rewrite / redebug.

As a refresher, the SUNs design is based on 4 layers: vmem, slab, depot and 
magazine-cache.

At the bottom is vmem, which maps pages of memory (in libu, new pages are 
acquired from mmap; excess are readily freed via munmap).

The next higher layer is the slab, which partitions a contiguous collection 
of pages into an array of raw memory objects.  (Objects are fixed sized 
pieces of memory).  Thus each object size requires a separate slab.  The slab 
requests n-pages from vmem to construct a new slab when it runs out of memory 
objects.  It also frees slabs to vmem when all the individual objects have 
been reclaimed .  Both of these occur automatically w/o external intervention.

The next higher layer is a depot, which stores constructed objects; e.g. 
objects that are initialized and potentially contain references to other 
memory objects.  When memory requests are made and no free objects exist, 
requests are passed down to the associated slab (of corresponding size) and a 
constructor is applied.  This happens automatically.  Free's to the slab are 
asynchronous; they occur when the entire system is memory starved, or when a 
depot has too many free objects.  In any case, prior to a free to the slab, 
the destructor is applied to each freed memory object.

The highest level is the magazine-cache  The sole function of the cache is 
to allow multiple threads to have thread-local accesses to the depot.  It 
does this by caching up to 2*M objects in a thread-local magazine-structure.  
On  an alloc, when the magazine is empty, thread-synchronized access is 
performed to the underlying depot / slab / vmem layers.  On a free, when the 
magazine is full (contains 2*M objects), thread-synchronized access to lower 
layers is performed.  Most of the time there is a mixture of alloc's and 
frees which avoids lower-level accesses (and thus mutex locks).  Aside from 
sync-contention alleviation, this minimizes the sharing of memory objects 
between threads, and on multi-CPU systems, this avoids cache pollution (due 
to setting a cache flag in MESI to S).

So far, nothing revolutionary, albeit clever on SUNs part.  The next 
attribute, however, is the use of external boundry tags.  The idea is that 
the allocated memory should never be written to (SUNs design goal included 
the use of read-only allocations).  Further, these external tags don't have 
to be contiguous.  Runs of contiguous memory regions are collected together 
in a segment.  Segments are allocation regions (such as via 
mmap).  Within a segment, consolidation may occur.  When fully consolidated, 
the segment is freed.  Hashtables are used to associate a user's memory 
pointer with these external boundry tags.  From this, SUN claims a constant 
(bounded) alloc/free time no matter what is going on.  From my analysis, the 
only possible wrinkle to this constant time is in the reclaiming stage (which 
makes n reclaims, where n is the number of depots which is a relatively low 
and constant number)

At run-time, object-caches can be defined for arbitrarily sized objects.  
Constructors and destructors can optionally be configured.  Such object 
references use the magazine layer interface.  For generic memory access, vmem 
is the API, but small memory requests are redirected to one of several 
predefined quantum caches (of sizes 1-page, 2-pages, .. q_max-pages), which 
alleviates lock-contention on vmem when used genericly.

SUN addressed the memory fragmentation problem by placing a lower-bound on a 
memory request size.  Slabs allocate at least 3 times the object-size.  In 
most cases, a slab is 16 - 32 times the object size.  Further, with q-caches, 
all allocations up to q-max have slabs of the same size.  Thus slabs used by 
8 and 16 byte objects allocations can be trivially intermixed.  The only 
danger is if you never free all the objects w/in each slab.  I call this a 
shot-gun effect, since you have little bits throughout memory space, 
preventing the freeing of slabs, and thus preventing the interchange of slabs 
when needed.

My design is as follows:

sbrk is completely avoided, which leaves the possibility of having two 
separate memory managers.  The following functions are public:
void* vm_alloc(size)
void  vm_free(void*)
void* vm_realloc(void*,size)
int  vm_sizeof(void*)
  vm_dump_stats()

// automatically allocates a slab  (reuses one if possible)
depot_t* depot_create( 

Re: Improved storage-to-storage architecture performance

2001-10-30 Thread Michael L Maraist



 Absolutely. Compile-and-go is an absolute must, and one that'll end up
 disabling most of the potential optimizations for sheer startup time
 reasons. Which is a pity, but we can always let people enable them if they
 want from the command line.

Or via use flags, since mod_perl and the like might like to see
optimizations, even though they'll not have access to argv.  Course mod_perl
is a totally separate executable.

-Michael



Re: Improved storage-to-storage architecture performance

2001-10-30 Thread Michael L Maraist

On Tuesday 30 October 2001 01:47 am, Ken Fox wrote:
 Uri Guttman wrote:
  that is good. i wasn't disagreeing with your alternative architecture.
  i was just making sure that the priority was execution over compilation
  speed.

 I use a snazzy quintuple-pass object-oriented assembler written
 in equal parts spit and string (with a little RecDescent thrown in for
 good measure). A real speed demon it is... ;)

 The real motivation of my work is to see if a storage-to-storage
 machine ends up using cache better and with less compiler effort
 than a register machine. When I read about CRISP, the first thing
 that came to mind was the top-of-stack-register-file could be
 simulated exactly with high-speed cache in a software VM. Dropping
 the stack-machine instructions in favor of Parrot's 3 operand ones
 made it sound even better.


This raises an interesting question. PMC's are mighty beasts and can do just 
about anything: mutate, store into stashes, perform queries of their 
datatype, have embedded lock variables and even have embedded garbage 
collecting info.  They're work for safe thread-sharing, auto-variables, bla 
bla bla.  
The only memory storage for scalars that I currently am conceiving of is in 
name-space stashes (globals).   Thus our most likely implementation of S2S 
would be to have 'add g_x, g_y, g_z' which performs two stash 
lookups, an add, then one stash write.  I can't imagine that this would be 
faster than using a load/store architecture w/ respect to stash accesess; the 
overhead of the multi-level hash-table lookups is much greater than that of 
the op-code.
What I think I hear you saying is that S2S would be storing in some sort of 
memory frame outside of the register set.  But this adds a bunch of 
complexity.  How does the GC sweep across a huge VM heap (as opposed to 
walking the register / stash maps).  Further, we'd have to have a notice of 
the relative addresses of these objects (aren't they relocatable?).  Assuming 
the exclusive use of ld/st stash access for globals (which due to the 
possibile stash-renaming/remapping, I think this is a fair assumption), then 
the only reason for a scalar scratchboard would be for lexical locals.  But 
that could be addressed via an indexible stack, which is probably most 
efficient with a load/store architecture anyway (due to the multitude of 
stacks).

I spoke exclusively on scalars (PMCs) because I had an interesting time 
considering static datatypes.  It's obvious what happens when we declare:
for my int $idx ( 0 .. 100 ) {
}
We get an integer register variable which maps directly to add / extraction 
calls.  But what about:

our int $g_idx;

What's it's default value?  zero?  Where does it live?  If you allow the 
storage of pure integers in a stash, then you'll have to have a stash 
structure like this:

struct stash_t {
  .
  PMC sv_val;
  PMC av_val;
  PMC hv_val;
  PMC cv_val;
  PMC io_val;
  ...
  IV i_val;
  NV  n_val;
  SV  str_val;
};

But then when you try and retrieve a global stash symbol, how do you 
distinguish between i_val and a sv_val?  Or even str_val from sv_val.  If 
perl6 is pre-compiled with access to all module-code (not likely due to late 
binding), then it can be optimized to make special access requests (knowing 
the prototype of $i), but this doesn't allow for dynamic modification of the 
stash-space.  We are, after-all, a high-level language, not java.  Further, 
this would be completely unused in perl5 execution.

Best I can figure is that we can't store non PMC's in stashes.  Which also 
means that we don't store non PMC's anywhere except the stacks;  They're thus 
purely for lexical values. 

Anyway have any vision that conflicts with this?

Back to the original discussion.  I'm thusly not seeing a benifit for S2S 
here.

-Michael



Re: Parameter passing conventions

2001-10-29 Thread Michael L Maraist


  A split between local, marginal, and global registers would be an
  interesting thing to do, and I can see it making the code more elegant. I
  worry about it making things more complex, though, especially with us
  already having multiple register types. (We'd double or triple the number
  of register types essentially, and to some extent blow cache even more
  than we do now. Might be a win in other ways, though. I'll have to ponder
  a bit)

 Yeah, I didn't like the idea of proliferating that more either. I still
 sometimes dream about a single register file of N regs into which we can
 put whatever we want. Each block of registers has room for the reg
 contents and the type info too. Seems you've got some of the support for
 that figured out in the stack already. Just declare that either (a) it
 is illegal (or behavior undefined) to do

   set $2, 5
   set $3, foo bar
   add $1, $2, $3

 [just because we have higher-level data types than a real machine
 doesn't mean we can't still have general-purpose registers, I think]

 or (b) that if you do something numeric with a register that is
 non-numeric type mucking happens behind the scenes and throws an
 exception if there is a problem. Certainly this wouldn't be surprising
 to anyone who had been looking at what we do with PMCs and arithmetic
 ops.

 If we ever did move to such a single-register-file model, I'd support
 looking seriously at the calling conventions of MMIX to see if we can
 get the appropriate performance characteristics. And, BTW, we have
 4*32 = 128 regs now. We could even match the logical register count
 of MMIX (256) with only a doubling of total register count. And, if
 we ever determined we needed another kind of register (such as one
 that can be used for address arithmetic, since INTVAL doesn't cut it),
 we wouldn't have to add a fifth file, we'd just add another type
 (thinking again about the stack implementation).


After reading the entire MMIX chapter, my mind went back and forth.  First of 
all, the only reason that 256 registers were used was because of the byte 
aligned register arguments to the op-codes (plus modern intuition is 
that more registers = faster execution).  Currently we have 4B arguments and 
don't perform boundry-condition checks, so this is neither here nor 
there.  Currently we translate P1 to:

interpreter-num_reg-registers[ cur_opcode[1] ]

Which requires 4 indirections.  (multiple Px instances should be optimized by 
gcc so that subsequent accesses only require 2 indirections) To avoid 
core-dumping on invalid arguments, we could up the reg-set to 256  
and convert the above to:

interpreter-num_reg-registers[ (unsigned char)cur_opcode[1] ]

and adjust the assembler accordingly.  Alternatively to work with 32 regs, 
we'd have:

interpreter-num_reg-registers[ cur_opcode[1]  0x001F ]

As for MMIX.  I don't see a need for globals, since we're going to have 
various global symbol stashes available to us.  Further, I don't see a value 
in providing special trapping code to return zero when reading from a 
Marginal or extending the local variable space when writing.  For writing, an 
explicit reserve (once (or less)per function call) shouldn't be too much 
bother.  And if the code is silly enough to write or read from this 
marginal region, then we'll pretend that they're using uninitialized 
values.  Further, the reserveer must require that there is enough space to 
fully utilize the n-register set (currently 32), so that set $r31, 5 doesn't 
spill into the tail of the register set (since that was previously handled by 
the trapping code).  This modifies the MMIX spec such that we potentially 
waste up to 31 register slots in the register window (which is trivial when 
we have a window of size = 1024).

The rolling register set can be accomplished via three methods.  First, 
realloc the register stack every time an extend exceeds the size.  (This 
sucks for recursive functions).  Second use paged register sets and copy 
values durring partial spillover (very complex).  Lastly utilize a [pow2] 
modulous on a fixed size register stack.  This has a very interesting 
implication; that we completely do away with the (push|pop)[ipsn] and their 
associated data-structures.  Thus the P1 translation becomes:

//(for a 1K rolling stack)
#define STACK_MASK 0x03FF 
interpreter-num_reg[ ( interp-num_offset + cur_opcode[1] )  STACK_MASK ]

This has 4 indirections and two integer ops.  As above, for multiple uses of 
Px in an op-code, this should be optimized to 2 indirections.  Since 
indirections are significantly slower than bitwise logical operations, this 
should be roughly equivalent in speed to our current interpreter.  If we were 
hell-bent on speed, we could utilize STACK_SIZE alligned memory regions and 
perform direct memory arithmetic as with:

interp-x_reg_base = [ ... ] 1 K chunk of memory aligned to 1K boundry
interp-x_reg = interp-x_reg_base + interp-x_offset

#define P1  *(int*)(((int)( interp-x_reg 

Re: Are threads what we really want ???

2001-10-24 Thread Michael L Maraist

On Wednesday 24 October 2001 08:28 pm, you wrote:
 I’ve just spent some time looking through the various RFC’s for Perl 6.

 IMHO: The various proposals dealing with multi threading, synchronization
 and RPC can be solved in a different way.

 Instead of thinking about multiple threads, one could think about multiple
 execution contexts. Each instance of an object must belong to one and only
 one execution context. Each execution context has an attached security
 context and a security manager.

 When making a call from an object in one execution context to a method of
 an object in another execution context the security context of the caller
 is passed to the receiver an validated by it’s security manager before
 being allowed to execute. Asynchronous execution could be implemented using
 “oneway” functions and “future” variables and a combination of both.
 “future” variables would allow for returning results from “oneway”
 functions, and for turning synchronous functions into “oneway” functions.

 “deferred” methods would also be very nice. A deferred method is something
 that would be executed at some later time when the current execution
 context is idle.

 If the following keywords were added to Perl:
 1.abandon
 2.asynch[ronous]
 3.at
 4.cancel
 5.deferred
 6.future
 7.oneway

[snip]


 I'm well aware that most of this functionality can be achived by other
 means - but I think a language and runtime level solution would be most
 elegant and portable. My choice of syntax and keywords are for illustration
 purposes only.

Well, from the sideline, I have a couple of comments.  First, parrot is 
supposed to be language agnostic, so such an implementation would go unused 
in python / perl5 lingo.  Perhaps there could be support modules bolted on, 
who knows.  However, this sort of RFC is probably a bit too late (ironically) 
to make any headway, given that we're in the post RFC stage.

As for the technical feasibility.  I'm not sure what direction you're trying 
to take.  You speak as if this were an alternative to a threading model, but 
you don't talk about fake threading or even dispatch-driven code (which I'm 
sure is already ruled out in the interest of main-stream performance).  Thus 
I'm not quite sure what your vision of the implementation is.  The 
pseudo-perl code is really only syntactic sugar.  Again, parrot doesn't care 
what perl6 does. Only what perl6 needs at a fundamental level.  If you want 
to try again in terms of op-codes, then you might make some more head-way.

In general, however, IPC is not an issue parrot is concerned with.  As far as 
I understand, it won't even be part of the core, but instead live in 
dynamically loaded system modules.  I believe the real push of threading is 
less involved with IPC (or ITC as the case may be) but to provide 
state-of-the-art programming capability in a reliable (and performance 
minded) manner.  Apache 2.0 will be MT, and perl5 is not, thus there's going 
to be a rift if at least an MT mod_perl can't be relied apon.  The MT has to 
co-exist with c-librarires (a la DBD drivers), and it seems will be very 
close to a posix varient (with locking on marginally shared structures).  I 
think a litmus test is going to be MT DBD support. (Currently many CPAN 
modules refuse to compile alongside perl5.005 threads)

None the less, separate execution contexts is exactly what seems to be going 
forward.  Each thread will have (at least) one parrot interpreter 
datastructure.  I'm assuming that this implies completely separate 
name-spaces.  So long as there are well thought out interfaces between the 
threads, any such IPC that you refer to can be imlemented ontop.

In fairness though, I've been partial to the QNX real-time-OS, which utilized 
at it's lowest level a message passing architecture (not too dissimilar in 
fundamental concept from what you're suggesting).  The advantage was that 
these messages could be relayed/tunneled anywhere (including other processes 
or other networks networks).  But the messages were hidden behind system 
calls (as with TCP/IP based messenges), so performance wasn't hindered too 
badly.  That said, I'm not seeing an advantage that outweighs the 
specialization complexity in implementing what I do understand about your RFC.

-Michael



Re: memory allocation paper (Courtesy of Alan)

2001-10-03 Thread Michael L Maraist

Uri Guttman wrote:

  DS == Dan Sugalski [EMAIL PROTECTED] writes:

 off list

   DS Ask has found us a spot for the paper Alan was speaking of.

   DS http://dev.perl.org/perl6/talks/

 very impressive paper. my recent proposal for a multi-queue malloc
 system has some of the ideas of this but nowhere nearly as fully
 developed. if we can drop in the libumem implementation directly, i
 think it will be a major win. it doesn't look too difficult for us to
 write our own version (with some parrot centric features maybe). the
 thread (vs. cpu) support looks very good too. i like how they get
 constant or linear time results.

 uri

I was facinated by the concepts and implications so I wrote several test
utilities based on a 466MHZ dual processor celeron running Linux RedHat 7.1.

The paper shows a benchmark based on a 10cpu Solaris Machine that is a large
loop of allocs and frees (of an unspecified size, assumed to be 4bytes).

The paper shows single threaded throughput of .5Mega cycles / second growing
to an upper bound of 3 Mc / second.

I ran several combinations of a simple loop of:

thr_main:
  for i .. size:
ptr = alloc(4); *ptr = 1; free(ptr);

main:
  for i .. num_threads:
pthread_create( thr_main )

The first thing to note is that using glibc's non-MT malloc I achieved
2.67Mega cycles / second.  Using the perl5 memory allocator (non-MT), I
achieved 8.4Mega cycles / second.  In both cases using pthreads (via
uncommenting thread-specific code and linking -lpthread), the throughput fell
to 0.978M / second ( for both perl5 and glibc ).  Perhaps I'm not successfully
mapping alloc to perl - I tried various alternative names for the
allocator (such as New, Perl_xxx, etc).

I obviously didn't have libumem to play with, but I did want to see what the
lower-bound was for the various operations.  I found some strange results.

First I wrote an unlocked void* MyAlloc(int), void MyFree(void*) pair which
simply used a stack for allocation.  It wasn't a valid memory allocator, since
it assumes that you free in reverse order but it sufficied.  It achieved 20
million cycles / second.

Next I added mutex's, which slowed the single thread down to 1.3Mega cycles/
second.

The next thing I did was to split resources into separate arrays which are
accessed via a hashing of the thread-id.  This required (int)pthread_self().
The existing implementation is that thread_t is just an int, but it doesn't
give me warm fuzzies doing this (how portable could that be).  Strangely
enough Linux's thread_id's are wildly jumping numbers (starting at over a
thousand with at least a thousand numeric values separating each).  The hash
was a simple masking of the right several most bits (3 in my test case since I
wasn't to test more than 10 simultaneous threads).

Very quickly I found cache-line-contention to be a problem (just as the paper
suggested).  Since I was using separate arrays of integers, the cache-lines
were hopelessly shared.  My interoom solution was to multiply the thread_id by
some value ( 8 or 16 work well ), at the expense of unused array elements.
The following are some benchmarks (that I can still read from my
scratch-sheets):

Num-threads|run-time for thread-multiplier = 16 (pad)|run-time
for thread-multiplier = 2|mutex's with multiplier=16
1|1.52s|1.52s|8.05s
2|8.25s|8.6s|16.14s
3|5.6s|12.99s|16.18s
4|3.86s|17.23s|21s
5|5.04s||23.19s
6|6.7s
7|6.64s

One comment, the left-most data-set has some wild-swings; the more common
occurance were something like:
1|1.52s|6.4Mega cycles / s
2|8s|2.5Mc/s
3|5s|6Mc/s
4|5.5s|7.2Mc/s
5|6s|8.3Mc/s
6|6.5s|...
etc

The above are based on the top of my head from the types of numbers I've seen
after hundreds of tweaks and unrecorded runs.

Two things seemed true.  First, the more threads that were used, the more
efficient it became.  Secondly the second thread always seemed to provide the
slowest / least-efficient operation, without exception.  I can't explain
this.  It seemed impervious to cache-line-width padding.  Note in the
padding-times table that two threads took roughly the same time for both 2 and
16x thread_id multipliers.  When only either a 1 or 2x muliplier, the run-time
for 2 threads seems to fall in line with the other thread-values.  The
1-thread operation is obviously faster because there is no cache-line
contention.  I'm curious to learn if this is a general problem with either
Linux, x86, the dual-processor celerons or something more subtle.

At this point I was using a two level memory manager.  The first level was a
thread-specific magazine allocator.  In this primative memory system, there
was no need for data / previous as described in the SUN paper, and instead
a single magazine array / 

Re: [PATCH] print_s_v op (was: RE: variable number of arguments)

2001-09-26 Thread Michael L Maraist

Gregor N. Purdy wrote:

 Michael --

  I had more time to think about it, and I determined how a compute op-code
  could be efficient.
 
  [snip]

 You wicked, wicked person! :)

 I'd like to see some benchmarks on that one vs. the most efficient
 possible hand-coded separate ops for moderate to complex arithmetic...
 These sorts of ops should be possible without them being in the core
 once we have loadable oplibs, as long as the assembler knows how to
 lay them out...

 BTW, the ops should probably be:

   * compute_i_s_v
   * compute_i_sc_v
   * compute_n_s_v
   * compute_n_sc_v

  /** FMT: compute v_arg_size, destination-register, fmt_const_string,
  ( list of reg-indexes )

 The nargs operand is spliced in in place of the 'v'

 Regards,

I didn't have the var-args available so I wrote a hack: functions with 4, 8
and 16 register-parameters (from the above) which were just aliases of the
same function which didn't care (because it was based on the format field).  I
wrote a dozen or so variations of the function 'calculate'.  I determined
after a lot of experimenting that there is not much value in providing a
general stack machine sub-op-code-engine (with push, pop, dup, and swap).  The
performance trails off as the number of symbols increases (since 50% of the
operations are push).  My attempts at combining multiple chars in a switch
statement had marginally positive effects when in optimial conditions, but
performed slower in sub-optimal conditions.

I later wrote two variations. One was a semi-simple divde-and-conquor summer:

/* sum Pdest, arg-count, arg1, arg2, ... */
AUTO_OP sum_i_i_i_i_i_i {
  unsigned int cnt = P2 - 1;
  opcode_t* args_offset = cur_opcode + 3;

  int accum = INT_REG( *args_offset++ );

  while(cnt) {
if ( cnt = 8 ) {
  accum += INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ );
   cnt -= 8;
} else if ( cnt = 4 ) {
  accum += INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ );
   cnt -= 4;
} else if ( cnt = 2 ) {
  accum += INT_REG( *args_offset++ ) +
   INT_REG( *args_offset++ );
   cnt -= 2;
} else {
  accum += INT_REG( *args_offset++ );
   cnt--;
}
  }
  INT_REG( P1 ) = accum;
}

This performed the best in all tests.  Note that I only went down to n=4;
obviously there would be a performance hit for anything less.  Unfortuntaely
without a format, it's impossible to mix-and-match constants with registers.
Or even integers with numbers.  You'd need a separate sum_int and sum_num.

The other function is a general accumulator-based function which uses a format
to figure out not only the type of operation (add, sub, mul, div,
mod,and,or,xor, land,lor,lxor,not), but also the format (int-reg, int-const,
num-reg, num-const).  By using a single character, performance is maintained
(though readibility suffers somewhat).  What follows is mostly what I used for
my tests.


AUTO_OP compute_i_i_s_i_i_i_i { // note should be I_i_s_v
   STRING *s = Parrot_string_constants[P3];

   opcode_t* args_offset = cur_opcode + 4;

   char * ops = s-bufstart;

   /* do bounds checking */

   int accum = INT_REG( *args_offset++ ); // probably not always what we want,
but costs too much otherwise

   while(*ops) {
 switch( *ops ) {
   case '+':
 accum += INT_REG( *args_offset++ );
 break;
   case '-':
 accum -= INT_REG( *args_offset++ );
 break;
   case '*':
 accum *= INT_REG( *args_offset++ );
 break;
   case '/':
 accum /= INT_REG( *args_offset++ );
 break;
   case '%':
 accum %= INT_REG( *args_offset++ );
 break;

   case 'a':
 accum += *args_offset++;
 break;
   case 's':
 accum -= *args_offset++;
 break;
   case 'm':
 accum *= *args_offset++;
  break;
   case 'd':
 accum /= *args_offset++;
 break;

   case 'A':
 accum += NUM_REG( *args_offset++ );
 break;
   case 'S':
 accum -= NUM_REG( *args_offset++ );
 break;
   case 'M':
 accum *= NUM_REG( *args_offset++ );
 break;
   case 'D':
 accum /= NUM_REG( *args_offset++ );
 break;

   default:
 printf(Unsupported calculate-opcode %c. Exiting\n, *ops);
 exit(-1);
 }
 ops++;
   }
   INT_REG( P2 ) = accum;
}

A summary of the results for 6 million iterations of each of the following on
a 466MHZ celeron on Linux:

// for n = 4|for n = 16
sum:21 seconds  (18% faster)  |23 seconds ( 78% faster)
format-calculated:22s  (9% faster)  |  39s ( 47% faster)
manual-adds:23s|63s
null-loop: 12s   

Re: Strings db

2001-09-25 Thread Michael L Maraist

Wizard wrote:

 Michael Maraist [mailto:[EMAIL PROTECTED]] wrote:
  Does you idea allow for:
  int msgid = txtToMsgid( This feels strange\n );
  char *label = msgidToRes( msgid );

 I'm not sure that I understand your question. This is not my idea, but GNU's
 gettext tools. I, myself, am not thrilled with this implementation, and
 would suggest using some other implementation (either our own or someone
 else's).
 Grant M.

You quoted something similar to my text above and said you didn't like it.  I
believe mostly because it involved reading external files, but also because of
the concept of the message-id.  My statement was that there are efficient
reasons to want to use a message-id, so it should be possible to have two
api's, one with and one that's direct (which is simply a wrapper around the
former).

I realize that you were merely inciting that we should be able to develop
something, I was just commenting on performance.

-Michael




Re: [PATCH] print_s_v op (was: RE: variable number of arguments)

2001-09-25 Thread Michael L Maraist

Michael Maraist wrote:

  All --
 
   I've created a varargs-ish example by making a new op, print_s_v.
   This is pretty rough, and I haven't updated the assembler, but it
   seems to work.


 With var-args, we could produce highly efficient SIMD instructions.
 printf obviously, multi-push/pop, createArrays/ Hashes.

 I toyed with the idea of a stack-based ALU to handle algebra such as:

 ( ( a + b*c ) / d + e ) / ( f + g + h )

 // compute dest, fmt, register-names
 compute Rd, ++^*+/+, f, g, h, a, b, c, d, e

 Unfortunately I couldn't find a way of garunteeing that it's efficient.
 The best I've come up with so far is to use a mini-switch statement.  The
 only reason it might be faster than an optimized switch-statement over all
 op-codes is that the switch has fewer terms.

 The above case also gets complex if we don't use multiple registers for
 intermediate calculations.

 None the less the possibilities are cool.


I had more time to think about it, and I determined how a compute op-code
could be efficient.
Pros:
o First of all, there is no need to assign to temporary registers (as would
be necessary for the compiler).
o Secondly, I'm under the impression that compilers convert algebraic
expressions to reverse polish notation anyway (to reduce the temporaries);
though parallel computing is an obvious exception (when multiple ops can be
executing simultaneously).  This theoretically simplifies the job of the
compiler.
o Third, only a single register access is required for each parameter
(where-as an expression might otherwise require multiple accesses).
o Fourth, it's possible to check for a subset of two or more adjacent
operations such as  which would quickly add four number with n-1
physical adds ('accumulator = a+ b + c + d  + e') which can be optimized in
for parallel op-code execution within the physical CPU.  Or ^+ which
normally pushes a stack-element, adds top two, then pops.  Here it just adds
to the top-element.  To prevent d^n number of operations, only common ones
are defined.  The default block of the switch checks for individual
operations.
o Fifth, the op-code is a lot more dense:  (16 + 6 * var)  verses (16 * n).
In inner loops, this could add up with cache-savings (don't pay retail).

Cons:
o The overhead of an op-code is probably not much greater than the total
overhead of this instruction (especially with a switch-based op-dispatcher
and -O2 compilation)
o Potentially long op-chains are possible which could thwart responsiveness
to synchronous event-handling. (Unless there is a max-fmt size)
o Potential portability issues if we deal with 16bit shorts.  Might have to
require 32bit ints as the smallest macro-symbol

/** FMT: compute v_arg_size, destination-register, fmt_const_string, ( list
of reg-indexes )
 *contents of fmt_string:
 * + //add top two stack el's
 * - // sub ..
 * * // mul ..
 * / // divide ..
 * % // take modulous
 *  //  logical and
 * | // logical or
 * P // push register onto local stack
 * C // push constant onto local stack
 * D // duplicate top stack element
 * S // switch top two elements
 * \0 // end-of-fmt-string.  Must be word-aligned  with such eofs eg
PPP++\0\0\0
 **
 * Note that the above symbols should probably be part of a c-enum so that
the switch statement can be optimized
 **/
AUTO_OP computei_varg {
  static int stack[MAX_STACK]; // make 1 larger than the max-size of
allowable varg_size
  unsigned int top = 0; // points to safe location

  STRING *s = Parrot_string_constants[P2];
  int args_offset =  3 +code ; // pointer to the start of parameters (could
have just been word_align( strlen(code[4]) ) )
 short * dops = (short*)s-bufstart; // get format

  int f_first_two_op;
  char * sops; // single_char_ops

  // sanity check against varg_size
  if ( code[1]  MAX_STACK )
die invalid size-of operation;

  stack[ 0 ] = 0; // null-value (should never read)
  stack[ 1 ] = 0; // initially stack contains null

  // compare pairs of op-codes (theoretically could nest this into the
default statement of a 4-op comparison)
  while (*dops  top) {
switch( *dops ) {
  case PUSH_PLUS: // P+ or ('^'  8) +'+'
 stack[top] += INT_REG( code[ args_offset++ ] );
 break;
  case CONST_PLUS: //  C+
 stack[top] += code[ args_offset++ ];
  case PLUS_PLUS: // ++
stack[top] += stack[top--] + stack[top--];
  ...
  defailt: // fall back to the single-character case for unrecognized
pairs
f_first_two_op = 0;
sops = (char*)dops;
while(!f_first_two_op++) {
  switch( *((char*) sops))  {
case '+': ...
case '-': ...
 ...
default: die invalid op-code
  } // end 1char-switch
} // end 1char-while
} // end 2char-switch
  } // end 2char-while

  if ( top )
INT_REG( P2 ) = stack[top];
  else
die invalid number of pushes;
} // computei_varg

-Michael




Re: question about branching/returning

2001-09-20 Thread Michael L Maraist

The internal do-op-loop runs until it sees a return value of zero from any
op-codes.  The RETURN statement within basic_opcodes.ops is really a keyword
which gets parsed into an offset of the current PC counter based on the
internally calculated size of the instruction (known at Configure time from
opcode_table).  Of course, this is based off my reverse engineering of existing
code.  I'm sure any of this is subject to change.

As a note, another poster said that end is never executed which isn't currently
true.

-Michael

Dave Storrs wrote:

 Ok, that was pretty much what I thought.  But then what is the 'end'
 opcode for?  It does a 'RETURN 0', which would increment the PC by 0
 opcodes...which either counts as an infinite loop or a no-op, and we've
 already got a no-op op.





Re: Parrot multithreading?

2001-09-20 Thread Michael L Maraist

Arthur Bergman wrote:

 In an effort to rest my braine from a coredumping perl5 I started to think a bit on 
threading under parrot?

 While it has been decided that perl should be using ithread like threading, I guess 
that is irelevant at the parrot level. Are you
 going to have one virtual cpu per thread with it's own set of registers or are you 
going to context switch the virtual cpu?

 If it was one virtual cpu per thread  then one would just create a new virtual cpu 
and feed it the bytecode stream?

 Is there anything I could help with regarding this?

 Arthur

The context is almost identical to that of Perl5's MULTIPLICITY which passes the 
perl-interpreter to each op-code.  Thus there is
inherent support for multiple ithread-streams.  In the main-loop (between each invoked 
op-code) there is an event-checker (or was in
older versions at any rate).  It doesn't do anything yet, but it would make sence to 
assume that this is where context-switches
would occur, which would simply involve swapping out the current pointer to the 
perl-context; A trivial matter.

The easiest threading model I can think of would be to have a global var called 
next_interpreter which is always loaded in the
do-loop.  An asynchronous timer (or event) could cause the value of next_interpreter 
to be swapped.  This way no schedule
function need be checked on each operation.  The cost is that of an extra indirection 
once per op-code.

True MT code simply has each thread use it's own local interpreter instance.  MT-code 
is problematic with non MT-safe extensions
(since you can't enforce that).

In iThread, you don't have a problem with atomic operations, but you can't take 
advantage of multiple CPUs nor can you garuntee
prevention of IO-blocking (though you can get sneaky with UNIX-select).

-Michael




Re: Parrot multithreading?

2001-09-20 Thread Michael L Maraist



 What we're going to do is fire up a new interpreter for each thread. (We
 may have a pool of prebuilt interpreters hanging around for this
 eventuality) Threading *is* essential at the parrot level, and there are
 even a few (as yet undocumented) opcodes to deal with them, and some stuff
 that's an integral part of the variable vtable code to deal with it.
 Whether it's considered ithread-like or not's up in the air--it'll probably
 look a lot like a mix.

 I'm also seriously considering throwing *all* PerlIO code into separate
 threads (one per file) as an aid to asynchrony.


Just remember the cost in context-switching, plus the lack of scalability as
the number of file-handles increases.  Linux thread-context-switches are
relatively brutal compared to say Solaris.  Additionally you're consuming a
new stack area for each file-handle.  That's lots of overhead.

There are bound to be semi-portable methods of non-blocking IO.  UNIX-select
has to have an equiv on NT.  Granted it's a lot more complicated.  Basically
you could have IO ops trigger an event-based iThread swap, which causes select
to be invoked.  I've always thought this was the most efficient model for
single-CPU machines.  The biggest problem was always segmenting one's code
into call-backs.  Well, with op-codes, we have a natural division.  We have a
tight inner loop that occasionally hits a dispatcher on a complexity level
similar to a GUI.  You're much better prone to handle event-based operations
(which means higher level languages can be built atop such a parrot-design).

Food for thought.

-Michael




Re: SV: Parrot multithreading?

2001-09-20 Thread Michael L Maraist

Arthur Bergman wrote:

  Arthur Bergman wrote:
 
   In an effort to rest my braine from a coredumping perl5 I started to think a bit 
on threading under parrot?
  
   While it has been decided that perl should be using ithread like threading, I 
guess that is irelevant at the parrot level. Are you
   going to have one virtual cpu per thread with it's own set of registers or are 
you going to context switch the virtual cpu?
  
   If it was one virtual cpu per thread  then one would just create a new virtual 
cpu and feed it the bytecode stream?
  
   Is there anything I could help with regarding this?
  
   Arthur
 
  The context is almost identical to that of Perl5's MULTIPLICITY which passes the 
perl-interpreter to each op-code.  Thus there is
  inherent support for multiple ithread-streams.  In the main-loop (between each 
invoked op-code) there is an event-checker (or was in
  older versions at any rate).  It doesn't do anything yet, but it would make sence 
to assume that this is where context-switches
  would occur, which would simply involve swapping out the current pointer to the 
perl-context; A trivial matter.

 Uhm, are you talking perl 5 here? The event checker checks for signals, we got safe 
signals now.

There wasn't any code for CHECK_EVENTS w/in Parrot when I first read the source-code.  
I merely assumed that it's role was not-yet determined, but considered the possible 
uses.  CHECK_EVENTS seems to be gone at the moment, so it's a moot point.


 MULTIPLICITY is just allowing multiple interpreters, ithreads is letting them run at 
the same time and properly clone them. If you want to use it switch interpreters at 
runtime for fake threads, patches are welcome, send it and I will apply it.



  The easiest threading model I can think of would be to have a global var called 
next_interpreter which is always loaded in the
  do-loop.  An asynchronous timer (or event) could cause the value of 
next_interpreter to be swapped.  This way no schedule
  function need be checked on each operation.  The cost is that of an extra 
indirection once per op-code.
 
  True MT code simply has each thread use it's own local interpreter instance.  
MT-code is problematic with non MT-safe extensions
  (since you can't enforce that).

 I am sorry to say, but perl 5 is true MT.

Yes, but that feature never got past being experimental.  I know of a couple DBDs that 
would not let you compile XS code with MT enabled since they weren't MT-safe.  The 
interpreter can be built MT-safe (java is a good example), but extensions are always 
going to be problematic. (Especially when many extensions are simply wrappers around
existing non-MT-aware APIs).  I think a good solution to them would be to tread it 
like X does (which says you can only run X-code w/in the main-thread).  An extension 
could say whether it was MT-safe or not, and be forced to be serialized w/in the 
main-physical-thread, which becomes the monitoring thread.  An alternative would be to 
simply
have XS code compile in a flag which says to throw an exception if the code is run 
outside of the main-thread;  Documentation would emphatically state that it's up to 
the user to design the system such that only the main-thread calls it.

On the side, I never understood the full role of iThreads w/in perl 5.6.  As far as I 
understood, it was merely used as a way of faking fork on NT by running multiple 
true-threads that don't share any globals.  I'd be curious to learn if there were 
other known uses for it.



  In iThread, you don't have a problem with atomic operations, but you can't take 
advantage of multiple CPUs nor can you garuntee
  prevention of IO-blocking (though you can get sneaky with UNIX-select).
 

 Where did you get this breaking info? ithread works with multiple CPUs and IO 
blocking is not a problem.

 Arthur

I'm under the impression that the terminology for iThreads assumes an independance of 
the physical threading model.  As other posters have noted, there are portability 
issues if we require hardware threading.  Given the prospect of falling back to 
fake-threads, then multi-CPU and IO blocking is problematic; though the latter can be 
avoided
/ minimized if async-IO is somehow enforced.  From my scarce exposure to the Linux 
Java movement, green-threads were considered more stable for a long time, even 
though the porters were just trying to get things to work on one platform.

I would definately like hardware threading to be available.  If nothing else, it lets 
students taking Operating Systems to experiment with threading w/o all the headaches 
of c.  (Granted there's Java, but we like perl)  However, I'm not convinced that 
threading won't ultimately be restrictive if used for generation operation (such as 
for the
IO-subsystem).  I'm inclined to believe that threading is only necessary when the user 
physically wants it (e.g. requests it), and that in many cases fake-threads fulfill 
the basic desires of everyone involved