Re: Register VM WIP

2012-05-16 Thread Andy Wingo
On Wed 16 May 2012 02:39, Noah Lavine noah.b.lav...@gmail.com writes:

 Do you mean that the register pool will grow and shrink for each
 function call? Is that why the stack frames can be fixed-size?

The register pool is the set of locals on the stack.  Registers for one
function are stored in the stack frame.

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Andy Wingo
On Wed 16 May 2012 06:23, Mark H Weaver m...@netris.org writes:

 It's surprising to me for another reason: in order to make the
 instructions reasonably compact, only a limited number of bits are
 available in each instruction to specify which registers to use.

It turns out that being reasonably compact isn't terribly important --
more important is the number of opcodes it takes to get something done,
which translates to the number of dispatches.  Have you seen the direct
threading VM implementation strategy?  In that case the opcode is not
an index into a jump table, it's a word that encodes the pointer
directly.  So it's a word wide, just for the opcode.  That's what
JavaScriptCore does, for example.  The opcode is a word wide, and each
operand is a word as well.

The design of the wip-rtl VM is to allow 16M registers (24-bit
addressing).  However many instructions can just address 2**8 registers
(8-bit addressing) or 2**12 registers (12-bit addressing).  We will
reserve registers 253 to 255 as temporaries.  If you have so many
registers as to need more than that, then you have to shuffle operands
down into the temporaries.  That's the plan, anyway.

Cheers,

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread David Kastrup
Mark H Weaver m...@netris.org writes:

 I certainly agree that we should have a generous number of registers,
 but I suspect that the sweet spot for a VM is 256, because it enables
 more compact dispatching code in the VM, and yet is more than enough to
 allow a decent register allocator to generate good code.

 That's my educated guess anyway.  Feel free to prove me wrong :)

The counterproof will usually be done by benchmarking, and will even
differ between different processors sharing the same instruction set.

I see two ways out:
a) pick the register size individually for each function, as small as
possible without spillage.  Which makes the whole indistinguishable from
a stack-based VM.
b) don't generate the final bytecode until the code is actually being run.

That means that _if_ code is precompiled, it will be precompiled into
either stack-based VM or some other representation better suited to
compile into code for a certain amount of registers.  Of course, the
threshold to picking actual registers of the available processor and
compiling native code is then not all too large.

-- 
David Kastrup




Re: Register VM WIP

2012-05-16 Thread Noah Lavine
Hi Mark,

You are thinking along very similar lines to how I used to think. But
I have a different way to think about it that might make it seem
better.

In our current VM, we have two stacks: the local-variable stack, which
has frames for different function calls and is generally what you'd
think of as a stack, and the temporary-variable stack, which is
literally a stack in the sense that you only operate on the top of it.
The temporary-variable stack makes us do a lot of unnecessary work,
because we have to load things from the local-variable stack to the
temporary-variable stack.

I think what Andy is proposing to do is to get rid of the
temporary-variable stack and operate directly on the local-variable
stack. We shouldn't think of these registers as being like machine
registers, and in fact maybe registers is not a good name for these
objects. They are really just variables in the topmost stack frame.
This should only reduce memory usage, because the local-variable stack
stays the same and the temporary-variable stack goes away (some
temporaries might move to the local-variable stack, but it can't be
more than were on the temporary-variable stack, so that's still a
win).

The machine I was initially thinking of, and I imagine you were too,
is different. I had imagined a machine where the number of registers
was limited, ideally to the length of a processor cache line, and was
separate from the local-variables stack. In such a machine, the
registers are used as a cache for the local variables, and you get to
deal with all the register allocation problems that a standard
compiler would. That would accomplish the goal of keeping more things
in cache.

The registers as cache idea may result in faster code than the
directly addressing local variables idea, but it's also more
complicated to implement. So it makes sense to me that we would try
directly addressing local variables first, and maybe later move to
using a fixed-size cache of registers. It also occurs to me that the
RTL intermediate language, which is really just a language for
directly addressing an arbitrary number of local variables, is a
standard compiler intermediate language. So it might be useful to have
that around anyway, because we could more easily feed its output into,
for instance, GCC.

Andy, is this an accurate description of the register VM? And Mark and
everyone else, does it seem better when you look at it this way?

Noah

On Wed, May 16, 2012 at 9:44 AM, Mark H Weaver m...@netris.org wrote:
 Hi Andy!

 Andy Wingo wi...@pobox.com writes:
 On Wed 16 May 2012 06:23, Mark H Weaver m...@netris.org writes:

 It's surprising to me for another reason: in order to make the
 instructions reasonably compact, only a limited number of bits are
 available in each instruction to specify which registers to use.

 It turns out that being reasonably compact isn't terribly important --
 more important is the number of opcodes it takes to get something done,
 which translates to the number of dispatches.  Have you seen the direct
 threading VM implementation strategy?  In that case the opcode is not
 an index into a jump table, it's a word that encodes the pointer
 directly.  So it's a word wide, just for the opcode.  That's what
 JavaScriptCore does, for example.  The opcode is a word wide, and each
 operand is a word as well.

 The design of the wip-rtl VM is to allow 16M registers (24-bit
 addressing).  However many instructions can just address 2**8 registers
 (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
 reserve registers 253 to 255 as temporaries.  If you have so many
 registers as to need more than that, then you have to shuffle operands
 down into the temporaries.  That's the plan, anyway.

 I'm very concerned about this design, for the same reason that I was
 concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
 is extremely important on modern architectures, because of the vast (and
 increasing) disparity between cache speed and RAM speed.  If you can fit
 the active set into the cache, that often makes a profound difference in
 the speed of a program.

 I agree that with VMs, minimizing the number of dispatches is crucial,
 but beyond a certain point, having more registers is not going to save
 you any dispatches, because they will almost never be used anyway.
 2^12 registers is _far_ beyond that point.

 As I wrote before concerning NaN-boxing, I suspect that the reason these
 memory-bloated designs are so successful in the JavaScript world is that
 they are specifically optimized for use within a modern web browser,
 which is already a memory hog anyway.  Therefore, if the language
 implementation wastes yet more memory it will hardly be noticed.

 If I were designing this VM, I'd work hard to allow as many loops as
 possible to run completely in the cache.  That means that three things
 have to fit into the cache together: the VM itself, the user loop code,
 and the user data.  IMO, the sum 

Re: Register VM WIP

2012-05-16 Thread Andy Wingo
Howdy,

On Wed 16 May 2012 15:44, Mark H Weaver m...@netris.org writes:

 The design of the wip-rtl VM is to allow 16M registers (24-bit
 addressing).  However many instructions can just address 2**8 registers
 (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
 reserve registers 253 to 255 as temporaries.  If you have so many
 registers as to need more than that, then you have to shuffle operands
 down into the temporaries.  That's the plan, anyway.

 I'm very concerned about this design, for the same reason that I was
 concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
 is extremely important on modern architectures, because of the vast (and
 increasing) disparity between cache speed and RAM speed.  If you can fit
 the active set into the cache, that often makes a profound difference in
 the speed of a program.

 I agree that with VMs, minimizing the number of dispatches is crucial,
 but beyond a certain point, having more registers is not going to save
 you any dispatches, because they will almost never be used anyway.
 2^12 registers is _far_ beyond that point.

I'm probably not explaining myself clearly.  Here goes.

I willingly grant that 256 registers is usually enough.  But there are
valid reasons to use 2**12 registers: for example in the mov
instruction, if you have an 8-bit opcode, you have 24 bits left.  Using
12 for each operand makes sense.  There are other cases in which you
want to reference 24-bit values, for relative jumps; and even 32-bit
values, to reference constants using relative addressing.  (64 MB is too
small a limit for one compilation unit.  16 GB is fine.)

Likewise I can imagine cases in which you might end up with more than
2**12 active locals, especially in the presence of macros.  In that case
you spill.  But where do you spill?  For Guile, this means spilling to
additional registers, and having to shuffle with long-mov.  Otherwise
you would spill to a vector or something.  The WIP-RTL strategy
adequately captures the edge case while making the normal case fast.

 If I were designing this VM, I'd work hard to allow as many loops as
 possible to run completely in the cache.  That means that three things
 have to fit into the cache together: the VM itself, the user loop code,
 and the user data.  IMO, the sum of these three things should be made as
 small as possible.

Certainly.

 I certainly agree that we should have a generous number of registers,
 but I suspect that the sweet spot for a VM is 256, because it enables
 more compact dispatching code in the VM, and yet is more than enough to
 allow a decent register allocator to generate good code.

 That's my educated guess anyway.  Feel free to prove me wrong :)

I will do better: I will prove you right and prove me right at the same
time :)  The instructions in wip-rtl try to stay in one 32-bit unit.  In
that case they have limits, usually 8 bits.  But when they need to
spill, they will do so on the stack, not on the heap.

Regards,

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Andy Wingo
Hi Stefan,

On Fri 11 May 2012 22:29, Stefan Israelsson Tampe stefan.ita...@gmail.com 
writes:

 1. What about growing stacks any coments if they will be easier to manage
 for this setup. Can one copy the C stack logic?

Having a fixed-size frame means that it's easier to have disjoint
stacks, since a register VM addresses operands relative to the frame
pointer and not the stack pointer.  I hope to be able to decrease our
default stack size, and allow it instead to grow dynamically.

 2. Is there an instruction that does what call does but can be used for tail 
 call's
 when it needs it e.g. the code
  for (n = 0; n  nargs; n++)
     LOCAL_SET (n, old_fp[ip[4 + n]]);
 that is missing for the tail code

This is another advantage of wip-rtl.  In it, the compiler is
responsible for shuffling tail arguments.  It can do a parallel move
possibly without shuffling args to the top of the stack.  Then tail-call
just sets a new procedure and jumps to its entry.

 3. I would appriciate if the frame is always below say 256 SCM:s of the fp 
 stack limit
 that way when preparing tail calling one doesn't usally need to check if the 
 argument fit's
 when issuing a tail call.

See above :)

 4. I think the logic code hook I recently investigated could easily fit into 
 this VM engine with
 using similar techniques as I described in previous mails.

I'm still working back through the mails; remind me again if it seems I
overlooked this mail.

Cheers,

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Andy Wingo
On Wed 16 May 2012 16:54, Noah Lavine noah.b.lav...@gmail.com writes:

 In our current VM, we have two stacks: the local-variable stack, which
 has frames for different function calls and is generally what you'd
 think of as a stack, and the temporary-variable stack, which is
 literally a stack in the sense that you only operate on the top of it.

Nice explanation!  Surely register vm is a poor name, if one can have
a variable number of registers -- registers are normally something of
which one has a fixed number.

Cheers,

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Mark H Weaver
Hi Andy,

Andy Wingo wi...@pobox.com writes:
 Likewise I can imagine cases in which you might end up with more than
 2**12 active locals, especially in the presence of macros.  In that case
 you spill.  But where do you spill?

You spill to them to stack of course, which brings me to my next point:
as discussed in chapter 4 of David Kranz's thesis on the Orbit compiler
(highly recommended reading) it is sometimes better to store a local on
the stack, even if you have registers to spare.  The reason is that
registers must be saved and restored for every procedure call, and as we
all know, Scheme has no shortage of those.

What's your plan for saving and restoring such a large register file?

Best,
 Mark



Re: our benchmark-suite

2012-05-16 Thread Andy Wingo
Howdy!

On Wed 25 Apr 2012 22:39, l...@gnu.org (Ludovic Courtès) writes:

 So, those are the problems: benchmarks running for inappropriate,
 inconsistent durations;

 I don’t really see such a problem.  It doesn’t matter to me if
 ‘arithmetic.bm’ takes 2mn while ‘vlists.bm’ takes 40s, since I’m not
 comparing them.

Running a benchmark for 2 minutes is not harmful to the results, but it
is a bit needless.  One second is enough.

However, running a benchmark for just a few milliseconds is not very
interesting:

;; (if.bm: if-bool-then: executing then 33 real 0.011994627 
real/iteration 3.63473545454545e-8 run/iteration 3.62829060606061e-8 
core/iteration 9.61427360606058e-10 gc 0.0)

That's 12 milliseconds.  The jitter there is too much.

 inappropriate benchmarks;

 I agree that things like ‘if.bm’ are not very relevant now.  But there
 are also appropriate benchmarks, and benchmarks are always better than
 wild guess.  ;-)

Agreed :-)

 and benchmarks being optimized out.

 That should be fixed.

In what way?  It would make those benchmarks different.

Thesis: anything for which you would want to turn off the optimizer is
not a good benchmark anyway.

See also: http://www.azulsystems.com/presentations/art-of-java-benchmarking

 My proposal is to rebase the iteration count in 0-reference.bm to run
 for 0.5s on some modern machine, and adjust all benchmarks to match,
 removing those benchmarks that do not measure anything useful.

 Sounds good.  However, adjusting iteration counts of the benchmarks
 themselves should be done rarely, as it breaks performance tracking like
 http://ossau.homelinux.net/~neil/bm_master_i.html.

I think we've established that this isn't the case -- modulo the effect
that such a change would have on GC (process image size, etc)

 Finally we should perhaps enable automatic scaling of the iteration
 count.  What do folks think about that?

 On the positive side, all of our benchmarks are very clear that they are
 a time per number of iterations, and so this change should not affect
 users that measure time per iteration.

 If the reported time is divided by the global iteration count, then
 automatic scaling of the global iteration count would be good, yes.

OK, will do.

Speak now or be surprised by a commit!

;-)

Andy
-- 
http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Noah Lavine
 Perhaps it needs a different name than register virtual machine.

How about RTL VM, since it's a virtual machine that interprets RTL?

Or maybe frame-addressed VM, because the operations address objects
in the current stack frame?

Noah



bug in syntax-case in master

2012-05-16 Thread Stefan Israelsson Tampe
I'm trying to port syntax-parse to master. And get into the following
trubble


(syntax-case x (integrate) ((integrate a b) ...))
fails, but

(syntax-case x (integrate) ((_ a b) ...))
does not fail


looking at the code for syntax-case I would expect that the datum integrate
is
match against and not using syntax any syntactic information.

In psyntax.scm around 2419 we have,
((bound-id-member? p keys)
 (values (vector 'free-id p) ids))

keys are the fixed datums, and

(define bound-id-member?
  (lambda (x list)
(and (not (null? list))
 (or (bound-id=? x (car list))
 (bound-id-member? x (cdr list))

e.g. no comparisons of the datum.

Is this correct! I do understand that this can be a feature but is this
expected?
In syntax parse both options are possible.

/Regards
Stefan


Re: stack closures for guile-log

2012-05-16 Thread Stefan Israelsson Tampe
Hi,

Yes, I think that if one go for capturing the stack via copying frames then
that's the way
to go. This makes rewinding and unwinding fast. on the other hand,
currently in guile-log
we use very small linked frames so that essentially all the stored
information is lying in the
heap compressed to a tree and is therefore superior memory wise, because of
the compression and because the GC get better oppertunities to reclaim
memory.

However I need to find more bench mark more in order to understand this
better.

I also must add hooks to handle a satck with stored coses. They serve as a
derefernce and identity objects for the closures which will be reference
throughout the heap. I cannot move
the conses but I can release the references to them and allocate new conses
in essence
move a cons from the stack to the heap. This is quite neat but do decrease
the possibilities of reclaiming memory in some cases so conses on the stack
will be an expert option but is needed to get the last 2x - 3x speedup down
to what compiled prolog can do.

The bulk of the closures stack is quite simple in it's functioning and
there I think it's
a bit over kill to use a dynstack.

Thanks for the info and Cheers!


On Tue, May 15, 2012 at 10:37 PM, Andy Wingo wi...@pobox.com wrote:

 On Tue 08 May 2012 21:16, Stefan Israelsson Tampe stefan.ita...@gmail.com
 writes:

  I have three stacks
  1. a control stack that stores undo information scheme hooks aka
  dynamic-wind and stack references to the other stacks.

 Have you seen dynstack.[ch] on master?

  2. stack from which data is allocated from like the bulk of a closure
  and special pairs
 
  3. a cons stack, e.g. an array of allocated conses

 I wonder, can you implement these too using something like the dynstack?

 Andy
 --
 http://wingolog.org/



Re: syntax parse link

2012-05-16 Thread Stefan Israelsson Tampe
Thx,

I have a few things I would like to do first but maybe after this weekend I
will
make the linkage!

/Stefan

On Tue, May 15, 2012 at 10:33 PM, Andy Wingo wi...@pobox.com wrote:

 On Tue 08 May 2012 17:46, Stefan Israelsson Tampe stefan.ita...@gmail.com
 writes:

  I would like to add a link to the syntax-parse repo from guile's home
 page. I was told to try get
  the rights to do this myself. If that doesn't work I will need help from
 any of the maintainers
  to make the linkage.

 I've added you to the Guile group.  You should be able to make a
 writable CVS checkout of the web repository.  (Yes, CVS.  It's
 incredible.  I keep forgetting how to use it.)  Anyway the bulk of the
 web site is in template/ and generated.  I think there is a README.  It
 is still horribly complex.  Anyway good luck, and ask on IRC when you
 have problems!

 Andy
 --
 http://wingolog.org/



Re: Register VM WIP

2012-05-16 Thread Ludovic Courtès
Hi,

Noah Lavine noah.b.lav...@gmail.com skribis:

 I think what Andy is proposing to do is to get rid of the
 temporary-variable stack and operate directly on the local-variable
 stack. We shouldn't think of these registers as being like machine
 registers, and in fact maybe registers is not a good name for these
 objects. They are really just variables in the topmost stack frame.

Yeah, I too was confused the first time I heard about “register VMs.”

The key idea is that opcodes encode the offset of the operand they work
on, rather than working only on the last words of the stack (for example,
‘add x y’ instead of ‘local-ref x’ + ‘local-ref y’ + ‘add’ + ‘local-set x’.)

Thanks,
Ludo’.



Re: our benchmark-suite

2012-05-16 Thread Ludovic Courtès
Hi!

Andy Wingo wi...@pobox.com skribis:

 On Wed 25 Apr 2012 22:39, l...@gnu.org (Ludovic Courtès) writes:

 So, those are the problems: benchmarks running for inappropriate,
 inconsistent durations;

 I don’t really see such a problem.  It doesn’t matter to me if
 ‘arithmetic.bm’ takes 2mn while ‘vlists.bm’ takes 40s, since I’m not
 comparing them.

 Running a benchmark for 2 minutes is not harmful to the results, but it
 is a bit needless.  One second is enough.

Well, duration has to be chosen such that the jitter is small enough.
Sometimes it could be 2mn, sometimes 1s.

[...]

 and benchmarks being optimized out.

 That should be fixed.

 In what way?  It would make those benchmarks different.

 Thesis: anything for which you would want to turn off the optimizer is
 not a good benchmark anyway.

Yes, it depends on the benchmarks.  For instance, I once added
benchmarks for ‘1+’ and ‘1-’, because I wanted to see the impact of an
optimization to the corresponding VM instructions.

Nowadays peval would optimize those benchmarks out.  Yet, the fact is
that I was interested in the performance of the underlying VM
instructions, regardless of what the compiler might be doing.

Thanks,
Ludo’.



Re: syntax parse link

2012-05-16 Thread Ludovic Courtès
Hi Stefan,

Stefan Israelsson Tampe stefan.ita...@gmail.com skribis:

 I have a few things I would like to do first but maybe after this weekend I
 will
 make the linkage!

Please add it to gnu-guile-projects.html (under template/ first, as Andy
mentioned), and using the same format as other entries there.

Thanks!

Ludo’.




problems evaluating code depending on version

2012-05-16 Thread Stefan Israelsson Tampe
hi,

I'm trying to use this as a way to defined different versions of the code
depending on the
guile-version. So here it is,



(eval-when (compile load eval)
  (define (ver)
(let ((v (version)))
  (cond
   ((string-match ^2.0 v)
'v2.0)
   ((string-match ^2.1 v)
'v2.1)
   (else #f)

(define-syntax guile-stuff
  (lambda (x)
(syntax-case x ()
  (_
   (let ((q (ver)))
 (cond
  ((eq? q 'v2.0)
   #'(begin 1))
  ((eq? q 'v2.1)
   #'(begin
   (define-syntax-rule (fluid-let-syntax . l)
 (syntax-parametrize . l))
   (export fluid-let-syntax)))
  (else (error not supported version

(guile-stuff)

This does not work in master (v2.1) why?
/Stefan


bug in syntax-case in master

2012-05-16 Thread Stefan Israelsson Tampe
I have found the bug, It was because of the bug fixed in
master got a bug in my code visible!

/Stefan

-- Forwarded message --
From: Stefan Israelsson Tampe stefan.ita...@gmail.com
Date: Wed, May 16, 2012 at 8:57 PM
Subject: bug in syntax-case in master
To: guile-devel guile-devel@gnu.org


I'm trying to port syntax-parse to master. And get into the following
trubble


(syntax-case x (integrate) ((integrate a b) ...))
fails, but

(syntax-case x (integrate) ((_ a b) ...))
does not fail


looking at the code for syntax-case I would expect that the datum integrate
is
match against and not using syntax any syntactic information.

In psyntax.scm around 2419 we have,
((bound-id-member? p keys)
 (values (vector 'free-id p) ids))

keys are the fixed datums, and

(define bound-id-member?
  (lambda (x list)
(and (not (null? list))
 (or (bound-id=? x (car list))
 (bound-id-member? x (cdr list))

e.g. no comparisons of the datum.

Is this correct! I do understand that this can be a feature but is this
expected?
In syntax parse both options are possible.

/Regards
Stefan


Re: [PATCH] Fix Ecmascript's tree-il compiling

2012-05-16 Thread Nala Ginrut
OK, I received a mail just now that they have acknowledged my assignment.

On Fri, May 4, 2012 at 5:15 PM, Nala Ginrut nalagin...@gmail.com wrote:
 I've already delivered it with post. Maybe takes 1-2 weeks.


 On Thu, May 3, 2012 at 5:54 AM, Ludovic Courtès l...@gnu.org wrote:
 Hi Noah,

 Noah Lavine noah.b.lav...@gmail.com skribis:

 Andy and Ludovic, I hope this is the right thing to do. If not, please
 let me know.

 It is, thanks for helping!

 Ludo’.