Re: [racket-users] Re: Code generation performance

2020-03-18 Thread Eric Griffis
Made a tiny bit of progress today.

On a bigger machine, I was able to profile the giant unit tests module. It
has one top-level for/template that iterates over the 5 scalar types, and a
bunch of smaller ones inside that cover the multitude of operations for
each of the 4 fixed vector lengths.

profiling (lib "glm/vector/tests.rkt")
> Initial code size: 5039
> Final code size  : 1019095
>

The good news is, I'm seeing around 200x compression. I mean, who wouldn't
mind getting completely DRY source for as little as 1/250th the effort?
(Assuming, of course, that programming time is proportional to program size
in bytes.)

The bad news is, compilation takes around 40 seconds on a modern desktop
with plenty of CPU and RAM. From the rest of the profiling output, it looks
like phase-0 for/template is responsible for about 12.5% of the total size,
but phase-1 for/list contributes 57.2% and phase-0 check contributes 48.4%.

I'm not sure how to interpret these numbers yet. On one hand, for/template
is essentially a for/list loop unroller, so the stats could just mean it
did its job. On the other hand, I don't know how much of that 57.2% is
merely the cost of doing business in Racket.

When I comment out everything but the first two tests, I see this:

Initial code size: 243
> Final code size  : 21725
>

That's a mere 89x compression, which is OK because the first two tests are
relatively simple, with phase-0 for/template accounting for 58.5% of the
total size, phase-1 for/list contributing 23.2%, and no phase-0 check.

It's starting to look like there isn't much I can do to bring down the
total size. But what about total compile time?

When I manually unroll the for/template forms, the profiler gives:

Initial code size: 1509
> Final code size  : 21725
>

The identical final size is interesting -- it suggests the original output
sizes are what they would be if templates weren't used.

This version takes, on average, 1.883 seconds to compile. The for/template
version takes 2.499 seconds, and an empty test suite takes 1.743 seconds.
Subtracting out the control time, it took 0.612 seconds more, or 5.4x
longer, to compile a fairly simple module with for/template than without.

Is the extra cost acceptable? I'm guessing that's highly context dependent.
In this case, adding half a second to compile one module wouldn't
inconvenience me terribly, but it doesn't take much imagination to find a
situation where it would, and I have no idea how any of these numbers will
scale.

Eric


On Sat, Mar 14, 2020 at 3:28 PM Eric Griffis  wrote:

> Alright, I re-discovered Ryan Culpepper's talk, "The Cost of Sugar," from
> the RacketCon 2018 video stream (https://youtu.be/CLjXhr_TgP8?t=5908) and
> made some progress by following along.
>
> Here are the .zo files larger than 100K:
>
> 993K ./vector/compiled/tests_rkt.zo
> 830K ./scribblings/compiled/glm_scrbl.zo
> 328K ./vector/compiled/relational_rkt.zo
> 295K ./vec4/compiled/bool_rkt.zo
> 291K ./vec4/compiled/int_rkt.zo
> 290K ./vec4/compiled/uint_rkt.zo
> 290K ./vec4/compiled/double_rkt.zo
> 289K ./vec4/compiled/float_rkt.zo
> 280K ./vec3/compiled/bool_rkt.zo
> 276K ./vec3/compiled/int_rkt.zo
> 275K ./vec3/compiled/uint_rkt.zo
> 275K ./vec3/compiled/double_rkt.zo
> 274K ./vec3/compiled/float_rkt.zo
> 262K ./vec2/compiled/bool_rkt.zo
> 258K ./vec2/compiled/uint_rkt.zo
> 258K ./vec2/compiled/int_rkt.zo
> 258K ./vec2/compiled/double_rkt.zo
> 257K ./vec2/compiled/float_rkt.zo
> 213K ./vec1/compiled/bool_rkt.zo
> 210K ./vec1/compiled/uint_rkt.zo
> 210K ./vec1/compiled/int_rkt.zo
> 210K ./vec1/compiled/double_rkt.zo
> 209K ./vec1/compiled/float_rkt.zo
> 102K ./compiled/main_rkt.zo
> 101K ./compiled/vector_rkt.zo
>
> I'm pretty sure that's a lot of big files. It's for a port of GLM, a
> graphics math library that implements (among other things) fixed-length
> vectors of up to 4 components over 5 distinct scalar types, for a total of
> 20 distinct type-length combinations with many small variations in their
> APIs and implementations.
>
> The variations I'm targeting either require a macro or exacerbate
> developer- or run-time overhead when functions are introduced. For example,
> the base component accessors for a four-component vector of doubles are:
>
>   dvec4-x
>   dvec4-y
>   dvec4-z
>   dvec4-w
>
> Each of the "xyzw" components has two aliases -- one from "rgba" and
> another from "stpq". Each accessor also has a corresponding mutator, e.g.,
> dvec4-g and set-dvec4-g!.
>
> For another example, whereas adding two dvec4's sums four components,
>
>   (dvec4
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2)))
>
> the same operation on dvec2's sums only the first two components.
>
> Furthermore, the sheer volume of the target code base makes writing
> everything out by hand a mind-numbing exercise in frustration, and that's
> when looking at a mere 20% of the pile. It's going to get much worse

Re: [racket-users] Re: Code generation performance

2020-03-15 Thread Hendrik Boom
On Sun, Mar 15, 2020 at 10:48:48AM -0700, Eric Griffis wrote:
> On Sat, Mar 14, 2020 at 10:25 PM Hendrik Boom  wrote:
> >
> > There's a port of glm in the Racket package library.
> > Is that the same one?  If not, is it also that huge?
> 
> Same repository, different branch. The master branch, which is a
> couple months old now, implements the matrix and vector types on top
> of a single, list-based, length-agnostic structure type. It's a
> snapshot of the moment I realized the volume of code and run-time
> loops were becoming a problem.
> 
> The new code is in the dev branch. It implements just the vector types
> in a manner similar to generic interfaces while also exposing a
> progression of increasingly type-specific variants. This allows me to
> prototype with a generic API, then eliminate the overhead of dynamic
> dispatch later by switching to more type-specific operations. (If it's
> not obvious, I'm working toward a pluggable type system harness, so
> the compiler can specialize and prune automatically.)
> 
> > By the way, I'm working on updating the opengl package to the current
> > OpenGL 4.6 spec.  Because a change of format in Khronos's
> > specfiles, it appears to require a complete rewrite to its
> > specification translator.
> 
> This is good news! Through March, I'll be announcing several graphics
> packages that could benefit from this directly. Let's try to keep a
> conversation going.

I'm the guy behind Rackettown, https://github.com/hendrikboom3/rackettown
It's an experiment about attribute management in procedurally-generated 
content, although it looks like a fairly simple building-drawing 
program.

I wanted to move to doing it in 3D, and it too a little while to realise 
I needed to use openGl directly, rather than something like Pict3D.

I had some difficulty figuring out the arious tutorials, so I went to 
the current red book (covering opengl 4.6) and discovered that the baby 
steps used functions that weren't in the Racket binding (though they are 
present in the openGL on my Linux system.

So ... I'm redoing the binding, after a fruitless attempt tp find 
up-to-date versions of the old specfiles.

I'm hoping to test the new interface generator by comparing its output 
with the one in the present Racket package.  That will always be a bit 
awkward, because all the definitions are now in a new order, so a simple 
diff won't do.  Not to mention checking out the errors that Stephan 
discovered in the old specfiles.

-- hendrik

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/20200315180922.ejb3eu6bsclm4nbf%40topoi.pooq.com.


Re: [racket-users] Re: Code generation performance

2020-03-15 Thread Eric Griffis
On Sat, Mar 14, 2020 at 10:25 PM Hendrik Boom  wrote:
>
> There's a port of glm in the Racket package library.
> Is that the same one?  If not, is it also that huge?

Same repository, different branch. The master branch, which is a
couple months old now, implements the matrix and vector types on top
of a single, list-based, length-agnostic structure type. It's a
snapshot of the moment I realized the volume of code and run-time
loops were becoming a problem.

The new code is in the dev branch. It implements just the vector types
in a manner similar to generic interfaces while also exposing a
progression of increasingly type-specific variants. This allows me to
prototype with a generic API, then eliminate the overhead of dynamic
dispatch later by switching to more type-specific operations. (If it's
not obvious, I'm working toward a pluggable type system harness, so
the compiler can specialize and prune automatically.)

> By the way, I'm working on updating the opengl package to the current
> OpenGL 4.6 spec.  Because a change of format in Khronos's
> specfiles, it appears to require a complete rewrite to its
> specification translator.

This is good news! Through March, I'll be announcing several graphics
packages that could benefit from this directly. Let's try to keep a
conversation going.

Eric

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CAORuSUzrPxHccXWCUJ9TFEoZPA9ORm%2BNuDBR5oMYOQ1NS1Kfqg%40mail.gmail.com.


Re: [racket-users] Re: Code generation performance

2020-03-14 Thread Hendrik Boom
On Sat, Mar 14, 2020 at 03:28:35PM -0700, Eric Griffis wrote:

> 
> I'm pretty sure that's a lot of big files. It's for a port of GLM, a 
> graphics math library that implements (among other things) fixed-length 
> vectors of up to 4 components over 5 distinct scalar types, for a total of 
> 20 distinct type-length combinations with many small variations in their 
> APIs and implementations.

There's a port of glm in the Racket package library.
Is that the same one?  If not, is it also that huge?

By the way, I'm working on updating the opengl package to the current
OpenGL 4.6 spec.  Because a change of format in Khronos's 
specfiles, it appears to require a complete rewrite to its 
specification translator.

-- hendrik

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/20200315052511.2aekyt6wlidmspw6%40topoi.pooq.com.


[racket-users] Re: Code generation performance

2020-03-14 Thread Eric Griffis
Alright, I re-discovered Ryan Culpepper's talk, "The Cost of Sugar," from 
the RacketCon 2018 video stream (https://youtu.be/CLjXhr_TgP8?t=5908) and 
made some progress by following along.

Here are the .zo files larger than 100K:

993K ./vector/compiled/tests_rkt.zo
830K ./scribblings/compiled/glm_scrbl.zo
328K ./vector/compiled/relational_rkt.zo
295K ./vec4/compiled/bool_rkt.zo
291K ./vec4/compiled/int_rkt.zo
290K ./vec4/compiled/uint_rkt.zo
290K ./vec4/compiled/double_rkt.zo
289K ./vec4/compiled/float_rkt.zo
280K ./vec3/compiled/bool_rkt.zo
276K ./vec3/compiled/int_rkt.zo
275K ./vec3/compiled/uint_rkt.zo
275K ./vec3/compiled/double_rkt.zo
274K ./vec3/compiled/float_rkt.zo
262K ./vec2/compiled/bool_rkt.zo
258K ./vec2/compiled/uint_rkt.zo
258K ./vec2/compiled/int_rkt.zo
258K ./vec2/compiled/double_rkt.zo
257K ./vec2/compiled/float_rkt.zo
213K ./vec1/compiled/bool_rkt.zo
210K ./vec1/compiled/uint_rkt.zo
210K ./vec1/compiled/int_rkt.zo
210K ./vec1/compiled/double_rkt.zo
209K ./vec1/compiled/float_rkt.zo
102K ./compiled/main_rkt.zo
101K ./compiled/vector_rkt.zo

I'm pretty sure that's a lot of big files. It's for a port of GLM, a 
graphics math library that implements (among other things) fixed-length 
vectors of up to 4 components over 5 distinct scalar types, for a total of 
20 distinct type-length combinations with many small variations in their 
APIs and implementations.

The variations I'm targeting either require a macro or exacerbate 
developer- or run-time overhead when functions are introduced. For example, 
the base component accessors for a four-component vector of doubles are:

  dvec4-x
  dvec4-y
  dvec4-z
  dvec4-w

Each of the "xyzw" components has two aliases -- one from "rgba" and 
another from "stpq". Each accessor also has a corresponding mutator, e.g., 
dvec4-g and set-dvec4-g!. 

For another example, whereas adding two dvec4's sums four components,

  (dvec4
   (fl+ (dvec4-x v1) (dvec4-x v2))
   (fl+ (dvec4-x v1) (dvec4-x v2))
   (fl+ (dvec4-x v1) (dvec4-x v2))
   (fl+ (dvec4-x v1) (dvec4-x v2)))

the same operation on dvec2's sums only the first two components.

Furthermore, the sheer volume of the target code base makes writing 
everything out by hand a mind-numbing exercise in frustration, and that's 
when looking at a mere 20% of the pile. It's going to get much worse very 
quickly. To add fixed-length matrices up to shape 4x4 over the same scalar 
types, I'm looking at 16x5 = 80 more distinct type-shape combinations!

Getting back to the .zo files, I had no luck running "raco macro-profiler" 
on the top end of the list. It appears to diverge. My dev laptop probably 
doesn't have enough RAM, so I'll have to try again on a bigger machine.

Here's an excerpt from a file on the bottom end:

[eric@walden racket-glm]$ raco macro-profiler glm/vec4/double
profiling (lib "glm/vec4/double.rkt")
Initial code size: 87
Final code size  : 86531

Phase 0
the-template (defined as the-template.1 in glm/vector/template)
  total: 31536, mean: 31536
  direct: 2054, mean: 2054, count: 1, stddev: 0
define-dvec4-unop (defined in "this module")
  total: 7300, mean: 730
  direct: 7480, mean: 748, count: 10, stddev: 0
define/contract (defined in racket/contract/region)
  total: , mean: 44
  direct: 3572, mean: 23, count: 153, stddev: 1.48
define-dvec4-binop (defined in "this module")
  total: 6200, mean: 620
  direct: 6380, mean: 638, count: 10, stddev: 0
...

Phase 1
for/list (defined in racket/private/for)
  total: 6558, mean: 273
  direct: 2274, mean: 95, count: 24, stddev: 14.94
for/fold/derived/final (defined in racket/private/for)
  total: 4332, mean: 180
  direct: 336, mean: 14, count: 24, stddev: 0
for/fold/derived (defined in racket/private/for)
  total: 4284, mean: 178
  direct: 240, mean: 10, count: 24, stddev: 0
for/foldX/derived (defined in racket/private/for)
  total: 3996, mean: 24
  direct: 3164, mean: 19, count: 170, stddev: 48.16

Wow, does that look like nearly 1000x compression? Three orders of 
magnitude seems right, given what I know about how these macros interact.

The "the-template" macro is defined inside a module generated by my custom 
#%module-begin. It defines 4 type-agnostic, fixed-length module templates 
(e.g., glm/vec4/template), which are instantiated once for each of the 5 
scalar types. Those fixed-length module templates are based, in turn, on 
another module template (glm/vector/template) that takes a length argument 
and uses the other profiled macros (define-dvec4-unop, define/contract, 
define-dvec4-binop) to create 20 component-wise operations per instance. 
All together, that should inflate the size of the output to somewhere near 
the middle of the interval 4x20x5x[1,4], which is 1000.

At phase 1, the comprehension forms are busy churning out component aliases 
and unrolling component-wise operations at "compile" time. I'm reluctant to 
anti-inline these because they keep the written code small and the 
generated code f