Made a tiny bit of progress today.
On a bigger machine, I was able to profile the giant unit tests module. It
has one top-level for/template that iterates over the 5 scalar types, and a
bunch of smaller ones inside that cover the multitude of operations for
each of the 4 fixed vector lengths.
profiling (lib "glm/vector/tests.rkt")
> Initial code size: 5039
> Final code size : 1019095
>
The good news is, I'm seeing around 200x compression. I mean, who wouldn't
mind getting completely DRY source for as little as 1/250th the effort?
(Assuming, of course, that programming time is proportional to program size
in bytes.)
The bad news is, compilation takes around 40 seconds on a modern desktop
with plenty of CPU and RAM. From the rest of the profiling output, it looks
like phase-0 for/template is responsible for about 12.5% of the total size,
but phase-1 for/list contributes 57.2% and phase-0 check contributes 48.4%.
I'm not sure how to interpret these numbers yet. On one hand, for/template
is essentially a for/list loop unroller, so the stats could just mean it
did its job. On the other hand, I don't know how much of that 57.2% is
merely the cost of doing business in Racket.
When I comment out everything but the first two tests, I see this:
Initial code size: 243
> Final code size : 21725
>
That's a mere 89x compression, which is OK because the first two tests are
relatively simple, with phase-0 for/template accounting for 58.5% of the
total size, phase-1 for/list contributing 23.2%, and no phase-0 check.
It's starting to look like there isn't much I can do to bring down the
total size. But what about total compile time?
When I manually unroll the for/template forms, the profiler gives:
Initial code size: 1509
> Final code size : 21725
>
The identical final size is interesting -- it suggests the original output
sizes are what they would be if templates weren't used.
This version takes, on average, 1.883 seconds to compile. The for/template
version takes 2.499 seconds, and an empty test suite takes 1.743 seconds.
Subtracting out the control time, it took 0.612 seconds more, or 5.4x
longer, to compile a fairly simple module with for/template than without.
Is the extra cost acceptable? I'm guessing that's highly context dependent.
In this case, adding half a second to compile one module wouldn't
inconvenience me terribly, but it doesn't take much imagination to find a
situation where it would, and I have no idea how any of these numbers will
scale.
Eric
On Sat, Mar 14, 2020 at 3:28 PM Eric Griffis wrote:
> Alright, I re-discovered Ryan Culpepper's talk, "The Cost of Sugar," from
> the RacketCon 2018 video stream (https://youtu.be/CLjXhr_TgP8?t=5908) and
> made some progress by following along.
>
> Here are the .zo files larger than 100K:
>
> 993K ./vector/compiled/tests_rkt.zo
> 830K ./scribblings/compiled/glm_scrbl.zo
> 328K ./vector/compiled/relational_rkt.zo
> 295K ./vec4/compiled/bool_rkt.zo
> 291K ./vec4/compiled/int_rkt.zo
> 290K ./vec4/compiled/uint_rkt.zo
> 290K ./vec4/compiled/double_rkt.zo
> 289K ./vec4/compiled/float_rkt.zo
> 280K ./vec3/compiled/bool_rkt.zo
> 276K ./vec3/compiled/int_rkt.zo
> 275K ./vec3/compiled/uint_rkt.zo
> 275K ./vec3/compiled/double_rkt.zo
> 274K ./vec3/compiled/float_rkt.zo
> 262K ./vec2/compiled/bool_rkt.zo
> 258K ./vec2/compiled/uint_rkt.zo
> 258K ./vec2/compiled/int_rkt.zo
> 258K ./vec2/compiled/double_rkt.zo
> 257K ./vec2/compiled/float_rkt.zo
> 213K ./vec1/compiled/bool_rkt.zo
> 210K ./vec1/compiled/uint_rkt.zo
> 210K ./vec1/compiled/int_rkt.zo
> 210K ./vec1/compiled/double_rkt.zo
> 209K ./vec1/compiled/float_rkt.zo
> 102K ./compiled/main_rkt.zo
> 101K ./compiled/vector_rkt.zo
>
> I'm pretty sure that's a lot of big files. It's for a port of GLM, a
> graphics math library that implements (among other things) fixed-length
> vectors of up to 4 components over 5 distinct scalar types, for a total of
> 20 distinct type-length combinations with many small variations in their
> APIs and implementations.
>
> The variations I'm targeting either require a macro or exacerbate
> developer- or run-time overhead when functions are introduced. For example,
> the base component accessors for a four-component vector of doubles are:
>
> dvec4-x
> dvec4-y
> dvec4-z
> dvec4-w
>
> Each of the "xyzw" components has two aliases -- one from "rgba" and
> another from "stpq". Each accessor also has a corresponding mutator, e.g.,
> dvec4-g and set-dvec4-g!.
>
> For another example, whereas adding two dvec4's sums four components,
>
> (dvec4
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2))
>(fl+ (dvec4-x v1) (dvec4-x v2)))
>
> the same operation on dvec2's sums only the first two components.
>
> Furthermore, the sheer volume of the target code base makes writing
> everything out by hand a mind-numbing exercise in frustration, and that's
> when looking at a mere 20% of the pile. It's going to get much