Hi,

I've been working a bit on testing infrastructure for systematically improving 
swiftc compilation performance. Attached is a document summarizing both the 
machinery I put together, and an example (small) bug recently fixed, as a case 
study showing how to use it.

I'd like to encourage folks interested in swiftc compile times to take a read 
through; especially if you happen to have an existing bug, or pattern of bad 
performance you're seeing (or even have a hunch about), I'd be interested to 
know if a scaling test can capture it. I'll be working through some compilation 
performance bugs using this framework for the next while, so if anyone would 
like to collaborate on investigating something, let me know and I'd be happy to 
help.

Thanks,

-Graydon

Compilation performance
=======================

This document describes a systematic method for detection,
isolation, correction and prevention of regression, on matters
of compiler performance. The method focuses on the creation of
small end-to-end _scaling tests_ that characterize, and stand in
for, larger codebases on which the compiler is seen to perform
poorly. These scaling tests have a number of advantages over
"full sized" performance-testing codebases:

  - They run quickly, just long enough to measure counters
  - They use a clear criterion (super-linearity) for "bad performance"
  - They are small and comprehensible
  - They can focus on a single, isolated performance problem
  - They can be written and shared by users, characterizing
    patterns in private codebases

The method is first described in general; we then work through an
example of finding and fixing a compilation-performance bug.

(The same method can also be used for measuring
and correcting problems in generated code, insofar as the
problem you're interested manifests as a thing that can be
counted while compiling, eg. quantity of LLVM code emitted.)

General method
--------------

1. Formulate a linear (or sub-linear) relationship that _should hold_
   between some aspect of program input size `N` and some measurement 
   `W` of work done by the compiler.

2. Add a _statistic_ counter to the compiler that measures `W`.

3. Write a small _scaling testcase_: a `.gyb` file varying in `N`,
   that can be run by `utils/scale-test`.

4. Run `utils/scale-test` selecting for `W`, and see if `W` exceeds 
   `O(N^1)`.

5. If not, goto #1 and examine a new relationship.

6. If so, run the testcase at a small scale under a debugger, breaking
   on the context you count `W` in, and find the unexpected contexts /
   causes of unexpected work.

7. Fix the unexpected causes of work, rerun the test to confirm 
   (sub-)linearity.

8. Add the testcase to the `validation-tests/compiler_scale` directory
   to prevent future performance regression.


Example
-------

As an example (arising from a bug reported in the field), consider
work done typechecking the function bodies of property getters and
setters that are synthesized for nominal types. Ideally, compiling a
multi-file module, we expect that only the getters and setters in
the current file being complied are subject to function-body-level
typechecking.

That is, our independent scaling variable `N` will scale the number
of files, and our dependent work variable `W` will count the number
of calls to the function-body typechecking routine in the compiler.
We will check to see that this relationship is linear.

Formulating this relationship precisely (as a testcase), we begin
by ensuring that the compiler has a statistic to measure the
dependent variable. To do this, we select a location in the compiler
to instrument, ensure the containing file includes `swift/Basic/Statistic.h` and defines a local macro `DEBUG_TYPE` 
naming the local context. Then we add a single `SWIFT_FUNC_STAT`
macro in the function we want to count. In this case,
we place a counter immediately inside `typeCheckAbstractFunctionBody`:

~~~~c++
--- a/lib/Sema/TypeCheckStmt.cpp
+++ b/lib/Sema/TypeCheckStmt.cpp
@@ -27,2 +27,3 @@
 #include "swift/Basic/SourceManager.h"
+#include "swift/Basic/Statistic.h"
 #include "swift/Parse/Lexer.h"
@@ -40,2 +41,4 @@ using namespace swift;
 
+#define DEBUG_TYPE "TypeCheckStmt"
+
 namespace {
@@ -1250,2 +1253,4 @@ bool TypeChecker::typeCheckAbstractFunctionBody(AbstractFunctionDecl *AFD) {
 
+  SWIFT_FUNC_STAT;
+
   Optional<FunctionBodyTimer> timer;
~~~~

Recall that the relationship we're interested in testing is one
that arises when doing a _multi-file_ compilation: when `swiftc`
is invoked with multiple input files comprising a module, and
its driver subsequently runs one _frontend job_ for each file
in the module, treating one file as _primary_ and parsing (but
not translating) all the other files as additional inputs.

This type of scale test is not the default mode of the `scale-test`
script, but it is supported with the `--sum-multi` command line
flag. In this mode, `scale-test` collects and sums-together all
statistics of all the primary-file frontend jobs in a multi-file
driver job.

In order to test the relationship, we start with a single
declaration in each file, and vary the number of files. The simplest
test therefore looks like this:

~~~~swift
struct Struct${N} {
    var Field : Int?
}
~~~~

When we run this file under `scale-test`, however, we see
only a linear relationship:

~~~~sh
$ utils/scale-test --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
~~~~

To see actual counts (to be sure we're not misunderstanding)
we can pass `--values`:

~~~~sh
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [30, 60, 90, 120, 150, 180, 210, 240, 270]
~~~~

This is encouraging, but since it doesn't reproduce bad behaviour
described in the bug report, we make our testcase just a
little bit more complicated by making the structs in each file
refer to the file adjacent to them in the module:

~~~~swift
struct Struct${N} {
% if int(N) > 1:
    var Field : Struct${int(N)-1}?
% else:
    var Field : Int?
% end
}
~~~~

Now the 0th struct has an `Int?`-typed property and every other struct
has a property that refers to a definition in a neighbouring file;
a bit more pathological than in most codebases, but not too
unrealistic. When we run this case under `scale-test`, we see the
problematic behaviour very clearly:

~~~~sh
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody scale.gyb
O(n^2.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [102, 402, 902, 1602, 2502, 3602, 4902, 6402, 8102]
~~~~

Next we get to the part that's hard to describe through anything
other than debugging experience: finding and fixing the bug. The
`scale-test` script can save us some difficulty in setup by
invoking `lldb` on one of the frontend jobs if we pass it `--debug`,
but beyond that it's up to us to diagnose and fix:

~~~~sh
$ utils/scale-test --debug --sum-multi scale.gyb
(lldb) target create "swiftc"
Current executable set to 'swiftc' (x86_64).
(lldb) settings set -- target.run-args  "-frontend" "-c" "-o" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/out.o" "-primary-file" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in0.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in1.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in2.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in3.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in4.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in5.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in6.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in7.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in8.swift" "/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/in9.swift" "-Xllvm" "-stats" "-Xllvm" "-stats-json" "-Xllvm" "-info-output-file=/var/folders/1z/kg4j1y2j5hn9m94ffs7xsfk00000gn/T/tmpilFrPQ/stats.json"
(lldb) br set --name typeCheckAbstractFunctionBody
Breakpoint 1: where = swiftc`swift::TypeChecker::typeCheckAbstractFunctionBody(swift::AbstractFunctionDecl*) + 21 [inlined] swift::AbstractFunctionDecl::getBodyKind() const at Decl.h:4657, address = 0x000000010094a465
(lldb) r
...
~~~~

In this case the unwanted work can be inhibited by modifying
`addTrivialAccessorsToStorage`; the fix involves redirecting
some calls from `typeCheckDecl` to the less-involved `validateDecl`
(along with some other compensating changes omitted for brevity here)

~~~~c++
...
@@ -765,4 +765,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
   synthesizeTrivialGetter(getter, storage, TC);
-  TC.typeCheckDecl(getter, true);
-  TC.typeCheckDecl(getter, false);
+  TC.validateDecl(getter);
 
@@ -774,4 +773,3 @@ void swift::addTrivialAccessorsToStorage(AbstractStorageDecl *storage,
     synthesizeTrivialSetter(setter, storage, setterValueParam, TC);
-    TC.typeCheckDecl(setter, true);
-    TC.typeCheckDecl(setter, false);
+    TC.validateDecl(setter);
   }
...
~~~~

With our fix applied, the `scale-test` now shows correct
behaviour:

~~~~sh
$ utils/scale-test --values --sum-multi --select typeCheckAbstractFunctionBody --values scale.gyb
O(n^1.0) : TypeCheckStmt.typeCheckAbstractFunctionBody
                =  [30, 60, 90, 120, 150, 180, 210, 240, 270]
~~~~

To prevent future regression, we put the scale test in the
testsuite. The `scale-test` driver script will consider any
test as failing if it selects a counter that scales worse
than `O(n^1.2)` (to give a little room for error).

So our investigation testcase is nearly usable as a regression
test. We make a few modifications. First, we'll pass `--parse`
since there's no need to run full code generation each time.
Second we'll pass a more limited set of scaling steps, sufficient
to show the problem, but faster to run. Finally we'll use the
`lit.py` test-execution framework to both run the
test, and make the test conditional on a release-mode compiler,
to further limit the impact on testsuite execution time. The
final regression test, which we put in
`validation-test/compiler_scale/scale_neighbouring_getset.gyb`,
looks like this:

~~~~swift
// RUN: %scale-test --sum-multi --parse --begin 5 --end 16 --step 5 --select TypeCheckStmt %s
// REQUIRES: tools-release

struct Struct${N} {
% if int(N) > 1:
    var Field : Struct${int(N)-1}?
% else:
    var Field : Int?
% end
}
~~~~
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to