[PATCH 00/10] RFC: Prototype of compiler-assisted performance analysis

David Malcolm Tue, 29 May 2018 13:32:47 -0700

I want to provide more "actionable" information on how GCC optimizes
code, both for us (GCC developers), and for advanced end-users.


For us, I want to make it easier to get a sense of how an optimization
can be improved (bug-fixing).
For end-users, I want to make it easier to figure out which command-line
flags to try, or how to rework their code.

In particular, given a workload (or benchmark), how does the user figure
out:
  * "what did the optimizers do to the hottest code?"
  * "what *didn't* they do, and why not?"
  * "how can I make the code faster?" (either by editing the code, or the
     gcc options)
and as GCC developers:
  * "how can we make GCC do a better job on this code?"


The current situation
=====================

I believe we currently have four ways of emitting information on what the
optimizers are doing:

(a) -fopt-info and its variants (e.g. "-fopt-info-all").

    Consider this "-fopt-info" code (from "vect_analyze_loop_form"):

      if (dump_enabled_p ())
        {
           dump_printf_loc (MSG_NOTE, vect_location,
                            "Symbolic number of iterations is ");
           dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations);
           dump_printf (MSG_NOTE, "\n");
        }

    This emits text to the destination(s) of "-fopt-info", e.g.:

       foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D)

    With "-fopt-info-all" this would be emitted to stderr.
    If "details" or "note" was added when enabling the dumpfile
    (e.g. via "-fdump-tree-all-details"), it would also be added to the
    dumpfile for the current pass.

    Passes are tagged as being members of zero or more optimization groups
    ("ipa", "loop", "inline", "omp", "vec"), and these can be used to filter
    which messages are emitted.  For example, the "increase_alignment" IPA
    pass is tagged with both "loop" and "vec".

(b) other messages emitted to dumpfiles.  We have numerous other messages
    that are written just to the dumpfiles, and don't go through -fopt-info.
    Consider e.g. this from loop-unroll.c:unroll_loop_constant_iterations:

      if (dump_file)
        fprintf (dump_file,
                 ";; Unrolled loop %d times, constant # of iterations %i 
insns\n",
                 max_unroll, num_loop_insns (loop));

(c) IR dumps emitted to dumpfiles.  execute_one_pass and
    execute_one_ipa_transform_pass have this code:

      if (dump_file)
        do_per_function (execute_function_dump, pass);

    which captures a textual dump of the current state of each function.

(d) dot-format dumps emitted to the ".dot" dumpfiles (via the "-graph") suffix
    e.g. "-fdump-tree-all-graph".

All of these options are marked as "GCC Developer Options".

How does everyone go about investigating what GCC is doing on a given
workload or benchmark?  (FWIW in the past I've been using
  "-fdump-tree-all-graph -fdump-ipa-all-graph -fdump-rtl-all-graph"
and using a custom .dot viewer that I wrote:
  https://github.com/davidmalcolm/gcc-viewer
but it's something of an ad-hoc solution).


Problems with the current situation
===================================

-fopt-info:

* There's no integration with profile data: the report has a "wall of text"
  feel, where the few hot sites in the code are hidden amongst the many
  cold sites, with all of them treated as of having equal importance.

* -fopt-info emits lines of the form:

       foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D)

  This is purely capturing the source location of the deepest point of the
  inlining chain: if there are multiple places where a function has been
  inlined, we'll get multiple reports, with no way to distinguish them.

* If a program attempts to parse this data, it can get source location
  plus a message, but there's no structure to the messages.  For example,
  we can't extract the source location of the expressions that are referred
  to in the messages.

* There are various places where we emit MSG_NOTES that look like sections
  and subsections, e.g.:

    if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
                      "=== vect_analyze_data_ref_dependences ===\n");

  This appears to be intended to express the start of a hierarchical
  section within the dump, but there's nothing to express the end of
  the section, and they're current printed in a flat way, again
  contributing to a "wall of text" feel.  In my prototype I've
  experimented with capturing the nesting, and highlighting it using
  indentation.

dumpfiles:

* These are per-pass-instance, capturing every function at each pass.
  There's no easy way to see how a particular function "evolved" through
  the passes.

* Some messages are written to dumpfiles, but don't go through -fopt-info.
  Some messages are gated with:

    (dump_flags & TDF_DETAILS)

  The divisions seems somewhat haphazard.

None of the messages are tagged with where they came from; tracking them
down requires grepping the GCC source code.


Aside: LLVM's experience with remarks
=====================================

LLVM added "remarks" as a kind of diagnostic for tracking optimization
information.  It has some similarities to our -fopt-info.

However, they ran into issues with them.  There's an excellent talk
given by Adam Nemet at the November 2016 LLVM Developers' Meeting
on "Compiler-assisted Performance Analysis":
  https://www.youtube.com/watch?v=qq0q1hfzidg
  
http://llvm.org/devmtg/2016-11/Slides/Nemet-Compiler-assistedPerformanceAnalysis.pdf
  (part of http://www.llvm.org/devmtg/2016-11/ )

He ran into the "wall of text" issue: too much data, where you
can't see what's important.

His fix was to capture the data in a file format (as "optimization records")
and to write a HTML tool for visualizing them, showing the optimizations
prioritized by code hotness.

This seems like a great idea, so I've been experimenting with it for GCC.


Prototype of a GCC-based solution
=================================

I've spent some time experimenting with this for GCC.

I'm assuming the following constraints:

* Preserve existing "-fopt-info" overall (whilst allowing for minor changes
  to formatting *within* individual messages).

* Preserve existing dumpfile output, as it's what GCC contributors are
  used to.

* No extra time/memory cost for the default "disabled" case.  I assume it's
  acceptable to spend a little more time/memory when the user opts-in by
  enabling this instrumentation.

I experimented with a few approaches, and the current prototype is
something of a hybrid of them.  The prototype aims:

* to capture "optimization records", emitting them in a format
  that retains all pertinent information that can be easily parsed.
  This separates the data from its presentation.

* to associate the optimization records with code hotness,
  and to provide a report that prioritizes the records, starting with
  the hottest code.

* other metadata to capture include(s):

  * which pass is emitting this record (and data about the pass)

  * the inlining chain (rather than just the "deepest" source location)

  * as well as source location within the user's code, I'm capturing
    source location within *GCC* for where the optimization happens.
    For any message you can immediately go to the implementation
    that emitted it.

I've implemented an HTML report generator, inspired by Adam Nemet's LLVM
work, using the existing places where we're emitting "-fopt-info"
information.

Differences from Adam Nemet's work (as I understand it):

* I've added hierarchical records, so that there can be a nesting
  structure of optimization notes (otherwise there's still too much
  of a "wall of text").

* capture of GCC source location

* LLVM is using YAML for some reason; I used JSON.  Given that I'm
  capturing some different things, I didn't attempt to use the same
  file format as LLVM.

Example 1
*********

Consider this "-fopt-info" code (from "vect_analyze_loop_form")::

  if (dump_enabled_p ())
    {
       dump_printf_loc (MSG_NOTE, vect_location,
                        "Symbolic number of iterations is ");
       dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations);
       dump_printf (MSG_NOTE, "\n");
    }

This purely emits text to any destination(s) of "-fopt-info", e.g.
by writing:

  foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D)

to stderr, and/or the dumpfile for the current pass.

This can be rewritten for my optimization records prototype as:

  if (optinfo_enabled_p ())
    {
      OPTINFO_VECT_NOTE
        << "symbolic number of iterations is "
        << details (number_of_iterations);
    }

It can then be emitted in up to three ways:

(a) as before via "-fopt-info"::

    foo.c:15:3: note: symbolic number of iterations is (unsigned int) n_9(D)

  (albeit with capitalization changes), to the "-fopt-info" destination(s).

(b) It can be emitted as a "remark" through the diagnostics subsystem,
    showing the extra metadata::

      foo.c:15:3: remark: symbolic number of iterations is ‘(unsigned int)
      n_9(D)’ [pass=vect] [count(guessed_local)=955630224]

    getting the standard diagnostics benefits:

    * announcing "In function 'blah':" etc when this changes.

    * printing the corresponding source lines (unless
      "-fno-diagnostics-show-caret"), and

    * colorizing the various parts of the message if stderr is at a tty.

    * possibly allowing for remarks to be used in DejaGnu tests to better
      associate testing of an optimization with the source line in
      question, rather than relying on a scan of the dumpfile (which is
      per-source file and thus rather "coarse-grained").

(c) It is emitted in JSON form through "-fsave-optimization-record":

      {
          "kind": "note",
          "count": {
              "quality": "guessed_local",
              "value": 9.5563e+08
          },
          "location": {
              "line": 15,
              "file": "foo.c",
              "column": 3
          },
          "pass": {
              "num": 166,
              "type": "gimple",
              "name": "vect",
              "optgroups": [
                  "loop",
                  "vec"
              ]
          },
          "message": [
              "symbolic number of iterations is ",
              {
                  "expr": "(unsigned int) n_8(D)"
              }
          ],
          "impl_location": {
              "line": 1674,
              "file": "../../src/gcc/tree-vect-loop.c",
              "function": "vect_analyze_loop_form"
          },
          "function": "compute_sum"
      },

    Note how this captures metadata such as profile count, function name,
    pass, etc.  (currently the JSON is emitted on one long line, with no
    pretty indentation, but it's trivial to run it through a tool to
    make it easy to read).

An example of a raw JSON output file can be seen here:
  
https://dmalcolm.fedorapeople.org/gcc/2018-05-29/presentation-inlined.c.opt-record.json
and a prettified version of the same (via indentation) can be seen at:
  https://dmalcolm.fedorapeople.org/gcc/2018-05-29/pretty.json


Example 2
*********

Consider this "-fopt-info" code (from "vect_create_data_ref_ptr"):

  if (dump_enabled_p ())
     {
       tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr));
       dump_printf_loc (MSG_NOTE, vect_location,
                        "create %s-pointer variable to type: ",
                        get_tree_code_name (TREE_CODE (aggr_type)));
       dump_generic_expr (MSG_NOTE, TDF_SLIM, aggr_type);
       if (TREE_CODE (dr_base_type) == ARRAY_TYPE)
         dump_printf (MSG_NOTE, "  vectorizing an array ref: ");
       else if (TREE_CODE (dr_base_type) == VECTOR_TYPE)
         dump_printf (MSG_NOTE, "  vectorizing a vector ref: ");
       else if (TREE_CODE (dr_base_type) == RECORD_TYPE)
         dump_printf (MSG_NOTE, "  vectorizing a record based array ref: ");
       else
         dump_printf (MSG_NOTE, "  vectorizing a pointer ref: ");
       dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_BASE_OBJECT (dr));
       dump_printf (MSG_NOTE, "\n");
     }

where the information is built up piecewise, with conditional logic.

This purely emits text to any destination(s) of "-fopt-info", e.g.:

    foo.c:12:5: note: create vector_type-pointer variable to type:
    vector(4) int  vectorizing a pointer ref: *arr_9(D)

e.g. to stderr, and/or the dumpfile for the current pass.

This can be rewritten by exposing the "pending_optinfo":

  if (optinfo_enabled_p ())
    {
      tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr));
      pending_optinfo info = OPTINFO_VECT_NOTE
       << optinfo_printf ("create %s-pointer variable to type: ",
                          get_tree_code_name (TREE_CODE (aggr_type)))
       << slim (aggr_type);
      if (TREE_CODE (dr_base_type) == ARRAY_TYPE)
       info << "  vectorizing an array ref: ";
      else if (TREE_CODE (dr_base_type) == VECTOR_TYPE)
       info << "  vectorizing a vector ref: ";
      else if (TREE_CODE (dr_base_type) == RECORD_TYPE)
       info << "  vectorizing a record based array ref: ";
      else
       info << "  vectorizing a pointer ref: ";
      info << slim (DR_BASE_OBJECT (dr));
    }

The "pending_optinfo" is emitted when it goes out of scope.  As before, it
can be emitted in up to three ways:

(a) as before via "-fopt-info":

    foo.c:12:5: note: create vector_type-pointer variable to type:
    vector(4) int  vectorizing a pointer ref: *arr_9(D)

  (albeit with capitalization changes), to the "-fopt-info" destination(s).

(b) It can be emitted as a "remark" through the diagnostics subsystem:

    foo.c:15:3: remark:   create vector_type-pointer variable to type:
    ‘vector(4) int’  vectorizing a pointer ref: ‘*arr_9(D)’ [pass=vect]
    [count(guessed_local)=860067200]

(c) It is emitted in JSON form through "-fsave-optimization-record":

      {
          "kind": "note",
          "count": {
              "quality": "guessed_local",
              "value": 8.60067e+08
          },
          "location": {
              "line": 15,
              "file": "foo.c",
              "column": 3
          },
          "pass": {
              "num": 166,
              "type": "gimple",
              "name": "vect",
              "optgroups": [
                  "loop",
                  "vec"
              ]
          },
          "message": [
              "create vector_type-pointer variable to type: ",
              {
                  "expr": "vector(4) int"
              },
              "  vectorizing a pointer ref: ",
              {
                  "location": {
                      "line": 16,
                      "file": "foo.c",
                      "column": 5
                  },
                  "expr": "*arr_9(D)"
              }
          ],
          "impl_location": {
              "line": 4702,
              "file": "../../src/gcc/tree-vect-data-refs.c",
              "function": "vect_create_data_ref_ptr"
          },
          "function": "compute_sum"
      },

Note how the message isn't just text: here it has captured the
source location of the pointer ref.  I've not been a fan of
the "<<" C++ idiom, but it seems to work well here for capturing
the data we need in a relative terse internal API.
(but do we need to internationalize the messages?  We're not
doing so at the moment)

As well as capturing text and numbers, there's support for capturing
trees, gimple statements, and symbol table nodes (in each case
capturing their source location, but also the metadata that the
text is talking about these things, allowing for richer views
in the presentation).  I'm thinking of adding support for
basic blocks and for loops.


Viewing the optimization records
********************************

I've written an HTML generator (in Python 3)
 https://github.com/davidmalcolm/gcc-opt-viewer
which takes a directory full of these json files and emits a directory
of HTML files, with an index.html showing the optimizations, prioritized
from hottest code downwards.

An example of HTML output can be seen here:

* https://dmalcolm.fedorapeople.org/gcc/2018-05-18/pgo-demo-test/pgo-demo-test/

based on this source:

* https://github.com/davidmalcolm/pgo-demo


Clicking on the links in the "Source Location column shows the
optimizations inline next to the source code they affect.

It uses a little JavaScript to add expand/collapse toggles when the amount
of text for an optimization gets too long.  You can see the hierarchical
structure of the notes via indentation (e.g. for the vectorization example).
An alternative presentation could add an explicit tree structure with
more expand/collapse toggles - this is just a prototype.

I've ported many messages from the existing API to the "optinfo" API, but
I deliberately didn't make significant changes to messages.  But there's
probably scope for improvements here.  For example, in the
"presentation-not-inlined.c" example at the link above, the vect pass emits:

   === analyzing loop ===
    === analyze_loop_nest ===
      === vect_analyze_loop_form ===
        === get_loop_niters ===
      symbolic number of iterations is (unsigned int) n_9(D)
      not vectorized: loop contains function calls or data references that 
cannot be analyzed

which is useful, but more useful would be to know *which* function calls
or data references couldn't be analyzed.  So that would be something
to tackle in followups (this post is more about discussing the overall
direction).

Clicking on the "Pass" links shows the GCC source that emitted the
message - or, at least, a close approximation to it on the trunk
(on the github mirror, to reduce traffic on our git server).


Caveats
=======

This is still just a rough prototype (I'm looking for feedback, especially
from loop, vectorization, and IPA maintainers).

Some known issues:

* I haven't yet attempted to bootstrap this

* there are lots of FIXMEs

* the file format ought to be versioned

* "remarks" are a bit half-baked in this patch kit, it's just on/off.
  It probably should support the optimization group tags from
  -fopt-info.  LLVM has a -f[no-]diagnostics-show-hotness; etc.

* I'm using a little C++11 (std::shared_ptr), which would need to be
  ported to C++98.

* the patch kit is relative to r259650, from 2018-04-25


Known unknowns
==============

* Should the "optinfo_enabled_p" use an "UNLIKELY" macro to
  tell a non-PGO build of gcc that all of this stuff is off by
  default and only used by a small subset of users?  We could
  do this for dump_enabled_p for that matter.

* Should the viewer generate static HTML, or serve dynamic HTML?
  When I've tried this on non-trivial examples, the index.html
  becomes unwieldy, so pagination would help a lot.  That said,
  there could be multiple different viewers, a benefit of separating
  the data out from its presentation.

* Should there be a way to pre-filter the optimization records before
  they're written out to JSON?  Or is it OK to dump everything, and
  make filtering be the viewer's responsibility?  (currently the patch
  kit dumps everything)

* optimization records can be emitted before the profile counts have
  been loaded/setup.  An example is the "einline" reports.  Maybe
  there's a way to patch the records with profile information???
  (which would mean storing them in memory before flushing them)

* Should the JSON format interact with the other uses of dumpfiles?
  (both free-form messages, and capturing the IR)  This would allow
  a viewer e.g. to follow a specific function, showing each state of
  it as it goes through the optimizers, rather than the current "show
  me every function at time T" view that we get from browsing dumpfiles.
  This could allow neat things like "show me every pass in which function
  FOO changed" etc.
  I've been experimenting with this, but it's not in the patch kit as
  is.

* Should the output format capture the passes and their hierarchy?
  (passes.def, plus any from plugins).  I'm thinking "yes", with the
  optimization records referencing back to those passes, rather than
  repeating the pass metadata in each optimization record.

* Formatting convention mismatches:

  * diagnostics have lowercase start, and no trailing period

  * -fopt-info dumps have uppercase start, and often have a trailing period

    * but some of them have multiple sentences

* Is there a natural tension here between logs (which are timed-ordered
  events) vs source markup (source-ordered)?  Consider e.g. a
  collection of related messages from one subroutine.

* Should optimization info be internationalized/localized? Currently
  "-fopt-info" isn't.  I say let's *not* i18n it.  This is straddling
  the border between implementation detail and advanced-user-facing.

* My prototype respects TDF_DETAILS for the existing "-fopt-info" output,
  but given that it appears to have been applied haphazardly, and that we
  can filter in other ways, my prototype ignores it for the other output
  formats.  It could be added.

* Which JSON library to use for output?  Note that we only need output, so
  that for the final version we can directly write the JSON, rather than the
  current approach which, inefficiently, requires first building a DOM-like
  tree in memory.  It would be fairly simple to write JSON directly to
  a pretty_printer *.

Thanks for reading this far.

Thoughts?
Dave


David Malcolm (10):
  Convert dump and optgroup flags to enums
  Add JSON implementation
  Add optinfo, remarks and optimization records
  Use indentation to show nesting for -fopt-info
  Experiment with using optinfo for vectorization
  Experiments with using optinfo for inlining
  Experiment with using optinfo for loop-handling
  Experiment with using optinfo for devirtualization
  Experiment with using optinfo in gimple-loop-interchange.cc
  Experiment with optinfo in tree-ssa-loop-im.c

 gcc/Makefile.in                              |    8 +
 gcc/c-family/c-pretty-print.c                |    2 +-
 gcc/cfg.c                                    |    4 +-
 gcc/cfghooks.c                               |    2 +-
 gcc/common.opt                               |    9 +
 gcc/coretypes.h                              |    8 +
 gcc/diagnostic-color.c                       |    2 +
 gcc/diagnostic-core.h                        |    2 +
 gcc/diagnostic.c                             |   17 +
 gcc/diagnostic.def                           |    1 +
 gcc/doc/invoke.texi                          |   34 +-
 gcc/dumpfile.c                               |   72 +-
 gcc/dumpfile.h                               |  230 +++-
 gcc/early-remat.c                            |    2 +-
 gcc/fortran/gfc-diagnostic.def               |    1 +
 gcc/gengtype.c                               |    3 +-
 gcc/gimple-fold.c                            |   29 +-
 gcc/gimple-loop-interchange.cc               |   36 +-
 gcc/gimple-pretty-print.c                    |    2 +-
 gcc/gimple-ssa-store-merging.c               |    6 +-
 gcc/gimple-ssa-strength-reduction.c          |    2 +-
 gcc/gimple.c                                 |    2 +-
 gcc/ipa-devirt.c                             |   14 +-
 gcc/ipa-inline.c                             |   25 +
 gcc/ipa.c                                    |   17 +-
 gcc/json.cc                                  | 1914 ++++++++++++++++++++++++++
 gcc/json.h                                   |  214 +++
 gcc/opt-functions.awk                        |    1 +
 gcc/optinfo-emit-diagnostics.cc              |  141 ++
 gcc/optinfo-emit-diagnostics.h               |   26 +
 gcc/optinfo-emit-fopt-info.cc                |  106 ++
 gcc/optinfo-emit-fopt-info.h                 |   26 +
 gcc/optinfo-emit-json.cc                     |  408 ++++++
 gcc/optinfo-emit-json.h                      |   37 +
 gcc/optinfo-internal.h                       |  139 ++
 gcc/optinfo.cc                               |  272 ++++
 gcc/optinfo.h                                |  389 ++++++
 gcc/opts.c                                   |    4 +
 gcc/opts.h                                   |   13 +-
 gcc/passes.c                                 |    2 +-
 gcc/print-tree.c                             |    7 +-
 gcc/profile-count.c                          |   28 +
 gcc/profile-count.h                          |    5 +
 gcc/selftest-run-tests.c                     |    1 +
 gcc/selftest.h                               |    1 +
 gcc/testsuite/gcc.dg/plugin/plugin.exp       |    2 +
 gcc/testsuite/gcc.dg/plugin/remarks-1.c      |   32 +
 gcc/testsuite/gcc.dg/plugin/remarks_plugin.c |  158 +++
 gcc/testsuite/lib/gcc-dg.exp                 |    9 +
 gcc/toplev.c                                 |    5 +
 gcc/tree-inline.c                            |   10 +
 gcc/tree-pass.h                              |    2 +-
 gcc/tree-pretty-print.c                      |    2 +-
 gcc/tree-ssa-live.c                          |    4 +-
 gcc/tree-ssa-loop-im.c                       |   13 +
 gcc/tree-ssa-loop-ivcanon.c                  |   38 +-
 gcc/tree-ssa-loop-niter.c                    |    7 +-
 gcc/tree-ssa-math-opts.c                     |    4 +-
 gcc/tree-ssa-reassoc.c                       |    2 +-
 gcc/tree-ssa-sccvn.c                         |    2 +-
 gcc/tree-vect-data-refs.c                    |  979 ++++++-------
 gcc/tree-vect-loop-manip.c                   |   98 +-
 gcc/tree-vect-loop.c                         | 1390 +++++++++----------
 gcc/tree-vect-patterns.c                     |  100 +-
 gcc/tree-vect-slp.c                          |  507 ++++---
 gcc/tree-vect-stmts.c                        |  617 ++++-----
 gcc/tree-vectorizer.c                        |   42 +-
 gcc/tree-vectorizer.h                        |   37 +
 68 files changed, 6193 insertions(+), 2131 deletions(-)
 create mode 100644 gcc/json.cc
 create mode 100644 gcc/json.h
 create mode 100644 gcc/optinfo-emit-diagnostics.cc
 create mode 100644 gcc/optinfo-emit-diagnostics.h
 create mode 100644 gcc/optinfo-emit-fopt-info.cc
 create mode 100644 gcc/optinfo-emit-fopt-info.h
 create mode 100644 gcc/optinfo-emit-json.cc
 create mode 100644 gcc/optinfo-emit-json.h
 create mode 100644 gcc/optinfo-internal.h
 create mode 100644 gcc/optinfo.cc
 create mode 100644 gcc/optinfo.h
 create mode 100644 gcc/testsuite/gcc.dg/plugin/remarks-1.c
 create mode 100644 gcc/testsuite/gcc.dg/plugin/remarks_plugin.c

-- 
1.8.5.3

[PATCH 00/10] RFC: Prototype of compiler-assisted performance analysis

Reply via email to