I want to provide more "actionable" information on how GCC optimizes code, both for us (GCC developers), and for advanced end-users.
For us, I want to make it easier to get a sense of how an optimization can be improved (bug-fixing). For end-users, I want to make it easier to figure out which command-line flags to try, or how to rework their code. In particular, given a workload (or benchmark), how does the user figure out: * "what did the optimizers do to the hottest code?" * "what *didn't* they do, and why not?" * "how can I make the code faster?" (either by editing the code, or the gcc options) and as GCC developers: * "how can we make GCC do a better job on this code?" The current situation ===================== I believe we currently have four ways of emitting information on what the optimizers are doing: (a) -fopt-info and its variants (e.g. "-fopt-info-all"). Consider this "-fopt-info" code (from "vect_analyze_loop_form"): if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, "Symbolic number of iterations is "); dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations); dump_printf (MSG_NOTE, "\n"); } This emits text to the destination(s) of "-fopt-info", e.g.: foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) With "-fopt-info-all" this would be emitted to stderr. If "details" or "note" was added when enabling the dumpfile (e.g. via "-fdump-tree-all-details"), it would also be added to the dumpfile for the current pass. Passes are tagged as being members of zero or more optimization groups ("ipa", "loop", "inline", "omp", "vec"), and these can be used to filter which messages are emitted. For example, the "increase_alignment" IPA pass is tagged with both "loop" and "vec". (b) other messages emitted to dumpfiles. We have numerous other messages that are written just to the dumpfiles, and don't go through -fopt-info. Consider e.g. this from loop-unroll.c:unroll_loop_constant_iterations: if (dump_file) fprintf (dump_file, ";; Unrolled loop %d times, constant # of iterations %i insns\n", max_unroll, num_loop_insns (loop)); (c) IR dumps emitted to dumpfiles. execute_one_pass and execute_one_ipa_transform_pass have this code: if (dump_file) do_per_function (execute_function_dump, pass); which captures a textual dump of the current state of each function. (d) dot-format dumps emitted to the ".dot" dumpfiles (via the "-graph") suffix e.g. "-fdump-tree-all-graph". All of these options are marked as "GCC Developer Options". How does everyone go about investigating what GCC is doing on a given workload or benchmark? (FWIW in the past I've been using "-fdump-tree-all-graph -fdump-ipa-all-graph -fdump-rtl-all-graph" and using a custom .dot viewer that I wrote: https://github.com/davidmalcolm/gcc-viewer but it's something of an ad-hoc solution). Problems with the current situation =================================== -fopt-info: * There's no integration with profile data: the report has a "wall of text" feel, where the few hot sites in the code are hidden amongst the many cold sites, with all of them treated as of having equal importance. * -fopt-info emits lines of the form: foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) This is purely capturing the source location of the deepest point of the inlining chain: if there are multiple places where a function has been inlined, we'll get multiple reports, with no way to distinguish them. * If a program attempts to parse this data, it can get source location plus a message, but there's no structure to the messages. For example, we can't extract the source location of the expressions that are referred to in the messages. * There are various places where we emit MSG_NOTES that look like sections and subsections, e.g.: if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "=== vect_analyze_data_ref_dependences ===\n"); This appears to be intended to express the start of a hierarchical section within the dump, but there's nothing to express the end of the section, and they're current printed in a flat way, again contributing to a "wall of text" feel. In my prototype I've experimented with capturing the nesting, and highlighting it using indentation. dumpfiles: * These are per-pass-instance, capturing every function at each pass. There's no easy way to see how a particular function "evolved" through the passes. * Some messages are written to dumpfiles, but don't go through -fopt-info. Some messages are gated with: (dump_flags & TDF_DETAILS) The divisions seems somewhat haphazard. None of the messages are tagged with where they came from; tracking them down requires grepping the GCC source code. Aside: LLVM's experience with remarks ===================================== LLVM added "remarks" as a kind of diagnostic for tracking optimization information. It has some similarities to our -fopt-info. However, they ran into issues with them. There's an excellent talk given by Adam Nemet at the November 2016 LLVM Developers' Meeting on "Compiler-assisted Performance Analysis": https://www.youtube.com/watch?v=qq0q1hfzidg http://llvm.org/devmtg/2016-11/Slides/Nemet-Compiler-assistedPerformanceAnalysis.pdf (part of http://www.llvm.org/devmtg/2016-11/ ) He ran into the "wall of text" issue: too much data, where you can't see what's important. His fix was to capture the data in a file format (as "optimization records") and to write a HTML tool for visualizing them, showing the optimizations prioritized by code hotness. This seems like a great idea, so I've been experimenting with it for GCC. Prototype of a GCC-based solution ================================= I've spent some time experimenting with this for GCC. I'm assuming the following constraints: * Preserve existing "-fopt-info" overall (whilst allowing for minor changes to formatting *within* individual messages). * Preserve existing dumpfile output, as it's what GCC contributors are used to. * No extra time/memory cost for the default "disabled" case. I assume it's acceptable to spend a little more time/memory when the user opts-in by enabling this instrumentation. I experimented with a few approaches, and the current prototype is something of a hybrid of them. The prototype aims: * to capture "optimization records", emitting them in a format that retains all pertinent information that can be easily parsed. This separates the data from its presentation. * to associate the optimization records with code hotness, and to provide a report that prioritizes the records, starting with the hottest code. * other metadata to capture include(s): * which pass is emitting this record (and data about the pass) * the inlining chain (rather than just the "deepest" source location) * as well as source location within the user's code, I'm capturing source location within *GCC* for where the optimization happens. For any message you can immediately go to the implementation that emitted it. I've implemented an HTML report generator, inspired by Adam Nemet's LLVM work, using the existing places where we're emitting "-fopt-info" information. Differences from Adam Nemet's work (as I understand it): * I've added hierarchical records, so that there can be a nesting structure of optimization notes (otherwise there's still too much of a "wall of text"). * capture of GCC source location * LLVM is using YAML for some reason; I used JSON. Given that I'm capturing some different things, I didn't attempt to use the same file format as LLVM. Example 1 ********* Consider this "-fopt-info" code (from "vect_analyze_loop_form"):: if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, "Symbolic number of iterations is "); dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations); dump_printf (MSG_NOTE, "\n"); } This purely emits text to any destination(s) of "-fopt-info", e.g. by writing: foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) to stderr, and/or the dumpfile for the current pass. This can be rewritten for my optimization records prototype as: if (optinfo_enabled_p ()) { OPTINFO_VECT_NOTE << "symbolic number of iterations is " << details (number_of_iterations); } It can then be emitted in up to three ways: (a) as before via "-fopt-info":: foo.c:15:3: note: symbolic number of iterations is (unsigned int) n_9(D) (albeit with capitalization changes), to the "-fopt-info" destination(s). (b) It can be emitted as a "remark" through the diagnostics subsystem, showing the extra metadata:: foo.c:15:3: remark: symbolic number of iterations is ‘(unsigned int) n_9(D)’ [pass=vect] [count(guessed_local)=955630224] getting the standard diagnostics benefits: * announcing "In function 'blah':" etc when this changes. * printing the corresponding source lines (unless "-fno-diagnostics-show-caret"), and * colorizing the various parts of the message if stderr is at a tty. * possibly allowing for remarks to be used in DejaGnu tests to better associate testing of an optimization with the source line in question, rather than relying on a scan of the dumpfile (which is per-source file and thus rather "coarse-grained"). (c) It is emitted in JSON form through "-fsave-optimization-record": { "kind": "note", "count": { "quality": "guessed_local", "value": 9.5563e+08 }, "location": { "line": 15, "file": "foo.c", "column": 3 }, "pass": { "num": 166, "type": "gimple", "name": "vect", "optgroups": [ "loop", "vec" ] }, "message": [ "symbolic number of iterations is ", { "expr": "(unsigned int) n_8(D)" } ], "impl_location": { "line": 1674, "file": "../../src/gcc/tree-vect-loop.c", "function": "vect_analyze_loop_form" }, "function": "compute_sum" }, Note how this captures metadata such as profile count, function name, pass, etc. (currently the JSON is emitted on one long line, with no pretty indentation, but it's trivial to run it through a tool to make it easy to read). An example of a raw JSON output file can be seen here: https://dmalcolm.fedorapeople.org/gcc/2018-05-29/presentation-inlined.c.opt-record.json and a prettified version of the same (via indentation) can be seen at: https://dmalcolm.fedorapeople.org/gcc/2018-05-29/pretty.json Example 2 ********* Consider this "-fopt-info" code (from "vect_create_data_ref_ptr"): if (dump_enabled_p ()) { tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr)); dump_printf_loc (MSG_NOTE, vect_location, "create %s-pointer variable to type: ", get_tree_code_name (TREE_CODE (aggr_type))); dump_generic_expr (MSG_NOTE, TDF_SLIM, aggr_type); if (TREE_CODE (dr_base_type) == ARRAY_TYPE) dump_printf (MSG_NOTE, " vectorizing an array ref: "); else if (TREE_CODE (dr_base_type) == VECTOR_TYPE) dump_printf (MSG_NOTE, " vectorizing a vector ref: "); else if (TREE_CODE (dr_base_type) == RECORD_TYPE) dump_printf (MSG_NOTE, " vectorizing a record based array ref: "); else dump_printf (MSG_NOTE, " vectorizing a pointer ref: "); dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_BASE_OBJECT (dr)); dump_printf (MSG_NOTE, "\n"); } where the information is built up piecewise, with conditional logic. This purely emits text to any destination(s) of "-fopt-info", e.g.: foo.c:12:5: note: create vector_type-pointer variable to type: vector(4) int vectorizing a pointer ref: *arr_9(D) e.g. to stderr, and/or the dumpfile for the current pass. This can be rewritten by exposing the "pending_optinfo": if (optinfo_enabled_p ()) { tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr)); pending_optinfo info = OPTINFO_VECT_NOTE << optinfo_printf ("create %s-pointer variable to type: ", get_tree_code_name (TREE_CODE (aggr_type))) << slim (aggr_type); if (TREE_CODE (dr_base_type) == ARRAY_TYPE) info << " vectorizing an array ref: "; else if (TREE_CODE (dr_base_type) == VECTOR_TYPE) info << " vectorizing a vector ref: "; else if (TREE_CODE (dr_base_type) == RECORD_TYPE) info << " vectorizing a record based array ref: "; else info << " vectorizing a pointer ref: "; info << slim (DR_BASE_OBJECT (dr)); } The "pending_optinfo" is emitted when it goes out of scope. As before, it can be emitted in up to three ways: (a) as before via "-fopt-info": foo.c:12:5: note: create vector_type-pointer variable to type: vector(4) int vectorizing a pointer ref: *arr_9(D) (albeit with capitalization changes), to the "-fopt-info" destination(s). (b) It can be emitted as a "remark" through the diagnostics subsystem: foo.c:15:3: remark: create vector_type-pointer variable to type: ‘vector(4) int’ vectorizing a pointer ref: ‘*arr_9(D)’ [pass=vect] [count(guessed_local)=860067200] (c) It is emitted in JSON form through "-fsave-optimization-record": { "kind": "note", "count": { "quality": "guessed_local", "value": 8.60067e+08 }, "location": { "line": 15, "file": "foo.c", "column": 3 }, "pass": { "num": 166, "type": "gimple", "name": "vect", "optgroups": [ "loop", "vec" ] }, "message": [ "create vector_type-pointer variable to type: ", { "expr": "vector(4) int" }, " vectorizing a pointer ref: ", { "location": { "line": 16, "file": "foo.c", "column": 5 }, "expr": "*arr_9(D)" } ], "impl_location": { "line": 4702, "file": "../../src/gcc/tree-vect-data-refs.c", "function": "vect_create_data_ref_ptr" }, "function": "compute_sum" }, Note how the message isn't just text: here it has captured the source location of the pointer ref. I've not been a fan of the "<<" C++ idiom, but it seems to work well here for capturing the data we need in a relative terse internal API. (but do we need to internationalize the messages? We're not doing so at the moment) As well as capturing text and numbers, there's support for capturing trees, gimple statements, and symbol table nodes (in each case capturing their source location, but also the metadata that the text is talking about these things, allowing for richer views in the presentation). I'm thinking of adding support for basic blocks and for loops. Viewing the optimization records ******************************** I've written an HTML generator (in Python 3) https://github.com/davidmalcolm/gcc-opt-viewer which takes a directory full of these json files and emits a directory of HTML files, with an index.html showing the optimizations, prioritized from hottest code downwards. An example of HTML output can be seen here: * https://dmalcolm.fedorapeople.org/gcc/2018-05-18/pgo-demo-test/pgo-demo-test/ based on this source: * https://github.com/davidmalcolm/pgo-demo Clicking on the links in the "Source Location column shows the optimizations inline next to the source code they affect. It uses a little JavaScript to add expand/collapse toggles when the amount of text for an optimization gets too long. You can see the hierarchical structure of the notes via indentation (e.g. for the vectorization example). An alternative presentation could add an explicit tree structure with more expand/collapse toggles - this is just a prototype. I've ported many messages from the existing API to the "optinfo" API, but I deliberately didn't make significant changes to messages. But there's probably scope for improvements here. For example, in the "presentation-not-inlined.c" example at the link above, the vect pass emits: === analyzing loop === === analyze_loop_nest === === vect_analyze_loop_form === === get_loop_niters === symbolic number of iterations is (unsigned int) n_9(D) not vectorized: loop contains function calls or data references that cannot be analyzed which is useful, but more useful would be to know *which* function calls or data references couldn't be analyzed. So that would be something to tackle in followups (this post is more about discussing the overall direction). Clicking on the "Pass" links shows the GCC source that emitted the message - or, at least, a close approximation to it on the trunk (on the github mirror, to reduce traffic on our git server). Caveats ======= This is still just a rough prototype (I'm looking for feedback, especially from loop, vectorization, and IPA maintainers). Some known issues: * I haven't yet attempted to bootstrap this * there are lots of FIXMEs * the file format ought to be versioned * "remarks" are a bit half-baked in this patch kit, it's just on/off. It probably should support the optimization group tags from -fopt-info. LLVM has a -f[no-]diagnostics-show-hotness; etc. * I'm using a little C++11 (std::shared_ptr), which would need to be ported to C++98. * the patch kit is relative to r259650, from 2018-04-25 Known unknowns ============== * Should the "optinfo_enabled_p" use an "UNLIKELY" macro to tell a non-PGO build of gcc that all of this stuff is off by default and only used by a small subset of users? We could do this for dump_enabled_p for that matter. * Should the viewer generate static HTML, or serve dynamic HTML? When I've tried this on non-trivial examples, the index.html becomes unwieldy, so pagination would help a lot. That said, there could be multiple different viewers, a benefit of separating the data out from its presentation. * Should there be a way to pre-filter the optimization records before they're written out to JSON? Or is it OK to dump everything, and make filtering be the viewer's responsibility? (currently the patch kit dumps everything) * optimization records can be emitted before the profile counts have been loaded/setup. An example is the "einline" reports. Maybe there's a way to patch the records with profile information??? (which would mean storing them in memory before flushing them) * Should the JSON format interact with the other uses of dumpfiles? (both free-form messages, and capturing the IR) This would allow a viewer e.g. to follow a specific function, showing each state of it as it goes through the optimizers, rather than the current "show me every function at time T" view that we get from browsing dumpfiles. This could allow neat things like "show me every pass in which function FOO changed" etc. I've been experimenting with this, but it's not in the patch kit as is. * Should the output format capture the passes and their hierarchy? (passes.def, plus any from plugins). I'm thinking "yes", with the optimization records referencing back to those passes, rather than repeating the pass metadata in each optimization record. * Formatting convention mismatches: * diagnostics have lowercase start, and no trailing period * -fopt-info dumps have uppercase start, and often have a trailing period * but some of them have multiple sentences * Is there a natural tension here between logs (which are timed-ordered events) vs source markup (source-ordered)? Consider e.g. a collection of related messages from one subroutine. * Should optimization info be internationalized/localized? Currently "-fopt-info" isn't. I say let's *not* i18n it. This is straddling the border between implementation detail and advanced-user-facing. * My prototype respects TDF_DETAILS for the existing "-fopt-info" output, but given that it appears to have been applied haphazardly, and that we can filter in other ways, my prototype ignores it for the other output formats. It could be added. * Which JSON library to use for output? Note that we only need output, so that for the final version we can directly write the JSON, rather than the current approach which, inefficiently, requires first building a DOM-like tree in memory. It would be fairly simple to write JSON directly to a pretty_printer *. Thanks for reading this far. Thoughts? Dave David Malcolm (10): Convert dump and optgroup flags to enums Add JSON implementation Add optinfo, remarks and optimization records Use indentation to show nesting for -fopt-info Experiment with using optinfo for vectorization Experiments with using optinfo for inlining Experiment with using optinfo for loop-handling Experiment with using optinfo for devirtualization Experiment with using optinfo in gimple-loop-interchange.cc Experiment with optinfo in tree-ssa-loop-im.c gcc/Makefile.in | 8 + gcc/c-family/c-pretty-print.c | 2 +- gcc/cfg.c | 4 +- gcc/cfghooks.c | 2 +- gcc/common.opt | 9 + gcc/coretypes.h | 8 + gcc/diagnostic-color.c | 2 + gcc/diagnostic-core.h | 2 + gcc/diagnostic.c | 17 + gcc/diagnostic.def | 1 + gcc/doc/invoke.texi | 34 +- gcc/dumpfile.c | 72 +- gcc/dumpfile.h | 230 +++- gcc/early-remat.c | 2 +- gcc/fortran/gfc-diagnostic.def | 1 + gcc/gengtype.c | 3 +- gcc/gimple-fold.c | 29 +- gcc/gimple-loop-interchange.cc | 36 +- gcc/gimple-pretty-print.c | 2 +- gcc/gimple-ssa-store-merging.c | 6 +- gcc/gimple-ssa-strength-reduction.c | 2 +- gcc/gimple.c | 2 +- gcc/ipa-devirt.c | 14 +- gcc/ipa-inline.c | 25 + gcc/ipa.c | 17 +- gcc/json.cc | 1914 ++++++++++++++++++++++++++ gcc/json.h | 214 +++ gcc/opt-functions.awk | 1 + gcc/optinfo-emit-diagnostics.cc | 141 ++ gcc/optinfo-emit-diagnostics.h | 26 + gcc/optinfo-emit-fopt-info.cc | 106 ++ gcc/optinfo-emit-fopt-info.h | 26 + gcc/optinfo-emit-json.cc | 408 ++++++ gcc/optinfo-emit-json.h | 37 + gcc/optinfo-internal.h | 139 ++ gcc/optinfo.cc | 272 ++++ gcc/optinfo.h | 389 ++++++ gcc/opts.c | 4 + gcc/opts.h | 13 +- gcc/passes.c | 2 +- gcc/print-tree.c | 7 +- gcc/profile-count.c | 28 + gcc/profile-count.h | 5 + gcc/selftest-run-tests.c | 1 + gcc/selftest.h | 1 + gcc/testsuite/gcc.dg/plugin/plugin.exp | 2 + gcc/testsuite/gcc.dg/plugin/remarks-1.c | 32 + gcc/testsuite/gcc.dg/plugin/remarks_plugin.c | 158 +++ gcc/testsuite/lib/gcc-dg.exp | 9 + gcc/toplev.c | 5 + gcc/tree-inline.c | 10 + gcc/tree-pass.h | 2 +- gcc/tree-pretty-print.c | 2 +- gcc/tree-ssa-live.c | 4 +- gcc/tree-ssa-loop-im.c | 13 + gcc/tree-ssa-loop-ivcanon.c | 38 +- gcc/tree-ssa-loop-niter.c | 7 +- gcc/tree-ssa-math-opts.c | 4 +- gcc/tree-ssa-reassoc.c | 2 +- gcc/tree-ssa-sccvn.c | 2 +- gcc/tree-vect-data-refs.c | 979 ++++++------- gcc/tree-vect-loop-manip.c | 98 +- gcc/tree-vect-loop.c | 1390 +++++++++---------- gcc/tree-vect-patterns.c | 100 +- gcc/tree-vect-slp.c | 507 ++++--- gcc/tree-vect-stmts.c | 617 ++++----- gcc/tree-vectorizer.c | 42 +- gcc/tree-vectorizer.h | 37 + 68 files changed, 6193 insertions(+), 2131 deletions(-) create mode 100644 gcc/json.cc create mode 100644 gcc/json.h create mode 100644 gcc/optinfo-emit-diagnostics.cc create mode 100644 gcc/optinfo-emit-diagnostics.h create mode 100644 gcc/optinfo-emit-fopt-info.cc create mode 100644 gcc/optinfo-emit-fopt-info.h create mode 100644 gcc/optinfo-emit-json.cc create mode 100644 gcc/optinfo-emit-json.h create mode 100644 gcc/optinfo-internal.h create mode 100644 gcc/optinfo.cc create mode 100644 gcc/optinfo.h create mode 100644 gcc/testsuite/gcc.dg/plugin/remarks-1.c create mode 100644 gcc/testsuite/gcc.dg/plugin/remarks_plugin.c -- 1.8.5.3