[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-17 Thread rogerio.souza at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #50 from Rogério de Souza Moraes  
---
Created attachment 44848
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44848=edit
GCC 6.3.0 consolidated patch based on Richard's patches

The patch attached is a backport based on Richard's patches to GCC v6.3.0. If
any issues, please let me know.

Regards,
--
Rogerio

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #49 from Richard Biener  ---
Author: rguenth
Date: Wed Oct 17 08:49:00 2018
New Revision: 265235

URL: https://gcc.gnu.org/viewcvs?rev=265235=gcc=rev
Log:
2018-10-16  Richard Biener  

Backport from mainline
2018-10-08  Richard Biener  

PR tree-optimization/63155
* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
(ssa_propagation_engine::ssa_propagate): Remove redundant
bitmap bit clearing.

2018-10-05  Richard Biener  

PR tree-optimization/63155
* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
vertical space in dumpfiles.
* tree-ssa-propagate.h
(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
* tree-ssa-propagate.c (cfg_blocks_back): New global.
(ssa_edge_worklist_back): Likewise.
(curr_order): Likewise.
(cfg_blocks_get): Remove abstraction.
(cfg_blocks_add): Likewise.
(cfg_blocks_empty_p): Likewise.
(add_ssa_edge): Add to current or next worklist based on
RPO index.
(add_control_edge): Likewise.
(ssa_propagation_engine::process_ssa_edge_worklist): Fold
into ...
(ssa_propagation_engine::ssa_propagate): ... here.  Unify
iteration from CFG and SSA edge worklist so we process
everything in RPO order, prioritizing forward progress
over iteration.
(ssa_prop_init): Allocate new worklists, do not dump
immediate uses.
(ssa_prop_fini): Free new worklists.

2018-09-24  Richard Biener  

PR tree-optimization/63155
* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
the worklist when the edge of the respective argument isn't
executable.

Modified:
branches/gcc-8-branch/gcc/tree-ssa-ccp.c
branches/gcc-8-branch/gcc/tree-ssa-propagate.h

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #48 from Richard Biener  ---
Author: rguenth
Date: Wed Oct 17 07:01:28 2018
New Revision: 265231

URL: https://gcc.gnu.org/viewcvs?rev=265231=gcc=rev
Log:
2018-10-17  Richard Biener  

Backport from mainline
2018-10-08  Richard Sandiford  

PR middle-end/63155
* gimple-ssa-backprop.c (backprop::intersect_uses): Use
FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT.

Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/gimple-ssa-backprop.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #47 from Richard Biener  ---
Author: rguenth
Date: Tue Oct 16 13:23:56 2018
New Revision: 265193

URL: https://gcc.gnu.org/viewcvs?rev=265193=gcc=rev
Log:
2018-10-16  Richard Biener  

Backport from mainline
2018-10-08  Richard Biener  

PR tree-optimization/63155
* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
(ssa_propagation_engine::ssa_propagate): Remove redundant
bitmap bit clearing.

2018-10-05  Richard Biener  

PR tree-optimization/63155
* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
vertical space in dumpfiles.
* tree-ssa-propagate.h
(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
* tree-ssa-propagate.c (cfg_blocks_back): New global.
(ssa_edge_worklist_back): Likewise.
(curr_order): Likewise.
(cfg_blocks_get): Remove abstraction.
(cfg_blocks_add): Likewise.
(cfg_blocks_empty_p): Likewise.
(add_ssa_edge): Add to current or next worklist based on
RPO index.
(add_control_edge): Likewise.
(ssa_propagation_engine::process_ssa_edge_worklist): Fold
into ...
(ssa_propagation_engine::ssa_propagate): ... here.  Unify
iteration from CFG and SSA edge worklist so we process
everything in RPO order, prioritizing forward progress
over iteration.
(ssa_prop_init): Allocate new worklists, do not dump
immediate uses.
(ssa_prop_fini): Free new worklists.

2018-09-24  Richard Biener  

PR tree-optimization/63155
* tree-ssa-propagate.c (add_ssa_edge): Avoid adding PHIs to
the worklist when the edge of the respective argument isn't
executable.

Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/tree-ssa-propagate.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #46 from Richard Biener  ---
Author: rguenth
Date: Tue Oct 16 11:23:22 2018
New Revision: 265189

URL: https://gcc.gnu.org/viewcvs?rev=265189=gcc=rev
Log:
2018-10-16  Richard Biener  

Backport from mainline
2018-09-18  Richard Biener  

PR middle-end/63155
* tree-ssa-coalesce.c (tree_int_map_hasher): Remove.
(compute_samebase_partition_bases): Likewise.
(coalesce_ssa_name): Always use compute_optimized_partition_bases.
(gimple_can_coalesce_p): Simplify.

Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/tree-ssa-coalesce.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #45 from Richard Biener  ---
Author: rguenth
Date: Tue Oct  9 11:43:46 2018
New Revision: 264956

URL: https://gcc.gnu.org/viewcvs?rev=264956=gcc=rev
Log:
2018-10-09  Richard Biener  

PR tree-optimization/63155
* tree-ssa-structalias.c: Include tree-ssa.h.
(get_constraint_for_ssa_var): For undefs return nothing_id.
(find_func_aliases): Cleanup PHI handling.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-structalias.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #44 from Richard Biener  ---
(In reply to Richard Biener from comment #43)
> This makes CCP the main
> offender again but as said the rectification would probably mean pulling
> back the SSA SCC discovery code from SCCVN and use that in the SSA
> propagator somehow.

I take that back.  SCC processing is quite fundamentally incompatible
with the way SSA propagation works.

But what would be possible is to add a non-optimistic mode to the SSA
propagator removing the need to iterate at all.  That's some non-trivial
work though, possibly better spent teaching value-numbering the bits
of CCP that it doesn't do (bit-value tracking, UNDEF handling) and then
kill off CCP altogether.

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #43 from Richard Biener  ---
We're now down to

 tree PTA   :   3.92 ( 16%)   0.12 ( 36%)   4.02 ( 16%)
  12445 kB (  2%)
 tree CCP   :   7.43 ( 30%)   0.02 (  6%)   7.44 ( 29%)
646 kB (  0%)
 tree FRE   :   2.34 (  9%)   0.00 (  0%)   2.35 (  9%)
116 kB (  0%)
 tree backward propagate:   0.62 (  2%)   0.00 (  0%)   0.62 (  2%)
  0 kB (  0%)
 out of ssa :   3.01 ( 12%)   0.00 (  0%)   3.01 ( 12%)
  0 kB (  0%)
 TOTAL  :  24.91  0.33 25.26   
 573769 kB

notice the tree backward propagate improvement.  This makes CCP the main
offender again but as said the rectification would probably mean pulling
back the SSA SCC discovery code from SCCVN and use that in the SSA
propagator somehow.

The out of SSA time is what was originally topic of this bug.

The tree PTA time is "new" and related to the number of PHI nodes
and edges.  You can disable PTA via -fno-tree-pta.

The tree FRE time is PHI lookups/inserts, some refactoring can speed this up
a bit.

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-08 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #42 from rsandifo at gcc dot gnu.org  
---
Author: rsandifo
Date: Mon Oct  8 18:58:59 2018
New Revision: 264941

URL: https://gcc.gnu.org/viewcvs?rev=264941=gcc=rev
Log:
Use FOR_EACH_IMM_USE_FAST in gimple-ssa-backprop.c

As pointed out by Richard in PR63155.  It speeds up the testcase a few %.

2018-10-08  Richard Sandiford  

gcc/
PR middle-end/63155
* gimple-ssa-backprop.c (backprop::intersect_uses): Use
FOR_EACH_IMM_USE_FAST instead of FOR_EACH_IMM_USE_STMT.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/gimple-ssa-backprop.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-08 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #41 from rsandifo at gcc dot gnu.org  
---
(In reply to Richard Biener from comment #39)
> Oh, and backprop is really intersect_uses () with
> 
>   FOR_EACH_IMM_USE_STMT (stmt, iter, var)
> {
> 
> being quadratic due to its stupid implementation (we really have many uses
> of vars).

Ouch, hadn't realised the difference between them was that severe.

> If the pass can deal with duplicate stmt uses just fine using
> FOR_EACH_IMM_USE_FAST is going to be faster.

Yeah, should be fine here, since the function is just gathering
information.  Testing a patch...

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #40 from Richard Biener  ---
Author: rguenth
Date: Mon Oct  8 07:16:28 2018
New Revision: 264912

URL: https://gcc.gnu.org/viewcvs?rev=264912=gcc=rev
Log:
2018-10-08  Richard Biener  

PR tree-optimization/63155
* tree-ssa-propagate.c (add_ssa_edge): Do cheap check first.
(ssa_propagation_engine::ssa_propagate): Remove redundant
bitmap bit clearing.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-propagate.c

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #39 from Richard Biener  ---
Oh, and backprop is really intersect_uses () with

  FOR_EACH_IMM_USE_STMT (stmt, iter, var)
{

being quadratic due to its stupid implementation (we really have many uses
of vars).  If the pass can deal with duplicate stmt uses just fine using
FOR_EACH_IMM_USE_FAST is going to be faster.

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #38 from Richard Biener  ---
For the last testcase the compile-time on trunk is now 25s at -O1:

 tree PTA   :   3.37 ( 13%)   0.10 ( 30%)   3.46 ( 13%)
  12445 kB (  2%)
 tree CCP   :   4.61 ( 18%)   0.00 (  0%)   4.62 ( 18%)
646 kB (  0%)
 tree FRE   :   2.21 (  9%)   0.01 (  3%)   2.21 (  9%)
116 kB (  0%)
 tree backward propagate:   5.03 ( 20%)   0.00 (  0%)   5.04 ( 20%)
  0 kB (  0%)
 out of ssa :   3.05 ( 12%)   0.00 (  0%)   3.05 ( 12%)
  0 kB (  0%)
 TOTAL  :  25.39  0.33 25.72   
 573954 kB

and perf:

Samples: 9K of event 'instructions', Event count (approx.): 107285199390
Overhead   Samples  Command  Shared Object Symbol  
 ◆
  18.06%  1195  cc1  cc1   [.] (anonymous
namespace)::backprop::process_var  ▒
   5.58%   560  cc1  cc1   [.] visit_phi   
 ▒
   5.21%   476  cc1  cc1   [.] inchash::add_expr   
 ▒
   5.21%   671  cc1  cc1   [.] VN_INFO 
 ▒
   5.14%   493  cc1  cc1   [.] bitmap_set_bit  
 ▒
   3.13%   296  cc1  cc1   [.]
hash_table::find_with_hash▒
   2.99%   287  cc1  cc1   [.] vn_phi_lookup   
 ▒
   2.39%   229  cc1  cc1   [.] bitmap_ior_into 
 ▒
   1.77%   165  cc1  cc1   [.] do_rpo_vn

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #37 from Richard Biener  ---
Author: rguenth
Date: Fri Oct  5 12:54:51 2018
New Revision: 264869

URL: https://gcc.gnu.org/viewcvs?rev=264869=gcc=rev
Log:
2018-10-05  Richard Biener  

PR tree-optimization/63155
* tree-ssa-ccp.c (ccp_propagate::visit_phi): Avoid excess
vertical space in dumpfiles.
* tree-ssa-propagate.h
(ssa_propagation_engine::process_ssa_edge_worklist): Remove.
* tree-ssa-propagate.c (cfg_blocks_back): New global.
(ssa_edge_worklist_back): Likewise.
(curr_order): Likewise.
(cfg_blocks_get): Remove abstraction.
(cfg_blocks_add): Likewise.
(cfg_blocks_empty_p): Likewise.
(add_ssa_edge): Add to current or next worklist based on
RPO index.
(add_control_edge): Likewise.
(ssa_propagation_engine::process_ssa_edge_worklist): Fold
into ...
(ssa_propagation_engine::ssa_propagate): ... here.  Unify
iteration from CFG and SSA edge worklist so we process
everything in RPO order, prioritizing forward progress
over iteration.
(ssa_prop_init): Allocate new worklists, do not dump
immediate uses.
(ssa_prop_fini): Free new worklists.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-ccp.c
trunk/gcc/tree-ssa-propagate.c
trunk/gcc/tree-ssa-propagate.h

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-05 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #36 from rguenther at suse dot de  ---
On Thu, 4 Oct 2018, rogerio.souza at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155
> 
> --- Comment #35 from Rogério de Souza Moraes  
> ---
> Created attachment 44791
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791=edit
> Small testcase more similar to original environment
> 
> Hi Richard,
> 
> this is a new testcase, based on another file in the original environment. 
> It’s
> quite small (7000 lines, 240 setjmp calls).
> 
> This code with a little complex but still simplified control structure
> represents state machine implementation, which is very widely used by our
> customers. Another new factor is the nested setjmp calls. Of course, original
> testcase is more complex and takes even more time with more difference.
> 
> You can run it using the following commands:
> 
> 
> time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O0
> -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
> gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o
> 
> time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O
> -pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
> gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o
> 
> 
> Results :
> 
> GCC: 4.8.5 (From RHEL 7.5)
> 
> real0m0.349s
> user0m0.255s
> sys 0m0.083s
> 
> real0m0.193s
> user0m0.163s
> sys 0m0.023s
> 
> GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it)
> 
> real0m32.235s
> user0m30.486s
> sys 0m1.622s
> 
> real3m34.203s
> user3m33.726s
> sys 0m0.292s
> 
> The performance difference is relevant in this test.

Thanks for the more realistic testcase.  I can confirm the above
and I also see a slowdown in GCC 9 compared to GCC 8 at -O1:

> /usr//bin/time gcc-8 -S t.c -O -fwrapv -mstackrealign 
-mpreferred-stack-boundary=4 -m32
157.48user 0.24system 2:37.78elapsed 99%CPU (0avgtext+0avgdata 
888036maxresident)k
47704inputs+152outputs (8major+240936minor)pagefaults 0swaps

> /usr//bin/time gcc-9 -S t.c -O -fwrapv -mstackrealign 
-mpreferred-stack-boundary=4 -m32
197.61user 0.39system 3:18.08elapsed 99%CPU (0avgtext+0avgdata 
890628maxresident)k
0inputs+184outputs (0major+259016minor)pagefaults 0swaps

Somehow it's still CCP that makes things slow:

 tree CCP   : 178.52 ( 89%)   0.01 (  2%) 178.55 ( 
89%) 646 kB (  0%)

perf tells me it's

-   96.33%29.55% 14801  cc1  cc1   [.] 
ccp_propagate::visit_phi▒
 ccp_propagate::visit_phi   
▒
   - ssa_propagation_engine::simulate_stmt  
▒
  + 49.51% ssa_propagation_engine::simulate_block   
▒
  + 46.82% ssa_propagation_engine::ssa_propagate

-   37.06%28.98% 12421  cc1  cc1   [.] 
ccp_lattice_meet▒
   - ccp_lattice_meet   
▒
  + 37.02% ccp_propagate::visit_phi 
▒
  + 0.03% set_lattice_value  

-5.17% 5.17%  1949  cc1  cc1   [.] 
wi::bit_or >, generic_w▒
 wi::bit_or >, 
generic_wide_int > >   ▒
   - ccp_lattice_meet   
▒
  + 5.16% ccp_propagate::visit_phi  
▒
  + 0.01% set_lattice_value 

-4.02% 4.02%  1509  cc1  cc1   [.] 
canonicalize_value  ▒
   - canonicalize_value 
▒
  + 4.02% get_value_for_expr
▒
  + 0.00% ccp_folder::get_value  

-2.90% 2.89%  1083  cc1  cc1   [.] 
wi::eq_p >, int>   ▒
 wi::eq_p >, int>  
▒
   - ccp_lattice_meet   
▒
  + 2.89% ccp_propagate::visit_phi  
▒
  + 0.00% set_lattice_value   

As said, thanks for the testcase.

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-10-04 Thread rogerio.souza at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #35 from Rogério de Souza Moraes  
---
Created attachment 44791
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44791=edit
Small testcase more similar to original environment

Hi Richard,

this is a new testcase, based on another file in the original environment. It’s
quite small (7000 lines, 240 setjmp calls).

This code with a little complex but still simplified control structure
represents state machine implementation, which is very widely used by our
customers. Another new factor is the nested setjmp calls. Of course, original
testcase is more complex and takes even more time with more difference.

You can run it using the following commands:


time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O0
-pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o

time gcc -DGCC -DLINUX_C -D_GLIBCXX_USE_CXX11_ABI=0  -m32 -m32 -w -c -O
-pedantic -fwrapv -mstackrealign -mpreferred-stack-boundary=4
gcc_2nd_synth_pure_c_item.c -o gcc_2nd_synth_pure_c_item.o


Results :

GCC: 4.8.5 (From RHEL 7.5)

real0m0.349s
user0m0.255s
sys 0m0.083s

real0m0.193s
user0m0.163s
sys 0m0.023s

GCC: 6.3.0 (GCC 6.3.0 with Revision 264523 backported and applied to it)

real0m32.235s
user0m30.486s
sys 0m1.622s

real3m34.203s
user3m33.726s
sys 0m0.292s

The performance difference is relevant in this test.

Regards,
--
Rogerio

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-09-26 Thread david at pgmasters dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

--- Comment #34 from David  ---
My primary concern in 87316 was about memory usage and this patch definitely
helps a lot with that.  Thanks! 

Using -ftree-coalesce-vars helps on >= 4.9 versions and does not seem to have
an adverse effect on test coverage.

[Bug middle-end/63155] [6/7/8 Regression] memory hog

2018-09-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63155

Richard Biener  changed:

   What|Removed |Added

  Known to work||9.0
Summary|[6/7/8/9 Regression] memory |[6/7/8 Regression] memory
   |hog |hog

--- Comment #33 from Richard Biener  ---
So on trunk for the original testcase I now see

> /usr/bin/time ./cc1 -quiet testunity_Runner.i -std=c99
2.70user 0.16system 0:02.86elapsed 100%CPU (0avgtext+0avgdata
427672maxresident)k
0inputs+504outputs (0major+106295minor)pagefaults 0swaps

while on the same machine using GCC 4.8:

> /usr/bin/time /space/rguenther/install/gcc-4.8.5/bin/gcc -S 
> testunity_Runner.i -std=c99
0.24user 0.01system 0:00.60elapsed 41%CPU (0avgtext+0avgdata 39424maxresident)k
30960inputs+504outputs (37major+8516minor)pagefaults 0swaps

so we've come a long way but still regressed which is somehow not avoidable
because of the correctness fix that started this.

For reference GCC 8.2 numbers are

> /usr/bin/time /space/rguenther/install/gcc-8.2/bin/gcc -S testunity_Runner.i 
> -std=c99
94.31user 2.46system 1:36.79elapsed 99%CPU (0avgtext+0avgdata
10172916maxresident)k
0inputs+504outputs (0major+2535422minor)pagefaults 0swaps

So overall I consider this issue fixed for trunk.