Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2015-02-22 Thread Tomas Vondra
On 22.2.2015 09:14, Jeff Davis wrote:
 
 On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote:
 So I started digging in the code and I noticed this:

 hash_mem = MemoryContextMemAllocated(aggstate-hashcontext, true);

 which is IMHO obviously wrong, because that accounts only for the
 hashtable itself. It might be correct for aggregates with state passed
 by value, but it's incorrect for state passed by reference (e.g.
 Numeric, arrays etc.), because initialize_aggregates does this:

 oldContext = MemoryContextSwitchTo(aggstate-aggcontext);
 pergroupstate-transValue = datumCopy(peraggstate-initValue,
 peraggstate-transtypeByVal,
 peraggstate-transtypeLen);
 MemoryContextSwitchTo(oldContext);

 and it's also wrong for all the user-defined aggretates that have no
 access to hashcontext at all, and only get reference to aggcontext using
 AggCheckCallContext(). array_agg() is a prime example.
 
 Actually, I believe the first one is right, and the second one is
 wrong. If we allocate pass-by-ref transition states in aggcontext,
 that defeats the purpose of the patch, because we can't release the
 memory when we start a new batch (because we can't reset aggcontext).
 
 Instead, the transition states should be copied into hashcontext; we 
 should count the memory usage of hashcontext; AggCheckCallContext
 should return hashcontext; and after each batch we should reset
 hashcontext.

Hmmm, maybe. I'd however suggest not to stuff everything into a single
memory context, but to create a 'transvalue' context next to
hashcontext, and return that one from AggCheckCallContext(). I.e.
something like this:

aggcontext
hashcontext - hash table etc.
transcontext - new context for transition values

It's better to keep things separated, I guess.

 It might be worth refactoring a bit to call it the groupcontext or 
 something instead, and use it for both HashAgg and GroupAgg. That
 would reduce the need for conditionals.

Not sure that's worth it. The patch already seems complex enough.

 [ ... problems with array_agg subcontexts ... ]
 
 and it runs indefinitely (I gave up after a few minutes). I believe this
 renders the proposed memory accounting approach dead.
 
 ...
 
 The array_agg() patch I submitted to this CF would fix this
 particular query, as it removes the child contexts (so there's no
 need for recursion in MemoryContextMemAllocated), but it does
 nothing to the user-defined aggregates out there. And it's not
 committed yet.
 
 Now that your patch is committed, I don't think I'd call the memory 
 accounting patch I proposed dead yet. We decided that creating one 
 context per group is essentially a bug, so we don't need to optimize
 for that case.

Yes, but the question is how many other aggregates designed similarly to
array_agg are out there (in extensions atc.), and whether we actually
care about them. Because those will be equally impacted by this, and we
can't fix them.

But maybe we don't really care, and it's the author's responsibility to
test them on a new major version, and fix the performance issue just
like we did with array_agg().

 
 Your approach may be better, though.


I'm not sure - it really boils down do the goal and assumptions.

If we restrict ourselves to HashAggregate-only solution, under the
assumption that the number of child contexts is small (and leave the fix
up to the author of the extension), then your approach is probably
simpler and better than the approach I proposed.

If we're aiming for a general-purpose solution that might be used
elsewhere, then I'm afraid we can't make the 'small number of child
contexts' assumption and the simple solution may not be the right one.

Although I believe that we shouldn't be creating large numbers of
'parallel' memory contexts anyway.

 Thank you for reviewing. I'll update my patches and resubmit for the
  next CF.

Thanks for working on this.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2015-02-22 Thread Jeff Davis

On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote:
 So I started digging in the code and I noticed this:
 
 hash_mem = MemoryContextMemAllocated(aggstate-hashcontext, true);
 
 which is IMHO obviously wrong, because that accounts only for the
 hashtable itself. It might be correct for aggregates with state passed
 by value, but it's incorrect for state passed by reference (e.g.
 Numeric, arrays etc.), because initialize_aggregates does this:
 
 oldContext = MemoryContextSwitchTo(aggstate-aggcontext);
 pergroupstate-transValue = datumCopy(peraggstate-initValue,
 peraggstate-transtypeByVal,
 peraggstate-transtypeLen);
 MemoryContextSwitchTo(oldContext);
 
 and it's also wrong for all the user-defined aggretates that have no
 access to hashcontext at all, and only get reference to aggcontext using
 AggCheckCallContext(). array_agg() is a prime example.

Actually, I believe the first one is right, and the second one is wrong.
If we allocate pass-by-ref transition states in aggcontext, that defeats
the purpose of the patch, because we can't release the memory when we
start a new batch (because we can't reset aggcontext).

Instead, the transition states should be copied into hashcontext; we
should count the memory usage of hashcontext; AggCheckCallContext should
return hashcontext; and after each batch we should reset hashcontext.

It might be worth refactoring a bit to call it the groupcontext or
something instead, and use it for both HashAgg and GroupAgg. That would
reduce the need for conditionals.



[ ... problems with array_agg subcontexts ... ]

 and it runs indefinitely (I gave up after a few minutes). I believe this
 renders the proposed memory accounting approach dead.

...

 The array_agg() patch I submitted to this CF would fix this particular
 query, as it removes the child contexts (so there's no need for
 recursion in MemoryContextMemAllocated), but it does nothing to the
 user-defined aggregates out there. And it's not committed yet.

Now that your patch is committed, I don't think I'd call the memory
accounting patch I proposed dead yet. We decided that creating one
context per group is essentially a bug, so we don't need to optimize for
that case.

Your approach may be better, though.

Thank you for reviewing. I'll update my patches and resubmit for the
next CF.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2015-01-07 Thread Tomas Vondra
Hi,

On 23.12.2014 10:16, Jeff Davis wrote:
 It seems that these two patches are being reviewed together. Should
 I just combine them into one? My understanding was that some wanted
 to review the memory accounting patch separately.
 
 On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
 That's the only conflict, and after fixing it it compiles OK.
 However, I got a segfault on the very first query I tried :-(
 
 If lookup_hash_entry doesn't find the group, and there's not enough 
 memory to create it, then it returns NULL; but the caller wasn't 
 checking for NULL. My apologies for such a trivial mistake, I was
 doing most of my testing using DISTINCT. My fix here was done
 quickly, so I'll take a closer look later to make sure I didn't miss
 something else.
 
 New patch attached (rebased, as well).

I did a review today, using these two patches:

* memory-accounting-v9.patch (submitted on December 2)
* hashagg-disk-20141222.patch

I started with some basic performance measurements comparing hashagg
queries without and with the patches (while you compared hashagg and
sort). That's IMHO an interesting comparison, especially when no
batching is necessary - in the optimal case the users should not see any
slowdown (we shouldn't make them pay for the batching unless it actually
is necessary).

So I did this:

drop table if exists test_hash_agg;

create table test_hash_agg as
select
i AS a,
mod(i,100) AS b,
mod(i,10) AS c,
mod(i,1) AS d,
mod(i,1000) AS e,
mod(i,100) AS f
from generate_series(1,1000) s(i);

vacuum (full, analyze) test_hash_agg;

i.e. a ~500MB table with 10M rows, and columns with different
cardinalities. And then queries like this:

   select count(*) from (select a, count(a) from test_hash_agg
 group by a) foo;

   -- 10M groups (OOM)
   select count(*) from (select a, array_agg(a) from test_hash_agg
 group by a) foo;

   -- 100 groups
   select count(*) from (select f, array_agg(f) from test_hash_agg
 group by f) foo;

which performed quite well, i.e. I've seen absolutely no slowdown.
Which, in the array_agg case, is quite is quite suspicious, because on
every lookup_hash_entry call, it has to do MemoryContextMemAllocated()
on 10M contexts, and I really doubt that can be done in ~0 time.

So I started digging in the code and I noticed this:

hash_mem = MemoryContextMemAllocated(aggstate-hashcontext, true);

which is IMHO obviously wrong, because that accounts only for the
hashtable itself. It might be correct for aggregates with state passed
by value, but it's incorrect for state passed by reference (e.g.
Numeric, arrays etc.), because initialize_aggregates does this:

oldContext = MemoryContextSwitchTo(aggstate-aggcontext);
pergroupstate-transValue = datumCopy(peraggstate-initValue,
peraggstate-transtypeByVal,
peraggstate-transtypeLen);
MemoryContextSwitchTo(oldContext);

and it's also wrong for all the user-defined aggretates that have no
access to hashcontext at all, and only get reference to aggcontext using
AggCheckCallContext(). array_agg() is a prime example.

In those cases the patch actually does no memory accounting and as
hashcontext has no child contexts, there's no accounting overhead.

After fixing this bug (replacing hashcontext with aggcontext in both
calls to MemoryContextMemAllocated) the performance drops dramatically.
For the query with 100 groups (not requiring any batching) I see this:

test=# explain analyze select count(x) from (select f, array_agg(1) AS x
from test_hash_agg group by f) foo;

QUERY PLAN (original patch)

 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
(actual time=2539.156..2539.156 rows=1 loops=1)
   -  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
  (actual time=2492.264..2539.012 rows=100 loops=1)
 Group Key: test_hash_agg.f
 Batches: 1  Memory Usage: 24kB  Disk Usage:0kB
 -  Seq Scan on test_hash_agg
 (cost=0.00..163693.71 rows=871 width=4)
 (actual time=0.022..547.379 rows=1000 loops=1)
 Planning time: 0.039 ms
 Execution time: 2542.932 ms
(7 rows)

QUERY PLAN (fixed patch)

 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
(actual time=5670.885..5670.885 rows=1 loops=1)
   -  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
  (actual time=5623.254..5670.803 rows=100 loops=1)
 Group Key: test_hash_agg.f
 Batches: 1  Memory Usage: 117642kB  Disk Usage:0kB
 -  Seq Scan on 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-30 Thread Jeff Janes
On Sun, Dec 28, 2014 at 12:45 PM, Peter Geoghegan p...@heroku.com wrote:

 On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote:
  Do others have similar numbers? I'm quite surprised at how little
  work_mem seems to matter for these plans (HashJoin might be a different
  story though). I feel like I made a mistake -- can someone please do a
  sanity check on my numbers?

 I have seen external sorts that were quicker than internal sorts
 before. With my abbreviated key patch, under certain circumstances
 external sorts are faster, while presumably the same thing is true of
 int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
 that was marginally faster than a multi-gigabyte one that fit the
 entire sort in memory. It probably has something to do with caching
 effects dominating over the expense of more comparisons, since higher
 work_mem settings that still resulted in an external sort were slower
 than the 10MB setting.

 I was surprised by this too, but it has been independently reported by
 Jeff Janes.


I don't recall (at the moment) seeing our external sort actually faster
than quick-sort, but I've very reliably seen external sorts get faster with
less memory than with more.  It is almost certainly a CPU caching issue.
Very large simple binary heaps are horrible on the CPU cache. And for
sort-by-reference values, quick sort is also pretty bad.

With a slow enough data bus between the CPU and main memory, I don't doubt
that a 'tapesort' with small work_mem could actually be faster than
quicksort with large work_mem.  But I don't recall seeing it myself.  But
I'd be surprised that a tapesort as currently implemented would be faster
than a quicksort if the tapesort is using just one byte less memory than
the quicksort is.

But to Jeff Davis's question, yes, tapesort is not very sensitive to
work_mem, and to the extent it is sensitive it is in the other direction of
more memory being bad.  Once work_mem is so small that it takes multiple
passes over the data to do the merge, then small memory would really be a
problem.  But on modern hardware you have to get pretty absurd settings
before that happens.

Cheers,

Jeff


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-29 Thread Jim Nasby

On 12/28/14, 2:45 PM, Peter Geoghegan wrote:

On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote:

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might be a different
story though). I feel like I made a mistake -- can someone please do a
sanity check on my numbers?


I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.


I haven't tested for external faster than internal in a while, but I've 
certainly seen this effect before. Generally, once you get beyond a certain 
size (maybe 100MB?) you run the risk of a tapesort being faster than an 
internal sort.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-28 Thread Jeff Davis
On Tue, 2014-12-23 at 01:16 -0800, Jeff Davis wrote:
 New patch attached (rebased, as well).
 
 I also see your other message about adding regression testing. I'm
 hesitant to slow down the tests for everyone to run through this code
 path though. Should I add regression tests, and then remove them later
 after we're more comfortable that it works?

Attached are some tests I ran. First, generate the data sets with
hashagg_test_data.sql. Then, do (I used work_mem at default of 4MB):

  set enable_hashagg=false;
  \o /tmp/sort.out
  \i /tmp/hashagg_test.sql
  \o
  set enable_hashagg=true;
  \o /tmp/hash.out
  \i /tmp/hashagg_test.sql

and then diff'd the output to make sure the results are the same (except
the plans, of course). The script loads the results into a temp table,
then sorts it before outputting, to make the test order-independent. I
didn't just add an ORDER BY, because that would change the plan and it
would never use hashagg.

I think that has fairly good coverage of the hashagg code. I used 3
different input data sets, byval and byref types (for group key and
args), and a group aggregate query as well as DISTINCT. Let me know if I
missed something.

I also did some performance comparisons between disk-based sort+group
and disk-based hashagg. The results are quite favorable for hashagg
given the data sets I provided. Simply create the data using
hashagg_test_data.sql (if not already done), set the work_mem to the
value you want to test, and run hashagg_test_perf.sql. It uses EXPLAIN
ANALYZE for the timings.

singleton: 10M groups of 1
even: 1M groups of 10
skew: wildly different group sizes; see data script

q1: group aggregate query
q2: distinct query

The total memory requirements for the test to run without going to disk
ranges from about 100MB (for even) to about 1GB (for singleton).
Regardless of work_mem, these all fit in memory on my machine, so they
aren't *really* going to disk. Also note that, because of how the memory
blocks are allocated, and that hashagg waits until memory is exceeded,
then hashagg might use about double work_mem when work_mem is small (the
effect is not important at higher values).

work_mem='1MB':
  sort+group (s)hashagg (s)
   singleton q1   12 10
   singleton q28  7
   even q114  7
   even q210  5
   skew q122  6
   skew q216  4

work_mem='4MB':
  sort+group (s)hashagg (s)
   singleton q1   12 11
   singleton q28  6
   even q112  7
   even q2 9  5
   skew q119  6
   skew q213  3

work_mem='16MB':
  sort+group (s)hashagg (s)
   singleton q1   12 11
   singleton q28  7
   even q114  7
   even q210  5
   skew q115  6
   skew q212  4

work_mem='64MB':
  sort+group (s)hashagg (s)
   singleton q1   13 12
   singleton q29  8
   even q114  8
   even q210  5
   skew q117  6
   skew q213  4

work_mem='256MB':
  sort+group (s)hashagg (s)
   singleton q1   12 12
   singleton q29  8
   even q114  7
   even q211  4
   skew q116  6
   skew q213  4

work_mem='512MB':
  sort+group (s)hashagg (s)
   singleton q1   12 12
   singleton q29  7
   even q114  7
   even q210  4
   skew q116  6
   skew q212  4

work_mem='2GB':
  sort+group (s)hashagg (s)
   singleton q19 12
   singleton q26  6
   even q1 8  7
   even q2 6  4
   skew q1 7  6
   skew q2 5  4


These numbers are great news for disk-based hashagg. It seems to be the
same or better than sort+group in nearly all cases (again, this example
doesn't actually go to disk, so those numbers may come out differently).
Also, the numbers are remarkably stable for varying work_mem for both
plans. That means that it doesn't cost much to keep a lower work_mem as
long as your system has plenty of memory.

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-28 Thread Peter Geoghegan
On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote:
 Do others have similar numbers? I'm quite surprised at how little
 work_mem seems to matter for these plans (HashJoin might be a different
 story though). I feel like I made a mistake -- can someone please do a
 sanity check on my numbers?

I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-28 Thread Jeff Davis
On Sun, 2014-12-28 at 12:37 -0800, Jeff Davis wrote:
 I feel like I made a mistake -- can someone please do a
 sanity check on my numbers?

I forgot to randomize the inputs, which doesn't matter much for hashagg
but does matter for sort. New data script attached. The results are even
*better* for disk-based hashagg than the previous numbers suggest. Here
are some new numbers:

work_mem='1MB':
  sort+group (s)hashagg (s)
   singleton q1   21 10
   singleton q2   12  8
   even q120  7
   even q213  5
   skew q122  6
   skew q216  4

work_mem='4MB':
  sort+group (s)hashagg (s)
   singleton q1   17 10
   singleton q2   11  6
   even q116  7
   even q211  5
   skew q119  6
   skew q213  4

work_mem='16MB':
  sort+group (s)hashagg (s)
   singleton q1   16 11
   singleton q2   11  7
   even q115  8
   even q212  6
   skew q115  6
   skew q212  4

work_mem='64MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   13  8
   even q117 10
   even q213  6
   skew q117  6
   skew q214  4

work_mem='256MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   14  7
   even q116  9
   even q214  5
   skew q118  6
   skew q213  4

work_mem='512MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   14  7
   even q117  9
   even q214  5
   skew q117  6
   skew q213  4

work_mem='2GB':
  sort+group (s)hashagg (s)
   singleton q1   11 11
   singleton q27  6
   even q110  9
   even q2 7  5
   skew q1 7  6
   skew q2 4  4


Regards,
Jeff Davis



hashagg_test_data.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-23 Thread Jeff Davis
It seems that these two patches are being reviewed together. Should I
just combine them into one? My understanding was that some wanted to
review the memory accounting patch separately.

On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
 That's the only conflict, and after fixing it it compiles OK. However, I
 got a segfault on the very first query I tried :-(

If lookup_hash_entry doesn't find the group, and there's not enough
memory to create it, then it returns NULL; but the caller wasn't
checking for NULL. My apologies for such a trivial mistake, I was doing
most of my testing using DISTINCT. My fix here was done quickly, so I'll
take a closer look later to make sure I didn't miss something else.

New patch attached (rebased, as well).

I also see your other message about adding regression testing. I'm
hesitant to slow down the tests for everyone to run through this code
path though. Should I add regression tests, and then remove them later
after we're more comfortable that it works?

Regards
Jeff Davis

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 3045,3050  include_dir 'conf.d'
--- 3045,3065 
/listitem
   /varlistentry
  
+  varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk
+   termvarnameenable_hashagg_disk/varname (typeboolean/type)
+   indexterm
+primaryvarnameenable_hashagg_disk/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the planner expects the hash table size to exceed
+ varnamework_mem/varname. The default is literalon/.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin
termvarnameenable_hashjoin/varname (typeboolean/type)
indexterm
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 86,91  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
--- 86,92 
  	 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
+ static void show_hashagg_info(AggState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  	ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
***
*** 1422,1427  ExplainNode(PlanState *planstate, List *ancestors,
--- 1423,1429 
  		case T_Agg:
  			show_agg_keys((AggState *) planstate, ancestors, es);
  			show_upper_qual(plan-qual, Filter, planstate, ancestors, es);
+ 			show_hashagg_info((AggState *) planstate, es);
  			if (plan-qual)
  show_instrumentation_count(Rows Removed by Filter, 1,
  		   planstate, es);
***
*** 1912,1917  show_sort_info(SortState *sortstate, ExplainState *es)
--- 1914,1955 
  }
  
  /*
+  * Show information on hash aggregate buckets and batches
+  */
+ static void
+ show_hashagg_info(AggState *aggstate, ExplainState *es)
+ {
+ 	Agg *agg = (Agg *)aggstate-ss.ps.plan;
+ 
+ 	Assert(IsA(aggstate, AggState));
+ 
+ 	if (agg-aggstrategy != AGG_HASHED)
+ 		return;
+ 
+ 	if (!aggstate-hash_init_state)
+ 	{
+ 		long	memPeakKb = (aggstate-hash_mem_peak + 1023) / 1024;
+ 		long	diskKb	  = (aggstate-hash_disk + 1023) / 1024;
+ 
+ 		if (es-format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es-str, es-indent * 2);
+ 			appendStringInfo(
+ es-str,
+ Batches: %d  Memory Usage: %ldkB  Disk Usage:%ldkB\n,
+ aggstate-hash_num_batches, memPeakKb, diskKb);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong(HashAgg Batches,
+ aggstate-hash_num_batches, es);
+ 			ExplainPropertyLong(Peak Memory Usage, memPeakKb, es);
+ 			ExplainPropertyLong(Disk Usage, diskKb, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
*** a/src/backend/executor/execGrouping.c
--- b/src/backend/executor/execGrouping.c
***
*** 310,316  BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
  	hash_ctl.hcxt = tablecxt;
  	hashtable-hashtab = hash_create(TupleHashTable, nbuckets,
  	 hash_ctl,
! 	HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
  
  	return hashtable;
  }
--- 310,317 
  	hash_ctl.hcxt = tablecxt;
  	hashtable-hashtab = hash_create(TupleHashTable, nbuckets,
  	 hash_ctl,
! 	 HASH_ELEM | HASH_FUNCTION | HASH_COMPARE |
! 	 HASH_CONTEXT | HASH_NOCHILDCXT);
  
  	return hashtable;
  }
***
*** 331,336  TupleHashEntry
--- 332,386 
  LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  	 bool *isnew)
  {
+ 	uint32 hashvalue;
+ 
+ 	hashvalue = TupleHashEntryHash(hashtable, slot);
+ 	return LookupTupleHashEntryHash(hashtable, 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-23 Thread Tomas Vondra
On 23.12.2014 10:16, Jeff Davis wrote:
 It seems that these two patches are being reviewed together. Should
 I just combine them into one? My understanding was that some wanted
 to review the memory accounting patch separately.

I think we should keep the patches separate. Applying two patches is
trivial, splitting them not so much.

 On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
 That's the only conflict, and after fixing it it compiles OK.
 However, I got a segfault on the very first query I tried :-(
 
 If lookup_hash_entry doesn't find the group, and there's not enough 
 memory to create it, then it returns NULL; but the caller wasn't 
 checking for NULL. My apologies for such a trivial mistake, I was
 doing most of my testing using DISTINCT. My fix here was done
 quickly, so I'll take a closer look later to make sure I didn't miss
 something else.
 
 New patch attached (rebased, as well).
 
 I also see your other message about adding regression testing. I'm 
 hesitant to slow down the tests for everyone to run through this
 code path though. Should I add regression tests, and then remove them
 later after we're more comfortable that it works?

I think when done right, the additional time will be negligible. By
setting the work_mem low, we don't need that much data.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-21 Thread Tomas Vondra
On 2.12.2014 06:14, Jeff Davis wrote:
 On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
 I can also just move isReset there, and keep mem_allocated as a uint64.
 That way, if I find later that I want to track the aggregated value for
 the child contexts as well, I can split it into two uint32s. I'll hold
 off any any such optimizations until I see some numbers from HashAgg
 though.

 I took a quick look at memory-accounting-v8.patch.

 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.
 
 No reason. New version attached; that's the only change.

I've started reviewing this today. It does not apply cleanly on current
head, because of 4a14f13a committed a few days ago. That commit changed
the constants in  src/include/utils/hsearch.h, so the patch needs to use
this:

#define HASH_NOCHILDCXT 0x4000 /* ... */

That's the only conflict, and after fixing it it compiles OK. However, I
got a segfault on the very first query I tried :-(

  create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d
from generate_series(1,1000) s(i);

  analyze test_hash_agg;

  select a, count(*) from test_hash_agg where a  10 group by a;

With a fresh cluster (default config), I get this:

  server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
  The connection to the server was lost. Attempting reset: Failed.

The backtrace looks like this (full attached):

Program received signal SIGSEGV, Segmentation fault.
advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0,
peraggstate=peraggstate@entry=0x2b5efd0,
pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468
468 if (pergroupstate-noTransValue)
(gdb) bt
#0  advance_transition_function at nodeAgg.c:468
#1  0x005c3494 in advance_aggregates at nodeAgg.c:624
#2  0x005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640
#3  ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338

(gdb) print pergroupstate-noTransValue
Cannot access memory at address 0x11
(gdb) print pergroupstate
$1 = (AggStatePerGroup) 0x8

My guess is something is scribbling over the pergroup memory, or maybe
the memory context gets released, or something. In any case, it's easily
reproducible, and apparently deterministic (always exactly the same
values, no randomness).

regards
Tomas
Program received signal SIGSEGV, Segmentation fault.
advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0, 
peraggstate=peraggstate@entry=0x2b5efd0, pergroupstate=pergroupstate@entry=0x8) 
at nodeAgg.c:468
468 if (pergroupstate-noTransValue)
(gdb) bt
#0  advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0, 
peraggstate=peraggstate@entry=0x2b5efd0, pergroupstate=pergroupstate@entry=0x8) 
at nodeAgg.c:468
#1  0x005c3494 in advance_aggregates 
(aggstate=aggstate@entry=0x2b5c5f0, pergroup=pergroup@entry=0x8) at 
nodeAgg.c:624
#2  0x005c3dc2 in agg_fill_hash_table (aggstate=0x2b5c5f0) at 
nodeAgg.c:1640
#3  ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338
#4  0x005b6c68 in ExecProcNode (node=node@entry=0x2b5c5f0) at 
execProcnode.c:486
#5  0x005b3e90 in ExecutePlan (dest=0xbffa40 donothingDR, 
direction=optimized out, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, planstate=0x2b5c5f0, estate=0x2b5c4d8) at execMain.c:1480
#6  standard_ExecutorRun (queryDesc=0x2b58d78, direction=optimized out, 
count=0) at execMain.c:308
#7  0x0056df73 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x2b58ce0, into=into@entry=0x0, 
es=es@entry=0x7fff6fa066c0, queryString=optimized out, params=0x0, 
planduration=planduration@entry=0x7fff6fa06650) at explain.c:487
#8  0x0056e22d in ExplainOneQuery (query=optimized out, into=0x0, 
es=0x7fff6fa066c0, queryString=0x2b0bf68 explain analyze select a, count(*) 
from test_hash_agg where a  10 group by a;, params=0x0) at explain.c:341
#9  0x0056e670 in ExplainQuery (stmt=stmt@entry=0x2b0d248, 
queryString=0x2b0bf68 explain analyze select a, count(*) from test_hash_agg 
where a  10 group by a;, params=params@entry=0x0, dest=0x2b1b490) at 
explain.c:232
#10 0x006b5af6 in standard_ProcessUtility (parsetree=0x2b0d248, 
queryString=optimized out, context=optimized out, params=0x0, 
dest=optimized out, completionTag=0x7fff6fa067d0 ) at utility.c:648
#11 0x006b2c61 in PortalRunUtility (portal=portal@entry=0x2b4c3b8, 
utilityStmt=0x2b0d248, isTopLevel=optimized out, dest=dest@entry=0x2b1b490, 
completionTag=completionTag@entry=0x7fff6fa067d0 ) at pquery.c:1187
#12 0x006b3add in FillPortalStore (portal=portal@entry=0x2b4c3b8, 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-21 Thread Tomas Vondra
On 21.12.2014 20:19, Tomas Vondra wrote:

 However, I got a segfault on the very first query I tried :-(
 
   create table test_hash_agg as select i AS a, i AS b, i AS c, i AS d
 from generate_series(1,1000) s(i);
 
   analyze test_hash_agg;
 
   select a, count(*) from test_hash_agg where a  10 group by a;
 
 With a fresh cluster (default config), I get this:
 
   server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
   The connection to the server was lost. Attempting reset: Failed.
 
 The backtrace looks like this (full attached):
 
 Program received signal SIGSEGV, Segmentation fault.
 advance_transition_function (aggstate=aggstate@entry=0x2b5c5f0,
 peraggstate=peraggstate@entry=0x2b5efd0,
 pergroupstate=pergroupstate@entry=0x8) at nodeAgg.c:468
 468 if (pergroupstate-noTransValue)
 (gdb) bt
 #0  advance_transition_function at nodeAgg.c:468
 #1  0x005c3494 in advance_aggregates at nodeAgg.c:624
 #2  0x005c3dc2 in agg_fill_hash_table at nodeAgg.c:1640
 #3  ExecAgg (node=node@entry=0x2b5c5f0) at nodeAgg.c:1338
 
 (gdb) print pergroupstate-noTransValue
 Cannot access memory at address 0x11
 (gdb) print pergroupstate
 $1 = (AggStatePerGroup) 0x8
 
 My guess is something is scribbling over the pergroup memory, or maybe
 the memory context gets released, or something. In any case, it's easily
 reproducible, and apparently deterministic (always exactly the same
 values, no randomness).

BTW the fact that 'make installcheck' and 'make check' both pass yet a
simple query crashes, suggests there are no regression tests testing the
batching properly. Which should be part of the patch, I believe.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-07 Thread Michael Paquier
On Tue, Dec 2, 2014 at 2:14 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
  I can also just move isReset there, and keep mem_allocated as a uint64.
  That way, if I find later that I want to track the aggregated value for
  the child contexts as well, I can split it into two uint32s. I'll hold
  off any any such optimizations until I see some numbers from HashAgg
  though.

 I took a quick look at memory-accounting-v8.patch.

 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.

 No reason. New version attached; that's the only change.
Note that I am marking this patch back to Needs Review state. I
doesn't seem that this patch has been reviewed completely.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-01 Thread Jeff Davis
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
  I can also just move isReset there, and keep mem_allocated as a uint64.
  That way, if I find later that I want to track the aggregated value for
  the child contexts as well, I can split it into two uint32s. I'll hold
  off any any such optimizations until I see some numbers from HashAgg
  though.
 
 I took a quick look at memory-accounting-v8.patch.
 
 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.

No reason. New version attached; that's the only change.

Regards,
Jeff Davis


*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 
! 		context-mem_allocated += blksize;
! 
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return context;
  }
  
  /*
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 596,603 
  		else
  		{
  			/* Normal case, release the block */
+ 			context-mem_allocated 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-30 Thread Peter Geoghegan
On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
 I can also just move isReset there, and keep mem_allocated as a uint64.
 That way, if I find later that I want to track the aggregated value for
 the child contexts as well, I can split it into two uint32s. I'll hold
 off any any such optimizations until I see some numbers from HashAgg
 though.

I took a quick look at memory-accounting-v8.patch.

Is there some reason why mem_allocated is a uint64? All other things
being equal, I'd follow the example of tuplesort.c's
MemoryContextAllocHuge() API, which (following bugfix commit
79e0f87a1) uses int64 variables to track available memory and so on.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-21 Thread Tomas Vondra
On 21.11.2014 00:03, Andres Freund wrote:
 On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote:
 On 17.11.2014 19:46, Andres Freund wrote:

 The MemoryContextData struct is embedded into AllocSetContext.

 Oh, right. That makes is slightly more complicated, though, because
 AllocSetContext adds 6 x 8B fields plus an in-line array of 
 freelist pointers. Which is 11x8 bytes. So in total 56+56+88=200B, 
 without the additional field. There might be some difference 
 because of alignment, but I still don't see how that one
 additional field might impact cachelines?
 
 It's actually 196 bytes:

U, I think the pahole output shows 192, not 196? Otherwise it
wouldn't be exactly 3 cachelines anyway.

But yeah - my math-foo was weak for a moment, because 6x8 != 56. Which
is the 8B difference :-/

 struct AllocSetContext {
 MemoryContextData  header;   /* 056 */
 AllocBlock blocks;   /*56 8 */
 /* --- cacheline 1 boundary (64 bytes) --- */
 AllocChunk freelist[11]; /*6488 */
 /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
 Size   initBlockSize;/*   152 8 */
 Size   maxBlockSize; /*   160 8 */
 Size   nextBlockSize;/*   168 8 */
 Size   allocChunkLimit;  /*   176 8 */
 AllocBlock keeper;   /*   184 8 */
 /* --- cacheline 3 boundary (192 bytes) --- */
 
 /* size: 192, cachelines: 3, members: 8 */
 };
 
 And thus one additional field tipps it over the edge.
 
 pahole is a very useful utility.

Indeed.

 But if we separated the freelist, that might actually make it
 faster, at least for calls that don't touch the freelist at all,
 no? Because most of the palloc() calls will be handled from the
 current block.
 
 I seriously doubt it. The additional indirection + additional
 branches are likely to make it worse.

That's possible, although I tried it on my version of the accounting
patch, and it showed slight improvement (lower overhead) on Robert's
reindex benchmark.

The question is how would that work with regular workload, because
moving the freelist out of the structure makes it smaller (2 cachelines
instead of 3), and it can only impact workloads working with the
freelists (i.e. either by calling free, or realloc, or whatever).
Although palloc() checks the freelist too ...

Also, those pieces may be allocated together (next to each other), which
might keep locality.

But I haven't tested any of this, and my knowledge of this low-level
stuff is poor, so I might be completely wrong.

regard
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-20 Thread Andres Freund
On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote:
 On 17.11.2014 19:46, Andres Freund wrote:
  On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
  On 17.11.2014 18:04, Andres Freund wrote:
  Hi,
 
  On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
  *** a/src/include/nodes/memnodes.h
  --- b/src/include/nodes/memnodes.h
  ***
  *** 60,65  typedef struct MemoryContextData
  --- 60,66 
   MemoryContext nextchild;/* next child of same parent */
   char   *name;   /* context name (just 
  for debugging) */
   boolisReset;/* T = no space alloced 
  since last reset */
  +uint64  mem_allocated;  /* track memory allocated for 
  this context */
#ifdef USE_ASSERT_CHECKING
   boolallowInCritSection; /* allow palloc in 
  critical section */
#endif
 
  That's quite possibly one culprit for the slowdown. Right now one 
  AllocSetContext struct fits precisely into three cachelines. After
  your change it doesn't anymore.
 
  I'm no PPC64 expert, but I thought the cache lines are 128 B on that
  platform, since at least Power6?
  
  Hm, might be true.
  
  Also, if I'm counting right, the MemoryContextData structure is 56B
  without the 'mem_allocated' field (and without the allowInCritSection),
  and 64B with it (at that particular place). sizeof() seems to confirm
  that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
  
  The MemoryContextData struct is embedded into AllocSetContext.
 
 Oh, right. That makes is slightly more complicated, though, because
 AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
 pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
 additional field. There might be some difference because of alignment,
 but I still don't see how that one additional field might impact cachelines?

It's actually 196 bytes:

struct AllocSetContext {
MemoryContextData  header;   /* 056 */
AllocBlock blocks;   /*56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
AllocChunk freelist[11]; /*6488 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
Size   initBlockSize;/*   152 8 */
Size   maxBlockSize; /*   160 8 */
Size   nextBlockSize;/*   168 8 */
Size   allocChunkLimit;  /*   176 8 */
AllocBlock keeper;   /*   184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */

/* size: 192, cachelines: 3, members: 8 */
};

And thus one additional field tipps it over the edge.

pahole is a very useful utility.

 But if we separated the freelist, that might actually make it faster, at
 least for calls that don't touch the freelist at all, no? Because most
 of the palloc() calls will be handled from the current block.

I seriously doubt it. The additional indirection + additional branches
are likely to make it worse.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-18 Thread Jim Nasby

On 11/17/14, 10:50 AM, Tomas Vondra wrote:

Either of these restrictions would prevent a situation where a context
has to update accounting for two parent contexts. That'd allow updating
a single place (because each context has a single parent context with
tracking requested).


Another option might be to be lazy on updating parent contexts. I'm thinking 
something like keep a boolean (or make a size of 0 magic) that indicates 
whether a context has up-to-date info from it's children or not. That would 
allow you to only update the complete size when you need it, but if your 
children haven't been touched since the last time you calculated then you're 
don't need to recalc.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Simon Riggs
On 17 November 2014 07:31, Jeff Davis pg...@j-davis.com wrote:
 On Sat, 2014-11-15 at 21:36 +, Simon Riggs wrote:
 Do I understand correctly that we are trying to account for exact
 memory usage at palloc/pfree time? Why??

 Not palloc chunks, only tracking at the level of allocated blocks (that
 we allocate with malloc).

 It was a surprise to me that accounting at that level would have any
 measurable impact, but Robert found a reasonable case on a POWER machine
 that degraded a couple percent. I wasn't able to reproduce it
 consistently on x86.

Surprise to me also.

Robert's tests showed a deviation of 0.4 sec after a restart. ISTM
that we wouldn't see that every time.

AFAIK the whole purpose of the memory allocator is to reduce the
number of system calls, so if we are doing so many malloc() calls as
to be noticeable just for accounting then something is wrong.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Simon Riggs
On 17 November 2014 07:31, Jeff Davis pg...@j-davis.com wrote:

 To calculate the total memory used, I included a function
 MemoryContextMemAllocated() that walks the memory context and its
 children recursively.

If we assume that we want the results accurate to within 5%, then we
can work out a good sampling rate to achieve that accuracy.

We would probably need a counting mechanism that didn't slow down O(N)
as our allocations get bigger.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Tomas Vondra
On 15.11.2014 22:36, Simon Riggs wrote:
 On 16 October 2014 02:26, Jeff Davis pg...@j-davis.com wrote:
 
 The inheritance is awkward anyway, though. If you create a tracked 
 context as a child of an already-tracked context, allocations in 
 the newer one won't count against the original. I don't see a way 
 around that without introducing even more performance problems.
 
 This seems to have reached impasse, which is a shame. Let me throw 
 out a few questions to see if that might help.
 
 Do I understand correctly that we are trying to account for exact 
 memory usage at palloc/pfree time? Why??
 
 Surely we just want to keep track of the big chunks? Does this need 
 to be exact? How exact does it need to be?

What do you mean by big chunks? Blocks, or really over-sized chunks
(over 8kB)? I assume blocks, and that's what the proposed patches are doing.

 Or alternatively, can't we just sample the allocations to reduce
 the overhead?

That might work, but I'm afraid it's trickier than it might seem.

 Are there some assumptions we can make about micro-contexts never 
 needing to be tracked at all?

What is a micro-context?

 Jeff seems not too bothered by inheritance, whereas Tomas thinks its
 essential. Perhaps there is a middle ground, depending upon the role
 of the context?

Having a hierarchy of memory contexts and an accounting completely
ignoring that hierarchy seems a bit strange to me.

Maybe we can impose some restrictions, for example:

  (a) a single context with tracking explicitly requested (in the whole
  hierarchy)

  (b) when tracking is requested for a context, the parent context must
  not have tracking enabled

Either of these restrictions would prevent a situation where a context
has to update accounting for two parent contexts. That'd allow updating
a single place (because each context has a single parent context with
tracking requested).

But I'm not sure that those restrictions are acceptable, that we can
enforce them reliably, or how frequently that'd break external code
(e.g. extensions with custom aggregates etc.). Because this is part of
the public API, including the option to enable accounting for a memory
context.

Imagine you write an extension with a custom aggregate (or whatever) and
it works just fine. Then we tweak something in the executor, requesting
accounting for another context, and the extension suddenly breaks.

Or maybe we don't change anything, and the aggregate works fine with
Group Aggregate, but breaks whenever the plan switches to Hash Aggregate
(because that's using accounting).

Or maybe the user just redirects the tracking somewhere else (e.g. by
supplying a pointer to the v6 of the patch), excluding the context from
the accounting entirely.

So yeah, I think the inheritance is an important aspect, at least for a
generic accounting infrastructure. I'd bet that if we choose to do that,
there'll be a good deal of various WTF moments and foot-shooting in the
near future ...

On 23/8 I proposed to use a separate hierarchy of accounting info,
parallel to the memory context hierarchy (but only for contexts with
tracking explicitly requested)

   http://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz

but no response so far. I do see ~1% regression on Robert's benchmark
(on x86), but it's quite noisy. Not sure about PPC, which is what Robert
used. I don't see how to make it faster without sacrificing full
inheritance support ...

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Andres Freund
Hi,

On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
 *** a/src/include/nodes/memnodes.h
 --- b/src/include/nodes/memnodes.h
 ***
 *** 60,65  typedef struct MemoryContextData
 --- 60,66 
   MemoryContext nextchild;/* next child of same parent */
   char   *name;   /* context name (just for 
 debugging) */
   boolisReset;/* T = no space alloced since 
 last reset */
 + uint64  mem_allocated;  /* track memory allocated for this 
 context */
   #ifdef USE_ASSERT_CHECKING
   boolallowInCritSection; /* allow palloc in critical 
 section */
   #endif

That's quite possibly one culprit for the slowdown. Right now one
AllocSetContext struct fits precisely into three cachelines. After your
change it doesn't anymore.

Consider not counting memory in bytes but blocks and adding it directly
after the NodeTag. That'd leave the size the same as right now.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Tomas Vondra
On 17.11.2014 08:31, Jeff Davis wrote:
 On Sat, 2014-11-15 at 21:36 +, Simon Riggs wrote:
 Do I understand correctly that we are trying to account for exact
 memory usage at palloc/pfree time? Why??
 
 Not palloc chunks, only tracking at the level of allocated blocks
 (that we allocate with malloc).
 
 It was a surprise to me that accounting at that level would have any 
 measurable impact, but Robert found a reasonable case on a POWER
 machine that degraded a couple percent. I wasn't able to reproduce
 it consistently on x86.
 
 Or alternatively, can't we just sample the allocations to reduce
 the overhead?
 
 Not sure quite what you mean by sample, but it sounds like
 something along those lines would work.

Maybe. It might also cause unexpected volatility / nondeterminism in the
query execution.

Imagine a Hash Aggregate, where a small number of the requests is
considerably larger than the rest. If you happen to sample only a few of
those large requests, the whole hash aggregate happens in-memory,
otherwise it starts batching. The users will observe this as random
variations to the runtimes - same query / parameters, idle machine,
sudden changes to the plan ...

Maybe I'm too wary, but I guess there are use cases where the latency
uniformity is a concern.

 Attached is a patch that does something very simple: only tracks 
 blocks held in the current context, with no inheritance or anything 
 like it. This reduces it to a couple arithmetic instructions added
 to the alloc/dealloc path, so hopefully that removes the general 
 performance concern raised by Robert[1].
 
 To calculate the total memory used, I included a function
 MemoryContextMemAllocated() that walks the memory context and its
 children recursively.
 
 Of course, I was originally trying to avoid that, because it moves 
 the problem to HashAgg. For each group, it will need to execute 
 MemoryContextMemAllocated() to see if work_mem has been exceeded. It
  will have to visit a couple contexts, or perhaps many (in the case 
 of array_agg, which creates one per group), which could be a 
 measurable regression for HashAgg.

:-(

 But if that does turn out to be a problem, I think it's solvable. 
 First, I could micro-optimize the function by making it iterative 
 rather than recursive, to save on function call overhead. Second, I 
 could offer a way to prevent the HTAB from creating its own context, 
 which would be one less context to visit. And if those don't work, 
 perhaps I could resort to a sampling method of some kind, as you 
 allude to above.

Do you plan to try this approach with your hashagg patch, and do some
measurements?

I'd expect the performance hit to be significant, but I'd like to see
some numbers.

kind regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Tomas Vondra
On 17.11.2014 18:04, Andres Freund wrote:
 Hi,
 
 On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
 *** a/src/include/nodes/memnodes.h
 --- b/src/include/nodes/memnodes.h
 ***
 *** 60,65  typedef struct MemoryContextData
 --- 60,66 
  MemoryContext nextchild;/* next child of same parent */
  char   *name;   /* context name (just for 
 debugging) */
  boolisReset;/* T = no space alloced since 
 last reset */
 +uint64  mem_allocated;  /* track memory allocated for this 
 context */
   #ifdef USE_ASSERT_CHECKING
  boolallowInCritSection; /* allow palloc in critical 
 section */
   #endif
 
 That's quite possibly one culprit for the slowdown. Right now one 
 AllocSetContext struct fits precisely into three cachelines. After
 your change it doesn't anymore.

I'm no PPC64 expert, but I thought the cache lines are 128 B on that
platform, since at least Power6?

Also, if I'm counting right, the MemoryContextData structure is 56B
without the 'mem_allocated' field (and without the allowInCritSection),
and 64B with it (at that particular place). sizeof() seems to confirm
that. (But I'm on x86, so maybe the alignment on PPC64 is different?).

 Consider not counting memory in bytes but blocks and adding it
 directly after the NodeTag. That'd leave the size the same as right
 now.

I suppose you mean kbytes, because the block size is not fixed.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Andres Freund
On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
 On 17.11.2014 18:04, Andres Freund wrote:
  Hi,
  
  On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
  *** a/src/include/nodes/memnodes.h
  --- b/src/include/nodes/memnodes.h
  ***
  *** 60,65  typedef struct MemoryContextData
  --- 60,66 
 MemoryContext nextchild;/* next child of same parent */
 char   *name;   /* context name (just for 
  debugging) */
 boolisReset;/* T = no space alloced since 
  last reset */
  +  uint64  mem_allocated;  /* track memory allocated for this 
  context */
#ifdef USE_ASSERT_CHECKING
 boolallowInCritSection; /* allow palloc in critical 
  section */
#endif
  
  That's quite possibly one culprit for the slowdown. Right now one 
  AllocSetContext struct fits precisely into three cachelines. After
  your change it doesn't anymore.
 
 I'm no PPC64 expert, but I thought the cache lines are 128 B on that
 platform, since at least Power6?

Hm, might be true.

 Also, if I'm counting right, the MemoryContextData structure is 56B
 without the 'mem_allocated' field (and without the allowInCritSection),
 and 64B with it (at that particular place). sizeof() seems to confirm
 that. (But I'm on x86, so maybe the alignment on PPC64 is different?).

The MemoryContextData struct is embedded into AllocSetContext.

  Consider not counting memory in bytes but blocks and adding it
  directly after the NodeTag. That'd leave the size the same as right
  now.
 
 I suppose you mean kbytes, because the block size is not fixed.

Some coarser size than bytes. Don't really care about the granularity.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Tomas Vondra
On 17.11.2014 19:46, Andres Freund wrote:
 On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
 On 17.11.2014 18:04, Andres Freund wrote:
 Hi,

 On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
 *** a/src/include/nodes/memnodes.h
 --- b/src/include/nodes/memnodes.h
 ***
 *** 60,65  typedef struct MemoryContextData
 --- 60,66 
MemoryContext nextchild;/* next child of same parent */
char   *name;   /* context name (just for 
 debugging) */
boolisReset;/* T = no space alloced since 
 last reset */
 +  uint64  mem_allocated;  /* track memory allocated for this 
 context */
   #ifdef USE_ASSERT_CHECKING
boolallowInCritSection; /* allow palloc in critical 
 section */
   #endif

 That's quite possibly one culprit for the slowdown. Right now one 
 AllocSetContext struct fits precisely into three cachelines. After
 your change it doesn't anymore.

 I'm no PPC64 expert, but I thought the cache lines are 128 B on that
 platform, since at least Power6?
 
 Hm, might be true.
 
 Also, if I'm counting right, the MemoryContextData structure is 56B
 without the 'mem_allocated' field (and without the allowInCritSection),
 and 64B with it (at that particular place). sizeof() seems to confirm
 that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
 
 The MemoryContextData struct is embedded into AllocSetContext.

Oh, right. That makes is slightly more complicated, though, because
AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
additional field. There might be some difference because of alignment,
but I still don't see how that one additional field might impact cachelines?

But if we separated the freelist, that might actually make it faster, at
least for calls that don't touch the freelist at all, no? Because most
of the palloc() calls will be handled from the current block.

I tried this on the patch I sent on 23/8 (with the MemoryAccounting
hierarchy parallel to the contexts), and I got this (running 10x the
reindex, without restarts or such):

without the patch
-
  ... 1723933 KB used: CPU 1.35s/4.33u sec elapsed 10.38 sec
  ... 1723933 KB used: CPU 1.39s/4.19u sec elapsed 9.89 sec
  ... 1723933 KB used: CPU 1.39s/4.26u sec elapsed 9.90 sec
  ... 1723933 KB used: CPU 1.45s/4.22u sec elapsed 10.15 sec
  ... 1723933 KB used: CPU 1.36s/4.20u sec elapsed 9.84 sec
  ... 1723933 KB used: CPU 1.32s/4.20u sec elapsed 9.96 sec
  ... 1723933 KB used: CPU 1.50s/4.23u sec elapsed 9.91 sec
  ... 1723933 KB used: CPU 1.47s/4.24u sec elapsed 9.90 sec
  ... 1723933 KB used: CPU 1.38s/4.23u sec elapsed 10.09 sec
  ... 1723933 KB used: CPU 1.37s/4.19u sec elapsed 9.84 sec

with the patch (no tracking enabled)

  ... 1723933 KB used: CPU 1.40s/4.28u sec elapsed 10.79 sec
  ... 1723933 KB used: CPU 1.39s/4.32u sec elapsed 10.14 sec
  ... 1723933 KB used: CPU 1.40s/4.16u sec elapsed 9.99 sec
  ... 1723933 KB used: CPU 1.32s/4.21u sec elapsed 10.16 sec
  ... 1723933 KB used: CPU 1.37s/4.34u sec elapsed 10.03 sec
  ... 1723933 KB used: CPU 1.35s/4.22u sec elapsed 9.97 sec
  ... 1723933 KB used: CPU 1.43s/4.20u sec elapsed 9.99 sec
  ... 1723933 KB used: CPU 1.39s/4.17u sec elapsed 9.71 sec
  ... 1723933 KB used: CPU 1.44s/4.16u sec elapsed 9.91 sec
  ... 1723933 KB used: CPU 1.40s/4.25u sec elapsed 9.99 sec

So it's about 9,986 vs 10,068 for averages, and 9,905 vs 9,99 for a
median. I.e. ~0.8% difference. That's on x86, though. I wonder what'd be
the effect on the PPC machine.

Patch attached - it's not perfect, though. For example it should free
the freelists at some point, but I feel a bit lazy today.

Tomas
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 743455e..d2ecb5b 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -175,7 +175,7 @@ typedef struct AllocSetContext
 	MemoryContextData header;	/* Standard memory-context fields */
 	/* Info about storage allocated in this context: */
 	AllocBlock	blocks;			/* head of list of blocks in this set */
-	AllocChunk	freelist[ALLOCSET_NUM_FREELISTS];		/* free chunk lists */
+	AllocChunk *freelist;		/* free chunk lists */
 	/* Allocation parameters for this context: */
 	Size		initBlockSize;	/* initial block size */
 	Size		maxBlockSize;	/* maximum block size */
@@ -242,6 +242,8 @@ typedef struct AllocChunkData
 #define AllocChunkGetPointer(chk)	\
 	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
 
+static void update_allocation(MemoryContext context, int64 size);
+
 /*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
@@ -250,7 +252,7 @@ static void AllocSetFree(MemoryContext context, void *pointer);
 static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
 static void AllocSetInit(MemoryContext 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Jeff Davis
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote:
 That's quite possibly one culprit for the slowdown. Right now one
 AllocSetContext struct fits precisely into three cachelines. After your
 change it doesn't anymore.

Thank you! I made the same mistake as Tomas: I saw that
MemoryContextData went to 64 bytes and thought that should be fine
while ignoring AllocSetContext.

I did some tests in the past between a version of my patch that made
MemoryContextData 72 bytes and one that made it 64 bytes, but I may have
neglected to test against the original 56 byte version.

I'm still not able to test to see if this is the real problem, but it
seems best to eliminate it anyway.

 Consider not counting memory in bytes but blocks and adding it directly
 after the NodeTag. That'd leave the size the same as right now.

I can also just move isReset there, and keep mem_allocated as a uint64.
That way, if I find later that I want to track the aggregated value for
the child contexts as well, I can split it into two uint32s. I'll hold
off any any such optimizations until I see some numbers from HashAgg
though.

Attached new version.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 
! 		context-mem_allocated += blksize;
! 
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-15 Thread Simon Riggs
On 16 October 2014 02:26, Jeff Davis pg...@j-davis.com wrote:

 The inheritance is awkward anyway, though. If you create a tracked
 context as a child of an already-tracked context, allocations in the
 newer one won't count against the original. I don't see a way around
 that without introducing even more performance problems.

This seems to have reached impasse, which is a shame. Let me throw out
a few questions to see if that might help.

Do I understand correctly that we are trying to account for exact
memory usage at palloc/pfree time? Why??

Surely we just want to keep track of the big chunks? Does this need to
be exact? How exact does it need to be?

Or alternatively, can't we just sample the allocations to reduce the overhead?

Are there some assumptions we can make about micro-contexts never
needing to be tracked at all? Jeff seems not too bothered by
inheritance, whereas Tomas thinks its essential. Perhaps there is a
middle ground, depending upon the role of the context?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-10-18 Thread Tomas Vondra
Hi,

On 16.10.2014 03:26, Jeff Davis wrote:
 On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
 What about removing the callback per se and just keeping the
 argument, as it were. That is, a MemoryContext has a field of type
 size_t * that is normally NULL, but if set to non-NULL, then we
 increment the pointed-to value for pallocs and decrement for
 pfrees.
 
 I like that idea; patch attached. Two differences:
   * At block level, not chunk level.
   * I use a uint64, because a size_t is only guaranteed to hold one
 allocation, and we need to sum up many allocations.

 When unused, it does still appear to have a little overhead in 
 Robert's test on his power machine. It seems to be between a 0.5-1.0%
 regression. There's not much extra code on that path, just a branch,
 pointer dereference, and an addition; so I don't think it will get
 much cheaper than it is.
 
 This regression seems harder to reproduce on my workstation, or
 perhaps it's just noisier.

I assume it's the reindex pgbench_accounts_pkey test, correct?

I did the same test on my workstation (with i7-4770R CPU), and ISTM
there's a measurable difference, although in surprising directions. For
binaries compiled with gcc 4.7.3 and clang 3.3 and 3.5, the results look
like this (5 runs for each combination):

gcc 4.7.3 / master
--
LOG:  internal sort ended, ... CPU 1.24s/4.32u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.29s/4.31u sec elapsed 9.80 sec
LOG:  internal sort ended, ... CPU 1.26s/4.31u sec elapsed 9.80 sec
LOG:  internal sort ended, ... CPU 1.26s/4.37u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.33s/4.29u sec elapsed 9.78 sec

gcc 4.7.3 / patched
---
LOG:  internal sort ended, ... CPU 1.35s/4.27u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.33s/4.30u sec elapsed 9.74 sec
LOG:  internal sort ended, ... CPU 1.34s/4.27u sec elapsed 9.75 sec
LOG:  internal sort ended, ... CPU 1.27s/4.26u sec elapsed 9.78 sec
LOG:  internal sort ended, ... CPU 1.35s/4.26u sec elapsed 9.78 sec

clang 3.3 / master
--
LOG:  internal sort ended, ... CPU 1.32s/4.61u sec elapsed 10.00 sec
LOG:  internal sort ended, ... CPU 1.27s/4.48u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.35s/4.48u sec elapsed 9.99 sec
LOG:  internal sort ended, ... CPU 1.32s/4.49u sec elapsed 9.97 sec
LOG:  internal sort ended, ... CPU 1.32s/4.47u sec elapsed 10.04 sec

clang 3.3 / patched
---
LOG:  internal sort ended, ... CPU 1.35s/4.59u sec elapsed 10.13 sec
LOG:  internal sort ended, ... CPU 1.31s/4.61u sec elapsed 10.06 sec
LOG:  internal sort ended, ... CPU 1.28s/4.63u sec elapsed 10.10 sec
LOG:  internal sort ended, ... CPU 1.27s/4.58u sec elapsed 10.01 sec
LOG:  internal sort ended, ... CPU 1.29s/4.60u sec elapsed 10.05 sec

clang 3.5 / master
--
LOG:  internal sort ended, ... CPU 1.30s/4.46u sec elapsed 9.96 sec
LOG:  internal sort ended, ... CPU 1.30s/4.49u sec elapsed 9.96 sec
LOG:  internal sort ended, ... CPU 1.30s/4.53u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.25s/4.51u sec elapsed 9.95 sec
LOG:  internal sort ended, ... CPU 1.30s/4.50u sec elapsed 9.95 sec

clang 3.5 / patched
---
LOG:  internal sort ended, ... CPU 1.32s/4.59u sec elapsed 9.97 sec
LOG:  internal sort ended, ... CPU 1.27s/4.49u sec elapsed 9.91 sec
LOG:  internal sort ended, ... CPU 1.29s/4.48u sec elapsed 9.88 sec
LOG:  internal sort ended, ... CPU 1.31s/4.49u sec elapsed 9.93 sec
LOG:  internal sort ended, ... CPU 1.23s/4.56u sec elapsed 9.90 sec

So while on clang 3.3 I really see about ~1% slowdown, on the other two
compilers it seems a tad faster (but it might easily be a noise).

It'd be interesting to know what compiler/version Robert used ...

 One thought in either case is that we don't have to touch the API
 for MemoryContextCreate: rather, the feature can be enabled in a
 separate call after making the context.
 
 That seems fairly awkward to me because the pointer needs to be 
 inherited from the parent context when not specified. I left the
 extra API call in.

I don't see a big difference between adding a new API create method
and a adding a separate enable tracking method.

What however seems awkward is using the 'uint64*' directly as a
parameter, instead of using 'enable_tracking' flag as in the previous
versions. Is there a reason for that?

This allows supplying a pointer to a uint64 variable, i.e. something
like this:

   {
   uint64 mem_used = 0;
   MemoryContext context
   = MemoryContextCreateTracked(..., mem_used);

   ...

   if (mem_used  work_mem * 1024L)
   {
   ... do something ...
   }
   }

Thanks to exposing the total_mem like this, it's possible to get the
current value directly like this (without the API methods available in
the previous versions). It also makes it possible to detach the
accounting from the parent (i.e. not count it against the parent).

But is this 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-10-15 Thread Jeff Davis
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
 What about removing the callback per se and just keeping the argument,
 as it were.  That is, a MemoryContext has a field of type size_t *
 that is normally NULL, but if set to non-NULL, then we increment the
 pointed-to value for pallocs and decrement for pfrees.

I like that idea; patch attached. Two differences:
  * At block level, not chunk level.
  * I use a uint64, because a size_t is only guaranteed to hold one
allocation, and we need to sum up many allocations.

When unused, it does still appear to have a little overhead in Robert's
test on his power machine. It seems to be between a 0.5-1.0% regression.
There's not much extra code on that path, just a branch, pointer
dereference, and an addition; so I don't think it will get much cheaper
than it is.

This regression seems harder to reproduce on my workstation, or perhaps
it's just noisier.

 One thought in either case is that we don't have to touch the API for
 MemoryContextCreate: rather, the feature can be enabled in a separate
 call after making the context.

That seems fairly awkward to me because the pointer needs to be
inherited from the parent context when not specified. I left the extra
API call in.

The inheritance is awkward anyway, though. If you create a tracked
context as a child of an already-tracked context, allocations in the
newer one won't count against the original. I don't see a way around
that without introducing even more performance problems.

If this patch is unacceptable, my only remaining idea is to add the
memory only for the current context with no inheritance (thereby
eliminating the branch, also). That will require HashAgg to traverse all
the child contexts to check whether the memory limit has been exceeded.
As Tomas pointed out, that could be a lot of work in the case of
array_agg with many groups.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,482 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	/* inherit 'track_mem' from parent context, if available */
! 	uint64 *track_mem = (parent == NULL) ? NULL : parent-track_mem;
! 
! 	return AllocSetContextCreateTracked(parent, name, minContextSize,
! 		initBlockSize, maxBlockSize,
! 		track_mem);
! }
! 
! /*
!  * AllocSetContextCreateTracked
!  *		Create a new AllocSet context, tracking total memory usage.
!  *
!  * parent: parent context, or NULL if top-level context
!  * name: name of context (for debugging --- string will be copied)
!  * minContextSize: minimum context size
!  * initBlockSize: initial allocation block size
!  * maxBlockSize: maximum allocation block size
!  * track_mem: caller-supplied variable to use for memory tracking
!  */
! MemoryContext
! AllocSetContextCreateTracked(MemoryContext parent,
! 			 const char *name,
! 			 Size minContextSize,
! 			 Size initBlockSize,
! 			 Size maxBlockSize,
! 			 uint64 *track_mem)
! {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name,
!   track_mem);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 490,498 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Jeff Davis
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote:
 Still doesn't look good here.  On the same PPC64 machine I've been
 using for testing:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

I've made some progress I think (thanks to both Robert and Tomas), but I
have a bit more microoptimization and testing to do. I'll mark it
returned with feedback for now, though if I find the time I'll do more
testing to see if the performance concerns are fully addressed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 I have a new approach to the patch which is to call a callback at each
 block allocation and child contexts inherit the callback from their
 parents. The callback could be defined to simply dereference and
 increment its argument, which would mean block allocations anywhere in
 the hierarchy are incrementing the same variable, avoiding the need to
 traverse the hierarchy.

Cute, but it's impossible to believe that a function call isn't going
to be *even more* overhead than what you've got now.  This could only
measure out as a win when the feature is going unused.

What about removing the callback per se and just keeping the argument,
as it were.  That is, a MemoryContext has a field of type size_t *
that is normally NULL, but if set to non-NULL, then we increment the
pointed-to value for pallocs and decrement for pfrees.

One thought in either case is that we don't have to touch the API for
MemoryContextCreate: rather, the feature can be enabled in a separate
call after making the context.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Tomas Vondra
On 29.8.2014 16:12, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
 I have a new approach to the patch which is to call a callback at each
 block allocation and child contexts inherit the callback from their
 parents. The callback could be defined to simply dereference and
 increment its argument, which would mean block allocations anywhere in
 the hierarchy are incrementing the same variable, avoiding the need to
 traverse the hierarchy.

 Cute, but it's impossible to believe that a function call isn't going
 to be *even more* overhead than what you've got now.  This could only
 measure out as a win when the feature is going unused.
 
 What about removing the callback per se and just keeping the argument,
 as it were.  That is, a MemoryContext has a field of type size_t *
 that is normally NULL, but if set to non-NULL, then we increment the
 pointed-to value for pallocs and decrement for pfrees.

How is that going to work with two memory contexts (parent/child), both
requesting accounting? If I understand the proposal correctly, the child
will either inherit pointer to the field in parent (and then won't get
accounting for it's own memory), or wil get a private field (and thus
won't update the accounting of the parent context).

Or am I missing something?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-22 Thread Robert Haas
On Wed, Aug 20, 2014 at 2:11 AM, Jeff Davis pg...@j-davis.com wrote:
 I attached a patch that uses two uint32 fields so that it doesn't
 increase the size of MemoryContextData, and it tracks memory usage for
 all contexts. I was unable to detect any performance regression vs.
 master, but on my machine the results are noisy.

Still doesn't look good here.  On the same PPC64 machine I've been
using for testing:

master:
2014-08-22 16:26:42 UTC [13153] LOG:  internal sort ended, 1723933 KB
used: CPU 2.19s/11.41u sec elapsed 16.40 sec
2014-08-22 16:28:18 UTC [13173] LOG:  internal sort ended, 1723933 KB
used: CPU 2.56s/11.37u sec elapsed 16.64 sec
2014-08-22 16:29:37 UTC [13194] LOG:  internal sort ended, 1723933 KB
used: CPU 2.57s/11.48u sec elapsed 16.75 sec

memory-accounting-v4:
2014-08-22 16:27:35 UTC [13163] LOG:  internal sort ended, 1723933 KB
used: CPU 2.72s/11.91u sec elapsed 17.43 sec
2014-08-22 16:29:10 UTC [13185] LOG:  internal sort ended, 1723933 KB
used: CPU 2.67s/12.03u sec elapsed 17.40 sec
2014-08-22 16:30:01 UTC [13203] LOG:  internal sort ended, 1723933 KB
used: CPU 2.65s/11.97u sec elapsed 17.45 sec

That's a 4.7% regression on the median results, worse than before.

 It would be easier to resolve the performance concern if I could
 reliably get the results Robert is getting. I think I was able to
 reproduce the regression with the old patch, but the results were still
 noisy.

If you send me an SSH key by private email I can set you an account up
on this machine, if that's helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-22 Thread Tomas Vondra
On 20.8.2014 08:11, Jeff Davis wrote:
 On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
 The use-case for this is tracking a chosen subtree of contexts - e.g.
 aggcontext and below, so I'd expect the tracked subtrees to be relatively
 shallow. Am I right?
 
 Right.
 
 My fear is that by removing the inheritance bit, we'll hurt cases with a
 lot of child contexts. For example, array_agg currently creates a separate
 context for each group - what happens if you have 100k groups and do
 MemoryContextGetAllocated? I guess iterating over 100k groups is not free.
 
 Good point. We don't want to make checking the allocated space into an
 expensive operation, because then we will be forced to call it less
 frequently, which implies that we'd be sloppier about actually fitting
 in work_mem.
 
 Wouldn't the solution with inheritance and propagating the accounting info
 to the parent actually better? Or maybe allowing both, having two flags
 when creating a context instead of one?
 
 That seems overly complicated. We only have one driving use case, so two
 orthogonal options sounds excessive. Perhaps one option if we can't
 solve the performance problem and we need to isolate the changes to
 hashagg.
 
 Also, do we really need to track allocated bytes - couldn't we track
 kilobytes or something and use smaller data types to get below the 64B?
 
 Good idea.
 
 I attached a patch that uses two uint32 fields so that it doesn't
 increase the size of MemoryContextData, and it tracks memory usage for
 all contexts. I was unable to detect any performance regression vs.
 master, but on my machine the results are noisy.

I don't think we really need to abandon the 'tracked' flag (or that we
should). I think it was useful, and removing it might be one of the
reasons why Robert now sees worse impact than before.

I assume you did that to get below 64B, right? What about to changing
'isReset' in MemoryContextData to a 'flags' field instead? It's a bool
now, i.e. 1B - 8 bits.

I don't think it makes sense to abandon this only to get = 64B, if it
forces you to touch multiple 64B structures unnecessarily.

Also, you're doing this for every block in AllocSetReset:

update_allocation(context, -(block-endptr - ((char*) block)));

So if there are 1000 blocks, you'll call that 1000x (and it will
propagate up to TopMemoryContext). Instead keep a local sum and only do
the call once at the end. Again, this might be one of the reasons why
Robet sees ~4% overhead.

BTW what happens when AllocSetContext is created with initBlockSize that
is not a multiple of 1kB? I see it's enforced to be at least 1024, but I
don't see anything forcing it to be a multiple of 1024?

What if I create a context with initBlockSize=1500? ISTM it'll break the
accounting, right? (Not that it would make sense to create such
contexts, but it's possible.)

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-22 Thread Robert Haas
On Fri, Aug 22, 2014 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I don't think we really need to abandon the 'tracked' flag (or that we
 should). I think it was useful, and removing it might be one of the
 reasons why Robert now sees worse impact than before.

The version that introduced that flag had the same overhead as the
previous version that didn't, at least in my testing, so I'm not sure
it's at all useful.  Part of the problem here is that the number of
instructions that you can afford to take to complete a memory
allocation is just really small, and so every one hurts.  Memory
latency may be an issue as well, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-22 Thread Tomas Vondra
On 20.8.2014 08:11, Jeff Davis wrote:
 On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:

 It would be easier to resolve the performance concern if I could
 reliably get the results Robert is getting. I think I was able to
 reproduce the regression with the old patch, but the results were still
 noisy.

I created a small extension for this purpose, it's available here:

   https://github.com/tvondra/palloc_bench

In short, it creates a chain of memory contexts, and then repeats
palloc/pfree a given number of times (with a chosen request size).

It either calls AllocSetContextCreate or AllocSetContextCreateTracked,
depending on whether there's a

#define TRACKING_FLAG

so either leave it there or comment it out. The time is printed as a
warhing.

I did a number of tests to get an idea of the overhead, using this call

  select palloc_bench(10,1,32768);

which means 10 memory contexts, 1 palloc/free cycles with 32kB
requests.

And I got these results:

master
---
3246.03 ms
3125.24 ms
3247.86 ms
3251.85 ms
3113.03 ms
3140.35 ms

v3 patch (AllocSetContextCreate = no tracing)
3303.64 ms
3278.57 ms
3295.11 ms
3325.63 ms
3329.84 ms
3439.27 ms

-- v3 patch (AllocSetContextCreateTracked = tracing)
6296.43 ms
5228.83 ms
5271.43 ms
5158.60 ms
5404.57 ms
5240.40 ms

-- v4 (tracing all the time)
6728.84 ms
6478.88 ms
6478.82 ms
6473.57 ms
6554.96 ms
6528.66 ms

I think this makes the overhead clearly visible. I also worth mentioning
that this does nothing else except for palloc/free, which is not really
what a real workload does. And 1 palloc/free of 32kB blocks
means ~3TB of RAM, unless my math is broken.

So looking at the numbers and saying 7 seconds  3 seconds, all is
lost is not really appropriate IMHO.

Anyway, ISTM that v4 is actually a bitm ore expensive than v3 for some
reason. I'm not entirely sure why, but I suspect it's because of
updating the few additional memory contexts up to TopMemoryContext.
That's something v3 didn't do.

I tried to hack a bit on the idea of using a single byte for the flags
(isReset and track_mem) - patch attached. That got me pretty much to v3
performance (or maybe slightly better):

-- v4 + flag (tracing all the time)
5222.38 ms
4958.37 ms
5072.21 ms
5100.43 ms
5059.65 ms
4995.52 ms

But nothing that'd magically save the day ... and with disabled tracing
we get pretty much to v3 numbers (with trace_mem=false). So this
gymnastics gave us pretty much nothing ...

But I have realized that maybe the problem is that we're handling memory
contexts and accounting as 1:1. But that's not really the case - most of
the context is not really interested in this. They don't use accounting
now, and it won't change. So only small fraction of memory contexts will
ask for acounting. Yet all the contexts are forced to pass accounting
info from their children to their parent, if there happens to be a
context with track_mem=true somewhere above them.

And that's exactly the problem, because most of the time is spent in
update_accounting, in the loop over parents.

So my proposal is to separate those two things into two hierarchies,
that are somehow parallel, but not exactly.

That means:

(1) creating a structure with the accouting info

typedef struct MemoryAccountingData {

Size  total_allocated;
Size  self_allocated;

struct MemoryAccountingData * parent;

} MemoryAccountingData;

(2) adding a pointer to MemoryAccountingData to MemoryContextData, and
a 'track_mem' flag for contexts that requested tracking

typedef struct MemoryContextData
{
...
MemoryAccounting  accounting;
...
} MemoryContextData;

(3) when a context does not request accounting, it just uses the
accounting pointer from the parent context, and track_mem=false

(4) when the context requests accounting, it allocates it's own
accounting structure, sets accounting-parent to the accounting
from parent, and sets track_mem=true

Now all the contexts have a direct pointer to the accounting of the
nearest parent context that explicitly requested accounting, and don't
need to walk through all the parents.

Contexts that did not request tracking have track_mem=false, and their
accounting points to the parent with explicit accounting, or is NULL if
there's no such parent. For these contexts, GetAllocated always returns
0, but that's OK because they haven't requested accounting anyway.

Contexts that requested tracking have have track_mem=true, and have
their own specific accounting instance. The accounting-parent plays
pretty much the same role as 'accounting' with 'track_mem=false' (see
previous paragraph). These contexts return GetAllocated properly.

Now, I did a quick with the palloc_bench - 1 context with tracking
enabled, and a chain of 10 contexts without tracking (but updating the
accounting for the first context).

And I see this - with tracking enabled:

3235.57 ms
3240.09 ms
3225.47 ms
3306.95 ms
3249.14 ms
3225.56 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-20 Thread Jeff Davis
On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
 The use-case for this is tracking a chosen subtree of contexts - e.g.
 aggcontext and below, so I'd expect the tracked subtrees to be relatively
 shallow. Am I right?

Right.

 My fear is that by removing the inheritance bit, we'll hurt cases with a
 lot of child contexts. For example, array_agg currently creates a separate
 context for each group - what happens if you have 100k groups and do
 MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Good point. We don't want to make checking the allocated space into an
expensive operation, because then we will be forced to call it less
frequently, which implies that we'd be sloppier about actually fitting
in work_mem.

 Wouldn't the solution with inheritance and propagating the accounting info
 to the parent actually better? Or maybe allowing both, having two flags
 when creating a context instead of one?

That seems overly complicated. We only have one driving use case, so two
orthogonal options sounds excessive. Perhaps one option if we can't
solve the performance problem and we need to isolate the changes to
hashagg.

 Also, do we really need to track allocated bytes - couldn't we track
 kilobytes or something and use smaller data types to get below the 64B?

Good idea.

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, size_t size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 250,256  static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
--- 252,258 
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context, MemoryContext parent);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 502,510 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 595,601 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 611,617  AllocSetReset(MemoryContext context)
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
--- 617,623 
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context, MemoryContext parent)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
***
*** 623,628  AllocSetDelete(MemoryContext context)
--- 629,644 
  	AllocSetCheck(context);
  #endif
  
+ 	/*
+ 	 * Parent is already unlinked from context, so can't use
+ 	 * update_allocation().
+ 	 */
+ 	while (parent != NULL)
+ 	{
+ 		parent-total_allocated -= context-total_allocated;
+ 		parent = parent-parent;
+ 	}
+ 
  	/* Make it look empty, just in case... */
  	MemSetAligned(set-freelist, 0, sizeof(set-freelist));
  	set-blocks = NULL;
***
*** 678,683  AllocSetAlloc(MemoryContext context, Size size)
--- 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-19 Thread Jeff Davis
On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote:
 It appears to me that the performance characteristics for this version
 are not significantly different from version 1.  I have not looked at
 the code.

While trying to reproduce your results, I noticed what might be around a
1% regression just from adding the 3 fields to MemoryContextData. If I
cut it down to adding just one field, the regression disappears.

The results are fairly noisy, so I could be chasing the wrong thing. But
one reason to believe it is that I pushed the size of MemoryContextData
above 64, which sounds like it might be an important threshold.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-19 Thread Jeff Davis
On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:
 But maybe the inheritance really is not necessary - maybe it would be
 enough to track this per-context, and then just walk through the
 contexts and collect this. Because my observation is that most of the
 time is actually spent in walking through blocks and freelists.

That makes a lot of sense to me.

Another approach is to pass a flag to hash_create that tells it not to
create a subcontext. Then we don't need to traverse anything; we already
know which context we want to look at. Perhaps I was being too clever
with the idea of tracking space for an entire hierarchy.

Also, as I pointed out in my reply to Robert, adding too many fields to
MemoryContextData may be the cause of the regression. Your idea requires
only one field, which doesn't show the same regression in my tests.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-19 Thread Tomas Vondra
On 19 Srpen 2014, 10:26, Jeff Davis wrote:
 On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:
 But maybe the inheritance really is not necessary - maybe it would be
 enough to track this per-context, and then just walk through the
 contexts and collect this. Because my observation is that most of the
 time is actually spent in walking through blocks and freelists.

 That makes a lot of sense to me.

 Another approach is to pass a flag to hash_create that tells it not to
 create a subcontext. Then we don't need to traverse anything; we already
 know which context we want to look at. Perhaps I was being too clever
 with the idea of tracking space for an entire hierarchy.

 Also, as I pointed out in my reply to Robert, adding too many fields to
 MemoryContextData may be the cause of the regression. Your idea requires
 only one field, which doesn't show the same regression in my tests.

Yeah, keeping the structure size below 64B seems like a good idea.

The use-case for this is tracking a chosen subtree of contexts - e.g.
aggcontext and below, so I'd expect the tracked subtrees to be relatively
shallow. Am I right?

My fear is that by removing the inheritance bit, we'll hurt cases with a
lot of child contexts. For example, array_agg currently creates a separate
context for each group - what happens if you have 100k groups and do
MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Wouldn't the solution with inheritance and propagating the accounting info
to the parent actually better? Or maybe allowing both, having two flags
when creating a context instead of one?

  AllocSetCreateTracked( , bool track, bool propagate_immediately)

By squashing both flags into a single mask you wouldn't increase the size.
Also, do we really need to track allocated bytes - couldn't we track
kilobytes or something and use smaller data types to get below the 64B?

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tomas Vondra
On 10.8.2014 22:50, Jeff Davis wrote:
 On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
 Either way, it's better to be conservative. Attached is a version
 of the patch with opt-in memory usage tracking. Child contexts
 inherit the setting from their parent.
 
 There was a problem with the previous patch: the parent was unlinked 
 before the delete_context method was called, which meant that the 
 parent's memory was not being decremented.
 
 Attached is a fix. It would be simpler to just reorder the operations
 in MemoryContextDelete, but there is a comment warning against that.
 So, I pass the old parent as an argument to the delete_context
 method.

Hi,

I did a quick check of the patch, mostly because I wasn't sure whether
it allows accounting for a selected subtree only (e.g. aggcontext and
it's children). And yes, it does exactly that, which is great.

Also, the code seems fine to me, except maybe for this piece in
AllocSetDelete:

/*
 * Parent is already unlinked from context, so can't use
 * update_allocation().
 */
while (parent != NULL)
{
parent-total_allocated -= context-total_allocated;
parent = parent-parent;
}

I believe this should check parent-track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

It does not really break anything (because parents without enabled
tracking don't report allocated space), but for plans
creating/destroying a lot of contexts (say, GroupAggregate with a lot of
groups) this might unnecessarily add overhead.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 I believe this should check parent-track_mem, just like
 update_allocation, because this way it walks all the parent context up
 to the TopMemoryContext.

TBH, I don't think this opt-in tracking business has been thought
through even a little bit; for example, what happens if there is a
declared-to-not-track context in between levels that are declared
to track?  And it certainly has not been proven that the extra
complexity actually saves cycles rather than the reverse.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tomas Vondra
On 16.8.2014 20:00, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 I believe this should check parent-track_mem, just like
 update_allocation, because this way it walks all the parent context up
 to the TopMemoryContext.
 
 TBH, I don't think this opt-in tracking business has been thought 
 through even a little bit; for example, what happens if there is a 
 declared-to-not-track context in between levels that are declared to
 track? And it certainly has not been proven that the extra complexity
 actually saves cycles rather than the reverse.


AFAIK there's no other built-in accounting mechanism at the moment.
That's one of the reasons why some of the nodes had to invent their own
custom accounting - hashjoin is an example of that.

We do have MemoryContextStats, which essentially walks the memory
context hierarchy and accounts all the important information. While
working on the hashjoin patch, I was thinking that maybe we could create
MemoryContextStats clone, computing the accounting info on request.

So I did an experiment, writing this new MemoryContextStats clone and
rewriting the nodeHash.c to use this instead of custom tracking. The
impact was consistently ~2-3%, for the hashjoin queries I measured (e.g.
9100 ms instead of 8900 ms etc.).

I just tried the same with the hashjoin patch applied. With the built-in
accounting the queries now run ~2.5x faster - around 3500ms. With the
MemoryContextStats clone, it takes a whopping 18000 ms.

I don't know what causes such increase (I assume I use the accounting
info within a loop or something). The point is that with Jeff's patch,
it goes down to 3500 ms, with pretty much zero overhead.

I'll check why the patched hashjoin is so much slower, but even if I can
optimize it somehow I don't think I'll get it to 0 overhead. So I'd say
this accounting actally saves the cycles.


Of course, that does not prove it's well a well thought-out design,
although it seems sensible to me. It's the least complex accounting
implementation I can think of.

For example, I don't think being able to create nonsensical hierarchies
is a major flaw. We certainly should not do that in our code, and we
should take reasonable precautions to prevent this in a user code (as
for example inheriting the tracking flag in MemoryContextCreate, as Jeff
does). Can this be seen as a foot gun? Maybe, but we're handing
ammunition to users on a daily basis - for example they can switch to
TopMemoryContext, but that does not make memory contexts bad.

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

Then the issues with nonsensical hierarchies would disappear, and also
the possible overhead with updating all parent hierarchies would go away
(not sure if it's even measurable, though).

regards
Tomas





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-14 Thread Robert Haas
On Fri, Aug 8, 2014 at 4:16 AM, Jeff Davis pg...@j-davis.com wrote:
 I wasn't able to reproduce your results on my machine. At -s 300, with
 maintenance_work_mem set high enough to do internal sort, it took about
 40s and I heard some disk activity, so I didn't think it was a valid
 result. I went down to -s 150, and it took around 5.3s on both master
 and memory-accounting.

 Either way, it's better to be conservative. Attached is a version of the
 patch with opt-in memory usage tracking. Child contexts inherit the
 setting from their parent.

I repeated my tests with your v3 patch.  Here are the new results:

master, as of commit a4287a689d10bd4863e3dfbf9c4f232feeca0cdd:
2014-08-14 16:45:24 UTC [2940] LOG:  internal sort ended, 1723933 KB
used: CPU 2.44s/11.52u sec elapsed 16.68 sec
2014-08-14 16:46:34 UTC [2960] LOG:  internal sort ended, 1723933 KB
used: CPU 2.63s/11.65u sec elapsed 16.94 sec
2014-08-14 16:47:26 UTC [2979] LOG:  internal sort ended, 1723933 KB
used: CPU 2.63s/11.48u sec elapsed 16.85 sec

memory-accounting-v3, on top of the aforementioned master commit:
2014-08-14 16:46:05 UTC [2950] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/12.16u sec elapsed 17.36 sec
2014-08-14 16:47:00 UTC [2969] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/11.90u sec elapsed 17.11 sec
2014-08-14 16:47:51 UTC [2988] LOG:  internal sort ended, 1723933 KB
used: CPU 2.52s/11.98u sec elapsed 17.31 sec

It appears to me that the performance characteristics for this version
are not significantly different from version 1.  I have not looked at
the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-10 Thread Jeff Davis
On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
 Either way, it's better to be conservative. Attached is a version of the
 patch with opt-in memory usage tracking. Child contexts inherit the
 setting from their parent.

There was a problem with the previous patch: the parent was unlinked
before the delete_context method was called, which meant that the
parent's memory was not being decremented.

Attached is a fix. It would be simpler to just reorder the operations in
MemoryContextDelete, but there is a comment warning against that. So, I
pass the old parent as an argument to the delete_context method.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 250,256  static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
--- 252,258 
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context, MemoryContext parent);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
***
*** 430,435  randomize_mem(char *ptr, size_t size)
--- 432,440 
   * minContextSize: minimum context size
   * initBlockSize: initial allocation block size
   * maxBlockSize: maximum allocation block size
+  *
+  * The flag determining whether this context tracks memory usage is inherited
+  * from the parent context.
   */
  MemoryContext
  AllocSetContextCreate(MemoryContext parent,
***
*** 438,443  AllocSetContextCreate(MemoryContext parent,
--- 443,468 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
+ 	return AllocSetContextCreateTracked(
+ 		parent, name, minContextSize, initBlockSize, maxBlockSize,
+ 		(parent == NULL) ? false : parent-track_mem);
+ }
+ 
+ /*
+  * AllocSetContextCreateTracked
+  *		Create a new AllocSet context.
+  *
+  * Implementation for AllocSetContextCreate, but also allows the caller to
+  * specify whether memory usage should be tracked or not.
+  */
+ MemoryContext
+ AllocSetContextCreateTracked(MemoryContext parent,
+ 			 const char *name,
+ 			 Size minContextSize,
+ 			 Size initBlockSize,
+ 			 Size maxBlockSize,
+ 			 bool track_mem)
+ {
  	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
***
*** 445,451  AllocSetContextCreate(MemoryContext parent,
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 470,477 
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name,
! 			 track_mem);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 526,534 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 619,625 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 611,617  AllocSetReset(MemoryContext context)
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
--- 641,647 
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context, MemoryContext parent)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-08 Thread Jeff Davis
On Wed, 2014-08-06 at 11:43 -0400, Robert Haas wrote:
 Comparing the median times, that's about a 3% regression.  For this
 particular case, we might be able to recapture that by replacing the
 bespoke memory-tracking logic in tuplesort.c with use of this new
 facility.  I'm not sure whether there are other cases that we might
 also want to test; I think stuff that runs all on the server side is
 likely to show up problems more clearly than pgbench.  Maybe a
 PL/pgsql loop that does something allocation-intensive on each
 iteration, for example, like parsing a big JSON document.

I wasn't able to reproduce your results on my machine. At -s 300, with
maintenance_work_mem set high enough to do internal sort, it took about
40s and I heard some disk activity, so I didn't think it was a valid
result. I went down to -s 150, and it took around 5.3s on both master
and memory-accounting.

Either way, it's better to be conservative. Attached is a version of the
patch with opt-in memory usage tracking. Child contexts inherit the
setting from their parent.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 430,435  randomize_mem(char *ptr, size_t size)
--- 432,440 
   * minContextSize: minimum context size
   * initBlockSize: initial allocation block size
   * maxBlockSize: maximum allocation block size
+  *
+  * The flag determining whether this context tracks memory usage is inherited
+  * from the parent context.
   */
  MemoryContext
  AllocSetContextCreate(MemoryContext parent,
***
*** 438,443  AllocSetContextCreate(MemoryContext parent,
--- 443,468 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
+ 	return AllocSetContextCreateTracked(
+ 		parent, name, minContextSize, initBlockSize, maxBlockSize,
+ 		(parent == NULL) ? false : parent-track_mem);
+ }
+ 
+ /*
+  * AllocSetContextCreateTracked
+  *		Create a new AllocSet context.
+  *
+  * Implementation for AllocSetContextCreate, but also allows the caller to
+  * specify whether memory usage should be tracked or not.
+  */
+ MemoryContext
+ AllocSetContextCreateTracked(MemoryContext parent,
+ 			 const char *name,
+ 			 Size minContextSize,
+ 			 Size initBlockSize,
+ 			 Size maxBlockSize,
+ 			 bool track_mem)
+ {
  	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
***
*** 445,451  AllocSetContextCreate(MemoryContext parent,
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 470,477 
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name,
! 			 track_mem);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 526,534 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 619,625 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 632,637  AllocSetDelete(MemoryContext context)
--- 662,668 
  	{
  		AllocBlock	next = block-next;
  
+ 		update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 678,683  AllocSetAlloc(MemoryContext context, Size size)
--- 709,717 
  	 errmsg(out of memory),
  	 errdetail(Failed on request of size %zu., size)));
  		}
+ 
+ 		update_allocation(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
***
*** 873,878  AllocSetAlloc(MemoryContext context, Size size)
--- 907,914 
  	 errdetail(Failed on request of size %zu., size)));
  		}
  
+ 		update_allocation(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 976,981  AllocSetFree(MemoryContext context, void *pointer)
--- 1012,1018 
  			

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-06 Thread Robert Haas
On Sat, Aug 2, 2014 at 4:40 PM, Jeff Davis pg...@j-davis.com wrote:
 Attached is a patch that explicitly tracks allocated memory (the blocks,
 not the chunks) for each memory context, as well as its children.

 This is a prerequisite for memory-bounded HashAgg, which I intend to
 submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
 the hash table, which is a good estimate for Hashjoin. But I don't think
 it's as easy for Hashagg, for which we need to track transition values,
 etc. (also, for HashAgg, I expect that the overhead will be more
 significant than for Hashjoin). If we track the space used by the memory
 contexts directly, it's easier and more accurate.

 I did some simple pgbench select-only tests, and I didn't see any TPS
 difference.

I was curious whether a performance difference would show up when
sorting, so I tried it out.  I set up a test with pgbench -i 300.  I
then repeatedly restarted the database, and after each restart, did
this:

time psql -c 'set trace_sort=on; reindex index pgbench_accounts_pkey;'

I alternated runs between master and master with this patch, and got
the following results:

master:
LOG:  internal sort ended, 1723933 KB used: CPU 2.58s/11.54u sec
elapsed 16.88 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.50s/12.37u sec
elapsed 17.60 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.14s/11.28u sec
elapsed 16.11 sec

memory-accounting:
LOG:  internal sort ended, 1723933 KB used: CPU 2.57s/11.97u sec
elapsed 17.39 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.30s/12.57u sec
elapsed 17.68 sec
LOG:  internal sort ended, 1723933 KB used: CPU 2.54s/11.99u sec
elapsed 17.25 sec

Comparing the median times, that's about a 3% regression.  For this
particular case, we might be able to recapture that by replacing the
bespoke memory-tracking logic in tuplesort.c with use of this new
facility.  I'm not sure whether there are other cases that we might
also want to test; I think stuff that runs all on the server side is
likely to show up problems more clearly than pgbench.  Maybe a
PL/pgsql loop that does something allocation-intensive on each
iteration, for example, like parsing a big JSON document.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-06 Thread Tomas Vondra
On 2.8.2014 22:40, Jeff Davis wrote:
 Attached is a patch that explicitly tracks allocated memory (the blocks,
 not the chunks) for each memory context, as well as its children.
 
 This is a prerequisite for memory-bounded HashAgg, which I intend to

Anyway, I'm really looking forward to the memory-bounded hashagg, and
I'm willing to spend some time testing it.

 submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
 the hash table, which is a good estimate for Hashjoin. But I don't think

Actually, even for HashJoin the estimate is pretty bad, and varies a lot
depending on the tuple width. With narrow tuples (e.g. ~40B of data),
which is actually quite common case, you easily get ~100% palloc overhead.

I managed to address that by using a custom allocator. See this:

   https://commitfest.postgresql.org/action/patch_view?id=1503

I wonder whether something like that would be possible for the hashagg?

That would make the memory accounting accurate with 0% overhead (because
it's not messing with the memory context at all), but it only for the
one node (but maybe that's OK?).

 it's as easy for Hashagg, for which we need to track transition values,
 etc. (also, for HashAgg, I expect that the overhead will be more
 significant than for Hashjoin). If we track the space used by the memory
 contexts directly, it's easier and more accurate.

I don't think that's comparable - I can easily think about cases leading
to extreme palloc overhead with HashAgg (think of an aggregate
implementing COUNT DISTINCT - that effectively needs to store all the
values, and with short values the palloc overhead will be terrible).

Actually, I was looking at HashAgg (it's a somehow natural direction
after messing with Hash Join), and my plan was to use a similar dense
allocation approach. The trickiest part would probably me making this
available from the custom aggregates.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-04 Thread Peter Geoghegan
On Sat, Aug 2, 2014 at 1:40 PM, Jeff Davis pg...@j-davis.com wrote:
 This is a prerequisite for memory-bounded HashAgg, which I intend to
 submit for the next CF.

FWIW, I think that's a good project. A large number of these TPC-H
queries used HashAggs when I checked, on a moderate sized sample TPC-H
database:

http://examples.citusdata.com/tpch_queries.html

(I found these queries at random from Googling, but happened to have a
~2GB TPC-H database on my laptop). I attach EXPLAIN ANALYZE ouput for
each, as shown on master. From this admittedly unscientific random
sampling, 5 out of 8 query plans have a hash aggregate node. TPC-H is
a benchmark that Postgres does not tend to do too well on [1], and I
suspect that this has something to do with it; lower work_mem settings
will spook the optimizer into using a group aggregate within
choose_hashed_grouping(). Of course, in order to get the benefit of
your patch, that will need to be adjusted. I think that part is
surprisingly straightforward, though.

[1] https://wiki.postgresql.org/wiki/TPC-H
-- 
Peter Geoghegan
QUERY PLAN  
   
---
 Sort  (cost=410317.35..410317.37 rows=6 width=25) (actual 
time=10424.815..10424.816 rows=4 loops=1)
   Sort Key: l_returnflag, l_linestatus
   Sort Method: quicksort  Memory: 25kB
   -  HashAggregate  (cost=410317.11..410317.27 rows=6 width=25) (actual 
time=10424.769..10424.775 rows=4 loops=1)
 Group Key: l_returnflag, l_linestatus
 -  Seq Scan on lineitem  (cost=0.00..173620.19 rows=5917423 width=25) 
(actual time=0.017..1179.993 rows=5916591 loops=1)
   Filter: (l_shipdate = '1998-09-02 00:00:00'::timestamp without 
time zone)
   Rows Removed by Filter: 84624
 Planning time: 0.380 ms
 Execution time: 10424.894 ms
(10 rows)

   QUERY 
PLAN   

 Limit  (cost=267628.27..267628.32 rows=20 width=24) (actual 
time=2227.928..2227.931 rows=20 loops=1)
   -  Sort  (cost=267628.27..268428.35 rows=320034 width=24) (actual 
time=2227.927..2227.929 rows=20 loops=1)
 Sort Key: (sum((lineitem.l_extendedprice * (1::numeric - 
lineitem.l_discount, orders.o_orderdate
 Sort Method: top-N heapsort  Memory: 26kB
 -  HashAggregate  (cost=255111.85..259112.28 rows=320034 width=24) 
(actual time=.023..2225.916 rows=11620 loops=1)
   Group Key: lineitem.l_orderkey, orders.o_orderdate, 
orders.o_shippriority
   -  Hash Join  (cost=61347.05..250311.34 rows=320034 width=24) 
(actual time=511.401..2197.516 rows=30519 loops=1)
 Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
 -  Seq Scan on lineitem  (cost=0.00..173620.19 
rows=3238338 width=16) (actual time=0.007..880.520 rows=3241776 loops=1)
   Filter: (l_shipdate  '1995-03-15'::date)
   Rows Removed by Filter: 2759439
 -  Hash  (cost=59494.05..59494.05 rows=148240 width=12) 
(actual time=511.151..511.151 rows=147126 loops=1)
   Buckets: 16384  Batches: 1  Memory Usage: 6322kB
   -  Hash Join  (cost=5841.56..59494.05 rows=148240 
width=12) (actual time=32.502..482.473 rows=147126 loops=1)
 Hash Cond: (orders.o_custkey = 
customer.c_custkey)
 -  Seq Scan on orders  (cost=0.00..43975.00 
rows=728452 width=16) (actual time=0.006..226.220 rows=727305 loops=1)
   Filter: (o_orderdate  
'1995-03-15'::date)
   Rows Removed by Filter: 772695
 -  Hash  (cost=5460.00..5460.00 rows=30525 
width=4) (actual time=32.464..32.464 rows=30142 loops=1)
   Buckets: 4096  Batches: 1  Memory Usage: 
1060kB
   -  Seq Scan on customer  
(cost=0.00..5460.00 rows=30525 width=4) (actual time=0.006..28.441 rows=30142 
loops=1)
 Filter: (c_mktsegment = 
'BUILDING'::bpchar)
 Rows Removed by Filter: 119858
 Planning time: 0.483 ms
 Execution time: 2230.504 ms
(25 rows)

 QUERY 
PLAN 

[HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-02 Thread Jeff Davis
Attached is a patch that explicitly tracks allocated memory (the blocks,
not the chunks) for each memory context, as well as its children.

This is a prerequisite for memory-bounded HashAgg, which I intend to
submit for the next CF. Hashjoin tracks the tuple sizes that it adds to
the hash table, which is a good estimate for Hashjoin. But I don't think
it's as easy for Hashagg, for which we need to track transition values,
etc. (also, for HashAgg, I expect that the overhead will be more
significant than for Hashjoin). If we track the space used by the memory
contexts directly, it's easier and more accurate.

I did some simple pgbench select-only tests, and I didn't see any TPS
difference.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 502,510 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 595,601 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 632,637  AllocSetDelete(MemoryContext context)
--- 638,644 
  	{
  		AllocBlock	next = block-next;
  
+ 		update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 678,683  AllocSetAlloc(MemoryContext context, Size size)
--- 685,693 
  	 errmsg(out of memory),
  	 errdetail(Failed on request of size %zu., size)));
  		}
+ 
+ 		update_allocation(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
***
*** 873,878  AllocSetAlloc(MemoryContext context, Size size)
--- 883,890 
  	 errdetail(Failed on request of size %zu., size)));
  		}
  
+ 		update_allocation(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 976,981  AllocSetFree(MemoryContext context, void *pointer)
--- 988,994 
  			set-blocks = block-next;
  		else
  			prevblock-next = block-next;
+ 		update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 1088,1093  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1101,1107 
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***
*** 1105,1110  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1119,1126 
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block-endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  		{
***
*** 1114,1119  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1130,1136 
  	 errmsg(out of memory),
  	 errdetail(Failed on request of size %zu., size)));
  		}
+ 		update_allocation(context, blksize - oldblksize);
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
***
*** 1277,1282  AllocSetStats(MemoryContext context, int level)
--- 1294,1320 
  }
  
  
+ /*
+  * update_allocation
+  *
+  * Track newly-allocated or newly-freed memory (freed memory should be
+  * negative).
+  */
+ static void
+ update_allocation(MemoryContext context, int64 size)
+ {
+ 	MemoryContext parent;
+ 
+ 	context-self_allocated += size;
+ 
+ 	for (parent = context; parent != NULL; parent = parent-parent)
+ 	{
+ 		parent-total_allocated += size;
+ 		Assert(parent-self_allocated = 0);
+ 		Assert(parent-total_allocated = 0);
+ 	}
+ }
+ 
  #ifdef MEMORY_CONTEXT_CHECKING
  
  /*
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***
*** 324,329  MemoryContextAllowInCriticalSection(MemoryContext context, bool allow)
--- 324,344 
  }
  
  /*
+  * MemoryContextGetAllocated
+  *
+  * Return memory