Re: Memory Accounting

2019-10-04 Thread Robert Haas
On Tue, Sep 24, 2019 at 2:47 PM Melanie Plageman
 wrote:
> I think it would be helpful if we could repeat the performance tests
> Robert did on that machine with the current patch (unless this version
> of the patch is exactly the same as the ones he tested previously).

I still have access to a POWER machine, but it's currently being used
by somebody else for a benchmarking project, so I can't test this
immediately.

It's probably worth noting that, in addition to whatever's changed in
this patch, tuplesort.c's memory management has been altered
significantly since 2015 - see
0011c0091e886b874e485a46ff2c94222ffbf550 and
e94568ecc10f2638e542ae34f2990b821bbf90ac. I'm not immediately sure how
that would affect the REINDEX case that I tested back then, but it
seems at least possible that they would have the effect of making
palloc overhead less significant. More generally, so much of the
sorting machinery has been overhauled by Peter Geoghegan since then
that what happens now may just not be very comparable to what happened
back then.

I do agree that this approach looks pretty light weight. Tomas's point
about the difference between updating only the current context and
updating all parent contexts seems right on target to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Memory Accounting

2019-10-04 Thread Tom Lane
Peter Geoghegan  writes:
> On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis  wrote:
>> The patch has been floating around for a very long time, so I don't
>> remember exactly why I chose a signed value. Sorry.

> I am reminded of the fact that int64 is used to size buffers within
> tuplesort.c, because it needs to support negative availMem sizes --
> when huge allocations were first supported, tuplesort.c briefly used
> "Size", which didn't work. Perhaps it had something to do with that.

I wonder if we should make that use ssize_t instead.  Probably
not worth the trouble.

regards, tom lane




Re: Memory Accounting

2019-10-04 Thread Peter Geoghegan
On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis  wrote:
> The patch has been floating around for a very long time, so I don't
> remember exactly why I chose a signed value. Sorry.

I am reminded of the fact that int64 is used to size buffers within
tuplesort.c, because it needs to support negative availMem sizes --
when huge allocations were first supported, tuplesort.c briefly used
"Size", which didn't work. Perhaps it had something to do with that.

--
Peter Geoghegan




Re: Memory Accounting

2019-10-04 Thread Tom Lane
I wrote:
> Just to make things even more mysterious, prairiedog finally showed
> the Assert failure on its fourth run with c477f3e449 included:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-04%2012%3A35%3A41
> It's also now apparent that lapwing and locust were failing only
> sometimes, as well.  I totally don't understand why that failure
> would've been only partially reproducible.  Maybe we should dig
> a bit harder, rather than just deciding that we fixed it.

I did that digging, reproducing the problem on florican's host
(again intermittently).  Here's a stack trace from the spot where
we sometimes downsize a large chunk:

#5  0x0851c70a in AllocSetRealloc (context=0x31b35000, pointer=0x319e5020, 
size=1224) at aset.c:1158
#6  0x085232eb in repalloc (pointer=0x319e5020, size=1224) at mcxt.c:1082
#7  0x31b69591 in resize_intArrayType (num=300, a=0x319e5020)
at _int_tool.c:268
#8  resize_intArrayType (a=0x319e5020, num=300) at _int_tool.c:250
#9  0x31b6995d in _int_unique (r=0x319e5020) at _int_tool.c:329
#10 0x31b66a00 in g_int_union (fcinfo=0xffbfcc5c) at _int_gist.c:146
#11 0x084ff9c4 in FunctionCall2Coll (flinfo=0x319e2bb4, collation=100, 
arg1=834250780, arg2=4290759884) at fmgr.c:1162
#12 0x080db3a3 in gistMakeUnionItVec (giststate=0x319e2820, itvec=0x31bae4a4, 
len=15, attr=0xffbfce5c, isnull=0xffbfcedc) at gistutil.c:204
#13 0x080e410d in gistunionsubkeyvec (giststate=giststate@entry=0x319e2820, 
itvec=itvec@entry=0x31bb5ed4, gsvp=gsvp@entry=0xffbfcd4c) at gistsplit.c:64
#14 0x080e416f in gistunionsubkey (giststate=giststate@entry=0x319e2820, 
itvec=itvec@entry=0x31bb5ed4, spl=spl@entry=0xffbfce3c) at gistsplit.c:91
#15 0x080e4456 in gistSplitByKey (r=, page=, 
itup=, len=, giststate=, 
v=, attno=) at gistsplit.c:689
#16 0x080d8797 in gistSplit (r=0x31bbb424, page=0x297e0b80 "", 
itup=0x31bb5ed4, len=16, giststate=0x319e2820) at gist.c:1432
#17 0x080d8d9c in gistplacetopage (rel=, 
freespace=, giststate=, 
buffer=, itup=, ntup=, 
oldoffnum=, newblkno=, 
leftchildbuf=, splitinfo=, 
markfollowright=, heapRel=, 
is_build=) at gist.c:299

So the potential downsize is expected, triggered by _int_unique
being able to remove some duplicate entries from a GIST union key.
AFAICT the fact that it happens only intermittently must boil down
to the randomized insertion choices that gistchoose() sometimes makes.

In short: nothing to see here, move along.

regards, tom lane




Re: Memory Accounting

2019-10-04 Thread Tom Lane
Tomas Vondra  writes:
> On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
>> What I think is happening is that c477f3e449 allowed this bit in
>> AllocSetRealloc:
>> context->mem_allocated += blksize - oldblksize;
>> to be executed in situations where blksize < oldblksize, where before
>> that was not possible.
>> ...
>> (I'm not quite sure why we're not seeing this failure on *all* the
>> 32-bit machines; maybe there's some other factor involved?)

> Interesting failure mode (especially that it does *not* fail on some
> 32-bit machines).

Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2019-10-04%2012%3A35%3A41

It's also now apparent that lapwing and locust were failing only
sometimes, as well.  I totally don't understand why that failure
would've been only partially reproducible.  Maybe we should dig
a bit harder, rather than just deciding that we fixed it.

regards, tom lane




Re: Memory Accounting

2019-10-04 Thread Tom Lane
Jeff Davis  writes:
> On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
>> Yeah, I think that's an oversight. Maybe there's a reason why Jeff
>> used int64, but I can't think of any.

> I had chosen a 64-bit value to account for the situation Tom mentioned:
> that, in theory, Size might not be large enough to represent all
> allocations in a memory context. Apparently, that theoretical situation
> is not worth being concerned about.

Well, you could also argue it the other way: maybe in our children's
time, int64 won't be as wide as Size.  (Yeah, I know that sounds
ridiculous, but needing pointers wider than 32 bits was a ridiculous
idea too when I started in this business.)

The committed fix seems OK to me except that I think you should've
also changed MemoryContextMemAllocated() to return Size.

regards, tom lane




Re: Memory Accounting

2019-10-04 Thread Jeff Davis
On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
> On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
> > So ... why exactly did this patch define
> > MemoryContextData.mem_allocated
> > as int64?  That seems to me to be doubly wrong: it is not the right
> > width
> > on 32-bit machines, and it is not the right signedness anywhere.  I
> > think
> > that field ought to be of type Size (a/k/a size_t, but memnodes.h
> > always
> > calls it Size).
> > 
> 
> Yeah, I think that's an oversight. Maybe there's a reason why Jeff
> used
> int64, but I can't think of any.

I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.

The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.

Regards,
Jeff Davis






Re: Memory Accounting

2019-10-04 Thread Tomas Vondra

On Fri, Oct 04, 2019 at 10:26:44AM +0200, Tomas Vondra wrote:

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.



I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.



I've pushed a fix changing the type to Size, splitting the mem_allocated
to two separate updates (to prevent any underflows in the subtraction).
Hopefully this fixes the 32-bit machines ...

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-10-04 Thread Tomas Vondra

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:

So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64?  That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere.  I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).



Yeah, I think that's an oversight. Maybe there's a reason why Jeff used
int64, but I can't think of any.


I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with

2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG:  statement: CREATE 
INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: 
"aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG:  server process (PID 66916) was 
terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL:  Failed process was running: 
CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:

context->mem_allocated += blksize - oldblksize;

to be executed in situations where blksize < oldblksize, where before
that was not possible.  Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case.  If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.

(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)



Interesting failure mode (especially that it does *not* fail on some
32-bit machines).


I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist.  (I speak from experience.)  I do not think
we need to design Postgres to allow for that.

Likewise, there's no evident value in allowing mem_allocated
to be negative.

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.



I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Memory Accounting

2019-10-03 Thread Tom Lane
So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64?  That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere.  I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).

I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with

2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG:  statement: CREATE 
INDEX text_idx on test__int using gist ( a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: 
"aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG:  server process (PID 66916) was 
terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL:  Failed process was running: 
CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:

context->mem_allocated += blksize - oldblksize;

to be executed in situations where blksize < oldblksize, where before
that was not possible.  Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case.  If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.

(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)

I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist.  (I speak from experience.)  I do not think
we need to design Postgres to allow for that.

Likewise, there's no evident value in allowing mem_allocated
to be negative.

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.

regards, tom lane




Re: Memory Accounting

2019-09-30 Thread Tomas Vondra

On Mon, Sep 30, 2019 at 01:34:13PM -0700, Jeff Davis wrote:

On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:

Notice that when CLOBBER_FREED_MEMORY is defined, the code first
> calls
> wipe_mem and then accesses fields of the (wiped) block.
> Interesringly
> enough, the regression tests don't seem to exercise these bits -
> I've
> tried adding elog(ERROR) and it still passes. For (2) that's not
> very
> surprising because Generation context is only really used in
> logical
> decoding (and we don't delete the context I think). Not sure about
> (1)
> but it might be because AllocSetReset does the right thing and only
> leaves behind the keeper block.
>
> I'm pretty sure a custom function calling the contexts explicitly
> would
> fall over, but I haven't tried.
>


Fixed.

I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.

AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.



OK, looks good to me now.


Oh, and one more thing - this probably needs to add at least some
basic
explanation of the accounting to src/backend/mmgr/README.


Added.



Well, I've meant a couple of paragraphs explaining the motivation, and
relevant trade-offs considered. So I've written a brief summary of the
design as I understand it and pushed it like that. Of course, if you
could proof-read it, that'd be good.

I had a bit of a hard time deciding who to list as a reviewer - this
patch started sometime in ~2015, and it was initially discussed as part
of the larger hashagg effort, with plenty of people discussing various
ancient versions of the patch. In the end I've included just people from
the current thread. If that omits important past reviews, I'm sorry.

For the record, results of the benchmarks I've done over the past couple
of days are in [1]. It includes both the reindex benchmark used by by
Robert in 2015, and a regular read-only pgbench. In general, the
overhead of the accounting is pretty much indistinguishable from noise
(at least on those two machines).

In any case, thanks for the perseverance working on this.


[1] https://github.com/tvondra/memory-accounting-benchmarks


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Memory Accounting

2019-09-30 Thread Jeff Davis
On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:
> Notice that when CLOBBER_FREED_MEMORY is defined, the code first
> > calls
> > wipe_mem and then accesses fields of the (wiped) block.
> > Interesringly
> > enough, the regression tests don't seem to exercise these bits -
> > I've
> > tried adding elog(ERROR) and it still passes. For (2) that's not
> > very
> > surprising because Generation context is only really used in
> > logical
> > decoding (and we don't delete the context I think). Not sure about
> > (1)
> > but it might be because AllocSetReset does the right thing and only
> > leaves behind the keeper block.
> > 
> > I'm pretty sure a custom function calling the contexts explicitly
> > would
> > fall over, but I haven't tried.
> > 

Fixed.

I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.

AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.

> Oh, and one more thing - this probably needs to add at least some
> basic 
> explanation of the accounting to src/backend/mmgr/README.

Added.

Regards,
Jeff Davis

From 7184df8a928d6c0c6e41d171968a54d46f256a22 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 1 Jun 2018 13:35:21 -0700
Subject: [PATCH] Memory accounting at block level.

Track the memory allocated to a memory context in the form of blocks,
and introduce a new function MemoryContextMemAllocated() to report it
to the caller in bytes. This does not track individual chunks.
---
 src/backend/utils/mmgr/README   |  4 +++
 src/backend/utils/mmgr/aset.c   | 38 +
 src/backend/utils/mmgr/generation.c | 19 ---
 src/backend/utils/mmgr/mcxt.c   | 24 ++
 src/backend/utils/mmgr/slab.c   | 10 
 src/include/nodes/memnodes.h|  1 +
 src/include/utils/memutils.h|  1 +
 7 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index 7e6541d0dee..bd5fa6d8e53 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -23,6 +23,10 @@ The basic operations on a memory context are:
 * reset a context (free all memory allocated in the context, but not the
   context object itself)
 
+* inquire about the total amount of memory allocated to the context
+  (the raw memory from which the context allocates chunks; not the
+  chunks themselves)
+
 Given a chunk of memory previously allocated from a context, one can
 free it or reallocate it larger or smaller (corresponding to standard C
 library's free() and realloc() routines).  These operations return memory
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6b63d6f85d0..90f370570fe 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,9 @@ AllocSetContextCreateInternal(MemoryContext parent,
 parent,
 name);
 
+			((MemoryContext) set)->mem_allocated =
+set->keeper->endptr - ((char *) set);
+
 			return (MemoryContext) set;
 		}
 	}
@@ -546,6 +549,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		parent,
 		name);
 
+	((MemoryContext) set)->mem_allocated = firstBlockSize;
+
 	return (MemoryContext) set;
 }
 
@@ -566,6 +571,7 @@ AllocSetReset(MemoryContext context)
 {
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block;
+	Size		keepersize = set->keeper->endptr - ((char *) set);
 
 	AssertArg(AllocSetIsValid(set));
 
@@ -604,6 +610,8 @@ AllocSetReset(MemoryContext context)
 		else
 		{
 			/* Normal case, release the block */
+			context->mem_allocated -= block->endptr - ((char*) block);
+
 #ifdef CLOBBER_FREED_MEMORY
 			wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -612,6 +620,8 @@ AllocSetReset(MemoryContext context)
 		block = next;
 	}
 
+	Assert(context->mem_allocated == keepersize);
+
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
 }
@@ -628,6 +638,7 @@ AllocSetDelete(MemoryContext context)
 {
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block = set->blocks;
+	Size		keepersize = set->keeper->endptr - ((char *) set);
 
 	AssertArg(AllocSetIsValid(set));
 
@@ -683,6 +694,9 @@ AllocSetDelete(MemoryContext context)
 	{
 		AllocBlock	next = block-&

Re: Memory Accounting

2019-09-28 Thread Tomas Vondra

On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote:

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.


That was my conclusion, as well.



I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--

#ifdef CLOBBER_FREED_MEMORY
  wipe_mem(block, block->freeptr - ((char *) block));
#endif

  if (block != set->keeper)
  {
  context->mem_allocated -= block->endptr - ((char *) block);
  free(block);
  }

2) generation.c (GenerationReset)
-

#ifdef CLOBBER_FREED_MEMORY
  wipe_mem(block, block->blksize);
#endif
  context->mem_allocated -= block->blksize;


Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.



Oh, and one more thing - this probably needs to add at least some basic 
explanation of the accounting to src/backend/mmgr/README.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-09-28 Thread Tomas Vondra

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:

It's worth mentioning that those bechmarks (I'm assuming we're
talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers
demonstrating
I'wrong I'll stick to my plan to get this committed soon.


That was my conclusion, as well.



I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--

#ifdef CLOBBER_FREED_MEMORY
   wipe_mem(block, block->freeptr - ((char *) block));
#endif

   if (block != set->keeper)
   {
   context->mem_allocated -= block->endptr - ((char *) block);
   free(block);
   }

2) generation.c (GenerationReset)
-

#ifdef CLOBBER_FREED_MEMORY
   wipe_mem(block, block->blksize);
#endif
   context->mem_allocated -= block->blksize;


Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.

Aside from that, I've repeated the REINDEX benchmarks done by Robert in
[1] with different scales on two different machines, and I've measured
no difference. Both machines are x86_64, I don't have access to any
Power machine at the moment, unfortunately.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-09-26 Thread Jeff Davis
On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
> It's worth mentioning that those bechmarks (I'm assuming we're
> talking
> about the numbers Rober shared in [1]) were done on patches that used
> the eager accounting approach (i.e. walking all parent contexts and
> updating the accounting for them).
> 
> I'm pretty sure the current "lazy accounting" patches don't have that
> issue, so unless someone objects and/or can show numbers
> demonstrating
> I'wrong I'll stick to my plan to get this committed soon.

That was my conclusion, as well.

Regards,
Jeff Davis





Re: Memory Accounting

2019-09-26 Thread Tomas Vondra

On Tue, Sep 24, 2019 at 11:46:49AM -0700, Melanie Plageman wrote:

On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis  wrote:


On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

> max_stack_depth   max level   lazy (ms)   eager (ms)
(eage
> r/lazy)
> 2MB   82  302.715 427.554 1.4123978
> 3MB   3474567.829 896.143 1.578191674
> 7.67MB86942657.9724903.0631.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?



We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.



It looks like you agree with the approach and the results. Did you find
any other issues with the patch?



We didn't observe any other problems with the patch and agree with the
approach. It is a good start.




I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.



I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).



I agree that would be nice, but I don't have access to any Power machine
suitable for this kind of benchmarking :-( Robert, any chance you still
have access to that machine?

It's worth mentioning that those bechmarks (I'm assuming we're talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers demonstrating
I'wrong I'll stick to my plan to get this committed soon.

regards

[1] 
https://www.postgresql.org/message-id/flat/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com#3e2dc9e70a9f9eb2d695ab94a580c5a2

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory Accounting

2019-09-24 Thread Michael Paquier
On Tue, Sep 24, 2019 at 02:05:51PM +0200, Tomas Vondra wrote:
> The way I see it we can do either eager or lazy accounting. Eager might
> work better for aggregates with many contexts, but it does increase the
> overhead for the "regular" aggregates with just one or two contexts.
> Considering how rare those many-context aggregates are (I'm not aware of
> any such aggregate at the moment), it seems reasonable to pick the lazy
> accounting.

Okay.

> So I think the approach Jeff ended up with sensible - certainly as a
> first step. We may improve it in the future, of course, once we have
> more practical experience.
> 
> Barring objections, I do plan to get this committed by the end of this
> CF (i.e. sometime later this week).

Sounds good to me.  Though I have not looked at the patch in details,
the arguments are sensible.  Thanks for confirming.
--
Michael


signature.asc
Description: PGP signature


Re: Memory Accounting

2019-09-24 Thread Melanie Plageman
On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis  wrote:

> On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> > Hi Jeff,
>
> Hi Soumyadeep and Melanie,
>
> Thank you for the review!
>
> > max_stack_depth   max level   lazy (ms)   eager (ms)
> (eage
> > r/lazy)
> > 2MB   82  302.715 427.554 1.4123978
> > 3MB   3474567.829 896.143 1.578191674
> > 7.67MB86942657.9724903.0631.844663149
>
> Thank you for collecting data on this. Were you able to find any
> regression when compared to no memory accounting at all?
>
>
We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.


> It looks like you agree with the approach and the results. Did you find
> any other issues with the patch?
>

We didn't observe any other problems with the patch and agree with the
approach. It is a good start.


>
> I am also including Robert in this thread. He had some concerns the
> last time around due to a small regression on POWER.
>

I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).

Thanks,
Soumyadeep & Melanie


Re: Memory Accounting

2019-09-24 Thread Tomas Vondra

On Tue, Sep 24, 2019 at 02:21:40PM +0900, Michael Paquier wrote:

On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:

I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).


The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no?  Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?


I don't think so.

Aggregates creating many memory contexts (context for each group) was
discussed extensively in the thread about v11 [1] in 2015. And back then
the conclusion was that that's a pretty awful pattern anyway, as it uses
much more memory (no cross-context freelists), and has various other
issues. In a way, those aggregates are wrong and should be fixed just
like we fixed array_agg/string_agg (even without the memory accounting).

The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.

(Note: Another factor affecting the lazy vs. eager efficiency is the
number of palloc/pfree calls vs. calls to determine amount of mem used,
but that's mostly orthogonal and we cn ignore it here).

So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.

Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).

[1] 
https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Memory Accounting

2019-09-19 Thread Jeff Davis
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

> max_stack_depth   max level   lazy (ms)   eager (ms)  (eage
> r/lazy)
> 2MB   82  302.715 427.554 1.4123978
> 3MB   3474567.829 896.143 1.578191674
> 7.67MB86942657.9724903.0631.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

Regards,
Jeff Davis





Re: Memory Accounting

2019-09-18 Thread Melanie Plageman
I wanted to address a couple of questions Jeff asked me off-list about
Greenplum's implementations of Memory Accounting.

Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.

The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.

More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.

Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1] is a kind of work_mem enforcement (this discussion was focused
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2].

The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.

I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.

Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.

Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference


Re: Memory Accounting

2019-07-24 Thread Tomas Vondra

On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote:

On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis  wrote:


Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.



Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?



I think there's plenty of places where we quickly get into a state with
enough chunks in the freelist - the per-tuple contexts are a good
example of that, I think.


I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.



I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).




The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.



This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?



Yes, that might me a good candidate (and it would be much simpler than
the manual accounting we use now).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Memory Accounting

2019-07-23 Thread Melanie Plageman
On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis  wrote:

> Previous discussion:
> https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
>
> This patch introduces a way to ask a memory context how much memory it
> currently has allocated. Each time a new block (not an individual
> chunk, but a new malloc'd block) is allocated, it increments a struct
> member; and each time a block is free'd, it decrements the struct
> member. So it tracks blocks allocated by malloc, not what is actually
> used for chunks allocated by palloc.
>
>
Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?

I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.


> The purpose is for Memory Bounded Hash Aggregation, but it may be
> useful in more cases as we try to better adapt to and control memory
> usage at execution time.
>
>
This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?

-- 
Melanie Plageman


Re: Memory Accounting

2019-07-22 Thread Jeff Davis
On Mon, 2019-07-22 at 18:16 +0200, Tomas Vondra wrote:
> * I changed it to only update mem_allocated for the current
> > > context,
> > > not recursively for all parent contexts. It's now up to the
> > > function
> > > that reports memory usage to recurse or not (as needed).
> > 
> > Is that OK for memory bounded hash aggregation? Might there be a
> > lot 
> > of sub-contexts during hash aggregation?
> > 
> 
> There shouldn't be, at least not since b419865a814abbc. There might
> be
> cases where custom aggregates still do that, but I think that's
> simply a
> design we should discourage.

Right, I don't think memory-context-per-group is something we should
optimize for.

Discussion:
https://www.postgresql.org/message-id/3839201.Nfa2RvcheX%40techfox.foxi
https://www.postgresql.org/message-id/5334D7A5.2000907%40fuzzy.cz

Commit link:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

Regards,
Jeff Davis






Re: Memory Accounting

2019-07-22 Thread Tomas Vondra

On Mon, Jul 22, 2019 at 11:30:37AM +0300, Heikki Linnakangas wrote:

On 18/07/2019 21:24, Jeff Davis wrote:

Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.


Seems handy.



Indeed.


* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).


Is that OK for memory bounded hash aggregation? Might there be a lot 
of sub-contexts during hash aggregation?




There shouldn't be, at least not since b419865a814abbc. There might be
cases where custom aggregates still do that, but I think that's simply a
design we should discourage.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Memory Accounting

2019-07-22 Thread Heikki Linnakangas

On 18/07/2019 21:24, Jeff Davis wrote:

Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.


Seems handy.


* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).


Is that OK for memory bounded hash aggregation? Might there be a lot of 
sub-contexts during hash aggregation?


- Heikki




Memory Accounting

2019-07-18 Thread Jeff Davis
Previous discussion: 
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

I ran the same tests as Robert did before[1] on my laptop[2]. The only
difference is that I also set max_parallel_workers[_per_gather]=0 to be
sure. I did 5 runs, alternating between memory-accounting and master,
and I got the following results for "elapsed" (as reported by
trace_sort):


regression=# select version, min(s), max(s), percentile_disc(0.5)
within group (order by s) as median, avg(s)::numeric(10,2) from tmp
group by version;
  version  |  min  |  max  | median |  avg
---+---+---++---
 master| 13.92 | 14.40 |  14.06 | 14.12
 memory accounting | 13.43 | 14.46 |  14.11 | 14.09
(2 rows)


So, I don't see any significant performance impact for this patch in
this test. That may be because:

* It was never really significant except on PPC64.
* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).
* A lot of changes to sort have happened since that time, so perhaps
it's not a good test of memory contexts any more.

pgbench didn't show any slowdown either.

I also did another test with hash aggregation that uses significant
memory (t10m is a table with 10M distinct values and work_mem is 1GB):

postgres=# select (select (i, count(*)) from t10m group by i having
count(*) > n) from (values(1),(2),(3),(4),(5)) as s(n);

I didn't see any noticable difference there, either.

Regards,
Jeff Davis

[1] 
https://postgr.es/m/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com
[2] Linux jdavis 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
From d7c8587620aad080afc904b54a3ebd3232f39b7c Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 1 Jun 2018 13:35:21 -0700
Subject: [PATCH] Memory accounting at block level.

Track the memory allocated to a memory context in the form of blocks,
and introduce a new function MemoryContextMemAllocated() to report it
to the caller in bytes. This does not track individual chunks.
---
 src/backend/utils/mmgr/aset.c   | 36 +
 src/backend/utils/mmgr/generation.c | 18 +++
 src/backend/utils/mmgr/mcxt.c   | 24 +++
 src/backend/utils/mmgr/slab.c   | 10 
 src/include/nodes/memnodes.h|  1 +
 src/include/utils/memutils.h|  1 +
 6 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6e4a343439..6e90493810 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -458,6 +458,9 @@ AllocSetContextCreateInternal(MemoryContext parent,
 parent,
 name);
 
+			((MemoryContext) set)->mem_allocated =
+set->keeper->endptr - ((char *) set);
+
 			return (MemoryContext) set;
 		}
 	}
@@ -546,6 +549,8 @@ AllocSetContextCreateInternal(MemoryContext parent,
 		parent,
 		name);
 
+	((MemoryContext) set)->mem_allocated = firstBlockSize;
+
 	return (MemoryContext) set;
 }
 
@@ -604,6 +609,8 @@ AllocSetReset(MemoryContext context)
 		else
 		{
 			/* Normal case, release the block */
+			context->mem_allocated -= block->endptr - ((char*) block);
+
 #ifdef CLOBBER_FREED_MEMORY
 			wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -612,6 +619,8 @@ AllocSetReset(MemoryContext context)
 		block = next;
 	}
 
+	Assert(context->mem_allocated == set->keeper->endptr - ((char *) set));
+
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
 }
@@ -688,11 +697,16 @@ AllocSetDelete(MemoryContext context)
 #endif
 
 		if (block != set->keeper)
+		{
+			context->mem_allocated -= block->endptr - ((char *) block);
 			free(block);
+		}
 
 		block = next;
 	}
 
+	Assert(context->mem_allocated == set->keeper->endptr - ((char *) set));
+
 	/* Finally, free the context header, including the keeper block */
 	free(set);
 }
@@ -733,6 +747,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 		block = (AllocBlock) malloc(blksize);
 		if (block == NULL)
 			return NULL;
+
+		context->mem_allocated += blksize;
+
 		block->aset = set;
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
@