Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Robert Haas
On Sat, Jul 31, 2010 at 12:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another misbehavior that I noted while playing with Artur's example is
 that, while GIN index build seems to adhere pretty well to whatever
 maintenance_work_mem limit you give it in 8.4, it blows right by that
 limit in 9.0 and HEAD --- I observed it happily eating something north
 of 128MB with a maintenance_work_mem of 70MB.  It looks to me like the
 reason is the new rbtree.c code, which adds a whole lot of data
 structure that's not being counted against the maintenance_work_mem
 limit.

 Now the first question that might be asked is what we'd need to do to
 rbtree.c's API to allow its memory consumption to be tracked.  However,
 I think there's a considerably more pressing question, which is what
 is it that rbtree.c is doing that justifies a 2X bloat factor in GIN
 index build --- which is pretty much what it's costing us, given the
 observation above.  We know that the amount of memory available for
 the build has an enormous effect on build time.  If we just do the
 obvious thing of including the rbtree data structure in the
 maintenance_work_mem calculation, what we're going to get is a very
 substantial slowdown in build speed for the same maintenance_work_mem
 setting, compared to the way 8.4 worked.

 So, I would like somebody to show cause why that whole module shouldn't
 be ripped out and the code reverted to where it was in 8.4.  My
 recollection is that the argument for adding it was to speed things up
 in corner cases, but what I think it's actually going to do is slow
 things down in every case.

I've always been a bit suspicious of this code, too, even though I
didn't think about the memory consumption issue.  But see here:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00307.php

I think there may be a better way to avoid the pathological cases, but
I'm not sure what it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So, I would like somebody to show cause why that whole module shouldn't
 be ripped out and the code reverted to where it was in 8.4.  My
 recollection is that the argument for adding it was to speed things up
 in corner cases, but what I think it's actually going to do is slow
 things down in every case.

 I've always been a bit suspicious of this code, too, even though I
 didn't think about the memory consumption issue.  But see here:
 http://archives.postgresql.org/pgsql-hackers/2010-02/msg00307.php

I did a bit of experimentation and confirmed my fears: HEAD is willing
to eat about double the specified maintenance_work_mem.  If you cut
back the setting so that its actual memory use is no more than 8.4's,
it's about 33% slower on non-pathological data (I'm testing the dataset
from Artur Dabrowski here).  The referenced tests showing significant
speedup over 8.4 were presumably all done without controlling for memory
usage.  I'm not sure how much those results need to be discounted, but
they can't be taken at face value.

 I think there may be a better way to avoid the pathological cases, but
 I'm not sure what it is.

After a bit of further study I think maybe something could be done
towards making rbtree.c less memory-hungry: it's basically been coded
with zero attention to memory efficiency.  The RBNode structs are
individually palloc'd, which means that on a 64-bit machine they take
about 80 bytes apiece.  The EntryAccumulator structs they are frontends
for are only 32 bytes apiece (plus the probably-pass-by-reference Datum
value, which we can't easily do anything with), and they are allocated in
groups to avoid palloc overhead.  EntryAccumulator did get 16 bytes
smaller since 8.4 as a result of removing the left/right pointers that
the rbtree structure is substituting for, but that doesn't make up
for this.

I'm tempted to suggest that making RBNode be a hidden struct containing
a pointer to somebody else's datum is fundamentally the wrong way to
go about things, because the extra void pointer is pure overhead,
and we aren't ever going to be using these things in a context where
memory usage isn't of concern.  If we refactored the API so that RBNode
was intended to be the first field of some larger struct, as is done in
dynahash tables for instance, we could eliminate the void pointer and
the palloc inefficiency.  The added storage compared to what 8.4 used
would be a parent link and the iteratorState/color fields, which would
end up costing us 16 more bytes per EntryAccumulator rather than 64.
Still not great but at least it's not a 2X penalty, and the memory
allocation would become the caller's problem not rbtree's, so the
problem of tracking usage would be no different from before.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Robert Haas
On Sat, Jul 31, 2010 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:40 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So, I would like somebody to show cause why that whole module shouldn't
 be ripped out and the code reverted to where it was in 8.4.  My
 recollection is that the argument for adding it was to speed things up
 in corner cases, but what I think it's actually going to do is slow
 things down in every case.

 I've always been a bit suspicious of this code, too, even though I
 didn't think about the memory consumption issue.  But see here:
 http://archives.postgresql.org/pgsql-hackers/2010-02/msg00307.php

 I did a bit of experimentation and confirmed my fears: HEAD is willing
 to eat about double the specified maintenance_work_mem.  If you cut
 back the setting so that its actual memory use is no more than 8.4's,
 it's about 33% slower on non-pathological data (I'm testing the dataset
 from Artur Dabrowski here).

That seems like a pretty serious regression.

 I'm tempted to suggest that making RBNode be a hidden struct containing
 a pointer to somebody else's datum is fundamentally the wrong way to
 go about things, because the extra void pointer is pure overhead,
 and we aren't ever going to be using these things in a context where
 memory usage isn't of concern.  If we refactored the API so that RBNode
 was intended to be the first field of some larger struct, as is done in
 dynahash tables for instance, we could eliminate the void pointer and
 the palloc inefficiency.  The added storage compared to what 8.4 used
 would be a parent link and the iteratorState/color fields, which would
 end up costing us 16 more bytes per EntryAccumulator rather than 64.
 Still not great but at least it's not a 2X penalty, and the memory
 allocation would become the caller's problem not rbtree's, so the
 problem of tracking usage would be no different from before.

Even if we do that, is it still going to be too much of a performance
regression overall?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm tempted to suggest that making RBNode be a hidden struct containing
 a pointer to somebody else's datum is fundamentally the wrong way to
 go about things, because the extra void pointer is pure overhead,
 and we aren't ever going to be using these things in a context where
 memory usage isn't of concern.  If we refactored the API so that RBNode
 was intended to be the first field of some larger struct, as is done in
 dynahash tables for instance, we could eliminate the void pointer and
 the palloc inefficiency.

 Even if we do that, is it still going to be too much of a performance
 regression overall?

Looking back, EntryAccumulator was 56 bytes on 64-bit machines in 8.4
(it should have been smaller, but a poor choice of field ordering left
a lot of pad space).  Right now it's 32 bytes, and if we stick an RBNode
field in the front it'd be 64.  So that'd be a 14% penalty compared to
8.4, as opposed to the present situation which is a 100% penalty (32+80
bytes per entry).  On 32-bit machines the numbers are 32 bytes (8.4)
versus 20+40 (HEAD) versus 36 bytes (my proposal), so 12.5% penalty
versus 87.5%.  (All of these numbers should be discounted by whatever
space you want to assume the pass-by-reference key datum takes.)

So it'd definitely be a lot better than now.  Maybe there'd be some
remaining performance gap for non-pathological cases, but I think we
would be willing to pay that in order to avoid bad behavior for the
pathological cases.  It's difficult to say for sure of course
without going to the trouble of coding and testing it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Robert Haas
On Sat, Jul 31, 2010 at 12:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm tempted to suggest that making RBNode be a hidden struct containing
 a pointer to somebody else's datum is fundamentally the wrong way to
 go about things, because the extra void pointer is pure overhead,
 and we aren't ever going to be using these things in a context where
 memory usage isn't of concern.  If we refactored the API so that RBNode
 was intended to be the first field of some larger struct, as is done in
 dynahash tables for instance, we could eliminate the void pointer and
 the palloc inefficiency.

 Even if we do that, is it still going to be too much of a performance
 regression overall?

 Looking back, EntryAccumulator was 56 bytes on 64-bit machines in 8.4
 (it should have been smaller, but a poor choice of field ordering left
 a lot of pad space).  Right now it's 32 bytes, and if we stick an RBNode
 field in the front it'd be 64.  So that'd be a 14% penalty compared to
 8.4, as opposed to the present situation which is a 100% penalty (32+80
 bytes per entry).  On 32-bit machines the numbers are 32 bytes (8.4)
 versus 20+40 (HEAD) versus 36 bytes (my proposal), so 12.5% penalty
 versus 87.5%.  (All of these numbers should be discounted by whatever
 space you want to assume the pass-by-reference key datum takes.)

 So it'd definitely be a lot better than now.  Maybe there'd be some
 remaining performance gap for non-pathological cases, but I think we
 would be willing to pay that in order to avoid bad behavior for the
 pathological cases.  It's difficult to say for sure of course
 without going to the trouble of coding and testing it.

Well, it sounds like a reasonable thing to try, then.  You going to
take a crack at it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So it'd definitely be a lot better than now.  Maybe there'd be some
 remaining performance gap for non-pathological cases, but I think we
 would be willing to pay that in order to avoid bad behavior for the
 pathological cases.  It's difficult to say for sure of course
 without going to the trouble of coding and testing it.

 Well, it sounds like a reasonable thing to try, then.  You going to
 take a crack at it?

Yeah, I'll have at it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jul 31, 2010 at 12:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So it'd definitely be a lot better than now.  Maybe there'd be some
 remaining performance gap for non-pathological cases, but I think we
 would be willing to pay that in order to avoid bad behavior for the
 pathological cases.  It's difficult to say for sure of course
 without going to the trouble of coding and testing it.

 Well, it sounds like a reasonable thing to try, then.  You going to
 take a crack at it?

This worked out pretty well, so I've applied the attached patch.
I observed the following timings running the index build from
Artur Dobrowski's example:

8.4 9.0b4   HEAD

maintenance_work_mem setting100MB   60MB100MB
actual peak process image size  118M126M112M
elapsed time964s1289s   937s

so the memory bloat problem is definitely fixed and the speed
seems satisfactory.

It might be worth commenting that after I'd rewritten the rbtree code,
I spent awhile pulling my hair out because the code *still* blew past
the expected memory usage.  It turned out that there were some
significant leaks in ginEntryInsert and subsidiary routines, causing
memory usage to continue to grow even once we realized we'd hit
maintenance_work_mem and started to dump data to disk.  These leaks were
there before, but were masked in 8.4 because the old ginGetEntry code
incrementally pfreed itempointer-list storage as it walked the data
structure; this caused storage to be released faster than the leaks
would consume it.  The new code doesn't release any of that storage
until the MemoryContextReset after the dump loop, so any leakage in
the dump loop becomes a visible increase in the peak space consumption.
I wasn't able to remove all of the leakage --- there's some fairly ugly
coding associated with index page splits that leaks an indextuple or so
per split, and I'm not sure where the extra tuple could safely be
pfree'd.  That seems to be small enough to live with though; it's less
than 0.5% of the total memory usage in the example I'm working with.

I also cleaned up some other stuff that would've bit us eventually,
like unnecessary use of recursion in the rbtree iteration code.

regards, tom lane

Index: src/backend/access/gin/ginbtree.c
===
RCS file: /cvsroot/pgsql/src/backend/access/gin/ginbtree.c,v
retrieving revision 1.15
diff -c -r1.15 ginbtree.c
*** src/backend/access/gin/ginbtree.c	2 Jan 2010 16:57:33 -	1.15
--- src/backend/access/gin/ginbtree.c	1 Aug 2010 01:44:08 -
***
*** 267,272 
--- 267,274 
  
  /*
   * Insert value (stored in GinBtree) to tree described by stack
+  *
+  * NB: the passed-in stack is freed, as though by freeGinBtreeStack.
   */
  void
  ginInsertValue(GinBtree btree, GinBtreeStack *stack)
***
*** 308,317 
  PageSetTLI(page, ThisTimeLineID);
  			}
  
! 			UnlockReleaseBuffer(stack-buffer);
  			END_CRIT_SECTION();
  
! 			freeGinBtreeStack(stack-parent);
  			return;
  		}
  		else
--- 310,320 
  PageSetTLI(page, ThisTimeLineID);
  			}
  
! 			LockBuffer(stack-buffer, GIN_UNLOCK);
  			END_CRIT_SECTION();
  
! 			freeGinBtreeStack(stack);
! 
  			return;
  		}
  		else
***
*** 325,331 
  			 */
  			newlpage = btree-splitPage(btree, stack-buffer, rbuffer, stack-off, rdata);
  
- 
  			((ginxlogSplit *) (rdata-data))-rootBlkno = rootBlkno;
  
  			parent = stack-parent;
--- 328,333 
***
*** 341,347 
  ((ginxlogSplit *) (rdata-data))-isRootSplit = TRUE;
  ((ginxlogSplit *) (rdata-data))-rrlink = InvalidBlockNumber;
  
- 
  page = BufferGetPage(stack-buffer);
  lpage = BufferGetPage(lbuffer);
  rpage = BufferGetPage(rbuffer);
--- 343,348 
***
*** 375,384 
  
  UnlockReleaseBuffer(rbuffer);
  UnlockReleaseBuffer(lbuffer);
! UnlockReleaseBuffer(stack-buffer);
! 
  END_CRIT_SECTION();
  
  return;
  			}
  			else
--- 376,386 
  
  UnlockReleaseBuffer(rbuffer);
  UnlockReleaseBuffer(lbuffer);
! LockBuffer(stack-buffer, GIN_UNLOCK);
  END_CRIT_SECTION();
  
+ freeGinBtreeStack(stack);
+ 
  return;
  			}
  			else
Index: src/backend/access/gin/ginbulk.c
===
RCS file: /cvsroot/pgsql/src/backend/access/gin/ginbulk.c,v
retrieving revision 1.19
diff -c -r1.19 ginbulk.c
*** src/backend/access/gin/ginbulk.c	26 Feb 2010 02:00:33 -	1.19
--- src/backend/access/gin/ginbulk.c	1 Aug 2010 01:44:08 -
***
*** 19,35 
  #include utils/memutils.h
  
  
! #define DEF_NENTRY	2048
! #define DEF_NPTR	4
  
- static void *
- ginAppendData(void *old, void *new, void *arg)
- {
- 	EntryAccumulator *eo = (EntryAccumulator *) old,
- 			   *en = 

[HACKERS] rbtree code breaks GIN's adherence to maintenance_work_mem

2010-07-30 Thread Tom Lane
Another misbehavior that I noted while playing with Artur's example is
that, while GIN index build seems to adhere pretty well to whatever
maintenance_work_mem limit you give it in 8.4, it blows right by that
limit in 9.0 and HEAD --- I observed it happily eating something north
of 128MB with a maintenance_work_mem of 70MB.  It looks to me like the
reason is the new rbtree.c code, which adds a whole lot of data
structure that's not being counted against the maintenance_work_mem
limit.

Now the first question that might be asked is what we'd need to do to
rbtree.c's API to allow its memory consumption to be tracked.  However,
I think there's a considerably more pressing question, which is what
is it that rbtree.c is doing that justifies a 2X bloat factor in GIN
index build --- which is pretty much what it's costing us, given the
observation above.  We know that the amount of memory available for
the build has an enormous effect on build time.  If we just do the
obvious thing of including the rbtree data structure in the
maintenance_work_mem calculation, what we're going to get is a very
substantial slowdown in build speed for the same maintenance_work_mem
setting, compared to the way 8.4 worked.

So, I would like somebody to show cause why that whole module shouldn't
be ripped out and the code reverted to where it was in 8.4.  My
recollection is that the argument for adding it was to speed things up
in corner cases, but what I think it's actually going to do is slow
things down in every case.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers