Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Simon Riggs
On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:

 Another idea would be to

...persist the optimal dump order in the database.

That way we can maintain the correct dump order each time we do DDL,
which is only a small incremental cost, no matter how many objects we
have.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:
 Another idea would be to

 ...persist the optimal dump order in the database.

 That way we can maintain the correct dump order each time we do DDL,
 which is only a small incremental cost, no matter how many objects we
 have.

I don't see any obvious way to make it incremental; so I doubt that
it would be a small extra cost.  In any case I disagree that making DDL
slower to make pg_dump faster is a good tradeoff.  Many people seldom
or never use pg_dump.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Claudio Freire
On Tue, Jul 29, 2014 at 3:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:
 Another idea would be to

 ...persist the optimal dump order in the database.

 That way we can maintain the correct dump order each time we do DDL,
 which is only a small incremental cost, no matter how many objects we
 have.

 I don't see any obvious way to make it incremental; so I doubt that
 it would be a small extra cost.  In any case I disagree that making DDL
 slower to make pg_dump faster is a good tradeoff.  Many people seldom
 or never use pg_dump.

 regards, tom lane


Not to mention slowing down temp tables


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-25 Thread Tom Lane
I looked into the performance issue Joe Van Dyk reported in bug #11033.
It turns out this is basically a consequence of the section boundary
pseudo-objects I added in commit a1ef01fe163b304760088e3e30eb22036910a495.
(I'd worried about performance consequences at the time, but hadn't
seen any actual evidence that there would be a problem.)

In Joe's test case, there are 8119 objects that need to be sorted into a
dumpable order.  (Only about 2700 of these are user objects that actually
end up in the output file; the rest are system objects like collations and
operators, which we include in the sorting pass in case there are
dependency paths leading through them.)  5976 of these are pre-data
objects, so we end up with 5976 dependencies for the pre-data boundary
object; and then there are 386 data objects, which depend on the pre-data
boundary object and which the post-data boundary object depends on; and
then 1755 post-data objects, which all have dependencies on the post-data
boundary object.

You can probably already see where this is going :-(.  For each post-data
object that's potentially involved in a dependency loop, the findLoop()
search in pg_dump_sort.c will need to recurse not only to the object's
ordinary dependencies (of which there are probably just a few), but to the
post-data boundary object.  From there it will need to recurse to each
data object.  And from each data object, it will recurse to the
pre-data boundary object.  And from there, it'll have to recurse to each
and every pre-data object.  Not to mention recursing to the actual
dependencies of each one.  Now the recursion is pretty quick, but still
this is a significant number of cycles.

But wait, it gets worse.

In a data-only dump, we have even more dependencies, because we sort all
these objects (even though we won't dump most of them), and we also add
dependencies representing foreign-key constraints, in hopes of arriving at
a dump order wherein the foreign-key constraints won't be violated during
the restore.  Joe's example has about 320 FK constraints, some of them in
fairly long chains (ie, table A references table B which references table
C which references table D etc).  What this means is that there are a
multitude of dependency paths from the post-data boundary to the pre-data
boundary, for instance from boundary to table A to boundary, or from
boundary to table A to table B to boundary, or from boundary to table A to
table B to table C to boundary, etc etc etc.  After chasing down *each* of
these paths, findLoop has to visit every one of the pre-data objects.

So what's surprising is not so much that the sort takes 30 seconds as that
it finishes before the heat death of the universe.

I've found a fairly simple fix for this, which depends on the observation
that once we've failed to find a dependency path from object X back to our
starting object, there's no need to search again if we arrive at X via a
different dependency path.  This is noninvasive and obviously correct, and
it seems to completely eliminate the performance issue in Joe's test case.

There are other things that we might consider doing, like excluding system
objects from the dependency sort; but I'm less excited about that because
it's not entirely clear that there wouldn't be unexpected behavioral
consequences.  In any case, the more user objects are in the database, the
less it would matter.  We've definitely heard of databases with tens of
thousands of user tables for instance.  Another idea would be to somehow
be smarter about quickly eliminating objects that aren't actually members
of any dependency loop, but only depend on objects that are in loops.
But that seems like a research project; I definitely don't see a cheap
way to get TopoSort to detect such cases.

So I propose applying the attached, and back-patching to 9.1, which is
as far back as the section-boundary objects are used.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 1b505a0..f0caa6b 100644
*** a/src/bin/pg_dump/pg_dump_sort.c
--- b/src/bin/pg_dump/pg_dump_sort.c
*** static void findDependencyLoops(Dumpable
*** 137,142 
--- 137,143 
  static int findLoop(DumpableObject *obj,
  		 DumpId startPoint,
  		 bool *processed,
+ 		 DumpId *searchFailed,
  		 DumpableObject **workspace,
  		 int depth);
  static void repairDependencyLoop(DumpableObject **loop,
*** static void
*** 633,653 
  findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs)
  {
  	/*
! 	 * We use two data structures here.  One is a bool array processed[],
! 	 * which is indexed by dump ID and marks the objects already processed
! 	 * during this invocation of findDependencyLoops().  The other is a
! 	 * workspace[] array of DumpableObject pointers, in which we try to build
! 	 * lists of objects constituting loops.  We make workspace[] large enough
! 	 * to hold all the objects, which is huge