Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-12-03 Thread Timothy Garnett
On Tue, Dec 3, 2013 at 12:14 PM, Bruce Momjian br...@momjian.us wrote:

 On Wed, Apr 24, 2013 at 03:33:42PM -0400, Andrew Dunstan wrote:
 
  On 04/23/2013 07:53 PM, Timothy Garnett wrote:
 ...
  Attached is two diffs off of the REL9_2_4 tag that I've been
  using.  The first is a simple change that serially loads the data
  section before handing off the remainder of the restore to the
  existing parallelized restore code (the .ALT. diff).  The second
  which gets more parallelization but is a bit more of a change uses
  the existing dependency analysis code to allow index building etc.
  to occur in parallel with data loading. The data loading tasks are
  still performed serially in the main thread, but non-data loading
  tasks are scheduled in parallel as their dependencies are
  satisfied (with the caveat that the main thread can only dispatch
  new tasks between data loads).
 ...
 
 
  I don't think these are bad ideas at all, and probably worth doing.
  Note that there are some fairly hefty changes affecting this code in
  master, so your rebasing could be tricky.

 Is there any progress on this:  doing parallel pg_restore from a pipe?


We're on 9.2 and making make extensive use of the patch in the original
post.  I will probably forward port it to 9.3 when we migrate to that
(probably sometime Q1) since we pretty much require it in our setup.

Tim


Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-05-16 Thread Timothy Garnett
 If you need something like this short term, we actually found a way to do it
 ourselves for a migration we performed back in October. The secret is xargs
 with the -P option:

 xargs -I{} -P 8 -a table-list.txt \
 bash -c pg_dump -Fc -t {} my_db | pg_restore -h remote -d my_db

 Fill table-list.txt with as many, or as few tables as you want. The above
 example would give you 8 parallel threads. Well equipped systems may be able
 to increase this.

 Admittedly it's a gross hack, but it works. :)

I think you'd have to be real careful around foreign key constraints
for that to work.

Tim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-04-25 Thread Timothy Garnett
As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection of
tables (i.e. -t / -T) in addition to the whole database and it supported or
we were able to patch in an option to cluster as part of the migration (the
equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).

Tim


On Wed, Apr 24, 2013 at 5:47 PM, Joachim Wieland j...@mcknight.de wrote:

 On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner 
 ste...@kaltenbrunner.cc wrote:

  What might make sense is something like pg_dump_restore which would have
  no intermediate storage at all, just pump the data etc from one source
  to another in parallel. But I pity the poor guy who has to write it :-)

 hmm pretty sure that Joachims initial patch for parallel dump actually
 had a PoC for something very similiar to that...


 That's right, I implemented that as an own output format and named it
 migrator I think, which wouldn't write each stream to a file as the
 directory output format does but that instead pumps it back into a restore
 client.

 Actually I think the logic was even reversed, it was a parallel restore
 that got the data from internally calling pg_dump functionality instead of
 from reading files... The neat thing about this approach was that the order
 was optimized and correct, i.e. largest tables start first and dependencies
 get resolved in the right order.

 I could revisit that patch for 9.4 if enough people are interested.

 Joachim



Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-04-25 Thread Timothy Garnett
On Wed, Apr 24, 2013 at 5:47 PM, Joachim Wieland j...@mcknight.de wrote:

 On Wed, Apr 24, 2013 at 4:05 PM, Stefan Kaltenbrunner 
 ste...@kaltenbrunner.cc wrote:

  What might make sense is something like pg_dump_restore which would have
  no intermediate storage at all, just pump the data etc from one source
  to another in parallel


 That's right, I implemented that as an own output format and named it
 migrator I think, which wouldn't write each stream to a file as the
 directory output format does but that instead pumps it back into a restore
 client.

 I could revisit that patch for 9.4 if enough people are interested.

 Joachim



As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection of
tables (i.e. -t / -T) in addition to the whole database and it supported or
we were able to patch in an option to cluster as part of the migration (the
equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).

Tim


[HACKERS] Allowing parallel pg_restore from pipe

2013-04-24 Thread Timothy Garnett
Hi All,

Currently the -j option to pg_restore, which allows for parallelization in
the restore, can only be used if the input file is a regular file and not,
for ex., a pipe.  However this is a pretty common occurrence for us
(usually in the form of pg_dump | pg_restore to copy an individual database
or some tables thereof from one machine to another).  While there's no good
way to parallelize the data load steps when reading from a pipe, the index
and constraint building can still be parallelized and as they are generally
CPU bound on our machines we've found quite a bit of speedup from doing so.

Attached is two diffs off of the REL9_2_4 tag that I've been using.  The
first is a simple change that serially loads the data section before
handing off the remainder of the restore to the existing parallelized
restore code (the .ALT. diff).  The second which gets more parallelization
but is a bit more of a change uses the existing dependency analysis code to
allow index building etc. to occur in parallel with data loading. The data
loading tasks are still performed serially in the main thread, but non-data
loading tasks are scheduled in parallel as their dependencies are satisfied
(with the caveat that the main thread can only dispatch new tasks between
data loads).

Anyways, the question is if people think this is generally useful.  If so I
can clean up the preferred choice a bit and rebase it off of master, etc.

Tim


0003-patch-pg_restore-to-allow-parallel-restore-when-the.ALT.patch
Description: Binary data


0003-patch-pg_restore-to-allow-parallel-restore-when-the-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Option for pg_dump to dump tables in clustered index order

2012-02-23 Thread Timothy Garnett
Hi All,

Having pg_dump dump tables in clustered index order is something we've
found we've needed a fair number of times (for ex. when copying a large
logging tables or sets of tables out of one database where the order is not
maintained into another for running a bunch of backend analysis) as it
saves us the clustering step which is often longer then the copy step
itself.

I wanted to gauge the interest in adding an option for this to pg_dump.  A
(not production ready) patch that we've been using off of the 9.1.2 tag to
implement this is attached or can be viewed
herehttps://github.com/tgarnett/postgres/commit/d4412aa4047e7a0822ee93fa47a1c0d282cb7925.
 It adds a --cluster-order option to pg_dump. If people have any
suggestions on better ways of pulling out the order clause or other
improvements that would be great too.

Tim
From d4412aa4047e7a0822ee93fa47a1c0d282cb7925 Mon Sep 17 00:00:00 2001
From: Timothy Garnett tgarn...@panjiva.com
Date: Fri, 10 Feb 2012 16:21:32 -0500
Subject: [PATCH] Support for pg_dump to dump tables in cluster order if a
 clustered index is defined on the table, a little hacked in
 with how the data is passed around and how the order is
 pulled out of the db. The latter is the only
 semi-problematic part as you might be able to generate
 (very odd) table column names that would break the regex
 used there which would cause the sql query to be invalid
 and therefore not dump data for that table.  But as long as
 you don't name an clustered column/function something like
 foo ) WHERE or the like should be ok.

---
 src/bin/pg_dump/pg_dump.c |   48 +---
 src/bin/pg_dump/pg_dump.h |1 +
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 57f2ed3..9ef9a71 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -134,6 +134,7 @@
 static int	binary_upgrade = 0;
 static int	disable_dollar_quoting = 0;
 static int	dump_inserts = 0;
+static int	cluster_order = 0;
 static int	column_inserts = 0;
 static int	no_security_labels = 0;
 static int	no_unlogged_table_data = 0;
@@ -319,6 +320,7 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 		 */
 		{attribute-inserts, no_argument, column_inserts, 1},
 		{binary-upgrade, no_argument, binary_upgrade, 1},
+		{cluster-order, no_argument, cluster_order, 1},
 		{column-inserts, no_argument, column_inserts, 1},
 		{disable-dollar-quoting, no_argument, disable_dollar_quoting, 1},
 		{disable-triggers, no_argument, disable_triggers, 1},
@@ -849,6 +851,7 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 	printf(_(  -T, --exclude-table=TABLE   do NOT dump the named table(s)\n));
 	printf(_(  -x, --no-privileges do not dump privileges (grant/revoke)\n));
 	printf(_(  --binary-upgradefor use by upgrade utilities only\n));
+	printf(_(  --cluster-order dump table data in clustered index order (= 8.2)\n));
 	printf(_(  --column-insertsdump data as INSERT commands with column names\n));
 	printf(_(  --disable-dollar-quotingdisable dollar quoting, use SQL standard quoting\n));
 	printf(_(  --disable-triggers  disable triggers during data-only restore\n));
@@ -1245,7 +1248,7 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 		 classname),
 		  column_list);
 	}
-	else if (tdinfo-filtercond)
+	else if (tdinfo-filtercond || tbinfo-ordercond)
 	{
 		/* Note: this syntax is only supported in 8.2 and up */
 		appendPQExpBufferStr(q, COPY (SELECT );
@@ -1257,10 +1260,14 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 		}
 		else
 			appendPQExpBufferStr(q, * );
-		appendPQExpBuffer(q, FROM %s %s) TO stdout;,
+		appendPQExpBuffer(q, FROM %s ,
 		  fmtQualifiedId(tbinfo-dobj.namespace-dobj.name,
-		 classname),
-		  tdinfo-filtercond);
+		 classname));
+		if (tdinfo-filtercond)
+		  appendPQExpBuffer(q, %s , tdinfo-filtercond);
+		if (tbinfo-ordercond)
+		  appendPQExpBuffer(q, %s, tbinfo-ordercond);
+		appendPQExpBuffer(q, ) TO stdout;);
 	}
 	else
 	{
@@ -1388,6 +1395,8 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 	}
 	if (tdinfo-filtercond)
 		appendPQExpBuffer(q,  %s, tdinfo-filtercond);
+	if (tbinfo-ordercond)
+		appendPQExpBuffer(q,  %s, tbinfo-ordercond);
 
 	res = PQexec(g_conn, q-data);
 	check_sql_result(res, g_conn, q-data, PGRES_COMMAND_OK);
@@ -4400,6 +4409,7 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 i_oid,
 i_indexname,
 i_indexdef,
+i_indexdeforderclause,
 i_indnkeys,
 i_indkey,
 i_indisclustered,
@@ -4451,6 +4461,14 @@ static void check_sql_result(PGresult *res, PGconn *conn, const char *query,
 			  SELECT t.tableoid, t.oid, 
 			  t.relname AS indexname, 
 	 pg_catalog.pg_get_indexdef