Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-10 Thread Robert Haas
On Tue, Feb 8, 2011 at 10:54 PM, Joachim Wieland j...@mcknight.de wrote:
 On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro
 itagaki.takah...@gmail.com wrote:
 On Tue, Feb 8, 2011 at 13:34, Robert Haas robertmh...@gmail.com wrote:
 So how close are we to having a committable version of this?  Should
 we push this out to 9.2?

 I think so. The feature is pretty attractive, but more works are required:
  * Re-base on synchronized snapshots patch
  * Consider to use pipe also on Windows.
  * Research libpq + fork() issue. We have a warning in docs:
 http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
 | On Unix, forking a process with open libpq connections can lead to
 unpredictable results

 Just for the records, once the sync snapshot patch is committed, there
 is no need to do fancy libpq + fork() combinations anyway.
 Unfortunately, so far no committer has commented on the synchronized
 snapshot patch at all.

 I am not fighting for getting parallel pg_dump done in 9.1, as I don't
 really have a personal use case for the patch. However it would be the
 irony of the year if we shipped 9.1 with a synchronized snapshot patch
 but no parallel dump  :-)

True.  But it looks like there are some outstanding items from
previous reviews that you've yet to address, which makes pushing it
out seem fairly reasonable...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-08 Thread Itagaki Takahiro
On Tue, Feb 8, 2011 at 13:34, Robert Haas robertmh...@gmail.com wrote:
 So how close are we to having a committable version of this?  Should
 we push this out to 9.2?

I think so. The feature is pretty attractive, but more works are required:
 * Re-base on synchronized snapshots patch
 * Consider to use pipe also on Windows.
 * Research libpq + fork() issue. We have a warning in docs:
http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
| On Unix, forking a process with open libpq connections can lead to
unpredictable results

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-08 Thread Joachim Wieland
On Tue, Feb 8, 2011 at 8:31 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 On Tue, Feb 8, 2011 at 13:34, Robert Haas robertmh...@gmail.com wrote:
 So how close are we to having a committable version of this?  Should
 we push this out to 9.2?

 I think so. The feature is pretty attractive, but more works are required:
  * Re-base on synchronized snapshots patch
  * Consider to use pipe also on Windows.
  * Research libpq + fork() issue. We have a warning in docs:
 http://developer.postgresql.org/pgdocs/postgres/libpq-connect.html
 | On Unix, forking a process with open libpq connections can lead to
 unpredictable results

Just for the records, once the sync snapshot patch is committed, there
is no need to do fancy libpq + fork() combinations anyway.
Unfortunately, so far no committer has commented on the synchronized
snapshot patch at all.

I am not fighting for getting parallel pg_dump done in 9.1, as I don't
really have a personal use case for the patch. However it would be the
irony of the year if we shipped 9.1 with a synchronized snapshot patch
but no parallel dump  :-)


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-07 Thread Joachim Wieland
Hi Jaime,

thanks for your review!

On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 code review:

 something i found, and is a very simple one, is this warning (there's
 a similar issue in _StartMasterParallel with the buf variable)
 
 pg_backup_directory.c: In function ‘_EndMasterParallel’:
 pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized
 in this function
 

Cool. My compiler didn't tell me about this.


 i guess the huge amount of info is showing the patch is just for
 debugging and will be removed before commit, right?

That's right.


 functional review:

 it works good most of the time, just a few points:
 - if i interrupt the process the connections stay, i guess it could
 catch the signal and finish the connections

Hm, well, recovering gracefully out of errors could be improved. In
your example you would signal the children implicitly because the
parent process dies and the pipes to the children would get broken as
well. Of course the parent could more actively terminate the children
but it might not be the best option to just kill them, as then there
will be a lot of unexpected EOF connections in the log. So if an
error condition comes up in the parent (as in your example, because
you canceled the process), then ideally the parent should signal the
children with a non-lethal signal and the children should catch this
please terminate signal and exit cleanly but as soon as possible. If
the error case comes up at the child however, then we'd need to make
sure that the user sees the error message from the child. This should
work well as-is but currently it could happen that the parent exists
before all of the children have exited. I'll investigate this a bit...


 - if i have an exclusive lock on a table and a worker starts dumping
 it, it fails because it can't take the lock but it just say it was
 ok and would prefer an error

I'm getting a clear

pg_dump: [Archivierer] could not lock table public.c: ERROR:  could
not obtain lock on relation c

but I'll look into this as well.

Regarding your other post:

 - there is no docs

True...

 - pg_dump and pg_restore are inconsistent:
  pg_dump requires the directory to be provided with the -f option:
 pg_dump -Fd -f dir_dump
  pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump

Well, this is there with pg_dump and pg_restore currently as well. -F
is the switch for the format and it just takes d as the format. The
dir_dump is an option without any switch.

See the output for the --help switches:

Usage:
  pg_dump [OPTION]... [DBNAME]

Usage:
  pg_restore [OPTION]... [FILE]

So in either case you don't need to give a switch for what you have.
If you run pg_dump you don't give the switch for the database but you
need to give it for the output (-f) and with pg_restore you don't give
a switch for the file that you're restoring but you'd need to give -d
for restoring to a database.


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-07 Thread Robert Haas
On Mon, Feb 7, 2011 at 10:42 PM, Joachim Wieland j...@mcknight.de wrote:
 i guess the huge amount of info is showing the patch is just for
 debugging and will be removed before commit, right?

 That's right.

So how close are we to having a committable version of this?  Should
we push this out to 9.2?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-06 Thread Jaime Casanova
On Tue, Feb 1, 2011 at 11:32 PM, Joachim Wieland j...@mcknight.de wrote:
 On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas robertmh...@gmail.com wrote:
 The parallel pg_dump portion of this patch (i.e. the still-uncommitted
 part) no longer applies.  Please rebase.

 Here is a rebased version with some minor changes as well. I haven't
 tested it on Windows now but will do so as soon as the Unix part has
 been reviewed.


code review:

something i found, and is a very simple one, is this warning (there's
a similar issue in _StartMasterParallel with the buf variable)

pg_backup_directory.c: In function ‘_EndMasterParallel’:
pg_backup_directory.c:856: warning: ‘status’ may be used uninitialized
in this function


i guess the huge amount of info is showing the patch is just for
debugging and will be removed before commit, right?

functional review:

it works good most of the time, just a few points:
- if i interrupt the process the connections stay, i guess it could
catch the signal and finish the connections
- if i have an exclusive lock on a table and a worker starts dumping
it, it fails because it can't take the lock but it just say it was
ok and would prefer an error

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-06 Thread Jaime Casanova
On Sun, Feb 6, 2011 at 2:12 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Tue, Feb 1, 2011 at 11:32 PM, Joachim Wieland j...@mcknight.de wrote:
 On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas robertmh...@gmail.com wrote:
 The parallel pg_dump portion of this patch (i.e. the still-uncommitted
 part) no longer applies.  Please rebase.

 Here is a rebased version with some minor changes as well. I haven't
 tested it on Windows now but will do so as soon as the Unix part has
 been reviewed.


 code review:


ah! two other things i forget:

- there is no docs
- pg_dump and pg_restore are inconsistent:
  pg_dump requires the directory to be provided with the -f option:
pg_dump -Fd -f dir_dump
  pg_restore pass the directory as an argument for -Fd: pg_restore -Fd dir_dump

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-04 Thread Joachim Wieland
On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 I think we have 2 important technical issues here:
  * The consistency is not perfect. Each transaction is started
   with small delays in step 1, but we cannot guarantee no other
   transaction between them.

This is exactly where the patch for synchronized snapshot comes into
the game. See https://commitfest.postgresql.org/action/patch_view?id=480


  * Can we inherit connections to child processes with fork() ?
   Moreover, we also need to pass running transactions to children.
   I wonder libpq is designed for such usage.

As far as I know you can inherit sockets to a child program, as long
as you make sure that after the fork only one, father or child, uses
the socket, the other one should close it. But this wouldn't be a
matter with the above mentioned patch anyway.


 It might be better to remove Windows-specific codes from the first try.
 I doubt Windows message queue is the best API in such console-based
 application. I hope we could use the same implementation for all
 platforms for inter-process/thread communication.

Windows doesn't support pipes, but offers the message queues to
exchange messages. Parallel pg_dump only exchanges messages in the
form of DUMP 39209 or RESTORE OK 48 23 93, it doesn't exchange any
large chunks of binary data, just these small textual messages. The
messages also stay within the same process, they are just sent between
the different threads. The windows part worked just fine when I tested
it last time. Do you have any other technology in mind that you think
is better suited?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-04 Thread Magnus Hagander
On Sat, Feb 5, 2011 at 04:50, Joachim Wieland j...@mcknight.de wrote:
 On Thu, Feb 3, 2011 at 11:46 PM, Itagaki Takahiro
 itagaki.takah...@gmail.com wrote:
 It might be better to remove Windows-specific codes from the first try.
 I doubt Windows message queue is the best API in such console-based
 application. I hope we could use the same implementation for all
 platforms for inter-process/thread communication.

 Windows doesn't support pipes, but offers the message queues to
 exchange messages. Parallel pg_dump only exchanges messages in the
 form of DUMP 39209 or RESTORE OK 48 23 93, it doesn't exchange any
 large chunks of binary data, just these small textual messages. The
 messages also stay within the same process, they are just sent between
 the different threads. The windows part worked just fine when I tested
 it last time. Do you have any other technology in mind that you think
 is better suited?

Haven't been following this thread in details or read the code.. But
our /port directory contains a pipe() implementation for Windows,
that's used for the syslogger at least. Look in the code for pgpipe().
If using that one works, then that should probably be used rather than
something completely custom.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-03 Thread Itagaki Takahiro
On Wed, Feb 2, 2011 at 13:32, Joachim Wieland j...@mcknight.de wrote:
 Here is a rebased version with some minor changes as well.

I read the patch works as below. Am I understanding correctly?
  1. Open all connections in a parent process.
  2. Start transactions for each connection in the parent.
  3. Spawn child processes with fork().
  4. Each child process uses one of the inherited connections.

I think we have 2 important technical issues here:
 * The consistency is not perfect. Each transaction is started
   with small delays in step 1, but we cannot guarantee no other
   transaction between them.
 * Can we inherit connections to child processes with fork() ?
   Moreover, we also need to pass running transactions to children.
   I wonder libpq is designed for such usage.

To solve both issues, we might want a way to control visibility
in a database server instead of client programs. Don't we need
server-side support like [1] before developing parallel dump?
 [1] 
http://wiki.postgresql.org/wiki/ClusterFeatures#Export_snapshots_to_other_sessions

 I haven't
 tested it on Windows now but will do so as soon as the Unix part has
 been reviewed.

It might be better to remove Windows-specific codes from the first try.
I doubt Windows message queue is the best API in such console-based
application. I hope we could use the same implementation for all
platforms for inter-process/thread communication.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-02-01 Thread Joachim Wieland
On Sun, Jan 30, 2011 at 5:26 PM, Robert Haas robertmh...@gmail.com wrote:
 The parallel pg_dump portion of this patch (i.e. the still-uncommitted
 part) no longer applies.  Please rebase.

Here is a rebased version with some minor changes as well. I haven't
tested it on Windows now but will do so as soon as the Unix part has
been reviewed.


Joachim


parallel_pg_dump.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-30 Thread Robert Haas
On Wed, Jan 19, 2011 at 12:45 AM, Joachim Wieland j...@mcknight.de wrote:
 On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 This one is the last version of this patch? if so, commitfest app
 should be updated to reflect that

 Here are the latest patches all of them also rebased to current HEAD.
 Will update the commitfest app as well.

The parallel pg_dump portion of this patch (i.e. the still-uncommitted
part) no longer applies.  Please rebase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-23 Thread Heikki Linnakangas

On 21.01.2011 19:11, Euler Taveira de Oliveira wrote:

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say --file=DESTINATION or
--file=FILENAME|DIRNAME ?


... --file=OUTPUT or --file=OUTPUTNAME.


Ok, works for me.

I've committed this patch now, with a whole bunch of further fixes.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-21 Thread Heikki Linnakangas

On 20.01.2011 17:22, Heikki Linnakangas wrote:

(I'm working on this, no need to submit a new patch)


Ok, here's a heavily refactored version of this (also available at 
git://git.postgresql.org/git/users/heikki/postgres.git, branch 
pg_dump_directory). The directory format is now identical to the tar 
format, except that in the directory format the files can be compressed. 
Also we don't write the restore.sql file - it would be nice to have, but 
pg_restore doesn't require it. We can leave that as a TODO.


I ended up writing another compression abstraction layer in 
compress_io.c. It wraps fopen / gzopen etc. in a common API, so that the 
caller doesn't need to care if the file is compressed or not. In 
hindsight, the compression API we put in earlier didn't suit us very 
well. But I guess it wasn't a complete waste, as it moved the gory 
details of zlib out of the custom format code.


If compression is used, the files are created with the .gz suffix, and 
include the gzip header so that you can manipulate them easily with 
gzip/gunzip utilities. When reading, we accept files with or without the 
.gz suffix, and you can have some files compressed and others uncompressed.


I haven't updated the documentation yet.

There's one UI thing that bothers me. The option to specify the target 
directory is called --file. But it's clearly not a file. OTOH, I'd hate 
to introduce a parallel --dir option just for this. Any thoughts on this?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index de4968c..5266cc8 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -194,8 +194,11 @@ PostgreSQL documentation
   termoption--file=replaceable class=parameterfile/replaceable/option/term
   listitem
para
-Send output to the specified file.  If this is omitted, the
-standard output is used.
+Send output to the specified file. This parameter can be omitted for file
+based output formats, in which case the standard output is used. It must
+be given for the directory output format however, where it specifies the target
+directory instead of a file. In this case the directory is created by
+commandpg_dump/command and must not exist before.
/para
   /listitem
  /varlistentry
@@ -226,9 +229,24 @@ PostgreSQL documentation
   para
Output a custom-format archive suitable for input into
applicationpg_restore/application.
-   This is the most flexible output format in that it allows manual
-   selection and reordering of archived items during restore.
-   This format is also compressed by default.
+   Together with the directory output format, this is the most flexible
+   output format in that it allows manual selection and reordering of
+   archived items during restore. This format is also compressed by
+   default.
+  /para
+ /listitem
+/varlistentry
+
+varlistentry
+ termliterald//term
+ termliteraldirectory//term
+ listitem
+  para
+   Output a directory-format archive suitable for input into
+   applicationpg_restore/application. This will create a directory
+   instead of a file and this directory will contain one file for each
+   table and BLOB of the database that is being dumped. This format is
+   compressed by default.
   /para
  /listitem
 /varlistentry
@@ -947,6 +965,14 @@ CREATE DATABASE foo WITH TEMPLATE template0;
   /para
 
   para
+   To dump a database into a directory-format archive:
+
+screen
+prompt$/prompt userinputpg_dump -Fd mydb -f dumpdir/userinput
+/screen
+  /para
+
+  para
To reload an archive file into a (freshly created) database named
literalnewdb/:
 
diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index db607b4..8410af1 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -20,7 +20,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 
 OBJS=	pg_backup_archiver.o pg_backup_db.o pg_backup_custom.o \
 	pg_backup_files.o pg_backup_null.o pg_backup_tar.o \
-	dumputils.o compress_io.o $(WIN32RES)
+	pg_backup_directory.o dumputils.o compress_io.o $(WIN32RES)
 
 KEYWRDOBJS = keywords.o kwlookup.o
 
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 8c41a69..506533a 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -7,6 +7,17 @@
  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
+ * This file includes two APIs for dealing with compressed data. The first
+ * provides more flexibility, using callbacks to read/write data from the
+ * underlying stream. The second API is a wrapper 

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-21 Thread Robert Haas
On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 There's one UI thing that bothers me. The option to specify the target
 directory is called --file. But it's clearly not a file. OTOH, I'd hate to
 introduce a parallel --dir option just for this. Any thoughts on this?

If we were starting over, I'd probably suggest calling the option -o,
--output.  But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

  -f, --file=FILENAME output file (or directory) name
  -F, --format=c|t|p|doutput file format (custom, tar, text, dir)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-21 Thread Heikki Linnakangas

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd hate to
introduce a parallel --dir option just for this. Any thoughts on this?


If we were starting over, I'd probably suggest calling the option -o,
--output.  But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

   -f, --file=FILENAME output file (or directory) name
   -F, --format=c|t|p|doutput file format (custom, tar, text, dir)


Ok, that's exactly what the patch does now. I guess it's fine then.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-21 Thread Andrew Dunstan



On 01/21/2011 10:34 AM, Heikki Linnakangas wrote:

On 21.01.2011 15:35, Robert Haas wrote:

On Fri, Jan 21, 2011 at 4:41 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

There's one UI thing that bothers me. The option to specify the target
directory is called --file. But it's clearly not a file. OTOH, I'd 
hate to

introduce a parallel --dir option just for this. Any thoughts on this?


If we were starting over, I'd probably suggest calling the option -o,
--output.  But since -o is already taken (for --oids) I'd be inclined
to just make the help text read:

   -f, --file=FILENAME output file (or directory) name
   -F, --format=c|t|p|doutput file format (custom, tar, text, 
dir)


Ok, that's exactly what the patch does now. I guess it's fine then.



Maybe we could change the hint to say --file=DESTINATION or 
--file=FILENAME|DIRNAME ?


Just a thought.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-21 Thread Euler Taveira de Oliveira

Em 21-01-2011 12:47, Andrew Dunstan escreveu:

Maybe we could change the hint to say --file=DESTINATION or
--file=FILENAME|DIRNAME ?


... --file=OUTPUT or --file=OUTPUTNAME.


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-20 Thread Heikki Linnakangas

On 19.01.2011 16:01, Joachim Wieland wrote:

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.


What's the idea of storing the file sizes in the toc file? It looks like
it's not used for anything.


It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.


Ok. I'm going to leave out the filesize. I can see some value in that, 
and the CRC, but I don't want to add stuff that's not used at this point.



It would be nice to have this format match the tar format. At the moment,
there's a couple of cosmetic differences:

* TOC file is called TOC, instead of toc.dat

* blobs TOC file is called BLOBS.TOC instead of blobs.toc

* each blob is stored as blobs/oid.dat, instead of blob_oid.dat


That can be done easily...


The only significant difference is that in the directory archive format,
each data file has a header in the beginning.



What are the benefits of the data file header? Would it be better to leave
it out, so that the format would be identical to the tar format? You could
then just tar up the directory to get a tar archive, or vice versa.


The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.


Hmm, tar format doesn't support compression, but looks like the file 
format issue has been thought of already: there's still code there to 
add .gz suffix for compressed files. How about adopting that convention 
in the directory format too? That would make an uncompressed directory 
format compatible with the tar format.


That seems pretty attractive anyway, because you can then dump to a 
directory, and manually gzip the data files later.


Now that we have an API for compression in compress_io.c, it probably 
wouldn't be very hard to implement the missing compression support to 
tar format either.



If you want to drop the header altogether, fine with me but if it's
just for the tar-  directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...


I don't know how popular it'll be in practice, but it seems very nice to 
me if you can do things like parallel pg_dump in directory format first, 
and then tar it up to a file for archival.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-20 Thread Joachim Wieland
On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 It's part of the overall idea to make sure files are not inadvertently
 exchanged between different backups and that a file is not truncated.
 In the future I'd also like to add a checksum to the TOC so that a
 backup can be checked for integrity. This will cost performance but
 with the parallel backup it can be distributed to several processors.

 Ok. I'm going to leave out the filesize. I can see some value in that, and
 the CRC, but I don't want to add stuff that's not used at this point.

Okay.

 The header is there to identify a file, it contains the header that
 every other pgdump file contains, including the internal version
 number and the unique backup id.

 The tar format doesn't support compression so going from one to the
 other would only work for an uncompressed archive and special care
 must be taken to get the order of the tar file right.

 Hmm, tar format doesn't support compression, but looks like the file format
 issue has been thought of already: there's still code there to add .gz
 suffix for compressed files. How about adopting that convention in the
 directory format too? That would make an uncompressed directory format
 compatible with the tar format.

So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?


 That seems pretty attractive anyway, because you can then dump to a
 directory, and manually gzip the data files later.

The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

This is a valid use case for people who are concerned with a fast
dump, usually they would dump uncompressed and later compress the
archive. However once we have parallel pg_dump, this advantage
vanishes.


 Now that we have an API for compression in compress_io.c, it probably
 wouldn't be very hard to implement the missing compression support to tar
 format either.

True, but the question to the advantage of the tar format remains :-)


 A tar archive has the advantage that you can postprocess the dump data
 with other tools  but for this we could also add an option that gives
 you only the data part of a dump file (and uncompresses it at the same
 time if compressed). Once we have that however, the question is what
 anybody would then still want to use the tar format for...

 I don't know how popular it'll be in practice, but it seems very nice to me
 if you can do things like parallel pg_dump in directory format first, and
 then tar it up to a file for archival.

Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-20 Thread Heikki Linnakangas

On 20.01.2011 15:46, Joachim Wieland wrote:

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.


Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.


So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.


I'm not sure either. Maybe you want to pipe the output of pg_dump -F t 
via an ssh tunnel to another host, where you untar it, producing a 
directory format dump. You can then edit the directory format dump, and 
restore it back to the database without having to tar it again.


It gives you a lot of flexibility if the formats are compatible, which 
is generally good.



But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?


Hmm, I didn't realize pg_restore requires the files to be in certain 
order in the tar file. There's no mention of that in the docs either, we 
should add that. It doesn't actually require that if you read from a 
file, but from stdin it does.


You can put files in the archive in a certain order if you list them 
explicitly in the tar command line, like tar cf backup.tar toc.dat 
 It's hard to know the right order, though. In practice you would 
need to do tar tf backup.tar files before untarring, and use files 
to tar them again in the rightorder.



That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.


The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...


Yeah, we should write the header too. That's not hard, e.g gzopen will 
do that automatically, or you can pass a flag to deflateInit2.



A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...


I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.


Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?


See above, you can unless you try to pipe it to pg_restore. In fact, 
that's listed as an advantage of the tar format over other formats in 
the pg_dump documentation.


(I'm working on this, no need to submit a new patch)

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-20 Thread Florian Pflug
On Jan20, 2011, at 16:22 , Heikki Linnakangas wrote:
 You can put files in the archive in a certain order if you list them 
 explicitly in the tar command line, like tar cf backup.tar toc.dat  
 It's hard to know the right order, though. In practice you would need to do 
 tar tf backup.tar files before untarring, and use files to tar them 
 again in the rightorder.

Hm, could we create a file in the backup directory which lists the files in the 
right order?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-19 Thread Heikki Linnakangas

On 19.01.2011 07:45, Joachim Wieland wrote:

On Mon, Jan 17, 2011 at 5:38 PM, Jaime Casanovaja...@2ndquadrant.com  wrote:

This one is the last version of this patch? if so, commitfest app
should be updated to reflect that


Here are the latest patches all of them also rebased to current HEAD.
Will update the commitfest app as well.


What's the idea of storing the file sizes in the toc file? It looks like 
it's not used for anything.


It would be nice to have this format match the tar format. At the 
moment, there's a couple of cosmetic differences:


* TOC file is called TOC, instead of toc.dat

* blobs TOC file is called BLOBS.TOC instead of blobs.toc

* each blob is stored as blobs/oid.dat, instead of blob_oid.dat

The only significant difference is that in the directory archive format, 
each data file has a header in the beginning.


What are the benefits of the data file header? Would it be better to 
leave it out, so that the format would be identical to the tar format? 
You could then just tar up the directory to get a tar archive, or vice 
versa.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-19 Thread Joachim Wieland
On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Here are the latest patches all of them also rebased to current HEAD.
 Will update the commitfest app as well.

 What's the idea of storing the file sizes in the toc file? It looks like
 it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

 It would be nice to have this format match the tar format. At the moment,
 there's a couple of cosmetic differences:

 * TOC file is called TOC, instead of toc.dat

 * blobs TOC file is called BLOBS.TOC instead of blobs.toc

 * each blob is stored as blobs/oid.dat, instead of blob_oid.dat

That can be done easily...

 The only significant difference is that in the directory archive format,
 each data file has a header in the beginning.

 What are the benefits of the data file header? Would it be better to leave
 it out, so that the format would be identical to the tar format? You could
 then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

If you want to drop the header altogether, fine with me but if it's
just for the tar - directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...


Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

2011-01-17 Thread Jaime Casanova
On Fri, Jan 7, 2011 at 3:18 PM, Joachim Wieland j...@mcknight.de wrote:
 Here's a new series of patches for the parallel dump/restore. They need to be
 applied on top of each other.


This one is the last version of this patch? if so, commitfest app
should be updated to reflect that

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers