Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Simon Riggs wrote:

On Sat, 2008-07-26 at 11:03 -0700, Joshua D. Drake wrote:


2. We have no concurrency which means, anyone with any database over 50G
has unacceptable restore times.


Agreed.



Sounds good.

Doesn't help with the main element of dump time: one table at a time to
one output file. We need a way to dump multiple tables concurrently,
ending in multiple files/filesystems.


Agreed but that is a problem I understand with a solution I don't. I am 
all eyes on a way to fix that. One thought I had and please, be gentle 
in response was some sort of async transaction capability. I know that 
libpq has the ability to send async queries. Is it possible to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?

Sincerely,

Joshua D. Drake


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Simon Riggs

On Sat, 2008-07-26 at 13:56 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I want to dump tables separately for performance reasons. There are
  documented tests showing 100% gains using this method. There is no gain
  adding this to pg_restore. There is a gain to be had - parallelising
  index creation, but this patch doesn't provide parallelisation.
 
 Right, but the parallelization is going to happen sometime, and it is
 going to happen in the context of pg_restore. 

I honestly think there is less benefit that way than if we consider
things more as a whole:

To do data dump quickly we need to dump different tables to different
disks simultaneously. By its very nature, that cannot end with just a
single file. So the starting point for any restore must be potentially
more than one file.

There are two ways of dumping: either multi-thread pg_dump, or allow
multiple pg_dumps to work together. Second option much less work, same
result. (Either way we also need a way for multiple concurrent sessions
to share a snapshot.)

When restoring, we can then just use multiple pg_restore sessions to
restore the individual data files. Or again we can write a
multi-threaded pg_restore to do the same thing - why would I bother
doing that when I already can? It gains us nothing.

Parallelising the index creation seems best done using concurrent psql.
We've agreed some mods to psql to put multi-sessions in there. If we do
that right, then we can make pg_restore generate a psql script with
multi-session commands scattered appropriately throughout.

Parallel pg_restore is a lot of work for a narrow use case. Concurrent
psql provides a much wider set of use cases.

So fully parallelising dump/restore can be achieved by

* splitting dump into pieces (this patch)
* allowing sessions to share a common snapshot
* concurrent psql
* changes to pg_restore/psql/pg_dump to allow commands to be inserted
which will use concurrent psql features

If we do things this way then we have some useful tools that can be used
in a range of use cases, not just restore.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Simon Riggs

On Sat, 2008-07-26 at 11:03 -0700, Joshua D. Drake wrote:

 2. We have no concurrency which means, anyone with any database over 50G
 has unacceptable restore times.

Agreed.

Also the core reason for wanting -w

 3. We have to continue develop hacks to define custom utilization. Why
 am I passing pre-data anything? It should be automatic. For example:
 
 pg_backup (not dump, we aren't dumping. Dumping is usually associated
 with some sort of crash or fould human behavoir. We are backing up).
pg_backup -U user -D database -F -f mybackup.sqlc
 
 If I were to extract mybackup.sqlc I would get:
 
   mybackup.datatypes
   mybackup.tables
   mybackup.data
   mybackup.primary_keys
   mybackup.indexes
   mybackup.constraints
   mybackup.grants

Sounds good.

Doesn't help with the main element of dump time: one table at a time to
one output file. We need a way to dump multiple tables concurrently,
ending in multiple files/filesystems.

 Oh and pg_dumpall? It should have been removed right around the release
 of 7.2, pg_dump -A please.

Good idea

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Andrew Dunstan



Joshua D. Drake wrote:


Agreed but that is a problem I understand with a solution I don't. I 
am all eyes on a way to fix that. One thought I had and please, be 
gentle in response was some sort of async transaction capability. I 
know that libpq has the ability to send async queries. Is it possible 
to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?




IIRC, libpq doesn't let you have more than one async query active at one 
time.


cheers

andrew

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Andrew Dunstan wrote:



Joshua D. Drake wrote:


Agreed but that is a problem I understand with a solution I don't. I 
am all eyes on a way to fix that. One thought I had and please, be 
gentle in response was some sort of async transaction capability. I 
know that libpq has the ability to send async queries. Is it possible 
to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?




IIRC, libpq doesn't let you have more than one async query active at one 
time.


Now that I think on it harder, this isn't even a libpq problem (although 
its involved), we need the postmaster do be able to do a background 
async query. Which is (I am guessing) why libpq can only do one at a time.


Sincerely,

Joshua D. Drake



--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  I dislike, and doubt that I'd use, this approach.  At the end of the
  day, it ends up processing the same (very large amount of data) multiple
  times.
 
 Well, that's easily avoided: just replace the third step by restoring
 directly to the target database.
 
 pg_restore --schema-before-data whole.dump before.sql
 edit before.sql
 pg_restore --schema-after-data whole.dump after.sql
 edit after.sql
 psql -f before.sql target_db
 pg_restore --data-only -d target_db whole.dump
 psql -f after.sql target_db

This would depend on the dump being in the custom format, though I
suppose that ends up being true for any usage of these options.  I've
never really been a fan of the custom format, in large part because it
doesn't really buy you all that much and makes changing things more
difficult (by having to extract out what you want to change, and then
omit it from the restore).

I can see some advantage to having the entire dump contained in a single
file and still being able to pull out pieces based on before/after.
Should we get a binary format which is much faster, I could see myself
being more likely to use pg_restore.  Same for parallelization or, in my
fantasies, the ability to copy schema, tables, indexes, etc, in 'raw' PG
format between servers.  Worse than having to vi an insanely large file,
or split it up to be able to modify the pieces you want, is having to
rebuild indexes, especially GIST ones.  That's another topic though.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Right, but the parallelization is going to happen sometime, and it is
 going to happen in the context of pg_restore.  So I think it's pretty
 silly to argue that no one will ever want this feature to work in
 pg_restore.

I think you've about convinced me on this, and it annoys me. ;)  Worse
is that it sounds like this might cause the options to not make it in
for 8.4, which would be quite frustrating.

 To extend the example I just gave to Stephen, I think a fairly probable
 scenario is where you only need to tweak some before object
 definitions, and then you could do
 
 pg_restore --schema-before-data whole.dump before.sql
 edit before.sql
 psql -f before.sql target_db
 pg_restore --data-only --schema-after-data -d target_db whole.dump
 
 which (given a parallelizing pg_restore) would do all the time-consuming
 steps in a fully parallelized fashion.

Alright, this has been mulling around in the back of my head a bit and
has now finally surfaced- I like having the whole dump contained in a
single file, but I hate having what ends up being out-dated or wrong
or not what was loaded in the dump file.  Doesn't seem likely to be
possible, but it'd be neat to be able to modify objects in the dump
file.

Also, something which often happens to me is that I need to change the
search_path or the role at the top of a .sql from pg_dump before
restoring it.  Seems like using the custom format would make that
difficult without some pipe/cat/sed magic.  Parallelization would make
using that kind of magic more difficult too, I would guess.  Might be
something to think about.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Stephen Frost wrote:

* Tom Lane ([EMAIL PROTECTED]) wrote:

Stephen Frost [EMAIL PROTECTED] writes:

I dislike, and doubt that I'd use, this approach.  At the end of the
day, it ends up processing the same (very large amount of data) multiple
times.



This would depend on the dump being in the custom format, though I
suppose that ends up being true for any usage of these options.  I've
never really been a fan of the custom format, in large part because it
doesn't really buy you all that much and makes changing things more
difficult (by having to extract out what you want to change, and then
omit it from the restore).


Custom format rocks for partial set restores from a whole dump. See the 
TOC option :)


Joshua D. Drake

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Joshua D. Drake ([EMAIL PROTECTED]) wrote:
 Custom format rocks for partial set restores from a whole dump. See the  
 TOC option :)

I imagine it does, but that's very rarely what I need.  Most of the time
we're dumping out a schema to load it into a seperate schema (usually on
another host).  Sometimes that can be done by simply vi'ing the file to
change the search_path and whatnot, though more often we end up pipe'ing
the whole thing through sed.  Since we don't allow regular users to do
much, and you have to 'set role postgres;' to do anything as superuser,
we also often end up adding 'set role postgres;' to the top of the .sql
files.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Simon Riggs

On Fri, 2008-07-25 at 19:16 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  [ pg_dump_beforeafter.v6.patch ]

 Unfortunately there's still a lot of work to do, and I don't feel like
 doing it so I'm bouncing this patch back for further work.

Fair enough. Thanks for the review.

 The key problem is that pg_restore is broken: it emits nearly the same
 output for --schema-before-data and --schema-after-data, because it
 doesn't have any way to distinguish which objects in a full dump file
 belong where.  This is because the filtering logic was put in the wrong
 place, namely in the ArchiveEntry creation routines in pg_dump.c, when
 where it really needs to happen is while scanning the TocEntry list in
 RestoreArchive().  (Note: it is perhaps worth keeping the pg_dump.c
 filters so as to avoid doing extra server queries for objects that we
 aren't going to dump anyway, but the core logic has to be in
 RestoreArchive.)

My feeling is that this would take the patch off-track. 

The key capability here is being able to split the dump into multiple
pieces. The equivalent capability on restore is *not* required, because
once the dump has been split the restore never needs to be. It might
seem that the patch should be symmetrical with respect to pg_dump and
pg_restore, but I see no use case for the pg_restore case.

The title of this email confirms that as original intention.

 I looked over this patch a bit.  I have a proposal for a slightly
 different way of defining the new switches:
 
 * --schema-before-data, --data-only, and --schema-after-data can be
 specified in any combination to obtain any subset of the full dump.
 If none are specified (which would in itself be a useless combination)
 then the default is to dump all three sections, just as if all three
 were specified.
 
 * --schema-only is defined as equivalent to specifying both
 --schema-before-data and --schema-after-data.
 
 The patch as submitted enforces what seem largely arbitrary restrictions
 on combining these switches.  It made some sense before to treat
 specifying both --schema-only and --data-only as an error, but it's not
 clear to me why you shouldn't be able to write both --schema-before-data
 and --schema-after-data, especially when there's a switch right beside
 them that appears to be equivalent to that combination.  So let's just
 allow all the combinations.

I had it both ways at various points in development. I'm happy with what
you propose.

 The attached updated patch implements and documents this behavior,
 and gets rid of the special linkage between --disable-triggers and
 --data-only as previously discussed.

OK

 Another issue is that the rules for deciding which objects are before
 data and which are after data are wrong.  In particular ACLs are after
 data not before data, which is relatively easy to fix.  

OK

 Not so easy to fix
 is that COMMENTs might be either before or after data depending on what
 kind of object they are attached to.

Is there anything to fix? Comments are added by calls to dumpComment,
which are always made in conjunction with the dump of an object. So if
you dump the object you dump the comment. As long as objects are
correctly split out then comments will be also.

 (BTW, what about BLOB COMMENTS?  They definitely can't be before data.
 ISTM you could make a case for them being after data, if you think that
 comments are always schema.  But there is also a case for considering
 them as data, because the objects they are attached to are data.  I kind
 of like the latter approach because it would create an invariant that
 comments appear in the same dump section as the object commented on.
 Thoughts?)

Yes, data. I'll look at this.

 Implementing the filtering by examining the type of a TocEntry in 
 RestoreArchive is a bit of a PITA, but it's probably possible.  The
 main bad thing about that is the need to make an explicit list of every
 type of TocEntry that exists now or ever has been emitted by any past
 version of pg_dump.  The design concept was that the type tags are
 mainly documentation, and while we've had to bend that in places (mostly
 for backward-compatibility reasons) this would be the first place we'd
 have to throw it overboard completely.
 
 And there's yet another issue here, which is that it's not entirely clear
 that the type of an object uniquely determines whether it's before or
 after data.  This might be an emergent property of the object sorting
 rules, but there is certainly not anything positively guaranteeing that
 the dependency-driven topological sort will produce such a result, and
 especially not that that will always be true in the future.  So the
 approach seems a bit fragile.

Don't understand that. Objects are sorted in well-defined order,
specified in pg_dump_sort.c. Essentially we are saying that (according
to current numbering)

--schema-before-data priority 1-8
--data-only  priority 9-11
--schema-after-data  priority 12+

So the 

Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Stephen Frost
* Simon Riggs ([EMAIL PROTECTED]) wrote:
 The key capability here is being able to split the dump into multiple
 pieces. The equivalent capability on restore is *not* required, because
 once the dump has been split the restore never needs to be. It might
 seem that the patch should be symmetrical with respect to pg_dump and
 pg_restore, but I see no use case for the pg_restore case.

I'm inclined to agree with this.  It might have been nice to provide a
way to split out already-created dumps, but I suspect that people who
need that probably have already figured out a way to do it (I know I
have..).  We should probably ensure that pg_restore doesn't *break* when
fed a partial dump.

  The patch as submitted enforces what seem largely arbitrary restrictions
  on combining these switches.
 
 I had it both ways at various points in development. I'm happy with what
 you propose.

I agree with removing the restrictions.  I don't see much in the way of
use cases, but it reduces code and doesn't cause problems.

  Another issue is that the rules for deciding which objects are before
  data and which are after data are wrong.  In particular ACLs are after
  data not before data, which is relatively easy to fix.  
 
 OK

This was partially why I was complaining about having documentation, and
a policy for that matter, which goes into more detail about why X is before
or after the data.  I agree that they're after today, but I don't see
any particular reason why they should be one or the other.  If we
adopted a policy like I proposed (--schema-post-data is essentially that
which uses the data and is faster done in bulk) then ACLs would be
before, and I tend to feel like it makes more sense that way.

  Not so easy to fix
  is that COMMENTs might be either before or after data depending on what
  kind of object they are attached to.
 
 Is there anything to fix? Comments are added by calls to dumpComment,
 which are always made in conjunction with the dump of an object. So if
 you dump the object you dump the comment. As long as objects are
 correctly split out then comments will be also.

I agree with this, and it follows for BLOB comments- in any case, they
go with the object being dumped at the time of that object getting
dumped.  Comments make sense as an extention of the object, not as a
seperate set of objects to be explicitly placed before or after the
data.

 All of the above makes me certain I want to remove these options from
 pg_restore.

I'm in agreement with this.

  BTW, another incomplete item is that pg_dumpall should probably be taught
  to accept and pass down --schema-before-data and --schema-after-data
  switches.
 
 OK

I could go either way on this.

 Can we prune down to the base use case to avoid this overhead? i.e. have
 these options on pg_dump only?

Makes sense to me.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Fri, 2008-07-25 at 19:16 -0400, Tom Lane wrote:
 The key problem is that pg_restore is broken:

 The key capability here is being able to split the dump into multiple
 pieces. The equivalent capability on restore is *not* required, because
 once the dump has been split the restore never needs to be.

This argument is nonsense.  The typical usage of this capability, IMHO,
will be

pg_dump -Fc whole.dump
pg_restore --schema-before-data whole.dump before.sql
pg_restore --data-only whole.dump data.sql
pg_restore --schema-after-data whole.dump after.sql

followed by editing the schema pieces and then loading.  One reason
is that this gives you a consistent dump, whereas three successive
pg_dump runs could never guarantee any such thing.  Another reason
is that you may well not know when you prepare the dump that you
will need split output, because the requirement to edit the dump
is likely to be realized only when you go to load it.

In any case, why did you put the switches into pg_restore.c if you
thought it wasn't useful for pg_restore to handle them?

 Not so easy to fix
 is that COMMENTs might be either before or after data depending on what
 kind of object they are attached to.

 Is there anything to fix?

Well, yeah.  If you attach a comment to an after-data object and test
--schema-after-data, you'll notice the comment is lost.

 And there's yet another issue here, which is that it's not entirely clear
 that the type of an object uniquely determines whether it's before or
 after data.

 Don't understand that. Objects are sorted in well-defined order,
 specified in pg_dump_sort.

After which we do a topological sort that enforces dependency ordering.
The question to worry about is whether there can ever be a dependency
from a normally-before object to a normally-after object, which
would cause the dependency sort to move the latter in front of the
former (in one way or another).  I'm not saying that any such case can
occur today, but I don't think it's an impossibility for it to arise in
future.  I don't want this relatively minor feature to be putting limits
on what kinds of dependencies the system can have.

 I'm conscious that the major work proposed will take weeks to complete

I don't think that what I am proposing is that complicated; I would
anticipate it requiring somewhere on the order of two dozen lines of
code.  I was thinking of doing a preliminary loop through the TocEntry
list to identify the ordinal numbers of the first and last data items,
and then the main loop could compare a counter to those numbers to
decide which of the three sections it was in.  Plus you'd need another
ArchiveEntry call someplace to prepare the dummy data item if one was
needed.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 Another issue is that the rules for deciding which objects are before
 data and which are after data are wrong.  In particular ACLs are after
 data not before data, which is relatively easy to fix.  
 
 OK

 This was partially why I was complaining about having documentation, and
 a policy for that matter, which goes into more detail about why X is before
 or after the data.  I agree that they're after today, but I don't see
 any particular reason why they should be one or the other.

If a table's ACL revokes owner insert privilege, and was placed before
the data load steps, those steps would fail.  We are relying on the
default table privileges until we are done doing everything we need to
do to the tables (and perhaps other objects, I'm not sure if there are
any other comparable problems).

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Fri, 2008-07-25 at 19:16 -0400, Tom Lane wrote:
  The key problem is that pg_restore is broken:
 
  The key capability here is being able to split the dump into multiple
  pieces. The equivalent capability on restore is *not* required, because
  once the dump has been split the restore never needs to be.
 
 This argument is nonsense.  The typical usage of this capability, IMHO,
 will be
 
   pg_dump -Fc whole.dump
   pg_restore --schema-before-data whole.dump before.sql
   pg_restore --data-only whole.dump data.sql
   pg_restore --schema-after-data whole.dump after.sql
 
 followed by editing the schema pieces and then loading.

I dislike, and doubt that I'd use, this approach.  At the end of the
day, it ends up processing the same (very large amount of data) multiple
times.  We have 60G dump files sometimes, and there's no way I'm going
to dump that into a single file first if I can avoid it.  What we end up
doing today is --schema-only followed by vi'ing it and splitting it up
by hand, etc, then doing a seperate --data-only dump.

 One reason
 is that this gives you a consistent dump, whereas three successive
 pg_dump runs could never guarantee any such thing.  

While this is technically true, in most cases people have control over
the schema bits and would likely be able to ensure that the schema
doesn't change during the time.  At that point it's only the data, which
is still done in a transactional way.

 Another reason is that you may well not know when you prepare the
 dump that you will need split output, because the requirement to edit
 the dump is likely to be realized only when you go to load it.

This is a good point.  My gut reaction is that, at least in my usage, it
would be more about if it's larger than a gig, I might as well split it
out, just in case I need to touch something.  Honestly, it's rare that
I don't have to make *some* change.  Often that's the point of dumping
it out.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread Simon Riggs

On Sat, 2008-07-26 at 12:20 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Fri, 2008-07-25 at 19:16 -0400, Tom Lane wrote:
  The key problem is that pg_restore is broken:
 
  The key capability here is being able to split the dump into multiple
  pieces. The equivalent capability on restore is *not* required, because
  once the dump has been split the restore never needs to be.
 
 This argument is nonsense.  
 The typical usage of this capability, IMHO, will be

Arghh! That's not my stated use case!?#*!

I want to dump tables separately for performance reasons. There are
documented tests showing 100% gains using this method. There is no gain
adding this to pg_restore. There is a gain to be had - parallelising
index creation, but this patch doesn't provide parallelisation.

Anyway, clearly time for me to stop and have a break. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-26 Thread daveg
On Sat, Jul 26, 2008 at 01:56:14PM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I want to dump tables separately for performance reasons. There are
  documented tests showing 100% gains using this method. There is no gain
  adding this to pg_restore. There is a gain to be had - parallelising
  index creation, but this patch doesn't provide parallelisation.
 
 Right, but the parallelization is going to happen sometime, and it is
 going to happen in the context of pg_restore.  So I think it's pretty
 silly to argue that no one will ever want this feature to work in
 pg_restore.
 
 To extend the example I just gave to Stephen, I think a fairly probable
 scenario is where you only need to tweak some before object
 definitions, and then you could do
 
 pg_restore --schema-before-data whole.dump before.sql
 edit before.sql
 psql -f before.sql target_db
 pg_restore --data-only --schema-after-data -d target_db whole.dump
 
 which (given a parallelizing pg_restore) would do all the time-consuming
 steps in a fully parallelized fashion.

A few thoughts about pg_restore performance:

To take advantage of non-logged copy, the create and load should be in
the same transaction.

To take advantage of file and buffer cache, it would be be good to do
indexes immediately after table data. Many tables will be small enough
to fit in cache and this will avoid re-reading them for index builds. This
effect becomes stronger with more indexes on one table. There may also be
some filesytem placement benefit to building the indexes for a table
immediately after loading the data.

The buffer fan file cache advantage also applies to constraint creation,
but this is complicated by the need for indexes and data in the referenced
tables.

It seems that a high performance restore will want to proced in a different
order than the current sort order or that proposed by the before/data/after
patch.

 - The simplest unit of work for parallelism may be the table and its
   decorations, eg indexes and relational constraints.

 - Sort tables by foreign key dependency so that referenced tables are
   loaded before referencing tables.

 - Do table creation and data load together in one transaction to use
   non-logged copy. Index builds, and constraint creation should follow
   immediately, either as part of the same transaction, or possibly
   parallelized themselves.

Table creation, data load, index builds, and constraint creation could
be packaged up as the unit of work to be done in a subprocess which either
completes or fails as a unit. The worker process would be called with
connection info, a file pointer to the data, and the DDL for the table.
pg_restore would keep a work queue of tables to be restored in FK dependency
order and also do the other schema operations such as functions and types.

-dg

-- 
David Gould   [EMAIL PROTECTED]  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-25 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 [ pg_dump_beforeafter.v6.patch ]

I looked over this patch a bit.  I have a proposal for a slightly
different way of defining the new switches:

* --schema-before-data, --data-only, and --schema-after-data can be
specified in any combination to obtain any subset of the full dump.
If none are specified (which would in itself be a useless combination)
then the default is to dump all three sections, just as if all three
were specified.

* --schema-only is defined as equivalent to specifying both
--schema-before-data and --schema-after-data.

The patch as submitted enforces what seem largely arbitrary restrictions
on combining these switches.  It made some sense before to treat
specifying both --schema-only and --data-only as an error, but it's not
clear to me why you shouldn't be able to write both --schema-before-data
and --schema-after-data, especially when there's a switch right beside
them that appears to be equivalent to that combination.  So let's just
allow all the combinations.

The attached updated patch implements and documents this behavior,
and gets rid of the special linkage between --disable-triggers and
--data-only as previously discussed.

Unfortunately there's still a lot of work to do, and I don't feel like
doing it so I'm bouncing this patch back for further work.

The key problem is that pg_restore is broken: it emits nearly the same
output for --schema-before-data and --schema-after-data, because it
doesn't have any way to distinguish which objects in a full dump file
belong where.  This is because the filtering logic was put in the wrong
place, namely in the ArchiveEntry creation routines in pg_dump.c, when
where it really needs to happen is while scanning the TocEntry list in
RestoreArchive().  (Note: it is perhaps worth keeping the pg_dump.c
filters so as to avoid doing extra server queries for objects that we
aren't going to dump anyway, but the core logic has to be in
RestoreArchive.)

Another issue is that the rules for deciding which objects are before
data and which are after data are wrong.  In particular ACLs are after
data not before data, which is relatively easy to fix.  Not so easy to fix
is that COMMENTs might be either before or after data depending on what
kind of object they are attached to.

(BTW, what about BLOB COMMENTS?  They definitely can't be before data.
ISTM you could make a case for them being after data, if you think that
comments are always schema.  But there is also a case for considering
them as data, because the objects they are attached to are data.  I kind
of like the latter approach because it would create an invariant that
comments appear in the same dump section as the object commented on.
Thoughts?)

Implementing the filtering by examining the type of a TocEntry in 
RestoreArchive is a bit of a PITA, but it's probably possible.  The
main bad thing about that is the need to make an explicit list of every
type of TocEntry that exists now or ever has been emitted by any past
version of pg_dump.  The design concept was that the type tags are
mainly documentation, and while we've had to bend that in places (mostly
for backward-compatibility reasons) this would be the first place we'd
have to throw it overboard completely.

And there's yet another issue here, which is that it's not entirely clear
that the type of an object uniquely determines whether it's before or
after data.  This might be an emergent property of the object sorting
rules, but there is certainly not anything positively guaranteeing that
the dependency-driven topological sort will produce such a result, and
especially not that that will always be true in the future.  So the
approach seems a bit fragile.

We could perhaps get rid of that problem, as well as the need to implement
object-type-determination logic, if we were to make RestoreArchive define
the groupings according to position in the TocEntry list: everything
before the first TABLE DATA or BLOB (and BLOB COMMENT?) entry is before
data, everything after the last one is after data, everything in between
is data.  Then we only need to identify object types that are considered
data, which we already have a rule for (whether hadDumper is true).
This is pretty attractive until you stop to consider the possibility
that there aren't any data entries in an archive (ie, it was made with
--schema-only): then there's no way to identify the boundary points.

We could solve that problem by inserting a dummy data TOC entry where
the data would have appeared, but this will only work in new archive
files.  With an implementation like this, pg_restore with
--schema-before-data or --schema-after-data won't behave very nicely on a
pre-8.4 --schema-only archive file.  (Presumably it would act as though
all the objects were before data.)  Is that a small enough corner case
to live with in order to gain implementation simplicity and robustness?

BTW, another incomplete item is that pg_dumpall should probably be 

Re: [PATCHES] pg_dump additional options for performance

2008-07-25 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


Tom Lane wrote:
 * --schema-before-data, --data-only, and --schema-after-data can be

I thought you were arguing for some better names at one point? Those seem
very confusing to me, especially --schema-after-data. I know it means
the parts of the schema that come after the data but it could
also be read as other ways, including put the schema after the data  - which
makes no sense, but the name is not exactly intuitive either. Pre and Post
at least are slightly better, IMO. How about --pre-data-schema
and --post-data-schema? Or --pre-data-section and --post-data-section?
Or (my favorites) --pre-data-commands and --post-data-commands? As the
existing docs say, commands are what we are generating.

 them as data, because the objects they are attached to are data.  I kind
 of like the latter approach because it would create an invariant that
 comments appear in the same dump section as the object commented on.
 Thoughts?)

+1 on putting them next to the object commented on.

 And there's yet another issue here, which is that it's not entirely clear
 that the type of an object uniquely determines whether it's before or
 after data.

Wouldn't that be a problem with current dumps as well then?

 We could solve that problem by inserting a dummy data TOC entry where
 the data would have appeared, but this will only work in new archive
 files.  With an implementation like this, pg_restore with
 --schema-before-data or --schema-after-data won't behave very nicely on a
 pre-8.4 --schema-only archive file.  (Presumably it would act as though
 all the objects were before data.)  Is that a small enough corner case
 to live with in order to gain implementation simplicity and robustness?

I'm not comfortable with corner cases for pg_restore backwards compatibility.
What exactly would happen (worse case) in that scenario?


- --
Greg Sabino Mullane [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200807252235
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAkiKjgMACgkQvJuQZxSWSsiRMACg7c/VDo9hTTjukkFFvLYI31mL
BqkAn3FfepllvVnIwX+efA5cLPlVbDd0
=V/Sv
-END PGP SIGNATURE-



-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-24 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 [80k patch]

Surely there is a whole lot of unintended noise in this patch?
I certainly don't believe that you meant to change keywords.c
for instance.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-24 Thread Simon Riggs

On Thu, 2008-07-24 at 03:54 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  [80k patch]
 
 Surely there is a whole lot of unintended noise in this patch?
 I certainly don't believe that you meant to change keywords.c
 for instance.

Removed, thanks.

Unrelated to this patch, it seems I have some issues with my repository,
judging by this and another unrelated issue reported by Martin Zaun.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: doc/src/sgml/ref/pg_dump.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.103
diff -c -r1.103 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	20 Jul 2008 18:43:30 -	1.103
--- doc/src/sgml/ref/pg_dump.sgml	24 Jul 2008 07:30:19 -
***
*** 133,139 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
--- 133,140 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ or option--schema-before-data/ or
! option--schema-after-data/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
***
*** 426,431 
--- 427,452 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur before table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--schema-after-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur after table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-S replaceable class=parameterusername/replaceable/option/term
termoption--superuser=replaceable class=parameterusername/replaceable/option/term
listitem
***
*** 790,795 
--- 811,844 
/para
  
para
+The output of applicationpg_dump/application can be divided into three parts:
+itemizedlist
+ listitem
+  para
+ 	  Before Data - objects output before data, which includes
+ 	  commandCREATE TABLE/command statements and others.
+ 	  This part can be requested using option--schema-before-data/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  Table Data - data can be requested using option--data-only/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  After Data - objects output after data, which includes
+ 	  commandCREATE INDEX/command statements and others.
+ 	  This part can be requested using option--schema-after-data/.
+  /para
+ /listitem
+/itemizedlist
+This allows us to work more easily with large data dump files when
+there is some need to edit commands or resequence their execution for
+performance.
+   /para
+ 
+   para
 Because applicationpg_dump/application is used to transfer data
 to newer versions of productnamePostgreSQL/, the output of
 applicationpg_dump/application can be loaded into
Index: doc/src/sgml/ref/pg_restore.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_restore.sgml,v
retrieving revision 1.75
diff -c -r1.75 pg_restore.sgml
*** doc/src/sgml/ref/pg_restore.sgml	13 Apr 2008 03:49:21 -	1.75
--- doc/src/sgml/ref/pg_restore.sgml	24 Jul 2008 07:30:19 -
***
*** 321,326 
--- 321,346 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Restore object definitions (schema) that occur before table data,
+ 		using the order produced by a full restore.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--schema-after-data/option/term
+   listitem
+para
+ 		Restore object definitions (schema) that occur after table data,
+ 		using the order produced by a full restore.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-S replaceable class=parameterusername/replaceable/option/term
termoption--superuser=replaceable class=parameterusername/replaceable/option/term
listitem
***
*** 572,577 
--- 592,626 
/para
  
para
+The actions of applicationpg_restore/application can be 
+divided into three parts:
+itemizedlist
+ 

Re: [PATCHES] pg_dump additional options for performance

2008-07-23 Thread Simon Riggs

On Mon, 2008-07-21 at 07:56 -0400, Stephen Frost wrote:
 Simon,
 
 * Simon Riggs ([EMAIL PROTECTED]) wrote:
   I hadn't realized that Simon was using pre-schema and post-schema
   to name the first and third parts.  I'd agree that this is confusing
   nomenclature: it looks like it's trying to say that the data is the
   schema, and the schema is not!  How about pre-data and post-data?
  
  OK by me. Any other takers?
 
 Having the command-line options be --schema-pre-data and
 --schema-post-data is fine with me.  Leaving them the way they are is
 also fine by me.  It's the documentation (back to pg_dump.sgml,
 ~774/~797) that starts talking about Pre-Schema and Post-Schema.

OK, Mr.Reviewer, sir:

* patch redone using --schema-before-data and --schema-after-data

* docs rewritten using short clear descriptions using only the words
before and after, like in the option names

* all variable names changed

* retested

So prefixes pre and post no longer appear anywhere. No latin derived
phrases, just good ol' Anglo-Saxon words.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: doc/src/sgml/ref/pg_dump.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.103
diff -c -r1.103 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	20 Jul 2008 18:43:30 -	1.103
--- doc/src/sgml/ref/pg_dump.sgml	23 Jul 2008 15:26:29 -
***
*** 133,139 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
--- 133,140 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ or option--schema-before-data/ or
! option--schema-after-data/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
***
*** 426,431 
--- 427,452 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur before table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--schema-after-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur after table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-S replaceable class=parameterusername/replaceable/option/term
termoption--superuser=replaceable class=parameterusername/replaceable/option/term
listitem
***
*** 790,795 
--- 811,844 
/para
  
para
+The output of applicationpg_dump/application can be divided into three parts:
+itemizedlist
+ listitem
+  para
+ 	  Before Data - objects output before data, which includes
+ 	  commandCREATE TABLE/command statements and others.
+ 	  This part can be requested using option--schema-before-data/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  Table Data - data can be requested using option--data-only/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  After Data - objects output after data, which includes
+ 	  commandCREATE INDEX/command statements and others.
+ 	  This part can be requested using option--schema-after-data/.
+  /para
+ /listitem
+/itemizedlist
+This allows us to work more easily with large data dump files when
+there is some need to edit commands or resequence their execution for
+performance.
+   /para
+ 
+   para
 Because applicationpg_dump/application is used to transfer data
 to newer versions of productnamePostgreSQL/, the output of
 applicationpg_dump/application can be loaded into
Index: doc/src/sgml/ref/pg_restore.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_restore.sgml,v
retrieving revision 1.75
diff -c -r1.75 pg_restore.sgml
*** doc/src/sgml/ref/pg_restore.sgml	13 Apr 2008 03:49:21 -	1.75
--- doc/src/sgml/ref/pg_restore.sgml	23 Jul 2008 16:00:29 -
***
*** 321,326 
--- 321,346 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Restore object definitions (schema) that occur before table data,
+ 		using the order produced by a full restore.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   

Re: [PATCHES] pg_dump additional options for performance

2008-07-23 Thread Simon Riggs

On Wed, 2008-07-23 at 17:40 +0100, Simon Riggs wrote:
 On Mon, 2008-07-21 at 07:56 -0400, Stephen Frost wrote:
  Simon,
  
  * Simon Riggs ([EMAIL PROTECTED]) wrote:
I hadn't realized that Simon was using pre-schema and post-schema
to name the first and third parts.  I'd agree that this is confusing
nomenclature: it looks like it's trying to say that the data is the
schema, and the schema is not!  How about pre-data and post-data?
   
   OK by me. Any other takers?
  
  Having the command-line options be --schema-pre-data and
  --schema-post-data is fine with me.  Leaving them the way they are is
  also fine by me.  It's the documentation (back to pg_dump.sgml,
  ~774/~797) that starts talking about Pre-Schema and Post-Schema.
 
 OK, Mr.Reviewer, sir:
 
 * patch redone using --schema-before-data and --schema-after-data
 
 * docs rewritten using short clear descriptions using only the words
 before and after, like in the option names
 
 * all variable names changed
 
 * retested
 
 So prefixes pre and post no longer appear anywhere. No latin derived
 phrases, just good ol' Anglo-Saxon words.

...and with command line help also.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: doc/src/sgml/ref/pg_dump.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.103
diff -c -r1.103 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	20 Jul 2008 18:43:30 -	1.103
--- doc/src/sgml/ref/pg_dump.sgml	23 Jul 2008 16:55:05 -
***
*** 133,139 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
--- 133,140 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ or option--schema-before-data/ or
! option--schema-after-data/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
***
*** 426,431 
--- 427,452 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur before table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--schema-after-data/option/term
+   listitem
+para
+ 		Dump object definitions (schema) that occur after table data,
+ 		using the order produced by a full dump.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-S replaceable class=parameterusername/replaceable/option/term
termoption--superuser=replaceable class=parameterusername/replaceable/option/term
listitem
***
*** 790,795 
--- 811,844 
/para
  
para
+The output of applicationpg_dump/application can be divided into three parts:
+itemizedlist
+ listitem
+  para
+ 	  Before Data - objects output before data, which includes
+ 	  commandCREATE TABLE/command statements and others.
+ 	  This part can be requested using option--schema-before-data/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  Table Data - data can be requested using option--data-only/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  After Data - objects output after data, which includes
+ 	  commandCREATE INDEX/command statements and others.
+ 	  This part can be requested using option--schema-after-data/.
+  /para
+ /listitem
+/itemizedlist
+This allows us to work more easily with large data dump files when
+there is some need to edit commands or resequence their execution for
+performance.
+   /para
+ 
+   para
 Because applicationpg_dump/application is used to transfer data
 to newer versions of productnamePostgreSQL/, the output of
 applicationpg_dump/application can be loaded into
Index: doc/src/sgml/ref/pg_restore.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_restore.sgml,v
retrieving revision 1.75
diff -c -r1.75 pg_restore.sgml
*** doc/src/sgml/ref/pg_restore.sgml	13 Apr 2008 03:49:21 -	1.75
--- doc/src/sgml/ref/pg_restore.sgml	23 Jul 2008 16:55:05 -
***
*** 321,326 
--- 321,346 
   /varlistentry
  
   varlistentry
+   termoption--schema-before-data/option/term
+   listitem
+para
+ 		Restore object definitions (schema) that occur before table data,
+ 		using the order 

Re: [PATCHES] pg_dump additional options for performance

2008-07-23 Thread Stephen Frost
Simon,

* Simon Riggs ([EMAIL PROTECTED]) wrote:
 ...and with command line help also.

The documentation and whatnot looks good to me now.  There are a couple
of other issues I found while looking through and testing the patch
though-

Index: src/bin/pg_dump/pg_dump.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.497
diff -c -r1.497 pg_dump.c
*** src/bin/pg_dump/pg_dump.c   20 Jul 2008 18:43:30 -  1.497
--- src/bin/pg_dump/pg_dump.c   23 Jul 2008 17:04:24 -
***
*** 225,232 
RestoreOptions *ropt;
  
static int  disable_triggers = 0;
!   static int  outputNoTablespaces = 0;
static int  use_setsessauth = 0;
  
static struct option long_options[] = {
{data-only, no_argument, NULL, 'a'},
--- 229,238 
RestoreOptions *ropt;
  
static int  disable_triggers = 0;
!   static int outputNoTablespaces = 0;
static int  use_setsessauth = 0;
+   static int  use_schemaBeforeData;
+   static int  use_schemaAfterData;
  
static struct option long_options[] = {
{data-only, no_argument, NULL, 'a'},
***

This hunk appears to have a bit of gratuitous whitespace change, not a
big deal tho.

***
*** 464,474 
[...]
+   if (dataOnly)
+   dumpObjFlags = REQ_DATA;
+ 
+   if (use_schemaBeforeData == 1)
+   dumpObjFlags = REQ_SCHEMA_BEFORE_DATA;
+ 
+   if (use_schemaAfterData == 1)
+   dumpObjFlags = REQ_SCHEMA_AFTER_DATA;
+ 
+   if (schemaOnly)
+   dumpObjFlags = (REQ_SCHEMA_BEFORE_DATA | REQ_SCHEMA_AFTER_DATA);
***

It wouldn't kill to be consistant between testing for '== 1' and just
checking for non-zero.  Again, not really a big deal, and I wouldn't
mention these if there weren't other issues.

***
*** 646,652 
 * Dumping blobs is now default unless we saw an inclusion switch or -s
 * ... but even if we did see one of these, -b turns it back on.
 */
!   if (include_everything  !schemaOnly)
outputBlobs = true;
  
/*
--- 689,695 
 * Dumping blobs is now default unless we saw an inclusion switch or -s
 * ... but even if we did see one of these, -b turns it back on.
 */
!   if (include_everything  WANT_PRE_SCHEMA(dumpObjFlags))
outputBlobs = true;
  
/*
***

Shouldn't this change be to WANT_DATA(dumpObjFlags)?  That's what most
of the '!schemaOnly' get translated to.  Otherwise I think you would be
getting blobs when you've asked for just schema-before-data, which
doesn't seem like it'd make much sense.

***
*** 712,718 
dumpStdStrings(g_fout);
  
/* The database item is always next, unless we don't want it at all */
!   if (include_everything  !dataOnly)
dumpDatabase(g_fout);
  
/* Now the rearrangeable objects. */
--- 755,761 
dumpStdStrings(g_fout);
  
/* The database item is always next, unless we don't want it at all */
!   if (include_everything  WANT_DATA(dumpObjFlags))
dumpDatabase(g_fout);
  
/* Now the rearrangeable objects. */
***

Shouldn't this be 'WANT_PRE_SCHEMA(dumpObjFlags)'?

***
*** 3414,3420 
continue;
  
/* Ignore indexes of tables not to be dumped */
!   if (!tbinfo-dobj.dump)
continue;
  
if (g_verbose)
--- 3459,3465 
continue;
  
/* Ignore indexes of tables not to be dumped */
!   if (!tbinfo-dobj.dump || !WANT_POST_SCHEMA(dumpObjFlags))
continue;
  
if (g_verbose)
***

I didn't test this, but it strikes me as an unnecessary addition?  If
anything, wouldn't this check make more sense being done right after
dropping into getIndexes()?  No sense going through the loop just for
fun..  Technically, it's a behavioral change for --data-only since it
used to gather index information anyway, but it's a good optimization if
done in the right place.

Also around here, there doesn't appear to be any checking in
dumpEnumType(), which strikes me as odd.  Wouldn't that deserve a

if (!WANT_PRE_SCHEMA(dumpObjFlags))
return;

check?  If not even some kind of equivilant -dobj.dump check..

***
*** 9803,9809 
tbinfo-dobj.catId, 0, 
tbinfo-dobj.dumpId);
}
  
!   if (!schemaOnly)
{
resetPQExpBuffer(query);
appendPQExpBuffer(query, SELECT pg_catalog.setval();
--- 9848,9854 
tbinfo-dobj.catId, 0, 
tbinfo-dobj.dumpId);
}
  
!   

Re: [PATCHES] pg_dump additional options for performance

2008-07-22 Thread Simon Riggs

On Mon, 2008-07-21 at 19:19 -0400, Tom Lane wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  Are there use cases for just --omit-post-load or --omit-pre-load?
 
 Probably not many.  The thing that's bothering me is the
 action-at-a-distance property of the positive-logic switches.
 How are we going to explain this?
 
   By default, --schema-pre-load, --data-only, --schema-post-load
   are all ON.  But if you turn one of them ON (never mind that
   it was already ON by default), that changes the defaults for
   the other two to OFF.  Then you have to turn them ON (never
   mind that the default for them is ON) if you want two out of
   the three categories.

While I accept your argument a certain amount, --schema-only and
--data-only already behave in the manner you describe. Whether we pick
include or exclude or both, it will make more sense than these existing
options, regrettably. 

With regard to the logic, Insert and COPY also behave this way: if you
mention *any* columns then you only get the ones you mention. We manage
to describe that also. An Insert statement would be very confusing if
you had to list the columns you don't want.

So the --omit options seem OK if you assume we'll never add further
options or include additional SQL in the dump. But that seems an
unreliable prop, so I am inclined towards the inclusive approach.

 You have to bend your mind into a pretzel to wrap it around this
 behavior. 

Perhaps my mind was already toroidally challenged? :-}

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Simon Riggs

On Sun, 2008-07-20 at 21:18 -0400, Stephen Frost wrote:
 * Simon Riggs ([EMAIL PROTECTED]) wrote:
  On Sun, 2008-07-20 at 17:43 -0400, Stephen Frost wrote:
   Even this doesn't cover everything though- it's too focused on tables
   and data loading.  Where do functions go?  What about types?
  
  Yes, it is focused on tables and data loading. What about
  functions/types? No relevance here.
 
 I don't see how they're not relevant, it's not like they're being
 excluded and in fact they show up in the pre-load output.  Heck, even if
 they *were* excluded, that should be made clear in the documentation
 (either be an explicit include list, or saying they're excluded).
 
 Part of what's driving this is making sure we have a plan for future
 objects and where they'll go.  Perhaps it would be enough to just say
 pre-load is everything in the schema, except things which are faster
 done in bulk (eg: indexes, keys).  I don't think it's right to say
 pre-load is only object definitions required to load data when it
 includes functions and ACLs though.
 
 Hopefully my suggestion and these comments will get us to a happy
 middle-ground.

I don't really understand what you're saying.

The options split the dump into 3 parts that's all: before the load, the
load and after the load.

--schema-pre-load says
Dumps exactly what option--schema-only/ would dump, but only those
statements before the data load.

What is it you are suggesting? I'm unclear.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Simon Riggs

On Sun, 2008-07-20 at 23:34 -0400, Tom Lane wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  * daveg ([EMAIL PROTECTED]) wrote:
  One observation, indexes should be built right after the table data
  is loaded for each table, this way, the index build gets a hot cache
  for the table data instead of having to re-read it later as we do now.
 
  That's not how pg_dump has traditionally worked, and the point of this
  patch is to add options to easily segregate the main pieces of the
  existing pg_dump output (main schema definition, data dump, key/index
  building).  You suggestion brings up an interesting point that should
  pg_dump's traditional output structure change the --schema-post-load
  set of objects wouldn't be as clear to newcomers since the load and the
  indexes would be interleaved in the regular output.

Stephen: Agreed.

 Yeah.  Also, that is pushing into an entirely different line of
 development, which is to enable multithreaded pg_restore.  The patch
 at hand is necessarily incompatible with that type of operation, and
 wouldn't be used together with it.
 
 As far as the documentation/definition aspect goes, I think it should
 just say the parts are
   * stuff needed before you can load the data
   * the data
   * stuff needed after loading the data
 and not try to be any more specific than that.  There are corner cases
 that will turn any simple breakdown into a lie, and I doubt that it's
 worth trying to explain them all.  (Take a close look at the dependency
 loop breaking logic in pg_dump if you doubt this.)

Tom: Agreed.

 I hadn't realized that Simon was using pre-schema and post-schema
 to name the first and third parts.  I'd agree that this is confusing
 nomenclature: it looks like it's trying to say that the data is the
 schema, and the schema is not!  How about pre-data and post-data?

OK by me. Any other takers?



I also suggested having three options
--want-pre-schema
--want-data
--want-post-schema
so we could ask for any or all parts in the one dump. --data-only and
--schema-only are negative options so don't allow this.
(I don't like those names either, just thinking about capabilities)

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Stephen Frost
* Simon Riggs ([EMAIL PROTECTED]) wrote:
 The options split the dump into 3 parts that's all: before the load, the
 load and after the load.
 
 --schema-pre-load says
 Dumps exactly what option--schema-only/ would dump, but only those
 statements before the data load.
 
 What is it you are suggesting? I'm unclear.

That part is fine, the problem is that elsewhere in the documentation
(patch line starting ~774 before, ~797 after, to the pg_dump.sgml) you
change it to be objects required before data loading, which isn't the
same.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Andrew Dunstan



Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:
  

I also suggested having three options
--want-pre-schema
--want-data
--want-post-schema
so we could ask for any or all parts in the one dump. --data-only and
--schema-only are negative options so don't allow this.
(I don't like those names either, just thinking about capabilities)



Maybe invert the logic?

--omit-pre-data
--omit-data
--omit-post-data

Not wedded to these either, just tossing out an idea...


  


Please, no. Negative logic seems likely to cause endless confusion.

I'd even be happier with --schema-part-1 and --schema-part-2 if we can't 
find some more expressive way of designating them.


cheers

andrew

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Maybe invert the logic?
 --omit-pre-data
 --omit-data
 --omit-post-data

 Please, no. Negative logic seems likely to cause endless confusion.

I think it might actually be less confusing, because with this approach,
each switch has an identifiable default (no) and setting it doesn't
cause side-effects on settings of other switches.  The interactions of
the switches as Simon presents 'em seem less than obvious.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Simon Riggs

On Mon, 2008-07-21 at 07:46 -0400, Stephen Frost wrote:
 * Simon Riggs ([EMAIL PROTECTED]) wrote:
  The options split the dump into 3 parts that's all: before the load, the
  load and after the load.
  
  --schema-pre-load says
  Dumps exactly what option--schema-only/ would dump, but only those
  statements before the data load.
  
  What is it you are suggesting? I'm unclear.
 
 That part is fine, the problem is that elsewhere in the documentation
 (patch line starting ~774 before, ~797 after, to the pg_dump.sgml) you
 change it to be objects required before data loading, which isn't the
 same.

OK, gotcha now - will change that. I thought you might mean something
about changing the output itself.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Stephen Frost
Simon,

* Simon Riggs ([EMAIL PROTECTED]) wrote:
  I hadn't realized that Simon was using pre-schema and post-schema
  to name the first and third parts.  I'd agree that this is confusing
  nomenclature: it looks like it's trying to say that the data is the
  schema, and the schema is not!  How about pre-data and post-data?
 
 OK by me. Any other takers?

Having the command-line options be --schema-pre-data and
--schema-post-data is fine with me.  Leaving them the way they are is
also fine by me.  It's the documentation (back to pg_dump.sgml,
~774/~797) that starts talking about Pre-Schema and Post-Schema.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 * Tom Lane ([EMAIL PROTECTED]) wrote:
 As far as the documentation/definition aspect goes, I think it should
 just say the parts are
 * stuff needed before you can load the data
 * the data
 * stuff needed after loading the data

 Even that is a lie though, which I guess is what my problem is.

True; the stuff done after is done that way at least in part for
performance reasons rather than because it has to be done that way.
(I think it's not only performance issues, though --- for circular
FKs you pretty much have to load the data first.)

 I hadn't realized that Simon was using pre-schema and post-schema
 to name the first and third parts.  I'd agree that this is confusing
 nomenclature: it looks like it's trying to say that the data is the
 schema, and the schema is not!  How about pre-data and post-data?

 Argh.  The command-line options follow the 'data'/'load' line
 (--schema-pre-load and --schema-post-load), and so I think those are
 fine.  The problem was that in the documentation he switched to saying
 they were Pre-Schema and Post-Schema, which could lead to confusion.

Ah, I see.  No objection to those switch names, at least assuming we
want to stick to positive-logic switches.  What did you think of the
negative-logic suggestion (--omit-xxx)?

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Stephen Frost
Tom, et al,

* Tom Lane ([EMAIL PROTECTED]) wrote:
 Ah, I see.  No objection to those switch names, at least assuming we
 want to stick to positive-logic switches.  What did you think of the
 negative-logic suggestion (--omit-xxx)?

My preference is for positive-logic switches in general.  The place
where I would use this patch would lend itself to being more options if
--omit- were used.  I expect that would hold true for most people.
It would be:

  --omit-data --omit-post-load
  --omit-pre-load --omit-post-load
  --omit-pre-load --omit-data

vs.

  --schema-pre-load
  --data-only
  --schema-post-load

Point being that I'd be dumping these into seperate files where I could
more easily manipulate the pre-load or post-load files.  I'd still want
pre/post load to be seperate though since this would be used in cases
where there's alot of data (hence the reason for the split) and putting
pre and post together and running them before data would slow things
down quite a bit.

Are there use cases for just --omit-post-load or --omit-pre-load?
Probably, but I just don't see any situation where I'd use them like
that.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-21 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 Are there use cases for just --omit-post-load or --omit-pre-load?

Probably not many.  The thing that's bothering me is the
action-at-a-distance property of the positive-logic switches.
How are we going to explain this?

By default, --schema-pre-load, --data-only, --schema-post-load
are all ON.  But if you turn one of them ON (never mind that
it was already ON by default), that changes the defaults for
the other two to OFF.  Then you have to turn them ON (never
mind that the default for them is ON) if you want two out of
the three categories.

You have to bend your mind into a pretzel to wrap it around this
behavior.  Yeah, it might be convenient once you understand it,
but how long will it take for the savings in typing to repay the
time to understand it and the mistakes along the way?

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Simon Riggs

On Sun, 2008-07-20 at 05:47 +0100, Simon Riggs wrote:
 On Sat, 2008-07-19 at 23:07 -0400, Stephen Frost wrote:
  Simon,
  
I agree with adding these options in general, since I find myself
frustrated by having to vi huge dumps to change simple schema things.
A couple of comments on the patch though:
  
- Conflicting option handling
  I think we are doing our users a disservice by putting it on them to
  figure out exactly what:
  multiple object groups cannot be used together
  means to them.  You and I may understand what an object group is,
  and why there can be only one, but it's a great deal less clear than
  the prior message of
  options -s/--schema-only and -a/--data-only cannot be used together
  My suggestion would be to either list out the specific options which
  can't be used together, as was done previously, or add a bit of (I
  realize, boring) code and actually tell the user which of the
  conflicting options were used.
  
- Documentation
  When writing the documentation I would stress that pre-schema and
  post-schema be defined in terms of PostgreSQL objects and why they
  are pre vs. post.
  
- Technically, the patch needs to be updated slightly since another
  pg_dump-related patch was committed recently which also added
  options and thus causes a conflict.
  
Beyond those minor points, the patch looks good to me.
 
 Thanks for the review. I'll make the changes you suggest.

Patch updated to head, plus changes/docs requested.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support
Index: doc/src/sgml/ref/pg_dump.sgml
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.102
diff -c -r1.102 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	13 Apr 2008 03:49:21 -	1.102
--- doc/src/sgml/ref/pg_dump.sgml	20 Jul 2008 06:33:30 -
***
*** 133,139 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
--- 133,140 
 para
  Include large objects in the dump.  This is the default behavior
  except when option--schema/, option--table/, or
! option--schema-only/ or option--schema-pre-load/ or
! 		option--schema-post-load/ is specified, so the option-b/
  switch is only useful to add large objects to selective dumps.
 /para
/listitem
***
*** 443,448 
--- 444,471 
   /varlistentry
  
   varlistentry
+   termoption--schema-pre-load/option/term
+   listitem
+para
+ 		Dump only the object definitions (schema) required to load data. Dumps
+ 		exactly what option--schema-only/ would dump, but only those
+ 		statements before the data load.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termoption--schema-post-load/option/term
+   listitem
+para
+ 		Dump only the object definitions (schema) required after data has been
+ 		loaded. Dumps exactly what option--schema-only/ would dump, but 
+ 		only those statements after the data load.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-S replaceable class=parameterusername/replaceable/option/term
termoption--superuser=replaceable class=parameterusername/replaceable/option/term
listitem
***
*** 774,779 
--- 797,830 
/para
  
para
+The output of pg_dump can be notionally divided into three parts:
+itemizedlist
+ listitem
+  para
+ 	  Pre-Schema - objects required before data loading, such as 
+ 	  commandCREATE TABLE/command.
+ 	  This part can be requested using option--schema-pre-load/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  Table Data - data can be requested using option--data-only/.
+  /para
+ /listitem
+ listitem
+  para
+ 	  Post-Schema - objects required after data loading, such as
+ 	  commandALTER TABLE/command and commandCREATE INDEX/command.
+ 	  This part can be requested using option--schema-post-load/.
+  /para
+ /listitem
+/itemizedlist
+This allows us to work more easily with large data dump files when
+there is some need to edit commands or resequence their execution for
+performance.
+   /para
+ 
+   para
 Because applicationpg_dump/application is used to transfer data
 to newer versions of productnamePostgreSQL/, the output of
 applicationpg_dump/application can be loaded into
Index: doc/src/sgml/ref/pg_restore.sgml
===
RCS file: 

Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Stephen Frost
Simon,

* Simon Riggs ([EMAIL PROTECTED]) wrote:
 On Sun, 2008-07-20 at 05:47 +0100, Simon Riggs wrote:
  On Sat, 2008-07-19 at 23:07 -0400, Stephen Frost wrote:
[...]
 - Conflicting option handling

Thanks for putting in the extra code to explicitly indicate which
conflicting options were used.

 - Documentation
 When writing the documentation I would stress that pre-schema and
 post-schema be defined in terms of PostgreSQL objects and why they
 are pre vs. post.

Perhaps this is up for some debate, but I find the documentation added
for these options to be lacking the definitions I was looking for, and
the explanation of why they are what they are.  I'm also not sure I
agree with the Pre-Schema and Post-Schema nomenclature as it doesn't
really fit with the option names or what they do.  Would you consider:

termoption--schema-pre-load/option/term
listitem
 para
   Pre-Data Load - Minimum amount of the schema required before data
   loading can begin.  This consists mainly of creating the tables
   using the commandCREATE TABLE/command.
   This part can be requested using option--schema-pre-load/.
 /para
/listitem

termoption--schema-post-load/option/term
listitem
 para
   Post-Data Load - The rest of the schema definition, including keys,
   indexes, etc.  By putting keys and indexes after the data has been
   loaded the whole process of restoring data is much faster.  This is
   because it is faster to build indexes and check keys in bulk than
   piecemeal as the data is loaded.
   This part can be requested using option--schema-post-load/.
 /para
/listitem

Even this doesn't cover everything though- it's too focused on tables
and data loading.  Where do functions go?  What about types?

A couple of additional points:

  - The command-line help hasn't been updated.  Clearly, that also needs
to be done to consider the documentation aspect complete.

  - There appears to be a bit of mistakenly included additions.  The
patch to pg_restore.sgml attempts to add in documentation for
--superuser.  I'm guessing that was unintentional, and looks like
just a mistaken extra copypaste.

 - Technically, the patch needs to be updated slightly since another
 pg_dump-related patch was committed recently which also added
 options and thus causes a conflict.

I think this might have just happened again, funny enough.  It's
something that a committer could perhaps fix, but if you're reworking
the patch anyway...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Simon Riggs

On Sun, 2008-07-20 at 17:43 -0400, Stephen Frost wrote:

 Perhaps this is up for some debate, but I find the documentation added
 for these options to be lacking the definitions I was looking for, and
 the explanation of why they are what they are.  I'm also not sure I
 agree with the Pre-Schema and Post-Schema nomenclature as it doesn't
 really fit with the option names or what they do.  Would you consider:

Will reword.

 Even this doesn't cover everything though- it's too focused on tables
 and data loading.  Where do functions go?  What about types?

Yes, it is focused on tables and data loading. What about
functions/types? No relevance here.

   - The command-line help hasn't been updated.  Clearly, that also needs
   to be done to consider the documentation aspect complete.
 
   - There appears to be a bit of mistakenly included additions.  The
 patch to pg_restore.sgml attempts to add in documentation for
   --superuser.  I'm guessing that was unintentional, and looks like
   just a mistaken extra copypaste.

Thanks, will do.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Stephen Frost
* Simon Riggs ([EMAIL PROTECTED]) wrote:
 On Sun, 2008-07-20 at 17:43 -0400, Stephen Frost wrote:
  Even this doesn't cover everything though- it's too focused on tables
  and data loading.  Where do functions go?  What about types?
 
 Yes, it is focused on tables and data loading. What about
 functions/types? No relevance here.

I don't see how they're not relevant, it's not like they're being
excluded and in fact they show up in the pre-load output.  Heck, even if
they *were* excluded, that should be made clear in the documentation
(either be an explicit include list, or saying they're excluded).

Part of what's driving this is making sure we have a plan for future
objects and where they'll go.  Perhaps it would be enough to just say
pre-load is everything in the schema, except things which are faster
done in bulk (eg: indexes, keys).  I don't think it's right to say
pre-load is only object definitions required to load data when it
includes functions and ACLs though.

Hopefully my suggestion and these comments will get us to a happy
middle-ground.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread daveg
On Sun, Jul 20, 2008 at 09:18:29PM -0400, Stephen Frost wrote:
 * Simon Riggs ([EMAIL PROTECTED]) wrote:
  On Sun, 2008-07-20 at 17:43 -0400, Stephen Frost wrote:
   Even this doesn't cover everything though- it's too focused on tables
   and data loading.  Where do functions go?  What about types?
  
  Yes, it is focused on tables and data loading. What about
  functions/types? No relevance here.
 
 I don't see how they're not relevant, it's not like they're being
 excluded and in fact they show up in the pre-load output.  Heck, even if
 they *were* excluded, that should be made clear in the documentation
 (either be an explicit include list, or saying they're excluded).
 
 Part of what's driving this is making sure we have a plan for future
 objects and where they'll go.  Perhaps it would be enough to just say
 pre-load is everything in the schema, except things which are faster
 done in bulk (eg: indexes, keys).  I don't think it's right to say
 pre-load is only object definitions required to load data when it
 includes functions and ACLs though.
 
 Hopefully my suggestion and these comments will get us to a happy
 middle-ground.

One observation, indexes should be built right after the table data
is loaded for each table, this way, the index build gets a hot cache
for the table data instead of having to re-read it later as we do now.

-dg
 


-- 
David Gould   [EMAIL PROTECTED]  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Stephen Frost
* daveg ([EMAIL PROTECTED]) wrote:
 One observation, indexes should be built right after the table data
 is loaded for each table, this way, the index build gets a hot cache
 for the table data instead of having to re-read it later as we do now.

That's not how pg_dump has traditionally worked, and the point of this
patch is to add options to easily segregate the main pieces of the
existing pg_dump output (main schema definition, data dump, key/index
building).  You suggestion brings up an interesting point that should
pg_dump's traditional output structure change the --schema-post-load
set of objects wouldn't be as clear to newcomers since the load and the
indexes would be interleaved in the regular output.

I'd be curious about the performance impact this has on an actual load
too.  It would probably be more valuable on smaller loads where it would
have less of an impact anyway than on loads larger than the cache size.
Still, not an issue for this patch, imv.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-20 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 * daveg ([EMAIL PROTECTED]) wrote:
 One observation, indexes should be built right after the table data
 is loaded for each table, this way, the index build gets a hot cache
 for the table data instead of having to re-read it later as we do now.

 That's not how pg_dump has traditionally worked, and the point of this
 patch is to add options to easily segregate the main pieces of the
 existing pg_dump output (main schema definition, data dump, key/index
 building).  You suggestion brings up an interesting point that should
 pg_dump's traditional output structure change the --schema-post-load
 set of objects wouldn't be as clear to newcomers since the load and the
 indexes would be interleaved in the regular output.

Yeah.  Also, that is pushing into an entirely different line of
development, which is to enable multithreaded pg_restore.  The patch
at hand is necessarily incompatible with that type of operation, and
wouldn't be used together with it.

As far as the documentation/definition aspect goes, I think it should
just say the parts are
* stuff needed before you can load the data
* the data
* stuff needed after loading the data
and not try to be any more specific than that.  There are corner cases
that will turn any simple breakdown into a lie, and I doubt that it's
worth trying to explain them all.  (Take a close look at the dependency
loop breaking logic in pg_dump if you doubt this.)

I hadn't realized that Simon was using pre-schema and post-schema
to name the first and third parts.  I'd agree that this is confusing
nomenclature: it looks like it's trying to say that the data is the
schema, and the schema is not!  How about pre-data and post-data?

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-19 Thread Simon Riggs

On Sat, 2008-07-19 at 23:07 -0400, Stephen Frost wrote:
 Simon,
 
   I agree with adding these options in general, since I find myself
   frustrated by having to vi huge dumps to change simple schema things.
   A couple of comments on the patch though:
 
   - Conflicting option handling
 I think we are doing our users a disservice by putting it on them to
   figure out exactly what:
   multiple object groups cannot be used together
   means to them.  You and I may understand what an object group is,
   and why there can be only one, but it's a great deal less clear than
   the prior message of
   options -s/--schema-only and -a/--data-only cannot be used together
   My suggestion would be to either list out the specific options which
   can't be used together, as was done previously, or add a bit of (I
   realize, boring) code and actually tell the user which of the
   conflicting options were used.
 
   - Documentation
   When writing the documentation I would stress that pre-schema and
   post-schema be defined in terms of PostgreSQL objects and why they
   are pre vs. post.
 
   - Technically, the patch needs to be updated slightly since another
   pg_dump-related patch was committed recently which also added
   options and thus causes a conflict.
 
   Beyond those minor points, the patch looks good to me.

Thanks for the review. I'll make the changes you suggest.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches