Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Simon Riggs
On Tue, 2004-06-01 at 01:24, Tom Lane wrote:
 I was just about to commit a patch that revises the btree index build
 procedure as discussed here:
 http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php
 specifically, not using shared buffers during index build and bypassing
 WAL-logging in favor of just fsyncing the index file before commit.
 
 I was actually writing the commit message when it occurred to me that
 this would seriously break PITR.  If the WAL datastream doesn't contain
 enough info to rebuild the index then rolling forward from a past backup
 isn't gonna work.
 
 I thought for a little bit about a magic reconstruct the index WAL
 entry that would invoke the index build procedure in toto, but that
 doesn't look like it will fly either.  (Two problems: during crash
 recovery, you couldn't be sure that what's on disk for the underlying
 table exactly matches the index you need to build --- it could be a
 later state of the table; and besides, the environment of the WAL replay
 process isn't capable of running user-defined functions, so it couldn't
 work for functional indexes.)
 
 So AFAICS, we've got to dump the index contents into WAL to support
 PITR.  This is a tad annoying.
 
 What I'm thinking about right now is tweaking the index-build code to
 write to WAL only if it sees that PITR is actually in use.  It would
 have to look at the GUC variables to determine whether WAL archiving
 is enabled.  If archiving isn't turned on, then we could assume that
 rollforward from a past backup isn't needed in this installation, and
 use the WAL-less index build method.
 
 Comments?

The mechanism you suggest would also break crash recovery, not just PITR
- though the avoidance of shared buffers seems like a gain either way.


..You raise the whole subject of UNRECOVERABLE data objects. Also known
as NOT LOGGED etc.

There are many significant performance gains to be had by turning off
recoverability for certain large operations. Oracle and Teradata make
extensive use of such features, i.e. especially in Data Warehousing.

Examples of such operations might be:
- index builds
- INSERT SELECTs into previously empty tables

This is an important area for performance...not just index builds.


A suggestion would be:
- add the dont send to xlog functionality as a user option on each
statement, default=LOGGING - this could be Oracle compatible, or not,
but concept is similar. Put the hooks in now and we can add this to all
appropriate statement syntax later.

e.g. 
CREATE INDEX blah ... NO LOGGING... ;
INSERT INTO blah ... NO LOGGING  SELECT... ;

- if conf file says dont use fsync, then dont write to log - clearly
they dont mind losing data in the event of a crash...
i.e. default=NOLOGGING on all statements



Best regards, Simon Riggs


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Zeugswetter Andreas SB SD

 I think an actually implementable version of this would be:
 
 1. Don't log any index operations at all in WAL.
 
 2. When recovering from WAL, restore all the table contents by WAL
 replay.  (This would of course include the system catalog contents that
 describe the indexes.)  Then sit there and do a global REINDEX to
 rebuild all the indexes.
 
 This would gain a reduction of some percentage in WAL traffic, at the
 cost of a hugely expensive recovery cycle any time you actually needed
 to use the WAL.  I guess this could be attractive to some installations,
 but I'm not sure very many people would want it ...

I think only the global part of it is not really acceptable. If we had a flag
for each index that marks it inconsistent reindexing only those that are
marked would be great.

Could we log a WAL record that basically only marks an index for deferred reindex
after WAL recovery ? During WAL replay all records for this index could be 
ignored (this is not a must because of the post update page images in WAL, 
the index would still stay inconsistent until reindex of course).

I think such a reindex step could also be responsible for those non-btree 
indexes that don't fully support WAL (gist?).

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 The mechanism you suggest would also break crash recovery, not just PITR

No it wouldn't.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 I thought for a little bit about a magic reconstruct the index WAL
 entry that would invoke the index build procedure in toto, but that
 doesn't look like it will fly either.  (Two problems: during crash
 recovery, you couldn't be sure that what's on disk for the underlying
 table exactly matches the index you need to build --- it could be a
 later state of the table; and besides, the environment of the WAL replay
 process isn't capable of running user-defined functions, so it couldn't
 work for functional indexes.)

Could you just mark the index as unusable? Have the optimizer ignore such
indexes and PITR recovery can notify the user of these indexes and/or invoke a
rebuild automatically?

It wouldn't happen unless the user had done an index rebuild since the last
complete backup, so it wouldn't even be a performance issue. Restoring the
index from the WAL replay of an index rebuild must take a long time anyways.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Alvaro Herrera
On Tue, Jun 01, 2004 at 12:52:32PM -0400, Greg Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  I thought for a little bit about a magic reconstruct the index WAL
  entry that would invoke the index build procedure in toto, but that
  doesn't look like it will fly either.  (Two problems: during crash
  recovery, you couldn't be sure that what's on disk for the underlying
  table exactly matches the index you need to build --- it could be a
  later state of the table; and besides, the environment of the WAL replay
  process isn't capable of running user-defined functions, so it couldn't
  work for functional indexes.)
 
 Could you just mark the index as unusable? Have the optimizer ignore such
 indexes and PITR recovery can notify the user of these indexes and/or invoke a
 rebuild automatically?

The big problem I see with this kind of approaches is that building an
index from scratch can take a huge amount of time, because you have to
sort the data.  Building from WAL does not have this problem, so it can
be much faster.  Of course, when you are restoring using a PITR
approach you probably want it to be very fast, and have the DB running
with as little quirks as possible, as soon as possible.

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
El realista sabe lo que quiere; el idealista quiere lo que sabe (Anónimo)


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Greg Stark
Alvaro Herrera [EMAIL PROTECTED] writes:

 The big problem I see with this kind of approaches is that building an
 index from scratch can take a huge amount of time, because you have to
 sort the data.  Building from WAL does not have this problem, so it can
 be much faster.  

I'm not clear that building from WAL is really going to be that much faster.
A) algorithmically it's only the factor of log(n) that you're talking about.
and B) the WAL will have records for every write, not just the final product,
so it might potentially have a lot more writes to do.

I thought part of the original problem was specifically that going through WAL
slowed down the index rebuild much more than a factor of 2, which would tend
to imply that in fact rebuilding from WAL isn't going to be as fast as you
might expect.

Another possibility is doing the complete index build without going through
WAL and then inserting a complete copy of the index into the WAL without
syncing or activating the rebuilt index until the copy do WAL is done. That
kind of sucks since it's equivalent to just taking another backup of the data
files immediately after the rebuild, but might be a more direct solution using
the existing tools.

 Of course, when you are restoring using a PITR approach you probably want it
 to be very fast, and have the DB running with as little quirks as possible,
 as soon as possible.

This is certainly true.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 I'm not clear that building from WAL is really going to be that much faster.
 A) algorithmically it's only the factor of log(n) that you're talking about.
 and B) the WAL will have records for every write, not just the final product,
 so it might potentially have a lot more writes to do.

Wrong ... what we log in WAL for a btree index build is just the series
of completed index page images.  Recreation of the index would proceed
at whatever your disk read/write bandwidth is.

Like Alvaro, I suspect that people who are using PITR will be concerned
about recovery time, and would not be thrilled with any scenario that
involves REINDEX to get the system back on its feet.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Alvaro Herrera
On Tue, Jun 01, 2004 at 01:55:38PM -0400, Greg Stark wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 
  The big problem I see with this kind of approaches is that building an
  index from scratch can take a huge amount of time, because you have to
  sort the data.  Building from WAL does not have this problem, so it can
  be much faster.  
 
 I'm not clear that building from WAL is really going to be that much faster.
 A) algorithmically it's only the factor of log(n) that you're talking about.
 and B) the WAL will have records for every write, not just the final product,
 so it might potentially have a lot more writes to do.

Maybe it is log(n) algorithmically, but the constants are big because
there's a lot of non-sequential I/O involved.  With the WAL approach you
only read the pages from the log and copy them somewhere else.  It's not
nearly the same amount of I/O, and it's mostly sequential.  And if I
understood correctly what Tom said, after index construction the whole
index pages are dropped, so there's no need to redo each node split
operation.


 I thought part of the original problem was specifically that going through WAL
 slowed down the index rebuild much more than a factor of 2, which would tend
 to imply that in fact rebuilding from WAL isn't going to be as fast as you
 might expect.

I think there was more than one problem in the code.  I expect at least
those not related (such as some locking issue apparently) are solved.
And I'd expect WAL construction to be heavier than recovery from WAL,
because some WAL writes have to be fsync()ed, and this is a heavy
burden, while recovery does not need fsync on the new files AFAIK (but I
could be wrong on this).

One of the things that bothered me was that for some reason you couldn't
get the whole performance benefit you would expect from simultaneous
index builds when using a multiprocessor machine.


 Another possibility is doing the complete index build without going through
 WAL and then inserting a complete copy of the index into the WAL without
 syncing or activating the rebuilt index until the copy do WAL is done. That
 kind of sucks since it's equivalent to just taking another backup of the data
 files immediately after the rebuild, but might be a more direct solution using
 the existing tools.

Apparently this is the current state of affairs, though I'm not sure.

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
La persona que no quería pecar / estaba obligada a sentarse
en duras y empinadas sillas/ desprovistas, por cierto
de blandos atenuantes  (Patricio Vogel)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Bruce Momjian
Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  I'm not clear that building from WAL is really going to be that much faster.
  A) algorithmically it's only the factor of log(n) that you're talking about.
  and B) the WAL will have records for every write, not just the final product,
  so it might potentially have a lot more writes to do.
 
 Wrong ... what we log in WAL for a btree index build is just the series
 of completed index page images.  Recreation of the index would proceed
 at whatever your disk read/write bandwidth is.
 
 Like Alvaro, I suspect that people who are using PITR will be concerned
 about recovery time, and would not be thrilled with any scenario that
 involves REINDEX to get the system back on its feet.

Agreed.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Simon Riggs
On Tue, 2004-06-01 at 03:21, Bruce Momjian wrote:
 Tom Lane wrote:
  Bruce Momjian [EMAIL PROTECTED] writes:
   I assume if someone turns on archiving in postgresql.conf, sighups the
   postmaster, then does a tar backup, they should be able to do archiving,
   no?
  
  I would have zero problem with labeling the archive parameter as
  changeable only at postmaster start.
 
 I guess the question is whether it would be possible to start/stop it at
 other times.  And what process are we going to use to do a tar backup? 
 Do they turn off archiving before doing the tar, or tell the system to
 tar to another location?  
 

This situation has been discussed and agreed twice already. First we
discussed it was SUSET, then SIGHUP, now we talk about postmaster
startup.

I'm not sure I'm too bothered either way, but the code has now been
written to make it a SIGHUP operation. Making it SIGHUP effects the way
we invoke the archiver process at postmaster startup, so if we want to
change things again we must do so real soon.

Postmaster startup is the simplest scenario at run-time, so I'd suggest
we move to that NOW and then MAYBE back to SIGHUP at a later time, when
we are more certain everything works in production.

 And you brought up the issue of how do we feed multilple archive files
 back into the xlog directory during restore if they don't all fit on the
 disk.
 

That has already been requested by Tom and agreed as on-the-PITR feature
list as an embellishment of the general recover-to-a point scenario. It
*MIGHT* make it into this release, if we get the other stuff done first.

 I think we need to explore the procedures we are going to use for PITR.
 

Much of that has already been discussed.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Fast index build vs. PITR

2004-06-01 Thread Simon Riggs
On Tue, 2004-06-01 at 03:13, Alvaro Herrera wrote:
 A completely different idea would be to log a logical index creation,
 so that during normal recovery those entries are saved somewhere; after
 the rest of WAL recovery is done, the system is taken into a more normal
 post-recovery pre-usable state, on which those indexes are recreated
 from user data.  This would be cheapest in WAL traffic, but probably
 it'll also require more code and new hooks in the startup mechanism.
 Also, it'd require examining later WAL entries that refer to the index
 and act accordingly (e.g. ignore the entry if it modifies the index, and
 forget the creation if it's a DROP INDEX command.)
 

There will be many ways to optimise recovery once we have PITR
working...

The current code does a straight replay of all changes. We can imagine
lots of different multi-pass or lookahead strategies for replaying xlog
records, but please lets wait awhile...

 Not that I like neither of those ideas really ... issuing normal WAL
 index creation traffic if PITR is active is certainly the easiest way.

I agree, certainly for now.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[HACKERS] Fast index build vs. PITR

2004-05-31 Thread Tom Lane
I was just about to commit a patch that revises the btree index build
procedure as discussed here:
http://archives.postgresql.org/pgsql-general/2004-05/msg00480.php
specifically, not using shared buffers during index build and bypassing
WAL-logging in favor of just fsyncing the index file before commit.

I was actually writing the commit message when it occurred to me that
this would seriously break PITR.  If the WAL datastream doesn't contain
enough info to rebuild the index then rolling forward from a past backup
isn't gonna work.

I thought for a little bit about a magic reconstruct the index WAL
entry that would invoke the index build procedure in toto, but that
doesn't look like it will fly either.  (Two problems: during crash
recovery, you couldn't be sure that what's on disk for the underlying
table exactly matches the index you need to build --- it could be a
later state of the table; and besides, the environment of the WAL replay
process isn't capable of running user-defined functions, so it couldn't
work for functional indexes.)

So AFAICS, we've got to dump the index contents into WAL to support
PITR.  This is a tad annoying.

What I'm thinking about right now is tweaking the index-build code to
write to WAL only if it sees that PITR is actually in use.  It would
have to look at the GUC variables to determine whether WAL archiving
is enabled.  If archiving isn't turned on, then we could assume that
rollforward from a past backup isn't needed in this installation, and
use the WAL-less index build method.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Gavin Sherry
On Mon, 31 May 2004, Tom Lane wrote:

[snip]

 I thought for a little bit about a magic reconstruct the index WAL
 entry that would invoke the index build procedure in toto, but that
 doesn't look like it will fly either.  (Two problems: during crash
 recovery, you couldn't be sure that what's on disk for the underlying
 table exactly matches the index you need to build --- it could be a
 later state of the table; and besides, the environment of the WAL replay
 process isn't capable of running user-defined functions, so it couldn't
 work for functional indexes.)

 So AFAICS, we've got to dump the index contents into WAL to support
 PITR.  This is a tad annoying.

Is it possible in this case to dump the index block by block into the log
after it has been generated? It seems to me this should reduce the
amount of data we write to WAL and (therefore) speed up the rebuild.

Gavin

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Christopher Kings-Lynne
What I'm thinking about right now is tweaking the index-build code to
write to WAL only if it sees that PITR is actually in use.  It would
have to look at the GUC variables to determine whether WAL archiving
is enabled.  If archiving isn't turned on, then we could assume that
rollforward from a past backup isn't needed in this installation, and
use the WAL-less index build method.
Seems reasonable.
Chris
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Bruce Momjian
Christopher Kings-Lynne wrote:
  What I'm thinking about right now is tweaking the index-build code to
  write to WAL only if it sees that PITR is actually in use.  It would
  have to look at the GUC variables to determine whether WAL archiving
  is enabled.  If archiving isn't turned on, then we could assume that
  rollforward from a past backup isn't needed in this installation, and
  use the WAL-less index build method.
 
 Seems reasonable.

What happens if someone turns on archiving while the index is being
built?  Is that possible?

I assume if someone turns on archiving in postgresql.conf, sighups the
postmaster, then does a tar backup, they should be able to do archiving,
no?


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 I assume if someone turns on archiving in postgresql.conf, sighups the
 postmaster, then does a tar backup, they should be able to do archiving,
 no?

I would have zero problem with labeling the archive parameter as
changeable only at postmaster start.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Alvaro Herrera
On Tue, Jun 01, 2004 at 11:34:15AM +1000, Gavin Sherry wrote:
 On Mon, 31 May 2004, Tom Lane wrote:
 
  I thought for a little bit about a magic reconstruct the index WAL
  entry that would invoke the index build procedure in toto, but that
  doesn't look like it will fly either.  (Two problems: during crash
  recovery, you couldn't be sure that what's on disk for the underlying
  table exactly matches the index you need to build --- it could be a
  later state of the table; and besides, the environment of the WAL replay
  process isn't capable of running user-defined functions, so it couldn't
  work for functional indexes.)
 
  So AFAICS, we've got to dump the index contents into WAL to support
  PITR.  This is a tad annoying.
 
 Is it possible in this case to dump the index block by block into the log
 after it has been generated? It seems to me this should reduce the
 amount of data we write to WAL and (therefore) speed up the rebuild.

Is this less log traffic?  You save a lot of per-record overhead, but
you have to save internal pages (which are not saved with the standard
code).  Plus it would be a lot different from standard btree WAL
traffic, so it'd be more code.

Maybe there's more to be saved by logging only leaf pages.  But then
there would be even more code to be able to reconstruct the index from
only leaf pages, and there are not that many internal pages either.

A completely different idea would be to log a logical index creation,
so that during normal recovery those entries are saved somewhere; after
the rest of WAL recovery is done, the system is taken into a more normal
post-recovery pre-usable state, on which those indexes are recreated
from user data.  This would be cheapest in WAL traffic, but probably
it'll also require more code and new hooks in the startup mechanism.
Also, it'd require examining later WAL entries that refer to the index
and act accordingly (e.g. ignore the entry if it modifies the index, and
forget the creation if it's a DROP INDEX command.)

Not that I like neither of those ideas really ... issuing normal WAL
index creation traffic if PITR is active is certainly the easiest way.

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
In Europe they call me Niklaus Wirth; in the US they call me Nickel's worth.
That's because in Europe they call me by name, and in the US by value!


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I assume if someone turns on archiving in postgresql.conf, sighups the
  postmaster, then does a tar backup, they should be able to do archiving,
  no?
 
 I would have zero problem with labeling the archive parameter as
 changeable only at postmaster start.

I guess the question is whether it would be possible to start/stop it at
other times.  And what process are we going to use to do a tar backup? 
Do they turn off archiving before doing the tar, or tell the system to
tar to another location?  

And you brought up the issue of how do we feed multilple archive files
back into the xlog directory during restore if they don't all fit on the
disk.

I think we need to explore the procedures we are going to use for PITR.

Also, do we need to do tar in a special way, like tar up a specific file
first?


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 So AFAICS, we've got to dump the index contents into WAL to support
 PITR.  This is a tad annoying.

 Is it possible in this case to dump the index block by block into the log
 after it has been generated?

That's what we do now, and it more or less doubles the amount of I/O
needed to create an index ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Fast index build vs. PITR

2004-05-31 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 A completely different idea would be to log a logical index creation,
 so that during normal recovery those entries are saved somewhere; after
 the rest of WAL recovery is done, the system is taken into a more normal
 post-recovery pre-usable state, on which those indexes are recreated
 from user data.

I think an actually implementable version of this would be:

1. Don't log any index operations at all in WAL.

2. When recovering from WAL, restore all the table contents by WAL
replay.  (This would of course include the system catalog contents that
describe the indexes.)  Then sit there and do a global REINDEX to
rebuild all the indexes.

This would gain a reduction of some percentage in WAL traffic, at the
cost of a hugely expensive recovery cycle any time you actually needed
to use the WAL.  I guess this could be attractive to some installations,
but I'm not sure very many people would want it ...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html