Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Tom Lane
I wrote:
 * We can not change the toast rel OID of a shared catalog -- there's no
 way to propagate that into the other copies of pg_class.  So we need to
 rejigger the logic for heap rewriting a little bit.  Toast rel swapping
 has to be handled by swapping their relfilenodes not their OIDs.  This
 is no big deal as far as cluster.c itself is concerned, but the tricky
 part is that when we write new toasted values into the new toast rel,
 the TOAST pointers going into the new heap have to be written with the
 original toast-table OID value not the one that the transient target
 toast rel has got.  This is doable but it would uglify the TOAST API a
 bit I think.

I've been playing around with different alternatives for solving the
problem of toast-pointer OIDs, but I keep coming back to the above as
being the least invasive and most robust answer.  There are two basic
ways that we could do it: pass the OID to use to the toast logic, which
would require adding a parameter to heap_insert and a number of other
places; or add a field to struct Relation that says when inserting a
TOAST pointer in this relation, use this OID as the toast-table OID
value in the pointer, even if that's different from what the table's OID
appears to be.  The latter seems like less of a notational change, so
I'm leaning to that, but wanted to see if anyone prefers the other.

We could avoid this hackery if there were a way for Relation structs to
point at either the old or the new physical relation (relfilenode);
then we'd not need the transient new heap relation during CLUSTER/VF,
which would be good for reducing catalog churn.  I've concluded that
that's too large a change to undertake for 9.0, but it might be
interesting to try in the future.  So I'd prefer that what we do for
now touch as little code as possible so as to be easy to revert; hence
I'm not wanting to change heap_insert's signature.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Bruce Momjian
Tom Lane wrote:
 I've been playing around with different alternatives for solving the
 problem of toast-pointer OIDs, but I keep coming back to the above as
 being the least invasive and most robust answer.  There are two basic
 ways that we could do it: pass the OID to use to the toast logic, which
 would require adding a parameter to heap_insert and a number of other
 places; or add a field to struct Relation that says when inserting a
 TOAST pointer in this relation, use this OID as the toast-table OID
 value in the pointer, even if that's different from what the table's OID
 appears to be.  The latter seems like less of a notational change, so
 I'm leaning to that, but wanted to see if anyone prefers the other.
 
 We could avoid this hackery if there were a way for Relation structs to
 point at either the old or the new physical relation (relfilenode);
 then we'd not need the transient new heap relation during CLUSTER/VF,
 which would be good for reducing catalog churn.  I've concluded that
 that's too large a change to undertake for 9.0, but it might be
 interesting to try in the future.  So I'd prefer that what we do for
 now touch as little code as possible so as to be easy to revert; hence
 I'm not wanting to change heap_insert's signature.

I don't think any of this affects pg_migrator, but if it does, please
let me know.  When I hear TOAST and OID used in the same sentence, my
ears perk up.  :-)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Simon Riggs
On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote:

 I've concluded that that's too large a change to undertake for 9.0

The purpose of this was to make the big changes in 9.0. If we aren't
going to do that it seems like we shouldn't bother at all.

So why not flip back to the easier approach of make something work for
HS only and then do everything you want to do in the next release? The
burst radius of the half-way changes you are proposing seems high in
comparison.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote:
 I've concluded that that's too large a change to undertake for 9.0

 The purpose of this was to make the big changes in 9.0. If we aren't
 going to do that it seems like we shouldn't bother at all.

No, the purpose of this was to get rid of VACUUM FULL INPLACE in 9.0.
I'm not interested in destabilizing the code (even more) just to avoid
one small internal kluge.  The proposed magic field in struct Relation
is the only part of this that I'd foresee reverting later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Heikki Linnakangas
Tom Lane wrote:
 Hm ... do we want an LWLock per map file, or is one lock to rule them all
 sufficient?  LWLock per database seems problematic.  With an HW lock there
 wouldn't be a problem with that.  HW lock would allow concurrent updates of
 the map files of different DBs, but is that worth the extra cycles?

A single LWLock should be enough.

 Once the updated map file is moved into place, the relocation is effectively
 committed even if we subsequently abort the transaction.  We can make that
 window pretty narrow but not remove it completely.

We could include the instructions to update the map file in the commit
record, instead of introducing a new record type, and update the map
file only *after* writing the commit record. The map file doesn't grow,
so we can be pretty confident that updating it doesn't fail (failure
would lead to PANIC).

I'm assuming the map file is fixed size, with a fixed location for each
relation, so that we can just overwrite the old file without the
create+rename dance, and not worry about torn-pages.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Simon Riggs
On Sun, 2010-01-31 at 22:49 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  I'll do a little work towards step (1) just so we can take a more
  informed view once you've had a better look at just what (2) involves.
 
 I spent a couple of hours reading code and believe that I've identified
 all the pain points for allowing relocation of system catalogs (ie,
 assigning them new relfilenodes during cluster-style VACUUM FULL,
 REINDEX, etc).  It's definitely not a trivial project but it's not out
 of reach either --- I estimate I could get it done in a couple or three
 days of full-time effort.  Given the potential benefits I think it's
 worth doing.  Rough notes are attached --- comments?

Comments inline.

 Change pg_class.relfilenode to be 0 for all rels for which a map file should
 be used instead.  Define RelationGetFilenode() to look to the physaddr info
 instead, and make all internal refs go to that instead of reading
 rd_rel-relfilenode directly.  

Yes

 Define pg_relation_filenode(regclass) and fix
 external refs (oid2name, pg_dump) to use that instead.

Not sure why?

 In zeroth cut, just have relcache.c substitute the OID in place of reading
 a map file.  Possibly it should always do that during bootstrap.

Yes

 How should bootstrap decide which rels to insert zero for, anyway?
 Shared definitely, pg_class, ... maybe that's enough?  Or do we need
 it for all nailed local catalogs?

Only for nailed || shared. My submitted patch covers the normal
relations.

Put 0 for rel at boot, write files at boot also.

 local state contains
   * shared map list
   * this database's map list
   * list of local overrides to each on-disk map list
 
 At commit: if modified, write out new version; do this as late as possible
 before xact commit
 At abort: just discard local-override list

Yep

 Write must include generating a WAL entry as well as sending a shared-inval
 signal
   * write temp file, fsync it
   * emit WAL record containing copy of new file contents, fsync it
   * atomically rename temp file into place
   * send sinval

Yes

 During relation open, check for sinval before relying on current cached value
 of map contents

Yes

 Hm, what about concurrent updates of map?  

Why not have one file per relation that has a map file? No concurrency
issues then at all then. 

 Another issue for clustering a catalog is that sinval catcache signalling
 could get confused, since that depends on catalog entry TIDs which would
 change --- we'd likely need to have relocation send an sinval signal saying
 flush *everything* from that catalog.  (relcache inval on the catalog itself
 doesn't do that.)  

Yes

 This action could/should be transactional so no
 additional code is needed to propagate the notice to HS standbys.

Agreed

 Once the updated map file is moved into place, the relocation is effectively
 committed even if we subsequently abort the transaction.  We can make that
 window pretty narrow but not remove it completely.  Risk factors:
 * abort would try to remove relation files created by xact, in particular
 the new version of the catalog.  Ooops.  Can fix this by telling smgr to
 forget about removing those files before installing the new map file ---
 better to leak some disk space than destroy catalogs.
 * on abort, we'd not send sinval notices, leaving other backends at risk
 of using stale info (maybe our own too).  We could fix this by sending
 the sinval notice BEFORE renaming into place (if we send it and then fail
 to rename, no real harm done, just useless cache flushes).  This requires
 keeping a nontransactional-inval path in inval.c, but at least it's much
 more localized than before.  No problem for HS since we have the WAL record
 for map update to drive issuing sinvals on the slave side.  Note this
 means that readers need to take the mapfile lock in shared mode, since they
 could conceivably get the sinval notice before we complete the rename.
 
 For our own backend we need cache flushes at CCI and again at xact
 commit/abort.  This could be handled by regular transactional inval
 path but we end up with a lot of redundant flushing.  Maybe not worth
 worrying about though.

!

 Disallow catalog relocation inside subtransactions, to avoid having
 to handle subxact abort effects on the local-map-changes state.
 This could be implemented if desired, but doesn't seem worth it
 at least in first pass.

Agreed, not needed for emergency maintenance actions like this.


There are issues associated with performing actions on critical tables,
since in some cases I got relation  is already locked. ISTM we'd
need some special handling of locking in those cases.

Toast tables are also an area of problem because there are dependencies
between tables that need to be mangled.

Please bear in mind that I looked at the code also and came to a similar
original assessment. It was only when I started working with it that I
came to different conclusions. 


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Greg Stark
On Mon, Feb 1, 2010 at 8:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Disallow catalog relocation inside subtransactions, to avoid having
 to handle subxact abort effects on the local-map-changes state.
 This could be implemented if desired, but doesn't seem worth it
 at least in first pass.

 Agreed, not needed for emergency maintenance actions like this.

Note that this would mean it will never work if you have psql's
ROLLBACK_ON_ERROR set.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Mon, Feb 1, 2010 at 8:54 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Disallow catalog relocation inside subtransactions, to avoid having
 to handle subxact abort effects on the local-map-changes state.
 This could be implemented if desired, but doesn't seem worth it
 at least in first pass.
 
 Agreed, not needed for emergency maintenance actions like this.

 Note that this would mean it will never work if you have psql's
 ROLLBACK_ON_ERROR set.

VACUUM has always failed in such a case, so I don't see this as a
showstopper.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Simon Riggs
On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:

 the assumption that the file is less than one disk block,
 it should be just as atomic as pg_control updates are.

IIRC there were 173 relations affected by this. 4 bytes each we would
have more than 512 bytes.

ISTM you need to treat some of those system relations just as normal
relations rather than give everybody a map entry.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:
 the assumption that the file is less than one disk block,
 it should be just as atomic as pg_control updates are.

 IIRC there were 173 relations affected by this. 4 bytes each we would
 have more than 512 bytes.

Where in the world did you get that number?

There are currently 29 shared relations (counting indexes), and 13
nailed local relations, which would go into a different map file.
I'm not sure if the set of local catalogs requiring the map treatment
is exactly the same as what's presently nailed, but that's probably
a good approximation.

At 8 bytes each (OID + relfilenode), we could fit 64 map entries in
a standard disk sector --- let's say 63 so there's room for a header.
So, barring more-than-doubling of the number of shared catalogs,
there's not going to be a problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Heikki Linnakangas
Tom Lane wrote:
 That seems too fragile to me, as I don't find it a stretch at all to
 think that writing the map file might fail --- just think Windows
 antivirus code :-(.  Now, once we have written the WAL record for
 the mapfile change, we can't really afford a failure in my approach
 either.  But I think a rename() after successfully creating/writing/
 fsync'ing a temp file is a whole lot safer than writing from a standing
 start.

My gut feeling is exactly opposite. Creating and renaming a file
involves operations (and permissions) on the directory, while
overwriting a small file is just a simple write(). Especially if you
open() the file before doing anything irreversible.

 The other problem with what you sketch is that it'd require holding the
 mapfile write lock across commit, because we still have to have strict
 serialization of updates.

Why is the strict serialization of updates needed? To avoid overwriting
the file with stale contents in a race condition?

I was thinking that we only store the modified part in the WAL record.
Right after writing commit record, take the lock, read() the map file,
modify it in memory, write() it back, and release lock.

That means that there's no full images of the file in WAL records, which
makes me slightly uncomfortable from a disaster recovery point-of-view,
but we could keep a backup copy of the file in the data directory or
something if that's too scary otherwise.

 Maybe we should forget the
 rename() trick and overwrite the map file in place.  I still think it
 needs to be a separate WAL record though.  I'm thinking
 
   * obtain lock
   * open file for read/write
   * read current contents
   * construct modified contents
   * write and sync WAL record
   * write back file through already-opened descriptor
   * fsync
   * release lock
 
 Not totally clear if this is more or less safe than the rename method;
 but given the assumption that the file is less than one disk block,
 it should be just as atomic as pg_control updates are.

That doesn't solve the problem I was trying to solve, which is that if
the map file is rewritten, but the transaction later aborts, the map
file update has hit the disk already. That's why I wanted to stash it
into the commit record.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Tom Lane wrote:
 The other problem with what you sketch is that it'd require holding the
 mapfile write lock across commit, because we still have to have strict
 serialization of updates.

 Why is the strict serialization of updates needed? To avoid overwriting
 the file with stale contents in a race condition?

Exactly.

 I was thinking that we only store the modified part in the WAL record.
 Right after writing commit record, take the lock, read() the map file,
 modify it in memory, write() it back, and release lock.

 That means that there's no full images of the file in WAL records, which
 makes me slightly uncomfortable from a disaster recovery point-of-view,

Yeah, me too, which is another advantage of using a separate WAL entry.

 That doesn't solve the problem I was trying to solve, which is that if
 the map file is rewritten, but the transaction later aborts, the map
 file update has hit the disk already. That's why I wanted to stash it
 into the commit record.

The design I sketched doesn't require such an assumption anyway.  Once
the map file is written, the relocation is effective, commit or no.
As long as we restrict relocations to maintenance operations such as
VACUUM FULL, which have no transactionally significant results, this
doesn't seem like a problem.  What we do need is that after a CLUSTER
or V.F., which is going to relocate not only the rel but its indexes,
the relocations of the rel and its indexes have to all commit
atomically.  But saving up the transaction's map changes and applying
them in one write takes care of that.

(What I believe this means is that pg_class and its indexes have to all
be mapped, but I'm thinking right now that no other non-shared catalogs
need the treatment.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Simon Riggs
On Mon, 2010-02-01 at 10:27 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:
  the assumption that the file is less than one disk block,
  it should be just as atomic as pg_control updates are.
 
  IIRC there were 173 relations affected by this. 4 bytes each we would
  have more than 512 bytes.
 
 Where in the world did you get that number?
 
 There are currently 29 shared relations (counting indexes), and 13
 nailed local relations, which would go into a different map file.
 I'm not sure if the set of local catalogs requiring the map treatment
 is exactly the same as what's presently nailed, but that's probably
 a good approximation.

I was suggesting that we only do shared and nailed relations. Sounds
like you agree.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-02-01 Thread Tom Lane
I wrote:
 The design I sketched doesn't require such an assumption anyway.  Once
 the map file is written, the relocation is effective, commit or no.
 As long as we restrict relocations to maintenance operations such as
 VACUUM FULL, which have no transactionally significant results, this
 doesn't seem like a problem.  What we do need is that after a CLUSTER
 or V.F., which is going to relocate not only the rel but its indexes,
 the relocations of the rel and its indexes have to all commit
 atomically.  But saving up the transaction's map changes and applying
 them in one write takes care of that.

BTW, I noticed a couple of other issues that need to be dealt with to
make that safe.  During CLUSTER/V.F. we typically try to update the
relation's relfilenode, relpages, reltuples, relfrozenxid (in
setNewRelfilenode) as well as its toastrelid (in swap_relation_files).
These are regular transactional updates to the pg_class tuple that will
fail to commit if the outer transaction rolls back.  However:

* For a mapped relation, both the old and new relfilenode will be zero,
so it doesn't matter.

* Possibly losing the updates of relpages and reltuples is not critical.

* For relfrozenxid, we can simply force the new and old values to be the
same rather than hoping to advance the value, if we're dealing with a
mapped relation.  Or just let it be; I think that losing an advance
of relfrozenxid wouldn't be critical either.

* We can not change the toast rel OID of a shared catalog -- there's no
way to propagate that into the other copies of pg_class.  So we need to
rejigger the logic for heap rewriting a little bit.  Toast rel swapping
has to be handled by swapping their relfilenodes not their OIDs.  This
is no big deal as far as cluster.c itself is concerned, but the tricky
part is that when we write new toasted values into the new toast rel,
the TOAST pointers going into the new heap have to be written with the
original toast-table OID value not the one that the transient target
toast rel has got.  This is doable but it would uglify the TOAST API a
bit I think.  Another possibility is to treat the toast rel OID of a
catalog as something that can be supplied by the map file.  Not sure
which way to jump.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Sat, 2010-01-30 at 15:17 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 The last item on my list before close is making VACUUM FULL and Hot
 Standby play nicely together.
 The options to do this were and still are:
 (1) Add WAL messages for non-transactional relcache invalidations
 (2) Allow system relations to be cluster-ed/vacuum full-ed.
 (1) was how we did it originally and I believe it worked without
 problem. We saw the opportunity to do (2) and it has been worth
 exploring.
 Refresh my memory on why (1) lets us avoid fixing (2)?  
 
 (1) allows us to retain VACUUM FULL INPLACE for system relations, thus
 avoiding the need to do (2). Non-transactional invalidations need to be
 replayed in recovery for the same reason they exist on the primary.
 
 It sounds like a kluge at best ...
 
 (2) isn't a necessary change right now. It is the best design going
 forwards, but its burst radius stretches far beyond Hot Standby. There
 is enough code in HS for us to support, so adding to it makes little
 sense for me, in this release, since there is no functional benefit in
 doing so.

Well, it'll avoid having to support the kludges in HS required for
VACUUM FULL INPLACE.

I'm in favor of (2), unless some unforeseen hard problems come up with
implementing the ideas on that discussed earlier. Yeah, that's pretty
vague, but basically I think it's worth spending some more time doing
(2), than doing (1). For one, if the plan is to do (2) in next release
anyway, we might as well do it now and avoid having to support the
HS+VACUUM FULL INPLACE combination in only 9.0 stable branch for years
to come.

I believe we had a pretty well-thought out plan on how to support system
catalogs with the new VACUUM FULL.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Simon Riggs
On Sun, 2010-01-31 at 21:00 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Sat, 2010-01-30 at 15:17 -0500, Tom Lane wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  The last item on my list before close is making VACUUM FULL and Hot
  Standby play nicely together.
  The options to do this were and still are:
  (1) Add WAL messages for non-transactional relcache invalidations
  (2) Allow system relations to be cluster-ed/vacuum full-ed.
  (1) was how we did it originally and I believe it worked without
  problem. We saw the opportunity to do (2) and it has been worth
  exploring.
  Refresh my memory on why (1) lets us avoid fixing (2)?  
  
  (1) allows us to retain VACUUM FULL INPLACE for system relations, thus
  avoiding the need to do (2). Non-transactional invalidations need to be
  replayed in recovery for the same reason they exist on the primary.
  
  It sounds like a kluge at best ...
  
  (2) isn't a necessary change right now. It is the best design going
  forwards, but its burst radius stretches far beyond Hot Standby. There
  is enough code in HS for us to support, so adding to it makes little
  sense for me, in this release, since there is no functional benefit in
  doing so.
 
 Well, it'll avoid having to support the kludges in HS required for
 VACUUM FULL INPLACE.
 
 I'm in favor of (2), unless some unforeseen hard problems come up with
 implementing the ideas on that discussed earlier. Yeah, that's pretty
 vague, but basically I think it's worth spending some more time doing
 (2), than doing (1). For one, if the plan is to do (2) in next release
 anyway, we might as well do it now and avoid having to support the
 HS+VACUUM FULL INPLACE combination in only 9.0 stable branch for years
 to come.

That's a good argument, but with respect, it isn't you who is writing
the code, nor will it be you that supports it, AIUI. Right now, HS is
isolated in many ways. If we introduce this change it will effect
everybody and that means I'll be investigating all manner of bug reports
that have zip to do with HS for a long time to come.

What I would say is that for 9.0 we can easily remove the INPLACE option
as an explicit request.

 I believe we had a pretty well-thought out plan on how to support system
 catalogs with the new VACUUM FULL.

I think calling it a well thought out plan is, err, overstating
things. We had what looked like a viable sketch of how to proceed and I
agreed to investigate that. Having done so, I'm saying I don't like what
I see going further and wish to backtrack to my known safe solution.

Overall, I don't see any benefit in pursuing that course any further. I
just see risk, on balance with (2), whereas (1) seems easier/faster,
less risk and more isolated.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Sat, 2010-01-30 at 15:17 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 The options to do this were and still are:
 (1) Add WAL messages for non-transactional relcache invalidations
 (2) Allow system relations to be cluster-ed/vacuum full-ed.

 Refresh my memory on why (1) lets us avoid fixing (2)?  

 (1) allows us to retain VACUUM FULL INPLACE for system relations, thus
 avoiding the need to do (2). Non-transactional invalidations need to be
 replayed in recovery for the same reason they exist on the primary.

Well, I would expect that invalidation events need to be transmitted to
hot-standby slaves no matter what --- backends running queries on an HS
slave need to hear about inval events just as much as backends on the
master do.  So my take on it is that all inval events will have to have
associated WAL records when in HS mode, independently of what we choose
to do about VACUUM.

Anyway, it's still not apparent to me exactly what the connection is
between VACUUM FULL and Hot Standby.  I remember that we said HS didn't
work with VACUUM FULL (INPLACE) but I don't recall why that is, and the
links on the open-items pages are not leading me to any useful
discussion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Having said that I now realise a few things I didn't before:

 * Approach (2) effects the core of Postgres, even if you don't use Hot
 Standby.
 * I've had to remove 7 sanity checks to get the first few VACUUMs
 working. ISTM that removing various basic checks in the code is not a
 good thing. 
 * There are are more special cases than I realised at first: temp,
 shared, with-toast, nailed, shared-and-nailed, pg_class, normal system.

Quite honestly, these statements and the attached patch (which doesn't
even begin to touch the central issue, but does indeed break a lot of
things) demonstrate that *you* are not the guy to implement what was
being discussed.  It needs to be done by someone who understands the
core caching code, which apparently you haven't studied in any detail.

I have a feeling that I should go do this...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Simon Riggs
On Sun, 2010-01-31 at 14:35 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Sat, 2010-01-30 at 15:17 -0500, Tom Lane wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  The options to do this were and still are:
  (1) Add WAL messages for non-transactional relcache invalidations
  (2) Allow system relations to be cluster-ed/vacuum full-ed.
 
  Refresh my memory on why (1) lets us avoid fixing (2)?  
 
  (1) allows us to retain VACUUM FULL INPLACE for system relations, thus
  avoiding the need to do (2). Non-transactional invalidations need to be
  replayed in recovery for the same reason they exist on the primary.
 
 Well, I would expect that invalidation events need to be transmitted to
 hot-standby slaves no matter what --- backends running queries on an HS
 slave need to hear about inval events just as much as backends on the
 master do.  So my take on it is that all inval events will have to have
 associated WAL records when in HS mode, independently of what we choose
 to do about VACUUM.

All transactional invalidations are already handled by HS. It was the
non-transactional invalidations in VACUUM FULL INPLACE that still
require additional handling.

 Anyway, it's still not apparent to me exactly what the connection is
 between VACUUM FULL and Hot Standby.  I remember that we said HS didn't
 work with VACUUM FULL (INPLACE) but I don't recall why that is, and the
 links on the open-items pages are not leading me to any useful
 discussion.

Very little really; not enough to force the sort of changes that I am
now seeing will be required in the way catalogs and caches operate.
There was some difficulty around the fact that VFI issues two commits
for the same transaction, but that is now correctly handled in the code
after discussion.

I'm not known as a risk-averse person but (2) is a step too far for me.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Simon Riggs
On Sun, 2010-01-31 at 14:44 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  Having said that I now realise a few things I didn't before:
 
  * Approach (2) effects the core of Postgres, even if you don't use Hot
  Standby.
  * I've had to remove 7 sanity checks to get the first few VACUUMs
  working. ISTM that removing various basic checks in the code is not a
  good thing. 
  * There are are more special cases than I realised at first: temp,
  shared, with-toast, nailed, shared-and-nailed, pg_class, normal system.
 
 Quite honestly, these statements and the attached patch (which doesn't
 even begin to touch the central issue, but does indeed break a lot of
 things) demonstrate that *you* are not the guy to implement what was
 being discussed.  It needs to be done by someone who understands the
 core caching code, which apparently you haven't studied in any detail.

I didn't claim the attached patch began to touch the issues. I was very
clear that it covered only the very simplest use case, that was the
point. You may not wish to acknowledge it, but I *have* studied the core
caching code in detail for many months and that is the basis for my
comments. The way I have written the exclusions in vacuum.c shows that I
have identified each of the sub-cases we are required to handle.

There is nothing wrong with your idea of using a mapping file. That is
relatively easy part of the problem.

 I have a feeling that I should go do this...

If you wish, but I still think it is an unnecessary change for this
release, whoever does it. We both know that once you start you won't
stop, but that doesn't make it worthwhile or less risky.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Sun, 2010-01-31 at 14:35 -0500, Tom Lane wrote:
 Anyway, it's still not apparent to me exactly what the connection is
 between VACUUM FULL and Hot Standby.  I remember that we said HS didn't
 work with VACUUM FULL (INPLACE) but I don't recall why that is, and the

[ sorry, I meant not-INPLACE ]

 links on the open-items pages are not leading me to any useful
 discussion.

 Very little really; not enough to force the sort of changes that I am
 now seeing will be required in the way catalogs and caches operate.
 There was some difficulty around the fact that VFI issues two commits
 for the same transaction, but that is now correctly handled in the code
 after discussion.

If the only benefit of getting rid of VACUUM FULL were simplifying
Hot Standby, I'd agree with you.  But there are numerous other benefits.
The double-commit hack you mention is something we need to get rid of
for general system stability (because of the risk of PANIC if the vacuum
fails after the first commit).  Getting rid of REINDEX-in-place on
shared catalog indexes is another thing that's really safety critical.
Removing V-F related hacks in other places would just be a bonus.

It's something we need to do, so if Hot Standby is forcing our hands,
then let's just do it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Simon Riggs
On Sun, 2010-01-31 at 15:14 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Sun, 2010-01-31 at 14:35 -0500, Tom Lane wrote:
  Anyway, it's still not apparent to me exactly what the connection is
  between VACUUM FULL and Hot Standby.  I remember that we said HS didn't
  work with VACUUM FULL (INPLACE) but I don't recall why that is, and the
 
 [ sorry, I meant not-INPLACE ]
 
  links on the open-items pages are not leading me to any useful
  discussion.
 
  Very little really; not enough to force the sort of changes that I am
  now seeing will be required in the way catalogs and caches operate.
  There was some difficulty around the fact that VFI issues two commits
  for the same transaction, but that is now correctly handled in the code
  after discussion.
 
 If the only benefit of getting rid of VACUUM FULL were simplifying
 Hot Standby, I'd agree with you.  But there are numerous other benefits.
 The double-commit hack you mention is something we need to get rid of
 for general system stability (because of the risk of PANIC if the vacuum
 fails after the first commit).  Getting rid of REINDEX-in-place on
 shared catalog indexes is another thing that's really safety critical.
 Removing V-F related hacks in other places would just be a bonus.
 
 It's something we need to do, so if Hot Standby is forcing our hands,
 then let's just do it.

That's the point: Hot Standby is *not* forcing our hand to do this.

Doing this will not simplify Hot Standby in any significant way. The
code to support VFI with Hot Standby is, after technical review, much,
much simpler than the code to remove VFI.

I'll do a little work towards step (1) just so we can take a more
informed view once you've had a better look at just what (2) involves. I
had already written the code for the Sept release of HS.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Simon Riggs
On Sun, 2010-01-31 at 15:14 -0500, Tom Lane wrote:

 If the only benefit of getting rid of VACUUM FULL were simplifying
 Hot Standby, I'd agree with you.  But there are numerous other benefits.
 The double-commit hack you mention is something we need to get rid of
 for general system stability (because of the risk of PANIC if the vacuum
 fails after the first commit).  Getting rid of REINDEX-in-place on
 shared catalog indexes is another thing that's really safety critical.
 Removing V-F related hacks in other places would just be a bonus.

I should've agreed with this in my last post, cos I do. I want very,
very much to get rid of VACUUM FULL just because it's such a sump of
ugly, complex code. But there is a limit to how and when performs what I
now see is a more major surgical operation.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 I'll do a little work towards step (1) just so we can take a more
 informed view once you've had a better look at just what (2) involves.

I spent a couple of hours reading code and believe that I've identified
all the pain points for allowing relocation of system catalogs (ie,
assigning them new relfilenodes during cluster-style VACUUM FULL,
REINDEX, etc).  It's definitely not a trivial project but it's not out
of reach either --- I estimate I could get it done in a couple or three
days of full-time effort.  Given the potential benefits I think it's
worth doing.  Rough notes are attached --- comments?

regards, tom lane


Change pg_class.relfilenode to be 0 for all rels for which a map file should
be used instead.  Define RelationGetFilenode() to look to the physaddr info
instead, and make all internal refs go to that instead of reading
rd_rel-relfilenode directly.  Define pg_relation_filenode(regclass) and fix
external refs (oid2name, pg_dump) to use that instead.

In zeroth cut, just have relcache.c substitute the OID in place of reading
a map file.  Possibly it should always do that during bootstrap.

How should bootstrap decide which rels to insert zero for, anyway?
Shared definitely, pg_class, ... maybe that's enough?  Or do we need
it for all nailed local catalogs?

local state contains
* shared map list
* this database's map list
* list of local overrides to each on-disk map list

At commit: if modified, write out new version; do this as late as possible
before xact commit
At abort: just discard local-override list

Write must include generating a WAL entry as well as sending a shared-inval
signal
* write temp file, fsync it
* emit WAL record containing copy of new file contents, fsync it
* atomically rename temp file into place
* send sinval

During relation open, check for sinval before relying on current cached value
of map contents

Hm, what about concurrent updates of map?  Probably instantiate a new
LWLock or maybe better a HW lock.  So write involves taking lock, check
for updates, finally write --- which means that local modifications
have to be stored in a way that allows re-reading somebody else's mods.
Hence above data structure with separate storage of local modifications.
We assume rel-level locking protects us from concurrent update attempts
on the same map entry, but we don't want to forbid concurrent relocations
of different catalogs.

LWLock would work if we do an unconditional read of the file contents after
getting lock rather than relying on sinval signaling, which seems a small
price considering map updates should be infrequent.  Without that, writers
have to hold the lock till commit which rules out using LWLock.

Hm ... do we want an LWLock per map file, or is one lock to rule them all
sufficient?  LWLock per database seems problematic.  With an HW lock there
wouldn't be a problem with that.  HW lock would allow concurrent updates of
the map files of different DBs, but is that worth the extra cycles?

In a case where a transaction relocates pg_class (or another mapped catalog)
and then makes updates in that catalog, all is well in the sense that the
updates will be preserved iff the xact commits.  HOWEVER we would have
problems during WAL replay?  No, because the WAL entries will command updates
to the catalog's new relfilenode, which will be interesting to anyone else if
and only if the xact gets to commit.  We would need to be careful about the
case of relocating pg_class itself though --- make sure local relcache
references the new pg_class before any such updates happen.  Probably a CCI
is sufficient.

Another issue for clustering a catalog is that sinval catcache signalling
could get confused, since that depends on catalog entry TIDs which would
change --- we'd likely need to have relocation send an sinval signal saying
flush *everything* from that catalog.  (relcache inval on the catalog itself
doesn't do that.)  This action could/should be transactional so no
additional code is needed to propagate the notice to HS standbys.

Once the updated map file is moved into place, the relocation is effectively
committed even if we subsequently abort the transaction.  We can make that
window pretty narrow but not remove it completely.  Risk factors:
* abort would try to remove relation files created by xact, in particular
the new version of the catalog.  Ooops.  Can fix this by telling smgr to
forget about removing those files before installing the new map file ---
better to leak some disk space than destroy catalogs.
* on abort, we'd not send sinval notices, leaving other backends at risk
of using stale info (maybe our own too).  We could fix this by sending
the sinval notice BEFORE renaming into place (if we send it and then fail
to rename, no real harm done, just useless cache flushes).  This requires
keeping a nontransactional-inval path in inval.c, but 

Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-31 Thread Bruce Momjian

FYI, getting rid of VACUUM FULL in a .0 release does make more sense
than doing it in a .1 release.

---

Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  I'll do a little work towards step (1) just so we can take a more
  informed view once you've had a better look at just what (2) involves.
 
 I spent a couple of hours reading code and believe that I've identified
 all the pain points for allowing relocation of system catalogs (ie,
 assigning them new relfilenodes during cluster-style VACUUM FULL,
 REINDEX, etc).  It's definitely not a trivial project but it's not out
 of reach either --- I estimate I could get it done in a couple or three
 days of full-time effort.  Given the potential benefits I think it's
 worth doing.  Rough notes are attached --- comments?
 
   regards, tom lane
 
 
 Change pg_class.relfilenode to be 0 for all rels for which a map file should
 be used instead.  Define RelationGetFilenode() to look to the physaddr info
 instead, and make all internal refs go to that instead of reading
 rd_rel-relfilenode directly.  Define pg_relation_filenode(regclass) and fix
 external refs (oid2name, pg_dump) to use that instead.
 
 In zeroth cut, just have relcache.c substitute the OID in place of reading
 a map file.  Possibly it should always do that during bootstrap.
 
 How should bootstrap decide which rels to insert zero for, anyway?
 Shared definitely, pg_class, ... maybe that's enough?  Or do we need
 it for all nailed local catalogs?
 
 local state contains
   * shared map list
   * this database's map list
   * list of local overrides to each on-disk map list
 
 At commit: if modified, write out new version; do this as late as possible
 before xact commit
 At abort: just discard local-override list
 
 Write must include generating a WAL entry as well as sending a shared-inval
 signal
   * write temp file, fsync it
   * emit WAL record containing copy of new file contents, fsync it
   * atomically rename temp file into place
   * send sinval
 
 During relation open, check for sinval before relying on current cached value
 of map contents
 
 Hm, what about concurrent updates of map?  Probably instantiate a new
 LWLock or maybe better a HW lock.  So write involves taking lock, check
 for updates, finally write --- which means that local modifications
 have to be stored in a way that allows re-reading somebody else's mods.
 Hence above data structure with separate storage of local modifications.
 We assume rel-level locking protects us from concurrent update attempts
 on the same map entry, but we don't want to forbid concurrent relocations
 of different catalogs.
 
 LWLock would work if we do an unconditional read of the file contents after
 getting lock rather than relying on sinval signaling, which seems a small
 price considering map updates should be infrequent.  Without that, writers
 have to hold the lock till commit which rules out using LWLock.
 
 Hm ... do we want an LWLock per map file, or is one lock to rule them all
 sufficient?  LWLock per database seems problematic.  With an HW lock there
 wouldn't be a problem with that.  HW lock would allow concurrent updates of
 the map files of different DBs, but is that worth the extra cycles?
 
 In a case where a transaction relocates pg_class (or another mapped catalog)
 and then makes updates in that catalog, all is well in the sense that the
 updates will be preserved iff the xact commits.  HOWEVER we would have
 problems during WAL replay?  No, because the WAL entries will command updates
 to the catalog's new relfilenode, which will be interesting to anyone else if
 and only if the xact gets to commit.  We would need to be careful about the
 case of relocating pg_class itself though --- make sure local relcache
 references the new pg_class before any such updates happen.  Probably a CCI
 is sufficient.
 
 Another issue for clustering a catalog is that sinval catcache signalling
 could get confused, since that depends on catalog entry TIDs which would
 change --- we'd likely need to have relocation send an sinval signal saying
 flush *everything* from that catalog.  (relcache inval on the catalog itself
 doesn't do that.)  This action could/should be transactional so no
 additional code is needed to propagate the notice to HS standbys.
 
 Once the updated map file is moved into place, the relocation is effectively
 committed even if we subsequently abort the transaction.  We can make that
 window pretty narrow but not remove it completely.  Risk factors:
 * abort would try to remove relation files created by xact, in particular
 the new version of the catalog.  Ooops.  Can fix this by telling smgr to
 forget about removing those files before installing the new map file ---
 better to leak some disk space than destroy catalogs.
 * on abort, we'd not send sinval notices, leaving other backends at risk
 of 

[HACKERS] Hot Standby and VACUUM FULL

2010-01-30 Thread Simon Riggs

The last item on my list before close is making VACUUM FULL and Hot
Standby play nicely together.

The options to do this were and still are:

(1) Add WAL messages for non-transactional relcache invalidations
(2) Allow system relations to be cluster-ed/vacuum full-ed.

(1) was how we did it originally and I believe it worked without
problem. We saw the opportunity to do (2) and it has been worth
exploring.

My approach to (2) was to look at this design-wise. Given my experience
with other aspects of relcache and invalidation, it all looked do-able
without problem. The design seems straightforward in a few ways, though
had a few special cases.

I attach a WIP patch that is sufficient to do psql -c VACUUM FULL
pg_am; successfully, it being the easiest case out of the special
cases. It's also easy to remove the special case-avoidance code in
vacuum.c to see the various failures that occur without further work.
I've added tests to cover the new cases, which currently cause make
check to fail solely because I haven't updated the test output, since
this is a WIP.

I'm not aware of any specific technical blockers to continuing with (2).

Having said that I now realise a few things I didn't before:

* Approach (2) effects the core of Postgres, even if you don't use Hot
Standby.
* I've had to remove 7 sanity checks to get the first few VACUUMs
working. ISTM that removing various basic checks in the code is not a
good thing. 
* There are are more special cases than I realised at first: temp,
shared, with-toast, nailed, shared-and-nailed, pg_class, normal system.

Taken together, ISTM that the benefits of progressing towards (2) are
not worth the potential burst radius of any problems introduced. With
the larger number of special cases and the removal of checks we may be
less able to spot or cope with failure. Given that people are getting
edgy about code instability, I would say there's likely to be some here.

My feeling is that we should return to approach (1) for Postgres 9.0,
since we have an approach that works and isolates any change to just Hot
Standby related codepaths. I'm not personally keen to introduce core
changes unrelated to the new feature (HS).

Your thoughts, please.

-- 
 Simon Riggs   www.2ndQuadrant.com
Index: src/backend/catalog/index.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/catalog/index.c,v
retrieving revision 1.331
diff -c -r1.331 index.c
*** src/backend/catalog/index.c	22 Jan 2010 16:40:18 -	1.331
--- src/backend/catalog/index.c	30 Jan 2010 14:59:11 -
***
*** 600,614 
   errmsg_internal(concurrent index creation for exclusion constraints is not supported)));
  
  	/*
- 	 * We cannot allow indexing a shared relation after initdb (because
- 	 * there's no way to make the entry in other databases' pg_class).
- 	 */
- 	if (shared_relation  !IsBootstrapProcessingMode())
- 		ereport(ERROR,
- (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-  errmsg(shared indexes cannot be created after initdb)));
- 
- 	/*
  	 * Validate shared/non-shared tablespace (must check this before doing
  	 * GetNewRelFileNode, to prevent Assert therein)
  	 */
--- 600,605 
Index: src/backend/catalog/toasting.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/catalog/toasting.c,v
retrieving revision 1.28
diff -c -r1.28 toasting.c
*** src/backend/catalog/toasting.c	28 Jan 2010 23:21:11 -	1.28
--- src/backend/catalog/toasting.c	30 Jan 2010 14:59:28 -
***
*** 127,141 
  
  	/*
  	 * Toast table is shared if and only if its parent is.
- 	 *
- 	 * We cannot allow toasting a shared relation after initdb (because
- 	 * there's no way to mark it toasted in other databases' pg_class).
  	 */
  	shared_relation = rel-rd_rel-relisshared;
- 	if (shared_relation  !IsBootstrapProcessingMode())
- 		ereport(ERROR,
- (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-  errmsg(shared tables cannot be toasted after initdb)));
  
  	/*
  	 * Is it already toasted?
--- 127,134 
Index: src/backend/commands/cluster.c
===
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.195
diff -c -r1.195 cluster.c
*** src/backend/commands/cluster.c	28 Jan 2010 23:21:11 -	1.195
--- src/backend/commands/cluster.c	30 Jan 2010 17:28:17 -
***
*** 376,394 
  	Relation	OldIndex;
  
  	/*
- 	 * Disallow clustering system relations.  This will definitely NOT work
- 	 * for shared relations (we have no way to update pg_class rows in other
- 	 * databases), nor for nailed-in-cache relations (the relfilenode values
- 	 * for those are hardwired, see relcache.c).  It might work for other
- 	 * system relations, but I ain't gonna risk it.
- 	 */
- 	if (IsSystemRelation(OldHeap))
- 		

Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-30 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 The last item on my list before close is making VACUUM FULL and Hot
 Standby play nicely together.

 The options to do this were and still are:

 (1) Add WAL messages for non-transactional relcache invalidations
 (2) Allow system relations to be cluster-ed/vacuum full-ed.

 (1) was how we did it originally and I believe it worked without
 problem. We saw the opportunity to do (2) and it has been worth
 exploring.

Refresh my memory on why (1) lets us avoid fixing (2)?  It sounds
like a kluge at best ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby and VACUUM FULL

2010-01-30 Thread Simon Riggs
On Sat, 2010-01-30 at 15:17 -0500, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  The last item on my list before close is making VACUUM FULL and Hot
  Standby play nicely together.
 
  The options to do this were and still are:
 
  (1) Add WAL messages for non-transactional relcache invalidations
  (2) Allow system relations to be cluster-ed/vacuum full-ed.
 
  (1) was how we did it originally and I believe it worked without
  problem. We saw the opportunity to do (2) and it has been worth
  exploring.
 
 Refresh my memory on why (1) lets us avoid fixing (2)?  

(1) allows us to retain VACUUM FULL INPLACE for system relations, thus
avoiding the need to do (2). Non-transactional invalidations need to be
replayed in recovery for the same reason they exist on the primary.

 It sounds like a kluge at best ...

(2) isn't a necessary change right now. It is the best design going
forwards, but its burst radius stretches far beyond Hot Standby. There
is enough code in HS for us to support, so adding to it makes little
sense for me, in this release, since there is no functional benefit in
doing so.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers