Re: [PATCHES] HOT patch - version 14

2007-09-03 Thread Heikki Linnakangas
Tom Lane wrote:
 Pavan Deolasee [EMAIL PROTECTED] writes:
 Please see the version 14 of HOT patch attached.
 
 I expected to find either a large new README, or some pretty substantial
 additions to existing README files, to document how this all works.

Here's an updated version of the README I posted earlier. It now
reflects the changes to how pruning works.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Use case

The best use case for HOT is a table that's frequently UPDATEd, and is large
enough that VACUUM is painful. On small tables that fit in cache, running 
VACUUM every few minutes isn't a problem.

Heap-only tuples

When a HOT update is performed, the new tuple is placed on the same page as the 
old one, marked with the HEAP_ONLY_TUPLE flag. HEAP_ONLY_TUPLE means that 
there's no index pointers to the tuple, which allows pruning the chain in the 
future. The old tuple is marked with HEAP_HOT_UPDATE-flag, which means that the 
tuple pointed to by t_ctid is a heap-only tuple. That needs to be taken into 
account when vacuuming, so that we don't remove the root tuple in the update 
chain, when there's no index pointers to the later tuples.

When doing an index scan, whenever we reach a non-visible tuple, we need to 
check if the tuple has been HOT-updated (== HEAP_HOT_UPDATE flag is set). If 
so, we need to follow the ctid pointer until we reach a visible one, or one 
that hasn't been HOT-updated.

Sequential scans (and bitmap heap scans with a lossy bitmap) don't need to pay 
attention to the flags.

The pre-requirements for doing a HOT update is that none of the indexed columns 
are changed. That's checked at execution time, comparing the binary 
representation of the old and new values. That means that dummy updates, like 
UPDATE foo SET col1 = ?, where ? is the same as the old value can be HOT.

In addition to the above, there needs to be room on the page for the new tuple. 
If the page is full, we try to make room by pruning the page.

Pruning
---
Pruning is a lightweight vacuum operation that can be run on a single page, 
with no need to scan indexes, but it only removes dead HOT tuples. Other dead 
tuples are truncated, leaving only a redirected dead line pointer. The removed 
tuples are compacted away using PageRepairFragmentation, like in normal vacuum. 
There's two reasons to prune a page: to make room on the page for future 
updates, and to shorten HOT chains to make index lookups cheaper.

When accessing a page with HOT updated tuples on it, and less than a certain 
threshold of free space, we try to prune it. To do that, we need to take a 
vacuum strength lock on the buffer. If that fails, we don't prune; the theory 
is that you usually do get the lock, and if you don't, you'll get to try again 
next time. It would be more logical to do the pruning in heap_update when the 
page is full, but by the time we get there we have already pinned the page and 
have references to tuples on it, so we can't start moving tuples around it. 
Also, that alone wouldn't address the desire to keep HOT chains short, to avoid 
overhead of traversing long chains on index lookups.

To reclaim the index-visible (i.e. first) tuple in a HOT chain, the line 
pointer is turned into a redirecting line pointer that points to the line 
pointer of the next tuple in the chain.

When the last live tuple in an update chain becomes dead (after a DELETE or a 
cold update), the redirecting line pointer is marked as redirected dead. That 
allows us to immediately reuse the space, sans the line pointer itself. We've 
effectively resurrected the truncate dead tuples to just line pointer idea 
that has been proposed and rejected before because of fear of line pointer 
bloat. To limit the damage in worst case, and to keep numerous arrays as well 
as the bitmaps in bitmap scans reasonably sized, the maximum number of line 
pointers (MaxHeapTuplesPerPage) is somewhat arbitrarily capped at 2 * what it 
was before.

VACUUM FULL
---
To make vacuum full work, any DEAD tuples in the middle of an update chain 
needs to be removed (see comments at the top of heap_prune_hotchain_hard for 
details). Vacuum full performs a more aggressive pruning that not only removes 
dead tuples at the beginning of an update chain, it scans the whole chain and 
removes any intermediate dead tuples as well.

Vacuum
--
There's not much changes to regular vacuum. It removes dead HOT tuples, like 
pruning, and cleans up any redirected dead line pointers.

In lazy vacuum, we must not freeze a tuple that's in the middle of an update 
chain. That can happen when a tuple has xmin  xmax; it's the same scenario 
that requires hard pruning in VACUUM FULL. Freezing such tuples will break 
the check that xmin and xmax matches when following the chain. It's not a 
problem without HOT, because the preceding tuple in the chain must be dead as 
well so no-one will try to follow the chain, but with HOT the 

Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Pavan Deolasee
On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote:



 In fact, now that I think about it there is no other
 fundamental reason to not support HOT on system tables. So we
 can very well do what you are suggesting.



On second thought, I wonder if there is really much to gain by
supporting HOT on system tables and whether it would justify all
the complexity. Initially I thought about CatalogUpdateIndexes to
which we need to teach HOT. Later I also got worried about
building the HOT attribute lists for system tables and handling
all the corner cases for bootstrapping and catalog REINDEX.
It might turn out to be straight forward, but I am not able to
establish that with my limited knowledge in the area.

I would still vote for disabling HOT on catalogs unless you see
strong value in it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Decibel!
On Fri, Aug 31, 2007 at 12:53:51PM +0530, Pavan Deolasee wrote:
 On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote:
 
 
 
  In fact, now that I think about it there is no other
  fundamental reason to not support HOT on system tables. So we
  can very well do what you are suggesting.
 
 
 
 On second thought, I wonder if there is really much to gain by
 supporting HOT on system tables and whether it would justify all
 the complexity. Initially I thought about CatalogUpdateIndexes to
 which we need to teach HOT. Later I also got worried about
 building the HOT attribute lists for system tables and handling
 all the corner cases for bootstrapping and catalog REINDEX.
 It might turn out to be straight forward, but I am not able to
 establish that with my limited knowledge in the area.
 
 I would still vote for disabling HOT on catalogs unless you see
 strong value in it.

What about ANALYZE? Doesn't that do a lot of updates?

BTW, I'm 100% in favor of pushing system catalog HOT until later; it's
be silly to risk not getting hot in 8.3 because of catalog HOT.
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpXfGeddWvmd.pgp
Description: PGP signature


Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Tom Lane
Decibel! [EMAIL PROTECTED] writes:
 BTW, I'm 100% in favor of pushing system catalog HOT until later; it's
 be silly to risk not getting hot in 8.3 because of catalog HOT.

I see this the other way around: if it doesn't work on system catalogs,
it probably doesn't work, period.  I'm not in favor of treating the
catalogs differently.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Pavan Deolasee
On 9/1/07, Tom Lane [EMAIL PROTECTED] wrote:


 I see this the other way around: if it doesn't work on system catalogs,
 it probably doesn't work, period.  I'm not in favor of treating the
 catalogs differently.


Now that I hear you, I know what to do next :-)

I don't think there is any fundamental problem with system catalogs,
its only the additional complexity that I was worried about. Anyways,
I will rework things as per your suggestion. And I take your point that
making it work on all tables will give us more confidence on the code.

Thanks,
Pavan

P.S. Next week is bad for me :-(  I am on vacation on Thursday/Friday
and for remaining days, I may not be able to spend extra cycles,
apart from regular working hours.

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  Please see the version 14 of HOT patch attached.

 I expected to find either a large new README, or some pretty substantial
 additions to existing README files, to document how this all works.
 The comments included do not represent nearly enough documentation.


I shall take that up. There are couple of documents posted by Heikki and
Greg,
apart from several email threads and original design doc by Simon. I shall
consolidate everything in a single README.


One thing I was unable to discern from the comments is how CREATE INDEX
 can work at all.  A new index might mean that tuples that could formerly
 legally be part of the same hot-chain no longer can.  I can't find any
 code here that looks like it's addressing that.



You are right - a new index might mean that an existing HOT chain
is broken as far as the new index is concerned. The way we address
that is by indexing the root tuple of the chain, but the index key is
extracted from the last tuple in the chain. The index is marked unusable
for all those existing transactions which can potentially see any
intermediate
tuples in the chain.

Please see this document written by Greg Stark. He has nicely summarized
how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.

http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php


I also don't think I
 believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
 tuples: what if the index completion commits, but the concurrent delete
 rolls back?  Then you've got a valid tuple that's not in the index.



Since CREATE INDEX works with ShareLock on the relation, we can
safely assume that we can't find any DELETE_IN_PROGRESS tuples except
those deleted by our own transaction. The only exception is system relation,
but we don't do HOT updates on system relation. Given that, it seems OK
to me to ignore these tuples because if the transaction aborts,
CREATE INDEX is aborted as well. Am I overlooking something here ?
There is a comment to this regard in the current code as well.


 I also took this opportunity to remove the modularity invasion caused
  by heap_check_idxupdate() since it was using resultRelInfo. We now
  build the list of attributes that must be checked to satisfy HOT update.
  This list includes all the index columns, columns in the partial index
  predicates and expression index expressions and is built in the
  executor.

 The executor is the wrong place for that: I'm not sure why you think
 that making heapam depend on the executor is a modularity improvement.



Earlier (in the previous version of HOT patch) we were passing ResultRelInfo
to heap_update, which was more ugly. The comment is in that context :-)


Furthermore this approach requires recalculating the list during
 each query, which is wasteful when it could only change during schema
 updates.  I'd suggest making the relcache responsible for computing and
 saving this data, along the same lines as RelationGetIndexList().
 (That also means that heapam can get the data from the relcache, saving
 a lot of API-bloating from passing around Attrids explicitly.)
 Also, rather than inventing Attrids, I'd suggest just using one
 Bitmapset with the convention that its contents are offset by
 FirstLowInvalidHeapAttributeNumber.



I liked all these suggestions. I know I thought about computing
the attribute list in relcache, but probably avoided to keep things simple.
I shall make these changes.


The redefinition of the value of MaxHeapTuplesPerPage seems pretty
 ugly and unprincipled.  I think it'd be better to leave it as-is,
 and just enforce that we don't allow more than that many line pointers
 on a heap page (as indeed you have to do anyway, so it's not clear
 what the change is buying).



The only reason to redefine MaxHeapTuplesPerPage to higher side is
because HOT allows us to accommodate more tuples per page because
of redirect-dead line pointers. For a table with sufficiently large tuples,
the original bound would work well, but for very small tuples - we might
hit the line pointer limit even if there is  free space available in the
page. Doubling the value is chosen as a balance between heap page
utilization, line pointer bloating and overhead for bitmap scans. But
I agree that the factor choice is rather arbitrary.



 Is it really necessary to add hot_update to xl_heap_update?  Seems the
 information should be available from the tuple header fields.



I think its not necessary. The reason I did that way because the new block
might be backed up in the WAL record and hence we might have recorded
the new tuple infomasks. But we can surely handle that corner case. I shall
fix this.


Have you demonstrated that the prune_hard logic is worth its weight?
 Considering that many of us want to drop VACUUM FULL, adding more
 complexity to it doesn't seem like a profitable use of time.



The prune_hard code is lot more simple 

Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 Pavan Deolasee [EMAIL PROTECTED] writes:
 Please see the version 14 of HOT patch attached.

 I expected to find either a large new README, or some pretty substantial
 additions to existing README files, to document how this all works.
 The comments included do not represent nearly enough documentation.

The Heikki and I posted a two-part README of sorts:

http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php
http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php

 One thing I was unable to discern from the comments is how CREATE INDEX
 can work at all.  A new index might mean that tuples that could formerly
 legally be part of the same hot-chain no longer can.  I can't find any
 code here that looks like it's addressing that.  

This is one of the weirdest things in HOT and one of the hardest problems it
faced. The best solution proposed was to index the head of the HOT chain and
just ban older transactions which might be able to see the older versions fro
using the index.

That's the purpose of the indcreatexid column. It's set in
index_set_createxid() and checked in get_relation_info().

 I also don't think I believe the reasoning for not indexing
 DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits,
 but the concurrent delete rolls back? Then you've got a valid tuple that's
 not in the index.

You're talking about the concurrent index build case? Wouldn't the second pass
pick up that tuple? I have to look back at it to see for sure.

 The redefinition of the value of MaxHeapTuplesPerPage seems pretty
 ugly and unprincipled.  I think it'd be better to leave it as-is,
 and just enforce that we don't allow more than that many line pointers
 on a heap page (as indeed you have to do anyway, so it's not clear
 what the change is buying).

Note that in a heavily updated table basically every tuple will have two line
pointers, the head which just redirects to another line pointer, and the line
pointer for the actual tuple. It's hard to get rid of the actual head line
pointer without declaring that the tid might sometimes not change after an
update.

Not sure what this translates to for MaxHeapTuplesPerPage though.

The rest I know less about and will leave to Pavan and Heikki (or anyone else
who was following those details more closely).

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Gregory Stark

Gregory Stark [EMAIL PROTECTED] writes:

 Tom Lane [EMAIL PROTECTED] writes:

 Pavan Deolasee [EMAIL PROTECTED] writes:
 Please see the version 14 of HOT patch attached.

 I expected to find either a large new README, or some pretty substantial
 additions to existing README files, to document how this all works.
 The comments included do not represent nearly enough documentation.

 The Heikki and I posted a two-part README of sorts:

Er, editing error there. Has a ring to it though.

 http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php
 http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php
...
 I also don't think I believe the reasoning for not indexing
 DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits,
 but the concurrent delete rolls back? Then you've got a valid tuple that's
 not in the index.

 You're talking about the concurrent index build case? Wouldn't the second pass
 pick up that tuple? I have to look back at it to see for sure.

Sorry, that's misguided. The concurrent index build uses snapshots now so it
can't see DELETE_IN_PROGRESS. And the non-concurrent index build has an lock
so it ought to be back to the way it was before I messed with it where there
was an assert against finding *_IN_PROGRESS (except as Pavan points out in the
same transaction).

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 You are right - a new index might mean that an existing HOT chain
 is broken as far as the new index is concerned. The way we address
 that is by indexing the root tuple of the chain, but the index key is
 extracted from the last tuple in the chain. The index is marked unusable
 for all those existing transactions which can potentially see any
 intermediate tuples in the chain.

I don't think that works --- what if the last tuple in the chain isn't
committed good yet?  If its inserter ultimately rolls back, you've
indexed the wrong value.

 Please see this document written by Greg Stark. He has nicely summarized
 how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
 http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php

Isn't the extra machination for C.I.C. just useless complication?
What's the point of avoiding creation of new broken HOT chains when
you still have to deal with existing ones?

 I also don't think I
 believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
 tuples: what if the index completion commits, but the concurrent delete
 rolls back?  Then you've got a valid tuple that's not in the index.

 Since CREATE INDEX works with ShareLock on the relation, we can
 safely assume that we can't find any DELETE_IN_PROGRESS tuples except
 those deleted by our own transaction. The only exception is system relation,
 but we don't do HOT updates on system relation.

That chain of reasoning is way too shaky for me.  Essentially what
you're saying is you'll promise not to corrupt non-system indexes.
Nor am I really thrilled about having to disable HOT for system
catalogs.

 The only reason to redefine MaxHeapTuplesPerPage to higher side is
 because HOT allows us to accommodate more tuples per page because
 of redirect-dead line pointers.

No, it doesn't allow more tuples per page.  It might mean there can be
more line pointers than that on a page, but not more actual tuples.
The question is whether there is any real use in allowing more line
pointers than that --- the limit is already unrealistically high,
since it assumes no data content in any of the tuples.  If there is a
rationale for it then you should invent a different constant
MaxLinePointersPerPage or some such, but I really think there's no
point.

 Doubling the value is chosen as a balance between heap page
 utilization, line pointer bloating and overhead for bitmap scans.

I'd say it allows a ridiculous amount of line pointer bloat.

 Even if it's safe, ISTM what you're mostly accomplishing there is to
 expend a lot of cycles while holding exclusive lock on the page, when
 there is good reason to think that you're blocking other people who are
 interested in using the page.  Eliminating the separation between that
 and cleanup would also allow eliminating the separate PD_FRAGMENTED
 page state.

 The reason we did it that way because repairing fragmentation seems
 much more costly that pruning. Please note that we prune a single
 chain during index fetch. Its only for heap-scans (and VACUUM) that
 we try to prune all chains in the page. So unless we are doing heap-scan,
 I am not sure if we are spending too much time holding the exclusive
 lock. I agree we don't have any specific numbers to prove that though.

If you don't have numbers proving that this extra complication has a
benefit, I'd vote for simplifying it.  The SnapshotAny case is going to
bite other people besides you in future.

 Another reasoning behind  separating these two steps is:  pruning
 requires exclusive lock whereas repairing fragmentation requires
 cleanup lock.

This is nonsense.  Those are the same lock.  If you have the former and
not the latter, it just means that you *know* there is contention for
the page.  It seems to me that performing optional maintenance work in
such a situation is completely wrong.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  You are right - a new index might mean that an existing HOT chain
  is broken as far as the new index is concerned. The way we address
  that is by indexing the root tuple of the chain, but the index key is
  extracted from the last tuple in the chain. The index is marked
 unusable
  for all those existing transactions which can potentially see any
  intermediate tuples in the chain.

 I don't think that works --- what if the last tuple in the chain isn't
 committed good yet?  If its inserter ultimately rolls back, you've
 indexed the wrong value.



I am confused. How could we get ShareLock on the relation while
there is some open transaction which has inserted a tuple in the
table ? (Of course, I am not considering the system tables here)
If the transaction which inserted the last tuple in the chain is already
aborted, we are not indexing that tuple (even if that tuple is at the
end). We would rather index the last committed-good tuple in the
chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees
that. Isn't it ?

 Please see this document written by Greg Stark. He has nicely summarized
  how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
  http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php

 Isn't the extra machination for C.I.C. just useless complication?
 What's the point of avoiding creation of new broken HOT chains when
 you still have to deal with existing ones?



IMHO the extra step in C.I.C simplifies the index build. The
transaction-waits
takes care of the existing broken chains and the early catalog entry for
the index helps us avoid creating new broken chains. I am not against
doing it a different way, but this is the best solution we could come up
when we worked on CIC.


 I also don't think I
  believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
  tuples: what if the index completion commits, but the concurrent delete
  rolls back?  Then you've got a valid tuple that's not in the index.

  Since CREATE INDEX works with ShareLock on the relation, we can
  safely assume that we can't find any DELETE_IN_PROGRESS tuples except
  those deleted by our own transaction. The only exception is system
 relation,
  but we don't do HOT updates on system relation.

 That chain of reasoning is way too shaky for me.  Essentially what
 you're saying is you'll promise not to corrupt non-system indexes.
 Nor am I really thrilled about having to disable HOT for system
 catalogs.



I am not against supporting HOT for system catalogs. But IMHO
its not a strict requirements  because  system catalogs are small, less
frequently updated and HOT adds little value to them. If we don't have
a generic solution which works for system and non-system tables,
thats the best we can get. We can start with non-system
tables and expand its scope later.


 The only reason to redefine MaxHeapTuplesPerPage to higher side is
  because HOT allows us to accommodate more tuples per page because
  of redirect-dead line pointers.

 No, it doesn't allow more tuples per page.  It might mean there can be
 more line pointers than that on a page, but not more actual tuples.
 The question is whether there is any real use in allowing more line
 pointers than that --- the limit is already unrealistically high,
 since it assumes no data content in any of the tuples.  If there is a
 rationale for it then you should invent a different constant
 MaxLinePointersPerPage or some such, but I really think there's no
 point.



I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently
large. May be we can keep it the same and tune it later if we have numbers
to prove its worthiness.


 Doubling the value is chosen as a balance between heap page
  utilization, line pointer bloating and overhead for bitmap scans.

 I'd say it allows a ridiculous amount of line pointer bloat.



OK. Lets keep MaxHeapTuplesPerPage unchanged.


 Even if it's safe, ISTM what you're mostly accomplishing there is to
  expend a lot of cycles while holding exclusive lock on the page, when
  there is good reason to think that you're blocking other people who are
  interested in using the page.  Eliminating the separation between that
  and cleanup would also allow eliminating the separate PD_FRAGMENTED
  page state.

  The reason we did it that way because repairing fragmentation seems
  much more costly that pruning. Please note that we prune a single
  chain during index fetch. Its only for heap-scans (and VACUUM) that
  we try to prune all chains in the page. So unless we are doing
 heap-scan,
  I am not sure if we are spending too much time holding the exclusive
  lock. I agree we don't have any specific numbers to prove that though.

 If you don't have numbers proving that this extra complication has a
 benefit, I'd vote for simplifying it.  The SnapshotAny case is going to
 bite other people besides you in 

Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Isn't the extra machination for C.I.C. just useless complication?
 What's the point of avoiding creation of new broken HOT chains when
 you still have to deal with existing ones?

 IMHO the extra step in C.I.C simplifies the index build.

 If you make the change suggested above, I think you don't need to do
 things differently in C.I.C.

It seems to me if you wait out transactions as you come across them you could
end up waiting a whole lot longer than the way it works now where it waits
them all out at the end of the first pass.

 OK.  So if I get you correctly, you are suggesting to acquire cleanup lock.
 If we don't get that, we don't to any maintenance work. Otherwise, we prune
 and repair fragmentation in one go.

 Yeah, that's what I'm thinking; then there's no need to track a separate
 page state where we've pruned but not defragmented.

Note that you still need to do it in two steps, you just get to postpone the
pruning work to the same point in time as the defragmenting. You can never
acquire the cleanup lock when it comes time to insert a new update version
since the executor still holds references to the original tuple.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Alvaro Herrera
Tom Lane escribió:
 Pavan Deolasee [EMAIL PROTECTED] writes:
  On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote:
  I don't think that works --- what if the last tuple in the chain isn't
  committed good yet?  If its inserter ultimately rolls back, you've
  indexed the wrong value.
 
  I am confused. How could we get ShareLock on the relation while
  there is some open transaction which has inserted a tuple in the
  table ? (Of course, I am not considering the system tables here)
 
 Not if someone else releases lock before committing.

FWIW, a red flag raised for me here, though maybe it is irrelevant or
unimportant.  Currently, VACUUM acquires an exclusive lock for
truncating the table.  The lock is kept till commit.  However I am
proposing that it be released before commit.

Now, VACUUM never inserts rows.  But I don't claim I understand what's
going on here.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane escribió:
 Not if someone else releases lock before committing.

 FWIW, a red flag raised for me here, though maybe it is irrelevant or
 unimportant.  Currently, VACUUM acquires an exclusive lock for
 truncating the table.  The lock is kept till commit.  However I am
 proposing that it be released before commit.

I think that's all right, because it's dealing with a different set of
concerns.  AFAICS the only issue for truncation is to prevent physical
access to the blocks in question until we can get rid of them.  Once
they're gone, if there wasn't an active seqscan (with an
already-established notion of the max block number to scan to), there
would be no reason for anyone to try to touch them.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/31/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:

 Not if someone else releases lock before committing.  Which I remind you
 is a programming technique we use quite a lot with respect to the system
 catalogs.  I'm not prepared to guarantee that there is no code path in
 core (much less in contrib or third-party addons) that doesn't do it on
 a user table; even less that no such path will ever appear in future.
 As things stand it's a pretty harmless error --- the worst consequence
 I know of is that a VACUUM FULL starting right then might bleat and
 refuse to do shrinking.  What you propose is to turn the case into a
 cause of silent index corruption.

 A more robust solution would be to wait out the source transaction if
 CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS
 HOT tuple.  Or you could throw an error, since it's not really likely to
 happen (except maybe in a system catalog REINDEX operation).  But the
 way the patch approaches this at the moment is far too fragile IMHO.



OK. I looked at the code again. In CVS HEAD we assume that we
should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS
unless its deleted/inserted by our own transaction or we are indexing a
system table. Except for these cases, we throw an error.

With HOT, we keep it the same i.e. if we see a
DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error,
unless its deleted/inserted by us or this is a system table..
What we change is if we find a DELETE_IN_PROGRESS tuple deleted
by our own transaction and its HOT-updated, we skip that tuple. This is
fine because if the CREATE INDEX commits the DELETE is also committed
and the tuple is not visible (I am still assuming the indcreatxid mechanism
is in place)

So if we don't do HOT update on system tables, the current algorithm
should work fine because we should never find a HOT updated tuple
in the system table and the error-out mechanism should ensure
that we don't corrupt user tables.

So I guess what you are suggesting is to turn on HOT on system tables
and then wait-out any DELETING/INSETING transaction if we find such
a tuple during CREATE INDEX. I think this would work and since we are
talking about system tables, the wait-out business won't be harmful - I
remember there were objections when I suggested this as a general solution.

If we approach it this way, we might also be able to jettison some of
 the other complexity such as idxcreatexid.



I doubt, unless we replace it with something better. I think indcreatexid
serves
another purpose. While building an index, we skip any RECENTLY_DEAD
HOT-updated tuples. This is required to handle the existing HOT chains
which are broken with respect to the new index. Essentially what it means
is we don't want to index anything other than the last committed good tuple
in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples,
which are potentially visible to the existing running transactions, we want
those transactions NOT to use this new index.



  IMHO the extra step in C.I.C simplifies the index build.

 If you make the change suggested above, I think you don't need to do
 things differently in C.I.C.


I am not sure if I follow you correctly here. The issue with CIC is that
it works with a snapshot. So during initial index build, we might index
a tuple which is in the middle of a broken HOT chain. In the second phase,
we just don't need to index the tuple at the head of the chain, but also
remove the previously inserted tuple, otherwise there would be two
references from the index to the same root heap tuple.

The additional step which does two things:

1) Create catalog entry - stops any new broken HOT chains being created
2) Wait-out existing transactions - makes sure that when the index is built,
we only index the last committed good tuple in the chain.


 
  Since CREATE INDEX works with ShareLock on the relation, we can
  safely assume that we can't find any DELETE_IN_PROGRESS tuples except
  those deleted by our own transaction. The only exception is system
  relation, but we don't do HOT updates on system relation.

 Same issue as above: this makes correctness utterly dependent on nobody
 releasing locks early.



As I said above, we in fact throw an error for user tables. But if we want
to support HOT updates on system tables, we may need to do the
wait-out business. In fact, now that I think about it there is no other
fundamental reason to not support HOT on system tables. So we
can very well do what you are suggesting.


 OK.  So if I get you correctly, you are suggesting to acquire cleanup
 lock.
  If we don't get that, we don't to any maintenance work. Otherwise, we
 prune
  and repair fragmentation in one go.

 Yeah, that's what I'm thinking; then there's no need to track a separate
 page state where we've pruned but not defragmented.



I agree. Lets keep it simple and we can always improve it later. The
only thing that worries me how 

Re: [PATCHES] HOT patch - version 14

2007-08-29 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 Please see the version 14 of HOT patch attached.

I expected to find either a large new README, or some pretty substantial
additions to existing README files, to document how this all works.
The comments included do not represent nearly enough documentation.

One thing I was unable to discern from the comments is how CREATE INDEX
can work at all.  A new index might mean that tuples that could formerly
legally be part of the same hot-chain no longer can.  I can't find any
code here that looks like it's addressing that.  I also don't think I
believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
tuples: what if the index completion commits, but the concurrent delete
rolls back?  Then you've got a valid tuple that's not in the index.

 I also took this opportunity to remove the modularity invasion caused
 by heap_check_idxupdate() since it was using resultRelInfo. We now
 build the list of attributes that must be checked to satisfy HOT update.
 This list includes all the index columns, columns in the partial index
 predicates and expression index expressions and is built in the
 executor.

The executor is the wrong place for that: I'm not sure why you think
that making heapam depend on the executor is a modularity improvement.
Furthermore this approach requires recalculating the list during
each query, which is wasteful when it could only change during schema
updates.  I'd suggest making the relcache responsible for computing and
saving this data, along the same lines as RelationGetIndexList().
(That also means that heapam can get the data from the relcache, saving
a lot of API-bloating from passing around Attrids explicitly.)
Also, rather than inventing Attrids, I'd suggest just using one
Bitmapset with the convention that its contents are offset by
FirstLowInvalidHeapAttributeNumber.

The redefinition of the value of MaxHeapTuplesPerPage seems pretty
ugly and unprincipled.  I think it'd be better to leave it as-is,
and just enforce that we don't allow more than that many line pointers
on a heap page (as indeed you have to do anyway, so it's not clear
what the change is buying).

Is it really necessary to add hot_update to xl_heap_update?  Seems the
information should be available from the tuple header fields.

Have you demonstrated that the prune_hard logic is worth its weight?
Considering that many of us want to drop VACUUM FULL, adding more
complexity to it doesn't seem like a profitable use of time.

Is it really safe, or productive, to run heap_page_prune when the buffer
is not locked for cleanup (ie, there are other people with pins on it)?
Even if it's safe, ISTM what you're mostly accomplishing there is to
expend a lot of cycles while holding exclusive lock on the page, when
there is good reason to think that you're blocking other people who are
interested in using the page.  Eliminating the separation between that
and cleanup would also allow eliminating the separate PD_FRAGMENTED
page state.

PlanSetValidForThisTransaction is completely bogus --- it's not
re-entrant, and it needs to be.  I think you need some state in
PlannerGlobal instead.

rd_avgfsm seems fairly bogus ... when does it get updated?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] HOT patch - version 14

2007-08-20 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I see that you have a separate bitmapset to keep track of indexes on
 system attributes. But having an index on a system attribute doesn't
 make any sense, does it?

Counterexample: OID.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT patch - version 14

2007-08-20 Thread Pavan Deolasee
On 8/20/07, Tom Lane [EMAIL PROTECTED] wrote:

 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I see that you have a separate bitmapset to keep track of indexes on
  system attributes. But having an index on a system attribute doesn't
  make any sense, does it?

 Counterexample: OID.



Right. Further, we allow creating indexes on system attributes. So we
must support those.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com