Re: [PATCHES] page macros cleanup

2008-07-04 Thread Pavan Deolasee
On Fri, Jul 4, 2008 at 1:01 PM, Zdenek Kotala [EMAIL PROTECTED] wrote:


 Good catch. I lost in basic arithmetic. What I see now that original
 definition count sizeof(ItemIdData) twice and on other side it does not take
 care about MAXALING correctly. I think correct formula is:

 #define HashMaxItemSize(page) \
(PageGetPageSize(page) - \
  ( MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))+ \
MAXALIGN(sizeof(HashPageOpaqueData)) \
  )\
 )

 What do you think?



Yes. I think that's the correct way.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Pavan Deolasee
On Fri, Jul 4, 2008 at 2:08 PM, Zdenek Kotala [EMAIL PROTECTED] wrote:
  Patch
 modifies HashMaxItemSize. It should require reindex on all hash indexes.
 Should we bump catalog version?


Do we really need that ? Even if the new HashMaxItemSize comes out to
be lower than the current limit (because of MAXALIGN), I don't see how
existing hash indexes can have a larger item than the new limit.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Pavan Deolasee
On Fri, Jul 4, 2008 at 3:37 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


 I think this is the way it should be:

 #define HashMaxItemSize \
(BLCKSZ - \
 SizeOfPageHeaderData - \
 MAXALIGN(sizeof(HashPageOpaqueData)) - \
 sizeof(ItemIdData))


I am wondering if this would fail for corner case if HashMaxItemSize
happened to be unaligned. For example, if (itemsz  HashMaxItemSize 
MAXALIGN(itemsz), PageAddItem() would later fail with a not-so-obvious
error. Should we just MAXALIGN_DOWN the HashMaxItemSize ?

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Pavan Deolasee
On Fri, Jul 4, 2008 at 4:25 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


 No, there's a itemsz = MAXALIGN(itemsz) call before the check against
 HashMaxItemSize.


Ah, right. Still should we just not MAXALIGN_DOWN the Max size to
reflect the practical limit on the itemsz ? It's more academical
though, so not a big deal.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-04 Thread Pavan Deolasee
On Fri, Jul 4, 2008 at 4:20 PM, Zdenek Kotala [EMAIL PROTECTED] wrote:



 By my opinion first place where tuple should be placed is:

 MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))


Tuple actually starts from the other end of the block.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] page macros cleanup

2008-07-03 Thread Pavan Deolasee
On Fri, Jun 13, 2008 at 9:38 PM, Zdenek Kotala [EMAIL PROTECTED] wrote:
 I attached code cleanup which is related to in-place upgrade. I replace
 direct access to PageHeader structure with already existing macros and I
 removed also unnecessary retyping.

A quick review comment:

One thing I noticed is that the modified definition of HashMaxItemSize
now does not account for the size of ItemId which may not be the right
thing. Please recheck that.

Apart from that the patch looks good to me. As you said, we should
probably replace the other direct usage of PageHeader with appropriate
macros.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


[PATCHES] VACUUM Improvements - WIP Patch

2008-06-09 Thread Pavan Deolasee
Here is a WIP patch based on the discussions here:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00863.php

The attached WIP patch improves the LAZY VACUUM by limiting or
avoiding the second heap scan. This not only saves considerable time
in VACUUM, but also reduces the double-writes of vacuumed blocks. If
the second heap scan is considerably limited, that should also save
CPU usage and reduce WAL log writing.

With HOT, the first heap scan prunes and defrags every page in the
heap. That truncates all the dead tuples to their DEAD line pointers
and releases all the free space in the page. The second scan only
removes these DEAD line pointers and records the free space in the
FSM. The free space in fact does not change from the first pass. But
to do so, it again calls RepairPageFragmentation on each page, dirties
the page and calls log_heap_clean() again on the page. This clearly
looks like too much work for a small gain.

As this patch stands, the first phase of vacuum prunes the heap pages
as usual. But it marks the DEAD line pointers as DEAD_RECLAIMED to
signal that the index pointers to these line pointers are being
removed, if certain conditions are satisfied. Other backend when
prunes a page, also reclaims DEAD_RECLAIMED line pointers by marking
them UNUSED. We need some additional logic to do this in a safe way:

- An additional boolean pg_class attribute (relvac_inprogress) is used
to track the status of vacuum on a relation. If the attribute is true,
either vacuum is in progress on the relation or the last vacuum did
not complete successfully.

When VACUUM starts, it sets relvac_inprogress to true. The transaction
is committed and a new transaction is started so that all other
backends can see the change. We also note down the transactions which
may already have the table open. VACUUM then starts the first heap
scan. It prunes the page, but it can start marking the DEAD line
pointers as DEAD_RECLAIMED only after it knows that all other backends
can see that VACUUM is in progress on the target relation. Otherwise
there is a danger that backends might reclaim DEAD line pointers
before their index pointers are removed and that would lead to index
corruption. We do that by periodic conditional waits on the noted
transactions ids. Once all old transactions are gone, VACUUM sets the
second scan limit to the current block number and starts marking
subsequent DEAD line pointers as DEAD_RECLAIMED.

In most of the cases where the old transactions quickly go away, and
for large tables, the second scan will be very limited. In the worst
case, we might incur the overhead of conditional waits without any
success.

TODO:

- We can potentially update FSM at the end of first pass. This is not
a significant issue if the second scan is very limited. But if we do
this, we need to handle the truncate case properly.

- As the patch stands, we check of old transactions at every block
iteration. This might not be acceptable for the cases where there are
long running transactions. We probably need some exponential gap here.

- As the patch stands, the heap_page_prune handles reclaiming the
DEAD_RECLAIMED line pointers since it already has ability to WAL log
similar changes. We don't do any extra work to trigger pruning though
(except than setting page_prune_xid). May be we should trigger pruning
if we got a line pointer bloat in a page too.

Please let me know comments/suggestions and any other improvements.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


VACUUM_second_scan-v5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 7:13 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


  Yep, patch attached. I also changed xactGetCommittedChildren to return
  the original array instead of copying it, as Alvaro suggested.


Any comments on the flag based approach I suggested earlier ? Am I
missing some normal scenario where it won't work well ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 9:27 PM, Tom Lane [EMAIL PROTECTED] wrote:


  I didn't like it; it seemed overly complicated (consider dealing with
  XID wraparound),

We are talking about subtransactions here. I don't think we support
subtransaction wrap-around, do we ?

 and it would have problems with a slow transaction
  generating a sparse set of subtransaction XIDs.

I agree thats the worst case. But is that common ? Thats what I
was thinking when I proposed the alternate solution. I thought that can
happen only if most of the subtransactions abort, which again I thought
is not a normal case. But frankly I am not sure if my assumption is correct.

 I think getting rid of
  the linear search will be enough to fix the performance problem.


I wonder if a skewed binary search would help more ? For example,
if we know that the range of xids stored in the array is 1 to 1000 and
if we are searching for a number closer to 1000, we can break the
array into large,small parts instead of equal parts and then
search.

Well, may be I making simple things complicated ;-)

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 10:44 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:


  Imagine that you start a transaction just before transaction
  wrap-around, so that the top level XID is 2^31-10. Then you start 20
  subtransactions. What XIDs will they get? Now how would you map those to
  a bitmap?


Wait. Subtransaction ids are local to a transaction and always start from 1.
See this:

/*
 * reinitialize within-transaction counters
 */
s-subTransactionId = TopSubTransactionId;
currentSubTransactionId = TopSubTransactionId;



  It's not that common to have hundreds of thousands of subtransactions to
  begin with..

True. But thats the case we are trying to solve here :-)


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-11 Thread Pavan Deolasee
On Tue, Mar 11, 2008 at 6:04 PM, Heikki Linnakangas
[EMAIL PROTECTED] wrote:
 (moved to pgsql-patches, as there's a patch attached)


  I couldn't let this case go, so I wrote a patch. I replaced the linked
  list with an array that's enlarged at AtSubCommit_childXids when necessary.


We can replace the array of xids with an array of flags where i'th flag is
set to true if the corresponding subtransaction is committed. This would
make lookup O(1) operation. Of course, the downside is when the array
is sparse. Or we can switch to flag based representation once the array size
grows beyond a limit.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] TransactionIdIsInProgress() cache

2008-03-11 Thread Pavan Deolasee
On Tue, Mar 11, 2008 at 6:37 PM, Alvaro Herrera
[EMAIL PROTECTED] wrote:
 I didn't check whether your transformation is correct, but if so then it
  can be changed like this and save the extra XidDidCommit call:

 xvac_committed = TransactionIdDidCommit(xvac);
 if (xvac_committed)

 {
 /* committed */
 }
 else if (!TransactionIdIsInProgress(xvac))
 {
if (xvac_committed)

{
   /* committed */
}
else
{
   /* aborted */
}
 }
 else
 {
 /* in-progress */
 }



I doubt if the transformation is correct. If xvac_committed is true, why would
one even get into the else part ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] HOT version 18

2007-09-18 Thread Pavan Deolasee
On 9/18/07, Jaime Casanova [EMAIL PROTECTED] wrote:



 this sql scripts make current cvs + patch to crash with this message
 in the logs:


Can you please check if the attached patch fixes the issue for you ?
It sets t_tableOid before returning a HOT tuple to the caller.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
*** src/backend/access/index/indexam.c	2007-09-18 11:59:26.0 +0530
--- src/backend/access/index/indexam.c	2007-09-18 11:52:36.0 +0530
***
*** 531,536 
--- 531,537 
  			ItemPointerSetOffsetNumber(tid, offnum);
  			heapTuple-t_data = (HeapTupleHeader) PageGetItem(dp, lp);
  			heapTuple-t_len = ItemIdGetLength(lp);
+ 			heapTuple-t_tableOid = RelationGetRelid(scan-heapRelation);
  			ctid = heapTuple-t_data-t_ctid;
  
  			/*

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] HOT synced with HEAD

2007-09-17 Thread Pavan Deolasee
On 9/17/07, Tom Lane [EMAIL PROTECTED] wrote:


 Yeah.  As the code stands, anything that's XMIN_INVALID will be
 considered not-HotUpdated (look at the macro...).  So far I've seen no
 place where there is any value in following a HOT chain past such a
 tuple --- do you see any?


No, I don't think we would ever need to follow a HOT chain past
the aborted tuple. The only thing that worries about this setup though
is the dependency on hint bits being set properly. But the places
where this would be used right now for detecting aborted dead tuples,
apply HeapTupleSatisfiesVacuum on the tuple before checking
for HeapTupleIsHotUpdated, so we are fine. Or should we just check
for XMIN_INVALID explicitly at those places ?


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT synced with HEAD

2007-09-17 Thread Pavan Deolasee
On 9/17/07, Tom Lane [EMAIL PROTECTED] wrote:



 Meanwhile I've started looking at the vacuum code, and it seems that v16
 has made that part of the patch significantly worse.  VACUUM will fail
 to count tuples that are removed by pruning, which seems like something
 it should report somehow.


I understand. I did not give real weight to it because I thought we
anyways remove tuples elsewhere during pruning. But I agree
that the heap_page_prune_defrag in the VACUUM code path
is doing so on behalf of vacuum and hence we should credit
that to VACUUM.


And you've introduced a race condition: as
 I just mentioned, it's perfectly possible that the second call of
 HeapTupleSatisfiesVacuum gets a different answer than what the prune
 code saw, especially in lazy VACUUM (in VACUUM FULL it'd suggest that
 someone released lock early ... but we do have to cope with that).



Hmm.. you are right. Those extra notices I added are completely
unnecessary and wrong.


The comments you added seem to envision a more invasive patch that gets
 rid of the second HeapTupleSatisfiesVacuum pass altogether, but I'm not
 sure how practical that is, and am not real inclined to try to do it
 right now anyway ...


I agree. I just wanted to leave a hint there that such a possibility exists
if someone really wants to optimize, now or later.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT synced with HEAD

2007-09-16 Thread Pavan Deolasee
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote:

 Attached is as far as I've gotten with reviewing the HOT patch; I see
 that Pavan is still fooling with it so we'd better re-sync.



I am done with most of the items on my plate. I was not sure how
far you have gone with the patch, so was trying to finish as many
items as possible myself. Now that I know you have started refactoring
the code, I would try not to change it unless its urgent. And when its
required, I will send a add-on patch just the way I did earlier.

Let me know if you want me to focus of something at this point.



 * Many API cleanups, in particular I didn't like what had been done
   to heap_fetch and thought it better to split out the chain-following
   logic



I liked those changes. I am wondering if we could have avoided
duplicating the chain following code in heap_hot_search_buffer and
index_getnext. But I trust your judgment.

I also liked the way you reverted the API changes to various index build
methods.

I would test the patch tomorrow in detail.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Latest README.HOT

2007-09-16 Thread Pavan Deolasee
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote:

 Forgot to include this in the patch ...

 I'm still unhappy about too much complexity for the VACUUM FULL case,
 but some other issues have gone away.



The entire complexity in the VACUUM FULL code is for the book keeping
purposes so that at the end we can recheck the index and heap tuples
count. VACUUM FULL converts HOT tuples into non-HOT tuples when it
moves them around and that introduces the extra complexity. Otherwise
there is no significant change to the VACUUM FULL logic itself.

Good to hear that we are now comfortable with most of the other items.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT synced with HEAD

2007-09-16 Thread Pavan Deolasee
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote:



 BTW, I'm in process of taking out the separate HEAPTUPLE_DEAD_CHAIN
 return value from HeapTupleSatisfiesVacuum.


I agree. I myself suggested doing so earlier in the discussion (I actually
even removed this before I sent out the add-on patch last night, but then
reverted back because I realized at least it is required at one place)
The place where I thought its required is to avoid marking an index tuple
dead
even though the corresponding root tuple is dead and the root tuple was
HOT updated. But seems like you have already put in a different mechanism
to handle that. So we should be able to get rid of HEAPTUPLE_DEAD_CHAIN.

Thanks,
Pavan




-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT synced with HEAD

2007-09-16 Thread Pavan Deolasee
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote:


 Something else I was just looking at: in the pruning logic, SetRedirect
 and SetDead operations are done at the same time that we record them for
 the eventual WAL record creation, but SetUnused operations are
 postponed and only done at the end.  This seems a bit odd/nonorthogonal.
 Furthermore it's costing extra code complexity: if you were to SetUnused
 immediately then you wouldn't need that bitmap thingy to prevent you
 from doing it twice.  I think that the reason it's like that is probably
 because of the problem of potential multiple visits to a DEAD heap-only
 tuple, but it looks to me like that's not really an issue given the
 current form of the testing for aborted tuples, which I have as

 if (HeapTupleSatisfiesVacuum(tuple-t_data, global_xmin, buffer)
 == HEAPTUPLE_DEAD  !HeapTupleIsHotUpdated(tuple))
 heap_prune_record_unused(nowunused, nunused, unusedbitmap,
  rootoffnum);

 ISTM that if HeapTupleIsHotUpdated is false, we could simply SetUnused
 immediately, because any potential chain members after this one must
 be dead too, and they'll get reaped by this same bit of code --- there
 is no need to preserve the chain.  (The only case we're really covering
 here is XMIN_INVALID, and any later chain members must certainly be
 XMIN_INVALID as well.)  When the HOT-chain is later followed, we'll
 detect chain break anyway, so I see no need to postpone clearing the
 item pointer.



So are you suggesting we go back to the earlier way of handling
aborted tuples separately ? But then we can not do that by simply
checking for !HeaptupleIsHotUpdated. There could be several aborted
tuples at the end of the chain of which all but one are marked HotUpdated.
Or are you suggesting we also check for XMIN_INVALID for detecting
aborted tuples ?

If we handle aborted tuples differently, then we don't need that extra
bitmap and can in fact set the line pointer unused immediately. Otherwise,
as you rightly pointed out, we might break a chain before we prune
it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT documentation README

2007-09-13 Thread Pavan Deolasee
On 9/12/07, Pavan Deolasee [EMAIL PROTECTED] wrote:


 One change that is worh mentioning
 and discussing is that we don't follow HOT chains while fetching tuples
 during
 autoanalyze and autoanalyze would consider all such tuples as DEAD.
 In the worst case when all the tuples in the table are reachable  via
 redirected line pointers, this would confuse autoanalyze since it would
 consider all tuples in the table as DEAD.



This is all crap. I was under the impression that heap_release_fetch()
would never fetch a HOT tuple directly, but thats not true. analyze fetches
all tuples in the chosen block, so it would ultimately fetch the visible
tuple.
A tuple is counted DEAD only if its t_data is set to non-NULL. For
redirected
line pointer heap_release_fetch() will set t_data to NULL and hence these
line pointers are (rightly) not counted as DEAD. This is the right thing to
do
because lazy vacuum can not remove redirected line pointers.


I think we should change this to follow HOT chain in analyze.



We need not follow the chain, but we should check for redirect-dead line
pointers and count them as dead rows.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-13 Thread Pavan Deolasee
On 9/13/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 Yeah, I just checked the work-in-progress patch I sent you back in July.
 I refactored it to use one-based offsets for consistency, since I
 modified log_heap_clean quite heavily anyway.

 It's possible that I broke it in the process, I was only interested in
 testing the performance characteristics of the simplified pruning scheme.


I don't think so -- atleast I couldn't figure out why its broken. But I
would take Tom's comment seriously and look more into it tomorrow.
Or we can just revert it back to zero-based offsets.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-13 Thread Pavan Deolasee
On 9/13/07, Tom Lane [EMAIL PROTECTED] wrote:



  Never mind ... though my
 suspicions would probably not have been aroused if anyone had bothered
 to fix the comments.


Yeah, my fault. I should have fixed that. Sorry about that.


Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-13 Thread Pavan Deolasee
On 9/14/07, Tom Lane [EMAIL PROTECTED] wrote:


 HeapTupleSatisfiesAbort is bogus because it has failed to track recent
 changes in tqual.c.



Oh. I should have been aware.


Rather than fix it, though, I question why we need it at all.  The only
 use of it is in heap_prune_tuplechain() and ISTM that code is redundant,
 wrong, or both.

 In the case of a full-page prune, it's redundant because the tuple would
 be found anyway by searching its chain from the root tuple.  Indeed I
 suspect that such a tuple is entered into the newly-dead list twice,
 perhaps risking overflow of that array.



No, we would never reach an aborted dead tuple from the chain because
we would see a live tuple before that. Or the tuple preceding the aborted
(first aborted, if there are many) may have been updated again and the
chain never reaches to the aborted tuples. Thats the reason why we have
HeapTupleSatisfiesAbort to collect any aborted heap-only tuples.



In the case of a single-chain prune, it still seems wrong since you'll
 eliminate only one of what might be a chain with multiple aborted
 entries.  If it's OK to leave those other entries for future collection,
 then it should be OK to leave this one too.  If it's not OK then the
 approach needs to be redesigned.



Again, since we check every heap-only tuple for aborted dead
case we shall collect all such tuples. I think one place where you
are confusing (or IOW the code might have confused you ;-))
is that we never reach aborted dead heap-only tuples by
following the HOT chain from the root tuple and thats why it
needs special treatment.

I'm fairly unclear on what the design intention is for recovering from
 the case where the last item(s) on a HOT chain are aborted.  Please
 clarify.


We don't do anything special. Such a tuple is never reached during
HOT chain following because either we would see a visible tuple before
that or the older tuple might have been updated again and the chain
had moved in a different direction. We only need some special treatment
to prune such tuple and hence the business of HeapTupleSatisfiesAbort

Having said that, based on our recent discussion about pruning a chain
upto and including the latest DEAD tuple in the chain, ISTM that we can
get rid of the giving any special treatment to aborted heap-only
tuples. Earlier we could not do so for HOT updated aborted heap-only
because we did not know if its a part of any valid HOT chain. Now
(assuming we change the pruning code to always prune everything including
the latest DEAD tuple) we guarantee that all DEAD tuples in the chain will
be pruned, and hence we can collect any DEAD tuple seen while pruning.

I am not sure if I explain this well,  may be I should post an example.
Need to run now.

Thanks,
Pavan

-- 

Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT documentation README

2007-09-12 Thread Pavan Deolasee
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote:



 VACUUM
 --

 There is little change to regular vacuum. It removes dead HOT tuples,
 like pruning does, and cleans up any redirected dead line pointers.

 In lazy vacuum, we must not freeze a tuple that is in the middle of an
 update chain. That can happen when a tuple has xmin  xmax; it is the
 same scenario that requires hard pruning in VACUUM FULL. Freezing such
 tuples will break the check that xmin and xmax matches when following
 the chain. It is not a problem without HOT, because the preceding tuple
 in the chain must be dead as well so no one will try to follow the
 chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone
 might still need to follow the chain to find the live tuple. We avoid
 that by just not freezing such tuples. They can be frozen eventually,
 when the xmax of the preceding tuple is  OldestXmin as well.

 XXX doesn't the above completely break the anti-wraparound guarantees?
 And why couldn't we avoid the problem by pruning the previous tuple,
 which is surely dead?



We preserve anti-wraparound guarantees by not letting the relfrozenxid
advance past the minimum of cut-off xid and xmin/xmax of not-yet-frozen
tuples. Given that this is required to address corner case of DEAD tuple
following a RD tuple, the final relfrozenxid would be very close to the
cut-off xid. Isn't it ?

We could have actually pruned the preceding RD tuples (as we do in
vacuum full), but we were worried about missing some corner case
where someone may still want to follow the chain from the RD tuple.
We don't have any such concern with vacuum full because we run
with exclusive lock on the table. But if we agree that there is no
problem with pruning RD tuple preceding a DEAD tuple, we can
actually prune that tuple as well.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT documentation README

2007-09-12 Thread Pavan Deolasee
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote:


 VACUUM FULL
 ---

 To make vacuum full work, any DEAD tuples in the middle of an update
 chain need to be removed (see comments at the top of
 heap_prune_hotchain_hard() for details). Vacuum full performs a more
 aggressive pruning that not only removes dead tuples at the beginning of
 an update chain, but scans the whole chain and removes any intermediate
 dead tuples as well.  It also removes redirected line pointers by making
 them directly point to the first tuple in the HOT chain. This causes
 a user-visible change in the tuple's CTID, but since VACUUM FULL has
 always moved tuple CTIDs, that should not break anything.

 XXX any extra complexity here needs justification --- a lot of it.



We hard prune the chains and also clear up redirect line pointers
because doing so is safe within VACUUM FULL and it reduces addition
complexity in the actual VACUUM FULL work.

When we move tuples and tuple chains, we don't try to preserve
their HOT properties. So when tuples in a HOT chain are moved,
we reset their HEAP_ONLY_TUPLE and HEAP_HOT_UPDATED flags
and each tuple has its own index entry. This requires us to some
more book keeping work in terms on number of indexed tuples expected
etc because they are checked at the end of the index scan.


Statistics
 --

 XXX: How do HOT-updates affect statistics? How often do we need to run
 autovacuum and autoanalyze?



Auotovacuum needs to be run much less frequently with HOT. This is because
defragmentation reclaims dead space in a page, thus reducing total dead
space in a table. Right now we don't update FSM information about the page
after defragmenting it, so a UPDATE on a different page can still cause
relation extension even though there is free space in some other page.
The rational for not updating FSM is to let subsequent UPDATEs on the page
to use the freed up space. But one can argue that we should let the free
space to be used for other UPDATEs/INSERTs after leaving fillfactor worth
of space.

Another significant change regarding autovacuum is that we now track
the total dead space in the table instead of number of dead tuples. This
seems like a better approach because it takes into account varying tuple
sizes
into account. The tracked dead space is increased whenever we update/delete
a tuple (or insert is aborted) and reduced when a page is defragmented.
autovacuum_vacuum_scale_factor considers the percentage of dead space
to the size of the relation whereas autovacuum_vacuum_threshold
considers the absolute amount of dead space in terms of blocks.

Every UPDATE (HOT or COLD) contributes to the autoanalyze stats and
defragmentation/pruning has no effect on autoanalyze. IOW autoanalyze
would work just the way it does today. One change that is worh mentioning
and discussing is that we don't follow HOT chains while fetching tuples
during
autoanalyze and autoanalyze would consider all such tuples as DEAD.
In the worst case when all the tuples in the table are reachable  via
redirected line pointers, this would confuse autoanalyze since it would
consider all tuples in the table as DEAD.

I think we should change this to follow HOT chain in analyze. Since we
fetch using SnapshotNow, if there is a live tuple at the end of the
chain, analyze would use that. Otherwise the tuple is considered as
DEAD.

Thanks,
Pavan



-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-12 Thread Pavan Deolasee
On 9/11/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 The crucial design
 decision for HOT is when to prune and when to defragment the page, so
 that when we're doing the UPDATE, there's room in the page.


Right.

For defragmentation, I am still inclined towards doing it when we see
that the free space is less than (or slightly more than) the average tuple
size of the table - unless we have a better solution.

For pruning, we can do the quick pruning. But I won't feel comfortable
doing it without an exclusive lock on the buffer. Also I would avoid
line pointer redirection during quick prune. A simpler solution would
be to flag the page whenever HOT chain becomes longer than, say
5, (which can easily be detected in heap_hot_fetch) and prune it in
the next lookup. Hopefully we would only rarely have long HOT chains
and if at all we have them, we will have some mechanism to recover
from it.

Any other ideas ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT documentation README

2007-09-12 Thread Pavan Deolasee
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote:



 This seems all wrong to me.  We'd only be considering freezing a tuple
 whose xmin precedes the global xmin.  If it has a predecessor, that
 predecessor has xmax equal to this one's xmin, therefore also preceding
 global xmin, therefore it would be seen as DEAD not RECENTLY_DEAD.
 So we should never need to freeze a tuple that isn't the start of its
 HOT chain.


hm.. What you are saying is right. I fail to recollect any other scenario
that
had forced me to think freezing is a problem with HOT.


Also, if you find a DEAD tuple after a RECENTLY_DEAD one, you can
 certainly prune both, because what this tells you is that both tuples
 are in fact dead to all observers.


I agree. I ran a long test with this change and there doesn't seem to be
any issue. So lets prune everything including the latest DEAD tuple. That
would let us take out the changes related to freezing. I also think that
should let us remove the DEAD_CHAIN concept, but let me check.

Thanks,
Pavan



-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-11 Thread Pavan Deolasee
On 9/11/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 We're only changing the offsetnumber part of it, which is 2 bytes. That
 shouldn't cross a hardware sector boundary on any reasonable hardware.


Not entirely true if we consider line pointer redirection, which involves
changing 4 bytes. You can argue that for quick pruning we don't do any
line pointer redirection though.

Also my worry about dealing with unreachable heap-only dead tuples
remain. We may need to do far more work to identify that they are not part
of
any reachable HOT chain and hence can be removed.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-11 Thread Pavan Deolasee
On 9/11/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  Pruning removes intermediate dead tuples by marking their line pointers
  ~LP_USED and redirecting the root line pointer to the first
  live/recently_dead tuple in the chain.

 It seems utterly unsafe to do this without a vacuum-grade lock on the
 page.  Someone might be examining the root tuple (eg, in a SnapshotAny
 scan) and I think they are entitled to assume that the line pointer
 stays valid while they are looking at the tuple.


As the patch stands, we prune holding a vacuum-grade lock. But ISTM
that there are only limited users of SnapshotAny and we make them
recheck the line pointer after acquiring the buffer lock. So hopefully we
should be fine.

Anyways, its not a concern right now because we hold vacuum-grade lock while
pruning. But if we decide to what Heikki/Greg suggested about quick
pruning then we will have to make sure that your concern is addressed.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-11 Thread Pavan Deolasee
On 9/11/07, Pavan Deolasee [EMAIL PROTECTED] wrote:


 - Track the minimum xmin in the page header to avoid repeated
 (wasted) attempts to prune a Prunable page in the presence of long running
 transactions.


I would actually think twice before even doing this because this would lead
to
complete change in heap page structure and stop people from upgrading
to 8.3 without a complete dump/restore. I don't remember 8.3 introduces any
other
significant change which already enforces that.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-11 Thread Pavan Deolasee
On 9/11/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  I would actually think twice before even doing this because this would
  lead to complete change in heap page structure and stop people from
  upgrading to 8.3 without a complete dump/restore. I don't remember 8.3
  introduces any other significant change which already enforces that.

 Then you don't remember far enough --- either the HeapTupleHeader change
 or the varvarlena change would be enough to prevent that.  For that
 matter, the pd_flags header field that the patch is already relying on
 is new for 8.3.


Oops! I forgot combo command id. That went into last major release of EDB
and that confused me. Regarding pd_flags, I thought somebody can just
fix that in-place and avoid dump/restore (of course, its too invasive, but
isn't it feasible with pg_migrator ?)

Anyways, given this, should we go for storing xmin in the page header ?
For that matter should there be a separate heap page header ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 I'm repeating myself, but I would split that second operation into two
 parts while we think about this: pruning the entire page, and
 defragmenting (PageRepairFragmentation). Tom wondered why they're
 separated in the patch. As the patch stands, there is no reason, but I
 feel that separating them and doing them at different times might be an
 important piece in the puzzle.



I think we all agree that defragmentation should be done less aggressively
and ideally when we are going to insert a tuple in the page and running
low on free space in the page. If we could really do it in heap_update(),
after
we figure out that there is not enough free space available in the page,
that will be great. We already have exclusive lock on the page and hopefully
we also have the vacuum-lock. (This is how earlier version of the patch used
to work - before we took out all the complexity associated with tracking
LP_DELETEd tuples,
tracking fragmented tuples and replaced them with simple
PageRepairFragmentation
Since in the earlier version, we never used to call PageRepairFragmentation,
we could easily do pruning in heap_update())

If we can't find a way to defragment inside heap_update(), then we have
three
choices:

1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low
on free space, we try to prune and defragment at that point.

2. Move some of the work to bgwriter.

3. Use some heuristic to decide whether to defragment or not.

I am not sure how easy it is to know apriori that we are fetching a tuple
for UPDATE. Assuming we can, this seems like a good idea.

Eventually we may want to move some work to bgwriter or some other
background process. But my take would be to save that for 8.4

As a starting point, we are doing 3 in the patch. We always try to
keep just one tuple worth of space free in the page. So if an UPDATE
takes up  remaining free space in the page, the page will be
pruned/defraged in the subsequent lookup (index or seq). In read-mostly
scenario,
only the first lookup would trigger prune/defrag, but next many lookups
would be quick. In update-mostly scenario, most of the heap_fetches
would anyways end up doing update, so doing pruning/defragmentation
in heap_fetch shouldn't be too bad. For a balanced scenario, we might
be moving cost of pruning/defragmentation to the SELECT path, but
UPDATEs would be quick.


The other issue is when to prune the HOT chain. I tend to agree that
the chances of having long HOT chains are less and the cost of following the
chain won't be too significant. At the same we probably don't want to live
with very long chains forever. An example would be: a tuple gets HOT updated
several times in a transaction. If the page is never updated again, the long
HOT
chain would never be pruned. I like Simon's idea of pruning the chain if it
goes beyond a limit. Instead of adding any significant complexity to
track length of each HOT chain, we can just mark the page with a flag
if heap_hot_fetch() detects a long chain. The next lookup (index or seq)
of the page will try to prune the page and clear the flag if successful.

I also agree that we should avoid taking exclusive lock before checking
free space in heap_page_prune_defrag(). That might be unsafe, but we
don't care if you occasionally skip the maintenance work (or do it a
little early)

Thanks,
Pavan

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/6/07, Tom Lane [EMAIL PROTECTED] wrote:

 Heikki Linnakangas [EMAIL PROTECTED] writes:
  When I suggested that we get rid of the LP_DELETE flag for heap tuples,
  the tuple-level fragmentation and all that, and just take the vacuum
  lock and call PageRepairFragmentation, I was thinking that we'd do it in
  heap_update and only when we run out of space on the page. But as Greg
  said, it doesn't work because you're already holding a reference to at
  least one tuple on the page, the one you're updating, by the time you
  get to heap_update. That's why I put the pruning code to heap_fetch
  instead. Yes, though the amortized cost is the same, it does push the
  pruning work to the foreground query path.

 The amortized cost is only the same if every heap_fetch is associated
 with a heap update.  I feel pretty urgently unhappy about this choice.
 Have you tested the impact of the patch on read-mostly workloads?


For read-mostly workloads, only the first SELECT after an UPDATE
would trigger pruning/defragmentation. heap_page_prune_defrag()
would be a no-op for subsequent SELECTs  (PageIsPrunable() would
be false until the next UPDATE)

I think Heikki's recent test confirms this.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/8/07, Bruce Momjian [EMAIL PROTECTED] wrote:



 Would someone tell us exactly when pruning and defragmenting happens in
 the current version of the patch?  If we don't nail this issue down soon
 PostgreSQL 8.3 is going to sail without this patch.




I guess you already have the answers by now, but let me repeat here:

Until patch version 14, defragmentation and pruning were separate
operations, though we used to invoke both at the same time. The only
difference was that pruning can work with a simple exclusive lock
whereas defragmentation needs vacuum-strength lock. Tom argued
that its not worth the complexity, so I changed that and clubbed
both the operations together. What it means: pruning also now needs
vacuum-strength lock and is always followed by defragmentation.

So as the patch stands (version 15), we call heap_page_prune_defrag()
inside heap_release_fetch() and heapgetpage(), apart from the vacuum
loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable
is set when a tuple is updated/deleted. So for read-mostly workloads
heap_page_prune_defrag will mostly be no-op.

If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried
conditionally, so we don't wait if there is contention) and we are running
low on free space (right now if free space is less than 1.2 times the
average
tuple size or less than BLCKSZ/8), the entire page is pruned and
fragmentation
is repaired.

We also prune-defrag is vacuum lazy and vacuum full. But I assume we
are not worried about these maintenance activities.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote:
  Bruce Momjian wrote:
   (Can someone time the access time for following a chain that fills an
   entire page (the worst case) vs. having a single tuple on the page?)
 
  Here is some results, on my laptop.

HEADHOT HOT-opt HOT-pruned
  seqscan   19.921.120.111.5
  idxscan   27.831.430.413.7
 

  Comparing the idxscan columns, it looks like following the chain *is*
  more expensive than having to go through killed index pointers. Pruning
  clearly does help.
 
  Given that this test is pretty much the worst case scenario, I'm ok with
  not pruning for the purpose of keeping chains short.

 I wasn't expecting that result and had accepted the counter argument.



Yes, I go with Simon. I am also surprised that HOT-pruned did
so well in this setup. I always thought that HOT would do well
in update-intensive scenarios, but from the results it seems that
HOT is also doing well for read-mostly queries.

In this particular example, the first SELECT after the 250 UPDATEs
would have pruned all the dead tuples and reduced HOT chain
to a single tuple. Hence the total time for subsequent SELECTs improved
tremendously.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Florian Pflug [EMAIL PROTECTED] wrote:


 Maybe that risk could be lowered if instead of a flag, we stored the
 minimal global xmin needed to prune at least one tuple.



I like the idea. The question is whether the chances of a Prunable
page being looked up again and again in presence of a long
running transaction are high enough to justify adding 4 bytes
to page header.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


 Oh, one more thing occured to me. Without HOT, we not only mark index
 tuples pointing to dead tuples as killed, we remove them altogether if
 the index page gets full. If you modify the test case so that after
 doing the updates, you insert a bunch of tuples with a different key to
 fill the index page, you should see CVS HEAD winning HOT without pruning
 hands down.


Um. I am not sure I follow that. With HOT even if the HOT chain is long,
there is still just one index entry for the entire chain. So I don't see
why CVS HEAD would do better than HOT in the above case. Net-net
there will be equal number of index keys after the inserts.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote:

 Heikki Linnakangas wrote:

 
  Pruning does generate a WAL record at the moment. Maybe you could do
  some kind of a quick pruning without a WAL record. Like just modify the
  ctid of the oldest dead tuple in the chain, or the redirecting line
  pointer if there is one, to point to the latest live tuple, without
  removing the dead tuples or the line pointers.

 I am wondering what you even mean by removing the dead tuples when
 pruning.  I thought only defragmentation removed tuples.


Pruning removes intermediate dead tuples by marking their line pointers
~LP_USED and redirecting the root line pointer to the first
live/recently_dead
tuple in the chain. OTOH defragmentation moves tuples around to repair
fragmentation. IOW defragmentation is nothing but calling
PageRepairFragmentation on the page.

For example, if we have a HOT chain of 5 tuples:

1 -- 2 -- 3 -- 4 -- 5

of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE

At the end of pruning:

- line pointer of 1 is redirected to 4
- line pointers of 2 and 3 are marked ~LP_USED
- the offset of 4 and 5 is unchanged

At the end of defragmentation:

- 4 and 5 are moved to the end of the page to create a larger
contiguous free space in the page.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote:



 Right.  My point is that pruning only modifies item pointers.  It does
 not change the actual heap tuples.  In the quote above, how is Heikki's
 quick pruning different from the pruning you are describing?



I think the only difference is that the quick pruning does not mark
intermediate tuples ~LP_USED and hence we may avoid WAL logging.

Btw, I am not too enthusiastic about quick pruning because it leaves
behind dead heap-only tuples which are not reachable from any root
heap tuples. Not that we can't handle them, but I am worried about
making such changes right now unless we are sure about the benefits.
We can always tune and tweak in 8.4

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-10 Thread Pavan Deolasee
On 9/11/07, Gregory Stark [EMAIL PROTECTED] wrote:




 You could mark such tuples with LP_DELETE. That would also let other
 transactions quickly tot up how much space would be available if they were
 to
 run PageRepairFragmentation.





IMHO we are making full circles here. We have already tried LP_DELETE
techniques
and moved away to simplify things. We also tried reusing dead space without
running PageRepairFragmentation. Each of these techniques worked just fine
with slightly different performance characteristics. What we now have is a
simplified algorithm which is much easier to follow and is safer, yet giving
us a very good performance boost. I am not sure if this is the right time to
throw new ideas because we would never be sure as what we are doing
would be the most optimal solution. Would it help if we go with some
solution right now, get rest of the review process done and then use
the feedback during beta testing to tune things ? We may have far more
data points at that time to choose one technique over other. And we
would also know what areas to focus on.

I am also worried that by focusing too much on this issue we may
overlook some other correctness issue in the patch.

From whatever we have discussed so far, IMHO we should do the following
things and let rest of the review process proceed

- Defragment a page only when the free space left in the page is not
enough to accommodate even a single tuple (use average tuple length for
this decision). This would mean we might be defragmenting even though
there is no immediate UPDATE to the page. But we can treat this as
fillfactor which allows us to provision for the next UPDATE coming to
the page. Since we are defragmenting when the page is almost full
hopefully we would reclaim good amount of space in the page and
won't call defragmentation for next few UPDATEs.

We already have mechanism to track average tuple request size in
relcache. May be we can have some relcache invalidation to keep the
information in sync (send invalidation when the average request size
changes by say 5%)

- Avoid pruning chains in every index or seq lookup. But if the chain
becomes longer than X tuples, mark the page to be pruned in the
next lookup. We can choose to separate prune and defragmentation
and only do pruning in this case. But I would prefer to keep them
together for now.

- Track the minimum xmin in the page header to avoid repeated
(wasted) attempts to prune a Prunable page in the presence of long running
transactions.

We can save rest of the techniques for beta testing period or 8.4.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 15

2007-09-05 Thread Pavan Deolasee
On 9/6/07, Gregory Stark [EMAIL PROTECTED] wrote:

 Tom Lane [EMAIL PROTECTED] writes:

 
  ISTM the only time we should be doing HOT cleanup work is when we are
  trying to insert a new tuple on that page --- and maybe even only when
  we try and it doesn't fit straight off.  Otherwise you're pushing
  maintenance work into foreground query pathways, which is exactly where
  I don't think we should be headed.

 Ah, as I understand it you can't actually do the pruning then because the
 executor holds references to source tuple on the page. In other words you
 can
 never get the vacuum lock there because you already have the page pinned
 yourself.


I don't think executor holding a reference is a problem because when
we check for vacuum lock, we have already pinned the page anyways.
But moving the old tuple around deep down in the UPDATE code path
(when we realize that there is no free space) is an issue. I know Heikki
tried to do it this way - but then moved the pruning code to lookup
code. Heikki ?

Another real problem with doing pruning only in UPDATE path is that
we may end up with long HOT chains if the page does not receive a
UPDATE, after many consecutive HOT updates. Every lookup to the
visible tuple in this chain would be CPU expensive since it would require
long chain following.

The downside is of course that SELECT may occasionally do more work.
But since we ensure that we do the extra work only when there is no
contention on the page, ISTM that the downside is limited.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Lazy xid assignment V4

2007-09-04 Thread Pavan Deolasee
On 9/4/07, Florian G. Pflug [EMAIL PROTECTED] wrote:

 Hi

 Here is an updated patch, following the discussion.
 The patch can be found at: http://soc.phlo.org/lazyxidassign.v4.patch
 (I seems I still can't get attachments through to this list)


I haven't been able to follow the discussions here, but do I need to worry
about its interaction with HOT ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Pavan Deolasee
On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote:



 In fact, now that I think about it there is no other
 fundamental reason to not support HOT on system tables. So we
 can very well do what you are suggesting.



On second thought, I wonder if there is really much to gain by
supporting HOT on system tables and whether it would justify all
the complexity. Initially I thought about CatalogUpdateIndexes to
which we need to teach HOT. Later I also got worried about
building the HOT attribute lists for system tables and handling
all the corner cases for bootstrapping and catalog REINDEX.
It might turn out to be straight forward, but I am not able to
establish that with my limited knowledge in the area.

I would still vote for disabling HOT on catalogs unless you see
strong value in it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 14

2007-08-31 Thread Pavan Deolasee
On 9/1/07, Tom Lane [EMAIL PROTECTED] wrote:


 I see this the other way around: if it doesn't work on system catalogs,
 it probably doesn't work, period.  I'm not in favor of treating the
 catalogs differently.


Now that I hear you, I know what to do next :-)

I don't think there is any fundamental problem with system catalogs,
its only the additional complexity that I was worried about. Anyways,
I will rework things as per your suggestion. And I take your point that
making it work on all tables will give us more confidence on the code.

Thanks,
Pavan

P.S. Next week is bad for me :-(  I am on vacation on Thursday/Friday
and for remaining days, I may not be able to spend extra cycles,
apart from regular working hours.

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  Please see the version 14 of HOT patch attached.

 I expected to find either a large new README, or some pretty substantial
 additions to existing README files, to document how this all works.
 The comments included do not represent nearly enough documentation.


I shall take that up. There are couple of documents posted by Heikki and
Greg,
apart from several email threads and original design doc by Simon. I shall
consolidate everything in a single README.


One thing I was unable to discern from the comments is how CREATE INDEX
 can work at all.  A new index might mean that tuples that could formerly
 legally be part of the same hot-chain no longer can.  I can't find any
 code here that looks like it's addressing that.



You are right - a new index might mean that an existing HOT chain
is broken as far as the new index is concerned. The way we address
that is by indexing the root tuple of the chain, but the index key is
extracted from the last tuple in the chain. The index is marked unusable
for all those existing transactions which can potentially see any
intermediate
tuples in the chain.

Please see this document written by Greg Stark. He has nicely summarized
how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.

http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php


I also don't think I
 believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
 tuples: what if the index completion commits, but the concurrent delete
 rolls back?  Then you've got a valid tuple that's not in the index.



Since CREATE INDEX works with ShareLock on the relation, we can
safely assume that we can't find any DELETE_IN_PROGRESS tuples except
those deleted by our own transaction. The only exception is system relation,
but we don't do HOT updates on system relation. Given that, it seems OK
to me to ignore these tuples because if the transaction aborts,
CREATE INDEX is aborted as well. Am I overlooking something here ?
There is a comment to this regard in the current code as well.


 I also took this opportunity to remove the modularity invasion caused
  by heap_check_idxupdate() since it was using resultRelInfo. We now
  build the list of attributes that must be checked to satisfy HOT update.
  This list includes all the index columns, columns in the partial index
  predicates and expression index expressions and is built in the
  executor.

 The executor is the wrong place for that: I'm not sure why you think
 that making heapam depend on the executor is a modularity improvement.



Earlier (in the previous version of HOT patch) we were passing ResultRelInfo
to heap_update, which was more ugly. The comment is in that context :-)


Furthermore this approach requires recalculating the list during
 each query, which is wasteful when it could only change during schema
 updates.  I'd suggest making the relcache responsible for computing and
 saving this data, along the same lines as RelationGetIndexList().
 (That also means that heapam can get the data from the relcache, saving
 a lot of API-bloating from passing around Attrids explicitly.)
 Also, rather than inventing Attrids, I'd suggest just using one
 Bitmapset with the convention that its contents are offset by
 FirstLowInvalidHeapAttributeNumber.



I liked all these suggestions. I know I thought about computing
the attribute list in relcache, but probably avoided to keep things simple.
I shall make these changes.


The redefinition of the value of MaxHeapTuplesPerPage seems pretty
 ugly and unprincipled.  I think it'd be better to leave it as-is,
 and just enforce that we don't allow more than that many line pointers
 on a heap page (as indeed you have to do anyway, so it's not clear
 what the change is buying).



The only reason to redefine MaxHeapTuplesPerPage to higher side is
because HOT allows us to accommodate more tuples per page because
of redirect-dead line pointers. For a table with sufficiently large tuples,
the original bound would work well, but for very small tuples - we might
hit the line pointer limit even if there is  free space available in the
page. Doubling the value is chosen as a balance between heap page
utilization, line pointer bloating and overhead for bitmap scans. But
I agree that the factor choice is rather arbitrary.



 Is it really necessary to add hot_update to xl_heap_update?  Seems the
 information should be available from the tuple header fields.



I think its not necessary. The reason I did that way because the new block
might be backed up in the WAL record and hence we might have recorded
the new tuple infomasks. But we can surely handle that corner case. I shall
fix this.


Have you demonstrated that the prune_hard logic is worth its weight?
 Considering that many of us want to drop VACUUM FULL, adding more
 complexity to it doesn't seem like a profitable use of time.



The prune_hard code is lot more simple

Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:
  You are right - a new index might mean that an existing HOT chain
  is broken as far as the new index is concerned. The way we address
  that is by indexing the root tuple of the chain, but the index key is
  extracted from the last tuple in the chain. The index is marked
 unusable
  for all those existing transactions which can potentially see any
  intermediate tuples in the chain.

 I don't think that works --- what if the last tuple in the chain isn't
 committed good yet?  If its inserter ultimately rolls back, you've
 indexed the wrong value.



I am confused. How could we get ShareLock on the relation while
there is some open transaction which has inserted a tuple in the
table ? (Of course, I am not considering the system tables here)
If the transaction which inserted the last tuple in the chain is already
aborted, we are not indexing that tuple (even if that tuple is at the
end). We would rather index the last committed-good tuple in the
chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees
that. Isn't it ?

 Please see this document written by Greg Stark. He has nicely summarized
  how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
  http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php

 Isn't the extra machination for C.I.C. just useless complication?
 What's the point of avoiding creation of new broken HOT chains when
 you still have to deal with existing ones?



IMHO the extra step in C.I.C simplifies the index build. The
transaction-waits
takes care of the existing broken chains and the early catalog entry for
the index helps us avoid creating new broken chains. I am not against
doing it a different way, but this is the best solution we could come up
when we worked on CIC.


 I also don't think I
  believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
  tuples: what if the index completion commits, but the concurrent delete
  rolls back?  Then you've got a valid tuple that's not in the index.

  Since CREATE INDEX works with ShareLock on the relation, we can
  safely assume that we can't find any DELETE_IN_PROGRESS tuples except
  those deleted by our own transaction. The only exception is system
 relation,
  but we don't do HOT updates on system relation.

 That chain of reasoning is way too shaky for me.  Essentially what
 you're saying is you'll promise not to corrupt non-system indexes.
 Nor am I really thrilled about having to disable HOT for system
 catalogs.



I am not against supporting HOT for system catalogs. But IMHO
its not a strict requirements  because  system catalogs are small, less
frequently updated and HOT adds little value to them. If we don't have
a generic solution which works for system and non-system tables,
thats the best we can get. We can start with non-system
tables and expand its scope later.


 The only reason to redefine MaxHeapTuplesPerPage to higher side is
  because HOT allows us to accommodate more tuples per page because
  of redirect-dead line pointers.

 No, it doesn't allow more tuples per page.  It might mean there can be
 more line pointers than that on a page, but not more actual tuples.
 The question is whether there is any real use in allowing more line
 pointers than that --- the limit is already unrealistically high,
 since it assumes no data content in any of the tuples.  If there is a
 rationale for it then you should invent a different constant
 MaxLinePointersPerPage or some such, but I really think there's no
 point.



I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently
large. May be we can keep it the same and tune it later if we have numbers
to prove its worthiness.


 Doubling the value is chosen as a balance between heap page
  utilization, line pointer bloating and overhead for bitmap scans.

 I'd say it allows a ridiculous amount of line pointer bloat.



OK. Lets keep MaxHeapTuplesPerPage unchanged.


 Even if it's safe, ISTM what you're mostly accomplishing there is to
  expend a lot of cycles while holding exclusive lock on the page, when
  there is good reason to think that you're blocking other people who are
  interested in using the page.  Eliminating the separation between that
  and cleanup would also allow eliminating the separate PD_FRAGMENTED
  page state.

  The reason we did it that way because repairing fragmentation seems
  much more costly that pruning. Please note that we prune a single
  chain during index fetch. Its only for heap-scans (and VACUUM) that
  we try to prune all chains in the page. So unless we are doing
 heap-scan,
  I am not sure if we are spending too much time holding the exclusive
  lock. I agree we don't have any specific numbers to prove that though.

 If you don't have numbers proving that this extra complication has a
 benefit, I'd vote for simplifying it.  The SnapshotAny case is going to
 bite other people besides you

Re: [PATCHES] HOT patch - version 14

2007-08-30 Thread Pavan Deolasee
On 8/31/07, Tom Lane [EMAIL PROTECTED] wrote:

 Pavan Deolasee [EMAIL PROTECTED] writes:

 Not if someone else releases lock before committing.  Which I remind you
 is a programming technique we use quite a lot with respect to the system
 catalogs.  I'm not prepared to guarantee that there is no code path in
 core (much less in contrib or third-party addons) that doesn't do it on
 a user table; even less that no such path will ever appear in future.
 As things stand it's a pretty harmless error --- the worst consequence
 I know of is that a VACUUM FULL starting right then might bleat and
 refuse to do shrinking.  What you propose is to turn the case into a
 cause of silent index corruption.

 A more robust solution would be to wait out the source transaction if
 CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS
 HOT tuple.  Or you could throw an error, since it's not really likely to
 happen (except maybe in a system catalog REINDEX operation).  But the
 way the patch approaches this at the moment is far too fragile IMHO.



OK. I looked at the code again. In CVS HEAD we assume that we
should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS
unless its deleted/inserted by our own transaction or we are indexing a
system table. Except for these cases, we throw an error.

With HOT, we keep it the same i.e. if we see a
DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error,
unless its deleted/inserted by us or this is a system table..
What we change is if we find a DELETE_IN_PROGRESS tuple deleted
by our own transaction and its HOT-updated, we skip that tuple. This is
fine because if the CREATE INDEX commits the DELETE is also committed
and the tuple is not visible (I am still assuming the indcreatxid mechanism
is in place)

So if we don't do HOT update on system tables, the current algorithm
should work fine because we should never find a HOT updated tuple
in the system table and the error-out mechanism should ensure
that we don't corrupt user tables.

So I guess what you are suggesting is to turn on HOT on system tables
and then wait-out any DELETING/INSETING transaction if we find such
a tuple during CREATE INDEX. I think this would work and since we are
talking about system tables, the wait-out business won't be harmful - I
remember there were objections when I suggested this as a general solution.

If we approach it this way, we might also be able to jettison some of
 the other complexity such as idxcreatexid.



I doubt, unless we replace it with something better. I think indcreatexid
serves
another purpose. While building an index, we skip any RECENTLY_DEAD
HOT-updated tuples. This is required to handle the existing HOT chains
which are broken with respect to the new index. Essentially what it means
is we don't want to index anything other than the last committed good tuple
in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples,
which are potentially visible to the existing running transactions, we want
those transactions NOT to use this new index.



  IMHO the extra step in C.I.C simplifies the index build.

 If you make the change suggested above, I think you don't need to do
 things differently in C.I.C.


I am not sure if I follow you correctly here. The issue with CIC is that
it works with a snapshot. So during initial index build, we might index
a tuple which is in the middle of a broken HOT chain. In the second phase,
we just don't need to index the tuple at the head of the chain, but also
remove the previously inserted tuple, otherwise there would be two
references from the index to the same root heap tuple.

The additional step which does two things:

1) Create catalog entry - stops any new broken HOT chains being created
2) Wait-out existing transactions - makes sure that when the index is built,
we only index the last committed good tuple in the chain.


 
  Since CREATE INDEX works with ShareLock on the relation, we can
  safely assume that we can't find any DELETE_IN_PROGRESS tuples except
  those deleted by our own transaction. The only exception is system
  relation, but we don't do HOT updates on system relation.

 Same issue as above: this makes correctness utterly dependent on nobody
 releasing locks early.



As I said above, we in fact throw an error for user tables. But if we want
to support HOT updates on system tables, we may need to do the
wait-out business. In fact, now that I think about it there is no other
fundamental reason to not support HOT on system tables. So we
can very well do what you are suggesting.


 OK.  So if I get you correctly, you are suggesting to acquire cleanup
 lock.
  If we don't get that, we don't to any maintenance work. Otherwise, we
 prune
  and repair fragmentation in one go.

 Yeah, that's what I'm thinking; then there's no need to track a separate
 page state where we've pruned but not defragmented.



I agree. Lets keep it simple and we can always improve it later. The
only thing that worries me how

Re: [PATCHES] HOT patch - version 14

2007-08-20 Thread Pavan Deolasee
On 8/20/07, Tom Lane [EMAIL PROTECTED] wrote:

 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I see that you have a separate bitmapset to keep track of indexes on
  system attributes. But having an index on a system attribute doesn't
  make any sense, does it?

 Counterexample: OID.



Right. Further, we allow creating indexes on system attributes. So we
must support those.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 11

2007-08-07 Thread Pavan Deolasee
On 8/2/07, Pavan Deolasee [EMAIL PROTECTED] wrote:




 . It would also be better if we didn't emit a
  separate WAL record for defraging a page, if we also prune it at the
  same time. I'm not that worried about WAL usage in general, but that
  seems simple enough to fix.



 Ah I see. I shall fix that.



When  I started making this change, I realized that we need the
second WAL record because if the block is backed up in pruning
WAL write, we may never call PageRepairFragmentation during
the redo phase. Of course, we can fix that by making
heap_xlog_clean always repair page fragmentation irrespective
of whether the block was backed up, but that doesn't seem like
a good solution.



Thanks,
Pavan



-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 11

2007-08-02 Thread Pavan Deolasee
On 8/2/07, Pavan Deolasee [EMAIL PROTECTED] wrote:



 On 8/2/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:
 
  Maybe a nicer
  solution would be to have another version of ConditionalLockBuffer with
  three different return values: didn't get lock, got exclusive lock, or
  got cleanup lock.



 Thats a good idea. I shall do that.



On a second thought, I feel its may not be such a good idea to change
the ConditionalLockBuffer return value. boolean is the most natural
way. And we don't save anything in terms on BufHdr locking.

So may be should just have two different functions and do the
BufferIsLockedForCleanup check immediately after acquiring the
exclusive lock.


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 11

2007-08-02 Thread Pavan Deolasee
On 8/2/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:

 Pavan Deolasee wrote:
  Please see the attached version 11 of HOT patch

 Thanks!

 One wrinkle in the patch is how the ResultRelInfo-struct is passed to
 heap_update, and on to heap_check_idxupdate, to check any indexed
 columns have changed. I think that's a modularity violation, heap_update
 really shouldn't have to deal with ResultRelInfo, that belongs in the
 executor. When we add support for expression and partial indexes,
 heap_check_idxupdate will need even more executor machinery to be able
 to evaluate expressions.



The reason I put it there because we wanted to do that check
as late as possible, once we confirm that update is possible and
there is enough space in the block to perform HOT update. But
I agree thats a modularity violation. Any suggestion to avoid that ?


In heap_page_prune_defrag, it would be better to do the test for
 BufferIsLockedForCleanup right after acquiring the lock. The longer the
 delay between those steps, the bigger the chances that someone pins the
 page and starts to wait for the buffer lock, making us think that we
 didn't get the cleanup lock, though we actually did. Maybe a nicer
 solution would be to have another version of ConditionalLockBuffer with
 three different return values: didn't get lock, got exclusive lock, or
 got cleanup lock.



Thats a good idea. I shall do that.


It's not necessary to WAL-log the unused-array that
 PageRepairFragmentation returns. In replay, a call to
 PageRepairFragmentation will come to the same conclusion about which
 line pointers are not used. It would also be better if we didn't emit a
 separate WAL record for defraging a page, if we also prune it at the
 same time. I'm not that worried about WAL usage in general, but that
 seems simple enough to fix.



Ah I see. I shall fix that.

Thanks,
Pavan



-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patch - version 11

2007-08-01 Thread Pavan Deolasee
On 8/1/07, Simon Riggs [EMAIL PROTECTED] wrote:

 On Wed, 2007-08-01 at 14:36 +0530, Pavan Deolasee wrote:


 BufferIsLockedForCleanup() should be named BufferIsAvilableForCleanup().
 There is no cleanup mode, what we mean is that there is only one pin;
 the comments say If we are lucky enough to have already acquired the
 cleanup lock, which made me look for the point where we had executed
 LockBufferForCleanup(), which we never do.



Ok. I am open to suggestions here as well as other parts of the
code about naming.


heap_page_prune() needs to comment what prune_hard means. I think you
 mean true=prune page, false=prune just this hot chain. It might be
 better to have a prune type so that the code reads as PRUNE_HOT_CHAIN or
 PRUNE_WHOLE_PAGE. Does it means the same thing as in
 heap_prune_tuplechain()?



No. prune_hard means remove all RECENTLY_DEAD tuples before
a DEFINITELY DEAD tuple. We use this only for VACUUM FULL.

The pruneoff is set to InvalidOffsetNumber to prune all tuples chains in
the page.


I'm concerned that HeapTupleSatisfiesAbort() may be very expensive and
 yet almost never return true. Do we need this code at this point?



We actually run SatisfiesVacuum anyways on all the tuples. We
should expect hint bits to be set and hence the call may not be
expensive. We need this to reclaims aborted heap-only tuples
which otherwise are not reachable.


It would be good to put some explanatory notes somewhere.
 Do all calls to PageGetFreeSpace() get replaced by
 PageHeapGetFreeSpace()?


Ok. Would do that. Right now, PageHeapGetFreeSpace is used
for heap pages.


To answer some of your XXXs:



 XXX Should we set PageSetPrunable on this page ? If the transaction
 aborts, there will be a prunable tuple in this page.
 Think you need to explain a little more...but we probably shouldn't tune
 for aborts.



If the INSERTing transaction aborts, we are left with a DEAD
tuple in the page which can be pruned and removed. So the
question is whether we should set the hint bit. If we don't set
the hint bit, the aborted DEAD tuple would stay there till the
next vacuum or some other tuple in the page is UPDATED/DELETEd
and the page is cleaned up.


XXX Should we set hint on the newbuf as well ? If the transaction
 aborts, there would be a prunable tuple in the newbuf.
 Think you need to explain a little more...but we probably shouldn't tune
 for aborts.



Same as above.


XXX Should we update the FSM information of this page ?
 (after pruning)
 No. Updating FSM would only take place if updates are taking place on
 this block. We would make this page a target for other INSERTs and
 UPDATEs, so would effectively bring more contention onto those pages.
 Bad Thing.



On the other hand, if we don't update FSM all COLD updates may
result in extending the relation even though there is some free space
available elsewhere in the relation. One option could be to leave
fillfactor worth of space and update FSM accordingly.


XXX We may want to tweak the percentage value below:
 (1.2 * RelationGetAvgFSM(relation)
 Not sure what difference the 1.2 makes...



Thats a quick hack. I thought of leaving some margin because
avg FSM request size is a moving average and there is a good
chance to err.


So that means I agree with all of your XXXs apart from the last one.

  The logic of chain pruning has been simplified a lot. In addition,
  there
  are fewer new WAL log records. We rather reuse the existing WAL
  record types to log the operations.

 I can't find any mention of XLogInserts for the new record types.

 e.g. XLOG_HEAP2_PRUNEHOT is only mentioned on one line in this patch.

 I this a mistake, or could you explain?



Sorry, thats just a left over from the previous version. I shall clean
that up.




 It would be good to see the results...



I shall post them soon.


Thanks,
Pavan





-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT latest patch - version 8

2007-07-23 Thread Pavan Deolasee

On 7/15/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



 Actually
storing InvalidOffsetNumber in lp_off is a bit bogus in the first place
since lp_off is unsigned, and InvalidOffsetNumber is -1, so I fixed that
as well.




I see InvalidOffsetNumber as 0 in off.h:26

#define InvalidOffsetNumber ((OffsetNumber) 0)

So I think we should be OK to use that to indicate redirect-dead
pointers.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [pgsql-patches] Ctid chain following enhancement

2007-06-01 Thread Pavan Deolasee

On 6/2/07, Bruce Momjian [EMAIL PROTECTED] wrote:




OK, removed from 8.4 queue.




I am OK with this, though I personally never felt that it complicated
the code :-)

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [pgsql-patches] Ctid chain following enhancement

2007-05-30 Thread Pavan Deolasee

On 5/29/07, Zdenek Kotala [EMAIL PROTECTED] wrote:



I'm looking on your patch. I have one comment:

If you have old tid and new tid you can easy compare if new tid points
to different page? And if page is still same there is no reason to
Unlock it and lock again. I think add inner loop something like:



Tom has already expressed his unwillingness to add complexity
without any proven benefits. Your suggestion though good would
make the code more unreadable without much benefit since the
function is not called very often.


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Maintaining cluster order on insert

2007-05-21 Thread Pavan Deolasee

On 5/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



Ah, sorry about that. For some reason my source tree was checked out
from the 8.2 branch, instead of CVS HEAD.



I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:


I am wondering if we can optimize _bt_moveright() code by remembering
the PageLSN of the page before releasing the READ LOCK. If the page
is indeed changed since we last saw it, the PageLSN must be different
than what we had remembered and gives us a hint that we might need
to move right.


!   /* Page was full. Release lock and pin and get another block
!* as if suggested_blk was not given.
!*/
!   LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK);
!   ReleaseBuffer(suggested_buf);

May be you want to use UnlockReleaseBuffer()  or  _bt_relbuf() just as a
shorthand.
Btw, I wonder why _bt_relbuf() exists and even so why does it take
Relation
argument ?

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Maintaining cluster order on insert

2007-05-21 Thread Pavan Deolasee

On 5/21/07, Pavan Deolasee [EMAIL PROTECTED] wrote:




On 5/19/07, Heikki Linnakangas [EMAIL PROTECTED]  wrote:


 Ah, sorry about that. For some reason my source tree was checked out
 from the 8.2 branch, instead of CVS HEAD.


I looked at the patch. Not that I am very comfortable with this part
of the code, but nevertheless here are my comments:



Another problem that I noticed with the patch is that it disregards
the fillfactor while inserting the tuples. ISTM that this is a correct
behavior when a tuple is being inserted in the block to preserve cluster
ordering. But when the tuple is being inserted as a normal operation,
IMO we should respect the fillfactor.

The easiest way to reproduce this is to insert tuples sorted on the
cluster index key.

Thanks,
Pavan


--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Concurrent psql patch

2007-05-18 Thread Pavan Deolasee

Hi Greg,

I looked at the patch briefly. I couldn't spot any issues and it looks good
to me. I've just couple of comments here.

The mvcc regression test files are missing in the patch.

--- 1179,1189 
 dbname, user, password);

   /* We can immediately discard the password -- no longer needed */
!   if (password)
!   {
!   memset(password, '\0', strlen(password));
   free(password);
+   }


Any reason why we do this ? password is anyways freed.  I think you
might have left it behind after some debugging exercise.


--- 25,37 
 #include mb/pg_wchar.h
 #include mbprint.h

+ #if 0
+ #include libpq-int.h /* For PG_ASYNC */
+ #endif
+

This looks redundant..

Apart from that I really like consistent coding style. For example, to me,
for (i = 0; i  10; i++) looks much better than for (i=0;i10; i++)

This is not comment on your patch  and neither I am saying
we should follow a specific coding style (though I wish we could have done
so) because we have already so many different styles. So its best to
stick to the coding style already followed in that particular file. But few
simple rules like having a single space around operators like '', '+', '='
etc really makes the code more readable. Other examples are using
parenthesis in a right manner to improve code readability.

flag = (pointer == NULL); is more readable than
flag = pointer == NULL;

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT patches

2007-05-07 Thread Pavan Deolasee

On 5/7/07, Zoltan Boszormenyi [EMAIL PROTECTED] wrote:




I tried your patchset on current CVS (some minutes old) and
got this single regression below.



Thanks for testing the patch. This failure is expected and I had mentioned
this when I posted v7. With HOT,  CIC is now a three step process
and a failure after the first step will leave an invalid index behind. In
this particular case, CIC fails because of duplicate keys.

I did not deliberately fix the regression output to highlight this change.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT Patch - Ready for review

2007-04-20 Thread Pavan Deolasee

On 4/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:



What's the purpose of the HeapScanHintPagePrune mechanism in index
builds? I lost track of the discussion on create index, is the it
necessary for correctness?



Its not required strictly for correctness, but it helps us prune the
HOT-chains
while index building. During index build, if we skip a tuple which is
RECENTLY_DEAD, existing transactions can not use the index for queries.
Pruning the HOT-chains reduces the possibility of finding such tuples
while building the index.


A comment in IndexBuildHeapScan explaining

why it's done would be nice.



I would do that.


In any case a PG_TRY/CATCH block should be

used to make sure it's turned off after an unsuccessful index build.



Oh thanks. Would do that too

I would wait for other review comments before submitting a fresh patch.
I hope thats ok.

Thanks,
Pavan
--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] HOT + MVCC-safe cluster conflict fix

2007-04-19 Thread Pavan Deolasee

On 4/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Hi Pavan,

Here's a little patch against CVS HEAD + NewHOT-v7.0.patch to fix the
conflict between MVCC-safe cluster and HOT.

index_getnext is modified to return all tuples in a HOT chain when
called with SnapshotAny. Cluster will insert them all as normal cold
updates.



Thanks Heikki. We might need to tweak it a bit because I think I had
made an assumption that heap_hot_fetch() should be called only on
the root tuple. Anyways, would look at it.

Thanks

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] Dead Space Map version 3 (simplified)

2007-04-13 Thread Pavan Deolasee

On 3/30/07, ITAGAKI Takahiro [EMAIL PROTECTED] wrote:


Attached is an updated DSM patch. I've left the core function of DSM only
and dropped other complicated features in this release.



I was testing this patch when got this server crash. The patch is applied
on the current CVS HEAD. I thought you would be interested in this.

The patch worked for smaller scaling factor and its reproducible.

Test: pgbench -s 90 -i -F 95 postgres

Stack:

(gdb) bt
#0  0x001d37a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00213955 in raise () from /lib/tls/libc.so.6
#2  0x00215319 in abort () from /lib/tls/libc.so.6
#3  0x082dc04f in ExceptionalCondition (conditionName=0x83a7ad7 !(victim),
errorType=0x83a7622 FailedAssertion,
   fileName=0x83a7487 deadspace.c, lineNumber=1080) at assert.c:51
#4  0x0821eb29 in dsm_create_chunk (dsmrel=0xb7bcd744, key=0xbff589c0) at
deadspace.c:1080
#5  0x0821d473 in dsm_record_state (rnode=0xaee02698, pageno=98304,
state=DSM_LOW) at deadspace.c:333
#6  0x0821d29e in DegradeDeadSpaceState (rel=0xaee02698, buffer=10645) at
deadspace.c:254
#7  0x0817b542 in lazy_scan_heap (onerel=0xaee02698, vacrelstats=0x9f5f2e0,
Irel=0x9f5f4dc, nindexes=1, iter=0x9f5f57c)
   at vacuumlazy.c:586
#8  0x0817a733 in lazy_vacuum_rel (onerel=0xaee02698, vacstmt=0x9f39c94) at
vacuumlazy.c:209
#9  0x08174e5c in vacuum_rel (relid=16388, vacstmt=0x9f39c94,
expected_relkind=114 'r') at vacuum.c:1107
#10 0x0817421c in vacuum (vacstmt=0x9f39c94, relids=0x0, isTopLevel=1
'\001') at vacuum.c:401
#11 0x0823d90b in ProcessUtility (parsetree=0x9f39c94, queryString=0x9f62f94
vacuum analyze, params=0x0,
   isTopLevel=1 '\001', dest=0x9f39cf0, completionTag=0xbff5a040 ) at
utility.c:929
#12 0x0823bdd6 in PortalRunUtility (portal=0x9f60f8c, utilityStmt=0x9f39c94,
isTopLevel=1 '\001', dest=0x9f39cf0,
   completionTag=0xbff5a040 ) at pquery.c:1170
#13 0x0823bf0a in PortalRunMulti (portal=0x9f60f8c, isTopLevel=1 '\001',
dest=0x9f39cf0, altdest=0x9f39cf0,
   completionTag=0xbff5a040 ) at pquery.c:1262
#14 0x0823b6df in PortalRun (portal=0x9f60f8c, count=2147483647,
isTopLevel=1 '\001', dest=0x9f39cf0, altdest=0x9f39cf0,
   completionTag=0xbff5a040 ) at pquery.c:809
#15 0x082365df in exec_simple_query (query_string=0x9f399d4 vacuum
analyze) at postgres.c:956
#16 0x08239e43 in PostgresMain (argc=4, argv=0x9ecfc94, username=0x9ecfc64
perf) at postgres.c:3503
#17 0x08204e84 in BackendRun (port=0x9ee3628) at postmaster.c:2987
#18 0x08204493 in BackendStartup (port=0x9ee3628) at postmaster.c:2614
#19 0x0820228b in ServerLoop () at postmaster.c:1214
#20 0x08201c66 in PostmasterMain (argc=3, argv=0x9ecdc50) at postmaster.c
:967
#21 0x081a9e0b in main (argc=3, argv=0x9ecdc50) at main.c:188


--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [HACKERS] CIC and deadlocks

2007-04-11 Thread Pavan Deolasee

On 4/1/07, Tom Lane [EMAIL PROTECTED] wrote:



Good point.  I'm envisioning a procarray.c function along the
lines of
bool TransactionHasSnapshot(xid)
which returns true if the xid is currently listed in PGPROC
and has a nonzero xmin.  CIC's cleanup wait loop would check
this and ignore the xid if it returns false.  Your point means
that this function would have to take exclusive not shared lock
while scanning the procarray, which is kind of annoying, but
it seems not fatal since CIC isn't done all that frequently.



When I looked at the code, it occurred to me that possibly we are
OK with just taking shared lock on the procarray. That means that
some other transaction can concurrently set its serializable snapshot
while we are scanning the procarray. But that should not harm us:
if we see the snapshot set, we wait for the transaction. A transaction
which is setting its serializable snapshot NOW, can not see the
tuples that we did not index, isn't it ?

A patch based on the discussion is attached.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


CIC_deadlock.patch
Description: Binary data

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] CIC and deadlocks

2007-04-11 Thread Pavan Deolasee

On 4/11/07, Tom Lane [EMAIL PROTECTED] wrote:



[ itch... ]  The problem is with time-extended execution of
GetSnapshotData; what happens if the other guy lost the CPU for a good
long time while in the middle of GetSnapshotData?  He might set his
xmin based on info you saw as long gone.

You might be correct that it's safe, but the argument would have to
hinge on the OldestXmin process being unable to commit because of
someone holding shared ProcArrayLock; a point you are definitely not
making above.  (Study the comments in GetSnapshotData for awhile,
also those in xact.c's commit-related code.)



My argument was based on what you said above, but I obviously did not
state it well :)

Anyways, I think its better to be safe and we agree that its not such a
bad thing to take exclusive lock on procarray because CIC is not something
that happens very often. Attached is a revised patch which takes exclusive
lock on the procarray, rest remaining the same.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


CIC_deadlock_v2.patch
Description: Binary data

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] Optimized pgbench for 8.3

2007-04-07 Thread Pavan Deolasee

On 4/6/07, Tatsuo Ishii [EMAIL PROTECTED] wrote:



BTW, is anybody working on enabling the fill factor to the tables used
by pgbench? 8.3 will introduce HOT, and I think adding the feature
will make it easier to test HOT.



Please see if the attached patch looks good. It adds a new -F option
which can be used to set fillfactor for tellers, accounts and branches
tables. Default is 100 and anything between 10 and 100 is acceptable.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


pgbench_fillfactor.patch
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] HOT WIP Patch - version 6.3

2007-04-04 Thread Pavan Deolasee

On 4/3/07, Bruce Momjian [EMAIL PROTECTED] wrote:



Your patch has been added to the PostgreSQL unapplied patches list at:



Thanks Bruce. I would like to submit atleast one more revision
which would include couple of TODOs mentioned in my last mail.
I would also like to do some cleanup and commenting to make
review process easier. I hope to do this before the weekend,
but if someone wants to start reviewing it before, please let me
know.

Is there something more that I can do to help the review process ?
I've posted almost all the design ideas as and when I was posting
patch revisions. Would it help to consolidate them in a single
document ?

Thanks,
Pavan


--

EnterpriseDB http://www.enterprisedb.com


[PATCHES] HOT WIP Patch - Version 5.0

2007-03-20 Thread Pavan Deolasee



The version 5.0 of HOT WIP patch is attached. This fixes the
VACUUM FULL issue with HOT. In all the earlier versions, I'd
disabled VACUUM FULL.

When we move the HOT-chain, we move the chains but don't carry
the HOT_UPDATED or HEAP_ONLY flags and insert as many index
entries as there are tuples in the chain. IOW the HOT-update
is actually turned into a COLD chain.

Apart from this, I'd to make some changes to the VACUUM FULL
code so that the number of indexed tuples is counted
correctly. With HOT, whenever a HEAP_ONLY tuple is moved, an
additional index entry is generated and this needs to be
taken into account.

Please let me know comments/suggestions.


Thanks,
Pavan


--


EnterpriseDBhttp://www.enterprisedb.com



NewHOT-v5.0.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] HOT WIP Patch - Version 4.4

2007-03-13 Thread Pavan Deolasee


Please see the attached version 4.4 of HOT WIP patch. I have
fixed couple of bugs in the earlier version posted. Other than
that there are not any significant changes in the patch.

The row-level fragmentation had a bug where we were
unintentionally sorting the line pointers array more than
once. Also, the defragmented lengths were computed wrongly
and was a source of many errors.

Another bug fix was in the heap_hot_fetch() code path where
we were asserting on finding a LP_DELETEd tuple in the hot
chain. I later realized that this is not required and we
should rather just assume that the chain is broken, something
very similar to the xmax/xmin checks.


Thanks,
Pavan


--


EnterpriseDBhttp://www.enterprisedb.com



NewHOT-v4.4.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PATCHES] HOT WIP Patch - Version 4.1

2007-03-07 Thread Pavan Deolasee



Please see the attached HOT WIP patch, version 4.1. There are
not any significant changes since the version 4.0 patch that
I posted a week back.

This patch includes some optimizations for efficiently looking
up LP_DELETEd tuples. I have used the recent changes made by
Tom/Heikki which give us few bits per page header. I use one
bit to track if there are any LP_DELETEd tuples in the page.
The changes to this bit are not WAL-logged and hence the
information might not be accurate. But we should be ok with
that.

Another non-trivial change is the addition of logic to clean
up row level fragmentation. I have discussed this earlier on
the list, but neverthless would summarize it again here.

When we reuse LP_DELETEd tuples for UPDATE, we might waste
few bytes when the original size of the reused tuple is
larger than the new tuple. The information about the original
length is lost. When we run out of LP_DELETEd tuples of
size equal or greater than the requested size in UPDATE path,
we correct the row level fragmentation, if any. Please note,
we don't move tuples around and hence don't need the
VACUUM-strength lock on the page.

We use another bit in the page header to track if there are
any fragmented LP_DELETEd tuples in the page. We also need
one bit in the tuple header to track that the particular
tuple was fragmeneted while being reused. This information
is then used when the tuple is again released and marked
LP_DELETE, to update the page level hint bit.


Comments/suggestions ?

Thanks,
Pavan


--


EnterpriseDBhttp://www.enterprisedb.com



NewHOT-v4.1.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Heap page diagnostic/test functions (WIP)

2007-03-05 Thread Pavan Deolasee

Simon Riggs wrote:

I'll happily code it as functions or system cols or any other way, as
long as we can see everything there is to see.
  



With HOT, other useful information is about the line pointers. It would be
cool to be able to print the redirection info, details about LP_DELETEd
line pointers etc. Also, there are some flags in the t_infomask2 which HOT
uses, so that information would be useful also.

Thanks,
Pavan

EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] A little COPY speedup

2007-03-01 Thread Pavan Deolasee

Heikki Linnakangas wrote:


Attached is a fix for that. It adds a flag to each heap page that 
indicates that there isn't any free line pointers on this page, so 
don't bother trying. Heap pages haven't had any heap-specific 
per-page data before, so this patch adds a HeapPageOpaqueData-struct 
that's stored in the special space.



I would really like this change. I was thinking on similar lines to optimize
some of the HOT code paths

Thanks,
Pavan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] Resending - HOT WIP Patch - version 4

2007-03-01 Thread Pavan Deolasee


Resending once again, all my previous attempts seems to have failed.
Is there a limit on patch size on -patches as well ? Attached is
a gzipped version.


Hi All,

Please see the  version 4.0 of HOT WIP patch attached with the mail.
ISTM that this version has one of the most radical changes included i.e.
the line pointer redirection.

The patch should apply on the current CVS HEAD and pass all regression
tests. I feel that the patch is now in a stable state and I would
really appreciate if community folks can take a look at it and
provide comments/suggestions.

The VACUUM FULL is disabled right now. I am not aware of any design
issues for not supporting VACUUM FULL. I know its broken, but do
not want to hold back the patch while we fix it.

The most important change in this patch is the implementation of line
pointer redirection for handling dead root tuples. Please see my post
on hackers for details.

I would post some preliminary numbers in a separate email.
I would also post items where we need attention and discussion
and further work. I would welcome others to test the patch, for
correctness as well as performance.

Comments/suggestions ?

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com



NewHOT-v4.0.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] HOT WIP Patch - version 3.2

2007-02-27 Thread Pavan Deolasee

On 2/27/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:
 - What do we do with the LP_DELETEd tuples at the VACUUM time ?
 In this patch, we are collecting them and vacuuming like
 any other dead tuples. But is that the best thing to do ?

Since they don't need index cleanups, it's a waste of
maintenance_work_mem to keep track of them in the dead tuples array.
Let's remove them in the 1st phase. That means trading the shared lock
for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want
to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages
that had dead tuples on them, and scan and remove them in the 2nd phase
when we have to acquire the vacuum-level lock anyway.



I liked the idea of not collecting the LP_DELETEd tuples in the first
pass. We also prune the HOT-update chains in the page in the first
pass, may be that can also be moved to second pass. We need to
carefully work on the race conditions involved in the VACUUM, pruning
and tuple reuse though.



- While searching for a LP_DELETEd tuple, we start from the
 first offset and return the first slot which is big enough
 to store the tuple. Is there a better search algorithm
 (sorting/randomizing) ? Should we go for best-fit instead
 of first-fit ?

Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd
line pointers, but wasting space can lead to cold updates and get much
more expensive.



Ok. I will give it a shot once the basic things are ready.


You could also prune the chains on the page to make room for the update,

and if you can get a vacuum lock you can also defrag the page.



Yes, thats a good suggestion as well. I am already doing that in the
patch I am working on right now.



- Should we have metadata on the heap page to track the
 number of LP_DELETEd tuples, number of HOT-update chains in the
 page and any other information that can help us optimize
 search/prune operations ?

I don't think the CPU overhead is that significant; we only need to do
the search/prune when a page gets full. We can add flags later if we
feel like it, but let's keep it simple for now.




I am making good progress with the line-pointer redirection stuff.
Its showing tremendous value in keeping the table and index size
in control. But we need to check for the CPU overhead as well
and if required optimize there.



Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


[PATCHES] HOT WIP Patch - version 3.2

2007-02-24 Thread Pavan Deolasee

Please see the attached WIP HOT patch - version 3.2. It now
implements the  logic for reusing heap-only dead tuples. When a
HOT-update  chain is pruned, the heap-only tuples are marked
LP_DELETE.  The lp_offset and lp_len fields in the line pointer are
maintained.

When a backend runs out of free space in a page when doing an
UPDATE, it searches the line pointers to find a slot
which is marked LP_DELETEd and has enough space to accommodate
the new tuple. If such a slot is found, its reused. We might
waste some space if the slot is larger than the tuple, but
that gets reclaimed at VACUUM time.

During VACUUM, tuple-chains are pruned, one page at a time.
This is necessary so that chain is pruned even if the tuple
is not accessed after the update (normally pruning happens
when the tuple is index fetched).

This patch also implements the logic to convert dead root
tuple into stub. This is done by changing the lp_len of the
root tuple slot to sizeof (HeapTupleHeaderData).
The space released can only be reclaimed at the time of VACUUM.

Open Issues:
-

There are quite a few:

- What do we do with the LP_DELETEd tuples at the VACUUM time ?
In this patch, we are collecting them and vacuuming like
any other dead tuples. But is that the best thing to do ?

- While searching for a LP_DELETEd tuple, we start from the
first offset and return the first slot which is big enough
to store the tuple. Is there a better search algorithm
(sorting/randomizing) ? Should we go for best-fit instead
of first-fit ?

- Should we have metadata on the heap page to track the
number of LP_DELETEd tuples, number of HOT-update chains in the
page and any other information that can help us optimize
search/prune operations ?

- There are some interesting issues in the VACUUMing area. How
do we count the dead tuples ? Should we count HOT-updated
tuples in the dead count ? If we do so, I noticed that
VACUUM gets triggered on very small tables like tellers
in pgbench and takes several minutes to finish because
it waits very very long for VACUUM-strength lock on the
index pages. Index is just a page or two in this case.

Whats Next ?
--

ISTM that the next important item to do is handling of dead root
tuples.

There are several designs proposed to handle this, but I don't
think we have consensus on any one design. I am thinking
going ahead with my proposal of line-pointer indirection.
I am not passing judgement on other proposed designs, but for the
lack of consensus I am choosing the simplest one right now and
which seems good to me :)

Are there any serious objections to this approach ? This would waste
4 bytes per HOT-updated tuple chain for the life-time of the
chain, but would provide an excellent opportunity to reuse the
dead root tuple just like any other heap-only dead tuple.

The stub approach would waste 4 bytes of line pointer + 24 bytes
of header and it requires index swing to remove the stub.

The cyclic tuple chain would temporarily waste space upto size of the
root tuple until it gets reused by another update of the same row. Plus
there are concerns about additional header field for back pointer and
complexity involved in locating visible tuple.

Based on these observations, I am inclined towards line-pointer
redirection approach. I haven't heard any for/against comments on
this yet. Please let me know if there are objections before I
start working on it.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


NewHOT-v3.2.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2

2007-02-20 Thread Pavan Deolasee

On 2/20/07, Hannu Krosing [EMAIL PROTECTED] wrote:


Ühel kenal päeval, T, 2007-02-20 kell 12:08, kirjutas Pavan Deolasee:


What do you do, if there are no live tuples on the page ? will this
un-HOTify the root and free all other tuples in HOT chain ?



Yes. The HOT-updated status of the root and all intermediate
tuples is cleared and their respective ctid pointers are made
point to themselves. The index entry will be marked LP_DELETE
as with the normal case. VACUUM can subsequently reclaimed these
tuples, along with the index entry.




 The intermediate heap-only tuples are  removed from the HOT-update
 chain.
 The HOT-updated status of these tuples is cleared and their respective
 t_ctid are made point to themselves. These tuples are not reachable
 now and ready for vacuuming.

Does this mean, that they are now indistinguishable from ordinary
tuples ?



No. HEAP_ONLY_TUPLE flag is still set on these tuples. So you
can distinguish those tuples.

Maybe they could be freed right away instead of changing HOT-updated

status and ctid ?



Yeah, thats a good idea. I am thinking of setting LP_DELETE flag on them
while pruning. The tuple then can be reused for next in-page HOT-update.




 When we run out space for update-within-the-block, we traverse
 through all the line pointers looking for LP_DELETEd items. If any of
 these
 items have space large enough to store the new tuple, that item is
 reused.
 Does anyone see any issue with doing this ? Also, any suggestions
 about doing it in a better way ?

IIRC the size is determined by the next tuple pointer, so you can store
new data without changing tuple pointer only if they are exactly the
same size.



There is a lp_len field in the line pointer to store the length of the
tuple. ISTM that we can reduce that while reusing the line pointer. But
that would create a permanent hole in the page.



 we are
 more concerned about the large tables, the chances of being able to
 upgrade
 the exclusive lock to vacuum-strength lock are high. Comments ?

I'm not sure about the we are more concerned about the large tables
part. I see it more as a device for high-update tables. This may not
always be the same as large, so there should be some fallbacks for
case where you can't get the lock. Maybe just give up and move to
another page ?



Oh, yes. I agree. The fallback option of doing COLD update always
exists.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2

2007-02-20 Thread Pavan Deolasee

On 2/20/07, Bruce Momjian [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:
 When following a HOT-update chain from the index fetch, if we notice
that
 the root tuple is dead and it is HOT-updated, we try to prune the chain
to
 the smallest possible length. To do that, the share lock is upgraded to
an
 exclusive lock and the tuple chain is followed till we find a
 live/recently-dead
 tuple. At that point, the root t_ctid is made point to that tuple. In
order

I assume you meant recently-dead here, rather than live/recently-dead,
because we aren't going to change live ctids, right?



No, I meant live or recently-dead (in fact, anything  other than
HEAPTUPLE_DEAD
or HEAPTUPLE_DEAD_CHAIN).

We are not changing the tids here, but only pruning the HOT-update chain.
After pruning, the root-t_ctid points to the oldest tuple that might be
visible to any backend. The live tuples are still identified by their
original tid and index reachable from the root tuple.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2

2007-02-20 Thread Pavan Deolasee

On 2/20/07, Tom Lane [EMAIL PROTECTED] wrote:


Pavan Deolasee [EMAIL PROTECTED] writes:
 ... Yes. The HOT-updated status of the root and all intermediate
 tuples is cleared and their respective ctid pointers are made
 point to themselves.

Doesn't that destroy the knowledge that they form a tuple chain?
While it might be that no one cares any longer, it would seem more
reasonable to leave 'em chained together.



I see your point, but as you mentioned do we really care ? The chain
needs to be broken so that the intermediate DEAD tuples can be
vacuumed. We can't vacuum them normally because they could
be a part of live HOT-update chain. Resetting the HOT-updated
status of the root tuple helps to mark the index entry LP_DELETE
once the entire HOT-update chain is dead.

Also, if we decide to reuse the heap-only tuples without even
vacuuming, breaking the chain is a better option since we then
guarantee no references to the heap-only DEAD tuples.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2

2007-02-20 Thread Pavan Deolasee

On 2/20/07, Bruce Momjian [EMAIL PROTECTED] wrote:


Tom Lane wrote:

 Recently dead means still live to somebody, so those tids better not
 change either.  But I don't think that's what he meant.  I'm more
 worried about the deadlock possibilities inherent in trying to upgrade a
 buffer lock.  We do not have deadlock detection for LWLocks.

I am guessing he is going to have to release the lock, then ask for an
exclusive one.



Yes, thats what is done. Since we try to prune the HOT-update chain
even in the SELECT path, we upgrade the lock only if we are sure
that there is atleast one tuple that can be removed from the chain
or the root needs to be fixed (broken ctid chain for some reason).

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


[PATCHES] HOT WIP Patch - version 2

2007-02-19 Thread Pavan Deolasee

Reposting - looks like the message did not get through in the first
attempt. My apologies if multiple copies are received.


This is the next version of the HOT WIP patch. Since the last patch that
I sent out, I have implemented the HOT-update chain pruning mechanism.

When following a HOT-update chain from the index fetch, if we notice that
the root tuple is dead and it is HOT-updated, we try to prune the chain to
the smallest possible length. To do that, the share lock is upgraded to an
exclusive lock and the tuple chain is followed till we find a
live/recently-dead
tuple. At that point, the root t_ctid is made point to that tuple. In order
to
preserve the xmax/xmin chain, the xmax of the root tuple is also updated
to xmin of the found tuple. Since this xmax is also  RecentGlobalXmin
and is a committed transaction, the visibility of the root tuple still
remains
the same.

The intermediate heap-only tuples are  removed from the HOT-update chain.
The HOT-updated status of these tuples is cleared and their respective
t_ctid are made point to themselves. These tuples are not reachable now
and ready for vacuuming. This entire action is logged in a single
WAL record.

During vacuuming, we keep track of number of root tuples vacuumed.
If this count is zero, then the index cleanup step is skipped. This
would avoid unnecessary index scans whenever possible.

This patch should apply cleanly on current CVS head and pass all regression
tests. I am still looking for review comments from the first WIP patch. If
anyone
has already looked through it and is interested in the incremental changes,
please let me know. I can post that.

Whats Next ?
-

ISTM that  the basic  HOT-updates and ability to prune the HOT-update chain,

should help us reduce the index bloat, limit the overhead of ctid following
in
index fetch and efficiently vacuum heap-only tuples. IMO the next important
but rather less troublesome thing to tackle is to reuse space within a block

without complete vacuum of the table. This would help us do much more
HOT-updates and thus further reduce index/heap bloat.

I am thinking of reusing the DEAD heap-only tuples which gets removed from
the HOT-update chain as part of pruning operation. Since these tuples, once
removed from the chain, are neither reachable nor have any index references,
could be readily used for storing newer versions of the same or other rows
in
the block. How about setting LP_DELETE on these tuples as part of the
prune operation ? LP_DELETE is unused for heap tuples, if I am not
mistaken. Other information like length and offset are is maintained as it
is.
When we run out space for update-within-the-block, we traverse
through all the line pointers looking for LP_DELETEd items. If any of these
items have space large enough to store the new tuple, that item is reused.
Does anyone see any issue with doing this ? Also, any suggestions
about doing it in a better way ?

If the page gets really fragmented, we can try to grab a VACUUM-strength
lock on the page and de-fragment it. The lock is tried conditionally to
avoid
any deadlocks. This is done in the heap_update() code path, so would add
some overhead, but may still prove better than putting the tuple in a
different block and having corresponding index insert(s). Also, since we are
more concerned about the large tables, the chances of being able to upgrade
the exclusive lock to vacuum-strength lock are high. Comments ?

If there are no objections, I am planning to work on the first part
while Nikhil would take up the second task of block level retail-vacuum.
Your comments on these issues and the patch are really appreciated.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


NewHOT-v2.0.patch.gz
Description: GNU Zip compressed data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[pgsql-patches] Lock compatibility matrix

2007-01-30 Thread Pavan Deolasee

I had this in a different form, but reworked so that it matches the doc
patch that Teodor submitted earlier. I think it would be good to have this
information in the lock.h file as well.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


lock-compatibility.patch
Description: Binary data

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [pgsql-patches] Lock compatibility matrix

2007-01-30 Thread Pavan Deolasee

On 1/30/07, Peter Eisentraut [EMAIL PROTECTED] wrote:


Pavan Deolasee wrote:
 I had this in a different form, but reworked so that it matches the
 doc patch that Teodor submitted earlier. I think it would be good to
 have this information in the lock.h file as well.

Why would we want to have two redundant copies of the same information?



IMHO its useful to have this information in the source code, just like many
other comments. It improves the readability of the code while documentation
acts as a reference.

But I am not sure whats the generally accepted practice for PostgresQL,
so I may be wrong here.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


Re: [pgsql-patches] Ctid chain following enhancement

2007-01-28 Thread Pavan Deolasee

On 1/28/07, Tom Lane [EMAIL PROTECTED] wrote:


OTOH it might be
cleaner to refactor things that way, if we were going to apply this.



Here is a revised patch which includes refactoring of
heap_get_latest_tid(), as per Tom's suggestion.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


cf-enhance-head-v2.patch
Description: Binary data

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [pgsql-patches] Ctid chain following enhancement

2007-01-27 Thread Pavan Deolasee

On 1/27/07, Tom Lane [EMAIL PROTECTED] wrote:



It looks to me that you have introduced a buffer leak into
heap_get_latest_tid ...



I can't spot that. A previously pinned buffer is released at the start
of the loop if we are moving to a different block. Otherwise, the buffer
is released at all places where the for(;;) loop is terminated by a break.
Am I missing something ?

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


[pgsql-patches] Ctid chain following enhancement

2007-01-26 Thread Pavan Deolasee

Attached is a patch which should marginally improve the ctid chain followin
code path when current and the next tuple in the chain are in the same
block.

In the current code, we unconditionally drop pin of the current block
without
checking whether the next tuple is the same block or not. ISTM that this can
be improved.

The attached patch replaces the heap_fetch() with heap_release_fetch() in
EvalPlanQual() and thus releases the buffer iff the chain gets into a
different block.
Similarly, ReadBuffer() is replaced with ReleaseAndReadBuffer() in
heap_get_latest_tid() for the same reason. The buffer is set to
InvalidBuffer
to start with and is not released at the end of the loop.
heap_release_fetch()/
ReleaseAndReadBuffer() would release it if required.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


cf-enhance-head.patch
Description: Binary data

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq