Re: [PATCHES] page macros cleanup
On Fri, Jul 4, 2008 at 1:01 PM, Zdenek Kotala [EMAIL PROTECTED] wrote: Good catch. I lost in basic arithmetic. What I see now that original definition count sizeof(ItemIdData) twice and on other side it does not take care about MAXALING correctly. I think correct formula is: #define HashMaxItemSize(page) \ (PageGetPageSize(page) - \ ( MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData))+ \ MAXALIGN(sizeof(HashPageOpaqueData)) \ )\ ) What do you think? Yes. I think that's the correct way. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] page macros cleanup
On Fri, Jul 4, 2008 at 2:08 PM, Zdenek Kotala [EMAIL PROTECTED] wrote: Patch modifies HashMaxItemSize. It should require reindex on all hash indexes. Should we bump catalog version? Do we really need that ? Even if the new HashMaxItemSize comes out to be lower than the current limit (because of MAXALIGN), I don't see how existing hash indexes can have a larger item than the new limit. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] page macros cleanup
On Fri, Jul 4, 2008 at 3:37 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: I think this is the way it should be: #define HashMaxItemSize \ (BLCKSZ - \ SizeOfPageHeaderData - \ MAXALIGN(sizeof(HashPageOpaqueData)) - \ sizeof(ItemIdData)) I am wondering if this would fail for corner case if HashMaxItemSize happened to be unaligned. For example, if (itemsz HashMaxItemSize MAXALIGN(itemsz), PageAddItem() would later fail with a not-so-obvious error. Should we just MAXALIGN_DOWN the HashMaxItemSize ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] page macros cleanup
On Fri, Jul 4, 2008 at 4:25 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: No, there's a itemsz = MAXALIGN(itemsz) call before the check against HashMaxItemSize. Ah, right. Still should we just not MAXALIGN_DOWN the Max size to reflect the practical limit on the itemsz ? It's more academical though, so not a big deal. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] page macros cleanup
On Fri, Jul 4, 2008 at 4:20 PM, Zdenek Kotala [EMAIL PROTECTED] wrote: By my opinion first place where tuple should be placed is: MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData)) Tuple actually starts from the other end of the block. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] page macros cleanup
On Fri, Jun 13, 2008 at 9:38 PM, Zdenek Kotala [EMAIL PROTECTED] wrote: I attached code cleanup which is related to in-place upgrade. I replace direct access to PageHeader structure with already existing macros and I removed also unnecessary retyping. A quick review comment: One thing I noticed is that the modified definition of HashMaxItemSize now does not account for the size of ItemId which may not be the right thing. Please recheck that. Apart from that the patch looks good to me. As you said, we should probably replace the other direct usage of PageHeader with appropriate macros. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
[PATCHES] VACUUM Improvements - WIP Patch
Here is a WIP patch based on the discussions here: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00863.php The attached WIP patch improves the LAZY VACUUM by limiting or avoiding the second heap scan. This not only saves considerable time in VACUUM, but also reduces the double-writes of vacuumed blocks. If the second heap scan is considerably limited, that should also save CPU usage and reduce WAL log writing. With HOT, the first heap scan prunes and defrags every page in the heap. That truncates all the dead tuples to their DEAD line pointers and releases all the free space in the page. The second scan only removes these DEAD line pointers and records the free space in the FSM. The free space in fact does not change from the first pass. But to do so, it again calls RepairPageFragmentation on each page, dirties the page and calls log_heap_clean() again on the page. This clearly looks like too much work for a small gain. As this patch stands, the first phase of vacuum prunes the heap pages as usual. But it marks the DEAD line pointers as DEAD_RECLAIMED to signal that the index pointers to these line pointers are being removed, if certain conditions are satisfied. Other backend when prunes a page, also reclaims DEAD_RECLAIMED line pointers by marking them UNUSED. We need some additional logic to do this in a safe way: - An additional boolean pg_class attribute (relvac_inprogress) is used to track the status of vacuum on a relation. If the attribute is true, either vacuum is in progress on the relation or the last vacuum did not complete successfully. When VACUUM starts, it sets relvac_inprogress to true. The transaction is committed and a new transaction is started so that all other backends can see the change. We also note down the transactions which may already have the table open. VACUUM then starts the first heap scan. It prunes the page, but it can start marking the DEAD line pointers as DEAD_RECLAIMED only after it knows that all other backends can see that VACUUM is in progress on the target relation. Otherwise there is a danger that backends might reclaim DEAD line pointers before their index pointers are removed and that would lead to index corruption. We do that by periodic conditional waits on the noted transactions ids. Once all old transactions are gone, VACUUM sets the second scan limit to the current block number and starts marking subsequent DEAD line pointers as DEAD_RECLAIMED. In most of the cases where the old transactions quickly go away, and for large tables, the second scan will be very limited. In the worst case, we might incur the overhead of conditional waits without any success. TODO: - We can potentially update FSM at the end of first pass. This is not a significant issue if the second scan is very limited. But if we do this, we need to handle the truncate case properly. - As the patch stands, we check of old transactions at every block iteration. This might not be acceptable for the cases where there are long running transactions. We probably need some exponential gap here. - As the patch stands, the heap_page_prune handles reclaiming the DEAD_RECLAIMED line pointers since it already has ability to WAL log similar changes. We don't do any extra work to trigger pruning though (except than setting page_prune_xid). May be we should trigger pruning if we got a line pointer bloat in a page too. Please let me know comments/suggestions and any other improvements. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com VACUUM_second_scan-v5.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit
On Wed, Mar 12, 2008 at 7:13 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: Yep, patch attached. I also changed xactGetCommittedChildren to return the original array instead of copying it, as Alvaro suggested. Any comments on the flag based approach I suggested earlier ? Am I missing some normal scenario where it won't work well ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit
On Wed, Mar 12, 2008 at 9:27 PM, Tom Lane [EMAIL PROTECTED] wrote: I didn't like it; it seemed overly complicated (consider dealing with XID wraparound), We are talking about subtransactions here. I don't think we support subtransaction wrap-around, do we ? and it would have problems with a slow transaction generating a sparse set of subtransaction XIDs. I agree thats the worst case. But is that common ? Thats what I was thinking when I proposed the alternate solution. I thought that can happen only if most of the subtransactions abort, which again I thought is not a normal case. But frankly I am not sure if my assumption is correct. I think getting rid of the linear search will be enough to fix the performance problem. I wonder if a skewed binary search would help more ? For example, if we know that the range of xids stored in the array is 1 to 1000 and if we are searching for a number closer to 1000, we can break the array into large,small parts instead of equal parts and then search. Well, may be I making simple things complicated ;-) Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit
On Wed, Mar 12, 2008 at 10:44 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: Imagine that you start a transaction just before transaction wrap-around, so that the top level XID is 2^31-10. Then you start 20 subtransactions. What XIDs will they get? Now how would you map those to a bitmap? Wait. Subtransaction ids are local to a transaction and always start from 1. See this: /* * reinitialize within-transaction counters */ s-subTransactionId = TopSubTransactionId; currentSubTransactionId = TopSubTransactionId; It's not that common to have hundreds of thousands of subtransactions to begin with.. True. But thats the case we are trying to solve here :-) Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit
On Tue, Mar 11, 2008 at 6:04 PM, Heikki Linnakangas [EMAIL PROTECTED] wrote: (moved to pgsql-patches, as there's a patch attached) I couldn't let this case go, so I wrote a patch. I replaced the linked list with an array that's enlarged at AtSubCommit_childXids when necessary. We can replace the array of xids with an array of flags where i'th flag is set to true if the corresponding subtransaction is committed. This would make lookup O(1) operation. Of course, the downside is when the array is sparse. Or we can switch to flag based representation once the array size grows beyond a limit. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] TransactionIdIsInProgress() cache
On Tue, Mar 11, 2008 at 6:37 PM, Alvaro Herrera [EMAIL PROTECTED] wrote: I didn't check whether your transformation is correct, but if so then it can be changed like this and save the extra XidDidCommit call: xvac_committed = TransactionIdDidCommit(xvac); if (xvac_committed) { /* committed */ } else if (!TransactionIdIsInProgress(xvac)) { if (xvac_committed) { /* committed */ } else { /* aborted */ } } else { /* in-progress */ } I doubt if the transformation is correct. If xvac_committed is true, why would one even get into the else part ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] HOT version 18
On 9/18/07, Jaime Casanova [EMAIL PROTECTED] wrote: this sql scripts make current cvs + patch to crash with this message in the logs: Can you please check if the attached patch fixes the issue for you ? It sets t_tableOid before returning a HOT tuple to the caller. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com *** src/backend/access/index/indexam.c 2007-09-18 11:59:26.0 +0530 --- src/backend/access/index/indexam.c 2007-09-18 11:52:36.0 +0530 *** *** 531,536 --- 531,537 ItemPointerSetOffsetNumber(tid, offnum); heapTuple-t_data = (HeapTupleHeader) PageGetItem(dp, lp); heapTuple-t_len = ItemIdGetLength(lp); + heapTuple-t_tableOid = RelationGetRelid(scan-heapRelation); ctid = heapTuple-t_data-t_ctid; /* ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] HOT synced with HEAD
On 9/17/07, Tom Lane [EMAIL PROTECTED] wrote: Yeah. As the code stands, anything that's XMIN_INVALID will be considered not-HotUpdated (look at the macro...). So far I've seen no place where there is any value in following a HOT chain past such a tuple --- do you see any? No, I don't think we would ever need to follow a HOT chain past the aborted tuple. The only thing that worries about this setup though is the dependency on hint bits being set properly. But the places where this would be used right now for detecting aborted dead tuples, apply HeapTupleSatisfiesVacuum on the tuple before checking for HeapTupleIsHotUpdated, so we are fine. Or should we just check for XMIN_INVALID explicitly at those places ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT synced with HEAD
On 9/17/07, Tom Lane [EMAIL PROTECTED] wrote: Meanwhile I've started looking at the vacuum code, and it seems that v16 has made that part of the patch significantly worse. VACUUM will fail to count tuples that are removed by pruning, which seems like something it should report somehow. I understand. I did not give real weight to it because I thought we anyways remove tuples elsewhere during pruning. But I agree that the heap_page_prune_defrag in the VACUUM code path is doing so on behalf of vacuum and hence we should credit that to VACUUM. And you've introduced a race condition: as I just mentioned, it's perfectly possible that the second call of HeapTupleSatisfiesVacuum gets a different answer than what the prune code saw, especially in lazy VACUUM (in VACUUM FULL it'd suggest that someone released lock early ... but we do have to cope with that). Hmm.. you are right. Those extra notices I added are completely unnecessary and wrong. The comments you added seem to envision a more invasive patch that gets rid of the second HeapTupleSatisfiesVacuum pass altogether, but I'm not sure how practical that is, and am not real inclined to try to do it right now anyway ... I agree. I just wanted to leave a hint there that such a possibility exists if someone really wants to optimize, now or later. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT synced with HEAD
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote: Attached is as far as I've gotten with reviewing the HOT patch; I see that Pavan is still fooling with it so we'd better re-sync. I am done with most of the items on my plate. I was not sure how far you have gone with the patch, so was trying to finish as many items as possible myself. Now that I know you have started refactoring the code, I would try not to change it unless its urgent. And when its required, I will send a add-on patch just the way I did earlier. Let me know if you want me to focus of something at this point. * Many API cleanups, in particular I didn't like what had been done to heap_fetch and thought it better to split out the chain-following logic I liked those changes. I am wondering if we could have avoided duplicating the chain following code in heap_hot_search_buffer and index_getnext. But I trust your judgment. I also liked the way you reverted the API changes to various index build methods. I would test the patch tomorrow in detail. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Latest README.HOT
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote: Forgot to include this in the patch ... I'm still unhappy about too much complexity for the VACUUM FULL case, but some other issues have gone away. The entire complexity in the VACUUM FULL code is for the book keeping purposes so that at the end we can recheck the index and heap tuples count. VACUUM FULL converts HOT tuples into non-HOT tuples when it moves them around and that introduces the extra complexity. Otherwise there is no significant change to the VACUUM FULL logic itself. Good to hear that we are now comfortable with most of the other items. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT synced with HEAD
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote: BTW, I'm in process of taking out the separate HEAPTUPLE_DEAD_CHAIN return value from HeapTupleSatisfiesVacuum. I agree. I myself suggested doing so earlier in the discussion (I actually even removed this before I sent out the add-on patch last night, but then reverted back because I realized at least it is required at one place) The place where I thought its required is to avoid marking an index tuple dead even though the corresponding root tuple is dead and the root tuple was HOT updated. But seems like you have already put in a different mechanism to handle that. So we should be able to get rid of HEAPTUPLE_DEAD_CHAIN. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT synced with HEAD
On 9/16/07, Tom Lane [EMAIL PROTECTED] wrote: Something else I was just looking at: in the pruning logic, SetRedirect and SetDead operations are done at the same time that we record them for the eventual WAL record creation, but SetUnused operations are postponed and only done at the end. This seems a bit odd/nonorthogonal. Furthermore it's costing extra code complexity: if you were to SetUnused immediately then you wouldn't need that bitmap thingy to prevent you from doing it twice. I think that the reason it's like that is probably because of the problem of potential multiple visits to a DEAD heap-only tuple, but it looks to me like that's not really an issue given the current form of the testing for aborted tuples, which I have as if (HeapTupleSatisfiesVacuum(tuple-t_data, global_xmin, buffer) == HEAPTUPLE_DEAD !HeapTupleIsHotUpdated(tuple)) heap_prune_record_unused(nowunused, nunused, unusedbitmap, rootoffnum); ISTM that if HeapTupleIsHotUpdated is false, we could simply SetUnused immediately, because any potential chain members after this one must be dead too, and they'll get reaped by this same bit of code --- there is no need to preserve the chain. (The only case we're really covering here is XMIN_INVALID, and any later chain members must certainly be XMIN_INVALID as well.) When the HOT-chain is later followed, we'll detect chain break anyway, so I see no need to postpone clearing the item pointer. So are you suggesting we go back to the earlier way of handling aborted tuples separately ? But then we can not do that by simply checking for !HeaptupleIsHotUpdated. There could be several aborted tuples at the end of the chain of which all but one are marked HotUpdated. Or are you suggesting we also check for XMIN_INVALID for detecting aborted tuples ? If we handle aborted tuples differently, then we don't need that extra bitmap and can in fact set the line pointer unused immediately. Otherwise, as you rightly pointed out, we might break a chain before we prune it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT documentation README
On 9/12/07, Pavan Deolasee [EMAIL PROTECTED] wrote: One change that is worh mentioning and discussing is that we don't follow HOT chains while fetching tuples during autoanalyze and autoanalyze would consider all such tuples as DEAD. In the worst case when all the tuples in the table are reachable via redirected line pointers, this would confuse autoanalyze since it would consider all tuples in the table as DEAD. This is all crap. I was under the impression that heap_release_fetch() would never fetch a HOT tuple directly, but thats not true. analyze fetches all tuples in the chosen block, so it would ultimately fetch the visible tuple. A tuple is counted DEAD only if its t_data is set to non-NULL. For redirected line pointer heap_release_fetch() will set t_data to NULL and hence these line pointers are (rightly) not counted as DEAD. This is the right thing to do because lazy vacuum can not remove redirected line pointers. I think we should change this to follow HOT chain in analyze. We need not follow the chain, but we should check for redirect-dead line pointers and count them as dead rows. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/13/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Yeah, I just checked the work-in-progress patch I sent you back in July. I refactored it to use one-based offsets for consistency, since I modified log_heap_clean quite heavily anyway. It's possible that I broke it in the process, I was only interested in testing the performance characteristics of the simplified pruning scheme. I don't think so -- atleast I couldn't figure out why its broken. But I would take Tom's comment seriously and look more into it tomorrow. Or we can just revert it back to zero-based offsets. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/13/07, Tom Lane [EMAIL PROTECTED] wrote: Never mind ... though my suspicions would probably not have been aroused if anyone had bothered to fix the comments. Yeah, my fault. I should have fixed that. Sorry about that. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/14/07, Tom Lane [EMAIL PROTECTED] wrote: HeapTupleSatisfiesAbort is bogus because it has failed to track recent changes in tqual.c. Oh. I should have been aware. Rather than fix it, though, I question why we need it at all. The only use of it is in heap_prune_tuplechain() and ISTM that code is redundant, wrong, or both. In the case of a full-page prune, it's redundant because the tuple would be found anyway by searching its chain from the root tuple. Indeed I suspect that such a tuple is entered into the newly-dead list twice, perhaps risking overflow of that array. No, we would never reach an aborted dead tuple from the chain because we would see a live tuple before that. Or the tuple preceding the aborted (first aborted, if there are many) may have been updated again and the chain never reaches to the aborted tuples. Thats the reason why we have HeapTupleSatisfiesAbort to collect any aborted heap-only tuples. In the case of a single-chain prune, it still seems wrong since you'll eliminate only one of what might be a chain with multiple aborted entries. If it's OK to leave those other entries for future collection, then it should be OK to leave this one too. If it's not OK then the approach needs to be redesigned. Again, since we check every heap-only tuple for aborted dead case we shall collect all such tuples. I think one place where you are confusing (or IOW the code might have confused you ;-)) is that we never reach aborted dead heap-only tuples by following the HOT chain from the root tuple and thats why it needs special treatment. I'm fairly unclear on what the design intention is for recovering from the case where the last item(s) on a HOT chain are aborted. Please clarify. We don't do anything special. Such a tuple is never reached during HOT chain following because either we would see a visible tuple before that or the older tuple might have been updated again and the chain had moved in a different direction. We only need some special treatment to prune such tuple and hence the business of HeapTupleSatisfiesAbort Having said that, based on our recent discussion about pruning a chain upto and including the latest DEAD tuple in the chain, ISTM that we can get rid of the giving any special treatment to aborted heap-only tuples. Earlier we could not do so for HOT updated aborted heap-only because we did not know if its a part of any valid HOT chain. Now (assuming we change the pruning code to always prune everything including the latest DEAD tuple) we guarantee that all DEAD tuples in the chain will be pruned, and hence we can collect any DEAD tuple seen while pruning. I am not sure if I explain this well, may be I should post an example. Need to run now. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT documentation README
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote: VACUUM -- There is little change to regular vacuum. It removes dead HOT tuples, like pruning does, and cleans up any redirected dead line pointers. In lazy vacuum, we must not freeze a tuple that is in the middle of an update chain. That can happen when a tuple has xmin xmax; it is the same scenario that requires hard pruning in VACUUM FULL. Freezing such tuples will break the check that xmin and xmax matches when following the chain. It is not a problem without HOT, because the preceding tuple in the chain must be dead as well so no one will try to follow the chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone might still need to follow the chain to find the live tuple. We avoid that by just not freezing such tuples. They can be frozen eventually, when the xmax of the preceding tuple is OldestXmin as well. XXX doesn't the above completely break the anti-wraparound guarantees? And why couldn't we avoid the problem by pruning the previous tuple, which is surely dead? We preserve anti-wraparound guarantees by not letting the relfrozenxid advance past the minimum of cut-off xid and xmin/xmax of not-yet-frozen tuples. Given that this is required to address corner case of DEAD tuple following a RD tuple, the final relfrozenxid would be very close to the cut-off xid. Isn't it ? We could have actually pruned the preceding RD tuples (as we do in vacuum full), but we were worried about missing some corner case where someone may still want to follow the chain from the RD tuple. We don't have any such concern with vacuum full because we run with exclusive lock on the table. But if we agree that there is no problem with pruning RD tuple preceding a DEAD tuple, we can actually prune that tuple as well. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT documentation README
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote: VACUUM FULL --- To make vacuum full work, any DEAD tuples in the middle of an update chain need to be removed (see comments at the top of heap_prune_hotchain_hard() for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, but scans the whole chain and removes any intermediate dead tuples as well. It also removes redirected line pointers by making them directly point to the first tuple in the HOT chain. This causes a user-visible change in the tuple's CTID, but since VACUUM FULL has always moved tuple CTIDs, that should not break anything. XXX any extra complexity here needs justification --- a lot of it. We hard prune the chains and also clear up redirect line pointers because doing so is safe within VACUUM FULL and it reduces addition complexity in the actual VACUUM FULL work. When we move tuples and tuple chains, we don't try to preserve their HOT properties. So when tuples in a HOT chain are moved, we reset their HEAP_ONLY_TUPLE and HEAP_HOT_UPDATED flags and each tuple has its own index entry. This requires us to some more book keeping work in terms on number of indexed tuples expected etc because they are checked at the end of the index scan. Statistics -- XXX: How do HOT-updates affect statistics? How often do we need to run autovacuum and autoanalyze? Auotovacuum needs to be run much less frequently with HOT. This is because defragmentation reclaims dead space in a page, thus reducing total dead space in a table. Right now we don't update FSM information about the page after defragmenting it, so a UPDATE on a different page can still cause relation extension even though there is free space in some other page. The rational for not updating FSM is to let subsequent UPDATEs on the page to use the freed up space. But one can argue that we should let the free space to be used for other UPDATEs/INSERTs after leaving fillfactor worth of space. Another significant change regarding autovacuum is that we now track the total dead space in the table instead of number of dead tuples. This seems like a better approach because it takes into account varying tuple sizes into account. The tracked dead space is increased whenever we update/delete a tuple (or insert is aborted) and reduced when a page is defragmented. autovacuum_vacuum_scale_factor considers the percentage of dead space to the size of the relation whereas autovacuum_vacuum_threshold considers the absolute amount of dead space in terms of blocks. Every UPDATE (HOT or COLD) contributes to the autoanalyze stats and defragmentation/pruning has no effect on autoanalyze. IOW autoanalyze would work just the way it does today. One change that is worh mentioning and discussing is that we don't follow HOT chains while fetching tuples during autoanalyze and autoanalyze would consider all such tuples as DEAD. In the worst case when all the tuples in the table are reachable via redirected line pointers, this would confuse autoanalyze since it would consider all tuples in the table as DEAD. I think we should change this to follow HOT chain in analyze. Since we fetch using SnapshotNow, if there is a live tuple at the end of the chain, analyze would use that. Otherwise the tuple is considered as DEAD. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: The crucial design decision for HOT is when to prune and when to defragment the page, so that when we're doing the UPDATE, there's room in the page. Right. For defragmentation, I am still inclined towards doing it when we see that the free space is less than (or slightly more than) the average tuple size of the table - unless we have a better solution. For pruning, we can do the quick pruning. But I won't feel comfortable doing it without an exclusive lock on the buffer. Also I would avoid line pointer redirection during quick prune. A simpler solution would be to flag the page whenever HOT chain becomes longer than, say 5, (which can easily be detected in heap_hot_fetch) and prune it in the next lookup. Hopefully we would only rarely have long HOT chains and if at all we have them, we will have some mechanism to recover from it. Any other ideas ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT documentation README
On 9/12/07, Tom Lane [EMAIL PROTECTED] wrote: This seems all wrong to me. We'd only be considering freezing a tuple whose xmin precedes the global xmin. If it has a predecessor, that predecessor has xmax equal to this one's xmin, therefore also preceding global xmin, therefore it would be seen as DEAD not RECENTLY_DEAD. So we should never need to freeze a tuple that isn't the start of its HOT chain. hm.. What you are saying is right. I fail to recollect any other scenario that had forced me to think freezing is a problem with HOT. Also, if you find a DEAD tuple after a RECENTLY_DEAD one, you can certainly prune both, because what this tells you is that both tuples are in fact dead to all observers. I agree. I ran a long test with this change and there doesn't seem to be any issue. So lets prune everything including the latest DEAD tuple. That would let us take out the changes related to freezing. I also think that should let us remove the DEAD_CHAIN concept, but let me check. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: We're only changing the offsetnumber part of it, which is 2 bytes. That shouldn't cross a hardware sector boundary on any reasonable hardware. Not entirely true if we consider line pointer redirection, which involves changing 4 bytes. You can argue that for quick pruning we don't do any line pointer redirection though. Also my worry about dealing with unreachable heap-only dead tuples remain. We may need to do far more work to identify that they are not part of any reachable HOT chain and hence can be removed. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Pruning removes intermediate dead tuples by marking their line pointers ~LP_USED and redirecting the root line pointer to the first live/recently_dead tuple in the chain. It seems utterly unsafe to do this without a vacuum-grade lock on the page. Someone might be examining the root tuple (eg, in a SnapshotAny scan) and I think they are entitled to assume that the line pointer stays valid while they are looking at the tuple. As the patch stands, we prune holding a vacuum-grade lock. But ISTM that there are only limited users of SnapshotAny and we make them recheck the line pointer after acquiring the buffer lock. So hopefully we should be fine. Anyways, its not a concern right now because we hold vacuum-grade lock while pruning. But if we decide to what Heikki/Greg suggested about quick pruning then we will have to make sure that your concern is addressed. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Pavan Deolasee [EMAIL PROTECTED] wrote: - Track the minimum xmin in the page header to avoid repeated (wasted) attempts to prune a Prunable page in the presence of long running transactions. I would actually think twice before even doing this because this would lead to complete change in heap page structure and stop people from upgrading to 8.3 without a complete dump/restore. I don't remember 8.3 introduces any other significant change which already enforces that. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: I would actually think twice before even doing this because this would lead to complete change in heap page structure and stop people from upgrading to 8.3 without a complete dump/restore. I don't remember 8.3 introduces any other significant change which already enforces that. Then you don't remember far enough --- either the HeapTupleHeader change or the varvarlena change would be enough to prevent that. For that matter, the pd_flags header field that the patch is already relying on is new for 8.3. Oops! I forgot combo command id. That went into last major release of EDB and that confused me. Regarding pd_flags, I thought somebody can just fix that in-place and avoid dump/restore (of course, its too invasive, but isn't it feasible with pg_migrator ?) Anyways, given this, should we go for storing xmin in the page header ? For that matter should there be a separate heap page header ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: I'm repeating myself, but I would split that second operation into two parts while we think about this: pruning the entire page, and defragmenting (PageRepairFragmentation). Tom wondered why they're separated in the patch. As the patch stands, there is no reason, but I feel that separating them and doing them at different times might be an important piece in the puzzle. I think we all agree that defragmentation should be done less aggressively and ideally when we are going to insert a tuple in the page and running low on free space in the page. If we could really do it in heap_update(), after we figure out that there is not enough free space available in the page, that will be great. We already have exclusive lock on the page and hopefully we also have the vacuum-lock. (This is how earlier version of the patch used to work - before we took out all the complexity associated with tracking LP_DELETEd tuples, tracking fragmented tuples and replaced them with simple PageRepairFragmentation Since in the earlier version, we never used to call PageRepairFragmentation, we could easily do pruning in heap_update()) If we can't find a way to defragment inside heap_update(), then we have three choices: 1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low on free space, we try to prune and defragment at that point. 2. Move some of the work to bgwriter. 3. Use some heuristic to decide whether to defragment or not. I am not sure how easy it is to know apriori that we are fetching a tuple for UPDATE. Assuming we can, this seems like a good idea. Eventually we may want to move some work to bgwriter or some other background process. But my take would be to save that for 8.4 As a starting point, we are doing 3 in the patch. We always try to keep just one tuple worth of space free in the page. So if an UPDATE takes up remaining free space in the page, the page will be pruned/defraged in the subsequent lookup (index or seq). In read-mostly scenario, only the first lookup would trigger prune/defrag, but next many lookups would be quick. In update-mostly scenario, most of the heap_fetches would anyways end up doing update, so doing pruning/defragmentation in heap_fetch shouldn't be too bad. For a balanced scenario, we might be moving cost of pruning/defragmentation to the SELECT path, but UPDATEs would be quick. The other issue is when to prune the HOT chain. I tend to agree that the chances of having long HOT chains are less and the cost of following the chain won't be too significant. At the same we probably don't want to live with very long chains forever. An example would be: a tuple gets HOT updated several times in a transaction. If the page is never updated again, the long HOT chain would never be pruned. I like Simon's idea of pruning the chain if it goes beyond a limit. Instead of adding any significant complexity to track length of each HOT chain, we can just mark the page with a flag if heap_hot_fetch() detects a long chain. The next lookup (index or seq) of the page will try to prune the page and clear the flag if successful. I also agree that we should avoid taking exclusive lock before checking free space in heap_page_prune_defrag(). That might be unsafe, but we don't care if you occasionally skip the maintenance work (or do it a little early) Thanks, Pavan Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/6/07, Tom Lane [EMAIL PROTECTED] wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: When I suggested that we get rid of the LP_DELETE flag for heap tuples, the tuple-level fragmentation and all that, and just take the vacuum lock and call PageRepairFragmentation, I was thinking that we'd do it in heap_update and only when we run out of space on the page. But as Greg said, it doesn't work because you're already holding a reference to at least one tuple on the page, the one you're updating, by the time you get to heap_update. That's why I put the pruning code to heap_fetch instead. Yes, though the amortized cost is the same, it does push the pruning work to the foreground query path. The amortized cost is only the same if every heap_fetch is associated with a heap update. I feel pretty urgently unhappy about this choice. Have you tested the impact of the patch on read-mostly workloads? For read-mostly workloads, only the first SELECT after an UPDATE would trigger pruning/defragmentation. heap_page_prune_defrag() would be a no-op for subsequent SELECTs (PageIsPrunable() would be false until the next UPDATE) I think Heikki's recent test confirms this. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/8/07, Bruce Momjian [EMAIL PROTECTED] wrote: Would someone tell us exactly when pruning and defragmenting happens in the current version of the patch? If we don't nail this issue down soon PostgreSQL 8.3 is going to sail without this patch. I guess you already have the answers by now, but let me repeat here: Until patch version 14, defragmentation and pruning were separate operations, though we used to invoke both at the same time. The only difference was that pruning can work with a simple exclusive lock whereas defragmentation needs vacuum-strength lock. Tom argued that its not worth the complexity, so I changed that and clubbed both the operations together. What it means: pruning also now needs vacuum-strength lock and is always followed by defragmentation. So as the patch stands (version 15), we call heap_page_prune_defrag() inside heap_release_fetch() and heapgetpage(), apart from the vacuum loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable is set when a tuple is updated/deleted. So for read-mostly workloads heap_page_prune_defrag will mostly be no-op. If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried conditionally, so we don't wait if there is contention) and we are running low on free space (right now if free space is less than 1.2 times the average tuple size or less than BLCKSZ/8), the entire page is pruned and fragmentation is repaired. We also prune-defrag is vacuum lazy and vacuum full. But I assume we are not worried about these maintenance activities. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Simon Riggs [EMAIL PROTECTED] wrote: On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote: Bruce Momjian wrote: (Can someone time the access time for following a chain that fills an entire page (the worst case) vs. having a single tuple on the page?) Here is some results, on my laptop. HEADHOT HOT-opt HOT-pruned seqscan 19.921.120.111.5 idxscan 27.831.430.413.7 Comparing the idxscan columns, it looks like following the chain *is* more expensive than having to go through killed index pointers. Pruning clearly does help. Given that this test is pretty much the worst case scenario, I'm ok with not pruning for the purpose of keeping chains short. I wasn't expecting that result and had accepted the counter argument. Yes, I go with Simon. I am also surprised that HOT-pruned did so well in this setup. I always thought that HOT would do well in update-intensive scenarios, but from the results it seems that HOT is also doing well for read-mostly queries. In this particular example, the first SELECT after the 250 UPDATEs would have pruned all the dead tuples and reduced HOT chain to a single tuple. Hence the total time for subsequent SELECTs improved tremendously. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Florian Pflug [EMAIL PROTECTED] wrote: Maybe that risk could be lowered if instead of a flag, we stored the minimal global xmin needed to prune at least one tuple. I like the idea. The question is whether the chances of a Prunable page being looked up again and again in presence of a long running transaction are high enough to justify adding 4 bytes to page header. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Oh, one more thing occured to me. Without HOT, we not only mark index tuples pointing to dead tuples as killed, we remove them altogether if the index page gets full. If you modify the test case so that after doing the updates, you insert a bunch of tuples with a different key to fill the index page, you should see CVS HEAD winning HOT without pruning hands down. Um. I am not sure I follow that. With HOT even if the HOT chain is long, there is still just one index entry for the entire chain. So I don't see why CVS HEAD would do better than HOT in the above case. Net-net there will be equal number of index keys after the inserts. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. Pruning removes intermediate dead tuples by marking their line pointers ~LP_USED and redirecting the root line pointer to the first live/recently_dead tuple in the chain. OTOH defragmentation moves tuples around to repair fragmentation. IOW defragmentation is nothing but calling PageRepairFragmentation on the page. For example, if we have a HOT chain of 5 tuples: 1 -- 2 -- 3 -- 4 -- 5 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE At the end of pruning: - line pointer of 1 is redirected to 4 - line pointers of 2 and 3 are marked ~LP_USED - the offset of 4 and 5 is unchanged At the end of defragmentation: - 4 and 5 are moved to the end of the page to create a larger contiguous free space in the page. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote: Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's quick pruning different from the pruning you are describing? I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Btw, I am not too enthusiastic about quick pruning because it leaves behind dead heap-only tuples which are not reachable from any root heap tuples. Not that we can't handle them, but I am worried about making such changes right now unless we are sure about the benefits. We can always tune and tweak in 8.4 Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Gregory Stark [EMAIL PROTECTED] wrote: You could mark such tuples with LP_DELETE. That would also let other transactions quickly tot up how much space would be available if they were to run PageRepairFragmentation. IMHO we are making full circles here. We have already tried LP_DELETE techniques and moved away to simplify things. We also tried reusing dead space without running PageRepairFragmentation. Each of these techniques worked just fine with slightly different performance characteristics. What we now have is a simplified algorithm which is much easier to follow and is safer, yet giving us a very good performance boost. I am not sure if this is the right time to throw new ideas because we would never be sure as what we are doing would be the most optimal solution. Would it help if we go with some solution right now, get rest of the review process done and then use the feedback during beta testing to tune things ? We may have far more data points at that time to choose one technique over other. And we would also know what areas to focus on. I am also worried that by focusing too much on this issue we may overlook some other correctness issue in the patch. From whatever we have discussed so far, IMHO we should do the following things and let rest of the review process proceed - Defragment a page only when the free space left in the page is not enough to accommodate even a single tuple (use average tuple length for this decision). This would mean we might be defragmenting even though there is no immediate UPDATE to the page. But we can treat this as fillfactor which allows us to provision for the next UPDATE coming to the page. Since we are defragmenting when the page is almost full hopefully we would reclaim good amount of space in the page and won't call defragmentation for next few UPDATEs. We already have mechanism to track average tuple request size in relcache. May be we can have some relcache invalidation to keep the information in sync (send invalidation when the average request size changes by say 5%) - Avoid pruning chains in every index or seq lookup. But if the chain becomes longer than X tuples, mark the page to be pruned in the next lookup. We can choose to separate prune and defragmentation and only do pruning in this case. But I would prefer to keep them together for now. - Track the minimum xmin in the page header to avoid repeated (wasted) attempts to prune a Prunable page in the presence of long running transactions. We can save rest of the techniques for beta testing period or 8.4. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/6/07, Gregory Stark [EMAIL PROTECTED] wrote: Tom Lane [EMAIL PROTECTED] writes: ISTM the only time we should be doing HOT cleanup work is when we are trying to insert a new tuple on that page --- and maybe even only when we try and it doesn't fit straight off. Otherwise you're pushing maintenance work into foreground query pathways, which is exactly where I don't think we should be headed. Ah, as I understand it you can't actually do the pruning then because the executor holds references to source tuple on the page. In other words you can never get the vacuum lock there because you already have the page pinned yourself. I don't think executor holding a reference is a problem because when we check for vacuum lock, we have already pinned the page anyways. But moving the old tuple around deep down in the UPDATE code path (when we realize that there is no free space) is an issue. I know Heikki tried to do it this way - but then moved the pruning code to lookup code. Heikki ? Another real problem with doing pruning only in UPDATE path is that we may end up with long HOT chains if the page does not receive a UPDATE, after many consecutive HOT updates. Every lookup to the visible tuple in this chain would be CPU expensive since it would require long chain following. The downside is of course that SELECT may occasionally do more work. But since we ensure that we do the extra work only when there is no contention on the page, ISTM that the downside is limited. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Lazy xid assignment V4
On 9/4/07, Florian G. Pflug [EMAIL PROTECTED] wrote: Hi Here is an updated patch, following the discussion. The patch can be found at: http://soc.phlo.org/lazyxidassign.v4.patch (I seems I still can't get attachments through to this list) I haven't been able to follow the discussions here, but do I need to worry about its interaction with HOT ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 14
On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote: In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. On second thought, I wonder if there is really much to gain by supporting HOT on system tables and whether it would justify all the complexity. Initially I thought about CatalogUpdateIndexes to which we need to teach HOT. Later I also got worried about building the HOT attribute lists for system tables and handling all the corner cases for bootstrapping and catalog REINDEX. It might turn out to be straight forward, but I am not able to establish that with my limited knowledge in the area. I would still vote for disabling HOT on catalogs unless you see strong value in it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 14
On 9/1/07, Tom Lane [EMAIL PROTECTED] wrote: I see this the other way around: if it doesn't work on system catalogs, it probably doesn't work, period. I'm not in favor of treating the catalogs differently. Now that I hear you, I know what to do next :-) I don't think there is any fundamental problem with system catalogs, its only the additional complexity that I was worried about. Anyways, I will rework things as per your suggestion. And I take your point that making it work on all tables will give us more confidence on the code. Thanks, Pavan P.S. Next week is bad for me :-( I am on vacation on Thursday/Friday and for remaining days, I may not be able to spend extra cycles, apart from regular working hours. -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 14
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. I shall take that up. There are couple of documents posted by Heikki and Greg, apart from several email threads and original design doc by Simon. I shall consolidate everything in a single README. One thing I was unable to discern from the comments is how CREATE INDEX can work at all. A new index might mean that tuples that could formerly legally be part of the same hot-chain no longer can. I can't find any code here that looks like it's addressing that. You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked unusable for all those existing transactions which can potentially see any intermediate tuples in the chain. Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. Given that, it seems OK to me to ignore these tuples because if the transaction aborts, CREATE INDEX is aborted as well. Am I overlooking something here ? There is a comment to this regard in the current code as well. I also took this opportunity to remove the modularity invasion caused by heap_check_idxupdate() since it was using resultRelInfo. We now build the list of attributes that must be checked to satisfy HOT update. This list includes all the index columns, columns in the partial index predicates and expression index expressions and is built in the executor. The executor is the wrong place for that: I'm not sure why you think that making heapam depend on the executor is a modularity improvement. Earlier (in the previous version of HOT patch) we were passing ResultRelInfo to heap_update, which was more ugly. The comment is in that context :-) Furthermore this approach requires recalculating the list during each query, which is wasteful when it could only change during schema updates. I'd suggest making the relcache responsible for computing and saving this data, along the same lines as RelationGetIndexList(). (That also means that heapam can get the data from the relcache, saving a lot of API-bloating from passing around Attrids explicitly.) Also, rather than inventing Attrids, I'd suggest just using one Bitmapset with the convention that its contents are offset by FirstLowInvalidHeapAttributeNumber. I liked all these suggestions. I know I thought about computing the attribute list in relcache, but probably avoided to keep things simple. I shall make these changes. The redefinition of the value of MaxHeapTuplesPerPage seems pretty ugly and unprincipled. I think it'd be better to leave it as-is, and just enforce that we don't allow more than that many line pointers on a heap page (as indeed you have to do anyway, so it's not clear what the change is buying). The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. For a table with sufficiently large tuples, the original bound would work well, but for very small tuples - we might hit the line pointer limit even if there is free space available in the page. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. But I agree that the factor choice is rather arbitrary. Is it really necessary to add hot_update to xl_heap_update? Seems the information should be available from the tuple header fields. I think its not necessary. The reason I did that way because the new block might be backed up in the WAL record and hence we might have recorded the new tuple infomasks. But we can surely handle that corner case. I shall fix this. Have you demonstrated that the prune_hard logic is worth its weight? Considering that many of us want to drop VACUUM FULL, adding more complexity to it doesn't seem like a profitable use of time. The prune_hard code is lot more simple
Re: [PATCHES] HOT patch - version 14
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked unusable for all those existing transactions which can potentially see any intermediate tuples in the chain. I don't think that works --- what if the last tuple in the chain isn't committed good yet? If its inserter ultimately rolls back, you've indexed the wrong value. I am confused. How could we get ShareLock on the relation while there is some open transaction which has inserted a tuple in the table ? (Of course, I am not considering the system tables here) If the transaction which inserted the last tuple in the chain is already aborted, we are not indexing that tuple (even if that tuple is at the end). We would rather index the last committed-good tuple in the chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees that. Isn't it ? Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php Isn't the extra machination for C.I.C. just useless complication? What's the point of avoiding creation of new broken HOT chains when you still have to deal with existing ones? IMHO the extra step in C.I.C simplifies the index build. The transaction-waits takes care of the existing broken chains and the early catalog entry for the index helps us avoid creating new broken chains. I am not against doing it a different way, but this is the best solution we could come up when we worked on CIC. I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. That chain of reasoning is way too shaky for me. Essentially what you're saying is you'll promise not to corrupt non-system indexes. Nor am I really thrilled about having to disable HOT for system catalogs. I am not against supporting HOT for system catalogs. But IMHO its not a strict requirements because system catalogs are small, less frequently updated and HOT adds little value to them. If we don't have a generic solution which works for system and non-system tables, thats the best we can get. We can start with non-system tables and expand its scope later. The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. No, it doesn't allow more tuples per page. It might mean there can be more line pointers than that on a page, but not more actual tuples. The question is whether there is any real use in allowing more line pointers than that --- the limit is already unrealistically high, since it assumes no data content in any of the tuples. If there is a rationale for it then you should invent a different constant MaxLinePointersPerPage or some such, but I really think there's no point. I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently large. May be we can keep it the same and tune it later if we have numbers to prove its worthiness. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. I'd say it allows a ridiculous amount of line pointer bloat. OK. Lets keep MaxHeapTuplesPerPage unchanged. Even if it's safe, ISTM what you're mostly accomplishing there is to expend a lot of cycles while holding exclusive lock on the page, when there is good reason to think that you're blocking other people who are interested in using the page. Eliminating the separation between that and cleanup would also allow eliminating the separate PD_FRAGMENTED page state. The reason we did it that way because repairing fragmentation seems much more costly that pruning. Please note that we prune a single chain during index fetch. Its only for heap-scans (and VACUUM) that we try to prune all chains in the page. So unless we are doing heap-scan, I am not sure if we are spending too much time holding the exclusive lock. I agree we don't have any specific numbers to prove that though. If you don't have numbers proving that this extra complication has a benefit, I'd vote for simplifying it. The SnapshotAny case is going to bite other people besides you
Re: [PATCHES] HOT patch - version 14
On 8/31/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Not if someone else releases lock before committing. Which I remind you is a programming technique we use quite a lot with respect to the system catalogs. I'm not prepared to guarantee that there is no code path in core (much less in contrib or third-party addons) that doesn't do it on a user table; even less that no such path will ever appear in future. As things stand it's a pretty harmless error --- the worst consequence I know of is that a VACUUM FULL starting right then might bleat and refuse to do shrinking. What you propose is to turn the case into a cause of silent index corruption. A more robust solution would be to wait out the source transaction if CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS HOT tuple. Or you could throw an error, since it's not really likely to happen (except maybe in a system catalog REINDEX operation). But the way the patch approaches this at the moment is far too fragile IMHO. OK. I looked at the code again. In CVS HEAD we assume that we should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS unless its deleted/inserted by our own transaction or we are indexing a system table. Except for these cases, we throw an error. With HOT, we keep it the same i.e. if we see a DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error, unless its deleted/inserted by us or this is a system table.. What we change is if we find a DELETE_IN_PROGRESS tuple deleted by our own transaction and its HOT-updated, we skip that tuple. This is fine because if the CREATE INDEX commits the DELETE is also committed and the tuple is not visible (I am still assuming the indcreatxid mechanism is in place) So if we don't do HOT update on system tables, the current algorithm should work fine because we should never find a HOT updated tuple in the system table and the error-out mechanism should ensure that we don't corrupt user tables. So I guess what you are suggesting is to turn on HOT on system tables and then wait-out any DELETING/INSETING transaction if we find such a tuple during CREATE INDEX. I think this would work and since we are talking about system tables, the wait-out business won't be harmful - I remember there were objections when I suggested this as a general solution. If we approach it this way, we might also be able to jettison some of the other complexity such as idxcreatexid. I doubt, unless we replace it with something better. I think indcreatexid serves another purpose. While building an index, we skip any RECENTLY_DEAD HOT-updated tuples. This is required to handle the existing HOT chains which are broken with respect to the new index. Essentially what it means is we don't want to index anything other than the last committed good tuple in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples, which are potentially visible to the existing running transactions, we want those transactions NOT to use this new index. IMHO the extra step in C.I.C simplifies the index build. If you make the change suggested above, I think you don't need to do things differently in C.I.C. I am not sure if I follow you correctly here. The issue with CIC is that it works with a snapshot. So during initial index build, we might index a tuple which is in the middle of a broken HOT chain. In the second phase, we just don't need to index the tuple at the head of the chain, but also remove the previously inserted tuple, otherwise there would be two references from the index to the same root heap tuple. The additional step which does two things: 1) Create catalog entry - stops any new broken HOT chains being created 2) Wait-out existing transactions - makes sure that when the index is built, we only index the last committed good tuple in the chain. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. Same issue as above: this makes correctness utterly dependent on nobody releasing locks early. As I said above, we in fact throw an error for user tables. But if we want to support HOT updates on system tables, we may need to do the wait-out business. In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. OK. So if I get you correctly, you are suggesting to acquire cleanup lock. If we don't get that, we don't to any maintenance work. Otherwise, we prune and repair fragmentation in one go. Yeah, that's what I'm thinking; then there's no need to track a separate page state where we've pruned but not defragmented. I agree. Lets keep it simple and we can always improve it later. The only thing that worries me how
Re: [PATCHES] HOT patch - version 14
On 8/20/07, Tom Lane [EMAIL PROTECTED] wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I see that you have a separate bitmapset to keep track of indexes on system attributes. But having an index on a system attribute doesn't make any sense, does it? Counterexample: OID. Right. Further, we allow creating indexes on system attributes. So we must support those. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 11
On 8/2/07, Pavan Deolasee [EMAIL PROTECTED] wrote: . It would also be better if we didn't emit a separate WAL record for defraging a page, if we also prune it at the same time. I'm not that worried about WAL usage in general, but that seems simple enough to fix. Ah I see. I shall fix that. When I started making this change, I realized that we need the second WAL record because if the block is backed up in pruning WAL write, we may never call PageRepairFragmentation during the redo phase. Of course, we can fix that by making heap_xlog_clean always repair page fragmentation irrespective of whether the block was backed up, but that doesn't seem like a good solution. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 11
On 8/2/07, Pavan Deolasee [EMAIL PROTECTED] wrote: On 8/2/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Maybe a nicer solution would be to have another version of ConditionalLockBuffer with three different return values: didn't get lock, got exclusive lock, or got cleanup lock. Thats a good idea. I shall do that. On a second thought, I feel its may not be such a good idea to change the ConditionalLockBuffer return value. boolean is the most natural way. And we don't save anything in terms on BufHdr locking. So may be should just have two different functions and do the BufferIsLockedForCleanup check immediately after acquiring the exclusive lock. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 11
On 8/2/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Pavan Deolasee wrote: Please see the attached version 11 of HOT patch Thanks! One wrinkle in the patch is how the ResultRelInfo-struct is passed to heap_update, and on to heap_check_idxupdate, to check any indexed columns have changed. I think that's a modularity violation, heap_update really shouldn't have to deal with ResultRelInfo, that belongs in the executor. When we add support for expression and partial indexes, heap_check_idxupdate will need even more executor machinery to be able to evaluate expressions. The reason I put it there because we wanted to do that check as late as possible, once we confirm that update is possible and there is enough space in the block to perform HOT update. But I agree thats a modularity violation. Any suggestion to avoid that ? In heap_page_prune_defrag, it would be better to do the test for BufferIsLockedForCleanup right after acquiring the lock. The longer the delay between those steps, the bigger the chances that someone pins the page and starts to wait for the buffer lock, making us think that we didn't get the cleanup lock, though we actually did. Maybe a nicer solution would be to have another version of ConditionalLockBuffer with three different return values: didn't get lock, got exclusive lock, or got cleanup lock. Thats a good idea. I shall do that. It's not necessary to WAL-log the unused-array that PageRepairFragmentation returns. In replay, a call to PageRepairFragmentation will come to the same conclusion about which line pointers are not used. It would also be better if we didn't emit a separate WAL record for defraging a page, if we also prune it at the same time. I'm not that worried about WAL usage in general, but that seems simple enough to fix. Ah I see. I shall fix that. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 11
On 8/1/07, Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2007-08-01 at 14:36 +0530, Pavan Deolasee wrote: BufferIsLockedForCleanup() should be named BufferIsAvilableForCleanup(). There is no cleanup mode, what we mean is that there is only one pin; the comments say If we are lucky enough to have already acquired the cleanup lock, which made me look for the point where we had executed LockBufferForCleanup(), which we never do. Ok. I am open to suggestions here as well as other parts of the code about naming. heap_page_prune() needs to comment what prune_hard means. I think you mean true=prune page, false=prune just this hot chain. It might be better to have a prune type so that the code reads as PRUNE_HOT_CHAIN or PRUNE_WHOLE_PAGE. Does it means the same thing as in heap_prune_tuplechain()? No. prune_hard means remove all RECENTLY_DEAD tuples before a DEFINITELY DEAD tuple. We use this only for VACUUM FULL. The pruneoff is set to InvalidOffsetNumber to prune all tuples chains in the page. I'm concerned that HeapTupleSatisfiesAbort() may be very expensive and yet almost never return true. Do we need this code at this point? We actually run SatisfiesVacuum anyways on all the tuples. We should expect hint bits to be set and hence the call may not be expensive. We need this to reclaims aborted heap-only tuples which otherwise are not reachable. It would be good to put some explanatory notes somewhere. Do all calls to PageGetFreeSpace() get replaced by PageHeapGetFreeSpace()? Ok. Would do that. Right now, PageHeapGetFreeSpace is used for heap pages. To answer some of your XXXs: XXX Should we set PageSetPrunable on this page ? If the transaction aborts, there will be a prunable tuple in this page. Think you need to explain a little more...but we probably shouldn't tune for aborts. If the INSERTing transaction aborts, we are left with a DEAD tuple in the page which can be pruned and removed. So the question is whether we should set the hint bit. If we don't set the hint bit, the aborted DEAD tuple would stay there till the next vacuum or some other tuple in the page is UPDATED/DELETEd and the page is cleaned up. XXX Should we set hint on the newbuf as well ? If the transaction aborts, there would be a prunable tuple in the newbuf. Think you need to explain a little more...but we probably shouldn't tune for aborts. Same as above. XXX Should we update the FSM information of this page ? (after pruning) No. Updating FSM would only take place if updates are taking place on this block. We would make this page a target for other INSERTs and UPDATEs, so would effectively bring more contention onto those pages. Bad Thing. On the other hand, if we don't update FSM all COLD updates may result in extending the relation even though there is some free space available elsewhere in the relation. One option could be to leave fillfactor worth of space and update FSM accordingly. XXX We may want to tweak the percentage value below: (1.2 * RelationGetAvgFSM(relation) Not sure what difference the 1.2 makes... Thats a quick hack. I thought of leaving some margin because avg FSM request size is a moving average and there is a good chance to err. So that means I agree with all of your XXXs apart from the last one. The logic of chain pruning has been simplified a lot. In addition, there are fewer new WAL log records. We rather reuse the existing WAL record types to log the operations. I can't find any mention of XLogInserts for the new record types. e.g. XLOG_HEAP2_PRUNEHOT is only mentioned on one line in this patch. I this a mistake, or could you explain? Sorry, thats just a left over from the previous version. I shall clean that up. It would be good to see the results... I shall post them soon. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT latest patch - version 8
On 7/15/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Actually storing InvalidOffsetNumber in lp_off is a bit bogus in the first place since lp_off is unsigned, and InvalidOffsetNumber is -1, so I fixed that as well. I see InvalidOffsetNumber as 0 in off.h:26 #define InvalidOffsetNumber ((OffsetNumber) 0) So I think we should be OK to use that to indicate redirect-dead pointers. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [pgsql-patches] Ctid chain following enhancement
On 6/2/07, Bruce Momjian [EMAIL PROTECTED] wrote: OK, removed from 8.4 queue. I am OK with this, though I personally never felt that it complicated the code :-) Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [pgsql-patches] Ctid chain following enhancement
On 5/29/07, Zdenek Kotala [EMAIL PROTECTED] wrote: I'm looking on your patch. I have one comment: If you have old tid and new tid you can easy compare if new tid points to different page? And if page is still same there is no reason to Unlock it and lock again. I think add inner loop something like: Tom has already expressed his unwillingness to add complexity without any proven benefits. Your suggestion though good would make the code more unreadable without much benefit since the function is not called very often. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Maintaining cluster order on insert
On 5/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Ah, sorry about that. For some reason my source tree was checked out from the 8.2 branch, instead of CVS HEAD. I looked at the patch. Not that I am very comfortable with this part of the code, but nevertheless here are my comments: I am wondering if we can optimize _bt_moveright() code by remembering the PageLSN of the page before releasing the READ LOCK. If the page is indeed changed since we last saw it, the PageLSN must be different than what we had remembered and gives us a hint that we might need to move right. ! /* Page was full. Release lock and pin and get another block !* as if suggested_blk was not given. !*/ ! LockBuffer(suggested_buf, BUFFER_LOCK_UNLOCK); ! ReleaseBuffer(suggested_buf); May be you want to use UnlockReleaseBuffer() or _bt_relbuf() just as a shorthand. Btw, I wonder why _bt_relbuf() exists and even so why does it take Relation argument ? Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Maintaining cluster order on insert
On 5/21/07, Pavan Deolasee [EMAIL PROTECTED] wrote: On 5/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Ah, sorry about that. For some reason my source tree was checked out from the 8.2 branch, instead of CVS HEAD. I looked at the patch. Not that I am very comfortable with this part of the code, but nevertheless here are my comments: Another problem that I noticed with the patch is that it disregards the fillfactor while inserting the tuples. ISTM that this is a correct behavior when a tuple is being inserted in the block to preserve cluster ordering. But when the tuple is being inserted as a normal operation, IMO we should respect the fillfactor. The easiest way to reproduce this is to insert tuples sorted on the cluster index key. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Concurrent psql patch
Hi Greg, I looked at the patch briefly. I couldn't spot any issues and it looks good to me. I've just couple of comments here. The mvcc regression test files are missing in the patch. --- 1179,1189 dbname, user, password); /* We can immediately discard the password -- no longer needed */ ! if (password) ! { ! memset(password, '\0', strlen(password)); free(password); + } Any reason why we do this ? password is anyways freed. I think you might have left it behind after some debugging exercise. --- 25,37 #include mb/pg_wchar.h #include mbprint.h + #if 0 + #include libpq-int.h /* For PG_ASYNC */ + #endif + This looks redundant.. Apart from that I really like consistent coding style. For example, to me, for (i = 0; i 10; i++) looks much better than for (i=0;i10; i++) This is not comment on your patch and neither I am saying we should follow a specific coding style (though I wish we could have done so) because we have already so many different styles. So its best to stick to the coding style already followed in that particular file. But few simple rules like having a single space around operators like '', '+', '=' etc really makes the code more readable. Other examples are using parenthesis in a right manner to improve code readability. flag = (pointer == NULL); is more readable than flag = pointer == NULL; Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patches
On 5/7/07, Zoltan Boszormenyi [EMAIL PROTECTED] wrote: I tried your patchset on current CVS (some minutes old) and got this single regression below. Thanks for testing the patch. This failure is expected and I had mentioned this when I posted v7. With HOT, CIC is now a three step process and a failure after the first step will leave an invalid index behind. In this particular case, CIC fails because of duplicate keys. I did not deliberately fix the regression output to highlight this change. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT Patch - Ready for review
On 4/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: What's the purpose of the HeapScanHintPagePrune mechanism in index builds? I lost track of the discussion on create index, is the it necessary for correctness? Its not required strictly for correctness, but it helps us prune the HOT-chains while index building. During index build, if we skip a tuple which is RECENTLY_DEAD, existing transactions can not use the index for queries. Pruning the HOT-chains reduces the possibility of finding such tuples while building the index. A comment in IndexBuildHeapScan explaining why it's done would be nice. I would do that. In any case a PG_TRY/CATCH block should be used to make sure it's turned off after an unsuccessful index build. Oh thanks. Would do that too I would wait for other review comments before submitting a fresh patch. I hope thats ok. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT + MVCC-safe cluster conflict fix
On 4/19/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Hi Pavan, Here's a little patch against CVS HEAD + NewHOT-v7.0.patch to fix the conflict between MVCC-safe cluster and HOT. index_getnext is modified to return all tuples in a HOT chain when called with SnapshotAny. Cluster will insert them all as normal cold updates. Thanks Heikki. We might need to tweak it a bit because I think I had made an assumption that heap_hot_fetch() should be called only on the root tuple. Anyways, would look at it. Thanks -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] Dead Space Map version 3 (simplified)
On 3/30/07, ITAGAKI Takahiro [EMAIL PROTECTED] wrote: Attached is an updated DSM patch. I've left the core function of DSM only and dropped other complicated features in this release. I was testing this patch when got this server crash. The patch is applied on the current CVS HEAD. I thought you would be interested in this. The patch worked for smaller scaling factor and its reproducible. Test: pgbench -s 90 -i -F 95 postgres Stack: (gdb) bt #0 0x001d37a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00213955 in raise () from /lib/tls/libc.so.6 #2 0x00215319 in abort () from /lib/tls/libc.so.6 #3 0x082dc04f in ExceptionalCondition (conditionName=0x83a7ad7 !(victim), errorType=0x83a7622 FailedAssertion, fileName=0x83a7487 deadspace.c, lineNumber=1080) at assert.c:51 #4 0x0821eb29 in dsm_create_chunk (dsmrel=0xb7bcd744, key=0xbff589c0) at deadspace.c:1080 #5 0x0821d473 in dsm_record_state (rnode=0xaee02698, pageno=98304, state=DSM_LOW) at deadspace.c:333 #6 0x0821d29e in DegradeDeadSpaceState (rel=0xaee02698, buffer=10645) at deadspace.c:254 #7 0x0817b542 in lazy_scan_heap (onerel=0xaee02698, vacrelstats=0x9f5f2e0, Irel=0x9f5f4dc, nindexes=1, iter=0x9f5f57c) at vacuumlazy.c:586 #8 0x0817a733 in lazy_vacuum_rel (onerel=0xaee02698, vacstmt=0x9f39c94) at vacuumlazy.c:209 #9 0x08174e5c in vacuum_rel (relid=16388, vacstmt=0x9f39c94, expected_relkind=114 'r') at vacuum.c:1107 #10 0x0817421c in vacuum (vacstmt=0x9f39c94, relids=0x0, isTopLevel=1 '\001') at vacuum.c:401 #11 0x0823d90b in ProcessUtility (parsetree=0x9f39c94, queryString=0x9f62f94 vacuum analyze, params=0x0, isTopLevel=1 '\001', dest=0x9f39cf0, completionTag=0xbff5a040 ) at utility.c:929 #12 0x0823bdd6 in PortalRunUtility (portal=0x9f60f8c, utilityStmt=0x9f39c94, isTopLevel=1 '\001', dest=0x9f39cf0, completionTag=0xbff5a040 ) at pquery.c:1170 #13 0x0823bf0a in PortalRunMulti (portal=0x9f60f8c, isTopLevel=1 '\001', dest=0x9f39cf0, altdest=0x9f39cf0, completionTag=0xbff5a040 ) at pquery.c:1262 #14 0x0823b6df in PortalRun (portal=0x9f60f8c, count=2147483647, isTopLevel=1 '\001', dest=0x9f39cf0, altdest=0x9f39cf0, completionTag=0xbff5a040 ) at pquery.c:809 #15 0x082365df in exec_simple_query (query_string=0x9f399d4 vacuum analyze) at postgres.c:956 #16 0x08239e43 in PostgresMain (argc=4, argv=0x9ecfc94, username=0x9ecfc64 perf) at postgres.c:3503 #17 0x08204e84 in BackendRun (port=0x9ee3628) at postmaster.c:2987 #18 0x08204493 in BackendStartup (port=0x9ee3628) at postmaster.c:2614 #19 0x0820228b in ServerLoop () at postmaster.c:1214 #20 0x08201c66 in PostmasterMain (argc=3, argv=0x9ecdc50) at postmaster.c :967 #21 0x081a9e0b in main (argc=3, argv=0x9ecdc50) at main.c:188 -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [HACKERS] CIC and deadlocks
On 4/1/07, Tom Lane [EMAIL PROTECTED] wrote: Good point. I'm envisioning a procarray.c function along the lines of bool TransactionHasSnapshot(xid) which returns true if the xid is currently listed in PGPROC and has a nonzero xmin. CIC's cleanup wait loop would check this and ignore the xid if it returns false. Your point means that this function would have to take exclusive not shared lock while scanning the procarray, which is kind of annoying, but it seems not fatal since CIC isn't done all that frequently. When I looked at the code, it occurred to me that possibly we are OK with just taking shared lock on the procarray. That means that some other transaction can concurrently set its serializable snapshot while we are scanning the procarray. But that should not harm us: if we see the snapshot set, we wait for the transaction. A transaction which is setting its serializable snapshot NOW, can not see the tuples that we did not index, isn't it ? A patch based on the discussion is attached. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com CIC_deadlock.patch Description: Binary data ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] [HACKERS] CIC and deadlocks
On 4/11/07, Tom Lane [EMAIL PROTECTED] wrote: [ itch... ] The problem is with time-extended execution of GetSnapshotData; what happens if the other guy lost the CPU for a good long time while in the middle of GetSnapshotData? He might set his xmin based on info you saw as long gone. You might be correct that it's safe, but the argument would have to hinge on the OldestXmin process being unable to commit because of someone holding shared ProcArrayLock; a point you are definitely not making above. (Study the comments in GetSnapshotData for awhile, also those in xact.c's commit-related code.) My argument was based on what you said above, but I obviously did not state it well :) Anyways, I think its better to be safe and we agree that its not such a bad thing to take exclusive lock on procarray because CIC is not something that happens very often. Attached is a revised patch which takes exclusive lock on the procarray, rest remaining the same. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com CIC_deadlock_v2.patch Description: Binary data ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Optimized pgbench for 8.3
On 4/6/07, Tatsuo Ishii [EMAIL PROTECTED] wrote: BTW, is anybody working on enabling the fill factor to the tables used by pgbench? 8.3 will introduce HOT, and I think adding the feature will make it easier to test HOT. Please see if the attached patch looks good. It adds a new -F option which can be used to set fillfactor for tellers, accounts and branches tables. Default is 100 and anything between 10 and 100 is acceptable. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com pgbench_fillfactor.patch Description: Binary data ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] HOT WIP Patch - version 6.3
On 4/3/07, Bruce Momjian [EMAIL PROTECTED] wrote: Your patch has been added to the PostgreSQL unapplied patches list at: Thanks Bruce. I would like to submit atleast one more revision which would include couple of TODOs mentioned in my last mail. I would also like to do some cleanup and commenting to make review process easier. I hope to do this before the weekend, but if someone wants to start reviewing it before, please let me know. Is there something more that I can do to help the review process ? I've posted almost all the design ideas as and when I was posting patch revisions. Would it help to consolidate them in a single document ? Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[PATCHES] HOT WIP Patch - Version 5.0
The version 5.0 of HOT WIP patch is attached. This fixes the VACUUM FULL issue with HOT. In all the earlier versions, I'd disabled VACUUM FULL. When we move the HOT-chain, we move the chains but don't carry the HOT_UPDATED or HEAP_ONLY flags and insert as many index entries as there are tuples in the chain. IOW the HOT-update is actually turned into a COLD chain. Apart from this, I'd to make some changes to the VACUUM FULL code so that the number of indexed tuples is counted correctly. With HOT, whenever a HEAP_ONLY tuple is moved, an additional index entry is generated and this needs to be taken into account. Please let me know comments/suggestions. Thanks, Pavan -- EnterpriseDBhttp://www.enterprisedb.com NewHOT-v5.0.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] HOT WIP Patch - Version 4.4
Please see the attached version 4.4 of HOT WIP patch. I have fixed couple of bugs in the earlier version posted. Other than that there are not any significant changes in the patch. The row-level fragmentation had a bug where we were unintentionally sorting the line pointers array more than once. Also, the defragmented lengths were computed wrongly and was a source of many errors. Another bug fix was in the heap_hot_fetch() code path where we were asserting on finding a LP_DELETEd tuple in the hot chain. I later realized that this is not required and we should rather just assume that the chain is broken, something very similar to the xmax/xmin checks. Thanks, Pavan -- EnterpriseDBhttp://www.enterprisedb.com NewHOT-v4.4.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PATCHES] HOT WIP Patch - Version 4.1
Please see the attached HOT WIP patch, version 4.1. There are not any significant changes since the version 4.0 patch that I posted a week back. This patch includes some optimizations for efficiently looking up LP_DELETEd tuples. I have used the recent changes made by Tom/Heikki which give us few bits per page header. I use one bit to track if there are any LP_DELETEd tuples in the page. The changes to this bit are not WAL-logged and hence the information might not be accurate. But we should be ok with that. Another non-trivial change is the addition of logic to clean up row level fragmentation. I have discussed this earlier on the list, but neverthless would summarize it again here. When we reuse LP_DELETEd tuples for UPDATE, we might waste few bytes when the original size of the reused tuple is larger than the new tuple. The information about the original length is lost. When we run out of LP_DELETEd tuples of size equal or greater than the requested size in UPDATE path, we correct the row level fragmentation, if any. Please note, we don't move tuples around and hence don't need the VACUUM-strength lock on the page. We use another bit in the page header to track if there are any fragmented LP_DELETEd tuples in the page. We also need one bit in the tuple header to track that the particular tuple was fragmeneted while being reused. This information is then used when the tuple is again released and marked LP_DELETE, to update the page level hint bit. Comments/suggestions ? Thanks, Pavan -- EnterpriseDBhttp://www.enterprisedb.com NewHOT-v4.1.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Heap page diagnostic/test functions (WIP)
Simon Riggs wrote: I'll happily code it as functions or system cols or any other way, as long as we can see everything there is to see. With HOT, other useful information is about the line pointers. It would be cool to be able to print the redirection info, details about LP_DELETEd line pointers etc. Also, there are some flags in the t_infomask2 which HOT uses, so that information would be useful also. Thanks, Pavan EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] A little COPY speedup
Heikki Linnakangas wrote: Attached is a fix for that. It adds a flag to each heap page that indicates that there isn't any free line pointers on this page, so don't bother trying. Heap pages haven't had any heap-specific per-page data before, so this patch adds a HeapPageOpaqueData-struct that's stored in the special space. I would really like this change. I was thinking on similar lines to optimize some of the HOT code paths Thanks, Pavan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] Resending - HOT WIP Patch - version 4
Resending once again, all my previous attempts seems to have failed. Is there a limit on patch size on -patches as well ? Attached is a gzipped version. Hi All, Please see the version 4.0 of HOT WIP patch attached with the mail. ISTM that this version has one of the most radical changes included i.e. the line pointer redirection. The patch should apply on the current CVS HEAD and pass all regression tests. I feel that the patch is now in a stable state and I would really appreciate if community folks can take a look at it and provide comments/suggestions. The VACUUM FULL is disabled right now. I am not aware of any design issues for not supporting VACUUM FULL. I know its broken, but do not want to hold back the patch while we fix it. The most important change in this patch is the implementation of line pointer redirection for handling dead root tuples. Please see my post on hackers for details. I would post some preliminary numbers in a separate email. I would also post items where we need attention and discussion and further work. I would welcome others to test the patch, for correctness as well as performance. Comments/suggestions ? Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com NewHOT-v4.0.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT WIP Patch - version 3.2
On 2/27/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Pavan Deolasee wrote: - What do we do with the LP_DELETEd tuples at the VACUUM time ? In this patch, we are collecting them and vacuuming like any other dead tuples. But is that the best thing to do ? Since they don't need index cleanups, it's a waste of maintenance_work_mem to keep track of them in the dead tuples array. Let's remove them in the 1st phase. That means trading the shared lock for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages that had dead tuples on them, and scan and remove them in the 2nd phase when we have to acquire the vacuum-level lock anyway. I liked the idea of not collecting the LP_DELETEd tuples in the first pass. We also prune the HOT-update chains in the page in the first pass, may be that can also be moved to second pass. We need to carefully work on the race conditions involved in the VACUUM, pruning and tuple reuse though. - While searching for a LP_DELETEd tuple, we start from the first offset and return the first slot which is big enough to store the tuple. Is there a better search algorithm (sorting/randomizing) ? Should we go for best-fit instead of first-fit ? Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd line pointers, but wasting space can lead to cold updates and get much more expensive. Ok. I will give it a shot once the basic things are ready. You could also prune the chains on the page to make room for the update, and if you can get a vacuum lock you can also defrag the page. Yes, thats a good suggestion as well. I am already doing that in the patch I am working on right now. - Should we have metadata on the heap page to track the number of LP_DELETEd tuples, number of HOT-update chains in the page and any other information that can help us optimize search/prune operations ? I don't think the CPU overhead is that significant; we only need to do the search/prune when a page gets full. We can add flags later if we feel like it, but let's keep it simple for now. I am making good progress with the line-pointer redirection stuff. Its showing tremendous value in keeping the table and index size in control. But we need to check for the CPU overhead as well and if required optimize there. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[PATCHES] HOT WIP Patch - version 3.2
Please see the attached WIP HOT patch - version 3.2. It now implements the logic for reusing heap-only dead tuples. When a HOT-update chain is pruned, the heap-only tuples are marked LP_DELETE. The lp_offset and lp_len fields in the line pointer are maintained. When a backend runs out of free space in a page when doing an UPDATE, it searches the line pointers to find a slot which is marked LP_DELETEd and has enough space to accommodate the new tuple. If such a slot is found, its reused. We might waste some space if the slot is larger than the tuple, but that gets reclaimed at VACUUM time. During VACUUM, tuple-chains are pruned, one page at a time. This is necessary so that chain is pruned even if the tuple is not accessed after the update (normally pruning happens when the tuple is index fetched). This patch also implements the logic to convert dead root tuple into stub. This is done by changing the lp_len of the root tuple slot to sizeof (HeapTupleHeaderData). The space released can only be reclaimed at the time of VACUUM. Open Issues: - There are quite a few: - What do we do with the LP_DELETEd tuples at the VACUUM time ? In this patch, we are collecting them and vacuuming like any other dead tuples. But is that the best thing to do ? - While searching for a LP_DELETEd tuple, we start from the first offset and return the first slot which is big enough to store the tuple. Is there a better search algorithm (sorting/randomizing) ? Should we go for best-fit instead of first-fit ? - Should we have metadata on the heap page to track the number of LP_DELETEd tuples, number of HOT-update chains in the page and any other information that can help us optimize search/prune operations ? - There are some interesting issues in the VACUUMing area. How do we count the dead tuples ? Should we count HOT-updated tuples in the dead count ? If we do so, I noticed that VACUUM gets triggered on very small tables like tellers in pgbench and takes several minutes to finish because it waits very very long for VACUUM-strength lock on the index pages. Index is just a page or two in this case. Whats Next ? -- ISTM that the next important item to do is handling of dead root tuples. There are several designs proposed to handle this, but I don't think we have consensus on any one design. I am thinking going ahead with my proposal of line-pointer indirection. I am not passing judgement on other proposed designs, but for the lack of consensus I am choosing the simplest one right now and which seems good to me :) Are there any serious objections to this approach ? This would waste 4 bytes per HOT-updated tuple chain for the life-time of the chain, but would provide an excellent opportunity to reuse the dead root tuple just like any other heap-only dead tuple. The stub approach would waste 4 bytes of line pointer + 24 bytes of header and it requires index swing to remove the stub. The cyclic tuple chain would temporarily waste space upto size of the root tuple until it gets reused by another update of the same row. Plus there are concerns about additional header field for back pointer and complexity involved in locating visible tuple. Based on these observations, I am inclined towards line-pointer redirection approach. I haven't heard any for/against comments on this yet. Please let me know if there are objections before I start working on it. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com NewHOT-v3.2.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2
On 2/20/07, Hannu Krosing [EMAIL PROTECTED] wrote: Ühel kenal päeval, T, 2007-02-20 kell 12:08, kirjutas Pavan Deolasee: What do you do, if there are no live tuples on the page ? will this un-HOTify the root and free all other tuples in HOT chain ? Yes. The HOT-updated status of the root and all intermediate tuples is cleared and their respective ctid pointers are made point to themselves. The index entry will be marked LP_DELETE as with the normal case. VACUUM can subsequently reclaimed these tuples, along with the index entry. The intermediate heap-only tuples are removed from the HOT-update chain. The HOT-updated status of these tuples is cleared and their respective t_ctid are made point to themselves. These tuples are not reachable now and ready for vacuuming. Does this mean, that they are now indistinguishable from ordinary tuples ? No. HEAP_ONLY_TUPLE flag is still set on these tuples. So you can distinguish those tuples. Maybe they could be freed right away instead of changing HOT-updated status and ctid ? Yeah, thats a good idea. I am thinking of setting LP_DELETE flag on them while pruning. The tuple then can be reused for next in-page HOT-update. When we run out space for update-within-the-block, we traverse through all the line pointers looking for LP_DELETEd items. If any of these items have space large enough to store the new tuple, that item is reused. Does anyone see any issue with doing this ? Also, any suggestions about doing it in a better way ? IIRC the size is determined by the next tuple pointer, so you can store new data without changing tuple pointer only if they are exactly the same size. There is a lp_len field in the line pointer to store the length of the tuple. ISTM that we can reduce that while reusing the line pointer. But that would create a permanent hole in the page. we are more concerned about the large tables, the chances of being able to upgrade the exclusive lock to vacuum-strength lock are high. Comments ? I'm not sure about the we are more concerned about the large tables part. I see it more as a device for high-update tables. This may not always be the same as large, so there should be some fallbacks for case where you can't get the lock. Maybe just give up and move to another page ? Oh, yes. I agree. The fallback option of doing COLD update always exists. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2
On 2/20/07, Bruce Momjian [EMAIL PROTECTED] wrote: Pavan Deolasee wrote: When following a HOT-update chain from the index fetch, if we notice that the root tuple is dead and it is HOT-updated, we try to prune the chain to the smallest possible length. To do that, the share lock is upgraded to an exclusive lock and the tuple chain is followed till we find a live/recently-dead tuple. At that point, the root t_ctid is made point to that tuple. In order I assume you meant recently-dead here, rather than live/recently-dead, because we aren't going to change live ctids, right? No, I meant live or recently-dead (in fact, anything other than HEAPTUPLE_DEAD or HEAPTUPLE_DEAD_CHAIN). We are not changing the tids here, but only pruning the HOT-update chain. After pruning, the root-t_ctid points to the oldest tuple that might be visible to any backend. The live tuples are still identified by their original tid and index reachable from the root tuple. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2
On 2/20/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: ... Yes. The HOT-updated status of the root and all intermediate tuples is cleared and their respective ctid pointers are made point to themselves. Doesn't that destroy the knowledge that they form a tuple chain? While it might be that no one cares any longer, it would seem more reasonable to leave 'em chained together. I see your point, but as you mentioned do we really care ? The chain needs to be broken so that the intermediate DEAD tuples can be vacuumed. We can't vacuum them normally because they could be a part of live HOT-update chain. Resetting the HOT-updated status of the root tuple helps to mark the index entry LP_DELETE once the entire HOT-update chain is dead. Also, if we decide to reuse the heap-only tuples without even vacuuming, breaking the chain is a better option since we then guarantee no references to the heap-only DEAD tuples. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2
On 2/20/07, Bruce Momjian [EMAIL PROTECTED] wrote: Tom Lane wrote: Recently dead means still live to somebody, so those tids better not change either. But I don't think that's what he meant. I'm more worried about the deadlock possibilities inherent in trying to upgrade a buffer lock. We do not have deadlock detection for LWLocks. I am guessing he is going to have to release the lock, then ask for an exclusive one. Yes, thats what is done. Since we try to prune the HOT-update chain even in the SELECT path, we upgrade the lock only if we are sure that there is atleast one tuple that can be removed from the chain or the root needs to be fixed (broken ctid chain for some reason). Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[PATCHES] HOT WIP Patch - version 2
Reposting - looks like the message did not get through in the first attempt. My apologies if multiple copies are received. This is the next version of the HOT WIP patch. Since the last patch that I sent out, I have implemented the HOT-update chain pruning mechanism. When following a HOT-update chain from the index fetch, if we notice that the root tuple is dead and it is HOT-updated, we try to prune the chain to the smallest possible length. To do that, the share lock is upgraded to an exclusive lock and the tuple chain is followed till we find a live/recently-dead tuple. At that point, the root t_ctid is made point to that tuple. In order to preserve the xmax/xmin chain, the xmax of the root tuple is also updated to xmin of the found tuple. Since this xmax is also RecentGlobalXmin and is a committed transaction, the visibility of the root tuple still remains the same. The intermediate heap-only tuples are removed from the HOT-update chain. The HOT-updated status of these tuples is cleared and their respective t_ctid are made point to themselves. These tuples are not reachable now and ready for vacuuming. This entire action is logged in a single WAL record. During vacuuming, we keep track of number of root tuples vacuumed. If this count is zero, then the index cleanup step is skipped. This would avoid unnecessary index scans whenever possible. This patch should apply cleanly on current CVS head and pass all regression tests. I am still looking for review comments from the first WIP patch. If anyone has already looked through it and is interested in the incremental changes, please let me know. I can post that. Whats Next ? - ISTM that the basic HOT-updates and ability to prune the HOT-update chain, should help us reduce the index bloat, limit the overhead of ctid following in index fetch and efficiently vacuum heap-only tuples. IMO the next important but rather less troublesome thing to tackle is to reuse space within a block without complete vacuum of the table. This would help us do much more HOT-updates and thus further reduce index/heap bloat. I am thinking of reusing the DEAD heap-only tuples which gets removed from the HOT-update chain as part of pruning operation. Since these tuples, once removed from the chain, are neither reachable nor have any index references, could be readily used for storing newer versions of the same or other rows in the block. How about setting LP_DELETE on these tuples as part of the prune operation ? LP_DELETE is unused for heap tuples, if I am not mistaken. Other information like length and offset are is maintained as it is. When we run out space for update-within-the-block, we traverse through all the line pointers looking for LP_DELETEd items. If any of these items have space large enough to store the new tuple, that item is reused. Does anyone see any issue with doing this ? Also, any suggestions about doing it in a better way ? If the page gets really fragmented, we can try to grab a VACUUM-strength lock on the page and de-fragment it. The lock is tried conditionally to avoid any deadlocks. This is done in the heap_update() code path, so would add some overhead, but may still prove better than putting the tuple in a different block and having corresponding index insert(s). Also, since we are more concerned about the large tables, the chances of being able to upgrade the exclusive lock to vacuum-strength lock are high. Comments ? If there are no objections, I am planning to work on the first part while Nikhil would take up the second task of block level retail-vacuum. Your comments on these issues and the patch are really appreciated. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com NewHOT-v2.0.patch.gz Description: GNU Zip compressed data ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[pgsql-patches] Lock compatibility matrix
I had this in a different form, but reworked so that it matches the doc patch that Teodor submitted earlier. I think it would be good to have this information in the lock.h file as well. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com lock-compatibility.patch Description: Binary data ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [pgsql-patches] Lock compatibility matrix
On 1/30/07, Peter Eisentraut [EMAIL PROTECTED] wrote: Pavan Deolasee wrote: I had this in a different form, but reworked so that it matches the doc patch that Teodor submitted earlier. I think it would be good to have this information in the lock.h file as well. Why would we want to have two redundant copies of the same information? IMHO its useful to have this information in the source code, just like many other comments. It improves the readability of the code while documentation acts as a reference. But I am not sure whats the generally accepted practice for PostgresQL, so I may be wrong here. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [pgsql-patches] Ctid chain following enhancement
On 1/28/07, Tom Lane [EMAIL PROTECTED] wrote: OTOH it might be cleaner to refactor things that way, if we were going to apply this. Here is a revised patch which includes refactoring of heap_get_latest_tid(), as per Tom's suggestion. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com cf-enhance-head-v2.patch Description: Binary data ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [pgsql-patches] Ctid chain following enhancement
On 1/27/07, Tom Lane [EMAIL PROTECTED] wrote: It looks to me that you have introduced a buffer leak into heap_get_latest_tid ... I can't spot that. A previously pinned buffer is released at the start of the loop if we are moving to a different block. Otherwise, the buffer is released at all places where the for(;;) loop is terminated by a break. Am I missing something ? Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[pgsql-patches] Ctid chain following enhancement
Attached is a patch which should marginally improve the ctid chain followin code path when current and the next tuple in the chain are in the same block. In the current code, we unconditionally drop pin of the current block without checking whether the next tuple is the same block or not. ISTM that this can be improved. The attached patch replaces the heap_fetch() with heap_release_fetch() in EvalPlanQual() and thus releases the buffer iff the chain gets into a different block. Similarly, ReadBuffer() is replaced with ReleaseAndReadBuffer() in heap_get_latest_tid() for the same reason. The buffer is set to InvalidBuffer to start with and is not released at the end of the loop. heap_release_fetch()/ ReleaseAndReadBuffer() would release it if required. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com cf-enhance-head.patch Description: Binary data ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq