Re: [PATCHES] HOT patch - version 14
Tom Lane wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. Here's an updated version of the README I posted earlier. It now reflects the changes to how pruning works. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Use case The best use case for HOT is a table that's frequently UPDATEd, and is large enough that VACUUM is painful. On small tables that fit in cache, running VACUUM every few minutes isn't a problem. Heap-only tuples When a HOT update is performed, the new tuple is placed on the same page as the old one, marked with the HEAP_ONLY_TUPLE flag. HEAP_ONLY_TUPLE means that there's no index pointers to the tuple, which allows pruning the chain in the future. The old tuple is marked with HEAP_HOT_UPDATE-flag, which means that the tuple pointed to by t_ctid is a heap-only tuple. That needs to be taken into account when vacuuming, so that we don't remove the root tuple in the update chain, when there's no index pointers to the later tuples. When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple has been HOT-updated (== HEAP_HOT_UPDATE flag is set). If so, we need to follow the ctid pointer until we reach a visible one, or one that hasn't been HOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) don't need to pay attention to the flags. The pre-requirements for doing a HOT update is that none of the indexed columns are changed. That's checked at execution time, comparing the binary representation of the old and new values. That means that dummy updates, like UPDATE foo SET col1 = ?, where ? is the same as the old value can be HOT. In addition to the above, there needs to be room on the page for the new tuple. If the page is full, we try to make room by pruning the page. Pruning --- Pruning is a lightweight vacuum operation that can be run on a single page, with no need to scan indexes, but it only removes dead HOT tuples. Other dead tuples are truncated, leaving only a redirected dead line pointer. The removed tuples are compacted away using PageRepairFragmentation, like in normal vacuum. There's two reasons to prune a page: to make room on the page for future updates, and to shorten HOT chains to make index lookups cheaper. When accessing a page with HOT updated tuples on it, and less than a certain threshold of free space, we try to prune it. To do that, we need to take a vacuum strength lock on the buffer. If that fails, we don't prune; the theory is that you usually do get the lock, and if you don't, you'll get to try again next time. It would be more logical to do the pruning in heap_update when the page is full, but by the time we get there we have already pinned the page and have references to tuples on it, so we can't start moving tuples around it. Also, that alone wouldn't address the desire to keep HOT chains short, to avoid overhead of traversing long chains on index lookups. To reclaim the index-visible (i.e. first) tuple in a HOT chain, the line pointer is turned into a redirecting line pointer that points to the line pointer of the next tuple in the chain. When the last live tuple in an update chain becomes dead (after a DELETE or a cold update), the redirecting line pointer is marked as redirected dead. That allows us to immediately reuse the space, sans the line pointer itself. We've effectively resurrected the truncate dead tuples to just line pointer idea that has been proposed and rejected before because of fear of line pointer bloat. To limit the damage in worst case, and to keep numerous arrays as well as the bitmaps in bitmap scans reasonably sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is somewhat arbitrarily capped at 2 * what it was before. VACUUM FULL --- To make vacuum full work, any DEAD tuples in the middle of an update chain needs to be removed (see comments at the top of heap_prune_hotchain_hard for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, it scans the whole chain and removes any intermediate dead tuples as well. Vacuum -- There's not much changes to regular vacuum. It removes dead HOT tuples, like pruning, and cleans up any redirected dead line pointers. In lazy vacuum, we must not freeze a tuple that's in the middle of an update chain. That can happen when a tuple has xmin xmax; it's the same scenario that requires hard pruning in VACUUM FULL. Freezing such tuples will break the check that xmin and xmax matches when following the chain. It's not a problem without HOT, because the preceding tuple in the chain must be dead as well so no-one will try to follow the chain, but with HOT the
Re: [PATCHES] HOT patch - version 14
On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote: In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. On second thought, I wonder if there is really much to gain by supporting HOT on system tables and whether it would justify all the complexity. Initially I thought about CatalogUpdateIndexes to which we need to teach HOT. Later I also got worried about building the HOT attribute lists for system tables and handling all the corner cases for bootstrapping and catalog REINDEX. It might turn out to be straight forward, but I am not able to establish that with my limited knowledge in the area. I would still vote for disabling HOT on catalogs unless you see strong value in it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 14
On Fri, Aug 31, 2007 at 12:53:51PM +0530, Pavan Deolasee wrote: On 8/31/07, Pavan Deolasee [EMAIL PROTECTED] wrote: In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. On second thought, I wonder if there is really much to gain by supporting HOT on system tables and whether it would justify all the complexity. Initially I thought about CatalogUpdateIndexes to which we need to teach HOT. Later I also got worried about building the HOT attribute lists for system tables and handling all the corner cases for bootstrapping and catalog REINDEX. It might turn out to be straight forward, but I am not able to establish that with my limited knowledge in the area. I would still vote for disabling HOT on catalogs unless you see strong value in it. What about ANALYZE? Doesn't that do a lot of updates? BTW, I'm 100% in favor of pushing system catalog HOT until later; it's be silly to risk not getting hot in 8.3 because of catalog HOT. -- Decibel!, aka Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) pgpXfGeddWvmd.pgp Description: PGP signature
Re: [PATCHES] HOT patch - version 14
Decibel! [EMAIL PROTECTED] writes: BTW, I'm 100% in favor of pushing system catalog HOT until later; it's be silly to risk not getting hot in 8.3 because of catalog HOT. I see this the other way around: if it doesn't work on system catalogs, it probably doesn't work, period. I'm not in favor of treating the catalogs differently. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] HOT patch - version 14
On 9/1/07, Tom Lane [EMAIL PROTECTED] wrote: I see this the other way around: if it doesn't work on system catalogs, it probably doesn't work, period. I'm not in favor of treating the catalogs differently. Now that I hear you, I know what to do next :-) I don't think there is any fundamental problem with system catalogs, its only the additional complexity that I was worried about. Anyways, I will rework things as per your suggestion. And I take your point that making it work on all tables will give us more confidence on the code. Thanks, Pavan P.S. Next week is bad for me :-( I am on vacation on Thursday/Friday and for remaining days, I may not be able to spend extra cycles, apart from regular working hours. -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 14
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. I shall take that up. There are couple of documents posted by Heikki and Greg, apart from several email threads and original design doc by Simon. I shall consolidate everything in a single README. One thing I was unable to discern from the comments is how CREATE INDEX can work at all. A new index might mean that tuples that could formerly legally be part of the same hot-chain no longer can. I can't find any code here that looks like it's addressing that. You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked unusable for all those existing transactions which can potentially see any intermediate tuples in the chain. Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. Given that, it seems OK to me to ignore these tuples because if the transaction aborts, CREATE INDEX is aborted as well. Am I overlooking something here ? There is a comment to this regard in the current code as well. I also took this opportunity to remove the modularity invasion caused by heap_check_idxupdate() since it was using resultRelInfo. We now build the list of attributes that must be checked to satisfy HOT update. This list includes all the index columns, columns in the partial index predicates and expression index expressions and is built in the executor. The executor is the wrong place for that: I'm not sure why you think that making heapam depend on the executor is a modularity improvement. Earlier (in the previous version of HOT patch) we were passing ResultRelInfo to heap_update, which was more ugly. The comment is in that context :-) Furthermore this approach requires recalculating the list during each query, which is wasteful when it could only change during schema updates. I'd suggest making the relcache responsible for computing and saving this data, along the same lines as RelationGetIndexList(). (That also means that heapam can get the data from the relcache, saving a lot of API-bloating from passing around Attrids explicitly.) Also, rather than inventing Attrids, I'd suggest just using one Bitmapset with the convention that its contents are offset by FirstLowInvalidHeapAttributeNumber. I liked all these suggestions. I know I thought about computing the attribute list in relcache, but probably avoided to keep things simple. I shall make these changes. The redefinition of the value of MaxHeapTuplesPerPage seems pretty ugly and unprincipled. I think it'd be better to leave it as-is, and just enforce that we don't allow more than that many line pointers on a heap page (as indeed you have to do anyway, so it's not clear what the change is buying). The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. For a table with sufficiently large tuples, the original bound would work well, but for very small tuples - we might hit the line pointer limit even if there is free space available in the page. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. But I agree that the factor choice is rather arbitrary. Is it really necessary to add hot_update to xl_heap_update? Seems the information should be available from the tuple header fields. I think its not necessary. The reason I did that way because the new block might be backed up in the WAL record and hence we might have recorded the new tuple infomasks. But we can surely handle that corner case. I shall fix this. Have you demonstrated that the prune_hard logic is worth its weight? Considering that many of us want to drop VACUUM FULL, adding more complexity to it doesn't seem like a profitable use of time. The prune_hard code is lot more simple
Re: [PATCHES] HOT patch - version 14
Tom Lane [EMAIL PROTECTED] writes: Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. The Heikki and I posted a two-part README of sorts: http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php One thing I was unable to discern from the comments is how CREATE INDEX can work at all. A new index might mean that tuples that could formerly legally be part of the same hot-chain no longer can. I can't find any code here that looks like it's addressing that. This is one of the weirdest things in HOT and one of the hardest problems it faced. The best solution proposed was to index the head of the HOT chain and just ban older transactions which might be able to see the older versions fro using the index. That's the purpose of the indcreatexid column. It's set in index_set_createxid() and checked in get_relation_info(). I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. You're talking about the concurrent index build case? Wouldn't the second pass pick up that tuple? I have to look back at it to see for sure. The redefinition of the value of MaxHeapTuplesPerPage seems pretty ugly and unprincipled. I think it'd be better to leave it as-is, and just enforce that we don't allow more than that many line pointers on a heap page (as indeed you have to do anyway, so it's not clear what the change is buying). Note that in a heavily updated table basically every tuple will have two line pointers, the head which just redirects to another line pointer, and the line pointer for the actual tuple. It's hard to get rid of the actual head line pointer without declaring that the tid might sometimes not change after an update. Not sure what this translates to for MaxHeapTuplesPerPage though. The rest I know less about and will leave to Pavan and Heikki (or anyone else who was following those details more closely). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] HOT patch - version 14
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. The Heikki and I posted a two-part README of sorts: Er, editing error there. Has a ring to it though. http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php ... I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. You're talking about the concurrent index build case? Wouldn't the second pass pick up that tuple? I have to look back at it to see for sure. Sorry, that's misguided. The concurrent index build uses snapshots now so it can't see DELETE_IN_PROGRESS. And the non-concurrent index build has an lock so it ought to be back to the way it was before I messed with it where there was an assert against finding *_IN_PROGRESS (except as Pavan points out in the same transaction). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 14
Pavan Deolasee [EMAIL PROTECTED] writes: You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked unusable for all those existing transactions which can potentially see any intermediate tuples in the chain. I don't think that works --- what if the last tuple in the chain isn't committed good yet? If its inserter ultimately rolls back, you've indexed the wrong value. Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php Isn't the extra machination for C.I.C. just useless complication? What's the point of avoiding creation of new broken HOT chains when you still have to deal with existing ones? I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. That chain of reasoning is way too shaky for me. Essentially what you're saying is you'll promise not to corrupt non-system indexes. Nor am I really thrilled about having to disable HOT for system catalogs. The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. No, it doesn't allow more tuples per page. It might mean there can be more line pointers than that on a page, but not more actual tuples. The question is whether there is any real use in allowing more line pointers than that --- the limit is already unrealistically high, since it assumes no data content in any of the tuples. If there is a rationale for it then you should invent a different constant MaxLinePointersPerPage or some such, but I really think there's no point. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. I'd say it allows a ridiculous amount of line pointer bloat. Even if it's safe, ISTM what you're mostly accomplishing there is to expend a lot of cycles while holding exclusive lock on the page, when there is good reason to think that you're blocking other people who are interested in using the page. Eliminating the separation between that and cleanup would also allow eliminating the separate PD_FRAGMENTED page state. The reason we did it that way because repairing fragmentation seems much more costly that pruning. Please note that we prune a single chain during index fetch. Its only for heap-scans (and VACUUM) that we try to prune all chains in the page. So unless we are doing heap-scan, I am not sure if we are spending too much time holding the exclusive lock. I agree we don't have any specific numbers to prove that though. If you don't have numbers proving that this extra complication has a benefit, I'd vote for simplifying it. The SnapshotAny case is going to bite other people besides you in future. Another reasoning behind separating these two steps is: pruning requires exclusive lock whereas repairing fragmentation requires cleanup lock. This is nonsense. Those are the same lock. If you have the former and not the latter, it just means that you *know* there is contention for the page. It seems to me that performing optional maintenance work in such a situation is completely wrong. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 14
On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked unusable for all those existing transactions which can potentially see any intermediate tuples in the chain. I don't think that works --- what if the last tuple in the chain isn't committed good yet? If its inserter ultimately rolls back, you've indexed the wrong value. I am confused. How could we get ShareLock on the relation while there is some open transaction which has inserted a tuple in the table ? (Of course, I am not considering the system tables here) If the transaction which inserted the last tuple in the chain is already aborted, we are not indexing that tuple (even if that tuple is at the end). We would rather index the last committed-good tuple in the chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees that. Isn't it ? Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php Isn't the extra machination for C.I.C. just useless complication? What's the point of avoiding creation of new broken HOT chains when you still have to deal with existing ones? IMHO the extra step in C.I.C simplifies the index build. The transaction-waits takes care of the existing broken chains and the early catalog entry for the index helps us avoid creating new broken chains. I am not against doing it a different way, but this is the best solution we could come up when we worked on CIC. I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. That chain of reasoning is way too shaky for me. Essentially what you're saying is you'll promise not to corrupt non-system indexes. Nor am I really thrilled about having to disable HOT for system catalogs. I am not against supporting HOT for system catalogs. But IMHO its not a strict requirements because system catalogs are small, less frequently updated and HOT adds little value to them. If we don't have a generic solution which works for system and non-system tables, thats the best we can get. We can start with non-system tables and expand its scope later. The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. No, it doesn't allow more tuples per page. It might mean there can be more line pointers than that on a page, but not more actual tuples. The question is whether there is any real use in allowing more line pointers than that --- the limit is already unrealistically high, since it assumes no data content in any of the tuples. If there is a rationale for it then you should invent a different constant MaxLinePointersPerPage or some such, but I really think there's no point. I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently large. May be we can keep it the same and tune it later if we have numbers to prove its worthiness. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. I'd say it allows a ridiculous amount of line pointer bloat. OK. Lets keep MaxHeapTuplesPerPage unchanged. Even if it's safe, ISTM what you're mostly accomplishing there is to expend a lot of cycles while holding exclusive lock on the page, when there is good reason to think that you're blocking other people who are interested in using the page. Eliminating the separation between that and cleanup would also allow eliminating the separate PD_FRAGMENTED page state. The reason we did it that way because repairing fragmentation seems much more costly that pruning. Please note that we prune a single chain during index fetch. Its only for heap-scans (and VACUUM) that we try to prune all chains in the page. So unless we are doing heap-scan, I am not sure if we are spending too much time holding the exclusive lock. I agree we don't have any specific numbers to prove that though. If you don't have numbers proving that this extra complication has a benefit, I'd vote for simplifying it. The SnapshotAny case is going to bite other people besides you in
Re: [PATCHES] HOT patch - version 14
Tom Lane [EMAIL PROTECTED] writes: Isn't the extra machination for C.I.C. just useless complication? What's the point of avoiding creation of new broken HOT chains when you still have to deal with existing ones? IMHO the extra step in C.I.C simplifies the index build. If you make the change suggested above, I think you don't need to do things differently in C.I.C. It seems to me if you wait out transactions as you come across them you could end up waiting a whole lot longer than the way it works now where it waits them all out at the end of the first pass. OK. So if I get you correctly, you are suggesting to acquire cleanup lock. If we don't get that, we don't to any maintenance work. Otherwise, we prune and repair fragmentation in one go. Yeah, that's what I'm thinking; then there's no need to track a separate page state where we've pruned but not defragmented. Note that you still need to do it in two steps, you just get to postpone the pruning work to the same point in time as the defragmenting. You can never acquire the cleanup lock when it comes time to insert a new update version since the executor still holds references to the original tuple. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] HOT patch - version 14
Tom Lane escribió: Pavan Deolasee [EMAIL PROTECTED] writes: On 8/30/07, Tom Lane [EMAIL PROTECTED] wrote: I don't think that works --- what if the last tuple in the chain isn't committed good yet? If its inserter ultimately rolls back, you've indexed the wrong value. I am confused. How could we get ShareLock on the relation while there is some open transaction which has inserted a tuple in the table ? (Of course, I am not considering the system tables here) Not if someone else releases lock before committing. FWIW, a red flag raised for me here, though maybe it is irrelevant or unimportant. Currently, VACUUM acquires an exclusive lock for truncating the table. The lock is kept till commit. However I am proposing that it be released before commit. Now, VACUUM never inserts rows. But I don't claim I understand what's going on here. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] HOT patch - version 14
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane escribió: Not if someone else releases lock before committing. FWIW, a red flag raised for me here, though maybe it is irrelevant or unimportant. Currently, VACUUM acquires an exclusive lock for truncating the table. The lock is kept till commit. However I am proposing that it be released before commit. I think that's all right, because it's dealing with a different set of concerns. AFAICS the only issue for truncation is to prevent physical access to the blocks in question until we can get rid of them. Once they're gone, if there wasn't an active seqscan (with an already-established notion of the max block number to scan to), there would be no reason for anyone to try to touch them. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] HOT patch - version 14
On 8/31/07, Tom Lane [EMAIL PROTECTED] wrote: Pavan Deolasee [EMAIL PROTECTED] writes: Not if someone else releases lock before committing. Which I remind you is a programming technique we use quite a lot with respect to the system catalogs. I'm not prepared to guarantee that there is no code path in core (much less in contrib or third-party addons) that doesn't do it on a user table; even less that no such path will ever appear in future. As things stand it's a pretty harmless error --- the worst consequence I know of is that a VACUUM FULL starting right then might bleat and refuse to do shrinking. What you propose is to turn the case into a cause of silent index corruption. A more robust solution would be to wait out the source transaction if CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS HOT tuple. Or you could throw an error, since it's not really likely to happen (except maybe in a system catalog REINDEX operation). But the way the patch approaches this at the moment is far too fragile IMHO. OK. I looked at the code again. In CVS HEAD we assume that we should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS unless its deleted/inserted by our own transaction or we are indexing a system table. Except for these cases, we throw an error. With HOT, we keep it the same i.e. if we see a DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error, unless its deleted/inserted by us or this is a system table.. What we change is if we find a DELETE_IN_PROGRESS tuple deleted by our own transaction and its HOT-updated, we skip that tuple. This is fine because if the CREATE INDEX commits the DELETE is also committed and the tuple is not visible (I am still assuming the indcreatxid mechanism is in place) So if we don't do HOT update on system tables, the current algorithm should work fine because we should never find a HOT updated tuple in the system table and the error-out mechanism should ensure that we don't corrupt user tables. So I guess what you are suggesting is to turn on HOT on system tables and then wait-out any DELETING/INSETING transaction if we find such a tuple during CREATE INDEX. I think this would work and since we are talking about system tables, the wait-out business won't be harmful - I remember there were objections when I suggested this as a general solution. If we approach it this way, we might also be able to jettison some of the other complexity such as idxcreatexid. I doubt, unless we replace it with something better. I think indcreatexid serves another purpose. While building an index, we skip any RECENTLY_DEAD HOT-updated tuples. This is required to handle the existing HOT chains which are broken with respect to the new index. Essentially what it means is we don't want to index anything other than the last committed good tuple in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples, which are potentially visible to the existing running transactions, we want those transactions NOT to use this new index. IMHO the extra step in C.I.C simplifies the index build. If you make the change suggested above, I think you don't need to do things differently in C.I.C. I am not sure if I follow you correctly here. The issue with CIC is that it works with a snapshot. So during initial index build, we might index a tuple which is in the middle of a broken HOT chain. In the second phase, we just don't need to index the tuple at the head of the chain, but also remove the previously inserted tuple, otherwise there would be two references from the index to the same root heap tuple. The additional step which does two things: 1) Create catalog entry - stops any new broken HOT chains being created 2) Wait-out existing transactions - makes sure that when the index is built, we only index the last committed good tuple in the chain. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. Same issue as above: this makes correctness utterly dependent on nobody releasing locks early. As I said above, we in fact throw an error for user tables. But if we want to support HOT updates on system tables, we may need to do the wait-out business. In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. OK. So if I get you correctly, you are suggesting to acquire cleanup lock. If we don't get that, we don't to any maintenance work. Otherwise, we prune and repair fragmentation in one go. Yeah, that's what I'm thinking; then there's no need to track a separate page state where we've pruned but not defragmented. I agree. Lets keep it simple and we can always improve it later. The only thing that worries me how
Re: [PATCHES] HOT patch - version 14
Pavan Deolasee [EMAIL PROTECTED] writes: Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. One thing I was unable to discern from the comments is how CREATE INDEX can work at all. A new index might mean that tuples that could formerly legally be part of the same hot-chain no longer can. I can't find any code here that looks like it's addressing that. I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. I also took this opportunity to remove the modularity invasion caused by heap_check_idxupdate() since it was using resultRelInfo. We now build the list of attributes that must be checked to satisfy HOT update. This list includes all the index columns, columns in the partial index predicates and expression index expressions and is built in the executor. The executor is the wrong place for that: I'm not sure why you think that making heapam depend on the executor is a modularity improvement. Furthermore this approach requires recalculating the list during each query, which is wasteful when it could only change during schema updates. I'd suggest making the relcache responsible for computing and saving this data, along the same lines as RelationGetIndexList(). (That also means that heapam can get the data from the relcache, saving a lot of API-bloating from passing around Attrids explicitly.) Also, rather than inventing Attrids, I'd suggest just using one Bitmapset with the convention that its contents are offset by FirstLowInvalidHeapAttributeNumber. The redefinition of the value of MaxHeapTuplesPerPage seems pretty ugly and unprincipled. I think it'd be better to leave it as-is, and just enforce that we don't allow more than that many line pointers on a heap page (as indeed you have to do anyway, so it's not clear what the change is buying). Is it really necessary to add hot_update to xl_heap_update? Seems the information should be available from the tuple header fields. Have you demonstrated that the prune_hard logic is worth its weight? Considering that many of us want to drop VACUUM FULL, adding more complexity to it doesn't seem like a profitable use of time. Is it really safe, or productive, to run heap_page_prune when the buffer is not locked for cleanup (ie, there are other people with pins on it)? Even if it's safe, ISTM what you're mostly accomplishing there is to expend a lot of cycles while holding exclusive lock on the page, when there is good reason to think that you're blocking other people who are interested in using the page. Eliminating the separation between that and cleanup would also allow eliminating the separate PD_FRAGMENTED page state. PlanSetValidForThisTransaction is completely bogus --- it's not re-entrant, and it needs to be. I think you need some state in PlannerGlobal instead. rd_avgfsm seems fairly bogus ... when does it get updated? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] HOT patch - version 14
Heikki Linnakangas [EMAIL PROTECTED] writes: I see that you have a separate bitmapset to keep track of indexes on system attributes. But having an index on a system attribute doesn't make any sense, does it? Counterexample: OID. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 14
On 8/20/07, Tom Lane [EMAIL PROTECTED] wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I see that you have a separate bitmapset to keep track of indexes on system attributes. But having an index on a system attribute doesn't make any sense, does it? Counterexample: OID. Right. Further, we allow creating indexes on system attributes. So we must support those. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com