[HACKERS] Visibility map page pinned for too long ?
I was looking at the code when the following tiny bit caught my attention. In vacuumlazy.c, we release the pin on the final VM page at line number 972. 954 if (vacrelstats-num_dead_tuples 0) 955 { 956 /* Log cleanup info before we touch indexes */ 957 vacuum_log_cleanup_info(onerel, vacrelstats); 958 959 /* Remove index entries */ 960 for (i = 0; i nindexes; i++) 961 lazy_vacuum_index(Irel[i], 962 indstats[i], 963 vacrelstats); 964 /* Remove tuples from heap */ 965 lazy_vacuum_heap(onerel, vacrelstats); 966 vacrelstats-num_index_scans++; 967 } 968 969 /* Release the pin on the visibility map page */ 970 if (BufferIsValid(vmbuffer)) 971 { 972 ReleaseBuffer(vmbuffer); 973 vmbuffer = InvalidBuffer; 974 } So we are holding the pin right through the index vacuuming and the second pass over the heap; both can take a very long time. We can and should really be releasing the pin *before* those steps. In fact, it would be appropriate to do it right after the preceding big for-loop. While it may or may not matter from the performance or correctness perspective, I think we should fix that. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
Re: [HACKERS] Visibility map page pinned for too long ?
On 3 December 2012 17:37, Pavan Deolasee pavan.deola...@gmail.com wrote: I was looking at the code when the following tiny bit caught my attention. In vacuumlazy.c, we release the pin on the final VM page at line number 972. 954 if (vacrelstats-num_dead_tuples 0) 955 { 956 /* Log cleanup info before we touch indexes */ 957 vacuum_log_cleanup_info(onerel, vacrelstats); 958 959 /* Remove index entries */ 960 for (i = 0; i nindexes; i++) 961 lazy_vacuum_index(Irel[i], 962 indstats[i], 963 vacrelstats); 964 /* Remove tuples from heap */ 965 lazy_vacuum_heap(onerel, vacrelstats); 966 vacrelstats-num_index_scans++; 967 } 968 969 /* Release the pin on the visibility map page */ 970 if (BufferIsValid(vmbuffer)) 971 { 972 ReleaseBuffer(vmbuffer); 973 vmbuffer = InvalidBuffer; 974 } So we are holding the pin right through the index vacuuming and the second pass over the heap; both can take a very long time. We can and should really be releasing the pin *before* those steps. In fact, it would be appropriate to do it right after the preceding big for-loop. While it may or may not matter from the performance or correctness perspective, I think we should fix that. Yes, its a clear bug. Patched. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
On Thu, May 5, 2011 at 2:20 PM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, May 5, 2011 at 2:00 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Merlin Moncure mmonc...@gmail.com wrote: a small cache that remembers the commit/cancel status of recently seen transactions. How is that different from the head of the clog SLRU? several things: *) any slru access requires lock (besides the lock itself, you are spending cycles in critical path) *) cache access happens at different stage of processing in HeapTupleSatisfiesMVCC: both TransactionIdIsCurrentTransactionId and TransactionIdIsInProgress have to be checked first. Logically, it's extension of hint bit check itself, not expansion of lower levels of caching *) in tqual.c you can sneak in some small optimizations like only caching the bit if it's known good in the WAL (XlogNeedsFlush). That way you don't need to keep checking it over and over for the same trasaction *) slru level accesses happen too late to give much benefit: I can't stress enough how tight HeapTupleSatisfiesMVCC is. On my workstation VM, each non inline function call shows up measurably in profiling. I think anything you do here has to be inline, hand rolled, and very tight (you can forget anything around dynahash). Delegating the cache management to transam or (even worse) slru level penalizes some workloads non-trivially. An updated patch is attached. It's still WIP, but I need a little guidance before going further. What I did: *) Added a lot of source level comments that should explain better what's happening and why *) Fixed a significant number of goofs in the earlier patch. *) Reorganized the interaction with HeapTupleSatisfiesMVCC. In particular SetHintBits() is returning if it actually set the bit because I can use that information. What's not done: *) Only commit bits are cached, and caching action is only happening in HeapTupleSatisfiesMVCC. I'm not sure yet if it's better to store invalid bits in the same cache or in a separate one. I'm not sure if the other satisfies routines should also be engaging the cache. Translated from nerd speak, that means I haven't yet done the research to see when they are fired and if they are bottlenecks :-). *) I'd like to reach some sort of consensus with Tom if there is any point in going further in direction. Not so much on how the mechanics of how the cache work, but that it is at the tqual.c level and the changes to HeapTuplesSatisfiesMVCC. In particular. I think caching at transam.c level is a dead end on performance grounds regardless of how you implement the cache. Some points of note: *) Is it acceptable to use static definition of memory like that. If not, should there be a more standard allocation under CacheMemoryContext? *) Testing for the benefit is simple: just create a bunch of records and seqscan the table (select count(*)). Without the patch the first scan is slower and does a bunch of i/o. With it, it does not. *) The cache overhead is *almost* not measurable. As best I can tell we are looking at maybe 1% ish overhead in synthetic scan heavy workloads (i think this is a fair price to pay for all the i/o savings). The degenerate case of repeated 'rollups' is really difficult to generate, even synthetically -- if the cache is performing lousily the regular hint bit action tends to protect it. Performance testing under real workloads is going to give better info here. merlin hbache.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Visibility map and hint bits
There has been a lot of recent discussion about the visibility map (for index-only scans) and hint bits (trying to avoid double-writing a table). I wonder if we could fix both of these at the same time. Once the visibility map is reliable, can we use that to avoid updating the hint bits on all rows on a page? For bulk loads, all the pages are going have the same xid and all be visible, so instead of writing the entire table, we just write the visibility map. I think the problem is that we have the PD_ALL_VISIBLE page flag, which requires a write of the page as well. Could we get by with only the visibility bits and remove PD_ALL_VISIBLE? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote: There has been a lot of recent discussion about the visibility map (for index-only scans) and hint bits (trying to avoid double-writing a table). I still think a small tqual.c maintained cache of hint bits will effectively eliminate hint bit i/o issues surrounding bulk loads. Tom fired a shot across the bow regarding the general worthiness of that technique though (see: http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html) :(. I can rig up a cleaned up version of the patch pretty easily...it's a local change and fairly simple. I don't think there is any way to remove the hint bits without suffering some other problem. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
Merlin Moncure wrote: On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote: There has been a lot of recent discussion about the visibility map (for index-only scans) and hint bits (trying to avoid double-writing a table). I still think a small tqual.c maintained cache of hint bits will effectively eliminate hint bit i/o issues surrounding bulk loads. Tom fired a shot across the bow regarding the general worthiness of that technique though (see: http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html) :(. I can rig up a cleaned up version of the patch pretty easily...it's a local change and fairly simple. I don't think there is any way to remove the hint bits without suffering some other problem. Was that the idea that the pages had to fit in the cache and be updated with hint bits before being written to disk? Restricting that to the size of the buffer cache seemed very limiting. One 8k visibilty map page can hold bits for 1/2 gig of heap pages so I thought that would be a better all-visible indictor and avoid many all-visible page writes in bulk load cases. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
On Thu, May 5, 2011 at 12:59 PM, Bruce Momjian br...@momjian.us wrote: I wonder if we could fix both of these at the same time. Once the visibility map is reliable, can we use that to avoid updating the hint bits on all rows on a page? I don't think so. There are two problems: 1. If there is a long-running transaction on the system, it will not be possible to set PD_ALL_VISIBLE, but hint bits can still be set. So there could be a significant performance regression if we don't set hint bits in that case. 2. Making the visibility map crash-safe will mean making setting hint bits emit XLOG records, so it can't be done on Hot Standby servers at all, and it's much more expensive than just setting a hint bit on the master. For bulk loads, all the pages are going have the same xid and all be visible, so instead of writing the entire table, we just write the visibility map. I think the problem is that we have the PD_ALL_VISIBLE page flag, which requires a write of the page as well. Could we get by with only the visibility bits and remove PD_ALL_VISIBLE? In some ways, that would make things much simpler. But to make that work, every insert/update/delete to a page would have to pin the visibility map page and clear PD_ALL_VISIBLE if appropriate, so it might not be good from a performance standpoint, especially in high-concurrency workloads. Right now, if PD_ALL_VISIBLE isn't set, we don't bother touching the visibility map page, which seems like a possibly important optimization. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
On Thu, May 5, 2011 at 1:34 PM, Bruce Momjian br...@momjian.us wrote: Merlin Moncure wrote: On Thu, May 5, 2011 at 11:59 AM, Bruce Momjian br...@momjian.us wrote: There has been a lot of recent discussion about the visibility map (for index-only scans) and hint bits (trying to avoid double-writing a table). I still think a small tqual.c maintained cache of hint bits will effectively eliminate hint bit i/o issues surrounding bulk loads. Tom fired a shot across the bow regarding the general worthiness of that technique though (see: http://postgresql.1045698.n5.nabble.com/Process-local-hint-bit-cache-td4270229.html) :(. I can rig up a cleaned up version of the patch pretty easily...it's a local change and fairly simple. I don't think there is any way to remove the hint bits without suffering some other problem. Was that the idea that the pages had to fit in the cache and be updated with hint bits before being written to disk? Restricting that to the size of the buffer cache seemed very limiting. One 8k visibilty map page can hold bits for 1/2 gig of heap pages so I thought that would be a better all-visible indictor and avoid many all-visible page writes in bulk load cases. no, that was my first idea -- check visibility when you evict. that helps a different problem but not bulk loads. One way it could help is for marking PD_ALL_VISIBLE. This might also be a winner but there is some valid skepticism that adding more work for bgwriter is really a good idea. The tqual cache idea is such that there is a small cache that remembers the commit/cancel status of recently seen transactions. If scan a tuple and the status is known via cache, you set the bit but don't mark the page dirty. That way, if you are scanning a lot of unhinted tuples with similar xid, you don't need to jam out i/o. I think the general concept is clean, but it might need some buy in from tom and some performance testing for justification. The alternate 'cleaner' approach of maintaining larger transam.c cache had some downsides I saw no simple workaround for. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
Merlin Moncure mmonc...@gmail.com wrote: a small cache that remembers the commit/cancel status of recently seen transactions. How is that different from the head of the clog SLRU? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and hint bits
On Thu, May 5, 2011 at 2:00 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Merlin Moncure mmonc...@gmail.com wrote: a small cache that remembers the commit/cancel status of recently seen transactions. How is that different from the head of the clog SLRU? several things: *) any slru access requires lock (besides the lock itself, you are spending cycles in critical path) *) cache access happens at different stage of processing in HeapTupleSatisfiesMVCC: both TransactionIdIsCurrentTransactionId and TransactionIdIsInProgress have to be checked first. Logically, it's extension of hint bit check itself, not expansion of lower levels of caching *) in tqual.c you can sneak in some small optimizations like only caching the bit if it's known good in the WAL (XlogNeedsFlush). That way you don't need to keep checking it over and over for the same trasaction *) slru level accesses happen too late to give much benefit: I can't stress enough how tight HeapTupleSatisfiesMVCC is. On my workstation VM, each non inline function call shows up measurably in profiling. I think anything you do here has to be inline, hand rolled, and very tight (you can forget anything around dynahash). Delegating the cache management to transam or (even worse) slru level penalizes some workloads non-trivially. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map
Can we just log the change of VM in log_heap_clean() for redo? Thanks -- GaoZengqi pgf...@gmail.com zengqi...@gmail.com On Tue, Nov 23, 2010 at 3:24 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I *think* that the answer to this parenthesized question is no. When we vacuum a page, we set the LSN on both the heap page and the visibility map page. Therefore, neither of them can get written to disk until the WAL record is flushed, but they could get flushed in either order. So the visibility map page could get flushed before the heap page, as the non-parenthesized portion of the comment indicates. Right. However, at least in theory, it seems like we could fix this up during redo. Setting a bit in the visibility map is currently not WAL-logged, but yes once we add WAL-logging, that's straightforward to fix. Eh, so. Suppose - for the sake of argument - we do the following: 1. Allocate an additional infomask(2) bit that means xmin is frozen, no need to call XidInMVCCSnapshot(). When we freeze a tuple, we set this bit in lieu of overwriting xmin. Note that freezing pages is already WAL-logged, so redo is possible. 2. Modify VACUUM so that, when the page is observed to be all-visible, it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the visibility map bit, writing a single XLOG record for the whole operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same vacuum already removed tuples; otherwise and/or when no tuples were removed writing XLOG_HEAP2_FREEZE or some new record type). This loses no forensic information because of (1). (If the page is NOT observed to be all-visible, we freeze individual tuples only when they hit the current age thresholds.) Setting the visibility map bit is now crash-safe. Please poke holes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map
On 22.11.2010 21:24, Robert Haas wrote: Eh, so. Suppose - for the sake of argument - we do the following: 1. Allocate an additional infomask(2) bit that means xmin is frozen, no need to call XidInMVCCSnapshot(). When we freeze a tuple, we set this bit in lieu of overwriting xmin. Note that freezing pages is already WAL-logged, so redo is possible. 2. Modify VACUUM so that, when the page is observed to be all-visible, it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the visibility map bit, writing a single XLOG record for the whole operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same vacuum already removed tuples; otherwise and/or when no tuples were removed writing XLOG_HEAP2_FREEZE or some new record type). This loses no forensic information because of (1). (If the page is NOT observed to be all-visible, we freeze individual tuples only when they hit the current age thresholds.) Setting the visibility map bit is now crash-safe. That's an interesting idea. You pickyback setting the vm bit on the freeze WAL record, on the assumption that you have to write the freeze record anyway. However, if that assumption doesn't hold, because the tuples are deleted before they reach vacuum_freeze_min_age, it's no better than the naive approach of WAL-logging the vm bit set separately. Whether that's acceptable or not, I don't know. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map
On Tue, Nov 23, 2010 at 3:42 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: That's an interesting idea. You pickyback setting the vm bit on the freeze WAL record, on the assumption that you have to write the freeze record anyway. However, if that assumption doesn't hold, because the tuples are deleted before they reach vacuum_freeze_min_age, it's no better than the naive approach of WAL-logging the vm bit set separately. Whether that's acceptable or not, I don't know. I don't know, either. I was trying to think of the cases where this would generate a net increase in WAL before I sent the email, but couldn't fully wrap my brain around it at the time. Thanks for summarizing. Here's another design to poke holes in: 1. Imagine that the visibility map is divided into granules. For the sake of argument let's suppose there are 8K bits per granule; thus each granule covers 64M of the underlying heap and 1K of space in the visibility map itself. 2. In shared memory, create a new array called the visibility vacuum array (VVA), each element of which has room for a backend ID, a relfilenode, a granule number, and an LSN. Before setting bits in the visibility map, a backend is required to allocate a slot in this array, XLOG the slot allocation, and fill in its backend ID, relfilenode number, and the granule number whose bits it will be manipulating, plus the LSN of the slot allocation XLOG record. It then sets as many bits within that granule as it likes. When done, it sets the backend ID of the VVA slot to InvalidBackendId but does not remove it from the array immediately; such a slot is said to have been released. 3. When visibility map bits are set, the LSN of the page is set to the new-VVA-slot XLOG record, so that the visibility map page can't hit the disk before the new-VVA-slot XLOG record. Also, the contents of the VVA, sans backend IDs, are XLOG'd at each checkpoint. Thus, on redo, we can compute a list of all VVA slots for which visibility-bit changes might already be on disk; we go through and clear both the visibility map bit and the PD_ALL_VISIBLE bits on the underlying pages. 4. To free a VVA slot that has been released, we must xlogflush as far as the record that allocated the slot and sync the visibility map and heap segments containing that granule. Thus, all slots released before a checkpoint starts can be freed after it completes. Alternatively, an individual backend can free a previously-released slot by perfoming the xlog flush and syncs itself. (This might require a few more bookkeeping details to be stored in the VVA, but it seems manageable.) One problem with this design is that the visibility map bits never get set on standby servers. If we don't XLOG setting the bit then I suppose that doesn't happen now either, but it's more sucky (that's the technical term) if you're relying on it for index-only scans (which are also relevant on the standby, either during HS or if promoted) versus if you're only relying on it for vacuum (which doesn't happen on the standby anyway unless and until it's promoted). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map
On Mon, Jun 14, 2010 at 1:19 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I *think* that the answer to this parenthesized question is no. When we vacuum a page, we set the LSN on both the heap page and the visibility map page. Therefore, neither of them can get written to disk until the WAL record is flushed, but they could get flushed in either order. So the visibility map page could get flushed before the heap page, as the non-parenthesized portion of the comment indicates. Right. However, at least in theory, it seems like we could fix this up during redo. Setting a bit in the visibility map is currently not WAL-logged, but yes once we add WAL-logging, that's straightforward to fix. Eh, so. Suppose - for the sake of argument - we do the following: 1. Allocate an additional infomask(2) bit that means xmin is frozen, no need to call XidInMVCCSnapshot(). When we freeze a tuple, we set this bit in lieu of overwriting xmin. Note that freezing pages is already WAL-logged, so redo is possible. 2. Modify VACUUM so that, when the page is observed to be all-visible, it will freeze all tuples on the page, set PD_ALL_VISIBLE, and set the visibility map bit, writing a single XLOG record for the whole operation (possibly piggybacking on XLOG_HEAP2_CLEAN if the same vacuum already removed tuples; otherwise and/or when no tuples were removed writing XLOG_HEAP2_FREEZE or some new record type). This loses no forensic information because of (1). (If the page is NOT observed to be all-visible, we freeze individual tuples only when they hit the current age thresholds.) Setting the visibility map bit is now crash-safe. Please poke holes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] visibility map
visibilitymap.c begins with a long and useful comment - but this part seems to have a bit of split personality disorder. * Currently, the visibility map is not 100% correct all the time. * During updates, the bit in the visibility map is cleared after releasing * the lock on the heap page. During the window between releasing the lock * and clearing the bit in the visibility map, the bit in the visibility map * is set, but the new insertion or deletion is not yet visible to other * backends. * * That might actually be OK for the index scans, though. The newly inserted * tuple wouldn't have an index pointer yet, so all tuples reachable from an * index would still be visible to all other backends, and deletions wouldn't * be visible to other backends yet. (But HOT breaks that argument, no?) I believe that the answer to the parenthesized question here is yes (in which case we might want to just delete this paragraph). * There's another hole in the way the PD_ALL_VISIBLE flag is set. When * vacuum observes that all tuples are visible to all, it sets the flag on * the heap page, and also sets the bit in the visibility map. If we then * crash, and only the visibility map page was flushed to disk, we'll have * a bit set in the visibility map, but the corresponding flag on the heap * page is not set. If the heap page is then updated, the updater won't * know to clear the bit in the visibility map. (Isn't that prevented by * the LSN interlock?) I *think* that the answer to this parenthesized question is no. When we vacuum a page, we set the LSN on both the heap page and the visibility map page. Therefore, neither of them can get written to disk until the WAL record is flushed, but they could get flushed in either order. So the visibility map page could get flushed before the heap page, as the non-parenthesized portion of the comment indicates. However, at least in theory, it seems like we could fix this up during redo. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map
On 14/06/10 06:08, Robert Haas wrote: visibilitymap.c begins with a long and useful comment - but this part seems to have a bit of split personality disorder. * Currently, the visibility map is not 100% correct all the time. * During updates, the bit in the visibility map is cleared after releasing * the lock on the heap page. During the window between releasing the lock * and clearing the bit in the visibility map, the bit in the visibility map * is set, but the new insertion or deletion is not yet visible to other * backends. * * That might actually be OK for the index scans, though. The newly inserted * tuple wouldn't have an index pointer yet, so all tuples reachable from an * index would still be visible to all other backends, and deletions wouldn't * be visible to other backends yet. (But HOT breaks that argument, no?) I believe that the answer to the parenthesized question here is yes (in which case we might want to just delete this paragraph). A HOT update can only update non-indexed columns, so I think we're still OK with HOT. To an index-only scan, it doesn't matter which tuple in a HOT update chain you consider as live, because they both must all the same value in the indexed columns. Subtle.. * There's another hole in the way the PD_ALL_VISIBLE flag is set. When * vacuum observes that all tuples are visible to all, it sets the flag on * the heap page, and also sets the bit in the visibility map. If we then * crash, and only the visibility map page was flushed to disk, we'll have * a bit set in the visibility map, but the corresponding flag on the heap * page is not set. If the heap page is then updated, the updater won't * know to clear the bit in the visibility map. (Isn't that prevented by * the LSN interlock?) I *think* that the answer to this parenthesized question is no. When we vacuum a page, we set the LSN on both the heap page and the visibility map page. Therefore, neither of them can get written to disk until the WAL record is flushed, but they could get flushed in either order. So the visibility map page could get flushed before the heap page, as the non-parenthesized portion of the comment indicates. Right. However, at least in theory, it seems like we could fix this up during redo. Setting a bit in the visibility map is currently not WAL-logged, but yes once we add WAL-logging, that's straightforward to fix. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Heikki Linnakangas wrote: ITAGAKI Takahiro wrote: - What relation are there between autovacuum_freeze_max_age, vacuum_freeze_min_age and vacuum_freeze_table_age? If we increase one of them, should we also increase the others? Yeah, that's a fair question. I'll try to work a doc patch to explain that better. Ok, how does this sound: para +commandVACUUM/ normally skips pages that don't have any dead row +versions, but those pages might still have tuples with old XID values. +To replace them too, a scan of the whole table is needed every once +in a while. varnamevacuum_freeze_table_age/ controls when +commandVACUUM/ does that: a whole table sweep is forced if +structfieldrelfrozenxid/ is more than +varnamevacuum_freeze_table_age/ transactions old. Setting it to 0 +makes commandVACUUM/ to ignore the visibility map and always scan all +pages. The effective maximum is 0.95 * varnameautovacuum_freeze_max_age/; +a setting higher than that will be capped to that maximum. A value +higher than varnameautovacuum_freeze_max_age/ wouldn't make sense +because an anti-wraparound autovacuum would be triggered at that point +anyway, and the 0.95 multiplier leaves some breathing room to run a manual +commandVACUUM/ before that happens. As a rule of thumb, +commandvacuum_freeze_table_age/ should be set to a value somewhat +below varnameautovacuum_freeze_max_age/. Setting it too close could +lead to anti-wraparound autovacuums, even though the table was recently +manually vacuumed, whereas lower values lead to more frequent whole-table +scans. + /para -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Euler Taveira de Oliveira wrote: Simon Riggs escreveu: On Tue, 2009-01-20 at 21:51 +0200, Heikki Linnakangas wrote: Attached is a simple patch to only start skipping pages after 20 consecutive pages marked as visible in the visibility map. This doesn't do any look-ahead, so it will always scan the first 20 pages of a table before it starts to skip pages, and whenever there's even one page that needs vacuuming, the next 19 pages will also be vacuumed. We could adjust that figure 20 according to table size. Or by seq_page_cost/random_page_cost. But I'm leaning towards a simple hard-coded value for now. Yes, sounds good. Can we stick to multiples of 2 as the OS readahead does IIRC? So either 16 or 32. I'd go 32. Agreed. And do it a constant (ALL_VISIBLE_VM_THRESHOLD?). Okay-dokay. I committed this with the constant as a #define, at value 32. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
ITAGAKI Takahiro wrote: Gregory Stark st...@enterprisedb.com wrote: I don't think we can perfectly capture the meaning of these GUCs in the name. I think our goal should be to avoid confusion between them. I was thinking it would be clearer if the options which control *when* autovacuum fires off a worker consistently had some action word in them like trigger or start or launch. I think we need more explanations about those variables, not only how to work but also how to tune them. I feel they are un-tunable parameters. Our documentation says: | Larger values of these settings | preserve transactional information longer, while smaller values increase | the number of transactions that can elapse before the table must be | vacuumed again. i.e, we are explaining the variables only as Larger is better, but is it really true? Yes, that is explicitly explained in the docs: The sole disadvantage of increasing varnamevacuum_freeze_table_age/ and varnameautovacuum_freeze_max_age/ is that the filenamepg_clog/ subdirectory of the database cluster will take more space, because it must store the commit status for all transactions back to the varnameautovacuum_freeze_max_age/ horizon. ... - What relation are there between autovacuum_freeze_max_age, vacuum_freeze_min_age and vacuum_freeze_table_age? If we increase one of them, should we also increase the others? Yeah, that's a fair question. I'll try to work a doc patch to explain that better. - Is it ok to increase the variables to maximum values? Are there any trade-off? Disk space. - Are there some conditions where whole-table-scanning vacuum is more effective than vacuums using visibility map? If so, we should switch to full-scan *automatically*, without relying on user configurations. Hmm, the only downside I can see is that skipping a page here and there could defeat the OS read-ahead. Perhaps we should call posix_fadvise(SEQUENTIAL) to compensate. Or, we could modify the logic to only skip pages when there's at least N consecutive pages that can be skipped. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
On Tue, 2009-01-20 at 11:03 +0200, Heikki Linnakangas wrote: - Are there some conditions where whole-table-scanning vacuum is more effective than vacuums using visibility map? If so, we should switch to full-scan *automatically*, without relying on user configurations. Hmm, the only downside I can see is that skipping a page here and there could defeat the OS read-ahead. Perhaps we should call posix_fadvise(SEQUENTIAL) to compensate. Or, we could modify the logic to only skip pages when there's at least N consecutive pages that can be skipped. I would rather we didn't skip any pages at all unless the gains are significant. Skipping the odd page makes no difference from a performance perspective but may have a robustness impact. Significant gains should take into account the size of both heap and indexes, and recognise that we still scan whole indexes in either case. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Simon Riggs wrote: On Tue, 2009-01-20 at 11:03 +0200, Heikki Linnakangas wrote: - Are there some conditions where whole-table-scanning vacuum is more effective than vacuums using visibility map? If so, we should switch to full-scan *automatically*, without relying on user configurations. Hmm, the only downside I can see is that skipping a page here and there could defeat the OS read-ahead. Perhaps we should call posix_fadvise(SEQUENTIAL) to compensate. Or, we could modify the logic to only skip pages when there's at least N consecutive pages that can be skipped. I would rather we didn't skip any pages at all unless the gains are significant. Skipping the odd page makes no difference from a performance perspective but may have a robustness impact. Significant gains should take into account the size of both heap and indexes, and recognise that we still scan whole indexes in either case. That sounds pretty complex, approaching what the planner does. I'd rather keep it simple. Attached is a simple patch to only start skipping pages after 20 consecutive pages marked as visible in the visibility map. This doesn't do any look-ahead, so it will always scan the first 20 pages of a table before it starts to skip pages, and whenever there's even one page that needs vacuuming, the next 19 pages will also be vacuumed. We could adjust that figure 20 according to table size. Or by seq_page_cost/random_page_cost. But I'm leaning towards a simple hard-coded value for now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/commands/vacuumlazy.c --- src/backend/commands/vacuumlazy.c *** *** 271,276 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 271,277 int i; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; + BlockNumber all_visible_streak; pg_rusage_init(ru0); *** *** 292,297 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 293,299 lazy_space_alloc(vacrelstats, nblocks); + all_visible_streak = 0; for (blkno = 0; blkno nblocks; blkno++) { Buffer buf; *** *** 309,315 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, /* * Skip pages that don't require vacuuming according to the ! * visibility map. */ if (!scan_all) { --- 311,324 /* * Skip pages that don't require vacuuming according to the ! * visibility map. But only if we've seen a streak of at least ! * 20 pages marked as clean. Since we're reading sequentially, ! * the OS should be doing readahead for us and there's no gain ! * in skipping a page now and then. You need a longer run of ! * consecutive skipped pages before it's worthwhile. Also, ! * skipping even a single page means that we can't update ! * relfrozenxid or reltuples, so we only want to do it if ! * there's a good chance to skip a goodly number of pages. */ if (!scan_all) { *** *** 317,325 lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, visibilitymap_test(onerel, blkno, vmbuffer); if (all_visible_according_to_vm) { ! vacrelstats-scanned_all = false; ! continue; } } vacuum_delay_point(); --- 326,340 visibilitymap_test(onerel, blkno, vmbuffer); if (all_visible_according_to_vm) { ! all_visible_streak++; ! if (all_visible_streak = 20) ! { ! vacrelstats-scanned_all = false; ! continue; ! } } + else + all_visible_streak = 0; } vacuum_delay_point(); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
On Tue, 2009-01-20 at 21:51 +0200, Heikki Linnakangas wrote: Attached is a simple patch to only start skipping pages after 20 consecutive pages marked as visible in the visibility map. This doesn't do any look-ahead, so it will always scan the first 20 pages of a table before it starts to skip pages, and whenever there's even one page that needs vacuuming, the next 19 pages will also be vacuumed. We could adjust that figure 20 according to table size. Or by seq_page_cost/random_page_cost. But I'm leaning towards a simple hard-coded value for now. Yes, sounds good. Can we stick to multiples of 2 as the OS readahead does IIRC? So either 16 or 32. I'd go 32. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Simon Riggs escreveu: On Tue, 2009-01-20 at 21:51 +0200, Heikki Linnakangas wrote: Attached is a simple patch to only start skipping pages after 20 consecutive pages marked as visible in the visibility map. This doesn't do any look-ahead, so it will always scan the first 20 pages of a table before it starts to skip pages, and whenever there's even one page that needs vacuuming, the next 19 pages will also be vacuumed. We could adjust that figure 20 according to table size. Or by seq_page_cost/random_page_cost. But I'm leaning towards a simple hard-coded value for now. Yes, sounds good. Can we stick to multiples of 2 as the OS readahead does IIRC? So either 16 or 32. I'd go 32. Agreed. And do it a constant (ALL_VISIBLE_VM_THRESHOLD?). -- Euler Taveira de Oliveira http://www.timbira.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Gregory Stark st...@enterprisedb.com wrote: I don't think we can perfectly capture the meaning of these GUCs in the name. I think our goal should be to avoid confusion between them. I was thinking it would be clearer if the options which control *when* autovacuum fires off a worker consistently had some action word in them like trigger or start or launch. I think we need more explanations about those variables, not only how to work but also how to tune them. I feel they are un-tunable parameters. Our documentation says: | Larger values of these settings | preserve transactional information longer, while smaller values increase | the number of transactions that can elapse before the table must be | vacuumed again. i.e, we are explaining the variables only as Larger is better, but is it really true? I think we should have answers about the following questions: - What relation are there between autovacuum_freeze_max_age, vacuum_freeze_min_age and vacuum_freeze_table_age? If we increase one of them, should we also increase the others? - Is it ok to increase the variables to maximum values? Are there any trade-off? - Are there some conditions where whole-table-scanning vacuum is more effective than vacuums using visibility map? If so, we should switch to full-scan *automatically*, without relying on user configurations. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Jeff Davis wrote: On Fri, 2009-01-09 at 13:49 +0200, Heikki Linnakangas wrote: I'm now leaning towards: autovacuum_freeze_max_age vacuum_freeze_table_age vacuum_freeze_min_age where autovacuum_freeze_max_age and vacuum_freeze_min_age are unchanged, and vacuum_freeze_table_age is the new setting that controls when VACUUM or autovacuum should perform a full scan of the table to advance relfrozenxid. I'm still bothered by the fact that max and min really mean the same thing here. Yeah. Those are existing names, though, and I don't recall any complaints from users. I don't think we can perfectly capture the meaning of these GUCs in the name. I think our goal should be to avoid confusion between them. Agreed. I was thinking it would be clearer if the options which control *when* autovacuum fires off a worker consistently had some action word in them like trigger or start or launch. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Alvaro Herrera wrote: Heikki Linnakangas escribió: This patch adds a new column to pg_autovacuum, reflecting the new vacuum_freeze_table_age GUC just like freeze_min_age column reflects vacuum_freeze_min_age. If I commit this tomorrow, will I cause you a lot of trouble with the reloptions patch? No -- go ahead, I'll merge it. Ok, committed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas wrote: Also, is anything being done about the concern about 'vacuum storm' explained below? I'm interested too. The additional vacuum_freeze_table_age (as I'm now calling it) setting I discussed in a later thread should alleviate that somewhat. When a table is autovacuumed, the whole table is scanned to freeze tuples if it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. When different tables reach the autovacuum threshold at different times, they will also have their relfrozenxids set to different values. And in fact no anti-wraparound vacuum is needed. That doesn't help with read-only or insert-only tables, but that's not a new problem. OK, is this targeted for 8.4? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: Bruce Momjian br...@momjian.us writes: Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? I suggested raising it dramatically in the post you quote and Heikki pointed it controls the maximum amount of space the clog will take. Raising it to, say, 800M will mean up to 200MB of space which might be kind of annoying for a small database. It would be nice if we could ensure the clog got trimmed frequently enough on small databases that we could raise the max_age. It's really annoying to see all these vacuums running 10x more often than necessary. I always assumed that it was our 4-byte xid that was requiring our vacuum freeze, but I now see our limiting factor is the size of clog; interesting. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Bruce Momjian wrote: Heikki Linnakangas wrote: Also, is anything being done about the concern about 'vacuum storm' explained below? I'm interested too. The additional vacuum_freeze_table_age (as I'm now calling it) setting I discussed in a later thread should alleviate that somewhat. When a table is autovacuumed, the whole table is scanned to freeze tuples if it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. When different tables reach the autovacuum threshold at different times, they will also have their relfrozenxids set to different values. And in fact no anti-wraparound vacuum is needed. That doesn't help with read-only or insert-only tables, but that's not a new problem. OK, is this targeted for 8.4? Yes. It's been on my todo list for a long time, and I've also added it to the Open Items list so that we don't lose track of it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Jeff Davis wrote: On Fri, 2009-01-09 at 13:49 +0200, Heikki Linnakangas wrote: If the distinction you're making is that autovacuum_freeze_max_age affects the launching of a vacuum rather than the behavior of a vacuum, maybe we could incorporate the word launch like: autovacuum_launch_freeze_threshold Hmm, I think I'd like it in the form autovacuum_freeze_launch_threshold more. I'm now leaning towards: autovacuum_freeze_max_age vacuum_freeze_table_age vacuum_freeze_min_age where autovacuum_freeze_max_age and vacuum_freeze_min_age are unchanged, and vacuum_freeze_table_age is the new setting that controls when VACUUM or autovacuum should perform a full scan of the table to advance relfrozenxid. I'm still bothered by the fact that max and min really mean the same thing here. Yeah. Those are existing names, though, and I don't recall any complaints from users. I don't think we can perfectly capture the meaning of these GUCs in the name. I think our goal should be to avoid confusion between them. Agreed. Well, for better or worse here's a patch leaving the existing setting names alone, and the new GUC is called vacuum_freeze_table_age. I'm not opposed to changing the names of the existing GUCs. If we do change them, the columns in pg_autovacuum need to be changed too. There's documentation changes included to descríbe the new GUC, and the new behavior of VACUUM with visibility map in general. Is it readable, and is it enough? Alvaro, are you getting rid of pg_autovacuum in favor of the reloptions for 8.4? This patch adds a new column to pg_autovacuum, reflecting the new vacuum_freeze_table_age GUC just like freeze_min_age column reflects vacuum_freeze_min_age. If I commit this tomorrow, will I cause you a lot of trouble with the reloptions patch? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** doc/src/sgml/catalogs.sgml --- doc/src/sgml/catalogs.sgml *** *** 1361,1366 --- 1361,1373 entry/entry entryCustom varnameautovacuum_freeze_max_age/ parameter/entry /row + + row + entrystructfieldfreeze_table_age/structfield/entry + entrytypeinteger/type/entry + entry/entry + entryCustom varnamevacuum_freeze_table_age/ parameter/entry + /row /tbody /tgroup /table *** doc/src/sgml/config.sgml --- doc/src/sgml/config.sgml *** *** 3950,3955 COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; --- 3950,3975 /listitem /varlistentry + varlistentry id=guc-vacuum-freeze-max-age xreflabel=vacuum_freeze_max_age + termvarnamevacuum_freeze_max_age/varname (typeinteger/type)/term + indexterm +primaryvarnamevacuum_freeze_max_age/ configuration parameter/primary + /indexterm + listitem +para + commandVACUUM/ performs a whole-table scan if the table's + structnamepg_class/.structfieldrelfrozenxid/ field reaches the + age specified by this setting. The default is 150 million + transactions. Although users can set this value anywhere from zero to + one billion, commandVACUUM/ will silently limit the effective value + to the value of xref linkend=guc-autovacuum-freeze-max-age minus + 1 million transactions, so that regular manual commandVACUUM/ has a + chance to run before autovacuum is launched to prevent XID wraparound. + For more information see xref linkend=vacuum-for-wraparound. +/para + /listitem + /varlistentry + varlistentry id=guc-vacuum-freeze-min-age xreflabel=vacuum_freeze_min_age termvarnamevacuum_freeze_min_age/varname (typeinteger/type)/term indexterm *** *** 3960,3966 COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; Specifies the cutoff age (in transactions) that commandVACUUM/ should use to decide whether to replace transaction IDs with literalFrozenXID/ while scanning a table. ! The default is 100 million transactions. Although users can set this value anywhere from zero to one billion, commandVACUUM/ will silently limit the effective value to half the value of xref linkend=guc-autovacuum-freeze-max-age, so --- 3980,3986 Specifies the cutoff age (in transactions) that commandVACUUM/ should use to decide whether to replace transaction IDs with literalFrozenXID/ while scanning a table. ! The default is 50 million transactions. Although users can set this value anywhere from zero to one billion, commandVACUUM/ will silently limit the effective value to half the value of xref linkend=guc-autovacuum-freeze-max-age, so *** *** 3971,3976 COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; --- 3991,4017 /listitem /varlistentry + varlistentry
Re: [HACKERS] Visibility map and freezing
Heikki Linnakangas escribió: Alvaro, are you getting rid of pg_autovacuum in favor of the reloptions for 8.4? That's the intention, yes. This patch adds a new column to pg_autovacuum, reflecting the new vacuum_freeze_table_age GUC just like freeze_min_age column reflects vacuum_freeze_min_age. If I commit this tomorrow, will I cause you a lot of trouble with the reloptions patch? No -- go ahead, I'll merge it. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? Also, is anything being done about the concern about 'vacuum storm' explained below? --- Gregory Stark wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. Having seen how the anti-wraparound vacuums work in the field I think merely replacing it with a regular vacuum which covers the whole table will not actually work well. What will happen is that, because nothing else is advancing the relfrozenxid, the age of the relfrozenxid for all tables will advance until they all hit autovacuum_max_freeze_age. Quite often all the tables were created around the same time so they will all hit autovacuum_max_freeze_age at the same time. So a database which was operating fine and receiving regular vacuums at a reasonable pace will suddenly be hit by vacuums for every table all at the same time, 3 at a time. If you don't have vacuum_cost_delay set that will cause a major issue. Even if you do have vacuum_cost_delay set it will prevent the small busy tables from getting vacuumed regularly due to the backlog in anti-wraparound vacuums. Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. 2) Include a factor which spreads out the anti-wraparound freezes in the autovacuum launcher. Some ideas: . we could implicitly add random(vacuum_freeze_min_age) to the autovacuum_max_freeze_age. That would spread them out evenly over 100M transactions. . we could check if another anti-wraparound vacuum is still running and implicitly add a vacuum_freeze_min_age penalty to the autovacuum_max_freeze_age for each running anti-wraparound vacuum. That would spread them out without being introducing non-determinism which seems better. . we could leave autovacuum_max_freeze_age and instead pick a semi-random vacuum_freeze_min_age. This would mean the first set of anti-wraparound vacuums would still be synchronized but subsequent ones might be spread out somewhat. There's not as much room to randomize this though and it would affect how much i/o vacuum did which makes it seem less palatable to me. 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many people are setting it to unreasonably high values which results in their vacuums never completing. Actually I think what we should do is junk all the existing parameters and replace it with a vacuum_nice_level or vacuum_bandwidth_cap from which we calculate the cost_limit and hide all the other parameters as internal parameters. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Bruce Momjian wrote: Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? Presumably because of this (from the docs): The commit status uses two bits per transaction, so if autovacuum_freeze_max_age has its maximum allowed value of a little less than two billion, pg_clog can be expected to grow to about half a gigabyte. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Andrew Dunstan wrote: Bruce Momjian wrote: Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? Presumably because of this (from the docs): The commit status uses two bits per transaction, so if autovacuum_freeze_max_age has its maximum allowed value of a little less than two billion, pg_clog can be expected to grow to about half a gigabyte. Oh, that's interesting; thanks. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Bruce Momjian br...@momjian.us writes: Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? I suggested raising it dramatically in the post you quote and Heikki pointed it controls the maximum amount of space the clog will take. Raising it to, say, 800M will mean up to 200MB of space which might be kind of annoying for a small database. It would be nice if we could ensure the clog got trimmed frequently enough on small databases that we could raise the max_age. It's really annoying to see all these vacuums running 10x more often than necessary. The rest of the thread is visible at the bottom of: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525 Also, is anything being done about the concern about 'vacuum storm' explained below? I'm interested too. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: Bruce Momjian br...@momjian.us writes: Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? I suggested raising it dramatically in the post you quote and Heikki pointed it controls the maximum amount of space the clog will take. Raising it to, say, 800M will mean up to 200MB of space which might be kind of annoying for a small database. It would be nice if we could ensure the clog got trimmed frequently enough on small databases that we could raise the max_age. It's really annoying to see all these vacuums running 10x more often than necessary. Well, if it's a small database, you might as well just vacuum it. The rest of the thread is visible at the bottom of: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525 Also, is anything being done about the concern about 'vacuum storm' explained below? I'm interested too. The additional vacuum_freeze_table_age (as I'm now calling it) setting I discussed in a later thread should alleviate that somewhat. When a table is autovacuumed, the whole table is scanned to freeze tuples if it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. When different tables reach the autovacuum threshold at different times, they will also have their relfrozenxids set to different values. And in fact no anti-wraparound vacuum is needed. That doesn't help with read-only or insert-only tables, but that's not a new problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Jeff Davis wrote: On Wed, 2009-01-07 at 09:34 +0200, Heikki Linnakangas wrote: autovacuum_freeze_max_age - autovacuum_freeze_scan_age vacuum_freeze_max_age - vacuum_freeze_scan_age vacuum_freeze_min_age - vacuum_freeze_tuple_age *_scan_age settings control when the table is fully scanned to freeze tuples and advance relfrozenxid, and vacuum_freeze_tuple_age controls how old a tuple needs to be to be frozen. One objection is that you can read freeze_scan to mean that a scan is frozen, like a tuple is frozen. Any better ideas? I see what you mean about the possible misinterpretation, but I think it's a big improvement, and I don't have a better suggestion. Thinking about this some more, I'm not too happy with those names either. vacuum_freeze_scan_age and autovacuum_freeze_scan_age don't mean quite the same thing, like vacuum_cost_delay and autovacuum_vacuum_cost_delay do, for example. I'm now leaning towards: autovacuum_freeze_max_age vacuum_freeze_table_age vacuum_freeze_min_age where autovacuum_freeze_max_age and vacuum_freeze_min_age are unchanged, and vacuum_freeze_table_age is the new setting that controls when VACUUM or autovacuum should perform a full scan of the table to advance relfrozenxid. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
On Fri, 2009-01-09 at 13:49 +0200, Heikki Linnakangas wrote: Thinking about this some more, I'm not too happy with those names either. vacuum_freeze_scan_age and autovacuum_freeze_scan_age don't mean quite the same thing, like vacuum_cost_delay and autovacuum_vacuum_cost_delay do, for example. If the distinction you're making is that autovacuum_freeze_max_age affects the launching of a vacuum rather than the behavior of a vacuum, maybe we could incorporate the word launch like: autovacuum_launch_freeze_threshold I'm now leaning towards: autovacuum_freeze_max_age vacuum_freeze_table_age vacuum_freeze_min_age where autovacuum_freeze_max_age and vacuum_freeze_min_age are unchanged, and vacuum_freeze_table_age is the new setting that controls when VACUUM or autovacuum should perform a full scan of the table to advance relfrozenxid. I'm still bothered by the fact that max and min really mean the same thing here. I don't think we can perfectly capture the meaning of these GUCs in the name. I think our goal should be to avoid confusion between them. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
On Wed, 2009-01-07 at 09:34 +0200, Heikki Linnakangas wrote: autovacuum_freeze_max_age - autovacuum_freeze_scan_age vacuum_freeze_max_age - vacuum_freeze_scan_age vacuum_freeze_min_age - vacuum_freeze_tuple_age *_scan_age settings control when the table is fully scanned to freeze tuples and advance relfrozenxid, and vacuum_freeze_tuple_age controls how old a tuple needs to be to be frozen. One objection is that you can read freeze_scan to mean that a scan is frozen, like a tuple is frozen. Any better ideas? I see what you mean about the possible misinterpretation, but I think it's a big improvement, and I don't have a better suggestion. Thanks, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Jeff Davis wrote: On Mon, 2008-12-22 at 21:24 +0200, Heikki Linnakangas wrote: Introduce new vacuum_freeze_max_age setting. Manual VACUUM will scan the whole table and advance relfrozenxid, if relfrozenxid is older than vacuum_freeze_max_age. It's confusing to have two GUCs named vacuum_freeze_min_age and vacuum_freeze_max_age with incompatible definitions. The former is the minimum transaction age of a tuple found during the scan of a table, while the latter is the maximum transaction age of the relfrozenxid of the table. If you set vacuum_freeze_max_age to 0, the visibility map is not used to skip pages, so you'll get the pre-8.4 old behavior. Seems like a strange way to turn off visibility maps, and the meaning doesn't seem to fit with either vacuum_freeze_min_age or autovacuum_freeze_max_age. The proposal itself makes sense, but I think we need to decide on some better names. Right now the meanings of autovacuum_freeze_max_age and vacuum_freeze_min_age are incompatible, so we're not in a good position, but there has to be something less confusing. For one thing, there isn't even a common definition of min or max between them. They both trigger an event (freezing or vacuuming) when something (tuple xmin or relfrozenxid) exceeds some number. Why is one called a min and the other a max? Yeah, you're right. Fuji's point that it's confusing that you can have a min greater than max is also valid. How about autovacuum_freeze_max_age - autovacuum_freeze_scan_age vacuum_freeze_max_age - vacuum_freeze_scan_age vacuum_freeze_min_age - vacuum_freeze_tuple_age *_scan_age settings control when the table is fully scanned to freeze tuples and advance relfrozenxid, and vacuum_freeze_tuple_age controls how old a tuple needs to be to be frozen. One objection is that you can read freeze_scan to mean that a scan is frozen, like a tuple is frozen. Any better ideas? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
On Mon, 2008-12-22 at 21:24 +0200, Heikki Linnakangas wrote: Introduce new vacuum_freeze_max_age setting. Manual VACUUM will scan the whole table and advance relfrozenxid, if relfrozenxid is older than vacuum_freeze_max_age. It's confusing to have two GUCs named vacuum_freeze_min_age and vacuum_freeze_max_age with incompatible definitions. The former is the minimum transaction age of a tuple found during the scan of a table, while the latter is the maximum transaction age of the relfrozenxid of the table. If you set vacuum_freeze_max_age to 0, the visibility map is not used to skip pages, so you'll get the pre-8.4 old behavior. Seems like a strange way to turn off visibility maps, and the meaning doesn't seem to fit with either vacuum_freeze_min_age or autovacuum_freeze_max_age. The proposal itself makes sense, but I think we need to decide on some better names. Right now the meanings of autovacuum_freeze_max_age and vacuum_freeze_min_age are incompatible, so we're not in a good position, but there has to be something less confusing. For one thing, there isn't even a common definition of min or max between them. They both trigger an event (freezing or vacuuming) when something (tuple xmin or relfrozenxid) exceeds some number. Why is one called a min and the other a max? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Heikki Linnakangas wrote: Peter Eisentraut wrote: Heikki Linnakangas wrote: I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. Would one parameter to control both suffice? (i.e., rename autovacuum_freeze_max_age to vacuum_freeze_max_age) Imagine that you run a nightly VACUUM from cron, and have autovacuum disabled. If autovacuum_freeze_max_age is the same as vacuum_freeze_max_age, as soon as that age is reached, an anti-wraparound autovacuum launched. What we'd want to happen is for the next nightly VACUUM to do the work instead. So they need to be separate settings, with some space between them by default. Attached is a proposed patch to handle freezing. In a nutshell: Introduce new vacuum_freeze_max_age setting. Manual VACUUM will scan the whole table and advance relfrozenxid, if relfrozenxid is older than vacuum_freeze_max_age. If vacuum_freeze_max_age is greater than autovacuum_freeze_max_age - 1 million transactions, it's effectively capped at that value. It doesn't make sense to have vacuum_freeze_max_age autovacuum_freeze_max_age, because the manual VACUUM wouldn't have a chance to do the full sweep before the anti-wraparound autovacuum is launched. The minus one million transactions is to give some headroom. I'm lowering vacuum_freeze_min_age from 100 million transactions to 50 million, so that the whole-table vacuum doesn't need to run as often. Note that since VACUUM normally only scans pages that need vacuuming according to the visibility map, tuples on skipped pages are not frozen any earlier even though vacuum_freeze_min_age is lower. To recap, here's the new defaults: autovacuum_freeze_max_age 2 vacuum_freeze_max_age 15000 vacuum_freeze_min_age5000 This means that with defaults, autovacuum will launch a whole-table vacuum every 150 million transactions (autovacuum_freeze_max_age - vacuum_freeze_min_age). If you run a nightly manual VACUUM, you'll get a whole-table vacuum every 100 million transactions. vacuum_freeze_max_age also affects autovacuums. If an autovacuum is launched on table to remove dead tuples, and vacuum_freeze_max_age has been reached (but not yet autovacuum_freeze_max_age), the autovacuum will scan the whole table. I'm not sure if this is desirable, to avoid having to launch separate anti-wraparound autovacuums even when there's not many dead tuples, or just confusing. If you set vacuum_freeze_max_age to 0, the visibility map is not used to skip pages, so you'll get the pre-8.4 old behavior. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7493ca9..9848ce0 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3925,6 +3925,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; /listitem /varlistentry + varlistentry id=guc-vacuum-freeze-max-age xreflabel=vacuum_freeze_max_age + termvarnamevacuum_freeze_max_age/varname (typeinteger/type)/term + indexterm + primaryvarnamevacuum_freeze_max_age/ configuration parameter/primary + /indexterm + listitem + para +commandVACUUM/ performs a whole-table scan if the table's +structnamepg_class/.structfieldrelfrozenxid/ field reaches the +age specified by this setting. The default is 150 million +transactions. Although users can set this value anywhere from zero to +one billion, commandVACUUM/ will silently limit the effective value +to the value of xref linkend=guc-autovacuum-freeze-max-age minus +1 million transactions, so that regular manual commandVACUUM/ has a +chance to run before autovacuum is launched to prevent XID wraparound. +For more information see xref linkend=vacuum-for-wraparound. + /para + /listitem + /varlistentry + varlistentry id=guc-vacuum-freeze-min-age xreflabel=vacuum_freeze_min_age termvarnamevacuum_freeze_min_age/varname (typeinteger/type)/term indexterm @@ -3935,7 +3955,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; Specifies the cutoff age (in transactions) that commandVACUUM/ should use to decide whether to replace transaction IDs with literalFrozenXID/ while scanning a table. -The default is 100 million transactions. Although +The default is 50 million transactions. Although users can set this value anywhere from zero to one billion, commandVACUUM/ will silently limit the effective value to half the value of xref linkend=guc-autovacuum-freeze-max-age, so diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 03ce2e9..c41d464 100644
Re: [HACKERS] Visibility map and freezing
On Tue, Dec 23, 2008 at 4:24 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Heikki Linnakangas wrote: Peter Eisentraut wrote: Heikki Linnakangas wrote: I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. Would one parameter to control both suffice? (i.e., rename autovacuum_freeze_max_age to vacuum_freeze_max_age) Imagine that you run a nightly VACUUM from cron, and have autovacuum disabled. If autovacuum_freeze_max_age is the same as vacuum_freeze_max_age, as soon as that age is reached, an anti-wraparound autovacuum launched. What we'd want to happen is for the next nightly VACUUM to do the work instead. So they need to be separate settings, with some space between them by default. Attached is a proposed patch to handle freezing. In a nutshell: Introduce new vacuum_freeze_max_age setting. Manual VACUUM will scan the whole table and advance relfrozenxid, if relfrozenxid is older than vacuum_freeze_max_age. If vacuum_freeze_max_age is greater than autovacuum_freeze_max_age - 1 million transactions, it's effectively capped at that value. It doesn't make sense to have vacuum_freeze_max_age autovacuum_freeze_max_age, because the manual VACUUM wouldn't have a chance to do the full sweep before the anti-wraparound autovacuum is launched. The minus one million transactions is to give some headroom. I'm lowering vacuum_freeze_min_age from 100 million transactions to 50 million, so that the whole-table vacuum doesn't need to run as often. Note that since VACUUM normally only scans pages that need vacuuming according to the visibility map, tuples on skipped pages are not frozen any earlier even though vacuum_freeze_min_age is lower. To recap, here's the new defaults: autovacuum_freeze_max_age 2 vacuum_freeze_max_age 15000 vacuum_freeze_min_age5000 This means that with defaults, autovacuum will launch a whole-table vacuum every 150 million transactions (autovacuum_freeze_max_age - vacuum_freeze_min_age). If you run a nightly manual VACUUM, you'll get a whole-table vacuum every 100 million transactions. vacuum_freeze_max_age also affects autovacuums. If an autovacuum is launched on table to remove dead tuples, and vacuum_freeze_max_age has been reached (but not yet autovacuum_freeze_max_age), the autovacuum will scan the whole table. I'm not sure if this is desirable, to avoid having to launch separate anti-wraparound autovacuums even when there's not many dead tuples, or just confusing. If you set vacuum_freeze_max_age to 0, the visibility map is not used to skip pages, so you'll get the pre-8.4 old behavior. It seems to be strange that max can be less than min. Is it worth dividing a parameter into two(min/max)? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Heikki Linnakangas wrote: I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. Would one parameter to control both suffice? (i.e., rename autovacuum_freeze_max_age to vacuum_freeze_max_age) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Peter Eisentraut wrote: Heikki Linnakangas wrote: I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. Would one parameter to control both suffice? (i.e., rename autovacuum_freeze_max_age to vacuum_freeze_max_age) Imagine that you run a nightly VACUUM from cron, and have autovacuum disabled. If autovacuum_freeze_max_age is the same as vacuum_freeze_max_age, as soon as that age is reached, an anti-wraparound autovacuum launched. What we'd want to happen is for the next nightly VACUUM to do the work instead. So they need to be separate settings, with some space between them by default. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
Tom Lane wrote: I think your previous sketch is right: suppress update of reltuples (and relpages) from a partial vacuum scan, and ensure that the analyze phase is allowed to do it instead if it happens during VACUUM ANALYZE. We also mustn't reset n_live_tuples in pgstat in partial vacuum. Committed a patch to do that. Thanks for the report! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Visibility map and freezing
The way VACUUM works with the visibility map is that if any pages are skipped, relfrozenxid can't be updated. That means that plain VACUUM won't advance relfrozenxid, and doesn't protect from XID wraparound. We discussed this in the context of autovacuum before, and we have that covered now. Autovacuum will launch a full-scanning vacuum that advances relfrozenxid, when autovacuum_freeze_max_age is reached, and partial vacuums otherwise. Autovacuum will launch anti-wraparound vacuum even if it's otherwise disabled. Which is good, but it'll be an unpleasant surprise for someone who performs a simple manual database-wide VACUUM, for example, every night from a cron job. You could run VACUUM FREEZE, say monthly, to force a full-scanning vacuum, but that's unnecessarily aggressive, and you need to know about the issue to set that up in the first place. I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. This ensures that you don't run into forced anti-wraparound autovacuums if you do your VACUUMs manually. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Zdenek Kotala zdenek.kot...@sun.com writes: What's about add second bit which mark frozen page (all tuples have freeze XID)? It should avoid full scan, but extend size of map. That would only really work if you have a very static table where entire pages get frozen and stay frozen long before the freeze_max_age is reached. Even that wouldn't really work because the partial vacuums would never see those pages. One option is to keep a frozenxid per page which would allow us to visit only pages that need freezing. A more complex scheme would be to have a bit which indicates that all non-frozen xids are relfrozenxid+100M. When we find all the bits set we can clear them all and bump relfrozenxid by 100M. This would allow regular partial vacuums to gradually move the frozenxid forward. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
Zdenek Kotala wrote: What's about add second bit which mark frozen page (all tuples have freeze XID)? It should avoid full scan, but extend size of map. First of all, we'd still have to make the decision of when to scan pages that need freezing, and when to only scan pages that have dead tuples. Secondly, if it's just one bit, we'd have to freeze all tuples on the page to set the bit, which is a lot more aggressive than we do now. Thirdly, those frozen bits would be critical, not just hints like the visibility map is right now. Corrupt frozen bits would lead to data loss. That means we'd have to solve the problem of how to make sure the bits are always cleared when tuples are updated on a page. That's not completely crash-proof at the moment, see comments in visibilitymap.c. Even if we did solve that, I'd rather live with a hints-only visibility map for one release, before we take the plunge and make it a critical part. Just in case... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map and freezing
What's about add second bit which mark frozen page (all tuples have freeze XID)? It should avoid full scan, but extend size of map. Zdenek Heikki Linnakangas napsal(a): The way VACUUM works with the visibility map is that if any pages are skipped, relfrozenxid can't be updated. That means that plain VACUUM won't advance relfrozenxid, and doesn't protect from XID wraparound. We discussed this in the context of autovacuum before, and we have that covered now. Autovacuum will launch a full-scanning vacuum that advances relfrozenxid, when autovacuum_freeze_max_age is reached, and partial vacuums otherwise. Autovacuum will launch anti-wraparound vacuum even if it's otherwise disabled. Which is good, but it'll be an unpleasant surprise for someone who performs a simple manual database-wide VACUUM, for example, every night from a cron job. You could run VACUUM FREEZE, say monthly, to force a full-scanning vacuum, but that's unnecessarily aggressive, and you need to know about the issue to set that up in the first place. I think we need a threshold similar to autovacuum_freeze_max_age for manual vacuums as well: vacuum_freeze_max_age. If you run VACUUM, and relfrozenxid is older than vacuum_freeze_max_age, the visibility map is ignored and all pages are scanned. This ensures that you don't run into forced anti-wraparound autovacuums if you do your VACUUMs manually. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
Heikki Linnakangas wrote: Ned T. Crigler wrote: It appears that the visibility map patch is causing pg_class.reltuples to be set improperly after a vacuum. For example, it is set to 0 if the map indicated that no pages in the heap needed to be scanned. Perhaps reltuples should not be updated unless every page was scanned during the vacuum? Yeah, vacuum shouldn't overwrite reltuples if it hasn't scanned all pages. Because we use reltuples divided by relpages in the planner, we probably shouldn't update relpages either if we don't update reltuples. Otherwise, if the table has grown a lot since we last updated reltuples, the reltuples / relpages ratio would be less, not more, accurate, if relpages is updated to a new higher value but reltuples is not. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
I wonder if we should switch to keeping reltuplesperpage instead. Then a partial vacuum could update it by taking the average number of tuples per page forbthe pages it saw. Perhaps adjusting it to the weights average between the old value and the new value based on how many pages were seen. I suppose there's no reason we can't update reltuples using that same logic though it would be a big opaque. -- Greg On 15 Dec 2008, at 04:01, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Heikki Linnakangas wrote: Ned T. Crigler wrote: It appears that the visibility map patch is causing pg_class.reltuples to be set improperly after a vacuum. For example, it is set to 0 if the map indicated that no pages in the heap needed to be scanned. Perhaps reltuples should not be updated unless every page was scanned during the vacuum? Yeah, vacuum shouldn't overwrite reltuples if it hasn't scanned all pages. Because we use reltuples divided by relpages in the planner, we probably shouldn't update relpages either if we don't update reltuples. Otherwise, if the table has grown a lot since we last updated reltuples, the reltuples / relpages ratio would be less, not more, accurate, if relpages is updated to a new higher value but reltuples is not. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Greg Stark wrote: I wonder if we should switch to keeping reltuplesperpage instead. Then a partial vacuum could update it by taking the average number of tuples per page forbthe pages it saw. Perhaps adjusting it to the weights average between the old value and the new value based on how many pages were seen. The pages scanned by a partial vacuum isn't a random sample of pages in the table. That would bias the reltuplesperpage value towards those pages that are updated more. Yeah ... and it's highly likely that repeatedly-updated pages would have more dead space than never-updated ones, so there'd be a systematic creep towards underestimation of the total tuple count. I think your previous sketch is right: suppress update of reltuples (and relpages) from a partial vacuum scan, and ensure that the analyze phase is allowed to do it instead if it happens during VACUUM ANALYZE. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
Greg Stark wrote: I wonder if we should switch to keeping reltuplesperpage instead. Then a partial vacuum could update it by taking the average number of tuples per page forbthe pages it saw. Perhaps adjusting it to the weights average between the old value and the new value based on how many pages were seen. The pages scanned by a partial vacuum isn't a random sample of pages in the table. That would bias the reltuplesperpage value towards those pages that are updated more. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
On Mon, 15 Dec 2008, Greg Stark wrote: I wonder if we should switch to keeping reltuplesperpage instead. It would be preferrable to not touch the user side of reltuples if possible, since it's the only instant way to get a good estimate of the number of rows in a table right now. That's been a regular application technique for at least two years now, since http://www.varlena.com/GeneralBits/120.php popularized it. -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map and reltuples
Ned T. Crigler wrote: It appears that the visibility map patch is causing pg_class.reltuples to be set improperly after a vacuum. For example, it is set to 0 if the map indicated that no pages in the heap needed to be scanned. Perhaps reltuples should not be updated unless every page was scanned during the vacuum? Yeah, vacuum shouldn't overwrite reltuples if it hasn't scanned all pages. The interplay of vacuum and analyze in VACUUM ANALYZE needs to be changed too. Currently, the analyze after vacuum doesn't overwrite reltuples, because the one calculated by vacuum is based on scanning all pages, and is thus more accurate than the one estimated from the sample (which is not true anymore, as you pointed out). I think the vacuum needs to somehow tell analyze whether it updated reltuples or not, so that analyze can update reltuples if the vacuum didn't scan all pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] visibility map and reltuples
It appears that the visibility map patch is causing pg_class.reltuples to be set improperly after a vacuum. For example, it is set to 0 if the map indicated that no pages in the heap needed to be scanned. Perhaps reltuples should not be updated unless every page was scanned during the vacuum? -- Ned T. Crigler -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] visibility map - what do i miss?
--- repost to hackers as suggested by RhodiumToad --- hi, i tried to test new visibility map feature. to do so i: 1. fetched postgresql sources from cvs 2. compiled 3. turned autovacuum off 4. started pg 5. ran this queries: - CREATE TABLE test_1 (i INT4); - CREATE TABLE test_2 (i INT4); - CREATE TABLE test_3 (i INT4); - CREATE TABLE test_4 (i INT4); - INSERT INTO test_1 SELECT generate_series(1, 1); - INSERT INTO test_2 SELECT generate_series(1, 1); - INSERT INTO test_3 SELECT generate_series(1, 1); - INSERT INTO test_4 SELECT generate_series(1, 1); - UPDATE test_2 SET i = i + 1 WHERE i 1000; - UPDATE test_3 SET i = i + 1 WHERE i 5000; - UPDATE test_4 SET i = i + 1 WHERE i 9000; - VACUUM test_1; - VACUUM test_2; - VACUUM test_3; - VACUUM test_4; I did it 2 times, first with sources of pg from 1st of november, and second - with head from yesterday evening (warsaw, poland time). results puzzled me. First run - without visibility maps, timing of vacuums: Time: 267844.822 ms Time: 138854.592 ms Time: 305467.950 ms Time: 487133.179 ms Second run - on head: Time: 252218.609 ms Time: 234388.763 ms Time: 334016.413 ms Time: 575698.750 ms Now - as I understand the change - visilibity maps should make second run much faster? Tests were performed on laptop. During first test I used it to browse the web, read mail. During second test - nobody used the laptop. Relation forms seem to exist: # select oid from pg_database where datname = 'depesz'; oid --- 16389 (1 row) # select relfilenode from pg_class where relname ~ 'test_'; relfilenode - 26756 26759 26762 26765 (4 rows) = ls -l {26756,26759,26762,26765}* -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:31 26756 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:33 26756.1 -rw--- 1 pgdba pgdba 1065066496 2008-12-06 01:34 26756.2 -rw--- 1 pgdba pgdba 811008 2008-12-06 01:34 26756_fsm -rw--- 1 pgdba pgdba 57344 2008-12-06 01:34 26756_vm -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:35 26759 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:36 26759.1 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:37 26759.2 -rw--- 1 pgdba pgdba 312582144 2008-12-06 01:39 26759.3 -rw--- 1 pgdba pgdba 892928 2008-12-06 01:39 26759_fsm -rw--- 1 pgdba pgdba 57344 2008-12-06 01:39 26759_vm -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:39 26762 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:49 26762.1 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:41 26762.2 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:42 26762.3 -rw--- 1 pgdba pgdba 523862016 2008-12-06 01:43 26762.4 -rw--- 1 pgdba pgdba1204224 2008-12-06 01:43 26762_fsm -rw--- 1 pgdba pgdba 81920 2008-12-06 01:53 26762_vm -rw--- 1 pgdba pgdba 1073741824 2008-12-06 02:01 26765 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 02:08 26765.1 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 02:18 26765.2 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:50 26765.3 -rw--- 1 pgdba pgdba 1073741824 2008-12-06 01:51 26765.4 -rw--- 1 pgdba pgdba 735141888 2008-12-06 02:00 26765.5 -rw--- 1 pgdba pgdba1523712 2008-12-06 02:00 26765_fsm -rw--- 1 pgdba pgdba 98304 2008-12-06 02:18 26765_vm What do I miss? Best regards, depesz -- Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ jid/gtalk: [EMAIL PROTECTED] / aim:depeszhdl / skype:depesz_hdl / gg:6749007 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map - what do i miss?
On Sat, Dec 6, 2008 at 12:53 PM, hubert depesz lubaczewski [EMAIL PROTECTED] wrote: First run - without visibility maps, timing of vacuums: Time: 267844.822 ms Time: 138854.592 ms Time: 305467.950 ms Time: 487133.179 ms Second run - on head: Time: 252218.609 ms Time: 234388.763 ms Time: 334016.413 ms Time: 575698.750 ms Now - as I understand the change - visilibity maps should make second run much faster? If I understand correctly what Heikki explained, not if you run VACUUM only once (and you confirmed me on IRC you run it only once). It's the VACUUM which sets the PD_ALL_VISIBLE flag on the pages so the first VACUUM should be slower with HEAD than with 8.3 as it has far more work to do. The second VACUUM should then be faster. -- Guillaume -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] visibility map - what do i miss?
Guillaume Smet wrote: On Sat, Dec 6, 2008 at 12:53 PM, hubert depesz lubaczewski [EMAIL PROTECTED] wrote: First run - without visibility maps, timing of vacuums: Time: 267844.822 ms Time: 138854.592 ms Time: 305467.950 ms Time: 487133.179 ms Second run - on head: Time: 252218.609 ms Time: 234388.763 ms Time: 334016.413 ms Time: 575698.750 ms Now - as I understand the change - visilibity maps should make second run much faster? If I understand correctly what Heikki explained, not if you run VACUUM only once (and you confirmed me on IRC you run it only once). It's the VACUUM which sets the PD_ALL_VISIBLE flag on the pages so the first VACUUM should be slower with HEAD than with 8.3 as it has far more work to do. The second VACUUM should then be faster. That diagnosis is not quite right, but the prognosis is correct. The first VACUUM after the UPDATEs doesn't set the PD_ALL_VISIBLE flags, because there's still dead tuples on the pages. The dead tuples are removed in the 2nd pass of the first vacuum, but it doesn't try to set the PD_ALL_VISIBLE flags; that's only done in the first phase. The second vacuum is just as slow as the first one, because the visibility map doesn't have any bits set yet. The second vacuum will set the bits, though, so the *third* vacuum should go faster. So setting the PD_ALL_VISIBLE flags doesn't slow things down. That should be just a tiny bit of extra CPU work per vacuumed page, not something that would show up in performance tests. This is the 1st issue I mentioned in this mail: http://archives.postgresql.org/message-id/[EMAIL PROTECTED] There was some suggestions in that thread, but none has been implemented. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas wrote: Here's an updated version, with a lot of smaller cleanups, and using relcache invalidation to notify other backends when the visibility map fork is extended. I already committed the change to FSM to do the same. I'm feeling quite satisfied to commit this patch early next week. Committed. I haven't done any doc changes for this yet. I think a short section in the database internal storage chapter is probably in order, and the fact that plain VACUUM skips pages should be mentioned somewhere. I'll skim through references to vacuum and see what needs to be changed. Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. Having seen how the anti-wraparound vacuums work in the field I think merely replacing it with a regular vacuum which covers the whole table will not actually work well. What will happen is that, because nothing else is advancing the relfrozenxid, the age of the relfrozenxid for all tables will advance until they all hit autovacuum_max_freeze_age. Quite often all the tables were created around the same time so they will all hit autovacuum_max_freeze_age at the same time. So a database which was operating fine and receiving regular vacuums at a reasonable pace will suddenly be hit by vacuums for every table all at the same time, 3 at a time. If you don't have vacuum_cost_delay set that will cause a major issue. Even if you do have vacuum_cost_delay set it will prevent the small busy tables from getting vacuumed regularly due to the backlog in anti-wraparound vacuums. Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. 2) Include a factor which spreads out the anti-wraparound freezes in the autovacuum launcher. Some ideas: . we could implicitly add random(vacuum_freeze_min_age) to the autovacuum_max_freeze_age. That would spread them out evenly over 100M transactions. . we could check if another anti-wraparound vacuum is still running and implicitly add a vacuum_freeze_min_age penalty to the autovacuum_max_freeze_age for each running anti-wraparound vacuum. That would spread them out without being introducing non-determinism which seems better. . we could leave autovacuum_max_freeze_age and instead pick a semi-random vacuum_freeze_min_age. This would mean the first set of anti-wraparound vacuums would still be synchronized but subsequent ones might be spread out somewhat. There's not as much room to randomize this though and it would affect how much i/o vacuum did which makes it seem less palatable to me. 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many people are setting it to unreasonably high values which results in their vacuums never completing. Actually I think what we should do is junk all the existing parameters and replace it with a vacuum_nice_level or vacuum_bandwidth_cap from which we calculate the cost_limit and hide all the other parameters as internal parameters. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas wrote: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. Heh :-) Yes, this should be handled sanely, without having to invoke FREEZE. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. Having seen how the anti-wraparound vacuums work in the field I think merely replacing it with a regular vacuum which covers the whole table will not actually work well. What will happen is that, because nothing else is advancing the relfrozenxid, the age of the relfrozenxid for all tables will advance until they all hit autovacuum_max_freeze_age. Quite often all the tables were created around the same time so they will all hit autovacuum_max_freeze_age at the same time. So a database which was operating fine and receiving regular vacuums at a reasonable pace will suddenly be hit by vacuums for every table all at the same time, 3 at a time. If you don't have vacuum_cost_delay set that will cause a major issue. Even if you do have vacuum_cost_delay set it will prevent the small busy tables from getting vacuumed regularly due to the backlog in anti-wraparound vacuums. Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. 2) Include a factor which spreads out the anti-wraparound freezes in the autovacuum launcher. Some ideas: . we could implicitly add random(vacuum_freeze_min_age) to the autovacuum_max_freeze_age. That would spread them out evenly over 100M transactions. . we could check if another anti-wraparound vacuum is still running and implicitly add a vacuum_freeze_min_age penalty to the autovacuum_max_freeze_age for each running anti-wraparound vacuum. That would spread them out without being introducing non-determinism which seems better. . we could leave autovacuum_max_freeze_age and instead pick a semi-random vacuum_freeze_min_age. This would mean the first set of anti-wraparound vacuums would still be synchronized but subsequent ones might be spread out somewhat. There's not as much room to randomize this though and it would affect how much i/o vacuum did which makes it seem less palatable to me. How about a way to say that only one (or a config parameter for n) of the autovac workers can be used for anti-wraparound vacuum? Then the other slots would still be available for the small-but-frequently-updated tables. 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many people are setting it to unreasonably high values which results in their vacuums never completing. Actually I think what we should do is junk all the existing parameters and replace it with a vacuum_nice_level or vacuum_bandwidth_cap from which we calculate the cost_limit and hide all the other parameters as internal parameters. It would certainly be helpful if it was just a single parameter - the arbitraryness of the parameters there now make them pretty hard to set properly - or at least easy to set wrong. //Magnus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. FWIW, it seems the omission is actually the other way 'round. Autovacuum always forces a full-scanning vacuum, making the visibility map useless for autovacuum. This obviously needs to be fixed. What will happen is that, because nothing else is advancing the relfrozenxid, the age of the relfrozenxid for all tables will advance until they all hit autovacuum_max_freeze_age. Quite often all the tables were created around the same time so they will all hit autovacuum_max_freeze_age at the same time. So a database which was operating fine and receiving regular vacuums at a reasonable pace will suddenly be hit by vacuums for every table all at the same time, 3 at a time. If you don't have vacuum_cost_delay set that will cause a major issue. Even if you do have vacuum_cost_delay set it will prevent the small busy tables from getting vacuumed regularly due to the backlog in anti-wraparound vacuums. Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... But we already have that problem, don't we? When you initially load your database, all tuples will have the same xmin, and all tables will have more or less the same relfrozenxid. I guess you can argue that it becomes more obvious if vacuums are otherwise cheaper, but I don't think the visibility map makes that much difference to suddenly make this issue urgent. Agreed that it would be nice to do something about it, though. I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. It allows you to truncate clog. If I did my math right, 200M transactions amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap after all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. FWIW, it seems the omission is actually the other way 'round. Autovacuum always forces a full-scanning vacuum, making the visibility map useless for autovacuum. This obviously needs to be fixed. How does it do that? Is there some option in the VacStmt to control this? Do we just need a syntax to set that option? How easy is it to tell what percentage of the table needs to be vacuumed? If it's 50% perhaps it would make sense to scan the whole table? (Hm. Not really if it's a contiguous 50% though...) Another idea: Perhaps each page of the visibility map should have a frozenxid (or multiple frozenxids?). Then if an individual page of the visibility map is old we could force scanning all the heap pages covered by that map page and update it. I'm not sure we can do that safely though without locking issues -- or is it ok because it's vacuum doing the updating? Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... But we already have that problem, don't we? When you initially load your database, all tuples will have the same xmin, and all tables will have more or less the same relfrozenxid. I guess you can argue that it becomes more obvious if vacuums are otherwise cheaper, but I don't think the visibility map makes that much difference to suddenly make this issue urgent. We already have that problem but it only bites in a specific case: if you have no other vacuums being triggered by the regular dead tuple scale factor. The normal case is intended to be that autovacuum triggers much more frequently than every 100M transactions to reduce bloat. However in practice this specific case does seem to arise rather alarmingly easy. Most databases do have some large tables which are never deleted from or updated. Also, the default scale factor of 20% is actually quite easy to never reach if your tables are also growing quickly -- effectively moving the goalposts further out as fast as the updates and deletes bloat the table. The visibility map essentially widens this specific use case to cover *all* tables. Since the relfrozenxid would never get advanced by regular vacuums the only time it would get advanced is when they all hit the 200M wall simultaneously. Agreed that it would be nice to do something about it, though. I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. It allows you to truncate clog. If I did my math right, 200M transactions amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap after all. Ah. Hm. Then perhaps this belongs in the realm of the config generator people are working on. They'll need a dial to say how much disk space you expect your database to take in addition to how much memory your machine has available. 50M is nothing for a 1TB database but it's kind of silly to have to keep hundreds of megs of clogs on a 1MB database. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Gregory Stark wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. FWIW, it seems the omission is actually the other way 'round. Autovacuum always forces a full-scanning vacuum, making the visibility map useless for autovacuum. This obviously needs to be fixed. How does it do that? Is there some option in the VacStmt to control this? Do we just need a syntax to set that option? The way it works now is that if VacuumStmt-freeze_min_age is not -1 (which means use the default), the visibility map is not used and the whole table is scanned. Autovacuum always sets freeze_min_age, so it's never using the visibility map. Attached is a patch I'm considering to fix that. How easy is it to tell what percentage of the table needs to be vacuumed? If it's 50% perhaps it would make sense to scan the whole table? (Hm. Not really if it's a contiguous 50% though...) Hmm. You could scan the visibility map to see how much you could skip by using it. You could account for contiguity. Another idea: Perhaps each page of the visibility map should have a frozenxid (or multiple frozenxids?). Then if an individual page of the visibility map is old we could force scanning all the heap pages covered by that map page and update it. I'm not sure we can do that safely though without locking issues -- or is it ok because it's vacuum doing the updating? We discussed that a while ago: http://archives.postgresql.org/message-id/[EMAIL PROTECTED] Tom was concerned about making the visibility map not just a hint but critical data. Rightly so. This is certainly 8.5 stuff; perhaps it would be more palatable after we get the index-only-scans working using the visibility map, since the map would be critical data anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index fd2429a..3e3cb9d 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -171,10 +171,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, vacrelstats-hasindex = (nindexes 0); /* Should we use the visibility map or scan all pages? */ - if (vacstmt-freeze_min_age != -1) - scan_all = true; - else - scan_all = false; + scan_all = vacstmt-scan_all; /* Do the vacuuming */ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index eb7ab4d..2781f6e 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -2771,6 +2771,7 @@ _copyVacuumStmt(VacuumStmt *from) COPY_SCALAR_FIELD(analyze); COPY_SCALAR_FIELD(verbose); COPY_SCALAR_FIELD(freeze_min_age); + COPY_SCALAR_FIELD(scan_all)); COPY_NODE_FIELD(relation); COPY_NODE_FIELD(va_cols); diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index d4c57bb..86a032f 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1436,6 +1436,7 @@ _equalVacuumStmt(VacuumStmt *a, VacuumStmt *b) COMPARE_SCALAR_FIELD(analyze); COMPARE_SCALAR_FIELD(verbose); COMPARE_SCALAR_FIELD(freeze_min_age); + COMPARE_SCALAR_FIELD(scan_all); COMPARE_NODE_FIELD(relation); COMPARE_NODE_FIELD(va_cols); diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 85f4616..1aab75c 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -5837,6 +5837,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n-analyze = false; n-full = $2; n-freeze_min_age = $3 ? 0 : -1; + n-scan_all = $3; n-verbose = $4; n-relation = NULL; n-va_cols = NIL; @@ -5849,6 +5850,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n-analyze = false; n-full = $2; n-freeze_min_age = $3 ? 0 : -1; + n-scan_all = $3; n-verbose = $4; n-relation = $5; n-va_cols = NIL; @@ -5860,6 +5862,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n-vacuum = true; n-full = $2; n-freeze_min_age = $3 ? 0 : -1; + n-scan_all = $3; n-verbose |= $4; $$ = (Node *)n; } diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 8d8947f..2c68779 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2649,6 +2649,7 @@ autovacuum_do_vac_analyze(autovac_table *tab, vacstmt.full = false; vacstmt.analyze = tab-at_doanalyze; vacstmt.freeze_min_age = tab-at_freeze_min_age; + vacstmt.scan_all = tab-at_wraparound; vacstmt.verbose = false; vacstmt.relation = NULL;
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark [EMAIL PROTECTED] writes: Heikki Linnakangas [EMAIL PROTECTED] writes: Gregory Stark wrote: 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long before they accomplish anything. It allows you to truncate clog. If I did my math right, 200M transactions amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap after all. Hm, the more I think about it the more this bothers me. It's another subtle change from the current behaviour. Currently *every* vacuum tries to truncate the clog. So you're constantly trimming off a little bit. With the visibility map (assuming you fix it not to do full scans all the time) you can never truncate the clog just as you can never advance the relfrozenxid unless you do a special full-table vacuum. I think in practice most people had a read-only table somewhere in their database which prevented the clog from ever being truncated anyways, so perhaps this isn't such a big deal. But the bottom line is that the anti-wraparound vacuums are going to be a lot more important and much more visible now than they were in the past. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: There is another problem, though, if the map is frequently probed for pages that don't exist in the map, or the map doesn't exist at all. Currently, the size of the map file is kept in relcache, in the rd_vm_nblocks_cache variable. Whenever a page is accessed that's rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, and rd_vm_nblocks_cache is updated. That means that every probe to a non-existing page causes an lseek(), which isn't free. Well, considering how seldom new pages will be added to the visibility map, it seems to me we could afford to send out a relcache inval event when that happens. Then rd_vm_nblocks_cache could be treated as trustworthy. Here's an updated version, with a lot of smaller cleanups, and using relcache invalidation to notify other backends when the visibility map fork is extended. I already committed the change to FSM to do the same. I'm feeling quite satisfied to commit this patch early next week. I modified the VACUUM VERBOSE output slightly, to print the number of pages scanned. The added part emphasized below: postgres=# vacuum verbose foo; INFO: vacuuming public.foo INFO: foo: removed 230 row versions in 10 pages INFO: foo: found 230 removable, 10 nonremovable row versions in *10 out of* 43 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM That seems OK to me, but maybe others have an opinion on that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas wrote: Here's an updated version, ... And here it is, for real... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/heap/Makefile --- src/backend/access/heap/Makefile *** *** 12,17 subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o include $(top_srcdir)/src/backend/common.mk --- 12,17 top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o include $(top_srcdir)/src/backend/common.mk *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *** *** 47,52 --- 47,53 #include access/transam.h #include access/tuptoaster.h #include access/valid.h + #include access/visibilitymap.h #include access/xact.h #include access/xlogutils.h #include catalog/catalog.h *** *** 195,200 heapgetpage(HeapScanDesc scan, BlockNumber page) --- 196,202 int ntup; OffsetNumber lineoff; ItemId lpp; + bool all_visible; Assert(page scan-rs_nblocks); *** *** 233,252 heapgetpage(HeapScanDesc scan, BlockNumber page) lines = PageGetMaxOffsetNumber(dp); ntup = 0; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff = lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { - HeapTupleData loctup; bool valid; ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet((loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer); if (valid) scan-rs_vistuples[ntup++] = lineoff; } --- 235,266 lines = PageGetMaxOffsetNumber(dp); ntup = 0; + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + */ + all_visible = PageIsAllVisible(dp); + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff = lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { bool valid; ! if (all_visible) ! valid = true; ! else ! { ! HeapTupleData loctup; ! ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet((loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer); ! } if (valid) scan-rs_vistuples[ntup++] = lineoff; } *** *** 1860,1865 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1874,1880 TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + bool all_visible_cleared = false; if (relation-rd_rel-relhasoids) { *** *** 1920,1925 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1935,1946 RelationPutHeapTuple(relation, buffer, heaptup); + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + /* * XXX Should we set PageSetPrunable on this page ? * *** *** 1943,1948 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1964,1970 Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation-rd_node; xlrec.target.tid = heaptup-t_self; rdata[0].data = (char *) xlrec; *** *** 1994,1999 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 2016,2026 UnlockReleaseBuffer(buffer); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, + ItemPointerGetBlockNumber((heaptup-t_self))); + /* * If tuple is cachable, mark it for invalidation from the caches in case * we abort. Note it is OK to do this after releasing the buffer, because *** *** 2070,2075 heap_delete(Relation relation, ItemPointer tid, --- 2097,2103 Buffer buffer; bool have_tuple_lock = false; bool iscombo; + bool all_visible_cleared = false; Assert(ItemPointerIsValid(tid)); *** *** 2216,2221 l1: --- 2244,2255 */ PageSetPrunable(page, xid); + if (PageIsAllVisible(page)) + { + all_visible_cleared = true; + PageClearAllVisible(page); + } + /* store transaction information of xact deleting the tuple */ tp.t_data-t_infomask = ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | *** *** 2237,2242 l1: --- 2271,2277 XLogRecPtr recptr; XLogRecData rdata[2];
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: The visibility map won't be inquired unless you vacuum. This is a bit tricky. In vacuum, we only know whether we can set a bit or not, after we've acquired a cleanup lock on the page, and scanned all the tuples. While we're holding a cleanup lock, we don't want to do I/O, which could potentially block out other processes for a long time. So it's too late to extend the visibility map at that point. This is no good; I think you've made the wrong tradeoffs. In particular, even though only vacuum *currently* uses the map, you want to extend it to be used by indexscans. So it's going to uselessly spring into being even without vacuums. I'm not convinced that I/O while holding cleanup lock is so bad that we should break other aspects of the system to avoid it. However, if you want to stick to that, how about * vacuum page, possibly set its header bit * release page lock (but not pin) * if we need to set the bit, fetch the corresponding map page (I/O might happen here) * get share lock on heap page, then recheck its header bit; if still set, set the map bit Yeah, could do that. There is another problem, though, if the map is frequently probed for pages that don't exist in the map, or the map doesn't exist at all. Currently, the size of the map file is kept in relcache, in the rd_vm_nblocks_cache variable. Whenever a page is accessed that's rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, and rd_vm_nblocks_cache is updated. That means that every probe to a non-existing page causes an lseek(), which isn't free. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: There is another problem, though, if the map is frequently probed for pages that don't exist in the map, or the map doesn't exist at all. Currently, the size of the map file is kept in relcache, in the rd_vm_nblocks_cache variable. Whenever a page is accessed that's rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, and rd_vm_nblocks_cache is updated. That means that every probe to a non-existing page causes an lseek(), which isn't free. Well, considering how seldom new pages will be added to the visibility map, it seems to me we could afford to send out a relcache inval event when that happens. Then rd_vm_nblocks_cache could be treated as trustworthy. Maybe it'd be worth doing that for the FSM too. The frequency of invals would be higher, but then again the reference frequency is probably higher too? A relcache invalidation sounds awfully heavy-weight. Perhaps a light-weight invalidation event that doesn't flush the entry altogether, but just resets the cached sizes? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: Well, considering how seldom new pages will be added to the visibility map, it seems to me we could afford to send out a relcache inval event when that happens. Then rd_vm_nblocks_cache could be treated as trustworthy. A relcache invalidation sounds awfully heavy-weight. It really isn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: There is another problem, though, if the map is frequently probed for pages that don't exist in the map, or the map doesn't exist at all. Currently, the size of the map file is kept in relcache, in the rd_vm_nblocks_cache variable. Whenever a page is accessed that's rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, and rd_vm_nblocks_cache is updated. That means that every probe to a non-existing page causes an lseek(), which isn't free. Well, considering how seldom new pages will be added to the visibility map, it seems to me we could afford to send out a relcache inval event when that happens. Then rd_vm_nblocks_cache could be treated as trustworthy. Maybe it'd be worth doing that for the FSM too. The frequency of invals would be higher, but then again the reference frequency is probably higher too? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: Well, considering how seldom new pages will be added to the visibility map, it seems to me we could afford to send out a relcache inval event when that happens. Then rd_vm_nblocks_cache could be treated as trustworthy. A relcache invalidation sounds awfully heavy-weight. It really isn't. Okay, then. I'll use relcache invalidation for both the FSM and visibility map. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Nov 23, 2008, at 3:18 PM, Tom Lane wrote: So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. That makes the idea of not writing out hint bit updates unless the page is already dirty a lot easier to swallow, because now we'd have a mechanism in place to ensure that they were set in a reasonable timeframe by autovacuum. That actually wouldn't incur much extra overhead at all, except in the case of a table that's effectively write-only. Actually, that's not even true; you still have to eventually freeze a write-mostly table. -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: * ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE page header flag *must* be correct, but it's really okay if the backing map bit *isn't* correct --- in particular we don't trust the map bit when performing antiwraparound vacuums. This isn't well documented. Right. Will add comments. We can't use the map bit for antiwraparound vacuums, because the bit doesn't tell you when the tuples have been frozen. And we can't advance relfrozenxid if we've skipped any pages. I've been thinking that we could add one frozenxid field to each visibility map page, for the oldest xid on the heap pages covered by the visibility map page. That would allow more fine-grained anti-wraparound vacuums as well. * Also, I see that vacuum has a provision for clearing an incorrectly set PD_ALL_VISIBLE flag, but shouldn't it fix the map too? Yes, will fix. Although, as long as we don't trust the visibility map, no real damage would be done. * It would be good if the visibility map fork were never created until there is occasion to set a bit in it; this would for instance typically mean that temp tables would never have one. I think that visibilitymap.c doesn't get this quite right --- in particular vm_readbuf seems willing to create/extend the fork whether its extend argument is true or not, so it looks like an inquiry operation would cause the map fork to be created. It should be possible to act as though a nonexistent fork just means all zeroes. The visibility map won't be inquired unless you vacuum. This is a bit tricky. In vacuum, we only know whether we can set a bit or not, after we've acquired a cleanup lock on the page, and scanned all the tuples. While we're holding a cleanup lock, we don't want to do I/O, which could potentially block out other processes for a long time. So it's too late to extend the visibility map at that point. I agree that vm_readbuf should not create the fork if 'extend' is false, that's an oversight, but it won't change the actual behavior because visibilitymap_test calls it with 'extend' true. Because of the above. I will add comments about that, though, there's nothing describing that currently. * heap_insert's all_visible_cleared variable doesn't seem to get initialized --- didn't your compiler complain? Hmph, I must've been compiling with -O0. * You missed updating SizeOfHeapDelete and SizeOfHeapUpdate Thanks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: I've been thinking that we could add one frozenxid field to each visibility map page, for the oldest xid on the heap pages covered by the visibility map page. That would allow more fine-grained anti-wraparound vacuums as well. This doesn't strike me as a particularly good idea. Right now the map is only hints as far as vacuum is concerned --- if you do the above then the map becomes critical data. And I don't really think you'll buy much. The visibility map won't be inquired unless you vacuum. This is a bit tricky. In vacuum, we only know whether we can set a bit or not, after we've acquired a cleanup lock on the page, and scanned all the tuples. While we're holding a cleanup lock, we don't want to do I/O, which could potentially block out other processes for a long time. So it's too late to extend the visibility map at that point. This is no good; I think you've made the wrong tradeoffs. In particular, even though only vacuum *currently* uses the map, you want to extend it to be used by indexscans. So it's going to uselessly spring into being even without vacuums. I'm not convinced that I/O while holding cleanup lock is so bad that we should break other aspects of the system to avoid it. However, if you want to stick to that, how about * vacuum page, possibly set its header bit * release page lock (but not pin) * if we need to set the bit, fetch the corresponding map page (I/O might happen here) * get share lock on heap page, then recheck its header bit; if still set, set the map bit regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. I'm not too excited about triggering an extra vacuum. As Matthew pointed out, the point of this patch is to reduce the number of vacuums required, not increase it. If you're not going to vacuum a table, you don't care if the bits in the visibility map are set or not. But it's already the case that the bits provide a performance increase to other things besides vacuum. We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, if we want to make the seqscan optimization more effective. For example, a seqscan could set the flag too, if it sees that all the tuples were visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set. I was wondering whether we could teach heap_page_prune to set the flag without adding any extra tuple visibility checks. A seqscan per se shouldn't be doing this because it doesn't normally call HeapTupleSatifiesVacuum. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane [EMAIL PROTECTED] writes: Heikki Linnakangas [EMAIL PROTECTED] writes: I've been thinking that we could add one frozenxid field to each visibility map page, for the oldest xid on the heap pages covered by the visibility map page. That would allow more fine-grained anti-wraparound vacuums as well. This doesn't strike me as a particularly good idea. Right now the map is only hints as far as vacuum is concerned --- if you do the above then the map becomes critical data. And I don't really think you'll buy much. Hm, that depends on how critical the critical data is. It's critical that the frozenxid that autovacuum sees is no more recent than the actual frozenxid, but not critical that it be entirely up-to-date otherwise. So if it's possible for the frozenxid in the visibility map to go backwards then it's no good, since if that update is lost we might skip a necessary vacuum freeze. But if we guarantee that we never update the frozenxid in the visibility map forward ahead of recentglobalxmin then it can't ever go backwards. (Well, not in a way that matters) However I'm a bit puzzled how you could possibly maintain this frozenxid. As soon as you freeze an xid you'll have to visit all the other pages covered by that visibility map page to see what the new value should be. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark wrote: However I'm a bit puzzled how you could possibly maintain this frozenxid. As soon as you freeze an xid you'll have to visit all the other pages covered by that visibility map page to see what the new value should be. Right, you could only advance it when you scan all the pages covered by the visibility map page. But that's better than having to scan the whole relation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: Gregory Stark wrote: However I'm a bit puzzled how you could possibly maintain this frozenxid. As soon as you freeze an xid you'll have to visit all the other pages covered by that visibility map page to see what the new value should be. Right, you could only advance it when you scan all the pages covered by the visibility map page. But that's better than having to scan the whole relation. Is it? It seems like that would just move around the work. You'll still have to visit every page once ever 2B transactions or so. You'll just do it 64MB at a time. It's nice to smooth the work but it would be much nicer to detect that a normal vacuum has already processed all of those pages since the last insert/update/delete on those pages and so avoid the work entirely. To avoid the work entirely you need some information about the oldest xid on those pages seen by regular vacuums (and/or prunes). We would want to skip any page which: a) Has been visited by vacuum freeze and not been updated since b) Has been visited by a regular vacuum and the oldest xid found was more recent than freeze_threshold. c) Has been updated frequently such that no old tuples remain Ideally (b) should completely obviate the need for anti-wraparound freezes entirely. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Gregory Stark [EMAIL PROTECTED] writes: So if it's possible for the frozenxid in the visibility map to go backwards then it's no good, since if that update is lost we might skip a necessary vacuum freeze. Seems like a lost disk write would be enough to make that happen. Now you might argue that the odds of that are no worse than the odds of losing an update to one particular heap page, but in this case the single hiccup could lead to losing half a gigabyte of data (assuming 8K page size). The leverage you get for saving vacuum freeze work is exactly equal to the magnification factor for data loss. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: I committed the changes to FSM truncation yesterday, that helps with the truncation of the visibility map as well. Attached is an updated visibility map patch. I looked over this patch a bit ... 1. The bits in the visibility map are set in the 1st phase of lazy vacuum. That works, but it means that after a delete or update, it takes two vacuums until the bit in the visibility map is set. The first vacuum removes the dead tuple, and only the second sees that there's no dead tuples and sets the bit. I think this is probably not a big issue really. The point of this change is to optimize things for pages that are static over the long term; one extra vacuum cycle before the page is deemed static doesn't seem like a problem. You could even argue that this saves I/O because we don't set the bit (and perhaps later have to clear it) until we know that the page has stayed static across a vacuum cycle and thus has a reasonable probability of continuing to do so. A possible problem is that if a relation is filled all in one shot, autovacuum would trigger a single vacuum cycle on it and then never have a reason to trigger another; leading to the bits never getting set (or at least not till an antiwraparound vacuum occurs). We might want to tweak autovac so that an extra vacuum cycle occurs in this case. But I'm not quite sure what a reasonable heuristic would be. Some other points: * ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE page header flag *must* be correct, but it's really okay if the backing map bit *isn't* correct --- in particular we don't trust the map bit when performing antiwraparound vacuums. This isn't well documented. * Also, I see that vacuum has a provision for clearing an incorrectly set PD_ALL_VISIBLE flag, but shouldn't it fix the map too? * It would be good if the visibility map fork were never created until there is occasion to set a bit in it; this would for instance typically mean that temp tables would never have one. I think that visibilitymap.c doesn't get this quite right --- in particular vm_readbuf seems willing to create/extend the fork whether its extend argument is true or not, so it looks like an inquiry operation would cause the map fork to be created. It should be possible to act as though a nonexistent fork just means all zeroes. * heap_insert's all_visible_cleared variable doesn't seem to get initialized --- didn't your compiler complain? * You missed updating SizeOfHeapDelete and SizeOfHeapUpdate regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote: A possible problem is that if a relation is filled all in one shot, autovacuum would trigger a single vacuum cycle on it and then never have a reason to trigger another; leading to the bits never getting set (or at least not till an antiwraparound vacuum occurs). We might want to tweak autovac so that an extra vacuum cycle occurs in this case. But I'm not quite sure what a reasonable heuristic would be. This would only be an issue if using the visibility map for things other than partial vacuum (e.g. index-only scan), right? If we never do another VACUUM, we don't need partial vacuum. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Jeff Davis [EMAIL PROTECTED] writes: On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote: A possible problem is that if a relation is filled all in one shot, autovacuum would trigger a single vacuum cycle on it and then never have a reason to trigger another; leading to the bits never getting set (or at least not till an antiwraparound vacuum occurs). This would only be an issue if using the visibility map for things other than partial vacuum (e.g. index-only scan), right? If we never do another VACUUM, we don't need partial vacuum. Well, the patch already uses the page header bits for optimization of seqscans, and could probably make good use of them for bitmap scans too. It'd be nice if the page header bits got set even if the map bits didn't. Reflecting on it though, maybe Heikki described the behavior too pessimistically anyway. If a page contains no dead tuples, it should get its bits set on first visit anyhow, no? So for the ordinary bulk load scenario where there are no failed insertions, the first vacuum pass should set all the bits ... at least, if enough time has passed for RecentXmin to be past the inserting transaction. However, my comment above was too optimistic, because in an insert-only scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: However, my comment above was too optimistic, because in an insert-only scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. I'm sure I'm missing something, but I thought the point of this was to lessen the impact of VACUUM and now you are suggesting that we have to add vacuums to tables that have never needed one before. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Reflecting on it though, maybe Heikki described the behavior too pessimistically anyway. If a page contains no dead tuples, it should get its bits set on first visit anyhow, no? So for the ordinary bulk load scenario where there are no failed insertions, the first vacuum pass should set all the bits ... at least, if enough time has passed for RecentXmin to be past the inserting transaction. Right. I did say ... after a delete or update, it takes two vacuums until ... in my mail. However, my comment above was too optimistic, because in an insert-only scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. So it seems like we do indeed want to rejigger autovac's rules a bit to account for the possibility of wanting to apply vacuum to get visibility bits set. I'm not too excited about triggering an extra vacuum. As Matthew pointed out, the point of this patch is to reduce the number of vacuums required, not increase it. If you're not going to vacuum a table, you don't care if the bits in the visibility map are set or not. We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, if we want to make the seqscan optimization more effective. For example, a seqscan could set the flag too, if it sees that all the tuples were visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
I committed the changes to FSM truncation yesterday, that helps with the truncation of the visibility map as well. Attached is an updated visibility map patch. There's two open issues: 1. The bits in the visibility map are set in the 1st phase of lazy vacuum. That works, but it means that after a delete or update, it takes two vacuums until the bit in the visibility map is set. The first vacuum removes the dead tuple, and only the second sees that there's no dead tuples and sets the bit. 2. Should modify the output of VACUUM VERBOSE to say how many pages were actually scanned. What other information is relevant, or is no longer relevant, with partial vacuums. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/heap/Makefile --- src/backend/access/heap/Makefile *** *** 12,17 subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o include $(top_srcdir)/src/backend/common.mk --- 12,17 top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o include $(top_srcdir)/src/backend/common.mk *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *** *** 47,52 --- 47,53 #include access/transam.h #include access/tuptoaster.h #include access/valid.h + #include access/visibilitymap.h #include access/xact.h #include access/xlogutils.h #include catalog/catalog.h *** *** 195,200 heapgetpage(HeapScanDesc scan, BlockNumber page) --- 196,202 int ntup; OffsetNumber lineoff; ItemId lpp; + bool all_visible; Assert(page scan-rs_nblocks); *** *** 233,252 heapgetpage(HeapScanDesc scan, BlockNumber page) lines = PageGetMaxOffsetNumber(dp); ntup = 0; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff = lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { - HeapTupleData loctup; bool valid; ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet((loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer); if (valid) scan-rs_vistuples[ntup++] = lineoff; } --- 235,266 lines = PageGetMaxOffsetNumber(dp); ntup = 0; + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + */ + all_visible = PageIsAllVisible(dp); + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff = lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { bool valid; ! if (all_visible) ! valid = true; ! else ! { ! HeapTupleData loctup; ! ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet((loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(loctup, snapshot, buffer); ! } if (valid) scan-rs_vistuples[ntup++] = lineoff; } *** *** 1860,1865 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1874,1880 TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + bool all_visible_cleared; if (relation-rd_rel-relhasoids) { *** *** 1920,1925 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1935,1946 RelationPutHeapTuple(relation, buffer, heaptup); + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + /* * XXX Should we set PageSetPrunable on this page ? * *** *** 1943,1948 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1964,1970 Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation-rd_node; xlrec.target.tid = heaptup-t_self; rdata[0].data = (char *) xlrec; *** *** 1994,1999 heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 2016,2026 UnlockReleaseBuffer(buffer); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, + ItemPointerGetBlockNumber((heaptup-t_self))); + /* * If tuple is cachable, mark it for invalidation from the caches in case * we abort. Note it is OK to do this after releasing the buffer, because *** *** 2070,2075 heap_delete(Relation relation, ItemPointer tid, --- 2097,2103 Buffer buffer; bool
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas wrote: Another thing that does need to be fixed, is the way that the extension and truncation of the visibility map is handled; that's broken in the current patch. I started working on the patch a long time ago, before the FSM rewrite was finished, and haven't gotten around fixing that part yet. We already solved it for the FSM, so we could just follow that pattern. The way we solved truncation in the FSM was to write a separate WAL record with the new heap size, but perhaps we want to revisit that decision, instead of adding again new code to write a third WAL record, for truncation of the visibility map. smgrtruncate() writes a WAL record of its own, if any full blocks are truncated away of the FSM, but we needed a WAL record even if no full blocks are truncated from the FSM file, because the tail of the last remaining FSM page, representing the truncated away heap pages, still needs to cleared. Visibility map has the same problem. One proposal was to piggyback on the smgrtruncate() WAL-record, and call FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a modularity point of view; smgr.c shouldn't be calling higher-level functions. But maybe it wouldn't be that bad, after all. Or, we could remove WAL-logging from smgrtruncate() altogether, and move it to RelationTruncate() or another higher-level function, and handle the WAL-logging and replay there. In preparation for the visibility map patch, I revisited the truncation issue, and hacked together a patch to piggyback the FSM truncation to the main fork smgr truncation WAL record. I moved the WAL-logging from smgrtruncate() to RelationTruncate(). There's a new flag to RelationTruncate indicating whether the FSM should be truncated too, and only one truncation WAL record is written for the operation. That does seem cleaner than the current approach where the FSM writes a separate WAL record just to clear the bits of the last remaining FSM page. I had to move RelationTruncate() to smgr.c, because I don't think a function in bufmgr.c should be doing WAL-logging. However, RelationTruncate really doesn't belong in smgr.c either. Also, now that smgrtruncate doesn't write its own WAL record, it doesn't seem right for smgrcreate to be doing that either. So, I think I'll take this one step forward, and move RelationTruncate() to a new higher level file, e.g. src/backend/catalog/storage.c, and also create a new RelationCreateStorage() function that calls smgrcreate(), and move the WAL-logging from smgrcreate() to RelationCreateStorage(). So, we'll have two functions in a new file: /* Create physical storage for a relation. If 'fsm' is true, an FSM fork is also created */ RelationCreateStorage(Relation rel, bool fsm) /* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM is also truncated */ RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm) The next question is whether the pending rel deletion stuff in smgr.c should be moved to the new file too. It seems like it would belong there better. That would leave smgr.c as a very thin wrapper around md.c -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: The next question is whether the pending rel deletion stuff in smgr.c should be moved to the new file too. It seems like it would belong there better. That would leave smgr.c as a very thin wrapper around md.c Well it's just a switch, albeit with only one case, so I wouldn't expect it to be much more than a thin wrapper. If we had more storage systems it might be clearer what features were common to all of them and could be hoisted up from md.c. I'm not clear there are any though. Actually I wonder if an entirely in-memory storage system would help with the temporary table problem on systems where the kernel is too aggressive about flushing file buffers or metadata. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. Yes, that should be possible in theory. There's no version of ConditionalLockBuffer() for conditionally upgrading a shared lock to exclusive, but it should be possible in theory. I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Heikki Linnakangas [EMAIL PROTECTED] writes: ... I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. Seems like it must be safe. If you have shared lock on a page then no one else could be modifying the page in a way that would falsify PD_ALL_VISIBLE. You might have several processes concurrently try to set the bit but that is safe (same situation as for hint bits). The harder part is propagating the bit to the visibility map, but I gather you intend to only allow VACUUM to do that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: The harder part is propagating the bit to the visibility map, but I gather you intend to only allow VACUUM to do that? Yep. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: To modify a page: If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared first. The heap page is kept pinned, but not locked, while the visibility map is updated. We want to avoid holding a lock across I/O, even though the visibility map is likely to stay in cache. After the visibility map has been updated, the page is exclusively locked and modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing the lock. So after having determined that you will modify a page, you release the ex lock on the buffer and then try to regain it later? Seems like a really bad idea from here. What if it's no longer possible to do the modification you intended? In case of insert/update, you have to find a new target page. I put the logic in RelationGetBufferForTuple(). In case of delete and update (old page), the flag is checked and bit cleared just after pinning the buffer, before doing anything else. (I note that that's not actually what the patch is doing for heap_update, will fix..) If we give up on the strict requirement that the bit in the visibility map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set, then we could just update the visibility map after releasing the locks on the heap pages. I think I'll do that for now, for simplicity. To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the page, while you observe that all tuples on the page are visible to everyone. That doesn't sound too good from a concurrency standpoint... Well, no, but it's only done in VACUUM. And pruning. I implemented it as a new loop that call HeapTupleSatisfiesVacuum on each tuple, and checking that xmin is old enough for live tuples, but come to think of it, we're already calling HeapTupleSatisfiesVacuum for every tuple on the page during VACUUM, so it should be possible to piggyback on that by restructuring the code. That's how the patch works right now. However, there's a small performance problem with the current approach: setting the PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen: I'm more concerned about *clearing* the bit being WAL-logged. That's necessary for correctness. Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged. There's a new boolean field in xl_heap_insert/update/delete records indicating if the operation cleared the flag. On replay, if the flag was cleared, the bit in the visibility map is also cleared. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Never a problem. No query can ever see the rows removed by a cleanup record, enforced by the recovery system. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. Yes, that should be possible in theory. There's no version of ConditionalLockBuffer() for conditionally upgrading a shared lock to exclusive, but it should be possible in theory. I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. To be honest, I'm more excited about your perf results for that than I am about speeding up some VACUUMs. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
Simon Riggs wrote: On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: One option would be to just ignore that problem for now, and not WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say page is now all visible, without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. Never a problem. No query can ever see the rows removed by a cleanup record, enforced by the recovery system. Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: Lazy VACUUM only needs to visit pages that are '0' in the visibility map. This allows partial vacuums, where we only need to scan those parts of the table that need vacuuming, plus all indexes. Just realised that this means we still have to visit each block of a btree index with a cleanup lock. That means the earlier idea of saying I don't need a cleanup lock if the page is not in memory makes a lot more sense with a partial vacuum. 1. Scan all blocks in memory for the index (and so, don't do this unless the index is larger than a certain % of shared buffers), 2. Start reading in new blocks until you've removed the correct number of tuples 3. Work through the rest of the blocks checking that they are either in shared buffers and we can get a cleanup lock, or they aren't in shared buffers and so nobody has them pinned. If you step (2) intelligently with regard to index correlation you might not need to do much I/O at all, if any. (1) has a good hit ratio because mostly only active tables will be vacuumed so are fairly likely to be in memory. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Visibility map, partial vacuums
On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote: Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be separately settable on the standby. Another reason why it should be able to be set without a VACUUM - since there will never be one on standby. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers