Re: [PERFORM] I/O on select count(*)
On May 18, 2008, at 1:28 AM, Greg Smith wrote: I just collected all the good internals information included in this thread and popped it onto http://wiki.postgresql.org/wiki/ Hint_Bits where I'll continue to hack away at the text until it's readable. Thanks to everyone who answered my questions here, that's good progress toward clearing up a very underdocumented area. I note a couple of potential TODO items not on the official list yet that came up during this discussion: -Smooth latency spikes when switching commit log pages by preallocating cleared pages before they are needed -Improve bulk loading by setting frozen hint bits for tuple inserts which occur within the same database transaction as the creation of the table into which they're being inserted Did I miss anything? I think everything brought up falls either into one of those two or the existing Consider having the background writer update the transaction status hint bits... TODO. -Evaluate impact of improved caching of CLOG per Greenplum: Per Luke Longergan: I'll find out if we can extract our code that did the work. It was simple but scattered in a few routines. In concept it worked like this: 1 - Ignore if hint bits are unset, use them if set. This affects heapam and vacuum I think. 2 - implement a cache for clog lookups based on the optimistic assumption that the data was inserted in bulk. Put the cache one call away from heapgetnext() I forget the details of (2). As I recall, if we fall off of the assumption, the penalty for long scans get large-ish (maybe 2X), but since when do people full table scan when they're updates/inserts are so scattered across TIDs? It's an obvious big win for DW work. -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [PERFORM] I/O on select count(*)
On May 18, 2008, at 1:28 AM, Greg Smith wrote: I just collected all the good internals information included in this thread and popped it onto http://wiki.postgresql.org/wiki/ Hint_Bits where I'll continue to hack away at the text until it's readable. Thanks to everyone who answered my questions here, that's good progress toward clearing up a very underdocumented area. I note a couple of potential TODO items not on the official list yet that came up during this discussion: -Smooth latency spikes when switching commit log pages by preallocating cleared pages before they are needed -Improve bulk loading by setting frozen hint bits for tuple inserts which occur within the same database transaction as the creation of the table into which they're being inserted Did I miss anything? I think everything brought up falls either into one of those two or the existing Consider having the background writer update the transaction status hint bits... TODO. Blah, sorry for the double-post, but I just remembered a few things... Did we completely kill the idea of the bg_writer *or some other background process* being responsible for setting all hint-bits on dirty pages before they're written out? Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the back-end if the page isn't already dirty. We'd likely need some heuristics on this... based on Luke's comments about improved CLOG caching maybe we want to set the bits anyway if the tuples without them set are from old transactions (idea being that pulling those CLOG pages would be pretty expensive). Or better yet; if we have to read a CLOG page off disk, set the bits. This could still potentially be a big disadvantage for data warehouses; though perhaps the way to fix that is recommend a backgrounded vacuum after data load. -- Decibel!, aka Jim C. Nasby, Database Architect [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 smime.p7s Description: S/MIME cryptographic signature
Re: [PERFORM] I/O on select count(*)
Hi Hannu, Interesting suggestion on the partial index! I'll find out if we can extract our code that did the work. It was simple but scattered in a few routines. In concept it worked like this: 1 - Ignore if hint bits are unset, use them if set. This affects heapam and vacuum I think. 2 - implement a cache for clog lookups based on the optimistic assumption that the data was inserted in bulk. Put the cache one call away from heapgetnext() I forget the details of (2). As I recall, if we fall off of the assumption, the penalty for long scans get large-ish (maybe 2X), but since when do people full table scan when they're updates/inserts are so scattered across TIDs? It's an obvious big win for DW work. We also have a GUC to turn it off if needed, in which case a vacuum will write the hint bits. - Luke - Original Message - From: Hannu Krosing [EMAIL PROTECTED] To: Luke Lonergan Cc: Pavan Deolasee [EMAIL PROTECTED]; Greg Smith [EMAIL PROTECTED]; Alvaro Herrera [EMAIL PROTECTED]; pgsql-performance@postgresql.org pgsql-performance@postgresql.org Sent: Thu May 22 12:10:02 2008 Subject: Re: [PERFORM] I/O on select count(*) On Thu, 2008-05-15 at 10:52 +0800, Luke Lonergan wrote: BTW – we’ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole “write the data, write it to the log also, then write it again just for good measure” behavior. For people doing data warehousing work like the poster, this Postgres behavior is miserable. It should be fixed for 8.4 for sure (volunteers?) I might try it. I think I have told you about my ideas ;) I plan to first do cacheing (for being able to doi index only scans among other things) and then if the cache works reliably, use the cacheing code as the main visibility / MVCC mechanism. Is Greenplums code available, or should I roll my own ? BTW – for the poster’s benefit, you should implement partitioning by date, then load each partition and VACUUM ANALYZE after each load. You probably won’t need the date index anymore – so your load times will vastly improve (no indexes), you’ll store less data (no indexes) and you’ll be able to do simpler data management with the partitions. You may also want to partition AND index if you do a lot of short range selective date predicates. Example would be: partition by day, index on date field, queries selective on date ranges by hour will then select out only the day needed, then index scan to get the hourly values. If your queries allow it, you may try indexing on int2::extract('HOUR' from date) so the index may be smaller storing the date as type abstime is another way to reduce index size. Typically time-oriented data is nearly time sorted anyway, so you’ll also get the benefit of a clustered index. Hannu
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera [EMAIL PROTECTED] writes: pg_clog is allocated in pages of 8kB apiece(*). On allocation, pages are zeroed, which is the bit pattern for transaction in progress. So when a transaction starts, it only needs to ensure that the pg_clog page that corresponds to it is allocated, but it need not write anything to it. Yeah, that's pretty-much how I imagined it would be. Nice and compact. I would imagine that if there are only a few transactions, doing the pg_clog lookup would be remarkably quick. However, once there have been a bazillion transactions, with tuples pointing to the whole range of them, it would degenerate into having to perform an extra seek for each tuple, and that's why you added the hint bits. On Fri, 16 May 2008, Tom Lane wrote: One additional point: this means that one transaction in every 32K writing transactions *does* have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog. And that doesn't just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress. This probably contributes to the behavior that Simon and Josh regularly complain about, that our transaction execution time is subject to unpredictable spikes. I'm not sure how to get rid of it though. Does it really take that long to zero out 8kB of RAM? I thought CPUs were really quick at doing that! Anyway, the main thing you need to avoid is all the rest of the transactions waiting for the new pg_clog page. The trick is to generate the new page early, outside any locks on existing pages. It doesn't necessarily need to be done by a daemon thread at all. Matthew -- I'm NOT paranoid! Which of my enemies told you this? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Matthew Wakeling wrote: On Fri, 16 May 2008, Tom Lane wrote: One additional point: this means that one transaction in every 32K writing transactions *does* have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog. And that doesn't just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress. This probably contributes to the behavior that Simon and Josh regularly complain about, that our transaction execution time is subject to unpredictable spikes. I'm not sure how to get rid of it though. Does it really take that long to zero out 8kB of RAM? I thought CPUs were really quick at doing that! Yea, that was my assumption too. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Bruce Momjian [EMAIL PROTECTED] writes: Matthew Wakeling wrote: Does it really take that long to zero out 8kB of RAM? I thought CPUs were really quick at doing that! Yea, that was my assumption too. You have to write the page (to be sure there is space for it on disk) not only zero it. This design is kind of a holdover, though, from back when we had one ever-growing clog file. Today I'd be inclined to think about managing it more like pg_xlog, ie, have some background process pre-create a whole segment file at a time. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Mon, 19 May 2008, Matthew Wakeling wrote: Does it really take that long to zero out 8kB of RAM? I thought CPUs were really quick at doing that! You don't get the whole CPU--you get time slices of one. Some of the cases complaints have come in about have over a thousand connections all fighting for CPU time, and making every one of them block for one guy who needs to fiddle with memory for a while can be a problem. If you're unlucky you won't even be on the same CPU you started on each time you get a little check of time, and you'll run at the speed of RAM rather than that of the CPU--again, fighting for RAM access with every other process on the server. The real question in my mind is why this turns into a bottleneck before the similar task of cleaning the 16MB XLOG segment does. I expected that one would need to be cracked before the CLOG switch time could possibly be an issue, but reports from the field seem to suggest otherwise. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Greg Smith [EMAIL PROTECTED] writes: The real question in my mind is why this turns into a bottleneck before the similar task of cleaning the 16MB XLOG segment does. Because we do the latter off-line, or at least try to. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
I just collected all the good internals information included in this thread and popped it onto http://wiki.postgresql.org/wiki/Hint_Bits where I'll continue to hack away at the text until it's readable. Thanks to everyone who answered my questions here, that's good progress toward clearing up a very underdocumented area. I note a couple of potential TODO items not on the official list yet that came up during this discussion: -Smooth latency spikes when switching commit log pages by preallocating cleared pages before they are needed -Improve bulk loading by setting frozen hint bits for tuple inserts which occur within the same database transaction as the creation of the table into which they're being inserted Did I miss anything? I think everything brought up falls either into one of those two or the existing Consider having the background writer update the transaction status hint bits... TODO. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008, Alvaro Herrera wrote: Starting a transaction does not write anything to pg_clog. For Matt and others, some details here are in src/backend/access/transam/README: pg_clog records the commit status for each transaction that has been assigned an XID. Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one --- typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned. After reading the code and that documentation a bit, the part I'm still not sure about is whether the CLOG entry is created when the XID is assigned and then kept current as the state changes, or whether that isn't even in CLOG until the transaction is committed. It seems like the latter, but there's some ambiguity in the wording and too many code paths for me to map right now. From there, it doesn't make its way out to disk until the internal CLOG buffers are filled, at which point the least recently used buffer there is evicted to permanent storage. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Fri, 2008-05-16 at 14:05 -0400, Greg Smith wrote: After reading the code and that documentation a bit, the part I'm still not sure about is whether the CLOG entry is created when the XID is assigned and then kept current as the state changes, or whether that isn't even in CLOG until the transaction is committed. It seems like the latter, but there's some ambiguity in the wording and too many code paths for me to map right now. Alvaro already said this, I thought? The clog is updated only at sub or main transaction end, thank goodness. When the transactionid is assigned the page of the clog that contains that transactionid is checked to see if it already exists and if not, it is initialised. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Greg Smith wrote: After reading the code and that documentation a bit, the part I'm still not sure about is whether the CLOG entry is created when the XID is assigned and then kept current as the state changes, or whether that isn't even in CLOG until the transaction is committed. It seems like the latter, but there's some ambiguity in the wording and too many code paths for me to map right now. pg_clog is allocated in pages of 8kB apiece(*). On allocation, pages are zeroed, which is the bit pattern for transaction in progress. So when a transaction starts, it only needs to ensure that the pg_clog page that corresponds to it is allocated, but it need not write anything to it. (*) Each transaction needs 2 bits, so on a 8 kB page there is space for 4 transactions/byte * 8 pages * 1kB/page = 32k transactions. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera wrote: pg_clog is allocated in pages of 8kB apiece(*). On allocation, pages are zeroed, which is the bit pattern for transaction in progress. So when a transaction starts, it only needs to ensure that the pg_clog page that corresponds to it is allocated, but it need not write anything to it. Of course, in 8.3 it's not when the transaction starts, but when the Xid is assigned (i.e. when the transaction first calls a read-write command). In previous versions it happens when the first snapshot is taken (i.e. normally on the first command of any type, with very few exceptions.) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera [EMAIL PROTECTED] writes: Greg Smith wrote: After reading the code and that documentation a bit, the part I'm still not sure about is whether the CLOG entry is created when the XID is assigned and then kept current as the state changes, or whether that isn't even in CLOG until the transaction is committed. pg_clog is allocated in pages of 8kB apiece(*). On allocation, pages are zeroed, which is the bit pattern for transaction in progress. So when a transaction starts, it only needs to ensure that the pg_clog page that corresponds to it is allocated, but it need not write anything to it. One additional point: this means that one transaction in every 32K writing transactions *does* have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog. And that doesn't just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress. This probably contributes to the behavior that Simon and Josh regularly complain about, that our transaction execution time is subject to unpredictable spikes. I'm not sure how to get rid of it though. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Hi, Luke Lonergan wrote: BTW – we’ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole “write the data, write it to the log also, then write it again just for good measure” behavior. can you go a bit deeper into how you implemented this or is it some IP of greenplum you cannot reveal? Btw, is there something with your eyes: FONT SIZE=4FONT FACE=Verdana, Helvetica, ArialSPAN STYLE='font-size:14pt' ? :-)) Cheers Tino smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] I/O on select count(*)
Jan de Visser [EMAIL PROTECTED] writes: Obviously, this issue is tied to the slow count(*) one, as I found out the hard way. Consider the following scenario: * Insert row * Update that row a couple of times * Rinse and repeat many times Now somewhere during that cycle, do a select count(*) just to see where you are. You will be appalled by how slow that is, due to not only the usual 'slow count(*)' reasons. This whole hint bit business makes it even worse, as demonstrated by the fact that running a vacuum before the count(*) makes the latter noticably faster. Uh, well, you can't blame that entirely on hint-bit updates. The vacuum has simply *removed* two-thirds of the rows in the system, resulting in a large drop in the number of rows that the select even has to look at. It's certainly true that hint-bit updates cost something, but quantifying how much isn't easy. The off-the-cuff answer is to do the select count(*) twice and see how much cheaper the second one is. But there are two big holes in that answer: the first is the possible cache effects from having already read in the pages, and the second is that the follow-up scan gets to avoid the visits to pg_clog that the first scan had to make (which after all is the point of the hint bits). I don't know any easy way to disambiguate the three effects that are at work here. But blaming it all on the costs of writing out hint-bit updates is wrong. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008, Luke Lonergan wrote: BTW we¹ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole ³write the data, write it to the log also, then write it again just for good measure² behavior. This sounds like a good option. I believe I suggested this a few months ago, however it was rejected because in the worst case (when the hints are not cached), if you're doing an index scan, you can do twice the number of seeks as before. http://archives.postgresql.org/pgsql-performance/2007-12/msg00217.php The hint data will be four bits per tuple plus overheads, so it could be made very compact, and therefore likely to stay in the cache fairly well. Each tuple fetched would have to be spaced really far apart in the database table in order to exhibit the worst case, because fetching a page of hint cache will cause 64kB or so of disc to appear in the disc's read-ahead buffer, which will be equivalent to 128MB worth of database table (assuming eight tuples per block and no overhead). As soon as you access another tuple in the same 128MB bracket, you'll hit the disc read-ahead buffer for the hints. On balance, to me it still seems like a good option. Matthew -- Those who do not understand Unix are condemned to reinvent it, poorly. -- Henry Spencer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thursday 15 May 2008 03:02:19 Tom Lane wrote: Jan de Visser [EMAIL PROTECTED] writes: Obviously, this issue is tied to the slow count(*) one, as I found out the hard way. Consider the following scenario: * Insert row * Update that row a couple of times * Rinse and repeat many times Now somewhere during that cycle, do a select count(*) just to see where you are. You will be appalled by how slow that is, due to not only the usual 'slow count(*)' reasons. This whole hint bit business makes it even worse, as demonstrated by the fact that running a vacuum before the count(*) makes the latter noticably faster. Uh, well, you can't blame that entirely on hint-bit updates. The vacuum has simply *removed* two-thirds of the rows in the system, resulting in a large drop in the number of rows that the select even has to look at. It's certainly true that hint-bit updates cost something, but quantifying how much isn't easy. The off-the-cuff answer is to do the select count(*) twice and see how much cheaper the second one is. But there are two big holes in that answer: the first is the possible cache effects from having already read in the pages, and the second is that the follow-up scan gets to avoid the visits to pg_clog that the first scan had to make (which after all is the point of the hint bits). I don't know any easy way to disambiguate the three effects that are at work here. But blaming it all on the costs of writing out hint-bit updates is wrong. regards, tom lane True. But it still contributes to the fact that queries sometimes behave in a non-deterministic way, which IMHO is the major annoyance when starting to work with pgsql. And contrary to other causes (vacuum, checkpoints) this is woefully underdocumented. jan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008, Heikki Linnakangas wrote: Is it really safe to update the hint bits in place? If there is a power cut in the middle of writing a block, is there a guarantee from the disc that the block will never be garbled? Don't know, to be honest. We've never seen any reports of corrupted data that would suggest such a problem, but it doesn't seem impossible to me that some exotic storage system might do that. Hmm. That problem is what WAL full-page-writes is meant to handle, isn't it? So basically, if you're telling people that WAL full-page-writes is safer than partial WAL, because it avoids updating pages in-place, then you shouldn't be updating pages in-place for the hint bits either. You can't win! In fact, if the tuple's creating transaction has aborted, then the tuple can be vacuumed right there and then before it is even written. Not if you have any indexes on the table. To vacuum, you'll have to scan all indexes to remove pointers to the tuple. Ah. Well, would that be so expensive? After all, someone has to do it eventually, and these are index entries that have only just been added anyway. I can understand index updating being a bit messy in the middle of a checkpoint though, as you would have to write the update to the WAL, which you are checkpointing... So, I don't know exactly how the WAL updates to indexes work, but my guess is that it has been implemented as write the blocks that we would change to the WAL. The problem with this is that all the changes to the index are done individually, so there's no easy way to undo one of them later on when you find out that the transaction has been aborted during the checkpoint. An alternative would be to build a list of changes in the WAL without actually changing the underlying index at all. When reading the index, you would read the list first (which would be in memory, and in an efficient-to-search structure), then read the original index and add the two. Then when checkpointing, vet all the changes against known aborted transactions before making all the changes to the index together. This is likely to speed up index writes quite a bit, and also allow you to effectively vacuum aborted tuples before they get written to the disc. Matthew -- Vacuums are nothings. We only mention them to let them know we know they're there. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Matthew Wakeling wrote: On Thu, 15 May 2008, Luke Lonergan wrote: ...HINT bit optimization, but avoids this whole ³write the data, write it to the log also, then write it again just for good measure² ... The hint data will be four bits per tuple plus overheads, so it could be made very compact, and therefore likely to stay in the cache fairly well. Does it seem like these HINT bits would be good candidates to move off to map forks similar to how the visibility map stuff will be handled? Since (if I understand right) only the hint bits change during the select(*) it seems a lot less write-IO would happen if such a map were updated rather than the data pages themselves. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Greg Smith escribió: On Thu, 15 May 2008, Pavan Deolasee wrote: I had suggested in the past that whenever we set hint bits for a tuple, we should check all other tuples in the page and set their hint bits too to avoid multiple writes of the same page. I guess the idea got rejected because of lack of benchmarks to prove the benefit. From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the impression the idea was to have the background writer get involved to help with this particular situation. The problem is that the bgwriter does not understand about the content of the pages it is writing -- they're opaque pages for all it knows. So it cannot touch the hint bits. I agree with Pavan that it's likely that setting hint bits in batches instead of just for the tuple being examined is a benefit. However, it's perhaps not so good to be doing it in a foreground process, because you're imposing extra cost to the client queries which we want to be as fast as possible. Perhaps the thing to do is have a database-local bgwriter which would scan pages and do this kind of change ... a different kind of process to be launched by autovacuum perhaps. If we had the bitmask in a separate map fork, this could be cheap. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Matthew Wakeling [EMAIL PROTECTED] writes: Hmm. That problem is what WAL full-page-writes is meant to handle, isn't it? So basically, if you're telling people that WAL full-page-writes is safer than partial WAL, because it avoids updating pages in-place, then you shouldn't be updating pages in-place for the hint bits either. You can't win! This argument ignores the nature of the data change. With a hint-bit update, no data is being shuffled around, so there is no danger from a partial page write. A disk that leaves an individual sector corrupt would be a problem, but I don't think that's a huge risk. Keep in mind that disks aren't designed to just stop dead when power dies --- they are made to be able to park their heads before the juice is entirely gone. I think it's reasonable to assume they'll finish writing the sector in progress before they start parking. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Matthew Wakeling wrote: On Thu, 15 May 2008, Heikki Linnakangas wrote: Is it really safe to update the hint bits in place? If there is a power cut in the middle of writing a block, is there a guarantee from the disc that the block will never be garbled? Don't know, to be honest. We've never seen any reports of corrupted data that would suggest such a problem, but it doesn't seem impossible to me that some exotic storage system might do that. Hmm. That problem is what WAL full-page-writes is meant to handle, isn't it? So basically, if you're telling people that WAL full-page-writes is safer than partial WAL, because it avoids updating pages in-place, then you shouldn't be updating pages in-place for the hint bits either. You can't win! Full-page-writes protect from torn pages, that is, when one half of an update hits the disk but the other one doesn't. In particular, if the beginning of the page where the WAL pointer (XLogRecPtr) is flushed to disk, but the actual changes elsewhere in the page aren't, you're in trouble. WAL replay will look at the WAL pointer, and think that the page doesn't need to be replayed, while other half of the update is still missing. Hint bits are different. We're only updating a single bit, and it doesn't matter from correctness point of view whether the hint bit update hits the disk or not. But what would spell trouble is if the disk controller/whatever garbles the whole sector, IOW changes something else than the changed bit, while doing the update. In fact, if the tuple's creating transaction has aborted, then the tuple can be vacuumed right there and then before it is even written. Not if you have any indexes on the table. To vacuum, you'll have to scan all indexes to remove pointers to the tuple. Ah. Well, would that be so expensive? After all, someone has to do it eventually, and these are index entries that have only just been added anyway. Scanning all indexes? Depends on your table of course, but yes it would be expensive in general. An alternative would be to build a list of changes in the WAL without actually changing the underlying index at all. When reading the index, you would read the list first (which would be in memory, and in an efficient-to-search structure), then read the original index and add the two. Then when checkpointing, vet all the changes against known aborted transactions before making all the changes to the index together. This is likely to speed up index writes quite a bit, and also allow you to effectively vacuum aborted tuples before they get written to the disc. There's not much point optimizing something that only helps with aborted transactions. The general problem with any idea that involves keeping a list of changes made in a transaction is that that list will grow big during bulk loads, so you'll have to overflow to disk or abandon the list approach. Which means that it won't help with bulk loads. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera wrote: Greg Smith escribió: On Thu, 15 May 2008, Pavan Deolasee wrote: I had suggested in the past that whenever we set hint bits for a tuple, we should check all other tuples in the page and set their hint bits too to avoid multiple writes of the same page. I guess the idea got rejected because of lack of benchmarks to prove the benefit. From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the impression the idea was to have the background writer get involved to help with this particular situation. The problem is that the bgwriter does not understand about the content of the pages it is writing -- they're opaque pages for all it knows. So it cannot touch the hint bits. We know what kind of a relation we're dealing with in ReadBuffer, so we could add a flag to BufferDesc to mark heap pages. If we had the bitmask in a separate map fork, this could be cheap. I don't buy that. The point of a hint bit is that it's right there along with the tuple you're looking at. If you have to look at a separate buffer, you might as well just look at clog. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008, Heikki Linnakangas wrote: There's not much point optimizing something that only helps with aborted transactions. That's fair enough, but this list method is likely to speed up index writes anyway. The general problem with any idea that involves keeping a list of changes made in a transaction is that that list will grow big during bulk loads, so you'll have to overflow to disk or abandon the list approach. Which means that it won't help with bulk loads. Yeah, it wouldn't be a list of changes for the transaction, it would be a list of changes since the last checkpoint. Keeping data in memory for the length of the transaction is doomed to failure, because there is no bound on its size, so bulk loads are still going to miss out on hint optimisation. Matthew -- for a in past present future; do for b in clients employers associates relatives neighbours pets; do echo The opinions here in no way reflect the opinions of my $a $b. done; done -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Heikki Linnakangas escribió: Alvaro Herrera wrote: The problem is that the bgwriter does not understand about the content of the pages it is writing -- they're opaque pages for all it knows. So it cannot touch the hint bits. We know what kind of a relation we're dealing with in ReadBuffer, so we could add a flag to BufferDesc to mark heap pages. Hmm, I was thinking that it would need access to the catalogs to know where the tuples are, but that's certainly not true, so perhaps it could be made to work. If we had the bitmask in a separate map fork, this could be cheap. I don't buy that. The point of a hint bit is that it's right there along with the tuple you're looking at. If you have to look at a separate buffer, you might as well just look at clog. True -- I was confusing this with the idea of having the tuple MVCC header (xmin, xmax, etc) in a separate fork, which would make the idea of index-only scans more feasible at the expense of seqscans. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Matthew Wakeling wrote: Aside from the rest of commentary, a slight clarification: So, as I understand it, Postgres works like this: 1. You begin a transaction. Postgres writes an entry into pg_clog. Starting a transaction does not write anything to pg_clog. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera [EMAIL PROTECTED] writes: Heikki Linnakangas escribió: We know what kind of a relation we're dealing with in ReadBuffer, so we could add a flag to BufferDesc to mark heap pages. Hmm, I was thinking that it would need access to the catalogs to know where the tuples are, but that's certainly not true, so perhaps it could be made to work. The issue in my mind is not so much could bgwriter physically do it as that it's a violation of module layering. That has real consequences, like potential for deadlocks. It'll become particularly pressing if we go forward with the plans to get rid of the separate dedicated buffers for pg_clog etc and have them work in the main shared-buffer pool. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Tom Lane wrote: It's certainly true that hint-bit updates cost something, but quantifying how much isn't easy. Maybe we can instrument the code with DTrace probes to quantify the actual costs. I'm not familiar with the code, but if I know where to place the probes, I can easily do a quick test and provide the data. The off-the-cuff answer is to do the select count(*) twice and see how much cheaper the second one is. Doesn't seem the second run is cheaper as shown in the results below. The data came from the probes I've added recently. *** Run #1 ** SQL Statement : select count(*) from accounts; Execution time : 1086.58 (ms) Buffer Read Counts Tablespace Database Table Count 1663 16384 1247 1 1663 16384 2600 1 1663 16384 2703 1 1663 16384 1255 2 1663 16384 2650 2 1663 16384 2690 3 1663 16384 2691 3 1663 16384 16397 8390 Dirty Buffer Write Counts = Tablespace Database Table Count 1663 16384 16397 2865 Total buffer cache hits : 1932 Total buffer cache misses: 6471 Average read time from cache : 5638 (ns) Average read time from disk :143371 (ns) Average write time to disk : 20368 (ns) *** Run #2 ** SQL Statement : select count(*) from accounts; Execution time : 1115.94 (ms) Buffer Read Counts Tablespace Database Table Count 1663 16384 16397 8390 Dirty Buffer Write Counts = Tablespace Database Table Count 1663 16384 16397 2865 Total buffer cache hits : 1931 Total buffer cache misses: 6459 Average read time from cache : 4357 (ns) Average read time from disk :154127 (ns) Average write time to disk : 20368 (ns) -Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Robert Lor [EMAIL PROTECTED] writes: Tom Lane wrote: It's certainly true that hint-bit updates cost something, but quantifying how much isn't easy. Maybe we can instrument the code with DTrace probes to quantify the actual costs. Hmm, the problem would be trying to figure out what percentage of writes could be blamed solely on hint-bit updates and not any other change to the page. I don't think that the bufmgr currently keeps enough state to know that, but you could probably modify it easily enough, since callers distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave. Define another flag bit that's set only by the first, and test it during write-out. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Tom Lane wrote: Hmm, the problem would be trying to figure out what percentage of writes could be blamed solely on hint-bit updates and not any other change to the page. I don't think that the bufmgr currently keeps enough state to know that, but you could probably modify it easily enough, since callers distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave. Define another flag bit that's set only by the first, and test it during write-out. Ok, I made a few changes to bufmgr per my understanding of your description above and with my limited understanding of the code. Patch is attached. Assuming the patch is correct, the effect of writes due to hint bits is quite significant. I collected the data below by runing pgbench in one terminal and psql on another to run the query. Is the data plausible? -Robert -- Backend PID: 16189 SQL Statement : select count(*) from accounts; Execution time : 17.33 sec Buffer Read Counts Tablespace Database Table Count 1663 16384 2600 1 1663 16384 2601 1 1663 16384 2615 1 1663 16384 1255 2 1663 16384 2602 2 1663 16384 2603 2 1663 16384 2616 2 1663 16384 2650 2 1663 16384 2678 2 1663 16384 1247 3 1663 16384 1249 3 1663 16384 2610 3 1663 16384 2655 3 1663 16384 2679 3 1663 16384 2684 3 1663 16384 2687 3 1663 16384 2690 3 1663 16384 2691 3 1663 16384 2703 4 1663 16384 1259 5 1663 16384 2653 5 1663 16384 2662 5 1663 16384 2663 5 1663 16384 2659 7 1663 16384 16397 8390 Dirty Buffer Write Counts = Tablespace Database Table Count 1663 16384 16402 2 1663 16384 16394 11 1663 16384 16397 4771 == Hint Bits Write Counts == Tablespace Database Table Count 1663 16384 16397 4508 Total buffer cache hits : 732 Total buffer cache misses: 7731 Average read time from cache : 9136 (ns) Average read time from disk :384201 (ns) Average write time to disk :210709 (ns) Backend PID: 16189 SQL Statement : select count(*) from accounts; Execution time : 12.72 sec Buffer Read Counts Tablespace Database Table Count 1663 16384 16397 8392 Dirty Buffer Write Counts = Tablespace Database Table Count 1663 16384 16394 6 1663 16384 16402 7 1663 16384 16397 2870 == Hint Bits Write Counts == Tablespace Database Table Count 1663 16384 16402 2 1663 16384 16397 2010 Total buffer cache hits : 606 Total buffer cache misses: 7786 Average read time from cache : 6949 (ns) Average read time from disk :706288 (ns) Average write time to disk : 90426 (ns) Index: bufmgr.c === RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.228 diff -u -3 -p -r1.228 bufmgr.c --- bufmgr.c 1 Jan 2008 19:45:51 - 1.228 +++ bufmgr.c 15 May 2008 20:56:38 - @@ -42,6 +42,7 @@ #include storage/smgr.h #include utils/resowner.h #include pgstat.h +#include pg_trace.h /* Note: these two macros only work on shared buffers, not local ones! */ @@ -171,6 +172,7 @@ ReadBuffer_common(Relation reln, BlockNu if (isExtend) blockNum = smgrnblocks(reln-rd_smgr); + TRACE_POSTGRESQL_BUFFER_READ_START(blockNum, reln-rd_node.spcNode, reln-rd_node.dbNode, reln-rd_node.relNode, isLocalBuf); pgstat_count_buffer_read(reln); if (isLocalBuf) @@ -200,12 +202,16 @@ ReadBuffer_common(Relation reln, BlockNu { if (!isExtend) { + TRACE_POSTGRESQL_BUFFER_HIT(); /* Just need to update stats before we exit */ pgstat_count_buffer_hit(reln); if (VacuumCostActive) VacuumCostBalance += VacuumCostPageHit; + TRACE_POSTGRESQL_BUFFER_READ_DONE(blockNum, + reln-rd_node.spcNode, reln-rd_node.dbNode, + reln-rd_node.relNode, isLocalBuf, found); return BufferDescriptorGetBuffer(bufHdr); } @@ -257,6 +263,7 @@ ReadBuffer_common(Relation reln, BlockNu } while (!StartBufferIO(bufHdr, true)); } } + TRACE_POSTGRESQL_BUFFER_MISS(); /* * if we have gotten to this
Re: [PERFORM] I/O on select count(*)
Alvaro Herrera wrote: Hint bits are used to mark tuples as created and/or deleted by transactions that are know committed or aborted. To determine the visibility of a tuple without such bits set, you need to consult pg_clog and possibly pg_subtrans, so it is an expensive check. On the other So, how come there is this outstanding work to do, which will inevitably be done, and it hasn't been done until it is 'just too late' to avoid getting in the way of the query? The OP didn't suggest that he had just loaded the data. Also - is it the case that this only affects the case where updated pages were spilled during the transaction that changed them? ie, if we commit a transaction and there are changed rows still in the cache since their pages are not evicted yet, are the hint bits set immediately so that page is written just once? Seems this would be common in most OLTP systems. Heikki points out that the list might get big and need to be abandoned, but then you fall back to scheduling a clog scan that can apply the bits, which does what you have now, though hopefully in a way that fills slack disk IO rather than waiting for the read. Matthew says: 'it would be a list of changes since the last checkpoint' but I don't see why you can't start writing hints to in-memory pages as soon as the transaction ends. You might fall behind, but I doubt it with modern CPU speeds. I can't see why Pavan's suggestion to try to update as many of the bits as possible when a dirty page is evicted would be contentious. I do think this is something of interest to users, not just developers, since it may influence the way updates are processed where it is reasonable to do so in 'bite sized chunks' as a multipart workflow. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Luke Lonergan [EMAIL PROTECTED] writes: BTW we¹ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole ³write the data, write it to the log also, then write it again just for good measure² behavior. For people doing data warehousing work like the poster, this Postgres behavior is miserable. It should be fixed for 8.4 for sure (volunteers?) For people doing data warehousing I would think the trick would be to do something like what we do to avoid WAL logging for tables created in the same transaction. That is, if you're loading a lot of data at the same time then all of that data is going to be aborted or committed and that will happen at the same time. Ideally we would find a way to insert the data with the hint bits already set to committed and mark the section of the table as being only provisionally extended so other transactions wouldn't even look at those pages until the transaction commits. This is similar to the abortive attempt to have the abovementioned WAL logging trick insert the records pre-frozen. I recall there were problems with that idea though but I don't recall if they were insurmountable or just required more work. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, May 15, 2008 at 5:11 PM, in message [EMAIL PROTECTED], James Mansion [EMAIL PROTECTED] wrote: Alvaro Herrera wrote: Hint bits are used to mark tuples as created and/or deleted by transactions that are know committed or aborted. To determine the visibility of a tuple without such bits set, you need to consult pg_clog and possibly pg_subtrans, so it is an expensive check. On the other So, how come there is this outstanding work to do, which will inevitably be done, and it hasn't been done until it is 'just too late' to avoid getting in the way of the query? There has been discussion from time to time about setting the hint bits for tuple inserts which occur within the same database transaction as the creation of the table into which they're being inserted. That would allow people to cover many of the bulk load situations. I don't see it on the task list. (I would also argue that there is little information lost, even from a forensic perspective, to writing such rows as frozen.) Is this idea done, dead, or is someone working on it? If we could set hint bits on dirty buffer pages after the commit, we'd cover the OLTP situation. In many situations, there is a bigger OS cache than PostgreSQL shared memory, and an attempt to set the bits soon after the commit would coalesce the two writes into one physical write using RAM-based access, which would be almost as good. I don't know if it's feasible to try to do that after the pages have moved from the PostgreSQL cache to the OS cache, but it would likely be a performance win. If we are going to burden any requester process with the job of setting the hint bits, it would typically be better to burden the one doing the data modification rather than some random thread later trying to read data from the table. Of course, getting work off the requester processes onto some background worker process is generally even better. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Wed, May 14, 2008 at 4:09 PM, Doug Eck [EMAIL PROTECTED] wrote: I have a large table (~ 2B rows) that contains an indexed timestamp column. I am attempting to run a query to determine the number of rows for a given day using something like select count(*) from tbl1 where ts between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'. Explain tells me that the query will be done using an index scan (as I would expect), and I realize that it is going to take a while. My question concerns some unusual I/O activity on the box (SUSE) when I run the query. For the first couple of minutes I see reads only. After that vmstat shows mixed reads and writes in a ratio of about 1 block read to 5 blocks written. We have determined that files in our data and log partitions are being hit, but the file system itself is not growing during this time (it appears to be writing over the same chunk of space over and over again). Memory on the box is not being swapped while all of this is happening. I would have guessed that a select count(*) would not require a bunch of writes, and I can't begin to figure out why the number of blocks written are so much higher than the blocks read. If I modify the where clause to only count the rows for a given minute or two, I see the reads but I never see the unusual write behavior. Any thoughts into what could be going on? Thanks in advance for your help. can you post the exact output of explain analyze? (or, at least, explain if the query takes too long) merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
- Original Message From: Merlin Moncure [EMAIL PROTECTED] To: Doug Eck [EMAIL PROTECTED] Cc: pgsql-performance@postgresql.org Sent: Wednesday, May 14, 2008 3:38:23 PM Subject: Re: [PERFORM] I/O on select count(*) On Wed, May 14, 2008 at 4:09 PM, Doug Eck [EMAIL PROTECTED] wrote: I have a large table (~ 2B rows) that contains an indexed timestamp column. I am attempting to run a query to determine the number of rows for a given day using something like select count(*) from tbl1 where ts between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'. Explain tells me that the query will be done using an index scan (as I would expect), and I realize that it is going to take a while. My question concerns some unusual I/O activity on the box (SUSE) when I run the query. For the first couple of minutes I see reads only. After that vmstat shows mixed reads and writes in a ratio of about 1 block read to 5 blocks written. We have determined that files in our data and log partitions are being hit, but the file system itself is not growing during this time (it appears to be writing over the same chunk of space over and over again). Memory on the box is not being swapped while all of this is happening. I would have guessed that a select count(*) would not require a bunch of writes, and I can't begin to figure out why the number of blocks written are so much higher than the blocks read. If I modify the where clause to only count the rows for a given minute or two, I see the reads but I never see the unusual write behavior. Any thoughts into what could be going on? Thanks in advance for your help. can you post the exact output of explain analyze? (or, at least, explain if the query takes too long) merlin The query takes a long time to run, so I'll start with the explain output. I can run explain analyze (given enough time) if you believe its output could hold some clues. db_2008= explain select count(*) from ot_2008_05 where transact_time between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'; QUERY PLAN --- Aggregate (cost=10368613.47..10368613.48 rows=1 width=0) - Index Scan using ot_2008_05_ak2 on ot_2008_05 (cost=0.00..10011333.27 rows=142912078 width=0) Index Cond: ((transact_time = '2008-05-12 00:00:00-04'::timestamp with time zone) AND (transact_time = '2008-05-12 23:59:59.999-04'::timestamp with time zone)) (3 rows) db_2008= Doug
Re: [PERFORM] I/O on select count(*)
Doug Eck [EMAIL PROTECTED] wrote: I am attempting to run a query to determine the number of rows for a given day using something like select count(*) from tbl1 where ts between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'. Explain tells me that the query will be done using an index scan (as I would expect), and I realize that it is going to take a while. My question concerns some unusual I/O activity on the box (SUSE) when I run the query. For the first couple of minutes I see reads only. After that vmstat shows mixed reads and writes in a ratio of about 1 block read to 5 blocks written. Any thoughts into what could be going on? Thanks in advance for your help. Odd as it may seem, a SELECT can cause a page to be rewritten. If this is the first time that the rows are being read since they were inserted (or since the database was loaded, including from backup), it may be rewriting the rows to set hint bits, which can make subsequent access faster. The best solution may be to vacuum more often. http://archives.postgresql.org/pgsql-performance/2007-12/msg00206.php -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Wed, 14 May 2008, Kevin Grittner wrote: If this is the first time that the rows are being read since they were inserted (or since the database was loaded, including from backup), it may be rewriting the rows to set hint bits, which can make subsequent access faster. This is the second time this has come up recently, and I know it used to puzzle me too. This is a particularly relevant area to document better for people doing benchmarking. As close I've found to a useful commentary on this subject is the thread at http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php I still don't completely understand this myself though, if I did I'd add a FAQ on it. Anyone want to lecture for a minute on the birth and care of hint bits? I'll make sure any comments here get onto the wiki. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
Greg Smith wrote: On Wed, 14 May 2008, Kevin Grittner wrote: If this is the first time that the rows are being read since they were inserted (or since the database was loaded, including from backup), it may be rewriting the rows to set hint bits, which can make subsequent access faster. This is the second time this has come up recently, and I know it used to puzzle me too. This is a particularly relevant area to document better for people doing benchmarking. As close I've found to a useful commentary on this subject is the thread at http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php I still don't completely understand this myself though, if I did I'd add a FAQ on it. Anyone want to lecture for a minute on the birth and care of hint bits? I'll make sure any comments here get onto the wiki. Hint bits are used to mark tuples as created and/or deleted by transactions that are know committed or aborted. To determine the visibility of a tuple without such bits set, you need to consult pg_clog and possibly pg_subtrans, so it is an expensive check. On the other hand, if the tuple has the bits set, then it's state is known (or, at worst, it can be calculated easily from your current snapshot, without looking at pg_clog.) There are four hint bits: XMIN_COMMITTED -- creating transaction is known committed XMIN_ABORTED -- creating transaction is known aborted XMAX_COMMITTED -- same, for the deleting transaction XMAX_ABORTED -- ditto If neither of the bits is set, then the transaction is either in progress (which you can check by examining the list of running transactions in shared memory) or your process is the first one to check (in which case, you need to consult pg_clog to know the status, and you can update the hint bits if you find out a permanent state). Regarding FAQs, I'm having trouble imagining putting this in the user FAQ; I think it belongs into the developer's FAQ. However, a benchmarker is not going to look there. Maybe we should start a benchmarker's FAQ? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Wed, 14 May 2008, Alvaro Herrera wrote: If neither of the bits is set, then the transaction is either in progress (which you can check by examining the list of running transactions in shared memory) or your process is the first one to check (in which case, you need to consult pg_clog to know the status, and you can update the hint bits if you find out a permanent state). So is vacuum helpful here because it will force all that to happen in one batch? To put that another way: if I've run a manual vacuum, is it true that it will have updated all the hint bits to XMIN_COMMITTED for all the tuples that were all done when the vacuum started? Regarding FAQs, I'm having trouble imagining putting this in the user FAQ; I think it belongs into the developer's FAQ. However, a benchmarker is not going to look there. Maybe we should start a benchmarker's FAQ? On the wiki I've started adding a series of things that are performance-related FAQs. There's three of them mixed in the bottom of http://wiki.postgresql.org/wiki/Frequently_Asked_Questions right now, about slow count(*) and dealing with slow queries. Here the FAQ would be Why am I seeing all these writes when I'm just doing selects on my table?, and if it's mixed in with a lot of other performance related notes people should be able to find it. The answer and suggestions should be simple enough to be useful to a user who just noticed this behavior, while perhaps going into developer land for those who want to know more about the internals. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On 5/14/08, Greg Smith [EMAIL PROTECTED] wrote: On Wed, 14 May 2008, Alvaro Herrera wrote: If neither of the bits is set, then the transaction is either in progress (which you can check by examining the list of running transactions in shared memory) or your process is the first one to check (in which case, you need to consult pg_clog to know the status, and you can update the hint bits if you find out a permanent state). So is vacuum helpful here because it will force all that to happen in one batch? To put that another way: if I've run a manual vacuum, is it true that it will have updated all the hint bits to XMIN_COMMITTED for all the tuples that were all done when the vacuum started? From my benchmarking experience: Yes, vacuum helps. See also below. Regarding FAQs, I'm having trouble imagining putting this in the user FAQ; I think it belongs into the developer's FAQ. However, a benchmarker is not going to look there. Maybe we should start a benchmarker's FAQ? On the wiki I've started adding a series of things that are performance-related FAQs. There's three of them mixed in the bottom of http://wiki.postgresql.org/wiki/Frequently_Asked_Questions right now, about slow count(*) and dealing with slow queries. Here the FAQ would be Why am I seeing all these writes when I'm just doing selects on my table?, and if it's mixed in with a lot of other performance related notes people should be able to find it. The answer and suggestions should be simple enough to be useful to a user who just noticed this behavior, while perhaps going into developer land for those who want to know more about the internals. Obviously, this issue is tied to the slow count(*) one, as I found out the hard way. Consider the following scenario: * Insert row * Update that row a couple of times * Rinse and repeat many times Now somewhere during that cycle, do a select count(*) just to see where you are. You will be appalled by how slow that is, due to not only the usual 'slow count(*)' reasons. This whole hint bit business makes it even worse, as demonstrated by the fact that running a vacuum before the count(*) makes the latter noticably faster. jan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, May 15, 2008 at 7:51 AM, Greg Smith [EMAIL PROTECTED] wrote: So is vacuum helpful here because it will force all that to happen in one batch? To put that another way: if I've run a manual vacuum, is it true that it will have updated all the hint bits to XMIN_COMMITTED for all the tuples that were all done when the vacuum started? Yes. For that matter, even a plain SELECT or count(*) on the entire table is good enough. That will check every tuple for visibility and set it's hint bits. Another point to note is that the hint bits are checked and set on a per tuple basis. So especially during index scan, the same heap page may get rewritten many times. I had suggested in the past that whenever we set hint bits for a tuple, we should check all other tuples in the page and set their hint bits too to avoid multiple writes of the same page. I guess the idea got rejected because of lack of benchmarks to prove the benefit. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
BTW we¹ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole ³write the data, write it to the log also, then write it again just for good measure² behavior. For people doing data warehousing work like the poster, this Postgres behavior is miserable. It should be fixed for 8.4 for sure (volunteers?) BTW for the poster¹s benefit, you should implement partitioning by date, then load each partition and VACUUM ANALYZE after each load. You probably won¹t need the date index anymore so your load times will vastly improve (no indexes), you¹ll store less data (no indexes) and you¹ll be able to do simpler data management with the partitions. You may also want to partition AND index if you do a lot of short range selective date predicates. Example would be: partition by day, index on date field, queries selective on date ranges by hour will then select out only the day needed, then index scan to get the hourly values. Typically time-oriented data is nearly time sorted anyway, so you¹ll also get the benefit of a clustered index. - Luke On 5/15/08 10:40 AM, Pavan Deolasee [EMAIL PROTECTED] wrote: On Thu, May 15, 2008 at 7:51 AM, Greg Smith [EMAIL PROTECTED] wrote: So is vacuum helpful here because it will force all that to happen in one batch? To put that another way: if I've run a manual vacuum, is it true that it will have updated all the hint bits to XMIN_COMMITTED for all the tuples that were all done when the vacuum started? Yes. For that matter, even a plain SELECT or count(*) on the entire table is good enough. That will check every tuple for visibility and set it's hint bits. Another point to note is that the hint bits are checked and set on a per tuple basis. So especially during index scan, the same heap page may get rewritten many times. I had suggested in the past that whenever we set hint bits for a tuple, we should check all other tuples in the page and set their hint bits too to avoid multiple writes of the same page. I guess the idea got rejected because of lack of benchmarks to prove the benefit. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008, Pavan Deolasee wrote: I had suggested in the past that whenever we set hint bits for a tuple, we should check all other tuples in the page and set their hint bits too to avoid multiple writes of the same page. I guess the idea got rejected because of lack of benchmarks to prove the benefit. From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the impression the idea was to have the background writer get involved to help with this particular situation. The way things are setup right now, I would guess it's impractical for an individual client to be forced to wait for all the tuples in a block to be checked just because it ran into one tuple that needed its hint bits refreshed. If the pages that had any hint bit updates since they were read/created were made easy to identify (maybe they already are), the writer could do the kind of scan you suggest anytime it was about to evict that page. That wouldn't be in the client's critical path and it would maximize the possible improvement here. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] I/O on select count(*)
On Thu, 15 May 2008 10:52:01 +0800 Luke Lonergan [EMAIL PROTECTED] wrote: BTW we¹ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole ³write the data, write it to the log also, then write it again just for good measure² behavior. For people doing data warehousing work like the poster, this Postgres behavior is miserable. It should be fixed for 8.4 for sure (volunteers?) Donations? You have the code Luke :) Sincerely, Joshua D. Drake P.S. Sorry for the almost bad Star Wars pun. -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate signature.asc Description: PGP signature