Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 12:37 AM, Tom Lane t...@sss.pgh.pa.us wrote: We could fix the direct symptom by inserting UnlockReleaseBuffer() in front of the continue, but AFAICS that just makes this into a busy-waiting equivalent of waiting unconditionally, so I don't really see why we should not revert the above fragment altogether. However, I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. Skipping the block completely isn't feasible. The only options are to wait or to do out of order cleaning. The conditional behaviour in vacuum_scan_heap() is much more important that it is here, so just waiting for the lock wouldn't be too bad. Out of order cleaning could be very expensive in terms of I/O and could make that less robust. So I'd say lets just wait normally for the lock. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 10:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. I think you're rejecting the concept too easily. A setup with very little shmem is only going to be suitable for low-velocity systems that are not pushing too many transactions through per second, so it's not likely to need so many CLOG buffers. And frankly I'm not that concerned about what the performance is like: I'm more concerned about whether PG will start up at all without modifying the system shmem limits, on systems with legacy values for SHMMAX etc. Shaving a few single-purpose buffers to make back what we spent on SSI, for example, seems like a good idea to me. Having 32 clog buffers is important at the high end. I don't think that other complexities should mask that truth and lead to us not doing anything on this topic for this release. Please can we either make it user configurable? prepared transactions require config, lock table size is configurable also, so having SSI and clog require config is not too much of a stretch. We can then discuss intelligent autotuning behaviour when we have more time and more evidence. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost sfr...@snowman.net wrote: * Simon Riggs (si...@2ndquadrant.com) wrote: I discover that non-all-zeroes holes are fairly common, just not very frequent. Curious, might be interesting to find out why. That may or may not be a problem, but not something to be dealt with here and now. But I agree that it's not the job of this patch/effort. It sounds like we have clear indication, however, that those areas, as they are not necessairly all zeros, should be included in the checksum. Disagree. Full page writes ignore the hole, so its appropriate to do so here also. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote: On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost sfr...@snowman.net wrote: * Simon Riggs (si...@2ndquadrant.com) wrote: I discover that non-all-zeroes holes are fairly common, just not very frequent. Curious, might be interesting to find out why. That may or may not be a problem, but not something to be dealt with here and now. But I agree that it's not the job of this patch/effort. It sounds like we have clear indication, however, that those areas, as they are not necessairly all zeros, should be included in the checksum. Disagree. Full page writes ignore the hole, so its appropriate to do so here also. Well, ignoriging them in fpw has clear space benefits. Ignoring them while checksumming doesn't have that much of a benefit. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Thu, Jan 5, 2012 at 23:19, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. Good point, yeah, it should do that. But I still think it's rare enough that just special-casing the error message should be enough... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unlogged tables and BufferSync()
In BufferSync(), we have this: /* * Unless this is a shutdown checkpoint, we write only permanent, dirty * buffers. But at shutdown time, we write all dirty buffers. */ if (!(flags CHECKPOINT_IS_SHUTDOWN)) flags |= BM_PERMANENT; That seems bogus to me. We're mixing CHECKPOINT_* flags and buffer BM_* flags in the same variable. Furthermore, we only use that flags variable to pass it down to CheckpointWriteDelay(), which only pays attention to CHECKPOINT_IMMEDIATE flag. The intention was probably to do mask |= BM_PERMANENT ? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables and BufferSync()
On Fri, Jan 6, 2012 at 7:52 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: In BufferSync(), we have this: /* * Unless this is a shutdown checkpoint, we write only permanent, dirty * buffers. But at shutdown time, we write all dirty buffers. */ if (!(flags CHECKPOINT_IS_SHUTDOWN)) flags |= BM_PERMANENT; That seems bogus to me. We're mixing CHECKPOINT_* flags and buffer BM_* flags in the same variable. Furthermore, we only use that flags variable to pass it down to CheckpointWriteDelay(), which only pays attention to CHECKPOINT_IMMEDIATE flag. The intention was probably to do mask |= BM_PERMANENT ? Ouch. Yeah, that's a bug, will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FATAL: bogus data in lock file postmaster.pid:
On Fri, Jan 6, 2012 at 6:13 AM, Magnus Hagander mag...@hagander.net wrote: On Thu, Jan 5, 2012 at 23:19, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Thu, Jan 5, 2012 at 17:13, Tom Lane t...@sss.pgh.pa.us wrote: I think link(2) would create race conditions of its own. I'd be inclined to suggest that maybe we should just special-case a zero length postmaster.pid file as meaning okay to proceed. In general, garbage That's pretty much what I meant - but with a warning message. Actually ... wait a minute. If we allow that, don't we create a race condition between two postmasters starting at almost the same instant? The second one could see the lock file when the first has created but not yet filled it. Good point, yeah, it should do that. But I still think it's rare enough that just special-casing the error message should be enough... so just something that does like stat(filename, st); size = st.st_size; if (size == 0) elog(ERROR, lock file \%s\ has 0 length., filename); somewhere in CreateLockFile in miscinit.c?
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
On 01/05/2012 10:59 PM, Alex Hunsaker wrote: After further digging I found it chokes on any non scalar (IOW any reference). I attached a simple c program that I tested with 5.8.9, 5.10.1, 5.12.4 and 5.14.2 (for those who did not know about it, perlbrew made testing across all those perls relatively painless). PFA that copies if its readonly and its not a scalar. I didn't bother adding regression tests-- should I have? [redirecting to -hackers] I have several questions. 1. How much are we actually saving here? newSVsv() ought to be pretty cheap, no? I imagine it's pretty heavily used inside the interpreter. 2. Unless I'm insufficiently caffeinated right now, there's something wrong with this logic: ! if (SvREADONLY(sv) ! (type != SVt_IV || ! type != SVt_NV || ! type != SVt_PV)) 3. The above is in any case almost certainly insufficient, because in my tests a typeglob didn't trigger SvREADONLY(), but did cause a crash. And yes, we should possibly add a regression test or two. Of course, we can't use the cause of the original complaint ($^V) in them, though. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote: I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. My intention was to skip the tuple, but I failed to notice the unusual way in which this loop iterates. How about something like the attached? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company fix-vacuum-loop.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote: I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. My intention was to skip the tuple, but I failed to notice the unusual way in which this loop iterates. How about something like the attached? It solves the waiting issue, but leaves unused tuples in the heap that previously would have been removed. I don't think that is a solution. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 9:53 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote: I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. My intention was to skip the tuple, but I failed to notice the unusual way in which this loop iterates. How about something like the attached? It solves the waiting issue, but leaves unused tuples in the heap that previously would have been removed. I don't think that is a solution. Uh, we discussed this before the patch was committed, and you agreed it made sense to do that in the second heap scan just as we do it in the first heap scan. If you now see a problem with that, that's fine, but please explain what the problem is, rather than just saying it's not acceptable for the patch to do the thing that the patch was designed to do. Actually, on further review, I do see another problem: we need to pass the scan_all flag down to lazy_vacuum_heap, and only do this conditionally if that flag is not set. Otherwise we might skip a page but still advance relfrozenxid, which would definitely be bad. But that looks easy to fix, and I don't see any other reason why this would be either unsafe or undesirable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: There is no compiler anywhere that implements always inline, unless you are talking about a macro. inline is a hint and nothing more, and if you think you can force it you are mistaken. So this controversy is easily resolved: we do not need any such construct. That's basically in direct contradiction to the experimental evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
Excerpts from Andrew Dunstan's message of vie ene 06 10:34:30 -0300 2012: And yes, we should possibly add a regression test or two. Of course, we can't use the cause of the original complaint ($^V) in them, though. Why not? You obviously can't need output it verbatim, but you could compare it with a different function that returns $^V stringified by a different mechanism. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. I think you're rejecting the concept too easily. A setup with very little shmem is only going to be suitable for low-velocity systems that are not pushing too many transactions through per second, so it's not likely to need so many CLOG buffers. Well, if you take the same workload and spread it out over a long period of time, it will still have just as many CLOG misses or shared_buffers misses as it had when you did it all at top speed. Admittedly, you're unlikely to run into the situation where you have people wanting to do simultaneous CLOG reads than there are buffers, but you'll still thrash the cache. And frankly I'm not that concerned about what the performance is like: I'm more concerned about whether PG will start up at all without modifying the system shmem limits, on systems with legacy values for SHMMAX etc. After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. That way, pretty much anyone who bothers to set shared_buffers to a non-default value will get 32 CLOG buffers, which should be fine, but people who are in the 32MB-or-less range can ramp down lower than what we've allowed in the past. That seems like it might give us the best of both worlds. Shaving a few single-purpose buffers to make back what we spent on SSI, for example, seems like a good idea to me. I think if we want to buy back that memory, the best way to do it would be to add a GUC to disable SSI at startup time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Simon Riggs si...@2ndquadrant.com writes: Please can we either make it user configurable? Weren't you just complaining that *I* was overcomplicating things? I see no evidence to justify inventing a user-visible GUC here. We have rough consensus on both the need for and the shape of a formula, with just minor discussion about the exact parameters to plug into it. Punting the problem off to a GUC is not a better answer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. Works for me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 3:28 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 6, 2012 at 9:53 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 6, 2012 at 2:29 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 7:37 PM, Tom Lane t...@sss.pgh.pa.us wrote: I suppose Robert had something more intelligent in mind than a tight loop when the buffer can't be exclusively locked, so maybe there is some other change that should be made here instead. My intention was to skip the tuple, but I failed to notice the unusual way in which this loop iterates. How about something like the attached? It solves the waiting issue, but leaves unused tuples in the heap that previously would have been removed. I don't think that is a solution. Uh, we discussed this before the patch was committed, and you agreed it made sense to do that in the second heap scan just as we do it in the first heap scan. If you now see a problem with that, that's fine, but please explain what the problem is, rather than just saying it's not acceptable for the patch to do the thing that the patch was designed to do. My name is on that patch, so of course I accept responsibility for the earlier decision making. I *have* explained what is wrong. It leaves unused tuples in the heap that would previously have been removed. More simply: lazy_vacuum_page() does some work and we can't skip that work just because we don't get the lock. Elsewhere in the patch we skipped getting the lock when there was no work to be done. In lazy_vacuum_heap() we only visit pages that need further work, hence every page is important. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
On 01/06/2012 10:49 AM, Alvaro Herrera wrote: Excerpts from Andrew Dunstan's message of vie ene 06 10:34:30 -0300 2012: And yes, we should possibly add a regression test or two. Of course, we can't use the cause of the original complaint ($^V) in them, though. Why not? You obviously can't need output it verbatim, but you could compare it with a different function that returns $^V stringified by a different mechanism. not sure exactly how to in such a way that exercises this code, but feel free to suggest something ;-) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 10:51 AM, Simon Riggs si...@2ndquadrant.com wrote: I *have* explained what is wrong. It leaves unused tuples in the heap that would previously have been removed. More simply: lazy_vacuum_page() does some work and we can't skip that work just because we don't get the lock. Elsewhere in the patch we skipped getting the lock when there was no work to be done. That is not true. We skip vacuumable pages in the first heap pass even if they contain dead tuples, unless scan_all is set or we can get the buffer lock without waiting. In lazy_vacuum_heap() we only visit pages that need further work, hence every page is important. If the system were to crash after the first heap pass and the index vacuuming, but before the second heap pass, nothing bad would happen. The next vacuum would collect those same dead tuples, scan the indexes for them (and not find them, since we already removed them), and then remove them from the heap. We already made a decision that it's a good idea to skip vacuuming a page on which we can't immediately obtain a cleanup lock, because waiting for the cleanup lock can cause the autovacuum worker to get stuck indefinitely, starving the table and in some cases the entire cluster of adequate vacuuming. That logic is just as valid in the second heap pass as it is in the first one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Fri, Jan 6, 2012 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Please can we either make it user configurable? Weren't you just complaining that *I* was overcomplicating things? I see no evidence to justify inventing a user-visible GUC here. We have rough consensus on both the need for and the shape of a formula, with just minor discussion about the exact parameters to plug into it. Punting the problem off to a GUC is not a better answer. As long as we get 32 buffers on big systems, I have no complaint. I'm sorry if I moaned at you personally. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 5 January 2012 20:23, Robert Haas robertmh...@gmail.com wrote: I don't have a problem with the idea of a pg_always_inline, but I'm wondering what sort of fallback mechanism you propose. It seems to me that if the performance improvement is conditioned on inlining be done and we're not confident that we can force the compiler to inline, we ought to fall back to not making the specialization available, rather than to making something available that may not work well. Of course if it's only a question of a smaller performance gain, rather than an actual loss, then that's not as much of an issue... pg_always_inline is more a less a neat adjunct to what I've already done. You can take it or leave it, but it would be irrational to dismiss it out of hand, because it is demonstrably effective in this case. Using non-portable things like function attributes is fairly well precedented - we use __attribute__((format(*) )) frequently. We don't need a fallback mechanism other than #else #define pg_always_inline inline #endif. I think these facilities are available to well over 99% of our users, who use GCC, GCC compatible compiler and MSVC builds. I have basically no sympathy for anyone who uses a compiler that can't inline functions. Do such people actually exist? ISTM that protecting against that outside the qsort implementation (but only for index sorts) is wrong-headed. Assuming you mean that qsort_arg() will protect against this stuff so that callers don't need to do it separately, +1. That's exactly what I mean. Going quadratic in the face of lot of duplicates is something that, remarkably, was a problem well into the 1990s, apparently because some guy noticed it one day, though I think that lots of popular implementations happened to be unaffected anyway. As you know, all queries tested have lots and lots of duplicates (a ~1.5GB table that contains the same number of distinct elements as a ~10MB table once did), and removing the duplicate protection for btrees, on top of everything else, sees the qsort do an awful lot better than HEAD does with the psuedo-protection. As I've said, we already lack any such protection for heap tuples. We are unaffected now anyway, in particular, because we do this, as apparently recommended by both Sedgewick and Knuth: while (pb = pc (r = cmp(pb, a)) = 0) { if (r == 0) { swap(pa, pb); pa += es; } pb += es; } Note that the partition swap occurs *because* the pivot and element are equal. You might imagine that this is superfluous. It turns out that it protects against the duplicates, resulting in sub-partitions about the same size (though it occurs to me that it does so at the expense of precluding parallelism). In short, Sedgewick would approve of our qsort implementation. As for the compare a tuple to itself thing, that's just silly, any was probably only ever seen in some homespun implementation at some point a relatively long time ago, but I've asserted against it anyway. I can't find any description of how this actually works... obviously, we'd like to find a close-to-median element, but how do we actually do that cheaply? Uh, it works whatever way you want it to work. The implementation has to figure out how to get the median ahead of time. I doubt that this would be worthwhile -- the pivot that we pick at the toplevel doesn't really matter much; even if it's bad, we can recover just fine if the pivot we pick one level down is good, or the level after that. We only really have a problem if we keep systematically picking bad pivots every time. And if we do have that problem, improving the toplevel choice of pivot is only going to help modestly, because we'll still be O(n^2) on each partition. Can I get a straw poll on how much of a problem worst-case performance of qsort is believed to be? I'd be reluctant to remove any of the protections we currently have, because I bet they were all put in to fix problems that people hit in the field. But I don't know that we need more. Of course, consolidating them into qsort() itself rather than its callers probably makes sense, as noted above. I was merely proposing something that would compliment the med3 method and our existing protections. In a perfect world, if it were somehow possible to know the perfect pivots ahead of time from a histogram or something, we'd have a quick sort variant with worst-case performance of O(n log(n)). That, or the limited approximation that I've sketched would perhaps be worthwhile, even if it was subject to a number of caveats. Wikipedia claims that the worst case for quick sort is O(n log(n)) with the refinements recommended by Sedgewick's 1978 paper, but this seems like nonsense to me - the med3 technique is a good heuristic in
Re: [HACKERS] Poorly thought out code in vacuum
I started to wonder how likely it would be that some other process would sit on a buffer pin for so long as to allow 134217727 iterations of ReadBufferExtended, even given the slowdowns associated with CLOBBER_CACHE_ALWAYS. This led to some fruitless searching for possible deadlock conditions, but eventually I realized that there's a much simpler explanation: if PrivateRefCount 1 then ConditionalLockBufferForCleanup always fails. This means that once ConditionalLockBufferForCleanup has failed once, the currently committed code in lazy_vacuum_heap is guaranteed to loop until it gets a failure in enlarging the ResourceOwner buffer-reference array. Which in turn means that neither of you ever actually exercised the skip case, or you'd have noticed the problem. Nor are the current regression tests well designed to exercise the case. (There might well be failures of this type happening from time to time in autovacuum, but we'd not see any reported failure in the buildfarm, unless we went trawling in postmaster logs.) So at this point I've got serious doubts as to the quality of testing of that whole patch, not just this part. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: As you know, all queries tested have lots and lots of duplicates (a ~1.5GB table that contains the same number of distinct elements as a ~10MB table once did), and removing the duplicate protection for btrees, on top of everything else, sees the qsort do an awful lot better than HEAD does with the psuedo-protection. Does that also win vs. unpatched master? If so we may as well pull that part out and commit it separately. I can't find any description of how this actually works... obviously, we'd like to find a close-to-median element, but how do we actually do that cheaply? Uh, it works whatever way you want it to work. The implementation has to figure out how to get the median ahead of time. Oh. I'd be inclined to think that would be pretty inefficient, unless we only did it for very large partitions or when we observed that some less costly strategy was tanking. I gather from a quick read that this is supposed to be especially good when the data is already somewhat sorted. Would there be any merit in trying to guess when that's likely to be the case (e.g. based on the correlation statistic)? That seems like a stretch, but OTOH a GUC doesn't feel great either: what is the poor user to do with a query that does 2 sorts, one of which is faster with QS and the other of which is faster with Timsort? Ugh. I'd imagined that it might work a bit like default_statistics_target. Of course, I was just thinking out loud. Also, we do a very good job on *perfectly* pre-sorted input because we check for pre-sorted input. We occasionally have people who want to do SELECT * FROM foo WHERE a 100 AND a 200 ORDER BY a, b where foo has an index on (a) but not (a, b). This tends to suck, because we fail to realize that we can batch the sort. We either seqscan and filter the table then sort the results, or else we scan the index on (a) and then ignore the fact that we have data which is already partially sorted. It's particularly annoying if a is mostly unique and needs b only occasionally to break ties. It would be nice to improve this, but it may be more of a planner/executor problem that something we can fix directly inside the sort implementation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: I started to wonder how likely it would be that some other process would sit on a buffer pin for so long as to allow 134217727 iterations of ReadBufferExtended, even given the slowdowns associated with CLOBBER_CACHE_ALWAYS. This led to some fruitless searching for possible deadlock conditions, but eventually I realized that there's a much simpler explanation: if PrivateRefCount 1 then ConditionalLockBufferForCleanup always fails. This means that once ConditionalLockBufferForCleanup has failed once, the currently committed code in lazy_vacuum_heap is guaranteed to loop until it gets a failure in enlarging the ResourceOwner buffer-reference array. Which in turn means that neither of you ever actually exercised the skip case, or you'd have noticed the problem. Nor are the current regression tests well designed to exercise the case. (There might well be failures of this type happening from time to time in autovacuum, but we'd not see any reported failure in the buildfarm, unless we went trawling in postmaster logs.) So at this point I've got serious doubts as to the quality of testing of that whole patch, not just this part. I tested the case where we skip a block during the first pass, but I admit that I punted on testing the case where we skip a block during the second pass, because I couldn't think of a good way to exercise it. Any suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgsphere
I've been asked by someone to support pgshpere. It would appear that the two project owners are MIA. If anyone knows different can they let me know ? Does anyone have any objection to me taking over the project? Dave Cramer dave.cramer(at)credativ(dot)ca http://www.credativ.ca -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
Robert Haas robertmh...@gmail.com writes: On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: So at this point I've got serious doubts as to the quality of testing of that whole patch, not just this part. I tested the case where we skip a block during the first pass, but I admit that I punted on testing the case where we skip a block during the second pass, because I couldn't think of a good way to exercise it. Any suggestions? Hack ConditionalLockBufferForCleanup to have a 50% probability of failure regardless of anything else, for instance via static int ctr = 0; if ((++ctr) % 2) return false; regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsphere
On 01/06/2012 12:32 PM, Dave Cramer wrote: I've been asked by someone to support pgshpere. It would appear that the two project owners are MIA. If anyone knows different can they let me know ? Does anyone have any objection to me taking over the project? One of the owners is Teodor, who is a core committer ... I hope he's not MIA. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Poorly thought out code in vacuum
On Fri, Jan 6, 2012 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Jan 6, 2012 at 12:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: So at this point I've got serious doubts as to the quality of testing of that whole patch, not just this part. I tested the case where we skip a block during the first pass, but I admit that I punted on testing the case where we skip a block during the second pass, because I couldn't think of a good way to exercise it. Any suggestions? Hack ConditionalLockBufferForCleanup to have a 50% probability of failure regardless of anything else, for instance via static int ctr = 0; if ((++ctr) % 2) return false; Oh, that's brilliant. OK, I'll go try that. Note to self: Try to remember to take that hack back out before committing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Thu, Jan 5, 2012 at 3:46 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Sure. I just think you are there already except for what I got into. New version attached, with your suggested changes included. Hole check code is there as well, but ifdef'd out since it isn't a valid check in all cases. Following private discussions, Kevin showed me the patch at v3 was inadequate because it didn't cater for torn pages as a result of hint bit writes. The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). Dan Scales also observed some cases where private writes aren't checksummed, e.g. index build. Those suggestions are still outstanding from me, but the v4 is worthy of more serious review. I'll now turn my attention more fully onto the DW patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0cc3296..9b367a3 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1701,6 +1701,48 @@ SET ENABLE_SEQSCAN TO OFF; /listitem /varlistentry + varlistentry id=guc-page-checksums xreflabel=page_checksums + indexterm + primaryvarnamepage_checksums/ configuration parameter/primary + /indexterm + termvarnamepage_checksums/varname (typeboolean/type)/term + listitem + para +When this parameter is on, the productnamePostgreSQL/ server +calculates checksums when it writes main database pages to disk, +flagging the page as checksum protected. When this parameter is off, +no checksum is written, only a standard watermark in the page header. +The database may thus contain a mix of pages with checksums and pages +without checksums. + /para + + para +When pages are read into shared buffers any page flagged with a +checksum has the checksum re-calculated and compared against the +stored value to provide greatly improved validation of page contents. + /para + + para +Writes via temp_buffers are not checksummed. + /para + + para +Turning this parameter off speeds normal operation, but +might allow data corruption to go unnoticed. The checksum uses +16-bit checksums, using the fast Fletcher 16 algorithm. With this +parameter enabled there is still a non-zero probability that an error +could go undetected, as well as a non-zero probability of false +positives. + /para + + para +This parameter can only be set in the filenamepostgresql.conf/ +file or on the server command line. +The default is literaloff/. + /para + /listitem + /varlistentry + varlistentry id=guc-wal-buffers xreflabel=wal_buffers termvarnamewal_buffers/varname (typeinteger/type)/term indexterm diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8e65962..c9538d3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -708,6 +708,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata) bool updrqst; bool doPageWrites; bool isLogSwitch = (rmid == RM_XLOG_ID info == XLOG_SWITCH); + bool IsHint = (rmid == RM_SMGR_ID info == XLOG_SMGR_HINT); /* cross-check on whether we should be here or not */ if (!XLogInsertAllowed()) @@ -975,6 +976,18 @@ begin:; } /* + * If this is a hint record and we don't need a backup block then + * we have no more work to do and can exit quickly without inserting + * a WAL record at all. In that case return InvalidXLogRecPtr. + */ + if (IsHint !(info XLR_BKP_BLOCK_MASK)) + { + LWLockRelease(WALInsertLock); + END_CRIT_SECTION(); + return InvalidXLogRecPtr; + } + + /* * If there isn't enough space on the current XLOG page for a record * header, advance to the next page (leaving the unused space as zeroes). */ @@ -3670,6 +3683,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) BLCKSZ - (bkpb.hole_offset + bkpb.hole_length)); } + /* + * Any checksum set on this page will be invalid. We don't need + * to reset it here since it will be reset before being written + * but it seems worth doing this for general sanity and hygiene. + */ + PageSetPageSizeAndVersion(page, BLCKSZ, PG_PAGE_LAYOUT_VERSION); + PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer);
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 6 January 2012 17:29, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: As you know, all queries tested have lots and lots of duplicates (a ~1.5GB table that contains the same number of distinct elements as a ~10MB table once did), and removing the duplicate protection for btrees, on top of everything else, sees the qsort do an awful lot better than HEAD does with the psuedo-protection. Does that also win vs. unpatched master? If so we may as well pull that part out and commit it separately. I didn't bother isolating that, because it doesn't really make sense to (not having it is probably only of particular value when doing what I'm doing anyway, but who knows). Go ahead and commit something to remove that code (get it in both comparetup_index_btree and comparetup_index_hash), as well as the tuple1 != tuple2 test now if you like. It's patently clear that it is unnecessary, and so doesn't have to be justified as a performance enhancement - it's a simple case of refactoring for clarity. As I've said, we don't do this for heap tuples and we've heard no complaints in all that time. It was added in commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when problems with system qsorts came to light. I'd imagined that it might work a bit like default_statistics_target. Of course, I was just thinking out loud. Also, we do a very good job on *perfectly* pre-sorted input because we check for pre-sorted input. We occasionally have people who want to do SELECT * FROM foo WHERE a 100 AND a 200 ORDER BY a, b where foo has an index on (a) but not (a, b). This tends to suck, because we fail to realize that we can batch the sort. We either seqscan and filter the table then sort the results, or else we scan the index on (a) and then ignore the fact that we have data which is already partially sorted. It's particularly annoying if a is mostly unique and needs b only occasionally to break ties. It would be nice to improve this, but it may be more of a planner/executor problem that something we can fix directly inside the sort implementation. That sounds like an interesting problem. Something to look into for 9.3, perhaps. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgsphere
I've been asked by someone to support pgshpere. It would appear that the two project owners are MIA. If anyone knows different can they let me know ? Does anyone have any objection to me taking over the project? Dave Cramer dave.cramer(at)credativ(dot)ca http://www.credativ.ca -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Peter Geoghegan pe...@2ndquadrant.com writes: I didn't bother isolating that, because it doesn't really make sense to (not having it is probably only of particular value when doing what I'm doing anyway, but who knows). Go ahead and commit something to remove that code (get it in both comparetup_index_btree and comparetup_index_hash), as well as the tuple1 != tuple2 test now if you like. It's patently clear that it is unnecessary, and so doesn't have to be justified as a performance enhancement - it's a simple case of refactoring for clarity. As I've said, we don't do this for heap tuples and we've heard no complaints in all that time. It was added in commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when problems with system qsorts came to light. Actually, I'm going to object to reverting that commit, as I believe that having equal-keyed index entries in physical table order may offer some performance benefits at access time. If you don't like the comment, we can change that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
On Fri, Jan 6, 2012 at 06:34, Andrew Dunstan and...@dunslane.net wrote: PFA that copies if its readonly and its not a scalar. I didn't bother adding regression tests-- should I have? I have several questions. 1. How much are we actually saving here? newSVsv() ought to be pretty cheap, no? I imagine it's pretty heavily used inside the interpreter. It will incur an extra copy for every return value from plperl and every value passed to a spi function (and possibly other areas I forgot about). Personally I think avoiding yet another copy of the return value is worth it. 2. Unless I'm insufficiently caffeinated right now, there's something wrong with this logic: ! if (SvREADONLY(sv) ! (type != SVt_IV || ! type != SVt_NV || ! type != SVt_PV)) Oh my... I dunno exactly what I was smoking last night, but its a good thing I didn't share :-). Uh so my test program was also completely wrong, Ill have to redo it all. I've narrowed it down to: if ((type == SVt_PVGV || SvREADONLY(sv))) { if (type != SVt_PV type != SVt_NV) { sv = newSVsv(sv); } } One odd thing is we have to copy SVt_IV (plain numbers) as apparently that is shared with refs (my $str = 's'; my $ref = \$str;). Given this, #3 and the other unknowns we might run into in the future I think if ((SvREADONLY(sv) || type == SVt_GVPV) proposed upthread makes the most sense. 3. The above is in any case almost certainly insufficient, because in my tests a typeglob didn't trigger SvREADONLY(), but did cause a crash. Hrm the glob I was testing was *STDIN. It failed to fail in my test program because I was not testing *STDIN directly but instead had @test = (*STDIN); Ugh. Playing with it a bit more it seems only *STDIN, *STDERR and *STDOUT have problems. For example this seems to work fine for me: do LANGUAGE plperlu $$ open(my $fh, '', '/tmp/t.tmp'); elog(NOTICE, $fh) $$; And yes, we should possibly add a regression test or two. Of course, we can't use the cause of the original complaint ($^V) in them, though. I poked around the perl source looking for some candidates elog(INFO, ${^TAINT}) works fine on 5.8.9 and 5.14.2. I thought we could get away with elog(INFO, *STDIN) but 5.8.9 says: NOTICE: While other version of perl (5.14) say: NOTICE: *main::STDIN -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
On 01/06/2012 02:02 PM, Alex Hunsaker wrote: 3. The above is in any case almost certainly insufficient, because in my tests a typeglob didn't trigger SvREADONLY(), but did cause a crash. Hrm the glob I was testing was *STDIN. It failed to fail in my test program because I was not testing *STDIN directly but instead had @test = (*STDIN); Ugh. Playing with it a bit more it seems only *STDIN, *STDERR and *STDOUT have problems. For example this seems to work fine for me: do LANGUAGE plperlu $$ open(my $fh, '', '/tmp/t.tmp'); elog(NOTICE, $fh) $$; $fh isn't a typeglob here, AIUI. But it's definitely not just *STDIN and friends. Try: do LANGUAGE plperlu $$ $foo=1; elog(NOTICE, *foo) $$; cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP(!) Double Writes
On Wed, Jan 04, 2012 at 10:19:16PM -0800, David Fetter wrote: Folks, Please find attached two patches, each under the PostgreSQL license, one which implements page checksums vs. REL9_0_STABLE, the other which depends on the first (i.e. requires that it be applied first) and implements double writes. They're vs. REL9_0_STABLE because they're extracted from vPostgres 1.0, a proprietary product currently based on PostgreSQL 9.0. I had wanted the first patch set to be: - Against git head, and - Based on feedback from Simon's patch. Simon's now given some feedback during a fruitful discussion, and has sent an updated checksum patch which will be the basis for the double-write stuff Dan's working on. Stay tuned! Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On 6 January 2012 18:45, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan pe...@2ndquadrant.com writes: I didn't bother isolating that, because it doesn't really make sense to (not having it is probably only of particular value when doing what I'm doing anyway, but who knows). Go ahead and commit something to remove that code (get it in both comparetup_index_btree and comparetup_index_hash), as well as the tuple1 != tuple2 test now if you like. It's patently clear that it is unnecessary, and so doesn't have to be justified as a performance enhancement - it's a simple case of refactoring for clarity. As I've said, we don't do this for heap tuples and we've heard no complaints in all that time. It was added in commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when problems with system qsorts came to light. Actually, I'm going to object to reverting that commit, as I believe that having equal-keyed index entries in physical table order may offer some performance benefits at access time. If you don't like the comment, we can change that. Please explain your position. When is this supposed to be useful? Even if you can present an argument for keeping it, it has to weigh the fact that people don't tend to have much use for indexes with lots of duplicates anyway. Prior to 2004, we did not do this - it was justified as insurance against bad qsort implementations only. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). Well, but it will hurt in those cases where hint bits already hurt hugely. Which is for example when accessing huge amounts of data the first time after loading it. Which is not exactly uncommon... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Fri, Jan 6, 2012 at 11:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. Works for me. Done. I tested this on my MacBook Pro and I see no statistically significant difference from the change on a couple of small pgbench tests. Hopefully that means this is good on large boxes and at worst harmless on small ones. As far as I can see, the trade-off is this: If you increase the number of CLOG buffers, then your CLOG miss rate will go down. On the other hand, the cost of looking up a CLOG buffer will go up. At some point, the reduction in the miss rate will not be enough to pay for a longer linear search - which also means holding CLogControlLock. I think it'd probably be worthwhile to think about looking for something slightly smarter than a linear search at some point, and maybe also looking for a way to partition the locking better. But, this at least picks the available load-hanging fruit, which is a good place to start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On 06.01.2012 20:26, Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). How's that going to work during recovery? Like in hot standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: On 06.01.2012 20:26, Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). How's that going to work during recovery? Like in hot standby. How's recovery a problem? Unless I miss something that doesn't actually introduce a new possibility to transport hint bits to the standby (think fpw's). A new transport will obviously increase traffic but ... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 2:35 PM, Andres Freund and...@anarazel.de wrote: On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). Well, but it will hurt in those cases where hint bits already hurt hugely. Which is for example when accessing huge amounts of data the first time after loading it. Which is not exactly uncommon... No, but I'd rather see us commit a checksum patch that sometimes carries a major performance penalty and then work to reduce that penalty later than never commit anything at all. I think it's unrealistic to assume we're going to get this perfectly right in one try. I am not sure whether this particular patch is close enough for government work or still needs more hacking, and it may well be the latter, but there is no use holding our breath and waiting for absolute perfection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund and...@anarazel.de wrote: On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: On 06.01.2012 20:26, Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). How's that going to work during recovery? Like in hot standby. How's recovery a problem? Unless I miss something that doesn't actually introduce a new possibility to transport hint bits to the standby (think fpw's). A new transport will obviously increase traffic but ... The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Friday, January 06, 2012 08:53:38 PM Robert Haas wrote: On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund and...@anarazel.de wrote: On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: On 06.01.2012 20:26, Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). How's that going to work during recovery? Like in hot standby. How's recovery a problem? Unless I miss something that doesn't actually introduce a new possibility to transport hint bits to the standby (think fpw's). A new transport will obviously increase traffic but ... The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Err. Stupid me, thanks. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. Its funny. I have the feeling we all are missing a very obvious brilliant solution to this... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
Corrections: On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote: On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote: + *We set s to be the estimated frequency of the K'th element in a natural + *language's frequency table, where K is the target number of entries in + *the MCELEM array. We assume that the distribution of element frequencies + *follows Zipf's law with an exponent of 1. + * + *Assuming Zipfian distribution, the frequency of the K'th element is equal + *to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of + *elements in the language. Putting W as one million, we get roughly + *0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which gives bucket + *width w = K/0.007 and maximum expected hashtable size of about 1000 * K. These last two paragraphs, adapted from ts_typanalyze.c, assume natural language documents. To what extent do these parameter choices remain sensible for arbitrary data such as users may place in arrays? In any event, we need a different justification, even if it's just a hand-wavy justification. If I'm following this correctly, this choice of s makes the algorithm guaranteed to find only elements constituting = 7% of the input elements. Incidentally, isn't that far too high for natural language documents? If the English word the only constitutes 7% of typical documents, then this s value would permit us to discard practically every word; we'd be left with words read while filling the last bucket and words that happened to repeat exceedingly often in the column. I haven't tried to make a test case to observe this problem; am I missing something? (This question is largely orthogonal to your patch.) No, we'll find elements of frequency at least 0.07/(default_statistics_target * 10) -- in the default configuration, 0.007%. Also, ts_typanalyze() counts the number of documents that contain one or more instances of each lexeme, ignoring the number of appearances within each document. The word the may constitute 7% of a typical document, but it will appear at least once in nearly 100% of documents. Therefore, this s value is adequate even for the pathological case of each document containing just one lexeme. + * + *Note: in the above discussion, s, epsilon, and f/N are in terms of a + *element's frequency as a fraction of all elements seen in the input. + *However, what we actually want to store in the finished pg_statistic + *entry is each element's frequency as a fraction of all rows that it occurs + *in. Elements might be repeated in the same array. Since operators + *@, , @ takes care only about element occurence itself and not about + *occurence count, function takes additional care about uniqueness of + *counting. Also we need to change the divisor from N to nonnull_cnt to get + *the number we want. On the same tangent, why does ts_typanalyze() not deduplicate the same way? The @@ operator has the same property. Answer: to_tsvector() will have already done so. + /* +* We set bucket width equal to (num_mcelem + 10) / 0.007 as per the +* comment above. +*/ + bucket_width = num_mcelem * 1000 / 7; The addend mentioned is not present in the code or discussed in the comment above. (I see the comment is copied verbatim from ts_typanalyze(), where the addend *is* present, though again the preceding comment says nothing of it.) The addend rationale in fact does appear in the ts_typanalyze() comment. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 7:49 PM, Robert Haas robertmh...@gmail.com wrote: No, but I'd rather see us commit a checksum patch that sometimes carries a major performance penalty and then work to reduce that penalty later than never commit anything at all. I think it's unrealistic to assume we're going to get this perfectly right in one try. I am not sure whether this particular patch is close enough for government work or still needs more hacking, and it may well be the latter, but there is no use holding our breath and waiting for absolute perfection. Well said. My view completely. I do want jam tomorrow, I just want bread and butter today as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.
Alex Hunsaker bada...@gmail.com writes: Oh my... I dunno exactly what I was smoking last night, but its a good thing I didn't share :-). Uh so my test program was also completely wrong, Ill have to redo it all. I've narrowed it down to: if ((type == SVt_PVGV || SvREADONLY(sv))) { if (type != SVt_PV type != SVt_NV) { sv = newSVsv(sv); } } Has anyone tried looking at the source code for SvPVutf8 to see exactly what cases it fails on? The fact that there's an explicit croak() call makes me think it might not be terribly hard to tell. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Peter Geoghegan pe...@2ndquadrant.com writes: On 6 January 2012 18:45, Tom Lane t...@sss.pgh.pa.us wrote: Actually, I'm going to object to reverting that commit, as I believe that having equal-keyed index entries in physical table order may offer some performance benefits at access time. If you don't like the comment, we can change that. Please explain your position. When is this supposed to be useful? When there are lots of duplicates of a particular indexed value, the existing code will cause an indexscan to search them in physical order, whereas if we remove the existing logic it'll be random --- in particular, accesses to the same heap page can no longer be expected to be clustered. Admittedly, I don't have any numbers quantifying just how useful that might be, but on the other hand you've not presented any evidence justifying removing the behavior either. If we believe your position that indexes don't generally have lots of duplicates, then the code in question will seldom be reached and therefore there would be no performance benefit in removing it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 3:47 PM, Simon Riggs si...@2ndquadrant.com wrote: Since we ignore hints during HS anyway, No, we don't. We need to ignore visibility map bits, but we need not and do not ignore HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, etc. not setting them seems OK if checksums defined. Or we can recommend that you don't use checksums on a standby. Whichever fits. Writing pages during recovery doesn't need WAL. If we crash, we replay using the already generated WAL. Which is all fine, except when you start making changes that are not WAL-logged. Then, you have the same torn page problem that exists when you it in normal running. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: Admittedly, I don't have any numbers quantifying just how useful that might be, but on the other hand you've not presented any evidence justifying removing the behavior either. If we believe your position that indexes don't generally have lots of duplicates, then the code in question will seldom be reached and therefore there would be no performance benefit in removing it. Obviously, many indexes are unique and thus won't have duplicates at all. But if someone creates an index and doesn't make it unique, odds are very high that it has some duplicates. Not sure how many we typically expect to see, but more than zero... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On Tue, Jan 03, 2012 at 09:11:17PM -0500, Andrew Dunstan wrote: Yeah, I'm with you and Pavel. Here's my quick perl one-liner to produce a set of SPI_* constants for pl/perl. I'm looking at the best way to include this in the bootstrap code. perl -ne 'BEGIN { print use constant\n{\n; } END { print };\n; } print \t$1 = $2,\n if /#define (SPI_\S+)\s+\(?(-?\d+)\)?/;' src/include/executor/spi.h Well, there's also h2ph, but that may be overkill... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund and...@anarazel.de wrote: The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Err. Stupid me, thanks. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. Its funny. I have the feeling we all are missing a very obvious brilliant solution to this... Like getting rid of hint bits? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] LWLOCK_STATS
It's been a while since I did any testing with LWLOCK_STATS defined, so I thought it might be about time to do that again and see how things look. Here's how they looked back in July: http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php Here are the results from a test I ran today on latest sources, again on Nate Boley's machine. Five-minute pgbench run, scale factor 100, permanent tables, my usual config settings. Somewhat depressingly, virtually all of the interesting activity still centers around the same three locks that were problematic back then, which means that - although overall performance has improved quite a bit - we've not really delivered any knockout punches. Here are the perpetrators: lwlock 4: shacq 26160717 exacq 2690379 blk 1129763 lwlock 11: shacq 97074202 exacq 2699639 blk 1482737 lwlock 7: shacq 0 exacq 16522284 blk 2926957 grand total: shacq 225919534 exacq 77179954 blk 6218570 There is some change in how the contention is distributed. Taking the number of times a request for each lock blocked as a percentage of the total number of lock acquisitions that blocked, we get this: WALInsertLock - July 2011: 35%, January 2012: 47% CLogControLock - July 2011: 23%, January 2012: 24% ProcArrayLock - July 2011: 32%, January 2012: 18% Since there's been some change to the test configuration over the last six months, this has to be taken with a grain of salt, but in broad strokes it makes sense given what's been done - ProcArrayLock contention is down significantly (due to Pavan's patch, and followup tweaks), and the other locks are under correspondingly more pressure. We've done enough work on CLogControlLock (today's change, and Simon's prior patch to wake up the WAL writer more aggressively and thus allow hint bits to be set sooner) to allow it to keep pace, so it's only up slightly, but we haven't done anything about WALInsertLock and it's therefore grown from just over a third of the blocks to almost half. The top locks in terms of number of shared acquisitions are CLogControlLock, which accounts for 47% of the shared lock acquisitions in the system all by itself, followed by ProcArrayLock, which accounts for another 12%. The buffer mapping locks make up another 23% in total, with the busiest one having about 3.5x the traffic of the least busy one. Even when these shared acquisitions are mostly uncontended at the lwlock level, the spinlock can still be a contention point, and thus these are possible future targets for further reducing our synchronization overhead despite the fact that (on this test) there's not much blocking on, say, the buffer mapping locks. Note that this fits in shared buffers; on a larger test case, there would be much more blocking on the buffer mapping locks (and presumably BufFreelistLock would be a big problem, too). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Intermittent regression test failures from index-only plan changes
Another regression test failure that I've been seeing lately is a change from index-only scan to seqscan in create_index, as for instance here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguardt=2012-01-02%2023%3A05%3A02 I've managed to duplicate and debug this one too. What I find is that the planner switches from preferring index-only scan to preferring seqscan because normally it sees the table as being fully all-visible after the immediately preceding VACUUM, but in the failure cases the table is seen as having no all-visible pages. That inflates the estimated cost of an index-only scan by enough to make it look like a loser. The create_index test is usually running by itself at this point (since the concurrently started create_view test is much shorter), but I had autovacuum logging on and found that there was a concurrent auto-ANALYZE on another table when the manual VACUUM was running. So failure to set the all-visible flags is expected given that context. This is a bit troublesome because, AFAICS, this means that every single one of the fifty-odd regression test cases that expect to see an Index Only Scan plan might transiently fail, if there happens to be a background auto-ANALYZE running at just the moment that the previous vacuum would've otherwise set the all-visible bits. It might be that all the other ones are safe in practice for various reasons, but even if they are there's no guarantee that new regression tests added in future will be reliable. Background auto-VACUUMs shouldn't cause this problem because they don't take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create the issue either, but ANALYZE does need a snapshot so it's hard to see how to avoid having it trip the all-visible logic. Anybody have any ideas? If nothing else comes to mind I'll probably just remove the sometimes-failing EXPLAIN test case, but I'm worried about the prospects of future failures of the same ilk. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 7:53 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund and...@anarazel.de wrote: On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote: On 06.01.2012 20:26, Simon Riggs wrote: The following patch (v4) introduces a new WAL record type that writes backup blocks for the first hint on a block in any checkpoint that has not previously been changed. IMHO this fixes the torn page problem correctly, though at some additional loss of performance but not the total catastrophe some people had imagined. Specifically we don't need to log anywhere near 100% of hint bit settings, much more like 20-30% (estimated not measured). How's that going to work during recovery? Like in hot standby. How's recovery a problem? Unless I miss something that doesn't actually introduce a new possibility to transport hint bits to the standby (think fpw's). A new transport will obviously increase traffic but ... The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. Since we ignore hints during HS anyway, not setting them seems OK if checksums defined. Or we can recommend that you don't use checksums on a standby. Whichever fits. Writing pages during recovery doesn't need WAL. If we crash, we replay using the already generated WAL. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund and...@anarazel.de wrote: The standby can set hint bits locally that weren't set on the data it received from the master. This will require rechecksumming and rewriting the page, but obviously we can't write the WAL records needed to protect those writes during recovery. So a crash could create a torn page, invalidating the checksum. Err. Stupid me, thanks. Ignoring checksum errors during Hot Standby operation doesn't fix it, either, because eventually you might want to promote the standby, and the checksum will still be invalid. Its funny. I have the feeling we all are missing a very obvious brilliant solution to this... Like getting rid of hint bits? Or even just not bothering to consider them as making buffers dirty, so the only writes are already protected by the double-write (WAL, or if they get some DW outside of WAL). I think I've said it before, but I'm guessing OLTP style database rarely have pages written that are dirty that aren't covered by real changes (so have the FPW anyways) and OLAP type generally freeze after loads to avoid the hint-bit-write penalty too... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure mmonc...@gmail.com wrote: Its funny. I have the feeling we all are missing a very obvious brilliant solution to this... Like getting rid of hint bits? No. First, that solution has been proposed before, and it crashed and burned before. You may as well propose getting rid of VACUUM. In each case, the supposed problem is in fact a cure for a much more severe problem that would quickly make you long for the days when you had the first one. I think you (or someone else?) had a fairly well-developed patch for blunting the impact of hint bit setting, and that we might be able to do if you (or someone) wants to finish it. But getting rid of them altogether isn't likely, not because PostgreSQL developers are an intractable herd of mule-headed idiots (although I have been known to be all of those things on occasion) but because the performance characteristics demonstrably suck, as mostly recently demonstrated by commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b, which significantly improved performance with synchronous_commit=off by shaving about an average of one-tenth of one second off the time before hint bits could be set on any given tuple. Second, even if you did it, it wouldn't fix this problem, because hint bits aren't the only changes we make without WAL logging. In particular, when an index scan runs across a tuple that turns out to be dead, we kill the index tuple so that future scans don't need to revisit it. I haven't actually done any testing to measure how significant this is, but I wouldn't bet on it not mattering. I think it would be wonderful to come up with a design that either blunted the effect of hint bits or eliminated them altogether, but the idea that there's an obvious way forward there that we've simply refused to pursue is, AFAICT, flat wrong. Let's not get sidetracked into talking about things that aren't going to change in time for 9.2 (and probably not 9.3, either, given that no one seems to be putting any time into it) and aren't even feasible solutions to the problem at hand (checksums) anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Jan 6, 2012 at 6:48 PM, Aidan Van Dyk ai...@highrise.ca wrote: I think I've said it before, but I'm guessing OLTP style database rarely have pages written that are dirty that aren't covered by real changes (so have the FPW anyways) and OLAP type generally freeze after loads to avoid the hint-bit-write penalty too... But ya, again, I've never measured ;-) -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch: fix SSI finished list corruption
There's a corner case in the SSI cleanup code that isn't handled correctly. It can arise when running workloads that are comprised mostly (but not 100%) of READ ONLY transactions, and can corrupt the finished SERIALIZABLEXACT list, potentially causing a segfault. The attached patch fixes it. Specifically, when the only remaining active transactions are READ ONLY, we do a partial cleanup of committed transactions because certain types of conflicts aren't possible anymore. For committed r/w transactions, we release the SIREAD locks but keep the SERIALIZABLEXACT. However, for committed r/o transactions, we can go further and release the SERIALIZABLEXACT too. The problem was with the latter case: we were returning the SERIALIZABLEXACT to the free list without removing it from the finished list. The only real change in the patch is the SHMQueueDelete line, but I also reworked some of the surrounding code to make it obvious that r/o and r/w transactions are handled differently -- the existing code felt a bit too clever. Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ From 39f8462332f998d7363058adabac412c7654befe Mon Sep 17 00:00:00 2001 From: Dan R. K. Ports d...@csail.mit.edu Date: Thu, 29 Dec 2011 23:11:35 -0500 Subject: [PATCH] Read-only SERIALIZABLEXACTs are completely released when doing partial cleanup, so remove them from the finished list. This prevents the finished list from being corrupted. Also make it more clear that read-only transactions are treated differently here. --- src/backend/storage/lmgr/predicate.c | 26 +++--- 1 files changed, 23 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index aefa698..c0b3cb4 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -3531,10 +3531,29 @@ ClearOldPredicateLocks(void) else if (finishedSxact-commitSeqNo PredXact-HavePartialClearedThrough finishedSxact-commitSeqNo = PredXact-CanPartialClearThrough) { + /* + * Any active transactions that took their snapshot before + * this transaction committed are read-only, so we can + * clear part of its state. + */ LWLockRelease(SerializableXactHashLock); - ReleaseOneSerializableXact(finishedSxact, - !SxactIsReadOnly(finishedSxact), - false); + + if (SxactIsReadOnly(finishedSxact)) + { +/* A read-only transaction can be removed entirely */ +SHMQueueDelete((finishedSxact-finishedLink)); +ReleaseOneSerializableXact(finishedSxact, false, false); + } + else + { +/* + * A read-write transaction can only be partially + * cleared. We need to keep the SERIALIZABLEXACT but + * can release the SIREAD locks and conflicts in. + */ +ReleaseOneSerializableXact(finishedSxact, true, false); + } + PredXact-HavePartialClearedThrough = finishedSxact-commitSeqNo; LWLockAcquire(SerializableXactHashLock, LW_SHARED); } @@ -3640,6 +3659,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial, Assert(sxact != NULL); Assert(SxactIsRolledBack(sxact) || SxactIsCommitted(sxact)); + Assert(partial || !SxactIsOnFinishedList(sxact)); Assert(LWLockHeldByMe(SerializableFinishedListLock)); /* -- 1.7.8.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Second thoughts on CheckIndexCompatible() vs. operator families
In 367bc426a1c22b9f6badb06cd41fc438fd034639, I introduced a CheckIndexCompatible() that approves btree and hash indexes having changed to a different operator class within the same operator family. To make that valid, I also tightened the operator family contracts for those access methods to address casts. However, I've found two retained formal hazards. They're boring and perhaps unlikely to harm real users, but I'd rather nail them down just in case. First, opclasses like array_ops have polymorphic opcintype. Members of such operator classes could, in general, change behavior arbitrarily in response to get_fn_expr_argtype(). No core polymorphic operator family member does this. Nor could they, for no core index access method sets fn_expr. In the absence of responses to my previous inquiry[1] on the topic, I'm taking the position that the lack of fn_expr in these calls is an isolated implementation detail that could change at any time. Therefore, we should only preserve an index of polymorphic operator class when the old and new opcintype match exactly. This patch's test suite addition illustrates one ALTER TABLE ALTER TYPE affected by this new restriction: a conversion from an array type to a domain over that array type will now require an index rebuild. Second, as a thought experiment, consider a database with these three types: 1. int4, the core type. 2. int4neg, stores to disk as the negation of its logical value. Includes a complete set of operators in the integer_ops btree family, but defines no casts to other integer_ops-represented types. 3. int4other, stores to disk like int4. No operators. Has a implicit binary coercion cast to int4 and an explicit binary coercion cast to int4neg. Suppose a table in this database has a column of type int4other with an index of default operator class. By virtue of the implicit binary coercion and lack of other candidates, the index will use int4_ops. Now, change the type of that column from int4other to int4neg. The operator class changes to int4neg_ops, still within the integer_ops family, and we do not rebuild the index. However, the logical sign of each value just flipped, making the index invalid. Where did CheckIndexCompatible() miss? An operator family only promises cast-compatibility when there exists an implicit or binary coercion cast between the types in question. There's no int4-int4neg cast here, not even a multiple-step pathway. CheckIndexCompatible() assumes that our ability to perform a no-rewrite ALTER TABLE ALTER TYPE implies the existence of such a cast. It does imply that for the before-and-after _table column types_, but it's the before-and-after _opcintype_ that matter here. I think we could close this hazard by having CheckIndexCompatible() test for an actual implicit or binary coercion cast between the opcintype of the old and new operator classes. However, I'm not confident to say that no similar problem would remain. Therefore, I propose ceasing to optimize intra-family index operator transitions and simply requiring exact matches of operator classes and exclusion operators. This does not remove any actual optimization for changes among core types, and it will be simpler to validate. (I designed the current rules under the misapprehension that varchar_ops was the default operator class for varchar columns. However, varchar_ops is a copy of text_ops apart from the name and being non-default. We ship no operator with a varchar operand, relying instead on binary coercion to text.) This patch does not, however, remove the new terms from the operator family contracts. While core PostgreSQL will no longer depend on them, they may again prove handy for future optimizations like this. I remain of the opinion that they're already widely (perhaps even universally) obeyed. This patch conflicts trivially with my patch Avoid FK validations for no-rewrite ALTER TABLE ALTER TYPE in that they both add test cases to the same location in alter_table.sql. They're otherwise independent, albeit reflecting parallel principles. Thanks, nm [1] http://archives.postgresql.org/message-id/20111229211711.ga8...@tornado.leadboat.com diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 712b0b0..1bf1de5 100644 *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** *** 23,28 --- 23,29 #include catalog/pg_opclass.h #include catalog/pg_opfamily.h #include catalog/pg_tablespace.h + #include catalog/pg_type.h #include commands/dbcommands.h #include commands/defrem.h #include commands/tablecmds.h *** *** 52,57 --- 53,59 /* non-export function prototypes */ static void CheckPredicate(Expr *predicate); static void ComputeIndexAttrs(IndexInfo *indexInfo, + Oid *typeOidP, Oid *collationOidP, Oid *classOidP,
Re: [HACKERS] LWLOCK_STATS
On Fri, Jan 6, 2012 at 2:24 PM, Robert Haas robertmh...@gmail.com wrote: It's been a while since I did any testing with LWLOCK_STATS defined, so I thought it might be about time to do that again and see how things look. Here's how they looked back in July: http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php Here are the results from a test I ran today on latest sources, again on Nate Boley's machine. Five-minute pgbench run, scale factor 100, permanent tables, my usual config settings. What was the tps/or and number of transactions? I assume this was -c80 -j40? Also, do you know what percent of CPU time was spend idle during the test? If the very little time is spend sleeping on lwlocks (i.e. CPU time near 100%), it doesn't matter much how that waiting is distributed. Also, was there a big difference in tps between LWLOCK_STATS defined and not defined (i.e. the overhead of doing the accounting)? Somewhat depressingly, virtually all of the interesting activity still centers around the same three locks that were problematic back then, which means that - although overall performance has improved quite a bit - we've not really delivered any knockout punches. Here are the perpetrators: I don't think that is depressing at all. Certain locks needs to exist to protect certain things, and a benchmark which tests those things is going to take those locks rather than some other set of locks. X times faster is still X times faster, even if the bottleneck hasn't move to some other part of the code. but we haven't done anything about WALInsertLock and it's therefore grown from just over a third of the blocks to almost half. But not all blocks are for the same length of time. Do we know how much time is spent blocking? I've seen some code around that tries to instrument that, but on my machine of the time it added a lot of overhead so it couldn't be used effectively. I can try to dig it up and update it to git-head if you want to try it and aren't already aware of it. (My code was based on something I found somewhere in this list.) Also, I assume this is without the recent Moving more work outside WALInsertLock applied? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
On Fri, Jan 6, 2012 at 9:29 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Jan 6, 2012 at 2:24 PM, Robert Haas robertmh...@gmail.com wrote: It's been a while since I did any testing with LWLOCK_STATS defined, so I thought it might be about time to do that again and see how things look. Here's how they looked back in July: http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php Here are the results from a test I ran today on latest sources, again on Nate Boley's machine. Five-minute pgbench run, scale factor 100, permanent tables, my usual config settings. What was the tps/or and number of transactions? I assume this was -c80 -j40? Sorry. -c 32 -j 32. tps was 9-10k, but I don't take it too seriously with LWLOCK_STATS defined, because it has some performance impact. Also, do you know what percent of CPU time was spend idle during the test? Sorry, no. If the very little time is spend sleeping on lwlocks (i.e. CPU time near 100%), it doesn't matter much how that waiting is distributed. Well, clearly, there is clearly a pretty big impact, because unlogged tables are much faster than regular tables. See for example: http://archives.postgresql.org/pgsql-hackers/2011-12/msg00095.php ...where the comparable result on slightly older sources are: 8 CLOG buffers, permanent tables: tps = 10025.079556 (including connections establishing) 32 CLOG buffers, permanent tables: tps = 11247.358762 (including connections establishing) 8 CLOG buffers, unlogged tables: tps = 16999.301828 (including connections establishing) 32 CLOG buffers, permanent tables: tps = 19653.023856 (including connections establishing) As of today, you get 32 CLOG buffers without patching the source code. That test was also done before commits d573e239f03506920938bf0be56c868d9c3416da and 0d76b60db4684d3487223b003833828fe9655fe2, which further optimized ProcArrayLock. Also, was there a big difference in tps between LWLOCK_STATS defined and not defined (i.e. the overhead of doing the accounting)? Yes, see notes above. Somewhat depressingly, virtually all of the interesting activity still centers around the same three locks that were problematic back then, which means that - although overall performance has improved quite a bit - we've not really delivered any knockout punches. Here are the perpetrators: I don't think that is depressing at all. Certain locks needs to exist to protect certain things, and a benchmark which tests those things is going to take those locks rather than some other set of locks. X times faster is still X times faster, even if the bottleneck hasn't move to some other part of the code. True. What I don't like is: I think we've really only pushed the bottleneck out a few cores. Throw a 64-core machine at it and we're going to have all these same problems over again. I'd like to find solutions that change the dynamic in a more fundamental way, so that we buy a little more. But I'm not going to complain too much; the performance gains we've gotten with these techniques are obviously quite substantial, even though they're not a total solution. but we haven't done anything about WALInsertLock and it's therefore grown from just over a third of the blocks to almost half. But not all blocks are for the same length of time. Do we know how much time is spent blocking? I've seen some code around that tries to instrument that, but on my machine of the time it added a lot of overhead so it couldn't be used effectively. I can try to dig it up and update it to git-head if you want to try it and aren't already aware of it. (My code was based on something I found somewhere in this list.) I haven't tried it for reasons of overhead, but I'm aware of the problem. Also, I assume this is without the recent Moving more work outside WALInsertLock applied? Right. If we can get that done for 9.2, we'll be cooking with gas - on my tests that was a big improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Fix Leaky View Problem
On Tue, Jan 3, 2012 at 12:11 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2011/12/23 Robert Haas robertmh...@gmail.com: On Fri, Dec 23, 2011 at 5:56 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I'd like the regression test on select_view test being committed also to detect unexpected changed in the future. How about it? Can you resend that as a separate patch? I remember there were some things I didn't like about it, but I don't remember what they were at the moment... Sorry for this late response. The attached one is patch of the regression test that checks scenario of malicious function with/without security_barrier option. I guess you concerned about that expected/select_views_1.out is patched, not expected/select_views.out. I'm not sure the reason why regression test script tries to make diff between results/select_views and expected/select_views_1.out. select_views.out and select_views_1.out are alternate expected output files. The regression tests pass if the actual output matches either one. Thus, you have to patch both. BTW, can you also resubmit the leakproof stuff as a separate patch for the last CF? Want to make sure we get that into 9.2, if at all possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extensions and 9.2
On Fri, Dec 23, 2011 at 5:45 AM, Daniel Farina dan...@heroku.com wrote: On Wed, Dec 21, 2011 at 5:46 AM, Robert Haas robertmh...@gmail.com wrote: Assuming the command in question can be stuffed inside a function, the most you're gaining is a little notational convenience I can answer that one (why a full-blown mechanism for a notational convenience). It has occurred to me to use this mechanism to support extensions, but I find the prospect of having to teach people special operators to understand how to use the standard extension feature an unstomachable prospect. The single biggest problem is that pg_restore will not know to call this function rather than run CREATE EXTENSION, and then two databases, prepared exactly the same cannot be pg_dump-ed and restored in a reasonable way. If there's a definable whitelist, then this vital functionality will work. There are other sicknesses imposed if one has to hack up how to delegate extension creation to non-superusers: * The postgres manual, wiki, and ecosystem of recipes on the web and internet at large basically are not going to work without modification. Death by a thousand cuts. * \df in psql displays some new operators that you aren't familiar with. Also, putting aside your pg_dump/pg_restore generated SQL will not work, they will look funny. This is an eyesore. * If one tells someone else yeah, the system supports extensions, they're going to type CREATE EXTENSION. And then it's not going to work, and then they're either going to give up, yak shave for a few hours figuring out what they were supposed to do for their provider or organization, or maybe worst of all hack their way around functionality the extension could have provided in a cleaner way had it just worked nice and tidy to begin with. I find this functionality basically vital because it greatly decreases the friction between people, organizations, and software in the domain of deploying, reasoning, and communicating about the installation and status of Postgres extensions in a database. OK, I'll buy that. I think we need to consider the design of the mechanism carefully, though, or we'll end up with something messy and inconvenient. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] broken link to PostgreSQL 9.1 repository for Fedora 14
Hello There is broken link on http://yum.postgresql.org/repopackages.php page PostgreSQL 9.1 - Fedora 14 - http://yum.postgresql.org/9.1/fedora/fedora-14-i386/pgdg-fedora-9.1-2.noarch.rpm - 404 - Not Found Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LWLOCK_STATS
On 07.01.2012 00:24, Robert Haas wrote: It's been a while since I did any testing with LWLOCK_STATS defined, so I thought it might be about time to do that again and see how things look. Here's how they looked back in July: http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php Interesting. A couple of weeks ago I wrote a little patch that's similar to LWLOCK_STATS, but it prints out % of wallclock time that is spent acquiring, releasing, or waiting for a lock. I find that more useful than the counters. Here's the patch, I hope it's useful to others. It uses timer_create() and timer_settime(), so it probably won't work on all platforms, and requires linking with -lrt. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers