Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 9:41 AM, Tom Lane wrote: > Yeah, I had been thinking about the latter point. We could be > conservative and just keep the reported tuple density the same (ie, > update relpages to the new correct value, while setting reltuples so > that the density ratio doesn't change). But that has its own problems > when the table contents *do* change. What I'm currently imagining is > to do a smoothed moving average, where we factor in the new density > estimate with a weight dependent on the percentage of the table we did > scan. That is, the calculation goes something like > > old_density = old_reltuples / old_relpages > new_density = counted_tuples / scanned_pages > reliability = scanned_pages / new_relpages > updated_density = old_density + (new_density - old_density) * reliability > new_reltuples = updated_density * new_relpages This amounts to assuming that the pages observed in the vacuum have the density observed and the pages that weren't seen have the density that were previously in the reltuples/relpages stats. That seems like a pretty solid approach to me. If the numbers were sane before it follows that they should be sane after the update. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 10:05 PM, Greg Stark wrote: >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages > > This amounts to assuming that the pages observed in the vacuum have > the density observed and the pages that weren't seen have the density > that were previously in the reltuples/relpages stats. In case it's not clear, Tom's expression for updated_density is equivalent by simple algebra to: updated_density = (old_density * (1-reliability)) + (new_density * reliability) -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Thu, May 26, 2011 at 1:18 PM, Robert Haas wrote: >> The replay of the WAL record for A doesn't rely on the content of chunk 1 >> which B modified. So I don't think that "partial page writes" has such >> a problem. >> No? > > Sorry. WAL records today DO rely on the prior state of the page. If > they didn't, we wouldn't need full page writes. They don't rely on > them terribly heavily - things like where pd_upper is pointing, and > what the page LSN is. But they do rely on them. Yeah, I'm sure that normal WAL record (neither full page writes nor "partial page writes") relies on the prior state of the page. But WAL record for A is "partial page writes", which also relies on the prior state? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Wed, May 25, 2011 at 10:09 PM, Fujii Masao wrote: > On Wed, May 25, 2011 at 9:34 PM, Robert Haas wrote: >> On Tue, May 24, 2011 at 10:52 PM, Jeff Davis wrote: >>> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote: As I think about it a bit more, we'd need to XLOG not only the parts of the page we actually modifying, but any that the WAL record would need to be correct on replay. >>> >>> I don't understand that statement. Can you clarify? >> >> I'll try. Suppose we have two WAL records A and B, with no >> intervening checkpoint, that both modify the same page. A reads chunk >> 1 of that page and then modifies chunk 2. B modifies chunk 1. Now, >> suppose we make A do a "partial page write" on chunk 2 only, and B do >> the same for chunk 1. At the point the system crashes, A and B are >> both on disk, and the page has already been written to disk as well. >> Replay begins from a checkpoint preceding A. >> >> Now, when we get to the record for A, what are we to do? If it were a >> full page image, we could just restore it, and everything would be >> fine after that. But if we replay the partial page write, we've got >> trouble. A will now see the state of the chunk 1 as it existed after >> the action protected by B occurred, and will presumably do the wrong >> thing. > > If this is really true, full page writes would also cause the similar problem. > No? Imagine the case where A reads page 1, then modifies page 2, and B > modifies page 1. At the recovery, A will see the state of page 1 as it existed > after the action by B. Yeah, but it won't matter, because the LSN interlock will prevent A from taking any action. If you only write parts of the page, though, the concept of "the" LSN of the page becomes a bit murky, because you may have different parts of the page from different points in the WAL stream. I believe it's possible to cope with that if we design it carefully, but it does seem rather complex and error-prone (which is not necessarily the best design for a recovery system, but hey). Anyway, you can either have the partial page write for A restore the older LSN, or not. If you do, then you have the problem as I described it. If you don't, then the effects of A vanish into the either. Either way, it doesn't work. > The replay of the WAL record for A doesn't rely on the content of chunk 1 > which B modified. So I don't think that "partial page writes" has such > a problem. > No? Sorry. WAL records today DO rely on the prior state of the page. If they didn't, we wouldn't need full page writes. They don't rely on them terribly heavily - things like where pd_upper is pointing, and what the page LSN is. But they do rely on them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 11:51 PM, Pavan Deolasee wrote: >> Agreed. The only thing I'm trying to do further is to avoid the need >> for a reshuffle when the special LSN storage is reclaimed. > > Ah ok. That was never clear from your initial emails or may be I > mis-read. Sorry, I must not have explained it very well. :-( > So what you are saying is by storing LSN after line pointer > array, we might be able to reclaim LSN storage without shuffling. That > makes sense. Having said that, it doesn't excite me too much because I > think we should do the dead line pointer reclaim operation during page > pruning and we are already holding cleanup lock at that time and most > likely do a reshuffle anyways. I'll give that a firm maybe. If there is no reshuffle, then you can do this with just an exclusive content lock. Maybe that's worthless, but I'm not certain of it. I guess we might need to see how the code shakes out. Also, reshuffling might be more expensive. I agree that if there are new dead tuples on the page, then you're going to be paying that price anyway; but if not, it might be avoidable. > Also a downside of storing LSN after line pointer array is that you > may waste space because of alignment issues. We could possibly store it unaligned and read it back two bytes at a time. Granted, that's not free. > I also thought that the > LSN might come in between extending line pointer array, but probably > thats not a big deal since if there is free space in the page (and > there should be if we are adding a new tuple), it should be available > immediately after the LSN. Yeah. I'm not sure how icky that is, though. > There are some other issues that we should think about too. Like > recording free space and managing visibility map. The free space is > recorded in the second pass pass today, but I don't see any reason why > that can't be moved to the first pass. Its not clear though if we > should also record free space after retail page vacuum or leave it as > it is. Not sure. Any idea why it's like that, or why we might want to change it? > For visibility maps, we should not update them until there are > LP_DEAD line pointers on the page. Now thats not good because all > tuples in the page may be visible, so we may loose some advantage, at > least for a while, but if mark the page all-visible, the vacuum scan > would not find the dead line pointers in it and that would leave > dangling index pointers after an index vacuum. Also, an index-only scan might return index tuples that are pointing to dead line pointers. Currently, I believe the only way a page can get marked all-visible is by vacuum. But if we make this change, then it would be possible for a HOT cleanup to encounter a situation where all-visible could be set. We probably want to make that work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 11:39 PM, Robert Haas wrote: > On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee > Now, there >> is no way you can store is after the line pointer array without moving >> the live tuple somewhere else. > > So far I agree. But don't we always defragment immediately after > pruning dead tuples to line pointers? The removal of even one tuple > will give us more than enough space to store the LSN. > Yes, we do. But defragment means shuffling tuples around. So we agree that to find space for the LSN, we might need to move the tuples around. > > Agreed. The only thing I'm trying to do further is to avoid the need > for a reshuffle when the special LSN storage is reclaimed. Ah ok. That was never clear from your initial emails or may be I mis-read. So what you are saying is by storing LSN after line pointer array, we might be able to reclaim LSN storage without shuffling. That makes sense. Having said that, it doesn't excite me too much because I think we should do the dead line pointer reclaim operation during page pruning and we are already holding cleanup lock at that time and most likely do a reshuffle anyways. Also a downside of storing LSN after line pointer array is that you may waste space because of alignment issues. I also thought that the LSN might come in between extending line pointer array, but probably thats not a big deal since if there is free space in the page (and there should be if we are adding a new tuple), it should be available immediately after the LSN. There are some other issues that we should think about too. Like recording free space and managing visibility map. The free space is recorded in the second pass pass today, but I don't see any reason why that can't be moved to the first pass. Its not clear though if we should also record free space after retail page vacuum or leave it as it is. For visibility maps, we should not update them until there are LP_DEAD line pointers on the page. Now thats not good because all tuples in the page may be visible, so we may loose some advantage, at least for a while, but if mark the page all-visible, the vacuum scan would not find the dead line pointers in it and that would leave dangling index pointers after an index vacuum. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI predicate locking on heap -- tuple or row?
> Dan Ports wrote: > On Tue, May 24, 2011 at 04:18:37AM -0500, Kevin Grittner wrote: >> These proofs show that there is no legitimate cycle which could >> cause an anomaly which the move from row-based to tuple-based >> logic will miss. They don't prove that the change will generate >> all the same serialization failures; and in fact, some false >> positives are eliminated by the change. > > Yes, that's correct. That's related to the part in the proof where > I claimed T3 couldn't have a conflict out *to some transaction T0 > that precedes T1*. > > I originally tried to show that T3 couldn't have any conflicts out > that T2 didn't have, which would mean we got the same set of > serialization failures, but that's not true. In fact, it's not too > hard to come up with an example where there would be a > serialization failure with the row version links, but not without. > However, because the rw-conflict can't be pointing to a transaction > that precedes T1 in the serial order, it won't create a cycle. In > other words, there are serialization failures that won't happen > anymore, but they were false positives. Dan and I went around a couple times chasing down all code, comment, and patch changes needed, resulting in the attached patch. We found and fixed the bug which originally manifested in a way which I confused with a need for row locks, as well as another which was nearby in the code. We backed out the changes which were causing merge problems for Robert, as those were part of the attempt at the row locking (versus tuple locking). We removed a function which is no longer needed. We adjusted the comments and an affected isolation test. As might be expected from removing an unnecessary feature, the lines of code went down -- a net decrease of 93 lines. This patch applies against HEAD, `make check-world` passes as does `make -C src/test/isolation installcheck` and `make installcheck-world` against a database with default_transaction_isolation = 'serializable'. Dan is running stress testing against the patched version tonight with DBT-2. These changes generate merge conflicts with the work I've done on handling CLUSTER, DROP INDEX, etc. It seems to me that the best course would be to commit this, then I can rebase the other work and post it. Since these issues are orthogonal, it didn't seem like a good idea to combine them in one patch, and this one seems more urgent. -Kevin ssi-tuple-not-row-locking-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Wed, May 25, 2011 at 9:34 PM, Robert Haas wrote: > On Tue, May 24, 2011 at 10:52 PM, Jeff Davis wrote: >> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote: >>> As I think about it a bit more, we'd >>> need to XLOG not only the parts of the page we actually modifying, but >>> any that the WAL record would need to be correct on replay. >> >> I don't understand that statement. Can you clarify? > > I'll try. Suppose we have two WAL records A and B, with no > intervening checkpoint, that both modify the same page. A reads chunk > 1 of that page and then modifies chunk 2. B modifies chunk 1. Now, > suppose we make A do a "partial page write" on chunk 2 only, and B do > the same for chunk 1. At the point the system crashes, A and B are > both on disk, and the page has already been written to disk as well. > Replay begins from a checkpoint preceding A. > > Now, when we get to the record for A, what are we to do? If it were a > full page image, we could just restore it, and everything would be > fine after that. But if we replay the partial page write, we've got > trouble. A will now see the state of the chunk 1 as it existed after > the action protected by B occurred, and will presumably do the wrong > thing. If this is really true, full page writes would also cause the similar problem. No? Imagine the case where A reads page 1, then modifies page 2, and B modifies page 1. At the recovery, A will see the state of page 1 as it existed after the action by B. The replay of the WAL record for A doesn't rely on the content of chunk 1 which B modified. So I don't think that "partial page writes" has such a problem. No? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner
On Wed, May 25, 2011 at 8:37 PM, Michael Nolan wrote: > On Wed, May 25, 2011 at 11:18 AM, Robert Haas wrote: >> >> I basically agree. There have been several recent discussions of this >> topic on both -hackers and -performance; it is likely that the TODO >> needs to be updated with some more recent links. > > Anything to help the NKOTB to get up to speed would be appreciated, though I > still think it is not just a 'caching' issue. > > The question I hesitated to ask in Ottawa was: So, what information would > you like and what would you do with it? *scratches head* I'm not sure I understand your question. Can you elaborate? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner
On Wed, May 25, 2011 at 11:18 AM, Robert Haas wrote: > > I basically agree. There have been several recent discussions of this > topic on both -hackers and -performance; it is likely that the TODO > needs to be updated with some more recent links. > Anything to help the NKOTB to get up to speed would be appreciated, though I still think it is not just a 'caching' issue. The question I hesitated to ask in Ottawa was: So, what information would you like and what would you do with it? -- Mike Nolan
Re: [HACKERS] tackling full page writes
On Tue, May 24, 2011 at 1:34 PM, Robert Haas wrote: > 1. Heikki suggested that instead of doing full page writes, we might > try to write only the parts of the page that have changed. For > example, if we had 16 bits to play with in the page header (which we > don't), then we could imagine the page as being broken up into 16 > 512-byte chunks, one per bit. Each time we update the page, we write > whatever subset of the 512-byte chunks we're actually modifying, > Alternately we could have change vectors which are something like which I think would be a lot less wasteful than dumping 512 byte chunks. The main advantage of 512 byte chunks is it's easier to figure how what chunks to include in the output and if you're replacing the entire block it looks just like our existing system including not having to read in the page before writing. If we output change vectors then we need to do some arithmetic to figure out when it makes sense to merge records and how much of the block we need to be replacing before we just decide to include the whole block. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade automatic testing
On Wed, May 25, 2011 at 3:07 PM, Peter Eisentraut wrote: > On ons, 2011-04-27 at 18:14 -0400, Noah Misch wrote: >> Enthusiastic +1 for this concept. There's at least one rough edge: it fails >> if >> you have another postmaster running on port 5432. > > This has now been addressed: pg_upgrade accepts PGPORT settings. > Attached is a slightly updated patch runs the test suite with a port of > 65432, which you can override by setting PGPORT yourself. > > Anyway, is this something that people want in the repository? It's not > as polished as the pg_regress business, but it is definitely helpful. Is this going to result in using the built binaries with the installed libraries, a la Tom's recent complaint about the isolation tests? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 5:54 PM, Tom Lane wrote: > Robert Haas writes: >> On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote: >>> Because the problem is not specific to TOAST tables. As things >>> currently stand, we will accept the word of an ANALYZE as gospel even if >>> it scanned 1% of the table, and completely ignore the results from a >>> VACUUM even if it scanned 99% of the table. This is not sane. > >> I agree that if VACUUM scanned 99% of the table, it's probably fine to >> use its numbers. It's also fine to use the numbers from ANALYZE, >> because those pages are chosen randomly. What bothers me is the idea >> of using a small *non-random* sample, and I'm not sure that >> incorporating possibly-bogus results slowly is any better than >> incorporating them quickly. > > The above is simply fuzzy thinking. The fact that ANALYZE looked at a > random subset of pages is *no guarantee whatsoever* that its results are > highly accurate. They might be more trustworthy than VACUUM's nonrandom > sample of a similar number of pages, but it doesn't hold even a little > bit of water to claim that we should believe ANALYZE completely and > VACUUM not at all even when the latter has looked at a significantly > larger sample of pages. I think you're arguing against a straw-man. > In any case, your line of thought doesn't help us for fixing the problem > with toast tables, because we aren't going to start doing ANALYZEs on > toast tables. Can we simply use a constant for tuple density on TOAST tables? > The bottom line here is that making use of stats we have is a lot better > than not making use of them, even if they aren't entirely trustworthy. http://dilbert.com/strips/comic/2008-05-07/ -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Displaying optimized CASE expressions (was Re: [HACKERS] Nested CASE-WHEN scoping)
I wrote: > Heikki Linnakangas writes: >> BTW, i just stumbled into this: >> postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) >> THEN 'foo' ELSE 'bar' END; >> ERROR: unexpected CASE WHEN clause: 326 >> Looks like ruleutils.c is also not prepared for the case that the >> implicit equality operation gets inlined into something else than an OpExpr. > Grumble ... I thought we'd fixed that ... Yeah, you're right. We've hacked that code so it can handle some transformations that the optimizer might apply, but inlining some random expression to replace the equality operator is far beyond what we can hope to deal with. For those following along at home, the point is that if the user writes CASE test_expr WHEN cmp_expr THEN ... the parser identifies the equality operator to use and produces something that looks like this: CASE test_expr WHEN CaseTestExpr = cmp_expr THEN ... We really need ruleutils.c to generate the original form when it is looking at a stored rule (eg a view), so it goes to some lengths to recognize "CaseTestExpr = something" in a WHEN clause and only print the "something". However, this example shows that there's no chance of always being able to do that when looking at an expression that's been through the planner. I think what we'd better do, if we don't see something that looks like "CaseTestExpr = something", is just print whatever we do have in the WHEN clause. That will require inventing a print representation for CaseTestExpr, since in most cases that's going to appear in there somewhere. I suggest we just print CASE_TEST_EXPR, but if anyone wants to bikeshed, feel free ... Note that if Heikki does what I suggested upthread, the display will eventually probably look like "$nn" instead (since it'll be a Param not a CaseTestExpr). But that's 9.2 or later material. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
> I thought the problem was that they upgraded the OS and now the encoding > names changed, though they behaved the same. Is that now what is > happening? Can they supply the values with different cases? > In my case I never touched the locale. It was set by the OS. I presume this is true for most people. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote: >> Because the problem is not specific to TOAST tables. As things >> currently stand, we will accept the word of an ANALYZE as gospel even if >> it scanned 1% of the table, and completely ignore the results from a >> VACUUM even if it scanned 99% of the table. This is not sane. > I agree that if VACUUM scanned 99% of the table, it's probably fine to > use its numbers. It's also fine to use the numbers from ANALYZE, > because those pages are chosen randomly. What bothers me is the idea > of using a small *non-random* sample, and I'm not sure that > incorporating possibly-bogus results slowly is any better than > incorporating them quickly. The above is simply fuzzy thinking. The fact that ANALYZE looked at a random subset of pages is *no guarantee whatsoever* that its results are highly accurate. They might be more trustworthy than VACUUM's nonrandom sample of a similar number of pages, but it doesn't hold even a little bit of water to claim that we should believe ANALYZE completely and VACUUM not at all even when the latter has looked at a significantly larger sample of pages. In any case, your line of thought doesn't help us for fixing the problem with toast tables, because we aren't going to start doing ANALYZEs on toast tables. The bottom line here is that making use of stats we have is a lot better than not making use of them, even if they aren't entirely trustworthy. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Tom Lane wrote: > Robert Haas writes: > > On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian wrote: > >> I thought the problem was that they upgraded the OS and now the encoding > >> names changed, though they behaved the same. ?Is that now what is > >> happening? ?Can they supply the values with different cases? > > > Oh, hmm. I don't know. > > The original complaint wasn't too clear. But I agree with Robert that > case-insensitive comparison is about as far as it seems safe to go in > trying to allow for variant spellings of locale names. I would think > that a platform that randomly whacks such names around is going to find > a whole lot of software breaking, not only Postgres. Agreed. Unless an expert shows up and can tell us exactly how the naming works, we have gone as far as we can go. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nested CASE-WHEN scoping
Heikki Linnakangas writes: > On 25.05.2011 17:47, Tom Lane wrote: >> [ scratches head ... ] Why does the save/restore in ExecEvalCase not >> take care of this? > The mistake happens during planning, when the SQL function is inlined > and pre-evaluated. OK, I see the problem now: we have a CaseTestExpr in the arguments of the function, which we are unable to reduce to a constant, so it gets substituted as-is into the body of the function during inlining. And then it's physically inside the CASE expression that's in the function body, so it looks like it syntactically belongs to that expression, which it doesn't. You're probably right that this is impractical to fix without redesigning the expression representation, and that CoerceToDomainValue has got a similar issue. My advice is to not change the parser output representation, though. What I think you ought to do about this is to have the planner replace CaseTestExpr and CoerceToDomainValue with PARAM_EXEC Params during expression preprocessing, and assign suitable Param numbers which it sticks into the CaseExpr (resp CoerceToDomainExpr) so that those expressions know which Param slot to fill at runtime. The const-simplification logic can avoid getting dumber by treating the cases like known-Param-value cases. I don't think you need to invent something separate from the PARAM_EXEC infrastructure to handle these. The main annoyance here is that standalone expressions may now need Param slots to execute, which will complicate places that need to evaluate them, but there's probably no way around that (a bespoke mechanism would complicate those callers just as much, if the number of value slots it needs is variable, which it will be). > BTW, i just stumbled into this: > postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) > THEN 'foo' ELSE 'bar' END; > ERROR: unexpected CASE WHEN clause: 326 > Looks like ruleutils.c is also not prepared for the case that the > implicit equality operation gets inlined into something else than an OpExpr. Grumble ... I thought we'd fixed that ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Robert Haas writes: > On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian wrote: >> I thought the problem was that they upgraded the OS and now the encoding >> names changed, though they behaved the same. Is that now what is >> happening? Can they supply the values with different cases? > Oh, hmm. I don't know. The original complaint wasn't too clear. But I agree with Robert that case-insensitive comparison is about as far as it seems safe to go in trying to allow for variant spellings of locale names. I would think that a platform that randomly whacks such names around is going to find a whole lot of software breaking, not only Postgres. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
Robert Haas writes: > I think getting it working is probably a good first goal. I am not > really sure that we want to commit it that way, and I think my vote > would be for you to work on the approach we discussed before rather > than this one, but it's your project, and I think you'll probably > learn enough in getting it working that it will be a step forward in > any case. The planner is complex enough that it's worth trying to get > something that works, first, before trying to make it perfect. Remember Brooks' advice: "Plan to throw one away. You will anyhow." regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH
Alvaro Herrera writes: > Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011: >> Right. It would also increase the cognitive load on the user to have >> to remember the command-line go-to-line-number switch for his editor. >> So I don't particularly want to redesign this feature. However, I can >> see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from >> the same place that you set EDITOR, which would suggest that we allow >> the value to come from an environment variable. I'm not sure whether >> there is merit in allowing both that source and ~/.psqlrc, though >> possibly for Windows users it might be easier if ~/.psqlrc worked. > If we're going to increase the number of options in .psqlrc that do not > work with older psql versions, can I please have .psqlrc-MAJORVERSION or > some such? Having 8.3's psql complain all the time because it doesn't > understand "linestyle" is annoying. 1. I thought we already did have that. 2. In any case, EDITOR_LINENUMBER_SWITCH isn't a hazard for this, because older versions will just think it's a variable without any special meaning. But the real question here is whether we want to change it to be also (or instead?) an environment variable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
On Wed, May 25, 2011 at 3:05 PM, Simon Riggs wrote: > On Wed, May 25, 2011 at 3:41 PM, Simon Riggs wrote: >> On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci >> wrote: Da: Simon Riggs I can't find a clear discussion of what you are trying to do, and how, just a URL back to a complex discussion on another topic. >>> >>> >>> While trying to write a patch to allow changing an unlogged table into >>> a logged one, I had to add another int field to xl_xact_commit. >>> Robert Haas said: >>> >>> "I have to admit I don't like this approach very much. I can't see >>> adding 4 bytes to every commit record for this feature." >>> >>> >>> which is a correct remark. >>> >>> xl_xact_commit can contain some arrays (relation to drops, >>> committed sub-trans, shared invalidation msgs). The length of >>> these arrays is specified using 3 ints in the struct. >>> >>> So, to avoid adding more ints to the struct, I've been suggested to >>> remove all the ints, and use xl_xact_commit.xinfo to flag which >>> arrays are, in fact, present. >>> >>> So the whole idea is: >>> >>> - remove nrels, nsubxacts and nmsgs from xl_xact_commit >>> - use bits in xinfo to signal which arrays are present at the end >>> of xl_xact_commit >>> - for each present array, add the length of the array (as int) at >>> the end of xl_xact_commit >>> - add each present array after all the lengths >> >> >> OK, thats clear. Thanks. >> >> That formatting sounds quite complex. >> >> I would propose we split this into 2 WAL records: xl_xact_commit and >> xl_xact_commit_with_info >> >> xl_xact_commit doesn't have any flags, counts or arrays. >> >> xl_xact_commit_with_info always has all 3 counts, even if zero. >> Arrays follow the main record >> >> I think it might also be possible to removed dbId and tsId from >> xl_act_commit if we use that definition. > > Yes, that's correct. We can remove them from the normal commit record > when nmsgs == 0. > > I think we can get away with these 2 definitions, but pls check. Using > static definitions works better for me because we can see what they > contain, rather than having the info flags imply that the record can > contain any permutation of settings when that's not really possible. > > { > TimestampTz xact_time; /* time of commit */ > int nsubxacts; /* number of > subtransaction XIDs */ > /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ > } xl_xact_commit; > > { > TimestampTz xact_time; /* time of commit */ > uint32 xinfo; /* info flags */ > int nrels; /* number of > RelFileNodes */ > int nsubxacts; /* number of > subtransaction XIDs */ > int nmsgs; /* number of shared inval msgs > */ > Oid dbId; /* MyDatabaseId */ > Oid tsId; /* > MyDatabaseTableSpace */ > /* Array of RelFileNode(s) to drop at commit */ > RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ > /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ > /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */ > } xl_xact_commit_with_info; Wow, that is identical to the conclusion that I came to about 60 seconds ago. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade automatic testing
On ons, 2011-04-27 at 18:14 -0400, Noah Misch wrote: > Enthusiastic +1 for this concept. There's at least one rough edge: it fails > if > you have another postmaster running on port 5432. This has now been addressed: pg_upgrade accepts PGPORT settings. Attached is a slightly updated patch runs the test suite with a port of 65432, which you can override by setting PGPORT yourself. Anyway, is this something that people want in the repository? It's not as polished as the pg_regress business, but it is definitely helpful. diff --git i/contrib/pg_upgrade/Makefile w/contrib/pg_upgrade/Makefile index 8f3fd7c..9ec7bc0 100644 --- i/contrib/pg_upgrade/Makefile +++ w/contrib/pg_upgrade/Makefile @@ -21,3 +21,6 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +check: test.sh + MAKE=$(MAKE) bindir=$(bindir) $(SHELL) $< diff --git i/contrib/pg_upgrade/test.sh w/contrib/pg_upgrade/test.sh index e69de29..c675f26 100644 --- i/contrib/pg_upgrade/test.sh +++ w/contrib/pg_upgrade/test.sh @@ -0,0 +1,41 @@ +set -eux + +: ${MAKE=make} +: ${PGPORT=65432} +export PGPORT + +temp_root=$PWD/tmp_check +temp_install=$temp_root/install +bindir=$temp_install/$bindir +PATH=$bindir:$PATH +export PATH + +PGDATA=$temp_root/data +export PGDATA +rm -rf "$PGDATA" "$PGDATA".old + +logdir=$PWD/log +rm -rf "$logdir" +mkdir "$logdir" + +$MAKE -C ../.. install DESTDIR="$temp_install" 2>&1 | tee "$logdir/install.log" +$MAKE -C ../pg_upgrade_support install DESTDIR="$temp_install" 2>&1 | tee -a "$logdir/install.log" +$MAKE -C . install DESTDIR="$temp_install" 2>&1 | tee -a "$logdir/install.log" + +initdb 2>&1 | tee "$logdir/initdb1.log" +pg_ctl start -l "$logdir/postmaster1.log" -w +$MAKE -C ../.. installcheck 2>&1 | tee "$logdir/installcheck.log" +pg_dumpall >"$temp_root"/dump1.sql +pg_ctl -m fast stop + +mv "${PGDATA}" "${PGDATA}.old" + +initdb 2>&1 | tee "$logdir/initdb2.log" + +pg_upgrade -d "${PGDATA}.old" -D "${PGDATA}" -b "$bindir" -B "$bindir" + +pg_ctl start -l "$logdir/postmaster2.log" -w +pg_dumpall >"$temp_root"/dump2.sql +pg_ctl -m fast stop + +diff -q "$temp_root"/dump1.sql "$temp_root"/dump2.sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
2011/5/25 Alvaro Herrera : > Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011: > >> > Well, we only actually need to store one number, because you can figure >> > out a much more precise number-of-pages figure with pg_relation_size() >> > divided by configured page size. > >> I may miss something but we need relation size in costsize.c even if >> we have a reldensity (or we need a reltuples). Else what values should >> be used to estimate the relation size ? (pg_relation_size() goes down >> to kernel/fs to ask the stat.st_size, is it really what we want?) > > Actually yes, the planner does go to kernel to determine the current > relation size, and then multiplies by density as computed from catalog > data to figure out current reasonably accurate number of tuples. Okay! I just read that part. Interesting. (If I dive correctly, we search our last segment and then use a fileseek to the end of this segment to get our information) make more sense, suddendly :) > > -- > Álvaro Herrera > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
On Wed, May 25, 2011 at 10:41 AM, Simon Riggs wrote: > OK, thats clear. Thanks. > > That formatting sounds quite complex. > > I would propose we split this into 2 WAL records: xl_xact_commit and > xl_xact_commit_with_info > > xl_xact_commit doesn't have any flags, counts or arrays. > > xl_xact_commit_with_info always has all 3 counts, even if zero. > Arrays follow the main record > > I think it might also be possible to removed dbId and tsId from > xl_act_commit if we use that definition. That's an interesting idea. I hadn't noticed that if nmsgs=0 then those fields aren't needed either. Also, both nrels>0 and nmsgs>0 will probably only happen if some DDL has been done, which most transactions probably don't, so it would be logical to have one record type that excludes nrels, nmsgs, dbId, tsId, and another record type that includes all of those. In fact, we could probably throw in xinfo as well: if there's no DDL involved, we shouldn't need to set XACT_COMPLETION_UPDATE_RELCACHE_FILE or XACT_COMPLETION_FORCE_SYNC_COMMIT either. But I'm not so sure about nsubxacts. It seems to me that we might lose a lot of the benefit of having two record types if we have to use the larger one whenever nsubxacts>0. I'm not exactly sure how common subtransactions are, but if I had to guess I'd speculate that they are far more frequent than DDL operations. Maybe we should think about making the xl_xact_commit record contain just xact_time, nsubxacts, and a subxact array; and then have the xl_xact_commit_with_info (or maybe xl_xact_commit_full, or whatever we decide to call it) contain everything else. That would shave off 20 bytes from the commit records of any non-DDL operation, which would be pretty nice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
On Wed, May 25, 2011 at 3:41 PM, Simon Riggs wrote: > On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci > wrote: >>> Da: Simon Riggs >>> I can't find a clear discussion of what you are trying to do, and how, >>> just a URL back to a complex discussion on another topic. >> >> >> While trying to write a patch to allow changing an unlogged table into >> a logged one, I had to add another int field to xl_xact_commit. >> Robert Haas said: >> >> "I have to admit I don't like this approach very much. I can't see >> adding 4 bytes to every commit record for this feature." >> >> >> which is a correct remark. >> >> xl_xact_commit can contain some arrays (relation to drops, >> committed sub-trans, shared invalidation msgs). The length of >> these arrays is specified using 3 ints in the struct. >> >> So, to avoid adding more ints to the struct, I've been suggested to >> remove all the ints, and use xl_xact_commit.xinfo to flag which >> arrays are, in fact, present. >> >> So the whole idea is: >> >> - remove nrels, nsubxacts and nmsgs from xl_xact_commit >> - use bits in xinfo to signal which arrays are present at the end >> of xl_xact_commit >> - for each present array, add the length of the array (as int) at >> the end of xl_xact_commit >> - add each present array after all the lengths > > > OK, thats clear. Thanks. > > That formatting sounds quite complex. > > I would propose we split this into 2 WAL records: xl_xact_commit and > xl_xact_commit_with_info > > xl_xact_commit doesn't have any flags, counts or arrays. > > xl_xact_commit_with_info always has all 3 counts, even if zero. > Arrays follow the main record > > I think it might also be possible to removed dbId and tsId from > xl_act_commit if we use that definition. Yes, that's correct. We can remove them from the normal commit record when nmsgs == 0. I think we can get away with these 2 definitions, but pls check. Using static definitions works better for me because we can see what they contain, rather than having the info flags imply that the record can contain any permutation of settings when that's not really possible. { TimestampTz xact_time; /* time of commit */ int nsubxacts; /* number of subtransaction XIDs */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_commit; { TimestampTz xact_time; /* time of commit */ uint32 xinfo; /* info flags */ int nrels; /* number of RelFileNodes */ int nsubxacts; /* number of subtransaction XIDs */ int nmsgs; /* number of shared inval msgs */ Oid dbId; /* MyDatabaseId */ Oid tsId; /* MyDatabaseTableSpace */ /* Array of RelFileNode(s) to drop at commit */ RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */ } xl_xact_commit_with_info; -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian wrote: > Alvaro Herrera wrote: >> Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011: >> > On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian wrote: >> >> > > I can easily remove dashes before the compare if people like that idea >> > > --- I think you could argue that a dash is not significant, unless "ab-c" >> > > and "a-bc" are different locales. >> > >> > I think the more we mush that string around, the more chance we have >> > of breaking something. What's wrong with insisting that people set >> > the value to the same thing? Like, really the same? >> >> No objection here to that idea. > > I thought the problem was that they upgraded the OS and now the encoding > names changed, though they behaved the same. Is that now what is > happening? Can they supply the values with different cases? Oh, hmm. I don't know. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The way to know whether the standby has caught up with the master
On Wed, May 25, 2011 at 12:34:59PM +0100, Simon Riggs wrote: > On Wed, May 25, 2011 at 6:16 AM, Heikki Linnakangas > wrote: > > >> To achieve that, I'm thinking to change walsender so that, when the > >> standby > >> has caught up with the master, it sends back the message indicating that > >> to > >> the standby. And I'm thinking to add new function (or view like > >> pg_stat_replication) > >> available on the standby, which shows that info. > > > > By the time the standby has received that message, it might not be caught-up > > anymore because new WAL might've been generated in the master already. > > AFAICS, this is an exact case of the Byzantine General's problem. Have they updated it to acknowledge that the city is now called Istanbul? Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
I think the command 'where' does the same. And the command showed something which looked like was part of evaluation...it got me confused. Anyways, thanks robert. I will check that too. I did not know the 'bt' command. -- Sent from my Android On 25 May 2011 23:02, "Robert Haas" wrote:
[HACKERS] Minor issues with docs
While I was working on automatic translation of PostgreSQL's documentation from SGML to XML, I found some minor issues. In the file doc/src/sgml/ecpg.sgml there are many lines of C code containing unescaped '<' characters. In the file doc/src/sgml/array.sgml there is a tag which has a case mismatch error with its end tag. Attached you can find the patch, if you want to apply it. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it >From d22539bdb7cabcb6bfbf0ce1b80a59fbba283ca4 Mon Sep 17 00:00:00 2001 From: Marco Nenciarini Date: Wed, 25 May 2011 19:35:36 +0200 Subject: [PATCH] Fix minor issues with documentation markup Signed-off-by: Marco Nenciarini --- doc/src/sgml/array.sgml |2 +- doc/src/sgml/ecpg.sgml | 20 ++-- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml index bb318c5..3508ba3 100644 --- a/doc/src/sgml/array.sgml +++ b/doc/src/sgml/array.sgml @@ -369,7 +369,7 @@ UPDATE sal_emp SET pay_by_quarter = ARRAY[25000,25000,27000,27000] UPDATE sal_emp SET pay_by_quarter[4] = 15000 WHERE name = 'Bill'; - + or updated in a slice: diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml index 9c6ca4c..a8ffde5 100644 --- a/doc/src/sgml/ecpg.sgml +++ b/doc/src/sgml/ecpg.sgml @@ -4154,7 +4154,7 @@ switch (v.sqltype) { case ECPGt_char: memset(&var_buf, 0, sizeof(var_buf)); -memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf) - 1 : sqllen)); +memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf) - 1 : sqllen)); break; case ECPGt_int: /* integer */ @@ -4390,7 +4390,7 @@ main(void) case ECPGt_char: memset(&var_buf, 0, sizeof(var_buf)); -memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf)-1 : sqllen)); +memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf)-1 : sqllen)); break; case ECPGt_int: /* integer */ @@ -5871,39 +5871,39 @@ main(void) /* create */ loid = lo_create(conn, 0); -if (loid < 0) +if (loid < 0) printf("lo_create() failed: %s", PQerrorMessage(conn)); printf("loid = %d\n", loid); /* write test */ fd = lo_open(conn, loid, INV_READ|INV_WRITE); -if (fd < 0) +if (fd < 0) printf("lo_open() failed: %s", PQerrorMessage(conn)); printf("fd = %d\n", fd); rc = lo_write(conn, fd, buf, buflen); -if (rc < 0) +if (rc < 0) printf("lo_write() failed\n"); rc = lo_close(conn, fd); -if (rc < 0) +if (rc < 0) printf("lo_close() failed: %s", PQerrorMessage(conn)); /* read test */ fd = lo_open(conn, loid, INV_READ); -if (fd < 0) +if (fd < 0) printf("lo_open() failed: %s", PQerrorMessage(conn)); printf("fd = %d\n", fd); rc = lo_read(conn, fd, buf2, buflen); -if (rc < 0) +if (rc < 0) printf("lo_read() failed\n"); rc = lo_close(conn, fd); -if (rc < 0) +if (rc < 0) printf("lo_close() failed: %s", PQerrorMessage(conn)); /* check */ @@ -5912,7 +5912,7 @@ main(void) /* cleanup */ rc = lo_unlink(conn, loid); -if (rc < 0) +if (rc < 0) printf("lo_unlink() failed: %s", PQerrorMessage(conn)); EXEC SQL COMMIT; -- 1.7.5.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Alvaro Herrera wrote: > Excerpts from Robert Haas's message of mi?? may 25 13:33:41 -0400 2011: > > On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian wrote: > > > > I can easily remove dashes before the compare if people like that idea > > > --- I think you could argue that a dash is not significant, unless "ab-c" > > > and "a-bc" are different locales. > > > > I think the more we mush that string around, the more chance we have > > of breaking something. What's wrong with insisting that people set > > the value to the same thing? Like, really the same? > > No objection here to that idea. I thought the problem was that they upgraded the OS and now the encoding names changed, though they behaved the same. Is that now what is happening? Can they supply the values with different cases? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee wrote: > I think the point is you can not *always* put it just after the line > pointer array without possibly shuffling the tuples. Remember we need > to put the LSN when the dead line pointer is generated because we > decided to prune away the dead tuple. Say, for example, the page is > completely full and there are no dead line pointers and hence no LSN > on the page. Also there is no free space after the line pointer array. > Now say we prune dead tuples and generate dead line pointers, but the > last line pointer in the array is still in-use and the first tuple > immediately after the line pointer array is live. Since you generated > dead line pointers you want to store the LSN on the page. Now, there > is no way you can store is after the line pointer array without moving > the live tuple somewhere else. So far I agree. But don't we always defragment immediately after pruning dead tuples to line pointers? The removal of even one tuple will give us more than enough space to store the LSN. > Let me summarize the sequence of operations and let me know if you > still disagree with the general principle: > > 1. There are no dead line pointers in the page - we are good. > 2. Few tuples become dead, HOT pruning is invoked either during normal > operation or heap vacuum. The dead tuples are pruned away and > truncated to dead line pointers. We already hold cleanup lock on the > buffer. We set the flag in the page header and store the LSN (either > at the end of line pointer array or at the end of the page) > 3. Someday index vacuum is run and it removes the index pointers to > the dead line pointers. We remember the start LSN of the index vacuum > somewhere, may be as a pg_class attribute (how does index vacuum get > the list of dead line pointers is not material in the general scheme > of things) > 4. When the page is again chosen for pruning, we check if the flag is > set in the header. If so, get the LSN stored in the page, check it > against the last successful index vacuum LSN and if its precedes the > index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The > special LSN can be removed unless new LP_DEAD line pointers get > generated during the pruning, otherwise its overwritten with the > current LSN. Since we hold the buffer cleanup lock, the special LSN > storage can be reclaimed by shuffling things around. Agreed. The only thing I'm trying to do further is to avoid the need for a reshuffle when the special LSN storage is reclaimed. For example, consider: 1. There are three tuples on the page. We are good. 2. Tuple #2 becomes dead. The tuple is pruned to a line pointer. The page is defragmented. At this point, it doesn't matter WHERE we put the LSN - we are rearranging the whole page anyway. 3. Index vacuum is run. 4. Now we want to make the dead line pointer unused, and reclaim the LSN storage. If the LSN is stored at the end of the page, then we now have to move all of the tuple data forward by 8 bytes. But if it's stored adjacent to the hole in the middle of the page, we need only clear the page-header bits saying it's there (and maybe adjust pd_lower). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] adding a new column in IDENTIFY_SYSTEM
On Tue, May 24, 2011 at 22:31, Jaime Casanova wrote: > On Tue, May 24, 2011 at 8:52 PM, Fujii Masao wrote: > >> + primary_xlp_magic = atoi(PQgetvalue(res, 0, 2)); >> >> You wrongly get the third field (i.e., current xlog location) as the >> WAL version. >> You should call PQgetvalue(res, 0, 3), instead. >> >>> errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.", >> >> You need to change the above message. >> > > Fixed. > > About you comments on the check... if you read the thread, you will > find that the whole reason for the field is future improvement, but > everyone wanted some use of the field now... so i made a patch to use > it in pg_basebackup before the transfer starts and avoid time and > bandwith waste but Magnus prefer this in walreceiver... The idea *at this point* was, I believe, to be able to provide a more useful error message in the case of walreceiver. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 10:36 PM, Robert Haas wrote: > >> I don't see how you'd store >> anything in the hole without it being in a fixed place, where it would >> eventually be hit by either the line pointer array or tuple data. > > The point is that it doesn't matter. Suppose we put it just after the > line pointer array. I think the point is you can not *always* put it just after the line pointer array without possibly shuffling the tuples. Remember we need to put the LSN when the dead line pointer is generated because we decided to prune away the dead tuple. Say, for example, the page is completely full and there are no dead line pointers and hence no LSN on the page. Also there is no free space after the line pointer array. Now say we prune dead tuples and generate dead line pointers, but the last line pointer in the array is still in-use and the first tuple immediately after the line pointer array is live. Since you generated dead line pointers you want to store the LSN on the page. Now, there is no way you can store is after the line pointer array without moving the live tuple somewhere else. Thats the point me and Alvaro are making. Do you agree with that logic ? Now that does not matter because we would always generate dead line pointers holding a buffer cleanup lock and hence we are free to shuffle tuples around. May be we are digressing on a trivial detail here, but I hope I got it correct. > Any time we're thinking about extending the line > pointer array, we already have an exclusive lock on the buffer. And > if we already have a exclusive lock on the buffer, then we can reclaim > the dead line pointers and now we no longer need the saved LSN, so > writing over it is perfectly fine. > The trouble is you may not be able to shrink the line pointer array. But of course, you can reuse the reclaimed dead line pointers. I would still advocate doing that during the pruning operation because we want to emit WAL records for the operation. > OK, I lied: if we have an exclusive buffer lock, but the last vacuum > either failed, or is still in progress, then the LSN might not be old > enough for us to reclaim the dead line pointers yet. So ideally we'd > like to hold onto it. We can do that by either (a) moving the LSN out > another 6 bytes, if there's enough room; or (b) deciding not to put > the new tuple on this page, after all. There's no situation in which > we absolutely HAVE to get another tuple onto this particular page. We > can just decide that the effective size of a page that contains dead > line pointers is effectively 8 bytes less. The alternative is to eat > up 8 bytes of space on ALL pages, whether they contain dead line > pointers or not. > I think we are on the same page as far as storing LSN if and only if its required. But what was not convincing is the argument that you can *always* find free space for the LSN without moving things around. Let me summarize the sequence of operations and let me know if you still disagree with the general principle: 1. There are no dead line pointers in the page - we are good. 2. Few tuples become dead, HOT pruning is invoked either during normal operation or heap vacuum. The dead tuples are pruned away and truncated to dead line pointers. We already hold cleanup lock on the buffer. We set the flag in the page header and store the LSN (either at the end of line pointer array or at the end of the page) 3. Someday index vacuum is run and it removes the index pointers to the dead line pointers. We remember the start LSN of the index vacuum somewhere, may be as a pg_class attribute (how does index vacuum get the list of dead line pointers is not material in the general scheme of things) 4. When the page is again chosen for pruning, we check if the flag is set in the header. If so, get the LSN stored in the page, check it against the last successful index vacuum LSN and if its precedes the index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The special LSN can be removed unless new LP_DEAD line pointers get generated during the pruning, otherwise its overwritten with the current LSN. Since we hold the buffer cleanup lock, the special LSN storage can be reclaimed by shuffling things around. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
On Wed, May 25, 2011 at 1:37 PM, Hitoshi Harada wrote: >> How do you decide whether or not to push down? > > Yeah, that's the problem. In addition to the conditions of join-qual > == grouping key && outer is unique on qual, we need some criteria if > it should be done. At first I started to think I can compare cost of > two different plan nodes, which are generated by calling > subquery_planner() twice. But now my plan is to apply some heuristics > like that join qual selectivity is less than 10% or so. I either don't > like magic numbers but given Query restructuring instead of > PlannerInfo (which means we cannot use Path) it is only left way. To > get it work is my first goal anyway. I think getting it working is probably a good first goal. I am not really sure that we want to commit it that way, and I think my vote would be for you to work on the approach we discussed before rather than this one, but it's your project, and I think you'll probably learn enough in getting it working that it will be a step forward in any case. The planner is complex enough that it's worth trying to get something that works, first, before trying to make it perfect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH
Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011: > Right. It would also increase the cognitive load on the user to have > to remember the command-line go-to-line-number switch for his editor. > So I don't particularly want to redesign this feature. However, I can > see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from > the same place that you set EDITOR, which would suggest that we allow > the value to come from an environment variable. I'm not sure whether > there is merit in allowing both that source and ~/.psqlrc, though > possibly for Windows users it might be easier if ~/.psqlrc worked. If we're going to increase the number of options in .psqlrc that do not work with older psql versions, can I please have .psqlrc-MAJORVERSION or some such? Having 8.3's psql complain all the time because it doesn't understand "linestyle" is annoying. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011: > On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian wrote: > > I can easily remove dashes before the compare if people like that idea > > --- I think you could argue that a dash is not significant, unless "ab-c" > > and "a-bc" are different locales. > > I think the more we mush that string around, the more chance we have > of breaking something. What's wrong with insisting that people set > the value to the same thing? Like, really the same? No objection here to that idea. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote: > > I don't really like the idea of adding a GUC for this, unless we > convince ourselves that nothing else is sensible. I mean, that leads > to conversations like this: > > Newbie: My query is slow. > Hacker: Turn on enable_magic_pixie_dust and it'll get better. > Newbie: Oh, yeah, that is better. Why isn't this turned on by default, > anyway? > Hacker: Well, on pathological queries, it makes planning take way too > long, so we think it's not really safe to enable it by default. > Newbie: Wait... I thought you just told me to enable it. It's not safe? > Hacker: Well, sort of. I mean, uh... hey, look, an elephant! > ROTFL! This needs to go on the wiki somewhere discussing why HACKERs rejects tunable knobs as often as happens. Ross -- Ross Reedstrom, Ph.D. reeds...@rice.edu Systems Engineer & Admin, Research Scientistphone: 713-348-6166 Connexions http://cnx.orgfax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011: > > Well, we only actually need to store one number, because you can figure > > out a much more precise number-of-pages figure with pg_relation_size() > > divided by configured page size. > I may miss something but we need relation size in costsize.c even if > we have a reldensity (or we need a reltuples). Else what values should > be used to estimate the relation size ? (pg_relation_size() goes down > to kernel/fs to ask the stat.st_size, is it really what we want?) Actually yes, the planner does go to kernel to determine the current relation size, and then multiplies by density as computed from catalog data to figure out current reasonably accurate number of tuples. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
2011/5/26 Robert Haas : > On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada wrote: >> 2011/5/25 Hitoshi Harada : >>> So I'm still >>> thinking which of pulling up and parameterized scan is better. >> >> After more investigation I came up with third idea, pushing down >> RangeTblEntry to aggregate subquery. This sounds like a crazy idea, >> but it seems to me like it is slightly easier than pulling up >> agg-subquery. The main point is that when you want to pull up, you >> must care if the final output would be correct with other join >> relations than the aggregate-related one. In contrast, when pushing >> down the target relation to agg-subquery it is simple to ensure the >> result. >> >> I'm looking around pull_up_subqueries() in subquery_planner() to add >> the pushing down logic. It could be possible to do it around >> make_one_rel() but I bet query structure changes are doable against >> Query, not PlannerInfo. > > How do you decide whether or not to push down? Yeah, that's the problem. In addition to the conditions of join-qual == grouping key && outer is unique on qual, we need some criteria if it should be done. At first I started to think I can compare cost of two different plan nodes, which are generated by calling subquery_planner() twice. But now my plan is to apply some heuristics like that join qual selectivity is less than 10% or so. I either don't like magic numbers but given Query restructuring instead of PlannerInfo (which means we cannot use Path) it is only left way. To get it work is my first goal anyway. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian wrote: > Alvaro Herrera wrote: >> Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011: >> >> > > I think you misread what I wrote, or I misexplained it, but never >> > > mind. Matching locale names case-insensitively sounds reasonable to >> > > me, unless someone has reason to believe it will blow up. >> > >> > OK, that's what I needed to hear. I have applied the attached patch, >> > but only to 9.1 because of the risk of breakage. (This was only the >> > first bug report of this, and we aren't 100% certain about the case >> > issue.) >> >> Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some >> systems. So if you have a locale alias that makes en_US.utf8 the same >> as en_US.UTF-8, the patched code would still not work. I wonder if >> there's a need for canonicalization somewhere. Or maybe this is just >> not worth the trouble. > > I can easily remove dashes before the compare if people like that idea > --- I think you could argue that a dash is not significant, unless "ab-c" > and "a-bc" are different locales. I think the more we mush that string around, the more chance we have of breaking something. What's wrong with insisting that people set the value to the same thing? Like, really the same? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
On Wed, May 25, 2011 at 12:34 PM, Vaibhav Kaushal wrote: > But somehow the execevalvar is being called. When i changed the logic of its > working to use slot_getattr instead of cute_datum_array for the first run / > call, everything just worked! This is what gdb is for... set a breakpoint on that function using "b ExexEvalVar", and when it hits the breakpoint, use "bt" to see where the call came from. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
2011/5/25 Alvaro Herrera : > Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011: >> Tom Lane wrote: >> >> > I don't know what client-side code might be looking at >> > relpages/reltuples. >> >> I know that I find reltuples useful for getting an "accurate enough" >> sense of rows in a table (or set of tables) without resorting to >> count(*). I'd be OK with any two of pages, tuples, and density; but >> dropping to one would make databases harder to manage, IMV. >> >> Personally, I don't have much code that uses those columns; >> eliminating an existing column wouldn't involve much pain for me as >> long as it could still be derived. > > Well, we only actually need to store one number, because you can figure > out a much more precise number-of-pages figure with pg_relation_size() > divided by configured page size. > > (We feel free to hack around catalogs in other places, and we regularly > break pgAdmin and the like when we drop columns -- people just live with > it and update their tools. I don't think it's such a big deal in this > particular case.) I may miss something but we need relation size in costsize.c even if we have a reldensity (or we need a reltuples). Else what values should be used to estimate the relation size ? (pg_relation_size() goes down to kernel/fs to ask the stat.st_size, is it really what we want?) > > -- > Álvaro Herrera > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 1:15 PM, Robert Haas wrote: > I agree that if VACUUM scanned 99% of the table, it's probably fine to > use its numbers. It's also fine to use the numbers from ANALYZE, > because those pages are chosen randomly. What bothers me is the idea > of using a small *non-random* sample, and I'm not sure that > incorporating possibly-bogus results slowly is any better than > incorporating them quickly. In particular, unless I'm misremembering, VACUUM *always* scans the first few pages of the table, until it sees enough consecutive all-visible bits that it decides to start skipping. If I'm right about that, then those pages could easily end up being overweighted when VACUUM does the counting; especially if ANALYZE or an actual full-table vacuum aren't allowed to snap the count back to reality. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Wed, May 25, 2011 at 1:06 PM, Greg Smith wrote: > On 05/24/2011 04:34 PM, Robert Haas wrote: >> >> we could name this feature "partial full page writes" and enable it >> either with a setting of full_page_writes=partial > > +1 to overloading the initial name, but only if the parameter is named > 'maybe', 'sometimes', or 'perhaps'. Perfect! > I've been looking into a similar refactoring of the names here, where we > bundle all of these speed over safety things (fsync, full_page_writes, etc.) > into one control so they're easier to turn off at once. Not sure if it > should be named "web_scale" or "do_you_feel_lucky_punk". Actually, I suggested that same idea to you, or someone, a while back, only I was serious. crash_safety=off. I never got around to fleshing out the details, though. >> 3. Going a bit further, Greg proposed the idea of ripping out our >> current WAL infrastructure altogether and instead just having one WAL >> record that says "these byte ranges on this page changed to have these >> new contents". > > The main thing that makes this idea particularly interesting to me, over the > other two, is that it might translate well into the idea of using > sync_file_range to aim for a finer fsync call on Linux than is currently > possible. Hmm, maybe. But it's possible that the dirty blocks are the first and last ones in the file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nested CASE-WHEN scoping
On 25.05.2011 20:11, Tom Lane wrote: Heikki Linnakangas writes: On 25.05.2011 17:47, Tom Lane wrote: [ scratches head ... ] Why does the save/restore in ExecEvalCase not take care of this? The mistake happens during planning, when the SQL function is inlined and pre-evaluated. Hm. I'm inclined to think this may be more of a bug in the inlining process than anything else. Well, if you want to get away without the capability to reference multiple CaseTestExprs at a time, you'll have to detect the danger and abort the inlining process. That's a bit pessimal, although this is a pretty artificial case in the first place so maybe we don't care much. (I'm still going to need more placeholder slots to handle IN and BETWEEN. Of course, I can just copy-paste CaseTestExpr into something like InTestExpr and BetweenTestExpr, but it seems like it would be good to unite all that infrastructure) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
But somehow the execevalvar is being called. When i changed the logic of its working to use slot_getattr instead of cute_datum_array for the first run / call, everything just worked! This would indicate surely that the function does get called at least once before being called by executor for qual check. This is what got me confused - where does the function get called? Gdb trace says its first call comes through ExecutePlan but results say something else. Moreover, gdb still confuses me. Is there some gui to gdb? :( The only thing i am trying to make sure that ExecEvalVar gets the values from cute_datum_array only when called durin a scan, no way before it. May be there could be another way. I think i have to rethink about this. -- Sent from my Android On 25 May 2011 20:34, "Tom Lane" wrote: > Alvaro Herrera writes: >> Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400 2011: >>> If the above is confusing, I just want to ask: "Is expression evaluator, >>> even in part responsible for {PLANNEDSTMT creation?" > >> Yeah, as far as I understood Tom's talk, the expr evaluator is used to >> reduce some expressions to constants and such. > > The planner would never call it with an expression containing a Var, > though. > > regards, tom lane
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Alvaro Herrera wrote: > Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011: > > > > I think you misread what I wrote, or I misexplained it, but never > > > mind. Matching locale names case-insensitively sounds reasonable to > > > me, unless someone has reason to believe it will blow up. > > > > OK, that's what I needed to hear. I have applied the attached patch, > > but only to 9.1 because of the risk of breakage. (This was only the > > first bug report of this, and we aren't 100% certain about the case > > issue.) > > Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some > systems. So if you have a locale alias that makes en_US.utf8 the same > as en_US.UTF-8, the patched code would still not work. I wonder if > there's a need for canonicalization somewhere. Or maybe this is just > not worth the trouble. I can easily remove dashes before the compare if people like that idea --- I think you could argue that a dash is not significant, unless "ab-c" and "a-bc" are different locales. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On 05/24/2011 04:34 PM, Robert Haas wrote: we could name this feature "partial full page writes" and enable it either with a setting of full_page_writes=partial +1 to overloading the initial name, but only if the parameter is named 'maybe', 'sometimes', or 'perhaps'. I've been looking into a similar refactoring of the names here, where we bundle all of these speed over safety things (fsync, full_page_writes, etc.) into one control so they're easier to turn off at once. Not sure if it should be named "web_scale" or "do_you_feel_lucky_punk". 3. Going a bit further, Greg proposed the idea of ripping out our current WAL infrastructure altogether and instead just having one WAL record that says "these byte ranges on this page changed to have these new contents". The main thing that makes this idea particularly interesting to me, over the other two, is that it might translate well into the idea of using sync_file_range to aim for a finer fsync call on Linux than is currently possible. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote: > [ shrug... ] When you don't have complete information, it's *always* > the case that you will sometimes make a mistake. That's not > justification for paralysis, especially not when the existing code is > demonstrably broken. > > What occurs to me after thinking a bit more is that the existing tuple > density is likely to be only an estimate, too (one coming from the last > ANALYZE, which could very well have scanned even less of the table than > VACUUM did). So what I now think is that both VACUUM and ANALYZE ought > to use a calculation of the above form to arrive at a new value for > pg_class.reltuples. In both cases it would be pretty easy to track the > number of pages we looked at while counting tuples, so the same raw > information is available. > >> I am wondering, though, why we're not just inserting a special-purpose >> hack for TOAST tables. > > Because the problem is not specific to TOAST tables. As things > currently stand, we will accept the word of an ANALYZE as gospel even if > it scanned 1% of the table, and completely ignore the results from a > VACUUM even if it scanned 99% of the table. This is not sane. I agree that if VACUUM scanned 99% of the table, it's probably fine to use its numbers. It's also fine to use the numbers from ANALYZE, because those pages are chosen randomly. What bothers me is the idea of using a small *non-random* sample, and I'm not sure that incorporating possibly-bogus results slowly is any better than incorporating them quickly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011: > I don't know. That's maybe better, but I'd be willing to wager that > in some cases it will just slow down the rate at which we converge to > a completely incorrect value, while in other cases it'll fail to > update the data when it really has changed. For regular tables I don't think there's a real problem because it'll get fixed next time a full scan happens anyway. For toast tables, I think the set of operations is limited enough that this is easy to prove correct (or fixed so that it is) -- no HOT updates (in fact no updates at all), etc. BTW one thing we haven't fixed at all is how do HOT updates affect vacuuming behavior ... > I am wondering, though, why we're not just inserting a special-purpose > hack for TOAST tables. Your email seems to indicate that regular > tables are already handled well enough, and certainly if we only whack > around the TOAST behavior it's much less likely to fry anything. Well, having two paths one of which is uncommonly used means that it will get all the bugs. As with the xl_commit WAL record comment from Simon, I'd rather stick with having a single one. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nested CASE-WHEN scoping
Heikki Linnakangas writes: > On 25.05.2011 17:47, Tom Lane wrote: >> [ scratches head ... ] Why does the save/restore in ExecEvalCase not >> take care of this? > The mistake happens during planning, when the SQL function is inlined > and pre-evaluated. Hm. I'm inclined to think this may be more of a bug in the inlining process than anything else. I have to run off for a doctor's appointment, but will look at this closer later today. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Nested CASE-WHEN scoping
On 25.05.2011 17:47, Tom Lane wrote: Heikki Linnakangas writes: While looking at fixing the multiple-evaluation issue in IN and BETWEEN discussed a while ago, I realized that the current assumption that only one CaseTestExpr placeholder needs to be valid at any given time is not true. [ scratches head ... ] Why does the save/restore in ExecEvalCase not take care of this? The mistake happens during planning, when the SQL function is inlined and pre-evaluated. It's a bit hard to see what happened once the planning is finished because the whole expression is folded into a constant, but here's goes: The original expression is: CASE now() WHEN 29 THEN 'foo' ELSE 'bar' END; In parse analysis, it is turned into this: CASE WHEN CaseTestExpr = 29 THEN 'foo' ELSE 'bar' END; where CaseTestExpr stands for the now(). Next the planner tries to simplify the WHEN condition, "CaseTestExpr = 29". The equality operator is implemented by the evileq(timestamptz, int4) SQL function, defined as: CASE $2 WHEN length($1::text) THEN true ELSE false END; That SQL-function is transformed at parse analysis into: CASE CaseTestExpr = length($1::text) THEN true ELSE false END; This CaseTestExpr stands for the Param to the function, $2. When that tranformed SQL function body is inlined into the outer WHEN clause, "CaseTestExpr = 29", and Params are substituted, it becomes: CASE CaseTestExpr = length(CaseTestExpr::text) THEN true ELSE false END. (you can see the expression tree for that if you print out 'newexpr' in inline_function(), just after the call to substitute_actual_parameters()) At this point it's easy to see that we have screwed up. The first CaseTestExpr stands for the inner CASE-value, which is $2, which stands for 29, and the second CaseTestExpr stands for the *outer* CASE-value, which is supposed to be now(). The planner cannot distinguish between the two anymore. Both CaseTestExprs are then incorrectly replaced with constant 29, and the whole expression is constant-folded into 'bar'. So I'm going to put the BETWEEN/IN fix aside for now, and refactor the placeholder infrastructure to handle several simultaneous placeholders, That sounds like a mess, and I'm not even convinced it'll solve the problem ... Hmm, it would solve the above by if we can keep the CaseTestExprs separate. It's not quite as straightforward as I originally thought, as the parse analysis of the inlined SQL function needs to use placeholder numbers that are different from those used in the outer context. But it seems doable. BTW, i just stumbled into this: postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) THEN 'foo' ELSE 'bar' END; ERROR: unexpected CASE WHEN clause: 326 Looks like ruleutils.c is also not prepared for the case that the implicit equality operation gets inlined into something else than an OpExpr. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 12:48 PM, Alvaro Herrera wrote: > Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011: >> On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee >> wrote: > >> > I think if are reclaiming LP_DEAD line pointers only while >> > defragmenting the page, we can always reclaim the space for the LSN, >> > irrespective of where we store it. So may be we should decide >> > depending on what would matter for on-disk compatibility and whatever >> > requires least invasive changes. I don't know what is that yet. >> >> If we store the LSN at the beginning (page header) or end (special >> space) of the page, then we'll have to adjust the location of the data >> which follows it or precedes it if and when we want to reclaim the >> space it occupies. But if we store it in the hole in the middle >> somewhere (probably by reducing the size of the hole, not by actually >> referencing the area between pd_lower and pd_upper) then no tuples or >> item pointers have to move around when we want to reclaim the space. >> That way, we need only an exclusive lock (not a cleanup lock), and we >> don't need to memmove() anything. > > You can store anything in the "hole" in the data area (currently we only > store tuple data), but then you'd have to store the offset to where you > stored it in some fixed location, and make sure you adjust pd_upper/lower > so that the next tuple does not overwrite it. You'd still need space > in either page header or special space. You only need one bit of space in the page header, to indicate whether the LSN is present or not. And we have unused bit space there. > I don't see how you'd store > anything in the hole without it being in a fixed place, where it would > eventually be hit by either the line pointer array or tuple data. The point is that it doesn't matter. Suppose we put it just after the line pointer array. Any time we're thinking about extending the line pointer array, we already have an exclusive lock on the buffer. And if we already have a exclusive lock on the buffer, then we can reclaim the dead line pointers and now we no longer need the saved LSN, so writing over it is perfectly fine. OK, I lied: if we have an exclusive buffer lock, but the last vacuum either failed, or is still in progress, then the LSN might not be old enough for us to reclaim the dead line pointers yet. So ideally we'd like to hold onto it. We can do that by either (a) moving the LSN out another 6 bytes, if there's enough room; or (b) deciding not to put the new tuple on this page, after all. There's no situation in which we absolutely HAVE to get another tuple onto this particular page. We can just decide that the effective size of a page that contains dead line pointers is effectively 8 bytes less. The alternative is to eat up 8 bytes of space on ALL pages, whether they contain dead line pointers or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > On Wed, May 25, 2011 at 12:41 PM, Tom Lane wrote: >> Yeah, I had been thinking about the latter point. We could be >> conservative and just keep the reported tuple density the same (ie, >> update relpages to the new correct value, while setting reltuples so >> that the density ratio doesn't change). But that has its own problems >> when the table contents *do* change. What I'm currently imagining is >> to do a smoothed moving average, where we factor in the new density >> estimate with a weight dependent on the percentage of the table we did >> scan. That is, the calculation goes something like >> >> old_density = old_reltuples / old_relpages >> new_density = counted_tuples / scanned_pages >> reliability = scanned_pages / new_relpages >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages >> >> We could slow the moving-average convergence even further when >> reliability is small; for instance if you were really paranoid you might >> want to use the square of reliability in line 4. That might be >> overdoing it, though. > I don't know. That's maybe better, but I'd be willing to wager that > in some cases it will just slow down the rate at which we converge to > a completely incorrect value, while in other cases it'll fail to > update the data when it really has changed. [ shrug... ] When you don't have complete information, it's *always* the case that you will sometimes make a mistake. That's not justification for paralysis, especially not when the existing code is demonstrably broken. What occurs to me after thinking a bit more is that the existing tuple density is likely to be only an estimate, too (one coming from the last ANALYZE, which could very well have scanned even less of the table than VACUUM did). So what I now think is that both VACUUM and ANALYZE ought to use a calculation of the above form to arrive at a new value for pg_class.reltuples. In both cases it would be pretty easy to track the number of pages we looked at while counting tuples, so the same raw information is available. > I am wondering, though, why we're not just inserting a special-purpose > hack for TOAST tables. Because the problem is not specific to TOAST tables. As things currently stand, we will accept the word of an ANALYZE as gospel even if it scanned 1% of the table, and completely ignore the results from a VACUUM even if it scanned 99% of the table. This is not sane. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011: > Tom Lane wrote: > > > I don't know what client-side code might be looking at > > relpages/reltuples. > > I know that I find reltuples useful for getting an "accurate enough" > sense of rows in a table (or set of tables) without resorting to > count(*). I'd be OK with any two of pages, tuples, and density; but > dropping to one would make databases harder to manage, IMV. > > Personally, I don't have much code that uses those columns; > eliminating an existing column wouldn't involve much pain for me as > long as it could still be derived. Well, we only actually need to store one number, because you can figure out a much more precise number-of-pages figure with pg_relation_size() divided by configured page size. (We feel free to hack around catalogs in other places, and we regularly break pgAdmin and the like when we drop columns -- people just live with it and update their tools. I don't think it's such a big deal in this particular case.) -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 12:41 PM, Tom Lane wrote: > Yeah, I had been thinking about the latter point. We could be > conservative and just keep the reported tuple density the same (ie, > update relpages to the new correct value, while setting reltuples so > that the density ratio doesn't change). But that has its own problems > when the table contents *do* change. What I'm currently imagining is > to do a smoothed moving average, where we factor in the new density > estimate with a weight dependent on the percentage of the table we did > scan. That is, the calculation goes something like > > old_density = old_reltuples / old_relpages > new_density = counted_tuples / scanned_pages > reliability = scanned_pages / new_relpages > updated_density = old_density + (new_density - old_density) * reliability > new_reltuples = updated_density * new_relpages > > We could slow the moving-average convergence even further when > reliability is small; for instance if you were really paranoid you might > want to use the square of reliability in line 4. That might be > overdoing it, though. I don't know. That's maybe better, but I'd be willing to wager that in some cases it will just slow down the rate at which we converge to a completely incorrect value, while in other cases it'll fail to update the data when it really has changed. I am wondering, though, why we're not just inserting a special-purpose hack for TOAST tables. Your email seems to indicate that regular tables are already handled well enough, and certainly if we only whack around the TOAST behavior it's much less likely to fry anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011: > On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee > wrote: > > I think if are reclaiming LP_DEAD line pointers only while > > defragmenting the page, we can always reclaim the space for the LSN, > > irrespective of where we store it. So may be we should decide > > depending on what would matter for on-disk compatibility and whatever > > requires least invasive changes. I don't know what is that yet. > > If we store the LSN at the beginning (page header) or end (special > space) of the page, then we'll have to adjust the location of the data > which follows it or precedes it if and when we want to reclaim the > space it occupies. But if we store it in the hole in the middle > somewhere (probably by reducing the size of the hole, not by actually > referencing the area between pd_lower and pd_upper) then no tuples or > item pointers have to move around when we want to reclaim the space. > That way, we need only an exclusive lock (not a cleanup lock), and we > don't need to memmove() anything. You can store anything in the "hole" in the data area (currently we only store tuple data), but then you'd have to store the offset to where you stored it in some fixed location, and make sure you adjust pd_upper/lower so that the next tuple does not overwrite it. You'd still need space in either page header or special space. I don't see how you'd store anything in the hole without it being in a fixed place, where it would eventually be hit by either the line pointer array or tuple data. As far as the general idea of this thread goes, I remember thinking about exactly this topic a couple of months ago -- I didn't get this far though, so thanks for fleshing out some details and getting the ball rolling. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner
On Wed, May 25, 2011 at 10:17 AM, Greg Smith wrote: >> This data would probably need to be kept separately for each table or >> index, as some tables or indexes >> may be mostly or fully in cache or on faster physical media than others, >> although in the absence of other >> data about a specific table or index, data about other relations in the >> same tablespace might be of some use. > > This is the important part. Model how the data needs to get stored such > that the optimizer can make decisions using it, and I consider it easy to > figure out how it will get populated later. I basically agree. There have been several recent discussions of this topic on both -hackers and -performance; it is likely that the TODO needs to be updated with some more recent links. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "errno" not set in case of "libm" functions (HPUX)
Please find the updated patch. I have added this "+Olibmerrno" compile flag check in configure/configure.in file. OS HP-UX B.11.31 U ia64 without patch --- postgres=# select acos(2); acos -- NaN (1 row) with patch --- postgres=# select acos(2); ERROR: input is out of range On Tue, May 24, 2011 at 11:13 PM, Tom Lane wrote: > Heikki Linnakangas writes: > > So the default is indeed non-standard. But I wonder if we should use -Aa > > instead? > > Probably not; at least on older HPUX versions, -Aa turns off access to > assorted stuff that we do want, eg "long long". "man cc" on my box > saith > > -Amode Specify the compilation standard to be used by the >compiler. mode can be one of the following letters: > > c(Default) Compile in a mode compatible with >HP-UX releases prior to 7.0. (See The C >Programming Language, First Edition by >Kernighan and Ritchie). This option also >defines the symbol _HPUX_SOURCE and allows the >user to access macros and typedefs provided by >the HPUX Operating System. The default >compilation mode may change in future releases. > > aCompile under ANSI mode (ANSI programming >language C standard ISO 9899:1990). When >compiling under ANSI mode, the header files >would define only those names (macros and >typedefs) specified by the Standard. To access >macros and typedefs that are not defined by the >ANSI Standard but are provided by the HPUX >Operating System, define the symbol >_HPUX_SOURCE; or use the extension option >described below. > > eExtended ANSI mode. Same as -Aa -D_HPUX_SOURCE >+e. This would define the names (macros and >typedefs) provided by the HPUX Operating System >and, in addition, allow the following >extensions: $ characters in identifier names, >sized enums, sized bit-fields, and 64-bit >integral type long long. Additional extensions >may be added to this option in the future. > > The +e option is elsewhere stated to mean > > +eEnables HP value-added features while compiling >in ANSI C mode, -Aa. This option is ignored >with -Ac because these features are already >provided. Features enabled: > > o Long pointers > o Integral type specifiers can appear in >enum declarations. > o The $ character can appear in >identifier names. > o Missing parameters on intrinsic calls > > which isn't 100% consistent with what it says under -Ae, so maybe some > additional experimentation is called for. But anyway, autoconf appears > to think that -Ae is preferable to the combination -Aa -D_HPUX_SOURCE > (that choice is coming from autoconf not our own code); so I'm not > optimistic that we can get more-standard behavior by overriding that. > >regards, tom lane > -- Ibrar Ahmed diff --git a/configure b/configure index 3b23c46..d448534 100755 --- a/configure +++ b/configure @@ -4468,6 +4468,66 @@ if test x"$pgac_cv_prog_cc_cflags__qnoansialias" = x"yes"; then CFLAGS="$CFLAGS -qnoansialias" fi +elif test "$PORTNAME" = "hpux"; then + # HP-UX libm functions on 'Integrity server' do not set errno by default, + # for errno setting, compile with the +Olibmerrno option. + { $as_echo "$as_me:$LINENO: checking whether $CC supports +Olibmerrno" >&5 +$as_echo_n "checking whether $CC supports +Olibmerrno... " >&6; } +if test "${pgac_cv_prog_cc_cflags_pOlibmerrno+set}" = set; then + $as_echo_n "(cached) " >&6 +else + pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS +Olibmerrno" +cat >conftest.$ac_ext <<_ACEOF +/* confdefs.h. */ +_ACEOF +cat confdefs.h >>conftest.$ac_ext +cat >>conftest.$ac_ext <<_ACEOF +/* end confdefs.h. */ + +int +main () +{ + + ; + return 0; +} +_ACEOF +rm -f conftest.$ac_objext +if { (ac_try="$ac_compile" +case "(($ac_try" in + *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;; + *) ac_try_echo=$ac_try;; +esac +eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\"" +$as_echo "$ac_try_echo") >&5 + (eval "$ac_compile") 2>conftest.er1 + ac
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
@pavan - thanks a lot. Will try it when i go to desktop the next time. -- Sent from my Android On 25 May 2011 15:33, "Pavan Deolasee" wrote: > On Wed, May 25, 2011 at 3:22 PM, Vaibhav Kaushal < > vaibhavkaushal...@gmail.com> wrote: > >> I see that the target list to be scanned is handled by "ExecTargetList" >> function. >> >> I am not so sure about this because this function is not listed in the list >> of functions which GDB shows me (./configure --enable-debug && make clean && >> make && make install). Rest everything else (almost / perhaps) is shown! >> (Can anyone tell why? :( ) >> >> > You would need to turn optimization off by passing "-O0" flag to the > compiler. Otherwise static functions may get optimized and you may not see > them in gdb stack. > > CFLAGS="-O0" ./configure --enable-debug > > Thanks, > Pavan > > -- > Pavan Deolasee > EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > On Wed, May 25, 2011 at 11:47 AM, Tom Lane wrote: >> 2. Revise the vacuum code so that it doesn't skip updating the pg_class >> entries. We could have it count the number of pages it skipped, rather >> than just keeping a bool, and then scale up the rel_tuples count to be >> approximately right by assuming the skipped pages have tuple density >> similar to the scanned ones. > This approach doesn't seem like a good idea to me. The skipped > portions of the table are the ones that haven't been modified in a > while, so this is or embeds an assumption that the tuples in the hot > and cold portions of the table are the same size. That might be true, > but it isn't hard to think of cases where it might not be. There > could also very easily be sampling error, because it's easy to imagine > a situation where 99% of the table is getting skipped. Yeah, I had been thinking about the latter point. We could be conservative and just keep the reported tuple density the same (ie, update relpages to the new correct value, while setting reltuples so that the density ratio doesn't change). But that has its own problems when the table contents *do* change. What I'm currently imagining is to do a smoothed moving average, where we factor in the new density estimate with a weight dependent on the percentage of the table we did scan. That is, the calculation goes something like old_density = old_reltuples / old_relpages new_density = counted_tuples / scanned_pages reliability = scanned_pages / new_relpages updated_density = old_density + (new_density - old_density) * reliability new_reltuples = updated_density * new_relpages We could slow the moving-average convergence even further when reliability is small; for instance if you were really paranoid you might want to use the square of reliability in line 4. That might be overdoing it, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.
Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011: > > I think you misread what I wrote, or I misexplained it, but never > > mind. Matching locale names case-insensitively sounds reasonable to > > me, unless someone has reason to believe it will blow up. > > OK, that's what I needed to hear. I have applied the attached patch, > but only to 9.1 because of the risk of breakage. (This was only the > first bug report of this, and we aren't 100% certain about the case > issue.) Hmm, I know the locale can be spelled "en_US.UTF-8" (note dash) in some systems. So if you have a locale alias that makes en_US.utf8 the same as en_US.UTF-8, the patched code would still not work. I wonder if there's a need for canonicalization somewhere. Or maybe this is just not worth the trouble. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Tom Lane wrote: > I don't know what client-side code might be looking at > relpages/reltuples. I know that I find reltuples useful for getting an "accurate enough" sense of rows in a table (or set of tables) without resorting to count(*). I'd be OK with any two of pages, tuples, and density; but dropping to one would make databases harder to manage, IMV. Personally, I don't have much code that uses those columns; eliminating an existing column wouldn't involve much pain for me as long as it could still be derived. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee wrote: >> I'm confused. A major point of the approach I was proposing was to >> avoid having to move tuples around. > > Well, I am not sure how you can always guarantee to make space > available for the LSN without moving tuples , irrespective of where > you store it. But probably its not important as we discussed below. Well, if we just defragged the page, then we should be guaranteed that pd_lower + 8 < pd_upper. No? > I think if are reclaiming LP_DEAD line pointers only while > defragmenting the page, we can always reclaim the space for the LSN, > irrespective of where we store it. So may be we should decide > depending on what would matter for on-disk compatibility and whatever > requires least invasive changes. I don't know what is that yet. If we store the LSN at the beginning (page header) or end (special space) of the page, then we'll have to adjust the location of the data which follows it or precedes it if and when we want to reclaim the space it occupies. But if we store it in the hole in the middle somewhere (probably by reducing the size of the hole, not by actually referencing the area between pd_lower and pd_upper) then no tuples or item pointers have to move around when we want to reclaim the space. That way, we need only an exclusive lock (not a cleanup lock), and we don't need to memmove() anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Alvaro Herrera writes: > Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011: >> I can see two basic approaches we might take here: >> >> 1. Modify autovacuum to use something from the stats collector, rather >> than reltuples, to make its decisions. I'm not too clear on why >> reltuples is being used there anyway; is there some good algorithmic or >> statistical reason why AV should be looking at a number from the last >> vacuum? > It uses reltuples simply because it was what the original contrib code > was using. Since pgstat was considerably weaker at the time, reltuples > might have been the only thing available. It's certainly the case that > pgstat has improved a lot since autovacuum got in, and some things have > been revised but not this one. On reflection I'm hesitant to do this, especially for a backpatched bug fix, because it would be changing the feedback loop behavior for autovacuum scheduling. That could have surprising consequences. >> 2. Revise the vacuum code so that it doesn't skip updating the pg_class >> entries. We could have it count the number of pages it skipped, rather >> than just keeping a bool, and then scale up the rel_tuples count to be >> approximately right by assuming the skipped pages have tuple density >> similar to the scanned ones. > Hmm, interesting idea. This would be done only for toast tables, or all > tables? I'm thinking just do it for all. The fact that these numbers don't necessarily update after a vacuum is certainly surprising in and of itself, and it did not work that way before the VM patch went in. I'm concerned about other stuff besides AV not dealing well with obsolete values. > At this point I only wonder why we store tuples & pages rather than just > live tuple density. It's just for backwards compatibility. I've thought about doing that in the past, but I don't know what client-side code might be looking at relpages/reltuples. It's not like collapsing them into one field would save much, anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Wed, May 25, 2011 at 11:47 AM, Tom Lane wrote: > 2. Revise the vacuum code so that it doesn't skip updating the pg_class > entries. We could have it count the number of pages it skipped, rather > than just keeping a bool, and then scale up the rel_tuples count to be > approximately right by assuming the skipped pages have tuple density > similar to the scanned ones. This approach doesn't seem like a good idea to me. The skipped portions of the table are the ones that haven't been modified in a while, so this is or embeds an assumption that the tuples in the hot and cold portions of the table are the same size. That might be true, but it isn't hard to think of cases where it might not be. There could also very easily be sampling error, because it's easy to imagine a situation where 99% of the table is getting skipped. Any error that creeps into the estimate is going to get scaled up along with the estimate itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing overhead of frequent table locks
Simon Riggs wrote: > On Tue, May 24, 2011 at 6:37 PM, Robert Haas wrote: > > >> That being said, it's a slight extra cost for all fast-path lockers to > >> benefit > >> the strong lockers, so I'm not prepared to guess whether it will pay off. > > > > Yeah. ?Basically this entire idea is about trying to make life easier > > for weak lockers at the expense of making it more difficult for strong > > lockers. ?I think that's a good trade-off in general, but we might > > need to wait until we have an actual implementation to judge whether > > we've turned the dial too far. > > I like this overall concept and like the way this has been described > with strong and weak locks. It seems very useful to me, since temp > tables can be skipped. That leaves shared DDL and we have done lots to > reduce the lock levels held and are looking at further reductions > also. I think even quite extensive delays are worth the trade-off. > > I'd been looking at this also, though hadn't mentioned it previously > because I found an Oracle patent that discusses dynamically turning on > and off locking. So that's something to be aware of. IMHO if we > discuss this in terms of sharing/not sharing locking information then > it is sufficient to avoid the patent. That patent also discusses the > locking state change needs to wait longer than required. FYI, I thought we had agreed not to look at any patents because looking at them might cause us more problems than not looking at them. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada wrote: > 2011/5/25 Hitoshi Harada : >> So I'm still >> thinking which of pulling up and parameterized scan is better. > > After more investigation I came up with third idea, pushing down > RangeTblEntry to aggregate subquery. This sounds like a crazy idea, > but it seems to me like it is slightly easier than pulling up > agg-subquery. The main point is that when you want to pull up, you > must care if the final output would be correct with other join > relations than the aggregate-related one. In contrast, when pushing > down the target relation to agg-subquery it is simple to ensure the > result. > > I'm looking around pull_up_subqueries() in subquery_planner() to add > the pushing down logic. It could be possible to do it around > make_one_rel() but I bet query structure changes are doable against > Query, not PlannerInfo. How do you decide whether or not to push down? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Wed, May 25, 2011 at 10:13 AM, Bruce Momjian wrote: >> > Idempotent does seem like the most promising idea. >> >> I tend to agree with you, but I'm worried it won't actually work out >> to a win. By the time we augment the records with enough additional >> information we may have eaten up a lot of the benefit we were hoping >> to get. > > This is where I was confused. Our bad case now is when someone modifies > one row on a page between checkpoints --- instead of writing 400 bytes, > we write 8400. What portion of between-checkpoint activity writes more > than a few rows to a page? I didn't think many, except for COPY. > Ideally we could switch in and out of this mode per page, but that seems > super-complicated. Well, an easy to understand example would be a page that gets repeated HOT updates. We'll do this: add a tuple, add a tuple, add a tuple, add a tuple, HOT cleanup, add a tuple, add a tuple, add a tuple, add a tuple, HOT cleanup... and so on. In the worst case, that could be done many, many times between checkpoints that might be up to an hour apart. The problem can also occur (with a little more difficulty) even without HOT. Imagine a small table without lots of inserts and deletes. Page fills up, some rows are deleted, vacuum frees up space, page fills up again, some more rows are deleted, vacuum frees up space again, and so on. But you raise an interesting point, which is that it might also be possible to reduce the impact of write-ahead logging in other ways. For example, if we're doing a large COPY into a table, we could buffer up a full block of tuples and then just emit an FPI for the page. This would likely be cheaper than logging each tuple individually. In fact, you could imagine keeping a queue of pending WAL for each block in shared buffers. You don't really need that WAL to be consolidated into a single stream until either (a) you need to write the block or (b) you need to commit the transaction. When one of those things happens, you can decide at that point whether it's cheaper to emit the individual records or do some sort of consolidation. Doing it in exactly that way is probably impractical, because every backend that wants to commit would have to make a sweep of every buffer it's dirtied and see if any of them still contain WAL that needs to be shoved into the main queue, and that would probably suck, especially for small transactions. But maybe there's some variant that could be made to work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011: > I think I see what must be going on here: that toast table must contain > a long run of all-visible-according-to-the-VM pages (which is hardly a > surprising situation). This results in VACUUM choosing not to update > the pg_class entry: > > /* > * Update statistics in pg_class. But only if we didn't skip any pages; > * the tuple count only includes tuples from the pages we've visited, and > * we haven't frozen tuples in unvisited pages either. The page count is > * accurate in any case, but because we use the reltuples / relpages ratio > * in the planner, it's better to not update relpages either if we can't > * update reltuples. > */ > if (vacrelstats->scanned_all) > vac_update_relstats(onerel, > vacrelstats->rel_pages, vacrelstats->rel_tuples, > vacrelstats->hasindex, > FreezeLimit); > > For an ordinary table this wouldn't be fatal because we'll still do an > ANALYZE from time to time, and that will update the entries with new > (approximate) values. But we never run ANALYZE on toast tables. Ouch. > I can see two basic approaches we might take here: > > 1. Modify autovacuum to use something from the stats collector, rather > than reltuples, to make its decisions. I'm not too clear on why > reltuples is being used there anyway; is there some good algorithmic or > statistical reason why AV should be looking at a number from the last > vacuum? It uses reltuples simply because it was what the original contrib code was using. Since pgstat was considerably weaker at the time, reltuples might have been the only thing available. It's certainly the case that pgstat has improved a lot since autovacuum got in, and some things have been revised but not this one. > 2. Revise the vacuum code so that it doesn't skip updating the pg_class > entries. We could have it count the number of pages it skipped, rather > than just keeping a bool, and then scale up the rel_tuples count to be > approximately right by assuming the skipped pages have tuple density > similar to the scanned ones. Hmm, interesting idea. This would be done only for toast tables, or all tables? At this point I only wonder why we store tuples & pages rather than just live tuple density. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Volunteering as Commitfest Manager
On Wed, May 25, 2011 at 10:32 AM, Simon Riggs wrote: > On Wed, May 25, 2011 at 3:24 PM, Tom Lane wrote: >> Simon Riggs writes: >>> I hear that CF manager is a difficult role for a single individual. >>> So it makes sense to split that role between multiple people. >> >>> I volunteer to be the CF manager for Replication, and also for >>> Performance. ... >>> Patches need an eventual committer anyway, so this is a reasonable way >>> of having the process be managed by the eventual committer. >> >> ISTM that we really want the CF manager to be somebody who is *not* >> directly involved in reviewing or committing patches. It's a distinct >> skill set, so there is no reason why it's a good idea for a committer >> to do it. And we need to get the CF work load more spread out, not more >> concentrated. > > I agree it makes sense if a non-committer performs the role. If a > committer does take the role, I would volunteer to split the role and > for me to work on the suggested areas. I think it would be really valuable for you to get more involved in reviewing some of the patches, whether you end up doing any of the CommitFest management tasks or not. In the past, your involvement has been light; but we have a limited number of people who are qualified to do detailed technical reviews of complicated patches, and more help is very much needed. I share the opinion expressed upthread that in an ideal world, we would not have major reviewers or committers also doing CommitFest management work, but in some sense that's an ideal world that we don't live in. Greg Smith and Kevin Grittner, both of whom have successfully managed CommitFests in the past, also contribute in many other ways, and I would not want to say either that those other contributions disqualify them from managing CommitFests, or that because they are managing the CommitFest they should suspend their other involvement. Either rule would be a random bureaucratic obstacle that accomplishes nothing useful. So I don't see the fact that you are a committer as a reason why you can't or shouldn't help with CommitFest management - if, for example, you don't want to review patches yourself, but you do want to help organize other people to volunteer to review patches, we should accept the contribution that you're willing to make rather than insisting that you should make some other contribution instead. So, what I think is important is to understand exactly what you'd like to volunteer to do, and see whether that fits in well with what other people want to do. Typically, at the beginning of the CommitFest, there are a bunch of people who volunteer to be assigned a patch. This task is likely best done by one person who has the entire CommitFest in mind - in particular, I have found that it's a good idea to tackle the most significant patches first; the little stuff can be picked off at the end without much trouble, and is rarely what holds the process up. However, all of the other CommitFest management tasks can be easily split across multiple people. The main task, and where the vast majority of the work goes, is in following up on patches where discussion has died off or failed to reach a conclusion, and encouraging the patch author to resubmit or the reviewer to re-review or a committer to consider it for commit or just marking it Returned with Feedback if there's insufficient consensus and/or time and/or motivation to get it finished up in a month's time. We've allowed this to become the problem of a single and rather monolithic individual, but that system sucks, and I have no desire to perpetuate it. It sucks for that individual because it's a lot of work, and it's difficult to keep many unrelated patches in your head at the same time; and it sucks for everyone else, because about half the time that one monolithic individual finds that the rest of their life interferes with PostgreSQL, or the other things they are doing for PostgreSQL interfere with their role as CommitFest manager, and they don't do it as well as it needs to be done to really make things run smoothly. Therefore, if you'd like to volunteer to help keep on top of the replication and performance patches, and make sure that they are getting followed up on regularly, I think that would be excellent. We tried before having a position of "patch-chaser" or "assistant commitfest manager by topic" for the September of 2009 CommitFest, but it wasn't entirely successful. It's time to revisit that concept, because we're not always going to find one person who can devote 40+ hours to managing a CommitFest, and even when we can, it's not exactly accomplishing our goal of decentralizing the work. It's better than the "Tom do everything" approach, and we do have a lot of people helping review now, but there are still two or three people per CommitFest burning a lot of midnight oil. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pg
Re: [HACKERS] Nested CASE-WHEN scoping
Heikki Linnakangas writes: > While looking at fixing the multiple-evaluation issue in IN and BETWEEN > discussed a while ago, I realized that the current assumption that only > one CaseTestExpr placeholder needs to be valid at any given time is not > true. [ scratches head ... ] Why does the save/restore in ExecEvalCase not take care of this? > So I'm going to put the BETWEEN/IN fix aside for now, and refactor the > placeholder infrastructure to handle several simultaneous placeholders, That sounds like a mess, and I'm not even convinced it'll solve the problem ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
2011/5/26 Tom Lane : > Hitoshi Harada writes: >> 2011/5/25 Hitoshi Harada : >>> So I'm still >>> thinking which of pulling up and parameterized scan is better. > >> After more investigation I came up with third idea, pushing down >> RangeTblEntry to aggregate subquery. This sounds like a crazy idea, > > Yes, it sure does. Won't that typically change the number of rows > produced by the subquery's jointree? And thus possibly change the > aggregate's result? No, because it will be restricted to assumption that join qual == grouping key and outer relation is unique on the var in that join qual. Pushing down the baserel to sub-aggregate is equivalent to pulling up sub-aggregate eventually if there are only two relations (which is my first example.) The difference is if the optimizer needs to care about other relations which may change the number of rows produced. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Florian Helmberger writes: > On 25.05.11 04:47, Tom Lane wrote: >> Florian Helmberger writes: >>> I'm running a production database with PostgreSQL 9.0.3 (64-bit) on >>> Debian 5.0.4 and have an issue with a TOAST table and far to frequent >>> autovacuum runs. >>> >>> I think I've pinned the problem down to the values pg_class holds for >>> the affected TOAST table: >>> >>> relpages | 433596 >>> reltuples | 1868538 >>> >>> These values are significantly too low. Interestingly, the autovacuum >>> logout reports the correct values: >>> >>> pages: 0 removed, 34788136 remain >>> tuples: 932487 removed, 69599038 remain >>> >>> but these aren't stored in pg_class after each run. >> That's exceedingly weird. Do the pg_stat_all_tables columns update >> after autovacuums on that table? > Yes they do: I think I see what must be going on here: that toast table must contain a long run of all-visible-according-to-the-VM pages (which is hardly a surprising situation). This results in VACUUM choosing not to update the pg_class entry: /* * Update statistics in pg_class. But only if we didn't skip any pages; * the tuple count only includes tuples from the pages we've visited, and * we haven't frozen tuples in unvisited pages either. The page count is * accurate in any case, but because we use the reltuples / relpages ratio * in the planner, it's better to not update relpages either if we can't * update reltuples. */ if (vacrelstats->scanned_all) vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, vacrelstats->hasindex, FreezeLimit); For an ordinary table this wouldn't be fatal because we'll still do an ANALYZE from time to time, and that will update the entries with new (approximate) values. But we never run ANALYZE on toast tables. And this would *still* be okay, because as noted in the comment, the planner only depends on the ratio being roughly correct, not on either individual value being current. But autovacuum didn't get the memo; it thinks it can use reltuples to make decisions. I can see two basic approaches we might take here: 1. Modify autovacuum to use something from the stats collector, rather than reltuples, to make its decisions. I'm not too clear on why reltuples is being used there anyway; is there some good algorithmic or statistical reason why AV should be looking at a number from the last vacuum? 2. Revise the vacuum code so that it doesn't skip updating the pg_class entries. We could have it count the number of pages it skipped, rather than just keeping a bool, and then scale up the rel_tuples count to be approximately right by assuming the skipped pages have tuple density similar to the scanned ones. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci wrote: >> Da: Simon Riggs >> I can't find a clear discussion of what you are trying to do, and how, >> just a URL back to a complex discussion on another topic. > > > While trying to write a patch to allow changing an unlogged table into > a logged one, I had to add another int field to xl_xact_commit. > Robert Haas said: > > "I have to admit I don't like this approach very much. I can't see > adding 4 bytes to every commit record for this feature." > > > which is a correct remark. > > xl_xact_commit can contain some arrays (relation to drops, > committed sub-trans, shared invalidation msgs). The length of > these arrays is specified using 3 ints in the struct. > > So, to avoid adding more ints to the struct, I've been suggested to > remove all the ints, and use xl_xact_commit.xinfo to flag which > arrays are, in fact, present. > > So the whole idea is: > > - remove nrels, nsubxacts and nmsgs from xl_xact_commit > - use bits in xinfo to signal which arrays are present at the end > of xl_xact_commit > - for each present array, add the length of the array (as int) at > the end of xl_xact_commit > - add each present array after all the lengths OK, thats clear. Thanks. That formatting sounds quite complex. I would propose we split this into 2 WAL records: xl_xact_commit and xl_xact_commit_with_info xl_xact_commit doesn't have any flags, counts or arrays. xl_xact_commit_with_info always has all 3 counts, even if zero. Arrays follow the main record I think it might also be possible to removed dbId and tsId from xl_act_commit if we use that definition. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing overhead of frequent table locks
On Wed, May 25, 2011 at 8:56 AM, Simon Riggs wrote: > On Wed, May 25, 2011 at 1:44 PM, Robert Haas wrote: >> On Wed, May 25, 2011 at 8:27 AM, Simon Riggs wrote: >>> I got a bit lost with the description of a potential solution. It >>> seemed like you were unaware that there is a local lock and a shared >>> lock table, maybe just me? >> >> No, I'm not unaware of the local lock table. The point of this >> proposal is to avoid fighting over the LWLocks that protect the shared >> hash table by allowing some locks to be taken without touching it. > > Yes, I got that. I just couldn't work out where mmap came in. I don't think we were planning to use mmap anywhere. >>> Design seemed relatively easy from there: put local lock table in >>> shared memory for all procs. We then have a use_strong_lock at proc >>> and at transaction level. Anybody that wants a strong lock first sets >>> use_strong_lock at proc and transaction level, then copies all local >>> lock data into shared lock table, double checking >>> TransactionIdIsInProgress() each time. Then queues for lock using the >>> now fully set up shared lock table. When transaction with strong lock >>> completes we do not attempt to reset transaction level boolean, only >>> at proc level, since DDL often occurs in groups and we want to avoid >>> flip-flopping quickly between lock share states. Cleanup happens by >>> regularly by bgwriter, perhaps every 10 seconds or so. All locks are >>> still visible for pg_locks. >> >> I'm not following this... > > Which bit aren't you following? It's a design outline for how to > implement, deliberately brief to allow a discussion of design > alternatives. Well, OK: 1. I don't think putting the local lock table in shared memory is a good idea both for performance (keeping it uncontended has value) and memory management (it would increase shared memory requirements quite a bit). The design Noah and I were discussing upthread uses only a small and fixed amount of shared memory for each backend, and preserves the local lock table as an unshared resource. 2. You haven't explained the procedure for acquire a weak lock at all, and in particular how a weak locker would be able to quickly determine whether a conflicting strong lock was potentially present. 3. I don't understand the proposed procedure for acquiring a strong lock either; in particular, I don't see why TransactionIdIsInProgress() would be relevant. The lock manager doesn't really do anything with transaction IDs now, and you haven't offered any explanation of why that would be necessary or advisable. 4. You haven't explained what the transaction-level boolean would actually do. It's not even clear whether you intend for that to be kept in local or shared memory. It's also unclear what you intend for bgwriter to clean up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
Hitoshi Harada writes: > 2011/5/25 Hitoshi Harada : >> So I'm still >> thinking which of pulling up and parameterized scan is better. > After more investigation I came up with third idea, pushing down > RangeTblEntry to aggregate subquery. This sounds like a crazy idea, Yes, it sure does. Won't that typically change the number of rows produced by the subquery's jointree? And thus possibly change the aggregate's result? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
Alvaro Herrera writes: > Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400 2011: >> If the above is confusing, I just want to ask: "Is expression evaluator, >> even in part responsible for {PLANNEDSTMT creation?" > Yeah, as far as I understood Tom's talk, the expr evaluator is used to > reduce some expressions to constants and such. The planner would never call it with an expression containing a Var, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400 2011: > If the above is confusing, I just want to ask: "Is expression evaluator, > even in part responsible for {PLANNEDSTMT creation?" Yeah, as far as I understood Tom's talk, the expr evaluator is used to reduce some expressions to constants and such. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pull up aggregate subquery
2011/5/25 Hitoshi Harada : > So I'm still > thinking which of pulling up and parameterized scan is better. After more investigation I came up with third idea, pushing down RangeTblEntry to aggregate subquery. This sounds like a crazy idea, but it seems to me like it is slightly easier than pulling up agg-subquery. The main point is that when you want to pull up, you must care if the final output would be correct with other join relations than the aggregate-related one. In contrast, when pushing down the target relation to agg-subquery it is simple to ensure the result. I'm looking around pull_up_subqueries() in subquery_planner() to add the pushing down logic. It could be possible to do it around make_one_rel() but I bet query structure changes are doable against Query, not PlannerInfo. I appreciate any comments. Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing overhead of frequent table locks
Simon Riggs writes: > On Wed, May 25, 2011 at 1:44 PM, Robert Haas wrote: >> On Wed, May 25, 2011 at 8:27 AM, Simon Riggs wrote: >>> Design seemed relatively easy from there: put local lock table in >>> shared memory for all procs. We then have a use_strong_lock at proc >>> and at transaction level. Anybody that wants a strong lock first sets >>> use_strong_lock at proc and transaction level, then copies all local >>> lock data into shared lock table, >> I'm not following this... > Which bit aren't you following? It's a design outline for how to > implement, deliberately brief to allow a discussion of design > alternatives. What I'm not following is how moving the local lock table into shared memory can possibly be a good idea. The reason we invented the local lock table in the first place (we didn't use to have one) is so that a process could do some manipulations without touching shared memory. (Notably, it is currently nearly free, and certainly lock-free, to re-request a lock type you already hold. This is not an infrequent case.) That advantage will go away if you do this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Volunteering as Commitfest Manager
On Wed, May 25, 2011 at 3:24 PM, Tom Lane wrote: > Simon Riggs writes: >> I hear that CF manager is a difficult role for a single individual. >> So it makes sense to split that role between multiple people. > >> I volunteer to be the CF manager for Replication, and also for >> Performance. ... >> Patches need an eventual committer anyway, so this is a reasonable way >> of having the process be managed by the eventual committer. > > ISTM that we really want the CF manager to be somebody who is *not* > directly involved in reviewing or committing patches. It's a distinct > skill set, so there is no reason why it's a good idea for a committer > to do it. And we need to get the CF work load more spread out, not more > concentrated. I agree it makes sense if a non-committer performs the role. If a committer does take the role, I would volunteer to split the role and for me to work on the suggested areas. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
Sorry, email sent without body. Fixed some English mistakes. commitlog_lessbytes02.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Volunteering as Commitfest Manager
Simon Riggs writes: > I hear that CF manager is a difficult role for a single individual. > So it makes sense to split that role between multiple people. > I volunteer to be the CF manager for Replication, and also for > Performance. ... > Patches need an eventual committer anyway, so this is a reasonable way > of having the process be managed by the eventual committer. ISTM that we really want the CF manager to be somebody who is *not* directly involved in reviewing or committing patches. It's a distinct skill set, so there is no reason why it's a good idea for a committer to do it. And we need to get the CF work load more spread out, not more concentrated. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner
Michael Nolan wrote: Based on last year's discussion of this TODO item, it seems thoughts have been focused on estimating how much data is being satisfied from PG's shared buffers. However, I think that's only part of the problem. Sure, but neither it nor what you're talking about are the real gating factor on making an improvement here. Figuring out how to expose all this information to the optimizer so it can use it when planning is the hard part. Collecting a read time profile is just one of the many ways you can estimate what's in cache, and each of the possible methods has good and bad trade-offs. I've been suggesting that people assume that's a solved problem--I'm pretty sure what you're proposing was done by Greg Stark once and a prototype built even--and instead ask what you're going to do next if you had this data. This data would probably need to be kept separately for each table or index, as some tables or indexes may be mostly or fully in cache or on faster physical media than others, although in the absence of other data about a specific table or index, data about other relations in the same tablespace might be of some use. This is the important part. Model how the data needs to get stored such that the optimizer can make decisions using it, and I consider it easy to figure out how it will get populated later. There are actually multiple ways to do it, and it may end up being something people plug-in an implementation that fits their workload into, rather than just having one available. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
Robert Haas wrote: > That idea has the additional advantage of confusing the level of > detail of our WAL logging (minimal vs. archive vs. hot standby) with > the mechanism used to protect against torn pages (full page writes vs. > idempotent WAL records vs. prayer). When they set it wrong and > destroy their system, we can tell them it's their own fault for not > configuring the system properly! Bwahahahaha! I love it! Create confusing configuration options and blame the user! > In all seriousness, I can't imagine that we'd make this > user-configurable in the first place, since that would amount to > having two sets of WAL records each of which would be even less well > tested than what we have now; and for a project this complex, we > probably shouldn't even consider changing things that seem to work now > unless the new system is clearly better than the old. > > > Idempotent does seem like the most promising idea. > > I tend to agree with you, but I'm worried it won't actually work out > to a win. By the time we augment the records with enough additional > information we may have eaten up a lot of the benefit we were hoping > to get. This is where I was confused. Our bad case now is when someone modifies one row on a page between checkpoints --- instead of writing 400 bytes, we write 8400. What portion of between-checkpoint activity writes more than a few rows to a page? I didn't think many, except for COPY. Ideally we could switch in and out of this mode per page, but that seems super-complicated. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 5:57 PM, Robert Haas wrote: > > Alternatively, it's possible that we'd be better off vacuuming the > table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or > something) but only doing the index scans every once in a while when > enough dead line pointers have accumulated. Thats precisely the reason I suggested separating heap and index vacuums instead of a tight integration as we have now. If we don't spool the dead line pointers in a separate area though, we would need to make sure that index vacuum runs through the heap first to collect the dead line pointers and then remove the corresponding index pointers. We would need to also take into consideration the implications on visibility map for any such scheme to work correctly and efficiently. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Volunteering as Commitfest Manager
* Simon Riggs (si...@2ndquadrant.com) wrote: > So it makes sense to split that role between multiple people. I agree that it'd be good to have multiple people supporting the CF. I've been considering volunteering to be part of such a group for a given CF. > I volunteer to be the CF manager for Replication, and also for > Performance. I dislike the idea of splitting up the duties along these lines. However, if others are willing to handle it that way and we can get coverage for all the CF categories, I wouldn't object. The reason that I dislike this is simply that the actual work of the CF manager isn't really distinguished in any way based on if it's a Replication patch or a Performance patch or some other patch. Most of the CF work is about following-up with reviewers and authors and committers. I feel this kind of "I'd prefer to be working on what interests me/what I'm knowledgable in" concept is typically addressed through the self-volunteer reviewers, where someone will mark themselves down as a reviewer for a specific patch because it's what they're interested in. Additionally, reviewers, when volunteering, can, and often do, ask for specific kinds of patches (though it's usually driven more by 'size' or 'complexity' than category). That feels like a better place to put this than at the CF manager level. > Patches need an eventual committer anyway, so this is a reasonable way > of having the process be managed by the eventual committer. I also don't like the idea that committers, when supporting a commitfest, would only be involved/working on specific categories of patches. I feel that part of the idea of a commitfest is to have individuals, particularly committers, looking at patches which fall outside of what they're currently working on (since they can/could commit those things more-or-less anytime). My thinking (and I have no idea how others feel or if anyone's even considered it this way) is simply that part of the responsibility of a committer would be that they help get non-committer patches, which are good for PG, submitted through the right process, etc, but which may not be of specific interest to any given committer, committed. If a patch is already of interest to a committer because it hits on a hot-spot with them then that patch doesn't really *need* a CF to get done. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 7:20 PM, Robert Haas wrote: > On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee > wrote: >>> But instead of allocating permanent space in the page header, which would >>> both reduce (admittedly only by 8 bytes) the amount of space available >>> for tuples, and more significantly have the effect of breaking on-disk >>> compatibility, I'm wondering if we could get by with making space for >>> that extra LSN only when it's actually present. In other words, when >>> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch, >>> increment pd_upper, and use the extra space to store the LSN. There >>> is an alignment problem to worry about there but that shouldn't be a >>> huge issue. >> >> That might work but would require us to move tuples around when the first >> dead line pointer gets generated in the page. > > I'm confused. A major point of the approach I was proposing was to > avoid having to move tuples around. > Well, I am not sure how you can always guarantee to make space available for the LSN without moving tuples , irrespective of where you store it. But probably its not important as we discussed below. >> You may argue that we should >> be holding a cleanup-lock when that happens and the dead line pointer >> creation is always followed by a call to PageRepairFragmentation(), so it >> should be easier to make space for the LSN. > > I'm not sure if this is the same thing you're saying, but certainly > the only time we need to make space for this value is when we've just > remove tuples from the page and defragmented, and at that point there > should certainly be 8 bytes free somewhere. > Agree. >> Instead of storing the LSN after the page header, would it be easier to set >> pd_special and store the LSN at the end of the page ? > > I was proposing storing it after the line pointer array, not after the > page header. If we store it at the end of the page, I suspect we're > going to basically end up allocating permanent space for it, because > otherwise we'll have to shift all the tuple data forward and backward > by 8 bytes when we allocate or deallocate space for this. Now, maybe > that's OK: I'm not sure. But it's something to think about carefully. > If we are going to allocate permanent space, the special space seems > better than the page header, because we should be able to make that > work without on-disk compatibility, and because AFAIUI we only need > the space for heap pages, not index pages. > I think if are reclaiming LP_DEAD line pointers only while defragmenting the page, we can always reclaim the space for the LSN, irrespective of where we store it. So may be we should decide depending on what would matter for on-disk compatibility and whatever requires least invasive changes. I don't know what is that yet. > >> If so, how do we handle the case where after restart the page may get LSN >> less than the index vacuum LSN if the index vacuum happened before the >> crash/stop ? > > Well, on a crash, the unlogged relations get truncated, and their > indexes also, so no problem. On a clean shutdown, I guess we need to > arrange to save the counter across restarts. Oh ok. I was not aware that unlogged tables get truncated. I think we can just restore from the value stored for last successful index vacuum (after incrementing it may be). That should be possible. > > Take a look at the existing logic around GetXLogRecPtrForTemp(). > That's slightly different, because there we don't even need to be > consistent across backends. We just need an increasing sequence of > values. For unlogged relations things are a bit more complex - but it > seems manageable. Ok. Will look at it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The way to know whether the standby has caught up with the master
Heikki Linnakangas writes: > On 25.05.2011 07:42, Fujii Masao wrote: >> To achieve that, I'm thinking to change walsender so that, when the standby >> has caught up with the master, it sends back the message indicating that to >> the standby. And I'm thinking to add new function (or view like >> pg_stat_replication) >> available on the standby, which shows that info. > By the time the standby has received that message, it might not be > caught-up anymore because new WAL might've been generated in the master > already. Even assuming that you believe this is a useful capability, there is no need to change walsender. It *already* sends the current-end-of-WAL in every message, which indicates precisely whether the message contains all of available WAL data. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] use less space in xl_xact_commit patch
commitlog_lessbytes02.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee wrote: >> But instead of allocating permanent space in the page header, which would >> both reduce (admittedly only by 8 bytes) the amount of space available >> for tuples, and more significantly have the effect of breaking on-disk >> compatibility, I'm wondering if we could get by with making space for >> that extra LSN only when it's actually present. In other words, when >> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch, >> increment pd_upper, and use the extra space to store the LSN. There >> is an alignment problem to worry about there but that shouldn't be a >> huge issue. > > That might work but would require us to move tuples around when the first > dead line pointer gets generated in the page. I'm confused. A major point of the approach I was proposing was to avoid having to move tuples around. > You may argue that we should > be holding a cleanup-lock when that happens and the dead line pointer > creation is always followed by a call to PageRepairFragmentation(), so it > should be easier to make space for the LSN. I'm not sure if this is the same thing you're saying, but certainly the only time we need to make space for this value is when we've just remove tuples from the page and defragmented, and at that point there should certainly be 8 bytes free somewhere. > Instead of storing the LSN after the page header, would it be easier to set > pd_special and store the LSN at the end of the page ? I was proposing storing it after the line pointer array, not after the page header. If we store it at the end of the page, I suspect we're going to basically end up allocating permanent space for it, because otherwise we'll have to shift all the tuple data forward and backward by 8 bytes when we allocate or deallocate space for this. Now, maybe that's OK: I'm not sure. But it's something to think about carefully. If we are going to allocate permanent space, the special space seems better than the page header, because we should be able to make that work without on-disk compatibility, and because AFAIUI we only need the space for heap pages, not index pages. > I think that should be not so difficult to handle. I think handling this by > special space mechanism might be less complicated. A permanent space allocation will certainly be simpler. I'm just not sure how much we care about giving up 8 bytes that could potentially be used to store tuple data. >> If the LSN stored in the page precedes the >> start-of-last-successful-index-vacuum LSN, and if, further, we can get >> a buffer cleanup lock on the page, then we can do a HOT cleanup and >> life is good. Otherwise, we can either (1) just forget about the >> most-recent-dead-line-pointer LSN - not ideal but not catastrophic >> either - or (2) if the start-of-last-successful-vacuum-LSN is old >> enough, we could overwrite an LP_DEAD line pointer in place. > > I don't think we need the cleanup lock to turn the LP_DEAD line pointers to > LP_UNUSED since that does not involve moving tuples around. So a simple > EXCLUSIVE lock should be enough. But we would need to WAL log the operation > of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT > pruning. There could be exceptions such as, say large number of DEAD line > pointers are accumulated towards the end and reclaiming those would free up > substantial space in the page. But may be we can use those conditions to > invoke HOT prune instead of handling them separately. Makes sense. >> Another issue is that this causes problems for temporary and unlogged >> tables, because no WAL records are generated and, therefore, the LSN >> does not advance. This is also a problem for GIST indexes; Heikki >> fixed temporary GIST indexes by generating fake LSNs off of a >> backend-local counter. Unlogged GIST indexes are currently not >> supported. I think what we need to do is create an API to which you >> can pass a relation and get an LSN. If it's a permanent relation, you >> get a regular LSN. If it's a temporary relation, you get a fake LSN >> based on a backend-local counter. If it's an unlogged relation, you >> get a fake LSN based on a shared-memory counter that is reset on >> restart. If we can encapsulate that properly, it should provide both >> what we need to make this idea work and allow a somewhat graceful fix >> for GIST-vs-unlogged problem. > > Can you explain more how things would work for unlogged tables ? Do we use > the same shared memory counter for tracking last successful index vacuum ? Yes. > If so, how do we handle the case where after restart the page may get LSN > less than the index vacuum LSN if the index vacuum happened before the > crash/stop ? Well, on a crash, the unlogged relations get truncated, and their indexes also, so no problem. On a clean shutdown, I guess we need to arrange to save the counter across restarts. Take a look at the existing logic around GetXLogRecPtrForTemp
Re: [HACKERS] use less space in xl_xact_commit patch
> Da: Simon Riggs > I can't find a clear discussion of what you are trying to do, and how, > just a URL back to a complex discussion on another topic. While trying to write a patch to allow changing an unlogged table into a logged one, I had to add another int field to xl_xact_commit. Robert Haas said: "I have to admit I don't like this approach very much. I can't see adding 4 bytes to every commit record for this feature." which is a correct remark. xl_xact_commit can contain some arrays (relation to drops, committed sub-trans, shared invalidation msgs). The length of these arrays is specified using 3 ints in the struct. So, to avoid adding more ints to the struct, I've been suggested to remove all the ints, and use xl_xact_commit.xinfo to flag which arrays are, in fact, present. So the whole idea is: - remove nrels, nsubxacts and nmsgs from xl_xact_commit - use bits in xinfo to signal which arrays are present at the end of xl_xact_commit - for each present array, add the length of the array (as int) at the end ofxl_xact_commit - add each present array after all the lengths > I can't see any analysis that shows whether this would be effective in > reducing space of WAL records and what the impacts might be. The space of WAL records should always be <= than without the patch: in the worst case scenario, all ints are present (as they would without the patch). In the best case, which I guess is the more common, each xl_xact_commit will be 3 ints shorter. I don't think the effect will be "perceivable": the whole idea is to allow more variable-length structures in xl_xact_commit in the future, without incrementing the space on disk. > The patch contains very close to zero comments. I tried to add some. Leonardo commitlog_lessbytes01.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Wed, May 25, 2011 at 1:27 PM, Robert Haas wrote: >> At the moment we scan indexes if we have > 0 rows to remove, which is >> probably wasteful. Perhaps it would be better to keep a running total >> of rows to remove, by updating pg_stats, then when we hit a certain >> threshold in total we can do the index scan. So we don't need to >> remember the TIDs, just remember how many there were and use that to >> avoid cleaning too vigorously. > > That occurred to me, too. If we're being launched by autovacuum then > we know that a number of updates and deletes equal ~20% (or whatever > autovacuum_vacuum_scale_factor is set to) of the table size have > occurred since the last autovacuum. But it's possible that many of > those were HOT updates, in which case the number of index entries to > be cleaned up might be much less than 20% of the table size. > Alternatively, it's possible that we'd be better off vacuuming the > table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or > something) but only doing the index scans every once in a while when > enough dead line pointers have accumulated. After all, it's the first > heap pass that frees up most of the space; cleaning dead line pointers > seems a bit less urgent. But not having done any real analysis of how > this would work out in practice, I'm not sure whether it's a good idea > or not. We know whether a TID was once in the index or not, so we can keep an exact count. HOT doesn't come into it. We can remove TIDs from index as well without VACUUM during btree split avoidance. We can optimise the second scan by skipping htids no longer present in the index, though we'd need a spare bit to mark usage that which I'm not sure we have. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Tue, May 24, 2011 at 9:34 PM, Robert Haas wrote: > I'm not sure that these ideas are much good, but > for the sake of posterity: Both (1) and (2) seem promising to me. Heikki mentioned (2) would only be effective if we managed to change *all* WAL records. ISTM likely that we would find that difficult and/or time consuming and/or buggy. I would qualify that by saying *all* WAL record types for a certain page type, such as all btree index pages. Not that much better, since heap and btree are the big ones, ISTM. (1) seems like we could do it incrementally if we supported both partial and full page writes at same time. That way we could work on the most frequent record types over time. Not sure if that is possible, but seems worth considering. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing overhead of frequent table locks
On Wed, May 25, 2011 at 1:44 PM, Robert Haas wrote: > On Wed, May 25, 2011 at 8:27 AM, Simon Riggs wrote: >> I got a bit lost with the description of a potential solution. It >> seemed like you were unaware that there is a local lock and a shared >> lock table, maybe just me? > > No, I'm not unaware of the local lock table. The point of this > proposal is to avoid fighting over the LWLocks that protect the shared > hash table by allowing some locks to be taken without touching it. Yes, I got that. I just couldn't work out where mmap came in. >> Design seemed relatively easy from there: put local lock table in >> shared memory for all procs. We then have a use_strong_lock at proc >> and at transaction level. Anybody that wants a strong lock first sets >> use_strong_lock at proc and transaction level, then copies all local >> lock data into shared lock table, double checking >> TransactionIdIsInProgress() each time. Then queues for lock using the >> now fully set up shared lock table. When transaction with strong lock >> completes we do not attempt to reset transaction level boolean, only >> at proc level, since DDL often occurs in groups and we want to avoid >> flip-flopping quickly between lock share states. Cleanup happens by >> regularly by bgwriter, perhaps every 10 seconds or so. All locks are >> still visible for pg_locks. > > I'm not following this... Which bit aren't you following? It's a design outline for how to implement, deliberately brief to allow a discussion of design alternatives. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing overhead of frequent table locks
On Wed, May 25, 2011 at 8:27 AM, Simon Riggs wrote: > I got a bit lost with the description of a potential solution. It > seemed like you were unaware that there is a local lock and a shared > lock table, maybe just me? No, I'm not unaware of the local lock table. The point of this proposal is to avoid fighting over the LWLocks that protect the shared hash table by allowing some locks to be taken without touching it. > Design seemed relatively easy from there: put local lock table in > shared memory for all procs. We then have a use_strong_lock at proc > and at transaction level. Anybody that wants a strong lock first sets > use_strong_lock at proc and transaction level, then copies all local > lock data into shared lock table, double checking > TransactionIdIsInProgress() each time. Then queues for lock using the > now fully set up shared lock table. When transaction with strong lock > completes we do not attempt to reset transaction level boolean, only > at proc level, since DDL often occurs in groups and we want to avoid > flip-flopping quickly between lock share states. Cleanup happens by > regularly by bgwriter, perhaps every 10 seconds or so. All locks are > still visible for pg_locks. I'm not following this... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Tue, May 24, 2011 at 11:52 PM, Bruce Momjian wrote: > Robert Haas wrote: >> 2. The other fairly obvious alternative is to adjust our existing WAL >> record types to be idempotent - i.e. to not rely on the existing page >> contents. For XLOG_HEAP_INSERT, we currently store the target tid and >> the tuple contents. I'm not sure if there's anything else, but we >> would obviously need the offset where the new tuple should be written, >> which we currently infer from reading the existing page contents. For >> XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would >> certainly need to store its offset within the block, and maybe the >> infomask. For XLOG_HEAP_UPDATE, we'd need the old and new offsets and >> perhaps also the old and new infomasks. Assuming that's all we need >> and I'm not missing anything (which I won't bet on), that means we'd >> be adding, say, 4 bytes per insert or delete and 8 bytes per update. >> So, if checkpoints are spread out widely enough that there will be >> more than ~2K operations per page between checkpoints, then it makes >> more sense to just do a full page write and call it good. If not, >> this idea might have legs. > > I vote for "wal_level = idempotent" because so few people will know what > idempotent means. ;-) That idea has the additional advantage of confusing the level of detail of our WAL logging (minimal vs. archive vs. hot standby) with the mechanism used to protect against torn pages (full page writes vs. idempotent WAL records vs. prayer). When they set it wrong and destroy their system, we can tell them it's their own fault for not configuring the system properly! Bwahahahaha! In all seriousness, I can't imagine that we'd make this user-configurable in the first place, since that would amount to having two sets of WAL records each of which would be even less well tested than what we have now; and for a project this complex, we probably shouldn't even consider changing things that seem to work now unless the new system is clearly better than the old. > Idempotent does seem like the most promising idea. I tend to agree with you, but I'm worried it won't actually work out to a win. By the time we augment the records with enough additional information we may have eaten up a lot of the benefit we were hoping to get. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tackling full page writes
On Tue, May 24, 2011 at 10:52 PM, Jeff Davis wrote: > On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote: >> As I think about it a bit more, we'd >> need to XLOG not only the parts of the page we actually modifying, but >> any that the WAL record would need to be correct on replay. > > I don't understand that statement. Can you clarify? I'll try. Suppose we have two WAL records A and B, with no intervening checkpoint, that both modify the same page. A reads chunk 1 of that page and then modifies chunk 2. B modifies chunk 1. Now, suppose we make A do a "partial page write" on chunk 2 only, and B do the same for chunk 1. At the point the system crashes, A and B are both on disk, and the page has already been written to disk as well. Replay begins from a checkpoint preceding A. Now, when we get to the record for A, what are we to do? If it were a full page image, we could just restore it, and everything would be fine after that. But if we replay the partial page write, we've got trouble. A will now see the state of the chunk 1 as it existed after the action protected by B occurred, and will presumably do the wrong thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers