[HACKERS] Online base backup from the hot-standby
Hi I would like to develop function for 'Online base backup from the hot-standby' in PostgreSQL 9.2. Todo : Allow hot file system backups on standby servers (http://wiki.postgresql.org/wiki/Todo) [GOAL] * Make pg_basebackup to execute to the hot-standby server and acquire online-base-backup . - pg_basebackup can be executed to only primary server in PostgreSQL 9.1 . - But physical-copy(etc) under processing of pg_basebackup raises the load of primary server . - Therefore , this function is necessary . [Problem] (There is the following problems when hot-standby acquires online-base-backup like executing pg_basebackup to the primary server .) * pg_start_backup() and pg_stop_backup() can't be executed to the hot-standby server . - hot-standby can't insert backup-end record to WAL-files and can't operate CHECKPOINT . - Because hot-standby can't write anything in WAL-files . * hot-standby can't send WAL-files to archive server. - when pg_stop_backup() is executed to the primary server , it waits for completing sending wal to archive server , but hot-standby can't do it. [Policy] (I create with the following Policy .) * This function doesn't affect primary server . - I don't adopt the way which "hot-standby requests primary to execute pg_basebackup" , because I think about many standbys is connected with a primary . [Approach] * When pg_basebackup is executed to the hot-standby server , it executes RESTARTPOINT instead of CHECKPOINT . backup_label is made from the RESTARTPOINT's results , and is sent to the designated backup server using pg_basebackup connection . * Instead of inserting backup-end record , hot-standby writes backup-end-position in backup-history-file and sends to the designated backup server using pg_basebackup connection . - In 9.1 , startup process knows backup-end-position from only backup-end record . In addition to its logic, startup process can know backup-end-position from backup-history-file . As a result , startup process can recovery certainly without backup-end record . [Precondition] (As a result of the above-mentioned Policy and Approach , there is the following restrictions .) * Immediately after backup starting of WAL must contain full page writes . But the above-mentioned Approach can't satisfy the restriction according to circumstances . Because full_page_writes of primary might equal 'off' . When standby recovery WAL which is removed full page writes by pg_lesslog , it is the same . * Because recovery starts from last CHECKPOINT , it becomes long . * I has not thought new process that become taking the place of waiting for completing sending wal to archive server , yet. [Working Step] STEP1: Make startup process to acquire backup-end-position from not only backup-end record but also backup-history-file . * startup process allows to acquire backup-end-position from backup-history-file . * When pg_basebackup is executed , backup-history-file is sent to the designated backup server . STEP2: Make pg_start_backup() and pg_stop_backup() to be executed by the hot-standby server. [Action until The first CommitFest (on June 15)] I will create a patch to STEP1 . (The patch will be able to settle a problem of Omnipitr-backup-slave.) (a problem of Omnipitr-backup-slave : http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php) * Shedule of creating STEP2 is the next CommitFest (in September 15) Jun Ishizuka NTT Software Corporation TEL:045-317-7018 E-Mail: ishizuka@po.ntts.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
Greg, * Greg Stark (gsst...@mit.edu) wrote: > On Thu, May 26, 2011 at 8:52 PM, Stephen Frost wrote: > > list_concat() does explicitly say that cells will > > be shared afterwards and that you can't pfree() either list (note that > > there's actually a couple cases currently that I discovered which were > > also addressed in the original patch where I commented out those > > pfree()'s). > > So in traditional list it would splice the second argument onto the > end of the first list. This has a few effects that it sounds like you > haven't preserved. For example if I insert an element anywhere in > list2 -- including in the first few elements -- it's also inserted > into list1. Reading through the comments, it doesn't look like we expressly forbid that, but it seems pretty unlikely that it's done. In any case, it wouldn't be difficult to fix, to be honest.. All we'd have to do is modify list2's head pointer to point to the new location. We do say that list1 is destructively changed and that the returned pointer must be used going forward. > I'm not really sure we care about these semantics with our lists > though. It's not like they're supposed to be a full-featured lisp > emulator and it's not like the C code pulls any particularly clever > tricks with lists. I suspect we may have already broken these > semantics long ago but I haven't looked to see if that's the case. It doesn't look like it was broken previously, but at the same time, it doesn't look like those semantics are depended upon (or at least, they're not tested through the regressions :). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Pre-alloc ListCell's optimization
* Greg Stark (gsst...@mit.edu) wrote: > On Thu, May 26, 2011 at 11:57 AM, Stephen Frost wrote: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > While I agree that there is some bloat that'll happen with this > > approach, we could reduce it by just having a 4-entry cache instead of > > an 8-entry cache. I'm not really sure that saving those 64 bytes per > > list is really worth it though. > > First off this whole direction seems a bit weird to me. It sounds like > you're just reimplementing palloc inside the List data structure with > its allocator and everything. Why not just improve the memory > allocator in palloc instead of layering a second one on top of it? I do think it'd be great to improve palloc(), but having looked through that code, figuring out how to improve it for the small case (such as with the lists) while keeping it working well for larger and other cases doesn't exactly look trivial. > But assuming there's an advantage I've missed there's another > possibility here: Are most of these small lists constructed with > list_makeN? Looks like we've got 306 cases of list_make1(), 82 cases of list_makeN() (where N > 1), but that said, one can make a list w/ just lappend(), and that seems to happen with some regularity. > But all this seems odd to me. The only reason for any of this is for > api convenience so we can pass around lists instead of passing arrays. > If the next links are really a big source of overhead we should just > fix our apis to take arrays of the right length or arrays with a > separate length argument. I'm not really sure I agree with this.. Lists are pretty useful and easier to manage when you don't know the size. I expect quite a few of these lists are small for simple queries and can get pretty large for complex queries. Also, in many cases it's natural to step through the list and not need random access into it, which at least reduces the reasons to go to the effort of having a variable length array. > Or if it's just palloc we should fix our memory allocator to behave > the way the callers need it to. Heikki long ago suggested adding a > stack allocator for the parser to use for its memory context to reduce > overhead of small allocations which won't be freed until the context > is freed for example. Much of this originated from Greg's oprofile and Tom's further commentary on it here: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00714.php Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Pre-alloc ListCell's optimization
On Thu, May 26, 2011 at 8:52 PM, Stephen Frost wrote: > list_concat() does explicitly say that cells will > be shared afterwards and that you can't pfree() either list (note that > there's actually a couple cases currently that I discovered which were > also addressed in the original patch where I commented out those > pfree()'s). So in traditional list it would splice the second argument onto the end of the first list. This has a few effects that it sounds like you haven't preserved. For example if I insert an element anywhere in list2 -- including in the first few elements -- it's also inserted into list1. I'm not really sure we care about these semantics with our lists though. It's not like they're supposed to be a full-featured lisp emulator and it's not like the C code pulls any particularly clever tricks with lists. I suspect we may have already broken these semantics long ago but I haven't looked to see if that's the case. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
Thanks Tom. Comparing to you people, I am definitely new to almost everything here. I did debug a few smaller programs and never seen anything as such. So asked. Moreover, those programs I compiled never used any optimization. While everything seems to be working, it looks like the slot values do not change and all rows in a sequential scan return the first value it finds on the disk, n number of times, where n = number of rows in the table! I am going to compile without optimization now. Hopefully that would change a few things in the debugging process. Seems beautiful, complicated, mysterious. And I thought I was beginning to understand computers. :) Whatever be the case, I will look more into it and ask again if I get into too much of trouble. Regards, Vaibhav On Fri, May 27, 2011 at 9:18 AM, Tom Lane wrote: > Vaibhav Kaushal writes: > > Why do these lines: > > ... > > repeat twice? > > Hm, you're new to using gdb, no? That's pretty normal: gdb is just > reflecting back the fact that the compiler rearranges individual > instructions as it sees fit. You could eliminate most, though perhaps > not all, of that noise if you built the program-under-test (ie postgres) > at -O0. > >regards, tom lane >
Re: [HACKERS] Pre-alloc ListCell's optimization
* Greg Stark (gsst...@mit.edu) wrote: > On Thu, May 26, 2011 at 11:57 AM, Stephen Frost wrote: > > Handling the 1-entry case would likely be pretty > > straight-forward, but you need book-keeping as soon as you go to two, > > and all that book-keeping feels like overkill for just a 2-entry cache > > to me. > > Incidentally what if I call nconc and pass a second arg of a list that > has the first few elements stashed in an array. Do you copy those > elements into cells before doing the nconc? Does our nconc support > having lists share cells? I suspect it doesn't actually so perhaps > that's good enough. nconc() turns into list_concat() which turns into adding list2 on to the end of list1 using the other normal lappend() routines which will utilize space in the cache of list1 if there is space available. Trying to use the old list2 for storage or much of anything turned into a real pain, unfortunately. list_concat() does explicitly say that cells will be shared afterwards and that you can't pfree() either list (note that there's actually a couple cases currently that I discovered which were also addressed in the original patch where I commented out those pfree()'s). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
Vaibhav Kaushal writes: > Why do these lines: > ... > repeat twice? Hm, you're new to using gdb, no? That's pretty normal: gdb is just reflecting back the fact that the compiler rearranges individual instructions as it sees fit. You could eliminate most, though perhaps not all, of that noise if you built the program-under-test (ie postgres) at -O0. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
On Thu, May 26, 2011 at 11:57 AM, Stephen Frost wrote: > Handling the 1-entry case would likely be pretty > straight-forward, but you need book-keeping as soon as you go to two, > and all that book-keeping feels like overkill for just a 2-entry cache > to me. Incidentally what if I call nconc and pass a second arg of a list that has the first few elements stashed in an array. Do you copy those elements into cells before doing the nconc? Does our nconc support having lists share cells? I suspect it doesn't actually so perhaps that's good enough. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?
OK, I ran a GDB trace into ExecScan and here is a part of it: # (gdb) finish Run till exit from #0 ExecScanFetch (node=0x1d5c3c0, accessMtd=0x55dd10 , recheckMtd=0x55db70 ) at execScan.c:44 194 if (TupIsNull(slot)) (gdb) s 205 econtext->ecxt_scantuple = slot; (gdb) s 206 int num_atts = slot->tts_tupleDescriptor->natts; (gdb) s 207 elog(INFO, "[start] BEFORE ExecQual==="); (gdb) s 206 int num_atts = slot->tts_tupleDescriptor->natts; (gdb) s 207 elog(INFO, "[start] BEFORE ExecQual==="); (gdb) s elog_start (filename=0x7c9db2 "execScan.c", lineno=207, funcname=0x7c9e69 "ExecScan") at elog.c:1089 1089 { (gdb) ## Why do these lines: 206 int num_atts = slot->tts_tupleDescriptor->natts; (gdb) s 207 elog(INFO, "[start] BEFORE ExecQual==="); repeat twice? I have written them only once! GDB documentation does not help! A few forums I am on, people accuse me of anything between bad programming to recursion. Any idea? I never face this with rest of the code (and in no other program). I am on Fedora 13 X86_64. Regards, Vaibhav On Wed, May 25, 2011 at 11:45 PM, Vaibhav Kaushal < vaibhavkaushal...@gmail.com> wrote: > I think the command 'where' does the same. And the command showed something > which looked like was part of evaluation...it got me confused. Anyways, > thanks robert. I will check that too. I did not know the 'bt' command. > > -- > Sent from my Android > On 25 May 2011 23:02, "Robert Haas" wrote: >
Re: [HACKERS] Pre-alloc ListCell's optimization
On Thu, May 26, 2011 at 11:57 AM, Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> I'm worried that this type of approach would >> bloat the storage required in those cases to a degree that would make >> the patch unattractive. > > While I agree that there is some bloat that'll happen with this > approach, we could reduce it by just having a 4-entry cache instead of > an 8-entry cache. I'm not really sure that saving those 64 bytes per > list is really worth it though. First off this whole direction seems a bit weird to me. It sounds like you're just reimplementing palloc inside the List data structure with its allocator and everything. Why not just improve the memory allocator in palloc instead of layering a second one on top of it? But assuming there's an advantage I've missed there's another possibility here: Are most of these small lists constructed with list_makeN? In which case maybe the trick would be to special case the initial contents by hard coding a variable sized array which represents the first N elements and is only constructed when the list is first constructed with its initial values. So a list make with list_make3() would have a 3 element array and then any further elements added would be in the added cons cells. If any of those were removed we would decrement the count but leave the array in place. This would reduce the overhead of any small static lists that aren't modified much which is probably the real case we're talking about. Things like operator arguments or things constructed in the parse tree. The cost would be the risk of bugs that only occur when something is passed a 2-element list that was made with list_make2() but not one made by list_make1() + list_append() or vice versa. This has the side benefit of allowing an arbitrarily large initial array (well, as large as the length field for the array size allows) if we wanted to have something like list_copy_static() which made a list that was expected not to be modified a lot subsequently and might as well be stored in a single large array. But all this seems odd to me. The only reason for any of this is for api convenience so we can pass around lists instead of passing arrays. If the next links are really a big source of overhead we should just fix our apis to take arrays of the right length or arrays with a separate length argument. Or if it's just palloc we should fix our memory allocator to behave the way the callers need it to. Heikki long ago suggested adding a stack allocator for the parser to use for its memory context to reduce overhead of small allocations which won't be freed until the context is freed for example. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for new feature: Buffer Cache Hibernation
On 05/07/2011 03:32 AM, Mitsuru IWASAKI wrote: For 1, I've just finish my work. The latest patch is available at: http://people.freebsd.org/~iwasaki/postgres/buffer-cache-hibernation-postgresql-20110507.patch Reminder here--we can't accept code based on it being published to a web page. You'll need to e-mail it to the pgsql-hackers mailing list to be considered for the next PostgreSQL CommitFest, which is starting in a few weeks. Code submitted to the mailing list is considered a release of it to the project under the PostgreSQL license, which we can't just assume for things when given only a URL to them. Also, you suggested you were out of time to work on this. If that's the case, we'd like to know that so we don't keep cc'ing you about things in expectation of an answer. Someone else may pick this up as a project to continue working on. But it's going to need a fair amount of revision before it matches what people want here, and I'm not sure how much of what you've written is going to end up in any commit that may happen from this idea. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LOCK DATABASE
Hi all, On Fri, May 27, 2011 at 2:13 AM, Robert Haas wrote: > On Thu, May 26, 2011 at 12:28 PM, Ross J. Reedstrom > wrote: > > Perhaps the approach to restricting connections should not be a database > > object lock, but rather an admin function that does the equivalent of > > flipping datallowconn in pg_database? > > To me, that seems like a better approach, although it's a little hard > to see how we'd address Alvaro's desire to have it roll back > automatically when the session disconnected. The disconnect might be > caused by a FATAL error, for example. > > I'm actually all in favor of doing more things via SQL rather than > configuration files. The idea of some ALTER SYSTEM command seems very > compelling to me. I just don't really like this particular > implementation, which to me seems far too bound up in implementation > details I'd rather not rely on. > Me too it it looks I'm a little bit late on this topic... Even if I got some interest in it. Personally I'd think such a lock system playing with file system is perhaps not the best way of doing as argued until now. It would make the DBA able to do superuser-like actions by modifying system files like pg_hba.conf. SQL approach looks to be better. At this point, perhaps you may be interested in such an approach: http://wiki.postgresql.org/wiki/Lock_database I wrote that after the cluster summit. Regards, -- Michael Paquier http://michael.otacoo.com
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
"Kevin Grittner" writes: > When we prune or vacuum a page, I don't suppose we have enough > information about that page's previous state to calculate a tuple > count delta, do we? That would allow a far more accurate number to > be maintained than anything suggested so far, as long as we tweak > autovacuum to count inserts toward the need to vacuum. Well, that was the other direction that was suggested upthread: stop relying on reltuples at all, but use the stats collector's counts. That might be a good solution in the long run, but there are some issues: 1. It's not clear how using a current count, as opposed to time-of-last-vacuum count, would affect the behavior of the autovacuum control logic. At first glance I think it would break it, since the basic logic there is "how much of the table changed since it was last vacuumed?". Even if the equations could be modified to still work, I remember enough feedback control theory from undergrad EE to think that this is something to be seriously scared of tweaking without extensive testing. IMO it is far more risky than what Robert is worried about. 2. You still have the problem that we're exposing inaccurate (or at least less accurate than they could be) counts to the planner and to onlooker clients. We could change the planner to also depend on the stats collector instead of reltuples, but at that point you just removed the option for people to turn off the stats collector. The implications for plan stability might be unpleasant, too. So that's not a direction I want to go without a significant amount of work and testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "errno" not set in case of "libm" functions (HPUX)
Peter Eisentraut writes: > On tor, 2011-05-26 at 12:14 -0400, Tom Lane wrote: >> I tried this on my HP-UX 10.20 box, and it didn't work very nicely: >> configure decided that the compiler accepted +Olibmerrno, so I got a >> compile full of >> cc: warning 450: Unrecognized option +Olibmerrno. >> warnings. The reason is that PGAC_PROG_CC_CFLAGS_OPT does not pay any >> attention to whether the proposed flag generates a warning. That seems >> like a bug --- is there any situation where we'd want to accept a flag >> that does generate a warning? I'm thinking that macro should set >> ac_c_werror_flag=yes, the same way PGAC_C_INLINE does. > I think so. OK, committed with that addition. > We could also do that globally, but that would probably be something for > the next release. Hmm. I'm a bit scared of how much might break. I don't think the autoconf tests are generally designed to guarantee no warnings. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas wrote: > Kevin Grittner wrote: >> By storing the ratio and one count you make changes to the >> other count implied and less visible. It seems more >> understandable and less prone to error (to me, anyway) to keep >> the two "raw" numbers and calculate the ratio -- and when you >> observe a change in one raw number which you believe should force >> an adjustment to the other raw number before its next actual >> value is observed, to comment on why that's a good idea, and do >> the trivial arithmetic at that time. > > Except that's not how it works. At least in the case of ANALYZE, > we *aren't* counting all the tuples in the table. We're selecting > a random sample of pages and inferring a tuple density, which we > then extrapolate to the whole table and store. Then when we pull > it back out of the table, we convert it back to a tuple density. > The real value we are computing and using almost everywhere is > tuple density; storing a total number of tuples in the table > appears to be just confusing the issue. Well, if tuple density is the number which is most heavily used, it might shave a few nanoseconds doing the arithmetic in enough places to justify the change, but I'm skeptical. Basically I'm with Tom on the fact that this change would store neither more nor less information (and for that matter would not really change what information you can easily retrieve); and slightly changing the manner in which it is stored doesn't solve any of the problems you assert that it does. When we prune or vacuum a page, I don't suppose we have enough information about that page's previous state to calculate a tuple count delta, do we? That would allow a far more accurate number to be maintained than anything suggested so far, as long as we tweak autovacuum to count inserts toward the need to vacuum. (It seems to me I saw a post giving some reason that would have benefits anyway.) Except for the full pass during transaction wrap-around protection, where it could just set a new actual count, autovacuum would be skipping pages where the bit is set to indicate that all tuples are visible, right? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "errno" not set in case of "libm" functions (HPUX)
On tor, 2011-05-26 at 12:14 -0400, Tom Lane wrote: > Ibrar Ahmed writes: > > Please find the updated patch. I have added this "+Olibmerrno" compile flag > > check in configure/configure.in file. > > I tried this on my HP-UX 10.20 box, and it didn't work very nicely: > configure decided that the compiler accepted +Olibmerrno, so I got a > compile full of > cc: warning 450: Unrecognized option +Olibmerrno. > warnings. The reason is that PGAC_PROG_CC_CFLAGS_OPT does not pay any > attention to whether the proposed flag generates a warning. That seems > like a bug --- is there any situation where we'd want to accept a flag > that does generate a warning? I'm thinking that macro should set > ac_c_werror_flag=yes, the same way PGAC_C_INLINE does. I think so. We could also do that globally, but that would probably be something for the next release. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconvenient compression options in pg_basebackup
Peter Eisentraut writes: > On tor, 2011-05-26 at 16:54 -0400, Tom Lane wrote: >> But if you want to take such an extension into account right now, >> maybe we ought to design that feature now. What are you seeing it as >> looking like? >> >> My thought is that "-z" should just mean "give me compression; a good >> default compression setting is fine". "-Zn" could mean "I want gzip >> with exactly this compression level" (thus making the presence or >> absence of -z moot). If you want to specify some other compression >> method altogether, use something like --lzma=N. It seems unlikely to >> me that somebody who wants to override the default compression method >> wouldn't want to pick the settings for it too. > I think of pg_basebackup as analogous to tar. tar has a bunch of > options to set a compression method (-Z, -z, -j, -J), but no support for > setting compression specific options. So in that sense that contradicts > your suspicion. I would think we'd be more concerned about preserving an analogy to pg_dump, which most certainly does expose compression-quality options. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup compressed tar to stdout
On tor, 2011-05-26 at 17:06 -0400, Tom Lane wrote: > Peter Eisentraut writes: > > pg_basebackup currently doesn't allow compressed tar to stdout. That > > should be added to make the interface consistent, and specifically to > > allow common idoms like > > > pg_basebackup -Ft -z -D - | ssh tar -x -z -f - > > > Small patch attached. > > I have not bothered to read this in context, but the visible part of the > patch makes it look like you broke the not-HAVE_LIBZ case ... other than > that gripe, no objection. Ah yes, that needs some fine-tuning. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup compressed tar to stdout
Peter Eisentraut writes: > pg_basebackup currently doesn't allow compressed tar to stdout. That > should be added to make the interface consistent, and specifically to > allow common idoms like > pg_basebackup -Ft -z -D - | ssh tar -x -z -f - > Small patch attached. I have not bothered to read this in context, but the visible part of the patch makes it look like you broke the not-HAVE_LIBZ case ... other than that gripe, no objection. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconvenient compression options in pg_basebackup
On tor, 2011-05-26 at 16:54 -0400, Tom Lane wrote: > But if you want to take such an extension into account right now, > maybe we ought to design that feature now. What are you seeing it as > looking like? > > My thought is that "-z" should just mean "give me compression; a good > default compression setting is fine". "-Zn" could mean "I want gzip > with exactly this compression level" (thus making the presence or > absence of -z moot). If you want to specify some other compression > method altogether, use something like --lzma=N. It seems unlikely to > me that somebody who wants to override the default compression method > wouldn't want to pick the settings for it too. I think of pg_basebackup as analogous to tar. tar has a bunch of options to set a compression method (-Z, -z, -j, -J), but no support for setting compression specific options. So in that sense that contradicts your suspicion. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_basebackup compressed tar to stdout
pg_basebackup currently doesn't allow compressed tar to stdout. That should be added to make the interface consistent, and specifically to allow common idoms like pg_basebackup -Ft -z -D - | ssh tar -x -z -f - Small patch attached. diff --git i/doc/src/sgml/ref/pg_basebackup.sgml w/doc/src/sgml/ref/pg_basebackup.sgml index 8a7b833..32fa9f8 100644 --- i/doc/src/sgml/ref/pg_basebackup.sgml +++ w/doc/src/sgml/ref/pg_basebackup.sgml @@ -174,8 +174,7 @@ PostgreSQL documentation Enables gzip compression of tar file output. Compression is only -available when generating tar files, and is not available when sending -output to standard output. +available when using the tar format. diff --git i/src/bin/pg_basebackup/pg_basebackup.c w/src/bin/pg_basebackup/pg_basebackup.c index 1f31fe0..713c3af 100644 --- i/src/bin/pg_basebackup/pg_basebackup.c +++ w/src/bin/pg_basebackup/pg_basebackup.c @@ -261,7 +261,20 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum) * Base tablespaces */ if (strcmp(basedir, "-") == 0) - tarfile = stdout; + { + if (compresslevel > 0) + { +ztarfile = gzdopen(dup(fileno(stdout)), "wb"); +if (gzsetparams(ztarfile, compresslevel, Z_DEFAULT_STRATEGY) != Z_OK) +{ + fprintf(stderr, _("%s: could not set compression level %i: %s\n"), + progname, compresslevel, get_gz_error(ztarfile)); + disconnect_and_exit(1); +} + } + else +tarfile = stdout; + } else { #ifdef HAVE_LIBZ @@ -384,7 +397,12 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum) } } - if (strcmp(basedir, "-") != 0) + if (strcmp(basedir, "-") == 0) + { +if (ztarfile) + gzclose(ztarfile); + } + else { #ifdef HAVE_LIBZ if (ztarfile != NULL) @@ -1076,14 +1094,6 @@ main(int argc, char **argv) progname); exit(1); } -#else - if (compresslevel > 0 && strcmp(basedir, "-") == 0) - { - fprintf(stderr, -_("%s: compression is not supported on standard output\n"), -progname); - exit(1); - } #endif /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconvenient compression options in pg_basebackup
Peter Eisentraut writes: > On tis, 2011-05-24 at 15:34 -0400, Tom Lane wrote: >> I would argue that -Z ought to turn on "gzip" without my having to write >> -z as well (at least when the argument is greater than zero; possibly >> -Z0 should be allowed as meaning "no compression"). > My concern with that is that if we ever add another compression method, > would we then add another option to control the compression level of > that method? Um ... what's your point? Forcing the user to type two switches instead of one isn't going to make that hypothetical future extension any easier, AFAICS. But if you want to take such an extension into account right now, maybe we ought to design that feature now. What are you seeing it as looking like? My thought is that "-z" should just mean "give me compression; a good default compression setting is fine". "-Zn" could mean "I want gzip with exactly this compression level" (thus making the presence or absence of -z moot). If you want to specify some other compression method altogether, use something like --lzma=N. It seems unlikely to me that somebody who wants to override the default compression method wouldn't want to pick the settings for it too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > On Thu, May 26, 2011 at 12:23 PM, Tom Lane wrote: >>> Another thought: Couldn't relation_needs_vacanalyze() just scale up >>> reltuples by the ratio of the current number of pages in the relation >>> to relpages, just as the query planner does? >> Hmm ... that would fix Florian's immediate issue, and it does seem like >> a good change on its own merits. But it does nothing for the problem >> that we're failing to put the best available information into pg_class. >> >> Possibly we could compromise on doing just that much in the back >> branches, and the larger change for 9.1? > Do you think we need to worry about the extra overhead of determining > the current size of every relation as we sweep through pg_class? It's > not a lot, but OTOH I think we'd be doing it once a minute... not sure > what would happen if there were tons of tables. Ugh ... that is a mighty good point, since the RelationGetNumberOfBlocks call would have to happen for each table, even the ones we then decide not to vacuum. We've already seen people complain about the cost of the AV launcher once they have a lot of databases, and this would probably increase it quite a bit. > Going back to your thought upthread, I think we should really consider > replacing reltuples with reltupledensity at some point. I continue to > be afraid that using a decaying average in this case is going to end > up overweighting the values from some portion of the table that's > getting scanned repeatedly, at the expense of other portions of the > table that are not getting scanned at all. Changing the representation of the information would change that issue not in the slightest. The fundamental point here is that we have new, possibly partial, information which we ought to somehow merge with the old, also possibly partial, information. Storing the data a little bit differently doesn't magically eliminate that issue. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > Except that's not how it works. At least in the case of ANALYZE, we > *aren't* counting all the tuples in the table. We're selecting a > random sample of pages and inferring a tuple density, which we then > extrapolate to the whole table and store. Then when we pull it back > out of the table, we convert it back to a tuple density. The real > value we are computing and using almost everywhere is tuple density; > storing a total number of tuples in the table appears to be just > confusing the issue. If we were starting in a green field we might choose to store tuple density. However, the argument for changing it now is at best mighty thin; IMO it is not worth the risk of breaking client code. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Thu, May 26, 2011 at 2:05 PM, Kevin Grittner wrote: >> I'm a bit confused by this - what the current design obfuscates is >> the fact that reltuples and relpages are not really independent >> columns; you can't update one without updating the other, unless >> you want screwy behavior. Replacing reltuples by reltupledensity >> would fix that problem - it would be logical and non-damaging to >> update either column independently. > > They don't always move in tandem. Certainly there can be available > space in those pages from which tuples can be allocated or which > increases as tuples are vacuumed. Your proposed change would > neither make more or less information available, because we've got > two numbers which can be observed as raw counts, and a ratio between > them. So far I agree. > By storing the ratio and one count you make changes to the > other count implied and less visible. It seems more understandable > and less prone to error (to me, anyway) to keep the two "raw" > numbers and calculate the ratio -- and when you observe a change in > one raw number which you believe should force an adjustment to the > other raw number before its next actual value is observed, to > comment on why that's a good idea, and do the trivial arithmetic at > that time. Except that's not how it works. At least in the case of ANALYZE, we *aren't* counting all the tuples in the table. We're selecting a random sample of pages and inferring a tuple density, which we then extrapolate to the whole table and store. Then when we pull it back out of the table, we convert it back to a tuple density. The real value we are computing and using almost everywhere is tuple density; storing a total number of tuples in the table appears to be just confusing the issue. Unless, of course, I am misunderstanding, which is possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inconvenient compression options in pg_basebackup
On tis, 2011-05-24 at 15:34 -0400, Tom Lane wrote: > I would argue that -Z ought to turn on "gzip" without my having to > write > -z as well (at least when the argument is greater than zero; possibly > -Z0 should be allowed as meaning "no compression"). My concern with that is that if we ever add another compression method, would we then add another option to control the compression level of that method? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI predicate locking on heap -- tuple or row?
Heikki Linnakangas wrote: > Could you explain in the README, why it is safe to only take the > lock on the visible row version, please? Sure. I actually intended to do this last night but ran out of steam and posted what I had, planning on following up with that. The place it seemed to fit best was in the "Innovations" section, since the SSI papers and their prototype implementations seemed oriented toward "rows" -- certainly the SIREAD locks were at the row level, versus a row version level. Since this doesn't touch any of the files in yesterday's patch, and it seems entirely within the realm of possibility that people will want to argue about how best to document this more than the actual fix, I'm posting it as a separate patch -- README-SSI only. I mostly just copied from Dan's posted proof verbatim. -Kevin *** a/src/backend/storage/lmgr/README-SSI --- b/src/backend/storage/lmgr/README-SSI *** *** 402,407 is based on the top level xid. When looking at an xid that comes --- 402,455 from a tuple's xmin or xmax, for example, we always call SubTransGetTopmostTransaction() before doing much else with it. + * PostgreSQL does not use "update in place" with a rollback log + for its MVCC implementation. Where possible it uses "HOT" updates on + the same page (if there is room and no indexed value is changed). + For non-HOT updates the old tuple is expired in place and a new tuple + is inserted at a new location. Because of this difference, a tuple + lock in PostgreSQL doesn't automatically lock any other versions of a + row. We don't try to copy or expand a tuple lock to any other + versions of the row, based on the following proof that any additional + serialization failures we would get from that would be false + positives: + + o If transaction T1 reads a row (thus acquiring a predicate + lock on it) and a second transaction T2 updates that row, must a + third transaction T3 which updates the new version of the row have a + rw-conflict in from T1 to prevent anomalies? In other words, does it + matter whether this edge T1 -> T3 is there? + + o If T1 has a conflict in, it certainly doesn't. Adding the + edge T1 -> T3 would create a dangerous structure, but we already had + one from the edge T1 -> T2, so we would have aborted something + anyway. + + o Now let's consider the case where T1 doesn't have a + conflict in. If that's the case, for this edge T1 -> T3 to make a + difference, T3 must have a rw-conflict out that induces a cycle in + the dependency graph, i.e. a conflict out to some transaction + preceding T1 in the serial order. (A conflict out to T1 would work + too, but that would mean T1 has a conflict in and we would have + rolled back.) + + o So now we're trying to figure out if there can be an + rw-conflict edge T3 -> T0, where T0 is some transaction that precedes + T1. For T0 to precede T1, there has to be has to be some edge, or + sequence of edges, from T0 to T1. At least the last edge has to be a + wr-dependency or ww-dependency rather than a rw-conflict, because T1 + doesn't have a rw-conflict in. And that gives us enough information + about the order of transactions to see that T3 can't have a + rw-dependency to T0: + - T0 committed before T1 started (the wr/ww-dependency implies this) + - T1 started before T2 committed (the T1->T2 rw-conflict implies this) + - T2 committed before T3 started (otherwise, T3 would be aborted +because of an update conflict) + + o That means T0 committed before T3 started, and therefore + there can't be a rw-conflict from T3 to T0. + + o In both cases, we didn't need the T1 -> T3 edge. + * Predicate locking in PostgreSQL will start at the tuple level when possible, with automatic conversion of multiple fine-grained locks to coarser granularity as need to avoid resource exhaustion. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
* Tom Lane (t...@sss.pgh.pa.us) wrote: > I'm worried that this type of approach would > bloat the storage required in those cases to a degree that would make > the patch unattractive. While I agree that there is some bloat that'll happen with this approach, we could reduce it by just having a 4-entry cache instead of an 8-entry cache. I'm not really sure that saving those 64 bytes per list is really worth it though. The cost of allocating the memory doesn't seem like it changes a lot between those and I don't think it's terribly common for us to copy lists around (copyList doesn't memcpy() them). > ISTM the first thing we'd need to have before > we could think about this rationally is some measurements about the > frequencies of different List lengths in a typical workload. I agree, that'd be a good thing to have. I'll look into measuring that. > When Neil redid the List infrastructure a few years ago, there was some > discussion of special-casing the very first ListCell, and allocating > just that cell along with the List header. Well, we do allocate the first cell when we create a list in new_list(), but it's a seperate palloc() call. One of the annoying things that I ran into with this patch is trying to keep track of if something could be free'd with pfree() or not. Can't allow pfree() of something inside the array, etc. Handling the 1-entry case would likely be pretty straight-forward, but you need book-keeping as soon as you go to two, and all that book-keeping feels like overkill for just a 2-entry cache to me. I'll try to collect some info on list lengths and whatnot though and get a feel for just how much this is likely to help. Of course, if someone else has time to help with that, I wouldn't complain. :) Thanks, Stephen signature.asc Description: Digital signature
[HACKERS] #PgWest 2011: CFP now open
Hello, The CFP for #PgWest is now open. We are holding it at the San Jose Convention Center from September 27th - 30th. We look forward to seeing your submissions. http://www.postgresqlconference.org/ Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development The PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas wrote: > Kevin Grittner wrote: >> Given how trivial it would be to adjust reltuples to keep its >> ratio to relpages about the same when we don't have a new "hard" >> number, but some evidence that we should fudge our previous >> value, I don't see where this change buys us much. It seems to >> mostly obfuscate the fact that we're changing our assumption >> about how many tuples we have. I would rather that we did that >> explicitly with code comments about why it's justified than to >> slip it in the way you suggest. > > I'm a bit confused by this - what the current design obfuscates is > the fact that reltuples and relpages are not really independent > columns; you can't update one without updating the other, unless > you want screwy behavior. Replacing reltuples by reltupledensity > would fix that problem - it would be logical and non-damaging to > update either column independently. They don't always move in tandem. Certainly there can be available space in those pages from which tuples can be allocated or which increases as tuples are vacuumed. Your proposed change would neither make more or less information available, because we've got two numbers which can be observed as raw counts, and a ratio between them. By storing the ratio and one count you make changes to the other count implied and less visible. It seems more understandable and less prone to error (to me, anyway) to keep the two "raw" numbers and calculate the ratio -- and when you observe a change in one raw number which you believe should force an adjustment to the other raw number before its next actual value is observed, to comment on why that's a good idea, and do the trivial arithmetic at that time. As a thought exercise, what happens each way if a table is loaded with a low fillfactor and then a lot of inserts are done? What happens if mass deletes are done from a table which has a high density? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
On Tue, May 24, 2011 at 10:56 PM, Stephen Frost wrote: > Someone (*cough*Haas*cough) made a claim over beers at PGCon that it > would be very difficult to come up with a way to pre-allocate List > entries and maintain the current List API. I'll admit that it wasn't > quite as trivial as I had *hoped*, but attached is a proof-of-concept > patch which does it. > > [ various points ] So I guess the first question here is - does it improve performance? Because if it does, then it's worth pursuing ... if not, that's the first thing to fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Thu, May 26, 2011 at 1:28 PM, Kevin Grittner wrote: > Robert Haas wrote: >> I think we should really consider replacing reltuples with >> reltupledensity at some point. I continue to be afraid that using >> a decaying average in this case is going to end up overweighting >> the values from some portion of the table that's getting scanned >> repeatedly, at the expense of other portions of the table that are >> not getting scanned at all. Now, perhaps experience will prove >> that's not a problem. But storing relpages and reltupledensity >> separately would give us more flexibility, because we could feel >> free to bump relpages even when we're not sure what to do about >> reltupledensity. That would help Florian's problem quite a lot, >> even if we did nothing else. > > Given how trivial it would be to adjust reltuples to keep its ratio > to relpages about the same when we don't have a new "hard" number, > but some evidence that we should fudge our previous value, I don't > see where this change buys us much. It seems to mostly obfuscate > the fact that we're changing our assumption about how many tuples we > have. I would rather that we did that explicitly with code comments > about why it's justified than to slip it in the way you suggest. I'm a bit confused by this - what the current design obfuscates is the fact that reltuples and relpages are not really independent columns; you can't update one without updating the other, unless you want screwy behavior. Replacing reltuples by reltupledensity would fix that problem - it would be logical and non-damaging to update either column independently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] timezone GUC
Excerpts from Robert Haas's message of dom may 22 23:09:47 -0400 2011: > On Sun, May 22, 2011 at 10:24 PM, Tom Lane wrote: > > Robert Haas writes: > >> On Sun, May 22, 2011 at 9:54 PM, Tom Lane wrote: > >>> But also, 99.999% of the time > >>> it would be completely wasted effort because the DBA wouldn't remove the > >>> postgresql.conf setting at all, ever. > > > >> Well, by that argument, we ought not to worry about masterminding what > >> happens if the DBA does do such a thing -- just run the whole process > >> and damn the torpedoes. If it causes a brief database stall, at least > >> they'll get the correct behavior. > > > > Yeah, maybe. But I don't especially want to document "If you remove a > > pre-existing setting of TimeZone from postgresql.conf, expect your > > database to lock up hard for multiple seconds" ... and I think we > > couldn't responsibly avoid mentioning it. At the moment that disclaimer > > reads more like "If you remove a pre-existing setting of TimeZone from > > postgresql.conf, the database will fall back to a default that might not > > be what you were expecting". Is A really better than B? > > Well, I'm not entirely sure, but I lean toward yes. Anyone else have > an opinion? Yes, I think the lock-up is better than weird behavior. Maybe we should add a short note in a postgresql.conf comment to this effect, so that it doesn't surprise anyone that deletes or comments out the line. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas wrote: > I think we should really consider replacing reltuples with > reltupledensity at some point. I continue to be afraid that using > a decaying average in this case is going to end up overweighting > the values from some portion of the table that's getting scanned > repeatedly, at the expense of other portions of the table that are > not getting scanned at all. Now, perhaps experience will prove > that's not a problem. But storing relpages and reltupledensity > separately would give us more flexibility, because we could feel > free to bump relpages even when we're not sure what to do about > reltupledensity. That would help Florian's problem quite a lot, > even if we did nothing else. Given how trivial it would be to adjust reltuples to keep its ratio to relpages about the same when we don't have a new "hard" number, but some evidence that we should fudge our previous value, I don't see where this change buys us much. It seems to mostly obfuscate the fact that we're changing our assumption about how many tuples we have. I would rather that we did that explicitly with code comments about why it's justified than to slip it in the way you suggest. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
Stephen Frost writes: > Basically, I added a ListCell array into the List structure and then > added a bitmap to keep track of which positions in the array were > filled. Hm. I've gotten the impression from previous testing that there are an awful lot of extremely short lists (1 or 2 elements) running around in a typical query. (One source for those is the argument lists for operators and functions.) I'm worried that this type of approach would bloat the storage required in those cases to a degree that would make the patch unattractive. ISTM the first thing we'd need to have before we could think about this rationally is some measurements about the frequencies of different List lengths in a typical workload. When Neil redid the List infrastructure a few years ago, there was some discussion of special-casing the very first ListCell, and allocating just that cell along with the List header. That'd be sort of the minimal version of what you've done here, and would be guaranteed to never eat any wasted space (since a list that has a header isn't empty). We should probably compare the behavior of that minimalistic version to versions with different sizes of preallocated arrays. > An alternative approach that I was already considering would be to > just allocate ListCell's in bulk (kind of a poor-man's slab allocator, I > believe). To do that we'd have to make the bitmap be a variable length > array of bitmaps and then have a list of pointers to the ListCell block > allocations. Seems like that's probably overkill for this, however. That would be pointing in the direction of trying to save space for very long Lists, which is a use-case that I'm not sure occurs often enough for us to be worth spending effort on, and in any case is a distinct issue from that of saving palloc time for very short Lists. Again, some statistics about actual list lengths would be really nice to have ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LOCK DATABASE
On Thu, May 26, 2011 at 12:28 PM, Ross J. Reedstrom wrote: > Perhaps the approach to restricting connections should not be a database > object lock, but rather an admin function that does the equivalent of > flipping datallowconn in pg_database? To me, that seems like a better approach, although it's a little hard to see how we'd address Alvaro's desire to have it roll back automatically when the session disconnected. The disconnect might be caused by a FATAL error, for example. I'm actually all in favor of doing more things via SQL rather than configuration files. The idea of some ALTER SYSTEM command seems very compelling to me. I just don't really like this particular implementation, which to me seems far too bound up in implementation details I'd rather not rely on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Thu, May 26, 2011 at 12:23 PM, Tom Lane wrote: >> Another thought: Couldn't relation_needs_vacanalyze() just scale up >> reltuples by the ratio of the current number of pages in the relation >> to relpages, just as the query planner does? > > Hmm ... that would fix Florian's immediate issue, and it does seem like > a good change on its own merits. But it does nothing for the problem > that we're failing to put the best available information into pg_class. > > Possibly we could compromise on doing just that much in the back > branches, and the larger change for 9.1? Do you think we need to worry about the extra overhead of determining the current size of every relation as we sweep through pg_class? It's not a lot, but OTOH I think we'd be doing it once a minute... not sure what would happen if there were tons of tables. Going back to your thought upthread, I think we should really consider replacing reltuples with reltupledensity at some point. I continue to be afraid that using a decaying average in this case is going to end up overweighting the values from some portion of the table that's getting scanned repeatedly, at the expense of other portions of the table that are not getting scanned at all. Now, perhaps experience will prove that's not a problem. But storing relpages and reltupledensity separately would give us more flexibility, because we could feel free to bump relpages even when we're not sure what to do about reltupledensity. That would help Florian's problem quite a lot, even if we did nothing else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
* Alvaro Herrera (alvhe...@commandprompt.com) wrote: > I think what this patch is mainly missing is a description of how the > new allocation is supposed to work, so that we can discuss the details > without having to reverse-engineer them from the code. Sure, sorry I didn't include something more descriptive previously. Basically, I added a ListCell array into the List structure and then added a bitmap to keep track of which positions in the array were filled. I added it as an array simply because makeNode() assumes the size of a List is static and doesn't call through new_list() or anything. When a new ListCell is needed, it'll check if there's an available spot in the array and use it if there is. If there's no more room left, it'll just fall back to doing a palloc() for the ListCell. On list_delete(), it'll free up the spot that was used by that cell. One caveat is that it won't try to clean up the used spots on a list_truncate (since you'd have to traverse the entire list to figure out if anything getting truncated off is using a spot in the array). Of course, if you list_truncate to zero, you'll just get NIL back and the next round through will generate a whole new/empty List structure for you. An alternative approach that I was already considering would be to just allocate ListCell's in bulk (kind of a poor-man's slab allocator, I believe). To do that we'd have to make the bitmap be a variable length array of bitmaps and then have a list of pointers to the ListCell block allocations. Seems like that's probably overkill for this, however. The idea for doing this was to address the case of small lists having to go through the palloc() process over and over. We'd be penalizing those again if we add a lot of complexity so that vary large lists don't have to palloc() as much. Thanks Stephen signature.asc Description: Digital signature
Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH
Excerpts from Tom Lane's message of mié may 25 16:07:55 -0400 2011: > Alvaro Herrera writes: > > Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011: > >> Right. It would also increase the cognitive load on the user to have > >> to remember the command-line go-to-line-number switch for his editor. > >> So I don't particularly want to redesign this feature. However, I can > >> see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from > >> the same place that you set EDITOR, which would suggest that we allow > >> the value to come from an environment variable. I'm not sure whether > >> there is merit in allowing both that source and ~/.psqlrc, though > >> possibly for Windows users it might be easier if ~/.psqlrc worked. > > > If we're going to increase the number of options in .psqlrc that do not > > work with older psql versions, can I please have .psqlrc-MAJORVERSION or > > some such? Having 8.3's psql complain all the time because it doesn't > > understand "linestyle" is annoying. > > 1. I thought we already did have that. Oh, true, we have that, though it's not very usable because you have to rename the file from .psqlrc-9.0.3 to .psqlrc-9.0.4 when you upgrade, which is kinda silly. > 2. In any case, EDITOR_LINENUMBER_SWITCH isn't a hazard for this, > because older versions will just think it's a variable without any > special meaning. Good point. > But the real question here is whether we want to change it to be also > (or instead?) an environment variable. I vote yes. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pre-alloc ListCell's optimization
Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011: > Greetings, > > Someone (*cough*Haas*cough) made a claim over beers at PGCon that it > would be very difficult to come up with a way to pre-allocate List > entries and maintain the current List API. I'll admit that it wasn't > quite as trivial as I had *hoped*, but attached is a proof-of-concept > patch which does it. I think what this patch is mainly missing is a description of how the new allocation is supposed to work, so that we can discuss the details without having to reverse-engineer them from the code. -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LOCK DATABASE
On Thu, May 19, 2011 at 04:13:12PM -0400, Alvaro Herrera wrote: > Excerpts from Robert Haas's message of jue may 19 15:32:57 -0400 2011: > > > > That's a bit of a self-defeating argument though, since it implies > > that the effect of taking an exclusive lock via LockSharedObject() > > will not simply prevent new backends from connecting, but rather will > > also block any backends already in the database that try to perform > > one of those operations. > > Well, the database that holds the lock is going to be able to run them, > which makes sense -- and you probably don't want others doing it, which > also does. I mean other backends are still going to be able to run > administrative tasks like slon and so on, just not modifying the > database. If they want to change the comments they can do so after > you're done with your lock. > > Tom has a point though and so does Chris. I'm gonna put this topic to > sleep though, 'cause I sure don't want to be seen like I'm proposing a > connection pooler in the backend. I know I'm late to this party, but just wanted to chime in with support for the idea that access to a particular database is properly in the scope for a DBA, and it would be good for it not to require filesystem/sysadmin action. It seems to me to be a proper serverside support for poolers or shared hosting setups, or other uses cases, without going to whole hog. Arguably would probably require versions of pg_cancel_backend and pg_terminate_backend that operate for the database owner, as well as superuser. Perhaps the approach to restricting connections should not be a database object lock, but rather an admin function that does the equivalent of flipping datallowconn in pg_database? Ross -- Ross Reedstrom, Ph.D. reeds...@rice.edu Systems Engineer & Admin, Research Scientistphone: 713-348-6166 Connexions http://cnx.orgfax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Robert Haas writes: > I would feel a lot better about something that is deterministic, like, > I dunno, if VACUUM visits more than 25% of the table, we use its > estimate. And we always use ANALYZE's estimate. Or something. This argument seems to rather miss the point. The data we are working from is fundamentally not deterministic, and you can't make it so by deciding to ignore what data we do have. That leads to a less useful estimate, not a more useful estimate. > Another thought: Couldn't relation_needs_vacanalyze() just scale up > reltuples by the ratio of the current number of pages in the relation > to relpages, just as the query planner does? Hmm ... that would fix Florian's immediate issue, and it does seem like a good change on its own merits. But it does nothing for the problem that we're failing to put the best available information into pg_class. Possibly we could compromise on doing just that much in the back branches, and the larger change for 9.1? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "errno" not set in case of "libm" functions (HPUX)
Ibrar Ahmed writes: > Please find the updated patch. I have added this "+Olibmerrno" compile flag > check in configure/configure.in file. I tried this on my HP-UX 10.20 box, and it didn't work very nicely: configure decided that the compiler accepted +Olibmerrno, so I got a compile full of cc: warning 450: Unrecognized option +Olibmerrno. warnings. The reason is that PGAC_PROG_CC_CFLAGS_OPT does not pay any attention to whether the proposed flag generates a warning. That seems like a bug --- is there any situation where we'd want to accept a flag that does generate a warning? I'm thinking that macro should set ac_c_werror_flag=yes, the same way PGAC_C_INLINE does. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
On Thu, May 26, 2011 at 11:25 AM, Tom Lane wrote: > I'm still of the opinion that an incremental estimation process like > the above is a lot saner than what we're doing now, snarky Dilbert > references notwithstanding. The only thing that seems worthy of debate > from here is whether we should trust ANALYZE's estimates a bit more than > VACUUM's estimates, on the grounds that the former are more likely to be > from a random subset of pages. We could implement that by applying a > fudge factor when folding a VACUUM estimate into the moving average (ie, > multiply its reliability by something less than one). I don't have any > principled suggestion for just what the fudge factor ought to be, except > that I don't think "zero" is the best value, which AFAICT is what Robert > is arguing. I think Greg's argument shows that "one" is the right value > when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a > random set of pages ... but using that for VACUUM does seem > overoptimistic. The problem is that it's quite difficult to predict the relative frequency of full-relation-vacuum, vacuum-with-skips, and ANALYZE operations on the table will be. It matters how fast the table is being inserted into vs. updated/deleted; and it also matters how fast the table is being updated compared with the system's rate of XID consumption. So in general it seems hard to say, well, we know this number might drift off course a little bit, but there will be a freezing vacuum or analyze or something coming along soon enough to fix the problem. There might be, but it's difficult to be sure. My argument isn't so much that using a non-zero value here is guaranteed to have bad effects, but that we really have no idea what will work out well in practice, and therefore it seems dangerous to whack the behavior around ... especially in stable branches. If we changed this in 9.1, and that's the last time we ever get a complaint about it, problem solved. But I would feel bad if we changed this in the back-branches and then found that, while solving this particular problem, we had created others. It also seems likely that the replacement problems would be more subtle and more difficult to diagnose, because they'd depend in a very complicated way on the workload, and having, say, the latest table contents would not necessarily enable us to reproduce the problem. I would feel a lot better about something that is deterministic, like, I dunno, if VACUUM visits more than 25% of the table, we use its estimate. And we always use ANALYZE's estimate. Or something. Another thought: Couldn't relation_needs_vacanalyze() just scale up reltuples by the ratio of the current number of pages in the relation to relpages, just as the query planner does? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Greg Stark writes: > On Wed, May 25, 2011 at 9:41 AM, Tom Lane wrote: >> ... What I'm currently imagining is >> to do a smoothed moving average, where we factor in the new density >> estimate with a weight dependent on the percentage of the table we did >> scan. That is, the calculation goes something like >> >> old_density = old_reltuples / old_relpages >> new_density = counted_tuples / scanned_pages >> reliability = scanned_pages / new_relpages >> updated_density = old_density + (new_density - old_density) * reliability >> new_reltuples = updated_density * new_relpages > This amounts to assuming that the pages observed in the vacuum have > the density observed and the pages that weren't seen have the density > that were previously in the reltuples/relpages stats. That seems like > a pretty solid approach to me. If the numbers were sane before it > follows that they should be sane after the update. Hm, that's an interesting way of looking at it, but I was coming at it from a signal-processing point of view. What Robert is concerned about is that if VACUUM is cleaning a non-representative sample of pages, and repeated VACUUMs examine pretty much the same sample each time, then over repeated applications of the above formula the estimated density will eventually converge to what we are seeing in the sample. The speed of convergence depends on the moving-average multiplier, ie the "reliability" number above, and what I was after was just to slow down convergence for smaller samples. So I wouldn't have any problem with including a fudge factor to make the convergence even slower. But your analogy makes it seem like this particular formulation is actually "right" in some sense. One other point here is that Florian's problem is really only with our failing to update relpages. I don't think there is any part of the system that particularly cares about reltuples for a toast table. So even if the value did converge to some significantly-bad estimate over time, it's not really an issue AFAICS. We do care about having a sane reltuples estimate for regular tables, but for those we should have a mixture of updates from ANALYZE and updates from VACUUM. Also, for both regular and toast tables we will have an occasional vacuum-for-wraparound that is guaranteed to scan all pages and hence do a hard reset of reltuples to the correct value. I'm still of the opinion that an incremental estimation process like the above is a lot saner than what we're doing now, snarky Dilbert references notwithstanding. The only thing that seems worthy of debate from here is whether we should trust ANALYZE's estimates a bit more than VACUUM's estimates, on the grounds that the former are more likely to be from a random subset of pages. We could implement that by applying a fudge factor when folding a VACUUM estimate into the moving average (ie, multiply its reliability by something less than one). I don't have any principled suggestion for just what the fudge factor ought to be, except that I don't think "zero" is the best value, which AFAICT is what Robert is arguing. I think Greg's argument shows that "one" is the right value when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a random set of pages ... but using that for VACUUM does seem overoptimistic. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch for distinguishing PG instances in event log
Hello, I wrote and attached a patch for the TODO item below (which I proposed). Allow multiple Postgres clusters running on the same machine to distinguish themselves in the event log http://archives.postgresql.org/pgsql-hackers/2011-03/msg01297.php http://archives.postgresql.org/pgsql-hackers/2011-05/msg00574.php I changed two things from the original proposal. 1. regsvr32.exe needs /n when you specify event source I described the reason in src/bin/pgevent/pgevent.c. 2. I moved the article for event log registration to more suitable place The traditional place and what I originally proposed were not best, because those who don't build from source won't read those places. I successfully tested event log registration/unregistration, event logging with/without event_source parameter, and SHOWing event_source parameter with psql on Windows Vista (32-bit). I would appreciate if someone could test it on 64-bit Windows who has the 64-bit environment. I'll add this patch to the first CommitFest of 9.2. Thank you in advance for reviewing it. Regards MauMau multi_event_source.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should partial dumps include extensions?
Peter Eisentraut writes: > On tis, 2011-05-24 at 23:26 -0400, Robert Haas wrote: >> On Tue, May 24, 2011 at 4:44 PM, Tom Lane wrote: >>> There's a complaint here >>> http://archives.postgresql.org/pgsql-general/2011-05/msg00714.php >>> about the fact that 9.1 pg_dump always dumps CREATE EXTENSION commands >>> for all loaded extensions. Should we change that? A reasonable >>> compromise might be to suppress extensions in the same cases where we >>> suppress procedural languages, ie if --schema or --table was used >>> (see "include_everything" switch in pg_dump.c). >> Making it work like procedural languages seems sensible to me. > The same problem still exists for foreign data wrappers, servers, and > user mappings. It should probably be changed in the same way. No objection here, but I'm not going to go do it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Thu, May 26, 2011 at 8:57 AM, Pavan Deolasee wrote: > On Thu, May 26, 2011 at 4:10 PM, Pavan Deolasee > wrote: >> On Thu, May 26, 2011 at 9:40 AM, Robert Haas wrote: >> >>> Currently, I believe the only way a page can get marked all-visible is >>> by vacuum. But if we make this change, then it would be possible for >>> a HOT cleanup to encounter a situation where all-visible could be set. >>> We probably want to make that work. >>> >> >> Yes. Thats certainly an option. > > BTW, I just realized that this design would expect the visibility map > to be always correct or at least it should always correctly report a > page having dead line pointers. We would expect the index vacuum to > clean index pointers to *all* dead line pointers because once the > index vacuum is complete, other backends or next heap vacuum may > remove any of those old dead line pointers assuming that index vacuum > would have taken care of the index pointers. > > IOW, the visibility map bit must always be clear when there are dead > line pointers on the page. Do we guarantee that today ? I think we do, > but the comment in the source file is not affirmative. It can end up in the wrong state after a crash. I have a patch to try to fix that, but I need someone to review it. (*looks meaningfully at Heikki, coughs loudly*) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Thu, May 26, 2011 at 6:40 AM, Pavan Deolasee wrote: >>> There are some other issues that we should think about too. Like >>> recording free space and managing visibility map. The free space is >>> recorded in the second pass pass today, but I don't see any reason why >>> that can't be moved to the first pass. Its not clear though if we >>> should also record free space after retail page vacuum or leave it as >>> it is. >> >> Not sure. Any idea why it's like that, or why we might want to change it? > > I think it precedes the HOT days when the dead space was reclaimed > only during the second scan. Even post-HOT, if we know we would > revisit the page anyways during the second scan, it makes sense to > delay recording free space because the dead line pointers can add to > it (if they are towards the end of the line pointer array). I remember > discussing this briefly during HOT, but can't recollect why we decided > not to update the FSM after retail vacuum. But the entire focus then > was to keep things simple and that could be one reason. It's important to keep in mind that page-at-a-time vacuum is happening in the middle of a routine INSERT/UPDATE/DELETE operation, so we don't want to do anything too expensive there. Whether updating the FSM falls into that category or not, I am not sure. >> Currently, I believe the only way a page can get marked all-visible is >> by vacuum. But if we make this change, then it would be possible for >> a HOT cleanup to encounter a situation where all-visible could be set. >> We probably want to make that work. > > Yes. Thats certainly an option. > > We did not discuss where to store the information about the start-LSN > of the last successful index vacuum. I am thinking about a new > pg_class attribute, just because I can't think of anything better. Any > suggestion ? That seems fairly grotty, but I don't have a lot of brilliant ideas. One possibility that occurred to me was to stick it in the special space on the first page of the relation. But that would mean that every HOT cleanup would need to look at that page, which seems poor. Even if we cached it after the first access, it still seems kinda poor. But it would make the unlogged case easier to handle... and we have thought previously about including some metadata in the relation file itself to help with forensics (which table was this, anyway?). So I don't know. > Also for the first version, I wonder if we should let the unlogged and > temp tables to be handled by the usual two pass vacuum. Once we have > proven that one pass is better, we will extend that to other tables as > discussed on this thread. We can certainly do that for testing. Whether we want to commit it that way, I'm not sure. > Do we need a modified syntax for vacuum, like "VACUUM mytab SKIP > INDEX" or something similar ? That way, user can just vacuum the heap > if she wishes so and can also help us with testing. There's an extensible-options syntax you can use... VACUUM (index off) mytab. > Do we need more autovacuum tuning parameters to control when to vacuum > just the heap and when to vacuum the index as well ? Again, we can > discuss and decide this later, but just wanted to mention this here. Let's make tuning that a separate effort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Thu, May 26, 2011 at 4:10 PM, Pavan Deolasee wrote: > On Thu, May 26, 2011 at 9:40 AM, Robert Haas wrote: > >> Currently, I believe the only way a page can get marked all-visible is >> by vacuum. But if we make this change, then it would be possible for >> a HOT cleanup to encounter a situation where all-visible could be set. >> We probably want to make that work. >> > > Yes. Thats certainly an option. > BTW, I just realized that this design would expect the visibility map to be always correct or at least it should always correctly report a page having dead line pointers. We would expect the index vacuum to clean index pointers to *all* dead line pointers because once the index vacuum is complete, other backends or next heap vacuum may remove any of those old dead line pointers assuming that index vacuum would have taken care of the index pointers. IOW, the visibility map bit must always be clear when there are dead line pointers on the page. Do we guarantee that today ? I think we do, but the comment in the source file is not affirmative. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should partial dumps include extensions?
On tis, 2011-05-24 at 23:26 -0400, Robert Haas wrote: > On Tue, May 24, 2011 at 4:44 PM, Tom Lane wrote: > > There's a complaint here > > http://archives.postgresql.org/pgsql-general/2011-05/msg00714.php > > about the fact that 9.1 pg_dump always dumps CREATE EXTENSION commands > > for all loaded extensions. Should we change that? A reasonable > > compromise might be to suppress extensions in the same cases where we > > suppress procedural languages, ie if --schema or --table was used > > (see "include_everything" switch in pg_dump.c). > > Making it work like procedural languages seems sensible to me. The same problem still exists for foreign data wrappers, servers, and user mappings. It should probably be changed in the same way. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix
On Thu, May 26, 2011 at 11:58 AM, Peter Geoghegan wrote: > Attached revision doesn't use any threads or pipes on win32. It's far > neater there. I'm still seeing that "lagger" process (which is an > overstatement) at times, so I guess it is normal. On Windows, there is > no detailed PS output, so I actually don't know what the lagger > process is, and no easy way to determine that immediately occurs to > me. Process Explorer might help you there: http://technet.microsoft.com/en-us/sysinternals/bb896653 -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix
On 26 May 2011 11:22, Heikki Linnakangas wrote: > The Unix-stuff looks good to me at a first glance. Good. > There's one reference left to "life sign" in comments. (FWIW, I don't have a > problem with that terminology myself) Should have caught that one. Removed. > Looking at the MSDN docs again, can't you simply include PostmasterHandle in > the WaitForMultipleObjects() call to have it return when the process dies? > It should be possible to mix different kind of handles in one call, > including process handles. Does it not work as advertised? Uh, I might have done that, had I been aware of PostmasterHandle. I tried various convoluted ways to make it do what ReadFile() did for me, before finally biting the bullet and just using ReadFile() in a separate thread. I've tried adding PostmasterHandle though, and it works well - it appears to behave exactly the same as my original implementation. This simplifies things considerably. Now, on win32, things are actually simpler than on Unix. >> You'll see that it takes about a second for the archiver to exit. All >> processes exit. > > Hmm, shouldn't the archiver exit almost instantaneously now that there's no > polling anymore? Actually, just one "lagger" process sometimes remains that takes maybe as long as a second, a bit longer than the others. I assumed that it was the archiver, but I was probably wrong. I also didn't see that very consistently. Attached revision doesn't use any threads or pipes on win32. It's far neater there. I'm still seeing that "lagger" process (which is an overstatement) at times, so I guess it is normal. On Windows, there is no detailed PS output, so I actually don't know what the lagger process is, and no easy way to determine that immediately occurs to me. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e71090f..b1d38f5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -10150,7 +10150,7 @@ retry: /* * Wait for more WAL to arrive, or timeout to be reached */ - WaitLatch(&XLogCtl->recoveryWakeupLatch, 500L); + WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, 500L); ResetLatch(&XLogCtl->recoveryWakeupLatch); } else diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c index 6dae7c9..fa1d382 100644 --- a/src/backend/port/unix_latch.c +++ b/src/backend/port/unix_latch.c @@ -94,6 +94,7 @@ #include "miscadmin.h" #include "storage/latch.h" +#include "storage/pmsignal.h" #include "storage/shmem.h" /* Are we currently in WaitLatch? The signal handler would like to know. */ @@ -108,6 +109,15 @@ static void initSelfPipe(void); static void drainSelfPipe(void); static void sendSelfPipeByte(void); +/* + * Constants that represent which of a pair of fds given + * to pipe() is watched and owned in the context of + * dealing with postmaster death + */ +#define POSTMASTER_FD_WATCH 0 +#define POSTMASTER_FD_OWN 1 + +extern int postmaster_alive_fds[2]; /* * Initialize a backend-local latch. @@ -188,22 +198,22 @@ DisownLatch(volatile Latch *latch) * backend-local latch initialized with InitLatch, or a shared latch * associated with the current process by calling OwnLatch. * - * Returns 'true' if the latch was set, or 'false' if timeout was reached. + * Returns bit field indicating which condition(s) caused the wake-up. */ -bool -WaitLatch(volatile Latch *latch, long timeout) +int +WaitLatch(volatile Latch *latch, int wakeEvents, long timeout) { - return WaitLatchOrSocket(latch, PGINVALID_SOCKET, false, false, timeout) > 0; + return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout); } /* * Like WaitLatch, but will also return when there's data available in - * 'sock' for reading or writing. Returns 0 if timeout was reached, - * 1 if the latch was set, 2 if the socket became readable or writable. + * 'sock' for reading or writing. + * + * Returns bit field indicating which condition(s) caused the wake-up. */ int -WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead, - bool forWrite, long timeout) +WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, long timeout) { struct timeval tv, *tvp = NULL; @@ -211,12 +221,13 @@ WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead, fd_set output_mask; int rc; int result = 0; + bool found = false; if (latch->owner_pid != MyProcPid) elog(ERROR, "cannot wait on a latch owned by another process"); /* Initialize timeout */ - if (timeout >= 0) + if (timeout >= 0 && (wakeEvents & WL_TIMEOUT)) { tv.tv_sec = timeout / 100L; tv.tv_usec = timeout % 100L; @@ -224,7 +235,7 @@ WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead, } waiting = true; - for (;;) + do {
[HACKERS] Database research papers
Just a trivia. I remember spending weeks on reading the ARIES paper during my post graduation and I loved the depth of knowledge in that paper. In fact, if I re-read it again now, I would appreciate it even more. Are there other papers in the same league, especially which are more closely related to PostgreSQL implementation ? http://www.almaden.ibm.com/u/mohan/RJ6649Rev.pdf Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Another attempt at vacuum improvements
On Thu, May 26, 2011 at 9:40 AM, Robert Haas wrote: > On Wed, May 25, 2011 at 11:51 PM, Pavan Deolasee > Having said that, it doesn't excite me too much because I >> think we should do the dead line pointer reclaim operation during page >> pruning and we are already holding cleanup lock at that time and most >> likely do a reshuffle anyways. > > I'll give that a firm maybe. If there is no reshuffle, then you can > do this with just an exclusive content lock. Maybe that's worthless, > but I'm not certain of it. I guess we might need to see how the code > shakes out. > Yeah, once we start working on it, we might have a better idea. > Also, reshuffling might be more expensive. I agree that if there are > new dead tuples on the page, then you're going to be paying that price > anyway; but if not, it might be avoidable. > Yeah. We can tackle this later. As you suggested, may be we can start with something simpler and then see we need to do more. > >> There are some other issues that we should think about too. Like >> recording free space and managing visibility map. The free space is >> recorded in the second pass pass today, but I don't see any reason why >> that can't be moved to the first pass. Its not clear though if we >> should also record free space after retail page vacuum or leave it as >> it is. > > Not sure. Any idea why it's like that, or why we might want to change it? > I think it precedes the HOT days when the dead space was reclaimed only during the second scan. Even post-HOT, if we know we would revisit the page anyways during the second scan, it makes sense to delay recording free space because the dead line pointers can add to it (if they are towards the end of the line pointer array). I remember discussing this briefly during HOT, but can't recollect why we decided not to update the FSM after retail vacuum. But the entire focus then was to keep things simple and that could be one reason. > Currently, I believe the only way a page can get marked all-visible is > by vacuum. But if we make this change, then it would be possible for > a HOT cleanup to encounter a situation where all-visible could be set. > We probably want to make that work. > Yes. Thats certainly an option. We did not discuss where to store the information about the start-LSN of the last successful index vacuum. I am thinking about a new pg_class attribute, just because I can't think of anything better. Any suggestion ? Also for the first version, I wonder if we should let the unlogged and temp tables to be handled by the usual two pass vacuum. Once we have proven that one pass is better, we will extend that to other tables as discussed on this thread. Do we need a modified syntax for vacuum, like "VACUUM mytab SKIP INDEX" or something similar ? That way, user can just vacuum the heap if she wishes so and can also help us with testing. Do we need more autovacuum tuning parameters to control when to vacuum just the heap and when to vacuum the index as well ? Again, we can discuss and decide this later, but just wanted to mention this here. So are there any other objections/suggestions ? Anyone else cares to look at the brief design that we discussed above ? Otherwise, I would go ahead and work on this in the coming days. Of course, I will keep the list posted about any new issues that I see. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix
On 24.05.2011 23:43, Peter Geoghegan wrote: Attached is the latest revision of the latch implementation that monitors postmaster death, plus the archiver client that now relies on that new functionality and thereby works well without a tight PostmasterIsAlive() polling loop. The Unix-stuff looks good to me at a first glance. The lifesign terminology has been dropped. We now close() the file descriptor that represents "ownership" - the write end of our anonymous pipe - in each child backend directly in the forking machinery (the thin fork() wrapper for the non-EXEC_BACKEND case), through a call to ReleasePostmasterDeathWatchHandle(). We don't have to do that on Windows, and we don't. There's one reference left to "life sign" in comments. (FWIW, I don't have a problem with that terminology myself) Disappointingly, and despite a big effort, there doesn't seem to be a way to have the win32 WaitForMultipleObjects() call wake on postmaster death in addition to everything else in the same way that select() does, so there are now two blocking calls, each in a thread of its own (when the latch code is interested in postmaster death - otherwise, it's single threaded as before). The threading stuff (in particular, the fact that we used a named pipe in a thread where the name of the pipe comes from the process PID) is inspired by win32 signal emulation, src/backend/port/win32/signal.c . That's a pity, all those threads and named pipes are a bit gross for a safety mechanism like this. Looking at the MSDN docs again, can't you simply include PostmasterHandle in the WaitForMultipleObjects() call to have it return when the process dies? It should be possible to mix different kind of handles in one call, including process handles. Does it not work as advertised? You can easily observe that it works as advertised on Windows by starting Postgres with archiving, using task manager to monitor processes, and doing the following to the postmaster (assuming it has a PID of 1234). This is the Windows equivalent of kill -9 : C:\Users\Peter>taskkill /pid 1234 /F You'll see that it takes about a second for the archiver to exit. All processes exit. Hmm, shouldn't the archiver exit almost instantaneously now that there's no polling anymore? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Latch implementation that wakes on postmaster death on both win32 and Unix
I'm a bit disappointed that no one has commented on this yet. I would have appreciated some preliminary feedback. Anyway, I've added it to CommitFest 2011-06. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The way to know whether the standby has caught up with the master
On Wed, May 25, 2011 at 11:07 PM, Tom Lane wrote: > Heikki Linnakangas writes: >> On 25.05.2011 07:42, Fujii Masao wrote: >>> To achieve that, I'm thinking to change walsender so that, when the standby >>> has caught up with the master, it sends back the message indicating that to >>> the standby. And I'm thinking to add new function (or view like >>> pg_stat_replication) >>> available on the standby, which shows that info. > >> By the time the standby has received that message, it might not be >> caught-up anymore because new WAL might've been generated in the master >> already. > > Even assuming that you believe this is a useful capability, there is no > need to change walsender. It *already* sends the current-end-of-WAL in > every message, which indicates precisely whether the message contains > all of available WAL data. That's not enough to calculate whether failover is safe or not. Even if the standby's flush location is equal to the master's current end location, new WAL might have already been generated, and the "success" indication of the corresponding transaction might have been returned to the client (this is possible only when async mode). So in addition to the master's current end location, the standby must know its sync mode, which walsender would need to send. Another problem is that, when we can safely promote the standby, the standby's flush location isn't always equal to the master's current end location. Imagine the case where there are some unsent WAL in the master and corresponding transactions are waiting for replication. In this case, obviously those locations are not the same. But in sync replication, we can guarantee that all the committed (from the client's view) transactions have been replicated to the standby, so failover is safe. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The way to know whether the standby has caught up with the master
On Wed, May 25, 2011 at 3:11 PM, Jaime Casanova wrote: > On Wed, May 25, 2011 at 12:28 AM, Fujii Masao wrote: >> On Wed, May 25, 2011 at 2:16 PM, Heikki Linnakangas >>> By the time the standby has received that message, it might not be caught-up >>> anymore because new WAL might've been generated in the master already. >> >> Right. But, thanks to sync rep, until such a new WAL has been replicated to >> the standby, the commit of transaction is not visible to the client. So, >> even if >> there are some WAL not replicated to the standby, the clusterware can promote >> the standby safely without any data loss (to the client point of view), I >> think. > > then, you also need to transmit to the standby if it is the current > sync standby. Yes. After further thought, we can promote the standby safely only when the corresponding walsender meets the following conditions: 1. sync_state is "sync" 2. the standby's flush_location is bigger than or equal to the smallest wait location in the sync rep queue. Which guarantees that all the committed transactions (i.e., their "success" indications have been returned to the client) have been replicated to the standby. Once the above conditions get satisfied, the failover is safe until sync_state is flipped to "async". By using this logic, walsender needs to check whether failover is safe, and send the message according to the result. One problem is that, when sync_state is flipped to "async", walsender might perform replication asynchronously before the standby receives the message indicating failover is unsafe. In this case, if the master crashes, the clusterware would wrongly think that failover is safe and promote the standby despite which causes data loss. To solve this problem, walsender would need to send that message *synchronously*, i.e., wait for the ACK of the message to arrive from the standby before actually changing sync_state to "async". Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI predicate locking on heap -- tuple or row?
On 26.05.2011 06:19, Kevin Grittner wrote: Dan and I went around a couple times chasing down all code, comment, and patch changes needed, resulting in the attached patch. We found and fixed the bug which originally manifested in a way which I confused with a need for row locks, as well as another which was nearby in the code. We backed out the changes which were causing merge problems for Robert, as those were part of the attempt at the row locking (versus tuple locking). We removed a function which is no longer needed. We adjusted the comments and an affected isolation test. Could you explain in the README, why it is safe to only take the lock on the visible row version, please? It's not quite obvious, as we've seen from this discussion, and if I understood correctly the academic papers don't touch that subject either. As might be expected from removing an unnecessary feature, the lines of code went down -- a net decrease of 93 lines. That's the kind of patch I like :-). These changes generate merge conflicts with the work I've done on handling CLUSTER, DROP INDEX, etc. It seems to me that the best course would be to commit this, then I can rebase the other work and post it. Since these issues are orthogonal, it didn't seem like a good idea to combine them in one patch, and this one seems more urgent. Agreed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers